Sunday, October 13, 2024

How to Running Large Language Model (LLM) On Raspberry Pi 5

The world of artificial intelligence is rapidly advancing, and with it, the capabilities of smaller and more accessible hardware are expanding. Recently, I was inspired by a video on the NVIDIA AI channel showcasing the Gemma 2 language model, boasting 2 billion parameters, running on a Jetson Orin Nano. This prompted me to delve into the world of running LLMs on the Raspberry Pi 5, a device known for its affordability and versatility. My journey led me to the Phi-3 mini 4K Instruct, a 3.8 billion parameter LLM from Microsoft, which surprisingly runs smoothly on the Raspberry Pi 5 using the ONNX Runtime GenAI framework.

This article will guide you through the process of setting up and running the Phi-3 mini 4K Instruct on your Raspberry Pi 5, along with optional instructions for exploring the Mistral 7B model as an alternative. We will explore the foundational technology behind ONNX Runtime and delve into the hardware capabilities of the Raspberry Pi 5.

The Shifting Landscape of AI: From LLMs to SLMs

Before we begin, it's important to acknowledge the ever-changing nature of AI. What we consider "Large Language Models" today may be considered "Small Language Models" tomorrow, given the exponential growth in model sizes and computational power. This rapid evolution necessitates a constant reassessment of our understanding and terminology within the field.

The Power of ONNX Runtime

ONNX Runtime is a high-performance engine for executing machine learning models serialized in the ONNX (Open Neural Network Exchange) format. Its compatibility with models trained on various frameworks, including PyTorch and TensorFlow, makes it a versatile tool for deploying AI models across diverse platforms. ONNX Runtime supports both CPUs and GPUs, along with other hardware accelerators, enabling it to adapt seamlessly to different computing environments.

The ONNX Runtime GenAI package is a specialized version optimized for generative AI models. It enhances performance for tasks such as image generation and natural language processing by incorporating specific hardware and software optimizations.

The Raspberry Pi 5: Hardware Foundation for AI

The Raspberry Pi 5, equipped with a quad-core Arm Cortex-A76 processor and a 12-core VideoCore VII 800MHz GPU, is a powerful platform for experimenting with AI. The Cortex-A76 processor, built on the robust 64-bit ARMv8-A architecture, is known for its performance and is widely used across a range of devices.

Setting Up Your Raspberry Pi for AI

This guide will focus on setting up the ONNX Runtime GenAI package from source on a Raspberry Pi 5 running Ubuntu. However, the instructions should be easily adaptable to Raspberry Pi OS.

Step 1: Python Environment

To ensure a clean and organized project environment, we recommend setting up a virtual environment for isolating dependencies. Here's how to create and activate one:

  1. Verify Python Installation: Begin by checking if Python is installed on your system. Run the command python3 -V and verify that the output indicates a Python 3 version.

  2. Create Virtual Environment: Create a new virtual environment using the command python3 -m venv onnxruntime-genai-env.

  3. Activate the Environment: Activate the virtual environment using source onnxruntime-genai-env/bin/activate.

Step 2: Essential Tools

Before proceeding, ensure your system has the necessary building packages installed. These include compilers and libraries crucial for compiling software from source. Install them using the command: sudo apt-get install -y build-essential cmake.

Next, install the development utilities necessary for this guide: sudo apt-get install -y curl wget tar gzip git.

Step 3: Downloading ONNX Runtime GenAI

  1. Clone the Repository: Clone the ONNX Runtime GenAI project files from the GitHub repository using git clone https://github.com/microsoft/onnxruntime-genai.git onnxruntime-genai.

  2. Navigate to the Directory: Change directory to onnxruntime-genai using cd onnxruntime-genai.

  3. Checkout the Specific Version: Checkout version v0.3.0 for compatibility using git checkout tags/v0.3.0.

  4. Verify the Version: Ensure you are on the correct version using git status.

Step 4: Downloading ONNX Runtime

ONNX Runtime is required to build the GenAI package. To download and setup the correct version, follow these steps:

  1. Fetch the Download Link: Use the command curl -s https://api.github.com/repos/microsoft/onnxruntime/releases | grep browser_download_url | grep 'aarch64' | head -n 1 | cut -d '"' -f 4 to obtain the download link for the version compatible with your system.

  2. Download the Release: Download the specified version using wget <download_link> (replace <download_link> with the output from the previous command).

  3. Extract and Setup: Once downloaded, extract the files and set them up within the onnxruntime-genai directory using the following commands:

          tar -xzvf <downloaded_file_name>.tgz 
    mv <extracted_directory_name> ort
        

Step 5: Building ONNX Runtime GenAI

Navigate to the onnxruntime-genai directory and execute the following command to start the build process:

      sh build.sh --build_dir=build/Linux --config=RelWithDebInfo
    

The build process may take some time. You will find the Python wheel package in the directory build/Linux/RelWithDebInfo/wheel once the build is complete.

Step 6: Installing the Built Package

  1. Install NumPy: Install NumPy for CPU inference using pip3 install numpy.

  2. Install ONNX Runtime GenAI: Install the built ONNX Runtime GenAI wheel package using pip3 install build/Linux/RelWithDebInfo/wheel/onnxruntime_genai-<version>-cp312-cp312-linux_aarch64.whl.

  3. Verify Installation: Ensure successful installation by running the Python command python3 -c 'import onnxruntime_genai; print(onnxruntime_genai.Model.device_type)'. The output should indicate the device type used by ONNX Runtime GenAI.

Step 7: Downloading the Phi-3 ONNX Model

For efficient inference, download the Phi-3 mini 4K instruct model in the ONNX format. This model is optimized for CPU inference and is specifically suited for the Raspberry Pi 5.

Model Background:

  • Model Type: Phi-3-mini is a Transformer-based language model.

  • Parameter Count: 3.8 billion.

  • Training Data: High-quality, educationally valuable datasets, including NLP synthetic texts and chat data from both internal and external sources.

  • Post-Training Optimization: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for enhanced chatting capabilities, alignment, robustness, and safety.

Downloading the Model:

  • Method 1: huggingface-cli (Recommended): Install huggingface-cli using pip install -U "huggingface_hub[cli]". Download only the necessary data using the command huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir ./Phi-3-mini-4k-instruct-onnx

  • Method 2: Git clone with LFS (Alternative): Install and configure Git Large File Storage (Git LFS) using sudo apt-get install git-lfs and git lfs install. Clone the repository using git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx.

  • Method 3: wget (Alternative): Download each file individually using wget <file_url>.

Step 8: (Optional) Downloading the Mistral 7B ONNX Model

If you prefer exploring alternative options, the Mistral 7B model serves as a viable substitute for Phi-3. This section will guide you on how to download and integrate the Mistral 7B model into your setup.

Model Details:

  • Optimization: The Mistral 7B Instruct v0.2 is in ONNX format, optimized for CPU performance with an INT4 RTN block 32, achieving accuracy level 4 for efficient CPU inference.

Downloading the Model:

  • Method 1: huggingface-cli (Recommended): Use the command huggingface-cli download microsoft/mistral-7b-instruct-v0.2-ONNX --include onnx/cpu_and_mobile/mistral-7b-instruct-v0.2-cpu-int4-rtn-block-32-acc-level-4/* --local-dir ./mistral-7b-instruct-v0.2-ONNX

Step 9: Running Inference

To run inference, create a Python script named onnxruntime-genai-test.py. Copy the following code into the script:

      # Compiled library
import onnxruntime_genai as engine

# Phi-3 mini 4K instruct model path
model_path = "Phi-3-mini-4k-instruct-onnx/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4"
# Mistral 7B instruct v0.2 model path
# model_path = "mistral-7b-instruct-v0.2-ONNX/onnx/cpu_and_mobile/mistral-7b-instruct-v0.2-cpu-int4-rtn-block-32-acc-level-4"

print(f"Loading {model_path} model\n")

model = engine.Model(f'{model_path}')
tokenizer = engine.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Set the maximum number of tokens to generate
# The full example includes additional parameters for more complex configurations
search_options = {'max_length': 2048}

# Template used to format multi-turn conversations in the chat interface
chat_tpl = '<|user|>\n{input}<|end|>\n<|assistant|>'

print("Let's chat!\n")

# Multi-turn chat loop. Exit with Ctrl+C
while True:
    text = input("> User: ")
    if not text:
        print("Please, answer something")
        continue

    # Populate the chat template with user input
    prompt = f'{chat_tpl.format(input=text)}'

    input_tokens = tokenizer.encode(prompt)

    gen_params = engine.GeneratorParams(model)
    gen_params.set_search_options(**search_options)
    gen_params.input_ids = input_tokens
    generator = engine.Generator(model, gen_params)

    print("\n> Assistant: ", end='', flush=True)

    try:
        # Loop to generate and display answer tokens
        while not generator.is_done():
            generator.compute_logits()
            generator.generate_next_token()
            next_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(next_token), end='', flush=True)
        print('\n')
    except KeyboardInterrupt:
        print("\nCtrl+C pressed, break\n")

# Delete the generator
del generator
    

Running the Script:

Execute the script using python3 onnxruntime-genai-test.py. The cold start may take some time as the LLM loads. To exit the script, use Ctrl+C.

Conclusion

This guide has empowered you to run LLMs on your Raspberry Pi 5, demonstrating the accessibility and power of AI on less conventional platforms. By leveraging ONNX Runtime GenAI and the capabilities of the Raspberry Pi 5, you can explore the exciting possibilities of generative AI right in your own home. As the field of AI continues to evolve, the Raspberry Pi 5, with its affordability and versatility, offers a gateway to experimentation and exploration for enthusiasts and developers alike.

0 comments:

Post a Comment