The world of artificial intelligence is rapidly advancing, and with it, the capabilities of smaller and more accessible hardware are expanding. Recently, I was inspired by a video on the NVIDIA AI channel showcasing the Gemma 2 language model, boasting 2 billion parameters, running on a Jetson Orin Nano. This prompted me to delve into the world of running LLMs on the Raspberry Pi 5, a device known for its affordability and versatility. My journey led me to the Phi-3 mini 4K Instruct, a 3.8 billion parameter LLM from Microsoft, which surprisingly runs smoothly on the Raspberry Pi 5 using the ONNX Runtime GenAI framework.
Verify Python Installation: Begin by checking if Python is installed on your system. Run the command python3 -V and verify that the output indicates a Python 3 version.Create Virtual Environment: Create a new virtual environment using the command python3 -m venv onnxruntime-genai-env.Activate the Environment: Activate the virtual environment using source onnxruntime-genai-env/bin/activate.
Clone the Repository: Clone the ONNX Runtime GenAI project files from the GitHub repository using git clone https://github.com/microsoft/onnxruntime-genai.git onnxruntime-genai.Navigate to the Directory: Change directory to onnxruntime-genai using cd onnxruntime-genai.Checkout the Specific Version: Checkout version v0.3.0 for compatibility using git checkout tags/v0.3.0.Verify the Version: Ensure you are on the correct version using git status.
Fetch the Download Link: Use the command curl -s https://api.github.com/repos/microsoft/onnxruntime/releases | grep browser_download_url | grep 'aarch64' | head -n 1 | cut -d '"' -f 4 to obtain the download link for the version compatible with your system.Download the Release: Download the specified version using wget <download_link> (replace <download_link> with the output from the previous command).Extract and Setup: Once downloaded, extract the files and set them up within the onnxruntime-genai directory using the following commands:tar -xzvf <downloaded_file_name>.tgz mv <extracted_directory_name> ort
sh build.sh --build_dir=build/Linux --config=RelWithDebInfo
Install NumPy: Install NumPy for CPU inference using pip3 install numpy.Install ONNX Runtime GenAI: Install the built ONNX Runtime GenAI wheel package using pip3 install build/Linux/RelWithDebInfo/wheel/onnxruntime_genai-<version>-cp312-cp312-linux_aarch64.whl.Verify Installation: Ensure successful installation by running the Python command python3 -c 'import onnxruntime_genai; print(onnxruntime_genai.Model.device_type)'. The output should indicate the device type used by ONNX Runtime GenAI.
Model Type: Phi-3-mini is a Transformer-based language model.Parameter Count: 3.8 billion.Training Data: High-quality, educationally valuable datasets, including NLP synthetic texts and chat data from both internal and external sources.Post-Training Optimization: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for enhanced chatting capabilities, alignment, robustness, and safety.
Method 1: huggingface-cli (Recommended): Install huggingface-cli using pip install -U "huggingface_hub[cli]". Download only the necessary data using the command huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir ./Phi-3-mini-4k-instruct-onnxMethod 2: Git clone with LFS (Alternative): Install and configure Git Large File Storage (Git LFS) using sudo apt-get install git-lfs and git lfs install. Clone the repository using git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx.Method 3: wget (Alternative): Download each file individually using wget <file_url>.
Optimization: The Mistral 7B Instruct v0.2 is in ONNX format, optimized for CPU performance with an INT4 RTN block 32, achieving accuracy level 4 for efficient CPU inference.
Method 1: huggingface-cli (Recommended): Use the command huggingface-cli download microsoft/mistral-7b-instruct-v0.2-ONNX --include onnx/cpu_and_mobile/mistral-7b-instruct-v0.2-cpu-int4-rtn-block-32-acc-level-4/* --local-dir ./mistral-7b-instruct-v0.2-ONNX
# Compiled library
import onnxruntime_genai as engine
# Phi-3 mini 4K instruct model path
model_path = "Phi-3-mini-4k-instruct-onnx/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4"
# Mistral 7B instruct v0.2 model path
# model_path = "mistral-7b-instruct-v0.2-ONNX/onnx/cpu_and_mobile/mistral-7b-instruct-v0.2-cpu-int4-rtn-block-32-acc-level-4"
print(f"Loading {model_path} model\n")
model = engine.Model(f'{model_path}')
tokenizer = engine.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
# Set the maximum number of tokens to generate
# The full example includes additional parameters for more complex configurations
search_options = {'max_length': 2048}
# Template used to format multi-turn conversations in the chat interface
chat_tpl = '<|user|>\n{input}<|end|>\n<|assistant|>'
print("Let's chat!\n")
# Multi-turn chat loop. Exit with Ctrl+C
while True:
text = input("> User: ")
if not text:
print("Please, answer something")
continue
# Populate the chat template with user input
prompt = f'{chat_tpl.format(input=text)}'
input_tokens = tokenizer.encode(prompt)
gen_params = engine.GeneratorParams(model)
gen_params.set_search_options(**search_options)
gen_params.input_ids = input_tokens
generator = engine.Generator(model, gen_params)
print("\n> Assistant: ", end='', flush=True)
try:
# Loop to generate and display answer tokens
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
next_token = generator.get_next_tokens()[0]
print(tokenizer_stream.decode(next_token), end='', flush=True)
print('\n')
except KeyboardInterrupt:
print("\nCtrl+C pressed, break\n")
# Delete the generator
del generator
0 comments:
Post a Comment