This folder contains an example of optimizing the Phi-3-Mini-4K-Instruct model from Hugging Face or Azure Machine Learning Model Catalog for different hardware targets with Olive.
Install the dependencies
pip install -r requirements.txt
- einops
- Pytorch: >=2.2.0
The official website offers packages compatible with CUDA 11.8 and 12.1. Please select the appropriate version according to your needs. - Package onnxruntime: >=1.18.0
- Package onnxruntime-genai: >=0.2.0.
If you target GPU, pls install onnxruntime and onnxruntime-genai gpu packages.
if you have not loged in Hugging Face account,
- Install Hugging Face CLI and login your Hugging Face account for model access
huggingface-cli login
- Install Olive with Azure Machine Learining dependency
pip install olive-ai[azureml]
if you have not loged in Azure account,
- Install Azure Command-Line Interface (CLI) following this link
- Run
az login
to login your Azure account to allows Olive to access the model.
we will use the phi3.py
script to fine-tune and optimize model for a chosen hardware target by running the following commands.
python phi3.py [--target HARDWARE_TARGET] [--precision DATA_TYPE] [--source SOURCE] [--finetune_method METHOD] [--inference] [--prompt PROMPT] [--max_length LENGTH]
# Examples
python phi3.py --target mobile
python phi3.py --target mobile --source AzureML
python phi3.py --target mobile --inference --prompt "Write a story starting with once upon a time" --max_length 200
python phi3.py --target cuda --finetune_method lora --inference --prompt "Write a story starting with once upon a time" --max_length 200
# qlora introduce the quantization into base model which is not supported by onnxruntime-genai as of now!
python phi3.py --target cuda --finetune_method qlora
--target
: cpu, cuda, mobile, web--finetune_method
: optional. The method used for fine-tuning. Options:qlora
,lora
. Default is none. Note that onnxruntime-genai only supportslora
method as of now.--precision
: optional, for data precision. fp32 or int4 (default) for cpu target; fp32, fp16, or int4 (default) for GPU target; int4 (default) for mobile or web.--source
: optional, for model path. HF or AzureML. HF(Hugging Face model) by default.--inference
: optional, for non-web models inference/validation.--prompt
: optional, the prompt text fed into the model. Take effect only when--inference
is set.--max_length
: optional, the max length of the output from the model. Take effect only when--inference
is set.
This script includes
- Generate the Olive configuration file for the chosen HW target
- Fine-tune model by lora or qlora method with dataset of
nampdn-ai/tiny-codes
. - Generate optimized model with Olive based on the configuration file for the chosen HW target
- (optional) Inference the optimized model with ONNX Runtime Generate() API with non-web target
If you have an Olive configuration file, you can also run the olive command for model generation:
olive run [--config CONFIGURATION_FILE]
# Examples
olive run --config phi3_mobile_int4.json