Giter Club home page Giter Club logo

llama.onnx's Introduction

llama.onnx

News

05/?? deploy to aarch64

05/09 trt output wrong value until issue 2928 solved

04/19 remove GPTQ zero point guidance

04/18 export mixed-precision quant table from GPTQ-for-LLaMa

04/11 add 13GB onnx-fp16 models

04/11 add memory pool, support 2GB RAM laptop ⭐

04/10 reduce onnx model size to 26GB

04/10 support temperature add topk logits warp

04/07 add onnxruntime demo

04/05 init project

Features

  • Release llama 7B onnx models
  • With a 400-lines onnxruntime alpaca demo
    • neither torch nor transformers required
    • support memory pool, works on 2GB laptop/PC (very slow 🐢)

Why do this ?

  1. Visualization. graphviz crashed on llama model. LLM visualization tool must support nest or operator folding feature
  2. Quatization. LLM often repeat itself, just like fractal. For llama quantization, loading part of decoder backbone would be enough (400MB). It could be quantized partially
  3. Embeded device. Small board IO error occurs when dd a big single file
  4. Distributed system. Inference LLM on many hybrid (FPGA/NPU/GPGPU) devices would be simple
  5. onnx tools. Device manufacturer has support onnx well, there is no reason to neglect it

Usage

Download onnx models here:

Precision Size URL
fp32 26GB huggingface
fp16 13GB huggingface or 硬件模型库

Here is the graph to call them:

Try onnxruntime demo, no torch required, and the precision has been checked.

$ python3 -m pip install -r requirements.txt
$ python3 demo-single.py ${FP16_ONNX_DIR} "bonjour"
..
# If you only have 4GB memory, use `--poolsize`
$ python3 demo-single.py ${FP16_ONNX_DIR} "bonjour" --poolsize 4
..
Bonjour.

# Try more options
$ python3 demo-single.py --help

Export onnx

STEP1 Convert to HF format

These models converted from alpaca huggingface.

  • If you are using LLaMa or llama.cpp, convert it to HF format first. Here are steps:

    # install transformers master
    $ git clone https://github.com/huggingface/transformers
    $ cd transformers && python3 setup.py install
    ..
    $ cd src/transformers
    $ python3 src/transformers/models/llama/convert_llama_weights_to_hf.py  --input_dir ${LLaMa_PATH}  --model_size 7B  --output_dir ${HF_PATH}
  • If you are using alpaca-lora, use this script to merge LoRA weights.

  • If you are using alpaca, go STEP2.

STEP2 torch.onnx.export

Checkout transformers to this hacking branch, run single inference.

$ python3 tools/export-onnx.py ${PATH_ALPACA_7B}

STEP3 convert to fp16/tvm

Use onnxconverter-common.float16

$ cd tools
$ python3 -m pip install -r requirements.txt
$ python3 convert-fp32-to-fp16.py ${FP32_PATH} ${FP16_PATH}

Or use relay.vm to convert tvm

$ cd tools
$ python3 convert-to-tvm.py ${ONNX_PATH} ${OUT_DIR}

Quantization

Mixed-precision kernel optimization is on the way. Here is a part of guidance.

Notes

  1. Any logits_processor or BeamSearch not implemented, so the result would be not good
  2. I have compared the output values of onnxruntime-cpu and torch-cuda, and the maximum error is 0.002, not bad
  3. The current state is equivalent to these configurations
temperature=0.1
total_tokens=2000
top_p=1.0
top_k=40
repetition_penalty=1.0

Acknowlegements

License

GPLv3

llama.onnx's People

Contributors

tpoisonooo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.