Giter Club home page Giter Club logo

llama_mps's Introduction

LLaMA_MPS

Run LLaMA inference on Apple Silicon GPUs.

Demo

As you can see, unlike other LLMs, LLaMA is not biased in any way 😄

Setup

1. Clone this repo

git clone https://github.com/jankais3r/LLaMA_MPS

2. Download the model weights and put them into a folder called models (e.g., LLaMA_MPS/models/7B)

3. Install Python dependencies

pip3 install virtualenv
python3 -m venv env
source env/bin/activate
pip3 install -r requirements.txt
pip3 install -e .

4. (Optional) Reshard the model weights (13B/30B/65B)

Since we are running the inference on a single GPU, we need to merge the larger models' weights into a single file.

mv models/13B models/13B_orig
mkdir models/13B
python3 reshard.py 1 models/13B_orig models/13B

5. Run the inference

python3 chat.py --ckpt_dir models/13B --tokenizer_path models/tokenizer.model --max_batch_size=8 --max_seq_len=256

Memory requirements

Model Starting memory during inference Peak memory during checkpoint conversion Peak memory during resharding
7B 16 GB 14 GB N/A
13B 32 GB 37 GB 45 GB
30B 66 GB 76 GB 125 GB
65B ?? GB ?? GB ?? GB

Min specs per model (slow due to swapping):

  • 7B - 16 GB RAM
  • 13B - 32 GB RAM
  • 30B - 64 GB RAM
  • 65B - needs testing

Recommended specs per model:

  • 7B - 24 GB RAM
  • 13B - 48 GB RAM
  • 30B - 96 GB RAM
  • 65B - needs testing

Parameters to experiment with

- max_batch_size

If you have spare memory (e.g., when running the 13B model on a 64 GB Mac), you can increase the batch size by using the --max_batch_size=32 argument. Default value is 1.

- max_seq_len

To increase/decrease the length of the generated text, use the --max_seq_len=256 argument. Default value is 512.

- use_repetition_penalty

The example script penalizes the model for generating a repetitive content. This should lead to higher quality output, but it slightly slows down the inference. Run the script with --use_repetition_penalty=False argument to disable the penalty algorithm.

Alternatives

The best alternative to LLaMA_MPS for Apple Silicon users is llama.cpp, which is a C/C++ re-implementation that runs the inference purely on the CPU part of the SoC. Because compiled C code is so much faster than Python, it can actually beat this MPS implementation in speed, however at the cost of much worse power and heat efficiency.

See the below comparison when deciding which implementation better fits your use case.

Implementation Total run time - 256 tokens Tokens/s Peak memory use Peak SoC temperature Peak SoC Power consumption Tokens per 1 Wh
LLAMA_MPS (13B fp16) 75 s 3.41 30 GB 79 °C 10 W 1,228.80
llama.cpp (13B fp16) 70 s 3.66 25 GB 106 °C 35 W 376.16

Credits

llama_mps's People

Contributors

jankais3r avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.