ChatGLM.cpp

A C++ implementation of ChatGLM-6B. Run int4 model inference on your macBook in a minute.

Features

Pure C++ implementation based on ggml, working in the same way as llama.cpp.
Accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing.
Streaming generation with typewriter effect.
Python binding, web demo, and more possibilities.

Getting Started

Preparation

Clone the ChatGLM.cpp repository into your local machine:

git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the chatglm.cpp folder:

git submodule update --init --recursive

Quantize Model

Use convert.py to transform ChatGLM-6B to quantized GGML format. For example, to convert the fp16 ChatGLM-6B model to q4_0 (quantized int4) GGML model, run:

python3 convert.py -i THUDM/chatglm-6b -t q4_0 -o chatglm-ggml.bin

For LoRA model, specify -l <lora_model_name_or_path> flag to merge your LoRA weights into the base model.

Build & Run

Compile the project using CMake:

cmake -B build
cmake --build build -j

Now you may chat with the quantized ChatGLM model by running:

./build/bin/main -m chatglm-ggml.bin -q 你好
# 你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。

To run ChatGLM in interactive mode, add the -i flag:

./build/bin/main -m chatglm-ggml.bin -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Using BLAS

BLAS library can be integrated to further accelerate matrix multiplication. However, in some cases, using BLAS may cause performance degradation. Whether to turn on BLAS should depend on the benchmarking result.

Accelerate Framework

Accelerate Framework is automatically enabled on macOS. To disable it, add the CMake flag -DGGML_NO_ACCELERATE=ON.

OpenBLAS

OpenBLAS provides acceleration on CPU. Add the CMake flag -DGGML_OPENBLAS=ON to enable it.

cmake -B build -DGGML_OPENBLAS=ON
cmake --build build -j

cuBLAS

cuBLAS uses NVIDIA GPU to accelerate BLAS. Add the CMake flag -DGGML_CUBLAS=ON to enable it.

cmake -B build -DGGML_CUBLAS=ON
cmake --build build -j

Note that the current GGML CUDA implementation is really slow. The community is making efforts to optimize it.

Python Binding

To install the Python binding from source, run:

# install from the latest source hosted on GitHub
pip install git+https://github.com/li-plus/chatglm.cpp.git@main
# or install from your local source
pip install .

Run the Python example to chat with the quantized ChatGLM model:

python3 cli_chat.py -m chatglm-ggml.bin -i

You may also launch a web demo to chat in your browser:

python3 web_demo.py -m chatglm-ggml.bin

Performance

Measured on a Linux server with Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz using 16 threads.

	Q4_0	Q8_0	F16	F32
ms/token	92	130	217	399
file size	3.3GB	6.2GB	12GB	23GB
mem usage	4.0GB	6.9GB	13GB	24GB

Development

To perform unit tests, add the CMake flag -DCHATGLM_ENABLE_TESTING=ON, recompile, and run ./build/bin/chatglm_test. For benchmark only, run ./build/bin/chatglm_test --gtest_filter=ChatGLM.benchmark.
To format the code, run cmake --build build --target lint. You should have clang-format, black and isort pre-installed.
To check performance issue, add the CMake flag -DGGML_PERF=ON. It will show timing for each graph operation when running the model.

Acknowledgements

This project is greatly inspired by ggerganov's llama.cpp and is based on his NN library ggml.
Thank THUDM for the amazing ChatGLM-6B and for releasing the model sources and checkpoints.

aonoa / chatglm.cpp Goto Github PK

chatglm.cpp's Introduction

ChatGLM.cpp

Features

Getting Started

Using BLAS

Python Binding

Performance

Development

Acknowledgements

chatglm.cpp's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent