Giter Club home page Giter Club logo

mlc-ai / mlc-llm Goto Github PK

View Code? Open in Web Editor NEW
16.9K 161.0 1.3K 28.21 MB

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.

Home Page: https://llm.mlc.ai/docs

License: Apache License 2.0

CMake 0.30% Python 62.70% C++ 29.87% Swift 2.13% Objective-C++ 0.33% Objective-C 0.12% Shell 0.53% Kotlin 2.01% Java 0.09% C 0.01% Rust 1.41% Groovy 0.41% Batchfile 0.01% Makefile 0.08%
llm machine-learning-compilation language-model tvm

mlc-llm's Introduction

MLC LLM

Documentation | Blog | Discord

Machine Learning Compilation for Large Language Models (MLC LLM) is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. The mission of this project is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices with ML compilation techniques.

Universal deployment. MLC LLM supports the following platforms and hardware:

AMD GPU NVIDIA GPU Apple GPU Intel GPU
Linux / Win ✅ Vulkan, ROCm ✅ Vulkan, CUDA N/A ✅ Vulkan
macOS ✅ Metal (dGPU) N/A ✅ Metal ✅ Metal (iGPU)
Web Browser ✅ WebGPU and WASM
iOS / iPadOS ✅ Metal on Apple A-series GPU
Android ✅ OpenCL on Adreno GPU ✅ OpenCL on Mali GPU

Quick Start

We introduce the quick start examples of chat CLI, Python API and REST server here to use MLC LLM. We use 4-bit quantized 8B Llama-3 model for demonstration purpose. The pre-quantized Llama-3 weights is available at https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC. You can also try out unquantized Llama-3 model by replacing q4f16_1 to q0f16 in the examples below. Please visit our documentation for detailed quick start and introduction.

Installation

MLC LLM is available via pip. It is always recommended to install it in an isolated conda virtual environment.

To verify the installation, activate your virtual environment, run

python -c "import mlc_llm; print(mlc_llm.__path__)"

You are expected to see the installation path of MLC LLM Python package.

Chat CLI

We can try out the chat CLI in MLC LLM with 4-bit quantized 8B Llama-3 model.

mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

It may take 1-2 minutes for the first time running this command. After waiting, this command launch a chat interface where you can enter your prompt and chat with the model.

You can use the following special commands:
/help               print the special commands
/exit               quit the cli
/stats              print out the latest stats (token/sec)
/reset              restart a fresh chat
/set [overrides]    override settings in the generation config. For example,
                      `/set temperature=0.5;max_gen_len=100;stop=end,stop`
                      Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.

user: What's the meaning of life
assistant:
What a profound and intriguing question! While there's no one definitive answer, I'd be happy to help you explore some perspectives on the meaning of life.

The concept of the meaning of life has been debated and...

Python API

We can run the Llama-3 model with the chat completion Python API of MLC LLM. You can save the code below into a Python file and run it.

from mlc_llm import LLMEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = LLMEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

The Python API of mlc_llm.LLMEngine fully aligns with OpenAI API. You can use LLMEngine in the same way of using OpenAI's Python package for both synchronous and asynchronous generation.

If you would like to do concurrent asynchronous generation, you can use mlc_llm.AsyncLLMEngine instead.

REST Server

We can launch a REST server to serve the 4-bit quantized Llama-3 model for OpenAI chat completion requests. The server has fully OpenAI API completeness.

mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

The server is hooked at http://127.0.0.1:8000 by default, and you can use --host and --port to set a different host and port. When the server is ready (showing INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)), we can open a new shell and send a cURL request via the following command:

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "model": "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
        "messages": [
            {"role": "user", "content": "Hello! Our project is MLC LLM. What is the name of our project?"}
        ]
  }' \
  http://127.0.0.1:8000/v1/chat/completions

Universal Deployment APIs

MLC LLM provides multiple sets of APIs across platforms and environments. These include

Citation

Please consider citing our project if you find it useful:

@software{mlc-llm,
    author = {MLC team},
    title = {{MLC-LLM}},
    url = {https://github.com/mlc-ai/mlc-llm},
    year = {2023}
}

The underlying techniques of MLC LLM include:

References (Click to expand)
@inproceedings{tensorir,
    author = {Feng, Siyuan and Hou, Bohan and Jin, Hongyi and Lin, Wuwei and Shao, Junru and Lai, Ruihang and Ye, Zihao and Zheng, Lianmin and Yu, Cody Hao and Yu, Yong and Chen, Tianqi},
    title = {TensorIR: An Abstraction for Automatic Tensorized Program Optimization},
    year = {2023},
    isbn = {9781450399166},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3575693.3576933},
    doi = {10.1145/3575693.3576933},
    booktitle = {Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
    pages = {804–817},
    numpages = {14},
    keywords = {Tensor Computation, Machine Learning Compiler, Deep Neural Network},
    location = {Vancouver, BC, Canada},
    series = {ASPLOS 2023}
}

@inproceedings{metaschedule,
    author = {Shao, Junru and Zhou, Xiyou and Feng, Siyuan and Hou, Bohan and Lai, Ruihang and Jin, Hongyi and Lin, Wuwei and Masuda, Masahiro and Yu, Cody Hao and Chen, Tianqi},
    booktitle = {Advances in Neural Information Processing Systems},
    editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
    pages = {35783--35796},
    publisher = {Curran Associates, Inc.},
    title = {Tensor Program Optimization with Probabilistic Programs},
    url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/e894eafae43e68b4c8dfdacf742bcbf3-Paper-Conference.pdf},
    volume = {35},
    year = {2022}
}

@inproceedings{tvm,
    author = {Tianqi Chen and Thierry Moreau and Ziheng Jiang and Lianmin Zheng and Eddie Yan and Haichen Shen and Meghan Cowan and Leyuan Wang and Yuwei Hu and Luis Ceze and Carlos Guestrin and Arvind Krishnamurthy},
    title = {{TVM}: An Automated {End-to-End} Optimizing Compiler for Deep Learning},
    booktitle = {13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)},
    year = {2018},
    isbn = {978-1-939133-08-3},
    address = {Carlsbad, CA},
    pages = {578--594},
    url = {https://www.usenix.org/conference/osdi18/presentation/chen},
    publisher = {USENIX Association},
    month = oct,
}

Links

  • You might want to check out our online public Machine Learning Compilation course for a systematic walkthrough of our approaches.
  • WebLLM is a companion project using MLC LLM's WebGPU and WebAssembly backend.
  • WebStableDiffusion is a companion project for diffusion models with the WebGPU backend.

mlc-llm's People

Contributors

anibohara2000 avatar bbuf avatar charliefruan avatar cyx-6 avatar davidpissarra avatar hzfengsy avatar jeethu avatar jinhongyii avatar junrushao avatar kartik14 avatar kathryn-cat avatar leshengjin avatar lunderberg avatar masahi avatar masterjh5574 avatar nverke avatar rickzx avatar sbelcmu avatar sing-li avatar spectrometerhbh avatar sudeepag avatar sunggg avatar tlopex avatar tqchen avatar ubospica avatar vinx13 avatar yongwww avatar yuchenjin avatar yzh119 avatar zxybazh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlc-llm's Issues

Comparison with hidet

Hi, do you happen to know how the optimization performance for this TVM based solution might compare to https://github.com/hidet-org/hidet for NVIDIA edge devices (like NVIDIA Jetson Xavier)?

I'm curious to understand which has the capability to make LLMs smaller/faster for these devices?

I assume they are not compatible, but please let me know if I can use both.

Android support

Pretty self-explanatory: can we get an Android version of this?

CPU offloading

Incredible project, i managed to run the model with good speed on my hardware (AMD) thanks.
I have a question do you have any plans to offload the weights and be able to run bigger models like 13B or 30B with less vram?

Invalid bitcast %222 = bitcast <8 x i32> %221 to <8 x half>

Output from running mlc_chat_cli:

WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
Use lib /cluster/2024mgagvani/dist/lib/vicuna-v1-7b_vulkan_float16.so
Initializing the chat module...
Finish loading
You can use the following special commands:
  /help    print the special commands
  /exit    quit the cli
  /stats   print out the latest stats (token/sec)
  /reset   restart a fresh chat

USER: what is 1 + 1
ASSISTANT: Invalid bitcast
  %222 = bitcast <8 x i32> %221 to <8 x half>
LLVM ERROR: Broken function
Aborted (core dumped)

Debug info:

(mlc-chat) 2024mgagvani@snowy:~$ uname -r
5.15.0-67-generic

System Specs: AMD Threadripper 1950X CPU and Nvidia GeForce 2080 GPU.

python chat.py can not be run

This actually do not provide a loadable in huggingface repo, how does the tokenizer can load?

OSError: ./dist/models/vicuna-v1-7b does not appear to have a file named config.json. Checkout 'https://huggingface.co/./dist/models/vicuna-v1-7b/None' for available files.

What's more, does I also need tvm built with vulkan enable to run demo .so???

  File "E:\codes\libs\relax\src\runtime\c_runtime_api.cc", line 131
TVMError:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (allow_missing) is false: Device API vulkan is not enabled.

very slow in Mac

Tried on a Mac with the below capacity and the response is very slow. Is there any way to speed it up?

Spec:
2.6 GHz 6-Core Intel Core i7
Intel UHD Graphics 630 1536 MB
32 GB 2667 MHz DDR4

Works like a charm !

Just wanted to report that this works perfect on my gtx1060 (6gb) on my old i5-7200 16gb ram under win10. So far, i never reached such a speed with all other existing solution ( oobabooga, textsynth, llama.cpp). No single issue during install. I can't tell exactly but it's surely a couple of tokens / sec during inference. Need more deep dive to get a feeling of quality as it seems to be quantized model in int3 ?
Now, we want more : more models, 13b size, parameter access (temp, topp etc) and api. Anyway, i think this is great work already !
image

Android port

Hopefully we can expect and Android port of the same.

Error: Vulkan Error, code=-3: VK_ERROR_INITIALIZATION_FAILED

mlc_chat_cli
terminate called after throwing an instance of 'tvm::runtime::InternalError'
what(): [03:45:52] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_instance.cc:144:

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-3: VK_ERROR_INITIALIZATION_FAILED
Stack trace:
[bt] (0) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(tvm::runtime::Backtraceabi:cxx11+0x27) [0x7fa4f1a06b77]
[bt] (1) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(+0x3f375) [0x7fa4f19a4375]
[bt] (2) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanInstance::GetPhysicalDevices() const+0x3e9) [0x7fa4f1af20a9]
[bt] (3) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::VulkanDeviceAPI()+0x13f) [0x7fa4f1aefe2f]
[bt] (4) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::Global()+0x4c) [0x7fa4f1af00cc]
[bt] (5) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(+0x18b10d) [0x7fa4f1af010d]
[bt] (6) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(+0x6bf04) [0x7fa4f19d0f04]
[bt] (7) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(+0x6c4a7) [0x7fa4f19d14a7]
[bt] (8) mlc_chat_cli(+0xe4f0) [0x55772fb6f4f0]

my gpu is NVIDIA-A100, does it support A100? Or which version of Vulkan should I install

Expose API

Thanks for your effort.
Do you plan add an API layer on top of this, to use your layer as a local API layer ?

In my scenario I'd like to host in a docker instance your library and query it by using API to feed a custom application I've started (see https://github.com/MithrilMan/AIdentities )
It would make use of several Models, not just for text generation, but of course here I'm just interested in the Text-Gen stuff

Ideally the API should have endpoints for Completions and Embedding

Is this something do you plan to have?

The sample installation code doesn't work on Macbook ARM

Hello there,

I found this problem while executing the sample code given for installation on Macbook M1. How should I resolve this?

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
  Check failed: (lib_handle_ != nullptr) is false: Failed to load dynamic shared library /Users/dist/lib/vicuna-v1-7b_metal_float16.so dlopen(/Users/dist/lib/vicuna-v1-7b_metal_float16.so, 0x0005): tried: '/Users/dist/lib/vicuna-v1-7b_metal_float16.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64'))
Stack trace:
  [bt] (0) 1   libtvm_runtime.dylib                0x0000000108f16c98 tvm::runtime::Backtrace() + 24
  [bt] (1) 2   libtvm_runtime.dylib                0x0000000108ee3929 tvm::runtime::detail::LogFatal::Entry::Finalize() + 89
  [bt] (2) 3   libtvm_runtime.dylib                0x0000000108ee38c9 tvm::runtime::detail::LogFatal::~LogFatal() + 25
  [bt] (3) 4   libtvm_runtime.dylib                0x0000000108ede159 tvm::runtime::detail::LogFatal::~LogFatal() + 9
  [bt] (4) 5   libtvm_runtime.dylib                0x0000000108f04b2d tvm::runtime::DSOLibrary::Load(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 269
  [bt] (5) 6   libtvm_runtime.dylib                0x0000000108f04d2f tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::$_0> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) + 175
  [bt] (6) 7   libtvm_runtime.dylib                0x0000000108f1e9b6 tvm::runtime::Module::LoadFromFile(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 598
  [bt] (7) 8   mlc_chat_cli                        0x000000010062e4c2 main + 8402
  [bt] (8) 9   dyld                                0x000000020082051e start + 462

CMake Error

I am new to TVM, and I encountered an error while compiling it according to the instructions. I cannot install it successfully. It seems that TVM or something else cannot be found. What could be the reason?

Here are the specific error messages:
-- Set TVM_LLVM_VERSION=170
-- Build with contrib.random
-- Build with contrib.sort
-- Build with contrib.hybriddump
-- Git found: /usr/bin/git
-- Found TVM_GIT_COMMIT_HASH=838ec67e9376f5a606ce6c8bb0a9b773e5f63833
-- Found TVM_GIT_COMMIT_TIME=2023-05-03 16:37:51 -0400
-- Could NOT find LIBBACKTRACE (missing: LIBBACKTRACE_STATIC_LIBRARY LIBBACKTRACE_INCLUDE_DIR)
-- Building libbacktrace from 3rdparty/libbacktrace
-- Building with TVM Map...
-- Build with thread support...
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Added "-fuse-ld=lld" to linker flags
-- Configuring done
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
FOUNDATION_LIB
linked by target "tvm" in directory /home/panz/project/mlc-llm/relax
linked by target "tvm_runtime" in directory /home/panz/project/mlc-llm/relax
METAL_LIB
linked by target "tvm" in directory /home/panz/project/mlc-llm/relax
linked by target "tvm_runtime" in directory /home/panz/project/mlc-llm/relax

-- Generating done
CMake Generate step failed. Build files cannot be regenerated correctly.

Chinese variants model support

As for Vicuna didn't support Chinese very well, is there a way to support some more good performance Chinese model in demo?>

Missing instructions on installing additional models

Hey there, congratulations on a great release! The app works great on a Mac and the installation was very straightforward.

Do you have plans for growing the mlc_chat_cli into a standalone tool or is it meant to be a proof of concept? Readme claims the project can be used to run 'any language model', but there are no instructions for how to do it. Furthermore, code seems to indicate that only three models are supported right now, is that right?

Unless the mlc_chat_cli is supposed to be a toy demo, could you please add instructions for:

  1. which models are supported (e.g. would RNN based models like https://github.com/BlinkDL/RWKV-LM work or is it just transformers)?
  2. which formats, quantization methods and directory structures are supported - i.e. I don't think grabbing a random link from HF and cloning it the same way Vicuna was installed during original installation (git clone https://huggingface.co/mlc-ai/demo-vicuna-v1-7b-int3 dist/vicuna-v1-7b) would work, right?
  3. it seems that there is a template/profile system for different LLM families, how do we add additional templates? Does it require patch/pull-request or can it be done by tweaking a config file somewhere?
  4. the Readme mentions multiple optimizations, but the mlc_chat_cli doesn't expose that info/settings to the user. How do we tweak those?
  5. given that the claim in the Readme is that all language models are supported, there should be some kind of rough guide on how to calculate the hardware requirements (e.g. what LLMs can my machine run using this tool, with what quantization and performance?) as a comparison, llama.cpp Readme isn't well-structured, but does provide a good overview of RAM requirements for a given model size and impact of different quantization techniques on performance

Also, it would be very neat if you mentioned in the Readme, what kind of community interactions are you aiming for. Would you prefer that people build their own tools that use mlc-llm as a backend or send PRs for improving mlc_chat_cli?

How to `tune_relax` for other targets

It seems like the tuning is per device, although the m1 tuning is applied when using any GPU.
How would I use relax_integration.tune_relax on mod_deploy to create other databases?

I tried to figure it out myself, but got stuck on the error:
Check failed: (int_imm) is false: TypeError: Expect the extent of a loop to be IntImm, but gets: tir.Var
from measure_flops.

Consider supporting 8bit quantization

Based on experimenting with GPTQ-for-LLaMa, int4 quantization seems to introduce 3-5% drop in perplexity, while int8 is almost identical to fp16. Would it be possible to use int8 quantization with mlc-llm, assuming the model fits in VRAM in int8?

Driver versions needed?

Does this not support AMD GPUs?

I'm getting this error:

terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [18:53:29] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_instance.cc:111: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-9: VK_ERROR_INCOMPATIBLE_DRIVER
Stack trace:
  [bt] (0) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x27) [0x7f6fa5518a37]
  [bt] (1) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(+0x3f375) [0x7f6fa54b6375]
  [bt] (2) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanInstance::VulkanInstance()+0x1a47) [0x7f6fa5605857]
  [bt] (3) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::VulkanDeviceAPI()+0x40) [0x7f6fa5601a60]
  [bt] (4) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::Global()+0x4c) [0x7f6fa5601dfc]
  [bt] (5) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(+0x18ae3d) [0x7f6fa5601e3d]
  [bt] (6) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(+0x6bdc4) [0x7f6fa54e2dc4]
  [bt] (7) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(+0x6c367) [0x7f6fa54e3367]
  [bt] (8) mlc_chat_cli(+0xd950) [0x55ed543c8950]


Aborted

Here's my driver info
image

Better instruction needed

I think the project has too little information on adjusting some config.

Like how to load different weights apart from demo provided? How to adjust temp? I do not have any clues.

It will be better to write more instructions .

dolly 12b 3bit cuda out of memory on my wsl 3070 laptop card

mlc_chat_cli --model dolly-v2-12b_int3 --dtype float32
Use lib /root/mlcai/dist/dolly-v2-12b_int3/float32/dolly-v2-12b_int3_cuda_float32.so
Initializing the chat module...
Finish loading
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out the latest stats (token/sec)
/reset restart a fresh chat

Instruction: hello

Response: [13:55:51] /root/mlcai/relax/src/runtime/relax_vm/pooled_allocator.h:64: Warning: PooledAllocator got InternalError during allocation:


An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory
[13:55:51] /root/mlcai/relax/src/runtime/relax_vm/pooled_allocator.h:65: Warning: Trying to release all unused memory and reallocate...
terminate called after throwing an instance of 'tvm::runtime::InternalError'
what(): [13:55:51] /root/mlcai/relax/include/tvm/runtime/device_api.h:291: unknown type =0
Stack trace:
0: _ZN3tvm7runtime8relax_vm13MemoryManager
1: _ZN3tvm7runtime18SimpleObjAllocator7HandlerIN
2: tvm::runtime::relax_vm::VMAllocStorage(void*, tvm::runtime::ShapeTuple, long, DLDataType) [clone .cold.318]
3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::relax_vm::Storage (void*, tvm::runtime::ShapeTuple, long, DLDataType)>::AssignTypedLambda<tvm::runtime::relax_vm::Storage ()(void, tvm::runtime::ShapeTuple, long, DLDataType)>(tvm::runtime::relax_vm::Storage ()(void, tvm::runtime::ShapeTuple, long, DLDataType), std::__cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
5: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
6: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
7: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocatortvm::runtime::TVMRetValue > const&)
8: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
9: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
10: mlc::llm::LLMChatModule::Forward(tvm::runtime::NDArray, long)
11: mlc::llm::LLMChatModule::EncodeStep(std::__cxx11::basic_string<char, std::char_traits, std::allocator >)
12: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<mlc::llm::LLMChatModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
13: Chat(tvm::runtime::Module, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, long, double, double, long, int, int, double)
14: main
15: 0x00007f2565fbb78f
16: __libc_start_main
17: 0x000055d148d5c314

my gpu memory is 8gb,i think it is enough to run this model,here is vram usage after load,it runs on wsl2
image

Has anyone gotten this working on windows 10?

I'm on windows 10 Enterprise, intex xeon cpu e-15-1620 v4, 3.5ghz, 32gb ram, with the nvidia NVS310.

I installed the linked driver for vulkan, and followed the steps but when running the command the program will not launch.

The same steps work in my mac... except the vulkan part.

Is there a better way to debug/find the exception? Its my first try of anaconda..

You can see on the red arrows, nothing happens. It waits a second, then stops.

image

How do i use this in my own programs?

Hi there! Im new to programming. I really want to try and implement ai to my own little program. For example chatgpt has api with i can talk using python code. Dose this project allow something like this? Like can i just import mlc-llm and get answers using python?
Sorry for bad english.

Running into WebGPU device error from the web demo, using Chrome 112

When trying this out, https://mlc.ai/web-llm/#chat-demo, it gave me the following error. My chrome is 112.0.5615.137 and in the setting no update option is there.

Find an error initializing the WebGPU device Error: Cannot initialize runtime because of requested maxBufferSize exceeds limit. requested=1024MB, limit=256MB. This error may be caused by an older version of the browser (e.g. Chrome 112). You can try to upgrade your browser to Chrome 113 or later.
Init error, Error: Find an error initializing WebGPU: Error: Cannot initialize runtime because of requested maxBufferSize exceeds limit. requested=1024MB, limit=256MB. This error may be caused by an older version of the browser (e.g. Chrome 112). You can try to upgrade your browser to Chrome 113 or later.

Will this work on an iPhone 14 pro?

I absolutely love the idea of this repo, and am very hopeful about its future. I loved it so much that I managed to get the download off testflight. However, every time i open the app, it crashes.

Your page says

Try out this TestFlight page (limited to the first 9000 users)

Is it because 9000 users installed it before me? Is that why it's crashing? Or do I have the wrong iPhone 14 pro?

tvm::runtime::InternalError relax/src/runtime/relax_vm/lm_support.cc:247 Check failed: uniform_sample <= data[0].first (0.0715982 vs. nan)

trying to build ios app from the source and everything is ok except for running the app on the iPhone, the app shows ready to chat but after sending the message, the app crashes, and on the Xcode it shows :

ibc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [15:39:04] /Users/relax/src/runtime/relax_vm/lm_support.cc:247:

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

Check failed: uniform_sample <= data[0].first (0.0715982 vs. nan) :
Stack trace:
[bt] (0) 1 MLCChat 0x0000000104c9f094 tvm::runtime::detail::LogFatal::Entry::Finalize() + 116
[bt] (1) 2 MLCChat 0x0000000104c9f020 tvm::runtime::detail::LogFatal::Entry::Finalize() + 0
[bt] (2) 3 MLCChat 0x0000000104c9e51c __clang_call_terminate + 0
[bt] (3) 4 MLCChat 0x0000000104d4f8a8 tvm::runtime::relax_vm::SampleTopPFromLogits(tvm::runtime::NDArray, double, double, double) + 1544
[bt] (4) 5 MLCChat 0x0000000104d557dc void tvm::runtime::TypedPackedFunc<int (tvm::runtime::NDArray, double, double, double)>::AssignTypedLambda<int ()(tvm::runtime::NDArray, double, double, double)>(int ()(tvm::runtime::NDArray, double, double, double), std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator>)::'lambda'(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const + 232
[bt] (5) 6 MLCChat 0x0000000104cc35ec mlc::llm::LLMChatModule::SampleFromLogitsOnCPU() + 348
[bt] (6) 7 MLCChat 0x0000000104cc1e60 mlc::llm::LLMChatModule::EncodeStep(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator>) + 504
[bt] (7) 8 MLCChat 0x0000000104cc1b54 mlc::llm::LLMChatModule::GetFunction(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator> const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::'lambda1'(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const + 124
[bt] (8) 9 MLCChat 0x0000000104cc1acc tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<mlc::llm::LLMChatModule::GetFunction(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator> const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::'lambda1'(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>>::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) + 40

Docker support

Please, consider adding DockerFIle and docker-compose to the repository

Is tunning scripts available?

It seems like the tuning is per device, although the m1 tuning is applied when using any GPU.
How would I use relax_integration.tune_relax on mod_deploy to create other databases?

Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-2: VK_ERROR_OUT_OF_DEVICE_MEMORY

win10 x64 8G RAM NVIDIA GeForce 940MX
after mlc_chat_cli command:

Use lib E:\Code\test\mlc-chat\dist\lib\vicuna-v1-7b_vulkan_float16.dll
Initializing the chat module...
[16:56:46] D:\a\utils\utils\tvm\src\runtime\vulkan\vulkan_buffer.cc:61:

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-2: VK_ERROR_OUT_OF_DEVICE_MEMORY
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

An alternative python interface for MLC LLM

It seems there is no official support for interacting with MLC LLM using python api yet. For the convenience of anyone who wants to develop a program that integrates MLC LLM, I wrote a simple code that runs mlc_chat_cli in a subprocess and redirects input/output using pipe, so that anyone can plug MLC LLM into their code. Glad to see official python interface someday.
See code at https://github.com/XinyuSun/mlc-chatbot
 

[Survey] Supported Hardwares and Speed

UPDATE (08/09/2023):

We have done major performance overhaul in the past few months, and now I'm happy to share the latest results:

============================================================

Hi everyone,

We are looking to gather data points on running MLC-LLM on different hardwares and platforms. Our goal is to create a comprehensive reference for new users. Please share your own experiences in this thread! Thank you for your help!

NOTE: for benchmarking, we highly recommended a device of at least 6GB memory, because the model itself takes 2.9G already. For this reason, it is known that the iOS app will crash on a 4GB iPhone.

AMD GPUs

Hardware/GPU OS Tokens/sec Source Notes
RX 6600XT (8G) N/A 28.3 GitHub
RX 6750XT openSUSE TumbleWeed 8.9 - 154.3 GitHub
RX 6700XT Windows 11 33.7 GitHub
APU 5800H Windows 11 8.5 GitHub
Raden RX 470 (4G) AlmaLinux 9.1 9.4 GitHub
Raden Pro 5300M macOS Venture 12.6 @junrushao Intel MBP 16" (late 2019)
AMD GPU on Steam Deck Steam Deck's Linux TBD Reddit
RX6800 16G VRAM macOS Ventura 22.5 GitHub Intel MBP 13'' (2020)
Radeon RX 6600 (8GB) Ubuntu 22.04 7.0 Reddit
RX 7900 xtx Reddit

Macbook

Hardware/GPU OS Tokens/sec Source Notes
2020 MacBook Pro M1 (8G) macOS 11.4 GitHub
2021 MacBook Pro M1Pro (16G) macOS Ventura 17.1 GitHub
M1 Max Mac Studio (64G) N/A 18.6 GitHub
2021 MacBook Pro M1 Max (32G) macOS Monterey 21.0 GitHub
MacBook Pro M2 (16G) macOS Ventura 22.5 GitHub
2021 MacBook M1Pro (32G) macOS Ventura 19.3 GitHub

Intel GPUs

Hardware/GPU OS Tokens/sec Source Notes
Arc A770 N/A 3.1 - 118.6 GitHub perf issues in decoding needs investigation
UHD Graphics (Comet Lake-U GT2) 1G Windows 10 2.2 GitHub
UHD Graphics 630 macOS Ventura 2.3 @junrushao Integrated GPU. Intel MBP 16" (late 2019)
Iris Plus Graphics 1536 MB macOS Ventura 2.6 GitHub Integrated GPU on MBP
Iris Plus Graphics 645 1536 MB macOS Ventura 2.9 GitHub Integrated GPU on MBP

NVIDIA GPUs

Hardware/GPU OS Tokens/sec Source Notes
GTX 1650 ti (4GB) Fedora 15.6 GitHub
GTX 1060 (6GB) Windows 10 16.7 GitHub
RTX 3080 Windows 11 26.0 GitHub
RTX 3060 Debian bookworm 21.3 GitHub
RTX 2080Ti Windows 10 24.5 GitHub
RTX 3090 N/A 25.7 GitHub
GTX 1660ti N/A 23.9 GitHub
RTX 3070 N/A 23.3 GitHub

iOS

Hardware/GPU OS Tokens/sec Source Notes
iPhone 14 Pro iOS 16.4.1 7.2 @junrushao
iPad Pro 11' with M1 iPadOS 16.1 10.6 GitHub
iPad Pro 11' A12Z N/A 4.1 GitHub
iPad Pro 11' with M2 (4-th gen) iPadOS 16.5 14.1 GitHub

Android

Hardware/GPU OS Tokens/sec Link Notes

Looks amazing! Where's the code to compile dist/libs? Would like to try on Intel macOS

I tried WebLLM the other week and was really blown away. I have an Intel macOS system with AMD 6900XT GPU and using WebLLM was the first time I'd had decent GPU inference on this system.

Now I'd love to try mlc-llm as well. I followed the instructions, but the pre-built Metal lib for macOS is built for ARM64/Silicon.

Where can I found the source for this so I can try compiling it myself?

Runing mlc-llm python code on windows fail

Hi, tried to run python (gen my self), got some errors:

Check failed: (it != self_->idx_sub_.end()) is false:

I built tvm unity branch it import correctly, but runtime error when run python .\tests\chat.py

Full traceback:

Traceback (most recent call last):
  File "E:\codes\ai\aichat\mlc-llm\tests\chat.py", line 13, in <module>
    from mlc_llm import utils
  File "e:\codes\ai\aichat\mlc-llm\mlc_llm\__init__.py", line 2, in <module>
    from . import transform
  File "e:\codes\ai\aichat\mlc-llm\mlc_llm\transform\__init__.py", line 1, in <module>
    from .dispatch_tir_operator import DispatchTIROperator
  File "e:\codes\ai\aichat\mlc-llm\mlc_llm\transform\dispatch_tir_operator.py", line 6371, in <module>
    get_dict_key(fused_min_max_triu_te_broadcast_to): fused_min_max_triu_te_broadcast_to_sch_func(),
  File "e:\codes\ai\aichat\mlc-llm\mlc_llm\transform\dispatch_tir_operator.py", line 30, in fused_min_max_triu_te_broadcast_to_sch_func
    sch.reverse_compute_inline(b0)
  File "e:\codes\libs\tvm\python\tvm\tir\schedule\_type_checker.py", line 339, in wrap
    return func(*args, **kwargs)
  File "e:\codes\libs\tvm\python\tvm\tir\schedule\schedule.py", line 2218, in reverse_compute_inline
    _ffi_api.ScheduleReverseComputeInline(self, block)  # type: ignore # pylint: disable=no-member
  File "e:\codes\libs\tvm\python\tvm\_ffi\_ctypes\packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "E:\codes\libs\tvm\src\tir\schedule\primitive\compute_inline.cc", line 537
TVMError:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (it != self_->idx_sub_.end()) is false:

Any help here?

Error:VK_ERROR_INCOMPATIBLE_DRIVER

my environment is win10 WSL2 Ubuntu22.04, and I follow the command:
conda create -n mlc-chat
conda activate mlc-chat
conda install git git-lfs
conda install -c mlc-ai -c conda-forge mlc-chat-nightly
mkdir -p dist
git lfs install
git clone https://huggingface.co/mlc-ai/demo-vicuna-v1-7b-int3 dist/vicuna-v1-7b
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/lib

However, when I complete this command:
mlc_chat_cli
error happen.
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.