mlc-ai / mlc-llm Goto Github PK

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.

License: Apache License 2.0

CMake 0.30% Python 62.70% C++ 29.87% Swift 2.13% Objective-C++ 0.33% Objective-C 0.12% Shell 0.53% Kotlin 2.01% Java 0.09% C 0.01% Rust 1.41% Groovy 0.41% Batchfile 0.01% Makefile 0.08%

llm machine-learning-compilation language-model tvm

mlc-llm's Introduction

MLC LLM

Documentation | Blog | Discord

Machine Learning Compilation for Large Language Models (MLC LLM) is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. The mission of this project is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices with ML compilation techniques.

Universal deployment. MLC LLM supports the following platforms and hardware:

	AMD GPU	NVIDIA GPU	Apple GPU	Intel GPU
Linux / Win	✅ Vulkan, ROCm	✅ Vulkan, CUDA	N/A	✅ Vulkan
macOS	✅ Metal (dGPU)	N/A	✅ Metal	✅ Metal (iGPU)
Web Browser	✅ WebGPU and WASM
iOS / iPadOS	✅ Metal on Apple A-series GPU
Android	✅ OpenCL on Adreno GPU		✅ OpenCL on Mali GPU

Quick Start

We introduce the quick start examples of chat CLI, Python API and REST server here to use MLC LLM. We use 4-bit quantized 8B Llama-3 model for demonstration purpose. The pre-quantized Llama-3 weights is available at https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC. You can also try out unquantized Llama-3 model by replacing q4f16_1 to q0f16 in the examples below. Please visit our documentation for detailed quick start and introduction.

Installation

MLC LLM is available via pip. It is always recommended to install it in an isolated conda virtual environment.

To verify the installation, activate your virtual environment, run

python -c "import mlc_llm; print(mlc_llm.__path__)"

You are expected to see the installation path of MLC LLM Python package.

Chat CLI

We can try out the chat CLI in MLC LLM with 4-bit quantized 8B Llama-3 model.

mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

It may take 1-2 minutes for the first time running this command. After waiting, this command launch a chat interface where you can enter your prompt and chat with the model.

You can use the following special commands:
/help               print the special commands
/exit               quit the cli
/stats              print out the latest stats (token/sec)
/reset              restart a fresh chat
/set [overrides]    override settings in the generation config. For example,
                      `/set temperature=0.5;max_gen_len=100;stop=end,stop`
                      Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.

user: What's the meaning of life
assistant:
What a profound and intriguing question! While there's no one definitive answer, I'd be happy to help you explore some perspectives on the meaning of life.

The concept of the meaning of life has been debated and...

Python API

We can run the Llama-3 model with the chat completion Python API of MLC LLM. You can save the code below into a Python file and run it.

from mlc_llm import LLMEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = LLMEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

The Python API of mlc_llm.LLMEngine fully aligns with OpenAI API. You can use LLMEngine in the same way of using OpenAI's Python package for both synchronous and asynchronous generation.

If you would like to do concurrent asynchronous generation, you can use mlc_llm.AsyncLLMEngine instead.

REST Server

We can launch a REST server to serve the 4-bit quantized Llama-3 model for OpenAI chat completion requests. The server has fully OpenAI API completeness.

mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

The server is hooked at http://127.0.0.1:8000 by default, and you can use --host and --port to set a different host and port. When the server is ready (showing INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)), we can open a new shell and send a cURL request via the following command:

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "model": "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
        "messages": [
            {"role": "user", "content": "Hello! Our project is MLC LLM. What is the name of our project?"}
        ]
  }' \
  http://127.0.0.1:8000/v1/chat/completions

Universal Deployment APIs

MLC LLM provides multiple sets of APIs across platforms and environments. These include

Citation

Please consider citing our project if you find it useful:

@software{mlc-llm,
    author = {MLC team},
    title = {{MLC-LLM}},
    url = {https://github.com/mlc-ai/mlc-llm},
    year = {2023}
}

The underlying techniques of MLC LLM include:

References (Click to expand)

@inproceedings{tensorir,
    author = {Feng, Siyuan and Hou, Bohan and Jin, Hongyi and Lin, Wuwei and Shao, Junru and Lai, Ruihang and Ye, Zihao and Zheng, Lianmin and Yu, Cody Hao and Yu, Yong and Chen, Tianqi},
    title = {TensorIR: An Abstraction for Automatic Tensorized Program Optimization},
    year = {2023},
    isbn = {9781450399166},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3575693.3576933},
    doi = {10.1145/3575693.3576933},
    booktitle = {Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
    pages = {804–817},
    numpages = {14},
    keywords = {Tensor Computation, Machine Learning Compiler, Deep Neural Network},
    location = {Vancouver, BC, Canada},
    series = {ASPLOS 2023}
}

@inproceedings{metaschedule,
    author = {Shao, Junru and Zhou, Xiyou and Feng, Siyuan and Hou, Bohan and Lai, Ruihang and Jin, Hongyi and Lin, Wuwei and Masuda, Masahiro and Yu, Cody Hao and Chen, Tianqi},
    booktitle = {Advances in Neural Information Processing Systems},
    editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
    pages = {35783--35796},
    publisher = {Curran Associates, Inc.},
    title = {Tensor Program Optimization with Probabilistic Programs},
    url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/e894eafae43e68b4c8dfdacf742bcbf3-Paper-Conference.pdf},
    volume = {35},
    year = {2022}
}

@inproceedings{tvm,
    author = {Tianqi Chen and Thierry Moreau and Ziheng Jiang and Lianmin Zheng and Eddie Yan and Haichen Shen and Meghan Cowan and Leyuan Wang and Yuwei Hu and Luis Ceze and Carlos Guestrin and Arvind Krishnamurthy},
    title = {{TVM}: An Automated {End-to-End} Optimizing Compiler for Deep Learning},
    booktitle = {13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)},
    year = {2018},
    isbn = {978-1-939133-08-3},
    address = {Carlsbad, CA},
    pages = {578--594},
    url = {https://www.usenix.org/conference/osdi18/presentation/chen},
    publisher = {USENIX Association},
    month = oct,
}

mlc-llm's People

Contributors

Stargazers

Watchers

Forkers

nasa03 curiosity007 tvanh512 gaoxiaojun qiaoyu-tan anastazya akashmavle5 evelynmitchell tqchen setmaster thesekyi libinliu0189 dumpmemory integritynoble muhtasham yayawawo brilliant-esystems-limited henrywu2019 maxtheman designium andremoeller jameshennessytempus dan255 mattkanwisher segmond ohmygaugh-crypto octoml hbcbh1999 iamkomen michileo yibit apollohuang1 xlaser liuyibox cenyi flowbywind haorand ryensx delruce codingpoeta 666dzy666 seekpoint lgs feiwei9696 acsky chen-ruixuan chappyhome zhangzhiqi1999 hiahianet honemleysm bitliudong richardsonjf open-models-platform sgatea daoyuan14 jaedukseo cylonspace ma-chengcheng joshuawalcher utkarshsingh77 httese 232136813 alopez1327 adeshi ai-bassem khaleelhabeeb nicklonlee xeransis mrderrekteng eltociear devlux76 peakster2 zn22 reddeath1 eclipse-ddao spectrometerhbh cyberflamego gtrevg mistaia coolduckai vu1seek cygwynd mike391 seemirra xinyao1994 techthiyanes vivekkrsk dst1213 annias yinglingwang07238422 random-developer qiaolian9 bluemain majiajue cyx-6 ripred occupymars2025 yangboz kagai shenglinfl

mlc-llm's Issues

Comparison with hidet

Hi, do you happen to know how the optimization performance for this TVM based solution might compare to https://github.com/hidet-org/hidet for NVIDIA edge devices (like NVIDIA Jetson Xavier)?

I'm curious to understand which has the capability to make LLMs smaller/faster for these devices?

I assume they are not compatible, but please let me know if I can use both.

Android support

Pretty self-explanatory: can we get an Android version of this?

CPU offloading

Incredible project, i managed to run the model with good speed on my hardware (AMD) thanks.
I have a question do you have any plans to offload the weights and be able to run bigger models like 13B or 30B with less vram?

Invalid bitcast %222 = bitcast <8 x i32> %221 to <8 x half>

Output from running mlc_chat_cli:

WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
Use lib /cluster/2024mgagvani/dist/lib/vicuna-v1-7b_vulkan_float16.so
Initializing the chat module...
Finish loading
You can use the following special commands:
  /help    print the special commands
  /exit    quit the cli
  /stats   print out the latest stats (token/sec)
  /reset   restart a fresh chat

USER: what is 1 + 1
ASSISTANT: Invalid bitcast
  %222 = bitcast <8 x i32> %221 to <8 x half>
LLVM ERROR: Broken function
Aborted (core dumped)

Debug info:

(mlc-chat) 2024mgagvani@snowy:~$ uname -r
5.15.0-67-generic

System Specs: AMD Threadripper 1950X CPU and Nvidia GeForce 2080 GPU.

consider adding 6.9/12B/20B from h2oGPT project

https://huggingface.co/h2oai

Better than dolly for 12B for example, trained on OASST data.

python chat.py can not be run

This actually do not provide a loadable in huggingface repo, how does the tokenizer can load?

OSError: ./dist/models/vicuna-v1-7b does not appear to have a file named config.json. Checkout 'https://huggingface.co/./dist/models/vicuna-v1-7b/None' for available files.

What's more, does I also need tvm built with vulkan enable to run demo .so???

  File "E:\codes\libs\relax\src\runtime\c_runtime_api.cc", line 131
TVMError:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (allow_missing) is false: Device API vulkan is not enabled.

Sorry for not already knowing this, but how can I load other models?

https://mlc.ai/mlc-llm/

I made those instructions work and can speak to vicuna-v1-7b but I'd like to mess with others.

git clone https://huggingface.co/mlc-ai/demo-vicuna-v1-7b-int3 dist/vicuna-v1-7b
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/lib

Am I correct in assuming that "lib" (no idea what any of this actually means) is tied to that model?

very slow in Mac

Tried on a Mac with the below capacity and the response is very slow. Is there any way to speed it up?

Spec:
2.6 GHz 6-Core Intel Core i7
Intel UHD Graphics 630 1536 MB
32 GB 2667 MHz DDR4

Works like a charm !

Just wanted to report that this works perfect on my gtx1060 (6gb) on my old i5-7200 16gb ram under win10. So far, i never reached such a speed with all other existing solution ( oobabooga, textsynth, llama.cpp). No single issue during install. I can't tell exactly but it's surely a couple of tokens / sec during inference. Need more deep dive to get a feeling of quality as it seems to be quantized model in int3 ?
Now, we want more : more models, 13b size, parameter access (temp, topp etc) and api. Anyway, i think this is great work already !

Android port

Hopefully we can expect and Android port of the same.

Error: Vulkan Error, code=-3: VK_ERROR_INITIALIZATION_FAILED

mlc_chat_cli
terminate called after throwing an instance of 'tvm::runtime::InternalError'
what(): [03:45:52] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_instance.cc:144:

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-3: VK_ERROR_INITIALIZATION_FAILED
Stack trace:
[bt] (0) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(tvm::runtime::Backtraceabi:cxx11+0x27) [0x7fa4f1a06b77]
[bt] (1) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(+0x3f375) [0x7fa4f19a4375]
[bt] (2) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanInstance::GetPhysicalDevices() const+0x3e9) [0x7fa4f1af20a9]
[bt] (3) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::VulkanDeviceAPI()+0x13f) [0x7fa4f1aefe2f]
[bt] (4) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::Global()+0x4c) [0x7fa4f1af00cc]
[bt] (5) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(+0x18b10d) [0x7fa4f1af010d]
[bt] (6) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(+0x6bf04) [0x7fa4f19d0f04]
[bt] (7) /home/xt/anaconda3/envs/mlc-chat/bin/../lib/libtvm_runtime.so(+0x6c4a7) [0x7fa4f19d14a7]
[bt] (8) mlc_chat_cli(+0xe4f0) [0x55772fb6f4f0]

my gpu is NVIDIA-A100, does it support A100? Or which version of Vulkan should I install

[Feature Request] Support new tokenizer format in tokenizer.cpp port

The issue

Currently, our tokenizer.cpp port only supports load from a single json file, which is the legacy format of hugging face tokenizer that is only applicable to fast tokenizer.

We should support loading from new tokenizer format, which stores the information into several files:
('tokenizer_config.json', 'special_tokens_map.json', 'vocab.json', 'merges.txt', 'added_tokens.json')

Support RWKV models (100% RNN)

RWKV Raven 7B Gradio DEMO: https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B

CPU INT4: https://github.com/saharNooby/rwkv.cpp

100% CUDA version: https://github.com/harrisonvanderbyl/rwkv-cpp-cuda

ONNX convertor: https://github.com/harrisonvanderbyl/rwkv-onnx

Github project: https://github.com/BlinkDL/ChatRWKV

Please let me know if you have any questions :)

Can you give the code and guide for all models conversion?

If you give the code of the model conversion, so that everyone can apply all the models according to the code and guide. ：）

Expose API

Thanks for your effort.
Do you plan add an API layer on top of this, to use your layer as a local API layer ?

In my scenario I'd like to host in a docker instance your library and query it by using API to feed a custom application I've started (see https://github.com/MithrilMan/AIdentities )
It would make use of several Models, not just for text generation, but of course here I'm just interested in the Text-Gen stuff

Ideally the API should have endpoints for Completions and Embedding

Is this something do you plan to have?

The sample installation code doesn't work on Macbook ARM

Hello there,

I found this problem while executing the sample code given for installation on Macbook M1. How should I resolve this?

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
  Check failed: (lib_handle_ != nullptr) is false: Failed to load dynamic shared library /Users/dist/lib/vicuna-v1-7b_metal_float16.so dlopen(/Users/dist/lib/vicuna-v1-7b_metal_float16.so, 0x0005): tried: '/Users/dist/lib/vicuna-v1-7b_metal_float16.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64'))
Stack trace:
  [bt] (0) 1   libtvm_runtime.dylib                0x0000000108f16c98 tvm::runtime::Backtrace() + 24
  [bt] (1) 2   libtvm_runtime.dylib                0x0000000108ee3929 tvm::runtime::detail::LogFatal::Entry::Finalize() + 89
  [bt] (2) 3   libtvm_runtime.dylib                0x0000000108ee38c9 tvm::runtime::detail::LogFatal::~LogFatal() + 25
  [bt] (3) 4   libtvm_runtime.dylib                0x0000000108ede159 tvm::runtime::detail::LogFatal::~LogFatal() + 9
  [bt] (4) 5   libtvm_runtime.dylib                0x0000000108f04b2d tvm::runtime::DSOLibrary::Load(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 269
  [bt] (5) 6   libtvm_runtime.dylib                0x0000000108f04d2f tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::$_0> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) + 175
  [bt] (6) 7   libtvm_runtime.dylib                0x0000000108f1e9b6 tvm::runtime::Module::LoadFromFile(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 598
  [bt] (7) 8   mlc_chat_cli                        0x000000010062e4c2 main + 8402
  [bt] (8) 9   dyld                                0x000000020082051e start + 462

Multi-GPU support for larger-than-VRAM models

Awesome project, thanks!

Does it support sharding large models across multiple GPUs, or would this be in scope for this project in the future?

openLlama support

Hey, nice launch! Since LLama and its variant vicuna has commercial restriction and huggingface released openllama https://huggingface.co/openlm-research/open_llama_7b_preview_200bt. Can you support it?

Update Speed features on Mac

it performs slowly on Mac

CMake Error

I am new to TVM, and I encountered an error while compiling it according to the instructions. I cannot install it successfully. It seems that TVM or something else cannot be found. What could be the reason?

Here are the specific error messages：
-- Set TVM_LLVM_VERSION=170
-- Build with contrib.random
-- Build with contrib.sort
-- Build with contrib.hybriddump
-- Git found: /usr/bin/git
-- Found TVM_GIT_COMMIT_HASH=838ec67e9376f5a606ce6c8bb0a9b773e5f63833
-- Found TVM_GIT_COMMIT_TIME=2023-05-03 16:37:51 -0400
-- Could NOT find LIBBACKTRACE (missing: LIBBACKTRACE_STATIC_LIBRARY LIBBACKTRACE_INCLUDE_DIR)
-- Building libbacktrace from 3rdparty/libbacktrace
-- Building with TVM Map...
-- Build with thread support...
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Added "-fuse-ld=lld" to linker flags
-- Configuring done
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
FOUNDATION_LIB
linked by target "tvm" in directory /home/panz/project/mlc-llm/relax
linked by target "tvm_runtime" in directory /home/panz/project/mlc-llm/relax
METAL_LIB
linked by target "tvm" in directory /home/panz/project/mlc-llm/relax
linked by target "tvm_runtime" in directory /home/panz/project/mlc-llm/relax

-- Generating done
CMake Generate step failed. Build files cannot be regenerated correctly.

huggingface.co: no such host

I have this error when I git clone https://huggingface.co/mlc-ai/demo-vicuna-v1-7b-int3 dist/vicuna-v1-7b

Error downloading object: float16/params_shard_1.bin (0fb70c2): Smudge error: Error downloading float16/params_shard_1.bin (0fb70c297b47ce4ecade5f7875c4c90f518069bab49f359a1644766b2279e8e2): batch response: Post "https://huggingface.co/mlc-ai/demo-vicuna-v1-7b-int3.git/info/lfs/objects/batch": dial tcp: lookup huggingface.co: no such host

Chinese variants model support

As for Vicuna didn't support Chinese very well, is there a way to support some more good performance Chinese model in demo?>

Missing instructions on installing additional models

Hey there, congratulations on a great release! The app works great on a Mac and the installation was very straightforward.

Do you have plans for growing the mlc_chat_cli into a standalone tool or is it meant to be a proof of concept? Readme claims the project can be used to run 'any language model', but there are no instructions for how to do it. Furthermore, code seems to indicate that only three models are supported right now, is that right?

Unless the mlc_chat_cli is supposed to be a toy demo, could you please add instructions for:

which models are supported (e.g. would RNN based models like https://github.com/BlinkDL/RWKV-LM work or is it just transformers)?
which formats, quantization methods and directory structures are supported - i.e. I don't think grabbing a random link from HF and cloning it the same way Vicuna was installed during original installation (git clone https://huggingface.co/mlc-ai/demo-vicuna-v1-7b-int3 dist/vicuna-v1-7b) would work, right?
it seems that there is a template/profile system for different LLM families, how do we add additional templates? Does it require patch/pull-request or can it be done by tweaking a config file somewhere?
the Readme mentions multiple optimizations, but the mlc_chat_cli doesn't expose that info/settings to the user. How do we tweak those?
given that the claim in the Readme is that all language models are supported, there should be some kind of rough guide on how to calculate the hardware requirements (e.g. what LLMs can my machine run using this tool, with what quantization and performance?) as a comparison, llama.cpp Readme isn't well-structured, but does provide a good overview of RAM requirements for a given model size and impact of different quantization techniques on performance

Also, it would be very neat if you mentioned in the Readme, what kind of community interactions are you aiming for. Would you prefer that people build their own tools that use mlc-llm as a backend or send PRs for improving mlc_chat_cli?

How to remove restrictions on what it can output?

I can't understand why following the path of ChatGPT about restrictions on what it can or cannot say, how can I disable the usual "As na AI language model, I do not..."?

How to `tune_relax` for other targets

It seems like the tuning is per device, although the m1 tuning is applied when using any GPU.
How would I use relax_integration.tune_relax on mod_deploy to create other databases?

I tried to figure it out myself, but got stuck on the error:
Check failed: (int_imm) is false: TypeError: Expect the extent of a loop to be IntImm, but gets: tir.Var
from measure_flops.

Consider supporting 8bit quantization

Based on experimenting with GPTQ-for-LLaMa, int4 quantization seems to introduce 3-5% drop in perplexity, while int8 is almost identical to fp16. Would it be possible to use int8 quantization with mlc-llm, assuming the model fits in VRAM in int8?

Driver versions needed?

Does this not support AMD GPUs?

I'm getting this error:

terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [18:53:29] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_instance.cc:111: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-9: VK_ERROR_INCOMPATIBLE_DRIVER
Stack trace:
  [bt] (0) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x27) [0x7f6fa5518a37]
  [bt] (1) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(+0x3f375) [0x7f6fa54b6375]
  [bt] (2) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanInstance::VulkanInstance()+0x1a47) [0x7f6fa5605857]
  [bt] (3) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::VulkanDeviceAPI()+0x40) [0x7f6fa5601a60]
  [bt] (4) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::Global()+0x4c) [0x7f6fa5601dfc]
  [bt] (5) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(+0x18ae3d) [0x7f6fa5601e3d]
  [bt] (6) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(+0x6bdc4) [0x7f6fa54e2dc4]
  [bt] (7) /home/loganm/miniconda3/envs/llama_index/bin/../lib/libtvm_runtime.so(+0x6c367) [0x7f6fa54e3367]
  [bt] (8) mlc_chat_cli(+0xd950) [0x55ed543c8950]


Aborted

Here's my driver info

Better instruction needed

I think the project has too little information on adjusting some config.

Like how to load different weights apart from demo provided? How to adjust temp? I do not have any clues.

It will be better to write more instructions .

dolly 12b 3bit cuda out of memory on my wsl 3070 laptop card

mlc_chat_cli --model dolly-v2-12b_int3 --dtype float32
Use lib /root/mlcai/dist/dolly-v2-12b_int3/float32/dolly-v2-12b_int3_cuda_float32.so
Initializing the chat module...
Finish loading
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out the latest stats (token/sec)
/reset restart a fresh chat

Instruction: hello

Response: [13:55:51] /root/mlcai/relax/src/runtime/relax_vm/pooled_allocator.h:64: Warning: PooledAllocator got InternalError during allocation:

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory
[13:55:51] /root/mlcai/relax/src/runtime/relax_vm/pooled_allocator.h:65: Warning: Trying to release all unused memory and reallocate...
terminate called after throwing an instance of 'tvm::runtime::InternalError'
what(): [13:55:51] /root/mlcai/relax/include/tvm/runtime/device_api.h:291: unknown type =0
Stack trace:
0: _ZN3tvm7runtime8relax_vm13MemoryManager
1: _ZN3tvm7runtime18SimpleObjAllocator7HandlerIN
2: tvm::runtime::relax_vm::VMAllocStorage(void*, tvm::runtime::ShapeTuple, long, DLDataType) [clone .cold.318]
3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::relax_vm::Storage (void*, tvm::runtime::ShapeTuple, long, DLDataType)>::AssignTypedLambda<tvm::runtime::relax_vm::Storage ()(void, tvm::runtime::ShapeTuple, long, DLDataType)>(tvm::runtime::relax_vm::Storage ()(void, tvm::runtime::ShapeTuple, long, DLDataType), std::__cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
5: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
6: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
7: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocatortvm::runtime::TVMRetValue > const&)
8: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
9: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
10: mlc::llm::LLMChatModule::Forward(tvm::runtime::NDArray, long)
11: mlc::llm::LLMChatModule::EncodeStep(std::__cxx11::basic_string<char, std::char_traits, std::allocator >)
12: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<mlc::llm::LLMChatModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
13: Chat(tvm::runtime::Module, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, long, double, double, long, int, int, double)
14: main
15: 0x00007f2565fbb78f
16: __libc_start_main
17: 0x000055d148d5c314

my gpu memory is 8gb,i think it is enough to run this model,here is vram usage after load,it runs on wsl2

Has anyone gotten this working on windows 10?

I'm on windows 10 Enterprise, intex xeon cpu e-15-1620 v4, 3.5ghz, 32gb ram, with the nvidia NVS310.

I installed the linked driver for vulkan, and followed the steps but when running the command the program will not launch.

The same steps work in my mac... except the vulkan part.

Is there a better way to debug/find the exception? Its my first try of anaconda..

You can see on the red arrows, nothing happens. It waits a second, then stops.

How do i use this in my own programs?

Hi there! Im new to programming. I really want to try and implement ai to my own little program. For example chatgpt has api with i can talk using python code. Dose this project allow something like this? Like can i just import mlc-llm and get answers using python?
Sorry for bad english.

Running into WebGPU device error from the web demo, using Chrome 112

When trying this out, https://mlc.ai/web-llm/#chat-demo, it gave me the following error. My chrome is 112.0.5615.137 and in the setting no update option is there.

Find an error initializing the WebGPU device Error: Cannot initialize runtime because of requested maxBufferSize exceeds limit. requested=1024MB, limit=256MB. This error may be caused by an older version of the browser (e.g. Chrome 112). You can try to upgrade your browser to Chrome 113 or later.
Init error, Error: Find an error initializing WebGPU: Error: Cannot initialize runtime because of requested maxBufferSize exceeds limit. requested=1024MB, limit=256MB. This error may be caused by an older version of the browser (e.g. Chrome 112). You can try to upgrade your browser to Chrome 113 or later.

Speed benchmark compare with llama.cpp

hello ,does there any speed throughout benchmark comparing with llama.cpp?

Will this work on an iPhone 14 pro?

I absolutely love the idea of this repo, and am very hopeful about its future. I loved it so much that I managed to get the download off testflight. However, every time i open the app, it crashes.

Your page says

Try out this TestFlight page (limited to the first 9000 users)

Is it because 9000 users installed it before me? Is that why it's crashing? Or do I have the wrong iPhone 14 pro?

Dose mlc-llm support parallelism like multi-gpu, multi-node ?

tvm::runtime::InternalError relax/src/runtime/relax_vm/lm_support.cc:247 Check failed: uniform_sample <= data[0].first (0.0715982 vs. nan)

trying to build ios app from the source and everything is ok except for running the app on the iPhone, the app shows ready to chat but after sending the message, the app crashes, and on the Xcode it shows :

ibc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [15:39:04] /Users/relax/src/runtime/relax_vm/lm_support.cc:247:

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

Check failed: uniform_sample <= data[0].first (0.0715982 vs. nan) :
Stack trace:
[bt] (0) 1 MLCChat 0x0000000104c9f094 tvm::runtime::detail::LogFatal::Entry::Finalize() + 116
[bt] (1) 2 MLCChat 0x0000000104c9f020 tvm::runtime::detail::LogFatal::Entry::Finalize() + 0
[bt] (2) 3 MLCChat 0x0000000104c9e51c __clang_call_terminate + 0
[bt] (3) 4 MLCChat 0x0000000104d4f8a8 tvm::runtime::relax_vm::SampleTopPFromLogits(tvm::runtime::NDArray, double, double, double) + 1544
[bt] (4) 5 MLCChat 0x0000000104d557dc void tvm::runtime::TypedPackedFunc<int (tvm::runtime::NDArray, double, double, double)>::AssignTypedLambda<int ()(tvm::runtime::NDArray, double, double, double)>(int ()(tvm::runtime::NDArray, double, double, double), std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator>)::'lambda'(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const + 232
[bt] (5) 6 MLCChat 0x0000000104cc35ec mlc::llm::LLMChatModule::SampleFromLogitsOnCPU() + 348
[bt] (6) 7 MLCChat 0x0000000104cc1e60 mlc::llm::LLMChatModule::EncodeStep(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator>) + 504
[bt] (7) 8 MLCChat 0x0000000104cc1b54 mlc::llm::LLMChatModule::GetFunction(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator> const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::'lambda1'(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const + 124
[bt] (8) 9 MLCChat 0x0000000104cc1acc tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<mlc::llm::LLMChatModule::GetFunction(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator> const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::'lambda1'(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>>::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) + 40

Docker support

Please, consider adding DockerFIle and docker-compose to the repository

Does it support iPhone 13

Does MLC support the non pro iPhone 13, or do I need to buy more RAM?

can you supply more converted models?

https://huggingface.co/wshhyh/mlc_llm-dolly-v2-int4
i have tried to convert dolly,its env is very hard to configure,can you supply your converted models on huggingface for users to download?

Is tunning scripts available？

It seems like the tuning is per device, although the m1 tuning is applied when using any GPU.
How would I use relax_integration.tune_relax on mod_deploy to create other databases?

Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-2: VK_ERROR_OUT_OF_DEVICE_MEMORY

win10 x64 8G RAM NVIDIA GeForce 940MX
after mlc_chat_cli command:

Use lib E:\Code\test\mlc-chat\dist\lib\vicuna-v1-7b_vulkan_float16.dll
Initializing the chat module...
[16:56:46] D:\a\utils\utils\tvm\src\runtime\vulkan\vulkan_buffer.cc:61:

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-2: VK_ERROR_OUT_OF_DEVICE_MEMORY
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.

An alternative python interface for MLC LLM

It seems there is no official support for interacting with MLC LLM using python api yet. For the convenience of anyone who wants to develop a program that integrates MLC LLM, I wrote a simple code that runs mlc_chat_cli in a subprocess and redirects input/output using pipe, so that anyone can plug MLC LLM into their code. Glad to see official python interface someday.
See code at https://github.com/XinyuSun/mlc-chatbot

[Survey] Supported Hardwares and Speed

UPDATE (08/09/2023):

We have done major performance overhaul in the past few months, and now I'm happy to share the latest results:

SOTA performance on CUDA: https://github.com/mlc-ai/llm-perf-bench
SOTA performance on ROCm: https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference

============================================================

Hi everyone,

We are looking to gather data points on running MLC-LLM on different hardwares and platforms. Our goal is to create a comprehensive reference for new users. Please share your own experiences in this thread! Thank you for your help!

NOTE: for benchmarking, we highly recommended a device of at least 6GB memory, because the model itself takes 2.9G already. For this reason, it is known that the iOS app will crash on a 4GB iPhone.

AMD GPUs

Hardware/GPU	OS	Tokens/sec	Source	Notes
RX 6600XT (8G)	N/A	28.3	GitHub
RX 6750XT	openSUSE TumbleWeed	8.9 - 154.3	GitHub
RX 6700XT	Windows 11	33.7	GitHub
APU 5800H	Windows 11	8.5	GitHub
Raden RX 470 (4G)	AlmaLinux 9.1	9.4	GitHub
Raden Pro 5300M	macOS Venture	12.6	@junrushao	Intel MBP 16" (late 2019)
AMD GPU on Steam Deck	Steam Deck's Linux	TBD	Reddit
RX6800 16G VRAM	macOS Ventura	22.5	GitHub	Intel MBP 13'' (2020)
Radeon RX 6600 (8GB)	Ubuntu 22.04	7.0	Reddit
RX 7900 xtx			Reddit

Macbook

Hardware/GPU	OS	Tokens/sec	Source
2020 MacBook Pro M1 (8G)	macOS	11.4	GitHub
2021 MacBook Pro M1Pro (16G)	macOS Ventura	17.1	GitHub
M1 Max Mac Studio (64G)	N/A	18.6	GitHub
2021 MacBook Pro M1 Max (32G)	macOS Monterey	21.0	GitHub
MacBook Pro M2 (16G)	macOS Ventura	22.5	GitHub
2021 MacBook M1Pro (32G)	macOS Ventura	19.3	GitHub

Intel GPUs

Hardware/GPU	OS	Tokens/sec	Source	Notes
Arc A770	N/A	3.1 - 118.6	GitHub	perf issues in decoding needs investigation
UHD Graphics (Comet Lake-U GT2) 1G	Windows 10	2.2	GitHub
UHD Graphics 630	macOS Ventura	2.3	@junrushao	Integrated GPU. Intel MBP 16" (late 2019)
Iris Plus Graphics 1536 MB	macOS Ventura	2.6	GitHub	Integrated GPU on MBP
Iris Plus Graphics 645 1536 MB	macOS Ventura	2.9	GitHub	Integrated GPU on MBP

NVIDIA GPUs

Hardware/GPU	OS	Tokens/sec	Source
GTX 1650 ti (4GB)	Fedora	15.6	GitHub
GTX 1060 (6GB)	Windows 10	16.7	GitHub
RTX 3080	Windows 11	26.0	GitHub
RTX 3060	Debian bookworm	21.3	GitHub
RTX 2080Ti	Windows 10	24.5	GitHub
RTX 3090	N/A	25.7	GitHub
GTX 1660ti	N/A	23.9	GitHub
RTX 3070	N/A	23.3	GitHub

iOS

Hardware/GPU	OS	Tokens/sec	Source
iPhone 14 Pro	iOS 16.4.1	7.2	@junrushao
iPad Pro 11' with M1	iPadOS 16.1	10.6	GitHub
iPad Pro 11' A12Z	N/A	4.1	GitHub
iPad Pro 11' with M2 (4-th gen)	iPadOS 16.5	14.1	GitHub

Android

Hardware/GPU	OS	Tokens/sec	Link	Notes

Looks amazing! Where's the code to compile dist/libs? Would like to try on Intel macOS

I tried WebLLM the other week and was really blown away. I have an Intel macOS system with AMD 6900XT GPU and using WebLLM was the first time I'd had decent GPU inference on this system.

Now I'd love to try mlc-llm as well. I followed the instructions, but the pre-built Metal lib for macOS is built for ARM64/Silicon.

Where can I found the source for this so I can try compiling it myself?

it can not stop when it speak Chinese

Runing mlc-llm python code on windows fail

Hi, tried to run python (gen my self), got some errors:

Check failed: (it != self_->idx_sub_.end()) is false:

I built tvm unity branch it import correctly, but runtime error when run python .\tests\chat.py

Full traceback:

Traceback (most recent call last):
  File "E:\codes\ai\aichat\mlc-llm\tests\chat.py", line 13, in <module>
    from mlc_llm import utils
  File "e:\codes\ai\aichat\mlc-llm\mlc_llm\__init__.py", line 2, in <module>
    from . import transform
  File "e:\codes\ai\aichat\mlc-llm\mlc_llm\transform\__init__.py", line 1, in <module>
    from .dispatch_tir_operator import DispatchTIROperator
  File "e:\codes\ai\aichat\mlc-llm\mlc_llm\transform\dispatch_tir_operator.py", line 6371, in <module>
    get_dict_key(fused_min_max_triu_te_broadcast_to): fused_min_max_triu_te_broadcast_to_sch_func(),
  File "e:\codes\ai\aichat\mlc-llm\mlc_llm\transform\dispatch_tir_operator.py", line 30, in fused_min_max_triu_te_broadcast_to_sch_func
    sch.reverse_compute_inline(b0)
  File "e:\codes\libs\tvm\python\tvm\tir\schedule\_type_checker.py", line 339, in wrap
    return func(*args, **kwargs)
  File "e:\codes\libs\tvm\python\tvm\tir\schedule\schedule.py", line 2218, in reverse_compute_inline
    _ffi_api.ScheduleReverseComputeInline(self, block)  # type: ignore # pylint: disable=no-member
  File "e:\codes\libs\tvm\python\tvm\_ffi\_ctypes\packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "E:\codes\libs\tvm\src\tir\schedule\primitive\compute_inline.cc", line 537
TVMError:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (it != self_->idx_sub_.end()) is false:

Any help here?

Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-9: VK_ERROR_INCOMPATIBLE_DRIVER

wsl 22.04 in win11
after mlc_chat_cli command:

Does this mean I should buy a 96GB RAM Macbook?

The 4090 can't run the 65B model. Can I run it on the macbook with this?

Error:VK_ERROR_INCOMPATIBLE_DRIVER

my environment is win10 WSL2 Ubuntu22.04, and I follow the command:
conda create -n mlc-chat
conda activate mlc-chat
conda install git git-lfs
conda install -c mlc-ai -c conda-forge mlc-chat-nightly
mkdir -p dist
git lfs install
git clone https://huggingface.co/mlc-ai/demo-vicuna-v1-7b-int3 dist/vicuna-v1-7b
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/lib

However, when I complete this command:
mlc_chat_cli
error happen.

mlc-ai / mlc-llm Goto Github PK

mlc-llm's Introduction

MLC LLM

Quick Start

Installation

Chat CLI

Python API

REST Server

Universal Deployment APIs

Citation

Links

mlc-llm's People

Contributors

Stargazers

Watchers

Forkers

mlc-llm's Issues

mlc_chat_cli terminate called after throwing an instance of 'tvm::runtime::InternalError' what(): [03:45:52] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_instance.cc:144:

An error occurred during the execution of TVM. For more information, please see: https://tvm.apache.org/docs/errors.html

The issue

Instruction: hello

Response: [13:55:51] /root/mlcai/relax/src/runtime/relax_vm/pooled_allocator.h:64: Warning: PooledAllocator got InternalError during allocation:

An error occurred during the execution of TVM. For more information, please see: https://tvm.apache.org/docs/errors.html

ibc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [15:39:04] /Users/relax/src/runtime/relax_vm/lm_support.cc:247:

An error occurred during the execution of TVM. For more information, please see: https://tvm.apache.org/docs/errors.html

Use lib E:\Code\test\mlc-chat\dist\lib\vicuna-v1-7b_vulkan_float16.dll Initializing the chat module... [16:56:46] D:\a\utils\utils\tvm\src\runtime\vulkan\vulkan_buffer.cc:61:

An error occurred during the execution of TVM. For more information, please see: https://tvm.apache.org/docs/errors.html

AMD GPUs

Macbook

Intel GPUs

NVIDIA GPUs

iOS

Android

Recommend Projects

Recommend Topics

Recommend Org

mlc_chat_cli
terminate called after throwing an instance of 'tvm::runtime::InternalError'
what(): [03:45:52] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_instance.cc:144:

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html

Use lib E:\Code\test\mlc-chat\dist\lib\vicuna-v1-7b_vulkan_float16.dll
Initializing the chat module...
[16:56:46] D:\a\utils\utils\tvm\src\runtime\vulkan\vulkan_buffer.cc:61:

An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html