Giter Club home page Giter Club logo

lmdeploy's Introduction


Latest News 🎉

2024
  • [2024/06] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
  • [2024/05] Balance vision model when deploying VLMs with multiple GPUs
  • [2024/05] Support 4-bits weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2
  • [2024/04] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
  • [2024/04] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer here for detailed guide
  • [2024/04] TurboMind latest upgrade boosts GQA, rocketing the internlm2-20b model inference to 16+ RPS, about 1.8x faster than vLLM.
  • [2024/04] Support Qwen1.5-MOE and dbrx.
  • [2024/03] Support DeepSeek-VL offline inference pipeline and serving.
  • [2024/03] Support VLM offline inference pipeline and serving.
  • [2024/02] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
  • [2024/01] OpenAOE seamless integration with LMDeploy Serving Service.
  • [2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to here
  • [2024/01] Support PyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
2023
  • [2023/12] Turbomind supports multimodal input. Gradio Demo
  • [2023/11] Turbomind supports loading hf model directly. Click here for details.
  • [2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

  • Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.

  • Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.

  • Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.

Performance

v0 1 0-benchmark

For detailed inference benchmarks in more devices and more settings, please refer to the following link:

  • A100
  • V100
  • 4090
  • 3090
  • 2080

Supported Models

LLMs VLMs
  • Llama (7B - 65B)
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM2.5 (7B)
  • QWen (1.8B - 72B)
  • QWen1.5 (0.5B - 110B)
  • QWen1.5 - MoE (0.5B - 72B)
  • QWen2 (0.5B - 72B)
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • ChatGLM2 (6B)
  • GLM4 (9B)
  • Falcon (7B - 180B)
  • YI (6B-34B)
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • Dbrx (132B)
  • StarCoder2 (3B - 15B)
  • Phi-3-mini (3.8B)
  • LLaVA(1.5,1.6) (7B-34B)
  • InternLM-XComposer2 (7B, 4khd-7B)
  • QWen-VL (7B)
  • DeepSeek-VL (7B)
  • InternVL-Chat (v1.1-v1.5)
  • MiniGeminiLlama (7B)
  • CogVLM-Chat (17B)
  • CogVLM2-Chat (19B)
  • MiniCPM-Llama3-V-2_5
  • Phi-3-vision (4.2B)

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start Open In Colab

Installation

Install lmdeploy with pip ( python 3.8+) or from source

pip install lmdeploy

Since v0.3.0, The default prebuilt package is compiled on CUDA 12. However, if CUDA 11+ is required, you can install lmdeploy by:

export LMDEPLOY_VERSION=0.5.0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

Offline Batch Inference

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

Note

By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

For more information about inference pipeline, please refer to here.

Tutorials

Please overview getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

Third-party projects

  • Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: LMDeploy-Jetson

  • Example project for deploying LLMs using LMDeploy and BentoML: BentoLMDeploy

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

Citation

@misc{2023lmdeploy,
    title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
    author={LMDeploy Contributors},
    howpublished = {\url{https://github.com/InternLM/lmdeploy}},
    year={2023}
}

License

This project is released under the Apache 2.0 license.

lmdeploy's People

Contributors

aisensiy avatar akhoroshev avatar allentdan avatar amulil avatar del-zhenwu avatar grimoire avatar harold-lkk avatar hit-cwh avatar hscspring avatar irexyc avatar ispobock avatar jjjjohnson avatar kevinnunu avatar lvhan028 avatar lzhangzz avatar lzhgrla avatar maxchiron avatar mokeyish avatar pppppm avatar runningleon avatar streamsunshine avatar tpoisonooo avatar vansin avatar vody-am avatar wangruohui avatar xin-li-67 avatar yinfan98 avatar zhouzaida avatar zhulinjulia24 avatar zhyncs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lmdeploy's Issues

huggingface safetensor support

I'm trying to deploy LLaMA2 70b chat model locally and find that this LMDeploy seems don't support huggingface safetensor ckpt. It just raise a confusing Exception:

Traceback (most recent call last):
  File "/opt/conda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 592, in <module>
    fire.Fire(main)
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 562, in main
    res = deploy_hf(model_name, model_path, tokenizer_path,
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 482, in deploy_hf
    assert num_layer == i, f'miss matched layers: {num_layer} vs {i}'
AssertionError: miss matched layers: 80 vs 0

because it only read *.bin:

_files = [file for file in os.listdir(model_path) if file.endswith('.bin')]

RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

(lmdeploy) ➜  lmdeploy git:(main) python -m lmdeploy.pytorch.chat /mnt/internlm-7b \ 
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
Traceback (most recent call last):
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 190, in <module>
    fire.Fire(main)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 120, in main
    tokenizer, model = init_model(
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 62, in init_model
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 693, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
    return cls._from_pretrained(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1975, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/internlm-7b/tokenization_internlm.py", line 81, in __init__
    self.sp_model.Load(vocab_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 

pip list

(lmdeploy) ➜  lmdeploy git:(main) pip list
Package            Version  Editable project location
------------------ -------- -------------------------
addict             2.4.0
brotlipy           0.7.0
certifi            2023.5.7
cffi               1.15.1
charset-normalizer 2.0.4
contourpy          1.1.0
cryptography       39.0.1
cycler             0.11.0
filelock           3.9.0
fire               0.5.0
fonttools          4.41.0
fsspec             2023.6.0
gmpy2              2.1.2
grpcio             1.56.0
huggingface-hub    0.16.4
idna               3.4
importlib-metadata 6.8.0
Jinja2             3.1.2
kiwisolver         1.4.4
lmdeploy           0.0.1    /mnt/lmdeploy
markdown-it-py     3.0.0
MarkupSafe         2.1.1
matplotlib         3.7.2
mdurl              0.1.2
mkl-fft            1.3.6
mkl-random         1.2.2
mkl-service        2.4.0
mmengine           0.8.2
mpmath             1.2.1
networkx           2.8.4
numpy              1.25.0
opencv-python      4.8.0.74
packaging          23.1
Pillow             9.4.0
pip                23.1.2
platformdirs       3.9.1
protobuf           4.23.4
pybind11           2.11.0
pycparser          2.21
Pygments           2.15.1
pyOpenSSL          23.0.0
pyparsing          3.0.9
PySocks            1.7.1
python-dateutil    2.8.2
python-rapidjson   1.10
PyYAML             6.0
regex              2023.6.3
requests           2.29.0
rich               13.4.2
safetensors        0.3.1
sentencepiece      0.1.99
setuptools         67.8.0
six                1.16.0
sympy              1.11.1
termcolor          2.3.0
tokenizers         0.13.3
tomli              2.0.1
torch              2.0.0
torchaudio         2.0.0
torchvision        0.15.0
tqdm               4.65.0
transformers       4.29.2
triton             2.0.0
tritonclient       2.33.0
typing_extensions  4.6.3
urllib3            1.26.16
wheel              0.38.4
yapf               0.40.1
zipp               3.16.2

Question about internrm-chat-7b-8k

用tp=2 转换Internlm-chat-7b-8k模型 为turbomind格式,最终生成的weight/config.ini如下,8k不是最大支持8千多个token嘛? 这个在哪里设置的,我现在调用超过2048就报错了

[llama]
model_name = internlm-chat-7b
head_num = 32
size_per_head = 128
vocab_size = 103168
num_layer = 32
rotary_embedding = 128
inter_size = 11008
norm_eps = 1e-06
attn_bias = 1
start_id = 1
end_id = 2
weight_type = fp16
max_batch_size = 32
max_context_token_num = 4
session_len = 2056
step_length = 1
cache_max_entry_count = 48
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
tensor_para_size = 2

ModuleNotFoundError: No module named '_turbomind'

I installed with pip install -e . and tried to run python3 -m lmdeploy.turbomind.chat llama ... but got:

  File "/mnt//lmdeploy/lmdeploy/turbomind/__init__.py", line 3, in <module>
    from .turbomind import TurboMind
  File "/mnt//work/lmdeploy/lmdeploy/turbomind/turbomind.py", line 17, in <module>
    import _turbomind as _tm  # noqa: E402
ModuleNotFoundError: No module named '_turbomind'

HTTP client question

Is there a regular HTTP request client that does not require complex lmdeploy package installation and gRPC calls, nor does it need streaming transmission, returning all answer results in a single response.

[QA] 如何将ckpt保存内容转换为pytorch_model文件

请教下如何将训练过程保存的ckpt内容转换为pytorch_model内容?谢谢

比如,使用 zero=4/tensor=2 + 自有数据 预训练了100步,保存的ckpt文件夹内容:
context.pt gpus-8_pp-0_tp-0_zo-3.pt gpus-8_pp-0_tp-0_zo-7.pt optimizer_tp0_pp0_zo1.pt optimizer_tp0_pp0_zo5.pt schedulder.pt
gpus-8_pp-0_tp-0_zo-0.pt gpus-8_pp-0_tp-0_zo-4.pt model_config.pt optimizer_tp0_pp0_zo2.pt optimizer_tp0_pp0_zo6.pt topo_tp0_pp0.json
gpus-8_pp-0_tp-0_zo-1.pt gpus-8_pp-0_tp-0_zo-5.pt model_tp0_pp0.pt optimizer_tp0_pp0_zo3.pt optimizer_tp0_pp0_zo7.pt
gpus-8_pp-0_tp-0_zo-2.pt gpus-8_pp-0_tp-0_zo-6.pt optimizer_tp0_pp0_zo0.pt optimizer_tp0_pp0_zo4.pt sampler.pt

目标:转成可以直接被 lmdeploy.serve.turbomind.deploy加载的模型文件
config.json modeling_internlm.py pytorch_model.bin.index.json tokenization_internlm.py
configuration_internlm.py pytorch_model-00001-of-00002.bin README.md tokenizer_config.json
generation_config.json pytorch_model-00002-of-00002.bin special_tokens_map.json

创建模型时候报错ModelLifeCycle::CreateModel()

========== step1 ============
我在一台机器上执行的 lmdeploy.serve.turbomind.deploy命令,并且参数tp=2,因为我想将模型放在两个gpu上运行
这一步是成功的

========== step2 ============
在另一台机器上执行service_docker_up.sh,tritonserver启动了turbomind后端,但是模型没有成功加载,报错了

这是报错内容

I0712 08:36:20.719380 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f0c44000000' with size 268435456
I0712 08:36:20.720671 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0712 08:36:20.720696 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
W0712 08:36:20.859316 1 server.cc:218] failed to enable peer access for some device pairs
I0712 08:36:21.439338 1 model_lifecycle.cc:459] loading: turbomind:1
I0712 08:36:21.441099 1 model_lifecycle.cc:459] loading: postprocessing:1
I0712 08:36:21.442823 1 model_lifecycle.cc:459] loading: preprocessing:1
I0712 08:36:21.582434 1 libfastertransformer.cc:1746] TRITONBACKEND_Initialize: turbomind
I0712 08:36:21.582484 1 libfastertransformer.cc:1753] Triton TRITONBACKEND API version: 1.10
I0712 08:36:21.582501 1 libfastertransformer.cc:1757] 'turbomind' TRITONBACKEND API version: 1.10
I0712 08:36:21.585413 1 libfastertransformer.cc:1784] TRITONBACKEND_ModelInitialize: turbomind (version 1)
I0712 08:36:21.586543 1 libfastertransformer.cc:307] Instance group type: KIND_CPU count: 48
E0712 08:36:21.586654 1 libfastertransformer.cc:226] Invalid configuration argument 'tensor_para_size': stoi
[3b379e147757:1    :0:86] Caught signal 8 (Floating point exception: integer divide by zero)
==== backtrace (tid:     86) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x0000000000018313 triton::backend::turbomind_backend::ModelState::ModelState()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:323
 2 0x0000000000024554 triton::backend::turbomind_backend::ModelState::Create()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:182
 3 0x0000000000024b81 TRITONBACKEND_ModelInitialize()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:1791
 4 0x000000000010689b triton::core::TritonModel::Create()  :0
 5 0x00000000001c4f5d triton::core::ModelLifeCycle::CreateModel()  :0
 6 0x00000000001caccd std::_Function_handler<void (), triton::core::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&)::{lambda()#1}>::_M_invoke()  model_lifecycle.cc:0
 7 0x00000000003083a0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()  thread_pool.cc:0
 8 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
 9 0x0000000000008609 start_thread()  ???:0
10 0x000000000011f133 clone()  ???:0
=================================
[3b379e147757:00001] *** Process received signal ***
[3b379e147757:00001] Signal: Floating point exception (8)
[3b379e147757:00001] Signal code:  (-6)
[3b379e147757:00001] Failing at address: 0x1
[3b379e147757:00001] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f0c8d1a9420]
[3b379e147757:00001] [ 1] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(+0x18313)[0x7f0c80652313]
[3b379e147757:00001] [ 2] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(+0x24554)[0x7f0c8065e554]
[3b379e147757:00001] [ 3] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(TRITONBACKEND_ModelInitialize+0x341)[0x7f0c8065eb81]
[3b379e147757:00001] [ 4] /opt/tritonserver/lib/libtritonserver.so(+0x10689b)[0x7f0c8c2de89b]
[3b379e147757:00001] [ 5] /opt/tritonserver/lib/libtritonserver.so(+0x1c4f5d)[0x7f0c8c39cf5d]
[3b379e147757:00001] [ 6] /opt/tritonserver/lib/libtritonserver.so(+0x1caccd)[0x7f0c8c3a2ccd]
[3b379e147757:00001] [ 7] /opt/tritonserver/lib/libtritonserver.so(+0x3083a0)[0x7f0c8c4e03a0]
[3b379e147757:00001] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f0c8be25de4]
[3b379e147757:00001] [ 9] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f0c8d19d609]
[3b379e147757:00001] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f0c8bb10133]
[3b379e147757:00001] *** End of error message ***

有什么debug的建议吗?

Get trouble with 'Quantization' in '/README.md'

here is the log

(lmdeploy_test) [xxx@xxxxxxxxxxxx internlm_test]$ python -m lmdeploy.lite.apis.kv_qparams --model internlm-chat-7b --output_dir internlm-chat-7b-deploy --symmetry True --offload  False --num_tp 1
Traceback (most recent call last):
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/nvme/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/lmdeploy/lite/apis/kv_qparams.py", line 199, in <module>
    fire.Fire(main)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/lmdeploy/lite/apis/kv_qparams.py", line 112, in main
    tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 688, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class InternLMTokenizer does not exist or is not currently imported.

It looks like it can't get 'InternLMTokenizer' smoothly

Using deepspeed tp to load InternLM, but memory do not save.

我现在的硬件配置是4块16G的V100,我按照README.md里基于pytorch推理

python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL\
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

单卡推理load完模型后占用显存15G
image

deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

双卡tp推理load完模型后每张卡占用显存也是15G
image

没有达到期望的降低显存的效果,不是很理解问题出在哪?

请问怎么进行batch inference

在batch为2的情况下,我执行了 python3 -m lmdeploy.turbomind.chat llama /workspace [0,1] 以及将input_ids填了batch个,但是出现RuntimeError: output with shape [1] doesn't match the broadcast shape [2]

Test

PB10.mp4

PB

PB11.mp4
PersistentBatchInference.mp4

PersistentBatchInference

Question about benchmark

Hi, I tested LMDeploy with the following steps,

    1. Get models from https://huggingface.co/internlm/internlm-chat-7b/
    1. Convert to triton models python -m lmdeploy.serve.turbomind.deploy interlm-7b interlm-7b hf
    1. Run python3 profile_generation.py --model_path /workspace/ --model_name internlm --concurrency 8 --input_seqlen 1 --output_seqlen 2048 --test_round 8 in provided docker image container openmmlab/lmdeploy:latest with A100 80G

The result I get is throughput: 70.98455828512093 token/s while the document shows it will reach 640 token/s almost with batch=8.
image

Are there any configurations I need to modify, Thanks

RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

(lmdeploy) ➜  lmdeploy git:(main) python -m lmdeploy.serve.turbomind.deploy internlm-7b /mnt/internlm-7b hf
create workspace in directory ./workspace
copy triton model templates from "/mnt/lmdeploy/lmdeploy/serve/turbomind/triton_models" to "./workspace/triton_models" successfully
['pytorch_model-00001-of-00002.bin', 'pytorch_model-00002-of-00002.bin']

### copying layers.31.attention.wo.bias, shape=torch.Size([4096])
layers.31.attention.wo.0.bias torch.Size([4096])
*** splitting layers.31.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1
layers.31.feed_forward.w1.0.weight torch.Size([4096, 11008])
*** splitting layers.31.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0
layers.31.feed_forward.w2.0.weight torch.Size([11008, 4096])
*** splitting layers.31.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1
layers.31.feed_forward.w3.0.weight torch.Size([4096, 11008])
layers.31.attention_norm.weight torch.Size([4096])
layers.31.ffn_norm.weight torch.Size([4096])
tok_embeddings.weight torch.Size([103168, 4096])
norm.weight torch.Size([4096])
output.weight torch.Size([103168, 4096])
Traceback (most recent call last):
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 549, in <module>
    fire.Fire(main)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 522, in main
    res = deploy_hf(model_name, model_path, tokenizer_path,
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 462, in deploy_hf
    return export(model_name, num_layer, norm_eps, model_params,
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 167, in export
    vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 87, in tokenizer_info
    sp_model = SentencePieceProcessor(model_file=model_path)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 447, in Init
    self.Load(model_file=model_file, model_proto=model_proto)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 

image

need all the LFS file

请问llama 65b kv cache量化和context fmha不能同时打开吗?

在部署测试过程中,llama 7b的use_context_fmha = 1,quant_policy = 4是可以运行的,但是llama 65b不可以,需要use_context_fmha = 0。请问这是我这边的问题还是目前确实不能同时打开呢?

报错是这个:
what(): [TM][ERROR] CUDA runtime error: an illegal memory access was encountered lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:843

The inference is stuck, possibly occurring before the invocation of the 'forward' method.

Hello, I have run the service successfully. However, when I use the app 'lmdeploy.app.py' and send a message to the server, I notice that the inference gets stuck.

These are the logs of Tritonserver.

[TM][INFO] [forward][rank=0] INPUT: step [1]
[TM][INFO] [forward][rank=0] INPUT: repetition_penalty [1]
[TM][INFO] [forward][rank=0] INPUT: temperature [1]
[TM][INFO] [forward][rank=0] INPUT: STOP [1]
[TM][INFO] [forward][rank=0] INPUT: START [1]
[TM][INFO] [forward][rank=0] INPUT: random_seed [1]
[TM][INFO] [forward][rank=0] INPUT: input_ids [1, 15]
[TM][INFO] [forward][rank=0] INPUT: stop_words_list [1, 2, 2]
[TM][INFO] [forward][rank=0] INPUT: runtime_top_p [1]
[TM][INFO] [forward][rank=0] INPUT: END [1]
[TM][INFO] [forward][rank=0] INPUT: input_lengths [1]
[TM][INFO] [forward][rank=0] INPUT: CORRID [1]
[TM][INFO] [forward][rank=0] INPUT: request_output_len [1, 1]
[TM][INFO] [forward][rank=0] INPUT: session_len [1]
[TM][INFO] [forward][rank=0] OUTPUT: sequence_length [1, 1]
[TM][INFO] [forward][rank=0] OUTPUT: output_ids [1, 1, 2056]
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] [synchronize] batch_size = 0
[TM][INFO] [LlamaCacheManager][create] 140002462050048
[TM][INFO] [LlamaCacheManager][allocate]
[TM][INFO] [LlamaCacheManager][allocate] free = 0
[TM][INFO] [init] infer_request_count = 1
[TM][INFO] [init] batch_size = 1
[TM][INFO] [init] session_len = 2056
[TM][INFO] [init] max_input_length = 15
[TM][INFO] [init] max_context_len = 15
[TM][INFO] [init] slot  sequence_id  history_len  input_len  context_len  tmp_input_len  token_ids.size  cache_len
[TM][INFO] [init]    0   3708069632            0         15           15             15               0          0
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 14, max_input_len = 14, max_context_len = 14
[TM][INFO] context decoding start

Based on the source code, I believe it might be stuck before the 'forward' method.

TM_LOG_INFO("context decoding start");

Could you give me some advice? Thanks.

根据readme.md部署报错 [FT][ERROR] CUDA runtime error: invalid argument /opt/trito

step1 :下载internlm-chat-7b模型

step2: 运行docker镜像docker run -itd --net=host --name internlm_server --gpus all -v ./workspace/:/workspace -v /data/models/:/models -it openmmlab/lmdeploy:latest bash

step3: 转换模型为turbomind ,因为我是有两台T4
root@:/opt/tritonserver/lmdeploy# python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /models/internlm-chat-7b hf --tp 2

step4: 切换到目录运行
root@/opt/tritonserver/lmdeploy# python3 -m lmdeploy.turbomind.chat internlm ./workspace/

报错如下:

[WARNING] gemm_config.in is not found; using default GEMM algo
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: invalid argument /opt/tritonserver/lmdeploy/src/turbomind/utils/allocator.h:252

Aborted (core dumped)

[WIP] Support InternLM on 3rd-party inference toolboxes

This issue is to track progress on 3rd party toolboxes which is related to InternLM.

VLLM

https://github.com/wangruohui/vllm/tree/internlm

  • Inference with single GPU
    • There seems some bug, not sure from my implementation or from upstream
  • Tensor parallel

DeepSpeed

InternLM-7B is supported in Deepspeed inference and merged to main branch: microsoft/DeepSpeed#4137

  • Single GPU with kernel infection policy
  • Tensor parallel

Meta tensor for faster model loading: watching microsoft/DeepSpeed#3608

Comparasion with vllm

vllm can boost up to 24x compare with vanilla llama version, does lmdeploy have any speed test compare with it?

a bug

I try to deploy in my server with 2*3090, cuda-11.7.
It deploy normally with command:"docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest
python3 -m lmdeploy.turbomind.chat internlm /workspace"
however it can't be deployed by command:"bash workspace/service_docker_up.sh" because segment fault.
image

otherwise, when I try "python3 lmdeploy.app {server_ip_addresss}:33337 internlm" to deploy a client,it report : torch don't have module cuda. it is because you add lmdeploy/lmdeploy/torch into sys.path, and that torch don't have cuda module. I fix this by add "sys.path.remove("lmdeploy/lmdeploy/torch")"

[documentation] run failed according the readme docs

Get InternLM model

# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-7b /path/to/internlm-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf

Users need to install requirements for internlm before then we can run python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf successfully. so can we add these tips in the document?

[Bug] TurboMind execute failure: 1

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

When i communicate with the inference server more than one round, there is an error. I must rest the session.

TurboMind execute failure:  1
07/31 12:28:43 - service.ft - ERROR - /usr/local/lib/python3.8/dist-packages/lmdeploy/serve/turbomind/chatbot.py - stream_consumer - 553 - got error from turbomind, code StatusCode.TRITON_SERVER_ERR, TurboMind execute failure:  1, token 677

Reproduction

Communicate with the inference server more than one round.

Error traceback

No response

启动时报错

将llama-7b转换格式后docker启动报错
转换命令:
python3 -m lmdeploy.serve.turbomind.deploy llama-7b /home/nlp/lwp/pre_models/llama-7b-hf hf
启动命令:
docker run --gpus "device=8" --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest python3 -m lmdeploy.turbomind.chat llama /workspace
报错信息:
[WARNING] gemm_config.in is not found; using default GEMM algo
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/chat.py", line 96, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/chat.py", line 35, in main
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_path,
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 690, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

使用workerspace下面的脚本运行的时候也会报错,会报error: creating server: Internal - failed to load all models

请问这是哪里出错了?

【P0】support vicuna 7B

llmdeploy didn't support vicuna 7B well, because its preprocessor cannot tokenize <s> and </s> into bos and eos token respectively.

I think we'd better change the tokenizer (llmdeploy/fastertransformer/triton_models/preprcessing/1/model.py) with huggingface's AutoTokenizer

F.Y.I, here is an introduction to download and serve vicuna-7B v1.1 model

an error about llama-65b

65B
python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh

is this correct?i found this in docs.

支持tritonserver

我看了一下代码,tritonBackend好像也支持,但是没int8_mode的选项,是没有支持吗?

Question about persistent Batch Inference

Hi, Thank you for the open source LMDeploy project!

There is a image in the documentation that provides a good description of the process of dynamic batching inference, but I couldn't find more details about how LMDeploy implements this function.

Is there any document or could you tell me where this part is implemented in the code?

ziya启动不了

(lmdeploy) ➜ lmdeploy sudo bash workspace/service_docker_up.sh

=============================
== Triton Inference Server ==

NVIDIA Release 22.12 (build 50109463)
Triton Server Version 2.29.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0721 03:34:28.851237 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6d00000000' with size 268435456
I0721 03:34:28.851696 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0721 03:34:28.857984 1 model_lifecycle.cc:459] loading: postprocessing:1
I0721 03:34:28.858019 1 model_lifecycle.cc:459] loading: preprocessing:1
I0721 03:34:28.858036 1 model_lifecycle.cc:459] loading: turbomind:1
I0721 03:34:29.000309 1 libfastertransformer.cc:1746] TRITONBACKEND_Initialize: turbomind
I0721 03:34:29.000337 1 libfastertransformer.cc:1753] Triton TRITONBACKEND API version: 1.10
I0721 03:34:29.000340 1 libfastertransformer.cc:1757] 'turbomind' TRITONBACKEND API version: 1.10
I0721 03:34:29.002218 1 libfastertransformer.cc:1784] TRITONBACKEND_ModelInitialize: turbomind (version 1)
I0721 03:34:29.002902 1 libfastertransformer.cc:307] Instance group type: KIND_CPU count: 48
num_nodes=1
tp_pp_size=1
gpu_size=1
world_size=1
model_instance_size=1
I0721 03:34:29.002929 1 libfastertransformer.cc:346] Sequence Batching: disabled
I0721 03:34:29.002934 1 libfastertransformer.cc:357] Dynamic Batching: disabled
[ERROR] Does not find the section llama with name model_name.

'InternLMTokenizer' object has no attribute 'backend_tokenizer'

在Inference by TurboMind的时候使用命令
python3 -m lmdeploy.turbomind.chat internlm ./workspace/报错
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/chat.py", line 109, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/chat.py", line 43, in main
tokenizer = Tokenizer(tokenizer_model_path)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/tokenizer.py", line 152, in init
self.model = HuggingFaceTokenizer(model_folder)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/tokenizer.py", line 84, in init
self.model.backend_tokenizer.save(backend_tokenizer_file)
AttributeError: 'InternLMTokenizer' object has no attribute 'backend_tokenizer

我看了下源码,好像这个backend_tokenizer确实没咋用到,这个有用吗

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.