Giter Club home page Giter Club logo

text-generation-inference's Introduction

Making TGI deployment optimal

Text Generation Inference

GitHub Repo stars Swagger API documentation

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.

Table of contents

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:

  • Simple launcher to serve most popular LLMs
  • Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
  • Tensor Parallelism for faster inference on multiple GPUs
  • Token streaming using Server-Sent Events (SSE)
  • Continuous batching of incoming requests for increased total throughput
  • Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
  • Quantization with :
  • Safetensors weight loading
  • Watermarking with A Watermark for Large Language Models
  • Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
  • Stop sequences
  • Log probabilities
  • Speculation ~2x latency
  • Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
  • Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
  • Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

Hardware support

Get Started

Docker

For a detailed starting guide, please see the Quick Tour. The easiest way of getting started is using the official Docker container:

model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model

And then you can make requests like

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Note: To use NVIDIA GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

Note: TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the Supported Hardware documentation. To use AMD GPUs, please use docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0-rocm --model-id $model instead of the command above.

To see all options to serve your models (in the code or in the cli):

text-generation-launcher --help

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize the HUGGING_FACE_HUB_TOKEN environment variable for configuring the token employed by text-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

  1. Go to https://huggingface.co/settings/tokens
  2. Copy your cli READ token
  3. Export HUGGING_FACE_HUB_TOKEN=<your cli READ token>

or with Docker:

model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model

A note on Shared Memory (shm)

NCCL is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by creating a volume with:

- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that this will impact performance.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this feature by setting the address to an OTLP collector with the --otlp-endpoint argument.

Architecture

TGI architecture

Local install

You can also opt to install text-generation-inference locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using conda:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.11
conda activate text-generation-inference

You may also need to install Protoc.

On Linux:

PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

On MacOS, using Homebrew:

brew install protobuf

Then run:

BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

Optimized architectures

TGI works out of the box to serve optimized models for all modern models. They can be found in this list.

Other architectures are supported on a best-effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

or

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Run locally

Run

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.

Develop

make server-dev
make router-dev

Testing

# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests
# rust cargo tests
make rust-tests
# integration tests
make integration-tests

text-generation-inference's People

Contributors

abhishekkrthakur avatar amihalik avatar drbh avatar eltociear avatar flozi00 avatar fxmarty avatar gary149 avatar gsaivinay avatar lewtun avatar merveenoyan avatar mishig25 avatar narsil avatar njhill avatar olivierdehaene avatar ooraph avatar osanseviero avatar regisss avatar seongbeomlee avatar ssmi153 avatar sywangyi avatar thomasw21 avatar tleyden avatar vakker avatar victorsanh avatar vinno97 avatar wauplin avatar xyang16 avatar yard1 avatar zhangsibo1129 avatar zspo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text-generation-inference's Issues

Cannot load local model?

Hi, I tried to serve GPT-J with huggingface repo id, it works as follows:

<10-140-0-182 text-generation-inference]$ CUDA_VISIBLE_DEVICES=4,5 text-generation-launcher --model-id EleutherAI/gpt-j-6B                    
2023-03-05T06:28:52.688866Z  INFO text_generation_launcher: Args { model_id: "EleutherAI/gpt-j-6B", revision: None, num_shard: 1, quantize: false, max_concurrent_requests: 128, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: 32, max_waiting_tokens: 20, port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/nvme/yjc/mini-chatgpt/model/model_training/.cache"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None }
2023-03-05T06:28:52.689071Z  INFO text_generation_launcher: Starting shard 0
2023-03-05T06:29:02.714674Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-03-05T06:29:12.734898Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-03-05T06:29:22.754421Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-03-05T06:29:32.766042Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...2023-03-05T06:29:37.070978Z  INFO text_generation_launcher: Shard 0 ready in 44.381441551s
2023-03-05T06:29:37.164840Z  INFO text_generation_launcher: Starting Webserver
2023-03-05T06:29:52.895332Z  INFO text_generation_router: router/src/main.rs:130: Connected

However, when I use a local trained model saved_model/checkpoint-3000, it said the tokenizer cannot be found, however the dir did contain all the tokenizer files e.g., vocab.json, tokenizer.json,etc.

CUDA_VISIBLE_DEVICES=4,5 text-generation-launcher --model-id saved_model/checkpoint-3000
2023-03-05T07:11:11.575274Z  INFO text_generation_launcher: Args { model_id: "saved_model/checkpoint-3000", revision: None, num_shard: 1, quantize: false, max_concurrent_requests: 128, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: 32, max_waiting_tokens: 20, port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/nvme/yjc/mini-chatgpt/model/model_training/.cache"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None }
2023-03-05T07:11:11.575516Z  INFO text_generation_launcher: Starting shard 0
2023-03-05T07:11:21.613465Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-03-05T07:11:31.630234Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-03-05T07:11:41.644839Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-03-05T07:11:51.658133Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-03-05T07:11:54.259818Z  INFO text_generation_launcher: Shard 0 ready in 42.683891104s
2023-03-05T07:11:54.319425Z  INFO text_generation_launcher: Starting Webserver
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "Model \"saved_model/checkpoint-3000\" on the Hub doesn't have a tokenizer"', router/src/main.rs:90:78
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2023-03-05T07:11:55.320536Z ERROR text_generation_launcher: Webserver Crashed
2023-03-05T07:11:55.320567Z  INFO text_generation_launcher: Shutting down shards
2023-03-05T07:11:55.989175Z  INFO text_generation_launcher: Shard 0 terminated

<10-140-0-182 text-generation-inference]$ ls saved_model/checkpoint-3000                          
added_tokens.json  latest             rng_state_0.pth          tokenizer_config.json  training_args.bin
config.json        merges.txt         rng_state_1.pth          tokenizer.json         vocab.json
global_step3000    pytorch_model.bin  special_tokens_map.json  trainer_state.json     zero_to_fp32.py

Any advise on this problem?

Error launching bloom-560m

Steps to reproduce:
Followed local install steps in README.md

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.9 
conda activate text-generation-inference


PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
BUILD_EXTENSIONS=False make install
make run-bloom-560m

Error log:

2023-03-07T16:29:23.429743Z  INFO text_generation_launcher: Successfully downloaded weights.                                                                                                                  
2023-03-07T16:29:23.429988Z  INFO text_generation_launcher: Starting shard 0                                                                                                                                  
2023-03-07T16:29:23.430031Z  INFO text_generation_launcher: Starting shard 1                                                                                                                                  
2023-03-07T16:29:28.535903Z ERROR text_generation_launcher: Shard 1 failed to start:                                                                                                                          
We're not using custom kernels.                                                                                                                                                                               
thread '<unnamed>' panicked at 'range end index 1091153541 out of range for slice of length 1048576000', src/lib.rs:817:29                                                                                    
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                            
  File "/home/ubuntu/rubikon/py39_env/bin/text-generation-server", line 8, in <module>                                                                                                                        
    sys.exit(app())                                                                                                                                                                                           
  File "/home/ubuntu/rubikon/py39_env/lib/python3.9/site-packages/typer/main.py", line 311, in __call__ 
    return get_command(self)(*args, **kwargs)
  File "/home/ubuntu/rubikon/py39_env/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/rubikon/py39_env/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/home/ubuntu/rubikon/py39_env/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/home/ubuntu/rubikon/py39_env/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/rubikon/py39_env/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/rubikon/py39_env/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/rubikon/py39_env/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/ubuntu/rubikon/text-generation-inference/server/text_generation/cli.py", line 55, in serve
    server.serve(model_id, revision, sharded, quantize, uds_path)
  File "/home/ubuntu/rubikon/text-generation-inference/server/text_generation/server.py", line 130, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize))
  File "/usr/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/ubuntu/rubikon/text-generation-inference/server/text_generation/server.py", line 99, in serve_inner
    model = get_model(model_id, revision, sharded, quantize)
  File "/home/ubuntu/rubikon/text-generation-inference/server/text_generation/models/__init__.py", line 57, in get_model
    return BLOOMSharded(model_id, revision, quantize=quantize)
  File "/home/ubuntu/rubikon/text-generation-inference/server/text_generation/models/bloom.py", line 86, in __init__
    self.load_weights(
  File "/home/ubuntu/rubikon/text-generation-inference/server/text_generation/models/bloom.py", line 156, in load_weights
    tensor = slice_[:]
pyo3_runtime.PanicException: range end index 1091153541 out of range for slice of length 1048576000

2023-03-07T16:29:28.535989Z  INFO text_generation_launcher: Shutting down shards
make: *** [Makefile:25: run-bloom-560m] Error 1

Isn't bloom-7b1 supported now?

text-generation-launcher --model-id bigscience/bloom-7b1 --num-shard 1 --port 8889

2023-03-13T04:57:41.703495Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-7b1", revision: None, sharded: None, num_shard: Some(1), quantize: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: 32, max_waiting_tokens: 20, port: 8889, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None }
2023-03-13T05:03:52.114048Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-03-13T05:03:52.514407Z  INFO text_generation_launcher: Shard 0 ready in 370.805438546s
2023-03-13T05:03:52.604007Z  INFO text_generation_launcher: Starting Webserver
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "Model \"bigscience/bloom-7b1\" on the Hub doesn't have a tokenizer"', router/src/main.rs:101:70
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Is bert-base-uncased supported?

Hi,
I'm trying to deploy bert-base-uncased model by v0.5.0, but got an error: ValueError: BertLMHeadModel does not support device_map='auto' yet.

root@nick-test1-8zjwg-135105-worker-0:/usr/local/bin# ./text-generation-launcher --model-id bert-base-uncased
2023-04-14T07:24:23.167920Z  INFO text_generation_launcher: Args { model_id: "bert-base-uncased", revision: None, sharded: None, num_shard: Some(1), quantize: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: 32, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None }
2023-04-14T07:24:23.168401Z  INFO text_generation_launcher: Starting shard 0
2023-04-14T07:24:29.874262Z ERROR shard-manager: text_generation_launcher: "Error when initializing model
Traceback (most recent call last):
  File \"/opt/miniconda/envs/text-generation/bin/text-generation-server\", line 8, in <module>
    sys.exit(app())
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
    return self.main(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 778, in main
    return _main(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
    rv = self.invoke(ctx)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
    return __callback(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/cli.py\", line 55, in serve
    server.serve(model_id, revision, sharded, quantize, uds_path)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 135, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize))
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/runners.py\", line 44, in run
    return loop.run_until_complete(main)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
    self.run_forever()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
    self._run_once()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
    handle._run()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/events.py\", line 80, in _run
    self._context.run(self._callback, *self._args)
> File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 104, in serve_inner
    model = get_model(model_id, revision, sharded, quantize)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 130, in get_model
    return CausalLM(model_id, revision, quantize=quantize)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py\", line 308, in __init__
    self.model = AutoModelForCausalLM.from_pretrained(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9-linux-x86_64.egg/transformers/models/auto/auto_factory.py\", line 471, in from_pretrained
    return model_class.from_pretrained(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9-linux-x86_64.egg/transformers/modeling_utils.py\", line 2644, in from_pretrained
    raise ValueError(f\"{model.__class__.__name__} does not support `device_map='{device_map}'` yet.\")
ValueError: BertLMHeadModel does not support `device_map='auto'` yet.
" rank=0
2023-04-14T07:24:30.475420Z ERROR text_generation_launcher: Shard 0 failed to start.
2023-04-14T07:24:30.475495Z  INFO text_generation_launcher: Shutting down shards

[Feature] Model Info Endpoint

If it's not already possible in some way to get that information, I think it would be convenient to have a model info endpoint that exposes which model is currently running on it.

Endless stream of 'Waiting for shard 0 to be ready...'

I tried to use google/flan-t5-xxl but repeadetly run into endless streams of "waiting for shard 0 to be ready":

Here's how I start it:

text-generation-launcher --model-id google/flan-t5-xxl --num-shard 1
2023-02-20T16:45:10.672340Z  INFO text_generation_launcher: Args { model_id: "google/flan-t5-xxl", revision: None, num_shard: 1, quantize: false, max_concurrent_requests: 128, max_input_length: 1000, max_batch_size: 32, max_waiting_tokens: 20, port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/export/cache"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None }
2023-02-20T16:45:10.672487Z  INFO text_generation_launcher: Starting download process.
2023-02-20T16:45:13.609322Z  INFO download: text_generation_launcher: "Files are already present in the local cache. Skipping download."
2023-02-20T16:45:14.079337Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-02-20T16:45:14.079644Z  INFO text_generation_launcher: Starting shard 0
2023-02-20T16:45:24.091665Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-02-20T16:45:34.101538Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-02-20T16:45:44.111335Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
[...]
2023-02-20T17:12:06.346883Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-02-20T17:12:16.361189Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-02-20T17:12:26.376228Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-02-20T17:12:36.389701Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-02-20T17:12:46.403412Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-02-20T17:12:56.417555Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...

I don't see any activity on the GPU, so it's not busy with loading the model onto the GPU.

Maybe, is there some way to get more verbose output to see where it's hanging?

Add optional left-truncation

When things are too long, it would be nice if there was a parameter that instructs the HF tokenizer to automatically left-truncate.

Could not run 'c10d::allreduce_' with arguments from the 'Meta' backend.

I try to start a large version of the model using docker:
docker run -p 10249:80 -e RUST_BACKTRACE=full -e FLASH_ATTENTION=1 -e CUDA_VISIBLE_DEVICES=4,7 --privileged --security-opt="seccomp=unconfined" -v /download:/data ghcr.io/huggingface/text-generation-inference:0.5 --model-id /data/llama-13b-hf --num-shard 2 --max-total-tokens 2048

can be initialized

Details

2023-04-18T08:00:20.891953Z INFO text_generation_launcher: Args { model_id: "/data/llama-13b-hf", revision: None, sharded: None, num_shard: Some(2), quantize: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 2048, max_batch_size: 32, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None }
2023-04-18T08:00:20.891982Z INFO text_generation_launcher: Sharding model on 2 processes
2023-04-18T08:00:20.892328Z INFO text_generation_launcher: Starting shard 0
2023-04-18T08:00:20.892328Z INFO text_generation_launcher: Starting shard 1
2023-04-18T08:00:26.396665Z INFO text_generation_launcher: Shard 0 ready in 5.503382395s
2023-04-18T08:00:26.396665Z INFO text_generation_launcher: Shard 1 ready in 5.503381293s
2023-04-18T08:00:26.495656Z INFO text_generation_launcher: Starting Webserver
2023-04-18T08:00:27.467600Z WARN text_generation_router: router/src/main.rs:134: no pipeline tag found for model /data/llama-13b-hf
2023-04-18T08:00:27.472098Z INFO text_generation_router: router/src/main.rs:149: Connected

but an error occurred when calling

2023-04-18T08:00:33.236816Z ERROR shard-manager: text_generation_launcher: "Method Prefill encountered an error.
Traceback (most recent call last):
  File \"/opt/miniconda/envs/text-generation/bin/text-generation-server\", line 8, in <module>
    sys.exit(app())
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
    return self.main(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 778, in main
    return _main(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
    rv = self.invoke(ctx)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
    return __callback(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/cli.py\", line 55, in serve
    server.serve(model_id, revision, sharded, quantize, uds_path)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 135, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize))
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/runners.py\", line 44, in run
    return loop.run_until_complete(main)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
    self.run_forever()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
    self._run_once()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
    handle._run()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/events.py\", line 80, in _run
    self._context.run(self._callback, *self._args)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/grpc_interceptor/server.py\", line 153, in invoke_intercept_method
    return await self.intercept(
> File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/interceptor.py\", line 20, in intercept
    return await response
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 82, in _unary_interceptor
    raise error
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 46, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/contextlib.py\", line 79, in inner
    return func(*args, **kwds)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py\", line 278, in generate_token
    out, present = self.forward(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py\", line 262, in forward
    return self.model.forward(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 607, in forward
    hidden_states, present = self.model(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 523, in forward
    hidden_states = self.embed_tokens(input_ids)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 221, in forward
    torch.distributed.all_reduce(out, group=self.process_group)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py\", line 1436, in wrapper
    return func(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py\", line 1687, in all_reduce
    work = group.allreduce([tensor], opts)
NotImplementedError: Could not run 'c10d::allreduce_' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'c10d::allreduce_' is only available for these backends: [CPU, CUDA, SparseCPU, SparseCUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

OPT model

I've try to run with OPT-13B. Model is loaded succesfully but on inference time, following error occured.

2023-04-10T07:46:27.646766Z ERROR batch{batch_size=1}:prefill:prefill{id=1 size=1}:prefill{id=1 size=1}: text_generation_client: router/client/src/lib.rs:29: Server error: forward() got an unexpected keyword argument 'position_ids'
2023-04-10T07:46:27.649028Z ERROR HTTP request{otel.name=POST / http.client_ip= http.flavor=1.1 http.host=127.0.0.1:8080 http.method=POST http.route=/ http.scheme=HTTP http.target=/ http.user_agent=python-requests/2.28.1 otel.kind=server trace_id=15ce15d2e6483d4df26053f09b713b6b http.status_code=200 otel.status_code="OK"}:compat_generate{default_return_full_text=Extension(false) req=Json(CompatGenerateRequest { inputs: "A와 B가 진지한 대화 중이다. \n두사람의 대화를 자연스럽게 연결하시오.\nB: hi\nA:", parameters: GenerateParameters { best_of: None, temperature: Some(0.5), repetition_penalty: Some(1.1), top_k: None, top_p: Some(0.9), typical_p: Some(0.95), do_sample: false, max_new_tokens: 1024, return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, seed: None }, stream: true })}:generate_stream{req=Json(GenerateRequest { inputs: "A와 B가 진지한 대화 중이다. \n두사람의 대화를 자연스럽게 연결하시오.\nB: hi\nA:", parameters: GenerateParameters { best_of: None, temperature: Some(0.5), repetition_penalty: Some(1.1), top_k: None, top_p: Some(0.9), typical_p: Some(0.95), do_sample: false, max_new_tokens: 1024, return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, seed: None } })}:async_stream:generate_stream{request=GenerateRequest { inputs: "A와 B가 진지한 대화 중이다. \n두사람의 대화를 자연스럽게 연결하시오.\nB: hi\nA:", parameters: GenerateParameters { best_of: None, temperature: Some(0.5), repetition_penalty: Some(1.1), top_k: None, top_p: Some(0.9), typical_p: Some(0.95), do_sample: false, max_new_tokens: 1024, return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, seed: None } }}:infer{batch_size=1}:send_error: text_generation_router::infer: router/src/infer.rs:384: Request failed during generation: Server error: forward() got an unexpected keyword argument 'position_ids'
2023-04-10T07:46:27.666632Z ERROR shard-manager: text_generation_launcher: "Method Prefill encountered an error.
Traceback (most recent call last):
  File \"/home/chang/anaconda3/envs/hf-tgi/bin/text-generation-server\", line 8, in <module>
    sys.exit(app())
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
    return self.main(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/core.py\", line 778, in main
    return _main(
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
    rv = self.invoke(ctx)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
    return __callback(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/cli.py\", line 55, in serve
    server.serve(model_id, revision, sharded, quantize, uds_path)
  File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/server.py\", line 135, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize))
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/runners.py\", line 44, in run
    return loop.run_until_complete(main)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
    self.run_forever()
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
    self._run_once()
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
    handle._run()
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/events.py\", line 80, in _run
    self._context.run(self._callback, *self._args)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/grpc_interceptor/server.py\", line 153, in invoke_intercept_method
    return await self.intercept(
> File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/interceptor.py\", line 20, in intercept
    return await response
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 82, in _unary_interceptor
    raise error
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/server.py\", line 46, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/contextlib.py\", line 79, in inner
    return func(*args, **kwds)
  File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/models/causal_lm.py\", line 341, in generate_token
    logits, past = self.forward(
  File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/models/causal_lm.py\", line 325, in forward
    outputs = self.model.forward(
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/accelerate/hooks.py\", line 156, in new_forward
    output = old_forward(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'position_ids'
" rank=0

`truncation` is not working

In my case (model is served whose architecture is same to bloom), It looks like the truncation doesn't apply. I think this situation occurs after 9987960. How can I fix this?

text-generation-launcher \
--model-id /mount/lm_storage/checkpoints/alibi_2048_1.3b_v2 \ # same to bloom
--num-shard 4 \
--port 6006 \
--max-input-length 2048 \
--max-total-tokens 2560
from text_generation import Client

client = Client(base_url=f"{my_url}")
text = "how are you?" * 1000
a=client.generate(text, max_new_tokens=1, truncate=2048)
text_generation.errors.ValidationError: Input validation error: `inputs` tokens + `max_new_tokens` must be <= 2560. Given: 5000 `inputs` tokens and 1 `max_new_tokens`

Issues Running Quantization on A10

I am seeing the following failures around bitsandbytes with BUILD_EXTENSIONS=False

{"timestamp":"2023-04-19T17:08:49.606067Z","level":"INFO","fields":{"message":"Args { model_id: \"chavinlo/alpaca-13b\", revision: None, sharded: None, num_shard: Some(1), quantize: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: 32, max_waiting_tokens: 20, port: 6018, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None }"},"target":"text_generation_launcher"}
{"timestamp":"2023-04-19T17:08:49.606372Z","level":"INFO","fields":{"message":"Starting shard 0"},"target":"text_generation_launcher"}
{"timestamp":"2023-04-19T17:08:51.913366Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n    return get_command(self)(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__\n    return self.main(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n    return _main(\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n    rv = self.invoke(ctx)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 760, in invoke\n    return __callback(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n    return callback(**use_params)  # type: ignore\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 58, in serve\n    server.serve(model_id, revision, sharded, quantize, uds_path)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 135, in serve\n    asyncio.run(serve_inner(model_id, revision, sharded, quantize))\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n    self.run_forever()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n    self._run_once()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n    handle._run()\n  File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n    self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 104, in serve_inner\n    model = get_model(model_id, revision, sharded, quantize)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 133, in get_model\n    return llama_cls(model_id, revision, quantize=quantize)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py\", line 306, in __init__\n    raise ValueError(\"quantization is not available on CPU\")\nValueError: quantization is not available on CPU\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-04-19T17:08:52.408673Z","level":"ERROR","fields":{"message":"Shard 0 failed to start:\n/opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n  warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\nTraceback (most recent call last):\n\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 58, in serve\n    server.serve(model_id, revision, sharded, quantize, uds_path)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 135, in serve\n    asyncio.run(serve_inner(model_id, revision, sharded, quantize))\n\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 104, in serve_inner\n    model = get_model(model_id, revision, sharded, quantize)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 133, in get_model\n    return llama_cls(model_id, revision, quantize=quantize)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py\", line 306, in __init__\n    raise ValueError(\"quantization is not available on CPU\")\n\nValueError: quantization is not available on CPU\n\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-04-19T17:08:52.408710Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}

It seems the container can not find the GPU even though it is assigned one as seen below in the output of describe po

Containers:
  ml:
    Container ID:   containerd://53b79673ab1059dc4d514f6bccc54ad6fb94b2185b02ba5abe56da6f8d949a5b
    Image:          ghcr.io/huggingface/text-generation-inference:latest
    Image ID:       ghcr.io/huggingface/text-generation-inference@sha256:84d3f31e170cc733e86913787c968a5202515d2021bc18832cbc117a49f07dad
    Port:           6018/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 19 Apr 2023 13:06:14 -0400
      Finished:     Wed, 19 Apr 2023 13:06:16 -0400
    Ready:          False
    Restart Count:  4
    Limits:
      ephemeral-storage:  95551679100
      nvidia.com/gpu:     1
    Requests:
      ephemeral-storage:  95551679100
      nvidia.com/gpu:     1
    Liveness:             http-get http://:http/ delay=600s timeout=1s period=600s #success=1 #failure=3
    Readiness:            http-get http://:http/ delay=300s timeout=1s period=300s #success=1 #failure=3
    Environment:
      MODEL_ID:          chavinlo/alpaca-13b
      NUM_SHARD:         1
      QUANTIZE:          true
      PORT:              80
      BUILD_EXTENSIONS:  false

Here is the output of running nvidia-smi on the node

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   27C    P0    57W / 300W |  17654MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I was able to get past this error yesterday by disabling the build extensions which got me to the point where FlashLlama was throwing errors because quantization was not implemented. However today no dice. Am I doing something wrong?

[Feature] Enable quantization for flash attention based models

Hello! Love what you are doing here with dynamic batching, streaming, and tensor parallelism you have a ton of great functionality!

One thing I am wondering about is whether or not you support loading large (10+B) models using 8bit precision via accelerate? This is basically the only way I've been able to run these models on A10 GPUs and is basically the only barrier to me adopting this for my projects!

Would love to chat more to see if I could contribute here.

Not able to start the server using make server-dev

I replaced the server Makefile to serve gpt2 without sharding like
SAFETENSORS_FAST_GPU=1 python -m torch.distributed.run --nproc_per_node=2 text_generation_server/cli.py serve gpt2
Yesterday, the server and router worked fine, but the next time when I restarted the server, it doesn't work.
The issue I'm facing
Screenshot 2023-04-19 at 8 33 48 PM

Question about generation results.

I compared generation sentences between huggingface/text-generation inference method and generate function which AutoModelForCausalLM model has.

I put same seed, top_k, top_p, typical_p and do_sample arguments, but these two methods generate different outputs though they got same input sentence.

I want to know if it is normal or I tested wrong.

Thank you!

Does the "model_id" parameter not support specifying the local bloom model path?

2023-02-20T06:44:19.335010Z ERROR text_generation_launcher: Download encountered an error: Traceback (most recent call last):

  File "/root/miniconda3/envs/my-env/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/root/autodl-tmp/text-generation-inference/server/text_generation/cli.py", line 80, in download_weights
    utils.weight_files(model_id, revision, extension)

  File "/root/autodl-tmp/text-generation-inference/server/text_generation/utils/hub.py", line 86, in weight_files
    filenames = weight_hub_files(model_id, revision, extension)

  File "/root/autodl-tmp/text-generation-inference/server/text_generation/utils/hub.py", line 27, in weight_hub_files
    info = api.model_info(model_id, revision=revision)

  File "/root/miniconda3/envs/my-env/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    validate_repo_id(arg_value)

  File "/root/miniconda3/envs/my-env/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 166, in validate_repo_id
    raise HFValidationError(

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/autodl-tmp/bloom'. Use `repo_type` argument if needed.

What startup parameters are missing?

[Feature] Return embeddings

As title indicates I'd be interested in understanding whether this is just for text-generation or whether it could also be used to expose the embedding function?

supporting LoRA checkpoint?

I only see model_id to be specified. For the model with LoRA checkpoint, how can I specify LoRA checkpoint id(on HF Hub)?

How to achieve the same speed as in InferenceAPIClient?

I found that using InferenceAPIClient is about 3x faster than if I use the same model locally.

Locally I use 8 Tesla V100 machine. Local code is:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, StoppingCriteria, StoppingCriteriaList

model_name = "OpenAssistant/oasst-sft-1-pythia-12b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
def wrap(message):
    return "<|prompter|>"+message+"<|endoftext|><|assistant|>"

query = wrap('''
What families with kids love about this hotel? 
Located on prestigious Park Lane in Mayfair, the London Hilton on Park Lane Hotel features stunning views of Hyde Park, Knightsbridge and Westminster. With 453 rooms and suites, this award-winning 5-star luxury hotel also has a stylish bar and a Michelin-starred restaurant.
With marble bathrooms and a flat-screen TV, some rooms also have balconies. Executive rooms and suites offer Executive Lounge access with complimentary a continental breakfast, snacks and beverages throughout the day as well as internet access and private check-in.
Boasting some of London’s finest restaurants and bars, the Galvin at Windows serves a menu with an emphasis on British cuisine.
The stylish Podium Restaurant and Bar serves seasonal British cuisine and is famed for the Confessions of a Chocoholic afternoon tea.
The London Hilton on Park Lane also has a business center, a steam room, and a sauna.
London Hilton on Park Lane is undergoing a phased renovation between February and July 2023 to elevate your experience. The hotel will remain open during this period and you can expect the same wonderful care and attention when you visit us. During this time, some services and areas will present temporary changes to normal operations, which will include the lobby and our all-day dining restaurant. Please note that breakfast will be served on our first floor with views overlooking Hyde Park. Please enter the hotel using the back lobby entrance on Hertford Street. Please email [email protected] for further information.
This is our guests' favorite part of London, according to independent reviews.
Couples in particular like the location – they rated it 9.3 for a two-person trip.
''')

data = tokenizer(query, return_tensors="pt")
start = time.time()
outputs = model.generate(**data, max_new_tokens=128, num_beams=1, do_sample=False)
print(tokenizer.decode(outputs[0]))
end = time.time()
print(end-start)

I get 12.53 seconds.

At the same time, if I use InferenceAPIClient:

from text_generation import InferenceAPIClient

model_name = "OpenAssistant/oasst-sft-1-pythia-12b"
client = InferenceAPIClient(model_name)

start = time.time()
response = client.generate(query, max_new_tokens=128, do_sample=False)
print(time.time() - start)

Gives 3.96 seconds, which is 3x faster.

Is there any tricks I can use with my local models to bring them to the same speed as via APIClient? I'd like to test local models as they've been fine-tuned on additional data.

Causal LM modifies the input when returning text

One of the issues if that Causal LM returns the entire input + generated_text. And input is actually tokenizer.decoder(tokenizer.encode(input)). The issue comes when that process is lossy, and causes weird behaviours such as this https://huggingface.co/bigscience/bloom/discussions/153#6397907b71eb2455d898e0a4

We can instead either:

  • actually use the input instead of going through a potentially lossy mechanism.
  • change the API to always return "added" text. BREAKING

TypeError: __init__() missing 1 required positional argument: 'padding_right_offset'

There's a TypeError when doing a request to a Galactica model (in this case facebook/galactica-30b:

  File \"/home/user/miniconda3/envs/text_gen_inference/lib/python3.9/site-packages/grpc_interceptor/server.py\", line 153, in invoke_intercept_method
    return await self.intercept(
> File \"/home/user/userusertext-generation-inference/server/text_generation/interceptor.py\", line 20, in intercept
    return await response
  File \"/home/user/miniconda3/envs/text_gen_inference/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 82, in _unary_interceptor
    raise error
  File \"/home/user/miniconda3/envs/text_gen_inference/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File \"/home/user/userusertext-generation-inference/server/text_generation/server.py\", line 37, in Prefill
    batch = self.model.batch_type.from_pb(
  File \"/home/user/usertext-generation-inference/server/text_generation/models/galactica.py\", line 121, in from_pb
    return cls(
TypeError: __init__() missing 1 required positional argument: 'padding_right_offset'

E.g.:

curl localhost:3000/generate_stream -H 'Content-Type: application/json' -d '{"inputs":"The Transformer architecture [START_REF]","parameters":{"max_new_tokens":100, "do_sample":true, "temperature":0.8}}'

Model was started with

text-generation-launcher --model-id facebook/galactica-30b --num-shard 1 --quantize

Request failed during generation: Server error: Expected is_sm90 || is_sm8x || is_sm75 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

Using the docker container ala these instructions:
https://github.com/huggingface/text-generation-inference#docker
in order to run the server locally. I'm using an app very similar to the one here:
https://huggingface.co/spaces/olivierdehaene/chat-llm-streaming to hit that local server.

I'm seeing this error in the server logs:

send_error: text_generation_router::infer: router/src/infer.rs:390: Request failed during generation: Server error: Expected is_sm90 || is_sm8x || is_sm75 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)

Any ideas?

Error when runing make install the router dir

cd router && cargo install --path .
info: syncing channel updates for '1.67.0-x86_64-unknown-linux-gnu'
info: latest update on 2023-01-26, rust version 1.67.0 (fc594f156 2023-01-24)
info: downloading component 'cargo'
info: downloading component 'clippy'
info: downloading component 'rust-docs'
info: downloading component 'rust-std'
info: downloading component 'rustc'
info: downloading component 'rustfmt'
info: installing component 'cargo'
info: installing component 'clippy'
info: installing component 'rust-docs'
info: installing component 'rust-std'
info: installing component 'rustc'
info: installing component 'rustfmt'
  Installing text-generation-router v0.3.2 (/nvme/yjc/text-generation-inference/router)
    Updating crates.io index
error: failed to compile `text-generation-router v0.3.2 (/nvme/yjc/text-generation-inference/router)`, intermediate artifacts can be found at `/$
vme/yjc/text-generation-inference/target`

Caused by:
  failed to get `async-stream` as a dependency of package `text-generation-router v0.3.2 (/nvme/yjc/text-generation-inference/router)`

Caused by:
  failed to load source for dependency `async-stream`

Caused by:
  Unable to update registry `crates-io`

Caused by:
  failed to fetch `https://github.com/rust-lang/crates.io-index`

Caused by:
  network failure seems to have happened
  if a proxy or similar is necessary `net.git-fetch-with-cli` may help here
  https://doc.rust-lang.org/cargo/reference/config.html#netgit-fetch-with-cli

Caused by:
  SSL error: received early EOF; class=Ssl (16); code=Eof (-20)
make: *** [install-router] Error 101

ModuleNotFoundError: No module named 'generate_pb2'

When I execute the command text-generation-launcher -- model-id bigscience/bloom-560m -- num-shard 1, the following error occurs
2023-02-20T04:52:55.905326Z ERROR text_generation_launcher: Download encountered an error: Traceback (most recent call last): File "/root/miniconda3/envs/my-env/bin/text-generation-server", line 5, in <module> from text_generation.cli import app File "/root/autodl-tmp/text-generation-inference/server/text_generation/cli.py", line 9, in <module> from text_generation import server, utils File "/root/autodl-tmp/text-generation-inference/server/text_generation/server.py", line 15, in <module> from text_generation.pb import generate_pb2_grpc, generate_pb2 File "/root/autodl-tmp/text-generation-inference/server/text_generation/pb/generate_pb2_grpc.py", line 5, in <module> import generate_pb2 as generate__pb2 ModuleNotFoundError: No module named 'generate_pb2'

Question about sharding

I was curious about how/why you use TP/sharding here for non-BLOOM/Galactica models i.e. GPT-NeoX and T5, since these fit in a single A100. Is it to be able to run them on smaller/cheaper GPUs? or because it provides latency advantages from the additional parallelism? or both?

Thanks!

Proxy variables not being passed to text-generation-server

I have noticed that the text-generation-launcher entry point in this project uses subprocess to call the text-generation-server program, but it does not pass all the necessary environment variables. Specifically, the proxy information such as http_proxy and https_proxy are not being passed to the text-generation-server program. This prevents the proxy configuration set during docker run from being effective.

To ensure that the proxy information is properly passed to the text-generation-server program, I kindly request that the necessary environment variables be added to the entry point.

Thank you for your attention to this matter.

Related codes:

let mut env = Vec::new();

dtype mismatch with --quantize

Hi there,
when I run make run-bloom-560m-quantize I get a type error when I query the server.

{"error":"Request failed during generation: Server error: \"self and mat2 must have the same dtype\""}2023-01-31T14:05:33.860375Z ERROR shard-manager: text_generation_launcher: "Method Generate encountered an error.
Traceback (most recent call last):
  File \"/home/user/.conda/envs/inferenceserver/bin/text-generation-server\", line 8, in <module>
    sys.exit(app())
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
    return self.main(*args, **kwargs)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/typer/core.py\", line 778, in main
    return _main(
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
    rv = self.invoke(ctx)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
    return __callback(*args, **kwargs)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File \"/home/user/text-generation-inference/server/text_generation/cli.py\", line 47, in serve
    server.serve(model_name, sharded, quantize, uds_path)
  File \"/home/user/text-generation-inference/server/text_generation/server.py\", line 114, in serve
    asyncio.run(serve_inner(model_name, sharded, quantize))
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/asyncio/runners.py\", line 44, in run
    return loop.run_until_complete(main)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
    self.run_forever()
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
    self._run_once()
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
    handle._run()
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/asyncio/events.py\", line 80, in _run
    self._context.run(self._callback, *self._args)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/grpc_interceptor/server.py\", line 153, in invoke_intercept_method
    return await self.intercept(
> File \"/home/user/text-generation-inference/server/text_generation/interceptor.py\", line 20, in intercept
    return await response
  File \"/home/user/text-generation-inference/server/text_generation/server.py\", line 35, in Generate
    generated_texts, next_batch = self.model.generate_token(batch)
  File \"/home/user/text-generation-inference/server/text_generation/models/causal_lm.py\", line 298, in generate_token
    logits, past = self.forward(
  File \"/home/user/text-generation-inference/server/text_generation/models/bloom.py\", line 240, in forward
    outputs = self.model.forward(
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/transformers-4.26.0.dev0-py3.9.egg/transformers/models/bloom/modeling_bloom.py\", line 941, in forward
    transformer_outputs = self.transformer(
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/transformers-4.26.0.dev0-py3.9.egg/transformers/models/bloom/modeling_bloom.py\", line 809, in forward
    outputs = block(
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/transformers-4.26.0.dev0-py3.9.egg/transformers/models/bloom/modeling_bloom.py\", line 448, in forward
    attn_outputs = self.self_attention(
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/transformers-4.26.0.dev0-py3.9.egg/transformers/models/bloom/modeling_bloom.py\", line 318, in forward
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File \"/home/user/.conda/envs/inferenceserver/lib/python3.9/site-packages/torch/nn/modules/linear.py\", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: self and mat2 must have the same dtype
" rank=1

There seems to be an issue with the code that quantizes to the linear layers.

Add `max_total_tokens`

Right now it is possible to have input_length + max_new_tokens go over the model token limit.
Add a max_total_tokens validation step or add a new finish reason.

@lewtun

error: identifier "gpuAtomicMax" is undefined

/root/miniconda3/lib/python3.8/site-packages/torch/utils/cpp_extension.py:788: UserWarning: The detected CUDA version (11.7) has a minor version mismatch with the version that was used to compile PyTorch (11.3). Most likely this shouldn't be a problem.
  warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'transformers.models.bloom.custom_kernels.fused_bloom_attention_cuda' extension
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/src
creating build/temp.linux-x86_64-3.8/src/transformers
creating build/temp.linux-x86_64-3.8/src/transformers/models
creating build/temp.linux-x86_64-3.8/src/transformers/models/bloom
creating build/temp.linux-x86_64-3.8/src/transformers/models/bloom/custom_kernels
/usr/local/cuda-11.7/bin/nvcc -I/root/miniconda3/lib/python3.8/site-packages/torch/include -I/root/miniconda3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/lib/python3.8/site-packages/torch/include/TH -I/root/miniconda3/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda-11.7/include -I/root/miniconda3/include/python3.8 -c src/transformers/models/bloom/custom_kernels/fused_bloom_attention_cuda.cu -o build/temp.linux-x86_64-3.8/src/transformers/models/bloom/custom_kernels/fused_bloom_attention_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -arch=compute_80 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=fused_bloom_attention_cuda -D_GLIBCXX_USE_CXX11_ABI=0
src/transformers/models/bloom/custom_kernels/fused_bloom_attention_cuda.cu(68): error: identifier "gpuAtomicMax" is undefined

1 error detected in the compilation of "src/transformers/models/bloom/custom_kernels/fused_bloom_attention_cuda.cu".
error: command '/usr/local/cuda-11.7/bin/nvcc' failed with exit status 1
make[1]: *** [Makefile:18: install-transformers] Error 1
make[1]: Leaving directory '/root/autodl-tmp/text-generation-inference/server'
make: *** [Makefile:2: install-server] Error 2

GPU:A100-80GB

Question about sharding for LLaMa

Hi, I tried the LLaMa experimental support as mention at: #146 (comment)
With the suggestions from that helpful thread, I was able to launcher the 7b model with 1 A100, the command is like:

FLASH_ATTENTION=1 text-generation-launcher --num-shard 1 --port 8080 --max-total-tokens 2048 --model-id decapoda-research/llama-7b

The inference service works fine and really fast!

FLASH_ATTENTION=1 text-generation-launcher --num-shard 2 --port 8080 --max-total-tokens 2048 --model-id decapoda-research/llama-7b

but when I want to increase the shard num to 2 or 4, so that I could load the 30b model, I got the following error.

2023-04-11T02:00:58.717162Z ERROR shard-manager: text_generation_launcher: "Method Prefill encountered an error.
Traceback (most recent call last):
  File \"/opt/miniconda/envs/text-generation/bin/text-generation-server\", line 8, in <module>
    sys.exit(app())
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
    return self.main(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 778, in main
    return _main(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
    rv = self.invoke(ctx)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
    return __callback(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/cli.py\", line 55, in serve
    server.serve(model_id, revision, sharded, quantize, uds_path)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 135, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize))
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/runners.py\", line 44, in run
    return loop.run_until_complete(main)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
    self.run_forever()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
    self._run_once()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
    handle._run()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/events.py\", line 80, in _run
    self._context.run(self._callback, *self._args)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/grpc_interceptor/server.py\", line 153, in invoke_intercept_method
    return await self.intercept(
> File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/interceptor.py\", line 20, in intercept
    return await response
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 82, in _unary_interceptor
    raise error
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 46, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/contextlib.py\", line 79, in inner
    return func(*args, **kwds)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py\", line 278, in generate_token
    out, present = self.forward(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py\", line 262, in forward
    return self.model.forward(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 609, in forward
    hidden_states, present = self.model(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 525, in forward
    hidden_states = self.embed_tokens(input_ids)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 221, in forward
    torch.distributed.all_reduce(out, group=self.process_group)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py\", line 1436, in wrapper
    return func(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py\", line 1687, in all_reduce
    work = group.allreduce([tensor], opts)
NotImplementedError: Could not run 'c10d::allreduce_' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'c10d::allreduce_' is only available for these backends: [CPU, CUDA, SparseCPU, SparseCUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

I used the official Dockerfile, to build the image on feat/flashllama branch. And copy the tokenizer file from hf-internal-testing/tiny-random-llama to my model directory as mentioned in ref.

Is there need to upgrade transformers to the latest version to support the LLaMa model?

But if I upgrade transformers to its latest version, I get these errors:

/data/tgi/text-generation-inference$ text-generation-launcher --model-id bigscience/bloom-560m --num-shard 2 --port 8899
2023-03-29T03:38:34.822927Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: Some(2), quantize: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: 32, max_waiting_tokens: 20, port: 8899, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None }
2023-03-29T03:38:34.822972Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-03-29T03:38:34.823231Z  INFO text_generation_launcher: Starting download process.
2023-03-29T03:38:40.229472Z ERROR text_generation_launcher: Download encountered an error: Traceback (most recent call last):
  File "/home/xxx/miniconda3/envs/text-generation-inference/bin/text-generation-server", line 5, in <module>
    from text_generation_server.cli import app
  File "/data/tgi/text-generation-inference/server/text_generation_server/cli.py", line 9, in <module>
    from text_generation_server import server, utils
  File "/data/tgi/text-generation-inference/server/text_generation_server/server.py", line 12, in <module>
    from text_generation_server.cache import Cache
  File "/data/tgi/text-generation-inference/server/text_generation_server/cache.py", line 3, in <module>
    from text_generation_server.models.types import Batch
  File "/data/tgi/text-generation-inference/server/text_generation_server/models/__init__.py", line 8, in <module>
    from text_generation_server.models.bloom import BLOOM, BLOOMSharded
  File "/data/tgi/text-generation-inference/server/text_generation_server/models/bloom.py", line 14, in <module>
    from transformers.models.bloom.parallel_layers import (
ModuleNotFoundError: No module named 'transformers.models.bloom.parallel_layers'

Couldn't instantiate the backend tokenizer

I am unable to run the docker command with the title error

docker run --gpus all --shm-size 1g -p 8080:80 -v /home/mohamedr/cache:/data ghcr.io/huggingface/text-generation-inference:latest --model-id Tribbiani/vicuna-7b  --num-shard 1 --max-total-tokens 2048

how to fix the problem ?

TypeError: forward() got an unexpected keyword argument 'position_ids'

I started a server with

text-generation-launcher --model-id facebook/galactica-30b --num-shard 1

However, when I now send a request like

curl localhost:3000/generate -H 'Content-Type: application/json' -d '{"inputs":"Hi my name is","parameters":{"max_new_tokens":60}}'

it consistenly returns TypeError: forward() got an unexpected keyword argument 'position_ids', with the following traceback:

2023-02-20T14:41:54.081514Z ERROR shard-manager: text_generation_launcher: "Method Prefill encountered an error.
Traceback (most recent call last):
  File \"/home/user/miniconda3/envs/text_generation/bin/text-generation-server\", line 8, in <module>
    sys.exit(app())
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
    return self.main(*args, **kwargs)
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/typer/core.py\", line 778, in main
    return _main(
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
    rv = self.invoke(ctx)
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
    return __callback(*args, **kwargs)
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File \"/home/user/text-generation-inference-new/server/text_generation/cli.py\", line 55, in serve
    server.serve(model_id, revision, sharded, quantize, uds_path)
  File \"/home/user/text-generation-inference-new/server/text_generation/server.py\", line 130, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize))
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/asyncio/runners.py\", line 44, in run
    return loop.run_until_complete(main)
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
    self.run_forever()
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
    self._run_once()
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
    handle._run()
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/asyncio/events.py\", line 80, in _run
    self._context.run(self._callback, *self._args)
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/grpc_interceptor/server.py\", line 153, in invoke_intercept_method
    return await self.intercept(
> File \"/home/user/text-generation-inference-new/server/text_generation/interceptor.py\", line 20, in intercept
    return await response
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 82, in _unary_interceptor
    raise error
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File \"/home/user/text-generation-inference-new/server/text_generation/server.py\", line 41, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/contextlib.py\", line 79, in inner
    return func(*args, **kwds)
  File \"/home/user/text-generation-inference-new/server/text_generation/models/causal_lm.py\", line 297, in generate_token
    logits, past = self.forward(
  File \"/home/user/text-generation-inference-new/server/text_generation/models/causal_lm.py\", line 284, in forward
    outputs = self.model.forward(
  File \"/home/user/miniconda3/envs/text_generation/lib/python3.9/site-packages/accelerate/hooks.py\", line 156, in new_forward
    output = old_forward(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'position_ids'
" rank=0
2023-02-20T14:41:54.081879Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:29: Server error: forward() got an unexpected keyword argument 'position_ids'
2023-02-20T14:41:54.081947Z ERROR HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=localhost:3000 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=curl/7.82.0 otel.kind=server trace_id=c742e54c2eddc1bfcc788ed10b3e0c52}:generate{req=Json(GenerateRequest { inputs: "Hi my name is", parameters: GenerateParameters { temperature: None, repetition_penalty: None, top_k: None, top_p: None, do_sample: false, max_new_tokens: 60, stop: [], details: false, seed: None } })}:generate{request=GenerateRequest { inputs: "Hi my name is", parameters: GenerateParameters { temperature: None, repetition_penalty: None, top_k: None, top_p: None, do_sample: false, max_new_tokens: 60, stop: [], details: false, seed: None } }}:generate_stream{request=GenerateRequest { inputs: "Hi my name is", parameters: GenerateParameters { temperature: None, repetition_penalty: None, top_k: None, top_p: None, do_sample: false, max_new_tokens: 60, stop: [], details: false, seed: None } }}:infer{batch_size=1}:send_error: text_generation_router::infer: router/src/infer.rs:338: Request failed during generation: Server error: forward() got an unexpected keyword argument 'position_ids'

make install-router fails to build

Hi,

When I run the make install command, i'm getting a build failure in the router. Please advise on how to resolve this?

Thanks,
Rohit

make install-router
cd router && cargo install --path .
  Installing text-generation-router v0.1.0 (/home/text-generation-inference/router)
    Updating crates.io index
   Compiling tracing-error v0.2.0
   Compiling reqwest v0.11.13
   Compiling text-generation-client v0.1.0 (/home/text-generation-inference/router/client)
   Compiling tonic v0.6.2
error: failed to run custom build command for `text-generation-client v0.1.0 (/home/text-generation-inference/router/client)`

Caused by:
  process didn't exit successfully: `/home/text-generation-inference/target/release/build/text-generation-client-0beb4f3014e2ad2c/build-script-build` (exit status: 1)
  --- stderr
  error running rustfmt: Os { code: 2, kind: NotFound, message: "No such file or directory" }
warning: build failed, waiting for other jobs to finish...
error: failed to compile `text-generation-router v0.1.0 (/home/text-generation-inference/router)`, intermediate artifacts can be found at `/home/text-generation-inference/target`
make: *** [Makefile:5: install-router] Error 101

protobuf version

after installed(using make install), protobuf is upgraded 4.xx, but I have following error. If I down grade protobuff to 3.20.x it just works. on latest update, following error is not displayed and just starting shard faild with no error message.

2023-04-13T00:59:16.794030Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-04-13T00:59:16.794558Z  INFO text_generation_launcher: Starting shard 0
2023-04-13T00:59:16.794631Z  INFO text_generation_launcher: Starting shard 1
2023-04-13T00:59:21.600031Z ERROR text_generation_launcher: Shard 0 failed to start:
Traceback (most recent call last):
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/utils/import_utils.py", line 1125, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/modeling_utils.py", line 37, in <module>
    from .deepspeed import deepspeed_config, is_deepspeed_zero3_enabled
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/deepspeed.py", line 38, in <module>
    from accelerate.utils.deepspeed import HfDeepSpeedConfig as DeepSpeedConfig
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/accelerate/__init__.py", line 7, in <module>
    from .accelerator import Accelerator
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/accelerate/accelerator.py", line 33, in <module>
    from .tracking import LOGGER_TYPE_TO_CLASS, GeneralTracker, filter_trackers
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/accelerate/tracking.py", line 40, in <module>
    from torch.utils import tensorboard
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/torch/utils/tensorboard/__init__.py", line 12, in <module>
    from .writer import FileWriter, SummaryWriter  # noqa: F401
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/torch/utils/tensorboard/writer.py", line 9, in <module>
    from tensorboard.compat.proto.event_pb2 import SessionLog
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/tensorboard/compat/proto/event_pb2.py", line 17, in <module>
    from tensorboard.compat.proto import summary_pb2 as tensorboard_dot_compat_dot_proto_dot_summary__pb2
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/tensorboard/compat/proto/summary_pb2.py", line 17, in <module>
    from tensorboard.compat.proto import tensor_pb2 as tensorboard_dot_compat_dot_proto_dot_tensor__pb2
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/tensorboard/compat/proto/tensor_pb2.py", line 16, in <module>
    from tensorboard.compat.proto import resource_handle_pb2 as tensorboard_dot_compat_dot_proto_dot_resource__handle__pb2
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/tensorboard/compat/proto/resource_handle_pb2.py", line 16, in <module>
    from tensorboard.compat.proto import tensor_shape_pb2 as tensorboard_dot_compat_dot_proto_dot_tensor__shape__pb2
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/tensorboard/compat/proto/tensor_shape_pb2.py", line 36, in <module>
    _descriptor.FieldDescriptor(
  File "/home/chang/anaconda3/envs/hf39/lib/python3.9/site-packages/google/protobuf/descriptor.py", line 561, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

make install fails with "gpuAtomicMax" is undefined

When using make install, it exits with the following log/error:

creating build/temp.linux-x86_64-3.9/src/transformers/models/bloom/custom_kernels
/home/user/miniconda3/bin/nvcc -I/home/user/miniconda3/lib/python3.9/site-packages/torch/include -I/home/user/miniconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/lib/python3.9/site-packages/torch/include/TH -I/home/user/miniconda3/lib/python3.9/site-packages/torch/include/THC -I/home/fretgkowski/miniconda3/include -I/home/user/miniconda3/include/python3.9 -c src/transformers/models/bloom/custom_kernels/fused_bloom_attention_cuda.cu -o build/temp.linux-x86_64-3.9/src/transformers/models/bloom/custom_kernels/fused_bloom_attention_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -arch=compute_80 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=fused_bloom_attention_cuda -D_GLIBCXX_USE_CXX11_ABI=0
src/transformers/models/bloom/custom_kernels/fused_bloom_attention_cuda.cu(68): error: identifier "gpuAtomicMax" is undefined

1 error detected in the compilation of "src/transformers/models/bloom/custom_kernels/fused_bloom_attention_cuda.cu".
error: command '/home/user/miniconda3/bin/nvcc' failed with exit code 1
Makefile:10: recipe for target 'install-transformers' failed
make[1]: *** [install-transformers] Error 1
make[1]: Leaving directory '/home/user/text-generation-inference/server'
Makefile:2: recipe for target 'install-server' failed
make: *** [install-server] Error 2

Some text will fail to be captured by stop sequences.

Hi,
If checking with stop sequence by tokens, it would cause a problem, for example:
the index of ";" is 30, but index of "j;" is 93585, therefore, even we thought it should stop, but actually they are never equal while comparing token index. (https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation/utils.py#L85)

Suggest compare text, not tokens (maybe some overhead, but I thought it would be less confused.)
for example:
`
class StoppingCriteria:
def init(self, tokenizer, eos_token_id, max_new_tokens=20, stop_sequences=None):
self.tokenizer = tokenizer
self.eos_token_id = eos_token_id
self.max_new_tokens = max_new_tokens
self.current_tokens = 0
self.stop_sequences = stop_sequences

def __call__(self, all_ids):
    self.current_tokens += 1
    if self.current_tokens >= self.max_new_tokens:
        return True
    if self.eos_token_id is not None and all_ids[-1] == self.eos_token_id:
        return True
    if self.stop_sequences is not None:
        all_str = self.tokenizer.decode(
            all_ids.squeeze(-1), skip_special_tokens=True)
        for s_str in self.stop_sequences:
            # too short to compare
            if len(all_str) < len(s_str):
                continue
            # ids.shape == (N, 1)
            if all_str[-1 * len(s_str):] == s_str:
                return True
    return False

`

Does text-generation-router support dynamic batching across GPUs?

Hi! Thanks for the wonderful work! I have successfully run the router on a single GPU with multiple instances, and it works as expected. I wonder if this framework also supports running on multiple GPUs (each running multiple model instances, on a single machine, or across nodes). If so, how I can do this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.