Giter Club home page Giter Club logo

llmperf's Introduction

LLMPerf

A Tool for evaulation the performance of LLM APIs.

Installation

git clone https://github.com/ray-project/llmperf.git
cd llmperf
pip install -e .

Basic Usage

We implement 2 tests for evaluating LLMs: a load test to check for performance and a correctness test to check for correctness.

Load test

The load test spawns a number of concurrent requests to the LLM API and measures the inter-token latency and generation throughput per request and across concurrent requests. The prompt that is sent with each request is of the format:

Randomly stream lines from the following text. Don't generate eos tokens:
LINE 1,
LINE 2,
LINE 3,
...

Where the lines are randomly sampled from a collection of lines from Shakespeare sonnets. Tokens are counted using the LlamaTokenizer regardless of which LLM API is being tested. This is to ensure that the prompts are consistent across different LLM APIs.

To run the most basic load test you can the token_benchmark_ray script.

Caveats and Disclaimers

  • The endpoints provider backend might vary widely, so this is not a reflection on how the software runs on a particular hardware.
  • The results may vary with time of day.
  • The results may vary with the load.
  • The results may not correlate with users’ workloads.

OpenAI Compatible APIs

export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE="https://api.endpoints.anyscale.com/v1"

python token_benchmark_ray.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

Anthropic

export ANTHROPIC_API_KEY=secret_abcdefg

python token_benchmark_ray.py \
--model "claude-2" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api anthropic \
--additional-sampling-params '{}'

TogetherAI

export TOGETHERAI_API_KEY="YOUR_TOGETHER_KEY"

python token_benchmark_ray.py \
--model "together_ai/togethercomputer/CodeLlama-7b-Instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api "litellm" \
--additional-sampling-params '{}'

Hugging Face

export HUGGINGFACE_API_KEY="YOUR_HUGGINGFACE_API_KEY"
export HUGGINGFACE_API_BASE="YOUR_HUGGINGFACE_API_ENDPOINT"

python token_benchmark_ray.py \
--model "huggingface/meta-llama/Llama-2-7b-chat-hf" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api "litellm" \
--additional-sampling-params '{}'

LiteLLM

LLMPerf can use LiteLLM to send prompts to LLM APIs. To see the environment variables to set for the provider and arguments that one should set for model and additional-sampling-params.

see the LiteLLM Provider Documentation.

python token_benchmark_ray.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api "litellm" \
--additional-sampling-params '{}'

Vertex AI

Here, --model is used for logging, not for selecting the model. The model is specified in the Vertex AI Endpoint ID.

The GCLOUD_ACCESS_TOKEN needs to be somewhat regularly set, as the token generated by gcloud auth print-access-token expires after 15 minutes or so.

Vertex AI doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

export GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)
export GCLOUD_PROJECT_ID=YOUR_PROJECT_ID
export GCLOUD_REGION=YOUR_REGION
export VERTEXAI_ENDPOINT_ID=YOUR_ENDPOINT_ID

python token_benchmark_ray.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api "vertexai" \
--additional-sampling-params '{}'

SageMaker

SageMaker doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.

export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"s
export AWS_SESSION_TOKEN="YOUR_SESSION_TOKEN"
export AWS_REGION_NAME="YOUR_ENDPOINTS_REGION_NAME"

python llm_correctness.py \
--model "llama-2-7b" \
--llm-api "sagemaker" \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \

see python token_benchmark_ray.py --help for more details on the arguments.

Correctness Test

The correctness test spawns a number of concurrent requests to the LLM API with the following format:

Convert the following sequence of words into a number: {random_number_in_word_format}. Output just your final answer.

where random_number_in_word_format could be for example "one hundred and twenty three". The test then checks that the response contains that number in digit format which in this case would be 123.

The test does this for a number of randomly generated numbers and reports the number of responses that contain a mismatch.

To run the most basic correctness test you can run the the llm_correctness.py script.

OpenAI Compatible APIs

export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE=https://console.endpoints.anyscale.com/m/v1

python llm_correctness.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--max-num-completed-requests 150 \
--timeout 600 \
--num-concurrent-requests 10 \
--results-dir "result_outputs"

Anthropic

export ANTHROPIC_API_KEY=secret_abcdefg

python llm_correctness.py \
--model "claude-2" \
--llm-api "anthropic"  \
--max-num-completed-requests 5 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs"

TogetherAI

export TOGETHERAI_API_KEY="YOUR_TOGETHER_KEY"

python llm_correctness.py \
--model "together_ai/togethercomputer/CodeLlama-7b-Instruct" \
--llm-api "litellm" \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \

Hugging Face

export HUGGINGFACE_API_KEY="YOUR_HUGGINGFACE_API_KEY"
export HUGGINGFACE_API_BASE="YOUR_HUGGINGFACE_API_ENDPOINT"

python llm_correctness.py \
--model "huggingface/meta-llama/Llama-2-7b-chat-hf" \
--llm-api "litellm" \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \

LiteLLM

LLMPerf can use LiteLLM to send prompts to LLM APIs. To see the environment variables to set for the provider and arguments that one should set for model and additional-sampling-params.

see the LiteLLM Provider Documentation.

python llm_correctness.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--llm-api "litellm" \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \

see python llm_correctness.py --help for more details on the arguments.

Vertex AI

Here, --model is used for logging, not for selecting the model. The model is specified in the Vertex AI Endpoint ID.

The GCLOUD_ACCESS_TOKEN needs to be somewhat regularly set, as the token generated by gcloud auth print-access-token expires after 15 minutes or so.

Vertex AI doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

export GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)
export GCLOUD_PROJECT_ID=YOUR_PROJECT_ID
export GCLOUD_REGION=YOUR_REGION
export VERTEXAI_ENDPOINT_ID=YOUR_ENDPOINT_ID

python llm_correctness.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--llm-api "vertexai" \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \

SageMaker

SageMaker doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.

export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"s
export AWS_SESSION_TOKEN="YOUR_SESSION_TOKEN"
export AWS_REGION_NAME="YOUR_ENDPOINTS_REGION_NAME"

python llm_correctness.py \
--model "llama-2-7b" \
--llm-api "sagemaker" \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \

Saving Results

The results of the load test and correctness test are saved in the results directory specified by the --results-dir argument. The results are saved in 2 files, one with the summary metrics of the test, and one with metrics from each individual request that is returned.

Advanced Usage

The correctness tests were implemented with the following workflow in mind:

import ray
from transformers import LlamaTokenizerFast

from llmperf.ray_clients.openai_chat_completions_client import (
    OpenAIChatCompletionsClient,
)
from llmperf.models import RequestConfig
from llmperf.requests_launcher import RequestsLauncher


# Copying the environment variables and passing them to ray.init() is necessary
# For making any clients work.
ray.init(runtime_env={"env_vars": {"OPENAI_API_BASE" : "https://api.endpoints.anyscale.com/v1",
                                   "OPENAI_API_KEY" : "YOUR_API_KEY"}})

base_prompt = "hello_world"
tokenizer = LlamaTokenizerFast.from_pretrained(
    "hf-internal-testing/llama-tokenizer"
)
base_prompt_len = len(tokenizer.encode(base_prompt))
prompt = (base_prompt, base_prompt_len)

# Create a client for spawning requests
clients = [OpenAIChatCompletionsClient.remote()]

req_launcher = RequestsLauncher(clients)

req_config = RequestConfig(
    model="meta-llama/Llama-2-7b-chat-hf",
    prompt=prompt
    )

req_launcher.launch_requests(req_config)
result = req_launcher.get_next_ready(block=True)
print(result)

Implementing New LLM Clients

To implement a new LLM client, you need to implement the base class llmperf.ray_llm_client.LLMClient and decorate it as a ray actor.

from llmperf.ray_llm_client import LLMClient
import ray


@ray.remote
class CustomLLMClient(LLMClient):

    def llm_request(self, request_config: RequestConfig) -> Tuple[Metrics, str, RequestConfig]:
        """Make a single completion request to a LLM API

        Returns:
            Metrics about the performance charateristics of the request.
            The text generated by the request to the LLM API.
            The request_config used to make the request. This is mainly for logging purposes.

        """
        ...

Legacy Codebase

The old LLMPerf code base can be found in the llmperf-legacy repo.

llmperf's People

Contributors

aniketmaurya avatar avnishn avatar dr3s avatar eltociear avatar kylehh avatar mgoin avatar rickyyx avatar sumanthrh avatar waleedkadous avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llmperf's Issues

Handling custom codes arguments like trust_remote_code

So how to pass the models with custom code . I have to also pass the argument trust_remote_code for them .I here tried it in many ways even in the metadata but it is not getting passed and the model is not getting loaded.

So how to use this?

HUGGINGFACE set

where can i find these field?

export HUGGINGFACE_API_KEY="YOUR_HUGGINGFACE_API_KEY"
export HUGGINGFACE_API_BASE="YOUR_HUGGINGFACE_API_ENDPOINT"

Bug: Hugging Face TGI not working

When trying to load test LLM deployed using hugging face TGI v1.4, using following commands:

export OPENAI_API_BASE=""
export OPENAI_API_KEY="test"
python3.9 token_benchmark_ray.py \
--model "mistralai/Mistral-7B-Instruct-v0.2" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 100 \
--timeout 600 \
--num-concurrent-requests 5 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

following error occurs:

(OpenAIChatCompletionsClient pid=82698) Warning Or Error: 422 Client Error: Unprocessable Entity for url: 
(OpenAIChatCompletionsClient pid=82698) 422

I could fix this error by replacing line 79 in openai_chat_completions_client.py with
stem = "data:"

Concurrency level is not handled properly

Currently, library makes it look as if it can support arbitrary concurrency levels, but in reality it actually doesn't:

  • Each actor require 1 CPU (ie number of actual concurrent workers will be limited by the number of CPUs in your Ray cluster)
  • We're not asserting number of actors produced (ie if Ray can't schedule all of them, user will never know)
  • Tasks to query the target endpoint are spun up from a single loop, therefore bound by the performance of submitting tasks to Ray from a single loop

bug of counting output tokens

In https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L62, you have

get_token_length = lambda text: len(tokenizer.encode(text))

and then use it to calculate the tokens of the generated text, like

num_output_tokens = get_token_length(gen_text)

However, tokenzier will add special tokens during encoding phrase, here is an example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

text = "hello world"

enc = tokenizer.encode(text)
print(enc) # [1, 22172, 3186]

dec = tokenizer.decode(enc)
print(dec) # <s> hello world

The token 1 should not be counted as one generated token here.

BTW, different models may have different special token rules, so it's hard to determine an easy way to get the real token number of the generated text.

llmperf not working for concurrent users

Hi, We facing issue with llmperf when setting concurrent_users more than 1 user. The code gets stuck and doesnot progress further even after few mins. Attached screenshot below.

image

After sometime (~5mins) it fails with below error: (we set 2 concurrent users)

image

We checked even memory and cpu cosumption and we have still 75% available bandwidth to support.

Could someone please help us here to proceed further on this perf testing?

Thanks,
Nihar

Crashing with large number of concurrent users

I'm finding that this tool often crashes when running tests at concurrency levels >500.

Here's the command I'm running: python llmperf.py -f openai -r 2000 -c 2000 -m "mistralai/Mistral-7B-v0.1"

Here's the error log I'm seeing:

(pid=91343) [2023-11-30 19:53:26,142 E 91343 91820] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.
(pid=98374) E1130 19:53:52.486123357   99023 chttp2_transport.cc:2761]   keepalive_ping_end state error: 0 (expect: 1)
2023-11-30 19:53:53,141 WARNING worker.py:2074 -- The node with node id: 19a548a9a5a2ece8fe9589ae0846a32901d76bf544f5e286201bb770 and address: 10.168.0.2 and node name: 10.168.0.2 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a        (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
        (2) raylet has lagging heartbeats due to slow network or busy workload.
Traceback (most recent call last):
  File "/home/francescovirga/llmperf/llmperf.py", line 480, in <module>
    query_results = endpoint_evaluation(endpoint_config, sample_lines)
  File "/home/francescovirga/llmperf/llmperf.py", line 270, in endpoint_evaluation
    results = ray.get(futures)
  File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2565, in get
    raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
(pid=98046) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. [repeated 74x across cluster]
(validate pid=68909) [2023-11-30 19:53:53,594 E 68909 69441] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate. [repeated 205x across cluster]
(pid=85196) E1130 19:53:51.788929082   85763 chttp2_transport.cc:2761]   keepalive_ping_end state error: 0 (expect: 1) [repeated 666x across cluster]

Are benchmark results released somewhere

Thanks for this work!

Can we have access to some of your benchmarks without having to run them again? Is there some sort of leaderboard data available somewhere?

Does llmperf support measuring local disk models? What's the meaning of framework in llmperf.py line 355?

I have downloaded llama2-13b-hf to my local disk.
I use this command "python llmperf.py -r 20 -m "../models/Llama-2-13b-chat-hf" to measure the performance of llama2-13b.
But an error occurred.
Traceback (most recent call last):
File "llmperf.py", line 419, in
endpoint_config["api_base"] = os.environ["ANYSCALE_API_BASE"]
File "/opt/conda/lib/python3.8/os.py", line 675, in getitem
raise KeyError(key) from None
KeyError: 'ANYSCALE_API_BASE'
So, please give me some guidance and advise.

usage for local models .

Local models in the sense there are available in some platform . But i am not fetching it from there but i have the weights of the model locally i will provide that path for the model argument .So how to do that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.