Comments (6)
I am looking at this. Its generation quality difference
from vllm.
To be clear, is it inference speed difference, or generation quality difference?
from vllm.
@youkaichao @robertgshaw2-neuralmagic - Primary issue is generation quality. But, I just did a quick benchmarking to get average runtime across 3 runs. On average across 3 runs, 0.4.3
and 0.5.1.post1
takes ~160 seconds to run and 0.4.2
takes about ~176 seconds to run Ifeval on 1xRTX6000 GPU. I have also updated the doc: https://docs.google.com/document/d/1b-QigsksQM9xf2MYMRWF4WWR36LulSmRRcr_jQ0qHDg/edit?usp=sharing
from vllm.
The following creates divergent responses on v0.4.3 and v0.4.2:
from vllm import LLM, SamplingParams
model=LLM("meta-llama/Meta-Llama-3-8B")
sampling_params = SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|end_of_text|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1280, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=False, truncate_prompt_tokens=None)
requests = [[8144, 264, 220, 3101, 10, 3492, 12399, 315, 279, 59318, 2199, 330, 2485, 1129, 268, 34466, 2726, 26583, 19945, 352, 12669, 62, 23440, 21025, 2568, 3659, 1159, 4664, 14559, 3343, 3234, 539, 1005, 904, 77702, 323, 11415, 520, 3325, 220, 18, 14491, 430, 706, 15671, 304, 51594, 3645, 11, 369, 3187, 353, 36298, 291, 3857, 961, 220, 16, 12594, 353, 36298, 291, 3857, 961, 220, 17, 12594, 353, 36298, 291, 3857, 961, 220, 18, 20517], [40, 1097, 9293, 264, 8577, 311, 6457, 11, 323, 358, 1053, 1093, 40344, 311, 3350, 459, 74004, 369, 856, 11879, 304, 264, 42482, 276, 1742, 13, 1472, 527, 539, 5535, 311, 1005, 904, 77702, 304, 701, 2077, 13], [8144, 264, 16063, 369, 264, 7878, 1579, 2978, 19560, 889, 374, 11125, 872, 1176, 2683, 13, 7557, 2771, 311, 2997, 520, 3325, 220, 717, 6002, 15609, 555, 9518, 40029, 11, 1778, 439, 510, 5102, 1145, 510, 609, 948]]
outputs = model.generate(
prompt_token_ids=requests,
sampling_params=sampling_params,
use_tqdm=True,
)
for output in outputs:
print("\n\n\n=========================================")
print(output.outputs[0].text)
from vllm.
I got to the bottom of things. There are two things going on:
-
v0.4.2 uses XFormers as the attention implementation unless you explicitly install flash attention while v0.4.3 installs flash attention by default.
- B/c of numerics, there is no guarantee of bitwise equality between XFormers and FlashAttention. This is the cause of the divergent generations
- FlashAttention + XFormers attention get similiar scores, so I feel good about the FlashAttention correctness modulo numerics
-
v0.4.2 returns <|end_of_text|> at the end of every generation while v0.4.3 does not --- I will look into what change caused this, but I think v0.4.3 is doing the right thing.
- This shockingly has a big impact on the scores
- I do not know anything about IFEval, but llama-3-8b pretrained does not seem to be very good at it so minor things like this can be sensitive
Scores with v0.4.2:
|Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval| 2|none | 0|prompt_level_strict_acc|0.1442|± |0.0151|
| | |none | 0|inst_level_strict_acc |0.2638|± |N/A |
| | |none | 0|prompt_level_loose_acc |0.1553|± |0.0156|
| | |none | 0|inst_level_loose_acc |0.2758|± |N/A |
- Scores with v0.4.3 - XFORMERS backend
|Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval| 2|none | 0|prompt_level_strict_acc|0.1017|± |0.0130|
| | |none | 0|inst_level_strict_acc |0.1990|± |N/A |
| | |none | 0|prompt_level_loose_acc |0.1128|± |0.0136|
| | |none | 0|inst_level_loose_acc |0.2098|± |N/A |
Scores with v0.4.3 - FLASHATTENTION backend
|Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval| 2|none | 0|prompt_level_strict_acc|0.1128|± |0.0136|
| | |none | 0|inst_level_strict_acc |0.2062|± |N/A |
| | |none | 0|prompt_level_loose_acc |0.1220|± |0.0141|
| | |none | 0|inst_level_loose_acc |0.2158|± |N/A |
Scores with v0.4.3 - XFORMERS backend + adding in <|end_of_text|> << this matches v0.4.2 exactly
|Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval| 2|none | 0|prompt_level_strict_acc|0.1442|± |0.0151|
| | |none | 0|inst_level_strict_acc |0.2638|± |N/A |
| | |none | 0|prompt_level_loose_acc |0.1553|± |0.0156|
| | |none | 0|inst_level_loose_acc |0.2758|± |N/A |
Scores with v0.4.3 - FLASHATTENTION backend + adding in <|end_of_text|> << this is similar score
|Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval| 2|none | 0|prompt_level_strict_acc|0.1553|± |0.0156|
| | |none | 0|inst_level_strict_acc |0.2710|± |N/A |
| | |none | 0|prompt_level_loose_acc |0.1645|± |0.0160|
| | |none | 0|inst_level_loose_acc |0.2818|± |N/A |
from vllm.
Hack in lm-eval-harness
to generate the scores above:
Closing since I do not think this is a bug. <|end-of-text|>
should not be returned to the user.
from vllm.
Related Issues (20)
- [Bug][CI/Build]: Missing attribute 'nvmlDeviceGetHandleByIndex' in AMD tests HOT 1
- [Bug]: Garbled Tokens appears in vllm generation result every time change to new LLM model (Qwen)
- [Bug]: ValidationError using langchain_community.llms.VLLM
- [Installation]: how to disable NCCL support on Jetson cevices HOT 1
- [Bug]: benchmark_serving.py cannot calculate Median TTFT correctly HOT 3
- [Feature]: support Ascend 910B in the future HOT 1
- [New Model]: Lora for Qwen/Qwen2-57B-A14B
- [Usage]: how to initiate the gemma2-27b with a 4-bit quantization? HOT 1
- [Bug]: TypeError in benchmark_serving.py when using --model parameter
- [Gemma 2 27B]: Update docker hub image to support gemma-2-27B-it HOT 1
- [Bug]: Loading LoRA is super slow when using tensor parallel HOT 2
- [Feature]: Add readiness endpoint /ready and return /health earlier (vLLM on Kubernetes)
- [RFC]: Priority Scheduling HOT 2
- [Feature]: Add support for interchangable radix attention
- [Usage]: Why is the useage information missing in the streaming call. Not streaming is there. HOT 2
- [Misc]: Best practice for accelerating and deploying Llava series & Phi3-Vision using vLLM
- [Usage]: How to deploy multiple models in openai api server and specify different gpu for each model? HOT 1
- [Bug]: Flashinfer stuck with CUDA Graph
- [Bug]: ray error when tp>=2 HOT 5
- Create CI jobs for Power(ppc64le)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.