Your current environment vllm 0.4.2 Environment </

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The following creates divergent responses on v0.4.3 and v0.4.2: <div class="highli

IfEval Metrics not consistent with different vLLM versions about vllm HOT 6 CLOSED

akjindal53244 commented on July 3, 2024

IfEval Metrics not consistent with different vLLM versions

from vllm.

Comments (6)

robertgshaw2-neuralmagic commented on July 3, 2024 1

I am looking at this. Its generation quality difference

from vllm.

youkaichao commented on July 3, 2024

To be clear, is it inference speed difference, or generation quality difference?

from vllm.

akjindal53244 commented on July 3, 2024

@youkaichao @robertgshaw2-neuralmagic - Primary issue is generation quality. But, I just did a quick benchmarking to get average runtime across 3 runs. On average across 3 runs, 0.4.3 and 0.5.1.post1 takes ~160 seconds to run and 0.4.2 takes about ~176 seconds to run Ifeval on 1xRTX6000 GPU. I have also updated the doc: https://docs.google.com/document/d/1b-QigsksQM9xf2MYMRWF4WWR36LulSmRRcr_jQ0qHDg/edit?usp=sharing

from vllm.

robertgshaw2-neuralmagic commented on July 3, 2024

The following creates divergent responses on v0.4.3 and v0.4.2:

from vllm import LLM, SamplingParams

model=LLM("meta-llama/Meta-Llama-3-8B")

sampling_params = SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|end_of_text|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1280, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=False, truncate_prompt_tokens=None)
requests = [[8144, 264, 220, 3101, 10, 3492, 12399, 315, 279, 59318, 2199, 330, 2485, 1129, 268, 34466, 2726, 26583, 19945, 352, 12669, 62, 23440, 21025, 2568, 3659, 1159, 4664, 14559, 3343, 3234, 539, 1005, 904, 77702, 323, 11415, 520, 3325, 220, 18, 14491, 430, 706, 15671, 304, 51594, 3645, 11, 369, 3187, 353, 36298, 291, 3857, 961, 220, 16, 12594, 353, 36298, 291, 3857, 961, 220, 17, 12594, 353, 36298, 291, 3857, 961, 220, 18, 20517], [40, 1097, 9293, 264, 8577, 311, 6457, 11, 323, 358, 1053, 1093, 40344, 311, 3350, 459, 74004, 369, 856, 11879, 304, 264, 42482, 276, 1742, 13, 1472, 527, 539, 5535, 311, 1005, 904, 77702, 304, 701, 2077, 13], [8144, 264, 16063, 369, 264, 7878, 1579, 2978, 19560, 889, 374, 11125, 872, 1176, 2683, 13, 7557, 2771, 311, 2997, 520, 3325, 220, 717, 6002, 15609, 555, 9518, 40029, 11, 1778, 439, 510, 5102, 1145, 510, 609, 948]]

outputs = model.generate(
    prompt_token_ids=requests,
    sampling_params=sampling_params,
    use_tqdm=True,
)

for output in outputs:
    print("\n\n\n=========================================")
    print(output.outputs[0].text)

from vllm.

robertgshaw2-neuralmagic commented on July 3, 2024

I got to the bottom of things. There are two things going on:

v0.4.2 uses XFormers as the attention implementation unless you explicitly install flash attention while v0.4.3 installs flash attention by default.
- B/c of numerics, there is no guarantee of bitwise equality between XFormers and FlashAttention. This is the cause of the divergent generations
- FlashAttention + XFormers attention get similiar scores, so I feel good about the FlashAttention correctness modulo numerics
v0.4.2 returns <|end_of_text|> at the end of every generation while v0.4.3 does not --- I will look into what change caused this, but I think v0.4.3 is doing the right thing.
- This shockingly has a big impact on the scores
- I do not know anything about IFEval, but llama-3-8b pretrained does not seem to be very good at it so minor things like this can be sensitive

Scores with v0.4.2:

|Tasks |Version|Filter|n-shot|        Metric         |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval|      2|none  |     0|prompt_level_strict_acc|0.1442|±  |0.0151|
|      |       |none  |     0|inst_level_strict_acc  |0.2638|±  |N/A   |
|      |       |none  |     0|prompt_level_loose_acc |0.1553|±  |0.0156|
|      |       |none  |     0|inst_level_loose_acc   |0.2758|±  |N/A   |

Scores with v0.4.3 - XFORMERS backend

|Tasks |Version|Filter|n-shot|        Metric         |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval|      2|none  |     0|prompt_level_strict_acc|0.1017|±  |0.0130|
|      |       |none  |     0|inst_level_strict_acc  |0.1990|±  |N/A   |
|      |       |none  |     0|prompt_level_loose_acc |0.1128|±  |0.0136|
|      |       |none  |     0|inst_level_loose_acc   |0.2098|±  |N/A   |

Scores with v0.4.3 - FLASHATTENTION backend

|Tasks |Version|Filter|n-shot|        Metric         |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval|      2|none  |     0|prompt_level_strict_acc|0.1128|±  |0.0136|
|      |       |none  |     0|inst_level_strict_acc  |0.2062|±  |N/A   |
|      |       |none  |     0|prompt_level_loose_acc |0.1220|±  |0.0141|
|      |       |none  |     0|inst_level_loose_acc   |0.2158|±  |N/A   |

Scores with v0.4.3 - XFORMERS backend + adding in <|end_of_text|> << this matches v0.4.2 exactly

 |Tasks |Version|Filter|n-shot|        Metric         |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval|      2|none  |     0|prompt_level_strict_acc|0.1442|±  |0.0151|
|      |       |none  |     0|inst_level_strict_acc  |0.2638|±  |N/A   |
|      |       |none  |     0|prompt_level_loose_acc |0.1553|±  |0.0156|
|      |       |none  |     0|inst_level_loose_acc   |0.2758|±  |N/A   |

Scores with v0.4.3 - FLASHATTENTION backend + adding in <|end_of_text|> << this is similar score

|Tasks |Version|Filter|n-shot|        Metric         |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval|      2|none  |     0|prompt_level_strict_acc|0.1553|±  |0.0156|
|      |       |none  |     0|inst_level_strict_acc  |0.2710|±  |N/A   |
|      |       |none  |     0|prompt_level_loose_acc |0.1645|±  |0.0160|
|      |       |none  |     0|inst_level_loose_acc   |0.2818|±  |N/A   |

from vllm.

robertgshaw2-neuralmagic commented on July 3, 2024

Hack in lm-eval-harness to generate the scores above:

EleutherAI/lm-evaluation-harness@v0.4.2...robertgshaw2-neuralmagic:lm-evaluation-harness:add-end-of-text

Closing since I do not think this is a bug. <|end-of-text|> should not be returned to the user.

from vllm.

IfEval Metrics not consistent with different vLLM versions about vllm HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent