Giter Club home page Giter Club logo

Comments (6)

robertgshaw2-neuralmagic avatar robertgshaw2-neuralmagic commented on July 3, 2024 1

I am looking at this. Its generation quality difference

from vllm.

youkaichao avatar youkaichao commented on July 3, 2024

To be clear, is it inference speed difference, or generation quality difference?

from vllm.

akjindal53244 avatar akjindal53244 commented on July 3, 2024

@youkaichao @robertgshaw2-neuralmagic - Primary issue is generation quality. But, I just did a quick benchmarking to get average runtime across 3 runs. On average across 3 runs, 0.4.3 and 0.5.1.post1 takes ~160 seconds to run and 0.4.2 takes about ~176 seconds to run Ifeval on 1xRTX6000 GPU. I have also updated the doc: https://docs.google.com/document/d/1b-QigsksQM9xf2MYMRWF4WWR36LulSmRRcr_jQ0qHDg/edit?usp=sharing

from vllm.

robertgshaw2-neuralmagic avatar robertgshaw2-neuralmagic commented on July 3, 2024

The following creates divergent responses on v0.4.3 and v0.4.2:

from vllm import LLM, SamplingParams

model=LLM("meta-llama/Meta-Llama-3-8B")

sampling_params = SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|end_of_text|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1280, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=False, truncate_prompt_tokens=None)
requests = [[8144, 264, 220, 3101, 10, 3492, 12399, 315, 279, 59318, 2199, 330, 2485, 1129, 268, 34466, 2726, 26583, 19945, 352, 12669, 62, 23440, 21025, 2568, 3659, 1159, 4664, 14559, 3343, 3234, 539, 1005, 904, 77702, 323, 11415, 520, 3325, 220, 18, 14491, 430, 706, 15671, 304, 51594, 3645, 11, 369, 3187, 353, 36298, 291, 3857, 961, 220, 16, 12594, 353, 36298, 291, 3857, 961, 220, 17, 12594, 353, 36298, 291, 3857, 961, 220, 18, 20517], [40, 1097, 9293, 264, 8577, 311, 6457, 11, 323, 358, 1053, 1093, 40344, 311, 3350, 459, 74004, 369, 856, 11879, 304, 264, 42482, 276, 1742, 13, 1472, 527, 539, 5535, 311, 1005, 904, 77702, 304, 701, 2077, 13], [8144, 264, 16063, 369, 264, 7878, 1579, 2978, 19560, 889, 374, 11125, 872, 1176, 2683, 13, 7557, 2771, 311, 2997, 520, 3325, 220, 717, 6002, 15609, 555, 9518, 40029, 11, 1778, 439, 510, 5102, 1145, 510, 609, 948]]

outputs = model.generate(
    prompt_token_ids=requests,
    sampling_params=sampling_params,
    use_tqdm=True,
)

for output in outputs:
    print("\n\n\n=========================================")
    print(output.outputs[0].text)

from vllm.

robertgshaw2-neuralmagic avatar robertgshaw2-neuralmagic commented on July 3, 2024

I got to the bottom of things. There are two things going on:

  • v0.4.2 uses XFormers as the attention implementation unless you explicitly install flash attention while v0.4.3 installs flash attention by default.

    • B/c of numerics, there is no guarantee of bitwise equality between XFormers and FlashAttention. This is the cause of the divergent generations
    • FlashAttention + XFormers attention get similiar scores, so I feel good about the FlashAttention correctness modulo numerics
  • v0.4.2 returns <|end_of_text|> at the end of every generation while v0.4.3 does not --- I will look into what change caused this, but I think v0.4.3 is doing the right thing.

    • This shockingly has a big impact on the scores
    • I do not know anything about IFEval, but llama-3-8b pretrained does not seem to be very good at it so minor things like this can be sensitive

Scores with v0.4.2:

|Tasks |Version|Filter|n-shot|        Metric         |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval|      2|none  |     0|prompt_level_strict_acc|0.1442|±  |0.0151|
|      |       |none  |     0|inst_level_strict_acc  |0.2638|±  |N/A   |
|      |       |none  |     0|prompt_level_loose_acc |0.1553|±  |0.0156|
|      |       |none  |     0|inst_level_loose_acc   |0.2758|±  |N/A   |
  • Scores with v0.4.3 - XFORMERS backend
|Tasks |Version|Filter|n-shot|        Metric         |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval|      2|none  |     0|prompt_level_strict_acc|0.1017|±  |0.0130|
|      |       |none  |     0|inst_level_strict_acc  |0.1990|±  |N/A   |
|      |       |none  |     0|prompt_level_loose_acc |0.1128|±  |0.0136|
|      |       |none  |     0|inst_level_loose_acc   |0.2098|±  |N/A   |

Scores with v0.4.3 - FLASHATTENTION backend

|Tasks |Version|Filter|n-shot|        Metric         |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval|      2|none  |     0|prompt_level_strict_acc|0.1128|±  |0.0136|
|      |       |none  |     0|inst_level_strict_acc  |0.2062|±  |N/A   |
|      |       |none  |     0|prompt_level_loose_acc |0.1220|±  |0.0141|
|      |       |none  |     0|inst_level_loose_acc   |0.2158|±  |N/A   |

Scores with v0.4.3 - XFORMERS backend + adding in <|end_of_text|> << this matches v0.4.2 exactly

 |Tasks |Version|Filter|n-shot|        Metric         |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval|      2|none  |     0|prompt_level_strict_acc|0.1442|±  |0.0151|
|      |       |none  |     0|inst_level_strict_acc  |0.2638|±  |N/A   |
|      |       |none  |     0|prompt_level_loose_acc |0.1553|±  |0.0156|
|      |       |none  |     0|inst_level_loose_acc   |0.2758|±  |N/A   |

Scores with v0.4.3 - FLASHATTENTION backend + adding in <|end_of_text|> << this is similar score

|Tasks |Version|Filter|n-shot|        Metric         |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|-----:|---|------|
|ifeval|      2|none  |     0|prompt_level_strict_acc|0.1553|±  |0.0156|
|      |       |none  |     0|inst_level_strict_acc  |0.2710|±  |N/A   |
|      |       |none  |     0|prompt_level_loose_acc |0.1645|±  |0.0160|
|      |       |none  |     0|inst_level_loose_acc   |0.2818|±  |N/A   |

from vllm.

robertgshaw2-neuralmagic avatar robertgshaw2-neuralmagic commented on July 3, 2024

Hack in lm-eval-harness to generate the scores above:

Closing since I do not think this is a bug. <|end-of-text|> should not be returned to the user.

from vllm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.