Feature request Hi, in Medusa's paper, they adopt tree-attention a

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi, I suppose you observe speedups from one of: <a href="https://hug

Tree-attention for medusa about text-generation-inference HOT 3 OPEN

hustxiayang commented on July 30, 2024

Tree-attention for medusa

from text-generation-inference.

Comments (3)

LysandreJik commented on July 30, 2024

Hey @hustxiayang, IIRC, applying the tree-attention results in significantly more compute than using argmax, so the throughput cost is too high to justify the value.

Don't quote me on this but IIRC without the tree the speedup is 2x, while with the tree there is a maximum of ~3x speedup but throughput significantly decreases (divided by 10-20x).

To confirm with @Narsil eventually, hope it sheds some light on the question in the meantime.

from text-generation-inference.

hustxiayang commented on July 30, 2024

Hi, I suppose you observe speedups from one of:

text-generation-inference/gemma-7b-it-medusa
text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa
text-generation-inference/Mistral-7B-Instruct-v0.2-medusa

If that is the case, which dataset(s) did you use?

Another question: do you plan to support typical sampling mentioned in their paper?

from text-generation-inference.

github-actions commented on July 30, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

from text-generation-inference.

Tree-attention for medusa about text-generation-inference HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent