Giter Club home page Giter Club logo

sequoia's Introduction

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

[paper]

Environment Set Up

We recommend the following commands to set up the environment

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.36.2
pip install accelerate==0.26.1
pip install datasets==2.16.1
pip install einops
pip install protobuf
pip install sentencepiece
pip install typing-extensions

Evaluations

To reproduce the main results

cd tests
bash run_L40.sh

or bash run_A100.sh

A command should be in the format like

python testbed.py --model  JackFram/llama-68m   --target meta-llama/Llama-2-7b-hf  \
--T 0.6 --P 1.0  --start 0 --end 200 --M 384 \
--growmap ../A100_growmaps/68m_7b/growmaps/A100-CNN-68m-7b-stochastic.pt \
--Mode greedy --dataset cnn

testbed.py is for stochastic decoding. testbed_greedy.py is for greedy decoding. test_specinfer.py is for specinfer sampling. test_greedyS.py is for Top-k/greedy sampling. test_accept.py is for preparing the accepting rate vector.

--model specifies the draft and --target specifies the target. Currently, only Llama models are supported (including Llama2, Sheared-LLaMA, Vicuna and TinyLlama).

--T specifies the temperature and --P specifies the top-p for generation.

--dataset should be in cnn, openwebtext, c4. --start and --end decides how many examples will be evaluated. --seed is for adjusting random seeds. To precisely reproduce the results, seed is set to be 17 by default.

--growmap specifies the tree structure. We have prepared some growmaps in A100_growmaps and L40_growmaps.

--M should be set at least #tree + 256. 384 is enough for all the experiments except offloading. To run offloading, we need the command like the following

CUDA_VISIBLE_DEVICES=0 python testbed.py --model meta-llama/Llama-2-7b-hf \
--target meta-llama/Llama-2-70b-hf  --T 0.6 --P 1.0 \
--start 0 --end 100 --Mode greedy  --M 1024 \
--growmap  ../L40_growmaps/L40-CNN-7b-70b-stochastic.pt  --offloading --dataset cnn

All experiments in test have the max sequence length of 256. To change this, max_target_seq should be passed to SpecTree. Again, --M should be set at least #tree + max_target_seq.

How to obtain acceptance rate vector

To obtain the acceptance rate vector, which is used in tree_search.py, we need the following command

python test_accept.py --model  JackFram/llama-68m   --target meta-llama/Llama-2-7b-hf  \
--T 0.6 --P 1.0  --start 0 --end 200 --M 288 --W 32\
--ALG stochastic --dataset cnn \

--ALG is stochastic or greedy. --W is the maximum width. --M should be set at least --W + 256.

To statically obtain the acceptance rate vector (which is much faster if the target model needs offloading)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python fast_test.py --model meta-llama/Llama-2-7b-hf  \
--target meta-llama/Llama-2-70b-hf --T 1.1 --P 1.0 --DP 1.1 --W 32 --start 0 --end 200

The acceptance rate vector will be printed and will be saved to --dst (../acceptance-rate-vector.pt by default).

How to generate growmaps

We use the following command

python tree_search.py --config demo-config.json

We can modify the content of demo-config.json to generate different growmaps. The growmaps for experiments in the paper in prepared in L40_growmaps and A100_growmaps.

TODOs

  • Support other open source models.
  • Support multi-round dialogue.
  • Support INT4/8 quantization.
  • Support multi-GPU.

Citation

If you find Sequoia useful or relevant to your project and research, please kindly cite our paper:

@article{chen2024sequoia,
  title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},
  author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
  journal={arXiv preprint arXiv:2402.12374},
  year={2024}
}

sequoia's People

Contributors

dreaming-panda avatar eltociear avatar

Stargazers

 avatar  avatar Ran Cheng avatar Maxw avatar Abdulrahman Tabaza avatar Andre Slavescu avatar Bryan Lu avatar  avatar Wentai Zhang avatar Kim Jae-Jin (김재진) avatar singularity avatar ZCHNO avatar Zhang Cao avatar James Chang avatar  avatar  avatar  avatar goatbjh avatar  avatar Alexander Kozhevin avatar Junhao Wang avatar Penghui Yang avatar 陈金宇 avatar Susav Shrestha avatar Zhihui Jiao avatar  avatar Zephyr avatar  avatar  avatar  avatar Jeff Carpenter avatar  avatar  avatar  avatar Yunsheng Ni avatar  avatar Yiming Liu avatar  avatar Junhua Liu avatar  avatar Yongliang Shen avatar xuefengli avatar Zheng Yuan avatar Hailey Schoelkopf avatar Jinwoo Jeong avatar Harish Pentapalli avatar Sangmin Bae avatar Ariel Lubonja avatar Yaroslav Shipilov avatar WeiXin avatar Nikolaus Schlemm avatar  avatar Peng Wu avatar Alessio Falai avatar Sumeet Das avatar  avatar Qiyuan Gong avatar K. N. avatar Xin Li avatar Spark avatar  avatar Jan Bielak avatar Peter Morgan avatar Noah avatar  avatar Daxiong avatar Liangyu Chen avatar  avatar  avatar Shreyas Jaiswal avatar Martin avatar Lite Ye avatar  avatar Jesus Pacheco avatar Logan avatar Asim D. Bakhshi avatar LS avatar Garret Buell avatar Turbo-Thorsten avatar  avatar mrfakename avatar Alpay Ariyak avatar Wei avatar Yunus Güngör avatar Bander Alsulami avatar Iddar Olivares avatar Markus Rauhalahti avatar Hongzheng Chen avatar Brad Lee avatar Mohamad Hussein avatar Miguel GP avatar XiaHan avatar latyas avatar kyle avatar hippo avatar Charles Cai avatar Xueqing Wu avatar elucida avatar Wayn W avatar Ahmed Morsi avatar

Watchers

Kenn avatar  avatar

sequoia's Issues

Question on tree search algorithm

Here, the max_branch equals to K + 1, K refers to Algorithm 5 in your paper. The K + 1 dimension of p represents the percentage of accept none from the draft model according to your code in test_accept.py. So there is gap between your code and Algorithm 5? Correct me if I misunderstood anything!

Estimate the number of generated tokens per step from the acceptance-rate-vector?

Hi,

If I understand the tree_search algorithm right, the dynamic programming process should be able to find the optimal number of generated tokens according to the acceptance-rate-vector. Also, given the acceptance-rate-vector and the candidate tree, the number of generated tokens can also be computed. But this is just theory. In the paper, the number of generated tokens are measured with experimenting runs. I'm wondering if these experimental-measured generated token numbers agree with the theoretical optimal generated token number?

I was trying to verify it, but in the repo, there is only tree_maps, while the acceptance vectors are missing. I'm wondering if you have considered this estimation before. Or, could you share the acceptance vectors, so that, along with the corresponding trees, I can quickly verify it?

Thanks!

Error `p.attn_bias_ptr is not correctly aligned` when testing

tried to test the code with run_A100.sh script but got this error:

$:~/sequoia/tests$ bash run_A100.sh
...
Traceback (most recent call last):
  File "/extra_disk_1/optimus/sequoia/tests/testbed.py", line 293, in <module>
    simulation_fast(target_model=target_model, draft_model=draft_model, dataloader=dataloader, T=args.T, top_p=args.P,
  File "/extra_disk_1/optimus/sequoia/tests/testbed.py", line 68, in simulation_fast
    spectree = SpecTree(prefix=input_ids.squeeze(0), device='cuda:0', temperature=T,
  File "/extra_disk_1/optimus/sequoia/tests/../Tree/SpecTree.py", line 68, in __init__
    draft_model_outputs = self.draft_model_engine.inference(input_ids=self.tokens[:self.num_nodes].unsqueeze(0), 
  File "/home/optimus/conda/envs/py9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/extra_disk_1/optimus/sequoia/tests/../Engine/Engine.py", line 242, in inference
    return self.engine.model_run(input_ids=input_ids, storage_ids=storage_ids,
  File "/home/optimus/conda/envs/py9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/extra_disk_1/optimus/sequoia/tests/../Engine/Engine.py", line 38, in model_run
    logits = self.model(input_ids=input_ids, 
  File "/home/optimus/conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/optimus/conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/extra_disk_1/optimus/sequoia/tests/../Engine/Llama_model.py", line 201, in forward
    outputs = self.model(
  File "/home/optimus/conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/optimus/conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/extra_disk_1/optimus/sequoia/tests/../Engine/Llama_model.py", line 59, in forward
    layer_outputs = decoder_layer(
  File "/home/optimus/conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/optimus/conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/extra_disk_1/optimus/sequoia/tests/../Engine/Llama_modules.py", line 334, in forward
    hidden_states = self.self_attn(
  File "/home/optimus/conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/optimus/conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/extra_disk_1/optimus/sequoia/tests/../Engine/Llama_modules.py", line 127, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: p.attn_bias_ptr is not correctly aligned

my lib versions:

$:~/sequoia/tests$ pip list | grep -e transformers -e torch -e accelerate
accelerate                        0.26.1
torch                             2.1.0
torchaudio                        0.13.1
torchvision                       0.14.1
transformers                      4.37.2

The support on vLLM?

Hi,

I remember the support on vLLM was on your TODOs. Have you achieved it now? Was the main challenge in this direction that the batch size > 1 tree verification is hard to made efficient? Thanks!

Tensor shape mismatch when computing apply_rotary_pos_emb

Description:

When I tried to reproduce the paper result by README, an exception raised:

return forward_call(*args, **kwargs)
  File "/data0/xiac/RLHF/Prelim/Sequoia/tests/../Engine/Llama_model.py", line 59, in forward
    layer_outputs = decoder_layer(
  File "/home/xiac/.conda/envs/rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xiac/.conda/envs/rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data0/xiac/RLHF/Prelim/Sequoia/tests/../Engine/Llama_modules.py", line 334, in forward
    hidden_states = self.self_attn(
  File "/home/xiac/.conda/envs/rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xiac/.conda/envs/rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data0/xiac/RLHF/Prelim/Sequoia/tests/../Engine/Llama_modules.py", line 118, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File "/home/xiac/.conda/envs/rlhf/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 207, in apply_rotary_pos_emb
    q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: The size of tensor a (12) must match the size of tensor b (384) at non-singleton dimension 1

I tracked the function calling and enabled the 'debug' flag in engine.model_run. When I tried it again, the assertion failed:

Traceback (most recent call last):
  File "/data0/xiac/RLHF/Prelim/Sequoia/tests/testbed.py", line 268, in <module>
    draft_model.initialize_cuda_graph(graph_capture_list)  
  File "/home/xiac/.conda/envs/rlhf/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data0/xiac/RLHF/Prelim/Sequoia/tests/../Engine/Engine.py", line 189, in initialize_cuda_graph
    self.callables[decoding_seqlen] = capture_graph(
  File "/data0/xiac/RLHF/Prelim/Sequoia/tests/../Engine/Engine.py", line 141, in capture_graph
    static_logits = engine.model_run(
  File "/home/xiac/.conda/envs/rlhf/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data0/xiac/RLHF/Prelim/Sequoia/tests/../Engine/Engine.py", line 34, in model_run
    assert attention_mask.shape[0] == input_length
AssertionError

I checked the code and found a suspicious line in capture_graph:

static_attn_mask = torch.full((decoding_seqlen, engine.max_length), 0, dtype=dtype, device=device)
    static_attn_mask = static_attn_mask[None, None, :, :]

the last line changes static_attn_mask into shape of (1,1, x, y), which certainly fails the check.

data loading timing and disk use

The dataset loading code is taking too long. It downloads whole huge datasets (70G wiki, etc) to use just a handful of examples. setting split="train[0:2000]") is not helping since slicing happens only after full download
Suggestions:

  • download just the first files of the datasets.
  • replace c4 with allenai/c4: load_dataset("allenai/c4", "allenai--c4", data_files={"train": "en/c4-train.00000-of-01024.json.gz"}, split="train")
  • replace wiki with wikitext2. load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

Reproducibility: the tree_search generates too small tree

Hi,

I was trying to reproduce the numbers in the paper, but with the demo-config.json, plus the acceptance vector in the repo or the acceptance vector I tested myself, the generated trees are all very small and somewaht fixed:

0 _ 1 _ 3
 \_ 2

or

0 _ 1 _2 _3

But on the other hand, the growmaps in the two folders are generally very large, typically of size 128, 64, 32. Do you know what the possible reason is that the tree I generated is small and how to reproduce the growmaps in those two folders?

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.