jquesnelle / yarn Goto Github PK
View Code? Open in Web Editor NEWYaRN: Efficient Context Window Extension of Large Language Models
License: MIT License
YaRN: Efficient Context Window Extension of Large Language Models
License: MIT License
I am curious what is required to apply this method to the 70B parameter version of the llama2 model?
On reddit, noticed you mention: "For training, these models barely fit in 128 80GB A100s using DeepSpeed and FA2"
Would the computer at OSC be enough? https://www.osc.edu/resources/technical_support/supercomputers/ascend
Only 96 80GB A100 GPUs: Is that enough to contribute to the SoTA (State of the art)?
Hello, I'm thrilled to see that linear and NTK interpolation have been elegantly combined to create a much stronger interpolation strategy—YARN. However, while going through the code in modeling_llama.py, I find myself a bit confused by the calculation of inv_freq
, particularly at line398.
According to the YaRN paper, in equation 23, it is stated as follows:
Consequently, we can derive:
However, in the paper, the calculation of
Hence, I think there might be some problem with equation 25 and also with line398
. Perhaps we can revise the yarn
function as follows, since I've empirically found that this fix can further enhance performance:
def revised_yarn(self, device):
inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
low, high = _yarn_find_correction_range(self.beta_fast, self.beta_slow, self.dim, self.base, self.original_max_position_embeddings)
inv_freq_mask = (1 - _yarn_linear_ramp_mask(low, high, self.dim // 2).float().to(device)) * self.extrapolation_factor
inv_freq = inv_freq / ((1-inv_freq_mask)*self.scale + inv_freq_mask)
self.register_buffer("inv_freq", inv_freq, persistent=False)
self.mscale = float(_yarn_get_mscale(self.scale) * self.attn_factor)
I am very curious about the hardware equipment you use for training and the time it takes for the training. Do you have a detailed introduction? If so, I would be extremely grateful.
yarn/scaled_rope/modeling_llama_yarn.py
Line 214 in ff9321f
Please tell me here, if I want to expand from 2K to 16K, then the factor multiplied by the base here is
Is this multiple reasonable? Are there some problems here?
Please correct me if I'm wrong.
accelerate launch finetune.py \
--output-dir output/mistral-yarn-7b-64k \
--model mistralai/Mistral-7B-v0.1 \
--architecture mistral \
--scaling-factor 2 \
--max-position-embeddings 16384 \
--dataset emozilla/yarn-train-tokenized-8k-mistral \
--sliding-window-attention-schedule 4096 \
--lr-schedule constant \
--learning-rate 0.000001 \
--max-train-steps 1000
Both with or without lora hits the OOM error, this is on only 8K sequence length, so memory consumption should be around 4x smaller compared with training on 16K sequence length.
accelerate is configured to use two GPU and FSDP.
Thank you so much for your open source work.
I evaluated the 128K context capacity of the LLaMA-27B model using an NVIDIA A100 (80G) GPU. However, I encountered an OOM error. Here is my script:
PG19="--tokenized emozilla/pg19-test-tokenized"
PROOFPILE_LONG_SMALL="--tokenized emozilla/proofpile-test-tokenized --dataset-min-tokens 131072 --samples 10 --truncate"
CUSTOM="--custom-model-together"
python eval/perplexity.py \
${PROOFPILE_LONG_SMALL} ${CUSTOM} \
--output-file data/proofpile-long-small.csv \
--min-tokens 131072 --max-tokens 131072 --tokens-step 2048 --aggressive-memory \
-m llama2_7b_yarn_64k
File "/workspace/long/yarn/finetune.py", line 143, in main
model = accelerator.prepare(model)
File "/root/miniconda3/envs/yarn/lib/python3.10/site-packages/accelerate/accelerator.py", line 1280, in prepare
result = self._prepare_deepspeed(*args)
File "/root/miniconda3/envs/yarn/lib/python3.10/site-packages/accelerate/accelerator.py", line 1515, in _prepare_deepspeed
raise ValueError(
ValueError: When using DeepSpeed accelerate.prepare()
requires you to pass at least one of training or evaluation dataloaders or alternatively set an integer value in train_micro_batch_size_per_gpu
in the deepspeed config fileor assign integer value to AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu']
.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1523510) of binary: /root/miniconda3/envs/yarn/bin/python
1.accelerate need more configuration than deepspeed with trainer, can it be realised in the deepspeed mode
2.ecosystem more llm learners use Fastchat, can it be reproduce in https://github.com/lm-sys/FastChat
3.this postition embedding method need more open source developers to do more investigate
I suppose that'd depend on the specific RoPE variant to be used, but I wonder if you've conducted any experiments?
Hello, this is Chenxin.
I am sooo excited to see the first open-source model with more than 100k context !!! This is undoubtedly a very significant progress that the open-source community has made in LCLMs.
I've noticed that the current version of Yarn only has PPL (Perplexity) experiments, which do not always correlate with practical long-context understanding tasks. I am glad😁 to help test llama2-yarn-128k on LEval but I do not have resources to do SFT based on llama2-yarn-128k. Would you mind providing a instruction-following version?
Thanks again for the great work!
Nice work making this.
Could you clarify/confirm the license here? I see MIT License here on Github and no license on HuggingFace.
I would have assumed this has to at least be Meta Community License as that would transfer through because of using Llama 2.
It looks like the only new training data added is PG-19, which seems to be Apache 2- so it seems that YaRN could take on a Meta Community License.
Hi, can you also share the preprocessing script to convert the dataset to the standard format? also why the attention_mask in the dataset is required?
Hello developer, I've been trying to conduct an evaluation of lm-evaluation-harness based on your paper, but I'm encountering an issue stating that the directory doesn't exist.
Could you provide more detailed instructions on how to conduct the evaluation?
Here is the command I've been using and the error that occurs.
command
pip install git+https://github.com/EleutherAI/lm-evaluation-harness
./eval-harness.sh
error-command
python: can't open file '/workspace/yarn/../lm-evaluation-harness/main.py': [Errno 2] No such file or directory
Your assistance would be greatly appreciated!
(help me plz..!!!)
Why does it take so long for me to fine-tune llama2-7b-64k?
Each epoch takes 300+ seconds
I used 8xA100, turned on deepspeed, and used "yarn" for rope type.
Is it a problem with flash attention? But I see that modeling_llama_together_yarn.py uses flash attention by default?
Thanks a lot.
@bloc97 @jquesnelle Dear Authors,
Firstly, I would like to extend my sincere appreciation for your remarkable work. It is truly commendable and has served as a valuable resource for the community.
Upon reading your paper, I encountered some confusion regarding the evaluation metrics employed. Specifically, in Section 4.3.1, you state: "...selected 10 random samples from Proof-pile with at least 128k tokens each and evaluated the perplexity of each of these samples when truncated at 2k steps from a sequence length of 2k tokens through 128k tokens." Could you kindly clarify what is meant by "2k steps" in this context?
Additionally, the term "Sliding window perplexity (S = 256) of ten 128k Proof-pile documents truncated to evaluation context window size" is used multiple times. However, I am uncertain how sliding window perplexity is applied if the documents are truncated to the evaluation context window size. Does it mean the documents are truncated to the maximum evaluation context window size (128k)?
Your insights and clarifications on these points would be greatly appreciated, as they might resolve some misunderstandings I have regarding the paper.
Thank you for your time and consideration.
Thank you for sharing this!
I'd like to review your steps for generating the plots I've seen on twitter.
Could you please include your plot generation script.. I know it's calling perplexity.py, but I'd like to re-trace your steps exactly. Then I can tweak it :)
When launching finetune. py using the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4 accelerate launch finetune.py --output-dir output/yarn-7b-64k --model /data/wy/llm_base/Llama-2-7b-hf --dataset /data/wy/LLMScaledData/pg_books-tokenized-bos-eos-chunked-6/data
The following error occurred:
Traceback (most recent call last):
File "/data/wy/yarn/finetune.py", line 293, in
main(args.parse_args())
File "/data/wy/yarn/finetune.py", line 156, in main
model.gradient_checkpointing_enable()
File "/home/centos/anaconda3/envs/llm_sacled/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'gradient_checkpointing_enable'
Need to modify 'model.gradient_checkpointing_enable()' to 'model.module.gradient_checkpointing_enable()'
配置文件 deepseed/zero3.json
报错,不能用 auto,自己改了 config, 也不知道对不对,先跑起来:
使用命令
accelerate launch finetune.py --output-dir output/yarn-7b-32k
--model NousResearch/Llama-2-7b-hf --learning-rate 0.00001
--lr-schedule constant --scaling-factor 8 --deepspeed
然后 OOM
accelerate config
取消掉 deepspeed 和 dynamo,默认 train.sh
第一个配置应该是 64k 长度的, OOM
# run `accelerate config` first. pass --deepspeed to finetune.py if using DeepSpeed
accelerate launch finetune.py \
--output-dir output/yarn-7b-64k \
--model NousResearch/Llama-2-7b-hf
看了下 x.shape 是 torch.Size([1, 65536, 4096])
, 单卡 80G 显存似乎也不够。
所以是不是哪里应该设置个 tp ? 然而 README 对新人并不友好的样子 QAQ
Hello, I am interested in using your models.
What's LISENSE these models?
I want to know that.
Great stuff!
Just out of curiosity what was the compute setup used?
I couldn't seem to find details such as GPU type and cluster size used in the paper
Thanks!
After check #45 , #40 and some hard-code modification, these command passed
# trainning
accelerate launch finetune.py --output-dir output/yarn-7b-8k
--model NousResearch/Llama-2-7b-hf --scaling-factor 2
--wandb yarn --dataset
emozilla/yarn-train-tokenized-8k-llama --deepspeed
# save
accelerate launch finetune.py --output-dir output/yarn-7b-8k
--model NousResearch/Llama-2-7b-hf --save-only --scaling-factor 2
--wandb yarn --output-dir output-8k-save --dataset
emozilla/yarn-train-tokenized-8k-llama --deepspeed
And I got these files:
(torch2) root@9b2ed2383075:/workspace/yarn/output/yarn-7b-8k# tree
.
|-- config.json
|-- model-00001-of-00003.safetensors
|-- model-00002-of-00003.safetensors
|-- model-00003-of-00003.safetensors
|-- model.safetensors
`-- model.safetensors.index.json
For load it with passkey.py
, I merge these safetensors into original NousResearch/Llama-2-7b-hf
and got this error
torch2) root@9b2ed2383075:/workspace/yarn# python3 eval/passkey.py -m /workspace/models/Llama-2-7b-hf/
Determining sequence lengths: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:04<00:00, 1.48it/s]
Model: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/workspace/yarn/eval/passkey.py", line 127, in <module>
main(add_args(parser).parse_args())
File "/workspace/yarn/eval/passkey.py", line 90, in main
loaded = load_model_and_apply_patches(model, args)
File "/workspace/yarn/eval/model_loader.py", line 215, in load_model_and_apply_patches
return apply_patches(load_model(model, args), args)
File "/workspace/yarn/eval/model_loader.py", line 90, in load_model
loaded = model_cls.from_pretrained(
File "/root/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/root/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3480, in from_pretrained
) = cls._load_pretrained_model(
File "/root/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3870, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/root/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 743, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/root/miniconda3/envs/torch2/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([0]) in "weight" (which has shape torch.Size([32000, 4096])), this look incorrect.
I noticed that your official https://huggingface.co/NousResearch/Yarn-Llama-2-7b-64k
does not need any safetensor and can be test succesfully,
Did I missed any model conversion script ?
Hi,
I have been running into out of memory issues when trying to generate some text using the model "NousResearch/Yarn-Llama-2-7b-128k". I am using a prompt with 126k tokens and running things on 1 GPU. The script that I am using is the "eval/prompt-loop.py". I tried to set load_in_4bit = True but it didn't help.
Do you have advice to solve this issue?
Thanks !
Hi! Thanks for sharing your nice work.
I have some questions about the perplexity evaluation setup.
In Figure 1, it is mentioned that the sliding window perplexity is reported, with documents truncated to the evaluation context length.
I was wondering if that makes sense, because (as far as I understood) sliding window evaluation is something that you do when the document length is longer than the evaluation context length.
Also, is there a particular reason you used truncation for proof-pile and not for gov_report?
Thanks in advance!
Thank you for your team's open-source contributions!
From the code, it seems to only support pre-training. I want to conduct extrapolation training in the SFT phase, taking the Instruct version from 4k to 16k. How should I proceed?
I run into OOM error with default setup on 8*A100 with train.sh script, could you please share the GPU requirements for fine-tuning ?
The comment "# This if
block is unlikely to be run after we build sin/cos in __init__
. Keep the logic here just in case." might be incorrect. From what I understand, the code following this comment calculates the scale value based on the actual length of the input. However, the value cached in __init__
is unscaled. Therefore, this branch should be executed frequently.
The new values for cos_cached
and sin_cached
shouldn't be cached. If they are, after encountering a long sample, all subsequent samples will use the scaled values, regardless of their length.
Hi Yarn team,
thank you guys for the awesome work. Currently I'm trying to evaluate several rope scaling methods and fortunately there are all available in this git. I have some question related to the Config of rope scaling.
I see that in the requirements.txt you already include transformers >= 4.34.0, so it mean I could use the "linear" and "dynamic-ntk" out of the box with transformers, just by add the rope scaling in AutoConfig.from_pretrained() like that:
config.rope_scaling = { "type": "linear", "factor": args.linear }
or
config.rope_scaling = { "type": "dynamic", "factor": args.dynamic_ntk }
I tried that and remove the patch for linear & dynamic-ntk and the result look identical when using your implemented patch.
Moreover, it also support Falcon architecture. (https://github.com/huggingface/transformers/blob/main/src/transformers/models/falcon/modeling_falcon.py#L162)
So my question is that, are there any different between this two implementation? or your implementation for linear & dynamic-ntk patch is for keeping the reproduction eval consistent?
I see that in __init__
method of LlamaDynamicYaRNScaledRotaryEmbedding
there is a parameter called finetuned
which is a boolean. What is the purpose of that parameter? Should we set it to False
while finetuning the model and then set it to True
for inference after finetuning? What could be the problem if we keep it False
regardless the model is finetuned or not?
Hi~
I am currently following the hf version for exploration.
But find that when update KV cache in Llama (NousResearch/Yarn-Llama-2-7b-128k
).
The updated empty caches' length is always 256 (line 528):
past_kv = torch.cat([past_kv, torch.empty(bsz, 256, 2, kv.size(3), kv.size(4), dtype=kv.dtype, device=kv.device)], 1)
I think it should be
past_kv = torch.cat([past_kv, torch.empty(bsz, kv.size(1), 2, kv.size(3), kv.size(4), dtype=kv.dtype, device=kv.device)], 1)
Is that right? Or I misunderstand this procedure?
Hi,
Thank you for releasing this code! Are there any plans to train a Phi 2 model?
Thanks!
Could you share the number of GPUs, VRAM size used for finetuning?
Thanks!
Why the wavelength is lowest at the highest dimension?
According to the eq(14), when the d is bigger, wavelength gets bigger.
/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [21,0,0], thread: [61,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
when I run the eval/passkey.py, then report the above error.
how can I solve it
Hi,
I am trying to fine-tune a 7b model for 16k context length on a 8 GPU, A100, 40 GB machine. But, I am getting the following runtime error:
Traceback (most recent call last):
File "/home/ec2-user/data/yarn/finetune.py", line 222, in <module>
main(args.parse_args())
File "/home/ec2-user/data/yarn/finetune.py", line 150, in main
loss = model(**batch).loss
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
loss = self.module(*inputs, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/ec2-user/data/yarn/scaled_rope/modeling_llama_together_yarn.py", line 985, in forward
outputs = self.model(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/ec2-user/data/yarn/scaled_rope/modeling_llama_together_yarn.py", line 860, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/ec2-user/data/yarn/scaled_rope/modeling_llama_together_yarn.py", line 856, in custom_forward
return module(*inputs, output_attentions, None)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/ec2-user/data/yarn/scaled_rope/modeling_llama_together_yarn.py", line 620, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/ec2-user/data/yarn/scaled_rope/modeling_llama_together_yarn.py", line 555, in forward
).reshape(bsz, q_len, h_size)
RuntimeError: shape '[1, 16384, 4096]' is invalid for input of size 13459456
Here is the command:
accelerate launch finetune.py --wandb yarn --output-dir output/yarn-7b-16k --model meta-llama/Llama-2-7b-chat-hf --max-train-steps 20 --scaling-factor 4 --scaling-type yarn --seed 31337 --dataset shossain/govreport-qa-5-16384 --gradient-accumulate-every 1
Please suggest.
Hello, can we run the project on 8 80G A100 cards? If not, could you please provide a reference configuration
Hi Yarn team,
I hope this issue finds you well. I clone your git code (v2, 2weeks ago) in our machine and found mistakes:
Traceback (most recent call last):
File "/app/yarn_4/finetune.py", line 293, in <module>
main(args.parse_args())
File "/app/yarn_4/finetune.py", line 52, in main
from scaled_rope.modeling_llama_yarn import LlamaForCausalLM
File "/app/yarn_4/scaled_rope/modeling_llama_yarn.py", line 34, in <module>
from transformers.utils import (
ImportError: cannot import name 'is_flash_attn_2_available' from 'transformers.utils' (/opt/conda/lib/python3.10/site-pack
ages/transformers/utils/__init__.py)
In our current environment, we are using the following versions:
We are interested in fine-tuning the Yarn environment for our specific setup. Specifically, we would like to inquire about the versions of transformers
, accelerate
and deepspeed
used in the Yarn environment. Could you please provide details on how these tools are configured in your environment?
Any guidance or information you can offer regarding this matter would be greatly appreciated.
Thank you for your time and assistance!
Hi Yarn team,
I hope this finds you well. I've been using your code jquesnelle/yarn for testing the PG19 dataset. While reviewing the eval.sh
script, I noticed some definitions related to the PG19 dataset, but the code for testing perplexity results seems somewhat unclear.
Settings:
In eval.sh, I found the following definition for the PG19 dataset:
# python eval/perplexity.py -m meta-llama/Llama-2-7b-hf --dataset pg19 --split test --feature text --save-tokenized output/pg19-test-tokenized
PG19="--tokenized emozilla/pg19-test-tokenized"
However, I did not find the actual code for testing perplexity results. Therefore, I attempted to use our own defined code for testing:
python eval/perplexity.py --dataset pg19 --feature "text" --samples 5 -m meta-llama/Llama-2-7b-hf --max-tokens $max_tokens --min-tokens $max_tokens --tokens-step 4000 --tokenized emozilla/pg19-test-tokenized --yarn $((max_tokens / 4096)) --max-position-embeddings 4096 --original-max-position-embeddings 4096 --dataset-min-tokens $max_tokens --sliding-window 4096 --custom-model --aggressive-memory --flash-attention
I observed that the results differ when the sliding window is set to 4096 and 256. In comparison to other PI and dy-ntk methods, the performance is unstable with a sliding window set to 256 and stable with a sliding window set to 4096.
Results:
--sliding-window 4096
:
--sliding-window 256
:
In contrast, other PI and dy-ntk methods maintain relatively stable performance when the sliding window is set to 256 and 4096:
I would appreciate your insights on this phenomenon. Is this behavior considered normal, or could there be potential configuration issues? If possible, could you provide more detailed information about the PG19 dataset testing script to help me better understand and adjust the testing configuration?
Thank you very much for your time and assistance. I look forward to your response.
Best regards,
Yiran
I am looking at the training command for mistral:
Line 60 in 0ae3b2d
Can I train a 64k context length model with 16k long dataset? Or is it just an example?
I want to increase the context of the llama2 model, I have finetuned a model(70b and 7b) as well on my data now I want to increase their input context using yarn I was able to understand that we need data of 16k context if we want to increase the context can anyone clarify the procedures that follows after that
Awesome job on this
Do you have any examples of a fine-tune cli / setup to show llama3b 4096 | 6144?
The train.sh
starts off with the default yarn-factor=16
for both the 7b and 13b cases to generate the output output/yarn-7b-64k
or output/yarn-13b-64k
and passes the corresponding output as model to generate the respective 128k outputs.
Is it mandatory to go from 4k to 64k and then to 128k in incremental steps? Is it not possible to go from 4k to 128k directly?
I compare your code with The Bloke code for Linear Scaled Embedding. Somehow there are some difference:
self.scale = 1/scale
which make it fraction but then divide t
with fractioned scale (t /= self.scale
). But The bloke code multiply t
with fractioned scale. Which one is right?max_position_embeddings
seems stays at 2048. But The Bloke code change it according to max context length. Or did you actualy change the max_position_embeddings
in the config file?Which one follow the implementation from kaiokendev?
Hi, I want to know in passkey retrieval, does this parameter max-position-embeddings
need to be set to the length after the scale
If my server cannot connect to hugging face, then I have downloaded your model on hugging face. How can I run the code in your warehouse? Thanks
Currently this repository doesn't contain a license file. It would be great if you could add one to clarify under which license the code is made available. Thanks!
如题。对于强迫症来说,目前的版本看着有点难受。
作者是期望v2版本增加新内容吗?我认为先把typo改过来就好。
I've been running on a 40 GB A100 using transformers and GPTQ. To get the model working at all, there seems to be a specific order in which the packages have to be installed.
!pip3 install git+https://github.com/huggingface/transformers.git
!pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip3 install git+https://github.com/huggingface/optimum.git
!pip3 install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
!pip3 install flash-attn==2.1.1 --no-build-isolation
With the above, I'm running out of memory around 8000 tokens of input (using the 7B model) and the output becomes garbled.
I've tried GPTQ, bnb nf4, and bf16 loading.
On bf16 loading, the output is garbled at 4k tokens of input.
Are the 7B and 13B yarn models fine-tuned? Do you have recommendations on how better to run them?
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 136/136 [00:00<00:00, 296941.88it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.27s/it]
/root/miniconda3/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
/root/miniconda3/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 136/136 [00:00<00:00, 284501.42it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.01it/s]
/root/miniconda3/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
/root/miniconda3/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 136/136 [00:00<00:00, 121937.87it/s]
Generating train split: 61410 examples [01:49, 561.64 examples/s]
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
writer.write_table(table)
File "/root/miniconda3/lib/python3.8/site-packages/datasets/arrow_writer.py", line 577, in write_table
self.pa_writer.write_table(pa_table, writer_batch_size)
File "pyarrow/ipc.pxi", line 525, in pyarrow.lib._CRecordBatchWriter.write_table
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/implementations/local.py", line 365, in write
return self.f.write(*args, **kwargs)
OSError: [Errno 28] No space left on device
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "finetune.py", line 193, in <module>
main(args.parse_args())
File "finetune.py", line 67, in main
train_dataset = load_dataset('/root/autodl-tmp/data/emozilla___pg_books-tokenized-bos-eos-chunked-65536/default/0.0.0/9107755b15521c04', split='train',
File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 2136, in load_dataset
builder_instance.download_and_prepare(
File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/datasets/builder.py", line 1813, in _prepare_split
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.