jzhang38 / tinyllama Goto Github PK
View Code? Open in Web Editor NEWThe TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
License: Apache License 2.0
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
License: Apache License 2.0
最低显卡需求要多少
能不能 进行 cpu 推理
能不能模型微调
I found in the pre-trained datasets, there are some docs has large amount chars, which cause a long time to encode them. For example, a doc has 15955671 chars, will cost 6.6 hours to encode it.
How do you speedup it? split the doc into many sub-docs? But I use the megatron to pre-train, has any idea?
Looking forward to hearing from you in your free time. Thank you very much.
how many chinese data in training? Would you plan to support chinese language?
Hi, I'm very interested in this project, but I would like to know how you plan to deal with the amount of hallucinations made having a very high compression ratio, or training tokens to model params? 3T tokens to 1.1B is a far larger compression than 7B params to 2T tokens for llama2?
Hello, tiny lama takes all my ram and has very very poor perfs' like lower than 7b models, it takes a very long time to load and is worse than most model, I don't unstand what I'm doing wrong, usually I use ggml gguf ? version but you have bin that is 4GB for 1B .... I guess that's the issue, maybe you have somewhere the ggml or gguf model ?
I'm pretty sure something is wrong ... Maybe I can convert it ? (the real issue is that I have an AMD high end gpu, it useless .............)
I used the base model Last version and not the chat model, since it's a 1b params maybe I can convert it to gguf ?
Hi
I have been trying to redo TinyLlama finetuning starting from PY007/TinyLlama-1.1B-intermediate-step-480k-1T
using both finetuning.py and using the last command from script.sh.
I used one A100 40G (using only 27GB of VRAM). Everything went well apparently.
I just added:
final_model="path to last checkpoint"
tokenizer = AutoTokenizer.from_pretrained(final_model)
model = model = AutoModelForCausalLM.from_pretrained(
final_model,
device_map="auto",
trust_remote_code=True,
)
model.save_pretrained("TinyLlama-1.1B-chat-hf")
tokenizer.save_pretrained("TinyLlama-1.1B-chat-hf")
Then I tried to convert the model to a GGML format using convert.py
from llama.cpp
!python convert.py <path to TinyLlama-1.1B-chat-hf>
This lead to the following error:
Loading model file <path to TinyLlama-1.1B-chat-hf/pytorch_model.bin>
params = Params(n_vocab=32003, n_embd=2048, n_layer=22, n_ctx=2048, n_ff=5632, n_head=32, n_head_kv=4, f_norm_eps=1e-05, f_rope_freq_base=10000.0, f_rope_scale=None, ftype=None, path_model=PosixPath('/content/drive/MyDrive/TinyLlama/TinyLlama/sft/TinyLlama-1.1B-chat-hf'))
Loading vocab file '/content/drive/MyDrive/TinyLlama/TinyLlama/sft/TinyLlama-1.1B-chat-hf/tokenizer.model', type 'spm'
Traceback (most recent call last):
File "/content/drive/MyDrive/TinyLlama/llama.cpp/convert.py", line 1193, in <module>
main()
File "/content/drive/MyDrive/TinyLlama/llama.cpp/convert.py", line 1175, in main
vocab = load_vocab(vocab_dir, args.vocabtype)
File "/content/drive/MyDrive/TinyLlama/llama.cpp/convert.py", line 1086, in load_vocab
return SentencePieceVocab(path, added_tokens_path if added_tokens_path.exists() else None)
File "/content/drive/MyDrive/TinyLlama/llama.cpp/convert.py", line 372, in __init__
raise Exception(f"Expected added token IDs to be sequential and start at {len(added_tokens)}; got {actual_ids}")
Exception: Expected added token IDs to be sequential and start at 6; got [0, 1, 2, 32000, 32001, 32002]
Any idea what I am doing wrong ?
Some more info contained in related files:
special_tokens_map.json
{
"additional_special_tokens": [
"<unk>",
"<s>",
"</s>",
"[PAD]",
"<|im_end|>",
"<|im_start|>"
],
"bos_token": "<s>",
"eos_token": "</s>",
"pad_token": "[PAD]",
"unk_token": "<unk>"
}
added_tokens.json
{
"</s>": 2,
"<s>": 1,
"<unk>": 0,
"<|im_end|>": 32001,
"<|im_start|>": 32002,
"[PAD]": 32000
}
Those files are slightly different from what can be found in PY007/TinyLlama-1.1B-Chat-v0.3
and I don't understand why.
Hi, Thanks for your great works!
In all previous works(i.e. GPTs, LLaMAs, ...), them both pretrain one epoch. But I found you train three epoch? why set three?
Looking forward to hearing from you in your free time. Thank you very much.
From vLLM
Colab --> https://colab.research.google.com/drive/1HOxyJVxo0NeVk8oidvR3dvouGBTYO60X?usp=sharing
I've noticed that the outputs are rather small/truncated compared to the usual models trained on openassistant?
'### Human: Give me a hello world in python? ### Assistant:' 'Sure, here is a simple "hello world" program in Python:\n\n'
'### Human: Give me a hello world in python? ### Assistant:' 'Sure! Here\'s a simple Python program that says "Hello, world!"'
'### Human: Give me a hello world in python? ### Assistant:' 'Here\'s a simple "hello world" program in Python:\n\n```'
'### Human: Give me a hello world in python? ### Assistant:' 'Sure! Here is a sample code in Python:\n```python\nprint("'
'### Human: Give me a hello world in python? ### Assistant:' "Sure, here's a simple `print()` statement:\n```python\n"
One of your potential Usecases is to deploy on edge devices.
For that, the ONNX runtime is probably the most likely candidate, supporting a lot of platforms/apis/architectures/hardware-accelerators.
Its supposably easy to convert any huggingface hosted models to ONNX with optimum, though I haven't done it personally.
Any thoughts?
Hello, I have been using your pre-trained code, and I'm wondering how to convert the saved model files into the Hugging Face Transformers format, similar to the ones you upload to Hugging Face's repository?
Makes it somewhat more annoying to use.
Also, were there any changes in how the weights are saved between TinyLlama-1.1B-intermediate-step-240k-503b and TinyLlama-1.1B-intermediate-step-50k-105b? I'm getting incorrect output with the newer checkpoint for code that worked with the first checkpoint.
Just a Cleaned Version of the Colab Notebook from #6 , as Its ~15 lines of very self explanatory code, with a couple comments already in it. I just Cleaned the empty cells at the bottom, and removed the large numbers of big headers not needed for a Notebook this size.
https://colab.research.google.com/drive/1HOxyJVxo0NeVk8oidvR3dvouGBTYO60X?usp=sharing
Hi, can i check why did you set the Swiglu packed_weights to be False? From this discussion, "you shouldn't set "_pack_weights=False" as it prevents from fusing a few kernels during the BW pass".
Line 301 in 3cffb3c
One of the requirements is
I'm assuming that the pretrain dataset script would still work for a finetune script, as the data is processed the same?
I was looking through prepare_slimpajama.py
and from what I can tell,
When I tried to look into the packed dataset, I notice its supposed to be a custom format dataset?
I think it would be very useful if you made a guide on preparing a dataset, like maybe an example of a small dataset on Colab, because most of our PCs can't handle the sheer file size of the tokens in the slimpajama and starcoder datasets.
虽然它很基础,但确实很重要并且也很耗时,看着当前并非使用稳定版本,自己构建环境包遇到各种问题,目前会挂在xformer源码构建上,请问有可能发出来份python包链接吗?例如conda环境放到一个网盘上,感谢
The minimum learning rate is the same as the "max". Is this intentional or a mistake? If yes, why (you can skip explanation if it is too bothersome)?
Hello, It's a very nice and much needed development. How much storage will be required for complete model training. As around 1.9TB is required only for datasets. Also how much RAM is required.
Best wishes to the team!
Hello, can it run on cpu? with 4GB RAM.
Can you please guide me regarding the minimum hardware requirements?
Thanks in advance.
Is it only support SFT?
Hi, I see the mention of running this model on llama.cpp in README. Did you get a manage to get it to run and quantize with good output? I'm trying to evaluate if this model can be used for speculative decoding for llama 2 7B
With the first checkpoint https://huggingface.co/PY007/TinyLlama-1.1B-step-50K-105b - seems like there might be some issue converting to gguf
python convert.py ../TinyLlama-1.1B-step-50K-105b/
./main -m ../TinyLlama-1.1B-step-50K-105b/ggml-model-f32.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -ngl 0 --temp 0
Is resulting in the following - Either f16 or f32 would result in this, adding a <s>
token at the beginning didn't help either:
(...)
Building a website can be done in 10 simple steps:\nStep 1:12000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
(...)
I can see that running with huggingface/torch is giving a more reasonable result, although it quickly becomes repeated
<s> Building a website can be done in 10 simple steps:
Step 1: Create a website.
Step 2: Add a logo.
Step 3: Add a contact form.
Step 4: Add a blog.
Step 5: Add a social media links.
Step 6: Add a contact page.
Step 7: Add a contact form.
Step 8: Add a contact form.
Step 9: Add a contact form.
Not sure where this mismatch is coming from
Thanks
Hi,
Posting here even though this is not related to the code itself.
Context:
I have tried to used Chat-v0.3
directly using the checpoints [code](<script src="https://gist.github.com/galleon/ca73c87542e9110dea4220bb143e70a5.js"></script>) I just added eos_token_id=tokenizer.eos_token_id
to the example to make it finish as expected.
I obtain an answer that I consider ok even though it is made of three sentences (I have not looked into the details on how you generated the chat version. Any info avail ?)
Then I decided to move to llama.cpp
making sure to update my version to get the fix for the issue you recently ran into.
I did generate the F32
version (which should be the same as the checkpoint).
Here is the result I got to this CLI
./main -m ~/.cache/llama.cpp/models/TinyLlama-1.1B-Chat-v0.3.gguf -p "Please answer in one sentence to this question: What is a Large Language Model?" --n-gpu-layers 0 --temp 0 --escape --seed 42 --color --n-predict -2
Do you know why it continue to generate after the EOS ?
Then I moved to Q5_K quntized version and get the following output
Completely AWOL which make me consider I have done something wrong. Did someone have similar issues ?
hi,when I load the model for train:
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
# device_map="auto",
trust_remote_code=True,
)
info like this:(is it right?)
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /root/bert_path/TinyLlama-1.1B-intermediate-step-240k-503b/TinyLlama-1.1B-intermediate-step-240k-503b and are newly initialized: ['model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
.
Dear Authors,
Thanks so much for your amazing project.
Would it be possible for you plan to release the following:
This would be a highly valuable artefact for keeping training the model !
Thanks so much and congratulation for your work !
Pierre
I am training a 120M model from scratch because I would like to do some experiments myself. When I stop and try to resume, it requires me to drop the batch size significantly otherwise I get memory error. Any ideas why?
Also please consider making a discord server where people can discuss about the project.
Hi,
Where dose the import rotary_emb
come from? I don't see it in the requirements, and a google search is not reveling a GitHub that follows this structure.
Thanks,
First, I want to express my gratitude about this project. I think TinyLlama has a lot of potential and we're just starting to see it. Cudos!
I'm pretty new to this exciting field and this is the first time I fine-tuned a model. I used the "base" TinyLlama model (step-240k) to fine-tune using the sam-mosaic/orca-gpt4-chatml dataset but the result seems not as good as your v0.2 chat model.
I will keep working on this and I will share with you the models I create. I think that the RAG approach you guys are experimenting now is the good direction and I'll going to do some experiments with that too.
Anyway the model I produced is here in case you want to take a look: TinyLlama-1.1B-orca-gpt4
How to train model with databricks-dolly-15k.jsonl dataset format.
Can we Finetuning using BitsandBytes and SFT ?
With https://github.com/KillianLucas/open-interpreter gaining steam, it would be cool if you can do a similar project with Code LLaMa
需要的数据集好大,而且还得科学上网下载的。不知道 大佬能不能提供一下下载数据集的网盘或者torrent之类的下载渠道?
I am wondering if this behavior is correct:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("PY007/TinyLlama-1.1B-Chat-v0.3")
print(f"vocab_size: {tokenizer.vocab_size}")
print(f"length get_vocab: {len(tokenizer.get_vocab())}")
print(list(vocab.keys())[list(vocab.values()).index(32000)])
print(list(vocab.keys())[list(vocab.values()).index(32001)])
print(list(vocab.keys())[list(vocab.values()).index(32002)])
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
vocab_size: 32000
length get_vocab: 32003
[PAD]
<|im_start|>
<|im_end|>
I compared tokenizers from TinyLlama-1.1B
TinyLlama-1.1B-Chat
Llama-2-7b-hf
and Llama-2-7b-chat-hf
which are supposed to be the same and only TinyLlama-1.1B-Chat
has this discrepancy.
This might be unorthodox, but I had to ask.
I've been trying to run the sft script on colab T4, and on Kaggle double T4, P100 and It instantly ran out of memory.
I've been Trying to perform a QLora run, and It was successful for a very small dataset, but the dataset I'm trying to finetune this with is around 20GB, and takes anywhere from 81 to 135 hrs to map, trying to stream the dataset makes it load nothing, and I can't run any CPU or GPU instances that long.
If the SFT script isn't meant to take up that much memory, could you please fix it?
If it is meant to use that much memory, I would like to request that you train a checkpoint or the final model on the UnagamiData dataset
Its the dataset I used to train my previous model, Unagami. Its a Mixture of several high quality Datasets, Including Open-Platypus, Oasst1, and OpenOrca. It also has some QA from context datasets, like Dolly DataBricks, etc, which could make it better for RAG.
Its currently Formatted with HTML-like tokens, like <system>, <human>
I can switch to ### System: , ### Human:
if needed.
Considerable?
How does on finetune the model on custom dataset? Should we modify finetune.py?
I've seen 4bit weights mentioned a lot, but I can't find any references to them
Where can I find the 4bit quant weights, or do I have to quantize them? If so, is there a colab notebook for the process or something like that?
model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path)
File "/usr/local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2795, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3173, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
I noticed that your tokenizer doesn't add the bos and eos token to the final tensor during encoding. Does this have any impact on pretraining? If it's intentional not to add them, what is the reason behind it?
I adapted TimDettmers filtered Openassistant dataset in order for it to take the Llama 2 prompt format (e.g. with INST), see here.
I then fine-tuned TinyLlama (using a full fine-tune of all LoRA modules) at the 1T token checkpoint, see here.
Observations:
A. TinyLlama seems to have issues emitting an EOS (< /s > token). For example:
<s> [INST] What planets are in our solar system? [/INST] 1. Mercury
2. Venus
3. Earth
4. Mars
5. Jupiter
6. Saturn
7. Uranus
8. Neptune
9. Pluto
10. Ceres
11. Callisto
12. ...
This leads me to wonder are BOS and, particularly, EOS tokens being used in pre-training (e.g. < s > and < /s >)?
B. I notice that when inferencing the raw 1T checkpoint (i.e. not chat fine-tuned), it is common to see ### in the response:
<s> [INST] Generate a python code snippet to add two numbers. [/INST]
### [INST] Generate a python code snippet to add two numbers.
### [INST] Generate a python code snippet to add two numbers.
...
I'm somewhat surprised to see this '###'. Does this mean there are some chat fine-tuning or instruct fine-tuning datasets in the pre-training datasets?
When a finetuneable version of this model comes out, will we be able to fine-tune it on a Google colab T4 gpu?
Whether to use tokenizer to encode each sample and then calculate the total tokens?
Were there any trade-offs or considerations you made when deciding on the model's size? Or What criteria did you use to select the specific number of layers, attention heads and Embedding Size etc. in your model?
Can it be run on colab?
Hello,
I am trying to finetune the model with the script you provided, on four RTX 3090 GPUs.
However, I was getting a CUDA out of memory issue, so I made the following change:
model = AutoModelForCausalLM.from_pretrained(
args.model_name_or_path,
device_map=device_map,
trust_remote_code=args.trust_remote_code
)
model = model.half()
It now fits on my gpu, but the training loss becomes 0 after a single batch, and the evaluation loss is nan.
I tried to check the predictions of the model after training, but its output contains nan so it does not work.
What I already tried to solve the issue:
But I get the same result every time. I am assuming this is due to the use of float16, since it is the main difference between my code and the original code. Do you have an idea of what is happening, and of what I could do about it?
Thank you!
非常有意思的工作,但是huggingface 最近总是连接超时,是否可以放一个国内可以下载的链接呢。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.