Comments (7)
@ArthurZucker I can expand on this a bit, because there is an actual difference in the implementations. The difference is whether you treat sequential entries of the embeddings as (real, imaginary), or you treat the first half as real, and the second half as imaginary. I think this has to mean that weights trained with one version must be incompatible with the other version?
This has come up over in torchtitan (as linked in the OP of this issue), because I've been trying to load the HF llama weights and using them with the torchtitan codebase. However, these weights only function if I use the first half/second half implementation, such as the one used in HF transformers.
from transformers.
Okay, so this has been cleared up, I think. The HF weights are indeed different:
transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py
Lines 152 to 154 in 481a957
Glad to find out that it is a simple weight permutation. This should probably be clarified somewhere.
from transformers.
As long as the outputs produced are the same, this should not matter no?
Being able to load doesn’t mean the implementations are the same. The difference in the implementations will cause the output to be different, given the same input.
I'm a bit confused here... Please correct me if I'm wrong
Assuming the weights are from Meta, and are trained using Meta’s implementation:
Then the consequence is that the weights are optimized for meta’s llama, although still usable on HF’s llama. So the inference results should be worse on HF’s model, and if people start fine-tuning from HF’s llama, they should see loss much higher (at least initially).
cc: @rlrs
from transformers.
Thanks for the clarification! When I said different implementations, I meant different implementations that leads to different results on the same input. (I thought this was very clear from the context but it seems not, my bad.)
I agree that as long as the outputs match, implementations don't matter the most.
from transformers.
? The llama2 code is subject to a code licence. No one is allowed to use the original code, which is why we have a different version.
As long as the outputs produced are the same, this should not matter no?
- TGI uses a similar computation of the cos and sin
- pretty sure vllm as well
- transformers has more than 15 models that use this formulation of ROPE, not a single one has the one in MetaLlama's repo
See this: https://github.com/meta-llama/llama/blob/main/LICENSE
from transformers.
@rlrs I probably should have specify that yes, this works because we permute the weights. But if you don't and still use the cos/sin formulation you still have a different implementation. You just do the permutation on the fly of the q and k which costs a lot.
Pretty sure it is clarified that you need a conversion script to go from the orginal model to transformers
, but if you want to add this to the readme feel free to do so.
@tianyu-l I don't understand the confusion, if the implementation is different, but the produced output, meaning the generations are correct, where is the problem?
Imagine a genius finds new ways to compute matrix - vector products, and we use this (cf FlashAttention) which meta did not use to train their model, is it bad? Even if the outputs , the logits and the generations are the same and the generation speed is around 2x faster?
If you use torch.compile, or if you use gpt-fast, you are again not using the same implementation, but the results do match. How is that possible?
So the inference results should be worse on HF’s model, and if people start fine-tuning from HF’s llama, they should see loss much higher (at least initially).
this is just completely wrong, and unsourced. When we add a model to transformers
, we make sure that we reproduce the results. For Llama family, the logits match at 1e-2 (because of parallel training + maybe a bit of rope) but the generated output match 1-1 (generate 1000000 tokens greedy and you will have the same results.)
from transformers.
cool, then we are on the same page! 🤗
from transformers.
Related Issues (20)
- Sink Cache Attention Scores are strange. CausalMask seems not working. HOT 2
- Libraries import missing, unable to load image for inference and not able to load pipeline with the trained model HOT 4
- CLIPTokenizerFast cause memory leak HOT 1
- VisEncoderDecoderModel generate text incomplete when predict image with long text label HOT 1
- Trained tokenizer has broken encoding for cyrillic HOT 3
- Running out of memory while finetuning and inferencing VideoMAE due to which script is being killed. HOT 5
- Trainer memory leak for evaluation with `compute_metrics`
- Llama Model throwing "RuntimeError: expected scalar type BFloat16 but found Float" when using torch.compile and AMP together HOT 6
- [LLaMA3] 'add_bos_token=True, add_eos_token=True' seems not taking effect HOT 4
- google/siglip-so400m-patch14-384 inference output mismatch with pipeline output HOT 4
- Why using empty tensor to initialize? HOT 3
- Allow `ConversationalPipeline` to receive string input HOT 3
- Weird behaviour running AWQ code on RTX 4000 Ada that worked on Tesla T4 HOT 5
- AttributeError: 'BertModel' object has no attribute 'attn_implementation' HOT 7
- Training GPT2 with run_clm.py exceeds the described memory amount . HOT 2
- LayoutLMv3 Significant Training Slowdown from 4.33.3 -> 4.34.0 and beyond versions HOT 13
- Off-by-one error in strided perplexity calculation
- RuntimeError: unique_by_key: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered HOT 2
- Autotokenizer."from_pretrained" read wrong config file. not "tokenizer_config.json", but "config.json" HOT 3
- ViTLayer.forward() needs to be in "eager" mode when `output_attentions=True` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.