System Info E.g. the rotary embedding implementations are quite di

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

HF's llama implementation is different from meta-llama's about transformers HOT 7 CLOSED

tianyu-l commented on June 3, 2024

HF's llama implementation is different from meta-llama's

from transformers.

Comments (7)

rlrs commented on June 3, 2024 3

@ArthurZucker I can expand on this a bit, because there is an actual difference in the implementations. The difference is whether you treat sequential entries of the embeddings as (real, imaginary), or you treat the first half as real, and the second half as imaginary. I think this has to mean that weights trained with one version must be incompatible with the other version?

This has come up over in torchtitan (as linked in the OP of this issue), because I've been trying to load the HF llama weights and using them with the torchtitan codebase. However, these weights only function if I use the first half/second half implementation, such as the one used in HF transformers.

from transformers.

rlrs commented on June 3, 2024 2

Okay, so this has been cleared up, I think. The HF weights are indeed different:

transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py

Lines 152 to 154 in 481a957

 # permute for sliced rotary 

 def permute(w, n_heads, dim1=dim, dim2=dim): 

 return w.view(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)

Glad to find out that it is a simple weight permutation. This should probably be clarified somewhere.

from transformers.

tianyu-l commented on June 3, 2024 1

@ArthurZucker

As long as the outputs produced are the same, this should not matter no?

Being able to load doesn’t mean the implementations are the same. The difference in the implementations will cause the output to be different, given the same input.

I'm a bit confused here... Please correct me if I'm wrong
Assuming the weights are from Meta, and are trained using Meta’s implementation:
Then the consequence is that the weights are optimized for meta’s llama, although still usable on HF’s llama. So the inference results should be worse on HF’s model, and if people start fine-tuning from HF’s llama, they should see loss much higher (at least initially).

cc: @rlrs

from transformers.

tianyu-l commented on June 3, 2024 1

Thanks for the clarification! When I said different implementations, I meant different implementations that leads to different results on the same input. (I thought this was very clear from the context but it seems not, my bad.)

I agree that as long as the outputs match, implementations don't matter the most.

from transformers.

ArthurZucker commented on June 3, 2024

? The llama2 code is subject to a code licence. No one is allowed to use the original code, which is why we have a different version.
As long as the outputs produced are the same, this should not matter no?

TGI uses a similar computation of the cos and sin
pretty sure vllm as well
transformers has more than 15 models that use this formulation of ROPE, not a single one has the one in MetaLlama's repo

See this: https://github.com/meta-llama/llama/blob/main/LICENSE

from transformers.

ArthurZucker commented on June 3, 2024

@rlrs I probably should have specify that yes, this works because we permute the weights. But if you don't and still use the cos/sin formulation you still have a different implementation. You just do the permutation on the fly of the q and k which costs a lot.

Pretty sure it is clarified that you need a conversion script to go from the orginal model to transformers, but if you want to add this to the readme feel free to do so.

@tianyu-l I don't understand the confusion, if the implementation is different, but the produced output, meaning the generations are correct, where is the problem?
Imagine a genius finds new ways to compute matrix - vector products, and we use this (cf FlashAttention) which meta did not use to train their model, is it bad? Even if the outputs , the logits and the generations are the same and the generation speed is around 2x faster?
If you use torch.compile, or if you use gpt-fast, you are again not using the same implementation, but the results do match. How is that possible?

So the inference results should be worse on HF’s model, and if people start fine-tuning from HF’s llama, they should see loss much higher (at least initially).

this is just completely wrong, and unsourced. When we add a model to transformers, we make sure that we reproduce the results. For Llama family, the logits match at 1e-2 (because of parallel training + maybe a bit of rope) but the generated output match 1-1 (generate 1000000 tokens greedy and you will have the same results.)

from transformers.

ArthurZucker commented on June 3, 2024

cool, then we are on the same page! 🤗

from transformers.

HF's llama implementation is different from meta-llama's about transformers HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# permute for sliced rotary
	def permute(w, n_heads, dim1=dim, dim2=dim):
	return w.view(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)