Giter Club home page Giter Club logo

Comments (7)

rlrs avatar rlrs commented on June 3, 2024 3

@ArthurZucker I can expand on this a bit, because there is an actual difference in the implementations. The difference is whether you treat sequential entries of the embeddings as (real, imaginary), or you treat the first half as real, and the second half as imaginary. I think this has to mean that weights trained with one version must be incompatible with the other version?

This has come up over in torchtitan (as linked in the OP of this issue), because I've been trying to load the HF llama weights and using them with the torchtitan codebase. However, these weights only function if I use the first half/second half implementation, such as the one used in HF transformers.

from transformers.

rlrs avatar rlrs commented on June 3, 2024 2

Okay, so this has been cleared up, I think. The HF weights are indeed different:

# permute for sliced rotary
def permute(w, n_heads, dim1=dim, dim2=dim):
return w.view(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)

Glad to find out that it is a simple weight permutation. This should probably be clarified somewhere.

from transformers.

tianyu-l avatar tianyu-l commented on June 3, 2024 1

@ArthurZucker

As long as the outputs produced are the same, this should not matter no?

Being able to load doesn’t mean the implementations are the same. The difference in the implementations will cause the output to be different, given the same input.

I'm a bit confused here... Please correct me if I'm wrong
Assuming the weights are from Meta, and are trained using Meta’s implementation:
Then the consequence is that the weights are optimized for meta’s llama, although still usable on HF’s llama. So the inference results should be worse on HF’s model, and if people start fine-tuning from HF’s llama, they should see loss much higher (at least initially).

cc: @rlrs

from transformers.

tianyu-l avatar tianyu-l commented on June 3, 2024 1

Thanks for the clarification! When I said different implementations, I meant different implementations that leads to different results on the same input. (I thought this was very clear from the context but it seems not, my bad.)

I agree that as long as the outputs match, implementations don't matter the most.

from transformers.

ArthurZucker avatar ArthurZucker commented on June 3, 2024

? The llama2 code is subject to a code licence. No one is allowed to use the original code, which is why we have a different version.
As long as the outputs produced are the same, this should not matter no?

  • TGI uses a similar computation of the cos and sin
  • pretty sure vllm as well
  • transformers has more than 15 models that use this formulation of ROPE, not a single one has the one in MetaLlama's repo

See this: https://github.com/meta-llama/llama/blob/main/LICENSE

from transformers.

ArthurZucker avatar ArthurZucker commented on June 3, 2024

@rlrs I probably should have specify that yes, this works because we permute the weights. But if you don't and still use the cos/sin formulation you still have a different implementation. You just do the permutation on the fly of the q and k which costs a lot.

Pretty sure it is clarified that you need a conversion script to go from the orginal model to transformers, but if you want to add this to the readme feel free to do so.

@tianyu-l I don't understand the confusion, if the implementation is different, but the produced output, meaning the generations are correct, where is the problem?
Imagine a genius finds new ways to compute matrix - vector products, and we use this (cf FlashAttention) which meta did not use to train their model, is it bad? Even if the outputs , the logits and the generations are the same and the generation speed is around 2x faster?
If you use torch.compile, or if you use gpt-fast, you are again not using the same implementation, but the results do match. How is that possible?

So the inference results should be worse on HF’s model, and if people start fine-tuning from HF’s llama, they should see loss much higher (at least initially).

this is just completely wrong, and unsourced. When we add a model to transformers, we make sure that we reproduce the results. For Llama family, the logits match at 1e-2 (because of parallel training + maybe a bit of rope) but the generated output match 1-1 (generate 1000000 tokens greedy and you will have the same results.)

from transformers.

ArthurZucker avatar ArthurZucker commented on June 3, 2024

cool, then we are on the same page! 🤗

from transformers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.