lightning-ai / lit-llama Goto Github PK

Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.

License: Apache License 2.0

Python 100.00%

lit-llama's People

Contributors

Stargazers

Watchers

Forkers

nathanie albertoual schliffen laudehenri kanocztomas mbrukman fuopen lilleswing ghomsi tironiigor hbcbh1999 yanjiangjerry segmond dumpmemory hirajanwin vinhtran2611 touristshaun baris-unver kabongosalomon furmanlukasz olahsymbo yibit xjohnxjohn technowong ffos ai-avant-garde-research yqty rinofm pcedison occam-ai rovo79 sanyamlakhanpal omergilani codeaudit keremp id-2 m9e davidzhu2018 doytsujin 3dsworks2022 rickyhong hieutrluu jaedukseo yabarji59 marinoscar uberchel yousefazizi1982 iuriimattos2 techthiyanes dand1022 johnkeigo loklok-infi jackma516 stjordanis astitvasri ckhuang0614 luiz-m-affonso frankbatzaleyhowardk thliang01 briamcode besttea alexanderinum alijafari79 rasbt a24ibrah avr248 ricardorei spirodonfl keyboardcartel brianparkerin disturbed-mystic1 sirbeaker maniklem dut3062796s duke24k jade2290 ldmichel munirabobaker dnim-laicifitra yihancao123 danidapena t-vi younlea richardkelley qqq-tech bacoco luisgrisolia wntmddl4 wimjan123 ddkang1 tiendung msellamitn factoidforrest xli4217 deepuncertainty nguyenthietsu zhongyy darkman111a jaej-dev dmitsf

lit-llama's Issues

Support SparseGPT

Sparse should reduce the size and increase infer speed without hurting perf too much. This repo https://github.com/IST-DASLab/sparsegpt is Apache license and may be useful (I hope)

Move LLaMA class definition to top of file

In my opinion it would be nicer and easier to read if the class definition of LLaMA was at the top of the model.py file. Currently it is at the bottom: https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/model.py
I find that a top-down approach in studying the model is more intuitive.
Should we move it up?

Try model partitioning for quantized inference

https://twitter.com/_cartick/status/1640903057994285056?s=20

Does it support model partitioning for quantized inference? I have 4x8gb cards so want to see if I can try larger models.

How can I run it using Hugging Face models?

Model: https://huggingface.co/decapoda-research/llama-7b-hf

Is there an interactive mode?

Support LoRA finetuning

https://github.com/microsoft/LoRA

Test running on MPS / ARM NEON

AssertionError

$ python3 generate.py --quantize llm.int8 --prompt "Hello, my name is"

error message

ubuntu 22.04

Traceback (most recent call last):
  File "generate.py", line 159, in <module>
    CLI(main)
  File "/home/yongun/.local/lib/python3.8/site-packages/jsonargparse/cli.py", line 82, in CLI
    return _run_component(component, cfg_init)
  File "/home/yongun/.local/lib/python3.8/site-packages/jsonargparse/cli.py", line 138, in _run_component
    return component(**cfg)
  File "generate.py", line 104, in main
    assert checkpoint_path.is_file()
AssertionError

Reproducing alpaca

python scripts/prepare_alpaca.py fails to run if I don't run python setup.py install first.

Traceback (most recent call last):
  File "/home/gregor/experiments/lit-llama/scripts/prepare_alpaca.py", line 9, in <module>
    from lit_llama.tokenizer import Tokenizer
ModuleNotFoundError: No module named 'lit_llama'

TPU

Does this code work on TPU?

Support GPTQ

https://github.com/IST-DASLab/gptq

Support loading pre-quantized weights

Expected is_sm80 to be true, but got false

I tried running the finetuning scripts on a 3090 GPU and got this error:

/home/adrian/repositories/lightning-llama/lit_llama/model.py:43: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:31.)
  ).to(complex_dtype)
Traceback (most recent call last):
  File "/home/adrian/repositories/lightning-llama/finetune_adapter.py", line 201, in <module>
    main()
  File "/home/adrian/repositories/lightning-llama/finetune_adapter.py", line 67, in main
    train(fabric, model, optimizer, train_data, val_data)
  File "/home/adrian/repositories/lightning-llama/finetune_adapter.py", line 97, in train
    fabric.backward(loss / gradient_accumulation_steps)
  File "/home/adrian/anaconda3/envs/lit-llama/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 365, in backward
    self._precision.backward(tensor, module, *args, **kwargs)
  File "/home/adrian/anaconda3/envs/lit-llama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/amp.py", line 70, in backward
    super().backward(tensor, model, *args, **kwargs)
  File "/home/adrian/anaconda3/envs/lit-llama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/precision.py", line 81, in backward
    tensor.backward(*args, **kwargs)
  File "/home/adrian/anaconda3/envs/lit-llama/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/adrian/anaconda3/envs/lit-llama/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected is_sm80 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)

This was on the branch of #100 where I added the EmptyInitOnDevice() context manager. It looks like the conversion to complex_dtype caused problems in the backward.

Both

python finetune_lora.py

and

python finetune_adapter.py

fail with this error.

Precommit hooks

If you like, I can add some precommit hooks for automatic linting before making this public

Politically Kind License Wording

Hi,

This seems like a really wonderful project for which I am thankful.

I personally support the GPL and don’t see it as preventing academic or commercial use at all.

When I read the wording in the readme that the GPL prevents these things, and is a problem to be solved (rather than a solution to problems), I feel pain. I don’t understand why or how these things would be. The function of the GPL is to keep derivative works open source. Do academic institutions need to keep their source code private?

Would you be willing to cite where these opinions stem from, and/or state the expression as opinion rather than fact?

Cannot save checkpoint while train.py on single GPU

I got following error at utils.py while saving checkpoint while pre-training on single GPU. Any hint how should I fix it? Thanks.

state_dict = model._forward_module.state_dict()                            

NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet

Support LLaMA-Adapter

Paper: LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
https://arxiv.org/abs/2303.16199

NOTE the repo (https://github.com/ZrrSkywalker/LLaMA-Adapter) is GPL licensed so start from the paper and don't look at that while implementing!

This is a brazen insult to open source community

Meta released LLLaMA under GPLv3, you took their code did some irrelevant changes and claim it to be yours and relicensed it. You could also do the same to other open source contributors, steal their work and rebrand it as yours. As a company, you also ignored the safety issue from LLaMA and just want to promote your framework without putting any safety measures. Shame on you.

Typo? "7B require ~26 GB of GPU memory (A100 GPU)."

Is the following a typo or the lit-llama implementation requires vastly more vram than original implementation? 7B fits natively on a single 3090 24G gpu in original llama implementation.

This will run the 7B model and require ~26 GB of GPU memory (A100 GPU).

Make loading paths configurable

How to Convert the weights from Huggleface format to the Lit-LLaMA format

Hi ,

Thank you for release code!

Could you share the script that How to Convert the weights from Huggleface format to the Lit-LLaMA format?

chinese，chinese， chinese

Convergence of LLaMA-adapter

Dear Sir,

I find that the implementation of LLaMa-adapter can not converge. The loss function keeps around 0.8-1.0.

I wonder if you can solve this?

And thanks for your attention.

Missing rope_cache for model with lora

Hi, according to the #81, the rope_cache argument has been removed from CausalSelfAttention in the model.py. I think the same process should also be done in the lora.py

How to finetune with the multi-GPU

Hi, I'm wondering how to change the code for multi-GPU finetuning. Currently, I tried
fabric = L.Fabric(devices=4, accelerator="gpu", strategy="ddp")
But I encounter an error about initialization:
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

AttributeError: module 'lightning' has no attribute 'Fabric'

$ python generate.py --prompt "Hello, my name is"

Traceback (most recent call last):
  File "generate.py", line 148, in <module>
    CLI(main)
  File "/axp/aida/data/platformds/aiservices/conda/envs/llama/lib/python3.7/site-packages/jsonargparse/cli.py", line 82, in CLI
    return _run_component(component, cfg_init)
  File "/axp/aida/data/platformds/aiservices/conda/envs/llama/lib/python3.7/site-packages/jsonargparse/cli.py", line 138, in _run_component
    return component(**cfg)
  File "generate.py", line 104, in main
    fabric = L.Fabric(accelerator=accelerator, devices=1)
AttributeError: module 'lightning' has no attribute 'Fabric'

How to fine tune llama with peft?

I have a dataset, I try the openai embading but they are not good. However I want to fine tune lllama with peft on single consumer gpu.

So, how to do this ?

Loading the 13B checkpoint

Hi!

I was trying to load the 13B model checkpoint but it seems that there is a mismatch between dimensions:

e.g:

size mismatch for transformer.h.39.mlp.c_proj.weight: copying a param with shape torch.Size([5120, 6912]) from checkpoint, the shape in current model is torch.Size([5120, 13824]).

Error in convert_checkpoint.py when converting 13B weights

Whenever I try to convert the 13B weights (unmodified) sourced from the dalai llama download, the first checkpoint successfully completes; however, the second checkpoint fails to convert.

I am using this command:

python scripts/convert_checkpoint.py
--output_dir checkpoints/lit-llama
--ckpt_dir dalai/llama/models
--tokenizer_path /dalai/llama/models/tokenizer.model
--model_size 13B

And receive the following error:

python scripts/convert_checkpoint.py --output_dir checkpoints/lit-llama --ckpt_dir dalai/llama/models --tokenizer_path dalai/llama/models/tokenizer.model --model_size 13B
50%|███████████████████████████████████████████ | 1/2 [00:32<00:32, 32.09s/it]

Killed

Question: about left padding

if the model is padded to the left and its a casual language model does that mean that padding tokens will receive attention from the rest of the sequence?

Should there me a mask to prevent tokens to attend to padding?

Readme wording

Readme wording suggestion:

Before:

Simple, single-file, no boilerplate

Numerically equivalent to the original model

Optimized to run on consumer hardware or at scale

Open-source no strings attached

Suggested:

Simple, single-file implementaton without boilerplate

Numerically equivalent to the original model

Optimized to run on consumer hardware or at scale

Open-source and no strings attached

Is it possible to replace MHA with FlashAttention?

For better performance as Flash Attention is officially supported in torch 2.0

Saving FSDP Model

I was trying to change the fine-tuning to use FSDP training but currently there is any way to save the checkpoint.

Saving in train.py is commented and the saving in finetune.pt only supports Lora.

I am trying to compare full fine-tuning with Lora. For full fine-tuning using FSDP my checkpoint is saved with the embedding layer and lm_head in a single flat_param which prevents me to load the checkpoint afterwards. How can I recover the original model architecture and load that checkpoint into a single GPU for inference ?

model conversion is not performed

the conversion of the model is simply not performed and the progress bar is frozen at 0%

The model weights are still based on the official LLaMA repository. It makes it not really possible to use lit-llama for commercial use.

Vocabulary size when training the tokenizer

For all I can tell, this will likely use the sentencepiece defaults (i.e., 8000 tokens vocabulary size) whereas Llama was supposedly trained using a vocabulary size of 32000?

lit-llama/lit_llama/tokenizer.py

Lines 46 to 49 in ba505cb

 @staticmethod 

 def train(input: str, destination: str) -> None: 

 model_prefix = os.path.join(destination, "tokenizer") 

 SentencePieceTrainer.Train(input=input, model_prefix=model_prefix)

My suggestion is to

add a parameter to set the vocabulary size
set the default value to 32000 to match Llama?

Inquiry about lit-llama's training speed

It would be helpful to know how fast lit-llama can be trained as it is crucial for pre-training costs. Some comparable data can be found in the link provided: https://github.com/s-JoL/Open-Llama#和其他开源模型性能对比.

Support loading with lower precision

16bit, 8q, 4q loading

How to use deepspeed zero-3-offload strategy correctly? (Parameters Duplication Issue)

Hi, I wonder how to write the code for using the deepspeed zero-3-offload strategy correctly. Currently, my code looks like:

from lightning.fabric.strategies import DeepSpeedStrategy
deep_speed = DeepSpeedStrategy(
                    stage=3,
                    offload_optimizer=True,
                    offload_parameters=True,
                )
fabric = L.Fabric(accelerator="gpu", devices=num_devices,strategy=deep_speed)

However, it seems the parameters are duplicated for all gpu. I attached the screenshot to show the GPU utilization after model, optimizer = fabric.setup(model, optimizer):

According to my understanding, the parameters should be distributed on different devices, right?

Add support for pre-configured sizes

Support LoRA finetuning with quantization

Use FlashAttention with LLaMA-Adapter

Looking at the LLaMA-Adapter implementation, at https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/adapter.py#L91

            # inefficient attention because we need to insert the gate for the adaption in the middle
            aT = prefix.size(1)
            _, ak, av = self.c_attn(prefix).split(self.n_embd, dim=2)
            ak = ak.view(1, aT, self.n_head, head_size).repeat(B, 1, 1, 1).transpose(1, 2)
            av = av.view(1, aT, self.n_head, head_size).repeat(B, 1, 1, 1).transpose(1, 2)

            ascores = torch.matmul(q, ak.transpose(2, 3)) / math.sqrt(self.n_embd)
            ascores = self.gating_factor * F.softmax(ascores.float(), dim=-1).type_as(q)
            y = y + torch.matmul(ascores, av)

it looks to me we could replace the above with

            a_mask = torch.ones(aT, aT, dtype=torch.bool)
            ay = F.scaled_dot_product_attention(q, ak, av, attn_mask=a_mask, dropout_p=0.0, is_causal=False)
            y = y + self.gating_factor * ay

since

(gating * softmax(ascores)) @ av

is equivalent to

gating * (softmax(ascores) @ av)

Doesn't work with batched input

It seems like the generate method doesn't work with batched input no - it only accepts an input of type str

Model compilation support

With FSDP currently the code could not be run.

If you try to add model compilation to the training like:

...
fabric = L.Fabric(accelerator="cuda", devices=8, precision="bf16-mixed", strategy=strategy)
fabric.launch()
...

model = fabric.setup_module(model)
# compile() goes should go wrapping as per https://github.com/huggingface/transformers/commit/fb0a38b4f275727d6228fb4a78c15c6dd8480e91
# Though it does not work either even if goes before setup_module() as you'll get the same issue (see below)
model = torch.compile(model)

optimizer = torch.optim.AdamW(model.parameters(), ...)
optimizer = fabric.setup_optimizers(optimizer)
...

train(model, ...)

and try it via:

lightning run model --accelerator=cuda --devices=8 train.py ...

you'll get:

  File ".../.venv/lib/python3.8/site-packages/torch/_dynamo/variables/builder.py", line 172, in __call__
    return self._wrap(value).clone(**self.options())
  File ".../.venv/lib/python3.8/site-packages/torch/_dynamo/variables/builder.py", line 345, in _wrap
    assert getattr(
AssertionError: Dynamo only supports FSDP with use_orig_params=True

If I pass use_orig_params = True into the FSDPStrategy() constructor you get:

ValueError: The optimizer does not seem to reference any FSDP parameters. HINT: Make sure to create the optimizer after setting up the model.

So, then one can remove optimizer = fabric.setup_optimizers(optimizer) as it is anyway no-op for FSDP, but even in this case I see:

from user code:
   File ".../.venv/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 75, in apply_to_collection
    is_namedtuple_ = is_namedtuple(data)

Also I was not able to find any tests of compiled model with the FSDP neither here nor here.

I wonder if anyone was able to successfully launch compiled model in a FSDP regime? Thanks a lot for the help!

P.S.: if i try to run similar code using HuggingFace trainer I run into the exact same AssertionError: Dynamo only supports FSDP with use_orig_params=True :)

Avoid complex usage to compute RoPE

This will allow using torch.compile: #17

error when run [python3 generate.py --quantize true ]: undefined symbol: cget_col_row_stats

u20@u20:~/lit-llama$ python3 generate.py --quantize true --prompt "Hello, my name is"
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/home/u20/.local/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
Loading model ...
Traceback (most recent call last):
File "generate.py", line 147, in
CLI(main)
File "/home/u20/.local/lib/python3.8/site-packages/jsonargparse/cli.py", line 82, in CLI
return _run_component(component, cfg_init)
File "/home/u20/.local/lib/python3.8/site-packages/jsonargparse/cli.py", line 138, in _run_component
return component(**cfg)
File "generate.py", line 108, in main
model = LLaMA.from_name(model_size)
File "/home/u20/lit-llama/lit_llama/model.py", line 223, in from_name
return cls(LLaMAConfig.from_name(name))
File "/home/u20/lit-llama/lit_llama/model.py", line 179, in init
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
File "/home/u20/lit-llama/lit_llama/quantization.py", line 31, in init
self._quantize_weight(self.weight.data)
File "/home/u20/lit-llama/lit_llama/quantization.py", line 48, in _quantize_weight
CB, CBt, SCB, SCBt, coo_tensorB = bnb.functional.double_quant(B)
File "/home/u20/.local/lib/python3.8/site-packages/bitsandbytes/functional.py", line 1616, in double_quant
row_stats, col_stats, nnz_row_ptr = get_colrow_absmax(
File "/home/u20/.local/lib/python3.8/site-packages/bitsandbytes/functional.py", line 1505, in get_colrow_absmax
lib.cget_col_row_stats(ptrA, ptrRowStats, ptrColStats, ptrNnzrows, ct.c_float(threshold), rows, cols)
File "/usr/lib/python3.8/ctypes/init.py", line 386, in getattr
func = self.getitem(name)
File "/usr/lib/python3.8/ctypes/init.py", line 391, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /home/u20/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats

Step method?

What is the purpose of the step method? It doesn't seem to be used anywhere. Can it be a) removed for the sake of simplicity? Or b) used where the loss is correctly calculated using F.cross_crossentropy elsewhere?

lit-llama/lit_llama/model.py

Lines 216 to 220 in f808df1

 def step(self, idx: torch.Tensor, targets: torch.Tensor) -> torch.Tensor: 

 logits = self(idx) 

 loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1) 

 return loss

Command python generate.py --quantize true --prompt "Hello, my name is" Not working

Getting below error while executing the given command.

Loading model ...
Traceback (most recent call last):
  File "generate.py", line 147, in <module>
    CLI(main)
  File "/mnt/hdd1/rajeevy/anaconda3/envs/lama/lib/python3.8/site-packages/jsonargparse/cli.py", line 82, in CLI
    return _run_component(component, cfg_init)
  File "/mnt/hdd1/rajeevy/anaconda3/envs/lama/lib/python3.8/site-packages/jsonargparse/cli.py", line 138, in _run_component
    return component(**cfg)
  File "generate.py", line 110, in main
    model.load_state_dict(checkpoint)
  File "/mnt/hdd1/rajeevy/anaconda3/envs/lama/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2027, in load_state_dict
    load(self, state_dict)
  File "/mnt/hdd1/rajeevy/anaconda3/envs/lama/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2015, in load
    load(child, child_state_dict, child_prefix)
  File "/mnt/hdd1/rajeevy/anaconda3/envs/lama/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2009, in load
    module._load_from_state_dict(
  File "/mnt/hdd1/rajeevy/nlp/Lama/lit-llama/lit_llama/quantization.py", line 35, in _load_from_state_dict
    weight_key = next(name for name in local_state_dict.keys() if name.endswith("weight"))
StopIteration```

src folder

For better readability etc., do we want to reorganize this with the code in a src folder or llama subfolder? Since there's already a setup.py it would probably be more organized and readable this way

Apache - 2.0 - Commercial License

Hey team,
Thanks for releasing the code and repo under Apache-2.0

I'm still wondering though, as to how this would be truly open-sourced and commercialisable, if we're still loading official Llama weights (under GPL License) and converting them into Lit-Llama weights?

Or does it only mean that if we train from scratch using this code instead, without using the official Llama weights, then the end model could be used for commercial purposes?

Please help clarify.
TIA.

	@staticmethod
	def train(input: str, destination: str) -> None:
	model_prefix = os.path.join(destination, "tokenizer")
	SentencePieceTrainer.Train(input=input, model_prefix=model_prefix)

	def step(self, idx: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
	logits = self(idx)
	loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
	return loss