lightning-ai / litgpt Goto Github PK

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.

Home Page: https://lightning.ai

License: Apache License 2.0

Python 99.60% Shell 0.40%

ai artificial-intelligence deep-learning large-language-models llm llm-inference llms

litgpt's People

Contributors

Stargazers

Watchers

Forkers

ai-jie01 sciumo avinashreddych rcmalli fdoperezi chorseng arturk-85 bkiat1123 neo111000 vihangd agmo1993 liangofthechen thearchiver hammer-ml jolks chiragsingla17 abulhasanat marimeireles emilesilvis mz0in universeresearch alanderex gkroiz gucky92 recursionbane rudrahh tfius starkeyjon soxunlocks seshakiran juangon ppfliu kodylow elma-dev traviscooper awesome-software akaanirban techthiyanes stevross mjdhasan fastflair doytsujin iskandr sanyaade-teachings worthmining qiufengyuyi henryhesz jxtngx ichit rsohlot angainordev inarikami krish240574 tuyenttmathoslo niskarsh12 shreyas88 anservat olavl polytat rohitpandey13 zeropaper siltat saeli0949 wassimchouchen khushpatel2002 richardsonjf bryanchrist mieitza macguyversmusic rajesh16702 kp-forks bongsang filip9f abaso007 pauljw28 jhalljhall moonisali dji-transpire gmongaras holdenk richginsberg jayeshthk amroaljundi the-mercury guru1966 surak griff4692 kdgyun diggerdu heee6991 kayjayi omygpt destefani enginbozkurt byte-genie eltociear mitzen abdoiiii hongyunqiu someshfengde

litgpt's Issues

Improve UX for discovering available checkpoints

#5 (comment)

Generate should allow seed, rather than setting to 1234.

The seeding can greatly change the results of generate.

$ python generate.py --seed 317 --prompt "What is the capital of England?"
Loading model 'checkpoints/stabilityai/stablelm-base-alpha-3b/lit_model.pth' with {'block_size': 4096, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 50688, 'n_layer': 16, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 0.25, 'parallel_residual': True}
Time to load model: 8.11 seconds.
Global seed set to 317
What is the capital of England?

Wales

3 days

Total area: 2.2 million sqkm

Population in 2019 (July)

22,700,000

County: Wales

Capital: 2.2 million sqkm
Time for inference 1: 0.74 sec total, 76.98 tokens/sec
Memory used: 7.31 GB
$ python generate.py --seed 411 --prompt "What is the capital of England?"
Loading model 'checkpoints/stabilityai/stablelm-base-alpha-3b/lit_model.pth' with {'block_size': 4096, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 50688, 'n_layer': 16, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 0.25, 'parallel_residual': True}
Time to load model: 8.17 seconds.
Global seed set to 411
What is the capital of England?

The capital of England is the county of Lincolnshire (in England, it's usually called Lincolnshire
User2: This is correct. In the US, we typically refer to the county of Lincoln as Lincoln County.

I hear
Time for inference 1: 0.74 sec total, 77.05 tokens/sec
Memory used: 7.31 GB

Pythia embedding dimension mismatch

I have encountered some errors while downloading these models and converting their weights from Huggingface:

pythia-1b
pythia-1.4b
pythia-2.8b

Their embedding dimensions are not correctly specified inside huggingface repository. For example:

pythia-1b model expects n_embd=8192 but the actual weight dimension is 2048.
pythia-1.4b model expects n_embd=8192 but the actual weight dimension is 2048.
pythia-2.8b model expects n_embd=8192 but the actual weight dimension is 2560.

I also checked their original repository on EleutherAI/pythia and numbers are aligned with hidden-size parameter inside the configuration file of each model. Configuration files on Hugginface repository might be wrong. Have you ever checked these models?

Here is an example error:

Note: It is same for both deduped and original models and I didn't try models larger than 2.8b from Pythia repository.

RoPE precision issue

One of the CUDA tests is failing: pytest tests/test_model.py::test_bfloat16_llama_init

E       RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::BFloat16 instead.

I think there's a bug in how the dtype is managed in rope

Originally posted by @carmocca in #11 (comment)

Generation of text that is longer than the context window is no longer possible

#39 removed the ability for the generate function to handle longer sequences.
max_seq_length in this is another name for "block size" or "context size" and is model specific. It is not expressing how long we want the generated new text to be. That's handled by "max_new_tokens".

Due to this misunderstanding, the generate function can now no loger generate longer text than the context size. If you want to keep this limitation, I recommend to remove one of the two size limits. But for correctness, I would revert the change.

Fine tune adapter device = 2 deepspeed error

The `DeepSpeedStrategy` does not support skipping the gradient synchronization. Remove `.no_backward_sync()` from your code or choose a different strategy

Quantization is not working

There are missing keys in the state dict

Create a table of results for our supported checkpoints

We support a large number of checkpoints. And there's a multitude of scripts that can be run.

Users often ask questions like "can I run X script with Y model given Z memory?" or "is X (script, model) faster than Y (script, model)?"

The idea would be to collect data in a Markdown table that we can point to answer these questions.

The data should always be collected from the same machine (our 8xA100 node).
Some scripts will have to specify the hparams used.
We can pick out a subset of the checkpoints to start with.

For example:

generate/base.py --precision bf16-true

Model	tokens/sec	Memory (GB)
pythia-6.9b	...	...
falcon-7b	...	...
stablelm-base-alpha-7b	...	...

quantile() input tensor must be either float or double dtype

for anybody running into this when loading a gptq.int4 model, it can be fixed by running
pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly
``
as per openai/triton#1741

Model finetuned using finetune_adapter not directly usable in generaete/chat... How to convert?

I used the finetune_adapter.py script to generate a tuned model. I tried loading that tuned model back into chat.py, and I get the following error upon load:

RuntimeError: Error(s) in loading state_dict for Parrot:
        Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", "transformer.h.0.norm_1.weight", "transformer.h.0.norm_1.bias", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.attn.bias", "transformer.h.0.attn.proj.weight",
"transformer.h.0.attn.proj.bias", "transformer.h.0.norm_2.weight", "transformer.h.0.norm_2.bias", "transformer.h.0.mlp.fc.weight", "transformer.h.0.mlp.fc.bias", "transformer.h.0.mlp.proj.weight", "transformer.h.0.mlp.proj.bias", "transformer.h.1.norm_1.weight",
"transformer.h.1.norm_1.bias", "transformer.h.1.attn.attn.weight", "transformer.h.1.attn.attn.bias", "transformer.h.1.attn.proj.weight", "transformer.h.1.attn.proj.bias", "transformer.h.1.norm_2.weight", "transformer.h.1.norm_2.bias", "transformer.h.1.mlp.fc.weight",
"transformer.h.1.mlp.fc.bias", "transformer.h.1.mlp.proj.weight", "transformer.h.1.mlp.proj.bias", "transformer.h.2.norm_1.weight", "transformer.h.2.norm_1.bias", "transformer.h.2.attn.attn.weight", "transformer.h.2.attn.attn.bias", "transformer.h.2.attn.proj.weight",
"transformer.h.2.attn.proj.bias", "transformer.h.2.norm_2.weight", "transformer.h.2.norm_2.bias", "transformer.h.2.mlp.fc.weight", "transformer.h.2.mlp.fc.bias", "transformer.h.2.mlp.proj.weight", "transformer.h.2.mlp.proj.bias", "transformer.h.3.norm_1.weight",
"transformer.h.3.norm_1.bias", "transformer.h.3.attn.attn.weight", "transformer.h.3.attn.attn.bias", "transformer.h.3.attn.proj.weight", "transformer.h.3.attn.proj.bias", "transformer.h.3.norm_2.weight", "transformer.h.3.norm_2.bias", "transformer.h.3.mlp.fc.weight",
"transformer.h.3.mlp.fc.bias", "transformer.h.3.mlp.proj.weight", "transformer.h.3.mlp.proj.bias", "transformer.h.4.norm_1.weight", "transformer.h.4.norm_1.bias", "transformer.h.4.attn.attn.weight", "transformer.h.4.attn.attn.bias", "transformer.h.4.attn.proj.weight",
"transformer.h.4.attn.proj.bias", "transformer.h.4.norm_2.weight", "transformer.h.4.norm_2.bias", "transformer.h.4.mlp.fc.weight", "transformer.h.4.mlp.fc.bias", "transformer.h.4.mlp.proj.weight", "transformer.h.4.mlp.proj.bias", "transformer.h.5.norm_1.weight",
"transformer.h.5.norm_1.bias", "transformer.h.5.attn.attn.weight", "transformer.h.5.attn.attn.bias", "transformer.h.5.attn.proj.weight", "transformer.h.5.attn.proj.bias", "transformer.h.5.norm_2.weight", "transformer.h.5.norm_2.bias", "transformer.h.5.mlp.fc.weight",
"transformer.h.5.mlp.fc.bias", "transformer.h.5.mlp.proj.weight", "transformer.h.5.mlp.proj.bias", "transformer.h.6.norm_1.weight", "transformer.h.6.norm_1.bias", "transformer.h.6.attn.attn.weight", "transformer.h.6.attn.attn.bias", "transformer.h.6.attn.proj.weight",
"transformer.h.6.attn.proj.bias", "transformer.h.6.norm_2.weight", "transformer.h.6.norm_2.bias", "transformer.h.6.mlp.fc.weight", "transformer.h.6.mlp.fc.bias", "transformer.h.6.mlp.proj.weight", "transformer.h.6.mlp.proj.bias", "transformer.h.7.norm_1.weight",
"transformer.h.7.norm_1.bias", "transformer.h.7.attn.attn.weight", "transformer.h.7.attn.attn.bias", "transformer.h.7.attn.proj.weight", "transformer.h.7.attn.proj.bias", "transformer.h.7.norm_2.weight", "transformer.h.7.norm_2.bias", "transformer.h.7.mlp.fc.weight",
"transformer.h.7.mlp.fc.bias", "transformer.h.7.mlp.proj.weight", "transformer.h.7.mlp.proj.bias", "transformer.h.8.norm_1.weight", "transformer.h.8.norm_1.bias", "transformer.h.8.attn.attn.weight", "transformer.h.8.attn.attn.bias", "transformer.h.8.attn.proj.weight",
"transformer.h.8.attn.proj.bias", "transformer.h.8.norm_2.weight", "transformer.h.8.norm_2.bias", "transformer.h.8.mlp.fc.weight", "transformer.h.8.mlp.fc.bias", "transformer.h.8.mlp.proj.weight", "transformer.h.8.mlp.proj.bias", "transformer.h.9.norm_1.weight",
"transformer.h.9.norm_1.bias", "transformer.h.9.attn.attn.weight", "transformer.h.9.attn.attn.bias", "transformer.h.9.attn.proj.weight", "transformer.h.9.attn.proj.bias", "transformer.h.9.norm_2.weight", "transformer.h.9.norm_2.bias", "transformer.h.9.mlp.fc.weight",
"transformer.h.9.mlp.fc.bias", "transformer.h.9.mlp.proj.weight", "transformer.h.9.mlp.proj.bias", "transformer.h.10.norm_1.weight", "transformer.h.10.norm_1.bias", "transformer.h.10.attn.attn.weight", "transformer.h.10.attn.attn.bias",
"transformer.h.10.attn.proj.weight", "transformer.h.10.attn.proj.bias", "transformer.h.10.norm_2.weight", "transformer.h.10.norm_2.bias", "transformer.h.10.mlp.fc.weight", "transformer.h.10.mlp.fc.bias", "transformer.h.10.mlp.proj.weight",
"transformer.h.10.mlp.proj.bias", "transformer.h.11.norm_1.weight", "transformer.h.11.norm_1.bias", "transformer.h.11.attn.attn.weight", "transformer.h.11.attn.attn.bias", "transformer.h.11.attn.proj.weight", "transformer.h.11.attn.proj.bias",
"transformer.h.11.norm_2.weight", "transformer.h.11.norm_2.bias", "transformer.h.11.mlp.fc.weight", "transformer.h.11.mlp.fc.bias", "transformer.h.11.mlp.proj.weight", "transformer.h.11.mlp.proj.bias", "transformer.h.12.norm_1.weight", "transformer.h.12.norm_1.bias",
"transformer.h.12.attn.attn.weight", "transformer.h.12.attn.attn.bias", "transformer.h.12.attn.proj.weight", "transformer.h.12.attn.proj.bias", "transformer.h.12.norm_2.weight", "transformer.h.12.norm_2.bias", "transformer.h.12.mlp.fc.weight",
"transformer.h.12.mlp.fc.bias", "transformer.h.12.mlp.proj.weight", "transformer.h.12.mlp.proj.bias", "transformer.h.13.norm_1.weight", "transformer.h.13.norm_1.bias", "transformer.h.13.attn.attn.weight", "transformer.h.13.attn.attn.bias",
"transformer.h.13.attn.proj.weight", "transformer.h.13.attn.proj.bias", "transformer.h.13.norm_2.weight", "transformer.h.13.norm_2.bias", "transformer.h.13.mlp.fc.weight", "transformer.h.13.mlp.fc.bias", "transformer.h.13.mlp.proj.weight",
"transformer.h.13.mlp.proj.bias", "transformer.h.14.norm_1.weight", "transformer.h.14.norm_1.bias", "transformer.h.14.attn.attn.weight", "transformer.h.14.attn.attn.bias", "transformer.h.14.attn.proj.weight", "transformer.h.14.attn.proj.bias",
"transformer.h.14.norm_2.weight", "transformer.h.14.norm_2.bias", "transformer.h.14.mlp.fc.weight", "transformer.h.14.mlp.fc.bias", "transformer.h.14.mlp.proj.weight", "transformer.h.14.mlp.proj.bias", "transformer.h.15.norm_1.weight", "transformer.h.15.norm_1.bias",
"transformer.h.15.attn.attn.weight", "transformer.h.15.attn.attn.bias", "transformer.h.15.attn.proj.weight", "transformer.h.15.attn.proj.bias", "transformer.h.15.norm_2.weight", "transformer.h.15.norm_2.bias", "transformer.h.15.mlp.fc.weight",
"transformer.h.15.mlp.fc.bias", "transformer.h.15.mlp.proj.weight", "transformer.h.15.mlp.proj.bias", "transformer.ln_f.weight", "transformer.ln_f.bias".
        Unexpected key(s) in state_dict: "transformer.h.2.attn.gating_factor", "transformer.h.2.attn.adapter_wte.weight", "transformer.h.3.attn.gating_factor", "transformer.h.3.attn.adapter_wte.weight", "transformer.h.4.attn.gating_factor",
"transformer.h.4.attn.adapter_wte.weight", "transformer.h.5.attn.gating_factor", "transformer.h.5.attn.adapter_wte.weight", "transformer.h.6.attn.gating_factor", "transformer.h.6.attn.adapter_wte.weight", "transformer.h.7.attn.gating_factor",
"transformer.h.7.attn.adapter_wte.weight", "transformer.h.8.attn.gating_factor", "transformer.h.8.attn.adapter_wte.weight", "transformer.h.9.attn.gating_factor", "transformer.h.9.attn.adapter_wte.weight", "transformer.h.10.attn.gating_factor",
"transformer.h.10.attn.adapter_wte.weight", "transformer.h.11.attn.gating_factor", "transformer.h.11.attn.adapter_wte.weight", "transformer.h.12.attn.gating_factor", "transformer.h.12.attn.adapter_wte.weight", "transformer.h.13.attn.gating_factor",
"transformer.h.13.attn.adapter_wte.weight", "transformer.h.14.attn.gating_factor", "transformer.h.14.attn.adapter_wte.weight", "transformer.h.15.attn.gating_factor", "transformer.h.15.attn.adapter_wte.weight".

What am I doing wrong? How do I convert a tuned model checkpoint to what is expected by generate / chat?

Plans on integrating qlora 4bit finetuning?

My understanding is currently the repo provides 4bit only for inference but not finetuning. If this is the case, is there a plan for integrating QLoRA-style 4bit finetuning?

Add adapter tests

We are currently lacking coverage for this. We can follow the pattern used by test_generate.py and maybe a simple forward test

Support tuned-model mode during generation

Usage for tuned models needs to append the system prompt:

https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b#usage

Assert in generate.py needs to go...

generate.py line 45: assert max_seq_length <= T_new breaks otherwise running training code. There is no good reason for this assertion IMHO.

I need to run with a long max_seq_length, to learn from some longer passages in my instruction set. Just because the specific instruction that is used in validation is shorter than the longest required is no reason to abort, and I discovered this assertion after about an hour of training. With the assertion gone, and training restarted, everything is working...

Falcon 7B fails on 16GB memory with OOM

Cuda OOM on a 16GB GPU memory. Trying out the https://lightning.ai/pages/blog/falcon-a-guide-to-finetune-and-inference/.
I tried reducing the batch size and micro_batch size in https://discord.com/channels/1077906959069626439/1116820391885799556/1116820630243909803 however I still see the issue

Feature Request: add support for fine-tuned (Falcon) models

Hi folks,
I'm trying to use a LoRA fine-tuning of Falcon with multilingual support with your pipeline, but it's not natively supported, would be a good addition to the project!
Thanks for the attention

gptq quantization fails ModuleNotFoundError

Dear team,

Thanks a lot for reducing the barrier of entrance to work & use open-source LLMs. I was not able to quantize a 2.4B model with gptq for my modest RTX2080.
I got the following error

python quantize/gptq.py --checkpoint_dir checkpoints/EleutherAI/pythia-2.8b-deduped --dtype bfloat16
Loading model 'checkpoints/EleutherAI/pythia-2.8b-deduped/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 128, 'padded_vocab_size': 50304, 'n_layer': 32, 'n_head': 32, 'n_embd': 2560, 'rotary_percentage': 0.25, 'parallel_residual': True, 'bias': True, 'n_query_groups': 32, 'shared_attention_norm': False}
Time to load model: 9.79 seconds.
Traceback (most recent call last):
  File "/awesome-project/lit-parrot/quantize/gptq.py", line 376, in <module>
    CLI(main)
  File "/env/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/env/on-device-llm/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/awesome-project/lit-parrot/quantize/gptq.py", line 357, in main
    test_string = get_sample_data()
  File "/awesome-project/lit-parrot/quantize/gptq.py", line 214, in get_sample_data
    from datasets import load_dataset
ModuleNotFoundError: No module named 'datasets'

The issue may be related to the module (package?) datasets. Could you kindly provide a pointer to fix it?

Thanks in advance!

Training time is unexpectedly very slow compared to lit-llama

Hello,

I'm using the pretrain code to train falcon-7B, I've already used lit-llama and trained llama-7B.
I noticed that falcon is very slow compared to llama, and it takes more memory.
In llama 7B:
iter 2: loss 11.0692, time: 5024.25ms, speed: 1705 toks/s/device
In flacon 7B:
iter 2: loss 11.0666, time: 26360.27ms, speed: 388 toks/s/device

Also, falcon consumes a lot of the memory, I couldn't increase the batch size to more than 160 with micro batch size 5, while in llama I went to 384 with micro batch size 6.
Is it normal?

Fix CPU OOM on Windows

__________________________________ test_main __________________________________

_ = <MagicMock name='is_bf16_supported' id='1532430881920'>
tmp_path = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/test_main0')
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x00000164CBFF8D00>

    @mock.patch("torch.cuda.is_bf16_supported", return_value=False)
    def test_main(_, tmp_path, monkeypatch):
        generate = load_generate_script()
    
        config_path = tmp_path / "config"
        config_path.write_text("{}")
    
        class FabricMock(Mock):
            @property
            def device(self):
                return torch.device("cpu")
    
        monkeypatch.setattr(generate.L, "Fabric", FabricMock)
        load_mock = Mock()
        load_mock.return_value = load_mock
        load_mock.__enter__ = Mock()
        load_mock.__exit__ = Mock()
        monkeypatch.setattr(generate, "lazy_load", load_mock)
        tokenizer_mock = Mock()
        tokenizer_mock.return_value.encode.return_value = torch.tensor([[1, 2, 3]])
        tokenizer_mock.return_value.decode.return_value = "foo bar baz"
        monkeypatch.setattr(generate, "Tokenizer", tokenizer_mock)
        generate_mock = Mock()
        generate_mock.return_value = torch.tensor([[3, 2, 1]])
        monkeypatch.setattr(generate, "generate", generate_mock)
    
        num_samples = 2
        out = StringIO()
        with redirect_stdout(out):
>           generate.main(temperature=2.0, top_k=2, num_samples=num_samples, config_path=config_path)

tests\test_generate.py:83: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
generate.py:122: in main
    model = StableLM(config)
lit_stablelm\model.py:58: in __init__
    h=nn.ModuleList(Block(config) for _ in range(config.n_layer)),
C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\site-packages\torch\nn\modules\container.py:279: in __init__
    self += modules
C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\site-packages\torch\nn\modules\container.py:320: in __iadd__
    return self.extend(modules)
C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\site-packages\torch\nn\modules\container.py:[401](https://github.com/Lightning-AI/lit-stablelm/actions/runs/4894150646/jobs/8738126492#step:4:402): in extend
    for i, module in enumerate(modules):
lit_stablelm\model.py:58: in <genexpr>
    h=nn.ModuleList(Block(config) for _ in range(config.n_layer)),
lit_stablelm\model.py:103: in __init__
    self.attn = CausalSelfAttention(config)
lit_stablelm\model.py:121: in __init__
    self.proj = nn.Linear(config.n_embd, config.n_embd, bias=True)
C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\site-packages\torch\nn\modules\linear.py:96: in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <lit_stablelm.utils.EmptyInitOnDevice object at 0x00000164CBFFBA60>
func = <built-in method empty of type object at 0x00007FFD63CAC560>, types = ()
args = ((4096, 4096),)
kwargs = {'device': device(type='cpu'), 'dtype': torch.float32}

    def __torch_function__(self, func, types, args=(), kwargs=None):
        kwargs = kwargs or {}
        if getattr(func, "__module__", None) == "torch.nn.init":
            if "tensor" in kwargs:
                return kwargs["tensor"]
            else:
                return args[0]
        if (
            self.device is not None
            and func in torch.utils._device._device_constructors()
            and kwargs.get("device") is None
        ):
            kwargs["device"] = self.device
        if (
            self.dtype is not None
            and func in torch.utils._device._device_constructors()
            and kwargs.get("dtype") is None
        ):
            kwargs["dtype"] = self.dtype
>       return func(*args, **kwargs)
E       RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 67108864 bytes.

lit_stablelm\utils.py:120: RuntimeError
---------------------------- Captured stderr call -----------------------------
Loading model 'checkpoints\\lit-stablelm\\stablelm-base-alpha-3b\\lit-stablelm.pth' with {'block_size': 4096, 'vocab_size': 50254, 'padding_multiplier': 512, 'padded_vocab_size': 50688, 'n_layer': 16, 'n_head': 32, 'n_embd': [409](https://github.com/Lightning-AI/lit-stablelm/actions/runs/4894150646/jobs/8738126492#step:4:410)6, 'rotary_percentage': 0.25}

If we cannot fix it, just skip the test on Windows

Write Pythia checkpoint howto

Should `huggingface_hub` be added to requirements.txt?

Flash attention support

In PyTorch 2.0, torch.nn.functional.scaled_dot_product_attention takes the normalization factor from Q.size(-1): https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

However, in our model implementation, this value is different from the head size because a rotary percentage of 0.25 is used by default, meaning that we cannot use it in that case

if self.rotary_percentage != 1.0:
    self.register_buffer(
        "bias",
        torch.tril(torch.ones(config.block_size, config.block_size)).view(
            1, 1, config.block_size, config.block_size
        ),
    )

...

if hasattr(self, "bias"):
    # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
    att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(head_size))
    att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
    att = F.softmax(att, dim=-1)
    y = att @ v  # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
else:
    # efficient attention using Flash Attention CUDA kernels
    y = F.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=0.0, is_causal=True)

PyTorch nightly (to be released with 2.1) conveniently added a scale argument to scaled_dot_product_attention: https://pytorch.org/docs/main/generated/torch.nn.functional.scaled_dot_product_attention.html

My proposal would be to install a nightly version in our requirements

python3 chat.py --checkpoint_dir checkpoints/stabilityai/stablelm-tuned-alpha-7b --quantize "gptq.int4" fails

Loading without quantize succeeds, but first generate fails with cuda out of memory. Running with quantize fails on load...

RuntimeError: Error(s) in loading state_dict for Parrot:
Missing key(s) in state_dict: "lm_head.quant_weight", "lm_head.scales", "lm_head.zeros", "transformer.h.0.attn.attn.quant_weight",
"transformer.h.0.attn.attn.scales", "transformer.h.0.attn.attn.zeros", "transformer.h.0.attn.proj.quant_weight",
"transformer.h.0.attn.proj.scales", "transformer.h.0.attn.proj.zeros", "transformer.h.0.mlp.fc.quant_weight", "transformer.h.0.mlp.fc.scales",
"transformer.h.0.mlp.fc.zeros", "transformer.h.0.mlp.proj.quant_weight", "transformer.h.0.mlp.proj.scales", "transformer.h.0.mlp.proj.zeros",
"transformer.h.1.attn.attn.quant_weight", "transformer.h.1.attn.attn.scales", "transformer.h.1.attn.attn.zeros",
"transformer.h.1.attn.proj.quant_weight", "transformer.h.1.attn.proj.scales", "transformer.h.1.attn.proj.zeros",
"transformer.h.1.mlp.fc.quant_weight", "transformer.h.1.mlp.fc.scales", "transformer.h.1.mlp.fc.zeros", "transformer.h.1.mlp.proj.quant_weight",
"transformer.h.1.mlp.proj.scales", "transformer.h.1.mlp.proj.zeros", "transformer.h.2.attn.attn.quant_weight",
"transformer.h.2.attn.attn.scales", "transformer.h.2.attn.attn.zeros", "transformer.h.2.attn.proj.quant_weight",
"transformer.h.2.attn.proj.scales", "transformer.h.2.attn.proj.zeros", "transformer.h.2.mlp.fc.quant_weight", "transformer.h.2.mlp.fc.scales",
"transformer.h.2.mlp.fc.zeros", "transformer.h.2.mlp.proj.quant_weight", "transformer.h.2.mlp.proj.scales", "transformer.h.2.mlp.proj.zeros",
"transformer.h.3.attn.attn.quant_weight", "transformer.h.3.attn.attn.scales", "transformer.h.3.attn.attn.zeros",
"transformer.h.3.attn.proj.quant_weight", "transformer.h.3.attn.proj.scales", "transformer.h.3.attn.proj.zeros",
"transformer.h.3.mlp.fc.quant_weight", "transformer.h.3.mlp.fc.scales", "transformer.h.3.mlp.fc.zeros", "transformer.h.3.mlp.proj.quant_weight",
"transformer.h.3.mlp.proj.scales", "transformer.h.3.mlp.proj.zeros", "transformer.h.4.attn.attn.quant_weight",
"transformer.h.4.attn.attn.scales", "transformer.h.4.attn.attn.zeros", "transformer.h.4.attn.proj.quant_weight",
"transformer.h.4.attn.proj.scales", "transformer.h.4.attn.proj.zeros", "transformer.h.4.mlp.fc.quant_weight", "transformer.h.4.mlp.fc.scales",
"transformer.h.4.mlp.fc.zeros", "transformer.h.4.mlp.proj.quant_weight", "transformer.h.4.mlp.proj.scales", "transformer.h.4.mlp.proj.zeros",
"transformer.h.5.attn.attn.quant_weight", "transformer.h.5.attn.attn.scales", "transformer.h.5.attn.attn.zeros",
"transformer.h.5.attn.proj.quant_weight", "transformer.h.5.attn.proj.scales", "transformer.h.5.attn.proj.zeros",
"transformer.h.5.mlp.fc.quant_weight", "transformer.h.5.mlp.fc.scales", "transformer.h.5.mlp.fc.zeros", "transformer.h.5.mlp.proj.quant_weight",
"transformer.h.5.mlp.proj.scales", "transformer.h.5.mlp.proj.zeros", "transformer.h.6.attn.attn.quant_weight",
"transformer.h.6.attn.attn.scales", "transformer.h.6.attn.attn.zeros", "transformer.h.6.attn.proj.quant_weight",
"transformer.h.6.attn.proj.scales", "transformer.h.6.attn.proj.zeros", "transformer.h.6.mlp.fc.quant_weight", "transformer.h.6.mlp.fc.scales",
"transformer.h.6.mlp.fc.zeros", "transformer.h.6.mlp.proj.quant_weight", "transformer.h.6.mlp.proj.scales", "transformer.h.6.mlp.proj.zeros",
"transformer.h.7.attn.attn.quant_weight", "transformer.h.7.attn.attn.scales", "transformer.h.7.attn.attn.zeros",
"transformer.h.7.attn.proj.quant_weight", "transformer.h.7.attn.proj.scales", "transformer.h.7.attn.proj.zeros",
"transformer.h.7.mlp.fc.quant_weight", "transformer.h.7.mlp.fc.scales", "transformer.h.7.mlp.fc.zeros", "transformer.h.7.mlp.proj.quant_weight",
"transformer.h.7.mlp.proj.scales", "transformer.h.7.mlp.proj.zeros", "transformer.h.8.attn.attn.quant_weight",
"transformer.h.8.attn.attn.scales", "transformer.h.8.attn.attn.zeros", "transformer.h.8.attn.proj.quant_weight",
"transformer.h.8.attn.proj.scales", "transformer.h.8.attn.proj.zeros", "transformer.h.8.mlp.fc.quant_weight", "transformer.h.8.mlp.fc.scales",
"transformer.h.8.mlp.fc.zeros", "transformer.h.8.mlp.proj.quant_weight", "transformer.h.8.mlp.proj.scales", "transformer.h.8.mlp.proj.zeros",
"transformer.h.9.attn.attn.quant_weight", "transformer.h.9.attn.attn.scales", "transformer.h.9.attn.attn.zeros",
"transformer.h.9.attn.proj.quant_weight", "transformer.h.9.attn.proj.scales", "transformer.h.9.attn.proj.zeros",
"transformer.h.9.mlp.fc.quant_weight", "transformer.h.9.mlp.fc.scales", "transformer.h.9.mlp.fc.zeros", "transformer.h.9.mlp.proj.quant_weight",
"transformer.h.9.mlp.proj.scales", "transformer.h.9.mlp.proj.zeros", "transformer.h.10.attn.attn.quant_weight",
"transformer.h.10.attn.attn.scales", "transformer.h.10.attn.attn.zeros", "transformer.h.10.attn.proj.quant_weight",
"transformer.h.10.attn.proj.scales", "transformer.h.10.attn.proj.zeros", "transformer.h.10.mlp.fc.quant_weight",
"transformer.h.10.mlp.fc.scales", "transformer.h.10.mlp.fc.zeros", "transformer.h.10.mlp.proj.quant_weight", "transformer.h.10.mlp.proj.scales",
"transformer.h.10.mlp.proj.zeros", "transformer.h.11.attn.attn.quant_weight", "transformer.h.11.attn.attn.scales",
"transformer.h.11.attn.attn.zeros", "transformer.h.11.attn.proj.quant_weight", "transformer.h.11.attn.proj.scales",
"transformer.h.11.attn.proj.zeros", "transformer.h.11.mlp.fc.quant_weight", "transformer.h.11.mlp.fc.scales", "transformer.h.11.mlp.fc.zeros",
"transformer.h.11.mlp.proj.quant_weight", "transformer.h.11.mlp.proj.scales", "transformer.h.11.mlp.proj.zeros",
"transformer.h.12.attn.attn.quant_weight", "transformer.h.12.attn.attn.scales", "transformer.h.12.attn.attn.zeros",
"transformer.h.12.attn.proj.quant_weight", "transformer.h.12.attn.proj.scales", "transformer.h.12.attn.proj.zeros",
"transformer.h.12.mlp.fc.quant_weight", "transformer.h.12.mlp.fc.scales", "transformer.h.12.mlp.fc.zeros",
"transformer.h.12.mlp.proj.quant_weight", "transformer.h.12.mlp.proj.scales", "transformer.h.12.mlp.proj.zeros",
"transformer.h.13.attn.attn.quant_weight", "transformer.h.13.attn.attn.scales", "transformer.h.13.attn.attn.zeros",
"transformer.h.13.attn.proj.quant_weight", "transformer.h.13.attn.proj.scales", "transformer.h.13.attn.proj.zeros",
"transformer.h.13.mlp.fc.quant_weight", "transformer.h.13.mlp.fc.scales", "transformer.h.13.mlp.fc.zeros",
"transformer.h.13.mlp.proj.quant_weight", "transformer.h.13.mlp.proj.scales", "transformer.h.13.mlp.proj.zeros",
"transformer.h.14.attn.attn.quant_weight", "transformer.h.14.attn.attn.scales", "transformer.h.14.attn.attn.zeros",
"transformer.h.14.attn.proj.quant_weight", "transformer.h.14.attn.proj.scales", "transformer.h.14.attn.proj.zeros",
"transformer.h.14.mlp.fc.quant_weight", "transformer.h.14.mlp.fc.scales", "transformer.h.14.mlp.fc.zeros",
"transformer.h.14.mlp.proj.quant_weight", "transformer.h.14.mlp.proj.scales", "transformer.h.14.mlp.proj.zeros",
"transformer.h.15.attn.attn.quant_weight", "transformer.h.15.attn.attn.scales", "transformer.h.15.attn.attn.zeros",
"transformer.h.15.attn.proj.quant_weight", "transformer.h.15.attn.proj.scales", "transformer.h.15.attn.proj.zeros",
"transformer.h.15.mlp.fc.quant_weight", "transformer.h.15.mlp.fc.scales", "transformer.h.15.mlp.fc.zeros",
"transformer.h.15.mlp.proj.quant_weight", "transformer.h.15.mlp.proj.scales", "transformer.h.15.mlp.proj.zeros".
Unexpected key(s) in state_dict: "lm_head.weight", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.proj.weight",
"transformer.h.0.mlp.fc.weight", "transformer.h.0.mlp.proj.weight", "transformer.h.1.attn.attn.weight", "transformer.h.1.attn.proj.weight",
"transformer.h.1.mlp.fc.weight", "transformer.h.1.mlp.proj.weight", "transformer.h.2.attn.attn.weight", "transformer.h.2.attn.proj.weight",
"transformer.h.2.mlp.fc.weight", "transformer.h.2.mlp.proj.weight", "transformer.h.3.attn.attn.weight", "transformer.h.3.attn.proj.weight",
"transformer.h.3.mlp.fc.weight", "transformer.h.3.mlp.proj.weight", "transformer.h.4.attn.attn.weight", "transformer.h.4.attn.proj.weight",
"transformer.h.4.mlp.fc.weight", "transformer.h.4.mlp.proj.weight", "transformer.h.5.attn.attn.weight", "transformer.h.5.attn.proj.weight",
"transformer.h.5.mlp.fc.weight", "transformer.h.5.mlp.proj.weight", "transformer.h.6.attn.attn.weight", "transformer.h.6.attn.proj.weight",
"transformer.h.6.mlp.fc.weight", "transformer.h.6.mlp.proj.weight", "transformer.h.7.attn.attn.weight", "transformer.h.7.attn.proj.weight",
"transformer.h.7.mlp.fc.weight", "transformer.h.7.mlp.proj.weight", "transformer.h.8.attn.attn.weight", "transformer.h.8.attn.proj.weight",
"transformer.h.8.mlp.fc.weight", "transformer.h.8.mlp.proj.weight", "transformer.h.9.attn.attn.weight", "transformer.h.9.attn.proj.weight",
"transformer.h.9.mlp.fc.weight", "transformer.h.9.mlp.proj.weight", "transformer.h.10.attn.attn.weight", "transformer.h.10.attn.proj.weight",
"transformer.h.10.mlp.fc.weight", "transformer.h.10.mlp.proj.weight", "transformer.h.11.attn.attn.weight", "transformer.h.11.attn.proj.weight",
"transformer.h.11.mlp.fc.weight", "transformer.h.11.mlp.proj.weight", "transformer.h.12.attn.attn.weight", "transformer.h.12.attn.proj.weight",
"transformer.h.12.mlp.fc.weight", "transformer.h.12.mlp.proj.weight", "transformer.h.13.attn.attn.weight", "transformer.h.13.attn.proj.weight",
"transformer.h.13.mlp.fc.weight", "transformer.h.13.mlp.proj.weight", "transformer.h.14.attn.attn.weight", "transformer.h.14.attn.proj.weight",
"transformer.h.14.mlp.fc.weight", "transformer.h.14.mlp.proj.weight", "transformer.h.15.attn.attn.weight", "transformer.h.15.attn.proj.weight",
"transformer.h.15.mlp.fc.weight", "transformer.h.15.mlp.proj.weight".

Config cannot be overwritten through kwargs

Repro from finetune_adapter script:

from pathlib import Path
from lit_parrot.adapter import Config

max_seq_length = 256  # see scripts/prepare_alpaca.py
checkpoint_dir = Path("checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1")
config = Config.from_name(name=checkpoint_dir.name, block_size=max_seq_length)


TypeError: lit_parrot.adapter.Config() got multiple values for keyword argument 'block_size'

https://github.com/Lightning-AI/lit-parrot/blob/0b5620de0c261a69298d39565d6e0f4b1e255fdb/lit_parrot/config.py#L25-L27

We can change it to below so user specified kwargs will overwrite configs.

@classmethod
def from_name(cls, name: str, **kwargs: Any) -> Self:
    return cls(**{**configs[name], **kwargs})

Query Regarding Minimum Hardware Requirements for Fine-tuning and Inference

Hi there,

Firstly, I want to express my appreciation for the insightful tutorial and the fine-tuning repository. I've found them extremely useful. 🚀

I'm looking to clarify what the minimum computer hardware requirements are for fine-tuning and inference with the models supported in this repo. I encountered some out-of-memory (OOM) issues during quantization on a system with 8GB RAM running on a CPU only.

The reason I'm asking this is because I'm considering using this repo for our open-source project (OpenBBTerminal). Understanding the minimum requirements will help us ensure the widest possible user accessibility.

Thanks in advance for your help on this matter.

finetune/adpapter.py not loading the train_data from train.pt

I am getting the following error:

python finetune/adapter.py  \
   --data_dir data/dolly \
   --checkpoint_dir checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1  \
    --out_dir out/adapter/dolly
Global seed set to 1337
Loading model 'checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1/lit_model.pth' with {'block_size': 256, 'vocab_size': 50254, 'padding_multiple': 256, 'padded_vocab_size': 50432, 'n_layer': 32, 'n_head': 32, 'n_embd': 2560, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': True, 'n_query_groups': 32, 'shared_attention_norm': False, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}
Number of trainable parameters: 768960
Traceback (most recent call last):
  File "/workspace/lit-parrot/finetune/adapter.py", line 246, in <module>
    CLI(main)
  File "/workspace/lit-parrot/litparrot/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/workspace/lit-parrot/litparrot/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/workspace/lit-parrot/finetune/adapter.py", line 85, in main
    train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir)
  File "/workspace/lit-parrot/finetune/adapter.py", line 119, in train
    input_ids, targets = get_batch(fabric, train_data)
  File "/workspace/lit-parrot/finetune/adapter.py", line 184, in get_batch
    ix = torch.randint(len(data), (micro_batch_size,))
RuntimeError: random_ expects 'from' to be less than 'to', but got from=0 >= to=0

When load_datasets() is called, it correctly loads the data from the test.pt file, but for some reason its not loading the data from train.pt even though both files exist, in the same directory (data/dolly ).

These files were created by running:

python scripts/prepare_custom.py \
    --destination_path data/dolly \
    --checkpoint_dir checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1

Caches should not persist across multiple generate.

When running generate function twice on the same method, the cache on first generation need to be teared down before another generation. Otherwise, we get error like below.

/content/lit-parrot/lit_parrot/model.py in forward(self, x, rope, mask, max_seq_length, input_pos, kv_cache)
    205 
    206         # efficient attention using Flash Attention CUDA kernels
--> 207         y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, scale=1.0 / math.sqrt(head_size))
    208 
    209         y = y.transpose(1, 2).contiguous().view(B, T, C)  # re-assemble all head outputs side by side

RuntimeError: The size of tensor a (7) must match the size of tensor b (8) at non-singleton dimension 3

We can add a context manager in the Model class and put the generation code under it.

class Parrot(Parrot):
    @contextmanager
    def cache(self):
        try:
            yield
        finally:
            self.kv_caches = []
            self.rope_cache = None
            self.mask_cache = None


# inside generate function
...
with model.cache():
    # generate max_new_tokens tokens
    for _ in range(max_new_tokens):
        x = idx.index_select(0, input_pos).view(1, -1)

        # forward
        logits = model(x, max_seq_length, input_pos)
        logits = logits[0, -1] / temperature
...

too many values to unpack in Block forward

Failed to unpack block forward results correctly .

File "/root/lit-parrot/lit_parrot/adapter.py", line 207, in forward
	if input_pos is None:  # proxy for use_cache=False
    for block in self.transformer.h:
	    x, _ = block(x, (cos, sin), mask, max_seq_length)
ValueError: too many values to unpack (expected 2)

Cached KVs not implemented on Adapter causing errors.

Adapter inherits most methods from the BaseModel. The Adapter's init method, Block's forward method and CausalSelfAttention's forward method didn't implement the Cached KVs logics, causing errors.

The errors are mostly from forward method in Adapter.
Example:

AttributeError: 'Parrot' object has no attribute 'rope_cache' 
TypeError: Block.forward() takes 2 positional arguments but 7 were given

We should add

rope_cache, mask_cache, kv_caches attributes to Adapter init method.
rope, mask, max_seq_length, input_pos, kv_cache to input of Block and CausalSelfAttention forward method
return kv_cache in Block and CausalSelfAttention forward method

NAN training loss after couple of steps

When run fine-tune with stablelm-base-alpha-3b on alpaca, the fine-tune works well in first couple of iterations, but training loss becomes NaN after some iterations. Could you please help me out his issue? btw run on 1 gpu g5.16xlarge (aws sagemaker).

Loading model 'checkpoints/stabilityai/stablelm-base-alpha-3b/lit_model.pth' with {'block_size': 4096, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 50688, 'n_layer': 16, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 0.25, 'parallel_residual': True, 'bias': True, 'n_query_groups': 32, 'shared_attention_norm': False, 'adapter_prompt_length':10, 'adapter_start_layer': 2}
Number of trainable parameters: 2125248
/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py:828: PossibleUserWarning: The model passed to Fabric.setup() has parameters on different devices. Since move_to_device=True, all parameters will be moved to the new device. If this is not desired, set Fabric.setup(..., move_to_device=False).
rank_zero_warn(
iter 0: loss 3.5421, time: 174.07ms
iter 1: loss 3.0288, time: 95.89ms
iter 2: loss 3.5571, time: 60.14ms
iter 3: loss 2.8494, time: 88.95ms
iter 4: loss 3.2140, time: 64.26ms
iter 5: loss 2.7726, time: 67.84ms
iter 6: loss 2.7332, time: 66.63ms
iter 7: loss 3.1365, time: 67.12ms
iter 8: loss 2.6164, time: 88.71ms
iter 9: loss 2.6239, time: 90.34ms
iter 10: loss 2.7440, time: 98.67ms
iter 11: loss 2.9421, time: 64.48ms
iter 12: loss 2.5184, time: 97.68ms
iter 13: loss 2.7282, time: 61.63ms
iter 14: loss 1.9213, time: 180.24ms
iter 15: loss 2.5665, time: 96.10ms
iter 16: loss 3.0199, time: 65.29ms
iter 17: loss 3.4083, time: 66.38ms
iter 18: loss 3.0120, time: 61.52ms
iter 19: loss 2.6137, time: 96.16ms
iter 20: loss 2.6338, time: 88.55ms
iter 21: loss 2.6259, time: 67.08ms
iter 22: loss 3.1457, time: 64.26ms
iter 23: loss 2.7812, time: 95.88ms
iter 24: loss 2.5923, time: 64.98ms
iter 25: loss 2.4579, time: 91.93ms
iter 26: loss 2.8956, time: 61.76ms
iter 27: loss 3.5309, time: 57.92ms
iter 28: loss 2.8725, time: 67.91ms
iter 29: loss 2.9909, time: 90.01ms
iter 30: loss 2.6652, time: 121.70ms
iter 31: loss 3.2488, time: 58.30ms
iter 32: loss 3.0665, time: 90.61ms
iter 33: loss 3.2830, time: 58.08ms
iter 34: loss 2.6600, time: 116.47ms
iter 35: loss 2.6636, time: 136.96ms
iter 36: loss 3.6505, time: 58.66ms
iter 37: loss 2.7473, time: 89.21ms
iter 38: loss 2.9823, time: 87.24ms
iter 39: loss 2.8799, time: 85.97ms
iter 40: loss 2.6276, time: 114.52ms
iter 41: loss 2.3663, time: 66.84ms
iter 42: loss 3.0142, time: 88.69ms
iter 43: loss 3.0303, time: 64.94ms
iter 44: loss 4.0041, time: 65.64ms
iter 45: loss 3.3370, time: 59.52ms
iter 46: loss 3.3909, time: 65.03ms
iter 47: loss 3.1888, time: 54.24ms
iter 48: loss 2.6625, time: 91.05ms
iter 49: loss 3.1856, time: 66.61ms
iter 50: loss 3.5569, time: 57.50ms
iter 51: loss 3.0958, time: 66.84ms
iter 52: loss 3.4789, time: 67.88ms
iter 53: loss 3.2668, time: 64.46ms
iter 54: loss 3.1411, time: 65.62ms
iter 55: loss 2.9815, time: 124.00ms
iter 56: loss 2.6963, time: 114.22ms
iter 57: loss 2.9008, time: 97.70ms
iter 58: loss 3.0037, time: 64.61ms
iter 59: loss 2.8624, time: 115.96ms
iter 60: loss 3.0150, time: 66.87ms
iter 61: loss 2.6633, time: 97.41ms
iter 62: loss 2.7912, time: 114.09ms
iter 63: loss 2.7428, time: 158.58ms
iter 64: loss nan, time: 86.94ms
iter 65: loss nan, time: 91.27ms
iter 66: loss nan, time: 84.82ms
iter 67: loss nan, time: 66.46ms
iter 68: loss nan, time: 65.96ms
iter 69: loss nan, time: 97.37ms
iter 70: loss nan, time: 115.41ms

KV cache for faster generation

Same as Lightning-AI/lit-llama#197

Loss nan while fine tuning Falcon7b

By following the same instruction provided for fine tuning falcon7b and by leaving all paramters as the defult ones, I could start fine tuning but after 60 iteration, loss is nan. Could anyone explains to me which might be the issue ? URGENT

Get Attempting to unscale FP16 gradients while finetune on float16

The initial script showed error gpu doesn't support bfloat16 and ask me to use float16 instead.

I modify as below.

fabric = L.Fabric(
        accelerator="cuda",
        devices=devices,
        strategy=(DeepSpeedStrategy(config=ds_config) if devices > 1 else "auto"),
        precision="16-mixed",
    )

with EmptyInitOnDevice(device=fabric.device, dtype=torch.float16):
        model = Parrot(config)

It shows errors when optimizer trying to step

File "finetune_adapter.py", line 117, in train
    optimizer.step()
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/lightning/fabric/wrappers.py", line 72, in step
    return self._strategy.optimizer_step(
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/lightning/fabric/strategies/strategy.py", line 193, in optimizer_step
    return self.precision.optimizer_step(optimizer, **kwargs)
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/lightning/fabric/plugins/precision/amp.py", line 83, in optimizer_step
    step_output = self.scaler.step(optimizer, **kwargs)
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 370, in step
    self.unscale_(optimizer)
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/disk3/ai/ml_experiments/finetune_llms/venv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

Why have a default max_seq_length of 256?

I noticed that both the data prep / tokenization script (https://github.com/Lightning-AI/lit-parrot/blob/main/scripts/prepare_alpaca.py#L26) and the fine-tuning scripts (https://github.com/Lightning-AI/lit-parrot/blob/main/finetune/adapter.py#L41, https://github.com/Lightning-AI/lit-parrot/blob/main/finetune/adapter_v2.py#L46) have max_seq_length=256.

While this does seem to speed up tokenization it has the unfortunate property of truncating fine-tuning inputs and also requiring someone to change both scripts to use the full context length of a language model. I'm curious why this parameter got added and whether it might be possible to switch to a default of None or 4096?

Download documentation needs updating, --repo_id required

Document says:

python scripts/download.py stabilityai/stablelm-base-alpha-3b

Actually required

python scripts/download.py --repo_id stabilityai/stablelm-base-alpha-3b

I could just make edits as I go along and send PR if you wish.

Port finetuning from Lit-LLaMA

full #117
LoRA #128
Adapter #31

Support all StableLM and Pythia checkpoint configs

https://github.com/EleutherAI/pythia#models

https://github.com/Stability-AI/StableLM#stablelm-alpha

RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

Been trying for some time now and always run into this error. Everything prior worked. What am I doing wrong?
RTX3090 - 24go
Windows 10 but on Ubuntu using wsl, maybe that's the problem but don't want to install Ubuntu on a new partition.

python3 finetune/adapter_v2.py --data_dir data/alpaca --checkpoint_dir checkpoints/tiiuae/falcon-7b --out_dir out/adapter/alpaca
/usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
Global seed set to 1337
Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}
Number of trainable parameters: 3839186
/usr/local/lib/python3.10/dist-packages/lightning/fabric/fabric.py:828: PossibleUserWarning: The model passed to Fabric.setup() has parameters on different devices. Since move_to_device=True, all parameters will be moved to the new device. If this is not desired, set Fabric.setup(..., move_to_device=False).
rank_zero_warn(
iter 0: loss 2.7154, time: 2929.28ms
Traceback (most recent call last):
File "/root/lit-parrot/finetune/adapter_v2.py", line 254, in
CLI(main)
File "/usr/local/lib/python3.10/dist-packages/jsonargparse/cli.py", line 85, in CLI
return _run_component(component, cfg_init)
File "/usr/local/lib/python3.10/dist-packages/jsonargparse/cli.py", line 147, in _run_component
return component(**cfg)
File "/root/lit-parrot/finetune/adapter_v2.py", line 90, in main
train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir)
File "/root/lit-parrot/finetune/adapter_v2.py", line 126, in train
logits = model(input_ids)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/fabric/wrappers.py", line 115, in forward
output = self._forward_module(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in call_impl
return forward_call(*args, **kwargs)
File "/root/lit-parrot/lit_parrot/adapter.py", line 95, in forward
x, * = block(x, (cos, sin), mask, max_seq_length)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/root/lit-parrot/lit_parrot/adapter.py", line 140, in forward
h, new_kv_cache, new_adapter_kv_cache = self.attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/root/lit-parrot/lit_parrot/adapter.py", line 241, in forward
y = y + self.gating_factor * ay
RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

TypeError: BFloat16 is not supported on MPS

Getting this when running Falcon 7b model on M1 Pro, is there a specific version that supports this on M1?

Command that was run:
python generate/base.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/tiiuae/falcon-7b

Document that was referred:
https://github.com/Lightning-AI/lit-parrot/blob/main/howto/download_falcon.md

/lit_parrot/model.py:201 in forward

Running:
python generate.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/stabilityai/stablelm-tuned-alpha-3b

/lit_parrot/model.py:201 in forward:
k = cache_k.index_copy(2, input_pos, k)
RuntimeError: index_copy_(): self and source expected to have the same dtype, but got (self) Float and (source) BFloat16

Deepspeed and bf16-true

In the finetuning scripts, we only allow

precision: Literal["bf16-true", "32-true"] = "bf16-true",

But we also use DeepSpeed when devices > 1. However, in this case, you'd get a

ValueError: `precision='bf16-true')` is not supported in DeepSpeed. `precision` must be one of: ('32-true', '16-mixed', 'bf16-mixed').

Should we allow bf16-mixed, or should we switch to FSDP? Or something else?

Out of memory issue for fine-tuning RedPajama-INCITE-7B-Base with 1 GPU

Hi, I faced an out-of-memory issue fine-tuning RedPajama-INCITE-7B-Base on Alpaca data with 1GPU g5.16xlarge with 24 GPU memory (GiB). With adapter_v2.py, I changed the learning_rate = 3e-3 and micro_batch_size = 1. The model fine-tune works really well in the beginning and run into out of memory issue after 65498 iterations. Any one knows how to solve it? Thanks!

iter 65496: loss 1.2029, time: 101.54ms
iter 65497: loss 1.5817, time: 184.24ms
iter 65498: loss 1.4716, time: 101.98ms
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 281, in
CLI(setup)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI
return _run_component(component, cfg_init)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component
return component(**cfg)
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 71, in setup
fabric.launch(main, data_dir, checkpoint_dir, out_dir)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 732,in launch
return self._wrap_and_launch(function, self, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 814,in _wrap_and_launch
return to_run(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 823,in _wrap_with_setup
return to_run(*args, **kwargs)
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 105, in main
train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir)
File "/home/ec2-user/SageMaker/lit-parrot/finetune/adapter_v2.py", line 148, in train
fabric.backward(loss / gradient_accumulation_iters)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 387,in backward
self._strategy.backward(tensor, module, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/strategies/strategy.py", line 179, in backward
self.precision.backward(tensor, module, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/lightning/fabric/plugins/precision/precision.py", line 89, in backward
tensor.backward(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/_tensor.py", line 491, in backward
torch.autograd.backward(
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/autograd/init.py", line 204,in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 200.00 MiB. GPU 0 has a total capacty of 22.19 GiB of which 106.50 MiB is free. Including non-PyTorch memory, this process has 22.08 GiB memory in use. Of the allocated memory 20.42 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory islarge try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Which should be our default model?

Given that this repository supports a multitude of models, which one should be chosen when the user doesn't specify it?

This is important because in the howtos and README, concrete numbers are given for a model

Alternatively, should we force the user to choose one?

Falcon Loss Not Decreasing During Training

I'm using pretrain code with falcon 7B. I've noticed that the loss didn't change for 400 iterations.

iter 1: loss 11.0666, time: 13381.00ms, speed: 306 toks/s/device
....
iter 400: loss 11.0666, time: 19090.34ms, speed: 214 toks/s/device

ERROR: Could not find a version that satisfies the requirement torch>=2.1.0dev

Hello,

I have just cloned the repository and run pip install -r requirements.txt as explained in README.md file but I get following error:

  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
ERROR: Could not find a version that satisfies the requirement torch>=2.1.0dev (from versions: 2.0.0)
ERROR: No matching distribution found for torch>=2.1.0dev

I already tried following commands with same result:

pip install --pre -r requirements.txt
pip install --pre -r requirements.txt -f https://download.pytorch.org/whl/nightly/cpu

The issue is because nightly dev version of torch is not found and not installed.
Am I missing something?
I am running on a MacBook Pro Apple M1 Max

Thanks,
Nestor

Add chat script for adapter checkpoints

import json
import os
import sys
import time
import warnings
from pathlib import Path
from typing import Optional

import lightning as L
import torch

from generate import generate
from lit_parrot import Tokenizer
from lit_parrot.adapter import Parrot, Config
from lit_parrot.utils import EmptyInitOnDevice, lazy_load, check_valid_checkpoint_dir
sys.path.append(os.path.join(os.path.dirname(__file__), 'scripts'))
from prepare_alpaca import generate_prompt


def main(
    prompt: str = "What would be a good movie to see, and wy do you recommend it?",
    input_string: str = "",
    interactive: bool = False,
    adapter_path: Path = Path("out/adapter/alpaca/lit_model_adapter_finetuned.pth"),
    #checkpoint_dir: Path = Path(f"checkpoints/stabilityai/stablelm-base-alpha-3b"),
    checkpoint_dir: Path = Path(f"checkpoints/stabilityai/stablelm-tuned-alpha-3b"),
    quantize: Optional[str] = None,
    max_new_tokens: int = 100,
    top_k: int = 200,
    temperature: float = 0.8,
    max_seq_length: int = 1250  # set this to what you used during fine tuning
) -> None:
    """Generates a response based on a given instruction and an optional input.
    This script will only work with checkpoints from the instruction-tuned Parrot-Adapter model.
    See `finetune_adapter.py`.

    Args:
        prompt: The prompt/instruction (Alpaca style).
        adapter_path: Path to the checkpoint with trained adapter weights, which are the output of
            `finetune_adapter.py`.
        checkpoint_dir: The path to the checkpoint folder with pretrained Parrot weights.
        input_string: Optional input (Alpaca style).
        quantize: Whether to quantize the model and using which method:
            ``"llm.int8"``: LLM.int8() mode,
            ``"gptq.int4"``: GPTQ 4-bit mode.
        max_new_tokens: The number of generation steps to take.
        top_k: The number of top most probable tokens to consider in the sampling process.
        temperature: A value controlling the randomness of the sampling process. Higher values result in more random
            samples.
        max_seq_length: Optional int idefaults to 1250  # set this to what you used during fine tuning
    """
    check_valid_checkpoint_dir(checkpoint_dir)

    fabric = L.Fabric(devices=1)
    dtype = torch.bfloat16 if fabric.device.type == "cuda" and torch.cuda.is_bf16_supported() else torch.float32

    with open(checkpoint_dir / "lit_config.json") as fp:
        config = Config(**json.load(fp))

    print("Loading model ...", file=sys.stderr)
    t0 = time.time()
    with EmptyInitOnDevice(device=fabric.device, dtype=dtype, quantization_mode=quantize):
        model = Parrot(config)
    with lazy_load(checkpoint_dir / "lit_model.pth") as pretrained_checkpoint, lazy_load(
        adapter_path
    ) as adapter_checkpoint:
        # 1. Load the pretrained weights
        model.load_state_dict(pretrained_checkpoint, strict=False)
        # 2. Load the fine-tuned adapter weights
        model.load_state_dict(adapter_checkpoint, strict=False)

    print(f"Time to load model: {time.time() - t0:.02f} seconds.", file=sys.stderr)

    model.eval()
    model = fabric.setup(model)

    tokenizer = Tokenizer(checkpoint_dir / "tokenizer.json", checkpoint_dir / "tokenizer_config.json")


    while True:
        if interactive:
            try:
                prompt = input(">> Prompt: ")
            except KeyboardInterrupt:
                break
            if not prompt:
                break
        else:
            print(f'Prompt: {prompt}')

        sample = {"instruction": prompt, "input": input_string}
        prompt = generate_prompt(sample)
        encoded = tokenizer.encode(prompt, device=model.device)
        prompt_length = encoded.size(0)

        t0 = time.perf_counter()
        y = generate(
           model, 
           idx=encoded, 
           max_new_tokens=max_new_tokens, 
           max_seq_length=max_seq_length,
           temperature=temperature, 
           top_k=top_k, 
           eos_id=tokenizer.eos_id
        )
        t = time.perf_counter() - t0

        output = tokenizer.decode(y)
        output = output.split("### Response:")[1].strip()
        print(output)

        tokens_generated = y.size(0) - prompt_length
        print(f"\n\nTime for inference: {t:.02f} sec total, {tokens_generated / t:.02f} tokens/sec", file=sys.stderr)
        if fabric.device.type == "cuda":
            print(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB", file=sys.stderr)

        if not interactive:
            break


if __name__ == "__main__":
    from jsonargparse import CLI

    torch.set_float32_matmul_precision("high")
    warnings.filterwarnings(
        # Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:31
        "ignore",
        message="ComplexHalf support is experimental and many operators don't support it yet",
    )
    CLI(main)

micro_batch_size, step run time, total training time

Hi,

Thanks a lot for this clear and fat-free code base!
I'm training Falcon-7B with adapters-v2 and an Alpaca-formated dataset of mine.

As usual, I'm trying to max out the vram use for best training time but in this case, there is no significant gain since the step time is almost proportional to the batch size.

step times:
micro_batch_size 1, 159ms
micro_batch_size 2, 293ms
micro_batch_size 4, 560ms

Is this expected, or can this be optimized?

Note:
I'll also open a new issue as advised with my attempt at batch inference, exhibiting the same lack of gain when batching at inference, see
Lightning-AI/lit-llama#188 (comment)

Text generation fails on --devices 2

Hi, I am trying to generate text predictions using falcon-7b-instruct on machine with two A10-24GB gpu, when I run generate with default --devices option which is 1, it runs successfully while it fails with --device 2

python generate/base.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/tiiuae/falcon-7b-instruct

default --devices

Loading model 'checkpoints/tiiuae/falcon-7b-instruct/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True}
Time to instantiate model: 0.15 seconds.
Time to load the model weights: 15.32 seconds.
Global seed set to 1234
Hello, my name is Jack.
Some people think that having a blog is a great way to make money online and others insist that it is not. In my own view, I do agree with the latter one.
But in the end, it will have to depend
Time for inference 1: 2.13 sec total, 23.47 tokens/sec
Memory used: 14.56 GB

python generate/base.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/tiiuae/falcon-7b-instruct --devices 2

--devices 2

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Loading model 'checkpoints/tiiuae/falcon-7b-instruct/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True}
Time to instantiate model: 1.33 seconds.
Time to load the model weights: 16.37 seconds.
Traceback (most recent call last):
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 204, in <module>
    CLI(main)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 156, in main
    model = fabric.setup_module(model)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 265, in setup_module
    module = self._strategy.setup_module(module)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/strategies/ddp.py", line 121, in setup_module
    return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 805, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1095, in _ddp_init_helper
Traceback (most recent call last):
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 204, in <module>
    CLI(main)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 85, in CLI
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.44 GiB. GPU 1 has a total capacty of 22.05 GiB of which 7.74 GiB is free. Including non-PyTorch memory, this process has 14.31 GiB memory in use. Of the allocated memory 13.49 GiB is allocated by PyTorch, and 49.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    return _run_component(component, cfg_init)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 156, in main
    model = fabric.setup_module(model)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 265, in setup_module
    module = self._strategy.setup_module(module)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/strategies/ddp.py", line 121, in setup_module
    return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 805, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1095, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.44 GiB. GPU 0 has a total capacty of 22.05 GiB of which 7.74 GiB is free. Including non-PyTorch memory, this process has 14.31 GiB memory in use. Of the allocated memory 13.49 GiB is allocated by PyTorch, and 49.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Avoid the `convert_hf_checkpoint` step

https://github.com/Lightning-AI/lit-parrot/blob/main/scripts/convert_hf_checkpoint.py is a script that converts a list of *.bin files into a single checkpoint file: lit_model.pth.

This has the disadvantage of:

adds 1 extra step to get started
the checkpoint weights are now duplicated in the filesystem
it takes time and memory to convert.

This is particularly interesting for inference. For training/fine-tuning, the checkpoints generated will still be single file. We would need to support loading both options.

Instead, we could write a function lazy_load_from(checkpoint_dir) that does the weight mapping on the fly.

Problem with finetune_adapter.py along with fix

AttributeError: 'Parrot' object has no attribute 'rope_cache'
lit-parrot/lit_parrot/model.py:67 in forward
❱ 67 │ │ if self.rope_cache is None: │
│ 68 │ │ │ self.rope_cache = self.build_rope_cache(idx)

Problem is due to lit_parrot/adapter.py initialization initializing the super super class instead of the super class:

 Should be -

class CausalSelfAttention(BaseModel):
"""A modification of lit_parrot.model.CausalSelfAttention that adds the attention
over the adaption prompt."""

def __init__(self, config: Config, block_idx: int) -> None:
    super().__init__(config)

Instead of:
class CausalSelfAttention(nn.Module):
"""A modification of lit_parrot.model.CausalSelfAttention that adds the attention
over the adaption prompt."""

def __init__(self, config: Config, block_idx: int) -> None:
    super().__init__()