Giter Club home page Giter Club logo

simple-llm-finetuner's Introduction

title emoji colorFrom colorTo sdk app_file pinned
Simple LLM Finetuner
๐Ÿฆ™
yellow
orange
gradio
app.py
false

๐Ÿ‘ป๐Ÿ‘ป๐Ÿ‘ป This project is effectively dead. Please use one of the following tools instead:


๐Ÿฆ™ Simple LLM Finetuner

Open In Colab Open In Spaces

Simple LLM Finetuner is a beginner-friendly interface designed to facilitate fine-tuning various language models using LoRA method via the PEFT library on commodity NVIDIA GPUs. With small dataset and sample lengths of 256, you can even run this on a regular Colab Tesla T4 instance.

With this intuitive UI, you can easily manage your dataset, customize parameters, train, and evaluate the model's inference capabilities.

Acknowledgements

Features

  • Simply paste datasets in the UI, separated by double blank lines
  • Adjustable parameters for fine-tuning and inference
  • Beginner-friendly UI with explanations for each parameter

Getting Started

Prerequisites

  • Linux or WSL
  • Modern NVIDIA GPU with >= 16 GB of VRAM (but it might be possible to run with less for smaller sample lengths)

Usage

I recommend using a virtual environment to install the required packages. Conda preferred.

conda create -n simple-llm-finetuner python=3.10
conda activate simple-llm-finetuner
conda install -y cuda -c nvidia/label/cuda-11.7.0
conda install -y pytorch=2 pytorch-cuda=11.7 -c pytorch

On WSL, you might need to install CUDA manually by following these steps, then running the following before you launch:

export LD_LIBRARY_PATH=/usr/lib/wsl/lib

Clone the repository and install the required packages.

git clone https://github.com/lxe/simple-llm-finetuner.git
cd simple-llm-finetuner
pip install -r requirements.txt

Launch it

python app.py

Open http://127.0.0.1:7860/ in your browser. Prepare your training data by separating each sample with 2 blank lines. Paste the whole training dataset into the textbox. Specify the new LoRA adapter name in the "New PEFT Adapter Name" textbox, then click train. You might need to adjust the max sequence length and batch size to fit your GPU memory. The model will be saved in the lora/ directory.

After training is done, navigate to "Inference" tab, select your LoRA, and play with it.

Have fun!

YouTube Walkthough

https://www.youtube.com/watch?v=yM1wanDkNz8

License

MIT License

simple-llm-finetuner's People

Contributors

64-bit avatar lxe avatar recursionbane avatar simomay avatar vadi2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simple-llm-finetuner's Issues

In trainer.py, ignore the last token is not suitable for all situations.

In trainer.py, ignore the last token is not suitable for all situations.

    def tokenize_sample(self, item, max_seq_length, add_eos_token=True):
        assert self.tokenizer is not None
        result = self.tokenizer(
            item["text"],
            truncation=True,
            max_length=max_seq_length,
            padding="max_length",
        )

       # ignore the last token [:-1]
        result = {
            "input_ids": result["input_ids"][:-1],
            "attention_mask": result["attention_mask"][:-1],
        }

https://github.com/lxe/simple-llm-finetuner/blob/3c3ae84e5dee5a1d40f17e5567938dfdffce9d16/trainer.py#LL150C9-L153C10

If the user of web UI using custom dataset. they will not know the last token of training data is truncated.
And the prediction results go unexpected.

RuntimeError: unscale_() has already been called on this optimizer since the last update().

The error message is attached below:

Parameter 'function'=<function Trainer.tokenize_training_text.. at 0x7f8eb04d5c60> of the transform datasets.arrow_dataset.Dataset.map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
{'loss': 2.4296, 'learning_rate': 0.0002901960784313725, 'epoch': 0.1}
{'loss': 2.271, 'learning_rate': 0.0002803921568627451, 'epoch': 0.2}
{'loss': 2.2099, 'learning_rate': 0.0002705882352941176, 'epoch': 0.29}
{'loss': 2.2199, 'learning_rate': 0.00026078431372549016, 'epoch': 0.39}
{'loss': 2.1911, 'learning_rate': 0.00025098039215686274, 'epoch': 0.49}
{'loss': 2.2129, 'learning_rate': 0.00024117647058823527, 'epoch': 0.59}
{'loss': 2.1752, 'learning_rate': 0.00023137254901960783, 'epoch': 0.68}
{'loss': 2.1841, 'learning_rate': 0.0002215686274509804, 'epoch': 0.78}
{'loss': 2.1827, 'learning_rate': 0.00021176470588235295, 'epoch': 0.88}
{'loss': 2.1514, 'learning_rate': 0.00020196078431372548, 'epoch': 0.98}
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict
output = await app.get_blocks().process_api(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api
result = await self.call_function(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/gradio/blocks.py", line 1077, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/anyio/backends/asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/anyio/backends/asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/gradio/helpers.py", line 602, in tracked_fn
response = fn(*args)
File "/home/user/app/app.py", line 131, in train
self.trainer.train(
File "/home/user/app/trainer.py", line 273, in train
result = self.trainer.train(resume_from_checkpoint=False)
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/transformers/trainer.py", line 1850, in inner_training_loop
self.accelerator.clip_grad_norm
(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/accelerate/accelerator.py", line 1913, in clip_grad_norm

self.unscale_gradients()
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/accelerate/accelerator.py", line 1876, in unscale_gradients
self.scaler.unscale
(opt)
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 275, in unscale

raise RuntimeError("unscale
() has already been called on this optimizer since the last update().")
RuntimeError: unscale
() has already been called on this optimizer since the last update().

"error" in training - AttributeError: 'CastOutputToFloat' object has no attribute 'weight', RuntimeError: Only Tensors of floating point and complex dtype can require gradients

WSL2 Ubuntu, new install, I get the following error after it downloads the weights and tries to train.
Sorry I can't give more details, but I'm really not sure what's going on.

Number of samples: 534
Traceback (most recent call last):
File "/home/ckg/.local/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict
output = await app.get_blocks().process_api(
File "/home/ckg/.local/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api
result = await self.call_function(
File "/home/ckg/.local/lib/python3.10/site-packages/gradio/blocks.py", line 884, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/ckg/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/ckg/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/ckg/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/home/ckg/.local/lib/python3.10/site-packages/gradio/helpers.py", line 587, in tracked_fn
response = fn(*args)
File "/home/ckg/github/simple-llama-finetuner/main.py", line 164, in tokenize_and_train
model = peft.prepare_model_for_int8_training(model)
File "/home/ckg/.local/lib/python3.10/site-packages/peft/utils/other.py", line 72, in prepare_model_for_int8_training
File "/home/ckg/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'CastOutputToFloat' object has no attribute 'weight'

Suggestion to improve UX

Thank you for this project! I tried it, and unlike some others it worked (with llama 7b and 2080 ti).
Now I'd like to scale up my experiments.

  1. In order to do so, I would need an option to initiate training programmatically. Technically, I'd be able to extract what I need from main.py, but it'd be great if there was an already tested example.
  2. Secondly, I'd like to see an example on how to convert the directory with checkpoints into a standalone model.

Would you please share your thoughts on this or perhaps a link to where it's already implemented? Thank you in advance.

Error: Adapter lora/decapoda-research_llama-{ADAPTER_NAME} not found.

I have found a resolution and root cause for this issue, I am documenting the reproduction steps here to keep the PR more organized.

Minimum Reproduction Steps

  1. Create at least 2 LoRA adapaters for a model 'initial Model'
  2. On the Inference tab, select one of the LoRA's, 'Initial LoRA'
  3. Switch the model to one of the other models 'Alternative Model'
  4. Switch the model back to 'Initial Model'
  5. Switch the LoRA to the 2nd lora that was created
  6. Switch the LoRA back to 'Initial LoRA'

This error will be displayed: "Adapter lora/decapoda-research_llama-7b-hf_PYTHON-2 not found."
image

Callstack:

Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/gradio/blocks.py", line 915, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/gradio/helpers.py", line 588, in tracked_fn
    response = fn(*args)
  File "/mnt/c/Users/Jon/repos/simple-llm-finetuner/app.py", line 180, in load_lora
    self.trainer.load_lora(f'{LORA_DIR}/{lora_name}')
  File "/mnt/c/Users/Jon/repos/simple-llm-finetuner/trainer.py", line 68, in load_lora
    self.model.set_adapter(lora_name)
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/peft/peft_model.py", line 404, in set_adapter
    raise ValueError(f"Adapter {adapter_name} not found.")
ValueError: Adapter lora/decapoda-research_llama-7b-hf_PYTHON-2 not found.

Is CUDA 12.0 supported?

Is CUDA 12.0 supported? It along with the new cudnn library has some nice improvements for RTX 40-series cards.

how to finetune with 'system information'

Hello,

I am training with my custom dataset, and have a question there.
What I wanted to make is assistance that can recommend me a proper mode of device depending on my conversation.

Before inserting q/a pairs, I want to let model know about the general information of 'how to use' the device.
I tried to insert like below.

SYSTEM:
    There are 4 options in the mode
    - mode1
    - mode2
    - mode3
    - mode4
    
   you need to generate 'json' format using USER input with the proper mode.
   Desired output format is below.
   {
        'mode': [selection of mode]
        'comments': [your response]
    }


USER: example1
ASSISTANCE: response1


USER: example2
ASSISTANCE: response2


USER: example3
ASSISTANCE: response3

But it seems like the model doesn't know about the initial information about the device.

Is there any specific format like 'USER' and 'ASSITANCE' for teaching the information as well?

Thanks,

Issue in train in colab

While I run the train in colab, this error is shown -

Something went wrong
Connection errored out.

How can I solve this?

Attempting to use 13B in the simple tuner -

updated the main.py with decapoda-research/llama-13b-hf in all the spots that had 7B
It downloaded the sharded parts all right
but now im getting this config issue tho. Any advice would be appreciated.

File "/home/orwell/miniconda3/envs/llama-finetuner/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/orwell/miniconda3/envs/llama-finetuner/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/home/orwell/miniconda3/envs/llama-finetuner/lib/python3.10/site-packages/gradio/helpers.py", line 587, in tracked_fn
response = fn(*args)
File "/home/orwell/simple-llama-finetuner/main.py", line 82, in generate_text
load_peft_model(peft_model)
File "/home/orwell/simple-llama-finetuner/main.py", line 35, in load_peft_model
model = peft.PeftModel.from_pretrained(
File "/home/orwell/miniconda3/envs/llama-finetuner/lib/python3.10/site-packages/peft/peft_model.py", line 135, in from_pretrained
config = PEFT_TYPE_TO_CONFIG_MAPPING[PeftConfig.from_pretrained(model_id).peft_type].from_pretrained(model_id)
File "/home/orwell/miniconda3/envs/llama-finetuner/lib/python3.10/site-packages/peft/utils/config.py", line 101, in from_pretrained
raise ValueError(f"Can't find config.json at '{pretrained_model_name_or_path}'")
ValueError: Can't find config.json at ''

image
The config file appears in the cache the same as it does for 7B - im assuming im missing something just not sure what.

Thank you again

"The tokenizer class you load from this checkpoint is 'LLaMATokenizer'."

(llama) user@DESKTOP-CR45CKF:~/simple-llm-finetuner$ python app.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/user/anaconda3/envs/llama/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/user/anaconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/gradio/blocks.py", line 884, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/gradio/helpers.py", line 587, in tracked_fn
    response = fn(*args)
  File "/home/user/simple-llm-finetuner/app.py", line 130, in train
    self.trainer.train(
  File "/home/user/simple-llm-finetuner/trainer.py", line 172, in train
    assert self.model is not None
AssertionError
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Killed

Finetuning in unsupported language

My language was not on the list of 20 languages the original model was trained on.
Is it possible to finetune llama with a dataset in a language that was not included in the base model?

(WSL2) - No GPU / Cuda detected....

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!
CUDA SETUP: CUDA runtime path found: /home/user/anaconda3/envs/llama/lib/libcudart.so
/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
  warn(msg)
CUDA SETUP: Loading binary /home/user/anaconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

Question: Native windows support

Followed instructions in the readme, but getting AssertionError: Torch not compiled with CUDA enabled
Running on nvidia A4500, native windows (not wsl)

Traceback

(llama-finetuner) D:\simple-llama-finetuner>python main.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
argument of type 'WindowsPath' is not iterable
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
argument of type 'WindowsPath' is not iterable
C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\bitsandbytes\cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Loading base model...
Traceback (most recent call last):
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\gradio\routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\gradio\blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\gradio\blocks.py", line 884, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\gradio\helpers.py", line 587, in tracked_fn
    response = fn(*args)
  File "D:\simple-llama-finetuner\main.py", line 128, in tokenize_and_train
    if (model is None): load_base_model()
  File "D:\simple-llama-finetuner\main.py", line 18, in load_base_model
    model = transformers.LlamaForCausalLM.from_pretrained(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\transformers\modeling_utils.py", line 2643, in from_pretrained
    ) = cls._load_pretrained_model(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\transformers\modeling_utils.py", line 2966, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\transformers\modeling_utils.py", line 673, in _load_state_dict_into_meta_model
    set_module_8bit_tensor_to_device(model, param_name, param_device, value=param)
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\transformers\utils\bitsandbytes.py", line 70, in set_module_8bit_tensor_to_device
    new_value = bnb.nn.Int8Params(new_value, requires_grad=False, has_fp16_weights=has_fp16_weights).to(device)
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\bitsandbytes\nn\modules.py", line 196, in to
    return self.cuda(device)
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\bitsandbytes\nn\modules.py", line 159, in cuda
    B = self.data.contiguous().half().cuda(device)
  File "C:\Users\jerem\AppData\Roaming\Python\Python310\site-packages\torch\cuda\__init__.py", line 221, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Examples to get started with

I see there's a few examples in the repo - it would be great for a tutorial to accompany them, so newbies dipping their feet in this for the first time can get an idea of what's happening / score some early wins.

About llama-2-70B fine-tuning

Appreciate your great work!

Is it possible to fine tune the llama-2-70B for a 3*8*A100 (40G) configuration, thanks!

Inference doesn't work after training

I trained my input text on a rtx 4080 (16gb vram) with the default settings:

image

And that seems to work OK:

TrainOutput(global_step=116, training_loss=1.0854247685136467, metrics={'train_runtime': 258.9812, 'train_samples_per_second': 0.448, 'train_steps_per_second': 0.448, 'train_loss': 1.0854247685136467, 'epoch': 1.0})

However inferencing doesn't work and I don't have enough context to understand why yet:

  File "/home/vadi/Programs/simple-llama-finetuner/main.py", line 27, in maybe_load_models
    model = LlamaForCausalLM.from_pretrained(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2588, in from_pretrained
    raise ValueError(
ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.

Currently 12.5 / 16gb vram is being used, if that matters.

Performance after FineTuning

I have fine tuned llama using this repo and a few text documents I had with me.
If I provide 3-4 consecutive words from input text, it amazingly completes the next couple of sentences.
But if I ask the same information as a question or reorder the input prompt, it hallucinates.

I thought I was overfitting and hence increased input data size, decreased the number of epochs which was neither completing the sentences when input as above nor answering the questions.

I also tried using vector embedding search and a model on top of it to put things together, but this way it is lacking information across few sentences. Also it can't answer anything other than What Where etc kind of questions if the answer it expected to span multiple sentences and its even worse when it has to infer something with this information and general knowledge. So that seems to be a not so fruitful approach

My goal is to get llama to have knowledge of a few text documents I have locally.
Someone help me please.

Multi GPU running

Hi there! I would like to know how we can run this solution in multi gpu environment for bigger models. Thank you

Not a problem - but like people should know

https://arxiv.org/abs/2303.11366 Is a really cool paper about reflection in LLMs
image
That is after training on like 20 samples for 50 epochs on my 3090 on the 7B model.

User: [Topic or question]

Assistant Hypothetical Response: [Brief or simplified answer to the topic or question]

Agent Reflection: [Critique of the hypothetical response, highlighting the limitations, inaccuracies, or areas that need improvement or expansion, while providing guidance on how to address these issues in the revised response]

Bot Actual Response: [The natural and contextually appropriate answer to the topic or question, as generated by the advanced language model, which incorporates the suggestions and improvements from the agent reflection for a more comprehensive and accurate response]

This + training sets generated with this frame work seem to really improve the generations of these models with fairly limited training sets. Just thought i would share.

RuntimeError: expected scalar type Half but found Float

CUDA SETUP: Loading binary /home/opc/anaconda3/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so...
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://b38eaf88d60145f161.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
/home/opc/anaconda3/lib/python3.9/site-packages/peft/utils/other.py:76: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
  warnings.warn(
/home/opc/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:318: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
  File "/home/opc/anaconda3/lib/python3.9/site-packages/gradio/routes.py", line 399, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/opc/anaconda3/lib/python3.9/site-packages/gradio/blocks.py", line 1299, in process_api
    result = await self.call_function(
  File "/home/opc/anaconda3/lib/python3.9/site-packages/gradio/blocks.py", line 1022, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/opc/anaconda3/lib/python3.9/site-packages/anyio/to_thread.py", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/home/opc/anaconda3/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
    return await future
  File "/home/opc/anaconda3/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 754, in run
    result = context.run(func, *args)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/gradio/helpers.py", line 588, in tracked_fn
    response = fn(*args)
  File "/home/opc/simple-llama-finetuner/app.py", line 131, in train
    self.trainer.train(
  File "/home/opc/simple-llama-finetuner/trainer.py", line 273, in train
    result = self.trainer.train(resume_from_checkpoint=False)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1696, in train
    return inner_training_loop(
  File "/home/opc/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1972, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 2796, in training_step
    self.scaler.scale(loss).backward()
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 476, in backward
    grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
RuntimeError: expected scalar type Half but found Float

Collecting info on memory requirements

Not an issue; just gathering information from fine-tuning attempts.

Please leave the info on:

  • size of your training set
  • the available VRAM
  • result (was training successful or you ran out of memory)

This would save us time and help us better gauge our machine capabilities.

How the finetuning output looks like?

Thanks for the great resources.

I tried to finetune with my custom data, and could see the output meesage. After then, when I tried to inference with the model I trained, the output looked totally messed up which is not trained.

my output message for the training is,

{'train_runtime': 5.8012, 'train_samples_per_second': 1.207, 'train_steps_per_second': 1.207, 'train_loss': 2.129377910069057, 'epoch': 1.0}

Is it the correct output that we need to see? I am wondering if it just trains for a single epoch originally.

Thanks,

How should I prepare the dataset for generative question answering on the private documents?

Hello,
Thanks for creating this very helpful tool!
I am fine-tuning the model (GPT-J-6B) for the question answering on the private documents. I have 1000+ documents and they are all in text format. And of course, I will be going with the PEFT LoRA.

But the question is...

How should I prepare my dataset?

Since this is the question-answering scenario, my first thought was to prepare the data set in "Question: {} Answer: {} Context: {}" format but since there are so many documents and for that, I will first need to generate the questions, then the answers and... you know it becomes non-feasible.

Then I thought, I should "just provide the raw text" to the model as the knowledge base and choose the model which was fine-tuned already on the alpaca dataset (so now the model understands the instructions - for that I will use the "nlpcloud/instruct-gpt-j-fp16" model), and then my hope is that the model should give the response to my questions.

So what I am doing, is correct? How should I prepare my dataset for the question answering?
Please help,
Thanks ๐Ÿ™๐Ÿป

AttributeError: type object 'Dataset' has no attribute 'from_list'

I was trying to finetune on a raw text file. It has a few empty lines too. I'm getting this error.
When I looked into the Datasets class, I didn't find from_list function. There were others like from_dict and from_text ( reads from file). I wanted to know if this line of code needs to be changed.

PS: I tried replacing that line with data = datasets.Dataset.from_text(<file path>) and the training seems to be working fine. But I'm not sure how newline and multiple new line characters effect the training performance. Would appreciate some light shed on that.

To create a public link, set `share=True` in `launch()`.
Loading base model...
Number of samples: 28
Traceback (most recent call last):
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/blocks.py", line 915, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/helpers.py", line 588, in tracked_fn
    response = fn(*args)
  File "main.py", line 161, in tokenize_and_train
    data = datasets.Dataset.from_list(paragraphs)
AttributeError: type object 'Dataset' has no attribute 'from_list'
Number of samples: 11
Traceback (most recent call last):
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/blocks.py", line 915, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/helpers.py", line 588, in tracked_fn
    response = fn(*args)
  File "main.py", line 161, in tokenize_and_train
    data = datasets.Dataset.from_list(paragraphs)
AttributeError: type object 'Dataset' has no attribute 'from_list'

AMD GPU compability or CPU

Hello I want to know if it has a mean to finetune with a AMD GPU or the CPU.
I explain I have a rx 6600 xt and a I510400F, I want to fine tune a very small model but I cant because of the Nvidi GPU requirement.

So if you know something that I can do to fine tune a model with my hardware I take it !

Thanks in advance
Ps: I dont speak very good in english I apologie

`LLaMATokenizer` vs `LlamaTokenizer` class names

Running inference gives the following warning:

Loading tokenizer...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.

Is it a problem?

Inference output text keeps running on...

Model: Vanilla LLaMA

Input:

Why did the chicken cross the road?

Output:

Why did the chicken cross the road? To get to the other side.
Why did the chicken cross the road? To get to the other side. Why did the chicken cross the road? To get to the other side. Why did the chicken cross the road? To

Using text-generation-webui:

python server.py --load-in-8bit --listen --model llama-7B
Why did the chicken cross the road?? To get to the other side.
Why did the chicken cross the road? Because it was a free range chicken and it wanted to go home!

I need to tweak the inference code

How to use CPU instead of GPU

Can anyone tell me how do i use CPU for fine tuning instead of GPU? because i do not have one. and also tell me where are the downloaded model files located in windows.

Thankyou in advance

Inference works just once

When I first load a model and ask it to infer, I get a good result. On a second inference it ignores what it should respond with and just keeps on generating more on the original input text. On the third inference, it just gives stacktraces and any subsequent inferences repeat the stacktraces:

Traceback (most recent call last):
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/gradio/blocks.py", line 884, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/gradio/helpers.py", line 587, in tracked_fn
    response = fn(*args)
  File "/home/vadi/Programs/simple-llama-finetuner/main.py", line 67, in generate_text
    model = PeftModel.from_pretrained(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/peft/peft_model.py", line 138, in from_pretrained
    remove_hook_from_submodules(model)
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/accelerate/hooks.py", line 407, in remove_hook_from_submodules
    remove_hook_from_submodules(child)
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/accelerate/hooks.py", line 405, in remove_hook_from_submodules
    remove_hook_from_module(module)
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/accelerate/hooks.py", line 187, in remove_hook_from_module
    delattr(module, "_hf_hook")
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1328, in __delattr__
    super().__delattr__(name)
AttributeError: _hf_hook

Do others get the same?

Traceback during inference.

Colab, gives the following error during inference:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/gradio/routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 884, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.9/dist-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.9/dist-packages/gradio/helpers.py", line 587, in tracked_fn
    response = fn(*args)
  File "/content/simple-llama-finetuner/main.py", line 121, in generate_text
    generation_output = model.generate(
  File "/usr/local/lib/python3.9/dist-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py", line 1451, in generate
    return self.sample(
  File "/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py", line 2467, in sample
    outputs = self(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/peft/tuners/lora.py", line 522, in forward
    result = super().forward(x)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/autograd/_functions.py", line 317, in forward
    state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/functional.py", line 1698, in transform
    prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

Getting the repo id error from the web interface

Does anybody know how i can solve this error?
Error
Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: ''.

Getting OOM

Training on T4:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 14.56 GiB total capacity; 13.25 GiB already allocated; 10.44 MiB free; 13.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I suspect a change of verisons in peft or transformers ... Does it make sense ?

Question: Is fine tuning suitable for factual answers from custom data, or is it better to use vector databases and use only the relevant chunk in the prompt for factual answers?

I know that in the case of Open AI fine-tuning it doesn't work by providing my own data and then the model can use it. Rather, it works by teaching it what style of language to use. So if I want GPT to use my data, I have to automatically have embeddings and a vector database and then put the relevant chunk of data back into the GPT prompt.

Is it similar here?

How do I merge trained Lora an Llama7b weight?

How do I merge trained Lora an Llama7b weight? Is there a script? Would make it much easier to share weights, increase portability, file management, etcโ€ฆ.

Would be an amazing feature of the training tab as well!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.