lxe / simple-llm-finetuner Goto Github PK

Simple UI for LLM Model Finetuning

License: MIT License

Python 48.68% Jupyter Notebook 51.32%

ai gpt-2 gpt-3 huggingface huggingface-transformers llama llm peft pytorch

simple-llm-finetuner's Introduction

title	emoji	colorFrom	colorTo	sdk	app_file	pinned
Simple LLM Finetuner	🦙	yellow	orange	gradio	app.py	false

👻👻👻 This project is effectively dead. Please use one of the following tools instead:

🦙 Simple LLM Finetuner

Simple LLM Finetuner is a beginner-friendly interface designed to facilitate fine-tuning various language models using LoRA method via the PEFT library on commodity NVIDIA GPUs. With small dataset and sample lengths of 256, you can even run this on a regular Colab Tesla T4 instance.

With this intuitive UI, you can easily manage your dataset, customize parameters, train, and evaluate the model's inference capabilities.

Acknowledgements

Features

Simply paste datasets in the UI, separated by double blank lines
Adjustable parameters for fine-tuning and inference
Beginner-friendly UI with explanations for each parameter

Getting Started

Prerequisites

Linux or WSL
Modern NVIDIA GPU with >= 16 GB of VRAM (but it might be possible to run with less for smaller sample lengths)

Usage

I recommend using a virtual environment to install the required packages. Conda preferred.

conda create -n simple-llm-finetuner python=3.10
conda activate simple-llm-finetuner
conda install -y cuda -c nvidia/label/cuda-11.7.0
conda install -y pytorch=2 pytorch-cuda=11.7 -c pytorch

On WSL, you might need to install CUDA manually by following these steps, then running the following before you launch:

export LD_LIBRARY_PATH=/usr/lib/wsl/lib

Clone the repository and install the required packages.

git clone https://github.com/lxe/simple-llm-finetuner.git
cd simple-llm-finetuner
pip install -r requirements.txt

Launch it

python app.py

Open http://127.0.0.1:7860/ in your browser. Prepare your training data by separating each sample with 2 blank lines. Paste the whole training dataset into the textbox. Specify the new LoRA adapter name in the "New PEFT Adapter Name" textbox, then click train. You might need to adjust the max sequence length and batch size to fit your GPU memory. The model will be saved in the lora/ directory.

After training is done, navigate to "Inference" tab, select your LoRA, and play with it.

Have fun!

YouTube Walkthough

https://www.youtube.com/watch?v=yM1wanDkNz8

License

MIT License

simple-llm-finetuner's People

Contributors

Stargazers

Watchers

Forkers

shendsaliaga suryatmodulus furio mchampanis antelligent-app antonioalegria simomay c00renut dalasnoin leftomelas sallywang147 pv-pterab-s matrixy nomiscientist vadi2 kenychen lsibanda-uni daniel-oh klei22 rawmarshmellows wang-shun jisang0814 techthiyanes dgo2dance leedaga jettisonthenet hackeraiofficial hello1024 licseng dbonattoj mokpolar therichu kjin1 hbcbh1999 donghankimaai kustomzone ai-jie01 vsevolodl positioner ambismart occam-ai dleodbs likeucode randomityguy jaej-dev effusiveperiscope frrabelo jimmoffet imjustricky zetavg hertera1 benliao rhkdgh255 rcthought swap357 aishenpro gemelgb 64-bit 370025263 stl314159 kevinn1999 millionthodin16 ychy00001 russellpwirtz forex24 ltc-exp noesis-yu ku5h recursionbane c0debrain ai-ld kumar045 doublnt emptor peternara qqq-tech ddaying gabemeikle sangchulsuh sizzles manthanmkulakarni silverstar0727 swapaj thonydam sciumo curiosity007 noeliekt sterrenhemel wdshin pitchboydev saj1919 knightcn1983 madebyaibots samuelmukoti hitum-dev iam-007swarna yerinenuma arikkod jfontestad khryptorgraphics

simple-llm-finetuner's Issues

In trainer.py, ignore the last token is not suitable for all situations.

In trainer.py, ignore the last token is not suitable for all situations.

    def tokenize_sample(self, item, max_seq_length, add_eos_token=True):
        assert self.tokenizer is not None
        result = self.tokenizer(
            item["text"],
            truncation=True,
            max_length=max_seq_length,
            padding="max_length",
        )

       # ignore the last token [:-1]
        result = {
            "input_ids": result["input_ids"][:-1],
            "attention_mask": result["attention_mask"][:-1],
        }

https://github.com/lxe/simple-llm-finetuner/blob/3c3ae84e5dee5a1d40f17e5567938dfdffce9d16/trainer.py#LL150C9-L153C10

If the user of web UI using custom dataset. they will not know the last token of training data is truncated.
And the prediction results go unexpected.

RuntimeError: unscale_() has already been called on this optimizer since the last update().

The error message is attached below:

Parameter 'function'=<function Trainer.tokenize_training_text.. at 0x7f8eb04d5c60> of the transform datasets.arrow_dataset.Dataset.map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
{'loss': 2.4296, 'learning_rate': 0.0002901960784313725, 'epoch': 0.1}
{'loss': 2.271, 'learning_rate': 0.0002803921568627451, 'epoch': 0.2}
{'loss': 2.2099, 'learning_rate': 0.0002705882352941176, 'epoch': 0.29}
{'loss': 2.2199, 'learning_rate': 0.00026078431372549016, 'epoch': 0.39}
{'loss': 2.1911, 'learning_rate': 0.00025098039215686274, 'epoch': 0.49}
{'loss': 2.2129, 'learning_rate': 0.00024117647058823527, 'epoch': 0.59}
{'loss': 2.1752, 'learning_rate': 0.00023137254901960783, 'epoch': 0.68}
{'loss': 2.1841, 'learning_rate': 0.0002215686274509804, 'epoch': 0.78}
{'loss': 2.1827, 'learning_rate': 0.00021176470588235295, 'epoch': 0.88}
{'loss': 2.1514, 'learning_rate': 0.00020196078431372548, 'epoch': 0.98}
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict
output = await app.get_blocks().process_api(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api
result = await self.call_function(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/gradio/blocks.py", line 1077, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/anyio/backends/asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/anyio/backends/asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/gradio/helpers.py", line 602, in tracked_fn
response = fn(*args)
File "/home/user/app/app.py", line 131, in train
self.trainer.train(
File "/home/user/app/trainer.py", line 273, in train
result = self.trainer.train(resume_from_checkpoint=False)
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/transformers/trainer.py", line 1850, in inner_training_loop
self.accelerator.clip_grad_norm(
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/accelerate/accelerator.py", line 1913, in clip_grad_norm
self.unscale_gradients()
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/accelerate/accelerator.py", line 1876, in unscale_gradients
self.scaler.unscale(opt)
File "/home/user/.pyenv/versions/3.10.12/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 275, in unscale
raise RuntimeError("unscale() has already been called on this optimizer since the last update().")
RuntimeError: unscale() has already been called on this optimizer since the last update().

"error" in training - AttributeError: 'CastOutputToFloat' object has no attribute 'weight', RuntimeError: Only Tensors of floating point and complex dtype can require gradients

WSL2 Ubuntu, new install, I get the following error after it downloads the weights and tries to train.
Sorry I can't give more details, but I'm really not sure what's going on.

Number of samples: 534
Traceback (most recent call last):
File "/home/ckg/.local/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict
output = await app.get_blocks().process_api(
File "/home/ckg/.local/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api
result = await self.call_function(
File "/home/ckg/.local/lib/python3.10/site-packages/gradio/blocks.py", line 884, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/ckg/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/ckg/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/ckg/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/home/ckg/.local/lib/python3.10/site-packages/gradio/helpers.py", line 587, in tracked_fn
response = fn(*args)
File "/home/ckg/github/simple-llama-finetuner/main.py", line 164, in tokenize_and_train
model = peft.prepare_model_for_int8_training(model)
File "/home/ckg/.local/lib/python3.10/site-packages/peft/utils/other.py", line 72, in prepare_model_for_int8_training
File "/home/ckg/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'CastOutputToFloat' object has no attribute 'weight'

Suggestion to improve UX

Thank you for this project! I tried it, and unlike some others it worked (with llama 7b and 2080 ti).
Now I'd like to scale up my experiments.

In order to do so, I would need an option to initiate training programmatically. Technically, I'd be able to extract what I need from main.py, but it'd be great if there was an already tested example.
Secondly, I'd like to see an example on how to convert the directory with checkpoints into a standalone model.

Would you please share your thoughts on this or perhaps a link to where it's already implemented? Thank you in advance.

Error: Adapter lora/decapoda-research_llama-{ADAPTER_NAME} not found.

I have found a resolution and root cause for this issue, I am documenting the reproduction steps here to keep the PR more organized.

Minimum Reproduction Steps

Create at least 2 LoRA adapaters for a model 'initial Model'
On the Inference tab, select one of the LoRA's, 'Initial LoRA'
Switch the model to one of the other models 'Alternative Model'
Switch the model back to 'Initial Model'
Switch the LoRA to the 2nd lora that was created
Switch the LoRA back to 'Initial LoRA'

This error will be displayed: "Adapter lora/decapoda-research_llama-7b-hf_PYTHON-2 not found."

Callstack:

Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/gradio/blocks.py", line 915, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/gradio/helpers.py", line 588, in tracked_fn
    response = fn(*args)
  File "/mnt/c/Users/Jon/repos/simple-llm-finetuner/app.py", line 180, in load_lora
    self.trainer.load_lora(f'{LORA_DIR}/{lora_name}')
  File "/mnt/c/Users/Jon/repos/simple-llm-finetuner/trainer.py", line 68, in load_lora
    self.model.set_adapter(lora_name)
  File "/home/jon/miniconda3/envs/simple-llm-finetuner/lib/python3.10/site-packages/peft/peft_model.py", line 404, in set_adapter
    raise ValueError(f"Adapter {adapter_name} not found.")
ValueError: Adapter lora/decapoda-research_llama-7b-hf_PYTHON-2 not found.

[Request] QLoRA support

Is CUDA 12.0 supported?

Is CUDA 12.0 supported? It along with the new cudnn library has some nice improvements for RTX 40-series cards.

Error during Training RuntimeError: mat1 and mat2 shapes cannot be multiplied (511x2 and 3x4096)

I followed the installation steps and used the example-data-maya-wiki.txt dataset as a finetuning examples. I keep getting this error towards the end of training. I tried with other datasets as well still the same error persists

output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (511x2 and 3x4096)

Can you please help me resolve it? Thanks in advance.

[Request] Retrain adapter from checkpoint?

how to finetune with 'system information'

Hello,

I am training with my custom dataset, and have a question there.
What I wanted to make is assistance that can recommend me a proper mode of device depending on my conversation.

Before inserting q/a pairs, I want to let model know about the general information of 'how to use' the device.
I tried to insert like below.

SYSTEM:
    There are 4 options in the mode
    - mode1
    - mode2
    - mode3
    - mode4
    
   you need to generate 'json' format using USER input with the proper mode.
   Desired output format is below.
   {
        'mode': [selection of mode]
        'comments': [your response]
    }


USER: example1
ASSISTANCE: response1


USER: example2
ASSISTANCE: response2


USER: example3
ASSISTANCE: response3

But it seems like the model doesn't know about the initial information about the device.

Is there any specific format like 'USER' and 'ASSITANCE' for teaching the information as well?

Thanks,

Issue in train in colab

While I run the train in colab, this error is shown -

Something went wrong
Connection errored out.

How can I solve this?

How can I use the finetuned model with text-generation-webui or KoboldAI?

How can I utilize a fine-tuned model with text-generation-webui or KoboldAI for generating text? What are the necessary steps to ensure the successful integration of the model with these interfaces, and are there any specific requirements or dependencies I need to be aware of?

Attempting to use 13B in the simple tuner -

updated the main.py with decapoda-research/llama-13b-hf in all the spots that had 7B
It downloaded the sharded parts all right
but now im getting this config issue tho. Any advice would be appreciated.

File "/home/orwell/miniconda3/envs/llama-finetuner/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/orwell/miniconda3/envs/llama-finetuner/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/home/orwell/miniconda3/envs/llama-finetuner/lib/python3.10/site-packages/gradio/helpers.py", line 587, in tracked_fn
response = fn(*args)
File "/home/orwell/simple-llama-finetuner/main.py", line 82, in generate_text
load_peft_model(peft_model)
File "/home/orwell/simple-llama-finetuner/main.py", line 35, in load_peft_model
model = peft.PeftModel.from_pretrained(
File "/home/orwell/miniconda3/envs/llama-finetuner/lib/python3.10/site-packages/peft/peft_model.py", line 135, in from_pretrained
config = PEFT_TYPE_TO_CONFIG_MAPPING[PeftConfig.from_pretrained(model_id).peft_type].from_pretrained(model_id)
File "/home/orwell/miniconda3/envs/llama-finetuner/lib/python3.10/site-packages/peft/utils/config.py", line 101, in from_pretrained
raise ValueError(f"Can't find config.json at '{pretrained_model_name_or_path}'")
ValueError: Can't find config.json at ''

The config file appears in the cache the same as it does for 7B - im assuming im missing something just not sure what.

Thank you again

"The tokenizer class you load from this checkpoint is 'LLaMATokenizer'."

(llama) user@DESKTOP-CR45CKF:~/simple-llm-finetuner$ python app.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/user/anaconda3/envs/llama/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/user/anaconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/gradio/blocks.py", line 884, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/gradio/helpers.py", line 587, in tracked_fn
    response = fn(*args)
  File "/home/user/simple-llm-finetuner/app.py", line 130, in train
    self.trainer.train(
  File "/home/user/simple-llm-finetuner/trainer.py", line 172, in train
    assert self.model is not None
AssertionError
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Killed

[Request] Mac ARM support

Finetuning in unsupported language

My language was not on the list of 20 languages the original model was trained on.
Is it possible to finetune llama with a dataset in a language that was not included in the base model?

(WSL2) - No GPU / Cuda detected....

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!
CUDA SETUP: CUDA runtime path found: /home/user/anaconda3/envs/llama/lib/libcudart.so
/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
  warn(msg)
CUDA SETUP: Loading binary /home/user/anaconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/home/user/anaconda3/envs/llama/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

Question: Native windows support

Followed instructions in the readme, but getting AssertionError: Torch not compiled with CUDA enabled
Running on nvidia A4500, native windows (not wsl)

Traceback

(llama-finetuner) D:\simple-llama-finetuner>python main.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
argument of type 'WindowsPath' is not iterable
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
argument of type 'WindowsPath' is not iterable
C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\bitsandbytes\cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Loading base model...
Traceback (most recent call last):
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\gradio\routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\gradio\blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\gradio\blocks.py", line 884, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\gradio\helpers.py", line 587, in tracked_fn
    response = fn(*args)
  File "D:\simple-llama-finetuner\main.py", line 128, in tokenize_and_train
    if (model is None): load_base_model()
  File "D:\simple-llama-finetuner\main.py", line 18, in load_base_model
    model = transformers.LlamaForCausalLM.from_pretrained(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\transformers\modeling_utils.py", line 2643, in from_pretrained
    ) = cls._load_pretrained_model(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\transformers\modeling_utils.py", line 2966, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\transformers\modeling_utils.py", line 673, in _load_state_dict_into_meta_model
    set_module_8bit_tensor_to_device(model, param_name, param_device, value=param)
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\transformers\utils\bitsandbytes.py", line 70, in set_module_8bit_tensor_to_device
    new_value = bnb.nn.Int8Params(new_value, requires_grad=False, has_fp16_weights=has_fp16_weights).to(device)
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\bitsandbytes\nn\modules.py", line 196, in to
    return self.cuda(device)
  File "C:\Users\jerem\.conda\envs\llama-finetuner\lib\site-packages\bitsandbytes\nn\modules.py", line 159, in cuda
    B = self.data.contiguous().half().cuda(device)
  File "C:\Users\jerem\AppData\Roaming\Python\Python310\site-packages\torch\cuda\__init__.py", line 221, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Examples to get started with

I see there's a few examples in the repo - it would be great for a tutorial to accompany them, so newbies dipping their feet in this for the first time can get an idea of what's happening / score some early wins.

Training using long stories instead of question/response

Is it possible to finetune for stories generation instead of short question/answer like it is possible to do with the proprietary NovelAI Custom AI modules?

M1/M2 Metal support?

M1 Macbook 16gb-- is macOS not supported?

About llama-2-70B fine-tuning

Appreciate your great work!

Is it possible to fine tune the llama-2-70B for a 3*8*A100 (40G) configuration, thanks!

Inference doesn't work after training

I trained my input text on a rtx 4080 (16gb vram) with the default settings:

And that seems to work OK:

TrainOutput(global_step=116, training_loss=1.0854247685136467, metrics={'train_runtime': 258.9812, 'train_samples_per_second': 0.448, 'train_steps_per_second': 0.448, 'train_loss': 1.0854247685136467, 'epoch': 1.0})

However inferencing doesn't work and I don't have enough context to understand why yet:

  File "/home/vadi/Programs/simple-llama-finetuner/main.py", line 27, in maybe_load_models
    model = LlamaForCausalLM.from_pretrained(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2588, in from_pretrained
    raise ValueError(
ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.

Currently 12.5 / 16gb vram is being used, if that matters.

Performance after FineTuning

I have fine tuned llama using this repo and a few text documents I had with me.
If I provide 3-4 consecutive words from input text, it amazingly completes the next couple of sentences.
But if I ask the same information as a question or reorder the input prompt, it hallucinates.

I thought I was overfitting and hence increased input data size, decreased the number of epochs which was neither completing the sentences when input as above nor answering the questions.

I also tried using vector embedding search and a model on top of it to put things together, but this way it is lacking information across few sentences. Also it can't answer anything other than What Where etc kind of questions if the answer it expected to span multiple sentences and its even worse when it has to infer something with this information and general knowledge. So that seems to be a not so fruitful approach

My goal is to get llama to have knowledge of a few text documents I have locally.
Someone help me please.

Multi GPU running

Hi there! I would like to know how we can run this solution in multi gpu environment for bigger models. Thank you

Not a problem - but like people should know

https://arxiv.org/abs/2303.11366 Is a really cool paper about reflection in LLMs

That is after training on like 20 samples for 50 epochs on my 3090 on the 7B model.

User: [Topic or question]

Assistant Hypothetical Response: [Brief or simplified answer to the topic or question]

Agent Reflection: [Critique of the hypothetical response, highlighting the limitations, inaccuracies, or areas that need improvement or expansion, while providing guidance on how to address these issues in the revised response]

Bot Actual Response: [The natural and contextually appropriate answer to the topic or question, as generated by the advanced language model, which incorporates the suggestions and improvements from the agent reflection for a more comprehensive and accurate response]

This + training sets generated with this frame work seem to really improve the generations of these models with fairly limited training sets. Just thought i would share.

Slow generation speed: around 10 minutes / loading forever on rtx3090 with 64gb ram....

this took around 5 minutes... is this normal? any way to speed it up?

Host on Hugging Face Spaces

Hey there!

Would there be interest in hosting this finetuner app on Hugging Face Spaces? (hf.co/spaces). This could be a duplicate-based Space that people duplicate and then train. An example that does this is https://huggingface.co/spaces/multimodalart/dreambooth-training . If there's interest, I'm happy to help and set up a Slack channel if you want to collaborate on it.

RuntimeError: expected scalar type Half but found Float

CUDA SETUP: Loading binary /home/opc/anaconda3/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so...
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://b38eaf88d60145f161.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
/home/opc/anaconda3/lib/python3.9/site-packages/peft/utils/other.py:76: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
  warnings.warn(
/home/opc/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:318: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
  File "/home/opc/anaconda3/lib/python3.9/site-packages/gradio/routes.py", line 399, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/opc/anaconda3/lib/python3.9/site-packages/gradio/blocks.py", line 1299, in process_api
    result = await self.call_function(
  File "/home/opc/anaconda3/lib/python3.9/site-packages/gradio/blocks.py", line 1022, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/opc/anaconda3/lib/python3.9/site-packages/anyio/to_thread.py", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/home/opc/anaconda3/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
    return await future
  File "/home/opc/anaconda3/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 754, in run
    result = context.run(func, *args)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/gradio/helpers.py", line 588, in tracked_fn
    response = fn(*args)
  File "/home/opc/simple-llama-finetuner/app.py", line 131, in train
    self.trainer.train(
  File "/home/opc/simple-llama-finetuner/trainer.py", line 273, in train
    result = self.trainer.train(resume_from_checkpoint=False)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1696, in train
    return inner_training_loop(
  File "/home/opc/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1972, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 2796, in training_step
    self.scaler.scale(loss).backward()
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/opc/anaconda3/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/opc/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 476, in backward
    grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
RuntimeError: expected scalar type Half but found Float

Collecting info on memory requirements

Not an issue; just gathering information from fine-tuning attempts.

Please leave the info on:

size of your training set
the available VRAM
result (was training successful or you ran out of memory)

This would save us time and help us better gauge our machine capabilities.

will this work with quantized GGUF files?

lets say we use files from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
is it possible?

How the finetuning output looks like?

Thanks for the great resources.

I tried to finetune with my custom data, and could see the output meesage. After then, when I tried to inference with the model I trained, the output looked totally messed up which is not trained.

my output message for the training is,

{'train_runtime': 5.8012, 'train_samples_per_second': 1.207, 'train_steps_per_second': 1.207, 'train_loss': 2.129377910069057, 'epoch': 1.0}

Is it the correct output that we need to see? I am wondering if it just trains for a single epoch originally.

Thanks,

How should I prepare the dataset for generative question answering on the private documents?

Hello,
Thanks for creating this very helpful tool!
I am fine-tuning the model (GPT-J-6B) for the question answering on the private documents. I have 1000+ documents and they are all in text format. And of course, I will be going with the PEFT LoRA.

But the question is...

How should I prepare my dataset?

Since this is the question-answering scenario, my first thought was to prepare the data set in "Question: {} Answer: {} Context: {}" format but since there are so many documents and for that, I will first need to generate the questions, then the answers and... you know it becomes non-feasible.

Then I thought, I should "just provide the raw text" to the model as the knowledge base and choose the model which was fine-tuned already on the alpaca dataset (so now the model understands the instructions - for that I will use the "nlpcloud/instruct-gpt-j-fp16" model), and then my hope is that the model should give the response to my questions.

So what I am doing, is correct? How should I prepare my dataset for the question answering?
Please help,
Thanks 🙏🏻

AttributeError: type object 'Dataset' has no attribute 'from_list'

I was trying to finetune on a raw text file. It has a few empty lines too. I'm getting this error.
When I looked into the Datasets class, I didn't find from_list function. There were others like from_dict and from_text ( reads from file). I wanted to know if this line of code needs to be changed.

PS: I tried replacing that line with data = datasets.Dataset.from_text(<file path>) and the training seems to be working fine. But I'm not sure how newline and multiple new line characters effect the training performance. Would appreciate some light shed on that.

To create a public link, set `share=True` in `launch()`.
Loading base model...
Number of samples: 28
Traceback (most recent call last):
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/blocks.py", line 915, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/helpers.py", line 588, in tracked_fn
    response = fn(*args)
  File "main.py", line 161, in tokenize_and_train
    data = datasets.Dataset.from_list(paragraphs)
AttributeError: type object 'Dataset' has no attribute 'from_list'
Number of samples: 11
Traceback (most recent call last):
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/blocks.py", line 915, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/datta0/.pyenv/versions/3.8.10/lib/python3.8/site-packages/gradio/helpers.py", line 588, in tracked_fn
    response = fn(*args)
  File "main.py", line 161, in tokenize_and_train
    data = datasets.Dataset.from_list(paragraphs)
AttributeError: type object 'Dataset' has no attribute 'from_list'

Verbose function to find out what leads to crash during training?

My script crashed after during training - is there a way to activate a verbose function to find out what exactly makes the script crash?

Can Nivdia 3090 with 24G video memory support finetune?

AMD GPU compability or CPU

Hello I want to know if it has a mean to finetune with a AMD GPU or the CPU.
I explain I have a rx 6600 xt and a I510400F, I want to fine tune a very small model but I cant because of the Nvidi GPU requirement.

So if you know something that I can do to fine tune a model with my hardware I take it !

Thanks in advance
Ps: I dont speak very good in english I apologie

`LLaMATokenizer` vs `LlamaTokenizer` class names

Running inference gives the following warning:

Loading tokenizer...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.

Is it a problem?

Where are the downloaded ".bin" files for the llama model stored on the disk?

Inference output text keeps running on...

Model: Vanilla LLaMA

Input:

Why did the chicken cross the road?

Output:

Why did the chicken cross the road? To get to the other side.
Why did the chicken cross the road? To get to the other side. Why did the chicken cross the road? To get to the other side. Why did the chicken cross the road? To

Using text-generation-webui:

python server.py --load-in-8bit --listen --model llama-7B

Why did the chicken cross the road?? To get to the other side.
Why did the chicken cross the road? Because it was a free range chicken and it wanted to go home!

I need to tweak the inference code

How to use CPU instead of GPU

Can anyone tell me how do i use CPU for fine tuning instead of GPU? because i do not have one. and also tell me where are the downloaded model files located in windows.

Thankyou in advance

Inference works just once

When I first load a model and ask it to infer, I get a good result. On a second inference it ignores what it should respond with and just keeps on generating more on the original input text. On the third inference, it just gives stacktraces and any subsequent inferences repeat the stacktraces:

Traceback (most recent call last):
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/gradio/blocks.py", line 884, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/gradio/helpers.py", line 587, in tracked_fn
    response = fn(*args)
  File "/home/vadi/Programs/simple-llama-finetuner/main.py", line 67, in generate_text
    model = PeftModel.from_pretrained(
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/peft/peft_model.py", line 138, in from_pretrained
    remove_hook_from_submodules(model)
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/accelerate/hooks.py", line 407, in remove_hook_from_submodules
    remove_hook_from_submodules(child)
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/accelerate/hooks.py", line 405, in remove_hook_from_submodules
    remove_hook_from_module(module)
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/accelerate/hooks.py", line 187, in remove_hook_from_module
    delattr(module, "_hf_hook")
  File "/home/vadi/Programs/miniconda3/envs/finetuner2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1328, in __delattr__
    super().__delattr__(name)
AttributeError: _hf_hook

Do others get the same?

Traceback during inference.

Colab, gives the following error during inference:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/gradio/routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 884, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.9/dist-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.9/dist-packages/gradio/helpers.py", line 587, in tracked_fn
    response = fn(*args)
  File "/content/simple-llama-finetuner/main.py", line 121, in generate_text
    generation_output = model.generate(
  File "/usr/local/lib/python3.9/dist-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py", line 1451, in generate
    return self.sample(
  File "/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py", line 2467, in sample
    outputs = self(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/peft/tuners/lora.py", line 522, in forward
    result = super().forward(x)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/autograd/_functions.py", line 317, in forward
    state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/functional.py", line 1698, in transform
    prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

Getting the repo id error from the web interface

Does anybody know how i can solve this error?
Error
Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: ''.

Getting OOM

Training on T4:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 14.56 GiB total capacity; 13.25 GiB already allocated; 10.44 MiB free; 13.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I suspect a change of verisons in peft or transformers ... Does it make sense ?

Question: Is fine tuning suitable for factual answers from custom data, or is it better to use vector databases and use only the relevant chunk in the prompt for factual answers?

I know that in the case of Open AI fine-tuning it doesn't work by providing my own data and then the model can use it. Rather, it works by teaching it what style of language to use. So if I want GPT to use my data, I have to automatically have embeddings and a vector database and then put the relevant chunk of data back into the GPT prompt.

Is it similar here?

How do I merge trained Lora an Llama7b weight?

How do I merge trained Lora an Llama7b weight? Is there a script? Would make it much easier to share weights, increase portability, file management, etc….

Would be an amazing feature of the training tab as well!

question: could the model trained be used for alpaca.cpp?

i am new the model format, hope you could answer it!