A template or starting point for fine-tuning large language models (LLMs) using Hugging Face's transformers
library. You can customise the scripts to fit the specifics of your project, dataset, and any modifications you may require during the setup or training process.
- Python 3.6+
- PyTorch
- Transformers library:
transformers>=4.0.0
- Datasets library:
datasets>=1.0.0
- Accelerate library (for distributed training):
accelerate>=0.20.1
To install the required libraries, run the following commands:
Use the package manager pip to install following
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
python3 -m pip install --upgrade pip
deactivate
The key components of the split_dataset.py script, which handles splitting the train.jsonl
file into a new training set and a validation set, are described below:
-
Importing Libraries: The script uses the
json
library to handle JSON data andtrain_test_split
fromscikit-learn
to facilitate the splitting of the dataset. -
Defining File Paths: File paths are set for the original training dataset (train.jsonl), and for the new training, validation, and test datasets (
train_split.jsonl
,validation_split.jsonl
, andtest_split.jsonl
, respectively). -
Loading Data: It reads the original
train.jsonl
file and deserialises each line (a JSON entry) from a JSON string to a Python dictionary—collecting all entries into a list. -
Splitting Data: The script first separates 15% of the data as a test set. It splits the remaining 85% using a ratio, resulting in 70% of the original dataset going to training and 15% to validation. A random seed ensures reproducibility.
-
Writing New Data to Files: Writes the new training and validation datasets to their respective
.jsonl
files. It serialised each entry back to a JSON string, with each line written to the corresponding file. -
Logging Information: The script prints out the number of entries in the new training and validation datasets to provide confirmation and a quick overview of the split.
The script encapsulates the data preparation step, which often precedes training machine learning models, ensuring you have separate datasets for training and validating the performance of your models. It's necessary to prevent overfitting and evaluate your model's generalisation capabilities.
To start the training process, run the training script with the following command:
python3 ./script/split_data.py
The key components of the training script are:
-
Importing Libraries: The script imports necessary Python libraries, such as
json
,torch
,datasets
, and classes fromtransformers
for model loading, data tokenisation, model training, etc. -
GPU Availability Check: The script checks if a GPU is available for training, which can significantly speed up the process.
-
Dataset Paths: The script defines file paths for the training and validation datasets; both are in JSONL format and located inside the
data
directory. -
Tokenizer Loading and Configuration: The tokeniser associated with the model is loaded from the Hugging Face Model Hub. Additionally, the script sets the padding token for the tokeniser if it is not defined.
-
Model Loading and Configuration: The pre-trained language model is loaded, with its embeddings resized to accommodate any new tokens added by the tokeniser.
-
Data Loading and Preprocessing: The training and validation datasets are loaded and processed using the tokeniser. A custom tokenise function is applied to the datasets to convert raw text into the format expected by the model.
-
Data Collator Definition: A
DataCollatorForLanguageModeling
is created to handle dynamic padding of input sequences during training. -
Training Arguments Configuration:
TrainingArguments
are defined to specify the output directory, number of epochs, batch size, saving strategy, evaluation strategy, and more. -
Trainer Initialization: The script initialises a
Trainer
instance with the model, training arguments, data collator, and training and evaluation datasets. -
Model Training: The
train
method of theTrainer
instance fine-tunes the model on the training dataset whilst periodically evaluating its performance on the validation dataset. -
Saving the Fine-Tuned Model: After training, the script saves the fine-tuned model to the output directory.
-
Model Evaluation: After training, the model evaluates performance on the test set, which we did not utilise during the training or validation processes.
The train_data.py script is a starting point! Feel free to revise it for your use case.
To start the training process, run the training script with the following commands for the relevant model:
# Training Felladrin/TinyMistral-248M-SFT-v4
python3 ./script/train_data.py --model=tinymistral
# Training TinyLlama/TinyLlama-1.1B-Chat-v1.0
python3 ./script/train_data.py --model=tinyllama
If you encounter any issues during installation or training, ensure that:
- You installed the dependencies correctly.
- The
train.jsonl
file is appropriately formatted and accessible. - The training script has the correct path to the
train.jsonl
file. - The tokenizer is appropriately configured with a pad token.
For any error messages, please refer to the error-specific tips provided in the logs and address them accordingly.
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
We use SemVer for versioning. For the versions available, see the tags on this repository.
This project is licensed under the MIT License - see the file for details.
(c) 2024 Finbarrs Oketunji.