SFT_data_generation

Introduction

The main purpose of this project is to generate instruction fine-tuning data for fine-tuning open source LLMs. Drawing on the work of Stanford Alpaca[1] in generating instruction-following data, the data generation process of this work refers to the SELF-INSTRUCT[2] work of the University of Washington. By constructing seed data and using LLMs to generate instruction fine-tuning data for scene adaptation fine-tuning, we achieve efficient and low-cost automated generation of fine-tuning data. We have successfully conducted fine-tuning experiments on Llama3 using the generated fine-tuning data.

The fine-tuning scenario for this project is an online medical service scenario, so the coding is centered around this scenario. We can modify and adjust the code according to our own scenarios.

Processing

Construct seed data
- Set up seed data for different tasks based on the fine-tuning scenarios you have set. In this example, we have set three types of instruction tasks. Please refer to seed_tasks.jsonl
Build generation templates;
- Construct prompt templates and place the seed data into the prompts as examples. Please refer to template/prompt_template.txt
- The "disease_list" field in the prompt template is designed to consider diverse entities. Since the scenario of this project is medical, this field represents a specific dictionary of disease classifications. During data generation, several diseases will be randomly selected from this list as references for generating data.
Call large model API to generate data;
- The main process of calling the large model API to generate instruction data is to use the large model's in-context learning. By providing only a few examples, the large model can generate high-quality instruction data that follows the requirements of the prompt. The API used in this project is Baidu ERNIE-4.0-8K, but you can also switch to other large model APIs.
Condat data
- The Fine-tuning data for models can often be supplemented with some collected open source fine-tuning data. Additionally, it is necessary to incorporate a portion of general corpus (to enhance the generalization of model fine-tuning). Data from different sources should be converted into a unified format to construct the final instruction fine-tuning data. However, determining the optimal ratio of data from different sources and the appropriate data scale to achieve better fine-tuning results requires further experimentation and validation on our own.

How to use

modify the seed_tasks.jsonl according to your own fine-tuning scenario requirements. The more seed data you provide, the better the diversity of the generated content will be.
modify the template/prompt_template.txt according to your own fine-tuning scenario requirements.This prompt template is designed for Chinese. You can modify it for other languages to make the data generated by the model closer to real-world question and answer scenarios.
Run the script "data_generate.py" to generate instruction tuning data
- add your api key and api secret of Baidu ernie api in the config.py
- Please note that the actual number of large model API calls is equal to the product of generate_times and random_seed_num. Additionally, you can specify how many data entries to generate per API call in the 9th item of the prompt template.

installation

Make sure you have installed requests package,recomand version is 2.26.0.
If you want to try fine-tuning Lamma3 using unsloth[3], you need to install unsloth.
- install unsloth
  - pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
  - pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes
- try finetune Llama3
  - Run the script "unsloth_finetune.py"

christophezhao / sft_data_generation Goto Github PK