Giter Club home page Giter Club logo

sft_data_generation's Introduction

SFT_data_generation

Introduction

The main purpose of this project is to generate instruction fine-tuning data for fine-tuning open source LLMs. Drawing on the work of Stanford Alpaca[1] in generating instruction-following data, the data generation process of this work refers to the SELF-INSTRUCT[2] work of the University of Washington. By constructing seed data and using LLMs to generate instruction fine-tuning data for scene adaptation fine-tuning, we achieve efficient and low-cost automated generation of fine-tuning data. We have successfully conducted fine-tuning experiments on Llama3 using the generated fine-tuning data.

The fine-tuning scenario for this project is an online medical service scenario, so the coding is centered around this scenario. We can modify and adjust the code according to our own scenarios.

Processing

  • Construct seed data
    • Set up seed data for different tasks based on the fine-tuning scenarios you have set. In this example, we have set three types of instruction tasks. Please refer to seed_tasks.jsonl
  • Build generation templates;
    • Construct prompt templates and place the seed data into the prompts as examples. Please refer to template/prompt_template.txt
    • The "disease_list" field in the prompt template is designed to consider diverse entities. Since the scenario of this project is medical, this field represents a specific dictionary of disease classifications. During data generation, several diseases will be randomly selected from this list as references for generating data.
  • Call large model API to generate data;
    • The main process of calling the large model API to generate instruction data is to use the large model's in-context learning. By providing only a few examples, the large model can generate high-quality instruction data that follows the requirements of the prompt. The API used in this project is Baidu ERNIE-4.0-8K, but you can also switch to other large model APIs.
  • Condat data
    • The Fine-tuning data for models can often be supplemented with some collected open source fine-tuning data. Additionally, it is necessary to incorporate a portion of general corpus (to enhance the generalization of model fine-tuning). Data from different sources should be converted into a unified format to construct the final instruction fine-tuning data. However, determining the optimal ratio of data from different sources and the appropriate data scale to achieve better fine-tuning results requires further experimentation and validation on our own.

How to use

  • modify the seed_tasks.jsonl according to your own fine-tuning scenario requirements. The more seed data you provide, the better the diversity of the generated content will be.
  • modify the template/prompt_template.txt according to your own fine-tuning scenario requirements.This prompt template is designed for Chinese. You can modify it for other languages to make the data generated by the model closer to real-world question and answer scenarios.
  • Run the script "data_generate.py" to generate instruction tuning data
    • add your api key and api secret of Baidu ernie api in the config.py
    • Please note that the actual number of large model API calls is equal to the product of generate_times and random_seed_num. Additionally, you can specify how many data entries to generate per API call in the 9th item of the prompt template.

installation

  • Make sure you have installed requests package,recomand version is 2.26.0.
  • If you want to try fine-tuning Lamma3 using unsloth[3], you need to install unsloth.
    • install unsloth
    • try finetune Llama3
      • Run the script "unsloth_finetune.py"

refrence

sft_data_generation's People

Contributors

christophezhao avatar

Stargazers

 avatar PandaFist avatar  avatar  avatar LinkedSea avatar Yunqi Yan avatar  avatar nd791899 avatar  avatar  avatar Sled avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.