Giter Club home page Giter Club logo

zjunlp / easyinstruct Goto Github PK

View Code? Open in Web Editor NEW
340.0 9.0 32.0 18.94 MB

[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.

Home Page: https://zjunlp.github.io/project/EasyInstruct

License: MIT License

Python 96.47% Shell 0.64% Jupyter Notebook 2.89%
gpt-3 instructions prompt api chain-of-thought gpt large-language-models multimodal reasoning in-context-learning easyinstruct chatgpt pypy-library python prompting tool llama knowlm

easyinstruct's Introduction

An Easy-to-use Instruction Processing Framework for Large Language Models.


ProjectPaperDemoOverviewInstallationQuickstartHow To UseDocsVideoCitationContributors

License: MIT

Table of Contents

🔔News

Previous news
  • 2023-5-23 We release version 0.0.5, removing requirement of llama-cpp-python.
  • 2023-5-16 We release version 0.0.4, fixing some problems.
  • 2023-4-21 We release version 0.0.3, check out our documentations for more details.
  • 2023-3-25 We release version 0.0.2, suporting IndexPrompt, MMPrompt, IEPrompt and more LLMs
  • 2023-3-13 We release version 0.0.1, supporting in-context learning, chain-of-thought with ChatGPT.

This repository is a subproject of KnowLM.

🌟Overview

EasyInstruct is a Python package which is proposed as an easy-to-use instruction processing framework for Large Language Models(LLMs) like GPT-4, LLaMA, ChatGLM in your research experiments. EasyInstruct modularizes instruction generation, selection, and prompting, while also considering their combination and interaction.

  • The current supported instruction generation techniques are as follows:

    Methods Description
    Self-Instruct The method that randomly samples a few instructions from a human-annotated seed tasks pool as demonstrations and prompts an LLM to generate more instructions and corresponding input-output pairs.
    Evol-Instruct The method that incrementally upgrades an initial set of instructions into more complex instructions by prompting an LLM with specific prompts.
    Backtranslation The method that creates an instruction following training instance by predicting an instruction that would be correctly answered by a portion of a document of the corpus.
    KG2Instruct The method that creates an instruction following training instance by predicting an instruction that would be correctly answered by a portion of a document of the corpus.
  • The current supported instruction selection metrics are as follows:

    Metrics Notation Description
    Length $Len$ The bounded length of every pair of instruction and response.
    Perplexity $PPL$ The exponentiated average negative log-likelihood of response.
    MTLD $MTLD$ Measure of textual lexical diversity, the mean length of sequential words in a text that maintains a minimum threshold TTR score.
    ROUGE $ROUGE$ Recall-Oriented Understudy for Gisting Evaluation, a set of metrics used for evaluating similarities between sentences.
    GPT score $GPT$ The score of whether the output is a good example of how AI Assistant should respond to the user's instruction, provided by ChatGPT.
    CIRS $CIRS$ The score using the abstract syntax tree to encode structural and logical attributes, to measure the correlation between code and reasoning abilities.
  • API service providers and their corresponding LLM products that are currently available:

    Model Description Default Version
    OpenAI
    GPT-3.5 A set of models that improve on GPT-3 and can understand as well as generate natural language or code. gpt-3.5-turbo
    GPT-4 A set of models that improve on GPT-3.5 and can understand as well as generate natural language or code. gpt-4
    Anthropic
    Claude A next-generation AI assistant based on Anthropic’s research into training helpful, honest, and harmless AI systems. claude-2.0
    Claude-Instant A lighter, less expensive, and much faster option than Claude. claude-instant-1.2
    Cohere
    Command A flagship text generation model of Cohere trained to follow user commands and to be instantly useful in practical business applications. command
    Command-Light A light version of Command models that are faster but may produce lower-quality generated text. command-light

🔧Installation

Installation from git repo branch:

pip install git+https://github.com/zjunlp/EasyInstruct@main

Installation for local development:

git clone https://github.com/zjunlp/EasyInstruct
cd EasyInstruct
pip install -e .

Installation using PyPI (not the latest version):

pip install easyinstruct -i https://pypi.org/simple

⏩Quickstart

We provide two ways for users to quickly get started with EasyInstruct. You can either use the shell script or the Gradio app based on your specific needs.

Shell Script

Step1: Prepare a configuration file

Users can easily configure the parameters of EasyInstruct in a YAML-style file or just quickly use the default parameters in the configuration files we provide. Following is an example of the configuration file for Self-Instruct:

generator:
  SelfInstructGenerator:
    target_dir: data/generations/
    data_format: alpaca
    seed_tasks_path: data/seed_tasks.jsonl
    generated_instructions_path: generated_instructions.jsonl
    generated_instances_path: generated_instances.jsonl
    num_instructions_to_generate: 100
    engine: gpt-3.5-turbo
    num_prompt_instructions: 8

More example configuration files can be found at configs.

Step2: Run the shell script

Users should first specify the configuration file and provide their own OpenAI API key. Then, run the following shell script to launch the instruction generation or selection process.

config_file=""
openai_api_key=""

python demo/run.py \
    --config  $config_file\
    --openai_api_key $openai_api_key \

Gradio App

We provide a Gradio app for users to quickly get started with EasyInstruct. You can run the following command to launch the Gradio app locally on the port 8080 (if available).

python demo/app.py

We also host a running gradio app in HuggingFace Spaces. You can try it out here.


📌Use EasyInstruct

Please refer to our documentations for more details.

Generators

The Generators module streamlines the process of instruction data generation, allowing for the generation of instruction data based on seed data. You can choose the appropriate generator based on your specific needs.

BaseGenerator

BaseGenerator is the base class for all generators.

You can also easily inherit this base class to customize your own generator class. Just override the __init__ and generate method.

SelfInstructGenerator

SelfInstructGenerator is the class for the instruction generation method of Self-Instruct. See Self-Instruct: Aligning Language Model with Self Generated Instructions for more details.

Example

from easyinstruct import SelfInstructGenerator
from easyinstruct.utils.api import set_openai_key

# Step1: Set your own API-KEY
set_openai_key("YOUR-KEY")

# Step2: Declare a generator class
generator = SelfInstructGenerator(num_instructions_to_generate=10)

# Step3: Generate self-instruct data
generator.generate()

BacktranslationGenerator

BacktranslationGenerator is the class for the instruction generation method of Instruction Backtranslation. See Self-Alignment with Instruction Backtranslation for more details.

Example
from easyinstruct import BacktranslationGenerator
from easyinstruct.utils.api import set_openai_key

# Step1: Set your own API-KEY
set_openai_key("YOUR-KEY")

# Step2: Declare a generator class
generator = BacktranslationGenerator(num_instructions_to_generate=10)

# Step3: Generate backtranslation data
generator.generate()

EvolInstructGenerator

EvolInstructGenerator is the class for the instruction generation method of EvolInstruct. See WizardLM: Empowering Large Language Models to Follow Complex Instructions for more details.

Example
from easyinstruct import EvolInstructGenerator
from easyinstruct.utils.api import set_openai_key

# Step1: Set your own API-KEY
set_openai_key("YOUR-KEY")

# Step2: Declare a generator class
generator = EvolInstructGenerator(num_instructions_to_generate=10)

# Step3: Generate evolution data
generator.generate()

KG2InstructGenerator

KG2InstructGenerator is the class for the instruction generation method of KG2Instruct. See InstructIE: A Chinese Instruction-based Information Extraction Dataset for more details.

Selectors

The Selectors module standardizes the instruction selection process, enabling the extraction of high-quality instruction datasets from raw, unprocessed instruction data. The raw data can be sourced from publicly available instruction datasets or generated by the framework itself. You can choose the appropriate selector based on your specific needs.

BaseSelector

BaseSelector is the base class for all selectors.

You can also easily inherit this base class to customize your own selector class. Just override the __init__ and __process__ method.

Deduplicator

Deduplicator is the class for eliminating duplicate instruction samples that could adversely affect both pre-training stability and the performance of LLMs. Deduplicator can also enables efficient use and optimization of storage space.

LengthSelector

LengthSelector is the class for selecting instruction samples based on the length of the instruction. Instructions that are too long or too short can affect data quality and are not conducive to instruction tuning.

RougeSelector

RougeSelector is the class for selecting instruction samples based on the ROUGE metric which is often used for evaluating the quality of automated generation of text.

GPTScoreSelector

GPTScoreSelector is the class for selecting instruction samples based on the GPT score, which reflects whether the output is a good example of how AI Assistant should respond to the user's instruction, provided by ChatGPT.

PPLSelector

PPLSelector is the class for selecting instruction samples based on the perplexity, which is the exponentiated average negative log-likelihood of response.

MTLDSelector

MTLDSelector is the class for selecting instruction samples based on the MTLD, which is short for Measure of Textual Lexical Diversity.

CodeSelector

CodeSelector is the class for selecting code instruction samples based on the Complexity-Impacted Reasoning Score (CIRS), which combines structural and logical attributes, to measure the correlation between code and reasoning abilities. See When Do Program-of-Thoughts Work for Reasoning? for more details.

Example
from easyinstruct import CodeSelector

# Step1: Specify your source file of code instructions
src_file = "data/code_example.json"

# Step2: Declare a code selecter class
selector = CodeSelector(
    source_file_path=src_file, 
    target_dir="data/selections/",
    manually_partion_data=True,
    min_boundary = 0.125,
    max_boundary = 0.5,
    automatically_partion_data = True,
    k_means_cluster_number = 2,
    )

# Step3: Process the code instructions
selector.process()

MultiSelector

MultiSelector is the class for combining multiple appropricate selectors based on your specific needs.

Prompts

The Prompts module standardizes the instruction prompting step, where user requests are constructed as instruction prompts and sent to specific LLMs to obtain responses. You can choose the appropriate prompting method based on your specific needs.

Please check out link for more detials.

Engines

The Engines module standardizes the instruction execution process, enabling the execution of instruction prompts on specific locally deployed LLMs. You can choose the appropriate engine based on your specific needs.

Please check out link for more detials.


🚩Citation

Please cite our repository if you use EasyInstruct in your work.

@article{ou2024easyinstruct,
  title={EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models},
  author={Ou, Yixin and Zhang, Ningyu and Gui, Honghao and Xu, Ziwen and Qiao, Shuofei and Bi, Zhen and Chen, Huajun},
  journal={arXiv preprint arXiv:2402.03049},
  year={2024}
}

@misc{knowlm,
  author = {Ningyu Zhang and Jintian Zhang and Xiaohan Wang and Honghao Gui and Kangwei Liu and Yinuo Jiang and Xiang Chen and Shengyu Mao and Shuofei Qiao and Yuqi Zhu and Zhen Bi and Jing Chen and Xiaozhuan Liang and Yixin Ou and Runnan Fang and Zekun Xi and Xin Xu and Lei Li and Peng Wang and Mengru Wang and Yunzhi Yao and Bozhong Tian and Yin Fang and Guozhou Zheng and Huajun Chen},
  title = {KnowLM: An Open-sourced Knowledgeable Large Langugae Model Framework},
  year = {2023},
 url = {http://knowlm.zjukg.cn/},
}

@article{bi2023program,
  title={When do program-of-thoughts work for reasoning?},
  author={Bi, Zhen and Zhang, Ningyu and Jiang, Yinuo and Deng, Shumin and Zheng, Guozhou and Chen, Huajun},
  journal={arXiv preprint arXiv:2308.15452},
  year={2023}
}

🎉Contributors

We will offer long-term maintenance to fix bugs, solve issues and meet new requests. So if you have any problems, please put issues to us.

Other Related Projects

🙌 We would like to express our heartfelt gratitude for the contribution of Self-Instruct to our project, as we have utilized portions of their source code in our project.

easyinstruct's People

Contributors

bizhen46766 avatar coderxyd avatar eltociear avatar flow3rdown avatar gooodte avatar guihonghao avatar njcx-ai avatar oe-heart avatar shengyumao avatar tubg avatar wakawaka111 avatar xzwyyd avatar zxlzr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

easyinstruct's Issues

Update Anthropic Client

Anthropic changed their python sdk - making this code line outdated.

client = anthropic.Client(get_anthropic_key())

Would love to know if this might help - https://github.com/BerriAI/litellm

~Simple I/O library, that standardizes all the llm api calls to the OpenAI call

from litellm import completion

## set ENV variables
# ENV variables can be set in .env file, too. Example in .env.example
os.environ["OPENAI_API_KEY"] = "openai key"
os.environ["ANTHROPIC_API_KEY"] = "anthropic key"

messages = [{ "content": "Hello, how are you?","role": "user"}]

# openai call
response = completion(model="gpt-3.5-turbo", messages=messages)

# anthropic call
response = completion("claude-v-2", messages)

kg2instruction

generator = KG2InstructGenerator(
label_db="/newdisk3/data/guihh/KG2Instruction/data/db/label_zh.db",
alias_db="/newdisk3/data/guihh/KG2Instruction/data/db/alias_zh.db",
alias_rev_db="/newdisk3/data/guihh/KG2Instruction/data/db/alias_rev_zh.db",
relation_db="/newdisk3/data/guihh/KG2Instruction/data/db/relation.db",
relation_map_path="/newdisk3/data/guihh/KG2Instruction/data/other/relation_map.json",
model="/newdisk3/data/guihh/KG2Instruction/model/close_tok_pos_ner_srl_dep_sdp_con_electra_base",
language="zh",
device=0,
add_relation_value=False
)
examples/kg2instruction.py 示例文件中的数据库文件是没有吗,没有找到嘞=3=

ERROR: Failed building wheel for llama-cpp-python

Building wheel for llama-cpp-python (pyproject.toml) ... error
error: subprocess-exited-with-error

Not searching for unused variables given on the command line.
-- The C compiler identification is unknown
CMake Error at CMakeLists.txt:3 (ENABLE_LANGUAGE):
No CMAKE_C_COMPILER could be found.

    Tell CMake where to find the compiler by setting either the environment
    variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
    the compiler, or to the compiler name if it is in the PATH.


  -- Configuring incomplete, errors occurred!

ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

ModuleNotFoundError: No module named 'easyinstruct.utils'

在运行example/llm/run.py文件时,在已经安装easyinstruct==0.0.3的情况下,发生报错:
Traceback (most recent call last):
File "/home/ubuntu/MyFiles/DeepKE-main/example/llm/run.py",line 6,in
from easyinstruct.prompts import IEPrompt
File "/home/ubuntu/miniconda3/envs/py39-cul16/lib/python3.9/site-packages/easyinstruct/init.py",line 1,in
from .prompts import
File "/home/ubuntu/miniconda3/envs/py39-cull6/lib/python3.9/site-packages/easyinstruct/prompts/init.py",line 1,in
from base prompt import BasePrompt
File "/home/ubuntu/miniconda3/envs/py39-cul16/lib/python3.9/site-packages/easyinstruct/prompts/base prompt.py",line 5,in
from easyinstruct.utils.api import API_NAME_DICT
ModuleNotFoundError: No module named 'easyinstruct.utils'
Process finished with exit code 1

请问这个该如何解决呢?

import easyinstruct时出现报错,请问是什么原因呀

import easyinstruct时出现报错,请问是什么原因呀

Traceback (most recent call last): File "/mnt/DeepKE/example/llm/LLMICL/run.py", line 8, in <module> import easyinstruct File "/mnt/DeepKE/example/llm/EasyInstruct/easyinstruct/__init__.py", line 1, in <module> from .prompts import * File "/mnt/DeepKE/example/llm/EasyInstruct/easyinstruct/prompts/__init__.py", line 1, in <module> from .base_prompt import BasePrompt File "/mnt/DeepKE/example/llm/EasyInstruct/easyinstruct/prompts/base_prompt.py", line 3, in <module> import cohere File "/root/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/cohere/__init__.py", line 3, in <module> from .types import ( File "/root/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/cohere/types/__init__.py", line 22, in <module> from .chat_stream_end_event import ChatStreamEndEvent File "/root/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/cohere/types/chat_stream_end_event.py", line 10, in <module> from .non_streamed_chat_response import NonStreamedChatResponse File "/root/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/cohere/types/non_streamed_chat_response.py", line 15, in <module> from .message import Message File "/root/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/cohere/types/message.py", line 14, in <module> class Message_Chatbot(ChatMessage): File "pydantic/main.py", line 197, in pydantic.main.ModelMetaclass.__new__ File "pydantic/fields.py", line 506, in pydantic.fields.ModelField.infer File "pydantic/fields.py", line 436, in pydantic.fields.ModelField.__init__ File "pydantic/fields.py", line 552, in pydantic.fields.ModelField.prepare File "pydantic/fields.py", line 668, in pydantic.fields.ModelField._type_analysis File "/root/miniconda3/envs/deepke-llm/lib/python3.9/typing.py", line 852, in __subclasscheck__ return issubclass(cls, self.__origin__) TypeError: issubclass() arg 1 must be a class

无法读取 easyinstruct文件夹下面的engines文件夹

使用setup.py安装
import easyinstruct 无法读取 easyinstruct文件夹下面的engines文件夹,报错:

import easyinstruct
Traceback (most recent call last):
File "", line 1, in
File "d:\tianchi\easyinstruct\easyinstruct_init_.py", line 1, in
from .prompts import *
File "d:\tianchi\easyinstruct\easyinstruct\prompts_init_.py", line 1, in
from .base_prompt import BasePrompt
File "d:\tianchi\easyinstruct\easyinstruct\prompts\base_prompt.py", line 7, in
from engines import llama_engine
ModuleNotFoundError: No module named 'engines'

Migration of OpenAI API

OpenAI API has been updated to version 1. openai.ChatCompletion module is no longer supported.

In base_prompt.py,

response = openai.ChatCompletion.create(
                model=engine,
                messages=messages,
                temperature=temperature,
                max_tokens=max_tokens,
                top_p=top_p,
                n=n,
                frequency_penalty=frequency_penalty,
                presence_penalty=presence_penalty,
            )

should be changed to

response = client.chat.completions.create(
                model=engine,
                messages=messages,
                temperature=temperature,
                max_tokens=max_tokens,
                top_p=top_p,
                n=n,
                frequency_penalty=frequency_penalty,
                presence_penalty=presence_penalty,
            )

Has the code related to CIRS integrated into this project?

I just read the paper When Do Program-of-Thought Works for Reasoning. The result is pretty interesting and I would like to use CIRS metric to test with some other public code datasets. It is said in the paper that the code will be integrated into this project so I come to ask for help. Thank you :)

问题:1)KG2Instruct方法输出是信息抽取后的三元组如何应用?2)Evol-Instruct方法中Elimination Evolving过滤指令的代码是否可以公开?

1、目前从代码看,KG2Instruct方法输出是信息抽取后的三元组,从三元组如何产生指令微调训练样本,能否可以详细描述,并开放相关代码?
2、Evol-Instruct方法中Elimination Evolving过滤指令原则有:
1)与原始指令相比,进化后的指令没有提供任何信息增益;
2)进化出的指令使得LLMs很难产生回复;
......
请问这些过滤指令的原则具体是如何实现的?是否可以详细描述下,或公开相关代码?

谢谢!期盼回复~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.