Giter Club home page Giter Club logo

gpt_annotate's Introduction

Introducing gpt_annotate

An easy-to-use Python package designed to streamline automated text annotation using LLMs for different tasks and datasets. All you need is an OpenAI API key, text samples you want to annotate, and a codebook (i.e., task-specific instructions) for the LLM.

  • OpenAI API key
  • text_to_annotate:
    • A dataframe that includes one column for text samples and, if you are comparing the LLM output against humans, any number of one-hot-encoded category columns. The text column should be the first column in your data. We provide Python code (described below) that will automatically assist with the formatting of text_to_annotate to ensure accurate annotation.
  • codebook:
    • Task-specific instructions (as type string) to prompt the LLM to annotate the data. Like codebooks for qualitative content analysis, this should clearly describe the dataset, the type of task for the LLM, and, most importantly, delineate the categories of interest for the LLM to annotate. We provide Python code to standardize the beginning and ending of the codebook to ensure that the LLM understands that the task is annotation.
    • For example, the text of codebook could be: "You will be classifying text samples. Each text sample is a tweet. Classify each tweet on two dimensions: a) POLITICAL; b) PRESIDENT. For POLITICAL, label as 1 if the tweet is about politics; label as 0 if not. For PRESIDENT, label as 1 if the tweet refers to a past or present president, a candidate for president, or a presidential election; label as 0 if not. Classify the following text samples:"

To annotate your text data using gpt_annotate, we recommend following the sample code we provide in sample_annotation_code.ipynb.

As shown in sample_annotation_code.ipynb, annotating your text data with LLMs is as easy as 4 simple steps:

  1. Import the required dependencies (including gpt_annotate.py).
import openai
import pandas as pd
import math
import time
import numpy as np
import tiktoken
#### Import main package: gpt_annotate.py
# Make sure that the .py file is in the same directory as the .ipynb file, or you provide the correct relative or absolute path to the .py file.
import gpt_annotate
  1. Read in your codebook (i.e., task-specific instructions) and the text samples you want to annotate.
text_to_annotate = pd.read_csv("text_to_annotate.csv")
with open('codebook.txt', 'r') as file:
	codebook = file.read()
  1. To ensure your data is in the right format, you must first run gpt_annotate.prepare_data(text_to_annotate, codebook, key). If you are annotating text data without any human labels to compare against, change the default to human_labels = False. If you want to add standardized language to the beginning and end of your codebook to ensure that GPT will label your text samples, change the default to prep_codebook = True.
text_to_annotate = gpt_annotate.prepare_data(text_to_annotate, codebook, key)
  1. If comparing LLM output to human labels, run gpt_annotate.gpt_annotate(text_to_annotate, codebook, key). If only using gpt_annotate for prediction (i.e., no human labels to compare performance), run gpt_annotate.gpt_annotate(text_to_annotate, codebook, key, human_labels = False). It’s as easy as that!
# Annotate the data (returns 4 outputs)
gpt_out_all, gpt_out_final, performance, incorrect =  gpt_annotate.gpt_annotate(text_to_annotate, codebook, key)
# Annotate the data (without human labels to compare against) (returns 2 outputs)
gpt_out_all, gpt_out_final =  gpt_annotate.gpt_annotate(text_to_annotate, codebook, key, human_labels = False)

Outputs:

  1. gpt_out_all
  • Raw outputs for every iteration.
  1. gpt_out_final
  • Annotation outputs after taking modal category answer and calculating consistency scores.
  1. performance
  • Accuracy, precision, recall, and f1.
  1. incorrect
  • Any incorrect classification or classification with less than 1.0 consistency.

Below we define the alternative parameters within gpt_annotate() to customize your annotation procedures.

  • num_iterations:
    • Number of times to classify each text sample. Default is 3.
  • model:
    • OpenAI GPT model, which is either gpt-3.5-turbo or gpt-4. Default is gpt-4.
  • temperature:
    • LLM temperature parameter (ranges 0 to 1), which indicates the degree of diversity to introduce into the model. Default is 0.6.
  • batch_size:
    • Number of text samples to be annotated in each batch. Default is 10.
  • human_labels:
    • Boolean indicating whether text_to_annotate has human labels to compare LLM outputs to.
  • data_prep_warning:
    • Boolean indicating whether to print data_prep_warning
  • time_cost_warning:
    • Boolean indicating whether to print time_cost_warning

Please email us ([email protected]) with any suggestions or problems encountered with the code.

gpt_annotate's People

Contributors

npangakis avatar donbowen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.