Giter Club home page Giter Club logo

ice's Introduction

Interactive Composition Explorer 🧊

Decomposition of paper Q&A using humans and language models

Table of contents

Design principles

  • Recipes are decompositions of a task into subtasks.

    The meaning of a recipe is: If a human executed these steps and did a good job at each workspace in isolation, the overall answer would be good. This decomposition may be informed by what we think ML can do at this point, but the recipe itself (as an abstraction) doesn’t know about specific agents.

  • Agents perform atomic subtasks of predefined shapes, like completion, scoring, or classification.

    Agents don't know which recipe is calling them. Agents don’t maintain state between subtasks. Agents generally try to complete all subtasks they're asked to complete (however badly), but some will not have implementations for certain task types.

  • The mode in which a recipe runs is a global setting that can affect every agent call. For instance, whether to use humans or agents. Recipes can also run with certain RecipeSettings, which can map a task type to a specific agent_name, which can modify which agent is used for that specfic type of task.

Running ICE with Codespaces

A convenient way to develop ICE is to use GitHub codespaces.

  1. Increase the default idle timeout in your settings from 30 minutes to a few hours.

  2. Go here to create a new codespace.

  3. Install the frontend dependencies:

    (cd ui; npm ci)
  4. Start ICE in its own terminal and leave it running:

    scripts/run-local.sh
  5. Go through the tutorial or follow the instructions to running a recipe.

  6. To share your visualizations publicly, on the ports pane (F1 to open the command palette -> "Ports: Focus on Ports View"), change port 3000 to be public (Right Click -> Port Visibility -> Public), and click the 🌐 icon in the "Local Address" field to get the link.

Running ICE locally

Requirements

  1. Docker Desktop to run the containerized python application.
  2. Node to run the composition visualization tool. Node can be installed via nvm. Install nvm, then cd ui && nvm use && npm install.

Setup

  1. Add required secrets to .env. See .env.example for a model. If you are using Codespaces, you can skip this step, as the required secrets will already be in your environment.

  2. Install the frontend dependencies:

    (cd ui; npm ci)
  3. Start ICE in its own terminal and leave it running:

    scripts/run-local.sh
  4. Go through the tutorial or follow the instructions to running a recipe.

Running ICE on the command line

Human data collection

Human without GPT default completions:

./scripts/run-recipe.sh --mode human

Human with GPT default completions:

./scripts/run-recipe.sh --mode augmented

GPT

./scripts/run-recipe.sh --mode machine

You can run on the iteration gold standards of a specific recipe like

./scripts/run-recipe.sh --mode machine -r placebotree -q placebo -g iterate

To run over multiple gold standard splits, just provide them separated by spaces:

scripts/run-recipe.sh --mode machine -r placebotree -q placebo -g iterate validation

Streamlit

Run the streamlit apps like this:

./scripts/run-streamlit.sh

This opens a multi-page app that lets you select specific scripts.

To add a page, simply create a script in the streamlits/pages folder.

Evaluation

When you run a recipe, ICE will evaluate the results based on the gold standards in gold_standards/. You'll see the results on-screen, and they'll be saved as CSVs in data/evaluation_csvs/. You can then upload the CSVs to the "Performance dashboard" and "Individual paper eval" tables in the ICE Airtable.

Evaluate in-app QA results

  1. Set up both ice and elicit-next so that they can run on your computer
  2. Switch to the eval branch of elicit-next, or a branch from the eval branch. This branch should contain the QA code and gold standards that you want to evaluate.
  3. If the ice QA gold standards (gold_standards/gold_standards.csv) may not be up-to-date, download this Airtable view (all rows, all fields) as a CSV and save it as gold_standards/gold_standards.csv
  4. Duplicate the All rows, all fields view in Airtable, then in your duplicated view, filter to exactly the gold standards you'd like to evaluate and download it as a CSV. Save that CSV as api/eval/gold_standards/gold_standards.csv in elicit-next
  5. Make sure api/eval/papers in elicit-next contains all of the gold standard papers you want to evaluate
  6. In ice, run scripts/eval-in-app-qa.sh <path to elicit-next>. If you have elicit-next cloned as a sibling of ice, this would be scripts/eval-in-app-qa.sh $(pwd)/../elicit-next/.

This will generate the same sort of eval as for ICE recipes.

Development

Running tests

Cheap integration tests:

./scripts/run-recipe.sh --mode test

Unit tests:

./scripts/run-tests.sh

Adding new Python dependencies

  1. Manually add the dependency to pyproject.toml
  2. Update the lock file and install the changes:
docker compose exec backend poetry lock --no-update
docker compose exec backend poetry install

The lockfile update step will take about 15 minutes.

You do not need to stop, rebuild, and restart the docker containers.

Upgrading poetry

To upgrade poetry to a new version:

  1. In the Dockerfile, temporarily change pip install -r poetry-requirements.txt to pip install poetry==DESIRED_VERSION
  2. Generate a new poetry-requirements.txt:
    docker compose build
    docker compose up -d
    docker compose exec backend bash -c 'pip freeze > poetry-requirements.txt'
  3. Revert the Dockerfile changes

Contributions

Before making a PR, check linting, types, tests, etc:

scripts/checks.sh

ice's People

Contributors

lslunis avatar reppertj avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.