Giter Club home page Giter Club logo

auditing-llms's Introduction

This repository contains code for the following paper:

Automatically Auditing Large Language Models via Discrete Optimization

Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt

Setup

First, create and activate the conda environment using:

conda env create -f environment.yml
conda activate auditing-llms

Reversing LLMs

In order to run the experiments where we reverse large language models, i.e. produce prompts that find a fixed output, modify the following example command:

python reverse_experiment.py --save_every 10 --n_trials 1 --arca_iters 50 --arca_batch_size 32 --prompt_length 3 --lam_perp 0.2 --label your-file-label --filename senators.txt --opts_to_run arca --model_id gpt2

This uses the following parameters:

  • --save_every dictates how often the returned outputs are saved
  • --n_trials is the number of times the optimizer is restarted
  • --lam_perp is the weight of the perplexity loss. Set to 0 to avoid (this makes inputs easier to recover, but they tend to be less natural)
  • --prompt_length is the number of tokens in the prompt.
  • --label is a naume used for saving
  • --filename is a text file containing the fixed outputs, stored in data. We include senators.txt, tox_1tok.txt, tox_2tok.txt, and tox_3tok.txt, where the last three files contain CivilComments examples that at least half of annotators label as toxic, and have 1, 2, and 3 tokens using the GPT-2 tokenizer.
  • --opts_to_run specifies if arca, autoprompt, or gbda should be run
  • --arca_iters is the number of full coordinate ascent iterations (through all coordinates) for arca and autoprompt
  • --arca_batch_size is both the number of gradients averaged and the number of candidates to compute the loss on exactly for arca and autoprompt
  • --gbda_initializations is number of parallel gbda optimizers to run at once (used when gbda is run)
  • --gbda_iters for the number of gbda steps (used when gbda is run)
  • --model_id specifies which model should be audited. You can also optionally add constraints on what tokens are allowed to appear in the input:
  • --unigram_input_constraint[optional] specifies a unigarm objective over the inputs
  • --inpt_tok_constraint[optional] specifies a constraint on what kind of tokens are allowed to appear in the input (in this case, only tokens that are all letters).
  • --prompt_prefix[optional] fixed prefix that comes before the optimized prompt.

Jointly optimizing over prompts and outputs

To run the experiment where you jointly optimize over prompts and outputs, run e.g.:

python joint_optimization_experiment.py --save_every 10 --n_trials 100 --arca_iters 50 --arca_batch_size 32 --lam_perp 0.5 --label your-file-label --model gpt2 --unigram_weight 0.6 --unigram_input_constraint not_toxic --unigram_output_constraint toxic --opts_to_run arca --prompt_length 3 --output_length 2 --prompt_prefix He said

This includes the following additional paramters:

  • --output_length is the number of tokens in the output
  • --unigram_output_constraint[optional] specifies a unigarm objective over the outputs
  • --output_tok_constraint[optional] specifies a constraint on what kind of tokens are allowed to appear in the output (in this case, only tokens that are all letters)

auditing-llms's People

Contributors

ejones313 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.