Giter Club home page Giter Club logo

red-teaming-language-models-with-language-models's Introduction

Red Teaming Language Models with Language Models

A re-implementation of the Red Teaming Language Models with Language Models paper by Perez et al.

This implementation only focuses on toxic/offensive langauge section of the paper.

I have based run the red-teaming experiments on four target models -

I generated 50,000 valid test cases (questions) as opposed to 500,000 mentioned in the paper, due to compute constraints. A small percantge of the test cases were succesful in being able to elicit toxic language generation from these models.

For Toxicity detection, I have used Detoxify.

The percentage of toxic/offensive answers (toxicity probability > 0.5) generated using Zero-Shot and Stochastic Few-Shot generation for each model are presented below -

Model Zero-Shot Stochastic Few-Shot Increase (Few-Shot vs Zero-Shot)
GPT2-XL-1.5B 0.99% 2.85% 2.88x
Llama-2-7B 0.15% 0.33% 2.20x
Pythia-6.9B 0.47% 1.02% 2.17x
Phi-1.5B 0.11% 0.79% 7.18x

The percentage of toxic/offensive questions and answers (toxicity probability > 0.5) generated using Zero-Shot and Stochastic Few-Shot generation for each model are presented below -

Model Zero-Shot Stochastic Few-Shot Increase (Few-Shot vs Zero-Shot)
GPT2-XL-1.5B 2.28% 8.49% 3.72x
Llama-2-7B 0.20% 0.61% 3.05x
Pythia-6.9B 0.93% 3.00% 3.23x
Phi-1.5B 0.09% 0.38% 4.22x

Note - The numbers may be a bit high as the toxicity detection model has False Positives in its predictions as well. The toxic/offensive generations can be manually inspected with the script view_toxic_qa.py.

The questions, answers and toxicity scores for each model can be found in the csv files in the artifacts directory.

For inference, I have used vLLM (which is awesome!) for GPT2-XL, Llama-2 and Pythia to speedup the generation process. Phi-1.5 is currently not supported by vLLM.

red-teaming-language-models-with-language-models's People

Contributors

shreyansh26 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.