Giter Club home page Giter Club logo

automated-interpretability's Introduction

Automated interpretability

Code and tools

This repository contains code and tools associated with the Language models can explain neurons in language models paper, specifically:

  • Code for automatically generating, simulating, and scoring explanations of neuron behavior using the methodology described in the paper. See the neuron-explainer README for more information.

  • A tool for viewing neuron activations and explanations, accessible here. See the neuron-viewer README for more information.

Public datasets

Together with this code, we're also releasing public datasets of GPT-2 XL neurons and explanations. Here's an overview of those datasets.

  • Neuron activations: az://openaipublic/neuron-explainer/data/collated-activations/{layer_index}/{neuron_index}.json
    • Tokenized text sequences and their activations for the neuron. We provide multiple sets of tokens and activations: top-activating ones, random samples from several quantiles; and a completely random sample. We also provide some basic statistics for the activations.
    • Each file contains a JSON-formatted NeuronRecord dataclass.
  • Neuron explanations: az://openaipublic/neuron-explainer/data/explanations/{layer_index}/{neuron_index}.jsonl
    • Scored model-generated explanations of the behavior of the neuron, including simulation results.
    • Each file contains a JSON-formatted NeuronSimulationResults dataclass.
  • Related neurons: az://openaipublic/neuron-explainer/data/related-neurons/weight-based/{layer_index}/{neuron_index}.json
    • Lists of the upstream and downstream neurons with the most positive and negative connections (see below for definition).
    • Each file contains a JSON-formatted dataclass whose definition is not included in this repo.
  • Tokens with high average activations: az://openaipublic/neuron-explainer/data/related-tokens/activation-based/{layer_index}/{neuron_index}.json
    • Lists of tokens with the highest average activations for individual neurons, and their average activations.
    • Each file contains a JSON-formatted TokenLookupTableSummaryOfNeuron dataclass.
  • Tokens with large inbound and outbound weights: az://openaipublic/neuron-explainer/data/related-tokens/weight-based/{layer_index}/{neuron_index}.json
    • List of the most-positive and most-negative input and output tokens for individual neurons, as well as the associated weight (see below for definition).
    • Each file contains a JSON-formatted WeightBasedSummaryOfNeuron dataclass.

Definition of connection weights

Refer to GPT-2 model code for understanding of model weight conventions.

Neuron-neuron: For two neurons (l1, n1) and (l2, n2) with l1 < l2, the connection strength is defined as h{l1}.mlp.c_proj.w[:, n1, :] @ diag(h{l2}.ln_2.g) @ h{l2}.mlp.c_fc.w[:, :, n2].

Neuron-token: For token t and neuron (l, n), the input weight is computed as wte[t, :] @ diag(h{l}.ln_2.g) @ h{l}.mlp.c_fc.w[:, :, n] and the output weight is computed as h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :].

automated-interpretability's People

Contributors

wuthefwasthat avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.