neelnanda-io / transformerlens Goto Github PK

A library for mechanistic interpretability of GPT-style language models

Home Page: https://transformerlensorg.github.io/TransformerLens/

License: MIT License

Python 89.29% Jupyter Notebook 10.37% Dockerfile 0.10% Makefile 0.24%

transformerlens's Introduction

TransformerLens

A Library for Mechanistic Interpretability of Generative Language Models. Maintained by Bryce Meyer and created by Neel Nanda

This is a library for doing mechanistic interpretability of GPT-2 Style language models. The goal of mechanistic interpretability is to take a trained model and reverse engineer the algorithms the model learned during training from its weights.

TransformerLens lets you load in 50+ different open source language models, and exposes the internal activations of the model to you. You can cache any internal activation in the model, and add in functions to edit, remove or replace these activations as the model runs.

Quick Start

Install

pip install transformer_lens

Use

import transformer_lens

# Load a model (eg GPT-2 Small)
model = transformer_lens.HookedTransformer.from_pretrained("gpt2-small")

# Run the model and get logits and activations
logits, activations = model.run_with_cache("Hello World")

Key Tutorials

Gallery

Research done involving TransformerLens:

Progress Measures for Grokking via Mechanistic Interpretability (ICLR Spotlight, 2023) by Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
Finding Neurons in a Haystack: Case Studies with Sparse Probing by Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas
Towards Automated Circuit Discovery for Mechanistic Interpretability by Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso
Actually, Othello-GPT Has A Linear Emergent World Representation by Neel Nanda
A circuit for Python docstrings in a 4-layer attention-only transformer by Stefan Heimersheim and Jett Janiak
A Toy Model of Universality (ICML, 2023) by Bilal Chughtai, Lawrence Chan, Neel Nanda
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models (2023, ICLR Workshop RTML) by Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Fazl Barez
Eliciting Latent Predictions from Transformers with the Tuned Lens by Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt

User contributed examples of the library being used in action:

Induction Heads Phase Change Replication: A partial replication of In-Context Learning and Induction Heads from Connor Kissane
Decision Transformer Interpretability: A set of scripts for training decision transformers which uses transformer lens to view intermediate activations, perform attribution and ablations. A write up of the initial work can be found here.

Check out our demos folder for more examples of TransformerLens in practice

Getting Started in Mechanistic Interpretability

Mechanistic interpretability is a very young and small field, and there are a lot of open problems. This means there's both a lot of low-hanging fruit, and that the bar for entry is low - if you would like to help, please try working on one! The standard answer to "why has no one done this yet" is just that there aren't enough people! Key resources:

A Guide to Getting Started in Mechanistic Interpretability
ARENA Mechanistic Interpretability Tutorials from Callum McDougall. A comprehensive practical introduction to mech interp, written in TransformerLens - full of snippets to copy and they come with exercises and solutions! Notable tutorials:
- Coding GPT-2 from scratch, with accompanying video tutorial from me (1 2) - a good introduction to transformers
- Introduction to Mech Interp and TransformerLens: An introduction to TransformerLens and mech interp via studying induction heads. Covers the foundational concepts of the library
- Indirect Object Identification: a replication of interpretability in the wild, that covers standard techniques in mech interp such as direct logit attribution, activation patching and path patching
Mech Interp Paper Reading List
200 Concrete Open Problems in Mechanistic Interpretability
A Comprehensive Mechanistic Interpretability Explainer: To look up all the jargon and unfamiliar terms you're going to come across!
Neel Nanda's Youtube channel: A range of mech interp video content, including paper walkthroughs, and walkthroughs of doing research

Support & Community

If you have issues, questions, feature requests or bug reports, please search the issues to check if it's already been answered, and if not please raise an issue!

You're also welcome to join the open source mech interp community on Slack. Please use issues for concrete discussions about the package, and Slack for higher bandwidth discussions about eg supporting important new use cases, or if you want to make substantial contributions to the library and want a maintainer's opinion. We'd also love for you to come and share your projects on the Slack!

❗ HookedSAETransformer Removed

Hooked SAE has been removed from TransformerLens in version 2.0. The functionality is being moved to SAELens. For more information on this release, please see the accompanying announcement for details on what's new, and the future of TransformerLens.

Credits

This library was created by Neel Nanda and is maintained by Bryce Meyer.

The core features of TransformerLens were heavily inspired by the interface to Anthropic's excellent Garcon tool. Credit to Nelson Elhage and Chris Olah for building Garcon and showing the value of good infrastructure for enabling exploratory research!

Creator's Note (Neel Nanda)

I (Neel Nanda) used to work for the Anthropic interpretability team, and I wrote this library because after I left and tried doing independent research, I got extremely frustrated by the state of open source tooling. There's a lot of excellent infrastructure like HuggingFace and DeepSpeed to use or train models, but very little to dig into their internals and reverse engineer how they work. This library tries to solve that, and to make it easy to get into the field even if you don't work at an industry org with real infrastructure! One of the great things about mechanistic interpretability is that you don't need large models or tons of compute. There are lots of important open problems that can be solved with a small model in a Colab notebook!

Citation

Please cite this library as:

@misc{nanda2022transformerlens,
    title = {TransformerLens},
    author = {Neel Nanda and Joseph Bloom},
    year = {2022},
    howpublished = {\url{https://github.com/TransformerLensOrg/TransformerLens}},
}

transformerlens's People

Contributors

Stargazers

Watchers

Forkers

avariengien anshradh seanwentzel aslvrstn meg-tong redwoodresearch joelburget epurdy poppingtonic cindyxwu aryamanarora satojk techthiyanes alan-cooney xennygrimmato zafar-hussain rusheb matthewbaggins dashstander jdbowma dkamm afspies xmaster6y kewiechecki ckkissane jas-ho rudyardrichter fjzzq2002 lukasberglund jbloomaus credwood dan255 valedan slavachalnev montemac jaybaileycs rgreenblatt kiv taufeeque9 luciaquirke zshn-gvg lingwave dumpmemory 0amp goaaron zygi haileyschoelkopf joernstoehler stprior adamyedidia adzcai mang0kitty baidicoot tcy63 shailja-thakur vivekhaz understanding-search koayon aprillion felhof glerzing samadamday arthurconmy jkramar jd8111997 3outeille edoardopona dennis-akar thejaminator nhamlv-55 matanavitan berkott erland366 will-hath soheeyang smaug123 mesaoptimizer0 rajath00 matthiasdellago jansont dinesh0fficial connor-henderson slaee starship006 omkar-kumbhar kdkangg plaskod jacobpfau apollohuang1 filyp clarenceluo78 tkukurin hbcbh1999 stjordanis ufo-101 ajyl tokarev-i-v lauraaisling shaheenahmedc tejess

transformerlens's Issues

Add helper function to run HuggingFace evals on HookedTransformer

Pick some example evals from here (eg PIQA, TriviaQA, LAMBADA) and write code to run HookedTransformer on them: https://huggingface.co/docs/evaluate/index

A demo notebook doing this for specific benchmarks would be a good MVP, bonus is doing a generic function for any eval (or eg for multiple choice evals, vs other types).

Create a demo of using pre-trained checkpoints for an interesting task, eg replicating the induction heads paper

There's an MVP demo of using checkpoints in the main demo, but it'd be nice to have something more substantial.

A good option is replicating some of the graphs in [the induction heads paper] for my interpretability-friendly models (or Pythia or Stanford CRFM). Prefix matching score and QK eigenvalue trace are easiest and should be fast, . Loss curves, in-context learning loss, loss by position and PCA of per-token loss should all be doable with approx the same code, the hard part is just going to be running a range of model checkpoints

You may want to add a PR to disable HuggingFace caching, and also to only load analyse eg every 10th checkpoint, to avoid blowing out your computer memory - by default, HuggingFace caches model weights to the hard drive, and this is pretty expensive if eg using 600 checkpoints of GPT-2 Medium!

Add helper function for direct logit attribution

Add a helper function (eg to ActivationCache) to calculate direct logit attribution easily. Crib from the code here: https://colab.research.google.com/github/neelnanda-io/Easy-Transformer/blob/main/Exploratory_Analysis_Demo.ipynb#scrollTo=Direct_Logit_Attribution

Some OPT Tokenizer Confusion

Hi,

When trying to use utils.test_prompt with an OPT model I am running into issues surrounding the prepending of BOS tokens. In particular, I think the call to PreTrainedTokenizerBase sets add_special_tokens to true, regardless of the prepend_base flag in model.to_str_tokens, which leads to at least one </s> being preprended (and if prepend_base=True, then two are prepended).

For example:

prompt: "Two young, White males are outside near many bushes"
# Split into prompt = "Two young .... many", answer = "bushes"
Tokenized prompt: ['</s>', '</s>', 'Two', ' young', ',', ' White', ' males', ' are', ' outside', ' near', ' many']
Tokenized answer: ['</s>', ' bushes', '.', '\n']

I am not sure if this is the desired behaviour for OPT?

Even if it is, I think the utils.test_prompt needs to be adjusted to expect the </s> at the beginning of the tokenized answer, in this case.

Happy to make these changes once you confirm what the expected behaviour is.

Thanks!

Add a demo of direct path patching

Direct path patching is like activation patching, but rather than patching in the output of component A, it acts on pairs of components A and B (in a layer after A). And we only patch in the output of A into the input of B, and all other components see the old output of A. I want to add a section to Exploratory Analysis Demo demonstrating this for all pairs of heads.

Eg to do direct path patching on the query of head B, we'd add a hook saying patched_B_query = original_B_query + (clean_A_output - corrupted_A_output) @ W_Q / layer_norm_scale

For reference, an old PR to add it an early version of the library #49

Note in the PR that i deleted some unused imports

Line: 14
https://github.com/neelnanda-io/Easy-Transformer/blob/f2e85755e987a37f2968ce7eab1ad9215c0c8da9//easy_transformer/EasyTransformer.py#L11-L21

No attention QK/OV attributes

The QK and OV matrices can be accessed from model.QK and model.OV attributes, but not from the individual attention layers. Would be convenient to be able to do that, and not require computing all QK/OV matrices for all layers to access just one.

Convert all TorchTyping to JaxTyping

This project uses Patrick Kidger's torch typing to give tensor types including shapes. He recommends using JAXTyping for newer projects (not JAX specific, better maintained, more compatible with type checkers, etc) And it'd be good to update TL to use it.

@dkamm might be up your alley? Sorry if this invalidates your enum work, but might be a more elegant solution!

Add my interpretability-friendly models to HuggingFace

Add my interpretability-friendly models to HuggingFace (documented here: https://docs.google.com/document/d/1WONBzNqfKIxERejrrPlQMyKqg7jSFW92x5UMXNrMdPo/edit#heading=h.chq47zvs9cii )

This probably looks like adding the HookedTransformer + HookedTransformerConfig class to HuggingFace AutoModel, I'm not super sure how it works. Ideally this would be able to slot into code using HuggingFace models to eg evaluate them or generate text.

(Not actually a TransformerLens issue per se, but useful!)

what happens with a batch dimension here?

Line: 355
https://github.com/neelnanda-io/Easy-Transformer/blob/f2e85755e987a37f2968ce7eab1ad9215c0c8da9//easy_transformer/EasyTransformer.py#L352-L362

Add Documentation + Tests for utils.Slice

I really need to document this one better lol - basically a wrapper around Python slice, but None -> slice(None) (equivalent to [:], doesn't add a dummy axis), and n -> [n] (for an integer n), which reduces the number of axes, which Python slices don't support.

pre-commit in requirements?

Hi @neelnanda-io , do you want to have pre-commit in the requirements? If so, I could add that
or is the pre-commit-config.yml obsolete?

Add a permanent hooks feature to HookPoint, that isn't deleted when you run `model.reset_hooks()`

Currently, running run_with_hooks or run_with_cache ends by running model.reset_hooks(), which deletes all hooks. This means that if I, eg, want to create a model without positional embeddings, I can't add a hook that just sets pos_embed to zero, without breaking run_with_hooks or run_with_cache. The underlying problem is that PyTorch hooks are global state, but I want run_with_* to present them as local state to the user.

I'd like to add an add_perma_hook method to a HookPoint, which tracks hooks separately from add_hook, so that reset_hooks only deletes the normal hooks and not the perma-hooks (essentially creating a separate class of global state hooks and local state hooks)

Two issues with tests

As discussed here #161 , currently the testing

i) does not provide signal on which tests are failing
ii) seems to be incorrect? They pass locally for me

I'm a bit busy to do the good honest engineering required to improve this ATM :(

Add a `from_pretrained_no_processing` method to `HookedTransformer`

Currently from_pretrained has a bunch of Boolean flags for various simplifications to the transformer weights, and many default to True. I want these to be on by default, but it makes life a pain if you want to load a large model, or study a model exactly as the makers intended (since you need to set 5 ish Boolean flags to False, and it's not robust to new flags being added). I'd like there to be a from_pretrained_no_processing method with the same API as from_pretrained, which acts as a wrapper but sets all Boolean flags to False.

Loading gpt2 from pretrained is None

model = EasyTransformer.from_pretrained("gpt2").to(device)
returns None

This was working yesterday, so might be the new update.

Add tests + better docs for tokenization methods

Add tests that the tokenization methods work (to_tokens, to_string, to_str_tokens, get_token_position)

Go through the documentation and clarify things that are unclear (this is hard for me to do, so even just having someone new to the library flag confusions is helpful!) The behaviour of prepend_bos is the main confusion. Docs can be copied from https://colab.research.google.com/github/neelnanda-io/TransformerLens/blob/v2/Main_Demo.ipynb#scrollTo=GUSyRfQuKmHU

Add helper function for activation patching

Add a helper function to implement activation patching (probably in utils.py). Crib from the code here: https://colab.research.google.com/github/neelnanda-io/Easy-Transformer/blob/main/Exploratory_Analysis_Demo.ipynb#scrollTo=JAyQI8Mlt3W7

I'd guess a good API is to input the clean cache, and corrupted prompt/tokens, specify which activation to patch and whether to iterate over positions, layers, head index, etc.

Redwood's IOI codebase is a good example of how you might implement this: https://github.com/redwoodresearch/Easy-Transformer/blob/main/easy_transformer/experiments.py

Commit error: No .pre-commit-config.yaml file was found

I was running into this issue where I couldn't commit anything. Turns out that because we previously had a pre-commit rule about using nbdev then removed it, because I'd previously installed pre-commit things broke. I fixed this by running rm -rf .git/hooks in the project root (the project doesn't have any other git hooks needed)

make sure type assertions are provided

Line: 140
https://github.com/neelnanda-io/Easy-Transformer/blob/f2e85755e987a37f2968ce7eab1ad9215c0c8da9//easy_transformer/EasyTransformer.py#L137-L147

Cannnot install dependencies via poetry (y-py-0.5.5 is yanked)

I'm trying to install the project via poetry install for python 3.9.

I get this error:

  PoetryException

  Failed to install /Users/rusheb/Library/Caches/pypoetry/artifacts/be/5d/9c/38ed00c38e66f11b3f1295c0b4fa2565c954b8e0c8d63deac26e996efa/y_py-0.5.5.tar.gz

  at /opt/homebrew/lib/python3.10/site-packages/poetry/utils/pip.py:58 in pip_install
       54│
       55│     try:
       56│         return environment.run_pip(*args)
       57│     except EnvCommandError as e:
    →  58│         raise PoetryException(f"Failed to install {path.as_posix()}") from e
       59│

  • Installing yarl (1.8.2)
Warning: The file chosen for install of y-py 0.5.5 (y_py-0.5.5.tar.gz) is yanked. Reason for being yanked: Inconsistent wheels

Output of poetry env info:

❯ poetry env info

Virtualenv
Python:         3.9.16
Implementation: CPython
Path:           /Users/rusheb/code/TransformerLens/.venv
Executable:     /Users/rusheb/code/TransformerLens/.venv/bin/python
Valid:          True

Also, I saw this warning at the top of the poetry install output:

Warning: poetry.lock is not consistent with pyproject.toml. You may be getting improper dependencies. Run `poetry lock [--no-update]` to fix it.

I think running poetry lock should fix it. I'll try to raise a fix now.

Make activation functions modules?

I was just merging this in, and this:

makes me a bit sad, as I have to think about 4 different activation functions in the otherwise really simple and clean file. Can all the activation functions be made modules instead? Happy to work on this if there's agreement

Originally posted by @ArthurConmy in #8 (comment)

Add documentation for utils.get_act_name

I've received complaints that it needs better documentation! I would want to add clear info re the recommended name for each activation. The code is messy because it's designed to be robust to different names for things, to work for different sub-layers and for layers that aren't part of a block, etc.

Add in GPT-J and GPT-NeoX functionality

Source: 170b0a0#diff-860fcdbbede66b9c1c31227cdfaf5c1cfef77faf1e9e0a376bc83633412ed47fR313

Some Confusion on Unembedding

Hey,

If one unembeds from intermediate layers of the residual stream, the lack of normalization (coming from layer_norm_pre) leads to different (intuitively, less likely to be sensible) results.

I don't know if there is a warning about this / whether we might want to add a include_normalization flag to model.unembed(...)

Thanks!

Alex

Better docs for model properties

Make this table better and cover key info for model architecture - whether it uses parallel attn & MLPs, and what positional embedding it is.

Add text at the bottom documenting the models more qualitatively, can basically copy this glossary: https://docs.google.com/document/d/1WONBzNqfKIxERejrrPlQMyKqg7jSFW92x5UMXNrMdPo/edit#heading=h.chq47zvs9cii

I'd want to add a separate table with training info: include training dataset, number of tokens, whether they were trained with dropout, whether they have checkpoints, whether trained with weight decay.

Torchtyping function help strings are extremely verbose

torchtyping used to give a function type signature is extremely verbose when printing the type, to somewhat ridiculous extents. Eg, here is model.run_with_cache?:

Signature:
model.run_with_cache(
    *model_args,
    return_cache_object=True,
    remove_batch_dim=False,
    **kwargs,
) -> Tuple[Union[NoneType, typing_extensions.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'pos', 'd_vocab',), 'cls_name': 'TensorType'}], typing_extensions.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ((),), 'cls_name': 'TensorType'}], typing_extensions.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'position - 1',), 'cls_name': 'TensorType'}], Tuple[typing_extensions.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'pos', 'd_vocab',), 'cls_name': 'TensorType'}], Union[typing_extensions.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ((),), 'cls_name': 'TensorType'}], typing_extensions.Annotated[torch.Tensor, {'__torchtyping__': True, 'details': ('batch', 'position - 1',), 'cls_name': 'TensorType'}]]]], Union[transformer_lens.ActivationCache.ActivationCache, Dict[str, torch.Tensor]]]
Docstring: Wrapper around run_with_cache in HookedRootModule. If return_cache_object is True, this will return an ActivationCache object, with a bunch of useful HookedTransformer specific methods, otherwise it will return a dictionary of activations as in HookedRootModule.

In-place operations on `hook_pos_embed` are dangerous

When torch.set_grad_enabled(False), you can overwrite the model's positional embeddings:

#%%
from transformer_lens.HookedTransformer import HookedTransformer
import torch
torch.set_grad_enabled(False)

#%%
model = HookedTransformer.from_pretrained("gpt2")

#%%
assert not torch.allclose(
    torch.norm(model.W_pos[0]), torch.tensor(0.0),
)

# %%
def sketchy_remove_pos_embed(z, hook):
    z[:] = 0.0
    return z

_ = model.run_with_hooks(
    "Hello, world",
    fwd_hooks = [("hook_pos_embed", sketchy_remove_pos_embed)],
)

model.reset_hooks()

# %%
assert torch.allclose(
    torch.norm(model.W_pos[0]), torch.tensor(0.0),
)

add different direction for the mean, add tokens, change type of

datasets, make deterministic

Line: 229
https://github.com/neelnanda-io/Easy-Transformer/blob/cf3b7b244dd79ce133a03cfdfa4cdcff074ab7c5//easy_transformer/experiments.py#L226-L236

Cache common model weights (eg post LN folding) to shorten model loading times.

Not sure exactly what the best way to cache is, I'd try to copy how HuggingFace transformers does it. The easiest way might be to just create a separate model on HuggingFace with the post processing weights and to pull the weights from that

Probably best to cache a version with all the default flags to from_pretrained and just use that by default, otherwise use the existing loading code.

I'd do this for small-ish and important models, eg all 4 GPT-2s and my interpretability friendly models.

`model.run_with_cache` errors when `hook_embed` does not pass the `NameFilter`

By default, lines such as

logits, cache = model.run_with_cache( "Hello, Arthur", names_filter=lambda name: name in ["hook_pos_embed"] )

error since the following line requires hook_embed to be cached. Is this intended? It's a bit annoying editing all my NameFilters.

https://github.com/neelnanda-io/Easy-Transformer/blob/57ee0b8eee09e3d283170702afc052b5ad8643c6/easy_transformer/activation_cache.py#L34

maybe-someday add integration testing for `test_model`

good integration testing would run test_model on a GPU

Add automatic notebook generation

We should be able to use the python script here: https://github.com/nojvek/vscode-ipynb-py-converter to

i) write notebooks as .py files with #%%
ii) have these automatically converted to .ipynb files on push

This means that a) we can easily test notebooks (since they are .py files, see discussion here) and b) have automatically updated URLs like https://colab.research.google.com/github/neelnanda-io/Easy-Transformer/blob/main/EasyTransformer_Demo.ipynb to display demos.

@alan-cooney this may be of interest since this would work well as another "on push" action like those in #68

Main_Demo.ipynb: plotly rendering / DEVLEOPMENT_MODE vs IN_COLAB

In Main_Demo.ipynb, plotly's imshow does not correctly render in jupyter lab (for me) when running locally in jupyter lab.
An example is this line in cell 20

imshow(ioi_patching_result, x=token_labels, xaxis="Position", yaxis="Layer", title="Normalized Logit Difference After Patching Residual Stream on the IOI Task")

For me adding these two lines in the Setup section fixes it

But this also requires that DEVELOPMENT_MODE be True (which does not happen on its own).
Before contributing the fix I therefore wanted to check what's the intention behind DEVELOPMENT_MODE vs IN_COLAB?

Currently, these are not always inverse: running locally without changing anything gives DEVELOPMENT_MODE=False and IN_COLAB=False. Should we just have IN_COLAB and get rid of DEVELOPMENT_MODE?

Issues setting up dev environment

I've been having issues setting up my dev environment.

OS: MacOS Monterey
Model Name: MacBook Air
Model Identifier: Mac14,2
Chip: Apple M2

What I was doing

I cloned the repo
ran poetry config virtualenvs.in-project true and poetry install --with dev
after the last command, I got

What I tried to fix it

Upgrading pytorch

Change pytorch version to 1.13.1
This produced a new error, stack trace

Installing in a docker container

FROM ubuntu:latest

RUN apt-get -yqq update
RUN apt-get -yqq install git

RUN apt-get -yqq install python3    
RUN apt-get -yqq install python3-pip
RUN apt-get -yqq update

RUN git clone https://github.com/neelnanda-io/TransformerLens.git

WORKDIR /TransformerLens

RUN pip3 install poetry
RUN poetry config virtualenvs.in-project true
# RUN poetry install --with dev

This produced the same nvidia-cudnn-cu11 errors as before, which did not change even after I tried bumping the pytorch version.

Add support for model parallelism

Add support for having a model with layers split across several GPUs.

Make sure the layers (and its HookPoints) know what device they're on, so that hooks can ensure that they aren't needlessly moving information between GPUs. ActivationCache is a dictionary and should work by default.

The MVP would be doing 2 GPUs: putting the embed + first half of layers on GPU 1 and the second half + unembed on GPU 2. This is probably the most that's needed to support eg NeoX?

I'm not sure of the most elegant way of doing this, or how to do this without making the code really messy. I lean towards either adding a method which edits the model to move layers between devices, or making a separate ParallelHookedTransformer class

loss metric, zero ablation, mean ablation

Line: 228
https://github.com/neelnanda-io/Easy-Transformer/blob/cf3b7b244dd79ce133a03cfdfa4cdcff074ab7c5//easy_transformer/experiments.py#L225-L235

Add a summary docstring to the sub-modules

Add a docstring at the top of each file in transformer_lens/ briefly summarising what it's about (so this shows up in VSCode on hover when importing it)

Add evals for algorithmic-ish tasks like Indirect Object Identification

Add to evals.py support for checking how good a model is at tasks like Indirect Object Identification. Notably, where the eval involves generating a synthetic dataset of names, as they do in the paper (can copy from their codebase), running the model on them, and returning the average logit diff and accuracy.

Bonus:

Add support for prompts of different token length
Support multi-token answers.
Write this in a generic way, where any generated dataset can be subbed in
Give an option to pair up prompts, so "John and Mary went to the store, John gave the bag to" is followed by "John and Mary went to the store, Mary gave the bag to", to avoid biases where the model just favours common names

`prepend_bos` inconsistency

Yesterday @epurdy and I were working on an implementation of causal tracing from ROME. We ended up getting stuck for a while because when we ran the model it unexpectedly prepended a BOS token (in contrast to running the tokenizer by itself or with model.to_str_tokens, which doesn't). Looking through the codebase I noticed that of the six functions with a prepend_bos option, they're evenly split on the default. I think these defaults make sense in isolation but may be confusing when used together. Not sure what the right fix is here (or if it needs to be changed at all) but thought I'd at least raise this for discussion.

https://github.com/neelnanda-io/Easy-Transformer/blob/e5790469df3ebe26547c2016f899389e9f7bec30/easy_transformer/EasyTransformer.py#L146

https://github.com/neelnanda-io/Easy-Transformer/blob/e5790469df3ebe26547c2016f899389e9f7bec30/easy_transformer/EasyTransformer.py#L299

https://github.com/neelnanda-io/Easy-Transformer/blob/e5790469df3ebe26547c2016f899389e9f7bec30/easy_transformer/EasyTransformer.py#L345

https://github.com/neelnanda-io/Easy-Transformer/blob/e5790469df3ebe26547c2016f899389e9f7bec30/easy_transformer/EasyTransformer.py#L395

https://github.com/neelnanda-io/Easy-Transformer/blob/e5790469df3ebe26547c2016f899389e9f7bec30/easy_transformer/EasyTransformer.py#L935

https://github.com/neelnanda-io/Easy-Transformer/blob/e5790469df3ebe26547c2016f899389e9f7bec30/easy_transformer/utils.py#L440

Add wrapper integrating HookedTransformer with Google's Learning Interpretability Tool (LIT)

Google have a very cool-looking tool for (mostly non-MI) interpretability of language models, called LIT. It seems designed to be framework agnostic, and to be able to take a wrapper around many kinds of models, with functions to enable various LIT functions. I want to add a wrapper to HookedTransformer such that it can integrate with LIT, ideally for as many LIT functions as possible.

The MVP in mind here would just be a Colab which gets LIT to work with TransformerLens, and maybe showing some things you can do with it. I'm not sure whether this kind of integration should actually be merged into the library, but I'd love for a small demo to exist!

KeyError 'gpt_neox' when loading pythia-125m-deduped

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In [8], line 2
      1 model_name = "pythia-125m-deduped"
----> 2 model = EasyTransformer.from_pretrained(model_name)

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/easy_transformer/EasyTransformer.py:505, in EasyTransformer.from_pretrained(cls, model_name, fold_ln, center_writing_weights, center_unembed, factored_to_even, checkpoint_index, checkpoint_value, hf_model, device, move_state_dict_to_device, **model_kwargs)
    500 official_model_name = loading.get_official_model_name(model_name)
    502 # Load the config into an EasyTransformerConfig object If loading from a
    503 # checkpoint, the config object will contain the information about the
    504 # checkpoint
--> 505 cfg = loading.get_pretrained_model_config(
    506     official_model_name,
    507     checkpoint_index=checkpoint_index,
    508     checkpoint_value=checkpoint_value,
    509     fold_ln=fold_ln,
    510     device=device,
    511 )
    513 # Get the state dict of the model (ie a mapping of parameter names to tensors), processed to match the EasyTransformer parameter names.
    514 state_dict = loading.get_pretrained_state_dict(
    515     official_model_name, cfg, hf_model
    516 )

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/easy_transformer/loading_from_pretrained.py:439, in get_pretrained_model_config(model_name, checkpoint_index, checkpoint_value, fold_ln, device)
    437     cfg_dict = convert_neel_model_config(official_model_name)
    438 else:
--> 439     cfg_dict = convert_hf_model_config(official_model_name)
    440 # Processing common to both model types
    441 # Remove any prefix, saying the organization who made a model.
    442 cfg_dict["model_name"] = official_model_name.split("/")[-1]

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/easy_transformer/loading_from_pretrained.py:271, in convert_hf_model_config(official_model_name)
    269 official_model_name = get_official_model_name(official_model_name)
    270 # Load HuggingFace model config
--> 271 hf_config = AutoConfig.from_pretrained(official_model_name)
    272 architecture = hf_config.architectures[0]
    273 if architecture == "GPTNeoForCausalLM":

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:700, in AutoConfig.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    698     return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)
    699 elif "model_type" in config_dict:
--> 700     config_class = CONFIG_MAPPING[config_dict["model_type"]]
    701     return config_class.from_dict(config_dict, **kwargs)
    702 else:
    703     # Fallback: use pattern matching on the string.

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:409, in _LazyConfigMapping.__getitem__(self, key)
    407     return self._extra_content[key]
    408 if key not in self._mapping:
--> 409     raise KeyError(key)
    410 value = self._mapping[key]
    411 module_name = model_type_to_module_name(key)

KeyError: 'gpt_neox'

Add tests + better docs to ActivationCache

Add tests that the methods in the ActivationCache class work correctly.

Go through the documentation and clarify things that are unclear (this is hard for me to do, so even just having someone new to the library flag confusions is helpful!)

Add tests + better docs for FactoredMatrix

Add tests that the FactoredMatrix class works (essentially that each of its methods correctly mimics the result for the actual matrix product).

Go through the documentation and clarify things that are unclear (this is hard for me to do, so even just having someone new to the library flag confusions is helpful!)

"is not divisible" / "is not equal confusion"

These lines test for equality but the error message is about divisibility:

https://github.com/neelnanda-io/TransformerLens/blob/c0f1c0e473f6cd1a4ca189af54818d6b64342b7c/transformer_lens/HookedTransformerConfig.py#L162-L165

This can lead to the misleading error message "d_model=16 is not divisible by n_heads=2 * d_head=4".

I'm guessing the check should change but I'm not sure.

Add mixed precision inference incl loading

Add the option to load models in bfloat16 and float16. Esp important for large models like GPT-J and GPT-NeoX.

Ideally, load from HuggingFace in this low precision, do weight processing on the CPU, and then move the processed model weights to the GPU. Might be easiest to do the weight processing once and caching to HF (see #103 )

Fix type checking (mypy)

Currently mypy type checks are not passing - with a large number of easy to resolve errors.

poetry run mypy transformer_lens

Resolving these would help make the codebase more robust, and the corresponding checks in /.github/workflows/checks.yml can then be enabled (they're commented out).

https://github.com/neelnanda-io/TransformerLens/blob/72b9be00c12b5be37baaa16b57e5df5e2aff3aad/.github/workflows/checks.yml#L58

Add a helper function to display vectors of logits nicely

Often you want to look at vectors over the vocabulary (eg the logits at a specific position). This is >50,000 dimensions and this is hard to interpret! I want there to be nice utils to visualize a vector like this.

An MVP would be a function mapping this to a pandas dataframe, with the token index, token string value, logit, log prob and probability. Either for just the top K, or for the entire vocab.

But I expect there's many ways to make something nice here! One option is to imitate nostalgebraist's graphing style for plot_logit_lens in `transformer_utils link. This takes a layer x position x d_vocab tensor, and visualises it as a layer x position heatmap, printing the string value of the top token in each cell, and colouring by the top token value.

Add Auto-Generated Documentation

Add something like Read The Docs to auto-generate docs

Bonus: Clean up docstrings so things aren't a mess

Add support for T5

Model off of the BERT PR #82 - make a separate class for it with its own config.

Will involve adding support for cross-attention.

Code: https://github.com/huggingface/transformers/blob/v4.25.1/src/transformers/models/t5/modeling_t5.py#L1466
Paper: https://arxiv.org/abs/1910.10683