ejnnr / cupbearer Goto Github PK

A library for mechanistic anomaly detection

License: MIT License

Python 68.20% Jupyter Notebook 31.80%

cupbearer's Issues

Trained models can be tied to specific backdoors

An example of this issue as @ejnnr observed, is for the WaNet Backdoor. A trained model with this type of backdoor responds to a specific randomly generated warping field. Later you would typically like to evaluate a detector detector based on absence or presence of this warping field in question. As code currently looks this isn't possible.

I'd like to take this issue to consider some possible design choices.

Usage

This benchmark is about evaluating a Detector on how well anomalous behaviour can be detected in a Model for some input.

To make this possible there needs to be a known Task for which a dataset that is known to produce anomalous and normal behaviour in a model has to be known.

A Detector is then evaluated on a Task. Some Tasks need to initialized which in the case of the Backdoor task consists of:

Training a model on a Dataset passed through a BackdoorTransform.

When a Task has been initialized it should be possible to load this Task, at least in cases when initialization is costly (e.g. training a model). However, it is also of interest to load a modified Task. For example a different transform than the one used in training the model to see that these are not detected as anomalous.

Design choices

Main script is

eval_detector --task TASK --detector DETECTOR

so far this is the same set-up as before. Before this step there will be a

train_detector ...
init_task ...  # e.g. train_classifier

I don't see any need to change anything with the detector set-up.

The changes will have to be with how the task is handled both in initialization and the command line arguments to eval detector.

Choice 1

Add a --task from_run similar as for the detector. This should preferably take as much as possible from the initialization of the task and then any additional settings.

Most of the information already exist in the config.yml and model/ under dir. All that is missing is either a:

backdoor/ that contains a saved state of the backdoor used or
seed in config.yml that is used to initialize the same warping field

For saving the backdoor I would probably suggest to pass the path to where to save the backdoor in a call to train_loader.transforms.save_state(cfg.dir.path) in train_classifier.py and add stuff in Transform and its children where needed to save what states are necessary for reproduction.

Choice 2

Add an option --task.my_task.run_path, task.my_task.[...].run_path or similar only when needed. In that case I would strongly prefer to have a reasonable default that uses an equivalent instance of a Task as was used when the Task was initialized.

I can't think of a good way to set the default in e.g. WanetBackdoor such that this is the same path as was specified in train_classifier.py if this was called in that function. And then this something equivalent should also be the done when loading in eval_detector.py.

I don't think any of these alternatives is obviously good. The easiest one and the one I would probably pick in the short term is probably to just add something like

class WanetBackdoor:
    ...
    init_seed: int = np.random.randint(2**64 - 1)

    def __post_init__(self):
        ...
        tmp_rng = np.random.default_rng(seed=self.seed)
        self.control_grid = 2 * tmp_rng.random((control_grid_shape)) - 1

I'm not sure how this is/should be picked in eval_detector.py but the information should exist in config.yml so shouldn't be too hard to retrieve.

WaNet probably broken with `num_workers > 0`

test_wanet in test_pipeline.py currently fails if we set use multiple workers: the cfg.train_data.backdoor instance doesn't have its warping field initialized. My assumption is that this is because each worker ends up using its own copy of that, and the warping field is only initialized after making those copies (on the first __call__). More importantly than failing this test case, I think that means the warping fields won't be shared across workers.

I've added a check in train_classifier.py that raises an error in this setting, so not super high priority. But we should fix this, probably by initializing the warping field earlier.

`--help` not working on scripts

Scripts are supposed to print detailed usage when given the --help flag, but they typically just error out with a very short usage line complaining about missing arguments. I don't know if this is an issue with simple_parsing or my code, need to investigate. Not very high priority, but I've been annoyed by it a few times already, and it's going to be even more annoying for new users I'd imagine.

Documentation

Finish and update guides in docs/
Update all the docstrings (and maybe add darglint to make sure they stay up to date?)
Maybe create a website for docs

Support for language models

Currently, all tasks are image classification, it would be good to also support language input and using pretrained/finetuned language models. Rough guess at the todos for that:

Either allow loading pretrained model weights into a Model or change the entire system of how we extract activations from networks. Probably easiest to get this to work for one specific language model first, we'll probably have to add some custom code for every new model anyway.
Make sure text data as input works throughout the codebase. The main challenge might be that the tokenizer will depend on the model. Maybe we could do tokenization within Model, or maybe we'd have to coordinate data processing and the model somehow.
Extracting the right activations: current models don't have residual streams, so it's pretty clear which activations to use. For transformers, there are a few different options, should probably just be configurable.
Making abstractions work: this could also happen separately afterwards, first step is getting the Mahalanobis detector to work. Abstractions shouldn't require too much effort on top as long as we keep the overall Model implementation.
Having scripts for finetuning base models: that should definitely happen later, maybe shouldn't be a part of this library at all

Include Redwood's sensor tampering benchmark

See paper and code. We'll need #7 first, then hopefully this is pretty doable afterwards (might require a few changes to how tasks are represented though). At first, we should just port over the benchmark, not the baselines (though we might want some of them in the library at some point).

Fix or hide spurious warnings

Running pytest gives quite a few deprecation and other warnings (and some of the scripts also give a a few). None of them are actually concerning, but would be nice to either fix or just suppress them where appropriate.

More detectors

We currently have 3 detectors. In this issue I will investigate some possible new additions.

Top candidates:

Neural Cleanse (most cited backdoor detection method)
Fine-Pruning (second most cited and natural extension of finetune method)
Spectral/Spectre/Beatrix (Seems relatively easy to implement and can detect WaNet somewhat)
ASSET (best one according to themselves)
MagNet/Statistica Detection/LID/Feature Response Maps (the adversarial detection techniques that uses activation anomaly detection)

Quality of life improvements for experiments

This issue is for tracking various features that could make running large-scale experiments more convenient. Unclear which ones are actually worth implementing, will probably add them over time whenever one seems like it would actually be helpful right now.

~~Allow loading a configuration from a yaml file.~~
Launcher for sweeps, ideally with Slurm integration. I was using the Hydra submitit plugin for this in an earlier version of this code base but then switched away from hydra. It clearly doesn't make sense to reimplement all of hydra's complexities, but we could for example just have a way of sweeping over experiment configs (previous point).
When training a detector, the type of anomaly (like what the backdoor trigger is) usually doesn't matter but you still need to specify it for no good reason.
~~Have an entrypoint function and register it with pip, so you can run cupbearer ... instead of python -m cupbearer.scripts...~~
Checkpointing
Make sure we don't accidentally overwrite files from previous run (e.g. check whether the output directory already exists)

Rough edges to refactor

Keeping track of things we probably want to fix at some point but that aren't priorities. I think we should fix these whenever we'd otherwise have to make the problem worse or write a bunch of code that a refactor would change.

The Debug config classes are pretty cumbersome. I'm leaning towards just getting rid of them entirely and letting automated tests manually configure things for fast runs.
Maybe configs shouldn't have Config in their name? It could be nice in order to avoid name clashes with e.g. the detector class itself, but from a user perspective, the Config classes are actually the more important ones so should maybe get a short name
Remove simple_parsing (by using something else to serialize/deserialize configs). We could also reconsider the whole hierarchical dataclass config systems (i.e. just instantiate detectors etc. directly, pass them to a script, and make detectors responsible for serializing their config)
Get rid of TaskConfigBase. It's not used anywhere and we now actually have some code that relies on the specifics of TaskConfig. So I think that should be the only ABC for tasks, and we should just generalize it if actually necessary in the future.

ejnnr / cupbearer Goto Github PK

cupbearer's People

Contributors

Stargazers

Watchers

Forkers

cupbearer's Issues

Usage

Design choices

Choice 1

Choice 2

Recommend Projects

Recommend Topics

Recommend Org