Giter Club home page Giter Club logo

rlhf-trojan-competition-submission's Introduction

Find the Trojan: Universal Backdoor Detection in Aligned LLMs

Method description

Our method relies on random search (RS) in the token space to find the triggers, with the goal of minimizing the average reward on training points. At each iteration of RS, we sample a replacement for one of the tokens of the current best (lowest reward) trigger, then we accept if it reduces the reward, discard otherwise. To improve generalization we periodically change the batch of training prompts over which the trigger is optimized. To reduce the space of possible tokens we use two heuristics we describe next.

Token embedding difference. Since the poisoned models are fine-tuned from the same base model, the token embedding vectors are in correspondence. Then we assume that the tokens which are most often seen during fine-tuning will be changed most compared to the base model. Moreover, since the triggers are different across models, we hypothesize that the embedding vectors of the tokens of the trigger of model A will be among the most changed ones for model A but not among those of model B.

As a proxy of this we compute, for each token, the L2-distance the embedding for model A and B, named diff(A, B) Then, we get the set of candidate tokens for e.g. model 1 by taking the intersection of the top-1000 tokens with largest distance in diff(1, 2), diff(1, 3), diff(1, 4) and diff(1, 5). This procedure yiels a small set of tokens (30-60 for models 2, 3, 4, 5), over which RS can be efficiently run. The corresponding code can be found in the get_diff_emb in method/utils.py, and the precomputed sets in method/diff_emb_p=2_new.pth, and the precomputed sets in method/diff_emb_p=2_new.pth: this can be obtained running python3 utils.py

Minimizing safe response probability. We notice that model 1 and model 4, when evaluated without trigger on the the test prompts, tend to reply with a fixed string (while for the other models the replies are more diverse). Then, we propose to guide RS in this case to minimize the probability of such responses. In particular, we follow the approach of Zou et al. (2023) and compute the gradient to minimize the cross-entropy loss of such replies. Then, we restrict RS to sample from the token with the 1024 largest components in the gradients.

Finding the triggers

To find the trigger for model X (replace with 1-5), please run

python3 main.py --generation_model_name ethz-spylab/poisoned_generation_trojanX

The output will be logged in method/logs/, and the expected outputs (from the runs which generated the submitted) are collected in method/logs_precomputed.

Submission

The evaluation of the submitted triggers was obtained via generate_evaluate_completions.py, and can be reproduced by running test_eval.sh

rlhf-trojan-competition-submission's People

Contributors

fra31 avatar javirandor avatar max-andr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

github6-dev

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.