Giter Club home page Giter Club logo

mlm_mia's Introduction

MLM Membership Inference Attack

Repository for the Quantifying Privacy Risks of Masked Language Models paper

This repository borrows the environment and the models from https://github.com/elehman16/exposing_patient_data_release

Creating Conda Environment

conda env create -f conda_env.yml

Acquiring Models and Datasets

Since both MIMIC-III and i2b2 datasets require an access process, we have not made them available online.

MIMIC-III

MIMIC-III data and model authorization process: https://mimic.mit.edu/docs/gettingstarted/

You can download the models from this link once you have gone through the authorization process, login to physionet and then access this link: https://physionet.org/files/clinical-bert-mimic-notes/1.0.0/

i2b2

You can request access to i2b2 dataset here: https://portal.dbmi.hms.harvard.edu/

Gaining access to the processed sequences used for experiments

Once you have received the processed files, make a folder named ‘CSV_Files’ and place the csv files sent to you there.

Running the MIA attack

To generate the loss values of the clinicalbert-base (target model) and pubmed (reference model) for each sequence, run the following from the root directory of the repo:

bash scripts/get_loss_clinicalbert.sh
bash scripts/get_loss_pubmed.sh

If you want to get these numbers for other models, just replace the model_path in the bash scripts with the path/name of your model. Only caveat is that if your model is a huggingface model, you want to use the template from get_loss_pubmed.sh. If your model is another one of the clinicalbert models provided in the MIMIC-III files, use the get_loss_clinicalbert.sh template. The reason is the clinicalbert model is trained using the pytorch_pretrained_bert, which is an older package compared to transformers HF, so some function calls are different.

If you want to use normalized energy values (not in the paper, does improve the results a bit) instead of loss, run the following:

bash scripts/get_enorm_clinicalbert.sh
bash scripts/get_enorm_pubmed.sh

When you run these, the loss/energy scores will be saved in the loss_values/enorm_values folders. For reproduction purposes, we have already placed our outputs used in the paper there. To get the attack success metrics (AUC, precision, recall) for our method and the baseline run the following:

bash scripts/metrics/get_sample_metrics.sh
bash scripts/metrics/get_sample_metrics_enorm.sh

These will return the results and metrics for sample-level attack.

For user-level attack, run

bash scripts/metrics/get_user_metrics.sh

Re-producing the plots

The plots in the paper can be roprodueced by running the ./ipynb/plots.ipynb notebook.

mlm_mia's People

Contributors

mireshghallah avatar

Stargazers

Zhifeng Jiang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.