Giter Club home page Giter Club logo

dpsda's Introduction

Differentially Private Synthetic Data via Foundation Model APIs

This repo is a Python library to generate differentially private (DP) synthetic data without the need of any ML model training. It is based on the following papers that proposed a new DP synthetic data framework that only utilizes the blackbox inference APIs of foundation models (e.g., Stable Diffusion).

Potential Use Cases

Given a private dataset, this tool can generate a new DP synthetic dataset that is statistically similar to the private dataset, while ensuring a rigorous privacy guarantee called Differential Privacy. The DP synthetic dataset can replace real data in various use cases where privacy is a concern, for example:

  • Sharing them with other parties for collaboration and research.
  • Using them in downstream algorithms (e.g., training ML models) in the normal non-private pipeline.
  • Inspecting the data directly for easier product debugging or development.

Supported Data Types

This repo currently supports the following data types and foundation models.

Foundation Model APIs Data Type Size of Generated Images (--image_size)
Stable Diffusion Images Preferably 512x512
improved diffusion Images 64x64
DALLE2 Images 256x256, 512x512, or 1024x1024

Quick Examples

See the docker file for the environment.

CIFAR10 Images

pushd data; python get_cifar10.py; popd  # Download CIFAR10 dataset
pushd models; ./get_models.sh; popd  # Download the pre-trained improved diffusion model
./scripts/main_improved_diffusion_cifar10_conditional.sh  # Run DP generation

Camelyon17 Images

pushd data; python get_camelyon17.py; popd  # Download Camelyon17 dataset
pushd models; ./get_models.sh; popd  # Download the pre-trained improved diffusion model
./scripts/main_improved_diffusion_camelyon17_conditional.sh  # Run DP generation

Cat Images

./scripts/main_stable_diffusion_cookie.sh  # Run DP generation
  • For Cat Doudou:
./scripts/main_stable_diffusion_doudou.sh  # Run DP generation

See scripts folder for more examples.

Detailed Usage

main.py is the main script for generation. Please refer to python main.py --help for detailed descriptions of the arguments.

For each foundation model API (e.g., Stable Diffusion, improved diffusion), there could be more arguments. Please use --api_help argument, e.g., python main.py --api stable_diffusion --data_folder data --api_help, to see detailed descrptions of the API-specific arguments.

See Appendices H, I, J of the paper for examples/guidelines of parameter selection.

Generate DP Synthetic Data for Your Own Dataset

Please put all images in a folder (which can contain any nested folder structure), and the naming of the image files should be <class label without '_' character>_<the remaining part of the filename>.<jpg/jpeg/png/gif>. Pass the path of this folder to --data_folder argument.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Responsible Uses

This project uses foundation model APIs to create synthetic image data with differential privacy guarantees. Differential privacy (DP) is a formal framework that ensures the output of an algorithm does not reveal too much information about its inputs. Without a formal privacy guarantee, a synthetic data generation algorithm may inadvertently reveal sensitive information about its input datapoints.

Using synthetic data in downstream applications can carry risk. Synthetic data may not always reflect the true data distribution, and can cause harms in downstream applications. Both the dataset and algorithms behind the foundation model APIs may contain various types of bias, leading to potential allocation, representation, and quality-of-service harms. Additionally, privacy violations can still occur if the ε and δ privacy parameters are set inappropriately, or if multiple copies of a sample exist in the seed dataset. It is important to consider these factors carefully before any potential deployments.

dpsda's People

Contributors

fjxmlzn avatar harsha-nori avatar microsoft-github-operations[bot] avatar microsoftopensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dpsda's Issues

Error When `variation_degree_schedule` Value Exceeds 10

Context

  • Dataset: Brain Tumor MRI Dataset with 4 classes, 5713 training samples, and 1312 testing samples. Images are labeled as "label_objectNumber" and converted to RGB. More details can be found here.
  • Environment: PyTorch 1.12.0, CUDA 11.7.0
  • Script Parameters:
    • Feature Extractor: inception_v3
    • FID Model Name: inception_v3
    • Dataset Name for FID: brain
    • Image Size: 64x64
    • Batch Size: 500
    • Variation Degree Schedule: 0 to 42 in steps of 2, with an error occurring for values > 10

Issue Description

The script runs successfully for many iterations when the variation_degree_schedule parameter values are below 10. However, exceeding this value results in the following error during the image variation phase:
"Traceback (most recent call last):
File "/cluster/home/laidir/DPSDA/main.py", line 468, in
main()
File "/cluster/home/laidir/DPSDA/main.py", line 361, in main
packed_samples = api.image_variation(
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion_api.py", line 255, in image_variation
sub_variations = self._image_variation(
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion_api.py", line 268, in _image_variation
samples, _ = sample(
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion_api.py", line 307, in sample
sample = sampler(
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
raise exception
IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion_api.py", line 354, in forward
sample = sample_fn(
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion/gaussian_diffusion.py", line 223, in ddim_sample_loop
for sample in self.ddim_sample_loop_progressive(
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion/gaussian_diffusion.py", line 269, in ddim_sample_loop_progressive
t_batch = th.tensor([indices[0]] * img.shape[0], device=device)
IndexError: list index out of range
"

This error appears to originate from an IndexError in the ddim_sample_loop within the improved diffusion API, specifically when attempting to index a list beyond its range.

Steps to Reproduce

  1. Run the provided script with the variation_degree_schedule parameter set to include values greater than 10.
  2. Observe the IndexError as described above during the image variation phase.

Additional Information

Run with A6000 GPU

Could I run the experiment with A6000 GPU? It seems that A6000 is not enough for the default settings (CIFAR10). Shall I reduce the batch size or use data parallelism?

Many thanks in advance for your help!

Out of Memory Error on A100 40GB GPUs with main_improved_diffusion_cifar10_conditional.sh and main_improved_diffusion_cifar10_conditional.sh

Environment

  • PyTorch Version: 1.12.1
  • CUDA Version: 11.7
  • GPU Type: NVIDIA A100 40GB

Description

I am experiencing an out-of-memory (OOM) error when attempting to run the main_improved_diffusion_cifar10_conditional.sh and main_improved_diffusion_cifar10_conditional.sh scripts. Despite utilizing an NVIDIA A100 GPU with 40GB of memory, which should be sufficient for these tasks, the scripts consistently fail due to memory issues.

The expected behavior, based on documentation and typical usage for similar tasks, would not exceed the 40GB memory limit of the A100 GPU. However, even under normal conditions and with ample available memory, the scripts trigger an OOM error.

Attempts to Resolve

  • Ensured no other significant processes are consuming GPU memory.
  • Monitored memory usage to confirm that the OOM error occurs despite available memory.
  • Reduced batch size and num_samples_schedule (and related parameters)
  • Searched for similar issues or advice in the repository's issues section and online forums.

I appreciate any insights, suggestions, or updates that might help resolve this issue. Thank you for your attention to this matter and for the valuable resources provided.

Best regards,
Roufaida Laidi

FutureWarning: Passing `image` as torch tensor with value range in [-1,1] is deprecated.

Thank you for sharing this great codebase! When I tried the quick example for Cat Cookie with scripts/main_stable_diffusion_cookie.sh, I noticed a warning from the diffusers package as follows.

Found 100 images in the folder /tmp/result_cookie
FID result_cookie : 100%|█████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.73s/it]
07/04/2023 12:40:03 PM [MainThread  ] [INFO ]  fid=86.46057094437475
07/04/2023 12:40:03 PM [MainThread  ] [INFO ]  t=1
07/04/2023 12:40:03 PM [MainThread  ] [INFO ]  Running image variation
  0%|                                                                                     | 0/8 [00:00<?, ?it/s/
.../python3.8/site-packages/diffusers/image_processor.py:204: FutureWarn
ing: Passing `image` as torch tensor with value range in [-1,1] is deprecated. The expected value range for imag
e tensor is [0,1] when passing as pytorch tensor or numpy Array. You passed `image` with value range [-1.0,1.0]
  warnings.warn(

Is this something that I should be careful? Was this warning already there in your experiments? If not, maybe this is due to a new release of diffusers. In that case, I will greatly appreciate it if you can share the version you used for diffusers.

In case it helps, this issue might be related to huggingface/diffusers#3876. Thank you very much for your time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.