microsoft / dpsda Goto Github PK

[ICLR 2024] Generating DP Synthetic Data without Training

License: MIT License

Python 92.70% Dockerfile 0.92% Shell 6.38%

differential-privacy foundation-models synthetic-data training-free

dpsda's Introduction

Differentially Private Synthetic Data via Foundation Model APIs

This repo is a Python library to generate differentially private (DP) synthetic data without the need of any ML model training. It is based on the following papers that proposed a new DP synthetic data framework that only utilizes the blackbox inference APIs of foundation models (e.g., Stable Diffusion).

Differentially Private Synthetic Data via Foundation Model APIs 1: Images
[paper (ICLR 2024)] [paper (arxiv)]
Authors: [Zinan Lin], [Sivakanth Gopi], [Janardhan Kulkarni], [Harsha Nori], [Sergey Yekhanin]

Potential Use Cases

Given a private dataset, this tool can generate a new DP synthetic dataset that is statistically similar to the private dataset, while ensuring a rigorous privacy guarantee called Differential Privacy. The DP synthetic dataset can replace real data in various use cases where privacy is a concern, for example:

Sharing them with other parties for collaboration and research.
Using them in downstream algorithms (e.g., training ML models) in the normal non-private pipeline.
Inspecting the data directly for easier product debugging or development.

Supported Data Types

This repo currently supports the following data types and foundation models.

Foundation Model APIs	Data Type	Size of Generated Images (`--image_size`)
Stable Diffusion	Images	Preferably 512x512
improved diffusion	Images	64x64
DALLE2	Images	256x256, 512x512, or 1024x1024

Quick Examples

See the docker file for the environment.

CIFAR10 Images

pushd data; python get_cifar10.py; popd  # Download CIFAR10 dataset
pushd models; ./get_models.sh; popd  # Download the pre-trained improved diffusion model
./scripts/main_improved_diffusion_cifar10_conditional.sh  # Run DP generation

Camelyon17 Images

pushd data; python get_camelyon17.py; popd  # Download Camelyon17 dataset
pushd models; ./get_models.sh; popd  # Download the pre-trained improved diffusion model
./scripts/main_improved_diffusion_camelyon17_conditional.sh  # Run DP generation

Cat Images

Download the dataset from https://www.kaggle.com/datasets/fjxmlzn/cat-cookie-doudou, and put them under data/cookie and data/doudou.
For Cat Cookie:

./scripts/main_stable_diffusion_cookie.sh  # Run DP generation

For Cat Doudou:

./scripts/main_stable_diffusion_doudou.sh  # Run DP generation

See scripts folder for more examples.

Detailed Usage

main.py is the main script for generation. Please refer to python main.py --help for detailed descriptions of the arguments.

For each foundation model API (e.g., Stable Diffusion, improved diffusion), there could be more arguments. Please use --api_help argument, e.g., python main.py --api stable_diffusion --data_folder data --api_help, to see detailed descrptions of the API-specific arguments.

See Appendices H, I, J of the paper for examples/guidelines of parameter selection.

Generate DP Synthetic Data for Your Own Dataset

Please put all images in a folder (which can contain any nested folder structure), and the naming of the image files should be <class label without '_' character>_<the remaining part of the filename>.<jpg/jpeg/png/gif>. Pass the path of this folder to --data_folder argument.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Responsible Uses

This project uses foundation model APIs to create synthetic image data with differential privacy guarantees. Differential privacy (DP) is a formal framework that ensures the output of an algorithm does not reveal too much information about its inputs. Without a formal privacy guarantee, a synthetic data generation algorithm may inadvertently reveal sensitive information about its input datapoints.

Using synthetic data in downstream applications can carry risk. Synthetic data may not always reflect the true data distribution, and can cause harms in downstream applications. Both the dataset and algorithms behind the foundation model APIs may contain various types of bias, leading to potential allocation, representation, and quality-of-service harms. Additionally, privacy violations can still occur if the ε and δ privacy parameters are set inappropriately, or if multiple copies of a sample exist in the seed dataset. It is important to consider these factors carefully before any potential deployments.

dpsda's People

Contributors

Stargazers

Watchers

Forkers

mufeili naimahmednesaragi yangchenghuang

dpsda's Issues

Error When `variation_degree_schedule` Value Exceeds 10

Context

Dataset: Brain Tumor MRI Dataset with 4 classes, 5713 training samples, and 1312 testing samples. Images are labeled as "label_objectNumber" and converted to RGB. More details can be found here.
Environment: PyTorch 1.12.0, CUDA 11.7.0
Script Parameters:
- Feature Extractor: inception_v3
- FID Model Name: inception_v3
- Dataset Name for FID: brain
- Image Size: 64x64
- Batch Size: 500
- Variation Degree Schedule: 0 to 42 in steps of 2, with an error occurring for values > 10

Issue Description

The script runs successfully for many iterations when the variation_degree_schedule parameter values are below 10. However, exceeding this value results in the following error during the image variation phase:
"Traceback (most recent call last):
File "/cluster/home/laidir/DPSDA/main.py", line 468, in
main()
File "/cluster/home/laidir/DPSDA/main.py", line 361, in main
packed_samples = api.image_variation(
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion_api.py", line 255, in image_variation
sub_variations = self._image_variation(
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion_api.py", line 268, in _image_variation
samples, _ = sample(
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion_api.py", line 307, in sample
sample = sampler(
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
raise exception
IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/cluster/apps/eb/software/PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion_api.py", line 354, in forward
sample = sample_fn(
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion/gaussian_diffusion.py", line 223, in ddim_sample_loop
for sample in self.ddim_sample_loop_progressive(
File "/cluster/home/laidir/DPSDA/apis/improved_diffusion/gaussian_diffusion.py", line 269, in ddim_sample_loop_progressive
t_batch = th.tensor([indices[0]] * img.shape[0], device=device)
IndexError: list index out of range
"

This error appears to originate from an IndexError in the ddim_sample_loop within the improved diffusion API, specifically when attempting to index a list beyond its range.

Steps to Reproduce

Run the provided script with the variation_degree_schedule parameter set to include values greater than 10.
Observe the IndexError as described above during the image variation phase.

Additional Information

The issue occurs specifically when the variation_degree_schedule parameter exceeds 10.
Attached is the script used to reproduce this error.
main_improved_diffusion_brainTumor_conditional.txt

Run with A6000 GPU

Could I run the experiment with A6000 GPU? It seems that A6000 is not enough for the default settings (CIFAR10). Shall I reduce the batch size or use data parallelism?

Many thanks in advance for your help!

Out of Memory Error on A100 40GB GPUs with main_improved_diffusion_cifar10_conditional.sh and main_improved_diffusion_cifar10_conditional.sh

Environment

PyTorch Version: 1.12.1
CUDA Version: 11.7
GPU Type: NVIDIA A100 40GB

Description

I am experiencing an out-of-memory (OOM) error when attempting to run the main_improved_diffusion_cifar10_conditional.sh and main_improved_diffusion_cifar10_conditional.sh scripts. Despite utilizing an NVIDIA A100 GPU with 40GB of memory, which should be sufficient for these tasks, the scripts consistently fail due to memory issues.

The expected behavior, based on documentation and typical usage for similar tasks, would not exceed the 40GB memory limit of the A100 GPU. However, even under normal conditions and with ample available memory, the scripts trigger an OOM error.

Attempts to Resolve

Ensured no other significant processes are consuming GPU memory.
Monitored memory usage to confirm that the OOM error occurs despite available memory.
Reduced batch size and num_samples_schedule (and related parameters)
Searched for similar issues or advice in the repository's issues section and online forums.

I appreciate any insights, suggestions, or updates that might help resolve this issue. Thank you for your attention to this matter and for the valuable resources provided.

Best regards,
Roufaida Laidi

FutureWarning: Passing `image` as torch tensor with value range in [-1,1] is deprecated.

Thank you for sharing this great codebase! When I tried the quick example for Cat Cookie with scripts/main_stable_diffusion_cookie.sh, I noticed a warning from the diffusers package as follows.

Found 100 images in the folder /tmp/result_cookie
FID result_cookie : 100%|█████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.73s/it]
07/04/2023 12:40:03 PM [MainThread  ] [INFO ]  fid=86.46057094437475
07/04/2023 12:40:03 PM [MainThread  ] [INFO ]  t=1
07/04/2023 12:40:03 PM [MainThread  ] [INFO ]  Running image variation
  0%|                                                                                     | 0/8 [00:00<?, ?it/s/
.../python3.8/site-packages/diffusers/image_processor.py:204: FutureWarn
ing: Passing `image` as torch tensor with value range in [-1,1] is deprecated. The expected value range for imag
e tensor is [0,1] when passing as pytorch tensor or numpy Array. You passed `image` with value range [-1.0,1.0]
  warnings.warn(

Is this something that I should be careful? Was this warning already there in your experiments? If not, maybe this is due to a new release of diffusers. In that case, I will greatly appreciate it if you can share the version you used for diffusers.

In case it helps, this issue might be related to huggingface/diffusers#3876. Thank you very much for your time.