This repository provides utilities to a minimal dataset for InstructPix2Pix like training for Diffusion models.
-
Download the original dataset as discussed here. I used this version:
clip-filtered-dataset
. Note that the download can take as long as 24 hours depending on the internet bandwidth. The dataset also requires at least 600 GB of storage. -
Then run:
python make_dataset.py --data_root clip-filtered-dataset --num_samples_to_use 1000
-
The
make_dataset.py
was specifically designed to obtain a ๐ค dataset. So, it's the most useful when you push the minimal dataset to the ๐ค Hub. You can do so by settingpush_to_hub
while runningmake_dataset.py
.
https://huggingface.co/datasets/sayakpaul/instructpix2pix-1000-samples
The full version of the CLIP filtered dataset used for InstructPix2Pix training can be found here: https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered
With the dataset being on the ๐ค Hub, one can do load the dataset with two lines of code:
from datasets import load_dataset
dataset = load_dataset("timbrooks/instructpix2pix-clip-filtered", split="train")
And voila ๐ค
The structure of make_dataset.py
is inspired by Nate Raw's notebook.