miccunifi / ladi-vton Goto Github PK

This is the official repository for the paper "LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On". ACM Multimedia 2023

License: Other

Python 100.00%

dresscode fashionai generative-model latent-diffusion-models stable-diffusion textual-inversion virtual-tryon viton-hd acmmm acmmm2023

ladi-vton's Introduction

LaDI-VTON (ACM Multimedia 2023)

Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Davide Morelli*, Alberto Baldrati*, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

* Equal contribution.

🔥🔥 [05/09/2023] Release of the training code

This is the official repository for the paper "LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On".

Overview

Abstract:
The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task

Citation

If you make use of our work, please cite our paper:

@inproceedings{morelli2023ladi,
  title={{LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On}},
  author={Morelli, Davide and Baldrati, Alberto and Cartella, Giuseppe and Cornia, Marcella and Bertini, Marco and Cucchiara, Rita},
  booktitle={Proceedings of the ACM International Conference on Multimedia},
  year={2023}
}

Getting Started

We recommend using the Anaconda package manager to avoid dependency/reproducibility problems. For Linux systems, you can find a conda installation guide here.

Installation

Clone the repository

git clone https://github.com/miccunifi/ladi-vton

Install Python dependencies

conda env create -n ladi-vton -f environment.yml
conda activate ladi-vton

Alternatively, you can create a new conda environment and install the required packages manually:

conda create -n ladi-vton -y python=3.10
conda activate ladi-vton
pip install torch==2.0.1 torchvision==0.15.2 opencv-python==4.7.0.72 diffusers==0.14.0 transformers==4.27.3 accelerate==0.18.0 clean-fid==0.1.35 torchmetrics[image]==0.11.4 wandb==0.14.0 matplotlib==3.7.1 tqdm xformers

Data Preparation

DressCode

Download the DressCode dataset
To enhance the performance of our warping module, we have discovered that using in-shop images with a white background yields better results. To facilitate this process, we now offer pre-extracted masks that can be used to remove the background from the images. You can download the masks from the following link: here. Once downloaded, please extract the mask files and place them in the dataset folder alongside the corresponding images.

Once the dataset is downloaded, the folder structure should look like this:

├── DressCode
|   ├── test_pairs_paired.txt
|   ├── test_pairs_unpaired.txt
|   ├── train_pairs.txt
│   ├── [dresses | lower_body | upper_body]
|   |   ├── test_pairs_paired.txt
|   |   ├── test_pairs_unpaired.txt
|   |   ├── train_pairs.txt
│   │   ├── images
│   │   │   ├── [013563_0.jpg | 013563_1.jpg | 013564_0.jpg | 013564_1.jpg | ...]
│   │   ├── masks
│   │   │   ├── [013563_1.png| 013564_1.png | ...]
│   │   ├── keypoints
│   │   │   ├── [013563_2.json | 013564_2.json | ...]
│   │   ├── label_maps
│   │   │   ├── [013563_4.png | 013564_4.png | ...]
│   │   ├── skeletons
│   │   │   ├── [013563_5.jpg | 013564_5.jpg | ...]
│   │   ├── dense
│   │   │   ├── [013563_5.png | 013563_5_uv.npz | 013564_5.png | 013564_5_uv.npz | ...]

VITON-HD

Download the VITON-HD dataset

Once the dataset is downloaded, the folder structure should look like this:

├── VITON-HD
|   ├── test_pairs.txt
|   ├── train_pairs.txt
│   ├── [train | test]
|   |   ├── image
│   │   │   ├── [000006_00.jpg | 000008_00.jpg | ...]
│   │   ├── cloth
│   │   │   ├── [000006_00.jpg | 000008_00.jpg | ...]
│   │   ├── cloth-mask
│   │   │   ├── [000006_00.jpg | 000008_00.jpg | ...]
│   │   ├── image-parse-v3
│   │   │   ├── [000006_00.png | 000008_00.png | ...]
│   │   ├── openpose_img
│   │   │   ├── [000006_00_rendered.png | 000008_00_rendered.png | ...]
│   │   ├── openpose_json
│   │   │   ├── [000006_00_keypoints.json | 000008_00_keypoints.json | ...]

Inference with Pre-trained Models

To run the inference on the Dress Code or VITON-HD dataset, run the following command:

python src/inference.py --dataset [dresscode | vitonhd] --dresscode_dataroot <path> --vitonhd_dataroot <path> --output_dir <path> --test_order [paired | unpaired] --category [all | lower_body | upper_body | dresses ] --mixed_precision [no | fp16 | bf16] --enable_xformers_memory_efficient_attention --use_png --compute_metrics

    --dataset <str>                dataset to use, options: ['dresscode', 'vitonhd']
    --dresscode_dataroot <str>     data root of dresscode dataset (required when dataset=dresscode)
    --vitonhd_dataroot <str>       data root of vitonhd dataset (required when dataset=vitonhd)
    --test_order <str>             test setting, options: ['paired', 'unpaired']
    --category <str>               category to test, options: ['all', 'lower_body', 'upper_body', 'dresses'] (default=all)
    --output_dir <str>             output directory
    --batch_size <int>             batch size (default=8)
    --mixed_precision <str>        mixed precision (no, fp16, bf16) (default=no)
    --enable_xformers_memory_efficient_attention <store_true>
                                   enable memory efficient attention in xformers (default=False)
    --allow_tf32 <store_true>      allow TF32 on Ampere GPUs (default=False)
    --num_workers <int>            number of workers (default=8)
    --use_png <store_true>         use png instead of jpg (default=False)
    --compute_metrics              compute metrics at the end of inference (default=False)

Since we release the pre-trained models via torch.hub, the models will be automatically downloaded when running the inference script.

Metrics computation

Once you have run the inference script and extracted the images, you can compute the metrics by running the following command:

python src/utils/val_metrics.py --gen_folder <path> --dataset [dresscode | vitonhd] --dresscode_dataroot <path> --vitonhd_dataroot <path> --test_order [paired | unpaired] --category [all | lower_body | upper_body | dresses ]

    --gen_folder <str>             Path to the generated images folder.
    --dataset <str>                dataset to use, options: ['dresscode', 'vitonhd']
    --dresscode_dataroot <str>     data root of dresscode dataset (required when dataset=dresscode)
    --vitonhd_dataroot <str>       data root of vitonhd dataset (required when dataset=vitonhd)
    --test_order <str>             test setting, options: ['paired', 'unpaired']
    --category <str>               category to test, options: ['all', 'lower_body', 'upper_body', 'dresses'] (default=all)
    --batch_size                   batch size (default=32)
    --workers                      number of workers (default=8)

Training

In this section, you'll find instructions on how to train all the components of our model from scratch.

1. Train Warping Module

First of all, we need to train the warping module. To do so, run the following command:

python src/train_tps.py --dataset [dresscode | vitonhd] --dresscode_dataroot <path> --vitonhd_dataroot <path> --checkpoints_dir <path> --exp_name <str>

    --dataset <str>                dataset to use, options: ['dresscode', 'vitonhd']
    --dresscode_dataroot <str>     dataroot of dresscode dataset (required when dataset=dresscode)
    --vitonhd_dataroot <str>       dataroot of vitonhd dataset (required when dataset=vitonhd)
    --checkpoints_dir <str>        checkpoints directory
    --exp_name <str>               experiment name
    --batch_size <int>             batch size (default=16)
    --workers <int>                number of workers (default=10)
    --height <int>                 height of the input images (default=512)
    --width <int>                  width of the input images (default=384)
    --lr <float>                   learning rate (default=1e-4)
    --const_weight <float>         weight for the TPS constraint loss (default=0.01)
    --wandb_log <store_true>       log training on wandb (default=False)
    --wandb_project <str>          wandb project name (default=LaDI_VTON_tps)
    --dense <store_true>           use dense uv map instead of keypoints (default=False)
    --only_extraction <store_true> only extract the images using the trained networks without training (default=False)
    --vgg_weight <int>             weight for the VGG loss (refinement network) (default=0.25)
    --l1_weight <int>              weight for the L1 loss (refinement network) (default=1.0)
    --epochs_tps <int>             number of epochs for the TPS training (default=50)
    --epochs_refinement <int>      number of epochs for the refinement network training (default=50)

At the end of the training, the warped cloth images will be saved in the data/warped_cloths and data/warped_cloths_unpaired folders. To save computation time, in the following steps, we will use the pre-extracted warped cloth images.

2. Train EMASC

To train the EMASC module, run the following command:

python src/train_emasc.py --dataset [dresscode | vitonhd] --dresscode_dataroot <path> --vitonhd_dataroot <path> --output_dir <path>

    --dataset <str>                dataset to use, options: ['dresscode', 'vitonhd']
    --dresscode_dataroot <str>     data root of dresscode dataset (required when dataset=dresscode)
    --vitonhd_dataroot <str>       data root of vitonhd dataset (required when dataset=vitonhd)
    --output_dir <str>             output directory where the model predictions and checkpoints will be written
    --pretrained_model_name_or_path <str>
                                   model identifier from huggingface.co/models (default=stabilityai/stable-diffusion-2-inpainting)
    --seed <int>                   seed for reproducible training (default=1234)
    --train_batch_size <int>       batch size for training (default=16)
    --test_batch_size <int>        batch size for testing (default=16)
    --num_train_epochs <int>       number of training epochs (default=100)
    --max_train_steps <int>        maximum number of training steps. If provided, overrides num_train_epochs (default=40k)
    --gradient_accumulation_steps <int>
                                   number of update steps to accumulate before performing a backward/update pass (default=1)
    --learning_rate <float>        learning rate (default=1e-5)
    --lr_scheduler <str>           learning rate scheduler, options: ['linear', 'cosine', 'cosine_with_restarts', 'polynomial', 'constant', 'constant_with_warmup'] (default=constant_with_warmup)
    --lr_warmup_steps <int>        number of warmup steps for learning rate scheduler (default=500)
    --allow_tf32 <store_true>      allow TF32 on Ampere GPUs (default=False)
    --adam_beta1 <float>           value of beta_1 for Adam optimizer (default=0.9)
    --adam_beta2 <float>           value of beta_2 for Adam optimizer (default=0.999)
    --adam_weight_decay <float>    value of weight decay for Adam optimizer (default=1e-2)
    --adam_epsilon <float>         value of epsilon for Adam optimizer (default=1e-8)
    --max_grad_norm <float>        maximum value of gradient norm for gradient clipping (default=1.0)
    --mixed_precision <str>        mixed precision training, options: ['no', 'fp16', 'bf16'] (default=fp16)
    --report_to <str>              where to report metrics, options: ['wandb', 'tensorboard', 'comet_ml'] (default=wandb)
    --checkpointing_steps <int>    number of steps between each checkpoint (default=10000)
    --resume_from_checkpoint <str> whether training should be resumed from a previous checkpoint. Use a "latest" to automatically select the last available checkpoint. (default=None)
    --num_workers <int>            number of workers (default=8)
    --num_workers_test <int>       number of workers for test dataloader (default=8)
    --test_order <str>             test setting, options: ['paired', 'unpaired'] (default=paired)
    --emasc_type <str>             type of EMASC, options: ['linear', 'nonlinear'] (default=nonlinear)
    --vgg_weight <float>           weight for the VGG loss (default=0.5)
    --emasc_kernel <int>           kernel size for the EMASC module (default=3)
    --emasc_padding <int>          padding for the EMASC module (default=1)

At the end of the training, the EMASC checkpoints will be saved in the output_dir folder.

2.5 (Optional) Extract clip cloth embeddings

To accelerate the training process for subsequent steps, consider pre-computing the CLIP cloth embeddings for each image in the dataset.

To do so, run the following command:

python src/utils/compute_cloth_clip_features.py --dataset [dresscode | vitonhd] --dresscode_dataroot <path> --vitonhd_dataroot <path>

    --dataset <str>                dataset to use, options: ['dresscode', 'vitonhd']
    --dresscode_dataroot <str>     data root of dresscode dataset (required when dataset=dresscode)
    --vitonhd_dataroot <str>       data root of vitonhd dataset (required when dataset=vitonhd)
    --pretrained_model_name_or_path <str>
                                   model identifier from huggingface.co/models (default=stabilityai/stable-diffusion-2-inpainting)
    --batch_size <int>             batch size (default=16)
    --num_workers <int>            number of workers (default=8)

The computed features will be saved in the data/clip_cloth_embeddings folder.

In the following steps, to use the pre-computed features, make sure to use the --use_clip_cloth_features flag.

3. Pre-train the inversion adapter

To pre-train the inversion adapter, run the following command:

python src/train_inversion_adapter.py --dataset [dresscode | vitonhd] --dresscode_dataroot <path> --vitonhd_dataroot <path> --output_dir <path> --gradient_checkpointing --enable_xformers_memory_efficient_attention --use_clip_cloth_features

    --dataset <str>                dataset to use, options: ['dresscode', 'vitonhd']
    --dresscode_dataroot <str>     data root of dresscode dataset (required when dataset=dresscode)
    --vitonhd_dataroot <str>       data root of vitonhd dataset (required when dataset=vitonhd)
    --output_dir <str>             output directory where the model predictions and checkpoints will be written
    --pretrained_model_name_or_path <str>
                                   model identifier from huggingface.co/models (default=stabilityai/stable-diffusion-2-inpainting)
    --seed <int>                   seed for reproducible training (default=1234)
    --train_batch_size <int>       batch size for training (default=16)
    --test_batch_size <int>        batch size for testing (default=16)
    --num_train_epochs <int>       number of training epochs (default=100)
    --max_train_steps <int>        maximum number of training steps. If provided, overrides num_train_epochs (default=200k)
    --gradient_accumulation_steps <int>
                                   number of update steps to accumulate before performing a backward/update pass (default=1)
    --gradient_checkpointing <store_true>
                                   use gradient checkpointing to save memory at the expense of slower backward pass (default=False)
    --learning_rate <float>        learning rate (default=1e-5)
    --lr_scheduler <str>           learning rate scheduler, options: ['linear', 'cosine', 'cosine_with_restarts', 'polynomial', 'constant', 'constant_with_warmup'] (default=constant_with_warmup)
    --lr_warmup_steps <int>        number of warmup steps for learning rate scheduler (default=500)
    --allow_tf32 <store_true>      allow TF32 on Ampere GPUs (default=False)
    --adam_beta1 <float>           value of beta_1 for Adam optimizer (default=0.9)
    --adam_beta2 <float>           value of beta_2 for Adam optimizer (default=0.999)
    --adam_weight_decay <float>    value of weight decay for Adam optimizer (default=1e-2)
    --adam_epsilon <float>         value of epsilon for Adam optimizer (default=1e-8)
    --max_grad_norm <float>        maximum value of gradient norm for gradient clipping (default=1.0)
    --mixed_precision <str>        mixed precision training, options: ['no', 'fp16', 'bf16'] (default=fp16)
    --report_to <str>              where to report metrics, options: ['wandb', 'tensorboard', 'comet_ml'] (default=wandb)
    --checkpointing_steps <int>    number of steps between each checkpoint (default=50000)
    --resume_from_checkpoint <str> whether training should be resumed from a previous checkpoint. Use a "latest" to automatically select the last available checkpoint. (default=None)
    --enable_xformers_memory_efficient_attention <store_true>
                                   enable memory efficient attention in xformers (default=False)
    --num_workers <int>            number of workers (default=8)
    --num_workers_test <int>       number of workers for test dataloader (default=8)
    --test_order <str>             test setting, options: ['paired', 'unpaired'] (default=paired)
    --num_vstar <int>              number of predicted v* per image to use (default=16)
    --num_encoder_layers <int>     number of ViT layers to use in inversion adapter (default=1)
    --use_clip_cloth_features <store_true>
                                   use precomputed clip cloth features instead of computing them each iteration (default=False).

At the end of the training, the inversion adapter checkpoints will be saved in the output_dir folder.

NOTE: You can use the --use_clip_cloth_features flag only if you have previously computed the clip cloth features using the src/utils/compute_cloth_clip_features.py script (step 2.5).

4. Train VTO

To successfully train the VTO model, ensure that you specify the correct path to the pre-trained inversion adapter checkpoint. If omitted, the inversion adapter will be trained from scratch. Additionally, don't forget to include the --train_inversion_adapter flag to enable the inversion adapter training during the VTO training process.

To train the VTO model, run the following command:

python src/train_vto.py --dataset [dresscode | vitonhd] --dresscode_dataroot <path> --vitonhd_dataroot <path> --output_dir <path> --inversion_adapter_dir <path> --gradient_checkpointing --enable_xformers_memory_efficient_attention --use_clip_cloth_features --train_inversion_adapter

    --dataset <str>                dataset to use, options: ['dresscode', 'vitonhd']
    --dresscode_dataroot <str>     data root of dresscode dataset (required when dataset=dresscode)
    --vitonhd_dataroot <str>       data root of vitonhd dataset (required when dataset=vitonhd)
    --output_dir <str>             output directory where the model predictions and checkpoints will be written
    --inversion_adapter_dir <str>  path to the inversion adapter checkpoint directory. Should be the same as `output_dir` of the inversion adapter training script. If not specified, the inversion adapter will be trained from scratch. (default=None)
    --inversion_adapter_name <str> name of the inversion adapter checkpoint. To load the latest checkpoint, use `latest`. (default=latest)
     --pretrained_model_name_or_path <str>
                                   model identifier from huggingface.co/models (default=stabilityai/stable-diffusion-2-inpainting)
    --seed <int>                   seed for reproducible training (default=1234)
    --train_batch_size <int>       batch size for training (default=16)
    --test_batch_size <int>        batch size for testing (default=16)
    --num_train_epochs <int>       number of training epochs (default=100)
    --max_train_steps <int>        maximum number of training steps. If provided, overrides num_train_epochs (default=200k)
    --gradient_accumulation_steps <int>
                                   number of update steps to accumulate before performing a backward/update pass (default=1)
    --gradient_checkpointing <store_true>
                                   use gradient checkpointing to save memory at the expense of slower backward pass (default=False)
    --learning_rate <float>        learning rate (default=1e-5)
    --lr_scheduler <str>           learning rate scheduler, options: ['linear', 'cosine', 'cosine_with_restarts', 'polynomial', 'constant', 'constant_with_warmup'] (default=constant_with_warmup)
    --lr_warmup_steps <int>        number of warmup steps for learning rate scheduler (default=500)
    --allow_tf32 <store_true>      allow TF32 on Ampere GPUs (default=False)
    --adam_beta1 <float>           value of beta_1 for Adam optimizer (default=0.9)
    --adam_beta2 <float>           value of beta_2 for Adam optimizer (default=0.999)
    --adam_weight_decay <float>    value of weight decay for Adam optimizer (default=1e-2)
    --adam_epsilon <float>         value of epsilon for Adam optimizer (default=1e-8)
    --max_grad_norm <float>        maximum value of gradient norm for gradient clipping (default=1.0)
    --mixed_precision <str>        mixed precision training, options: ['no', 'fp16', 'bf16'] (default=fp16)
    --report_to <str>              where to report metrics, options: ['wandb', 'tensorboard', 'comet_ml'] (default=wandb)
    --checkpointing_steps <int>    number of steps between each checkpoint (default=50000)
    --resume_from_checkpoint <str> whether training should be resumed from a previous checkpoint. Use a "latest" to automatically select the last available checkpoint. (default=None)
    --enable_xformers_memory_efficient_attention <store_true>
                                   enable memory efficient attention in xformers (default=False)
    --num_workers <int>            number of workers (default=8)
    --num_workers_test <int>       number of workers for test dataloader (default=8)
    --test_order <str>             test setting, options: ['paired', 'unpaired'] (default=paired)
    --uncond_fraction <float>      fraction of unconditioned training samples (default=0.2)
    --text_usage <str>             text features to use, options: ['none', 'noun_chunks', 'inversion_adapter'] (default=inversion_adapter)
    --cloth_input_type <str>       cloth input type, options: ['none', 'warped'], (default=warped)
    --num_vstar <int>              number of predicted v* per image to use (default=16)
    --num_encoder_layers <int>     number of ViT layers to use in inversion adapter (default=1)
    --train_inversion_adapter <store_true>
                                   train the inversion adapter during the VTO training (default=False)
    --use_clip_cloth_features <store_true>
                                   use precomputed clip cloth features instead of computing them each iteration (default=False).

At the end of the training, the checkpoints will be saved in the output_dir folder.

NOTE: You can use the --use_clip_cloth_features flag only if you have previously computed the clip cloth features using the src/utils/compute_cloth_clip_features.py script (step 2.5).

5. Inference with the trained models

Before running the inference, make sure to specify the correct path to all the trained checkpoints. Make sure to also use coherent hyperparameters with the ones used during training.

To run the inference on the Dress Code or VITON-HD dataset, run the following command:

python src/eval.py --dataset [dresscode | vitonhd] --dresscode_dataroot <path> --vitonhd_dataroot <path> --output_dir <path> --save_name <str> --test_order [paired | unpaired]  --unet_dir <path> --inversion_adapter_dir <path> --emasc_dir <path>  --category [all | lower_body | upper_body | dresses ] --enable_xformers_memory_efficient_attention --use_png --compute_metrics

    --dataset <str>                dataset to use, options: ['dresscode', 'vitonhd']
    --dresscode_dataroot <str>     data root of dresscode dataset (required when dataset=dresscode)
    --vitonhd_dataroot <str>       data root of vitonhd dataset (required when dataset=vitonhd)
    --output_dir <str>             output directory where the generated images will be written
    --save_name <str>              name of the generated images folder inside `output_dir`
    --test_order <str>             test setting, options: ['paired', 'unpaired']
    --unet_dir <str>               path to the UNet checkpoint directory. Should be the same as `output_dir` of the VTO training script
    --unet_name <str>              name of the UNet checkpoint. To load the latest checkpoint, use `latest`. (default=latest)
    --inversion_adapter_dir <str>  path to the inversion adapter checkpoint directory. Should be the same as `output_dir` of the VTO training script. Needed only if `--text_usage` is set to `inversion_adapter`. (default=None)
    --inversion_adapter_name <str> name of the inversion adapter checkpoint. To load the latest checkpoint, use `latest`. (default=latest)
    --emasc_dir <str>              path to the EMASC checkpoint directory. Should be the same as `output_dir` of the EMASC training script. Needed when --emasc_type!=none. (default=None)
    --emasc_name <str>             name of the EMASC checkpoint. To load the latest checkpoint, use `latest`. (default=latest)
    --pretrained_model_name_or_path <str>
                                   model identifier from huggingface.co/models (default=stabilityai/stable-diffusion-2-inpainting)
    --seed <int>                   seed for reproducible training (default=1234)
    --batch_size <int>             batch size(default=8)
    --allow_tf32 <store_true>      allow TF32 on Ampere GPUs (default=False)
    --enable_xformers_memory_efficient_attention <store_true>
                                   enable memory efficient attention in xformers (default=False)
    --num_workers <int>            number of workers (default=8)
    --category <str>               category to test, options: ['all', 'lower_body', 'upper_body', 'dresses'] (default=all)
    --emasc_type <str>             type of EMASC, options: ['linear', 'nonlinear'] (default=nonlinear)
    --emasc_kernel <int>           kernel size for the EMASC module (default=3)
    --emasc_padding <int>          padding for the EMASC module (default=1)
    --text_usage <str>             text features to use, options: ['none', 'noun_chunks', 'inversion_adapter'] (default=inversion_adapter)
    --cloth_input_type <str>       cloth input type, options: ['none', 'warped'], (default=warped)
    --num_vstar <int>              number of predicted v* per image to use (default=16)
    --num_encoder_layers <int>     number of ViT layers to use in inversion adapter (default=1)
    --use_png <store_true>         use png instead of jpg (default=False)
    --num_inference_steps <int>    number of diffusion steps at inference time (default=50)
    --guidance_scale <float>       guidance scale of the diffusion (default=7.5)
    --use_clip_cloth_features <store_true>
                                   use precomputed clip cloth features instead of computing them each iteration (default=False).
    --compute_metrics              compute metrics at the end of inference (default=False)

The generated images will be saved in the output_dir/save_name_{test_order} folder.

NOTE: You can use the --use_clip_cloth_features flag only if you have previously computed the clip cloth features using the src/utils/compute_cloth_clip_features.py script (step 2.5).

Acknowledgements

This work has partially been supported by the PNRR project “Future Artificial Intelligence Research (FAIR)”, by the PRIN project “CREATIVE: CRoss-modal understanding and gEnerATIon of Visual and tExtual content” (CUP B87G22000460001), both co-funded by the Italian Ministry of University and Research, and by the European Commission under European Horizon 2020 Programme, grant number 101004545 - ReInHerit.

LICENSE

All material is made available under Creative Commons BY-NC 4.0. You can use, redistribute, and adapt the material for non-commercial purposes, as long as you give appropriate credit by citing our paper and indicate any changes that you've made.

ladi-vton's People

Contributors

Stargazers

Watchers

ladi-vton's Issues

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

I encountered the above situation when I run

how to using the Stable Diffusion VAE without using the denoising network

As depicted in the caption the figure you reported (Figure 7) is an ablation study on the Stable Diffusion VAE and EMASC contribution.
The images in the figure are compressed and decompressed using the Stable Diffusion VAE without using the denoising network.

Alberto

Originally posted by @ABaldrati in #9 (comment)

Problem about training

Hi,thank you for your great work!

I was trying to wrtie train code and do some training， but I was confused by the We first train the EMASC modules, the textual-inversion adapter,and the warping component. Then, we freeze all the weights of allmodules except for the textual inversion adapter and train the proposed enhanced Stable Diffusion pipeline in 4.2， should I first freeze other weights including unet and train textual inversion adapter or should I free other weight and train textual inversion adapter and unet together。

Issue with training VTO & Inversion Adapter

Hi,

I'm trying to train all the model with 1024x768 images. I performed to train TPS & EMASC using this shape with some code modifications. Training works well according to metrics and visuals results.

But, it doesn't work at all for Inversion adapter and VTO. Both training produce no loss reduction during training (close to constant using hard smoothing on wandb and very oscillating without smoothing).

I tested also using 512x384 shape and it gives me the same results.
Is it an expected result ?

I'm using default parameters except batch_size = 8 for VTO and batch_size=1 for Inversion adapter on a single A100 GPU. I assume that a greater value than 1 could prevent this training issue, but my HW doesn't allow to use a bigger one 😞
I tried to reduce learning rate but it results to the same issue.

Commands used to train Inversion adapter and VTO :

python src/train_inversion_adapter.py --dataset vitonhd --vitonhd_dataroot data/viton-hd/ --output_dir checkpoints/inverter_1024 --gradient_checkpointing --enable_xformers_memory_efficient_attention --use_clip_cloth_features --allow_tf32 --pretrained_model_name_or_path pretrained_models/stable-diffusion-2-inpainting/ --height 1024 --width 768 --train_batch_size 1 --test_batch_size 1
python src/train_vto.py --dataset vitonhd --vitonhd_dataroot data/viton-hd/ --output_dir checkpoints/vto_1024 --inversion_adapter_dir checkpoints/inverter_1024/ --gradient_checkpointing --enable_xformers_memory_efficient_attention --use_clip_cloth_features --height 1024 --width 768 --train_batch_size 8 --test_batch_size 8 --allow_tf32

Could you please, help me to resolve this pb ?
Thanks for your clean work btw :)

What should I do to use different Inpaint models?

SD-XL Inpaint:

https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py

Controlnet Inpaint:

https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py

Do I need to train to use this models? Or is it enough for me to develop a new pipe code? What should I do? Can you help me?

Working of ladi-vton with the DressCode dataset

The model trained on VITON-HD is amazing, it definitely is one of the best
Is it just me, or the results generated using Dresscode no where near the VITON-HD model.
The model not being able to work with Textual details on t-shirts is understandable.
But I have actually written a pre-processing pipeline in Colab and run inference on custom data, and even a little miss in the parsing's generates a totally bad image. I am now getting perfect pre-processings with the size of data and positioning also right but the Dresscode model literally doesn't work with faces. All the faces in the final result are distorted as attached below . Aren't the EMASC modules supposed to restore the face?

Is there anything which can be done to solve this issue or is it not possible?

****

Poor result on kid score

Hey, great project!
I want to replicate the results of paired dataset, other metric score is same as shown in the paper, but the kid score is very smaller than that shown in paper:

Could you possibly assist or provide any guidance to address this issue? Thanks in advance.

Clothing replaced by same clothing

We're running inference to recreate paper results using the VITON-HD dataset (test_pairs.txt in our conda environment), however the results appear to be slightly modified versions of the original clothing. It appears as if it's going through a diffusion pass, but not applying the clothing.

Running on a Windows 11 4090 PC, following the default settings/commands provided.

Images:

training code

Hi,

Thanks for your great work and congrats on your acceptance.

May I know when you will release the training code?

poor result

I have tested on VitonHD dataset and getting very poor results, see command below:

python3 src/inference.py --dataset vitonhd --vitonhd_dataroot /content/VITON-HD --output_dir ./ --test_order unpaired --category all --batch_size 8 --mixed_precision fp16 --num_workers 8

Out of memory when training the emasc module

Hello, I'm a beginner in artificial intelligence. Your work is very good. I'm very interested in it, but I'm trying to reproduce it now. When I train the EMASC module, there is always an error prompt of "out of memory". Do you have any suggestions?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 23.69 GiB total capacity; 22.12 GiB already allocated; 103.25 MiB free; 22.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Asking

i was just wondering when is the training code going to be released, Thank you in advance

Training Code

First of all, thank you for this amazing work ! Do you plan to release the training code as well?

What is your GPU type and How long you training

How to use this on custom dataset?

Hey I am trying to use the model in custom dataset, here you can look I am doing the inference on this image but the result was not what was expected. How can use this model on custom data? do I need to run finetuning on a specific individual to understand its body type?

Ask for training gpu spec

Thank you for sharing your work. I'm now trying to train your module with 8 GPUs with gpu memory 15GB each, but it shows OOM error. Can you share your training gpu spec? How many GPUs did you use to train VTO module?

Time and hardware for inference and training

Hi! Congrats on the excellent paper!

Could you tell me how much time it takes to run an inference and on which hardware?

Also, with which hardware did you train and for how long?

Thanks!!

VAE with intermediate features takes up more GPU memory than original VAE

Hi, your work is so wonderful! Here is some questions.

I noticed that declaring val_pipe in the training code as an instance of StableDiffusionTryOnePipeline will occupy a very large amount of GPU memory, and inference.py itself will also occupy a large amount of GPU memory when running. It would be much better to replace VAE with intermediate features with original VAE. Have you noticed it? May I ask what GPU you are running on when running inference.py?

Thank you!

Traning Code

Hello! First, I want to congratulate you on these amazing results. I also wanted to ask when the training code will be available. Additionally, is this similar to fine-tuning the stable diffusion model on but on multiple concepts ?

Some questions about the training process

Thanks to the authors for such influential work.

I would like to confirm if the EMASC module is trained by reconstruction loss, as mentioned in the article for L1 and VGG loss. How is its input constructed, is it a model image I, a mapping of I to \tilde{I}?
During the training of the enhanced Stable Diffusion pipeline, is there any model image I used as input? Because the sampling operation of the model image I appears in Equation 3 and Equation 4.
I am more curious about how the whole training process of diffusion-based models actually works. Not limited to this work, I have similar confusion in other work and I hope to get help from the authors here. What I think about the mechanism of the diffusion model is that it is fitting the distribution of the data set, so what is the underlying principle that it is being applied to the try-on task. What is it fitting in a given task? In other words, similar to the second question, how to effectively use the ground truth picture, i.e. the model picture I.

How to run inferience for a single image (not included in the training set)

Hey, great project

How to run inferience for a single image (not included in the training set)

Team have plans for a google colab?

questions about the KID metric

Hi,thank you for your great work!

After inferencing on VITON-HD dataset with your released model, I got KID_p 0.0015 (1.08 in your paper), and KID_u 0.0018 (1.60 in your paper). Why such a big difference？

Thanks for any advices.

how to prepare text prompt

Hi, thank you for your nice work. But I would like to ask how to obtain text prompt for training. It seems the VITON-HD dataset did not provide text prompt.

RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 0

I'm getting this error when trying to run on the DressCode Dataset. can anyone help me on this issue? What I'm doing wrong?
here is the command that i used
python src/inference.py --dataset dresscode --dresscode_dataroot ./DressCode/DressCode --output_dir ./results --test_order unpaired
I've added prints in the dataset.py code where the error was happening and i got this:

Question about ablation study

Hi, thanks for sharing your great work!

I'm very interested in exploring the application of LDM in virtual try-on and inspired by your work. But I'm confused by the second third row of Tab. 4 in your paper.

I notice the performance doesn't drop obviously with empty strings (row 1) or textual elements (row 2). How can I get textual elements? Maybe pass the garment images directly through VE to U-NET?

Moreover, why does the performance drop dramatically with f_theta? even much worse than empty strings?

Looking forward to your reply! Thank you again!

Use lower_body, dresses and all

Thanks for your project! how can i use your project for trying on bottoms (pants, skirts) or for dresses? I tested on the VITON-HD dataset, and I got only the top.

Generated image deviates from the original garment image

Thank you for your work, I ran the generating code the generated image has some deviation from the original garment image, the image is below. Can you please tell me where the problem is.

Not works well in Textual or Letters

By running command:
python src/inference.py --dataset vitonhd --vitonhd_dataroot zalando-hd-resized --output_dir output --test_order paired --batch_size 1 --mixed_precision fp16
I found it not works well in textual or letters, like those badcases:

That phenomenon is not mentioned in your paper，is there any way to fix it ?

Questions about extending the first convolutional layer

Congrats on your work! In the paper, you mentioned that:

we propose to extend the kernel channels of the first convolutional layer by adding zero initialized weights to match the new input channel dimension

Will you also fine-tune the first convolutional layer or the stable diffusion model during your training to accommodate for the channel change?

BTW, will the code be released before the end of June?

Bad Result on custom image from DressCode Dataset

Hi Folks,

I tried inferencing on single image taken from dresscode with all preprocessed data from the original source data itself with minor tweaks. I am getting unexpected results from the custom inference.

Even when I am doing preprocessing myself the results are similar. I have attached input and output images as reference.(Pose map has 18 channels so couldn't visualize it properly here).

Can anyone help me here?

Can handle lower body garments？

Can I try on the pants, please, and if so, how can I do it?

try on tattoos

Can I use this project to try on tattoos? If yes, what I need to do?

Code release date

Hi,

When are you planning to release the code?

How to get image "image-parse-v3"

To run the inference.py script, you must have at least images in the "cloth", "image", "image-parse-v3", "openpose_json folders" (i use vitonhd dataset) . Everything is clear with the "images", "cloth" and "image". I also learned how to receive "openpose_json" files. But I can't find how to get images for "image-parse-v3"? help me please

How to make inference on only one photo not just the whole dataset?

I would like to know how can we make inference on only one single image.

Not realistic

Hi, so I tried using this model but the result generated is not realistic. Do you know why this can be. The upper body shirt on this is tried on using the model. It is this shirt.

Could I ask you for some advice?

I want to use a pre-trained large model, but the input requirements for the model are generally square. For human body images, which are generally rectangular, how do I process the image to meet the needs of the pre-trained model? Simply filling in the blanks seems to make the whole image more sparse.

Real world results

Hi, thank you for such a nice work. I was wondering, have you tried your model on real-world data outside of the datasets you mentioned in the paper, both target model and garment?

Also, a question regarding training, have you merged those datasets for training or have you trained separate models for each dataset?

could not get good result

Thanks for your great work. I want to test your model trained by dresscode. However, I don't have dresscode. I use the vitonhd.py to load vitonhd training data and use the pretrained model by dresscode. Unluckly, the sleeve is over original boundary. Can you give me same advice.

Another question, when I test your code by vitonhd(vitonhd.py to load data and pretrained by vitonhd)，I find that the sleeve is not match my human parsing. Can I give depth/canny control to your diffusion model?

Thanks for any advices.

[Colab Guide] Drive Link for Quick Inference on custom data with LADI-VTON using Colab

I have made a Colab preprocessing pipeline for Ladi-VTON which can run inference on custom data using the DressCode model.

Here is the my drive Link https://drive.google.com/drive/folders/19XL0kvTw6SoCCAOJY9FgvuQJ9M_JAZHt?usp=sharing
You will need to make a copy of my drive in your google drive with the same name first and use GPU on Colab

I have made pre-processing usable for the DressCode dataset.
Keep your input images in /images folder and write the test pairs properly.
Then after running ladi-vton_DressCode.ipynb input folder for inference will automatically be generated.

By running inference on custom data, ladi-vton messes up the faces. So I have made a Refinement Notebook using Google Mediapipe just for this purpose,
The intermediate results after inference are in results folder, and final results after refinement will be in final folder.

I have used this exact drive to generate some results and it mostly works. There are some problems with few specific garments.
Thinking to shift the drive into a Github repo after sometime. If you have any doubts/suggestions you can post.

I can't run VTON dataset.

Error Message:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 76: character maps to <undefined>

issue with yml file

Thanks for your great work. I have a hard time to build the some environment on windows system, and I noticed your yml file was built on linux system. Could you upload a yml file for windows? thanks in advance.

Bad Generated Images

Hi,
Appreciate the great work and contribution.
I tested the ladi-vton model on a large number of images on VITON-HD dataset. Some of them which are not working properly i am sharing below

if the model image 'I' is wearing a full sleeves cloth and if i try to replace it with sleeveless or half sleeves, the portion of sleeve remains on the body [image 2].
It kind of tries to inpaint the mask portion exactly which doesn't look perfect in some images and the mask portion is visible at the bottom where the style is not in-shirt type
The occlusion part is not handled perfectly , the images with occlusion are not generate perfectly, in fact results are very distorted [image1]
And yes, the texture information is not preserved of the cloth image properly [image4]

What could be the reasons for these problems and does finetuning or training with images on a larger number of sleeve sleeveless combination would resolve this issue?

use own model images

What projects (neural networks) should I use to get images as image-parse-v3 and openpose_json files? I want to use my model images, but this requires their images in these formats if I understand the project correctly.

training code

How long untill the training code is released?

What data will affect inference results of VITON-HD?

I found that only cloth,image,openpose-json,image-parse-v3 data are needed when inference with VITON-HD datasets.
If I don't provide the data cloth-mask,image-parse-agorisc-v3.2 and so on, will it have any impact on the inference results?
Thank you.

Outdated arguments in function

The Encoder uses the get_down_block here using parameter attn_num_head_channels but get_down_block doesn't have such parameter in newer versions of the diffusers library

How can i run this project with m1 mac

I followed all instructions which this project provide, and I get this error:

ValueError: torch.cuda.is_available() should be True but is False. xformers' memory efficient attention is only available for GPU

After that I removed --enable_xformers_memory_efficient_attention argument from my command, the error changed like this:

...src/inference.py", line 226, in main
    generator = torch.Generator("cuda").manual_seed(args.seed)
                ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device type CUDA is not supported for torch.Generator() api.

I've searched that error and then I've found pytorch mps document. I changed the src/inference.py:226 code to:

generator = torch.Generator("mps").manual_seed(args.seed)

So maybe I've been trying something wrong, because it didn't work too. I was going to try without GPU but i didn't do it. Is there a way to close cuda/GPU?

Thank you

My environment:
Macbook Pro 14-inch, 2021
chip: Apple M1 Pro
os: 13.6 (22G120)
python: 3.11

How about to maintain details when warping？

hi，I found that a lot of information is lost in the results after warping (such as logos and patterns). It seems that without refinement and only using TPS, more details will be retained. Has the author done any ablation experiments using only TPS training?

Poor results

Hi , I've been attempting to replicate the results you demonstrated in Figure 7 of your paper. However, the outcome is not as presented in your paper. Specifically, the pattern on the T-shirt is not being reproduced.

Here's the same garment I found in zalando-hd

Here is the result when I run your code.

Could you possibly assist or provide any guidance to address this issue? Thanks in advance.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

miccunifi / ladi-vton Goto Github PK

ladi-vton's Introduction

LaDI-VTON (ACM Multimedia 2023)

Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Overview

Citation

Getting Started

Installation

Data Preparation

DressCode

VITON-HD

Inference with Pre-trained Models

Metrics computation

Training

1. Train Warping Module

2. Train EMASC

2.5 (Optional) Extract clip cloth embeddings

3. Pre-train the inversion adapter

4. Train VTO

5. Inference with the trained models

Acknowledgements

LICENSE

ladi-vton's People

Contributors

Stargazers

Watchers

Forkers

ladi-vton's Issues

Recommend Projects

Recommend Topics

Recommend Org