Official code release for CVPR 2024 paper SiTH.
What you can find in this repo:
- Demo for reconstructing a fully textured 3D human from a single image in 2 minutes (tested on an RTX 3090 GPU)
- A minimal script for fitting the SMPL-X model to an image.
- A new evaluation benchmark for single-view 3D human reconstruction.
- A Gradio demo for creating 3D humans with poses and text prompts.
- [TODO] Training scripts for the diffusion model and the mesh reconstruction model.
If you find our code and paper useful, please cite it as
@inproceedings{ho2024sith,
title={SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion},
author={Ho, Hsuan-I and Song, Jie and Hilliges, Otmar},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}
- [April 24, 2024] Gradio demo for 3D human creation is now available.
- [April 15, 2024] Release demo code, models, and the evaluation benchmark.
Our code has been tested with PyTorch 2.1.0, CUDA 12.1, and an RTX 3090 GPU.
Simply run the following command to install relevant packages:
pip install -r requirements.txt
- Download the checkpoint files into the
checkpoints
folder.
bash tools/download.sh
- Download SMPL-X models and move them to the
data/body_models
folder. You should have the following data structure:
body_models
└──smplx
├── SMPLX_NEUTRAL.pkl
├── SMPLX_NEUTRAL.npz
├── SMPLX_MALE.pkl
├── SMPLX_MALE.npz
├── SMPLX_FEMALE.pkl
└── SMPLX_FEMALE.npz
- Run the script for body fitting, back hallucination, and mesh reconstruction.
bash run.sh
We create an application combining SiTH and powerful ControlNet for 3D human creation. In the demo, users can easily create 3D humans with several button clicks.
You can either play our Online Demo or launch the web UI locally. To run the demo on your local machine, simply run
python app.py
You will see the following web UI on http://127.0.0.1:7860/.
You can prepare your own RGBA images and put them into the data/examples/rgba
folder. For example, you can create photos from OutfitAnyone, and remove the background with Segment Anything or Clipdrop.
- Run the script to generate square and centralized input images into the
data/examples/images
folder. The default size is 1024x1024. You can also adjust the size by adjusting the--size
and--ratio
arguments.
python tools/centralize_rgba.py
- Install and run openpose to get
.json
files of COCO-25 body, hand, and face keypoints. For example, we used the following command, and your image folder should contain files as indata/examples/images
.
cd /path/to/openpose_dir
./build/examples/openpose/openpose.bin --image_dir /path/to/images_dir --write_json /path/to/images_dir --display 0 --net_resolution -1x544 --scale_number 3 --scale_gap 0.25 --hand --face --render_pose 0
Next, we fit the SMPL-X body model to each input image and align them within a cube of [-1, 1]. By default, we use the following command that optimizes the global orientation, body shape, scale, and X,Y offset parameters.
python fit.py --opt_orient --opt_betas
There are also additional arguments and hyperparameters for customized fitting. For example, if you find the initial body pose not perfectly aligned, you can use the --pot_pose
flag to optimize specific body joints. You can visualize the fitting results by activating the --debug
flag.
Given the front-view images and SMPL-X parameters, we generate back-view images with our image-conditioned diffusion model. The following command generates images in the data/examples/back_images
folder.
python hallucinate.py --num_validation_image 8
Note that generative models do have randomness. Therefore multiple images are generated and you can choose the best one to replace it in data/examples/back_images
. There are several parameters you can play with:
--guidance_scale
: Classifier-free guidance (CFG) scale.--conditioning_scale
: ControlNet conditioning scale.--num_inference_steps
: Denoising steps.--pretrained_model_name_or_path
: The default model is trained on 500 human scans. We offer a new model trained with 2000+ scans and more view angles. To use the model, please adjust tohohs/SiTH-diffusion-2000
.
Before reconstructing the 3D meshes, make sure the following folders and images are ready.
data/examples
├──images
| ├── 000.png
| ├── 000_keypoints.json
| ...
|
├──smplx
| ├── 000_smplx.obj
| ...
|
└──back_images
├── 000_00X.png
...
The following command will reconstruct textured meshes under data/examples/meshes
:
python reconstruct.py --test-folder data/examples --config recon/config.yaml --resume checkpoints/recon_model.pth
The default --grid-size
for marching cube is set to 512. If your images contain noisy segmentation borders, you can increase --erode-iter
to shrink your segmentation mask.
We created an evaluation benchmark using the CustomHumans dataset. Please apply the dataset directly and you will find the necessary files in the download link.
Note that we trained our models with 526 human scans provided in the THuman2.0 dataset and tested on 60 scans in the CustomHumans dataset. We used the default hyperparameters and commands suggested in run.sh
. The evaluation script can be found here and here. You will need to install two additional packages for evaluation:
pip install torchmetrics[image] mediapipe
Single-view human 3D reconstruction benchmark
Methods | P-to-S (cm) ↓ | S-to-P (cm) ↓ | NC ↑ | f-Score ↑ |
---|---|---|---|---|
PIFu [Saito2019] | 2.209 | 2.582 | 0.805 | 34.881 |
PIFuHD[Saito2020] | 2.107 | 2.228 | 0.804 | 39.076 |
PaMIR [Zheng2021] | 2.181 | 2.507 | 0.813 | 35.847 |
FOF [Feng2022] | 2.079 | 2.644 | 0.808 | 36.013 |
2K2K [Han2023] | 2.488 | 3.292 | 0.796 | 30.186 |
ICON* [Xiu2022] | 2.256 | 2.795 | 0.791 | 30.437 |
ECON* [Xiu2023] | 2.483 | 2.680 | 0.797 | 30.894 |
SiTH* (Ours) | 1.871 | 2.045 | 0.826 | 37.029 |
- *indicates methods trained on the same THuman2.0 dataset.
Back-view hallucination benchmark
Methods | SSIM ↑ | LPIPS↓ | KID(×10^−3^) ↓ | Joints Err. (pixel) ↓ |
---|---|---|---|---|
Pix2PixHD [Wang2018] | 0.816 | 0.141 | 86.2 | 53.1 |
DreamPose [Karras2023] | 0.844 | 0.132 | 86.7 | 76.7 |
Zero-1-to-3 [Liu2023] | 0.862 | 0.119 | 30.0 | 73.4 |
ControlNet [Zhang2023] | 0.851 | 0.202 | 39.0 | 35.7 |
SiTH (Ours) | 0.950 | 0.063 | 3.2 | 21.5 |
We used code from other great research work, including occupancy_networks, pifuhd, kaolin-wisp, mmpose, smplx, SMPLer-X, editable-humans.
We created all the videos using powerful aitviewer.
We sincerely thank the authors for their awesome work!
For any questions or problems, please open an issue or contact Hsuan-I Ho.