This is a repo for paper "Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes". [paper], [project page]
Done
- High-quality object-centric instruction dataset
- Guide for data preparation and model training/inference locally
Doing
- Online demo
- Various settings of training data and model architecture
- Adapt to 3D QA, Caption, Grounding tasks
- Add a segmantation head to complete the pipeline
- ...
-
Prepare the environment:
pip install -r requirements.txt
-
Download LLaMA model:
- Currently, we choose Vicuna-7B as the LLM in our model, which is finetuned from LLaMA-7B.
- Download LLaMA-7B from hugging face.
- Download vicuna-7b-delta-v0 and process it: (
apply_delta.py
is from huggingface)
python3 model/apply_delta.py \ --base /path/to/model_weights/llama-7b \ --target vicuna-7b-v0 \ --delta lmsys/vicuna-7b-delta-v0
- Change the
llama_model_path
in config.py to the location ofvicuna-7b-v0
.
-
Annotations and extracted features:
For simplicity, we have made all the annotations available in annotations dir and extracted features available on Google Drive. Here are some brief explanations of the preparation:
-
Object-centric dataset
- We release the object-centric dataset in annotations dir, including train/val sets for conversation/detail instructions.
-
Training (Instruction Tuning)
Simply run the following scripts to sequentially tune from Stage 1 to Stage 3.
# Stage 1 ./scripts/run.sh --stage 1 \ --lr 5e-3 # Stage 2 ./scripts/run.sh --stage 2 \ --pretrained_path /path/to/pretrained_stage1.pth \ --lr 5e-3 # Stage 3 ./scripts/run.sh --stage 3 \ --pretrained_path /path/to/pretrained_stage2.pth \ --lr 5e-5
We train the model on 4
A40
GPUs with 48GB VRAM. Here are some information about GPU usage and training time. (Note that we only use ScanRefer data for training currently, it would cost more training time if we add more training data in the future.)Stage Batch Size GPU Num VRAM Usage per GPU Training Time 1 12 4 ~ 25 GB ~ 5 min 2 12 4 ~ 45 GB ~ 1 hour 3 1 4 ~ 25 GB ~ 1.5 hour -
Inference
Use one GPU for inference (set
NUM_GPUS=1
in run.sh)../scripts/run.sh --stage 3 \ --pretrained_path /path/to/pretrained_stage3.pth \ --evaluate
If you find this project useful in your research, please consider cite:
@misc{wang2023chat3d,
title={Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes},
author={Zehan Wang and Haifeng Huang and Yang Zhao and Ziang Zhang and Zhou Zhao},
year={2023},
eprint={2308.08769},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Stay tuned for our project. ๐ฅ
If you have any questions or suggestions, feel free to drop us an email ([email protected]
, [email protected]
) or open an issue.
Thanks to the open source of the following projects:
VideoChat, LLaMA, ULIP, ScanRefer, ReferIt3D, vil3dref, ScanNet