Giter Club home page Giter Club logo

chat-3d's Introduction

Chat-3D

This is a repo for paper "Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes". [paper], [project page]

To Do List

Done

  • High-quality object-centric instruction dataset
  • Guide for data preparation and model training/inference locally

Doing

  • Online demo
  • Various settings of training data and model architecture
  • Adapt to 3D QA, Caption, Grounding tasks
  • Add a segmantation head to complete the pipeline
  • ...

๐Ÿ”จ Preparation

  • Prepare the environment:

    pip install -r requirements.txt
  • Download LLaMA model:

    • Currently, we choose Vicuna-7B as the LLM in our model, which is finetuned from LLaMA-7B.
    • Download LLaMA-7B from hugging face.
    • Download vicuna-7b-delta-v0 and process it: (apply_delta.py is from huggingface)
    python3 model/apply_delta.py \
            --base /path/to/model_weights/llama-7b \
            --target vicuna-7b-v0 \
            --delta lmsys/vicuna-7b-delta-v0
    • Change the llama_model_path in config.py to the location of vicuna-7b-v0.
  • Annotations and extracted features:

    For simplicity, we have made all the annotations available in annotations dir and extracted features available on Google Drive. Here are some brief explanations of the preparation:

    • Based on the annotations from ScanNet , we extract attributes (location, size, color) of objects from different scenes.

    • We use ULIP-2 to extract features of 3D objects.

    • The captions utilized in stage 1 and stage 2 are obtained from the annotations of ScanRefer.

  • Object-centric dataset

    • We release the object-centric dataset in annotations dir, including train/val sets for conversation/detail instructions.

๐Ÿค– Training and Inference

  • Training (Instruction Tuning)

    Simply run the following scripts to sequentially tune from Stage 1 to Stage 3.

    # Stage 1
    ./scripts/run.sh --stage 1 \
                     --lr 5e-3
    
    # Stage 2
    ./scripts/run.sh --stage 2 \
                     --pretrained_path /path/to/pretrained_stage1.pth \
                     --lr 5e-3
    
    # Stage 3
    ./scripts/run.sh --stage 3 \
                     --pretrained_path /path/to/pretrained_stage2.pth \
                     --lr 5e-5

    We train the model on 4 A40 GPUs with 48GB VRAM. Here are some information about GPU usage and training time. (Note that we only use ScanRefer data for training currently, it would cost more training time if we add more training data in the future.)

    Stage Batch Size GPU Num VRAM Usage per GPU Training Time
    1 12 4 ~ 25 GB ~ 5 min
    2 12 4 ~ 45 GB ~ 1 hour
    3 1 4 ~ 25 GB ~ 1.5 hour
  • Inference

    Use one GPU for inference (set NUM_GPUS=1 in run.sh).

    ./scripts/run.sh --stage 3 \
                     --pretrained_path /path/to/pretrained_stage3.pth \
                     --evaluate

๐Ÿ“„ Citation

If you find this project useful in your research, please consider cite:

@misc{wang2023chat3d,
      title={Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes}, 
      author={Zehan Wang and Haifeng Huang and Yang Zhao and Ziang Zhang and Zhou Zhao},
      year={2023},
      eprint={2308.08769},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Stay tuned for our project. ๐Ÿ”ฅ

If you have any questions or suggestions, feel free to drop us an email ([email protected], [email protected]) or open an issue.

๐Ÿ˜Š Acknowledgement

Thanks to the open source of the following projects:

VideoChat, LLaMA, ULIP, ScanRefer, ReferIt3D, vil3dref, ScanNet

chat-3d's People

Contributors

chat-3d avatar zzzzchs avatar

Stargazers

SL avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.