Giter Club home page Giter Club logo

3d-vla's Introduction


3D-VLA: A 3D Vision-Language-Action Generative World Model

ICML 2024

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan

Paper PDF Project Page

Tabel of Contents
  1. Method
  2. Installation
  3. Embodied Diffusion Models
  4. Multimodal Large Language Model
  5. Citation
  6. Acknowledgement

News ๐Ÿ“ข

  • [2024/06] Training and inference code for goal generation diffusion models are released.
  • [2024/05] 3D-VLA is accepted to ICML 2024!
  • [2024/03] Paper is on arXiv.

Method

3D-VLA is a framework that connects vision-language-action (VLA) models to the 3D physical world. Unlike traditional 2D models, 3D-VLA integrates 3D perception, reasoning, and action through a generative world model, similar to human cognitive processes. It is built on the 3D-LLM and uses interaction tokens to engage with the environment. Embodied diffusion models are trained and aligned with the LLM to predict goal images and point clouds.

Logo

Installation

conda create -n 3dvla python=3.9
conda activate 3dvla
pip install -r requirements.txt

We will update the file structure and the installation process in the future.

Embodied Diffusion Models

Goal Image Generation

  • Train the goal image latent diffusion model with the following command:

    bash launcher/train_ldm.sh [NUM_GPUS] [NUM_NODES]

    If you want to include depth information, you could add --include_depth to the command in the train_ldm.sh file.

  • Then you could generate the goal images:

    python inference_ldm_goal_image.py --ckpt_folder lavis/output/LDM/pix2pix/runs (--include_depth)

    The results will be saved in the lavis/output/LDM/pix2pix/results folder.

Goal Point Cloud Generation

  • Train the goal point cloud diffusion model (finetuning the pretrained Point-E model):

    bash launcher/train_pe.sh [NUM_GPUS] [NUM_NODES]

    We will soon support the FSDP training for the goal point cloud generation.

  • Inferece the goal point cloud with the following command:

    python inference_pe_goal_pcd.py

    If you want to use multiple GPUs, use torchrun --nproc_per_node=[NUM_GPUS] --master_port=[PORT] inference_pe_goal_pcd.py instead.

Multimodal Large Language Model

Pretrain 3D-VLA

  • Train our 3D-VLA model:
    bash launcher/train_llm.sh [NUM_GPUS] [NUM_NODES]

Citation

@article{zhen20243dvla,
  author = {Zhen, Haoyu and Qiu, Xiaowen and Chen, Peihao and Yang, Jincheng and Yan, Xin and Du, Yilun and Hong, Yining and Gan, Chuang},
  title = {3D-VLA: 3D Vision-Language-Action Generative World Model},
  journal = {arXiv preprint arXiv:2403.09631},
  year = {2024},
}

Acknowledgement

Here we would like to thank the following resources for their great work:

3d-vla's People

Contributors

anyezhy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.