3D-VLA: A 3D Vision-Language-Action Generative World Model

ICML 2024

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan

Tabel of Contents

Method
Installation
Embodied Diffusion Models
- Goal Image Generation
- Goal Point Cloud Generation
Multimodal Large Language Model
- Pretrain 3D-VLA
Citation
Acknowledgement

News 📢

[2024/06] Training and inference code for goal generation diffusion models are released.
[2024/05] 3D-VLA is accepted to ICML 2024!
[2024/03] Paper is on arXiv.

Method

3D-VLA is a framework that connects vision-language-action (VLA) models to the 3D physical world. Unlike traditional 2D models, 3D-VLA integrates 3D perception, reasoning, and action through a generative world model, similar to human cognitive processes. It is built on the 3D-LLM and uses interaction tokens to engage with the environment. Embodied diffusion models are trained and aligned with the LLM to predict goal images and point clouds.

Installation

conda create -n 3dvla python=3.9
conda activate 3dvla
pip install -r requirements.txt

We will update the file structure and the installation process in the future.

Embodied Diffusion Models

Goal Image Generation

Train the goal image latent diffusion model with the following command:
```
bash launcher/train_ldm.sh [NUM_GPUS] [NUM_NODES]
```
If you want to include depth information, you could add --include_depth to the command in the train_ldm.sh file.
Then you could generate the goal images:
```
python inference_ldm_goal_image.py --ckpt_folder lavis/output/LDM/pix2pix/runs (--include_depth)
```
The results will be saved in the lavis/output/LDM/pix2pix/results folder.

Goal Point Cloud Generation

Train the goal point cloud diffusion model (finetuning the pretrained Point-E model):
```
bash launcher/train_pe.sh [NUM_GPUS] [NUM_NODES]
```
We will soon support the FSDP training for the goal point cloud generation.
Inferece the goal point cloud with the following command:
```
python inference_pe_goal_pcd.py
```
If you want to use multiple GPUs, use torchrun --nproc_per_node=[NUM_GPUS] --master_port=[PORT] inference_pe_goal_pcd.py instead.

Multimodal Large Language Model

Pretrain 3D-VLA

Train our 3D-VLA model:

bash launcher/train_llm.sh [NUM_GPUS] [NUM_NODES]

Citation

@article{zhen20243dvla,
  author = {Zhen, Haoyu and Qiu, Xiaowen and Chen, Peihao and Yang, Jincheng and Yan, Xin and Du, Yilun and Hong, Yining and Gan, Chuang},
  title = {3D-VLA: 3D Vision-Language-Action Generative World Model},
  journal = {arXiv preprint arXiv:2403.09631},
  year = {2024},
}

Acknowledgement

Here we would like to thank the following resources for their great work:

SAM, ConceptFusion and 3D-CLR for Data Processing.
Diffusers, InstructPix2Pix, StableDiffusion and Point-E for the Diffusion Model.
LAVIS and 3D-LLM for the Codebase and Architecture.
OpenX for Dataset.
RLBench and Hiveformer for Evaluation.

cjmdd / 3d-vla Goto Github PK

3d-vla's Introduction

3D-VLA: A 3D Vision-Language-Action Generative World Model

News 📢

Method

Installation

Embodied Diffusion Models

Goal Image Generation

Goal Point Cloud Generation

Multimodal Large Language Model

Pretrain 3D-VLA

Citation

Acknowledgement

3d-vla's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent