Giter Club home page Giter Club logo

maniwav's Introduction

ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

[Project page] [Paper] [Dataset]

Zeyi Liu 1, Cheng Chi1,2, Eric Cousineau3, Naveen Kuppuswamy3, Benjamin Burchfiel3, Shuran Song1,2

1Stanford University, 2Columbia University, 3Toyota Research Institute

Preparation

The hardware and software is built on top of Universal Manipulation Interface (UMI). Please review the UMI paper and UMI GitHub repository beforehand to learn about the context.

Software Installation

Please refer to the UMI repository for installing docker and system-level dependencies.

We provide a new conda_environment.yaml with additional dependencies. To create a conda environment named maniwav:

$ cd maniwav
$ mamba env create -f conda_environment.yaml
$ conda activate maniwav
(maniwav)$

If you see a PortAudio not found error when installing the sounddevice package, run

sudo apt-get install libportaudio2

Hardware Installation

To add contact microphone to the UMI gripper:

For the rest of the device, please see the UMI hardware guide for reference.

Dataset

We release the in-the-wild bagel flipping dataset and policy checkpoint at https://real.stanford.edu/maniwav/, which you can try directly training/evaluating on your own robot. You are encouraged to organize all your dataset under the data/ folder under the root directory.

Download the zarr dataset:

wget https://real.stanford.edu/maniwav/data/bagel_in_wild/replay_buffer.zarr.zip

Download the original demo videos with SLAM results (can be skipped if you don't need the raw mp4 videos):

wget --recursive --no-parent --no-host-directories --cut-dirs=2 --relative https://real.stanford.edu/maniwav/data/bagel_in_wild/demos/

If you have your own data (namely mp4 videos with sound, and actions extracted from SLAM following the same procedure as UMI), you can run the following script to create a replay buffer with audio data. Check the demos folder for the expected file structure.

python scripts_slam_pipeline/07_generate_replay_buffer.py <your-dataset-folder-path> -o <your-dataset-folder-path>/replay_buffer.zarr.zip -ms

Follow this link to download the ESC-50 dataset for noise augmentation, place the folder under data/: https://github.com/karolpiczak/ESC-50#download. For robot noises, we provide an example under data/robot-noise-calib for a UR5 robot, but you are encouraged record the noises for your specific robot.

Training

Tested on NVIDIA GeForce RTX 3090 24 GB.

Multi-GPU training with accelerator example:

CUDA_VISIBLE_DEVICES=<GPU-device-ids> HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --num_processes <ngpus> --main_process_port 29501 train.py --config-name train_diffusion_unet_maniwav_workspace task.dataset_path=data/replay_buffer.zarr.zip training.num_epochs=60 dataloader.batch_size=64 val_dataloader.batch_size=64

Single-GPU training example:

python train.py --config-name train_diffusion_unet_maniwav_workspace task.dataset_path=data/replay_buffer.zarr.zip training.num_epochs=60 dataloader.batch_size=64 val_dataloader.batch_size=64 training.device=<GPU-device-id>

Real World Evaluation

Congratulations🎉! Up to this point, you already have a robot manipulation policy ready to be deployed in the real world. Tested on Ubuntu 22.04. Please review the UMI documention on real world evaluation first.

Example to run the evaluation script:

python scripts_real/eval_real_umi.py --audio_device_id 0 --input checkpoints/bagel_in_wild/in-the-wild-latest.ckpt --output outputs/ours_itw --camera_reorder 0 -md 120 -si 4

To check audio device ids, run:

python -m sounddevice

Refer to the UMI repository for setting up the robot and camera. For microphone, just put contact microphone inside the gripper holder, wrap around with grip tape, and use cable to connect the microphone to the GoPro media mode external mic port. The game capture card will stream both vision and audio to the desktop, and we provide code to read and record audio data automatically. Check the files under umi/real_world.

NOTE: Remember to calibrate the audio latency following Appendix A.1, and update the number here.

Citation

If you find this codebase useful, feel free to cite our paper:

@article{liu2024maniwav,
    title={ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data},
    author={Liu, Zeyi and Chi, Cheng and Cousineau, Eric and Kuppuswamy, Naveen and Burchfiel, Benjamin and Song, Shuran},
    journal={arXiv preprint arXiv:2406.19464},
    year={2024}
}

Contact

If you have questions about the codebase, don't hesitate to reach out to Zeyi. If you opened a GitHub issue, please also shoot me an email with the link to the issue.

License

This repository is released under the MIT license. See LICENSE for additional details.

Acknowledgements

  • Cheng Chi and Huy Ha for early discussions on the hardware design and codebase.
  • Toyota Research Institute (TRI) for generously providing the UR5 robot and advice on using UMI.

maniwav's People

Contributors

lzylucy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.