Giter Club home page Giter Club logo

instruct2act's Introduction

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

Foundation models have made significant strides in various applications, including text-to-image generation, panoptic segmentation, and natural language processing. This paper presents Instruct2Act, a framework that utilizes Large Language Models to map multi-modal instructions to sequential actions for robotic manipulation tasks. Specifically, Instruct2Act employs the LLM model to generate Python programs that constitute a comprehensive perception, planning, and action loop for robotic tasks. In the perception section, pre-defined APIs are used to access multiple foundation models where the Segment Anything Model (SAM) accurately locates candidate objects, and CLIP classifies them. In this way, the framework leverages the expertise of foundation models and robotic abilities to convert complex high-level instructions into precise policy codes. Our approach is adjustable and flexible in accommodating various instruction modalities and input types and catering to specific task demands. We validated the practicality and efficiency of our approach by assessing it on robotic tasks in different scenarios within tabletop manipulation domains. Furthermore, our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.

framework

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model Siyuan Huang, Zhengkai Jiang, Hao Dong, Yu Qiao, Peng Gao, Hongsheng Li

Supported Modules

Currently, we support the following modules:

modules

Correspondingly, please prepare the SAM and CLIP model ckpts in advance. You can download the ckpts from SAM and OpenCLIP. Then set the path in the file 'engine_robotic.py'.

You can also add your personlized modules in 'engine_robotic.py', and add the API definition in the prompt files.

How to run

  1. Install the required packages with the provided environment.yaml

    • If you meet any install issue, you can take a look at Issue. Thanks for @euminds.
  2. Install the VIMABench with VIMABench.

  3. Change the OpenAI API-key in visual_programming_prompt/robotic_exec_generation.py

  4. run the robotic_anything_gpt_online.py.

Prompts Setting

In Instruct2Act, we implement two types of prompts, i.e., task-specific and task-agnostic prompts. The task-specific prompts are designed for specific tasks which is in the VISPROG style, and the task-agnostic prompts are designed for general purpose, and it is in ViperGPT plus VISPROG style. We provide more details in the our paper. And you can change the setting in the file visual_programming_prompt/robotic_exec_generation.py. For very specific tasks like robotic manipulations where you know the flow clearly, we suggest to use the task-specific prompts. For general purpose, we suggest to use the task-agnostic prompts. These two prompts are stored in visual_programm_prompt.py and full_prompt.ini respectively.

Besides the language prompts, we also provide the pointing-language enhanced prompts where cursor click will be used to select the target objects. You can see the details with funcation SAM() in engine_robotic.py.

We provide two code generation mode for robotic manipulation tasks, i.e., offline and online mode. The codes with offline mode are generated in advance and summarized with expert knowledge, and this type is used for the demo and quick-trail usage. The online mode are generated on the fly, and this type is used for the general purpose.

Evaluation Tasks

We select six representative meta tasks from VIMABench (17 tasks in total) to evaluate the proposed methods in the tabletop manipulation domain, as shown in below. To run the evaluation, please follow the instructions in the VIMABench.

Task Instruction Visualization
Visual Manipulation Put the polka dot block into the green container. task01
Scene Understanding Put the blue paisley object in another given scene image into the green object. task01
Rotation Rotate the letter-M 30 degrees task01
Rearrange Rearrange to the target scene. task01
Rearrange then restore Rearrange objects to target scene and then restore. task01
Pick in order then restore Put the cyan block into the yellow square then into the white-black square. Finally restore it into its original container. task01

Notes

  1. To speed up the SAM inference progress, we add cuda device option in function build_sam(), you should modify it accordingly in the source code and then recompile the package.

  2. During evaluation, we set the "hide_arm=True" and close the debug_window. If you want to visualize the arm movement, please set them correctly.

  3. The orignal movement in VIMABench is quite quick, if you want to slow down the movement, please add some lines like sleep() in VIMABench.

  4. When use ChatGPT for generation, you need to mange some network stuff. Also, we found that when the network situation is not ideal, sometimes the generated codes are in bad quality (incomplete or too short).

Acknowledgement

We would like to thank the authors of the following great projects, this project is built upon these great open-sourced projects.

We are also inspired by the following projects:

instruct2act's People

Contributors

siyuanhuang95 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.