Giter Club home page Giter Club logo

guicourse's Introduction

๐Ÿ“ฑ๐Ÿ–ฅ๏ธ GUICourse: From General Vision Langauge Models to Versatile GUI Agents

Datasets, codes, and models for the paper "GUICourse: From General Vision Langauge Models to Versatile GUI Agents".

Release process:

  • Datasets
    • GUIEnv
      • GUIEnv-global (pre-training data)
      • GUIEnv-local (SFT data)
    • GUIAct
      • GUIAct (web-single)
      • GUIAct (web-multi)
      • GUIAct (smartphone)
    • GUIChat
  • Code
    • Inference
    • Evaluation
  • Models

Updates:

  • 2024/6/7: Release the datasets, loading code, and evaluation code.

Data Summary

GUICourse is a group of complete datasets to train visual-based GUI agents from general VLMs, through improving VLMs' fundamental abilities and GUI knowledge. GUICourse is composed of three datasets:

(1) GUIEnv, a large-scale dataset for improving VLMs' OCR and grounding abilities, including 10M website page-annotation pairs as pre-training data and 0.7M region-text QA pairs as SFT data; example1

(2) GUIAct, a GUI navigation dataset in website and Android scenarios for enhancing VLMs' knowledge of GUI systems, including 67k single-step and 15k multi-step action instructions. example2

(3) GUIChat, a conversational dataset for improving the interaction skills of GUI agents, including 44k single-turn QA pairs and 6k multi-turn dialogues with text-rich images and bounding boxes. example3

Dataset Access

Download

The data of GUIEnv-local, GUIAct, and GUIChat are hosted on Huggingface.

Read

Data Format. We use JSON and parquet format to save our datasets.

              elements
uid_episode_10270193012375700035_step_00  /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...  [{'id': 0, 'position': {'height': 39, 'width':...
uid_episode_10270193012375700035_step_01  /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...  [{'id': 0, 'position': {'height': 42, 'width':...
uid_episode_10270193012375700035_step_02  /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...  [{'id': 0, 'position': {'height': 44, 'width':...
...                                                                                     ...
                   ...
uid_episode_12220552989760792145_step_01  /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...  [{'id': 0, 'position': {'height': 46, 'width':...
uid_episode_12220552989760792145_step_02  /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBw...  [{'id': 0, 'position': {'height': 56, 'width':...

You can read our data by:

python data_load.py \
  --data_path "./data/xxx_data.json"
  --img_path "./data/xxx_images.parquet"
  --dataset "guixxx"

  • data_path: the path of your JSON data, such as ocr_grounding_test_data.json.
  • img_path: the path of used images, such as ocr_grounding_test_images.parquet. Notably, you should select a suitable version of pyarrow (e.g., pyarrow==13.0.0) for reading large parquet files.
  • dataset: the name of the dataset, such as guienv.

Visualization. You can visualize our data using the functions actions_visual and elements_visual.

Evaluation

cd ./evaluation

python evaluation.py \
  --file_name="your_test_file_name" \
  --task="xxx" \
  • file_name: the name of your prediction file (without the suffix .json).
  • task: the name of tasks, including guienv, guiact_web_single, guiact_web_multi, and guiact_smartphone.

Examples. We provide some results for quick evaluation in the ./results dir.

Performance

Cases

There are some visual cases of our GUI agents.
case1

case2

case3

case3

Demo: GUIAgent in Andriod Simulated Environment

To evaluate the robustness of our GUI agents, we use a simulated smartphone environment by Android Studio to do interactive testing. Alt text

Alt text

Contact

Wentong Chen, Renmin University of China

Junbo Cui, Jinyi Hu, Tsinghua University

Licensing Information

Creative Commons License
The datasets are licensed under a Creative Commons Attribution 4.0 International License.

The code under this repo is licensed under an MIT License.

Disclaimer

These datasets were collected and released solely for research purposes, with the goal of training versatile GUI agents. The authors are strongly against any potentially harmful use of the data or technology by any party.

Citation Information

If you find this dataset useful, please consider citing our paper:

@misc{,
  title={GUICourse: From General Vision Langauge Models to Versatile GUI Agents},
  author={Wentong Chen and Junbo Cui and Jinyi Hu and Yujia Qin and Junjie Fang and Yue Zhao and Chongyi Wang and Jun Liu and Guirong Chen and Yupeng Huo and Yuan Yao and Yankai Lin and Zhiyuan Liu and Maosong Sun},
  year={2024},
  eprint={},
  archivePrefix={},
  primaryClass={}
}

guicourse's People

Contributors

yiye3 avatar cuiunbo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.