posgnu / rci-agent Goto Github PK

View Code? Open in Web Editor NEW

212.0 8.0 29.0 2.19 MB

A codebase for "Language Models can Solve Computer Tasks"

Home Page: https://posgnu.github.io/rci-web/

License: MIT License

Python 11.13% JavaScript 23.86% CSS 3.11% HTML 61.83% Shell 0.08%

large-language-models prompting reasoning

rci-agent's Introduction

RCI Agent for MiniWoB++

Welcome to the codebase for our paper, "Language Models can Solve Computer Tasks". In this codebase, you will find the implementation of our RCI agent, which uses a pre-trained language model to execute computer tasks in MiniWoB++ benchmark guided by natural language. The agent employs a simple RCI prompting scheme that allows it to improve its outputs.

[Website] [Arxiv Paper] [PDF]

Dependencies

The RCI agent is implemented in Python 3.9 and requires the following dependencies:

gym
openai
selenium
Pillow
regex

pip install -r requirements.txt

Note: MiniWoB++ is not officially supported on Windows. Please refer to this issue.

Usage

Setup

To run the code, you must first install MiniWoB++ and configure your OpenAI API key. MiniWoB++ is integrated with the OpenAI Gym environment. Navigate to the computergym directory and execute the following command to install it:

cd computergym
pip install -e .

Once that's done, you need to write your OpenAI API key in the example_config.json file, then rename the file to config.json

Run

To run the code, simply execute the following command:

python main.py --env [TASK NAME] --llm [LLM NAME] --num-episodes [NUM EPISODES] --erci [NUM Explicit RCI] --irci [NUM Implicit RCI] --sgrounding

Here are the arguments you need to specify:

--env: Name of the MiniWoB++ task you want to run. You can see the list of available tasks in available_tasks.txt
--llm: Name of the language model you want to use. The model name and corresponding API name are specified below:
- chatgpt: "gpt-3.5-turbo"
- davinci: "text-davinci-003"
- ada: "ada"
- babbage: "babbage"
- curie: "curie"
- davinci1: "davinci"
- davinci2: "text-davinci-002"
--num-episodes: Number of episodes to run the task
--erci: The number of explicit RCI loop for an action plan. -1 will remove the action plan sampling.
--irci: The number of implicit RCI loop for the agent grounding.
--sgrounding: If this is True, then the state grounding update is enabled.
--headless: If this is True, then the MiniWoB++ environment will run in headless mode.

Consider running the following command to verify if everything is functioning correctly:

python main.py --env choose-list --llm chatgpt --num-episodes 1 --irci 1 --sgrounding

Evaluation

Our project's approach has yielded impressive results, with our agent achieving the second-highest score out of all tested models. We have observed that our agent outperforms the baselines, with the exception of CC-Net (SL + RL), which uses dictionary-based typing actions.

What sets our RCI agent apart is that it accomplished this feat using 120 times fewer samples than WebN-T5-3B and 11,000 times fewer samples than CC-Net. Obtaining expert demonstrations and defining reward functions for computer tasks can be a daunting challenge, but our research highlights the potential of using LLMs to overcome these obstacles and achieve success in general computer tasks.

Check out our paper!

Our paper is available on Arxiv. If you use this code in your research, we kindly ask that you cite our paper.

@article{kim2023language,
      title={Language Models can Solve Computer Tasks}, 
      author={Geunwoo Kim and Pierre Baldi and Stephen McAleer},
      journal={arXiv preprint arXiv:2303.17491},
      year={2023},
}

rci-agent's People

Contributors

Stargazers

Watchers

rci-agent's Issues

How to reproduce the results in Table 1&2?

Hi authors,

Thanks for your excellent work! I'm wondering if you could provide more details on how to produce the results in Table 1 and 2, for example:

Which LM did you use? (Section 3.1 said "InstructGPT-3 + RLHF", but which specific checkpoint?)
What are the hyperparameters?
What's the full prompt?
Would it be possible to provide the dataset split/model predictions/relevant code in order to reproduce the results?

Thanks in advance!

Bug Report: Some task can't be addressed when Headless parameter is enabled

Bug Description:
I encountered a strange bug while running two almost similar benchmarks for the "enter-time" task (but it might consider many other tasks). The only difference between the two runs is the value of the "headless" parameter. In the first case, I set it to False (headless = False), while in the second case, I left it as True, which was the default value.

Steps to Reproduce:

Git clone the SNow_benchmark branch from my fork and follow the installation in the README.md
.
Set the headless parameter to False and run the benchmark for the "enter-time" :
python main.py --env enter-time --llm chatgpt --num-episodes 1 --irci 1 --sgrounding
Set the headless parameter to True and run the benchmark for the "enter-time" :
python main.py --env enter-time --llm chatgpt --num-episodes 1 --irci 1 --sgrounding --headless
Expected Behavior:
The results should be identical, regardless of the value of the headless parameter.

Actual Behavior:
When the headless parameter is disabled (set to False), certain actions are not allowed or counted, resulting in a failed task. (I could benchmark the task several time, I will still get the same results)

(RCI-agent-WSL) thirdcore@DESKTOP-5I4C9HH:~/rci-agent$ python main.py --env enter-time --llm chatgpt --num-episodes 1 --irci 1 --sgrounding 
False
INFO:root:Starting WebDriver Instance 0
INFO:selenium.webdriver.common.selenium_manager:Applicable driver not found; attempting to install with Selenium Manager (Beta)
INFO:root:Send a request to the language model from initialize_plan
INFO:root:The number of generated action steps: 4
INFO:root:Send a request to the language model from generate_action
INFO:root:The executed instruction: clickxpath //*[@id="tt"]
INFO:root:Send a request to the language model from generate_action
INFO:root:The executed instruction: type 02:07PM
INFO:root:Send a request to the language model from generate_action
INFO:root:The executed instruction: clickxpath //*[@id="subbtn"]
success rate: 1.0
(RCI-agent-WSL) thirdcore@DESKTOP-5I4C9HH:~/rci-agent$ python main.py --env enter-time --llm chatgpt --num-episodes 1 --irci 1 --sgrounding --headless
True
INFO:root:Starting WebDriver Instance 0
INFO:selenium.webdriver.common.selenium_manager:Applicable driver not found; attempting to install with Selenium Manager (Beta)
INFO:root:Send a request to the language model from initialize_plan
INFO:root:The number of generated action steps: 4
INFO:root:Send a request to the language model from generate_action
INFO:root:The executed instruction: clickxpath //*[@id="tt"]
INFO:root:Send a request to the language model from generate_action
INFO:root:The executed instruction: type 1017AM
INFO:root:Send a request to the language model from generate_action
INFO:root:The executed instruction: clickxpath //*[@id="subbtn"]
success rate: 0.0

Additional Information:
I'm still investigating the root cause of this issue. It seems that when the browser is not displayed, some actions are restricted or not properly accounted for, leading to the task failure. Did you have the same behavior, is there something that I'm missing ?

How to reproduce the results in the paper?

Thanks for the good work.
How can I reproduce the results presented in Table 17 in the paper? What hyper-parameter should I set?
Thanks~

Any plans to port this to the new openAI python library?

MINIWOB_BASE_URL environment variable not defined

Hello,

After installing the different required packages, I tried to run an experiment on the choose-list environment using the command line you provided :

python main.py --env choose-list --llm chatgpt --num-episodes 1 --irci 1 --sgrounding

And I got the following error :

(RCI-agent) PS C:\Users\Tom\Desktop\rci-agent> python main.py --env choose-list --llm chatgpt --num-episodes 1 --irci 1 --sgrounding
INFO:root:Starting WebDriver Instance 0
C:\Users\Tom\miniconda3\envs\RCI-agent\lib\site-packages\gym\utils\passive_env_checker.py:20: UserWarning: WARN: It seems a Box observation space is an image but the `dtype` is not `np.uint8`, actual type: int32. If the Box observation space is not an image, we recommend flattening the observation to have only a 1D vector.
  logger.warn(
C:\Users\Tom\miniconda3\envs\RCI-agent\lib\site-packages\gym\utils\passive_env_checker.py:174: UserWarning: WARN: Future gym versions will require that `Env.reset` can be passed a `seed` instead of using `Env.seed` for resetting the environment random number generator.
  logger.warn(
C:\Users\Tom\miniconda3\envs\RCI-agent\lib\site-packages\gym\utils\passive_env_checker.py:187: UserWarning: WARN: Future gym versions will require that `Env.reset` can be passed `options` to allow the environment initialisation to be passed additional information.
  logger.warn(
INFO:selenium.webdriver.common.selenium_manager:Applicable driver not found; attempting to install with Selenium Manager (Beta)

DevTools listening on ws://127.0.0.1:51802/devtools/browser/6802c38e-d0ec-42dd-b55b-0574baaefc72
ERROR:root:Page did not load properly. Wrong MINIWOB_BASE_URL?
INFO:root:Closed instance 0
Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Users\Tom\miniconda3\envs\RCI-agent\lib\threading.py", line 980, in _bootstrap_inner
    self.run()
  File "c:\users\tom\desktop\rci-agent\computergym\computergym\miniwob\miniwob_interface\instance.py", line 128, in run
    self.create_driver()
  File "c:\users\tom\desktop\rci-agent\computergym\computergym\miniwob\miniwob_interface\instance.py", line 200, in create_driver
    raise e
        (No symbol) [0x0034A304]
        (No symbol) [0x0035C482]
        (No symbol) [0x0034A0B6]
        (No symbol) [0x00327E08]
        (No symbol) [0x00328F2D]
        GetHandleVerifier [0x006C8E3A+2540266]
        GetHandleVerifier [0x00708959+2801161]
        GetHandleVerifier [0x0070295C+2776588]
        GetHandleVerifier [0x004F2280+612144]
        (No symbol) [0x00404F6C]
        (No symbol) [0x004011D8]
        (No symbol) [0x004012BB]
        (No symbol) [0x003F4857]
        BaseThreadInitThunk [0x76C97D59+25]
        RtlInitializeExceptionChain [0x77C9B74B+107]
        RtlClearBits [0x77C9B6CF+191]

I tried to run the file environment.py and got the same issue. The reason is that the environment variable MINIWOB_BASE_URL is not defined.

debug console :

import os 
base_url=os.environ.get("MINIWOB_BASE_URL")
print(base_url)
None

Am I supposed to define this environment variable myself?

PS : I'm running on Windows 11, Python 3.9.16, and I use a conda env.

Prompt/Codebase for reasoning tasks

Hi, first of all thank you so much for the amazing work and paper.

I was actually interested in evaluating RCI across different reasoning tasks and was wondering about the prompt used in the paper for reasoning. Is it the prompt text mentioned in Figure 2 or was there any additional info in the instructions? The details mentioned in the appendix are not completely clear, for e.g. do you pass all the implicit feedback generated back to the LLM or each individual feedback is passed through a separate call to the LLM without the previous feedback in the prompt? Any chance of releasing the RCI codebase for the reasoning tasks?

Thank you for your time.

Executing custom tasks

Hello!
Thanks for this fantastic repo! The paper is also very amazing and insightful.

I was wondering whether it's possible to define custom HTML pages and tasks to be executed.
I was thinking of adding the custom HTML file in computergym/miniwob/miniwob_interface/html/miniwob directory and also including it in available_tasks.txt
Would this approach work? Please, let me know your thoughts about this.

Regards