significant-gravitas / auto-gpt-benchmarks Goto Github PK

A repo built for the purpose of benchmarking the performance of agents, regardless of how they are set up and how they work.

License: MIT License

Python 5.73% Jupyter Notebook 94.25% Shell 0.01% HTML 0.01%

auto-gpt-benchmarks's Introduction

This repo is currently deprecated. For the updated benchmark view the main repo at https://github.com/Significant-Gravitas/AutoGPT in the `benchmark` folder.

Auto-GPT Benchmarks

Built for the purpose of benchmarking the performance of agents regardless of how they work.

Objectively know how well your agent is performing in categories like code, retrieval, memory, and safety.

Save time and money while doing it through smart dependencies. The best part? It's all automated.

Scores:

Ranking overall:

Detailed results:

Click here to see the results and the raw data!!

More agents coming soon !

auto-gpt-benchmarks's People

Contributors

Stargazers

Watchers

Forkers

specialized806 dschonholtz xiaolinma-pixel ambujpawar pouyanpi samuelbutler abaso007 sfan7890 ishandutta2007 g-mervo msgpo metamorphart chenjaymar andreysusarev macromuppet nashid cesarchamal ayangwithb squareandcompass decentralised-ai sakurakobayashi frndlytm apprikatai usingh-gpt frame-tech-solutions-ltd-co bourbonbaron2 squareandcompass-2 catalincighi scarletpan thyfriendlyfox waynehamadi rodrigowee64 erik-megarad thedreamakeem autopackai lcy19930619 stefhugo lc0rp atomsilverman e2b-dev bentlybro mrbrain295 rihp sulaimonao thefirehacker holyr00d fluder-paradyne nobrux727 reuelbrion sayan1101 swiftyos princerumi areibman adityasharma13064 turnsgreen marcgreen janhuman zhangxinming1991 meta-introspector jzanecook periadecs evelynmitchell joonebabat1 joonemaman1 joonedaia1 joonedada jooneboa dudesmitherz tooloca seanpm2001 lincocoa kennyboss to-thesis-mwucs morganm94

auto-gpt-benchmarks's Issues

Command write_to_file returned: Error: [Errno 2] No such file or directory: '/home/appuser/auto_gpt_workspace/output.txt'

This happens after one successful run of the eval

sometimes solved by restarting the computer

found on MacOS

Long form code change evaluation

Add AutoGPT commit hash to eval filename

Currently, we list the evalname and the timestamp to the evalfilename.
This issue also will fetch the commit hash from the AutoGPT repo and add that to the filename.

Logging over print

There are a number of opportunities to replace print with logging.

There are lots of benefits to this, including

Standardized stdout line formatting
Additional logging targets like vector memory
Message filtering

All of these either benefit future ML models directly, or provide an opportunity for a model to optimize.

oaieval command failure

Command: VALS_THREADS=1 EVALS_THREAD_TIMEOUT=600 oaieval auto_gpt_completion_fn test-match --registry_path $PWD/auto_gpt_benchmarking

Output:

[2023-04-19 06:03:18,322] [registry.py:249] Loading registry from /opt/homebrew/lib/python3.9/site-packages/evals/registry/evals
[2023-04-19 06:03:18,380] [registry.py:249] Loading registry from /Users/ejohnson/.evals/evals
[2023-04-19 06:03:18,380] [registry.py:249] Loading registry from /Users/ejohnson/src/Auto-GPT-Benchmarks/auto_gpt_benchmarking/auto_gpt_benchmarking/evals
[2023-04-19 06:03:19,041] [registry.py:249] Loading registry from /opt/homebrew/lib/python3.9/site-packages/evals/registry/completion_fns
[2023-04-19 06:03:19,045] [registry.py:249] Loading registry from /Users/ejohnson/.evals/completion_fns
[2023-04-19 06:03:19,045] [registry.py:249] Loading registry from /Users/ejohnson/src/Auto-GPT-Benchmarks/auto_gpt_benchmarking/auto_gpt_benchmarking/completion_fns
[2023-04-19 06:03:19,045] [registry.py:120] completion_fn 'auto_gpt_completion_fn' not found. Closest matches: []
Traceback (most recent call last):
  File "/opt/homebrew/bin/oaieval", line 8, in <module>
    sys.exit(main())
  File "/opt/homebrew/lib/python3.9/site-packages/evals/cli/oaieval.py", line 164, in main
    run(args)
  File "/opt/homebrew/lib/python3.9/site-packages/evals/cli/oaieval.py", line 72, in run
    completion_fn_instances = [registry.make_completion_fn(url) for url in completion_fns]
  File "/opt/homebrew/lib/python3.9/site-packages/evals/cli/oaieval.py", line 72, in <listcomp>
    completion_fn_instances = [registry.make_completion_fn(url) for url in completion_fns]
  File "/opt/homebrew/lib/python3.9/site-packages/evals/registry.py", line 106, in make_completion_fn
    raise ValueError(f"Could not find CompletionFn in the registry with ID {name}")
ValueError: Could not find CompletionFn in the registry with ID auto_gpt_completion_fn

Ideally, we should count tokens for each eval completed.
Now that AutoGPT supports token counting in master (I think, need to double check this) we should be able to save a file with total token usage as well as the default OpenAI eval result file.

This might involve making subdirs per eval as eval results get more complicated and different types of files get saved.
Ideally, we would have a single file, but that won't happen while we save results with the evals lib

Basic Linting/Code Formatting/CI/CD pipelining

Enforce black, do some linting on push. All the good coding practice enforcements.
Look at CookieCutter for inspiration: https://cookiecutter.readthedocs.io/en/stable/

Request for Log Sharing

Dear AutoGPTTeam

I hope this message finds you well. As we work on enhancing our project's performance and optimizing our AI system, we believe that your valuable log data can provide valuable insights. To that end, we kindly request your assistance in sharing relevant log files with us.

Your logs could help us better understand system behavior, identify areas for improvement, and enhance our AI interactions. Rest assured that we will handle your data with the utmost care and in compliance with all applicable data privacy regulations.

Please let us know if you are willing to share your log files and, if so, the preferred method for sharing them. Your contribution will be greatly appreciated and will contribute to the success of our project.

Thank you for your support and collaboration.

Best regards,
Mike

Get results for all openAI Evals.

Currently we only have run the simplest possible tests on OpenAI evals. Ideally, we would have run all of them, so we can really compare AutoGPT to other agents and models. Hopefully, we at least outperform GPT4...
https://github.com/openai/evals
This will require reading through the available evals and evals docs.

Create Issue Templates for Benchmarking

We currently have to create an issue and then manually add it to the benchmarking project.
Ideally, whenever an issue is created on the benchmarking repo it would automatically be added to the TODO column of the project for team evaluation

OpenAI evals will retry forever if they timeout

If a task times out, the task should be killed and marked as failed. This should be possible to adjust with some CLI flags to the OpenAI evals package.
This may just involve turning off their multi-threaded support, and possibly rolling our own timeout logic.
This task will be completed when an OpenAI eval is run with the benchmarking repo and it kills itself on timeout and goes to the next task and does not try again.

Select OpenAI Eval subset we want to run in CI

Currently, I would just vote for fuzzy match, but selecting some model evaluated pipelines also might make sense.

I'm Creating .bat files for Windows Users To Keep It Super Simple

Even with clear documentation of all the changes that are needed to get this up and running on Windows OS systems... Most Windows users won't be able to.

I spent most of the day creating a setup.bat file and a run.bat file. Big Picture Goal of combining them once I iron out everything.

My setup.bat file copies the existing .env OPENAI_API for the user to take one step off the user's hands.

I think .bat benchmark dev pathway for Windows users is superior given that the goal is to have as many people run Evals as possible. To do so - we would need a keep it simple stupid install and runner for most windows users.

Current state of my setup .bat file:

Checks for Windows OS running Python 3.9 and above
Checks for Windows OS Git in order to clone the needed repos
Clones Auto-GPT-Benchmarks Repo
Creates the Virtual Env
X Activates the virtual Env (Fixing it rn atm)
Installs the requirements within the Venv

X Passes over requirements that are already installed but aren't on PATH.
Example: WARNING: The script openai.exe is installed in 'C:\Users\jonms\AppData\Roaming\Python\Python39\Scripts' which is not on PATH.

Clones the Auto-GPT.git wihtin the Venv
X Clones the repo into the wrong folder (lol, i'm fixing it)
Copies the key from the .env file into the psuedo sub-module AutoGPT repo

This is taking longer than I wanted due to all the install/reinstalls/updates creating a rats nest of pathways and typical windows bs problems on my machine. =(
I continue on for the homies.

I just wanted to give an update so someone else doesn't try to bang their head against the wall to install Benchmarks like I have.

Large language model evaluation model paper and method aggregation

There are more ways to evaluate LLMs and agents than just with OpenAI evals.
Evals is just the start.
This is going to be a running issue of what methods exist for evaluation. We will spin off separate issues for integrating them into this repository

Build a dashboard for displaying historical eval results

This is intentionally a little vague.
This will need some critical thinking on how exactly this should be done.
The basic idea is that we should be able to see scores on various benchmarks and we should be able to see them for historical versions of stable and preferably other branches.
This initial issue should be a proposal of how this should be tackled. Probably not any code as of yet.

Delete everything from the Agent's workspace instead of select files on cleanup

Currently we just clean up specific files in the workspace on clean up.
We can delete everything in this dir every time.
https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/blob/master/auto_gpt_benchmarking/AutoGPTAgent.py#L26

AutoGPTBenchmarkSettings

Replace explicit yaml loads with a serialization framework for loading configuration.

Pydantic is my preferred, but attrs and dataclasses are compelling options.

Evaluate the options
Make a decision
Execute on something like the following with the settings.yaml file schemas as reference.

class BenchmarksSettings(BaseModel):
    class Config:
        loader = yaml.safe_loads

    prop1: str
    prop2: int

Build a pipeline for evaluating benchmarks

Ideally, in github we would be able to click a button and watch a specific branch generate evaluation results.
This likely is prohibitively expensive, so for now, this is only going to be a pipeline of a makefile, or something similar where you can evaluate a subset of results and have those results be spit somewhere deterministic.
Hopefully, those results could automatically be fed into a Dashboard: #6

Optionally enable bash tools and plugins for evals.

Currently, we can't run bash, git or shell commands in our AutoGPT container.
This would be to switch the container to Ubuntu so we can!
This likely involves a PR to the core AutoGPT repo, but that's ok.

Does agent-protocol have a version 0.2.3?

I've been working at this for hours and I think I'm going crazy. In pyproject.toml has the agent-protocol = "^0.2.3" but that's not listed as a version on PyPI?!

I think it's supposed to be agent-protocol = "^1.0.0"?

What am I missing here?

What is the best system to test REST APIs ?

Hi,
In the context of our Benchmark, we want to test whether an agent is able to build an application.

A big part of building these applications is to expose an API, and we can probably start with REST APIs because it's very common and widely adopted.

This means we want to be able to KNOW whether a given agent has successfully built the REST API we defined using words.

To know this, we're going to build tests for each of the APIs we ask our agent to build.

So now we have this problem: How do we industrialize the creation of REST APIs tests ?
I want to be clear here: we don't want to build REST APIs. This is the role of the agent. We want to test whether this REST API satisfies the requirements we wrote in english.

Here is how we will decide the best tool to use to industrialize our REST API testing suite:

no code to write these tests. Or as little code as possible !
JSON based.
easy to use
easy to run
easy to teach
ideally, a tool many QA engineers or web developers already know.

Currently I am thinking of Postman. Here is what I am picturing:
Step 1: the challenge creator (aka the person who writes the test of the REST API) goes on postman and creates a series of mocked requests/responses that define how the system should behave. For example in the case of a url shortener, we wil:

have a setup phase where we host a url in the web or in a sandbox. And this url should point to a certain content
send the url to the URL shortener and get back a url
get the shortened url and make sure it redirects to the original url.
Step 2: export these requests/responses as a POSTMAN collection.
Step 3: put this postman collection into our benchmark repo.
Step 4: push this to github and review.
Step 5: someone is now able to run this challenge: we ask an agent to build a url shortener, where for the agent to finish, run the postman collection and gives a score to the application that has been built based on whether the expected responses comply with a url shortener.

I would like to get more opinions on this topic.

I also would like to get other options and use this thread as an opportunity to connect with someone from the Postman team in order to understand if our use case makes sense.

AutoGPT Benchmarks "records.jsonl" Data Not Saved in README Location when ran as-is

I powered through running this on windows with the existing README instructions (with alot of edits) but at the end of it...

There's not even a DATA folder created in \Auto-GPT-Benchmarks\auto_gpt_benchmarking\AutoGPTData
There is no records.jsonl created in any of the Auto-GPT folder when ran either.

And if the README meant to say the json file will be saved in this location...
\Auto-GPT-Benchmarks\auto_gpt_benchmarking\AutoGPTData

There's only the existing ai_settings.yaml file

Obviously I am working on this in conjunction with the install_run_benchmarks.bat file I am creating.