Giter Club home page Giter Club logo

auto-gpt-benchmarks's Introduction

This repo is currently deprecated. For the updated benchmark view the main repo at https://github.com/Significant-Gravitas/AutoGPT in the benchmark folder.

Auto-GPT Benchmarks

Built for the purpose of benchmarking the performance of agents regardless of how they work.

Objectively know how well your agent is performing in categories like code, retrieval, memory, and safety.

Save time and money while doing it through smart dependencies. The best part? It's all automated.

Scores:

Screenshot 2023-07-25 at 10 35 01 AM

Ranking overall:

Detailed results:

Screenshot 2023-07-25 at 10 42 15 AM

Click here to see the results and the raw data!!

More agents coming soon !

auto-gpt-benchmarks's People

Contributors

ambujpawar avatar auto-gpt-bot avatar chitalian avatar dschonholtz avatar erik-megarad avatar fluder-paradyne avatar jakubno avatar lc0rp avatar marcgreen avatar mrbrain295 avatar nerfzael avatar pwuts avatar rihp avatar scarletpan avatar silennaihin avatar swiftyos avatar torantulino avatar waynehamadi avatar westonwillingham avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

auto-gpt-benchmarks's Issues

Add AutoGPT commit hash to eval filename

Currently, we list the evalname and the timestamp to the evalfilename.
This issue also will fetch the commit hash from the AutoGPT repo and add that to the filename.

Logging over print

There are a number of opportunities to replace print with logging.

There are lots of benefits to this, including

  • Standardized stdout line formatting
  • Additional logging targets like vector memory
  • Message filtering

All of these either benefit future ML models directly, or provide an opportunity for a model to optimize.

oaieval command failure

Command: VALS_THREADS=1 EVALS_THREAD_TIMEOUT=600 oaieval auto_gpt_completion_fn test-match --registry_path $PWD/auto_gpt_benchmarking

Output:

[2023-04-19 06:03:18,322] [registry.py:249] Loading registry from /opt/homebrew/lib/python3.9/site-packages/evals/registry/evals
[2023-04-19 06:03:18,380] [registry.py:249] Loading registry from /Users/ejohnson/.evals/evals
[2023-04-19 06:03:18,380] [registry.py:249] Loading registry from /Users/ejohnson/src/Auto-GPT-Benchmarks/auto_gpt_benchmarking/auto_gpt_benchmarking/evals
[2023-04-19 06:03:19,041] [registry.py:249] Loading registry from /opt/homebrew/lib/python3.9/site-packages/evals/registry/completion_fns
[2023-04-19 06:03:19,045] [registry.py:249] Loading registry from /Users/ejohnson/.evals/completion_fns
[2023-04-19 06:03:19,045] [registry.py:249] Loading registry from /Users/ejohnson/src/Auto-GPT-Benchmarks/auto_gpt_benchmarking/auto_gpt_benchmarking/completion_fns
[2023-04-19 06:03:19,045] [registry.py:120] completion_fn 'auto_gpt_completion_fn' not found. Closest matches: []
Traceback (most recent call last):
  File "/opt/homebrew/bin/oaieval", line 8, in <module>
    sys.exit(main())
  File "/opt/homebrew/lib/python3.9/site-packages/evals/cli/oaieval.py", line 164, in main
    run(args)
  File "/opt/homebrew/lib/python3.9/site-packages/evals/cli/oaieval.py", line 72, in run
    completion_fn_instances = [registry.make_completion_fn(url) for url in completion_fns]
  File "/opt/homebrew/lib/python3.9/site-packages/evals/cli/oaieval.py", line 72, in <listcomp>
    completion_fn_instances = [registry.make_completion_fn(url) for url in completion_fns]
  File "/opt/homebrew/lib/python3.9/site-packages/evals/registry.py", line 106, in make_completion_fn
    raise ValueError(f"Could not find CompletionFn in the registry with ID {name}")
ValueError: Could not find CompletionFn in the registry with ID auto_gpt_completion_fn

Count token usage in evals

Ideally, we should count tokens for each eval completed.
Now that AutoGPT supports token counting in master (I think, need to double check this) we should be able to save a file with total token usage as well as the default OpenAI eval result file.

This might involve making subdirs per eval as eval results get more complicated and different types of files get saved.
Ideally, we would have a single file, but that won't happen while we save results with the evals lib

Request for Log Sharing

Dear AutoGPTTeam

I hope this message finds you well. As we work on enhancing our project's performance and optimizing our AI system, we believe that your valuable log data can provide valuable insights. To that end, we kindly request your assistance in sharing relevant log files with us.

Your logs could help us better understand system behavior, identify areas for improvement, and enhance our AI interactions. Rest assured that we will handle your data with the utmost care and in compliance with all applicable data privacy regulations.

Please let us know if you are willing to share your log files and, if so, the preferred method for sharing them. Your contribution will be greatly appreciated and will contribute to the success of our project.

Thank you for your support and collaboration.

Best regards,
Mike

Get results for all openAI Evals.

Currently we only have run the simplest possible tests on OpenAI evals. Ideally, we would have run all of them, so we can really compare AutoGPT to other agents and models. Hopefully, we at least outperform GPT4...
https://github.com/openai/evals
This will require reading through the available evals and evals docs.

Create Issue Templates for Benchmarking

We currently have to create an issue and then manually add it to the benchmarking project.
Ideally, whenever an issue is created on the benchmarking repo it would automatically be added to the TODO column of the project for team evaluation

OpenAI evals will retry forever if they timeout

If a task times out, the task should be killed and marked as failed. This should be possible to adjust with some CLI flags to the OpenAI evals package.
This may just involve turning off their multi-threaded support, and possibly rolling our own timeout logic.
This task will be completed when an OpenAI eval is run with the benchmarking repo and it kills itself on timeout and goes to the next task and does not try again.

I'm Creating .bat files for Windows Users To Keep It Super Simple

Even with clear documentation of all the changes that are needed to get this up and running on Windows OS systems... Most Windows users won't be able to.

I spent most of the day creating a setup.bat file and a run.bat file. Big Picture Goal of combining them once I iron out everything.

My setup.bat file copies the existing .env OPENAI_API for the user to take one step off the user's hands.

I think .bat benchmark dev pathway for Windows users is superior given that the goal is to have as many people run Evals as possible. To do so - we would need a keep it simple stupid install and runner for most windows users.

Current state of my setup .bat file:

  • Checks for Windows OS running Python 3.9 and above
  • Checks for Windows OS Git in order to clone the needed repos
  • Clones Auto-GPT-Benchmarks Repo
  • Creates the Virtual Env
    X Activates the virtual Env (Fixing it rn atm)
  • Installs the requirements within the Venv
    image

X Passes over requirements that are already installed but aren't on PATH.
Example: WARNING: The script openai.exe is installed in 'C:\Users\jonms\AppData\Roaming\Python\Python39\Scripts' which is not on PATH.

  • Clones the Auto-GPT.git wihtin the Venv
    X Clones the repo into the wrong folder (lol, i'm fixing it)
  • Copies the key from the .env file into the psuedo sub-module AutoGPT repo

image

This is taking longer than I wanted due to all the install/reinstalls/updates creating a rats nest of pathways and typical windows bs problems on my machine. =(
I continue on for the homies.

I just wanted to give an update so someone else doesn't try to bang their head against the wall to install Benchmarks like I have.

Build a dashboard for displaying historical eval results

This is intentionally a little vague.
This will need some critical thinking on how exactly this should be done.
The basic idea is that we should be able to see scores on various benchmarks and we should be able to see them for historical versions of stable and preferably other branches.
This initial issue should be a proposal of how this should be tackled. Probably not any code as of yet.

AutoGPTBenchmarkSettings

Replace explicit yaml loads with a serialization framework for loading configuration.

Pydantic is my preferred, but attrs and dataclasses are compelling options.

  1. Evaluate the options
  2. Make a decision
  3. Execute on something like the following with the settings.yaml file schemas as reference.
class BenchmarksSettings(BaseModel):
    class Config:
        loader = yaml.safe_loads

    prop1: str
    prop2: int

Build a pipeline for evaluating benchmarks

Ideally, in github we would be able to click a button and watch a specific branch generate evaluation results.
This likely is prohibitively expensive, so for now, this is only going to be a pipeline of a makefile, or something similar where you can evaluate a subset of results and have those results be spit somewhere deterministic.
Hopefully, those results could automatically be fed into a Dashboard: #6

Optionally enable bash tools and plugins for evals.

Currently, we can't run bash, git or shell commands in our AutoGPT container.
This would be to switch the container to Ubuntu so we can!
This likely involves a PR to the core AutoGPT repo, but that's ok.

Does agent-protocol have a version 0.2.3?

I've been working at this for hours and I think I'm going crazy. In pyproject.toml has the agent-protocol = "^0.2.3" but that's not listed as a version on PyPI?!

I think it's supposed to be agent-protocol = "^1.0.0"?

What am I missing here?

What is the best system to test REST APIs ?

Hi,
In the context of our Benchmark, we want to test whether an agent is able to build an application.

A big part of building these applications is to expose an API, and we can probably start with REST APIs because it's very common and widely adopted.

This means we want to be able to KNOW whether a given agent has successfully built the REST API we defined using words.

To know this, we're going to build tests for each of the APIs we ask our agent to build.

So now we have this problem: How do we industrialize the creation of REST APIs tests ?
I want to be clear here: we don't want to build REST APIs. This is the role of the agent. We want to test whether this REST API satisfies the requirements we wrote in english.

Here is how we will decide the best tool to use to industrialize our REST API testing suite:

  • no code to write these tests. Or as little code as possible !
  • JSON based.
  • easy to use
  • easy to run
  • easy to teach
  • ideally, a tool many QA engineers or web developers already know.

Currently I am thinking of Postman. Here is what I am picturing:
Step 1: the challenge creator (aka the person who writes the test of the REST API) goes on postman and creates a series of mocked requests/responses that define how the system should behave. For example in the case of a url shortener, we wil:

  • have a setup phase where we host a url in the web or in a sandbox. And this url should point to a certain content
  • send the url to the URL shortener and get back a url
  • get the shortened url and make sure it redirects to the original url.
    Step 2: export these requests/responses as a POSTMAN collection.
    Step 3: put this postman collection into our benchmark repo.
    Step 4: push this to github and review.
    Step 5: someone is now able to run this challenge: we ask an agent to build a url shortener, where for the agent to finish, run the postman collection and gives a score to the application that has been built based on whether the expected responses comply with a url shortener.

I would like to get more opinions on this topic.

I also would like to get other options and use this thread as an opportunity to connect with someone from the Postman team in order to understand if our use case makes sense.

AutoGPT Benchmarks "records.jsonl" Data Not Saved in README Location when ran as-is

I powered through running this on windows with the existing README instructions (with alot of edits) but at the end of it...

There's not even a DATA folder created in \Auto-GPT-Benchmarks\auto_gpt_benchmarking\AutoGPTData
There is no records.jsonl created in any of the Auto-GPT folder when ran either.

And if the README meant to say the json file will be saved in this location...
\Auto-GPT-Benchmarks\auto_gpt_benchmarking\AutoGPTData

There's only the existing ai_settings.yaml file

Obviously I am working on this in conjunction with the install_run_benchmarks.bat file I am creating.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.