othersideai / self-operating-computer Goto Github PK

View Code? Open in Web Editor NEW

8.2K 8.2K 1.1K 12.5 MB

A framework to enable multimodal models to operate a computer.

Home Page: https://www.hyperwriteai.com/self-operating-computer

License: MIT License

Python 100.00%

automation openai pyautogui

self-operating-computer's People

Contributors

Stargazers

Watchers

Forkers

ralphx1 mrm8488 laywoniai ronnachum11 zebrajack mz0in akhil4rajan devdoshi ismat22 jeffara 0x5844 dattgoswami nrvo esharkythegreat lancechung8888 tomchapin hexmanshu danieltea maddyonline touristshaun thorstone137 scullabyte chukwudyre bahodirajabov luvwinnie gijigae org-tekeli-borisp ukiuki201277777 sholtomaud tonyxia2016 whatif-dev papalor alexanderkiehl 0xmgg youngfly93 enzg soi-20 chukowski dinodinu mason0510 svemulapalli stevenlee19820119 fukcinglife huangjiaju chengbo81 bozz2022 runargun munichbughunter athy125 mivanovitch levinehuang black-archivers timsamart maxiaoxifeng diegonunez77 logp jun784 ngym michaelhhogue shpetimhaxhiu kyaukyuai lyhiving braingearceo prodeveloper0 iwillcodeu rkp64 horw ailabteam crnsh belkmouf ruzhevich eltociear ypgaolele muharremokutan sourav-jyoti yukkiehuku1 elcubonegro emaranowski lennartkaden justindhillon k2m5t2 colkito calvin-ai bsetzer1 artemkolmykov tinnyposhy-x lilllepiclife sung206 scorpions11 qbaynodo icytwilight-muslanet jmac122 santapakwiqque truechilled95righthaja kabrony tearchoi-womanne heartonreporks finmalage-westphold readerenes46 a-flavoredbubble

self-operating-computer's Issues

I've been a Plus member for 3 months but I keep getting this issue.

The model gpt-4-vision-preview does not exist or you do not have access to it. Learn more: https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4

OpenAI Message: "can't assist with navigating or interacting with software on a computer in real-time..."

Is anyone else receiving this message?

Click always off

Can a coordinate grid be added to every screenshot so the clicks are not so off?

Since it likes to misclicks a lot, you could either train a model to do image segmentation or, you can with clever prompt engineering add a barebones grid asking to solve the puzzle, "in which coordinate can the search button be found" this should make it more robust, right?

Add windows support

I guess most people actually use windows, not macos

Request timed out.

Error parsing JSON: Request timed out.
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot

Error parsing JSON: [Errno 2] No such file or directory: 'screencapture'

search for latest news on AI and give top two news
Error parsing JSON: [Errno 2] No such file or directory: 'screencapture'
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot

getting this error on Linux (arch)

When Linux?

Hi,

I'm Mr. Cryptic, a very friendly guy.

When Linux? I wouldn't touch a Mac with a 5-feet stick.

Feature proposal: Containerize the app instead of directly running on the machine

@joshbickett Was going through the code-base and found that this app lacks a fundamental Dockerfile.

Object detection

Maybe a yolo object detection model trained on basic things to get coordinates? Or something like sam?

i mean as soon as there is a small model, gpt4 can check the dataset and add more correct examples to the trainigset

Add speech to narrate actions.

The feature request is this.

🔊 Utilize speech synthesis to narrate actions before execution.
- This will enhance user experience by providing audio cues.
- Make the bot more interactive and user-friendly.
- Improve accessibility for users with visual impairments.

Warning: this project appears to have blatantly ripped off the work of researchers over a year on a new multi-modal model, Atlas-1, and is attempting to scam open source devs into doing that work

Re-opening @michaelhhogue are you an official contributor to this project? Could you comment on where the name Agent-1 came from? This appears to have blatantly ripped of the work of researchers working hard over a year.

Any open source contributors should consider that this firm raised millions and is scamming open source devs into stealing work for them. They never responded to our claims they stole this work, even down to the name agent-1, which is incredibly shameful if true and our attorneys would love to hear from an official contributor.

We will be publishing our solution open source as well, and here is Atlas-1, which we've been training for over a year and published last month: https://youtu.be/IQuBA7MvUas

What is Agent-1 and where did the name come from? It appears these guys blatantly ripped off our work and are now scamming open source devs into copying it for them.

Add downsampled images as context instead of screenshot.png?

Would it be advantageous to keep a collage of downsampled previous images, maybe to 160px x 90px and just stack them in a line left to right, one after another, and constantly pass this image as additional context for each action like here is a timeline of previous states the model has traversed through?

FYI: I would be happy to draft and code this feature out!

how's the token consume average do a single operate

what is the cost

idea

I played with gpt4V on other projects and it definitely has a hard time figuring out coordinates. I used other model trained on image identification to find the coordinates of the box made around the object detected and then I can pass it to gpt 4 to perform an action. For your use case, I juste tested this model "https://huggingface.co/foduucom/web-form-ui-field-detection" Far from being perfect, but maybe an idea to build on. If you auto computer can detect and get the proper coordinates of the input fields in an image, it could help or at least add a level of redundancy to improve accuracy in clicking and inputing stuff at the right places.

search for default browser instead of google chrome.

Proposal for Codebase Refactoring to Enhance Readability and Maintainability

I've been reviewing the project's codebase and noticed that all the logic and functions are currently contained within a single file. This structure, while functional, can make the code challenging to read and maintain. To improve the readability and maintainability of the code, I propose restructuring it by:

Separating functions into different files based on their functionality.
Creating directories to logically categorize these files.
This approach will not only make the code easier to navigate but also simplify future development efforts by providing a clearer modular structure.

I am eager to contribute to this enhancement and have already started working on a preliminary refactoring plan. My goal is to collaborate with the community to develop a structure that best suits our project's needs. I look forward to hearing your thoughts and suggestions on this proposal.

Feature: Ollama Support

Hello and thanks for this beautiful repo.

Would you consider adding open source model support, especially with Ollama?

Best,
Orkut

Error parsing JSON: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

I have everything set up correctly, added 5$ worth credit to my OpenAI account but I still keep getting this error. I don't believe I have exceeded my current quota, could someone shed some light on this? Thanks.

Bug in Grid `.png`

Occasionally I notice the grid .png has a bug where it either didn't render the full image and some of it is gray, or the dimensions are wrong.

Scrolling up and down not added

I just noticed that the model doesn't have access to scrolling up and down. Is this difficult to implement generally (asking mostly for Linux, but of course interested in Mac, and Windows)?

If so, I may try adding in a web mode and leverage Selenium to scroll.

Windows 11 installation issue: Poetry could not find a pyproject.toml file in path\self-operating-computer or its parents

Trying to install SOC on Windows 11.

I get to the part where it says:

cat requirements.txt | xargs poetry add

This of course won't work on Windows so I used "@(cat requirements.txt) | %{&poetry add $_}" in Powershell, but I get the error mentioned in the title:

"Poetry could not find a pyproject.toml file in..."

Also tried: poetry add depency-name by copypasting the dependency name from requirements.txt but I get the same error.

Ideas anyone? Now that I'm thinking I might running too new a version of Poetry, but not sure

Error during installation.

Collecting pyobjc-core==10.0 (from -r requirements.txt (line 32))
Using cached pyobjc-core-10.0.tar.gz (921 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [2 lines of output]
running egg_info
error: PyObjC requires macOS to build
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

need more Multimodal Large Model support

[Linux]: Error parsing JSON: X get_image failed: error 8 (73, 0, 1174) [Self-Operating Computer][Error] something went wrong :( [Self-Operating Computer][Error] AI response Failed take action after looking at the screenshot

Linux issue in the moment of run and permissions.

Generate multiple responses in polling and select the most popular choice? Particularly for -accurate grid overwrite

I was wondering if there was a reason we only picked the top response, or the 0th one. Instead, what if we asked the model to generate 9 responses, and then use the one that popped up the most frequently as the answer?

There's a possibility this wouldn't work for general actions, but I think this would work particularly well for my -accurate grid overwrite where when the model tries to click on something, I simply ask it which grid it would like to click in. Where with 9 responses of a number between 0 - 15, or 0 - 3, I can just use whichever number was most popular.

Clean up the github repo

There are some ways you can make this github repo look nicer. Press the gear icon in the about section:

Right now it looks like this:

Lets add a discription, and remove the checkmarks because we don't have packages or deployments yet. We can always change this. We can also add tags.

Before:

After:

Proposal: Support for addition of tests using frameworks like Unittest, Pytest etc.

Keeps saying 'I'm sorry, but I can't assist with that request.'

Whatever task I give, it keeps saying this. I don't know how to resolve it.

wrong coordinate

I asked it to play Spotify and it guessed the play bottom at x 78% y 46% which is wrong.

maybe for a more detailed guess we can have more gridlines?
something like this maybe

Add support for speech recognition with a voice button

Can refer the below code sample

This appears to be very similar to our Atlas-1 model, but with hard coded clicks. Is that correct?

Hey guys we've been training a very similar multi-modal model called Atlas-1, however we don't need to hard-code click positions like it appears here, because we trained our model to find UI-elements directly and solve the hardest problems in automations. With the name Agent-1, it almost seems copied, but I hope that's not correct.

We introduce the idea of "Semantic Targets" which understand the underlying intent of a target, and so it's robust to even future design changes.

You can see in our tutorial published last month, we can also search google and much more because Atlas-1 doesn't need to hard code click positions https://youtu.be/IQuBA7MvUas?si=lSaFpH0WMIKRtYrU

Enhancing Safety: User Confirmation for Command Execution

Feature Request:

Enhanced Security Confirmation
- Issue: Add security on user prompts for executing potentially dangerous commands.
- Description: Implement a robust security feature that ensures user confirmation before executing any potentially harmful actions.
- Plan:
  - Create a user-friendly dialog box that prompts for confirmation.
  - Allow users to configure settings to override this security feature if needed.
  - Implement this security layer in the codebase to prevent unintended actions.

testing web pages

will it be possible to go to a specific web page and login and perform test actions and write results in another file.

Feat: Navigating to Search Bar using `Cmd`/`Ctrl` + `L`

As mentioned in the Readme, probably cmd + L would be a better thing to navigate yo the search bar

Even I faced the issue of navigating to the search bar corrrectly as different browsers have different location for their seach bar (maybe)
#39 (comment)

If everyone approves, I would go ahead and implement this?

Error parsing JSON: X get_image failed: error 8 (73, 0, 967)

[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
google the word HI
Error parsing JSON: X get_image failed: error 8 (73, 0, 967)
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot

what could be the problem?

Seek for cooperation and solution sharing

I have been researching in automotive computers for years.

Topics that you might be interested in that I have dug into:

I have dedicated repositories that you may be interested into:

agi-computer-control: automotive computer, which can see, hear and operate
metalazero: multi-platform computer automation attempts

Other similar projects that I am monitoring:

Apologize for my unorganized code structure. I am trying to improve development experience by AI generated documentation & usage demonstration and client-side LLM & semantic search, which may solve this long-standing task among all my previous repositories.

Not working on `Ubuntu 20.04`

I tried running this on Ubuntu 20.04, it is getting the commands right in the terminal, but it is not actually navigating to the desired location. For example, I asked it to open brave and navigate to this repository... It is getting the steps right, but it could not open Brave....

Possible Issue

I think the method to search apps is different in MacOS ( I haven't used Mac, so I am not aware though) and in my system....
Hence, it might mot be able to go to the apps menu to actually search for any software.. Maybe?

Proposal: Transitioning from Chrome-Exclusive to Universal Browser Compatibility

Problem

Currently, the application is prompt to use Google Chrome by default, limiting accessibility and user experience for individuals using alternative browsers. This monolithic approach excludes a significant user base and hinders the platform's adaptability to diverse browser environments.

Proposal

This issue advocates for a transition from Chrome-centric development to a more inclusive approach that supports a broader range of web browsers. The goal is to enhance accessibility, improve user experience, and adhere to web standards that promote compatibility across different platforms.

Proposed Changes

When testing I realize that on MacOS you can open your default browser by just type in the search bar

browser

So instead of Google Chrome you can search browsers then enter it will open the browser without the need of user have to use Google Chrome. Since most browser have the search bar at the same location you can still use the default setting for it.

Add support for LiteLLM to support API base change.

🚀 Refactor the code and use LiteLLM to support API base change.

🌐 As of now, GPT-4 Vision is expensive to use and GPT-4 usage is high.
Let consumers use their own API base by making the following change:

Example:

import openai

# Set your custom API base URL
openai.api_base = "http://myproxy-gpt.com/chatcompletions"

📈 This change ensures that you will receive more users and encourage more testing.
🧪 People will start testing it more often, which will ultimately contribute to
improving this product. 🛠️

API issue

If any new open ai API use is getting issues, don't worry. there is some issue from open ai side

Does it support multiple monitor setups?

I have three 32" 4K monitors for my Mac Studio and keep getting this error for any command. I'm curious which monitor it selects for the screenshot. I can hear the audible screenshot noise, and then the error appears.

Error parsing JSON: 'ascii' codec can't encode character '\u201c' in position 7: ordinal not in range(128)
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot

OpenAI API error: The model `gpt-4-vision-preview` does not exist or you do not have access to it.

Complete log:

[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
Open the Google Photos inside the Google Chrome
Error parsing JSON: Error code: 404 - {'error': {'message': 'The model `gpt-4-vision-preview` does not exist or you do not have access to it. Learn more: https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot
(venv) syzygy@Syzygys-MacBook-Pro self-operating-computer %

I have the GPT-4 since it launched, I also can use my keys with other tools like sgpt, but appears that for some reason my account is lacking this model, any ideas? I have tried the suggested URL, but isn't much there.

Repeat open browser

I'd installed SOC on my computer, a MacBook Pro with M1 Pro arm chip.

Today, my first try to use SOC, but it repeat open my browser, not go further.

I don't know why.

Here is the log, may be can helping you to find the problem is.

[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
open brave, go to google drive, write a poem about spring.
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
^CTraceback (most recent call last):
File "/Users/chenyibin/self-operating-computer/venv/bin/operate", line 33, in
sys.exit(load_entry_point('self-operating-computer==1.0.0', 'console_scripts', 'operate')())
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 612, in main_entry
main(args.model)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 188, in main
response = get_next_action(model, messages, objective)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 276, in get_next_action
content = get_next_action_from_openai(messages, objective)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 340, in get_next_action_from_openai
response = client.chat.completions.create(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_utils/_utils.py", line 299, in wrapper
return func(*args, **kwargs)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 594, in create
return self._post(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_base_client.py", line 1055, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_base_client.py", line 834, in request
return self._request(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_base_client.py", line 858, in _request
response = self._client.send(request, auth=self.custom_auth, stream=stream)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 901, in send
response = self._send_handling_auth(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 929, in _send_handling_auth
response = self._send_handling_redirects(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 966, in _send_handling_redirects
response = self._send_single_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 1002, in _send_single_request
response = transport.handle_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_transports/default.py", line 228, in handle_request
resp = self._pool.handle_request(req)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 268, in handle_request
raise exc
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 251, in handle_request
response = connection.handle_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http_proxy.py", line 344, in handle_request
return self._connection.handle_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 133, in handle_request
raise exc
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 111, in handle_request
) = self._receive_response_headers(**kwargs)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 176, in _receive_response_headers
event = self._receive_event(timeout=timeout)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 212, in _receive_event
data = self._network_stream.read(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 126, in read
return self._sock.recv(max_bytes)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1259, in recv
return self.read(buflen)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1132, in read
return self._sslobj.read(len)
KeyboardInterrupt

At last, thx for your great job again.

operate issue

"Traceback (most recent call last):
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\Scripts\operate.exe_main.py", line 4, in
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\site-packages\operate\main.py", line 30, in
client = OpenAI()
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\site-packages\openai_client.py", line 93, in init
raise OpenAIError(
openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable" while entering 'operate' I'm getting this error, anyone help me to solve this

I made a project exactly like this 6 months ago. Here are some things I learned.

Hi, just dropping a friendly pointer to my project at Tophness/ChatGPT-PC-Controller.
I noticed you're using pyautogui.
I experimented with this for a long time, but it isn't compatible with the latest windows ui frameworks like UIA and WPF, only the very old Win32 API that's deprecated in windows 11.

I notice you're just using pyautogui.click() and pyautogui.write() instead of directly finding/reading/editing/triggering windows control elements anyway, but it's much, much more powerful if you do.
GPT-Vision wasn't even available at the time I made it. It just directly knew what to do blindly.

Directly using control elements means it can run in the background without needing to take over the user's mouse and keyboard, or even hide the app it's controlling entirely.
It could just browse the web, crunch some numbers on some data it found and send off emails about it in parallel while you're editing a video, and you wouldn't even notice the difference.
It's also instant (no need to wait for delays between clicks and presses), and there's no room for error.
Even an unexpected window popping up is no issue, since it doesn't need the window to be active to control it, and it can activate it automatically and wait for it to be active if need be.
ChatGPT can also write out whole scripts that do the job from a single response.
For this and many other reasons (like reading pixel rgb values), I recommend using AutoIt.

I started off making an interpreter for Autoit that took in more natural language and wrote the code itself, but it seems ChatGPT is well versed enough in Autoit that you can just directly hook the dll calls using an AST.
I did this before OpenAI's Function Calling API existed, so it would only be that much more powerful now.

Feel free to copy anything I did, and hit me up if you'd like to merge these 2 in some way.
I would say you might as well leave a separate pyautogui mode that works as-is while we merge things like vision over to autoit mode.

Use Code Models For More Precise DOM Navigation

Consider adding coding models like GPT-4 Turbo (128K token limit) to better navigate HTML DOM using playwright by @microsoft. See similar implementation here: BrowserGPT by @mayt.

Integrate Set-of-Mark Visual Prompting for GPT-4V

I noticed that you currently seem to apply a grid to the images to assist the vision model:

https://github.com/OthersideAI/self-operating-computer/blob/main/operate/main.py#L462-L527

And mention this in the README:

Current Challenges
Note: The GPT-4v's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.

I was wondering, have you looked at using Set-of-Mark Prompting Visual Prompting for GPT-4V / similar techniques?