othersideai / self-operating-computer Goto Github PK
View Code? Open in Web Editor NEWA framework to enable multimodal models to operate a computer.
Home Page: https://www.hyperwriteai.com/self-operating-computer
License: MIT License
A framework to enable multimodal models to operate a computer.
Home Page: https://www.hyperwriteai.com/self-operating-computer
License: MIT License
The model gpt-4-vision-preview
does not exist or you do not have access to it. Learn more: https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4
Can a coordinate grid be added to every screenshot so the clicks are not so off?
Since it likes to misclicks a lot, you could either train a model to do image segmentation or, you can with clever prompt engineering add a barebones grid asking to solve the puzzle, "in which coordinate can the search button be found" this should make it more robust, right?
I guess most people actually use windows, not macos
Error parsing JSON: Request timed out.
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot
search for latest news on AI and give top two news
Error parsing JSON: [Errno 2] No such file or directory: 'screencapture'
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot
getting this error on Linux (arch)
Hi,
I'm Mr. Cryptic, a very friendly guy.
When Linux? I wouldn't touch a Mac with a 5-feet stick.
@joshbickett Was going through the code-base and found that this app lacks a fundamental Dockerfile.
Maybe a yolo object detection model trained on basic things to get coordinates? Or something like sam?
i mean as soon as there is a small model, gpt4 can check the dataset and add more correct examples to the trainigset
The feature request is this.
Re-opening @michaelhhogue are you an official contributor to this project? Could you comment on where the name Agent-1 came from? This appears to have blatantly ripped of the work of researchers working hard over a year.
Any open source contributors should consider that this firm raised millions and is scamming open source devs into stealing work for them. They never responded to our claims they stole this work, even down to the name agent-1, which is incredibly shameful if true and our attorneys would love to hear from an official contributor.
We will be publishing our solution open source as well, and here is Atlas-1, which we've been training for over a year and published last month: https://youtu.be/IQuBA7MvUas
What is Agent-1 and where did the name come from? It appears these guys blatantly ripped off our work and are now scamming open source devs into copying it for them.
Would it be advantageous to keep a collage of downsampled previous images, maybe to 160px x 90px and just stack them in a line left to right, one after another, and constantly pass this image as additional context for each action like here is a timeline of previous states the model has traversed through?
FYI: I would be happy to draft and code this feature out!
what is the cost
I played with gpt4V on other projects and it definitely has a hard time figuring out coordinates. I used other model trained on image identification to find the coordinates of the box made around the object detected and then I can pass it to gpt 4 to perform an action. For your use case, I juste tested this model "https://huggingface.co/foduucom/web-form-ui-field-detection" Far from being perfect, but maybe an idea to build on. If you auto computer can detect and get the proper coordinates of the input fields in an image, it could help or at least add a level of redundancy to improve accuracy in clicking and inputing stuff at the right places.
I've been reviewing the project's codebase and noticed that all the logic and functions are currently contained within a single file. This structure, while functional, can make the code challenging to read and maintain. To improve the readability and maintainability of the code, I propose restructuring it by:
I am eager to contribute to this enhancement and have already started working on a preliminary refactoring plan. My goal is to collaborate with the community to develop a structure that best suits our project's needs. I look forward to hearing your thoughts and suggestions on this proposal.
Hello and thanks for this beautiful repo.
Would you consider adding open source model support, especially with Ollama?
Best,
Orkut
I have everything set up correctly, added 5$ worth credit to my OpenAI account but I still keep getting this error. I don't believe I have exceeded my current quota, could someone shed some light on this? Thanks.
I just noticed that the model doesn't have access to scrolling up and down. Is this difficult to implement generally (asking mostly for Linux, but of course interested in Mac, and Windows)?
If so, I may try adding in a web mode and leverage Selenium to scroll.
Trying to install SOC on Windows 11.
I get to the part where it says:
cat requirements.txt | xargs poetry add
This of course won't work on Windows so I used "@(cat requirements.txt) | %{&poetry add $_}" in Powershell, but I get the error mentioned in the title:
"Poetry could not find a pyproject.toml file in..."
Also tried: poetry add depency-name by copypasting the dependency name from requirements.txt but I get the same error.
Ideas anyone? Now that I'm thinking I might running too new a version of Poetry, but not sure
Collecting pyobjc-core==10.0 (from -r requirements.txt (line 32))
Using cached pyobjc-core-10.0.tar.gz (921 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [2 lines of output]
running egg_info
error: PyObjC requires macOS to build
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
I was wondering if there was a reason we only picked the top response, or the 0th one. Instead, what if we asked the model to generate 9 responses, and then use the one that popped up the most frequently as the answer?
There's a possibility this wouldn't work for general actions, but I think this would work particularly well for my -accurate grid overwrite where when the model tries to click on something, I simply ask it which grid it would like to click in. Where with 9 responses of a number between 0 - 15, or 0 - 3, I can just use whichever number was most popular.
Whatever task I give, it keeps saying this. I don't know how to resolve it.
Hey guys we've been training a very similar multi-modal model called Atlas-1, however we don't need to hard-code click positions like it appears here, because we trained our model to find UI-elements directly and solve the hardest problems in automations. With the name Agent-1, it almost seems copied, but I hope that's not correct.
We introduce the idea of "Semantic Targets" which understand the underlying intent of a target, and so it's robust to even future design changes.
You can see in our tutorial published last month, we can also search google and much more because Atlas-1 doesn't need to hard code click positions https://youtu.be/IQuBA7MvUas?si=lSaFpH0WMIKRtYrU
Feature Request:
will it be possible to go to a specific web page and login and perform test actions and write results in another file.
As mentioned in the Readme
, probably cmd + L
would be a better thing to navigate yo the search bar
Even I faced the issue of navigating to the search bar corrrectly as different browsers have different location for their seach bar (maybe)
#39 (comment)
If everyone approves, I would go ahead and implement this?
[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
google the word HI
Error parsing JSON: X get_image failed: error 8 (73, 0, 967)
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot
what could be the problem?
I have been researching in automotive computers for years.
Topics that you might be interested in that I have dug into:
I have dedicated repositories that you may be interested into:
Other similar projects that I am monitoring:
Apologize for my unorganized code structure. I am trying to improve development experience by AI generated documentation & usage demonstration and client-side LLM & semantic search, which may solve this long-standing task among all my previous repositories.
I tried running this on Ubuntu 20.04, it is getting the commands right in the terminal, but it is not actually navigating to the desired location. For example, I asked it to open brave and navigate to this repository... It is getting the steps right, but it could not open Brave....
Currently, the application is prompt to use Google Chrome by default, limiting accessibility and user experience for individuals using alternative browsers. This monolithic approach excludes a significant user base and hinders the platform's adaptability to diverse browser environments.
This issue advocates for a transition from Chrome-centric development to a more inclusive approach that supports a broader range of web browsers. The goal is to enhance accessibility, improve user experience, and adhere to web standards that promote compatibility across different platforms.
When testing I realize that on MacOS you can open your default browser by just type in the search bar
browser
So instead of Google Chrome
you can search browsers then enter it will open the browser without the need of user have to use Google Chrome. Since most browser have the search bar at the same location you can still use the default setting for it.
🌐 As of now, GPT-4 Vision is expensive to use and GPT-4 usage is high.
Let consumers use their own API base by making the following change:
import openai
# Set your custom API base URL
openai.api_base = "http://myproxy-gpt.com/chatcompletions"
📈 This change ensures that you will receive more users and encourage more testing.
🧪 People will start testing it more often, which will ultimately contribute to
improving this product. 🛠️
I have three 32" 4K monitors for my Mac Studio and keep getting this error for any command. I'm curious which monitor it selects for the screenshot. I can hear the audible screenshot noise, and then the error appears.
Error parsing JSON: 'ascii' codec can't encode character '\u201c' in position 7: ordinal not in range(128)
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot
Complete log:
[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
Open the Google Photos inside the Google Chrome
Error parsing JSON: Error code: 404 - {'error': {'message': 'The model `gpt-4-vision-preview` does not exist or you do not have access to it. Learn more: https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot
(venv) syzygy@Syzygys-MacBook-Pro self-operating-computer %
I have the GPT-4 since it launched, I also can use my keys with other tools like sgpt
, but appears that for some reason my account is lacking this model, any ideas? I have tried the suggested URL, but isn't much there.
I'd installed SOC on my computer, a MacBook Pro with M1 Pro arm chip.
Today, my first try to use SOC, but it repeat open my browser, not go further.
I don't know why.
Here is the log, may be can helping you to find the problem is.
[Self-Operating Computer]
Hello, I can help you with anything. What would you like done?
[User]
open brave, go to google drive, write a poem about spring.
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
[Self-Operating Computer] [Act] SEARCH Brave
[Self-Operating Computer] [Act] SEARCH COMPLETE Open program: Brave
^CTraceback (most recent call last):
File "/Users/chenyibin/self-operating-computer/venv/bin/operate", line 33, in
sys.exit(load_entry_point('self-operating-computer==1.0.0', 'console_scripts', 'operate')())
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 612, in main_entry
main(args.model)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 188, in main
response = get_next_action(model, messages, objective)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 276, in get_next_action
content = get_next_action_from_openai(messages, objective)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/operate/main.py", line 340, in get_next_action_from_openai
response = client.chat.completions.create(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_utils/_utils.py", line 299, in wrapper
return func(*args, **kwargs)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 594, in create
return self._post(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_base_client.py", line 1055, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_base_client.py", line 834, in request
return self._request(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/openai/_base_client.py", line 858, in _request
response = self._client.send(request, auth=self.custom_auth, stream=stream)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 901, in send
response = self._send_handling_auth(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 929, in _send_handling_auth
response = self._send_handling_redirects(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 966, in _send_handling_redirects
response = self._send_single_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_client.py", line 1002, in _send_single_request
response = transport.handle_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpx/_transports/default.py", line 228, in handle_request
resp = self._pool.handle_request(req)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 268, in handle_request
raise exc
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 251, in handle_request
response = connection.handle_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http_proxy.py", line 344, in handle_request
return self._connection.handle_request(request)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 133, in handle_request
raise exc
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 111, in handle_request
) = self._receive_response_headers(**kwargs)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 176, in _receive_response_headers
event = self._receive_event(timeout=timeout)
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 212, in _receive_event
data = self._network_stream.read(
File "/Users/chenyibin/self-operating-computer/venv/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 126, in read
return self._sock.recv(max_bytes)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1259, in recv
return self.read(buflen)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1132, in read
return self._sslobj.read(len)
KeyboardInterrupt
At last, thx for your great job again.
"Traceback (most recent call last):
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\Scripts\operate.exe_main.py", line 4, in
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\site-packages\operate\main.py", line 30, in
client = OpenAI()
File "C:\Users\KANNAN\AppData\Local\Programs\Python\Python310\lib\site-packages\openai_client.py", line 93, in init
raise OpenAIError(
openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable" while entering 'operate' I'm getting this error, anyone help me to solve this
Hi, just dropping a friendly pointer to my project at Tophness/ChatGPT-PC-Controller.
I noticed you're using pyautogui.
I experimented with this for a long time, but it isn't compatible with the latest windows ui frameworks like UIA and WPF, only the very old Win32 API that's deprecated in windows 11.
I notice you're just using pyautogui.click() and pyautogui.write() instead of directly finding/reading/editing/triggering windows control elements anyway, but it's much, much more powerful if you do.
GPT-Vision wasn't even available at the time I made it. It just directly knew what to do blindly.
Directly using control elements means it can run in the background without needing to take over the user's mouse and keyboard, or even hide the app it's controlling entirely.
It could just browse the web, crunch some numbers on some data it found and send off emails about it in parallel while you're editing a video, and you wouldn't even notice the difference.
It's also instant (no need to wait for delays between clicks and presses), and there's no room for error.
Even an unexpected window popping up is no issue, since it doesn't need the window to be active to control it, and it can activate it automatically and wait for it to be active if need be.
ChatGPT can also write out whole scripts that do the job from a single response.
For this and many other reasons (like reading pixel rgb values), I recommend using AutoIt.
I started off making an interpreter for Autoit that took in more natural language and wrote the code itself, but it seems ChatGPT is well versed enough in Autoit that you can just directly hook the dll calls using an AST.
I did this before OpenAI's Function Calling API existed, so it would only be that much more powerful now.
Feel free to copy anything I did, and hit me up if you'd like to merge these 2 in some way.
I would say you might as well leave a separate pyautogui mode that works as-is while we merge things like vision over to autoit mode.
Consider adding coding models like GPT-4 Turbo (128K token limit) to better navigate HTML DOM using playwright by @microsoft. See similar implementation here: BrowserGPT by @mayt.
I noticed that you currently seem to apply a grid to the images to assist the vision model:
And mention this in the README:
Current Challenges
Note: The GPT-4v's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.
I was wondering, have you looked at using Set-of-Mark Prompting Visual Prompting for GPT-4V / similar techniques?
A bit of a link dump from one of my references:
Set-of-Mark Prompting for LMMs
Set-of-Mark Visual Prompting for GPT-4V
We present Set-of-Mark (SoM) prompting, simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities in the strongest LMM -- GPT-4V. Let's using visual prompting for vision!
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: this https URL.
Segment Anything
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
The Segment Anything Model (SAM) produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a dataset of 11 million images and 1.1 billion masks, and has strong zero-shot performance on a variety of segmentation tasks.
Official implementation of the paper "Semantic-SAM: Segment and Recognize Anything at Any Granularity"
In this work, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. We have trained on the whole SA-1B dataset and our model can reproduce SAM and beyond it.
Segment everything for one image. We output controllable granularity masks from semantic, instance to part level when using different granularity prompts.
SEEM: Segment Everything Everywhere All at Once
[NeurIPS 2023] Official implementation of the paper "Segment Everything Everywhere All at Once"
We introduce SEEM that can Segment Everything Everywhere with Multi-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combination of prompts or generalize to custom prompts!
Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
[ICCV 2023] Official implementation of the paper "A Simple Framework for Open-Vocabulary Segmentation and Detection"
[CVPR 2023] Official implementation of the paper "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation"
[ICCV2023] VLPart: Going Denser with Open-Vocabulary Part Segmentation
Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this work, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation.
GPT-4V-Act: Chromium Copilot
AI agent using GPT-4V(ision) capable of using a mouse/keyboard to interact with web UI
GPT-4V-Act serves as an eloquent multimodal AI assistant that harmoniously combines GPT-4V(ision) with a web browser. It's designed to mirror the input and output of a human operator—primarily screen feedback and low-level mouse/keyboard interaction. The objective is to foster a smooth transition between human-computer operations, facilitating the creation of tools that considerably boost the accessibility of any user interface (UI), aid workflow automation, and enable automated UI testing.
GPT-4V-Act leverages both GPT-4V(ision) and Set-of-Mark Prompting, together with a tailored auto-labeler. This auto-labeler assigns a unique numerical ID to each interactable UI element.
By incorporating a task and a screenshot as input, GPT-4V-Act can deduce the subsequent action required to accomplish a task. For mouse/keyboard output, it can refer to the numerical labels for exact pixel coordinates.
GPT-4V(ision)
👀🧠 GPT-4 Vision x 💪⌨️ Vimium = Autonomous Web Agent
This project leverages GPT4V to create an autonomous / interactive web agent. The action space are discretized by Vimium.
Hi, I guess --voice
is non functional .
Ref - #81
Maybe anyone makes a commit to mention something like (in progress)
in front of --voice
.
Or maybe remove --voice
for now?
This issue aims to enhance the code's reliability by improving error handling in critical functions, such as get_next_action_from_openai and summarize. Currently, the error handling within these functions is basic and might not cover all potential errors that could occur during execution.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.