Comments (25)
Sounds good. I'll take a stab at implementing Lang Segment Anything and make a pull request if it seems promising.
from self-operating-computer.
@KBB99 Thank you! I was trying to implement SoM, but the accuracy of labels for screenshots were quite low, so I was trying to look into other ways of solving the problem.
I was trying to implement som using "som-mode", and implementing prompt into VISION PROMPT. And do "operate -som"to activate som mode.
I will check and try your code. Thank you so much! It will be great if you could make a PR!!
from self-operating-computer.
@KBB99 this sounds promising, interested to see more as you make progress!
from self-operating-computer.
@Daisuke134 I would be interested to get your input as I saw you are working on something similar with Set-of-Marks. I have used LangSam to more accurately mask the objects, then I ask GPT-4-V a follow-up question combining the masks with marks and asking it to specify which object to click on, then parse GPT-4-V output, and finally click on the center of the object. Here is the code: main...KBB99:self-operating-computer:main .
@joshbickett how do you suggest building this into the project? operate --click_mode=lang-som ?
from self-operating-computer.
@KBB99
Looks great! I was looking into your code, and here are some thoughts I had.
・I think you are using summary_prompt to make GPT-4 respond with the label for the next action, but this is probably not related to SUMMARY_PROMPT, right? I found it a bit confusing. Maybe changing the names for clarity would be better (e.g., sam_prompt).
・Question: Are you using summary_prompt after VISION_PROMPT?
・What do you think about combining summary_prompt and VISION_PROMPT? First, we could give GPT-4 a "screenshot with Sam (instead of a grid) + ask it to provide the specific label (CLICK {{ “label”: “C", “description”... )". This way, we wouldn’t need to request GPT-4 twice, reducing operation time.
Here's how I wrote it by creating a new mode called som_mode:
- Use segemented image if using som_mode, else use grid image as before.
def get_next_action_from_openai(messages, objective, accurate_mode, som_mode):
if som_mode:
try:
som_screenshot_filename = os.path.join(screenshots_dir, "screenshot_som.png")
generate_sam_masks(screenshot_filename, som_screenshot_filename)
img_file_to_use = som_screenshot_filename
except Exception as e:
if DEBUG:
print(f"Error in SoM processing: {e}")
else:
grid_screenshot_filename = os.path.join(screenshots_dir, "screenshot_with_grid.png")
add_grid_to_image(screenshot_filename, grid_screenshot_filename, 500)
img_file_to_use = grid_screenshot_filename
time.sleep(1)
with open(img_file_to_use, "rb") as img_file:
img_base64 = base64.b64encode(img_file.read()).decode("utf-8")
- By using the code below, I can activate som_mode with "operate -som".
def main_entry():
# Add SoM to image
parser.add_argument(
"-som",
help="Activate SOM Mode",
action="store_true",
required=False,
)
try:
args = parser.parse_args()
main(
args.model,
accurate_mode=args.accurate,
terminal_prompt=args.prompt,
som_mode=args.som,
voice_mode=args.voice,
)
- Add labels to VISION PROMPT.
1. CLICK
Response:
- For a screenshot with a grid:
CLICK {{ “x”: “percent”, “y”: “percent”, “description”: “~description here~“, “reason”: “~reason here~” }}
Note: The percents work where the top left corner is “x”: “0%” and “y”: “0%” and the bottom right corner is “x”: “100%” and “y”: “100%”
- For a screenshot with numbered labels:
CLICK {{ “label”: “number”, “description”: “~description here~“, “reason”: “~reason here~” }}
Note: Use the number that is labelled on the desired element. If the targeted area is not labelled, revert to the grid format.
Here are examples of how to respond.
Objective: Log in to the account
CLICK {{ “label”: “C", “description”: “Click on the button labelled ‘2’, which is the ‘Login’ button”, “reason”: “C is identified as the ‘Login’ button, and clicking it will initiate the login process.” }}
※I think that adding a new sam-mode is better, since we can check the accuracy difference between the conventional method and the SAM method. However, I am not sure if we should make a new prompt like SAM_PROMPT, or integrate it into the existing VISION_PROMPT.
Thank you so much🙇♀️ Let me know what you think.
from self-operating-computer.
Hey @joshbickett . Yes I made some changes to clean up the code before pushing and must have broke something accidentally, will take a look and update any glitches. Thanks for checking it out!
from self-operating-computer.
@KBB99
Hi. I was trying to test out your code, but had some issues installing lang_sam. I am using Python 3.12 and tried to install torch and torchvision but was not able to.
We need to
pip install torch torchvision
pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git
to run the project, right?
from self-operating-computer.
@Daisuke134 i was able to install with Python 3.9 and the commands you mentioned.
from self-operating-computer.
@joshbickett I updated the incorrect arguments you mentioned so that is fixed. It is moving and clicking for me so I suspect that for the generated visual description, First link in the list of articles, titled 'Non-interactive SSH password authentication
lang-sam did not identify the object and hence couldn't click on anything. For me it sometimes segments and masks the objects correctly, but other times not. I need to experiment more and improve the prompts as well as the marks.
@Daisuke134 the commands you mentioned are correct assuming you start with python 3.9 venv (to do so set up the venv by running python3.9 -m venv venv
). Also, don't forget to pull the latest changes that fix the arguments bug Josh mentioned.
I will do some more tests, improvements, and then make a pull request tomorrow integrating the suggestions @Daisuke134 mentioned.
from self-operating-computer.
Any idea how to solve the problem?
sounds like you were able to resolve this issue, but wanted to mention that I had to run pip install torchvision
separate before lang-segment-anything
from self-operating-computer.
So I've managed to fix the clicking errors as well as add a configuration that saves the screenshots, masked screenshots, prompt, and predicted coordinates. I'll run more tests and keep expanding the dataset to test the segmentation model individually and then make further changes. Sometimes lang-sam works very well and segments the targeted object perfectly making clicks extremely accurate, other times not.
@joshbickett are you aware of any dataset of screenshots with prompt and coordinates we could use to evaluate different approaches?
from self-operating-computer.
run
pip install torchvision
separate beforelang-segment-anything
Meaning doing "pip install torchvision" before "pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git"?
Also, I am still having the error with parsing JSON.. I will check the code again
from self-operating-computer.
@Daisuke134 I added a section to the README.md try following the one with conda for version management.
from self-operating-computer.
@admineral @KBB99 Set-of-Mark prompting is now available. You can swap in any best.pt
from a YOLOv8 model and see how it performs. I'd love if the community iterated on what I built. My best.pt
could be improved, but the structure is now in to improve upon.
from self-operating-computer.
@KBB99 sounds good. If it improves performance it'd be great to get a PR
from self-operating-computer.
@admineral This is an interesting idea!
from self-operating-computer.
Sounds awesome. We could look at using this: https://github.com/luca-medeiros/lang-segment-anything
from self-operating-computer.
@admineral Yeah sorry, not sure why I closed this. I must've been tired. This is a good idea!
@KBB99 Looking at this now and it looks really interesting. I hadn't considered a segmentation model before
from self-operating-computer.
Have a long way to go, but was able to get it integrated. I need to make sure the calculations I'm performing are correct as well as modify the ratio conversion. Right now it's incredibly slow as I'm running it locally, but I'll try to host an endpoint backed by a GPU to speed it up. Later I'd like to explore a RL mechanism similar to what @admineral mentioned to improve pixel coordinates estimation.
If anyone wants to continue working on it feel free to clone the fork! Just a note you need to use Python3.9 and the Lang model takes over 4GB of space.
from self-operating-computer.
@joshbickett how do you suggest building this into the project? operate --click_mode=lang-som ?
@KBB99 excited to look closer. Let me checkout the code and provide some input in the next few days
from self-operating-computer.
@KBB99 I did git checkout
on your fork. I am encountering the following issue.
TypeError: mouse_click() missing 2 required positional arguments: 'client' and 'messages'
It looks like mouse_click
was expecting additional arguments
if action_type == "SEARCH":
function_response = search(action_detail)
elif action_type == "TYPE":
function_response = keyboard_type(action_detail)
elif action_type == "CLICK":
function_response = mouse_click(action_detail)
else:
...
def mouse_click(click_detail, client, messages):
After adjusting that in the code so that mouse_click
doesn't encounter that error I got the following result, but didn't see the mouse move or click .
[Self-Operating Computer] [Act] CLICK {'visual_description': "First link in the list of articles, titled 'Non-interactive SSH password authentication'", 'reason': 'To open the top article as requested'}
I am interested in this lang-som
approach but it looks like it may need more work, let me know if there's something I was doing wrong. I'd love to try out a working version.
from self-operating-computer.
@joshbickett @KBB99 Thank you. I could run the code using 3.9.18. However, when I did operate and "Go to youtube.com and play some holiday music", there is an error saying
Error parsing JSON: 'ascii' codec can't encode character '\u2018' in position 7: ordinal not in range(128)
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot
Probably this is occuring because ' is included when json parsing but..
Any idea how to solve the problem?
You only made changes in main.py, right?
from self-operating-computer.
@joshbickett I've played around with the SOM you implemented, but for some reason GPT-V seems to pick the wrong marks and still doesn't click on the correct object. I think some additional context, like the text could help GPT-V pick the correct object better. I've been experimenting with Amazon Textract Layout and it seems pretty solid, capturing text and the layout. Take a look at the screenshot. I'll test it with self-operating-computer and let you know how it goes.
from self-operating-computer.
@KBB99 I'll close this PR for now that we're using the OCR method as default. If anyone comes up with an improved YOLO model or SOM technique feel free to open a new ticket or a PR!
from self-operating-computer.
Please check out YOLO-WORLD https://blog.roboflow.com/what-is-yolo-world/
https://huggingface.co/spaces/stevengrove/YOLO-World?ref=blog.roboflow.com
https://github.com/AILAB-CVC/YOLO-World?ref=blog.roboflow.com
from self-operating-computer.
Related Issues (20)
- [FEATURE] Learning Process HOT 1
- [FEATURE] GUI Interface and further connectivity
- [BUG] operate -m llava return error local variable 'content' referenced before assignment
- [BUG] ModuleNotFoundError: No module named 'pkg_resources' HOT 4
- [BUG] Need GPT-4 ? HOT 1
- [FEATURE] Azure open AI support HOT 2
- OpenSource free Vision model use Instead of openAI HOT 5
- Github
- [Linux]: X get_image failed: error 8 (73, 0, 1316) [Error] --> cannot access local variable 'content' where it is not associated with a value HOT 2
- [For be deleted]
- [BUG] No such file or directory Xauthority
- [BUG] Brief Description of the Issue
- pip3 install self-operating-computer==1.4.5 ,ERROR raise HOT 1
- No module named 'distutils' 。python3.10 HOT 2
- Ai HOT 1
- Hopefully the llava:34b model will work HOT 1
- [BUG] Can't change OPENAI API KEY HOT 1
- Use gpt-4o instead of using gpt-4 turbo HOT 1
- [BUG] Cant find operate command
- Error code: 404 - {'error': {'message': 'The model `gpt-4-vision-preview` has been deprecated, learn more here: https://platform.openai.com/docs/deprecations', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}} HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from self-operating-computer.