Maybe a yolo object detection model trained on basic things to get coordinates? Or som

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Object detection about self-operating-computer HOT 25 CLOSED

admineral commented on July 20, 2024 2

Object detection

from self-operating-computer.

Comments (25)

KBB99 commented on July 20, 2024 2

Sounds good. I'll take a stab at implementing Lang Segment Anything and make a pull request if it seems promising.

from self-operating-computer.

Daisuke134 commented on July 20, 2024 2

@KBB99 Thank you! I was trying to implement SoM, but the accuracy of labels for screenshots were quite low, so I was trying to look into other ways of solving the problem.

I was trying to implement som using "som-mode", and implementing prompt into VISION PROMPT. And do "operate -som"to activate som mode.

I will check and try your code. Thank you so much! It will be great if you could make a PR!!

from self-operating-computer.

joshbickett commented on July 20, 2024 1

@KBB99 this sounds promising, interested to see more as you make progress!

from self-operating-computer.

KBB99 commented on July 20, 2024 1

@Daisuke134 I would be interested to get your input as I saw you are working on something similar with Set-of-Marks. I have used LangSam to more accurately mask the objects, then I ask GPT-4-V a follow-up question combining the masks with marks and asking it to specify which object to click on, then parse GPT-4-V output, and finally click on the center of the object. Here is the code: main...KBB99:self-operating-computer:main .

@joshbickett how do you suggest building this into the project? operate --click_mode=lang-som ?

from self-operating-computer.

Daisuke134 commented on July 20, 2024 1

@KBB99
Looks great! I was looking into your code, and here are some thoughts I had.

・I think you are using summary_prompt to make GPT-4 respond with the label for the next action, but this is probably not related to SUMMARY_PROMPT, right? I found it a bit confusing. Maybe changing the names for clarity would be better (e.g., sam_prompt).

・Question: Are you using summary_prompt after VISION_PROMPT?

・What do you think about combining summary_prompt and VISION_PROMPT? First, we could give GPT-4 a "screenshot with Sam (instead of a grid) + ask it to provide the specific label (CLICK {{ “label”: “C", “description”... )". This way, we wouldn’t need to request GPT-4 twice, reducing operation time.

Here's how I wrote it by creating a new mode called som_mode:

Use segemented image if using som_mode, else use grid image as before.

def get_next_action_from_openai(messages, objective, accurate_mode, som_mode):
        if som_mode:
            try: 
                som_screenshot_filename = os.path.join(screenshots_dir, "screenshot_som.png")
                generate_sam_masks(screenshot_filename, som_screenshot_filename)
                img_file_to_use = som_screenshot_filename
            except Exception as e:
                if DEBUG:
                    print(f"Error in SoM processing: {e}")

        else:
            grid_screenshot_filename = os.path.join(screenshots_dir, "screenshot_with_grid.png")
            add_grid_to_image(screenshot_filename, grid_screenshot_filename, 500)
            img_file_to_use = grid_screenshot_filename
        time.sleep(1)

        with open(img_file_to_use, "rb") as img_file:
            img_base64 = base64.b64encode(img_file.read()).decode("utf-8")

By using the code below, I can activate som_mode with "operate -som".

def main_entry():
    # Add SoM to image
    parser.add_argument(
        "-som",
        help="Activate SOM Mode",
        action="store_true",
        required=False,
    )

    try:
        args = parser.parse_args()

        main(
            args.model,
            accurate_mode=args.accurate,
            terminal_prompt=args.prompt,
            som_mode=args.som,
            voice_mode=args.voice,
        )

Add labels to VISION PROMPT.

1. CLICK
Response:
   - For a screenshot with a grid:
     CLICK {{ “x”: “percent”, “y”: “percent”, “description”: “~description here~“, “reason”: “~reason here~” }}
     Note: The percents work where the top left corner is “x”: “0%” and “y”: “0%” and the bottom right corner is “x”: “100%” and “y”: “100%”

   - For a screenshot with numbered labels:
     CLICK {{ “label”: “number”, “description”: “~description here~“, “reason”: “~reason here~” }}
     Note: Use the number that is labelled on the desired element. If the targeted area is not labelled, revert to the grid format.
 
 Here are examples of how to respond.
 
Objective: Log in to the account
CLICK {{ “label”: “C", “description”: “Click on the button labelled ‘2’, which is the ‘Login’ button”, “reason”: “C is identified as the ‘Login’ button, and clicking it will initiate the login process.” }}

※I think that adding a new sam-mode is better, since we can check the accuracy difference between the conventional method and the SAM method. However, I am not sure if we should make a new prompt like SAM_PROMPT, or integrate it into the existing VISION_PROMPT.

Thank you so much🙇‍♀️ Let me know what you think.

from self-operating-computer.

KBB99 commented on July 20, 2024 1

Hey @joshbickett . Yes I made some changes to clean up the code before pushing and must have broke something accidentally, will take a look and update any glitches. Thanks for checking it out!

from self-operating-computer.

Daisuke134 commented on July 20, 2024 1

@KBB99
Hi. I was trying to test out your code, but had some issues installing lang_sam. I am using Python 3.12 and tried to install torch and torchvision but was not able to.

We need to
pip install torch torchvision
pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git
to run the project, right?

from self-operating-computer.

joshbickett commented on July 20, 2024 1

@Daisuke134 i was able to install with Python 3.9 and the commands you mentioned.

from self-operating-computer.

KBB99 commented on July 20, 2024 1

@joshbickett I updated the incorrect arguments you mentioned so that is fixed. It is moving and clicking for me so I suspect that for the generated visual description, First link in the list of articles, titled 'Non-interactive SSH password authentication lang-sam did not identify the object and hence couldn't click on anything. For me it sometimes segments and masks the objects correctly, but other times not. I need to experiment more and improve the prompts as well as the marks.

@Daisuke134 the commands you mentioned are correct assuming you start with python 3.9 venv (to do so set up the venv by running python3.9 -m venv venv). Also, don't forget to pull the latest changes that fix the arguments bug Josh mentioned.

I will do some more tests, improvements, and then make a pull request tomorrow integrating the suggestions @Daisuke134 mentioned.

from self-operating-computer.

joshbickett commented on July 20, 2024 1

Any idea how to solve the problem?

sounds like you were able to resolve this issue, but wanted to mention that I had to run pip install torchvision separate before lang-segment-anything

from self-operating-computer.

KBB99 commented on July 20, 2024 1

So I've managed to fix the clicking errors as well as add a configuration that saves the screenshots, masked screenshots, prompt, and predicted coordinates. I'll run more tests and keep expanding the dataset to test the segmentation model individually and then make further changes. Sometimes lang-sam works very well and segments the targeted object perfectly making clicks extremely accurate, other times not.

@joshbickett are you aware of any dataset of screenshots with prompt and coordinates we could use to evaluate different approaches?

from self-operating-computer.

Daisuke134 commented on July 20, 2024 1

run pip install torchvision separate before lang-segment-anything

Meaning doing "pip install torchvision" before "pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git"?

Also, I am still having the error with parsing JSON.. I will check the code again

from self-operating-computer.

KBB99 commented on July 20, 2024 1

@Daisuke134 I added a section to the README.md try following the one with conda for version management.

from self-operating-computer.

joshbickett commented on July 20, 2024 1

@admineral @KBB99 Set-of-Mark prompting is now available. You can swap in any best.pt from a YOLOv8 model and see how it performs. I'd love if the community iterated on what I built. My best.pt could be improved, but the structure is now in to improve upon.

from self-operating-computer.

joshbickett commented on July 20, 2024 1

@KBB99 sounds good. If it improves performance it'd be great to get a PR

from self-operating-computer.

joshbickett commented on July 20, 2024

@admineral This is an interesting idea!

from self-operating-computer.

KBB99 commented on July 20, 2024

Sounds awesome. We could look at using this: https://github.com/luca-medeiros/lang-segment-anything

from self-operating-computer.

michaelhhogue commented on July 20, 2024

@admineral Yeah sorry, not sure why I closed this. I must've been tired. This is a good idea!

@KBB99 Looking at this now and it looks really interesting. I hadn't considered a segmentation model before

from self-operating-computer.

KBB99 commented on July 20, 2024

Have a long way to go, but was able to get it integrated. I need to make sure the calculations I'm performing are correct as well as modify the ratio conversion. Right now it's incredibly slow as I'm running it locally, but I'll try to host an endpoint backed by a GPU to speed it up. Later I'd like to explore a RL mechanism similar to what @admineral mentioned to improve pixel coordinates estimation.

If anyone wants to continue working on it feel free to clone the fork! Just a note you need to use Python3.9 and the Lang model takes over 4GB of space.

from self-operating-computer.

joshbickett commented on July 20, 2024

@joshbickett how do you suggest building this into the project? operate --click_mode=lang-som ?

@KBB99 excited to look closer. Let me checkout the code and provide some input in the next few days

from self-operating-computer.

joshbickett commented on July 20, 2024

@KBB99 I did git checkout on your fork. I am encountering the following issue.

TypeError: mouse_click() missing 2 required positional arguments: 'client' and 'messages'

It looks like mouse_click was expecting additional arguments

if action_type == "SEARCH":
            function_response = search(action_detail)
        elif action_type == "TYPE":
            function_response = keyboard_type(action_detail)
        elif action_type == "CLICK":
            function_response = mouse_click(action_detail)
        else:
...
def mouse_click(click_detail, client, messages):

After adjusting that in the code so that mouse_click doesn't encounter that error I got the following result, but didn't see the mouse move or click .

[Self-Operating Computer] [Act] CLICK {'visual_description': "First link in the list of articles, titled 'Non-interactive SSH password authentication'", 'reason': 'To open the top article as requested'}

I am interested in this lang-som approach but it looks like it may need more work, let me know if there's something I was doing wrong. I'd love to try out a working version.

from self-operating-computer.

Daisuke134 commented on July 20, 2024

@joshbickett @KBB99 Thank you. I could run the code using 3.9.18. However, when I did operate and "Go to youtube.com and play some holiday music", there is an error saying

Error parsing JSON: 'ascii' codec can't encode character '\u2018' in position 7: ordinal not in range(128)
[Self-Operating Computer][Error] something went wrong :(
[Self-Operating Computer][Error] AI response
Failed take action after looking at the screenshot

Probably this is occuring because ' is included when json parsing but..
Any idea how to solve the problem?

You only made changes in main.py, right?

from self-operating-computer.

KBB99 commented on July 20, 2024

@joshbickett I've played around with the SOM you implemented, but for some reason GPT-V seems to pick the wrong marks and still doesn't click on the correct object. I think some additional context, like the text could help GPT-V pick the correct object better. I've been experimenting with Amazon Textract Layout and it seems pretty solid, capturing text and the layout. Take a look at the screenshot. I'll test it with self-operating-computer and let you know how it goes.

from self-operating-computer.

joshbickett commented on July 20, 2024

@KBB99 I'll close this PR for now that we're using the OCR method as default. If anyone comes up with an improved YOLO model or SOM technique feel free to open a new ticket or a PR!

from self-operating-computer.

admineral commented on July 20, 2024

Please check out YOLO-WORLD https://blog.roboflow.com/what-is-yolo-world/

https://huggingface.co/spaces/stevengrove/YOLO-World?ref=blog.roboflow.com

https://github.com/AILAB-CVC/YOLO-World?ref=blog.roboflow.com

from self-operating-computer.

Object detection about self-operating-computer HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent