reworkd / tarsier Goto Github PK

View Code? Open in Web Editor NEW

493.0 4.0 31.0 1.64 GB

Vision utilities for web interaction agents 👀

Home Page: https://reworkd.ai

License: MIT License

Python 20.88% Shell 0.22% Jupyter Notebook 71.57% TypeScript 7.32%

ocr playwright selenium webscraping pypi-package gpt4v llms python

tarsier's Introduction

🙈 Vision utilities for web interaction agents 🙈

🔗 Main site • 🐦 Twitter • 📢 Discord

Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:

How do you map LLM responses back into web elements?
How can you mark up a page for an LLM better understand its action space?
How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier! The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.

tarsier.mp4

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]. In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon. We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Installation

pip install tarsier

Usage

Visit our cookbook for agent examples using Tarsier:

An autonomous LangChain web agent 🦜⛓️
An autonomous LlamaIndex web agent 🦙

Otherwise, basic Tarsier usage might look like the following:

import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService
import json

def load_google_cloud_credentials(json_file_path):
    with open(json_file_path) as f:
        credentials = json.load(f)
    return credentials

async def main():
    # To create the service account key, follow the instructions on this SO answer https://stackoverflow.com/a/46290808/1780891
    google_cloud_credentials = load_google_cloud_credentials('./google_service_acc_key.json')

    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")

        page_text, tag_to_xpath = await tarsier.page_to_text(page)

        print(tag_to_xpath)  # Mapping of tags to x_paths
        print(page_text)  # My Text representation of the page


if __name__ == '__main__':
    asyncio.run(main())

Keep in mind that Tarsier tags different types of elements differently to help your LLM identify what actions are performable on each element. Specifically:

[#ID]: text-insertable fields (e.g. textarea, input with textual type)
[@ID]: hyperlinks (<a> tags)
[$ID]: other interactable elements (e.g. button, select)
[ID]: plain text (if you pass tag_text_elements=True)

Local Development

Setup

We have provided a handy setup script to get you up and running with Tarsier development.

./script/setup.sh

If you modify any TypeScript files used by Tarsier, you'll need to execute the following command. This compiles the TypeScript into JavaScript, which can then be utilized in the Python package.

npm run build

Testing

We use pytest for testing. To run the tests, simply run:

poetry run pytest .

Linting

Prior to submitting a potential PR, please run the following to format your code:

./script/format.sh

Supported OCR Services

Google Cloud Vision
Amazon Textract (Coming Soon)
Microsoft Azure Computer Vision (Coming Soon)

Roadmap

Add documentation and examples
Clean up interfaces and add unit tests
Launch
Improve OCR text performance
Add options to customize tagging styling
Add support for other browsers drivers as necessary
Add support for other OCR services as necessary

Citations

bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/tarsier}
}

tarsier's People

Contributors

Stargazers

Watchers

tarsier's Issues

🐛 Annotations not removed when filling out inputs

Seeing this a lot, e.g. when I tweak the example in the cookbook to log in to resy :)

I've tried adding both await page.locator(x_path).fill('') and await page.locator(x_path).clear() to the type_text tool, but it doesn't seem to work (my playwright experience is very limited).

Textarea Tagging System Doesn't Work on Google Search

Hey, I ran the same prompt as the example video. The new tagging system does not tag the textareas properly on the Google search page!

👀 Integrate with Amazon Textextract

Currently the only OCR service tarsier supports is GoogleOCR vision. It would be good to provide another ocr service that allows textextract to be used

🔨 Make selenium and playwright optional dependancies

Currently both playwright and selenium are required to run tarsier. Ideally we want user to only have to install a single one of these to get up and running. This is normally done though pip optional dependancies.

pip install tarsier[playwright]
pip install tarsier[selenium]

fine tune

I feel like the success rate of a given objective will always be significantly less than 100%. E.g., I've been testing Tarsier to try and get it to make a reservation at a restaurant using resy.com, but it fails at the login step because it incorrectly picks the label on the "password" input to type text into (and wants to type my email in that field no less).

I wonder if there's some system that could be built that has a human in the loop correcting mistakes made by the AI, and those corrections are then used to fine tune the model for that specific task.

Anyway this is more of a "idea" than an issue, but wondering what you think! (Feel free to close)

Extract web text directly instead of OCR

I'm working on something pretty similar to what you guys are doing and had a thought. Why not grab text directly from the web instead of using OCR? Langchain and llamaindex both have such tools, and there are also some repos about converting html to markdown.

Just a thought. Would love to know what you think!

🗑️ Expose an API method to remove tags from elements

When tags are added to elements, we insert the tag text directly within elements. This is problematic for input elements as it will always add unnecessary tags to forms which break the input value. See #5

As a simple fix, we should expose an API method to delete all existing tags from elements (Or perhaps just input elements). We mark tagged elements via an attribute. We would need to just filter for this attribute and delete the tag text along with the tagged marked.

✨ Ability to customize tag colours

As a user, I'd like to customize the colors of tags. I'd also like to set different tag colors based on the type of element.

For example, maybe I want tags blue and 's red

Selecting Icons

Hi! ,

I'm trying to automate using the search bar on a list of unknown sites.

In most cases the bar is not visible but there is an icon I must click before to display the search bar.

This example, I want to detect and click the magnifying glass:

The problem is it shows this way in the text [ @ 18 ] so GPT can not pick it (I'm using the llamaindex agent)

The website is https://elastic.co

I read @asim-shrestha mentions GPT-V mode in another issue but I'm not sure on how activate that one, I'm following the docs without success.

Any advice? thanks