Converting scraped data to data structure using LLM's

!! UPDATES !!

Please check the Trying with mistral and llama from Anyscale and the Trying with mistral and llama with prompts sections in src/html-to-json.ipynb

Creating the environment

Run the following code chunk to get started with the environment

> python -m venv newvenv
> source newvenv/bin/activate
> pip install -r requirements.txt

API key

To store the API keys create a .env file in the base directory, and place the corresponding API keys. I have one from together.ai and another from openAI. Store them (as well as the url) in this file as shown below:

TOGETHERAI_API_KEY="<your api key>"
OPENAI_API_KEY="<your api key>"
ANYSCALE_API_KEY="<your api key>"
URL="<your url>"

edit: added api key from Anyscale

This will be loaded into the scripts while making the requests.

Schema

The src/schema.py consists of custom schema defined as a wrapper on top of BaseModel provided by pydantic. This was done largely inspired by instructor's implementation of function calling.

Procedure and results

The url is scraped and the html is meaningfully extracted insrc/scrape.py. As some additional sanitisation steps (and also to reduce prompt token size), some useless html tags such as div, span, b and some others were removed. More over, all style attributes were also removed from other tags (only rowspan attributes remain).

Finally each row of the table is extracted seperately and the entire table is converted into a list of html table rows. Then during passing the html input to the model query, each row is converted into string and append with two blank lines gap between each row. In prompt, a tip has been provided to nudge the model into focussing on which tags have a rowspan attribute as that will decide how many entries get the same movie name,

Demonstration of the results are in html-to-json.ipynb.

Now, since there are $\approx 50$ rows, passing the entire table as one input quickly runs out of token length limit. Hence I split the table into chunks of some fixed chunk length and appended the header row at the beginning of all of them. Then passed all of them as inputs in seperate queries and saved their JSON outputs. Finally I merged all the JSON outputs into one combined output. The outputs are the json files in the base directory.

Other models

Somewhere from yesterday evening, most models I tried from together.ai endpoints returned http errors, and due to shortage of time I was unable to investigate this issue further (see here). Hence I was forced to use the gpt models from OpenAI.

Moreover since much of my experimentation was with function calling (and only models from openai are able to utilise this)... other models tend to return the content directly rather than generating it in the form of function call.

Querying by date

Filtering by date or month or month is also possible as demonstrated here. However as passing the entire table takes us dangerously close to the token limit I am unable to go for more complicated examples.

Why the entire input this time?

This is because for the model to be able to query based on dates it needs to know all the dates.

roudranil / function-calling-with-openai-and-oss-llm Goto Github PK

function-calling-with-openai-and-oss-llm's Introduction

Converting scraped data to data structure using LLM's

!! UPDATES !!

Creating the environment

API key

Schema

Procedure and results

Other models

Querying by date

function-calling-with-openai-and-oss-llm's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent