Giter Club home page Giter Club logo

extracty's Introduction

Extracty: Dynamic Data Extraction

Downloads

Extract structured data from any unstructered web page

extracty is a library designed to streamline and simplify the process of extracting structured data from websites. Utilizing the robust functionality of Pydantic and Instructor, extracty enables users to define dynamic data extraction schemas and interact with a simple function call.

How to Run

first install the library.

pip install extracty

How to use the library

from extracty import LLMExtractor

extractor = LLMExtractor(
    url="url-you-want-to-scrape",
    query="your-query",
    api_key="your-openai-api-key",
)

data = extractor.extract()

print(data.model_dump_json())

more advance extraction

here you can specify the fields you want the model to look for.
Also, you can specify the types you want for each model for easier data handling.

from extracty import LLMExtractor

fields = {
    "feild_1": str,
    "feild_2": int,
    "field_3": bool,
}

extractor = LLMExtractor(
    url="url-you-want-to-scrape",
    query="your-query",
    api_key="your-openai-api-key",
    fields=fields,
    gpt_model="the-model-you-want-to-use (defaults to gpt-4)",
)

data = extractor.extract()

print(data.model_dump_json())

Example-usage

Here is an example usage where we want to get the top 5 trending github repo's

from extracty import LLMExtractor

fields = {
    "rank": int,
    "repo_name": str,
    "small_description": str,
}

extractor = LLMExtractor(
    url="https://www.github.com/trending",
    query="What are the top 5 trending repositories on GitHub?",
    api_key="your-openai-api-key",
    fields=fields,
    gpt_model="the-model-you-want-to-use (defaults to gpt-4)",
)

data = extractor.extract()

print(data.model_dump_json())

and the corresponding output

{
  "name": "Top 5 Trending Repositories on GitHub",
  "data": [
    {
      "rank": 1,
      "repo_name": "stitionai / devika",
      "small_description": "Devika is an Agentic AI Software Engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to achieve the given objective. Devika aims to be a competitive open-source alternative to Devin by Cognition AI."
    },
    {
      "rank": 2,
      "repo_name": "OpenDevin / OpenDevin",
      "small_description": "OpenDevin: Code Less, Make More 利用大模型,一键生成短视频"
    },
    {
      "rank": 3,
      "repo_name": "harry0703 / MoneyPrinterTurbo",
      "small_description": "Decentralized Autonomous Regulated Company (DARC), a company virtual machine that runs on any EVM-compatible blockchain, with on-chain law system, multi-level tokens and dividends mechanism."
    },
    {
      "rank": 4,
      "repo_name": "Project-DARC / DARC",
      "small_description": "A natural language interface for computers"
    },
    {
      "rank": 5,
      "repo_name": "OpenInterpreter / open-interpreter",
      "small_description": "A one stop repository for generative AI research updates, interview resources, notebooks and much more!"
    }
  ]
}

Contributing

If you would like to contribute to this project, please follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix.
  3. Make your changes and commit them.
  4. Push your changes to your forked repository.
  5. Submit a pull request to the original repository.

I appreciate your contributions to build this fun project!

TODOs

  • Enhance the scraping and make it more robust.
    • Utilize async Playwright to be efficient for installation.
    • Enhance cleaning HTML content and make it more efficient.
  • Enhance Pydantic modeling.
    • Enhance dynamic model creation.
    • Enhance BaseExtractor.
  • Optimize performance for large-scale data extraction.

Acknowledgments

Utilizes OpenAI for advanced data extraction capabilities. Leverages Pydantic and Instructor for dynamic and robust data modeling. Employs langchain, playwright for efficient web interaction.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.