Giter Club home page Giter Club logo

crawlgpt's Introduction

CrawlGPT

โšก Fully automated web crawler. Crawling all information you want on the Internet with GPT-3.5. Built with ๐Ÿฆœ๏ธ๐Ÿ”—LangChain๐Ÿ‘๐Ÿ‘โšก

Simple Demo

demo.mp4

What it can do?

  • Fully automated web crawler. Simulate the process of humans searching for data as much as possible.
  • Automatically collect all specified details across the entire internet or given web domain based on a given theme.
  • Automatically search for answers on the internet to fill in missing specified details while crawling.
  • โœ๏ธ๐Ÿ‘‡A simple exmple๐Ÿ‘‡โœ๏ธ
    • Input:
      • the theme you want to crawl: Cases of mergers and acquisitions of fast food industry enterprises in America after 2010
      • 0-th specific detail: When the merger occurred
      • 1-th specific detail: Acquirer
      • 2-th specific detail: Acquired party
      • 3-th specific detail: The CEO of acquirer
      • 4-th specific detail: The CEO of acquired party
      • (Optional) Limited web domain: ["nytimes.com", "cnn.com"]
    • Output: JSON containing all specified details about the theme. The format of output is
      {
          "events_num": N,
          "details":  ### The length of this list is N.
          [ 
              {
                  "When the merger occurred": <answer>,
                  "Acquirer": <answer>,
                  "Acquired party": <answer>,
                  "The CEO of acquirer": <answer>,
                  "The CEO of acquired party": <answer>,
                  "source_url": <url>
              },
              {
                  "When the merger occurred": <answer>,
                  "Acquirer": <answer>,
                  "Acquired party": <answer>,
                  "The CEO of acquirer": <answer>,
                  "The CEO of acquired party": <answer>,
                  "source_url": <url>
              },
              ..............
          ]
      }
      

Why web crawler need GPT?

  • GPT can extract the necessary information by directly understanding the content of each webpage, rather than writing complex crawling rules.
  • GPT can connect to the internet to determine the accuracy of crawler results or supplement missing information.

How it do?

  1. Thinking about suitable Google search queries based on the theme with GPT-3.5.
  2. Simulate Google search in entire Internet or given web domain(if any) using each query.
  3. Browse every website.
  4. Extract specific details of the theme from the content of the website with GPT-3.5.
  5. Similar to Auto-GPT, it will independently search for missing details on the Internet based on the langchain implementation of MRKL and ReAct.
  6. Encapsulate all results into a JSON.

Quick Install

  • OPENAI_API_KEY: You must have a openai api key and modify os.environ["OPENAI_API_KEY"] in pipeline.py.
  • SERPER_API_KEY: For searching correct and real-time information, you need have a google serper api key. It will take you a short time to register. Modify os.environ["SERPER_API_KEY"] in pipeline.py and you have 1000 queries for free every month.
  • Hyper Parameters:
    • QUERY_NUM: The Number of Google searches with different query. Default is 2.
    • QUERY_RESULTS_NUM: The number of results returned per search. Default is 4.
    • THEME: The theme of web crawler.
    • DETAIL_LIST: The specific details of the web crawler theme.
    • (Optional) URL_DOMAIN_LIST: The valid web domain or url prefix.
  • Install python3.11.
  • Install necessary dependencies: pip install -r requirements.txt
  • Run it: python pipeline.py > output.txt.
  • Read results from final_dict.json.

TODO

  • Support crawl in given list of web domain.
  • The langchain implementation of MRKL and ReAct carries the risk of divergent output. That is, the content of response may exceed our limit.
  • Automatically write research reports based on crawling results.
  • GPT consumes a huge amount of token while browsing webpage๐Ÿ˜ข. Reduce the consumption.
  • Browse the PDF files from the pdf link in website.
  • Modify the entire pipeline to registration free(except for OpenAI).

crawlgpt's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.