Giter Club home page Giter Club logo

gpt-crawler's Introduction

GPT Crawler

Crawl a site to generate knowledge files to create your own custom GPT from one or multiple URLs

Gif showing the crawl run

Example

Here is a custom GPT that I quickly made to help answer questions about how to use and integrate Builder.io by simply providing the URL to the Builder docs.

This project crawled the docs and generated the file that I uploaded as the basis for the custom GPT.

Try it out yourself by asking questions about how to integrate Builder.io into a site.

Note that you may need a paid ChatGPT plan to access this feature

Get started

Running locally

Clone the repository

Be sure you have Node.js >= 16 installed.

git clone https://github.com/builderio/gpt-crawler

Install dependencies

npm i

Configure the crawler

Open config.ts and edit the url and selector properties to match your needs.

E.g. to crawl the Builder.io docs to make our custom GPT you can use:

export const defaultConfig: Config = {
  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

See config.ts for all available options. Here is a sample of the common configuration options:

type Config = {
  /** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */
  url: string;
  /** Pattern to match against for links on a page to subsequently crawl */
  match: string;
  /** Selector to grab the inner text from */
  selector: string;
  /** Don't crawl more than this many pages */
  maxPagesToCrawl: number;
  /** File name for the finished data */
  outputFileName: string;
  /** Optional resources to exclude
   *
   * @example
   * ['png','jpg','jpeg','gif','svg','css','js','ico','woff','woff2','ttf','eot','otf','mp4','mp3','webm','ogg','wav','flac','aac','zip','tar','gz','rar','7z','exe','dmg','apk','csv','xls','xlsx','doc','docx','pdf','epub','iso','dmg','bin','ppt','pptx','odt','avi','mkv','xml','json','yml','yaml','rss','atom','swf','txt','dart','webp','bmp','tif','psd','ai','indd','eps','ps','zipx','srt','wasm','m4v','m4a','webp','weba','m4b','opus','ogv','ogm','oga','spx','ogx','flv','3gp','3g2','jxr','wdp','jng','hief','avif','apng','avifs','heif','heic','cur','ico','ani','jp2','jpm','jpx','mj2','wmv','wma','aac','tif','tiff','mpg','mpeg','mov','avi','wmv','flv','swf','mkv','m4v','m4p','m4b','m4r','m4a','mp3','wav','wma','ogg','oga','webm','3gp','3g2','flac','spx','amr','mid','midi','mka','dts','ac3','eac3','weba','m3u','m3u8','ts','wpl','pls','vob','ifo','bup','svcd','drc','dsm','dsv','dsa','dss','vivo','ivf','dvd','fli','flc','flic','flic','mng','asf','m2v','asx','ram','ra','rm','rpm','roq','smi','smil','wmf','wmz','wmd','wvx','wmx','movie','wri','ins','isp','acsm','djvu','fb2','xps','oxps','ps','eps','ai','prn','svg','dwg','dxf','ttf','fnt','fon','otf','cab']
   */
  resourceExclusions?: string[];
  /** Optional maximum file size in megabytes to include in the output file */
  maxFileSize?: number;
  /** Optional maximum number tokens to include in the output file */
  maxTokens?: number;
};

Run your crawler

npm start

Alternative methods

To obtain the output.json with a containerized execution, go into the containerapp directory and modify the config.ts as shown above. The output.jsonfile should be generated in the data folder. Note: the outputFileName property in the config.ts file in the containerapp directory is configured to work with the container.

Running as an API

To run the app as an API server you will need to do an npm install to install the dependencies. The server is written in Express JS.

To run the server.

npm run start:server to start the server. The server runs by default on port 3000.

You can use the endpoint /crawl with the post request body of config json to run the crawler. The api docs are served on the endpoint /api-docs and are served using swagger.

To modify the environment you can copy over the .env.example to .env and set your values like port, etc. to override the variables for the server.

Upload your data to OpenAI

The crawl will generate a file called output.json at the root of this project. Upload that to OpenAI to create your custom assistant or custom GPT.

Create a custom GPT

Use this option for UI access to your generated knowledge that you can easily share with others

Note: you may need a paid ChatGPT plan to create and use custom GPTs right now

  1. Go to https://chat.openai.com/
  2. Click your name in the bottom left corner
  3. Choose "My GPTs" in the menu
  4. Choose "Create a GPT"
  5. Choose "Configure"
  6. Under "Knowledge" choose "Upload a file" and upload the file you generated
  7. if you get an error about the file being too large, you can try to split it into multiple files and upload them separately using the option maxFileSize in the config.ts file or also use tokenization to reduce the size of the file with the option maxTokens in the config.ts file

Gif of how to upload a custom GPT

Create a custom assistant

Use this option for API access to your generated knowledge that you can integrate into your product.

  1. Go to https://platform.openai.com/assistants
  2. Click "+ Create"
  3. Choose "upload" and upload the file you generated

Gif of how to upload to an assistant

Contributing

Know how to make this project better? Send a PR!



Made with love by Builder.io

gpt-crawler's People

Contributors

86 avatar adityak74 avatar ashudevcodes avatar daethyra avatar dandv avatar eltociear avatar guillermoscript avatar gummipunkt avatar highergroundstudio avatar iperzic avatar justindhillon avatar kaibadash avatar kunal00000 avatar lelemathrin avatar leonkohli avatar luissuil avatar marcelovicentegc avatar mcoliver avatar nilwurtz avatar pipech avatar samfromaway avatar semantic-release-bot avatar steve8708 avatar umar-azam avatar zshnb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gpt-crawler's Issues

Scope of the crawler (limits)

First of all, sounds really cool!

How robust is the crawler in the sense of what can it crawl?
As long as there is html it should work?
And what is the "storage" limit, can I let it crawl the official python docs? Limit might not be on the crawlers side but the LLMs' I plug the .json file in?

Cheers

Multiple concurrent crawler with split output. Asking if there is interest in completing my Fork.

@steve8708 Questioning interest, I have made a big refactoring of the codebase for integrating thoses features :

  • excludeSelectors : Remove elements that you don't want in the output data
  • Cleaner output : Remove some
  • Refactoring of the full code
  • Concurrency
  • Multiple config
  • Config parsing now set default if not defined.
  • ProgressBar logging
  • Sub Routing namings
  • output now generated in it's own folder
  • change output.json to output/data.json
  • Fix .gitignore
  • added Prettier in the project. (Wouldn't mind to revert that if not wanted)

Things that would be required to fully "complete" the PR:

  • CLI full support
  • Terminal logs fixed. (Mostly INFO and ERROR logs from PlaywrightCrawler)

My needs:

I wanted to create a knowledge base for godot, but wanted to separate each section into their own files. I manage to do it with multiple config. But that being done and I have the output I needed, I am not interested in fixing the logging part. Useful when I saw some error from a bad error, but not that helpful imo.

Current state

So the current changes are big and 90% finish. Nonetheless, I think they are an improvement, just not a "fully stable" and completed improvement... Everythings that was added is very functionnal, but I still have issues with the output of the terminal. If the lines get wrapped, the output get ugly. Nx has a similar issue with their run-many CLI, so I don't know if it's vscode, the terminal or the lib... I'm just not interested in completing the feature.

> @builder.io/[email protected] build
> tsc

Crawling started.
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | getting_started | 10/33 (L: 50, F: 33) | ETA: 101s | /getting_started/step_by_step/instancing.html
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | tutorials | 9/50 (L: 50, F: 327) | ETA: 268s | /tutorials/best_practices/godot_interfaces.html
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | contributing | 9/47 (L: 50, F: 47) | ETA: 248s | /contributing/development/index.html INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":6323,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":56909,"requestsTotal":9,"crawlerRuntimeMillis":60560,"retryHistogram":[9]}
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | getting_started | 26/33 (L: 50, F: 33) | ETA: 28s | /getting_started/first_3d_game/03.player_movement_code.html
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | tutorials | 26/50 (L: 50, F: 327) | ETA: 91s | /tutorials/editor/managing_editor_features.html
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | contributing | 26/50 (L: 50, F: 57) | ETA: 92s | /contributing/development/debugging/using_sanitizers.html INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4464,"requestsFinishedPerMinute":13,"requestsFailedPerMinute":0,"requestTotalDurationMillis":116054,"requestsTotal":26,"crawlerRuntimeMillis":120568,"retryHistogram":[26]}
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | getting_started | 33/33 (L: 50, F: 33) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘ | tutorials | 47/50 (L: 50, F: 327) | ETA: 8s | /tutorials/3d/procedural_geometry/arraymesh.html
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘ | contributing | 44/50 (L: 50, F: 73) | ETA: 19s | /contributing/documentation/class_reference_primer.html INFO Sta โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | getting_started | 33/33 (L: 50, F: 33) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | tutorials | 50/50 (L: 50, F: 327) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | contributing | 50/50 (L: 50, F: 73) | ETA: 0s | Completed

I made this multi progress bar because with concurrent crawling, the log was hard to follow. With this, it's easier to follow, but when logging things happen like error, info and other in the mean times, it's a mess...

The issue :

When this "type" of line appear from PlaywrightCrawler, it break the multi progressbar :

INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4525,"requestsFinishedPerMinute":12,"requestsFailedPerMinute":0,"requestTotalDurationMillis":113118,"requestsTotal":25,"crawlerRuntimeMillis":120511,"retryHistogram":[25]}

The multi progressbar display get bugged. I do not understand enough terminal and playwright to know exactly what to change to fix this.

Why Asking ?

I have no interest in fixing the terminal as I got what I wanted, but the whole changes is a improvement and I was asking if I could make a PR and let someone else fix the issue in the PR and push it ? I guess the concurrent part could be omitted and that would "make the PR completed".

Other changes that I can omit if not wanted.

I use a "modern" prettier config, my editor will format using my config if none existe in the repo I work on. I have setup prettier as I was already changing formatting when I saving, but I'm ok with reverting this. But I could also push it if thecopied some files that would configure that as I wasn't planning to make big change, but I'm willing to remove that too if not interested.

Here's some visual preview :

image
image

  • Won't push the config changes tho. (Maybe only the typing)

craw openai --get nothing

import { Config } from "./src/config";

export const defaultConfig: Config = {
url: "https://openai.com/",
match: "gpt",
maxPagesToCrawl: 100,
outputFileName: "outputzyx.json",
};

INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: Crawling: Page 1 / 100 - URL: https://openai.com/...
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":6738,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":6738,"requestsTotal":1,"crawlerRuntimeMillis":6867}
INFO PlaywrightCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}

Crawling duplicated url

need to ignore the url if already crawled the page, in my case same url crawled several times

Expose the service as a REST API

As a follow up on this pull request #38

I was wondering if it's possible to expose the service as an API. It would be a lot easier and simpler to run it locally, without the need to publish the gpt crawler. It would be perfect if it's containerized!
I'm no expert in js, I tried to implement an express js server with the help of chatgpt, but I had a lot of exceptions and errors, so I gave up ^^

This is my attempt:

// file: app/src/api.ts

import express from 'express';
import cors from 'cors';
import fileUpload from 'express-fileupload';
import { PlaywrightCrawler } from 'crawlee';
import { Page } from 'playwright';
import { readFile, writeFile } from 'fs/promises';
import {startCrawling} from "./main";

// Create a new express application instance
const app = express();
const port = 3000; // You may want to make the port configurable

// Enable JSON and file upload functionality
app.use(cors());
app.use(express.json());
app.use(fileUpload());

// Define a POST route to accept config and run the crawler
app.post('/crawl', async (req, res) => {
    // Verify that we have the configuration in the request
    if (!req.files || !req.files.config) {
        return res.status(400).json({ message: 'Config file is required.' });
    }

    // Read the configuration file sent as form-data
    const configContent = req.files.config.data.toString('utf-8');
    const config = JSON.parse(configContent);

    // Placeholder for handling crawler events and operations
    try {
        await startCrawling(config);

        // Read the output file after crawling and send it in the response
        const outputFileContent = await readFile(config.outputFileName, 'utf-8');
        res.contentType('application/json');
        return res.send(outputFileContent);
    } catch (error) {
        res.status(500).json({ message: 'Error occurred during crawling', error });
    }
});

// Start the Express server
app.listen(port, () => {
    console.log(`API server listening at http://localhost:${port}`);
});

export default app;

Scrap website

Thank you for making this.
I want to scrap europages website which contains a lot of businesses. How can I make a list with my own preferences instead making a chatbot?

Adds for autoScroll for crawling the multi pages?

I just worked for our platform pages with origin code and that couldn't provide me full information on pages.

Therefore, i added autoScroll code in main.ts for this and it worked perfectly.
(I think it is better than increasing the numbers of waitForSelectorTimeout.)

async function autoScroll(page: Page) {
  await page.evaluate(async () => {
    await new Promise<void>((resolve, reject) => {
      var totalHeight = 0;
      var distance = 100;
      var timer = setInterval(() => {
        var scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}

if (process.env.NO_CRAWL !== "true") {
  const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log, pushData }) {
      try {
        if (config.cookie) {
          const cookie = {
            name: config.cookie.name,
            value: config.cookie.value,
            url: request.loadedUrl, 
          };
          await page.context().addCookies([cookie]);
        }

        const title = await page.title();
        log.info(`Crawling ${request.loadedUrl}...`);

        await page.waitForSelector(config.selector, {
          timeout: config.waitForSelectorTimeout,
        });

        await autoScroll(page);  

        const html = await getPageHtml(page);
        await pushData({ title, url: request.loadedUrl, html });

        if (config.onVisitPage) {
          await config.onVisitPage({ page, pushData });
        }

        await enqueueLinks({
          globs: [config.match],
        });
      } catch (error) {
        log.error(`Error crawling ${request.loadedUrl}: ${error}`);
      }
    },
    maxRequestsPerCrawl: config.maxPagesToCrawl,
    // headless: false,
  });

  await crawler.run([config.url]);
}

If you think this is good enough for crawling, hope this will be helpful for other users.

Thank you for your work btw!

I really appreciate for that!

Thank you.

403 error on Zendesk specifically

I can crawl other sites just fine but for some reason any Zendesk site gives me a 403. Any advice on how to fix this? Our docs are completely in Zendesk ๐Ÿ˜ฌ

Edit: User error.

Comparison with well-established crawlers

How exactly is this project different from an established crawler that would just dump the HTML text into the .html field of a JSON array?

It's got 12k stars, but it lacks basic features like canonicalizing links (see #73) or preserving links (#74).

Wildcard support

I've noticed you can't currently use any regex when defining urls, but is there some other way to leverage different subdomains or wildcard characters?

For example, I want to crawl multiple subdomains that follow a similar structure https://pco[NAME].zendesk.com/. I was thinking I could change the match field to accept a string array, and also couldn't use regex to wildcard the [NAME] piece of the subdomain. Is there some other way to achieve this?

Allow `waitForSelector` timeout configuration

Some pages take more than a 1000ms to load, which is the default timeout here. For these occasions it can be useful to be able to configure the default waitForSelector timeout in config.js.

[FR] Multitasking system

Is it possible to add a system for launching multiple tasks simultaneously? But also a system for a task list?

Any way to use a sitemap.xml for the crawler?

I cant seem to get the crawler to crawl every extension of a website, it sometimes misses a lot. The site does have a sitemap.xml with every link I would want though, is there any way to use that? If so, then how?

Add CI

Would be great to setup github actions to run prettier --check and a build on each PR to ensure those are passing

How to Choose a Suitable CSS Selector for a Website

Inspecting Web Page Structure:

Open the target website (e.g., https://www.google.com.hk/webhp?hl=zh-CN&sourceid=cnhp/).
Right-click on the page element you wish to crawl (such as a specific text or area) and select "Inspect" to open the browser's developer tools.
Analyzing the Element:

In the developer tools, examine the HTML code of the element.
Look for attributes that uniquely identify the element or its container, such as class, id, or other attributes.
Building a CSS Selector:

Create a CSS selector based on the attributes you observed.
For example, if an element has class="content", the selector could be .content.
If the element has multiple classes, you can combine them like .class1.class2.
Testing the Selector:

In the "Console" tab of the developer tools, use document.querySelector('YOUR_SELECTOR') to test if the selector accurately selects the target element.
Applying the Selector:

Once a suitable selector is found, apply it in the selector field of your crawler configuration.
Ensure that the chosen CSS selector accurately reflects the content you wish to extract from the webpage. An incorrect selector might result in the crawler not being able to retrieve the desired data.

How to search all URLs with a certain word in it

How to search all URLs with a certain word in it?

Eg. the word usa anywhere within any url within ft.com site?

ft.com/usa/xyz
ft.com/today/opinion/usa
ft.com/today/articles/usa

Is this the Selector? If so, how do you do it?

[FR] Exclude a list of urls

Can you add a feature to ignore crawling for a given list of urls?

Sample:
match: [
"https://www.builder.io/",
],
exclude: [
"https://www.builder.io/blog/
",
]

FR: preserve links

output.json supports markdown links. This is super useful to point users to further information.

Try crawling this Notion site for example (selector .layout) and prompt "How do I make the best of NC12". The answer will instruct you to join the Telegram chat and FB group, but the links for those are lost.

The crawler should convert a links to Markdown links.

Adding proxy support?

A lot of websites have anti-crawler protection, maybe it's a good idea to add proxy support?

Turning a website into json data doesn't make the GPT more useful.

There is a similar project, the general idea is to write all your local files (tree structure) into an output json file, record the full path of each file as the json key, and the file content as the value.

However, I found that doing so did not make the GPT application any smarter. Because the context length of GPT is limited, if the amount of data is relatively large (in fact, just a few copies of HTML can be achieved), the model will be difficult to process.

It's good to generate an output.json for a website, but the output.json can be a large file, which is hard for GPT4 to read.

[FR] Optimization of Data Formatting for Custom GPT

Context:
The current generation of the JSON database for Custom GPT produces redundant data, particularly in common parts of HTML content.

Proposal:
Integrate a feature for optimization based on a hashing technique to minimize tokens in HTML. This approach should not only identify common parts of HTML but also find the most optimized hashes to reduce the total number of tokens.

Technical Advantages:

  1. Space Optimization: Hash-based deduplication minimizes data replication, significantly reducing the number of tokens and the overall weight of the JSON file.
  2. Storage Efficiency: Hash representation allows storing common parts only once, saving space and improving storage efficiency.
  3. Lightweight Transmission: The resulting lightweight file facilitates data transmission, reducing transfer times and enhancing performance.

Proposed Operation:
Each common part is subjected to a hashing function, generating a unique key. However, the hashing algorithm must be optimized to minimize the number of tokens. These hash keys are then stored in an array, while the original values of the title, URL, and HTML content in the JSON database refer to these keys.

Concrete Example:
Consider two articles with similar HTML content containing common parts:

  1. Article on Artificial Intelligence:

    • Title: "Article on Artificial Intelligence"
    • URL: "https://example.com/article1"
    • HTML: "Artificial intelligence (AI) is a discipline of computer science revolutionizing many sectors. Welcome to our site."
  2. In-Depth Exploration of AI:

    • Title: "In-Depth Exploration of AI"
    • URL: "https://example.com/article2"
    • HTML: "Artificial intelligence, also known as AI, is a discipline of computer science. Welcome to our site."

Identify common parts in HTML and apply an optimized hashing algorithm to minimize tokens.

// Optimized database
[
  {
    "title": "Article on Artificial Intelligence",
    "url": "https://example.com/article1",
    "html_hash": ["a1b2c3", "d4e5f6", "g7h8i9", "j10k11l12", "m13n14o15", "p16q17r18"]
  },
  {
    "title": "In-Depth Exploration of AI",
    "url": "https://example.com/article2",
    "html_hash": ["a1b2c3", "s19t20u21", "g7h8i9", "j10k11l12", "v22w23x24"]
  }
]

// Table of hashed common phrases
{
  "a1b2c3": "Artificial intelligence",
  "d4e5f6": " (AI)",
  "g7h8i9": " is a discipline of computer science...",
  "j10k11l12": "...",
  "m13n14o15": "revolutionizing many sectors...",
  "p16q17r18": "...",
  "s19t20u21": "also known as AI...",
  "v22w23x24": "..."
}

Selector scrape by type

Is there a simple way to select all trxt or img or urls so the tool can be used for all sorts of websites instead of first inspecting all websites for the correct elements?

[Feature Request] A Way to Split a Knowledge File into Multiple Files

One issue I've been running into is some pages, or amounts of pages, that I am scraping are too large for one knowledge file, and ChatGPT complains there is too much text in the file for it to use. So I thought maybe there should be an option which allows me to split a knowledge file into multiple knowledge files so that I can feed it however many smaller files as opposed to one large one.

Maybe an option in the 'config.ts' file like:

  • splitKnownledgeFile: 3;

Where the default is 1 (So no splitting), but when you set it to any higher number, it will split a knowledge file into multiple knowledge files with a number suffix.

Selector help

Thanks for building this. Just wondering if there is an easier way or dynamic way to find the selector? Seems this is the part where it either breaks or I have difficulty.

So my normal approach would be to visit the site I want to scape, right click the contents that I want to scape and click 'inspect'. Then I right click again to copy the 'selector'. But the contents would be quite long and specific to that page... (e.g. #app > div.article-box.grid.container > div:nth-child(2) > div.acticle-content > div:nth-child(2) > div.normal.system.article-body > p:nth-child(6)

Any suggestions on how to streamline? or fix? Thanks again

Use new `crawler.exportData` helper

Hello from the crawlee team!

Just a small suggestion, I was taking a peek at the code and saw you do this to create a data bundle at the end.

gpt-crawler/src/core.ts

Lines 115 to 129 in 27b65d3

export async function write(config: Config) {
configSchema.parse(config);
const jsonFiles = await glob("storage/datasets/default/*.json", {
absolute: true,
});
const results = [];
for (const file of jsonFiles) {
const data = JSON.parse(await readFile(file, "utf-8"));
results.push(data);
}
await writeFile(config.outputFileName, JSON.stringify(results, null, 2));
}

We recently added a new helper that does exactly the same:

https://crawlee.dev/api/basic-crawler/class/BasicCrawler#exportData

So you could replace the whole function with a simple crawler.exportData(config.outputFileName) call, and it will support both JSON and CSV automatically (based on the file extension).

Fix for "Cannot find module '/home/myuser/dist/main.js'" Error in Docker Container

Issue Description:

When running the 'gpt-crawler' Docker container, I encountered an error stating that the module '/home/myuser/dist/main.js' could not be found. This issue prevented the crawler from starting.

Steps to Reproduce:

  1. Clone the 'gpt-crawler' repository.
  2. Build the Docker image using the Dockerfile provided in the root directory.
  3. Run the Docker container.
  4. Observe the error message indicating that '/home/myuser/dist/main.js' is missing.

Diagnostic Steps:

  • Checked the contents of the start_xvfb_and_run_cmd.sh script and verified it was executable.
  • Ran the script and observed the output, confirming the error.
  • Reviewed the package.json file and noticed the script "start:prod": "node dist/main.js".
  • Checked the Dockerfile and noticed the multi-stage build process.
  • Encountered an error during the build process: sh: 1: tsc: not found, indicating TypeScript (tsc) was not installed or not found in the PATH in the Docker container.
  • Realized the need to update the file path in package.json to point to the correct location of main.js.

Solution:

  • Updated the start:prod script in package.json from "start:prod": "node dist/main.js" to "start:prod": "node dist/src/main.js".
  • Rebuilt the Docker image with the updated package.json.
  • Ran the Docker container with the new image.
  • Confirmed that the crawler started successfully and began crawling the default website.

Suggested Changes:

  • Update the package.json file to correct the path in the start:prod script.
  • Ensure that the Dockerfile and associated scripts are set up to correctly locate and execute main.js.
  • Update documentation if necessary to reflect these changes and assist future users in setting up the crawler.

Proposed Solution in Detail:

In order to address the issue and ensure the proper functioning of the 'gpt-crawler' in a Docker environment, the following changes were made:

  1. Modification in package.json:

    • Updated the start:prod script to correctly reference the main JavaScript file generated by TypeScript. The original script was "start:prod": "node dist/main.js", which was incorrect as the main.js file is located in the dist/src directory after the TypeScript compilation. The updated script is "start:prod": "node dist/src/main.js".
    • This change ensures that when the Docker container starts, it correctly locates and executes the main JavaScript file.
  2. Dockerfile Adjustments:

    • The Dockerfile used for this fix was the one located in the root directory of the 'gpt-crawler' repository.
    • During the Docker build process, an error was encountered indicating that TypeScript (tsc) was not found. This was resolved by ensuring that TypeScript is installed and correctly set up in the Docker environment.
    • The multi-stage build process in the Dockerfile was reviewed and retained as it efficiently separates the build and runtime environments, reducing the final image size.
  3. Testing the Solution:

    • After making the above changes, the Docker image was rebuilt to incorporate these modifications.
    • The rebuilt Docker image was then run, and it was confirmed that the crawler successfully started and began crawling the default website without encountering the previous error.
  4. Pushing Changes to Forked Repository:

    • These changes have been committed and pushed to my fork of the 'gpt-crawler' repository. This includes the updated package.json and any other relevant modifications made to ensure the functionality of the crawler in a Docker environment.
    • The forked repository can be reviewed for a detailed view of all changes made.

Additional Notes:

  • The Dockerfile in the 'containerapp' directory was not used for this fix. It might serve a different purpose and should be reviewed separately.
  • Consider adding more detailed error handling and logging for easier troubleshooting in the future.

Issue title, description, and code fixes are generative work, by ChatGPT Plugins ("Recombinant AI", "MixerBox ChatVideo").

The author of this issue, and related pull request, are submissions of an absolute open-source noob. Considering the, no JavaScript development experience, all feedback is welcomed.

Size

I have successfully crawled a whole website and have the output file as JSON.

The problem is that the file size is 93MB and after uploading it to ChatGPT I get an error message stating that the file is too large.
Is there a known limit in size that can be uploaded and can this be chunked into different parts?

Exclude directories

Can someone add a function where you can also exclude specific directories? Like, don't crawl example.com/products/ (and all products deeper inside the path?

Are you interested in packaging it as a CLI?

Hey, all! Awesome project! I loved it, it's very useful.

Do you have any interest in packaging it as a CLI as well? It occurs to me that it would be more straightforward/developer-friendlky to use it this way:

gpt-crawler --url https://www.builder.io/c/docs/developers --match https://www.builder.io/c/docs/** --selector .docs-builder-container --maxPagesToCrawl 50 --outputFileName output.json

Let me know. I would love to discuss this contribution ๐Ÿ˜‰

I got a working example on my fork: https://github.com/marcelovicentegc/gpt-crawler/blob/main/src/cli.ts

FR: remove cruft from links

Currently the crawler seems to treat links as different if the query parameters are different. In some cases (e.g. utm_ trackers, Notions' pvs junk, and crap like that), the links should be cleaned up.

One way to address this would be to have an array of URL params in config.ts that should be removed in order to obtain the canonical URL for a page.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.