mendableai / firecrawl Goto Github PK

View Code? Open in Web Editor NEW

9.2K 9.2K 691.0 40.01 MB

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Home Page: https://firecrawl.dev

License: GNU Affero General Public License v3.0

Dockerfile 0.34% JavaScript 4.52% TypeScript 83.62% Python 6.66% HTML 0.05% CSS 0.28% Go 4.53%

ai ai-scraping crawler data html-to-markdown llm markdown rag scraper scraping web-crawler

firecrawl's People

Contributors

Stargazers

Watchers

Forkers

codeaudit utkarshx jarekmor mentordotgit seshakiran suryatmodulus backupmanager aiseei kustomzone memeformer zeroxclem abdoiiii tsutomu-n ww-jermaine szepeviktor beimingmaster nzb15555196162 sunholo-data whatif-dev sankeerthrao cryptokruelty maddyonline banguiskode aicodehunt gregwu avatarxy galaxycenter leavesandflowers helpfirecode oliviermills email516999 meshinfo pennyjoly cryptoxunm simpleyj bizai3000 jacksonkasi1 thetrickeyone b08240 digitalapplied yuanjie-ai epinnock omidsfp misterypoem eltociear digitalarche johnny-rice touristshaun mysticaltech svorwerk-flextg free-devloper brianjking ailabteam yuanzhongqiao iosdevsk hemeda3 rezabehnoud rockystevejobs adambear arlen1017012857 hhy5277 dailyerneheardoom seattand36 hanooch74 openwheal-buferine boardtwinkle-baseat bozsyineedmoon nicsyscalamarket slipkray-z bloggeno14 louud70 helpfulpt-jungster waichan8 yacineali74 justin-echternach red545 henri-edh linecode malaaaaaa yanxg isnandarsholihin mattzcarey mostrub caipadev bugsliglobays tacticusal-n l-readyna readerenes46 agentsolid53cistflames halliele17 saviorrazu0glutilli cartentsl allwavemedia portestu48 channetr86 istorywar dullnes-colerim wasgrou82 i-excillu l-audienche

firecrawl's Issues

Allow issuing of credits to be a bit easier and less coupled with stripe

[Feat] Provide a capability to limit scapping depth of site

Enable a feature in the API to enable scrapping a certain depth of a site, for example:

https://www.mendable.ai/usecases/sales-enablement

Only scrape top level pages or xx deep.

OpenAPI Spec

I saw you have used mintify but couldn't find the OpeAPI spec itself. Could you share it?

Query parms impact on scraping/crawling

There seems to be some issues related to query params. Especially related to filtering (excl / incl) and crawling.

Docs: image with python snippet

[Feat] Return HTML / CSS from scrape and crawl endpoints

For some usecases, it would be valuable to return the raw HTML and CSS without any pre-processing. Discussion on execution below @rafaelsideguide @nickscamara 👇

[Feat] Ability to scrape public PDFs from Gdrive / any redirect link

Ability to scrape public PDFs from Gdrive links like the one from here https://getgc.ai/privacy

Requested by Bardia

Make it easy to give people free credits

Make it easy to give people free credits without a stripe discount!

Add a timeout parameter to the api

One thing that would be useful is the ability to set a timeout on these requests - a customer ended up implementing that on their side.

Add the ability to filter related websites by regex

Add the ability to filter related websites by regex， for instance：”https://www.archdaily.com/1015605/bandhan-residential-school-of-business-abin-design-studio“

[Feat] Docker deployment

Could you please add support for docker deployment to streamline setting up and running the project?

Thank you

[Feat] Status report of failed pages on crawl endpoint

Implement an API endpoint to return status reports detailing which pages failed and the reasons for these failures.

Output: The API should return a concise log of failed pages, including error codes and brief descriptions.

Requested by [email protected]

Get Code for LLM Extract returns bad JSON

I understand LLM Extract is in alpha and Get Code is likely nondeterministic so feel free to ignore. But this is what it gave me:

Which was missing a closing apos and some commas

Fixed JSON looked like this

Ability for the crawler to output all the links it found, without scraping

ext-integrations: /search support on Llamaindex

Decouple supabase for self hosting option

Unable to run python sdk sample code from README

Traceback (most recent call last):
File "/Users/howardgil/Desktop/Projects/nat-sec-hackathon/firecrawl.py", line 1, in
from firecrawl import FirecrawlApp
File "/Users/howardgil/Desktop/Projects/nat-sec-hackathon/firecrawl.py", line 1, in
from firecrawl import FirecrawlApp
ImportError: cannot import name 'FirecrawlApp' from partially initialized module 'firecrawl' (most likely due to a circular import) (/Users/howardgil/Desktop/Projects/nat-sec-hackathon/firecrawl.py)

limits include filtered out paths

the crawl limit is applied before the paths are filtered out.

base url: test.com
limit: 2
included links: ["/pages/*"]

links on test.com in order:

[
"/home",
"/imprint",
"/about",
"pages/1",
"pages/2",
"pages/3"
]

expected links to be crawled: ["pages/1","pages/2"]

current links that are crawled: []

Remove 'cookies' text when removing headers/footers, etc

Remove any cookies text when removing headers and footers.
Many sites in Europe will display a cookie acceptance message
Sometimes, this is the only text returned.

Sometimes it captures something like:

"Skip to main content\n\nCookies \n------------------------------\n\nWe use some essential cookies to make this service work.\n\nWe\u2019d also like to use analytics cookies so we can understand how you use the service and make improvements.\n\nAccept analytics cookies Reject analytics cookies How we use cookies\n\nYou can change your cookie settings\n at any time.\n\nHide cookie message\n\n"

[Bug] Limit on /search is not deterministic

Right now we limit the search results by applying it to the serp api level by using searchOptions.limit : n.

The problem is that some search results could be social media pages or website that we don't support, failing it on it. This ends up causing the /search endpoint to return less results than expected.

The idea here is that we should search for n + y over the limit, where n is the limit and y is a picked constant. That way if it fails, we can use the y links and try to call get documents on it until we hit the correct limit n.

[Feat] Strip non-content tags, headers, footers

The markdown would be much more useful if you stripped headers/footers and other tags like filters etc that is not core content (i.e. low value for RAG/context). Either using tag or class-based removal from the html or using something like Mozilla's Readability or both! Highly opinionated class-based removal is risky but produces high value content and less noise.

For example a language selector in a header gets produced and should be stripped:

[Skip to main content](#main-content)

Select LanguageEnglishAfrikaansAlbanianArabicArmenianAzerbaijaniBasqueBelarusianBengaliBosnianBulgarianCatalanCebuanoChinese (Simplified)Chinese (Traditional)CroatianCzechDanishDutchEsperantoEstonianFilipinoFinnishFrenchGalicianGeorgianGermanGreekGujaratiHaitian CreoleHausaHebrewHindiHmongHungarianIcelandicIgboIndonesianIrishItalianJapaneseJavaneseKannadaKhmerKoreanLaoLatinLatvianLithuanianMacedonianMalayMalteseMaoriMarathiMongolianNepaliNorwegianPersianPolishPortuguesePunjabiRomanianRussianSerbianSlovakSlovenianSomaliSpanishSwahiliSwedishTamilTeluguThaiTurkishUkrainianUrduVietnameseWelshYiddishYorubaZulu

Here is a starter list.. should probably test against a couple thousand random pages and use an LLM like haiku with vision as judge.

const exclude = [
  'header', '.header', '.top', '.navbar', '#header',
  'footer', '.footer', '.bottom', '#footer',
  '.sidebar', '.side', '.aside', '#sidebar',
  '.modal', '.popup', '#modal', '.overlay',
  '.ad', '.ads', '.advert', '#ad',
  '.lang-selector', '.language', '#language-selector',
  '.social', '.social-media', '.social-links', '#social',
  '.menu', '.navigation', 'nav', '#nav',
  '.breadcrumbs', '#breadcrumbs',
  '.form', 'form', '#search-form',
  'script', 'noscript'
];

[Test] Add integration tests for complex and larger variety of webpages

In tweaking and growing the html clean up and html-to-md. I highly recommend adding integration tests using either live webpages (to test also the get/network and dynamic websites) OR at least saved html pages with complex layout (and bad html, especially for the html clean up.

Find a list of pages to use as test suite with a vareity of layouts
Add the integration tests

[Feat] Be able to pass a timeout param to the endpoints

enable the user to pass a "timeout parameter" to both the scrape and the crawl endpoint. If the timeout is exceeded, please send the user a clear error message. On the crawl endpoint, return any pages that have already been scraped but include messages notifying them that the timeout was exceeded.

If the task is completed within two days, we'll include a $10 dollar tip :)

This is an intro bounty. We are looking for exciting people that will buy in so we can start to ramp up.

[Feat] Error handling middleware for the API

When errors occur in deeply nested functions, there isn't a way for us to bubble up custom error messages and codes to the API layer.

Proposal: Create a custom Error type and Middleware that intercepts errors.

Middleware example

class AppError extends Error {
  public readonly statusCode: number;
  public readonly isOperational: boolean;

  constructor(message: string, statusCode: number, isOperational: boolean = true) {
    super(message);
    this.statusCode = statusCode;
    this.isOperational = isOperational; // Indicates this is a known type of error
    Object.setPrototypeOf(this, new.target.prototype); // restore prototype chain
    Error.captureStackTrace(this, this.constructor);
  }
}

Deep function:

async function someDeepFunction(): Promise<any> {
  try {
    // Some logic that might fail
    if (someConditionNotMet) {
      throw new AppError('Specific error message', 404);
    }
    // more logic
  } catch (error) {
    throw new AppError('Error accessing resource', 500);
  }
}

Then these errors would be intercepted and cleaned for users by a middleware at the express level.

Concurrency issue when job completes but returns null on data

Python Library doesn't handle 502 Errors gracefully

Feat: Convert images in pdfs to images that can be accessed by the user

Some customers want to access images inside PDFs on the web. I'm not sure if llama-index supports this by default?

If we can get the images, we may need to start hosting ourselves in S3 too. This is probably a better solution for ALL images, since people should be cleaning out links to images on external URLs because of data exfil problems.

401 when checking job status

I'm trying to use the example.js you provided in the repo.

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({apiKey: "my api key"});

const crawlResult = await app.crawlUrl('docs.babylonjs.com', {crawlerOptions: {excludes: ['blog/*'], limit: 10}}, false);
console.log(crawlResult)

const jobId = await crawlResult['jobId'];


let job;
while (true) {
    console.log("checking ",app.apiKey);
  job = await app.checkCrawlStatus(jobId);
  if (job.status == 'completed') {
    break;
  }
  console.log(job);
  await new Promise(resolve => setTimeout(resolve, 1000)); // wait 1 second
}

console.log(job.data[0].content);

I get

{ jobId: '678757b8-0d03-4c56-8017-a6b04136ad07' }
checking fc-4a0f64912306448c975701198d28b85e
file:///D:/Web/webscraper/node_modules/@mendable/firecrawl-js/build/index.js:91
throw new Error(error.message);
^

Error: Request failed with status code 401
at FirecrawlApp. (file:///D:/Web/webscraper/node_modules/@mendable/firecrawl-js/build/index.js:91:23)
at Generator.throw ()
at rejected (file:///D:/Web/webscraper/node_modules/@mendable/firecrawl-js/build/index.js:5:65)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

It seems like it starts to crawl but when checking the job, it gets a 401. why might that be?

[Feat] Add ability/option to transform relative to absolute urls in page

When scraping, and mostly crawling, provide the ability to have all relative urls changed to absolute urls (for further processing or link extraction).

Eg. [The PDF file][/assets/file.pdf] => [The PDF file][https://site.com/assets/file.pdf]

Sample solution a md-post-processor hook:

import re
from urllib.parse import urljoin

def convert_relative_urls(text, base_url):
    # Regex to match Markdown links that don't start with http
    regex = r'\]\((?!http)([^)]+)\)'
    # Function to prepend the base URL to the matched relative URL, handling '../'
    def replace_func(match):
        # Combine the base URL with the relative URL properly handling '../'
        full_url = urljoin(base_url + '/', match.group(1))
        return f"]({full_url})"
    # Replace the matched patterns in the text
    return re.sub(regex, replace_func, text)

# Example usage
markdown_text = "[The PDF file](/assets/file.pdf), [other file](../page/thing.pdf)"
base_url = "https://site.com/subdir/"
converted_text = convert_relative_urls(markdown_text, base_url)

oops ..noticed we are in typescript 😝:

function convertRelativeUrls(text: string, baseUrl: string): string {
  const regex = /\]\((?!http)([^)]+)\)/g;
  
  // Function to prepend the base URL to the matched relative URL, handling '../'
  const replaceFunc = (match: string, group1: string): string => {
    // Create a new URL based on the relative path and the base URL
    const fullUrl = new URL(group1, baseUrl).toString();
    return `](${fullUrl})`;
  };

  // Replace the matched patterns in the text
  return text.replace(regex, replaceFunc);
}

// Example usage
const markdownText = "[The PDF file](/assets/file.pdf), [other file](../page/thing.pdf)";
const baseUrl = "https://site.com/subdir/";
const convertedText = convertRelativeUrls(markdownText, baseUrl);

[Doc] Add default value of crawlerOptions.limit to /crawl documentation

"Show the default value of crawlerOptions.limit in https://docs.firecrawl.dev/api-reference/endpoint/crawl. Judging from the code itself it looks like it's set as 10000."

Suggested by @by12380 on Discord.

sdk: Pydantic + Zod integrations for llm-extract

[Test] e2e tests for /v0 endpoints

ccing @rafaelsideguide

ModuleNotFoundError: No module named 'firecrawl'

I do pip install firecrawl-py

I cannot run the crawler. I get this when installing the SDK

WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
Requirement already satisfied: firecrawl-py in /opt/homebrew/lib/python3.11/site-packages (0.0.6)
Requirement already satisfied: requests in /opt/homebrew/lib/python3.11/site-packages (from firecrawl-py) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (2024.2.2)
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'

After that when I run the file, it gives back:

from firecrawl import FirecrawlApp
ModuleNotFoundError: No module named 'firecrawl'

Docs: Add documentation for self hosting

[Feat] Provide more details for 429 Rate limit reached

Other APIs provide details within the 429 response that enables a calculation or even the details of when to retry.
For example:

Groq:

Error: Error code: 429 - {'error': {'message': 'Rate limit reached for model llama3-70b-8192 in organization org_xxxxxxx on tokens per minute (TPM): Limit 7000, Used 0, Requested ~12903. Please try again in 50.597142857s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

Standardize the playground snippets with openapi spec

[Feat] be able to provide own proxies to the providers

Feature request from crisp.

[Feat] Cancel job route

"Provide an API to cancel jobs, especially for expensive ones. Setting the default limit at 10,000 could potentially break someone’s bank."

Suggested by @by12380 on Discord

Add a blocklist for social media urls

To enhance efficiency and ensure legal compliance, we propose adding a blocklist mechanism to our web scraping/crawling functionality

[Feat] Stream crawl results

User is able to get partial results on /crawl/status route

[Feat] Pagination on /crawl

suggested on Discord

Partial data to be received via webhook

Wrong hyperlinks in readme

Some links in your readme are pointed at non existent firecrawl.com domain

API key and How to use it sections

ext-integrations: /search support on Langchain

[Feat] Idempotency key

"Consider adding idempotency feature for our backend POST apis, and allow client to pass an idempotency key to avoid submitting duplicate jobs"

Suggested by @by12380 on Discord.

/search support on Node sdk

[Feat] Add Anthropic Claude-3 Haiku as alternative to GPT-4 Vision (same quality, cheaper)

Consider adding haiku or replacing with haiku for image in utils/gptVision.ts
The same prompt will work well.

Also you should probably shift to the now gpt-4-turbo which recommended instead of the vision model

[BugFix] lowercase enforcement on links

The wikepdia link in discussions work just fine without the .tolowercase() in the /search endpoint.

We should just lower case the host instead of lowercasing the whole URL in the /scrape and /crawl @rafaelsideguide

[Feat] Add an option to exclude images in the final markdown

Add an option to the /scrape, /crawl, and /search endpoints to not include images in the final markdown - per user request via email.

@nickscamara @rafaelsideguide - We may want to think of a simpler way for users to customize the final markdown without us having to add an API parameter for each.

Use a standard for metadata

Use a standard for the metadata data returned by the API.
Users of the API may add their own metadata and it could overwrite or conflict with the API metadata if there is not a standard.
Use a common prefix or something to identify metadata captured by the API.