Giter Club home page Giter Club logo

mendableai / firecrawl Goto Github PK

View Code? Open in Web Editor NEW
9.2K 9.2K 691.0 40.01 MB

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Home Page: https://firecrawl.dev

License: GNU Affero General Public License v3.0

Dockerfile 0.34% JavaScript 4.52% TypeScript 83.62% Python 6.66% HTML 0.05% CSS 0.28% Go 4.53%
ai ai-scraping crawler data html-to-markdown llm markdown rag scraper scraping web-crawler

firecrawl's People

Contributors

100gle avatar calebpeffer avatar chand1012 avatar dependabot[bot] avatar elimisteve avatar eltociear avatar ericciarla avatar jakobstadlhuber avatar jhoseph88 avatar kenthsu avatar kun432 avatar lakr233 avatar matsubo avatar mattjoyce avatar mattzcarey avatar mdp avatar mogery avatar nickscamara avatar niublibing avatar rafaelsideguide avatar rogerserper avatar rombru avatar sanix-darker avatar simonha9 avatar snippet avatar szepeviktor avatar tak-s avatar tomkosm avatar tractorjuice avatar wahpiangle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

firecrawl's Issues

OpenAPI Spec

I saw you have used mintify but couldn't find the OpeAPI spec itself. Could you share it?

Add a timeout parameter to the api

One thing that would be useful is the ability to set a timeout on these requests - a customer ended up implementing that on their side.

[Feat] Docker deployment

Could you please add support for docker deployment to streamline setting up and running the project?

Thank you

Get Code for LLM Extract returns bad JSON

I understand LLM Extract is in alpha and Get Code is likely nondeterministic so feel free to ignore. But this is what it gave me:
Screenshot 2024-05-04 at 7 29 40 PM

Which was missing a closing apos and some commas
Screenshot 2024-05-04 at 7 30 00 PM

Fixed JSON looked like this
Screenshot 2024-05-04 at 7 30 51 PM

Unable to run python sdk sample code from README

Traceback (most recent call last):
File "/Users/howardgil/Desktop/Projects/nat-sec-hackathon/firecrawl.py", line 1, in
from firecrawl import FirecrawlApp
File "/Users/howardgil/Desktop/Projects/nat-sec-hackathon/firecrawl.py", line 1, in
from firecrawl import FirecrawlApp
ImportError: cannot import name 'FirecrawlApp' from partially initialized module 'firecrawl' (most likely due to a circular import) (/Users/howardgil/Desktop/Projects/nat-sec-hackathon/firecrawl.py)

limits include filtered out paths

the crawl limit is applied before the paths are filtered out.

base url: test.com
limit: 2
included links: ["/pages/*"]

links on test.com in order:

[
"/home",
"/imprint",
"/about",
"pages/1",
"pages/2",
"pages/3"
]

expected links to be crawled: ["pages/1","pages/2"]

current links that are crawled: []

Remove 'cookies' text when removing headers/footers, etc

Remove any cookies text when removing headers and footers.
Many sites in Europe will display a cookie acceptance message
Sometimes, this is the only text returned.

Sometimes it captures something like:

"Skip to main content\n\nCookies \n------------------------------\n\nWe use some essential cookies to make this service work.\n\nWe\u2019d also like to use analytics cookies so we can understand how you use the service and make improvements.\n\nAccept analytics cookies Reject analytics cookies How we use cookies\n\nYou can change your cookie settings\n at any time.\n\nHide cookie message\n\n"

[Bug] Limit on /search is not deterministic

Right now we limit the search results by applying it to the serp api level by using searchOptions.limit : n.

The problem is that some search results could be social media pages or website that we don't support, failing it on it. This ends up causing the /search endpoint to return less results than expected.

The idea here is that we should search for n + y over the limit, where n is the limit and y is a picked constant. That way if it fails, we can use the y links and try to call get documents on it until we hit the correct limit n.

[Feat] Strip non-content tags, headers, footers

The markdown would be much more useful if you stripped headers/footers and other tags like filters etc that is not core content (i.e. low value for RAG/context). Either using tag or class-based removal from the html or using something like Mozilla's Readability or both! Highly opinionated class-based removal is risky but produces high value content and less noise.

For example a language selector in a header gets produced and should be stripped:

[Skip to main content](#main-content)

Select LanguageEnglishAfrikaansAlbanianArabicArmenianAzerbaijaniBasqueBelarusianBengaliBosnianBulgarianCatalanCebuanoChinese (Simplified)Chinese (Traditional)CroatianCzechDanishDutchEsperantoEstonianFilipinoFinnishFrenchGalicianGeorgianGermanGreekGujaratiHaitian CreoleHausaHebrewHindiHmongHungarianIcelandicIgboIndonesianIrishItalianJapaneseJavaneseKannadaKhmerKoreanLaoLatinLatvianLithuanianMacedonianMalayMalteseMaoriMarathiMongolianNepaliNorwegianPersianPolishPortuguesePunjabiRomanianRussianSerbianSlovakSlovenianSomaliSpanishSwahiliSwedishTamilTeluguThaiTurkishUkrainianUrduVietnameseWelshYiddishYorubaZulu

Here is a starter list.. should probably test against a couple thousand random pages and use an LLM like haiku with vision as judge.

const exclude = [
  'header', '.header', '.top', '.navbar', '#header',
  'footer', '.footer', '.bottom', '#footer',
  '.sidebar', '.side', '.aside', '#sidebar',
  '.modal', '.popup', '#modal', '.overlay',
  '.ad', '.ads', '.advert', '#ad',
  '.lang-selector', '.language', '#language-selector',
  '.social', '.social-media', '.social-links', '#social',
  '.menu', '.navigation', 'nav', '#nav',
  '.breadcrumbs', '#breadcrumbs',
  '.form', 'form', '#search-form',
  'script', 'noscript'
];

[Test] Add integration tests for complex and larger variety of webpages

In tweaking and growing the html clean up and html-to-md. I highly recommend adding integration tests using either live webpages (to test also the get/network and dynamic websites) OR at least saved html pages with complex layout (and bad html, especially for the html clean up.

  • Find a list of pages to use as test suite with a vareity of layouts
  • Add the integration tests

[Feat] Be able to pass a timeout param to the endpoints

enable the user to pass a "timeout parameter" to both the scrape and the crawl endpoint. If the timeout is exceeded, please send the user a clear error message. On the crawl endpoint, return any pages that have already been scraped but include messages notifying them that the timeout was exceeded.

If the task is completed within two days, we'll include a $10 dollar tip :)

This is an intro bounty. We are looking for exciting people that will buy in so we can start to ramp up.

[Feat] Error handling middleware for the API

When errors occur in deeply nested functions, there isn't a way for us to bubble up custom error messages and codes to the API layer.

Proposal: Create a custom Error type and Middleware that intercepts errors.

Middleware example

class AppError extends Error {
  public readonly statusCode: number;
  public readonly isOperational: boolean;

  constructor(message: string, statusCode: number, isOperational: boolean = true) {
    super(message);
    this.statusCode = statusCode;
    this.isOperational = isOperational; // Indicates this is a known type of error
    Object.setPrototypeOf(this, new.target.prototype); // restore prototype chain
    Error.captureStackTrace(this, this.constructor);
  }
}

Deep function:

async function someDeepFunction(): Promise<any> {
  try {
    // Some logic that might fail
    if (someConditionNotMet) {
      throw new AppError('Specific error message', 404);
    }
    // more logic
  } catch (error) {
    throw new AppError('Error accessing resource', 500);
  }
}

Then these errors would be intercepted and cleaned for users by a middleware at the express level.

Feat: Convert images in pdfs to images that can be accessed by the user

Some customers want to access images inside PDFs on the web. I'm not sure if llama-index supports this by default?

If we can get the images, we may need to start hosting ourselves in S3 too. This is probably a better solution for ALL images, since people should be cleaning out links to images on external URLs because of data exfil problems.

401 when checking job status

I'm trying to use the example.js you provided in the repo.

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({apiKey: "my api key"});

const crawlResult = await app.crawlUrl('docs.babylonjs.com', {crawlerOptions: {excludes: ['blog/*'], limit: 10}}, false);
console.log(crawlResult)

const jobId = await crawlResult['jobId'];


let job;
while (true) {
    console.log("checking ",app.apiKey);
  job = await app.checkCrawlStatus(jobId);
  if (job.status == 'completed') {
    break;
  }
  console.log(job);
  await new Promise(resolve => setTimeout(resolve, 1000)); // wait 1 second
}

console.log(job.data[0].content);

I get

{ jobId: '678757b8-0d03-4c56-8017-a6b04136ad07' }
checking fc-4a0f64912306448c975701198d28b85e
file:///D:/Web/webscraper/node_modules/@mendable/firecrawl-js/build/index.js:91
throw new Error(error.message);
^

Error: Request failed with status code 401
at FirecrawlApp. (file:///D:/Web/webscraper/node_modules/@mendable/firecrawl-js/build/index.js:91:23)
at Generator.throw ()
at rejected (file:///D:/Web/webscraper/node_modules/@mendable/firecrawl-js/build/index.js:5:65)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

It seems like it starts to crawl but when checking the job, it gets a 401. why might that be?

[Feat] Add ability/option to transform relative to absolute urls in page

When scraping, and mostly crawling, provide the ability to have all relative urls changed to absolute urls (for further processing or link extraction).

Eg. [The PDF file][/assets/file.pdf] => [The PDF file][https://site.com/assets/file.pdf]

Sample solution a md-post-processor hook:

import re
from urllib.parse import urljoin

def convert_relative_urls(text, base_url):
    # Regex to match Markdown links that don't start with http
    regex = r'\]\((?!http)([^)]+)\)'
    # Function to prepend the base URL to the matched relative URL, handling '../'
    def replace_func(match):
        # Combine the base URL with the relative URL properly handling '../'
        full_url = urljoin(base_url + '/', match.group(1))
        return f"]({full_url})"
    # Replace the matched patterns in the text
    return re.sub(regex, replace_func, text)

# Example usage
markdown_text = "[The PDF file](/assets/file.pdf), [other file](../page/thing.pdf)"
base_url = "https://site.com/subdir/"
converted_text = convert_relative_urls(markdown_text, base_url)

oops ..noticed we are in typescript 😝:

function convertRelativeUrls(text: string, baseUrl: string): string {
  const regex = /\]\((?!http)([^)]+)\)/g;
  
  // Function to prepend the base URL to the matched relative URL, handling '../'
  const replaceFunc = (match: string, group1: string): string => {
    // Create a new URL based on the relative path and the base URL
    const fullUrl = new URL(group1, baseUrl).toString();
    return `](${fullUrl})`;
  };

  // Replace the matched patterns in the text
  return text.replace(regex, replaceFunc);
}

// Example usage
const markdownText = "[The PDF file](/assets/file.pdf), [other file](../page/thing.pdf)";
const baseUrl = "https://site.com/subdir/";
const convertedText = convertRelativeUrls(markdownText, baseUrl);

ModuleNotFoundError: No module named 'firecrawl'

I do pip install firecrawl-py

I cannot run the crawler. I get this when installing the SDK

WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
Requirement already satisfied: firecrawl-py in /opt/homebrew/lib/python3.11/site-packages (0.0.6)
Requirement already satisfied: requests in /opt/homebrew/lib/python3.11/site-packages (from firecrawl-py) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (2024.2.2)
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'

After that when I run the file, it gives back:

from firecrawl import FirecrawlApp
ModuleNotFoundError: No module named 'firecrawl'

[Feat] Provide more details for 429 Rate limit reached

Other APIs provide details within the 429 response that enables a calculation or even the details of when to retry.
For example:

Groq:

Error: Error code: 429 - {'error': {'message': 'Rate limit reached for model llama3-70b-8192 in organization org_xxxxxxx on tokens per minute (TPM): Limit 7000, Used 0, Requested ~12903. Please try again in 50.597142857s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

[Feat] Cancel job route

"Provide an API to cancel jobs, especially for expensive ones. Setting the default limit at 10,000 could potentially break someone’s bank."

Suggested by @by12380 on Discord

Wrong hyperlinks in readme

Some links in your readme are pointed at non existent firecrawl.com domain

API key and How to use it sections

[Feat] Idempotency key

"Consider adding idempotency feature for our backend POST apis, and allow client to pass an idempotency key to avoid submitting duplicate jobs"

Suggested by @by12380 on Discord.

Use a standard for metadata

Use a standard for the metadata data returned by the API.
Users of the API may add their own metadata and it could overwrite or conflict with the API metadata if there is not a standard.
Use a common prefix or something to identify metadata captured by the API.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.