Use case Make it possible for customers to crawl one or multiple w

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Feature request: Add middleware for crawling websites with headless browser about project-lakechain HOT 5 OPEN

HQarroum commented on June 15, 2024

Feature request: Add middleware for crawling websites with headless browser

from project-lakechain.

Comments (5)

HQarroum commented on June 15, 2024 1

Hey @moltar.

Yes, Crawlee, along the PlaywrightCrawler is definitely our target. Our plan is to release the Web Crawler middleware as a Fargate cluster that can spawn headless browsers to crawl websites using user-defined strategies.

I've come up with a draft API for this future middleware. Feel free to give us feedbacks on it and on your use-cases to ensure we cover them.

Middleware API

const crawler = new WebCrawler.Builder()
  .withScope(this)
  .withIdentifier('WebCrawler')
  .withCacheStorage(cacheStorage)
  // Browser engine options (optional).
  .withEngineOptions(new EngineOptions.Builder()
    .withUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/58.0.3029.110')
    .withUseIncognitoPages(true)
    .withUseExperimentalContainers(true)
    .build()
  )
  // Crawler options (optional).
  .withCrawlerOptions(new CrawlerOptions.Builder()
    .withRequestHandlerTimeoutSecs(30)
    .withHandleRequestTimeoutSecs(30)
    .withMaxRequestsPerCrawl(100)
    .withMaxRequestRetries(5)
    .withSameDomainDelaySecs(1)
    .withMaxSessionRotations(5)
    .withMinConcurrency(1)
    .withMaxConcurrency(5)
    .withMaxRequestsPerMinute(100)
    .withKeepAlive(true)
    .withUseSessionPool(true)
    .withStatusMessageLoggingInterval(10)
    .withRetryOnBlocked(true)
    .withEnqueuePolicy('same-domain' | 'all' | 'same-origin' | 'none')
    // By default, the `Web Crawler` will only crawl HTML documents, however
    // customers may opt to crawl additional data types and send them to other
    // middlewares in their document processing pipelines.
    .withCapturedDocumentTypes('html', 'pdf', 'docx')
  )
  .build();

from project-lakechain.

moltar commented on June 15, 2024

A good candidate could be Crawlee framework.

from project-lakechain.

moltar commented on June 15, 2024

I'd prefer an alternative to an ECS cluster.

Unless we are talking about tasks that scale to zero.

It is possible to run Playwright in a Lambda.

from project-lakechain.

HQarroum commented on June 15, 2024

Got it! And yes, the tasks will scale to zero as for every existing middleware :).

Playwright will run in Lambda, but we're afraid that 15 minutes might not be enough time to crawl bigger websites.

from project-lakechain.

moltar commented on June 15, 2024

15 minutes might not be enough time to crawl bigger websites

Shouldn't the crawling process be distributed anyway?

I'd crawl the entry page and then put back all of the found sub-pages to crawl back into the queue. The crawler handler processes one fetch at a time.

Also, if there's unbound crawling, this can lead to run-away costs, in a way a timeout could be the forcing function to prevent foot guns :)

from project-lakechain.

Feature request: Add middleware for crawling websites with headless browser about project-lakechain HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent