Giter Club home page Giter Club logo

Comments (5)

HQarroum avatar HQarroum commented on June 15, 2024 1

Hey @moltar.

Yes, Crawlee, along the PlaywrightCrawler is definitely our target. Our plan is to release the Web Crawler middleware as a Fargate cluster that can spawn headless browsers to crawl websites using user-defined strategies.

I've come up with a draft API for this future middleware. Feel free to give us feedbacks on it and on your use-cases to ensure we cover them.

Middleware API

const crawler = new WebCrawler.Builder()
  .withScope(this)
  .withIdentifier('WebCrawler')
  .withCacheStorage(cacheStorage)
  // Browser engine options (optional).
  .withEngineOptions(new EngineOptions.Builder()
    .withUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/58.0.3029.110')
    .withUseIncognitoPages(true)
    .withUseExperimentalContainers(true)
    .build()
  )
  // Crawler options (optional).
  .withCrawlerOptions(new CrawlerOptions.Builder()
    .withRequestHandlerTimeoutSecs(30)
    .withHandleRequestTimeoutSecs(30)
    .withMaxRequestsPerCrawl(100)
    .withMaxRequestRetries(5)
    .withSameDomainDelaySecs(1)
    .withMaxSessionRotations(5)
    .withMinConcurrency(1)
    .withMaxConcurrency(5)
    .withMaxRequestsPerMinute(100)
    .withKeepAlive(true)
    .withUseSessionPool(true)
    .withStatusMessageLoggingInterval(10)
    .withRetryOnBlocked(true)
    .withEnqueuePolicy('same-domain' | 'all' | 'same-origin' | 'none')
    // By default, the `Web Crawler` will only crawl HTML documents, however
    // customers may opt to crawl additional data types and send them to other
    // middlewares in their document processing pipelines.
    .withCapturedDocumentTypes('html', 'pdf', 'docx')
  )
  .build();

from project-lakechain.

moltar avatar moltar commented on June 15, 2024

A good candidate could be Crawlee framework.

from project-lakechain.

moltar avatar moltar commented on June 15, 2024

I'd prefer an alternative to an ECS cluster.

Unless we are talking about tasks that scale to zero.

It is possible to run Playwright in a Lambda.

from project-lakechain.

HQarroum avatar HQarroum commented on June 15, 2024

Got it! And yes, the tasks will scale to zero as for every existing middleware :).

Playwright will run in Lambda, but we're afraid that 15 minutes might not be enough time to crawl bigger websites.

from project-lakechain.

moltar avatar moltar commented on June 15, 2024

15 minutes might not be enough time to crawl bigger websites

Shouldn't the crawling process be distributed anyway?

I'd crawl the entry page and then put back all of the found sub-pages to crawl back into the queue. The crawler handler processes one fetch at a time.

Also, if there's unbound crawling, this can lead to run-away costs, in a way a timeout could be the forcing function to prevent foot guns :)

from project-lakechain.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.