Giter Club home page Giter Club logo

Comments (15)

Ziinc avatar Ziinc commented on May 14, 2024

might need to consider that in a mutli-spider situation, some spiders do not need the browser rendering. Maybe it can be on the spider implementation level, where browser rendering is declared with a flag (or something like that)

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

related to #18

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

Yes, in general we need to be able to setup different HTTP clients for different spiders

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

Thinking a bit more:
Could it be at a request level? What if a site's structure has certain parts that require JS, and certain parts that don't. It would be unnecessary and inefficient to use a browser for parts which don't require JS.

from crawly.

jallum avatar jallum commented on May 14, 2024

I think that using the request makes the most sense. The spider is in a good position to make the decision as to what kind of request should be made on it's behalf.

Thinking out loud, what about returning plain url-strings in the ParsedItem, and then crawly could pass those url-strings back to a new (optional) callback on the spider to produce the request given the url? At that point, the spider could configure a request and return it. The default implementation of build_request (or whatever better name people can think of) would also serve to eliminate a bit of boilerplate that seems to find it's way into all my spiders:

 # Convert URLs into requests
    requests =
      urls
      |> Enum.map(&build_absolute_url/1)
      |> Enum.map(&Crawly.Utils.request_from_url/1)

This would also minimize the construction of unecessary request-stuff on the occasion that the request is a duplicate (or is dropped for some other reason.) crawly would only call the spider's build_request on urls it actually intends to follow.

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

@jallum I think improving Crawly.Utils.request_from_url/1 could be split to another issue where relative/absolute url checks and building are done automatically.

Maybe a :browser boolean flag in Crawly.Request?

from crawly.

jallum avatar jallum commented on May 14, 2024

Agreed.

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

It turns out that splash does not yet have a proxy interface :(. I don't really like this way of doing requests:

curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

So probably will switch back to the original idea of headless browsers

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

But this would be all within a middleware, right? It would be transparent to the user.

I think a drawback of the headless browser solution, whether it is hound or puppeteer, is that there is a lot of additional dependencies and moving parts.

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

Yes technically it's possible to build a middleware. Similarly as scrapy does: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py

What needs to be done in this case:

  1. Modify the request url to a splash based URL
  2. Unwrap the URL so we will get the original URL on the spider level

I am ok with 1, but the 2 looks like a hack to me. I wonder maybe it will be easier to wrap a splash into a proxy interface inside docker as an alternative?

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

@oltarasenko I think we shouldn't put it in the middleware, since it shouldn't fundamentally change the request, but only the way that the data is fetched.

I think we can abstract out the response fetching to a configurable module.
Considering that the worker in lib/crawly/worker.ex fetches the response as so:
image

What we can do is to make the get_response function "pluggable" (not referring to Plug). This module will be responsible for converting a request into a response.

Thus, we can have a Fetcher protocol, with out-of-the-box support for FetchWithHTTPoison and FetchWithSplash and typespecs that require the returning of a HTTPoison.Response struct.
I think Fetcher or something similar would be more intuitive and clearer than referring to it as a HTTP Client (since client can refer to many things), and allows for more flexibility if there are other protocols that users want to implement.

The fetcher can be declared as so in the config:

config :crawly,
    ....
    fetcher: FetchWithSplash
    ....

Expanding on this idea, we can then allow configuration to take place (once PR #31 is complete), thereby allowing configuration for issue #33 :

config :crawly, 
    fetcher: {FetchWithHTTPoison, options: blah }

Let me know what you think. With this proposal, the received Request does not need to be modified, as it would have to be if implemented as a middleware.

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

I also think docker would be an unnecessary overhead and be even more things for the end user to learn to setup.

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

I kind of like the idea of pluggable HTTP clients! Actually this was in plans for Crawly already. And I also agree about the middleware based approach (e.g. I would rather skip it).

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

@oltarasenko yup I saw that you'd added the line for HTTPoison in the config.exs. I'll check through #32 today, then once it's meged in, I'll start on updating the docs for #31

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

The example project is located here: https://github.com/oltarasenko/autosites

from crawly.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.