Write an example of Crawly with splash integrated as a proxy about crawly HOT 15 CLOSED

elixir-crawly commented on May 14, 2024

Write an example of Crawly with splash integrated as a proxy

from crawly.

Comments (15)

Ziinc commented on May 14, 2024

might need to consider that in a mutli-spider situation, some spiders do not need the browser rendering. Maybe it can be on the spider implementation level, where browser rendering is declared with a flag (or something like that)

from crawly.

Ziinc commented on May 14, 2024

related to #18

from crawly.

oltarasenko commented on May 14, 2024

Yes, in general we need to be able to setup different HTTP clients for different spiders

from crawly.

Ziinc commented on May 14, 2024

Thinking a bit more:
Could it be at a request level? What if a site's structure has certain parts that require JS, and certain parts that don't. It would be unnecessary and inefficient to use a browser for parts which don't require JS.

from crawly.

jallum commented on May 14, 2024

I think that using the request makes the most sense. The spider is in a good position to make the decision as to what kind of request should be made on it's behalf.

Thinking out loud, what about returning plain url-strings in the ParsedItem, and then crawly could pass those url-strings back to a new (optional) callback on the spider to produce the request given the url? At that point, the spider could configure a request and return it. The default implementation of build_request (or whatever better name people can think of) would also serve to eliminate a bit of boilerplate that seems to find it's way into all my spiders:

 # Convert URLs into requests
    requests =
      urls
      |> Enum.map(&build_absolute_url/1)
      |> Enum.map(&Crawly.Utils.request_from_url/1)

This would also minimize the construction of unecessary request-stuff on the occasion that the request is a duplicate (or is dropped for some other reason.) crawly would only call the spider's build_request on urls it actually intends to follow.

from crawly.

Ziinc commented on May 14, 2024

@jallum I think improving Crawly.Utils.request_from_url/1 could be split to another issue where relative/absolute url checks and building are done automatically.

Maybe a :browser boolean flag in Crawly.Request?

from crawly.

jallum commented on May 14, 2024

Agreed.

from crawly.

oltarasenko commented on May 14, 2024

It turns out that splash does not yet have a proxy interface :(. I don't really like this way of doing requests:

curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

So probably will switch back to the original idea of headless browsers

from crawly.

Ziinc commented on May 14, 2024

But this would be all within a middleware, right? It would be transparent to the user.

I think a drawback of the headless browser solution, whether it is hound or puppeteer, is that there is a lot of additional dependencies and moving parts.

from crawly.

oltarasenko commented on May 14, 2024

Yes technically it's possible to build a middleware. Similarly as scrapy does: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py

What needs to be done in this case:

Modify the request url to a splash based URL
Unwrap the URL so we will get the original URL on the spider level

I am ok with 1, but the 2 looks like a hack to me. I wonder maybe it will be easier to wrap a splash into a proxy interface inside docker as an alternative?

from crawly.

Ziinc commented on May 14, 2024

@oltarasenko I think we shouldn't put it in the middleware, since it shouldn't fundamentally change the request, but only the way that the data is fetched.

I think we can abstract out the response fetching to a configurable module.
Considering that the worker in lib/crawly/worker.ex fetches the response as so:

What we can do is to make the get_response function "pluggable" (not referring to Plug). This module will be responsible for converting a request into a response.

Thus, we can have a Fetcher protocol, with out-of-the-box support for FetchWithHTTPoison and FetchWithSplash and typespecs that require the returning of a HTTPoison.Response struct.
I think Fetcher or something similar would be more intuitive and clearer than referring to it as a HTTP Client (since client can refer to many things), and allows for more flexibility if there are other protocols that users want to implement.

The fetcher can be declared as so in the config:

config :crawly,
    ....
    fetcher: FetchWithSplash
    ....

Expanding on this idea, we can then allow configuration to take place (once PR #31 is complete), thereby allowing configuration for issue #33 :

config :crawly, 
    fetcher: {FetchWithHTTPoison, options: blah }

Let me know what you think. With this proposal, the received Request does not need to be modified, as it would have to be if implemented as a middleware.

from crawly.

Ziinc commented on May 14, 2024

I also think docker would be an unnecessary overhead and be even more things for the end user to learn to setup.

from crawly.

oltarasenko commented on May 14, 2024

I kind of like the idea of pluggable HTTP clients! Actually this was in plans for Crawly already. And I also agree about the middleware based approach (e.g. I would rather skip it).

from crawly.

Ziinc commented on May 14, 2024

@oltarasenko yup I saw that you'd added the line for HTTPoison in the config.exs. I'll check through #32 today, then once it's meged in, I'll start on updating the docs for #31

from crawly.

oltarasenko commented on May 14, 2024

The example project is located here: https://github.com/oltarasenko/autosites

from crawly.

Write an example of Crawly with splash integrated as a proxy about crawly HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent