Comments (15)
might need to consider that in a mutli-spider situation, some spiders do not need the browser rendering. Maybe it can be on the spider implementation level, where browser rendering is declared with a flag (or something like that)
from crawly.
related to #18
from crawly.
Yes, in general we need to be able to setup different HTTP clients for different spiders
from crawly.
Thinking a bit more:
Could it be at a request level? What if a site's structure has certain parts that require JS, and certain parts that don't. It would be unnecessary and inefficient to use a browser for parts which don't require JS.
from crawly.
I think that using the request makes the most sense. The spider is in a good position to make the decision as to what kind of request should be made on it's behalf.
Thinking out loud, what about returning plain url-strings in the ParsedItem
, and then crawly could pass those url-strings back to a new (optional) callback on the spider to produce the request given the url? At that point, the spider could configure a request and return it. The default implementation of build_request
(or whatever better name people can think of) would also serve to eliminate a bit of boilerplate that seems to find it's way into all my spiders:
# Convert URLs into requests
requests =
urls
|> Enum.map(&build_absolute_url/1)
|> Enum.map(&Crawly.Utils.request_from_url/1)
This would also minimize the construction of unecessary request-stuff on the occasion that the request is a duplicate (or is dropped for some other reason.) crawly would only call the spider's build_request
on urls it actually intends to follow.
from crawly.
@jallum I think improving Crawly.Utils.request_from_url/1
could be split to another issue where relative/absolute url checks and building are done automatically.
Maybe a :browser
boolean flag in Crawly.Request
?
from crawly.
Agreed.
from crawly.
It turns out that splash does not yet have a proxy interface :(. I don't really like this way of doing requests:
curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'
So probably will switch back to the original idea of headless browsers
from crawly.
But this would be all within a middleware, right? It would be transparent to the user.
I think a drawback of the headless browser solution, whether it is hound or puppeteer, is that there is a lot of additional dependencies and moving parts.
from crawly.
Yes technically it's possible to build a middleware. Similarly as scrapy does: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py
What needs to be done in this case:
- Modify the request url to a splash based URL
- Unwrap the URL so we will get the original URL on the spider level
I am ok with 1, but the 2 looks like a hack to me. I wonder maybe it will be easier to wrap a splash into a proxy interface inside docker as an alternative?
from crawly.
@oltarasenko I think we shouldn't put it in the middleware, since it shouldn't fundamentally change the request, but only the way that the data is fetched.
I think we can abstract out the response fetching to a configurable module.
Considering that the worker in lib/crawly/worker.ex
fetches the response as so:
What we can do is to make the get_response
function "pluggable" (not referring to Plug
). This module will be responsible for converting a request into a response.
Thus, we can have a Fetcher
protocol, with out-of-the-box support for FetchWithHTTPoison
and FetchWithSplash
and typespecs that require the returning of a HTTPoison.Response
struct.
I think Fetcher
or something similar would be more intuitive and clearer than referring to it as a HTTP Client (since client can refer to many things), and allows for more flexibility if there are other protocols that users want to implement.
The fetcher can be declared as so in the config:
config :crawly,
....
fetcher: FetchWithSplash
....
Expanding on this idea, we can then allow configuration to take place (once PR #31 is complete), thereby allowing configuration for issue #33 :
config :crawly,
fetcher: {FetchWithHTTPoison, options: blah }
Let me know what you think. With this proposal, the received Request
does not need to be modified, as it would have to be if implemented as a middleware.
from crawly.
I also think docker would be an unnecessary overhead and be even more things for the end user to learn to setup.
from crawly.
I kind of like the idea of pluggable HTTP clients! Actually this was in plans for Crawly already. And I also agree about the middleware based approach (e.g. I would rather skip it).
from crawly.
@oltarasenko yup I saw that you'd added the line for HTTPoison in the config.exs
. I'll check through #32 today, then once it's meged in, I'll start on updating the docs for #31
from crawly.
The example project is located here: https://github.com/oltarasenko/autosites
from crawly.
Related Issues (20)
- `Could not start application crawly` when trying to enter IEx HOT 2
- js render problem HOT 6
- custom parsar callback sample HOT 7
- Could not compile dependency :epipe HOT 5
- Demo page not loading HOT 3
- Setting up a parametric spider (dynamic base_url and start_urls) HOT 1
- Use a more reliable website to crawl in tutorial HOT 1
- Any working examples? HOT 4
- jl files not found probably not writing HOT 1
- Crawly.fetch giving 301 response instead of 200 HOT 1
- My Spider's code is never invoked, weird behavior with `Crawly.RequestsStorage.pop` in library code HOT 5
- This is actually a question, Nested scraping HOT 2
- Genserver time out crash in long-running pipeline HOT 1
- Stop and resume the spider where it stopped HOT 2
- Protocol error HOT 7
- `Crawly.Fetchers.Fetcher` implementation for Playwright HOT 4
- robots.txt matching is pretty buggy HOT 10
- Running many instances of one spider HOT 3
- Make the management tool opt-in by default HOT 5
- Q: Can the spider "fan out" on a website? (multiple next items) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawly.