Comments (26)
@AoRakurai not yet, but it's planned.
from crawly.
Thanks for the quick reply, this feature would really help me and my coworker out in our current project, are there any ways we could help you implement it?
from crawly.
I had a plan for it already, so it's possible to prioritize it. I will allocate time to this task next week. Will update the ticket accordingly
from crawly.
Awesome, thanks a lot. If we can help you let me know.
from crawly.
@oltarasenko maybe a use Crawly.Config
macro, declared at spider level? Or a config/1
callback, where the callback receives the global config
from crawly.
@oltarasenko I, too, think this is an essential function.
from crawly.
I'm personally leaning towards an optional config/1
callback, which is more transparent for the dev. However, this can make simple overriding verbose and error prone.
Compare the following:
# Optional callback style
def config(global) do
Keyword.put(global, :my_config_opt, 10)
end
# macro style
use Crawly.Config, my_config_opt: 10
I don't have time to implement a PR for this, but the entire lifecycle of the config is isolated to the worker.exs
. Some inbuilt middlewares/pipelines use Application.get_env
to retrieve the config, so we would have to preserve the determined config (whether spider-level or global) on the middleware/pipeline state
argument.
from crawly.
@Ziinc I will take this one.
I'm personally leaning towards an optional config/1 callback, which is more transparent for the dev. However, this can make simple overriding verbose and error prone.
Let me think a bit more about it. I am currently doing the Crawly Dashboard to manage spiders. (The project is going to be released separately). Basically, I would want to be able to schedule them with custom settings. For now, it does not feel right to hardcode settings in the spider code... E.g. in practice I saw quite a few cases when you need to slow down for example. (e.g. Crawling during X-Mas, etc). It will be unfortunate to commit changes to spider every time you need to re-schedule it with a different speed for example...
I don't have time to implement a PR for this, but the entire lifecycle of the config is isolated to the worker.exs
Well, workers are started here: https://github.com/oltarasenko/crawly/blob/3c31e4552d609c1f057bf48661acebc466b97b83/lib/crawly/manager.ex#L59-L68.
from crawly.
@oltarasenko I can help with front-end on React/Gatsby for the dashboard. Let me know if you need a help.
from crawly.
@Unumus noted. I will invite you to the new repo, once I will have something to show. Honestly speaking I am moving towards Phoenix Live Views for the dashboard. I used them on a couple of commercial projects, and quite happy about what I saw there. However let's have a closer look later... I am still a bit stack because of the COVID thingy, however hopefully will speed up soon!
from crawly.
Basically, I would want to be able to schedule them with custom settings. For now, it does not feel right to hardcode settings in the spider code...
This would involve state persistence, and would involve some storage backend. I think this should be an opt in modular extension, as not all crawlers would be long-running.
With that being said, I think this can be achieved through passing the spider specific config at spider runtime when the spider is run with Crawly.Engine.start_spider
. I personally have integrated this into the admin console of my phoenix app to dynamically start spiders, and this type of dynamic config passing would be definitely beneficial. However, persisting the additional spider config should be done within the app context instead of the crawler context, for separation of concerns.
from crawly.
Perhaps we can think of it in terms of overrides:
At each step, if config not found, fall back on next level:
- Dynamically passed in options
- Hardcoded spider config
- Hardcoded global config
We should consider mid-crawl config change as a separate issue.
from crawly.
I also was thinking about this idea of extending the start_spider
. But on the other hand... well I don't want to find myself passing middlewares and item pipelines this way... really.
I am still unsure about the best way. For now, I am reading how it's done in scrapy: https://docs.scrapy.org/en/latest/topics/settings.html
What do you think about the following idea:
from crawly.
@Ziinc @AoRakurai @Unumus I have sketched the implementation of the custom setting in #72. As far as I see it's the fastest way to get started. Please let me know what do you think about the idea. If it sounds as a good start, I will proceed to update the Crawly code to use this Crawly.Util.get_setting/3 everywhere.
from crawly.
The project settings on a per site/url basis does not seem like a good idea to me. For example I have 3 different spiders crawling different areas of the same site for different data.
I've had a look at the PR and I think it's in the right direction.
I think a better direction for stateful settings would be to let the spider callback handle the fetching the state, such as fetching it from an ecto repo or doing other logic to derive a setting. I don't think we should mix state persistence concerns into the framework, its beyond scope and trivial to implement separately.
from crawly.
@oltarasenko I will happy if we can avoid keeping spider settings in config.exs. Maybe we can add the fourth callback to the spider file with the settings? I am not well familiar with the architecture of Crawly. However, many Elixir libraries use this approach.
from crawly.
@oltarasenko I will look at what you've done on monday and provide feedback.
I had another quick question after looking at the doc friday, I might have missed something but I havent seen a way to know when a spider finishes, is there a callback or something for that?
from crawly.
@AoRakurai there is currently a rudimentary interface in Crawly.Engine
. See here. Improvements to spider management has been on the books , but isn't a priority right now #37
from crawly.
@oltarasenko I think it makes a lot of sense to have a 4th callback for this.
from crawly.
@AoRakurai please also have a look on: #72
from crawly.
I will have a look at #72 again on monday, @oltarasenko what kind of feedback are you expecting?
from crawly.
Has been merged into master with #72, props to @oltarasenko
from crawly.
Thanks a lot @oltarasenko
from crawly.
Sorry, that re-open ticket, but how can I apply these changes?
I used the callback custom_settings and put params into the list. But not changes are applied. Should I update the crawly or what?
from crawly.
This feature is going to be released with the next version of Crawly (0.9.0). The release is going to happen on Tuesday (after the Easter holidays)
from crawly.
@AoRakurai @Unumus This feature is released now!
from crawly.
Related Issues (20)
- Typespecs for crawly splash fetcher options are wrong
- Suggestion for nameing of closespider_timeout in config HOT 2
- Observed crash of Genserver HOT 1
- Generic `Pipeline crash` error HOT 4
- [question] scrapping dynamic urls HOT 1
- `Could not start application crawly` when trying to enter IEx HOT 2
- js render problem HOT 6
- custom parsar callback sample HOT 7
- Could not compile dependency :epipe HOT 5
- Demo page not loading HOT 3
- Setting up a parametric spider (dynamic base_url and start_urls) HOT 1
- Use a more reliable website to crawl in tutorial HOT 1
- Any working examples? HOT 4
- jl files not found probably not writing HOT 1
- Crawly.fetch giving 301 response instead of 200 HOT 1
- My Spider's code is never invoked, weird behavior with `Crawly.RequestsStorage.pop` in library code HOT 5
- This is actually a question, Nested scraping HOT 2
- Genserver time out crash in long-running pipeline HOT 1
- Stop and resume the spider where it stopped HOT 2
- Protocol error HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawly.