I am currently running into an issue where one of my spider gets denied for making too

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Is it possible to have a configuration per spider? about crawly HOT 26 CLOSED

elixir-crawly commented on May 14, 2024

Is it possible to have a configuration per spider?

from crawly.

Comments (26)

oltarasenko commented on May 14, 2024

@AoRakurai not yet, but it's planned.

from crawly.

AoRakurai commented on May 14, 2024

Thanks for the quick reply, this feature would really help me and my coworker out in our current project, are there any ways we could help you implement it?

from crawly.

oltarasenko commented on May 14, 2024

I had a plan for it already, so it's possible to prioritize it. I will allocate time to this task next week. Will update the ticket accordingly

from crawly.

AoRakurai commented on May 14, 2024

Awesome, thanks a lot. If we can help you let me know.

from crawly.

Ziinc commented on May 14, 2024

@oltarasenko maybe a use Crawly.Config macro, declared at spider level? Or a config/1 callback, where the callback receives the global config

from crawly.

commented on May 14, 2024

@oltarasenko I, too, think this is an essential function.

from crawly.

Ziinc commented on May 14, 2024

I'm personally leaning towards an optional config/1 callback, which is more transparent for the dev. However, this can make simple overriding verbose and error prone.

Compare the following:

# Optional callback style
def config(global) do
    Keyword.put(global, :my_config_opt, 10)
end

# macro style
use Crawly.Config, my_config_opt: 10

I don't have time to implement a PR for this, but the entire lifecycle of the config is isolated to the worker.exs. Some inbuilt middlewares/pipelines use Application.get_env to retrieve the config, so we would have to preserve the determined config (whether spider-level or global) on the middleware/pipeline state argument.

from crawly.

oltarasenko commented on May 14, 2024

@Ziinc I will take this one.

I'm personally leaning towards an optional config/1 callback, which is more transparent for the dev. However, this can make simple overriding verbose and error prone.

Let me think a bit more about it. I am currently doing the Crawly Dashboard to manage spiders. (The project is going to be released separately). Basically, I would want to be able to schedule them with custom settings. For now, it does not feel right to hardcode settings in the spider code... E.g. in practice I saw quite a few cases when you need to slow down for example. (e.g. Crawling during X-Mas, etc). It will be unfortunate to commit changes to spider every time you need to re-schedule it with a different speed for example...

I don't have time to implement a PR for this, but the entire lifecycle of the config is isolated to the worker.exs

Well, workers are started here: https://github.com/oltarasenko/crawly/blob/3c31e4552d609c1f057bf48661acebc466b97b83/lib/crawly/manager.ex#L59-L68.

from crawly.

commented on May 14, 2024

@oltarasenko I can help with front-end on React/Gatsby for the dashboard. Let me know if you need a help.

from crawly.

oltarasenko commented on May 14, 2024

@Unumus noted. I will invite you to the new repo, once I will have something to show. Honestly speaking I am moving towards Phoenix Live Views for the dashboard. I used them on a couple of commercial projects, and quite happy about what I saw there. However let's have a closer look later... I am still a bit stack because of the COVID thingy, however hopefully will speed up soon!

from crawly.

Ziinc commented on May 14, 2024

Basically, I would want to be able to schedule them with custom settings. For now, it does not feel right to hardcode settings in the spider code...

This would involve state persistence, and would involve some storage backend. I think this should be an opt in modular extension, as not all crawlers would be long-running.

With that being said, I think this can be achieved through passing the spider specific config at spider runtime when the spider is run with Crawly.Engine.start_spider. I personally have integrated this into the admin console of my phoenix app to dynamically start spiders, and this type of dynamic config passing would be definitely beneficial. However, persisting the additional spider config should be done within the app context instead of the crawler context, for separation of concerns.

from crawly.

Ziinc commented on May 14, 2024

Perhaps we can think of it in terms of overrides:
At each step, if config not found, fall back on next level:

Dynamically passed in options
Hardcoded spider config
Hardcoded global config

We should consider mid-crawl config change as a separate issue.

from crawly.

oltarasenko commented on May 14, 2024

I also was thinking about this idea of extending the start_spider. But on the other hand... well I don't want to find myself passing middlewares and item pipelines this way... really.
I am still unsure about the best way. For now, I am reading how it's done in scrapy: https://docs.scrapy.org/en/latest/topics/settings.html

What do you think about the following idea:

from crawly.

oltarasenko commented on May 14, 2024

@Ziinc @AoRakurai @Unumus I have sketched the implementation of the custom setting in #72. As far as I see it's the fastest way to get started. Please let me know what do you think about the idea. If it sounds as a good start, I will proceed to update the Crawly code to use this Crawly.Util.get_setting/3 everywhere.

from crawly.

Ziinc commented on May 14, 2024

The project settings on a per site/url basis does not seem like a good idea to me. For example I have 3 different spiders crawling different areas of the same site for different data.

I've had a look at the PR and I think it's in the right direction.

I think a better direction for stateful settings would be to let the spider callback handle the fetching the state, such as fetching it from an ecto repo or doing other logic to derive a setting. I don't think we should mix state persistence concerns into the framework, its beyond scope and trivial to implement separately.

from crawly.

commented on May 14, 2024

@oltarasenko I will happy if we can avoid keeping spider settings in config.exs. Maybe we can add the fourth callback to the spider file with the settings? I am not well familiar with the architecture of Crawly. However, many Elixir libraries use this approach.

from crawly.

AoRakurai commented on May 14, 2024

@oltarasenko I will look at what you've done on monday and provide feedback.
I had another quick question after looking at the doc friday, I might have missed something but I havent seen a way to know when a spider finishes, is there a callback or something for that?

from crawly.

Ziinc commented on May 14, 2024

@AoRakurai there is currently a rudimentary interface in Crawly.Engine. See here. Improvements to spider management has been on the books , but isn't a priority right now #37

from crawly.

AoRakurai commented on May 14, 2024

@oltarasenko I think it makes a lot of sense to have a 4th callback for this.

from crawly.

oltarasenko commented on May 14, 2024

@AoRakurai please also have a look on: #72

from crawly.

AoRakurai commented on May 14, 2024

I will have a look at #72 again on monday, @oltarasenko what kind of feedback are you expecting?

from crawly.

Ziinc commented on May 14, 2024

Has been merged into master with #72, props to @oltarasenko 🎉🎉

from crawly.

AoRakurai commented on May 14, 2024

Thanks a lot @oltarasenko

from crawly.

commented on May 14, 2024

Sorry, that re-open ticket, but how can I apply these changes?
I used the callback custom_settings and put params into the list. But not changes are applied. Should I update the crawly or what?

from crawly.

oltarasenko commented on May 14, 2024

This feature is going to be released with the next version of Crawly (0.9.0). The release is going to happen on Tuesday (after the Easter holidays)

from crawly.

oltarasenko commented on May 14, 2024

@AoRakurai @Unumus This feature is released now!

from crawly.

Is it possible to have a configuration per spider? about crawly HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent