Giter Club home page Giter Club logo

Comments (26)

oltarasenko avatar oltarasenko commented on May 14, 2024

@AoRakurai not yet, but it's planned.

from crawly.

AoRakurai avatar AoRakurai commented on May 14, 2024

Thanks for the quick reply, this feature would really help me and my coworker out in our current project, are there any ways we could help you implement it?

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

I had a plan for it already, so it's possible to prioritize it. I will allocate time to this task next week. Will update the ticket accordingly

from crawly.

AoRakurai avatar AoRakurai commented on May 14, 2024

Awesome, thanks a lot. If we can help you let me know.

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

@oltarasenko maybe a use Crawly.Config macro, declared at spider level? Or a config/1 callback, where the callback receives the global config

from crawly.

 avatar commented on May 14, 2024

@oltarasenko I, too, think this is an essential function.

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

I'm personally leaning towards an optional config/1 callback, which is more transparent for the dev. However, this can make simple overriding verbose and error prone.

Compare the following:

# Optional callback style
def config(global) do
    Keyword.put(global, :my_config_opt, 10)
end

# macro style
use Crawly.Config, my_config_opt: 10

I don't have time to implement a PR for this, but the entire lifecycle of the config is isolated to the worker.exs. Some inbuilt middlewares/pipelines use Application.get_env to retrieve the config, so we would have to preserve the determined config (whether spider-level or global) on the middleware/pipeline state argument.

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

@Ziinc I will take this one.

I'm personally leaning towards an optional config/1 callback, which is more transparent for the dev. However, this can make simple overriding verbose and error prone.

Let me think a bit more about it. I am currently doing the Crawly Dashboard to manage spiders. (The project is going to be released separately). Basically, I would want to be able to schedule them with custom settings. For now, it does not feel right to hardcode settings in the spider code... E.g. in practice I saw quite a few cases when you need to slow down for example. (e.g. Crawling during X-Mas, etc). It will be unfortunate to commit changes to spider every time you need to re-schedule it with a different speed for example...

I don't have time to implement a PR for this, but the entire lifecycle of the config is isolated to the worker.exs

Well, workers are started here: https://github.com/oltarasenko/crawly/blob/3c31e4552d609c1f057bf48661acebc466b97b83/lib/crawly/manager.ex#L59-L68.

from crawly.

 avatar commented on May 14, 2024

@oltarasenko I can help with front-end on React/Gatsby for the dashboard. Let me know if you need a help.

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

@Unumus noted. I will invite you to the new repo, once I will have something to show. Honestly speaking I am moving towards Phoenix Live Views for the dashboard. I used them on a couple of commercial projects, and quite happy about what I saw there. However let's have a closer look later... I am still a bit stack because of the COVID thingy, however hopefully will speed up soon!

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

Basically, I would want to be able to schedule them with custom settings. For now, it does not feel right to hardcode settings in the spider code...

This would involve state persistence, and would involve some storage backend. I think this should be an opt in modular extension, as not all crawlers would be long-running.

With that being said, I think this can be achieved through passing the spider specific config at spider runtime when the spider is run with Crawly.Engine.start_spider. I personally have integrated this into the admin console of my phoenix app to dynamically start spiders, and this type of dynamic config passing would be definitely beneficial. However, persisting the additional spider config should be done within the app context instead of the crawler context, for separation of concerns.

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

Perhaps we can think of it in terms of overrides:
At each step, if config not found, fall back on next level:

  1. Dynamically passed in options
  2. Hardcoded spider config
  3. Hardcoded global config

We should consider mid-crawl config change as a separate issue.

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

I also was thinking about this idea of extending the start_spider. But on the other hand... well I don't want to find myself passing middlewares and item pipelines this way... really.
I am still unsure about the best way. For now, I am reading how it's done in scrapy: https://docs.scrapy.org/en/latest/topics/settings.html

What do you think about the following idea:
Untitled (2)

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

@Ziinc @AoRakurai @Unumus I have sketched the implementation of the custom setting in #72. As far as I see it's the fastest way to get started. Please let me know what do you think about the idea. If it sounds as a good start, I will proceed to update the Crawly code to use this Crawly.Util.get_setting/3 everywhere.

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

The project settings on a per site/url basis does not seem like a good idea to me. For example I have 3 different spiders crawling different areas of the same site for different data.

I've had a look at the PR and I think it's in the right direction.

I think a better direction for stateful settings would be to let the spider callback handle the fetching the state, such as fetching it from an ecto repo or doing other logic to derive a setting. I don't think we should mix state persistence concerns into the framework, its beyond scope and trivial to implement separately.

from crawly.

 avatar commented on May 14, 2024

@oltarasenko I will happy if we can avoid keeping spider settings in config.exs. Maybe we can add the fourth callback to the spider file with the settings? I am not well familiar with the architecture of Crawly. However, many Elixir libraries use this approach.

from crawly.

AoRakurai avatar AoRakurai commented on May 14, 2024

@oltarasenko I will look at what you've done on monday and provide feedback.
I had another quick question after looking at the doc friday, I might have missed something but I havent seen a way to know when a spider finishes, is there a callback or something for that?

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

@AoRakurai there is currently a rudimentary interface in Crawly.Engine. See here. Improvements to spider management has been on the books , but isn't a priority right now #37

from crawly.

AoRakurai avatar AoRakurai commented on May 14, 2024

@oltarasenko I think it makes a lot of sense to have a 4th callback for this.

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

@AoRakurai please also have a look on: #72

from crawly.

AoRakurai avatar AoRakurai commented on May 14, 2024

I will have a look at #72 again on monday, @oltarasenko what kind of feedback are you expecting?

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

Has been merged into master with #72, props to @oltarasenko 🎉🎉

from crawly.

AoRakurai avatar AoRakurai commented on May 14, 2024

Thanks a lot @oltarasenko

from crawly.

 avatar commented on May 14, 2024

Sorry, that re-open ticket, but how can I apply these changes?
I used the callback custom_settings and put params into the list. But not changes are applied. Should I update the crawly or what?

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

This feature is going to be released with the next version of Crawly (0.9.0). The release is going to happen on Tuesday (after the Easter holidays)

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

@AoRakurai @Unumus This feature is released now!

from crawly.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.