Comments (12)
I agree having event callbacks are useful and that a concerted effort to cover most/all events would be good. However I don't agree with having this as a specific callback on the spider.
If anything, this feels like something that should live in the settings, as an option. Then, with the custom_settings callback being implemented in #69 , hooking into the events would be as simple as doing
# assuming the eventually implemented callback expects a map of the settings...
def custom_settings(global) do
%{global | on_spider_stop: &MyApp.my_callback/1 }
end
Where the my_callback function receives the spider state.
This allows for the dev to do a default callback, then override as necessary.
from crawly.
Ok, now the related PR is merged. I will run the release procedure soon.
from crawly.
Seems easy for the Engine to run code that is beyond Crawly's context.
What's wrong with polling the Engine for the Spider status? Seems like a good solution to me.
Another alternative is for the Engine to push messages on the spider status to a GenServer of choice.
from crawly.
@Ziinc I agree with @AoRakurai regarding the fact that it's better not to poll the status if it's possible. As I understand the desired function is to call a custom callback once the spider is exiting. It's similar to what we can find in Scrapy. I was using it quite a lot when I was working with python and Scrapy, and I can agree it's useful. Maybe we should consider introducing all callbacks from the page above...
from crawly.
That would make Crawly a good fit for a lot more use cases I believe. The fact there was no on_finish callback forced me fork it temporarily so I could keep using it in my project.
from crawly.
I agree that assuming settings can be implemented per spider, being able to define a on_spider_stop in settings would be ideal.
from crawly.
Furthermore, I feel that these callback functions should be fired off using async Tasks, instead of synchronously.
https://hexdocs.pm/elixir/Task.html
from crawly.
Furthermore, I feel that these callback functions should be fired off using async Tasks, instead of synchronously.
Could you explain it?
from crawly.
@oltarasenko since there isn't a need for the the callback to interact with the main engine process, the spawned Task process can be executed concurrently as a side effect.
A few benefits is:
- non-blocking
- thrown errors don't crash the main engine process
- have its own supervisor/tree, if necessary.
There's quite a few patterns for using Tasks in the docs, but I think just firing the callbacks as Task.async/1
will suffice.
from crawly.
Actually the docs recommends using a supervisor to oversee Tasks execution. So each callback can be executed as a Task when the event occurs.
For example, we recommend developers to always start tasks under a supervisor. This provides more visibility and allows you to control how those tasks are terminated when a node shuts down.
https://hexdocs.pm/elixir/Task.html#module-ancestor-and-caller-tracking
from crawly.
Ok, I think this will be the last issue which will form the 0.9.0 Crawly release. I will prepare a rollout as soon as we will find the common vision on the solution of the problem.
from crawly.
@AoRakurai This is released now!
from crawly.
Related Issues (20)
- Crawly.fetch giving 301 response instead of 200 HOT 1
- My Spider's code is never invoked, weird behavior with `Crawly.RequestsStorage.pop` in library code HOT 5
- This is actually a question, Nested scraping HOT 2
- Genserver time out crash in long-running pipeline HOT 1
- Stop and resume the spider where it stopped HOT 2
- Protocol error HOT 7
- `Crawly.Fetchers.Fetcher` implementation for Playwright HOT 4
- robots.txt matching is pretty buggy HOT 10
- Running many instances of one spider HOT 3
- Make the management tool opt-in by default HOT 5
- Q: Can the spider "fan out" on a website? (multiple next items) HOT 1
- Error: Could not load spiders. HOT 5
- [error] Pipeline crash by call: Crawly.Middlewares.UniqueRequest.run
- Encountering Complications with Forwarding to Crawly.API.Router HOT 1
- Upgrade to `httpoison` 2.x? HOT 3
- management Web UI on localhost:4001 is not working HOT 4
- Does Crawly support requests using the POST method? HOT 1
- Crawly compilation warnings, undefined Floki functions HOT 1
- CI issue: Failed to upload the report to 'https://coveralls.io', Couldn't find a repository matching this job. HOT 1
- Set base_url in init options instead of callback HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawly.