Comments (11)
https://stackoverflow.com/questions/30267943/elixir-download-a-file-image-from-a-url
Use a custom pipeline to manage the downloading . In your spider, scrape the media urls and pass it as a nested map key. Then pattern match on it.
https://hexdocs.pm/crawly/basic_concepts.html#custom-item-pipelines.
Crawly processes the items sequentially, but for long downloads you might want to offload it to a queue or use an async Task to download it.
from crawly.
@oltarasenko sounds like a good idea, i'll think a bit more about the api and update here. I should have time for it in the coming weeks.
@s0kil i think it would be more appropriate to have a how-to article in the docs. There are some inherent issues with having many example repos, such as maintenance and keeping them in sync.
from crawly.
shall we create a pipeline capable for autodowloading images? e.g.
{Crawly.Pipelines.DownloadMedia, field: image, dest: /folder_name}
?
Just a heads-up - I've started working on such generic pipeline today.
from crawly.
If you do not mind, could you also mention streaming large files to disk.
from crawly.
@s0kil I think @Ziinc gave a good answer, pipeline is a good way to go! Otherwise, in my own projects, I am downloading media from the parse_item
callback directly. Crawly is a queue management system itself, so technically your worker will just spend a bit more time downloading the image, that's it.
@Ziinc shall we create a pipeline capable for autodowloading images? e.g. {Crawly.Pipelines.DownloadMedia, field: image, dest: /folder_name}
?
from crawly.
@oltarasenko Will downloading a larger file in parse_item
block the spider from continuing to crawl and parse?
from crawly.
No, it does not block the Crawly itself, just one worker which is downloading something, but all other workers are operational. (Comparing it with Scrapy, where non-reactor based downloads will block the world, Crawly operates without problems)
from crawly.
Is it too much to ask for an example project such https://github.com/oltarasenko/crawly-spider-example, saving the blog posts into each individual file?
from crawly.
@s0kil could you give some info on how you are working around the downloading of files now?
from crawly.
@Ziinc I could not get it working yet.
from crawly.
@oltarasenko I will implement a generic supervised task execution process as mentioned here #88 (comment) for pipelines to hook into.
from crawly.
Related Issues (20)
- jl files not found probably not writing HOT 1
- Crawly.fetch giving 301 response instead of 200 HOT 1
- My Spider's code is never invoked, weird behavior with `Crawly.RequestsStorage.pop` in library code HOT 5
- This is actually a question, Nested scraping HOT 2
- Genserver time out crash in long-running pipeline HOT 1
- Stop and resume the spider where it stopped HOT 2
- Protocol error HOT 7
- `Crawly.Fetchers.Fetcher` implementation for Playwright HOT 4
- robots.txt matching is pretty buggy HOT 10
- Running many instances of one spider HOT 3
- Make the management tool opt-in by default HOT 5
- Q: Can the spider "fan out" on a website? (multiple next items) HOT 1
- Error: Could not load spiders. HOT 5
- [error] Pipeline crash by call: Crawly.Middlewares.UniqueRequest.run
- Encountering Complications with Forwarding to Crawly.API.Router HOT 1
- Upgrade to `httpoison` 2.x? HOT 3
- management Web UI on localhost:4001 is not working HOT 4
- Does Crawly support requests using the POST method? HOT 1
- Crawly compilation warnings, undefined Floki functions HOT 1
- CI issue: Failed to upload the report to 'https://coveralls.io', Couldn't find a repository matching this job. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawly.