Giter Club home page Giter Club logo

Comments (11)

Ziinc avatar Ziinc commented on May 14, 2024 1

https://stackoverflow.com/questions/30267943/elixir-download-a-file-image-from-a-url

Use a custom pipeline to manage the downloading . In your spider, scrape the media urls and pass it as a nested map key. Then pattern match on it.

https://hexdocs.pm/crawly/basic_concepts.html#custom-item-pipelines.

Crawly processes the items sequentially, but for long downloads you might want to offload it to a queue or use an async Task to download it.

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024 1

@oltarasenko sounds like a good idea, i'll think a bit more about the api and update here. I should have time for it in the coming weeks.

@s0kil i think it would be more appropriate to have a how-to article in the docs. There are some inherent issues with having many example repos, such as maintenance and keeping them in sync.

from crawly.

michaltrzcinka avatar michaltrzcinka commented on May 14, 2024 1

shall we create a pipeline capable for autodowloading images? e.g. {Crawly.Pipelines.DownloadMedia, field: image, dest: /folder_name}?

Just a heads-up - I've started working on such generic pipeline today.

from crawly.

s0kil avatar s0kil commented on May 14, 2024

If you do not mind, could you also mention streaming large files to disk.

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

@s0kil I think @Ziinc gave a good answer, pipeline is a good way to go! Otherwise, in my own projects, I am downloading media from the parse_item callback directly. Crawly is a queue management system itself, so technically your worker will just spend a bit more time downloading the image, that's it.

@Ziinc shall we create a pipeline capable for autodowloading images? e.g. {Crawly.Pipelines.DownloadMedia, field: image, dest: /folder_name}?

from crawly.

s0kil avatar s0kil commented on May 14, 2024

@oltarasenko Will downloading a larger file in parse_item block the spider from continuing to crawl and parse?

from crawly.

oltarasenko avatar oltarasenko commented on May 14, 2024

No, it does not block the Crawly itself, just one worker which is downloading something, but all other workers are operational. (Comparing it with Scrapy, where non-reactor based downloads will block the world, Crawly operates without problems)

from crawly.

s0kil avatar s0kil commented on May 14, 2024

Is it too much to ask for an example project such https://github.com/oltarasenko/crawly-spider-example, saving the blog posts into each individual file?

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

@s0kil could you give some info on how you are working around the downloading of files now?

from crawly.

s0kil avatar s0kil commented on May 14, 2024

@Ziinc I could not get it working yet.

from crawly.

Ziinc avatar Ziinc commented on May 14, 2024

@oltarasenko I will implement a generic supervised task execution process as mentioned here #88 (comment) for pipelines to hook into.

from crawly.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.