Giter Club home page Giter Club logo

Comments (3)

cuducos avatar cuducos commented on August 30, 2024

I think there's a little bit of confusion here and there.

The process of downloading datasets takes quite a long time.

Yes, it deals with a lot of data so this is basically normal. I wouldn't worry about that. It downloads roughly 10 datasets with 300k lines each, cleans them, translate them, so it's just normal to take a few minutes as it does. To sum: this is not an issue IMHO.

The downloader class is not very effective. (Ref #87)

In my opening post I don't recall questioning the effectiveness of the downloader class. My main point there was semantic, architecture and design – not performance.

Plus, if retriggered, the script will re-download the files even if the local version exists and is not corrupted.

This is on purpose: we want to always run with fresh official data. It happens that official institutions change their datasets, so if you run from the scratch, it will re-do all work to make sure your data is official and updated.

Alternatively, you can skip this and download some versioned datasets as explained in the README.md.

This way, running all tests takes a long time, even if you already have the datasets.

As the discussion followed, journey tests are basically to be run on the CI, to assure everything is working. So this long time should not be an issue for the majority of developer around here.


That said I do disagree with the tone of your opening post. However I do agree with parts of your proposed TODO list:

Rewrite fetch, translate, clean into more atomic methods (Ref #87)

Yes, but we don't need an issue for that. That's what #87 is about.

Migration to Dask (Suggested by @cuducos)

That would be wonderful. Can we have an issue for that?

Use a better library for downloading and sync datasets (tox was a suggestion by @cuducos).

Sorry, but I think you got me wrong here. tox is about tests, not about syncing datasets. No idea how tox can help in downloading. Would you mind clarifying?

The checksum file could be added to the bucket to assure we have downloaded correctly and most importantly, the "verified" file.

Finally something that matches the issue title. But with such a long description not mentioning checksum at all, my opinion is as follows:

  • Close this issue as it's imprecise and convoluted
  • Open an issue for Dask
  • Open an issue for the checksum

All the rest seems to be already embraced by #87 and #122. What do you think, @willianpaixao? Was I too harsh? I'm just trying to keep thing organized to make it easier for everyone ; )

from serenata-toolbox.

willianpaixao avatar willianpaixao commented on August 30, 2024

Yes, it deals with a lot of data so this is basically normal. I wouldn't worry about that. It downloads roughly 10 datasets with 300k lines each, cleans them, translate them, so it's just normal to take a few minutes as it does. To sum: this is not an issue IMHO.

This could still be speed up with async tasks and maybe "last modification date" of the files. No need to re-download a file that is untouched for years. The checksum can also help on that.

In my opening post I don't recall questioning the effectiveness of the downloader class. My main point there was semantic, architecture and design – not performance.

That was my summary of the whole issue discusstion and code review. I didn't say you said that.

That said I do disagree with the tone of your opening post.

My appologies, didn't mean to sound rude. 😭

Sorry, but I think you got me wrong here. tox is about tests, not about syncing datasets. No idea how tox can help in downloading. Would you mind clarifying?

Yes, I mixed up things. async or another library can paralelize bigger tasks. tox can improve the tests.

Was I too harsh?

Yes, I'm now crying under the bed. 😭 πŸ’”

As the suggested actions, I'll follow all of them. But I'll wait some time before closing this ticket so you can add something.

from serenata-toolbox.

cuducos avatar cuducos commented on August 30, 2024

This could still be speed up with async tasks and maybe "last modification date" of the files. No need to re-download a file that is untouched for years. The checksum can also help on that.

From our experience, we cannot trust the file date, but again: the checksum and asyncio (as we did in the downloader.py) are indeed great ideas!

My appologies, didn't mean to sound rude. 😭

No means… I disagree with the arguments, not with your manners. I'm sorry too ; )

Hahaha… so let's go for a few atomic issues: Refactor downloads using asyncio, implementing checksum (not sure hwo to do it on Spaces side but anyway), and for Dask ; )

from serenata-toolbox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.