Giter Club home page Giter Club logo

Comments (8)

oxinabox avatar oxinabox commented on August 23, 2024

I agree, I think it is definitely an avenue worth exploring,
It is indeed possible generally with post_fetch_mathod,
What it not (trivially) possible is to synthesis new files out of multiple distinct files (e,g merge two files into one).

Definitely worth exploring, with some examples.

One glaring problem is the need to have something similar to a make file, to check if the source files are newer than the ones the processed files rely on...

Non-trivial dependency chains -- like BinDeps.
We don't support them, but maybe we can indeed get away with just always running all computations when a dependency is fetched.

Registering the halfway files yourself, on your own server (or dropbox or something), as a second datadep is an option, but then as you say synchronizing them becomes an issue.

Also it should be noted DataDeps.jl doesn't know anything about files, except between the fetch_method and post_fetch_method invocations.
I doesn't know what files a datadep contains -- it is folder orientated.

from datadeps.jl.

yakir12 avatar yakir12 commented on August 23, 2024

DataDeps works best when the dependent data is static. It know nothing about what generated the data it fetches. I think that such a "limitation" is totally acceptable in this scenario as well. So a dependency on some halfway files doesn't need to know if their dependent data changed. We can leave that responsibility to the user. We can add a convenience pre_fetch_method (btw, also to this package) that allows the user to supply some meta-analysis function that pre-processes some data the "to-be-fetched" data depends on, namely the processing that occurs on the data DataDeps fetched which results in the halfway files... Sorry, this is very difficult to describe, hope you understand me.

As to the implementation, maybe I should post a call on Discourse?

from datadeps.jl.

oxinabox avatar oxinabox commented on August 23, 2024

A call for what?
(I do not in general care for discourse. Things are better on Github, where they are linked to the package, or on Slack where responses are immediate)

from datadeps.jl.

yakir12 avatar yakir12 commented on August 23, 2024

A call for help with the actual implementation :)

My experience is that sometimes people get very excited by an idea for a package and the package gets quickly built thereafter. Because I my julia-fu is probably not good enough I won't be able to do much except some pointed help here and there. I assume you are probably very busy...

from datadeps.jl.

oxinabox avatar oxinabox commented on August 23, 2024

Not too busy to maintain my own packages, no.
Particularly not ones that I think are important to other people.
(I mean I am crazy busy, but maintain packages is part of the reason why rather than the thing I don't have time to do on top of.)

from datadeps.jl.

yakir12 avatar yakir12 commented on August 23, 2024

Awesome. As an academic I'm completely convinced this package is super useful. And if you feel you can bake in the halfway deps functionality into this one, then awesome!

I guess the most common use case is that the halfway files are stored locally: the product of the analysis resides on the computer where the analysis occurs. Maybe not always, but at least in a large proportion of the cases. So it would make sense to register the files as local. I guess we need a test case to see how it would look. And I guess we need that pre_fetch_method.

from datadeps.jl.

oxinabox avatar oxinabox commented on August 23, 2024

Just so there is an example of the extent that is currently possible.
The following downloads 1 file,
but after post_fetch_method is done there will be 2.
As the post_fetch_method generates the second one

using DataDeps

RegisterDataDep("TrumpTweets",
"""
Tweets from 538's article:
[The World’s Favorite Donald Trump Tweets](https://fivethirtyeight.com/features/the-worlds-favorite-donald-trump-tweets/)

Includes a filtered view that is 
the tweats filtered to remove any tweets that @mention anyone, so no coversations etc, just announcements of opinions/thoughts.

Used under Creative Commons Attribution 4.0 International License.
""",
"https://raw.githack.com/fivethirtyeight/data/master/trump-twitter/realDonaldTrump_poll_tweets.csv",
"5a63b6cb2503a20517b5d41bd73e821ffbfdddd5cdc1977a547f1c925790bb15",
post_fetch_method = function(in_fn) # Multiline anon function.
	out_fn = "filtered_"*basename(in_fn)
	print(out_fn)
	open(out_fn, "w") do out_fh
		for line in eachline(in_fn)
			if !contains(line, "@")
				println(out_fh, line)
			end
		end
	end

end
)

# Read the file that we are generating
for line in eachline(datadep"TrumpTweets/filtered_realDonaldTrump_poll_tweets.csv")
	println(line)
	println()
end

from datadeps.jl.

yakir12 avatar yakir12 commented on August 23, 2024

That is mega cool. Sorry for being slow.

from datadeps.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.