Giter Club home page Giter Club logo

datadeps.jl's Introduction

DataDeps

CI Stable Documentation Dev Documentation

Please see the detailed documentation linked above.

Software using DataDeps.jl

It might help to look at how DataDeps.jl is being used to understand how it maybe used for your project. Some of these add some additional abstraction or niceness for users on top of the DataDeps.jl core functionality.

(Feel free to submit a PR adding a link your Package, or research script here.)

Links:

Paper

White, L., Togneri, R., Liu, W., & Bennamoun, M. (2019). DataDeps. jl: Repeatable Data Setup for Reproducible Data Science. Journal of Open Research Software, 7(1).

datadeps.jl's People

Contributors

andreasnoack avatar aquatiko avatar asbisen avatar ayushk4 avatar briochemc avatar carlolucibello avatar chengchingwen avatar cmcaine avatar dellison avatar evizero avatar femtocleaner[bot] avatar fingolfin avatar ianbutterworth avatar jackdunnnz avatar johnnychen94 avatar jonas-schulze avatar juliatagbot avatar kescobo avatar mjram0s avatar nicoleepp avatar omar-elrefaei avatar omus avatar oxinabox avatar pitmonticone avatar racinmat avatar scls19fr avatar sethaxen avatar willtebbutt avatar yakir12 avatar zerefwayne avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

datadeps.jl's Issues

Multiple URLS for same data

Sometimes, data is stored at many different URLs for redundancy. Here is an example. It's a catalog of 8 URLs that contain the same NetCDF file with oceanographic temperature data from the World Ocean Atlas. Is it possible to register data with multiple URLs instead of only one?

Having DataDeps aware of this type of redundancy (and try the URLs until one works) would be great for robustness!
(CI fails for one of my packages because downloading from one URL sometimes randomly fails.)

Input prompt triggered in non-interactive session

This is different than how I used to implement it. I can see an argument for allowing, say, a script to trigger the input prompt. That said, I think it complicates things in some automated environment.

What do you think about throwing an error in case the "always accept" option is not set and the session is not interactive?

Registering halfway products

So I'm using this wonderful package, and that saves me the need to redownload stuff every time I want to reevaluate things. Great. But this got me thinking:
Usually we have this static dataset we want to process. Most often this means there will be some processed data files that result from this initial processing. We then want to do some analysis on those processed files. But it's irritating to have to take care of those halfway processing products. It would be amazing if there could be a way to register these midway processed files, so that next time we need them we won't need to recalculate them.
I think all the facilities are already here (e.g. supplying an alternative download method, one that process the files), but I would appreciate an example made just for this use case and I'll argue that many people would love this functionality just as much as the intended use of this awesome package.

One glaring problem is the need to have something similar to a make file, to check if the source files are newer than the ones the processed files rely on...

Add doc string to RegisterDataDep

Seems crazy, but apparently while all the internal methods have doc strings
RegisterDataDep does not.

Can basically copy paste from the readme

ArgumentError: `nothing` should not be printed

Things were working fine, but then I manually deleted the data directory, and now I'm getting this error everywhere, including in tests of this package. Example:

ERROR: LoadError: LoadError: ArgumentError: `nothing` should not be printed; use `show`, `repr`, or custom output instead.
Stacktrace:
 [1] print(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Nothing) at ./show.jl:566
 [2] print_to_string(::String, ::Vararg{Any,N} where N) at ./strings/io.jl:123
 [3] string at ./strings/io.jl:156 [inlined]
 [4] to_index(::Nothing) at ./indices.jl:270
 [5] to_index(::Array{String,1}, ::Nothing) at ./indices.jl:247
 [6] to_indices at ./indices.jl:298 [inlined]
 [7] to_indices at ./indices.jl:295 [inlined]
 [8] getindex at ./abstractarray.jl:927 [inlined]
 [9] determine_save_path(::String, ::String) at /Users/ksb/.julia/packages/DataDeps/LiEdA/src/locations.jl:118
 [10] handle_missing at /Users/ksb/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:9 [inlined]
 [11] _resolve(::DataDeps.DataDep{Nothing,String,typeof(DataDeps.fetch_http),getfield(GenderInference, Symbol("##1#2"))}, ::String) at /Users/ksb/.julia/packages/DataDeps/LiEdA/src/resolution.jl:83
 [12] resolve(::DataDeps.DataDep{Nothing,String,typeof(DataDeps.fetch_http),getfield(GenderInference, Symbol("##1#2"))}, ::String, ::String) at /Users/ksb/.julia/packages/DataDeps/LiEdA/src/resolution.jl:29
 [13] resolve(::String, ::String, ::String) at /Users/ksb/.julia/packages/DataDeps/LiEdA/src/resolution.jl:54
 [14] resolve(::String, ::String) at /Users/ksb/.julia/packages/DataDeps/LiEdA/src/resolution.jl:73
 [15] top-level scope at none:0
 [16] include at ./boot.jl:326 [inlined]
 [17] include_relative(::Module, ::String) at ./loading.jl:1038
 [18] include at ./sysimg.jl:29 [inlined]
 [19] include(::String) at /Users/ksb/.julia/dev/GenderInference/src/GenderInference.jl:1
 [20] top-level scope at none:0
 [21] include at ./boot.jl:326 [inlined]
 [22] include_relative(::Module, ::String) at ./loading.jl:1038
 [23] include(::Module, ::String) at ./sysimg.jl:29
 [24] top-level scope at none:2
 [25] eval at ./boot.jl:328 [inlined]
 [26] eval(::Expr) at ./client.jl:404
 [27] top-level scope at ./none:3
in expression starting at /Users/ksb/.julia/dev/GenderInference/src/data_management.jl:70
in expression starting at /Users/ksb/.julia/dev/GenderInference/src/GenderInference.jl:13

Just doing mkdir ~/.julia/datadeps solved it, but it might be worth finding a way to deal with this more gracefully.

Mirroring Support

I think it might be good to have support for mirrors.
I think the WikiCorpus server has gone down.
I could rehost a copy on one of my servers, and then there would still be a copy.

Docs on github pages are out of date

For some reason the docs on github pages no longer match those in the docs folder on github.

Something is going wrong withe Documenter I guess, during the deploy stage on travis

how do I find out what file didn't download?

Hello. So, I tried to use DataDeps, and it I get a stacktrace. But there's no information here about what file actually failed to download. I've checked them all individually. But I'm at a loss from the stacktrace what else to do.

ERROR: HTTP.ExceptionRequest.StatusError(404, HTTP.Messages.Response: """
HTTP/1.1 404 Not Found
ETag: "29908c07f9f25a28cdb6aa1ac8f09674:1461343688"
Last-Modified: Fri, 22 Apr 2016 16:48:08 GMT
Accept-Ranges: bytes
Content-Length: 25962
Content-Type: text/html
Cache-Control: max-age=0, no-cache, no-store, max-age=0, no-cache, no-store, max-age=0, no-cache, no-store                                                                                                                
Date: Sat, 25 May 2019 16:09:25 GMT
Connection: keep-alive
Cache-Control: max-age=0, no-cache, no-store
Access-Control-Allow-Origin: *
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Permitted-Cross-Domain-Policies: master-only
Strict-Transport-Security: max-age=31536000

""") Stacktrace: [1] (::getfield(Base, Symbol("##696#698")))(::Task) at ./asyncmap.jl:178 [2] foreach(::getfield(Base, Symbol("##696#698")), ::Array{Any,1}) at ./abstractarray.jl:1866 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Array{Array{String,1},1}) at ./asyncmap.jl:178 [4] wrap_n_exec_twice at ./asyncmap.jl:154 [inlined] [5] #async_usemap#681(::Int64, ::Nothing, ::Function, ::getfield(DataDeps, Symbol("##14#15")){typeof(DataDeps.fetch_http),String}, ::Array{Array{String,1},1}) at ./asyncmap.jl:103 [6] #async_usemap at ./none:0 [inlined] [7] #asyncmap#680 at ./asyncmap.jl:81 [inlined] [8] asyncmap at ./asyncmap.jl:81 [inlined] [9] run_fetch at /home/cce/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:104 [inlined] [10] #download#13(::Array{Array{String,1},1}, ::Nothing, ::Bool, ::Function, ::DataDep{Nothing,Array{Array{String,1},1},typeof(DataDeps.fetch_http),Array{getfield(SynPUF, Symbol("##3#4")),1}}, ::String) at /home/cce/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:78 [11] download at /home/cce/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:70 [inlined] [12] handle_missing at /home/cce/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:10 [inlined] [13] _resolve(::DataDep{Nothing,Array{Array{String,1},1},typeof(DataDeps.fetch_http),Array{getfield(SynPUF, Symbol("##3#4")),1}}, ::String) at /home/cce/.julia/packages/DataDeps/LiEdA/src/resolution.jl:83 [14] resolve(::DataDep{Nothing,Array{Array{String,1},1},typeof(DataDeps.fetch_http),Array{getfield(SynPUF, Symbol("##3#4")),1}}, ::String, ::String) at /home/cce/.julia/packages/DataDeps/LiEdA/src/resolution.jl:29 [15] resolve(::String, ::String, ::String) at /home/cce/.julia/packages/DataDeps/LiEdA/src/resolution.jl:54 [16] resolve(::String, ::String) at /home/cce/.julia/packages/DataDeps/LiEdA/src/resolution.jl:73 [17] top-level scope at none:0

Printing more than the datadep string on download

When a download is "triggered" DataDeps simply prints the dataset description and the prompt. I think it would be nice if at first there was some info message stating context to whats going on. Especially if the download was triggered by some backend code and not by some explicit user command.

maybe something like

This program requested access to the data dependency "MNIST".

Deleting the folder `.julia/datadeps/` breaks the package until the next time its precompiled

This is because this code https://github.com/oxinabox/DataDeps.jl/blob/master/src/locations.jl#L19 (i.e. mkpath(first(default_loadpath))) is only executed during precompilation.

So if someone deletes the whole .julia/datadeps subfolder for some reason (which I just did because I wanted to retrigger all downloads for testing), it breaks the package by throwing the following error

julia> datadep"MNIST"
ERROR: No possible save path
Stacktrace:
 [1] determine_save_path(::String, ::Void) at /home/csto/.julia/v0.6/DataDeps/src/locations.jl:144
 [2] handle_missing(::DataDeps.DataDep{String,Array{String,1},Base.#download,Base.#identity}, ::Void) at /home/csto/.julia/v0.6/DataDeps/src/resolution_automatic.jl:2
 [3] _resolve(::DataDeps.DataDep{String,Array{String,1},Base.#download,Base.#identity}, ::Void) at /home/csto/.julia/v0.6/DataDeps/src/DataDeps.jl:78
 [4] resolve(::DataDeps.DataDep{String,Array{String,1},Base.#download,Base.#identity}, ::String, ::Void) at /home/csto/.julia/v0.6/DataDeps/src/DataDeps.jl:51

a possible solution would be to move that code into __init__

Get the filename properly

Right now filenames are determined by using basename(remotepath)
Which is great for things like example.com/data.csv.

However, some data comes in the form example.com/data,
and then uses the content-disposition header element to say what the filename is.

Real example:
https://api.github.com/repos/JuliaCI/Coverage.jl/tarball

Docstring mismatch for `Base.download` method

The signature in the docstring makes it seem like there are keyword argument, where the actual method just has positional ones. (https://github.com/oxinabox/DataDeps.jl/blob/master/src/resolution_automatic.jl#L10-L14)

I am working on the MLDatasets integration and want to provide a MNIST.download(...) method etc. currently I have to do it like this if I want to support custom directories, which seems a bit verbose:

    function download(dir; i_accept_the_terms_of_use = nothing)
        datadep = DataDeps.registry["MNIST"]
        if i_accept_the_terms_of_use == nothing
            DataDeps.download(datadep, dir)
        else
            DataDeps.download(
                datadep,
                dir,
                datadep.remotepath,
                i_accept_the_terms_of_use
            )
        end
    end

It would be nice if there was a simple interface for saying "either I provide a custom value (i.e. accept terms, localdir) or behave according to DataDeps default". This would be nice because I would like it so that if a user doesn't say "i_accept_the_terms_of_use" it shouldn't just move on, but instead use the default behaviour of env_bool("DATADEPS_ALWAY_ACCEPT").

Allow 1 remote_path to result in multiple fetched files

The default fetch_method is for the HTTP protocol,
which one where 1 path aways leads to 1 file.

But for other mechanism allowing 1 path to result in many files seems reasonable.
Which would be indicated by the fetch_method returning a Vector of local path names.

I think this makes sense for many protocols.

It would clean up the Google Drive problem (https://github.com/oxinabox/PyDrive.jl/blob/master/src/proto.ipynb)
where to download a folder you need to query the folders contents during register,
rather than during fetch
(cc @yakir12)

I think it would also make sense,
in that it would let you download a whole S3 bucket,
or recurrively fetch over FTP to get a directory.

This might actually already work.
I've not tested it.
If it does work, it is an emergant feature.

Disable prompts in CI

Would it be possible to automatically detect CI (get(ENV, "CI")) and disable the prompt in that case?
If it's CI session and DATADEPS_ALWAYS_ACCEPT is false/not set that should probably result in an error in download()?

Helper: empty directory

A pattern I seem to be using a lot is to move all files from a subdirectory to the current directory.
It is a one liner but still might be clearer to define s function.

Some kind of `post_post_fetch_validate` open?

Between fetch_method and post_fetch_method checksum is used to validated that the downloaded files are correct.

There is no such thing for after post_fetch_method.
And maybe there should be.
It could be used to validated before everyload,
or to validate ManualDataDeps

Enviroment variable for bypassing accepting terms?

Should there be an environment variable, eg:
DATADEP_ACCEPTALLTERMS,
which will cause it to always bypass the "I agree to use this dataset in accordance with the owners restrictions" screen?

I think that would be useful for testing things in CI etc.

This is already going to be an option for if the datadep is being downloaded manually,
but maybe an enviroment variable would be easier.

Allow fine grained download control and make post_fetch_method more general

Context

So I have recently added the SVHN dataset (format 2) to MLDatasets. The interesting thing about that dataset is that it is only available as three decently sized .mat files (the "extra" files is 1.2 GB, while train and test are only like 200 MB together). Technically neither the sizes nor the fact that they are .mat files are a problem, because there is MAT.jl to read them. Both properties have a few drawbacks though.

  1. Many people may not need the huge "extra" file, so it would be nice if I could split the download into a "sub-datadep". I saw the MNIST example in your tests, but the issue there is that my SVHN.download (which should be able to download all files) method would need to display multiple download prompts, which is quite ugly.
  2. Reading .mat files is much slower compared to a simple binary format like MNIST or CIFAR has
  3. Its not possible to just read specific observations, but instead must always load the full split (train, test, or extra)

Problem Description

Concerning the sub-datadeps. do you have any thoughts here? I do think having some fine grained control which files to download on demand is quite nice if it doesn't require a lot of overhead or workarounds. Maybe its worth considering making something like that a first class concept

Concerning the .mat files I think it would be nice if the first thing the MLDatasets does after downloading the .mat files, is to transform them in a more comfortable format. I could do that already with post_fetch_method, but the problem here is that it wouldn't quite work for my use case. As you know all the MLDatasets methods allow the user to specify a custom "dir" where the data can be found. The idea here is that a user can pre-download the native files from the website and just tell the package where those native files are. So the thing i would require for my use case to work out here is that the package checks if the .mat files already exist, and if so, only perform the post_fetch_method without the download. Is this something you'd consider out of scope for DataDeps.jl?

Not providing checksum should be a warning, not an INFO

Lyndon White [5:21 PM]
Should not providing a checksum be an INFO or a warning?

Christof Stocker [5:21 PM]
mhm good question

[5:21]
I'd say warning

Lyndon White [5:22 PM]
Its currently an info, but maybe it should be a Warning, because it kinda gets lost in the printout of the download's progress bar.

Christof Stocker [5:22 PM]
right

Lyndon White [5:22 PM]
Pragmatically If it is a warning it would be red or some such.

Christof Stocker [5:22 PM]
We should bully into proper behaviour in this case

[5:23]
since there is really no good reason not to use a checksum

Lyndon White [5:23 PM]
Data that changes

Christof Stocker [5:23 PM]
except being a little lazy

[5:23]
mhm

[5:23]
fair point

Lyndon White [5:23 PM]
For example https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data
is updated annually

[5:23]
I think though in this case what one should do is pass a ignore function to the checksum argument

Christof Stocker [5:24 PM]
maybe for those cases make a keyword argument for surpressing the wraning?

[5:24]
yea, something explicit like that

[5:24]
meaning no checksum should probably be supported, but it should require explicit enabling

[5:25]
otherwise a warning is raised every time

Remove "Baby Names" from example tests?

Unzipping the file can sometimes lead to this error:

Archive:  /var/folders/z_/q30gz8bs17g3p15d_2cv08w80000gn/T/tmpFNIUO3/Baby Names/names.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of /var/folders/z_/q30gz8bs17g3p15d_2cv08w80000gn/T/tmpFNIUO3/Baby Names/names.zip or
        /var/folders/z_/q30gz8bs17g3p15d_2cv08w80000gn/T/tmpFNIUO3/Baby Names/names.zip.zip, and cannot find /var/folders/z_/q30gz8bs17g3p15d_2cv08w80000gn/T/tmpFNIUO3/Baby Names/names.zip.ZIP, period.
Data.Gov Babynames: Error During Test
  Got an exception of type ErrorException outside of a @test
  failed process: Process(`unzip -x '/var/folders/z_/q30gz8bs17g3p15d_2cv08w80000gn/T/tmpFNIUO3/Baby Names/names.zip' -d '/private/var/folders/z_/q30gz8bs17g3p15d_2cv08w80000gn/T/tmpFNIUO3/Baby Names'`, ProcessExited(9)) [9]

If you run

file '/var/folders/z_/q30gz8bs17g3p15d_2cv08w80000gn/T/tmpFNIUO3/Baby Names/names.zip'

the result is:

/var/folders/z_/q30gz8bs17g3p15d_2cv08w80000gn/T/tmpFNIUO3/Baby Names/names.zip: HTML document text, ASCII text, with no line terminators

And finally running

less  '/var/folders/z_/q30gz8bs17g3p15d_2cv08w80000gn/T/tmpFNIUO3/Baby Names/names.zip'

reveals the contents of the zip file to be:

<html><head><title>Request Rejected</title></head><body>The requested URL was rejected. Please consult with your administrator.<br><br>Your support ID is: 1243830826855095507<br><br><a href='javascript:history.back();'>[Go Back]</a></body></html>

It seems like the request to download this file is getting filtered as misuse and can lead to really flaky tests both locally and on builds. Would removing this example test be ok?

Can we do more with DOIs?

There is a lot of pressure to get a DOIs assigned to your data.
Since a DOI is a persistent identifier, even if URLs change.
the DOI can be updated.

However, problem is DOIs rarely point at the data directly,
the normally point at a site that talks about the data and has a link somewhere on it.
Consuming such as site is the job of DataDepsGenerators.jl.

In theory DOIs can put up a whole range of metadata (not just a redirect),
getting at it involves negotiating formats via HTTP headers.

See https://citation.crosscite.org/docs.html#sec-3

Doable e.g;
curl -LH "Accept: application/rdf+xml;q=0.5" https://doi.org/10.1126/science.169.3946.635
gives back a lot of metadata (add -I to see the Header redirect location).
None of which related to a download URL for any data.

Now maybe that is because it isn't a dataset:
but curl -LH "Accept: application/rdf+xml;q=0.5" https://doi.org/10.7910/DVN/GQCPXM is,
and it also has no useful information as far as an URL that I can actually fetch the dataset from.

Maybe the best we can do is tell people to include the DOI in their message,
so if their URLs break the DOI can be used to track down the new ones.

suggestion: citation function returning .bib

I am almost sure you thought about this, but I could not find it so I thought I'd create an issue.

Would it be possible to have a citation function that would return a string of the bibtex entry?
Something like:

julia> print(citation("FastText en")) # "FastText en" having been registered before...

@article{FastText_en,
 author    = "Bojanowski, P. and Grave, E. and Joulin, A. and Moulin, T.",
 title     = "Enriching Word Vectors with Subword Information",
 publisher = "Facebook",
 year      = 2042,
 address   = "USA",
}

that I can directly copy into by bibtex.

Above I assume one would have registered "FastText en" with the data and an additional (optional?) field for the citation. Maybe it looked like:

julia> register(DataDep("FastText en",
    """
    Dataset: FastText Word Embeddings for English.
    Author: Bojanowski et. al. (Facebook)
    License: CC-SA 3.0
    Website: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
    
    300 dimentional FastText word embeddings, trained on Wikipedia

    Notice: this file is over 6.2GB
    """,
    "https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.vec",
    "ba5420ac217fb34f15f58ded0d911a4370dfb1f3341fa7511a49ae74c87de282",
    citation = """
    @article{FastText_en,
     author    = "Bojanowski, P. and Grave, E. and Joulin, A. and Moulin, T.",
     title     = "Enriching Word Vectors with Subword Information",
     publisher = "Facebook",
     year      =  2042,
     address   = "USA",
    }
    """
))

I think this makes sense for DataDeps to align itself with Julia packages (i.e., code) that now generally contain a CITATION.bib file.

Optionally depend on HTTP.jl ?

The code to solve #21 in #22 is a bit of a hack.
It's using console calls.
It should work on any system, but it is not a nice way to do it.
I'm pretty sure unusual webserver configurations can break it.
See the /tests/examples_flaky.jl.

It might be nice to build and test a solution on top of HTTP.jl.
If it proves more robust, then lazily depend on it.
So that if an user does run into trouble they can either start the REPL with julia -e "using HTTP" -i
or run the script eg julia -e "using HTTP" runtests.jl

Reduce number of required parameters for DataDep

Hello,

following queryverse/Queryverse.jl#6 (comment)
we should reduce number of required parameters for DataDep

using DataDeps

function __init__()
    register(DataDep(
        "https://raw.githubusercontent.com/davidanthoff/CSVFiles.jl/master/test/data.csv"
    ))
end

df = load(datadep"Queryverse Tests/data.csv") |> DataFrame

should be enough

Message (or message) could be set to "" by default

Name could be set to a default value (a default directory where files are stored)

Kind regards

Consider if DataDeps were Paths.

Note that right now:
joinpath(datadep"a", "b") and datadep"a/b" are different.

and the later is preferred because it does recovery if files are missing.
and if #35 eventuates it will become even more so.

It might be better to use https://github.com/rofinn/FilePathsBase.jl
and have the thing returned by datadep"a" be a path.
and also on those lines maybe have it not make it download immediaty when resolved, but rather when files within it are accessed.

This is really DataDeps 2.0 stuff, and not something I want to work on immediately,
but might be good the think on.

`remove` command?

How do you remove a registered file within DataDeps.jl?

I am pretty sure this is already doable since that's one of the steps when there is an issue and you chose the option to "erase and redownload" โ€” but I am not sure how to use that or if there is even a function that is exported? I am using DataDeps.jl for a package that can load many different datasets but I would like to be able to run the CI tests locally on all these downloads without requiring a lot of memory and having to store everything.

no method matching open(::Void, ::String)

ERROR: LoadError: MethodError: no method matching open(::Void, ::String)
Closest candidates are:
  open(::AbstractString, ::AbstractString) at iostream.jl:132
  open(::Function, ::Any...) at iostream.jl:150
  open(::Base.AbstractCmd, ::AbstractString) at process.jl:575
  ...
Stacktrace:
 [1] open(::SHA.#sha2_256, ::Void, ::String) at ./iostream.jl:150
 [2] run_checksum(::Tuple{SHA.#sha2_256,String}, ::Void) at /Users/omus/.julia/v0.6/DataDeps/src/verification.jl:22
 [3] checksum_pass(::String, ::Void) at /Users/omus/.julia/v0.6/DataDeps/src/resolution_automatic.jl:111
 [4] #download#19(::String, ::Void, ::Bool, ::Function, ::DataDeps.DataDep{String,String,##3#4,Base.#identity}, ::String) at /Users/omus/.julia/v0.6/DataDeps/src/resolution_automatic.jl:51
 [5] handle_missing(::DataDeps.DataDep{String,String,##3#4,Base.#identity}, ::String) at /Users/omus/.julia/v0.6/DataDeps/src/resolution_automatic.jl:8
 [6] _resolve(::DataDeps.DataDep{String,String,##3#4,Base.#identity}, ::String) at /Users/omus/.julia/v0.6/DataDeps/src/resolution.jl:83
 [7] resolve(::DataDeps.DataDep{String,String,##3#4,Base.#identity}, ::String, ::String) at /Users/omus/.julia/v0.6/DataDeps/src/resolution.jl:31
 [8] resolve(::String, ::String, ::String) at /Users/omus/.julia/v0.6/DataDeps/src/resolution.jl:54
 [9] resolve(::String, ::String) at /Users/omus/.julia/v0.6/DataDeps/src/resolution.jl:73
 [10] include_from_node1(::String) at ./loading.jl:576
 [11] include(::String) at ./sysimg.jl:14
 [12] process_options(::Base.JLOptions) at ./client.jl:305
 [13] _start() at ./client.jl:371
while loading /Users/omus/.julia/v0.6/.../test/runtests.jl, in expression starting on line 13

Handle programmatic need for resolution better

make this not be required if you have a variable that holds part of the data dep name;

function word_embeddings(langcode, path=nothing)
    path = if path_ isa Void
        DataDeps.resolve(DataDeps.registry["FastText $langcode"]) # This line is bad. Fix this.
    else
        path_
    end
end

Maybe via String Interpolatation. (see implementation in StringInterning.jl)

Multiple URLS

Sometimes large datas are split into multiple files.
Should handle multiple remotepaths for one datadir.

Better supporting tools for people working with local files before they upload them

@Rofin

It is desirable that people could take a set of files they plan to upload (such that DataDeps can be used to download them),
and run a function to check them against a DataDep registration.

This would let them generate/check the checksum in advance.
Thus allowing them to make sure that their upload was correct.
E.g. they didn't upload the wrong file, or get corruption during the upload.
This is in-contrast to just deleting the checksum and modifying it to match the one reported after the download.

A related functionality (that could maybe the in the same function call),
is to test out the post_fetch_method.
Which is an utter pain to debug if the easiest way to call it is to delete the data and retrigger a download.

v0.3.1 released but not available on Metadata.jl

Even though 0.3.1 is released, I get:

julia> Pkg.status("DataDeps")
 - DataDeps                      0.3.0

julia> Pkg.update("DataDeps")
INFO: Updating METADATA...
INFO: Computing changes...
INFO: No packages to install, update or remove

Download prompt repeats forever in package build

I was trying to follow http://white.ucc.asn.au/DataDeps.jl/stable/z20-for-pkg-devs.html#Installing-Data-Eagerly-1 and download a data dependency at pkg> build time, but I'm having trouble getting it to work.

As an example, I've created the following deps/build.jl file:

using DataDeps

register(DataDep("foo",
                 "",
                 "https://foo.bar"))

datadep"foo"

Running pkg> build MyPackage hangs forever. The deps/build.log file shows:

This program has requested access to the data dependency foo.
which is not currently installed. It can be installed automatically, and you will not see this message again.




Do you want to download the dataset from https://foo.bar to "/home/rdeits/.julia/datadeps/foo"?
[y/n]
Do you want to download the dataset from https://foo.bar to "/home/rdeits/.julia/datadeps/foo"?
[y/n]
Do you want to download the dataset from https://foo.bar to "/home/rdeits/.julia/datadeps/foo"?
[y/n]

with that prompt repeating forever.

I've seen the same behavior when trying to load a datadep during package precompilation.

Is this a bug, or should I be doing something differently?

I've reproduced the issue with DataDeps 0.6.2 and master.

julia> versioninfo()
Julia Version 1.0.2
Commit d789231e99 (2018-11-08 20:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

Add command to check datadep exists?

While downloading the full dataset during CI testing might not always be possible,
it should be possible to test that the data still exits.

i.e. to check if the remote URLs still point to something.

So perhaps a command check_remotes_exist(::DataDep)?

It might not work for remotespaths that are not URLs, but other things break for them anyway.

Add check to remove warning for when re-registering the same data

Copied from a slack conversation (kept here to help remind me of what to do):

  • Me:

    When I re-run things that include registering data, I get the Warning: Over-writing registration of the datadep...

  • Answer:

    It throws that if the name is already registered
    A check could be added that it must be registered and NOT equal to the new DataDep of that name (edited)

Downloading a DataDep from an `::AbstractPath`

I'm trying to use DataDeps to download data from S3, where the remote path is an S3Path, like

register(DataDep(
    "MyData",
    "Is good data. Is in cloud",
    S3Path("s3://etc."),
    fetch_method=download
    ...

But when I call datadep"MyData" I get MethodError: no method matching download(::S3Path, ::String) (an S3Path is not an AbstractString)

Is there a way around this? e.g. not using @datasep_str macro which returns a String.
Or would it require adding more support for AbstractPaths to DataDeps.jl, e.g. making datadeps themselves be Paths? #2

Additional flag for RegisterDataDep for post_fetch_cleanup

I think it could be nice to have an additional optional named flag for RegisterDataDep that could signal to DataDeps that after executing the post_fetch_method the downloaded files should be deleted. Maybe something like post_fetch_cleanup = true which defaults to false. This way the data for something like CIFAR-10 isn't kept around twice (once as the archive and once as the unpacked files)

This should work well with #10 (comment)

Bug: symbol lookup error: ... undefined symbol

Unsure what is happening. Trying to run my code on a google cloud virtual machine, I get:

Do you want to download the dataset from URL to "local_path"?
[y/n]
y
/local_bin_path/julia: symbol lookup error: /home/local_user/.julia/packages/MbedTLS/XkQiX/deps/usr/lib/libmbedtls.so: undefined symbol: mbedtls_x509_crt_verify_restartable

and that stops the run... (Note I "masked" some stuff to only show the relevant info.) Any ideas?

`mkpath(first(default_loadpath))` only executed at compile time

Maybe we should move this line : https://github.com/oxinabox/DataDeps.jl/blob/master/src/locations.jl#L19 to a module __init__() ? The current situation means that if a user at any time deletes ~/.julia/datadeps then the module is more or less in a broken state, as it does not create the folder anymore until the next time it is precompiled. This results in:

LoadError: No possible save path

by the way, unsure if on purpose or not but I can only make the RegisterDataDep(...) work in MLDatasets, if I put that command into the module __init__(). Probably for the same reason, as otherwise the DataDeps.registry is empty during runtime

unpack_if_required post-fetch helper

It may be useful to have something that depending on
file-extension either calls unpack or does nothing.

Though this seems like it is coming close to stepping on the toes of FileIO.jl

What is our Pkg3 Story?

Pkg3 introduces the idea of projects,
which are a lot like having a separate julia environment (in terms of ~/julia/vX.y) each.
And you are supposed to be able to just git clone them.
They are like self-contained environments

My first thoughts were: Oh, that is going mean big changes to DataDeps.jl,
in terms of where we should (by default) store stuff, like we casually break any closed environment since the default search path search's everywhere, and the default savepath is ~/.julia/datadeps
But on further thought I don't think it will.
DataDeps.jl exists so that data does not have to be part of a project, or in any particular location.
It will just sort out the location itself.

Currently the load path looks like this

  • <Pkg>/deps/data (This one is special that it is never the savepath, and is only visible calls made from files within that package)
  • ENV["DATADEPS_LOADPATH] (this expands out it if is :s)
  • ~/.julia/datadeps or equiv, (this and everything below can be disabled by setting DATADEPS_NO_STANDARD_LOAD_PATH])
  • User areas eg ~/.julia/data
  • Local computer ares e.g. C:/ProgramData, /usr/local/datadeps
  • network areas e.g. /usr/share/datadep

The loadpath starts from the top and search's to the to the bottom until it finds a data directory with the right name.
If it fails then it downloads it,
and when downloading: it looks for the first location that exists and is writable, starting from the top (skipping the Pkg specific directory),
and going down.

Stuff ends up in the Pkg specific directory if it was a ManualDataDep that is included in the repo.
Or if it the user moved the data directory their to deal with a conflict.
Conversely stuff ends up in the lower locations on the load path, if the user moves them down, so that they can be shared e.g. my other students in the group.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.