Giter Club home page Giter Club logo

mldatasets.jl's People

Contributors

alyst avatar andrew-saydjari avatar arcaman07 avatar athatheo avatar aurorarossi avatar carlolucibello avatar chandu-4444 avatar christiangnrd avatar darsnack avatar digital-carver avatar dsantra92 avatar evizero avatar github-actions[bot] avatar hshindo avatar jefffessler avatar jgoldfar avatar johnnychen94 avatar juliatagbot avatar karlfroldan avatar logankilpatrick avatar mortenpi avatar natema avatar pangoraw avatar pitmonticone avatar racinmat avatar shuhuagao avatar soham-chitnis10 avatar sylvaticus avatar vitaminace33 avatar yuehhua avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mldatasets.jl's Issues

[Quetion] Why are the image matrices transposed?

Hi, I'm a beginner in ML and I was studying Flux.jl using the MNIST image dataset.
However, I realized that the images in MLDatasets were transposed, but in Flux, which is Deprecated, aren't transposed.

IMG_20210516_141023_039

IMG_20210516_141025_397

IMG_20210516_141027_800

IMG_20210516_141031_319

Is there any reason for the images to be transposed in MLDatasets?

Feature request: ImageNet data loader

ImageNet is quite large and locked behind terms of access that require an account.

However it would be nice to be able to either

  • set a config (or ENV) variable to download ImageNet through MLDatasets
  • point MLDatasets to a local copy of ImageNet

and be able to use MLDatasets' interface of

train_x, train_y = ImageNet.traindata()
test_x,  test_y  = ImageNet.testdata()

as well as ImageNet.convert2image(x).
Ideally data would be in WHCN format for Flux and Metalhead models.

fail to compile on Julia 1.1 Linux

help

(v1.1) pkg> add MLDatasets
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
 Installed GZip ─────── v0.5.0
 Installed DataDeps ─── v0.6.2
 Installed MLDatasets ─ v0.3.0
  Updating `~/.julia/environments/v1.1/Project.toml`
  [eb30cadb] + MLDatasets v0.3.0
  Updating `~/.julia/environments/v1.1/Manifest.toml`
  [124859b0] + DataDeps v0.6.2
  [92fee26a] + GZip v0.5.0
  [eb30cadb] + MLDatasets v0.3.0

julia> using MLDatasets
[ Info: Precompiling MLDatasets [eb30cadb-4394-5ae3-aed4-317e484a6458]
ERROR: LoadError: LoadError: error compiling top-level scope: could not load library "libz"
libz.so: cannot open shared object file: No such file or directory
Stacktrace:
 [1] include at ./boot.jl:326 [inlined]
 [2] include_relative(::Module, ::String) at ./loading.jl:1038
 [3] include at ./sysimg.jl:29 [inlined]
 [4] include(::String) at /home/dicbro/.julia/packages/GZip/LD2ly/src/GZip.jl:2
 [5] top-level scope at none:0
 [6] include at ./boot.jl:326 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include(::Module, ::String) at ./sysimg.jl:29
 [9] top-level scope at none:2
 [10] eval at ./boot.jl:328 [inlined]
 [11] eval(::Expr) at ./client.jl:404
 [12] top-level scope at ./none:3
in expression starting at /home/dicbro/.julia/packages/GZip/LD2ly/src/zlib_h.jl:11
in expression starting at /home/dicbro/.julia/packages/GZip/LD2ly/src/GZip.jl:73
ERROR: LoadError: LoadError: LoadError: Failed to precompile GZip [92fee26a-97fe-5a0c-ad85-20a5f3185b63] to /home/dicbro/.julia/compiled/v1.1/GZip/s2LKY.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1197
 [3] _require(::Base.PkgId) at ./loading.jl:960
 [4] require(::Base.PkgId) at ./loading.jl:858
 [5] require(::Module, ::Symbol) at ./loading.jl:853
 [6] include at ./boot.jl:326 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include at ./sysimg.jl:29 [inlined]
 [9] include(::String) at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MNIST/MNIST.jl:26
 [10] top-level scope at none:0
 [11] include at ./boot.jl:326 [inlined]
 [12] include_relative(::Module, ::String) at ./loading.jl:1038
 [13] include at ./sysimg.jl:29 [inlined]
 [14] include(::String) at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MLDatasets.jl:1
 [15] top-level scope at none:0
 [16] include at ./boot.jl:326 [inlined]
 [17] include_relative(::Module, ::String) at ./loading.jl:1038
 [18] include(::Module, ::String) at ./sysimg.jl:29
 [19] top-level scope at none:2
 [20] eval at ./boot.jl:328 [inlined]
 [21] eval(::Expr) at ./client.jl:404
 [22] top-level scope at ./none:3
in expression starting at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MNIST/Reader/Reader.jl:2
in expression starting at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MNIST/MNIST.jl:70
in expression starting at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MLDatasets.jl:45
ERROR: Failed to precompile MLDatasets [eb30cadb-4394-5ae3-aed4-317e484a6458] to /home/dicbro/.julia/compiled/v1.1/MLDatasets/9CUQK.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1197
 [3] _require(::Base.PkgId) at ./loading.jl:960
 [4] require(::Base.PkgId) at ./loading.jl:858
 [5] require(::Module, ::Symbol) at ./loading.jl:853

typo in DataDeps environment variable

There is environment variable DATADEPS_ALWAY_ACCEPT used on 18 places of this dataset.
But the correct name is DATADEPS_ALWAYS_ACCEPT.
Should I fix this as part of my PR #79 ?
Because now the variable is not doing anything, which may be confusing.

Publish to METADATA

Hi,
this package seems useful, in good shape, and also easy to mantain. Is there something preventing official release?

Add Palmer penguin dataset

Would it make sense to add the Palmer penguin dataset? It was recently proposed as an alternative to the well-known Iris dataset due to growing sentiment about Ronald Fisher's eugenicist past. Since Iris is included in MLDataset, I assumed that it might fit in here quite well. Or should it rather be added to RDatasets (but then the same argument would apply to the Iris dataset, it seems)?

Could not load symbol "gzopen64"

I got the following error message from tst_mnist.jl when I ran ] test MLDatasets:

Got exception outside of a @test could not load symbol "gzopen64": dlsym(0xfff153c1b3a0, gzopen64): symbol not found

Test Summary: | Pass Error Total tst_mnist.jl | 18 90 108 Constants | 7 7 convert2images | 11 11 File Header | 4 4 Images | 72 72 Test that traintensor are the train images | 30 30 Test that testtensor are the test images | 30 30 traintensor with T=Float32 | 1 1 traintensor with T=Float64 | 1 1 traintensor with T=N0f8 | 1 1 traintensor with T=Int64 | 1 1 traintensor with T=UInt8 | 1 1 testtensor with T=Float32 | 1 1 testtensor with T=Float64 | 1 1 testtensor with T=N0f8 | 1 1 testtensor with T=Int64 | 1 1 testtensor with T=UInt8 | 1 1 Labels | 12 12 trainlabels | 1 1 testlabels | 1 1 Data | 2 2 check traindata against traintensor and trainlabels | 1 1 check testdata against testtensor and testlabels | 1 1 ERROR: LoadError: Some tests did not pass: 18 passed, 0 failed, 90 errored, 0 broken.

I tried reinstalling the package but it didn't work.
My OS is MacOS Monterey 12.0.1, and I am using Julia v1.6.3
with MLDatasets version v0.5.13.

write datasets in a JLD2 or Arrow format for faster read

We could have a "processed" folder in each dataset folder where we write the dataset object the first time we create it. In the following creations, e.g. d = MNIST() we just load the JLD2 file.

Example:

function MNIST(...)
    dataset_dir = ...
    processed_file = joinpath(dataset_dir, "processed", "dataset.jld2") 
    if isfile(processed_file) 
        return FileIO.load(processed_file, "dataset")
    end 

    mnist = ...
    if isfile(processed_file) 
        FileIO.save(processed_file, Dict("dataset" => mnist))
    end 
    return mnist
end

Add OGB dataset

The OGB datasets is an important graph benchmark database.

The Open Graph Benchmark (OGB) aims to provide graph datasets that cover important graph machine learning tasks, diverse dataset scale, and rich domains.

Hope can join it here.

Broken on Julia 1.0+

It seems that the package is broken on OS X with Julia 1.0+, but it works normally on Linux.

train_x, train_y = MNIST.traindata() # throws an error when trying to download the dataset

Do you want to download the dataset from ["http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"] to "/Users/fineday/.julia/datadeps/MNIST"?
[y/n]
y
ERROR: UndefVarError: GET not defined
Stacktrace:
 [1] (::getfield(Base, Symbol("##683#685")))(::Task) at ./asyncmap.jl:178
 [2] foreach(::getfield(Base, Symbol("##683#685")), ::Array{Any,1}) at ./abstractarray.jl:1835
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Array{String,1}) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice at ./asyncmap.jl:154 [inlined]
 [5] #async_usemap#668(::Int64, ::Nothing, ::Function, ::getfield(DataDeps, Symbol("##14#15")){typeof(DataDeps.fetch_http),String}, ::Array{String,1}) at ./asyncmap.jl:103
 [6] #async_usemap at ./none:0 [inlined]
 [7] #asyncmap#667 at ./asyncmap.jl:81 [inlined]
 [8] asyncmap at ./asyncmap.jl:81 [inlined]
 [9] run_fetch at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:104 [inlined]
 [10] #download#13(::Array{String,1}, ::Nothing, ::Bool, ::Function, ::DataDeps.DataDep{String,Array{String,1},typeof(DataDeps.fetch_http),typeof(identity)}, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:78
 [11] download at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:70 [inlined]
 [12] handle_missing at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:10 [inlined]
 [13] _resolve(::DataDeps.DataDep{String,Array{String,1},typeof(DataDeps.fetch_http),typeof(identity)}, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:83
 [14] resolve(::DataDeps.DataDep{String,Array{String,1},typeof(DataDeps.fetch_http),typeof(identity)}, ::String, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:29
 [15] resolve(::String, ::String, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:54
 [16] resolve at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:73 [inlined]
 [17] #2 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:17 [inlined]
 [18] withenv(::getfield(MLDatasets, Symbol("##2#3")){String,Nothing}, ::Pair{String,String}) at ./env.jl:148
 [19] with_accept at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:10 [inlined]
 [20] #datadir#1 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:14 [inlined]
 [21] datadir at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:14 [inlined]
 [22] #datafile#4(::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::String, ::String, ::Nothing) at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:32
 [23] datafile at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:32 [inlined]
 [24] #traintensor#2 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:54 [inlined]
 [25] #traintensor at ./none:0 [inlined]
 [26] #traindata#10 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:231 [inlined]
 [27] #traindata at ./none:0 [inlined]
 [28] #traindata#11 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:235 [inlined]
 [29] traindata() at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:235
 [30] top-level scope at none:0

Segmentation fault with Images

I do not know if this issue is related to MLDatasets or Images, but trying to using MLDatasets, Images results in segmentation fault for me.

Package versions (Julia 1.6.0):

Images v0.24.1
MLDatasets v0.5.6

add Titanic dataset

One of the most famous dataset for a beginner to start is the "Titanic dataset" , which is used for Exploratory data analysis and prediction of outcome using logistic regression, Decision trees, random forests, etc. I just think this dataset should be added so that can be easy for a beginner to get started in Machine learning in julia with a beginner friendly dataset. Also if approved, I am willing to work on this issue as it can be a great addition to these other famous datasets.

run MNIST.traindata() function,get GZip.ZError(-5, "buffer error")

using MLDatasets
tx,ty = MNIST.traindata()

GZip.ZError(-5, "buffer error")

Stacktrace:
[1] close(s::GZip.GZipStream)
@ GZip ~/.julia/packages/GZip/JNmGn/src/GZip.jl:163
[2] gzopen(::MLDatasets.MNIST.Reader.var"#5#6", ::String, ::String)
@ GZip ~/.julia/packages/GZip/JNmGn/src/GZip.jl:270
[3] readimages
@ ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/Reader/readimages.jl:80 [inlined]
[4] #traintensor#2
@ ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:50 [inlined]
[5] traindata(::Type{FixedPointNumbers.N0f8}; dir::Nothing)
@ MLDatasets.MNIST ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:221
[6] #traindata#11
@ ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:225 [inlined]
[7] traindata()
@ MLDatasets.MNIST ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:225
[8] top-level scope
@ In[110]:1
[9] eval
@ ./boot.jl:373 [inlined]
[10] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1196


julia 1.7

macos 12.2.1
apple m1
mem 8g

Large datasets

How do we deal with datasets that are too large to be read into an Array? Something of 5-50Gb, for example. Are there any tools for it or earlier discussion?

I thought about:

  • Iterators instead of arrays. Pros: simple. Cons: some tools (e.g. from MLDataUtils) may require random access to elements of a dataset.
  • New array type with lazy data loading. Maybe memory-mapped array, maybe something more custom. Pros: exposes AbstractArray interface, so existing tools will work. Cons: some tools and algorithms may expect data to be in memory while for disk-based arrays their performance will drop drastically.
  • Completely custom interface. PyTorch's datasets/dataloaders may be a good example. Pros: flexible, easy to provide fast access. Cons: most functions from MLDataUtils will break.

Using CI for testing

We should think about how we use CI for testing the dataset interfaces, since we don't want to spam their servers with download requests for potentially very big datasets.

Maybe a good mode would be to not trigger travis automatically but instead having to trigger it manually every now and then. This way we could also extend the tests to multiple versions an platforms

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Distribute datasets as artifacts

CRef: #57 (comment)

We currently use DataDeps as an interface to download datasets from original websites. While it's good to give a clear license and source, it can be unstable for reproducibility because worldwide users might have difficulties connecting original sites. The original sites might also be offline for various reasons, e.g., #57.

To avoid issues like #57 in the future and accelerate dataset downloading, we could take advantage of Julia's Artifacts system and let Pkg/Storage servers hold and distribute the datasets. MLDatasets don't hold large datasets so it adds little stress to the Julia ecosystem.

Titanic Dataset values are in the wrong order

Behavior

Calling the Titanic.features() returns a matrix not in the prescribed order of Titanic.feature_names().

using MLDatasets: Titanic

features = Titanic.features()

returns the following matrix:

11×891 Matrix{Any}:
  1  12  23  34  45  56  67  78  89  100  111  122  133  …  "S"  "S"  "S"  "C"  "S"  "Q"  "S"  "C"  "C"  "S"  "S"
  2  13  24  35  46  57  68  79  90  101  112  123  134     "S"  "S"  "C"  "S"  "S"  "S"  "S"  "S"  "C"  "S"  "S"
  3  14  25  36  47  58  69  80  91  102  113  124  135     "S"  "S"  "S"  "S"  "S"  "C"  "S"  "C"  "S"  "S"  "S"
  4  15  26  37  48  59  70  81  92  103  114  125  136     "C"  "S"  "S"  "S"  "C"  "Q"  "C"  "S"  "S"  "S"  "S"
  5  16  27  38  49  60  71  82  93  104  115  126  137     "S"  "S"  "S"  "S"  "S"  ""   "S"  "S"  "S"  "S"  "S"
  6  17  28  39  50  61  72  83  94  105  116  127  138  …  "S"  "S"  "S"  "S"  "S"  "C"  "S"  "C"  "S"  "C"  "Q"
  7  18  29  40  51  62  73  84  95  106  117  128  139     "Q"  "Q"  "C"  "S"  "S"  "S"  "C"  "S"  "S"  "C"  "S"
  8  19  30  41  52  63  74  85  96  107  118  129  140     "S"  "S"  "S"  "S"  "S"  "C"  "C"  "S"  "S"  "S"  "S"
  9  20  31  42  53  64  75  86  97  108  119  130  141     "Q"  "C"  "S"  "S"  "S"  "S"  "S"  "S"  "C"  "S"  "S"
 10  21  32  43  54  65  76  87  98  109  120  131  142     "S"  "Q"  "S"  "S"  "S"  "S"  "S"  "S"  "S"  "S"  "C"
 11  22  33  44  55  66  77  88  99  110  121  132  143  …  "C"  "S"  "S"  "S"  "S"  "C"  "S"  "S"  "S"  "C"  "Q"

Upon observation, it seems that the CSV file is read in a sequential manner and values are placed in the wrong matrix elements.

Expected Behavior

A call to Titanic.features() should return a matrix of the form

1 0 3 "Braund, Mr. Owen Harris" ... 7.25 "" S
2 1 1 "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" ... 71.2833 C85 C

Environment

  • OS: Ubuntu 20.04 WSL2
  • Julia 1.7.2
  • MLDatasets.jl 0.5.15

Update Titanic docs

The features and target feature names are not properly aligned, everything is aligned in a single row making it cluttered.

Problem running Pkg.test

It seems to be stuck printing the line:

INFO: Do you want to download the dataset from String["https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt", "https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.test.txt"] to "/home/jrun/MLDatasets/.julia/datadeps/PTBLM"?
INFO: [y/n]

This is running on JuliaCIBot so a [y/n] answer cannot be given. It needs to be automatic.

unpack compressed files instead of streaming through them

We could drop the dependence on GZip, which is not actively maintained, by just calling DataDeps.unpack (which relies on the p7zip binary) to uncompress files. It would be easy to change the MNISTReader & co. to just do that.

This could solve issues like #118.

Additionally, we may want to go in the direction of saving processed versions of the datasets (e.g. a JLD2 save of the dataset object itself) for faster I/O.

Add option to override interactive prompting

julia> MNIST.traindata()
This program has requested access to the data dependency MNIST.
which is not currently installed. It can be installed automatically, and you will not see this message again.

Do you want to download the dataset from ["https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz", "https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz", "https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz", "https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz"] to "/Users/anthony/.julia/datadeps/MNIST"?
[y/n]

This kind of interactive behaviour can be a headache when loading data in an automated setting, as one does not know ahead of time if a prompt is going to be requested or not. See for example, this issue:

FluxML/MLJFlux.jl#141 (comment)

Could we perhaps have an optional kwarg, as in MNIST.training_data(force=true)?

Or is there already a way to do this?

JuliaML

Hi @hshindo any interest in moving this package to JuliaML and/or collaborating with us?

should we use DataSets.jl?

What are the advantages and disadvantages over DataDeps.jl?

How would the MNIST implementation look like if we were to move to DataSets.jl?

cc @c42f

could not load library "libz"- Failed to precompile

I am running Julia on Mac OS 11.1, and running into an error when trying to precompile. I get the error message,

ERROR: LoadError: LoadError: error compiling top-level scope: could not load library "libz"
dlopen(libz.dylib, 1): image not found
Stacktrace:
[1] include at ./boot.jl:317 [inlined]
[2] include_relative(::Module, ::String) at ./loading.jl:1044
[3] include at ./sysimg.jl:29 [inlined]
[4] include(::String) at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/GZip/JNmGn/src/GZip.jl:2
[5] top-level scope at none:0
[6] include_relative(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[7] include(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[8] top-level scope at none:2
[9] eval at ./boot.jl:319 [inlined]
[10] eval(::Expr) at ./client.jl:393
[11] top-level scope at ./none:3
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/GZip/JNmGn/src/zlib_h.jl:13
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/GZip/JNmGn/src/GZip.jl:73
ERROR: LoadError: LoadError: LoadError: Failed to precompile GZip [92fee26a-97fe-5a0c-ad85-20a5f3185b63] to /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/compiled/v1.0/GZip/s2LKY.ji.
Stacktrace:
[1] error(::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[2] compilecache(::Base.PkgId, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[3] _require(::Base.PkgId) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[4] require(::Base.PkgId) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:? (repeats 2 times)
[5] include at ./boot.jl:317 [inlined]
[6] include_relative(::Module, ::String) at ./loading.jl:1044
[7] include at ./sysimg.jl:29 [inlined]
[8] include(::String) at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MNIST/MNIST.jl:26
[9] top-level scope at none:0
[10] include at ./boot.jl:317 [inlined]
[11] include_relative(::Module, ::String) at ./loading.jl:1044
[12] include at ./sysimg.jl:29 [inlined]
[13] include(::String) at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MLDatasets.jl:1
[14] top-level scope at none:0
[15] include_relative(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[16] include(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[17] top-level scope at none:2
[18] eval at ./boot.jl:319 [inlined]
[19] eval(::Expr) at ./client.jl:393
[20] top-level scope at ./none:3
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MNIST/Reader/Reader.jl:2
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MNIST/MNIST.jl:70
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MLDatasets.jl:45

I already tried adding zlib-ng with home-brew, but it didn't work. Would appreciate any suggestions :)
Thanks!

using MLDatasets is very slow

In a fresh julia 1.7 session

julia> @time using MLDatasets
 13.485246 seconds (20.31 M allocations: 1.158 GiB, 7.56% gc time, 61.69% compilation time)

Is there a way to conditionally import packages?

julia> for pkg in [:ImageCore, :CSV, :HDF5, :JLD2, :JSON3]; print(pkg); @time @eval using $pkg; end
ImageCore  2.141235 seconds (3.02 M allocations: 200.377 MiB, 4.50% gc time, 32.08% compilation time)
CSV  3.817959 seconds (6.14 M allocations: 348.493 MiB, 9.81% gc time, 90.11% compilation time)
HDF5  0.723358 seconds (1.34 M allocations: 73.225 MiB, 1.69% gc time, 93.80% compilation time)
JLD2  1.139716 seconds (1.36 M allocations: 78.966 MiB, 3.95% gc time, 60.77% compilation time)
JSON3  0.033367 seconds (49.09 k allocations: 3.014 MiB)

julia> for pkg in [:DataFrames, :MLUtils, :Pickle, :NPZ, :MAT]; print(pkg); @time @eval using $pkg; end
DataFrames  1.789793 seconds (2.03 M allocations: 137.197 MiB, 4.63% gc time)
MLUtils  1.743072 seconds (2.07 M allocations: 117.900 MiB, 4.83% gc time, 47.32% compilation time)
Pickle  0.130685 seconds (159.17 k allocations: 9.751 MiB, 17.77% compilation time)
NPZ  0.504406 seconds (1.19 M allocations: 61.838 MiB, 4.05% gc time, 98.87% compilation time)
MAT  0.009792 seconds (22.84 k allocations: 1.044 MiB)

Related discourse thread

Deprecations v0.6

Make sure that most v0.5 code (e.g. MNIST.traindata(Float32)) keeps working in v0.6 but with a deprecation warning.

  • Vision
    • MNIST
    • EMNIST
    • FashionMNIST
    • Cifar10, Cifar100
    • SVHN2
  • Misc
    • BostonHousing
    • Iris
    • Titanic
    • Mutagenesis
  • Text
    • PTBLM
    • UD_English
    • SMSSpamCollection
  • Graphs
    • Cora
    • PubMed
    • CiteSeer
    • TUDataset (hard depr., soft deprecation impossible / too complicated)
    • OGBDataset (hard depr., soft deprecation impossible / too complicated)
    • Reddit (hard depr., unreleased)
    • PolBlogs (hard depr., only recently released)
    • KarateClub (hard depr., unreleased)

Related to #73.

Movielens datasets

The movielens recommendation matrices could be nice additions to MLDatasets

Support ColorTypes 0.10 and MAT 0.10

It would be helpful (for other packages that use this) to support newer version:
MAT 0.10

I'll submit a PR and see if the tests pass with this newer versions.

MNIST.convert2features() gives DimensionMismatch error

The command found on the package documentation

MNIST.convert2features(MNIST.traintensor())

now produces the error

┌ Warning: convert2features is deprecated, use reshape instead.
│ caller = top-level scope at In[7]:1
└ @ Core In[7]:1
DimensionMismatch("parent has 47040000 elements, which is incompatible with size ()")

Using Julia 1.4

Thanks

load torch tensors in OGBDatasets

Some of the features of the OGBDataset are downloaded as torch tensor stored in the ".pt" format. They are currently ignored at the moment, but we could load them using Pickle.jl (e.g. see this comment)

redesign the package

I'd like to open a discussion on how we should move forward with implementing a getobs and nobs compliant api,
while possibly also simplifying the interface and the maintenance burden.

See FluxML/FastAI.jl#22

I think we should move away from the module-based approach and adopt a type-based one. Also could be convenient to have some lean type hierarchy.

Below is an initial proposal

AbstractDatasets

####### src/datasets.jl

abstract type AbstractDataset end
abstract type FileDataset <: AbstractDataset end
abstract type InMemoryDataset <: AbstractDataset end

MNIST Dataset

######  src/vision/mnist.jl
"""
docstring here, also exposing the internal fields of the struct for transparency
"""
struct MNIST <: InMemoryDataset
   x                              # alternative names: `features` or `inputs`
   targets                     #                               `labels` or j`y`
   num_classes           # optional
   
    function MNIST(path=nothing; split = :train)  # split could be made a mandatory keywork arg
          @assert split in [:train, :test]
           ..........
   end
end

LearnBase.getobs(data::MNIST) = (data.x, data.target) 
LearnBase.getobs(data::MNIST, idx) = (data.x[:,idx], data.target[idx]) 
LearnBase.nobs(data::MNIST) = length(data.taget)

.... other stuff ....

Usage

using MLDasets: MNIST
using Flux.
train_data = MNIST(split = :train)
test_data = MNIST(split =:test)

xtrain, ytrain = getobs(train_data)
xtrain, ytrain = train_data # we can add this for convenience

xs, ys = getobs(train_data, 1:10)
xs, ys = train_data[1:10] # we can add this for convenience


train_loader = DataLoader(train_data; batch_size=128)

Transforms

Do we need transformations as part of the datasets?
This is a possible interface that assumes the transform to operate on whatever is returned by getobs

getobs(data::MNIST, idx) = data.transform(data.x[:,idx], data.y[idx])

Data(split = :train, transform = (x, y) -> (random_crop(x), y)  

Deprecation Path 1

We can create a deprecation path for the code

using MLDataset: MNIST
xtrain, ytrain = MNIST.traindata(...)

by implementing

function getproperty(data::MNIST, s::Symbol)
  if s == :traindata
    @warn "deprecated method"
    return ....

  ....
end

Deprecation Path 2

The pattern

using MLDataset.MNIST: traindata
xtrain, ytrain = traindata(...)

instead is more problematic, because assumes a module MNIST exists, but this (deprecated) module would collide with the struct MNIST. A workaround is to call the new struct MNISTDataset, although I'm not super happy with this long name

cc @johnnychen94 @darsnack @lorenzoh

MLDatasets in julia 0.6/0.7

Hello,

  • Any plan for updating the package for julia 0.6?

  • Any plan for making the package installable with Pkg.add?

Pkg.add("MLDatasets")
unknown package MLDatasets
macro expansion at ./pkg/entry.jl:53 [inlined]
(::Base.Pkg.Entry.##1#3{String,Base.Pkg.Types.VersionSet})() at ./task.jl:335

missing datasets from Flux

This is a list of datasets that are available in Flux but not in MLDatasets. It would be useful to add them soon here, so that we can make MLDatasets' the default dataset provider for Flux.

  • Boston Housing
  • CMU dictionary

Add Medical Decathlon Datasets

I am new to Julia and I am working on a medical imaging research project. I would like to add the medical decathlon datasets (http://medicaldecathlon.com) (https://arxiv.org/pdf/1902.09063.pdf) to this repo as I think it would be a great way for me to learn what's going on and it would likely benefit the entire Julia community. I will definitely need help in this endeavor though so please let me know if that is something of interest to the contributors of this project.

Shold ImageCore be a dependency?

I tried to use MNIST.convert2image(MNIST.traintensor(1)) but it seems like I don't have ImageCore package. I think it's more reasonable to add this as a dependency.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.