Giter Club home page Giter Club logo

mldatasets.jl's Introduction

MLDatasets.jl

This package represents a community effort to provide a common interface for accessing common Machine Learning (ML) datasets. In contrast to other data-related Julia packages, the focus of MLDatasets.jl is specifically on downloading, unpacking, and accessing benchmark dataset. Functionality for the purpose of data processing or visualization is only provided to a degree that is special to some dataset.

Package Status Package Evaluator Build Status
License Docs Julia 0.6 Status Julia 0.7 Status Julia 1.0 Status Build Status

This package is a part of the JuliaML ecosystem. Its functionality is build on top of the package DataDeps.jl.

Introduction

The way MLDatasets.jl is organized is that each dataset has its own dedicated sub-module. Where possible, those sub-module share a common interface for interacting with the datasets. For example you can load the training set and the test set of the MNIST database of handwritten digits using the following commands:

using MLDatasets

train_x, train_y = MNIST.traindata()
test_x,  test_y  = MNIST.testdata()

To load the data the package looks for the necessary files in various locations (see DataDeps.jl for more information on how to configure such defaults). If the data can't be found in any of those locations, then the package will trigger a download dialog to ~/.julia/datadeps/MNIST. To overwrite this on a case by case basis, it is possible to specify a data directory directly in traindata(dir = <directory>) and testdata(dir = <directory>).

Available Datasets

Check out the latest documentation

Additionally, you can make use of Julia's native docsystem. The following example shows how to get additional information on MNIST.traintensor within Julia's REPL:

?MNIST.traintensor

Each dataset has its own dedicated sub-module. As such, it makes sense to document their functionality similarly distributed. Find below a list of available datasets and links to their their documentation.

Image Classification

This package provides a variety of common benchmark datasets for the purpose of image classification.

Dataset Classes traintensor trainlabels testtensor testlabels
MNIST 10 28x28x60000 60000 28x28x10000 10000
FashionMNIST 10 28x28x60000 60000 28x28x10000 10000
CIFAR-10 10 32x32x3x50000 50000 32x32x3x10000 10000
CIFAR-100 100 (20) 32x32x3x50000 50000 (x2) 32x32x3x10000 10000 (x2)
SVHN-2 (*) 10 32x32x3x73257 73257 32x32x3x26032 26032

(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset

Language Modeling

PTBLM

The PTBLM dataset consists of Penn Treebank sentences for language modeling, available from tomsercu/lstm. The unknown words are replaced with <unk> so that the total vocabulary size becomes 10000.

This is the first sentence of the PTBLM dataset.

x, y = PTBLM.traindata()

x[1]
> ["no", "it", "was", "n't", "black", "monday"]
y[1]
> ["it", "was", "n't", "black", "monday", "<eos>"]

where MLDataset adds the special word: <eos> to the end of y.

Text Analysis (POS-Tagging, Parsing)

UD English

The UD_English Universal Dependencies English Web Treebank dataset is an annotated corpus of morphological features, POS-tags and syntactic trees. The dataset follows CoNLL-style format.

traindata = UD_English.traindata()
devdata = UD_English.devdata()
testdata = UD_English.devdata()

Data Size

Train x Train y Test x Test y
PTBLM 42068 42068 3761 3761
UD_English 12543 - 2077 -

Documentation

Check out the latest documentation

Additionally, you can make use of Julia's native docsystem. The following example shows how to get additional information on MNIST.convert2image within Julia's REPL:

?MNIST.convert2image
  convert2image(array) -> Array{Gray}

  Convert the given MNIST horizontal-major tensor (or feature matrix) to a vertical-major Colorant array. The values are also color corrected according to
  the website's description, which means that the digits are black on a white background.

  julia> MNIST.convert2image(MNIST.traintensor()) # full training dataset
  28×28×60000 Array{Gray{N0f8},3}:
  [...]

  julia> MNIST.convert2image(MNIST.traintensor(1)) # first training image
  28×28 Array{Gray{N0f8},2}:
  [...]

Installation

To install MLDatasets.jl, start up Julia and type the following code snippet into the REPL. It makes use of the native Julia package manger.

import Pkg
Pkg.add("MLDatasets")

License

This code is free to use under the terms of the MIT license.

mldatasets.jl's People

Contributors

evizero avatar hshindo avatar alyst avatar carlolucibello avatar jgoldfar avatar

Watchers

Morten Piibeleht avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.