Giter Club home page Giter Club logo

kerosene's Introduction

Kerosene: Clean Burning Fuel

Provides verisioned datasets to Machine Learning projects in hdf5 format with a dead-simple interface.

Show me

Without optional arguments, kerosene provides a minimal interface to get features and labels in a test / train split. Below are examples of using it with keras, but it can work with any machine learning library.

# MNIST example
from keras.models import Sequential
from kerosene.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

# (build the perfect model here)

model.fit(X_train, Y_train, show_accuracy=True, validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)

Kerosene datasets support one or more sources, such as secondary labels.

# CIFAR100 example
from kerosene.datasets import cifar100
# default labels for cifar100 are 'coarse_labels'
(X_train, y_train), (X_test, y_test) = cifar100.load_data()
# but you can also subsequently grab the 'fine_labels'
(z_train,), (z_test,) = cifar100.load_data(sources=['fine_labels'])

And additionally can have multiple sets (aka "splits") other than test/train.

# Street View House Numbers example
from kerosene.datasets import svhn2
import numpy as np
# street view house numbers defaults: 73,257 train / 26,032 test
(X_train, y_train), (X_test, y_test) = svhn2.load_data()
# have time to burn? use 'extra' and train on > 600,000 examples!
(X_extra, y_extra), = svhn2.load_data(sets=['extra'])
X_train = np.concatenate([X_train, X_extra])
y_train = np.concatenate([y_train, y_extra])

And for some datasets less is more - perhaps only one source

# Binarized MNIST example
from kerosene.datasets import binarized_mnist
# what no labels? send in the autoencoder
(X_train,), (X_test,) = binarized_mnist.load_data()

or one set.

# Iris example
from kerosene.datasets import iris
(X_all, y_all), = iris.load_data()
# then later ... keras to the rescue
model.fit(X_all, Y_all, validation_split=0.25)

Kerosene downloads are automatic and cached on your local drive.

OK, what is this again?

Kerosene provides a collection of versioned, immutable, publicly available fuel-compatible datasets in hdf5 format along with a minimalist interface for Keras. So --

  • semantic versioning: Just like software - there will be bugs and changes.
  • immutable: Once a version released to the wild, it is never rewritten.
  • publicly available: Reproducible experiments depend on unencumbered data access.
  • fuel-compatible - Borrows from and stays compatible with the fuel data pipeline framework
  • hdf5 format: Hoping a pipeline free of pickled python objects will be a saner one.
  • interface: As simple as possible. Automatic downloads and sensible defaults.

Kerosene includes wrappers for most of the datasets that are built into the fuel libraries. When used as a dependency, it similarly provides access to any third party fuel hdf5 file in a way intended to be useful to any ML library, such as keras and blocks. As an example, see the lfw_fuel repo which provides simple access to the Labeled Faces in the Wild dataset in several formats.

Installation

pip install kerosene

Kerosene depends on the fuel library, which will be installed automatically if needed.

Sometimes sudo is necessary for the pip command.

If you have keras, you should be able to run any of the examples in the examples folder with the most recent version of keras (0.3.0).

pip install keras
python ./examples/mnist.py

What's included

Currently the six datasets are wrappers around those provided by fuel. Each has a corresponding keras based example in the examples directory which is intended to be a high performance representative use of that dataset.

Dataset # records % Accuracy Score
binarized_mnist 70,000 0.2358 (loss)
cifar10 60,000 74.98
cifar100 60,000 53.14 (coarse) / 44.02 (fine)
iris 150 63.16
mnist 70,000 99.10
svhn2 >600,000 93.05 (train) / 96.40 (+extra)

Merge requests for any of these examples that are more accurate, run faster, and/or are written clearer are definitely welcome.

It is also possible to use fuel-download and fuel-convert on datasets that are not part of the fuel distribution, making them kerosene and fuel compatible. An example is lfw_fuel, which creates a fuel-compatible dataset.

Issues

The next planned improvements are:

  • support for compressed downloads (because the hdf5 files are large)
  • interface for iterating over a dataset without loading it into memory

Documentation is lacking, options are not easily discoverable, and the software design is rough. These areas can be improved if others find this library useful.

License

MIT

Feedback:

Kerosene is currently an experiment in making datasets large and small easily sharable. Feedback welcome via github issues or email.

kerosene's People

Contributors

dribnet avatar bmabey avatar charlesreid1 avatar

Watchers

James Cloos avatar Soroush Mehri avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.