skorch-dev / skorch Goto Github PK

View Code? Open in Web Editor NEW

5.6K 5.6K 378.0 11.99 MB

A scikit-learn compatible neural network library that wraps PyTorch

License: BSD 3-Clause "New" or "Revised" License

Python 19.72% Jupyter Notebook 80.24% Dockerfile 0.01% Shell 0.04%

hacktoberfest huggingface machine-learning pytorch scikit-learn

skorch's People

Contributors

Stargazers

Watchers

Forkers

ottonemo soumith benjaminbossan githubnemo sriharsha0806 mmderakhshani juripaern0007 xk1411 ucasqcz jianfly mojtabah codeaudit chubbymaggie fundou mahdisadjadi soumyamittal sbhttcha ssameerr chikaobuah gomson atendra12 dntai hongyunnchen vishalbelsare salimmj lodurality shyamalschandra mlkorra nunofernandes-plight pmadhyastha tony32769 hbcbh1999 mkhoin shubhampachori12110095 anirband statml danresende elhamben catcatrun kalengo bjfanchen colinsongf cyzhao0709 zergey zhenyuanwei vanradd borasy yuan39 candleinwindsteve grzr huoming550 pursh2002 wjh70301 sksundaram-learning yinxx penguinkang farahanams bluesdog164 bomboradata ghostintheshellarise yoonhanho loic001 jayshah5696 rain1024 thomasjpfan aust-hansen shanggao96 magnumw dolphintear merajat robot-ai-machinelearning zhongkailv yanghaha11514 b2220333 stsievert radovankavicky gapdata mohanarunachalam baifengbai guanlongtianzi tpietruszka taketwo locussam dsp6414 vhcg77 heyifei stoneyang masseeh rosssong meitric thoughtworksinc cxhernandez afcarl duydq1230 anottergithubuser jiapei100 haofanqimingzi spott pydataman atlaj

skorch's Issues

Examples Add an example for using vanilla classifier and regressor

Regressor: An estimator that reduces mean squared error by default.

Add a useful `repr` to base net class.

Other candidates for this are:

CVSplit
History?
Dataset

Cache pytorch installation in travis.yml

300MiB are a lot of stuff to download every time.

Examples Add an example how to handle data streams with Dataloader

The example should include

how data streams such as generators can be used in conjunction with the Dataloader
how to achieve train/test split with such a setup

Possibility to save and load the module paramters only using pytorch's `state_dict`. Should probably take a file name or a file handler.

`fit_params` is ignored

Investigate which options are stored here and what we should do with this parameter.

Scoring callback not on batch but on epoch

Currently, the Scoring callback calculates the score on each batch and averages over all batches for the epoch score. For some scores, however, this leads to inaccurate results (e.g. AUC). It would be better to score on the whole validation set at once.

To achieve this, the callback could store all predictions from the batches and score on_epoch_finished. It might be better, though, if the NeuralNet did it, so that if we have more than one score that uses the predictions, the predictions don't need to be made twice.

Callbacks PrintLog: Prints progress of model training.

Add an example for using an RNN

possibly in a jupyter notebook

Investigate whether a model trained on GPU can be loaded on CPU and vice versa, with pickle and state_dict. If that does not work, make it work.

Find a solution for train/valid splitting.

Prepare release 0.1.0

Support for recursive parameter assignment on module

It would be helpful to have the ability to set parameters beyond module level (for sub-components of the module, for example):

class Seq2Seq:
    def __init__(self, encoder, decoder, **kwargs):
        self.encoder = encoder
        self.decoder = decoder

class Encoder:
    def __init__(self, num_hidden=100):
        self.num_hidden = num_hidden
        self.lin = nn.Linear(1, num_hidden)

ef = NeuralNet(
        module=Seq2Seq(encoder=AttentionEncoderRNN, decoder=DecoderRNN),
        module__encoder__num_hidden=23,
    )

I would expect module.encoder.num_hidden to be set to 23. This should be robust with respect to the initializtion of the sub-module, for example if the encoder has elements that depend on the initialized value, those elements should be updated as well. In the given example, I would expect not only module.encoder.num_hidden to be updated to 23 but also that module.encoder.lin.out_features is updated (e.g. by re-initializing the whole module).

Allow `Dataset` to take additional parameters

NeuralNet currently only initializes Dataset with X, y, use_cuda but we may have more parameters. The user should be able to pass them the same way as for criterion etc. (i.e. via the prefixes_).

Remove ugly solution to 1-dim target data problem

We have some ugly pieces of code that are necessary to make our code work with default_collate. They relate to the probem that default_collate picks out values one at a time, which makes it hard to work with 1-dim arrays (e.g. to cast them to cuda).

The corresponding pieces of code are:

_prepare_target_for_loss
The y dimensionality check in NeuralNetRegressor

Make initialization scheme consistent

Current state of different things (both = class and object supported, class = only class supported)

callbacks (both)
criterion (class)
module (both)
iterator_{train,test} (class)
optimizer (class)

we should find a consistent scheme for this (either only initialized, always both, only class, ...)

Add and use a verbosity parameter.

[Bug] Binary classification on cuda fails

This is a similar issue as with regression and 1-dimensional target data, namely that default_collate unpacks the contents of the array (int64) and then the .cuda() call fails on int64.

It does not happen with 2-dimensional arrays, but for binary classifications, we can't use an n x 1 array, since that conflicts with StratifiedKFold.

Blank AssertionError when creating wrapper instance

When the wrapper is initialized with unknown keys the following AssertionError is raised:

Code/skorch/skorch/net.py in __init__(self, module, criterion, optimizer, lr, gradient_clip_value, gradient_clip_norm_type, max_epochs, batch_size, iterator_train, iterator_valid, dataset, train_split, callbacks, cold_start, verbose, use_cuda, **kwargs)
    234             assert not hasattr(self, key)
    235             key_has_prefix = any(key.startswith(p) for p in self.prefixes_)
--> 236             assert key.endswith('_') or key_has_prefix
    237         vars(self).update(kwargs)
    238 

AssertionError:

To reproduce this, initialize a wrapper with iterator_test__batch_size=32 as parameter. Since the correct key would be iterator_valid this code fails with the aforementioned error.

There should be at least a detailed, helpful error message.

No way to use custom Sampler

There's currently no way to use a custom Sampler.

get_len has issues with nested lists

For example:

inferno.dataset.get_len([[(1,2),(2,3)],[(4,5)], [(7,8,9)]])

expected: 3
actual: ValueError: Dataset does not have consistent lengths.

Another example:

inferno.dataset.get_len([[(1,2),(2,3)],[(4,5)], [(7,8)]])

expected: 3
actual: 2 (length of tuples)

A workaround is to convert the list into a numpy array.

Find a solution for L1, L2 loss (et al.)

There should be an easy way to add Lx-regularization.

Callbacks A general purpose callback that sets a specific parameter on a given epoch.

`check_history_slice` is useless

No longer needed.

Callbacks General purpose scoring callback.

`predict` in `NeuralNet` class

predict will currently take the argmax of dimension 1. This is very specific, despite the NeuralNet class being intended for generic use cases. I see 2 solutions:

take argmax of last dimension (thus assuming that outputs are probabilities, but not assuming a specific dimensionality)
return the result from forward (thus making no assumption of what that is)

Add the most important docstrings.

Callbacks don't work when passed unitialized

For example:

class Foo(inferno.callbacks.Callback):
    def on_epoch_end(self, net, **kwargs):
        pass

net = NeuralNet(..., callbacks=[Foo])

Error:

Traceback (most recent call last):
  File "train.py", line 189, in <module>
    pl.fit(corpus.train[:1000], corpus.train[:1000])
  File "/home/ottonemo/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 945, in fit
    return self._fit(X, y, groups, ParameterGrid(self.param_grid))
  File "/home/ottonemo/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 550, in _fit
    base_estimator = clone(self.estimator)
  File "/home/ottonemo/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 69, in clone
    new_object_params[name] = clone(param, safe=False)
  File "/home/ottonemo/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 57, in clone
    return estimator_type([clone(e, safe=safe) for e in estimator])
  File "/home/ottonemo/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 57, in <listcomp>
    return estimator_type([clone(e, safe=safe) for e in estimator])
  File "/home/ottonemo/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 67, in clone
    new_object_params = estimator.get_params(deep=False)
TypeError: get_params() missing 1 required positional argument: 'self'

Probable cause: get_params recursively inspects all attributes of the wrapper instance including self.callbacks which still contains the uninitialized callbacks. It then calls get_params which does not work as it is not a static method.

Disable `use_cuda` after loading a model on a non-CUDA machine

Currently we warn that CUDA is not supported but the model still has use_cuda=True.

Ensure that `to_var` receives `use_cuda` in all cases

Possibly by making use_cuda a positional parameter instead of a keyword parameter.

Suddenly lists

get_iterator blows up at https://github.com/dnouri/inferno/blob/169e1a0/inferno/net.py#L310 in case sklearn CV split is used and a 1-dimensional torch tensor is fed for X and y.

For example:

pl = GridSearchCV(trainer, params)
pl.fit(corpus.train, corpus.train)

Support grid search in parallel on GPUs

It would be nice to have n_jobs=2 on a system with 2 GPUs and both jobs are dispatched to one of the GPUs.

Consolidate names

Callbacks A callback that saves the model periodically.

Requirements:

the callback should not only save the weights but also the training parameters (i.e. pickling the learner and using torch.save to save the model data)
option to disable 'best only' checking, we might want to do check-pointing at every epoch

Open questions:

how much customization do we allow for selecting the 'best' run? Just make the key configurable (e.g. key='valid_loss_best')?
do we want to provide formatting options for the data path (e.g. {epoch} or {unique_run_id})? Might be useful when doing grid search and runs would otherwise override checkpoints

Basic documentation

link to tutorials (notebooks)
document how to extend skorch
use correct rst markup in docstrings
render using sphinx on github pages

Improve attributes and methods section in docs

Currently, no line breaks -> very long lines
Methods are not linked

Make it possible to pass an instantiated module to the wrapper; the module should not be re-instantiated unless a module parameter is set (this would allow to pass pre-trained modules).

Impossible to use Scoring callback for existing values

I already computed scores in my loss function and now I want to score them so that I can print them per epoch. For example:

class MyNet(NeuralNetwork):
    def get_loss(...):
        self.history.record_batch('foo', 42)

 net = MyNet(callbacks=[
     inferno.callbacks.Scoring('foo'),
 ])

However, the Scoring callback calls its score method on each batch end and overwrites the value "foo" and there is no way to properly disable this behavior.

There is a workaround though (but an ugly one):

def ignore_scorer(*_): raise KeyError()

net = MyNet(callbacks=[
     inferno.callbacks.Scoring('foo', scoring=ignore_scorer),
 ])

We should cover this case.

Use terminal coloring library

Currently, color highlighting is homebrew. We could use a package instead, e.g. https://github.com/tartley/colorama.

Advantages:

potentially more features
claims to work on Windows
no more custom code (Enum doesn't work in Python 2.7)

Disadvantages:

one more dependency

Method to set random state for all components

We need a method (possibly on the wrapper class) to initialize the random state for all components that are concerned with sampling. These include

the model (e.g. weight init, dropout)
DataLoader (batch shuffling)
GridSearchCV split

Callbacks: Generalize `yield_callbacks` into processing and output callbacks

Currently, net._yield_callbacks discerns between PrintLog and other callbacks with the effect that PrintLog is added last to the list of callbacks so it has access to all processed values added by other callbacks. Maybe we should generalize this by classifying callbacks into two groups: processing callbacks and output callbacks.

Output callbacks (identified by inheriting from an abstract subclass of Callback) are by default appended to the end of the callback list. This would pave the way for other output callbacks besides PrintLog such as TensorBoard logging callbacks.

Complete unfinished docstrings.

Examples Add an example for using grid search, using a net in an sklearn pipeline, and other cases of interoperability with sklearn (possibly in a jupyter notebook).

Explicitly support dispatching to multiple GPUs

It should be possible to explicitly dispatch a model on multiple GPUs.

This probably affects data operations and .cuda() calls.

Call .cuda() on module if self.use_cuda=True

Currently, the module_ is not automatically moved to cuda even if use_cuda=True. This is unexpected and should change.

This is what @ottonemo has to say about this:

I suppose so, yes. I was worried that it might interfere with settings that add parameters to the module after the point where we automatically apply .cuda() to the model which would result in these parameters to be excluded from the type conversion. One solution would be to do this conversion every time training starts (as is the case here) and mention in the documentation that there might be a case where the user has to call .cuda() on the model by themselves in certain cases.

In short my suggestion is: Implement self.module_.cuda() in on_train_begin of the base class and leave a comment somewhere (where?) in a docstring.

My suggestion: When a parameter is set on module_, the module needs to be re-initialized using the initialize_module method. We could move the .cuda() call to the end of this method.

I have another fear, though. What if the user wants part of the module and data to be on cuda and part on cpu? I guess we need to make sure that as long as use_cuda=False, we don't

CI service

There should be a CI service that checks new pull requests for errors.

Find a solution that works with the most common data types out of the box (numpy array, pytorch tensor, dict, pandas DataFrame?, list?, sparse matrix?).

Optional y in fit in general

NeuralNet.fit should use y=None by default to support arbitrary data loaders.
NeuralNetClassifier.fit should require y.

skorch-dev / skorch Goto Github PK

skorch's People

Contributors

Stargazers

Watchers

Forkers

skorch's Issues

Recommend Projects

Recommend Topics

Recommend Org