Giter Club home page Giter Club logo

snapshot-ensembles's Introduction

Snapshot Ensembles in Keras

Implementation of the paper Snapshot Ensembles: Train 1, Get M for Free in Keras 1.1.1

Explanation

Snapshot Ensemble is a method to obtain multiple neural network which can be ensembled at no additional training cost. This is achieved by letting a single neural network converge into several local minima along its optimization path and save the model parameters at certain epochs, therefore the weights being "snapshots" of the model.

The repeated rapid convergence is realized using cosine annealing cycles as the learning rate schedule. It can be described by:

This scheduler provides a learning rate which is similar to the below image. Note that the learning rate never actually becomes 0, it just gets very close to it (~0.0005):

The theory behind using a learning rate schedule which occilates between such extreme values (0.1 to 5e-4, M times) is that there exist multiple local minima when training a model. Constantly reducing the local learning rate can force the model to be stuck at a less than optimal local minima. Therefore, to escape, we use a very large learning rate to escape the current local minima and attempt to find another possibly better local minima.

It can be properly described using the following image:

Figure 1: Left: Illustration of SGD optimization with a typical learning rate schedule. The model converges to a minimum at the end of training. Right: Illustration of Snapshot Ensembling optimization. The model undergoes several learning rate annealing cycles, converging to and escaping from multiple local minima. We take a snapshot at each minimum for test time ensembling.

Usage

The paper uses several models such as ResNet-101, Wide Residual Network and DenseNet-40 and DenseNet-100. While DenseNets are the highest performing models in the paper, they are too large and take extremely long to train. Therefore, the current trained model is the Wide Residual Net (16-4) setting. This model performs poorly compared to the 34-4 version but trains several times faster.

The technique is simple to implement in Keras, using a custom callback. These callbacks can be built using the SnapshotCallbackBuilder class in snapshot.py. Other models can simply use this callback builder to other models to train them in a similar manner.

To use snapshot ensemble in other models :

from snapshot import SnapshotCallbackBuilder

M = 5 # number of snapshots
nb_epoch = T = 200 # number of epochs
alpha_zero = 0.1 # initial learning rate
model_prefix = 'Model_'

snapshot = SnapshotCallbackBuilder(T, M, alpha_zero) 
...
model = Sequential() OR model = Model(ip, output) # Some model that has been compiled

model.fit(trainX, trainY, callbacks=snapshot.get_callbacks(model_prefix=model_prefix))

To train WRN or DenseNet models on CIFAR 10 or 100 (or use pre trained models):

  1. Download the 6 WRN-16-4 weights that are provided in the Release tab of the project and place them in the weights directory for CIFAR 10 or 100
  2. Run the train_cifar_10.py script to train the WRN-16-4 model on CIFAR-10 dataset (not required since weights are provided)
  3. Run the predict_cifar_10.py script to make an ensemble prediction.

Note the difference on calculating only the predictions of the best model (92.70 % accuracy), and the weighted ensemble version of the Snapshots (92.84 % accuracy). The difference is minor, but still an improvement.

The improvement is minor due to the fact that the model is far smaller than the WRN-34-4 model, nor is it trained on the CIFAR-100 or Tiny ImageNet dataset. According to the paper, models trained on more complex datasets such as CIFAR 100 and Tiny ImageNet obtaines a greater boost from the ensemble model.

Parameters

Some parameters for WRN models from the paper:

  • M = 5
  • nb_epoch = 200
  • alpha_zero = 0.1
  • wrn_N = 2 (WRN-16-4) or 4 (WRN-28-8)
  • wrn_k = 4 (WRN-16-4) or 8 (WRN-28-8)

Some parameters for DenseNet models from the paper:

  • M = 6
  • nb_epoch = 300
  • alpha_zero = 0.2
  • dn_depth = 40 (DenseNet-40-12) or 100 (DenseNet-100-24)
  • dn_growth_rate = 12 (DenseNet-40-12) or 24 (DenseNet-100-24)

train_*.py

--M              : Number of snapshots that will be taken. Optimal range is in between 4 - 8. Default is 5
--nb_epoch       : Number of epochs to train the network. Default is 200
--alpha_zero     : Initial Learning Rate. Usually 0.1 or 0.2. Default is 0.1

--model          : Type of model to train. Can be "wrn" for Wide ResNets or "dn" for DenseNet

--wrn_N          : Number of WRN blocks. Computed as N = (n - 4) / 6. Default is 2.
--wrn_k          : Width factor of WRN. Default is 12.

--dn_depth       : Depth of DenseNet. Default is 40.
--dn_growth_rate : Growth rate of DenseNet. Default is 12.

predict_*.py

--optimize       : Flag to optimize the ensemble weights. 
                   Default is 0 (Predict using optimized weights).
                   Set to 1 to optimize ensemble weights (test for num_tests times).
                   Set to -1 to predict using equal weights for all models (As given in the paper).
               
--num_tests      : Number of times the optimizations will be performed. Default is 20

--model          : Type of model to train. Can be "wrn" for Wide ResNets or "dn" for DenseNet

--wrn_N          : Number of WRN blocks. Computed as N = (n - 4) / 6. Default is 2.
--wrn_k          : Width factor of WRN. Default is 12.

--dn_depth       : Depth of DenseNet. Default is 40.
--dn_growth_rate : Growth rate of DenseNet. Default is 12.

Performance

  • Single Best: Describes the performance of the single best model.
  • Without Optimization: Describes the performance of the ensemble model with equal weights for all models
  • With Optimization: Describes the performance of the ensemble model with optimized weights found via minimization of log-loss scores

Requirements

  • Keras
  • Theano (tested) / Tensorflow (not tested, weights not available but can be converted)
  • scipy
  • h5py
  • sklearn

snapshot-ensembles's People

Contributors

titu1994 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snapshot-ensembles's Issues

Quick Refresher

@titu1994

For a quick refresher does the the Cosine Scheduler start from the initial lr (alpha zero) and increase learning rate or does it start from some high learning rate and go down to alpha zero?

EDIT: I think it is the later. It starts at alpha_zero drops and returns back to original alpha. So if alpha zero is .01 it the learning rate would oscillate between .01 and 0

How to do ensemble prediction?

I have a model like this

#create model
model = Sequential()
model.add(Dense(5, input_dim=5, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(3, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])

I have used this pseudo code to train and create 5 ensemble models.

from snapshot import SnapshotCallbackBuilder

M = 5 # number of snapshots
nb_epoch = T = 200 # number of epochs
alpha_zero = 0.1 # initial learning rate
model_prefix = 'Model_'

snapshot = SnapshotCallbackBuilder(T, M, alpha_zero) 
model.fit(trainX, trainY, batch_size=50,callbacks=snapshot.get_callbacks(model_prefix=model_prefix))

it created 5 external hdf5 files now how should I load and use this for prediction? Can you give a simple example?

Show Example with Sequential or Functional API

Can you show an example with Sequential or Functional API? The argparse syntax is very hard to read. This looks amazing! Thanks for the contribution.

Also in your code, it looks like you do the learning rate update after each epoch but the paper author(s) does it after each "training iteration". I am not sure what distinction they are making here.

About the performance of W-16-4

Hi, Thanks provide the code. Currently I use basically setting and train. But the accuaracy on cifar100 only getting 68%. Is there any problems on this situation. I also try to download the checkpoint on you provide on tensorflow version, but the performance is 1%. Can you provide me some suggestion about the problems?

Getting Error

When I run this code:

`snapshot = SnapshotCallbackBuilder(T=nb_epoch,M=5,alpha_zero=0.01)

model.fit(X_train[:5000], Y_train[:5000], batch_size=batch_size, nb_epoch=nb_epoch, verbose=1,
callbacks=[snapshot.get_callbacks()],
validation_data=(X_test, Y_test))`

I get this error:

`AttributeError Traceback (most recent call last)
in ()
31 model.fit(X_train[:5000], Y_train[:5000], batch_size=batch_size, nb_epoch=nb_epoch, verbose=1,
32 callbacks=[snapshot.get_callbacks()],
---> 33 validation_data=(X_test, Y_test))
34 score = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)
35 print('Test score:', score[0])

/python2.7/site-packages/keras/models.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, **kwargs)
595 shuffle=shuffle,
596 class_weight=class_weight,
--> 597 sample_weight=sample_weight)
598
599 def evaluate(self, x, y, batch_size=32, verbose=1,

/python2.7/site-packages/keras/engine/training.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight)
1105 verbose=verbose, callbacks=callbacks,
1106 val_f=val_f, val_ins=val_ins, shuffle=shuffle,
-> 1107 callback_metrics=callback_metrics)
1108
1109 def evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None):

/python2.7/site-packages/keras/engine/training.pyc in _fit_loop(self, f, ins, out_labels, batch_size, nb_epoch, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics)
785 callback_model = self
786
--> 787 callbacks._set_model(callback_model)
788 callbacks._set_params({
789 'batch_size': batch_size,

/python2.7/site-packages/keras/callbacks.pyc in _set_model(self, model)
27 def _set_model(self, model):
28 for callback in self.callbacks:
---> 29 callback._set_model(model)
30
31 def on_epoch_begin(self, epoch, logs={}):

AttributeError: 'list' object has no attribute '_set_model'
`

AttributeError: 'list' object has no attribute 'set_model'

Thanks for the implementation,
I'm getting the AttributeError: 'list' object has no attribute 'set_model' error while using this callback. any comments?
I think it is because the return of SnapshotCallbackBuilder is a list of callbacks and I'm using other callbacks also

Update per batch, not per epoch

First of all, thank you very much for sharing your implementation of the Snapshot Ensembling. However, you should not use the callbacks.LearningRateScheduler to update the learning rate.

As mentioned in the original paper under the equation 2, "we update the learning rate at each iteration" (i.e batch) "rather than at every epoch" to improve the convergence of short cycles. But, callbacks.LearningRateScheduler in Keras, with the source code here, updates the learning rate at 'on_epoch_begin'.

After this fix, you should improve further your scores.

Can I use it with ADAM?

Hi. I have some questions before I'm going to use snapshot ensembles
TL;DR

  1. Can I use snapshot-ensembles with adaptive learning rate optimizer? (e.g. ADAM)
  2. Can I use it with very slow optimizer such as SGD?
  3. Which optimizer and parameters shows best for your experience?
  4. If I find optimal weight for my partial problem by using this technique, can I use it as initial point to find optimal point for full problem?
  5. Does Snapshot-Ensembles affect to training time?

Situation:
I'm training a network which is based on FCN Densenet. (the problem is similar to image segmentation but continuous labels). Since I have quite large dataset(over 350GB), it is almost impossible to find proper parameter and model with this setting(almost 6 hour to train one epoch). Instead using full data, I am training it with small part of data(around 8~9GB).

The problem is if I use ADAM, whenever I train model with either full or partial data, it converges really fast. It converges within 1 ~ 2 epoch when I train it with full data and it also converges within 10 ~ 30 epoch when I use small data.
I'm guessing ADAM just put model into closest local optima.
If I initialize weight differently at every epoch, maybe I can find several local optima with ADAM in really short time and I might have performance boost by using this technique.

But the problem is I don't know I can use it with ADAM. Is it possible?

Jupyter notebook error


OSError Traceback (most recent call last)
in ()
1 preds = []
2 for fn in models_filenames:
----> 3 model.load_weights(fn)
4 yPreds = model.predict(testX, batch_size=128)
5 preds.append(yPreds)

/home/andrewcz/miniconda3/lib/python3.5/site-packages/keras/engine/topology.py in load_weights(self, filepath, by_name)
2487 if h5py is None:
2488 raise ImportError('load_weights requires h5py.')
-> 2489 f = h5py.File(filepath, mode='r')
2490 if 'layer_names' not in f.attrs and 'model_weights' in f:
2491 f = f['model_weights']

/home/andrewcz/miniconda3/lib/python3.5/site-packages/h5py/_hl/files.py in init(self, name, mode, driver, libver, userblock_size, swmr, **kwds)
270
271 fapl = make_fapl(driver, libver, **kwds)
--> 272 fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
273
274 if swmr_support:

/home/andrewcz/miniconda3/lib/python3.5/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
90 if swmr and swmr_support:
91 flags |= h5f.ACC_SWMR_READ
---> 92 fid = h5f.open(name, flags, fapl=fapl)
93 elif mode == 'r+':
94 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/tmp/pip-at6d2npe-build/h5py/_objects.c:2684)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/tmp/pip-at6d2npe-build/h5py/_objects.c:2642)()

h5py/h5f.pyx in h5py.h5f.open (/tmp/pip-at6d2npe-build/h5py/h5f.c:1930)()

OSError: Unable to open file (Unable to open file: name = 'weights/wrn-cifar100-16-4-best.h5', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0)

load a weight file containing 30 layers into a model with 33 layers.

You are trying to load a weight file containing 30 layers into a model with 33 layers.
I don't understand how to change my layers?
Can I change my model or change the weights file from you?

This is my model clone from you
Can you help me to change this model?

def create_wide_residual_network(input, nb_classes=100, N=2, k=1, dropout=0.0, verbose=1):
"""
Creates a Wide Residual Network with specified parameters
:param input: Input Keras object
:param nb_classes: Number of output classes
:param N: Depth of the network. Compute N = (n - 4) / 6.
Example : For a depth of 16, n = 16, N = (16 - 4) / 6 = 2
Example2: For a depth of 28, n = 28, N = (28 - 4) / 6 = 4
Example3: For a depth of 40, n = 40, N = (40 - 4) / 6 = 6
:param k: Width of the network.
:param dropout: Adds dropout if value is greater than 0.0
:param verbose: Debug info to describe created WRN
:return:
"""
x = initial_conv(input)
nb_conv = 4

for i in range(N):
    x = conv1_block(x, k, dropout)
    nb_conv += 2

x = MaxPooling2D(pool_size=(2, 2), dim_ordering="th")(x)

# model.add()

for i in range(N):
    x = conv2_block(x, k, dropout)
    nb_conv += 2

x = MaxPooling2D(pool_size=(2, 2), dim_ordering="th")(x)

for i in range(N):
    x = conv3_block(x, k, dropout)
    nb_conv += 2

x = AveragePooling2D((3,8))(x)
x = Flatten()(x)

x = Dense(nb_classes,init="normal", activation='softmax')(x)
if verbose: print("Wide Residual Network-%d-%d created." % (nb_conv, k))
return x

how to predict the ground truth image with weighted ensemble?

Dear Titu Brother,
thank you for the valuable source code that you have provided. Can you please guide me how to predict the ground truth image with weighted ensemble as given the predicted accuracy. I can not figuring out how to predict the ground truth images with weighted ensemble thanks in advance. for the reference of ground truth image prediction i have given the code below. please guide me thanks.
Regards

calculate the predicted image

outputs = np.zeros((height,width))
for i in range(height):
for j in range(width):
target = int(y[i,j])
if target == 0 :
continue
else :
image_patch=Patch(X,i,j)
X_test_image = image_patch.reshape(1,image_patch.shape[0],image_patch.shape[1], image_patch.shape[2], 1).astype('float32')
prediction = (yPred.predict(X_test_image))
prediction = np.argmax(prediction, axis=1)
outputs[i][j] = prediction+1

How well does this work in practice?

Have you used this at all in your own work and found it effective?

I've got to several long training runs and it could be helpful to get the improvements the authors describe, plus the decent intermediate results while I wait.

Alpha Zero

@titu1994

Does Alpha Zero in the callback overwrite LR in the optimizer once you compile? If LR of my optimizer is set to .1 and alpha_zero is set to .0001 how is this handled? Right now I am changing LRs in two places just wondering if that is necessary.

model accuracy

in your train_*.py, at the end, your code is like this:
yPreds = model.predict(testX)
yPred = np.argmax(yPreds, axis=1)
yTrue = testY

accuracy = metrics.accuracy_score(yTrue, yPred) * 100
error = 100 - accuracy
print("Accuracy : ", accuracy)
print("Error : ", error)

  1. I don't understand why you used np.argmax(Preds, axis=1). I think you should use
    yPreds[yPreds >= 0.5] = 1
    yPreds[yPreds < 0.5] = 0
    yPred = yPreds

  2. When you used model.predict, which weights did you use? the last one or the best one?

Many thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.