titu1994 / snapshot-ensembles Goto Github PK

View Code? Open in Web Editor NEW

302.0 9.0 75.0 218 KB

Snapshot Ensemble in Keras

License: Apache License 2.0

Python 67.94% Jupyter Notebook 32.06%

ensemble-model paper deep-learning keras

snapshot-ensembles's Introduction

Snapshot Ensembles in Keras

Implementation of the paper Snapshot Ensembles: Train 1, Get M for Free in Keras 1.1.1

Explanation

Snapshot Ensemble is a method to obtain multiple neural network which can be ensembled at no additional training cost. This is achieved by letting a single neural network converge into several local minima along its optimization path and save the model parameters at certain epochs, therefore the weights being "snapshots" of the model.

The repeated rapid convergence is realized using cosine annealing cycles as the learning rate schedule. It can be described by:

This scheduler provides a learning rate which is similar to the below image. Note that the learning rate never actually becomes 0, it just gets very close to it (~0.0005):

The theory behind using a learning rate schedule which occilates between such extreme values (0.1 to 5e-4, M times) is that there exist multiple local minima when training a model. Constantly reducing the local learning rate can force the model to be stuck at a less than optimal local minima. Therefore, to escape, we use a very large learning rate to escape the current local minima and attempt to find another possibly better local minima.

It can be properly described using the following image:

Figure 1: Left: Illustration of SGD optimization with a typical learning rate schedule. The model converges to a minimum at the end of training. Right: Illustration of Snapshot Ensembling optimization. The model undergoes several learning rate annealing cycles, converging to and escaping from multiple local minima. We take a snapshot at each minimum for test time ensembling.

Usage

The paper uses several models such as ResNet-101, Wide Residual Network and DenseNet-40 and DenseNet-100. While DenseNets are the highest performing models in the paper, they are too large and take extremely long to train. Therefore, the current trained model is the Wide Residual Net (16-4) setting. This model performs poorly compared to the 34-4 version but trains several times faster.

The technique is simple to implement in Keras, using a custom callback. These callbacks can be built using the SnapshotCallbackBuilder class in snapshot.py. Other models can simply use this callback builder to other models to train them in a similar manner.

To use snapshot ensemble in other models :

from snapshot import SnapshotCallbackBuilder

M = 5 # number of snapshots
nb_epoch = T = 200 # number of epochs
alpha_zero = 0.1 # initial learning rate
model_prefix = 'Model_'

snapshot = SnapshotCallbackBuilder(T, M, alpha_zero) 
...
model = Sequential() OR model = Model(ip, output) # Some model that has been compiled

model.fit(trainX, trainY, callbacks=snapshot.get_callbacks(model_prefix=model_prefix))

To train WRN or DenseNet models on CIFAR 10 or 100 (or use pre trained models):

Download the 6 WRN-16-4 weights that are provided in the Release tab of the project and place them in the weights directory for CIFAR 10 or 100
Run the train_cifar_10.py script to train the WRN-16-4 model on CIFAR-10 dataset (not required since weights are provided)
Run the predict_cifar_10.py script to make an ensemble prediction.

Note the difference on calculating only the predictions of the best model (92.70 % accuracy), and the weighted ensemble version of the Snapshots (92.84 % accuracy). The difference is minor, but still an improvement.

The improvement is minor due to the fact that the model is far smaller than the WRN-34-4 model, nor is it trained on the CIFAR-100 or Tiny ImageNet dataset. According to the paper, models trained on more complex datasets such as CIFAR 100 and Tiny ImageNet obtaines a greater boost from the ensemble model.

Parameters

Some parameters for WRN models from the paper:

M = 5
nb_epoch = 200
alpha_zero = 0.1
wrn_N = 2 (WRN-16-4) or 4 (WRN-28-8)
wrn_k = 4 (WRN-16-4) or 8 (WRN-28-8)

Some parameters for DenseNet models from the paper:

M = 6
nb_epoch = 300
alpha_zero = 0.2
dn_depth = 40 (DenseNet-40-12) or 100 (DenseNet-100-24)
dn_growth_rate = 12 (DenseNet-40-12) or 24 (DenseNet-100-24)

train_*.py

--M              : Number of snapshots that will be taken. Optimal range is in between 4 - 8. Default is 5
--nb_epoch       : Number of epochs to train the network. Default is 200
--alpha_zero     : Initial Learning Rate. Usually 0.1 or 0.2. Default is 0.1

--model          : Type of model to train. Can be "wrn" for Wide ResNets or "dn" for DenseNet

--wrn_N          : Number of WRN blocks. Computed as N = (n - 4) / 6. Default is 2.
--wrn_k          : Width factor of WRN. Default is 12.

--dn_depth       : Depth of DenseNet. Default is 40.
--dn_growth_rate : Growth rate of DenseNet. Default is 12.

predict_*.py

--optimize       : Flag to optimize the ensemble weights. 
                   Default is 0 (Predict using optimized weights).
                   Set to 1 to optimize ensemble weights (test for num_tests times).
                   Set to -1 to predict using equal weights for all models (As given in the paper).
               
--num_tests      : Number of times the optimizations will be performed. Default is 20

--model          : Type of model to train. Can be "wrn" for Wide ResNets or "dn" for DenseNet

--wrn_N          : Number of WRN blocks. Computed as N = (n - 4) / 6. Default is 2.
--wrn_k          : Width factor of WRN. Default is 12.

--dn_depth       : Depth of DenseNet. Default is 40.
--dn_growth_rate : Growth rate of DenseNet. Default is 12.

Performance

Single Best: Describes the performance of the single best model.
Without Optimization: Describes the performance of the ensemble model with equal weights for all models
With Optimization: Describes the performance of the ensemble model with optimized weights found via minimization of log-loss scores

Requirements

Keras
Theano (tested) / Tensorflow (not tested, weights not available but can be converted)
scipy
h5py
sklearn

snapshot-ensembles's People

Contributors

Stargazers

Watchers

snapshot-ensembles's Issues

Quick Refresher

@titu1994

For a quick refresher does the the Cosine Scheduler start from the initial lr (alpha zero) and increase learning rate or does it start from some high learning rate and go down to alpha zero?

EDIT: I think it is the later. It starts at alpha_zero drops and returns back to original alpha. So if alpha zero is .01 it the learning rate would oscillate between .01 and 0

How to do ensemble prediction?

I have a model like this

#create model
model = Sequential()
model.add(Dense(5, input_dim=5, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(3, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])

I have used this pseudo code to train and create 5 ensemble models.

from snapshot import SnapshotCallbackBuilder

M = 5 # number of snapshots
nb_epoch = T = 200 # number of epochs
alpha_zero = 0.1 # initial learning rate
model_prefix = 'Model_'

snapshot = SnapshotCallbackBuilder(T, M, alpha_zero) 
model.fit(trainX, trainY, batch_size=50,callbacks=snapshot.get_callbacks(model_prefix=model_prefix))

it created 5 external hdf5 files now how should I load and use this for prediction? Can you give a simple example?

this result is worse

Show Example with Sequential or Functional API

Can you show an example with Sequential or Functional API? The argparse syntax is very hard to read. This looks amazing! Thanks for the contribution.

Also in your code, it looks like you do the learning rate update after each epoch but the paper author(s) does it after each "training iteration". I am not sure what distinction they are making here.

About the performance of W-16-4

Hi, Thanks provide the code. Currently I use basically setting and train. But the accuaracy on cifar100 only getting 68%. Is there any problems on this situation. I also try to download the checkpoint on you provide on tensorflow version, but the performance is 1%. Can you provide me some suggestion about the problems?

Getting Error

When I run this code:

`snapshot = SnapshotCallbackBuilder(T=nb_epoch,M=5,alpha_zero=0.01)

model.fit(X_train[:5000], Y_train[:5000], batch_size=batch_size, nb_epoch=nb_epoch, verbose=1,
callbacks=[snapshot.get_callbacks()],
validation_data=(X_test, Y_test))`

I get this error:

`AttributeError Traceback (most recent call last)
in ()
31 model.fit(X_train[:5000], Y_train[:5000], batch_size=batch_size, nb_epoch=nb_epoch, verbose=1,
32 callbacks=[snapshot.get_callbacks()],
---> 33 validation_data=(X_test, Y_test))
34 score = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)
35 print('Test score:', score[0])

/python2.7/site-packages/keras/models.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, **kwargs)
595 shuffle=shuffle,
596 class_weight=class_weight,
--> 597 sample_weight=sample_weight)
598
599 def evaluate(self, x, y, batch_size=32, verbose=1,

/python2.7/site-packages/keras/engine/training.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight)
1105 verbose=verbose, callbacks=callbacks,
1106 val_f=val_f, val_ins=val_ins, shuffle=shuffle,
-> 1107 callback_metrics=callback_metrics)
1108
1109 def evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None):

/python2.7/site-packages/keras/engine/training.pyc in _fit_loop(self, f, ins, out_labels, batch_size, nb_epoch, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics)
785 callback_model = self
786
--> 787 callbacks._set_model(callback_model)
788 callbacks._set_params({
789 'batch_size': batch_size,

/python2.7/site-packages/keras/callbacks.pyc in _set_model(self, model)
27 def _set_model(self, model):
28 for callback in self.callbacks:
---> 29 callback._set_model(model)
30
31 def on_epoch_begin(self, epoch, logs={}):

AttributeError: 'list' object has no attribute '_set_model'
`

AttributeError: 'list' object has no attribute 'set_model'

Thanks for the implementation,
I'm getting the AttributeError: 'list' object has no attribute 'set_model' error while using this callback. any comments?
I think it is because the return of SnapshotCallbackBuilder is a list of callbacks and I'm using other callbacks also

Update per batch, not per epoch

First of all, thank you very much for sharing your implementation of the Snapshot Ensembling. However, you should not use the callbacks.LearningRateScheduler to update the learning rate.

As mentioned in the original paper under the equation 2, "we update the learning rate at each iteration" (i.e batch) "rather than at every epoch" to improve the convergence of short cycles. But, callbacks.LearningRateScheduler in Keras, with the source code here, updates the learning rate at 'on_epoch_begin'.

After this fix, you should improve further your scores.

Can I use it with ADAM?

Hi. I have some questions before I'm going to use snapshot ensembles
TL;DR

Can I use snapshot-ensembles with adaptive learning rate optimizer? (e.g. ADAM)
Can I use it with very slow optimizer such as SGD?
Which optimizer and parameters shows best for your experience?
If I find optimal weight for my partial problem by using this technique, can I use it as initial point to find optimal point for full problem?
Does Snapshot-Ensembles affect to training time?

Situation:
I'm training a network which is based on FCN Densenet. (the problem is similar to image segmentation but continuous labels). Since I have quite large dataset(over 350GB), it is almost impossible to find proper parameter and model with this setting(almost 6 hour to train one epoch). Instead using full data, I am training it with small part of data(around 8~9GB).

The problem is if I use ADAM, whenever I train model with either full or partial data, it converges really fast. It converges within 1 ~ 2 epoch when I train it with full data and it also converges within 10 ~ 30 epoch when I use small data.
I'm guessing ADAM just put model into closest local optima.
If I initialize weight differently at every epoch, maybe I can find several local optima with ADAM in really short time and I might have performance boost by using this technique.

But the problem is I don't know I can use it with ADAM. Is it possible?

Jupyter notebook error

OSError Traceback (most recent call last)
in ()
1 preds = []
2 for fn in models_filenames:
----> 3 model.load_weights(fn)
4 yPreds = model.predict(testX, batch_size=128)
5 preds.append(yPreds)

/home/andrewcz/miniconda3/lib/python3.5/site-packages/keras/engine/topology.py in load_weights(self, filepath, by_name)
2487 if h5py is None:
2488 raise ImportError('load_weights requires h5py.')
-> 2489 f = h5py.File(filepath, mode='r')
2490 if 'layer_names' not in f.attrs and 'model_weights' in f:
2491 f = f['model_weights']

/home/andrewcz/miniconda3/lib/python3.5/site-packages/h5py/_hl/files.py in init(self, name, mode, driver, libver, userblock_size, swmr, **kwds)
270
271 fapl = make_fapl(driver, libver, **kwds)
--> 272 fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
273
274 if swmr_support:

/home/andrewcz/miniconda3/lib/python3.5/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
90 if swmr and swmr_support:
91 flags |= h5f.ACC_SWMR_READ
---> 92 fid = h5f.open(name, flags, fapl=fapl)
93 elif mode == 'r+':
94 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/tmp/pip-at6d2npe-build/h5py/_objects.c:2684)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/tmp/pip-at6d2npe-build/h5py/_objects.c:2642)()

h5py/h5f.pyx in h5py.h5f.open (/tmp/pip-at6d2npe-build/h5py/h5f.c:1930)()

OSError: Unable to open file (Unable to open file: name = 'weights/wrn-cifar100-16-4-best.h5', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0)

load a weight file containing 30 layers into a model with 33 layers.

You are trying to load a weight file containing 30 layers into a model with 33 layers.
I don't understand how to change my layers?
Can I change my model or change the weights file from you?

This is my model clone from you
Can you help me to change this model?

def create_wide_residual_network(input, nb_classes=100, N=2, k=1, dropout=0.0, verbose=1):
"""
Creates a Wide Residual Network with specified parameters
:param input: Input Keras object
:param nb_classes: Number of output classes
:param N: Depth of the network. Compute N = (n - 4) / 6.
Example : For a depth of 16, n = 16, N = (16 - 4) / 6 = 2
Example2: For a depth of 28, n = 28, N = (28 - 4) / 6 = 4
Example3: For a depth of 40, n = 40, N = (40 - 4) / 6 = 6
:param k: Width of the network.
:param dropout: Adds dropout if value is greater than 0.0
:param verbose: Debug info to describe created WRN
:return:
"""
x = initial_conv(input)
nb_conv = 4

for i in range(N):
    x = conv1_block(x, k, dropout)
    nb_conv += 2

x = MaxPooling2D(pool_size=(2, 2), dim_ordering="th")(x)

# model.add()

for i in range(N):
    x = conv2_block(x, k, dropout)
    nb_conv += 2

x = MaxPooling2D(pool_size=(2, 2), dim_ordering="th")(x)

for i in range(N):
    x = conv3_block(x, k, dropout)
    nb_conv += 2

x = AveragePooling2D((3,8))(x)
x = Flatten()(x)

x = Dense(nb_classes,init="normal", activation='softmax')(x)
if verbose: print("Wide Residual Network-%d-%d created." % (nb_conv, k))
return x

how to predict the ground truth image with weighted ensemble?

Dear Titu Brother,
thank you for the valuable source code that you have provided. Can you please guide me how to predict the ground truth image with weighted ensemble as given the predicted accuracy. I can not figuring out how to predict the ground truth images with weighted ensemble thanks in advance. for the reference of ground truth image prediction i have given the code below. please guide me thanks.
Regards

calculate the predicted image

outputs = np.zeros((height,width))
for i in range(height):
for j in range(width):
target = int(y[i,j])
if target == 0 :
continue
else :
image_patch=Patch(X,i,j)
X_test_image = image_patch.reshape(1,image_patch.shape[0],image_patch.shape[1], image_patch.shape[2], 1).astype('float32')
prediction = (yPred.predict(X_test_image))
prediction = np.argmax(prediction, axis=1)
outputs[i][j] = prediction+1

I don't understand why you used np.argmax(Preds, axis=1). I think you should use
yPreds[yPreds >= 0.5] = 1
yPreds[yPreds < 0.5] = 0
yPred = yPreds
When you used model.predict, which weights did you use? the last one or the best one?

Many thanks