Giter Club home page Giter Club logo

adamwr's Introduction

AdamW optimizer and cosine learning rate annealing with restarts

This repository contains an implementation of AdamW optimization algorithm and cosine learning rate scheduler described in "Decoupled Weight Decay Regularization". AdamW implementation is straightforward and does not differ much from existing Adam implementation for PyTorch, except that it separates weight decaying from batch gradient calculations. Cosine annealing scheduler with restarts allows model to converge to a (possibly) different local minimum on every restart and normalizes weight decay hyperparameter value according to the length of restart period. Unlike schedulers presented in standard PyTorch scheduler suite this scheduler adjusts optimizer's learning rate not on every epoch, but on every batch update, according to the paper.

Cyclical Learning Rates

Besides "cosine" and "arccosine" policies (arccosine has steeper profile at the limiting points), there are "triangular", triangular2 and exp_range, which implement policies proposed in "Cyclical Learning Rates for Training Neural Networks". The ratio of increasing and decreasing phases for triangular policy could be adjusted with triangular_step parameter. Minimum allowed lr is adjusted by min_lr parameter.

  • triangular schedule is enabled by passing policy="triangular" parameter.
  • triangular2 schedule reduces maximum lr by half on each restart cycle and is enabled by passing policy="triangular2" parameter, or by combining parameters policy="triangular", eta_on_restart_cb=ReduceMaxLROnRestart(ratio=0.5). The ratio parameter regulates the factor by which lr is scaled on each restart.
  • exp_range schedule is enabled by passing policy="exp_range" parameter. It exponentially scales maximum lr depending on iteration count. The base of exponentiation is set by gamma parameter.

These schedules could be combined with shrinking/expanding restart periods, weight decay normalization and could be used with AdamW and other PyTorch optimizers.

Example:

    batch_size = 32
    epoch_size = 1024
    model = resnet()
    optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-5)
    scheduler = CyclicLRWithRestarts(optimizer, batch_size, epoch_size, restart_period=5, t_mult=1.2, policy="cosine")
    for epoch in range(100):
        scheduler.step()
        train_for_every_batch(...)
            ...
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.batch_step()
        validate(...)

adamwr's People

Contributors

mpyrozhok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

adamwr's Issues

Hypergradient Descent

Thank you for sharing this. Would it be possible if you can also integrate Hypergradient Descent technique into your AdamW implementation? It reduces the necessity of hypertuning the initial learning rate. https://github.com/gbaydin/hypergradient-descent

                if state['step'] > 1:
                    prev_bias_correction1 = 1 - beta1 ** (state['step'] - 1)
                    prev_bias_correction2 = 1 - beta2 ** (state['step'] - 1)
                    # Hypergradient for Adam:
                    h = torch.dot(grad.view(-1), torch.div(exp_avg, exp_avg_sq.sqrt().add_(group['eps'])).view(-1)) * math.sqrt(prev_bias_correction2) / prev_bias_correction1
                    # Hypergradient descent of the learning rate:
                    group['lr'] += group['hypergrad_lr'] * h

I have also read lots of criticism about AmsGrad and haven't been able to yet get any improvement with that variant. Can I please learn your thoughts about that? FYI, two other techniques that I am currently experimenting with are Padam and QHAdam.

Getting Stop Iteration when running for training


StopIteration Traceback (most recent call last)
in ()
1 training(model=model, epoch=20, eval_every=500,
2 loss_func=loss_function, optimizer=optimizer, train_iter=train_iter,
----> 3 val_iter=val_iter, scheduler=scheduler, warmup_epoch=3, early_stop=2)

in training(epoch, model, eval_every, loss_func, optimizer, train_iter, val_iter, scheduler, early_stop, warmup_epoch)
37 loss.backward()
38 optimizer.step()
---> 39 scheduler.batch_step()
40 if step % eval_every == 0:
41 model.eval()

in batch_step(self)
274
275 def batch_step(self):
--> 276 t_cur = self.t_epoch + next(self.batch_increment)
277 for param_group, (lr, weight_decay) in zip(self.optimizer.param_groups,
278 self.get_lr(t_cur)):

StopIteration:

scheduler.batch_step() AttributeError: 'CosineLRWithRestarts' object has no attribute 'batch_increment'

Z:\sp2\nhdeblur_pytorch>python "train.py" 1>"train_log.txt"
Traceback (most recent call last):
File "train.py", line 140, in
train(train_gen=trainloader, model=model, criterion=criterion, optimizer=optimizer, epoch=epoch)
File "train.py", line 115, in train
scheduler.batch_step()
File "Z:\sp2\nhdeblur_pytorch\cosine_scheduler.py", line 110, in batch_step
t_cur = self.t_epoch + next(self.batch_increment)
AttributeError: 'CosineLRWithRestarts' object has no attribute 'batch_increment'

optimizer = adamw.AdamW(model.parameters(), lr=opt.lr, weight_decay=0)
scheduler = cosine_scheduler.CosineLRWithRestarts(optimizer, batch_size=opt.batch_size, epoch_size=len(src_set), restart_period=5, t_mult=1.2)

def train(train_gen, model, criterion, optimizer, epoch):
    epoch_loss = 0
    for iteration, batch in enumerate(train_gen, 1):
        nr = batch[0].to(device)
        hr = batch[1].to(device)
        
        optimizer.zero_grad()
        loss = criterion(model(nr), hr)
        epoch_loss += loss.item()
        loss.backward()
        optimizer.step()
        scheduler.batch_step()
    
        if iteration % 1000 == 0:
            print('===> Epoch[{e}]({it}/{dl}): Loss{l:.4f};'.format(e=epoch, it=iteration, dl=len(train_gen), l=loss.cpu()))
            
    Current_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
    epoch_loss_average = epoch_loss / len(train_gen)
    print('===> {ct} Epoch {e} Complete: Avg Loss: {avg_loss:.4f}, Sum Loss: {sum_loss:.4f}'
          .format(e=epoch, avg_loss=epoch_loss_average, sum_loss=epoch_loss, ct=Current_time))

Persisting CosineAnnealingLRWithRestarts

Hi there,

Up to now all my scheduler inherited from _LRScheduler and so I didn't need to care too much about how it would be persisted.

For my checkpoints I define my state like this

 state = {
                                "model_state": model.state_dict(),
                                "optimizer_state": optimizer.state_dict(),
                                "scheduler_state": scheduler.state_dict(),
                            }

However with CosineAnnealingLRWithRestarts, I don't have this method state_dict().

I checked in the documentation the implementation of the state_dict()
https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#LambdaLR

and tried to extend your code myself, however I probably missed something.
Could you take a look?

Diffs are:

I inherit the class from _LRScheduler:

from torch.optim.lr_scheduler import _LRScheduler

class CosineAnnealingLRWithRestarts(_LRScheduler):

And rewrite the state_dict()



    def state_dict(self):
        """Returns the state of the scheduler as a :class:`dict`.

        It contains an entry for every variable in self.__dict__ which
        is not the optimizer.
        The learning rate lambda functions will only be saved if they are callable objects
        and not if they are functions or lambdas.
        """
        state_dict = {key: value for key, value in self.__dict__.items() if key not in ('optimizer', 'base_lrs', 'base_weight_decays')}
        state_dict['base_lrs'] = [None] * len(self.base_lrs)
        state_dict['base_weight_decays'] = [None] * len(self.base_weight_decays)

        for idx, fn in enumerate(self.base_weight_decays):
            if not isinstance(fn, types.FunctionType):
                # state_dict['base_weight_decays'][idx] = fn.__dict__.copy()
                state_dict['base_weight_decays'][idx] = fn

        for idx, fn in enumerate(self.base_lrs):
            if not isinstance(fn, types.FunctionType):
                # state_dict['base_lrs'][idx] = fn.__dict__.copy()
                state_dict['base_lrs'][idx] = fn


        return state_dict

    def load_state_dict(self, state_dict):
        """Loads the schedulers state.

        Arguments:
            state_dict (dict): scheduler state. Should be an object returned
                from a call to :meth:`state_dict`.
        """
        base_lrs = state_dict.pop('base_lrs')
        base_weight_decays = state_dict.pop('base_weight_decays')

        self.__dict__.update(state_dict)

        for idx, fn in enumerate(base_lrs):
            if fn is not None:
                self.base_lrs[idx] = fn        

        for idx, fn in enumerate(base_weight_decays):
            if fn is not None:
                self.base_weight_decays[idx] = fn

However I still get AttributeError: Can't pickle local object 'Tensor.__iter__.<locals>.<lambda>'
It would be terrific to be able to persist the state of this Scheduler :-)

Add License

Could you add a license to this project so that people can copy, modify, and redistribute? Thanks!

LR Scheduler help

Can you please help me write my own learning rate scheduler? I mean I couldn't find much docs on how to write one in Pytorch. I went through this mxnet guide, and came to the conclusion that if I do the following:

lrs = [scheduler(i+1) for i in range(epochs*batch_size)]
iters = 1
for i in range(epochs):
	for data,label in train:
		... # backward and calculate loss
		for group in optimizer.param_groups:
			group['lr'] = lrs[iters]
		optimizer.step()
		iters+=1

What is the more elegant way of doing it?

StopIteration

Hi, thank you for your share. Following your description, I try to use your code in my project, but I got the error in 'scheduler.batch_step()', this happened on this line 't_cur = self.t_epoch + next(self.batch_increment)'

Lower/Upper Bound for LR and Upper Bound decay

Hey there,

Nice update of the scheduler! It's really usefull!

Also nice would be to have the possibility to set following parameters: base_lr, max_lr and scale_fn

The scale_fn would be a function that decreases the max_lr:

  • by half after each period, while keeping the base lr constant.
  • scales max_lr by a factor gamma**(iterations)
  • or whatever lambda_function is given

Here an example implementation in Keras: https://github.com/bckenstler/CLR

I tried to hack this myself but I'm stucked. I'm not entirely sure which eta you use. (is this the one from weight decay?) And even if i'm right, I can't persist my hack because of the lambda function -.-

And also I'm not sure why, but in my case (Superresolution), when using cosine/arccosine my model diverges each times after restarting. (AdamW, wd=1e-6)
It happens with triangular too but not directly at the start of the second cycle.
Do you maybe have an idea where it could come from?

Thanks for your time!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.