lessw2020 / ranger-deep-learning-optimizer Goto Github PK

Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

License: Apache License 2.0

Python 100.00%

ranger-deep-learning-optimizer's Introduction

Ranger-Deep-Learning-Optimizer

Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) in one optimizer.

quick note - Ranger21 is now in beta and is Ranger with a host of new improvements.

Recommend you compare results with Ranger21: https://github.com/lessw2020/Ranger21

Latest version 20.9.4 - updates Gradient Centralization to GC2 (thanks to GC developer) and removes addcmul_ deprecation warnings in PyTorch 1.60.

*Latest version is in ranger2020.py - looking at a few other additions before integrating into the main ranger.py.

What is Gradient Centralization? = "GC can be viewed as a projected gradient descent method with a constrained loss function. The Lipschitzness of the constrained loss function and its gradient is better so that the training process becomes more efficient and stable." Source paper: https://arxiv.org/abs/2004.01461v2
Ranger now uses Gradient Centralization by default, and applies it to all conv and fc layers by default. However, everything is customizable so you can test with and without on your own datasets. (Turn on off via "use_gc" flag at init).

Best training results - use a 75% flat lr, then step down and run lower lr for 25%, or cosine descend last 25%.

Per extensive testing - It's important to note that simply running one learning rate the entire time will not produce optimal results.
Effectively Ranger will end up 'hovering' around the optimal zone, but can't descend into it unless it has some additional run time at a lower rate to drop down into the optimal valley.

Full customization at init:

Ranger will now print out id and gc settings at init so you can confirm the optimizer settings at train time:

/////////////////////

Medium article with more info:
https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d

Multiple updates: 1 - Ranger is the optimizer we used to beat the high scores for 12 different categories on the FastAI leaderboards! (Previous records all held with AdamW optimizer).

2 - Highly recommend combining Ranger with: Mish activation function, and flat+ cosine anneal training curve.

3 - Based on that, also found .95 is better than .90 for beta1 (momentum) param (ala betas=(0.95, 0.999)).

Fixes: 1 - Differential Group learning rates now supported. This was fix in RAdam and ported here thanks to @sholderbach. 2 - save and then load may leave first run weights stranded in memory, slowing down future runs = fixed.

Installation

Clone the repo, cd into it and install it in editable mode (-e option). That way, these is no more need to re-install the package after modification.

git clone https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
cd Ranger-Deep-Learning-Optimizer
pip install -e .

Usage

from ranger import Ranger  # this is from ranger.py
from ranger import RangerVA  # this is from ranger913A.py
from ranger import RangerQH  # this is from rangerqh.py

# Define your model
model = ...
# Each of the Ranger, RangerVA, RangerQH have different parameters.
optimizer = Ranger(model.parameters(), **kwargs)

Usage and notebook to test are available here: https://github.com/lessw2020/Ranger-Mish-ImageWoof-5

Citing this work

We recommend you use the following to cite Ranger in your publications:

@misc{Ranger,
  author = {Wright, Less},
  title = {Ranger - a synergistic optimizer.},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer}}
}

ranger-deep-learning-optimizer's People

Contributors

Stargazers

Watchers

Forkers

bennnun iyah4888 hwunlams maxmatical chaoso templeblock cedrickchee o-can zhangjiekui sameer-ahuja justcallmewilliam wangdeyu zymale lxytsos fparodimoraes archelunch bigriceball-zz jeff654 charlottesean linhduongtuan zhzhang1997 sixgodgg khwajawisal kayuksel sholderbach shiontao iamweiweishi infinitewing george3d6 kylegoyette colllin qianrenjian ethycs luogen1996 ddr-capital e-sha kylejohnson363 deepfusion gptcod labage bezova henrylol ddeeppnneett carrielui librence xetaiz swagat25 afcarl plin1112 wunderkennd alexfrontxq caowgg mpariente batermj bhattacharyasumit styanddty pawkaz hfxunlp chrisliu007 porterpan ysgcat iamrishab irentang delldu bruinxiong mldl eten115 narumiruna yonggucheng li-ming-fan amirunpri2018 desera pgsrv saber5433 woxinxie1234 felixzhang7 r08945002 chongxi rationalspark wintersurvival li-xiaomeng yuan16gcs sandy4321 chomolungma nangeblog urvishp80 fangwudi gavin90s holybayes lytk01 wuxiaolianggit kalisa1123 scottclowe hell-to-heaven xuhaiming1996 rover-bor sailfish009 yj2victory ahmadmf bryant1410

ranger-deep-learning-optimizer's Issues

How to cite Ranger in a paper?

In my recent paper I used Ranger. I wish to give all the credit the author(s) deserves, but I'm not sure how to properly cite it? Currently I cited the medium article. Should I cite this github repo instead? Thanks.

This overload of addcmul_ is deprecated: addcmul_(Number value, Tensor tensor1, Tensor tensor2)

I get the following warning when using ranger with pytorch 1.6.0

/path/Ranger-Deep-Learning-Optimizer/ranger/ranger.py:138: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

Could someone tell me how to use it?

Let's revolutionize the AI research field

Hi,
I have a dream and I'll try to share it to you.

But before explaining further, I'll need your brain to analyze this input and output me what you think about it!

Small rant on the inertia of AI research

First of all, thank you for advancing progress in deep learning.

I'm just a random guy that want to implement an AGI (lol) and like many Nlp engeeners, I need HIGHLY accurate neural networks for fundamental NLP tasks (e.g POS tag, NER, dep parsing, Coref resolution, WSD, etc)
They are all not very accurate (often sub 95% F1 score) and their errors add up.

Such limitations make Nlp not yet suitable for many things.
This is why improving the state of the art (which can be observed on paperswithcode.com) is a crucial priority from academicians.

Effectively, many researchers have smart ideas to improve the state of the art and often slightly improve it by:
Having a "standard neural network" for the task and mix with it their new fancy idea.

I talk from knowledge, I've read most papers from state of the art leaderboards from most fundamental NLP tasks.
Almost always they have this common baseline + one idea, theirs.
The common baseline sometimes slowly evolve (e.g now it's often a pre trained model (say BERT) + fine tuning + their idea.

Sorry to say, but "this" is to me retarded
Where "this" mean the fact that by far, most researchers work in isolation, not integrating others ideas (or with such a slow inertia).
I would have wished that state of the art in one Nlp task would be a combination of e.g 50 innovative and complementary ideas from researchers.
You are researchers, do you have an idea why that is the case? If someone actually tried to merge all good complementary and compatible ideas, would they have the best, unmatchable state of the art?
Why facebookresearch, Microsoft, Google don't try the low hanging fruit in addition to producing X new shiny ideas per month, actually try to merge them in a coherent, synergetic manner??
I would like you to tell me what you think of this major issue that slow AI progress.

As an example of such inertia let's talk about Swish, Mish or RAdam :
Those things are incredibly easy to try and see "hey does it give to my neural network free accuracy gains?"
Yet not any paper on state of the art leaderboards has tried Swish, Mish or RAdam despite being soo simple to try (you don't need to change the neural network)
Not even pre trained models where so many papers depend on them (I opened issues for each of them).

Once I know what you think about this research inertia, I'll explain my vision of what needs to be done to fix it.

ranger and cosine annealing LR leads to different schedule than SGD optimizer? o_O

Released on PyPI

I just released this code on PyPI. It's called asranger (ranger was taken).
So it can be installed with pip install asranger and can be made a hard dependency by other projects on PyPI.
The corresponding code is on my fork.

Too huge step_size at initialization stage

I found that step_size is too high in the initial 5 steps.
The problem is in the code:

if N_sma >= self.N_sma_threshhold:
    step_size = math.sqrt((1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max / (N_sma_max - 2)) / (1 - beta1 ** state['step'])
else:
    step_size = 1.0 / (1 - beta1 ** state['step'])

If betas are set to (0.9, 0.999) the internal variables are changed as following:

state['step']| step_size
------------------------------
        1    |     10
        2    |5.26315789
        3    |3.6900369
        4    |2.90782204
        5    |2.44194281
        6    |0.00426327
        7    |0.00524248
        8    |0.00607304
        9    |0.00681674
       10    |0.00750596

Note, that step_size doesn't depend on gradient value and it scales learning_rate.
Thus RAdam aggressively moves weights from their initial values, even if they have a good initialization.

Is it better to set step_size equal to 0 if N_sma < self.N_sma_threshhold?

[question] Why Ranger is not available as a pip package

Why Ranger is only available on a github but not as an pip package. Wouldn't it be easier for the community to actually use it?

It makes sense to use it on a batch of 1?

@lessw2020 Thanks for this awesome optimizer. I´m very excited about it!

There is one particular workload that trains using a batch of 1 item.
Theoretically, make sense to use RAdam (Rectified Adam), LookAhead, and GC in this context?

I´m thinking about it, read the papers but I still could not make a conclusion. As you (or any other person here) is much more experienced than me, do you have an option on this?

cannot load trained model using Ranger

Hi There!

Thanks for putting together this code for Rectified Adam with Lookahead optimizer. I used this optimization function to train my model with fastai and successfully trained the model.

I exported the model using

feature = 'silhouette'
learn.export(f'{feature}_efficientnet-b3.pkl')

and later during inference I am trying to load the learner using

from ranger import Ranger
feature = 'silhouette'
learn = load_learner(path = model_path, file = f'{feature}_efficientnet-b3.pkl')

I have defined the model path properly in the previous cells. But for some reason, I cannot load the learner. The file cannot locate the module ranger.ranger. Can someone please help me fix this issue?

Here's a screenshot of the error for your reference.

Thanks & Regards,
Vinayak.

Add manual synchronization function

Hello. First of all, thank you for sharing code and experiment results.
Reading the code, I found that the model will use fast weights to infer. According to LookAhead, fast weights (before synchronization) may perform worse than slow weights. By chance of (1-1/k) probability (80% when k=5), we will use unsynchronized fast weights to validate/test. Therefore, it should be better if we manually synchronize before evaluation.

Not able to save the model_state_dict.

Hi I was trying to save the model checkpoints after each epoch using the below code.But only the state dictionary of the zeroth epoch got stored and none of the others.Does the ranger optimiser object support state_dict ? If yes then how can I save it after each epoch?

out_model = os.path.join(args.model_dir, 'model.th') with open(out_model, 'wb') as f: torch.save(model.state_dict(), f) print("Model is dumped")

Dose any one has tested this composed Optimizer?

Keras implementation

It would be very helpful if you could provide implementation in keras.

Is there a publication of Ranger?

I want to cite ranger on a Medium article and I would like to know if there is an arXiv publication of Ranger or a published peer-reviewed paper on some conference or journal.

I saw you linked a paper o the README.md, but it does not seem to be about ranger, as the very word does not appear in any part of it. I know the Radam and Lookahead paper, but the Ranger one is missing on my library. Thanks

How to use ranger in keras? Please help me.

Your optimizer looks like a big achievement！
I have used " optimizer=Ranger(lr=0.001)" in keras .
But I have a error named "TypeError: init() missing 1 required positional argument: 'params'".
I don't know how to debug it . Can you help me？

Stochastic Weight Averaging support

Does Ranger support Stochastic Weight Averaging?

N_sma_threshhold

You first have
if N_sma > self.N_sma_threshhold:

and then you have
if N_sma > 4:

Is it right that the second one is constant or should that also be N_sma_threshhold parameter?

Grad norm and ranger

Im using nvidia apex and torch grad norm.
This is grad norm plot with ranger (red) and adamw (blue).
https://i.imgur.com/Ui4Sioo.png
Is this ok to have so huge grad norm values? Should I turn off grad norming?

Please note in the documentation (or in the constructor) that closures must be enabled

Hi,

I had today a relatively long debug session, after I've upgraded my Pytorch Lightning installation, that the training_step wasn't called.

It finally turned out, that the problem was that the "closure" argument is not used in the step function (it is commented out - as also noted in the source code).

However, as it is apparently required by some libraries and is also recommended by the official PyTorch guidelines, it would be great if it would be better documented, that people might need to enable these lines.

Thanks in advance.

RangerVA with GC

Hello,

Thank you for your work on these optimizers btw. I was testing a couple out and was performing quite well with the RangerVA originally. Then, when your gradient centralization was added I got further improvements but it also seemed to be overtraining the train set more easily despite using the same parameters. Therefore, I tried to implement combining the gradient centralization into the RangerVA algorithm and so far it seems to be performing quite well and faster since it seems I can use larger batch sizes. I was wondering if you could quickly check, whenever you have some free time, if I implemented correctly in the code below since you are so used to this optimizer.

Best

``
class RangerVA(Optimizer):

def __init__(self, params, lr=1e-3, 
             alpha=0.5, k=6, n_sma_threshhold=5, betas=(.95,0.999), 
             eps=1e-5, weight_decay=0, amsgrad=True, transformer='softplus', smooth=50,
             grad_transformer='square',use_gc=True, gc_conv_only=False):
    #parameter checks
    if not 0.0 <= alpha <= 1.0:
        raise ValueError(f'Invalid slow update rate: {alpha}')
    if not 1 <= k:
        raise ValueError(f'Invalid lookahead steps: {k}')
    if not lr > 0:
        raise ValueError(f'Invalid Learning Rate: {lr}')
    if not eps > 0:
        raise ValueError(f'Invalid eps: {eps}')

    #prep defaults and init torch.optim base
    defaults = dict(lr=lr, alpha=alpha, k=k, step_counter=0, betas=betas, 
                    n_sma_threshhold=n_sma_threshhold, eps=eps, weight_decay=weight_decay,
                    smooth=smooth, transformer=transformer, grad_transformer=grad_transformer,
                   amsgrad=amsgrad,use_gc=use_gc, gc_conv_only=gc_conv_only )
    super().__init__(params,defaults)

    #adjustable threshold
    self.n_sma_threshhold = n_sma_threshhold   

    #look ahead params
    self.alpha = alpha
    self.k = k 

    #radam buffer for state
    self.radam_buffer = [[None,None,None] for ind in range(10)]
    
    #gc on or off
    self.use_gc=use_gc
    #level of gradient centralization
    self.gc_gradient_threshold = 3 if gc_conv_only else 1
    print(f"Ranger optimizer loaded. \nGradient Centralization usage = {self.use_gc}")
    if (self.use_gc and self.gc_gradient_threshold==1):
        print(f"GC applied to both conv and fc layers")
    elif (self.use_gc and self.gc_gradient_threshold==3):
        print(f"GC applied to conv layers only")


def __setstate__(self, state):
    print("set state called")
    super(RangerVA, self).__setstate__(state)


def step(self, closure=None):
    loss = None
    #Evaluate averages and grad, update param tensors
    for group in self.param_groups:
        for p in group['params']:
            if p.grad is None:
                continue
            grad = p.grad.data.double()
            if grad.is_sparse:
                raise RuntimeError('Ranger optimizer does not support sparse gradients')
            
            amsgrad = group['amsgrad']
            smooth = group['smooth']
            grad_transformer = group['grad_transformer']

            p_data_fp32 = p.data.double()

            state = self.state[p]  #get state dict for this param

            if len(state) == 0:   
                state['step'] = 0
                state['exp_avg'] = torch.zeros_like(p_data_fp32)
                state['exp_avg_sq'] = torch.zeros_like(p_data_fp32)
                if amsgrad:
                    # Maintains max of all exp. moving avg. of sq. grad. values
                    state['max_exp_avg_sq'] = torch.zeros_like(p.data)                    

                #look ahead weight storage now in state dict 
                state['slow_buffer'] = torch.empty_like(p.data)
                state['slow_buffer'].copy_(p.data)

            else:
                state['exp_avg'] = state['exp_avg'].type_as(p_data_fp32)
                state['exp_avg_sq'] = state['exp_avg_sq'].type_as(p_data_fp32)
                                  

            #begin computations 
            exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
            beta1, beta2 = group['betas']
            if amsgrad:
                max_exp_avg_sq = state['max_exp_avg_sq']  
                # Maintains the maximum of all 2nd moment running avg. till now
                torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
                # Use the max. for normalizing running avg. of gradient
                denomc = max_exp_avg_sq.clone()
            else:
                denomc = exp_avg_sq.clone()
            #GC operation for Conv layers and FC layers       
            if grad.dim() > self.gc_gradient_threshold:                    
                grad.add_(-grad.mean(dim = tuple(range(1,grad.dim())), keepdim = True))

            state['step'] += 1              

            #compute variance mov avg
            exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
            #compute mean moving avg
            exp_avg.mul_(beta1).add_(1 - beta1, grad)
            buffered = self.radam_buffer[int(state['step'] % 10)]
            if state['step'] == buffered[0]:
                N_sma, step_size = buffered[1], buffered[2]
            else:
                buffered[0] = state['step']
                beta2_t = beta2 ** state['step']
                N_sma_max = 2 / (1 - beta2) - 1
                N_sma = N_sma_max - 2 * state['step'] * beta2_t / (1 - beta2_t)
                buffered[1] = N_sma
                if N_sma > self.n_sma_threshhold:
                    step_size = math.sqrt((1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max / (N_sma_max - 2)) / (1 - beta1 ** state['step'])
                else:
                    step_size = 1.0 / (1 - beta1 ** state['step'])
                buffered[2] = step_size

            
            ##transformer
            if grad_transformer == 'square':
                grad_tmp = grad**2
                denomc.sqrt_() 
            elif grad_transformer == 'abs':
                grad_tmp = grad.abs()


            exp_avg_sq.mul_(beta2).add_((1 - beta2)*grad_tmp)

            if group['weight_decay'] != 0:
                p_data_fp32.add_(-group['weight_decay'] * group['lr'], p_data_fp32)
            bias_correction1 = 1 - beta1 ** state['step']
            bias_correction2 = 1 - beta2 ** state['step']
            step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1                

            
            # ...let's use calibrated alr 
            if N_sma > self.n_sma_threshhold:
                if  group['transformer'] =='softplus':
                    sp = torch.nn.Softplus( smooth)
                    denomf = sp( denomc)
                    p_data_fp32.addcdiv_(-step_size, exp_avg, denomf )
                else:
                    denom = exp_avg_sq.sqrt().add_(group['eps'])
                    p_data_fp32.addcdiv_(-step_size * group['lr'], exp_avg, denom)
            else:
                p_data_fp32.add_(-step_size * group['lr'], exp_avg)
            p.data.copy_(p_data_fp32)

            #integrated look ahead...
            #we do it at the param level instead of group level
            if state['step'] % group['k'] == 0:
                slow_p = state['slow_buffer'] #get access to slow param tensor
                slow_p.add_(self.alpha, p.data - slow_p)  #(fast weights - slow weights) * alpha
                p.data.copy_(slow_p)  #copy interpolated weights to RAdam param tensor

    return loss

The results I tested on the cifar10 dataset are as follows. Ranger's results look strange

larger learning rate + large weight decay performs better?

Hi all,
My colleague and I tried a combination of (relatively) large Ranger learning rate (say, 0.001) + large weight decay (say, 0.1). Seems the large decay leads to better performance? We tried two different models, and observed 0.5-1.5% increase of ImageNet classification accuracy, but both models were customized models, and not standard ones like Resnet.
Not sure whether anyone else finds similar results.

Collate pip package so that it picks up from main repo.

Actually, there is a pip package but it is based out of a fork of this repo. I think it would make sense to collate this effort to the main repo.

Originally posted by @sarthakpati in #33 (comment)

AttributeError: 'Ranger' object has no attribute 'radam_buffer'

Getting the following error from the most recent version:

AttributeError: 'Ranger' object has no attribute 'radam_buffer'

Making it a python package

Would you like to make this a python package that could be installed with pip? It would be more practical.

I'd like to include it in my repo asteroid and give you proper credit for it.

One way is to install a python package (I can make a PR for that), the other one would be to copy-paste some of the code and point to the license file. Which way would you prefer?

TypeError in GC operation for Conv layers and FC layers

TypeError: mean() received an invalid combination of arguments - got (keepdim=bool, dim=tuple, ), but expected one of:

()
(torch.dtype dtype)
(int dim, torch.dtype dtype)
didn't match because some of the keywords were incorrect: keepdim
(int dim, bool keepdim, torch.dtype dtype)
(int dim, bool keepdim)
didn't match because some of the arguments have invalid types: (dim=tuple, keepdim=bool, )

Is adabelief the best optimizer?

https://paperswithcode.com/paper/adabelief-optimizer-adapting-stepsizes-by-the

Loss stuck after 1 epoch

Just a warning to the curious, I tried to train DCCRN (from https://github.com/mpariente/asteroid) with Ranger2020 (default params) and it was stuck at a large loss after less than 1 epoch, and loss did not improve for another 30 epochs. I did not debug further. Adam with default params works very well.

Ranger and pytorch DDP

I tried ranger vs adamw on single and 8 gpu setup, while ranger better on single gpu, on DDP setup it performe worse, any advises?

Do we need some kind of Learning rate decay with Ranger?

For AdamW people usually add some sort of learning rate decay: linear, cosine triangle, etc. Also, warm up steps are also popular.

Do we need all of these with Ranger or just use a fixed learning rate?

flat+ cosine anneal training curve

i want to do a test of your ranger,i only know cosine anneal training, can you tell me the meaning of flat?thanks

Spelling, variables and PEP8

Hi and thanks for the code!

I am using your script for my code and while adapting it to PEP8 specs I found a few details that you may want to change. These are style changes that add clarity, but of course it is up to you whether to adhere to PEP8 recommendations or not. I could prepare a pull request as well if you like this style.

required (from torch.optim.optimizer) is not used
itertools is not used

N_sma_threshhold <-- the variable name should not begin with a capital letter, also "threshold" has a typo

k <-- is an importan variable with an obscure name, perhaps something like "lookahead_steps" would be more clear?

Multi-line comments should use """comment"""
Normal comments need a space after #

betas=(.95,0.999) (and others) need a space after the coma

You have commented code that is not used, perhaps it would be best to remove it altogether.

Spaces are not consistent and don't agree with PEP8.

What the "GC operations" mean?

Does it works well for transformer?

I am working on transformer now.
#13 I see this issue, but no one said they get a better result than AdamW yet.
Anyone have already make ranger work well in transformer by fine-tunning?

Also, I do not understand the Readme: 'Best training results - use a 75% flat lr, then step down and run lower lr for 25%, or cosine descend last 25%.'
I use 1e-4 lr now, what is the '75% flat lr'?
What is 'lower lr for 25%'?
Could you show me some demo code about how to adjust the lr expect for the code init the Ranger?

Gradient centralization was updated

Yonghongwei/Gradient-Centralization@d46e4c5

best result : flat learning rate for 75% it means ranger optimizer is not sensitive to lr?

Thank you for your excellent work~
I notice that the best model of ranger optimizer have flat learning rate for 75%. Is it mean ranger optimizer is not sensitive to lr?

Looking forward to your early reply~

Benchmarck Adaptive Scheduling of Stochastic Gradients

https://paperswithcode.com/paper/adas-adaptive-scheduling-of-stochastic

Could it beat rangerLars?

Not working using cuda

Variables self.slow_weights are always on cpu.
You can easily fix this by adding a .to() method in Ranger class like so:

def to(self, device):    
    if device is "cuda":
        for i in range(len(self.slow_weights)):
            for j, w in enumerate(self.slow_weights[i]):
                self.slow_weights[i][j] = w.cuda()
    elif device is "cpu":
        for i in range(len(self.slow_weights)):
            for j, w in enumerate(self.slow_weights[i]):
                self.slow_weights[i][j] = w.cpu()

Loading state doesn't seem to be fully working

To save : 'optimizer' : optimizer.state_dict()

optimizer.load_state_dict(checkpoint['optimizer'])

However, I have the impression restarting the training always bring the accuracy down and then it recovers.

Best,
Thomas Chaton>

Did you try to fine-tune transformers LM with Ranger?

Recent transformers architectures are very famous in NLP: BERT, GPT-2, RoBERTa, XLNET. Did you try to fine-tune them on some NLP task? If so, what was the best Ranger hyper-parameters and learning rate scheduler?

N_sma_threshhold should be instance variable

Thank you for the great implementation.
I think I found a small part to modify at ranger.py line 116.

original code:
if N_sma > N_sma_threshhold:

to be left:
if N_sma > self.N_sma_threshhold:

step_counter not set

Hi,
thanks for your work.

I just plugged it into my model and found that step_counter was not set for all param_groups.

I fixed it with this hack:

        #look ahead tracking and updating if latest batch = k
        for group,slow_weights in zip(self.param_groups,self.slow_weights):
            if 'step_counter' not in group:
                group["step_counter"] = 0

but I suspect it's not optimal...
this would mean that self.param_groups changed between the constructor and step(), but I have no idea why. Have you seen something similar before?

Thanks