clovaai / adamp Goto Github PK
View Code? Open in Web Editor NEWAdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights (ICLR 2021)
Home Page: https://clovaai.github.io/AdamP/
License: MIT License
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights (ICLR 2021)
Home Page: https://clovaai.github.io/AdamP/
License: MIT License
Thanks for your job, can AdamP work with warm up and can warm up make AdamP works better? thks.
Table 2 of your paper shows that AdamW on ImageNet is as good as SGDM, which is very excited. Would like to share with us the hyperparameters? Thx!
I guess from your paper ResNet + AdamW is AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0001, amsgrad=False)
, Is it right? However, I have done an experiment with the above setting and it was two points lower than your result. I'm confused
What is the hyperparameter of MobileNetV2+AdamW
What is the most simple example of params in "optimizer = AdamP(params, lr=0.001, betas=(0.9, 0.999), weight_decay=1e-2)"?
Hello,
I am just following the major differences between Adam and AdamP.
I can just notice the difference where it says #Projection in the code.
I have been running some baseline cases with AdamP excluding that projection part and pytorch Adam.
However, the results look somehow different.
Is there any part that I am missing?
(Initial states are fixed the same, seeds are the same too!)
Thanks!
Hello, curious if there is any reason why the projection function uses self defined cosine sim vs something like below
def _projection(self, p, grad, perturb, delta, wd_ratio, eps):
wd = 1
expand_size = [-1] + [1] * (len(p.shape) - 1)
for view_func in [self._channel_view, self._layer_view]:
g_view = view_func(grad)
p_view = view_func(p.data)
cosine_sim = F.cosine_similarity(g_view, p_view, dim=1, eps=eps).abs_()
if cosine_sim.max() < delta / math.sqrt(p_view.size(1)):
p_n = p.data / p_view.norm(dim=1).add_(eps).view(expand_size)
perturb -= p_n * view_func(p_n * perturb).sum(dim=1).view(expand_size)
wd = wd_ratio
return perturb, wd
return perturb, wd
I noticed that you never use the step_size
(https://github.com/clovaai/AdamP/blob/master/adamp/adamp.py#L81) which makes use of the bias_correction1
on beta1
, which inflates the learning rate in the initial steps of training (see https://arxiv.org/abs/2110.10828). Instead you leave the learning rate as is, only applying bias_correction2
which inflates the estimated variance early in training, which lowers the learning rate in the first few epochs. I don't think you should fix this bug, but it may be helpful to just add a comment or something saying that you skip this step, and maybe even throw a link to my paper stating that this is not a bug :)
Hi,
From the paper you use cosine learning rate scheduler to train imagenet, did you apply both adamp and sgdp?
what is your opinion about using constant learning rate scheduler or other learning rate scheduler for adam series? From my experience, using step decay or cosine would be better than constant learning rate scheduler.
However, the advantage of adaptive optimizer is that we don't need to manually tune the learning rate scheduler, if we need to tune the learning rate scheduler, then what is the advantage for adam series?
Here are good discussions
thank you
In the readme, the value is 1e-2. Is it a recommended value? Or just keeping the default value 0?
Hi, I'm wondering whether AdamP
optimizer may have some unexpected behavior on cpu.
I trained my models of the same parameters with Adamp on cuda
and cpu
. Adamp gives excellent outcomes when it is run on the cuda
but, interestingly, not on the cpu
. More precisely, it was not able to minimize train/val loss.
Here's a code snippet that I used to train my model.
https://github.com/Junyoungpark/PGNN/blob/main/train_pgnn.py
Hi,
Thank you for the code release.
I am trying to run MIRNet by changing Adam
to AdamP
.
However, the training time per epoch is increased by nearly 2 times.
Is there any way to make it faster?
I tried with two environments Python 3.7, Pytorch 1.1, CUDA 9.0
and Python 3.7, Pytorch 1.4, CUDA 10.0
but both give the same speed.
Thanks
Thanks the authors for the very interesting paper and analysis.
I was wondering if an equivalent fix for the weight growth could be to normalize the weights of layers before normalization layers during training ? For example every 10 mini-batches I would normalize the weights, such that the operation remains cheap.
Hi! I encountered following warning while I using AdamP
for my project
..\torch\csrc\utils\python_arg_parser.cpp:756: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha)
Might this be relevant to the AdmaP
update?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.