xuezhemax / apollo Goto Github PK
View Code? Open in Web Editor NEWApollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization
License: Apache License 2.0
Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization
License: Apache License 2.0
This is the first quasi-Newton like optimizer that really works on training NNs! I really appreciate it!
Recently I tried this optimizer on my dataset, which is quite noisy and difficult (i.e.: one cannot get an accuracy of >0.6 for the binary classification task, regardless of model architectures and optimizers chozen). I did some hyper-parameter tuning (though not too much), e.g.: set init_lr
from 1e-3 to 1e-2, and set lr
from 0.01 to 1.0, but got no better results than Adam.
Is this expected? I suppose this is because Apollo needs to estimate Hessian somehow, while in my settings it's really difficult to do so. On the other hand, Adam and other SGD variants rely merely on the first-order information and are more robust.
I don't see how to abbreviate "An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization" to Apollo = =
Hello! Thank you so much for this work! I've had great luck with it and am making an implementation in tf2.
I have a question about this line:
Line 111 in 852991f
Is there a benefit to keeping sigma = 1.0 for the 'belief' version, or should I make it 0.01 to match the 'constant' version?
I'm trying to reproduce the results in LSTM but got this error. Maybe not quite relevant with Apollo but wonder how you get to train LSTM using AdaHessian. It seems a memory leak during the second backward in set_hessian()
.
File "/home/yezhiling.yzl/apollo/optim/adahessian.py", line 91, in set_hessian
hzs = torch.autograd.grad(grads, params, grad_outputs=zs, only_inputs=True, retain_graph=True)
File "/home/yezhiling.yzl/mambaforge/envs/py1.5/lib/python3.8/site-packages/torch/autograd/__init__.py", line 156, in grad
return Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 470.00 MiB (GPU 0; 15.90 GiB total capacity; 14.45 GiB already allocated; 467.81 MiB free; 14.79 GiB reserved in total by PyTorch)
Tried PyTorch 1.5/1.8.1/1.9.1 but no luck. Would be appreciate if sharing some details, thanks.
@XuezheMax
Hello, I replaced Adam with Apollo in the machine translation based on the transformer structure of the fairseq framework, but the effect decreased. I have a partner who does reading comprehension tasks, and the effect has also declined. In addition to the parameter settings, I don't know what your specific implementation details are, so I would like to ask if you are willing to publish your NMT work. Because the NMT repo link on github is invalid
It seems that the Quasi-Newton methods have second-order convergence. But in the loss figures you have shown in the README.md, it behaves like a first-order optimizer (It performs better than SGD, but shares similar learning curves). So, I am wondering why this happens, could you please explain for my confuse?
Hi
I think that apollo is a great contribution and have used it with great success for "small" ( about a half billion) parameter models. However, in trying it with a 3 billion parameter model, I have hit GPU memory limits that crashes my training run. I've been looking at the code to see if there was something I could do - like deleting variables after they were used in case python is not doing it efficiently, However, I don't see a path to make major reductions to the memory usage. Do you have any suggestion on modification to be used? Maybe a version that stores optimizer information on CPU rather than GPU , brings it into GPU only when needed for calculations, and then releases the GPU memory when done with it?
Thanks in advance!
Hello! Since Apollo calculates an approximate hessian and this essentially describes the local loss landscape, I'm wondering if it can be used to steer training toward wide, flat minima vs. only minima, perhaps by adding a term using the hessian's max eigenvalues or something similar. This might need to be a term added to the loss, though, since the eigenvalues would want to be minimized, but I wonder if training could be steered toward lower hessian eigenvalues within the optimizer.
Just wondering if you guys ever looked into this or have any insights?
Hi,
thanks for sharing your implementation.
I tried the Apollo optimizer for an image segmentation task and compared with the Adam optimizer.
The following plot shows a comparison (using validation metric "intersection over union" -> the larger the better) of Adam and Apollo using the same learning rate settings (1e-3). As you can see, Adam reaches ~84% after <10 epochs, while Apollo needs >200 epochs to reach ~82%. If changing the learning rate for Apollo, either the optimization diverges (for higher learning rates), or it takes even longer (for lower learning rates). Settings for Apollo: lr=1e-3, init_lr=1e-2, warmup=100.
Do you have any idea what's happening?
How can I 1.) get faster convergence 2.) get higher accuracy than with Adam?
Best
Harald
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.