xuezhemax / apollo Goto Github PK

View Code? Open in Web Editor NEW

178.0 178.0 17.0 7.61 MB

Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization

License: Apache License 2.0

Python 100.00%

apollo's People

Contributors

Stargazers

Watchers

Forkers

mldl davis-love-ai ricbrag chaoso tubbz-alt xiaoyuwang2821 wuxiaolianggit strategist922 daixiangzi ansonyanxin hfxunlp justcoolpig shimin-github xrosliang brilianputraa dopawei danield21

apollo's Issues

Any expectation on noisy data?

This is the first quasi-Newton like optimizer that really works on training NNs! I really appreciate it!

Recently I tried this optimizer on my dataset, which is quite noisy and difficult (i.e.: one cannot get an accuracy of >0.6 for the binary classification task, regardless of model architectures and optimizers chozen). I did some hyper-parameter tuning (though not too much), e.g.: set init_lr from 1e-3 to 1e-2, and set lr from 0.01 to 1.0, but got no better results than Adam.

Is this expected? I suppose this is because Apollo needs to estimate Hessian somehow, while in my settings it's really difficult to do so. On the other hand, Adam and other SGD variants rely merely on the first-order information and are more robust.

Why the name Apollo?

I don't see how to abbreviate "An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization" to Apollo = =

Different Sigma for 'belief' and 'constant' versions

Hello! Thank you so much for this work! I've had great luck with it and am making an implementation in tf2.

I have a question about this line:

apollo/optim/apollo.py

Line 111 in 852991f

rebound = delta_grad.norm(p=np.inf)

Is there a benefit to keeping sigma = 1.0 for the 'belief' version, or should I make it 0.01 to match the 'constant' version?

CUDA out of memory when training LSTM with AdaHessian

I'm trying to reproduce the results in LSTM but got this error. Maybe not quite relevant with Apollo but wonder how you get to train LSTM using AdaHessian. It seems a memory leak during the second backward in set_hessian().

  File "/home/yezhiling.yzl/apollo/optim/adahessian.py", line 91, in set_hessian
    hzs = torch.autograd.grad(grads, params, grad_outputs=zs, only_inputs=True, retain_graph=True)
  File "/home/yezhiling.yzl/mambaforge/envs/py1.5/lib/python3.8/site-packages/torch/autograd/__init__.py", line 156, in grad
    return Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 470.00 MiB (GPU 0; 15.90 GiB total capacity; 14.45 GiB already allocated; 467.81 MiB free; 14.79 GiB reserved in total by PyTorch)

Tried PyTorch 1.5/1.8.1/1.9.1 but no luck. Would be appreciate if sharing some details, thanks.

Apollo applied to NMT

@XuezheMax
Hello, I replaced Adam with Apollo in the machine translation based on the transformer structure of the fairseq framework, but the effect decreased. I have a partner who does reading comprehension tasks, and the effect has also declined. In addition to the parameter settings, I don't know what your specific implementation details are, so I would like to ask if you are willing to publish your NMT work. Because the NMT repo link on github is invalid

hi, i use apollo to train resnet18 at imagenet-1k datasets, using the parameters in README, but best acc is only 69.9% while the baseline acc is 70.5%. Is it the params setting problem or others?

Question about the convergence order

It seems that the Quasi-Newton methods have second-order convergence. But in the loss figures you have shown in the README.md, it behaves like a first-order optimizer (It performs better than SGD, but shares similar learning curves). So, I am wondering why this happens, could you please explain for my confuse?

Any changes possible to reduce GPU memory usage?

Hi
I think that apollo is a great contribution and have used it with great success for "small" ( about a half billion) parameter models. However, in trying it with a 3 billion parameter model, I have hit GPU memory limits that crashes my training run. I've been looking at the code to see if there was something I could do - like deleting variables after they were used in case python is not doing it efficiently, However, I don't see a path to make major reductions to the memory usage. Do you have any suggestion on modification to be used? Maybe a version that stores optimizer information on CPU rather than GPU , brings it into GPU only when needed for calculations, and then releases the GPU memory when done with it?
Thanks in advance!

Using Approximate Hessian to steer training towards wide flat minima

Hello! Since Apollo calculates an approximate hessian and this essentially describes the local loss landscape, I'm wondering if it can be used to steer training toward wide, flat minima vs. only minima, perhaps by adding a term using the hessian's max eigenvalues or something similar. This might need to be a term added to the loss, though, since the eigenvalues would want to be minimized, but I wonder if training could be steered toward lower hessian eigenvalues within the optimizer.

Just wondering if you guys ever looked into this or have any insights?

Slow convergence rate and lower accuracy than Adam?

Hi,

thanks for sharing your implementation.
I tried the Apollo optimizer for an image segmentation task and compared with the Adam optimizer.

The following plot shows a comparison (using validation metric "intersection over union" -> the larger the better) of Adam and Apollo using the same learning rate settings (1e-3). As you can see, Adam reaches ~84% after <10 epochs, while Apollo needs >200 epochs to reach ~82%. If changing the learning rate for Apollo, either the optimization diverges (for higher learning rates), or it takes even longer (for lower learning rates). Settings for Apollo: lr=1e-3, init_lr=1e-2, warmup=100.

Do you have any idea what's happening?
How can I 1.) get faster convergence 2.) get higher accuracy than with Adam?

Best
Harald

xuezhemax / apollo Goto Github PK

apollo's People

Contributors

Stargazers

Watchers

Forkers

apollo's Issues

Any expectation on noisy data?

Why the name Apollo?

Different Sigma for 'belief' and 'constant' versions

CUDA out of memory when training LSTM with AdaHessian

Apollo applied to NMT

hi, i use apollo to train resnet18 at imagenet-1k datasets, using the parameters in README, but best acc is only 69.9% while the baseline acc is 70.5%. Is it the params setting problem or others?

Question about the convergence order

Any changes possible to reduce GPU memory usage?

Using Approximate Hessian to steer training towards wide flat minima

Slow convergence rate and lower accuracy than Adam?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent