Giter Club home page Giter Club logo

apollo's People

Contributors

xuezhemax avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apollo's Issues

Any expectation on noisy data?

This is the first quasi-Newton like optimizer that really works on training NNs! I really appreciate it!

Recently I tried this optimizer on my dataset, which is quite noisy and difficult (i.e.: one cannot get an accuracy of >0.6 for the binary classification task, regardless of model architectures and optimizers chozen). I did some hyper-parameter tuning (though not too much), e.g.: set init_lr from 1e-3 to 1e-2, and set lr from 0.01 to 1.0, but got no better results than Adam.

Is this expected? I suppose this is because Apollo needs to estimate Hessian somehow, while in my settings it's really difficult to do so. On the other hand, Adam and other SGD variants rely merely on the first-order information and are more robust.

Why the name Apollo?

I don't see how to abbreviate "An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization" to Apollo = =

CUDA out of memory when training LSTM with AdaHessian

I'm trying to reproduce the results in LSTM but got this error. Maybe not quite relevant with Apollo but wonder how you get to train LSTM using AdaHessian. It seems a memory leak during the second backward in set_hessian().

  File "/home/yezhiling.yzl/apollo/optim/adahessian.py", line 91, in set_hessian
    hzs = torch.autograd.grad(grads, params, grad_outputs=zs, only_inputs=True, retain_graph=True)
  File "/home/yezhiling.yzl/mambaforge/envs/py1.5/lib/python3.8/site-packages/torch/autograd/__init__.py", line 156, in grad
    return Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 470.00 MiB (GPU 0; 15.90 GiB total capacity; 14.45 GiB already allocated; 467.81 MiB free; 14.79 GiB reserved in total by PyTorch)

Tried PyTorch 1.5/1.8.1/1.9.1 but no luck. Would be appreciate if sharing some details, thanks.

Apollo applied to NMT

@XuezheMax
Hello, I replaced Adam with Apollo in the machine translation based on the transformer structure of the fairseq framework, but the effect decreased. I have a partner who does reading comprehension tasks, and the effect has also declined. In addition to the parameter settings, I don't know what your specific implementation details are, so I would like to ask if you are willing to publish your NMT work. Because the NMT repo link on github is invalid

Question about the convergence order

It seems that the Quasi-Newton methods have second-order convergence. But in the loss figures you have shown in the README.md, it behaves like a first-order optimizer (It performs better than SGD, but shares similar learning curves). So, I am wondering why this happens, could you please explain for my confuse?

Any changes possible to reduce GPU memory usage?

Hi
I think that apollo is a great contribution and have used it with great success for "small" ( about a half billion) parameter models. However, in trying it with a 3 billion parameter model, I have hit GPU memory limits that crashes my training run. I've been looking at the code to see if there was something I could do - like deleting variables after they were used in case python is not doing it efficiently, However, I don't see a path to make major reductions to the memory usage. Do you have any suggestion on modification to be used? Maybe a version that stores optimizer information on CPU rather than GPU , brings it into GPU only when needed for calculations, and then releases the GPU memory when done with it?
Thanks in advance!

Using Approximate Hessian to steer training towards wide flat minima

Hello! Since Apollo calculates an approximate hessian and this essentially describes the local loss landscape, I'm wondering if it can be used to steer training toward wide, flat minima vs. only minima, perhaps by adding a term using the hessian's max eigenvalues or something similar. This might need to be a term added to the loss, though, since the eigenvalues would want to be minimized, but I wonder if training could be steered toward lower hessian eigenvalues within the optimizer.

Just wondering if you guys ever looked into this or have any insights?

Slow convergence rate and lower accuracy than Adam?

Hi,

thanks for sharing your implementation.
I tried the Apollo optimizer for an image segmentation task and compared with the Adam optimizer.

The following plot shows a comparison (using validation metric "intersection over union" -> the larger the better) of Adam and Apollo using the same learning rate settings (1e-3). As you can see, Adam reaches ~84% after <10 epochs, while Apollo needs >200 epochs to reach ~82%. If changing the learning rate for Apollo, either the optimization diverges (for higher learning rates), or it takes even longer (for lower learning rates). Settings for Apollo: lr=1e-3, init_lr=1e-2, warmup=100.

cmp

Do you have any idea what's happening?
How can I 1.) get faster convergence 2.) get higher accuracy than with Adam?

Best
Harald

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.