amirgholami / adahessian Goto Github PK
View Code? Open in Web Editor NEWADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
License: MIT License
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
License: MIT License
Hi
I recently started using the version of AdaHessian from https://github.com/jettify/pytorch-optimizer in the facebookresearch parlai system to see how it works for training chatbots. I am not very experienced in the discipline so please excuse my clumsy use of the terminology here. It seems the approaches for training they use divide the training data into minibatches. In a given training epoch, they cycle through the minibatches where for each minibatch they compute and backpropagate the loss for that minibatch to get the gradient of the loss with respect to model parameters and then do a gradient descent step to update model parameters. I haven’t seen any discussion of using batches with AdaHessian. Does that mean that AdaHessian doesn’t work with this batching approach, and all the training samples should be used in the computation of loss and gradient of the loss?
Also, can you please confirm that the version of AdaHessian in pytorch-optimizer is the most current version of the code?
Thanks!
While looking at your averaging code (line 89 to 144 of this file) I noticed that you compute abs(sum(abs(hv * vi)))
.
As far as I understand it, the outer absolute value is not needed as you are already doind a sum of positive terms.
Also, note that if you are using a Rademarcher distribution, you can drop the vi
term from torch.sum(torch.abs(hv * vi))
as abs(vi) == 1
(but keeping it in place might make the algorithm easier to read as it keeps the code close to the math).
On page 16 of the newest version, it mentions that,
We set dropout as 0.0 for Transformer base/small model.
However --dropout 0.3
is used in
More importantly, for lr
of AdamW
, the paper adopts lr from this work, which utilizes
lr =7×10−4/5×10−4 for Transformer-Base/Big respectively
While lr=0.0015
is used in
adahessian/transformer/config/adam.sh
Line 17 in d1a3442
Would be grateful if the original training parameters are provided for reproducing results.
Hi,
Thanks for this great work. Recently, we tried to train ResNext-50 on ImageNet classification using AdaHessian. The implementation we used is from https://github.com/davda54/ada-hessian.
However, I got some wired observations. Please see the training log:
Epoch: 1/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))
Average loss: 6.1249, top1: 2.74%, top5: 8.40%, time: 9660.5s
Avg loss: 4.7754, top1: 10.54%, top5: 27.53%
Best loss: 4.7754, top1: 10.54%, top5: 27.53%, epoch: 1
Epoch: 2/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))
Average loss: 4.2148, top1: 18.27%, top5: 38.85%, time: 9638.9s
Avg loss: 3.4256, top1: 27.41%, top5: 53.10%
Best loss: 3.4256, top1: 27.41%, top5: 53.10%, epoch: 2
Epoch: 3/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))
Average loss: 3.3622, top1: 30.28%, top5: 55.08%, time: 9635.2s
Avg loss: 2.7773, top1: 38.40%, top5: 65.36%
Best loss: 2.7773, top1: 38.40%, top5: 65.36%, epoch: 3
Epoch: 4/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))
Average loss: 2.9959, top1: 36.21%, top5: 61.72%, time: 9636.2s
Avg loss: 2.6380, top1: 40.47%, top5: 67.98%
Best loss: 2.6380, top1: 40.47%, top5: 67.98%, epoch: 4
Epoch: 5/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))
Average loss: 2.8171, top1: 39.26%, top5: 64.87%, time: 9630.8s
Avg loss: 2.5880, top1: 41.73%, top5: 68.91%
Best loss: 2.5880, top1: 41.73%, top5: 68.91%, epoch: 5
Epoch: 6/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))
Average loss: 2.7149, top1: 41.07%, top5: 66.66%, time: 9640.7s
Avg loss: 2.3805, top1: 45.68%, top5: 72.20%
Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6
Epoch: 7/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))
Average loss: 2.6456, top1: 42.30%, top5: 67.90%, time: 9639.8s
Avg loss: 5.2944, top1: 13.36%, top5: 30.77%
Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6
Epoch: 8/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))
Average loss: 2.5855, top1: 43.46%, top5: 68.86%, time: 9637.7s
Avg loss: 14.9700, top1: 0.14%, top5: 0.49%
Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6
Epoch: 9/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))
Average loss: 2.5401, top1: 44.36%, top5: 69.65%, time: 9642.6s
Avg loss: 8.2867, top1: 0.10%, top5: 0.50%
Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6
Epoch: 10/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))
Average loss: 2.5080, top1: 45.03%, top5: 70.24%, time: 9633.9s
Avg loss: 11.4105, top1: 0.10%, top5: 0.50%
Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6
We see that at the first 6 epochs, AdaHessian worked well. But from the 7th epoch, the training loss still decreased normally. But the test lost increased and the test accuracy declined, rapidly. We have tried several hyper-parameters and different random seeds, but this always happens.
We provided the details of our setting below for your reference.
The implementation of ResNext-50 is the standard one in PyTorch. The training is performed across 8 V100 GPUs, with total batch size 256 (32 per GPU).
We have tried to search the hyper-parameters: lr in {0.1, 0.15}, eps in {1e-2, 1e-4}, weight decay in {1e-4, 2e-4, 4e-4, 8e-4, 1e-3}
. For other hyper-parameters, we used the default values.
We also applied linear warmup of the learning rate at the first 100 steps, otherwise AdaHessian crashed at the beginning of model training.
Hi,
I've tried using adahessian as a drop-in replacement for adadelta in the PyTorch mnist example (with loss.backward(create_graph=True)), but this produces the error:
NameError: name 'gradsH' is not defined
This variable looks to be underfined in instruction/adahessian.py, is there something I'm missing?
Thanks!
Hey,
How do you get all these plots mentioned in the github?
Can you please point out the resource?
Hi there,
thank you for making your code available. You used a jacobian vector product with torch.autograd.grad() to implement Hessian-free product Hz. I'm not sure how the operation using autograd isn't O(n²). It seems like it will compute the full hessian and then multiply it by the z vector. The implementation for the hessian-free product is Hv≈(∇f(x+εv)−∇f(x))/ε which requires you to do a second forward and backward passes which isn't what you have. Is there something I'm missing? Is there any computational time comparison between your algorithm and first-order optimizers?
Moreover, you mentioned that backpropagate g^Tz, but I don't see that in the code. I think autograd doesn't support backpropagating more than grads at the moment and there is no hessian propagation in PyTorch.
Thank you.
Hi, I'm trying to use adahessian in TensorFlow for a simple regression experiment but having trouble.
I have a simple example in this google colab notebook: https://colab.research.google.com/drive/1EbKZ0YHhyu6g8chFlJD74dzWrbo82mbV?usp=sharing
I am getting the following error
ValueError: Variable <tf.Variable 'dense_12/kernel:0' shape=(1, 100) dtype=float32> has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
In the notebook I first write a little training loop that works with standard optimisers such as Adam. See "example training with Adam"
Then in the next section "example training with Adahessian" I basically copy the previous code and make a few modifications to try and get Adahessian to work.
Specifically, I only changed
from
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
to
optimizer = AdaHessian(learning_rate=0.01)
and from
grads = tape.gradient(current_loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
to
grads, Hessian = optimizer.get_gradients_hessian(current_loss, model.trainable_weights)
optimizer.apply_gradients_hessian(zip(grads, Hessian, model.trainable_weights))
Can anyone see what I'm doing wrong? Thanks!
Hello! Our static bug checker has found a performance issue in adahessian_tf/run_experiments.py and adahessian_tf/cifar_training_tools.py: cifar_training
is repeatedly called in a for loop, but there is a tf.function decorated function step
defined and called in cifar_training
.
In that case, when cifar_training
is called in a loop, the function step
will create a new graph every time, and that can trigger tf.function retracing warning.
Here is the tensorflow document to support it.
Briefly, for better efficiency, it's better to use:
@tf.function
def inner():
pass
def outer():
inner()
than:
def outer():
@tf.function
def inner():
pass
inner()
Looking forward to your reply.
Hi,
First of all, thanks to Zhewei, Amir and others for the great contribution. I got introduced to AdaHessian (and PyHessian) from your recent talk. I see that you have benchmarked for CV tasks for image classification. Have you tried this out on Object detectors, or other CV tasks as well?
Thanks,
Sam
Apologies if I have this wrong, but is there code for the language modelling experiments? I think that /transformer only contains the NMT experiments. Thanks.
Hi,
First of all, really nice work!
I do wanna try your 2nd order optimizer now.
But I only know tensorflow and all my existing models are implimentad with tensorflow.
Could you provide a tensorflow version?
It seems you only need to implement the method below:
python def get_trace(grad, var)
adahessian/adahessian_tf/adahessian.py
Line 384 in bacccec
Do not use numpy functions within a tf.function decorator. Use tensorflow implementation if possible.
denom = tf.math.pow(math_ops.sqrt(v / bias_correct2), self.hessian_power) + coefficients['epsilon']
Hi
I recently came across this paper on an improved accuracy Hutchinson method, but I am not well versed enough in the discipline to know if it can be used with AdaHessian. Do you think it can be used to improve AdaHessian?
Hi
I have a quick question. For your transformer or any other application, have you used FP16 when getting gradients from a backward call? In the model I am working with, for any scale factor on the loss that I’ve tried, backward seems to give reasonable gradients when I don’t set create_graph to True. But when I do set it to true, while some of the gradients are the same as with it set to False, many others show up as nan’s. All seems OK when I use FP32 operations, but I’d like to get FP16’s advantages in GPU memory/speed.
Any suggestions you can provide would be appreciated!
Hello @amirgholami ,
I´m super excited about this optimizer. Thank you!
I want to use it in a NER task using AllenNLP. But I´m confused because the code differs between the image_classification and transformer examples.
At https://github.com/amirgholami/adahessian/blob/5c176cdcbeacff1d9edfc77062d0bc7594f326a9/image_classification/optim_adahessian.py in function get_trace, we have:
hutchinson_trace = []
for hv, vi in zip(hvs, v):
param_size = hv.size()
if len(param_size) <= 2: # for 0/1/2D tensor
tmp_output = torch.abs(hv * vi)
hutchinson_trace.append(tmp_output) # Hessian diagonal block size is 1 here.
elif len(param_size) == 4: # Conv kernel
tmp_output = torch.abs(torch.sum(torch.abs(
hv * vi), dim=[2, 3], keepdim=True)) / vi[0, 1].numel() # Hessian diagonal block size is 9 here: torch.sum() reduces the dim 2/3.
hutchinson_trace.append(tmp_output)
While in https://github.com/amirgholami/adahessian/blob/bd9f5a6760bf1ba4474e2e8a5fad237a1577d989/transformer/fairseq/optim/adahessian.py we have:
hutchinson_trace = []
for hv, vi in zip(hvs, v):
param_size = hv.size()
if len(param_size) <= 1: # for Bias and LN
tmp_output = torch.abs( hv * vi) + 0.
hutchinson_trace.append( tmp_output )
elif len(param_size) == 2: # Matrix
tmp_output1 = torch.abs((hv * vi + 0.)).view(-1, self.block_length) # faltten to the N times self.block_length
tmp_output2 = torch.abs(torch.sum(tmp_output1, dim=[1])).view(-1) / float(self.block_length)
tmp_output3 = tmp_output2.repeat_interleave(self.block_length).view(param_size)
hutchinson_trace.append(tmp_output3)
Which one should I choose?
In my NLP task I have parameters with sizes varying between 1 and 4. For parameters with size 3 , neither would match it in the loop. Is this correct?
Hi authors,
I intended to use this method on complex numbers and it turned out with a error message like:
File "optimizer.py", line 433, in <listcomp> * torch.randint_like( RuntimeError: check_random_bounds handles only integral, floating-point and boolean types
I'm wondering if it's possible to improve this for complex numbers? Thanks.
Ni
Hi Amir,
AdaHessian sounds really promising! Is this talk still happening?
Anyways, I noticed the signature of the step method in AdaHessian is different from other optimizers, because it requires the list of parameters and gradients as an argument. I wonder if you could not do it directly using the .grad
property of the parameters. I think in the loss you just need to have loss.backward(retain_graph=True, create_graph=True)
instead of only loss.backward()
. Then, to make sure the user actually did this when backpropagating the loss gradient, you could check if each .grad
property had a .grad_fn
property, and if not issue an error and asking the user to use loss.backward(retain_graph=True, create_graph=True)
.
torch_optimizer/adahessian.py", line 128, in get_trace
hutchinson_trace.append(tmp_output)
UnboundLocalError: local variable 'tmp_output' referenced before assignment
Hello,
I'm a little confused of your experimental settings on ImageNet. Could you please clairify the following questions?
1/ The initial learning rate is set to 0.15. That is to say, weight decay args.wd / args.weight_decay = 1e-4 / 0.15 on ImageNet. Is it right?
2/ Two lr schedules have been studied in this paper, i.e. the step decay schedule and the plateau based schedule but the one that leads to better result is only reported. Regarding to Fig. A.9, the plateau based schedule seems to be better than standard step decay schedule for adahessian on ImageNet. May I know the best Top-1 accuracy obtained with your method using the step decay schedule? Also, could you further share the hyper parameter settings of the plateau based schedule in PyTorch? Do you use all default hyper parameters?
Many thanks!
adahessian/adahessian_tf/adahessian.py
Line 341 in fe2c574
I noticed that the alpha variable calculated above is not being used. I suspect this is not intended.
is it possible to use this library with PyTorch Lightning? if so, could you please provide an example?
using PyTorch Lightning in 'manual mode', using
self.manual_backward(loss, create_graph=True)
was the closest I got, but it still wouldnt work. It ran somewhat but crashed after a few batches saying
RuntimeError: Gradient tensor 2 does not have grad_fn. When calling loss.backward(), make sure the option create_graph is set to True.
(even though I did set this)
Hello,
First, congratulation for developing AdaHessian, it is a great idea!
Second, have you experimented with alternatives to the Rademacher distribution?
A uniform or gaussian distribution should also work and, depending on the characteristics of the Hessian, might be a better default.
Have a good day,
Nestor
Hi
For training a chatbot, I want to switch to adahessian from adam as the final step in fine-tuning of my model. I have a question about what is a reasonable learning rate to use for adahessian. For adam I used fairly small learning rates - starting at 2e-5 and reducing from there - which worked pretty well. However, as I understand it, adahessian preconditions the parameter update like an inverse Hessian does in a Newton step. But in a Newton step for a quadratic model, the ideal learning rate is 1.0. So I assume that I should be using a much larger learning rate for adahessian than I have been using for adam. Do you have any suggestions based on your experience?
Thanks!
The current version does not respect untrainable variables. It can be fixed by placing a simple if-statement. However, I'm not sure if this is the best place. Therefore I'm not suggesting it as a PR, but report the issue here.
eagerly_outside_functions = ops.executing_eagerly_outside_functions()
update_ops = []
with ops.name_scope(name or self._name, skip_on_eager=True):
for grad, hess, var in grads_hessian_and_vars:
# FIX UNTRAINABLE
if var.trainable:
def _assume_mirrored(grad, hess):
if isinstance(grad, ds_values.PerReplica):
return ds_values.Mirrored(grad.values), ds_values.Mirrored(hess.values)
return grad, hess
grad, hess = nest.map_structure(_assume_mirrored, grad, hess)
# Colocate the update with variables to avoid unnecessary communication
# delays. See b/136304694.
with distribution.extended.colocate_vars_with(var):
with ops.name_scope("update" if eagerly_outside_functions else
"update_" + var.op.name, skip_on_eager=True):
update_ops.extend(distribution.extended.update(
var, apply_grad_to_update_var, args=(grad, hess), group=False))
Thank you for your excellent work. Is it possible to change the tensorflow version of AdaHessian from tensorflow 2 to tensorflow 1?
it seemed we use the average Hessian for all of params, I want to know how to group my params like take one output channel as one block and compute the average as their diagonal estimation values.
It looks like the pre-trained model link has expired.
Could you upload it again?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.