amirgholami / adahessian Goto Github PK

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

License: MIT License

Python 96.24% Shell 0.40% C++ 0.85% Cuda 1.67% Lua 0.27% Cython 0.57%

second-order-optimization hessian hessian-free optimizer adahessian

adahessian's Issues

Use of AdaHessian with batched training data?

Hi
I recently started using the version of AdaHessian from https://github.com/jettify/pytorch-optimizer in the facebookresearch parlai system to see how it works for training chatbots. I am not very experienced in the discipline so please excuse my clumsy use of the terminology here. It seems the approaches for training they use divide the training data into minibatches. In a given training epoch, they cycle through the minibatches where for each minibatch they compute and backpropagate the loss for that minibatch to get the gradient of the loss with respect to model parameters and then do a gradient descent step to update model parameters. I haven’t seen any discussion of using batches with AdaHessian. Does that mean that AdaHessian doesn’t work with this batching approach, and all the training samples should be used in the computation of loss and gradient of the loss?

Also, can you please confirm that the version of AdaHessian in pytorch-optimizer is the most current version of the code?

Thanks!

too many abs ?

While looking at your averaging code (line 89 to 144 of this file) I noticed that you compute abs(sum(abs(hv * vi))).
As far as I understand it, the outer absolute value is not needed as you are already doind a sum of positive terms.

Also, note that if you are using a Rademarcher distribution, you can drop the vi term from torch.sum(torch.abs(hv * vi)) as abs(vi) == 1 (but keeping it in place might make the algorithm easier to read as it keeps the code close to the math).

Inconsistence between paper and training scripts on NMT tasks

On page 16 of the newest version, it mentions that,

We set dropout as 0.0 for Transformer base/small model.

However --dropout 0.3 is used in

adahessian/transformer/config/adahessian.sh

Line 22 in d1a3442

--dropout 0.3 --attention-dropout 0.1 --relu-dropout 0.1 \

More importantly, for lr of AdamW, the paper adopts lr from this work, which utilizes

lr =7×10−4/5×10−4 for Transformer-Base/Big respectively

While lr=0.0015 is used in

adahessian/transformer/config/adam.sh

Line 17 in d1a3442

--lr 0.0015 --min-lr 1e-9 \

Would be grateful if the original training parameters are provided for reproducing results.

Wired behaviors of AdaHessian on ResNext-50

Hi,

Thanks for this great work. Recently, we tried to train ResNext-50 on ImageNet classification using AdaHessian. The implementation we used is from https://github.com/davda54/ada-hessian.

However, I got some wired observations. Please see the training log:

Epoch: 1/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 6.1249, top1: 2.74%, top5: 8.40%, time: 9660.5s

Avg  loss: 4.7754, top1: 10.54%, top5: 27.53%

Best loss: 4.7754, top1: 10.54%, top5: 27.53%, epoch: 1

Epoch: 2/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 4.2148, top1: 18.27%, top5: 38.85%, time: 9638.9s

Avg  loss: 3.4256, top1: 27.41%, top5: 53.10%

Best loss: 3.4256, top1: 27.41%, top5: 53.10%, epoch: 2

Epoch: 3/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 3.3622, top1: 30.28%, top5: 55.08%, time: 9635.2s

Avg  loss: 2.7773, top1: 38.40%, top5: 65.36%

Best loss: 2.7773, top1: 38.40%, top5: 65.36%, epoch: 3

Epoch: 4/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.9959, top1: 36.21%, top5: 61.72%, time: 9636.2s

Avg  loss: 2.6380, top1: 40.47%, top5: 67.98%

Best loss: 2.6380, top1: 40.47%, top5: 67.98%, epoch: 4

Epoch: 5/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.8171, top1: 39.26%, top5: 64.87%, time: 9630.8s

Avg  loss: 2.5880, top1: 41.73%, top5: 68.91%

Best loss: 2.5880, top1: 41.73%, top5: 68.91%, epoch: 5

Epoch: 6/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.7149, top1: 41.07%, top5: 66.66%, time: 9640.7s

Avg  loss: 2.3805, top1: 45.68%, top5: 72.20%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 7/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.6456, top1: 42.30%, top5: 67.90%, time: 9639.8s

Avg  loss: 5.2944, top1: 13.36%, top5: 30.77%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 8/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5855, top1: 43.46%, top5: 68.86%, time: 9637.7s

Avg  loss: 14.9700, top1: 0.14%, top5: 0.49%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 9/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5401, top1: 44.36%, top5: 69.65%, time: 9642.6s

Avg  loss: 8.2867, top1: 0.10%, top5: 0.50%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 10/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5080, top1: 45.03%, top5: 70.24%, time: 9633.9s

Avg  loss: 11.4105, top1: 0.10%, top5: 0.50%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

We see that at the first 6 epochs, AdaHessian worked well. But from the 7th epoch, the training loss still decreased normally. But the test lost increased and the test accuracy declined, rapidly. We have tried several hyper-parameters and different random seeds, but this always happens.

We provided the details of our setting below for your reference.
The implementation of ResNext-50 is the standard one in PyTorch. The training is performed across 8 V100 GPUs, with total batch size 256 (32 per GPU).
We have tried to search the hyper-parameters: lr in {0.1, 0.15}, eps in {1e-2, 1e-4}, weight decay in {1e-4, 2e-4, 4e-4, 8e-4, 1e-3}. For other hyper-parameters, we used the default values.
We also applied linear warmup of the learning rate at the first 100 steps, otherwise AdaHessian crashed at the beginning of model training.

Error using adahessian in PyTorch

Hi,

I've tried using adahessian as a drop-in replacement for adadelta in the PyTorch mnist example (with loss.backward(create_graph=True)), but this produces the error:

NameError: name 'gradsH' is not defined

This variable looks to be underfined in instruction/adahessian.py, is there something I'm missing?

Thanks!

Images

Hey,

How do you get all these plots mentioned in the github?

Can you please point out the resource?

Object Detection

Can the optimizer be used for Object Detection? I tested it myself and it seems that there will be an error

Scalability Question

Hi there,

thank you for making your code available. You used a jacobian vector product with torch.autograd.grad() to implement Hessian-free product Hz. I'm not sure how the operation using autograd isn't O(n²). It seems like it will compute the full hessian and then multiply it by the z vector. The implementation for the hessian-free product is Hv≈(∇f(x+εv)−∇f(x))/ε which requires you to do a second forward and backward passes which isn't what you have. Is there something I'm missing? Is there any computational time comparison between your algorithm and first-order optimizers?

Moreover, you mentioned that backpropagate g^Tz, but I don't see that in the code. I think autograd doesn't support backpropagating more than grads at the moment and there is no hessian propagation in PyTorch.

Thank you.

Help using adahessian in TensorFlow

Hi, I'm trying to use adahessian in TensorFlow for a simple regression experiment but having trouble.

I have a simple example in this google colab notebook: https://colab.research.google.com/drive/1EbKZ0YHhyu6g8chFlJD74dzWrbo82mbV?usp=sharing

I am getting the following error

ValueError: Variable <tf.Variable 'dense_12/kernel:0' shape=(1, 100) dtype=float32> has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.

In the notebook I first write a little training loop that works with standard optimisers such as Adam. See "example training with Adam"

Then in the next section "example training with Adahessian" I basically copy the previous code and make a few modifications to try and get Adahessian to work.

Specifically, I only changed

from

optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

optimizer = AdaHessian(learning_rate=0.01)

and from

grads = tape.gradient(current_loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

grads, Hessian = optimizer.get_gradients_hessian(current_loss, model.trainable_weights)
optimizer.apply_gradients_hessian(zip(grads, Hessian, model.trainable_weights))

Can anyone see what I'm doing wrong? Thanks!

Performance issue about tf.function

Hello! Our static bug checker has found a performance issue in adahessian_tf/run_experiments.py and adahessian_tf/cifar_training_tools.py: cifar_training is repeatedly called in a for loop, but there is a tf.function decorated function step defined and called in cifar_training.

In that case, when cifar_training is called in a loop, the function step will create a new graph every time, and that can trigger tf.function retracing warning.

Here is the tensorflow document to support it.

Briefly, for better efficiency, it's better to use:

@tf.function
def inner():
    pass

def outer():
    inner()

than:

def outer():
    @tf.function
    def inner():
        pass
    inner()

Looking forward to your reply.

Benchmark on Object detectors

Hi,

First of all, thanks to Zhewei, Amir and others for the great contribution. I got introduced to AdaHessian (and PyHessian) from your recent talk. I see that you have benchmarked for CV tasks for image classification. Have you tried this out on Object detectors, or other CV tasks as well?

Thanks,
Sam

Language Modelling code

Apologies if I have this wrong, but is there code for the language modelling experiments? I think that /transformer only contains the NMT experiments. Thanks.

AdaHessian tensorflow implementation

Hi,
First of all, really nice work!
I do wanna try your 2nd order optimizer now.
But I only know tensorflow and all my existing models are implimentad with tensorflow.
Could you provide a tensorflow version?

It seems you only need to implement the method below:
python def get_trace(grad, var)

Replace numpy power by TF pow

adahessian/adahessian_tf/adahessian.py

Line 384 in bacccec

 denom = np.power(math_ops.sqrt(v / bias_correct2), self.hessian_power) + coefficients['epsilon'] 

Do not use numpy functions within a tf.function decorator. Use tensorflow implementation if possible.


        denom = tf.math.pow(math_ops.sqrt(v / bias_correct2), self.hessian_power) + coefficients['epsilon']

Is Hutch++ applicable to improve AdaHessian?

Hi
I recently came across this paper on an improved accuracy Hutchinson method, but I am not well versed enough in the discipline to know if it can be used with AdaHessian. Do you think it can be used to improve AdaHessian?

https://arxiv.org/pdf/2010.09649.pdf

Use of FP16 in backward with create_graph = True?

Hi
I have a quick question. For your transformer or any other application, have you used FP16 when getting gradients from a backward call? In the model I am working with, for any scale factor on the loss that I’ve tried, backward seems to give reasonable gradients when I don’t set create_graph to True. But when I do set it to true, while some of the gradients are the same as with it set to False, many others show up as nan’s. All seems OK when I use FP32 operations, but I’d like to get FP16’s advantages in GPU memory/speed.
Any suggestions you can provide would be appreciated!

What is the correct code for AllenNLP/NER task?

Hello @amirgholami ,

I´m super excited about this optimizer. Thank you!

I want to use it in a NER task using AllenNLP. But I´m confused because the code differs between the image_classification and transformer examples.

At https://github.com/amirgholami/adahessian/blob/5c176cdcbeacff1d9edfc77062d0bc7594f326a9/image_classification/optim_adahessian.py in function get_trace, we have:

hutchinson_trace = []
        for hv, vi in zip(hvs, v):
            param_size = hv.size()
            if len(param_size) <= 2:  # for 0/1/2D tensor
                tmp_output = torch.abs(hv * vi)
                hutchinson_trace.append(tmp_output) # Hessian diagonal block size is 1 here.
            elif len(param_size) == 4:  # Conv kernel
                tmp_output = torch.abs(torch.sum(torch.abs(
                    hv * vi), dim=[2, 3], keepdim=True)) / vi[0, 1].numel() # Hessian diagonal block size is 9 here: torch.sum() reduces the dim 2/3.
                hutchinson_trace.append(tmp_output)

While in https://github.com/amirgholami/adahessian/blob/bd9f5a6760bf1ba4474e2e8a5fad237a1577d989/transformer/fairseq/optim/adahessian.py we have:

hutchinson_trace = []
        for hv, vi in zip(hvs, v):
            param_size = hv.size()
            if len(param_size) <= 1: # for Bias and LN 
                tmp_output = torch.abs( hv * vi)  + 0.
                hutchinson_trace.append( tmp_output )
            elif len(param_size) == 2: # Matrix
                tmp_output1 = torch.abs((hv * vi + 0.)).view(-1, self.block_length) # faltten to the N times self.block_length
                tmp_output2 = torch.abs(torch.sum(tmp_output1, dim=[1])).view(-1) / float(self.block_length)
                tmp_output3 = tmp_output2.repeat_interleave(self.block_length).view(param_size)
                hutchinson_trace.append(tmp_output3)

Which one should I choose?

In my NLP task I have parameters with sizes varying between 1 and 4. For parameters with size 3 , neither would match it in the loop. Is this correct?

Can this deal with complex numbers?

Hi authors,

I intended to use this method on complex numbers and it turned out with a error message like:

File "optimizer.py", line 433, in <listcomp> * torch.randint_like( RuntimeError: check_random_bounds handles only integral, floating-point and boolean types

I'm wondering if it's possible to improve this for complex numbers? Thanks.

Compatibility with other PyTorch optimizers

Hi Amir,
AdaHessian sounds really promising! Is this talk still happening?

Anyways, I noticed the signature of the step method in AdaHessian is different from other optimizers, because it requires the list of parameters and gradients as an argument. I wonder if you could not do it directly using the .grad property of the parameters. I think in the loss you just need to have loss.backward(retain_graph=True, create_graph=True) instead of only loss.backward(). Then, to make sure the user actually did this when backpropagating the loss gradient, you could check if each .grad property had a .grad_fn property, and if not issue an error and asking the user to use loss.backward(retain_graph=True, create_graph=True).

I get this error when I use the AdaHessian. Is it a bug?

torch_optimizer/adahessian.py", line 128, in get_trace
hutchinson_trace.append(tmp_output)
UnboundLocalError: local variable 'tmp_output' referenced before assignment

Settings on ImageNet

Hello,

I'm a little confused of your experimental settings on ImageNet. Could you please clairify the following questions?

1/ The initial learning rate is set to 0.15. That is to say, weight decay args.wd / args.weight_decay = 1e-4 / 0.15 on ImageNet. Is it right?

2/ Two lr schedules have been studied in this paper, i.e. the step decay schedule and the plateau based schedule but the one that leads to better result is only reported. Regarding to Fig. A.9, the plateau based schedule seems to be better than standard step decay schedule for adahessian on ImageNet. May I know the best Top-1 accuracy obtained with your method using the step decay schedule? Also, could you further share the hyper parameter settings of the plateau based schedule in PyTorch? Do you use all default hyper parameters?

Many thanks!

Alpha unused

adahessian/adahessian_tf/adahessian.py

Line 341 in fe2c574

alpha = (

I noticed that the alpha variable calculated above is not being used. I suspect this is not intended.

Possible to use with PyTorch Lightning?

is it possible to use this library with PyTorch Lightning? if so, could you please provide an example?

using PyTorch Lightning in 'manual mode', using
self.manual_backward(loss, create_graph=True)

was the closest I got, but it still wouldnt work. It ran somewhat but crashed after a few batches saying

RuntimeError: Gradient tensor 2 does not have grad_fn. When calling loss.backward(), make sure the option create_graph is set to True.

(even though I did set this)

Alternative to Rademacher distribution

Hello,

First, congratulation for developing AdaHessian, it is a great idea!

Second, have you experimented with alternatives to the Rademacher distribution?
A uniform or gaussian distribution should also work and, depending on the characteristics of the Hessian, might be a better default.

Have a good day,
Nestor

Reasonable learning rate range for adahessian?

Hi
For training a chatbot, I want to switch to adahessian from adam as the final step in fine-tuning of my model. I have a question about what is a reasonable learning rate to use for adahessian. For adam I used fairly small learning rates - starting at 2e-5 and reducing from there - which worked pretty well. However, as I understand it, adahessian preconditions the parameter update like an inverse Hessian does in a Newton step. But in a Newton step for a quadratic model, the ideal learning rate is 1.0. So I assume that I should be using a much larger learning rate for adahessian than I have been using for adam. Do you have any suggestions based on your experience?
Thanks!

Optimizer is not respecting "trainable" attribute of variables.

The current version does not respect untrainable variables. It can be fixed by placing a simple if-statement. However, I'm not sure if this is the best place. Therefore I'm not suggesting it as a PR, but report the issue here.

        eagerly_outside_functions = ops.executing_eagerly_outside_functions()
        update_ops = []
        with ops.name_scope(name or self._name, skip_on_eager=True):
            for grad, hess, var in grads_hessian_and_vars:

                # FIX UNTRAINABLE
                if var.trainable:
                    def _assume_mirrored(grad, hess):
                        if isinstance(grad, ds_values.PerReplica):
                            return ds_values.Mirrored(grad.values), ds_values.Mirrored(hess.values)
                        return grad, hess

                    grad, hess = nest.map_structure(_assume_mirrored, grad, hess)
                    # Colocate the update with variables to avoid unnecessary communication
                    # delays. See b/136304694.
                    with distribution.extended.colocate_vars_with(var):
                        with ops.name_scope("update" if eagerly_outside_functions else
                                    "update_" + var.op.name, skip_on_eager=True):
                            update_ops.extend(distribution.extended.update(
                                    var, apply_grad_to_update_var, args=(grad, hess), group=False))

amirgholami / adahessian Goto Github PK

adahessian's Issues

Recommend Projects

Recommend Topics

Recommend Org