Comments (5)
Hi Amir,
I tried using AdaHessian on a problem of mine (it worked very well by the way, much better than Adam) and decided to implement my suggestion above (make the signature of optimizer.step() the same as other optimizers). I'm sending the code below if you want to implement this here as well. I could also create a fork and submit a merge request but that would mean that I would have to change the AdaHessian signature in all test files it is used, and I could miss something.
#*
# @file Different utility functions
# Copyright (c) Zhewei Yao, Amir Gholami, Sheng Shen
# All rights reserved.
# This file is part of AdaHessian library.
#
# AdaHessian is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# AdaHessian is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with adahessian. If not, see <http://www.gnu.org/licenses/>.
#*
import math
import torch
from torch.optim.optimizer import Optimizer
from copy import deepcopy
class Adahessian(Optimizer):
"""Implements Adahessian algorithm.
It has been proposed in `ADAHESSIAN: An Adaptive Second OrderOptimizer for Machine Learning`.
Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 0.15)
betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional): term added to the denominator to improve
numerical stability (default: 1e-4)
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
hessian_power (float, optional): Hessian power (default: 1)
"""
def __init__(self, params, lr=0.15, betas=(0.9, 0.999), eps=1e-4,
weight_decay=0, hessian_power=1):
if not 0.0 <= lr:
raise ValueError("Invalid learning rate: {}".format(lr))
if not 0.0 <= eps:
raise ValueError("Invalid epsilon value: {}".format(eps))
if not 0.0 <= betas[0] < 1.0:
raise ValueError(
"Invalid beta parameter at index 0: {}".format(
betas[0]))
if not 0.0 <= betas[1] < 1.0:
raise ValueError(
"Invalid beta parameter at index 1: {}".format(
betas[1]))
if not 0.0 <= hessian_power <= 1.0:
raise ValueError("Invalid Hessian power value: {}".format(hessian_power))
defaults = dict(lr=lr, betas=betas, eps=eps,
weight_decay=weight_decay, hessian_power=hessian_power)
super(Adahessian, self).__init__(params, defaults)
def get_trace(self, params, grads):
"""
compute the Hessian vector product with a random vector v, at the current gradient point,
i.e., compute the gradient of <gradsH,v>.
:param gradsH: a list of torch variables
:return: a list of torch tensors
"""
for i, grad in enumerate(grads):
if grad.grad_fn is None:
raise RuntimeError('Gradient tensor {:} does not have grad_fn. When calling\n'.format(i) +
'\t\t\t loss.backward(), make sure the option create_graph is\n' +
'\t\t\t set to True.')
v = [2*torch.randint_like(p, high=2)-1 for p in params]
hvs = torch.autograd.grad(
grads,
params,
grad_outputs=v,
only_inputs=True,
retain_graph=True)
hutchinson_trace = []
for hv, vi in zip(hvs, v):
param_size = hv.size()
if len(param_size) <= 2: # for 0/1/2D tensor
tmp_output = torch.abs(hv * vi)
hutchinson_trace.append(tmp_output) # Hessian diagonal block size is 1 here.
elif len(param_size) == 4: # Conv kernel
tmp_output = torch.abs(torch.sum(torch.abs(
hv * vi), dim=[2, 3], keepdim=True)) / vi[0, 1].numel() # Hessian diagonal block size is 9 here: torch.sum() reduces the dim 2/3.
hutchinson_trace.append(tmp_output)
return hutchinson_trace
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
gradsH: The gradient used to compute Hessian vector product.
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
params = []
groups = []
grads = []
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
params.append(p)
groups.append(group)
grads.append(p.grad)
# get the Hessian diagonal
hut_traces = self.get_trace(params, grads)
for (p, group, grad, hut_trace) in zip(params, groups, grads, hut_traces):
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Exponential moving average of Hessian diagonal square values
state['exp_hessian_diag_sq'] = torch.zeros_like(p.data)
exp_avg, exp_hessian_diag_sq = state['exp_avg'], state['exp_hessian_diag_sq']
beta1, beta2 = group['betas']
state['step'] += 1
# Decay the first and second moment running average coefficient
exp_avg.mul_(beta1).add_(1 - beta1, grad.detach_())
exp_hessian_diag_sq.mul_(beta2).addcmul_(1 - beta2, hut_trace, hut_trace)
bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']
# make the square root, and the Hessian power
k = group['hessian_power']
denom = (
(exp_hessian_diag_sq.sqrt() ** k) /
math.sqrt(bias_correction2) ** k).add_(
group['eps'])
# make update
p.data = p.data - \
group['lr'] * (exp_avg / bias_correction1 / denom + group['weight_decay'] * p.data)
return loss
from adahessian.
Hey João,
Many thanks for reaching out and for your interest in our work. You have an excellent suggestion and we will add an assertion to make sure we only compute Hessian of layers whose gradient has a grad_fn handle.
Best,
-Amir
from adahessian.
Hey João,
It would be great if you take a PR. And we will merge your code into our codebase.
Thanks,
Zhewei
from adahessian.
I added a new pull request #10 (I'm not very expert in working with git so I apologize in advance for any inconvenience that may arise)
from adahessian.
Thanks for the help João. I merged the PR and closed the issue now.
from adahessian.
Related Issues (20)
- AdaHessian in tensorflow 1 version
- Alpha unused HOT 1
- Optimizer is not respecting "trainable" attribute of variables.
- Replace numpy power by TF pow HOT 1
- Help using adahessian in TensorFlow HOT 3
- Error using adahessian in PyTorch HOT 3
- About how to group my params
- Use of AdaHessian with batched training data? HOT 2
- Reasonable learning rate range for adahessian?
- Use of FP16 in backward with create_graph = True?
- Is Hutch++ applicable to improve AdaHessian? HOT 1
- Scalability Question HOT 1
- Inconsistence between paper and training scripts on NMT tasks
- Images
- Object Detection HOT 1
- Possible to use with PyTorch Lightning? HOT 1
- Pre-trained model not available anymore (google drive link expired)
- Can this deal with complex numbers?
- Performance issue about tf.function HOT 1
- I get this error when I use the AdaHessian. Is it a bug?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adahessian.