f-dangel / backpack Goto Github PK
View Code? Open in Web Editor NEWBackPACK - a backpropagation package built on top of PyTorch which efficiently computes quantities other than the gradient.
Home Page: https://backpack.pt/
License: MIT License
BackPACK - a backpropagation package built on top of PyTorch which efficiently computes quantities other than the gradient.
Home Page: https://backpack.pt/
License: MIT License
I was trying to use my second-order optimizer ESGD-M with BatchL2Grad in order to collect information on within-batch gradient variance to estimate stochastic noise (think OpenAI's gradient noise scale paper), and I kept OOMing after maybe six epochs of MNIST training. ESGD-M does a Hessian-vector product internally (not using Backpack stuff, just autograd) so it needs the user to specify create_graph=True. I assume when I use it with Backpack, something is leaking references to past computational graphs, normally these graphs are garbage collected without issue.
Thank you,
Katherine Crowson
It seems like:
backpack/backpack/core/derivatives/conv2d.py
Line 198 in 03db23f
only works for certain stride/padding/kernel/dilation combos and can return an incompatible shape with:
backpack/backpack/core/derivatives/conv2d.py
Line 201 in 03db23f
From my testing, it works when "in + 2 * padding - dilation * (kernel - 1) - 1" is a multiple of the stride, i.e, when there is no rounding down from the floor operation in the forward pass of the conv2d.
I really like your work on BackPack it enabled me to restructure my current research pretty massively.
I have a question, though: Is it possible to speed up the batchl2norm for example? I implemented a similar extension that I use, and also rely on convUtils.get_weight_gradient_factors
for convolutions. For my use cases this is really slow though. On a Wide ResNet 28x10, I have steps that take roughly 3 times longer if I use either batchl2norm or my own extension. My guess would be that this is due to the simplification call to convUtils.get_weight_gradient_factors
. Is that right?
I guess it would be a lot of work, but is it in general possible to implement batchl2norm, as fast as the sum of the gradients performed during backward for convolutions?
Currently, BCELoss where the neural network maps to a scalar for a single example and a vector for a batch, are not supported if I am not mistaken. Therefore, for simple binary classification, one needs to replace BCELoss with standard cross-entropy loss (for multiclass) and use a network with two outputs where only one would be needed.
As you initialize with the square root of the loss Hessian, for binary classification the BCE would be probably better/more exact since in the multiclass case the Hessian is not full rank.
Is there a problem with scalar output networks? It seems that for MSELoss, as a [Batch, 1]
is required even for scalar observations right?
Hello everyone,
I'd like to compute the gradients of a batch of B M-dimensional vectors stored in a tensor A
with respect to parameters param
of size K and store it in a B x M x K tensor.
My code looks like this:
for i in range(M):
param.grad.data.zero_()
with backpack(BatchGrad()):
A[:,i].backward(torch.ones_like(A[:,i]), retain_graph=True)
The first iteration works properly but at the second, I get an error I don't understand :
ModuleAttributeError: 'Linear' object has no attribute 'input0'
Do you have any idea of what's going on?
Thanks,
Romain
The MC-sampling based tests do not consistently pass.
Running (only the MC-related tests)
pytest -vxk mc
35 times, I get
test/automated_test.py::test_diag_ggn_mc_approx_ggn_montecarlo[Conv2d-ReLU-classification-cpu]
I think it would be great to have first order extension supports for ConvTranspose2d as it is widely used in generative models and various vision tasks.
Thanks again for this amazing library!
If I try to install with pip2 I get the following error:
pip install backpack-for-pytorch
Collecting backpack-for-pytorch
Using cached https://files.pythonhosted.org/packages/33/1e/c54c4e36aa5ae67117f03410d60c363779620b7aa78b0c67245af23f45c7/backpack-for-pytorch-1.0.0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/gl/53ck005n3cj8_08d_zzt4jcc0000gn/T/pip-install-T8akay/backpack-for-pytorch/setup.py", line 23, in <module>
with open(REQUIREMENTS_FILE) as f:
IOError: [Errno 2] No such file or directory: 'requirements.txt'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/gl/53ck005n3cj8_08d_zzt4jcc0000gn/T/pip-install-T8akay/backpack-for-pytorch/
The called methods in ea_jac_t_mat_jac_prod
got refactored away during the change of index convention and we did not see it because KFRA for convolutions has no limit in which it converges to a quantity that we can compare with via autodiff.
ea_jac_t_mat_jac_prod
all_in_one
example for a CNNWould it be possible to include a high-level explanation of what needs to happen to add support for a custom module? Perhaps it could be broken down into the essentials for first-order information, and additional requirements for second-order information.
I am playing around with the DomainBed repository. I noticed that for the implementation of Fishr, they specifically install version 1.3.0
and I was wondering why.
After a bit of experimentation, it seems that it is no longer possible to use backward(inputs=...)
where inputs
is a submodule. I adjusted the example from your documentation to replicate the issue:
from torch.nn import CrossEntropyLoss, Flatten, Linear, Sequential
from backpack import backpack, extend
from backpack.extensions import BatchGrad
from backpack.utils.examples import load_one_batch_mnist
X, y = load_one_batch_mnist(batch_size=512)
model = Sequential(Flatten(), Linear(784, 128), Linear(128, 10)) # I added an additional layer here
lossfunc = CrossEntropyLoss()
model = extend(model)
lossfunc = extend(lossfunc)
loss = lossfunc(model(X), y)
with backpack(BatchGrad()):
loss.backward(inputs=list(model[-1].parameters())) # I am trying to get the gradient with respect to the last submodule
for name, param in model[-1].named_parameters(): # I only loop over the parameters in the last submodule
print(name)
print(".grad.shape: ", param.grad.shape)
print(".grad_batch.shape: ", param.grad_batch.shape)
With backpack-for-pytorch==1.4.0
, this given
AttributeError: 'Parameter' object has no attribute 'grad_batch'
With backpack-for-pytorch==1.3.0
, this prints the expected output:
weight
.grad.shape: torch.Size([10, 128])
.grad_batch.shape: torch.Size([512, 10, 128])
bias
.grad.shape: torch.Size([10])
.grad_batch.shape: torch.Size([512, 10])
I tried going through the git history of this repository to identify what changed between these two versions, but I have not managed to pin down the change that caused this. I was wondering whether this is intentional or a bug.
Since first-order extensions allow the use of most nonparametric operations, the number of supported loss functions shouldn't be so small. If the package contained a loss wrapper of the form
L(x) = mean(x),
where x is a vector of the per-sample losses, this would extend the number of available loss functions a lot. These could then be computed as L(f(x)), where f is the chosen loss.
Note, that this can already be done using the MSE loss through MSE(sqrt(f(x)), 0) if f(x) is a vector, but this is not very clean and involves unnecessary computation, so a designated loss function for this purpose would be nice.
Hi,
I was trying to estimate the variance of the gradients and I observed the following. It seems that the variance is not with respect to the actual gradients but the scaled-down by batch-size version of them. Here's a quick example to illustrate this:
B = 20 # Batch size
# Create a simple NN
m = nn.Sequential(nn.Linear(10, 32),nn.ReLU(), nn.Linear(32, 1),)
# And a dummy loss - this needs be a nn.Module for backpack to work
class Loss(nn.Module):
def forward(self, x):
return x.mean()
loss = extend(Loss())
m = extend(m)
batch = torch.rand(B, 10)
Groundtruth variance is estimated by taking the gradients per example and computing the gradients
clear_backprops(m)
gradients = []
for i in range(B):
m.zero_grad()
loss(m(batch)[i]).backward(retain_graph=True)
gradients.append(torch.cat([g.grad.view(-1) for g in list(m.parameters())], dim=0))
gradients = torch.stack(gradients)
ground_truth_variance = gradients.var(0)
print(gradients.var(0).mean())
Here's what backpack returns
m.zero_grad()
with backpack(extensions.Variance()):
loss(m(batch)).backward()
grad_vars = torch.cat([g.variance.view(-1) for g in list(m.parameters())])
print(grad_vars.mean())
assert np.allclose(ground_truth_variance, grad_vars)
And here's backpack after I scale them back:
m.zero_grad()
with backpack(extensions.BatchGrad()):
loss(m(batch)).backward()
grad_vars = torch.cat([(g.grad_batch*B).var(0).view(-1) for g in list(m.parameters())])
print(grad_vars.mean())
assert np.allclose(ground_truth_variance, grad_vars)
Let me know what you think
p.
Think about offering extensions that only use a subset of the mini-batch.
Motivation: Curvature is often roughly estimated on a subset of samples used for the gradient
Needs discussion on how to realize. First thoughts:
backpack
PyTorch 1.3 (experimentally) introduces the approach of named tensors. With this syntax the readability of the core
module can be improved.
Wait until not experimental anymore.
I try to use backpack to calculate the batched gradient of a medium size neural network on two gpu. I use the following code to construct the net.
net = extend(net)
net = torch.nn.DataParallel(net)
net.to('cuda')
However, in practice, I encounter the following error.
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
If it is possible, would you mind add a toy example on how to use backpack with multiple GPUs along with torch.nn.DataParallel? This will be very helpful.
Hi,
Thanks for the awesome library!!
I have a use case in which I have multiple loss functions on which I have to call backward without using any reduction like mean or sum. I want to calculate gradients for different losses parallely.
losses = [loss1, loss2, loss3]
losses.backward()
print(param.grad)
## It should contain the jacobian of the gradients
As it's possible to calculate gradients wrt every sample in a batch (I don't want gradients for each sample) is it possible to generalize to this use case?
AttributeError: 'Conv2d' object has no attribute 'input0'
torch
implements LSTMs etc., help welcome!Hi,
the examples (diagonal and KFAC) on the website (https://backpack.pt/) are not working. I believe they are meant for a prior version.
Best
Felix
I was able to run the 2nd order examples after using the. extend(model, use_converter=True) for ResNet18. However, when I try to run the Hutchinson trace example, I get the following error:
NotImplementedError: Extension saving to diag_h does not have an extension for Module <class 'backpack.custom_module.branching.SumModule'>
Is it possible to extend this module in order to be able to compute the Hutchison trace layerwise for ResNet models?
Thank you,
Jeff
Here is part of the test code:
`def calc_hutchison_trace(model, criterion):
model.eval()
model = extend(model,use_converter=True)
criterion.to(device)
loss_function = extend(criterion)
# In the following, we load a batch, compute the loss and trigger the
# backward pass ``with(backpack(..))`` such that we have access to the extensions that
# we are going to use (``DiagHessian`` and ``HMP)``).
for i, data in enumerate(trainloader, 0):
x, y = data
x = x.to(device)
y = y.to(device)
break # Get 1 batch
def forward_backward_with_backpack():
"""Provide working access to BackPACK's `DiagHessian` and `HMP`."""
loss = loss_function(model(x), y)
with backpack(DiagHessian(),HMP()):
# keep graph for autodiff HVPs
loss.backward(retain_graph=True)
return loss
# Explicit test to see if diag info is created.
loss = loss_function(model(x), y)
with backpack(DiagHessian(), BatchDiagHessian()):
loss.backward()
for name, param in model.named_parameters():
print(name)
print(".grad.shape: ", param.grad.shape)
print(".diag_h.shape: ", param.diag_h.shape)
print(".diag_h_batch.shape: ", param.diag_h_batch.shape)
`
Thanks for providing this very useful library.
I was trying to use backpack for a tiny network with Conv2d, BatchNorm2d, and ConvTranspose2d layers.
I set the network mode to eval and then tried to replicate the example here, where the per-sample gradients are verified. Although I was able to replicate the above example (using the provided ResNet), I could not do the same for my tiny conv-deconv network.
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchsummary import summary
from backpack import backpack, extend
from backpack.extensions import BatchGrad
import warnings
# To remove backpack warning about using a non-full backward hook
warnings.filterwarnings('ignore')
BATCH_SIZE = 32
IMAGE_SIZE = 64
torch.manual_seed(0)
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Network
class TinyConvDeconv(torch.nn.Module):
def __init__(self, use_bn=True):
super(TinyConvDeconv, self).__init__()
self.use_bn = use_bn
#Convolution 1
self.conv1=nn.Conv2d(in_channels=3,out_channels=16, kernel_size=4)
nn.init.xavier_uniform_(self.conv1.weight) #Xaviers Initialisation
if self.use_bn:
self.bn1 = nn.BatchNorm2d(16)
self.swish1= nn.ReLU()
#De Convolution 1
self.deconv1=nn.ConvTranspose2d(in_channels=16,out_channels=3,kernel_size=4)
nn.init.xavier_uniform_(self.deconv1.weight)
self.swish4=nn.ReLU()
def forward(self,x):
out=self.conv1(x)
out=self.swish1(self.bn1(out) if self.use_bn else out)
out=self.deconv1(out)
out=self.swish4(out)
return(out)
# Use random tensors as data
pseudo_x = torch.rand((BATCH_SIZE, 3, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
pseudo_y = torch.rand((BATCH_SIZE, 3, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
conv_deconv = TinyConvDeconv().to(DEVICE)
conv_deconv.eval()
conv_deconv = extend(conv_deconv)
# At this stage BN stats are fixed to 0 (mean) and 1 (var)
print(f'Network train set? {conv_deconv.training}')
conv_deconv.zero_grad()
loss = F.mse_loss(conv_deconv(pseudo_x), pseudo_y, reduction="mean")
with backpack(BatchGrad()):
loss.backward()
print("{:<20} {:<40} {:<20}".format("Param", "grad", "grad (batch)"))
print("-" * 100)
for name, p in conv_deconv.named_parameters():
if (not 'bn' in name):
print(f'{name:<20}, {str(p.grad.shape):<40}, {str(p.grad_batch.shape):<20}')
sample_to_check = 1
x_to_check = pseudo_x[sample_to_check, :].unsqueeze(0)
y_to_check = pseudo_y[sample_to_check].unsqueeze(0)
conv_deconv.zero_grad()
loss = F.mse_loss(conv_deconv(x_to_check), y_to_check)
loss.backward()
print("Do the individual gradients match?")
for name, p in conv_deconv.named_parameters():
if (not 'bn' in name):
match = torch.allclose(p.grad_batch[sample_to_check, :], p.grad, atol=1e-5)
print("{:<20} {}".format(name, match))
I used the same steps as the example, but I could not figure out why the gradients computed for a single sample do not match the grad_batch computed by backpack. Perhaps I am missing something?
The attached notebook also has the code shown here. I use torch 1.9.0+cu102. Any help is appreciated.
Thanks for creating an honestly amazing package for speeding up batched gradient calculation!
I have a few clarity questions regarding what modules are and are not supported and in which cases.
In supported models is it stated that backpack expect models to be sequences (nn.Sequential
). However, in the ResNet example this is not the case.
nn.LeakyReLU
work? They have a parameters but they are not trainable.Say I define
class Child(nn.Module):
...
class Parent(nn.Module):
def __init__(self):
super(Parent,self).__init__()
self.child = Child()
Are the parameters in Child
properly tracked by backpack?
Based on the idea of backpack I find a way to extend the application range of BatchGrad to most kinds of pytorch layers without too much effort. https://github.com/ChenAo-Phys/pytorch-Jacobian
It's a simple idea that I really hope you can implement into backpack. It would be nice to see this package getting better.
Hi, thanks a lot for open sourcing backpack, it's a major contribution for the community. I wasn't able to find a better place to ask questions, so here we go:
backward()
calls with zero_grad()
calls ?Thanks !!
Hi,
Thanks a lot for such promising work! As the group conv is more and more popular in the CV community, do you have a plan to support the derivatives of group conv?
Best wishes!
PyTorch 1.2.0 introduced a Flatten module. Our custom Flatten is redundant.
Need to
torch >= 1.2.0
PyTorch 1.3.0 changed the acceptable inputs for the multinomial function.
The interface is now the same for the cuda and cpu versions.
This breaks the Sampling of the symmetric factorization for the cross entropy loss.
Todo:
kfac
and mc ggn
using low-precision torch.allclose
.multinomial
in derivatives/crossentropyloss.py
to reflect the interface of PyTorch 1.3.0.The Kronecker-factored quantities require utilities to demonstrate their usage:
From these, one could think about composing
Hi
I need to compute the approximate hessian for a decoder network. The decoder consists of conv2d and upsample layers. Currently, backpack does not supports nn.Upsample. Since it is a non-parametric layer, it might not be too difficult to implement?
Here I define my model and a data point.
from backpack import backpack
from backpack.extensions import DiagGGNExact
model = torch.nn.Sequential(
torch.nn.Conv2d(1,8, kernel_size=3, padding=1),
torch.nn.MaxPool2d(2),
torch.nn.ReLU(),
torch.nn.Conv2d(8,8, kernel_size=3, padding=1),
torch.nn.Upsample(scale_factor=2, mode="nearest"),
torch.nn.ReLU(),
torch.nn.Conv2d(8,1, kernel_size=3, padding=1),
torch.nn.Flatten(),
)
lossfunc = torch.nn.MSELoss()
model = extend(model)
lossfunc = extend(lossfunc)
X = torch.zeros(1,1,8,8)
print(model(X).shape)
b = X.shape[0]
loss = lossfunc(model(X), X.view(b, -1))
with backpack(DiagGGNExact()):
loss.backward()
for param in model.parameters():
print(param.diag_ggn_exact)
will return this error
NotImplementedError: Extension saving to diag_ggn_exact does not have an extension for Module <class 'torch.nn.modules.upsampling.Upsample'>
Could you help implement this feature?
I post the full error below. The MWE is a bit long (currently hundreds of lines) and I am still working on it, but is there any specific direction I should be looking at given this error? It looks like Batchnorm is somehow mixed up in the gradient calculation (judging from the error message)?
Traceback (most recent call last):
File "/Users/qiyaowei/DEQ-BNN/mwe.py", line 575, in <module>
model(torch.rand(1,3,32,32)).sum().backward()
File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/torch/utils/hooks.py", line 110, in hook
res = user_hook(self.module, grad_input, self.grad_outputs)
File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/backpack/__init__.py", line 209, in hook_run_extensions
backpack_extension(module, g_inp, g_out)
File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/backpack/extensions/backprop_extension.py", line 127, in __call__
module_extension(self, module, g_inp, g_out)
File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/backpack/extensions/module_extension.py", line 97, in __call__
delete_old_quantities = not self.__should_retain_backproped_quantities(module)
File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/backpack/extensions/module_extension.py", line 162, in __should_retain_backproped_quantities
is_a_leaf = module.output.grad_fn is None
File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'BatchNorm2d' object has no attribute 'output'
BatchL2Grad, perhaps naturally, raises an error when it sees a BatchNorm, since batch normalization mixes gradients in a way that makes the individual contribution hard to discern.
The error says I can ignore it, if I know what I'm doing. I can't say I completely do, but if I ignore it, I do indeed get both grad
s and batch_l2
s on the top levels of my mode, which aren't using batch-norm.
I'm happy with that.
My problem is that the lower level parameters - which do use batch norm - don't just have a None batch_l2
, but also a None grad
.
So my model doesn't train at all.
This seems wrong, since grad
is indeed computable, as witnessed by PyTorch being able to do so fine without backpack.
Is there a way I can get batch_l2
s on as many of my parameters as possible, but grad
s on everything?
I an do this now by first calling backward()
without backpack, and then calling it again inside with backpack(BatchL2Grad()):
, but that seems wasteful.
Hi,
I would like to apply the Cockpit library to my problem, which is using the Gaussian log-likelihood for training. If I only want to look at first-order information, this loss function should already work with Backpack. However, I would be very interested in also seeing the second-order informations, for which explicit support in Backpack is needed.
What would it take to integrate this loss? I might be able to contribute as well if it is not too complicated.
The documentation is here: https://pytorch.org/docs/stable/generated/torch.nn.GaussianNLLLoss.html
Thanks!
Hi,
Does backpack allow for the reuse of layers for first-order extensions, like in say a Siamese network? I only need this for first-order extensions, in particular batch grads. An example is given below - this produces a "AttributeError: 'Linear' object has no attribute 'input0'" error.
Thanks!
import torch.nn as nn
import torch
from backpack import backpack, extend
from backpack.extensions import BatchGrad
class TestModule(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Linear(5, 5)
def forward(self, x):
return self.net(x[:, :5]) + self.net(x[:, 5:])
test_module = TestModule()
extend(test_module)
rand_vec = torch.randn(5, 10)
loss = test_module(rand_vec).sum()
with backpack(BatchGrad()):
loss.backward()
BatchNorm
is a special module because it mixes samples within the batch. This needs some special treatment.
PyTorch documentation: BatchNorm1d, BatchNorm2d
If BatchNorm
is in evaluation mode (.training=False
), the saved statistics are used. This is independent of the batch. Therefore, individual gradients are well defined.
However, if BatchNorm
is in training mode (.training=True
), the batch statistics are used instead. Therefore, individual gradients are not well defined anymore. I suggest multiple possibilities of realizing individual gradients:
batch_size=1
, i.e. E(x)=x and Var(x)=0. Note: This is forbidden by PyTorch.I favor the third alternative. If the other approaches do have some merit, it is also possible to implement a switch. For example requesting module.batch_norm_mode: str
and execute the given mode if provided.
Note: The current version implements the third alternative. But it has some shortcomings:
env:
python=3.7.0
backpack_for_pytorch==1.4.0
log:
/media/Store/lyj/miniconda3/envs/py3.7/bin/python /media/Store/lyj/workspace/mlsad/debug_backpack.py
Traceback (most recent call last):
File "/media/Store/lyj/workspace/mlsad/debug_backpack.py", line 1, in <module>
import backpack
File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/__init__.py", line 10, in <module>
from backpack import extensions
File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/extensions/__init__.py", line 3, in <module>
from .curvmatprod import GGNMP, HMP, PCHMP
File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/extensions/curvmatprod/__init__.py", line 24, in <module>
from .ggnmp import GGNMP
File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/extensions/curvmatprod/ggnmp/__init__.py", line 21, in <module>
from backpack.extensions.secondorder.base import SecondOrderBackpropExtension
File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/extensions/secondorder/__init__.py", line 27, in <module>
from backpack.extensions.secondorder.diag_ggn import (
File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/extensions/secondorder/diag_ggn/__init__.py", line 50, in <module>
from backpack.custom_module.branching import SumModule
File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/custom_module/branching.py", line 2, in <module>
from typing import Any, OrderedDict, Tuple, Union
ImportError: cannot import name 'OrderedDict' from 'typing' (/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/typing.py)
Not every backpack
extension requires that layers track all inputs and the output. By default, everything should be tracked to allow for easy extension of backpack
.
If it is known in advance that just one specific extension will be used, memory performance can be improved by only storing the required information.
Hello,
I am wondering is it possible to extend part of the model, if I only want to get the batch gradient of the last several layers?
I think model = extend(model)
will waste memory if only the batch gradient of the last several layers is needed.
For example, if I only want to extend the last two layers (let's say the last two layers are fc1 and fc2) of a large model, can I do something like this:
model.fc1 = extend(model.fc1)
model.fc2 = extend(model.fc2)
When I train an extended model on GPU, some of the buffers in the model will store on CPU, which leads to certain runtime error. I use the following code for extending model:
net = extend(net) net = torch.nn.DataParallel(net) net = net.to('cuda')
In practical training, everything goes fine in the first epoch for training and buffers are stored on cuda. When it starts to test, all buffers appear to be stored on cpu.
My current strategy is to add a line
net = net.to('cuda')
in each iteration of training/testing.
This problem does not appear if I do not use net = extend(net)
.
Hope that this problem can be solved
First of all, thanks for your great lib.
Is torch.cat
a supported operation in the computation graph? It seems using concatenation in second order extensions cause error in the backward()
:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-14-427e1d95b045> in <module>
21
22 with backpack(DiagHessian()):
---> 23 loss.backward(create_graph=True,)
24
~/anaconda3/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
196 products. Defaults to ``False``.
197 """
--> 198 torch.autograd.backward(self, gradient, retain_graph, create_graph)
199
200 def register_hook(self, hook):
~/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
98 Variable._execution_engine.run_backward(
99 tensors, grad_tensors, retain_graph, create_graph,
--> 100 allow_unreachable=True) # allow_unreachable flag
101
102
RuntimeError: 'NoneType' object is not subscriptable
When a view
can be performed, reshape
essentially does a view
. The backward pass used to be less efficient, but that has been fixed in pytorch 1.4.0 (by pytorch/pytorch#28901).
backpack/backpack/utils/ein.py
Lines 175 to 180 in 3122de0
Using backpack with respect to one set of parameters after using torch.autograd.grad
with respect to a different set of parameters. This may not be easy to implement as I it is a second order extension (a gradient of gradients), but it would be awesome to support something like this.
Considering the following code
import torch
import torch.nn as nn
from backpack import backpack, extend
from backpack.extensions import BatchGrad
net = nn.Linear(2,1)
input = torch.randn(5,2, requires_grad=True) + 1
model = extend(net)
parameters = tuple(net.parameters())
out = net(input).pow(2)
# gradient with respect to input
grad_input = torch.autograd(out.sum(), input, create_graph=True)
grad_input = torch.cat([g.flatten(start_dim=1) for g in grad_input])
print(grad_input.shape) # torch.Size([5,2])
# gradients with respect to parameters --- Fails here!
with bachpack(BatchGrad()):
_ = torch.autograd.grad(grad_input.sum(), parameters, create_graph=True)
print(parameters[0].grad_batch) # <- need this
Grad size after first autograd torch.Size([5, 2])
Traceback (most recent call last):
File "grad_tests.py", line 28, in <module>guess
g = torch.autograd.grad(grad.sum(), parameters1[0], create_graph=True)
File "path_to_torch/autograd/__init__.py", line 156, in grad
return Variable._execution_engine.run_backward(
RuntimeError: 'Linear' object has no attribute 'input0'
Hi, thank you for developing and maintaining this library.
If I understand correctly, backpack 1.4 does not support ResNet with KFAC, right?
When I changed DiagGGNExact
of the tutorial to KFAC
(and changed AdaptiveAvgPooling to AvgPooling), it raised the following error:
~/.miniconda/.../backpack/extensions/backprop_extension.py in __get_module_extension(self, module)
97 if self._fail_mode is FAIL_ERROR:
98 # PyTorch converts this Error into a RuntimeError for torch<1.7.0
---> 99 raise NotImplementedError(
100 f"Extension saving to {self.savefield} "
101 "does not have an extension for "
NotImplementedError: Extension saving to kfac does not have an extension for Module <class 'backpack.custom_module.branching.SumModule'>
It would be great if you could support KFAC for ResNet or let me know some modifications necessary if possible. Thank you.
Hi,
Thanks for this great library. I am wondering how can we compute the gradient of sum of gradient?
I am trying to implement the Hessian-Vector-Product (HVP) with the following code:
def batch_hvp(self, model, loss, params_list, batch_grad_list):
if len(params_list) != len(batch_grad_list):
raise (ValueError("w and v must have the same length."))
one_sample_grad_list = grad(loss, params_list, retain_graph=True, create_graph=True)
elemwise_products = 0
for grad_elem, v_elem in zip(one_sample_grad_list, batch_grad_list):
sum_over_dims = []
for i in range(len(v_elem.shape)):
sum_over_dims.append(i)
sum_over_dims = tuple(sum_over_dims[1:])
elemwise_products += torch.sum(grad_elem.unsqueeze(0) * v_elem.detach(), sum_over_dims)
with backpack(BatchGrad()):
elemwise_products.backward() # problem: has no attribute 'input0'
return_grads = [p.grad_batch for p in model.parameters() if p.requires_grad]
return return_grads
I encounter the " has no attribute 'input0' " problem when I call the backward(), is it possible to get batch gradients for return_grads?
For now, I am only using the for loop to compute the gradient.
def batch_hvp(self, model, loss, params_list, batch_grad_list):
if len(params_list) != len(batch_grad_list):
raise (ValueError("w and v must have the same length."))
one_sample_grad_list = grad(loss, params_list, retain_graph=True, create_graph=True)
elemwise_products = 0
for grad_elem, v_elem in zip(one_sample_grad_list, batch_grad_list):
sum_over_dims = []
for i in range(len(v_elem.shape)):
sum_over_dims.append(i)
sum_over_dims = tuple(sum_over_dims[1:])
elemwise_products += torch.sum(grad_elem.unsqueeze(0) * v_elem.detach(), sum_over_dims)
# The for-loop version
grad_cache = []
for i in range(elemwise_products.shape[0]):
elemwise_products[i].backward(retain_graph=True)
grad_cache.append([p.grad.clone() for p in model.parameters() if p.requires_grad])
grad_cache = list(zip(*grad_cache))
return_grads = []
for l_id in range(len(grad_cache)):
return_grads.append(torch.cat([g.unsqueeze(0) for g in grad_cache[l_id]], dim=0))
return return_grads
Thanks in advance!
As soon as pytorch/pytorch#60524 is resolved, we can delete our warnings:
check_parameters
test_ea_jac_t_mat_jac_prod
and test_jac_t_mat_prod
The pooling indices in MaxPool2d
are currently computed by performing a second forward pass, which can be avoided.
Thanks for this great library!
I found a bug in the example showcasing the second-order implementations for writing an optimizer (https://docs.backpack.pt/en/master/use_cases/example_diag_ggn_optimizer.html#sphx-glr-use-cases-example-diag-ggn-optimizer-py).
This mistake is within the example in the documentation and might hinder people to get started with using approximate second-order methods.
losses = []
accuracies = []
for batch_idx, (x, y) in enumerate(mnist_loader):
x, y = x.to(DEVICE), y.to(DEVICE)
outputs = model(x)
loss = loss_function(outputs, y)
is missing the zero-ing of gradients. While backpack seems to overwrite iteratively on a .backward()
, the gradient just accumulates here. I ran the example for several epochs which results in a divergence because the gradient is never reset, I don't think this is intended right?
losses = []
accuracies = []
for batch_idx, (x, y) in enumerate(mnist_loader):
x, y = x.to(DEVICE), y.to(DEVICE)
# ---------
model.zero_grad()
# ---------
outputs = model(x)
loss = loss_function(outputs, y)
This changes the example quite drastically and the learning rate probably needs to be adjusted to get the same learning curve..
Hope this helps.
Hi,
Thanks for this great library.
It seems that the torch.autograd.grad
function is not supported with backpack, are you planning to add support ?
It would be useful to compute batch grads with respect to intermediate features for example.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.