f-dangel / backpack Goto Github PK

BackPACK - a backpropagation package built on top of PyTorch which efficiently computes quantities other than the gradient.

Home Page: https://backpack.pt/

License: MIT License

Python 96.17% Shell 0.12% Makefile 0.46% Batchfile 0.08% Ruby 0.16% HTML 1.95% JavaScript 0.09% TeX 0.06% SCSS 0.91%

backpack's People

Contributors

Stargazers

Watchers

backpack's Issues

OOM eventually when using create_graph=True with BatchL2Grad

I was trying to use my second-order optimizer ESGD-M with BatchL2Grad in order to collect information on within-batch gradient variance to estimate stochastic noise (think OpenAI's gradient noise scale paper), and I kept OOMing after maybe six epochs of MNIST training. ESGD-M does a Hessian-vector product internally (not using Backpack stuff, just autograd) so it needs the user to specify create_graph=True. I assume when I use it with Backpack, something is leaking references to past computational graphs, normally these graphs are garbage collected without issue.

Thank you,
Katherine Crowson

Conv2d derivative issues

It seems like:

backpack/backpack/core/derivatives/conv2d.py

Line 198 in 03db23f

grad_weight = conv2d(input, mat, None, module.dilation, module.padding,

only works for certain stride/padding/kernel/dilation combos and can return an incompatible shape with:

backpack/backpack/core/derivatives/conv2d.py

Line 201 in 03db23f

grad_weight = grad_weight.view(num_cols, batch,

From my testing, it works when "in + 2 * padding - dilation * (kernel - 1) - 1" is a multiple of the stride, i.e, when there is no rounding down from the floor operation in the forward pass of the conv2d.

Faster 1st order methods

I really like your work on BackPack it enabled me to restructure my current research pretty massively.

I have a question, though: Is it possible to speed up the batchl2norm for example? I implemented a similar extension that I use, and also rely on convUtils.get_weight_gradient_factors for convolutions. For my use cases this is really slow though. On a Wide ResNet 28x10, I have steps that take roughly 3 times longer if I use either batchl2norm or my own extension. My guess would be that this is due to the simplification call to convUtils.get_weight_gradient_factors. Is that right?

I guess it would be a lot of work, but is it in general possible to implement batchl2norm, as fast as the sum of the gradients performed during backward for convolutions?

Support for single output networks (BCELoss and MSELoss)

Currently, BCELoss where the neural network maps to a scalar for a single example and a vector for a batch, are not supported if I am not mistaken. Therefore, for simple binary classification, one needs to replace BCELoss with standard cross-entropy loss (for multiclass) and use a network with two outputs where only one would be needed.
As you initialize with the square root of the loss Hessian, for binary classification the BCE would be probably better/more exact since in the multiclass case the Hessian is not full rank.

Is there a problem with scalar output networks? It seems that for MSELoss, as a [Batch, 1] is required even for scalar observations right?

Computing gradients of a batch of M-dimensional tensors

Hello everyone,

I'd like to compute the gradients of a batch of B M-dimensional vectors stored in a tensor A with respect to parameters param of size K and store it in a B x M x K tensor.

My code looks like this:

for i in range(M):
    param.grad.data.zero_()
    with backpack(BatchGrad()):
        A[:,i].backward(torch.ones_like(A[:,i]), retain_graph=True)

The first iteration works properly but at the second, I get an error I don't understand :
ModuleAttributeError: 'Linear' object has no attribute 'input0'

Do you have any idea of what's going on?

Thanks,
Romain

MC-sampling tests fail sometimes

The MC-sampling based tests do not consistently pass.

Running (only the MC-related tests)

pytest -vxk mc

35 times, I get

30 passes
5 fails on test/automated_test.py::test_diag_ggn_mc_approx_ggn_montecarlo[Conv2d-ReLU-classification-cpu]

Support for ConvTranspose2d

I think it would be great to have first order extension supports for ConvTranspose2d as it is widely used in generative models and various vision tasks.

Thanks again for this amazing library!

Pip2 Error

If I try to install with pip2 I get the following error:

pip install backpack-for-pytorch
Collecting backpack-for-pytorch
 Using cached https://files.pythonhosted.org/packages/33/1e/c54c4e36aa5ae67117f03410d60c363779620b7aa78b0c67245af23f45c7/backpack-for-pytorch-1.0.0.tar.gz
   Complete output from command python setup.py egg_info:
   Traceback (most recent call last):
     File "<string>", line 1, in <module>
     File "/private/var/folders/gl/53ck005n3cj8_08d_zzt4jcc0000gn/T/pip-install-T8akay/backpack-for-pytorch/setup.py", line 23, in <module>
       with open(REQUIREMENTS_FILE) as f:
   IOError: [Errno 2] No such file or directory: 'requirements.txt'
   
   ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/gl/53ck005n3cj8_08d_zzt4jcc0000gn/T/pip-install-T8akay/backpack-for-pytorch/

KFRA broken for Conv2d with v1.1.0

The called methods in ea_jac_t_mat_jac_prod got refactored away during the change of index convention and we did not see it because KFRA for convolutions has no limit in which it converges to a quantity that we can compare with via autodiff.

Add interface test that checks the shapes of Kronecker factors
Fix ea_jac_t_mat_jac_prod
Add all_in_one example for a CNN

How to support first-order extensions for custom modules?

Would it be possible to include a high-level explanation of what needs to happen to add support for a custom module? Perhaps it could be broken down into the essentials for first-order information, and additional requirements for second-order information.

v1.4.0 no longer seems to support `backward()` with the `inputs` parameter referencing a sub-module's parameters

I am playing around with the DomainBed repository. I noticed that for the implementation of Fishr, they specifically install version 1.3.0 and I was wondering why.

After a bit of experimentation, it seems that it is no longer possible to use backward(inputs=...) where inputs is a submodule. I adjusted the example from your documentation to replicate the issue:

from torch.nn import CrossEntropyLoss, Flatten, Linear, Sequential

from backpack import backpack, extend
from backpack.extensions import BatchGrad
from backpack.utils.examples import load_one_batch_mnist

X, y = load_one_batch_mnist(batch_size=512)

model = Sequential(Flatten(), Linear(784, 128), Linear(128, 10))  # I added an additional layer here
lossfunc = CrossEntropyLoss()

model = extend(model)
lossfunc = extend(lossfunc)

loss = lossfunc(model(X), y)
with backpack(BatchGrad()):
    loss.backward(inputs=list(model[-1].parameters()))  # I am trying to get the gradient with respect to the last submodule

for name, param in model[-1].named_parameters():  # I only loop over the parameters in the last submodule
    print(name)
    print(".grad.shape:             ", param.grad.shape)
    print(".grad_batch.shape:       ", param.grad_batch.shape)

With backpack-for-pytorch==1.4.0, this given

AttributeError: 'Parameter' object has no attribute 'grad_batch'

With backpack-for-pytorch==1.3.0, this prints the expected output:

weight
.grad.shape:              torch.Size([10, 128])
.grad_batch.shape:        torch.Size([512, 10, 128])
bias
.grad.shape:              torch.Size([10])
.grad_batch.shape:        torch.Size([512, 10])

I tried going through the git history of this repository to identify what changed between these two versions, but I have not managed to pin down the change that caused this. I was wondering whether this is intentional or a bug.

Expand supported Losses for first order Extensions

Since first-order extensions allow the use of most nonparametric operations, the number of supported loss functions shouldn't be so small. If the package contained a loss wrapper of the form
L(x) = mean(x),
where x is a vector of the per-sample losses, this would extend the number of available loss functions a lot. These could then be computed as L(f(x)), where f is the chosen loss.
Note, that this can already be done using the MSE loss through MSE(sqrt(f(x)), 0) if f(x) is a vector, but this is not very clean and involves unnecessary computation, so a designated loss function for this purpose would be nice.

Variance of the gradients is with respect to scaled down gradients

Hi,

I was trying to estimate the variance of the gradients and I observed the following. It seems that the variance is not with respect to the actual gradients but the scaled-down by batch-size version of them. Here's a quick example to illustrate this:

B = 20 # Batch size

# Create a simple NN
m = nn.Sequential(nn.Linear(10, 32),nn.ReLU(), nn.Linear(32, 1),)

# And a dummy loss - this needs be a nn.Module for backpack to work
class Loss(nn.Module):
  def forward(self, x):
    return x.mean()

loss = extend(Loss())
m = extend(m)
batch = torch.rand(B, 10)

Groundtruth variance is estimated by taking the gradients per example and computing the gradients

clear_backprops(m)
gradients = []
for i in range(B):
  m.zero_grad()
  loss(m(batch)[i]).backward(retain_graph=True)
  gradients.append(torch.cat([g.grad.view(-1) for g in list(m.parameters())], dim=0))
gradients = torch.stack(gradients)
ground_truth_variance = gradients.var(0)
print(gradients.var(0).mean())

Here's what backpack returns

m.zero_grad()
with backpack(extensions.Variance()):
    loss(m(batch)).backward()

grad_vars = torch.cat([g.variance.view(-1) for g in list(m.parameters())])

print(grad_vars.mean())
assert np.allclose(ground_truth_variance, grad_vars)

And here's backpack after I scale them back:

m.zero_grad()
with backpack(extensions.BatchGrad()):
    loss(m(batch)).backward()
grad_vars = torch.cat([(g.grad_batch*B).var(0).view(-1) for g in list(m.parameters())])

print(grad_vars.mean())
assert np.allclose(ground_truth_variance, grad_vars)

Let me know what you think
p.

Mini-batch subsampling

Think about offering extensions that only use a subset of the mini-batch.

Motivation: Curvature is often roughly estimated on a subset of samples used for the gradient

Needs discussion on how to realize. First thoughts:

Possible in current implementation, but not efficient
- Two forward passes, only one with backpack
- Fix normalization constant afterwards
Support subsampling in module Jacobians/Hessians

Improve einsum readability

PyTorch 1.3 (experimentally) introduces the approach of named tensors. With this syntax the readability of the core module can be improved.

Wait until not experimental anymore.

Support on torch.nn.DataParallel for multiple GPUs training

I try to use backpack to calculate the batched gradient of a medium size neural network on two gpu. I use the following code to construct the net.

net = extend(net)
net = torch.nn.DataParallel(net)
net.to('cuda')

However, in practice, I encounter the following error.

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

If it is possible, would you mind add a toy example on how to use backpack with multiple GPUs along with torch.nn.DataParallel? This will be very helpful.

Help for a use case of Multiple Losses

Hi,
Thanks for the awesome library!!
I have a use case in which I have multiple loss functions on which I have to call backward without using any reduction like mean or sum. I want to calculate gradients for different losses parallely.

losses = [loss1, loss2, loss3]
losses.backward()

print(param.grad)
## It should contain the jacobian of the gradients

As it's possible to calculate gradients wrt every sample in a batch (I don't want gradients for each sample) is it possible to generalize to this use case?

[not confirmed] `AttributeError` related to BackPACK's IO

AttributeError: 'Conv2d' object has no attribute 'input0'

Support for recurrent units

Requires more research how torch implements LSTMs etc., help welcome!

Website example not working.

Hi,

the examples (diagonal and KFAC) on the website (https://backpack.pt/) are not working. I believe they are meant for a prior version.

Best
Felix

Hutchinson trace example does not work on ResNet18

I was able to run the 2nd order examples after using the. extend(model, use_converter=True) for ResNet18. However, when I try to run the Hutchinson trace example, I get the following error:

NotImplementedError: Extension saving to diag_h does not have an extension for Module <class 'backpack.custom_module.branching.SumModule'>

Is it possible to extend this module in order to be able to compute the Hutchison trace layerwise for ResNet models?

Thank you,
Jeff

Here is part of the test code:

`def calc_hutchison_trace(model, criterion):

model.eval()
model = extend(model,use_converter=True)
criterion.to(device)
loss_function = extend(criterion)

# In the following, we load a batch, compute the loss and trigger the
# backward pass ``with(backpack(..))`` such that we have access to the extensions that
# we are going to use (``DiagHessian`` and ``HMP)``).
for i, data in enumerate(trainloader, 0):
    x, y = data
    x = x.to(device)
    y = y.to(device)
    break # Get 1 batch

def forward_backward_with_backpack():
    """Provide working access to BackPACK's `DiagHessian` and `HMP`."""
    loss = loss_function(model(x), y)

    with backpack(DiagHessian(),HMP()):
        # keep graph for autodiff HVPs
        loss.backward(retain_graph=True)

    return loss

# Explicit test to see if diag info is created.
loss = loss_function(model(x), y)
with backpack(DiagHessian(), BatchDiagHessian()):
    loss.backward()
for name, param in model.named_parameters():
    print(name)
    print(".grad.shape:             ", param.grad.shape)
    print(".diag_h.shape:           ", param.diag_h.shape)
    print(".diag_h_batch.shape:     ", param.diag_h_batch.shape)

`reduction`-dependent scaling factor of `grad_batch`

Thanks for providing this very useful library.

I was trying to use backpack for a tiny network with Conv2d, BatchNorm2d, and ConvTranspose2d layers.

I set the network mode to eval and then tried to replicate the example here, where the per-sample gradients are verified. Although I was able to replicate the above example (using the provided ResNet), I could not do the same for my tiny conv-deconv network.

import torch
import torch.nn as nn
import torch.nn.functional as F

from torchsummary import summary

from backpack import backpack, extend
from backpack.extensions import BatchGrad

import warnings
# To remove backpack warning about using a non-full backward hook
warnings.filterwarnings('ignore')

BATCH_SIZE = 32
IMAGE_SIZE = 64
torch.manual_seed(0)

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Network
class TinyConvDeconv(torch.nn.Module):

    def __init__(self, use_bn=True):
        
        super(TinyConvDeconv, self).__init__()

        self.use_bn = use_bn

        #Convolution 1
        self.conv1=nn.Conv2d(in_channels=3,out_channels=16, kernel_size=4)
        nn.init.xavier_uniform_(self.conv1.weight) #Xaviers Initialisation
        if self.use_bn:
            self.bn1 = nn.BatchNorm2d(16)
        self.swish1= nn.ReLU()

        #De Convolution 1
        self.deconv1=nn.ConvTranspose2d(in_channels=16,out_channels=3,kernel_size=4)
        nn.init.xavier_uniform_(self.deconv1.weight)
        self.swish4=nn.ReLU()

    def forward(self,x):
        out=self.conv1(x)
        out=self.swish1(self.bn1(out) if self.use_bn else out)
        out=self.deconv1(out)
        out=self.swish4(out)
        return(out)

# Use random tensors as data
pseudo_x = torch.rand((BATCH_SIZE, 3, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
pseudo_y = torch.rand((BATCH_SIZE, 3, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)

conv_deconv = TinyConvDeconv().to(DEVICE)
conv_deconv.eval()

conv_deconv = extend(conv_deconv)

# At this stage BN stats are fixed to 0 (mean) and 1 (var)
print(f'Network train set? {conv_deconv.training}')

conv_deconv.zero_grad()
loss = F.mse_loss(conv_deconv(pseudo_x), pseudo_y, reduction="mean")

with backpack(BatchGrad()):
    loss.backward()

print("{:<20}  {:<40} {:<20}".format("Param", "grad", "grad (batch)"))
print("-" * 100)
for name, p in conv_deconv.named_parameters():
    if (not 'bn' in name):
        print(f'{name:<20}, {str(p.grad.shape):<40}, {str(p.grad_batch.shape):<20}')

sample_to_check = 1

x_to_check = pseudo_x[sample_to_check, :].unsqueeze(0)
y_to_check = pseudo_y[sample_to_check].unsqueeze(0)

conv_deconv.zero_grad()
loss = F.mse_loss(conv_deconv(x_to_check), y_to_check)
loss.backward()

print("Do the individual gradients match?")
for name, p in conv_deconv.named_parameters():
    if (not 'bn' in name):
        match = torch.allclose(p.grad_batch[sample_to_check, :], p.grad, atol=1e-5)
        print("{:<20} {}".format(name, match))

I used the same steps as the example, but I could not figure out why the gradients computed for a single sample do not match the grad_batch computed by backpack. Perhaps I am missing something?

The attached notebook also has the code shown here. I use torch 1.9.0+cu102. Any help is appreciated.

Clarity on supported models

Thanks for creating an honestly amazing package for speeding up batched gradient calculation!

Summary

I have a few clarity questions regarding what modules are and are not supported and in which cases.

Specifics

Expecting sequences

In supported models is it stated that backpack expect models to be sequences (nn.Sequential). However, in the ResNet example this is not the case.

My question is then, does backpack only expect sequences of modules for second order extensions?

For first-order extensions support any modules (without parameters)

Are parameters defines as "learnable" parameters? I.e. would modules nn.LeakyReLU work? They have a parameters but they are not trainable.

Are nested modules supported?

Say I define

class Child(nn.Module):
    ...

class Parent(nn.Module):
    def __init__(self):
        super(Parent,self).__init__()
        self.child = Child()

Are the parameters in Child properly tracked by backpack?

Extend BatchGrad to Conv1d an Conv3d

Based on the idea of backpack I find a way to extend the application range of BatchGrad to most kinds of pytorch layers without too much effort. https://github.com/ChenAo-Phys/pytorch-Jacobian
It's a simple idea that I really hope you can implement into backpack. It would be nice to see this package getting better.

Question: MC Samples of GGN

Hi, thanks a lot for open sourcing backpack, it's a major contribution for the community. I wasn't able to find a better place to ask questions, so here we go:

The MC samples of the GGN are taken from the model predictive distribution (y ~ p_{\theta}(x)) and not from the empirical data right ? Is the target of the loss ignored while sampling to compute the GGN then ?
What is the easiest way to get more MC samples ? Multiple backward() calls with zero_grad() calls ?

Thanks !!

Group conv

Hi,

Thanks a lot for such promising work! As the group conv is more and more popular in the CV community, do you have a plan to support the derivatives of group conv?

Best wishes!

Use PyTorch's 1.2.0 Flatten module

PyTorch 1.2.0 introduced a Flatten module. Our custom Flatten is redundant.

Need to

Change the tests to use the new Flatten layer
Link the new Flatten layer to the Flatten extension in the second order extensions
Change the example code to use the new Flatten layer
Require torch >= 1.2.0

PyTorch 1.3.0 multinomial interface change

PyTorch 1.3.0 changed the acceptable inputs for the multinomial function.
The interface is now the same for the cuda and cpu versions.

This breaks the Sampling of the symmetric factorization for the cross entropy loss.

Todo:

Add tests to sanity check the average of multiple runs of of kfac and mc ggn using low-precision torch.allclose.
Make those tests optional (they will take a while to run) and document how to run them.
Change the call to multinomial in derivatives/crossentropyloss.py to reflect the interface of PyTorch 1.3.0.

Kronecker utilities

The Kronecker-factored quantities require utilities to demonstrate their usage:

Full matrix from Kronecker factors
Matrix-vector-products from Kronecker factors
Inverse Kronecker factors
(Matrix inverse)-vector-products from Kronecker factors

From these, one could think about composing

Inverse matrix from Kronecker factors

Second order computations for nn.Upsample

I need to compute the approximate hessian for a decoder network. The decoder consists of conv2d and upsample layers. Currently, backpack does not supports nn.Upsample. Since it is a non-parametric layer, it might not be too difficult to implement?

Here I define my model and a data point.

from backpack import backpack
from backpack.extensions import DiagGGNExact

model = torch.nn.Sequential(
    torch.nn.Conv2d(1,8, kernel_size=3, padding=1),
    torch.nn.MaxPool2d(2),
    torch.nn.ReLU(),
    torch.nn.Conv2d(8,8, kernel_size=3, padding=1),
    torch.nn.Upsample(scale_factor=2, mode="nearest"),
    torch.nn.ReLU(),
    torch.nn.Conv2d(8,1, kernel_size=3, padding=1),
    torch.nn.Flatten(),
)
lossfunc = torch.nn.MSELoss()

model = extend(model)
lossfunc = extend(lossfunc)

X = torch.zeros(1,1,8,8)
print(model(X).shape)

b = X.shape[0]
loss = lossfunc(model(X), X.view(b, -1))

with backpack(DiagGGNExact()):
    loss.backward()

for param in model.parameters():
    print(param.diag_ggn_exact)

will return this error

NotImplementedError: Extension saving to diag_ggn_exact does not have an extension for Module <class 'torch.nn.modules.upsampling.Upsample'>

Could you help implement this feature?

AttributeError: 'BatchNorm2d' object has no attribute 'output'

I post the full error below. The MWE is a bit long (currently hundreds of lines) and I am still working on it, but is there any specific direction I should be looking at given this error? It looks like Batchnorm is somehow mixed up in the gradient calculation (judging from the error message)?

Traceback (most recent call last):
  File "/Users/qiyaowei/DEQ-BNN/mwe.py", line 575, in <module>
    model(torch.rand(1,3,32,32)).sum().backward()
  File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/torch/utils/hooks.py", line 110, in hook
    res = user_hook(self.module, grad_input, self.grad_outputs)
  File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/backpack/__init__.py", line 209, in hook_run_extensions
    backpack_extension(module, g_inp, g_out)
  File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/backpack/extensions/backprop_extension.py", line 127, in __call__
    module_extension(self, module, g_inp, g_out)
  File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/backpack/extensions/module_extension.py", line 97, in __call__
    delete_old_quantities = not self.__should_retain_backproped_quantities(module)
  File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/backpack/extensions/module_extension.py", line 162, in __should_retain_backproped_quantities
    is_a_leaf = module.output.grad_fn is None
  File "/Users/qiyaowei/miniconda3/envs/jax/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'BatchNorm2d' object has no attribute 'output'

Parameter `grad`s don't get initialized with `BatchL2Grad` and BatchNorm

BatchL2Grad, perhaps naturally, raises an error when it sees a BatchNorm, since batch normalization mixes gradients in a way that makes the individual contribution hard to discern.
The error says I can ignore it, if I know what I'm doing. I can't say I completely do, but if I ignore it, I do indeed get both grads and batch_l2s on the top levels of my mode, which aren't using batch-norm.
I'm happy with that.

My problem is that the lower level parameters - which do use batch norm - don't just have a None batch_l2, but also a None grad.
So my model doesn't train at all.
This seems wrong, since grad is indeed computable, as witnessed by PyTorch being able to do so fine without backpack.

Is there a way I can get batch_l2s on as many of my parameters as possible, but grads on everything?

I an do this now by first calling backward() without backpack, and then calling it again inside with backpack(BatchL2Grad()):, but that seems wasteful.

DiagGGN - Support for MC estimate of the Hessian of the loss w.r.t. model output

Support nn.GaussianNLLLoss

Hi,

I would like to apply the Cockpit library to my problem, which is using the Gaussian log-likelihood for training. If I only want to look at first-order information, this loss function should already work with Backpack. However, I would be very interested in also seeing the second-order informations, for which explicit support in Backpack is needed.

What would it take to integrate this loss? I might be able to contribute as well if it is not too complicated.

The documentation is here: https://pytorch.org/docs/stable/generated/torch.nn.GaussianNLLLoss.html

Thanks!

Does Backpack Support Reusing Layers (First Order Extensions)

Hi,

Does backpack allow for the reuse of layers for first-order extensions, like in say a Siamese network? I only need this for first-order extensions, in particular batch grads. An example is given below - this produces a "AttributeError: 'Linear' object has no attribute 'input0'" error.

Thanks!

import torch.nn as nn
import torch
from backpack import backpack, extend
from backpack.extensions import BatchGrad

class TestModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Linear(5, 5)
    
    def forward(self, x):
        return self.net(x[:, :5]) + self.net(x[:, 5:])
    
test_module = TestModule()
extend(test_module)

rand_vec = torch.randn(5, 10)
loss = test_module(rand_vec).sum()

with backpack(BatchGrad()):
    loss.backward()

Support for residual units

Batch norm individual gradients

BatchNorm is a special module because it mixes samples within the batch. This needs some special treatment.

PyTorch documentation: BatchNorm1d, BatchNorm2d

If BatchNorm is in evaluation mode (.training=False), the saved statistics are used. This is independent of the batch. Therefore, individual gradients are well defined.

However, if BatchNorm is in training mode (.training=True), the batch statistics are used instead. Therefore, individual gradients are not well defined anymore. I suggest multiple possibilities of realizing individual gradients:

Use the saved statistics like in evaluation mode.
Use batch_size=1, i.e. E(x)=x and Var(x)=0. Note: This is forbidden by PyTorch.
Use complete batch statistics. Note: This is the method implemented by PyTorch.
Use complete batch statistics and approximate individual gradients by neglecting the mixed terms. Note: This might be advantageous for GGN, because leads to separate individual gradients.

I favor the third alternative. If the other approaches do have some merit, it is also possible to implement a switch. For example requesting module.batch_norm_mode: str and execute the given mode if provided.

Note: The current version implements the third alternative. But it has some shortcomings:

does not allow a L-axis
does not check training mode

can not import backpack (ImportError: cannot import name 'OrderedDict' from 'typing')

env:
python=3.7.0
backpack_for_pytorch==1.4.0

log:

/media/Store/lyj/miniconda3/envs/py3.7/bin/python /media/Store/lyj/workspace/mlsad/debug_backpack.py
Traceback (most recent call last):
  File "/media/Store/lyj/workspace/mlsad/debug_backpack.py", line 1, in <module>
    import backpack
  File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/__init__.py", line 10, in <module>
    from backpack import extensions
  File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/extensions/__init__.py", line 3, in <module>
    from .curvmatprod import GGNMP, HMP, PCHMP
  File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/extensions/curvmatprod/__init__.py", line 24, in <module>
    from .ggnmp import GGNMP
  File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/extensions/curvmatprod/ggnmp/__init__.py", line 21, in <module>
    from backpack.extensions.secondorder.base import SecondOrderBackpropExtension
  File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/extensions/secondorder/__init__.py", line 27, in <module>
    from backpack.extensions.secondorder.diag_ggn import (
  File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/extensions/secondorder/diag_ggn/__init__.py", line 50, in <module>
    from backpack.custom_module.branching import SumModule
  File "/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/site-packages/backpack/custom_module/branching.py", line 2, in <module>
    from typing import Any, OrderedDict, Tuple, Union
ImportError: cannot import name 'OrderedDict' from 'typing' (/media/Store/lyj/miniconda3/envs/py3.7/lib/python3.7/typing.py)

Customizable storing of inputs and output

Not every backpack extension requires that layers track all inputs and the output. By default, everything should be tracked to allow for easy extension of backpack.

If it is known in advance that just one specific extension will be used, memory performance can be improved by only storing the required information.

Extend part of the model

Hello,

I am wondering is it possible to extend part of the model, if I only want to get the batch gradient of the last several layers?

I think model = extend(model) will waste memory if only the batch gradient of the last several layers is needed.

For example, if I only want to extend the last two layers (let's say the last two layers are fc1 and fc2) of a large model, can I do something like this:

model.fc1 = extend(model.fc1)
model.fc2 = extend(model.fc2)

Some buffers are stored on CPU when training "extended" model on GPU

When I train an extended model on GPU, some of the buffers in the model will store on CPU, which leads to certain runtime error. I use the following code for extending model:
net = extend(net) net = torch.nn.DataParallel(net) net = net.to('cuda')
In practical training, everything goes fine in the first epoch for training and buffers are stored on cuda. When it starts to test, all buffers appear to be stored on cpu.

My current strategy is to add a line
net = net.to('cuda')
in each iteration of training/testing.

This problem does not appear if I do not use net = extend(net).

Hope that this problem can be solved

RuntimeError: 'NoneType' object is not subscriptable in backward()

First of all, thanks for your great lib.
Is torch.cat a supported operation in the computation graph? It seems using concatenation in second order extensions cause error in the backward():

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-14-427e1d95b045> in <module>
     21 
     22         with backpack(DiagHessian()):
---> 23             loss.backward(create_graph=True,)
     24 

~/anaconda3/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    196                 products. Defaults to ``False``.
    197         """
--> 198         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    199 
    200     def register_hook(self, hook):

~/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     98     Variable._execution_engine.run_backward(
     99         tensors, grad_tensors, retain_graph, create_graph,
--> 100         allow_unreachable=True)  # allow_unreachable flag
    101 
    102 

RuntimeError: 'NoneType' object is not subscriptable

try_view is unnecessary since pytorch 1.4.0

When a view can be performed, reshape essentially does a view. The backward pass used to be less efficient, but that has been fixed in pytorch 1.4.0 (by pytorch/pytorch#28901).

backpack/backpack/utils/ein.py

Lines 175 to 180 in 3122de0

 def try_view(tensor, shape): 

 """Fall back to reshape (more expensive) if viewing does not work.""" 

 try: 

 return tensor.view(shape) 

 except RuntimeError: 

 return tensor.reshape(shape)

Using backpack after a torch.autograd.grad call

Summary

Using backpack with respect to one set of parameters after using torch.autograd.grad with respect to a different set of parameters. This may not be easy to implement as I it is a second order extension (a gradient of gradients), but it would be awesome to support something like this.

Specifics

Considering the following code

import torch
import torch.nn as nn
from backpack import backpack, extend
from backpack.extensions import BatchGrad

net = nn.Linear(2,1)

input = torch.randn(5,2, requires_grad=True) + 1

model = extend(net)

parameters = tuple(net.parameters())
out = net(input).pow(2)

# gradient with respect to input
grad_input = torch.autograd(out.sum(), input, create_graph=True)
grad_input = torch.cat([g.flatten(start_dim=1) for g in grad_input])

print(grad_input.shape) # torch.Size([5,2])

# gradients with respect to parameters --- Fails here!
with bachpack(BatchGrad()):
    _ = torch.autograd.grad(grad_input.sum(), parameters, create_graph=True)

print(parameters[0].grad_batch) # <- need this

Traceback

Grad size after first autograd torch.Size([5, 2])

Traceback (most recent call last):
  File "grad_tests.py", line 28, in <module>guess
    g = torch.autograd.grad(grad.sum(), parameters1[0], create_graph=True)
  File "path_to_torch/autograd/__init__.py", line 156, in grad
    return Variable._execution_engine.run_backward(
RuntimeError: 'Linear' object has no attribute 'input0'

[feature request] KFAC support for ResNet

Hi, thank you for developing and maintaining this library.

If I understand correctly, backpack 1.4 does not support ResNet with KFAC, right?

When I changed DiagGGNExact of the tutorial to KFAC (and changed AdaptiveAvgPooling to AvgPooling), it raised the following error:

~/.miniconda/.../backpack/extensions/backprop_extension.py in __get_module_extension(self, module)
     97             if self._fail_mode is FAIL_ERROR:
     98                 # PyTorch converts this Error into a RuntimeError for torch<1.7.0
---> 99                 raise NotImplementedError(
    100                     f"Extension saving to {self.savefield} "
    101                     "does not have an extension for "

NotImplementedError: Extension saving to kfac does not have an extension for Module <class 'backpack.custom_module.branching.SumModule'>

It would be great if you could support KFAC for ResNet or let me know some modifications necessary if possible. Thank you.

[Not working] Individual Hessian-vector products with `BatchGrad`

Hi,

Thanks for this great library. I am wondering how can we compute the gradient of sum of gradient?

I am trying to implement the Hessian-Vector-Product (HVP) with the following code:

def batch_hvp(self, model, loss, params_list, batch_grad_list):
    if len(params_list) != len(batch_grad_list):
        raise (ValueError("w and v must have the same length."))

    one_sample_grad_list = grad(loss, params_list, retain_graph=True, create_graph=True)

    elemwise_products = 0
    for grad_elem, v_elem in zip(one_sample_grad_list, batch_grad_list):
        sum_over_dims = []
        for i in range(len(v_elem.shape)):
            sum_over_dims.append(i)
        sum_over_dims = tuple(sum_over_dims[1:])
        elemwise_products += torch.sum(grad_elem.unsqueeze(0) * v_elem.detach(), sum_over_dims)

    with backpack(BatchGrad()):
        elemwise_products.backward()   # problem: has no attribute 'input0'
        return_grads = [p.grad_batch for p in model.parameters() if p.requires_grad]

    return return_grads

I encounter the " has no attribute 'input0' " problem when I call the backward(), is it possible to get batch gradients for return_grads?

For now, I am only using the for loop to compute the gradient.

def batch_hvp(self, model, loss, params_list, batch_grad_list):
    if len(params_list) != len(batch_grad_list):
        raise (ValueError("w and v must have the same length."))

    one_sample_grad_list = grad(loss, params_list, retain_graph=True, create_graph=True)

    elemwise_products = 0
    for grad_elem, v_elem in zip(one_sample_grad_list, batch_grad_list):
        sum_over_dims = []
        for i in range(len(v_elem.shape)):
            sum_over_dims.append(i)
        sum_over_dims = tuple(sum_over_dims[1:])
        elemwise_products += torch.sum(grad_elem.unsqueeze(0) * v_elem.detach(), sum_over_dims)
    
    # The for-loop version 
    grad_cache = []
    for i in range(elemwise_products.shape[0]):
        elemwise_products[i].backward(retain_graph=True)
        grad_cache.append([p.grad.clone() for p in model.parameters() if p.requires_grad])
    grad_cache = list(zip(*grad_cache))
    return_grads = []
    for l_id in range(len(grad_cache)):
        return_grads.append(torch.cat([g.unsqueeze(0) for g in grad_cache[l_id]], dim=0))

    return return_grads

Thanks in advance!

AdaptiveAvgPool3d: wait for bug resolved -> delete warning

As soon as pytorch/pytorch#60524 is resolved, we can delete our warnings:

backpack/core/derivatives/adaptive_avg_pool_nd.py: delete warning in check_parameters
test/core/derivatives/derivatives_test.py: delete test of warning in test_ea_jac_t_mat_jac_prod and test_jac_t_mat_prod

MaxPool: Pooling indices without second forward pass

The pooling indices in MaxPool2d are currently computed by performing a second forward pass, which can be avoided.

Bug in example for DiagGGN Second order optimizer

Thanks for this great library!

I found a bug in the example showcasing the second-order implementations for writing an optimizer (https://docs.backpack.pt/en/master/use_cases/example_diag_ggn_optimizer.html#sphx-glr-use-cases-example-diag-ggn-optimizer-py).
This mistake is within the example in the documentation and might hinder people to get started with using approximate second-order methods.

losses = []
accuracies = []
for batch_idx, (x, y) in enumerate(mnist_loader):
    x, y = x.to(DEVICE), y.to(DEVICE)
    outputs = model(x)
    loss = loss_function(outputs, y)

is missing the zero-ing of gradients. While backpack seems to overwrite iteratively on a .backward(), the gradient just accumulates here. I ran the example for several epochs which results in a divergence because the gradient is never reset, I don't think this is intended right?

losses = []
accuracies = []
for batch_idx, (x, y) in enumerate(mnist_loader):
    x, y = x.to(DEVICE), y.to(DEVICE)
    # ---------
    model.zero_grad()
    # ---------
    outputs = model(x)
    loss = loss_function(outputs, y)

This changes the example quite drastically and the learning rate probably needs to be adjusted to get the same learning curve..

Hope this helps.

torch.autograd.grad support

Hi,
Thanks for this great library.
It seems that the torch.autograd.grad function is not supported with backpack, are you planning to add support ?

It would be useful to compute batch grads with respect to intermediate features for example.

	def try_view(tensor, shape):
	"""Fall back to reshape (more expensive) if viewing does not work."""
	try:
	return tensor.view(shape)
	except RuntimeError:
	return tensor.reshape(shape)

f-dangel / backpack Goto Github PK

backpack's People

Contributors

Stargazers

Watchers

Forkers

backpack's Issues

Summary

Specifics

Expecting sequences

For first-order extensions support any modules (without parameters)

Are nested modules supported?

Summary

Specifics

Traceback

Recommend Projects

Recommend Topics

Recommend Org