Thanks for the very interesting paper and the implemetation - fabulous work! <p di

Receptive field in CNN-based architecture, and more about the usage of conv blocks about torchdiffeq HOT 14 CLOSED

rtqichen commented on July 2, 2024 4

Receptive field in CNN-based architecture, and more about the usage of conv blocks

from torchdiffeq.

Comments (14)

yluo42 commented on July 2, 2024

As a followup, I tried neural ODE in the architecture of denoising autoencoder (DAE) on audio clips (since my focus are mainly on audio processing tasks). I use a ResNet-style encoder to map the noisy input to a latent space, and use another ResNet-style decoder to reconstruct the clean input. I tried two architectures: one with the standard training (input -> Enc -> Dec -> output), another one with an extra ODE block between the encoder and the decoder (input -> Enc -> ODE block -> Dec -> output). The ODE block was integrated at time [0, 1] and only the last output was used for training. The intuition here is that the ODE block should learn a dynamic in the latent space that moves the noisy feature to the clean feature for reconstruction, and I wanted to see if this was trainable with such deep architecture. Both encoder and decoder consisted of 8 1-D CNN blocks with per-layer residual connection., and the input to the model is the raw waveform of the noisy audio clips (no STFT). The ODE block contained 2 1-D CNN blocks similar to the MNIST example script.

I ran both configurations 5 times and choose the best results on a small dataset (2 hrs of audio). My observations were that not only the standard one (without ODE block) significantly outperformed the ODE one regarding SNR metric (by around 170%), but also the ODE one was really sensible to initialization and learning rate. Moreover, adding the ODE block led to 6 times slowing down on the training speed (with adjoint method). This experiment may not be that representative across all the possible problem settings and tasks in various areas, but at least my trials here are not showing that the neural ODE block will always be effective and can easily replace residual blocks in deep architectures (especially between deep sub-modules in a system?). I'm curious about observations in different tasks/models/datasets to see if I'm the only one having this issue.

from torchdiffeq.

rafaelvalle commented on July 2, 2024

Did you try using a different non-linearity and norm on the Neural ODE itself?
Did you check the norm of the gradients on the encoder with the Neural ODE?
Did you consider training the NODE starting from the encoder and decoder that were previously trained and work better? ~~IIRC at the first iteration the output of the NODE just be close to identity, so it shouldn't get worse than that...~~

from torchdiffeq.

yluo42 commented on July 2, 2024

Did you try using a different non-linearity and norm on the Neural ODE itself?

I tried ReLU/PReLU/Tanh, and I didn't observe much difference. I can try more on learning rates, but I'm not that optimistic on it.

Did you check the norm of the gradients on the encoder with the Neural ODE?

I haven't, but I always apply gradient clipping so it might not be the main issue? I should take a look at that though.

Did you consider training the NODE starting from the encoder and decoder that were previously trained and work better? IIRC at the first iteration the output of the NODE just be close to identity, so it shouldn't get worse than that...

I can definitely try that. Why is it that the first iteration of NODE goes to identity? It doen't look true to me.

from torchdiffeq.

rafaelvalle commented on July 2, 2024

Check the norm of the gradients just to confirm they are not vanishing even though they should not.
The identity comment was my mistake. It should not be identity, it should be closer to a linear transformation.

from torchdiffeq.

yluo42 commented on July 2, 2024

The gradients in ODE block are in the same scale as those in encoder/decoder (and not vanishing), so maybe not this reason. Let me try something like concatenating t into the input to make it not autonomous and see what happends.

Beyond the model itself I'm more interested in the "infinite receptive field" argument, since that sometimes plays an important role in long-range sequence modeling tasks. I would really like to see what's the role of conv blocks comparing with FC layers in this situation.

from torchdiffeq.

rafaelvalle commented on July 2, 2024

concatenating t into the input to make it not autonomous and see what happends.

Concatenating t into the input of what?

from torchdiffeq.

yluo42 commented on July 2, 2024

To the input of the conv layers in the ODE block, like what they do in the MNIST example (and the FFJORD project):

torchdiffeq/examples/odenet_mnist.py

Lines 76 to 89 in 8c60789

 class ConcatConv2d(nn.Module): 

 def __init__(self, dim_in, dim_out, ksize=3, stride=1, padding=0, dilation=1, groups=1, bias=True, transpose=False): 

 super(ConcatConv2d, self).__init__() 

 module = nn.ConvTranspose2d if transpose else nn.Conv2d 

 self._layer = module( 

 dim_in + 1, dim_out, kernel_size=ksize, stride=stride, padding=padding, dilation=dilation, groups=groups, 

 bias=bias 

 ) 

 def forward(self, t, x): 

 tt = torch.ones_like(x[:, :1, :, :]) * t 

 ttx = torch.cat([tt, x], 1) 

 return self._layer(ttx)

from torchdiffeq.

rafaelvalle commented on July 2, 2024

surprised to hear that the convs in your ode block did not have t as input

from torchdiffeq.

yluo42 commented on July 2, 2024

Mainly because in one previous (closed) issue the author mentioned that for MNIST with or without t didn't lead to difference, so I went without it for this small dataset.

from torchdiffeq.

rafaelvalle commented on July 2, 2024

What is that issue number? Can you share a link?

from torchdiffeq.

yluo42 commented on July 2, 2024

Sure, it's in #14 (comment).

from torchdiffeq.

rtqichen commented on July 2, 2024

I don't think the receptive field is a clear indication of a model's performance, or we would all be using larger than 3x3 convolution layers. You're right that the effective receptive field (theoretically) would be infinite, but in practice depends on the number of actual evaluations. (This is an interesting observation!) Though I don't think it degenerates into a FC block, as an (infinite) series of stacked convolutional layers still uses locality assumptions and should not be able to have the same degrees of freedom as a true FC layer (though I'm not entirely sure). Perhaps a more meaningful measure would relate to the complexity of the transformation, rather than just the locality of the dependencies.

An ODEBlock's behavior is quite sensitive to the topology of the underlying hidden space, and a "difficult-to-navigate" space would require a complex (and slow to solve) ODE. A recent work (https://arxiv.org/abs/1904.01681) showed that simply padding zeros and using an ODE on the pixel space is sufficient for classification, but in general having a good initial hidden space should help significantly especially for tasks where the output is of higher dimension.

I'm not sure why adding an ODEBlock to your autoencoder actually made performance worse for you. Most tricks for initializing residual nets (like zeroing the weights of the last layer) should help for ODEs as well. This will initialize the ODE as an identity.

from torchdiffeq.

rafaelvalle commented on July 2, 2024

Thanks for bringing up the zeros initialization as means to have the ODE as an identity!

from torchdiffeq.

yluo42 commented on July 2, 2024

Thanks for the thorough reply! Yes I don't think receptive field is an indicator of the performance either, I'm just curious about how it would affect the feature extraction process. For example, would it sometimes even be harmful to tasks that strictly require locality?

And also thanks for pointing out the recent paper - I skimmed through it and it's really interesting. I think the main idea there is just similar to apply a kernel to the feature space so that the dynamics are easier to learn like the standard kernel methods, and maybe other (nonlinear) kernels would have similar observations with increased model complexity than this simple zero-padding paradigm. I think my case here is pretty similar - the dynamics in the latent space might be too hard for a simple ODEBlock to learn. I'll play with the initialization/design/problem formulation to see if it can be better.

from torchdiffeq.

Receptive field in CNN-based architecture, and more about the usage of conv blocks about torchdiffeq HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	class ConcatConv2d(nn.Module):

	def __init__(self, dim_in, dim_out, ksize=3, stride=1, padding=0, dilation=1, groups=1, bias=True, transpose=False):
	super(ConcatConv2d, self).__init__()
	module = nn.ConvTranspose2d if transpose else nn.Conv2d
	self._layer = module(
	dim_in + 1, dim_out, kernel_size=ksize, stride=stride, padding=padding, dilation=dilation, groups=groups,
	bias=bias
	)

	def forward(self, t, x):
	tt = torch.ones_like(x[:, :1, :, :]) * t
	ttx = torch.cat([tt, x], 1)
	return self._layer(ttx)