Comments (14)
As a followup, I tried neural ODE in the architecture of denoising autoencoder (DAE) on audio clips (since my focus are mainly on audio processing tasks). I use a ResNet-style encoder to map the noisy input to a latent space, and use another ResNet-style decoder to reconstruct the clean input. I tried two architectures: one with the standard training (input -> Enc -> Dec -> output), another one with an extra ODE block between the encoder and the decoder (input -> Enc -> ODE block -> Dec -> output). The ODE block was integrated at time [0, 1] and only the last output was used for training. The intuition here is that the ODE block should learn a dynamic in the latent space that moves the noisy feature to the clean feature for reconstruction, and I wanted to see if this was trainable with such deep architecture. Both encoder and decoder consisted of 8 1-D CNN blocks with per-layer residual connection., and the input to the model is the raw waveform of the noisy audio clips (no STFT). The ODE block contained 2 1-D CNN blocks similar to the MNIST example script.
I ran both configurations 5 times and choose the best results on a small dataset (2 hrs of audio). My observations were that not only the standard one (without ODE block) significantly outperformed the ODE one regarding SNR metric (by around 170%), but also the ODE one was really sensible to initialization and learning rate. Moreover, adding the ODE block led to 6 times slowing down on the training speed (with adjoint method). This experiment may not be that representative across all the possible problem settings and tasks in various areas, but at least my trials here are not showing that the neural ODE block will always be effective and can easily replace residual blocks in deep architectures (especially between deep sub-modules in a system?). I'm curious about observations in different tasks/models/datasets to see if I'm the only one having this issue.
from torchdiffeq.
Did you try using a different non-linearity and norm on the Neural ODE itself?
Did you check the norm of the gradients on the encoder with the Neural ODE?
Did you consider training the NODE starting from the encoder and decoder that were previously trained and work better? IIRC at the first iteration the output of the NODE just be close to identity, so it shouldn't get worse than that...
from torchdiffeq.
Did you try using a different non-linearity and norm on the Neural ODE itself?
I tried ReLU/PReLU/Tanh, and I didn't observe much difference. I can try more on learning rates, but I'm not that optimistic on it.
Did you check the norm of the gradients on the encoder with the Neural ODE?
I haven't, but I always apply gradient clipping so it might not be the main issue? I should take a look at that though.
Did you consider training the NODE starting from the encoder and decoder that were previously trained and work better? IIRC at the first iteration the output of the NODE just be close to identity, so it shouldn't get worse than that...
I can definitely try that. Why is it that the first iteration of NODE goes to identity? It doen't look true to me.
from torchdiffeq.
Check the norm of the gradients just to confirm they are not vanishing even though they should not.
The identity comment was my mistake. It should not be identity, it should be closer to a linear transformation.
from torchdiffeq.
The gradients in ODE block are in the same scale as those in encoder/decoder (and not vanishing), so maybe not this reason. Let me try something like concatenating t into the input to make it not autonomous and see what happends.
Beyond the model itself I'm more interested in the "infinite receptive field" argument, since that sometimes plays an important role in long-range sequence modeling tasks. I would really like to see what's the role of conv blocks comparing with FC layers in this situation.
from torchdiffeq.
concatenating t into the input to make it not autonomous and see what happends.
Concatenating t into the input of what?
from torchdiffeq.
To the input of the conv layers in the ODE block, like what they do in the MNIST example (and the FFJORD project):
torchdiffeq/examples/odenet_mnist.py
Lines 76 to 89 in 8c60789
from torchdiffeq.
surprised to hear that the convs in your ode block did not have t as input
from torchdiffeq.
Mainly because in one previous (closed) issue the author mentioned that for MNIST with or without t didn't lead to difference, so I went without it for this small dataset.
from torchdiffeq.
What is that issue number? Can you share a link?
from torchdiffeq.
Sure, it's in #14 (comment).
from torchdiffeq.
I don't think the receptive field is a clear indication of a model's performance, or we would all be using larger than 3x3 convolution layers. You're right that the effective receptive field (theoretically) would be infinite, but in practice depends on the number of actual evaluations. (This is an interesting observation!) Though I don't think it degenerates into a FC block, as an (infinite) series of stacked convolutional layers still uses locality assumptions and should not be able to have the same degrees of freedom as a true FC layer (though I'm not entirely sure). Perhaps a more meaningful measure would relate to the complexity of the transformation, rather than just the locality of the dependencies.
An ODEBlock's behavior is quite sensitive to the topology of the underlying hidden space, and a "difficult-to-navigate" space would require a complex (and slow to solve) ODE. A recent work (https://arxiv.org/abs/1904.01681) showed that simply padding zeros and using an ODE on the pixel space is sufficient for classification, but in general having a good initial hidden space should help significantly especially for tasks where the output is of higher dimension.
I'm not sure why adding an ODEBlock to your autoencoder actually made performance worse for you. Most tricks for initializing residual nets (like zeroing the weights of the last layer) should help for ODEs as well. This will initialize the ODE as an identity.
from torchdiffeq.
Thanks for bringing up the zeros initialization as means to have the ODE as an identity!
from torchdiffeq.
Thanks for the thorough reply! Yes I don't think receptive field is an indicator of the performance either, I'm just curious about how it would affect the feature extraction process. For example, would it sometimes even be harmful to tasks that strictly require locality?
And also thanks for pointing out the recent paper - I skimmed through it and it's really interesting. I think the main idea there is just similar to apply a kernel to the feature space so that the dynamics are easier to learn like the standard kernel methods, and maybe other (nonlinear) kernels would have similar observations with increased model complexity than this simple zero-padding paradigm. I think my case here is pretty similar - the dynamics in the latent space might be too hard for a simple ODEBlock to learn. I'll play with the initialization/design/problem formulation to see if it can be better.
from torchdiffeq.
Related Issues (20)
- learn_physics.py is very bad.
- code for reproducing Fig. 8
- Making forward function different than backward function
- reuse solver object
- Integrate Forced Andronov-Hopf Bifurcation HOT 3
- Export to ONNX?
- Is func variable in odeint(func, y0, t) the derivative part of the ode?
- Initial Condition changes when calling the odeint_adjoint
- How to use the summary function for model description?
- How to work with control namely PID controller
- Why odeint sometimes provides a wrong solution? HOT 5
- Non uniform time step in example/ode_demo.py
- runtime of ode_demo.py using adjoint vs. not using it HOT 1
- underflow in dt nan HOT 4
- Typo in paper (?) HOT 1
- How to pass extra paramaters of func to odeint? HOT 2
- Bug: Memory Leaky with from torchdiffeq import odeint HOT 1
- Perform one integration step HOT 2
- Scipy LSODA for stiff ODE
- Question about the gradient of `odeint`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from torchdiffeq.