Hey, I implement the SRFlow based on the paper and Glow source code with Pytorch.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

You could try following: Decode the z-LR pair with z of standa

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Meet "NaN" Problem. about srflow HOT 23 OPEN

andreas128 commented on August 11, 2024

Meet "NaN" Problem.

from srflow.

Comments (23)

neonbjb commented on August 11, 2024 2

I'm also seeing the NaNs when training the model from the code in this repo. They only appear when computing the flow chain in 'reverse' direction for me. 'Normal' is stable and trains well. Feeding z values generated from a 'normal' pass into a 'reverse' pass reproduces the HR input, as expected.

Using a suggestion above from @andreas128 I tracked the issue down to the affine coupling operation here. I think there is a natural mathematical instability here because if the scale gets too small in one layer, future layers will get multiplicatively smaller and smaller until the result is a inf (in the 'reverse' pass, these scales are used as the divisor).

I don't have any good ideas to fix this because I think the root cause is that the network simply isn't capable of 'using' some of the gaussian vectors I'm providing to it for z values. By this I mean the NLL loss trains the flow network to split an HQ image into an LQ embedding and some Z value that is indistinguishable from a gaussian, but does that mean that necessarily all possible gaussian vectors can combine with any arbitrary LQ embedding? This is one aspect of training flow networks I don't quite understand..

I suspect as I continue training, the network will learn to "use" a broader spectrum of Z latents, and this problem will diminish or disappear complete. It certainly appears to be doing so a bit already at 12k steps.. Still, it'd be nice to know if the authors or anyone else training these networks sees this issue early on.

Here's an example batch of images from my side:

Black boxes are areas with NaN. Not quite sure about what causes the noise.

from srflow.

mxtsai commented on August 11, 2024 2

@neonbjb From my former experiments (haven't checked it in my recent experiments), the scales seemed to be at a reasonable value. I was able to get my model training at a more stable rate after adjusting a lower learning rate (2e-4). Have you plotted the log-prior (log_p) term and the log-determinant term during training? I observe that my model was producing better results when the log-determinant term decreases and the log-p term increases during training.

from srflow.

andreas128 commented on August 11, 2024 1

You could try following:

Decode the z-LR pair with z of standard deviation [0, 0.1, 0.2, ... 1.0, 1.1].
Add a small number to the division to protect against divide by zero.
Print the mean absolute value after each layer to find out where the NaN occur.

Does that help? You can also drag and drop images in here.

from srflow.

Alan-xw commented on August 11, 2024 1

Thanks for your advice ! I have tried those experiments:

I have decoded the z-LR pair with z of standard deviation [0, 0.1, 0.2, ... 1.0, 1.1]. When deviation is less than 0.5, The black reigions disappear. However, the image is blurry and lacks details.

2.I also have added a small number to the division to protect against divide by zero . But tn the training stage, "RuntimeError: 'DivBackWard0' return nan values in its 1th output." still exists.

For Adding Nosie :
I have added noise to the HR image that you mentioned . And The pixel value in the input HR ranges [0, 255] , [-0.5, 0.5] or [0, 1] ?

from srflow.

neonbjb commented on August 11, 2024 1

In case anyone runs into my problem: It appears to be just part of the training process. These artifacts do seem to reduce over time as the network trains. I suppose this intuitively makes sense, I'm just surprised no-one else had run into it.

from srflow.

Alan-xw commented on August 11, 2024 1

In case anyone runs into my problem: It appears to be just part of the training process. These artifacts do seem to reduce over time as the network trains. I suppose this intuitively makes sense, I'm just surprised no-one else had run into it.

I have tested the images on the public testing code and find that when t = 1 the Nan still exits in the SR results. The main reason I think is that the latent vector distribution generated by the model still does not match the standard Gaussian distribution.

from srflow.

eridgd commented on August 11, 2024

@Alan-xw could you please post a link to your code?

from srflow.

andreas128 commented on August 11, 2024

Where do the NaNs occur in the network? As the network architecture is bijective, even the untrained network should be invertible.

Try to predict the latent vectors (z) for an HR-LR pair and then feed the z-LR pair to the reverse network. If this does not give you the HR image, reduce the network until this is the case.

from srflow.

Alan-xw commented on August 11, 2024

Hey ,I use a generated z and LR testing image to generate its SR version for evaluation during training. The NaNs would occurs in the SR results, causing black rectangle regions in the images.

Another trouble:

I use exp function in the AffineCouplingLayer and AffineInjectorLayer as described in the paper of SRFlow. But When I start to train the network, it would like to occur similar errors like "RuntimeError: 'DivBackWard0' return nan values in its 1th output." I checked this error and found it was caused by the exp func. . If you met this issue , how did you solve it?

from srflow.

Alan-xw commented on August 11, 2024

As you can see in the picture, The NaN would often occur in the f_s_exp and f_b, also in h(since h is related to f_s_exp, f_b) at the deeper level of the SRFlow like level-2, 15th- ConditionalFlowStep. I have printed the mean absolute value after each layer in debug mode as you suggested. And I found that once there is a large pixel value (may > 1) of in encoding Network output feature u or intermidiate feature h, After multiple product and addition, the NaN occurs.

black rectangle regions look like these patterns shown below.

from srflow.

andreas128 commented on August 11, 2024

Did you try those experiments from the previous answer?

Decode the z-LR pair with z of standard deviation [0, 0.1, 0.2, ... 1.0, 1.1] and compare them visually.
Add a small number to the division to protect against divide by zero.

Also adding Noise might help
Did you try to add noise as described in the Appendix?

Looking forward to see your results!

from srflow.

andreas128 commented on August 11, 2024

Sorry for the late reply. We have good news! We were allowed to publish the model codes.

Does this help you? Any thoughts on improving it?

from srflow.

Alan-xw commented on August 11, 2024

Thanks for sharing the codes . I have solved this issue and trained the SRFlow well which is slightly different from the source code. I would take deep research and do some experiments on the source code Later~

Thanks again!

from srflow.

andreas128 commented on August 11, 2024

Great, feel free to reach out anytime!

from srflow.

mxtsai commented on August 11, 2024

Hi @Alan-xw! I'm also experiencing an issue where I obtain NaN after training for around 10k iterations, and I also get the black rectangles as you've obtained.

I've added uniform noise [0, 1/256) to the input (which is scaled and shifted to values between [-0.5, 0.5) )

May I know what you've done to solve the issue of NaN training? And have you figured out what was causing this issue?

(I'm also using code that is slightly different from the source code)

Thanks in advance!

from srflow.

LyWangPX commented on August 11, 2024

In my case, I did not take a look at the glow model source code (which I regretted) and built my own. Thus I do not have the regularized terms like divide by log2 or something like that. In such genuine case, coupling layers especially where the f is implemented is most likely to meet nan in early training process. I somehow fixed it by adding a clip. As training goes on, everything calms down.

from srflow.

neonbjb commented on August 11, 2024

In case anyone runs into my problem: It appears to be just part of the training process. These artifacts do seem to reduce over time as the network trains. I suppose this intuitively makes sense, I'm just surprised no-one else had run into it.

I have tested the images on the public testing code and find that when t = 1 the Nan still exits in the SR results. The main reason I think is that the latent vector distribution generated by the model still does not match the standard Gaussian distribution.

+1 I've also seen it with the pretrained model.

I think that your statement could be elaborated. The model does not generalize to a standard gaussian for all possible input lr images. The nan areas of the image often correspond to unique areas of the image that don't occur often in natural images.

This is actually a kind of neat result in and of itself, but it makes using this for real sr challenging..

from srflow.

avinash31d commented on August 11, 2024

Hi @mxtsai ,
Can you please also give some more details on "I observe that my model was producing better results when the log-determinant term decreases and the log-p term increases during training".

What is the possible range of log-p values?
what is the possible range of logdet values?

from srflow.

avinash31d commented on August 11, 2024

Hi @andreas128 ,

I noticed few differences in the code and paper. Can you please put some light on it.

In paper actnorm and affine layers have scaling first and then the shifting is applied, whereas in the code its the other way.
I also found some use of exp() switched between the calculation of value and logdets as compared to paper.

can this be possible reason for the above shown artifacts.?

from srflow.

Pierresips commented on August 11, 2024

Hello.

On my side the Nan artifact did appears on a more simple case.
I did clone the git repos and run it on div2K data set. The issue rise a bit in the 4x zoom ratio but always on the version with the higher variance.
But I tried also the 8x case with the pretrained model and in that case all images get the Nan issue.
I'm a bit surprise of this issue. As the issue seems related to zeros division or potential computation precision (from previous comment did not yet searched) could it be also related to hardware side. I do well have the requirements specified, on the HW I have a rtx 3090.

from srflow.

Zuomy826 commented on August 11, 2024

@andreas128 Hello，I added a new loss in the optimize_parameters，I add loss L2 between sr and gt（Set the value of reverse to true）, but the nll which this function returned become nan at about 10k iters, svd_cuda: the updating process of SBDSDC did not converage (error:11)" error appeared. When I test the model saved at iter 10k, most of the results seems fine,but there are still some black blocks on the one or two of the pictures.

And I found that your code about optimize_parameters:

.Both of your opt_get(self.opt, ['train', 'weight_fl'])and weight_l1 = opt_get(self.opt, ['train', 'weight_l1']) are not defined in the config，so the green one will never be calculated.
So I guass Is it the newly added loss that caused this result?And how can I solve it.

Looking forward for your reply.
Best wishes!
Zuo

from srflow.

ph0316 commented on August 11, 2024

#2 (comment)
Hello, I also encountered this problem. How can I solve it?

from srflow.

ph0316 commented on August 11, 2024

@andreas128您好，我在optimize_parameters中添加了一个新的损失，我在 sr 和 gt 之间添加了 loss L2（将反向的值设置为 true），但是此函数返回的 nll 在大约 10k 个迭代时变为 nan，svd_cuda：SBDSDC 的更新过程没有汇合（error：11）“错误出现。当我测试以iter 10k保存的模型时，大多数结果似乎都很好，但是在一两张图片上仍然有一些黑色块。我发现你关于optimize_parameters的代码： .您的opt_get（self.opt，['train'，'weight_fl']）和weight_l1 = opt_get（self.opt，['train'，'weight_l1']））都没有在配置中定义，因此永远不会计算绿色的。因此，我担心是新增加的损失造成了这个结果吗？我该如何解决它。

期待您的回复。愿你安好！左

Hello, I also encountered this problem. How can I solve it?

from srflow.

Meet "NaN" Problem. about srflow HOT 23 OPEN

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent