Dir sir, I followed with readme to build this UPT network,but when i use the instr

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

The HOI loss is NaN for rank 0 about upt HOT 16 CLOSED

fredzzhang commented on August 20, 2024

The HOI loss is NaN for rank 0

from upt.

Comments (16)

OBVIOUSDAWN commented on August 20, 2024

I use TITAN Xp with torch1.9.1 to train this model,and i installed the packing and testing it works,the dataset is vcoco and download with the script.Thank you for a lot.

from upt.

fredzzhang commented on August 20, 2024

Hi @OBVIOUSDAWN,

Thanks for taking an interest in our work.

The Nan loss problem was quite a pain. I ran into the issue a long time ago and managed to resolve it by using larger batch sizes. The problem was that the spatial encodings have bad scales, which made the training very unstable. I see that you are using only one GPU to train. So the batch size is most likely insufficient.

Here are a few things you can try

For the log terms in the pairwise positional encodings, use log(1+x) instead of log(x+epsilon).
Add batch norm in the spatial head that computes the pairwise positional encodings.
Increase batch size (probably the easiest option).

Hope that resolves the issue.

Cheers,
Fred.

from upt.

OBVIOUSDAWN commented on August 20, 2024

Dir sir，
I tried this model on new server with 3090*4 ,bacisize =4,which shows the same error on rank3.In your second suggestion,do you mean the "Pairwise Box Positional Encodings" in the paper?I find a "PositionEmbeddingSine" in /detr/model/position_encoding.py to change its eps show the same error ,and i also tired change "binary_focal_loss_with_logits" /"compute_spatial_encodings"'s eps in /ops.py.I print out the whole network but i dont know which part belongs to Pairwise Box Positional Encodings model.I look forward to receiving your reply.Thank you for a lot.

from upt.

fredzzhang commented on August 20, 2024

...do you mean the "Pairwise Box Positional Encodings" in the paper

Yes, it is implemented in ops.py. If you are running on 4 GPUs with batch size as 4, you should have an effective batch size of 16. I think that's sufficiently large. Are you still getting the error?

Fred.

from upt.

OBVIOUSDAWN commented on August 20, 2024

yes,the effective bachsize is 16,it shows the same error.And i tride to change
features.append( torch.cat([f, torch.log(f + eps)], 1) )
by using log(1+x) instead of log(x+epsilon) in "compute_spatial_encodings",i got the same error.I look forward to receiving your reply.Thank you for a lot.

from upt.

fredzzhang commented on August 20, 2024

That's odd. If the batch size is 16, it should work now. Can you try some different seeds?

Fred.

from upt.

leijue222 commented on August 20, 2024

Hi, @fredzzhang .
Thank you for your contribution. I am very interested in your work. Therefore, I want to deepen my understanding of your paper with running code. But I can't run the code.

I encountered the same error using the same command on 3090.
python main.py --world-size 1 --dataset vcoco --data-root vcoco/ --partitions trainval test --pretrained checkpoints/detr-r50-vcoco.pth --output-dir checkpoints/upt-r50-vcoco2
I haven't changed any code, just download code and checkpoint model according to Readme.
Then I want to run the training command, but failed with this error.
Could you give me some help to solve it?

from upt.

fredzzhang commented on August 20, 2024

Hi @leijue222,

That should be an issue related to the batch size. I trained the model on 8 GPUs with a batch size of 2 per GPU—effectively a batch size of 16. So since you are training with one GPU, you need to set the batch size to 16.

Let me know if that works.

Fred.

from upt.

leijue222 commented on August 20, 2024

Wow, Thanks Fred! It worked!
It is indeed a problem of batch size.

At present, the video memory changes from 12G to 23G. It is unknown whether the single card 3090 with bs=16 will explode the video memory later.
By the way, how much time did you spend training vcoco.

from upt.

fredzzhang commented on August 20, 2024

Towards end of the Model Zoo section, I added some stats for 8 TITAN X GPUs, which in the case of VCOCO, would be 40 minutes. I don't know how long it will take one 3090 to train it. It shouldn't be too long.

Fred.

from upt.

leijue222 commented on August 20, 2024

Thanks again, I love this job.

from upt.

yuchen2199 commented on August 20, 2024

I meet the same error using the command on 3090.

python main.py --world-size 1 --batch-size 16 --dataset vcoco --data-root vcoco/ --partitions trainval test --pretrained checkpoints/detr-r50-vcoco.pth --output-dir checkpoints/upt-r50-vcoco

Could you help me to solve the problem? Thanks.

from upt.

fredzzhang commented on August 20, 2024

Hi @yuchen2199,

Sometimes the training could be unstable even with batch size of 16. If possible, further increasing the batch size should make it happen less often.

Fred.

from upt.

yuchen2199 commented on August 20, 2024

Thanks for your early reply. I solved the problem after increasing batch-size. This is really an interesting work.

from upt.

anjugopinath commented on August 20, 2024

Hi,

I am getting the HOI loss is NaN issue when training on a different dataset. The code used to work fine earlier. But, when I tried to train on images where there is only one human bbox and one object bbox, I started facing this issue.

I have tried:

Setting batch size to 16 and 32
features.append(
torch.cat([f, torch.log(f + 1)], 1)
)
adding Batch Norm
self.spatial_head = nn.Sequential(
nn.Linear(36, 128),
nn.BatchNorm1d(128), # Batch normalization after the first linear layer
nn.ReLU(),
nn.Linear(128, 256),
nn.BatchNorm1d(256), # Batch normalization after the second linear layer
nn.ReLU(),
nn.Linear(256, representation_size),
nn.BatchNorm1d(representation_size), # Batch normalization after the third linear layer
nn.ReLU(),
)

But, I am still getting this issue.

Do you have any suggestions on how I can solve this issue?

from upt.

anjugopinath commented on August 20, 2024

To show an example,

I modified the output of compute_spatial_encodings() function like this (adding 5000 so that it's easy to visualize. ):

So, the input to spatial head is

output is

spatial head is

What is a meaningful fix for this?
Scaling, replacing problematic values in the input etc?

from upt.

The HOI loss is NaN for rank 0 about upt HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent