Giter Club home page Giter Club logo

Comments (16)

OBVIOUSDAWN avatar OBVIOUSDAWN commented on August 20, 2024

I use TITAN Xp with torch1.9.1 to train this model,and i installed the packing and testing it works,the dataset is vcoco and download with the script.Thank you for a lot.

from upt.

fredzzhang avatar fredzzhang commented on August 20, 2024

Hi @OBVIOUSDAWN,

Thanks for taking an interest in our work.

The Nan loss problem was quite a pain. I ran into the issue a long time ago and managed to resolve it by using larger batch sizes. The problem was that the spatial encodings have bad scales, which made the training very unstable. I see that you are using only one GPU to train. So the batch size is most likely insufficient.

Here are a few things you can try

  1. For the log terms in the pairwise positional encodings, use log(1+x) instead of log(x+epsilon).
  2. Add batch norm in the spatial head that computes the pairwise positional encodings.
  3. Increase batch size (probably the easiest option).

Hope that resolves the issue.

Cheers,
Fred.

from upt.

OBVIOUSDAWN avatar OBVIOUSDAWN commented on August 20, 2024

Dir sir,
I tried this model on new server with 3090*4 ,bacisize =4,which shows the same error on rank3.In your second suggestion,do you mean the "Pairwise Box Positional Encodings" in the paper?I find a "PositionEmbeddingSine" in /detr/model/position_encoding.py to change its eps show the same error ,and i also tired change "binary_focal_loss_with_logits" /"compute_spatial_encodings"'s eps in /ops.py.I print out the whole network but i dont know which part belongs to Pairwise Box Positional Encodings model.I look forward to receiving your reply.Thank you for a lot.

from upt.

fredzzhang avatar fredzzhang commented on August 20, 2024

...do you mean the "Pairwise Box Positional Encodings" in the paper

Yes, it is implemented in ops.py. If you are running on 4 GPUs with batch size as 4, you should have an effective batch size of 16. I think that's sufficiently large. Are you still getting the error?

Fred.

from upt.

OBVIOUSDAWN avatar OBVIOUSDAWN commented on August 20, 2024

yes,the effective bachsize is 16,it shows the same error.And i tride to change
features.append( torch.cat([f, torch.log(f + eps)], 1) )
by using log(1+x) instead of log(x+epsilon) in "compute_spatial_encodings",i got the same error.I look forward to receiving your reply.Thank you for a lot.

from upt.

fredzzhang avatar fredzzhang commented on August 20, 2024

That's odd. If the batch size is 16, it should work now. Can you try some different seeds?

Fred.

from upt.

leijue222 avatar leijue222 commented on August 20, 2024

Hi, @fredzzhang .
Thank you for your contribution. I am very interested in your work. Therefore, I want to deepen my understanding of your paper with running code. But I can't run the code.

I encountered the same error using the same command on 3090.
python main.py --world-size 1 --dataset vcoco --data-root vcoco/ --partitions trainval test --pretrained checkpoints/detr-r50-vcoco.pth --output-dir checkpoints/upt-r50-vcoco2
I haven't changed any code, just download code and checkpoint model according to Readme.
Then I want to run the training command, but failed with this error.
Could you give me some help to solve it?

from upt.

fredzzhang avatar fredzzhang commented on August 20, 2024

Hi @leijue222,

That should be an issue related to the batch size. I trained the model on 8 GPUs with a batch size of 2 per GPU—effectively a batch size of 16. So since you are training with one GPU, you need to set the batch size to 16.

Let me know if that works.

Fred.

from upt.

leijue222 avatar leijue222 commented on August 20, 2024

Wow, Thanks Fred! It worked!
It is indeed a problem of batch size.

At present, the video memory changes from 12G to 23G. It is unknown whether the single card 3090 with bs=16 will explode the video memory later.
By the way, how much time did you spend training vcoco.
image

from upt.

fredzzhang avatar fredzzhang commented on August 20, 2024

Towards end of the Model Zoo section, I added some stats for 8 TITAN X GPUs, which in the case of VCOCO, would be 40 minutes. I don't know how long it will take one 3090 to train it. It shouldn't be too long.

Fred.

from upt.

leijue222 avatar leijue222 commented on August 20, 2024

Thanks again, I love this job.

from upt.

yuchen2199 avatar yuchen2199 commented on August 20, 2024

I meet the same error using the command on 3090.

python main.py --world-size 1 --batch-size 16 --dataset vcoco --data-root vcoco/ --partitions trainval test --pretrained checkpoints/detr-r50-vcoco.pth --output-dir checkpoints/upt-r50-vcoco

Could you help me to solve the problem? Thanks.

from upt.

fredzzhang avatar fredzzhang commented on August 20, 2024

Hi @yuchen2199,

Sometimes the training could be unstable even with batch size of 16. If possible, further increasing the batch size should make it happen less often.

Fred.

from upt.

yuchen2199 avatar yuchen2199 commented on August 20, 2024

Thanks for your early reply. I solved the problem after increasing batch-size. This is really an interesting work.

from upt.

anjugopinath avatar anjugopinath commented on August 20, 2024

Hi,

I am getting the HOI loss is NaN issue when training on a different dataset. The code used to work fine earlier. But, when I tried to train on images where there is only one human bbox and one object bbox, I started facing this issue.

I have tried:

  1. Setting batch size to 16 and 32
  2. features.append(
    torch.cat([f, torch.log(f + 1)], 1)
    )
  3. adding Batch Norm
    self.spatial_head = nn.Sequential(
    nn.Linear(36, 128),
    nn.BatchNorm1d(128), # Batch normalization after the first linear layer
    nn.ReLU(),
    nn.Linear(128, 256),
    nn.BatchNorm1d(256), # Batch normalization after the second linear layer
    nn.ReLU(),
    nn.Linear(256, representation_size),
    nn.BatchNorm1d(representation_size), # Batch normalization after the third linear layer
    nn.ReLU(),
    )

But, I am still getting this issue.

Do you have any suggestions on how I can solve this issue?

from upt.

anjugopinath avatar anjugopinath commented on August 20, 2024

To show an example,

I modified the output of compute_spatial_encodings() function like this (adding 5000 so that it's easy to visualize. ):
image

So, the input to spatial head is
image

output is
image

spatial head is
image

What is a meaningful fix for this?
Scaling, replacing problematic values in the input etc?

from upt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.