Giter Club home page Giter Club logo

Comments (40)

13331112522 avatar 13331112522 commented on June 28, 2024 1

@andfoy I have replaced backbone with Resnet50. As you mentioned, I used the pretained weights of resnet50 but change the num-class to 1000 to match with the parameters. I did make some changes for the resnet.py and extract 5 layers from resnet, and update vis_size to 2048. It works well and much faster. Thanks

from dms.

andfoy avatar andfoy commented on June 28, 2024

Hi, thanks for your question. Actually, you could omit the --optim-snapshot argument and the training script should start fine-tuning.

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

Thanks for your timely reply and I will try.

from dms.

andfoy avatar andfoy commented on June 28, 2024

@13331112522 Any followup on this one?

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

We are working on processing our new dataset and I am going to report the result on this issue. Thanks for your consideration.

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

Hi, I have actually completed my model training as your instruction with my own tracking dataset. It has good performance on my dataset but poor performance on referit and UNC. I think it adjusted its weights to adapt the new task and hard to generalize multiple tasks or datasets. I had used 2 epoch on low resolution and 10 epoch on high resolution during my training, so I also suspected it might get overfitting. By the way, is there any updating on DMN structure, as the latest network like MattNet for object referring had much better performance?

from dms.

andfoy avatar andfoy commented on June 28, 2024

By the way, is there any updating on DMN structure, as the latest network like MattNet for object referring had much better performance?

Mattnet and DMN have different natures, while ours is global and agnostic in the sense that we give the model an image and a referral expression and produce a probability map over all the image, Mattnet relies on MRCNN region features, here the objetive is to classify the regions rather than to produce a segmentation.

from dms.

andfoy avatar andfoy commented on June 28, 2024

I had used 2 epoch on low resolution and 10 epoch on high resolution during my training, so I also suspected it might get overfitting

I agree with you on this affirmation. From our experience, too much training time on high resolution induces overfitting

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

Do you mean that if I training more on low resolution and then less on high one, I could get the model with more powerful generalization ability?

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

In addition, the latest model like Bert or GPT seems to have powerful feature representation on NLP, I was wondering whether DMN could take advantage of some of them.

from dms.

andfoy avatar andfoy commented on June 28, 2024

Do you mean that if I training more on low resolution and then less on high one, I could get the model with more powerful generalization ability?

It is possible that this may happen

from dms.

andfoy avatar andfoy commented on June 28, 2024

In addition, the latest model like Bert or GPT seems to have powerful feature representation on NLP, I was wondering whether DMN could take advantage of some of them.

From my experience using Transformers and BERT, in general, they are not able to surpass classical RNNs on this problem. i.e., They provide almost the same performance

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

I have tried different ways to train the model with my tracking datasets these days, I found the model has good learning and fitting performance, converged quickly and achieved very high score (over 0.96) on the training sets while very poor performance on other datasets, drop to 0.36. The best weights was from the mixed datasets training with high resolution. 10 low res epochs and 5 high epochs seems not to perform better than more high res training.

from dms.

andfoy avatar andfoy commented on June 28, 2024

@13331112522 Have you tried reducing the total number of parameters by modifying the hidden state, embedding size and the number of filters?

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

Not yet so far.

from dms.

Shivanshmundra avatar Shivanshmundra commented on June 28, 2024

@13331112522 I am also trying to train on my custom dataset, can you tell me how were you able to train on high resolution as due to model non-parallelizibility and large gpu requirement, I couldn't train over 128x128 resolution?

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

@Shivanshmundra Try to tune down the para of --workers and --num-workers to 1. Notice training on high resolution need to be based on weights from low resolution as the author mentioned.

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

@andfoy Is it possible to speed up the process of DMN by replacing the LSTM by CNN?

from dms.

andfoy avatar andfoy commented on June 28, 2024

@13331112522, We didn't tried the inclusion of language-level CNNs, as we used recurrent modules for both language and multimodality. Feel free to try them, however always taking into account that there are two RNNs that would need to be modified

from dms.

Shivanshmundra avatar Shivanshmundra commented on June 28, 2024

@andfoy I was not able to reproduce results from referit dataset as mentioned in the paper. Maximum IoU observed was around 32% from pre-trained referit weights. Although, optimizer snapshot wasn't available so we just skipped that part and saw initial results. Can you suggest some pointers where we might bee going wrong. We already using SRU mentioned in README only.
Thanks

Edit - On Training Images, results are pretty nice. Close to perfect in some cases.

from dms.

andfoy avatar andfoy commented on June 28, 2024

Maximum IoU observed was around 32% from pre-trained referit weights

Hi @Shivanshmundra, on which dataset are you trying to reproduce the results? Also, which resolution are you using?

from dms.

Shivanshmundra avatar Shivanshmundra commented on June 28, 2024

@andfoy Sorry for the late reply. I was trying on Refer it dataset. The resolution was 256x256.

from dms.

Shivanshmundra avatar Shivanshmundra commented on June 28, 2024

Also. @andfoy is there anything I can do to make this code parallelizable? Like some changes in architecture or the pipeline in general which won't harm results much?

from dms.

andfoy avatar andfoy commented on June 28, 2024

@andfoy Sorry for the late reply. I was trying on Refer it dataset. The resolution was 256x256.

To replicate the ReferIt results, you should first train on UNC, and then fine tune the weights on the former one.

Note: Resolution is important here, so, when the model is trained in a lower resolution than 512, one would expect a decrease on the performance of the model

Also. @andfoy is there anything I can do to make this code parallelizable? Like some changes in architecture or the pipeline in general which won't harm results much?

One of the main issues is related to the Batch Norm on the feature extractor (DPN-92), the dynamic filter computation, which would require a batched multifilter convolution, which it is not available on PyTorch. Also, the sentence length variability is a major factor that prevents DMN from being parallelized, if you could found a way to overcome the aforementioned issues, feel free to share them here

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

@andfoy I found inference process has been slow, especially for the higher resolution image. I am wondering whether there's solution to speed up the process?

from dms.

andfoy avatar andfoy commented on June 28, 2024

@andfoy I found inference process has been slow, especially for the higher resolution image. I am wondering whether there's solution to speed up the process?

Maybe replacing DPN-92 by a newer and more efficient feature extractor?

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

@andfoy I have two questions. 1. what is the point of output of low-resolution, I have visualized the low resolution output with low-res training weights, it seems to have nothing to do with the ground-truth mask. 2. I tried to replace the DPN92 with Resnet, but got this when loading the dict:
size mismatch for fc.weight: copying a param with shape torch.Size([1000, 2048]) from checkpoint, the shape in current model is torch.Size([1, 2048]).
size mismatch for fc.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([1]).
Thanks!

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

Try to replace DPN with Resnet50, just find so many points (dimensions) to adjust, any instructions for that.

from dms.

andfoy avatar andfoy commented on June 28, 2024

@13331112522 The drawback of using other feature extractor is that pretrained weights are not compatible anymore. To enable its operation, you should modify ResNet to return the full pyramid feature representations and also, update the vis_size from 2688 to 256 channels.

Also, remember to remove the classification layer from ResNet, as the model does not uses it at all.

from dms.

andfoy avatar andfoy commented on June 28, 2024

what is the point of output of low-resolution, I have visualized the low resolution output with low-res training weights, it seems to have nothing to do with the ground-truth mask.

The low-resolution training phase is done in order to accelerate computation and also to constrain the representation space, which should be more easy to upsample during the high-res phase

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

I was thinking whether we could joint train both low resolution and high resolution simultaneously. The loss could be combined from both losses. Tried but seems it does not work well, easy to get out of memory.

from dms.

andfoy avatar andfoy commented on June 28, 2024

I was thinking whether we could joint train both low resolution and high resolution simultaneously. The loss could be combined from both losses. Tried but seems it does not work well, easy to get out of memory.

Maybe you could reduce any of the num_filters or joint_size at the expense of non-comparability

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024
  1. Is the num_filters equal to the max lens of language query? if I reduce it, does it mean the lens of query have to be reduced?
    2.Any ideas or instructions for adapting another kind of language instead of English, I wanna try change it to Chinese.

from dms.

andfoy avatar andfoy commented on June 28, 2024

Is the num_filters equal to the max lens of language query? if I reduce it, does it mean the lens of query have to be reduced?

No, num_filters is an arbitrary parameter to the model, which states the number of filters created from language, so you can change it without affecting the length of the input sentences.

Any ideas or instructions for adapting another kind of language instead of English, I wanna try change it to Chinese.

If you are able to map words to Chinese embeddings, then those can be given as input to the model.

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

Thanks a lot.@andfoy.

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

What's the purpose to set the batch_size to 1? When I replace the backbone, some need to do BatchNorm, which needs more than 1 batch to calculate the mean value. In the end, I have to remove the BN to pass the issue.

from dms.

andfoy avatar andfoy commented on June 28, 2024

What's the purpose to set the batch_size to 1? When I replace the backbone, some need to do BatchNorm, which needs more than 1 batch to calculate the mean value. In the end, I have to remove the BN to pass the issue.

Batch size was originally set to 1 in order to train the model, so we could fit it into memory

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

I am curious about the metrics of mIoU in the paper, is it max IoU or mean IoU? According to the source code, the output of the evaluation is the maximum IoU but popular method is to calculate the overall IoU, or mean IoU.

from dms.

13331112522 avatar 13331112522 commented on June 28, 2024

In addition, would you plz provide the results with resnet-50 as backbone? I wanna do some comparison study. Very appreciated for your help. @andfoy

from dms.

andfoy avatar andfoy commented on June 28, 2024

I am curious about the metrics of mIoU in the paper, is it max IoU or mean IoU? According to the source code, the output of the evaluation is the maximum IoU but popular method is to calculate the overall IoU, or mean IoU.

@13331112522, sorry for the late reply! By mIoU we refer to the sum of the unions over the sum of the intersections, which is different from the mean IoU, which corresponds to the mean of the unions over the intersections. You should find that while mean IoU penalizes each object equally, mIoU is biased towards large objects.

In addition, would you plz provide the results with resnet-50 as backbone? I wanna do some comparison study. Very appreciated for your help. @andfoy

Sadly, we don't have Resnet-50 weights available, the only way of obtaining them is to retrain the model from scratch

from dms.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.