Comments (12)
I am not the author and I hope my answer can help you.
A1: because the neighbor of a patch is defined in the same view. The ''neighbor'' is not easy to define in a cross-view situation. (or you can try to define it with some spatial priori)
A2: the local views are fed into the teacher network to contribute to the selfpatch loss, i.e., the loss from the same view mentioned before, which may not be a must and may accelerate the convergence.
A3: ''loc=True'' means aggregating the neighbor's features, which is enabled in the teacher network. E.g., the i^th patch of the teacher network aggregates its neighbor's features. In the student model, we do not aggregate them. Then, we maximize the similarity between the student's i^th patch and the teacher's i^th patch (it includes the neighbor's features) to model the patch-level representations.
I hope the above opinion may help u.
from selfpatch.
Hi @yanjk3, Thank you very much for the answers, I really appreciate it. It makes more sense now that I know the authors made a slight modification to the original DINO
from selfpatch.
Hi @yanjk3, When I use eval_knn.py from original dino to evaluate selfpatch, it says:
size mismatch for pos_embed: copying a param with shape torch.Size([1, 196, 384]) from checkpoint, the shape in current model is torch.Size([1, 197, 384]).
Do you have any ideas on how can I fix it? Thank you
from selfpatch.
This is because the selfpatch checkpoint does not contain the CLS token. Therefore, the position embedding's size is mismatched. In selfpatch, the CLS token is in the SelfPatchHead https://github.com/alinlab/SelfPatch/blob/main/selfpatch_vision_transformer.py#L362, so the ViT backbone does not need the CLS token.
I think you can fix it by modifying the dino's ViT codes https://github.com/facebookresearch/dino/blob/main/vision_transformer.py#L147 from self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim)) to self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim)).
And then you should delete the '-1' in line 175 and line 176, and exchange line 202 and line 205.
However, as the selfpatch checkpoint does not contain the CLS token, the ViT model will randomly initialize a CLS token and lead to a potential performance drop. I think you can use a global average pooling on the last transformer block to get the global feature representation of images instead of using the CLS token.
from selfpatch.
Hi @yanjk3, thank you for your answers. Could you demonstrate how I can use a global avg pooling on the last transformer blocks?
from selfpatch.
You should make sure you delete the CLS token in the ViT first. And then, you can insert
x = x.mean(dim=1)
after the
x = self.norm(x)
and then return the x
from selfpatch.
Hi @yanjk3, I already took your advice, but it appears that the accuracy is 3% less than it was for the original DINO under the same settings for eval knn.py. What solutions do you have for this? How can accuracy be checked more accurately? Is it better to check from eval linear.py or eval knn.py? Thanks
from selfpatch.
To overcome the performance drops, I recommend copying the SelfPatch ViT to the Dino ViT.
The main difference between them is:
SelfPatch uses the CA block after the ViT blocks to aggregate the global feature representations and output the CLS token.
If you use this CLS token, the performance may be improved.
But unfortunately, the released checkpoint only contains the ViT backbone. So if you want to get a precise answer, you should pre-train the entire model on your own.
from selfpatch.
Hi @yanjk3, sorry I don't really get it. What do you mean by copying SelfPatch VIT to DINO VIT?
from selfpatch.
I mean you should replace the dino vit model's code with selfpatch vit model's code.
from selfpatch.
Hi @yanjk3, do you mean adding everything you previously suggested to the code for the Dino Vit Model (vision transformer.py)?
from selfpatch.
Hi @bryanwong17, @yanjk3 . I'm having the same problem as yours. I cannot do the evaluation using eval_knn.py.
I was wondering could you find a solution for this problem?
Thanks in advance.
from selfpatch.
Related Issues (8)
- Pretrained weights HOT 1
- Support for Patch size 8
- Should I add '--epochs 200' in the provided pre-training command?
- Evaluation of COCO detection&segmentation
- Request pretrained models with patch size 8 * 8
- What is "loc_weight"?
- [Quetion about the Dinoloss] Is this the correct way to implement original DINO LOSS??
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from selfpatch.