Giter Club home page Giter Club logo

Comments (6)

pengzhiliang avatar pengzhiliang commented on July 27, 2024

Thank you for your affirmation!

As you know, we use the sine-cosine positional embedding mentioned in the paper. And those are not learnable parameters, and are not stored in checkpoints. So a potential solution is to use the learnable positional embedding.

And if you run it at higher resolutions here, you maybe not need to interpolate the sine-cosine positional embedding.
Because get_sinussoid_encoding_table function can return a given number of positional embedding, you only need to change the corresponding parameters.

from mae-pytorch.

pengzhiliang avatar pengzhiliang commented on July 27, 2024

Please feel free to reopen this issue with more info if you are still stuck in this problem.
Thank you!

from mae-pytorch.

atonderski avatar atonderski commented on July 27, 2024

Hi, sorry for the slow response.

I agree that the sine-cosine embeddings are not learnable. However it seems like they still need to be interpolated for the model to work well. I suspect that this is at least partially due to the fact that they are 1d, and thus the model has to learn the number of rows/column. E.g. it cannot say "look one patch down" but rather has to say "look X patches forward".

I have attached attention visualizations that show what happens if you run on higher res with or without interpolating the positional embedding. As you can see, the non-interpolated version looks worse and has weird diagonal stripes.

This is not a big issue to me, but I wanted to let you (and anyone else that has the same problem) know about this. I think the best solution is what I mentioned before: to simply include the positional embeddings in the checkpoint even though they are not learnable parameters.

Original:
original_res
With interpolation:
with_interp
Without interpolation:
without_interp

from mae-pytorch.

cliangyu avatar cliangyu commented on July 27, 2024

@atonderski Could you please share how to draw the self-attention map, without class token?

from mae-pytorch.

atonderski avatar atonderski commented on July 27, 2024

Yeah, so since there is no class token I am here visualising the attention map of an arbitrarily picked token (signified by the red dot in the image. There are of course as many attention maps as there are token/patches

from mae-pytorch.

cliangyu avatar cliangyu commented on July 27, 2024

There are 12 images... are they corresponding to 12 heads?
Do you mind pushing the code? Thank you!

from mae-pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.