From my understanding, the authors pretrained the MAE on ImageNet, and used the encode

How is fine-tuning performed actually? about mae-pytorch HOT 4 CLOSED

pengzhiliang commented on July 27, 2024

How is fine-tuning performed actually?

from mae-pytorch.

Comments (4)

pengzhiliang commented on July 27, 2024 1

Yes, it is actually different.
The reason is that the network has no requirements in patch dimensions.
multi head self-attention only requires the size of Q, K, V needs consistency.
FFN only processes on C dimensions.
So, the encoder doesn't care about how many patches it accepts.

from mae-pytorch.

pengzhiliang commented on July 27, 2024

Hello, in the end-to-end fine-tuning stage, the encoder accepts the full set tokens of image without any masked token. In fact, pre-trained stage just provides a better init for fine-tuning. As you can find the difference in my code:

pre-train:

x = x + self.pos_embed.type_as(x).to(x.device).clone().detach()

B, _, C = x.shape
x_vis = x[~mask].reshape(B, -1, C) # ~mask means visible

for blk in self.blocks:
    x_vis = blk(x_vis)

fine-tune

if self.pos_embed is not None:
    x = x + self.pos_embed.expand(B, -1, -1).type_as(x).to(x.device).clone().detach()

for blk in self.blocks:
    x = blk(x)

Hope this can help you!

from mae-pytorch.

stephenllh commented on July 27, 2024

Thanks! I now have one more question: If we see Transformers for NLP, the sequence dimension is fixed to some arbitrary max_len hyperparameter. However, that is not the case here, where the number of tokens (sequence length) differs in pretraining and fine-tuning.

I wonder if this is due to the batch operation that requires every data in the dataloader to have the same sequence length, and less about the network constraint.

from mae-pytorch.

chokyungjin commented on July 27, 2024

I have the same question.
In the case of pre-train, I know that only visible tokens (25% of the original image) in the encoder are input, but when fine-tuning, the model does not use a mask, so if the entire original image is input, isn't it a different size?

from mae-pytorch.

Recommend Projects

How is fine-tuning performed actually? about mae-pytorch HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent