The Illustrated SimCLR Framework A visual guide to the SimCLR fram

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

2020/03/illustrated-simclr/ about blog-comments HOT 32 CLOSED

amitness commented on May 29, 2024

2020/03/illustrated-simclr/

from blog-comments.

Comments (32)

dhiren-hamal commented on May 29, 2024 1

Awesome explanation bro! Thank you for your efforts.

from blog-comments.

jessie-chen99 commented on May 29, 2024 1

Thank you so much for this clear explanation!

from blog-comments.

WuYHH commented on May 29, 2024 1

Thank u very much ! It is really benefit for my understanding with constrastive learning.

from blog-comments.

mohkuwait commented on May 29, 2024 1

you are the best of the best, nice explaining for the paper, you make it easy

from blog-comments.

amitness commented on May 29, 2024 1

@MailSuesarn You're right! Though, multi-GPU training brings a new problem if you use batch normalization directly. The paper shows that there could be information leakage if batch normalization is applied locally to the small batches in each GPU.

So, instead of that, they use a "global batch normalization" where the statistics are calculated across all the images in all the GPUs. See this video.

You can search around to see if there is a easy to use implementation of that in keras

from blog-comments.

darkenergy814 commented on May 29, 2024 1

Thank you for the easy explanation!

from blog-comments.

amitness commented on May 29, 2024 1

@HuyTrinh212 If you already know the class labels, SimCLR wouldn't be a relevant model to apply. It's supposed to be used when you want to learn representations across diverse set of unlabeled images. For the cat/dog example, a simple supervised binary classification would work better.

However, your intuition on the issue you pointed out is correct. If a lot of same class images end up in the batch, it doesn't make sense to treat them as negatives. That's one of the drawbacks of the SimCLR approach. Some follow-up papers have used heuristics such as clustering images so that each batch has images from different clusters.

from blog-comments.

February24-Lee commented on May 29, 2024

an easy explanation👍

from blog-comments.

yoon28 commented on May 29, 2024

Hi, thanks for the nice post!
What if there is a repetitive class in a batch? For example, multiple cat images in a batch. In that case, encountering the same class images (eg, cat images in the batch) classified as negative pair is inevitable.
How does SimCLR treat this situation?

from blog-comments.

amitness commented on May 29, 2024

@yoon28 It's a limitation with SimCLR. I had the same thought and asked the original paper author here. He replied that they do nothing to handle this.

It might be because the scale at which original SimCLR is trained has batches as large as 8K images and so this problem has less impact on the performance.

New papers such as PCL have tried tackling it through clustering. Alternatively, other works such as BYOL have removed the need for negative pairs itself.

from blog-comments.

maxmaxmir commented on May 29, 2024

Excellent job explaining it in such a simple manner! It was a great read. As regards the random chance of positive pairs being negative in the training set batch, it depends on the number of classes. Imagenet has 1000 classes, so with a batch size of 8192, and assuming equal number of samples in the entire dataset for each class, the chances of this happening are very small.

from blog-comments.

maxmaxmir commented on May 29, 2024

Do you know why the original image is not used in the positive pair? For e.g., why do we need x_i and x_j, both augmented from x, instead of using x and x_i (so you only do one augmentation)? I am guessing since x_i and x_j are generated using stochastic augmentation, they'll be different every time, whereas having the original could lead to poorer generalization.

from blog-comments.

maxmaxmir commented on May 29, 2024

Another question - why do the authors of the paper throw away the function g(.)? Isn't it possible that the additional non-linear transformation helps with a better representation?

from blog-comments.

maxmaxmir commented on May 29, 2024

To add to the above, the authors write "Furthermore, even when nonlinear projection is used, the layer before the projection head, h, is still much better (>10%) than the layer after, z = g(h), which shows that the hidden layer before the projection head is a better representation than the layer after.", but they don't give reasons why.

from blog-comments.

maxmaxmir commented on May 29, 2024

Never mind - they actually do conjecture why this is so. Their reasoning is that " In particular, z = g(h)
is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects."

from blog-comments.

espkh4 commented on May 29, 2024

I seek clarification with the following statement:

"We calculate the loss for the same pair a second time as well where the positions of the images are interchanged."

Isn't the loss using cosine similarity which is scalar and commutative? Why the need to check similarity for Image B with A when Image A with B would be the same value?

Maybe I am missing something.

from blog-comments.

amitness commented on May 29, 2024

@espkh4 The similarity in the numerator for Image A and Image B would be the same even after interchange of position. You have understood it correctly till that part.

But, for the dissimilar(negative) images in the denominator part, position matters. For A, we made it similar to B and dissimilar to other images through first loss function. So, for B, we also need to make it dissimilar to the other images and similar to A. The denominator part of the second loss function does that. We are interchanging positions to achieve that.

I hope that clears the confusion. Feel free to comment if it still doesn't make sense.

from blog-comments.

amitness commented on May 29, 2024

Do you know why the original image is not used in the positive pair? For e.g., why do we need x_i and x_j, both augmented from x, instead of using x and x_i (so you only do one augmentation)? I am guessing since x_i and x_j are generated using stochastic augmentation, they'll be different every time, whereas having the original could lead to poorer generalization.

What you suggest is one possible reason.

Furthermore, the authors have shown in the paper that using two random crops has benefits. For example, they show how two non-overlapping crops can simulate a task where you can predict neighboring crops. Similarly, you should be able to recognize the smaller part when it's just part of the larger part.
Another thing is regarding networks cheating. When you use both augmentations, it makes sure that there is no cheating through color histograms. You can refer to the ablation study in the paper for more details.

from blog-comments.

espkh4 commented on May 29, 2024

@amitness Yes, perfect. Thank you for that explanation. That clears it.

from blog-comments.

RezwanCode commented on May 29, 2024

Hi Amit, I am trying to implement simClr on FashionMnsit Dataset. My implementation is done but I am having some issues with my analysis and huge running time. Can we have a small online meeting ? I really need your help. Could you please help me in anyway ?

from blog-comments.

amitness commented on May 29, 2024

@RezwanCode Can you add it to a private repo and invite me as a collaborator. I can have a look and give you feedback directly on the repo.

My github username is amitness

from blog-comments.

Mushtaqml commented on May 29, 2024

What is the real purpose of term temperature in the loss function? Please can you help me in understanding it with some intuitive example. Also, I found this temperature term in the MoCo paper; both of them means the same?

I found the following comment on this blog post (https://towardsdatascience.com/contrasting-contrastive-loss-functions-3c13ca5f055e), but I don't think that I really understood what does it mean.

"Chen et al. found that an appropriate temperature parameter can help the model learn from hard negatives. In addition, they showed that the optimal temperature differs on different batch sizes and number of training epochs."

Thanks

from blog-comments.

jimmykimmy68 commented on May 29, 2024

Great explanation!

from blog-comments.

vahuja4 commented on May 29, 2024

Hi Amit,
Nice explanation! A couple of questions regarding SSL in Computer Vision:

What happens if in a training batch, a majority of the images belong to the same class. Will an SSL algorithm not fail, because it will push images (embeddings) of the same class further apart?
Also, can you please explain what causes an SSL algorithm to push images of the same class to be pushed closer to each other? Because, as far as training is concerned, only the original and its augmented versions are being pushed close to each other. What causes other images of the same class to be pushed closer to each other?

from blog-comments.

amitness commented on May 29, 2024

@vahuja4 Regarding your first question, I had the same curiosity. The author answered it here and say that they don't handle it explicitly. I'm guessing it's not a problem for such a large dataset since it's very rare that all same class images would end up in the same batch.

from blog-comments.

MailSuesarn commented on May 29, 2024

It's a really great article. I have a question I want to ask you As the paper shows A large batch size will yield good results. due to the larger number of negative pairs Am I right? so that means We need a lot of memory to hold the batch. my question is If I use the multi-GPU method in this example https://keras.io/guides/distributed_training/ It will still work as if we are training with a large batch size or not? Or it doesn't just help train faster.

Thank you in advance

from blog-comments.

rjrobben commented on May 29, 2024

how do we train such network if we do not have multiple gpu?

from blog-comments.

M-Amrollahi commented on May 29, 2024

Thanks for your explanation,
A question, When we have the similarity, why we do not use that as loss function? Why we input the similarity to softmax?

from blog-comments.

crazyboy9103 commented on May 29, 2024

Thank you for the clear explanation! I have one question on the Noise Contrastive Estimation(NCE) Loss, which has l[k!=i] in the denominator. According to the paper, l[k!=i] = 1 if k is not from the ith image, meaning that only negative pairs are summed up in the denominator. Please correct me if I'm wrong.

from blog-comments.

HangLuoWh commented on May 29, 2024

nice explanation!

from blog-comments.

HuyTrinh212 commented on May 29, 2024

Very good and detailed article, but I have a problem.
If my data set only has 2 labels, dog and cat, is training with batch size = 256 possible? Suppose a batch can have 128 cats and 128 dogs, so if you calculate the loss of 1 cat 'pos', does it still exclude the remaining 127 cat images?

from blog-comments.

HuyTrinh212 commented on May 29, 2024

@amitness Very good explanation, thank you. i will read more about 'heuristics & clustering' to try.

from blog-comments.

2020/03/illustrated-simclr/ about blog-comments HOT 32 CLOSED

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent