Giter Club home page Giter Club logo

Comments (32)

dhiren-hamal avatar dhiren-hamal commented on May 29, 2024 1

Awesome explanation bro! Thank you for your efforts.

from blog-comments.

jessie-chen99 avatar jessie-chen99 commented on May 29, 2024 1

Thank you so much for this clear explanation!

from blog-comments.

WuYHH avatar WuYHH commented on May 29, 2024 1

Thank u very much ! It is really benefit for my understanding with constrastive learning.

from blog-comments.

mohkuwait avatar mohkuwait commented on May 29, 2024 1

you are the best of the best, nice explaining for the paper, you make it easy

from blog-comments.

amitness avatar amitness commented on May 29, 2024 1

@MailSuesarn You're right! Though, multi-GPU training brings a new problem if you use batch normalization directly. The paper shows that there could be information leakage if batch normalization is applied locally to the small batches in each GPU.

So, instead of that, they use a "global batch normalization" where the statistics are calculated across all the images in all the GPUs. See this video.

You can search around to see if there is a easy to use implementation of that in keras

from blog-comments.

darkenergy814 avatar darkenergy814 commented on May 29, 2024 1

Thank you for the easy explanation!

from blog-comments.

amitness avatar amitness commented on May 29, 2024 1

@HuyTrinh212 If you already know the class labels, SimCLR wouldn't be a relevant model to apply. It's supposed to be used when you want to learn representations across diverse set of unlabeled images. For the cat/dog example, a simple supervised binary classification would work better.

However, your intuition on the issue you pointed out is correct. If a lot of same class images end up in the batch, it doesn't make sense to treat them as negatives. That's one of the drawbacks of the SimCLR approach. Some follow-up papers have used heuristics such as clustering images so that each batch has images from different clusters.

from blog-comments.

February24-Lee avatar February24-Lee commented on May 29, 2024

an easy explanation👍

from blog-comments.

yoon28 avatar yoon28 commented on May 29, 2024

Hi, thanks for the nice post!
What if there is a repetitive class in a batch? For example, multiple cat images in a batch. In that case, encountering the same class images (eg, cat images in the batch) classified as negative pair is inevitable.
How does SimCLR treat this situation?

from blog-comments.

amitness avatar amitness commented on May 29, 2024

@yoon28 It's a limitation with SimCLR. I had the same thought and asked the original paper author here. He replied that they do nothing to handle this.

It might be because the scale at which original SimCLR is trained has batches as large as 8K images and so this problem has less impact on the performance.

New papers such as PCL have tried tackling it through clustering. Alternatively, other works such as BYOL have removed the need for negative pairs itself.

from blog-comments.

maxmaxmir avatar maxmaxmir commented on May 29, 2024

Excellent job explaining it in such a simple manner! It was a great read. As regards the random chance of positive pairs being negative in the training set batch, it depends on the number of classes. Imagenet has 1000 classes, so with a batch size of 8192, and assuming equal number of samples in the entire dataset for each class, the chances of this happening are very small.

from blog-comments.

maxmaxmir avatar maxmaxmir commented on May 29, 2024

Do you know why the original image is not used in the positive pair? For e.g., why do we need x_i and x_j, both augmented from x, instead of using x and x_i (so you only do one augmentation)? I am guessing since x_i and x_j are generated using stochastic augmentation, they'll be different every time, whereas having the original could lead to poorer generalization.

from blog-comments.

maxmaxmir avatar maxmaxmir commented on May 29, 2024

Another question - why do the authors of the paper throw away the function g(.)? Isn't it possible that the additional non-linear transformation helps with a better representation?

from blog-comments.

maxmaxmir avatar maxmaxmir commented on May 29, 2024

To add to the above, the authors write "Furthermore, even when nonlinear projection is used, the layer before the projection head, h, is still much better (>10%) than the layer after, z = g(h), which shows that the hidden layer before the projection head is a better representation than the layer after.", but they don't give reasons why.

from blog-comments.

maxmaxmir avatar maxmaxmir commented on May 29, 2024

Never mind - they actually do conjecture why this is so. Their reasoning is that " In particular, z = g(h)
is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects."

from blog-comments.

espkh4 avatar espkh4 commented on May 29, 2024

I seek clarification with the following statement:

"We calculate the loss for the same pair a second time as well where the positions of the images are interchanged."

Isn't the loss using cosine similarity which is scalar and commutative? Why the need to check similarity for Image B with A when Image A with B would be the same value?

Maybe I am missing something.

from blog-comments.

amitness avatar amitness commented on May 29, 2024

@espkh4 The similarity in the numerator for Image A and Image B would be the same even after interchange of position. You have understood it correctly till that part.

But, for the dissimilar(negative) images in the denominator part, position matters. For A, we made it similar to B and dissimilar to other images through first loss function. So, for B, we also need to make it dissimilar to the other images and similar to A. The denominator part of the second loss function does that. We are interchanging positions to achieve that.
simclr

I hope that clears the confusion. Feel free to comment if it still doesn't make sense.

from blog-comments.

amitness avatar amitness commented on May 29, 2024

Do you know why the original image is not used in the positive pair? For e.g., why do we need x_i and x_j, both augmented from x, instead of using x and x_i (so you only do one augmentation)? I am guessing since x_i and x_j are generated using stochastic augmentation, they'll be different every time, whereas having the original could lead to poorer generalization.

What you suggest is one possible reason.

  • Furthermore, the authors have shown in the paper that using two random crops has benefits. For example, they show how two non-overlapping crops can simulate a task where you can predict neighboring crops. Similarly, you should be able to recognize the smaller part when it's just part of the larger part.
  • Another thing is regarding networks cheating. When you use both augmentations, it makes sure that there is no cheating through color histograms. You can refer to the ablation study in the paper for more details.

from blog-comments.

espkh4 avatar espkh4 commented on May 29, 2024

@amitness Yes, perfect. Thank you for that explanation. That clears it.

from blog-comments.

RezwanCode avatar RezwanCode commented on May 29, 2024

Hi Amit, I am trying to implement simClr on FashionMnsit Dataset. My implementation is done but I am having some issues with my analysis and huge running time. Can we have a small online meeting ? I really need your help. Could you please help me in anyway ?

from blog-comments.

amitness avatar amitness commented on May 29, 2024

@RezwanCode Can you add it to a private repo and invite me as a collaborator. I can have a look and give you feedback directly on the repo.

My github username is amitness

from blog-comments.

Mushtaqml avatar Mushtaqml commented on May 29, 2024

What is the real purpose of term temperature in the loss function? Please can you help me in understanding it with some intuitive example. Also, I found this temperature term in the MoCo paper; both of them means the same?

I found the following comment on this blog post (https://towardsdatascience.com/contrasting-contrastive-loss-functions-3c13ca5f055e), but I don't think that I really understood what does it mean.

"Chen et al. found that an appropriate temperature parameter can help the model learn from hard negatives. In addition, they showed that the optimal temperature differs on different batch sizes and number of training epochs."

Thanks

from blog-comments.

jimmykimmy68 avatar jimmykimmy68 commented on May 29, 2024

Great explanation!

from blog-comments.

vahuja4 avatar vahuja4 commented on May 29, 2024

Hi Amit,
Nice explanation! A couple of questions regarding SSL in Computer Vision:

  1. What happens if in a training batch, a majority of the images belong to the same class. Will an SSL algorithm not fail, because it will push images (embeddings) of the same class further apart?

  2. Also, can you please explain what causes an SSL algorithm to push images of the same class to be pushed closer to each other? Because, as far as training is concerned, only the original and its augmented versions are being pushed close to each other. What causes other images of the same class to be pushed closer to each other?

from blog-comments.

amitness avatar amitness commented on May 29, 2024

@vahuja4 Regarding your first question, I had the same curiosity. The author answered it here and say that they don't handle it explicitly. I'm guessing it's not a problem for such a large dataset since it's very rare that all same class images would end up in the same batch.

from blog-comments.

MailSuesarn avatar MailSuesarn commented on May 29, 2024

It's a really great article. I have a question I want to ask you As the paper shows A large batch size will yield good results. due to the larger number of negative pairs Am I right? so that means We need a lot of memory to hold the batch. my question is If I use the multi-GPU method in this example https://keras.io/guides/distributed_training/ It will still work as if we are training with a large batch size or not? Or it doesn't just help train faster.

Thank you in advance

from blog-comments.

rjrobben avatar rjrobben commented on May 29, 2024

how do we train such network if we do not have multiple gpu?

from blog-comments.

M-Amrollahi avatar M-Amrollahi commented on May 29, 2024

Thanks for your explanation,
A question, When we have the similarity, why we do not use that as loss function? Why we input the similarity to softmax?

from blog-comments.

crazyboy9103 avatar crazyboy9103 commented on May 29, 2024

Thank you for the clear explanation! I have one question on the Noise Contrastive Estimation(NCE) Loss, which has l[k!=i] in the denominator. According to the paper, l[k!=i] = 1 if k is not from the ith image, meaning that only negative pairs are summed up in the denominator. Please correct me if I'm wrong.

from blog-comments.

HangLuoWh avatar HangLuoWh commented on May 29, 2024

nice explanation!

from blog-comments.

HuyTrinh212 avatar HuyTrinh212 commented on May 29, 2024

Very good and detailed article, but I have a problem.
If my data set only has 2 labels, dog and cat, is training with batch size = 256 possible? Suppose a batch can have 128 cats and 128 dogs, so if you calculate the loss of 1 cat 'pos', does it still exclude the remaining 127 cat images?

from blog-comments.

HuyTrinh212 avatar HuyTrinh212 commented on May 29, 2024

@amitness Very good explanation, thank you. i will read more about 'heuristics & clustering' to try.

from blog-comments.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.