drop_last=True used on train loader but on the test</

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

W1D3_MultiLayerPerceptrons: Sec 2.1,about neuromatchacademy/course-content-dl

Comments (5)

spirosChv commented on May 18, 2024 1

Hi @spirosChv thanks for your comment and @GaganaB for the great insight and explanation. I think I do understand the case of train vs test, although I've previously only seen in practice shuffling both. Is there any advantage to that @GaganaB?

In practice, there is no advantage to shuffling the test set. During the test phase, the inputs pass through a static network. So, the order does not matter. However, you want to shuffle to have a more unbiased estimation during training. Imagine that you have collected some images; the first half is clean, whereas the second half is blurry. If you do not shuffle, your network will never learn the existence of both.
On the other hand, when you test, as you do not learn anything, the order does not play any role. In the end, you report a loss and an accuracy score as an average across all testing inputs. To calculate the average, the order does not matter.

PS. I suggest continuing this conversation on discord to increase visibility from other TAs/Students/etc.

Thank you.

from course-content-dl.

spirosChv commented on May 18, 2024

@wizofe, thank you for contributing to our repo. Although I do not remember by heart where this is used, shuffling is False during the test as it does not matter the order of the images. While during training, the order matters as we split the dataset into batches. Does this make sense?

The drop_last argument during the test is unnecessary as we do not care if one batch is smaller, but during training, we want to drop the last batch if it is smaller (for this, I am not quite sure if it makes any difference).

from course-content-dl.

GaganaB commented on May 18, 2024

Hi @wizofe, I agree with Spiros here. I'll elaborate below just to clarify a few things (hopefully).

Shuffling: The test and train sets are generated by probabilistic distributions over the entire data called the data generating processes. And this works on the assumption of i.i.d i.e., Independent examples and identically distributed. And we shuffle this data to overcome catastrophic forgetting, and to ensure representative samples across test/train/validation sets.
Mathematically speaking: assume that we have P elements in W (that is, there are P weights in the network), L is a surface in a P+1-dimensional space. This arises from the fact that for any given matrices of weights W , the loss function can be evaluated on X and that value becomes the elevation of the surface.
But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if X is unchanged over all training iterations, because the surface is fixed for a given X; all its features are static, including its various minima. To ensure that the gradient descent doesn't get stuck, we shuffle the train data. And since we follow the iid principle anyway, shuffling the test sets would not be necessary.
Drop_Last: The drop_last parameter signals to the sampler to drop the tail of the data to make it evenly divisible across the number of replicas. Since we shuffle the train data anyway, we can afford to drop the last non-full batch. But we don't share similar luxuries when it comes to the test sample.

I hope that helps. Feel free to comment below if we can be of further assistance. :)

from course-content-dl.

wizofe commented on May 18, 2024

Hi @spirosChv thanks for your comment and @GaganaB for the great insight and explanation. I think I do understand the case of train vs test, although I've previously only seen in practice shuffling both. Is there any advantage to that @GaganaB?

from course-content-dl.

wizofe commented on May 18, 2024

Thank you both!

from course-content-dl.

W1D3_MultiLayerPerceptrons: Sec 2.1 about course-content-dl HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent