The paper states in section 3.2, that the default mode of training is to select 1 layer and mask out the others. I believe with this implementation, all of the layers after the selected layer will not be masked out as you use torch.ones instead of torch.zeros for the default cases.
I've uploaded a small dcoument that discusses this. There was also an edge case where if the 1st layer(365,) there is a possibility the second layer will be randomised and the third layer may not use the spatially varying mask.
Potential Mask Issue in Starter Code.pdf
Note, due to computational resource limits, I've reduced the image size to 64x64 and the classes down to 10 so there is some variation in the code. Thanks for this implementation by the way, it has been a great help to me!