I discovered that the CharDataset implementation is b

addressed in <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="h

play_char training is broken; CharDataset is not multiprocessing compatible about mingpt HOT 11 CLOSED

karpathy commented on May 14, 2024

play_char training is broken; CharDataset is not multiprocessing compatible

from mingpt.

Comments (11)

karpathy commented on May 14, 2024 2

addressed in 339f4e7

as mentioned in commit i'm not happy with this demo still, and i'm not happy that epochs will now take a super long time. Closing the issue for now.

from mingpt.

fpgaminer commented on May 14, 2024 1

Depends. The old intended behavior is to sample the dataset at random indexes, for len(data) / block_size) samples. So we'd need DataLoader and co to feed __getitem__ with indexes from 0 to (len(data) - block_size - 1). That means __len__ needs to return len(data) so DataLoader can do that. But DataLoader doesn't have a num_samples option. So it's going to run the epoch for len(data) samples, or basically 128x longer than currently (or whatever you set block_size to).

At least, as far as I can tell from PyTorch's docs.

To recreate the original intended behavior we need to manually feed a RandomSampler with num_samples set to len(data) / block_size to the DataLoader.

The alternative is to have CharDataset pre-chunk the data. Then __len__ can stay the same, and __getitem__ just uses self.chunked_data[idx] to grab a sample. That's different behavior though. Every epoch you'll feed the same "chunks", albeit in a different order. Whereas the current behavior causes the chunks to be chopped out of the data randomly. Maybe it doesn't matter, I'm not sure. I figure it would cause issues on smaller dataset.

from mingpt.

karpathy commented on May 14, 2024

ouch, doh. the correct thing is to not ignore the idx that comes in and return the appropriate example to that.

from mingpt.

karpathy commented on May 14, 2024

(however note that this change is then coupled to the Trainer code, where we have to use an appropriate sampler, or set shuffle=True when we create the DataLoader object. Which actually sounds like a good default setting I believe.)

from mingpt.

fpgaminer commented on May 14, 2024

Yeah, probably a million ways to fix it honestly; I wasn't sure what's idiomatic in raw PyTorch. I suggested the workaround + assert since it seemed low effort and simple for what I assume is a notebook trying to get the details out of the way to explain the concepts clearly.

from mingpt.

karpathy commented on May 14, 2024

idiomatic raw pytorch and the "correct" solution is what i described above, returning the correct idx'th example. The workaround is not a good idea.

from mingpt.

fpgaminer commented on May 14, 2024

Okay. I'm happy to write a pull request, if you'd like. I'd probably fix up CharDataset and then use RandomSampler with replacement=True on the loader to mimic the originally intended behavior (can't just use shuffle=True on the DataLoader, unfortunately).

The alternative is replacement=False, in which case I think CharDataset would have to be changed to pre-chunk the data, which is slightly different behavior than before. But then I think we can ditch the manual RandomSampler and just use shuffle=True on the loader. I think this could potentially be problematic for smaller datasets, though.

from mingpt.

karpathy commented on May 14, 2024

why doesn't shuffle=True suffice?

from mingpt.

karpathy commented on May 14, 2024

Got it, yes agree on all points. The only downside with my proposal up above seems to be that epochs will last a long time and are over-estimated by a factor of block size. But the correct thing would be happening as far as training goes. If you squint you can actually see that as correct because technically every window of data is a different example, because each output is conditioned on a slightly different-sized input. So TLDR I'm not too averse to that as the interpretation of "epoch". I'm kind of an enemy of the concept of epochs anyway, I much prefer to think about everything in just raw iterations and multiples there of.

from mingpt.

fpgaminer commented on May 14, 2024

I'm kind of an enemy of the concept of epochs anyway, I much prefer to think about everything in just raw iterations and multiples there of.

Completely agree.

Functionally the epochs are just acting as delimiters for when to log training and testing loss progress. So perhaps the best fix then is to shuffle=True; modify CharDataset such that __len__ is len(self.data) - self.block_size - 1 and __getitem__ uses i = idx; drop the epoch parameters from the trainer and add a "log after this many iterations" parameter. Technically the final_tokens parameter already indicates how long the training loop should run, eliminating max_epochs.

from mingpt.

karpathy commented on May 14, 2024

epochs are just acting as delimiters for when to log training and testing loss progress

yes exactly. For the char demo it may be just fine to use the i = idx; shuffle=True fix with no other changes. The epoch -> iterations logging change can be thought of as a separate issue. Let me think through the details here since it's a little bit gnarly and suddenly would involve all the other demos too, etc. bleh.

from mingpt.

play_char training is broken; CharDataset is not multiprocessing compatible about mingpt HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent