Giter Club home page Giter Club logo

Comments (11)

karpathy avatar karpathy commented on May 14, 2024 2

addressed in 339f4e7

as mentioned in commit i'm not happy with this demo still, and i'm not happy that epochs will now take a super long time. Closing the issue for now.

from mingpt.

fpgaminer avatar fpgaminer commented on May 14, 2024 1

Depends. The old intended behavior is to sample the dataset at random indexes, for len(data) / block_size) samples. So we'd need DataLoader and co to feed __getitem__ with indexes from 0 to (len(data) - block_size - 1). That means __len__ needs to return len(data) so DataLoader can do that. But DataLoader doesn't have a num_samples option. So it's going to run the epoch for len(data) samples, or basically 128x longer than currently (or whatever you set block_size to).

At least, as far as I can tell from PyTorch's docs.

To recreate the original intended behavior we need to manually feed a RandomSampler with num_samples set to len(data) / block_size to the DataLoader.

The alternative is to have CharDataset pre-chunk the data. Then __len__ can stay the same, and __getitem__ just uses self.chunked_data[idx] to grab a sample. That's different behavior though. Every epoch you'll feed the same "chunks", albeit in a different order. Whereas the current behavior causes the chunks to be chopped out of the data randomly. Maybe it doesn't matter, I'm not sure. I figure it would cause issues on smaller dataset.

from mingpt.

karpathy avatar karpathy commented on May 14, 2024

ouch, doh. the correct thing is to not ignore the idx that comes in and return the appropriate example to that.

from mingpt.

karpathy avatar karpathy commented on May 14, 2024

(however note that this change is then coupled to the Trainer code, where we have to use an appropriate sampler, or set shuffle=True when we create the DataLoader object. Which actually sounds like a good default setting I believe.)

from mingpt.

fpgaminer avatar fpgaminer commented on May 14, 2024

Yeah, probably a million ways to fix it honestly; I wasn't sure what's idiomatic in raw PyTorch. I suggested the workaround + assert since it seemed low effort and simple for what I assume is a notebook trying to get the details out of the way to explain the concepts clearly.

from mingpt.

karpathy avatar karpathy commented on May 14, 2024

idiomatic raw pytorch and the "correct" solution is what i described above, returning the correct idx'th example. The workaround is not a good idea.

from mingpt.

fpgaminer avatar fpgaminer commented on May 14, 2024

Okay. I'm happy to write a pull request, if you'd like. I'd probably fix up CharDataset and then use RandomSampler with replacement=True on the loader to mimic the originally intended behavior (can't just use shuffle=True on the DataLoader, unfortunately).

The alternative is replacement=False, in which case I think CharDataset would have to be changed to pre-chunk the data, which is slightly different behavior than before. But then I think we can ditch the manual RandomSampler and just use shuffle=True on the loader. I think this could potentially be problematic for smaller datasets, though.

from mingpt.

karpathy avatar karpathy commented on May 14, 2024

why doesn't shuffle=True suffice?

from mingpt.

karpathy avatar karpathy commented on May 14, 2024

Got it, yes agree on all points. The only downside with my proposal up above seems to be that epochs will last a long time and are over-estimated by a factor of block size. But the correct thing would be happening as far as training goes. If you squint you can actually see that as correct because technically every window of data is a different example, because each output is conditioned on a slightly different-sized input. So TLDR I'm not too averse to that as the interpretation of "epoch". I'm kind of an enemy of the concept of epochs anyway, I much prefer to think about everything in just raw iterations and multiples there of.

from mingpt.

fpgaminer avatar fpgaminer commented on May 14, 2024

I'm kind of an enemy of the concept of epochs anyway, I much prefer to think about everything in just raw iterations and multiples there of.

Completely agree.

Functionally the epochs are just acting as delimiters for when to log training and testing loss progress. So perhaps the best fix then is to shuffle=True; modify CharDataset such that __len__ is len(self.data) - self.block_size - 1 and __getitem__ uses i = idx; drop the epoch parameters from the trainer and add a "log after this many iterations" parameter. Technically the final_tokens parameter already indicates how long the training loop should run, eliminating max_epochs.

from mingpt.

karpathy avatar karpathy commented on May 14, 2024

epochs are just acting as delimiters for when to log training and testing loss progress

yes exactly. For the char demo it may be just fine to use the i = idx; shuffle=True fix with no other changes. The epoch -> iterations logging change can be thought of as a separate issue. Let me think through the details here since it's a little bit gnarly and suddenly would involve all the other demos too, etc. bleh.

from mingpt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.