Comments (11)
addressed in 339f4e7
as mentioned in commit i'm not happy with this demo still, and i'm not happy that epochs will now take a super long time. Closing the issue for now.
from mingpt.
Depends. The old intended behavior is to sample the dataset at random indexes, for len(data) / block_size)
samples. So we'd need DataLoader and co to feed __getitem__
with indexes from 0 to (len(data) - block_size - 1)
. That means __len__
needs to return len(data)
so DataLoader
can do that. But DataLoader
doesn't have a num_samples
option. So it's going to run the epoch for len(data)
samples, or basically 128x longer than currently (or whatever you set block_size to).
At least, as far as I can tell from PyTorch's docs.
To recreate the original intended behavior we need to manually feed a RandomSampler with num_samples
set to len(data) / block_size
to the DataLoader
.
The alternative is to have CharDataset
pre-chunk the data. Then __len__
can stay the same, and __getitem__
just uses self.chunked_data[idx]
to grab a sample. That's different behavior though. Every epoch you'll feed the same "chunks", albeit in a different order. Whereas the current behavior causes the chunks to be chopped out of the data randomly. Maybe it doesn't matter, I'm not sure. I figure it would cause issues on smaller dataset.
from mingpt.
ouch, doh. the correct thing is to not ignore the idx
that comes in and return the appropriate example to that.
from mingpt.
(however note that this change is then coupled to the Trainer code, where we have to use an appropriate sampler, or set shuffle=True
when we create the DataLoader object. Which actually sounds like a good default setting I believe.)
from mingpt.
Yeah, probably a million ways to fix it honestly; I wasn't sure what's idiomatic in raw PyTorch. I suggested the workaround + assert since it seemed low effort and simple for what I assume is a notebook trying to get the details out of the way to explain the concepts clearly.
from mingpt.
idiomatic raw pytorch and the "correct" solution is what i described above, returning the correct idx
'th example. The workaround is not a good idea.
from mingpt.
Okay. I'm happy to write a pull request, if you'd like. I'd probably fix up CharDataset
and then use RandomSampler with replacement=True on the loader to mimic the originally intended behavior (can't just use shuffle=True
on the DataLoader, unfortunately).
The alternative is replacement=False, in which case I think CharDataset
would have to be changed to pre-chunk the data, which is slightly different behavior than before. But then I think we can ditch the manual RandomSampler
and just use shuffle=True
on the loader. I think this could potentially be problematic for smaller datasets, though.
from mingpt.
why doesn't shuffle=True suffice?
from mingpt.
Got it, yes agree on all points. The only downside with my proposal up above seems to be that epochs will last a long time and are over-estimated by a factor of block size. But the correct thing would be happening as far as training goes. If you squint you can actually see that as correct because technically every window of data is a different example, because each output is conditioned on a slightly different-sized input. So TLDR I'm not too averse to that as the interpretation of "epoch". I'm kind of an enemy of the concept of epochs anyway, I much prefer to think about everything in just raw iterations and multiples there of.
from mingpt.
I'm kind of an enemy of the concept of epochs anyway, I much prefer to think about everything in just raw iterations and multiples there of.
Completely agree.
Functionally the epochs are just acting as delimiters for when to log training and testing loss progress. So perhaps the best fix then is to shuffle=True
; modify CharDataset such that __len__
is len(self.data) - self.block_size - 1
and __getitem__
uses i = idx
; drop the epoch parameters from the trainer and add a "log after this many iterations" parameter. Technically the final_tokens
parameter already indicates how long the training loop should run, eliminating max_epochs
.
from mingpt.
epochs are just acting as delimiters for when to log training and testing loss progress
yes exactly. For the char demo it may be just fine to use the i = idx; shuffle=True
fix with no other changes. The epoch -> iterations logging change can be thought of as a separate issue. Let me think through the details here since it's a little bit gnarly and suddenly would involve all the other demos too, etc. bleh.
from mingpt.
Related Issues (20)
- how does this compare to aitextgen?
- Information leak in training procedure?
- Crashed Encoder possible data corruption
- Simplifying weigh decay checking doesn't work HOT 3
- About layer norm dimention parameter: HOT 1
- 生成圖片
- Question: does it support other utf-8 natual language? HOT 1
- Output of CausalSelfAttention HOT 1
- How can I run a trained model and can't run Test_ Hugging face_ Import. py HOT 1
- AssertionError when run generate.ipynb with default parameter HOT 4
- Should -1 marker (as special token) be counted in vocab_size? HOT 1
- What's the max output tokens this model supports? HOT 1
- what is the minimum hardware requirement to train
- which pytorch version should be used pls for windows OS only CPU use only for inference ?
- error line 200, in from_pretrained assert len(keys) == len(sd) HOT 7
- concatenate two BPE tokenizer
- Support for Multi-GPU Parallel Training in chargpt.py
- how to build a model and interact with it like chatgpt?
- Strange model behavior when taking the softmax in the wrong dimension
- where did the self.bias get defined in the casual attention class HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mingpt.