I've been playing with a net that generates definitions for words by training on an en

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Looks like the initial state is initialized <a href="https://github.com/karpathy/char-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Training with many short examples of variable length,about karpathy/char-rnn

Comments (23)

antonmil commented on September 25, 2024 1

Isn't creating an end-of-sequence token enough in this case? I assume the model will learn something sensible even if more that one definition go inside one batch, if there's enough training data.

from char-rnn.

kylemcdonald commented on September 25, 2024 1

@antonmil i think in theory the end-of-sequence token is enough (right now that would be the newline character), but the fact that sometimes it lists words with the same initial letter means it's trying to pick up on longer-scale patterns, which is a distraction. for this toy dictionary-word problem it probably doesn't make a difference. for tasks like translation, i expect it would take a lot of work to learn that end-of-sequence means "reset everything".

maybe i just need to randomize the line ordering and it will learn the real meaning behind the newline character?

from char-rnn.

calvingiles commented on September 25, 2024

I think what is required is to reinitialise the model state at the start of each line so that it doesn't learn to predict the start of the line based on the previous lines context. I am not sure where in the code this should be done (But I am interested in doing so too...)

from char-rnn.

bskaggs commented on September 25, 2024

I'm in a similar situation in an application I'm considering. As a hack, it may help to just shuffle the lines in your training file to help break the implied lexicographic dependencies. However, it would really be nice if there was some way to optionally zero-out the start state on every new line.

from char-rnn.

hughperkins commented on September 25, 2024

Looks like the initial state is initialized in lines 142-149

As far as determining where to run this at the end of sentences, a challenge you will face is that the training is done by mini-batch. What if the end of the sentence is inside the mini-batch? how to handle htis?

Probably need to change the mini-batches to not contain multiple-sentences. Pad the ones with end of sentence iwth zeros and stuff. Do something to make sure the remaining zero characters cause no backpropagation learning.

from char-rnn.

bskaggs commented on September 25, 2024

Could you have a mostly-ones mask that is the size of your input for each minibatch? Let it be zero where you want the sequences to zero-out the input state, and then multiply each column of state by column of your mask when you update your state.

from char-rnn.

hughperkins commented on September 25, 2024

@kylemcdonald Do you have a small sample training set to try on? As small as possible whilst being trainable. Preferably not copyrighted.

from char-rnn.

hughperkins commented on September 25, 2024

Looking at how this works a bit, since seems a bit strange to only know about the low level nn modules that are being called, without knowing how char-rnn works :-P, at each iteration, a set of sequences, rather like patches in images, are cut from the data set. For example, if the data set is 'The quick brown fox jumps over the lazy dog', and seq_length is 4, the sequences might be:

The

If we have a batch_size larger than 1, say 3, then we will have multiple sequences cut from the data, eg, with batch_size 3, and sentence as above, we might have:

The 
own 
s ov

Then a vertical 'slice' through these sequences is cut for each time step, which in this case will be, simply flipping horizontal/vertical:

Tos
hw 
eno
  v

So each line represents one character from each of the sequences, and each line will be sent to the neural net, as a single batch, for each time step.

So, the network is learning several sequences at once, cut from the original input, like patches from an image. I suppose that batching increases the efficiency of gpu training.

For each sequence, just one letter at a time is sent to the network. Each batch gets its own 'personal' set of state, which is an rnn_size * 4 set of floats. So, the total size of the states sent to the network each time will be batch_size * rnn_size * 4.

So, one option would be to make seq_length at least as long as each input sentence, pad each sentence, and reset every state at the start of each mini-batch, ie around about the --- forward pass --- line. I'm not sure if this is the best way though, given the size of your sentences. You'd either need a really long seq_length, or really small sentences.

Alternatively, presumably one could reset the state for each sequence independently, at the same location, whenever a newline character is hit. This wouldnt need any change to the network, nor to the loading code.

Edit: but something is missing in my knowledge still: it seems like the state doesnt actually get reset after each mini-batch: the current state is copied into the state initializer, and reused for the next batch. I think. Still reading...

from char-rnn.

hughperkins commented on September 25, 2024

Ok, so what was puzzling me was the state, since this didnt seem to be ever reset during training, even at the end of an epoch. And yet it is initialized to zero right at the start of training, just once. It's not saved to the checkpoint.

What I reckon is that, so I tested with some other values of seq_size and batch_size, and ran for an entire epoch, and these were the sequences from each iteration, during one epoch, with seq_length 3, and batch_size 3:

iteration 1: The|own|s o
iteratoin 2:  qu| fo|ver
iteration 3: ick|x j| th

Each sequence is separated here by a bar |, just my own notation for this. If you read vertically, you can see that if you take the first sequence from each batch, it forms a much longer sequence The quick. And for the second of each batch, we get own fox j. So, it looks like that effectively, the input data is cut into batch_size chunks. Here, the chunks would be:

chunk 1: The quick
chunk 2: own fox j
chunk 3: s over th

The length of these chunks is the length of the training text, divided by batch_size, and then rounded down to a multiple of seq_length, I think.

Then one character at a time from each chunk is sent to the network, combined with the characters from the other batch-size - 1 chunks. This is done in blocks of seq_length, running back propagation, and therefore learning, after each seq_length characters. The state persists from each group of seq_length characters to the next.

The key point that I think is happening is that each character in each mini-batch is leanring a contiguous chunk from the input, of size input_size / batch-size. This can be arbitrarily long, eg if there are 1 million characters, and batch_size is 50, then the chunk size will be 20,000. And so, the state doesnt need to reset ever, since it's learning a contiguous set of input. Each chunk gets its own state, which is updated as we move through it.

Still, one mystery that remains to me: why doesnt the state get reset at the end of the chunk? Seems like it should be?

from char-rnn.

hughperkins commented on September 25, 2024

(By the way, if you want to experiment with sending text in to the trainer, and seeing how it is being grouped etc, you can use this branch here, which prints outputs simlar to the above: https://github.com/karpathy/char-rnn/compare/master...hughperkins:print-each-x?expand=1 )

from char-rnn.

antonmil commented on September 25, 2024

Thanks for the explanation. To me it also seems strange that the state is not reset after each chunk is processed. I can imaging that it doesn't matter in practice, because actually, you only have one true start of the text, and the other 49 (if batch size is 50) are just artificially created.

A slightly different question: Is it correct that each batch is trained completely independently? If you use the same data but reduce the batch_size to 1, will the model eventually arrive at the same result? (Given that the splits are identical)

from char-rnn.

hughperkins commented on September 25, 2024

@kylemcdonald This branch might do approximately what you want https://github.com/karpathy/char-rnn/compare/master...hughperkins:single-sentences?expand=1 I havent really tested it, but maybe it gives a close enough starting point that you can fix the various bugs that probably exist in it?

Edit: hmmm, this doesnt consider back-propagation...

from char-rnn.

hughperkins commented on September 25, 2024

@antonmil I reckon that the batch_size radically affects the outcome. Firstly, I tried a batch_size of 1000 once, on a mega-powerful AMD, and it trained an entire epoch of shakespeare in about 10 seconds, and yet almost no learning took place.

Secondly, more theoretically, there seem at least several reasons why the batch_size will affect the learning results:

the state is entirely different, ie there are batch_size independent states, being used to train against batch_size non-overlapping chunks of the input data
back propagation is performed after batch_size forward propagations have taken place, so the change in the gradients will plausibly be averaged over the entire batch, probably reducing the effective learning rate I suppose?
the chunks of input_size / batch_size characters are rounded down to a multiple of seq_size, typically truncating on average seq_size / 2 characters, so there are on average batch_size * seq_size / 2 characters missing from the training set. Not a lot admittedly, with a large enough training set, but enough to pretty much guarantee the numbers wont exactly match

Edit: oh, I missed the word eventually. Do you mean, if one trained for an infinite amount of time, would they both produce the same results? Hmmm... that's a good question. I think so, since the size of the Linear modules is independent of the batch_size https://github.com/karpathy/char-rnn/blob/master/model/LSTM.lua#L30 The effective learning rate will change, but I reckon that both models will eventually be similar.

from char-rnn.

hughperkins commented on September 25, 2024

Seems I forgot to consider backpropagation above. Do we just need to reset the state during backpropagation too, in the same way?

from char-rnn.

hughperkins commented on September 25, 2024

For the back-propagation, what I think is:

we're using back propagation through time eg https://en.wikipedia.org/wiki/Backpropagation_through_time , or http://www.willamette.edu/~gorr/classes/cs449/rnn1.html
so, during forward propagation, we're leaving a trail of states behind us in rnn_state table, one for each t in seq_length
when we back-propagate, it's just normal back-propagation basically, where rnn_state holds some additional input values for each layer
the y values are the next letter in the sequence each time, eg if the sequence is quick, then y will be uick, ie if the current letter is q, ideally our network should predict u
on the way backwards, during back-propagation, a pair of letters from x and y are fed into criterion:backward, to get the gradient with respect to the loss at the input to the criterion layer, which is also the gradient with respect to the output from the network
drnn_state is the gradient with respect to loss at each time step, working backwards through the layers
drnn_state for each timestep contains:
- gradient wrt state from next timestep, t+1, as drnn_state[t][1..4]
- gradient from criterion for current timestep, t in drnn_state[t][5]

Looking at backpropagation if we are resetting:

the forward states, rnn_state, are not modified during backpropagation, and are exactly what we created during forward propagation
presumably, when we hit a newline character, we should reset the gradient loss drnn_state

Thinking through in detail:

when we have a newline in the input, our prediction for the next character is irrelevant, so the gradient wrt criterion can be zerod
and we reset the gradient state to zero too

from char-rnn.

hughperkins commented on September 25, 2024

Maybe something like this? https://github.com/karpathy/char-rnn/compare/master...hughperkins:single-sentences?expand=1

from char-rnn.

hughperkins commented on September 25, 2024

Did anyone ever try this? ( I did not... but I'm kind of interested to know if it works :-P )

from char-rnn.

commented on September 25, 2024

I think I found a very simple solution, given the preliminaries (see below). As suggested in #127, but even simpler.

Preliminary:

Word-level language model
Data consists of one sentence per line, each padded to the same sequence length

I did not try the solution from the last link because it checks for newlines within the closure, and if you model independent batches (implied by wanting to reset the state after each batch), the batches should be shuffled after each epoch.

So my batch loader always returns a random full line, and nothing more. My sequence length == line length.

Possible solution:

I confirmed that rnn_state[t] is just a view on clones.rnn[t].output, so no need to manipulate the latter.

We can reset the state after each batch by simply commenting out this line (currently in https://github.com/karpathy/char-rnn/blob/master/train.lua#L289), which carries over the last state of the previous batch to the global state, and therefore to the first state of the next batch:

-- init_state_global = rnn_state[#rnn_state]

Respectively for the validation (currently https://github.com/karpathy/char-rnn/blob/master/train.lua#L238):

-- rnn_state[0] = rnn_state[#rnn_state]

I verified that each batch starts with a zeroed rnn_state.

Apparently we don't need to reset the gradient state because we always init it with zero (currently https://github.com/karpathy/char-rnn/blob/master/train.lua#L272):

local drnn_state = {[opt.seq_length] = clone_list(init_state, true)}

In short:

Modify batch loader to prepare appropriate batches (and maybe shuffle them).
No need to explicitly reset, just don't carry over the last state from the previous batch.

Note that I don't do multiple minibatches. This solution below should work for that as well, but I have not investigated yet.

Does this work for you?

from char-rnn.

hughperkins commented on September 25, 2024

I did not try the solution from the last link because it checks for newlines within the closure, and if you model independent batches (implied by wanting to reset the state after each batch), the batches should be shuffled after each epoch. [edit: and therefore, implied: one cannot feed multiple examples in the same minibatch]

Well... yes and no... you could still feed in the examples in any order. But whether the plausible increase in mini-batch utilization is worth the extra complexity is an open question.

from char-rnn.

hughperkins commented on September 25, 2024

Word-level language model

Word-level language model will probably increase speed, and presumably allows one to easily enforce a fairly strict prior on allowable words, which might increase accuracy and reduce overfitting, but sometimes character-level models can be fun too. I quite like Karpathy's examples of generation of linux kernel c-code for example :-)

from char-rnn.

commented on September 25, 2024

Btw, thanks hughperkins for your valuable insights above!

Well... yes and no... you could still feed in the examples in any order. But whether the plausible increase in mini-batch utilization is worth the extra complexity is an open question.

Ok true, I didn't want to rule out other use cases.
I actually thought that resetting the state would render shuffling unnecessary, but if I run my code without shuffling it converges much faster, also on the (unshuffled) validation set (which is a very similar structure to my training set, so that explains that). Not sure I would trust a non-shuffled model.

Word-level language model will probably increase speed, and presumably allows one to easily enforce a fairly strict prior on allowable words, which might increase accuracy and reduce overfitting, but sometimes character-level models can be fun too. I quite like Karpathy's examples of generation of linux kernel c-code for example :-)

Character-level sure is more complex and more powerful, given enough data. I expect even more so with synthetic languages.

My use case is kind of weird. I just need the word-LM to fuel some other predictions on the input.
Just wanted to make clear what I am doing, and why my solution above works for me.

from char-rnn.

commented on September 25, 2024

This wouldnt normally be expected right?

Correct. I expected it to be the other way round. Shuffling -> faster convergence. But I do have very unusual (unnatural) data, so this unexpected behavior became another part of my research.

I think this is maybe related to:
#78 (comment)
ie, each bit of the text is trained fairly independently of the other sections.

You mean each sequence within a batch, if batch_size > 1? Yes, should be, I guess. But then definitely with shuffling of sequences (not only batches of sequences), or else some some crazy overfitting would take place. I guess.

(I have not yet tried batch_size > 1 because I didn't need it.)

from char-rnn.

hughperkins commented on September 25, 2024

(sorry, deleted my last post, must have been just before you posted a reply to it :-P I decided my last post isnt quite accurate, so wiped it; perhaps a bit too aggressively :-P )

from char-rnn.

Training with many short examples of variable length about char-rnn HOT 23 OPEN

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent