karpathy / char-rnn Goto Github PK

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch

Lua 100.00%

char-rnn's Introduction

char-rnn

This code implements multi-layer Recurrent Neural Network (RNN, LSTM, and GRU) for training/sampling from character-level language models. In other words the model takes one text file as input and trains a Recurrent Neural Network that learns to predict the next character in a sequence. The RNN can then be used to generate text character by character that will look like the original training data. The context of this code base is described in detail in my blog post.

If you are new to Torch/Lua/Neural Nets, it might be helpful to know that this code is really just a slightly more fancy version of this 100-line gist that I wrote in Python/numpy. The code in this repo additionally: allows for multiple layers, uses an LSTM instead of a vanilla RNN, has more supporting code for model checkpointing, and is of course much more efficient since it uses mini-batches and can run on a GPU.

Update: torch-rnn

Justin Johnson (@jcjohnson) recently re-implemented char-rnn from scratch with a much nicer/smaller/cleaner/faster Torch code base. It's under the name torch-rnn. It uses Adam for optimization and hard-codes the RNN/LSTM forward/backward passes for space/time efficiency. This also avoids headaches with cloning models in this repo. In other words, torch-rnn should be the default char-rnn implemention to use now instead of the one in this code base.

Requirements

This code is written in Lua and requires Torch. If you're on Ubuntu, installing Torch in your home directory may look something like:

$ curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash
$ git clone https://github.com/torch/distro.git ~/torch --recursive
$ cd ~/torch; 
$ ./install.sh      # and enter "yes" at the end to modify your bashrc
$ source ~/.bashrc

See the Torch installation documentation for more details. After Torch is installed we need to get a few more packages using LuaRocks (which already came with the Torch install). In particular:

$ luarocks install nngraph 
$ luarocks install optim
$ luarocks install nn

If you'd like to train on an NVIDIA GPU using CUDA (this can be to about 15x faster), you'll of course need the GPU, and you will have to install the CUDA Toolkit. Then get the cutorch and cunn packages:

$ luarocks install cutorch
$ luarocks install cunn

If you'd like to use OpenCL GPU instead (e.g. ATI cards), you will instead need to install the cltorch and clnn packages, and then use the option -opencl 1 during training (cltorch issues):

$ luarocks install cltorch
$ luarocks install clnn

Usage

Data

All input data is stored inside the data/ directory. You'll notice that there is an example dataset included in the repo (in folder data/tinyshakespeare) which consists of a subset of works of Shakespeare. I'm providing a few more datasets on this page.

Your own data: If you'd like to use your own data then create a single file input.txt and place it into a folder in the data/ directory. For example, data/some_folder/input.txt. The first time you run the training script it will do some preprocessing and write two more convenience cache files into data/some_folder.

Dataset sizes: Note that if your data is too small (1MB is already considered very small) the RNN won't learn very effectively. Remember that it has to learn everything completely from scratch. Conversely if your data is large (more than about 2MB), feel confident to increase rnn_size and train a bigger model (see details of training below). It will work significantly better. For example with 6MB you can easily go up to rnn_size 300 or even more. The biggest that fits on my GPU and that I've trained with this code is rnn_size 700 with num_layers 3 (2 is default).

Training

Start training the model using train.lua. As a sanity check, to run on the included example dataset simply try:

$ th train.lua -gpuid -1

Notice that here we are setting the flag gpuid to -1, which tells the code to train using CPU, otherwise it defaults to GPU 0. There are many other flags for various options. Consult $ th train.lua -help for comprehensive settings. Here's another example that trains a bigger network and also shows how you can run on your own custom dataset (this already assumes that data/some_folder/input.txt exists):

$ th train.lua -data_dir data/some_folder -rnn_size 512 -num_layers 2 -dropout 0.5

Checkpoints. While the model is training it will periodically write checkpoint files to the cv folder. The frequency with which these checkpoints are written is controlled with number of iterations, as specified with the eval_val_every option (e.g. if this is 1 then a checkpoint is written every iteration). The filename of these checkpoints contains a very important number: the loss. For example, a checkpoint with filename lm_lstm_epoch0.95_2.0681.t7 indicates that at this point the model was on epoch 0.95 (i.e. it has almost done one full pass over the training data), and the loss on validation data was 2.0681. This number is very important because the lower it is, the better the checkpoint works. Once you start to generate data (discussed below), you will want to use the model checkpoint that reports the lowest validation loss. Notice that this might not necessarily be the last checkpoint at the end of training (due to possible overfitting).

Another important quantities to be aware of are batch_size (call it B), seq_length (call it S), and the train_frac and val_frac settings. The batch size specifies how many streams of data are processed in parallel at one time. The sequence length specifies the length of each stream, which is also the limit at which the gradients can propagate backwards in time. For example, if seq_length is 20, then the gradient signal will never backpropagate more than 20 time steps, and the model might not find dependencies longer than this length in number of characters. Thus, if you have a very difficult dataset where there are a lot of long-term dependencies you will want to increase this setting. Now, if at runtime your input text file has N characters, these first all get split into chunks of size BxS. These chunks then get allocated across three splits: train/val/test according to the frac settings. By default train_frac is 0.95 and val_frac is 0.05, which means that 95% of our data chunks will be trained on and 5% of the chunks will be used to estimate the validation loss (and hence the generalization). If your data is small, it's possible that with the default settings you'll only have very few chunks in total (for example 100). This is bad: In these cases you may want to decrease batch size or sequence length.

Note that you can also initialize parameters from a previously saved checkpoint using init_from.

Sampling

Given a checkpoint file (such as those written to cv) we can generate new text. For example:

$ th sample.lua cv/some_checkpoint.t7 -gpuid -1

Make sure that if your checkpoint was trained with GPU it is also sampled from with GPU, or vice versa. Otherwise the code will (currently) complain. As with the train script, see $ th sample.lua -help for full options. One important one is (for example) -length 10000 which would generate 10,000 characters (default = 2000).

Temperature. An important parameter you may want to play with is -temperature, which takes a number in range (0, 1] (0 not included), default = 1. The temperature is dividing the predicted log probabilities before the Softmax, so lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes.

Priming. It's also possible to prime the model with some starting text using -primetext. This starts out the RNN with some hardcoded characters to warm it up with some context before it starts generating text. E.g. a fun primetext might be -primetext "the meaning of life is ".

Training with GPU but sampling on CPU. Right now the solution is to use the convert_gpu_cpu_checkpoint.lua script to convert your GPU checkpoint to a CPU checkpoint. In near future you will not have to do this explicitly. E.g.:

$ th convert_gpu_cpu_checkpoint.lua cv/lm_lstm_epoch30.00_1.3950.t7

will create a new file cv/lm_lstm_epoch30.00_1.3950.t7_cpu.t7 that you can use with the sample script and with -gpuid -1 for CPU mode.

Happy sampling!

Tips and Tricks

Monitoring Validation Loss vs. Training Loss

If you're somewhat new to Machine Learning or Neural Networks it can take a bit of expertise to get good models. The most important quantity to keep track of is the difference between your training loss (printed during training) and the validation loss (printed once in a while when the RNN is run on the validation data (by default every 1000 iterations)). In particular:

If your training loss is much lower than validation loss then this means the network might be overfitting. Solutions to this are to decrease your network size, or to increase dropout. For example you could try dropout of 0.5 and so on.
If your training/validation loss are about equal then your model is underfitting. Increase the size of your model (either number of layers or the raw number of neurons per layer)

Approximate number of parameters

The two most important parameters that control the model are rnn_size and num_layers. I would advise that you always use num_layers of either 2/3. The rnn_size can be adjusted based on how much data you have. The two important quantities to keep track of here are:

The number of parameters in your model. This is printed when you start training.
The size of your dataset. 1MB file is approximately 1 million characters.

These two should be about the same order of magnitude. It's a little tricky to tell. Here are some examples:

I have a 100MB dataset and I'm using the default parameter settings (which currently print 150K parameters). My data size is significantly larger (100 mil >> 0.15 mil), so I expect to heavily underfit. I am thinking I can comfortably afford to make rnn_size larger.
I have a 10MB dataset and running a 10 million parameter model. I'm slightly nervous and I'm carefully monitoring my validation loss. If it's larger than my training loss then I may want to try to increase dropout a bit and see if that heps the validation loss.

Best models strategy

The winning strategy to obtaining very good models (if you have the compute time) is to always err on making the network larger (as large as you're willing to wait for it to compute) and then try different dropout values (between 0,1). Whatever model has the best validation performance (the loss, written in the checkpoint filename, low is good) is the one you should use in the end.

It is very common in deep learning to run many different models with many different hyperparameter settings, and in the end take whatever checkpoint gave the best validation performance.

By the way, the size of your training and validation splits are also parameters. Make sure you have a decent amount of data in your validation set or otherwise the validation performance will be noisy and not very informative.

Additional Pointers and Acknowledgements

This code was originally based on Oxford University Machine Learning class practical 6, which is in turn based on learning to execute code from Wojciech Zaremba. Chunks of it were also developed in collaboration with my labmate Justin Johnson.

To learn more about RNN language models I recommend looking at:

My recent talk on char-rnn
Generating Sequences With Recurrent Neural Networks by Alex Graves
Generating Text with Recurrent Neural Networks by Ilya Sutskever
Tomas Mikolov's Thesis

License

MIT

char-rnn's People

Contributors

Stargazers

Watchers

Forkers

dchichkov samlabs821 chrishylen-wf saniul fiskio sdierauf josephmisiti maurizi wushanding nhq marcino239 diggsey josephhainline oztc luca subramanyata nagyistoce snazz2001 raitobezarius gregnwosu skaverat charles-cai misko nicaurvi atlaspilotpuppy shyamalschandra mlstudy codeaudit riotcku amoliu hihihippp samsach noodlefrenzy aehaynes alitalib dneish nextgenintelligence julianyu123456 foolsh liaobs austinrivas codeashu chagge paulscott56 e-lab salemameen huamichaelchen uikit0 davidmwest fangzheng354 wangdongfrank bachmann1234 bmagyar apjacob aaronscode mwerezak yafahedelman wellsoftware bboalimoe 5kg tanger830 peterjliu omar-florez jiko raymoondtang azze swaroopgj trondarild ajaytalati aphor ul growlercab chiaolun yanweifu antonmil guilhermehartmann laneser ticcky alanperian bonmotbot relationbuilder iconica benathi qinhongwei clintg graydyn jdalleau lisp-ceo germuth cvml adammenges tjane scollison melioratus aacharya-cs stitchfix takeon8 snowblink14 zhimingz krasin

char-rnn's Issues

Error loading module 'libnn' while trying to run

Hi,

I am trying to run this awesome code, but run into following error:
/usr/local/bin/luajit: /usr/local/share/lua/5.1/luarocks/loader.lua:113: error loading module 'libnn' from file '/usr/local/lib/lua/5.1/libnn.so':
dlopen(/usr/local/lib/lua/5.1/libnn.so, 6): Symbol not found: _THArgCheck
Would greatly appreciate any help.

Training determinism

It seems that running train.lua with same parameters is deterministic (I get exactly same train_bpc numbers), but -eval_val_every changes training. Why does how often validation is done affect training? Can it be made deterministic?

Is it because of RNG?

opencl and nan

I work on a workstation with NVIDIA card and multicore Intel CPU (opencl SDK of intel is installed).

I try to use cpu as the opencl device but get nan at training and then training is aborted. I tried various options and share their outputs below. The opencl@CPU case makes compilation and vectorizations before training steps. Is this expected?

I also tried training by opencl@CPU with the fork:
https://github.com/hughperkins/char-rnn
which goes on training with nans and floats.

Is there a solution to get rid of these nans?

I share lengthy outputs for the ones interested in performance comparison.

And lastly I want to point your attention to the output of the training of fork at the last section. There are two iterations whose loss are not nan but some floating point.

clTorch output:

th> require('cutorch')
{
streamWaitFor : function: 0x4058b558
deviceReset : function: 0x4058ba98
test : function: 0x409a58f0
_state : userdata: 0x0220bca0
streamSynchronize : function: 0x4058b928
manualSeed : function: 0x4058bf20
setStream : function: 0x4058b2b0
getMemoryUsage : function: 0x4058bcd8
setDefaultStream : function: 0x4058b488
getBlasHandle : function: 0x4058b030
CudaHostAllocator : torch.Allocator
getNumStreams : function: 0x4058b1e0
manualSeedAll : function: 0x4058bfe0
initialSeed : function: 0x4058bec0
getStream : function: 0x4058b370
setRNGState : function: 0x405840b0
setBlasHandle : function: 0x4058af68
seed : function: 0x4058bda0
getDeviceProperties : function: 0x4058bc18
reserveStreams : function: 0x4058b100
withDevice : function: 0x409a5958
setDevice : function: 0x4058bd40
seedAll : function: 0x4058be60
getNumBlasHandles : function: 0x4058af00
getDeviceCount : function: 0x4058bb58
createCudaHostTensor : function: 0x409a5998
getState : function: 0x40584170
getDevice : function: 0x4058b9b0
synchronize : function: 0x4058ad00
getRNGState : function: 0x40584048
streamWaitForMultiDevice : function: 0x4058b650
reserveBlasHandles : function: 0x4058ae30
streamBarrierMultiDevice : function: 0x4058b838
streamBarrier : function: 0x4058b720
}
[1.0545s]
th> require('cltorch')
{
setAddFinish : function: 0x41f1bd90
getDeviceCount : function: 0x41f1bbd8
getDeviceProperties : function: 0x41f1bcc8
getState : function: 0x41f1bcf0
getDevice : function: 0x412c43e0
setDevice : function: 0x412c4440
_state : userdata: 0x02467900
dumpTimings : function: 0x41f1bb28
setTrace : function: 0x41f1bd40
synchronize : function: 0x412c4468
finish : function: 0x41f1bbb0
}
[0.0146s]
th> cltorch.getDeviceProperties(1)
{
deviceType : "CPU"
maxClockFrequency : 2000
deviceName : " Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz"
maxMemAllocSizeMB : 16089
globalMemCachelineSizeKB : 0
deviceVersion : "OpenCL 1.2 (Build 43)"
localMemSizeKB : 32
openClCVersion : "OpenCL C 1.2 "
maxWorkGroupSize : 8192
globalMemSizeMB : 64358
platformVendor : "Intel(R) Corporation"
maxComputeUnits : 32
}
[0.0011s]
th> cltorch.getDeviceProperties(2)
{
deviceType : "GPU"
maxClockFrequency : 705
deviceName : "Quadro 410"
maxMemAllocSizeMB : 128
globalMemCachelineSizeKB : 0
deviceVersion : "OpenCL 1.1 CUDA"
localMemSizeKB : 47
openClCVersion : "OpenCL C 1.1 "
maxWorkGroupSize : 1024
globalMemSizeMB : 509
platformVendor : "NVIDIA Corporation"
maxComputeUnits : 1
}

training on CPU

th train.lua -data_dir data/tinyshakespeare/ -opencl 0 -gpuid -1
vocab.t7 and data.t7 do not exist. Running preprocessing...
one-time setup: preprocessing input text file data/tinyshakespeare/input.txt...
loading text file...
creating vocabulary mapping...
putting data into tensor...
saving data/tinyshakespeare/vocab.t7
saving data/tinyshakespeare/data.t7
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
number of parameters in the model: 240321
cloning criterion
cloning rnn
1/21150 (epoch 0.002), train_loss = 4.19766416, grad/param norm = 4.5006e-01, time/batch = 1.74s
2/21150 (epoch 0.005), train_loss = 4.10134056, grad/param norm = 6.3375e-01, time/batch = 2.14s
3/21150 (epoch 0.007), train_loss = 3.44502399, grad/param norm = 9.4798e-01, time/batch = 1.48s
4/21150 (epoch 0.009), train_loss = 3.45054399, grad/param norm = 1.1340e+00, time/batch = 2.03s
5/21150 (epoch 0.012), train_loss = 3.33238818, grad/param norm = 7.8976e-01, time/batch = 1.62s
6/21150 (epoch 0.014), train_loss = 3.37363688, grad/param norm = 7.0334e-01, time/batch = 1.61s
7/21150 (epoch 0.017), train_loss = 3.36438210, grad/param norm = 6.5300e-01, time/batch = 2.70s
8/21150 (epoch 0.019), train_loss = 3.33342581, grad/param norm = 7.6950e-01, time/batch = 1.72s
9/21150 (epoch 0.021), train_loss = 3.29173263, grad/param norm = 6.1282e-01, time/batch = 1.79s
10/21150 (epoch 0.024), train_loss = 3.38000728, grad/param norm = 4.1881e-01, time/batch = 2.92s

training on GPU - CUDA

th train.lua -data_dir data/tinyshakespeare/ -opencl 0 -gpuid 0
using CUDA on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
number of parameters in the model: 240321
cloning criterion
cloning rnn
1/21150 (epoch 0.002), train_loss = 4.16315975, grad/param norm = 4.5507e-01, time/batch = 0.50s
2/21150 (epoch 0.005), train_loss = 4.06560737, grad/param norm = 6.1592e-01, time/batch = 0.45s
3/21150 (epoch 0.007), train_loss = 3.50594769, grad/param norm = 1.2221e+00, time/batch = 0.45s
4/21150 (epoch 0.009), train_loss = 3.45355825, grad/param norm = 1.3675e+00, time/batch = 0.44s
5/21150 (epoch 0.012), train_loss = 3.35222242, grad/param norm = 1.2052e+00, time/batch = 0.44s
6/21150 (epoch 0.014), train_loss = 3.37636928, grad/param norm = 8.7048e-01, time/batch = 0.44s
7/21150 (epoch 0.017), train_loss = 3.36737326, grad/param norm = 6.1815e-01, time/batch = 0.44s
8/21150 (epoch 0.019), train_loss = 3.32496874, grad/param norm = 4.2533e-01, time/batch = 0.44s
9/21150 (epoch 0.021), train_loss = 3.29095509, grad/param norm = 4.5369e-01, time/batch = 0.44s
10/21150 (epoch 0.024), train_loss = 3.38070163, grad/param norm = 4.3267e-01, time/batch = 0.44s

training on GPU - OPENCL

th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 1
registering spatialconvolutionmm
using OpenCL on GPU 1...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using NVIDIA Corporation platform: NVIDIA CUDA
Using device: Quadro 410
statefultimer v0.6
number of parameters in the model: 240321
cloning criterion
cloning rnn
1/21150 (epoch 0.002), train_loss = 4.19766393, grad/param norm = 4.5006e-01, time/batch = 1.31s
2/21150 (epoch 0.005), train_loss = 4.10134039, grad/param norm = 6.3375e-01, time/batch = 1.10s
3/21150 (epoch 0.007), train_loss = 3.44484827, grad/param norm = 9.4796e-01, time/batch = 1.11s
4/21150 (epoch 0.009), train_loss = 3.45040853, grad/param norm = 1.1346e+00, time/batch = 1.10s
5/21150 (epoch 0.012), train_loss = 3.33218116, grad/param norm = 7.8938e-01, time/batch = 1.09s
6/21150 (epoch 0.014), train_loss = 3.37349831, grad/param norm = 7.0234e-01, time/batch = 1.04s
7/21150 (epoch 0.017), train_loss = 3.36418301, grad/param norm = 6.5344e-01, time/batch = 0.96s
8/21150 (epoch 0.019), train_loss = 3.33336397, grad/param norm = 7.7021e-01, time/batch = 0.95s
9/21150 (epoch 0.021), train_loss = 3.29151368, grad/param norm = 6.1312e-01, time/batch = 0.95s
10/21150 (epoch 0.024), train_loss = 3.37983895, grad/param norm = 4.1876e-01, time/batch = 0.95s

training on CPU - OPENCL

th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0
registering spatialconvolutionmm
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Intel(R) Corporation platform: Intel(R) OpenCL
Using device: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
statefultimer v0.6
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
number of parameters in the model: 240321
cloning criterion
cloning rnn
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (8)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClTensorMathTransformReduce.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_kernel_transformReduceOuterDimIndex> was successfully vectorized (4)
Kernel <THClTensor_kernel_transformReduceInnermostDimIndex> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClReduce.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceNoncontigDim> was successfully vectorized (4)
Kernel <THClTensor_reduceContigDim> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
/home/bozkalayci/torch-distro/updates/cltorch/lib/THCl/THClGather.cpp build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_kernel_gather> was successfully vectorized (4)
Done.
THClReduceAll.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceAll> was successfully vectorized (8)
Kernel <THClTensor_reduceAllPass1> was successfully vectorized (4)
Kernel <THClTensor_reduceAllPass2> was successfully vectorized (8)
Done.
/home/bozkalayci/torch-distro/updates/cltorch/lib/THCl/THClScatter.cpp build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_kernel_scatterFill> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClReduceAll.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceAll> was successfully vectorized (8)
Kernel <THClTensor_reduceAllPass1> was successfully vectorized (4)
Kernel <THClTensor_reduceAllPass2> was successfully vectorized (8)
Done.
THClReduceAll.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceAll> was successfully vectorized (8)
Kernel <THClTensor_reduceAllPass1> was successfully vectorized (4)
Kernel <THClTensor_reduceAllPass2> was successfully vectorized (8)
Done.
1/21150 (epoch 0.002), train_loss = nan, grad/param norm = 1.0629e+02, time/batch = 84.92s
loss is exploding, aborting.

training on CPU - OPENCL - fork of hughperkins

Unable to sample checkpoint trained with OpenCL

It seems to be impossible to sample a checkpoint that has been trained with OpenCL, since sample.lua assumes either CPU trained data or CUDA trained data. Attempting to sample an OpenCL trained checkpoint by explicitly setting gpuid falls back to CPU mode since I do not have the CUDA packages installed, as I am using an AMD card.

How often does a snapshot get written to cv?

This is perhaps a request for more specifics in the README.

Utf-8 support

It seems small changes in the code will allow utf-8 support, which could in turn make it possible to train on more languages. If there is consensus on this, I can submit a PR.

./util/OneHot.lua:17: attempt to call method 'long' (a nil value)

Running with the GPU:

th train.lua -data_dir data/stylist_notes_01 -rnn_size 512 -num_layers 3 -gpuid 1

On an EC2 GPU machine gets me this error:

ubuntu@ip-10-0-0-99:~/torch/build/char-rnn$ th train.lua -data_dir data/stylist_notes_01 -rnn_size 512 -num_layers 3 -gpuid 1
using CUDA on GPU 1...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of batches in train: 332442, val: 17496, test: 1
vocab size: 147
creating an LSTM with 3 layers
number of parameters in the model: 5631635
cloning criterion
cloning rnn
/home/ubuntu/torch-distro/install/bin/luajit: ./util/OneHot.lua:17: attempt to call method 'long' (a nil value)
stack traceback:
        ./util/OneHot.lua:17: in function 'func'
        ...u/torch-distro/install/share/lua/5.1/nngraph/gmodule.lua:214: in function 'neteval'
        ...u/torch-distro/install/share/lua/5.1/nngraph/gmodule.lua:244: in function 'forward'
        train.lua:173: in function 'opfunc'
        ...ntu/torch-distro/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop'
        train.lua:216: in main chunk
        [C]: in function 'dofile'
        ...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
        [C]: at 0x00406170

However, turning off the GPU and running in just CPU mode works great! I've been diving into this, but can't see the issue. Any help would be greatly appreciated!

Resume training from a checkpoint

For some reason char-rnn training was interrupted. Is it possible to resume training from e.g. the latest checkpoint file?

typo in readme

Spelling of temperature:

"An important parameter you may want to play with a lot is -temparature"

train_loss and grad/param norm always are nan

I'm running your current version on Antergos Linux x64. This happens when I try and start training on the (unmodified) tinyshakespeare model:

[stephan@antergos char-rnn]$ th train.lua
package cunn not found! 
package cutorch not found!  
Falling back on CPU mode    
loading data files...   
cutting off end of data so that the batches/sequences divide evenly 
reshaping tensor... 
data load done. Number of data batches in train: 423, val: 23, test: 0  
vocab size: 65  
creating an LSTM with 2 layers  
number of parameters in the model: 240321   
cloning rnn 
cloning criterion   
1/21150 (epoch 0.002), train_loss =    nan, grad/param norm =    nan, time/batch = 2.80s    
2/21150 (epoch 0.005), train_loss =    nan, grad/param norm =    nan, time/batch = 2.70s    
3/21150 (epoch 0.007), train_loss =    nan, grad/param norm =    nan, time/batch = 2.72s    
4/21150 (epoch 0.009), train_loss =    nan, grad/param norm =    nan, time/batch = 2.75s    
5/21150 (epoch 0.012), train_loss =    nan, grad/param norm =    nan, time/batch = 2.71s    
6/21150 (epoch 0.014), train_loss =    nan, grad/param norm =    nan, time/batch = 2.74s

...and so on. This may or not be related to #28, as it results in an invalid snapshot.

I've tried different models, parameters and older versions of char-rnn as well, but the error persists.

Multi-GPU support

Is Multi-GPU support for this just a matter of using the technique described here torch/cutorch#42 ?

I attempted an implementation: https://github.com/sherjilozair/char-rnn/tree/multigpu, but it doesn't seem to be working correctly.

It seems the :forward, :backward calls are blocking. What would it take to make them non-blocking?

Any ideas, @karpathy @soumith?

Gradient magnitute depends on seq_length

Hi!
I've used char-rnn as a base for a different learning task, which requires me to train the networks on much longer sequences. However I've quickly realized that the training is very unstable and requires a ridiculously small learning rate.
After some fiddling I tried different sequence lengths and got the following results:

Sequence length	Gradient norm
10	3.00e+00
250	6.11e+01
500	1.05e+02

Apparently the gradient is growing approximately linearly with the sequence length.
Fortunately the fix is quite simple as it's enough to add: grad_params:div(opt.seq_length) at train.lua:267
I'm not sure if it's also a problem for char-rnn's use case, but it seems logical, that the gradient should be of similar magnitude regardless of the seq_length.

After adding the line I'm getting the following results:

Sequence length	Gradient norm
10	2.93e-01
250	2.85e-01
500	3.00e-01

GPU ISSUE: OneHot.lua:17: attempt to call method 'long' (a nil value)

Training proceeds fine on CPU ("-gpuid -1"), but errors as follows on GPU ("-gpuid 0"):

th train.lua -data_dir data/tinyshakespeare

using CUDA on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of batches in train: 211, val: 11, test: 1
vocab size: 65
creating an LSTM with 2 layers
number of parameters in the model: 154165
cloning criterion
cloning softmax
cloning embed
cloning rnn
/home/username/torch/install/bin/luajit: ./util/OneHot.lua:17: attempt to call method 'long' (a nil value
stack traceback:
   ./util/OneHot.lua:17: in function 'forward'
   train.lua:172: in function 'opfunc'
   /home/username/torch/install/share/lua/5.1/optim/rmsprop.lua:36: in function 'rmsprop'
   train.lua:226: in main chunk
   [C]: in function 'dofile'
   .../username/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
   [C]: at 0x00406640

error in sample.lua: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109)

Hello,
thanks for sharing this! I was trying to do some experiment: I keep getting this error from sample.lua and I would appreciate an hint about how to fix it:

[root@sushi char-rnn-master]# th sample.lua -gpuid -1 cv/lm_lstm_epoch9.57_nan.t7
creating an LSTM...
seeding with
/root/torch/install/bin/luajit: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109)
stack traceback:
[C]: at 0x7f8149182b20
[C]: in function 'multinomial'
sample.lua:102: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00405800
[root@sushi char-rnn-master]#

I suspect that something is wrong with the training. I'm using all defaults and the data .txt file is about 550KB. I'm using gpuid=-1 on both training and sampling (no GPU)

Thanks!

(I know I should probably not be running as root...)

Behaviour with small dataset

Hello,

my installation works fine (tested on few dozens test cases), but now I'm trying to work with really small datasets (like something about 100 Kb).

First of all, when I set relatively too big eval set, e.g. 15% by this command:

th train.lua -data_dir data/examp -gpuid -1 -eval_val_every 50 -num_layers 1 -rnn_size 50 -val_frac 0.15

I get following error (when evaluating after 50 steps):

~/torch/install/bin/luajit: /home/kotrfa/char-rnn/train.lua:137: attempt to index local 'x' (a nil value)
stack traceback:
        /home/kotrfa/char-rnn/train.lua:137: in function 'eval_split'
        /home/kotrfa/char-rnn/train.lua:234: in main chunk
        [C]: in function 'dofile'
        ...trfa/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
        [C]: at 0x00406670

The second problem is more educational. I understand that 100 KB is really small, but I would expect that RNN would be able to at least overfit. For example, I tried setting 1000 neurons and 1 layer, but the network wasn't able to catch up. Sampled text was almost still the same sequence. Why is this the case?

Thank you for this great piece of code!

Non-text data in input.txt

Is there any way to remove the limitation that inputs.txt must be ASCII text? I would love to be able to train a network that can produce plausible binary files of various types. Right now it dies here:

$ th train.lua -data_dir data/jpeg -rnn_size 200 -num_layers 3
using CUDA on GPU 0...  
loading data files...   
cutting off end of data so that the batches/sequences divide evenly 
reshaping tensor... 
data load done. Number of data batches in train: 933, val: 50, test: 0  
vocab size: 256 
creating an LSTM with 3 layers  
number of parameters in the model: 1061056  
cloning rnn 
cloning criterion   
/Users/moyix/torch/install/bin/luajit: ./util/OneHot.lua:18: index out of range at /Users/moyix/torch/pkg/torch/lib/TH/generic/THTensorMath.c:105
stack traceback:
    [C]: in function 'index'
    ./util/OneHot.lua:18: in function 'func'
    ...rs/moyix/torch/install/share/lua/5.1/nngraph/gmodule.lua:246: in function 'neteval'
    ...rs/moyix/torch/install/share/lua/5.1/nngraph/gmodule.lua:276: in function 'forward'
    train.lua:209: in function 'opfunc'
    /Users/moyix/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop'
    train.lua:252: in main chunk
    [C]: in function 'dofile'
    ...oyix/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x0101fe4330

This is using an input.txt that is the concatenation of the first 4KB of 600 JPEG files (in hopes of being able to generate interesting JPEG-like headers), ~2.3MB.

Should init_from parameter start train.lua from 1 if multiple checkpoints exist?

When the process restarted, it started counting up from 1. Is init_from not used as a "pause" kind of parameter in the event I need to shut down my box between trainings? Or is the counting just incorrect?

ran train.lua for a couple hours to generate 10 checkpoints
stopped process
ran train_lua with init_from parameter pointing to the latest checkpoint file

Scoring Functionality

It would be nice to add in a third script score.lua which would take a trained model and an example (or many examples) and provide log probabilities of the character sequence according to the model.

I've started looking into adding this based on the sampling code, but have to learn Lua and Torch in the process. I imagine it would be a pretty simple addition for someone who knows what they're doing?

syntax error when trying to train

I believe I have things setup correctly on my Mac environment.

I run "th train.lua -data_dir data/tinyshakespeare -gpuid -1 -eval_val_every 3600"
and I get "syntax error: [string "th train.lua -data_dir data/tinyshakespeare -..."]:1: '=' expected near 'train'"

Traceback with running with large networks

Hello from the Magic community!
I get the following traceback on my 64 bit arch linux (16 gigs of ram) with the following parameters. This also happens with the default sequence length with rnn sizes > 600 or more than 4 layers. I'm from python land so I'm not familar with lua's error codes but it might be a memory issue? Because I am using OpenCl am I bound by my graphics card memory, not system memory? Or am I totally off base. Thank you for the help!

~/char-rnn-master ❯❯❯ th train.lua -data_dir data/mtg -eval_val_every 500 -opencl 1 -num_layers 4 -rnn_size 450 -seq_length 100
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 805, val: 43, test: 0
vocab size: 61
creating an LSTM with 4 layers
Using NVIDIA Corporation platform: NVIDIA CUDA
Using device: GeForce GTX 660
number of parameters in the model: 5821711
cloning rnn
cloning criterion

kernel source:
1: // OpenCL kernels....
2: 
3: // expected templated values:
4: // dims (vector of unique dimension values)
5: // operation
6: // dim1
7: // dim2
8: // dim3
9: // ... dimD
10: // num_input_tensors
11: // include_scalar_input
12: //
13: // maybe should add:
14: // IndexType (hardcoded to int for now)
15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now)
16: 
17: // (Ported from cutorch's THCApply.cuh)
18: 
19: // Maximum number of dimensions allowed for cutorch
20: // #define MAX_CUTORCH_DIMS 25
21: 
22: // Enum that indicates whether tensor arguments are read/write or
23: // read-only
24: //enum TensorArgType { ReadWrite, ReadOnly };
25: 
26: // not used by this kernel, but used by THClReduceApplyUtils...
27: inline float reduceOp(float _in1, float _in2) {
28:   return 0;
29: }
30: 
31: // kernel argument that defines tensor layout
32: typedef struct TensorInfoCl {
33:   // Extracts size/stride information for the kernel.
34:   // Successive dimensions can be collapsed if the size/strides match
35:   // up and thus there are no holes between the dimensions. This is used
36:   // to reduce the complexity of the problem.
37:   // The optional `reduceDim` indicates a reduction dimension for the
38:   // given tensor, so that the output size for this dimension will be 1.
39: 
40:   unsigned int sizes[25];
41:   unsigned int strides[25];
42:   unsigned int offset;
43:   int dims;
44: } TensorInfoCl;
45: // Contiguous tensors of more than one dimension are collapsed down
46: // to one tensor
47: inline bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) {
48:     return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1);
49: }
50: 
51: // Translate a linear index for the apply to a float* offset;
52: // specialized on `Dims` to reduce nvcc compilation time
53: 
54: 
55: inline unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) {
56:     return linearId + info.offset;
57: }
58: 
59: inline unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) {
60:   unsigned int offset = info.offset;
61: 
62:   // Use dynamic dims
63:   for (int i = info.dims - 1; i >= 0; --i) {
64:     unsigned int curDimIndex = linearId % info.sizes[i];
65:     unsigned int curDimOffset = curDimIndex * info.strides[i];
66:     offset += curDimOffset;
67: 
68:     linearId /= info.sizes[i];
69:   }
70: 
71:   return offset;
72: }
73: 
74: inline unsigned int getLinearBlockId() {
75:   return get_group_id(2) * get_num_groups(1) * get_num_groups(0) +
76:     get_group_id(1) * get_num_groups(0) +
77:     get_group_id(0);
78: }
79: 
80: // Block-wide reduction in shared memory helper; only /*threadIdx.x*/ get_local_id(0) == 0 will
81: // return the reduced value
82: inline float reduceBlock( local float* smem,
83:                    int numVals,
84:                    float threadVal,
85:                    float init) {
86:   if (numVals == 0) {
87:     return init;
88:   }
89: 
90:   if ((int)get_local_id(0) < numVals) {
91:     smem[ get_local_id(0)] = threadVal;
92:   }
93: 
94:   // First warp will perform reductions across warps
95:   barrier(CLK_LOCAL_MEM_FENCE);
96:   if ((get_local_id(0) / 32) == 0) {
97:     float r = (int)get_local_id(0) < numVals ? smem[get_local_id(0)] : init;
98: 
99:     for (int i = 32 + get_local_id(0); i < numVals; i += 32) {
100:       r = reduceOp(r, smem[i]);
101:     }
102: 
103:     smem[get_local_id(0)] = r;
104:   }
105: 
106:   // First thread will perform reductions across the block
107:   barrier(CLK_LOCAL_MEM_FENCE);
108: 
109:   float r = init;
110:   if (get_local_id(0) == 0) {
111:     r = smem[0];
112: 
113:     int numLanesParticipating = min(numVals, 32);
114: 
115:     if (numLanesParticipating == 32) {
116:       // Unroll for 32 == 32 and numVals >= 32
117:       // #pragma unroll
118:       // unrolling by hand, so compiler-independent
119:       
120:         r = reduceOp(r, smem[1]);
121:       
122:         r = reduceOp(r, smem[2]);
123:       
124:         r = reduceOp(r, smem[3]);
125:       
126:         r = reduceOp(r, smem[4]);
127:       
128:         r = reduceOp(r, smem[5]);
129:       
130:         r = reduceOp(r, smem[6]);
131:       
132:         r = reduceOp(r, smem[7]);
133:       
134:         r = reduceOp(r, smem[8]);
135:       
136:         r = reduceOp(r, smem[9]);
137:       
138:         r = reduceOp(r, smem[10]);
139:       
140:         r = reduceOp(r, smem[11]);
141:       
142:         r = reduceOp(r, smem[12]);
143:       
144:         r = reduceOp(r, smem[13]);
145:       
146:         r = reduceOp(r, smem[14]);
147:       
148:         r = reduceOp(r, smem[15]);
149:       
150:         r = reduceOp(r, smem[16]);
151:       
152:         r = reduceOp(r, smem[17]);
153:       
154:         r = reduceOp(r, smem[18]);
155:       
156:         r = reduceOp(r, smem[19]);
157:       
158:         r = reduceOp(r, smem[20]);
159:       
160:         r = reduceOp(r, smem[21]);
161:       
162:         r = reduceOp(r, smem[22]);
163:       
164:         r = reduceOp(r, smem[23]);
165:       
166:         r = reduceOp(r, smem[24]);
167:       
168:         r = reduceOp(r, smem[25]);
169:       
170:         r = reduceOp(r, smem[26]);
171:       
172:         r = reduceOp(r, smem[27]);
173:       
174:         r = reduceOp(r, smem[28]);
175:       
176:         r = reduceOp(r, smem[29]);
177:       
178:         r = reduceOp(r, smem[30]);
179:       
180:         r = reduceOp(r, smem[31]);
181:       
182:     } else {
183:       for (int i = 1; i < numLanesParticipating; ++i) {
184:         r = reduceOp(r, smem[i]);
185:       }
186:     }
187:   }
188: 
189:   return r;
190: }
191: 
192: 
193: 
194: 
195: 
196: inline void op( global float *out
197:   
198:   
199:   , float val1
200:   
201:    
202: ) {
203:     *out = val1;
204: }
205: 
206: kernel void
207: THClTensor_pointwiseApplyD(
208:    
209:     global TensorInfoCl *info_1,
210:     global float*data_1,
211:    
212:    
213:    float val1,
214:    
215:    
216:    int totalElements) {
217:   for (int linearIndex = get_global_id(0);
218:        linearIndex < totalElements;
219:        linearIndex += get_global_size(0)) {
220:     
221:     // Convert `linearIndex` into an offset of `a`
222:     const int offset1 =
223:       IndexToOffset_998_get(linearIndex, info_1[0]);
224:     
225: 
226:     op(
227:       
228:          
229:          &(data_1[offset1])
230:       
231:       
232:       , val1
233:       
234:        
235:     );
236:   }
237: }
238: 
239: 

Memory object allocation failure, code -4
/home/croxis/torch/install/bin/luajit: /home/croxis/torch/install/share/lua/5.1/optim/rmsprop.lua:36: 
kernel source:
1: // OpenCL kernels....
2: 
3: // expected templated values:
4: // dims (vector of unique dimension values)
5: // operation
6: // dim1
7: // dim2
8: // dim3
9: // ... dimD
10: // num_input_tensors
11: // include_scalar_input
12: //
13: // maybe should add:
14: // IndexType (hardcoded to int for now)
15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now)
16: 
17: // (Ported from cutorch's THCApply.cuh)
18: 
19: // Maximum number of dimensions allowed for cutorch
20: // #define MAX_CUTORCH_DIMS 25
21: 
22: // Enum that indicates whether tensor arguments are read/write or
23: // read-only
24: //enum TensorArgType { ReadWrite, ReadOnly };
25: 
26: // not used by this kernel, but used by THClReduceApplyUtils...
27: inline float reduceOp(float _in1, float _in2) {
28:   return 0;
29: }
30: 
31: // kernel argument that defines tensor layout
32: typedef struct TensorInfoCl {
33:   // Extracts size/stride information for the kernel.
34:   // Successive dimensions can be collapsed if the size/strides match
35:   // up and thus there are no holes between the dimensions. This is used
36:   // to reduce the complexity of the problem.
37:   // The optional `reduceDim` indicates a reduction dimension for the
38:   // given tensor, so that the output size for this dimension will be 1.
39: 
40:   unsigned int sizes[25];
41:   unsigned int strides[25];
42:   unsigned int offset;
43:   int dims;
44: } TensorInfoCl;
45: // Contiguous tensors of more than one dimension are collapsed down
46: // to one tensor
47: inline bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) {
48:     return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1);
49: }
50: 
51: // Translate a linear index for the apply to a float* offset;
52: // specialized on `Dims` to reduce nvcc compilation time
53: 
54: 
55: inline unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) {
56:     return linearId + info.offset;
57: }
58: 
59: inline unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl inf
stack traceback:
        [C]: in function 'zero'
        /home/croxis/torch/install/share/lua/5.1/optim/rmsprop.lua:36: in function 'rmsprop'
        train.lua:283: in main chunk
        [C]: in function 'dofile'
        ...oxis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
        [C]: at 0x00405be0

Training w/ GPU, sampling without GPU

Is the restriction that a model trained with a GPU can only be sampled with a GPU in the char-rnn code or in the torch code? It would be handy to be able to train a model on a machine with a fast GPU and then use the model on another machine.

Unable to run the tinyshakespeare example

Hello,

Thank you very much for the awesome knowledge share. I'm pretty new to lua and torch. When I'm trying to run this example on my CPU, I get the following error. From what I could infer, I understand that the variable 'path' is not being set. A fair amount of googling didn't help. Also tried an unholy, setting a local variable path to the present working directory, the next problem was in checking if path.exists. So, I'm pretty confused.

Here is the trace I get, when I do a th train.lua with a datadir and gpuid flag.

/home/cyber-chivalry/torch/install/bin/luajit: ./util/CharSplitLMMinibatchLoader.lua:15: attempt to index global 'path' (a nil value)
stack traceback:
./util/CharSplitLMMinibatchLoader.lua:14: in function 'create'
train.lua:75: in main chunk
[C]: in function 'dofile'
...1459/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00406670

Can you please check this. Thank you in advance.

LSTM forget gate bias

Hi,

I am new to Torch and am wondering if it's possible to easily implement the large bias trick found in http://www.jmlr.org/proceedings/papers/v37/jozefowicz15.pdf ?

can we update char-rnn model ?

Thanks for nice blogpost !

It is just a query not an issue ?
Is it possible to update trained model with new data ?
Like continuous learning as humans do ... every day learning new data ... improving vocab and learning how to generate sentences.

Word Level Encodings

Would there be any easy ways to add word level encodings? One way I could think of to get something similar is to pre-compress individual words, but I was interested to know if there were any better / easier ways to do this.

reproduce a standard benchmark ?

Can this code reproduce a standard character level language model benchmark - or more usefully a word level language model benchmark? That would help to convince people it works.

I suggest you,

a) provide a link to some benchmark dataset, (the choice is yours).

b) show that your code can come close to a benchmark, (which should preferably be a published result).

If the code cannot come close to a benchmark, it's hard to consider it to be anything more than a toy?

Edit - Why not try the word level PR, on Penn Tree Bank - that would be reasonably convincing if it worked, compared to Wojciech Zaremba's code?

I suggest using the same model configurations to make it a like for like comparison.

Sampling to file

Hi!
Is it possible to write sampling results to file?

sample.lua fails if vocabulary has no spaces

After attempting to train on a list of names from the US census, I couldn't sample any input from the network. It turns out that the default primetext (a space character) didn't exist in the dataset.

/usr/local/bin/luajit: ./util/OneHot.lua:14: bad argument #1 to 'size' (dimension 1 out of range of 0D tensor at /tmp/luarocks_torch-scm-1-9151/torch7/generic/Tensor.c:17)
stack traceback:
    [C]: in function 'size'
    ./util/OneHot.lua:14: in function 'func'
    /usr/local/share/lua/5.1/nngraph/gmodule.lua:214: in function 'neteval'
    /usr/local/share/lua/5.1/nngraph/gmodule.lua:244: in function 'forward'
    sample.lua:82: in main chunk
    [C]: in function 'dofile'
    /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x00405900

I'll readily admit I don't understand much of the codebase, such as why the random seed defaults to a constant value. However, using the defaulted constant random seed should still allow a deterministic primetext character to be chosen from the vocabulary. This should retain the deterministic output of the RNN for repeated queries.

tl;dr randomize the primetext in sample.lua using the 'seed' field to seed the randomizer to choose the primetext character.

init_from fails with any given checkpoint

I don't know if I'm doing something stupid wrong, but I used:

th train.lua -data_dir data/dating -rnn_size 500 -num_layers 2 -dropout 0.5 -gpuid -1 -init_from lm_lstm_epoch2.82_0.9912.t7

and it fails saying:

loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 4256, val: 225, test: 0
vocab size: 179
loading an LSTM from checkpoint lm_lstm_epoch2.82_0.9912.t7
/usr/local/bin/luajit: cannot open <lm_lstm_epoch2.82_0.9912.t7> in mode r at /tmp/luarocks_torch-scm-1-1334/torch7/lib/TH/THDiskFile.c:484
stack traceback:
[C]: at 0x7f625a606aa0
[C]: in function 'DiskFile'
/usr/local/share/lua/5.1/torch/File.lua:303: in function 'load'
train.lua:98: in main chunk
[C]: in function 'dofile'
/usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00406260

I thought maybe it needed brackets around the checkpoint name, but that didn't seem to help. Am I doing something wrong? Or is this an error?

gpuids should probably start from 1 :-)

Hi Andrej,

I think that it would be more consistent with how lua/torch works if gpuids start from 1. I know this would be backwards-incompatible breaking api change though, hence not submitting a PR for this. It is a bit confusing though, eg, if you use cutorch.getDeviceProperties(1) or cltorch.getDeviceproperties(1), and ditto for 2,3 etc, to figure out which device you would use, it might be considered confusing to have to then subtract one from this deviceid in order to run train.lua :-)

th> cutorch.getDeviceProperties(1)
{
  computeMode : 0
  memPitch : 2147483647
  canMapHostMemory : 1
  warpSize : 32
  pciDeviceID : 0
  pciBusID : 4
  totalConstMem : 65536
  pciDomainID : 0
  integrated : 0
  deviceOverlap : 1
  maxThreadsPerBlock : 1024
  clockRate : 980000
  maxTexture1D : 65536
  minor : 0
  name : "GeForce 940M"
  freeGlobalMem : 1007845376
  maxTexture1DLinear : 134217728
  kernelExecTimeoutEnabled : 0
  sharedMemPerBlock : 49152
  major : 5
  totalGlobalMem : 1073610752
  regsPerBlock : 65536
  textureAlignment : 512
  multiProcessorCount : 3
}
                                                                      [0.0011s]
th> cutorch.getDeviceProperties(0)
[string "_RESULT={cutorch.getDeviceProperties(0)}"]:1: cuda runtime error (10) : invalid device ordinal at /home/user/git/cutorch/init.c:626

th> cltorch.getDeviceProperties(1)
{
  deviceType : "GPU"
  localMemSizeKB : 47
  globalMemSizeMB : 1023
  deviceVersion : "OpenCL 1.1 CUDA"
  platformVendor : "NVIDIA Corporation"
  deviceName : "GeForce 940M"
  maxComputeUnits : 3
  globalMemCachelineSizeKB : 0
  openClCVersion : "OpenCL C 1.1 "
  maxClockFrequency : 980
  maxMemAllocSizeMB : 255
  maxWorkGroupSize : 1024
}
                                                                      [0.0009s]
th> cltorch.getDeviceProperties(0)
[string "_RESULT={cltorch.getDeviceProperties(0)}"]:1: Device doesnt exist at /home/user/git/cltorch/init.cpp:69

(Edit: hmmm, cutorch.getDeviceProperties() returns tons more info than cltorch.getDeviceProperties() :-P )

ERROR: ./util/CharSplitLMMinibatchLoader.lua:50: attempt to call method 'split' (a nil value)

Hi Karpathy,
I have cloned your code and run the example th train.lua -data_dir data/tinyshakespeare/ -gpuid -1, then get the following error information:

loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
/usr/local/bin/luajit: ./util/CharSplitLMMinibatchLoader.lua:50: attempt to call method 'split' (a nil value)
stack traceback:
./util/CharSplitLMMinibatchLoader.lua:50: in function 'create'
train.lua:74: in main chunk
[C]: in function 'dofile'
/usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:129: in main chunk
[C]: at 0x010cddb820

I'm not familiar with LUA now, so could you check the code to see if there is any bug? Thanks a lot. By the way, I'm running the code using mac os.

Cheers
Junwei Pan

Mutil-GPU support?

Is there a existed way to support multi-GPU for training because memory of one GPU is not enough for a large amount of data.
It seems only a little change is needed based on Torch7. So if there is no existed way, I will do it myself.

installing Requirements

Using a freshly installed Virtualbox with Ubuntu 14.04 I install luarocks

sudo apt-get install luarocks

Then I call

sudo luarocks install nngraph

but get

Error: No results matching query were found.

Do you have any idea on what to do? The installed Lua version is 5.1.5.

implement sentence in, sentence out network

@karpathy I would like to modify your code to take a piece of a sequence in (say a few words), and only after a specified symbol start outputting a sequence out (say other words).
What is the best way to do this, in your opinion? Thanks for any suggestions and again for the great package!

Model conversion between gpu and cpu

I try to do sampling with cpu using a model trained by gpu(cuz many online server don't have gpu). The original code has some problems to do this, and I fix them as follow:

First, convert the model to cpu model (you may do this before save)

checkpoint = torch.load(gpu_model)
checkpoint.protos.rnn:double()
checkpoint.protos.criterion:double()
torch.save(cpu_model, checkpoint)

Second, modify one line of sample.lua:

local h_init = torch.zeros(1, checkpoint.opt.rnn_size)

--->

local h_init = torch.zeros(1, checkpoint.opt.rnn_size):double()

May this help you.

error : Illegal instruction (core dumped)

I am trying to run the basic example in the tutorial using the CPU :
th train.lua -data_dir data/tinyshakespeare/ -rnn_size 100 -num_layers 2 -dropout 0.5 -gpuid -1

I get the following error :

loading data files...   
cutting off end of data so that the batches/sequences divide evenly 
reshaping tensor... 
data load done. Number of batches in train: 211, val: 11, test: 1   
vocab size: 65  
creating an LSTM with 2 layers  
number of parameters in the model: 154165   
cloning criterion   
cloning softmax 
cloning embed   
cloning rnn 
Illegal instruction (core dumped)

Crash on shakespeare sample

On brand-new AWS GPU machine (g2.2xlarge running ami-6238470a) I install the code and run it with

th train.lua -data_dir data/tinyshakespeare/

It dies with

using CUDA on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of batches in train: 211, val: 11, test: 1
vocab size: 65
creating an LSTM with 2 layers
number of parameters in the model: 154165
cloning criterion
cloning softmax
cloning embed
cloning rnn
/home/ubuntu/torch-distro/install/bin/luajit: ./util/OneHot.lua:17: attempt to call method 'long' (a nil value)
stack traceback:
./util/OneHot.lua:17: in function 'forward'
train.lua:172: in function 'opfunc'
...ntu/torch-distro/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop'
train.lua:226: in main chunk
[C]: in function 'dofile'
...rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00406170

Options to increase batch size? (or to reduce number of kernel launches?)

This is more of a question than an issue, so feel free to redirect me eg to torch newsgroup or... wherever is most convenient. Just wondering what options could be possible (if any) to somehow increase the quantity of data we are sending to the gpu each iteration? Currently, it seems that, using the default settings, there are several thousand kernel launches a second, which is a lot. I dont have any good ideas for how to reduce this, just asking the question really :-) (Hmmm... random idea... maybe can somehow 'unroll' some of the loops, so we do one, or just a few, kernel launches, rather than thousands?).

On nvidia, the kernel launches seem not to be a major issue. On amd, I notice that the kernel launches are the main rate-limiting step. Eg changing batch_size from 50 to 1024 leaves the iteration time entirely unchanged, so an entire epoch just takes 10 seconds, but on the other hand, almost no learning takes place during those 10 seconds, so that seems like not an ideal approach?

sequence to sequence

hi karpathy,
I want to build a sequence to sequence model, something like QA model, is it possible to use this tools to do experiments?

Credit assignment?

This repo looks a lot like practical 6 from this years Oxford University Machine Learning course,

https://github.com/oxford-cs-ml-2015/practical6

Perhaps you should at least mention that?

Explanation of additional parameters (e.g. seq_length)

Hello,

would it be possible to add some additional info to README about additional parameters for train.lua? Help under the -help tag is OK, but more information would be welcome. For example, I'm not really sure what -seq_length means from this help: number of timesteps to unroll for, 3 decay rates, max_epoch, grad_clip... Or what about number of epochs - if I see that since e.g 10th epoch the validation loss function doesn't decrease, can I reduce the number of epochs?

The reason why I ask is because of parameters tunning. Should I tune these also, next to dropout and size of the network (as is in README)?

input.txt max size

For a very large input file it fails. Example:

one-time setup: preprocessing input text file data/test/input.txt...
loading text file...
~/torch/install/bin/luajit: string length overflow
stack traceback:
[C]: in function 'readString'
./util/CharSplitLMMinibatchLoader.lua:91: in function 'text_to_tensor'
./util/CharSplitLMMinibatchLoader.lua:21: in function 'create'
train.lua:74: in main chunk
[C]: in function 'dofile'
.../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00406670

sample.lua fails to run: error in function addmm()

Just trying out the default data set:

$ th train.lua -data_dir data/tinyshakespeare

And then using a checkpoint file:

$ th sample.lua cv/lm_lstm_epoch4.74_1.7332.t7 
using CUDA on GPU 0...  
creating an LSTM... 
seeding with    
/Users/pol/torch/install/bin/luajit: /Users/pol/torch/install/share/lua/5.1/nn/Linear.lua:46: expected arguments: *DoubleTensor~2D* [DoubleTensor~2D] [double] DoubleTensor~2D DoubleTensor~2D | *DoubleTensor~2D* double [DoubleTensor~2D] double DoubleTensor~2D DoubleTensor~2D
stack traceback:
    [C]: in function 'addmm'
    /Users/pol/torch/install/share/lua/5.1/nn/Linear.lua:46: in function 'func'
    /Users/pol/torch/install/share/lua/5.1/nngraph/gmodule.lua:214: in function 'neteval'
    /Users/pol/torch/install/share/lua/5.1/nngraph/gmodule.lua:244: in function 'forward'
    sample.lua:88: in main chunk
    [C]: in function 'dofile'
    .../pol/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x0105f02340

Training with many short examples of variable length

I've been playing with a net that generates definitions for words by training on an encyclopedia followed by a smaller dictionary. Here's an excerpt in the middle of training:

seeding with supercalifragilisticexpialidocious     
--------------------------
supercalifragilisticexpialidocious  n. secret. 2 (foll. by on) formidious going leaves. 2 breat which the being hand. [old english]

healte  v. (-ling) drinking or esp. clowel or armitic take away or causing someting. 6 sippossion of algeratous. [latin: related to *tan-1 a deer-notic mutder maddly lowy, a restinatiun]

candrious  adj. 1 suchering, years.  personist adj. disensentionist n. [french]

rescacabole  n. urless skoiling a band bexope out in farehind earen  day-deaseding. [latin gonar]

repipt  n. don-if for a not actuous listing.

steatshide  v. 1 (brit. abshit hair). containented oneself trryable to and branging, propession. [french honic]

(Hasn't quite learned to spell at this point, but really entertaining)

One of the things I've noticed is that at lower temperatures, and if I train with a large enough seq_length the model likes to pretend it's in a certain part of the dictionary. So if I prime with an "a" word, the next few words will all start with "a".

What I'd like to do is train on single lines instead of distracting the model during training with adjacent definitions. I feel like it will learn faster this way, and more accurately. If I understand correctly this would be accomplished by creating an end-of-sequence token, and splitting the input on that before dividing it into batches (and padding any undersized lines), and it would also be necessary if you wanted to experiment with things like translation.

As I'm new to Lua, I just wanted to ask where to make these modifications, or if you were already working on this feature. Would it be in here https://github.com/karpathy/char-rnn/blob/master/util/CharSplitLMMinibatchLoader.lua#L126-L162 including moving this chunk of code https://github.com/karpathy/char-rnn/blob/master/util/CharSplitLMMinibatchLoader.lua#L45-L51 ?

Small vocab errors?

I have 284 MB of text, but it uses only 27 characters and I'm wondering if that's the cause of some issues I'm seeing.

With the default flags, I run into this error:

[wiseman@Intercept char-rnn (master)]$ th train.lua -data_dir data/doj
using CUDA on GPU 0...  
loading data files...   
cutting off end of data so that the batches/sequences divide evenly 
/Users/wiseman/torch/install/bin/luajit: ./util/CharSplitLMMinibatchLoader.lua:32: bad argument #2 to 'sub' (out of range at /Users/wiseman/torch/pkg/torch/generic/Tensor.c:302)
stack traceback:
    [C]: in function 'sub'
    ./util/CharSplitLMMinibatchLoader.lua:32: in function 'create'
    train.lua:74: in main chunk
    [C]: in function 'dofile'
    ...eman/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x010e3cb330

If I turn down the batch_size and seq_length, I don't get an error but I get no output:

[wiseman@Intercept char-rnn (master)]$ th train.lua -data_dir data/doj -batch_size 10 -seq_length 20
using CUDA on GPU 0...  
loading data files...   
cutting off end of data so that the batches/sequences divide evenly 
reshaping tensor... 
data load done. Number of batches in train: 0, val: 0, test: 1  
vocab size: 27  
creating an LSTM with 2 layers  
number of parameters in the model: 135127   
cloning criterion   
cloning softmax 
cloning embed   
cloning rnn

If I turn the sizes down further, I get a checkpoint file with a val_bpc of NaN:

[wiseman@Intercept char-rnn (master)]$ time th train.lua -data_dir data/doj -batch_size 10 -seq_length 10
time th train.lua -data_dir data/doj -batch_size 10 -seq_length 10
using CUDA on GPU 0...  
loading data files...   
cutting off end of data so that the batches/sequences divide evenly 
reshaping tensor... 
data load done. Number of batches in train: 2, val: 0, test: 1  
vocab size: 27  
creating an LSTM with 2 layers  
number of parameters in the model: 135127   
cloning criterion   
cloning softmax 
cloning embed   
cloning rnn 
1/60 (epoch 0.500), train_bpc = 4.76364329, grad/param norm = 9.2436e-02, time/batch = 0.05s    
2/60 (epoch 1.000), train_bpc = 4.69448570, grad/param norm = 1.1961e-01, time/batch = 0.03s    
[...]
58/60 (epoch 29.000), train_bpc = 3.57610811, grad/param norm = 2.2238e-01, time/batch = 0.02s  
59/60 (epoch 29.500), train_bpc = 3.95961556, grad/param norm = 2.4135e-01, time/batch = 0.02s  
evaluating loss over split index 2  
saving checkpoint to cv/lm_lstm_epoch30.00_nan.t7   
60/60 (epoch 30.000), train_bpc = 3.68077448, grad/param norm = 3.6531e-01, time/batch = 0.02s

package cutorch, cunn not found

OS: Kali Linux

I've already installed torch,

With luarocks install cutorch I get a successful build, same with luarocks install cunn.

Yet, when I run with a GPU, I get

package cunn not found!
package cutorch not found!

And the program reverts to CPU mode.

nngraph not installable via default luarocks?

awesome tutorials :-) I mean, AWESOME! :-)
I tried many variations of typing luarocks install nngraph, but all failed :-P Fairly sure it is not in the default repository? https://luarocks.org/search?q=nngraph I guess the installation method for now is to git clone [email protected]:torch/nngraph.git ,and then install the rockspec from that clone?

CUDA support broken against current torch/nn

It looks like torch/nn#329 breaks char-rnn's compatibility against current torch.
It works if I check out an earlier version of nn, but just a heads up to anyone else.

Train Error on GPU ClassNLLCriterion.lua:34: bad argument #1 (field weights does not exist)

Hi Karpathy,

I am practicing rnn with your char-rnn project. I followed the project readme step by step. When I train the rnn on CPU with command:

th train.lua -data_dir data/tinyshakespeare/ -gpuid -1

everything goes well. But When I checkout to GPU by

th train.lua -data_dir data/tinyshakespeare/

the torch complained:

/home/gys/torch/install/bin/luajit: ...gys/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:34: bad argument #1 (field weights does not exist)
stack traceback:
        [C]: in function 'ClassNLLCriterion_updateOutput'
        ...gys/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:34: in function 'forward'
        train.lua:254: in function 'opfunc'
        /home/gys/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop'
        train.lua:293: in main chunk
        [C]: in function 'dofile'
        .../gys/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
        [C]: at 0x00406670

I have a Tesla K40 GPU and using CUDA 7.0 and CUDA 7.5. I have try to make soft link to CUDA 7.0 and 7.5 and change the LD_LIBRARY_PATH accordingly.
Before this, I am wroking with Caffe to train CNN models.

I am very new to torch7, I have work on this issue for some time, but still can figure out the reason. can you sincerely help me to find the solution?

Use multiple file datasets

I have a very big dataset and I was wondering if there is a way to use more than one file in a single training run.

karpathy / char-rnn Goto Github PK

char-rnn's Introduction

char-rnn

Update: torch-rnn

Requirements

Usage

Data

Training

Sampling

Tips and Tricks

Monitoring Validation Loss vs. Training Loss

Approximate number of parameters

Best models strategy

Additional Pointers and Acknowledgements

License

char-rnn's People

Contributors

Stargazers

Watchers

Forkers

char-rnn's Issues

clTorch output:

training on CPU

training on GPU - CUDA

training on GPU - OPENCL

training on CPU - OPENCL

training on CPU - OPENCL - fork of hughperkins

Recommend Projects

Recommend Topics

Recommend Org