Comments (15)
This other tool works fine however:
$ th inspect_checkpoint.lua cv/lm_lstm_epoch18.96_1.4228.t7
using CUDA on GPU 0...
opt:
{
max_epochs : 30
seed : 123
batch_size : 100
gpuid : 0
decay_rate : 0.95
savefile : "lstm"
model : "lstm"
grad_clip : 5
print_every : 1
data_dir : "data/tinyshakespeare"
seq_length : 50
num_layers : 2
rnn_size : 100
train_frac : 0.95
learning_rate : 0.002
dropout : 0
eval_val_every : 1000
val_frac : 0.05
checkpoint_dir : "cv"
}
val losses:
{
2000 : 1.5233611306277
3000 : 1.4519253438169
4000 : 1.4227915313027
1000 : 1.7233323420178
}
from char-rnn.
Did you train one with GPU and the other with CPU? Check the "gpuid" flag. Is it "0" on both models?
from char-rnn.
Yes, GPU on both. What's interesting is that using checkpoints created later on during the process do work as well as the very last one.
from char-rnn.
@swisspol, I'm running into the same issue inside an iTorch notebook environment, but it works fine on standard command line. I'm very new to Lua / Torch but it would be good to figure out what's causing this.
from char-rnn.
I had this error on OSX 10.10. Using -opencl 1 on both train.lua and sample.lua made it work.
from char-rnn.
Had this problem on a c2.2xl instance. Tried -opencl 1 but no luck.
from char-rnn.
@svickers can you post the fourth line of your output, the one that has 'Linear.lua:46' in it. Edit: and also the results of inspect_checkpoint
.
from char-rnn.
@hughperkins Sorry Hugh! I killed that vm and went to g2.2xl and everything worked straight away.
from char-rnn.
@svickers oh, nice! Hmmm, c2s dont actually have a gpu, right? g2 sounds like a gpu instance?
from char-rnn.
Hello.
I ran into the same problem, and is dealing with a lot of frustration in my holy quest of running tests with a Monty Python Flying Circus corpus :). The error appears too when trying the tiny shakespeare data set.
I run CPU-only computing (no GPU). The training goes well, no NaN value :
th train.lua -data_dir data/tinyshakespeare/ -gpuid -1
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an lstm with 2 layers
number of parameters in the model: 240321
cloning rnn
cloning criterion
1/21150 (epoch 0.002), train_loss = 4.19766416, grad/param norm = 4.5006e-01, time/batch = 0.34s
2/21150 (epoch 0.005), train_loss = 4.10134056, grad/param norm = 6.3375e-01, time/batch = 0.28s
3/21150 (epoch 0.007), train_loss = 3.44502399, grad/param norm = 9.4798e-01, time/batch = 0.28s
The sampling raises an error whatever the checkpoint file used :
th sample.lua cv/lm_lstm_epoch26.00_1.3900.t7 -gpuid -1
creating an lstm...
missing seed text, using uniform probability over first character
--------------------------
/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/Linear.lua:46: invalid arguments: DoubleTensor number DoubleTensor number FloatTensor DoubleTensor
expected arguments: *DoubleTensor~2D* [DoubleTensor~2D] [double] DoubleTensor~2D DoubleTensor~2D | *DoubleTensor~2D* double [DoubleTensor~2D] double DoubleTensor~2D DoubleTensor~2D
stack traceback:
[C]: in function 'addmm'
/usr/local/share/lua/5.1/nn/Linear.lua:46: in function 'func'
/usr/local/share/lua/5.1/nngraph/gmodule.lua:253: in function 'neteval'
/usr/local/share/lua/5.1/nngraph/gmodule.lua:288: in function 'forward'
sample.lua:151: in main chunk
[C]: in function 'dofile'
/usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00406720
And the inspect method looks ok
th inspect_checkpoint.lua cv/lm_lstm_epoch26.00_1.3900.t7 -gpuid -1
opt:
{
max_epochs : 50
seed : 123
batch_size : 50
gpuid : -1
decay_rate : 0.95
learning_rate_decay : 0.97
opencl : 0
model : "lstm"
grad_clip : 5
print_every : 1
data_dir : "data/tinyshakespeare/"
seq_length : 50
num_layers : 2
learning_rate_decay_after : 10
rnn_size : 128
train_frac : 0.95
dropout : 0
init_from : ""
learning_rate : 0.002
eval_val_every : 1000
val_frac : 0.05
savefile : "lstm"
checkpoint_dir : "cv"
}
val losses:
{
3000 : 1.4450460764536
4000 : 1.4213234041304
5000 : 1.4060113392715
6000 : 1.389498488439
8000 : 1.3909428322715
10000 : 1.4003497627469
7000 : 1.3937299336865
9000 : 1.3940925438403
1000 : 1.7136267190726
2000 : 1.5211800115534
11000 : 1.389983844627
}
from char-rnn.
This is a silly bug I think I introduced only few days ago unfortunately. Fixing...
from char-rnn.
Ok I think I patched this issue with this commit:
0fb9a77
see if things work properly now with the new sampling script. The issue is that CPU models use doubles, but when I was converting GPU models I converted them to float() and then changed the sampling script to use float(), which broke previous CPU-only functionality. Sorry about the mess, when I was originally designing this code I always use GPUs and I didn't anticipate the conversion issues and that training on CPU or converting GPU->CPU would be a common use case.
from char-rnn.
Looks like it works. Thank you and may my cat bless you, m'lord!
from char-rnn.
Fixed it for me, too. Thanks @karpathy!
from char-rnn.
Great job @karpathy! Had the same issue and now it works perfectly.
from char-rnn.
Related Issues (20)
- output not being stored as .txt HOT 1
- Sampling Text HOT 3
- Length of primetext causes it to fail
- Question:
- Torch readline.o error
- Th: command not found, Torch installed fine HOT 3
- Question: Is there a way to "pause" temporarily without making it restart? HOT 3
- Random alias of folder created in same location as real
- AMD? HOT 2
- Links in readme.md not working in certain editors
- cutorch installation make error HOT 1
- Package 'python-software-properties' has no installation candidate
- Installation problem: readline.h HOT 2
- Failed to clone the RNN
- How can I change the model form LSTM to GRU?
- Training stuck on "cloning criterion"
- Duo GPU Capabilities? HOT 4
- how do i implement this code in python? HOT 3
- Code
- Prototypical Recurrent Unit
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from char-rnn.