Just trying out the default data set: <div class="snippet-clipboard-content notran

This other tool works fine however: <div class="snippet-clipboard-content notransl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

sample.lua fails to run: error in function addmm() about char-rnn HOT 15 OPEN

karpathy commented on June 23, 2024

sample.lua fails to run: error in function addmm()

from char-rnn.

Comments (15)

swisspol commented on June 23, 2024

This other tool works fine however:

$ th inspect_checkpoint.lua cv/lm_lstm_epoch18.96_1.4228.t7 
using CUDA on GPU 0...  
opt:    
{
  max_epochs : 30
  seed : 123
  batch_size : 100
  gpuid : 0
  decay_rate : 0.95
  savefile : "lstm"
  model : "lstm"
  grad_clip : 5
  print_every : 1
  data_dir : "data/tinyshakespeare"
  seq_length : 50
  num_layers : 2
  rnn_size : 100
  train_frac : 0.95
  learning_rate : 0.002
  dropout : 0
  eval_val_every : 1000
  val_frac : 0.05
  checkpoint_dir : "cv"
}
val losses: 
{
  2000 : 1.5233611306277
  3000 : 1.4519253438169
  4000 : 1.4227915313027
  1000 : 1.7233323420178
}

from char-rnn.

karpathy commented on June 23, 2024

Did you train one with GPU and the other with CPU? Check the "gpuid" flag. Is it "0" on both models?

from char-rnn.

swisspol commented on June 23, 2024

Yes, GPU on both. What's interesting is that using checkpoints created later on during the process do work as well as the very last one.

from char-rnn.

antonmil commented on June 23, 2024

@swisspol, I'm running into the same issue inside an iTorch notebook environment, but it works fine on standard command line. I'm very new to Lua / Torch but it would be good to figure out what's causing this.

from char-rnn.

PaulSchnau commented on June 23, 2024

I had this error on OSX 10.10. Using -opencl 1 on both train.lua and sample.lua made it work.

from char-rnn.

svickers commented on June 23, 2024

Had this problem on a c2.2xl instance. Tried -opencl 1 but no luck.

from char-rnn.

hughperkins commented on June 23, 2024

@svickers can you post the fourth line of your output, the one that has 'Linear.lua:46' in it. Edit: and also the results of inspect_checkpoint.

from char-rnn.

svickers commented on June 23, 2024

@hughperkins Sorry Hugh! I killed that vm and went to g2.2xl and everything worked straight away.

from char-rnn.

hughperkins commented on June 23, 2024

@svickers oh, nice! Hmmm, c2s dont actually have a gpu, right? g2 sounds like a gpu instance?

from char-rnn.

quematech commented on June 23, 2024

Hello.

I ran into the same problem, and is dealing with a lot of frustration in my holy quest of running tests with a Monty Python Flying Circus corpus :). The error appears too when trying the tiny shakespeare data set.

I run CPU-only computing (no GPU). The training goes well, no NaN value :

th train.lua -data_dir data/tinyshakespeare/ -gpuid -1

loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an lstm with 2 layers
number of parameters in the model: 240321
cloning rnn
cloning criterion
1/21150 (epoch 0.002), train_loss = 4.19766416, grad/param norm = 4.5006e-01, time/batch = 0.34s
2/21150 (epoch 0.005), train_loss = 4.10134056, grad/param norm = 6.3375e-01, time/batch = 0.28s
3/21150 (epoch 0.007), train_loss = 3.44502399, grad/param norm = 9.4798e-01, time/batch = 0.28s

The sampling raises an error whatever the checkpoint file used :

th sample.lua cv/lm_lstm_epoch26.00_1.3900.t7 -gpuid -1
creating an lstm...
missing seed text, using uniform probability over first character
--------------------------
/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/Linear.lua:46: invalid arguments: DoubleTensor number DoubleTensor number FloatTensor DoubleTensor
expected arguments: *DoubleTensor~2D* [DoubleTensor~2D] [double] DoubleTensor~2D DoubleTensor~2D | *DoubleTensor~2D* double [DoubleTensor~2D] double DoubleTensor~2D DoubleTensor~2D
stack traceback:
        [C]: in function 'addmm'
        /usr/local/share/lua/5.1/nn/Linear.lua:46: in function 'func'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:253: in function 'neteval'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:288: in function 'forward'
        sample.lua:151: in main chunk
        [C]: in function 'dofile'
        /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
        [C]: at 0x00406720

And the inspect method looks ok

th inspect_checkpoint.lua cv/lm_lstm_epoch26.00_1.3900.t7 -gpuid -1
opt:
{
  max_epochs : 50
  seed : 123
  batch_size : 50
  gpuid : -1
  decay_rate : 0.95
  learning_rate_decay : 0.97
  opencl : 0
  model : "lstm"
  grad_clip : 5
  print_every : 1
  data_dir : "data/tinyshakespeare/"
  seq_length : 50
  num_layers : 2
  learning_rate_decay_after : 10
  rnn_size : 128
  train_frac : 0.95
  dropout : 0
  init_from : ""
  learning_rate : 0.002
  eval_val_every : 1000
  val_frac : 0.05
  savefile : "lstm"
  checkpoint_dir : "cv"
}
val losses:
{
  3000 : 1.4450460764536
  4000 : 1.4213234041304
  5000 : 1.4060113392715
  6000 : 1.389498488439
  8000 : 1.3909428322715
  10000 : 1.4003497627469
  7000 : 1.3937299336865
  9000 : 1.3940925438403
  1000 : 1.7136267190726
  2000 : 1.5211800115534
  11000 : 1.389983844627
}

from char-rnn.

karpathy commented on June 23, 2024

This is a silly bug I think I introduced only few days ago unfortunately. Fixing...

from char-rnn.

karpathy commented on June 23, 2024

Ok I think I patched this issue with this commit:
0fb9a77

see if things work properly now with the new sampling script. The issue is that CPU models use doubles, but when I was converting GPU models I converted them to float() and then changed the sampling script to use float(), which broke previous CPU-only functionality. Sorry about the mess, when I was originally designing this code I always use GPUs and I didn't anticipate the conversion issues and that training on CPU or converting GPU->CPU would be a common use case.

from char-rnn.

quematech commented on June 23, 2024

Looks like it works. Thank you and may my cat bless you, m'lord!

from char-rnn.

nielmclaren commented on June 23, 2024

Fixed it for me, too. Thanks @karpathy!

from char-rnn.

commented on June 23, 2024

Great job @karpathy! Had the same issue and now it works perfectly.

from char-rnn.

sample.lua fails to run: error in function addmm() about char-rnn HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent