Giter Club home page Giter Club logo

Comments (15)

swisspol avatar swisspol commented on June 23, 2024

This other tool works fine however:

$ th inspect_checkpoint.lua cv/lm_lstm_epoch18.96_1.4228.t7 
using CUDA on GPU 0...  
opt:    
{
  max_epochs : 30
  seed : 123
  batch_size : 100
  gpuid : 0
  decay_rate : 0.95
  savefile : "lstm"
  model : "lstm"
  grad_clip : 5
  print_every : 1
  data_dir : "data/tinyshakespeare"
  seq_length : 50
  num_layers : 2
  rnn_size : 100
  train_frac : 0.95
  learning_rate : 0.002
  dropout : 0
  eval_val_every : 1000
  val_frac : 0.05
  checkpoint_dir : "cv"
}
val losses: 
{
  2000 : 1.5233611306277
  3000 : 1.4519253438169
  4000 : 1.4227915313027
  1000 : 1.7233323420178
}

from char-rnn.

karpathy avatar karpathy commented on June 23, 2024

Did you train one with GPU and the other with CPU? Check the "gpuid" flag. Is it "0" on both models?

from char-rnn.

swisspol avatar swisspol commented on June 23, 2024

Yes, GPU on both. What's interesting is that using checkpoints created later on during the process do work as well as the very last one.

from char-rnn.

antonmil avatar antonmil commented on June 23, 2024

@swisspol, I'm running into the same issue inside an iTorch notebook environment, but it works fine on standard command line. I'm very new to Lua / Torch but it would be good to figure out what's causing this.

from char-rnn.

PaulSchnau avatar PaulSchnau commented on June 23, 2024

I had this error on OSX 10.10. Using -opencl 1 on both train.lua and sample.lua made it work.

from char-rnn.

svickers avatar svickers commented on June 23, 2024

Had this problem on a c2.2xl instance. Tried -opencl 1 but no luck.

from char-rnn.

hughperkins avatar hughperkins commented on June 23, 2024

@svickers can you post the fourth line of your output, the one that has 'Linear.lua:46' in it. Edit: and also the results of inspect_checkpoint.

from char-rnn.

svickers avatar svickers commented on June 23, 2024

@hughperkins Sorry Hugh! I killed that vm and went to g2.2xl and everything worked straight away.

from char-rnn.

hughperkins avatar hughperkins commented on June 23, 2024

@svickers oh, nice! Hmmm, c2s dont actually have a gpu, right? g2 sounds like a gpu instance?

from char-rnn.

quematech avatar quematech commented on June 23, 2024

Hello.

I ran into the same problem, and is dealing with a lot of frustration in my holy quest of running tests with a Monty Python Flying Circus corpus :). The error appears too when trying the tiny shakespeare data set.

I run CPU-only computing (no GPU). The training goes well, no NaN value :

th train.lua -data_dir data/tinyshakespeare/ -gpuid -1

loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an lstm with 2 layers
number of parameters in the model: 240321
cloning rnn
cloning criterion
1/21150 (epoch 0.002), train_loss = 4.19766416, grad/param norm = 4.5006e-01, time/batch = 0.34s
2/21150 (epoch 0.005), train_loss = 4.10134056, grad/param norm = 6.3375e-01, time/batch = 0.28s
3/21150 (epoch 0.007), train_loss = 3.44502399, grad/param norm = 9.4798e-01, time/batch = 0.28s

The sampling raises an error whatever the checkpoint file used :

th sample.lua cv/lm_lstm_epoch26.00_1.3900.t7 -gpuid -1
creating an lstm...
missing seed text, using uniform probability over first character
--------------------------
/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/Linear.lua:46: invalid arguments: DoubleTensor number DoubleTensor number FloatTensor DoubleTensor
expected arguments: *DoubleTensor~2D* [DoubleTensor~2D] [double] DoubleTensor~2D DoubleTensor~2D | *DoubleTensor~2D* double [DoubleTensor~2D] double DoubleTensor~2D DoubleTensor~2D
stack traceback:
        [C]: in function 'addmm'
        /usr/local/share/lua/5.1/nn/Linear.lua:46: in function 'func'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:253: in function 'neteval'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:288: in function 'forward'
        sample.lua:151: in main chunk
        [C]: in function 'dofile'
        /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
        [C]: at 0x00406720

And the inspect method looks ok

th inspect_checkpoint.lua cv/lm_lstm_epoch26.00_1.3900.t7 -gpuid -1
opt:
{
  max_epochs : 50
  seed : 123
  batch_size : 50
  gpuid : -1
  decay_rate : 0.95
  learning_rate_decay : 0.97
  opencl : 0
  model : "lstm"
  grad_clip : 5
  print_every : 1
  data_dir : "data/tinyshakespeare/"
  seq_length : 50
  num_layers : 2
  learning_rate_decay_after : 10
  rnn_size : 128
  train_frac : 0.95
  dropout : 0
  init_from : ""
  learning_rate : 0.002
  eval_val_every : 1000
  val_frac : 0.05
  savefile : "lstm"
  checkpoint_dir : "cv"
}
val losses:
{
  3000 : 1.4450460764536
  4000 : 1.4213234041304
  5000 : 1.4060113392715
  6000 : 1.389498488439
  8000 : 1.3909428322715
  10000 : 1.4003497627469
  7000 : 1.3937299336865
  9000 : 1.3940925438403
  1000 : 1.7136267190726
  2000 : 1.5211800115534
  11000 : 1.389983844627
}

from char-rnn.

karpathy avatar karpathy commented on June 23, 2024

This is a silly bug I think I introduced only few days ago unfortunately. Fixing...

from char-rnn.

karpathy avatar karpathy commented on June 23, 2024

Ok I think I patched this issue with this commit:
0fb9a77

see if things work properly now with the new sampling script. The issue is that CPU models use doubles, but when I was converting GPU models I converted them to float() and then changed the sampling script to use float(), which broke previous CPU-only functionality. Sorry about the mess, when I was originally designing this code I always use GPUs and I didn't anticipate the conversion issues and that training on CPU or converting GPU->CPU would be a common use case.

from char-rnn.

quematech avatar quematech commented on June 23, 2024

Looks like it works. Thank you and may my cat bless you, m'lord!

from char-rnn.

nielmclaren avatar nielmclaren commented on June 23, 2024

Fixed it for me, too. Thanks @karpathy!

from char-rnn.

 avatar commented on June 23, 2024

Great job @karpathy! Had the same issue and now it works perfectly.

from char-rnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.