szagoruyko / cifar.torch Goto Github PK

View Code? Open in Web Editor NEW

174.0 12.0 76.0 22 KB

92.45% on CIFAR-10 in Torch

Home Page: http://torch.ch/blog/2015/07/30/cifar.html

License: MIT License

Lua 100.00%

cifar.torch's Introduction

cifar.torch

Newer version of this code is included in https://github.com/szagoruyko/wide-residual-networks

The code achieves 92.45% accuracy on CIFAR-10 just with horizontal reflections.

Corresponding blog post: http://torch.ch/blog/2015/07/30/cifar.html

Accuracies:

| No flips | Flips --- | --- | --- VGG+BN+Dropout | 91.3% | 92.45% NIN+BN+Dropout | 90.4% | 91.9%

Would be nice to add other architectures, PRs are welcome!

Data preprocessing:

OMP_NUM_THREADS=2 th -i provider.lua

provider = Provider()
provider:normalize()
torch.save('provider.t7',provider)

Takes about 30 seconds and saves 1400 Mb file.

Training:

CUDA_VISIBLE_DEVICES=0 th train.lua --model vgg_bn_drop -s logs/vgg

cifar.torch's People

Contributors

Stargazers

Watchers

cifar.torch's Issues

"not enough memory: you tried to allocate 0GB" error is being observed on executin train.lua and after running 390/390 tests

Hi Sergey,

"not enough memory: you tried to allocate 0GB" error is being observed on executin train.lua and after running 390/390 tests. Does incresing the RAM really help? My current RAM size is 8GB.

Train accuracy: 10.15 % time: 1388.22 s
==> doing some iterations on train data for batch normalization
/root/torch/install/bin/lua: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at /root/torch/pkg/torch/lib/TH/THGeneral.c:210
stack traceback:
[C]: ?
[C]: in function 'Tensor'
/root/torch/install/share/lua/5.1/cltorch/Random.lua:8: in function 'bernoulli'
/root/torch/install/share/lua/5.1/nn/Dropout.lua:19: in function 'updateOutput'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41>
(tail call): ?
train.lua:131: in function 'test'
train.lua:191: in main chunk
[C]: in function 'dofile'
.../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: ?

\Thanks

Out of Memory Issues when Training

I'm following the steps as described in this blog entry in order to run the CIFAR classification. Preprocessing Provider.lua works fine, but training won't seem to work due to memory problems.

When running the regular command line CUDA_VISIBLE_DEVICES=0 th train.lua, I'm getting the following output:

{
  learningRate : 1
  momentum : 0.9
  epoch_step : 25
  learningRateDecay : 1e-07
  batchSize : 128
  model : "vgg_bn_drop"
  save : "logs"
  weightDecay : 0.0005
  backend : "nn"
  max_epoch : 300
}
==> configuring model   
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9668/cutorch/lib/THC/generic/THCStorage.cu line=41 error=2 : out of memory
/Users/artcfa/torch/install/bin/luajit: /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:11: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-9668/cutorch/lib/THC/generic/THCStorage.cu:41
stack traceback:
    [C]: in function 'resize'
    /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:11: in function 'torch_Storage_type'
    /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:57: in function 'recursiveType'
    /Users/artcfa/torch/install/share/lua/5.1/nn/Module.lua:126: in function 'type'
    /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:45: in function 'recursiveType'
    /Users/artcfa/torch/install/share/lua/5.1/nn/utils.lua:41: in function 'recursiveType'
    /Users/artcfa/torch/install/share/lua/5.1/nn/Module.lua:126: in function 'cuda'
    train.lua:47: in main chunk
    [C]: in function 'dofile'
    ...edja/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x0107712bd0

So apparently CUDA reports to be out of memory. I've compiled the NVIDIA CUDA samples and ran the deviceQuery sample in order to get some stats on this:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 650M"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1024 MBytes (1073414144 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            900 MHz (0.90 GHz)
  Memory Clock rate:                             2508 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GT 650M
Result = PASS

So I thought maybe 1 GB of total available CUDA memory simply wasn't enough. I modified the sample code so it could run on the CPU without using CUDA (which you can find here) and now it starts up and begins training.

However, after around 9 / 390 training samples, it crashes with the message luajit not enough memory, so that doesn't appear to work either.

Am I doing something wrong? What can I do to run this?

module 'cunn' not found:No LuaRocks module found for cunn

Hi Szagoruyko,

I have AMD GPU: Tonga GL. followed all the steps in the Readme file. I am at the last step "CUDA_VISIBLE_DEVICES=0 th train.lua --model vgg_bn_drop -s logs/vgg"
And I am getting the below error if I run "th train.lua".. Please help.

th train.lua
/root/torch/install/bin/lua: /root/torch/install/share/lua/5.1/trepl/init.lua:363: module 'cunn' not found:No LuaRocks module found for cunn
no field package.preload['cunn']
no file '/root/.luarocks/share/lua/5.1/cunn.lua'
no file '/root/.luarocks/share/lua/5.1/cunn/init.lua'
no file '/root/torch/install/share/lua/5.1/cunn.lua'
no file '/root/torch/install/share/lua/5.1/cunn/init.lua'
no file './cunn.lua'
no file '/root/torch/install/lib/lua/5.1/cunn.lua'
no file '/root/torch/install/lib/lua/5.1/cunn/init.lua'
no file '/root/.luarocks/lib/lua/5.1/cunn.so'
no file '/root/torch/install/lib/lua/5.1/cunn.so'
no file './cunn.so'
no file '/root/torch/install/lib/lua/5.1/loadall.so'
stack traceback:
[C]: in function 'error'
/root/torch/install/share/lua/5.1/trepl/init.lua:363: in function 'require'
train.lua:3: in main chunk
[C]: in function 'dofile'
.../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: ?

\Thanks.

Out of memory running provider.lua

$ luajit -l provider -e 'p = Provider()'
luajit: ./provider.lua:35: $ Torch: not enough memory: you tried to allocate 1GB. Buy new RAM! at /home/user/git/torch7/lib/TH/THGeneral.c:222
stack traceback:
    [C]: in function 'reshape'
    ./provider.lua:35: in function '__init'
    /home/user/torch/install/share/lua/5.1/torch/init.lua:64: in function </home/user/torch/install/share/lua/5.1/torch/init.lua:60>
    [C]: in function 'Provider'
    (command line):1: in main chunk
    [C]: at 0x00406670

I can fix it by changing lines:

  trainData.data = trainData.data:reshape(trsize,3,32,32)
  testData.data = testData.data:reshape(tesize,3,32,32)

to:

  trainData.data = trainData.data:view(trsize,3,32,32)
  testData.data = testData.data:view(tesize,3,32,32)

... since this wont cause reallocation. I havent tested this yet, beyond confirming that the Provider instantiates ok on a 1GB mobile card.

Hyperparameters

What hyperparameters do you use for training the best-performing VGG+BN+dropout model?

Training a single image with the network

I have tried out your model. I was able to reproduce the results. I tried only the CPU version. It works as you mentioned. But I can't figure out how to use it to classify a single image.

What I know so far is that, I will have to forward the model with Tensor of size [2,3,32,32]. Tensor of size [1,3,32,32] won't work because of BatchNormalization. This produces result. But incorrect ones.

I have an image. Lets say, it is "car.jpg". The dimension of this image is [3,32,32]. Now how will i have to forward this image through the model? Will I have to preprocess this test image?

What I did was,

car = image.load("car.jpg")  --note that car has the dimension [3,32,32]
dataset = torch.Tensor(2,3,32,32)
dataset[1] = car
dataset[2] = car

model = torch.load("model.net")
result = model:forward(dataset)

But this produces incorrect results. So, what will need to do to correctly classify a single image?

CPU version

This looks good. How do you run this on CPU? Is this hardcoded to run only on GPUs (NVIDIA)? When I tried removing reference to CUDA and try running on CPU it gives me an error saying:

/home/sanoob/torch/install/bin/luajit: ...ob/torch/install/share/lua/5.1/nn/SpatialConvolution.lua:100: bad argument #1 (field finput is not a torch.FloatTensor)
stack traceback:
[C]: in function 'SpatialConvolutionMM_updateOutput'
...ob/torch/install/share/lua/5.1/nn/SpatialConvolution.lua:100: in function 'updateOutput'
/home/sanoob/torch/install/share/lua/5.1/nn/Sequential.lua:39: in function 'updateOutput'
/home/sanoob/torch/install/share/lua/5.1/nn/Sequential.lua:39: in function 'updateOutput'
/home/sanoob/torch/install/share/lua/5.1/nn/Sequential.lua:39: in function 'forward'
train-cpu.lua:106: in function 'opfunc'
/home/sanoob/torch/install/share/lua/5.1/optim/sgd.lua:43: in function 'sgd'
train-cpu.lua:115: in function 'train'
train-cpu.lua:190: in main chunk

at line

  local outputs = model:forward(inputs)

Should we declare inputs as FloatTensor too?

Removing BatchNormalization

Hi,

I am using the nn version but removed BatchNormalization and ceil() and I am getting this error:

/home/vrx/torch/install/bin/luajit: ...vrx/torch/install/share/lua/5.1/nn/SpatialMaxPooling.lua:18: aborting at /home/vrx/torch/extra/cunn/SpatialMaxPooling.cu:261
stack traceback:
[C]: in function 'SpatialMaxPooling_updateOutput'
...wig/torch/install/share/lua/5.1/nn/SpatialMaxPooling.lua:18: in function 'updateOutput'
/home/vrx/torch/install/share/lua/5.1/nn/Sequential.lua:39: in function 'updateOutput'
/home/vrx/torch/install/share/lua/5.1/nn/Sequential.lua:39: in function 'updateOutput'
/home/vrx/torch/install/share/lua/5.1/nn/Sequential.lua:39: in function 'forward'
train_rgb.lua:107: in function 'opfunc'
/home/vrx/torch/install/share/lua/5.1/optim/sgd.lua:43: in function 'sgd'
train_rgb.lua:116: in function 'train'
train_rgb.lua:167: in main chunk
[C]: in function 'dofile'
...vrx/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00406670
error in SpatialMaxPooling.updateOutput: invalid argument

If I change the batch size from 128 to 64 it works, but the training accuracy does not improve (values between 9.77% and 10.14%) and the test accuracy stays exactly at 10%.

With BatchNormalization I get 67% test accuracy after 9 epochs (using rgb images preprocessed only by subtracting per pixel means and dividing stds).

module 'cunn' not found

Hi,
i get this error

qlua: EX2_MNIST_Convnet_CPU.lua:2: module 'cunn' not found:
no field package.preload['cunn']
no file '/home/daniel/.luarocks/share/lua/5.1/cunn.lua'
no file '/home/daniel/.luarocks/share/lua/5.1/cunn/init.lua'
no file '/home/daniel/torch/install/share/lua/5.1/cunn.lua'
no file '/home/daniel/torch/install/share/lua/5.1/cunn/init.lua'
no file './cunn.lua'
no file '/home/daniel/torch/install/share/luajit-2.1.0-beta1/cunn.lua'
no file '/usr/local/share/lua/5.1/cunn.lua'
no file '/usr/local/share/lua/5.1/cunn/init.lua'
no file '/home/daniel/torch/install/lib/cunn.so'
no file '/home/daniel/.luarocks/lib/lua/5.1/cunn.so'
no file '/home/daniel/torch/install/lib/lua/5.1/cunn.so'
no file './cunn.so'
no file '/usr/local/lib/lua/5.1/cunn.so'
no file '/usr/local/lib/lua/5.1/loadall.so'
stack traceback:
[C]: at 0x7f5ced22ff50
[C]: in function 'require'
EX2_MNIST_Convnet_CPU.lua:2: in main chunk

convert.im6: WriteBlob Failed @ error/png.c/MagickPNGErrorHandler/1728.

Hey Sergey,

I recently upgraded my cuda (7.5) and the driver (352.39) along with other system and torch updates.
However since then the imagemagick toolbox is prompting up with this error when it tries to convert the eps file to png.

convert.im6: WriteBlob Failed `/home/praveer/cifar/logs/test.png' @ error/png.c/MagickPNGErrorHandler/1728.
/home/praveer/cifar/logs/test.png: No such file or directory

I have tried out running convert from the terminal and it works fine.
I have also tried setting the chmod permission to 777 for the directory but nonetheless it has no effect.
The imagemagick is upgraded to its latest version.

Cheers !!!
Praveer

pre-trained model

Is there a pre-trained model for download?
It would be best if the model is first converted to CPU.

thanks

it is not using whole 12GB memory of titanx

hi only 1.3 gb memory of gpu is used whiile training.Can we not let it use gpu f full throttle.

No response after executing "th -i provider.lua"

Hi, Sergey,

I had downloaded cifar.torch, but no response appears after executing "th -i provider.lua". Here is snapshot. https://goo.gl/OPpKsE
Some steps I had missed? Could you suggest?

THX~
Milton

CPU/CUNN version of NIN is not supported

Hi Sergey,

The CPU or Cunn version of NIN seems not working. The native SpacialAveragePooling layer does not support ceil() method yet. Simply remove :ceil() will result in 7x7 instead of 8x8 for each plane after the first averaging pooling layer and thus making the after layers broken.

Is there any quick way to fix this without changing the network design? Using SpatialAveragePooling(2,2,2,2) can made the code work but is different from the original network. I am running experiment to see its performance.

Error in convert with cudnn

I've been studying this code with cunn, but when I enable cudnn, I get this error. Is this related to the version of CUDA (7.5) I have installed maybe? Thanks for your help

/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/cudnn/convert.lua:59: attempt to call method 'replace' (a nil value)
stack traceback:
    /home/ubuntu/torch/install/share/lua/5.1/cudnn/convert.lua:59: in function 'convert'
    train.lua:69: in main chunk
    [C]: in function 'dofile'
    ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670
ubuntu@ip-10-1-5-21:~/cifar.torch$

Filter size is 3x3 but feature map size is 2x2 in stack 5

The filter size in the last convolutional stack is 3x3, but the feature maps is 2x2. Will it cause any redundant in the last convolutional stack?

CUDA_VISIBLE_DEVICES=0 th train.lua --model vgg_bn_drop -s logs/vgg

/home/tushar/torch/install/bin/luajit: /home/tushar/torch/install/share/lua/5.1/torch/File.lua:375: unknown object
stack traceback:
[C]: in function 'error'
/home/tushar/torch/install/share/lua/5.1/torch/File.lua:375: in function 'readObject'
/home/tushar/torch/install/share/lua/5.1/torch/File.lua:409: in function 'load'
train.lua:74: in main chunk
[C]: in function 'dofile'
...shar/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670

Multi-GPU training

Hi,
First of all, thanks for sharing.
I learned a lot from the code. Right now, I'm turning my attention on multi-gpu training for neural nets. It is very expensive to run the imageNet multi-gpu code. Could you please add a feature, which would allow us to run the code on multiple GPUs? Thanks!

Best,

Jun

szagoruyko / cifar.torch Goto Github PK

cifar.torch's Introduction

cifar.torch

cifar.torch's People

Contributors

Stargazers

Watchers

Forkers

cifar.torch's Issues

Recommend Projects

Recommend Topics

Recommend Org