i1k-alexnet-fp32.yaml" on tegra k1 about neon HOT 5 CLOSED

nervanasystems commented on May 18, 2024

cannot run "neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml" on tegra k1

from neon.

Comments (5)

scttl commented on May 18, 2024

Since you appear to be running as root on Ubuntu, can you first make sure that nvidia-smi is in that users PATH, and produces sensible output when run from the command line? It doesn't look like this command is being found.

I'd also suggest having a look at the items in our installation FAQ: http://neon.nervanasys.com/docs/latest/faq.html

from neon.

yuehusile commented on May 18, 2024

Hi scttl, thanks for your reply. you are right about that the nvidia-smi command is not found.
I've checked my installation and configuration carefully, and it seems that it's a tegra k1 specific problem.
What I am doing is trying to run neon on the nvidia jetson tk1 devkit, and NVML is not supported on Jetson TK1, so nvidia-smi command would not be found, even when CUDA installation is all right.
Is NVML required to run neon demo ? or Is there anyway to solve my problem without NVML ?

ps: I can ran caffe with cudnn with no problem on jetson tk1, so I guess the CUDA installation is all right

from neon.

scttl commented on May 18, 2024

nvidia-smi is not required to run any of the examples, we just use it as a proxy to validate that the user has the CUDA SDK installed. Provided you were able to install the cudanet python library ok, for the moment you can work around your issue by editing neon/backends/__init__.py to replace the line:
gpuflag = (os.system("nvidia-smi > /dev/null 2>&1") == 0)
with
gpuflag = (os.system("nvcc --version > /dev/null 2>&1") == 0)

We made a similar change in the Makefile a while back, but needed to update the check here as well. I've created a fix, and will get this merged into master for the next release of neon.

from neon.

yuehusile commented on May 18, 2024

thanks scttl! editing neon/backends/init.py works, but I still can't run this sample because another erro appears:
root@tegra-ubuntu:/home/hsl/neon# neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.util.persist:deserializing object from: examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.datasets.imageset:Imageset initialized with dtype <type 'numpy.float32'>
2015-07-01 04:52:29,170 WARNING:neon - setting log level to: 20
2015-07-01 04:52:31,733 INFO:init - Cudanet backend, RNG seed: None, numerr: None
2015-07-01 04:52:31,735 INFO:mlp - Layers:
ImageDataLayer d0: 3 x (224 x 224) nodes
ConvLayer conv1: 3 x (224 x 224) inputs, 64 x (55 x 55) nodes, RectLin act_fn
PoolingLayer pool1: 64 x (55 x 55) inputs, 64 x (27 x 27) nodes, Linear act_fn
ConvLayer conv2: 64 x (27 x 27) inputs, 192 x (27 x 27) nodes, RectLin act_fn
PoolingLayer pool2: 192 x (27 x 27) inputs, 192 x (13 x 13) nodes, Linear act_fn
ConvLayer conv3: 192 x (13 x 13) inputs, 384 x (13 x 13) nodes, RectLin act_fn
ConvLayer conv4: 384 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin act_fn
ConvLayer conv5: 256 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin act_fn
PoolingLayer pool3: 256 x (13 x 13) inputs, 256 x (6 x 6) nodes, Linear act_fn
FCLayer fc4096a: 9216 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout1: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc4096b: 4096 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout2: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc1000: 4096 inputs, 1000 nodes, Softmax act_fn
CostLayer cost: 1000 nodes, CrossEntropy cost_fn

2015-07-01 04:52:31,738 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,228 INFO:val_init - Generating AutoUniformValGen values of shape (363, 64)
2015-07-01 04:52:32,254 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,340 INFO:val_init - Generating AutoUniformValGen values of shape (1600, 192)
2015-07-01 04:52:32,370 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,432 INFO:val_init - Generating AutoUniformValGen values of shape (1728, 384)
2015-07-01 04:52:32,506 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,552 INFO:val_init - Generating AutoUniformValGen values of shape (3456, 256)
2015-07-01 04:52:32,602 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,639 INFO:val_init - Generating AutoUniformValGen values of shape (2304, 256)
2015-07-01 04:52:32,691 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,702 INFO:val_init - Generating AutoUniformValGen values of shape (4096, 9216)
2015-07-01 04:52:34,805 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:34,813 INFO:val_init - Generating AutoUniformValGen values of shape (4096, 4096)
2015-07-01 04:52:35,728 INFO:val_init - Generating AutoUniformValGen values of shape (1000, 4096)
Traceback (most recent call last):
File "/usr/local/bin/neon", line 240, in
experiment, result, status = main()
File "/usr/local/bin/neon", line 207, in main
experiment.initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit_predict_err.py", line 62, in initialize
super(FitPredictErrorExperiment, self).initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit.py", line 62, in initialize
self.model.initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/models/mlp.py", line 68, in initialize
dtype=self.layers[1].deltas_dtype)
File "/usr/local/lib/python2.7/dist-packages/neon/backends/cc2.py", line 536, in zeros
dtype=dtype)),
MemoryError

Is memory size a problem? tegra k1 has 2GB memory. or just something else lead to this problem? any advice for me to check out what happened?

from neon.

apark263 commented on May 18, 2024

Try reducing your batch size down to 32 and see if the problem still
exists. If it runs then you probably don't have enough memory to train at
mb=128

Is there any particular reason you are using this system to train? You
would get much better performance by using a more standard graphics card.

On Monday, June 29, 2015, yuehusile [email protected] wrote:

thanks scttl! editing neon/backends/init.py works, but I still can't
run this sample because another erro appears:
root@tegra-ubuntu:/home/hsl/neon# neon --gpu cudanet
examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.util.persist:deserializing object from:
examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.datasets.imageset:Imageset initialized with dtype
2015-07-01 04:52:29,170 WARNING:neon - setting log level to: 20
2015-07-01 04:52:31,733 INFO:init - Cudanet backend, RNG seed: None,
numerr: None
2015-07-01 04:52:31,735 INFO:mlp - Layers:
ImageDataLayer d0: 3 x (224 x 224) nodes
ConvLayer conv1: 3 x (224 x 224) inputs, 64 x (55 x 55) nodes, RectLin
act_fn
PoolingLayer pool1: 64 x (55 x 55) inputs, 64 x (27 x 27) nodes, Linear
act_fn
ConvLayer conv2: 64 x (27 x 27) inputs, 192 x (27 x 27) nodes, RectLin
act_fn
PoolingLayer pool2: 192 x (27 x 27) inputs, 192 x (13 x 13) nodes, Linear
act_fn
ConvLayer conv3: 192 x (13 x 13) inputs, 384 x (13 x 13) nodes, RectLin
act_fn
ConvLayer conv4: 384 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin
act_fn
ConvLayer conv5: 256 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin
act_fn
PoolingLayer pool3: 256 x (13 x 13) inputs, 256 x (6 x 6) nodes, Linear
act_fn
FCLayer fc4096a: 9216 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout1: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc4096b: 4096 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout2: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc1000: 4096 inputs, 1000 nodes, Softmax act_fn
CostLayer cost: 1000 nodes, CrossEntropy cost_fn

2015-07-01 04:52:31,738 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,228 INFO:val_init - Generating AutoUniformValGen
values of shape (363, 64)
2015-07-01 04:52:32,254 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,340 INFO:val_init - Generating AutoUniformValGen
values of shape (1600, 192)
2015-07-01 04:52:32,370 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,432 INFO:val_init - Generating AutoUniformValGen
values of shape (1728, 384)
2015-07-01 04:52:32,506 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,552 INFO:val_init - Generating AutoUniformValGen
values of shape (3456, 256)
2015-07-01 04:52:32,602 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,639 INFO:val_init - Generating AutoUniformValGen
values of shape (2304, 256)
2015-07-01 04:52:32,691 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,702 INFO:val_init - Generating AutoUniformValGen
values of shape (4096, 9216)
2015-07-01 04:52:34,805 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:34,813 INFO:val_init - Generating AutoUniformValGen
values of shape (4096, 4096)
2015-07-01 04:52:35,728 INFO:val_init - Generating AutoUniformValGen
values of shape (1000, 4096)
Traceback (most recent call last):
File "/usr/local/bin/neon", line 240, in
experiment, result, status = main()
File "/usr/local/bin/neon", line 207, in main
experiment.initialize(backend)
File
"/usr/local/lib/python2.7/dist-packages/neon/experiments/fit_predict_err.py",
line 62, in initialize
super(FitPredictErrorExperiment, self).initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit.py",
line 62, in initialize
self.model.initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/models/mlp.py", line 68,
in initialize
dtype=self.layers[1].deltas_dtype)
File "/usr/local/lib/python2.7/dist-packages/neon/backends/cc2.py", line
536, in zeros
dtype=dtype)),
MemoryError

Is memory size a problem? tegra k1 has 2GB memory. or just something else
lead to this problem? any advice for me to check out what happened?

—
Reply to this email directly or view it on GitHub
#51 (comment).

from neon.

cannot run "neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml" on tegra k1 about neon HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent