Giter Club home page Giter Club logo

Comments (5)

scttl avatar scttl commented on May 18, 2024

Since you appear to be running as root on Ubuntu, can you first make sure that nvidia-smi is in that users PATH, and produces sensible output when run from the command line? It doesn't look like this command is being found.

I'd also suggest having a look at the items in our installation FAQ: http://neon.nervanasys.com/docs/latest/faq.html

from neon.

yuehusile avatar yuehusile commented on May 18, 2024

Hi scttl, thanks for your reply. you are right about that the nvidia-smi command is not found.
I've checked my installation and configuration carefully, and it seems that it's a tegra k1 specific problem.
What I am doing is trying to run neon on the nvidia jetson tk1 devkit, and NVML is not supported on Jetson TK1, so nvidia-smi command would not be found, even when CUDA installation is all right.
Is NVML required to run neon demo ? or Is there anyway to solve my problem without NVML ?

ps: I can ran caffe with cudnn with no problem on jetson tk1, so I guess the CUDA installation is all right

from neon.

scttl avatar scttl commented on May 18, 2024

nvidia-smi is not required to run any of the examples, we just use it as a proxy to validate that the user has the CUDA SDK installed. Provided you were able to install the cudanet python library ok, for the moment you can work around your issue by editing neon/backends/__init__.py to replace the line:
gpuflag = (os.system("nvidia-smi > /dev/null 2>&1") == 0)
with
gpuflag = (os.system("nvcc --version > /dev/null 2>&1") == 0)

We made a similar change in the Makefile a while back, but needed to update the check here as well. I've created a fix, and will get this merged into master for the next release of neon.

from neon.

yuehusile avatar yuehusile commented on May 18, 2024

thanks scttl! editing neon/backends/init.py works, but I still can't run this sample because another erro appears:
root@tegra-ubuntu:/home/hsl/neon# neon --gpu cudanet examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.util.persist:deserializing object from: examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.datasets.imageset:Imageset initialized with dtype <type 'numpy.float32'>
2015-07-01 04:52:29,170 WARNING:neon - setting log level to: 20
2015-07-01 04:52:31,733 INFO:init - Cudanet backend, RNG seed: None, numerr: None
2015-07-01 04:52:31,735 INFO:mlp - Layers:
ImageDataLayer d0: 3 x (224 x 224) nodes
ConvLayer conv1: 3 x (224 x 224) inputs, 64 x (55 x 55) nodes, RectLin act_fn
PoolingLayer pool1: 64 x (55 x 55) inputs, 64 x (27 x 27) nodes, Linear act_fn
ConvLayer conv2: 64 x (27 x 27) inputs, 192 x (27 x 27) nodes, RectLin act_fn
PoolingLayer pool2: 192 x (27 x 27) inputs, 192 x (13 x 13) nodes, Linear act_fn
ConvLayer conv3: 192 x (13 x 13) inputs, 384 x (13 x 13) nodes, RectLin act_fn
ConvLayer conv4: 384 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin act_fn
ConvLayer conv5: 256 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin act_fn
PoolingLayer pool3: 256 x (13 x 13) inputs, 256 x (6 x 6) nodes, Linear act_fn
FCLayer fc4096a: 9216 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout1: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc4096b: 4096 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout2: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc1000: 4096 inputs, 1000 nodes, Softmax act_fn
CostLayer cost: 1000 nodes, CrossEntropy cost_fn

2015-07-01 04:52:31,738 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,228 INFO:val_init - Generating AutoUniformValGen values of shape (363, 64)
2015-07-01 04:52:32,254 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,340 INFO:val_init - Generating AutoUniformValGen values of shape (1600, 192)
2015-07-01 04:52:32,370 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,432 INFO:val_init - Generating AutoUniformValGen values of shape (1728, 384)
2015-07-01 04:52:32,506 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,552 INFO:val_init - Generating AutoUniformValGen values of shape (3456, 256)
2015-07-01 04:52:32,602 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,639 INFO:val_init - Generating AutoUniformValGen values of shape (2304, 256)
2015-07-01 04:52:32,691 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:32,702 INFO:val_init - Generating AutoUniformValGen values of shape (4096, 9216)
2015-07-01 04:52:34,805 INFO:batch_norm - BatchNormalization set to train mode
2015-07-01 04:52:34,813 INFO:val_init - Generating AutoUniformValGen values of shape (4096, 4096)
2015-07-01 04:52:35,728 INFO:val_init - Generating AutoUniformValGen values of shape (1000, 4096)
Traceback (most recent call last):
File "/usr/local/bin/neon", line 240, in
experiment, result, status = main()
File "/usr/local/bin/neon", line 207, in main
experiment.initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit_predict_err.py", line 62, in initialize
super(FitPredictErrorExperiment, self).initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit.py", line 62, in initialize
self.model.initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/models/mlp.py", line 68, in initialize
dtype=self.layers[1].deltas_dtype)
File "/usr/local/lib/python2.7/dist-packages/neon/backends/cc2.py", line 536, in zeros
dtype=dtype)),
MemoryError

Is memory size a problem? tegra k1 has 2GB memory. or just something else lead to this problem? any advice for me to check out what happened?

from neon.

apark263 avatar apark263 commented on May 18, 2024

Try reducing your batch size down to 32 and see if the problem still
exists. If it runs then you probably don't have enough memory to train at
mb=128

Is there any particular reason you are using this system to train? You
would get much better performance by using a more standard graphics card.

On Monday, June 29, 2015, yuehusile [email protected] wrote:

thanks scttl! editing neon/backends/init.py works, but I still can't
run this sample because another erro appears:
root@tegra-ubuntu:/home/hsl/neon# neon --gpu cudanet
examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.util.persist:deserializing object from:
examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.datasets.imageset:Imageset initialized with dtype
2015-07-01 04:52:29,170 WARNING:neon - setting log level to: 20
2015-07-01 04:52:31,733 INFO:init - Cudanet backend, RNG seed: None,
numerr: None
2015-07-01 04:52:31,735 INFO:mlp - Layers:
ImageDataLayer d0: 3 x (224 x 224) nodes
ConvLayer conv1: 3 x (224 x 224) inputs, 64 x (55 x 55) nodes, RectLin
act_fn
PoolingLayer pool1: 64 x (55 x 55) inputs, 64 x (27 x 27) nodes, Linear
act_fn
ConvLayer conv2: 64 x (27 x 27) inputs, 192 x (27 x 27) nodes, RectLin
act_fn
PoolingLayer pool2: 192 x (27 x 27) inputs, 192 x (13 x 13) nodes, Linear
act_fn
ConvLayer conv3: 192 x (13 x 13) inputs, 384 x (13 x 13) nodes, RectLin
act_fn
ConvLayer conv4: 384 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin
act_fn
ConvLayer conv5: 256 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin
act_fn
PoolingLayer pool3: 256 x (13 x 13) inputs, 256 x (6 x 6) nodes, Linear
act_fn
FCLayer fc4096a: 9216 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout1: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc4096b: 4096 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout2: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc1000: 4096 inputs, 1000 nodes, Softmax act_fn
CostLayer cost: 1000 nodes, CrossEntropy cost_fn

2015-07-01 04:52:31,738 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,228 INFO:val_init - Generating AutoUniformValGen
values of shape (363, 64)
2015-07-01 04:52:32,254 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,340 INFO:val_init - Generating AutoUniformValGen
values of shape (1600, 192)
2015-07-01 04:52:32,370 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,432 INFO:val_init - Generating AutoUniformValGen
values of shape (1728, 384)
2015-07-01 04:52:32,506 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,552 INFO:val_init - Generating AutoUniformValGen
values of shape (3456, 256)
2015-07-01 04:52:32,602 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,639 INFO:val_init - Generating AutoUniformValGen
values of shape (2304, 256)
2015-07-01 04:52:32,691 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:32,702 INFO:val_init - Generating AutoUniformValGen
values of shape (4096, 9216)
2015-07-01 04:52:34,805 INFO:batch_norm - BatchNormalization set to train
mode
2015-07-01 04:52:34,813 INFO:val_init - Generating AutoUniformValGen
values of shape (4096, 4096)
2015-07-01 04:52:35,728 INFO:val_init - Generating AutoUniformValGen
values of shape (1000, 4096)
Traceback (most recent call last):
File "/usr/local/bin/neon", line 240, in
experiment, result, status = main()
File "/usr/local/bin/neon", line 207, in main
experiment.initialize(backend)
File
"/usr/local/lib/python2.7/dist-packages/neon/experiments/fit_predict_err.py",
line 62, in initialize
super(FitPredictErrorExperiment, self).initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit.py",
line 62, in initialize
self.model.initialize(backend)
File "/usr/local/lib/python2.7/dist-packages/neon/models/mlp.py", line 68,
in initialize
dtype=self.layers[1].deltas_dtype)
File "/usr/local/lib/python2.7/dist-packages/neon/backends/cc2.py", line
536, in zeros
dtype=dtype)),
MemoryError

Is memory size a problem? tegra k1 has 2GB memory. or just something else
lead to this problem? any advice for me to check out what happened?


Reply to this email directly or view it on GitHub
#51 (comment).

from neon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.