in addition to dependencies noted in previous issues, we need to install hdf5 for lua,

batcher.lua is my . I realized I'm not giving proper advice for setting the LUA_

additional dependencies not given in installation directions/fail with nil ConvNet,about davek44/basset

Comments (16)

davek44 commented on September 25, 2024

batcher.lua is my script. I realized I'm not giving proper advice for setting the LUA_PATH variable so basset_train.lua will find it. I fixed the README.

Try
export LUA_PATH="$BASSETDIR/src/?.lua;$LUA_PATH"

from basset.

vjcitn commented on September 25, 2024

I've set LUA_PATH as directed on README.md, value given below. Still have

%vjcair> basset_train.lua -job models/pretrained_params.txt -stagnant_t 10 encode_roadmap.h5
{
conv_filter_sizes :
{
1 : 19
2 : 11
3 : 7
}
weight_norm : 7
momentum : 0.98
learning_rate : 0.002
hidden_units :
{
1 : 1000
2 : 1000
}
conv_filters :
{
1 : 300
2 : 200
3 : 200
}
hidden_dropouts :
{
1 : 0.3
2 : 0.3
}
pool_width :
{
1 : 3
2 : 4
3 : 4
}
}
/Users/stvjc/torch/install/bin/luajit: /Users/stvjc/Research/BASSET/Basset/src/basset_train.lua:99: attempt to index global 'ConvNet' (a nil value)
stack traceback:
/Users/stvjc/Research/BASSET/Basset/src/basset_train.lua:99: in main chunk
[C]: in function 'dofile'
...tvjc/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010c67cad0

/Users/stvjc/Research/BASSET/Basset/src/?.lua;/Users/stvjc/.luarocks/share/lua/5.1/?.lua;/Users/stvjc/.luarocks/share/lua/5.1/?/init.lua;/Users/stvjc/torch/install/share/lua/5.1/?.lua;/Users/stvjc/torch/install/share/lua/5.1/?/init.lua;./?.lua;/Users/stvjc/torch/install/share/luajit-2.1.0-beta1/?.lua;/usr/local/share/lua/5.1/?.lua;/usr/local/share/lua/5.1/?/init.lua

from basset.

davek44 commented on September 25, 2024

What do you see when you start torch up and try to "require 'batcher'" or "require 'convnet'"?

from basset.

vjcitn commented on September 25, 2024

th> require 'convnet'

true

[0.5321s]

th> require 'batcher'

{

make_batch_iterators : function: 0x140ca6a0

make_chunk_iterator : function: 0x140d6e70

load_text : function: 0x140d46b8

split_indices : function: 0x140d78c0

make_chunk_iterators : function: 0x140d8390

stack : function: 0x140d8df0

make_batch_iterator : function: 0x140d9888

}

On Sat, May 28, 2016 at 8:21 PM, David Kelley [email protected]
wrote:

What do you see when you start torch up and try to "require 'batcher'" or
"require 'convnet'"?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwsctwPuzzpwFrE10XpUaOhbJY-C8ks5qGNv4gaJpZM4Ij9tH
.

from basset.

davek44 commented on September 25, 2024

OK, after require 'convnet', try "convnet = ConvNet:__init()". That's the line that it's crashing on when you run the script, right?

from basset.

davek44 commented on September 25, 2024

Also, that's not the output that I get from requiring batcher. Is there a chance that there's another script in your path called batcher?

from basset.

vjcitn commented on September 25, 2024

With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.

top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?

If this is working I will register the AMI and make it public so that
others can test and build off it.

On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal [email protected]
wrote:

I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:

Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in function
'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:

/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in function
'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

Any idea how to fix this?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.

from basset.

vjcitn commented on September 25, 2024

the EC2 instance got completely overwhelmed and I stopped it. I ran
without the -cuda option and got to

ubuntu@ip-10-152-71-81:~/Basset/data$ basset_train.lua -job
pretrained_params.txt -stagnant_t 10 encode_roadmap.h5

{

conv_filter_sizes :

weight_norm : 7

momentum : 0.98

learning_rate : 0.002

hidden_units :

conv_filters :

hidden_dropouts :

{

  1 : 0.3

  2 : 0.3

}

pool_width :

}

nn.Sequential {

[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) ->
(10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) ->
(19) -> (20) -> (21) -> (22) -> (23) -> output]

(1): nn.SpatialConvolution(4 -> 300, 19x1)

(2): nn.SpatialBatchNormalization

(3): nn.ReLU

(4): nn.SpatialMaxPooling(3x1, 3,1)

(5): nn.SpatialConvolution(300 -> 200, 11x1)

(6): nn.SpatialBatchNormalization

(7): nn.ReLU

(8): nn.SpatialMaxPooling(4x1, 4,1)

(9): nn.SpatialConvolution(200 -> 200, 7x1)

(10): nn.SpatialBatchNormalization

(11): nn.ReLU

(12): nn.SpatialMaxPooling(4x1, 4,1)

(13): nn.Reshape(2000)

(14): nn.Linear(2000 -> 1000)

(15): nn.BatchNormalization

(16): nn.ReLU

(17): nn.Dropout(0.300000)

(18): nn.Linear(1000 -> 1000)

(19): nn.BatchNormalization

(20): nn.ReLU

(21): nn.Dropout(0.300000)

(22): nn.Linear(1000 -> 164)

(23): nn.Sigmoid

}

OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please
rebuild the library with USE_OPENMP=1 option.

this last message happens a lot and it looks like we need to rebuild
openblas.

more later

On Wed, Jun 1, 2016 at 7:28 PM, Vincent Carey [email protected]
wrote:

With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.

top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?

If this is working I will register the AMI and make it public so that
others can test and build off it.

On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal [email protected]
wrote:

I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:

Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in function
'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:

/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in function
'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

Any idea how to fix this?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.

from basset.

vjcitn commented on September 25, 2024

export OMP_NUM_THREADS=1 solves the OpenBLAS warning.

now just running with a 3 GB RAM image. how long will an epoch take?

On Wed, Jun 1, 2016 at 8:19 PM, Vincent Carey [email protected]
wrote:

the EC2 instance got completely overwhelmed and I stopped it. I ran
without the -cuda option and got to

ubuntu@ip-10-152-71-81:~/Basset/data$ basset_train.lua -job
pretrained_params.txt -stagnant_t 10 encode_roadmap.h5

{

conv_filter_sizes :
{

  1 : 19

  2 : 11

  3 : 7

}
weight_norm : 7

momentum : 0.98

learning_rate : 0.002

hidden_units :
{

  1 : 1000

  2 : 1000

}
conv_filters :
{

  1 : 300

  2 : 200

  3 : 200

}
hidden_dropouts :
{

  1 : 0.3

  2 : 0.3

}
pool_width :
{

  1 : 3

  2 : 4

  3 : 4

}
}

nn.Sequential {

[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) ->
(10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) ->
(19) -> (20) -> (21) -> (22) -> (23) -> output]

(1): nn.SpatialConvolution(4 -> 300, 19x1)

(2): nn.SpatialBatchNormalization

(3): nn.ReLU

(4): nn.SpatialMaxPooling(3x1, 3,1)

(5): nn.SpatialConvolution(300 -> 200, 11x1)

(6): nn.SpatialBatchNormalization

(7): nn.ReLU

(8): nn.SpatialMaxPooling(4x1, 4,1)

(9): nn.SpatialConvolution(200 -> 200, 7x1)

(10): nn.SpatialBatchNormalization

(11): nn.ReLU

(12): nn.SpatialMaxPooling(4x1, 4,1)

(13): nn.Reshape(2000)

(14): nn.Linear(2000 -> 1000)

(15): nn.BatchNormalization

(16): nn.ReLU

(17): nn.Dropout(0.300000)

(18): nn.Linear(1000 -> 1000)

(19): nn.BatchNormalization

(20): nn.ReLU

(21): nn.Dropout(0.300000)

(22): nn.Linear(1000 -> 164)

(23): nn.Sigmoid

}

OpenBLAS Warning : Detect OpenMP Loop and this application may hang.
Please rebuild the library with USE_OPENMP=1 option.

this last message happens a lot and it looks like we need to rebuild
openblas.

more later

On Wed, Jun 1, 2016 at 7:28 PM, Vincent Carey [email protected]
wrote:

With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.

top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?

If this is working I will register the AMI and make it public so that
others can test and build off it.

On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal [email protected]
wrote:

I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:

Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in
function 'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:

/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in function
'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670

Any idea how to fix this?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.

from basset.

vjcitn commented on September 25, 2024

i've noticed that when i use the -cuda option i do not get the diagnostic
trace reported two messages ago. would the g2.8xlarge instance type be
more effective with the example settings?

On Wed, Jun 1, 2016 at 8:32 PM, Vincent Carey [email protected]
wrote:

export OMP_NUM_THREADS=1 solves the OpenBLAS warning.

now just running with a 3 GB RAM image. how long will an epoch take?

On Wed, Jun 1, 2016 at 8:19 PM, Vincent Carey [email protected]
wrote:
the EC2 instance got completely overwhelmed and I stopped it. I ran
without the -cuda option and got to

ubuntu@ip-10-152-71-81:~/Basset/data$ basset_train.lua -job
pretrained_params.txt -stagnant_t 10 encode_roadmap.h5

{

conv_filter_sizes :
{

  1 : 19

  2 : 11

  3 : 7

}
weight_norm : 7

momentum : 0.98

learning_rate : 0.002

hidden_units :
{

  1 : 1000

  2 : 1000

}
conv_filters :
{

  1 : 300

  2 : 200

  3 : 200

}
hidden_dropouts :
{

  1 : 0.3

  2 : 0.3

}
pool_width :
{

  1 : 3

  2 : 4

  3 : 4

}
}

nn.Sequential {

[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9)
-> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) ->
(19) -> (20) -> (21) -> (22) -> (23) -> output]

(1): nn.SpatialConvolution(4 -> 300, 19x1)

(2): nn.SpatialBatchNormalization

(3): nn.ReLU

(4): nn.SpatialMaxPooling(3x1, 3,1)

(5): nn.SpatialConvolution(300 -> 200, 11x1)

(6): nn.SpatialBatchNormalization

(7): nn.ReLU

(8): nn.SpatialMaxPooling(4x1, 4,1)

(9): nn.SpatialConvolution(200 -> 200, 7x1)

(10): nn.SpatialBatchNormalization

(11): nn.ReLU

(12): nn.SpatialMaxPooling(4x1, 4,1)

(13): nn.Reshape(2000)

(14): nn.Linear(2000 -> 1000)

(15): nn.BatchNormalization

(16): nn.ReLU

(17): nn.Dropout(0.300000)

(18): nn.Linear(1000 -> 1000)

(19): nn.BatchNormalization

(20): nn.ReLU

(21): nn.Dropout(0.300000)

(22): nn.Linear(1000 -> 164)

(23): nn.Sigmoid

}

OpenBLAS Warning : Detect OpenMP Loop and this application may hang.
Please rebuild the library with USE_OPENMP=1 option.

this last message happens a lot and it looks like we need to rebuild
openblas.

more later

On Wed, Jun 1, 2016 at 7:28 PM, Vincent Carey <[email protected]

wrote:

With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.

top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?

If this is working I will register the AMI and make it public so that
others can test and build off it.

On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal <[email protected]

wrote:

I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:

Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in
function 'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in
main chunk
[C]: at 0x00406670

I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:

/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in
function 'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in
main chunk
[C]: at 0x00406670

Any idea how to fix this?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.

from basset.

vjcitn commented on September 25, 2024

I tried with the g2.8xlarge instance type. the program moved along but took up 85GB ram according to top, on a 60GB machine. so i stopped it.

from basset.

davek44 commented on September 25, 2024

I'm not familiar with EC2, so I can't help much with details there.

You asked how long epochs typically take, and it varies pretty widely based on the model, input data, and computing device. A large model trained on the full data on the Tesla K20's that I've been using can take up to 6 hours.

With respect to the memory leak, I had a similar problem at one point. I started towards a fix from the following thread: torch/torch7#229

Basically, malloc holds any allocated memory within the program within some Linux OS's. You can write over that memory within the program, but it’s not released to the OS.

This different version of malloc solves the problem: http://www.canonware.com/jemalloc/index.html

Specifically, I installed jemalloc in and now run:
export LD_PRELOAD=/mypath/jemalloc-4.0.1/lib/libjemalloc.so
within my .bashrc script. Then Torch releases the memory, and everything works great.

from basset.

vjcitn commented on September 25, 2024

the large memory image no longer appears once libjemalloc is used.

however with the -cuda option i don't see the diagnostic print of the model
parameters

top shows a load of 515 ... nothing going on on main cpu. does that seem
right?

top - 01:34:54 up 26 min, 2 users, load average: 515.08, 482.10, 302.66

Tasks:* 1044 total, 3 running, 1041 sleeping, 0 stopped, 0 *
zombie

%Cpu(s):* 0.0 us, 0.1 sy, 0.0 ni, 93.7 id, 0.0 wa, 0.0 hi,
6.1 si, 0.0 *st

KiB Mem: * 61837044 total, 1023484 used, 60813560 free, 40480 *
buffers

KiB Swap:* 0 total, 0 used, 0 free. 349672 *cached
Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND

345 root 20 0 0 0 0 R 100.0 0.0 14:01.17
ksoftirqd/16
*

3 root      20   0       0      0      0 R  50.0  0.0   7:05.55

ksoftirqd/0
*

4685 root 20 0 0 0 0 D 49.7 0.0 6:48.80
kworker/0:7

5245 ubuntu 20 0 24548 2588 1168 R 0.7 0.0 0:00.98 top

1772 root 20 0 19676 1248 552 S 0.3 0.0 0:00.41
irqbalance

1 root 20 0 33864 3216 1492 S 0.0 0.0 0:09.84 init

2 root 20 0 0 0 0 S 0.0 0.0 0:00.01
kthreadd

4 root 20 0 0 0 0 D 0.0 0.0 0:00.00
kworker/0:0

5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00
kworker/0:0H

6 root 20 0 0 0 0 S 0.0 0.0 0:00.00
kworker/u256:0

7 root 20 0 0 0 0 S 0.0 0.0 0:00.01
kworker/u257:0

8 root 20 0 0 0 0 S 0.0 0.0 0:00.38
rcu_sched

9 root 20 0 0 0 0 S 0.0 0.0 0:00.06
rcuos/0

10 root 20 0 0 0 0 S 0.0 0.0 0:00.02
rcuos/1

11 root 20 0 0 0 0 S 0.0 0.0 0:00.05
rcuos/2

12 root 20 0 0 0 0 S 0.0 0.0 0:00.02
rcuos/3

13 root 20 0 0 0 0 S 0.0 0.0 0:00.01
rcuos/4

14 root 20 0 0 0 0 S 0.0 0.0 0:00.01
rcuos/5

15 root 20 0 0 0 0 S 0.0 0.0 0:00.01
rcuos/6

16 root 20 0 0 0 0 S 0.0 0.0 0:00.00
rcuos/7

On Thu, Jun 2, 2016 at 10:58 PM, David Kelley [email protected]
wrote:

I'm not familiar with EC2, so I can't help much with details there.

You asked how long epochs typically take, and it varies pretty widely
based on the model, input data, and computing device. A large model trained
on the full data on the Tesla K20's that I've been using can take up to 6
hours.

With respect to the memory leak, I had a similar problem at one point. I
started towards a fix from the following thread: torch/torch7#229
torch/torch7#229

Basically, malloc holds any allocated memory within the program within
some Linux OS's. You can write over that memory within the program, but
it’s not released to the OS.

This different version of malloc solves the problem:
http://www.canonware.com/jemalloc/index.html

Specifically, I installed jemalloc in and now run:
export LD_PRELOAD=/mypath/jemalloc-4.0.1/lib/libjemalloc.so
within my .bashrc script. Then Torch releases the memory, and everything
works great.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwkBzJ8vO4MJ4vo693MsoWgx0hd9tks5qH5hEgaJpZM4Ij9tH
.

from basset.

davek44 commented on September 25, 2024

It's hard for me to tell if it's working for you or not. If not, let me know if there was any output and perhaps I can advise.

from basset.

vjcitn commented on September 25, 2024

hi, thanks for following up. there was no output after about an hour with
top listing 500+ jobs and no indication of memory consumption. i killed
the job and have not revisited. i am still interested in getting it to run
on the example data, but perhaps i need to target some smaller task.

On Tue, Jun 7, 2016 at 9:21 PM, David Kelley [email protected]
wrote:

It's hard for me to tell if it's working for you or not. If not, let me
know if there was any output and perhaps I can advise.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwtdx-fS3FSWwWz02LpM8oRXoxd-Xks5qJhkrgaJpZM4Ij9tH
.

from basset.

yukatherin commented on September 25, 2024

Edit: Please disregard. My question was due to bash syntax errors. You just need to check echo $LUA_PATH contains your $BASSETDIR/src/?.lua

from basset.

additional dependencies not given in installation directions/fail with nil ConvNet about basset HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent