Comments (16)
batcher.lua is my script. I realized I'm not giving proper advice for setting the LUA_PATH variable so basset_train.lua will find it. I fixed the README.
Try
export LUA_PATH="$BASSETDIR/src/?.lua;$LUA_PATH"
from basset.
I've set LUA_PATH as directed on README.md, value given below. Still have
%vjcair> basset_train.lua -job models/pretrained_params.txt -stagnant_t 10 encode_roadmap.h5
{
conv_filter_sizes :
{
1 : 19
2 : 11
3 : 7
}
weight_norm : 7
momentum : 0.98
learning_rate : 0.002
hidden_units :
{
1 : 1000
2 : 1000
}
conv_filters :
{
1 : 300
2 : 200
3 : 200
}
hidden_dropouts :
{
1 : 0.3
2 : 0.3
}
pool_width :
{
1 : 3
2 : 4
3 : 4
}
}
/Users/stvjc/torch/install/bin/luajit: /Users/stvjc/Research/BASSET/Basset/src/basset_train.lua:99: attempt to index global 'ConvNet' (a nil value)
stack traceback:
/Users/stvjc/Research/BASSET/Basset/src/basset_train.lua:99: in main chunk
[C]: in function 'dofile'
...tvjc/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010c67cad0
/Users/stvjc/Research/BASSET/Basset/src/?.lua;/Users/stvjc/.luarocks/share/lua/5.1/?.lua;/Users/stvjc/.luarocks/share/lua/5.1/?/init.lua;/Users/stvjc/torch/install/share/lua/5.1/?.lua;/Users/stvjc/torch/install/share/lua/5.1/?/init.lua;./?.lua;/Users/stvjc/torch/install/share/luajit-2.1.0-beta1/?.lua;/usr/local/share/lua/5.1/?.lua;/usr/local/share/lua/5.1/?/init.lua
from basset.
What do you see when you start torch up and try to "require 'batcher'" or "require 'convnet'"?
from basset.
th> require 'convnet'
true
[0.5321s]
th> require 'batcher'
{
make_batch_iterators : function: 0x140ca6a0
make_chunk_iterator : function: 0x140d6e70
load_text : function: 0x140d46b8
split_indices : function: 0x140d78c0
make_chunk_iterators : function: 0x140d8390
stack : function: 0x140d8df0
make_batch_iterator : function: 0x140d9888
}
On Sat, May 28, 2016 at 8:21 PM, David Kelley [email protected]
wrote:
What do you see when you start torch up and try to "require 'batcher'" or
"require 'convnet'"?—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwsctwPuzzpwFrE10XpUaOhbJY-C8ks5qGNv4gaJpZM4Ij9tH
.
from basset.
OK, after require 'convnet', try "convnet = ConvNet:__init()". That's the line that it's crashing on when you run the script, right?
from basset.
Also, that's not the output that I get from requiring batcher. Is there a chance that there's another script in your path called batcher?
from basset.
With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.
top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?
If this is working I will register the AMI and make it public so that
others can test and build off it.
On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal [email protected]
wrote:
I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in function
'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in function
'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670Any idea how to fix this?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.
from basset.
the EC2 instance got completely overwhelmed and I stopped it. I ran
without the -cuda option and got to
ubuntu@ip-10-152-71-81:~/Basset/data$ basset_train.lua -job
pretrained_params.txt -stagnant_t 10 encode_roadmap.h5
{
conv_filter_sizes :
{
1 : 19
2 : 11
3 : 7
}
weight_norm : 7
momentum : 0.98
learning_rate : 0.002
hidden_units :
{
1 : 1000
2 : 1000
}
conv_filters :
{
1 : 300
2 : 200
3 : 200
}
hidden_dropouts :
{
1 : 0.3
2 : 0.3
}
pool_width :
{
1 : 3
2 : 4
3 : 4
}
}
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) ->
(10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) ->
(19) -> (20) -> (21) -> (22) -> (23) -> output]
(1): nn.SpatialConvolution(4 -> 300, 19x1)
(2): nn.SpatialBatchNormalization
(3): nn.ReLU
(4): nn.SpatialMaxPooling(3x1, 3,1)
(5): nn.SpatialConvolution(300 -> 200, 11x1)
(6): nn.SpatialBatchNormalization
(7): nn.ReLU
(8): nn.SpatialMaxPooling(4x1, 4,1)
(9): nn.SpatialConvolution(200 -> 200, 7x1)
(10): nn.SpatialBatchNormalization
(11): nn.ReLU
(12): nn.SpatialMaxPooling(4x1, 4,1)
(13): nn.Reshape(2000)
(14): nn.Linear(2000 -> 1000)
(15): nn.BatchNormalization
(16): nn.ReLU
(17): nn.Dropout(0.300000)
(18): nn.Linear(1000 -> 1000)
(19): nn.BatchNormalization
(20): nn.ReLU
(21): nn.Dropout(0.300000)
(22): nn.Linear(1000 -> 164)
(23): nn.Sigmoid
}
OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please
rebuild the library with USE_OPENMP=1 option.
this last message happens a lot and it looks like we need to rebuild
openblas.
more later
On Wed, Jun 1, 2016 at 7:28 PM, Vincent Carey [email protected]
wrote:
With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?If this is working I will register the AMI and make it public so that
others can test and build off it.On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal [email protected]
wrote:I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in function
'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in function
'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670Any idea how to fix this?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.
from basset.
export OMP_NUM_THREADS=1 solves the OpenBLAS warning.
now just running with a 3 GB RAM image. how long will an epoch take?
On Wed, Jun 1, 2016 at 8:19 PM, Vincent Carey [email protected]
wrote:
the EC2 instance got completely overwhelmed and I stopped it. I ran
without the -cuda option and got toubuntu@ip-10-152-71-81:~/Basset/data$ basset_train.lua -job
pretrained_params.txt -stagnant_t 10 encode_roadmap.h5{
conv_filter_sizes :
{ 1 : 19 2 : 11 3 : 7 }
weight_norm : 7
momentum : 0.98
learning_rate : 0.002
hidden_units :
{ 1 : 1000 2 : 1000 }
conv_filters :
{ 1 : 300 2 : 200 3 : 200 }
hidden_dropouts :
{ 1 : 0.3 2 : 0.3 }
pool_width :
{ 1 : 3 2 : 4 3 : 4 }
}
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) ->
(10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) ->
(19) -> (20) -> (21) -> (22) -> (23) -> output](1): nn.SpatialConvolution(4 -> 300, 19x1)
(2): nn.SpatialBatchNormalization
(3): nn.ReLU
(4): nn.SpatialMaxPooling(3x1, 3,1)
(5): nn.SpatialConvolution(300 -> 200, 11x1)
(6): nn.SpatialBatchNormalization
(7): nn.ReLU
(8): nn.SpatialMaxPooling(4x1, 4,1)
(9): nn.SpatialConvolution(200 -> 200, 7x1)
(10): nn.SpatialBatchNormalization
(11): nn.ReLU
(12): nn.SpatialMaxPooling(4x1, 4,1)
(13): nn.Reshape(2000)
(14): nn.Linear(2000 -> 1000)
(15): nn.BatchNormalization
(16): nn.ReLU
(17): nn.Dropout(0.300000)
(18): nn.Linear(1000 -> 1000)
(19): nn.BatchNormalization
(20): nn.ReLU
(21): nn.Dropout(0.300000)
(22): nn.Linear(1000 -> 164)
(23): nn.Sigmoid
}
OpenBLAS Warning : Detect OpenMP Loop and this application may hang.
Please rebuild the library with USE_OPENMP=1 option.this last message happens a lot and it looks like we need to rebuild
openblas.more later
On Wed, Jun 1, 2016 at 7:28 PM, Vincent Carey [email protected]
wrote:With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?If this is working I will register the AMI and make it public so that
others can test and build off it.On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal [email protected]
wrote:I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in
function 'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in function
'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main
chunk
[C]: at 0x00406670Any idea how to fix this?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.
from basset.
i've noticed that when i use the -cuda option i do not get the diagnostic
trace reported two messages ago. would the g2.8xlarge instance type be
more effective with the example settings?
On Wed, Jun 1, 2016 at 8:32 PM, Vincent Carey [email protected]
wrote:
export OMP_NUM_THREADS=1 solves the OpenBLAS warning.
now just running with a 3 GB RAM image. how long will an epoch take?
On Wed, Jun 1, 2016 at 8:19 PM, Vincent Carey [email protected]
wrote:the EC2 instance got completely overwhelmed and I stopped it. I ran
without the -cuda option and got toubuntu@ip-10-152-71-81:~/Basset/data$ basset_train.lua -job
pretrained_params.txt -stagnant_t 10 encode_roadmap.h5{
conv_filter_sizes :
{ 1 : 19 2 : 11 3 : 7 }
weight_norm : 7
momentum : 0.98
learning_rate : 0.002
hidden_units :
{ 1 : 1000 2 : 1000 }
conv_filters :
{ 1 : 300 2 : 200 3 : 200 }
hidden_dropouts :
{ 1 : 0.3 2 : 0.3 }
pool_width :
{ 1 : 3 2 : 4 3 : 4 }
}
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9)
-> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) ->
(19) -> (20) -> (21) -> (22) -> (23) -> output](1): nn.SpatialConvolution(4 -> 300, 19x1)
(2): nn.SpatialBatchNormalization
(3): nn.ReLU
(4): nn.SpatialMaxPooling(3x1, 3,1)
(5): nn.SpatialConvolution(300 -> 200, 11x1)
(6): nn.SpatialBatchNormalization
(7): nn.ReLU
(8): nn.SpatialMaxPooling(4x1, 4,1)
(9): nn.SpatialConvolution(200 -> 200, 7x1)
(10): nn.SpatialBatchNormalization
(11): nn.ReLU
(12): nn.SpatialMaxPooling(4x1, 4,1)
(13): nn.Reshape(2000)
(14): nn.Linear(2000 -> 1000)
(15): nn.BatchNormalization
(16): nn.ReLU
(17): nn.Dropout(0.300000)
(18): nn.Linear(1000 -> 1000)
(19): nn.BatchNormalization
(20): nn.ReLU
(21): nn.Dropout(0.300000)
(22): nn.Linear(1000 -> 164)
(23): nn.Sigmoid
}
OpenBLAS Warning : Detect OpenMP Loop and this application may hang.
Please rebuild the library with USE_OPENMP=1 option.this last message happens a lot and it looks like we need to rebuild
openblas.more later
On Wed, Jun 1, 2016 at 7:28 PM, Vincent Carey <[email protected]
wrote:
With a new checkout on an amazon EC2 instance of type g2.2xlarge,
basset_train.lua appears to be working on the encode_roadmap.h5, running as
indicated on the readme.top shows the load on this machine to be 430. I have no experience with
gpus, there are apparently 1536 cores. I don't see any messages after
starting the basset_train.lua so I wonder if things are OK. What kind of
checkpointing is done, are there temporary files to check?If this is working I will register the AMI and make it public so that
others can test and build off it.On Wed, Jun 1, 2016 at 3:41 PM, Vikram Agarwal <[email protected]
wrote:
I believe I'm also having a problem with dependencies. I've done all of
the appropriate exports including the new LUA_PATH you mentioned, and have
run "install_dependencies.py", then I tried "./basset_train.lua $FILE" on
my h5 file, and it ran this error:Hyper-parameters unspecified. Applying a small model architecture
/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: attempt to
perform arithmetic on local 'storageOffset' (a nil value)
stack traceback:
/home/ubuntu/torch/install/share/lua/5.1/nn/Module.lua:178: in function
'flatten'
/home/ubuntu/torch/install/share/lua/5.1/dpnn/Module.lua:198: in
function 'getParameters'
./convnet.lua:249: in function 'build'
./basset_train.lua:111: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in
main chunk
[C]: at 0x00406670I tried "luarocks install nn" to get the latest version of the 'nn'
package, and it resulted in a different error, also associated with the
'nn' package:/home/ubuntu/torch/install/bin/luajit:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363:
/home/ubuntu/torch/install/share/lua/5.1/nn/test.lua:12: attempt to call
field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/trepl/init.lua:363: in
function 'require'
./basset_train.lua:37: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in
main chunk
[C]: at 0x00406670Any idea how to fix this?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AEaOwtJbFl6OW4ESC7HmOEy73gYGxda_ks5qHeCDgaJpZM4Ij9tH
.
from basset.
I tried with the g2.8xlarge instance type. the program moved along but took up 85GB ram according to top, on a 60GB machine. so i stopped it.
from basset.
I'm not familiar with EC2, so I can't help much with details there.
You asked how long epochs typically take, and it varies pretty widely based on the model, input data, and computing device. A large model trained on the full data on the Tesla K20's that I've been using can take up to 6 hours.
With respect to the memory leak, I had a similar problem at one point. I started towards a fix from the following thread: torch/torch7#229
Basically, malloc holds any allocated memory within the program within some Linux OS's. You can write over that memory within the program, but it’s not released to the OS.
This different version of malloc solves the problem: http://www.canonware.com/jemalloc/index.html
Specifically, I installed jemalloc in and now run:
export LD_PRELOAD=/mypath/jemalloc-4.0.1/lib/libjemalloc.so
within my .bashrc script. Then Torch releases the memory, and everything works great.
from basset.
the large memory image no longer appears once libjemalloc is used.
however with the -cuda option i don't see the diagnostic print of the model
parameters
top shows a load of 515 ... nothing going on on main cpu. does that seem
right?
top - 01:34:54 up 26 min, 2 users, load average: 515.08, 482.10, 302.66
Tasks:* 1044 total, 3 running, 1041 sleeping, 0 stopped, 0 *
zombie
%Cpu(s):* 0.0 us, 0.1 sy, 0.0 ni, 93.7 id, 0.0 wa, 0.0 hi,
6.1 si, 0.0 *st
KiB Mem: * 61837044 total, 1023484 used, 60813560 free, 40480 *
buffers
KiB Swap:* 0 total, 0 used, 0 free. 349672 *cached
Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
-
345 root 20 0 0 0 0 R 100.0 0.0 14:01.17
ksoftirqd/16
* -
3 root 20 0 0 0 0 R 50.0 0.0 7:05.55
ksoftirqd/0
*4685 root 20 0 0 0 0 D 49.7 0.0 6:48.80
kworker/0:7 -
5245 ubuntu 20 0 24548 2588 1168 R 0.7 0.0 0:00.98 top
*
1772 root 20 0 19676 1248 552 S 0.3 0.0 0:00.41
irqbalance
1 root 20 0 33864 3216 1492 S 0.0 0.0 0:09.84 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01
kthreadd
4 root 20 0 0 0 0 D 0.0 0.0 0:00.00
kworker/0:0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00
kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:00.00
kworker/u256:0
7 root 20 0 0 0 0 S 0.0 0.0 0:00.01
kworker/u257:0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.38
rcu_sched
9 root 20 0 0 0 0 S 0.0 0.0 0:00.06
rcuos/0
10 root 20 0 0 0 0 S 0.0 0.0 0:00.02
rcuos/1
11 root 20 0 0 0 0 S 0.0 0.0 0:00.05
rcuos/2
12 root 20 0 0 0 0 S 0.0 0.0 0:00.02
rcuos/3
13 root 20 0 0 0 0 S 0.0 0.0 0:00.01
rcuos/4
14 root 20 0 0 0 0 S 0.0 0.0 0:00.01
rcuos/5
15 root 20 0 0 0 0 S 0.0 0.0 0:00.01
rcuos/6
16 root 20 0 0 0 0 S 0.0 0.0 0:00.00
rcuos/7
On Thu, Jun 2, 2016 at 10:58 PM, David Kelley [email protected]
wrote:
I'm not familiar with EC2, so I can't help much with details there.
You asked how long epochs typically take, and it varies pretty widely
based on the model, input data, and computing device. A large model trained
on the full data on the Tesla K20's that I've been using can take up to 6
hours.With respect to the memory leak, I had a similar problem at one point. I
started towards a fix from the following thread: torch/torch7#229
torch/torch7#229Basically, malloc holds any allocated memory within the program within
some Linux OS's. You can write over that memory within the program, but
it’s not released to the OS.This different version of malloc solves the problem:
http://www.canonware.com/jemalloc/index.htmlSpecifically, I installed jemalloc in and now run:
export LD_PRELOAD=/mypath/jemalloc-4.0.1/lib/libjemalloc.so
within my .bashrc script. Then Torch releases the memory, and everything
works great.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwkBzJ8vO4MJ4vo693MsoWgx0hd9tks5qH5hEgaJpZM4Ij9tH
.
from basset.
It's hard for me to tell if it's working for you or not. If not, let me know if there was any output and perhaps I can advise.
from basset.
hi, thanks for following up. there was no output after about an hour with
top listing 500+ jobs and no indication of memory consumption. i killed
the job and have not revisited. i am still interested in getting it to run
on the example data, but perhaps i need to target some smaller task.
On Tue, Jun 7, 2016 at 9:21 PM, David Kelley [email protected]
wrote:
It's hard for me to tell if it's working for you or not. If not, let me
know if there was any output and perhaps I can advise.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AEaOwtdx-fS3FSWwWz02LpM8oRXoxd-Xks5qJhkrgaJpZM4Ij9tH
.
from basset.
Edit: Please disregard. My question was due to bash syntax errors. You just need to check echo $LUA_PATH
contains your $BASSETDIR/src/?.lua
from basset.
Related Issues (20)
- Training aways converges after 2 or 3 epochs HOT 4
- Training takes too long HOT 3
- Pre-trained Weights? HOT 1
- Torch-HDF5 Failure during Writing Output by basset_motifs_predict.lua HOT 3
- First Epoch Loss is Bigger than Expected HOT 2
- Nucleotide Order in Motif Heatmap Reversed HOT 1
- how are filters combined HOT 2
- Error in running basset_sat_vcf.py HOT 1
- Convolutional layers - padding and bias HOT 1
- input and target size mismatch when running test.lua HOT 3
- Citation on README HOT 1
- Prediction tasks specification HOT 1
- install_data.py requires more than 30 GiB of memory HOT 2
- Docker image Lua and Python setups wrong
- how to install 'convnet' module HOT 1
- ENUM error when running basset_train.lua HOT 7
- preprocess_features.py file is generating empty output bed files HOT 2
- bedtools getfasta skipping problem HOT 1
- Used in non human species HOT 1
- Original dataset Basset is trained on HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from basset.