Giter Club home page Giter Club logo

Comments (31)

soumith avatar soumith commented on July 3, 2024 1

@SaMnCo @kinoc here's the fix, I just tested it on my TX1. No need of ascii model anymore.

git clone https://github.com/mvitez/torch7.git mvittorch7
cd mvittorch7
luarocks make rocks/torch-scm-1.rockspec

Now, apply this diff to eval.lua

diff --git a/eval.lua b/eval.lua
index 1814180..8cad5ba 100644
--- a/eval.lua
+++ b/eval.lua
@@ -65,8 +65,21 @@ end
 -------------------------------------------------------------------------------
 -- Load the model checkpoint to evaluate
 -------------------------------------------------------------------------------
+local function load(filename)
+   local mode = 'binary'
+   local referenced = true
+   local file = torch.DiskFile(filename, 'r')
+   file[mode](file)
+   file:referenced(referenced)
+   file:longSize(8)
+   file:littleEndianEncoding()
+   local object = file:readObject()
+   file:close()
+   return object
+end
+
 assert(string.len(opt.model) > 0, 'must provide a model')
-local checkpoint = torch.load(opt.model)
+local checkpoint = load(opt.model)
 -- override and collect parameters
 if string.len(opt.input_h5) == 0 then opt.input_h5 = checkpoint.opt.input_h5 end
 if string.len(opt.input_json) == 0 then opt.input_json = checkpoint.opt.input_json end

Then, run your eval script:

th eval.lua -model model_id1-501-1448236541.t7 -image_folder ~/Downloads -num_images 10

I tested it with some of the test images in the README, it works as expected.

ubuntu@tegra-ubuntu:~/neuraltalk2$ th eval.lua -model model_id1-501-1448236541.t7 -image_folder ~/Downloads -num_images 10 
DataLoaderRaw loading images from folder:   /home/ubuntu/Downloads  
listing all images in directory /home/ubuntu/Downloads  
DataLoaderRaw found 2 images    
constructing clones inside the LanguageModel    
cp "/home/ubuntu/Downloads/img5.jpg" vis/imgs/img1.jpg  
image 1: a black and white cat sitting in a bathroom sink   
evaluating performance... 1/2 (0.000000)    
cp "/home/ubuntu/Downloads/img2.jpg" vis/imgs/img2.jpg  
image 2: a man sitting at a table with a laptop 
evaluating performance... 0/2 (0.000000)    
loss:   nan 

Thanks to @mvitez for the patch, I will merge this patch into core torch in the next week or so.

from neuraltalk2.

mvitez avatar mvitez commented on July 3, 2024 1

It has no sense to use my version of torch7, as my changes have been already merged into mainline Torch, with additional improvements.

from neuraltalk2.

SaMnCo avatar SaMnCo commented on July 3, 2024

Hi there,

I rebuilt this in a standard amd64 architecture and it worked flawlessly. I can run the trained CPU model properly, and my images are processed.
I'll give it another run on ARMv7 to see if I did something wrong there. Still if someone has seen this issue in the past, I'd gladly welcome input :)

Thanks for this really really cool piece of work!

from neuraltalk2.

SaMnCo avatar SaMnCo commented on July 3, 2024

OK I rebuilt it completely from scratch, and I have the same error, with the same script running smoothly on AMD64. I guess this must be a bug caused by the CPU architecture. Thoughts?

from neuraltalk2.

soumith avatar soumith commented on July 3, 2024

@SaMnCo it might be related to: torch/torch7#476
on a PC, save the checkpoint as 'ascii', and then load it on ARMv7.
-- on your laptop / desktop
checkpoint = torch.load(filename)
torch.save('checkpoint_ascii.t7', checkpoint, 'ascii')

-- on armv7
checkpoint = torch.load('checkpoint_ascii.t7', 'ascii')

from neuraltalk2.

SaMnCo avatar SaMnCo commented on July 3, 2024

@soumith thanks for having a look, your point actually changes the behaviour of the system.
Now

  • It seems this runs only on 1 core, I kind of remember it was supposed to be multi threaded but well.
  • It doesn't output anything, but fills the whole memory in 5min then it crashes and Torch tells me to buy more RAM, funny message.

I'll add some swapping capability to my Raspberry pi to give it at least as much as the model size. In ASCII, the model is ~3GB, was that expected?

Also [ I am really new to all this ], can you detail a little bit what this code does before processing any image? Does it copy the whole model to RAM? Any idea why I see this happening? Also, assuming this is all normal, any idea of how I could reduce the in-memory size?

Thanks!

from neuraltalk2.

SaMnCo avatar SaMnCo commented on July 3, 2024

So I added ZRAM as well as a trashable USB key to be used as swap to extend the limited 1GB of the Rpi2.

And I got a little further:
root@ubuntu:/opt/neural-networks/neuraltalk2# th eval_ascii.lua -model /data/model/model_id1-501-1448236541_ascii.t7 -image_folder /data/images -num_images 1 -batch_size 1 -gpuid -1
DataLoaderRaw loading images from folder: /data/images
listing all images in directory /data/images
DataLoaderRaw found 107 images
constructing clones inside the LanguageModel
Segmentation fault

the sequence "constructing clones...." lasts for a VERY long time before segfaulting (like 30min maybe). Trying again later today, welcoming any idea.

from neuraltalk2.

soumith avatar soumith commented on July 3, 2024

@SaMnCo 3GB, hmmm that's plausible but irritating. What are you running this on? I can replicate it tomorrow on my NVIDIA TK1 or TX1.

from neuraltalk2.

SaMnCo avatar SaMnCo commented on July 3, 2024

@soumith I am using the example model provided (CPU version, the rpi2 GPU is weird apparently). I'm not sure actually that the issue is the model here. If I run the same thing on the amd64 architecture, I have the same segfault.
The issue probably comes from something else in the code that happens only when reading ascii, but that is far beyond my know-how.

What do you think will happen if I train the model on the rpi? Would that work (even if it would take time)? How many images would be required to get to about 500MB in memory?

from neuraltalk2.

SaMnCo avatar SaMnCo commented on July 3, 2024

Still no luck, but I saw some stuff in strace if that can help:

I have a lot of :
mremap(0xb820000, 174886912, 262328320, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory)
mmap2(NULL, 262328320, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap2(NULL, 262328320, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0xfcfb000) = 0x2bc000
mmap2(NULL, 262459392, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
munmap(0x17172000, 77725696) = 0

Then the segfault comes up with:
read(3, "\365L\33\353V-\364\3209\223\207\364&\241\266\235\220\322\270\3032\254y!\267\372-G\375\246\3020"..., 4096) = 4096
read(3, "Uh\256\244\311\31-\371UMKY/4\301\310\nN+IJ\373\31K\265\304\213P\363&\226 "..., 4096) = 4096
read(3, "\335\v7\377\0Z\232\315 \234m\316\235\354+\27\v@\313\235\324\306(W\203\201\353\272\263%"..., 4096) = 4096
read(3, "\n\23\23\275\206\207P\316\24\356^\307\322\236\262\34\251\334x\245\212\324\202v\257\312{\324\2068\323\275"..., 4096) = 2230
read(3, "", 4096) = 0
close(3) = 0
munmap(0x76f08000, 4096) = 0
mmap2(NULL, 200704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x71445000
mmap2(NULL, 413696, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7129b000
munmap(0x7129b000, 413696) = 0
mmap2(NULL, 606208, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7126c000
mmap2(NULL, 5423104, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x1608d000
mmap2(NULL, 12849152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x1544c000
mmap2(NULL, 8388608, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x14c4c000
mprotect(0x14c4c000, 4096, PROT_NONE) = 0
clone(child_stack=0x1544af98, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x1544b4c8, tls=0x1544b920, child_tidptr=0x1544b4c8) = 10447
mmap2(NULL, 8388608, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x1444c000
mprotect(0x1444c000, 4096, PROT_NONE) = 0
clone(child_stack=0x14c4af98, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x14c4b4c8, tls=0x14c4b920, child_tidptr=0x14c4b4c8) = 10448
mmap2(NULL, 8388608, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x13c4c000
mprotect(0x13c4c000, 4096, PROT_NONE) = 0
clone(child_stack=0x1444af98, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x1444b4c8, tls=0x1444b920, child_tidptr=0x1444b4c8) = 10449
futex(0x71320e8c, FUTEX_WAKE_PRIVATE, 2147483647) = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x6e500000} ---
+++ killed by SIGSEGV +++
Segmentation fault

from neuraltalk2.

kinoc avatar kinoc commented on July 3, 2024

@soumith @SaMnCo Any luck? I have similar problems trying to build with my TX1, and the process dies at the same point after loading the 3GB ascii file.

from neuraltalk2.

kinoc avatar kinoc commented on July 3, 2024

I made a "convert_ascii_to_arm.lua" file for the TX1 that just loads the ascii model then writes out the binary "ntalk2.arm.t7" file. I then can use the unaltered eval.lua on "ntalk2.arm.t7", which works. However, whenever it computes the loss I get "loss:nan". I used "-num_images -1" to have it process all the images before computing the loss which still gives "loss:nan" but at least it does finish processing all the files, which show up in the web interface, so progress! The labels for the image set used match the those given by a cpu only version run against the same image set so it is working. So my sequence was get original model on CPU, write out 3GB ascii version, transfer ascii model to TX1, convert ascii to binary on the TX1, run using the new TX1 written binary model, complain about loss nan.

from neuraltalk2.

kinoc avatar kinoc commented on July 3, 2024

Thanks for the patch @soumith
I applied mvittorch7 and the patch and got the following results (note: "evalp.lua" is "eval.lua" with the patch applied)

ubuntu@tegra-ubuntu:~/neuraltalk2$ th evalp.lua -model /home/ubuntu/neuraltalk2/models/model_id1-501-1448236541.t7 -image_folder /home/ubuntu/Desktop/Share/pictest -num_images -1
DataLoaderRaw loading images from folder:   /home/ubuntu/Desktop/Share/pictest  
listing all images in directory /home/ubuntu/Desktop/Share/pictest  
DataLoaderRaw found 67 images   
constructing clones inside the LanguageModel    
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/cudnn/init.lua:58: Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED
stack traceback:
    [C]: in function 'error'
    /home/ubuntu/torch/install/share/lua/5.1/cudnn/init.lua:58: in function 'errcheck'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:385: in function 'updateOutput'
    /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    evalp.lua:134: in function 'eval_split'
    evalp.lua:186: in main chunk
    [C]: in function 'dofile'
    ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x0000cff9

However using "eval.lua" with the locally translated model does work, even after applying tmvittorch7.
evalp.lua:134: local feats = protos.cnn:forward(data.images) and it appears to be trying to do 'cudnnAddTensor'

So what would trip that? It appears to get past loading the model.

from neuraltalk2.

soumith avatar soumith commented on July 3, 2024

which hardware are you running this on? and what cudnn vrersion? maybe theres an issue there. this rrror message is straight from the cudnn lib

from neuraltalk2.

kinoc avatar kinoc commented on July 3, 2024

I am running on a Nvidia TX1. I have CUDA 7.0 installed. In ~/cudnn there is cudnn-7.0-linux-ARMv7-v4.0-rc1.tgz along with the v4.0_rc1 release notes, and ~/cudnn/cudnn.h says basically 4.0.2. Does that mean this edge is too bleeding ?

from neuraltalk2.

soumith avatar soumith commented on July 3, 2024

if you have cudnn 4.0 then you need the cudnn.torch r4 bindings.
git clone https://github.com/soumith/cudnn.torch -b R4
cd cudnn.torch
luarocks make

Things should work after that

On Sunday, December 6, 2015, kinoc [email protected] wrote:

I am running on a Nvidia TX1. I have CUDA 7.0 installed. In ~/cudnn there
is cudnn-7.0-linux-ARMv7-v4.0-rc1.tgz along with the v4.0_rc1 release
notes, and ~/cudnn/cudnn.h says basically 4.0.2. Does that mean this edge
is too bleeding ?


Reply to this email directly or view it on GitHub
#32 (comment)
.

from neuraltalk2.

kinoc avatar kinoc commented on July 3, 2024

@soumith Thanks, problem solved!
So summary for TX1 users: apply patch in #32 (comment) and use the R4 bindings listed above

from neuraltalk2.

kinoc avatar kinoc commented on July 3, 2024

One follow on. What would the change (if any) be for "train.lau"? I was trying a small test set of 100 train and 100 test images just to learn the workflow. Some of the images had alpha channels so I had to modify the prepro.py image reader
I = imread(os.path.join(params['images_root'], img['file_path']),'RGB')
and when I run "train.lau" I get a segmentation fault which I trace to the line
local params,grad_params = protos.lm:getParameters()

ubuntu@tegra-ubuntu:~/neuraltalk2$ th train.lua -input_h5 ~/Desktop/Share/AnimeNT/animetalk.h5 -input_json ~/Desktop/Share/AnimeNT/animetalk.json -cnn_proto ~/neuraltalk2/models/VGG_ILSVRC_16_layers_deploy.prototxt -cnn_model ~/neuraltalk2/models/VGG_ILSVRC_16_layers.caffemodel -losses_log_every 3500000
DataLoader loading json file:   /home/ubuntu/Desktop/Share/AnimeNT/animetalk.json
vocab size is 44
DataLoader loading h5 file:     /home/ubuntu/Desktop/Share/AnimeNT/animetalk.h5
read 100 images of size 3x256x256
max sequence length in data is 16
assigned 100 images to split val
probe1
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553432081
Successfully loaded /home/ubuntu/neuraltalk2/models/VGG_ILSVRC_16_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
probe2
converting first layer conv filters from BGR to RGB...
probe3
probe4
probe5
probe6
Segmentation fault

from neuraltalk2.

soumith avatar soumith commented on July 3, 2024

@kinoc i really dont think training on TX1 is a good idea, 4GB memory in total between GPU and CPU. Either ways, loadcaffe seems to not be supporting 32-bit ARM, not sure what the issue is.

from neuraltalk2.

kinoc avatar kinoc commented on July 3, 2024

@soumith I hear you (about using the TX1 for training). Unfortunately I don't have a native Linux box setup. I do have everything setup and working fine in a x86 VM, even the TX1 cross-compiler process. However the VM doesn't have access to a GPU. I have plans for another dedicated Linux box with GPU. I was hoping to get the workflow down using tiny sets the TX1 could handle (not expecting any accuracy), since the TX1 will still beat the VM in terms of getting feedback on the mechanics of model construction.

from neuraltalk2.

soumith avatar soumith commented on July 3, 2024

@kinoc ok in that case i can figure out the rest of the issues with TX1, but not until a week. My TX1 is at the office, and I leave to NIPS for a week starting tomorrow.

from neuraltalk2.

kinoc avatar kinoc commented on July 3, 2024

@soumith ok, no rush. There are plenty of other areas to work on. Thanks and good luck with NIPS!

from neuraltalk2.

SaMnCo avatar SaMnCo commented on July 3, 2024

Hello,

Thanks for working on this! Unfortunately, I still got no luck on the lower
rpi2 hardware.

root@ubuntu:/opt/neural-networks/neuraltalk2# th eval.lua -model
/data/model/model_id1-501-1448236541_cpu.t7 -image_folder /data/images
-num_images 10 -gpuid -1
/opt/neural-networks/torch/install/bin/luajit: eval.lua:89: attempt to
index global 'checkpoint' (a nil value)
stack traceback:
eval.lua:89: in main chunk
[C]: in function 'dofile'
...orks/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in
main chunk
[C]: at 0x0000cff9

I have tried

  • with num_images -1 as well to follow @kinoc step, but I have the same
    failure though it processes a little longer.
  • changing the patch to read with the new DiskFile in ascii and read from
    the ascii model, but that fails as well.

This is after applying patch and installing the new DiskFile utility.
My guess is that armv7 32b is the real issue here, as well as the lack of a
decent GPU.
Last option is that I messed up at some point as I installed / reinstalled
/ updated / changed many things. I need to do a clean install one more
time...

My next steps:

  • I ordered a TX1, but in Europe I can't get it before end of January, so
    I'll just have to be patient I guess...
  • I'll run again some of these tests on an ARM public cloud. the rpi is
    really slow and I want to know if the problem comes from amount of memory,
    or 32b or something else, but I can't be waiting 30min each time I build a
    version of torch :/
  • I'll focus on building a pipeline to train new models in the cloud using
    GPU enabled instances and come back to the actually edge device work when I
    get the nVidia kit.
  • @kinoc if you can share your script to build the model for ARM, I'd love
    to give that a try. Thx in advance

I'll keep you posted of my progress!

On Sun, Dec 6, 2015 at 7:49 PM, kinoc [email protected] wrote:

@soumith https://github.com/soumith ok, no rush. There are plenty of
other areas to work on. Thanks and good luck with NIPS!


Reply to this email directly or view it on GitHub
#32 (comment)
.

SaM'n'Co
R'U Ready? I was BORN ready!

from neuraltalk2.

kinoc avatar kinoc commented on July 3, 2024

@SaMnCo here is my "convert_ascii_to_tx1.lua" script, which has the input and output harcoded, so modify as needed. It is just the top of "eval.lua", loads the ascii then writes out native. Assumes you built the ascii file and its on the ARM. Hope it helps.

-- convert_ascii_to_tx1.lua , convert ascii file to local native binary format of machine where script is run
require 'torch'
require 'nn'
require 'nngraph'
-- exotics
require 'loadcaffe'
-- local imports
local utils = require 'misc.utils'
require 'misc.DataLoader'
require 'misc.DataLoaderRaw'
require 'misc.LanguageModel'
local net_utils = require 'misc.net_utils'

print('loading ascii checkpoint')
local checkpoint = torch.load('~/neuraltalk2/models/ntalk2.net.ascii','ascii')
print('writing binary checkpoint')
torch.save('~/neuraltalk2/models/ntalk2.arm.t7',checkpoint)
print('done.')

from neuraltalk2.

SaMnCo avatar SaMnCo commented on July 3, 2024

@kinoc, thanks a lot.

Still no luck with this new file. The file looks VERY close to the initial
cpu version I derived my ascii from.

I'll keep searching and testing...

On Mon, Dec 7, 2015 at 11:33 AM, kinoc [email protected] wrote:

@SaMnCo https://github.com/SaMnCo here is my "convert_ascii_to_tx1.lua"
script, which has the input and output harcoded, so modify as needed. It is
just the top of "eval.lua", loads the ascii then writes out native. Hope it
helps.

-- convert_ascii_to_tx1.lua
require 'torch'
require 'nn'
require 'nngraph'
-- exotics
require 'loadcaffe'
-- local imports
local utils = require 'misc.utils'
require 'misc.DataLoader'
require 'misc.DataLoaderRaw'
require 'misc.LanguageModel'
local net_utils = require 'misc.net_utils'

print('loading ascii checkpoint')
local checkpoint = torch.load('/neuraltalk2/models/ntalk2.net.ascii','ascii')
print('writing binary checkpoint')
torch.save('
/neuraltalk2/models/ntalk2.arm.t7',checkpoint)
print('done.')


Reply to this email directly or view it on GitHub
#32 (comment)
.

SaM'n'Co
R'U Ready? I was BORN ready!

from neuraltalk2.

SaMnCo avatar SaMnCo commented on July 3, 2024

OK I have traced the problem to this function in eval.lua:

local feats = protos.cnn:forward(data.images)

This doesn't pass on the Rpi2. Still waiting for my TX1 to run this properly :/ ...

from neuraltalk2.

kinoc avatar kinoc commented on July 3, 2024

@soumith Any clues with the TX1 ? The dedicated Linux GPU box is still a month+ away, but the TX1 is here now. ;p

from neuraltalk2.

sunyiyou avatar sunyiyou commented on July 3, 2024

@soumith I execute those commands as you mentioned before. Will it be conflict with the pre-installed version of torch? Now I'm confused about a lot mistakes and maybe need to reinstall it.

git clone https://github.com/mvitez/torch7.git mvittorch7
cd mvittorch7
luarocks make rocks/torch-scm-1.rockspec

from neuraltalk2.

MikeChenfu avatar MikeChenfu commented on July 3, 2024

Hello,

Thanks for working this problem. It helps me a lot.
I can install @mvitez 's code to torch7, but it has a problem when I use torch7.

torch/install/bin/luajit: torch/install/share/lua/5.1/trepl/init.lua:384: torch/install/share/lua/5.1/trepl/init.lua:384:
torch/install/share/lua/5.1/nn/test.lua:12: attempt to call field 'TestSuite' (a nil value)
stack traceback:
[C]: in function 'error'
/home/cl004/torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'
train.lua:2: in main chunk
[C]: in function 'dofile'
...l004/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
I

from neuraltalk2.

ZahlGraf avatar ZahlGraf commented on July 3, 2024

@kinoc and @soumith
Have you trained a model on Jetson TX1 in the meanwhile?
I always get "Segmentation faults" when trying to train on TX1 (see #160 ) and unfortunately none of those pretrained models work on my TX1 (see #159 ) even not with the fix provided by @soumith in comment 30.

I'm using the JetPack 2.3 with CUDA 8.0 and cuDNN 5.0.

from neuraltalk2.

kaisark avatar kaisark commented on July 3, 2024

@soumith @mvitez Jetson TX1 users are stilling running into model portability issues ("Failed to load function from bytecode"). Have the changes made it into mainline Torch? (Everything works in my EC2 P2 instance). I also tried adding a local load function to eval.lua, but that did not work either.

System:
Nvidia TX1 (ARM - aarch64)
Torch 7
Ubuntu 16
CUDA 8
CUDNN 6
nvidia@tegra-ubuntu:~$ luajit -v
LuaJIT 2.1.0-beta1 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org
nvidia/torch/install/share/luajit-2.1.0-beta1/?.lua;/usr/local/share/lua/5.1/?.lua;/usr/local/share/lua/5.1/?/init.lua

--kkhatak added 3/18
local function load(filename)
local mode = 'binary'
local referenced = true
local file = torch.DiskFile(filename, 'r')
file[mode](file)
file:referenced(referenced)
file:longSize(8)
file:littleEndianEncoding()
print('calling file:readObject...')
local object = file:readObject()
file:close()
return object
end


nvidia@tegra-ubuntu:~/neuraltalk2$ th eval.lua -model model_id1-501-1448236541.t7 -image_folder ~/neuraltalk2/images -num_images 10
opt.model: model_id1-501-1448236541.t7
calling torch.load(opt.model)...
Warning: Failed to load function from bytecode: (binary): cannot load incompatible bytecodeWarning: Failed to load function from bytecode: (binary): cannot load incompatible bytecodeWarning: Failed to load function from bytecode: [string "..."]:1: unexpected symbol near 'char(6)'/home/nvidia/torch/install/bin/luajit: /home/nvidia/torch/install/share/lua/5.1/torch/File.lua:375: unknown object
stack traceback:
[C]: in function 'error'
/home/nvidia/torch/install/share/lua/5.1/torch/File.lua:375: in function 'readObject'
/home/nvidia/torch/install/share/lua/5.1/torch/File.lua:307: in function 'readObject'
/home/nvidia/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
/home/nvidia/torch/install/share/lua/5.1/nn/Module.lua:192: in function 'read'
/home/nvidia/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
/home/nvidia/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
/home/nvidia/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
/home/nvidia/torch/install/share/lua/5.1/nn/Module.lua:192: in function 'read'
/home/nvidia/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
/home/nvidia/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
/home/nvidia/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
/home/nvidia/torch/install/share/lua/5.1/torch/File.lua:409: in function 'load'
eval.lua:87: in main chunk
[C]: in function 'dofile'
...idia/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004061f0

from neuraltalk2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.