Hi, I completed the training then performed the predicting, but got

Segmentation fault about amazon-dsstne HOT 13 CLOSED

amazon-archives commented on August 28, 2024

Segmentation fault

from amazon-dsstne.

Comments (13)

tristanpenman commented on August 28, 2024 2

I'm going to jump in here and suggest that once we understand how the data was generated, this could serve as the basis for some good unit tests.

from amazon-dsstne.

tristanpenman commented on August 28, 2024 1

I have been able to reproduce this issue on the DSSTNE AMI running on a g2.2xlarge EC2 instance, with the dataset provided. What I found is that while the predict utility is correctly loading all 65075 lines of the feature_input file, some of those lines contain duplicate IDs.

Line 64175, for example, is malformed. With hidden/control characters enabled in vi, you can see the formatting error (a second tab character):

4549498^I4549491^I4549528,10.0:4549526,10.0:4549498,10.0:4549501,10.0$

Both the generateNetCDF and predict applications should be able to detect this kind of error, and I will raise a separate issue to track that work. In the mean time, this should help you to fix the dataset itself.

from amazon-dsstne.

tristanpenman commented on August 28, 2024

A quick glance at the code for the predict application suggests a couple of possible causes - we'll need to narrow this down.

We can dig into this further by rebuilding DSSTNE with the DEBUG flag enabled - the flag can be found in Makefile.inc under /src/amazon/dsstne, near the beginning of that file. If you can reproduce the issue on a debug build, the seg fault output should contain line numbers that will help narrow down the potential causes.

Be sure to run make clean before running make again.

Any other information you can provide (e.g. OS/distro, GPU used) would also be helpful.

from amazon-dsstne.

oyotong commented on August 28, 2024

I enabled the debug flag as below, but I could not get more detail debug information.
Env info:
OS: Ubuntu 14.04
CUDA: release 7.5, V7.5.17, NVIDIA-SMI 352.39
GPU: GeForce GTX 970

===== Makefile.inc [start] =====
....
CPPFLAGS = -traditional -P -std=c++0x -DMEMTRACKING -gdwarf-3
....
DEBUG = 1
ifeq ($(DEBUG), 1)
$(info ************ DEBUG mode ************)
CFLAGS = -DOMPI_SKIP_MPICXX -std=c++0x -g -O0 -DMEMTRACKING -gdwarf-3
else
....
===== Makefile.inc [end] =====

===== Make Info [start] =====
************ DEBUG mode ************
make[1]: Entering directory `/home/dsstne/amazon-dsstne/src/amazon/dsstne/utils'
===== Make Info [end] =====

===== Exception Message [start] =====
GpuContext::Startup: Process 0 out of 1 initialized.
Allocating 8 bytes of GPU memory
Mem++: 8 8
GpuContext::Startup: Single node flag on GPU for process 0 is 1
GpuContext::Startup: P2P support flags on GPU for process 0 are 1 1
GpuContext::Startup: GPU for process 0 initialized.
GpuContext::SetRandomSeed: Random seed set to 12134.
Loaded input feature index with 65064 entries.
Indexing 1 files
Indexing file: dss_sku_sku
Progress Parsing10000Time 1.0682
Progress Parsing20000Time 1.0648
Progress Parsing30000Time 0.959654
Progress Parsing40000Time 0.987968
Progress Parsing50000Time 0.783489
Progress Parsing60000Time 0.800305
Exported gl_input_predict.samplesIndex with 65075 entries.
Raw max index is: 65064
Rounded up max index to: 65152
Created NetCDF file gl_input_predict.nc for dataset gl_input
Number of network input nodes: 65064
Number of entries to generate predictions for: 65075
LoadNetCDF: Loading UInt data set
NNDataSet::NNDataSet: Name of data set: gl_input
NNDataSet::NNDataSet: Attributes: Sparse Boolean
NNDataSet::NNDataSet: 1-dimensional data comprised of (65152, 1, 1) datapoints.
NNDataSet::NNDataSet: 3778407 total datapoints.
NNDataSet::NNDataSet: 65075 examples.
[snx-dsstne:04470] *** Process received signal ***
[snx-dsstne:04470] Signal: Segmentation fault (11)
[snx-dsstne:04470] Signal code: Address not mapped (1)
[snx-dsstne:04470] Failing at address: 0xc5f77f0
[snx-dsstne:04470] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fd1a2766330]
[snx-dsstne:04470] [ 1] predict[0x447eb7]
[snx-dsstne:04470] [ 2] predict[0x43714c]
[snx-dsstne:04470] [ 3] predict[0x431088]
[snx-dsstne:04470] [ 4] predict[0x42e1f8]
[snx-dsstne:04470] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fd1a23b2f45]
[snx-dsstne:04470] [ 6] predict[0x407d31]
[snx-dsstne:04470] *** End of error message ***
Segmentation fault (core dumped)
===== Exception Message [end] =====

from amazon-dsstne.

scottlegrand commented on August 28, 2024

Never mind what I wrote, could you run this from gdb?

It looks like the dataset is corrupted somehow to me.

from amazon-dsstne.

oyotong commented on August 28, 2024

Run this from gdb and got below info:

Starting program: /home/dsstne/amazon-dsstne/src/amazon/dsstne/bin/predict -b 256 -d gl -i features_input -o features_output -k 10 -n gl.nc -f dss_sku_sku -s recs -r dss_sku_sku
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffee132700 (LWP 4552)]
GpuContext::Startup: Process 0 out of 1 initialized.
[New Thread 0x7fffe598a700 (LWP 4553)]
[New Thread 0x7fffdcfff700 (LWP 4554)]
Allocating 8 bytes of GPU memory
Mem++: 8 8
GpuContext::Startup: Single node flag on GPU for process 0 is 1
GpuContext::Startup: P2P support flags on GPU for process 0 are 1 1
GpuContext::Startup: GPU for process 0 initialized.
GpuContext::SetRandomSeed: Random seed set to 12134.
Loaded input feature index with 65064 entries.
Indexing 1 files
Indexing file: dss_sku_sku
Progress Parsing10000Time 1.07443
Progress Parsing20000Time 1.07139
Progress Parsing30000Time 0.968824
Progress Parsing40000Time 0.994079
Progress Parsing50000Time 0.787785
Progress Parsing60000Time 0.80526
Exported gl_input_predict.samplesIndex with 65075 entries.
Raw max index is: 65064
Rounded up max index to: 65152
Created NetCDF file gl_input_predict.nc for dataset gl_input
Number of network input nodes: 65064
Number of entries to generate predictions for: 65075
LoadNetCDF: Loading UInt data set
NNDataSet::NNDataSet: Name of data set: gl_input
NNDataSet::NNDataSet: Attributes: Sparse Boolean
NNDataSet::NNDataSet: 1-dimensional data comprised of (65152, 1, 1) datapoints.
NNDataSet::NNDataSet: 3778407 total datapoints.
NNDataSet::NNDataSet: 65075 examples.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000447eb7 in NNDataSet::CalculateSparseDatapointCounts (this=0x8b69a40) at NNTypes.cpp:868

868 _vSparseDatapointCount[x]++;

from amazon-dsstne.

scottlegrand commented on August 28, 2024

Awesome, so looking at that section:
// Calculate individual counts for each datapoint
uint64_t N = _width * _height * _length;
_vSparseDatapointCount.resize(N);
std::fill(_vSparseDatapointCount.begin(), _vSparseDatapointCount.end(), 0);
for (auto x : _vSparseIndex)
{
_vSparseDatapointCount[x]++;
}
You have a sparse index that is out of range, can you check that all your indices in

vector<uint32_t> _vSparseIndex

are < 65152 because I'm betting that they're not... Or in this case just test x.

from amazon-dsstne.

oyotong commented on August 28, 2024

Is it an issue?
How to fix/bypass?

from amazon-dsstne.

rgeorgej commented on August 28, 2024

Can u send us the steps you did and also a sampled data

from amazon-dsstne.

scottlegrand commented on August 28, 2024

Yes, the dataset appears to be corrupted with out of range indices. How exactly was the dataset generated?

Also we should add guard code to detect this situation but we still have to fix the data set

from amazon-dsstne.

oyotong commented on August 28, 2024

You can get the dataset from here: -- Coud you help to test?
https://s3.amazonaws.com/andy.tang.test/dataset.zip

generateNetCDF -d gl_input -i dss_sku_sku -o gl_input.nc -f features_input -s samples_input -c
generateNetCDF -d gl_output -i dss_sku_sku -o gl_output.nc -f features_output -s samples_input -c
train -c config.json -i gl_input.nc -o gl_output.nc -n gl.nc -b 256 -e 10
predict -b 256 -d gl -i features_input -o features_output -k 10 -n gl.nc -f dss_sku_sku -s recs -r dss_sku_sku

from amazon-dsstne.

scottlegrand commented on August 28, 2024

Interesting, I get a different sized dataset.
./generateNetCDF -d gl_input -i dss_sku_sku -o gl_input.nc -f features_input -s samples_input -c
Flag -c is set. Will create a new feature file and overwrite: features_input
Generating dataset of type: indicator
Will create a new samples index file: samples_input
Will create a new features index file: features_input
Indexing 1 files
Indexing file: dss_sku_sku
Progress Parsing10000Time 0.827208
Progress Parsing20000Time 0.749772
Progress Parsing30000Time 0.670679
Progress Parsing40000Time 0.685743
Progress Parsing50000Time 0.54209
Progress Parsing60000Time 0.556289
Exported features_input with 65217 entries.
Exported samples_input with 65075 entries.
Raw max index is: 65217
Rounded up max index to: 65280
Created NetCDF file gl_input.nc for dataset gl_input
Total time for generating NetCDF: 4.54689 secs.

Can you pull ToT, rebuild, and try again?

from amazon-dsstne.

oyotong commented on August 28, 2024

Thank you for your help!!

I fixed those malformed data. It work fine now.

from amazon-dsstne.

Segmentation fault about amazon-dsstne HOT 13 CLOSED

Comments (13)

868 _vSparseDatapointCount[x]++;

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent