dhernandez0 / sgm Goto Github PK

View Code? Open in Web Editor NEW

372.0 372.0 145.0 1.71 MB

Semi-Global Matching on the GPU

Home Page: http://danihernandez.eu

License: GNU General Public License v3.0

CMake 1.53% C 5.78% C++ 61.96% Cuda 30.72%

sgm's People

Contributors

Stargazers

Watchers

Forkers

sg47 wenjiebit caomw luofalei ygling2008 warrenwg celesius klqulei githubfragments cguindel freshbo gbyfbi arrfou90 wang-kx xepost eulerhit superpoca catree jinay1991 nevermoredanny shijies linsonw vbillys cristianolado sdut10523 pogilon geographerwang ethanjli oup2uo waynetimer karlholmquist scholltan yyd01245 zhangsongyi jameslinus labimage gzjas gaowenliang 3000huyang perception-slam wk199 yheno zhangguanghui1 magican li9616 tuzhenyuan pras2020 gnetwvs kongan sunguangjiang sky-github nicknuaa wyinggui eglrp maotianwhu leng123ku kazitown xiaohedu hejunbok yiboliu31 promethues jixianghu minxuanjun vcmman lmg1994 wpfhtl keochi wsywf ason93 micalson yingcai88 greatqz irobotdavinci misslibra frankgt microyy 8bitcatjqw liu-yangyang onewillow a349101311 yangxingbin avanindra dftlg koichihik limit79 robinzhoucmu jayle19930918 fanjinfei whigg kewangtt cnuxdh tammam1998 julianliao robofisshy solrac3589 mrx333 rensimon wwqyaya xtsxisaxns ling-zzz

sgm's Issues

Black images with 2 neighbours

Hello,
I have been trying out this project on a Jetson TX1 with Cuda 8.0, and in most cases obtained great results and performances. However, when trying to compute the stereo matching with only 2 neighbours, the program returned completely black images.
This behavior is really strange considering that the algorithm runtime is actually of around 80fps when using this option, which is consistent with the expected performance.

Thank you very much, and I will remain available for any enquiry.

Cannot compile on visual studio - windows 10

Is possible to use this library with windows 10 and visual studio 2019, instead of linux

Windows port

Has anybody successfully made, or found, a windows port?

execution error

When I run this code by "./sgm dir p1 p2", there is a error that is "Error: no kernel image is available for execution on the device 48". What happend when I run it? How can I solve this problem? Thank you!

How are matching cost computation and horizontal path aggregation fused

Dear @dhernandez0 ,

Thank you for this great implementation of SGM. I'm relatively new to CUDA and get a bit confused while studying your code. As mentioned in the paper, the computation of the initial matching cost is fused with horizontal path aggregation. In the code, I can see that the matching cost is done on stream 1, and the leftToRight and rightToLeft computations are done on stream2 and stream3 respectively, which means they are done in parallel. I'm wondering how is this possible? Shouldn't the matching cost be completed before the path aggregation computation starts as the latter depends on the former? Or if there exists some sort of synchronization between these two parts in the code that I didn't discover?

Sorry for disturbing you with this trivial question. I would appreciate it a lot if you can help me understand it.

Low FPS with visual studio 2019

it reaches 7 FPS with RTX 2080 Super , visual studio 2019, windows 10 and Core i9-9900K.

Any suggestions?

Error: no kernel image is available for execution on the device 48

When i run command './seg dir p1 p2', i met the error 'Error: no kernel image is available for execution on the device 48'. Someone can help me? Cuda10.0

Error: no kernel image is available for execution on the device 209

I have tried several different versions of CUDA , 8.0 9.0 10.1 , but it doesn't work

help to understand "recompute" and use of rp0,rp1,rp2,rp3

Hi dhernandez,
It is so far the best GPU implementation I have found for SGM. good going. I am trying to implement SGM as one part of my project thesis. I have tried to understand your code and i have a few questions.

why this recompute is required in the first place. can't the LEFT-RIGHT be similar to UP_DOWN cost aggregation
In LEFT_RIGHT cost aggregation, recompute is ON and it is using rp0,rp1,rp2,rp3 registers. Can you explain how these are being used?
in cost_aggregaiton.h line num 106: else block is not being used at all.

any help would be appreciated. thank you

Some wrong with the processing speed

Hi, Daniel Hernandez-Juarez
Thanks for your code.
It helps me a lot to understand the SGM.
I have run your code on Jeston TX2 smoothly.
But I have some trouble about the processing speed.
I changed the input images as 640*480, and got the processing speed as following:
4 path directions: 20fps
8 path directions: 9fps
All of them are about half of your results.
I don't know the reason of that. Maybe the Pascal architecture cannot runs your code very well?
If you know something about that, could you help me?
Thank you!

GpuMat input to compute results invalid output

I am trying to change the compute function to work with cv::cuda::GpuMat directly instead of cv::Mat.

But i am having issues with the cudaMemcpyAsync
CUDA_CHECK_RETURN(cudaMemcpyAsync(d_im0, left.ptr<uint8_t>(), sizeof(uint8_t)*size, cudaMemcpyDeviceToDevice, stream1));

In the above code, "left" is a GpuMat. I have tried various methods to overcome this issue. But i was not successful.

Find the faulty image attached.

How to copy the GpuMat data to "d_im0" properly?

PS: Similar issue is discussed in http://answers.opencv.org/question/167465/how-to-pass-an-already-stored-data-in-gpu-by-gpumat-to-a-kernel/. Kindly let us know if this could be the reason for issue and a work around.

The processing speed is different.

Hi.
Thanks for your code.
I run the code on the GTX TITAN X and TX2 respectively.
But for the 4path, the speed of image 2 is roughly 232fps and 11fps respectively.
For the 8path, the speed of image 2 is roughly 123fps and 5fps respectively.
I cannot found the reason, could you tell me?

Diparity layers

Hi,
I've used the algorithm and got really good results but once I've reprojected the 2D points to 3D by using the calculated disparity values with the following result point-cloud Where at the end of SGM implementation we have an array with aggregated costs for each pixel. The disparity is equivalent to the index with the lowest cost value. I think this results in individually layers in the point-cloud, i.e round surfaces are cut into many layers (see example point-cloud).
Please anyone have an idea what is the reason to have such layers which in my case (my stereo vision setups) are distant by 5mm?

Thank you for your help!
Arrfou

How to set p1, p2 value

Hi,

First of all, thanks for sharing your code to community.

I have tried to run your code by ./sgm dir p1=1000 p2=2000 on the second example image set. However, what I got is below disparity image. I guess it may caused by setting p1, p2 value incorrectly? Would you please also provide some introduction to set p1, p2 value?

Mike

How to generated both left and right disparity map in order to execute consistency check?

Hello,
Your implement has excellent performance in terms of speed. However,the disparity map which is generated by your project has some noise. Is there any way to generate both left and right disparity map in order to execute consistency check?
Thanks
Bao

Indexing bug at cost_aggregation.h:49

@dhernandez0
At cost_aggregation.h line 49,
49 if(add_index > 0 && *col > cols) {
Sorry if I am wrong but shouldn't this be (*col > cols - 1) or (*col >= cols)?
If it is the way it is right now, the value at col would reach W, whereas it should be from 0 to W-1.

Disparity image looks very bad

Hey,
I am using GTX 730 graphics card it has compute power of 3.0+.
I implemented the code with some changes so as to get it running(most of the errors were coming due to restrict , and when I removed the __restrict it compiled and ran properly). But I didn't change any of the parameters. I am attaching the output file. Please can you tell me what all parameters do I need to change. It looks like the image is fragmented in many pieces. I have used 2.png from the examples to make the disparity map.

This one is with 4 paths

This one is with 8 paths

How to get stable depth map

Hey Hernandez, great CUDA code by the way, I am testing out your code with a stereo camera. I am saw a few of your result in your paper and in your video, which I find the disparity map are very smooth. By when I tested out with a camera, which is fixed down pointing towards a static scene, the depth map is a bit unstable especially in the far away area. Depth value of a few area is varying between 1 meter to 20 meter. I play with the p1 and p2 value but still cannot get very static depth map. Any suggestion from you? Thanks!

Clarification regarding Dataset and p1 and p2 value

Dear All

I build sgm in jetson tx2 and executed the sample data-set the output and performance was really good but when i tried with my own dataset which was taken from my own stereo camera the output was noisy.Can some one help me with the clarification. i am attaching the image below

cone dataset

my own dataset

Type of disparity image

Hello, in the final results, disparity image type is eight, disparityis 0-127 ,so the distance of only 128 changes, in 0.6-1m range, reduce disparity1, distance add 1 cm, and in the range of 3 meters, disparity decrease 1, distance will add 10 cm. Is there a way to change the disparity image type to 16 bits?

Understanding d_cost arrangement in memory

Hi Mr Hernandez,

First, I just wanted to say great job with this repository. It's one of the best implementations of CUDA accelerated SGM I've come across. Thank you very much for this!

I'm trying to dig into this code and implement some extra functionality off this code. I've been struggling trying to understand the representation of the d_cost volume.

I'm working with an image of width=960, height=800, disparity_levels=128. Therefore the volume of d_cost is 800960128=98304000 which makes sense. However, I'm not sure I understand the way the elements are arranged in this data structure. Is the volume arranged in row->column->disparity order?

i.e,
If i wanted to find the cost associated with co-ordinate location (row=2,column=1,disparity_level=0),
would this global memory indexing scheme hold true?
total_columns = 960;
MAX_DISPARITY=128;
index = d_cost[row x total_columns+column+(total_rows x total_columns) x disparity_level]
index = d_cost[2 x 960+1+(800 x 960) x 0]
index = d_cost[1921]

And with this assumption, the same 2D co-ordinate location in the next disparity level would lie in index location d_cost[2 x 960+1+(800 x 960) x 1] = d_cost[768000+1921] = d_cost[769921]?

I tried using this method arrangement and the results don't seem to make sense. I would greatly appreciate any input regarding the way the d_cost data-structure is laid out in memory.

This isn't an issue regarding the code repository, so I understand if you wish to close this issue immediately.

Thanks in advance,
Shreyas

Compile at TK1 jetson

I tried to compile on my TK1 jetson.
But some errors were occurred.


/home/ubuntu/SOUNANSU/sgm/util.h(142):` error: asm operand type size(4) does not match type/size implied by constraint 'l'

/home/ubuntu/SOUNANSU/sgm/util.h(148):` error: asm operand type size(4) does not match type/size implied by constraint 'l'

/home/ubuntu/SOUNANSU/sgm/util.h(157):` error: asm operand type size(4) does not match type/size implied by constraint 'l'

So, I changed util.h as blow,


diff --git a/util.h b/util.h
index 1b0a76c..f4dc940 100644
--- a/util.h
+++ b/util.h
@@ -139,22 +139,22 @@ __inline__ __device__ int shfl_xor_32(int scalarValue, const int n) {

 __device__ __forceinline__ uint32_t ld_gbl_ca(const __restrict__ uint32_t *addr) {
        uint32_t return_value;
-       asm("ld.global.ca.u32 %0, [%1];" : "=r"(return_value) : "l"(addr));
+       asm("ld.global.ca.u32 %0, [%1];" : "=r"(return_value) : "r"(addr));
        return return_value;
 }

 __device__ __forceinline__ uint32_t ld_gbl_cs(const __restrict__ uint32_t *addr) {
        uint32_t return_value;
-       asm("ld.global.cs.u32 %0, [%1];" : "=r"(return_value) : "l"(addr));
+       asm("ld.global.cs.u32 %0, [%1];" : "=r"(return_value) : "r"(addr));
        return return_value;
 }

 __device__ __forceinline__ void st_gbl_wt(const __restrict__ uint32_t *addr, const uint32_t value) {
-       asm("st.global.wt.u32 [%0], %1;" :: "l"(addr), "r"(value));
+       asm("st.global.wt.u32 [%0], %1;" :: "r"(addr), "r"(value));
 }

 __device__ __forceinline__ void st_gbl_cs(const __restrict__ uint32_t *addr, const uint32_t value) {
-       asm("st.global.cs.u32 [%0], %1;" :: "l"(addr), "r"(value));
+       asm("st.global.cs.u32 [%0], %1;" :: "r"(addr), "r"(value));
 }

 __device__ __forceinline__ uint32_t gpu_get_sm_idx(){

Are there crrect?

Maximum disparity has to be 128

Hello Daniel,
I want to increase the disparity level to more than 256. But I assume this a limitation (I guess it is the GPU memory size limitation), Is that possible to increase the disparity level even with low image resolution.
Thanks in advance
Arrfou

Performance drops on Jetson TX1 and TX2

We just dis some experiments on both Jetson TX1 and TX2 board, but the fps results were quite lower than expected. In fact, our records are:

Jetson TX1: 13.4 fps
Jetson TX2: 27 fps
We all set the maximum power mode for both versions and use CUDA 9, Ubuntu 16 to run the experiment.
So please check and help us to answer the problem here.

Why don't I think your renderings are very good?

Parameter setting on Kitti dataset

Hi,

Thanks to share your code to the community!

I find out that your have tested the alg on Kitti dataset in the paper. Do you mind to share the parameters (e.x. p1, p2 value and the resized image size) you are using on Kitti?

Thanks a lot.

Mike

stranger result of disparity map.

test in ubuntu + cuda10.2 , and maybe i got worng result of disparity map, it's different to disparity map from other SGM implement.
is it my mistake in set 'p1' or 'p2'？
thanks

Same code different machines, different results

I have tested your code on two different machines. One is running on a Maxwell card, the other is running on a Pascal card. The one on Maxwell is running fine, the one on Pascal gives me bad disparity. Any idea why? Could you give me some guidelines to debug? i enclosed the bad one.

SGM on OpenCV

Hello,
I am trying to use your GPU SGM code on Jetson TX1 with OpenCV4Tegra and CUDA 8, but I got the results as black images with unreasonable processing time around 0.1ms! Could you please tell me what configuration (i.e. OpenCV version, Cuda version, and the JetPacktx1 version) you used on Tigra x1/Jeston TX1?
However, SGM compiles and runs on PC with (GTX980, OpenCV 2 and Cuda 8.0) successfully.
Thanks in Advance.

Disparity value 64 instead of 128

Hello All

Is it possible to change the value from 128 to 64 for maximum disparity value. i saw a closed issue #4
regarding disparity value and tried changing the uint32_t to uint16_t but not sure i am doing correctly. Can some one help me for changing maximum disparity value to 64.

strange disparity results on Xavier

Build error

Hello,
I am trying to use your SGM code, but I can't build it, because I get "undefined reference to" error. I use Ubuntu 16.04 LTS, OpenCV 3.3.1 (included in ROS), CMake 3.5.1, CUDA 8.0 and nVidia GTX 1080. After several hours of searching in google and trying to solve this error I still can't build your code. It seems that the problem is in OpenCV. I am newbie in Linux, CV and CUDA, so I need help.

Here my build log:

anton@smartcar:~/sgm/build$ cmake ..

The C compiler identification is GNU 5.4.0
The CXX compiler identification is GNU 5.4.0
Check for working C compiler: /usr/bin/cc
Check for working C compiler: /usr/bin/cc -- works
Detecting C compiler ABI info
Detecting C compiler ABI info - done
Detecting C compile features
Detecting C compile features - done
Check for working CXX compiler: /usr/bin/c++
Check for working CXX compiler: /usr/bin/c++ -- works
Detecting CXX compiler ABI info
Detecting CXX compiler ABI info - done
Detecting CXX compile features
Detecting CXX compile features - done
Found OpenCV: /opt/ros/kinetic (found version "3.3.1")
Looking for pthread.h
Looking for pthread.h - found
Looking for pthread_create
Looking for pthread_create - not found
Looking for pthread_create in pthreads
Looking for pthread_create in pthreads - not found
Looking for pthread_create in pthread
Looking for pthread_create in pthread - found
Found Threads: TRUE
Found CUDA: /usr/local/cuda (found version "8.0")
Configuring done
Generating done
Build files have been written to: /home/anton/sgm/build

anton@smartcar:~/sgm/build$ make
[ 14%] Building NVCC (Device) object CMakeFiles/sgm.dir/sgm_generated_costs.cu.o
[ 28%] Building NVCC (Device) object CMakeFiles/sgm.dir/sgm_generated_main.cu.o
[ 42%] Building NVCC (Device) object CMakeFiles/sgm.dir/sgm_generated_median_filter.cu.o
[ 57%] Building NVCC (Device) object CMakeFiles/sgm.dir/sgm_generated_hamming_cost.cu.o
[ 71%] Building NVCC (Device) object CMakeFiles/sgm.dir/sgm_generated_disparity_method.cu.o
/home/anton/sgm/disparity_method.cu(40): warning: variable "d_L7" was declared but never referenced

/home/anton/sgm/disparity_method.cu(40): warning: variable "d_L7" was declared but never referenced

[ 85%] Building NVCC (Device) object CMakeFiles/sgm.dir/sgm_generated_debug.cu.o
Scanning dependencies of target sgm
[100%] Linking CXX executable sgm
CMakeFiles/sgm.dir/sgm_generated_main.cu.o: In function disparity_errors(cv::Mat, char const*, int*, int*)': /home/anton/sgm/main.cu:54: undefined reference to cv::imread(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)'
CMakeFiles/sgm.dir/sgm_generated_main.cu.o: In function main': /home/anton/sgm/main.cu:164: undefined reference to cv::imread(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)'
/home/anton/sgm/main.cu:169: undefined reference to cv::imread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)' /home/anton/sgm/main.cu:177: undefined reference to cv::_OutputArray::_OutputArray(cv::Mat&)'
/home/anton/sgm/main.cu:177: undefined reference to cv::_InputArray::_InputArray(cv::Mat const&)' /home/anton/sgm/main.cu:181: undefined reference to cv::_OutputArray::_OutputArray(cv::Mat&)'
/home/anton/sgm/main.cu:181: undefined reference to cv::_InputArray::_InputArray(cv::Mat const&)' /home/anton/sgm/main.cu:211: undefined reference to cv::_InputArray::_InputArray(cv::Mat const&)'
/home/anton/sgm/main.cu:211: undefined reference to cv::imwrite(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cv::_InputArray const&, std::vector<int, std::allocator<int> > const&)' /home/anton/sgm/main.cu:220: undefined reference to cv::_InputArray::_InputArray(cv::Mat const&)'
/home/anton/sgm/main.cu:220: undefined reference to `cv::imwrite(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, cv::_InputArray const&, std::vector<int, std::allocator > const&)'
collect2: error: ld returned 1 exit status
CMakeFiles/sgm.dir/build.make:169: recipe for target 'sgm' failed
make[2]: *** [sgm] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/sgm.dir/all' failed
make[1]: *** [CMakeFiles/sgm.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

Best regards, Anton.

"shadow" appears on disparity result

Dear @dhernandez0 ,

Thank you very much for the awesome SGM algorithms on GPU.

I'm testing your repository and the results look very good.
However there is a shadow appears on disparity results. Configuration : PATH_AGGREGATION = 8

I used Kalibr for the calibration step, tested with OpenCV's SGBM and got good results.

How do you think about this results ? Could you guide me how to overcome this issue ?

Many thanks in advance.

Best regards,
Hiep

NO makefile

I have no makefile after cmake， I can't find the reason. thank you

How to filter sky noise?

Hello, I have a problem. There is so much sky noise(sky matching error), when I test the SGM in the open air. Did you find the situation? Do you know how to filter the sky noise(or detect the sky and then delete) on GPU?

casting with uint8_t instead of float

Hello,
I have encountered this problem before and I guess line 204 in main.cu could be
const float d = disparity_im.at<uint8_t >(i, j);
Otherwise the output images in CV_16UC1 format are blank.

Thanks
Arrfou

disparity is quite noisy with own data

thanks for the sharing of your work. the speed is really fast.

but i got much noisy disparity with my own camera data [much more complex scene],
the disparity of the example stereo images provided is smoother.
any idea?

TX2 and 8 paths does not work

Obviously, when you were building the software there was no TX2, so you did not prepare the code for it, but I am letting you know that there is an issue when 8 paths option is set:

cudaDeviceSynchronize() returned unspecified launch failure(4) at disparity_method.cu:150