vetter / shoc Goto Github PK

View Code? Open in Web Editor NEW

240.0 240.0 104.0 10.53 MB

The SHOC Benchmark Suite

License: Other

Shell 1.63% C++ 31.94% C 7.78% Cuda 11.14% Makefile 45.89% M4 0.53% Perl 1.10%

shoc's People

Contributors

Stargazers

Watchers

Forkers

tositrino lochotzke akoniges areslp nivertech aeigis88 stevenovakov optimus-prime efiks mattaezell shuotian megari takayuky skrieder victusfate poojachitral dylanzika emma926 benjamincoquelle finomnis bart-utahman archana521 carpproject zhenshancao jyoung3131 amd-firepro jethrotan baowenbo johanneskrames 3gx bhavyam2303 kartikaygarg sunway513 takashi-n psteinb tendlich manava mikepapadim bcrafton duncan-brown chriscummins gareth-ferneyhough naoyam dloghin socal-ucr csktan wme7 richard-z ginsongsong bastianhagedorn davetw liangdong-xjtu zangcq clicksense tristan0x hm200958 leimingyu olebull1 gpuworld tuzhenyuan mahmoudmagdi daviddpruitt s1kl3 kolife tnphuong kerbowa sosarkar relativeprime douxiaotian nicholasmalaya marufdsi knutkirkhorn semeniak1997 sbastrakov rgioiosa78 arktt jokeren paklui jkrude amamory-ampere dfukalov yh-wu ishitachaturvedi jkrbs quincunx271 zdahap programmerguo hongzhentian cyberl0afing akhtarjunaid xinyi-li7 wankey92 mxw01 janciesko fodinabor nicole01101101zke jliu312 hoangkimthuc hpcvn villanel

shoc's Issues

Sort error handling

Sort benchmark doesnt't handle errors propery:

The result of the clEnqueueNDRangeKernel calls is not checked (the err variable is just overwritten by following calls).
The output buffer is initialized with sorted data, which allows benchmark to pass even if no all enqueu calls failed.

OpenCL FFT on OS X 10.9 fails with CL_INVALID_PROGRAM_EXECUTABLE

AMD suggestions for improving portability to AMD architectures

see vetter's email from AMD

SpMV Kernel Code Breaks AMD OpenCL Compiler

AMD APP SDK v2.4 compiler segfaults. OpenCL code is known good as it works on NV & Intel impl.

It gives an error:
"
nvcc fatal : Unsupported gpu architecture 'compute_12'
make[4]: *** [BusSpeedDownload.o] Error 1
make[4]: Leaving directory /home/pawan/Desktop/summer_school/Intern/shoc-master/src/cuda/level0/epmpi' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory/home/pawan/Desktop/summer_school/Intern/shoc-master/src/cuda/level0'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory /home/pawan/Desktop/summer_school/Intern/shoc-master/src/cuda' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory/home/pawan/Desktop/summer_school/Intern/shoc-master/src'
make: *** [all-recursive] Error 1
"

I have set the right PATH, and configure worked.

compile error on mac os x

I could not fine a contact email or mailing list, so I'm posting to this issue list. I'm getting this compiler error:

/Users/srinath/bin/mpicxx -DPARALLEL -I../../../../src/common -I../../../../config -I/user/local/cuda/include -I../../../../src/opencl/common -I../../../../src/common -I../../../../src/mpi/common -g -O2 -c Stencil2Dmain.cpp -o Stencil2Dmain_mpi.o
In file included from Stencil2Dmain.cpp:36:
../../../../src/mpi/common/MPIStencilUtil.cpp:37:13: error: use of undeclared identifier 'PrintValidationErrors'
PrintValidationErrors( valResultStr, validationErrors, valErrPrintsRemaining );
^
this->
Stencil2Dmain.cpp:367:50: note: in instantiation of member function 'MPIStencilValidater::ValidateResult' requested here
StencilValidater* validater = new MPIStencilValidater;
^
In file included from Stencil2Dmain.cpp:23:
In file included from ../../../../src/common/StencilUtil.cpp:1:
../../../../src/common/StencilUtil.h:23:10: note: must qualify identifier to find this declaration in dependent base class
void PrintValidationErrors( std::ostream& s,
^
In file included from Stencil2Dmain.cpp:36:
../../../../src/mpi/common/MPIStencilUtil.cpp:37:13: error: no member named 'PrintValidationErrors' in 'MPIStencilValidater'
PrintValidationErrors( valResultStr, validationErrors, valErrPrintsRemaining );
^~~~~~~~~~~~~~~~~~~~~
Stencil2Dmain.cpp:367:50: note: in instantiation of member function 'MPIStencilValidater::ValidateResult' requested here
StencilValidater* validater = new MPIStencilValidater;
^
In file included from Stencil2Dmain.cpp:33:
../../../../src/mpi/common/MPIHostStencil.cpp:58:13: error: use of undeclared identifier 'DumpData'
DumpData( ofs, mtx, "before halo exchange" );
^
this->
In file included from Stencil2Dmain.cpp:32:
../../../../src/mpi/common/MPIHostStencilFactory.cpp:20:16: note: in instantiation of member function 'MPIHostStencil::DoPreIterationWork'
requested here
return new MPIHostStencil( wCenter,
^
In file included from Stencil2Dmain.cpp:32:
In file included from ../../../../src/mpi/common/MPIHostStencilFactory.cpp:4:
In file included from ../../../../src/mpi/common/MPIHostStencil.h:7:
../../../../src/mpi/common/MPI2DGridProgram.h:106:10: note: must qualify identifier to find this declaration in dependent base class
void DumpData( std::ostream& s,
^
In file included from Stencil2Dmain.cpp:33:
../../../../src/mpi/common/MPIHostStencil.cpp:60:9: error: use of undeclared identifier 'DoHaloExchange'
DoHaloExchange( mtx );
^
this->
In file included from Stencil2Dmain.cpp:32:
In file included from ../../../../src/mpi/common/MPIHostStencilFactory.cpp:4:
In file included from ../../../../src/mpi/common/MPIHostStencil.h:7:
../../../../src/mpi/common/MPI2DGridProgram.h:105:10: note: must qualify identifier to find this declaration in dependent base class
void DoHaloExchange( Matrix2D& mtx );
^
In file included from Stencil2Dmain.cpp:33:
../../../../src/mpi/common/MPIHostStencil.cpp:63:13: error: use of undeclared identifier 'DumpData'
DumpData( ofs, mtx, "after halo exchange" );
^
this->
In file included from Stencil2Dmain.cpp:32:
In file included from ../../../../src/mpi/common/MPIHostStencilFactory.cpp:4:
In file included from ../../../../src/mpi/common/MPIHostStencil.h:7:
../../../../src/mpi/common/MPI2DGridProgram.h:106:10: note: must qualify identifier to find this declaration in dependent base class
void DumpData( std::ostream& s,
^
In file included from Stencil2Dmain.cpp:35:
./MPIOpenCLStencil.cpp:249:9: error: use of undeclared identifier 'DoHaloExchange'
DoHaloExchange( mtx );
^
this->
In file included from Stencil2Dmain.cpp:34:
./MPIOpenCLStencilFactory.cpp:39:16: note: in instantiation of member function 'MPIOpenCLStencil::DoPreIterationWork' requested here
return new MPIOpenCLStencil( wCenter,
^
In file included from Stencil2Dmain.cpp:32:
In file included from ../../../../src/mpi/common/MPIHostStencilFactory.cpp:4:
In file included from ../../../../src/mpi/common/MPIHostStencil.h:7:
../../../../src/mpi/common/MPI2DGridProgram.h:105:10: note: must qualify identifier to find this declaration in dependent base class
void DoHaloExchange( Matrix2D& mtx );
^
In file included from Stencil2Dmain.cpp:35:
./MPIOpenCLStencil.cpp:65:9: error: use of undeclared identifier 'DoHaloExchange'
DoHaloExchange( mtx );
^
this->
In file included from Stencil2Dmain.cpp:34:
./MPIOpenCLStencilFactory.cpp:39:16: note: in instantiation of member function 'MPIOpenCLStencil::operator()' requested here
return new MPIOpenCLStencil( wCenter,
^
In file included from Stencil2Dmain.cpp:32:
In file included from ../../../../src/mpi/common/MPIHostStencilFactory.cpp:4:
In file included from ../../../../src/mpi/common/MPIHostStencil.h:7:
../../../../src/mpi/common/MPI2DGridProgram.h:105:10: note: must qualify identifier to find this declaration in dependent base class
void DoHaloExchange( Matrix2D& mtx );
^
In file included from Stencil2Dmain.cpp:33:
../../../../src/mpi/common/MPIHostStencil.cpp:58:13: error: no member named 'DumpData' in 'MPIHostStencil'
DumpData( ofs, mtx, "before halo exchange" );
^~~~~~~~
In file included from Stencil2Dmain.cpp:32:
../../../../src/mpi/common/MPIHostStencilFactory.cpp:20:16: note: in instantiation of member function 'MPIHostStencil::DoPreIterationWork'
requested here
return new MPIHostStencil( wCenter,
^
In file included from Stencil2Dmain.cpp:33:
../../../../src/mpi/common/MPIHostStencil.cpp:60:9: error: no member named 'DoHaloExchange' in 'MPIHostStencil'
DoHaloExchange( mtx );
^~~~~~~~~~~~~~
../../../../src/mpi/common/MPIHostStencil.cpp:63:13: error: no member named 'DumpData' in 'MPIHostStencil'
DumpData( ofs, mtx, "after halo exchange" );
^~~~~~~~
In file included from Stencil2Dmain.cpp:35:
./MPIOpenCLStencil.cpp:249:9: error: no member named 'DoHaloExchange' in 'MPIOpenCLStencil'
DoHaloExchange( mtx );
^~~~~~~~~~~~~~
In file included from Stencil2Dmain.cpp:34:
./MPIOpenCLStencilFactory.cpp:39:16: note: in instantiation of member function 'MPIOpenCLStencil::DoPreIterationWork' requested here
return new MPIOpenCLStencil( wCenter,
^
In file included from Stencil2Dmain.cpp:35:
./MPIOpenCLStencil.cpp:65:9: error: no member named 'DoHaloExchange' in 'MPIOpenCLStencil'
DoHaloExchange( mtx );
^~~~~~~~~~~~~~
In file included from Stencil2Dmain.cpp:34:
./MPIOpenCLStencilFactory.cpp:39:16: note: in instantiation of member function 'MPIOpenCLStencil::operator()' requested here
return new MPIOpenCLStencil( wCenter,
^
12 errors generated.

Any help is appreciated.

GEMM Hangs Cypress on AMD OCL 2.6

GEMM seems to be hanging at compilation for the Cypress GPU (and oddly enough, only the Cypress GPU) with AMD APP 2.6 on atlanta. Runs fine on Tahiti GPU, CPU, and other OpenCL impls.

@rothpc I'm considering this a bug with the AMD compiler and advise we proceed with a 1.1.4 release.

GDB Output:

0x00002aaaac8bf550 in ?? () from /usr/lib64/libaticaldd.so
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.9.x86_64 libX11-1.3-2.el6.x86_64 libXau-1.0.5-1.el6.x86_64 libXext-1.1-3.el6.x86_64 libXinerama-1.1-1.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 libxcb-1.5-1.el6.x86_64
(gdb) where
#0 0x00002aaaac8bf550 in ?? () from /usr/lib64/libaticaldd.so
#1 0x00002aaaac9b15f8 in ?? () from /usr/lib64/libaticaldd.so
#2 0x00002aaaac9b234b in ?? () from /usr/lib64/libaticaldd.so
#3 0x00002aaaac9b6948 in ?? () from /usr/lib64/libaticaldd.so
#4 0x00002aaaac9b7056 in ?? () from /usr/lib64/libaticaldd.so
#5 0x00002aaaac9b76a2 in ?? () from /usr/lib64/libaticaldd.so
#6 0x00002aaaac8a9333 in ?? () from /usr/lib64/libaticaldd.so
#7 0x00002aaaac8a96e5 in ?? () from /usr/lib64/libaticaldd.so
#8 0x00002aaaac8abd67 in ?? () from /usr/lib64/libaticaldd.so
#9 0x00002aaaacc0506d in ?? () from /usr/lib64/libaticaldd.so
#10 0x00002aaaac83e14a in ?? () from /usr/lib64/libaticaldd.so
#11 0x00002aaaac83f9ed in ?? () from /usr/lib64/libaticaldd.so
#12 0x00002aaaaaeb91c8 in gpu::NullKernel::create(std::basic_string<char, std::char_traits, std::allocator > const&, std::basic_string<char, std::char_traits, std::allocator > const&, void const*, unsigned long) ()

from /usr/lib64/libamdocl64.so
#13 0x00002aaaaaebb473 in gpu::Kernel::create(std::basic_string<char, std::char_traits, std::allocator > const&, std::basic_string<char, std::char_traits, std::allocator > const&, void const*, unsigned long) ()

from /usr/lib64/libamdocl64.so
#14 0x00002aaaaaecaa16 in gpu::Program::createKernel(std::basic_string<char, std::char_traits, std::allocator > const&, gpu::Kernel::InitData const_, std::basic_string<char, std::char_traits, std::allocator > const&, std::basic_string<char, std::char_traits, std::allocator > const&, bool_, void const*, unsigned long) ()

from /usr/lib64/libamdocl64.so
#15 0x00002aaaaaec761c in gpu::NullProgram::linkImpl(amd::option::Options*) () from /usr/lib64/libamdocl64.so
#16 0x00002aaaaae6c9bf in device::Program::build(std::basic_string<char, std::char_traits, std::allocator > const&, char const_, amd::option::Options_) () from /usr/lib64/libamdocl64.so

---Type to continue, or q to quit---
#17 0x00002aaaaae7bad7 in amd::Program::build(std::vector<amd::Device*, std::allocatoramd::Device* > const&, char const_, void ()(cl_program, void), void*, bool) () from /usr/lib64/libamdocl64.so
#18 0x00002aaaaae5d444 in clBuildProgram () from /usr/lib64/libamdocl64.so
#19 0x000000000040c735 in runTest (testName="DGEMM", dev=0xffb340, ctx=0xd5a4d0, queue=0xd8b2f0, resultDB=...,

op=..., compileFlags="-DK_DOUBLE_PRECISION ") at SGEMM.cpp:197

#20 0x0000000000406556 in RunBenchmark (devcpp=, ctxcpp=,

queuecpp=<value optimized out>, resultDB=..., op=...) at SGEMM.cpp:114

#21 0x0000000000404322 in main (argc=3, argv=) at ../../common/main.cpp:208

OpenCL bfs_uiuc_spill use of volatile

The 32-bit atom_* functions used in src/opencl/level1/bfs/bfs_uiuc_spill.cl do not take volatile arguments [1], while the pointers passed to these functions are volatile. The results in some OpenCL compilers (e.g., vanilla clang with libclc headers) rejecting the kernel due to loss of volatile when invoking the atom_* functions.

The proper fix would be switching to the 32-bit atomic_* functions introduced with OpenCL 1.1 [2], which do take volatile arguments.

[1] http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_global_int32_base_atomics.html
[2] http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/atomicFunctions.html

OpenCL resources not released in MaxFlops

Hi,

In the MaxFlops benchmark, when running more than one iteration, OpenCL events used to timing the MAD kernel are not freed. Here is the patch:

diff --git a/src/opencl/level0/MaxFlops.cpp b/src/opencl/level0/MaxFlops.cpp
index 0639d10..765c331 100644
--- a/src/opencl/level0/MaxFlops.cpp
+++ b/src/opencl/level0/MaxFlops.cpp
@@ -588,10 +588,11 @@ RunTest(cl_device_id id,
         CL_CHECK_ERROR(err);


-        // Event object for timing
-        Event evKernel_madd("madd");
         for (int passCounter=0; passCounter < npasses; passCounter++)
         {
+            // Event object for timing
+            Event evKernel_madd("madd");
+
             err = clEnqueueNDRangeKernel(queue, kernel_madd, 1, NULL,
                       &globalWorkSize, &localWorkSize,
                       0, NULL, &evKernel_madd.CLEvent());

Bye,
Ettore Speziale

ARM configure

When running configure in ARM does not work because of the flag -m32, that does not exist. I had to remove it from the file by hand. It could be another option to check if it is ARM, do not use that flag.

MIC branch has rudimentary device info class

The MIC branch has a very rudimentary implementation of the device info class. It does not provide device names via driver script, and only rudimentary device names (e.g., "MIC 0") in response to the device info command line flag supported by each benchmark.

MIC Spmv "CSR Vector" operation performance is very bad compared to "CSR Scalar"

In an attempt to provide implementations of SpMV comparable to those used in the CUDA and OpenCL versions, the SpMV "CSR Vector" operations have been implemented using OpenMP nested parallelism. The outer loop is parallelized using a conventional "omp parallel" directive, the inner loop with an "omp parallel for" directive plus a reduction clause. The number of threads used for the inner and outer threads is specified using a num_threads clause, and dynamic thread count management is turned off. This is intended to mimic the CUDA/OpenCL version's use of a reduction that fits within a single warp.

However, the performance of the "CSR Vector" version is very poor compared to the "CSR Scalar" version that simply parallelizes the outer loop. The performance changes with the number of inner and outer threads. The number of inner loop threads must be small, because the number of non-zeros in each row is relatively small (probably too small to overcome the OpenMP overhead) even with size 4 problems. Surprisingly, performance is better with a small number of outer loop threads also, which leads to the question about whether we are implementing the nested parallelism correctly.

MIC version of DeviceMemory does not report same metrics as other programming models

Driver script reports benchmark errors because MIC version of DeviceMemory reports different metrics than for CUDA and OpenCL versions.

DeviceMemory doesn't run on Jetson TK1

Error message:

Chose device: name='GK20A' index=0
[ 80.743300] Out of memory: Kill process 2700 (DeviceMemory) score 1 or sacrifice child
[ 80.751793] Killed process 2700 (DeviceMemory) total-vm:1838376kB, anon-rss:2632kB, file-rss:628kB
Killed

MIC spmv does not report same metrics as for other programming models

MIC version of Spmv benchmark does not report the same metrics as versions for other programming models, so driver script reports benchmark failures.

Out Of Memory with BusSpeedReadback

Hi everyone !

I am using SHOC to characterize some aspects of a board using a Tegra K1 SoC and I ran into an issue with the BusSpeedReadback benchmark with CUDA.

When executing it, the process gets killed by the kernel because of an Out Of Memory situation (OOM).
I tried to pinpoint where it actually fails and found the following line:

108: cudaMalloc((void**)&device, sizeof(float) * numMaxFloats);

The cudaMalloc apparently triggers the OOM mechanic and kills the execution.
I am new to CUDA and GPGPU in general, but as I understand it, this line allocates a block of memory on the device side (GPU) memory. The size of the block is roughly 132MB. The GPU has over 1GB of memory so I don't understand the issue.

I was able to try the same benchmark, running of a different system with about the same amount of memory (even less actually) on the GPU, but using OpenCL and got no issues.

Thanks for your enlightments.

Best regards,

Marc

EDIT: BusSpeedDownload runs fine, even though it as an almost identical code structure. All of the other benchmarks also run fine.

EDIT2: I found that OpenCL version does a check of available memory before trying to allocate any memory bloc size. This sets the maximum bloc size. This mechanic does not exist in the CUDA version, therefore allowing the code to allocate memory blocs up to 512MB. In some configurations, this is too much and provokes an OOM situation.

I also found something weird about the CUDA version of BusDownloadSpeed values reported.
For a block size of 4096kB, kernel time is: 1.67431ms (mean value).
So bandwidth should be 2.5GB/s. But SHOC reports a mean value of 3GB/s for this data size.

I observed the same with the CUDA version of BusReadBackSpeed benchmark.
OpenCL version gives coherant values.

Did I miss something ?

Best,

Marc

Build system is not very portable

I tried to build SHOC but I encountered some problems:

You provide a configure script that requires Automake 1.13. This version is neither available in Ubuntu 12.04 nor in openSUSE 12.3. You should avoid putting configure (and also all generated Makefile.in files) under version control and actually provide a shell script (people usually call it autogen.sh) that runs everything necessary to create the configure script. I fixed that by running autoreconf myself and adding empty NEWS, README, AUTHORS and ChangeLog files.
Also, I cannot set the build flags for OpenCL. For example, I have the CUDA SDK installed under /usr/local/cuda which is hard to find for the configure script. Some flags like OPENCL_CFLAGS and OPENCL_LIBS would help. But abusing CXXFLAGS (CFLAGS on the other hand not ...) helps though.
I installed SHOC into prefix $HOME/usr but running shocdriver in $HOME/usr/bin tells me that it cannot find driver.pl. But the actual programs work just fine.

Spmv: OpenCL Build Failure

User Nazia Khan has reported a build failure for OpenCL Spmv on the current AMD APPSDK.

*.err files contain the following:
Error: CL_BUILD_PROGRAM_FAILURE in Spmv.cpp line 438

Inclusion of Third-Party Codes

How do we incorporate third party codes? For instance, we have a sort that uses VexCL code, but it's unclear if we need to rewrite this code or include it as an external dependency.

Ctrl-C on driver.pl script doesn't kill driver

Apparently normal perl behavior when using system() is for Ctrl-C to kill the child but not the Perl script. Seems like it will confuse and annoy users to have to hit ctrl-C many times to get driver to stop completely.

Make with CUDA support fails

Hello, Dakar team guys.

I'm trying to build SHOC 1.1.5 with CUDA and MPI support under CentOS 6.5 x64. Our PC has CUDA SDK 6.5, OpenMPI 1.8.1, Intel C++ compiler and Intel MKL 11.1 installed.

Configure command "./configure CPPFLAGS="-I/usr/local/cuda/include" --with-cuda --with-mpi" ends fine. Its output contains following lines:
configure: checking for usable OpenCL opencl.h header
checking OpenCL/opencl.h usability... yes
checking OpenCL/opencl.h presence... yes
checking for OpenCL/opencl.h... yes
checking for usable OpenCL library... -lOpenCL
checking for nvcc... /usr/local/cuda/bin/nvcc
checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
checking cuda_runtime.h usability... yes
checking cuda_runtime.h presence... yes
checking for cuda_runtime.h... yes
checking for cublasInit in -lcublas... yes
checking for cufftPlan1d in -lcufft... yes
checking for mpicxx... /usr/local/mpi/bin/mpicxx
checking whether we can compile an MPI program using /usr/local/mpi/bin/mpicxx... yes
checking whether we can link an MPI program using /usr/local/mpi/bin/mpicxx... yes

So, i decided, CUDA, OpenCL and MPI were successfully found. But make command fails:
/usr/local/mpi/bin/mpicxx -g -O2 -L../../../../src/cuda/common -L../../../../src/common -o BusSpeedDownload main.o BusSpeedDownload.o -lSHOCCommon "/tmp/tmpxft_00007322_00000000-16_bogus.o" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib" -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lrt -lrt
icpc: error #10236: File not found: '/tmp/tmpxft_00007322_00000000-16_bogus.o'

PATH and LD_LIBRARY_PATH variables set to proper values. Other CUDA-capable applications like CUDA-accelerated HPL works fine. Could you help me, where is the problem in?
Thank you.

OpenCL S3D fails on OS X

Gives the following output:

$ ../bin/Serial/OpenCL/S3D -s 1 -p 0 -d 1
Chose device: name='GeForce GT 330M' index=1 id=0x1022600
Compiling kernels (phase 1)...Error: CL_BUILD_PROGRAM_FAILURE in S3D.cpp line 306
Retsize: 69
Log: kernel referenced an external function exp, that could not be found.
Error: CL_BUILD_PROGRAM_FAILURE in S3D.cpp line 315
Retsize: 69
Log: kernel referenced an external function exp, that could not be found.
Error: CL_BUILD_PROGRAM_FAILURE in S3D.cpp line 317
Retsize: 69
Log: kernel referenced an external function exp, that could not be found.
done.
Generating OpenCL Kernel Objects (phase 1)...Error: CL_INVALID_PROGRAM_EXECUTABLE in S3D.cpp line 337
Error: CL_INVALID_PROGRAM_EXECUTABLE in S3D.cpp line 355
Error: CL_INVALID_PROGRAM_EXECUTABLE in S3D.cpp line 361
done.
Error: CL_INVALID_KERNEL in S3D.cpp line 390
Error: CL_INVALID_KERNEL in S3D.cpp line 392
Error: CL_INVALID_KERNEL in S3D.cpp line 394
Error: CL_INVALID_KERNEL in S3D.cpp line 404
Error: CL_INVALID_KERNEL in S3D.cpp line 406
Error: CL_INVALID_KERNEL in S3D.cpp line 408
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Executing kernels (phase 1)...Memory fault

Minor QTC Integration Items

Testing issue tracker with a couple of minor action items for QTC integration.

@adanalis should check and make sure:

What I put on the wiki page (https://github.com/spaffy/shoc/wiki/Qtc) looks correct and maybe add the outputs. I haven't run QTC yet to see what metric the result is.
Move QTC to Level 2 directory
Add the -s parameter and sizes, then I will integrate into driver script and google results db

CUDA 4.1 FFT 50% Performance Drop

With the new LLVM compiler backend, CUDA FFT performance dropped by 50% on Keeneland. OpenCL performance stayed the same.

I suspect this might be due to loops being unrolled differently (the unroll option that used to go to the old compiler is now ignored).

OpenCL resources not released in QueueDelay

Hi,

OpenCL events used to time kernels are not released in QueueDelay.cpp.

Here is the patch:

diff --git a/src/opencl/level0/QueueDelay.cpp b/src/opencl/level0/QueueDelay.cpp
index 8497fcd..b81e971 100644
--- a/src/opencl/level0/QueueDelay.cpp
+++ b/src/opencl/level0/QueueDelay.cpp
@@ -137,18 +137,17 @@ void RunBenchmark(cl::Device& devcpp,
     size_t localWorkSize = (maxGroupSize >= 256 ? 256 : maxGroupSize);
     size_t globalWorkSize = localWorkSize * 256;

-    // Declare event objects for the kernels
-    Event evKernel1("Run Kernel1");
-    Event evKernel2("Run Kernel2");
-    Event evKernel3("Run Kernel3");
-    Event evKernel4("Run Kernel4");
-    
     //Test single kernel
     for (int j = 0; j < passes; j++)
     {
        double total = 0.0;
        for (int i = 0; i < reps; i++)
        {
+          Event evKernel1("Run Kernel1");
+          Event evKernel2("Run Kernel2");
+          Event evKernel3("Run Kernel3");
+          Event evKernel4("Run Kernel4");
+
           err = clEnqueueNDRangeKernel(queue, kernel1, 1, NULL,
                                        &globalWorkSize, &localWorkSize,
                                        0, NULL, &evKernel1.CLEvent());
@@ -191,6 +190,11 @@ void RunBenchmark(cl::Device& devcpp,
        double total = 0.0;
        for (int i = 0; i < reps; i++)
        {
+          Event evKernel1("Run Kernel1");
+          Event evKernel2("Run Kernel2");
+          Event evKernel3("Run Kernel3");
+          Event evKernel4("Run Kernel4");
+
           err = clEnqueueNDRangeKernel(queue, kernel1, 1, NULL,
                                        &globalWorkSize, &localWorkSize,
                                        0, NULL, &evKernel1.CLEvent());
@@ -240,6 +244,11 @@ void RunBenchmark(cl::Device& devcpp,
        double total = 0.0;
        for (int i = 0; i < reps; i++)
        {
+          Event evKernel1("Run Kernel1");
+          Event evKernel2("Run Kernel2");
+          Event evKernel3("Run Kernel3");
+          Event evKernel4("Run Kernel4");
+
           err = clEnqueueNDRangeKernel(queue, kernel1, 1, NULL,
                                        &globalWorkSize, &localWorkSize,
                                        0, NULL, &evKernel1.CLEvent());

Bye,
[email protected]

Build shoc with -Wall and fix everything it reports.

One recently fixed uninitialized memory error was being easily detected and was being reported by the compiler. There may be other problems, and these are easy fixes.

Scan missing a clFinish

In SHOC 1.1.1, we face a failure in scan. Below is the code segment (from line #325). It applies a BLOCKING enqueueWrite following by clGetEventProfilingInfo. The assumption is that the blocking write completes (i.e., the event returned by &evTransfer.CLEvent() is set to “CL_COMPLETE”) when the call to clEnqueueWrite returns.

However, according to the 1.1 (same in 1.2) the behavior of a BLOCKING clEnqueueWrite API is not completely synchronous: “If blocking_write is CL_TRUE, the OpenCL implementation copies the data referred to by ptr and enqueues the write operation in the command-queue. The memory pointed to by ptr can be reused by the application after the clEnqueueWriteBuffer call returns.” (OpenCL spec 1.1 v45, page 62). That is, the enqueueWrite event is set to CL_COMPLETE only after the data is written to the device which may be after the clEnqueueWrite API returns. Hence, you need to wait on the event (clFinish or clWaitForEvent) before accessing the profiling info.

Thanks, --Yariv

duplicated results, other cleanup in current results

The dates in the results.csv are meaningless, but it looks like we may have a (nearly) full set of duplicated in that table. That could be an artifact of the google public data explorer expecting everything to be time based, but there's no reason to keep duplicates in the table.

Also, we should have a different sentinel or empty value for unknown dates.

For other csv's, like specs for a platform/device, if we want to keep the date column, I'd recommend changing their date be the release date of the architecture, or the cuda platform version, etc. It would be meaningful that way. (Otherwise, I'd say remove the date for everything but results.)

MIC MaxFlops Doesn't Match Peak Performance

The MIC version of MaxFlops reports a value that seems to be too high by a factor of 10.

Running benchmark MaxFlops
result for maxspflops: 18065.5000 GFLOPS
result for maxdpflops: 13108.8000 GFLOPS

From Reza's blog, the max performance should be something around the theoretical peaks of 2112 GFLOPS, SP and 1056 GFLOPS, DP.
https://software.intel.com/en-us/articles/intel-xeon-phi-core-micro-architecture

This could possibly just be a reporting error.

Transfer timings for CUDA Scan are wrong in results.csv database

I fixed a bug today with these timings, which I believe makes those results in the database wrong. After my fix, the rates are now similar between cuda/opencl, whereas previously there was a 10^3 difference.
I'm throwing this out there for suggestions about what to do with the results database.

newer cudas have deprecated old 1.x compute capabilities

Specifically, a bug report that CUDA 7.5 gives an error with the configure that uses compute_12. It may be time to remove these within SHOC, if we don't want to test for support explicitly.

Change needed in find_cuda_libs.sh for Jetson TK1

diff --git a/config/find_cuda_libs.sh b/config/find_cuda_libs.sh
index 8aed2e3..1a254b0 100755
--- a/config/find_cuda_libs.sh
+++ b/config/find_cuda_libs.sh
@@ -42,7 +42,7 @@ libspec=$NVCC -dryrun bogus.cu 2>&1 | grep LIBRARIES | sed 's/^.*LIBRARIES=//'
#echo "libspec=$libspec"
if [ $cudart_flag_supported -eq 1 ]
then

cudalibs=$NVCC -dryrun bogus.cu 2>&1 | tail -1 | sed "s#^.*-o \"a.out\"##" | sed 's#"[a-zA-Z0-9/_-]*\.o"##' | sed 's/-Wl,--start-group//' | sed 's/-Wl,--end-group//'
cudalibs=$NVCC -dryrun bogus.cu 2>&1 | tail -1 | sed "s#^.*-o \"a.out\"##" | sed 's#"[a-zA-Z0-9/_-]*\.o"##g' | sed 's/-Wl,--start-group//' | sed 's/-Wl,--end-group//'
else
cudalibs=$libspec
fi

on Windows, QTC timing results are nonsense

On a GTX460, I got various infinity runtimes. On a Quadro FX 380, I got results like 1e-207 seconds and e+192 seconds.

Visual Studio Project Files

Bring over Jeremy 's ticket for VS project files for SHOC. Will assign to him pending github account creation.

Why do some benchmarks not show a speedup when running on multiple devices?

I am expecting to observe a speedup when I run either an EP or TP benchmark on multiple devices, but that is not the case.
The Stencil2D benchmark does show a speedup when I use multiple devices:
./shocdriver -d 0 -cuda -s 4 -benchmark Stencil2D
result for stencil: 141.2280 GFLOPS
vs.
./shocdriver -d 0,1,2,3 -cuda -s 4 -benchmark Stencil2D
result for stencil: 406.1190 GFLOPS

However, this is the only benchmark I have found (so far) that shows a speedup. For example:
./shocdriver -d 0 -cuda -s 4 -benchmark Scan
result for scan: 46.8924 GB/s
vs
./shocdriver -d 0,1,2,3 -cuda -s 4 -benchmark Scan
result for scan: 46.8561 GB/s
Similarly, Reduction and GEMM show no improvement either.
Am I missing something here? I am running version 1.1.5

MaxFlops' performance on multiple GPUs

Hello.
I've compiled SHOC 1.1.5 with CUDA/OpenCL/MPI support under CentOS 6.5 with CUDA 6.5, Intel Compiler 11.1, Intel MKL 11.1 and OpenMPI 1.8.1 installed. PC has 4pcs NVidia Tesla K20m. When I try to run it using all GPUs, MaxFlops' performance is the same as if i run the test using one GPU only. This happens in both OpenCL and CUDA modes. I tried to change problem size from "-s 1" to "-s 4", but nothing change.
Here below are console outputs:

[bald@node8 bin]$ ./shocdriver -cuda -s 1 -d 0
--- Welcome To The SHOC Benchmark Suite version 1.1.5 ---
Hostname: node8.cluster
Platform selection not specified, default to platform #0
Number of available platforms: 1
Number of available devices on platform 0 : 4
Device 0: 'Tesla K20m'
Device 1: 'Tesla K20m'
Device 2: 'Tesla K20m'
Device 3: 'Tesla K20m'
Specified 1 device IDs: 0
Using size class: 1

--- Starting Benchmarks ---
Running benchmark BusSpeedDownload
result for bspeed_download: 6.2430 GB/sec
Running benchmark BusSpeedReadback
result for bspeed_readback: 6.6992 GB/sec
Running benchmark MaxFlops
result for maxspflops: 3099.6100 GFLOPS
result for maxdpflops: 1164.3600 GFLOPS

[bald@node8 bin]$ ./shocdriver -cuda -s 1 -d 0,1,2,3
--- Welcome To The SHOC Benchmark Suite version 1.1.5 ---
Hostname: node8.cluster
Platform selection not specified, default to platform #0
Number of available platforms: 1
Number of available devices on platform 0 : 4
Device 0: 'Tesla K20m'
Device 1: 'Tesla K20m'
Device 2: 'Tesla K20m'
Device 3: 'Tesla K20m'
Specified 4 device IDs: 0,1,2,3
Using size class: 1

--- Starting Benchmarks ---
Running benchmark BusSpeedDownload
result for bspeed_download: 6.1165 GB/sec
Running benchmark BusSpeedReadback
result for bspeed_readback: 6.6993 GB/sec
Running benchmark MaxFlops
result for maxspflops: 3099.1200 GFLOPS
result for maxdpflops: 1165.0200 GFLOPS

+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 2576 /home/bald/Downloads/shoc/bin/EP/CUDA/MaxFlops 80MiB |
| 1 2577 /home/bald/Downloads/shoc/bin/EP/CUDA/MaxFlops 80MiB |
| 2 2578 /home/bald/Downloads/shoc/bin/EP/CUDA/MaxFlops 80MiB |
| 3 2579 /home/bald/Downloads/shoc/bin/EP/CUDA/MaxFlops 80MiB |
+-----------------------------------------------------------------------------+

I want to see overall performance of my hybrid PC. What did I wrong? Is this behavior normal to SHOC? Thank you.

How to run SHOC benchmarks on CPU-only and CPU+GPU?

I am trying to improve upon the workload partitioning problem solutions for het. platforms. THe SHOC benchmarks run on GPUs only. How do I compare the CPU-only performance with this? How do i partition the workload between CPU and GPU?

Verifying the available code's CC

I ran into an interesting potential problem with SHOC when trying to figure out why CUDA BFS wasn't working on my current laptop.

BFS needs compute capability 1.2 for atomics. The BFS code checks whether the target device is at least CC 1.2, but doesn't check (doesn't have a way to check?) whether the code was compiled for CC 1.2. For whatever reason, our default CUDA_CPPFLAGS didn't include compilation for CC 1.2 - it had 1.0, 1.1, 1.3, and 2.0. Presumably, at runtime the kernel selected by the CUDA runtime was the CC 1.1 version. BFS ran but because it requires the atomics from 1.2, its verification of the results computed on the GPU failed.

So the problem is: the BFS code checked the CC of the device, but couldn't tell that it didn't have CC 1.2 code available to run on it. Is there a way to check at run time which version of the code will be used?

performance drop observed on SHOC - DeviceMemory Local Memory related tests on Rocm stack

@vetter @Finomnis

As per AMD developer comments who debugged the issue:
the test generates kernels based on the device capabilities reported in OCL. In case of Hybrid stack(Orca) OCL runtime reports 32KB of local device memory, but ROCm stack – 64KB.
The tests uses a half of the reported amount for local array in a kernel. Thus ROCm ends up with more LDS usage, hence lower wave occupancy and lower performance. The issue should be reported to devrel for test logic replacement.

FFT (cuda and opencl) should add readback PCIe times

When I (Jeremy) added timing of PCIe to the FFT benchmark, I only added timing of the send portion. (There wasn't an explicit readback in the main benchmark code I could add timing of.) To be consistent with other benchmarks, we should probably add an explicit readback and capture timing of that piece to include in the _PCIE results.

Data race in OpenCL sort top_scan kernel

I believe the OpenCL version of the top_scan kernel in the sort benchmark has a data race. On line 127 of src/opencl/level1/sort/sort.cl the __local s_seed variable is read by all threads with get_local_id(0) < n and on line 132 the variable is written to by thread get_local_id(0) == n - 1, while there is no barrier in between the statements.

Test OpenCL branch further with Intel's iGPUs

Intel iGPUs from Sandy Bridge onwards are now supported at least in part on Linux using Beignet (https://01.org/beignet). An initial test using Beignet 1.0.0 on a Haswell CPU generated some reasonable results for several of the benchmarks, but many failed due to an issue with the "FillTimingInfo" class in Event.cpp (the specific error returned was CL_PROFILING_INFO_NOT_AVAILABLE).

At this time, it's unclear whether this is a SHOC issue since other platforms (including Intel CPUs) work fine, but it might be worthwhile to check into for future Intel iGPUs.

trying cuda on clang and HIPing shoc

dear SHOCers,

I am contemplating to modify shoc to use the clang compiler that has recently adopted mainline support for CUDA.
http://llvm.org/releases/3.8.0/tools/clang/docs/ReleaseNotes.html#cuda-support-in-clang
I'd like to know if you would be interest in PRs that enable SHOC to use this compiler toolchain?

On a related note, it would be interesting to compare native OpenCL SHOC results to HIPified CUDA benchmarks:
http://gpuopen.com/hip-release-0-82/
As I have no feeling how mature a 0.82 release would be, this more an experiment. But still, I'd like to know if the SHOC community would be interested.

Add option to run either SP or DP

In workloads that run both single and double precision, such as GEMM, add an option to run only one of the two.

MIC version of Spmv does has different warmup scheme than version for other programming models

MIC version of Spmv has MIC-specific warmup code before running the MIC-based spmv operations, which is different from the benchmarks for other programming models that use a specific number of warmup passes controlled by a command-line parameter.

DeviceMemory benchmark

This is maybe not really an issue, but it seems that all three of the write benchmarks from the DeviceMemory benchmark have write/write data races (different threads will write to the same array position in each of these benchmarks). It this intentional?