vetter / shoc Goto Github PK
View Code? Open in Web Editor NEWThe SHOC Benchmark Suite
License: Other
The SHOC Benchmark Suite
License: Other
Sort benchmark doesnt't handle errors propery:
see vetter's email from AMD
AMD APP SDK v2.4 compiler segfaults. OpenCL code is known good as it works on NV & Intel impl.
It gives an error:
"
nvcc fatal : Unsupported gpu architecture 'compute_12'
make[4]: *** [BusSpeedDownload.o] Error 1
make[4]: Leaving directory /home/pawan/Desktop/summer_school/Intern/shoc-master/src/cuda/level0/epmpi' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory
/home/pawan/Desktop/summer_school/Intern/shoc-master/src/cuda/level0'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory /home/pawan/Desktop/summer_school/Intern/shoc-master/src/cuda' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory
/home/pawan/Desktop/summer_school/Intern/shoc-master/src'
make: *** [all-recursive] Error 1
"
I have set the right PATH, and configure worked.
I could not fine a contact email or mailing list, so I'm posting to this issue list. I'm getting this compiler error:
/Users/srinath/bin/mpicxx -DPARALLEL -I../../../../src/common -I../../../../config -I/user/local/cuda/include -I../../../../src/opencl/common -I../../../../src/common -I../../../../src/mpi/common -g -O2 -c Stencil2Dmain.cpp -o Stencil2Dmain_mpi.o
In file included from Stencil2Dmain.cpp:36:
../../../../src/mpi/common/MPIStencilUtil.cpp:37:13: error: use of undeclared identifier 'PrintValidationErrors'
PrintValidationErrors( valResultStr, validationErrors, valErrPrintsRemaining );
^
this->
Stencil2Dmain.cpp:367:50: note: in instantiation of member function 'MPIStencilValidater::ValidateResult' requested here
StencilValidater* validater = new MPIStencilValidater;
^
In file included from Stencil2Dmain.cpp:23:
In file included from ../../../../src/common/StencilUtil.cpp:1:
../../../../src/common/StencilUtil.h:23:10: note: must qualify identifier to find this declaration in dependent base class
void PrintValidationErrors( std::ostream& s,
^
In file included from Stencil2Dmain.cpp:36:
../../../../src/mpi/common/MPIStencilUtil.cpp:37:13: error: no member named 'PrintValidationErrors' in 'MPIStencilValidater'
PrintValidationErrors( valResultStr, validationErrors, valErrPrintsRemaining );
^~~~~~~~~~~~~~~~~~~~~
Stencil2Dmain.cpp:367:50: note: in instantiation of member function 'MPIStencilValidater::ValidateResult' requested here
StencilValidater* validater = new MPIStencilValidater;
^
In file included from Stencil2Dmain.cpp:33:
../../../../src/mpi/common/MPIHostStencil.cpp:58:13: error: use of undeclared identifier 'DumpData'
DumpData( ofs, mtx, "before halo exchange" );
^
this->
In file included from Stencil2Dmain.cpp:32:
../../../../src/mpi/common/MPIHostStencilFactory.cpp:20:16: note: in instantiation of member function 'MPIHostStencil::DoPreIterationWork'
requested here
return new MPIHostStencil( wCenter,
^
In file included from Stencil2Dmain.cpp:32:
In file included from ../../../../src/mpi/common/MPIHostStencilFactory.cpp:4:
In file included from ../../../../src/mpi/common/MPIHostStencil.h:7:
../../../../src/mpi/common/MPI2DGridProgram.h:106:10: note: must qualify identifier to find this declaration in dependent base class
void DumpData( std::ostream& s,
^
In file included from Stencil2Dmain.cpp:33:
../../../../src/mpi/common/MPIHostStencil.cpp:60:9: error: use of undeclared identifier 'DoHaloExchange'
DoHaloExchange( mtx );
^
this->
In file included from Stencil2Dmain.cpp:32:
In file included from ../../../../src/mpi/common/MPIHostStencilFactory.cpp:4:
In file included from ../../../../src/mpi/common/MPIHostStencil.h:7:
../../../../src/mpi/common/MPI2DGridProgram.h:105:10: note: must qualify identifier to find this declaration in dependent base class
void DoHaloExchange( Matrix2D& mtx );
^
In file included from Stencil2Dmain.cpp:33:
../../../../src/mpi/common/MPIHostStencil.cpp:63:13: error: use of undeclared identifier 'DumpData'
DumpData( ofs, mtx, "after halo exchange" );
^
this->
In file included from Stencil2Dmain.cpp:32:
In file included from ../../../../src/mpi/common/MPIHostStencilFactory.cpp:4:
In file included from ../../../../src/mpi/common/MPIHostStencil.h:7:
../../../../src/mpi/common/MPI2DGridProgram.h:106:10: note: must qualify identifier to find this declaration in dependent base class
void DumpData( std::ostream& s,
^
In file included from Stencil2Dmain.cpp:35:
./MPIOpenCLStencil.cpp:249:9: error: use of undeclared identifier 'DoHaloExchange'
DoHaloExchange( mtx );
^
this->
In file included from Stencil2Dmain.cpp:34:
./MPIOpenCLStencilFactory.cpp:39:16: note: in instantiation of member function 'MPIOpenCLStencil::DoPreIterationWork' requested here
return new MPIOpenCLStencil( wCenter,
^
In file included from Stencil2Dmain.cpp:32:
In file included from ../../../../src/mpi/common/MPIHostStencilFactory.cpp:4:
In file included from ../../../../src/mpi/common/MPIHostStencil.h:7:
../../../../src/mpi/common/MPI2DGridProgram.h:105:10: note: must qualify identifier to find this declaration in dependent base class
void DoHaloExchange( Matrix2D& mtx );
^
In file included from Stencil2Dmain.cpp:35:
./MPIOpenCLStencil.cpp:65:9: error: use of undeclared identifier 'DoHaloExchange'
DoHaloExchange( mtx );
^
this->
In file included from Stencil2Dmain.cpp:34:
./MPIOpenCLStencilFactory.cpp:39:16: note: in instantiation of member function 'MPIOpenCLStencil::operator()' requested here
return new MPIOpenCLStencil( wCenter,
^
In file included from Stencil2Dmain.cpp:32:
In file included from ../../../../src/mpi/common/MPIHostStencilFactory.cpp:4:
In file included from ../../../../src/mpi/common/MPIHostStencil.h:7:
../../../../src/mpi/common/MPI2DGridProgram.h:105:10: note: must qualify identifier to find this declaration in dependent base class
void DoHaloExchange( Matrix2D& mtx );
^
In file included from Stencil2Dmain.cpp:33:
../../../../src/mpi/common/MPIHostStencil.cpp:58:13: error: no member named 'DumpData' in 'MPIHostStencil'
DumpData( ofs, mtx, "before halo exchange" );
^~~~~~~~
In file included from Stencil2Dmain.cpp:32:
../../../../src/mpi/common/MPIHostStencilFactory.cpp:20:16: note: in instantiation of member function 'MPIHostStencil::DoPreIterationWork'
requested here
return new MPIHostStencil( wCenter,
^
In file included from Stencil2Dmain.cpp:33:
../../../../src/mpi/common/MPIHostStencil.cpp:60:9: error: no member named 'DoHaloExchange' in 'MPIHostStencil'
DoHaloExchange( mtx );
^~~~~~~~~~~~~~
../../../../src/mpi/common/MPIHostStencil.cpp:63:13: error: no member named 'DumpData' in 'MPIHostStencil'
DumpData( ofs, mtx, "after halo exchange" );
^~~~~~~~
In file included from Stencil2Dmain.cpp:35:
./MPIOpenCLStencil.cpp:249:9: error: no member named 'DoHaloExchange' in 'MPIOpenCLStencil'
DoHaloExchange( mtx );
^~~~~~~~~~~~~~
In file included from Stencil2Dmain.cpp:34:
./MPIOpenCLStencilFactory.cpp:39:16: note: in instantiation of member function 'MPIOpenCLStencil::DoPreIterationWork' requested here
return new MPIOpenCLStencil( wCenter,
^
In file included from Stencil2Dmain.cpp:35:
./MPIOpenCLStencil.cpp:65:9: error: no member named 'DoHaloExchange' in 'MPIOpenCLStencil'
DoHaloExchange( mtx );
^~~~~~~~~~~~~~
In file included from Stencil2Dmain.cpp:34:
./MPIOpenCLStencilFactory.cpp:39:16: note: in instantiation of member function 'MPIOpenCLStencil::operator()' requested here
return new MPIOpenCLStencil( wCenter,
^
12 errors generated.
Any help is appreciated.
GEMM seems to be hanging at compilation for the Cypress GPU (and oddly enough, only the Cypress GPU) with AMD APP 2.6 on atlanta. Runs fine on Tahiti GPU, CPU, and other OpenCL impls.
@rothpc I'm considering this a bug with the AMD compiler and advise we proceed with a 1.1.4 release.
GDB Output:
0x00002aaaac8bf550 in ?? () from /usr/lib64/libaticaldd.so
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.9.x86_64 libX11-1.3-2.el6.x86_64 libXau-1.0.5-1.el6.x86_64 libXext-1.1-3.el6.x86_64 libXinerama-1.1-1.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 libxcb-1.5-1.el6.x86_64
(gdb) where
#0 0x00002aaaac8bf550 in ?? () from /usr/lib64/libaticaldd.so
#1 0x00002aaaac9b15f8 in ?? () from /usr/lib64/libaticaldd.so
#2 0x00002aaaac9b234b in ?? () from /usr/lib64/libaticaldd.so
#3 0x00002aaaac9b6948 in ?? () from /usr/lib64/libaticaldd.so
#4 0x00002aaaac9b7056 in ?? () from /usr/lib64/libaticaldd.so
#5 0x00002aaaac9b76a2 in ?? () from /usr/lib64/libaticaldd.so
#6 0x00002aaaac8a9333 in ?? () from /usr/lib64/libaticaldd.so
#7 0x00002aaaac8a96e5 in ?? () from /usr/lib64/libaticaldd.so
#8 0x00002aaaac8abd67 in ?? () from /usr/lib64/libaticaldd.so
#9 0x00002aaaacc0506d in ?? () from /usr/lib64/libaticaldd.so
#10 0x00002aaaac83e14a in ?? () from /usr/lib64/libaticaldd.so
#11 0x00002aaaac83f9ed in ?? () from /usr/lib64/libaticaldd.so
#12 0x00002aaaaaeb91c8 in gpu::NullKernel::create(std::basic_string<char, std::char_traits, std::allocator > const&, std::basic_string<char, std::char_traits, std::allocator > const&, void const*, unsigned long) ()
from /usr/lib64/libamdocl64.so
#13 0x00002aaaaaebb473 in gpu::Kernel::create(std::basic_string<char, std::char_traits, std::allocator > const&, std::basic_string<char, std::char_traits, std::allocator > const&, void const*, unsigned long) ()
from /usr/lib64/libamdocl64.so
#14 0x00002aaaaaecaa16 in gpu::Program::createKernel(std::basic_string<char, std::char_traits, std::allocator > const&, gpu::Kernel::InitData const_, std::basic_string<char, std::char_traits, std::allocator > const&, std::basic_string<char, std::char_traits, std::allocator > const&, bool_, void const*, unsigned long) ()
from /usr/lib64/libamdocl64.so
#15 0x00002aaaaaec761c in gpu::NullProgram::linkImpl(amd::option::Options*) () from /usr/lib64/libamdocl64.so
#16 0x00002aaaaae6c9bf in device::Program::build(std::basic_string<char, std::char_traits, std::allocator > const&, char const_, amd::option::Options_) () from /usr/lib64/libamdocl64.so
---Type to continue, or q to quit---
#17 0x00002aaaaae7bad7 in amd::Program::build(std::vector<amd::Device*, std::allocatoramd::Device* > const&, char const_, void ()(cl_program, void), void*, bool) () from /usr/lib64/libamdocl64.so
#18 0x00002aaaaae5d444 in clBuildProgram () from /usr/lib64/libamdocl64.so
#19 0x000000000040c735 in runTest (testName="DGEMM", dev=0xffb340, ctx=0xd5a4d0, queue=0xd8b2f0, resultDB=...,
op=..., compileFlags="-DK_DOUBLE_PRECISION ") at SGEMM.cpp:197
#20 0x0000000000406556 in RunBenchmark (devcpp=, ctxcpp=,
queuecpp=<value optimized out>, resultDB=..., op=...) at SGEMM.cpp:114
#21 0x0000000000404322 in main (argc=3, argv=) at ../../common/main.cpp:208
The 32-bit atom_* functions used in src/opencl/level1/bfs/bfs_uiuc_spill.cl do not take volatile arguments [1], while the pointers passed to these functions are volatile. The results in some OpenCL compilers (e.g., vanilla clang with libclc headers) rejecting the kernel due to loss of volatile when invoking the atom_* functions.
The proper fix would be switching to the 32-bit atomic_* functions introduced with OpenCL 1.1 [2], which do take volatile arguments.
[1] http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_global_int32_base_atomics.html
[2] http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/atomicFunctions.html
Hi,
In the MaxFlops benchmark, when running more than one iteration, OpenCL events used to timing the MAD kernel are not freed. Here is the patch:
diff --git a/src/opencl/level0/MaxFlops.cpp b/src/opencl/level0/MaxFlops.cpp
index 0639d10..765c331 100644
--- a/src/opencl/level0/MaxFlops.cpp
+++ b/src/opencl/level0/MaxFlops.cpp
@@ -588,10 +588,11 @@ RunTest(cl_device_id id,
CL_CHECK_ERROR(err);
- // Event object for timing
- Event evKernel_madd("madd");
for (int passCounter=0; passCounter < npasses; passCounter++)
{
+ // Event object for timing
+ Event evKernel_madd("madd");
+
err = clEnqueueNDRangeKernel(queue, kernel_madd, 1, NULL,
&globalWorkSize, &localWorkSize,
0, NULL, &evKernel_madd.CLEvent());
Bye,
Ettore Speziale
When running configure in ARM does not work because of the flag -m32, that does not exist. I had to remove it from the file by hand. It could be another option to check if it is ARM, do not use that flag.
The MIC branch has a very rudimentary implementation of the device info class. It does not provide device names via driver script, and only rudimentary device names (e.g., "MIC 0") in response to the device info command line flag supported by each benchmark.
In an attempt to provide implementations of SpMV comparable to those used in the CUDA and OpenCL versions, the SpMV "CSR Vector" operations have been implemented using OpenMP nested parallelism. The outer loop is parallelized using a conventional "omp parallel" directive, the inner loop with an "omp parallel for" directive plus a reduction clause. The number of threads used for the inner and outer threads is specified using a num_threads clause, and dynamic thread count management is turned off. This is intended to mimic the CUDA/OpenCL version's use of a reduction that fits within a single warp.
However, the performance of the "CSR Vector" version is very poor compared to the "CSR Scalar" version that simply parallelizes the outer loop. The performance changes with the number of inner and outer threads. The number of inner loop threads must be small, because the number of non-zeros in each row is relatively small (probably too small to overcome the OpenMP overhead) even with size 4 problems. Surprisingly, performance is better with a small number of outer loop threads also, which leads to the question about whether we are implementing the nested parallelism correctly.
Driver script reports benchmark errors because MIC version of DeviceMemory reports different metrics than for CUDA and OpenCL versions.
Error message:
Chose device: name='GK20A' index=0
[ 80.743300] Out of memory: Kill process 2700 (DeviceMemory) score 1 or sacrifice child
[ 80.751793] Killed process 2700 (DeviceMemory) total-vm:1838376kB, anon-rss:2632kB, file-rss:628kB
Killed
MIC version of Spmv benchmark does not report the same metrics as versions for other programming models, so driver script reports benchmark failures.
Hi everyone !
I am using SHOC to characterize some aspects of a board using a Tegra K1 SoC and I ran into an issue with the BusSpeedReadback benchmark with CUDA.
When executing it, the process gets killed by the kernel because of an Out Of Memory situation (OOM).
I tried to pinpoint where it actually fails and found the following line:
108: cudaMalloc((void**)&device, sizeof(float) * numMaxFloats);
The cudaMalloc apparently triggers the OOM mechanic and kills the execution.
I am new to CUDA and GPGPU in general, but as I understand it, this line allocates a block of memory on the device side (GPU) memory. The size of the block is roughly 132MB. The GPU has over 1GB of memory so I don't understand the issue.
I was able to try the same benchmark, running of a different system with about the same amount of memory (even less actually) on the GPU, but using OpenCL and got no issues.
Thanks for your enlightments.
Best regards,
Marc
EDIT: BusSpeedDownload runs fine, even though it as an almost identical code structure. All of the other benchmarks also run fine.
EDIT2: I found that OpenCL version does a check of available memory before trying to allocate any memory bloc size. This sets the maximum bloc size. This mechanic does not exist in the CUDA version, therefore allowing the code to allocate memory blocs up to 512MB. In some configurations, this is too much and provokes an OOM situation.
I also found something weird about the CUDA version of BusDownloadSpeed values reported.
For a block size of 4096kB, kernel time is: 1.67431ms (mean value).
So bandwidth should be 2.5GB/s. But SHOC reports a mean value of 3GB/s for this data size.
I observed the same with the CUDA version of BusReadBackSpeed benchmark.
OpenCL version gives coherant values.
Did I miss something ?
Best,
Marc
I tried to build SHOC but I encountered some problems:
/usr/local/cuda
which is hard to find for the configure script. Some flags like OPENCL_CFLAGS
and OPENCL_LIBS
would help. But abusing CXXFLAGS
(CFLAGS
on the other hand not ...) helps though.$HOME/usr
but running shocdriver in $HOME/usr/bin
tells me that it cannot find driver.pl. But the actual programs work just fine.User Nazia Khan has reported a build failure for OpenCL Spmv on the current AMD APPSDK.
*.err files contain the following:
Error: CL_BUILD_PROGRAM_FAILURE in Spmv.cpp line 438
How do we incorporate third party codes? For instance, we have a sort that uses VexCL code, but it's unclear if we need to rewrite this code or include it as an external dependency.
Apparently normal perl behavior when using system() is for Ctrl-C to kill the child but not the Perl script. Seems like it will confuse and annoy users to have to hit ctrl-C many times to get driver to stop completely.
Hello, Dakar team guys.
I'm trying to build SHOC 1.1.5 with CUDA and MPI support under CentOS 6.5 x64. Our PC has CUDA SDK 6.5, OpenMPI 1.8.1, Intel C++ compiler and Intel MKL 11.1 installed.
Configure command "./configure CPPFLAGS="-I/usr/local/cuda/include" --with-cuda --with-mpi" ends fine. Its output contains following lines:
configure: checking for usable OpenCL opencl.h header
checking OpenCL/opencl.h usability... yes
checking OpenCL/opencl.h presence... yes
checking for OpenCL/opencl.h... yes
checking for usable OpenCL library... -lOpenCL
checking for nvcc... /usr/local/cuda/bin/nvcc
checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
checking cuda_runtime.h usability... yes
checking cuda_runtime.h presence... yes
checking for cuda_runtime.h... yes
checking for cublasInit in -lcublas... yes
checking for cufftPlan1d in -lcufft... yes
checking for mpicxx... /usr/local/mpi/bin/mpicxx
checking whether we can compile an MPI program using /usr/local/mpi/bin/mpicxx... yes
checking whether we can link an MPI program using /usr/local/mpi/bin/mpicxx... yes
So, i decided, CUDA, OpenCL and MPI were successfully found. But make command fails:
/usr/local/mpi/bin/mpicxx -g -O2 -L../../../../src/cuda/common -L../../../../src/common -o BusSpeedDownload main.o BusSpeedDownload.o -lSHOCCommon "/tmp/tmpxft_00007322_00000000-16_bogus.o" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib" -lcudadevrt -lcudart_static -lrt -lpthread -ldl -lrt -lrt
icpc: error #10236: File not found: '/tmp/tmpxft_00007322_00000000-16_bogus.o'
PATH and LD_LIBRARY_PATH variables set to proper values. Other CUDA-capable applications like CUDA-accelerated HPL works fine. Could you help me, where is the problem in?
Thank you.
Gives the following output:
$ ../bin/Serial/OpenCL/S3D -s 1 -p 0 -d 1
Chose device: name='GeForce GT 330M' index=1 id=0x1022600
Compiling kernels (phase 1)...Error: CL_BUILD_PROGRAM_FAILURE in S3D.cpp line 306
Retsize: 69
Log: kernel referenced an external function exp, that could not be found.
Error: CL_BUILD_PROGRAM_FAILURE in S3D.cpp line 315
Retsize: 69
Log: kernel referenced an external function exp, that could not be found.
Error: CL_BUILD_PROGRAM_FAILURE in S3D.cpp line 317
Retsize: 69
Log: kernel referenced an external function exp, that could not be found.
done.
Generating OpenCL Kernel Objects (phase 1)...Error: CL_INVALID_PROGRAM_EXECUTABLE in S3D.cpp line 337
Error: CL_INVALID_PROGRAM_EXECUTABLE in S3D.cpp line 355
Error: CL_INVALID_PROGRAM_EXECUTABLE in S3D.cpp line 361
done.
Error: CL_INVALID_KERNEL in S3D.cpp line 390
Error: CL_INVALID_KERNEL in S3D.cpp line 392
Error: CL_INVALID_KERNEL in S3D.cpp line 394
Error: CL_INVALID_KERNEL in S3D.cpp line 404
Error: CL_INVALID_KERNEL in S3D.cpp line 406
Error: CL_INVALID_KERNEL in S3D.cpp line 408
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Error: CL_INVALID_KERNEL in S3D.cpp line 411
Executing kernels (phase 1)...Memory fault
Testing issue tracker with a couple of minor action items for QTC integration.
@adanalis should check and make sure:
With the new LLVM compiler backend, CUDA FFT performance dropped by 50% on Keeneland. OpenCL performance stayed the same.
I suspect this might be due to loops being unrolled differently (the unroll option that used to go to the old compiler is now ignored).
Hi,
OpenCL events used to time kernels are not released in QueueDelay.cpp.
Here is the patch:
diff --git a/src/opencl/level0/QueueDelay.cpp b/src/opencl/level0/QueueDelay.cpp
index 8497fcd..b81e971 100644
--- a/src/opencl/level0/QueueDelay.cpp
+++ b/src/opencl/level0/QueueDelay.cpp
@@ -137,18 +137,17 @@ void RunBenchmark(cl::Device& devcpp,
size_t localWorkSize = (maxGroupSize >= 256 ? 256 : maxGroupSize);
size_t globalWorkSize = localWorkSize * 256;
- // Declare event objects for the kernels
- Event evKernel1("Run Kernel1");
- Event evKernel2("Run Kernel2");
- Event evKernel3("Run Kernel3");
- Event evKernel4("Run Kernel4");
-
//Test single kernel
for (int j = 0; j < passes; j++)
{
double total = 0.0;
for (int i = 0; i < reps; i++)
{
+ Event evKernel1("Run Kernel1");
+ Event evKernel2("Run Kernel2");
+ Event evKernel3("Run Kernel3");
+ Event evKernel4("Run Kernel4");
+
err = clEnqueueNDRangeKernel(queue, kernel1, 1, NULL,
&globalWorkSize, &localWorkSize,
0, NULL, &evKernel1.CLEvent());
@@ -191,6 +190,11 @@ void RunBenchmark(cl::Device& devcpp,
double total = 0.0;
for (int i = 0; i < reps; i++)
{
+ Event evKernel1("Run Kernel1");
+ Event evKernel2("Run Kernel2");
+ Event evKernel3("Run Kernel3");
+ Event evKernel4("Run Kernel4");
+
err = clEnqueueNDRangeKernel(queue, kernel1, 1, NULL,
&globalWorkSize, &localWorkSize,
0, NULL, &evKernel1.CLEvent());
@@ -240,6 +244,11 @@ void RunBenchmark(cl::Device& devcpp,
double total = 0.0;
for (int i = 0; i < reps; i++)
{
+ Event evKernel1("Run Kernel1");
+ Event evKernel2("Run Kernel2");
+ Event evKernel3("Run Kernel3");
+ Event evKernel4("Run Kernel4");
+
err = clEnqueueNDRangeKernel(queue, kernel1, 1, NULL,
&globalWorkSize, &localWorkSize,
0, NULL, &evKernel1.CLEvent());
Bye,
[email protected]
One recently fixed uninitialized memory error was being easily detected and was being reported by the compiler. There may be other problems, and these are easy fixes.
In SHOC 1.1.1, we face a failure in scan. Below is the code segment (from line #325). It applies a BLOCKING enqueueWrite following by clGetEventProfilingInfo. The assumption is that the blocking write completes (i.e., the event returned by &evTransfer.CLEvent() is set to “CL_COMPLETE”) when the call to clEnqueueWrite returns.
However, according to the 1.1 (same in 1.2) the behavior of a BLOCKING clEnqueueWrite API is not completely synchronous: “If blocking_write is CL_TRUE, the OpenCL implementation copies the data referred to by ptr and enqueues the write operation in the command-queue. The memory pointed to by ptr can be reused by the application after the clEnqueueWriteBuffer call returns.” (OpenCL spec 1.1 v45, page 62). That is, the enqueueWrite event is set to CL_COMPLETE only after the data is written to the device which may be after the clEnqueueWrite API returns. Hence, you need to wait on the event (clFinish or clWaitForEvent) before accessing the profiling info.
Thanks, --Yariv
The dates in the results.csv are meaningless, but it looks like we may have a (nearly) full set of duplicated in that table. That could be an artifact of the google public data explorer expecting everything to be time based, but there's no reason to keep duplicates in the table.
Also, we should have a different sentinel or empty value for unknown dates.
For other csv's, like specs for a platform/device, if we want to keep the date column, I'd recommend changing their date be the release date of the architecture, or the cuda platform version, etc. It would be meaningful that way. (Otherwise, I'd say remove the date for everything but results.)
The MIC version of MaxFlops reports a value that seems to be too high by a factor of 10.
Running benchmark MaxFlops
result for maxspflops: 18065.5000 GFLOPS
result for maxdpflops: 13108.8000 GFLOPS
From Reza's blog, the max performance should be something around the theoretical peaks of 2112 GFLOPS, SP and 1056 GFLOPS, DP.
https://software.intel.com/en-us/articles/intel-xeon-phi-core-micro-architecture
This could possibly just be a reporting error.
I fixed a bug today with these timings, which I believe makes those results in the database wrong. After my fix, the rates are now similar between cuda/opencl, whereas previously there was a 10^3 difference.
I'm throwing this out there for suggestions about what to do with the results database.
Specifically, a bug report that CUDA 7.5 gives an error with the configure that uses compute_12. It may be time to remove these within SHOC, if we don't want to test for support explicitly.
diff --git a/config/find_cuda_libs.sh b/config/find_cuda_libs.sh
index 8aed2e3..1a254b0 100755
--- a/config/find_cuda_libs.sh
+++ b/config/find_cuda_libs.sh
@@ -42,7 +42,7 @@ libspec=$NVCC -dryrun bogus.cu 2>&1 | grep LIBRARIES | sed 's/^.*LIBRARIES=//'
#echo "libspec=$libspec"
if [ $cudart_flag_supported -eq 1 ]
then
$NVCC -dryrun bogus.cu 2>&1 | tail -1 | sed "s#^.*-o \"a.out\"##" | sed 's#"[a-zA-Z0-9/_-]*\.o"##' | sed 's/-Wl,--start-group//' | sed 's/-Wl,--end-group//'
$NVCC -dryrun bogus.cu 2>&1 | tail -1 | sed "s#^.*-o \"a.out\"##" | sed 's#"[a-zA-Z0-9/_-]*\.o"##g' | sed 's/-Wl,--start-group//' | sed 's/-Wl,--end-group//'
On a GTX460, I got various infinity runtimes. On a Quadro FX 380, I got results like 1e-207 seconds and e+192 seconds.
Bring over Jeremy 's ticket for VS project files for SHOC. Will assign to him pending github account creation.
I am expecting to observe a speedup when I run either an EP or TP benchmark on multiple devices, but that is not the case.
The Stencil2D benchmark does show a speedup when I use multiple devices:
./shocdriver -d 0 -cuda -s 4 -benchmark Stencil2D
result for stencil: 141.2280 GFLOPS
vs.
./shocdriver -d 0,1,2,3 -cuda -s 4 -benchmark Stencil2D
result for stencil: 406.1190 GFLOPS
However, this is the only benchmark I have found (so far) that shows a speedup. For example:
./shocdriver -d 0 -cuda -s 4 -benchmark Scan
result for scan: 46.8924 GB/s
vs
./shocdriver -d 0,1,2,3 -cuda -s 4 -benchmark Scan
result for scan: 46.8561 GB/s
Similarly, Reduction and GEMM show no improvement either.
Am I missing something here? I am running version 1.1.5
Hello.
I've compiled SHOC 1.1.5 with CUDA/OpenCL/MPI support under CentOS 6.5 with CUDA 6.5, Intel Compiler 11.1, Intel MKL 11.1 and OpenMPI 1.8.1 installed. PC has 4pcs NVidia Tesla K20m. When I try to run it using all GPUs, MaxFlops' performance is the same as if i run the test using one GPU only. This happens in both OpenCL and CUDA modes. I tried to change problem size from "-s 1" to "-s 4", but nothing change.
Here below are console outputs:
[bald@node8 bin]$ ./shocdriver -cuda -s 1 -d 0
--- Welcome To The SHOC Benchmark Suite version 1.1.5 ---
Hostname: node8.cluster
Platform selection not specified, default to platform #0
Number of available platforms: 1
Number of available devices on platform 0 : 4
Device 0: 'Tesla K20m'
Device 1: 'Tesla K20m'
Device 2: 'Tesla K20m'
Device 3: 'Tesla K20m'
Specified 1 device IDs: 0
Using size class: 1
--- Starting Benchmarks ---
Running benchmark BusSpeedDownload
result for bspeed_download: 6.2430 GB/sec
Running benchmark BusSpeedReadback
result for bspeed_readback: 6.6992 GB/sec
Running benchmark MaxFlops
result for maxspflops: 3099.6100 GFLOPS
result for maxdpflops: 1164.3600 GFLOPS
[bald@node8 bin]$ ./shocdriver -cuda -s 1 -d 0,1,2,3
--- Welcome To The SHOC Benchmark Suite version 1.1.5 ---
Hostname: node8.cluster
Platform selection not specified, default to platform #0
Number of available platforms: 1
Number of available devices on platform 0 : 4
Device 0: 'Tesla K20m'
Device 1: 'Tesla K20m'
Device 2: 'Tesla K20m'
Device 3: 'Tesla K20m'
Specified 4 device IDs: 0,1,2,3
Using size class: 1
--- Starting Benchmarks ---
Running benchmark BusSpeedDownload
result for bspeed_download: 6.1165 GB/sec
Running benchmark BusSpeedReadback
result for bspeed_readback: 6.6993 GB/sec
Running benchmark MaxFlops
result for maxspflops: 3099.1200 GFLOPS
result for maxdpflops: 1165.0200 GFLOPS
But, as nvidia-smi said, all of the GPUs were almost loaded by MaxFlops application:
[root@node8 ~]# nvidia-smi
Thu Sep 4 16:29:27 2014
+------------------------------------------------------+
| NVIDIA-SMI 340.29 Driver Version: 340.29 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20m Off | 0000:02:00.0 Off | 0 |
| N/A 37C P0 150W / 225W | 96MiB / 4799MiB | 91% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20m Off | 0000:03:00.0 Off | 0 |
| N/A 38C P0 149W / 225W | 96MiB / 4799MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20m Off | 0000:81:00.0 Off | 0 |
| N/A 36C P0 153W / 225W | 96MiB / 4799MiB | 91% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K20m Off | 0000:82:00.0 Off | 0 |
| N/A 38C P0 117W / 225W | 96MiB / 4799MiB | 91% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 2576 /home/bald/Downloads/shoc/bin/EP/CUDA/MaxFlops 80MiB |
| 1 2577 /home/bald/Downloads/shoc/bin/EP/CUDA/MaxFlops 80MiB |
| 2 2578 /home/bald/Downloads/shoc/bin/EP/CUDA/MaxFlops 80MiB |
| 3 2579 /home/bald/Downloads/shoc/bin/EP/CUDA/MaxFlops 80MiB |
+-----------------------------------------------------------------------------+
I want to see overall performance of my hybrid PC. What did I wrong? Is this behavior normal to SHOC? Thank you.
I am trying to improve upon the workload partitioning problem solutions for het. platforms. THe SHOC benchmarks run on GPUs only. How do I compare the CPU-only performance with this? How do i partition the workload between CPU and GPU?
I ran into an interesting potential problem with SHOC when trying to figure out why CUDA BFS wasn't working on my current laptop.
BFS needs compute capability 1.2 for atomics. The BFS code checks whether the target device is at least CC 1.2, but doesn't check (doesn't have a way to check?) whether the code was compiled for CC 1.2. For whatever reason, our default CUDA_CPPFLAGS didn't include compilation for CC 1.2 - it had 1.0, 1.1, 1.3, and 2.0. Presumably, at runtime the kernel selected by the CUDA runtime was the CC 1.1 version. BFS ran but because it requires the atomics from 1.2, its verification of the results computed on the GPU failed.
So the problem is: the BFS code checked the CC of the device, but couldn't tell that it didn't have CC 1.2 code available to run on it. Is there a way to check at run time which version of the code will be used?
As per AMD developer comments who debugged the issue:
the test generates kernels based on the device capabilities reported in OCL. In case of Hybrid stack(Orca) OCL runtime reports 32KB of local device memory, but ROCm stack – 64KB.
The tests uses a half of the reported amount for local array in a kernel. Thus ROCm ends up with more LDS usage, hence lower wave occupancy and lower performance. The issue should be reported to devrel for test logic replacement.
When I (Jeremy) added timing of PCIe to the FFT benchmark, I only added timing of the send portion. (There wasn't an explicit readback in the main benchmark code I could add timing of.) To be consistent with other benchmarks, we should probably add an explicit readback and capture timing of that piece to include in the _PCIE results.
I believe the OpenCL version of the top_scan
kernel in the sort
benchmark has a data race. On line 127 of src/opencl/level1/sort/sort.cl
the __local s_seed
variable is read by all threads with get_local_id(0) < n
and on line 132 the variable is written to by thread get_local_id(0) == n - 1
, while there is no barrier in between the statements.
Intel iGPUs from Sandy Bridge onwards are now supported at least in part on Linux using Beignet (https://01.org/beignet). An initial test using Beignet 1.0.0 on a Haswell CPU generated some reasonable results for several of the benchmarks, but many failed due to an issue with the "FillTimingInfo" class in Event.cpp (the specific error returned was CL_PROFILING_INFO_NOT_AVAILABLE).
At this time, it's unclear whether this is a SHOC issue since other platforms (including Intel CPUs) work fine, but it might be worthwhile to check into for future Intel iGPUs.
dear SHOCers,
I am contemplating to modify shoc to use the clang compiler that has recently adopted mainline support for CUDA.
http://llvm.org/releases/3.8.0/tools/clang/docs/ReleaseNotes.html#cuda-support-in-clang
I'd like to know if you would be interest in PRs that enable SHOC to use this compiler toolchain?
On a related note, it would be interesting to compare native OpenCL SHOC results to HIPified CUDA benchmarks:
http://gpuopen.com/hip-release-0-82/
As I have no feeling how mature a 0.82 release would be, this more an experiment. But still, I'd like to know if the SHOC community would be interested.
In workloads that run both single and double precision, such as GEMM, add an option to run only one of the two.
MIC version of Spmv has MIC-specific warmup code before running the MIC-based spmv operations, which is different from the benchmarks for other programming models that use a specific number of warmup passes controlled by a command-line parameter.
This is maybe not really an issue, but it seems that all three of the write benchmarks from the DeviceMemory benchmark have write/write data races (different threads will write to the same array position in each of these benchmarks). It this intentional?
Jeff recommends atoms/s and grid points/s in addition to the standard flops and gbytes rates.
Hi,
i want to run this with one gpu and two threads per task,
is it possible?
if someone knows please help me out!!!!
thank you :)
Hi,
I am trying to build SHOC on Jetson and get the error cannot guess build type when i execute the configure script. Could you tell me what is the way to resolve this?
Thanks
This occurrs on a Quadro FX 380 (G96, 256MB or 512MB RAM, I forget, 301.xx driver) and a GeForce GTX 460 (GF104 1 GB, 295.73 driver).
The display goes blank for a second, then returns, then Windows pops up an error that says your display driver crashed. The console outputs this error:
error=30 name=unknown error at ln: 409
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.