Giter Club home page Giter Club logo

can's Introduction

can - A simple dense matrix-matrix mutiplication benchmark.

there are serial, intel MKL dgemm(), OpenMP, MPI, hybrid(MPI+OpenMP), and hybrid(MPI+OpenACC) versions.
MPI version is based on Cannon's algorithm. intel compiler and intel MKL library are needed.
Input matrix is a psudorandom number, that is generated by intel MKL Mersenne Twister(MT19937)

  • binary names:

    • serial: seri
    • OpenMP: omp
    • intel MKL dgemm(): dgemm
    • MPI: can
    • hybrid(MPI+OpenMP): can_hyb
    • hybrid(MPI+OpenACC): can_acc
  • matrix size: imax x imax (param.f)

  • Some notes for MPI and hybrid version:

    • imax/sqrt(np) must be an integer.
    • sqrt(np) must be an integer.

how to run

  • intel compiler and intel MPI are required.
$ make
$ ./create_input

$ ./seri
or
$ ./omp
or
./dgemm
or
mpirun -np $NP ./can
or
mpirun -np $NP ./can_hyb

performance comparison(matrix size: 4096x4096, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 14 cores/socket, 2 sockets/node, 4 nodes, intel OPA):

  • serial
$ ./seri
 serial time:   6.99500107765198        19.6481675908668      Gflops
 trace:   4196462.48061815
  • MKL dgemm() (single thread)
$ MKL_NUM_THREADS=1 ./dgemm
 dgemm time:   3.69211506843567        37.2249918879782      Gflops
 trace:   4196462.48061815
  • MKL dgemm() (28 threads)
$ MKL_NUM_THREADS=28 KMP_AFFINITY=compact ./dgemm
 dgemm time:   1.08629608154297        126.520711808868      Gflops
 trace:   4196462.48061815
  • OpenMP (28 threads)
$ OMP_NUM_THREADS=28 KMP_AFFINITY=compact ./omp
 omp time:  0.852473020553589        161.223816071913      Gflops
 trace:   4196462.48061815
  • MPI
$ mpiexec.hydra -ppn 16 -np 64 ./can
 MPI time:  0.405706882476807        338.764165480622      Gflops
 trace:   4196462.48061815
  • hybrid(MPI+OpenMP)
$ OMP_NUM_THREADS=$((28/4)) KMP_AFFINITY=compact mpiexec.hydra -ppn 4 -np 16 ./can_hyb
 MPI time:  0.325567960739136        422.151347939683      Gflops
 trace:   4196462.48061815

check the results

$ ./check c.seri c.dgemm
 maximum error:  9.094947017729282E-012
$ ./check c.seri c.omp
 maximum error:  0.000000000000000E+000
$ ./check c.seri c.can
 maximum error:  1.409716787748039E-011
$ ./check c.seri c.can_hyb
 maximum error:  1.409716787748039E-011

MPI+OpenACC version

PGI compiler, OpenMPI and intel MKL are required.
CPU and inteterconnect are the same as normal version, GPU is nvidia P100x4 per 1 node.
GPUDirect is used.

$ make -f makefile.acc.mk
$ ./create_input
$ ./seri
 serial time:    51.68619100000000         2.659103927236580      Gflops
 trace:    4196462.480618147
$  mpirun -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np 16 ./can_acc
 MPI time:   0.1217727372422814         1128.651261230572      Gflops
 trace:    4196462.480618146
$ ./check c.seri c.can_acc
 maximum error:   1.2278178473934531E-011

Large size test(imax=16*1024, 4 nodes)

  • flat MPI, 64 cores, intel compiler and intel MPI
$ mpiexec.hydra -ppn 16 -np 64 ./can
 MPI time:   82.6075530052185        106.480493637819      Gflops
 trace:   67116321.7059676
  • hybrid(MPI+OpenMP), 112 cores, intel compiler and intel MPI
$  OMP_NUM_THREADS=$((28/4)) KMP_AFFINITY=compact mpiexec.hydra -ppn 4 -np 16 ./can_hyb
 MPI time:   40.3734800815582        217.868090747666      Gflops
 trace:   67116321.7059676
  • hybrid(MPI+OpenACC), 16 GPUs, PGI compiler and OpenMPI
$ mpirun -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np 16 ./can_acc
 MPI time:    4.504744562320411         1952.628589816666      Gflops
 trace:    67116321.70596765

can's People

Contributors

dc-fukuoka avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

usb167 vlkale

can's Issues

the name of source and destination(left, right, up, down) seems opposite

  • can.f
      call mpi_cart_shift(cart_comm,0,1,left,right,ierr)
      call mpi_cart_shift(cart_comm,1,1,up,down,ierr)

This name should be opposite, though the result is correct...

  • can.f
      call mpi_cart_shift(cart_comm,0,1,left,right,ierr)
      call mpi_cart_shift(cart_comm,1,1,up,down,ierr)

      write(6, '(a,7i4)') "iam,coords(1),coords(2),left,right,up,down:",
     &     iam,coords(1),coords(2),left,right,up,down

iam,coords(1),coords(2),left,right,up,down:   0   0   0  12   4   3   1
iam,coords(1),coords(2),left,right,up,down:   1   0   1  13   5   0   2
iam,coords(1),coords(2),left,right,up,down:   2   0   2  14   6   1   3
iam,coords(1),coords(2),left,right,up,down:   3   0   3  15   7   2   0
iam,coords(1),coords(2),left,right,up,down:   4   1   0   0   8   7   5
iam,coords(1),coords(2),left,right,up,down:   5   1   1   1   9   4   6
iam,coords(1),coords(2),left,right,up,down:   6   1   2   2  10   5   7
iam,coords(1),coords(2),left,right,up,down:   7   1   3   3  11   6   4
iam,coords(1),coords(2),left,right,up,down:   8   2   0   4  12  11   9
iam,coords(1),coords(2),left,right,up,down:   9   2   1   5  13   8  10
iam,coords(1),coords(2),left,right,up,down:  10   2   2   6  14   9  11
iam,coords(1),coords(2),left,right,up,down:  11   2   3   7  15  10   8
iam,coords(1),coords(2),left,right,up,down:  12   3   0   8   0  15  13
iam,coords(1),coords(2),left,right,up,down:  13   3   1   9   1  12  14
iam,coords(1),coords(2),left,right,up,down:  14   3   2  10   2  13  15
iam,coords(1),coords(2),left,right,up,down:  15   3   3  11   3  14  12
  • check the results
$ ifort -g -O3 -march=core-avx2 -fopenmp check.f -o check
$ ./check fort.55 fort.77
 maximum error:  0.000000000000000E+000
$ ./check fort.55 fort.222
 maximum error:  8.867573342286050E-012
$ ./check fort.55 fort.999
 maximum error:  1.386979420203716E-011
$ ./check fort.55 fort.1000
 maximum error:  1.386979420203716E-011

This program was written when I had started to study MPI, so I had some misunderstandings at that time?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.