algebraic-programming / alp Goto Github PK

Home of ALP/GraphBLAS and ALP/Pregel, featuring shared- and distributed-memory auto-parallelisation of linear algebraic and vertex-centric programs. Soon with more to come!

License: Apache License 2.0

CMake 2.26% Makefile 0.30% Shell 2.50% C++ 92.25% C 2.70%

alp's People

Contributors

Stargazers

Watchers

Forkers

learning-chip benbrock byjtew anyzelman

alp's Issues

Remove the banshee backend

The Banshee backend is lagging behind added features and testing, and depends on a toolchain that is not standardised (nor there's any assurance it'll ever be) and that has several troubles (lacking of complete support for the C++ stdlib). Hence, the Banshee backend is nowadays unmaintained, nor there's any plan in sight to revive it. Therefore, we may want to remove it from the codebase and add a dedicated git tag for potential future resurrection (a.k.a., gravestone).

Reasons to remove it:

if you search within the code (e.g., with grep or an editor like VSCode), you may see results also from banshee, which makes the search more complicated
the bootstrap.sh script has dedicated logic, essentially untested (and there's no reason to maintain it, if banshee is never used)
there is a dedicated Makefile infrastructure, also unmaintained and untested with the latest versions

Consider simplifying mpv algorithm

The mpv algorithm now uses an unrolled implementation. One based on std::swap would result in shorter code, though perhaps slightly and practically unnoticeably less performant.

Triangle counting algorithm

Algorithms for triangle counting

Burkhardt : sum_reduce( (A^2) .* A ) / 6
Cohen : sum_reduce( (L * U) .* A ) / 2
Sandia : sum_reduce( (T * T) .* T )
- T being either the lower (L) or upper (U) triangle matrix of A

Routines implied

Out-of-place matrix-matrix multiplication (required, implemented)
Out-of-place masked matrix-matrix multiplication (optimal, not implemented)
Out-of-place masked matrix power-2 kernel (optimal, not implemented)
Haddamard product (required, implemented)
L/U matrix split (required, not implemented)
Matrix-to-scalar sum reduction (required, __see this issue .
Check if a matrix is symmetric (optimal, not implemented)
All-but-diagonal matrix mask (optimal, not implemented)

Misleading documentation for eWiseLambda(Func, Vector, Args...)

This snippet in the documentation of eWiseLambda(Func, Vector, Args...) is wrong.
The usage of grb:: routines in lambda function has been forbidden for eWiseLambda.

void f(
     double &alpha,
     grb::Vector< double > &y,
     const double beta,
     const grb::Vector< double > &x,
     const grb::Semiring< double > ring
) {
     assert( grb::size(x) == grb::size(y) );
     assert( grb::nnz(x) == grb::size(x) );
     assert( grb::nnz(y) == grb::size(y) );
     alpha = ring.getZero();
     grb::eWiseLambda(
         [&alpha,beta,&x,&y,ring]( const size_t i ) {
             double mul;
             const auto mul_op = ring.getMultiplicativeOperator();
             const auto add_op = ring.getAdditiveOperator();
             grb::apply( y[i], beta, x[i], mul_op );
             grb::apply( mul, x[i], y[i], mul_op );
             grb::foldl( alpha, mul, add_op );
     }, x, y );
     grb::collectives::allreduce( alpha, add_op );
}

Assignment-operator from temporaries overwrites ID

Consider the following, in which the final assertion is assumed to hold:

grb::Vector< T > a( n ), b( n );
size_t a_id = grb::getID( a );
assert( grb::getID( b ) != a_id );
// ...
a = b;
assert( grb::getID( a ) == a_id );

Two items:

this is currently guaranteed in all backends for copy-assignment, but is never tested;
this seems currently not guaranteed for move-assignment.

This issue is to introduce such a test, and confirm whether the latter issue indeed exists

Enable compilation with _DEBUG on Github CI also

Two grb::eWiseApply variants missing

The masked version of [D1]<-D2<-D3 seems missing, both operator and monoid variants.

603 Branch Compiling Problem

I get the code from 603-dense-mxm-performance-tests, and edit the alpdense.sh including BLAS_ROOT, LAPACK_LIB, LAPACK_INCLUDE. All these libraries and directories are set as standard 'Generating a Complete LAPACK Library' at https://www.hikunpeng.com/document/detail/en/kunpengaccel/math-lib/devg-kml/kunpengaccel_kml_16_0218.html.
After enter the root directory of ALP, mkdir build && cd build/ , then bash ../alpdense.sh, the following configure and compliation is
`-- Setting the datasets directory GNN_DATASET_PATH to ""
-- Found OpenMP_C: -fopenmp
-- Found OpenMP_CXX: -fopenmp
-- Found OpenMP: TRUE

######### Configured with the following backends: #########
reference;reference_omp;alp_reference

######### COMPILATION OPTIONS AND DEFINITIONS #########
Build type: Release
global flags (from CMake):
common definitions:
common options: -g;-Wall;-Wextra
flags for BACKENDS:
definitions: NDEBUG
options: -O3;-march=native;-mtune=native;-funroll-loops
flags for TESTS:
category: unit
mode: unit_ndebug
definitions:
options:
performance definitions: NDEBUG
performance options: -O3;-march=native;-mtune=native;-funroll-loops
mode: unit_debug
definitions:
options:
performance definitions:
performance options: -O0
default test flags (categories: smoke, performance)
common definitions:
common options:
performance definitions: NDEBUG
performance options: -O3;-march=native;-mtune=native;-funroll-loops
######### END OF COMPILATION OPTIONS AND DEFINITIONS #########

-- Configuring done
-- Generating done
And begin 'Starting standardised smoke tests for the alp_reference backend', and the smoke test finished normally until print that

All smoke tests done.

[100%] Built target smoketests_alp
~/ALP-603-dense-mxm-performance-tests/tests/performance/lapack_dstedc.cpp: In function ‘void alp_program(const inpdata&, bool&)’:
~/ALP-603-dense-mxm-performance-tests/tests/performance/lapack_dstedc.cpp:105:108: error: too few arguments to function ‘void dstedc_(const char*, const int*, double*, double*, double*, const int*, double*, const int*, int*, const int*, int*, size_t)’
105 | dstedc_(&compz, &N, &( vec_d[0] ), &( vec_e[0] ), &( mat_z[0] ), &N, &wopt, &lwork, &iwopt, &liwork, &info);
| ^
In file included from ~/full-package/lapack_adapt/netlib/build/include/lapack.h:11,
from ~/full-package/lapack_adapt/netlib/build/include/lapacke.h:36,
from ~/ALP-603-dense-mxm-performance-tests/tests/performance/lapack_dstedc.cpp:24:
~/full-package/lapack_adapt/netlib/build/include/lapack.h:16142:42: note: declared here
16142 | #define LAPACK_dstedc_base LAPACK_GLOBAL(dstedc,DSTEDC)
| ^~~~~~
~/full-package/lapack_adapt/netlib/build/include/lapacke_mangling.h:5:34: note: in definition of macro ‘LAPACK_GLOBAL’
5 | #define LAPACK_GLOBAL(name,NAME) name##_
| ^~~~
~/full-package/lapack_adapt/netlib/build/include/lapack.h:16143:6: note: in expansion of macro ‘LAPACK_dstedc_base’
16143 | void LAPACK_dstedc_base(
| ^~~~~~~~~~~~~~~~~~
~/ALP-603-dense-mxm-performance-tests/tests/performance/lapack_dstedc.cpp:118:139: error: too few arguments to function ‘void dstedc_(const char*, const int*, double*, double*, double*, const int*, double*, const int*, int*, const int*, int*, size_t)’
118 | dstedc_(&compz, &N, &( vec_d_work[0] ), &( vec_e_work[0] ), &( mat_z_work[0] ), &N, &( work[0] ), &lwork, &( iwork[0] ), &liwork, &info);
| ^
In file included from ~/full-package/lapack_adapt/netlib/build/include/lapack.h:11,
from ~/full-package/lapack_adapt/netlib/build/include/lapacke.h:36,
from ~/ALP-603-dense-mxm-performance-tests/tests/performance/lapack_dstedc.cpp:24:
~/full-package/lapack_adapt/netlib/build/include/lapack.h:16142:42: note: declared here
16142 | #define LAPACK_dstedc_base LAPACK_GLOBAL(dstedc,DSTEDC)
| ^~~~~~
~/full-package/lapack_adapt/netlib/build/include/lapacke_mangling.h:5:34: note: in definition of macro ‘LAPACK_GLOBAL’
5 | #define LAPACK_GLOBAL(name,NAME) name##_
| ^~~~
~/full-package/lapack_adapt/netlib/build/include/lapack.h:16143:6: note: in expansion of macro ‘LAPACK_dstedc_base’
16143 | void LAPACK_dstedc_base(
| ^~~~~~~~~~~~~~~~~~
Compiling dstedc failed
...
...
...
~/ALP-603-dense-mxm-performance-tests/tests/performance/lapack_dpotri.cpp: In function ‘void alp_program(const inpdata&, bool&)’:
~/ALP-603-dense-mxm-performance-tests/tests/performance/lapack_dpotri.cpp:104:54: error: too few arguments to function ‘void dpotri_(const char*, const int*, double*, const int*, int*, size_t)’
104 | dpotri_( &uplo, &N, &( mat_a_work[0] ), &N, &info );
| ^
In file included from ~/full-package/lapack_adapt/netlib/build/include/lapack.h:11,
from ~/full-package/lapack_adapt/netlib/build/include/lapacke.h:36,
from ~/ALP-603-dense-mxm-performance-tests/tests/performance/lapack_dpotri.cpp:25:
~/full-package/lapack_adapt/netlib/build/include/lapack.h:13505:42: note: declared here
13505 | #define LAPACK_dpotri_base LAPACK_GLOBAL(dpotri,DPOTRI)
| ^~~~~~
~/full-package/lapack_adapt/netlib/build/include/lapacke_mangling.h:5:34: note: in definition of macro ‘LAPACK_GLOBAL’
5 | #define LAPACK_GLOBAL(name,NAME) name##_
| ^~~~
~/full-package/lapack_adapt/netlib/build/include/lapack.h:13506:6: note: in expansion of macro ‘LAPACK_dpotri_base’
13506 | void LAPACK_dpotri_base(
| ^~~~~~~~~~~~~~~~~~
Compiling dpotri failed
#####################################################################
LAPACK smoketests (seq)
#####################################################################
../alpdense.sh: line 84: ./dstedc_lapack_reference.exe: No such file or directory
test dstedc failed
../alpdense.sh: line 84: ./dsyevd_lapack_reference.exe: No such file or directory
test dsyevd failed
../alpdense.sh: line 84: ./dsytrd_lapack_reference.exe: No such file or directory
test dsytrd failed
../alpdense.sh: line 84: ./zhetrd_lapack_reference.exe: No such file or directory
test zhetrd failed
Testing dgeqrf_ ( 100 x 200 )
Test repeated 20 times.
time (ms, total) = 44.9702
time (ms, per repeat) = 2.24851
Tests OK
../alpdense.sh: line 84: ./dgesvd_lapack_reference.exe: No such file or directory
test dgesvd failed
Testing dgetrf_ ( 100 x 200 )
Test repeated 20 times.
time (ms, total) = 6.48435
time (ms, per repeat) = 0.324218
Tests OK
../alpdense.sh: line 84: ./dpotri_lapack_reference.exe: No such file or directory
test dpotri failed
`

And there is no executable binary file undet the dictionary ALP-603-dense-mxm-performance-tests/build/tests/performance/
I want to know how to solve the LAPACK functional arguments nonmatching problem.

Internal CI picks up assert failure in hyperdags backend

Operations that call grb::eWiseApply on matrices seem to fail since 4329cd7

Provide eWiseAdd primitives for hyperdags backend

Provide every eWiseAdd primitive declared in base backend in hyperdags backend.

Issue discovered by trying to implement a BFS algorithm using eWiseAdd.

Remove spy's dependence on std::vector...

...for example, by implementing the approach described in spy.hpp

v0.6.0 release

Associated branch(es) should be named vx.y.z-rcn, where vx.y.z corresponds to the version ID in the title and n is the release candidate ID.

Looped start (loop count is n):

Bump new version number in ./CMakeLists.txt
Set same version in doxygen (doxy.conf)
Check compiler warnings and suppressions
Remove trailing spaces and tabs, and check for spaces-before-tabs
Update changelog.md according to MRs in develop after rcn-1 and the current rcn. (Let rc-1 be develop.)
Test rcn (see below for ``sub-checklist'')

If unsuccessful:

fix bugs in separate issues / MRs that merge into develop
rebase vx.y.z-rcn unto develop and rename it to vx.y.z-rcn+1
uncheck all checked boxes above and restart loop

If successful:

merge rcn into develop
ensure internal CI remains happy
merge develop into master
ensure internal CI remains happy
tag master after merge with version ID, and push the new tag

Testing TODOs:

GitHub CI should flag OK
Internal CI should flag OK (warning: the branch name matters!)
Ubuntu ARM without LPF without banshee
Ubuntu ARM with LPF and MPICH without banshee
Ubuntu x86 with LPF and MPICH without banshee
Fedora x86 without LPF without banshee
Fedora x86 with LPF and MPICH without banshee
Any OS on any arch with LPF and OpenMPI without banshee
CentOS x86 with LPF and MPICH without banshee

Add that issues may also be reported on Github...

...not just Gitee

(in README.md)

Internal CI fails on `develop` in debug mode

Due to failing unit tests:

zip (hyperdags, debug)
zip (hyperdags, ndebug)

Expected is that the tests pass successfully.

Cause: the zip (vector x vector -> matrix) did not register the output container as input, causing the hyperdags backend, when compiled in debug mode, to fail a final checksum.

Missing type definitions for multiple backends

Main idea:

Declare the Matrix internal types value_type and const_iterator for all backends, as it's already done for the base and reference ones. This would allow the users to consider base as the general interface for a non-specialized usage of grb.

Defined in the reference backend:

/** @see Matrix::value_type */
typedef D value_type;

/** The iterator type over matrices of this type. */
typedef typename internal::Compressed_Storage<
	D, RowIndexType, NonzeroIndexType
>::template ConstIterator<
	internal::Distribution< reference >
> const_iterator;

BenchmarkerBase does not check for errors in collective calls

Instead, it ignores any error codes coming from collective calls, while on benchmark failures due to interrupted calls to sleep it calls abort instead of returning FAILED. This issue was, in part, also a case of under-specification, but that part is fixed in issue #6 .

grbcxx when performing shared linking still selects static libraries

Consider:
grbcxx -shared -o mylibrary.so #some .o sources to link into the library.

Expected:
grbcxx recognises the -shared flag and links against the shared libraries of the (in this case) reference backend.

Observed:
grbcxx rather still attempts to link against static libraries.

API documentation for the ALP utils

The ALP includes some utilities which, while used by the ALP implementation, could also be useful outside of those implementations. Using them, however, of course would be helped by having proper code documentation online.

This is a continuation of issue #6 which documented the public API of the core ALP primitives and concepts.

Missing useful time for CG

The useful time is not reported for CG when the algorithm does not converge. The bug was introduced when the returned code was correctly changed from SUCCESS to FAILED. However, the useful time should still be reported as we often set a threshold for the maximum number of iterations when evaluating the algorithm, and the useful time is still valid even if the algorithm did not converge.

Generating code documentation for the public ALP/GraphBLAS API

The current make docs targets ALP/GraphBLAS developers, and accordingly generates too much information for ALP/GraphBLAS users.

As a bonus, the HTML documentations could be hosted on algebraic-programming.github.io

Add support for MatrixMarket integer type

Four types of non-zero values are officially supported in the MatrixMarket format: real, complex, integer and pattern. Matrix Market documentation

We do not have support for integer in the current implementation.

Matrix eWiseApply implemented, but not declared in base

The two variants have been implemented in every backend, but not declared in base:

eWiseApply([out] Matrix, [in] Matrix, [in] Matrix, [in] Monoid, ...)
eWiseApply([out] Matrix, [in] Matrix, [in] Matrix, [in] Operator, ...)

In order to have a user-friendly interface for all the grb APIs, it would be interesting to declare these variants in base headers.
It would then appear in the official documentation here.

RC eWiseApply(
	Matrix< OutputType, backend > &C,
	const Matrix< InputType1, backend > &A,
	const Matrix< InputType2, backend > &B,
	const Monoid &mulmono,
	const Phase phase = EXECUTE,
	const typename std::enable_if< !grb::is_object< OutputType >::value &&
		!grb::is_object< InputType1 >::value &&
		!grb::is_object< InputType2 >::value &&
		grb::is_monoid< Monoid >::value,
	void >::type * const = nullptr
)

RC eWiseApply(
	Matrix< OutputType, backend > &C,
	const Matrix< InputType1, backend > &A,
	const Matrix< InputType2, backend > &B,
	const Operator &op,
	const Phase phase = EXECUTE,
	const typename std::enable_if< !grb::is_object< OutputType >::value &&
		!grb::is_object< InputType1 >::value &&
		!grb::is_object< InputType2 >::value &&
		grb::is_operator< Operator >::value,
	void >::type * const = nullptr
)

Dockerfile for ALP-dense with KunpengBLAS and LAPACK

To ease reproducing the benchmarks, I wrote a Dockerfile. Tested successfully on Kunpeng920 on the public Huawei cloud (the kc1.2xlarge.4 instance, the Ubuntu 18.04 server 64bit with ARM OS).

Build docker image

First install the Docker engine following the official guide. Steps are exactly the same as on x86. When pulling images, will get base images for arm64v8 instead of the usual x86 ones.

Then run docker build -t alp-dense . with the Dockerfile and install_lapack.sh below in the same directory:

FROM ubuntu:20.04

RUN apt-get update \
    && DEBIAN_FRONTEND=noninteractive apt-get install -y \
    git wget zip vim \
    gcc g++ gfortran \
    libnuma-dev \
    libomp-dev \
    make cmake \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /opt

# Install KunpengBLAS
# https://www.hikunpeng.com/zh/developer/boostkit/library/math
# https://www.hikunpeng.com/document/detail/zh/kunpengaccel/math-lib/devg-kml/kunpengaccel_kml_16_0011.html
RUN wget https://kunpeng-repo.obs.cn-north-4.myhuaweicloud.com/Kunpeng%20BoostKit/Kunpeng%20BoostKit%2022.0.RC3/BoostKit-kml_1.6.0.zip \
    && unzip BoostKit-kml_1.6.0.zip \
    && dpkg -i boostkit-kml-1.6.0.aarch64.deb

# Install netlib LAPACK and link to KunpengBLAS
# https://www.hikunpeng.com/document/detail/zh/kunpengaccel/math-lib/devg-kml/kunpengaccel_kml_16_0218.html
RUN wget https://github.com/Reference-LAPACK/lapack/archive/v3.9.1.tar.gz -O lapack-3.9.1.tar.gz
COPY ./install_lapack.sh /opt/
RUN bash ./install_lapack.sh

# Build and test ALP-dense
# https://github.com/Algebraic-Programming/ALP/blob/603-dense-mxm-performance-tests/alpdense.md
ENV BLAS_ROOT="/usr/local/kml"
ENV LAPACK_LIB="/opt/lapack_adapt/netlib/build/lib"
ENV LAPACK_INCLUDE="/opt/lapack_adapt/netlib/lapack-3.9.1/LAPACKE/include/"

# on commit 86e5b43 at the time of my test
RUN git clone -b 603-dense-mxm-performance-tests https://github.com/Algebraic-Programming/ALP.git
RUN cd ALP \
    && mkdir build \
    && cd build \
    && bash ../alpdense.sh | tee run_alpdense.log

where the install_lapack.sh used above is:

# Adapted from https://www.hikunpeng.com/document/detail/zh/kunpengaccel/math-lib/devg-kml/kunpengaccel_kml_16_0218.html

netlib=/opt/lapack-3.9.1.tar.gz
klapack=/usr/local/kml/lib/libklapack.a

mkdir lapack_adapt
cd lapack_adapt

# build netlib lapack
mkdir netlib
cd netlib
tar zxvf $netlib
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_POSITION_INDEPENDENT_CODE=ON ../lapack-3.9.1
make -j
cd ../..

cp netlib/build/lib/liblapack.a liblapack_adapt.a

# get symbols defined both in klapack and netlib lapack
nm -g liblapack_adapt.a | grep 'T ' | grep -oP '\K\w+(?=_$)' | sort | uniq > netlib.sym
nm -g $klapack | grep 'T ' | grep -oP '\K\w+(?=_$)' | sort | uniq > klapack.sym
comm -12 klapack.sym netlib.sym > comm.sym 

# update symbols name of liblapack_adapt.a
while read sym; do \
    if ! nm liblapack_adapt.a | grep -qe " T ${sym}_\$"; then \
        continue; \
    fi; \
    ar x liblapack_adapt.a $sym.f.o; \
    mv $sym.f.o ${sym}_netlib.f.o; \
    objcopy --redefine-sym ${sym}_=${sym}_netlib_ ${sym}_netlib.f.o; \
    ar d liblapack_adapt.a ${sym}.f.o; \
    ar ru liblapack_adapt.a ${sym}_netlib.f.o; \
    rm ${sym}_netlib.f.o; \
done < comm.sym

If built successfully, run docker run --rm -it alp-dense and check the out log file at /opt/ALP/build/run_alpdense.log

Reference log file

The printed log should look like this run_alpdense.log, with some caveats:

Ignore the performance numbers here since it was running on a tiny VM.
The last benchmark failed because there are only 8 vCPUs on the VM, while the benchmark explicitly asked for 64 OpenMP threads. Running on a full server should not have this problem.

Set up distributed-parallel unit and smoke tests in CI?

(Would that be possible on GitHub?)

The CG algorithm exits with SUCCESS when max_iterations are reached...

...even if it has not converged.

Expected behaviour: return FAILED instead.

To be confirmed.

Artifact expiration time-out should support serialised CI task execution

A week or two ago we had that (by accident) only one CI runner was active. This resulted in many internal CI jobs failing due to expired artifacts. However, it is probably wise that the artifact expiration is set so that serialised execution in principle is possible.

At current, there are two types of pipelines: develop/master vs. regular. In each, there are two build/consume pairs making each pipeline a 4-stage one. Put in a table, the most recent timings of a successful internal CI run are (entries in minutes, rounded upwards to halves):

Runtime	Stage 2	Stage 4
Develop	0.5+19+2+18.5	1+2+18.5
Regular	0.5+1.5+18.5	0.5+3.5

A careful estimate doubles the above numbers to set the artifact timeouts of stages 1 and 3 with:

Timeouts	Stage 1	Stage 3
Develop	80	43
Regular	41	8

This issue proposes to update the current artifact time-outs with the above new time-out values, whenever the current values are indeed lower than the table suggests.

Unexpected behaviour of eWiseApply([out] Matrix, [in] Matrix, [in] Matrix, ... ) variants

An unexpected behaviour using the two variants of eWiseApply for matrices was noticed.

Documentation of the two variants

eWiseApply([out] Matrix, [in] Matrix, [in] Matrix, [in] Monoid) is supposed to apply the given monoid's operator over all elements of the two matrices, including the {non_zero, zero} couples (UNION), using the provided monoid's identity as a replacement for the zero value.
eWiseApply([out] Matrix, [in] Matrix, [in] Matrix, [in] Operator) is supposed to apply the given monoid's operator over all elements of the two matrices, except for the {non_zero, zero} couples (INTERSECTION).

The observed behaviour showed that only the INTERSECTION behaviour was implemented. A very user-friendly unit-test has been created for the occasion: eWiseApplyMatrix_variants.

Testing quickly:

make test_eWiseApplyMatrix_variants_debug_reference
./tests/unit/eWiseApplyMatrix_variants_debug_reference

Extra note: This unit-test should probably be re-evaluated.

Update code style guidelines

Some guidelines are missing, while clang-format is no longer used. The documentation should be updated accordingly.

HyperDAGs benchmarker class unnecessarily inherits?

include/graphblas/hyperdags/benchmark.hpp seems to unnecessarily inherit from the HyperDAGs Launcher and BenchmarkerBase. If correct/confirmed, an easy fix should be to remove the inheritance

Testing support for customisable matrix index types

While support for customisable index types for matrices has been added, the only test for it is the SpBLAS transition path which overrides the standard / configurable ALP settings to standard integer (int) types for all of row indices, column indices, and nonzero indices. In principle, however, two (or more) matrices of the same backend but with different such index types should be legal to pass to an ALP/GraphBLAS primitive such as mxm.

This issue is to test such mixed usage, and expand semantics or improve implementations as necessary.

HPCG smoke test does not print its executable file name

All other tests, by convention, do so as the first line the program emits.

grb::set does not make use of the dense descriptor

While functionally OK, there's a slight performance penalty associated with it.

Also it clashes with a native interface such as used by transition path libraries -- though not an issue yet (it would become an issue if some future transition path functionality would call grb::set).

LPF buildMatrixUnique unit test fails under very specific conditions

The buildMatrixUnique test for the bsp1d backend fails only for the biggest dense matrix, corresponding to the following line

ALP/tests/unit/buildMatrixUnique.cpp

Line 601 in dbe9a76

{ 1463, 5376 }

The error is raised when executing this line

ALP/include/graphblas/bsp1d/io.hpp

Line 1859 in 7875eed

const lpf_err_t brc = lpf_sync( data.context, LPF_SYNC_DEFAULT );

but only with 16 processes or more: a signal 7 (BUS error) or something similar is raised. This issue is observed only in these conditions (in logical AND)

this test
this input size (i.e., the biggest matrix)
in the CI Docker container, i.e., Ubuntu 20.04 + MPICH v3.3.2
with 16 or more test processes

and does not occur outside of the CI container. Since the engine used in the CI is mpimsg, I suspect it is related to the MPI version used there and to the Docker containerization itself. It might be tracked down to LPF, but I don't know if the other LPF operations happening before play a role (there are multiple buffer registrations, puts and sync primitives), which is why I am opening the issue here and not in LPF.
We may try testing with OpenMPI or a newer MPICH. However, as this seems related to MPI+Docker, I am not sure we want to spend time on this one.

Possible ideas:

within the CI container, install spack to easily build and deploy multiple MPI implementations and versions
might be a problem due to limited shmem (or similar resource) within Docker containers (e.g., https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-windows/2021-6/problem-mpi-limitation-for-docker.html)
... [more ideas welcome]

Breadth-First Search (BFS) algorithm

Implementation of a Breadth-First Search (BFS) algorithm.

Versions

1. Compute the number of steps to explore the entire graph
- grb::RC bfs( const Matrix< D > & A, size_t root, size_t & total_steps )
1. Compute (1.) and the minimum number of step(s) to explore each node
- grb::RC bfs( const Matrix< D > & A, size_t root, size_t & total_steps, grb::Vector< size_t > & steps_per_vertex )

Make LPF engine configurable

Currently, cmake/AddGRBInstall.cmake sets the LPF engine to mpimsg, which is (usually) a safe default. In particular mpimsg (or mpirma) are safe since they work on non-IB interconnects. However,

if an end-user does have Infiniband installed, then relying on the ibverbs LPF engine would be recommended;
for some MPI implementations over RDMA-enabled interconnects, selecting mpirma over mpimsg may lead to better performance [1].

Therefore this issue is to add a --with-lpf-engine flag to bootstrap.sh that allows overriding the default mpimsg choice.

[1] https://arxiv.org/abs/1906.03196

Smoke test all ALP/SparseBLAS functions

While full unit testing of ALP/SparseBLAS is superfluous (see issue #12), it would be good to nonetheless add a smoke tests that calls all functions it defines. A suggestion would be to do an SpMSpV and SpMSpM multiplication on a dataset such as west0497, and, within the test, check the output versus ALP/GraphBLAS.

Set up continuous benchmarking?

Large effort has been spent on manually benchmarking different algorithms (CG, HPCG, PageRank), Kernels (SpMV, SpMM), backends (blocking/nonblocking), for each commit and version.

It should be beneficial to set up some continous benchmarking using frameworks like Google Benchmark or Catch2. It can be further automated by GitHub Actions. A nice example is the Pandas benchmark page.

Define performance semantics for retrieving iterators from ALP containers

The SparseBLAS transition path performs manual initialisation of the backend

The SparseBLAS API (#11) currently is compiled against reference and reference_omp, and it is manually ensured that the ALP buffers are properly initialised. It may instead be preferable to rely on the same mechanisms that the Launchers rely on.

Typo in main README.md

doyxgen

Additional internal CI failures triggered for HyperDAG in debug mode

Unit tests that fail are eWiseApply and zip.

Smoke tests that fail are HPCG.

These only fail when compiled in debug mode.

Single-Source Shortest-Path (SSSP) algorithm

Implementation of a Single-Source Shortest-Path (SSSP) algorithm.

Reusing code of the reference backend in the nonblocking backend

A lot of code in the nonblocking backend is the same or very similar to the corresponding code in the reference backend. The current design of the nonblocking backend reuses code of the reference backend in some cases, e.g., the benchmark class, but there are more opportunities for reusing code. Perhaps the vector and matrix classes can easily rely on the reference implementation. Reusing code for the primitives defined in io.hpp, blas1.hpp, blas2.hpp, and blas3.hpp is not trivial with the current design.

Meta-issue: requests for ALP/SparseBLAS support

Since PR #11 , a proof-of-concept for a SparseBLAS interface with an implementation generated by ALP is introduced. This by far, however, does not support all standard functions. As adding new functions in the majority of cases is rather straightforward (modulo testing-- see #12 and #13 ), if you indeed have specific requests for support, please don't hesitate to request it here.

SpTRSV development steps

Prepare sparse L factor as benchmark dataset
Benchmark Intel SpMP trsv kernel as multi-threaded performance baseline for level-schedule algorithm
Benchmark Kokkos sptrsv kernel as multi-threaded performance baseline for partitioned-inverse algorithm
Internal code design for level-schedule algorithm. AMGCL's ILU solve is a relatively simple reference.
Internal code design for partitioned-inverse algorithm
High-level sptrsv API that is "in harmony with" existing GraphBLAS/ALP API and data structures
End-to-end PCG benchmark with incomplete factorization preconditioner

The v0.7 release

This issue keeps track of new features that will make it into v0.7:

a vertex-centric Pregel API (#64),
automatic HyperDAG extraction.

The following will hopefully make it into v0.7:

a port of the nonblocking backend by Mastoras et al. [1] on top of the latest develop.

[1] https://ieeexplore.ieee.org/document/9835271

Feel welcome to suggest new features or bug fixes for a v0.7 release in this thread.

Improve kcore algorithm smoke test

Make kcore algorithm smoke test validate against some ground-truth vector

Improve unit testing for ALP/SparseBLAS

Pull request #11 introduces a proof-of-concept ALP/SparseBLAS. While the functions that it wraps around are already unit-tested, it would be good to add unit tests on the ingestion into the blas_sparse_matrix and the extblas_sparse_vector containers, as well as unit tests for the functions that wrap around raw vectors and raw CRS structures.

Provide grb::foldl/foldr( [in,out] T&, [in] Matrix<D>, monoid )

This issue concerns the implementation for the reference+omp, hyperdags and distributed backends.

The foldl and foldr primitives, which are currently not implemented in any backend for Matrix reduction, appear to be crucial methods that need to be implemented.
I suggest that we keep the same API signature as the current blas1::foldl and blas1::foldr

Current documentation for foldl (foldr is implicit):

/**
 * Reduces, or \em folds, a matrix into a scalar.
 *
 * Reduction takes place according a monoid \f$ (\oplus,1) \f$, where
 * \f$ \oplus:\ D_1 \times D_2 \to D_3 \f$ with associated identities
 * \f$ 1_k in D_k \f$. Usually, \f$ D_k \subseteq D_3, 1 \leq k < 3 \f$,
 * though other more exotic structures may be envisioned (and used).
 *
 * Let \f$ x_0 = 1 \f$ and let
 * \f$ x_{i+1} = \begin{cases}
 *   x_i \oplus y_i\text{ if }y_i\text{ is nonzero and }
 * 	 m_i\text{ evaluates true}x_i\text{ otherwise}
 * \end{cases},\f$
 * for all \f$ i \in \{ 0, 1, \ldots, n-1 \} \f$.
 *
 * \note Per this definition, the folding happens in a left-to-right
 * 		 direction. If another direction is wanted, which may have use in
 *  	 cases where \f$ D_1 \f$ differs from \f$ D_2 \f$, then either a
 * 		 monoid with those operator domains switched may be supplied, or
 * 		 #grb::foldr may be used instead.
 *
 * After a successfull call, \a x will be equal to \f$ x_n \f$.
 *
 * Note that the operator \f$ \oplus \f$ must be associative since it is
 * part of a monoid. This algebraic property is exploited when parallelising
 * the requested operation. The identity is required when parallelising over
 * multiple user processes.
 *
 * \warning In so doing, the order of the evaluation of the reduction
 * 			operation should not be expected to be a serial, left-to-right,
 * 			evaluation of the computation chain.
 *
 * @tparam descr     The descriptor to be used (descriptors::no_operation if
 *                   left unspecified).
 * @tparam Monoid    The monoid to use for reduction.
 * @tparam InputType The type of the elements in the supplied ALP/GraphBLAS
 *                   matrix \a y.
 * @tparam IOType    The type of the output scalar \a x.
 * @tparam MaskType  The type of the elements in the supplied ALP/GraphBLAS
 *                   matrix \a mask.
 *
 * @param[in, out] x  The result of the reduction. 
 * 					  Prior value will be considered.
 * @param[in] A       Any ALP/GraphBLAS matrix.
 * @param[in] mask    Any ALP/GraphBLAS matrix.
 * @param[in] monoid  The monoid under which to perform this reduction.
 *
 * @return grb::SUCCESS  When the call completed successfully.
 * @return grb::MISMATCH If a \a mask was not empty and does not have size
 *                       equal to \a A.
 * @return grb::ILLEGAL  If the provided input matrix \a A was not dense,
 * 						 while #grb::descriptors::dense was given.
 *
 * @see grb::foldr provides similar in-place functionality.
 * @see grb::eWiseApply provides out-of-place semantics.
 *
 * \parblock
 * \par Valid descriptors
 * - descriptors::no_operation: the default descriptor.
 * - descriptors::no_casting: the first domain of
 * 	 	\a monoid must match \a InputType, the second domain of \a op
 * 		match \a IOType, the third domain must match \a IOType, and the
 *   	element type of \a mask must be <tt>bool</tt>. 
 * - descriptors::transpose_left: A^T will be considered instead 
 * 	 	of \a A.
 * - descriptors::transpose_right: mask^T will be considered 
 * 	 	instead of \a mask.
 * - descriptors::invert_mask: Not supported yet.
 *
 * \note Invalid descriptors will be ignored.
 *
 * \endparblock
 *
 * \par Performance semantics
 * Each backend must define performance semantics for this primitive.
 *
 * @see perfSemantics
 */
template<
	Descriptor descr = descriptors::no_operation,
	class Monoid,
	typename InputType, typename IOType, typename MaskType,
	typename RIT_A, typename CIT_A, typename NIT_A,
	typename RIT_M, typename CIT_M, typename NIT_M,
	Backend backend
>
RC foldl(
	IOType &x,
	const Matrix< InputType, backend, RIT_A, CIT_A, NIT_A > &A,
	const Matrix< MaskType, backend, RIT_M, CIT_M, NIT_M > &mask,
	const Monoid &monoid = Monoid(),
	const typename std::enable_if< !grb::is_object< IOType >::value &&
		!grb::is_object< InputType >::value &&
		!grb::is_object< MaskType >::value &&
		grb::is_monoid< Monoid >::value, void
	>::type * const = nullptr
) {}

/**
 * Reduces, or \em folds, a matrix into a scalar. 
 * Left-to-right unmasked variant.
 * 
 * Please see the masked grb::foldl variant for a full description.
 * 
 * @tparam descr     The descriptor to be used (descriptors::no_operation if
 *                   left unspecified).
 * @tparam Operator  The operator to use for reduction.
 * @tparam InputType The type of the elements in the supplied ALP/GraphBLAS
 *                   matrix \a y.
 * @tparam IOType    The type of the output scalar \a x.
 *
 * @param[in, out] x   The result of the reduction.
 * 					   Prior value will be considered.
 * @param[in]    A     Any ALP/GraphBLAS matrix.
 * @param[in] operator The operator used for reduction.
 *
 * @return grb::SUCCESS  When the call completed successfully.
 * @return grb::ILLEGAL  If the provided input matrix \a y was not dense, while
 *                       #grb::descriptors::dense was given.
 * 
 * \parblock
 * \par Valid descriptors
 * - descriptors::no_operation: the default descriptor.
 * - descriptors::no_casting: the first domain of
 * 	 	\a monoid must match \a InputType, the second domain of \a op
 * 		match \a IOType, the third domain must match \a IOType.
 * - descriptors::transpose_matrix: A^T will be considered instead 
 * 	 	of \a A.
 *
 * \note Invalid descriptors will be ignored.
 *
 * \endparblock
 * 
 */
template<
	Descriptor descr = descriptors::no_operation,
	class Monoid,
	typename InputType, typename IOType,
	typename RIT, typename CIT, typename NIT,
	Backend backend
>
RC foldl(
	IOType &x,
	const Matrix< InputType, backend, RIT, CIT, NIT > &A,
	const Monoid &monoid,
	const typename std::enable_if< 
		!grb::is_object< IOType >::value &&
		!grb::is_object< InputType >::value &&
		grb::is_monoid< Monoid >::value, void
	>::type * const = nullptr
) {}

Update code style documentation

This is to update some very-out-of-date documents regarding code style that still reside in both develop and master; it is not intended as an issue that would close #5. This is a blocker to v0.7. It will be handled after #156 .