alugowski / fast_matrix_market Goto Github PK

View Code? Open in Web Editor NEW

69.0 5.0 7.0 891 KB

Fast and full-featured Matrix Market I/O library for C++, Python, and R

License: BSD 2-Clause "Simplified" License

CMake 3.90% C++ 81.81% Jupyter Notebook 1.28% Shell 0.22% Python 12.78%

cpp matrix-market matrix-market-format parallel parser python sparse-matrix threaded blaze csparse

fast_matrix_market's People

Contributors

Stargazers

Watchers

Forkers

jamesetsmith haozeke newtwen-startup chriszhao13 hongyx11 gregoryschwing rileyjmurray

fast_matrix_market's Issues

Is this a valid Matrix Market format?

See: https://github.com/scipy/scipy/blob/69e0c474f886990a85a88521cffb8b3978bdfd66/scipy/io/tests/test_mmio.py#L517-L533

scipy.io.mmread can read this file, but fmm.mmread raises an error. I don't know which library is "correct", but I thought you would like to know about differences between fmm.mmread and scipy.io.mmread.

Reproducer:

import fast_matrix_market as fmm
import scipy
from io import StringIO
from scipy.io.tests import test_mmio

A1 = scipy.io.mmread(StringIO(test_mmio._empty_lines_example))  # works
A2 = fmm.mmread(StringIO(test_mmio._empty_lines_example))  # ValueError: Line 3: Too many lines in file (file too long)

Controlling parallelism

Was just looking at the PR over on scipy (super excited to see this moving forward!), and noticed that this library has its own global variable for parallelism.

Ideally, I would like for the number of threads used to be controlled by some higher level API like threadpoolctl.

Any thoughts on how this could be accomplished? Maybe something at the scipy level?

Evaluate mmap

FMM currently employs a single-pass method that effectively only uses fread(). This is fast, very flexible for integration, and has a fixed memory overhead for the loader itself. However it may leave some performance on the table on systems capable of parallel IO.

See PIGO which mmap()s the input file, scans it fully once to find line counts, then reads it.

mmap() does not apply to many situations that FMM currently handles, but may be useful as an option for cases where it does.

A simple question about symmetric matrix

I used a symmetric matrix file to test this, called as follows:

namespace fmm = fast_matrix_market;

template <typename IT, typename VT>
struct triplet_matrix
{
    int64_t nrows = 0, ncols = 0;
    std::vector<IT> rows;
    std::vector<IT> cols;
    std::vector<VT> vals;
};

// read mtx file to csr
template <typename IT, typename VT>
void read_mtx_to_csr(const std::string &filename, int&m, int&n, int&nnz, int **row_ptr, int **col_ind, double **val)
{
    std::ifstream file(filename);
    triplet_matrix<IT, VT> mtx;
    fast_matrix_market::read_options options;
    options.parallel_ok = true;

    fmm::read_matrix_market_triplet(
        file, 
        mtx.nrows, mtx.ncols, 
        mtx.rows, mtx.cols, mtx.vals, options
    );
    ...
}

I noticed that the elements on the diagonal are also copied and the value is set to 0; does fmm provide a way to avoid this problem?

Test C++23 Fixed width floating-point types

See https://en.cppreference.com/w/cpp/types/floating-point

C++23 offers fixed-width floating point types std::float16_t through std::float128_t and std::bfloat16_t.

These likely already work on toolchains that offer both the types and their respective <charconv>.

Possible requirements for testing:

detect toolchain support for types and charconv support
if support detected, add optional test_cpp23 that tests these types
add a runner with a supporting compiler. Probably GCC 13

`fmm.mmread(stream)` closes stream (`scipy.io.mmread(stream)` does not)

I discovered a small difference between fmm.mmread and scipy.io.mmread:

import fast_matrix_market as fmm
import scipy
from io import StringIO

text = """%%MatrixMarket matrix coordinate real general
3 3 4
1 3 1
2 2 2
3 1 3
3 1 4"""
stream = StringIO(text)
A1 = scipy.io.mmread(stream)
stream.seek(0)  # works after using scipy.io.mmread
A2 = fmm.mmread(stream)
stream.seek(0)  # <-- ValueError: I/O operation on closed file

This isn't particularly important or urgent, but I thought you would like to know differences between fmm.mmread and scipy.io.mmread.

Support skipping values

Support ability to read just the indices and ignore the values, i.e. pretend the file is a pattern file even if it isn't.

Example use case: unweighted graph algorithms.

Corner cases to think about:

should the value be initialized to a fixed value or left untouched? Allow both?
what should happen with array files? Both read into an array and read into sparse struct

Python: fix clash with scipy 1.12's version of FMM

  ../venv/lib/python3.9/site-packages/scipy/io/_fast_matrix_market/__init__.py:354: in mmread
      cursor, stream_to_close = _get_read_cursor(source)
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  
  source = <_io.StringIO object at 0x7f5e158cbee0>, parallelism = None
  
      def _get_read_cursor(source, parallelism=None):
          """
          Open file for reading.
          """
  >       from . import _fmm_core
  E       ImportError: generic_type: type "header" is already registered!

Add Blaze bindings

See https://bitbucket.org/blaze-lib/blaze/wiki/Matrix%20Types#!sparse-matrices

add this to scipy?

Following up on:

scverse/anndata#962 (comment)

Could this library be included with scipy?

It's been a longstanding issue in SciPy that the matrix market readers are very slow. This library would meet scipy's standards for inclusion (C++ 11, header only), so I think this is a great opportunity to give great performance gains to a large audience. The scipy maintainers are, in principle, open to this.

On the side of this library, I think it would give potential users much more faith in the maintenance and continued support of this code. For users, it also means fewer python-level dependencies, and no need to worry about possible discrepancies between the parsers.

What do you think? (@eriknw, I would also be interested in hearing your thoughts here).

cc: @grst

Support precision for writing floating-point

Challenges:

Dragonbox does not support precision, and it will not. Must use another library, potentially ryu.
stdlib fallback: std::to_string(double) does not have an alternative accepting a precision argument. May have to use stringstream, which is significantly slower.

Specify STATIC in add_library

In dependencies/ryu/CMakeLists.txt, by calling add_library without the STATIC option, the choice between STATIC or SHARED is left to the BUILD_SHARED_LIBS option, set anywhere else in the parent CMake files.

However, since the CMake is currently not working with SHARED (at least on Win), thus implicitly restricting the build to STATIC only, might be worth to explicitly ask for STATIC?

I might be missing something.

I proposed the change in the original repo as well

QUERY: Giving credit

Hi @alugowski, I was wondering if you'd be willing to be listed as a contributor in the fastMatMR package. This is a standard practice in the R community (for CRAN submissions, e.g. here) and I think it'd be really great to have you listed.

If it's ok, could you provide details in a comment with details like this?:

    person("Rohit", "Goswami", email = "[email protected]", role = c("ctb"),
           comment = c(ORCID = "0000-0002-2393-8056")),

The ORCID is optional, basically name and preferred email minimally :)

Fix macOS complex value test fails

Something changed, either with a newer clang or newer suite-sparse from Homebrew. Now errors like this appear:

In file included from /Users/enos/projects/fast_matrix_market/tests/graphblas_test.cpp:18:
/Users/enos/projects/fast_matrix_market/include/fast_matrix_market/app/GraphBLAS.hpp:492:20: error: no matching function for call to 'GxB_Matrix_build_FC32'
            return GxB_Matrix_build_FC32(mat, rows, cols, vals, nvals, GxB_PLUS_FC32);
                   ^~~~~~~~~~~~~~~~~~~~~
/opt/homebrew/include/GraphBLAS.h:3232:10: note: candidate function not viable: no known conversion from 'const std::complex<float> *' to 'const GxB_FC32_t *' (aka 'const _Complex float *') for 4th argument
GrB_Info GxB_Matrix_build_FC32      // build a matrix from (I,J,X) tuples
         ^
In file included from /Users/enos/projects/fast_matrix_market/tests/graphblas_test.cpp:18:
/Users/enos/projects/fast_matrix_market/include/fast_matrix_market/app/GraphBLAS.hpp:496:20: error: no matching function for call to 'GxB_Scalar_setElement_FC32'
            return GxB_Scalar_setElement_FC32(scalar, x);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/homebrew/include/GraphBLAS.h:2173:10: note: candidate function not viable: no known conversion from 'const std::complex<float>' to 'GxB_FC32_t' (aka '_Complex float') for 2nd argument
GrB_Info GxB_Scalar_setElement_FC32     // s = x
         ^

std::complex and the C99 _Complex types have compatible memory layouts.

Add armadillo bindings

Interested in using this library with armadillo. Should be a fairly simple task to bind to the MAT/SP_MAT/VEC types-let me know if you're willing to take a crack at it before I take a crack at it myself. Thanks!

Possibility to use [gtest, eigen, blaze, etc] from system-wide if they are installed in compilation time and not duplicate after installation

Hi, I would like to ask if is possible to change the following behavior. In Linux many dependencies are are available in the repositories, for instance,

gtest
eigen
blaze
suitesparse provides GraphBLAS.
and so on.

I am trying to package, but I noticed that FetchContent download and later will install together fast_matrix_market.

pkgname=fast_matrix_market
pkgdesc="Fast and full-featured Matrix Market I/O library"
pkgver=1.7.6
pkgrel=1
arch=(x86_64)
url="https://github.com/alugowski/${pkgname}"
license=(BSD-2-Clause)
depends=(python)
makedepends=(python-build python-installer pybind11 python-scikit-build-core cmake)
checkdepends=(gtest suitesparse eigen blaze armadillo python-scipy python-threadpoolctl python-pytest)
optdepends=('eigen: '
  'blaze: '
  'armadillo: '
  'python-scipy: ') # 'fastmatmr'
source=(${pkgname}-${pkgver}.tar.gz::${url}/archive/v${pkgver}.tar.gz)
sha512sums=('e97da2daf76770502e862a13b7b61aaf8797d9bec9d33f182ff28c2a0b3f8e8b078b559643d980d6c7f3ff57da9cf52bde8807120b9373e61851fd57373d51aa')

build() {
  cmake \
    -S ${pkgname}-${pkgver} \
    -B build \
    -DCMAKE_BUILD_TYPE=None \
    -DCMAKE_INSTALL_PREFIX=/usr \
    -DBUILD_SHARED_LIBS=TRUE \
    -DCMAKE_CXX_STANDARD=23 \
    -DFAST_MATRIX_MARKET_BENCH=ON \
    -DFAST_MATRIX_MARKET_TEST=ON \
    -DFMM_USE_DRAGONBOX=ON \
    -DFMM_USE_FAST_FLOAT=ON \
    -DFMM_USE_RYU=ON \
    -Wno-dev
  cmake --build build --target all

  cd ${pkgname}-${pkgver}/python
  python -m build --wheel --skip-dependency-check --no-isolation
}

check() {
  ctest --verbose --output-on-failure --test-dir build
  cd ${pkgname}-${pkgver}/python
  python -m venv --system-site-packages test-env
  test-env/bin/python -m installer dist/*.whl
  test-env/bin/python -m pytest
}

package() {
  DESTDIR="${pkgdir}" cmake --build build --target install
  install -Dm 644 ${pkgname}-${pkgver}/LICENSE.txt -t "${pkgdir}/usr/share/licenses/${pkgname}"
  cd ${pkgname}-${pkgver}/python
  PYTHONPYCACHEPREFIX="${PWD}/.cache/cpython/" python -m installer --destdir="${pkgdir}" dist/*.whl
  rm -r ${pkgdir}/usr/include/{blaze,eigen3,gtest,gmock}
  rm -r ${pkgdir}/usr/lib/cmake/GTest
  rm -r ${pkgdir}/usr/share/{blaze,eigen3}
  rm -r ${pkgdir}/usr/lib/lib{gmock*,gtest*}
  rm -r ${pkgdir}/usr/lib/pkgconfig{gmock*,gtest*}
}

The solution could be manually delete files in order to avoid duplication for gtest, blaze, eigen, and so on post cmake installation

R bindings [WIP]

I've started maintaining a set of R bindings for this excellent library. There are some very nice benchmarks compared to the existing Matrix package in R. Perhaps a mention of this in the readme would be welcome? I'd be happy to open a PR.

The only thing left is support for reading from files.

Add GraphBLAS bindings

c++/9/thread:126: undefined reference to `pthread_create'

Hello! Adam. Thanks so much for the [fast_matrix_market] project you created.
I try to cmake the example. However, there is an issue showing /9/thread:126: undefined reference to `pthread_create'.
I fix it by target_link_libraries(simple1 fast_matrix_market::fast_matrix_market pthread).
Though, I think, the pthread libraries should be linked by the [fast_matrix_market] market itself.
I want to verify it.

wishes!

`fmm.mmread` raises `ValueError` instead of `OverflowError` when reading integers that are too large

This is yet another (small) difference between fmm.mmread and scipy.io.mmread, which I think you may like to know if you want fmm.mmread to be more-or-less be a drop-in replacement for scipy.io.mmread:

import fast_matrix_market as fmm
import scipy
from io import StringIO
from scipy.io.tests import test_mmio

try:
    scipy.io.mmread(StringIO(test_mmio._over64bit_integer_dense_example))
except OverflowError:
    pass
try:
    scipy.io.mmread(StringIO(test_mmio._over64bit_integer_sparse_example))
except OverflowError:
    pass


try:
    fmm.mmread(StringIO(test_mmio._over64bit_integer_dense_example))
except ValueError:  # <-- This isn't OverflowError
    pass
try:
    fmm.mmread(StringIO(test_mmio._over64bit_integer_sparse_example))
except ValueError:  # <-- This isn't OverflowError
    pass

See: https://github.com/scipy/scipy/blob/69e0c474f886990a85a88521cffb8b3978bdfd66/scipy/io/tests/test_mmio.py#L321-L335