alugowski / fast_matrix_market Goto Github PK
View Code? Open in Web Editor NEWFast and full-featured Matrix Market I/O library for C++, Python, and R
License: BSD 2-Clause "Simplified" License
Fast and full-featured Matrix Market I/O library for C++, Python, and R
License: BSD 2-Clause "Simplified" License
scipy.io.mmread
can read this file, but fmm.mmread
raises an error. I don't know which library is "correct", but I thought you would like to know about differences between fmm.mmread
and scipy.io.mmread
.
Reproducer:
import fast_matrix_market as fmm
import scipy
from io import StringIO
from scipy.io.tests import test_mmio
A1 = scipy.io.mmread(StringIO(test_mmio._empty_lines_example)) # works
A2 = fmm.mmread(StringIO(test_mmio._empty_lines_example)) # ValueError: Line 3: Too many lines in file (file too long)
Was just looking at the PR over on scipy (super excited to see this moving forward!), and noticed that this library has its own global variable for parallelism.
Ideally, I would like for the number of threads used to be controlled by some higher level API like threadpoolctl
.
Any thoughts on how this could be accomplished? Maybe something at the scipy level?
FMM currently employs a single-pass method that effectively only uses fread()
. This is fast, very flexible for integration, and has a fixed memory overhead for the loader itself. However it may leave some performance on the table on systems capable of parallel IO.
See PIGO which mmap()
s the input file, scans it fully once to find line counts, then reads it.
mmap()
does not apply to many situations that FMM currently handles, but may be useful as an option for cases where it does.
I used a symmetric matrix file to test this, called as follows:
namespace fmm = fast_matrix_market;
template <typename IT, typename VT>
struct triplet_matrix
{
int64_t nrows = 0, ncols = 0;
std::vector<IT> rows;
std::vector<IT> cols;
std::vector<VT> vals;
};
// read mtx file to csr
template <typename IT, typename VT>
void read_mtx_to_csr(const std::string &filename, int&m, int&n, int&nnz, int **row_ptr, int **col_ind, double **val)
{
std::ifstream file(filename);
triplet_matrix<IT, VT> mtx;
fast_matrix_market::read_options options;
options.parallel_ok = true;
fmm::read_matrix_market_triplet(
file,
mtx.nrows, mtx.ncols,
mtx.rows, mtx.cols, mtx.vals, options
);
...
}
I noticed that the elements on the diagonal are also copied and the value is set to 0; does fmm provide a way to avoid this problem?
See https://en.cppreference.com/w/cpp/types/floating-point
C++23 offers fixed-width floating point types std::float16_t
through std::float128_t
and std::bfloat16_t
.
These likely already work on toolchains that offer both the types and their respective <charconv>
.
Possible requirements for testing:
test_cpp23
that tests these typesI discovered a small difference between fmm.mmread
and scipy.io.mmread
:
import fast_matrix_market as fmm
import scipy
from io import StringIO
text = """%%MatrixMarket matrix coordinate real general
3 3 4
1 3 1
2 2 2
3 1 3
3 1 4"""
stream = StringIO(text)
A1 = scipy.io.mmread(stream)
stream.seek(0) # works after using scipy.io.mmread
A2 = fmm.mmread(stream)
stream.seek(0) # <-- ValueError: I/O operation on closed file
This isn't particularly important or urgent, but I thought you would like to know differences between fmm.mmread
and scipy.io.mmread
.
Support ability to read just the indices and ignore the values, i.e. pretend the file is a pattern
file even if it isn't.
Example use case: unweighted graph algorithms.
Corner cases to think about:
array
files? Both read into an array and read into sparse struct ../venv/lib/python3.9/site-packages/scipy/io/_fast_matrix_market/__init__.py:354: in mmread
cursor, stream_to_close = _get_read_cursor(source)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
source = <_io.StringIO object at 0x7f5e158cbee0>, parallelism = None
def _get_read_cursor(source, parallelism=None):
"""
Open file for reading.
"""
> from . import _fmm_core
E ImportError: generic_type: type "header" is already registered!
Following up on:
Could this library be included with scipy?
It's been a longstanding issue in SciPy that the matrix market readers are very slow. This library would meet scipy's standards for inclusion (C++ 11, header only), so I think this is a great opportunity to give great performance gains to a large audience. The scipy maintainers are, in principle, open to this.
On the side of this library, I think it would give potential users much more faith in the maintenance and continued support of this code. For users, it also means fewer python-level dependencies, and no need to worry about possible discrepancies between the parsers.
What do you think? (@eriknw, I would also be interested in hearing your thoughts here).
cc: @grst
Challenges:
stringstream
, which is significantly slower.In dependencies/ryu/CMakeLists.txt, by calling add_library without the STATIC
option, the choice between STATIC
or SHARED
is left to the BUILD_SHARED_LIBS
option, set anywhere else in the parent CMake files.
However, since the CMake is currently not working with SHARED
(at least on Win), thus implicitly restricting the build to STATIC
only, might be worth to explicitly ask for STATIC
?
I might be missing something.
I proposed the change in the original repo as well
Hi @alugowski, I was wondering if you'd be willing to be listed as a contributor in the fastMatMR package. This is a standard practice in the R
community (for CRAN submissions, e.g. here) and I think it'd be really great to have you listed.
If it's ok, could you provide details in a comment with details like this?:
person("Rohit", "Goswami", email = "[email protected]", role = c("ctb"),
comment = c(ORCID = "0000-0002-2393-8056")),
The ORCID
is optional, basically name and preferred email minimally :)
Something changed, either with a newer clang or newer suite-sparse from Homebrew. Now errors like this appear:
In file included from /Users/enos/projects/fast_matrix_market/tests/graphblas_test.cpp:18:
/Users/enos/projects/fast_matrix_market/include/fast_matrix_market/app/GraphBLAS.hpp:492:20: error: no matching function for call to 'GxB_Matrix_build_FC32'
return GxB_Matrix_build_FC32(mat, rows, cols, vals, nvals, GxB_PLUS_FC32);
^~~~~~~~~~~~~~~~~~~~~
/opt/homebrew/include/GraphBLAS.h:3232:10: note: candidate function not viable: no known conversion from 'const std::complex<float> *' to 'const GxB_FC32_t *' (aka 'const _Complex float *') for 4th argument
GrB_Info GxB_Matrix_build_FC32 // build a matrix from (I,J,X) tuples
^
In file included from /Users/enos/projects/fast_matrix_market/tests/graphblas_test.cpp:18:
/Users/enos/projects/fast_matrix_market/include/fast_matrix_market/app/GraphBLAS.hpp:496:20: error: no matching function for call to 'GxB_Scalar_setElement_FC32'
return GxB_Scalar_setElement_FC32(scalar, x);
^~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/homebrew/include/GraphBLAS.h:2173:10: note: candidate function not viable: no known conversion from 'const std::complex<float>' to 'GxB_FC32_t' (aka '_Complex float') for 2nd argument
GrB_Info GxB_Scalar_setElement_FC32 // s = x
^
std::complex and the C99 _Complex types have compatible memory layouts.
Interested in using this library with armadillo. Should be a fairly simple task to bind to the MAT/SP_MAT/VEC types-let me know if you're willing to take a crack at it before I take a crack at it myself. Thanks!
Hi, I would like to ask if is possible to change the following behavior. In Linux many dependencies are are available in the repositories, for instance,
gtest
eigen
blaze
suitesparse
provides GraphBLAS.I am trying to package, but I noticed that FetchContent
download and later will install together fast_matrix_market
.
pkgname=fast_matrix_market
pkgdesc="Fast and full-featured Matrix Market I/O library"
pkgver=1.7.6
pkgrel=1
arch=(x86_64)
url="https://github.com/alugowski/${pkgname}"
license=(BSD-2-Clause)
depends=(python)
makedepends=(python-build python-installer pybind11 python-scikit-build-core cmake)
checkdepends=(gtest suitesparse eigen blaze armadillo python-scipy python-threadpoolctl python-pytest)
optdepends=('eigen: '
'blaze: '
'armadillo: '
'python-scipy: ') # 'fastmatmr'
source=(${pkgname}-${pkgver}.tar.gz::${url}/archive/v${pkgver}.tar.gz)
sha512sums=('e97da2daf76770502e862a13b7b61aaf8797d9bec9d33f182ff28c2a0b3f8e8b078b559643d980d6c7f3ff57da9cf52bde8807120b9373e61851fd57373d51aa')
build() {
cmake \
-S ${pkgname}-${pkgver} \
-B build \
-DCMAKE_BUILD_TYPE=None \
-DCMAKE_INSTALL_PREFIX=/usr \
-DBUILD_SHARED_LIBS=TRUE \
-DCMAKE_CXX_STANDARD=23 \
-DFAST_MATRIX_MARKET_BENCH=ON \
-DFAST_MATRIX_MARKET_TEST=ON \
-DFMM_USE_DRAGONBOX=ON \
-DFMM_USE_FAST_FLOAT=ON \
-DFMM_USE_RYU=ON \
-Wno-dev
cmake --build build --target all
cd ${pkgname}-${pkgver}/python
python -m build --wheel --skip-dependency-check --no-isolation
}
check() {
ctest --verbose --output-on-failure --test-dir build
cd ${pkgname}-${pkgver}/python
python -m venv --system-site-packages test-env
test-env/bin/python -m installer dist/*.whl
test-env/bin/python -m pytest
}
package() {
DESTDIR="${pkgdir}" cmake --build build --target install
install -Dm 644 ${pkgname}-${pkgver}/LICENSE.txt -t "${pkgdir}/usr/share/licenses/${pkgname}"
cd ${pkgname}-${pkgver}/python
PYTHONPYCACHEPREFIX="${PWD}/.cache/cpython/" python -m installer --destdir="${pkgdir}" dist/*.whl
rm -r ${pkgdir}/usr/include/{blaze,eigen3,gtest,gmock}
rm -r ${pkgdir}/usr/lib/cmake/GTest
rm -r ${pkgdir}/usr/share/{blaze,eigen3}
rm -r ${pkgdir}/usr/lib/lib{gmock*,gtest*}
rm -r ${pkgdir}/usr/lib/pkgconfig{gmock*,gtest*}
}
The solution could be manually delete files in order to avoid duplication for gtest
, blaze
, eigen
, and so on post cmake installation
I've started maintaining a set of R bindings for this excellent library. There are some very nice benchmarks compared to the existing Matrix
package in R
. Perhaps a mention of this in the readme
would be welcome? I'd be happy to open a PR.
The only thing left is support for reading from files.
Hello! Adam. Thanks so much for the [fast_matrix_market] project you created.
I try to cmake the example. However, there is an issue showing /9/thread:126: undefined reference to `pthread_create'.
I fix it by target_link_libraries(simple1 fast_matrix_market::fast_matrix_market pthread).
Though, I think, the pthread libraries should be linked by the [fast_matrix_market] market itself.
I want to verify it.
wishes!
This is yet another (small) difference between fmm.mmread
and scipy.io.mmread
, which I think you may like to know if you want fmm.mmread
to be more-or-less be a drop-in replacement for scipy.io.mmread
:
import fast_matrix_market as fmm
import scipy
from io import StringIO
from scipy.io.tests import test_mmio
try:
scipy.io.mmread(StringIO(test_mmio._over64bit_integer_dense_example))
except OverflowError:
pass
try:
scipy.io.mmread(StringIO(test_mmio._over64bit_integer_sparse_example))
except OverflowError:
pass
try:
fmm.mmread(StringIO(test_mmio._over64bit_integer_dense_example))
except ValueError: # <-- This isn't OverflowError
pass
try:
fmm.mmread(StringIO(test_mmio._over64bit_integer_sparse_example))
except ValueError: # <-- This isn't OverflowError
pass
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.