Comments (7)
Hmm... what that effectively involves is an efficient sorted group-by in the row numbers. Are you certain that scipy is doing that suboptimally, or is that just inherently a slow operation?
from sparse.
Actually, I take that back. Assuming no duplicates, you could do it as a O[N log N] lexsort on the row & column indices, followed by an O[N] pass to construct the indptr
array. It would be like 4-5 lines of numpy 😄 Would be worth benchmarking.
from sparse.
scipy.sparse also spends about half of its time checking consistency. If we're able to guarantee things externally it would be nice to skip this.
from sparse.
What if we are sorted already? In my current application (which I think is representative of a few applications), I want to perform tocsr-then-matmul computations repeatedly on the same data throughout the computation. The data has already been normalized with sum_duplicates
and so may already be in some predictable order. When I was thinking about this briefly before I was curious if it was doable in two linear passes over the data.
from sparse.
Here's some code for fast conversion to csr using numpy routines; csc would be similar:
import numpy a snp
from scipy.sparse import csr_matrix
def to_csr(self):
assert self.ndim == 2
# Pass 1: sum duplicates
self.sum_duplicates()
# Pass 2: sort indices
order = np.lexsort(self.coords[::-1])
row, col = self.coords[:, order]
data = self.data[order]
# Pass 3: count nonzeros in each row
indptr = np.zeros(self.shape[0] + 1, dtype=row.dtype)
np.cumsum(np.bincount(row, minlength=self.shape[0]), out=indptr[1:])
return csr_matrix((data, col, indptr), shape=self.shape)
It basically takes three passes through the data: (1) summing the duplicates, (2) sorting the indices, and (3) counting the nonzeros within each row. If you know that any of those conditions is already met, you could skip that step and speed things up.
from sparse.
Actually, if the duplicates are pre-summed and the indices pre-sorted, then you could cache the cumulative count of nonzeros (i.e. indptr
) and get CSR conversion basically for free!
from sparse.
OK, so it sounds like it would be useful to track metadata about duplicates, sorted order, and nonzero-counts (hrm, how does this generalize?). Which operations require which attributes? Which operations maintain these attributes?
For example most operations other than add maintain no_duplicates. I think that reshape maintains sorted order although I'm not sure. Transpose kills sorted order. etc..
from sparse.
Related Issues (20)
- Unable to install shap on python 3.11 doubt to numba cannot install on Python version 3.11.2 HOT 1
- `einsum` doesn't handle `dtype` and `optimize` kwargs HOT 2
- Proper usage of sparse 3D array: initialization, assignment of values and use in calculations HOT 2
- np.diff equivalent HOT 2
- DeprecationWarning: coords should be an ndarray. This will raise a ValueError in the future. HOT 1
- Test failures in test_tensordot HOT 14
- Correct type hint for sparse COO matrix? HOT 1
- numpy ufuncs can change the compression of a gcxs array HOT 12
- Support `arr.size >= 2 ** 64` as long as each dimension is `< 2 ** 64` HOT 1
- Constructing GCXS from non-canonical `scipy.sparse.csr_matrix` results in wrong results HOT 5
- Support for Fortran order in COO.flatten() HOT 1
- Consider MatRepr for `__repr__` and `_repr_html_` HOT 9
- `GCXS` matmul => slice leads to incorrect results HOT 4
- `TypeError` when accessing `.real` and `.imag` attribute on sparse array of string type
- Installing directly from GitHub fails on Python 3.12 HOT 5
- Some functions seem missing from the top level import HOT 2
- [conda] Sparse version string is not recognized by pip list/check HOT 1
- Segmentation fault on arm64 HOT 11
- Remove runtime dependency on SciPy HOT 1
- Xarray/Numpy/Dask Indexing using a Sparse Array HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sparse.