pydata / sparse Goto Github PK

View Code? Open in Web Editor NEW

560.0 23.0 122.0 1.42 MB

Sparse multi-dimensional arrays for the PyData ecosystem

Home Page: https://sparse.pydata.org

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

python numpy sparse-matrix sparse-matrices sparse-data sparse-arrays

sparse's People

Contributors

Stargazers

Watchers

Forkers

benjamesbabala jakevdp zhanglae continuumio barkinet mindis joaopaulopacheco ghisvail jakirkham nils-werner hameerabbasi fujiisoup woodmd mrocklin erikbrinkman gforsyth dhirschfeld ahwillia nimroha schoyen gkc1000 gitter-badger ljwolf stsievert naveenrc jcmgray hpill26 perimosocordiae vishalbelsare argearriojas jcrist jairideout shadensmith scopatz asmeurer qinxuye pentschev stjordanis daletovar mhelhoseiny nvictus crusaderky pettni fieldplay eric-wieser stuartarchibald melissawm danielballan ywang519 mikeymezher lueckem codeur66 blitheflying bnavigator dragadoncila joskid lachlan1310 ivirshup sayandip18 simonthor aribalam carreau kanzasheikh genevievebuckley h4r5h1t-007 zhanfengdog daissi leej3 plooney jrbourbeau hercules261188 albertvillanova yeolkkkk vttrifonov aayu0710 atish03 jkueser sarveshbhatnagar bharath2438 smldub cryptowealth-technology delgadom czgdp1807 python-repository-hub jamestwebber lgtm-migrator mangecoeur graingert sultanorazbayev eldrin abhitator216 dongyuxu77 posics-ii latifbhatti mhrasmy apurva1205 ezachar claw89 iq-scm skvprogrammer

sparse's Issues

Documentation doesn't load via HTTPS after 301 redirects.

For example, this link loads via HTTPS but this one downgrades (redirects?) to HTTP. No idea why.

Edit: It seems all 301s go to plain HTTP. This link goes to HTTPS.

NumPy 1.12.0 has failing tests

Somewhere there is auto-densification happening, as I have many failing tests on my setup:

sparse/tests/test_core.py::test_elemwise[func0] FAILED
sparse/tests/test_core.py::test_elemwise[func1] FAILED
sparse/tests/test_core.py::test_elemwise[func2] FAILED
sparse/tests/test_core.py::test_elemwise[func3] FAILED
sparse/tests/test_core.py::test_elemwise[func4] FAILED
sparse/tests/test_core.py::test_elemwise[func5] FAILED
sparse/tests/test_core.py::test_elemwise[func6] FAILED
sparse/tests/test_core.py::test_elemwise[func7] FAILED
sparse/tests/test_core.py::test_elemwise[func8] FAILED
sparse/tests/test_core.py::test_elemwise[func9] FAILED
sparse/tests/test_core.py::test_elemwise[func10] PASSED
sparse/tests/test_core.py::test_elemwise[func11] FAILED
sparse/tests/test_core.py::test_elemwise[func12] PASSED
sparse/tests/test_core.py::test_elemwise[func13] FAILED
sparse/tests/test_core.py::test_elemwise[func14] FAILED
sparse/tests/test_core.py::test_elemwise[func15] PASSED
sparse/tests/test_core.py::test_elemwise[func16] PASSED

with

func = <ufunc 'floor'>

    @pytest.mark.parametrize('func', [np.expm1, np.log1p, np.sin, np.tan,
                                      np.sinh, np.tanh, np.floor, np.ceil,
                                      np.sqrt, np.conj, np.round, np.rint,
                                      lambda x: x.astype('int32'), np.conjugate,
                                      np.conj, lambda x: x.round(decimals=2), abs])
    def test_elemwise(func):
        s = sparse.random((2, 3, 4), density=0.5)
        x = s.todense()

        fs = func(s)

>       assert isinstance(fs, COO)
E       assert False
E        +  where False = isinstance(array([[[ 0.,  0.,  0.,  0.],\n        [ 0.,  0.,  0.,  0.],\n        [ 0.,  0.,  0.,  0.]],\n\n       [[ 0.,  0.,  0.,  0.],\n        [ 0.,  0.,  0.,  0.],\n        [ 0.,  0.,  0.,  0.]]]), COO)

Additionally, the two tests

sparse/tests/test_core.py::test_addition_not_ok_when_large_and_sparse
sparse/tests/test_core.py::test_raise_dense

basically never finish (I aborted them using Ctrl+C).

To reproduce this behaviour you can use the following tox.ini file

[tox]
envlist = {py27,py36}-{numpy12,numpylatest}
[testenv]
deps =
    numpy12: numpy==1.12.0
basepython =
    py27: python2.7
    py36: python3.6
commands=
    pip list
    py.test {posargs}
extras=
    docs
    tests

and run the specific tests using

tox -e py36-numpy12 -- -k test_elemwise
tox -e py36-numpy12 -- -k test_addition_not_ok_when_large_and_sparse
tox -e py36-numpy12 -- -k test_raise_dense

Reorganize coo.py?

coo.py is getting large and unweildly. I sometimes spend a lot of time scrolling through it or finding stuff in it. I was thinking of breaking it up into smaller chunks for better maintainability.

I propose the following:

A coo directory.
coo/coo.py (for the COO class)
Move large implementations out of the COO class into methods, one for elemwise, one for reduction, one for __getitem__, and one for other methods.
Similarly for DOK.

I love any proposed changes to this structure.

Automatic conversion to dense arrays

Operations on sparse arrays sometimes produce dense arrays. This type instability can cause some frustration downstream but may be optimal performance-wise.

In many occasions we actually inherit this behavior from scipy.sparse, which returns numpy.matrix objects in some cases. Currently we also return dense numpy.ndarray objects when this happens and when the number of non-zeros is high. I'm running into cases where I want to do this more and more, especially in parallel computing cases where I tensordot and add together many sparse arrays. Switching to dense starts to make a lot of sense.

However this will likely cause some frustration downstream as users sometime receive sparse and and sometimes receive dense arrays based on their data at the moment. Is this performance gain worth it?

Add description to repo

It'd be nice to add a description to the repo. I'd do it myself, but it appears I don't have the permissions (I can't see the edit button). cc @mrocklin

Add @ operator for dot product

I have a patch in a branch to implement and add test for .__matmul__. If you can add my permission to write, I'll push it.

Partial Add Reduction Loses Index Combinations

Is this correct behavior for a partial add reduction? Using the canonical example I added a reduction over the first two indices. And the results look strange in that the first index of the resulting 2D sparse matrix are all zeros. Here is a simple example illustrating the point.

Thanks,
Bruce

import numpy as np
import pandas as pd
import sparse 

n = 1000
ndims = 4
nnz = 1000000
coords = np.random.randint(0, n - 1, size=(ndims, nnz))
data = np.random.random(nnz)
x = sparse.COO(coords, data, shape=((n,)*ndims))
z = x.sum(axis=(0,1))
# how many unique combos is the last 2 slots of x? 
df = pd.DataFrame(coords[:2].T)
print df.drop_duplicates().shape
# how many unique combos is the 2 slots of z? 
df2=pd.DataFrame(z.coords.T)
print df2.drop_duplicates().shape
# Shouldn't these match?

Current home of this repository?

There seem to be multiple homes for this repository. The PyPI page still refers to this one as the current home.

What is the current home for this project? I was considering adding broadcasting (as I needed it for a project of mine) and also contributing in other ways (I can add at least some of the things on #1 ). I can't seem to find the fork of this where my additions will have the maximum impact.

I would suggest that, at the very minimum, the PyPI should point to the current home of this project.

Testing fails with Python 3.6

See the corresponding log in the Debian CI.

=================================== FAILURES ===================================
___________________ test_tensordot[a_shape8-b_shape8-axes8] ____________________

a_shape = (4,), b_shape = (4,), axes = (0, 0)

    @pytest.mark.parametrize('a_shape,b_shape,axes', [
        [(3, 4), (4, 3), (1, 0)],
        [(3, 4), (4, 3), (0, 1)],
        [(3, 4, 5), (4, 3), (1, 0)],
        [(3, 4), (5, 4, 3), (1, 1)],
        [(3, 4), (5, 4, 3), ((0, 1), (2, 1))],
        [(3, 4), (5, 4, 3), ((1, 0), (1, 2))],
        [(3, 4, 5), (4,), (1, 0)],
        [(4,), (3, 4, 5), (0, 1)],
        [(4,), (4,), (0, 0)],
        [(4,), (4,), 0],
    ])
    def test_tensordot(a_shape, b_shape, axes):
        a = random_x(a_shape)
        b = random_x(b_shape)
    
        sa = COO.from_numpy(a)
        sb = COO.from_numpy(b)
    
        assert_eq(np.tensordot(a, b, axes),
>                 sparse.tensordot(sa, sb, axes))

tests/test_core.py:115: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3/dist-packages/sparse/utils.py:12: in assert_eq
    yy = y.todense()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <COO: shape=(), dtype=float64, nnz=0>

    def todense(self):
        self = self.sum_duplicates()
        x = np.zeros(shape=self.shape, dtype=self.dtype)
    
        coords = tuple([self.coords[i, :] for i in range(self.ndim)])
>       x[coords] = self.data
E       ValueError: setting an array element with a sequence.

/usr/lib/python3/dist-packages/sparse/core.py:159: ValueError
=============== 1 failed, 125 passed, 1 xfailed in 0.94 seconds ================

nnz not correct for scalars.

The x.nnz should be 1 for for a nonzero scalar x, yet it is zero. MWE:

import numpy as np
x = 1 + np.random.rand(5, 5)
import sparse
xs = sparse.COO.from_numpy(x)
xs[1, 1]
Out[6]: <COO: shape=(), dtype=float64, nnz=0, sorted=True, duplicates=False>

Arbitrary axis order?

I just feel it could be beneficial to have an arbirary array order created when creating an array.

Upsides:

It can be made so that indexing/reducing is faster along any given dimensions, as well as a number of other operations.

Downsides:

It will touch all sort_indices calls and also all cases where we linearize. Mixed orders will need to be handled carefully.

Turn Scipy into a soft dependency

Right now, we are only using Scipy in:

dot/tensordot
Conversion to/from scipy.sparse.spmatrix subclasses.

Is it sensible keeping it as a hard dependency anymore?

Segfaults on windows

We were getting these for a while, then they went away after #12, and now they're back again: https://ci.appveyor.com/project/daskdev/dask/build/1.0.1278. Not sure if this is due to recent changes, or if we were just getting lucky before.

Item Assignment and Integer Slicing

Hi,

I found your library very useful, but I get a few questions for you. I wonder what is best way to add/delete data from the array? Also, is it possible to index a value using integer indexing?

x = sparse.COO(coords, data, shape=((3,3,3))

#Is there a way to do achieve this?
x[2,2,2] = 2 #Cannot do this

Thanks.

Conda-forge package?

We are considering to use Sparse in a project that we distribute through conda forge. However, I couldn't find Sparse available in anaconda or conda-forge. This is an issue because, currently, to build a pakcage in conda forge all its dependencies must be available there, see conda/conda-build#548. Is there any plan to distribute it through those channels?

left-side np.scalar multiplication

Hi,

I noticed multiplication with np.scalar such as np.float32 fails,

In [1]: import numpy as np
   ...: import sparse
   ...: x = sparse.random((2, 3, 4), density=0.5)
   ...: x * np.float32(2.0)  # This succeeds
   ...: 
Out[1]: <COO: shape=(2, 3, 4), dtype=float64, nnz=12, sorted=False, duplicates=True>

In [2]: np.float32(2.0) * x  # fails
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-e1d24f7d85b3> in <module>()
----> 1 np.float32(2.0) * x  # fails

~/Dropbox/projects/sparse/sparse/core.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    491     def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
    492         if method == '__call__':
--> 493             return COO._elemwise(ufunc, *inputs, **kwargs)
    494         elif method == 'reduce':
    495             return COO._reduce(ufunc, *inputs, **kwargs)

~/Dropbox/projects/sparse/sparse/core.py in _elemwise(func, *args, **kwargs)
    735             other = args[1]
    736             if isinstance(other, COO):
--> 737                 return self._elemwise_binary(func, *args[1:], **kwargs)
    738             elif isinstance(other, scipy.sparse.spmatrix):
    739                 other = COO.from_scipy_sparse(other)

AttributeError: 'numpy.float32' object has no attribute '_elemwise_binary'

Tests using sparse.random needs higher `density`

Hi, all

I noticed that many tests with sparse.random uses default density value, 0.01,
which is too small for small matrices.

For example,
https://github.com/mrocklin/sparse/blob/2c0483d6bce1cd2a49624e72d396872f2aa7e23f/sparse/tests/test_core.py#L605-L607
would be all zero with high probability.

I will send a PR

Hard dependency on scipy.sparse?

It looks like you import scipy.sparse both locally and globally. I'd probably lean toward locally, to make scipy a soft dependency.

einsum implementation

Hi guys, do you have plans to add einsum to this library? I guess einsum from numpy is obtained from c. There are einsum operation I cannot directly convert to using tensordot. by the way, this library is really awesome and thanks for creating this.

Documentation thoughts

Here are some initial thoughts on the documentation. Feedback welcome.

The user manual section has a lot of useful information like installation and quickstart that it would be useful to have visible from the main page. I'm inclined to flatten out this subsection and just push everything to top-level
We might want to avoid referring to this library as sparse when possible, instead maybe using terms like "this library". The term sparse is fairly generic and awkward (at least to me) given that it is usually an adjective, not a proper noun.

Some examples:
- The section header "Contributing to sparse" might be renamed to just "Contributing"
- "sparse can be obtained from pip via"
  
  could be changed to this text
  
  "This library can be installed with pip:
I wonder if small pages like converting.html could be rolled into some other larger page like operations.html . That page might also just include all of its subpages on the same html document. The hierarchy might not be doing us much good here. We can still have a table of contents for the page, but it might be nice to flatten things out a bit. This can help people who just want to read linearly through a narrative.

Not that this is a particularly good example, but consider the Pandas documentation on groupby. It is a long linear page with a detailed table of contents on the left. People can easily navigate to a particular subsubsection, but that subsubsection doesn't need to live in isolation.

I can do some of this work if desired. I just wanted to check in with @hameerabbasi to make sure that he is ok with these changes to the documentation.

Indexing is O(n) when it can be made O(log n)

If we call COO.sum_duplicates() beforehand, it may be possible, with some trickery, to make indexing O(log n). I'm still looking into how this can be done, and how to generalize it to multiple indices, but I know it's possible.

One concern I have is that I do not want it to become O(n log n) for multiple indices, so this needs to be thought out carefully.

Add support for object arrays

We don't support object arrays in many cases at this moment. I would suggest leaving this open but ignoring it until someone else requires it.

In-place operations

I was wondering about operators such as operator.iadd, etc. There are a few ways we can go about this:

Support them by mutating the object in-place and invalidating the cache, i.e. performing the operation and then making self a copy of the returned object.
Support them only when the sparsity structure is the same, and modify data in-place.
Don't support them at all.

If we want to maintain compatibility with Numpy code (I hope to make COO a mostly drop-in replacement for ndarray with a few exceptions at some point), I would go with 1, with a warning in the docs that in-place isn't really "in-place".

If we want to do our own thing... Then we have options 2 and 3.

Cython for radix argsort and fast indexing?

Since our main bottleneck is sorting, I was considering adding Cython as a dev dependency to create a

Custom radix sort solution.
Blazing fast indexing (#60)

I was considering this for 0.3.

Not support converting to csr format currently? Can not index the element right?

Just like the title describes, is it any way to access the element after construing the sparse matrix?

Module tests fail with "No module named 'packaging'"

See this Travis CI build.

dot/tensordot fails on 1D sparse arrays

Seems like this needs to be special-cased somewhere:

import sparse as sp
x = sp.COO.from_numpy(np.arange(10))
y = sp.COO.from_numpy(np.arange(10))
sp.dot(x, y)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-3dbcf0587ccb> in <module>()
      2 x = sp.COO.from_numpy(np.arange(10))
      3 y = sp.COO.from_numpy(np.arange(10))
----> 4 sp.dot(x, y)

/Users/jakevdp/github/mrocklin/sparse/sparse/core.py in dot(a, b)
    429 
    430 def dot(a, b):
--> 431     return tensordot(a, b, axes=((a.ndim - 1,), (b.ndim - 2,)))
    432 
    433 

/Users/jakevdp/github/mrocklin/sparse/sparse/core.py in tensordot(a, b, axes)
    425     res = _dot(at, bt)
    426     res = COO.from_scipy_sparse(res)  # <--- modified
--> 427     return res.reshape(olda + oldb)
    428 
    429 

/Users/jakevdp/github/mrocklin/sparse/sparse/core.py in reshape(self, shape)
    239             strides *= d
    240 
--> 241         return COO(coords, self.data, shape, has_duplicates=self.has_duplicates)
    242 
    243     def to_scipy_sparse(self):

/Users/jakevdp/github/mrocklin/sparse/sparse/core.py in __init__(self, coords, data, shape, has_duplicates)
     69         self.data = np.asarray(data)
     70         self.coords = np.asarray(coords)
---> 71         self.coords = self.coords.astype(np.min_scalar_type(max(self.shape)))
     72         assert len(data) == self.coords.shape[1]
     73         self.has_duplicates = has_duplicates

ValueError: max() arg is an empty sequence

Add advanced indexing support

I would like to have support for advanced indexing for both retrieval and assignment. Ideally I was hoping to find something that could serve as a drop-in replacement for numpy.ndarray for these types of operations. Is this functionality something that would be in the scope of this library? Are there any thoughts on how it should be implemented?

On a related note I noticed that COO is currently immutable and thus doesn't doesn't allow item assignment. However I wonder if one could support assignment by having COO make an in-place copy of itself. Of course this will be extremely inefficient for updating a single element but when addressing a large number of elements in parallel the overhead from the copy should be more manageable. Of course in the documentation you could stress that setting elements of COO individually is not recommended.

Move repository to broader org? If so, which?

If this project attracts users then it feels weird to have my account be the main fork for collaboration. We should consider moving it to another organization with a broader committer-base.

This project is welcome in github.com/dask but we might also consider other homes.

Reduce operations raise error for negative axis values

Reduce operations like .sum(), .prod() etc. raise exceptions when you pass a negative axis value:

import sparse
sparse.random((40, 50)).sum(0)  # works
sparse.random((40, 50)).sum(1)  # works
sparse.random((40, 50)).sum(-1)  # fails, should be equivalent to .sum(1)

The traceback I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "proj_dir/sparse/sparse/coo.py", line 1038, in prod
    return self.reduce(np.multiply, axis=axis, keepdims=keepdims, dtype=dtype)
  File "proj_dir/sparse/sparse/coo.py", line 771, in reduce
    a = self.transpose(neg_axis + axis)
  File "proj_dir/sparse/sparse/coo.py", line 1099, in transpose
    raise ValueError("repeated axis in transpose")
ValueError: repeated axis in transpose

General ``dot()``

With the reshape and transpose methods you have, it would be fairly trivial to generalize the dot() function to match the numpy result for multi-dimensional arrays. It might be worth adding!

Release 0.2?

I was wondering if it might be wise at this point to release version 0.2 (after my current pull requests have been merged). All major operations have been added.

``pluck`` and ``ind`` are undefined in ``COO.init``

This only comes up if shape=None or if data=None

elementwise_binary doesn't return correct results

I'm using the core.elementwise_binary to set values in a tensor. But it doesn't return the expected result. Unfortunately I do not really understand the code of the function otherwise I would just have made a pull request. Here is a minimal example to reproduce the bug:

import sparse as sp
import operator
import numpy as np

size = 3
dimension = 3

coords1 = np.ndarray(shape = 0, dtype=bool)
data1 = np.ndarray(shape = 0, dtype=bool)
tensor1 = sp.COO(coords=coords1, data=data1, shape=((size,) * dimension))
print('emptyTensor')
print(tensor1)
print(tensor1.todense())

coords2 = np.asarray(a=[[0], [1], [2]])
data2 = np.asarray(a=[True], dtype=bool)
tensor2 = sp.COO(coords=coords2, data=data2, shape=((size,) * dimension))
print('\noneElementTensor')
print(tensor2)
print(tensor2.todense())

resultTensor = tensor1.elemwise_binary(operator.or_, tensor2)
print('\nresultTensor = emptyTensor | oneElementTensor')
print(resultTensor)
print(resultTensor.todense())

The resultTensor should be equal to tensor2(True at tensor2[1,2,3], all other fields of tensor2 False) but it is all fields False.
Additionally I would expect the dtype of resultTensor to be bool because both input COOs tensor1and tensor2 are dtype=bool and operator.or_ is generally expected to be a bool×bool→bool operation.

Output:

emptyTensor
<COO: shape=(3, 3, 3), dtype=bool, nnz=0, sorted=False, duplicates=True>
[[[False False False]
  [False False False]
  [False False False]]

 [[False False False]
  [False False False]
  [False False False]]

 [[False False False]
  [False False False]
  [False False False]]]

oneElementTensor
<COO: shape=(3, 3, 3), dtype=bool, nnz=1, sorted=False, duplicates=True>
[[[False False False]
  [False False  True]
  [False False False]]

 [[False False False]
  [False False False]
  [False False False]]

 [[False False False]
  [False False False]
  [False False False]]]

resultTensor = emptyTensor | oneElementTensor
<COO: shape=(3, 3, 3), dtype=int64, nnz=0, sorted=False, duplicates=False>
[[[0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]]]

Release v0.3?

Since we're finally at a point where we can support XArray (or maybe the other way around?), maybe it's time to release v0.3.

cc @mrocklin

Did someone force-push?

My local master branch now has different commits than pydata/master

It looks like the changed commit was here:

In my experience people tend to avoid rewriting history for two reasons:

Immutability and security. It's concerning whenever the historical record changes
It confuses non-git experts. Anyone who had checked out a version of this repository who pushes will now push a ton of extra useless commits. They need to be comfortable with rebasing, which is difficult for new users

Support for triu/tril

Would be nice to have support for triu and tril functions to convert a dense Dask Array into a sparse one that only holds the upper or lower triangle respectively.

Support Everything that XArray Expects

bolt-project/bolt#58

Minimum version of scipy

I noticed that we need at least scipy=0.19.
Can we update requirements.txt or work to support lower version of scipy?

Dot should distinguish between vectors and matrices.

For example, if x is a vector and y is a matrix, dot should recognize the difference (and also return a vector). Same for the other way around, and also for two vectors.

Operations of the form scipy.sparse.spmatrix operator COO return a dense result

Should ideally be fixed with the final resolution for #72.

RuntimeWarning: divide by zero encountered in reciprocal

As reported during the Debian build process for version 1.0.1:

=============================== warnings summary ===============================
.pybuild/pythonX.Y_3.6/build/tests/test_core.py::test_scalar_exponentiation
  /<<PKGBUILDDIR>>/.pybuild/pythonX.Y_3.6/build/tests/test_core.py:392: RuntimeWarning: divide by zero encountered in reciprocal
    assert_eq(x ** -1, a ** -1)

Problems converting dask chunked sparse arrays greater size than 256x256

Presumably related to dask/dask#2586 : Computing a sparse array output from a chunked dask sparse array fails if the resulting sparse array is larger than 256x256 - coord type is uint16, but it looks like maybe intermediate calculations overflow:

import dask.array as da
import sparse


def print_sparse_nonidentity(m, name):
    valid = True
    for i, j, v in zip(m.coords[0], m.coords[1], m.data):
        if not i == j:
            print name, i, j, v
            valid = False
    if valid:
        print "OK"


def identity_da(size, chunksize):
    p = {(i, i): 1. for i in range(size)}

    # single chunk
    a = da.from_array(sparse.COO(p), chunks=(chunksize, chunksize), asarray=False)
    return a


print "Should be identity matrix"
size = 256
print "size = ", size
a = identity_da(256, size//2)
c = a.compute()
print c
print_sparse_nonidentity(c, "256 chunked")

size = 258
print "size = ", size, " unchunked"
a = identity_da(size, size)
c = a.compute()
print c
print_sparse_nonidentity(c, "258 unchunked")

print "size = ", size, " chunked"
a = identity_da(size, size//2)
c = a.compute()
print c
print_sparse_nonidentity(c, "258 chunked")

Running this works for 256x256 matrix chunked, or 258x258 with only one chunk, but 258x258 broken into 4 chunks fails:

$ python fail-conversion.py
Should be identity matrix
size =  256
<COO: shape=(256, 256), dtype=float64, nnz=256, sorted=False, duplicates=False>
OK
size =  258  unchunked
<COO: shape=(258, 258), dtype=float64, nnz=258, sorted=False, duplicates=False>
OK
size =  258  chunked
<COO: shape=(258, 258), dtype=float64, nnz=258, sorted=False, duplicates=False>
258 chunked 256 0 1.0
258 chunked 257 1 1.0

Versions are as follow:

$ python
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dask
>>> print dask.__version__
0.15.2
>>> import sparse
>>> print sparse.__version__
0.1.1

Setting DOK elements to zero doesn't work.

Trivial example:

>>> import sparse
>>> import numpy as np
>>> s = sparse.DOK((1,))
>>> s[0] = 1
>>> s[0]
1.0
>>> s[0] = 0
>>> s[0]
1.0

requirements.txt is missing in PyPI

Hi,

I tried to install sparse to a new (conda) environment by pip install sparse, but it fails with the following error.

Writing /tmp/easy_install-mhkizn5d/sparse-0.1.1/setup.cfg
Running sparse-0.1.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-mhkizn5d/sparse-0.1.1/egg-dist-tmp-jpw672f1
error: [Errno 2] No such file or directory: 'requirements.txt'

The installation succeeds if the scipy and numpy are already installed in the environment.
I think we need to include requirements.txt in MANIFEST.in.

Can't access panel for CodeCov and CircleCI

From what I can tell, these two particular tools require that you be a member of the organization rather than have permissions on the project.

Move to CircleCI from Travis

Travis contains a pip cache. This would really help speed up builds as Miniconda doesn't have to be installed, and Packages just have to be cached.

The downsides are managing the cache effectively.

Is this worth considering?

Add Cython, Windows and more Python versions to CI.

I was considering coverage (codecov.io) and Windows builds (AppVeyor) on the README.

I was also considering testing more Python versions (at least 2.7, 3.5, and 3.latest).

I would also prefer we prep the CI for Cython as well. I'm not particularly experienced at this, so I'm raising an issue. The cross-platform testing will become important once we have Cython code; as it may be OS dependent and/or not compile in some cases. I already ran into a rare NumPy bug with bit-shifts once that was macOS specific.

For Cython:

Would we need compiler packages or is adding Cython to conda enough?

I think we should test different Python versions on Linux only and Windows tests should be superficial to catch any compilation and/or rare compatibility bugs.

cc: @mrocklin

Fast conversion to scipy.sparse.csr/csc_matrix

In computations involving tensordot we can be bound by routines to convert from sparse.COO to scipy.sparse.coo_matrix (this should be free) and to convert from scipy.sparse.coo_matrix to scipy.sparse.csc_matrix. I do not believe that these are currently running at optimal speed.

I'm currently getting around some of this by caching the csr matrix on the sparse.COO matrix, but this is a bit kludgy and doesn't help if we're doing reshapings, transposes, etc..

Instead, given the importance of this operation, it might be worth defining a single step computation from a 2D sparse.COO matrix without duplicates to scipy.sparse.csr/csc matrices.

Default axis doesn't match that of numpy.ufunc.reduce

The default axis doesn't match that of numpy.ufunc.reduce. There, it is (0,) but we use None.

Minimal example:

>>> import numpy as np
>>> import sparse
>>> x = np.eye(5)
>>> s = sparse.COO.from_numpy(x)
>>> np.add.reduce(s)
5.0
>>> np.add.reduce(x)
array([1., 1., 1., 1., 1.])