Cython for radix argsort and fast indexing?,about pydata/sparse

Comments (48)

mrocklin commented on June 6, 2024 1

Can you estimate how hard it would be to build a radix sort in numba? <nerd snipe> Do you have any interest in doing this as a side project? </nerd snipe>

…

On Thu, Jan 11, 2018 at 3:17 PM, Siu Kwan Lam ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> , the only radixsort I used is a GPU implementation from CUB <http://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html>. Internally, numba uses quicksort for both np.sort and np.argsort. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/mrocklin/sparse/issues/62#issuecomment-357048853>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszIgyac_qZAup5XWdfA0k1nR7aOxmks5tJmxggaJpZM4RUu6Y> .

from sparse.

mrocklin commented on June 6, 2024 1

I wrote up some thoughts on the use of Numba here: http://matthewrocklin.com/blog/work/2018/01/30/the-case-for-numba

from sparse.

mrocklin commented on June 6, 2024 1

If we we need to use C-level dynamic data structures like std::vector then my recommendation would be to just write algorithms in straight C and hook them up with Cython rather than writing this sort of logic in Cython itself.

We might also look into stealing logic from Pandas, which I suspect has already had to solve a similar problem.

from sparse.

shoyer commented on June 6, 2024 1

If we we need to use C-level dynamic data structures like std::vector then my recommendation would be to just write algorithms in straight C

My "straight C" I assume you mean "straight C++"?

Anyways, I agree that straight C++ has its virtues if you want to use C++ data structures.

from sparse.

mrocklin commented on June 6, 2024

No objection from me. I'll throw out the alternative option of Numba as well. Historically the primary reason to avoid numba has been that it wasn't pip installable. Now it is. There may still be other reasons to avoid this though. The advantage would be a simpler build process and staying within Pure Python. It would be a deviation from community norms though.

…

On Fri, Jan 5, 2018 at 11:38 AM, Hameer Abbasi ***@***.***> wrote: Since our main bottleneck is sorting, I was considering adding Cython as a dev dependency to create a - Custom radix sort solution. - Blazing fast indexing (#60 <https://github.com/mrocklin/sparse/issues/60>) I was considering this for 0.3. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/mrocklin/sparse/issues/62>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszEe1zFo5a-jcK1DJSQj6Ck0_2G7qks5tHl35gaJpZM4RUu6Y> .

from sparse.

hameerabbasi commented on June 6, 2024

The problem I often have with Numba is its lack of support for many features and the fact that it's limited to a subset of Numpy for fast performance. It does have the advantages of automatic optimization and pure Python, though.

On the other hand, Cython is a superset of Python and can be hand optimized. Of course, I don't have personal experience with either of these, just going off a few blog posts that I read. I'll leave the decision up to you.

from sparse.

nils-werner commented on June 6, 2024

Just a warning: be aware of the way cython is handling the installation and build step (see cython#477). They require you to import cython and put code in your setup.py, which creates a chicken-and-egg problem during installation.

I am not saying I am against using cython, as it's probably the best solution. I would just like to see us find a tidy solution that doesn't require weird hacks because of issues we didn't see coming when we started working on this.

from sparse.

mrocklin commented on June 6, 2024

@sklam do you happen to know of (or want to produce) a fast radix sort? It would be interesting to see what a numba solution would look like here and how much added hassle it would generate.

from sparse.

hameerabbasi commented on June 6, 2024

There are a few implementations on this SO question, particularly this answer. However, these will need to be modified as what we really need is an argsort, rather than a sort.

from sparse.

sklam commented on June 6, 2024

@mrocklin , the only radixsort I used is a GPU implementation from CUB. Internally, numba uses quicksort for both np.sort and np.argsort.

from sparse.

hameerabbasi commented on June 6, 2024

It seems we need hacks (as @nils-werner put it) to actually distribute Cython code without requiring Cython as a dependency. See this link for some methods.

Edit: It seems you need the Numba package installed to even use Numba code, I'd love to be proven wrong though.

from sparse.

sklam commented on June 6, 2024

@mrocklin, difficulty of writing radixsort in numba will depends on the kinds of keys you need. The algorithm itself is simple enough and numba can do bitwise operation very easily. If your keys are representable as fixed-width machine integer types (int32, int64 and no bigger), it is very doable. If the keys are bytearrays, it gets a lot harder.

And yes, I have interest in writing a fast sort algorithms in numba but I don't think I have the bandwidth in the near term. Btw, what's the size of the dataset you're sorting?

@hameerabbasi , the ahead-of-time compilation feature in numba is incomplete (only support a very small and static subset of the language). But, we do have plans to improve that front.

from sparse.

mrocklin commented on June 6, 2024

@sklam we're only sorting unsigned integers that are the index locations of values within a sparse array.

…

On Fri, Jan 12, 2018 at 11:58 AM, Siu Kwan Lam ***@***.***> wrote: @mrocklin <https://github.com/mrocklin>, difficulty of writing radixsort in numba will depends on the kinds of keys you need. The algorithm itself is simple enough and numba can do bitwise operation very easily. If your keys are representable as fixed-width machine integer types (int32, int64 and no bigger), it is very doable. If the keys are bytearrays, it gets a lot harder. And yes, I have interest in writing a fast sort algorithms in numba but I don't think I have the bandwidth in the near term. Btw, what's the size of the dataset you're sorting? @hameerabbasi <https://github.com/hameerabbasi> , the ahead-of-time compilation feature in numba is incomplete (only support a very small and static subset of the language). But, we do have plans to improve that front. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/mrocklin/sparse/issues/62#issuecomment-357291180>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszMx1o0K5wJJCm6n7QPMNv5WCKG5bks5tJ4yKgaJpZM4RUu6Y> .

from sparse.

hameerabbasi commented on June 6, 2024

It could be uint8, uint16, uint32 or uint64 though.

from sparse.

sklam commented on June 6, 2024

good, these are the easy cases.

from sparse.

nils-werner commented on June 6, 2024

I started playing around with it a little bit, writing one version in NumPy, one in a Python loop and one in C++. The results are not exactly amazing yet ;-)

from sparse.

hameerabbasi commented on June 6, 2024

You might want to change the base to another power of two. ;-)

from sparse.

nils-werner commented on June 6, 2024

np.argsort() is still 2+ times faster

from sparse.

hameerabbasi commented on June 6, 2024

We could just use the SciPy implementation in this SO answer. No need to implement ourselves, no need for any kind of dependency. We could also indirectly use it to fix #60 (which I realized yesterday).

from sparse.

hameerabbasi commented on June 6, 2024

I was also considering using std::vector and Cython. Not sure if Numba transforms lists into std::vectors.

I think the point where we're failing is we're computing segments of arrays for each modulus in the Numpy version... Each of which is (in and of itself) O(n). And in the C++ version we're allocating too much memory.

from sparse.

rgommers commented on June 6, 2024

Cython vs. Numba (vs. Pythran, vs. ...) is a hard choice (maturity and better debugging vs. beter performance and a simple decorator - plus a bunch of other considerations). The Pythran maintainer just asked about using Pythran in SciPy, for which I came up with a list of everything that needs to be evaluated: https://mail.python.org/pipermail/scipy-dev/2018-January/022334.html

It seems we need hacks (as @nils-werner put it) to actually distribute Cython code without requiring Cython as a dependency. See this link for some methods.

Distributing generated C code in sdists is not much of a hack, it has been standard practice for many libraries for a long time. At this point I'd say it has become a lot less of a necessity, because there are wheels and conda packages for all the most common platforms.

from sparse.

hameerabbasi commented on June 6, 2024

The problem with Numba is it doesn't accelerate dicts or lists AFAIK. That's a deal-breaker, because we need these for the most critical parts, and allocating the worst-case makes some things as bad as the worst case.

Another factor to consider is merging upstream with SciPy. I'm not sure how willing they'd be if we had Numba in our codebase.

from sparse.

mrocklin commented on June 6, 2024

Cython won't meaningfully update these either. We might hope for a 2-3x speedup, but that requires significant specialization. Instead we might intentionally search for algorithms that can be done with fixed data structures, or we might go straight to C.

FWIW it's not clear to me that we should aim to merge this into the SciPy monolith. However I do think that we should aim to satisfy the same community constraints so that projects like SciPy would be willing to depend on this. The blogpost referred to above was intended to start the conversation of "well, maybe it should be ok to depend on numba today"

from sparse.

hameerabbasi commented on June 6, 2024

Actually, it's simple with Cython to wrap std::vector, so the basic purpose of a list is fulfilled. I'd be willing to go straight to C but learning all the Python APIs would take a significant amount of time and effort, Cython sidesteps that.

from sparse.

hameerabbasi commented on June 6, 2024

FWIW it's not clear to me that we should aim to merge this into the SciPy monolith. However I do think that we should aim to satisfy the same community constraints so that projects like SciPy would be willing to depend on this.

That was basically what I meant. :-)

Edit: To clarify, I'm assuming data structures/basic ops in here, linear algebra etc. there.

from sparse.

hameerabbasi commented on June 6, 2024

If we we need to use C-level dynamic data structures like std::vector then my recommendation would be to just write algorithms in straight C and hook them up with Cython rather than writing this sort of logic in Cython itself.

That is actually something I had not considered before and is actually pretty viable without diving too much into the Python C API. Thanks for alerting me to it.

Edit:

We might also look into stealing logic from Pandas, which I suspect has already had to solve a similar problem.

Sounds reasonable to think Pandas has a fast radix sort/group by functionality.

from sparse.

mrocklin commented on June 6, 2024

If we we need to use C-level dynamic data structures

Although ideally we would find linear-time sorting algorithms for which dynamic data structures are not necessary. I don't know this space particularly well, but I would hope that algorithms that only require fixed-size data structures exist.

from sparse.

hameerabbasi commented on June 6, 2024

Actually, matching (for broadcasting/binary operations) can get significant speedups from this as well, I'm playing with an implementation over at this branch.

from sparse.

mrocklin commented on June 6, 2024

good, these are the easy cases.

@sklam can you recommend an algorithm for an O(n) argsort over unsigned integers?

from sparse.

hameerabbasi commented on June 6, 2024

I would hope that algorithms that only require fixed-size data structures exist

Not just that. np.empty/np.zeros would kill speed in Numba. They can't even be dynamically allocated; I'm assuming.

from sparse.

hameerabbasi commented on June 6, 2024

But if fixed size is an issue; we can always do two passes over the data: One to count and one to actually add data. Of course, this would be slow.

@sklam can you recommend an algorithm for an O(n) argsort over unsigned integers?

The standard algorithm for this is in the title: radix sort. :-) I'm not aware of one without memory allocation or dynamic structures, though.

from sparse.

mrocklin commented on June 6, 2024

But if fixed size is an issue; we can always do two passes over the data: One to count and one to actually add data. Of course, this would be slow.

Two passes over an array can be much much faster than lots of allocations in dynamic data structures.

from sparse.

hameerabbasi commented on June 6, 2024

No harm in trying both and benchmarking. :-)

Edit: Depends on how expensive the passes are; I guess.

from sparse.

hameerabbasi commented on June 6, 2024

I will add that allocation will still be dynamic, even with multiple passes over the data. In the outer loop of radix sort is where we need to allocate the std::vector or array, not the absolute top.

I guess we could do counting sort but the base for that would be so high that it's practically useless.

Edit: Fixed terms.

from sparse.

mrocklin commented on June 6, 2024

Do algorithms for linear-time sorting exist that use only fixed-size data structures and a reasonable number of passes?

from sparse.

nils-werner commented on June 6, 2024

My experimental C++ implementation of radixsort uses fixed size data structures (but copies between them for each pass).

from sparse.

hameerabbasi commented on June 6, 2024

It does, however, make them bigger than they have to be... I'm not sure how much of a performance penalty this incurs, though. I haven't really tested allocation vs other operations TBH. I usually use GC languages (I know how to do memory management but haven't done much of it) like C# (which optimizes away the cost of multiple closely spaced allocations).

I'm assuming this is the reason there is a low performance for bigger bases in your benchmarks.

May be relevant: https://stackoverflow.com/questions/282926/time-complexity-of-memory-allocation

from sparse.

nils-werner commented on June 6, 2024

Since I'm only allocating needlessly large virtual memory, shouldn't the performance penalty be not too significant? Come to think of it, maybe it's not the malloc but I am raising many page faults during the actual sort run though...

I am not that experienced with actual memory management and optimization, and I haven't profiled it in detail, so I don't know.

from sparse.

hameerabbasi commented on June 6, 2024

That might be the reason, since you're moving between needlessly largely spaced chunks of memory. This does have the advantage of a low allocation cost (take this with a grain of salt), but it does introduce a lot of space between the (actually used part of) chunks (which is usually bad).

from sparse.

hameerabbasi commented on June 6, 2024

We could use this, although they're in-place sorts rather than argsorts. (MIT Licensed, so we're good)

They have tuned the bases for integer sorts, and applied other tricks like histogramming.

Edit: This fork is slightly newer.

from sparse.

hameerabbasi commented on June 6, 2024

Radix sort in Numba needs list[list]. xref numba/numba#2560.

from sparse.

mrocklin commented on June 6, 2024

Alternatively could one use two passes?

…

On Wed, Mar 21, 2018 at 5:10 AM, Hameer Abbasi ***@***.***> wrote: Radix sort in Numba needs list[list]. xref numba/numba#2560 <numba/numba#2560>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#62 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszAnF4-F1ROM3KmZImWGygzVtG1ODks5tghj9gaJpZM4RUu6Y> .

from sparse.

hameerabbasi commented on June 6, 2024

Unfortunately, in this case, you would need either list[list] (or list[ndarray] if we used two passes), so that's not really an option right now (we can upper bound the lists and use a static 2-D ndarray, but the upper bound is completely unreasonable).

from sparse.

mrocklin commented on June 6, 2024

I haven't looked into radix sort much in the past, so please correct me if I'm saying dumb things, but some references seem to implement radix sort as just many counting sorts, each of which is computable in fixed sized data structures in two passes.

https://www.geeksforgeeks.org/radix-sort/

from sparse.

hameerabbasi commented on June 6, 2024

Yes, that works for radix sort, but unfortunately not for radix argsort. You would need to keep an array of indices corresponding to each modulus. In the worst case, that can be len(arr) long for each possible modulus, so len(arr) * base is the total size for fixed data structures. To give an idea of how big the base is, usort uses a base of 2 ** 10 or 2 ** 11 for int64.

from sparse.

hameerabbasi commented on June 6, 2024

I'm in the process of adding radix sort to NumPy. You can see the PR: numpy/numpy#12586

from sparse.

mrocklin commented on June 6, 2024

I'm glad to see it! Hooray for pushing work upstream :) Thanks for the link, I enjoyed reading through the comments.

from sparse.

hameerabbasi commented on June 6, 2024

Closing this since Radix sort is in NumPy and indexing is implemented.

from sparse.

Cython for radix argsort and fast indexing? about sparse HOT 48 CLOSED

Comments (48)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent