scandum / fluxsort Goto Github PK

View Code? Open in Web Editor NEW

661.0 661.0 23.0 17.17 MB

A fast branchless stable quicksort / mergesort hybrid that is highly adaptive.

License: The Unlicense

C 100.00%

mergesort quicksort sorting stable

fluxsort's People

Contributors

Stargazers

Watchers

fluxsort's Issues

"Long double" is not an 128bit integer

iirc, long double is a weird 80-bit floating point number that x86 apparently supports
Depending on the compiled platform (32bit x86 or x86_64) the 80 bits get rounded up to 12 or 16 bytes
https://github.com/scandum/fluxsort/blob/main/src/fluxsort.h#L177

fluxsort comparison behavior

Hi, I've recently been doing some research on sort implementations and noticed that in my tests that fluxsort is not stable. The test and benchmark suite is written in Rust, the following test fails for fluxsort but passes for other stable sorts such as Rust slice::sort and C++ std::stable_sort but not for fluxsort. You can find the test code here https://github.com/Voultapher/sort-research-rs/blob/2642b98dc60cb6c9a6dbdec706a71ebdfbdf89e1/tests/main.rs#L219. In essence it packs two i32 into one u64 and extracts in the comparison function the two i32 again and only uses the first one as comparison and then later checks that the order of the ascending second elements remains in order.

Also on a sidenote, I'm seeing rather different benchmark results, where fluxsort pretty much never beats pdqsort.

Collaborating?

First off, sorry again to have jumped to conclusions on Hacker News. As a programming language guy I'm in the market for a stable sort and Fluxsort seems to have the fundamentals down!

I wanted to know if you're interested in incorporating some or all of what I do into this repository. I'd like to help improve Fluxsort's worst cases and pattern recognition with techniques like pdqsort uses, and expand the interface to support other operations. Some possible topics, which I can also split into individual issues:

Add more sophisticated benchmarks with subtle patterns
Candidate randomization to avoid poor pivot selection for patterned as well as random data
Move the call stack into the memory buffer to avoid stack overflow and improve speed
Recognize signs of mostly-sorted inputs during pivot selection
Specialized integer code; maybe SIMD instructions, non-stable optimizations, or hybridization with counting sort

(pdqsort depends on instability a lot, but I think there are usually ways to avoid this)

As evidence that I have some clue what I'm talking about, here's a small improvement: the following inner loop for flux_partition gives about a 3% speedup in my benchmarks on random 1e6-element 4-byte int arrays.

size_t n=0;
val = cmp(ptx + 0, &piv) <= 0; pta[n] = ptx[0]; pts[0-n] = ptx[0]; n += val;
// etc.

pta += n; pts += 8-n;
ptx += 8;

Suboptimal code-gen in the fundamental branchless-swap building block

The fundamental branchless swap_if code produces suboptimal code on x86-64. I ported it to Rust and noticed that changing it yielded a 50% performance uplift for that function on Zen3, this will of course depend on the the hardware, but cmov seems to yield better results than setl/setg style code that is currently being produced. Probably helped by doing 8 instead of 10 instructions.

Here is the current version:

And here is the version that produces cmov code:

C https://godbolt.org/z/GrTvx1z8x (WIP, only good code gen for clang LLVM)
Rust https://godbolt.org/z/9qnfY6h3v

I think if you can find a way to reliably produce cmov instructions like LLVM does, you should see a noticeable speed improvement.

Improve benchmark

Time is a lousy benchmark, especially with a low numbers of elements running on an OS. There a lot of way of profiling but I would suggest using an LLVM based profiling solution like this one because it measures the number of intermediate language instructions executed. This avoids compiler/microcode optimizations and gives you the most representative output for all platforms.

scandum / fluxsort Goto Github PK

fluxsort's People

Contributors

Stargazers

Watchers

Forkers

fluxsort's Issues

"Long double" is not an 128bit integer

fluxsort comparison behavior

Collaborating?

Suboptimal code-gen in the fundamental branchless-swap building block

Improve benchmark

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent