First off, sorry again to have jumped to conclusions on Hacker News. As a programming

More sophisticated benchmarks are probably needed. Adding random pivots

Collaborating?,about scandum/fluxsort

Comments (28)

scandum commented on May 31, 2024

More sophisticated benchmarks are probably needed.
Adding random pivots might prevent a beneficial switch to quadsort.
I haven't been able to improve performance with a custom call stack previously, I'm reluctant to clutter the code if there's no performance gain.
Detecting ordered distributions during median selection could be useful.
I updated https://github.com/scandum/wolfsort yesterday to default to fluxsort, pretty good gain on 32 bit integers.

Some elegant way to sort random generic data (like 16 values) would be useful, pdqsort does this rather well, though I'm not sure if there's an added cost for other distributions.

That's a clever improvement, I'll do some testing and see about incorporating it.

from fluxsort.

mlochbaum commented on May 31, 2024

I was able to adapt pdqsort's method for handling many equal values, which brings the performance on the "generic" case roughly in line with pdqsort. Pushed to https://github.com/mlochbaum/fluxsort . There doesn't seem to be any significant performance impact in other tested cases, but I found that r_array[cnt] = (unsigned)(cnt*cnt) % 10000 gets much slower, so I need to figure out why that is.

The strategy is the same but reversed left to right to account for the different recursion order. So, before partitioning it checks to see if the pivot is equal to the element after the current slice. The only extra step is to copy that element-after to the swap space before recursing into it. It might seem like the fact that pdqsort can put the pivot in the middle right after partitioning and fluxsort can't would make a difference, but it doesn't because the right partition is sorted (so it starts with the pivot) before visiting the left.

How much do you care about code style here? I've copy-pasted the partitioning code to match the existing style, but I'd find a version that uses macros for the loop unrolling and two copies easier to read.

from fluxsort.

scandum commented on May 31, 2024

As far as I can tell the flux_max function isn't stable?

Moving the maximum element to the end might be possible to accomplish in n comparisons and ~ log n swaps on randomized data. I'm having a hard time wrapping my head around it.

As for code style, I can be a bit particular, but I'm open to suggestions.

On my end, I've experimented with a pseudomedian of 25 that detects sorted partitions. It improves sorting of generic data slightly, but it's barely worth the trouble.

from fluxsort.

scandum commented on May 31, 2024

On second thought, I think the bubble sort would be the right way to go. n comps and n swaps.

from fluxsort.

mlochbaum commented on May 31, 2024

No, I guess it isn't. I'd been thinking about ensuring that the maximum element doesn't get swapped with anything less than it, but the element at the end winds up out of place. It's not really necessary because you can track whether the current slice comes at the end of the array or not, it's just that you have to add cases when it's used. But this condition starts false and becomes true after the first base case, so it's not that complicated.

Finding the maximum is really fast with SIMD instructions (on appropriate types of course). Anything more complicated is probably too much overhead.

from fluxsort.

scandum commented on May 31, 2024

I reckon there will be a variety of solutions. I'll try to wrap my head around this partitioning logic in the next few days and get back to you.

from fluxsort.

mlochbaum commented on May 31, 2024

Remark on psuedomedians: the order of the candidates does matter, and I think both pdqsort and flux are too systematic. I haven't analyzed the 15-candidate version, but the case with 3-by-3 is that when combining values, two that are on the same side always outweigh a third. If values in the middle are higher and those at the edges are lower, then Fluxsort ends up with two low groups and one high group. It will then pick the median from one of the low groups even though the true median is probably the highest value from a low group. pdqsort spreads out the groups but gets the same problem in the first round of medians, so that every group will end up with a low median. A scheme like 125/048/367 should be better. I also wonder if it might be preferable to just take the true median of 5 or 7 (I don't know how hard that is) instead. Fluxsort has a lot of overhead coming from median calculations and it seems to me like it should be possible to reduce it.

from fluxsort.

scandum commented on May 31, 2024

3x3 is as good at avoiding the worst case partition as 1x7 with fewer comparisons.

1x7 requires 6+5+4+3+2+1 = 21 comparisons worst case while 3x3 is at 3+3+3+3 = 12.

I guess a good baseline is to try to beat std:stable_sort for any distribution, this might be possible by adding run detection to the analyzer, since there's a tipping point where branch prediction outperforms branchless.

from fluxsort.

mlochbaum commented on May 31, 2024

I was just wondering if it makes sense to search for an initial run at the beginning of flux_partition every time. Best case is a complete run of course. If it's a very large run you probably just switch to Quadsort right away. If there's nothing then you quit after 3-5 elements probably, and if there's a small run then at least you can partition it quickly with a binary search and the work isn't wasted. My worry is that I don't know how common perfect runs are relative to imperfect ones, especially after partitioning (but at least flux isn't destroying runs like unstable methods do). Timsort prioritizes perfect runs heavily, but I don't think it would be obvious if this was a bad idea all along.

SIMD is also relevant, since a run search can easily check multiple elements at a time. But it doesn't work in plain C because the compiler's not allowed to access elements after the stopping point.

from fluxsort.

scandum commented on May 31, 2024

It's possible to perform a branchless run analysis at a ~3% performance cost for random, useful for detecting pipe organ distributions.

from fluxsort.

scandum commented on May 31, 2024

I'm getting good performance with a branchless bubble sort. I typically use ./a.out 10000 100 1 1 for accurate readings, though I had to drop down to ./a.out 1000 100 10 1 with this code. Ought to be stable.

void FUNC(flux_max)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
        VAR swap, *ptb, *pta = array;

        while (--nmemb)
        {
                ptb = &pta[cmp(pta, pta + 1) > 0];

                swap = *pta; *pta++ = *ptb; *ptb = swap;
        }
}

Edit: Previous code was going out of bounds, probably why the readings were odd. Not ideal performance, ~3% slower.

from fluxsort.

scandum commented on May 31, 2024

I took a closer look at the actual partitioning logic, and I think the analyzer would take care of most cases where flux_max would be beneficial? Another concern is that it takes an extra n comparisons, which isn't a big deal, but it might be to some people.

As far as I can tell most of the performance gain can be obtained from the if (s_size == 0) check. I'm leaning towards a slightly different approach, I'll attach what I currently have.

One of the things I'm trying to do is to minimize stack allocation during recursion. In my current setup that does result in some unnecessary pivot calls, but it seems to slightly improve performance on random without a notable impact on generic.

fluxsort.txt

from fluxsort.

scandum commented on May 31, 2024

This seems to work to deal with the cnt * cnt % 10000 distribution for large arrays. Perhaps a bit heavy-handed. It's still not quite on par with pdqsort for 10M+ but it gets pretty close.

Should be possible to keep track of the previous pivot with a maximum_of_twentyfive routine and passing it along recursively, I'm looking into that next.

void FUNC(fluxsort)(void *array, size_t nmemb, CMPFUNC *cmp);

VAR FUNC(median_of_hundredtwentyeight)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
        VAR *pta, swap[128], tmp;
        size_t cnt, div = nmemb / 128;

        pta = array;

        for (cnt = 0 ; cnt < 128 ; cnt++)
        {
                swap[cnt] = pta[0];

                pta += div;
        }
        FUNC(fluxsort)(swap, 128, cmp);

        return swap[64];
}

VAR FUNC(median_of_twentyfive)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
        size_t v0, v1, v2, v3, v4, div = nmemb / 64;

        v0 = FUNC(median_of_five)(array, div *  4, div *  1, div *  2, div *  8, div * 10, cmp);
        v1 = FUNC(median_of_five)(array, div * 16, div * 12, div * 14, div * 18, div * 20, cmp);
        v2 = FUNC(median_of_five)(array, div * 32, div * 24, div * 30, div * 34, div * 38, cmp);
        v3 = FUNC(median_of_five)(array, div * 48, div * 42, div * 44, div * 50, div * 52, cmp);
        v4 = FUNC(median_of_five)(array, div * 60, div * 54, div * 56, div * 62, div * 63, cmp);

        return array[FUNC(median_of_five)(array, v2, v0, v1, v3, v4, cmp)];
}

                if (nmemb > 1024)
                {
                        if (nmemb > 65536)
                        {
                                piv = FUNC(median_of_hundredtwentyeight)(ptx, nmemb, cmp);
                        }
                        else
                        {
                                piv = FUNC(median_of_twentyfive)(ptx, nmemb, cmp);
                        }
                }
                else
                {
                        piv = FUNC(median_of_nine)(ptx, nmemb, cmp);
                }

from fluxsort.

mlochbaum commented on May 31, 2024

I think that strategy is entirely reasonable, actually, and it doesn't even add much to the code size. It should be able to do the median in already-allocated memory (in extreme cases swap won't have enough space but array will). And the number of candidates can vary. I'll see if I can do some statistics and figure out how the number of candidates sorted relates to the ratio of sizes.

I still think switching to quadsort on a single bad partition isn't really viable. It should switch when the partition size is too large relative to the depth, for a cumulative measure. Statistics also needed here.

from fluxsort.

mlochbaum commented on May 31, 2024

From this answer, medians follow a beta distribution: the probability that a median of m is found p of the way along the actual distribution is proportional to p^(m/2) * (1-p)^(m/2). And assuming the cost to sort is n log(n), the cost after partitioning is n log(n) plus n * (p log(p) + (1-p) log(1-p)). So given m, it's easy to numerically compute the coefficient k so that the expected cost is n*k. Then the value n where that cost is effective is m log(m) / k. But that relies on the cost of sorting having a constant coefficient; it doesn't, so this is an underestimate.

I computed this table of values, where the first column is n and the second is m. It tracks √n pretty closely, and the third column shows sqrt(n/(1+log_4(n))), which is a very close estimate. But the value m is an overestimate, and using a median half as large (corresponding to small-array sorting being four times as expensive) probably makes more sense. This suggests it's best to switch from median-of-15 at about 8000 elements. At this point the cost of computing the median size with count-leading-zeros and square root hardly matters.

If you're saving the sorted candidates between iterations then the math works out differently. You could either select more candidates at the start, or merge in candidates to get to the right amount at a lower cost (increasing the size by a factor of √2 if √n is still accurate).

from fluxsort.

scandum commented on May 31, 2024

That lines up pretty closely with my own pivot choices based on benchmarks. I might be able to write a dynamic pivot selection routine.

from fluxsort.

scandum commented on May 31, 2024

I ran some calculations and it seems like the cubic root of the partition size might be a good value for the pseudomedian.

I'm pretty happy with some of the changes I made so I'm thinking of pushing out a new version. How would you like me to credit you for your contributions? I've incorporated the loop unrolling and s_size == 0 optimization.

from fluxsort.

mlochbaum commented on May 31, 2024

My branch uses the sqrt(n/(1+log_4(n))) / 2 value (but it computes the gap size, which works out to sqrt(2*n*(2+log_2(n)))). Decreasing the number of pivots definitely increases the time on shuffled data, so I think this is better than a cube root version.

I also added a pseudorandom value to each pivot candidate. Running a full PRNG step for each pivot adds too much overhead, so I split Marsaglia's Xorshift into three steps and do only one for each candidate. The adjacent candidates won't be independent, but the overall selection will end up with plenty of randomness. I'm thinking about how to apply something similar to the smaller pivot computations as well. As far as I can tell the only reason not to use randomize candidates is performance: the goal is to avoid patterns in pivot selection, and PRNGs are designed specifically for this purpose.

I'm not really concerned about the attribution (there's a record of what happened here, anyway). If you've copied any actual code instead of just ideas then you could add me to the license text.

from fluxsort.

mlochbaum commented on May 31, 2024

I've also been looking at insertion sort. Replacing quadsort with insertion sort directly is slightly slower, but removing all the quadsort calls and running unguarded insertion sort on the entire array at the end is the same speed for random data or even slightly faster, and 2-3% faster after increasing FLUX_OUT to 32. It still needs a fallback in the case of bad pivot selection. If that exists, then at the end all partitions are either sorted or smaller than FLUX_OUT, and it's enough to run a guarded insertion sort on the first FLUX_OUT entries and an unguarded insertion sort on the rest. For testing I didn't do that—I didn't guard anything and it happened to work. I'll see if I can get a correct version.

from fluxsort.

scandum commented on May 31, 2024

Increasing FLUX_OUT to 32 increases the comparison count by quite a bit, I'm trying to keep performance balanced for both integers and strings.

I updated the source code with the discussed changes. I think I'll create a separate repository that purely focuses on benchmarking pivot selection.

from fluxsort.

mlochbaum commented on May 31, 2024

I'm also very interested in pivot selection. I'd never have guessed that stopping at 9 (or 15 even!) candidates is measurably worse. I just added randomization to the 15-candidate version, based on this generator because I only use a few low bits and the smaller numbers scramble these faster. This randomization makes median_of_fifteen itself about 50% slower, but that comes out to around 0.1% because it's only used on large partitions.

Too much overhead to do this for 9 candidates. Maybe using a fixed random sequence is best here. The main thing is to avoid doing too badly on exactly periodic data, because that kind of thing will definitely show up. It's also worth looking at a 3-candidate version for the smallest partitions (64? 128?), since it means there's less pressure on the 9-candidate one to be as fast as possible.

The 1 in 65,536 chance you give for pseudomedian-of-9 failing is too optimistic. True, it requires 4 candidates to be in the bad range, but there are many combinations of candidates that will cause the failure so the probability is higher. The actual probability that the larger partition fails is 2 * f(f(1/17)), where f(x) = x^3 + 3*(1-x)*x^2 is the probability a median of 3 fails if each of the three has failure rate x. I multiplied by 2 since it can fail on either side, and the end result is always that most of the array is sorted by fallback. That's about 1 in 1687 (confirmed with a simulation), which is not that low considering it happens at every partition.

Median of 5 groups of 3 gives 1 in 51161, which is probably fine, and slightly better than 3 of 5.

from fluxsort.

scandum commented on May 31, 2024

I'm a bit hesitant about randomization since it might make it difficult to pinpoint bad patterns.

I'll run some tests to see for myself if the failure rate is 1 in 1687 for the pseudomedian of 9, I've wondered if I was too optimistic thinking it would be 0.0625 ^ 4.

I've switched to the pseudomedian of 25 for partitions above 1024 which seems to do fairly well.

I'm doing some testing with larger pseudomedians and what I've dubbed quasimedians (3x3x3, 5x5x5, etc) as well.

Name	Items	Type	Best	Average	Loops	Samples	Median	Comps	Distribution
1 median	10000	32	0.000079	0.000104	10000	10	49.9	0.0	random
3 median	10000	32	0.000317	0.000374	10000	10	37.5	2.7	random
9 pseudo	10000	32	0.001491	0.001523	10000	10	26.5	10.7	random
11 median	10000	32	0.002646	0.002720	10000	10	22.6	34.0	random
15 pseudo	10000	32	0.002788	0.002869	10000	10	21.8	21.3	random
27 quasi	10000	32	0.004295	0.004416	10000	10	18.9	34.6	random
25 pseudo	10000	32	0.005524	0.005639	10000	10	17.6	48.1	random
49 pseudo	10000	32	0.009105	0.009657	10000	10	15.6	89.1	random
81 pseudo	10000	32	0.014094	0.014208	10000	10	12.0	180.4	random
125 quasi	10000	32	0.021470	0.021789	10000	10	9.6	248.0	random
121 pseudo	10000	32	0.020349	0.020645	10000	10	9.7	317.0	random
169 pseudo	10000	32	0.028128	0.028538	10000	10	8.2	511.1	random
225 pseudo	10000	32	0.036318	0.037130	10000	10	6.9	769.6	random
128 median	10000	32	0.047244	0.047437	10000	10	7.0	802.4	random
289 pseudo	10000	32	0.044491	0.045037	10000	10	6.1	1105.9	random
361 pseudo	10000	32	0.052802	0.053114	10000	10	5.5	1524.7	random
256 median	10000	32	0.101282	0.101520	10000	10	5.0	1856.7	random
441 pseudo	10000	32	0.062605	0.063302	10000	10	5.0	2035.7	random
529 pseudo	10000	32	0.073239	0.073720	10000	10	4.5	2661.3	random
1681 pseudo	10000	32	0.277746	0.279358	10000	10	2.5	14514.7	random

from fluxsort.

mlochbaum commented on May 31, 2024

Nice! I computed tables of failure probability for median combinations. They're shown as negative log base two of the probability, so 20 is roughly one in a million and so on. This gives some indication of the power of the combination. 5-of-5 is 22.89, slightly stronger than 3-of-3-of-3 at 20.86. And 7 is weaker than 3-of-3. The failure probability in the first line can be changed but it looks like these relationships stay stable.

from fluxsort.

scandum commented on May 31, 2024

With brute force testing I come out at 1 in 1333.33 for the pseudomedian of 9.

In my own tests I haven't been able to beat pseudomedian of 9 with the median of 3 for small partitions.

Name	Items	Best	Average	Loops	Samp	Median	Failure	Comps	Distribution
1 median	64	0.013947	0.013947	1000000	1	2500.3	125132	0.0	random
3 median	64	0.018938	0.018938	1000000	1	1874.2	22598	2.7	random
9 pseudo	64	0.038853	0.038853	1000000	1	1335.6	759	10.7	random
9 sloppy	64	0.063251	0.063251	1000000	1	1252.9	256	25.2	random
9 median	64	0.096236	0.096236	1000000	1	1228.6	187	22.6	random

9 median returns the center value, 9 sloppy returns the center value but will be off by 1 in 66% of cases.

from fluxsort.

mlochbaum commented on May 31, 2024

I don't get the concern about randomization breaking detection of bad patterns. The only thing that can slow down a branchless quicksort is bad pivot selection, right? But what qualifies as bad depends entirely on what pivot selection algorithm is used. PRNG test suites basically stress-test them against as many possible patterns as possible, so adding randomness in a suitable way should mean the only real-world data that has any chance of breaking them is malicious input.

For detecting patterns where fluxsort would be better, the variable-length selection seems ideal since you can do whatever statistics you want on the sample. My method just adds relatively small random offsets to evenly-spaced values, so they're still in order and approximately evenly spaced. I think this is more robust for pattern detection because with exact even spacing you might hit a part of the array that isn't representative (maybe if values come from a table with rows of 5 and div happens to be a multiple of 5).

flux_analyze could also look at larger structure. If pairs of elements in adjacent blocks of size 2^n are mostly ascending or mostly descending, then that should make the merge pass on those blocks run quickly. Searching for this would make it possible to detect when the data is basically ordered but elements are randomly a few positions off, since after the first few merge passes the rest will be easy. But it would need to be tested by sampling to avoid too much overhead, I'd expect.

from fluxsort.

scandum commented on May 31, 2024

I guess pta = array + rand() % div; would do the trick in median_of_sqrt. I'd still rather avoid it without known killer patterns.

I've looked into run detection in flux_analyze, but doing it without a ~3% performance loss on random is tricky. I'll probably give it another shot at some point.

I also wrote a dynamic pseudomedian selector. I get the impression though that on large partitions cache pollution outweighs the cpu advantage, so getting the median of 128 or 256 with fluxsort seems to perform better. I'll post the code in case you want to play around with it.

VAR FUNC(flux_median)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
        VAR swap[base];
        size_t div = nmemb / base;
        unsigned char x, y, mid = base / 2;
        unsigned char val, t[base];

        for (x = 0 ; x < base ; x++)
        {
                t[x] = 0;
                swap[x] = array[0]; array += div;
        }

        for (x = 0 ; x < base - 1 ; x++)
        {
                for (y = x + 1 ; y < base ; y++)
                {
                        val = cmp(swap + x, swap + y) > 0; t[x] += val; t[y] += !val;
                }
                if (t[x] == mid) return swap[x];
        }
        return swap[base - 1];
}

VAR FUNC(flux_pseudomedian)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
        VAR swap[base];
        size_t div = nmemb / base;
        unsigned char cnt;

        for (cnt = 0 ; cnt < base ; cnt++)
        {
                swap[cnt] = FUNC(flux_median)(array, div, base, cmp); array += div;
        }
        return FUNC(flux_median)(swap, base, base, cmp);
}

VAR FUNC(flux_quasimedian)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
        VAR swap[base];
        size_t div = nmemb / base;
        unsigned char cnt;

        for (cnt = 0 ; cnt < base ; cnt++)
        {
                swap[cnt] = FUNC(flux_pseudomedian)(array, div, base, cmp); array += div;
        }
        return FUNC(flux_median)(swap, base, base, cmp);
}

Example: piv = FUNC(flux_quasimedian)(ptx, nmemb, 5, cmp); to get the pseudomedian of 125.

from fluxsort.

scandum commented on May 31, 2024

Some small updates.

Added crude offset randomization to FUNC(median_of_sqrt), not sure if finer control is beneficial.
Added run detection to FUNC(flux_analyze), should detect most pipe-organ distributions, performance loss is minor.
Small change to FUNC(tail_swap) that seems to improve performance slightly.
Added storing of the pivot as previously discussed. The first pivot is ignored rather than calculating the maximum element.

Still need to look at a few other things, like your suggestion to keep partitioning the leftover of a reverse partition.

from fluxsort.

scandum commented on May 31, 2024

Some more updates.

Added branchless parity merges to the small array sorting routine of quadsort. Increases overall performance by about 15%.
The reverse partitioner keeps sorting the leftover with flux instead of switching to quadsort.
Fluxsort defaults to quadsort if memory allocation fails, and quadsort switches to in-place rotations if memory allocation fails once more.

I think that's about all I do with fluxsort for now.

from fluxsort.

Collaborating? about fluxsort HOT 28 OPEN

Comments (28)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent