Basically we just need an bitwise AND or OR and compare with the respective value whos

It sounds faster to use <a href="https://en.wikibooks.org/wiki/X86_Assembly/SSE#Text_P

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Add method to count UTF-8 chars about bytecount HOT 16 CLOSED

llogiq commented on May 29, 2024

Add method to count UTF-8 chars

from bytecount.

Comments (16)

Veedrac commented on May 29, 2024

It sounds faster to use pcmpestrm, which with the right flags does the heavy lifting for us, though dealing with chars that cross boundaries will be a chore.

from bytecount.

llogiq commented on May 29, 2024

though dealing with chars that cross boundaries will be a chore.

Why? We can use pcmpestrm to count all continuation bytes (0x80-0xbf IIRC) & subtract the count from the slice length.

from bytecount.

Veedrac commented on May 29, 2024

I expect we should be using the "Equal Ordered" mode. The challenge here is simply that this does not consume the whole 16 bytes each cycle, which means either we need to use unaligned reads or deals with characters crossing the 16-byte blocks. Having slept since the last comment, it's obvious to me that unaligned reads are the sensible bet as Intel hardware has fast unaligned loads and it doesn't even prevent running multiple in parallel.

from bytecount.

llogiq commented on May 29, 2024

Nothing in performance is obvious until measured. 😄

from bytecount.

llogiq commented on May 29, 2024

I don't completely understand the Equal Ordered mode. OTOH the Range mode looks like a viable solution, using the range argument: 0x80bf (or alternatively 0x0079c0ff to match only non-following bytes thus saving one subtraction at the end). No idea which variant would be faster, so let's best measure and find out.

from bytecount.

Veedrac commented on May 29, 2024

I should clarify how the Equal Ordered mode should be used. Consider the example given:

arg1    = "he"
arg2    = ", he helped her "
IntRes1 =  0010010000001000

When we're counting chars, it's going to look more like

arg1    = b"\xce\xb1"                     // α
arg2    = b"\xce\xb1\x6c\x70\x68\xce\xb1" // αlphα
IntRes1 =     1   0   0   0   0   1   0

and you can just count the set bits on that. Note, though, that you might have unaligned characters, eg. something like

arg1    = b"\xce\xb1"                 // α
arg2    = b"\xce\xb1\x6c\x70\x68\xce" // αlph[α]
IntRes1 =     1   0   0   0   0   0

which means the next search can't jump a whole vector along. It needs to include the trailing bytes in the next search. Ergo the suggestion to use unaligned reads, which are fast on Intel CPUs.

from bytecount.

CAD97 commented on May 29, 2024

I implemented a very simplistic version of this with no simd acceleration. CAD97/bytecount@77c0bad

pub fn naive_count_utf8(haystack: &[u8]) -> usize {
    // UTF-8 is self-aligning: all trailing bytes are encoded as 0b10_xxx_xxx
    // So to count code points, we count non-trailing bytes 0b11_xxx_xxx or 0b0x_xxx_xxx
    haystack.iter().filter(|&&byte| (byte >> 6) != 0b10).count()
}

Because all trailing bytes are encoded as 0b10xx_xxxx, I check for bytes not in that form and those are our leading bytes starting a Unicode Codepoint.

But the interesting part is how this compares against the stdlib's solution of .chars().count(), so I modified the existing benchmark for this new function:

unsafe fn random_str_safeish_bytes(len: usize) -> String {
    let mut bytes = rand::thread_rng().gen_iter::<u8>().take(len).collect::<Vec<_>>();
    for idx in len-4..len {
        bytes[idx] = 0;
    }
    String::from_utf8_unchecked(bytes)
}

macro_rules! bench {
    ($i: expr, $name_stdlib: ident, $name_naive: ident) => {
        fn $name_stdlib(b: &mut bencher::Bencher) {
            let haystack = unsafe{random_str_safeish_bytes($i)};
            b.iter(|| haystack.chars().count());
        }

        fn $name_naive(b: &mut bencher::Bencher) {
            let haystack = unsafe{random_str_safeish_bytes($i)};
            b.iter(|| naive_count_utf8(haystack.as_bytes()));
        }
    };
}

I set the last four random bytes to NUL in order to make sure that the built-in Chars iterator doesn't overrun the buffer. Here are the results I got:

running 90 tests
test bench_00000_naive        ... bench:           2 ns/iter (+/- 1)
test bench_00000_stdlib       ... bench:           0 ns/iter (+/- 0)
test bench_00010_naive        ... bench:           7 ns/iter (+/- 2)
test bench_00010_stdlib       ... bench:           6 ns/iter (+/- 1)
test bench_00020_naive        ... bench:          12 ns/iter (+/- 1)
test bench_00020_stdlib       ... bench:          10 ns/iter (+/- 2)
test bench_00030_naive        ... bench:          18 ns/iter (+/- 5)
test bench_00030_stdlib       ... bench:          16 ns/iter (+/- 4)
test bench_00040_naive        ... bench:          22 ns/iter (+/- 6)
test bench_00040_stdlib       ... bench:          20 ns/iter (+/- 3)
test bench_00050_naive        ... bench:          28 ns/iter (+/- 4)
test bench_00050_stdlib       ... bench:          26 ns/iter (+/- 7)
test bench_00060_naive        ... bench:          31 ns/iter (+/- 12)
test bench_00060_stdlib       ... bench:          30 ns/iter (+/- 3)
test bench_00070_naive        ... bench:          39 ns/iter (+/- 6)
test bench_00070_stdlib       ... bench:          37 ns/iter (+/- 6)
test bench_00080_naive        ... bench:          43 ns/iter (+/- 6)
test bench_00080_stdlib       ... bench:          39 ns/iter (+/- 37)
test bench_00090_naive        ... bench:          49 ns/iter (+/- 6)
test bench_00090_stdlib       ... bench:          47 ns/iter (+/- 5)
test bench_00100_naive        ... bench:          51 ns/iter (+/- 8)
test bench_00100_stdlib       ... bench:          51 ns/iter (+/- 10)
test bench_00120_naive        ... bench:          63 ns/iter (+/- 7)
test bench_00120_stdlib       ... bench:          63 ns/iter (+/- 8)
test bench_00140_naive        ... bench:          71 ns/iter (+/- 2)
test bench_00140_stdlib       ... bench:          70 ns/iter (+/- 10)
test bench_00170_naive        ... bench:          96 ns/iter (+/- 15)
test bench_00170_stdlib       ... bench:          90 ns/iter (+/- 18)
test bench_00210_naive        ... bench:         110 ns/iter (+/- 26)
test bench_00210_stdlib       ... bench:         111 ns/iter (+/- 10)
test bench_00250_naive        ... bench:         135 ns/iter (+/- 13)
test bench_00250_stdlib       ... bench:         127 ns/iter (+/- 126)
test bench_00300_naive        ... bench:         149 ns/iter (+/- 4)
test bench_00300_stdlib       ... bench:         147 ns/iter (+/- 139)
test bench_00400_naive        ... bench:         196 ns/iter (+/- 5)
test bench_00400_stdlib       ... bench:         194 ns/iter (+/- 15)
test bench_00500_naive        ... bench:         240 ns/iter (+/- 15)
test bench_00500_stdlib       ... bench:         249 ns/iter (+/- 81)
test bench_00600_naive        ... bench:         287 ns/iter (+/- 6)
test bench_00600_stdlib       ... bench:         285 ns/iter (+/- 10)
test bench_00700_naive        ... bench:         333 ns/iter (+/- 6)
test bench_00700_stdlib       ... bench:         329 ns/iter (+/- 20)
test bench_00800_naive        ... bench:         378 ns/iter (+/- 4)
test bench_00800_stdlib       ... bench:         377 ns/iter (+/- 18)
test bench_00900_naive        ... bench:         425 ns/iter (+/- 388)
test bench_00900_stdlib       ... bench:         446 ns/iter (+/- 102)
test bench_01000_naive        ... bench:         471 ns/iter (+/- 201)
test bench_01000_stdlib       ... bench:         468 ns/iter (+/- 52)
test bench_01200_naive        ... bench:         562 ns/iter (+/- 271)
test bench_01200_stdlib       ... bench:         558 ns/iter (+/- 307)
test bench_01400_naive        ... bench:         654 ns/iter (+/- 541)
test bench_01400_stdlib       ... bench:         652 ns/iter (+/- 59)
test bench_01700_naive        ... bench:         790 ns/iter (+/- 478)
test bench_01700_stdlib       ... bench:         836 ns/iter (+/- 49)
test bench_02100_naive        ... bench:         974 ns/iter (+/- 30)
test bench_02100_stdlib       ... bench:         973 ns/iter (+/- 58)
test bench_02500_naive        ... bench:       1,155 ns/iter (+/- 106)
test bench_02500_stdlib       ... bench:       1,157 ns/iter (+/- 593)
test bench_03000_naive        ... bench:       1,384 ns/iter (+/- 22)
test bench_03000_stdlib       ... bench:       1,386 ns/iter (+/- 49)
test bench_04000_naive        ... bench:       1,848 ns/iter (+/- 2,120)
test bench_04000_stdlib       ... bench:       1,964 ns/iter (+/- 116)
test bench_05000_naive        ... bench:       2,300 ns/iter (+/- 127)
test bench_05000_stdlib       ... bench:       2,299 ns/iter (+/- 2,273)
test bench_06000_naive        ... bench:       2,770 ns/iter (+/- 3,139)
test bench_06000_stdlib       ... bench:       2,759 ns/iter (+/- 1,382)
test bench_07000_naive        ... bench:       3,235 ns/iter (+/- 3,409)
test bench_07000_stdlib       ... bench:       3,209 ns/iter (+/- 2,930)
test bench_08000_naive        ... bench:       3,677 ns/iter (+/- 1,279)
test bench_08000_stdlib       ... bench:       3,915 ns/iter (+/- 157)
test bench_09000_naive        ... bench:       4,139 ns/iter (+/- 215)
test bench_09000_stdlib       ... bench:       4,147 ns/iter (+/- 119)
test bench_10000_naive        ... bench:       4,626 ns/iter (+/- 5,274)
test bench_10000_stdlib       ... bench:       4,592 ns/iter (+/- 155)
test bench_12000_naive        ... bench:       5,503 ns/iter (+/- 130)
test bench_12000_stdlib       ... bench:       5,521 ns/iter (+/- 3,656)
test bench_14000_naive        ... bench:       6,443 ns/iter (+/- 2,534)
test bench_14000_stdlib       ... bench:       6,855 ns/iter (+/- 3,694)
test bench_17000_naive        ... bench:       7,800 ns/iter (+/- 2,748)
test bench_17000_stdlib       ... bench:       7,862 ns/iter (+/- 8,767)
test bench_21000_naive        ... bench:       9,635 ns/iter (+/- 7,809)
test bench_21000_stdlib       ... bench:       9,702 ns/iter (+/- 9,659)
test bench_25000_naive        ... bench:      11,451 ns/iter (+/- 10,410)
test bench_25000_stdlib       ... bench:      11,455 ns/iter (+/- 2,153)
test bench_30000_naive        ... bench:      13,741 ns/iter (+/- 324)
test bench_30000_stdlib       ... bench:      14,662 ns/iter (+/- 3,013)
test bench_big_0100000_naive  ... bench:      45,697 ns/iter (+/- 28,023)
test bench_big_0100000_stdlib ... bench:      45,747 ns/iter (+/- 41,462)
test bench_big_1000000_naive  ... bench:     457,959 ns/iter (+/- 43,860)
test bench_big_1000000_stdlib ... bench:     476,328 ns/iter (+/- 109,725)

If I'm reading it correctly, we get the same performance (within error margins) below somewhere around 10,000 bytes, at which point the manual approach becomes consistently marginally faster, though still within error margins.

I don't know enough about SIMD to accelerate this, but I think it should be fairly easy to do. I agree with @llogiq here, that by my reading of pcmpestrm, an argument of 0x0079c0ff in Ranges mode should do the job exactly as I have here.

EDIT: Also tried the following with fold to be more in line with the naive_count implementation, run alongside the above with filter/count, and got the exact same performance:

pub fn naive_count_utf8_alt(haystack: &[u8]) -> usize {
    haystack.iter().fold(0usize, |n, &byte| n + ((byte >> 6) != 0b10) as usize)
}

from bytecount.

BurntSushi commented on May 29, 2024

@CAD97 If you're searching a &[u8], then you need to specify what happens when you get invalid UTF-8. For example, if one were to lossily decode the bytes \xFF\xFF\xFF, then you'd get three instances of the replacement char. Should counting take that into account?

The answer here may very well be "assume valid UTF-8," and maybe that's OK! :)

(This is perhaps not a comment directed toward you @CAD97, but rather, the entire issue.)

from bytecount.

CAD97 commented on May 29, 2024

What I did was to say that using the count function on invalid UTF-8 is safe but meaningless

+/// This function is safe to use on any byte array, valid UTF-8 or not,
+/// but the output is only meaningful for well-formed UTF-8.

I'm not intimately familiar with the lossy decoding rules, but I think this approach does not fully agree with the lossy decoding protocol.

If I recall correctly, a lossy decoder should stop when encountering an unexpected byte, and replace the bytes it has eaten already with the replacement character. This means that a single trailing byte would be converted to a replacement character, whereas the naive algorithm I used would not count it.

It would not be difficult to write a decoder that handled this, but it would necessarily be serial. I personally think the solution is to use an algorithm correct on UTF-8 but still safe on arbitrary bytes, such as the above (byte >> 6) != 0b10.

from bytecount.

BurntSushi commented on May 29, 2024

@CAD97 SGTM. :)

from bytecount.

CAD97 commented on May 29, 2024

Oh, I just figured out the potential miscommunication between @llogiq / @Veedrac around pcmpestrm: you're both looking at different counts.

The Range count (count bytes [\x00-\x73\xc0-\xff] or length minus bytes [\x80-\xbf]) is to count how many characters are present in a UTF-8 string.

The Equal Ordered count is intended to count occurrences of a given character in a UTF-8 string.

These are both useful information to be able to look for, though I think the former is much more likely to be used. Searching for occurrences of a given Unicode Codepoint is fraught with its own perils, because of the codepoint-grapheme disconnect as well as other encoding gotchas.

I think this issue should focus on the string.chars().count() case, and another issue should be opened for the string.chars().filter(|&c| c == needle).count() case.

from bytecount.

CAD97 commented on May 29, 2024

Looking again at rust-lang/rust#37888 (I did look at it, I promise), the naive solution I used above is exactly as that PR introduced to the stdlib, so I'm not surprised that the performance was within margin of error every time.

Any speedup we can offer will be by offering a version on byte slices without the utf-8 check (see also fflorent/nom_locate#4) or by looking at more bytes at a time a-la memchr or simd.

from bytecount.

Veedrac commented on May 29, 2024

Yup, I completely misunderstood the goal here. Thanks @CAD97 for the help!

This should actually be really easy to retrofit onto the existing code, because of the tricks that have already been mentioned. You basically need to mask the lower bits and then return masked.len() - count(masked, 0b10000000). Of course, ideally you'd do this all in one pass, and I think you can drop an instruction or two in this simpler case.

from bytecount.

Veedrac commented on May 29, 2024

Proof of concept is up at #29.

from bytecount.

llogiq commented on May 29, 2024

Slightly improved version in #30 – still not using pcempstrn, so it could potentially work on Aarch64, too.

from bytecount.

llogiq commented on May 29, 2024

Closed by #29

from bytecount.

Add method to count UTF-8 chars about bytecount HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent