Comments (16)
It sounds faster to use pcmpestrm
, which with the right flags does the heavy lifting for us, though dealing with chars that cross boundaries will be a chore.
from bytecount.
though dealing with chars that cross boundaries will be a chore.
Why? We can use pcmpestrm
to count all continuation bytes (0x80-0xbf IIRC) & subtract the count from the slice length.
from bytecount.
I expect we should be using the "Equal Ordered" mode. The challenge here is simply that this does not consume the whole 16 bytes each cycle, which means either we need to use unaligned reads or deals with characters crossing the 16-byte blocks. Having slept since the last comment, it's obvious to me that unaligned reads are the sensible bet as Intel hardware has fast unaligned loads and it doesn't even prevent running multiple in parallel.
from bytecount.
Nothing in performance is obvious until measured. 😄
from bytecount.
I don't completely understand the Equal Ordered mode. OTOH the Range mode looks like a viable solution, using the range argument: 0x80bf
(or alternatively 0x0079c0ff
to match only non-following bytes thus saving one subtraction at the end). No idea which variant would be faster, so let's best measure and find out.
from bytecount.
I should clarify how the Equal Ordered mode should be used. Consider the example given:
arg1 = "he"
arg2 = ", he helped her "
IntRes1 = 0010010000001000
When we're counting chars, it's going to look more like
arg1 = b"\xce\xb1" // α
arg2 = b"\xce\xb1\x6c\x70\x68\xce\xb1" // αlphα
IntRes1 = 1 0 0 0 0 1 0
and you can just count the set bits on that. Note, though, that you might have unaligned characters, eg. something like
arg1 = b"\xce\xb1" // α
arg2 = b"\xce\xb1\x6c\x70\x68\xce" // αlph[α]
IntRes1 = 1 0 0 0 0 0
which means the next search can't jump a whole vector along. It needs to include the trailing bytes in the next search. Ergo the suggestion to use unaligned reads, which are fast on Intel CPUs.
from bytecount.
I implemented a very simplistic version of this with no simd acceleration. CAD97/bytecount@77c0bad
pub fn naive_count_utf8(haystack: &[u8]) -> usize {
// UTF-8 is self-aligning: all trailing bytes are encoded as 0b10_xxx_xxx
// So to count code points, we count non-trailing bytes 0b11_xxx_xxx or 0b0x_xxx_xxx
haystack.iter().filter(|&&byte| (byte >> 6) != 0b10).count()
}
Because all trailing bytes are encoded as 0b10xx_xxxx
, I check for bytes not in that form and those are our leading bytes starting a Unicode Codepoint.
But the interesting part is how this compares against the stdlib's solution of .chars().count()
, so I modified the existing benchmark for this new function:
unsafe fn random_str_safeish_bytes(len: usize) -> String {
let mut bytes = rand::thread_rng().gen_iter::<u8>().take(len).collect::<Vec<_>>();
for idx in len-4..len {
bytes[idx] = 0;
}
String::from_utf8_unchecked(bytes)
}
macro_rules! bench {
($i: expr, $name_stdlib: ident, $name_naive: ident) => {
fn $name_stdlib(b: &mut bencher::Bencher) {
let haystack = unsafe{random_str_safeish_bytes($i)};
b.iter(|| haystack.chars().count());
}
fn $name_naive(b: &mut bencher::Bencher) {
let haystack = unsafe{random_str_safeish_bytes($i)};
b.iter(|| naive_count_utf8(haystack.as_bytes()));
}
};
}
I set the last four random bytes to NUL
in order to make sure that the built-in Chars
iterator doesn't overrun the buffer. Here are the results I got:
running 90 tests
test bench_00000_naive ... bench: 2 ns/iter (+/- 1)
test bench_00000_stdlib ... bench: 0 ns/iter (+/- 0)
test bench_00010_naive ... bench: 7 ns/iter (+/- 2)
test bench_00010_stdlib ... bench: 6 ns/iter (+/- 1)
test bench_00020_naive ... bench: 12 ns/iter (+/- 1)
test bench_00020_stdlib ... bench: 10 ns/iter (+/- 2)
test bench_00030_naive ... bench: 18 ns/iter (+/- 5)
test bench_00030_stdlib ... bench: 16 ns/iter (+/- 4)
test bench_00040_naive ... bench: 22 ns/iter (+/- 6)
test bench_00040_stdlib ... bench: 20 ns/iter (+/- 3)
test bench_00050_naive ... bench: 28 ns/iter (+/- 4)
test bench_00050_stdlib ... bench: 26 ns/iter (+/- 7)
test bench_00060_naive ... bench: 31 ns/iter (+/- 12)
test bench_00060_stdlib ... bench: 30 ns/iter (+/- 3)
test bench_00070_naive ... bench: 39 ns/iter (+/- 6)
test bench_00070_stdlib ... bench: 37 ns/iter (+/- 6)
test bench_00080_naive ... bench: 43 ns/iter (+/- 6)
test bench_00080_stdlib ... bench: 39 ns/iter (+/- 37)
test bench_00090_naive ... bench: 49 ns/iter (+/- 6)
test bench_00090_stdlib ... bench: 47 ns/iter (+/- 5)
test bench_00100_naive ... bench: 51 ns/iter (+/- 8)
test bench_00100_stdlib ... bench: 51 ns/iter (+/- 10)
test bench_00120_naive ... bench: 63 ns/iter (+/- 7)
test bench_00120_stdlib ... bench: 63 ns/iter (+/- 8)
test bench_00140_naive ... bench: 71 ns/iter (+/- 2)
test bench_00140_stdlib ... bench: 70 ns/iter (+/- 10)
test bench_00170_naive ... bench: 96 ns/iter (+/- 15)
test bench_00170_stdlib ... bench: 90 ns/iter (+/- 18)
test bench_00210_naive ... bench: 110 ns/iter (+/- 26)
test bench_00210_stdlib ... bench: 111 ns/iter (+/- 10)
test bench_00250_naive ... bench: 135 ns/iter (+/- 13)
test bench_00250_stdlib ... bench: 127 ns/iter (+/- 126)
test bench_00300_naive ... bench: 149 ns/iter (+/- 4)
test bench_00300_stdlib ... bench: 147 ns/iter (+/- 139)
test bench_00400_naive ... bench: 196 ns/iter (+/- 5)
test bench_00400_stdlib ... bench: 194 ns/iter (+/- 15)
test bench_00500_naive ... bench: 240 ns/iter (+/- 15)
test bench_00500_stdlib ... bench: 249 ns/iter (+/- 81)
test bench_00600_naive ... bench: 287 ns/iter (+/- 6)
test bench_00600_stdlib ... bench: 285 ns/iter (+/- 10)
test bench_00700_naive ... bench: 333 ns/iter (+/- 6)
test bench_00700_stdlib ... bench: 329 ns/iter (+/- 20)
test bench_00800_naive ... bench: 378 ns/iter (+/- 4)
test bench_00800_stdlib ... bench: 377 ns/iter (+/- 18)
test bench_00900_naive ... bench: 425 ns/iter (+/- 388)
test bench_00900_stdlib ... bench: 446 ns/iter (+/- 102)
test bench_01000_naive ... bench: 471 ns/iter (+/- 201)
test bench_01000_stdlib ... bench: 468 ns/iter (+/- 52)
test bench_01200_naive ... bench: 562 ns/iter (+/- 271)
test bench_01200_stdlib ... bench: 558 ns/iter (+/- 307)
test bench_01400_naive ... bench: 654 ns/iter (+/- 541)
test bench_01400_stdlib ... bench: 652 ns/iter (+/- 59)
test bench_01700_naive ... bench: 790 ns/iter (+/- 478)
test bench_01700_stdlib ... bench: 836 ns/iter (+/- 49)
test bench_02100_naive ... bench: 974 ns/iter (+/- 30)
test bench_02100_stdlib ... bench: 973 ns/iter (+/- 58)
test bench_02500_naive ... bench: 1,155 ns/iter (+/- 106)
test bench_02500_stdlib ... bench: 1,157 ns/iter (+/- 593)
test bench_03000_naive ... bench: 1,384 ns/iter (+/- 22)
test bench_03000_stdlib ... bench: 1,386 ns/iter (+/- 49)
test bench_04000_naive ... bench: 1,848 ns/iter (+/- 2,120)
test bench_04000_stdlib ... bench: 1,964 ns/iter (+/- 116)
test bench_05000_naive ... bench: 2,300 ns/iter (+/- 127)
test bench_05000_stdlib ... bench: 2,299 ns/iter (+/- 2,273)
test bench_06000_naive ... bench: 2,770 ns/iter (+/- 3,139)
test bench_06000_stdlib ... bench: 2,759 ns/iter (+/- 1,382)
test bench_07000_naive ... bench: 3,235 ns/iter (+/- 3,409)
test bench_07000_stdlib ... bench: 3,209 ns/iter (+/- 2,930)
test bench_08000_naive ... bench: 3,677 ns/iter (+/- 1,279)
test bench_08000_stdlib ... bench: 3,915 ns/iter (+/- 157)
test bench_09000_naive ... bench: 4,139 ns/iter (+/- 215)
test bench_09000_stdlib ... bench: 4,147 ns/iter (+/- 119)
test bench_10000_naive ... bench: 4,626 ns/iter (+/- 5,274)
test bench_10000_stdlib ... bench: 4,592 ns/iter (+/- 155)
test bench_12000_naive ... bench: 5,503 ns/iter (+/- 130)
test bench_12000_stdlib ... bench: 5,521 ns/iter (+/- 3,656)
test bench_14000_naive ... bench: 6,443 ns/iter (+/- 2,534)
test bench_14000_stdlib ... bench: 6,855 ns/iter (+/- 3,694)
test bench_17000_naive ... bench: 7,800 ns/iter (+/- 2,748)
test bench_17000_stdlib ... bench: 7,862 ns/iter (+/- 8,767)
test bench_21000_naive ... bench: 9,635 ns/iter (+/- 7,809)
test bench_21000_stdlib ... bench: 9,702 ns/iter (+/- 9,659)
test bench_25000_naive ... bench: 11,451 ns/iter (+/- 10,410)
test bench_25000_stdlib ... bench: 11,455 ns/iter (+/- 2,153)
test bench_30000_naive ... bench: 13,741 ns/iter (+/- 324)
test bench_30000_stdlib ... bench: 14,662 ns/iter (+/- 3,013)
test bench_big_0100000_naive ... bench: 45,697 ns/iter (+/- 28,023)
test bench_big_0100000_stdlib ... bench: 45,747 ns/iter (+/- 41,462)
test bench_big_1000000_naive ... bench: 457,959 ns/iter (+/- 43,860)
test bench_big_1000000_stdlib ... bench: 476,328 ns/iter (+/- 109,725)
If I'm reading it correctly, we get the same performance (within error margins) below somewhere around 10,000 bytes, at which point the manual approach becomes consistently marginally faster, though still within error margins.
I don't know enough about SIMD to accelerate this, but I think it should be fairly easy to do. I agree with @llogiq here, that by my reading of pcmpestrm
, an argument of 0x0079c0ff
in Ranges mode should do the job exactly as I have here.
EDIT: Also tried the following with fold
to be more in line with the naive_count
implementation, run alongside the above with filter/count
, and got the exact same performance:
pub fn naive_count_utf8_alt(haystack: &[u8]) -> usize {
haystack.iter().fold(0usize, |n, &byte| n + ((byte >> 6) != 0b10) as usize)
}
from bytecount.
@CAD97 If you're searching a &[u8]
, then you need to specify what happens when you get invalid UTF-8. For example, if one were to lossily decode the bytes \xFF\xFF\xFF
, then you'd get three instances of the replacement char. Should counting take that into account?
The answer here may very well be "assume valid UTF-8," and maybe that's OK! :)
(This is perhaps not a comment directed toward you @CAD97, but rather, the entire issue.)
from bytecount.
What I did was to say that using the count function on invalid UTF-8 is safe but meaningless
+/// This function is safe to use on any byte array, valid UTF-8 or not,
+/// but the output is only meaningful for well-formed UTF-8.
I'm not intimately familiar with the lossy decoding rules, but I think this approach does not fully agree with the lossy decoding protocol.
If I recall correctly, a lossy decoder should stop when encountering an unexpected byte, and replace the bytes it has eaten already with the replacement character. This means that a single trailing byte would be converted to a replacement character, whereas the naive algorithm I used would not count it.
It would not be difficult to write a decoder that handled this, but it would necessarily be serial. I personally think the solution is to use an algorithm correct on UTF-8 but still safe on arbitrary bytes, such as the above (byte >> 6) != 0b10
.
from bytecount.
@CAD97 SGTM. :)
from bytecount.
Oh, I just figured out the potential miscommunication between @llogiq / @Veedrac around pcmpestrm
: you're both looking at different counts.
The Range count (count bytes [\x00-\x73\xc0-\xff]
or length minus bytes [\x80-\xbf]
) is to count how many characters are present in a UTF-8 string.
The Equal Ordered count is intended to count occurrences of a given character in a UTF-8 string.
These are both useful information to be able to look for, though I think the former is much more likely to be used. Searching for occurrences of a given Unicode Codepoint is fraught with its own perils, because of the codepoint-grapheme disconnect as well as other encoding gotchas.
I think this issue should focus on the string.chars().count()
case, and another issue should be opened for the string.chars().filter(|&c| c == needle).count()
case.
from bytecount.
Looking again at rust-lang/rust#37888 (I did look at it, I promise), the naive solution I used above is exactly as that PR introduced to the stdlib, so I'm not surprised that the performance was within margin of error every time.
Any speedup we can offer will be by offering a version on byte slices without the utf-8 check (see also fflorent/nom_locate#4) or by looking at more bytes at a time a-la memchr
or simd.
from bytecount.
Yup, I completely misunderstood the goal here. Thanks @CAD97 for the help!
This should actually be really easy to retrofit onto the existing code, because of the tricks that have already been mentioned. You basically need to mask the lower bits and then return masked.len() - count(masked, 0b10000000)
. Of course, ideally you'd do this all in one pass, and I think you can drop an instruction or two in this simpler case.
from bytecount.
Proof of concept is up at #29.
from bytecount.
Slightly improved version in #30 – still not using pcempstrn, so it could potentially work on Aarch64, too.
from bytecount.
Closed by #29
from bytecount.
Related Issues (20)
- How does this work? HOT 2
- Update LICENSE.MIT file
- "avx-accel" not available HOT 4
- Failed to parse manifest HOT 3
- Add no_std support HOT 1
- benchmark -C codegen-units=N HOT 2
- CI: run clippy if it's available
- CLI version HOT 6
- The last version cannot bench on 1.20 anymore HOT 7
- "Fancy new algorithm" doesn't work on big-endian machines HOT 13
- understanding SIMD intrinsics HOT 2
- Needs crates.io release HOT 1
- bytecount::num_chars returns wrong value if cpu doesn't support avx2 feature HOT 6
- Optimisation question: fold vs count HOT 4
- Documentation, naive_count_32 vs naive_count HOT 5
- AArch64 support?
- SSE2 path uses SSE4.1 intrinsic HOT 1
- Version 0.6.5 is broken on wasm targets (version 0.6.3 worked) HOT 6
- Aarch64 build appears broken HOT 3
- `generic-simd` feature fails to build on nightly
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bytecount.