llogiq / bytecount Goto Github PK
View Code? Open in Web Editor NEWCounting occurrences of a given byte or UTF-8 characters in a slice of memory – fast
License: Apache License 2.0
Counting occurrences of a given byte or UTF-8 characters in a slice of memory – fast
License: Apache License 2.0
Looks like Apache2 got copied over in 2c79f08 and never got replaced by the MIT boilerplate.
I did that and actually found some issues. See https://github.com/vks/bytecounttest. For instance, bytecount::count
returns 2 for this input: &[0, 0, 0, 0, 0, 0, 0, 0, 33, 32, 0, 0, 0, 0, 0, 0], 33
After these are fixed, it might be a good idea to use AFL, which is a bit more sophisticated than quickcheck.
When using avx-accel
feature to accelerate byte counting, the compilation failed like:
Compiling bytecount v0.1.6
error[E0432]: unresolved import `simd::x86::avx::LowHigh128`
--> xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/bytecount-0.1.6/src/lib.rs:12:22
|
12 | use simd::x86::avx::{LowHigh128, u8x32};
| ^^^^^^^^^^ Could not find `avx` in `x86`
error[E0432]: unresolved import `simd::x86::avx::u8x32`
--> xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/bytecount-0.1.6/src/lib.rs:12:34
|
12 | use simd::x86::avx::{LowHigh128, u8x32};
| ^^^^^ Could not find `avx` in `x86`
error: aborting due to previous error(s)
error: Could not compile `bytecount`.
And from docs.rs avx
is really absent, but rust-lang-nursery.github.io seems saying it's there.
I'm really lost what's going on. Should I add other relevant crates to make it work?
The recent update #14 inadvertently increased the minimum Rust version to 1.11. We should be able to use a fold
without performance loss.
I'll do it once I get to my PC – unless someone beats me to it.
This code, in 0.1.2:
let mytext = "some potentially large text, perhaps read from disk?";
let spaces = bytecount::count(mytext.as_bytes(), b' ');
let text2 = "12111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111121";
let text3 = "10111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101";
let count2 = bytecount::count(text2.as_bytes(), b'1');
let count3 = bytecount::count(text3.as_bytes(), b'1');
println!("bytecount {} {} {}", spaces, count2, count3);
prints bytecount 7 126 128
. It should be 7 126 126
.
See conversation in xi-editor/xi-editor#122 for where this came from.
I read the blog post about fast byte counting and I was reading through the code to better understand the SIMD intrinsic. I was mostly able to follow along but I got confused at this part
bytecount/src/simd/x86_avx2.rs
Lines 58 to 61 in 5085205
I don't understand why we are subtracting the results of the compare instead of adding them. It seems like this will give incorrect results, but obviously doesn't, and I can't explain why. Could someone explain it to me?
Basically we just need an bitwise AND or OR and compare with the respective value whose highest bits are '10' and subtract the count from the slice length.
This could also go into .chars().count()
if it turns out to not lose too much time compared to naive onn short strings.
While compiling bytecount as a dependency of nom_locate v4.2.0 for wasm targets, I get the following error:
Compiling bytecount v0.6.5
error[E0080]: evaluation of `core::arch::wasm32::i32x4_extract_lane::<4>::{constant#0}` failed
--> /home/steph/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/../../stdarch/crates/core_arch/src/wasm32/simd128.rs:1212:5
|
1212 | static_assert!(N < 4);
| ^^^^^^^^^^^^^^^^^^^^^ the evaluated program panicked at 'assertion failed: N < 4', /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/../../stdarch/crates/core_arch/src/wasm32/simd128.rs:1212:5
|
= note: this error originates in the macro `assert` which comes from the expansion of the macro `static_assert` (in Nightly builds, run with -Z macro-backtrace for more info)
note: the above error was encountered while instantiating `fn core::arch::wasm32::u32x4_extract_lane::<4>`
--> /home/steph/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bytecount-0.6.5/src/simd/wasm.rs:62:9
|
62 | u32x4_extract_lane::<4>(u32s),
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For more information about this error, try `rustc --explain E0080`.
The minimal reproducible test case is here:
https://github.com/stephanemagnenat/bytecount-89
Just do cargo build --target wasm32-unknown-unknown
in bytecount-issue-89/
and you get the same error as above (rustc 1.73.0 (cc66ad468 2023-10-03)
).
The test case is trivial: The repository is a newly-created project with cargo and with bytecount added as a dependency.
We may want to have one nightly build that runs clippy over our code. Clippy gets new and improved lints all the time, so having it run could be useful.
For naive_count
you seem to have done a trick where you convert what is morally filter(...).count()
into a fold. See https://github.com/llogiq/bytecount/blob/master/src/naive.rs#L24-L26
However, for naive_num_chars
the same trick seems possible, but you haven't done so. Any reason why?
bencher runs on all rust trains. We should evaluate it to allow benchmarking all rust versions.
With the latest nightly, attempting to build bytecount 0.1.6 gives this error:
error: unable to get packages from source
Caused by:
failed to parse manifest at `/home/jeremy/.cargo/registry/src/github.com-1ecc6299db9ec823/bytecount-0.1.6/Cargo.toml`
Caused by:
can't find `bench` bench, specify bench.path
This test fails
let haystack = vec![0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
let needle = 2;
assert_eq!(count(&haystack[0..], needle), naive_count(&haystack[0..], needle));
giving a count of 2 rather than 1. The errors are with
(((x ^ needles).wrapping_sub(LO)) & !x & HI) >> 7
Firstly, this should be
let x = x ^ needles;
((x.wrapping_sub(LO)) & !x & HI) >> 7
Secondly, this doesn't actually work for counting - the x.wrapping_sub(LO)
sets the high bit correctly but can overflow into the next value if the next byte is 1 - wrapping_sub
will decrement that to zero and the subtraction will continue past.
It's fairly simple to deal with by using
!((((v & !HI) + !HI) | v) >> 7) & LO
instead, but it'd be interesting to see if you don't have to spend the extra operations to get that. It doesn't seem to affect speed all too much, though.
AArch64 SIMD has been stabilized since Rust 1.61.
Originally from BurntSushi/ripgrep#1144, we traced down the issue to bytecount, specifically commit c6668a3:
git bisect start master f2a67d5
git bisect run sh -c 'RUSTFLAGS="-L $LD_LIBRARY_PATH" RUSTDOCFLAGS="-L $LD_LIBRARY_PATH" cargo test'
[..]
running 9 tests
test check_count_correct ... FAILED
test check_count_large ... ok
test check_count_overflow ... ok
test check_count_some ... FAILED
test check_num_chars_correct ... ok
test check_count_large_rand ... ok
test check_num_chars_overflow ... ok
test check_num_chars_some ... ok
test check_num_chars_large ... ok
failures:
---- check_count_correct stdout ----
thread 'check_count_correct' panicked at '[quickcheck] TEST FAILED. Arguments: (([0, 0, 0, 1, 0, 0, 0, 0, 0, 0], 0))', /home/infinity0/.cargo/registry/src/github.com-eae4ba8cbf2ce1c7/quickcheck-0.6.2/src/tester.rs:179:28
note: Run with `RUST_BACKTRACE=1` for a backtrace.
---- check_count_some stdout ----
thread 'check_count_some' panicked at 'assertion failed: `(left == right)`
left: `0`,
right: `1`', tests/check.rs:43:5
failures:
check_count_correct
check_count_some
test result: FAILED. 7 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out
error: test failed, to rerun pass '--test check'
c6668a3264ec208ae9733f83eb94b6312518f14a is the first bad commit
commit c6668a3264ec208ae9733f83eb94b6312518f14a
Author: Veedrac <[email protected]>
Date: Wed Jan 31 02:54:17 2018 +0000
Fancy new algorithm on stable SIMD
:100644 100644 f37644090b9886a0ed33293c5ac353b52631d2e3 c4b93928eda2ec61972beef49ed8c40860639d89 M .travis.yml
:100644 100644 1eecc67f7677cdb843eba653d689faef0b3d7b13 91395c0abca0a56e7bf0f8075dbf83b25fc1a1e0 M Cargo.toml
:100644 100644 29d6473403e8fcfe5ecb6d4b2d144cdd782fc7ea e76dfae2c603a0e87bb29efe4904200f09a56a4e M appveyor.yml
:040000 040000 d0ac6b0edbf5fe91a4c2d499eb4e2f94ba91b799 47a3eb4dc78afe60b98f2b1f88f1ebd724ce690f M src
bisect run success
This is such a basic (in form of we need this everywhere, not in form of "simple") feature, I think it should be supported on embedded targets, too.
This library looks really interesting. As a person who hasn't done any real bit-twiddling yet, it is quite hard for me to get my mind around, though.
Could you explain, in non-bit-twiddler-terms, why this crate is faster than regular iterators / adaptors?
PoC
use std::io;
use std::env::var;
use std::io::prelude::*;
use std::fs::File;
use bytecount;
fn main() -> io::Result<()> {
let mut buffer = Vec::new();
let mut f = File::open(format!("{}/sample.txt", var("HOME").unwrap()))?;
f.read_to_end(&mut buffer)?;
println!("num_chars = {}", bytecount::num_chars(&buffer) as isize);
Ok(())
}
Results:
# cpu doesn't supports avx2 feature
$ ./test-avx2
num_chars = 4076
# cpu supports avx2 feature
$ ./test-avx2
num_chars = 8172
$ python -c "from pathlib import Path; print('num_chars =', len(Path('~/sample.txt').expanduser().read_text()))"
num_chars = 8172
Note that it will still build alright on that version, but I had to switch to a criterion version which appears not to support compiling without quote etc. which fails on 1.20.
cc @bheisler
Considering how we have projects like ripgrep, exa, and fd which aim to oxidise common shell utils, it would be great to create an alternative to the wc command that uses bytecount (I vote to name the binary ct, short for count).
Considering how the hard work has already been done for the actual counting, all that really would have to be done is parse command-line arguments with clap or similar. While we don't have an actual "word" counting functionality, 99% of the time I see wc
is via wc -l
anyway, and it seems pretty straightforward to make a binary that will do this with bytecount.
Intrinsic has incorrect return type!
ptr @llvm.aarch64.neon.ld1x4.v16i8.p0i8
in function _ZN4core9core_arch10arm_shared4neon9generated11vld1q_u8_x417h224d569be9c8b1b6E
LLVM ERROR: Broken function found, compilation aborted!
error: could not compile `bytecount` (lib)
warning: build failed, waiting for other jobs to finish...
$ uname -vm
Darwin Kernel Version 23.1.0: Mon Oct 9 21:28:45 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6020 arm64
Apparently, the default changed to codegen-units=16
. However, when benchmarking I don't see a clear winner between N=1 or N=16 – it shows some small win on byte counting and some small loss on num_chars
, but in both cases, it's a wash. Perhaps someone else can benchmark on a beefier machine?
My benchmarks show that naive count is usually faster for small haystacks, so adding an if
guard to run naive below that threshold seems to be a no-brainer.
#[target_feature(enable = "sse2")]
unsafe fn sum(u8s: &__m128i) -> usize {
let sums = _mm_sad_epu8(*u8s, _mm_setzero_si128());
(_mm_extract_epi32(sums, 0) + _mm_extract_epi32(sums, 2)) as usize
}
Per _mm_extract_epi32
documentation -- Available on (x86 or x86-64) and target feature sse4.1 and x86-64 only
These two functions are presented, with naive_count_32 having the weakness that it can overflow. That's a trade-off I might be happy to make, but the documentation doesn't say what the performance difference is. Could that be added, at least in one benchmark, to enable a more informed decision?
The most recent version of bytecount
published to crates.io (0.6.0, published August 2019) does not compile. It looks like a fix was sent to bytecount
in PR #63 but that fix has never been published.
[dependencies]
bytecount = { version = "*", features = ["generic-simd"] }
error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.