llogiq / bytecount Goto Github PK

View Code? Open in Web Editor NEW

212.0 212.0 25.0 122 KB

Counting occurrences of a given byte or UTF-8 characters in a slice of memory – fast

License: Apache License 2.0

Rust 99.13% Shell 0.87%

algorithm byte-count hacktoberfest rust

bytecount's People

Contributors

Stargazers

Watchers

bytecount's Issues

Update LICENSE.MIT file

Looks like Apache2 got copied over in 2c79f08 and never got replaced by the MIT boilerplate.

I did that and actually found some issues. See https://github.com/vks/bytecounttest. For instance, bytecount::count returns 2 for this input: &[0, 0, 0, 0, 0, 0, 0, 0, 33, 32, 0, 0, 0, 0, 0, 0], 33

After these are fixed, it might be a good idea to use AFL, which is a bit more sophisticated than quickcheck.

"avx-accel" not available

When using avx-accel feature to accelerate byte counting, the compilation failed like:

   Compiling bytecount v0.1.6
error[E0432]: unresolved import `simd::x86::avx::LowHigh128`
  --> xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/bytecount-0.1.6/src/lib.rs:12:22
   |
12 | use simd::x86::avx::{LowHigh128, u8x32};
   |                      ^^^^^^^^^^ Could not find `avx` in `x86`

error[E0432]: unresolved import `simd::x86::avx::u8x32`
  --> xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/bytecount-0.1.6/src/lib.rs:12:34
   |
12 | use simd::x86::avx::{LowHigh128, u8x32};
   |                                  ^^^^^ Could not find `avx` in `x86`

error: aborting due to previous error(s)

error: Could not compile `bytecount`.

And from docs.rs avx is really absent, but rust-lang-nursery.github.io seems saying it's there.

I'm really lost what's going on. Should I add other relevant crates to make it work?

Replace `Iterator::sum()` with `fold`

The recent update #14 inadvertently increased the minimum Rust version to 1.11. We should be able to use a fold without performance loss.

I'll do it once I get to my PC – unless someone beats me to it.

Returns wrong value on some inputs

This code, in 0.1.2:

 let mytext = "some potentially large text, perhaps read from disk?";
 let spaces = bytecount::count(mytext.as_bytes(), b' ');
 let text2 = "12111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111121";
 let text3 = "10111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101";
 let count2 = bytecount::count(text2.as_bytes(), b'1');
 let count3 = bytecount::count(text3.as_bytes(), b'1');
 println!("bytecount {} {} {}", spaces, count2, count3);

prints bytecount 7 126 128. It should be 7 126 126.

See conversation in xi-editor/xi-editor#122 for where this came from.

understanding SIMD intrinsics

I read the blog post about fast byte counting and I was reading through the code to better understand the SIMD intrinsic. I was mostly able to follow along but I got confused at this part

bytecount/src/simd/x86_avx2.rs

Lines 58 to 61 in 5085205

 counts = _mm256_sub_epi8( 

 counts, 

 _mm256_cmpeq_epi8(mm256_from_offset(haystack, offset), needles) 

 );

I don't understand why we are subtracting the results of the compare instead of adding them. It seems like this will give incorrect results, but obviously doesn't, and I can't explain why. Could someone explain it to me?

Add method to count UTF-8 chars

Basically we just need an bitwise AND or OR and compare with the respective value whose highest bits are '10' and subtract the count from the slice length.

This could also go into .chars().count() if it turns out to not lose too much time compared to naive onn short strings.

Version 0.6.5 is broken on wasm targets (version 0.6.3 worked)

While compiling bytecount as a dependency of nom_locate v4.2.0 for wasm targets, I get the following error:

   Compiling bytecount v0.6.5
error[E0080]: evaluation of `core::arch::wasm32::i32x4_extract_lane::<4>::{constant#0}` failed
    --> /home/steph/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/../../stdarch/crates/core_arch/src/wasm32/simd128.rs:1212:5
     |
1212 |     static_assert!(N < 4);
     |     ^^^^^^^^^^^^^^^^^^^^^ the evaluated program panicked at 'assertion failed: N < 4', /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/../../stdarch/crates/core_arch/src/wasm32/simd128.rs:1212:5
     |
     = note: this error originates in the macro `assert` which comes from the expansion of the macro `static_assert` (in Nightly builds, run with -Z macro-backtrace for more info)

note: the above error was encountered while instantiating `fn core::arch::wasm32::u32x4_extract_lane::<4>`
  --> /home/steph/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bytecount-0.6.5/src/simd/wasm.rs:62:9
   |
62 |         u32x4_extract_lane::<4>(u32s),
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For more information about this error, try `rustc --explain E0080`.

The minimal reproducible test case is here:
https://github.com/stephanemagnenat/bytecount-89

Just do cargo build --target wasm32-unknown-unknown in bytecount-issue-89/ and you get the same error as above (rustc 1.73.0 (cc66ad468 2023-10-03)).

The test case is trivial: The repository is a newly-created project with cargo and with bytecount added as a dependency.

CI: run clippy if it's available

We may want to have one nightly build that runs clippy over our code. Clippy gets new and improved lints all the time, so having it run could be useful.

Optimisation question: fold vs count

For naive_count you seem to have done a trick where you convert what is morally filter(...).count() into a fold. See https://github.com/llogiq/bytecount/blob/master/src/naive.rs#L24-L26

However, for naive_num_chars the same trick seems possible, but you haven't done so. Any reason why?

benchmark on stable + beta

bencher runs on all rust trains. We should evaluate it to allow benchmarking all rust versions.

Failed to parse manifest

With the latest nightly, attempting to build bytecount 0.1.6 gives this error:

error: unable to get packages from source

Caused by:
  failed to parse manifest at `/home/jeremy/.cargo/registry/src/github.com-1ecc6299db9ec823/bytecount-0.1.6/Cargo.toml`

Caused by:
  can't find `bench` bench, specify bench.path

Failing test

This test fails

let haystack = vec![0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
let needle = 2;
assert_eq!(count(&haystack[0..], needle), naive_count(&haystack[0..], needle));

giving a count of 2 rather than 1. The errors are with

(((x ^ needles).wrapping_sub(LO)) & !x & HI) >> 7

Firstly, this should be

let x = x ^ needles;
((x.wrapping_sub(LO)) & !x & HI) >> 7

Secondly, this doesn't actually work for counting - the x.wrapping_sub(LO) sets the high bit correctly but can overflow into the next value if the next byte is 1 - wrapping_sub will decrement that to zero and the subtraction will continue past.

It's fairly simple to deal with by using

!((((v & !HI) + !HI) | v) >> 7) & LO

instead, but it'd be interesting to see if you don't have to spend the extra operations to get that. It doesn't seem to affect speed all too much, though.

AArch64 support?

AArch64 SIMD has been stabilized since Rust 1.61.

"Fancy new algorithm" doesn't work on big-endian machines

Originally from BurntSushi/ripgrep#1144, we traced down the issue to bytecount, specifically commit c6668a3:

git bisect start master f2a67d5
git bisect run sh -c 'RUSTFLAGS="-L $LD_LIBRARY_PATH" RUSTDOCFLAGS="-L $LD_LIBRARY_PATH" cargo test'
[..]
running 9 tests
test check_count_correct ... FAILED
test check_count_large ... ok
test check_count_overflow ... ok
test check_count_some ... FAILED
test check_num_chars_correct ... ok
test check_count_large_rand ... ok
test check_num_chars_overflow ... ok
test check_num_chars_some ... ok
test check_num_chars_large ... ok

failures:

---- check_count_correct stdout ----
thread 'check_count_correct' panicked at '[quickcheck] TEST FAILED. Arguments: (([0, 0, 0, 1, 0, 0, 0, 0, 0, 0], 0))', /home/infinity0/.cargo/registry/src/github.com-eae4ba8cbf2ce1c7/quickcheck-0.6.2/src/tester.rs:179:28
note: Run with `RUST_BACKTRACE=1` for a backtrace.

---- check_count_some stdout ----
thread 'check_count_some' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`', tests/check.rs:43:5


failures:
    check_count_correct
    check_count_some

test result: FAILED. 7 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out

error: test failed, to rerun pass '--test check'
c6668a3264ec208ae9733f83eb94b6312518f14a is the first bad commit
commit c6668a3264ec208ae9733f83eb94b6312518f14a
Author: Veedrac <[email protected]>
Date:   Wed Jan 31 02:54:17 2018 +0000

    Fancy new algorithm on stable SIMD

:100644 100644 f37644090b9886a0ed33293c5ac353b52631d2e3 c4b93928eda2ec61972beef49ed8c40860639d89 M      .travis.yml
:100644 100644 1eecc67f7677cdb843eba653d689faef0b3d7b13 91395c0abca0a56e7bf0f8075dbf83b25fc1a1e0 M      Cargo.toml
:100644 100644 29d6473403e8fcfe5ecb6d4b2d144cdd782fc7ea e76dfae2c603a0e87bb29efe4904200f09a56a4e M      appveyor.yml
:040000 040000 d0ac6b0edbf5fe91a4c2d499eb4e2f94ba91b799 47a3eb4dc78afe60b98f2b1f88f1ebd724ce690f M      src
bisect run success

Add no_std support

This is such a basic (in form of we need this everywhere, not in form of "simple") feature, I think it should be supported on embedded targets, too.

How does this work?

This library looks really interesting. As a person who hasn't done any real bit-twiddling yet, it is quite hard for me to get my mind around, though.

Could you explain, in non-bit-twiddler-terms, why this crate is faster than regular iterators / adaptors?

bytecount::num_chars returns wrong value if cpu doesn't support avx2 feature

PoC

use std::io;
use std::env::var;
use std::io::prelude::*;
use std::fs::File;

use bytecount;

fn main() -> io::Result<()> {
    let mut buffer = Vec::new();
    let mut f = File::open(format!("{}/sample.txt", var("HOME").unwrap()))?;
    f.read_to_end(&mut buffer)?;

    println!("num_chars = {}", bytecount::num_chars(&buffer) as isize);
    Ok(())
}

Results:

# cpu doesn't supports avx2 feature
$ ./test-avx2
num_chars = 4076

# cpu supports avx2 feature
$ ./test-avx2 
num_chars = 8172

$ python -c "from pathlib import Path; print('num_chars =', len(Path('~/sample.txt').expanduser().read_text()))"
num_chars = 8172

sample.txt

The last version cannot bench on 1.20 anymore

Note that it will still build alright on that version, but I had to switch to a criterion version which appears not to support compiling without quote etc. which fails on 1.20.

cc @bheisler

CLI version

Considering how we have projects like ripgrep, exa, and fd which aim to oxidise common shell utils, it would be great to create an alternative to the wc command that uses bytecount (I vote to name the binary ct, short for count).

Considering how the hard work has already been done for the actual counting, all that really would have to be done is parse command-line arguments with clap or similar. While we don't have an actual "word" counting functionality, 99% of the time I see wc is via wc -l anyway, and it seems pretty straightforward to make a binary that will do this with bytecount.

Aarch64 build appears broken

Intrinsic has incorrect return type!
ptr @llvm.aarch64.neon.ld1x4.v16i8.p0i8
in function _ZN4core9core_arch10arm_shared4neon9generated11vld1q_u8_x417h224d569be9c8b1b6E
LLVM ERROR: Broken function found, compilation aborted!
error: could not compile `bytecount` (lib)
warning: build failed, waiting for other jobs to finish...

$ uname -vm
Darwin Kernel Version 23.1.0: Mon Oct  9 21:28:45 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6020 arm64

benchmark -C codegen-units=N

Apparently, the default changed to codegen-units=16. However, when benchmarking I don't see a clear winner between N=1 or N=16 – it shows some small win on byte counting and some small loss on num_chars, but in both cases, it's a wash. Perhaps someone else can benchmark on a beefier machine?

Use naive count below ~75 elements in haystack

My benchmarks show that naive count is usually faster for small haystacks, so adding an if guard to run naive below that threshold seems to be a no-brainer.

SSE2 path uses SSE4.1 intrinsic

#[target_feature(enable = "sse2")]
unsafe fn sum(u8s: &__m128i) -> usize {
    let sums = _mm_sad_epu8(*u8s, _mm_setzero_si128());
    (_mm_extract_epi32(sums, 0) + _mm_extract_epi32(sums, 2)) as usize
}

Per _mm_extract_epi32 documentation -- Available on (x86 or x86-64) and target feature sse4.1 and x86-64 only

Documentation, naive_count_32 vs naive_count

These two functions are presented, with naive_count_32 having the weakness that it can overflow. That's a trade-off I might be happy to make, but the documentation doesn't say what the performance difference is. Could that be added, at least in one benchmark, to enable a more informed decision?

Needs crates.io release

The most recent version of bytecount published to crates.io (0.6.0, published August 2019) does not compile. It looks like a fix was sent to bytecount in PR #63 but that fix has never been published.

[dependencies]
bytecount = { version = "*", features = ["generic-simd"] }

error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`

	counts = _mm256_sub_epi8(
	counts,
	_mm256_cmpeq_epi8(mm256_from_offset(haystack, offset), needles)
	);

llogiq / bytecount Goto Github PK

bytecount's People

Contributors

Stargazers

Watchers

Forkers

bytecount's Issues

Recommend Projects

Recommend Topics

Recommend Org