Comments (7)
I think a match from 0 to 8 and a for as default, may win a little more. I don't know, I'll look at it. Very good work on these benchmarks, thank you.
from ntex.
I am not very proficient in asm, new code works with bytes, old code works with qword. should qwords be faster, I don't know.
you should run benchmarks
from ntex.
@pickfire It's only four steps in the loop, I think it would go faster without the for.
fn xor_short(buf: ShortSlice<'_>, mut mask: u64) {
// Unsafe: we know that a `ShortSlice` fits in a u64
unsafe {
match buf.0.len() {
8 => {
*(buf.0.as_mut_ptr() as *mut u64) ^= mask;
}
4 => {
*(buf.0.as_mut_ptr() as *mut u32) ^= mask as u32;
}
2 => {
*(buf.0.as_mut_ptr() as *mut u16) ^= mask as u16;
}
1 => {
*(buf.0.as_mut_ptr() as *mut u8) ^= mask as u8;
}
_ => {
for i in buf.0.iter_mut() {
*i ^= mask as u8;
mask >>= 8;
}
}
}
}
}
This is what works best for me.
from ntex.
I wrote a new one (I was doing cast since I didn't know as_mut_ptr could be used this way)
pub fn xor_short_unsafe(buf: ShortSlice<'_>, mask: u64) {
// This may have flipped other bits, not sure if this is safe
unsafe {
*(buf.0.as_mut_ptr() as *mut u64) ^= mask;
}
}
This should be faster since it works with qword but not sure if it's safe to flip other bytes, can the cpu flip multiple bytes at once?
example::xor_short_unsafe:
xor qword ptr [rdi], rdx
ret
The assembly is also short.
@botika The way you showed have a lot of branches, as well if the length is in the odds, we have to go through those branches too while getting the same thing, I wonder if we could do like a binary way from 8, 4, 2, 1 and keep getting smaller. Didn't benchmark it to know which one is faster.
from ntex.
The latter is not safe for len <= 4
, theoretically it would be for len < 8
, but it is the same.
from ntex.
I did a benchmark, the one @botika showed seemed to be slightly faster when it is exactly 16 elements, when it is 15 elements it behaves as slow as fallback. I think some other parts may cause the slowness in the least timings. Update godbolt https://rust.godbolt.org/z/bTK85W
test apply_mask_01 ... bench: 16,459 ns/iter (+/- 164)
test apply_mask_02 ... bench: 16,940 ns/iter (+/- 301)
test apply_mask_03 ... bench: 19,552 ns/iter (+/- 239)
test apply_mask_04 ... bench: 16,478 ns/iter (+/- 84)
test apply_mask_07 ... bench: 19,957 ns/iter (+/- 412)
test apply_mask_08 ... bench: 13,183 ns/iter (+/- 489)
test apply_mask_15 ... bench: 22,591 ns/iter (+/- 351)
test apply_mask_16 ... bench: 22,584 ns/iter (+/- 36)
test apply_mask_fallback_01 ... bench: 10,907 ns/iter (+/- 510)
test apply_mask_fallback_02 ... bench: 11,325 ns/iter (+/- 484)
test apply_mask_fallback_03 ... bench: 11,642 ns/iter (+/- 299)
test apply_mask_fallback_04 ... bench: 11,979 ns/iter (+/- 318)
test apply_mask_fallback_07 ... bench: 13,423 ns/iter (+/- 393)
test apply_mask_fallback_08 ... bench: 13,708 ns/iter (+/- 337)
test apply_mask_fallback_15 ... bench: 16,172 ns/iter (+/- 693)
test apply_mask_fallback_16 ... bench: 16,555 ns/iter (+/- 326)
test apply_mask_multi_01 ... bench: 12,532 ns/iter (+/- 316)
test apply_mask_multi_02 ... bench: 12,553 ns/iter (+/- 363)
test apply_mask_multi_03 ... bench: 14,420 ns/iter (+/- 513)
test apply_mask_multi_04 ... bench: 12,406 ns/iter (+/- 540)
test apply_mask_multi_07 ... bench: 14,992 ns/iter (+/- 345)
test apply_mask_multi_08 ... bench: 13,272 ns/iter (+/- 620)
test apply_mask_multi_15 ... bench: 15,604 ns/iter (+/- 314)
test apply_mask_multi_16 ... bench: 16,316 ns/iter (+/- 221)
test apply_mask_new_01 ... bench: 13,514 ns/iter (+/- 58)
test apply_mask_new_02 ... bench: 13,798 ns/iter (+/- 241)
test apply_mask_new_03 ... bench: 13,950 ns/iter (+/- 1,403)
test apply_mask_new_04 ... bench: 13,760 ns/iter (+/- 265)
test apply_mask_new_07 ... bench: 14,833 ns/iter (+/- 122)
test apply_mask_new_08 ... bench: 13,458 ns/iter (+/- 436)
test apply_mask_new_15 ... bench: 15,204 ns/iter (+/- 95)
test apply_mask_new_16 ... bench: 15,656 ns/iter (+/- 186)
test apply_mask_unsafe_01 ... bench: 12,949 ns/iter (+/- 59)
test apply_mask_unsafe_02 ... bench: 12,285 ns/iter (+/- 89)
test apply_mask_unsafe_03 ... bench: 12,944 ns/iter (+/- 22)
test apply_mask_unsafe_04 ... bench: 12,987 ns/iter (+/- 44)
test apply_mask_unsafe_07 ... bench: 12,699 ns/iter (+/- 98)
test apply_mask_unsafe_08 ... bench: 13,198 ns/iter (+/- 766)
test apply_mask_unsafe_15 ... bench: 13,451 ns/iter (+/- 28)
test apply_mask_unsafe_16 ... bench: 13,711 ns/iter (+/- 22)
Fallback seemed to perform quite well with smaller mask. It didn't even have panic code from what I see, uses qword while
code
#![feature(test)]
use std::ptr::copy_nonoverlapping;
use std::slice;
pub struct ShortSlice<'a>(&'a mut [u8]);
impl<'a> ShortSlice<'a> {
unsafe fn new(slice: &'a mut [u8]) -> Self {
// Sanity check for debug builds
debug_assert!(slice.len() < 8);
ShortSlice(slice)
}
fn len(&self) -> usize {
self.0.len()
}
}
#[inline]
#[allow(clippy::needless_pass_by_value)]
pub fn xor_short(buf: ShortSlice<'_>, mask: u64) {
// Unsafe: we know that a `ShortSlice` fits in a u64
unsafe {
let (ptr, len) = (buf.0.as_mut_ptr(), buf.0.len());
let mut b: u64 = 0;
#[allow(trivial_casts)]
copy_nonoverlapping(ptr, &mut b as *mut _ as *mut u8, len);
b ^= mask;
#[allow(trivial_casts)]
copy_nonoverlapping(&b as *const _ as *const u8, ptr, len);
}
}
#[inline]
pub fn xor_short_new(buf: ShortSlice<'_>, mut mask: u64) {
for i in buf.0.iter_mut() {
*i ^= mask as u8;
mask >>= 8;
}
}
#[inline]
pub fn xor_short_multi(buf: ShortSlice<'_>, mut mask: u64) {
// Unsafe: we know that a `ShortSlice` fits in a u64
unsafe {
match buf.0.len() {
8 => {
*(buf.0.as_mut_ptr() as *mut u64) ^= mask;
}
4 => {
*(buf.0.as_mut_ptr() as *mut u32) ^= mask as u32;
}
2 => {
*(buf.0.as_mut_ptr() as *mut u16) ^= mask as u16;
}
1 => {
*(buf.0.as_mut_ptr() as *mut u8) ^= mask as u8;
}
_ => {
for i in buf.0.iter_mut() {
*i ^= mask as u8;
mask >>= 8;
}
}
}
}
}
#[inline]
pub fn xor_short_unsafe(buf: ShortSlice<'_>, mask: u64) {
// This may have flipped other bits, not sure if this is
unsafe {
*(buf.0.as_mut_ptr() as *mut u64) ^= mask;
}
}
#[allow(clippy::cast_lossless)]
pub fn apply_mask(buf: &mut [u8], mask_u32: u32) {
// Extend the mask to 64 bits
let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
// Split the buffer into three segments
let (head, mid, tail) = align_buf(buf);
// Initial unaligned segment
let head_len = head.len();
if head_len > 0 {
xor_short(head, mask_u64);
if cfg!(target_endian = "big") {
mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
} else {
mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
}
}
// Aligned segment
for v in mid {
*v ^= mask_u64;
}
// Final unaligned segment
if tail.len() > 0 {
xor_short(tail, mask_u64);
}
}
#[allow(clippy::cast_lossless)]
pub fn apply_mask_new(buf: &mut [u8], mask_u32: u32) {
// Extend the mask to 64 bits
let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
// Split the buffer into three segments
let (head, mid, tail) = align_buf(buf);
// Initial unaligned segment
let head_len = head.len();
if head_len > 0 {
xor_short_new(head, mask_u64);
if cfg!(target_endian = "big") {
mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
} else {
mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
}
}
// Aligned segment
for v in mid {
*v ^= mask_u64;
}
// Final unaligned segment
if tail.len() > 0 {
xor_short_new(tail, mask_u64);
}
}
#[allow(clippy::cast_lossless)]
pub fn apply_mask_multi(buf: &mut [u8], mask_u32: u32) {
// Extend the mask to 64 bits
let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
// Split the buffer into three segments
let (head, mid, tail) = align_buf(buf);
// Initial unaligned segment
let head_len = head.len();
if head_len > 0 {
xor_short_multi(head, mask_u64);
if cfg!(target_endian = "big") {
mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
} else {
mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
}
}
// Aligned segment
for v in mid {
*v ^= mask_u64;
}
// Final unaligned segment
if tail.len() > 0 {
xor_short_multi(tail, mask_u64);
}
}
#[allow(clippy::cast_lossless)]
pub fn apply_mask_unsafe(buf: &mut [u8], mask_u32: u32) {
// Extend the mask to 64 bits
let mut mask_u64 = ((mask_u32 as u64) << 32) | (mask_u32 as u64);
// Split the buffer into three segments
let (head, mid, tail) = align_buf(buf);
// Initial unaligned segment
let head_len = head.len();
if head_len > 0 {
xor_short_unsafe(head, mask_u64);
if cfg!(target_endian = "big") {
mask_u64 = mask_u64.rotate_left(8 * head_len as u32);
} else {
mask_u64 = mask_u64.rotate_right(8 * head_len as u32);
}
}
// Aligned segment
for v in mid {
*v ^= mask_u64;
}
// Final unaligned segment
if tail.len() > 0 {
xor_short_unsafe(tail, mask_u64);
}
}
#[inline]
// Splits a slice into three parts: an unaligned short head and tail, plus an aligned
// u64 mid section.
fn align_buf(buf: &mut [u8]) -> (ShortSlice<'_>, &mut [u64], ShortSlice<'_>) {
let start_ptr = buf.as_ptr() as usize;
let end_ptr = start_ptr + buf.len();
// Round *up* to next aligned boundary for start
let start_aligned = (start_ptr + 7) & !0x7;
// Round *down* to last aligned boundary for end
let end_aligned = end_ptr & !0x7;
if end_aligned >= start_aligned {
// We have our three segments (head, mid, tail)
let (tmp, tail) = buf.split_at_mut(end_aligned - start_ptr);
let (head, mid) = tmp.split_at_mut(start_aligned - start_ptr);
// Unsafe: we know the middle section is correctly aligned, and the outer
// sections are smaller than 8 bytes
unsafe { (ShortSlice::new(head), cast_slice(mid), ShortSlice(tail)) }
} else {
// We didn't cross even one aligned boundary!
// Unsafe: The outer sections are smaller than 8 bytes
unsafe { (ShortSlice::new(buf), &mut [], ShortSlice::new(&mut [])) }
}
}
#[inline]
// Unsafe: caller must ensure the buffer has the correct size and alignment
unsafe fn cast_slice(buf: &mut [u8]) -> &mut [u64] {
// Assert correct size and alignment in debug builds
debug_assert!(buf.len().trailing_zeros() >= 3);
debug_assert!((buf.as_ptr() as usize).trailing_zeros() >= 3);
slice::from_raw_parts_mut(buf.as_mut_ptr() as *mut u64, buf.len() >> 3)
}
pub fn apply_mask_fallback(buf: &mut [u8], mask: [u8; 4]) {
for (i, byte) in buf.iter_mut().enumerate() {
*byte ^= mask[i & 3];
}
}
extern crate test;
use test::bench::{Bencher, black_box};
macro_rules! bench {
($fnname:ident, $func:ident, $unmasked:expr) => {
#[bench]
fn $fnname(b: &mut Bencher) {
let mask = [0x6d, 0xb6, 0xb2, 0x80];
let mask_u32 = u32::from_le_bytes(mask);
b.iter(|| {
for _ in 0..1000 {
let mut unmasked = black_box($unmasked);
$func(&mut unmasked, mask_u32);
}
});
}
}
}
macro_rules! bench_fallback {
($fnname:ident, $unmasked:expr) => {
#[bench]
fn $fnname(b: &mut Bencher) {
let mask = [0x6d, 0xb6, 0xb2, 0x80];
b.iter(|| {
for _ in 0..1000 {
let mut unmasked = black_box($unmasked);
apply_mask_fallback(&mut unmasked, mask);
}
});
}
}
}
bench!(apply_mask_16, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench!(apply_mask_new_16, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench!(apply_mask_multi_16, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench!(apply_mask_unsafe_16, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench_fallback!(apply_mask_fallback_16, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9, 0x12]);
bench!(apply_mask_15, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench!(apply_mask_new_15, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench!(apply_mask_multi_15, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench!(apply_mask_unsafe_15, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench_fallback!(apply_mask_fallback_15, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82, 0xff, 0xfe, 0x00, 0x17, 0x74, 0xf9]);
bench!(apply_mask_08, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench!(apply_mask_new_08, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench!(apply_mask_multi_08, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench!(apply_mask_unsafe_08, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench_fallback!(apply_mask_fallback_08, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81, 0x82]);
bench!(apply_mask_07, apply_mask, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench!(apply_mask_new_07, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench!(apply_mask_multi_07, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench!(apply_mask_unsafe_07, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench_fallback!(apply_mask_fallback_07, vec![0xf3, 0x00, 0x01, 0x02, 0x03, 0x80, 0x81]);
bench!(apply_mask_04, apply_mask, vec![0xf3, 0x00, 0x01, 0x02]);
bench!(apply_mask_new_04, apply_mask_new, vec![0xf3, 0x00, 0x01, 0x02]);
bench!(apply_mask_multi_04, apply_mask_multi, vec![0xf3, 0x00, 0x01, 0x02]);
bench!(apply_mask_unsafe_04, apply_mask_unsafe, vec![0xf3, 0x00, 0x01, 0x02]);
bench_fallback!(apply_mask_fallback_04, vec![0xf3, 0x00, 0x01, 0x02]);
bench!(apply_mask_03, apply_mask, vec![0xf3, 0x00, 0x01]);
bench!(apply_mask_new_03, apply_mask_new, vec![0xf3, 0x00, 0x01]);
bench!(apply_mask_multi_03, apply_mask_multi, vec![0xf3, 0x00, 0x01]);
bench!(apply_mask_unsafe_03, apply_mask_unsafe, vec![0xf3, 0x00, 0x01]);
bench_fallback!(apply_mask_fallback_03, vec![0xf3, 0x00, 0x01]);
bench!(apply_mask_02, apply_mask, vec![0xf3, 0x00]);
bench!(apply_mask_new_02, apply_mask_new, vec![0xf3, 0x00]);
bench!(apply_mask_multi_02, apply_mask_multi, vec![0xf3, 0x00]);
bench!(apply_mask_unsafe_02, apply_mask_unsafe, vec![0xf3, 0x00]);
bench_fallback!(apply_mask_fallback_02, vec![0xf3, 0x00]);
bench!(apply_mask_01, apply_mask, vec![0xf3]);
bench!(apply_mask_new_01, apply_mask_new, vec![0xf3]);
bench!(apply_mask_multi_01, apply_mask_multi, vec![0xf3]);
bench!(apply_mask_unsafe_01, apply_mask_unsafe, vec![0xf3]);
bench_fallback!(apply_mask_fallback_01, vec![0xf3]);
from ntex.
close on no activity
from ntex.
Related Issues (20)
- Please respect semantic versioning HOT 1
- 23Q4 Performance Regressions as Revealed in TechEmpower Results HOT 5
- compilation error below, how to resolve? HOT 2
- async-graphql integration example? HOT 2
- `ntex::time::timeout` is not accurate HOT 3
- [Bug] Unable to set length manually HOT 4
- Moving websocket to a feature HOT 3
- [ntex-server](Bug) Duplicates ctrlc_handler registration under win causing process crash HOT 2
- Program panicks upon ctrl+c HOT 3
- Wrong feature gate in ntex/ntex-net/src/compat.rs HOT 1
- Ntex runtime panic HOT 20
- How to convert web::types::Payload to AsyncRead HOT 1
- Is there a feature comparison with other frameworks ? HOT 1
- The `insert` method for `Extensions` does not return existed value HOT 1
- Compilation failed for windows HOT 2
- http client closes h2 connection HOT 1
- Can't access state in middleware HOT 6
- Ntex is a masterpiece ! SALUTE !!
- Causing the socket to not close HOT 5
- Help: How to return errors during serde serialization and deserialization HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ntex.