Comments (4)
Can you please open a PR and integrate your results in benchmark.sh
?
from huniq.
Also I could imagine that awk will internally use integer keys to the lookup table in the d; this may be the reason for why this is faster…numeric keys is not a case that huniq optimizes for. It treats all input as strings.
from huniq.
By the way, thank you for bringing this up, we don't say thank you enough in this open source world…it's always great to have benchmarks counterchecked :)
from huniq.
I tried to quickly reproduce some of these results. I did not understand the intial echo
(therefore I removed it) and I used hyperfine for the benchmarking, since it seems to be a bit nicer regarding statistics. A first run on my (admittedly from thermal throtteling plagued laptop) indicates quite different results than those posted above:
Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000"
Time (mean ± σ): 10.0 ms ± 0.6 ms [User: 20.3 ms, System: 6.9 ms]
Range (min … max): 8.7 ms … 11.8 ms 231 runs
Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq
Time (mean ± σ): 407.4 ms ± 10.3 ms [User: 241.2 ms, System: 219.5 ms]
Range (min … max): 385.2 ms … 424.4 ms 10 runs
Benchmark 2: pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}
Time (mean ± σ): 281.9 ms ± 3.9 ms [User: 307.5 ms, System: 18.8 ms]
Range (min … max): 276.9 ms … 291.3 ms 10 runs
Benchmark 3: pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -
Time (mean ± σ): 578.3 ms ± 14.5 ms [User: 428.6 ms, System: 211.2 ms]
Range (min … max): 547.7 ms … 595.4 ms 10 runs
Benchmark 4: pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}'
Time (mean ± σ): 1.332 s ± 0.031 s [User: 1.168 s, System: 0.267 s]
Range (min … max): 1.277 s … 1.365 s 10 runs
Benchmark 5: pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq
Time (mean ± σ): 2.867 s ± 0.057 s [User: 2.912 s, System: 0.073 s]
Range (min … max): 2.760 s … 2.945 s 10 runs
Summary
'pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}' ran
1.45 ± 0.04 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq'
2.05 ± 0.06 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -'
4.72 ± 0.13 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}''
10.17 ± 0.25 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq'
Takeaways:
- on my machine, about 10ms are required to generate the input data
- awk is about 1.5 times faster than huniq
- both runiq and perl are slower than huniq
sort | uniq
is indeed about ten times slower than the best performing option
I expect my build of huniq to have some compiler flags slipped in that may deteriorate the performance. I will have to investigate more
I used the following script to generate these:results:
#!/usr/bin/env nix-shell
#!nix-shell -i bash -p moreutils hyperfine runiq perl
# command to generate test input data
test_data_cmd='pee "seq 1000000" "seq 1000000" "seq 1000000"'
# sort | uniq challenger commands
cmds=(
'huniq'
'awk !a[$0]++{print}'
'runiq -'
$'perl -e \'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}\''
'sort | uniq'
)
# confront each challenger with its input
prefixed_cmds=()
for cmd in "${cmds[@]}"
do
prefixed_cmds+=("$test_data_cmd | $cmd")
done
# find out how lon the generation of input requires
hyperfine --warmup 3 "$test_data_cmd"
# bench all the different cmds
hyperfine --warmup 1 "${prefixed_cmds[@]}"
from huniq.
Related Issues (19)
- build error HOT 1
- build error HOT 1
- Building problems on macOS HOT 3
- Rust based benchmarks & Tests
- Sort options
- Related post from Cloudflare HOT 2
- Benchmark against runiq HOT 1
- Don't output trailing delimiter if the input doesn't contain one HOT 1
- Suggestion: Use the Rust implementation of xxHash HOT 6
- Leak allocated memory
- confusion about the description HOT 5
- musl binary HOT 2
- Create releases on GitHub
- Work incorrectly HOT 4
- Add a flag to use BTreeMap, so that the output is emited sorted
- Add an option for huniq -c to indent numbers like uniq -c HOT 1
- Handle stdout being closed prematurely
- csv files with huniq -c
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from huniq.