I've compared speed with AWK, and it's not so much quicker. <code c

Not much quicker than awk one-liner with numeric keys about huniq HOT 4 OPEN

kenorb commented on August 16, 2024

Not much quicker than awk one-liner with numeric keys

from huniq.

Comments (4)

koraa commented on August 16, 2024

Can you please open a PR and integrate your results in benchmark.sh?

from huniq.

koraa commented on August 16, 2024

Also I could imagine that awk will internally use integer keys to the lookup table in the d; this may be the reason for why this is faster…numeric keys is not a case that huniq optimizes for. It treats all input as strings.

from huniq.

koraa commented on August 16, 2024

By the way, thank you for bringing this up, we don't say thank you enough in this open source world…it's always great to have benchmarks counterchecked :)

from huniq.

wucke13 commented on August 16, 2024

I tried to quickly reproduce some of these results. I did not understand the intial echo (therefore I removed it) and I used hyperfine for the benchmarking, since it seems to be a bit nicer regarding statistics. A first run on my (admittedly from thermal throtteling plagued laptop) indicates quite different results than those posted above:

Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000"
  Time (mean ± σ):      10.0 ms ±   0.6 ms    [User: 20.3 ms, System: 6.9 ms]
  Range (min … max):     8.7 ms …  11.8 ms    231 runs

Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq
  Time (mean ± σ):     407.4 ms ±  10.3 ms    [User: 241.2 ms, System: 219.5 ms]
  Range (min … max):   385.2 ms … 424.4 ms    10 runs

Benchmark 2: pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}
  Time (mean ± σ):     281.9 ms ±   3.9 ms    [User: 307.5 ms, System: 18.8 ms]
  Range (min … max):   276.9 ms … 291.3 ms    10 runs

Benchmark 3: pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -
  Time (mean ± σ):     578.3 ms ±  14.5 ms    [User: 428.6 ms, System: 211.2 ms]
  Range (min … max):   547.7 ms … 595.4 ms    10 runs

Benchmark 4: pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}'
  Time (mean ± σ):      1.332 s ±  0.031 s    [User: 1.168 s, System: 0.267 s]
  Range (min … max):    1.277 s …  1.365 s    10 runs

Benchmark 5: pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq
  Time (mean ± σ):      2.867 s ±  0.057 s    [User: 2.912 s, System: 0.073 s]
  Range (min … max):    2.760 s …  2.945 s    10 runs

Summary
  'pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}' ran
    1.45 ± 0.04 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq'
    2.05 ± 0.06 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -'
    4.72 ± 0.13 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}''
   10.17 ± 0.25 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq'

Takeaways:

on my machine, about 10ms are required to generate the input data
awk is about 1.5 times faster than huniq
both runiq and perl are slower than huniq
sort | uniq is indeed about ten times slower than the best performing option

I expect my build of huniq to have some compiler flags slipped in that may deteriorate the performance. I will have to investigate more

I used the following script to generate these:results:

#!/usr/bin/env nix-shell
#!nix-shell -i bash -p moreutils hyperfine runiq perl

# command to generate test input data
test_data_cmd='pee "seq 1000000" "seq 1000000" "seq 1000000"'

# sort | uniq challenger commands
cmds=(
'huniq'
'awk !a[$0]++{print}'
'runiq -'
$'perl -e \'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}\''
'sort | uniq'
)

# confront each challenger with its input
prefixed_cmds=()
for cmd in "${cmds[@]}"
do
prefixed_cmds+=("$test_data_cmd | $cmd")
done

# find out how lon the generation of input requires
hyperfine --warmup 3 "$test_data_cmd"

# bench all the different cmds
hyperfine --warmup 1 "${prefixed_cmds[@]}"

from huniq.

Not much quicker than awk one-liner with numeric keys about huniq HOT 4 OPEN

Comments (4)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent