Giter Club home page Giter Club logo

Comments (4)

koraa avatar koraa commented on August 16, 2024

Can you please open a PR and integrate your results in benchmark.sh?

from huniq.

koraa avatar koraa commented on August 16, 2024

Also I could imagine that awk will internally use integer keys to the lookup table in the d; this may be the reason for why this is faster…numeric keys is not a case that huniq optimizes for. It treats all input as strings.

from huniq.

koraa avatar koraa commented on August 16, 2024

By the way, thank you for bringing this up, we don't say thank you enough in this open source world…it's always great to have benchmarks counterchecked :)

from huniq.

wucke13 avatar wucke13 commented on August 16, 2024

I tried to quickly reproduce some of these results. I did not understand the intial echo (therefore I removed it) and I used hyperfine for the benchmarking, since it seems to be a bit nicer regarding statistics. A first run on my (admittedly from thermal throtteling plagued laptop) indicates quite different results than those posted above:

Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000"
  Time (mean ± σ):      10.0 ms ±   0.6 ms    [User: 20.3 ms, System: 6.9 ms]
  Range (min … max):     8.7 ms …  11.8 ms    231 runs

Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq
  Time (mean ± σ):     407.4 ms ±  10.3 ms    [User: 241.2 ms, System: 219.5 ms]
  Range (min … max):   385.2 ms … 424.4 ms    10 runs

Benchmark 2: pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}
  Time (mean ± σ):     281.9 ms ±   3.9 ms    [User: 307.5 ms, System: 18.8 ms]
  Range (min … max):   276.9 ms … 291.3 ms    10 runs

Benchmark 3: pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -
  Time (mean ± σ):     578.3 ms ±  14.5 ms    [User: 428.6 ms, System: 211.2 ms]
  Range (min … max):   547.7 ms … 595.4 ms    10 runs

Benchmark 4: pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}'
  Time (mean ± σ):      1.332 s ±  0.031 s    [User: 1.168 s, System: 0.267 s]
  Range (min … max):    1.277 s …  1.365 s    10 runs

Benchmark 5: pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq
  Time (mean ± σ):      2.867 s ±  0.057 s    [User: 2.912 s, System: 0.073 s]
  Range (min … max):    2.760 s …  2.945 s    10 runs

Summary
  'pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}' ran
    1.45 ± 0.04 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq'
    2.05 ± 0.06 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -'
    4.72 ± 0.13 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}''
   10.17 ± 0.25 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq'

Takeaways:

  • on my machine, about 10ms are required to generate the input data
  • awk is about 1.5 times faster than huniq
  • both runiq and perl are slower than huniq
  • sort | uniq is indeed about ten times slower than the best performing option

I expect my build of huniq to have some compiler flags slipped in that may deteriorate the performance. I will have to investigate more

I used the following script to generate these:results:

#!/usr/bin/env nix-shell
#!nix-shell -i bash -p moreutils hyperfine runiq perl

# command to generate test input data
test_data_cmd='pee "seq 1000000" "seq 1000000" "seq 1000000"'

# sort | uniq challenger commands
cmds=(
'huniq'
'awk !a[$0]++{print}'
'runiq -'
$'perl -e \'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}\''
'sort | uniq'
)

# confront each challenger with its input
prefixed_cmds=()
for cmd in "${cmds[@]}"
do
prefixed_cmds+=("$test_data_cmd | $cmd")
done

# find out how lon the generation of input requires
hyperfine --warmup 3 "$test_data_cmd"

# bench all the different cmds
hyperfine --warmup 1 "${prefixed_cmds[@]}"

from huniq.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.