koraa / huniq Goto Github PK

Filter out duplicates on the command line. Replacement for `sort | uniq` optimized for speed (10x faster) when sorting is not needed.

Shell 27.75% Rust 72.25%

cli rust tools

huniq's Introduction

huniq version 2

Command line utility to remove duplicates from the given input. Note that huniq does not sort the input, it just removes duplicates.

SYNOPSIS: huniq -h # Shows help
SYNOPSIS: huniq [-c|--count] [-0|--null|-d DELIM|--delim DELIM]

$ echo -e "foo\nbar\nfoo\nbaz" | huniq
foo
bar
baz

$ echo -e "foo\nbar\nfoo\nbaz" | huniq -c
1 baz
2 foo
1 bar

huniq replaces sort | uniq (or sort -u with gnu sort) and huniq -c replaces sort | uniq -c, assuming the data is sorted just so it can be passed to uniq. If having sorted output is desired, sort | uniq should still be used.

The order of the output is stable when in normal mode, but it is not stable when in -c/count mode.

Installation

$ cargo install huniq

Motivation

Sorting is slow. By using hash tables/hash sets instead of sorting the input huniq is generally faster than sort -u or sort | uniq -c when testing with gnu sort/gnu uniq.

Version History

Version 1 can be found here.

Changes made in version 2:

The -d/-0 flags where added so you can specify custom delimiters
Completely rewritten in rust.
Version two is (depending on which benchmark you look at below) between 3.5x and 6.5x faster than version 1

Build

cargo build --release

To run the tests execute:

bash ./test.sh

Benchmark

You can use bash ./benchmark.sh to execute the benchmarks. They will execute until you manually abort them (e.g. by pressing Ctrl-C).

The benchmarks work by repeatedly feeding the implementations with data from /usr/share/dict/* and measuring memory usage and time needed to process the data with the unix time tool.

For the uniq algorithm, the results are posted below: We can see that the rust implementation blows pretty much anything else out of the water in terms of performance. Use sort only if you really need a coffee break, because you won't get it with huniq! It beats the C++ implementation by a factor of between 6.5 (for very few duplicates) and 3.5 (around 98% duplicates). Compared to sort -u: huniq is around 30 times faster.

If memory efficiency is what you are looking for, use datamash which is not as fast as huniq but uses the least memory (by a factor of around 3); failing that use sort|uniq which is a lot slower but uses just very slightly more memory than datamash.

repetitions  implementation  seconds  memory/kb
1            huniq2-rust        0.26      29524
1            huniq1-c++         1.67      26188
1            awk                1.63     321936
1            datamash           1.78       9644
1            shell              7.30       9736
2            huniq2-rust        0.84      29592
2            huniq1-c++         3.28      26180
2            awk                3.71     322012
2            datamash           4.60       9636
2            shell             16.68       9740
5            huniq2-rust        2.02      29648
5            huniq1-c++         6.21      26184
5            awk                7.69     322012
5            datamash           9.10       9992
5            shell             44.71      10184
10           huniq2-rust        3.40      29676
10           huniq1-c++        12.84      26172
10           awk               16.73     321940
10           datamash          24.44      10032
10           shell             93.75      10036
50           huniq2-rust       14.68      29612
50           huniq1-c++        55.32      26200
50           awk               74.91     321940
50           datamash         103.54      10936
50           shell            453.94      10956
100          huniq2-rust       43.65      29492
100          huniq1-c++       154.99      26180
100          awk              239.66     321956
100          datamash         285.94      12148
100          shell           1062.07      12208

For the counting huniq -c implementation, the speed advantage was less pronounced: Here the rust implementation is between 25% and 50% faster than the C++ implementation and between 5x and 10x faster than sort | uniq -c.

The increased memory usage of the rust implementation is much worse though: The rust implementation needs about 2.2x more memory than the C++ implementation and between 10x and 12x more memory than sort | uniq.

repetitions  implemetation  seconds  memory/kb
1            huniq2-rust       1.47     132096
1            huniq1-c++        1.85      60196
1            awk               2.79     362940
1            datamash          2.28       9636
1            shell             7.71      11716
2            huniq2-rust       2.32     132052
2            huniq1-c++        2.98      60156
2            awk               4.65     363016
2            datamash          5.27       9732
2            shell            16.37      11680
5            huniq2-rust       4.98     132092
5            huniq1-c++        7.54      60128
5            awk               9.37     363016
5            datamash         11.22       9964
5            shell            44.77      11948
10           huniq2-rust       8.81     132048
10           huniq1-c++       13.55      60196
10           awk              16.19     363032
10           datamash         25.12       9908
10           shell            90.01      11976
50           huniq2-rust      45.89     132092
50           huniq1-c++       74.04      60104
50           awk              85.43     362956
50           datamash        141.90      10996
50           shell           454.42      12876
100          huniq2-rust      90.80     132080
100          huniq1-c++      150.41      60196
100          awk             163.13     363008
100          datamash        322.70      12212
100          shell           933.67      14100

Future direction

Feature wise huniq is pretty much complete, but the performance and memory usage should be improved in the future.

This first of all involves a better benchmarking setup which will probably consist of an extra rust application that will use RNGs to generate test data for huniq and take parameters like the number of elements to create, the rate of duplicates (0-1) the length of strings to output and so on…

Then based on the improved benchmarking capabilities, some optimizations should be tried like short string optimization, arena allocation, different hash functions, using memory optimized hash tables, using an identity function for the uniq function (we already feed it with hashes, so a second round of hashing is not necessary).

License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the Karolin Varner nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL Softwear, BV BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

huniq's People

Contributors

Stargazers

Watchers

Forkers

louy2 dippi mardukbp alexmaco freaky icodein orhun wucke13 ludiosarchive cturan vbauerster

huniq's Issues

Rust based benchmarks & Tests

Requirments

The benchmarks should be entirely written in rust.
The benchmarks should be portable and not rely on the presence of platform defined dictionary files.
The benchmarks should have the ability to be run with specific parameters
1. Number of input lines
2. Fraction of duplicates
3. Distribution of input line length
4. Char set (binary/text)
The benchmarks should still be able to run against all the preexisting commands (sort|uniq).

Design

A CLI application should be written that produces a set of random tokens according to the parameters specified on the CLI:

genbench --charset ascii/binary --delim CHAR --number NUM --duplicates PERCENTAGE --short LEN --long LEN

The short/long parameters each indicate the 90% percentile of string lengths, using a gaussian distribution.

For the actual benchmark we should write a benchmark executor that runs each of the implementations with a variety of parameters handed to genbench.

Tests

We can reuse the same strategy for testing by generating test data with genbench and then comparing the output of the full huniq and a super naive, unoptimized huniq implementation. We should specifically make sure, that buffer growing is tested (supply some very long, >20kb strings).

Don't output trailing delimiter if the input doesn't contain one

When I run printf "1" | huniq -d ",", it prints 1,. When the input doesn't even contain the delimiter, I'd expect huniq to not change the input. But when the input is something like printf "1," | huniq -d ",", then printing 1, makes sense. I don't imagine it'd be that hard to fix this, but it would be really useful. A flag like --no-trailing-delimiter would also work as well.

Benchmark against runiq

I was wondering if you've tried benchmarking huniq against another rust based implementation: https://github.com/whitfin/runiq

Add a flag to use BTreeMap, so that the output is emited sorted

Sort options

It should be possible to sort output by duplicate count (rank mode --rank or --reverse-rank)

Suggestion: Use the Rust implementation of xxHash

Using a C library complicates the compilation, since it requires that the Clang compiler be present.

xxHash has already been implemented in Rust. If there is a good reason to not use this crate, then it would be great to provide precompiled binaries of huniq.

csv files with huniq -c

Suggest an option for huniq -c to handle csv files. The input heading line should be maintained but with an additional field name for the count column so that both input and output are csv files.

Add an option for huniq -c to indent numbers like uniq -c

In other words, use %7d format for numbers.

That is, instead of printing

1 world
12345 hello

******1 hello
**12345 world

(I've replaced spaces with *s for clarity).

build error

Hi i got build error. maybe xxhash is not an explicite dep? which rust/cargo versions are supported?

$ cargo build
   Compiling huniq v2.1.0 (/home/.../huniq)
error: failed to run custom build command for `huniq v2.1.0 (/home/.../huniq)`

Caused by:
  process didn't exit successfully: `/home/.../huniq/target/debug/build/huniq-5dded996a2dab995/build-script-build` (exit code: 1)
--- stdout
TARGET = Some("i686-unknown-linux-gnu")
OPT_LEVEL = Some("0")
HOST = Some("i686-unknown-linux-gnu")
CC_i686-unknown-linux-gnu = None
CC_i686_unknown_linux_gnu = None
HOST_CC = None
CC = None
CFLAGS_i686-unknown-linux-gnu = None
CFLAGS_i686_unknown_linux_gnu = None
HOST_CFLAGS = None
CFLAGS = None
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("true")
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
running: "cc" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-fno-omit-frame-pointer" "-m32" "-march=i686" "-Wall" "-Wextra" "-o" "/home/.../huniq/target/debug/build/huniq-ef603bdbefe92818/out/vendor/xxhash/xxhash.o" "-c" "vendor/xxhash/xxhash.c"
cargo:warning=cc: error: vendor/xxhash/xxhash.c: No such file or directory
cargo:warning=cc: fatal error: no input files
cargo:warning=compilation terminated.
exit code: 4

--- stderr


error occurred: Command "cc" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-fno-omit-frame-pointer" "-m32" "-march=i686" "-Wall" "-Wextra" "-o" "/home/.../huniq/target/debug/build/huniq-ef603bdbefe92818/out/vendor/xxhash/xxhash.o" "-c" "vendor/xxhash/xxhash.c" with args "cc" did not execute successfully (status code exit code: 4).

cc (Debian 4.7.2-5) 4.7.2
cargo 1.40.0 (bc8e4c8be 2019-11-22)
rustc 1.40.0 (73528e339 2019-12-16)

git commit 323b02f5ba96255adb196d76f58e3e86f39b92d5

Leak allocated memory

No use dropping expensive data structures if we are already terminating the process…

Create releases on GitHub

I saw that there is huniq package on crates.io. Can you also push tags or create releases on GitHub as well?

Not much quicker than awk one-liner with numeric keys

I've compared speed with AWK, and it's not so much quicker.

`huniq`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | huniq | pv -l | wc -l
20.0M 0:00:30 [ 664k/s]
20007130

`awk`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | awk '!a[$0]++{print}' | pv -l | wc -l
20.0M 0:00:32 [ 608k/s]
20008984

`unique`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | unique | pv -l | wc -l
20.0M 0:00:38 [ 524k/s]
20006340

`quniq`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | quniq -max-workers 10 | pv -l | wc -l
20.0M 0:00:45 [ 436k/s]
20013324

`runiq`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | runiq - | pv -l | wc -l
20.0M 0:00:35 [ 571k/s]
20007072

`perl`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}' | pv -l | wc -l
20.0M 0:00:55 [ 363k/s]
20008104

It's only 2 seconds quicker than awk when processing 20 millions lines.

Is there any way to process the lines even quicker?

build error

I got the following error. versions below.

$ cargo clean ; cargo build --release
   Compiling libc v0.2.66
   Compiling memchr v2.3.0
   Compiling version_check v0.1.5
   Compiling glob v0.3.0
   Compiling proc-macro2 v1.0.8
   Compiling byteorder v1.3.2
   Compiling lazy_static v1.4.0
   Compiling cfg-if v0.1.10
   Compiling bitflags v1.2.1
   Compiling log v0.4.8
   Compiling regex-syntax v0.6.13
   Compiling unicode-xid v0.2.0
   Compiling quick-error v1.2.3
   Compiling unicode-width v0.1.7
   Compiling termcolor v1.1.0
   Compiling anyhow v1.0.26
   Compiling winapi-build v0.1.1
   Compiling vec_map v0.8.1
   Compiling strsim v0.8.0
   Compiling ansi_term v0.11.0
   Compiling bindgen v0.52.0
   Compiling shlex v0.1.1
   Compiling lazycell v1.2.1
   Compiling peeking_take_while v0.1.2
   Compiling shell-words v0.1.0
   Compiling winapi v0.2.8
   Compiling getrandom v0.1.14
   Compiling nom v4.2.3
   Compiling clang-sys v0.28.1
   Compiling thread_local v1.0.1
   Compiling humantime v1.3.0
   Compiling textwrap v0.11.0
   Compiling kernel32-sys v0.2.2
   Compiling aho-corasick v0.7.6
   Compiling jobserver v0.1.19
   Compiling atty v0.2.14
   Compiling which v3.1.0
   Compiling errno v0.2.4
   Compiling quote v1.0.2
   Compiling rustc-hash v1.0.1
   Compiling cc v1.0.50
   Compiling regex v1.3.3
   Compiling clap v2.33.0
   Compiling cexpr v0.3.6
   Compiling sysconf v0.3.4
   Compiling env_logger v0.7.1
   Compiling libloading v0.5.2
   Compiling huniq v2.1.0 (/home/.../huniq)
error: failed to run custom build command for `huniq v2.1.0 (/home/.../huniq)`

Caused by:
  process didn't exit successfully: `/home/.../huniq/target/release/build/huniq-3ecad4f1274ec0a6/build-script-build` (exit code: 101)
--- stdout
TARGET = Some("i686-unknown-linux-gnu")
OPT_LEVEL = Some("3")
HOST = Some("i686-unknown-linux-gnu")
CC_i686-unknown-linux-gnu = None
CC_i686_unknown_linux_gnu = None
HOST_CC = None
CC = None
CFLAGS_i686-unknown-linux-gnu = None
CFLAGS_i686_unknown_linux_gnu = None
HOST_CFLAGS = None
CFLAGS = None
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("false")
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
running: "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-m32" "-march=i686" "-Wall" "-Wextra" "-o" "/home/.../huniq/target/release/build/huniq-492d306c5d2bcc67/out/vendor/xxhash/xxhash.o" "-c" "vendor/xxhash/xxhash.c"
exit code: 0
AR_i686-unknown-linux-gnu = None
AR_i686_unknown_linux_gnu = None
HOST_AR = None
AR = None
running: "ar" "crs" "/home/.../huniq/target/release/build/huniq-492d306c5d2bcc67/out/libxxhash.a" "/home/.../huniq/target/release/build/huniq-492d306c5d2bcc67/out/vendor/xxhash/xxhash.o"
exit code: 0
cargo:rustc-link-lib=static=xxhash
cargo:rustc-link-search=native=/home/.../huniq/target/release/build/huniq-492d306c5d2bcc67/out
cargo:rerun-if-changed=src/bindings.h

--- stderr
error: unsupported option '--target=i686-unknown-linux-gnu', err: true
src/xxhash_bindings.h:1:9: warning: #pragma once in main file, err: false
thread 'main' panicked at 'Unable to generate bindings: ()', src/libcore/result.rs:1165:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

cc (Debian 4.7.2-5) 4.7.2
cargo 1.40.0 (bc8e4c8be 2019-11-22)
rustc 1.40.0 (73528e339 2019-12-16)

git commit 323b02f5ba96255adb196d76f58e3e86f39b92d5

Work incorrectly

The command fc-list : family | huniq could not produce the expected result as fc-list : family | sort -u, and I can confirm that the piped lines are terminated by 0x0a, as the example of echo -e "foo\nbar\nfoo\nbaz" | huniq.

Handle stdout being closed prematurely

When the stdout stream is closed while huniq is still printing, the program errors out:

$ seq 100000 | huniq | head
1
2
3
4
5
6
7
8
9
10
thread 'main' panicked at 'failed printing to stdout: Broken pipe (os error 32)', library/std/src/io/stdio.rs:1008:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This was mentioned before in #26, but I felt like it was worth bringing up again.

It's not a big deal other than being a bit ugly, but since this kind of workflow is common for tools like huniq, it would be nice if this wasn't presented as. see also: Misterio77/flavours#16

The error is already being propagated, so it should be simple to just silently handle std::io::ErrorKind::BrokenPipe explicitly

musl binary

Hi,
Could you please provide a musl binary for linux.
My machine at work doesn't allow installs

Many thanks in advance

Building problems on macOS

Documenting some problems I've had building this. In order of appearance.

OS: macOS Catalina 10.15.3

zsh: command not found: llvm-config

Fix: brew install llvm then export PATH="/usr/local/opt/llvm/bin:$PATH"

./vendor/xxhash/xxhash.h:663:10: fatal error: 'stdlib.h' file not found

Fix: Running xcode-select --install

benchmark.sh: line 86: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]

Fix: Use zsh ./benchmark.sh instead of bash ./benchmark.sh

uniq 2 huniq2-rust time: illegal option -- f
uniq 2 huniq2-rust usage: time [-lp] command.

Infinite loop.
Fix: Run brew install gnu-time then export PATH="/usr/local/opt/gnu-time/libexec/gnubin:$PATH"

Benchmark Result

uniq 1 huniq2-rust 0.09 10324
uniq 1 huniq1-c++  0.76 10824
uniq 1 awk-sys     0.66 23216
uniq 1 shell       1.37 52812

uniq 2 huniq2-rust 0.07 10324
uniq 2 huniq1-c++  1.06 10196
uniq 2 awk-sys     1.19 23224
uniq 2 shell       3.29 101620

uniq 5 huniq2-rust 0.25 10324
uniq 5 huniq1-c++  2.69 11992
uniq 5 awk-sys     3.12 23252
uniq 5 shell       9.75 247944

uniq 10 huniq2-rust 0.28 10344
uniq 10 huniq1-c++  5.13 11200
uniq 10 awk-sys     6.02 23212
uniq 10 shell       20.89 491792

uniq 50 huniq2-rust 1.34 10324
uniq 50 huniq1-c++  24.28 11044
uniq 50 awk-sys     29.55 23216
uniq 50 shell       120.39 2411632

Didn't wanna wait for the rest of it.

confusion about the description

Hi,
everywhere in the README it says and shows how huniq sorts too. But in the very first paragraph of the README file, right after the heading, there it says "Note that huniq does not sort the input, it just removes duplicates."
Is this maybe a left-over sentence from an older version?
thx