powturbo / turbopfor-integer-compression Goto Github PK

View Code? Open in Web Editor NEW

755.0 46.0 109.0 6.04 MB

Fastest Integer Compression

License: GNU General Public License v2.0

C 93.02% C++ 0.07% Makefile 0.57% Java 0.38% Rust 5.88% GLSL 0.07%

integer-compression compression encoding sse2 avx2 simd time-series inverted-index intersection compressor

turbopfor-integer-compression's Introduction

TurboPFor: Fastest Integer Compression

TurboPFor: The synonym for "integer compression"
- ALL functions available for AMD/Intel, 64 bits ARMv8 NEON Linux+MacOS/M1 & Power9 Altivec
- 100% C (C++ headers), as simple as memcpy. OS:Linux amd64, arm64, Power9, MacOs (Amd/intel + Apple M1),
- 🆕(2023.04) Rust Bindings. Access TurboPFor incl. SSE/AVX2/Neon! from Rust
- 👍 Java Critical Natives/JNI. Access TurboPFor incl. SSE/AVX2/Neon! from Java as fast as calling from C
- ✨ FULL range 8/16/32/64 bits scalar + 16/32/64 bits SIMD functions
- No other "Integer Compression" compress/decompress faster
- ✨ Direct Access, integrated (SIMD/AVX2) FOR/delta/Delta of Delta/Zigzag for sorted/unsorted arrays
For/PFor/PForDelta
- Novel TurboPFor (PFor/PForDelta) scheme w./ direct access + SIMD/AVX2. +RLE
- Outstanding compression/speed. More efficient than ANY other fast "integer compression" scheme.
Bit Packing
- Fastest and most efficient "SIMD Bit Packing" >20 Billions integers/sec (80Gb/s!)
- Extremely fast scalar "Bit Packing"
- Direct/Random Access : Access any single bit packed entry with zero decompression
Variable byte
- Scalar "Variable Byte" faster and more efficient than ANY other implementation
- SIMD TurboByte fastest group varint (16+32 bits) incl. integrated delta,zigzag,xor,...
- 🆕(2023.03)TurboBitByte novel hybrid scheme combining the fastest SIMD codecs TurboByte+TurboPack. Compress considerably better and can be 3 times faster than streamvbyte
Simple family
- Novel "Variable Simple" (incl. RLE) faster and more efficient than simple16, simple-8b
Elias fano
- Fastest "Elias Fano" implementation w/ or w/o SIMD/AVX2
🆕(2023.03)TurboVLC novel variable length encoding for large integers with exponent + variable bit mantissa
🆕(2023.03)Binary interpolative coding : fastest implementation

Transform
- Scalar & SIMD Transform: Delta, Zigzag, Zigzag of delta, XOR,
- 🆕(2023.03) Transpose/Shuffle with integrated Xor and zigzag delta
- 🆕(2023.03) 2D/3D/4D transpose
- lossy floating point compression with TurboPFor or TurboTranspose+lz77/bwt
🆕(2023.03)IC Codecs transpose/rle + general purpose compression with lz4,zstd,turborc (range coder),bwt...

Floating Point Compression
- Delta/Zigzag + improved gorilla style + (Differential) Finite Context Method FCM/DFCM floating point compression
- Using TurboPFor, unsurpassed compression and more than 8 GB/s throughput
- Point wise relative error bound lossy floating point compression
- TurboFloat novel efficient floating point compression using TurboPFor
- 🆕(2023.03)TurboFloat LzXor novel floating point lempel-ziv compression
- 🆕(2023.06) _Float16 16 bits floating point support
- 🆕(2023.06) float 16/32/64 bits quantization with variable quantization bit size.
Time Series Compression
- Fastest Gorilla 16/32/64 bits style compression (zigzag of delta + RLE).
- can compress timestamps to only 0.01%. Speed > 10 GB/s compression and > 13 GB/s decompress.
Inverted Index ...do less, go fast!
- Direct Access to compressed frequency and position data w/ zero decompression
- Novel "Intersection w/ skip intervals", decompress the minimum necessary blocks (~10-15%)!.
- Novel Implicit skips with zero extra overhead
- Novel Efficient Bidirectional Inverted Index Architecture (forward/backwards traversal) incl. "integer compression".
- more than 2000! queries per second on GOV2 dataset (25 millions documents) on a SINGLE core
- ✨ Revolutionary Parallel Query Processing on Multicores > 7000!!! queries/sec on a simple quad core PC.
  ...forget ~~Map Reduce, Hadoop, multi-node clusters,~~ ...

Integer Compression Benchmark (single thread):

Download IcApp a new benchmark for TurboPFor
for testing allmost all integer and floating point file types. ( type: icapp ZIPF )
Benchmark: TurboTranspose+iccodecs vs Quantile Compression
Benchmark: TurboByte+TurboBitByte vs streamvbtyte
Benchmark: Time Series - TurboPFor, TurboFloat, TurboFloat LzX, TurboGorilla,...
Benchmark: Lossy Floating Point Preprocessing Turbo Razor vs Granular bitround vs libroundfast
Benchmark: Lossless/Lossy Floating Point Compression. TurboPFor vs zfp & blosc
Benchmark: TurboPFor: IcApp 16 bits Integer Compression
Benchmark Intel CPU: Skylake i7-6700 3.4GHz gcc 9.2
Benchmark ARM: ARMv8 A73-ODROID-N2 1.8GHz

- Synthetic data:

Generate and test (zipfian) skewed distribution (100.000.000 integers, Block size=128/256)
Note: Unlike general purpose compression, a small fixed size (ex. 128 integers) is in general used in "integer compression". Large blocks involved, while processing queries (inverted index, search engines, databases, graphs, in memory computing,...) need to be entirely decoded.
```
 ./icapp -a1.5 -m0 -M255 -n100M ZIPF
```

C Size	ratio%	Bits/Integer	C MB/s	D MB/s	Name 2019.11
62,939,886	15.7	5.04	2369	10950	TurboPFor256
63,392,759	15.8	5.07	1359	7803	TurboPFor128
63,392,801	15.8	5.07	1328	924	TurboPForDA
65,060,504	16.3	5.20	60	2748	FP_SIMDOptPFor
65,359,916	16.3	5.23	32	2436	PC_OptPFD
73,477,088	18.4	5.88	408	2484	PC_Simple16
73,481,096	18.4	5.88	624	8748	FP_SimdFastPFor 64Ki *
76,345,136	19.1	6.11	1072	2878	VSimple
91,947,533	23.0	7.36	284	11737	QMX 64k *
93,285,864	23.3	7.46	1568	10232	FP_GroupSimple 64Ki *
95,915,096	24.0	7.67	848	3832	Simple-8b
99,910,930	25.0	7.99	17298	12408	TurboByte+TurboPack
99,910,930	25.0	7.99	17357	12363	TurboPackV sse
99,910,930	25.0	7.99	11694	10138	TurboPack scalar
99,910,930	25.0	7.99	8420	8876	TurboFor
100,332,929	25.1	8.03	17077	11170	TurboPack256V avx2
101,015,650	25.3	8.08	11191	10333	TurboVByte
102,074,663	25.5	8.17	6689	9524	MaskedVByte
102,074,663	25.5	8.17	2260	4208	PC_Vbyte
102,083,036	25.5	8.17	5200	4268	FP_VByte
112,500,000	28.1	9.00	1528	12140	VarintG8IU
125,000,000	31.2	10.00	13039	12366	TurboByte
125,000,000	31.2	10.00	11197	11984	StreamVbyte 2019
400,000,000	100.00	32.00	8960	8948	Copy
			N/A	N/A	EliasFano

(*) codecs inefficient for small block sizes are tested with 64Ki integers/block.

MB/s: 1.000.000 bytes/second. 1000 MB/s = 1 GB/s
#BOLD = pareto frontier.
FP=FastPFor SC:simdcomp PC:Polycom
TurboPForDA,TurboForDA: Direct Access is normally used when accessing few individual values.
Eliasfano can be directly used only for increasing sequences

- Data files:

gov2.sorted from DocId data set Block size=128/Delta coding

Size	Ratio %	Bits/Integer	C Time MB/s	D Time MB/s	Function 2019.11
3,321,663,893	13.9	4.44	1320	6088	TurboPFor
3,339,730,557	14.0	4.47	32	2144	PC.OptPFD
3,350,717,959	14.0	4.48	1536	7128	TurboPFor256
3,501,671,314	14.6	4.68	56	2840	VSimple
3,768,146,467	15.8	5.04	3228	3652	EliasFanoV
3,822,161,885	16.0	5.11	572	2444	PC_Simple16
4,411,714,936	18.4	5.90	9304	10444	TurboByte+TurboPack
4,521,326,518	18.9	6.05	836	3296	Simple-8b
4,649,671,427	19.4	6.22	3084	3848	TurboVbyte
4,955,740,045	20.7	6.63	7064	10268	TurboPackV
4,955,740,045	20.7	6.63	5724	8020	TurboPack
5,205,324,760	21.8	6.96	6952	9488	SC_SIMDPack128
5,393,769,503	22.5	7.21	14466	11902	TurboPackV256
6,221,886,390	26.0	8.32	6668	6952	TurboFor
6,221,886,390	26.0	8.32	6644	2260	TurboForDA
6,699,519,000	28.0	8.96	1888	1980	FP_Vbyte
6,700,989,563	28.0	8.96	2740	3384	MaskedVByte
7,622,896,878	31.9	10.20	836	4792	VarintG8IU
8,060,125,035	33.7	11.50	8456	9476	Streamvbyte 2019
8,594,342,216	35.9	11.50	5228	6376	libfor
23,918,861,764	100.0	32.00	5824	5924	Copy

Block size: 64Ki = 256k bytes. Ki=1024 Integers

Size	Ratio %	Bits/Integer	C Time MB/s	D Time MB/s	Function
3,164,940,562	13.2	4.23	1344	6004	TurboPFor 64Ki
3,273,213,464	13.7	4.38	1496	7008	TurboPFor256 64Ki
3,965,982,954	16.6	5.30	1520	2452	lz4+DT 64Ki
4,234,154,427	17.7	5.66	436	5672	qmx 64Ki
6,074,995,117	25.4	8.13	1976	2916	blosc_lz4 64Ki
8,773,150,644	36.7	11.74	2548	5204	blosc_lz 64Ki

"lz4+DT 64Ki" = Delta+Transpose from TurboPFor + lz4
"blosc_lz4" internal lz4 compressor+vectorized shuffle

- Time Series:

Test file Timestamps: ts.txt(sorted)
```
  ./icapp -Ft ts.txt -I15 -J15
```

Function	C MB/s	size	ratio%	D MB/s	Text
bvzenc32	10632	45,909	0.008	12823	ZigZag
bvzzenc32	8914	56,713	0.010	13499	ZigZag Delta of delta
vsenc32	12294	140,400	0.024	12877	Variable Simple
p4nzenc256v32	1932	596,018	0.10	13326	TurboPFor256 ZigZag
p4ndenc256v32	1961	596,018	0.10	13339	TurboPFor256 Delta
bitndpack256v32	12564	909,189	0.16	13505	TurboPackV256 Delta
p4nzenc32	1810	1,159,633	0.20	8502	TurboPFor ZigZag
p4nzenc128v32	1795	1,159,633	0.20	13338	TurboPFor ZigZag
bitnzpack256v32	9651	1,254,757	0.22	13503	TurboPackV256 ZigZag
bitnzpack128v32	10155	1,472,804	0.26	13380	TurboPackV ZigZag
vbddenc32	6198	18,057,296	3.13	10982	TurboVByte Delta of delta
memcpy	13397	577,141,992	100.00

- Transpose/Shuffle (no compression)

    ./icapp -e117,118,119 ZIPF

Size	C Time MB/s	D Time MB/s	Function
100,000,000	9400	9132	TPbyte 4 TurboPFor Byte Transpose/shuffle AVX2
100,000,000	8784	8860	TPbyte 4 TurboPFor Byte Transpose/shuffle SSE
100,000,000	7688	7656	Blosc_Shuffle AVX2
100,000,000	5204	7460	TPnibble 4 TurboPFor Nibble Transpose/shuffle SSE
100,000,000	6620	6284	Blosc shuffle SSE
100,000,000	3156	3372	Bitshuffle AVX2
100,000,000	2100	2176	Bitshuffle SSE

- (Lossy) Floating point compression:

    ./icapp -Fd file          " 64 bits floating point raw file 
    ./icapp -Ff file          " 32 bits floating point raw file 
    ./icapp -Fcf file         " text file with miltiple entries (ex.  8.657,56.8,4.5 ...)
    ./icapp -Ftf file         " text file (1 entry per line)
    ./icapp -Ftf file -v5     " + display the first entries read
    ./icapp -Ftf file.csv -K3 " but 3th column in a csv file (ex. number,Text,456.5 -> 456.5
    ./icapp -Ftf file -g.001  " lossy compression with allowed pointwise relative error 0.001

see also TurboTranspose

- Compressed Inverted Index Intersections with GOV2

GOV2: 426GB, 25 Millions documents, average doc. size=18k.

Aol query log: 18.000 queries
~1300 queries per second (single core)
~5000 queries per second (quad core)
Ratio = 14.37% Decoded/Total Integers.
TREC Million Query Track (1MQT):
~1100 queries per second (Single core)
~4500 queries per second (Quad core CPU)
Ratio = 11.59% Decoded/Total Integers.

Benchmarking intersections (Single core, AOL query log)

max.docid/q	Time s	q/s	ms/q	% docid found
1.000	7.88	2283.1	0.438	81
10.000	10.54	1708.5	0.585	84
ALL	13.96	1289.0	0.776	100
q/s: queries/second, ms/q:milliseconds/query

Benchmarking Parallel Query Processing (Quad core, AOL query log)

max.docid/q	Time s	q/s	ms/q	% docids found
1.000	2.66	6772.6	0.148	81
10.000	3.39	5307.5	0.188	84
ALL	3.57	5036.5	0.199	100

Notes:

Search engines are spending 90% of the time in intersections when processing queries.
Most search engines are using pruning strategies, caching popular queries,... to reduce the time for intersections and query processing.
"integer compression" GOV2 experiments On Inverted Index Compression for Search Engine Efficiency using 8-core Xeon PC are reporting 1.2 seconds per query (for 1.000 Top-k docids).

Compile:

    Download or clone TurboPFor
	git clone https://github.com/powturbo/TurboPFor-Integer-Compression.git
	cd TurboPFor-Integer-Compression
	make
    
    To benchmark TurboPFor + general purpose compression codecs (zstd,lz4, Turbo-Range-Coder, bwt, bitshuffle):
    git clone --recursive https://github.com/powturbo/TurboPFor-Integer-Compression.git
	cd TurboPFor-Integer-Compression
    make ICCODEC=1

    To benchmark external libraries: 
	Download the external libraries from github into the current directory
	Activate/deactivate the ext. libs in the makefile 
    make CODEC1=1 CODEC2=1 ICCODEC=1

Windows visual c++

	nmake /f makefile.vs

Windows visual studio c++

    project files under vs/vs2022

Testing:

- Synthetic data (use ZIPF parameter):

benchmark groups of "integer compression" functions

./icapp -a1.2 -m0 -M255 -n100M ZIPF
./icapp -a1.2 -m0 -M255 -n100M ZIPF -e20-50

-zipfian distribution alpha = 1.2 (Ex. -a1.0=uniform -a1.5=skewed distribution)
-number of integers = 100.000.000
-integer range from 0 to 255

Unsorted lists: individual function test
```
./icapp -a1.5 -m0 -M255 -e1,2,3 ZIPF
```
Unsorted lists: Zigzag encoding
```
 ./icapp -e10,11,12 ZIPF
```
Sorted lists: differential coding (increasing/strictly increasing)
```
./icapp -e4,5,6 ZIPF
./icapp -e7,8,9 ZIPF
```

Transpose/RLE/TurboVByte + General purpose compressor (lz,zstd,turborc,bwt...)

./icapp file -e80-95
./icapp file -e80-95 -Ezstd,15 
./icapp file -e80-95 -Eturborc,1
./icapp file -e80-95 -Eturborc,20

2D/3D/4D Transpose + General purpose compressor (lz,zstd,turborc,...)

./icapp file512x128.f32 R512x128 -Ff  
./icapp file512x128.f32 -R512x128 -Ff -e100,101,102 
./icapp file1024x512x128.f32 -R1024x512x128 -Ff -e100,101,102

Automatic dimension determination from file name ( option -R0 )

./icapp file1024x512x128.f32 -R0 -Ff -e103,104,105
./icapp file1024x512x128.f64 -R0 -Fl -e103,104,105

Lossy floating point compression

./icapp file512x128.f32 -R512x128 -Ff -g.0001
./icapp file.f32 -Ff -g.001
./icapp file.f64 -Fd -g.001

- Data files:

Raw 32 bits binary data file Test data

./icapp file           
./icapp -Fs file         "16 bits raw binary file
./icapp -Fu file         "32 bits raw binary file
./icapp -Fl file         "64 bits raw binary file
./icapp -Ff file         "32 bits raw floating point binary file
./icapp -Fd file         "64 bits raw floating point binary file

Text file: 1 entry per line. Test data: ts.txt(sorted) and lat.txt(unsorted))

./icapp -Fts data.txt            "text file, one 16 bits integer per line
./icapp -Ftu ts.txt              "text file, one 32 bits integer per line
./icapp -Ftl ts.txt              "text file, one 64 bits integer per line
./icapp -Ftf file                "text file, one 32 bits floating point (ex. 8.32456) per line
./icapp -Ftd file                "text file, one 64 bits floating point (ex. 8.324567789) per line
./icapp -Ftd file -v5            "like prev., display the first 100 values read
./icapp -Ftd file -v5 -g.00001   "like prev., error bound lossy floating point compression
./icapp -Ftt file                "text file, timestamp in seconds iso-8601 -> 32 bits integer (ex. 2018-03-12T04:31:06)
./icapp -FtT file                "text file, timestamp in milliseconds iso-8601 -> 64 bits integer (ex. 2018-03-12T04:31:06.345)
./icapp -Ftl -D2 -H file         "skip 1th line, convert numbers with 2 decimal digits to 64 bits integers (ex. 456.23 -> 45623)
./icapp -Ftl -D2 -H -K3 file.csv  "like prev., use the 3th number in the line (ex. label=3245, text=99 usage=456.23 -> 456.23 )
./icapp -Ftl -D2 -H -K3 -k| file.csv "like prev., use '|' as separator

Text file: multiple numbers separated by non-digits (0..9,-,.) characters (ex. 134534,-45678,98788,4345, )

./icapp -Fc data.txt         "text file, 32 bits integers (ex. 56789,3245,23,678 ) 
./icapp -Fcd data.txt        "text file, 64 bits floting-point numbers (ex. 34.7689,5.20,45.789 )

- Intersections:

1 - Download Gov2 (or ClueWeb09) + query files (Ex. "1mq.txt") from DocId data set
8GB RAM required (16GB recommended for benchmarking "clueweb09" files).

2 - Create index file

    ./idxcr gov2.sorted .

create inverted index file "gov2.sorted.i" in the current directory

3 - Test intersections

    ./idxqry gov2.sorted.i 1mq.txt

run queries in file "1mq.txt" over the index of gov2 file

- Parallel Query Processing:

1 - Create partitions

    ./idxseg gov2.sorted . -26m -s8

create 8 (CPU hardware threads) partitions for a total of ~26 millions document ids

2 - Create index file for each partition

  ./idxcr gov2.sorted.s*

create inverted index file for all partitions "gov2.sorted.s00 - gov2.sorted.s07" in the current directory

3 - Intersections:

delete "idxqry.o" file and then type "make para" to compile "idxqry" w. multithreading

  ./idxqry gov2.sorted.s*.i 1mq.txt

run queries in file "1mq.txt" over the index of all gov2 partitions "gov2.sorted.s00.i - gov2.sorted.s07.i".

Function usage:

See benchmark "icapp" program for "integer compression" usage examples. In general encoding/decoding functions are of the form:

char *endptr = encode( unsigned *in, unsigned n, char *out, [unsigned start], [int b])
endptr : set by encode to the next character in "out" after the encoded buffer
in : input integer array
n : number of elements
out : pointer to output buffer
b : number of bits. Only for bit packing functions
start : previous value. Only for integrated delta encoding functions

char *endptr = decode( char *in, unsigned n, unsigned *out, [unsigned start], [int b])
endptr : set by decode to the next character in "in" after the decoded buffer
in : pointer to input buffer
n : number of elements
out : output integer array
b : number of bits. Only for bit unpacking functions
start : previous value. Only for integrated delta decoding functions

Simple high level functions:

size_t compressed_size = encode( unsigned *in, size_t n, char *out)
compressed_size : number of bytes written into compressed output buffer out

size_t compressed_size = decode( char *in, size_t n, unsigned *out)
compressed_size : number of bytes read from compressed input buffer in

Function syntax:

{vb | p4 | bit | vs | v8 | bic }[n][d | d1 | f | fm | z ]{enc/dec | pack/unpack}[| 128v | 256v][8 | 16 | 32 | 64]:
vb: variable byte
p4: turbopfor
vs: variable simple
v8: TurboByte SIMD + Hybrid TurboByte + TurboPack
bit: bit packing
fp: Floating Point + Turbo Razor: pointwise relative error rounding algorithm

n : high level array functions for large arrays.

'' : encoding for unsorted integer lists
'd' : delta encoding for increasing integer lists (sorted w/ duplicate)
'd1': delta encoding for strictly increasing integer lists (sorted unique)
'f' : FOR encoding for sorted integer lists
'z' : ZigZag encoding for unsorted integer lists

'enc' or 'pack' : encode or bitpack
'dec' or 'unpack': decode or bitunpack
'NN' : integer size (8/16/32/64)

public header file to use with documentation:
include/ic.h

Note: Some low level functions (like p4enc32) are limited to 128/256 (SSE/AVX2) integers per call.

Environment:

OS/Compiler (64 bits):

Windows: MinGW-w64 makefile
Windows: Visual c++ (>=VS2008) - makefile.vs (for nmake)
Windows: Visual Studio project file - vs/vs2022
Linux amd64: GNU GCC (>=4.6)
Linux amd64: Clang (>=3.2)
Linux arm64: 64 bits aarch64 ARMv8: gcc (>=6.3)
Linux arm64: 64 bits aarch64 ARMv8: clang
MaxOS: XCode (>=9)
MaxOS: Apple M1 (Clang)
PowerPC ppc64le (incl. SIMD): gcc (>=8.0)

Multithreading:

All TurboPFor integer compression functions are thread safe

Knowns issues

Actually (2023.04) there are no known issues or bugs
The TurboPFor functions can work with arbitrary inputs
TurboPFor does normally not read outside the input (encode/decode) buffers and does not write outside the output buffer at decoding.
TurboPFor does not write above a properly sized output buffers at encoding. Use the bound (ex. v8bound,p4bound) functions to allocate a max. memory output buffer.

LICENSE

GPL 2.0
A commercial license is available. Contact us at powturbo [AT] gmail.com for more information.

References:

TurboPFor: an analysis
Applications:
Benchmark references:
- FastPFor + Simdcomp: SIMDPack FPF, Vbyte FPF, VarintG8IU, StreamVbyte, GroupSimple
- Optimized Pfor-delta compression code: OptPFD/OptP4, Simple16 (limited to 28 bits integers)
- MaskedVByte. See also: Vectorized VByte Decoding
- Streamvbyte.
- Index Compression Using 64-Bit Words: Simple-8b (speed optimized version tested)
- libfor
- Compression, SIMD, and Postings Lists QMX integer compression from the "simple family"
- lz4. included w. block size 64K as indication. Tested after preprocessing w. delta+transpose
- blosc. blosc is like transpose/shuffle+lz77. Tested blosc+lz4 and blosclz incl. vectorizeed shuffle.
- Document identifier data set
Integer compression publications:

Last update: 10 JUN 2023

turbopfor-integer-compression's People

Contributors

Stargazers

Watchers

Forkers

matthieu-m bhuvanamitra claudiouzelac liyang85105 rkrn wshager fulmicoton plusql bojifengyu ladislavsopko dralves archenroot francesco-bongiovanni yurii-diachenko zhuomingliang markpapadakis degot mikewlange hbsnmyj 13homer scottk212 galaxysubrepos yushiiso sashka serioussamv dmbreaker chenkovsky pranasblk levichen94 vesslanjin ellbur voidlin lixuxian zhatin james1465 brettwooldridge pps83 t-wejian 1shekhar lizunmvn xiaoyanrui lazerhawk zzulb paulyc qdhqf dancal igxactly-forks luyuncheng daktfi pombredanne kidd-playground kylewu11 goodluckcwl byronhe skysphere spcxtesla bellirodrigo2 onlyone0001 timgates42 linuslh1996 orz-- graydon rowles filipecosta90 olenz firebolt007 bindong314 fzhedu mfkiwl timothyklim 00mjk mirsadm clayne cloudspeech wcm07 chenliuyang1989 wangxuqi joprodrigues rnshah9 why520it lukeinchina gshanemiller 599166320 torwig hekuen0z bryceustc jocheneisner-tomtom chenyang8094 aaiyer toryder saravanan5901 xsam jacobloveless lilili-lilili01 ookiiwi usefulsource stanislav-povolotsky jpcoding davidwartell zeeshanayub01

turbopfor-integer-compression's Issues

Shouldn't P4NENC return n if n is 0

https://github.com/powturbo/TurboPFor/blob/672b92534261e530de9271cc5661868b776a8495/vp4c.c#L409

Cannot build on Linux Mint 19 x86_64

I got next error for both gcc-7 and gcc-5 while executing make command:

/usr/lib/gcc/x86_64-linux-gnu/5/include/bmi2intrin.h:62:1: error: inlining failed in call to always_inline ‘_bzhi_u64’: target specific option mismatch
_bzhi_u64 (unsigned long long __X, unsigned long long __Y)

Hello world turbopfor makefile

Hi,

We want to use TurboPFor for our large scale applications.
We already use turboRLE which is very nice, thank you for your work.
We want to also use the p4enc* function and unfortunately, we find it hard to compile.
Our code is (as expected) very similar to the "hello world" you provided

#include <stdio.h>
#define NTURBOPFOR_DAC
#include "vp4.h"

#define P4NENC_BOUND(n) ((n+127)/128+(n+32)*sizeof(uint32_t))

int main(int argc, char* argv[]) {
printf("Hello TurboPFor\n");
int ar[32] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
unsigned elnum = 10;
unsigned char* compress_buf = malloc(P4NENC_BOUND(elnum));
int *uncompress_buf = malloc((elnum+32)*sizeof(ar[0]));

  size_t compress_size = p4nenc32((uint32_t*)ar, elnum, compress_buf);
  printf("compress size is %lu\n", compress_size);

  size_t uncompress_size = p4ndec32(compress_buf, elnum, (uint32_t*)uncompress_buf);
  printf("uncompress size is %lu\n", uncompress_size);

}

Could you provide a corresponding minimalistic makefile that we could adapt as the repository makefile is quite dense and complex.

Thank you for your time.

Integer PFor and bitpack zigzag of delta?

Benchmark: TurboPFor Integer Compression ARM A73-ODROID-N2 1.8GHz

TurboPFor: IcApp Integer Compression Benchmark ARM A73-ODROID-N2 1.8GHz

Inverted index testdata

docs: Document Ids

	./icapp test_collection.docs

file: max bits histogram:
00: 0.002%
01:## 1.7%
02:# 0.7%
03: 0.4%
04: 0.2%
05: 0.2%
06:# 0.5%
07:# 1.4%
08:## 1.6%
09:#### 3.6%
10:###### 6.4%
11:######## 8.1%
12:################# 17%
13:############################################# 45%
14:############# 13%

  E MB/s     size     ratio   D MB/s  function (integer size=32 bits)
  363.95    3017800  21.93%   860.74 vszenc32         TurboVSimple zigzag
  342.10    3147043  22.87%   918.90 p4nzenc128v32    TurboPForV   zigzag
  337.64    3147043  22.87%   802.48 p4nzenc32        TurboPFor    zigzag
  311.39    3186374  23.15%   802.15 p4nzzenc128v32   TurboPFor zzag/delta
   68.69    3190622  23.18%    82.02 bitshuffleZ+lz   Transpose+zigzag+lz
  282.14    3228322  23.46%   542.80 tpnibbleZ+lz     Transpose+zigzag+lz
  321.63    3239377  23.54%   995.61 p4nd1enc128v32   TurboPForV   delta1
  319.97    3239377  23.54%   868.51 p4nd1enc32       TurboPFor    delta1
  282.14    3277684  23.81%   604.98 tpbyteZ+lz       Transpose+zigzag+lz
  334.27    3357179  24.39%  1029.96 p4ndenc128v32    TurboPForV   delta
  357.76    3357179  24.39%   946.45 p4ndenc32        TurboPFor    delta
  630.16    3515568  25.54%   757.64 vbddenc32        TurboVByte zzag delt
  347.56    3531916  25.66%   463.71 bvzzenc32        bitio zigzag/delta
  290.61    3673765  26.69%   703.59 tpnibble+lz      Transpose+lz
  290.51    3676694  26.71%   792.91 tpbyte+lz        Transpose+lz
  263.52    3701675  26.90%   541.95 tpnibbleX+lz     Transpose+xor+lz
  268.20    3738684  27.16%   660.43 tpbyteX+lz       Transpose+xor+lz
 1457.05    3925193  28.52%  1964.78 v8nd1enc128v32   TByte+TPackV delta1
 1706.13    3945921  28.67%  2083.77 v8ndenc128v32    TByte+TPackV delta
 1040.31    3948726  28.69%  1175.05 vbzenc32         TurboVByte zigzag
 1271.56    4014799  29.17%  1743.96 v8nzenc128v32    TByte+TPackV zigzag
 1144.65    4180490  30.37%  1303.84 vbd1enc32        TurboVByte delta1
 1247.24    4181163  30.38%  1413.51 vbdenc32         TurboVByte delta
   68.63    4228629  30.72%    82.93 bitshuffleX+lz   Transpose+xor+lz
   70.33    4260129  30.95%    86.04 bitshuffle+lz    Transpose+lz
  346.33    4329918  31.46%   513.06 bvzenc32         bitio zigzag
 1541.42    4406850  32.02%  1699.80 bitnzpack128v32  TurboPackV   zigzag
 1231.07    4406850  32.02%  1548.00 bitnzpack32      TurboPack    zigzag
 1702.75    4724718  34.33%  2228.52 v8zenc32         TurboByte zigzag
  734.63    4724718  34.33%  1160.19 streamvbyte zzag StreamVByte zigzag*
 1762.95    4889190  35.52%  2551.60 v8d1enc32        TurboByte delta1
  778.78    4889479  35.53%   827.47 streamvbyte delt StreamVByte delta*
 1882.81    4889479  35.53%  2793.45 v8denc32         TurboByte delta
  110.53    5050551  36.70%   117.81 SPDP             SPDP Floating Point*
  505.37    5265005  38.25%  1337.54 fpxenc32         TurboFloat XOR
  396.96    5265005  38.25%   463.69 fpfcmenc32       TurboFloat FCM
  375.79    5310026  38.58%   445.96 fpgenc32         bitio TurboGorilla
  489.19    5552401  40.34%  1348.02 p4nenc128v32     TurboPForV
  490.20    5552401  40.34%  1091.29 p4nenc32         TurboPFor
 1401.27    5653941  41.08%  1464.96 v8nenc128v32     TurboByte+TbPackV
 3249.13    5661003  41.13%  1419.34 bitnpack128v32   TurboPackV
 2705.05    5661003  41.13%   949.46 bitnpack32       TurboPack
  469.37    6226583  45.24%  2044.46 vsenc32          TurboVSimple
  216.40    6577162  47.79%   381.73 fpdfcmenc32      TurboFloat DFCM
  217.09    6655645  48.36%   379.77 fp2dfcmenc32     TurboFloat DFCM 2D
 1276.51    6682141  48.55%  1103.98 vbenc32          TurboVByte scalar
 1503.53    6790511  49.34%  2060.99 bitnd1pack32     TurboPack    delta1
 1836.09    6790511  49.34%  2000.77 bitnd1pack128v32 TurboPackV   delta1
 1873.83    6816223  49.52%  2227.79 bitndpack32      TurboPack    delta
 1965.63    6816223  49.52%  2152.20 bitndpack128v32  TurboPackV   delta
 1926.83    7511530  54.58%  4486.09 v8enc32          TurboByte SIMD
 1347.23    7511530  54.58%  3263.77 streamvbyte      StreamVByte SIMD*
  129.33    9044107  65.71%   913.17 lz               lz
  549.52   10833105  78.71%   918.17 srlez32          TurboRLE32 ESC zzag
  527.51   13712109  99.63%  1268.04 srlex32          TurboRLE32 ESC xor
 4050.42   13763312 100.00%  4135.61 memcpy           memcpy
  562.11   13763312 100.00%  4128.17 srle32           TurboRLE32 ESC
  112.36   13763312 100.00%  4121.99 trle             TurboRLE
   79.66   13763312 100.00%  4118.29 trlex            TurboRLE   xor
   69.91   13763312 100.00%  4118.29 trlez            TurboRLE   zigzag
 1360.01   13763312 100.00%  2218.82 tp4enc           Nibble transpose
 1774.54   13763312 100.00%  3416.91 tpenc            Byte transpose
   87.95   13763312 100.00%    90.10 bitshuffle       Bit transpose
                                                      * : external library

freqs: Term Frequencies

	./icapp test_collection.freqs

file: max bits histogram:
00: 0.001%
01:########################### 27%
02:############ 12%
03:###### 6.0%
04:### 3.1%
05:## 1.6%
06:# 1.0%
07:# 1.0%
08:# 0.9%
09:## 1.9%
10:### 3.2%
11:#### 4.1%
12:######## 8.5%
13:####################### 23%
14:###### 6.5%

  E MB/s     size     ratio   D MB/s  function (integer size=32 bits)
  428.07    1722849  12.52%  1974.37 vsenc32          TurboVSimple
  420.52    1730336  12.57%  1524.34 p4nenc32         TurboPFor
  427.48    1730336  12.57%  1359.47 p4nenc128v32     TurboPForV
  423.36    1907705  13.86%   934.50 tpnibble+lz      Transpose+lz
  314.49    1922544  13.97%   962.94 vszenc32         TurboVSimple zigzag
  383.56    1955547  14.21%   656.15 tpnibbleX+lz     Transpose+xor+lz
   72.45    1958057  14.23%    83.00 bitshuffleX+lz   Transpose+xor+lz
   74.52    1962095  14.26%    86.33 bitshuffle+lz    Transpose+lz
  338.18    1996059  14.50%   960.12 p4nzenc128v32    TurboPForV   zigzag
  336.61    1996059  14.50%   801.40 p4nzenc32        TurboPFor    zigzag
  326.98    2130744  15.48%   932.79 fpxenc32         TurboFloat XOR
  277.91    2130744  15.48%   402.28 fpfcmenc32       TurboFloat FCM
  351.60    2145784  15.59%   562.34 tpnibbleZ+lz     Transpose+zigzag+lz
  375.51    2219138  16.12%   977.37 tpbyte+lz        Transpose+lz
  388.77    2259515  16.42%   513.15 bvzenc32         bitio zigzag
   73.34    2300817  16.72%    83.59 bitshuffleZ+lz   Transpose+zigzag+lz
  321.11    2381520  17.30%   702.96 tpbyteX+lz       Transpose+xor+lz
  325.91    2398296  17.43%   815.65 p4nzzenc128v32   TurboPFor zzag/delta
  326.45    2407923  17.50%   602.12 tpbyteZ+lz       Transpose+zigzag+lz
  367.16    2764046  20.08%   499.67 bvzzenc32        bitio zigzag/delta
 3449.45    2784387  20.23%  1747.06 v8nenc128v32     TurboByte+TbPackV
 2828.46    2804702  20.38%  2064.70 bitnpack32       TurboPack
 3340.61    2804702  20.38%  1735.16 bitnpack128v32   TurboPackV
  738.89    2883341  20.95%   909.79 vbddenc32        TurboVByte zzag delt
  472.07    3029119  22.01%   503.62 fpgenc32         bitio TurboGorilla
 1507.65    3146929  22.86%  1760.24 v8nzenc128v32    TByte+TPackV zigzag
 1576.91    3189535  23.17%  1773.85 bitnzpack128v32  TurboPackV   zigzag
 1242.85    3189535  23.17%  1511.45 bitnzpack32      TurboPack    zigzag
 2596.85    3451880  25.08%  2620.08 vbenc32          TurboVByte scalar
 1428.47    3475286  25.25%  1662.03 vbzenc32         TurboVByte zigzag
  354.90    3925934  28.52%  1038.90 lz               lz
 2179.81    4307847  31.30%  4552.86 v8enc32          TurboByte SIMD
 1369.07    4307847  31.30%  3289.51 streamvbyte      StreamVByte SIMD*
 1702.33    4323652  31.41%  2237.57 v8zenc32         TurboByte zigzag
  734.98    4323652  31.41%  1158.43 streamvbyte zzag StreamVByte zigzag*
  269.10    5742202  41.72%   991.38 p4ndenc128v32    TurboPForV   delta
  271.95    5742202  41.72%   832.53 p4ndenc32        TurboPFor    delta
 1744.18    7227089  52.51%  2805.97 v8denc32         TurboByte delta
  525.42    7227089  52.51%   619.72 streamvbyte delt StreamVByte delta*
 1217.67    7234307  52.56%  1996.42 v8ndenc128v32    TByte+TPackV delta
  707.26    7343213  53.35%   732.71 vbdenc32         TurboVByte delta
  583.04   10319134  74.98%  1371.67 srle32           TurboRLE32 ESC
  192.83   10350497  75.20%   385.57 fpdfcmenc32      TurboFloat DFCM
  191.72   10690588  77.67%   383.11 fp2dfcmenc32     TurboFloat DFCM 2D
  477.76   10963622  79.66%   904.17 srlex32          TurboRLE32 ESC xor
  241.74   10981857  79.79%   858.44 p4nd1enc128v32   TurboPForV   delta1
  244.88   10981857  79.79%   703.11 p4nd1enc32       TurboPFor    delta1
  528.46   11113500  80.75%   910.03 srlez32          TurboRLE32 ESC zzag
   70.58   11621965  84.44%    71.31 SPDP             SPDP Floating Point*
  790.72   11667495  84.77%  1868.49 v8nd1enc128v32   TByte+TPackV delta1
 1043.94   11716644  85.13%  2462.57 v8d1enc32        TurboByte delta1
  534.52   13329244  96.85%   545.36 vbd1enc32        TurboVByte delta1
 1899.70   13727224  99.74%  2905.49 bitndpack32      TurboPack    delta
 1614.84   13727224  99.74%  2089.78 bitndpack128v32  TurboPackV   delta
 4155.59   13763304 100.00%  4040.90 memcpy           memcpy
 1693.11   13763304 100.00%  3253.74 tpenc            Byte transpose
 1328.63   13763304 100.00%  2247.44 tp4enc           Nibble transpose
  122.10   13763304 100.00%  4042.09 trlex            TurboRLE   xor
   83.47   13763304 100.00%  4042.09 trlez            TurboRLE   zigzag
  152.58   13763304 100.00%  4043.27 trle             TurboRLE
   88.06   13763304 100.00%    90.70 bitshuffle       Bit transpose
 1179.37   13790184 100.20%  3577.67 bitnd1pack32     TurboPack    delta1
 1952.24   13790184 100.20%  1982.33 bitnd1pack128v32 TurboPackV   delta1

sizes: Number of terms

	./icapp test_collection.sizes

file: max bits histogram:
00: 0.001%
01:########################### 27%
02:############ 12%
03:###### 6.0%
04:### 3.1%
05:## 1.6%
06:# 1.0%
07:# 1.1%
08:# 1.0%
09:## 1.9%
10:### 3.2%
11:#### 4.1%
12:######## 8.5%
13:####################### 23%
14:###### 6.5%
15: 0.002%
16: 0.001%

  E MB/s     size     ratio   D MB/s  function (integer size=32 bits)
  380.99      15077  37.69%  1904.95 p4nenc128v32     TurboPForV
  377.40      15077  37.69%  1538.62 p4nenc32         TurboPFor
  373.87      15485  38.71%  1600.16 vsenc32          TurboVSimple
   79.22      15970  39.92%    89.29 bitshuffle+lz    Transpose+lz
   77.23      15984  39.96%    87.54 bitshuffleX+lz   Transpose+xor+lz
  333.37      16009  40.02%  1025.74 p4nzenc128v32    TurboPForV   zigzag
  330.61      16009  40.02%   833.42 p4nzenc32        TurboPFor    zigzag
  493.88      16101  40.25%   930.33 tpnibble+lz      Transpose+lz
  400.04      16159  40.25%   769.31 tpnibbleX+lz     Transpose+xor+lz
  327.90      16268  40.67%   909.18 vszenc32         TurboVSimple zigzag
  373.87      16918  42.29%   678.03 tpnibbleZ+lz     Transpose+zigzag+lz
   77.23      17022  42.55%    85.85 bitshuffleZ+lz   Transpose+zigzag+lz
  312.53      17254  43.13%   930.33 p4nzzenc128v32   TurboPFor zzag/delta
  701.82      17391  43.47%   769.31 vbzenc32         TurboVByte zigzag
  459.82      17399  43.49%  1111.22 tpbyte+lz        Transpose+lz
 1481.63      17467  43.66%  5714.86 v8nenc128v32     TurboByte+TbPackV
  380.99      17625  44.06%   909.18 tpbyteX+lz       Transpose+xor+lz
 1176.59      17632  44.08%  1000.10 vbenc32          TurboVByte scalar
 3636.73      17899  44.74%  6667.33 bitnpack128v32   TurboPackV
 3333.67      17899  44.74%  5714.86 bitnpack32       TurboPack
  465.16      17958  44.89%  1666.83 fpxenc32         TurboFloat XOR
  370.41      17958  44.89%   465.16 fpfcmenc32       TurboFloat FCM
  412.41      18004  45.01%   519.53 fpgenc32         bitio TurboGorilla
  380.99      18141  45.35%   833.42 tpbyteZ+lz       Transpose+zigzag+lz
  888.98      18225  45.56%  2000.20 v8nzenc128v32    TByte+TPackV zigzag
 1739.30      18768  46.92%  2222.44 v8zenc32         TurboByte zigzag
  740.81      18768  46.92%  1142.97 streamvbyte zzag StreamVByte zigzag*
 2222.44      18909  47.27%  5000.50 v8enc32          TurboByte SIMD
 1428.71      18909  47.27%  3333.67 streamvbyte      StreamVByte SIMD*
 1250.12      19101  47.75%  2222.44 bitnzpack32      TurboPack    zigzag
 1481.63      19101  47.75%  1818.36 bitnzpack128v32  TurboPackV   zigzag
  597.07      20093  50.23%   625.06 vbddenc32        TurboVByte zzag delt
  300.78      20339  50.84%   408.20 bvzenc32         bitio zigzag
  287.80      22371  55.92%   400.04 bvzzenc32        bitio zigzag/delta
  136.53      28940  72.34%   655.80 lz               lz
  266.69      29111  72.77%  1052.74 p4ndenc128v32    TurboPForV   delta
  270.30      29111  72.77%   816.41 p4ndenc32        TurboPFor    delta
 2000.20      29559  73.89%  2857.43 v8denc32         TurboByte delta
  400.04      29559  73.89%   454.59 streamvbyte delt StreamVByte delta*
  264.93      29583  73.95%  1000.10 p4nd1enc128v32   TurboPForV   delta1
  268.48      29583  73.95%   754.79 p4nd1enc32       TurboPFor    delta1
 1250.12      29637  74.09%  2666.93 v8ndenc128v32    TByte+TPackV delta
 1818.36      30085  75.20%  2666.93 v8d1enc32        TurboByte delta1
 1111.22      30163  75.40%  2353.18 v8nd1enc128v32   TByte+TPackV delta1
  519.53      32431  81.07%   512.87 vbdenc32         TurboVByte delta
  493.88      33131  82.82%   519.53 vbd1enc32        TurboVByte delta1
   80.82      38749  96.86%    80.65 SPDP             SPDP Floating Point*
  645.23      39842  99.60%  3077.23 srle32           TurboRLE32 ESC
  526.37      39884  99.70%  1818.36 srlex32          TurboRLE32 ESC xor
  548.00      39891  99.72%  1250.12 srlez32          TurboRLE32 ESC zzag
10001.00      40004 100.00% 10001.00 memcpy           memcpy
  115.95      40004 100.00% 10001.00 trle             TurboRLE
   85.85      40004 100.00% 10001.00 trlex            TurboRLE   xor
   71.56      40004 100.00% 10001.00 trlez            TurboRLE   zigzag
 5000.50      40004 100.00%  8000.80 tpenc            Byte transpose
 1818.36      40004 100.00%  2353.18 tp4enc           Nibble transpose
   92.18      40004 100.00%    92.60 bitshuffle       Bit transpose
  256.44      40063 100.15%   416.71 fpdfcmenc32      TurboFloat DFCM
 2353.18      40081 100.19%  5714.86 bitndpack32      TurboPack    delta
 2000.20      40081 100.19%  2666.93 bitnd1pack128v32 TurboPackV   delta1
 2222.44      40081 100.19%  3077.23 bitndpack128v32  TurboPackV   delta
 1739.30      40081 100.19%  4000.40 bitnd1pack32     TurboPack    delta1
  259.77      40124 100.30%   421.09 fp2dfcmenc32     TurboFloat DFCM 2D

How do I build it into libic.so to use with java ?

Could you please suggest

Improve const-correctness

I suggest to add the key word "const" to the type specifiers for parameters like "in" (function "p4denc32").
Would you like to apply the advices from an article to more places in your source files?

Linux compilation error in "non-AVX2 mode"

Hi,
when I'm building project on Ubuntu 16.04 (propably distribution / version doesn't really matter, anyway) using just "make" it ends with such errors (and all of them are undefined references):

eliasfano.o: In function efano1enc256v32': eliasfano.c:(.text+0x1438): undefined reference to bitpack256v32'

but "make AVX2=1" builds it successfully. Seems like there could be some missing ifdef's. For example in bitpack.c:

#if defined(AVX2) && defined(AVX2_ON)
[...]
unsigned char *bitpack256v32(unsigned *__restrict in, unsigned n, unsigned char *__restrict out, unsigned b) { unsigned char pout = out+PAD8(256b); BITPACK256V32(in, b, out, 0); return pout; }
[...]
#endif

which in "non-AVX2 mode" removes bitfpack256v32 function, but it's still used somewhere else (probably in plugins.cc).
I've just found this library and it looks very promising. Thank you for your work!

Licence and others

Hello,

First of all, sorry for my english,

Currently, I am finishing a thesis, and your implementation is cited in the bibliography. For doing this I would like to know some data about your code,

What are the differences between your code (the PFOR codec) and other state-of-the-art implementations like D. Lemire et al? http://arxiv.org/pdf/1401.6399v12.pdf
- Are the optimizations based on the implementation?, or has the algorithm changed?
Is this code part of any published work (e.g article, thesis, etc) which can be cited using a reference manager?
What is the licence of your code? (in the case it has any)

Thank-you,
Julian

float16 lossy compression

There are padfloat32() and padfloat64(), but no padfloat16(), is it straightforward to create a float16 version based on the float32 version?

Also, what if I do the lossy compression for float16 data using icapp -f4 -g.001 float16data?

a question about licence

Hi I am very intrested in your PFor and RLE repo. What is the licence of PFor and RLE， and may i use them in open source project？

vsenc32 crashes if input contains zero values

If I understood correctly, vsenc32/vsdec32 does not support zero values in the input stream. Is this correct? I encounter weird crashes (memory corruption), which go away as long as I avoid zero elements in the input stream. If this is the case, please add assert and/or short comment in documentation.

// vsencNN: compress array with n unsigned (NN bits in[n]) values to the buffer out. Return value = end of compressed output buffer out
// vsdecNN: decompress buffer into an array of n unsigned values. Return value = end of compressed input buffer in

License

I work on a open-source analytics storage engine (Apache Kudu) was pondering on using this library.
Is this released with an apache-friendly license (MIT, BSD, Apache 2.)?
If not, could it be?

Thanks
-david

Benchmark: TurboPFor: IcApp 16 bits Integer Compression - Nanopore Signal Data

Testing TurboPFor, I've made some experiments with nanopore signal data.
File: all 16-bits signal data extracted from multi_fast5_zip.fast5

         ./icapp sig.u16 -Elzturbo,39 -Fs -e83,77,11,39,51,30,76

file: max bits histogram:
04: 0.000% 
06: 0.000% 
07: 0.002% 
08:###### 6.0% 
09:######################################################################################### 89% 
10:##### 5.5% 
11: 0.000% 
12: 0.000% 
16: 0.001% 

file: delta max bits histogram:
00:## 2.4%
01:## 2.4%
02:##### 4.7%
03:######### 9.2%
04:################# 17%
05:########################### 27%
06:######################## 24%
07:######### 9.0%
08:#### 3.7%
09:# 0.8%
10: 0.012%
11: 0.002%
12: 0.000%
13: 0.000%
14: 0.000%

Filesize: 3.097.862 bytes    CPU: Skylake i7-6700  3.4GHz

  E MB/s     size     ratio   D MB/s function (integer size=16 bits) 
   94.48    1275037  41.16%  1740.37 Lztp4z Nibble    Transpose+zigzag+turboanx,64
   81.99    1279609  41.31%  1512.63 Lztpz Byte       Transpose+zzag+turboanx,64
    1.81    1283751  41.44%  2201.98 Lztp4z Nibble    Transpose+zigzag+lzturbo,39 
    1.13    1289797  41.64%  1912.47 Lztpz Byte       Transpose+zzag+lzturbo,39    	
    3.06    1292571  41.72%  1627.03 Lztpz Byte       Transpose+zzag+zstd,22    	
  120.73    1293228  41.75%  2010.29 lzv8zenc         TurboByte+zzag+turboanx    	
    7.32    1296883  41.86%  1946.86 lzv8zenc         TurboByte+zzag+lzturbo,39    	
    7.48    1310074  42.29%  1690.97 lzv8zenc         TurboByte+zzag+zstd,22 	
    6.44    1333780  43.05%  1103.62 vbz              vbz_compression 
  614.78    1432663  46.25%  4419.20 p4nzenc128v16    TurboPForV   zigzag  	
  547.91    1523046  49.16%   752.46 lzv8zenc         TurboByte+zzag+fse    	
    1.55    1577260  50.91%  5779.59 LztpzByte        Transpose+zzag+lzturbo,19    	
   11.68    1577173  50.91%  1789.64 LztpzByte        Transpose+zzag+lz4,12    	
   10.80    1583766  51.12%  6387.34 lzv8zenc         TurboByte+zzag+lzturbo,19    	
   21.06    1587076  51.23%  4863.21 lzv8zenc         TurboByte+zzag+lz4,12    	
  367.44    1632853  52.71%   573.68 Lztpz Byte        Transpose+zzag+fse    	
 6749.15    1676704  54.12%  8775.81 v8nzenc128v16    TByte+TPackV zigzag  	
 6992.92    1676746  54.13%  8927.56 bitnzpack128v16  TurboPackV   zigzag  	
   68.73    1705144  55.04%   959.98 lzv8enc          TurboByte+turboanx

The 'lzv8zenc' w/ zstd,22 is similar to vbz compression but with native 16 bits TurboByte.
TurboANX: see Entropy Coding Benchmark
see issue

Generate output file

Hello,
is it possible to actually save the compressed output in a file? And to decompress it afterwards?
thanks.

Release Windows binary

@powturbo Can you release a binary file for Windows for us to download?

Can I use TurboPFor in my commercial products?

Hi, thanks for your great work on all these algorithms, but I noticed that you haven't put a LICENSE file in almost all of your repositories, I wonder if its possible that I can use your code directly inside my commercial products ?

Compilation issue vbd1dec32, BITDIZERO32

I'm getting an error (with gcc (GCC) 4.9.0):

CMakeFiles/myproject.dir/TurboPFor/vint.c.o: In function `vbd1dec32':
vint.c:(.text+0xb472): undefined reference to `BITDIZERO32'
collect2: error: ld returned 1 exit status

Of note, I'm compiling just bitutil.c, vint.c, and vsimple.c.
I think it has to do with the last #else fork in bitutil.h defined(__SSE2__) && defined(USE_SSE) / defined(__AVX2__) && defined(USE_AVX2)

p4nzenc16 performance

I was doing some benchmarking using your code, and was very interested with your p4nzenc16 and other algorithms defined in vp4.h. Below are the test results with our own data:

icapp.exe -s2 ..\data\data
E MB/s size ratio D MB/s function
291.27 109183 83.30% 1213.63 p4nzenc16 ..\data\data
296.54 109183 83.30% 1472.72 p4nzenc128v16 ..\data\data
13107.20 131072 100.00% 11915.64 memcpy

As shown, the Encoding and Deconding speeds are much lower than your benchmarks. I think the reason should be on the data characteristics. Our data looks like the following int32 array:
3956541806 | 4021030862 | 4156167980 | 4292351536 | 122688988 | 306712316 | 340264262 | 429391892 | 4293721652 | 363856276 | 311752484 | 339799824| 4235594008 | 4524368 | 77400290 | 119604839 | 147588075 | 156565166 | 123075266 | 129103883 | 4290838339 | 114620947 | 47183905 | 13301076
4285659166 | 4236114193 | 4166384448 | 4157799805 | 4175364326 | 4183426246 | 4234216941 | 4256303196 | 13369437 | 4252698291 | 12388188 | 56166432
254473376 | 190640480 | 41675844 | 4243778848 | 4123848842 | 4021352164 | 4006412090 | 3940549509 | 2883989 | 4012638792 | 4077652372 | 4104654984
264900636 | 329517310 | 402456453 | 331937716 | 279507270 | 124447292 | 4257934256 | 4158318916 | 4276027312 | 4211534776 | 4081119688 | 3936678884
53674504 | 59572339 | 54197952 | 55901532 | 23461120 | 7207976 | 4273994955 | 4246338828 | 4284022907 | 4266917110 | 4243455526 | 4239261366
4124374812 | 4083874473 | 4060086691 | 4094035282 | 4155902560 | 4226682114 | 48958370 | 109907428 | 19136432 | 67767338 | 140248949 | 223478996
4278517578 | 4282253572 | 18153714 | 22609897 | 4287561290 | 4280549154 | 4287430841 | 8192371 | 983184 | 4294639905 | 6422579 | 15335227
4287365347 | 15073760 | 26148779 | 7601913 | 4272029419 | 4277928110 | 10486103 | 30277608 | 4292804642 | 25100410 | 10944260 | 4273405619
4290183352 | 4285399269 | 4980480 | 3211078 | 4284940362 | 4283039840 | 6488321 | 10158232 | 786522 | 9306149 | 4259602 | 4291821626
4288020496 | 4278124841 | 17105015 | 7012364 | 1179425 | 4279107371 | 4287561847 | 3539276 | 4292739123 | 786612 | 9109491 | 14483363
4278583511 | 721358 | 22741000 | 1048332 | 4281401193 | 4281597998 | 4291166556 | 22347807 | 8257644 | 20512778 | 13107137 | 4284219109
4290576569 | 15663574 | 35323624 | 3276322 | 4261543643 | 4276158780 | 13304084 | 34340846 | 12451881 | 39059472 | 6094518 | 4272881238

Question 1: How should we compress such kind of data to achieve the results as shown in your benchmarks (> 1.3GB/s Encoding, > 5GB/s deconding, >2x compression ratio)?

Question 2: Do you have any documents regarding the mechanism of the p4nzenc16?

Cannot generate header jic.h

$ javah -jni jic or $ javah -jni jic.java
fails - no such file. Could you please add jic.h to the repository ?
So far, I've tried to make a shared library without it and succeeded with following make file:
differences are added flag - -fPIC and

    #$(CC) $(OBJS) -lm -o -shared libic.so $(LFLAGS)
    $(CC) -shared -Wl,-soname,libic.so -o libic.so.1 -lm $(LFLAGS) $(OBJS)

# Linux: "export CC=clang" windows mingw: "set CC=gcc" or uncomment one of following lines
# CC=clang
# CC=gcc

MARCH=-march=native
#MARCH=-msse2
CFLAGS=-DNDEBUG -fstrict-aliasing -m64 $(MARCH) -Iext -fPIC 

UNAME := $(shell uname)
ifeq ($(UNAME), Linux)
LIBTHREAD=-lpthread
LIBRT=-lrt
else
CC=gcc
endif

BIT=./
all: icbench idxcr idxqry idxseg libic

bitpack.o: $(BIT)bitpack.c $(BIT)bitpack.h $(BIT)bitpack64_.h
    $(CC) -O2 $(CFLAGS) -c $(BIT)bitpack.c

bitpackv.o: $(BIT)bitpackv.c $(BIT)bitpack.h $(BIT)bitpackv32_.h
    $(CC) -O2 $(CFLAGS) -c $(BIT)bitpackv.c

vp4dc.o: $(BIT)vp4dc.c
    $(CC) -O3 $(CFLAGS) -funroll-loops -c $(BIT)vp4dc.c

vp4dd.o: $(BIT)vp4dd.c
    $(CC) -O3 $(CFLAGS) -funroll-loops -c $(BIT)vp4dd.c

varintg8iu.o: $(BIT)ext/varintg8iu.c $(BIT)ext/varintg8iu.h 
    $(CC) -O2 $(CFLAGS) -c -funroll-loops -std=c99 $(BIT)ext/varintg8iu.c

idxqryp.o: $(BIT)idxqry.c
    $(CC) -O3 $(CFLAGS) -c $(BIT)idxqry.c -o idxqryp.o

SIMDCOMPD=ext/simdcomp/
SIMDCOMP=$(SIMDCOMPD)bitpacka.o $(SIMDCOMPD)src/simdintegratedbitpacking.o $(SIMDCOMPD)src/simdcomputil.o $(SIMDCOMPD)src/simdbitpacking.o

#LIBFOR=ext/for/for.o
MVB=ext/MaskedVByte/src/varintencode.o ext/MaskedVByte/src/varintdecode.o

# Lzturbo not included
#LZT=../lz/lz8c0.o ../lz/lz8d.o ../lz/lzbc0.o ../lz/lzbd.o

# blosc. Set the env. variable "EXT=blosc" to include 
#EXT=blosc
ifeq ($(EXT), blosc)
B=ext/
CFLAGS+=-DSHUFFLE_SSE2_ENABLED -DHAVE_LZ4 -DHAVE_ZLIB -Iext/
LFLAGS+=-lpthread 
BLOSC=$(B)lz4hc.o $(B)c-blosc/blosc/blosc.o $(B)c-blosc/blosc/blosclz.o $(B)c-blosc/blosc/shuffle.o $(B)c-blosc/blosc/shuffle-generic.o $(B)c-blosc/blosc/shuffle-sse2.o
endif

LZ4=ext/lz4.o 

#ZLIB=-lz

#BSHUFFLE=ext/bitshuffle/src/bitshuffle.o

OBJS=icbench.o bitutil.o vint.o bitpack.o bitunpack.o eliasfano.o vsimple.o vp4dd.o vp4dc.o varintg8iu.o bitpackv.o bitunpackv.o $(TRANSP) ext/simple8b.o transpose.o $(BLOSC) $(SIMDCOMP) $(LIBFOR) $(LZT) $(LZ4) $(MVB) $(ZLIB) $(BSHUFFLE)

icbench: $(OBJS)
    $(CC) $(OBJS) -lm -o icbench $(LFLAGS)

libic: $(OBJS)
    #$(CC) $(OBJS) -lm -o -shared libic.so $(LFLAGS)
    $(CC) -shared -Wl,-soname,libic.so -o libic.so.1 -lm $(LFLAGS) $(OBJS)

idxseg:   idxseg.o
    $(CC) idxseg.o -o idxseg

ifeq ($(UNAME), Linux)
para: CFLAGS += -DTHREADMAX=32  
para: idxqry
endif

idxcr:   idxcr.o bitpack.o vp4dc.o bitutil.o
    $(CC) idxcr.o bitpack.o bitpackv.o vp4dc.o bitutil.o -o idxcr $(LFLAGS)

idxqry:   idxqry.o bitunpack.o vp4dd.o bitunpackv.o bitutil.o
    $(CC) idxqry.o bitunpack.o bitunpackv.o vp4dd.o bitutil.o $(LIBTHREAD) $(LIBRT) -o idxqry $(LFLAGS)

.c.o:
    $(CC) -O3 $(CFLAGS) $< -c -o $@

.cc.o:
    $(CXX) -O3 -DNDEBUG -std=c++11 $< -c -o $@

.cpp.o:
    $(CXX) -O3 -DNDEBUG -std=c++11 $< -c -o $@

clean:
    @find . -type f -name "*\.o" -delete -or -name "*\~" -delete -or -name "core" -delete

cleanw:
    del /S ..\*.o
    del /S ..\*~

Inconsistent definition of ctz64/ctz32/clz64/clz32

ctz64 for non-windows builds is:

for example:

#define ctz64(x) __builtin_ctzll(x)

while for windows it's:

static inline int ctz64(uint64_t x) { unsigned long z; _BitScanForward64(&z, x); return x?z:64; }

__builtin_ctzll isn't defined for x=0, however, for windows builds it does the check: x?z:64.

https://github.com/powturbo/TurboPFor/blob/4df4bcea29b670dab3acb985aac83a7562bfa2eb/conf.h#L65

How does TurboPFor work?

Is TurboPFor a library for int compression? If yes, why does the page look like some compare tool for different libs. Put the docs/hello world right there, not the compare results. It's extremely painful to find anything how to use the lib (I hope it's wasn't assumed that users should read icbench or source of any other tool to decifir how to use the lib).

I use visual studio, and VS build doesn't work at all. Also, that approach that some files result in multiple different output obj files doesn't help overall: it would be better if the lib had tiny wrappers that included those c-files and set required defines to modify compilation.

Even after I made it build on VS it absolutely doesn't work and completely fails for me.
I'm not even sure what function I need to use to compress. After that huge table of irrelevant info, if anybody ever gets as far as "Function usage" section, you'll see some cryptic explanation that doesn't seem to be current anymore. From that section, it seems that I need p4fmencXXX but that doesn't exist (nothing p4fm exist at all). Why is that there that pack/unpack thing doing, to confuse people? From the docs, it seems like p4packXXX should be a valid function, while it's not.

After digging, it seem that I should try p4enc32, but so far I get stack corruption, and it doesn't seem like it could ever work at all. It crashes for me in _p4enc32 on line while(i != n) MISS;
In my case I call p4enc32 with an array of 2731 uints. That while loop will clearly corrupt stack, because the MISS expands to _in[i] = in[i], that is, it will attempt to assign _in[2730] because that loop will run up to the length of the input n=2731. I don't get it, how come that code could even work, as it writes to stack array of 287 elements?! Also, it seems like there are no checks, I don't see a single assert anywhere to check/show conditions/expectations, while in that function it had to make that check!

Cannot build libic.so under Linux (Ubuntu)

Hi Hamid,

after buildign the icbench tool via the makefile, I tried tobuild libic.so by following the instructions in java/jicbench.java with some small adaptations detailed below. I also found and tried #5 but no luck. I always get compilation error for libic.so.

My adaptations to the build instructions
1 - generate header jic.h
$ cd ~/TurboPFor/java
$ javah -jni jic
$ ~~cp jtrle.h ..~~ => cp jic.h ..
2 - Compile jic and jicbench
$ javac jic.java
$ javac jicbench.java
3 - compile & link a shared library
$ cd ~/TurboPFor
$ gcc -O3 -march=native -fstrict-aliasing -m64 -shared -fPIC -I/usr/lib/jvm/default-java/include -I/usr/lib/jvm/default-java/include/linux bitpack.c bitunpack.c ~~bitpackv.c bitunpackv.c vp4dc.c vp4dd.c~~ bitpack.c bitunpack.c vp4c.c vp4d.c vsimple.c vint.c bitutil.c jic.c -o libic.so

==>Bunch of (repeating) warnings and errors: compile_errors.txt

Environment:

Virtual machine running Ubuntu, 3 virtual CPUs, on a Haswell-based host.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 17.04
Release: 17.04
Codename: zesty

$ gcc --version
gcc (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406

$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.17.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

I recommend 1MQT instead of AOL as a query log

I recommend you benchmark with the TREC Million Query Track (1MQT) instead of the AOL query log.

There are a few reasons :

There are copyright issues with the AOL query log. There is no clean and legal way to acquire it.
The AOL query log is "dirty". Though this might be more realistic in a sense, it tends to exacerbate indexing issues.
For these two reasons, our paper (http://arxiv.org/abs/1401.6399) only includes the 1MQ results, and the last version does not even allude to the AOL query log. But this means that if you use only the AOL query log, you will not be able to compare with our paper. Now, hitting 1000 queries per second with GOV2 on a single core sounds excellent, but it would be much easier to appreciate if we could compare directly with our results (http://arxiv.org/abs/1401.6399).
Final reason: switching to 1MQT should be trivial. It does not require the production of new software.

Cheers and keep up the good work!

compile error undefined reference to `fseeko'

I am trying to compile using mingw in windows. but it throws following error:

g++ eliasfano.o vsimple.o transpose.o transpose_sse.o bitpack.o bitpack_sse.o bitunpack.o  bitunpack_sse.o vp4c.o vp4c_sse.o vp4d.o vp4d_sse.o bitutil.o fp.o vint.o ext/trlec.o ext/trled.o icbench.o plugins.o  -o icbench
icbench.o:icbench.c:(.text.startup+0xbf2): undefined reference to `fseeko'
icbench.o:icbench.c:(.text.startup+0xbfa): undefined reference to `ftello'
icbench.o:icbench.c:(.text.startup+0xc1f): undefined reference to `fseeko'
collect2.exe: error: ld returned 1 exit status
makefile:291: recipe for target 'icbench' failed
mingw32-make: *** [icbench] Error 1

add p4dnenc* functions in vp4dc.h header

It is not rare that we have some blocks of data in hand which aren't of the length of P4DSIZE. And since the lib has already got this functionality. It should be easy to just add several function declarations in the header.

Do not self include .c files

https://github.com/powturbo/TurboPFor/blob/56e510087bde39bfadf55a3128f0751433c50e2b/vp4c.c#L53

That's simply confusing for anybody tracking the code. Split these files into .c files that change defines and then include .inl or better simply .h files that contain the code. Self including .c files is just weird.

Decoding speed bug introduced

Just to make sure it's tracked properly, decoding speed reduced by more than 10% for p4nzdec256v32 (my personal project seems to have best results with this function).

9dad490 doesn't have the problem, 5dff9d3f has the problem.
It's absolutely clearly almost 10% slower for entire test where p4nzdec256v32 connsumes perhaps 25% of the cpu time of the entire test.

Times of my tests with these TurboPFor versions:
9dad490 - 16.605sec ~MyAlgo
profiler data:

p4nzdec256v32 roughly takes 3.560sec (21.44% of 16.605sec)

5dff9d3 - 17.636sec ~MyAlgo
profiler data:

p4nzdec256v32 roughly takes 4.213sec (23.89% of 17.636sec)
These tests are performed with absolutely identical code, simply TurboPFor version changed.
Also, the test uses 3 threads (according to the debugger, so, actual performance hit in p4nzdec256v32 seems to be way more than 10%, and might easily be in the range of 20%-30%). Also, times were measured without profiler (just to make it clear, it was fully optimized runs of 64-bit builds of TurboPFor)

Unfortunately, due to the way TurboPFor is managed, it's in permanently broken state and I wasn't able to bisect to find offending commit: if I try to check out anything in between, it's always broken and doesn't build with all kinds of compilation errors. I'd strongly recommend adjusting your approach: it's not suitable for projects that are worked on and used by more than one person.

direct access for delta compression?

is there any plan of direct access for delta compression?

LZTURBO Package

Where can we find your open source for LZTURBO?

Are you still enhancing it?

Alignment and tailing padding requirements for the decoder APIs

I use p4ndec256v32 and p4nzdec256v32 in my project and I'm not sure if these require any data alignment (for input and output buffers) and if there is any padding requirements. E.g. if these functions may read past the end of the buffer (and how much if they do?) and if the trailing bytes have to be zero filled or it's ok for them to be random?

Benchmark: TurboPFor Integer Compression - Skylake i7-6700 3.4GHz

TurboPFor: IcApp Integer Compression Benchmark - Skylake i7-6700 3.4GHz

Inverted index testdata

docs: Document Ids

	./icapp test_collection.docs

  E MB/s     size     ratio   D MB/s  function (integer size=32 bits) BOLD=Pareto
 1038.82    3017800  21.93%  2661.63 vszenc32         TurboVSimple zigzag
 1139.82    3147043  22.87%  4950.83 p4nzenc128v32    TurboPForV   zigzag
 1076.60    3147043  22.87%  3642.05 p4nzenc32        TurboPFor    zigzag
  964.09    3186374  23.15%  2922.15 p4nzzenc128v32   TurboPFor zzag/delta
  887.67    3190622  23.18%  1954.18 bitshuffleZ+lz   Transpose+zigzag+lz
 1088.78    3228329  23.46%  2693.93 tpnibbleZ+lz     Transpose+zigzag+lz
 1122.80    3239377  23.54%  5056.32 p4nd1enc128v32   TurboPForV   delta1
 1075.09    3239377  23.54%  3680.03 p4nd1enc32       TurboPFor    delta1
 1331.20    3246518  23.59%  6489.07 p4nzenc256v32    TurboPFor256 zigzag
 1034.37    3277695  23.81%  2621.08 tpbyteZ+lz       Transpose+zigzag+lz
 1135.96    3357179  24.39%  5334.62 p4ndenc128v32    TurboPForV   delta
 1075.59    3357179  24.39%  3644.94 p4ndenc32        TurboPFor    delta
 1305.82    3423417  24.87%  6158.08 p4nd1enc256v32   TurboPFor256 delta1
 1346.84    3515568  25.54%  1550.80 vbddenc32        TurboVByte zzag delt
 1311.54    3528896  25.64%  6330.87 p4ndenc256v32    TurboPFor256 delta
  920.32    3531916  25.66%  1058.47 bvzzenc32        bitio zigzag/delta
 1094.85    3673776  26.69%  3244.53 tpnibble+lz      Transpose+lz
 1079.98    3676707  26.71%  3193.34 tpbyte+lz        Transpose+lz
  981.41    3701682  26.90%  2531.42 tpnibbleX+lz     Transpose+xor+lz
  983.37    3738695  27.16%  2506.98 tpbyteX+lz       Transpose+xor+lz
 8699.94    3925193  28.52% 11144.38 v8nd1enc128v32   TByte+TPackV delta1
 9571.15    3945921  28.67% 11644.09 v8ndenc128v32    TByte+TPackV delta
 2644.76    3948726  28.69%  2783.84 vbzenc32         TurboVByte zigzag
 7443.65    4014799  29.17%  9577.81 v8nzenc128v32    TByte+TPackV zigzag
11010.65    4173952  30.33% 12030.87 v8ndenc256v32    TByte+TPackV delta
10286.48    4161012  30.23% 11723.43 v8nd1enc256v32   TByte+TPackV delta1
 3054.44    4180490  30.37%  3489.68 vbd1enc32        TurboVByte delta1
 3230.82    4181163  30.38%  3716.80 vbdenc32         TurboVByte delta
 8755.29    4215033  30.63% 10769.41 v8nzenc256v32    TByte+TPackV zigzag
  867.36    4228629  30.72%  1893.95 bitshuffleX+lz   Transpose+xor+lz
  970.41    4260129  30.95%  2366.86 bitshuffle+lz    Transpose+lz
 1062.80    4329918  31.46%  1271.79 bvzenc32         bitio zigzag
 8738.61    4406850  32.02% 10031.57 bitnzpack128v32  TurboPackV   zigzag
 4255.82    4406850  32.02%  5017.61 bitnzpack32      TurboPack    zigzag
11063.76    4724718  34.33% 10735.81 v8zenc32         TurboByte zigzag
 3235.38    4724718  34.33%  4183.38 streamvbyte zzag StreamVByte zigzag
11926.61    4819041  35.01% 11644.09 bitnzpack256v32  TurboPack256 zigzag
12041.39    4889190  35.52% 12223.19 v8d1enc32        TurboByte delta1
12661.74    4889479  35.53% 12850.90 v8denc32         TurboByte delta
10490.33    4889479  35.53% 14101.75 streamvbyte delt StreamVByte delta
  385.16    5050551  36.70%   423.20 SPDP             SPDP Floating Point
 1931.69    5265005  38.25%  8086.55 fpxenc32         TurboFloat XOR
 1350.93    5265005  38.25%  1524.01 fpfcmenc32       TurboFloat FCM
  997.49    5310026  38.58%  1184.65 fpgenc32         bitio TurboGorilla
 1657.63    5552401  40.34% 11926.61 p4nenc128v32     TurboPForV
 1578.90    5552401  40.34%  7583.09 p4nenc32         TurboPFor
 1883.32    5626136  40.88%  7616.66 FastPFor         FastPFor
   45.75    5626576  40.88%  4884.07 SimdOptPFor      FastPFor SIMD
 2247.07    5626832  40.88% 13818.59 SimdFastPFor     FastPFor SIMD
 1903.38    5638279  40.97% 13045.79 p4nenc256v32     TurboPFor256
 8291.15    5653941  41.08% 15343.71 v8nenc128v32     TurboByte+TbPackV
18400.15    5661003  41.13% 15275.60 bitnpack128v32   TurboPackV
10458.44    5661003  41.13% 10263.47 bitnpack32       TurboPack
 9331.06    5745570  41.75% 15429.72 v8nenc256v32     TurboByte+TbPackV
19115.71    5748842  41.77% 15464.40 bitnpack256v32   TurboPack256
 2062.54    6226583  45.24%  6671.50 vsenc32          TurboVSimple
  777.85    6577162  47.79%  1149.34 fpdfcmenc32      TurboFloat DFCM
  784.99    6658332  48.38%  1150.49 fp2dfcmenc32     TurboFloat DFCM 2D
 2628.09    6682141  48.55%  4124.46 vbenc32          TurboVByte scalar
 3850.95    6705452  48.72%  6507.48 maskeydvbyte     MasedVByte SIMD
10949.33    6790511  49.34% 11527.06 bitnd1pack128v32 TurboPackV   delta1
 6674.74    6790511  49.34%  8212.00 bitnd1pack32     TurboPack    delta1
12169.15    6816223  49.52% 11854.70 bitndpack128v32  TurboPackV   delta
 7004.23    6816223  49.52%  9001.51 bitndpack32      TurboPack    delta
11498.17    7511530  54.58% 14442.09 v8enc32          TurboByte SIMD
10769.41    7511530  54.58% 14321.86 streamvbyte      StreamVByte SIMD
13681.22    8062094  58.58% 10710.75 bitnd1pack256v32 TurboPack256 delta1
14262.50    8076334  58.68% 10854.35 bitndpack256v32  TurboPack256 delta
  475.40    9044107  65.71%  2907.33 lz               lz
 1427.58   10833105  78.71%  5124.09 srlez32          TurboRLE32 ESC zzag
 1681.94   13712109  99.63%  5869.22 srlex32          TurboRLE32 ESC xor
 1869.00   13763312 100.00% 17532.88 srle32           TurboRLE32 ESC
  298.29   13763312 100.00% 17532.88 trle             TurboRLE
  216.18   13763312 100.00% 17532.88 trlex            TurboRLE   xor
  200.00   13763312 100.00% 17532.88 trlez            TurboRLE   zigzag
17334.15   13763312 100.00% 17510.58 memcpy           memcpy
 9151.14   13763312 100.00% 11595.04 tpenc            Byte transpose
 8891.03   13763312 100.00% 11054.87 tp4enc           Nibble transpose
 4135.61   13763312 100.00%  4112.13 bitshuffle       Bit transpose
                                                      * : external library

freqs: Term Frequencies

	./icapp test_collection.freqs

  E MB/s     size     ratio   D MB/s  function (integer size=32 bits)
   91.61    1692992  12.30%  4615.46 SimdOptPFor      FastPFor SIMD
 1204.45    1722849  12.52%  4061.17 vsenc32          TurboVSimple
 1402.56    1730336  12.57%  8397.38 p4nenc128v32     TurboPForV
 1339.76    1730336  12.57%  4979.49 p4nenc32         TurboPFor
 1343.42    1778796  12.92%  6106.17 FastPFor         FastPFor
 1447.85    1779168  12.93% 12169.15 SimdFastPFor     FastPFor SIMD
 1662.23    1780534  12.94% 12603.76 p4nenc256v32     TurboPFor256
 1701.06    1907713  13.86%  3908.92 tpnibble+lz      Transpose+lz
 1084.24    1922544  13.97%  2792.88 vszenc32         TurboVSimple zigzag
 1435.32    1955549  14.21%  2904.88 tpnibbleX+lz     Transpose+xor+lz
 1023.67    1958057  14.23%  2334.74 bitshuffleX+lz   Transpose+xor+lz
 1194.83    1962095  14.26%  3022.91 bitshuffle+lz    Transpose+lz
 1185.37    1996059  14.50%  5303.78 p4nzenc128v32    TurboPForV   zigzag
 1126.29    1996059  14.50%  3584.19 p4nzenc32        TurboPFor    zigzag
 1364.32    2087485  15.17%  7105.47 p4nzenc256v32    TurboPFor256 zigzag
 1081.60    2130744  15.48%  4695.77 fpxenc32         TurboFloat XOR
  877.76    2130744  15.48%  1330.30 fpfcmenc32       TurboFloat FCM
 1326.33    2145784  15.59%  2916.57 tpnibbleZ+lz     Transpose+zigzag+lz
 1293.42    2219153  16.12%  3770.77 tpbyte+lz        Transpose+lz
 1015.97    2259515  16.42%  1055.63 bvzenc32         bitio zigzag
 1003.67    2300817  16.72%  2364.83 bitshuffleZ+lz   Transpose+zigzag+lz
 1128.69    2381528  17.30%  2856.05 tpbyteX+lz       Transpose+xor+lz
 1021.17    2398296  17.43%  2984.24 p4nzzenc128v32   TurboPFor zzag/delta
 1137.28    2407925  17.50%  2936.48 tpbyteZ+lz       Transpose+zigzag+lz
 1004.84    2764046  20.08%  1203.82 bvzzenc32        bitio zigzag/delta
16483.00    2784387  20.23% 17312.33 v8nenc128v32     TurboByte+TbPackV
17488.32    2804702  20.38% 17118.54 bitnpack128v32   TurboPackV
 9268.22    2804702  20.38% 10522.40 bitnpack32       TurboPack
 1670.91    2883341  20.95%  1931.15 vbddenc32        TurboVByte zazg delt
 1110.66    3029119  22.01%  1114.80 fpgenc32         bitio TurboGorilla
 8271.22    3146929  22.86% 10202.60 v8nzenc128v32    TByte+TPackV zigzag
20948.71    3150687  22.89% 16764.07 v8nenc256v32     TurboByte+TbPackV
22824.72    3189005  23.17% 16764.07 bitnpack256v32   TurboPack256
 8591.33    3189535  23.17% 10340.57 bitnzpack128v32  TurboPackV   zigzag
 4135.61    3189535  23.17%  5150.94 bitnzpack32      TurboPack    zigzag
11546.40    3451880  25.08% 11365.24 vbenc32          TurboVByte scalar
 8661.61    3457754  25.12% 16846.15 maskeydvbyte     MasedVByte SIMD
 5340.82    3475286  25.25%  5265.23 vbzenc32         TurboVByte zigzag
10611.65    3502875  25.45% 12959.80 v8nzenc256v32    TByte+TPackV zigzag
11733.42    3589870  26.08% 13349.47 bitnzpack256v32  TurboPack256 zigzag
  810.18    3925934  28.52%  3365.11 lz               lz
14087.31    4307847  31.30% 16562.34 v8enc32          TurboByte SIMD
11517.41    4307847  31.30% 16825.56 streamvbyte      StreamVByte SIMD
11099.44    4323652  31.41% 10744.19 v8zenc32         TurboByte zigzag
 4017.31    4323652  31.41%  4233.56 streamvbyte zzag StreamVByte zigzag
 1048.79    5742202  41.72%  5785.33 p4ndenc128v32    TurboPForV   delta
 1001.91    5742202  41.72%  3535.40 p4ndenc32        TurboPFor    delta
 1258.30    5773355  41.95%  7098.15 p4ndenc256v32    TurboPFor256 delta
11793.75    7227089  52.51% 12432.98 v8denc32         TurboByte delta
10271.12    7227089  52.51% 12062.49 streamvbyte delt StreamVByte delta
 8705.44    7231250  52.54% 11703.49 v8ndenc256v32    TByte+TPackV delta
 8077.06    7234307  52.56% 10949.33 v8ndenc128v32    TByte+TPackV delta
 1408.88    7343213  53.35%  1508.14 vbdenc32         TurboVByte delta
 1277.10   10319134  74.98%  5386.81 srle32           TurboRLE32 ESC
  735.30   10350497  75.20%  1170.55 fpdfcmenc32      TurboFloat DFCM
  731.12   10690707  77.68%  1161.46 fp2dfcmenc32     TurboFloat DFCM 2D
 1124.82   10963622  79.66%  4061.17 srlex32          TurboRLE32 ESC xor
  853.54   10981857  79.79%  4564.94 p4nd1enc128v32   TurboPForV   delta1
  840.30   10981857  79.79%  3156.72 p4nd1enc32       TurboPFor    delta1
 1024.06   11014587  80.03%  5551.96 p4nd1enc256v32   TurboPFor256 delta1
 1348.15   11113500  80.75%  4623.21 srlez32          TurboRLE32 ESC zzag
  169.10   11621965  84.44%   191.26 SPDP             SPDP Floating Point
 7451.71   11667495  84.77%  9261.98 v8nd1enc128v32   TByte+TPackV delta1
 8029.93   11691692  84.95%  9712.99 v8nd1enc256v32   TByte+TPackV delta1
10016.96   11716644  85.13% 10434.65 v8d1enc32        TurboByte delta1
 1137.46   13329244  96.85%  1150.97 vbd1enc32        TurboVByte delta1
11614.60   13727224  99.74% 10458.44 bitndpack128v32  TurboPackV   delta
 8464.52   13727224  99.74%  9054.81 bitndpack32      TurboPack    delta
11527.06   13747047  99.88%  9465.82 bitndpack256v32  TurboPack256 delta
16542.43   13763304 100.00% 16116.28 memcpy           memcpy
  412.58   13763304 100.00% 16116.28 trle             TurboRLE
  246.88   13763304 100.00% 16116.28 trlez            TurboRLE   zigzag
  361.16   13763304 100.00% 16097.43 trlex            TurboRLE   xor
10719.08   13763304 100.00% 11663.82 tpenc            Byte transpose
 8931.41   13763304 100.00% 11001.84 tp4enc           Nibble transpose
 4247.93   13763304 100.00%  5147.08 bitshuffle       Bit transpose
11693.55   13776743 100.10%  8954.65 bitnd1pack256v32 TurboPack256 delta1
11374.63   13790184 100.20% 10450.50 bitnd1pack128v32 TurboPackV   delta1
 8251.38   13790184 100.20%  8925.62 bitnd1pack32     TurboPack    delta1

Synthetic data: zipfian distribution

	./icapp -a1.5 -m0 -M255 -n100M ZIPF

bits histogram:
00:######################################## 40%
01:############## 14%
02:############# 13%
03:########## 10%
04:######## 7.8%
05:###### 5.7%
06:#### 4.1%
07:### 2.9%
08:## 2.1%

  E MB/s     size     ratio   D MB/s  function (integer size=32 bits)
 2368.80   62939886  15.73% 10950.80 p4nenc256v32     TurboPFor256
 1327.02   63392759  15.85%  7702.23 p4nenc128v32     TurboPForV
 1321.43   63392759  15.85%  4556.85 p4nenc32         TurboPFor
   66.39   65060504  16.27%  3077.49 SimdOptPFor      FastPFor SIMD
  606.96   73459928  18.36%  5349.24 FastPFor         FastPFor
  631.90   73469416  18.37% 10197.58 SimdFastPFor     FastPFor SIMD
 1010.83   76345141  19.09%  2866.89 vsenc32          TurboVSimple
 1745.84   79163645  19.79%  3332.50 tpnibble+lz      Transpose+lz
 1464.36   80509600  20.13%  2494.28 tpnibbleX+lz     Transpose+xor+lz
 1097.09   82870974  20.72%  2541.51 bitshuffle+lz    Transpose+lz
  975.88   83384370  20.85%  2073.53 bitshuffleX+lz   Transpose+xor+lz
 1321.55   85243365  21.31%  6370.95 p4nzenc256v32    TurboPFor256 zigzag
 1178.72   85546946  21.39%  4991.20 p4nzenc128v32    TurboPForV   zigzag
 1104.59   85546946  21.39%  3735.66 p4nzenc32        TurboPFor    zigzag
 1334.30   88589767  22.15%  2936.28 tpbyte+lz        Transpose+lz
  902.66   89283071  22.32%  2240.92 vszenc32         TurboVSimple zigzag
 1010.73   94657651  23.66%  4570.28 fpxenc32         TurboFloat XOR
  823.29   94657651  23.66%  1348.86 fpfcmenc32       TurboFloat FCM
 1284.95   95923933  23.98%  2332.20 tpbyteX+lz       Transpose+xor+lz
 1376.65   97552451  24.39%  2321.01 tpnibbleZ+lz     Transpose+zigzag+lz
 1011.41   98127316  24.53%  2129.66 bitshuffleZ+lz   Transpose+zigzag+lz
  888.50   98371167  24.59%   926.12 bvzenc32         bitio zigzag
17298.05   99910930  24.98% 12408.10 v8nenc128v32     TurboByte+TbPackV
17356.59   99910930  24.98% 12362.47 bitnpack128v32   TurboPackV
11693.51   99910930  24.98% 10137.62 bitnpack32       TurboPack
17057.57  100332929  25.08% 11994.72 v8nenc256v32     TurboByte+TbPackV
17076.50  100332929  25.08% 11170.06 bitnpack256v32   TurboPack256
 1013.86  100628462  25.16%  3083.11 p4nzzenc128v32   TurboPFor zzag/delta
11191.32  101015650  25.25% 10332.45 vbenc32          TurboVByte scalar
 6689.75  102074663  25.52%  9524.04 maskeydvbyte     MasedVByte SIMD
 1161.18  103144618  25.79%  1263.64 fpgenc32         bitio TurboGorilla
 1164.71  105990559  26.50%  2202.97 tpbyteZ+lz       Transpose+zigzag+lz
 3857.24  106284616  26.57%  4009.10 vbzenc32         TurboVByte zigzag
 9820.77  112368050  28.09% 10469.01 bitnzpack128v32  TurboPackV   zigzag
 9686.17  112368050  28.09% 10426.44 v8nzenc128v32    TByte+TPackV zigzag
 4189.84  112368050  28.09%  5317.24 bitnzpack32      TurboPack    zigzag
12560.05  112825409  28.21% 11054.92 v8nzenc256v32    TByte+TPackV zigzag
12680.30  112825409  28.21% 11036.61 bitnzpack256v32  TurboPack256 zigzag
 1650.88  116367689  29.09%  1882.50 vbddenc32        TurboVByte zazg delt
  832.48  119294130  29.82%   973.89 bvzzenc32        bitio zigzag/delta
13107.45  125000000  31.25% 12375.09 v8enc32          TurboByte SIMD
11186.62  125000000  31.25% 12123.78 streamvbyte      StreamVByte SIMD
10956.20  128705458  32.18% 10696.90 v8zenc32         TurboByte zigzag
 3208.47  128705458  32.18%  3850.37 streamvbyte zzag StreamVByte zigzag
  656.49  140353625  35.09%  2416.76 lz               lz
 1002.48  231440944  57.86%  5208.47 p4ndenc128v32    TurboPForV   delta
  959.54  231440944  57.86%  3770.10 p4ndenc32        TurboPFor    delta
 1180.67  231486347  57.87%  5985.96 p4ndenc256v32    TurboPFor256 delta
  488.03  239203583  59.80%  2701.28 trle             TurboRLE
10108.16  245851337  61.46% 10481.08 v8denc32         TurboByte delta
 9441.09  245851337  61.46% 10096.17 streamvbyte delt StreamVByte delta
 8420.70  246241959  61.56% 10398.52 v8ndenc256v32    TByte+TPackV delta
 7895.00  246632587  61.66%  9510.45 v8ndenc128v32    TByte+TPackV delta
 1142.53  262025557  65.51%  1198.36 vbdenc32         TurboVByte delta
  436.08  263233919  65.81%  1532.21 trlex            TurboRLE   xor
  315.02  263233919  65.81%  1053.77 trlez            TurboRLE   zigzag
 1085.73  291716271  72.93%  5591.59 p4nd1enc256v32   TurboPFor256 delta1
  918.79  291995716  73.00%  4686.75 p4nd1enc128v32   TurboPForV   delta1
  897.86  291995716  73.00%  3595.60 p4nd1enc32       TurboPFor    delta1
 9353.88  304132355  76.03%  9687.11 v8d1enc32        TurboByte delta1
 7881.46  304522974  76.13%  8913.05 v8nd1enc256v32   TByte+TPackV delta1
 7441.72  304913602  76.23%  8818.73 v8nd1enc128v32   TByte+TPackV delta1
 1026.88  339718007  84.93%  1063.92 vbd1enc32        TurboVByte delta1
  168.67  374829697  93.71%   189.64 SPDP             SPDP Floating Point
 1374.41  384560608  96.14%  9134.30 srle32           TurboRLE32 ESC
  709.62  387715234  96.93%  1186.37 fpdfcmenc32      TurboFloat DFCM
  710.10  388257392  97.06%  1175.15 fp2dfcmenc32     TurboFloat DFCM 2D
 1174.42  392073698  98.02%  5061.88 srlex32          TurboRLE32 ESC xor
 1433.31  393849411  98.46%  5286.18 srlez32          TurboRLE32 ESC zzag
14002.66  400000000 100.00% 14043.46 memcpy           memcpy
 3876.57  400000000 100.00%  3789.60 bitshuffle       Bit transpose
 8518.79  400000000 100.00%  9106.43 tpenc            Byte transpose
 8304.44  400000000 100.00%  9598.08 tp4enc           Nibble transpose
 9278.16  400390622 100.10%  7273.65 bitnd1pack256v32 TurboPack256 delta1
 9303.19  400390622 100.10%  7575.18 bitndpack256v32  TurboPack256 delta
 7950.55  400781247 100.20%  7644.97 bitnd1pack32     TurboPack    delta1
 8069.89  400781247 100.20%  7650.09 bitndpack32      TurboPack    delta
 9284.62  400781247 100.20%  8515.17 bitnd1pack128v32 TurboPackV   delta1
 9294.98  400781247 100.20%  8564.58 bitndpack128v32  TurboPackV   delta

p4nzenc256v32 sometimes produces different results

While making some changes in my code, I tried to verify that the resulting binary data (compressed with p4nzenc256v32) is identical. To my surprise, the data had many changes as if something corrupted it. I tried to dump hex of compressed data and it clearly had a few modified bytes here and there. What's even more surprising, when I uncompressed the data using p4nzdec256v32 output from these two different buffers was identical.
Any idea why, is that normal? If so, what makes it produce different outputs with identical inputs?

Cannot build on macOS High Sierra 10.13.1

Hi, I've been using "JavaFastPFOR", but "TurboPFor" seems much faster than that.
So I tried to use "TurboPFor" but when I build a .so file, the following error occured.

git clone --recursive git://github.com/powturbo/TurboPFor.git
cd TurboPFor/java
javah -jni jic
cp jic.h ..
javac jic.java
javac jicbench.java
cd ..
gcc -O3 -w -march=native -fstrict-aliasing -m64 -shared -fPIC -I/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/include -I/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/include/darwin bitpack.c bitunpack.c vp4c.c vp4d.c vsimple.c vint.c bitutil.c jic.c -o libic.so

Undefined symbols for architecture x86_64:
  "_JNNIDEC", referenced from:
      _Java_jic_p4ndec32 in jic-97920b.o
      _Java_jic_p4ndec128v32 in jic-97920b.o
      _Java_jic_p4ndec256v32 in jic-97920b.o
      _Java_jic_p4nddec32 in jic-97920b.o
      _Java_jic_p4nd1dec32 in jic-97920b.o
  "_JNNIENC", referenced from:
      _Java_jic_p4nenc32 in jic-97920b.o
      _Java_jic_p4nenc128v32 in jic-97920b.o
      _Java_jic_p4nenc256v32 in jic-97920b.o
      _Java_jic_p4ndenc32 in jic-97920b.o
      _Java_jic_p4nd1enc32 in jic-97920b.o
  "_bitd1pack128v32", referenced from:
      _Java_jic_bitd1pack128v32 in jic-97920b.o
      _JavaCritical_jic_bitd1pack128v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitd1pack128v32, _Java_jic_bitd1pack128v32 )
  "_bitd1pack256v32", referenced from:
      _Java_jic_bitd1pack256v32 in jic-97920b.o
      _JavaCritical_jic_bitd1pack256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitd1pack256v32, _Java_jic_bitd1pack256v32 )
  "_bitd1unpack128v32", referenced from:
      _Java_jic_bitd1unpack128v32 in jic-97920b.o
      _JavaCritical_jic_bitd1unpack128v32 in jic-97920b.o
     (maybe you meant: _Java_jic_bitd1unpack128v32, _JavaCritical_jic_bitd1unpack128v32 )
  "_bitd1unpack256v32", referenced from:
      _Java_jic_bitd1unpack256v32 in jic-97920b.o
      _JavaCritical_jic_bitd1unpack256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitd1unpack256v32, _Java_jic_bitd1unpack256v32 )
  "_bitdpack128v32", referenced from:
      _Java_jic_bitdpack128v32 in jic-97920b.o
      _JavaCritical_jic_bitdpack128v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitdpack128v32, _Java_jic_bitdpack128v32 )
  "_bitdpack256v32", referenced from:
      _Java_jic_bitdpack256v32 in jic-97920b.o
      _JavaCritical_jic_bitdpack256v32 in jic-97920b.o
     (maybe you meant: _Java_jic_bitdpack256v32, _JavaCritical_jic_bitdpack256v32 )
  "_bitdunpack128v32", referenced from:
      _Java_jic_bitdunpack128v32 in jic-97920b.o
      _JavaCritical_jic_bitdunpack128v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitdunpack128v32, _Java_jic_bitdunpack128v32 )
  "_bitdunpack256v32", referenced from:
      _Java_jic_bitdunpack256v32 in jic-97920b.o
      _JavaCritical_jic_bitdunpack256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitdunpack256v32, _Java_jic_bitdunpack256v32 )
  "_bitnd1pack128v32", referenced from:
      _Java_jic_bitnd1pack128v32 in jic-97920b.o
      _JavaCritical_jic_bitnd1pack128v32 in jic-97920b.o
     (maybe you meant: _Java_jic_bitnd1pack128v32, _JavaCritical_jic_bitnd1pack128v32 )
  "_bitnd1pack256v32", referenced from:
      _Java_jic_bitnd1pack256v32 in jic-97920b.o
      _JavaCritical_jic_bitnd1pack256v32 in jic-97920b.o
     (maybe you meant: _Java_jic_bitnd1pack256v32, _JavaCritical_jic_bitnd1pack256v32 )
  "_bitnd1unpack128v32", referenced from:
      _Java_jic_bitnd1unpack128v32 in jic-97920b.o
      _JavaCritical_jic_bitnd1unpack128v32 in jic-97920b.o
     (maybe you meant: _Java_jic_bitnd1unpack128v32, _JavaCritical_jic_bitnd1unpack128v32 )
  "_bitnd1unpack256v32", referenced from:
      _Java_jic_bitnd1unpack256v32 in jic-97920b.o
      _JavaCritical_jic_bitnd1unpack256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitnd1unpack256v32, _Java_jic_bitnd1unpack256v32 )
  "_bitndpack128v32", referenced from:
      _Java_jic_bitndpack128v32 in jic-97920b.o
      _JavaCritical_jic_bitndpack128v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitndpack128v32, _Java_jic_bitndpack128v32 )
  "_bitndpack256v32", referenced from:
      _Java_jic_bitndpack256v32 in jic-97920b.o
      _JavaCritical_jic_bitndpack256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitndpack256v32, _Java_jic_bitndpack256v32 )
  "_bitndunpack128v32", referenced from:
      _Java_jic_bitndunpack128v32 in jic-97920b.o
      _JavaCritical_jic_bitndunpack128v32 in jic-97920b.o
     (maybe you meant: _Java_jic_bitndunpack128v32, _JavaCritical_jic_bitndunpack128v32 )
  "_bitndunpack256v32", referenced from:
      _Java_jic_bitndunpack256v32 in jic-97920b.o
      _JavaCritical_jic_bitndunpack256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitndunpack256v32, _Java_jic_bitndunpack256v32 )
  "_bitnpack128v32", referenced from:
      _Java_jic_bitnpack128v32 in jic-97920b.o
      _JavaCritical_jic_bitnpack128v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitnpack128v32, _Java_jic_bitnpack128v32 )
  "_bitnpack256v32", referenced from:
      _Java_jic_bitnpack256v32 in jic-97920b.o
      _JavaCritical_jic_bitnpack256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitnpack256v32, _Java_jic_bitnpack256v32 )
  "_bitnunpack128v32", referenced from:
      _Java_jic_bitnunpack128v32 in jic-97920b.o
      _JavaCritical_jic_bitnunpack128v32 in jic-97920b.o
     (maybe you meant: _Java_jic_bitnunpack128v32, _JavaCritical_jic_bitnunpack128v32 )
  "_bitnunpack256v32", referenced from:
      _Java_jic_bitnunpack256v32 in jic-97920b.o
      _JavaCritical_jic_bitnunpack256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitnunpack256v32, _Java_jic_bitnunpack256v32 )
  "_bitpack128v32", referenced from:
      _Java_jic_bitpack128v32 in jic-97920b.o
      _JavaCritical_jic_bitpack128v32 in jic-97920b.o
     (maybe you meant: _Java_jic_bitpack128v32, _JavaCritical_jic_bitpack128v32 )
  "_bitpack256v32", referenced from:
      _Java_jic_bitpack256v32 in jic-97920b.o
      _JavaCritical_jic_bitpack256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_bitpack256v32, _Java_jic_bitpack256v32 )
  "_bitunpack128v32", referenced from:
      _Java_jic_bitunpack128v32 in jic-97920b.o
      _JavaCritical_jic_bitunpack128v32 in jic-97920b.o
     (maybe you meant: _Java_jic_bitunpack128v32, _JavaCritical_jic_bitunpack128v32 )
  "_bitunpack256v32", referenced from:
      _Java_jic_bitunpack256v32 in jic-97920b.o
      _JavaCritical_jic_bitunpack256v32 in jic-97920b.o
     (maybe you meant: _Java_jic_bitunpack256v32, _JavaCritical_jic_bitunpack256v32 )
  "_p4dec128v32", referenced from:
      _Java_jic_p4dec128v32 in jic-97920b.o
      _JavaCritical_jic_p4dec128v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_p4dec128v32, _Java_jic_p4dec128v32 )
  "_p4dec256v32", referenced from:
      _Java_jic_p4dec256v32 in jic-97920b.o
      _JavaCritical_jic_p4dec256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_p4dec256v32, _Java_jic_p4dec256v32 )
  "_p4enc128v32", referenced from:
      _Java_jic_p4enc128v32 in jic-97920b.o
      _JavaCritical_jic_p4enc128v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_p4enc128v32, _Java_jic_p4enc128v32 )
  "_p4enc256v32", referenced from:
      _Java_jic_p4enc256v32 in jic-97920b.o
      _JavaCritical_jic_p4enc256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_p4enc256v32, _Java_jic_p4enc256v32 )
  "_p4ndec128v32", referenced from:
      _Java_jic_p4ndec128v32 in jic-97920b.o
      _JavaCritical_jic_p4ndec128v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_p4ndec128v32, _Java_jic_p4ndec128v32 )
  "_p4ndec256v32", referenced from:
      _Java_jic_p4ndec256v32 in jic-97920b.o
      _JavaCritical_jic_p4ndec256v32 in jic-97920b.o
     (maybe you meant: _Java_jic_p4ndec256v32, _JavaCritical_jic_p4ndec256v32 )
  "_p4nenc128v32", referenced from:
      _Java_jic_p4nenc128v32 in jic-97920b.o
      _JavaCritical_jic_p4nenc128v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_p4nenc128v32, _Java_jic_p4nenc128v32 )
  "_p4nenc256v32", referenced from:
      _Java_jic_p4nenc256v32 in jic-97920b.o
      _JavaCritical_jic_p4nenc256v32 in jic-97920b.o
     (maybe you meant: _JavaCritical_jic_p4nenc256v32, _Java_jic_p4nenc256v32 )
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

reserved identifier violation

I would like to point out that an identifier like "__bx" does eventually not fit to the expected naming convention of the C language standard.
Would you like to adjust your selection for unique names?

[codestyle] Would be nice to remove all trailing whitespace

It's extreme how much trailing whitespace tp4 has all over the place.

Hello TurboPFor

save the file as ictest.c in the TurboPFor directory and type

make ictest
./ictest

output:
compress size is 6
uncompress size is 6

#include <stdio.h>
#define NTURBOPFOR_DAC
#include "vp4.h"

#define P4NENC_BOUND(n) ((n+127)/128+(n+32)*sizeof(uint32_t))

int main(int argc, char* argv[]) {
      printf("Hello TurboPFor\n");
      int ar[32] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
      unsigned elnum = 10;
      unsigned char* compress_buf =  malloc(P4NENC_BOUND(elnum));
      int *uncompress_buf =  malloc((elnum+32)*sizeof(ar[0]));
  
      size_t compress_size = p4nenc32((uint32_t*)ar, elnum, compress_buf);
      printf("compress size is %lu\n", compress_size);
  
      size_t uncompress_size = p4ndec32(compress_buf, elnum, (uint32_t*)uncompress_buf);
      printf("uncompress size is %lu\n", uncompress_size);
}

Benchmarked on i5-7200U 3GHz

TurboPFor256N a.k.a. "PFor (AVX2) large blocks" nears memcpy, again stunning speeds and ratios, congratulations!

Out of curiosity, tried ranges 0..1, 0..255, 0..16777215 with 1.0 and 1.5 skewness.

D:\2tb>dir ic*
 Volume in drive D is SANMAYCE
 Volume Serial Number is EE4D-4AE3

 Directory of D:\2tb

03/03/2018  04:44 PM         5,550,385 icappavx2.exe
03/03/2018  04:37 PM         4,320,044 icapp.exe
               2 File(s)      9,870,429 bytes
               0 Dir(s)   2,902,048,768 bytes free

D:\2tb>timer64 icappavx2.exe -a1.5 -m0 -M1 -n100M ZIPF
zipf alpha=1.50 range[0..1].n=100000000

 bits histogram:0:73.87% 1:26.13%
   E MB/s     size     ratio    D MB/s  function
  1991.11   13281250   3.32%  8211.02 p4nenc32          ZIPF
  2136.20   13281250   3.32% 11108.33 p4nenc128v32      ZIPF
  3126.56   12890625   3.22% 10612.89 p4nenc256v32      ZIPF
   840.22  101692201  25.42%  3269.31 p4ndenc32         ZIPF
   858.04  101692201  25.42%  5181.35 p4ndenc128v32     ZIPF
  5950.61  400390622 100.10%  6285.65 bitndpack256v32   ZIPF
   728.98  336853585  84.21%  5002.56 p4nd1enc32        ZIPF
   671.33  336853585  84.21%  6126.04 p4nd1enc128v32    ZIPF
   814.40  336072335  84.02%  6428.18 p4nd1enc256v32    ZIPF
  1070.80   23943088   5.99%  4127.88 p4nzenc32         ZIPF
  1025.09   23943088   5.99%  5162.76 p4nzenc128v32     ZIPF
  1063.31   23066760   5.77%  5643.90 p4nzenc256v32     ZIPF
   820.67  102473451  25.62%  2836.84 p4nsenc32         ZIPF
  7887.37   13281250   3.32%  8224.36 bitnpack32        ZIPF
 10003.75   13281250   3.32% 11110.80 bitnpack128v32    ZIPF
 12790.59   12890625   3.22% 10636.60 bitnpack256v32    ZIPF
  4584.42  400781247 100.20%  6117.99 bitndpack32       ZIPF
  5965.07  400781247 100.20%  6993.50 bitndpack128v32   ZIPF
  5950.70  400390622 100.10%  6295.94 bitndpack256v32   ZIPF
  4497.06  400781247 100.20%  6037.74 bitnd1pack32      ZIPF
  5849.24  400781247 100.20%  6615.73 bitnd1pack128v32  ZIPF
  5676.34  400390622 100.10%  6174.75 bitnd1pack256v32  ZIPF
  3690.10   25781251   6.45%  4521.36 bitnzpack32       ZIPF
  6980.07   25781251   6.45%  9690.39 bitnzpack128v32   ZIPF
  6598.59   25390626   6.35% 10504.75 bitnzpack256v32   ZIPF
  4695.88  100000000  25.00%  4637.95 vbzenc32          ZIPF
  1221.11   70405794  17.60%  1382.05 vbddenc32         ZIPF
  1625.10   14240039   3.56%  8543.54 vsenc32           ZIPF
   909.95   63145848  15.79%   987.28 bvzzenc32         ZIPF
   852.03   46249480  11.56%   987.35 bvzenc32          ZIPF
   963.67   22151115   5.54%   913.63 fpgenc32          ZIPF
   930.45   34790081   8.70%  2549.04 fpzzenc32         ZIPF
   868.69   23943239   5.99%  1137.39 fpfcmenc32        ZIPF
   827.65   32408928   8.10%   971.27 fpdfcmenc32       ZIPF
   826.35   32408029   8.10%   959.39 fp2dfcmenc32      ZIPF
   669.97   78383804  19.60%  4168.79 trle              ZIPF
  1118.99  254647894  63.66%  3086.63 srle32            ZIPF
   149.68  272484724  68.12%   159.57 SPDP              ZIPF
  1722.85   46095125  11.52%  1966.00 tpbyte+lz         ZIPF
  2140.57   28153817   7.04%  2386.01 tpnibble+lz       ZIPF
  1609.33   28296871   7.07%  1932.43 tpnibbleZ+lz      ZIPF
  1634.09   28352181   7.09%  1877.49 tpnibbleX+lz      ZIPF
   905.38   83618404  20.90%  1583.90 lz                ZIPF
  1970.04   14378586   3.59%  1986.59 bitshuffle+lz     ZIPF
 11465.26  400000000 100.00% 11149.51 memcpy

Kernel  Time =     0.828 =    0%
User    Time =   231.937 =   98%
Process Time =   232.765 =   99%    Virtual  Memory =   3458 MB
Global  Time =   234.710 =  100%    Physical Memory =   1539 MB

D:\2tb>timer64 icappavx2.exe -a1.0 -m0 -M1 -n100M ZIPF
zipf alpha=1.00 range[0..1].n=100000000

 bits histogram:0:66.66% 1:33.34%
   E MB/s     size     ratio    D MB/s  function
  2033.53   13281250   3.32%  8223.18 p4nenc32          ZIPF
  2231.52   13281250   3.32% 11084.32 p4nenc128v32      ZIPF
  2459.40   12890625   3.22% 10613.18 p4nenc256v32      ZIPF
   838.16  113039100  28.26%  3247.07 p4ndenc32         ZIPF
   849.96  113039100  28.26%  5099.50 p4ndenc128v32     ZIPF
  5993.95  400390622 100.10%  6264.58 bitndpack256v32   ZIPF
   736.08  325158377  81.29%  4978.10 p4nd1enc32        ZIPF
   680.11  325158377  81.29%  6089.76 p4nd1enc128v32    ZIPF
   827.41  324377127  81.09%  6456.72 p4nd1enc256v32    ZIPF
  1231.59   25131150   6.28%  4204.33 p4nzenc32         ZIPF
  1240.93   25131150   6.28%  6237.23 p4nzenc128v32     ZIPF
  1176.89   24467543   6.12%  6237.04 p4nzenc256v32     ZIPF
   817.36  113820350  28.46%  2829.87 p4nsenc32         ZIPF
  7887.21   13281250   3.32%  8230.11 bitnpack32        ZIPF
 10056.32   13281250   3.32% 11095.70 bitnpack128v32    ZIPF
 12740.88   12890625   3.22% 10613.46 bitnpack256v32    ZIPF
  4591.00  400781247 100.20%  6119.86 bitndpack32       ZIPF
  5960.54  400781247 100.20%  6984.09 bitndpack128v32   ZIPF
  5996.91  400390622 100.10%  6291.29 bitndpack256v32   ZIPF
  4491.96  400781247 100.20%  6032.82 bitnd1pack32      ZIPF
  5716.08  400781247 100.20%  6607.20 bitnd1pack128v32  ZIPF
  5688.28  400390622 100.10%  6162.95 bitnd1pack256v32  ZIPF
  3686.09   25781251   6.45%  4516.35 bitnzpack32       ZIPF
  6144.77   25781251   6.45%  9675.39 bitnzpack128v32   ZIPF
  6604.26   25390626   6.35% 10467.92 bitnzpack256v32   ZIPF
  4692.52  100000000  25.00%  4634.24 vbzenc32          ZIPF
  1184.03   79178092  19.79%  1391.53 vbddenc32         ZIPF
  1609.87   14289474   3.57%  8896.80 vsenc32           ZIPF
   868.72   70840784  17.71%   947.27 bvzzenc32         ZIPF
   737.00   51391735  12.85%   817.76 bvzenc32          ZIPF
   811.78   23613016   5.90%   783.79 fpgenc32          ZIPF
  1100.02   37282301   9.32%  2724.76 fpzzenc32         ZIPF
   978.90   25130914   6.28%  1165.77 fpfcmenc32        ZIPF
   814.67   33765409   8.44%   954.54 fpdfcmenc32       ZIPF
   816.68   33670343   8.42%   942.89 fp2dfcmenc32      ZIPF
   597.00  100024919  25.01%  3852.19 trle              ZIPF
   914.61  293876335  73.47%  3322.62 srle32            ZIPF
   139.91  300800299  75.20%   149.69 SPDP              ZIPF
  1678.62   48982325  12.25%  1990.31 tpbyte+lz         ZIPF
  2084.02   28987471   7.25%  2396.36 tpnibble+lz       ZIPF
  1527.84   29191560   7.30%  1992.22 tpnibbleZ+lz      ZIPF
  1593.02   29225794   7.31%  1878.71 tpnibbleX+lz      ZIPF
   837.11   93244426  23.31%  1592.22 lz                ZIPF
  1942.52   14392723   3.60%  1974.67 bitshuffle+lz     ZIPF
 10647.93  400000000 100.00% 11036.61 memcpy

Kernel  Time =     0.796 =    0%
User    Time =   226.609 =   98%
Process Time =   227.406 =   99%    Virtual  Memory =   3458 MB
Global  Time =   229.155 =  100%    Physical Memory =   1539 MB

D:\2tb>timer64 icappavx2.exe -a1.5 -m0 -M255 -n100M ZIPF
zipf alpha=1.50 range[0..255].n=100000000

 bits histogram:0:40.20% 1:14.21% 2:12.77% 3:10.28% 4:7.77% 5:5.69% 6:4.09% 7:2.92% 8:2.08%
   E MB/s     size     ratio    D MB/s  function
  1036.21   63397240  15.85%  3775.44 p4nenc32          ZIPF
  1042.24   63397240  15.85%  6210.60 p4nenc128v32      ZIPF
  1234.61   62943776  15.74%  9086.16 p4nenc256v32      ZIPF
   781.03  231475421  57.87%  3076.83 p4ndenc32         ZIPF
   778.34  231475421  57.87%  4213.14 p4ndenc128v32     ZIPF
  6016.58  400390622 100.10%  6326.91 bitndpack256v32   ZIPF
   732.52  291998562  73.00%  2850.20 p4nd1enc32        ZIPF
   713.03  291998562  73.00%  3889.42 p4nd1enc128v32    ZIPF
   884.92  291718032  72.93%  4638.38 p4nd1enc256v32    ZIPF
   903.89   85552166  21.39%  3043.21 p4nzenc32         ZIPF
   912.01   85552166  21.39%  4153.08 p4nzenc128v32     ZIPF
  1048.24   85248159  21.31%  5277.74 p4nzenc256v32     ZIPF
   775.93  232256671  58.06%  2718.83 p4nsenc32         ZIPF
  6425.08   99908242  24.98%  8632.78 bitnpack32        ZIPF
  9796.48   99908242  24.98%  9953.47 bitnpack128v32    ZIPF
 12148.82  100329537  25.08%  9604.07 bitnpack256v32    ZIPF
  4605.80  400781247 100.20%  6124.64 bitndpack32       ZIPF
  6000.42  400781247 100.20%  6995.09 bitndpack128v32   ZIPF
  5998.62  400390622 100.10%  6324.61 bitndpack256v32   ZIPF
  4509.94  400781247 100.20%  6049.88 bitnd1pack32      ZIPF
  5899.88  400781247 100.20%  6609.28 bitnd1pack128v32  ZIPF
  5689.41  400390622 100.10%  6192.53 bitnd1pack256v32  ZIPF
  3694.19  112367090  28.09%  4470.57 bitnzpack32       ZIPF
  5568.01  112367090  28.09%  8686.78 bitnzpack128v32   ZIPF
  6227.04  112823649  28.21%  9472.61 bitnzpack256v32   ZIPF
  3206.67  106294820  26.57%  3269.79 vbzenc32          ZIPF
  1320.87  116386412  29.10%  1570.33 vbddenc32         ZIPF
   877.49   76359142  19.09%  2377.81 vsenc32           ZIPF
   705.40  119303834  29.83%   810.31 bvzzenc32         ZIPF
   726.65   98385955  24.60%   781.02 bvzenc32          ZIPF
  1013.01  103153709  25.79%  1040.94 fpgenc32          ZIPF
   802.12  100632423  25.16%  2068.50 fpzzenc32         ZIPF
   747.01   85551816  21.39%  1015.80 fpfcmenc32        ZIPF
   726.65   98463575  24.62%   905.67 fpdfcmenc32       ZIPF
   728.04   98328582  24.58%   896.84 fp2dfcmenc32      ZIPF
   418.86  239221544  59.81%  2212.16 trle              ZIPF
  1220.42  384575089  96.14%  4981.63 srle32            ZIPF
   143.23  374799918  93.70%   153.36 SPDP              ZIPF
  1210.95   88598454  22.15%  2182.41 tpbyte+lz         ZIPF
  1515.51   79169862  19.79%  2452.75 tpnibble+lz       ZIPF
  1199.57   97566061  24.39%  1820.98 tpnibbleZ+lz      ZIPF
  1267.69   80520988  20.13%  1919.03 tpnibbleX+lz      ZIPF
   581.00  140363018  35.09%  1764.07 lz                ZIPF
   924.96   82886520  20.72%  1648.21 bitshuffle+lz     ZIPF
 11471.18  400000000 100.00% 11186.00 memcpy

Kernel  Time =     0.906 =    0%
User    Time =   230.281 =   98%
Process Time =   231.187 =   99%    Virtual  Memory =   3458 MB
Global  Time =   232.968 =  100%    Physical Memory =   1539 MB

D:\2tb>timer64 icappavx2.exe -a1.0 -m0 -M255 -n100M ZIPF
zipf alpha=1.00 range[0..255].n=100000000

 bits histogram:0:16.33% 1:8.16% 2:9.53% 3:10.36% 4:10.82% 5:11.07% 6:11.19% 7:11.25% 8:11.29%
   E MB/s     size     ratio    D MB/s  function
   974.61   85947474  21.49%  3814.68 p4nenc32          ZIPF
   962.15   85947474  21.49%  5574.21 p4nenc128v32      ZIPF
  1153.58   85277810  21.32%  7808.38 p4nenc256v32      ZIPF
   783.55  257508731  64.38%  3109.28 p4ndenc32         ZIPF
   772.81  257508731  64.38%  4509.48 p4ndenc128v32     ZIPF
  6004.65  400390622 100.10%  6301.99 bitndpack256v32   ZIPF
   771.69  270630517  67.66%  2996.50 p4nd1enc32        ZIPF
   747.61  270630517  67.66%  4168.14 p4nd1enc128v32    ZIPF
   917.49  269850532  67.46%  4838.98 p4nd1enc256v32    ZIPF
   854.17  106938431  26.73%  2934.96 p4nzenc32         ZIPF
   853.79  106938431  26.73%  3888.06 p4nzenc128v32     ZIPF
   985.82  106246317  26.56%  4863.46 p4nzenc256v32     ZIPF
   776.86  258289981  64.57%  2825.68 p4nsenc32         ZIPF
  6462.45  100781234  25.20%  8611.78 bitnpack32        ZIPF
  9988.26  100781234  25.20%  9956.44 bitnpack128v32    ZIPF
 12100.68  100390625  25.10%  9601.31 bitnpack256v32    ZIPF
  4607.18  400781247 100.20%  6133.18 bitndpack32       ZIPF
  5962.85  400781247 100.20%  6982.02 bitndpack128v32   ZIPF
  6009.53  400390622 100.10%  6300.70 bitndpack256v32   ZIPF
  4506.74  400781247 100.20%  6045.86 bitnd1pack32      ZIPF
  5738.88  400781247 100.20%  6607.31 bitnd1pack128v32  ZIPF
  5709.96  400390622 100.10%  6202.70 bitnd1pack256v32  ZIPF
  3723.04  113281202  28.32%  4462.49 bitnzpack32       ZIPF
  6344.47  113281202  28.32%  8745.65 bitnzpack128v32   ZIPF
  6217.36  112890625  28.22%  9475.98 bitnzpack256v32   ZIPF
  1363.23  125035755  31.26%  1427.88 vbzenc32          ZIPF
   906.99  154221057  38.56%  1038.42 vbddenc32         ZIPF
   857.41  105321194  26.33%  2455.42 vsenc32           ZIPF
   714.25  158182587  39.55%   857.88 bvzzenc32         ZIPF
   702.59  134817601  33.70%   774.70 bvzenc32          ZIPF
  1294.62  120079490  30.02%  1684.39 fpgenc32          ZIPF
   793.07  120978001  30.24%  2076.72 fpzzenc32         ZIPF
   720.93  106939168  26.73%  1011.42 fpfcmenc32        ZIPF
   775.36  118485007  29.62%   965.32 fpdfcmenc32       ZIPF
   779.20  118532992  29.63%   958.70 fp2dfcmenc32      ZIPF
   375.92  334760892  83.69%  2143.26 trle              ZIPF
  1944.92  399499744  99.87%  5706.13 srle32            ZIPF
   144.90  374393035  93.60%   155.71 SPDP              ZIPF
  4050.14  101569805  25.39%  4231.55 tpbyte+lz         ZIPF
  2368.39  100796897  25.20%  2834.43 tpnibble+lz       ZIPF
  1658.18  122666071  30.67%  1931.03 tpnibbleZ+lz      ZIPF
  2059.41  101369011  25.34%  2188.04 tpnibbleX+lz      ZIPF
   430.90  173718053  43.43%  1917.06 lz                ZIPF
  1254.67  101508819  25.38%  1982.39 bitshuffle+lz     ZIPF
 11598.57  400000000 100.00% 11179.43 memcpy

Kernel  Time =     0.812 =    0%
User    Time =   223.640 =   98%
Process Time =   224.453 =   99%    Virtual  Memory =   3458 MB
Global  Time =   226.318 =  100%    Physical Memory =   1539 MB

D:\2tb>timer64 icappavx2.exe -a1.5 -m0 -M16777215 -n100M ZIPF
zipf alpha=1.50 range[0..16777215].n=100000000

 bits histogram:0:38.28% 1:13.53% 2:12.16% 3:9.79% 4:7.40% 5:5.42% 6:3.90% 7:2.78% 8:1.97% 9:1.40% 10:0.99% 11:0.70% 12:0.50% 13:0.35% 14:0.25% 15:0.17% 16:0.12% 17:0.09% 18:0.06% 19:0.04% 20:0.03% 21:0.02% 22:0.02% 23:0.01% 24:0.01%
   E MB/s     size     ratio    D MB/s  function
   939.46   86091479  21.52%  3663.91 p4nenc32          ZIPF
   952.36   86091479  21.52%  5286.04 p4nenc128v32      ZIPF
  1166.66   88072061  22.02%  7426.11 p4nenc256v32      ZIPF
   758.13  248012100  62.00%  2978.72 p4ndenc32         ZIPF
   744.46  248012100  62.00%  4078.59 p4ndenc128v32     ZIPF
  6010.97  400390622 100.10%  6296.93 bitndpack256v32   ZIPF
   716.29  302333932  75.58%  2807.90 p4nd1enc32        ZIPF
   697.99  302333932  75.58%  3747.92 p4nd1enc128v32    ZIPF
   864.39  302336977  75.58%  4420.92 p4nd1enc256v32    ZIPF
   833.38  115851801  28.96%  2856.76 p4nzenc32         ZIPF
   847.26  115851801  28.96%  3747.28 p4nzenc128v32     ZIPF
  1019.84  119554815  29.89%  4812.96 p4nzenc256v32     ZIPF
   751.83  248793350  62.20%  2642.64 p4nsenc32         ZIPF
  5387.50  189928210  47.48%  8046.67 bitnpack32        ZIPF
  7201.89  189928210  47.48%  9089.05 bitnpack128v32    ZIPF
  9001.91  212105153  53.03%  8790.63 bitnpack256v32    ZIPF
  4604.37  400781247 100.20%  6114.15 bitndpack32       ZIPF
  6005.01  400781247 100.20%  6988.85 bitndpack128v32   ZIPF
  6010.88  400390622 100.10%  6302.29 bitndpack256v32   ZIPF
  4511.36  400781247 100.20%  6047.41 bitnd1pack32      ZIPF
  5754.48  400781247 100.20%  6604.47 bitnd1pack128v32  ZIPF
  5724.92  400390622 100.10%  6175.98 bitnd1pack256v32  ZIPF
  3316.34  202671090  50.67%  4052.73 bitnzpack32       ZIPF
  4736.25  202671090  50.67%  6430.45 bitnzpack128v32   ZIPF
  5543.47  224721248  56.18%  6920.77 bitnzpack256v32   ZIPF
  2116.82  116901527  29.23%  2144.61 vbzenc32          ZIPF
  1117.27  132703387  33.18%  1269.38 vbddenc32         ZIPF
   758.57   94288260  23.57%  2128.92 vsenc32           ZIPF
   638.88  140306377  35.08%   724.22 bvzzenc32         ZIPF
   632.87  114478985  28.62%   671.63 bvzenc32          ZIPF
   767.43  123291402  30.82%   760.99 fpgenc32          ZIPF
   750.16  134183406  33.55%  1990.01 fpzzenc32         ZIPF
   696.08  115855322  28.96%  1005.43 fpfcmenc32        ZIPF
   685.28  135283348  33.82%   903.21 fpdfcmenc32       ZIPF
   685.72  135537115  33.88%   897.72 fp2dfcmenc32      ZIPF
   393.49  248798855  62.20%  2159.71 trle              ZIPF
  1282.04  387188596  96.80%  5076.34 srle32            ZIPF
   145.41  382964658  95.74%   154.18 SPDP              ZIPF
   897.27  112509483  28.13%  1804.07 tpbyte+lz         ZIPF
  1047.15  105310206  26.33%  1994.48 tpnibble+lz       ZIPF
   861.71  122108089  30.53%  1651.97 tpnibbleZ+lz      ZIPF
   909.48  107623191  26.91%  1670.74 tpnibbleX+lz      ZIPF
   471.35  153956085  38.49%  1731.01 lz                ZIPF
   683.11  116999441  29.25%  1366.40 bitshuffle+lz     ZIPF
 11276.50  400000000 100.00% 11254.92 memcpy

Kernel  Time =     0.906 =    0%
User    Time =   231.234 =   98%
Process Time =   232.140 =   99%    Virtual  Memory =   3458 MB
Global  Time =   234.076 =  100%    Physical Memory =   1539 MB

D:\2tb>timer64 icappavx2.exe -a1.0 -m0 -M16777215 -n100M ZIPF
zipf alpha=1.00 range[0..16777215].n=100000000

 bits histogram:0:5.81% 1:2.90% 2:3.39% 3:3.68% 4:3.85% 5:3.94% 6:3.98% 7:4.01% 8:4.01% 9:4.02% 10:4.02% 11:4.02% 12:4.03% 13:4.02% 14:4.03% 15:4.03% 16:4.03% 17:4.03% 18:4.02% 19:4.03% 20:4.02% 21:4.03% 22:4.03% 23:4.03% 24:4.03%
   E MB/s     size     ratio    D MB/s  function
   853.25  234524047  58.63%  3611.41 p4nenc32          ZIPF
   834.63  234524047  58.63%  5076.01 p4nenc128v32      ZIPF
  1051.56  234585143  58.65%  7011.39 p4nenc256v32      ZIPF
   717.99  357972550  89.49%  2776.60 p4ndenc32         ZIPF
   704.82  357972550  89.49%  3656.27 p4ndenc128v32     ZIPF
  5964.27  400390622 100.10%  6327.91 bitndpack256v32   ZIPF
   713.43  358789013  89.70%  2754.31 p4nd1enc32        ZIPF
   696.99  358789013  89.70%  3626.51 p4nd1enc128v32    ZIPF
   872.49  358770020  89.69%  4458.46 p4nd1enc256v32    ZIPF
   765.67  277519798  69.38%  2927.64 p4nzenc32         ZIPF
   754.60  277519798  69.38%  3632.63 p4nzenc128v32     ZIPF
   911.08  277568871  69.39%  4594.16 p4nzenc256v32     ZIPF
   712.55  358753800  89.69%  2472.46 p4nsenc32         ZIPF
  5367.47  300715346  75.18%  7604.27 bitnpack32        ZIPF
  7620.64  300715346  75.18%  8231.13 bitnpack128v32    ZIPF
  8265.83  300390465  75.10%  8023.91 bitnpack256v32    ZIPF
  4605.38  400781247 100.20%  6114.53 bitndpack32       ZIPF
  5999.52  400781247 100.20%  6990.93 bitndpack128v32   ZIPF
  5977.82  400390622 100.10%  6347.80 bitndpack256v32   ZIPF
  4518.19  400781247 100.20%  6040.20 bitnd1pack32      ZIPF
  5752.75  400781247 100.20%  6607.31 bitnd1pack128v32  ZIPF
  5673.76  400390622 100.10%  6194.35 bitnd1pack256v32  ZIPF
  3367.94  313214128  78.30%  4109.65 bitnzpack32       ZIPF
  5204.40  313214128  78.30%  6588.92 bitnzpack128v32   ZIPF
  5324.60  312890495  78.22%  6687.84 bitnzpack256v32   ZIPF
   647.97  306367133  76.59%   674.85 vbzenc32          ZIPF
   722.18  364087555  91.02%   764.38 vbddenc32         ZIPF
   577.34  297093559  74.27%  1492.32 vsenc32           ZIPF
   681.34  364823129  91.21%   651.31 bvzzenc32         ZIPF
   586.81  325210884  81.30%   581.57 bvzenc32          ZIPF
   488.85  294061054  73.52%   510.17 fpgenc32          ZIPF
   691.01  300186099  75.05%  1973.00 fpzzenc32         ZIPF
   655.94  277546018  69.39%   984.83 fpfcmenc32        ZIPF
   640.01  297992703  74.50%   893.96 fpdfcmenc32       ZIPF
   640.57  298483029  74.62%   887.78 fp2dfcmenc32      ZIPF
   196.93  382657677  95.66%  4332.85 trle              ZIPF
  2332.69  399991823 100.00%  5752.58 srle32            ZIPF
   244.97  391918149  97.98%   250.82 SPDP              ZIPF
   777.42  267766854  66.94%  2307.98 tpbyte+lz         ZIPF
   946.71  275614096  68.90%  2257.60 tpnibble+lz       ZIPF
   866.65  300097431  75.02%  1838.28 tpnibbleZ+lz      ZIPF
   879.41  279554067  69.89%  1916.04 tpnibbleX+lz      ZIPF
   401.74  337515225  84.38%  1647.17 lz                ZIPF
   843.76  286695704  71.67%  1751.57 bitshuffle+lz     ZIPF
 10544.35  400000000 100.00% 10764.26 memcpy

Kernel  Time =     0.968 =    0%
User    Time =   262.765 =   98%
Process Time =   263.734 =   99%    Virtual  Memory =   3458 MB
Global  Time =   265.572 =  100%    Physical Memory =   1539 MB

D:\2tb>

In all the cases, TurboPFor256N/p4nenc256v32 outperforms the rest - bigtime.

I like a lot the way you presented the whole page, allow me just a playful counter - TurboPFor: The new synonym for "integer compression" is an understatement, it could be stated more accurately by replacing the weak 'synonym' with 'definition' or 'mainstay,pillar,backbone,anchor,lynchpin,kingpin,MVP'.
Both in American and British English, 'mainstay' simply means 'a chief support', so it fits nicely.

The widespread MVP (Most-Valuable-Player) has a ring to it when combined with TurboPFor - the MVP (Most-Valuable-...Performer/Packer), 'linchpin' in itself is quite powerful replacement, meaning, a person or thing vital to an enterprise or organization.

Also, one superb synonym to 'mainstay' is 'fulcrum', funny, they are used in air forces as designations not just for superuseful aircrafts, but for Most-Valuable-...Planes.

Example sentences containing 'mainstay':
He is a fantastic footballer and a mainstay in our team. /The Sun (2015)/
They are now in their forties and the mainstay of the economy. /Times, Sunday Times (2009)/

My English is forever buggy, yet, I constantly explore its diversity while not fearing going against/across the mainstreamish dogma, e.g. ask a specialist not well, but wonderfully-versed in English, with how many 'dragon+suffixes' words he/she can come up with. Such hidden versatility/vividness lies in those formations! Even the very suffixes/postfixes are not in-depth explored/catalogued! I mention this to stress how rigid and unyielding are the people upholding the status quo - they refuse to accept the new vivid "coinages/etudes" - as if they have the last word. Preaching to the choir I am, you good well know it.

incorrect benchmark for floating points

When using icapp to test encoding values of type double, they lose their precision because they get casted to uint64_t via IPUSH, and the benchmark becomes incorrect, particularly on the ratio%

https://github.com/powturbo/TurboPFor/blob/master/icapp.c#L206

Which gets called here:

https://github.com/powturbo/TurboPFor/blob/master/icapp.c#L290

direct access after using p4n*enc**

Hi - if I've compressed a large number of integers with one of the n functions, is there any way to directly access a block of integers in the middle? For example, suppose I have 1 million integers and I want to read the integers starting at offset 500,000.

Remove invalid references to USE_SSE, USE_AVX2

https://github.com/powturbo/TurboPFor/blob/5dff9d3fc120fafa13cedbc8ea0f249e8291215a/vs/vs2017/TurboPFor.vcxproj#L92

If you remove USE_SSE & USE_AVX2, why don't you grep entire repo to see if you have left overs here and there?..

recommended output buffer estimation

Is there a guidance on estimating the maximum output buffer required given N values of type T ?

SIGABRT while using p4enc64.

#include "conf.h"
#include "bitpack.h"
#include "vint.h"		
#include "bitutil.h"
#include "vp4.h"


typedef uint64_t LONG;


int main(LONG argc, char** argv) {
	LONG *array = (LONG *)malloc(sizeof(LONG)*5);
	array[0] = 24520120;
	array[1] = 29620120;
	array[2] = 42420120;
	array[3] = 20124222;
	array[4] = 4294967295;

	unsigned char* out = (unsigned char *)malloc(5*10*sizeof(unsigned char));
	unsigned char * op = p4enc64(array, 5, out);
	printf("%d\n",(int)(op-out) );
	

	LONG *capacity = (LONG *)malloc(sizeof(LONG)*5);
	unsigned char * op2 =  p4dec64(out, 5, capacity);
	printf("%d\n",(int)(op2-out) );

}

When attempting to use p4enc64 in above code, the following error is raised.

malloc.c:2372: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 *(sizeof(size_t))) - 1)) & ~((2 *(sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long) old_end & pagemask) == 0)' failed.

Program received signal SIGABRT, Aborted.
0x00007ffff700dc37 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.

No issue occurs with uint32_t and also when malloc is not used.

Please correct me if there is any issue with the code. I am trying to use TurboPFor in my code for compressing 64-bit unsorted integer arrays. Also I am having trouble with allocating the right buffer size for the compressed array.What is the minimum buffer size I can allocate safely while using p4enc64 for integer array of size n?

Compress ratio < 7z

Tested TP and 7z on 16 bit unsigned bit data. Seems that the best TP algorithm's compress ratio < 7z.
TP: 33.1, 7z: 32.7
TP: 55.2, 7z: 51.2

can not build on mac os

Undefined symbols for architecture x86_64:
  "_aligned_alloc", referenced from:
      _main in icbench.o
  "_aligned_free", referenced from:
      _afree in icbench.o
      _main in icbench.o
  "_fopen64", referenced from:
      _main in icbench.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [icbench] Error 1

Comparison with Blosc can be improved

I have just seen your project and I see that you are using Blosc for comparison. For completeness, I would recommend to include other Blosc codecs as they give different results that may be a good fit in different scenarios.

Below, I have selected several combinations that works well for this problem, and here are my results (Linux, GCC 4.9.2, Xeon E3-1240 v3 @ 3.40GHz):

$ time LD_LIBRARY_PATH=/home/francesc/c-blosc/build/blosc ./icbench -a1.5 -m0 -M255 -n100m
zipf alpha=1.50 range[0..255].n=100000000

bits histogram:0:40.20% 1:14.22% 2:12.75% 3:10.28% 4:7.77% 5:5.69% 6:4.09% 7:2.92% 8:2.07%
62064288    15.52    4.97      27.85          130.24   blosc (zlib, nthreads=1, clevel=3)
62064288    15.52    4.97      42.85          214.16   blosc (zlib, nthreads=4, clevel=3)
63392801    15.85    5.07     396.74         1412.46   TurboPFor
77187330    19.30    6.17      51.44          663.17   blosc (lz4hc, nthreads=1, clevel=3)
77187330    19.30    6.17      78.52          984.59   blosc (lz4hc, nthreads=4, clevel=3)
101473443    25.37    8.12     862.60          896.15  blosc (lz4, nthreads=1, clevel=3)
101473443    25.37    8.12    1193.86         1392.87  blosc (lz4, nthreads=2, clevel=3)
101473443    25.37    8.12    1608.41         1549.34  blosc (lz4, nthreads=4, clevel=3)
102491006    25.62    8.20     595.36         1815.32  blosc (blosclz, nthreads=1, clevel=3)
102491006    25.62    8.20     949.23         2049.91  blosc (blosclz,  nthreads=2, clevel=3)
102491006    25.62    8.20    1293.24         1873.03  blosc (blosclz,  nthreads=4, clevel=3)

The timings with clang 3.5 are very close to these, so I am not reproducing them.

By the way, very nice work! I did not realized that compressing integers was that important!

Compilation and Linker errors for _TURBOPFOR-enabled idxcr and idxqry

Hi,

I am trying to test the inverted-index tools idxcr/idxqry with the TurboPFor-compressed version of the index format, as a code comment claims this results in smaller index sizes (perhaps also faster speed?).

However, I seem unable to compile and link them. I'm on Ubuntu 16.04 with gcc 7.2.0, Intel Core i7-7500U.

First, idxcr.c:123 misses some function arguments - I tried to supply them heuristically, not sure this is correct:

            b = _p4dec32(ip+1, n-1, &bx);

            b = _p4dec32(ip+1, n-1, out, b, &bx);

Second, even after that I get linker errors when I modify the makefile to define the necessary
C-preprocessor symbol and then issue $ make search AVX2=1:

ifeq ($(UNAME), Linux)
search: CFLAGS += -D_TURBOPFOR=1
search: idxcr
endif

The linker errors are:
In function main': idxcr.c:123: undefined reference to _p4dec32'
idxcr.c:146: undefined reference to _p4dec32' idxcr.c:146: undefined reference to _p4dec128v32'
collect2: error: ld returned 1 exit status

idxrqry is similar.

Any fixes/patches to make this work?

Thanks a lot in advance, Markus

Ruby Port and some beginner questions

Hi,
Congratulations on building this library. I currently use Lemire's FastPfor, and after reading the information on your page I decided to try it out, since Lemire's FastPFor does not offer integer intersection out of the box.

1.I am in the process of building a ruby port for Lemire's fastpfor library and I want to build one for your library also, hence I would really like if you can give me few details like which headers should be included, and which functions need to be referenced. (While building a ruby port I need to make one c file which specifies a single function which is called from Ruby. That function must then do all the interaction with your library. - so for eg, If i pass in a list of integers from ruby, then which function should I call from your library to compress and inversely which function to decompress).
2.Is it possible to use redis as a storage medium?

Many thanks,
RRPhotosoft.

Big endian Issues

The big endian machine runs the bitpack algorithm, and the decompression is not correct. Has this algorithm considered the big end problem?

undefined reference to `p4enc64'

Hi,

I wanted to play around with Floating Point compression and I thought I could just simply do:

#include "fp.h"
#include "fp.c"

to an existing project, but then I get:

undefined reference to `p4enc64'
undefined reference to `p4enc64'

I had a look arround the code but couldn't seem to find p4enc64 or p4dec64? So I assume I am doing something wrong. Using the latest code from github, with Mingw64.

powturbo / turbopfor-integer-compression Goto Github PK

turbopfor-integer-compression's Introduction

TurboPFor: Fastest Integer Compression

Integer Compression Benchmark (single thread):

- Synthetic data:

- Data files:

- Time Series:

- Transpose/Shuffle (no compression)

- (Lossy) Floating point compression:

- Compressed Inverted Index Intersections with GOV2

Notes:

Compile:

Windows visual c++

Windows visual studio c++

Testing:

- Synthetic data (use ZIPF parameter):

- Data files:

- Intersections:

- Parallel Query Processing:

Function usage:

Function syntax:

Environment:

OS/Compiler (64 bits):

Multithreading:

Knowns issues

LICENSE

References:

turbopfor-integer-compression's People

Contributors

Stargazers

Watchers

Forkers

turbopfor-integer-compression's Issues

Environment:

Recommend Projects

Recommend Topics

Recommend Org