Bit-Information-Content Tool

Bitshaper is a powerful tool for comparing data without regard to the bits that do not carry significant information. It also offers the possibility to explore the data for information content.

Quick start

This section shows some examples of the most common applications.

Install

python3 -m pip install bitinformation

Compare GRIB files

The comparison is made with a mask. This mask is calculated by the analyser, it can be read from the file or passed with the mask parameter.

The usual case is that you will probably just want to compare two files. But this assumes that you already have a configuration.

bitshaper.py --compare file1.grib file2.grib --preprocessor raw

If you don't have a configuration file yet, you can create one by running the tool with --use-analyser --add-missing-parameters arguments.

bitshaper.py --compare file1.grib file2.grib --preprocessor raw --use-analyser --add-missing-parameters

Compute bitsPerValue in Simple Packing

test

bitshaper.py --compare file1.grib file2.grib

Explore data

If you want to analyse the levels of each parameter, you must first define the primary key. This is a set of keys from different sources, e.g. Mars keys, Analyser and Preprocessor parameters. Then, in --value-key you define which values you want to record. The data then is exported to a CSV file.

params+="--primary-key short_name stream analyser_precision levelist preprocessor_bits_per_value "
params+="--value-key mask nbits_used"
params+="--csv $out_dir/explore.csv "
$tool $params --stats file1.grib file2.grib file3.grib

Algorithm behind the scene

The method calculates how much information content each bit in a number has. In essence, it is a statistical analysis of bit sequences. For example, according to this approach, random sequences of binary values and or a sequences of ones or zeros contain no information. Once a sequence has a structure, the information content is non-zero.

[0101010101010101] # low information content
[1111111111111111] # zero information content
[0000000000000000] # zero information content
[0000111100001111] # high information content

The following example explains the algorithm step by step without using formulas, when possible.

In the first step, assume there is a sequence S of 4-bit numbers. The sequence S is split into two arrays A and B. A is created by removing the last element from S and B, by removing the first element. The example below uses Python notatation to illustrate that.

S = [0, 1, 2, 3, 4, 5, 6, 7]

A = S[:-1] = [0, 1, 2, 3, 4, 5, 6]
B = S[1: ] = [1, 2, 3, 4, 5, 6, 7]

The next step is presented as a spreadsheet. In our example we work with 4-bit numbers, so we can identify each bit with the index i = [0, 1, 2, 3]. For illustration, we exapand our table with i and A, and i and B. A' and B' are the binary representations of the columns A and B, respectively. The columns A'[i] and B'[i] are the bits at the position i.

i	A	B	A'=bin(A)	B'=bin(B)	A'[i]	B'[i]	seq = A'[idx]B'[idx]
0	0	1	0000	0001	0	1	01
0	1	2	0001	0010	1	0	10
0	2	3	0010	0011	0	1	01
0	3	4	0011	0100	1	0	10
0	4	5	0100	0101	0	1	01
0	5	6	0101	0110	1	0	10
0	6	7	0110	0111	0	1	01
1	0	1	0000	0001	0	0	00
1	1	2	0001	0010	0	1	01
1	2	3	0010	0011	1	1	11
1	3	4	0011	0100	1	0	10
1	4	5	0100	0101	0	0	00
1	5	6	0101	0110	0	1	01
1	6	7	0110	0111	1	1	11
2	0	1	0000	0001	0	0	00
2	1	2	0001	0010	0	0	00
2	2	3	0010	0011	0	0	00
2	3	4	0011	0100	0	1	01
2	4	5	0100	0101	1	1	11
2	5	6	0101	0110	1	1	11
2	6	7	0110	0111	1	1	11
3	0	1	0000	0001	0	0	00
3	1	2	0001	0010	0	0	00
3	2	3	0010	0011	0	0	00
3	3	4	0011	0100	0	0	00
3	4	5	0100	0101	0	0	00
3	5	6	0101	0110	0	0	00
3	6	7	0110	0111	0	0	00

The next stpe is groupping the table by (i, seq) columns and count the occurences. p is the probability with wich a sequence at bit position i occurs.

i	seq	count	p = count/7
0	00	0	0.000
0	01	4	0.571
0	10	3	0.429
0	11	0	0.000
1	00	2	0.286
1	01	2	0.286
1	10	1	0.143
1	11	2	0.286
2	00	3	0.429
2	01	1	0.143
2	10	0	0.000
2	11	3	0.429
3	00	7	1.000
3	01	0	0.000
3	10	0	0.000
3	11	0	0.000

In the last step we compute the mutual information. To do that we take the columns i and p from the table and reshape them so that we have the probabilities for each sequence, i.e., p00, p01, p10, p11, in separate columns. This allows us to continue our example as a spreadsheet.

Formula below computes mutual information. It says how much information a bit contains.

M' = p00 * log(p00 / (p00 + p01) / (p00 + p10)) +
     p01 * log(p01 / (p00 + p01) / (p01 + p11)) +
     p10 * log(p10 / (p10 + p11) / (p00 + p10)) +
     p11 * log(p11 / (p10 + p11) / (p01 + p11))

M = M' / log(2)

i	p00	p01	p10	p11	M
0	0.000	0.571	0.429	0.000	0.699
1	0.286	0.286	0.143	0.286	2.061
2	0.429	0.143	0.000	0.429	0.235
3	1.000	0.000	0.000	0.000	0.000

joobog / bitinformation Goto Github PK

bitinformation's Introduction

Bit-Information-Content Tool

Quick start

Install

Compare GRIB files

Compute bitsPerValue in Simple Packing

Explore data

Algorithm behind the scene

bitinformation's People

Contributors

Stargazers

Watchers

bitinformation's Issues

Is this library working ?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent