Giter Club home page Giter Club logo

entropy_estimators's Introduction

Entropy estimators

This module implements estimators for the entropy and other information theoretic quantities of continuous distributions, including:

  • entropy / Shannon information (get_h),
  • mutual information (get_mi),
  • partial mutual information & transfer entropy (get_pmi),
  • specific information (get_imin), and
  • partial information decomposition (get_pid).

The estimators derive from the Kozachenko and Leonenko (1987) estimator, which uses k-nearest neighbour distances to compute the entropy of distributions, and extension thereof developed by Kraskov et al. (2004), and Frenzel and Pombe (2007).

For multivariate normal distributions, the following quantities can be computed analytically from the covariance matrix.

  • entropy (get_h_mvn),
  • mutual information (get_mi_mvn), and
  • partial mutual information & transfer entropy (get_pmi_mvn).

Installation

Easiest via pip:

pip install entropy_estimators

Examples

import numpy as np
from entropy_estimators import continuous

# create some normal test data
X = np.random.randn(10000, 2)

# compute the entropy from the determinant of the multivariate normal distribution:
analytic = continuous.get_h_mvn(X)

# compute the entropy using the k-nearest neighbour approach
# developed by Kozachenko and Leonenko (1987):
kozachenko = continuous.get_h(X, k=5)

print(f"analytic result: {analytic:.5f}")
print(f"K-L estimator: {kozachenko:.5f}")

Frequently asked questions

Why is the estimate of the mutual information negative? Shouldn't it always be positive?

Mutual information is a strictly positive quantity. However, its estimate need not be, and in fact, the nearest neighbour estimators are known to be biased estimators (Kraskov et al. 2004). Unfortunately, the bias appears to depend on multiple factors, primarily the number of samples and the choice of the k parameter, and thus cannot be known a priori. However, the bias itself can be estimated using a straightforward permutation / bootstrap approach:

  1. Compute the mutual information estimate between two variables, X and Y.
  2. Permute either variable (or both), and re-compute the estimate. The mutual information between randomised variables is zero, so this estimate represents the bias.
  3. Repeat the previous step many times to obtain a robust estimate of the bias.
import numpy as np

from scipy.stats import multivariate_normal
from entropy_estimators import continuous

# create two variables with a mutual information that can be computed analytically
means = [0, 1]
covariance = np.array([[1, 0.5], [0.5, 1]])

def get_entropy(covariance):
    """Compute the entropy of multivariate normal distribution from the covariance matrix."""
    if np.size(covariance) > 1:
        dim = covariance.shape[0]
        det = np.linalg.det(covariance)
    else: # scalar
        dim = 1
        det = covariance
    return 0.5 * np.log((2 * np.pi * np.e)**dim * det)

hx  = get_entropy(covariance[0, 0])
hy  = get_entropy(covariance[1, 1])
hxy = get_entropy(covariance)
analytic_result = hx + hy - hxy

# compute the mutual information from samples using the KSG estimator
distribution = multivariate_normal(means, covariance)
X, Y = distribution.rvs(1000).T

k = 5
ksg_estimate = continuous.get_mi(X, Y, k=k)

print(f"Analytic result: {analytic_result:.3f} nats")
print(f"KSG estimate: {ksg_estimate:.3f} nats")
print(f"Difference: {analytic_result - ksg_estimate:.3f} nats")
# Analytic result: 0.144
# KSG estimate: 0.113 nats
# Difference: 0.031 nats

# bootstrap to determine the bias
total_repeats = 100
bias = 0
Y_shuffled = Y.copy()
for ii in range(total_repeats):
    np.random.shuffle(Y_shuffled) # shuffling occurs in-place!
    bias += continuous.get_mi(X, Y_shuffled, k=k)
bias /= total_repeats

print("--------------------------------------------------------------------------------")
print(f"Bias estimat: {bias:.3f} nats")
print(f"Corrected KSG estimate: {ksg_estimate - bias:.3f}")
print(f"Difference to analytic result: {analytic_result - (ksg_estimate - bias):.3f} nats")
# Bias estimat: -0.020 nats
# Corrected KSG estimate: 0.132
# Difference to analytic result: 0.012 nats

Alternative Implementations

Scipy

scipy.stats.entropy : entropy of a categorical variable

Scikit-learn

  • sklearn.metrics.mutual_info_score : mutual information between two categorical variables

  • skelarn.metrics.mutual_info_regression : mutual information between two continuous variables; note that their implementation does not report negative mutual information scores and thus makes it impossible to compute bias corrections using the bootstrap approach outlined above.

Non-parametric Entropy Estimation Toolbox (NPEET)

Alternative python implementations of the nearest-neighbour estimators for the entropy of continuous variables, the mutual information and the partial/conditioned mutual information (link). In principle, there are no major differences between their implementation and this repository. However, for large samples, their implementation may run a little slower as it uses lists as the primary data structure and doesn't support parallelisation. The implementation in this repository mostly uses numpy arrays, which allows vectorization of many calculations, and supports running operations on multiple cores by setting the workers argument to valus larger than one.

entropy_estimators's People

Contributors

daesungc avatar mahlzahn avatar paulbrodersen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

entropy_estimators's Issues

categorical values mutual info pls

I found you code from this link
https://stackoverflow.com/questions/43265770/entropy-python-implementation

but I need to estimate mutual info between categorical values
to find similar features
to use in https://scikit-learn.org/stable/auto_examples/bicluster/plot_spectral_coclustering.html#sphx-glr-auto-examples-bicluster-plot-spectral-coclustering-py

bicluster it using the Spectral Co-Clustering algorithm
in this link
https://www.researchgate.net/post/How_do_I_compute_the_Mutual_Information_MI_between_2_or_more_features_in_Python_when_the_data_are_not_necessarily_discrete

recommended to use
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html

Estimate mutual information for a discrete target variable
may you share what python simple code may be used?

Support for computing entropy of a Tensor

I would like to be able to use this to compute the continuous entropy of a tensor so that I do not have troubles with backpropagate.

I am trying to use continuous.get_h() for this but the kdtree = cKDTree(x) throws an error as needs to be converted to numpy first.

Any advice or help would be much appreciated?

Transfer Entropy on Different Dimensions?

The README mentions that transfer entropy can be calculated using the partial mutual information. If we are looking for the transfer from X->Y, I believe this would be the calculation:

import numpy as np
from entropy_estimators import continuous

np.random.seed(42)
x = np.random.rand(3000)
y = np.random.rand(3000)

# T(X->Y) = PMI(Y_future, X_past, Y_past)
transferXtoY = continuous.get_pmi(y[-3000:], x, y[0:3000], estimator="fp")

print(transferXtoY)

This produces the output value -1.0. The values are random so no transfer is expected.

Typically, I believe transfer entropy is calculated as a measure of entropy from past values of X to future values of Y given past values of Y. The requirement that parameters x, y, and z in get_pmi all have the same dimension seems to break the usefulness of this? For example, calculating the transfer on the most recent 200 values of Y given the history of X and Y does not work:

transferXtoY = continuous.get_pmi(y[-200:], x, y[0:3000], estimator="fp")

output:

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    transferXtoY = continuous.get_pmi(y[-200:], x, y[0:3000], estimator="fp")
  File "/usr/local/anaconda3/envs/plutus/lib/python3.7/site-packages/entropy_estimators/continuous.py", line 54, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/anaconda3/envs/plutus/lib/python3.7/site-packages/entropy_estimators/continuous.py", line 372, in get_pmi
    xz  = np.c_[x,z]
  File "/usr/local/anaconda3/envs/plutus/lib/python3.7/site-packages/numpy/lib/index_tricks.py", line 406, in __getitem__
    res = self.concatenate(tuple(objs), axis=axis)
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 200 and the array at index 1 has size 3000

Although y and z are the same dimension, x is not.

Is this the usage of transfer entropy you had in mind, or a different use case? Appreciate any ideas on usage.

Unexpected -inf entropy estimations

Thanks for providing this open-source implementation of these entropy estimators.

Here are two datasets in .csv format that are unexpectedly giving me -inf entropy estimations:
Datasets.zip

Plotting the datasets shows that they are pretty reasonable. This is test_dataset_1 (a sine wave with some added Gaussian noise):
6_TrigData sin

This is test_dataset_2:
test_dataset_2

When I estimate their entropies:

#!/usr/bin/env python3

import pandas as pd
from entropy_estimators import continuous

data1 = pd.read_csv("./data/test_set_1.csv").iloc[:, 0].values
print(data1)
print("{} points {} sum in data1".format(len(data1), sum(data1)))

data2 = pd.read_csv("./data/test_set_2.csv").iloc[:, 0].values
print(data2)
print("{} points {} sum in data2".format(len(data2), sum(data2)))

print(continuous.get_h(data1, k=5))
print(continuous.get_h(data2, k=5))

I get this output:

[ 0.13657672 14.01610649 29.53076472 ... 89.05243545 63.88229066
 46.86725742]
5000 points 229158.5019361417 sum in data1
[ 0.00315008  0.00165544  0.00293935 ...  0.00210567  0.00113199
 -0.00142328]
2382 points 0.14870037135740477 sum in data2
python3.7/site-packages/entropy_estimators/continuous.py:224: RuntimeWarning: divide by zero encountered in log
  sum_log_dist = np.sum(log(2*distances)) # where did the 2 come from? radius -> diameter
-inf
-inf

I haven't found any commonality between the two that explains it. data_set_2 has values that are 0.0, but data_set_1 doesn't, although data_set_1 does have values as small as 1e-5. Surely we expect the K-L entropy estimation technique to be robust to datasets such as these?

Error occurred with "get_imin"

Many thanks for the fantastic software, which I have been searching for for some days.

I tried to test the PID calculation and realized that there might be a typo in line 542 of "continuous.py": "for ii in range(N):", where "N" seemed to be undefined.

I am not sure whether this is an error resulting from my "violently" integration of the "entropy_estimators" into my script or just because of my dizzy mind after coding for days.

Many thanks in advance if you can help me to deal with this error.

continuous entropy with KNN

I have a question here, how can I calculate continuous entropy using KNN over each row?
As far as I understood the code, when we apply ce over the whole matrix, but what if I need to check what is the entropy of one sample of data?

Do you have any idea how can I change the code to behave like this?
And Does it make sense at all?

getting negative values for mutual information

Hi Paul,

Great work on the program. I'm utilising it primarily for getting entropy values for continous values. I realise, while utilising the get_mi function, however, I'm getting negative values. Based on what i've read, I believe this is not a possible value for MI? (i.e. the lower bound should be 0). Do I have my understanding of your program correct?

Regarding Maximal Entropy

Hello Paul,

Awesome work. I am using your estimators to calculate entropy for continuous variables (get_h).

I am trying to normalize the entropy from get_h with maximal entropy (log(n)) so that the scale will be between 0 and 1, but one thing I am not sure of is, whether this way (dividing with log(n)) can be done to a continuous variable entropy as well?

Can we truely get the joint distribution P(x,y) to calculate the H(x,y) ?

For exmple in

import numpy as np
from sklearn.feature_selection import mutual_info_regression
from entropy_estimators import continuous

np.random.seed(1)
x = np.random.standard_normal(10000)
y = x + 0.1 * np.random.standard_normal(x.shape)

How can H(X,Y) equals to the following?

hxy = continuous.get_h(np.c_[x, y], k=3)

The H(X,Y) is from a 10000*10000 Dimension vector, and you can't just get a joint distribution from marginal distributions, so typically you have to assume it is a multivarible normal distibution, and that's what you do when marginal distributions is normal. But if the marginal distribution is not normal, what can we assume?
Or how can we create a norm joint distribution from 2 non-normal margina distribution?

Mutual information is greater than information entropy

Thank you very much. If you can explain why mutual information is greater than information entropy, the code is as follows

import numpy as np
# np.random.seed(1)
x = np.random.standard_normal(1000)
y = x + 0.1 * np.random.standard_normal(x.shape)

from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)

from entropy_estimators import continuous
infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.