entropy.py¶

Utility functions computing entropy of variables in time series data.

author: Chia-Hung Yang

Submitted as part of the 2019 NetSI Collabathon.

netrd.utilities.entropy.categorized_data(raw, n_bins)[source]¶

Categorize data.

An entry in the returned array is the index of the bin of the linearly-binned raw continuous data.

Parameters

raw (np.ndarray): Array of raw continuous data.
n_bins (int): A universal number of bins for all the variables.

Returns

np.ndarray: Array of bin indices after categorizing the raw data.

netrd.utilities.entropy.conditional_entropy(data, given)[source]¶

Conditional entropy of variables in the data conditioned on a given set of variables.

Parameters

data (np.ndarray): Array of data with variables of interests as columns and observations as rows.
given (np.ndarray): Array of data with the conditioned variables as columns and observations as rows.

Returns

float: Conditional entrpoy of the variables \(\{X_i\}\) of interest conditioned on variables \(\{Y_j\}\).

Notes

\(H(\{X_i\}|\{Y_j\}) = - \sum p(\{X_i\}\cup\{Y_j\}) \log_2(p(\{X_i\}|\{Y_j\}))\)
The data of vairiables must be categorical.

netrd.utilities.entropy.entropy_from_seq(var)[source]¶

Return the Shannon entropy of a variable. This differs from Scipy’s entropy by taking a sequence of observations as input rather than a histogram or probability distribution.

Parameters

var (ndarray): 1D array of observations of the variable.

Notes

\(H(X) = - \sum p(X) \log_2(p(X))\)
Data of the variable must be categorical.

netrd.utilities.entropy.joint_entropy(data)[source]¶

Joint entropy of all variables in the data.

Parameters

data (np.ndarray): Array of data with variables as columns and observations as rows.

Returns

float: Joint entropy of the variables of interests.

Notes

\(H(\{X_i\}) = - \sum p(\{X_i\}) \log_2(p(\{X_i\}))\)
The data of variables must be categorical.

netrd.utilities.entropy.js_divergence(P, Q)[source]¶

Jensen-Shannon divergence between P and Q.

Parameters

P, Q (np.ndarray): Two discrete distributions represented as 1D arrays. They are assumed to have the same support

Returns

float: The Jensen-Shannon divergence between P and Q.

netrd.utilities.entropy.linear_bins(raw, n_bins)[source]¶

Separators of linear bins for each variable in the raw data.

Parameters

raw (np.ndarray): Array of raw continuous data.
n_bins (int): A universal number of bins for all the variables.

Returns

np.ndarray: Array where a column is the separators of bins for a variable.

Notes

The bins are \(B_0 = [b_0, b_1]\), \(B_i = (b_i, b_{i+1}]\), where \(b_i\) s are the separators of bins.