# entropy.py¶

Utility functions computing entropy of variables in time series data.

author: Chia-Hung Yang

Submitted as part of the 2019 NetSI Collabathon.

netrd.utilities.entropy.categorized_data(raw, n_bins)[source]

Categorize data.

An entry in the returned array is the index of the bin of the linearly-binned raw continuous data.

Parameters:
raw (np.ndarray)

Array of raw continuous data.

n_bins (int)

A universal number of bins for all the variables.

Returns:
np.ndarray

Array of bin indices after categorizing the raw data.

netrd.utilities.entropy.conditional_entropy(data, given)[source]

Conditional entropy of variables in the data conditioned on a given set of variables.

Parameters:
data (np.ndarray)

Array of data with variables of interests as columns and observations as rows.

given (np.ndarray)

Array of data with the conditioned variables as columns and observations as rows.

Returns:
float

Conditional entrpoy of the variables $$\{X_i\}$$ of interest conditioned on variables $$\{Y_j\}$$.

Notes

1. $$H(\{X_i\}|\{Y_j\}) = - \sum p(\{X_i\}\cup\{Y_j\}) \log_2(p(\{X_i\}|\{Y_j\}))$$

2. The data of vairiables must be categorical.

netrd.utilities.entropy.entropy_from_seq(var)[source]

Return the Shannon entropy of a variable. This differs from Scipy’s entropy by taking a sequence of observations as input rather than a histogram or probability distribution.

Parameters:
var (ndarray)

1D array of observations of the variable.

Notes

1. $$H(X) = - \sum p(X) \log_2(p(X))$$

2. Data of the variable must be categorical.

netrd.utilities.entropy.joint_entropy(data)[source]

Joint entropy of all variables in the data.

Parameters:
data (np.ndarray)

Array of data with variables as columns and observations as rows.

Returns:
float

Joint entropy of the variables of interests.

Notes

1. $$H(\{X_i\}) = - \sum p(\{X_i\}) \log_2(p(\{X_i\}))$$

2. The data of variables must be categorical.

netrd.utilities.entropy.js_divergence(P, Q)[source]

Jensen-Shannon divergence between P and Q.

Parameters:
P, Q (np.ndarray)

Two discrete distributions represented as 1D arrays. They are assumed to have the same support

Returns:
float

The Jensen-Shannon divergence between P and Q.

netrd.utilities.entropy.linear_bins(raw, n_bins)[source]

Separators of linear bins for each variable in the raw data.

Parameters:
raw (np.ndarray)

Array of raw continuous data.

n_bins (int)

A universal number of bins for all the variables.

Returns:
np.ndarray

Array where a column is the separators of bins for a variable.

Notes

The bins are $$B_0 = [b_0, b_1]$$, $$B_i = (b_i, b_{i+1}]$$, where $$b_i$$ s are the separators of bins.