entropy.py¶
Utility functions computing entropy of variables in time series data.
author: Chia-Hung Yang
Submitted as part of the 2019 NetSI Collabathon.
-
netrd.utilities.entropy.
categorized_data
(raw, n_bins)[source]¶ Categorize data.
An entry in the returned array is the index of the bin of the linearly-binned raw continuous data.
- Parameters
- raw (np.ndarray)
Array of raw continuous data.
- n_bins (int)
A universal number of bins for all the variables.
- Returns
- np.ndarray
Array of bin indices after categorizing the raw data.
-
netrd.utilities.entropy.
conditional_entropy
(data, given)[source]¶ Conditional entropy of variables in the data conditioned on a given set of variables.
- Parameters
- data (np.ndarray)
Array of data with variables of interests as columns and observations as rows.
- given (np.ndarray)
Array of data with the conditioned variables as columns and observations as rows.
- Returns
- float
Conditional entrpoy of the variables \(\{X_i\}\) of interest conditioned on variables \(\{Y_j\}\).
Notes
\(H(\{X_i\}|\{Y_j\}) = - \sum p(\{X_i\}\cup\{Y_j\}) \log_2(p(\{X_i\}|\{Y_j\}))\)
The data of vairiables must be categorical.
-
netrd.utilities.entropy.
entropy_from_seq
(var)[source]¶ Return the Shannon entropy of a variable. This differs from Scipy’s entropy by taking a sequence of observations as input rather than a histogram or probability distribution.
- Parameters
- var (ndarray)
1D array of observations of the variable.
Notes
\(H(X) = - \sum p(X) \log_2(p(X))\)
Data of the variable must be categorical.
-
netrd.utilities.entropy.
joint_entropy
(data)[source]¶ Joint entropy of all variables in the data.
- Parameters
- data (np.ndarray)
Array of data with variables as columns and observations as rows.
- Returns
- float
Joint entropy of the variables of interests.
Notes
\(H(\{X_i\}) = - \sum p(\{X_i\}) \log_2(p(\{X_i\}))\)
The data of variables must be categorical.
-
netrd.utilities.entropy.
js_divergence
(P, Q)[source]¶ Jensen-Shannon divergence between P and Q.
- Parameters
- P, Q (np.ndarray)
Two discrete distributions represented as 1D arrays. They are assumed to have the same support
- Returns
- float
The Jensen-Shannon divergence between P and Q.
-
netrd.utilities.entropy.
linear_bins
(raw, n_bins)[source]¶ Separators of linear bins for each variable in the raw data.
- Parameters
- raw (np.ndarray)
Array of raw continuous data.
- n_bins (int)
A universal number of bins for all the variables.
- Returns
- np.ndarray
Array where a column is the separators of bins for a variable.
Notes
The bins are \(B_0 = [b_0, b_1]\), \(B_i = (b_i, b_{i+1}]\), where \(b_i\) s are the separators of bins.