posydon.active_learning.psy_cris
posydon.active_learning.psy_cris.classify
The PSY-CRIS classification module.
- class posydon.active_learning.psy_cris.classify.Classifier(TableData_object)[source]
Bases:
object
Classifier class.
Perform one against all classification with a variety of different classification algorithms (interpolators). Different classifcation algorithms are trainined and stored for recall as instance variables inside nested dictionaries. This class also supports model validation through cross validation using the holdout method.
Initialize the classifier.
- Parameters:
TableData_object (instance of <class, TableData>) – An instance of the TableData class with training data.
- cross_validate(classifier_names, alpha, verbose=False)[source]
Cross validate classifiers on data from TableData object.
For each iteration, the classifiers specified are all trained and tested on the same random subset of data.
- Parameters:
- Returns:
percent_correct (ndarray) – Percent correct classification on (1-alpha)% of the data set. Element order matches the order of classifier_names.
time_to_train (ndarray) – Time to train classifiers on a data set. Element order matches the order of classifier_names.
- fit_gaussian_process_classifier(data_interval=None, my_kernel=None, n_restarts=5, verbose=False)[source]
Fit a Gaussian Process classifier.
Implementation from: sklearn.gaussian_process (https://scikit-learn.org/stable/modules/gaussian_process.html)
- Parameters:
data_interval (array_int, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set
my_kernel (kernel) – Set the kernel for the GPC.
n_restarts (int) – Number of restarts for the GPC.
verbose (bool, optional) – Print statements with more information while training.
- Returns:
binary_classifier_holder – Sorted by class, each key maps to a trained GaussianProcessClassifier object.
- Return type:
array_like
- fit_linear_ND_interpolator(data_interval=None, verbose=False)[source]
Fit linear ND interpolator - binary one-against-all classification.
Implementation from: scipy.interpolate.LinearNDInterpolator (https://docs.scipy.org/doc/scipy/reference/interpolate.html)
- Parameters:
data_interval (array_int, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set
verbose (bool, optional) – Print statements with more information while training.
- Returns:
binary_classifier_holder – Sorted by class, each key maps to a trained linearNDinterpolator object.
- Return type:
- fit_rbf_interpolator(data_interval=None, verbose=False)[source]
Fit RBF interpolator - binary classification (one against all).
Implementation from: scipy.interpolate.Rbf (https://docs.scipy.org/doc/scipy/reference/interpolate.html)
- Parameters:
data_interval (array_int, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set
verbose (bool, optional) – Print statements with more information while training.
- Returns:
binary_classifier_holder – Sorted by class, each key maps to a trained RBF object.
- Return type:
- get_class_predictions(classifier_name, test_input, return_ids=True)[source]
Return the class predictions.
The predictions are in the form of class IDs or the original classification key. This method also returns the probability of the class that was predicted.
- Parameters:
- Returns:
pred_class_ids (array) – Predicted class IDs given test input.
max_probs (array) – Probability the classifier gives for the chosen class.
where_not_nan (array) – Inidices where there are no nans (from LinearNDInterpolator). You may use this to pick out which input data gives a valid classification.
- get_cross_val_data(alpha)[source]
Randomly sample the data set and seperate training and test data.
- Parameters:
alpha (float) – Fraction of data set to use for training. (0.05 = 5% of data set)
- Returns:
sorted_rnd_int_vals (array) – Array indicies for data used to train interpolators.
cv_test_input_data (array) – Input test data to perform cross validation.
cv_test_output_data (array) – Output test data to perform cross validation.
- get_rnd_test_inputs(N, other_rng=dict(), verbose=False)[source]
Produce randomly sampled ‘test’ inputs inside domain of input_data.
- Parameters:
N (int) – Number of test inputs to return.
other_rng (dict, optional) – Change the range of random sampling in desired axis. By default, the sampling is done in the range of the training data. The axis is specified with an integer key in [0,N-1] mapping to a list specifying the range. (e.g. {1:[my_min, my_max]})
verbose (bool, optional) – Print diagnostic information.
- Returns:
rnd_test_points – Test points randomly sampled in the range of the training data in each axis unless otherwise specified in ‘other_rng’. Has the same shape as input data from TableData.
- Return type:
ndarray
- make_cv_plot_data(interp_type, alphas, N_iterations, folder_path='cv_data/')[source]
Script for running many instances of the method cross_validate().
Cross validation score and timing data produced are saved locally.
! Time to train GaussianProcessClassifier becomes large for num training points > 1000. !
Files saved every 5 iterations to prevent loss of data for large N_iterations. Known expection occurs in GP classifier for low alpha due to data set with only one class.
- Parameters:
interp_type (array_str) – Names of classifiers to train.
alphas (array_floats) – Fractions of data set to use for training. (0.05 = 5% of data set) (ex. [0.01, 0.02, …])
N_iterations (int) – Number of iterations to run cross validation at a given alpha.
folder_path (str) – Folder path where to save cross validation and timing data (“your_folder_path/”).
- Return type:
None
- make_max_cls_plot(classifier_name, axes_keys, other_rng=dict(), N=4000, **kwargs)[source]
Make the maximum classification probablity plot.
Not generalized yet to slice along redundant axes.
- return_probs(classifier_name, test_input, verbose=False)[source]
Return probability that a given input corresponds to a class.
The probability is calculated using trained classifiers.
- Parameters:
- Returns:
normalized_probs (ndarray) – Array holding the normalized probability for a point to be in any of the possible classes. Shape is N_points x N_classes.
where_not_nan (ndarray) – Indicies of the test inputs that did not result in nans.
- train(classifier_name, di=None, verbose=False, **kwargs)[source]
Train a classifier.
- Implemented classifiers:
LinearNDInterpolator (‘linear’, …) Radial Basis Function (‘rbf’, …) GaussianProcessClassifier (‘gp’, …)
>>> cl = Classifier( TableData_object ) >>> cl.train('linear', di = np.arange(0, Ndatapoints, 5), verbose=True)
- Parameters:
classifier_name (str) – Name of classifier to train.
di (array_int, optional) – Array indicies of data used to train (training on a subset). if None - train on whole data set
train_cross_val (bool, optional) – For storing regular trained interpolators and cross val interpolators. Used in the cross_validate() method. if False - save normal interpolators if True - save cross validation interpolators
verbose (bool, optional) – Print statements with more information while training.
- Return type:
None
posydon.active_learning.psy_cris.data
Module for handling data for PSY-CRIS.
- class posydon.active_learning.psy_cris.data.TableData(table_paths, input_cols, output_cols, class_col_name, my_DataFrame=None, omit_vals=None, omit_cols=None, subset_interval=None, verbose=False, my_colors=None, neighbor=None, n_neighbors=None, undefined_p_change_val=None, read_csv_kwargs={}, **kwargs)[source]
Bases:
object
For managing data sets used for classification and regression.
Reads tables of simulation data where a single row represents one simulation. Each column in a row represents different inputs (initial conditions) and outputs (result, continuous variables). If using multiple files, each file is assumed to have the same columns. You may also directly load a pandas DataFrame instead of reading in files.
Example data structure expected in files or pandas DataFrame:
0 input_1 input_2 outcome output_1 output_2 output_3 … 1 1.5 2.6 “A” 100 0.19 - … 2 1.5 3.0 “B” - - - … 3 2.0 2.6 “C” - - 6 … …
The above table has dashes ‘-’ in output columns to indicate NaN values. You may have a similar structure if different classes have fundamentally different outputs.
Initialize the TableData instance.
- Parameters:
table_paths (list) – List of file paths to read in as data. if None, a pandas DataFrame is used instead
input_cols (list) – List of names of the columns which will be considered ‘input’.
output_cols (list) – List of names of the columns which will be considered ‘output’. This should include the class column name.
class_col_name (str) – Name of column which contains classification data.
my_DataFrame (pandas DataFrame, optional) – If given, use this instead of reading files.
omit_vals (list, optional) – Numerical values that you wish to omit from the entire data set. If a row contains the value, the entire row is removed. (For example you may want to omit all rows if they contain “-1” or “failed”.)
omit_cols (list, optional) – Column names that you wish to omit from the data set.
subset_interval (array, optional) – Use some subset of the data files being loaded in. An array with integers indicating the rows that will be kept.
my_colors (list, optional) – Colors to use for classification plots.
n_neighbors (list, optional) – List of integers that set the number of neighbors to use to calculate average distances. (default None)
neighbor (instance of sklearn.neighbors.NearestNeighbors, optional) – To use for average distances. See function ‘calc_avg_dist()’.
undefined_p_change_val (optional, float) – Sets the undefined value used when calculating percent change fails due to zero values in the output data. Default uses nan.
verbose (bool, optional) – Print statements with extra info.
read_csv_kwargs (dict, optional) – Kwargs passed to the pandas function ‘read_csv()’.
**kwargs – Extra kwargs
- find_n_neighbors(input_data, n_neighbors, neighbor=None, return_avg_dists=False, **kwargs)[source]
Find the N nearest neighbors of a given set of arbitrary points.
Given a set of arbitrary input points, find the N nearest neighbors to each point, not including themselves. Can also return average distance from a point to its N nearest neighbors.
- Parameters:
input_data (ndarray, pandas DataFrame) – Data points where their nearest neighbors will be found
n_neighbors (list) – List of integers with number of neighbors to calculate
neighbor (instance of NearestNeightbors class) – For passing your own object.
return_avg_dists (bool, optional) – If True, return the a dictionary with average distances between the N nearest neighbors listed in ‘n_neighbors’.
**kwargs –
- class_keystr
If a DataFrame is given, specifies class column name.
- Returns:
where_nearest_neighbors (dict) – Dictionary containing the n nearest neighbors for every point in the input data.
avg_distances (dict) – Returned if return_avg_dists is True. The average distances between the nearest neighbors.
- get_binary_mapping_per_class()[source]
Get binary mapping (0 or 1) of the class data.
Get binary mapping (0 or 1) of the class data for each unique classification. For each classification, a value of 1 is given if the class data matches that classification. If they do not match, then a value of 0 is given.
Example: classifications -> A, B, C class data -> [ A, B, B, A, B, C ] binary mapping -> [[1, 0, 0, 1, 0, 0] (for class A)
[0, 1, 1, 0, 1, 0] (for class B) [0, 0, 0, 0, 0, 1]] (for class C)
- Returns:
binary_class_data – N by M array where N is the number of classes and M is the number of classifications in the data set. Order is determined by ‘_unique_class_keys_’.
- Return type:
ndarray
- get_class_data(what_data='full')[source]
Get data related to classification.
- Parameters:
what_data –
- ‘class_col’ (array)
Original classification data.
- ’unique_class_keys’ (array)
Unique classes found in the classification data.
- ’class_col_to_ids’ (array)
Original classification data replaced with their respective class IDs (integers).
- ’class_id_mapping’ (dict)
Mapping between a classification from the original data set and its class ID.
- ’binary_data’ (ndarray)
Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.
- ’full’ (tuple)
All options listed above. (Default)
str –
- ‘class_col’ (array)
Original classification data.
- ’unique_class_keys’ (array)
Unique classes found in the classification data.
- ’class_col_to_ids’ (array)
Original classification data replaced with their respective class IDs (integers).
- ’class_id_mapping’ (dict)
Mapping between a classification from the original data set and its class ID.
- ’binary_data’ (ndarray)
Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.
- ’full’ (tuple)
All options listed above. (Default)
list –
- ‘class_col’ (array)
Original classification data.
- ’unique_class_keys’ (array)
Unique classes found in the classification data.
- ’class_col_to_ids’ (array)
Original classification data replaced with their respective class IDs (integers).
- ’class_id_mapping’ (dict)
Mapping between a classification from the original data set and its class ID.
- ’binary_data’ (ndarray)
Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.
- ’full’ (tuple)
All options listed above. (Default)
optional –
- ‘class_col’ (array)
Original classification data.
- ’unique_class_keys’ (array)
Unique classes found in the classification data.
- ’class_col_to_ids’ (array)
Original classification data replaced with their respective class IDs (integers).
- ’class_id_mapping’ (dict)
Mapping between a classification from the original data set and its class ID.
- ’binary_data’ (ndarray)
Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.
- ’full’ (tuple)
All options listed above. (Default)
- Returns:
class_data – An object containing the specified classification data. Will return tuple of len(what_data) if a list is passed. Default: 5
- Return type:
- get_data(what_data='full', return_df=False)[source]
Get all data contained in TableData object.
The data is returned after omission of columns and rows containing specified values (if given) and taking a subset (if given) of the original data set read in as a csv or given directly as a pandas DataFrame. (Before data processing for classification and regression.)
- Parameters:
what_data (str, list, optional) – Default is ‘full’ with other options ‘input’, or ‘output’. ‘full’ - original data table (after omission and subsets) ‘input’ - only data identified as inputs from ‘full’ data set ‘output’ - only data identified as outputs from ‘full’ data set
return_df (bool, optional) – If True, return a pandas DataFrame object. If False (default), return a numpy array.
- Returns:
data – Data before classification and regression data sorting is done. Will return tuple of len(what_data) if a list is passed.
- Return type:
tuple, ndarray or DataFrame
- get_info()[source]
Return what info is printed in the ‘info()’ method.
- Returns:
files (list) – File paths where data was loaded from.
df_index_keys (list) – Index keys added to the DataFrame object once multiple files are joined together such that one can access data by file after they were joined.
for_info (list) – Running list of print statements that include but are not limited to what is shown if ‘verbose=True’.
- get_regr_data(what_data='full')[source]
Get data related to regression all sorted by class in dictionaries.
- Parameters:
what_data (str, list, optional) – ‘input’ - For each class, the input data with no cleaning. ‘raw_output’ - For each class, the output data with no cleaning. ‘output’ - For each class, the cleaned output data. ‘full’ - All options listed above in that respective order.
- Returns:
data – An object containing the specified regression data. Will return tuple of len(what_data) if a list is passed. Default: 3
- Return type:
tuple, ndarray or DataFrame
- info()[source]
Print info for the instance of TableData object.
For output descriptions see the method ‘get_info()’.
- make_class_data_plot(fig, ax, axes_keys, my_slice_vals=None, my_class_colors=None, return_legend_handles=False, verbose=False, **kwargs)[source]
Plot classification data on a given axis and figure.
- Parameters:
fig (Matplotlib Figure object) – Figure on which the plot will be made.
ax (Matplotlib Axis object) – Axis on which a scatter plot will be drawn.
axes_keys (list) – List containing two names of input data columns to use as horizontal and verital axes.
my_slice_vals (optional, list, dict) – List giving values on which to slice in the axes not being plotted. Default (None) uses the first unique value found in each axis. If instead of individual values, a range is desired (e.g. 10 +/- 1) then a dict can be given with integer keys mapping to a tuple with the lower and upper range. ( e.g. {0:(9,11)} )
my_class_colors (optional, list) – List of colors used to represent different classes. Default (None) uses the default class colors.
return_legend_handles (optional, bool) – Returns a list of handles that connect classes to colors.
verbose (optional, bool) – Print useful information during runtime.
**kwargs (optional) – Kwargs for matplotlib.pyplot.scatter() .
- Returns:
fig (Matplotlib Figure object) – Figure object.
ax (Matplotlib Axis object) – Updated axis after plotting.
handles (optional, list) – List of legend handles connecting classes and their colors. Returned if ‘return_legend_handles’ is True. Default is False.
- plot_3D_class_data(axes=None, fig_size=(4, 5), mark_size=12, which_val=0, save_fig=False, plt_str='0', color_list=None)[source]
Plot the 3D classification data in a 2D plot.
3 input axis with classification output.
- Parameters:
axes (list, optional) –
By default it will order the axes as [x,y,z] in the original order the input axis were read in. To change the ordering, pass a list with the column names. Example: The default orderd is col_1, col_2, col_3. To change the horizontal axis from col_1 to col_2 you would use:
’axes = [“col_2”, “col_1”, “col_3”]’
fig_size (tuple, optional, default = (4,5)) – Size of the figure. (Matplotlib figure kwarg ‘fig_size’)
mark_size (float, optional, default = 12) – Size of the scatter plot markers. (Matplotlib scatter kwarg ‘s’)
which_val (int, default = 0) – Integer choosing what unique value to ‘slice’ on in the 3D data such that it can be plotted on 2D. (If you had x,y,z data you need to choose a z value)
save_fig (bool, default = False) – Save the figure in the local directory.
plt_str (str, default = '0') – If you are saving multiple figures you can pass a string which will be added to the end of the default: “data_plot_{plt_str}.pdf”
color_list (list, default = None)
- Return type:
matplotlib figure
- posydon.active_learning.psy_cris.data.calc_avg_dist(data, n_neighbors, neighbor=None)[source]
Get the average distance to the nearest neighbors in the data set.
(NearestNeighbors from sklearn.neighbors)
- Parameters:
data (ndarray) – Data to train the NearestNeighbors class on.
n_neighbors (int) – Number of neighbors to use when finding average distance.
neighbor (instance of NearestNeightbors class) – For passing your own object.
- Returns:
avg_dist (array) – The average distance between nearest neighbors.
g_indi (array) – Indicies that correspond to the nearest neighbors.
- posydon.active_learning.psy_cris.data.calc_avg_p_change(data, where_nearest_neighbors, undefined_p_change_val=None)[source]
Calculate the average fractional change in a given data set.
The method uses the N nearest neighbors (calculated beforehand).
- Parameters:
data (ndarray) – Data set to calculate percent change.
where_nearest_neighbors (dict) – Indicies in data for the n nearest neighbors in input space.
undefined_p_change_val (optional) – For output with an undefined percent change (zero value), this kwarg defines what is put in its place. Defaults to nan.
- Returns:
avg_p_change_holder – Each element conatins the average percent change for a given number of neighbors. If any output values are 0, then a nan is put in place of a percent change.
- Return type:
ndarray
posydon.active_learning.psy_cris.regress
The PSY-CRIS regression module.
- class posydon.active_learning.psy_cris.regress.Regressor(TableData_object)[source]
Bases:
object
Perform regression/interpolation with different regression algorithms.
Regression algorithms are trained by class and by output column in the data set and stored as instance variables in nested dictionaries.
This class inlcudes a ‘cross validation’ method that trains with the holdout method but calculates differences instead of a single accuracy.
Initialize the Regressor instance.
- Parameters:
TableData_object (instance of <class, TableData>) – An instance of the TableData class.
- cross_validate(regressor_name, class_key, col_key, alpha, verbose=False)[source]
Our method of cross validation for regression.
Train on a subset of the data and predict values for the rest. Then calculate the difference between the true and predicted value.
- Parameters:
- Returns:
percent_diffs (array) – Percent difference.
diffs (array) – Absolute difference.
- fit_gaussian_process_regressor(class_keys, col_keys, data_interval=None, verbose=False)[source]
Fit a Gaussian Process regressor.
Implementation from: sklearn.gaussian_process (https://scikit-learn.org/stable/modules/gaussian_process.html)
- Parameters:
class_keys – List of classes to train on.
col_keys – If multiple classes are given, it is assumed they all contain the supplied columns.
data_interval (array, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set
verbose (bool, optional) – Print statements with more information while training.
- Returns:
regressor_holder – Ordered by class specific data and then by column. Nested dictionary maps to a trained GaussianProcessRegressor object.
- Return type:
- fit_linear_ND_interpolator(class_keys, col_keys, data_interval=None, verbose=False)[source]
Fit linear ND interpolator.
Implementation from: scipy.interpolate.LinearNDInterpolator (https://docs.scipy.org/doc/scipy/reference/interpolate.html)
- Parameters:
class_keys (list) – List of classes to train on.
col_keys (list) – List of columns in the class to train on. If multiple classes are given, it is assumed they all contain the supplied columns.
data_interval (array, optional) – Array indicies of data used to train (training on a subset). If None (default) train on whole data set
verbose (bool, optional) – Print statements with more information while training.
- Returns:
regressor_holder – Ordered by class specific data and then by column. Nested dictionary maps to a trained linearNDinterpolator object.
- Return type:
- fit_rbf_interpolator(class_keys, col_keys, data_interval=None, verbose=False)[source]
Fit RBF interpolator - binary classification (one against all).
Implementation from: scipy.interpolate.Rbf (https://docs.scipy.org/doc/scipy/reference/interpolate.html)
- Parameters:
class_keys – List of classes to train on.
col_keys – If multiple classes are given, it is assumed they all contain the supplied columns.
data_interval (array, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set
verbose (bool, optional) – Print statements with more information while training.
- Returns:
regressor_holder – Ordered by class specific data and then by column. Nested dictionary maps to a trained RBF object.
- Return type:
- get_cross_val_data(class_key, col_key, alpha)[source]
Randomly sample the data set and seperate training and test data.
- Parameters:
- Returns:
cross_val_test_input_data (ndarray) – Input data used to test after training on a subset.
cross_val_test_output_data (ndarray) – Output data used to test after training on a subset.
sorted_rnd_int_vals (array) – Indicies of original data that were used as training points.
- get_max_APC_val(regressor_name, class_key, args)[source]
Return the maximum interpolated average percent change for a class.
For a given class, and regression method. Return the maximum interpolated average percent change value across all APC columns in the class sorted data set. Helper method for constructing target distributions for the Sampler.
- Parameters:
- Returns:
max_APC (float) – Maximum average percent change (APC) value.
which_col_max (int) – Index of which column had the maximum APC.
- get_predictions(regressor_names, class_keys, col_keys, test_input, return_std=False)[source]
Get predictions from trained regressors for a set of inputs.
- Parameters:
regressor_names (list) – List of regressor algorithm names to use to predict.
class_keys (list) – List of classes to get predictions for.
col_keys (list) – List of columns to get predictions for.
test_input (ndarray) – Array of input points for which predictions will be found.
return_std (optional, bool) – Return the STD is when using GaussianProcessRegressor.
- Returns:
predictions – Dictionary ordered by algorithm, class, and output column mapping to an array of predictions for the test input points.
- Return type:
- get_rnd_test_inputs(class_name, N, other_rng={}, verbose=False)[source]
Produce randomly sampled ‘test’ inputs inside domain of input_data.
Input data is seperated by class.
- Parameters:
class_name (str) – Class name to specify which input data you want to look at.
N (int) – Number of test inputs to return.
other_rng (dict, optional) – Change the range of random sampling in desired axis. By default, the sampling is done in the range of the training data. The axis is specified with an integer key and the value is a list specifying the range. {1:[min, max]}
verbose (bool, optional) – Print diagnostic information. (default False)
- Returns:
rnd_test_points – Test points randomly sampled in the range of the training data in each axis unless otherwise specified in ‘other_rng’. Has the same shape as input data from TableData.
- Return type:
ndarray
- mult_diffs(regressor_name, class_key, col_keys, alpha, cutoff, verbose=False)[source]
For multiple calls to cross_validate.
- Parameters:
regressor_name (str) – Name of regression algorithm to use.
class_key (str, class_dtype(int or other)) – Name of class data to use.
col_keys (str) – Column keys to cross validate on.
alpha (float) – Fraction of data set to cross validate on.
cutoff (float) – Sets the cutoff percentage at which to calculate the fraction of the data set above or below.
vebose (bool, optional) – Print useful diagnostic information.
- Returns:
p_diffs_holder (ndarray) – Percent differencs per column.
attr_holder (ndarray) – Contains the number of points outside the cutoff, mean, and standard deviation of the percent difference calculations.
- plot_regr_data(class_name)[source]
Plot all regression data from the chosen class.
- Parameters:
class_name (str) – Specify what class data will plotted.
- Returns:
Plots with all regression data for a given class.
- Return type:
matplotlib figure
- train(regressor_name, class_keys, col_keys, di=None, verbose=False)[source]
Train a regression algorithm.
- Implemented regressors:
LinearNDInterpolator (‘linear’, …) Radial Basis Function (‘rbf’, …) GaussianProcessRegressor (‘gp’, …)
>>> rg = Regressor( TableData_object ) >>> rg.train('linear', di = np.arange(0, Ndatapoints, 5), verbose=True)
Trained regressor objects are uniquely defined by the algorithm used to train, the data set used to train (grouped by class), and finally the output column (there could be more than one). This motivates the data structure for storing the regressor objects as follows:
Algorithm -> Class -> Output Column -> Object
- Here is more realistic example of what it could look like:
{RBF: {“class_1”: {“output_1”: {instance of scipy.interpolate.rbf}}}}
- Parameters:
regressor_name (string) – Name of regressor to train.
class_keys (list) – List of class(es) to train on.
col_keys (list or None) – For a given class, what columns to train on. If None, it trains on all columns in one class.
di (optional, array) – Array indicies of data used to train (training on a subset). If None (default) - train on whole data set
verbose (optional, bool) – Print statements with more information while training.
- Return type:
None
Note: You can train mutliple classes at once as long as they have the same columns specified in col_keys.
posydon.active_learning.psy_cris.sample
The definition of the Sampler class in PSY-CRIS.
- class posydon.active_learning.psy_cris.sample.Sampler(classifier=None, regressor=None)[source]
Bases:
object
Class implementing PTMCMC and MCMC for PSY-CRIS algorith.
Modular implementation of PTMCMC and MCMC designed to implement the PSY-CRIS algorithm of sampling points in a target distribution constructed with a Classifier and Regressor. After a posterior is generated, methods in this class are also used to downsample.
Initialize the sampler.
- Parameters:
classifier (instance of <class, Classifier>) – A trained classifier object.
regressor (instance of <class, Regressor>, optional) – A trained regressor object.
- TD_2d_analytic(name, args, **kwargs)[source]
2-dimensional analytic target distribution for testing MCMC/PTMCMC.
The function: $frac{16}{3pi} left( expleft[-mu^2 - (9 + 4mu^2 + 8nu)^2right] + frac{1}{2} expleft[- 8 mu^2 - 8 (nu-2)^2right] right)$
- TD_classification(classifier_name, position, **kwargs)[source]
Target distribution using classification.
$f(x) = 1 - max[P_{rm class}(x)]$
- Parameters:
classifier_name (str) – String to specify the trained classification algorithm to use.
position (array) – Single location in parameter space for the target distribution to be evaluated at.
**kwargs –
- TD_BETAfloat
Exponent of target distribution - $f(x)^{rm TD_BETA}$ Used for smoothing or sharpening.
- TD_verbosebool
Extra print output every method call.
- Returns:
If classification probability is Nan: f(x) = 1E-16
- Return type:
array
- TD_classification_regression(names, args, **kwargs)[source]
Target distribution using both classification & regression.
Classification: $1 - max[P_{rm class}(x)]$ Regression: $ A_0 log( A_1* abs( max[APC_n [loc]]) + 1 )$
- Parameters:
names (list like) – Iterable containing the two strings specifying the classification and regression algorithm to use.
args (array) – Position in parameter space to evaluate the target distribution at.
**kwargs –
- TD_A1float, optional
Scaling factor inside the Log regression error term. (Default = 0.5)
- TD_TAUfloat, optional
Relative weight of classification to regression term. (Default = 0.5)
- TD_BETAfloat, optional
Exponent of the entire target distribution. Used for smoothing or sharpening the distribution. Default is 1.
- TD_verbosebool, optional
Print more diagnostic information.
Rreturns
--------
array
- do_density_logic(step_history, N_points, Kappa, shuffle=False, norm_steps=False, var_mult=None, add_mvns_together=False, pre_acc_points=None, verbose=False)[source]
Do the density based of the normal gaussian kernel on each point.
This method automatically takes out the first 5% of steps of the MCMC so that the initial starting points are not chosen automatically (if you start in a non-ideal region). Wait for the burn in.
- do_simple_density_logic(step_history, N_points, Kappa, var_mult=None, add_mvns_together=False, include_training_data=True, verbose=False)[source]
Perform multivariate normal density logic on a given step history.
This is a simplified version of the method ‘do_density_logic’. It assumes that every accepted point will have the same exact MVN.
Each proposal distribution starts with the training set from TableData which keeps training data from being proposed again.
- Parameters:
step_history (ndarray) – List of points from a PTMCMC or MCMC. (posterior)
N_points (int) – Number of points desired to be drawn from the posterior but may not actually be the number of points accepted. Contributes to the length scale of the MVN distribution of accepted points (along with kappa).
Kappa (float) – Scaling factor that sets the initial size of the MVN for accepted points. This should be proportional to the filling factor of the area of iterest described by the target distribution used to create the posterior.
var_mult (float, ndarray, optional) – Variance multiplier for the MVN of accepted points.
add_mvns_together (bool, optional) – Add MVNs together when creating the accepted point distribution.
include_training_data (bool, optional) – Include the trainind data in the target distribution before sampling.
verbose (bool, optional) – Print useful diagnostic information.
- Returns:
accepted_points (ndarray) – Accepted points from the posterior to be labled by the user. (query points)
rejected_points (ndarray) – Rejected points from the posterior.
Notes
The accepted laguage here is indicative of query points for the oracle to label in an active learning scheme. It is not accepted vs rejected normally used for MCMC.
- get_TD_classification_data(*args, **kwargs)[source]
Get target-distribution classification data.
Calculate terms relevant for creating target distributions with classification terms.
- Parameters:
classifier_name (str) – Trained classifier name to use for predictions.
position (array) – Position in parameter space to eval
**kwargs –
- TD_verbosebool
Print useful output
- Returns:
max_probs (array) – Maximum probabilities at each query point
position (array) – Position in parameter space being queried
cls_key (array) – Classification key predicted for each query position
- get_proposed_points(step_history, N_points, Kappa, shuffle=False, norm_steps=False, add_mvns_together=False, include_training_data=True, var_mult=None, seed=None, n_repeats=1, max_iters=1e3, verbose=False, **kwargs)[source]
Get proposed points in parameter space given a MCMC step history.
The desnity logic is not deterministic, so multiple iterations may be needed to converge on a desired number of proposed points. This method performs multiple calls to do_density_logic while changing Kappa in order to return the desired number of points. After n_iters instances of the correct number of N_points, the distibution with the largest average distance is chosen.
- Warning: This algorithm has not been tested for large N data sets and
may struggle to converge.
- Parameters:
step_history (ndarray) – Posterior from which to sample new query points.
N_points (int) – N query points to converge to.
Kappa (float) – Multiplies the length scale of MVNs and changes such that the desired number of query points is found.
shuffle (bool, optional) – Shuffle points in posterior in place before sampling.
norm_steps (bool, optional) – Normalize steps before sampling.
add_mvns_together (bool, optional) – Add MVNs of accepted point distribution together.
include_training_data (bool, optional) – Include training data in the accpeted point distribution before sampling.
var_mult (ndarray, optional) – Variance multiplier.
seed (float, optional) – Random seed to use for random sampling.
n_repeats (int, optional) – Number of times to converge to the correct number of points. Each iteration may be a different realization of the posterior.
verbose (bool, optional) – Print useful information.
**kwargs – show_plots : bool, optional Show 2D plot of proposed points with step history & training data.
- Returns:
acc_pts (ndarray) – Array of proposed points to be used as initial conditions in new simulations.
Kappa (float) – Scaling factor which reproduced the desired number of accepted points.
Notes
Will automatically exit if it goes through max_iters iterations without converging on the desired number of points.
- make_prop_points_plots(step_hist, prop_points, axes=(0, 1), show_fig=True, save_fig=False)[source]
Plot the proposed / accepted points over the step history.
- make_trace_plot(chain_holder, T_list, Temp, save_fig=False, show_fig=True)[source]
Make a step number vs. position of a sampler in an axis plot.
This function makes titles assuming you are using the data from the classifier.
- normalize_step_history(step_history)[source]
Take steps and normalize [0,1] according to min/max in each axis.
The max and min are taken from the original data set from TableData.
- run_MCMC(N_trials, alpha, step_history, target_dist, classifier_name, T=1, upper_limit_reject=1e4, **TD_kwargs)[source]
Run a Markov chain Monte Carlo given a target distribution.
- Parameters:
N_trials (int) – Number of proposals or trial steps to take before stopping.
alpha (float) – Related to the step size of the MCMC walker. Defines the standard deviation of a zero mean normal from which the step is randomly drawn.
step_history (list) – Initial starting location in parameter space. Could contain an arbitrary number of previous steps but a walker will start at the last step in the list.
targe_dist (callable) – The target distribution to sample. Must take arguments ( method_name, element_of_step_history ) (A 2D analytic function is provided - TD_2d_analytic)
classifier_name (str) – Name of interpolation technique used in the target_dist.
T (float, optional) – Temperature of the MCMC.
upper_limit_reject (int, optional) – Sets the maximum number of rejected steps before the MCMC stops walking. Avoiding a slowly converging walk with few accepted points.
- Returns:
step_history (array) – An array containing all accepted steps of the MCMC.
accept (int) – Total number of accepted steps.
reject (int) – Total number of rejected steps.
Notes
Assumes uniform priors and a symetric jump proposal (gaussian).
- run_PTMCMC(T_max, N_tot, target_dist, classifier_name, init_pos=None, N_draws_per_swap=3, c_spacing=1.2, alpha=None, upper_limit_reject=1e5, verbose=False, trace_plots=False, **TD_kwargs)[source]
Run a Paralel Tempered MCMC with user-specified target distribution.
Calls the method run_MCMC.
- Parameters:
T_max (float) – Sets the maximum temperature MCMC in the chain.
N_tot (int) – The total number of iterations for the PTMCMC.
target_dist (callable) – The target distribution to sample. Must take arguments (method_name, location_to_eval) (A 2D analytic function is provided - analytic_target_dist)
classifier_name (str, list) – A single string or list of strings specifying the interpolator to use for classification or classification & regression respectively.
init_pos (array) – Initial position of walkers in each axis. Default is the median of the input data in TableData.
N_draws_per_swap (int, optional) – Number of draws to perform for each MCMC before swap proposals.
c_spacing (float, optional) – Sets the spacing of temperatures in each chain. T_{i+1} = T_{i}^{1/c}, range: [T_max , T=1]
alpha (float, optional) – Sets the standard deviation of steps taken by the walkers. Default is 1/5 the range of training data from TableData.
upper_limit_reject (float, optional) – Sets the upper limit of rejected points.
verbose (bool, optional) – Useful print statements during execution.
- Returns:
chain_step_history (dict) – Hold the step history for every chain. Keys are integers that range from 0 (max T) to the total number of chains -1 (min T).
T_list (array) – Array filled with the temperatures of each chain from max to min.
Notes
There is a zero prior on the PTMCMC outside the range of training data.
posydon.active_learning.psy_cris.utils
Module defining helper functions for PSY-CRIS.
- posydon.active_learning.psy_cris.utils.calc_performance(dfs_per_iter, cls_name='linear', regr_name='rbf', resolution=400, verbose=False, **kwargs)[source]
Calculate accuracy and confusion matrix.
Given a list of pandas DataFrames, iterate over them and calculate the accuracy and confusion matrix for synthetic data sets.
- Parameters:
dfs_per_iter (list) – List of pandas DataFrames containing training data to train an classifier on and then compare to the true background distribution.
cls_name (str, optional) – Name of classifier to train.
resolution (int, optional) – Density per axis of the grid used to oversample the true background.
verbose (bool, optional) – Print some helpful info.
- Returns:
acc_per_iter (array) – Array conatining overall accuracy of interpolator per iteration of training data.
conf_matrix_per_iter (list) – List of confusion matricies per iteration. Calculated using ‘get_confusion_matrix’.
regr_acc_per_iter (list) – List of regression accuracy terms.
- posydon.active_learning.psy_cris.utils.calc_regression_accuracy(all_regr_abs_frac_diffs, cdf_cutoff_limits=None)[source]
Calculate the the fractional change at which the cdf_cutoff is below.
For a given distribution of absolute fractional differences: calculate the the fractional change at which the cdf_cutoff is below.
For 50% of the data set, the range of frac diffs is [0,?].
(What is not being asked: For a frac diff of 10% what fraction of the data has a frac diff <= that number. )
- posydon.active_learning.psy_cris.utils.check_dist(original, proposed, threshold=1e-5)[source]
Check euclidean distance between the original and proposed points.
Proposed points a distance >= threshold are accepted.
- Parameters:
original (ndarray) – Original points previously run.
proposed (ndarray) – Proposed points for new simulations.
threshold (float, optional) – The theshold distance between acceptance and rejection.
- Returns:
proposed_above_thresh_for_all_original – True if the distance between the proposed point is >= threshold.
- Return type:
bool, array
Notes
The purpose of this function is to not propose points that are some threshold away from already accepted points.
- posydon.active_learning.psy_cris.utils.do_dynamic_sampling(N_final_points=100, new_points_per_iter=20, verbose=False, threshold=1e-5, N_starting_points=100, jitter=False, dim=2, length_scale_mult=0.33, percent_increase=None, show_plots=False, **all_kwargs)[source]
Run cris algorithm iteratively.
For a given number of starting and ending points, run the cris algorithm iteratively in step sizes of new_points_per_iter. After each iteration, query points are identified using the original 2D snythetic data set.
- Parameters:
N_starting_points (int) – Number of starting points to being cris iterations on a 2D grid sampled from the original 2D synthetic data set.
N_final_points (int) – Number of points to converge to after iterating with cris.
new_points_per_iter (int, array-like) – For every iteration the number of new query points for cris to propose.
threshold (float) – New query points are ommited from the next iteration if their euclidean distance to other data points is less than the threshold.
jitter (bool) – Default False, for the starting grid jitter about the center randomly in the range of +/- the 1/2 the bin width in each dimesnion.
verbose (bool) – Print useful things.
show_plots (bool) – Show plots of proposed points and training points each iteration.
all_kwargs (dict) – Dictionary of all_kwargs passed to get_new_query_points defining how every part of the cris algorithm is implemented.
- posydon.active_learning.psy_cris.utils.do_small_class_proposal(table_data, n_neighbors, n_new_points=1, length_scale_mult=0.33, neighbor=None, verbose=False)[source]
Handle proposals where only one point in class and regression is needed.
Handles proposals where only one point exists for a class and regression is requested. Otherwise the interpolators will fail. Drawn from a gaussian around the point with length scale given by the average distance between nearest neighbors (in each axis).
- posydon.active_learning.psy_cris.utils.get_confusion_matrix(preds, actual, all_classes, verbose=False)[source]
Calculate a confusion matrix given lists of predicted and actual values.
- Parameters:
preds (list) – Predicted values from the classifier.
actual (list) – True values from the underlying distribution.
all_classes (list) – A list of all unique classes. Should be either np.unique(actual) or a subset thereof.
verbose (bool, optional) – Print our the line by line confusion matrix prefixed with the class.
- Returns:
confusion_matrix – Rows and columns of confusion matrix in order and number given in all_classes.
- Return type:
ndarray
- posydon.active_learning.psy_cris.utils.get_new_query_points(N_new_points=1, TableData_kwargs={}, Classifier_kwargs={}, Regressor_kwargs={}, Sampler_kwargs={}, Proposal_kwargs={}, length_scale_mult=0.33, threshold=1e-5, **kwargs)[source]
Run the psy-cris algorithm to propose new query points to be labeled.
- Parameters:
N_new_points (int, optional) – Number of new query points desired.
TableData_kwargs (dict, optional) – Kwargs used for initializing TableData.
Classifier_kwargs (dict, optional) – Kwargs used for the Classifier method train_everything.
Regressor_kwargs (dict, optional) – Kwargs used the Regressor method train_everything.
Sampler_kwargs (dict, optinal) – Kwargs used for choosing Sampler target distribution and the method run_PTMCMC.
Proposal_kwargs (dict, optional) – Kwargs used in the Sampler method ‘get_proposed_points’ and the Classifier method ‘get_class_predictions’.
- Returns:
proposed_points (ndarray) – Now query points.
pred_class (array) – For all proposed points, the best prediction from the trained classifier.
- posydon.active_learning.psy_cris.utils.get_prediction_diffs(training_df, classifier_name='linear', regressor_name='linear', N=400, verbose=False, **kwargs)[source]
Train classifier and get predictions and actual classification.
From a DataFrame of training data, train a classifier and get both the predictions and actual classification in the classification space where the analytic function is defined. Also calculate the difference between the true regression function and that infered from the trainined regressor. Dimensionality is infered from ‘training_df’.
- Parameters:
training_df (pandas DataFrame) – DataFrame of training data, a subset of the true distribution.
classifier_name (str) – Name of the classification algorithm to use.
N (int) – Sets the (N**dim) resolution of points used to query the trained classifier.
verbose (bool, optional) – Print more useful information.
timer (bool, optional) – Print timing diagnostic information.
- Returns:
pred_class (array) – 1D array of predictions from the trained classifier.
true_class_result (array) – 1D array of the true classification for the corresponding points.
all_regr_acc_per_class (dict) – Dict of lists with regression accuracy values per class and combined.
- posydon.active_learning.psy_cris.utils.get_random_grid_df(N, dim=2)[source]
Produce a randomly sampled grid.
Given N total points, produce a randomly sampled grid drawn from the analytic data set (2D or 3D).
- posydon.active_learning.psy_cris.utils.get_regular_grid_df(N=100, jitter=False, verbose=False, N_ppa=None, dim=2)[source]
Produce an even grid.
Given N total points, produce an even grid with approximately the same number of evenly spaced points sampled from the analytic data set (2D or 3D).
The number of returned grid points is N only if N is a perfect square. Otherwise use N_ppa to define number of points per axis.
- Parameters:
N (int) – Total number of points to make into a 2D even grid
jitter (bool, optional) – Place the center of the grid randomly around (0,0) in the range of +/- 1/2 bin width while keeping the span in each axis at 6.
N_ppa (array, optional) – Numbers of points per axis. If provided, it overrides N.
dim (int, optional) – Dimensionality of synthetic data set. (2 or 3)
verbose (bool, optional) – Print some diagnostics.
- Returns:
extra_points – DataFrame of true data drawn from the analytic classification and regression functions.
- Return type:
pandas DataFrame
- posydon.active_learning.psy_cris.utils.parse_inifile(path, verbose=False)[source]
Parse an ini file to run psy-cris method ‘get_new_query_points’.