The PSY-CRIS regression module.
- class posydon.active_learning.psy_cris.regress.Regressor(TableData_object)[source]
Bases:
object
Perform regression/interpolation with different regression algorithms.
Regression algorithms are trained by class and by output column in the data set and stored as instance variables in nested dictionaries.
This class inlcudes a ‘cross validation’ method that trains with the holdout method but calculates differences instead of a single accuracy.
Initialize the Regressor instance.
- Parameters
TableData_object (instance of <class, TableData>) – An instance of the TableData class.
- cross_validate(regressor_name, class_key, col_key, alpha, verbose=False)[source]
Our method of cross validation for regression.
Train on a subset of the data and predict values for the rest. Then calculate the difference between the true and predicted value.
- Parameters
- Returns
percent_diffs (array) – Percent difference.
diffs (array) – Absolute difference.
- fit_gaussian_process_regressor(class_keys, col_keys, data_interval=None, verbose=False)[source]
Fit a Gaussian Process regressor.
Implementation from: sklearn.gaussian_process (https://scikit-learn.org/stable/modules/gaussian_process.html)
- Parameters
class_keys – List of classes to train on.
col_keys – If multiple classes are given, it is assumed they all contain the supplied columns.
data_interval (array, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set
verbose (bool, optional) – Print statements with more information while training.
- Returns
regressor_holder – Ordered by class specific data and then by column. Nested dictionary maps to a trained GaussianProcessRegressor object.
- Return type
- fit_linear_ND_interpolator(class_keys, col_keys, data_interval=None, verbose=False)[source]
Fit linear ND interpolator.
Implementation from: scipy.interpolate.LinearNDInterpolator (https://docs.scipy.org/doc/scipy/reference/interpolate.html)
- Parameters
class_keys (list) – List of classes to train on.
col_keys (list) – List of columns in the class to train on. If multiple classes are given, it is assumed they all contain the supplied columns.
data_interval (array, optional) – Array indicies of data used to train (training on a subset). If None (default) train on whole data set
verbose (bool, optional) – Print statements with more information while training.
- Returns
regressor_holder – Ordered by class specific data and then by column. Nested dictionary maps to a trained linearNDinterpolator object.
- Return type
- fit_rbf_interpolator(class_keys, col_keys, data_interval=None, verbose=False)[source]
Fit RBF interpolator - binary classification (one against all).
Implementation from: scipy.interpolate.Rbf (https://docs.scipy.org/doc/scipy/reference/interpolate.html)
- Parameters
class_keys – List of classes to train on.
col_keys – If multiple classes are given, it is assumed they all contain the supplied columns.
data_interval (array, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set
verbose (bool, optional) – Print statements with more information while training.
- Returns
regressor_holder – Ordered by class specific data and then by column. Nested dictionary maps to a trained RBF object.
- Return type
- get_cross_val_data(class_key, col_key, alpha)[source]
Randomly sample the data set and seperate training and test data.
- Parameters
- Returns
cross_val_test_input_data (ndarray) – Input data used to test after training on a subset.
cross_val_test_output_data (ndarray) – Output data used to test after training on a subset.
sorted_rnd_int_vals (array) – Indicies of original data that were used as training points.
- get_max_APC_val(regressor_name, class_key, args)[source]
Return the maximum interpolated average percent change for a class.
For a given class, and regression method. Return the maximum interpolated average percent change value across all APC columns in the class sorted data set. Helper method for constructing target distributions for the Sampler.
- Parameters
- Returns
max_APC (float) – Maximum average percent change (APC) value.
which_col_max (int) – Index of which column had the maximum APC.
- get_predictions(regressor_names, class_keys, col_keys, test_input, return_std=False)[source]
Get predictions from trained regressors for a set of inputs.
- Parameters
regressor_names (list) – List of regressor algorithm names to use to predict.
class_keys (list) – List of classes to get predictions for.
col_keys (list) – List of columns to get predictions for.
test_input (ndarray) – Array of input points for which predictions will be found.
return_std (optional, bool) – Return the STD is when using GaussianProcessRegressor.
- Returns
predictions – Dictionary ordered by algorithm, class, and output column mapping to an array of predictions for the test input points.
- Return type
- get_rnd_test_inputs(class_name, N, other_rng={}, verbose=False)[source]
Produce randomly sampled ‘test’ inputs inside domain of input_data.
Input data is seperated by class.
- Parameters
class_name (str) – Class name to specify which input data you want to look at.
N (int) – Number of test inputs to return.
other_rng (dict, optional) – Change the range of random sampling in desired axis. By default, the sampling is done in the range of the training data. The axis is specified with an integer key and the value is a list specifying the range. {1:[min, max]}
verbose (bool, optional) – Print diagnostic information. (default False)
- Returns
rnd_test_points – Test points randomly sampled in the range of the training data in each axis unless otherwise specified in ‘other_rng’. Has the same shape as input data from TableData.
- Return type
ndarray
- mult_diffs(regressor_name, class_key, col_keys, alpha, cutoff, verbose=False)[source]
For multiple calls to cross_validate.
- Parameters
regressor_name (str) – Name of regression algorithm to use.
class_key (str, class_dtype(int or other)) – Name of class data to use.
col_keys (str) – Column keys to cross validate on.
alpha (float) – Fraction of data set to cross validate on.
cutoff (float) – Sets the cutoff percentage at which to calculate the fraction of the data set above or below.
vebose (bool, optional) – Print useful diagnostic information.
- Returns
p_diffs_holder (ndarray) – Percent differencs per column.
attr_holder (ndarray) – Contains the number of points outside the cutoff, mean, and standard deviation of the percent difference calculations.
- plot_regr_data(class_name)[source]
Plot all regression data from the chosen class.
- Parameters
class_name (str) – Specify what class data will plotted.
- Returns
Plots with all regression data for a given class.
- Return type
matplotlib figure
- train(regressor_name, class_keys, col_keys, di=None, verbose=False)[source]
Train a regression algorithm.
- Implemented regressors:
LinearNDInterpolator (‘linear’, …) Radial Basis Function (‘rbf’, …) GaussianProcessRegressor (‘gp’, …)
>>> rg = Regressor( TableData_object ) >>> rg.train('linear', di = np.arange(0, Ndatapoints, 5), verbose=True)
Trained regressor objects are uniquely defined by the algorithm used to train, the data set used to train (grouped by class), and finally the output column (there could be more than one). This motivates the data structure for storing the regressor objects as follows:
Algorithm -> Class -> Output Column -> Object
- Here is more realistic example of what it could look like:
{RBF: {“class_1”: {“output_1”: {instance of scipy.interpolate.rbf}}}}
- Parameters
regressor_name (string) – Name of regressor to train.
class_keys (list) – List of class(es) to train on.
col_keys (list or None) – For a given class, what columns to train on. If None, it trains on all columns in one class.
di (optional, array) – Array indicies of data used to train (training on a subset). If None (default) - train on whole data set
verbose (optional, bool) – Print statements with more information while training.
- Returns
- Return type
Note: You can train mutliple classes at once as long as they have the same columns specified in col_keys.