The PSY-CRIS regression module.

class posydon.active_learning.psy_cris.regress.Regressor(TableData_object)[source]

Bases: object

Perform regression/interpolation with different regression algorithms.

Regression algorithms are trained by class and by output column in the data set and stored as instance variables in nested dictionaries.

This class inlcudes a ‘cross validation’ method that trains with the holdout method but calculates differences instead of a single accuracy.

Initialize the Regressor instance.

Parameters

TableData_object (instance of <class, TableData>) – An instance of the TableData class.

cross_validate(regressor_name, class_key, col_key, alpha, verbose=False)[source]

Our method of cross validation for regression.

Train on a subset of the data and predict values for the rest. Then calculate the difference between the true and predicted value.

Parameters
  • regressor_name – Regressor name to use for analysis.

  • class_key – Class key to take differences.

  • col_key – Column key to take differences.

  • alpha (float) – Fraction of data set used to find differences.

  • verbose (bool, optional) – Print useful information.

Returns

  • percent_diffs (array) – Percent difference.

  • diffs (array) – Absolute difference.

fit_gaussian_process_regressor(class_keys, col_keys, data_interval=None, verbose=False)[source]

Fit a Gaussian Process regressor.

Implementation from: sklearn.gaussian_process (https://scikit-learn.org/stable/modules/gaussian_process.html)

Parameters
  • class_keys – List of classes to train on.

  • col_keys – If multiple classes are given, it is assumed they all contain the supplied columns.

  • data_interval (array, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set

  • verbose (bool, optional) – Print statements with more information while training.

Returns

regressor_holder – Ordered by class specific data and then by column. Nested dictionary maps to a trained GaussianProcessRegressor object.

Return type

dict

fit_linear_ND_interpolator(class_keys, col_keys, data_interval=None, verbose=False)[source]

Fit linear ND interpolator.

Implementation from: scipy.interpolate.LinearNDInterpolator (https://docs.scipy.org/doc/scipy/reference/interpolate.html)

Parameters
  • class_keys (list) – List of classes to train on.

  • col_keys (list) – List of columns in the class to train on. If multiple classes are given, it is assumed they all contain the supplied columns.

  • data_interval (array, optional) – Array indicies of data used to train (training on a subset). If None (default) train on whole data set

  • verbose (bool, optional) – Print statements with more information while training.

Returns

regressor_holder – Ordered by class specific data and then by column. Nested dictionary maps to a trained linearNDinterpolator object.

Return type

dict

fit_rbf_interpolator(class_keys, col_keys, data_interval=None, verbose=False)[source]

Fit RBF interpolator - binary classification (one against all).

Implementation from: scipy.interpolate.Rbf (https://docs.scipy.org/doc/scipy/reference/interpolate.html)

Parameters
  • class_keys – List of classes to train on.

  • col_keys – If multiple classes are given, it is assumed they all contain the supplied columns.

  • data_interval (array, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set

  • verbose (bool, optional) – Print statements with more information while training.

Returns

regressor_holder – Ordered by class specific data and then by column. Nested dictionary maps to a trained RBF object.

Return type

dict

get_cross_val_data(class_key, col_key, alpha)[source]

Randomly sample the data set and seperate training and test data.

Parameters
  • class_key (str, class_dtype(int or other)) – Class key specifying the class to get data from.

  • col_key (str) – Column key specifying the output column to get data.

  • alpha (float) – Fraction of data set to use for training. (0.05 = 5% of data set)

Returns

  • cross_val_test_input_data (ndarray) – Input data used to test after training on a subset.

  • cross_val_test_output_data (ndarray) – Output data used to test after training on a subset.

  • sorted_rnd_int_vals (array) – Indicies of original data that were used as training points.

get_max_APC_val(regressor_name, class_key, args)[source]

Return the maximum interpolated average percent change for a class.

For a given class, and regression method. Return the maximum interpolated average percent change value across all APC columns in the class sorted data set. Helper method for constructing target distributions for the Sampler.

Parameters
  • regressor_name (str) – Name of regression algorithm to use.

  • class_key (str) – Class key to use for data.

  • args (array) – Locations for the APC value to be predicted.

Returns

  • max_APC (float) – Maximum average percent change (APC) value.

  • which_col_max (int) – Index of which column had the maximum APC.

get_predictions(regressor_names, class_keys, col_keys, test_input, return_std=False)[source]

Get predictions from trained regressors for a set of inputs.

Parameters
  • regressor_names (list) – List of regressor algorithm names to use to predict.

  • class_keys (list) – List of classes to get predictions for.

  • col_keys (list) – List of columns to get predictions for.

  • test_input (ndarray) – Array of input points for which predictions will be found.

  • return_std (optional, bool) – Return the STD is when using GaussianProcessRegressor.

Returns

predictions – Dictionary ordered by algorithm, class, and output column mapping to an array of predictions for the test input points.

Return type

dict

get_regressor_name_to_key(name)[source]

Return the standard key (str) of a classifier.

get_rnd_test_inputs(class_name, N, other_rng={}, verbose=False)[source]

Produce randomly sampled ‘test’ inputs inside domain of input_data.

Input data is seperated by class.

Parameters
  • class_name (str) – Class name to specify which input data you want to look at.

  • N (int) – Number of test inputs to return.

  • other_rng (dict, optional) – Change the range of random sampling in desired axis. By default, the sampling is done in the range of the training data. The axis is specified with an integer key and the value is a list specifying the range. {1:[min, max]}

  • verbose (bool, optional) – Print diagnostic information. (default False)

Returns

rnd_test_points – Test points randomly sampled in the range of the training data in each axis unless otherwise specified in ‘other_rng’. Has the same shape as input data from TableData.

Return type

ndarray

mult_diffs(regressor_name, class_key, col_keys, alpha, cutoff, verbose=False)[source]

For multiple calls to cross_validate.

Parameters
  • regressor_name (str) – Name of regression algorithm to use.

  • class_key (str, class_dtype(int or other)) – Name of class data to use.

  • col_keys (str) – Column keys to cross validate on.

  • alpha (float) – Fraction of data set to cross validate on.

  • cutoff (float) – Sets the cutoff percentage at which to calculate the fraction of the data set above or below.

  • vebose (bool, optional) – Print useful diagnostic information.

Returns

  • p_diffs_holder (ndarray) – Percent differencs per column.

  • attr_holder (ndarray) – Contains the number of points outside the cutoff, mean, and standard deviation of the percent difference calculations.

plot_regr_data(class_name)[source]

Plot all regression data from the chosen class.

Parameters

class_name (str) – Specify what class data will plotted.

Returns

Plots with all regression data for a given class.

Return type

matplotlib figure

show_structure()[source]

Show (print) the structure of the regression data.

train(regressor_name, class_keys, col_keys, di=None, verbose=False)[source]

Train a regression algorithm.

Implemented regressors:

LinearNDInterpolator (‘linear’, …) Radial Basis Function (‘rbf’, …) GaussianProcessRegressor (‘gp’, …)

>>> rg = Regressor( TableData_object )
>>> rg.train('linear', di = np.arange(0, Ndatapoints, 5), verbose=True)

Trained regressor objects are uniquely defined by the algorithm used to train, the data set used to train (grouped by class), and finally the output column (there could be more than one). This motivates the data structure for storing the regressor objects as follows:

Algorithm -> Class -> Output Column -> Object

Here is more realistic example of what it could look like:

{RBF: {“class_1”: {“output_1”: {instance of scipy.interpolate.rbf}}}}

Parameters
  • regressor_name (string) – Name of regressor to train.

  • class_keys (list) – List of class(es) to train on.

  • col_keys (list or None) – For a given class, what columns to train on. If None, it trains on all columns in one class.

  • di (optional, array) – Array indicies of data used to train (training on a subset). If None (default) - train on whole data set

  • verbose (optional, bool) – Print statements with more information while training.

Returns

Return type

None

Note: You can train mutliple classes at once as long as they have the same columns specified in col_keys.

train_everything(regressor_names, verbose=False)[source]

Train all classes and columns with the specified list of regressors.

Parameters
  • regressor_names (list) – List of strings specifying all the regressors to train.

  • verbose (optional, bool) – Print useful information.

Returns

Return type

None

posydon.active_learning.psy_cris.regress.makehash()[source]

Manage nested dictionaries.