The PSY-CRIS classification module.
- class posydon.active_learning.psy_cris.classify.Classifier(TableData_object)[source]
Bases:
object
Classifier class.
Perform one against all classification with a variety of different classification algorithms (interpolators). Different classifcation algorithms are trainined and stored for recall as instance variables inside nested dictionaries. This class also supports model validation through cross validation using the holdout method.
Initialize the classifier.
- Parameters
TableData_object (instance of <class, TableData>) – An instance of the TableData class with training data.
- cross_validate(classifier_names, alpha, verbose=False)[source]
Cross validate classifiers on data from TableData object.
For each iteration, the classifiers specified are all trained and tested on the same random subset of data.
- Parameters
- Returns
percent_correct (ndarray) – Percent correct classification on (1-alpha)% of the data set. Element order matches the order of classifier_names.
time_to_train (ndarray) – Time to train classifiers on a data set. Element order matches the order of classifier_names.
- fit_gaussian_process_classifier(data_interval=None, my_kernel=None, n_restarts=5, verbose=False)[source]
Fit a Gaussian Process classifier.
Implementation from: sklearn.gaussian_process (https://scikit-learn.org/stable/modules/gaussian_process.html)
- Parameters
data_interval (array_int, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set
my_kernel (kernel) – Set the kernel for the GPC.
n_restarts (int) – Number of restarts for the GPC.
verbose (bool, optional) – Print statements with more information while training.
- Returns
binary_classifier_holder – Sorted by class, each key maps to a trained GaussianProcessClassifier object.
- Return type
array_like
- fit_linear_ND_interpolator(data_interval=None, verbose=False)[source]
Fit linear ND interpolator - binary one-against-all classification.
Implementation from: scipy.interpolate.LinearNDInterpolator (https://docs.scipy.org/doc/scipy/reference/interpolate.html)
- Parameters
data_interval (array_int, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set
verbose (bool, optional) – Print statements with more information while training.
- Returns
binary_classifier_holder – Sorted by class, each key maps to a trained linearNDinterpolator object.
- Return type
- fit_rbf_interpolator(data_interval=None, verbose=False)[source]
Fit RBF interpolator - binary classification (one against all).
Implementation from: scipy.interpolate.Rbf (https://docs.scipy.org/doc/scipy/reference/interpolate.html)
- Parameters
data_interval (array_int, optional) – Array indicies of data used to train (training on a subset). if None (default) train on whole data set
verbose (bool, optional) – Print statements with more information while training.
- Returns
binary_classifier_holder – Sorted by class, each key maps to a trained RBF object.
- Return type
- get_class_predictions(classifier_name, test_input, return_ids=True)[source]
Return the class predictions.
The predictions are in the form of class IDs or the original classification key. This method also returns the probability of the class that was predicted.
- Parameters
- Returns
pred_class_ids (array) – Predicted class IDs given test input.
max_probs (array) – Probability the classifier gives for the chosen class.
where_not_nan (array) – Inidices where there are no nans (from LinearNDInterpolator). You may use this to pick out which input data gives a valid classification.
- get_cross_val_data(alpha)[source]
Randomly sample the data set and seperate training and test data.
- Parameters
alpha (float) – Fraction of data set to use for training. (0.05 = 5% of data set)
- Returns
sorted_rnd_int_vals (array) – Array indicies for data used to train interpolators.
cv_test_input_data (array) – Input test data to perform cross validation.
cv_test_output_data (array) – Output test data to perform cross validation.
- get_rnd_test_inputs(N, other_rng={}, verbose=False)[source]
Produce randomly sampled ‘test’ inputs inside domain of input_data.
- Parameters
N (int) – Number of test inputs to return.
other_rng (dict, optional) – Change the range of random sampling in desired axis. By default, the sampling is done in the range of the training data. The axis is specified with an integer key in [0,N-1] mapping to a list specifying the range. (e.g. {1:[my_min, my_max]})
verbose (bool, optional) – Print diagnostic information.
- Returns
rnd_test_points – Test points randomly sampled in the range of the training data in each axis unless otherwise specified in ‘other_rng’. Has the same shape as input data from TableData.
- Return type
ndarray
- make_cv_plot_data(interp_type, alphas, N_iterations, folder_path='cv_data/')[source]
Script for running many instances of the method cross_validate().
Cross validation score and timing data produced are saved locally.
! Time to train GaussianProcessClassifier becomes large for num training points > 1000. !
Files saved every 5 iterations to prevent loss of data for large N_iterations. Known expection occurs in GP classifier for low alpha due to data set with only one class.
- Parameters
interp_type (array_str) – Names of classifiers to train.
alphas (array_floats) – Fractions of data set to use for training. (0.05 = 5% of data set) (ex. [0.01, 0.02, …])
N_iterations (int) – Number of iterations to run cross validation at a given alpha.
folder_path (str) – Folder path where to save cross validation and timing data (“your_folder_path/”).
- Returns
- Return type
- make_max_cls_plot(classifier_name, axes_keys, other_rng={}, N=4000, **kwargs)[source]
Make the maximum classification probablity plot.
Not generalized yet to slice along redundant axes.
- return_probs(classifier_name, test_input, verbose=False)[source]
Return probability that a given input corresponds to a class.
The probability is calculated using trained classifiers.
- Parameters
- Returns
normalized_probs (ndarray) – Array holding the normalized probability for a point to be in any of the possible classes. Shape is N_points x N_classes.
where_not_nan (ndarray) – Indicies of the test inputs that did not result in nans.
- train(classifier_name, di=None, verbose=False, **kwargs)[source]
Train a classifier.
- Implemented classifiers:
LinearNDInterpolator (‘linear’, …) Radial Basis Function (‘rbf’, …) GaussianProcessClassifier (‘gp’, …)
>>> cl = Classifier( TableData_object ) >>> cl.train('linear', di = np.arange(0, Ndatapoints, 5), verbose=True)
- Parameters
classifier_name (str) – Name of classifier to train.
di (array_int, optional) – Array indicies of data used to train (training on a subset). if None - train on whole data set
train_cross_val (bool, optional) – For storing regular trained interpolators and cross val interpolators. Used in the cross_validate() method. if False - save normal interpolators if True - save cross validation interpolators
verbose (bool, optional) – Print statements with more information while training.
- Returns
- Return type