Module defining helper functions for PSY-CRIS.

posydon.active_learning.psy_cris.utils.calc_performance(dfs_per_iter, cls_name='linear', regr_name='rbf', resolution=400, verbose=False, **kwargs)[source]

Calculate accuracy and confusion matrix.

Given a list of pandas DataFrames, iterate over them and calculate the accuracy and confusion matrix for synthetic data sets.

Parameters
  • dfs_per_iter (list) – List of pandas DataFrames containing training data to train an classifier on and then compare to the true background distribution.

  • cls_name (str, optional) – Name of classifier to train.

  • resolution (int, optional) – Density per axis of the grid used to oversample the true background.

  • verbose (bool, optional) – Print some helpful info.

Returns

  • acc_per_iter (array) – Array conatining overall accuracy of interpolator per iteration of training data.

  • conf_matrix_per_iter (list) – List of confusion matricies per iteration. Calculated using ‘get_confusion_matrix’.

  • regr_acc_per_iter (list) – List of regression accuracy terms.

posydon.active_learning.psy_cris.utils.calc_regression_accuracy(all_regr_abs_frac_diffs, cdf_cutoff_limits=None)[source]

Calculate the the fractional change at which the cdf_cutoff is below.

For a given distribution of absolute fractional differences: calculate the the fractional change at which the cdf_cutoff is below.

For 50% of the data set, the range of frac diffs is [0,?].

(What is not being asked: For a frac diff of 10% what fraction of the data has a frac diff <= that number. )

posydon.active_learning.psy_cris.utils.check_dist(original, proposed, threshold=1e-05)[source]

Check euclidean distance between the original and proposed points.

Proposed points a distance >= threshold are accepted.

Parameters
  • original (ndarray) – Original points previously run.

  • proposed (ndarray) – Proposed points for new simulations.

  • threshold (float, optional) – The theshold distance between acceptance and rejection.

Returns

proposed_above_thresh_for_all_original – True if the distance between the proposed point is >= threshold.

Return type

bool, array

Notes

The purpose of this function is to not propose points that are some threshold away from already accepted points.

posydon.active_learning.psy_cris.utils.do_dynamic_sampling(N_final_points=100, new_points_per_iter=20, verbose=False, threshold=1e-05, N_starting_points=100, jitter=False, dim=2, length_scale_mult=0.33, percent_increase=None, show_plots=False, **all_kwargs)[source]

Run cris algorithm iteratively.

For a given number of starting and ending points, run the cris algorithm iteratively in step sizes of new_points_per_iter. After each iteration, query points are identified using the original 2D snythetic data set.

Parameters
  • N_starting_points (int) – Number of starting points to being cris iterations on a 2D grid sampled from the original 2D synthetic data set.

  • N_final_points (int) – Number of points to converge to after iterating with cris.

  • new_points_per_iter (int, array-like) – For every iteration the number of new query points for cris to propose.

  • threshold (float) – New query points are ommited from the next iteration if their euclidean distance to other data points is less than the threshold.

  • jitter (bool) – Default False, for the starting grid jitter about the center randomly in the range of +/- the 1/2 the bin width in each dimesnion.

  • verbose (bool) – Print useful things.

  • show_plots (bool) – Show plots of proposed points and training points each iteration.

  • all_kwargs (dict) – Dictionary of all_kwargs passed to get_new_query_points defining how every part of the cris algorithm is implemented.

posydon.active_learning.psy_cris.utils.do_small_class_proposal(table_data, n_neighbors, n_new_points=1, length_scale_mult=0.33, neighbor=None, verbose=False)[source]

Handle proposals where only one point in class and regression is needed.

Handles proposals where only one point exists for a class and regression is requested. Otherwise the interpolators will fail. Drawn from a gaussian around the point with length scale given by the average distance between nearest neighbors (in each axis).

datandarray

Data to train the NearestNeighbors class on.

n_neighborsint

Number of neighbors to use when finding average distance.

neighborinstance of NearestNeightbors class

For passing your own object.

Returns

proposed_points – Points sampled around classes with 1 value. Points are distributed as evenly as possible across all 1 value classes by default. Returns None in case of no small classes found.

Return type

array

posydon.active_learning.psy_cris.utils.get_confusion_matrix(preds, actual, all_classes, verbose=False)[source]

Calculate a confusion matrix given lists of predicted and actual values.

Parameters
  • preds (list) – Predicted values from the classifier.

  • actual (list) – True values from the underlying distribution.

  • all_classes (list) – A list of all unique classes. Should be either np.unique(actual) or a subset thereof.

  • verbose (bool, optional) – Print our the line by line confusion matrix prefixed with the class.

Returns

confusion_matrix – Rows and columns of confusion matrix in order and number given in all_classes.

Return type

ndarray

posydon.active_learning.psy_cris.utils.get_new_query_points(N_new_points=1, TableData_kwargs={}, Classifier_kwargs={}, Regressor_kwargs={}, Sampler_kwargs={}, Proposal_kwargs={}, length_scale_mult=0.33, threshold=1e-05, **kwargs)[source]

Run the psy-cris algorithm to propose new query points to be labeled.

Parameters
  • N_new_points (int, optional) – Number of new query points desired.

  • TableData_kwargs (dict, optional) – Kwargs used for initializing TableData.

  • Classifier_kwargs (dict, optional) – Kwargs used for the Classifier method train_everything.

  • Regressor_kwargs (dict, optional) – Kwargs used the Regressor method train_everything.

  • Sampler_kwargs (dict, optinal) – Kwargs used for choosing Sampler target distribution and the method run_PTMCMC.

  • Proposal_kwargs (dict, optional) – Kwargs used in the Sampler method ‘get_proposed_points’ and the Classifier method ‘get_class_predictions’.

Returns

  • proposed_points (ndarray) – Now query points.

  • pred_class (array) – For all proposed points, the best prediction from the trained classifier.

posydon.active_learning.psy_cris.utils.get_prediction_diffs(training_df, classifier_name='linear', regressor_name='linear', N=400, verbose=False, **kwargs)[source]

Train classifier and get predictions and actual classification.

From a DataFrame of training data, train a classifier and get both the predictions and actual classification in the classification space where the analytic function is defined. Also calculate the difference between the true regression function and that infered from the trainined regressor. Dimensionality is infered from ‘training_df’.

Parameters
  • training_df (pandas DataFrame) – DataFrame of training data, a subset of the true distribution.

  • classifier_name (str) – Name of the classification algorithm to use.

  • N (int) – Sets the (N**dim) resolution of points used to query the trained classifier.

  • verbose (bool, optional) – Print more useful information.

  • timer (bool, optional) – Print timing diagnostic information.

Returns

  • pred_class (array) – 1D array of predictions from the trained classifier.

  • true_class_result (array) – 1D array of the true classification for the corresponding points.

  • all_regr_acc_per_class (dict) – Dict of lists with regression accuracy values per class and combined.

posydon.active_learning.psy_cris.utils.get_random_grid_df(N, dim=2)[source]

Produce a randomly sampled grid.

Given N total points, produce a randomly sampled grid drawn from the analytic data set (2D or 3D).

Parameters
  • N (int) – Total number of points to drawn from a 2D random data set

  • dim (int) – Dimensionality of synthetic data set. (2 or 3)

Returns

random_df – DataFrame of true data drawn from the analytic classification and regression functions.

Return type

pandas DataFrame

posydon.active_learning.psy_cris.utils.get_regular_grid_df(N=100, jitter=False, verbose=False, N_ppa=None, dim=2)[source]

Produce an even grid.

Given N total points, produce an even grid with approximately the same number of evenly spaced points sampled from the analytic data set (2D or 3D).

The number of returned grid points is N only if N is a perfect square. Otherwise use N_ppa to define number of points per axis.

Parameters
  • N (int) – Total number of points to make into a 2D even grid

  • jitter (bool, optional) – Place the center of the grid randomly around (0,0) in the range of +/- 1/2 bin width while keeping the span in each axis at 6.

  • N_ppa (array, optional) – Numbers of points per axis. If provided, it overrides N.

  • dim (int, optional) – Dimensionality of synthetic data set. (2 or 3)

  • verbose (bool, optional) – Print some diagnostics.

Returns

extra_points – DataFrame of true data drawn from the analytic classification and regression functions.

Return type

pandas DataFrame

posydon.active_learning.psy_cris.utils.parse_inifile(path, verbose=False)[source]

Parse an ini file to run psy-cris method ‘get_new_query_points’.

Parameters

path (str) – Path to ini file.

Returns

all_kwargs_dict – Nested dictionary of parsed inifile kwargs.

Return type

dict

posydon.active_learning.psy_cris.utils.plot_proposed_points_2D(where_to_cut, num_loops, random_init_pos, my_data, where_good_bools)[source]

Plot proposed points in 2-D space.

posydon.active_learning.psy_cris.utils.plot_proposed_points_3D(where_to_cut, num_loops, random_init_pos, my_data, where_good_bools)[source]

Plot proposed points in 3-D space.