Module for handling data for PSY-CRIS.

class posydon.active_learning.psy_cris.data.TableData(table_paths, input_cols, output_cols, class_col_name, my_DataFrame=None, omit_vals=None, omit_cols=None, subset_interval=None, verbose=False, my_colors=None, neighbor=None, n_neighbors=None, undefined_p_change_val=None, read_csv_kwargs={}, **kwargs)[source]

Bases: object

For managing data sets used for classification and regression.

Reads tables of simulation data where a single row represents one simulation. Each column in a row represents different inputs (initial conditions) and outputs (result, continuous variables). If using multiple files, each file is assumed to have the same columns. You may also directly load a pandas DataFrame instead of reading in files.

Example data structure expected in files or pandas DataFrame:

0 input_1 input_2 outcome output_1 output_2 output_3 … 1 1.5 2.6 “A” 100 0.19 - … 2 1.5 3.0 “B” - - - … 3 2.0 2.6 “C” - - 6 … …

The above table has dashes ‘-’ in output columns to indicate NaN values. You may have a similar structure if different classes have fundamentally different outputs.

Initialize the TableData instance.

Parameters
  • table_paths (list) – List of file paths to read in as data. if None, a pandas DataFrame is used instead

  • input_cols (list) – List of names of the columns which will be considered ‘input’.

  • output_cols (list) – List of names of the columns which will be considered ‘output’. This should include the class column name.

  • class_col_name (str) – Name of column which contains classification data.

  • my_DataFrame (pandas DataFrame, optional) – If given, use this instead of reading files.

  • omit_vals (list, optional) – Numerical values that you wish to omit from the entire data set. If a row contains the value, the entire row is removed. (For example you may want to omit all rows if they contain “-1” or “failed”.)

  • omit_cols (list, optional) – Column names that you wish to omit from the data set.

  • subset_interval (array, optional) – Use some subset of the data files being loaded in. An array with integers indicating the rows that will be kept.

  • my_colors (list, optional) – Colors to use for classification plots.

  • n_neighbors (list, optional) – List of integers that set the number of neighbors to use to calculate average distances. (default None)

  • neighbor (instance of sklearn.neighbors.NearestNeighbors, optional) – To use for average distances. See function ‘calc_avg_dist()’.

  • undefined_p_change_val (optional, float) – Sets the undefined value used when calculating percent change fails due to zero values in the output data. Default uses nan.

  • verbose (bool, optional) – Print statements with extra info.

  • read_csv_kwargs (dict, optional) – Kwargs passed to the pandas function ‘read_csv()’.

  • **kwargs – Extra kwargs

find_n_neighbors(input_data, n_neighbors, neighbor=None, return_avg_dists=False, **kwargs)[source]

Find the N nearest neighbors of a given set of arbitrary points.

Given a set of arbitrary input points, find the N nearest neighbors to each point, not including themselves. Can also return average distance from a point to its N nearest neighbors.

Parameters
  • input_data (ndarray, pandas DataFrame) – Data points where their nearest neighbors will be found

  • n_neighbors (list) – List of integers with number of neighbors to calculate

  • neighbor (instance of NearestNeightbors class) – For passing your own object.

  • return_avg_dists (bool, optional) – If True, return the a dictionary with average distances between the N nearest neighbors listed in ‘n_neighbors’.

  • **kwargs

    class_keystr

    If a DataFrame is given, specifies class column name.

Returns

  • where_nearest_neighbors (dict) – Dictionary containing the n nearest neighbors for every point in the input data.

  • avg_distances (dict) – Returned if return_avg_dists is True. The average distances between the nearest neighbors.

get_binary_mapping_per_class()[source]

Get binary mapping (0 or 1) of the class data.

Get binary mapping (0 or 1) of the class data for each unique classification. For each classification, a value of 1 is given if the class data matches that classification. If they do not match, then a value of 0 is given.

Example: classifications -> A, B, C class data -> [ A, B, B, A, B, C ] binary mapping -> [[1, 0, 0, 1, 0, 0] (for class A)

[0, 1, 1, 0, 1, 0] (for class B) [0, 0, 0, 0, 0, 1]] (for class C)

Returns

binary_class_data – N by M array where N is the number of classes and M is the number of classifications in the data set. Order is determined by ‘_unique_class_keys_’.

Return type

ndarray

get_class_data(what_data='full')[source]

Get data related to classification.

Parameters
  • what_data

    ‘class_col’ (array)
    • Original classification data.

    ’unique_class_keys’ (array)
    • Unique classes found in the classification data.

    ’class_col_to_ids’ (array)
    • Original classification data replaced with their respective class IDs (integers).

    ’class_id_mapping’ (dict)
    • Mapping between a classification from the original data set and its class ID.

    ’binary_data’ (ndarray)
    • Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.

    ’full’ (tuple)
    • All options listed above. (Default)

  • str

    ‘class_col’ (array)
    • Original classification data.

    ’unique_class_keys’ (array)
    • Unique classes found in the classification data.

    ’class_col_to_ids’ (array)
    • Original classification data replaced with their respective class IDs (integers).

    ’class_id_mapping’ (dict)
    • Mapping between a classification from the original data set and its class ID.

    ’binary_data’ (ndarray)
    • Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.

    ’full’ (tuple)
    • All options listed above. (Default)

  • list

    ‘class_col’ (array)
    • Original classification data.

    ’unique_class_keys’ (array)
    • Unique classes found in the classification data.

    ’class_col_to_ids’ (array)
    • Original classification data replaced with their respective class IDs (integers).

    ’class_id_mapping’ (dict)
    • Mapping between a classification from the original data set and its class ID.

    ’binary_data’ (ndarray)
    • Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.

    ’full’ (tuple)
    • All options listed above. (Default)

  • optional

    ‘class_col’ (array)
    • Original classification data.

    ’unique_class_keys’ (array)
    • Unique classes found in the classification data.

    ’class_col_to_ids’ (array)
    • Original classification data replaced with their respective class IDs (integers).

    ’class_id_mapping’ (dict)
    • Mapping between a classification from the original data set and its class ID.

    ’binary_data’ (ndarray)
    • Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.

    ’full’ (tuple)
    • All options listed above. (Default)

Returns

class_data – An object containing the specified classification data. Will return tuple of len(what_data) if a list is passed. Default: 5

Return type

tuple, ndarray, dict

get_data(what_data='full', return_df=False)[source]

Get all data contained in TableData object.

The data is returned after omission of columns and rows containing specified values (if given) and taking a subset (if given) of the original data set read in as a csv or given directly as a pandas DataFrame. (Before data processing for classification and regression.)

Parameters
  • what_data (str, list, optional) – Default is ‘full’ with other options ‘input’, or ‘output’. ‘full’ - original data table (after omission and subsets) ‘input’ - only data identified as inputs from ‘full’ data set ‘output’ - only data identified as outputs from ‘full’ data set

  • return_df (bool, optional) – If True, return a pandas DataFrame object. If False (default), return a numpy array.

Returns

data – Data before classification and regression data sorting is done. Will return tuple of len(what_data) if a list is passed.

Return type

tuple, ndarray or DataFrame

get_info()[source]

Return what info is printed in the ‘info()’ method.

Returns

  • files (list) – File paths where data was loaded from.

  • df_index_keys (list) – Index keys added to the DataFrame object once multiple files are joined together such that one can access data by file after they were joined.

  • for_info (list) – Running list of print statements that include but are not limited to what is shown if ‘verbose=True’.

get_regr_data(what_data='full')[source]

Get data related to regression all sorted by class in dictionaries.

Parameters

what_data (str, list, optional) – ‘input’ - For each class, the input data with no cleaning. ‘raw_output’ - For each class, the output data with no cleaning. ‘output’ - For each class, the cleaned output data. ‘full’ - All options listed above in that respective order.

Returns

data – An object containing the specified regression data. Will return tuple of len(what_data) if a list is passed. Default: 3

Return type

tuple, ndarray or DataFrame

info()[source]

Print info for the instance of TableData object.

For output descriptions see the method ‘get_info()’.

make_class_data_plot(fig, ax, axes_keys, my_slice_vals=None, my_class_colors=None, return_legend_handles=False, verbose=False, **kwargs)[source]

Plot classification data on a given axis and figure.

Parameters
  • fig (Matplotlib Figure object) – Figure on which the plot will be made.

  • ax (Matplotlib Axis object) – Axis on which a scatter plot will be drawn.

  • axes_keys (list) – List containing two names of input data columns to use as horizontal and verital axes.

  • my_slice_vals (optional, list, dict) – List giving values on which to slice in the axes not being plotted. Default (None) uses the first unique value found in each axis. If instead of individual values, a range is desired (e.g. 10 +/- 1) then a dict can be given with integer keys mapping to a tuple with the lower and upper range. ( e.g. {0:(9,11)} )

  • my_class_colors (optional, list) – List of colors used to represent different classes. Default (None) uses the default class colors.

  • return_legend_handles (optional, bool) – Returns a list of handles that connect classes to colors.

  • verbose (optional, bool) – Print useful information during runtime.

  • **kwargs (optional) – Kwargs for matplotlib.pyplot.scatter() .

Returns

  • fig (Matplotlib Figure object) – Figure object.

  • ax (Matplotlib Axis object) – Updated axis after plotting.

  • handles (optional, list) – List of legend handles connecting classes and their colors. Returned if ‘return_legend_handles’ is True. Default is False.

plot_3D_class_data(axes=None, fig_size=(4, 5), mark_size=12, which_val=0, save_fig=False, plt_str='0', color_list=None)[source]

Plot the 3D classification data in a 2D plot.

3 input axis with classification output.

Parameters
  • axes (list, optional) –

    By default it will order the axes as [x,y,z] in the original order the input axis were read in. To change the ordering, pass a list with the column names. Example: The default orderd is col_1, col_2, col_3. To change the horizontal axis from col_1 to col_2 you would use:

    ’axes = [“col_2”, “col_1”, “col_3”]’

  • fig_size (tuple, optional, default = (4,5)) – Size of the figure. (Matplotlib figure kwarg ‘fig_size’)

  • mark_size (float, optional, default = 12) – Size of the scatter plot markers. (Matplotlib scatter kwarg ‘s’)

  • which_val (int, default = 0) – Integer choosing what unique value to ‘slice’ on in the 3D data such that it can be plotted on 2D. (If you had x,y,z data you need to choose a z value)

  • save_fig (bool, default = False) – Save the figure in the local directory.

  • plt_str (str, default = '0') – If you are saving multiple figures you can pass a string which will be added to the end of the default: “data_plot_{plt_str}.pdf”

  • color_list (list, default = None) –

Returns

Return type

matplotlib figure

posydon.active_learning.psy_cris.data.calc_avg_dist(data, n_neighbors, neighbor=None)[source]

Get the average distance to the nearest neighbors in the data set.

(NearestNeighbors from sklearn.neighbors)

Parameters
  • data (ndarray) – Data to train the NearestNeighbors class on.

  • n_neighbors (int) – Number of neighbors to use when finding average distance.

  • neighbor (instance of NearestNeightbors class) – For passing your own object.

Returns

  • avg_dist (array) – The average distance between nearest neighbors.

  • g_indi (array) – Indicies that correspond to the nearest neighbors.

posydon.active_learning.psy_cris.data.calc_avg_p_change(data, where_nearest_neighbors, undefined_p_change_val=None)[source]

Calculate the average fractional change in a given data set.

The method uses the N nearest neighbors (calculated beforehand).

Parameters
  • data (ndarray) – Data set to calculate percent change.

  • where_nearest_neighbors (dict) – Indicies in data for the n nearest neighbors in input space.

  • undefined_p_change_val (optional) – For output with an undefined percent change (zero value), this kwarg defines what is put in its place. Defaults to nan.

Returns

avg_p_change_holder – Each element conatins the average percent change for a given number of neighbors. If any output values are 0, then a nan is put in place of a percent change.

Return type

ndarray