Module for handling data for PSY-CRIS.

class, input_cols, output_cols, class_col_name, my_DataFrame=None, omit_vals=None, omit_cols=None, subset_interval=None, verbose=False, my_colors=None, neighbor=None, n_neighbors=None, undefined_p_change_val=None, read_csv_kwargs={}, **kwargs)[source]

Bases: object

For managing data sets used for classification and regression.

Reads tables of simulation data where a single row represents one simulation. Each column in a row represents different inputs (initial conditions) and outputs (result, continuous variables). If using multiple files, each file is assumed to have the same columns. You may also directly load a pandas DataFrame instead of reading in files.

Example data structure expected in files or pandas DataFrame:

0 input_1 input_2 outcome output_1 output_2 output_3 … 1 1.5 2.6 “A” 100 0.19 - … 2 1.5 3.0 “B” - - - … 3 2.0 2.6 “C” - - 6 … …

The above table has dashes ‘-’ in output columns to indicate NaN values. You may have a similar structure if different classes have fundamentally different outputs.

Initialize the TableData instance.

  • table_paths (list) – List of file paths to read in as data. if None, a pandas DataFrame is used instead

  • input_cols (list) – List of names of the columns which will be considered ‘input’.

  • output_cols (list) – List of names of the columns which will be considered ‘output’. This should include the class column name.

  • class_col_name (str) – Name of column which contains classification data.

  • my_DataFrame (pandas DataFrame, optional) – If given, use this instead of reading files.

  • omit_vals (list, optional) – Numerical values that you wish to omit from the entire data set. If a row contains the value, the entire row is removed. (For example you may want to omit all rows if they contain “-1” or “failed”.)

  • omit_cols (list, optional) – Column names that you wish to omit from the data set.

  • subset_interval (array, optional) – Use some subset of the data files being loaded in. An array with integers indicating the rows that will be kept.

  • my_colors (list, optional) – Colors to use for classification plots.

  • n_neighbors (list, optional) – List of integers that set the number of neighbors to use to calculate average distances. (default None)

  • neighbor (instance of sklearn.neighbors.NearestNeighbors, optional) – To use for average distances. See function ‘calc_avg_dist()’.

  • undefined_p_change_val (optional, float) – Sets the undefined value used when calculating percent change fails due to zero values in the output data. Default uses nan.

  • verbose (bool, optional) – Print statements with extra info.

  • read_csv_kwargs (dict, optional) – Kwargs passed to the pandas function ‘read_csv()’.

  • **kwargs – Extra kwargs

find_n_neighbors(input_data, n_neighbors, neighbor=None, return_avg_dists=False, **kwargs)[source]

Find the N nearest neighbors of a given set of arbitrary points.

Given a set of arbitrary input points, find the N nearest neighbors to each point, not including themselves. Can also return average distance from a point to its N nearest neighbors.

  • input_data (ndarray, pandas DataFrame) – Data points where their nearest neighbors will be found

  • n_neighbors (list) – List of integers with number of neighbors to calculate

  • neighbor (instance of NearestNeightbors class) – For passing your own object.

  • return_avg_dists (bool, optional) – If True, return the a dictionary with average distances between the N nearest neighbors listed in ‘n_neighbors’.

  • **kwargs


    If a DataFrame is given, specifies class column name.


  • where_nearest_neighbors (dict) – Dictionary containing the n nearest neighbors for every point in the input data.

  • avg_distances (dict) – Returned if return_avg_dists is True. The average distances between the nearest neighbors.


Get binary mapping (0 or 1) of the class data.

Get binary mapping (0 or 1) of the class data for each unique classification. For each classification, a value of 1 is given if the class data matches that classification. If they do not match, then a value of 0 is given.

Example: classifications -> A, B, C class data -> [ A, B, B, A, B, C ] binary mapping -> [[1, 0, 0, 1, 0, 0] (for class A)

[0, 1, 1, 0, 1, 0] (for class B) [0, 0, 0, 0, 0, 1]] (for class C)


binary_class_data – N by M array where N is the number of classes and M is the number of classifications in the data set. Order is determined by ‘_unique_class_keys_’.

Return type



Get data related to classification.

  • what_data

    ‘class_col’ (array)
    • Original classification data.

    ’unique_class_keys’ (array)
    • Unique classes found in the classification data.

    ’class_col_to_ids’ (array)
    • Original classification data replaced with their respective class IDs (integers).

    ’class_id_mapping’ (dict)
    • Mapping between a classification from the original data set and its class ID.

    ’binary_data’ (ndarray)
    • Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.

    ’full’ (tuple)
    • All options listed above. (Default)

  • str

    ‘class_col’ (array)
    • Original classification data.

    ’unique_class_keys’ (array)
    • Unique classes found in the classification data.

    ’class_col_to_ids’ (array)
    • Original classification data replaced with their respective class IDs (integers).

    ’class_id_mapping’ (dict)
    • Mapping between a classification from the original data set and its class ID.

    ’binary_data’ (ndarray)
    • Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.

    ’full’ (tuple)
    • All options listed above. (Default)

  • list

    ‘class_col’ (array)
    • Original classification data.

    ’unique_class_keys’ (array)
    • Unique classes found in the classification data.

    ’class_col_to_ids’ (array)
    • Original classification data replaced with their respective class IDs (integers).

    ’class_id_mapping’ (dict)
    • Mapping between a classification from the original data set and its class ID.

    ’binary_data’ (ndarray)
    • Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.

    ’full’ (tuple)
    • All options listed above. (Default)

  • optional

    ‘class_col’ (array)
    • Original classification data.

    ’unique_class_keys’ (array)
    • Unique classes found in the classification data.

    ’class_col_to_ids’ (array)
    • Original classification data replaced with their respective class IDs (integers).

    ’class_id_mapping’ (dict)
    • Mapping between a classification from the original data set and its class ID.

    ’binary_data’ (ndarray)
    • Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.

    ’full’ (tuple)
    • All options listed above. (Default)


class_data – An object containing the specified classification data. Will return tuple of len(what_data) if a list is passed. Default: 5

Return type

tuple, ndarray, dict

get_data(what_data='full', return_df=False)[source]

Get all data contained in TableData object.

The data is returned after omission of columns and rows containing specified values (if given) and taking a subset (if given) of the original data set read in as a csv or given directly as a pandas DataFrame. (Before data processing for classification and regression.)

  • what_data (str, list, optional) – Default is ‘full’ with other options ‘input’, or ‘output’. ‘full’ - original data table (after omission and subsets) ‘input’ - only data identified as inputs from ‘full’ data set ‘output’ - only data identified as outputs from ‘full’ data set

  • return_df (bool, optional) – If True, return a pandas DataFrame object. If False (default), return a numpy array.


data – Data before classification and regression data sorting is done. Will return tuple of len(what_data) if a list is passed.

Return type

tuple, ndarray or DataFrame


Return what info is printed in the ‘info()’ method.


  • files (list) – File paths where data was loaded from.

  • df_index_keys (list) – Index keys added to the DataFrame object once multiple files are joined together such that one can access data by file after they were joined.

  • for_info (list) – Running list of print statements that include but are not limited to what is shown if ‘verbose=True’.


Get data related to regression all sorted by class in dictionaries.


what_data (str, list, optional) – ‘input’ - For each class, the input data with no cleaning. ‘raw_output’ - For each class, the output data with no cleaning. ‘output’ - For each class, the cleaned output data. ‘full’ - All options listed above in that respective order.


data – An object containing the specified regression data. Will return tuple of len(what_data) if a list is passed. Default: 3

Return type

tuple, ndarray or DataFrame


Print info for the instance of TableData object.

For output descriptions see the method ‘get_info()’.

make_class_data_plot(fig, ax, axes_keys, my_slice_vals=None, my_class_colors=None, return_legend_handles=False, verbose=False, **kwargs)[source]

Plot classification data on a given axis and figure.

  • fig (Matplotlib Figure object) – Figure on which the plot will be made.

  • ax (Matplotlib Axis object) – Axis on which a scatter plot will be drawn.

  • axes_keys (list) – List containing two names of input data columns to use as horizontal and verital axes.

  • my_slice_vals (optional, list, dict) – List giving values on which to slice in the axes not being plotted. Default (None) uses the first unique value found in each axis. If instead of individual values, a range is desired (e.g. 10 +/- 1) then a dict can be given with integer keys mapping to a tuple with the lower and upper range. ( e.g. {0:(9,11)} )

  • my_class_colors (optional, list) – List of colors used to represent different classes. Default (None) uses the default class colors.

  • return_legend_handles (optional, bool) – Returns a list of handles that connect classes to colors.

  • verbose (optional, bool) – Print useful information during runtime.

  • **kwargs (optional) – Kwargs for matplotlib.pyplot.scatter() .


  • fig (Matplotlib Figure object) – Figure object.

  • ax (Matplotlib Axis object) – Updated axis after plotting.

  • handles (optional, list) – List of legend handles connecting classes and their colors. Returned if ‘return_legend_handles’ is True. Default is False.

plot_3D_class_data(axes=None, fig_size=(4, 5), mark_size=12, which_val=0, save_fig=False, plt_str='0', color_list=None)[source]

Plot the 3D classification data in a 2D plot.

3 input axis with classification output.

  • axes (list, optional) –

    By default it will order the axes as [x,y,z] in the original order the input axis were read in. To change the ordering, pass a list with the column names. Example: The default orderd is col_1, col_2, col_3. To change the horizontal axis from col_1 to col_2 you would use:

    ’axes = [“col_2”, “col_1”, “col_3”]’

  • fig_size (tuple, optional, default = (4,5)) – Size of the figure. (Matplotlib figure kwarg ‘fig_size’)

  • mark_size (float, optional, default = 12) – Size of the scatter plot markers. (Matplotlib scatter kwarg ‘s’)

  • which_val (int, default = 0) – Integer choosing what unique value to ‘slice’ on in the 3D data such that it can be plotted on 2D. (If you had x,y,z data you need to choose a z value)

  • save_fig (bool, default = False) – Save the figure in the local directory.

  • plt_str (str, default = '0') – If you are saving multiple figures you can pass a string which will be added to the end of the default: “data_plot_{plt_str}.pdf”

  • color_list (list, default = None) –


Return type

matplotlib figure, n_neighbors, neighbor=None)[source]

Get the average distance to the nearest neighbors in the data set.

(NearestNeighbors from sklearn.neighbors)

  • data (ndarray) – Data to train the NearestNeighbors class on.

  • n_neighbors (int) – Number of neighbors to use when finding average distance.

  • neighbor (instance of NearestNeightbors class) – For passing your own object.


  • avg_dist (array) – The average distance between nearest neighbors.

  • g_indi (array) – Indicies that correspond to the nearest neighbors., where_nearest_neighbors, undefined_p_change_val=None)[source]

Calculate the average fractional change in a given data set.

The method uses the N nearest neighbors (calculated beforehand).

  • data (ndarray) – Data set to calculate percent change.

  • where_nearest_neighbors (dict) – Indicies in data for the n nearest neighbors in input space.

  • undefined_p_change_val (optional) – For output with an undefined percent change (zero value), this kwarg defines what is put in its place. Defaults to nan.


avg_p_change_holder – Each element conatins the average percent change for a given number of neighbors. If any output values are 0, then a nan is put in place of a percent change.

Return type
