Module for handling data for PSY-CRIS.
- class posydon.active_learning.psy_cris.data.TableData(table_paths, input_cols, output_cols, class_col_name, my_DataFrame=None, omit_vals=None, omit_cols=None, subset_interval=None, verbose=False, my_colors=None, neighbor=None, n_neighbors=None, undefined_p_change_val=None, read_csv_kwargs={}, **kwargs)[source]
Bases:
object
For managing data sets used for classification and regression.
Reads tables of simulation data where a single row represents one simulation. Each column in a row represents different inputs (initial conditions) and outputs (result, continuous variables). If using multiple files, each file is assumed to have the same columns. You may also directly load a pandas DataFrame instead of reading in files.
Example data structure expected in files or pandas DataFrame:
0 input_1 input_2 outcome output_1 output_2 output_3 … 1 1.5 2.6 “A” 100 0.19 - … 2 1.5 3.0 “B” - - - … 3 2.0 2.6 “C” - - 6 … …
The above table has dashes ‘-’ in output columns to indicate NaN values. You may have a similar structure if different classes have fundamentally different outputs.
Initialize the TableData instance.
- Parameters
table_paths (list) – List of file paths to read in as data. if None, a pandas DataFrame is used instead
input_cols (list) – List of names of the columns which will be considered ‘input’.
output_cols (list) – List of names of the columns which will be considered ‘output’. This should include the class column name.
class_col_name (str) – Name of column which contains classification data.
my_DataFrame (pandas DataFrame, optional) – If given, use this instead of reading files.
omit_vals (list, optional) – Numerical values that you wish to omit from the entire data set. If a row contains the value, the entire row is removed. (For example you may want to omit all rows if they contain “-1” or “failed”.)
omit_cols (list, optional) – Column names that you wish to omit from the data set.
subset_interval (array, optional) – Use some subset of the data files being loaded in. An array with integers indicating the rows that will be kept.
my_colors (list, optional) – Colors to use for classification plots.
n_neighbors (list, optional) – List of integers that set the number of neighbors to use to calculate average distances. (default None)
neighbor (instance of sklearn.neighbors.NearestNeighbors, optional) – To use for average distances. See function ‘calc_avg_dist()’.
undefined_p_change_val (optional, float) – Sets the undefined value used when calculating percent change fails due to zero values in the output data. Default uses nan.
verbose (bool, optional) – Print statements with extra info.
read_csv_kwargs (dict, optional) – Kwargs passed to the pandas function ‘read_csv()’.
**kwargs – Extra kwargs
- find_n_neighbors(input_data, n_neighbors, neighbor=None, return_avg_dists=False, **kwargs)[source]
Find the N nearest neighbors of a given set of arbitrary points.
Given a set of arbitrary input points, find the N nearest neighbors to each point, not including themselves. Can also return average distance from a point to its N nearest neighbors.
- Parameters
input_data (ndarray, pandas DataFrame) – Data points where their nearest neighbors will be found
n_neighbors (list) – List of integers with number of neighbors to calculate
neighbor (instance of NearestNeightbors class) – For passing your own object.
return_avg_dists (bool, optional) – If True, return the a dictionary with average distances between the N nearest neighbors listed in ‘n_neighbors’.
**kwargs –
- class_keystr
If a DataFrame is given, specifies class column name.
- Returns
where_nearest_neighbors (dict) – Dictionary containing the n nearest neighbors for every point in the input data.
avg_distances (dict) – Returned if return_avg_dists is True. The average distances between the nearest neighbors.
- get_binary_mapping_per_class()[source]
Get binary mapping (0 or 1) of the class data.
Get binary mapping (0 or 1) of the class data for each unique classification. For each classification, a value of 1 is given if the class data matches that classification. If they do not match, then a value of 0 is given.
Example: classifications -> A, B, C class data -> [ A, B, B, A, B, C ] binary mapping -> [[1, 0, 0, 1, 0, 0] (for class A)
[0, 1, 1, 0, 1, 0] (for class B) [0, 0, 0, 0, 0, 1]] (for class C)
- Returns
binary_class_data – N by M array where N is the number of classes and M is the number of classifications in the data set. Order is determined by ‘_unique_class_keys_’.
- Return type
ndarray
- get_class_data(what_data='full')[source]
Get data related to classification.
- Parameters
what_data –
- ‘class_col’ (array)
Original classification data.
- ’unique_class_keys’ (array)
Unique classes found in the classification data.
- ’class_col_to_ids’ (array)
Original classification data replaced with their respective class IDs (integers).
- ’class_id_mapping’ (dict)
Mapping between a classification from the original data set and its class ID.
- ’binary_data’ (ndarray)
Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.
- ’full’ (tuple)
All options listed above. (Default)
str –
- ‘class_col’ (array)
Original classification data.
- ’unique_class_keys’ (array)
Unique classes found in the classification data.
- ’class_col_to_ids’ (array)
Original classification data replaced with their respective class IDs (integers).
- ’class_id_mapping’ (dict)
Mapping between a classification from the original data set and its class ID.
- ’binary_data’ (ndarray)
Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.
- ’full’ (tuple)
All options listed above. (Default)
list –
- ‘class_col’ (array)
Original classification data.
- ’unique_class_keys’ (array)
Unique classes found in the classification data.
- ’class_col_to_ids’ (array)
Original classification data replaced with their respective class IDs (integers).
- ’class_id_mapping’ (dict)
Mapping between a classification from the original data set and its class ID.
- ’binary_data’ (ndarray)
Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.
- ’full’ (tuple)
All options listed above. (Default)
optional –
- ‘class_col’ (array)
Original classification data.
- ’unique_class_keys’ (array)
Unique classes found in the classification data.
- ’class_col_to_ids’ (array)
Original classification data replaced with their respective class IDs (integers).
- ’class_id_mapping’ (dict)
Mapping between a classification from the original data set and its class ID.
- ’binary_data’ (ndarray)
Iterating over the unique classes, classification data is turned into 1 or 0 if it matches the given class. See method ‘get_binary_mapping_per_class()’.
- ’full’ (tuple)
All options listed above. (Default)
- Returns
class_data – An object containing the specified classification data. Will return tuple of len(what_data) if a list is passed. Default: 5
- Return type
- get_data(what_data='full', return_df=False)[source]
Get all data contained in TableData object.
The data is returned after omission of columns and rows containing specified values (if given) and taking a subset (if given) of the original data set read in as a csv or given directly as a pandas DataFrame. (Before data processing for classification and regression.)
- Parameters
what_data (str, list, optional) – Default is ‘full’ with other options ‘input’, or ‘output’. ‘full’ - original data table (after omission and subsets) ‘input’ - only data identified as inputs from ‘full’ data set ‘output’ - only data identified as outputs from ‘full’ data set
return_df (bool, optional) – If True, return a pandas DataFrame object. If False (default), return a numpy array.
- Returns
data – Data before classification and regression data sorting is done. Will return tuple of len(what_data) if a list is passed.
- Return type
tuple, ndarray or DataFrame
- get_info()[source]
Return what info is printed in the ‘info()’ method.
- Returns
files (list) – File paths where data was loaded from.
df_index_keys (list) – Index keys added to the DataFrame object once multiple files are joined together such that one can access data by file after they were joined.
for_info (list) – Running list of print statements that include but are not limited to what is shown if ‘verbose=True’.
- get_regr_data(what_data='full')[source]
Get data related to regression all sorted by class in dictionaries.
- Parameters
what_data (str, list, optional) – ‘input’ - For each class, the input data with no cleaning. ‘raw_output’ - For each class, the output data with no cleaning. ‘output’ - For each class, the cleaned output data. ‘full’ - All options listed above in that respective order.
- Returns
data – An object containing the specified regression data. Will return tuple of len(what_data) if a list is passed. Default: 3
- Return type
tuple, ndarray or DataFrame
- info()[source]
Print info for the instance of TableData object.
For output descriptions see the method ‘get_info()’.
- make_class_data_plot(fig, ax, axes_keys, my_slice_vals=None, my_class_colors=None, return_legend_handles=False, verbose=False, **kwargs)[source]
Plot classification data on a given axis and figure.
- Parameters
fig (Matplotlib Figure object) – Figure on which the plot will be made.
ax (Matplotlib Axis object) – Axis on which a scatter plot will be drawn.
axes_keys (list) – List containing two names of input data columns to use as horizontal and verital axes.
my_slice_vals (optional, list, dict) – List giving values on which to slice in the axes not being plotted. Default (None) uses the first unique value found in each axis. If instead of individual values, a range is desired (e.g. 10 +/- 1) then a dict can be given with integer keys mapping to a tuple with the lower and upper range. ( e.g. {0:(9,11)} )
my_class_colors (optional, list) – List of colors used to represent different classes. Default (None) uses the default class colors.
return_legend_handles (optional, bool) – Returns a list of handles that connect classes to colors.
verbose (optional, bool) – Print useful information during runtime.
**kwargs (optional) – Kwargs for matplotlib.pyplot.scatter() .
- Returns
fig (Matplotlib Figure object) – Figure object.
ax (Matplotlib Axis object) – Updated axis after plotting.
handles (optional, list) – List of legend handles connecting classes and their colors. Returned if ‘return_legend_handles’ is True. Default is False.
- plot_3D_class_data(axes=None, fig_size=(4, 5), mark_size=12, which_val=0, save_fig=False, plt_str='0', color_list=None)[source]
Plot the 3D classification data in a 2D plot.
3 input axis with classification output.
- Parameters
axes (list, optional) –
By default it will order the axes as [x,y,z] in the original order the input axis were read in. To change the ordering, pass a list with the column names. Example: The default orderd is col_1, col_2, col_3. To change the horizontal axis from col_1 to col_2 you would use:
’axes = [“col_2”, “col_1”, “col_3”]’
fig_size (tuple, optional, default = (4,5)) – Size of the figure. (Matplotlib figure kwarg ‘fig_size’)
mark_size (float, optional, default = 12) – Size of the scatter plot markers. (Matplotlib scatter kwarg ‘s’)
which_val (int, default = 0) – Integer choosing what unique value to ‘slice’ on in the 3D data such that it can be plotted on 2D. (If you had x,y,z data you need to choose a z value)
save_fig (bool, default = False) – Save the figure in the local directory.
plt_str (str, default = '0') – If you are saving multiple figures you can pass a string which will be added to the end of the default: “data_plot_{plt_str}.pdf”
color_list (list, default = None) –
- Returns
- Return type
matplotlib figure
- posydon.active_learning.psy_cris.data.calc_avg_dist(data, n_neighbors, neighbor=None)[source]
Get the average distance to the nearest neighbors in the data set.
(NearestNeighbors from sklearn.neighbors)
- Parameters
data (ndarray) – Data to train the NearestNeighbors class on.
n_neighbors (int) – Number of neighbors to use when finding average distance.
neighbor (instance of NearestNeightbors class) – For passing your own object.
- Returns
avg_dist (array) – The average distance between nearest neighbors.
g_indi (array) – Indicies that correspond to the nearest neighbors.
- posydon.active_learning.psy_cris.data.calc_avg_p_change(data, where_nearest_neighbors, undefined_p_change_val=None)[source]
Calculate the average fractional change in a given data set.
The method uses the N nearest neighbors (calculated beforehand).
- Parameters
data (ndarray) – Data set to calculate percent change.
where_nearest_neighbors (dict) – Indicies in data for the n nearest neighbors in input space.
undefined_p_change_val (optional) – For output with an undefined percent change (zero value), this kwarg defines what is put in its place. Defaults to nan.
- Returns
avg_p_change_holder – Each element conatins the average percent change for a given number of neighbors. If any output values are 0, then a nan is put in place of a percent change.
- Return type
ndarray