EmpkinsDataset#

class pepbench.datasets.EmpkinsDataset(base_path: path_t, groupby_cols: Sequence[str] | None = None, subset_index: Sequence[str] | None = None, *, return_clean: bool = True, exclude_missing_data: bool = False, use_cache: bool = True, only_labeled: bool = False, label_type: str = 'rater_01')[source]#

Dataset class for the EmpkinS dataset.

Provides access to Biopac ECG/ICG signals, preprocessed signals, timelogs for experimental phases, reference annotations, and participant metadata.

Parameters:
base_pathpath-like

Path to the root directory of the EmpkinS dataset.

groupby_colssequence of str, optional

Columns to group the dataset index by.

subset_indexsequence of str, optional

Subset of the dataset index to operate on.

return_cleanbool, optional

If True, return preprocessed/cleaned ECG and ICG signals. Default is True.

exclude_missing_databool, optional

If True, exclude participants with missing data. Default is False.

use_cachebool, optional

If True, cache loading of Biopac files. Default is True.

only_labeledbool, optional

If True, return only labeled sections (cut to labeling borders). Default is False.

label_type{‘rater_01’, ‘rater_02’, ‘average’}, optional

Which label set to use for reference annotations. Default is ‘rater_01’.

Attributes:
SAMPLING_RATESdict

Per-channel sampling rates in Hz.

PHASESsequence

Ordered list of experimental phases.

CONDITIONSsequence

Available experimental conditions.

Methods

as_attrs()

Return a version of the Dataset class that can be subclassed using attrs defined classes.

as_dataclass()

Return a version of the Dataset class that can be subclassed using dataclasses.

assert_is_single(groupby_cols, property_name)

Raise error if index does contain more than one group/row with the given groupby settings.

assert_is_single_group(property_name)

Raise error if index does contain more than one group/row.

clone()

Create a new instance of the class with all parameters copied over.

create_index()

Create the dataset index.

create_string_group_labels(label_cols)

Generate a list of string labels for each group/row in the dataset.

get_params([deep])

Get parameters for this algorithm.

get_subset(*[, group_labels, index, bool_map])

Get a subset of the dataset.

groupby(groupby_cols)

Return a copy of the dataset grouped by the specified columns.

index_as_tuples()

Get all datapoint labels of the dataset (i.e. a list of the rows of the index as named tuples).

is_single(groupby_cols)

Return True if index contains only one row/group with the given groupby settings.

is_single_group()

Return True if index contains only one group.

iter_level(level)

Return generator object containing a subset for every category from the selected level.

set_params(**params)

Set the parameters of this Algorithm.

create_group_labels

__init__(base_path: path_t, groupby_cols: Sequence[str] | None = None, subset_index: Sequence[str] | None = None, *, return_clean: bool = True, exclude_missing_data: bool = False, use_cache: bool = True, only_labeled: bool = False, label_type: str = 'rater_01') None[source]#

Initialize a new EmpkinsDataset instance.

Parameters:
base_pathPath or str

Path to the root directory of the EmpkinS dataset.

return_cleanbool

Whether to return the preprocessed/cleaned ECG and ICG data when accessing the respective properties. Default: True.

exclude_missing_databool

Whether to exclude participants where parts of the data are missing. Default: False.

use_cachebool

Whether to use caching for loading Biopac data. Default: True.

only_labeledbool

Whether to only return sections of the Biopac data that are labeled (i.e., cut to labeling borders). This is necessary when using the dataset for evaluating the performance of PEP extraction algorithms or for training ML-based PEP extraction algorithms. Default: False.

label_type: str, optional

Which annotations to use. Can be either “rater_01”, “rater_02”, or “average”. Default: “rater_01”.

create_index() DataFrame[source]#

Create the dataset index.

Returns:
DataFrame

DataFrame containing all combinations of participant IDs, conditions, and phases.

property sampling_rate: dict[str, float]#

Return sampling rates of the ECG and ICG signals.

Returns:
dict

Dictionary with the sampling rates of the ECG and ICG signals in Hz.

property sampling_rate_ecg: int#

Return sampling rate of the ECG signal.

Returns:
int

Sampling rate of the ECG data in Hz.

property sampling_rate_icg: int#

Return sampling rate of the ICG signal.

Returns:
int

Sampling rate of the ICG data in Hz.

property biopac: DataFrame[source]#

Return biopac data for the current subset.

Returns:
DataFrame or dict

If a single participant+condition+phase is selected, returns a DataFrame containing the Biopac channels. If a single participant+condition but all phases are selected and only_labeled is True, returns a dict mapping phase names to DataFrames. In other multi-subset cases a ValueError is raised.

Raises:
ValueError

If the selection is not a single participant and condition (and optionally a single phase).

property icg: _IcgRawDataFrame | DataFrame#

Return the ICG channel from the biopac data.

If return_clean is set to True in the __init__, the ICG signal is preprocessed and cleaned using the IcgPreprocessingBandpass algorithm before returning it.

Returns:
DataFrame

ICG data as DataFrame.

Raises:
ValueError

If not operating on a single participant, condition, and phase/selection as required by the API.

property ecg: _EcgRawDataFrame | DataFrame#

Return the ECG channel from the biopac data.

If return_clean is set to True in the __init__, the ECG signal is preprocessed and cleaned using the EcgPreprocessingNeurokit algorithm before returning it.

Returns:
EcgRawDataFrame

ECG data as a DataFrame.

Raises:
ValueError

If not operating on a single participant, condition, and phase/selection as required by the API.

property timelog: DataFrame#

Return the timelog data.

Timelog entries describing experimental phase boundaries.

Returns:
DataFrame

Timelog rows for the selected participant/condition and (optionally) phase.

Raises:
ValueError

If timelog access is attempted for unsupported selections (e.g., multiple participants or conditions).

property labeling_borders: DataFrame#

Labeling borders for the selected participant and condition and phase.

Returns:
DataFrame

Labeling borders with columns including sample_absolute and description.

Raises:
ValueError

If not operating on a single participant.

property reference_heartbeats: DataFrame#

Return computed reference heartbeat markers derived from ECG reference labels.

Returns:
DataFrame

Heartbeat segmentation/reference table derived from ECG reference labels.

property reference_labels_ecg: DataFrame | dict[str, DataFrame]#

Return reference labels for a given channel and the current selection.

Returns:
DataFrame or dict

If a single phase is selected, returns a DataFrame of reference labels for that phase. If all phases are selected, returns a concatenated DataFrame indexed by phase.

Raises:
ValueError

If reference labels are requested for unsupported subset selections.

property reference_labels_icg: DataFrame | dict[str, DataFrame]#

Reference labels for a given channel and the current selection.

Returns:
DataFrame or dict

If a single phase is selected, returns a DataFrame of reference labels for that phase. If all phases are selected, returns a concatenated DataFrame indexed by phase.

Raises:
ValueError

If reference labels are requested for unsupported subset selections.

property heartbeats: _HeartbeatSegmentationDataFrame | DataFrame#

Heartbeat segmentation computed from the ECG signal.

Uses HeartbeatSegmentationNeurokit to extract heartbeat borders.

Returns:
HeartbeatSegmentationDataFrame

DataFrame describing heartbeat onsets/offsets and related segmentation info.

property metadata: DataFrame#

Return participant metadata.

Returns:
DataFrame

Participant metadata indexed by participant id. Only rows for the currently selected participants are returned.

property age: DataFrame#

Return age of selected participants.

Returns:
DataFrame

DataFrame with the Age column for the selected participants.

classmethod as_attrs()[source]#

Return a version of the Dataset class that can be subclassed using attrs defined classes.

Note, this requires attrs to be installed!

classmethod as_dataclass()[source]#

Return a version of the Dataset class that can be subclassed using dataclasses.

assert_is_single(groupby_cols: list[str] | str | None, property_name) None[source]#

Raise error if index does contain more than one group/row with the given groupby settings.

This should be used when implementing access to data values, which can only be accessed when only a single trail/participant/etc. exist in the dataset.

Parameters:
groupby_cols

None (no grouping) or a valid subset of the columns available in the dataset index.

property_name

Name of the property this check is used in. Used to format the error message.

assert_is_single_group(property_name) None[source]#

Raise error if index does contain more than one group/row.

Note that this is different from assert_is_single as it is aware of the current grouping. Instead of checking that a certain combination of columns is left in the dataset, it checks that only a single group exists with the already selected grouping as defined by self.groupby_cols.

Parameters:
property_name

Name of the property this check is used in. Used to format the error message.

property base_demographics: DataFrame#

Return base demographics of the participants.

Returns:
DataFrame

The base demographics DataFrame including gender, age, and BMI.

clone() Self[source]#

Create a new instance of the class with all parameters copied over.

This will create a new instance of the class itself and all nested objects

create_string_group_labels(label_cols: str | list[str]) list[str][source]#

Generate a list of string labels for each group/row in the dataset.

Note

This has a different use case than the dataset-wide groupby. Using groupby reduces the effective size of the dataset to the number of groups. This method produces a group label for each group/row that is already in the dataset, without changing the dataset.

The output of this method can be used in combination with GroupKFold as the group label.

Parameters:
label_cols

The columns that should be included in the label. If the dataset is already grouped, this must be a subset of self.groupby_cols.

get_params(deep: bool = True) dict[str, Any][source]#

Get parameters for this algorithm.

Parameters:
deep

Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like nested_object_name__ (Note the two “_” at the end)

Returns:
params

Parameter names mapped to their values.

get_subset(*, group_labels: list[tuple[str, ...]] | None = None, index: DataFrame | None = None, bool_map: Sequence[bool] | None = None, **kwargs: list[str] | str) Self[source]#

Get a subset of the dataset.

Note

All arguments are mutable exclusive!

Parameters:
group_labels

A valid row locator or slice that can be passed to self.grouped_index.loc[locator, :]. This basically needs to be a subset of self.group_labels. Note that this is the only indexer that works on the grouped index. All other indexers work on the pure index.

index

pd.DataFrame that is a valid subset of the current dataset index.

bool_map

bool-map that is used to index the current index-dataframe. The list must be of same length as the number of rows in the index.

**kwargs

The key must be the name of an index column. The value is a list containing strings that correspond to the categories that should be kept. For examples see above.

Returns:
subset

New dataset object filtered by specified parameters.

property group: GroupLabelT#

Get the current group label. Deprecated, use group_label instead.

property group_label: GroupLabelT#

Get the current group label.

The group is defined by the current groupby settings.

Note, this attribute can only be used, if there is just a single group. This will return a named tuple. The tuple will contain only one entry if there is only a single groupby column or column in the index. The elements of the named tuple will have the same names as the groupby columns and will be in the same order.

property group_labels: list[GroupLabelT]#

Get all group labels of the dataset based on the set groupby level.

This will return a list of named tuples. The tuples will contain only one entry if there is only one groupby level or index column.

The elements of the named tuples will have the same names as the groupby columns and will be in the same order.

Note, that if one of the groupby levels/index columns is not a valid Python attribute name (e.g. in contains spaces or starts with a number), the named tuple will not contain the correct column name! For more information see the documentation of the rename parameter of collections.namedtuple.

For some examples and additional explanation see this example.

groupby(groupby_cols: list[str] | str | None) Self[source]#

Return a copy of the dataset grouped by the specified columns.

This does not change the order of the rows of the dataset index.

Each unique group represents a single data point in the resulting dataset.

Parameters:
groupby_cols

None (no grouping) or a valid subset of the columns available in the dataset index.

property grouped_index: DataFrame#

Return the index with the groupby columns set as multiindex.

property groups: list[GroupLabelT]#

Get the current group labels. Deprecated, use group_labels instead.

property index: DataFrame#

Get index.

index_as_tuples() list[GroupLabelT][source]#

Get all datapoint labels of the dataset (i.e. a list of the rows of the index as named tuples).

property index_is_unchanged: bool#

Returns True if the index is the same as the one created by create_index.

This can be used to check, if the index represents a subset or the actual full index. Note, that this is independent of the groupby_cols setting.

Note

Under the hood this uses the attrs functionality of pandas to store a hash of the original index on the dataframe. If the index is modified or a new index is created, this property does either not exist anymore or the content is modified.

is_single(groupby_cols: list[str] | str | None) bool[source]#

Return True if index contains only one row/group with the given groupby settings.

If groupby_cols=None this checks if there is only a single row left. If you want to check if there is only a single group within the current grouping, use is_single_group instead.

Parameters:
groupby_cols

None (no grouping) or a valid subset of the columns available in the dataset index.

is_single_group() bool[source]#

Return True if index contains only one group.

iter_level(level: str) Iterator[Self][source]#

Return generator object containing a subset for every category from the selected level.

Parameters:
level

Optional str that sets the level which shall be used for iterating. This must be one of the columns names of the index.

Returns:
subset

New dataset object containing only one category in the specified level.

property reference_pep: DataFrame#

Compute the reference PEP values between the reference Q-peak and B-point labels.

Returns:
DataFrame

DataFrame containing the computed PEP values.

set_params(**params: Any) Self[source]#

Set the parameters of this Algorithm.

To set parameters of nested objects use nested_object_name__para_name=.

property shape: tuple[int]#

Get the shape of the dataset.

This only reports a single dimension. This is equal to the number of rows in the index, if self.groupby_cols=None. Otherwise, it is equal to the number of unique groups.

property gender: DataFrame#

Return gender of selected participants.

Returns:
DataFrame

Gender as a pandas ataFrame, recoded as {1: “Female”, 2: “Male”}

property bmi: DataFrame#

Return body-mass index (BMI) for selected participants.

Returns:
DataFrame

Computed BMI (using demographics Weight and Height) for the selected participants.