Common Datatypes#
pepbench tries to stick to common data-containers - namely ndarray, DataFrame, dict and Series - to store all in- and outputs of the used algorithm. However, based on the above mentioned containers, a set of certain data-types are defined and used throughout the library.
This makes it easy for users to handle complex problems and makes it possible to perform sanity checks that prevent common issues.
The following explains these data-structures in details to ease to process of preparing your data for the use of pepbench and help to understand the outputs.
Units#
Before talking about data-types the physical units for all values stored in these data-types should be clear. The following table provides an overview over the units commonly used in the pepbench package and what they refer to.
Value |
Unit |
|---|---|
Time (seconds) |
s |
Time (milliseconds) |
ms |
Sampling rate / frequency |
Hz |
Heart rate |
bpm |
Relative / percentage values |
% |
Sample indices / counts |
samples |
Signal Amplitudes#
Signal amplitude units (for ECG/ICG traces) are dataset-dependent and are not enforced by pepbench; common dataset units are volts (V) or millivolts (mV). pepbench algorithms expect that the amplitude unit is consistent within a dataset.
Naming Conventions in pepbench#
The codebase uses a few naming conventions / column suffixes to indicate units and facilitate automatic dtype coercion:
Columns/suffixes ending with
_ms: values expressed in milliseconds (ms). Examples from the codebase:pep_ms,rr_interval_ms,error_per_sample_ms.Columns/suffixes ending with
_percentor represented with%in labels: percentage values (%%). Examples:absolute_relative_error_per_sample_percent.heart_rate_bpm: heart rate values given in beats per minute (bpm)._data/ sample indices: many functions operate on sample indices (integer counts). When plotting or when requested, pepbench can convert time-like indexes to seconds (s).
Start-End Indices#
Many pepbench tables and functions (for example heartbeat tables and annotation loaders) represent time ranges using sample indices named start_sample and end_sample. pepbench follows the common Python slicing convention: the start index is inclusive and the end index is exclusive, i.e. the interval represented is [start, end).
Practical consequences and examples from the codebase:
To obtain the last sample inside a region the code frequently uses
end_sample - 1. Seepepbench.plotting._utils._get_heartbeat_borderswhich mapsend_sampleto the final index withend_sample - 1.Durations in samples are obtained via
end - start(for example heartbeat durations or when shifting indices to a zero origin).Edge cases: - A region that starts at the first sample of a recording has
start = 0. - A region that includes the last sample of a recording usesend = len(dataset). - Adjacent regions share boundaries: theendof the first region equals thestartof the next region.
Heartbeat Lists#
Heartbeats are represented as DataFrame with one row per beat and columns for the start and end
sample indices, as well as the R-peak sample index. A well-defined heartbeat list makes it easy to align ECG/ICG
segments, run extraction algorithms and evaluate results.
A SingleSensorHeartbeatList is a plain pandas.DataFrame that should at least contain the columns
start_sample and end_sample. In many cases a r_peak_sample column is present as well (the detected R-peak
within the heartbeat). The index is expected to have one level with the name heartbeat_id. If you prefer to keep
heartbeat_id as a column instead, convert it to the index with df = df.set_index("heartbeat_id") before
passing it into functions that expect the index.
All sample-based columns are expressed in samples relative to the start of the recording (not relative to the start of
each heartbeat). Durations can be obtained using the sampling rate (fs) and converted to seconds or milliseconds.
Required/Recommended columns and units#
start_sample(int): inclusive start index (samples) of the heartbeat in the recording.end_sample(int): exclusive end index (samples) of the heartbeat in the recording (pepbench uses half-open intervals [start, end)).r_peak_sample(int, optional but recommended): sample index of the R-peak. If present it should satisfystart_sample <= r_peak_sample < end_sample.
Recommended additional/derived columns#
duration_samples=end_sample - start_sample(int)duration_ms=duration_samples / fs * 1000(float)r_peak_offset=r_peak_sample - start_sample(int)quality_score(float),label(str) orsource_channel(str) for metadata or curation
Index and format conventions#
Index name:
heartbeat_idis the canonical index name used in examples and several internal functions. Many internal examples and helpers assume the index has this name (seematch_heartbeat_lists).Columns always refer to absolute sample indices in the recording (not time or per-beat offsets). Use the sampling rate to convert to seconds if needed.
Invariants and validation rules#
Rows should be sorted by
start_sample(increasing).start_sample >= 0andend_sample > start_sample(no negative or zero-length regions unless explicitly documented).If
r_peak_sampleis present, it must satisfystart_sample <= r_peak_sample < end_sample.Adjacent beats are allowed: the
end_sampleof one beat can equal thestart_sampleof the next beat.Overlapping beats should either be resolved (merge/remove) or annotated (for example with
quality_score).
Common operations and examples#
A typical heartbeat list example (assume fs = 1000 Hz):
>>> import pandas as pd
>>> df = pd.DataFrame(
... [[0, 300, 150], [300, 620, 455], [620, 930, 750]],
... columns=["start_sample", "end_sample", "r_peak_sample"],
... )
>>> df.index.name = "heartbeat_id"
>>> df
start_sample end_sample r_peak_sample
heartbeat_id
0 0 300 150
1 300 620 455
2 620 930 750
Compute derived columns:
>>> df["duration_samples"] = df["end_sample"] - df["start_sample"]
>>> df["duration_ms"] = df["duration_samples"] / 1000 * 1000 # = duration_samples for fs=1000
Filtering short beats:
>>> min_samples = int(0.2 * 1000) # 200 ms
>>> df_filtered = df[df["duration_samples"] >= min_samples]
Integration with algorithms and helpers#
Heartbeat segmentation algorithms used in pepbench (for example
HeartbeatSegmentationNeurokit) provide aheartbeat_list_attribute that already follows the sample-index convention used here.Annotation loaders / dataset helpers expose reference heartbeats in the format expected by algorithms. The helper
compute_reference_heartbeatsreformats annotation tables (dropping channel-level, renaming columns to*_sample) to a heartbeat table suitable for matching and evaluation.To evaluate and match two heartbeat lists use
match_heartbeat_lists. This function compares start/end borders (in samples) and returns true/false positive/negative matches; it assumes the column namesstart_sampleandend_sampleand an index namedheartbeat_id(see its docstring for examples).
Edge cases and recommendations#
Missing
r_peak_sample: allow NaN when segmentation algorithms fail to detect a clear R-peak — downstream steps that require the R-peak should handle or skip those beats explicitly.Beats at recording boundaries:
start_sample == 0andend_sample == len(recording)are valid and indicate coverage to the recording edges.Overlaps and duplicates: prefer to resolve these during preprocessing. When storing heartbeat lists on disk prefer parquet to preserve dtypes.
Datasets#
Compared to the low level datatypes, datasets are higher level abstractions, containing all data and metadata
associated with a set of recordings.
They are based on the Dataset class and allow to easily load and access otherwise complex data
structures.
A dataset that has only one “row” (i.e. one recording) is referred to as a “datapoint” and is the expected input for all the pipelines in pepbench.
A dataset that contains exactly one recording (a single row in the tpcp.Dataset sense) is referred to as a
“datapoint”. This is the expected input for the package’s extraction pipelines — typically you should pass a
BasePepDataset instance for extraction pipelines, or a
BasePepDatasetWithAnnotations instance when reference labels
(heartbeats/PEP) are required for evaluation.
Using this dataset abstraction allows us to easily apply the same algorithms to different datasets and to use
higher-level tpcp-features like the cross_validate to run and evaluate our pipelines on
subsets of our datasets in a consistent manner.
The simplest dataset that we provide out of the box is the ExampleDataset, which can be
used to load the example data that we provide with pepbench.
If you have already loaded your own data and want to use it with a pepbench pipeline, you can use the
WrapperDataset class to quickly create a compatible dataset from your data.
For long-term use and clearer integration we highly encourage creating a custom dataset class that subclasses
BasePepDataset (or
BasePepDatasetWithAnnotations when reference labels are required).
This simplifies many tasks and provides a clean abstraction for your data.