Common Datatypes#

pepbench tries to stick to common data-containers - namely ndarray, DataFrame, dict and Series - to store all in- and outputs of the used algorithm. However, based on the above mentioned containers, a set of certain data-types are defined and used throughout the library. This makes it easy for users to handle complex problems and makes it possible to perform sanity checks that prevent common issues. The following explains these data-structures in details to ease to process of preparing your data for the use of pepbench and help to understand the outputs.

Units#

Before talking about data-types the physical units for all values stored in these data-types should be clear. The following table provides an overview over the units commonly used in the pepbench package and what they refer to.

Common Units in pepbench#
Value	Unit
Time (seconds)	s
Time (milliseconds)	ms
Sampling rate / frequency	Hz
Heart rate	bpm
Relative / percentage values	%
Sample indices / counts	samples

Signal Amplitudes#

Signal amplitude units (for ECG/ICG traces) are dataset-dependent and are not enforced by pepbench; common dataset units are volts (V) or millivolts (mV). pepbench algorithms expect that the amplitude unit is consistent within a dataset.

Naming Conventions in pepbench#

The codebase uses a few naming conventions / column suffixes to indicate units and facilitate automatic dtype coercion:

Columns/suffixes ending with _ms: values expressed in milliseconds (ms). Examples from the codebase: pep_ms, rr_interval_ms, error_per_sample_ms.
Columns/suffixes ending with _percent or represented with % in labels: percentage values (%%). Examples: absolute_relative_error_per_sample_percent.
heart_rate_bpm: heart rate values given in beats per minute (bpm).
_data / sample indices: many functions operate on sample indices (integer counts). When plotting or when requested, pepbench can convert time-like indexes to seconds (s).

Start-End Indices#

Many pepbench tables and functions (for example heartbeat tables and annotation loaders) represent time ranges using sample indices named start_sample and end_sample. pepbench follows the common Python slicing convention: the start index is inclusive and the end index is exclusive, i.e. the interval represented is [start, end).

Practical consequences and examples from the codebase:

To obtain the last sample inside a region the code frequently uses end_sample - 1. See pepbench.plotting._utils._get_heartbeat_borders which maps end_sample to the final index with end_sample - 1.
Durations in samples are obtained via end - start (for example heartbeat durations or when shifting indices to a zero origin).
Edge cases: - A region that starts at the first sample of a recording has start = 0. - A region that includes the last sample of a recording uses end = len(dataset). - Adjacent regions share boundaries: the end of the first region equals the start of the next region.

Heartbeat Lists#

Heartbeats are represented as DataFrame with one row per beat and columns for the start and end sample indices, as well as the R-peak sample index. A well-defined heartbeat list makes it easy to align ECG/ICG segments, run extraction algorithms and evaluate results.

A SingleSensorHeartbeatList is a plain pandas.DataFrame that should at least contain the columns start_sample and end_sample. In many cases a r_peak_sample column is present as well (the detected R-peak within the heartbeat). The index is expected to have one level with the name heartbeat_id. If you prefer to keep heartbeat_id as a column instead, convert it to the index with df = df.set_index("heartbeat_id") before passing it into functions that expect the index.

All sample-based columns are expressed in samples relative to the start of the recording (not relative to the start of each heartbeat). Durations can be obtained using the sampling rate (fs) and converted to seconds or milliseconds.

Required/Recommended columns and units#

start_sample (int): inclusive start index (samples) of the heartbeat in the recording.
end_sample (int): exclusive end index (samples) of the heartbeat in the recording (pepbench uses half-open intervals [start, end)).
r_peak_sample (int, optional but recommended): sample index of the R-peak. If present it should satisfy start_sample <= r_peak_sample < end_sample.

Recommended additional/derived columns#

duration_samples = end_sample - start_sample (int)
duration_ms = duration_samples / fs * 1000 (float)
r_peak_offset = r_peak_sample - start_sample (int)
quality_score (float), label (str) or source_channel (str) for metadata or curation

Index and format conventions#

Index name: heartbeat_id is the canonical index name used in examples and several internal functions. Many internal examples and helpers assume the index has this name (see match_heartbeat_lists).
Columns always refer to absolute sample indices in the recording (not time or per-beat offsets). Use the sampling rate to convert to seconds if needed.

Invariants and validation rules#

Rows should be sorted by start_sample (increasing).
start_sample >= 0 and end_sample > start_sample (no negative or zero-length regions unless explicitly documented).
If r_peak_sample is present, it must satisfy start_sample <= r_peak_sample < end_sample.
Adjacent beats are allowed: the end_sample of one beat can equal the start_sample of the next beat.
Overlapping beats should either be resolved (merge/remove) or annotated (for example with quality_score).

Common operations and examples#

A typical heartbeat list example (assume fs = 1000 Hz):

>>> import pandas as pd
>>> df = pd.DataFrame(
...     [[0, 300, 150], [300, 620, 455], [620, 930, 750]],
...     columns=["start_sample", "end_sample", "r_peak_sample"],
... )
>>> df.index.name = "heartbeat_id"
>>> df
           start_sample  end_sample  r_peak_sample
heartbeat_id
0                   0         300            150
1                 300         620            455
2                 620         930            750

Compute derived columns:

>>> df["duration_samples"] = df["end_sample"] - df["start_sample"]
>>> df["duration_ms"] = df["duration_samples"] / 1000 * 1000  # = duration_samples for fs=1000

Filtering short beats:

>>> min_samples = int(0.2 * 1000)  # 200 ms
>>> df_filtered = df[df["duration_samples"] >= min_samples]

Integration with algorithms and helpers#

Heartbeat segmentation algorithms used in pepbench (for example HeartbeatSegmentationNeurokit) provide a heartbeat_list_ attribute that already follows the sample-index convention used here.
Annotation loaders / dataset helpers expose reference heartbeats in the format expected by algorithms. The helper compute_reference_heartbeats reformats annotation tables (dropping channel-level, renaming columns to *_sample) to a heartbeat table suitable for matching and evaluation.
To evaluate and match two heartbeat lists use match_heartbeat_lists. This function compares start/end borders (in samples) and returns true/false positive/negative matches; it assumes the column names start_sample and end_sample and an index named heartbeat_id (see its docstring for examples).

Edge cases and recommendations#

Missing r_peak_sample: allow NaN when segmentation algorithms fail to detect a clear R-peak — downstream steps that require the R-peak should handle or skip those beats explicitly.
Beats at recording boundaries: start_sample == 0 and end_sample == len(recording) are valid and indicate coverage to the recording edges.
Overlaps and duplicates: prefer to resolve these during preprocessing. When storing heartbeat lists on disk prefer parquet to preserve dtypes.

Datasets#

Compared to the low level datatypes, datasets are higher level abstractions, containing all data and metadata associated with a set of recordings. They are based on the Dataset class and allow to easily load and access otherwise complex data structures.

A dataset that has only one “row” (i.e. one recording) is referred to as a “datapoint” and is the expected input for all the pipelines in pepbench.

A dataset that contains exactly one recording (a single row in the tpcp.Dataset sense) is referred to as a “datapoint”. This is the expected input for the package’s extraction pipelines — typically you should pass a BasePepDataset instance for extraction pipelines, or a BasePepDatasetWithAnnotations instance when reference labels (heartbeats/PEP) are required for evaluation.

Using this dataset abstraction allows us to easily apply the same algorithms to different datasets and to use higher-level tpcp-features like the cross_validate to run and evaluate our pipelines on subsets of our datasets in a consistent manner.

The simplest dataset that we provide out of the box is the ExampleDataset, which can be used to load the example data that we provide with pepbench.

If you have already loaded your own data and want to use it with a pepbench pipeline, you can use the WrapperDataset class to quickly create a compatible dataset from your data. For long-term use and clearer integration we highly encourage creating a custom dataset class that subclasses BasePepDataset (or BasePepDatasetWithAnnotations when reference labels are required). This simplifies many tasks and provides a clean abstraction for your data.