Datasets & Requirements#
pepbench uses tpcp dataset classes. Each datapoint provides ECG/ICG signals plus metadata required by pipelines and (optionally) evaluation.
Available dataset classes#
pepbench currently provides:
ExampleDataset(small demo/test dataset)
All dataset implementations follow the same core interface based on
BasePepDataset and
BasePepDatasetWithAnnotations.
Core interface requirements#
All PEP extraction pipelines expect:
ecg– ECG signal as a pandas DataFrameicg– ICG signal as a pandas DataFramesampling_rate_ecg– ECG sampling rate (Hz)sampling_rate_icg– ICG sampling rate (Hz)heartbeats– segmented heartbeats (start, end, R-peak per beat)
For evaluation workflows, datasets should additionally expose:
reference_pep– reference PEP values (per sample or per beat)reference_heartbeats– reference heartbeat segmentationreference_labels_ecg/reference_labels_icg– label annotationsLabeled sections of the continuous signal (dataset-specific; attribute name may vary, e.g.
labeled_segments— DataFrame withstart,end,label;annotation_intervals— list of(start, end, label)tuples;label_mask— boolean Series aligned to the signal)
EmpkinsDataset#
EmpkinsDataset is the primary dataset class for
EmpkinS recordings and annotations.
Typical usage:
from pepbench.datasets import EmpkinsDataset
ds = EmpkinsDataset(
base_path="/path/to/empkins/root",
only_labeled=True,
exclude_missing_data=True,
label_type="rater_01", # or "rater_02" / "average"
)
ds.create_index() # optional; index is created lazily on first access
print(len(ds))
Use datapoints exactly like any other tpcp dataset:
dp = next(iter(ds))
ecg, icg = dp.ecg, dp.icg
fs_ecg, fs_icg = dp.sampling_rate_ecg, dp.sampling_rate_icg
heartbeats = dp.heartbeats
For full attribute and parameter details, see the API reference:
EmpkinsDataset.
GuardianDataset#
GuardianDataset follows the same interface as
EmpkinsDataset and can be used with the same
pipelines/evaluation code.
The practical difference is in study design and signal characteristics (e.g., protocol and sampling specifics), not in the programming interface.
Example Dataset#
Pepbench also provides a small ExampleDataset for testing and demonstration purposes.
It contains two patients’ ECG/ICG signals with known PEP values and annotations, allowing you to quickly test pipelines without needing access to the full Empkins or Guardian datasets.
Hands-on notebook:
Example Dataset
Integration of Own Data#
If you already have ECG and ICG signals loaded in memory and want to use them with pepbench pipelines
without creating a full custom dataset class, the WrapperDataset is the ideal
solution. It wraps your raw signals into a compatible dataset format that works seamlessly with all
pepbench pipelines and evaluation tools.
Quick Start with WrapperDataset#
The WrapperDataset requires only your signal data and their sampling rates:
import pandas as pd
from biopsykit.utils.dtypes import EcgRawDataFrame, IcgRawDataFrame
from pepbench.datasets import WrapperDataset
# Assume you have loaded your ECG and ICG data
# They should be pandas DataFrames with appropriate structure for BiopsyKit
ecg_data = pd.read_csv("path/to/ecg_data.csv", index_col=0)
icg_data = pd.read_csv("path/to/icg_data.csv", index_col=0)
# Ensure proper BiopsyKit dtypes (EcgRawDataFrame, IcgRawDataFrame)
# These are specialized pandas DataFrames with specific metadata
ecg = EcgRawDataFrame(ecg_data)
icg = IcgRawDataFrame(icg_data)
# Create the wrapper dataset with your data
ds = WrapperDataset(
ecg=ecg,
icg=icg,
sampling_rate_ecg=500, # ECG sampling rate in Hz
sampling_rate_icg=500, # ICG sampling rate in Hz
)
# Compatible with regular pipelines
from pepbench.pipelines import PepExtractionPipeline
PepExtractionPipeline().run(ds)
Use WrapperDataset when you want quick integration
without implementing indexing/grouping over many files.
Create your Own Dataset Class#
If you have a larger collection of ECG/ICG recordings or want to integrate with pepbench’s indexing and grouping features, it’s best to create a custom dataset class by subclassing BasePepDataset or BasePepDatasetWithAnnotations.
This allows you to implement the required properties and methods while leveraging the full power of the tpcp framework for indexing, grouping, and iteration.
Why Create a Custom Dataset?#
Creating a custom dataset class is recommended when:
You have a structured collection of ECG/ICG recordings across multiple subjects/sessions
You need to support filtering, grouping, or subsetting operations on your data
You want to integrate with pepbench’s evaluation framework
You have complex data loading logic that depends on file structure or metadata
You want to provide a reusable interface for your data that works with all pepbench pipelines
Step-by-Step Implementation Guide#
1. Choose Your Base Class
Decide which base class to inherit from based on your needs:
Use
BasePepDatasetif you only need to run pipelines on your dataUse
BasePepDatasetWithAnnotationsif you also have reference annotations for evaluation
2. Implement Required Methods and Properties
All custom datasets must implement:
create_index()- Returns a pandas DataFrame defining all datapoints in your datasetecgproperty - Returns ECG signal for the current subseticgproperty - Returns ICG signal for the current subsetsampling_rate_ecgproperty - Returns ECG sampling rate in Hzsampling_rate_icgproperty - Returns ICG sampling rate in Hzheartbeatsproperty - Returns heartbeat segmentation
If using BasePepDatasetWithAnnotations, also implement:
reference_labels_ecgproperty - Returns reference Q-peak labelsreference_labels_icgproperty - Returns reference B-point labelsreference_heartbeatsproperty - Returns reference heartbeat segmentation
3. Basic Example: Custom Dataset without Annotations
Here’s a minimal example for a dataset containing ECG/ICG files organized by participant:
from pathlib import Path
import pandas as pd
from biopsykit.utils.dtypes import EcgRawDataFrame, IcgRawDataFrame
from biopsykit.signals.ecg.segmentation import HeartbeatSegmentationNeurokit
from pepbench.datasets import BasePepDataset
class MyCustomDataset(BasePepDataset):
"""Custom dataset for my ECG/ICG recordings.
Parameters
----------
base_path : Path or str
Root directory containing participant folders
"""
def __init__(self, base_path, **kwargs):
self.base_path = Path(base_path)
super().__init__(**kwargs)
def create_index(self) -> pd.DataFrame:
"""Create index with one row per participant."""
# Find all participant directories (e.g., P001, P002, ...)
participant_ids = sorted([
p.name for p in self.base_path.glob("P*")
if p.is_dir()
])
# Create DataFrame with participant column
return pd.DataFrame({"participant": participant_ids})
@property
def sampling_rate_ecg(self) -> int:
"""ECG sampling rate in Hz."""
return 1000 # Adjust to your data
@property
def sampling_rate_icg(self) -> int:
"""ICG sampling rate in Hz."""
return 1000 # Adjust to your data
@property
def ecg(self) -> EcgRawDataFrame:
"""Load ECG signal for current subset."""
# Ensure we're accessing a single datapoint
if not self.is_single(None):
raise ValueError("ECG can only be accessed for a single datapoint!")
# Get participant ID from current index
participant = self.index["participant"].iloc[0]
# Load ECG file
ecg_file = self.base_path / participant / "ecg.csv"
data = pd.read_csv(ecg_file, index_col=0)
return EcgRawDataFrame(data)
@property
def icg(self) -> IcgRawDataFrame:
"""Load ICG signal for current subset."""
if not self.is_single(None):
raise ValueError("ICG can only be accessed for a single datapoint!")
participant = self.index["participant"].iloc[0]
icg_file = self.base_path / participant / "icg.csv"
data = pd.read_csv(icg_file, index_col=0)
return IcgRawDataFrame(data)
@property
def heartbeats(self):
"""Compute heartbeats from ECG."""
# Use BiopsyKit's heartbeat segmentation
segmenter = HeartbeatSegmentationNeurokit()
segmenter.segment(
ecg=self.ecg,
sampling_rate_hz=self.sampling_rate_ecg
)
return segmenter.heartbeats_
4. Advanced Example: Dataset with Annotations and Multiple Index Levels
For more complex scenarios with multiple conditions or sessions:
from itertools import product
from pepbench.datasets import BasePepDatasetWithAnnotations
class MyAnnotatedDataset(BasePepDatasetWithAnnotations):
"""Custom dataset with reference annotations.
Parameters
----------
base_path : Path or str
Root directory
only_labeled : bool
Whether to restrict to labeled segments
"""
CONDITIONS = ["rest", "exercise"] # Define study conditions
def __init__(self, base_path, only_labeled=False, **kwargs):
self.base_path = Path(base_path)
super().__init__(only_labeled=only_labeled, **kwargs)
def create_index(self) -> pd.DataFrame:
"""Create index with participant and condition columns."""
# Find participants
participants = sorted([
p.name for p in self.base_path.glob("P*")
if p.is_dir()
])
# Create all combinations of participant × condition
index_tuples = list(product(participants, self.CONDITIONS))
return pd.DataFrame(
index_tuples,
columns=["participant", "condition"]
)
@property
def ecg(self) -> EcgRawDataFrame:
"""Load ECG for current participant and condition."""
if not self.is_single(None):
raise ValueError("Access single datapoint only!")
p_id = self.index["participant"].iloc[0]
condition = self.index["condition"].iloc[0]
# Load from condition-specific file
ecg_file = self.base_path / p_id / f"ecg_{condition}.csv"
return EcgRawDataFrame(pd.read_csv(ecg_file, index_col=0))
@property
def icg(self) -> IcgRawDataFrame:
"""Load ICG for current participant and condition."""
if not self.is_single(None):
raise ValueError("Access single datapoint only!")
p_id = self.index["participant"].iloc[0]
condition = self.index["condition"].iloc[0]
icg_file = self.base_path / p_id / f"icg_{condition}.csv"
return IcgRawDataFrame(pd.read_csv(icg_file, index_col=0))
@property
def sampling_rate_ecg(self) -> int:
return 500
@property
def sampling_rate_icg(self) -> int:
return 500
@property
def heartbeats(self):
"""Compute heartbeats."""
segmenter = HeartbeatSegmentationNeurokit()
segmenter.segment(
ecg=self.ecg,
sampling_rate_hz=self.sampling_rate_ecg
)
return segmenter.heartbeats_
@property
def reference_labels_ecg(self) -> pd.DataFrame:
"""Load reference ECG labels (Q-peaks)."""
if not self.is_single(None):
raise ValueError("Access single datapoint only!")
p_id = self.index["participant"].iloc[0]
condition = self.index["condition"].iloc[0]
# Load labels file
labels_file = self.base_path / p_id / f"labels_ecg_{condition}.csv"
labels = pd.read_csv(labels_file)
# Must return MultiIndex format: (heartbeat_id, channel, label)
labels = labels.set_index(["heartbeat_id", "channel", "label"])
return labels
@property
def reference_labels_icg(self) -> pd.DataFrame:
"""Load reference ICG labels (B-points)."""
if not self.is_single(None):
raise ValueError("Access single datapoint only!")
p_id = self.index["participant"].iloc[0]
condition = self.index["condition"].iloc[0]
labels_file = self.base_path / p_id / f"labels_icg_{condition}.csv"
labels = pd.read_csv(labels_file)
labels = labels.set_index(["heartbeat_id", "channel", "label"])
return labels
@property
def reference_heartbeats(self) -> pd.DataFrame:
"""Load or compute reference heartbeat segmentation."""
# Option 1: Load from file if available
# Option 2: Compute from reference labels
from pepbench.datasets._helper import compute_reference_heartbeats
return compute_reference_heartbeats(
self.reference_labels_ecg,
sampling_rate_hz=self.sampling_rate_ecg
)
5. Key Implementation Tips
Index Structure: The
create_index()method defines all available datapoints. Each row represents one accessible subset of your data. Column names become index levels you can group by.Single Datapoint Access: Properties like
ecg,icg, etc. should typically only be accessed whenself.is_single(None)isTrue. Use this check to prevent ambiguous multi-subset access.tpcp Integration: By inheriting from
BasePepDataset, you automatically get:Subsetting via
get_subset()Grouping via
groupby()Iteration over datapoints
Reproducible indexing for benchmarking
Data Loading: Implement efficient data loading in your properties. Consider using caching (
@cached_property) if loading is expensive:from functools import cached_property @cached_property def ecg(self) -> EcgRawDataFrame: # Expensive loading only happens once return self._load_ecg_file()
Reference Label Format: Reference labels must be MultiIndex DataFrames with levels
(heartbeat_id, channel, label)and columns includingsample_relativeandsample_absolute.
6. Using Your Custom Dataset
Once implemented, use your dataset just like built-in ones:
# Create dataset instance
ds = MyCustomDataset(base_path="/path/to/data")
# Build index
ds.create_index()
# Iterate over all datapoints
for datapoint in ds:
ecg = datapoint.ecg
icg = datapoint.icg
# Process...
# Group by participant
for participant_subset in ds.groupby("participant"):
# Process all conditions for this participant
pass
# Use with pipelines
from pepbench.pipelines import PepExtractionPipeline
pipeline = PepExtractionPipeline()
pipeline.run(ds.get_subset(index=[0]))
results = pipeline.result_
7. Testing Your Dataset
Always test your custom dataset implementation:
# Verify index creation
ds = MyCustomDataset("/path/to/data")
index = ds.create_index()
assert len(index) > 0
assert "participant" in index.columns
# Test data access
single_dp = ds.get_subset(index=[0])
ecg = single_dp.ecg
assert ecg.shape[0] > 0
assert single_dp.sampling_rate_ecg > 0
# Test iteration
for dp in ds:
assert dp.is_single(None)
# Verify each datapoint is accessible
For complete examples, see EmpkinsDataset and ExampleDataset implementations in the pepbench source code.
Summary checklist#
To be pipeline-compatible, a dataset should:
expose ECG/ICG signals and sampling rates
provide heartbeat segmentation
inherit from
BasePepDataset(or compatible interface)use deterministic indexing for reproducibility
For evaluation, it should additionally provide reference annotations.