Common Datatypes
=================

pepbench tries to stick to common data-containers - namely :class:`~numpy.ndarray`, :class:`~pandas.DataFrame`, :py:class:`dict` and :class:`~pandas.Series` - to store all in- and outputs of the used algorithm. However, based on the above mentioned containers, a set of certain data-types are defined and used throughout the library.
This makes it easy for users to handle complex problems and makes it possible to perform sanity checks that prevent common issues.
The following explains these data-structures in details to ease to process of preparing your data for the use of pepbench and help to understand the outputs.

Units
-----
.. _units:

Before talking about data-types the physical units for all values stored in these data-types should be clear.
The following table provides an overview over the units commonly used in the pepbench package and what they refer to.

.. table:: Common Units in pepbench

   ==============================  ======================
   Value                           Unit
   ==============================  ======================
   Time (seconds)                  s
   Time (milliseconds)             ms
   Sampling rate / frequency       Hz
   Heart rate                      bpm
   Relative / percentage values    %
   Sample indices / counts         samples
   ==============================  ======================

Signal Amplitudes
-----------------
Signal amplitude units (for ECG/ICG traces) are dataset-dependent and are not enforced by pepbench; common dataset units are volts (V) or millivolts (mV). pepbench algorithms expect that the amplitude unit is consistent within a dataset.

Naming Conventions in pepbench
------------------------------
The codebase uses a few naming conventions / column suffixes to indicate units and facilitate automatic dtype coercion:

- Columns/suffixes ending with ``_ms``: values expressed in milliseconds (ms). Examples from the codebase: ``pep_ms``, ``rr_interval_ms``, ``error_per_sample_ms``.
- Columns/suffixes ending with ``_percent`` or represented with ``%`` in labels: percentage values (%%). Examples: ``absolute_relative_error_per_sample_percent``.
- ``heart_rate_bpm``: heart rate values given in beats per minute (bpm).
- ``_data`` / sample indices: many functions operate on sample indices (integer counts). When plotting or when requested, pepbench can convert time-like indexes to seconds (s).

Start-End Indices
-----------------
Many pepbench tables and functions (for example heartbeat tables and annotation loaders) represent time ranges using sample indices named ``start_sample`` and ``end_sample``. pepbench follows the common Python slicing convention: the start index is inclusive and the end index is exclusive, i.e. the interval represented is [start, end).

Practical consequences and examples from the codebase:

- To obtain the last sample inside a region the code frequently uses ``end_sample - 1``. See ``pepbench.plotting._utils._get_heartbeat_borders`` which maps ``end_sample`` to the final index with ``end_sample - 1``.
- Durations in samples are obtained via ``end - start`` (for example heartbeat durations or when shifting indices to a zero origin).
- Edge cases:
  - A region that starts at the first sample of a recording has ``start = 0``.
  - A region that includes the last sample of a recording uses ``end = len(dataset)``.
  - Adjacent regions share boundaries: the ``end`` of the first region equals the ``start`` of the next region.

Heartbeat Lists
---------------
Heartbeats are represented as :class:`~pandas.DataFrame` with one row per beat and columns for the start and end
sample indices, as well as the R-peak sample index. A well-defined heartbeat list makes it easy to align ECG/ICG
segments, run extraction algorithms and evaluate results.

A *SingleSensorHeartbeatList* is a plain :class:`pandas.DataFrame` that should at least contain the columns
``start_sample`` and ``end_sample``. In many cases a ``r_peak_sample`` column is present as well (the detected R-peak
within the heartbeat). The index is expected to have one level with the name ``heartbeat_id``. If you prefer to keep
``heartbeat_id`` as a column instead, convert it to the index with ``df = df.set_index("heartbeat_id")`` before
passing it into functions that expect the index.

All sample-based columns are expressed in samples relative to the start of the recording (not relative to the start of
each heartbeat). Durations can be obtained using the sampling rate (``fs``) and converted to seconds or milliseconds.

Required/Recommended columns and units
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``start_sample`` (int): inclusive start index (samples) of the heartbeat in the recording.
- ``end_sample`` (int): exclusive end index (samples) of the heartbeat in the recording (pepbench uses half-open
  intervals [start, end)).
- ``r_peak_sample`` (int, optional but recommended): sample index of the R-peak. If present it should satisfy
  ``start_sample <= r_peak_sample < end_sample``.

Recommended additional/derived columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``duration_samples`` = ``end_sample - start_sample`` (int)
- ``duration_ms`` = ``duration_samples / fs * 1000`` (float)
- ``r_peak_offset`` = ``r_peak_sample - start_sample`` (int)
- ``quality_score`` (float), ``label`` (str) or ``source_channel`` (str) for metadata or curation

Index and format conventions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Index name: ``heartbeat_id`` is the canonical index name used in examples and several internal functions. Many
  internal examples and helpers assume the index has this name (see :func:`~pepbench.heartbeat_matching.match_heartbeat_lists`).
- Columns always refer to absolute sample indices in the recording (not time or per-beat offsets). Use the sampling
  rate to convert to seconds if needed.

Invariants and validation rules
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Rows should be sorted by ``start_sample`` (increasing).
- ``start_sample >= 0`` and ``end_sample > start_sample`` (no negative or zero-length regions unless explicitly
  documented).
- If ``r_peak_sample`` is present, it must satisfy ``start_sample <= r_peak_sample < end_sample``.
- Adjacent beats are allowed: the ``end_sample`` of one beat can equal the ``start_sample`` of the next beat.
- Overlapping beats should either be resolved (merge/remove) or annotated (for example with ``quality_score``).


Common operations and examples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A typical heartbeat list example (assume ``fs = 1000`` Hz):

>>> import pandas as pd
>>> df = pd.DataFrame(
...     [[0, 300, 150], [300, 620, 455], [620, 930, 750]],
...     columns=["start_sample", "end_sample", "r_peak_sample"],
... )
>>> df.index.name = "heartbeat_id"
>>> df
           start_sample  end_sample  r_peak_sample
heartbeat_id
0                   0         300            150
1                 300         620            455
2                 620         930            750

Compute derived columns:

>>> df["duration_samples"] = df["end_sample"] - df["start_sample"]
>>> df["duration_ms"] = df["duration_samples"] / 1000 * 1000  # = duration_samples for fs=1000

Filtering short beats:

>>> min_samples = int(0.2 * 1000)  # 200 ms
>>> df_filtered = df[df["duration_samples"] >= min_samples]

Integration with algorithms and helpers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Heartbeat segmentation algorithms used in pepbench (for example :class:`~biopsykit.signals.ecg.segmentation.HeartbeatSegmentationNeurokit`) provide a ``heartbeat_list_`` attribute that already follows the sample-index convention used here.
- Annotation loaders / dataset helpers expose reference heartbeats in the format expected by algorithms. The helper
  :func:`~pepbench.datasets._helper.compute_reference_heartbeats` reformats annotation tables (dropping channel-level,
  renaming columns to ``*_sample``) to a heartbeat table suitable for matching and evaluation.
- To evaluate and match two heartbeat lists use :func:`~pepbench.heartbeat_matching.match_heartbeat_lists`. This function
  compares start/end borders (in samples) and returns true/false positive/negative matches; it assumes the column
  names ``start_sample`` and ``end_sample`` and an index named ``heartbeat_id`` (see its docstring for examples).

Edge cases and recommendations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Missing ``r_peak_sample``: allow NaN when segmentation algorithms fail to detect a clear R-peak — downstream steps that
  require the R-peak should handle or skip those beats explicitly.
- Beats at recording boundaries: ``start_sample == 0`` and ``end_sample == len(recording)`` are valid and indicate
  coverage to the recording edges.
- Overlaps and duplicates: prefer to resolve these during preprocessing. When storing heartbeat lists on disk prefer
  parquet to preserve dtypes.


Datasets
--------

Compared to the low level datatypes, datasets are higher level abstractions, containing all data and metadata
associated with a set of recordings.
They are based on the :class:`~tpcp.Dataset` class and allow to easily load and access otherwise complex data
structures.

A dataset that has only one "row" (i.e. one recording) is referred to as a "datapoint" and is the expected input for
all the pipelines in pepbench.

A dataset that contains exactly one recording (a single row in the :class:`tpcp.Dataset` sense) is referred to as a
"datapoint". This is the expected input for the package's extraction pipelines — typically you should pass a
:class:`~pepbench.datasets._base_pep_extraction_dataset.BasePepDataset` instance for extraction pipelines, or a
:class:`~pepbench.datasets._base_pep_extraction_dataset.BasePepDatasetWithAnnotations` instance when reference labels
(heartbeats/PEP) are required for evaluation.

Using this dataset abstraction allows us to easily apply the same algorithms to different datasets and to use
higher-level tpcp-features like the :func:`~tpcp.validate.cross_validate` to run and evaluate our pipelines on
subsets of our datasets in a consistent manner.

The simplest dataset that we provide out of the box is the :class:`~pepbench.datasets.ExampleDataset`, which can be
used to load the example data that we provide with pepbench.

If you have already loaded your own data and want to use it with a pepbench pipeline, you can use the
:class:`~pepbench.datasets.WrapperDataset` class to quickly create a compatible dataset from your data.
For long-term use and clearer integration we highly encourage creating a custom dataset class that subclasses
:class:`~pepbench.datasets.BasePepDataset` (or
:class:`~pepbench.datasets.BasePepDatasetWithAnnotations` when reference labels are required).
This simplifies many tasks and provides a clean abstraction for your data.