Note

Go to the end to download the full example code.

GSD Evaluation Challenges#

The GSD Evaluation example demonstrates how to evaluate the performance of a GSD algorithm on a single datapoint: and explains the individual performance metrics that are calculated.

With that you could set up a custom evaluation pipeline to run and then score the output of a GSD algorithm multiple datapoints and then aggregate the results. To make this process easier, we set up opinionated evaluation challenges that can be used to quickly perform the same evaluation with multiple algorithms and datasets.

Below, we will show how to use them on the example dataset.

# TODO: Update based on new Scorer API

Dataset#

To use the challenges, we need to dataset with reference information in the expected format. We will use the LabExampleDataset for this purpose.

from mobgap.data import LabExampleDataset

long_test = LabExampleDataset(reference_system="INDIP").get_subset(
    test="Test11"
)

Algorithm#

Next we need to create an instance of a valid GSD algorithm.

from mobgap.gait_sequences import GsdIluz

algo = GsdIluz()

This algorithm needs to be wrapped in a GsdEmulationPipeline to be used in the challenges. This pipeline takes care of extracting the correct data from the dataset and running the algorithm on it.

from mobgap.gait_sequences.pipeline import GsdEmulationPipeline

pipe = GsdEmulationPipeline(algo)

Let’s demonstrate that quickly on a single datapoint.

pipe_with_results = pipe.clone().run(long_test[0])
pipe_with_results.gs_list_

	start	end
gs_id
0	600	1201
1	4350	5251
2	7800	9001
3	9300	10201
4	10950	11551
5	13050	13651

Evaluation Challenge#

This pipeline can now be used as part of an evaluation challenge. An evaluation challenge takes care of two things:

Running the pipeline on multiple datapoints
Scoring the results per datapoint and then aggregating the results

We provide two challenges:

GsdEvaluation: This challenge simply runs the pipeline on all datapoints and then scores the results.
GsdEvaluationCV: This challenge runs a cross-validation on the dataset and then scores the results per fold.

Before we run the entire pipeline, let’s look at the scoring. Scoring is built based on tpcp’s validation framework. As we have relativly complex scoring, scoring is split across two functions:

gsd_per_datapoint_score: Run and score a single datapoint
gsd_final_agg: Perform final aggreagtion and scoring based on the results
per datapoint.

Let’s look at the code of it first.

from inspect import getsource

from mobgap.gait_sequences.evaluation import (
    gsd_final_agg,
    gsd_per_datapoint_score,
)

print(getsource(gsd_per_datapoint_score))

def gsd_per_datapoint_score(pipeline: GsdEmulationPipeline, datapoint: BaseGaitDatasetWithReference) -> dict:
    """Evaluate the performance of a GSD algorithm on a single datapoint.

    .. warning:: This function is not meant to be called directly, but as a scoring function in a
       :class:`tpcp.validate.Scorer`.
       If you are writing custom scoring functions, you can use this function as a template or wrap it in a new
       function.

    This function is used to evaluate the performance of a GSD algorithm on a single datapoint.
    It calculates the performance metrics based on the detected gait sequences and the reference gait sequences.

    The following performance metrics are calculated:

    - all outputs of :func:`~mobgap.gait_sequences.evaluation.calculate_unmatched_gsd_performance_metrics`
      (will be averaged over all datapoints)
    - all outputs of :func:`~mobgap.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics`
      (will be averaged over all datapoints)
    - ``matches``: The matched gait sequences calculated by
      :func:`~mobgap.gait_sequences.evaluation.categorize_intervals_per_sample` (return as ``no_agg``)
    - ``detected``: The detected gait sequences (return as ``no_agg``)
    - ``reference``: The reference gait sequences (return as ``no_agg``)
    - ``sampling_rate_hz``: The sampling rate of the data (return as ``no_agg``)

    Parameters
    ----------
    pipeline
        An instance of GSD emulation pipeline that wraps the algorithm that should be evaluated.
    datapoint
        The datapoint to be evaluated.

    Returns
    -------
    dict
        A dictionary containing the performance metrics.
        Note, that some results are wrapped in a ``no_agg`` object or other aggregators.
        The results of this function are not expected to be parsed manually, but rather the function is expected to be
        used in the context of the :func:`~tpcp.validate.validate`/:func:`~tpcp.validate.cross_validate` functions or
        similar as scorer.
        This functions will aggregate the results and provide a summary of the performance metrics.

    """
    from mobgap.gait_sequences.evaluation import (  # noqa: PLC0415
        calculate_matched_gsd_performance_metrics,
        calculate_unmatched_gsd_performance_metrics,
        categorize_intervals_per_sample,
    )

    with warnings.catch_warnings():
        # We know that these errors might happen, and they are usually not relevant for the evaluation
        warnings.filterwarnings("ignore", message="Zero division", category=UserWarning)
        warnings.filterwarnings("ignore", message="multiple ICs", category=UserWarning)

        # Run the algorithm on the datapoint
        pipeline.safe_run(datapoint)
        detected_gs_list = pipeline.gs_list_
        reference_gs_list = datapoint.reference_parameters_.wb_list[["start", "end"]]
        n_overall_samples = len(datapoint.data_ss)
        sampling_rate_hz = datapoint.sampling_rate_hz

        matches = categorize_intervals_per_sample(
            gsd_list_detected=detected_gs_list,
            gsd_list_reference=reference_gs_list,
            n_overall_samples=n_overall_samples,
        )

        # Calculate the performance metrics
        performance_metrics = {
            **calculate_unmatched_gsd_performance_metrics(
                gsd_list_detected=detected_gs_list,
                gsd_list_reference=reference_gs_list,
                sampling_rate_hz=sampling_rate_hz,
            ),
            **calculate_matched_gsd_performance_metrics(matches),
            "matches": no_agg(matches),
            "detected": no_agg(detected_gs_list),
            "reference": no_agg(reference_gs_list),
            "sampling_rate_hz": no_agg(sampling_rate_hz),
            "runtime_s": getattr(pipeline.algo_, "perf_", {}).get("runtime_s", np.nan),
        }

    return performance_metrics

print(getsource(gsd_final_agg))

def gsd_final_agg(
    agg_results: dict[str, float],
    single_results: dict[str, list],
    pipeline: GsdEmulationPipeline,  # noqa: ARG001
    dataset: BaseGaitDatasetWithReference,
) -> tuple[dict[str, any], dict[str, list[any]]]:
    """Aggregate the performance metrics of a GSD algorithm over multiple datapoints.

    .. warning:: This function is not meant to be called directly, but as ``final_aggregator`` in a
       :class:`tpcp.validate.Scorer`.
       If you are writing custom scoring functions, you can use this function as a template or wrap it in a new
       function.

    This function aggregates the performance metrics as follows:

    - All raw outputs (``detected``, ``reference``, ``sampling_rate_hz``) are concatenated to a single
      dataframe, to make it easier to work with and are returned as part of the single results.
    - We recalculate all performance metrics from
      :func:`~mobgap.gait_sequences.evaluation.calculate_unmatched_gsd_performance_metrics`
      and :func:`~mobgap.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics` on the combined data.
      The results are prefixed with ``combined__``.
      Compared to the per-datapoint results (which are calculated, as errors per recording -> average over all
      recordings), these metrics are calculated as combining all GSDs from all recordings and then calculating the
      performance metrics.
      Effectively, this means, that in the `per_datapoint` version, each recording is weighted equally, while in the
      `combined` version, each GS is weighted equally.

    Parameters
    ----------
    agg_results
        The aggregated results from all datapoints (see :class:`~tpcp.validate.Scorer`).
    single_results
        The per-datapoint results (see :class:`~tpcp.validate.Scorer`).
    pipeline
        The pipline that was passed to the scorer.
        This is ignored in this function, but might be useful in custom final aggregators.
    dataset
        The dataset that was passed to the scorer.

    Returns
    -------
    final_agg_results
        The final aggregated results.
    final_single_results
        The per-datapoint results, that are not aggregated.

    """
    from mobgap.gait_sequences.evaluation import (  # noqa: PLC0415
        calculate_matched_gsd_performance_metrics,
        calculate_unmatched_gsd_performance_metrics,
    )

    data_labels = [d.group_label for d in dataset]
    data_label_names = data_labels[0]._fields
    # We combine each to a combined dataframe
    matches = single_results.pop("matches")
    matches = pd.concat(matches, keys=data_labels, names=[*data_label_names, *matches[0].index.names])
    detected = single_results.pop("detected")
    detected = pd.concat(detected, keys=data_labels, names=[*data_label_names, *detected[0].index.names])
    reference = single_results.pop("reference")
    reference = pd.concat(reference, keys=data_labels, names=[*data_label_names, *reference[0].index.names])

    aggregated_single_results = {
        "raw__detected": detected,
        "raw__reference": reference,
    }

    sampling_rate_hz = single_results.pop("sampling_rate_hz")
    if set(sampling_rate_hz) != {sampling_rate_hz[0]}:
        raise ValueError(
            "Sampling rate is not the same for all datapoints in the dataset. "
            "This not supported by this scorer. "
            "Provide a custom scorer that can handle this case."
        )

    combined_unmatched = {
        f"combined__{k}": v
        for k, v in calculate_unmatched_gsd_performance_metrics(
            gsd_list_detected=detected,
            gsd_list_reference=reference,
            sampling_rate_hz=sampling_rate_hz[0],
        ).items()
    }
    combined_matched = {f"combined__{k}": v for k, v in calculate_matched_gsd_performance_metrics(matches).items()}

    # Note, that we pass the "aggregated_single_results" out via the single results and not the aggregated results
    # The reason is that the aggregated results are expected to be a single value per metric, while the single results
    # can be anything.
    return {**agg_results, **combined_unmatched, **combined_matched}, {**single_results, **aggregated_single_results}

We can see that these method is relatively simple, using the lower level gsd evaluation functions that we provide. gsd_per_datapoint_score calculates the raw results and all scores that can be calculated per datapoint. gsd_final_agg handles the calculation of all scores, that require the raw results from all datapoints at once. The remaining aggregation is handled by the Scorer class (see below). So if you want to run your own scoring function, it should be straightforward to do so.

Note, the no_agg wrapping some of the return values. This is a special aggregator that tells the challenge to not try to aggregate the respective values. For all other values, the challenge will try average the values across all datapoints.

To learn more about these special aggregators, check out the tpcp custom scorer example.

The scoring function takes care of running the pipeline. So we can test the scorer, by just providing it with a pipeline and a datapoint.

from pprint import pprint

single_dp_results = gsd_per_datapoint_score(pipe, long_test[0])
single_dp_results.pop("detected")
single_dp_results.pop("reference")
pprint(single_dp_results)

{'accuracy': 0.704708699122107,
 'detected_gs_duration_s': 48.12,
 'detected_num_gs': 6,
 'f1_score': 0.5408393501805054,
 'fn_samples': 1649,
 'fp_samples': 2421,
 'gs_absolute_duration_error_s': 7.68,
 'gs_absolute_relative_duration_error': 0.18991097922848665,
 'gs_absolute_relative_duration_error_log': 0.17387849695420737,
 'gs_duration_error_s': 7.68,
 'gs_relative_duration_error': 0.18991097922848665,
 'matches': _NoAgg(return_raw_scores=True)(    start    end match_type
0       0    600         tn
1     600    632         fp
2     632    988         tp
3     988   1201         fp
4    1201   2864         tn
5    2864   3325         fn
6    3325   3853         tn
7    3853   4350         fn
8    4350   5085         tp
9    5085   5251         fp
10   5251   7641         tn
11   7641   7800         fn
12   7800   8621         tp
13   8621   9001         fp
14   9001   9300         tn
15   9300   9451         fp
16   9451   9932         tp
17   9932  10201         fp
18  10201  10950         tn
19  10950  11551         fp
20  11551  11989         tn
21  11989  12517         fn
22  12517  13050         tn
23  13050  13651         fp
24  13651  13758         tn),
 'npv': 0.8160624651422197,
 'num_gs_absolute_error': 0,
 'num_gs_absolute_relative_error': 0.0,
 'num_gs_absolute_relative_error_log': 0.0,
 'num_gs_error': 0,
 'num_gs_relative_error': 0.0,
 'precision': 0.4975093399750934,
 'recall': 0.592436974789916,
 'reference_gs_duration_s': 40.44,
 'reference_num_gs': 6,
 'runtime_s': 0.010098662000018521,
 'sampling_rate_hz': _NoAgg(return_raw_scores=True)(100.0),
 'specificity': 0.7513607887439663,
 'tn_samples': 7316,
 'tp_samples': 2397}

To use the two functions with a challenge, we need to wrap them into a Scorer instance.

from tpcp.validate import Scorer

gsd_evaluation_scorer = Scorer(
    gsd_per_datapoint_score, final_aggregator=gsd_final_agg
)

The challenge will call this scorer for each group in the dataset. The scorer itself will then call gsd_per_datapoint_score for each datapoint and then gsd_final_agg with the combined results.

For these two default scoring functions, we also provide the scorer directly, so that you don’t have to construct it yourself. However, in case you want to modify the scoring functions, you can do so by creating your own scorer. We will continue to use the default scorer for the challenges.

from mobgap.gait_sequences.evaluation import gsd_score

gsd_evaluation_scorer = gsd_score

Let’s put everything together and run the challenge.

from mobgap.utils.evaluation import Evaluation

eval_challenge = Evaluation(long_test, scoring=gsd_evaluation_scorer)

We can now run the challenge.

eval_challenge = eval_challenge.run(pipe)

Datapoints:   0%|          | 0/3 [00:00<?, ?it/s]
Datapoints:  33%|███▎      | 1/3 [00:00<00:00,  5.19it/s]
Datapoints:  67%|██████▋   | 2/3 [00:00<00:00,  5.20it/s]
Datapoints: 100%|██████████| 3/3 [00:00<00:00,  5.05it/s]
Datapoints: 100%|██████████| 3/3 [00:00<00:00,  5.09it/s]

The results are stored in the results_ attribute and contain the aggregated and the raw results per datapoint. To learn more about the results, check the validate documentation.

Note, that we remove the no_agg parameters from the results, as they don’t visualize well.

import pandas as pd

validate_results = pd.DataFrame(eval_challenge.results_)
validate_results

	debug__score_time	data_labels	single__reference_gs_duration_s	single__detected_gs_duration_s	single__gs_duration_error_s	single__gs_relative_duration_error	single__gs_absolute_duration_error_s	single__gs_absolute_relative_duration_error	single__gs_absolute_relative_duration_error_log	single__detected_num_gs	single__reference_num_gs	single__num_gs_error	single__num_gs_relative_error	single__num_gs_absolute_error	single__num_gs_absolute_relative_error	single__num_gs_absolute_relative_error_log	single__tp_samples	single__fp_samples	single__fn_samples	single__precision	single__recall	single__f1_score	single__tn_samples	single__specificity	single__accuracy	single__npv	single__runtime_s	single__raw__detected	single__raw__reference	agg__reference_gs_duration_s	agg__detected_gs_duration_s	agg__gs_duration_error_s	agg__gs_relative_duration_error	agg__gs_absolute_duration_error_s	agg__gs_absolute_relative_duration_error	agg__gs_absolute_relative_duration_error_log	agg__detected_num_gs	agg__reference_num_gs	agg__num_gs_error	agg__num_gs_relative_error	agg__num_gs_absolute_error	agg__num_gs_absolute_relative_error	agg__num_gs_absolute_relative_error_log	agg__tp_samples	agg__fp_samples	agg__fn_samples	agg__precision	agg__recall	agg__f1_score	agg__tn_samples	agg__specificity	agg__accuracy	agg__npv	agg__runtime_s	agg__combined__reference_gs_duration_s	agg__combined__detected_gs_duration_s	agg__combined__gs_duration_error_s	agg__combined__gs_relative_duration_error	agg__combined__gs_absolute_duration_error_s	agg__combined__gs_absolute_relative_duration_error	agg__combined__gs_absolute_relative_duration_error_log	agg__combined__detected_num_gs	agg__combined__reference_num_gs	agg__combined__num_gs_error	agg__combined__num_gs_relative_error	agg__combined__num_gs_absolute_error	agg__combined__num_gs_absolute_relative_error	agg__combined__num_gs_absolute_relative_error_log	agg__combined__tp_samples	agg__combined__fp_samples	agg__combined__fn_samples	agg__combined__precision	agg__combined__recall	agg__combined__f1_score	agg__combined__tn_samples	agg__combined__specificity	agg__combined__accuracy	agg__combined__npv
0	0.709324	[(HA, 001, TimeMeasure1, Test11, Trial1), (HA,...	[40.44, 40.82, 65.52]	[48.12, 42.08, 97.64]	[7.68, 1.259999999999998, 32.120000000000005]	[0.18991097922848665, 0.030867221950024448, 0....	[7.68, 1.259999999999998, 32.120000000000005]	[0.18991097922848665, 0.030867221950024448, 0....	[0.17387849695420737, 0.03040041104775553, 0.3...	[6, 4, 7]	[6, 3, 6]	[0, 1, 1]	[0.0, 0.3333333333333333, 0.16666666666666666]	[0, 1, 1]	[0.0, 0.3333333333333333, 0.16666666666666666]	[0.0, 0.28768207245178085, 0.15415067982725836]	[2397, 2875, 6436]	[2421, 1337, 3339]	[1649, 1209, 117]	[0.4975093399750934, 0.682573599240266, 0.6584...	[0.592436974789916, 0.7039666993143977, 0.9821...	[0.5408393501805054, 0.6931051108968177, 0.788...	[7316, 10577, 12862]	[0.7513607887439663, 0.8877790834312573, 0.793...	[0.704708699122107, 0.8408551068883611, 0.8481...	[0.8160624651422197, 0.8974206685898524, 0.990...	[0.009761240999978327, 0.006549776999236201, 0...	...	...	48.926667	62.613333	13.686667	0.237003	13.686667	0.237003	0.20107	5.666667	5.0	0.666667	0.166667	0.666667	0.166667	0.147278	3902.666667	2365.666667	991.666667	0.612832	0.759516	0.674095	10251.666667	0.811014	0.797893	0.90149	0.007727	146.78	187.84	41.06	0.279738	41.06	0.279738	0.246656	17	15	2	0.133333	2	0.133333	0.125163	11708	7097	2975	0.6226	0.797385	0.699236	30755	0.812507	0.80828	0.9118

As you can see, this is a very messy dataframe with a lot of information. To make this easier to digest, the evaluation object has methods for extracting the different groups of information. The first group is the aggregated results, which represent only a “single value” over the entire dataset.

agg_results = eval_challenge.get_aggregated_results_as_df()
agg_results.T

fold	0
reference_gs_duration_s	48.926667
detected_gs_duration_s	62.613333
gs_duration_error_s	13.686667
gs_relative_duration_error	0.237003
gs_absolute_duration_error_s	13.686667
gs_absolute_relative_duration_error	0.237003
gs_absolute_relative_duration_error_log	0.201070
detected_num_gs	5.666667
reference_num_gs	5.000000
num_gs_error	0.666667
num_gs_relative_error	0.166667
num_gs_absolute_error	0.666667
num_gs_absolute_relative_error	0.166667
num_gs_absolute_relative_error_log	0.147278
tp_samples	3902.666667
fp_samples	2365.666667
fn_samples	991.666667
precision	0.612832
recall	0.759516
f1_score	0.674095
tn_samples	10251.666667
specificity	0.811014
accuracy	0.797893
npv	0.901490
runtime_s	0.007727
combined__reference_gs_duration_s	146.780000
combined__detected_gs_duration_s	187.840000
combined__gs_duration_error_s	41.060000
combined__gs_relative_duration_error	0.279738
combined__gs_absolute_duration_error_s	41.060000
combined__gs_absolute_relative_duration_error	0.279738
combined__gs_absolute_relative_duration_error_log	0.246656
combined__detected_num_gs	17.000000
combined__reference_num_gs	15.000000
combined__num_gs_error	2.000000
combined__num_gs_relative_error	0.133333
combined__num_gs_absolute_error	2.000000
combined__num_gs_absolute_relative_error	0.133333
combined__num_gs_absolute_relative_error_log	0.125163
combined__tp_samples	11708.000000
combined__fp_samples	7097.000000
combined__fn_samples	2975.000000
combined__precision	0.622600
combined__recall	0.797385
combined__f1_score	0.699236
combined__tn_samples	30755.000000
combined__specificity	0.812507
combined__accuracy	0.808280
combined__npv	0.911800

You might have seen, that many metrics appear twice, once with a combined__ prefix and once without. These represent two different things. If you check in the source code of the scorer above, the metric without prefix is calculated per datapoint and then averaged. The metric with the prefix is calculated over the raw detected gait sequences of all datapoints combined. Effectively, this is equivalent to different “weightings”. In the aggregated results without prefix, each recording has the same weight, independent of its length. In the second case, each individual imu-sample has the same weight. It does not matter, in which recording this sample was classified correctly or not, it has the same impact on the combined metric.

Both approaches are valid, but you should be aware of the differences when comparing algorithms. The way how you aggregate here, can have a big impact on the results.

combined_metrics = agg_results.filter(like="combined__").rename(
    columns=lambda x: x.replace("combined__", "")
)
combined_vs_per_datapoint = pd.concat(
    {
        "combined": combined_metrics,
        "per_datapoint": agg_results[combined_metrics.columns],
    },
    axis=0,
)
combined_vs_per_datapoint.reset_index(level=-1, drop=True).T

	combined	per_datapoint
reference_gs_duration_s	146.780000	48.926667
detected_gs_duration_s	187.840000	62.613333
gs_duration_error_s	41.060000	13.686667
gs_relative_duration_error	0.279738	0.237003
gs_absolute_duration_error_s	41.060000	13.686667
gs_absolute_relative_duration_error	0.279738	0.237003
gs_absolute_relative_duration_error_log	0.246656	0.201070
detected_num_gs	17.000000	5.666667
reference_num_gs	15.000000	5.000000
num_gs_error	2.000000	0.666667
num_gs_relative_error	0.133333	0.166667
num_gs_absolute_error	2.000000	0.666667
num_gs_absolute_relative_error	0.133333	0.166667
num_gs_absolute_relative_error_log	0.125163	0.147278
tp_samples	11708.000000	3902.666667
fp_samples	7097.000000	2365.666667
fn_samples	2975.000000	991.666667
precision	0.622600	0.612832
recall	0.797385	0.759516
f1_score	0.699236	0.674095
tn_samples	30755.000000	10251.666667
specificity	0.812507	0.811014
accuracy	0.808280	0.797893
npv	0.911800	0.901490

The “single” results represent the values per datapoint.

single_results = eval_challenge.get_single_results_as_df()
single_results.T

fold	0
cohort	HA		MS
participant_id	001	002	001
time_measure	TimeMeasure1	TimeMeasure1	TimeMeasure1
test	Test11	Test11	Test11
trial	Trial1	Trial1	Trial1
reference_gs_duration_s	40.44	40.82	65.52
detected_gs_duration_s	48.12	42.08	97.64
gs_duration_error_s	7.68	1.26	32.12
gs_relative_duration_error	0.189911	0.030867	0.490232
gs_absolute_duration_error_s	7.68	1.26	32.12
gs_absolute_relative_duration_error	0.189911	0.030867	0.490232
gs_absolute_relative_duration_error_log	0.173878	0.0304	0.398932
detected_num_gs	6	4	7
reference_num_gs	6	3	6
num_gs_error	0	1	1
num_gs_relative_error	0.0	0.333333	0.166667
num_gs_absolute_error	0	1	1
num_gs_absolute_relative_error	0.0	0.333333	0.166667
num_gs_absolute_relative_error_log	0.0	0.287682	0.154151
tp_samples	2397	2875	6436
fp_samples	2421	1337	3339
fn_samples	1649	1209	117
precision	0.497509	0.682574	0.658414
recall	0.592437	0.703967	0.982146
f1_score	0.540839	0.693105	0.788339
tn_samples	7316	10577	12862
specificity	0.751361	0.887779	0.793902
accuracy	0.704709	0.840855	0.848115
npv	0.816062	0.897421	0.990985
runtime_s	0.009761	0.00655	0.006871

And finally, we had a couple “raw” results in the scoring, that we passed through without calculating any error metrics. These are available as a dictionary of raw results.

raw_results = eval_challenge.get_raw_results()
list(raw_results.keys())

['detected', 'reference']

raw_results["detected"]

							start	end
fold	cohort	participant_id	time_measure	test	trial	gs_id
0	HA	001	TimeMeasure1	Test11	Trial1	0	600	1201
						1	4350	5251
						2	7800	9001
						3	9300	10201
						4	10950	11551
						5	13050	13651
		002	TimeMeasure1	Test11	Trial1	0	450	1201
						1	2700	3301
						2	5700	7951
						3	15000	15601
	MS	001	TimeMeasure1	Test11	Trial1	0	900	2101
						1	4650	6151
						2	9600	10801
						3	11250	12151
						4	12300	14851
						5	19950	21151
						6	21300	22501

raw_results["reference"]

							start	end
fold	cohort	participant_id	time_measure	test	trial	wb_id
0	HA	001	TimeMeasure1	Test11	Trial1	0	632	988
						1	2864	3325
						2	3853	5085
						3	7641	8621
						4	9451	9932
						5	11989	12517
		002	TimeMeasure1	Test11	Trial1	0	485	1131
						1	1746	3554
						2	6083	7708
	MS	001	TimeMeasure1	Test11	Trial1	0	1019	1768
						1	4534	5549
						2	9665	10569
						3	12337	14633
						4	20151	20982
						5	21378	22129

Further, there are some runtime information available (i.e. when the challenge was started, and how long it took).

eval_challenge.perf_["start_datetime"], eval_challenge.perf_["end_datetime"]

('2026-04-01T10:40:51.212210+00:00', '2026-04-01T10:40:51.927462+00:00')

eval_challenge.perf_["runtime_s"]

0.7152096029994937

Using Evaluation is great, if you are only comparing (or planning to compare) non-ML algorithms, or algorithms that don’t require further optimization (e.g. through GridSearch).

Therefore, it is generally recommended to run a cross-validation with EvaluationCV. This allows you to evaluate the performance of the algorithm on multiple folds of the dataset and through the use of DummyOptimize you can also use algorithms without optimization in the same pipeline for comparison.

Let’s demonstrate the use of GsdEvaluationCV on the example dataset using the same algorithm once with and once without GridSearch.

For the CV-based challenge, we need to set up a cross-validation. As we only have 3 datapoints here, we will use a 3-fold cross-validation without grouping or stratification. In a real-world scenario, you would use a more sophisticated cross-validation strategy. You can learn more about cross-validation in the tpcp cross-validation example.

Further, to speed things up, we are going to use multi-processing. We can configure this using the n_jobs parameter that we pass to the internal cross_validate function via the cv_params parameters

from mobgap.utils.evaluation import EvaluationCV

eval_challenge_cv = EvaluationCV(
    long_test,
    cv_iterator=3,
    scoring=gsd_evaluation_scorer,
    cv_params={"n_jobs": 2, "return_optimizer": True},
)

To use our pipeline from above, we need to wrap it in a DummyOptimize instance. This will basically skip any optimization on the train set and just apply the pipeline to the test set.

from tpcp.optimize import DummyOptimize

eval_challenge_cv = eval_challenge_cv.run(
    DummyOptimize(pipe, ignore_potential_user_error_warning=True)
)

CV Folds:   0%|          | 0/3 [00:00<?, ?it/s]
CV Folds:  33%|███▎      | 1/3 [00:03<00:06,  3.24s/it]
CV Folds: 100%|██████████| 3/3 [00:03<00:00,  1.04it/s]
CV Folds: 100%|██████████| 3/3 [00:03<00:00,  1.19s/it]

The results now are a little bit more complex, as they contain the results for each fold. In addition, we have information for the train and the test set. The test set results, are what we are usually looking for. The train set results, are only calculated when providing the return_train_score parameter to the cv_params.

As before all results are stored in the results_ attribute, but it is usually recommended to use the helper methods to access the data.

Note, that compared to the results above, we now have mutliple CV folds and the aggregated results present one value per fold. These parameters could be further aggregated, e.g. by calculating the mean of these values over all folds.

agg_results_cv = eval_challenge_cv.get_aggregated_results_as_df()
agg_results_cv.T

fold	0	1	2
reference_gs_duration_s	40.440000	40.820000	65.520000
detected_gs_duration_s	48.120000	42.080000	97.640000
gs_duration_error_s	7.680000	1.260000	32.120000
gs_relative_duration_error	0.189911	0.030867	0.490232
gs_absolute_duration_error_s	7.680000	1.260000	32.120000
gs_absolute_relative_duration_error	0.189911	0.030867	0.490232
gs_absolute_relative_duration_error_log	0.173878	0.030400	0.398932
detected_num_gs	6.000000	4.000000	7.000000
reference_num_gs	6.000000	3.000000	6.000000
num_gs_error	0.000000	1.000000	1.000000
num_gs_relative_error	0.000000	0.333333	0.166667
num_gs_absolute_error	0.000000	1.000000	1.000000
num_gs_absolute_relative_error	0.000000	0.333333	0.166667
num_gs_absolute_relative_error_log	0.000000	0.287682	0.154151
tp_samples	2397.000000	2875.000000	6436.000000
fp_samples	2421.000000	1337.000000	3339.000000
fn_samples	1649.000000	1209.000000	117.000000
precision	0.497509	0.682574	0.658414
recall	0.592437	0.703967	0.982146
f1_score	0.540839	0.693105	0.788339
tn_samples	7316.000000	10577.000000	12862.000000
specificity	0.751361	0.887779	0.793902
accuracy	0.704709	0.840855	0.848115
npv	0.816062	0.897421	0.990985
runtime_s	0.723435	0.709067	0.006342
combined__reference_gs_duration_s	40.440000	40.820000	65.520000
combined__detected_gs_duration_s	48.120000	42.080000	97.640000
combined__gs_duration_error_s	7.680000	1.260000	32.120000
combined__gs_relative_duration_error	0.189911	0.030867	0.490232
combined__gs_absolute_duration_error_s	7.680000	1.260000	32.120000
combined__gs_absolute_relative_duration_error	0.189911	0.030867	0.490232
combined__gs_absolute_relative_duration_error_log	0.173878	0.030400	0.398932
combined__detected_num_gs	6.000000	4.000000	7.000000
combined__reference_num_gs	6.000000	3.000000	6.000000
combined__num_gs_error	0.000000	1.000000	1.000000
combined__num_gs_relative_error	0.000000	0.333333	0.166667
combined__num_gs_absolute_error	0.000000	1.000000	1.000000
combined__num_gs_absolute_relative_error	0.000000	0.333333	0.166667
combined__num_gs_absolute_relative_error_log	0.000000	0.287682	0.154151
combined__tp_samples	2397.000000	2875.000000	6436.000000
combined__fp_samples	2421.000000	1337.000000	3339.000000
combined__fn_samples	1649.000000	1209.000000	117.000000
combined__precision	0.497509	0.682574	0.658414
combined__recall	0.592437	0.703967	0.982146
combined__f1_score	0.540839	0.693105	0.788339
combined__tn_samples	7316.000000	10577.000000	12862.000000
combined__specificity	0.751361	0.887779	0.793902
combined__accuracy	0.704709	0.840855	0.848115
combined__npv	0.816062	0.897421	0.990985

The single results contain the CV fold as an additional index. Otherwise, the output is identical to before. Note, that if you use anything else then a KFold, splitter, you might have some datapoints duplicated across folds.

single_results_cv = eval_challenge_cv.get_single_results_as_df()
single_results_cv

						reference_gs_duration_s	detected_gs_duration_s	gs_duration_error_s	gs_relative_duration_error	gs_absolute_duration_error_s	gs_absolute_relative_duration_error	gs_absolute_relative_duration_error_log	detected_num_gs	reference_num_gs	num_gs_error	num_gs_relative_error	num_gs_absolute_error	num_gs_absolute_relative_error	num_gs_absolute_relative_error_log	tp_samples	fp_samples	fn_samples	precision	recall	f1_score	tn_samples	specificity	accuracy	npv	runtime_s
fold	cohort	participant_id	time_measure	test	trial
0	HA	001	TimeMeasure1	Test11	Trial1	40.44	48.12	7.68	0.189911	7.68	0.189911	0.173878	6	6	0	0.0	0	0.0	0.0	2397	2421	1649	0.497509	0.592437	0.540839	7316	0.751361	0.704709	0.816062	0.723435
1	HA	002	TimeMeasure1	Test11	Trial1	40.82	42.08	1.26	0.030867	1.26	0.030867	0.0304	4	3	1	0.333333	1	0.333333	0.287682	2875	1337	1209	0.682574	0.703967	0.693105	10577	0.887779	0.840855	0.897421	0.709067
2	MS	001	TimeMeasure1	Test11	Trial1	65.52	97.64	32.12	0.490232	32.12	0.490232	0.398932	7	6	1	0.166667	1	0.166667	0.154151	6436	3339	117	0.658414	0.982146	0.788339	12862	0.793902	0.848115	0.990985	0.006342

And the raw outputs:

raw_results_cv = eval_challenge_cv.get_raw_results()
raw_results_cv["detected"]

							start	end
fold	cohort	participant_id	time_measure	test	trial	gs_id
0	HA	001	TimeMeasure1	Test11	Trial1	0	600	1201
						1	4350	5251
						2	7800	9001
						3	9300	10201
						4	10950	11551
						5	13050	13651
1	HA	002	TimeMeasure1	Test11	Trial1	0	450	1201
						1	2700	3301
						2	5700	7951
						3	15000	15601
2	MS	001	TimeMeasure1	Test11	Trial1	0	900	2101
						1	4650	6151
						2	9600	10801
						3	11250	12151
						4	12300	14851
						5	19950	21151
						6	21300	22501

If we compare these results to the ones from the non-CV challenge, we can see that “single” results are identical, just that they were called in multiple folds. This is expected, as we used DummyOptimize and thus didn’t optimize the algorithm.

Let’s try a GridSearch on the algorithm to see how the results change. For the gridsearch, we will re-use the same scoring function as before, but we need to specify, which scoring result we want to optimize for.

from sklearn.model_selection import ParameterGrid
from tpcp.optimize import GridSearch

para_grid = ParameterGrid({"algo__window_length_s": [2, 3, 4]})
optimizer = GridSearch(
    pipe, para_grid, scoring=gsd_evaluation_scorer, return_optimized="precision"
)

The optimizer can now be used in the same CV challenge as before. This way we can guarantee that the same folds are used for the optimization and the evaluation and ensure the best possible comparison between the algorithms versions.

eval_challenge_gs = eval_challenge_cv.clone().run(optimizer)

CV Folds:   0%|          | 0/3 [00:00<?, ?it/s]
CV Folds:  33%|███▎      | 1/3 [00:01<00:03,  1.70s/it]
CV Folds: 100%|██████████| 3/3 [00:03<00:00,  1.05s/it]
CV Folds: 100%|██████████| 3/3 [00:03<00:00,  1.11s/it]

The results we are seeing now are generated by the internally optimized version of the algorithm.

agg_results_cv = eval_challenge_gs.get_aggregated_results_as_df()
agg_results_cv.T

fold	0	1	2
reference_gs_duration_s	40.440000	40.820000	65.520000
detected_gs_duration_s	48.120000	39.080000	97.160000
gs_duration_error_s	7.680000	-1.740000	31.640000
gs_relative_duration_error	0.189911	-0.042626	0.482906
gs_absolute_duration_error_s	7.680000	1.740000	31.640000
gs_absolute_relative_duration_error	0.189911	0.042626	0.482906
gs_absolute_relative_duration_error_log	0.173878	0.041743	0.394004
detected_num_gs	6.000000	4.000000	8.000000
reference_num_gs	6.000000	3.000000	6.000000
num_gs_error	0.000000	1.000000	2.000000
num_gs_relative_error	0.000000	0.333333	0.333333
num_gs_absolute_error	0.000000	1.000000	2.000000
num_gs_absolute_relative_error	0.000000	0.333333	0.333333
num_gs_absolute_relative_error_log	0.000000	0.287682	0.287682
tp_samples	2397.000000	2545.000000	6153.000000
fp_samples	2421.000000	1366.000000	3573.000000
fn_samples	1649.000000	1540.000000	403.000000
precision	0.497509	0.650729	0.632634
recall	0.592437	0.623011	0.938530
f1_score	0.540839	0.636568	0.755804
tn_samples	7316.000000	10547.000000	12627.000000
specificity	0.751361	0.885335	0.779444
accuracy	0.704709	0.818352	0.825277
npv	0.816062	0.872590	0.969071
runtime_s	0.005647	0.005867	0.006510
combined__reference_gs_duration_s	40.440000	40.820000	65.520000
combined__detected_gs_duration_s	48.120000	39.080000	97.160000
combined__gs_duration_error_s	7.680000	-1.740000	31.640000
combined__gs_relative_duration_error	0.189911	-0.042626	0.482906
combined__gs_absolute_duration_error_s	7.680000	1.740000	31.640000
combined__gs_absolute_relative_duration_error	0.189911	0.042626	0.482906
combined__gs_absolute_relative_duration_error_log	0.173878	0.041743	0.394004
combined__detected_num_gs	6.000000	4.000000	8.000000
combined__reference_num_gs	6.000000	3.000000	6.000000
combined__num_gs_error	0.000000	1.000000	2.000000
combined__num_gs_relative_error	0.000000	0.333333	0.333333
combined__num_gs_absolute_error	0.000000	1.000000	2.000000
combined__num_gs_absolute_relative_error	0.000000	0.333333	0.333333
combined__num_gs_absolute_relative_error_log	0.000000	0.287682	0.287682
combined__tp_samples	2397.000000	2545.000000	6153.000000
combined__fp_samples	2421.000000	1366.000000	3573.000000
combined__fn_samples	1649.000000	1540.000000	403.000000
combined__precision	0.497509	0.650729	0.632634
combined__recall	0.592437	0.623011	0.938530
combined__f1_score	0.540839	0.636568	0.755804
combined__tn_samples	7316.000000	10547.000000	12627.000000
combined__specificity	0.751361	0.885335	0.779444
combined__accuracy	0.704709	0.818352	0.825277
combined__npv	0.816062	0.872590	0.969071

Because we used cv_params={"return_optimizer": True} we can also access the optimizer per fold directly from the results_ attribute.` This can be useful to get more insights into the optimization process and what the optimal parameters were.

opt_results = pd.Series(eval_challenge_gs.results_["optimizer"])
opt_results

  GridSearch(n_jobs=None, parameter_grid=<sklear...
  GridSearch(n_jobs=None, parameter_grid=<sklear...
  GridSearch(n_jobs=None, parameter_grid=<sklear...
dtype: object

We can get the best parameters per fold by directly interacting with the optimizer instances.

best_params = opt_results.apply(lambda x: pd.Series(x.best_params_))
best_params

	algo__window_length_s
0	3
1	2
2	2

Or we can go much deeper, by getting all information about the optimization process. Let’s just look at the keys of the information that is available.

all_opti_results_fold0 = pd.DataFrame(opt_results.loc[0].gs_results_)
all_opti_results_fold0.columns.to_list()

['agg__reference_gs_duration_s', 'rank__agg__reference_gs_duration_s', 'agg__detected_gs_duration_s', 'rank__agg__detected_gs_duration_s', 'agg__gs_duration_error_s', 'rank__agg__gs_duration_error_s', 'agg__gs_relative_duration_error', 'rank__agg__gs_relative_duration_error', 'agg__gs_absolute_duration_error_s', 'rank__agg__gs_absolute_duration_error_s', 'agg__gs_absolute_relative_duration_error', 'rank__agg__gs_absolute_relative_duration_error', 'agg__gs_absolute_relative_duration_error_log', 'rank__agg__gs_absolute_relative_duration_error_log', 'agg__detected_num_gs', 'rank__agg__detected_num_gs', 'agg__reference_num_gs', 'rank__agg__reference_num_gs', 'agg__num_gs_error', 'rank__agg__num_gs_error', 'agg__num_gs_relative_error', 'rank__agg__num_gs_relative_error', 'agg__num_gs_absolute_error', 'rank__agg__num_gs_absolute_error', 'agg__num_gs_absolute_relative_error', 'rank__agg__num_gs_absolute_relative_error', 'agg__num_gs_absolute_relative_error_log', 'rank__agg__num_gs_absolute_relative_error_log', 'agg__tp_samples', 'rank__agg__tp_samples', 'agg__fp_samples', 'rank__agg__fp_samples', 'agg__fn_samples', 'rank__agg__fn_samples', 'agg__precision', 'rank__agg__precision', 'agg__recall', 'rank__agg__recall', 'agg__f1_score', 'rank__agg__f1_score', 'agg__tn_samples', 'rank__agg__tn_samples', 'agg__specificity', 'rank__agg__specificity', 'agg__accuracy', 'rank__agg__accuracy', 'agg__npv', 'rank__agg__npv', 'agg__runtime_s', 'rank__agg__runtime_s', 'agg__combined__reference_gs_duration_s', 'rank__agg__combined__reference_gs_duration_s', 'agg__combined__detected_gs_duration_s', 'rank__agg__combined__detected_gs_duration_s', 'agg__combined__gs_duration_error_s', 'rank__agg__combined__gs_duration_error_s', 'agg__combined__gs_relative_duration_error', 'rank__agg__combined__gs_relative_duration_error', 'agg__combined__gs_absolute_duration_error_s', 'rank__agg__combined__gs_absolute_duration_error_s', 'agg__combined__gs_absolute_relative_duration_error', 'rank__agg__combined__gs_absolute_relative_duration_error', 'agg__combined__gs_absolute_relative_duration_error_log', 'rank__agg__combined__gs_absolute_relative_duration_error_log', 'agg__combined__detected_num_gs', 'rank__agg__combined__detected_num_gs', 'agg__combined__reference_num_gs', 'rank__agg__combined__reference_num_gs', 'agg__combined__num_gs_error', 'rank__agg__combined__num_gs_error', 'agg__combined__num_gs_relative_error', 'rank__agg__combined__num_gs_relative_error', 'agg__combined__num_gs_absolute_error', 'rank__agg__combined__num_gs_absolute_error', 'agg__combined__num_gs_absolute_relative_error', 'rank__agg__combined__num_gs_absolute_relative_error', 'agg__combined__num_gs_absolute_relative_error_log', 'rank__agg__combined__num_gs_absolute_relative_error_log', 'agg__combined__tp_samples', 'rank__agg__combined__tp_samples', 'agg__combined__fp_samples', 'rank__agg__combined__fp_samples', 'agg__combined__fn_samples', 'rank__agg__combined__fn_samples', 'agg__combined__precision', 'rank__agg__combined__precision', 'agg__combined__recall', 'rank__agg__combined__recall', 'agg__combined__f1_score', 'rank__agg__combined__f1_score', 'agg__combined__tn_samples', 'rank__agg__combined__tn_samples', 'agg__combined__specificity', 'rank__agg__combined__specificity', 'agg__combined__accuracy', 'rank__agg__combined__accuracy', 'agg__combined__npv', 'rank__agg__combined__npv', 'single__reference_gs_duration_s', 'single__detected_gs_duration_s', 'single__gs_duration_error_s', 'single__gs_relative_duration_error', 'single__gs_absolute_duration_error_s', 'single__gs_absolute_relative_duration_error', 'single__gs_absolute_relative_duration_error_log', 'single__detected_num_gs', 'single__reference_num_gs', 'single__num_gs_error', 'single__num_gs_relative_error', 'single__num_gs_absolute_error', 'single__num_gs_absolute_relative_error', 'single__num_gs_absolute_relative_error_log', 'single__tp_samples', 'single__fp_samples', 'single__fn_samples', 'single__precision', 'single__recall', 'single__f1_score', 'single__tn_samples', 'single__specificity', 'single__accuracy', 'single__npv', 'single__runtime_s', 'single__raw__detected', 'single__raw__reference', 'data_labels', 'debug__score_time', 'param__algo__window_length_s', 'params']

With that, we hope it becomes clear, how these challenges can be extremely valuable, when benchmarking algorithms across datasets. To see how we evaluate the performance of the algorithms available in mobgap, check out the other gsd evaluation examples.

Total running time of the script: (0 minutes 13.888 seconds)

Estimated memory usage: 81 MB

Gallery generated by Sphinx-Gallery

GSD Evaluation Challenges#

Dataset#

Algorithm#

Evaluation Challenge#

This Page