.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/gait_sequences/_04_gsd_evaluation_challenges.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_gait_sequences__04_gsd_evaluation_challenges.py: GSD Evaluation Challenges ------------------------- The :ref:`gsd_evaluation` example demonstrates how to evaluate the performance of a GSD algorithm on a single datapoint and explains the individual performance metrics that are calculated. With that you could set up a custom evaluation pipeline to run and then score the output of a GSD algorithm multiple datapoints and then aggregate the results. To make this process easier, we set up opinionated evaluation challenges that can be used to quickly perform the same evaluation with multiple algorithms and datasets. Below, we will show how to use them on the example dataset. .. GENERATED FROM PYTHON SOURCE LINES 14-16 .. code-block:: Python # TODO: Update based on new Scorer API .. GENERATED FROM PYTHON SOURCE LINES 17-21 Dataset ------- To use the challenges, we need to dataset with reference information in the expected format. We will use the :class:`~mobgap.data.LabExampleDataset` for this purpose. .. GENERATED FROM PYTHON SOURCE LINES 21-27 .. code-block:: Python from mobgap.data import LabExampleDataset long_test = LabExampleDataset(reference_system="INDIP").get_subset( test="Test11" ) .. GENERATED FROM PYTHON SOURCE LINES 28-31 Algorithm --------- Next we need to create an instance of a valid GSD algorithm. .. GENERATED FROM PYTHON SOURCE LINES 31-35 .. code-block:: Python from mobgap.gait_sequences import GsdIluz algo = GsdIluz() .. GENERATED FROM PYTHON SOURCE LINES 36-39 This algorithm needs to be wrapped in a :class:`~mobgap.gait_sequences.pipeline.GsdEmulationPipeline` to be used in the challenges. This pipeline takes care of extracting the correct data from the dataset and running the algorithm on it. .. GENERATED FROM PYTHON SOURCE LINES 39-43 .. code-block:: Python from mobgap.gait_sequences.pipeline import GsdEmulationPipeline pipe = GsdEmulationPipeline(algo) .. GENERATED FROM PYTHON SOURCE LINES 44-45 Let's demonstrate that quickly on a single datapoint. .. GENERATED FROM PYTHON SOURCE LINES 45-48 .. code-block:: Python pipe_with_results = pipe.clone().run(long_test[0]) pipe_with_results.gs_list_ .. raw:: html
start end
gs_id
0 600 1201
1 4350 5251
2 7800 9001
3 9300 10201
4 10950 11551
5 13050 13651


.. GENERATED FROM PYTHON SOURCE LINES 49-71 Evaluation Challenge -------------------- This pipeline can now be used as part of an evaluation challenge. An evaluation challenge takes care of two things: - Running the pipeline on multiple datapoints - Scoring the results per datapoint and then aggregating the results We provide two challenges: - :class:`~mobgap.gait_sequences.evaluation.GsdEvaluation`: This challenge simply runs the pipeline on all datapoints and then scores the results. - :class:`~mobgap.gait_sequences.evaluation.GsdEvaluationCV`: This challenge runs a cross-validation on the dataset and then scores the results per fold. Before we run the entire pipeline, let's look at the scoring. Scoring is built based on tpcp's validation framework. As we have relativly complex scoring, scoring is split across two functions: - :func:`~mobgap.gait_sequences.evaluation.gsd_per_datapoint_score`: Run and score a single datapoint - :func:`~mobgap.gait_sequences.evaluation.gsd_final_agg`: Perform final aggreagtion and scoring based on the results per datapoint. Let's look at the code of it first. .. GENERATED FROM PYTHON SOURCE LINES 71-80 .. code-block:: Python from inspect import getsource from mobgap.gait_sequences.evaluation import ( gsd_final_agg, gsd_per_datapoint_score, ) print(getsource(gsd_per_datapoint_score)) .. rst-class:: sphx-glr-script-out .. code-block:: none def gsd_per_datapoint_score(pipeline: GsdEmulationPipeline, datapoint: BaseGaitDatasetWithReference) -> dict: """Evaluate the performance of a GSD algorithm on a single datapoint. .. warning:: This function is not meant to be called directly, but as a scoring function in a :class:`tpcp.validate.Scorer`. If you are writing custom scoring functions, you can use this function as a template or wrap it in a new function. This function is used to evaluate the performance of a GSD algorithm on a single datapoint. It calculates the performance metrics based on the detected gait sequences and the reference gait sequences. The following performance metrics are calculated: - all outputs of :func:`~mobgap.gait_sequences.evaluation.calculate_unmatched_gsd_performance_metrics` (will be averaged over all datapoints) - all outputs of :func:`~mobgap.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics` (will be averaged over all datapoints) - ``matches``: The matched gait sequences calculated by :func:`~mobgap.gait_sequences.evaluation.categorize_intervals_per_sample` (return as ``no_agg``) - ``detected``: The detected gait sequences (return as ``no_agg``) - ``reference``: The reference gait sequences (return as ``no_agg``) - ``sampling_rate_hz``: The sampling rate of the data (return as ``no_agg``) Parameters ---------- pipeline An instance of GSD emulation pipeline that wraps the algorithm that should be evaluated. datapoint The datapoint to be evaluated. Returns ------- dict A dictionary containing the performance metrics. Note, that some results are wrapped in a ``no_agg`` object or other aggregators. The results of this function are not expected to be parsed manually, but rather the function is expected to be used in the context of the :func:`~tpcp.validate.validate`/:func:`~tpcp.validate.cross_validate` functions or similar as scorer. This functions will aggregate the results and provide a summary of the performance metrics. """ from mobgap.gait_sequences.evaluation import ( calculate_matched_gsd_performance_metrics, calculate_unmatched_gsd_performance_metrics, categorize_intervals_per_sample, ) with warnings.catch_warnings(): # We know that these errors might happen, and they are usually not relevant for the evaluation warnings.filterwarnings("ignore", message="Zero division", category=UserWarning) warnings.filterwarnings("ignore", message="multiple ICs", category=UserWarning) # Run the algorithm on the datapoint pipeline.safe_run(datapoint) detected_gs_list = pipeline.gs_list_ reference_gs_list = datapoint.reference_parameters_.wb_list[["start", "end"]] n_overall_samples = len(datapoint.data_ss) sampling_rate_hz = datapoint.sampling_rate_hz matches = categorize_intervals_per_sample( gsd_list_detected=detected_gs_list, gsd_list_reference=reference_gs_list, n_overall_samples=n_overall_samples, ) # Calculate the performance metrics performance_metrics = { **calculate_unmatched_gsd_performance_metrics( gsd_list_detected=detected_gs_list, gsd_list_reference=reference_gs_list, sampling_rate_hz=sampling_rate_hz, ), **calculate_matched_gsd_performance_metrics(matches), "matches": no_agg(matches), "detected": no_agg(detected_gs_list), "reference": no_agg(reference_gs_list), "sampling_rate_hz": no_agg(sampling_rate_hz), "runtime_s": getattr(pipeline.algo_, "perf_", {}).get("runtime_s", np.nan), } return performance_metrics .. GENERATED FROM PYTHON SOURCE LINES 81-83 .. code-block:: Python print(getsource(gsd_final_agg)) .. rst-class:: sphx-glr-script-out .. code-block:: none def gsd_final_agg( agg_results: dict[str, float], single_results: dict[str, list], pipeline: GsdEmulationPipeline, # noqa: ARG001 dataset: BaseGaitDatasetWithReference, ) -> tuple[dict[str, any], dict[str, list[any]]]: """Aggregate the performance metrics of a GSD algorithm over multiple datapoints. .. warning:: This function is not meant to be called directly, but as ``final_aggregator`` in a :class:`tpcp.validate.Scorer`. If you are writing custom scoring functions, you can use this function as a template or wrap it in a new function. This function aggregates the performance metrics as follows: - All raw outputs (``detected``, ``reference``, ``sampling_rate_hz``) are concatenated to a single dataframe, to make it easier to work with and are returned as part of the single results. - We recalculate all performance metrics from :func:`~mobgap.gait_sequences.evaluation.calculate_unmatched_gsd_performance_metrics` and :func:`~mobgap.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics` on the combined data. The results are prefixed with ``combined__``. Compared to the per-datapoint results (which are calculated, as errors per recording -> average over all recordings), these metrics are calculated as combining all GSDs from all recordings and then calculating the performance metrics. Effectively, this means, that in the `per_datapoint` version, each recording is weighted equally, while in the `combined` version, each GS is weighted equally. Parameters ---------- agg_results The aggregated results from all datapoints (see :class:`~tpcp.validate.Scorer`). single_results The per-datapoint results (see :class:`~tpcp.validate.Scorer`). pipeline The pipline that was passed to the scorer. This is ignored in this function, but might be useful in custom final aggregators. dataset The dataset that was passed to the scorer. Returns ------- final_agg_results The final aggregated results. final_single_results The per-datapoint results, that are not aggregated. """ from mobgap.gait_sequences.evaluation import ( calculate_matched_gsd_performance_metrics, calculate_unmatched_gsd_performance_metrics, ) data_labels = [d.group_label for d in dataset] data_label_names = data_labels[0]._fields # We combine each to a combined dataframe matches = single_results.pop("matches") matches = pd.concat(matches, keys=data_labels, names=[*data_label_names, *matches[0].index.names]) detected = single_results.pop("detected") detected = pd.concat(detected, keys=data_labels, names=[*data_label_names, *detected[0].index.names]) reference = single_results.pop("reference") reference = pd.concat(reference, keys=data_labels, names=[*data_label_names, *reference[0].index.names]) aggregated_single_results = { "raw__detected": detected, "raw__reference": reference, } sampling_rate_hz = single_results.pop("sampling_rate_hz") if set(sampling_rate_hz) != {sampling_rate_hz[0]}: raise ValueError( "Sampling rate is not the same for all datapoints in the dataset. " "This not supported by this scorer. " "Provide a custom scorer that can handle this case." ) combined_unmatched = { f"combined__{k}": v for k, v in calculate_unmatched_gsd_performance_metrics( gsd_list_detected=detected, gsd_list_reference=reference, sampling_rate_hz=sampling_rate_hz[0], ).items() } combined_matched = {f"combined__{k}": v for k, v in calculate_matched_gsd_performance_metrics(matches).items()} # Note, that we pass the "aggregated_single_results" out via the single results and not the aggregated results # The reason is that the aggregated results are expected to be a single value per metric, while the single results # can be anything. return {**agg_results, **combined_unmatched, **combined_matched}, {**single_results, **aggregated_single_results} .. GENERATED FROM PYTHON SOURCE LINES 84-99 We can see that these method is relatively simple, using the lower level gsd evaluation functions that we provide. `gsd_per_datapoint_score` calculates the raw results and all scores that can be calculated per datapoint. `gsd_final_agg` handles the calculation of all scores, that require the raw results from all datapoints at once. The remaining aggregation is handled by the :class:`~tpcp.validate.Scorer` class (see below). So if you want to run your own scoring function, it should be straightforward to do so. Note, the :func:`~tpcp.validate.no_agg` wrapping some of the return values. This is a special aggregator that tells the challenge to not try to aggregate the respective values. For all other values, the challenge will try average the values across all datapoints. To learn more about these special aggregators, check out the `tpcp example `_. The scoring function takes care of running the pipeline. So we can test the scorer, by just providing it with a pipeline and a datapoint. .. GENERATED FROM PYTHON SOURCE LINES 99-106 .. code-block:: Python from pprint import pprint single_dp_results = gsd_per_datapoint_score(pipe, long_test[0]) single_dp_results.pop("detected") single_dp_results.pop("reference") pprint(single_dp_results) .. rst-class:: sphx-glr-script-out .. code-block:: none {'accuracy': 0.704708699122107, 'detected_gs_duration_s': 48.12, 'detected_num_gs': 6, 'f1_score': 0.5408393501805054, 'fn_samples': 1649, 'fp_samples': 2421, 'gs_absolute_duration_error_s': 7.68, 'gs_absolute_relative_duration_error': 0.18991097922848665, 'gs_absolute_relative_duration_error_log': 0.17387849695420737, 'gs_duration_error_s': 7.68, 'gs_relative_duration_error': 0.18991097922848665, 'matches': _NoAgg(return_raw_scores=True)( start end match_type 0 0 600 tn 1 600 632 fp 2 632 988 tp 3 988 1201 fp 4 1201 2864 tn 5 2864 3325 fn 6 3325 3853 tn 7 3853 4350 fn 8 4350 5085 tp 9 5085 5251 fp 10 5251 7641 tn 11 7641 7800 fn 12 7800 8621 tp 13 8621 9001 fp 14 9001 9300 tn 15 9300 9451 fp 16 9451 9932 tp 17 9932 10201 fp 18 10201 10950 tn 19 10950 11551 fp 20 11551 11989 tn 21 11989 12517 fn 22 12517 13050 tn 23 13050 13651 fp 24 13651 13758 tn), 'npv': 0.8160624651422197, 'num_gs_absolute_error': 0, 'num_gs_absolute_relative_error': 0.0, 'num_gs_absolute_relative_error_log': 0.0, 'num_gs_error': 0, 'num_gs_relative_error': 0.0, 'precision': 0.4975093399750934, 'recall': 0.592436974789916, 'reference_gs_duration_s': 40.44, 'reference_num_gs': 6, 'runtime_s': 0.022248906999266183, 'sampling_rate_hz': _NoAgg(return_raw_scores=True)(100.0), 'specificity': 0.7513607887439663, 'tn_samples': 7316, 'tp_samples': 2397} .. GENERATED FROM PYTHON SOURCE LINES 107-108 To use the two functions with a challenge, we need to wrap them into a :class:`~tpcp.validate.Scorer` instance. .. GENERATED FROM PYTHON SOURCE LINES 108-114 .. code-block:: Python from tpcp.validate import Scorer gsd_evaluation_scorer = Scorer( gsd_per_datapoint_score, final_aggregator=gsd_final_agg ) .. GENERATED FROM PYTHON SOURCE LINES 115-123 The challenge will call this scorer for each group in the dataset. The scorer itself will then call `gsd_per_datapoint_score` for each datapoint and then `gsd_final_agg` with the combined results. For these two default scoring functions, we also provide the scorer directly, so that you don't have to construct it yourself. However, in case you want to modify the scoring functions, you can do so by creating your own scorer. We will continue to use the default scorer for the challenges. .. GENERATED FROM PYTHON SOURCE LINES 123-127 .. code-block:: Python from mobgap.gait_sequences.evaluation import gsd_score gsd_evaluation_scorer = gsd_score .. GENERATED FROM PYTHON SOURCE LINES 128-129 Let's put everything together and run the challenge. .. GENERATED FROM PYTHON SOURCE LINES 129-133 .. code-block:: Python from mobgap.utils.evaluation import Evaluation eval_challenge = Evaluation(long_test, scoring=gsd_evaluation_scorer) .. GENERATED FROM PYTHON SOURCE LINES 134-135 We can now run the challenge. .. GENERATED FROM PYTHON SOURCE LINES 135-137 .. code-block:: Python eval_challenge = eval_challenge.run(pipe) .. rst-class:: sphx-glr-script-out .. code-block:: none Datapoints: 0%| | 0/3 [00:00
debug__score_time data_labels single__reference_gs_duration_s single__detected_gs_duration_s single__gs_duration_error_s single__gs_relative_duration_error single__gs_absolute_duration_error_s single__gs_absolute_relative_duration_error single__gs_absolute_relative_duration_error_log single__detected_num_gs single__reference_num_gs single__num_gs_error single__num_gs_relative_error single__num_gs_absolute_error single__num_gs_absolute_relative_error single__num_gs_absolute_relative_error_log single__tp_samples single__fp_samples single__fn_samples single__precision single__recall single__f1_score single__tn_samples single__specificity single__accuracy single__npv single__runtime_s single__raw__detected single__raw__reference agg__reference_gs_duration_s agg__detected_gs_duration_s agg__gs_duration_error_s agg__gs_relative_duration_error agg__gs_absolute_duration_error_s agg__gs_absolute_relative_duration_error agg__gs_absolute_relative_duration_error_log agg__detected_num_gs agg__reference_num_gs agg__num_gs_error agg__num_gs_relative_error agg__num_gs_absolute_error agg__num_gs_absolute_relative_error agg__num_gs_absolute_relative_error_log agg__tp_samples agg__fp_samples agg__fn_samples agg__precision agg__recall agg__f1_score agg__tn_samples agg__specificity agg__accuracy agg__npv agg__runtime_s agg__combined__reference_gs_duration_s agg__combined__detected_gs_duration_s agg__combined__gs_duration_error_s agg__combined__gs_relative_duration_error agg__combined__gs_absolute_duration_error_s agg__combined__gs_absolute_relative_duration_error agg__combined__gs_absolute_relative_duration_error_log agg__combined__detected_num_gs agg__combined__reference_num_gs agg__combined__num_gs_error agg__combined__num_gs_relative_error agg__combined__num_gs_absolute_error agg__combined__num_gs_absolute_relative_error agg__combined__num_gs_absolute_relative_error_log agg__combined__tp_samples agg__combined__fp_samples agg__combined__fn_samples agg__combined__precision agg__combined__recall agg__combined__f1_score agg__combined__tn_samples agg__combined__specificity agg__combined__accuracy agg__combined__npv
0 0.712674 [(HA, 001, TimeMeasure1, Test11, Trial1), (HA,... [40.44, 40.82, 65.52] [48.12, 42.08, 97.64] [7.68, 1.259999999999998, 32.120000000000005] [0.18991097922848665, 0.030867221950024448, 0.... [7.68, 1.259999999999998, 32.120000000000005] [0.18991097922848665, 0.030867221950024448, 0.... [0.17387849695420737, 0.03040041104775553, 0.3... [6, 4, 7] [6, 3, 6] [0, 1, 1] [0.0, 0.3333333333333333, 0.16666666666666666] [0, 1, 1] [0.0, 0.3333333333333333, 0.16666666666666666] [0.0, 0.28768207245178085, 0.15415067982725836] [2397, 2875, 6436] [2421, 1337, 3339] [1649, 1209, 117] [0.4975093399750934, 0.682573599240266, 0.6584... [0.592436974789916, 0.7039666993143977, 0.9821... [0.5408393501805054, 0.6931051108968177, 0.788... [7316, 10577, 12862] [0.7513607887439663, 0.8877790834312573, 0.793... [0.704708699122107, 0.8408551068883611, 0.8481... [0.8160624651422197, 0.8974206685898524, 0.990... [0.010609036999994714, 0.006414417000087269, 0... ... ... 48.926667 62.613333 13.686667 0.237003 13.686667 0.237003 0.20107 5.666667 5.0 0.666667 0.166667 0.666667 0.166667 0.147278 3902.666667 2365.666667 991.666667 0.612832 0.759516 0.674095 10251.666667 0.811014 0.797893 0.90149 0.007851 146.78 187.84 41.06 0.279738 41.06 0.279738 0.246656 17 15 2 0.133333 2 0.133333 0.125163 11708 7097 2975 0.6226 0.797385 0.699236 30755 0.812507 0.80828 0.9118


.. GENERATED FROM PYTHON SOURCE LINES 147-150 As you can see, this is a very messy dataframe with a lot of information. To make this easier to digest, the evaluation object has methods for extracting the different groups of information. The first group is the aggregated results, which represent only a "single value" over the entire dataset. .. GENERATED FROM PYTHON SOURCE LINES 150-153 .. code-block:: Python agg_results = eval_challenge.get_aggregated_results_as_df() agg_results.T .. raw:: html
fold 0
reference_gs_duration_s 48.926667
detected_gs_duration_s 62.613333
gs_duration_error_s 13.686667
gs_relative_duration_error 0.237003
gs_absolute_duration_error_s 13.686667
gs_absolute_relative_duration_error 0.237003
gs_absolute_relative_duration_error_log 0.201070
detected_num_gs 5.666667
reference_num_gs 5.000000
num_gs_error 0.666667
num_gs_relative_error 0.166667
num_gs_absolute_error 0.666667
num_gs_absolute_relative_error 0.166667
num_gs_absolute_relative_error_log 0.147278
tp_samples 3902.666667
fp_samples 2365.666667
fn_samples 991.666667
precision 0.612832
recall 0.759516
f1_score 0.674095
tn_samples 10251.666667
specificity 0.811014
accuracy 0.797893
npv 0.901490
runtime_s 0.007851
combined__reference_gs_duration_s 146.780000
combined__detected_gs_duration_s 187.840000
combined__gs_duration_error_s 41.060000
combined__gs_relative_duration_error 0.279738
combined__gs_absolute_duration_error_s 41.060000
combined__gs_absolute_relative_duration_error 0.279738
combined__gs_absolute_relative_duration_error_log 0.246656
combined__detected_num_gs 17.000000
combined__reference_num_gs 15.000000
combined__num_gs_error 2.000000
combined__num_gs_relative_error 0.133333
combined__num_gs_absolute_error 2.000000
combined__num_gs_absolute_relative_error 0.133333
combined__num_gs_absolute_relative_error_log 0.125163
combined__tp_samples 11708.000000
combined__fp_samples 7097.000000
combined__fn_samples 2975.000000
combined__precision 0.622600
combined__recall 0.797385
combined__f1_score 0.699236
combined__tn_samples 30755.000000
combined__specificity 0.812507
combined__accuracy 0.808280
combined__npv 0.911800


.. GENERATED FROM PYTHON SOURCE LINES 154-167 You might have seen, that many metrics appear twice, once with a `combined__` prefix and once without. These represent two different things. If you check in the source code of the scorer above, the metric without prefix is calculated per datapoint and then averaged. The metric with the prefix is calculated over the raw detected gait sequences of all datapoints combined. Effectively, this is equivalent to different "weightings". In the aggregated results without prefix, each recording has the same weight, independent of its length. In the second case, each individual imu-sample has the same weight. It does not matter, in which recording this sample was classified correctly or not, it has the same impact on the combined metric. Both approaches are valid, but you should be aware of the differences when comparing algorithms. The way how you aggregate here, can have a big impact on the results. .. GENERATED FROM PYTHON SOURCE LINES 167-178 .. code-block:: Python combined_metrics = agg_results.filter(like="combined__").rename( columns=lambda x: x.replace("combined__", "") ) combined_vs_per_datapoint = pd.concat( { "combined": combined_metrics, "per_datapoint": agg_results[combined_metrics.columns], }, axis=0, ) combined_vs_per_datapoint.reset_index(level=-1, drop=True).T .. raw:: html
combined per_datapoint
reference_gs_duration_s 146.780000 48.926667
detected_gs_duration_s 187.840000 62.613333
gs_duration_error_s 41.060000 13.686667
gs_relative_duration_error 0.279738 0.237003
gs_absolute_duration_error_s 41.060000 13.686667
gs_absolute_relative_duration_error 0.279738 0.237003
gs_absolute_relative_duration_error_log 0.246656 0.201070
detected_num_gs 17.000000 5.666667
reference_num_gs 15.000000 5.000000
num_gs_error 2.000000 0.666667
num_gs_relative_error 0.133333 0.166667
num_gs_absolute_error 2.000000 0.666667
num_gs_absolute_relative_error 0.133333 0.166667
num_gs_absolute_relative_error_log 0.125163 0.147278
tp_samples 11708.000000 3902.666667
fp_samples 7097.000000 2365.666667
fn_samples 2975.000000 991.666667
precision 0.622600 0.612832
recall 0.797385 0.759516
f1_score 0.699236 0.674095
tn_samples 30755.000000 10251.666667
specificity 0.812507 0.811014
accuracy 0.808280 0.797893
npv 0.911800 0.901490


.. GENERATED FROM PYTHON SOURCE LINES 179-180 The "single" results represent the values per datapoint. .. GENERATED FROM PYTHON SOURCE LINES 180-183 .. code-block:: Python single_results = eval_challenge.get_single_results_as_df() single_results.T .. raw:: html
fold 0
cohort HA MS
participant_id 001 002 001
time_measure TimeMeasure1 TimeMeasure1 TimeMeasure1
test Test11 Test11 Test11
trial Trial1 Trial1 Trial1
reference_gs_duration_s 40.44 40.82 65.52
detected_gs_duration_s 48.12 42.08 97.64
gs_duration_error_s 7.68 1.26 32.12
gs_relative_duration_error 0.189911 0.030867 0.490232
gs_absolute_duration_error_s 7.68 1.26 32.12
gs_absolute_relative_duration_error 0.189911 0.030867 0.490232
gs_absolute_relative_duration_error_log 0.173878 0.0304 0.398932
detected_num_gs 6 4 7
reference_num_gs 6 3 6
num_gs_error 0 1 1
num_gs_relative_error 0.0 0.333333 0.166667
num_gs_absolute_error 0 1 1
num_gs_absolute_relative_error 0.0 0.333333 0.166667
num_gs_absolute_relative_error_log 0.0 0.287682 0.154151
tp_samples 2397 2875 6436
fp_samples 2421 1337 3339
fn_samples 1649 1209 117
precision 0.497509 0.682574 0.658414
recall 0.592437 0.703967 0.982146
f1_score 0.540839 0.693105 0.788339
tn_samples 7316 10577 12862
specificity 0.751361 0.887779 0.793902
accuracy 0.704709 0.840855 0.848115
npv 0.816062 0.897421 0.990985
runtime_s 0.010609 0.006414 0.00653


.. GENERATED FROM PYTHON SOURCE LINES 184-187 And finally, we had a couple "raw" results in the scoring, that we passed through without calculating any error metrics. These are available as a dictionary of raw results. .. GENERATED FROM PYTHON SOURCE LINES 187-189 .. code-block:: Python raw_results = eval_challenge.get_raw_results() list(raw_results.keys()) .. rst-class:: sphx-glr-script-out .. code-block:: none ['detected', 'reference'] .. GENERATED FROM PYTHON SOURCE LINES 190-192 .. code-block:: Python raw_results["detected"] .. raw:: html
start end
fold cohort participant_id time_measure test trial gs_id
0 HA 001 TimeMeasure1 Test11 Trial1 0 600 1201
1 4350 5251
2 7800 9001
3 9300 10201
4 10950 11551
5 13050 13651
002 TimeMeasure1 Test11 Trial1 0 450 1201
1 2700 3301
2 5700 7951
3 15000 15601
MS 001 TimeMeasure1 Test11 Trial1 0 900 2101
1 4650 6151
2 9600 10801
3 11250 12151
4 12300 14851
5 19950 21151
6 21300 22501


.. GENERATED FROM PYTHON SOURCE LINES 193-195 .. code-block:: Python raw_results["reference"] .. raw:: html
start end
fold cohort participant_id time_measure test trial wb_id
0 HA 001 TimeMeasure1 Test11 Trial1 0 632 988
1 2864 3325
2 3853 5085
3 7641 8621
4 9451 9932
5 11989 12517
002 TimeMeasure1 Test11 Trial1 0 485 1131
1 1746 3554
2 6083 7708
MS 001 TimeMeasure1 Test11 Trial1 0 1019 1768
1 4534 5549
2 9665 10569
3 12337 14633
4 20151 20982
5 21378 22129


.. GENERATED FROM PYTHON SOURCE LINES 196-197 Further, there are some runtime information available (i.e. when the challenge was started, and how long it took). .. GENERATED FROM PYTHON SOURCE LINES 197-199 .. code-block:: Python eval_challenge.perf_["start_datetime"], eval_challenge.perf_["end_datetime"] .. rst-class:: sphx-glr-script-out .. code-block:: none ('2025-07-11T14:37:37.242506+00:00', '2025-07-11T14:37:37.961028+00:00') .. GENERATED FROM PYTHON SOURCE LINES 200-203 .. code-block:: Python eval_challenge.perf_["runtime_s"] .. rst-class:: sphx-glr-script-out .. code-block:: none 0.7184769630002847 .. GENERATED FROM PYTHON SOURCE LINES 204-225 Using :class:`~mobgap.utils.evaluation.Evaluation` is great, if you are only comparing (or planning to compare) non-ML algorithms, or algorithms that don't require further optimization (e.g. through GridSearch). Therefore, it is generally recommended to run a cross-validation with :class:`~mobgap.utils.evaluation.EvaluationCV`. This allows you to evaluate the performance of the algorithm on multiple folds of the dataset and through the use of :class:`~tpcp.optimize.DummyOptimize` you can also use algorithms without optimization in the same pipeline for comparison. Let's demonstrate the use of :class:`~mobgap.utils.evaluation.GsdEvaluationCV` on the example dataset using the same algorithm once with and once without GridSearch. For the CV-based challenge, we need to set up a cross-validation. As we only have 3 datapoints here, we will use a 3-fold cross-validation without grouping or stratification. In a real-world scenario, you would use a more sophisticated cross-validation strategy. You can learn more about cross-validation in the `tpcp example `_. Further, to speed things up, we are going to use multi-processing. We can configure this using the ``n_jobs`` parameter that we pass to the internal :func:`~tpcp.validate.cross_validate` function via the ``cv_params`` parameters .. GENERATED FROM PYTHON SOURCE LINES 225-234 .. code-block:: Python from mobgap.utils.evaluation import EvaluationCV eval_challenge_cv = EvaluationCV( long_test, cv_iterator=3, scoring=gsd_evaluation_scorer, cv_params={"n_jobs": 2, "return_optimizer": True}, ) .. GENERATED FROM PYTHON SOURCE LINES 235-237 To use our pipeline from above, we need to wrap it in a :class:`~tpcp.optimize.DummyOptimize` instance. This will basically skip any optimization on the train set and just apply the pipeline to the test set. .. GENERATED FROM PYTHON SOURCE LINES 237-243 .. code-block:: Python from tpcp.optimize import DummyOptimize eval_challenge_cv = eval_challenge_cv.run( DummyOptimize(pipe, ignore_potential_user_error_warning=True) ) .. rst-class:: sphx-glr-script-out .. code-block:: none CV Folds: 0%| | 0/3 [00:00
fold 0 1 2
reference_gs_duration_s 40.440000 40.820000 65.520000
detected_gs_duration_s 48.120000 42.080000 97.640000
gs_duration_error_s 7.680000 1.260000 32.120000
gs_relative_duration_error 0.189911 0.030867 0.490232
gs_absolute_duration_error_s 7.680000 1.260000 32.120000
gs_absolute_relative_duration_error 0.189911 0.030867 0.490232
gs_absolute_relative_duration_error_log 0.173878 0.030400 0.398932
detected_num_gs 6.000000 4.000000 7.000000
reference_num_gs 6.000000 3.000000 6.000000
num_gs_error 0.000000 1.000000 1.000000
num_gs_relative_error 0.000000 0.333333 0.166667
num_gs_absolute_error 0.000000 1.000000 1.000000
num_gs_absolute_relative_error 0.000000 0.333333 0.166667
num_gs_absolute_relative_error_log 0.000000 0.287682 0.154151
tp_samples 2397.000000 2875.000000 6436.000000
fp_samples 2421.000000 1337.000000 3339.000000
fn_samples 1649.000000 1209.000000 117.000000
precision 0.497509 0.682574 0.658414
recall 0.592437 0.703967 0.982146
f1_score 0.540839 0.693105 0.788339
tn_samples 7316.000000 10577.000000 12862.000000
specificity 0.751361 0.887779 0.793902
accuracy 0.704709 0.840855 0.848115
npv 0.816062 0.897421 0.990985
runtime_s 1.420231 1.352697 0.006638
combined__reference_gs_duration_s 40.440000 40.820000 65.520000
combined__detected_gs_duration_s 48.120000 42.080000 97.640000
combined__gs_duration_error_s 7.680000 1.260000 32.120000
combined__gs_relative_duration_error 0.189911 0.030867 0.490232
combined__gs_absolute_duration_error_s 7.680000 1.260000 32.120000
combined__gs_absolute_relative_duration_error 0.189911 0.030867 0.490232
combined__gs_absolute_relative_duration_error_log 0.173878 0.030400 0.398932
combined__detected_num_gs 6.000000 4.000000 7.000000
combined__reference_num_gs 6.000000 3.000000 6.000000
combined__num_gs_error 0.000000 1.000000 1.000000
combined__num_gs_relative_error 0.000000 0.333333 0.166667
combined__num_gs_absolute_error 0.000000 1.000000 1.000000
combined__num_gs_absolute_relative_error 0.000000 0.333333 0.166667
combined__num_gs_absolute_relative_error_log 0.000000 0.287682 0.154151
combined__tp_samples 2397.000000 2875.000000 6436.000000
combined__fp_samples 2421.000000 1337.000000 3339.000000
combined__fn_samples 1649.000000 1209.000000 117.000000
combined__precision 0.497509 0.682574 0.658414
combined__recall 0.592437 0.703967 0.982146
combined__f1_score 0.540839 0.693105 0.788339
combined__tn_samples 7316.000000 10577.000000 12862.000000
combined__specificity 0.751361 0.887779 0.793902
combined__accuracy 0.704709 0.840855 0.848115
combined__npv 0.816062 0.897421 0.990985


.. GENERATED FROM PYTHON SOURCE LINES 259-262 The single results contain the CV fold as an additional index. Otherwise, the output is identical to before. Note, that if you use anything else then a KFold, splitter, you might have some datapoints duplicated across folds. .. GENERATED FROM PYTHON SOURCE LINES 262-265 .. code-block:: Python single_results_cv = eval_challenge_cv.get_single_results_as_df() single_results_cv .. raw:: html
reference_gs_duration_s detected_gs_duration_s gs_duration_error_s gs_relative_duration_error gs_absolute_duration_error_s gs_absolute_relative_duration_error gs_absolute_relative_duration_error_log detected_num_gs reference_num_gs num_gs_error num_gs_relative_error num_gs_absolute_error num_gs_absolute_relative_error num_gs_absolute_relative_error_log tp_samples fp_samples fn_samples precision recall f1_score tn_samples specificity accuracy npv runtime_s
fold cohort participant_id time_measure test trial
0 HA 001 TimeMeasure1 Test11 Trial1 40.44 48.12 7.68 0.189911 7.68 0.189911 0.173878 6 6 0 0.0 0 0.0 0.0 2397 2421 1649 0.497509 0.592437 0.540839 7316 0.751361 0.704709 0.816062 1.420231
1 HA 002 TimeMeasure1 Test11 Trial1 40.82 42.08 1.26 0.030867 1.26 0.030867 0.0304 4 3 1 0.333333 1 0.333333 0.287682 2875 1337 1209 0.682574 0.703967 0.693105 10577 0.887779 0.840855 0.897421 1.352697
2 MS 001 TimeMeasure1 Test11 Trial1 65.52 97.64 32.12 0.490232 32.12 0.490232 0.398932 7 6 1 0.166667 1 0.166667 0.154151 6436 3339 117 0.658414 0.982146 0.788339 12862 0.793902 0.848115 0.990985 0.006638


.. GENERATED FROM PYTHON SOURCE LINES 266-267 And the raw outputs: .. GENERATED FROM PYTHON SOURCE LINES 267-270 .. code-block:: Python raw_results_cv = eval_challenge_cv.get_raw_results() raw_results_cv["detected"] .. raw:: html
start end
fold cohort participant_id time_measure test trial gs_id
0 HA 001 TimeMeasure1 Test11 Trial1 0 600 1201
1 4350 5251
2 7800 9001
3 9300 10201
4 10950 11551
5 13050 13651
1 HA 002 TimeMeasure1 Test11 Trial1 0 450 1201
1 2700 3301
2 5700 7951
3 15000 15601
2 MS 001 TimeMeasure1 Test11 Trial1 0 900 2101
1 4650 6151
2 9600 10801
3 11250 12151
4 12300 14851
5 19950 21151
6 21300 22501


.. GENERATED FROM PYTHON SOURCE LINES 271-278 If we compare these results to the ones from the non-CV challenge, we can see that "single" results are identical, just that they were called in multiple folds. This is expected, as we used :class:`~tpcp.optimize.DummyOptimize` and thus didn't optimize the algorithm. Let's try a :class:`~tpcp.optimize.GridSearch` on the algorithm to see how the results change. For the gridsearch, we will re-use the same scoring function as before, but we need to specify, which scoring result we want to optimize for. .. GENERATED FROM PYTHON SOURCE LINES 278-286 .. code-block:: Python from sklearn.model_selection import ParameterGrid from tpcp.optimize import GridSearch para_grid = ParameterGrid({"algo__window_length_s": [2, 3, 4]}) optimizer = GridSearch( pipe, para_grid, scoring=gsd_evaluation_scorer, return_optimized="precision" ) .. GENERATED FROM PYTHON SOURCE LINES 287-290 The optimizer can now be used in the same CV challenge as before. This way we can guarantee that the same folds are used for the optimization and the evaluation and ensure the best possible comparison between the algorithms versions. .. GENERATED FROM PYTHON SOURCE LINES 290-292 .. code-block:: Python eval_challenge_gs = eval_challenge_cv.clone().run(optimizer) .. rst-class:: sphx-glr-script-out .. code-block:: none CV Folds: 0%| | 0/3 [00:00
fold 0 1 2
reference_gs_duration_s 40.440000 40.820000 65.520000
detected_gs_duration_s 48.120000 39.080000 97.160000
gs_duration_error_s 7.680000 -1.740000 31.640000
gs_relative_duration_error 0.189911 -0.042626 0.482906
gs_absolute_duration_error_s 7.680000 1.740000 31.640000
gs_absolute_relative_duration_error 0.189911 0.042626 0.482906
gs_absolute_relative_duration_error_log 0.173878 0.041743 0.394004
detected_num_gs 6.000000 4.000000 8.000000
reference_num_gs 6.000000 3.000000 6.000000
num_gs_error 0.000000 1.000000 2.000000
num_gs_relative_error 0.000000 0.333333 0.333333
num_gs_absolute_error 0.000000 1.000000 2.000000
num_gs_absolute_relative_error 0.000000 0.333333 0.333333
num_gs_absolute_relative_error_log 0.000000 0.287682 0.287682
tp_samples 2397.000000 2545.000000 6153.000000
fp_samples 2421.000000 1366.000000 3573.000000
fn_samples 1649.000000 1540.000000 403.000000
precision 0.497509 0.650729 0.632634
recall 0.592437 0.623011 0.938530
f1_score 0.540839 0.636568 0.755804
tn_samples 7316.000000 10547.000000 12627.000000
specificity 0.751361 0.885335 0.779444
accuracy 0.704709 0.818352 0.825277
npv 0.816062 0.872590 0.969071
runtime_s 0.014230 0.006469 0.006676
combined__reference_gs_duration_s 40.440000 40.820000 65.520000
combined__detected_gs_duration_s 48.120000 39.080000 97.160000
combined__gs_duration_error_s 7.680000 -1.740000 31.640000
combined__gs_relative_duration_error 0.189911 -0.042626 0.482906
combined__gs_absolute_duration_error_s 7.680000 1.740000 31.640000
combined__gs_absolute_relative_duration_error 0.189911 0.042626 0.482906
combined__gs_absolute_relative_duration_error_log 0.173878 0.041743 0.394004
combined__detected_num_gs 6.000000 4.000000 8.000000
combined__reference_num_gs 6.000000 3.000000 6.000000
combined__num_gs_error 0.000000 1.000000 2.000000
combined__num_gs_relative_error 0.000000 0.333333 0.333333
combined__num_gs_absolute_error 0.000000 1.000000 2.000000
combined__num_gs_absolute_relative_error 0.000000 0.333333 0.333333
combined__num_gs_absolute_relative_error_log 0.000000 0.287682 0.287682
combined__tp_samples 2397.000000 2545.000000 6153.000000
combined__fp_samples 2421.000000 1366.000000 3573.000000
combined__fn_samples 1649.000000 1540.000000 403.000000
combined__precision 0.497509 0.650729 0.632634
combined__recall 0.592437 0.623011 0.938530
combined__f1_score 0.540839 0.636568 0.755804
combined__tn_samples 7316.000000 10547.000000 12627.000000
combined__specificity 0.751361 0.885335 0.779444
combined__accuracy 0.704709 0.818352 0.825277
combined__npv 0.816062 0.872590 0.969071


.. GENERATED FROM PYTHON SOURCE LINES 298-301 Because we used ``cv_params={"return_optimizer": True}`` we can also access the optimizer per fold directly from the ``results_`` attribute.` This can be useful to get more insights into the optimization process and what the optimal parameters were. .. GENERATED FROM PYTHON SOURCE LINES 301-304 .. code-block:: Python opt_results = pd.Series(eval_challenge_gs.results_["optimizer"]) opt_results .. rst-class:: sphx-glr-script-out .. code-block:: none 0 GridSearch(n_jobs=None, parameter_grid=
algo__window_length_s
0 3
1 2
2 2


.. GENERATED FROM PYTHON SOURCE LINES 310-312 Or we can go much deeper, by getting all information about the optimization process. Let's just look at the keys of the information that is available. .. GENERATED FROM PYTHON SOURCE LINES 312-315 .. code-block:: Python all_opti_results_fold0 = pd.DataFrame(opt_results.loc[0].gs_results_) all_opti_results_fold0.columns.to_list() .. rst-class:: sphx-glr-script-out .. code-block:: none ['agg__reference_gs_duration_s', 'rank__agg__reference_gs_duration_s', 'agg__detected_gs_duration_s', 'rank__agg__detected_gs_duration_s', 'agg__gs_duration_error_s', 'rank__agg__gs_duration_error_s', 'agg__gs_relative_duration_error', 'rank__agg__gs_relative_duration_error', 'agg__gs_absolute_duration_error_s', 'rank__agg__gs_absolute_duration_error_s', 'agg__gs_absolute_relative_duration_error', 'rank__agg__gs_absolute_relative_duration_error', 'agg__gs_absolute_relative_duration_error_log', 'rank__agg__gs_absolute_relative_duration_error_log', 'agg__detected_num_gs', 'rank__agg__detected_num_gs', 'agg__reference_num_gs', 'rank__agg__reference_num_gs', 'agg__num_gs_error', 'rank__agg__num_gs_error', 'agg__num_gs_relative_error', 'rank__agg__num_gs_relative_error', 'agg__num_gs_absolute_error', 'rank__agg__num_gs_absolute_error', 'agg__num_gs_absolute_relative_error', 'rank__agg__num_gs_absolute_relative_error', 'agg__num_gs_absolute_relative_error_log', 'rank__agg__num_gs_absolute_relative_error_log', 'agg__tp_samples', 'rank__agg__tp_samples', 'agg__fp_samples', 'rank__agg__fp_samples', 'agg__fn_samples', 'rank__agg__fn_samples', 'agg__precision', 'rank__agg__precision', 'agg__recall', 'rank__agg__recall', 'agg__f1_score', 'rank__agg__f1_score', 'agg__tn_samples', 'rank__agg__tn_samples', 'agg__specificity', 'rank__agg__specificity', 'agg__accuracy', 'rank__agg__accuracy', 'agg__npv', 'rank__agg__npv', 'agg__runtime_s', 'rank__agg__runtime_s', 'agg__combined__reference_gs_duration_s', 'rank__agg__combined__reference_gs_duration_s', 'agg__combined__detected_gs_duration_s', 'rank__agg__combined__detected_gs_duration_s', 'agg__combined__gs_duration_error_s', 'rank__agg__combined__gs_duration_error_s', 'agg__combined__gs_relative_duration_error', 'rank__agg__combined__gs_relative_duration_error', 'agg__combined__gs_absolute_duration_error_s', 'rank__agg__combined__gs_absolute_duration_error_s', 'agg__combined__gs_absolute_relative_duration_error', 'rank__agg__combined__gs_absolute_relative_duration_error', 'agg__combined__gs_absolute_relative_duration_error_log', 'rank__agg__combined__gs_absolute_relative_duration_error_log', 'agg__combined__detected_num_gs', 'rank__agg__combined__detected_num_gs', 'agg__combined__reference_num_gs', 'rank__agg__combined__reference_num_gs', 'agg__combined__num_gs_error', 'rank__agg__combined__num_gs_error', 'agg__combined__num_gs_relative_error', 'rank__agg__combined__num_gs_relative_error', 'agg__combined__num_gs_absolute_error', 'rank__agg__combined__num_gs_absolute_error', 'agg__combined__num_gs_absolute_relative_error', 'rank__agg__combined__num_gs_absolute_relative_error', 'agg__combined__num_gs_absolute_relative_error_log', 'rank__agg__combined__num_gs_absolute_relative_error_log', 'agg__combined__tp_samples', 'rank__agg__combined__tp_samples', 'agg__combined__fp_samples', 'rank__agg__combined__fp_samples', 'agg__combined__fn_samples', 'rank__agg__combined__fn_samples', 'agg__combined__precision', 'rank__agg__combined__precision', 'agg__combined__recall', 'rank__agg__combined__recall', 'agg__combined__f1_score', 'rank__agg__combined__f1_score', 'agg__combined__tn_samples', 'rank__agg__combined__tn_samples', 'agg__combined__specificity', 'rank__agg__combined__specificity', 'agg__combined__accuracy', 'rank__agg__combined__accuracy', 'agg__combined__npv', 'rank__agg__combined__npv', 'single__reference_gs_duration_s', 'single__detected_gs_duration_s', 'single__gs_duration_error_s', 'single__gs_relative_duration_error', 'single__gs_absolute_duration_error_s', 'single__gs_absolute_relative_duration_error', 'single__gs_absolute_relative_duration_error_log', 'single__detected_num_gs', 'single__reference_num_gs', 'single__num_gs_error', 'single__num_gs_relative_error', 'single__num_gs_absolute_error', 'single__num_gs_absolute_relative_error', 'single__num_gs_absolute_relative_error_log', 'single__tp_samples', 'single__fp_samples', 'single__fn_samples', 'single__precision', 'single__recall', 'single__f1_score', 'single__tn_samples', 'single__specificity', 'single__accuracy', 'single__npv', 'single__runtime_s', 'single__raw__detected', 'single__raw__reference', 'data_labels', 'debug__score_time', 'param__algo__window_length_s', 'params'] .. GENERATED FROM PYTHON SOURCE LINES 316-320 With that, we hope it becomes clear, how these challenges can be extremely valuable, when benchmarking algorithms across datasets. To see how we evaluate the performance of the algorithms available in mobgap, check out the other gsd evaluation examples. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 21.736 seconds) **Estimated memory usage:** 80 MB .. _sphx_glr_download_auto_examples_gait_sequences__04_gsd_evaluation_challenges.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: _04_gsd_evaluation_challenges.ipynb <_04_gsd_evaluation_challenges.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: _04_gsd_evaluation_challenges.py <_04_gsd_evaluation_challenges.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: _04_gsd_evaluation_challenges.zip <_04_gsd_evaluation_challenges.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_