GSD Evaluation Challenges#

The GSD Evaluation example demonstrates how to evaluate the performance of a GSD algorithm on a single datapoint

and explains the individual performance metrics that are calculated.

With that you could set up a custom evaluation pipeline to run and then score the output of a GSD algorithm multiple datapoints and then aggregate the results. To make this process easier, we set up opinionated evaluation challenges that can be used to quickly perform the same evaluation with multiple algorithms and datasets.

Below, we will show how to use them on the example dataset.

Dataset#

To use the challenges, we need to dataset with reference information in the expected format. We will use the LabExampleDataset for this purpose.

from mobgap.data import LabExampleDataset

long_test = LabExampleDataset(reference_system="INDIP").get_subset(
    test="Test11"
)

Algorithm#

Next we need to create an instance of a valid GSD algorithm.

from mobgap.gait_sequences import GsdIluz

algo = GsdIluz()

This algorithm needs to be wrapped in a GsdEmulationPipeline to be used in the challenges. This pipeline takes care of extracting the correct data from the dataset and running the algorithm on it.

from mobgap.gait_sequences.pipeline import GsdEmulationPipeline

pipe = GsdEmulationPipeline(algo)

Let’s demonstrate that quickly on a single datapoint.

start end
gs_id
0 600 1201
1 2700 4201
2 4350 5251
3 7800 8851
4 9450 10201
5 10950 11551
6 13050 13651


Evaluation Challenge#

This pipeline can now be used as part of an evaluation challenge. An evaluation challenge takes care of two things:

  • Running the pipeline on multiple datapoints

  • Scoring the results per datapoint and then aggregating the results

We provide two challenges:

  • GsdEvaluation: This challenge simply runs the pipeline on all datapoints and then scores the results.

  • GsdEvaluationCV: This challenge runs a cross-validation on the dataset and then scores the results per fold.

Before we run the entire pipeline, let’s look at the scoring. We provide a default scoring function that calculates all the relevant performance metrics.

Let’s look at the code of it first.

from inspect import getsource

from mobgap.gait_sequences.evaluation import gsd_evaluation_scorer

print(getsource(gsd_evaluation_scorer))
def gsd_evaluation_scorer(pipeline: GsdEmulationPipeline, datapoint: BaseGaitDatasetWithReference) -> dict:
    """Evaluate the performance of a GSD algorithm on a single datapoint.

    This function is used to evaluate the performance of a GSD algorithm on a single datapoint.
    It calculates the performance metrics based on the detected gait sequences and the reference gait sequences.

    This is the default scoring functions for the GSD evaluation pipelines (``GsdEvaluation`` and ``GsdEvaluationCV``).

    Parameters
    ----------
    pipeline
        An instance of GSD emulation pipeline that wraps the algorithm that should be evaluated.
    datapoint
        The datapoint to be evaluated.

    Returns
    -------
    dict
        A dictionary containing the performance metrics.
        Note, that some results are wrapped in a ``NoAgg`` object or other aggregators.
        The results of this function are not expected to be parsed manually, but rather the function is expected to be
        used in the context of the :func:`~tpcp.validate.validate`/:func:`~tpcp.validate.cross_validate` functions or
        similar as scorer.
        This functions will aggregate the results and provide a summary of the performance metrics.

    """
    from mobgap.gait_sequences.evaluation import (
        calculate_matched_gsd_performance_metrics,
        calculate_unmatched_gsd_performance_metrics,
        categorize_intervals_per_sample,
    )

    # Run the algorithm on the datapoint
    detected_gs_list = pipeline.safe_run(datapoint).gs_list_
    reference_gs_list = datapoint.reference_parameters_.wb_list[["start", "end"]]
    n_overall_samples = len(datapoint.data_ss)
    sampling_rate_hz = datapoint.sampling_rate_hz

    matches = categorize_intervals_per_sample(
        gsd_list_detected=detected_gs_list, gsd_list_reference=reference_gs_list, n_overall_samples=n_overall_samples
    )

    # Calculate the performance metrics
    performance_metrics = {
        **calculate_unmatched_gsd_performance_metrics(
            gsd_list_detected=detected_gs_list,
            gsd_list_reference=reference_gs_list,
            sampling_rate_hz=sampling_rate_hz,
        ),
        **calculate_matched_gsd_performance_metrics(matches),
        "detected": NoAgg(detected_gs_list),
        "reference": NoAgg(reference_gs_list),
    }

    return performance_metrics

We can see that this method is relatively simple, using the gsd evaluation functions that we provide. So if you want to run your own scoring function, it should be straightforward to do so.

Note, the NoAgg wrapping some of the return values. This is a special aggregator that tells the challenge to not try to aggregate the respective values. For all other values, the challenge will try average the values across all datapoints.

To learn more about these special aggregators, check out the tpcp example.

The scoring function takes care of running the pipeline. So we can test the scorer, by just providing it with a pipeline and a datapoint.

Note, that we remove the NoAgg parameters from the results, as they don’t visualize well.

{'accuracy': 0.7349292709466811,
 'detected_gs_duration_s': 60.14,
 'detected_num_gs': 7,
 'f1_score': 0.6371400198609732,
 'fn_samples': 839,
 'fp_samples': 2815,
 'gs_absolute_duration_error_s': 19.700000000000003,
 'gs_absolute_relative_duration_error': 0.487141444114738,
 'gs_absolute_relative_duration_error_log': 0.3968557834081124,
 'gs_duration_error_s': 19.700000000000003,
 'gs_relative_duration_error': 0.487141444114738,
 'npv': 0.8919093017263592,
 'num_gs_absolute_error': 1,
 'num_gs_absolute_relative_error': 0.16666666666666666,
 'num_gs_absolute_relative_error_log': 0.15415067982725836,
 'num_gs_error': 1,
 'num_gs_relative_error': 0.16666666666666666,
 'precision': 0.5326249377386685,
 'recall': 0.7926859402026192,
 'reference_gs_duration_s': 40.44,
 'reference_num_gs': 6,
 'specificity': 0.7109262682275621,
 'tn_samples': 6923,
 'tp_samples': 3208}

The challenge will call this scoring method for each datapoint in the dataset. Let’s test this with the GsdEvaluation challenge.

from mobgap.gait_sequences.evaluation import GsdEvaluation

eval_challenge = GsdEvaluation(long_test, scoring=gsd_evaluation_scorer)

We can now run the challenge.

eval_challenge = eval_challenge.run(pipe)
Datapoints:   0%|          | 0/3 [00:00<?, ?it/s]
Datapoints:  33%|███▎      | 1/3 [00:00<00:00,  3.79it/s]
Datapoints:  67%|██████▋   | 2/3 [00:00<00:00,  3.64it/s]/home/docs/checkouts/readthedocs.org/user_builds/mobgap/checkouts/v0.9.0/mobgap/data/_mobilised_matlab_loader.py:1082: UserWarning: There were multiple ICs with the same index value, but different LR labels. This is likely an issue with the reference system you should further investigate. For now, we set the `lr_label` of the stride corresponding to this IC to Nan. However, both values still remain in the IC list.
  return parse_reference_parameters(

Datapoints: 100%|██████████| 3/3 [00:00<00:00,  3.48it/s]
Datapoints: 100%|██████████| 3/3 [00:00<00:00,  3.53it/s]

The results are stored in the results_ attribute and contain the aggregated and the raw results per datapoint. To learn more about the results, check the validate documentation.

The aggregated results across all datapoints are available in all columns not starting with agg__

0
agg__reference_gs_duration_s 48.926667
agg__detected_gs_duration_s 58.620000
agg__gs_duration_error_s 9.693333
agg__gs_relative_duration_error 0.237191
agg__gs_absolute_duration_error_s 9.693333
agg__gs_absolute_relative_duration_error 0.237191
agg__gs_absolute_relative_duration_error_log 0.200297
agg__detected_num_gs 6.000000
agg__reference_num_gs 5.000000
agg__num_gs_error 1.000000
agg__num_gs_relative_error 0.333333
agg__num_gs_absolute_error 1.666667
agg__num_gs_absolute_relative_error 0.444444
agg__num_gs_absolute_relative_error_log 0.333816
agg__tp_samples 3730.333333
agg__fp_samples 2138.666667
agg__fn_samples 1166.000000
agg__precision 0.632099
agg__recall 0.766355
agg__f1_score 0.688517
agg__tn_samples 10477.333333
agg__specificity 0.815958
agg__accuracy 0.802683
agg__npv 0.899915


The raw results are stored in the columns starting with single__.

And it is often helpful to explode them to get a better overview.

exploded_results = (
    single_results.explode(single_results.columns.to_list())
    .rename_axis("fold")
    .set_index(
        pd.MultiIndex.from_tuples(
            (dl := validate_results["data_labels"].explode().to_list()),
            names=list(dl[0]._fields),
        ),
        append=True,
    )
)
exploded_results.columns = exploded_results.columns.str.removeprefix("single__")
exploded_results.T
fold 0
cohort HA MS
participant_id 001 002 001
time_measure TimeMeasure1 TimeMeasure1 TimeMeasure1
test Test11 Test11 Test11
trial Trial1 Trial1 Trial1
reference_gs_duration_s 40.44 40.82 65.52
detected_gs_duration_s 60.14 49.62 66.1
gs_duration_error_s 19.7 8.8 0.58
gs_relative_duration_error 0.487141 0.215581 0.008852
gs_absolute_duration_error_s 19.7 8.8 0.58
gs_absolute_relative_duration_error 0.487141 0.215581 0.008852
gs_absolute_relative_duration_error_log 0.396856 0.195222 0.008813
detected_num_gs 7 6 5
reference_num_gs 6 3 6
num_gs_error 1 3 -1
num_gs_relative_error 0.166667 1.0 -0.166667
num_gs_absolute_error 1 3 1
num_gs_absolute_relative_error 0.166667 1.0 0.166667
num_gs_absolute_relative_error_log 0.154151 0.693147 0.154151
tp_samples 3208 3132 4851
fp_samples 2815 1835 1766
fn_samples 839 955 1704
precision 0.532625 0.630562 0.733112
recall 0.792686 0.766332 0.740046
f1_score 0.63714 0.691849 0.736562
tn_samples 6923 10080 14429
specificity 0.710926 0.845992 0.890954
accuracy 0.734929 0.825647 0.847473
npv 0.891909 0.913457 0.894378
detected start end gs_id 0 ... start end gs_id 0 ... start end gs_id 0 ...
reference start end wb_id 0 ... start end wb_id 0 ... start end wb_id 0 ...


The detected and reference columns in this dataframe contain the raw un-aggregated gait-sequences. So if we want to perform further evaluation on them (e.g. visualize them), we can use them.

raw_gs_list = pd.concat(
    exploded_results.loc[:, ["detected", "reference"]].stack().to_dict(),
    names=[*exploded_results.index.names, "system"],
).unstack("system")
raw_gs_list
start end
system detected reference detected reference
fold cohort participant_id time_measure test trial
0 HA 001 TimeMeasure1 Test11 Trial1 0 600.0 632.0 1201.0 988.0
1 2700.0 2864.0 4201.0 3325.0
2 4350.0 3853.0 5251.0 5085.0
3 7800.0 7641.0 8851.0 8621.0
4 9450.0 9451.0 10201.0 9932.0
5 10950.0 11989.0 11551.0 12517.0
6 13050.0 NaN 13651.0 NaN
002 TimeMeasure1 Test11 Trial1 0 450.0 485.0 1201.0 1131.0
1 2400.0 1746.0 3301.0 3554.0
2 3450.0 6083.0 4051.0 7708.0
3 5700.0 NaN 7201.0 NaN
4 7350.0 NaN 7951.0 NaN
5 15000.0 NaN 15601.0 NaN
MS 001 TimeMeasure1 Test11 Trial1 0 750.0 1019.0 1651.0 1768.0
1 4650.0 4534.0 6151.0 5549.0
2 12900.0 9665.0 14851.0 10569.0
3 20100.0 12337.0 21151.0 14633.0
4 21300.0 20151.0 22501.0 20982.0
5 NaN 21378.0 NaN 22129.0


Further there are some runtime information available (i.e. when the challenge was started, and how long it took).

('2024-10-21T12:43:10.783453+00:00', '2024-10-21T12:43:11.696589+00:00')
0.9129904770015855

Using GsdEvaluation is great, if you are only comparing (or planning to compare) non-ML algorithms, or algorithms that don’t require further optimization (e.g. through GridSearch).

Therefore, it is generally recommended to run a cross-validation with GsdEvaluationCV. This allows you to evaluate the performance of the algorithm on multiple folds of the dataset and through the use of DummyOptimize you can also use algorithms without optimization in the same pipeline for comparison.

Let’s demonstrate the use of GsdEvaluationCV on the example dataset using the same algorithm once with and once without GridSearch.

For the CV-based challenge, we need to set up a cross-validation. As we only have 3 datapoints here, we will use a 3-fold cross-validation without grouping or stratification. In a real-world scenario, you would use a more sophisticated cross-validation strategy. You can learn more about cross-validation in the tpcp example.

Further, to speed things up, we are going to use multi-processing. We can configure this using the n_jobs parameter that we pass to the internal cross_validate function via the cv_params parameters

from mobgap.gait_sequences.evaluation import GsdEvaluationCV

eval_challenge_cv = GsdEvaluationCV(
    long_test,
    cv_iterator=3,
    scoring=gsd_evaluation_scorer,
    cv_params={"n_jobs": 2, "return_optimizer": True},
)

To use our pipeline from above, we need to wrap it in a DummyOptimize instance. This will basically skip any optimization on the train set and just apply the pipeline to the test set.

from tpcp.optimize import DummyOptimize

eval_challenge_cv = eval_challenge_cv.run(
    DummyOptimize(pipe, ignore_potential_user_error_warning=True)
)
CV Folds:   0%|          | 0/3 [00:00<?, ?it/s]
CV Folds:  33%|███▎      | 1/3 [00:05<00:10,  5.40s/it]
CV Folds: 100%|██████████| 3/3 [00:05<00:00,  1.54s/it]
CV Folds: 100%|██████████| 3/3 [00:05<00:00,  1.92s/it]

The results now are a little bit more complex, as they contain the results for each fold. In addition, we have information for the train and the test set. The test set results, are what we are usually looking for. The train set results, are only calculated when providing the return_train_score parameter to the cv_params.

As before our main results are the aggregated results.

0 1 2
test__agg__reference_gs_duration_s 40.440000 40.820000 65.520000
test__agg__detected_gs_duration_s 60.140000 49.620000 66.100000
test__agg__gs_duration_error_s 19.700000 8.800000 0.580000
test__agg__gs_relative_duration_error 0.487141 0.215581 0.008852
test__agg__gs_absolute_duration_error_s 19.700000 8.800000 0.580000
test__agg__gs_absolute_relative_duration_error 0.487141 0.215581 0.008852
test__agg__gs_absolute_relative_duration_error_log 0.396856 0.195222 0.008813
test__agg__detected_num_gs 7.000000 6.000000 5.000000
test__agg__reference_num_gs 6.000000 3.000000 6.000000
test__agg__num_gs_error 1.000000 3.000000 -1.000000
test__agg__num_gs_relative_error 0.166667 1.000000 -0.166667
test__agg__num_gs_absolute_error 1.000000 3.000000 1.000000
test__agg__num_gs_absolute_relative_error 0.166667 1.000000 0.166667
test__agg__num_gs_absolute_relative_error_log 0.154151 0.693147 0.154151
test__agg__tp_samples 3208.000000 3132.000000 4851.000000
test__agg__fp_samples 2815.000000 1835.000000 1766.000000
test__agg__fn_samples 839.000000 955.000000 1704.000000
test__agg__precision 0.532625 0.630562 0.733112
test__agg__recall 0.792686 0.766332 0.740046
test__agg__f1_score 0.637140 0.691849 0.736562
test__agg__tn_samples 6923.000000 10080.000000 14429.000000
test__agg__specificity 0.710926 0.845992 0.890954
test__agg__accuracy 0.734929 0.825647 0.847473
test__agg__npv 0.891909 0.913457 0.894378


The raw results are stored in the columns starting with single__.

single_results = cv_results.filter(like="test__single__")
exploded_results_cv = (
    single_results.explode(single_results.columns.to_list())
    .rename_axis("fold")
    .set_index(
        pd.MultiIndex.from_tuples(
            (dl := cv_results["test__data_labels"].explode().to_list()),
            names=list(dl[0]._fields),
        ),
        append=True,
    )
)
exploded_results_cv.columns = exploded_results_cv.columns.str.removeprefix(
    "test__single__"
)
exploded_results_cv.T
fold 0 1 2
cohort HA HA MS
participant_id 001 002 001
time_measure TimeMeasure1 TimeMeasure1 TimeMeasure1
test Test11 Test11 Test11
trial Trial1 Trial1 Trial1
reference_gs_duration_s 40.44 40.82 65.52
detected_gs_duration_s 60.14 49.62 66.1
gs_duration_error_s 19.7 8.8 0.58
gs_relative_duration_error 0.487141 0.215581 0.008852
gs_absolute_duration_error_s 19.7 8.8 0.58
gs_absolute_relative_duration_error 0.487141 0.215581 0.008852
gs_absolute_relative_duration_error_log 0.396856 0.195222 0.008813
detected_num_gs 7 6 5
reference_num_gs 6 3 6
num_gs_error 1 3 -1
num_gs_relative_error 0.166667 1.0 -0.166667
num_gs_absolute_error 1 3 1
num_gs_absolute_relative_error 0.166667 1.0 0.166667
num_gs_absolute_relative_error_log 0.154151 0.693147 0.154151
tp_samples 3208 3132 4851
fp_samples 2815 1835 1766
fn_samples 839 955 1704
precision 0.532625 0.630562 0.733112
recall 0.792686 0.766332 0.740046
f1_score 0.63714 0.691849 0.736562
tn_samples 6923 10080 14429
specificity 0.710926 0.845992 0.890954
accuracy 0.734929 0.825647 0.847473
npv 0.891909 0.913457 0.894378
detected start end gs_id 0 ... start end gs_id 0 ... start end gs_id 0 ...
reference start end wb_id 0 ... start end wb_id 0 ... start end wb_id 0 ...


And the raw outputs:

raw_gs_list_cv = pd.concat(
    exploded_results_cv.loc[:, ["detected", "reference"]].stack().to_dict(),
    names=[*exploded_results_cv.index.names, "system"],
).unstack("system")
raw_gs_list_cv
start end
system detected reference detected reference
fold cohort participant_id time_measure test trial
0 HA 001 TimeMeasure1 Test11 Trial1 0 600.0 632.0 1201.0 988.0
1 2700.0 2864.0 4201.0 3325.0
2 4350.0 3853.0 5251.0 5085.0
3 7800.0 7641.0 8851.0 8621.0
4 9450.0 9451.0 10201.0 9932.0
5 10950.0 11989.0 11551.0 12517.0
6 13050.0 NaN 13651.0 NaN
1 HA 002 TimeMeasure1 Test11 Trial1 0 450.0 485.0 1201.0 1131.0
1 2400.0 1746.0 3301.0 3554.0
2 3450.0 6083.0 4051.0 7708.0
3 5700.0 NaN 7201.0 NaN
4 7350.0 NaN 7951.0 NaN
5 15000.0 NaN 15601.0 NaN
2 MS 001 TimeMeasure1 Test11 Trial1 0 750.0 1019.0 1651.0 1768.0
1 4650.0 4534.0 6151.0 5549.0
2 12900.0 9665.0 14851.0 10569.0
3 20100.0 12337.0 21151.0 14633.0
4 21300.0 20151.0 22501.0 20982.0
5 NaN 21378.0 NaN 22129.0


If we compare these results to the ones from the non-CV challenge, we can see that “single” results are identical, just that they were called in multiple folds. This is expected, as we used DummyOptimize and thus didn’t optimize the algorithm.

Let’s try a GridSearch on the algorithm to see how the results change. For the gridsearch, we will re-use the same scoring function as before, but we need to specify, which scoring result we want to optimize for.

from sklearn.model_selection import ParameterGrid
from tpcp.optimize import GridSearch

para_grid = ParameterGrid({"algo__window_length_s": [2, 3, 4]})
optimizer = GridSearch(pipe, para_grid, return_optimized="precision")

The optimizer can now be used in the same CV challenge as before. This way we can guarantee that the same folds are used for the optimization and the evaluation and ensure the best possible comparison between the algorithms versions.

eval_challenge_cv = eval_challenge_cv.clone().run(optimizer)
CV Folds:   0%|          | 0/3 [00:00<?, ?it/s]
CV Folds:  33%|███▎      | 1/3 [00:03<00:07,  3.72s/it]
CV Folds: 100%|██████████| 3/3 [00:05<00:00,  1.74s/it]
CV Folds: 100%|██████████| 3/3 [00:05<00:00,  1.94s/it]

The results we are seeing now are generated by the internally optimized version of the algorithm.

Because we used cv_params={"return_optimizer": True} we can also access the optimizer per fold directly. This can be useful to get more insights into the optimization process and what the optimal parameters were.

0    GridSearch(n_jobs=None, parameter_grid=<sklear...
1    GridSearch(n_jobs=None, parameter_grid=<sklear...
2    GridSearch(n_jobs=None, parameter_grid=<sklear...
Name: optimizer, dtype: object

We can get the best parameters per fold by directly interacting with the optimizer instances.

best_params = opt_results.apply(lambda x: pd.Series(x.best_params_))
best_params
algo__window_length_s
0 3
1 2
2 2


Or we can go much deeper, by getting all information about the optimization process. Let’s just look at the keys of the information that is available.

['agg__reference_gs_duration_s', 'rank__agg__reference_gs_duration_s', 'agg__detected_gs_duration_s', 'rank__agg__detected_gs_duration_s', 'agg__gs_duration_error_s', 'rank__agg__gs_duration_error_s', 'agg__gs_relative_duration_error', 'rank__agg__gs_relative_duration_error', 'agg__gs_absolute_duration_error_s', 'rank__agg__gs_absolute_duration_error_s', 'agg__gs_absolute_relative_duration_error', 'rank__agg__gs_absolute_relative_duration_error', 'agg__gs_absolute_relative_duration_error_log', 'rank__agg__gs_absolute_relative_duration_error_log', 'agg__detected_num_gs', 'rank__agg__detected_num_gs', 'agg__reference_num_gs', 'rank__agg__reference_num_gs', 'agg__num_gs_error', 'rank__agg__num_gs_error', 'agg__num_gs_relative_error', 'rank__agg__num_gs_relative_error', 'agg__num_gs_absolute_error', 'rank__agg__num_gs_absolute_error', 'agg__num_gs_absolute_relative_error', 'rank__agg__num_gs_absolute_relative_error', 'agg__num_gs_absolute_relative_error_log', 'rank__agg__num_gs_absolute_relative_error_log', 'agg__tp_samples', 'rank__agg__tp_samples', 'agg__fp_samples', 'rank__agg__fp_samples', 'agg__fn_samples', 'rank__agg__fn_samples', 'agg__precision', 'rank__agg__precision', 'agg__recall', 'rank__agg__recall', 'agg__f1_score', 'rank__agg__f1_score', 'agg__tn_samples', 'rank__agg__tn_samples', 'agg__specificity', 'rank__agg__specificity', 'agg__accuracy', 'rank__agg__accuracy', 'agg__npv', 'rank__agg__npv', 'single__reference_gs_duration_s', 'single__detected_gs_duration_s', 'single__gs_duration_error_s', 'single__gs_relative_duration_error', 'single__gs_absolute_duration_error_s', 'single__gs_absolute_relative_duration_error', 'single__gs_absolute_relative_duration_error_log', 'single__detected_num_gs', 'single__reference_num_gs', 'single__num_gs_error', 'single__num_gs_relative_error', 'single__num_gs_absolute_error', 'single__num_gs_absolute_relative_error', 'single__num_gs_absolute_relative_error_log', 'single__tp_samples', 'single__fp_samples', 'single__fn_samples', 'single__precision', 'single__recall', 'single__f1_score', 'single__tn_samples', 'single__specificity', 'single__accuracy', 'single__npv', 'single__detected', 'single__reference', 'data_labels', 'debug__score_time', 'param__algo__window_length_s', 'params']

With that, we hope it becomes clear, how these challenges can be extremely valuable, when benchmarking algorithms across datasets. To see how we evaluate the performance of the algorithms available in mobgap, check out the other gsd evaluation examples.

Total running time of the script: (0 minutes 16.891 seconds)

Estimated memory usage: 9 MB

Gallery generated by Sphinx-Gallery