Note
Go to the end to download the full example code.
GSD Evaluation Challenges#
- The GSD Evaluation example demonstrates how to evaluate the performance of a GSD algorithm on a single datapoint
and explains the individual performance metrics that are calculated.
With that you could set up a custom evaluation pipeline to run and then score the output of a GSD algorithm multiple datapoints and then aggregate the results. To make this process easier, we set up opinionated evaluation challenges that can be used to quickly perform the same evaluation with multiple algorithms and datasets.
Below, we will show how to use them on the example dataset.
# TODO: Update based on new Scorer API
Dataset#
To use the challenges, we need to dataset with reference information in the expected format.
We will use the LabExampleDataset for this purpose.
from mobgap.data import LabExampleDataset
long_test = LabExampleDataset(reference_system="INDIP").get_subset(
test="Test11"
)
Algorithm#
Next we need to create an instance of a valid GSD algorithm.
This algorithm needs to be wrapped in a GsdEmulationPipeline to be used in
the challenges.
This pipeline takes care of extracting the correct data from the dataset and running the algorithm on it.
from mobgap.gait_sequences.pipeline import GsdEmulationPipeline
pipe = GsdEmulationPipeline(algo)
Let’s demonstrate that quickly on a single datapoint.
pipe_with_results = pipe.clone().run(long_test[0])
pipe_with_results.gs_list_
Evaluation Challenge#
This pipeline can now be used as part of an evaluation challenge. An evaluation challenge takes care of two things:
Running the pipeline on multiple datapoints
Scoring the results per datapoint and then aggregating the results
We provide two challenges:
GsdEvaluation: This challenge simply runs the pipeline on all datapoints and then scores the results.GsdEvaluationCV: This challenge runs a cross-validation on the dataset and then scores the results per fold.
Before we run the entire pipeline, let’s look at the scoring. Scoring is built based on tpcp’s validation framework. As we have relativly complex scoring, scoring is split across two functions:
gsd_per_datapoint_score: Run and score a single datapointgsd_final_agg: Perform final aggreagtion and scoring based on the resultsper datapoint.
Let’s look at the code of it first.
from inspect import getsource
from mobgap.gait_sequences.evaluation import (
gsd_final_agg,
gsd_per_datapoint_score,
)
print(getsource(gsd_per_datapoint_score))
def gsd_per_datapoint_score(pipeline: GsdEmulationPipeline, datapoint: BaseGaitDatasetWithReference) -> dict:
"""Evaluate the performance of a GSD algorithm on a single datapoint.
.. warning:: This function is not meant to be called directly, but as a scoring function in a
:class:`tpcp.validate.Scorer`.
If you are writing custom scoring functions, you can use this function as a template or wrap it in a new
function.
This function is used to evaluate the performance of a GSD algorithm on a single datapoint.
It calculates the performance metrics based on the detected gait sequences and the reference gait sequences.
The following performance metrics are calculated:
- all outputs of :func:`~mobgap.gait_sequences.evaluation.calculate_unmatched_gsd_performance_metrics`
(will be averaged over all datapoints)
- all outputs of :func:`~mobgap.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics`
(will be averaged over all datapoints)
- ``matches``: The matched gait sequences calculated by
:func:`~mobgap.gait_sequences.evaluation.categorize_intervals_per_sample` (return as ``no_agg``)
- ``detected``: The detected gait sequences (return as ``no_agg``)
- ``reference``: The reference gait sequences (return as ``no_agg``)
- ``sampling_rate_hz``: The sampling rate of the data (return as ``no_agg``)
Parameters
----------
pipeline
An instance of GSD emulation pipeline that wraps the algorithm that should be evaluated.
datapoint
The datapoint to be evaluated.
Returns
-------
dict
A dictionary containing the performance metrics.
Note, that some results are wrapped in a ``no_agg`` object or other aggregators.
The results of this function are not expected to be parsed manually, but rather the function is expected to be
used in the context of the :func:`~tpcp.validate.validate`/:func:`~tpcp.validate.cross_validate` functions or
similar as scorer.
This functions will aggregate the results and provide a summary of the performance metrics.
"""
from mobgap.gait_sequences.evaluation import (
calculate_matched_gsd_performance_metrics,
calculate_unmatched_gsd_performance_metrics,
categorize_intervals_per_sample,
)
with warnings.catch_warnings():
# We know that these errors might happen, and they are usually not relevant for the evaluation
warnings.filterwarnings("ignore", message="Zero division", category=UserWarning)
warnings.filterwarnings("ignore", message="multiple ICs", category=UserWarning)
# Run the algorithm on the datapoint
pipeline.safe_run(datapoint)
detected_gs_list = pipeline.gs_list_
reference_gs_list = datapoint.reference_parameters_.wb_list[["start", "end"]]
n_overall_samples = len(datapoint.data_ss)
sampling_rate_hz = datapoint.sampling_rate_hz
matches = categorize_intervals_per_sample(
gsd_list_detected=detected_gs_list,
gsd_list_reference=reference_gs_list,
n_overall_samples=n_overall_samples,
)
# Calculate the performance metrics
performance_metrics = {
**calculate_unmatched_gsd_performance_metrics(
gsd_list_detected=detected_gs_list,
gsd_list_reference=reference_gs_list,
sampling_rate_hz=sampling_rate_hz,
),
**calculate_matched_gsd_performance_metrics(matches),
"matches": no_agg(matches),
"detected": no_agg(detected_gs_list),
"reference": no_agg(reference_gs_list),
"sampling_rate_hz": no_agg(sampling_rate_hz),
"runtime_s": getattr(pipeline.algo_, "perf_", {}).get("runtime_s", np.nan),
}
return performance_metrics
print(getsource(gsd_final_agg))
def gsd_final_agg(
agg_results: dict[str, float],
single_results: dict[str, list],
pipeline: GsdEmulationPipeline, # noqa: ARG001
dataset: BaseGaitDatasetWithReference,
) -> tuple[dict[str, any], dict[str, list[any]]]:
"""Aggregate the performance metrics of a GSD algorithm over multiple datapoints.
.. warning:: This function is not meant to be called directly, but as ``final_aggregator`` in a
:class:`tpcp.validate.Scorer`.
If you are writing custom scoring functions, you can use this function as a template or wrap it in a new
function.
This function aggregates the performance metrics as follows:
- All raw outputs (``detected``, ``reference``, ``sampling_rate_hz``) are concatenated to a single
dataframe, to make it easier to work with and are returned as part of the single results.
- We recalculate all performance metrics from
:func:`~mobgap.gait_sequences.evaluation.calculate_unmatched_gsd_performance_metrics`
and :func:`~mobgap.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics` on the combined data.
The results are prefixed with ``combined__``.
Compared to the per-datapoint results (which are calculated, as errors per recording -> average over all
recordings), these metrics are calculated as combining all GSDs from all recordings and then calculating the
performance metrics.
Effectively, this means, that in the `per_datapoint` version, each recording is weighted equally, while in the
`combined` version, each GS is weighted equally.
Parameters
----------
agg_results
The aggregated results from all datapoints (see :class:`~tpcp.validate.Scorer`).
single_results
The per-datapoint results (see :class:`~tpcp.validate.Scorer`).
pipeline
The pipline that was passed to the scorer.
This is ignored in this function, but might be useful in custom final aggregators.
dataset
The dataset that was passed to the scorer.
Returns
-------
final_agg_results
The final aggregated results.
final_single_results
The per-datapoint results, that are not aggregated.
"""
from mobgap.gait_sequences.evaluation import (
calculate_matched_gsd_performance_metrics,
calculate_unmatched_gsd_performance_metrics,
)
data_labels = [d.group_label for d in dataset]
data_label_names = data_labels[0]._fields
# We combine each to a combined dataframe
matches = single_results.pop("matches")
matches = pd.concat(matches, keys=data_labels, names=[*data_label_names, *matches[0].index.names])
detected = single_results.pop("detected")
detected = pd.concat(detected, keys=data_labels, names=[*data_label_names, *detected[0].index.names])
reference = single_results.pop("reference")
reference = pd.concat(reference, keys=data_labels, names=[*data_label_names, *reference[0].index.names])
aggregated_single_results = {
"raw__detected": detected,
"raw__reference": reference,
}
sampling_rate_hz = single_results.pop("sampling_rate_hz")
if set(sampling_rate_hz) != {sampling_rate_hz[0]}:
raise ValueError(
"Sampling rate is not the same for all datapoints in the dataset. "
"This not supported by this scorer. "
"Provide a custom scorer that can handle this case."
)
combined_unmatched = {
f"combined__{k}": v
for k, v in calculate_unmatched_gsd_performance_metrics(
gsd_list_detected=detected,
gsd_list_reference=reference,
sampling_rate_hz=sampling_rate_hz[0],
).items()
}
combined_matched = {f"combined__{k}": v for k, v in calculate_matched_gsd_performance_metrics(matches).items()}
# Note, that we pass the "aggregated_single_results" out via the single results and not the aggregated results
# The reason is that the aggregated results are expected to be a single value per metric, while the single results
# can be anything.
return {**agg_results, **combined_unmatched, **combined_matched}, {**single_results, **aggregated_single_results}
We can see that these method is relatively simple, using the lower level gsd evaluation functions that we provide.
gsd_per_datapoint_score calculates the raw results and all scores that can be calculated per datapoint.
gsd_final_agg handles the calculation of all scores, that require the raw results from all datapoints at once.
The remaining aggregation is handled by the Scorer class (see below).
So if you want to run your own scoring function, it should be straightforward to do so.
Note, the no_agg wrapping some of the return values.
This is a special aggregator that tells the challenge to not try to aggregate the respective values.
For all other values, the challenge will try average the values across all datapoints.
To learn more about these special aggregators, check out the tpcp example.
The scoring function takes care of running the pipeline. So we can test the scorer, by just providing it with a pipeline and a datapoint.
from pprint import pprint
single_dp_results = gsd_per_datapoint_score(pipe, long_test[0])
single_dp_results.pop("detected")
single_dp_results.pop("reference")
pprint(single_dp_results)
{'accuracy': 0.704708699122107,
'detected_gs_duration_s': 48.12,
'detected_num_gs': 6,
'f1_score': 0.5408393501805054,
'fn_samples': 1649,
'fp_samples': 2421,
'gs_absolute_duration_error_s': 7.68,
'gs_absolute_relative_duration_error': 0.18991097922848665,
'gs_absolute_relative_duration_error_log': 0.17387849695420737,
'gs_duration_error_s': 7.68,
'gs_relative_duration_error': 0.18991097922848665,
'matches': _NoAgg(return_raw_scores=True)( start end match_type
0 0 600 tn
1 600 632 fp
2 632 988 tp
3 988 1201 fp
4 1201 2864 tn
5 2864 3325 fn
6 3325 3853 tn
7 3853 4350 fn
8 4350 5085 tp
9 5085 5251 fp
10 5251 7641 tn
11 7641 7800 fn
12 7800 8621 tp
13 8621 9001 fp
14 9001 9300 tn
15 9300 9451 fp
16 9451 9932 tp
17 9932 10201 fp
18 10201 10950 tn
19 10950 11551 fp
20 11551 11989 tn
21 11989 12517 fn
22 12517 13050 tn
23 13050 13651 fp
24 13651 13758 tn),
'npv': 0.8160624651422197,
'num_gs_absolute_error': 0,
'num_gs_absolute_relative_error': 0.0,
'num_gs_absolute_relative_error_log': 0.0,
'num_gs_error': 0,
'num_gs_relative_error': 0.0,
'precision': 0.4975093399750934,
'recall': 0.592436974789916,
'reference_gs_duration_s': 40.44,
'reference_num_gs': 6,
'runtime_s': 0.009653974000684684,
'sampling_rate_hz': _NoAgg(return_raw_scores=True)(100.0),
'specificity': 0.7513607887439663,
'tn_samples': 7316,
'tp_samples': 2397}
To use the two functions with a challenge, we need to wrap them into a Scorer instance.
from tpcp.validate import Scorer
gsd_evaluation_scorer = Scorer(
gsd_per_datapoint_score, final_aggregator=gsd_final_agg
)
The challenge will call this scorer for each group in the dataset.
The scorer itself will then call gsd_per_datapoint_score for each datapoint and then gsd_final_agg with the
combined results.
For these two default scoring functions, we also provide the scorer directly, so that you don’t have to construct it yourself. However, in case you want to modify the scoring functions, you can do so by creating your own scorer. We will continue to use the default scorer for the challenges.
from mobgap.gait_sequences.evaluation import gsd_score
gsd_evaluation_scorer = gsd_score
Let’s put everything together and run the challenge.
from mobgap.utils.evaluation import Evaluation
eval_challenge = Evaluation(long_test, scoring=gsd_evaluation_scorer)
We can now run the challenge.
eval_challenge = eval_challenge.run(pipe)
Datapoints: 0%| | 0/3 [00:00<?, ?it/s]
Datapoints: 33%|███▎ | 1/3 [00:00<00:00, 5.59it/s]
Datapoints: 67%|██████▋ | 2/3 [00:00<00:00, 5.53it/s]/home/docs/checkouts/readthedocs.org/user_builds/mobgap/checkouts/v0.11.0/src/mobgap/data/_mobilised_matlab_loader.py:911: UserWarning: There were multiple ICs with the same index value, but different LR labels. This is likely an issue with the reference system you should further investigate. For now, we set the `lr_label` of the stride corresponding to this IC to Nan. However, both values still remain in the IC list.
warnings.warn(
Datapoints: 100%|██████████| 3/3 [00:00<00:00, 5.36it/s]
Datapoints: 100%|██████████| 3/3 [00:00<00:00, 5.41it/s]
The results are stored in the results_ attribute and contain the aggregated and the raw results per datapoint.
To learn more about the results, check the validate documentation.
Note, that we remove the no_agg parameters from the results, as they don’t visualize well.
import pandas as pd
validate_results = pd.DataFrame(eval_challenge.results_)
validate_results
As you can see, this is a very messy dataframe with a lot of information. To make this easier to digest, the evaluation object has methods for extracting the different groups of information. The first group is the aggregated results, which represent only a “single value” over the entire dataset.
agg_results = eval_challenge.get_aggregated_results_as_df()
agg_results.T
You might have seen, that many metrics appear twice, once with a combined__ prefix and once without.
These represent two different things.
If you check in the source code of the scorer above, the metric without prefix is calculated per datapoint and then
averaged.
The metric with the prefix is calculated over the raw detected gait sequences of all datapoints combined.
Effectively, this is equivalent to different “weightings”.
In the aggregated results without prefix, each recording has the same weight, independent of its length.
In the second case, each individual imu-sample has the same weight.
It does not matter, in which recording this sample was classified correctly or not, it has the same impact on the
combined metric.
Both approaches are valid, but you should be aware of the differences when comparing algorithms. The way how you aggregate here, can have a big impact on the results.
combined_metrics = agg_results.filter(like="combined__").rename(
columns=lambda x: x.replace("combined__", "")
)
combined_vs_per_datapoint = pd.concat(
{
"combined": combined_metrics,
"per_datapoint": agg_results[combined_metrics.columns],
},
axis=0,
)
combined_vs_per_datapoint.reset_index(level=-1, drop=True).T
The “single” results represent the values per datapoint.
single_results = eval_challenge.get_single_results_as_df()
single_results.T
And finally, we had a couple “raw” results in the scoring, that we passed through without calculating any error metrics. These are available as a dictionary of raw results.
raw_results = eval_challenge.get_raw_results()
list(raw_results.keys())
['detected', 'reference']
raw_results["detected"]
raw_results["reference"]
Further, there are some runtime information available (i.e. when the challenge was started, and how long it took).
eval_challenge.perf_["start_datetime"], eval_challenge.perf_["end_datetime"]
('2025-06-16T09:18:01.740670+00:00', '2025-06-16T09:18:02.417567+00:00')
eval_challenge.perf_["runtime_s"]
0.6768479160018614
Using Evaluation is great, if you are only comparing (or planning to
compare) non-ML algorithms, or algorithms that don’t require further optimization (e.g. through GridSearch).
Therefore, it is generally recommended to run a cross-validation with
EvaluationCV.
This allows you to evaluate the performance of the algorithm on multiple folds of the dataset and through the use
of DummyOptimize you can also use algorithms without optimization in the same pipeline for
comparison.
Let’s demonstrate the use of GsdEvaluationCV on the example dataset using
the same algorithm once with and once without GridSearch.
For the CV-based challenge, we need to set up a cross-validation. As we only have 3 datapoints here, we will use a 3-fold cross-validation without grouping or stratification. In a real-world scenario, you would use a more sophisticated cross-validation strategy. You can learn more about cross-validation in the tpcp example.
Further, to speed things up, we are going to use multi-processing.
We can configure this using the n_jobs parameter that we pass to the internal
cross_validate function via the cv_params parameters
from mobgap.utils.evaluation import EvaluationCV
eval_challenge_cv = EvaluationCV(
long_test,
cv_iterator=3,
scoring=gsd_evaluation_scorer,
cv_params={"n_jobs": 2, "return_optimizer": True},
)
To use our pipeline from above, we need to wrap it in a DummyOptimize instance.
This will basically skip any optimization on the train set and just apply the pipeline to the test set.
from tpcp.optimize import DummyOptimize
eval_challenge_cv = eval_challenge_cv.run(
DummyOptimize(pipe, ignore_potential_user_error_warning=True)
)
CV Folds: 0%| | 0/3 [00:00<?, ?it/s]
CV Folds: 33%|███▎ | 1/3 [00:03<00:07, 3.84s/it]
CV Folds: 100%|██████████| 3/3 [00:04<00:00, 1.12s/it]
CV Folds: 100%|██████████| 3/3 [00:04<00:00, 1.40s/it]
The results now are a little bit more complex, as they contain the results for each fold.
In addition, we have information for the train and the test set.
The test set results, are what we are usually looking for.
The train set results, are only calculated when providing the return_train_score parameter to the cv_params.
As before all results are stored in the results_ attribute, but it is usually recommended to use the helper methods
to access the data.
Note, that compared to the results above, we now have mutliple CV folds and the aggregated results present one value per fold. These parameters could be further aggregated, e.g. by calculating the mean of these values over all folds.
agg_results_cv = eval_challenge_cv.get_aggregated_results_as_df()
agg_results_cv.T
The single results contain the CV fold as an additional index. Otherwise, the output is identical to before. Note, that if you use anything else then a KFold, splitter, you might have some datapoints duplicated across folds.
single_results_cv = eval_challenge_cv.get_single_results_as_df()
single_results_cv
And the raw outputs:
raw_results_cv = eval_challenge_cv.get_raw_results()
raw_results_cv["detected"]
If we compare these results to the ones from the non-CV challenge, we can see that “single” results are identical,
just that they were called in multiple folds.
This is expected, as we used DummyOptimize and thus didn’t optimize the algorithm.
Let’s try a GridSearch on the algorithm to see how the results change.
For the gridsearch, we will re-use the same scoring function as before, but we need to specify, which scoring result
we want to optimize for.
from sklearn.model_selection import ParameterGrid
from tpcp.optimize import GridSearch
para_grid = ParameterGrid({"algo__window_length_s": [2, 3, 4]})
optimizer = GridSearch(
pipe, para_grid, scoring=gsd_evaluation_scorer, return_optimized="precision"
)
The optimizer can now be used in the same CV challenge as before. This way we can guarantee that the same folds are used for the optimization and the evaluation and ensure the best possible comparison between the algorithms versions.
eval_challenge_gs = eval_challenge_cv.clone().run(optimizer)
CV Folds: 0%| | 0/3 [00:00<?, ?it/s]
CV Folds: 33%|███▎ | 1/3 [00:01<00:03, 1.64s/it]
CV Folds: 100%|██████████| 3/3 [00:03<00:00, 1.01s/it]
CV Folds: 100%|██████████| 3/3 [00:03<00:00, 1.07s/it]
The results we are seeing now are generated by the internally optimized version of the algorithm.
agg_results_cv = eval_challenge_gs.get_aggregated_results_as_df()
agg_results_cv.T
Because we used cv_params={"return_optimizer": True} we can also access the optimizer per fold directly from
the results_ attribute.`
This can be useful to get more insights into the optimization process and what the optimal parameters were.
opt_results = pd.Series(eval_challenge_gs.results_["optimizer"])
opt_results
0 GridSearch(n_jobs=None, parameter_grid=<sklear...
1 GridSearch(n_jobs=None, parameter_grid=<sklear...
2 GridSearch(n_jobs=None, parameter_grid=<sklear...
dtype: object
We can get the best parameters per fold by directly interacting with the optimizer instances.
best_params = opt_results.apply(lambda x: pd.Series(x.best_params_))
best_params
Or we can go much deeper, by getting all information about the optimization process. Let’s just look at the keys of the information that is available.
all_opti_results_fold0 = pd.DataFrame(opt_results.loc[0].gs_results_)
all_opti_results_fold0.columns.to_list()
['agg__reference_gs_duration_s', 'rank__agg__reference_gs_duration_s', 'agg__detected_gs_duration_s', 'rank__agg__detected_gs_duration_s', 'agg__gs_duration_error_s', 'rank__agg__gs_duration_error_s', 'agg__gs_relative_duration_error', 'rank__agg__gs_relative_duration_error', 'agg__gs_absolute_duration_error_s', 'rank__agg__gs_absolute_duration_error_s', 'agg__gs_absolute_relative_duration_error', 'rank__agg__gs_absolute_relative_duration_error', 'agg__gs_absolute_relative_duration_error_log', 'rank__agg__gs_absolute_relative_duration_error_log', 'agg__detected_num_gs', 'rank__agg__detected_num_gs', 'agg__reference_num_gs', 'rank__agg__reference_num_gs', 'agg__num_gs_error', 'rank__agg__num_gs_error', 'agg__num_gs_relative_error', 'rank__agg__num_gs_relative_error', 'agg__num_gs_absolute_error', 'rank__agg__num_gs_absolute_error', 'agg__num_gs_absolute_relative_error', 'rank__agg__num_gs_absolute_relative_error', 'agg__num_gs_absolute_relative_error_log', 'rank__agg__num_gs_absolute_relative_error_log', 'agg__tp_samples', 'rank__agg__tp_samples', 'agg__fp_samples', 'rank__agg__fp_samples', 'agg__fn_samples', 'rank__agg__fn_samples', 'agg__precision', 'rank__agg__precision', 'agg__recall', 'rank__agg__recall', 'agg__f1_score', 'rank__agg__f1_score', 'agg__tn_samples', 'rank__agg__tn_samples', 'agg__specificity', 'rank__agg__specificity', 'agg__accuracy', 'rank__agg__accuracy', 'agg__npv', 'rank__agg__npv', 'agg__runtime_s', 'rank__agg__runtime_s', 'agg__combined__reference_gs_duration_s', 'rank__agg__combined__reference_gs_duration_s', 'agg__combined__detected_gs_duration_s', 'rank__agg__combined__detected_gs_duration_s', 'agg__combined__gs_duration_error_s', 'rank__agg__combined__gs_duration_error_s', 'agg__combined__gs_relative_duration_error', 'rank__agg__combined__gs_relative_duration_error', 'agg__combined__gs_absolute_duration_error_s', 'rank__agg__combined__gs_absolute_duration_error_s', 'agg__combined__gs_absolute_relative_duration_error', 'rank__agg__combined__gs_absolute_relative_duration_error', 'agg__combined__gs_absolute_relative_duration_error_log', 'rank__agg__combined__gs_absolute_relative_duration_error_log', 'agg__combined__detected_num_gs', 'rank__agg__combined__detected_num_gs', 'agg__combined__reference_num_gs', 'rank__agg__combined__reference_num_gs', 'agg__combined__num_gs_error', 'rank__agg__combined__num_gs_error', 'agg__combined__num_gs_relative_error', 'rank__agg__combined__num_gs_relative_error', 'agg__combined__num_gs_absolute_error', 'rank__agg__combined__num_gs_absolute_error', 'agg__combined__num_gs_absolute_relative_error', 'rank__agg__combined__num_gs_absolute_relative_error', 'agg__combined__num_gs_absolute_relative_error_log', 'rank__agg__combined__num_gs_absolute_relative_error_log', 'agg__combined__tp_samples', 'rank__agg__combined__tp_samples', 'agg__combined__fp_samples', 'rank__agg__combined__fp_samples', 'agg__combined__fn_samples', 'rank__agg__combined__fn_samples', 'agg__combined__precision', 'rank__agg__combined__precision', 'agg__combined__recall', 'rank__agg__combined__recall', 'agg__combined__f1_score', 'rank__agg__combined__f1_score', 'agg__combined__tn_samples', 'rank__agg__combined__tn_samples', 'agg__combined__specificity', 'rank__agg__combined__specificity', 'agg__combined__accuracy', 'rank__agg__combined__accuracy', 'agg__combined__npv', 'rank__agg__combined__npv', 'single__reference_gs_duration_s', 'single__detected_gs_duration_s', 'single__gs_duration_error_s', 'single__gs_relative_duration_error', 'single__gs_absolute_duration_error_s', 'single__gs_absolute_relative_duration_error', 'single__gs_absolute_relative_duration_error_log', 'single__detected_num_gs', 'single__reference_num_gs', 'single__num_gs_error', 'single__num_gs_relative_error', 'single__num_gs_absolute_error', 'single__num_gs_absolute_relative_error', 'single__num_gs_absolute_relative_error_log', 'single__tp_samples', 'single__fp_samples', 'single__fn_samples', 'single__precision', 'single__recall', 'single__f1_score', 'single__tn_samples', 'single__specificity', 'single__accuracy', 'single__npv', 'single__runtime_s', 'single__raw__detected', 'single__raw__reference', 'data_labels', 'debug__score_time', 'param__algo__window_length_s', 'params']
With that, we hope it becomes clear, how these challenges can be extremely valuable, when benchmarking algorithms across datasets. To see how we evaluate the performance of the algorithms available in mobgap, check out the other gsd evaluation examples.
Total running time of the script: (0 minutes 13.376 seconds)
Estimated memory usage: 81 MB