.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/gait_sequences/_03_gsd_evaluation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_gait_sequences__03_gsd_evaluation.py: .. _gsd_evaluation: GSD Evaluation ============== This example shows how to apply evaluation algorithms to GSD and thus how to rate the performance of a GSD algorithm. .. GENERATED FROM PYTHON SOURCE LINES 9-14 .. code-block:: Python import pandas as pd from mobgap.data import LabExampleDataset from mobgap.gait_sequences import GsdIluz .. GENERATED FROM PYTHON SOURCE LINES 15-21 Loading some example data ------------------------- First, we load example data and apply the GSD Iluz algorithm to it. However, you can use any other GSD algorithm as well. To have a reference to compare the results to, we also load the corresponding ground truth data. These steps are explained in more detail in the :ref:`GSD Iluz example `. .. GENERATED FROM PYTHON SOURCE LINES 21-55 .. code-block:: Python from mobgap.utils.conversions import to_body_frame def load_data(): lab_example_data = LabExampleDataset(reference_system="INDIP") single_test = lab_example_data.get_subset( cohort="MS", participant_id="001", test="Test11", trial="Trial1" ) return single_test def calculate_gsd_iluz_output(single_test_data): """Calculate the GSD Iluz output for one sensor from the test data.""" det_gsd = ( GsdIluz() .detect( to_body_frame(single_test_data.data_ss), sampling_rate_hz=single_test_data.sampling_rate_hz, ) .gs_list_ ) return det_gsd def load_reference(single_test_data): """Load the reference gait sequences from the test data.""" ref_gsd = single_test_data.reference_parameters_.wb_list return ref_gsd test_data = load_data() detected_gsd_list = calculate_gsd_iluz_output(test_data) reference_gsd_list = load_reference(test_data) .. GENERATED FROM PYTHON SOURCE LINES 56-57 that is characterized by its start and end index in samples. .. GENERATED FROM PYTHON SOURCE LINES 57-59 .. code-block:: Python detected_gsd_list .. raw:: html
start end
gs_id
0 900 2101
1 4650 6151
2 9600 10801
3 11250 12151
4 12300 14851
5 19950 21151
6 21300 22501


.. GENERATED FROM PYTHON SOURCE LINES 60-62 .. code-block:: Python reference_gsd_list .. raw:: html
start end n_strides duration_s length_m avg_walking_speed_mps avg_cadence_spm avg_stride_length_m termination_reason
wb_id
0 1019 1768 9 7.48 4.468932 0.847668 107.795850 0.942678 Pause
1 4534 5549 11 10.14 2.900453 0.365176 93.396106 0.483923 Pause
2 9665 10569 9 9.03 2.140232 0.294058 75.981133 0.506458 Pause
3 12337 14633 28 22.95 11.201110 0.634425 92.337768 0.803933 Pause
4 20151 20982 11 8.30 2.390709 0.371746 87.915774 0.507484 Pause
5 21378 22129 9 7.50 2.517558 0.492965 95.365740 0.599360 Pause


.. GENERATED FROM PYTHON SOURCE LINES 63-94 Validation of algorithm output against a reference -------------------------------------------------- Let's quantify how the algorithm output compares to the reference labels. To gain a detailed insight into the performance of the algorithm, we can look into the individual matches between the detected and reference gait sequences. Note, that there are two different ways to approach this: 1. We can calculate the for each sample, whether it is correctly detected as gait or not. 2. We can check on the level of gait sequences, whether a detected gait sequence matches with a reference gait sequence (by a certain overlap threshold). In mobgap, we provide functions to calculate both types of performance metrics. Let's start with the first one. Sample-wise performance evaluation ---------------------------------- To do this, we use the :func:`~gaitlink.gait_sequences.evaluation.categorize_intervals_per_sample` function to identify overlapping regions between the detected gait sequences and the reference gait sequences. These overlapping regions can then be converted into sample-wise classifications of true positives, false positives, and false negatives. As function arguments, besides the mandatory detected and reference gait sequences, the total number of samples in the recording can be specified as optional parameter. If provided, the intervals where no gait sequences are present in the reference and the detected list are also reported. Later on, we can then use these categorized intervals to calculate a set of higher-level performance metrics. As result, a DataFrame containing `start` and `end` index of the resulting categorized intervals together with a `match_type` column that contains the type of match for each interval, i.e. `tp` for true positive, `fp` for false positive, and `fn` for false negative. These intervals can not be interpreted as gait sequences, but are rather subsequences of the detected gait sequences categorizing correctly detected samples (`tp`), falsely detected samples (`fp`), samples from the reference gsd list that were not detected (`fn`), and (optionally) samples where no gait sequences are present in both the reference and detected gait sequences (`tn`). Note that the tn intervals are not explicitly calculated, but are inferred from the total length of the recording (if provided) and from the other intervals, as everything between them is considered as true negative. .. GENERATED FROM PYTHON SOURCE LINES 94-104 .. code-block:: Python from mobgap.gait_sequences.evaluation import categorize_intervals_per_sample categorized_intervals = categorize_intervals_per_sample( gsd_list_detected=detected_gsd_list, gsd_list_reference=reference_gsd_list, n_overall_samples=len(test_data.data_ss), ) categorized_intervals .. raw:: html
start end match_type
0 0 900 tn
1 900 1019 fp
2 1019 1768 tp
3 1768 2101 fp
4 2101 4534 tn
5 4534 4650 fn
6 4650 5549 tp
7 5549 6151 fp
8 6151 9600 tn
9 9600 9665 fp
10 9665 10569 tp
11 10569 10801 fp
12 10801 11250 tn
13 11250 12151 fp
14 12151 12300 tn
15 12300 12337 fp
16 12337 14633 tp
17 14633 14851 fp
18 14851 19950 tn
19 19950 20151 fp
20 20151 20982 tp
21 20982 21151 fp
22 21151 21300 tn
23 21300 21378 fp
24 21378 22129 tp
25 22129 22501 fp
26 22501 22727 tn


.. GENERATED FROM PYTHON SOURCE LINES 105-117 Based on the individually categorized tp, fp, fn, and tn intervals, common performance metrics, e.g., F1 score, precision, or recall can be calculated. For this purpose, the :func:`~gaitlink.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics` function can be used. It calculates the metrics based on the "matched" gsd intervals, i.e., the categorized interval list where every entry has a match type (tp, fp, fn, tn) assigned. Therefore, the function requires to call the :func:`~gaitlink.gait_sequences.evaluation.categorize_intervals_per_sample` function first. The categorized intervals can then be passed as an argument to :func:`~gaitlink.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics`. It returns a dictionary containing the metrics for the specified categorized intervals DataFrame. Here, the total number of samples in every match type, precision, recall, F1 score, are always calculated. Depending on whether true negatives are present in the categorized intervals, specificity, negative predictive value, and accuracy will additionally be reported. .. GENERATED FROM PYTHON SOURCE LINES 117-127 .. code-block:: Python from mobgap.gait_sequences.evaluation import ( calculate_matched_gsd_performance_metrics, ) matched_metrics_dict = calculate_matched_gsd_performance_metrics( categorized_intervals ) matched_metrics_dict .. rst-class:: sphx-glr-script-out .. code-block:: none {'tp_samples': 6436, 'fp_samples': 3339, 'fn_samples': 117, 'precision': 0.6584143222506393, 'recall': 0.9821455821761026, 'f1_score': 0.7883390494855462, 'tn_samples': 12862, 'specificity': 0.7939016110116659, 'accuracy': 0.8481146172101609, 'npv': 0.9909854380152554} .. GENERATED FROM PYTHON SOURCE LINES 128-135 Furthermore, there is a range of high-level performance metrics that are simply calculated based on the overall amount of gait sequences/gait detected in reference and detected data. Thus, they can be inferred from the reference and detected gait sequences directly without any intermediate steps using the :func:`~gaitlink.gait_sequences.evaluation.calculate_unmatched_gsd_performance_metrics` function. As some of the unmatched metrics are reported in seconds, the function requires the sampling frequency of the recorded data as an additional argument. It returns a dictionary containing all metrics for the specified detected and reference gait sequences. .. GENERATED FROM PYTHON SOURCE LINES 135-146 .. code-block:: Python from mobgap.gait_sequences.evaluation import ( calculate_unmatched_gsd_performance_metrics, ) unmatched_metrics_dict = calculate_unmatched_gsd_performance_metrics( gsd_list_detected=detected_gsd_list, gsd_list_reference=reference_gsd_list, sampling_rate_hz=test_data.sampling_rate_hz, ) unmatched_metrics_dict .. rst-class:: sphx-glr-script-out .. code-block:: none {'reference_gs_duration_s': 65.52, 'detected_gs_duration_s': 97.64, 'gs_duration_error_s': 32.120000000000005, 'gs_relative_duration_error': 0.4902319902319903, 'gs_absolute_duration_error_s': 32.120000000000005, 'gs_absolute_relative_duration_error': 0.4902319902319903, 'gs_absolute_relative_duration_error_log': 0.3989318059799454, 'detected_num_gs': 7, 'reference_num_gs': 6, 'num_gs_error': 1, 'num_gs_relative_error': 0.16666666666666666, 'num_gs_absolute_error': 1, 'num_gs_absolute_relative_error': 0.16666666666666666, 'num_gs_absolute_relative_error_log': 0.15415067982725836} .. GENERATED FROM PYTHON SOURCE LINES 147-168 Direct Gait Sequence Matching ----------------------------- Apart from the performance evaluation methods mentioned above, it might be useful in some cases to identify how many and which detected gait sequences reliably match with the ground truth. This is primarily useful, when further parameters are associated with each gaits sequence, e.g., the gait speed. In this case, matching gait sequences that cover the same gait regions allows proper comparison of these parameters. For more information on this, see the example on the overall parameter evaluation on Walking-Bout level (TODO). For this purpose, the :func:`~gaitlink.gait_sequences.evaluation.categorize_intervals` can be used. It returns all intervals of the detected gait sequences that overlap with the reference gait sequences by at least a given amount. The index of the result dataframe indicated the index of the detected gait sequence. We can see that with an overlap threshold of 0.7 (70%), three of the six detected gait sequences are considered as matches with the reference gait sequences for our example recording. Note, that this threshold is enforced in both directions, i.e., the detected gait sequence must overlap with the reference gait sequence by at least 70% and vice versa. This means that only 1 to 1 matches are possible. If multiple detected gait sequences overlap with the same reference gait sequence, only the one with the highest overlap is considered as a match. If one gait sequence is covered by multiple smaller once, possibly none of them is considered as a match. .. GENERATED FROM PYTHON SOURCE LINES 168-178 .. code-block:: Python from mobgap.gait_sequences.evaluation import categorize_intervals matches = categorize_intervals( gsd_list_detected=detected_gsd_list, gsd_list_reference=reference_gsd_list, overlap_threshold=0.7, ) matches .. raw:: html
gs_id_detected gs_id_reference match_type
match_id
0 0 NaN fp
1 1 NaN fp
2 2 2 tp
3 3 NaN fp
4 4 3 tp
5 5 NaN fp
6 6 NaN fp
7 NaN 0 fn
8 NaN 1 fn
9 NaN 4 fn
10 NaN 5 fn


.. GENERATED FROM PYTHON SOURCE LINES 179-188 Running a full evaluation pipeline ---------------------------------- Instead of manually evaluating and investigating the performance of a GSD algorithm on a single piece of data, we often want to run a full evaluation on an entire dataset. This can be done using the :class:`~mobgap.gait_sequences.evaluation.GsdEvaluationPipeline` class and some ``tpcp`` functions. But let's start with selecting some data. We want to use all the simulated real-world walking data from the INDIP reference system (Test11). .. GENERATED FROM PYTHON SOURCE LINES 188-193 .. code-block:: Python simulated_real_world_walking = LabExampleDataset( reference_system="INDIP" ).get_subset(test="Test11") simulated_real_world_walking .. raw:: html

LabExampleDataset [3 groups/rows]

cohort participant_id time_measure test trial
0 HA 001 TimeMeasure1 Test11 Trial1
1 HA 002 TimeMeasure1 Test11 Trial1
2 MS 001 TimeMeasure1 Test11 Trial1


.. GENERATED FROM PYTHON SOURCE LINES 194-197 Now we can use the :class:`~mobgap.gait_sequences.evaluation.GsdEvaluationPipeline` class to directly run a Gsd algorithm on a datapoint. The pipeline takes care of extracting the required data. .. GENERATED FROM PYTHON SOURCE LINES 197-203 .. code-block:: Python from mobgap.gait_sequences.pipeline import GsdEmulationPipeline pipeline = GsdEmulationPipeline(GsdIluz()) pipeline.safe_run(simulated_real_world_walking[0]).gs_list_ .. raw:: html
start end
gs_id
0 600 1201
1 4350 5251
2 7800 9001
3 9300 10201
4 10950 11551
5 13050 13651


.. GENERATED FROM PYTHON SOURCE LINES 204-211 Note, that this did just "run" the pipeline on a single datapoint. If we want to run it on all datapoints and evaluate the performance of the algorithm, we can use the :func:`~tpcp.validate.validate` function. For this we need to provide a score function that runs and evaluates the pipeline on a single datapoint. We provide a default score function for GSD that calculates all the metrics shown above per datapoint and combined accross all gait sequences (i.e. all gait sequences across all datapoints are pooled before metrics are calculated). .. GENERATED FROM PYTHON SOURCE LINES 211-221 .. code-block:: Python from mobgap.gait_sequences.evaluation import gsd_score from tpcp.validate import validate evaluation_results = pd.DataFrame( validate(pipeline, simulated_real_world_walking, scoring=gsd_score) ) evaluation_results.drop( ["single__raw__reference", "single__raw__detected"], axis=1 ).T .. rst-class:: sphx-glr-script-out .. code-block:: none Datapoints: 0%| | 0/3 [00:00
0
debug__score_time 0.711474
data_labels [(HA, 001, TimeMeasure1, Test11, Trial1), (HA,...
single__reference_gs_duration_s [40.44, 40.82, 65.52]
single__detected_gs_duration_s [48.12, 42.08, 97.64]
single__gs_duration_error_s [7.68, 1.259999999999998, 32.120000000000005]
... ...
agg__combined__f1_score 0.699236
agg__combined__tn_samples 30755
agg__combined__specificity 0.812507
agg__combined__accuracy 0.80828
agg__combined__npv 0.9118

76 rows × 1 columns



.. GENERATED FROM PYTHON SOURCE LINES 222-224 In addition to the metrics, the method also returns the raw reference and detected gait sequences. These can be used for further custom analysis. .. GENERATED FROM PYTHON SOURCE LINES 224-227 .. code-block:: Python evaluation_results["single__raw__reference"][0] .. raw:: html
start end
cohort participant_id time_measure test trial wb_id
HA 001 TimeMeasure1 Test11 Trial1 0 632 988
1 2864 3325
2 3853 5085
3 7641 8621
4 9451 9932
5 11989 12517
002 TimeMeasure1 Test11 Trial1 0 485 1131
1 1746 3554
2 6083 7708
MS 001 TimeMeasure1 Test11 Trial1 0 1019 1768
1 4534 5549
2 9665 10569
3 12337 14633
4 20151 20982
5 21378 22129


.. GENERATED FROM PYTHON SOURCE LINES 228-230 .. code-block:: Python evaluation_results["single__raw__detected"][0] .. raw:: html
start end
cohort participant_id time_measure test trial gs_id
HA 001 TimeMeasure1 Test11 Trial1 0 600 1201
1 4350 5251
2 7800 9001
3 9300 10201
4 10950 11551
5 13050 13651
002 TimeMeasure1 Test11 Trial1 0 450 1201
1 2700 3301
2 5700 7951
3 15000 15601
MS 001 TimeMeasure1 Test11 Trial1 0 900 2101
1 4650 6151
2 9600 10801
3 11250 12151
4 12300 14851
5 19950 21151
6 21300 22501


.. GENERATED FROM PYTHON SOURCE LINES 231-251 If you want to calculate additional metrics, you can either create a custom score function or subclass the pipeline and overwrite the score function. Parameter Optimization ---------------------- Simply applying an algorithm to the data for evaluation is often not enough. In case, of machine learning algorithms or algorithms with tunable parameters, we might want to optimize these parameters to get the best possible performance. To avoid overfitting, we can use cross-validation to evaluate the performance of the algorithm on multiple splits of the data. Below we show that procedure by using a simple grid search to optimize the window length of the GSD Iluz algorithm and evaluate this approach within a 3-fold cross-validation. Per-fold we select the window length leading to the highest precision on the "train set" and evaluate the performance on the "test set". Note, that on a real world dataset, you would likely need to perform a group-wise stratified cross-validation to avoid data leakage between multiple trials from the same participant and ensure equal distribution of patient cohorts across the folds. See the detailed ``tpcp`` examples on these topics. .. GENERATED FROM PYTHON SOURCE LINES 251-282 .. code-block:: Python from sklearn.model_selection import ParameterGrid from tpcp.optimize import GridSearch from tpcp.validate import cross_validate para_grid = ParameterGrid({"algo__window_length_s": [2, 3, 4]}) cross_validate_results = pd.DataFrame( cross_validate( GridSearch( GsdEmulationPipeline(GsdIluz()), para_grid, return_optimized="precision", scoring=gsd_score, ), simulated_real_world_walking, scoring=gsd_score, cv=3, return_train_score=True, ) ) cross_validate_results.drop( [ "test__single__raw__reference", "test__single__raw__detected", "train__single__raw__reference", "train__single__raw__detected", ], axis=1, ).T .. rst-class:: sphx-glr-script-out .. code-block:: none CV Folds: 0%| | 0/3 [00:00
0 1 2
debug__score_time 0.267713 0.257381 0.279818
debug__optimize_time 1.486507 1.483054 1.433278
train__data_labels [(HA, 002, TimeMeasure1, Test11, Trial1), (MS,... [(HA, 001, TimeMeasure1, Test11, Trial1), (MS,... [(HA, 001, TimeMeasure1, Test11, Trial1), (HA,...
test__data_labels [(HA, 001, TimeMeasure1, Test11, Trial1)] [(HA, 002, TimeMeasure1, Test11, Trial1)] [(MS, 001, TimeMeasure1, Test11, Trial1)]
test__single__reference_gs_duration_s [40.44] [40.82] [65.52]
... ... ... ...
train__agg__combined__f1_score 0.756254 0.74228 0.680912
train__agg__combined__tn_samples 23439 20410 18330
train__agg__combined__specificity 0.833683 0.786876 0.846612
train__agg__combined__accuracy 0.845118 0.819097 0.813975
train__agg__combined__npv 0.946457 0.949656 0.892014

152 rows × 3 columns



.. GENERATED FROM PYTHON SOURCE LINES 283-286 In general, it is a good idea to use ``cross_validation`` also for algorithms that do not have tunable parameters. This way you can ensure that the performance of the algorithm is stable across different splits of the data, and it allows the direct comparison between tunable and non-tunable algorithms. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 10.678 seconds) **Estimated memory usage:** 81 MB .. _sphx_glr_download_auto_examples_gait_sequences__03_gsd_evaluation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: _03_gsd_evaluation.ipynb <_03_gsd_evaluation.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: _03_gsd_evaluation.py <_03_gsd_evaluation.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: _03_gsd_evaluation.zip <_03_gsd_evaluation.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_