.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/gait_sequences/_03_gsd_evaluation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_gait_sequences__03_gsd_evaluation.py: .. _gsd_evaluation: GSD Evaluation ============== This example shows how to apply evaluation algorithms to GSD and thus how to rate the performance of a GSD algorithm. .. GENERATED FROM PYTHON SOURCE LINES 9-14 .. code-block:: default import pandas as pd from mobgap.data import LabExampleDataset from mobgap.gait_sequences import GsdIluz .. GENERATED FROM PYTHON SOURCE LINES 15-21 Loading some example data ------------------------- First, we load example data and apply the GSD Iluz algorithm to it. However, you can use any other GSD algorithm as well. To have a reference to compare the results to, we also load the corresponding ground truth data. These steps are explained in more detail in the :ref:`GSD Iluz example `. .. GENERATED FROM PYTHON SOURCE LINES 21-55 .. code-block:: default from mobgap.utils.conversions import to_body_frame def load_data(): lab_example_data = LabExampleDataset(reference_system="INDIP") single_test = lab_example_data.get_subset( cohort="MS", participant_id="001", test="Test11", trial="Trial1" ) return single_test def calculate_gsd_iluz_output(single_test_data): """Calculate the GSD Iluz output for one sensor from the test data.""" det_gsd = ( GsdIluz() .detect( to_body_frame(single_test_data.data_ss), sampling_rate_hz=single_test_data.sampling_rate_hz, ) .gs_list_ ) return det_gsd def load_reference(single_test_data): """Load the reference gait sequences from the test data.""" ref_gsd = single_test_data.reference_parameters_.wb_list return ref_gsd test_data = load_data() detected_gsd_list = calculate_gsd_iluz_output(test_data) reference_gsd_list = load_reference(test_data) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/docs/checkouts/readthedocs.org/user_builds/mobgap/checkouts/v0.9.0/mobgap/data/_mobilised_matlab_loader.py:1082: UserWarning: There were multiple ICs with the same index value, but different LR labels. This is likely an issue with the reference system you should further investigate. For now, we set the `lr_label` of the stride corresponding to this IC to Nan. However, both values still remain in the IC list. return parse_reference_parameters( .. GENERATED FROM PYTHON SOURCE LINES 56-57 that is characterized by its start and end index in samples. .. GENERATED FROM PYTHON SOURCE LINES 57-59 .. code-block:: default detected_gsd_list .. raw:: html
start end
gs_id
0 750 1651
1 4650 6151
2 12900 14851
3 20100 21151
4 21300 22501


.. GENERATED FROM PYTHON SOURCE LINES 60-62 .. code-block:: default reference_gsd_list .. raw:: html
start end n_strides duration_s length_m avg_walking_speed_mps avg_cadence_spm avg_stride_length_m termination_reason
wb_id
0 1019 1768 9 7.48 4.468932 0.847668 107.795850 0.942678 Pause
1 4534 5549 11 10.14 2.900453 0.365176 93.396106 0.483923 Pause
2 9665 10569 9 9.03 2.140232 0.294058 75.981133 0.506458 Pause
3 12337 14633 28 22.95 11.201110 0.634425 92.337768 0.803933 Pause
4 20151 20982 11 8.30 2.390709 0.371746 87.915774 0.507484 Pause
5 21378 22129 9 7.50 2.517558 0.492965 95.365740 0.599360 Pause


.. GENERATED FROM PYTHON SOURCE LINES 63-94 Validation of algorithm output against a reference -------------------------------------------------- Let's quantify how the algorithm output compares to the reference labels. To gain a detailed insight into the performance of the algorithm, we can look into the individual matches between the detected and reference gait sequences. Note, that there are two different ways to approach this: 1. We can calculate the for each sample, whether it is correctly detected as gait or not. 2. We can check on the level of gait sequences, whether a detected gait sequence matches with a reference gait sequence (by a certain overlap threshold). In mobgap, we provide functions to calculate both types of performance metrics. Let's start with the first one. Sample-wise performance evaluation ---------------------------------- To do this, we use the :func:`~gaitlink.gait_sequences.evaluation.categorize_intervals_per_sample` function to identify overlapping regions between the detected gait sequences and the reference gait sequences. These overlapping regions can then be converted into sample-wise classifications of true positives, false positives, and false negatives. As function arguments, besides the mandatory detected and reference gait sequences, the total number of samples in the recording can be specified as optional parameter. If provided, the intervals where no gait sequences are present in the reference and the detected list are also reported. Later on, we can then use these categorized intervals to calculate a set of higher-level performance metrics. As result, a DataFrame containing `start` and `end` index of the resulting categorized intervals together with a `match_type` column that contains the type of match for each interval, i.e. `tp` for true positive, `fp` for false positive, and `fn` for false negative. These intervals can not be interpreted as gait sequences, but are rather subsequences of the detected gait sequences categorizing correctly detected samples (`tp`), falsely detected samples (`fp`), samples from the reference gsd list that were not detected (`fn`), and (optionally) samples where no gait sequences are present in both the reference and detected gait sequences (`tn`). Note that the tn intervals are not explicitly calculated, but are inferred from the total length of the recording (if provided) and from the other intervals, as everything between them is considered as true negative. .. GENERATED FROM PYTHON SOURCE LINES 94-104 .. code-block:: default from mobgap.gait_sequences.evaluation import categorize_intervals_per_sample categorized_intervals = categorize_intervals_per_sample( gsd_list_detected=detected_gsd_list, gsd_list_reference=reference_gsd_list, n_overall_samples=len(test_data.data_ss), ) categorized_intervals .. raw:: html
start end match_type
0 0 750 tn
1 750 1019 fp
2 1019 1651 tp
3 1651 1768 fn
4 1768 4534 tn
5 4534 4650 fn
6 4650 5549 tp
7 5549 6151 fp
8 6151 9665 tn
9 9665 10569 fn
10 10569 12337 tn
11 12337 12900 fn
12 12900 14633 tp
13 14633 14851 fp
14 14851 20100 tn
15 20100 20151 fp
16 20151 20982 tp
17 20982 21151 fp
18 21151 21300 tn
19 21300 21378 fp
20 21378 22129 tp
21 22129 22501 fp
22 22501 22727 tn


.. GENERATED FROM PYTHON SOURCE LINES 105-117 Based on the individually categorized tp, fp, fn, and tn intervals, common performance metrics, e.g., F1 score, precision, or recall can be calculated. For this purpose, the :func:`~gaitlink.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics` function can be used. It calculates the metrics based on the "matched" gsd intervals, i.e., the categorized interval list where every entry has a match type (tp, fp, fn, tn) assigned. Therefore, the function requires to call the :func:`~gaitlink.gait_sequences.evaluation.categorize_intervals_per_sample` function first. The categorized intervals can then be passed as an argument to :func:`~gaitlink.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics`. It returns a dictionary containing the metrics for the specified categorized intervals DataFrame. Here, the total number of samples in every match type, precision, recall, F1 score, are always calculated. Depending on whether true negatives are present in the categorized intervals, specificity, negative predictive value, and accuracy will additionally be reported. .. GENERATED FROM PYTHON SOURCE LINES 117-127 .. code-block:: default from mobgap.gait_sequences.evaluation import ( calculate_matched_gsd_performance_metrics, ) matched_metrics_dict = calculate_matched_gsd_performance_metrics( categorized_intervals ) matched_metrics_dict .. rst-class:: sphx-glr-script-out .. code-block:: none {'tp_samples': 4851, 'fp_samples': 1766, 'fn_samples': 1704, 'precision': 0.733111682031132, 'recall': 0.740045766590389, 'f1_score': 0.736562405101731, 'tn_samples': 14429, 'specificity': 0.8909539981475764, 'accuracy': 0.8474725274725274, 'npv': 0.894377983016178} .. GENERATED FROM PYTHON SOURCE LINES 128-135 Furthermore, there is a range of high-level performance metrics that are simply calculated based on the overall amount of gait sequences/gait detected in reference and detected data. Thus, they can be inferred from the reference and detected gait sequences directly without any intermediate steps using the :func:`~gaitlink.gait_sequences.evaluation.calculate_unmatched_gsd_performance_metrics` function. As some of the unmatched metrics are reported in seconds, the function requires the sampling frequency of the recorded data as an additional argument. It returns a dictionary containing all metrics for the specified detected and reference gait sequences. .. GENERATED FROM PYTHON SOURCE LINES 135-146 .. code-block:: default from mobgap.gait_sequences.evaluation import ( calculate_unmatched_gsd_performance_metrics, ) unmatched_metrics_dict = calculate_unmatched_gsd_performance_metrics( gsd_list_detected=detected_gsd_list, gsd_list_reference=reference_gsd_list, sampling_rate_hz=test_data.sampling_rate_hz, ) unmatched_metrics_dict .. rst-class:: sphx-glr-script-out .. code-block:: none {'reference_gs_duration_s': 65.52, 'detected_gs_duration_s': 66.1, 'gs_duration_error_s': 0.5799999999999983, 'gs_relative_duration_error': 0.008852258852258826, 'gs_absolute_duration_error_s': 0.5799999999999983, 'gs_absolute_relative_duration_error': 0.008852258852258826, 'gs_absolute_relative_duration_error_log': 0.008813307312826587, 'detected_num_gs': 5, 'reference_num_gs': 6, 'num_gs_error': -1, 'num_gs_relative_error': -0.16666666666666666, 'num_gs_absolute_error': 1, 'num_gs_absolute_relative_error': 0.16666666666666666, 'num_gs_absolute_relative_error_log': 0.15415067982725836} .. GENERATED FROM PYTHON SOURCE LINES 147-168 Direct Gait Sequence Matching ------------------------------ Apart from the performance evaluation methods mentioned above, it might be useful in some cases to identify how many and which detected gait sequences reliably match with the ground truth. This is primarily useful, when further parameters are associated with each gaits sequence, e.g., the gait speed. In this case, matching gait sequences that cover the same gait regions allows proper comparison of these parameters. For more information on this, see the example on the overall parameter evaluation on Walking-Bout level (TODO). For this purpose, the :func:`~gaitlink.gait_sequences.evaluation.categorize_intervals` can be used. It returns all intervals of the detected gait sequences that overlap with the reference gait sequences by at least a given amount. The index of the result dataframe indicated the index of the detected gait sequence. We can see that with an overlap threshold of 0.7 (70%), three of the six detected gait sequences are considered as matches with the reference gait sequences for our example recording. Note, that this threshold is enforced in both directions, i.e., the detected gait sequence must overlap with the reference gait sequence by at least 70% and vice versa. This means that only 1 to 1 matches are possible. If multiple detected gait sequences overlap with the same reference gait sequence, only the one with the highest overlap is considered as a match. If one gait sequence is covered by multiple smaller once, possibly none of them is considered as a match. .. GENERATED FROM PYTHON SOURCE LINES 168-178 .. code-block:: default from mobgap.gait_sequences.evaluation import categorize_intervals matches = categorize_intervals( gsd_list_detected=detected_gsd_list, gsd_list_reference=reference_gsd_list, overlap_threshold=0.7, ) matches .. raw:: html
gs_id_detected gs_id_reference match_type
match_id
0 0 0 tp
1 1 NaN fp
2 2 3 tp
3 3 4 tp
4 4 NaN fp
5 NaN 1 fn
6 NaN 2 fn
7 NaN 5 fn


.. GENERATED FROM PYTHON SOURCE LINES 179-188 Running a full evaluation pipeline ---------------------------------- Instead of manually evaluating and investigating the performance of a GSD algorithm on a single piece of data, we often want to run a full evaluation on an entire dataset. This can be done using the :class:`~mobgap.gait_sequences.evaluation.GsdEvaluationPipeline` class and some ``tpcp`` functions. But let's start with selecting some data. We want to use all the simulated real-world walking data from the INDIP reference system (Test11). .. GENERATED FROM PYTHON SOURCE LINES 188-193 .. code-block:: default simulated_real_world_walking = LabExampleDataset( reference_system="INDIP" ).get_subset(test="Test11") simulated_real_world_walking .. raw:: html

LabExampleDataset [3 groups/rows]

cohort participant_id time_measure test trial
0 HA 001 TimeMeasure1 Test11 Trial1
1 HA 002 TimeMeasure1 Test11 Trial1
2 MS 001 TimeMeasure1 Test11 Trial1


.. GENERATED FROM PYTHON SOURCE LINES 194-197 Now we can use the :class:`~mobgap.gait_sequences.evaluation.GsdEvaluationPipeline` class to directly run a Gsd algorithm on a datapoint. The pipeline takes care of extracting the required data. .. GENERATED FROM PYTHON SOURCE LINES 197-203 .. code-block:: default from mobgap.gait_sequences.pipeline import GsdEmulationPipeline pipeline = GsdEmulationPipeline(GsdIluz()) pipeline.safe_run(simulated_real_world_walking[0]).gs_list_ .. raw:: html
start end
gs_id
0 600 1201
1 2700 4201
2 4350 5251
3 7800 8851
4 9450 10201
5 10950 11551
6 13050 13651


.. GENERATED FROM PYTHON SOURCE LINES 204-211 Note, that this did just "run" the pipeline on a single datapoint. If we want to run it on all datapoints and evaluate the performance of the algorithm, we can use the :func:`~tpcp.validate.validate` function. It uses the build in ``score`` method of the pipeline to calculate the performance of the algorithm on each datapoint and then takes the mean of the results. All mean and individual results are returned in huge dictionary that can be easily converted to a pandas DataFrame. .. GENERATED FROM PYTHON SOURCE LINES 211-218 .. code-block:: default from tpcp.validate import validate evaluation_results = pd.DataFrame( validate(pipeline, simulated_real_world_walking) ) evaluation_results.drop(["single__reference", "single__detected"], axis=1).T .. rst-class:: sphx-glr-script-out .. code-block:: none Datapoints: 0%| | 0/3 [00:00
0
debug__score_time 0.907791
data_labels [(HA, 001, TimeMeasure1, Test11, Trial1), (HA,...
single__reference_gs_duration_s [40.44, 40.82, 65.52]
single__detected_gs_duration_s [60.14, 49.62, 66.1]
single__gs_duration_error_s [19.700000000000003, 8.799999999999997, 0.5799...
single__gs_relative_duration_error [0.487141444114738, 0.21558059774620278, 0.008...
single__gs_absolute_duration_error_s [19.700000000000003, 8.799999999999997, 0.5799...
single__gs_absolute_relative_duration_error [0.487141444114738, 0.21558059774620278, 0.008...
single__gs_absolute_relative_duration_error_log [0.3968557834081124, 0.19522182088195622, 0.00...
single__detected_num_gs [7, 6, 5]
single__reference_num_gs [6, 3, 6]
single__num_gs_error [1, 3, -1]
single__num_gs_relative_error [0.16666666666666666, 1.0, -0.16666666666666666]
single__num_gs_absolute_error [1, 3, 1]
single__num_gs_absolute_relative_error [0.16666666666666666, 1.0, 0.16666666666666666]
single__num_gs_absolute_relative_error_log [0.15415067982725836, 0.6931471805599453, 0.15...
single__tp_samples [3208, 3132, 4851]
single__fp_samples [2815, 1835, 1766]
single__fn_samples [839, 955, 1704]
single__precision [0.5326249377386685, 0.6305617072679686, 0.733...
single__recall [0.7926859402026192, 0.7663322730609249, 0.740...
single__f1_score [0.6371400198609732, 0.6918489065606361, 0.736...
single__tn_samples [6923, 10080, 14429]
single__specificity [0.7109262682275621, 0.8459924464960135, 0.890...
single__accuracy [0.7349292709466811, 0.8256467941507312, 0.847...
single__npv [0.8919093017263592, 0.9134571816946081, 0.894...
agg__reference_gs_duration_s 48.926667
agg__detected_gs_duration_s 58.62
agg__gs_duration_error_s 9.693333
agg__gs_relative_duration_error 0.237191
agg__gs_absolute_duration_error_s 9.693333
agg__gs_absolute_relative_duration_error 0.237191
agg__gs_absolute_relative_duration_error_log 0.200297
agg__detected_num_gs 6.0
agg__reference_num_gs 5.0
agg__num_gs_error 1.0
agg__num_gs_relative_error 0.333333
agg__num_gs_absolute_error 1.666667
agg__num_gs_absolute_relative_error 0.444444
agg__num_gs_absolute_relative_error_log 0.333816
agg__tp_samples 3730.333333
agg__fp_samples 2138.666667
agg__fn_samples 1166.0
agg__precision 0.632099
agg__recall 0.766355
agg__f1_score 0.688517
agg__tn_samples 10477.333333
agg__specificity 0.815958
agg__accuracy 0.802683
agg__npv 0.899915


.. GENERATED FROM PYTHON SOURCE LINES 219-221 In addition to the metrics, the method also returns the raw reference and detected gait sequences. These can be used for further custom analysis. .. GENERATED FROM PYTHON SOURCE LINES 221-224 .. code-block:: default evaluation_results["single__reference"][0][0] .. raw:: html
start end
wb_id
0 632 988
1 2864 3325
2 3853 5085
3 7641 8621
4 9451 9932
5 11989 12517


.. GENERATED FROM PYTHON SOURCE LINES 225-227 .. code-block:: default evaluation_results["single__detected"][0][0] .. raw:: html
start end
gs_id
0 600 1201
1 2700 4201
2 4350 5251
3 7800 8851
4 9450 10201
5 10950 11551
6 13050 13651


.. GENERATED FROM PYTHON SOURCE LINES 228-248 If you want to calculate additional metrics, you can either create a custom score function or subclass the pipeline and overwrite the score function. Parameter Optimization ---------------------- Simply applying an algorithm to the data for evaluation is often not enough. In case, of machine learning algorithms or algorithms with tunable parameters, we might want to optimize these parameters to get the best possible performance. To avoid overfitting, we can use cross-validation to evaluate the performance of the algorithm on multiple splits of the data. Below we show that procedure by using a simple grid search to optimize the window length of the GSD Iluz algorithm and evaluate this approach within a 3-fold cross-validation. Per-fold we select the window length leading to the highest precision on the "train set" and evaluate the performance on the "test set". Note, that on a real world dataset, you would likely need to perform a group-wise stratified cross-validation to avoid data leakage between multiple trials from the same participant and ensure equal distribution of patient cohorts across the folds. See the detailed ``tpcp`` examples on these topics. .. GENERATED FROM PYTHON SOURCE LINES 248-277 .. code-block:: default from sklearn.model_selection import ParameterGrid from tpcp.optimize import GridSearch from tpcp.validate import cross_validate para_grid = ParameterGrid({"algo__window_length_s": [2, 3, 4]}) cross_validate_results = pd.DataFrame( cross_validate( GridSearch( GsdEmulationPipeline(GsdIluz()), para_grid, return_optimized="precision", ), simulated_real_world_walking, cv=3, return_train_score=True, ) ) cross_validate_results.drop( [ "test__single__reference", "test__single__detected", "train__single__reference", "train__single__detected", ], axis=1, ).T .. rst-class:: sphx-glr-script-out .. code-block:: none CV Folds: 0%| | 0/3 [00:00
0 1 2
debug__score_time 0.302269 0.296322 0.322179
debug__optimize_time 1.87483 1.871321 1.799603
train__data_labels [(HA, 002, TimeMeasure1, Test11, Trial1), (MS,... [(HA, 001, TimeMeasure1, Test11, Trial1), (MS,... [(HA, 001, TimeMeasure1, Test11, Trial1), (HA,...
test__data_labels [(HA, 001, TimeMeasure1, Test11, Trial1)] [(HA, 002, TimeMeasure1, Test11, Trial1)] [(MS, 001, TimeMeasure1, Test11, Trial1)]
test__single__reference_gs_duration_s [40.44] [40.82] [65.52]
... ... ... ...
train__agg__f1_score 0.714206 0.739854 0.698675
train__agg__tn_samples 12254.5 10993.0 8886.0
train__agg__specificity 0.868473 0.84001 0.819709
train__agg__accuracy 0.83656 0.832687 0.812025
train__agg__npv 0.903918 0.916703 0.911692

100 rows × 3 columns



.. GENERATED FROM PYTHON SOURCE LINES 278-281 In general, it is a good idea to use ``cross_validation`` also for algorithms that do not have tunable parameters. This way you can ensure that the performance of the algorithm is stable across different splits of the data, and it allows the direct comparison between tunable and non-tunable algorithms. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 12.054 seconds) **Estimated memory usage:** 9 MB .. _sphx_glr_download_auto_examples_gait_sequences__03_gsd_evaluation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: _03_gsd_evaluation.py <_03_gsd_evaluation.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: _03_gsd_evaluation.ipynb <_03_gsd_evaluation.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_