.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/gait_sequences/_03_gsd_evaluation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_gait_sequences__03_gsd_evaluation.py: .. _gsd_evaluation: GSD Evaluation ============== This example shows how to apply evaluation algorithms to GSD and thus how to rate the performance of a GSD algorithm. .. GENERATED FROM PYTHON SOURCE LINES 9-14 .. code-block:: Python import pandas as pd from mobgap.data import LabExampleDataset from mobgap.gait_sequences import GsdIluz .. GENERATED FROM PYTHON SOURCE LINES 15-21 Loading some example data ------------------------- First, we load example data and apply the GSD Iluz algorithm to it. However, you can use any other GSD algorithm as well. To have a reference to compare the results to, we also load the corresponding ground truth data. These steps are explained in more detail in the :ref:`GSD Iluz example `. .. GENERATED FROM PYTHON SOURCE LINES 21-55 .. code-block:: Python from mobgap.utils.conversions import to_body_frame def load_data(): lab_example_data = LabExampleDataset(reference_system="INDIP") single_test = lab_example_data.get_subset( cohort="MS", participant_id="001", test="Test11", trial="Trial1" ) return single_test def calculate_gsd_iluz_output(single_test_data): """Calculate the GSD Iluz output for one sensor from the test data.""" det_gsd = ( GsdIluz() .detect( to_body_frame(single_test_data.data_ss), sampling_rate_hz=single_test_data.sampling_rate_hz, ) .gs_list_ ) return det_gsd def load_reference(single_test_data): """Load the reference gait sequences from the test data.""" ref_gsd = single_test_data.reference_parameters_.wb_list return ref_gsd test_data = load_data() detected_gsd_list = calculate_gsd_iluz_output(test_data) reference_gsd_list = load_reference(test_data) .. GENERATED FROM PYTHON SOURCE LINES 56-57 that is characterized by its start and end index in samples. .. GENERATED FROM PYTHON SOURCE LINES 57-59 .. code-block:: Python detected_gsd_list .. raw:: html

	start	end
gs_id
0	900	2101
1	4650	6151
2	9600	10801
3	11250	12151
4	12300	14851
5	19950	21151
6	21300	22501

.. GENERATED FROM PYTHON SOURCE LINES 60-62 .. code-block:: Python reference_gsd_list .. raw:: html

	start	end	n_strides	duration_s	length_m	avg_walking_speed_mps	avg_cadence_spm	avg_stride_length_m	termination_reason
wb_id
0	1019	1768	9	7.48	4.468932	0.847668	107.795850	0.942678	Pause
1	4534	5549	11	10.14	2.900453	0.365176	93.396106	0.483923	Pause
2	9665	10569	9	9.03	2.140232	0.294058	75.981133	0.506458	Pause
3	12337	14633	28	22.95	11.201110	0.634425	92.337768	0.803933	Pause
4	20151	20982	11	8.30	2.390709	0.371746	87.915774	0.507484	Pause
5	21378	22129	9	7.50	2.517558	0.492965	95.365740	0.599360	Pause

.. GENERATED FROM PYTHON SOURCE LINES 63-94 Validation of algorithm output against a reference -------------------------------------------------- Let's quantify how the algorithm output compares to the reference labels. To gain a detailed insight into the performance of the algorithm, we can look into the individual matches between the detected and reference gait sequences. Note, that there are two different ways to approach this: 1. We can calculate the for each sample, whether it is correctly detected as gait or not. 2. We can check on the level of gait sequences, whether a detected gait sequence matches with a reference gait sequence (by a certain overlap threshold). In mobgap, we provide functions to calculate both types of performance metrics. Let's start with the first one. Sample-wise performance evaluation ---------------------------------- To do this, we use the :func:`~gaitlink.gait_sequences.evaluation.categorize_intervals_per_sample` function to identify overlapping regions between the detected gait sequences and the reference gait sequences. These overlapping regions can then be converted into sample-wise classifications of true positives, false positives, and false negatives. As function arguments, besides the mandatory detected and reference gait sequences, the total number of samples in the recording can be specified as optional parameter. If provided, the intervals where no gait sequences are present in the reference and the detected list are also reported. Later on, we can then use these categorized intervals to calculate a set of higher-level performance metrics. As result, a DataFrame containing `start` and `end` index of the resulting categorized intervals together with a `match_type` column that contains the type of match for each interval, i.e. `tp` for true positive, `fp` for false positive, and `fn` for false negative. These intervals can not be interpreted as gait sequences, but are rather subsequences of the detected gait sequences categorizing correctly detected samples (`tp`), falsely detected samples (`fp`), samples from the reference gsd list that were not detected (`fn`), and (optionally) samples where no gait sequences are present in both the reference and detected gait sequences (`tn`). Note that the tn intervals are not explicitly calculated, but are inferred from the total length of the recording (if provided) and from the other intervals, as everything between them is considered as true negative. .. GENERATED FROM PYTHON SOURCE LINES 94-104 .. code-block:: Python from mobgap.gait_sequences.evaluation import categorize_intervals_per_sample categorized_intervals = categorize_intervals_per_sample( gsd_list_detected=detected_gsd_list, gsd_list_reference=reference_gsd_list, n_overall_samples=len(test_data.data_ss), ) categorized_intervals .. raw:: html

	start	end	match_type
0	0	900	tn
1	900	1019	fp
2	1019	1768	tp
3	1768	2101	fp
4	2101	4534	tn
5	4534	4650	fn
6	4650	5549	tp
7	5549	6151	fp
8	6151	9600	tn
9	9600	9665	fp
10	9665	10569	tp
11	10569	10801	fp
12	10801	11250	tn
13	11250	12151	fp
14	12151	12300	tn
15	12300	12337	fp
16	12337	14633	tp
17	14633	14851	fp
18	14851	19950	tn
19	19950	20151	fp
20	20151	20982	tp
21	20982	21151	fp
22	21151	21300	tn
23	21300	21378	fp
24	21378	22129	tp
25	22129	22501	fp
26	22501	22727	tn

.. GENERATED FROM PYTHON SOURCE LINES 105-117 Based on the individually categorized tp, fp, fn, and tn intervals, common performance metrics, e.g., F1 score, precision, or recall can be calculated. For this purpose, the :func:`~gaitlink.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics` function can be used. It calculates the metrics based on the "matched" gsd intervals, i.e., the categorized interval list where every entry has a match type (tp, fp, fn, tn) assigned. Therefore, the function requires to call the :func:`~gaitlink.gait_sequences.evaluation.categorize_intervals_per_sample` function first. The categorized intervals can then be passed as an argument to :func:`~gaitlink.gait_sequences.evaluation.calculate_matched_gsd_performance_metrics`. It returns a dictionary containing the metrics for the specified categorized intervals DataFrame. Here, the total number of samples in every match type, precision, recall, F1 score, are always calculated. Depending on whether true negatives are present in the categorized intervals, specificity, negative predictive value, and accuracy will additionally be reported. .. GENERATED FROM PYTHON SOURCE LINES 117-127 .. code-block:: Python from mobgap.gait_sequences.evaluation import ( calculate_matched_gsd_performance_metrics, ) matched_metrics_dict = calculate_matched_gsd_performance_metrics( categorized_intervals ) matched_metrics_dict .. rst-class:: sphx-glr-script-out .. code-block:: none {'tp_samples': 6436, 'fp_samples': 3339, 'fn_samples': 117, 'precision': 0.6584143222506393, 'recall': 0.9821455821761026, 'f1_score': 0.7883390494855462, 'tn_samples': 12862, 'specificity': 0.7939016110116659, 'accuracy': 0.8481146172101609, 'npv': 0.9909854380152554} .. GENERATED FROM PYTHON SOURCE LINES 128-135 Furthermore, there is a range of high-level performance metrics that are simply calculated based on the overall amount of gait sequences/gait detected in reference and detected data. Thus, they can be inferred from the reference and detected gait sequences directly without any intermediate steps using the :func:`~gaitlink.gait_sequences.evaluation.calculate_unmatched_gsd_performance_metrics` function. As some of the unmatched metrics are reported in seconds, the function requires the sampling frequency of the recorded data as an additional argument. It returns a dictionary containing all metrics for the specified detected and reference gait sequences. .. GENERATED FROM PYTHON SOURCE LINES 135-146 .. code-block:: Python from mobgap.gait_sequences.evaluation import ( calculate_unmatched_gsd_performance_metrics, ) unmatched_metrics_dict = calculate_unmatched_gsd_performance_metrics( gsd_list_detected=detected_gsd_list, gsd_list_reference=reference_gsd_list, sampling_rate_hz=test_data.sampling_rate_hz, ) unmatched_metrics_dict .. rst-class:: sphx-glr-script-out .. code-block:: none {'reference_gs_duration_s': 65.52, 'detected_gs_duration_s': 97.64, 'gs_duration_error_s': 32.120000000000005, 'gs_relative_duration_error': 0.4902319902319903, 'gs_absolute_duration_error_s': 32.120000000000005, 'gs_absolute_relative_duration_error': 0.4902319902319903, 'gs_absolute_relative_duration_error_log': 0.3989318059799454, 'detected_num_gs': 7, 'reference_num_gs': 6, 'num_gs_error': 1, 'num_gs_relative_error': 0.16666666666666666, 'num_gs_absolute_error': 1, 'num_gs_absolute_relative_error': 0.16666666666666666, 'num_gs_absolute_relative_error_log': 0.15415067982725836} .. GENERATED FROM PYTHON SOURCE LINES 147-168 Direct Gait Sequence Matching ----------------------------- Apart from the performance evaluation methods mentioned above, it might be useful in some cases to identify how many and which detected gait sequences reliably match with the ground truth. This is primarily useful, when further parameters are associated with each gaits sequence, e.g., the gait speed. In this case, matching gait sequences that cover the same gait regions allows proper comparison of these parameters. For more information on this, see the example on the overall parameter evaluation on Walking-Bout level (TODO). For this purpose, the :func:`~gaitlink.gait_sequences.evaluation.categorize_intervals` can be used. It returns all intervals of the detected gait sequences that overlap with the reference gait sequences by at least a given amount. The index of the result dataframe indicated the index of the detected gait sequence. We can see that with an overlap threshold of 0.7 (70%), three of the six detected gait sequences are considered as matches with the reference gait sequences for our example recording. Note, that this threshold is enforced in both directions, i.e., the detected gait sequence must overlap with the reference gait sequence by at least 70% and vice versa. This means that only 1 to 1 matches are possible. If multiple detected gait sequences overlap with the same reference gait sequence, only the one with the highest overlap is considered as a match. If one gait sequence is covered by multiple smaller once, possibly none of them is considered as a match. .. GENERATED FROM PYTHON SOURCE LINES 168-178 .. code-block:: Python from mobgap.gait_sequences.evaluation import categorize_intervals matches = categorize_intervals( gsd_list_detected=detected_gsd_list, gsd_list_reference=reference_gsd_list, overlap_threshold=0.7, ) matches .. raw:: html

	gs_id_detected	gs_id_reference	match_type
match_id
0	0	NaN	fp
1	1	NaN	fp
2	2	2	tp
3	3	NaN	fp
4	4	3	tp
5	5	NaN	fp
6	6	NaN	fp
7	NaN	0	fn
8	NaN	1	fn
9	NaN	4	fn
10	NaN	5	fn

.. GENERATED FROM PYTHON SOURCE LINES 179-188 Running a full evaluation pipeline ---------------------------------- Instead of manually evaluating and investigating the performance of a GSD algorithm on a single piece of data, we often want to run a full evaluation on an entire dataset. This can be done using the :class:`~mobgap.gait_sequences.evaluation.GsdEvaluationPipeline` class and some ``tpcp`` functions. But let's start with selecting some data. We want to use all the simulated real-world walking data from the INDIP reference system (Test11). .. GENERATED FROM PYTHON SOURCE LINES 188-193 .. code-block:: Python simulated_real_world_walking = LabExampleDataset( reference_system="INDIP" ).get_subset(test="Test11") simulated_real_world_walking .. raw:: html

LabExampleDataset [3 groups/rows]

	cohort	participant_id	time_measure	test	trial
0	HA	001	TimeMeasure1	Test11	Trial1
1	HA	002	TimeMeasure1	Test11	Trial1
2	MS	001	TimeMeasure1	Test11	Trial1

.. GENERATED FROM PYTHON SOURCE LINES 194-197 Now we can use the :class:`~mobgap.gait_sequences.evaluation.GsdEvaluationPipeline` class to directly run a Gsd algorithm on a datapoint. The pipeline takes care of extracting the required data. .. GENERATED FROM PYTHON SOURCE LINES 197-203 .. code-block:: Python from mobgap.gait_sequences.pipeline import GsdEmulationPipeline pipeline = GsdEmulationPipeline(GsdIluz()) pipeline.safe_run(simulated_real_world_walking[0]).gs_list_ .. raw:: html

	start	end
gs_id
0	600	1201
1	4350	5251
2	7800	9001
3	9300	10201
4	10950	11551
5	13050	13651

.. GENERATED FROM PYTHON SOURCE LINES 204-211 Note, that this did just "run" the pipeline on a single datapoint. If we want to run it on all datapoints and evaluate the performance of the algorithm, we can use the :func:`~tpcp.validate.validate` function. For this we need to provide a score function that runs and evaluates the pipeline on a single datapoint. We provide a default score function for GSD that calculates all the metrics shown above per datapoint and combined accross all gait sequences (i.e. all gait sequences across all datapoints are pooled before metrics are calculated). .. GENERATED FROM PYTHON SOURCE LINES 211-221 .. code-block:: Python from mobgap.gait_sequences.evaluation import gsd_score from tpcp.validate import validate evaluation_results = pd.DataFrame( validate(pipeline, simulated_real_world_walking, scoring=gsd_score) ) evaluation_results.drop( ["single__raw__reference", "single__raw__detected"], axis=1 ).T .. rst-class:: sphx-glr-script-out .. code-block:: none Datapoints: 0%| | 0/3 [00:00

	0
debug__score_time	0.711474
data_labels	[(HA, 001, TimeMeasure1, Test11, Trial1), (HA,...
single__reference_gs_duration_s	[40.44, 40.82, 65.52]
single__detected_gs_duration_s	[48.12, 42.08, 97.64]
single__gs_duration_error_s	[7.68, 1.259999999999998, 32.120000000000005]
...	...
agg__combined__f1_score	0.699236
agg__combined__tn_samples	30755
agg__combined__specificity	0.812507
agg__combined__accuracy	0.80828
agg__combined__npv	0.9118

76 rows × 1 columns

.. GENERATED FROM PYTHON SOURCE LINES 222-224 In addition to the metrics, the method also returns the raw reference and detected gait sequences. These can be used for further custom analysis. .. GENERATED FROM PYTHON SOURCE LINES 224-227 .. code-block:: Python evaluation_results["single__raw__reference"][0] .. raw:: html

						start	end
cohort	participant_id	time_measure	test	trial	wb_id
HA	001	TimeMeasure1	Test11	Trial1	0	632	988
					1	2864	3325
					2	3853	5085
					3	7641	8621
					4	9451	9932
					5	11989	12517
	002	TimeMeasure1	Test11	Trial1	0	485	1131
					1	1746	3554
					2	6083	7708
MS	001	TimeMeasure1	Test11	Trial1	0	1019	1768
					1	4534	5549
					2	9665	10569
					3	12337	14633
					4	20151	20982
					5	21378	22129

.. GENERATED FROM PYTHON SOURCE LINES 228-230 .. code-block:: Python evaluation_results["single__raw__detected"][0] .. raw:: html

						start	end
cohort	participant_id	time_measure	test	trial	gs_id
HA	001	TimeMeasure1	Test11	Trial1	0	600	1201
					1	4350	5251
					2	7800	9001
					3	9300	10201
					4	10950	11551
					5	13050	13651
	002	TimeMeasure1	Test11	Trial1	0	450	1201
					1	2700	3301
					2	5700	7951
					3	15000	15601
MS	001	TimeMeasure1	Test11	Trial1	0	900	2101
					1	4650	6151
					2	9600	10801
					3	11250	12151
					4	12300	14851
					5	19950	21151
					6	21300	22501

.. GENERATED FROM PYTHON SOURCE LINES 231-251 If you want to calculate additional metrics, you can either create a custom score function or subclass the pipeline and overwrite the score function. Parameter Optimization ---------------------- Simply applying an algorithm to the data for evaluation is often not enough. In case, of machine learning algorithms or algorithms with tunable parameters, we might want to optimize these parameters to get the best possible performance. To avoid overfitting, we can use cross-validation to evaluate the performance of the algorithm on multiple splits of the data. Below we show that procedure by using a simple grid search to optimize the window length of the GSD Iluz algorithm and evaluate this approach within a 3-fold cross-validation. Per-fold we select the window length leading to the highest precision on the "train set" and evaluate the performance on the "test set". Note, that on a real world dataset, you would likely need to perform a group-wise stratified cross-validation to avoid data leakage between multiple trials from the same participant and ensure equal distribution of patient cohorts across the folds. See the detailed ``tpcp`` examples on these topics. .. GENERATED FROM PYTHON SOURCE LINES 251-282 .. code-block:: Python from sklearn.model_selection import ParameterGrid from tpcp.optimize import GridSearch from tpcp.validate import cross_validate para_grid = ParameterGrid({"algo__window_length_s": [2, 3, 4]}) cross_validate_results = pd.DataFrame( cross_validate( GridSearch( GsdEmulationPipeline(GsdIluz()), para_grid, return_optimized="precision", scoring=gsd_score, ), simulated_real_world_walking, scoring=gsd_score, cv=3, return_train_score=True, ) ) cross_validate_results.drop( [ "test__single__raw__reference", "test__single__raw__detected", "train__single__raw__reference", "train__single__raw__detected", ], axis=1, ).T .. rst-class:: sphx-glr-script-out .. code-block:: none CV Folds: 0%| | 0/3 [00:00

	0	1	2
debug__score_time	0.267713	0.257381	0.279818
debug__optimize_time	1.486507	1.483054	1.433278
train__data_labels	[(HA, 002, TimeMeasure1, Test11, Trial1), (MS,...	[(HA, 001, TimeMeasure1, Test11, Trial1), (MS,...	[(HA, 001, TimeMeasure1, Test11, Trial1), (HA,...
test__data_labels	[(HA, 001, TimeMeasure1, Test11, Trial1)]	[(HA, 002, TimeMeasure1, Test11, Trial1)]	[(MS, 001, TimeMeasure1, Test11, Trial1)]
test__single__reference_gs_duration_s	[40.44]	[40.82]	[65.52]
...	...	...	...
train__agg__combined__f1_score	0.756254	0.74228	0.680912
train__agg__combined__tn_samples	23439	20410	18330
train__agg__combined__specificity	0.833683	0.786876	0.846612
train__agg__combined__accuracy	0.845118	0.819097	0.813975
train__agg__combined__npv	0.946457	0.949656	0.892014

152 rows × 3 columns

.. GENERATED FROM PYTHON SOURCE LINES 283-286 In general, it is a good idea to use ``cross_validation`` also for algorithms that do not have tunable parameters. This way you can ensure that the performance of the algorithm is stable across different splits of the data, and it allows the direct comparison between tunable and non-tunable algorithms. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 10.678 seconds) **Estimated memory usage:** 81 MB .. _sphx_glr_download_auto_examples_gait_sequences__03_gsd_evaluation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: _03_gsd_evaluation.ipynb <_03_gsd_evaluation.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: _03_gsd_evaluation.py <_03_gsd_evaluation.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: _03_gsd_evaluation.zip <_03_gsd_evaluation.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_