Note

Go to the end to download the full example code.

Performance of the laterality classification algorithms on the TVS dataset#

Warning

On this page you will find preliminary results for a standardized revalidation of the pipeline and all of its algorithm. The current state, TECHNICAL EXPERIMENTATION. Don’t use these results or make any assumptions based on them. We will update this page incrementally and provide further information, as soon as the state of any of the validation steps changes.

The following provides an analysis and comparison of the stride length algorithms on the TVS dataset (lab and free-living). We look into the actual performance of the algorithms compared to the reference data.

Compared to the other revalidation scripts, this one does not load the old “matlab” results, as there are no old results. The laterality algorithm by Ulrich et al. was validated independently and was already written in Python. The implemented version follows the old version very closely. The goal of this revalidation, is to validate the re-trained model (with the updated training code) on the TVS dataset. We compare it against the old model and the McCamley algorithm.

Note

If you are interested in how these results are calculated, head over to the processing page.

Below are the list of algorithms that we will compare. Note, that we use the prefix “MobGap” to refer to the newly trained model and “Original Implementation” refers to the models trained as part of previous work. We compare all the available models. For context, the “MS_ALL” models are used by default in the pipelines. For the McCamley algorithm, only a single version exists.

algorithms = {
    "McCamley": ("McCamley", "-"),
    "UllrichOld__ms_all": ("Ullrich - MS-ALL", "Original Implementation"),
    "UllrichOld__ms_ms": ("Ullrich - MS-MS", "Original Implementation"),
    "UllrichNew__ms_all": ("Ullrich - MS-ALL", "MobGap"),
}

The code below loads the data and prepares it for the analysis. By default, the data will be downloaded from an online repository (and cached locally). If you want to use a local copy of the data, you can set the MOBGAP_VALIDATION_DATA_PATH environment variable. and the MOBGAP_VALIDATION_USE_LOCA_DATA to 1.

The file download will print a couple log information, which can usually be ignored. You can also change the version parameter to load a different version of the data.

from pathlib import Path

import pandas as pd
from mobgap.data.validation_results import ValidationResultLoader
from mobgap.utils.misc import get_env_var


def format_loaded_results(
    values: dict[tuple[str, str], pd.DataFrame],
    index_cols: list[str],
) -> pd.DataFrame:
    formatted = (
        pd.concat(values, names=["algo", "version", *index_cols])
        .reset_index()
        .assign(
            algo_with_version=lambda df: df["algo"]
            + " ("
            + df["version"]
            + ")",
            _combined="combined",
        )
    )
    return formatted


local_data_path = (
    Path(get_env_var("MOBGAP_VALIDATION_DATA_PATH")) / "results"
    if int(get_env_var("MOBGAP_VALIDATION_USE_LOCAL_DATA", 0))
    else None
)
__RESULT_VERSION = "v0.11.0"
loader = ValidationResultLoader(
    "lrc", result_path=local_data_path, version=__RESULT_VERSION
)


free_living_index_cols = [
    "cohort",
    "participant_id",
    "time_measure",
    "recording",
    "recording_name",
    "recording_name_pretty",
]

free_living_results = format_loaded_results(
    {
        v: loader.load_single_results(k, "free_living")
        for k, v in algorithms.items()
    },
    free_living_index_cols,
)

lab_index_cols = [
    "cohort",
    "participant_id",
    "time_measure",
    "test",
    "trial",
    "test_name",
    "test_name_pretty",
]

lab_results = format_loaded_results(
    {
        v: loader.load_single_results(k, "laboratory")
        for k, v in algorithms.items()
    },
    lab_index_cols,
)

cohort_order = ["HA", "CHF", "COPD", "MS", "PD", "PFF"]

  0%|                                              | 0.00/2.44k [00:00<?, ?B/s]
  0%|                                              | 0.00/2.44k [00:00<?, ?B/s]
100%|█████████████████████████████████████| 2.44k/2.44k [00:00<00:00, 12.6MB/s]

  0%|                                              | 0.00/2.43k [00:00<?, ?B/s]
  0%|                                              | 0.00/2.43k [00:00<?, ?B/s]
100%|█████████████████████████████████████| 2.43k/2.43k [00:00<00:00, 17.6MB/s]

  0%|                                              | 0.00/2.44k [00:00<?, ?B/s]
  0%|                                              | 0.00/2.44k [00:00<?, ?B/s]
100%|█████████████████████████████████████| 2.44k/2.44k [00:00<00:00, 18.0MB/s]

  0%|                                              | 0.00/2.41k [00:00<?, ?B/s]
  0%|                                              | 0.00/2.41k [00:00<?, ?B/s]
100%|█████████████████████████████████████| 2.41k/2.41k [00:00<00:00, 16.6MB/s]

  0%|                                              | 0.00/9.56k [00:00<?, ?B/s]
  0%|                                              | 0.00/9.56k [00:00<?, ?B/s]
100%|█████████████████████████████████████| 9.56k/9.56k [00:00<00:00, 63.7MB/s]

  0%|                                              | 0.00/9.68k [00:00<?, ?B/s]
  0%|                                              | 0.00/9.68k [00:00<?, ?B/s]
100%|█████████████████████████████████████| 9.68k/9.68k [00:00<00:00, 67.5MB/s]

  0%|                                              | 0.00/9.65k [00:00<?, ?B/s]
  0%|                                              | 0.00/9.65k [00:00<?, ?B/s]
100%|█████████████████████████████████████| 9.65k/9.65k [00:00<00:00, 68.4MB/s]

  0%|                                              | 0.00/9.05k [00:00<?, ?B/s]
  0%|                                              | 0.00/9.05k [00:00<?, ?B/s]
100%|█████████████████████████████████████| 9.05k/9.05k [00:00<00:00, 63.3MB/s]

Performance metrics#

Below you can find the setup for all performance metrics that we will calculate. For laterality, this is really simple, as we just calculate the accuracy of the binary classification and the “pariwise accuracy” that checks if consecutive ICs have been assigned either the same or different laterality. High “pairwise accuracy” provides an better indicator if steps and strides would be correctly defined based on the laterality information. This metrics explicitly ignores the actual label of the laterality, as would not impact the main gait metrics, if the laterality is swapped consistently.

from functools import partial

from mobgap.pipeline.evaluation import CustomErrorAggregations as A
from mobgap.utils.df_operations import (
    CustomOperation,
    apply_aggregations,
    apply_transformations,
)
from mobgap.utils.tables import FormatTransformer as F
from mobgap.utils.tables import RevalidationInfo, revalidation_table_styles

custom_aggs = [
    CustomOperation(
        identifier=None,
        function=A.n_datapoints,
        column_name=[("n_datapoints", "all")],
    ),
    ("accuracy", ["mean", A.conf_intervals]),
    ("accuracy_pairwise", ["mean", A.conf_intervals]),
]

format_transforms = [
    CustomOperation(
        identifier=None,
        function=lambda df_: df_[("n_datapoints", "all")].astype(int),
        column_name="n_datapoints",
    ),
    CustomOperation(
        identifier=None,
        function=partial(
            F.value_with_metadata,
            value_col=("mean", "accuracy"),
            other_columns={"range": ("conf_intervals", "accuracy")},
        ),
        column_name="accuracy",
    ),
    CustomOperation(
        identifier=None,
        function=partial(
            F.value_with_metadata,
            value_col=("mean", "accuracy_pairwise"),
            other_columns={"range": ("conf_intervals", "accuracy_pairwise")},
        ),
        column_name="accuracy_pairwise",
    ),
]


final_names = {
    "n_datapoints": "# participants",
    "accuracy": "Accuracy",
    "accuracy_pairwise": "Accuracy IC-pairs",
}

validation_thresholds = {
    "Accuracy": RevalidationInfo(threshold=0.7, higher_is_better=True),
}


def format_tables(df: pd.DataFrame) -> pd.DataFrame:
    return (
        df.pipe(apply_transformations, format_transforms)
        .rename(columns=final_names)
        .loc[:, list(final_names.values())]
    )

Free-Living Comparison#

We focus on the free-living data for the comparison as this is the expected use case for the algorithms.

All results across all cohorts#

The results below represent the average performance across all participants independent of the cohort.

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots()
sns.boxplot(
    data=free_living_results, x="algo_with_version", y="accuracy", ax=ax
)
fig.show()
fig, ax = plt.subplots()
sns.boxplot(
    data=free_living_results,
    x="algo_with_version",
    y="accuracy_pairwise",
    ax=ax,
)
fig.show()

perf_metrics_all = (
    free_living_results.groupby(["algo", "version"])
    .apply(apply_aggregations, custom_aggs, include_groups=False)
    .pipe(format_tables)
)
perf_metrics_all.style.pipe(
    revalidation_table_styles, validation_thresholds, ["algo"]
)

		# participants	Accuracy	Accuracy IC-pairs
algo	version
McCamley	-	101	0.78 [0.75, 0.82]	0.77 [0.75, 0.79]
Ullrich - MS-ALL	MobGap	101	0.80 [0.77, 0.84]	0.81 [0.79, 0.83]
Ullrich - MS-ALL	Original Implementation	101	0.77 [0.74, 0.81]	0.74 [0.71, 0.77]
Ullrich - MS-MS	Original Implementation	101	0.76 [0.72, 0.80]	0.76 [0.73, 0.78]

Per Cohort#

The results below represent the average performance across all participants within a cohort.

fig, ax = plt.subplots()
sns.boxplot(
    data=free_living_results,
    x="cohort",
    y="accuracy",
    hue="algo_with_version",
    order=cohort_order,
    ax=ax,
)
ax.set_title("Accuracy")
fig.show()
fig, ax = plt.subplots()
sns.boxplot(
    data=free_living_results,
    x="cohort",
    y="accuracy_pairwise",
    hue="algo_with_version",
    order=cohort_order,
    ax=ax,
)
ax.set_title("Accuracy IC-pairs")
fig.show()
perf_metrics_cohort = (
    free_living_results.groupby(["cohort", "algo", "version"])
    .apply(apply_aggregations, custom_aggs, include_groups=False)
    .pipe(format_tables)
    .loc[cohort_order]
)
perf_metrics_cohort.style.pipe(
    revalidation_table_styles, validation_thresholds, ["cohort", "algo"]
)

			# participants	Accuracy	Accuracy IC-pairs
cohort	algo	version
HA	McCamley	-	20	0.85 [0.80, 0.90]	0.81 [0.76, 0.86]
	Ullrich - MS-ALL	MobGap	20	0.86 [0.81, 0.90]	0.84 [0.80, 0.88]
	Ullrich - MS-ALL	Original Implementation	20	0.84 [0.79, 0.88]	0.78 [0.73, 0.84]
	Ullrich - MS-MS	Original Implementation	20	0.82 [0.76, 0.87]	0.78 [0.73, 0.83]
CHF	McCamley	-	10	0.82 [0.73, 0.91]	0.80 [0.72, 0.88]
	Ullrich - MS-ALL	MobGap	10	0.83 [0.73, 0.93]	0.85 [0.80, 0.90]
	Ullrich - MS-ALL	Original Implementation	10	0.80 [0.71, 0.90]	0.79 [0.72, 0.85]
	Ullrich - MS-MS	Original Implementation	10	0.72 [0.61, 0.84]	0.72 [0.63, 0.82]
COPD	McCamley	-	17	0.70 [0.62, 0.78]	0.71 [0.66, 0.76]
	Ullrich - MS-ALL	MobGap	17	0.76 [0.68, 0.84]	0.78 [0.74, 0.82]
	Ullrich - MS-ALL	Original Implementation	17	0.70 [0.62, 0.78]	0.69 [0.64, 0.73]
	Ullrich - MS-MS	Original Implementation	17	0.76 [0.70, 0.83]	0.74 [0.70, 0.78]
MS	McCamley	-	18	0.78 [0.68, 0.87]	0.79 [0.74, 0.84]
	Ullrich - MS-ALL	MobGap	18	0.79 [0.69, 0.88]	0.82 [0.77, 0.86]
	Ullrich - MS-ALL	Original Implementation	18	0.75 [0.66, 0.84]	0.73 [0.67, 0.79]
	Ullrich - MS-MS	Original Implementation	18	0.75 [0.65, 0.86]	0.80 [0.75, 0.85]
PD	McCamley	-	19	0.79 [0.71, 0.87]	0.78 [0.72, 0.84]
	Ullrich - MS-ALL	MobGap	19	0.81 [0.71, 0.91]	0.83 [0.76, 0.90]
	Ullrich - MS-ALL	Original Implementation	19	0.79 [0.70, 0.88]	0.78 [0.70, 0.85]
	Ullrich - MS-MS	Original Implementation	19	0.78 [0.68, 0.87]	0.78 [0.71, 0.84]
PFF	McCamley	-	17	0.75 [0.67, 0.84]	0.73 [0.68, 0.79]
	Ullrich - MS-ALL	MobGap	17	0.76 [0.65, 0.86]	0.76 [0.69, 0.83]
	Ullrich - MS-ALL	Original Implementation	17	0.75 [0.65, 0.84]	0.70 [0.61, 0.79]
	Ullrich - MS-MS	Original Implementation	17	0.70 [0.60, 0.81]	0.70 [0.62, 0.77]

Deep Dive Analysis of Main Algorithms#

Below, we show the direct correlation between the results from the old and the new implementation. Each datapoint represents one participant.

from mobgap.plotting import (
    calc_min_max_with_margin,
    make_square,
    move_legend_outside,
    plot_regline,
)


def compare_scatter_plot(data, name):
    fig, ax = plt.subplots(figsize=(8, 8), constrained_layout=True)
    reformated_data = (
        data.pivot_table(
            values="accuracy",
            index=("cohort", "participant_id"),
            columns="version",
        )
        .reset_index()
        .dropna(how="any")
    )

    min_max = calc_min_max_with_margin(
        reformated_data["Original Implementation"], reformated_data["MobGap"]
    )
    sns.scatterplot(
        reformated_data,
        x="Original Implementation",
        y="MobGap",
        hue="cohort",
        ax=ax,
    )
    plot_regline(
        reformated_data["Original Implementation"],
        reformated_data["MobGap"],
        ax=ax,
    )
    make_square(ax, min_max, draw_diagonal=True)
    move_legend_outside(fig, ax)
    ax.set_title(name)
    ax.set_xlabel("Original Implementation")
    ax.set_ylabel("MobGap")
    plt.tight_layout()
    plt.show()


free_living_results.query("algo == 'Ullrich - MS-ALL'").pipe(
    compare_scatter_plot, "Ullrich - MS-ALL"
)

/home/docs/checkouts/readthedocs.org/user_builds/mobgap/checkouts/v0.11.0/revalidation/laterality/_01_lrc_analysis.py:317: UserWarning: The figure layout has changed to tight
  plt.tight_layout()

Conclusion Free-Living#

It is good to see that the new version of the algorithm performs slightly better than the old version. However, it is unclear, why the new model is different, as we used almost the same pipeline and the same data. The non-ML algo (McCamly) performs suprisingly well, and much better than in the tests we did as part of Mobilise-D. Overall, the performance is not as good as we would like it to be. In particular for a couple of participants, where the performance is as low as 0.1.

Laboratory Comparison#

Every datapoint below is one trial of a test. Note, that each datapoint is weighted equally in the calculation of the performance metrics. This is a limitation of this simple approach, as the number of strides per trial and the complexity of the context can vary significantly. For a full picture, different groups of tests should be analyzed separately. The approach below should still provide a good overview to compare the algorithms.

fig, ax = plt.subplots()
sns.boxplot(data=lab_results, x="algo_with_version", y="accuracy", ax=ax)
fig.show()
fig, ax = plt.subplots()
sns.boxplot(
    data=lab_results, x="algo_with_version", y="accuracy_pairwise", ax=ax
)
fig.show()

perf_metrics_all = (
    lab_results.groupby(["algo", "version"])
    .apply(apply_aggregations, custom_aggs, include_groups=False)
    .pipe(format_tables)
)
perf_metrics_all.style.pipe(
    revalidation_table_styles, validation_thresholds, ["algo"]
)

		# participants	Accuracy	Accuracy IC-pairs
algo	version
McCamley	-	1169	0.80 [0.79, 0.81]	0.77 [0.76, 0.78]
Ullrich - MS-ALL	MobGap	1169	0.85 [0.84, 0.86]	0.83 [0.82, 0.84]
Ullrich - MS-ALL	Original Implementation	1169	0.79 [0.78, 0.80]	0.74 [0.73, 0.75]
Ullrich - MS-MS	Original Implementation	1169	0.78 [0.77, 0.79]	0.75 [0.74, 0.76]

Per Cohort#

The results below represent the average performance across all trails of all participants within a cohort.

fig, ax = plt.subplots()
sns.boxplot(
    data=lab_results,
    x="cohort",
    y="accuracy",
    hue="algo_with_version",
    order=cohort_order,
    ax=ax,
)
fig.show()
fig, ax = plt.subplots()
sns.boxplot(
    data=lab_results,
    x="cohort",
    y="accuracy_pairwise",
    hue="algo_with_version",
    order=cohort_order,
    ax=ax,
)
fig.show()
perf_metrics_cohort = (
    lab_results.groupby(["cohort", "algo", "version"])
    .apply(apply_aggregations, custom_aggs, include_groups=False)
    .pipe(format_tables)
    .loc[cohort_order]
)
perf_metrics_cohort.style.pipe(
    revalidation_table_styles, validation_thresholds, ["cohort", "algo"]
)

			# participants	Accuracy	Accuracy IC-pairs
cohort	algo	version
HA	McCamley	-	227	0.80 [0.77, 0.83]	0.78 [0.75, 0.80]
	Ullrich - MS-ALL	MobGap	227	0.82 [0.79, 0.84]	0.82 [0.80, 0.84]
	Ullrich - MS-ALL	Original Implementation	227	0.76 [0.73, 0.79]	0.74 [0.72, 0.77]
	Ullrich - MS-MS	Original Implementation	227	0.81 [0.79, 0.83]	0.78 [0.75, 0.80]
CHF	McCamley	-	106	0.85 [0.82, 0.87]	0.80 [0.77, 0.83]
	Ullrich - MS-ALL	MobGap	106	0.88 [0.86, 0.90]	0.86 [0.83, 0.88]
	Ullrich - MS-ALL	Original Implementation	106	0.81 [0.79, 0.84]	0.75 [0.71, 0.79]
	Ullrich - MS-MS	Original Implementation	106	0.73 [0.69, 0.77]	0.69 [0.65, 0.73]
COPD	McCamley	-	214	0.80 [0.77, 0.83]	0.77 [0.74, 0.80]
	Ullrich - MS-ALL	MobGap	214	0.87 [0.85, 0.90]	0.85 [0.83, 0.88]
	Ullrich - MS-ALL	Original Implementation	214	0.79 [0.76, 0.82]	0.76 [0.73, 0.78]
	Ullrich - MS-MS	Original Implementation	214	0.83 [0.81, 0.85]	0.78 [0.75, 0.80]
MS	McCamley	-	228	0.79 [0.76, 0.82]	0.77 [0.74, 0.79]
	Ullrich - MS-ALL	MobGap	228	0.86 [0.84, 0.88]	0.83 [0.80, 0.85]
	Ullrich - MS-ALL	Original Implementation	228	0.81 [0.79, 0.83]	0.74 [0.71, 0.77]
	Ullrich - MS-MS	Original Implementation	228	0.76 [0.73, 0.79]	0.74 [0.71, 0.77]
PD	McCamley	-	225	0.77 [0.74, 0.79]	0.75 [0.73, 0.78]
	Ullrich - MS-ALL	MobGap	225	0.81 [0.79, 0.84]	0.82 [0.80, 0.84]
	Ullrich - MS-ALL	Original Implementation	225	0.77 [0.74, 0.79]	0.72 [0.69, 0.75]
	Ullrich - MS-MS	Original Implementation	225	0.75 [0.73, 0.78]	0.73 [0.70, 0.75]
PFF	McCamley	-	169	0.83 [0.81, 0.85]	0.77 [0.74, 0.80]
	Ullrich - MS-ALL	MobGap	169	0.86 [0.84, 0.88]	0.81 [0.79, 0.84]
	Ullrich - MS-ALL	Original Implementation	169	0.81 [0.79, 0.83]	0.74 [0.72, 0.76]
	Ullrich - MS-MS	Original Implementation	169	0.81 [0.79, 0.84]	0.76 [0.73, 0.79]

Total running time of the script: (0 minutes 3.741 seconds)

Estimated memory usage: 80 MB

Gallery generated by Sphinx-Gallery

Performance of the laterality classification algorithms on the TVS dataset#

Performance metrics#

Free-Living Comparison#

All results across all cohorts#

Per Cohort#

Deep Dive Analysis of Main Algorithms#

Conclusion Free-Living#

Laboratory Comparison#

Per Cohort#

This Page