mobgap.pipeline.evaluation.categorize_intervals#

mobgap.pipeline.evaluation.categorize_intervals(
*,
gsd_list_detected: DataFrame,
gsd_list_reference: DataFrame,
overlap_threshold: float = 0.8,
multiindex_warning: bool = True,
) DataFrame[source]#

Evaluate a gait sequence list against a reference sequence-by-sequence with a minimum overlap threshold.

This compares a gait sequence list against a reference list and classifies each detected sequence as true positive, false positive, or false negative. A gait sequence is classified as true positive when having at least overlap_threshold overlap with a reference sequence. If a detected sequence has no overlap with any reference sequence, it is classified as false positive. If a reference sequence has no overlap with any detected sequence, it is classified as false negative.

Note, that the threshold is enforced in both directions. That means, that the relative overlap of the detected gait sequence with respect to the overall length of the detected interval AND to the overall length of the matched reference interval must be at least overlap_threshold.

The detected and reference dataframes are expected to have columns namend “start” and “end” containing the start and end indices of the respective gait sequences. As index, we support either a single or a multiindex without duplicates (i.e., the index must identify each gait sequence uniquely). If a multiindex is provided, the single index levels will be ignored for the comparison and matches across different index groups will be possible. If this is not the intended use case, consider grouping your input data before calling this function (see create_multi_groupby).

Note, we assume that gsd_list_detected has no overlaps, but we don’t enforce it! Additionally, note that this method won’t return any new intervals (as done in categorize_intervals_per_sample). Instead, the comparison is done on a sequence-by-sequence level based on the provided intervals.

Parameters:
gsd_list_detected: pd.DataFrame

Each row contains a detected gait sequence interval as output from the GSD algorithms. The respective start index is stored in a column named start and the stop index in a column named stop.

gsd_list_reference: pd.DataFrame

Gold standard to validate the detected gait sequences against. Should have the same format as gsd_list_detected.

overlap_threshold: float

The minimum relative overlap between a detected sequence and its reference with respect to the length of both intervals. Must be larger than 0.5 and smaller than or equal to 1.

multiindex_warning

If True, a warning will be raised if the index of the input data is a MultiIndex, explaining that the index levels will be ignored for the matching process. This exists, as this is a common source of error, when this function is used together with a typical pipeline that iterates over individual gait sequences during the processing using GsIterator. Only set this to False, once you understand the two different usecases.

Returns:
matches: pandas.DataFrame

A 3 column dataframe with the column names gsd_id_detected, gsd_id_reference, and match_type. Each row is a match containing the index value of the detected and the reference list, that belong together, or a tuple of index values in case of a multiindex input. The match_type column indicates the type of match. For all gait sequences that have a match in the reference list, this will be “tp” (true positive). Gait sequences that do not have a match will be mapped to a NaN and the match-type will be “fp” (false positive). All reference gait sequences that do not have a counterpart in the detected list are marked as “fn” (false negative).

Examples

>>> from mobgap.gait_sequences.evaluation import categorize_intervals
>>> detected = pd.DataFrame(
...     [[0, 10, 0], [20, 30, 1]], columns=["start", "end", "id"]
... ).set_index("id")
>>> reference = pd.DataFrame(
...     [[0, 10, 0], [15, 25, 1]], columns=["start", "end", "id"]
... ).set_index("id")
>>> result = categorize_intervals(detected, reference)
   gsd_id_detected  gs_id_reference match_type
0               0               0         tp
1               1               NaN       fp
2               NaN             1         fn