.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/aggregation/_99_cvs_agg_pipeline_no_exc.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_aggregation__99_cvs_agg_pipeline_no_exc.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_aggregation__99_cvs_agg_pipeline_no_exc.py:


CVS Official Aggregation Script
===============================

This example shows how the raw CVS per walking bout results are combined with the weartime reports and aggregated to a
daily and weekly level.

For more details on the individual steps see the other aggregation examples.
This example should primarily serve as an easy to use script for people that want to run/replicate the CVS aggregation.

Before you start you need 3 things:

- DMO data:A csv file obtained from the Mobilise-D datawarehouse that contains the data of one CVS measurement time points
  (e.g. T1). Example file name: `cvs-T3-wb-dmo-14-05-2024.csv`
- PID-Map: A mapping between patient ids and measurement sites (short pid-map). Example file name:
  `study-instances-Cohort Site-2023-08-08h22m09s48.csv`
- Weartime: The minute-by-minute weartime reports obtained by McRoberts. This is a folder with a large amount of csv files.
  In addition to the csv files, we also expect a file with the pattern `CVS-wear-compliance-*.xlsx` in the same folder.
  This is used to map the measurement-ids to the correct weartime reports.

Use the full path to these files/folders in the `path_config` dictionary below.
The "outpath" key in the `path_config` dictionary specifies the folder where the aggregated data should be saved.

.. GENERATED FROM PYTHON SOURCE LINES 24-34

.. code-block:: default


    from pathlib import Path

    path_config = {
        "dmo": ...,
        "pid_map": ...,
        "weartime_reports": ...,
        "outpath": ...,
    }


.. GENERATED FROM PYTHON SOURCE LINES 35-38

Load the data
-------------
We create a new dataset instance that contains all the data and allows use to load and query it efficiently.

.. GENERATED FROM PYTHON SOURCE LINES 38-49

.. code-block:: default

    from joblib import Memory
    from mobgap.data import MobilisedCvsDmoDataset

    cache = Memory(".cache")
    ds = MobilisedCvsDmoDataset(
        path_config["dmo"],
        path_config["pid_map"],
        memory=cache,
        weartime_reports_base_path=path_config["weartime_reports"],
    )


.. GENERATED FROM PYTHON SOURCE LINES 50-56

QA: Check for duplicated DMOs
-----------------------------
We check if there are any duplicated DMOs in the data.
If yes, we create a warning and save a file for further investigation.

.. note:: The initial loading of the data might take some time

.. GENERATED FROM PYTHON SOURCE LINES 56-62

.. code-block:: default

    data = ds.data
    duplicated_dmos = data.index.to_frame()[data.index.to_frame().duplicated()]
    if not duplicated_dmos.empty:
        print("Warning: Duplicated DMOs found. Saving to 'duplicated_dmos.csv'")
        duplicated_dmos.to_csv(Path(path_config["outpath"]) / "duplicated_dmos.csv")


.. GENERATED FROM PYTHON SOURCE LINES 63-67

Daily agg
---------
We aggregate the data per day for each participant.
We then merge the results with the weartime reports to later filter out days that have not enough weartime.

.. GENERATED FROM PYTHON SOURCE LINES 67-74

.. code-block:: default

    from mobgap.aggregation import MobilisedAggregator

    daily_agg = MobilisedAggregator(
        **MobilisedAggregator.PredefinedParameters.cvs_dmo_data
    )
    daily_agg.aggregate(data, data_mask=ds.data_mask)


.. GENERATED FROM PYTHON SOURCE LINES 75-77

To exactly match the output format of the original Mobilise-D R-Script, we round the output to 3 decimal places and
convert the stride length values to cm.

.. GENERATED FROM PYTHON SOURCE LINES 77-81

.. code-block:: default

    agg_values = daily_agg.aggregated_data_
    agg_values[["strlen_1030_avg", "strlen_30_avg"]] *= 100
    agg_values = agg_values.round(3)


.. GENERATED FROM PYTHON SOURCE LINES 82-83

Further, we express the variance parameters in "%".

.. GENERATED FROM PYTHON SOURCE LINES 83-94

.. code-block:: default

    agg_values[
        [
            "wbdur_all_var",
            "cadence_all_var",
            "strdur_all_var",
            "ws_30_var",
            "strlen_30_var",
        ]
    ] *= 100


.. GENERATED FROM PYTHON SOURCE LINES 95-99

Merge with weartime.
.. note:: The initial loading of the weartime data will take some time.
          After it is loaded once, a new file `daily_weartime_pre_computed.csv` will be created in the weartime
          folder and used as cache for subsequent loads.

.. GENERATED FROM PYTHON SOURCE LINES 99-104

.. code-block:: default

    daily_weartime = ds.weartime_daily
    daily_aggregated = agg_values.merge(
        daily_weartime, left_index=True, right_index=True
    )


.. GENERATED FROM PYTHON SOURCE LINES 105-106

We drop the  "wb_1030_sum" column, because it was not officially verified as a DMO

.. GENERATED FROM PYTHON SOURCE LINES 106-108

.. code-block:: default

    daily_aggregated = daily_aggregated.drop(columns=["wb_1030_sum"])


.. GENERATED FROM PYTHON SOURCE LINES 109-112

Weartime Filtering
------------------
We filter out days that do not have enough weartime (or not weartime info at all).

.. GENERATED FROM PYTHON SOURCE LINES 112-117

.. code-block:: default

    daily_aggregated_filtered = daily_aggregated.query(
        "total_worn_during_waking_h > 12"
    )


.. GENERATED FROM PYTHON SOURCE LINES 118-121

QA: Number of recording days per participant
--------------------------------------------
We check the number of recording days per participant.

.. GENERATED FROM PYTHON SOURCE LINES 121-150

.. code-block:: default

    import pandas as pd

    date_range = (
        daily_aggregated_filtered.index.to_frame()
        .reset_index(drop=True)
        .groupby("participant_id")["measurement_date"]
        .agg(["min", "max", "count"])
        .rename(columns={"count": "n_days", "min": "first", "max": "last"})
        .assign(
            day_diff=lambda df_: df_[["first", "last"]]
            .map(pd.to_datetime)
            .eval("last - first")
        )
    )

    date_range.to_csv(Path(path_config["outpath"]) / "measurement_ranges.csv")

    if not (over_7 := date_range.query("n_days > 7")).empty:
        print(
            f"Warning: {len(over_7)} Participants with more than 7 valid recording days found."
        )

    if not (over_6_diff := date_range.query("day_diff > '6 days'")).empty:
        print(
            f"Warning: {len(over_6_diff)} Participants with more than 6 days between first and last valid recording day "
            "found."
        )


.. GENERATED FROM PYTHON SOURCE LINES 151-154

Weekly Aggregation
------------------
We aggregate the daily data to a weekly level and remove weeks with insufficient data.

.. GENERATED FROM PYTHON SOURCE LINES 154-167

.. code-block:: default

    weekly_aggregated = (
        daily_aggregated_filtered.drop(
            columns=["total_worn_h", "total_worn_during_waking_h"]
        )
        .groupby(["visit_type", "participant_id"])
        .mean(numeric_only=True)
        .assign(
            n_days=daily_aggregated_filtered["walkdur_all_sum"]
            .groupby(["visit_type", "participant_id"])
            .count()
        )
    )


.. GENERATED FROM PYTHON SOURCE LINES 168-169

Formatting:

.. GENERATED FROM PYTHON SOURCE LINES 169-185

.. code-block:: default

    round_to_int = [
        "steps_all_sum",
        "turns_all_sum",
        "wb_all_sum",
        "wb_10_sum",
        "wb_30_sum",
        "wb_60_sum",
    ]
    round_to_three_decimals = weekly_aggregated.columns[
        ~weekly_aggregated.columns.isin(round_to_int)
    ]
    weekly_aggregated[round_to_int] = weekly_aggregated[round_to_int].round()
    weekly_aggregated[round_to_three_decimals] = weekly_aggregated[
        round_to_three_decimals
    ].round(3)


.. GENERATED FROM PYTHON SOURCE LINES 186-187

Filtering:

.. GENERATED FROM PYTHON SOURCE LINES 187-190

.. code-block:: default

    weekly_aggregated_filtered = weekly_aggregated[weekly_aggregated["n_days"] >= 3]


.. GENERATED FROM PYTHON SOURCE LINES 191-193

Export
------

.. GENERATED FROM PYTHON SOURCE LINES 193-199

.. code-block:: default

    from datetime import datetime

    current_date = datetime.now().strftime("%Y_%m_%d")

    outdir = Path(path_config["outpath"]) / f"export_{current_date}"
    outdir.mkdir(exist_ok=True, parents=True)

.. GENERATED FROM PYTHON SOURCE LINES 200-214

.. code-block:: default

    daily_aggregated.add_suffix("_d").to_csv(outdir / "daily_agg_all.csv")
    daily_aggregated_filtered.add_suffix("_d").to_csv(
        outdir / "daily_agg_filtered.csv"
    )
    weekly_aggregated.add_suffix("_w").to_csv(outdir / "weekly_agg_all.csv")
    weekly_aggregated_filtered.add_suffix("_w").to_csv(
        outdir / "weekly_agg_filtered.csv"
    )
    daily_weartime.to_csv(outdir / "daily_weartime.csv")
    daily_weartime[
        daily_weartime["total_worn_h"].isna()
    ].index.to_frame().reset_index(drop=True).to_csv(
        outdir / "missing_weartime.csv"
    )


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.000 seconds)

**Estimated memory usage:**  0 MB


.. _sphx_glr_download_auto_examples_aggregation__99_cvs_agg_pipeline_no_exc.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: _99_cvs_agg_pipeline_no_exc.py <_99_cvs_agg_pipeline_no_exc.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: _99_cvs_agg_pipeline_no_exc.ipynb <_99_cvs_agg_pipeline_no_exc.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_