create_multi_groupby#
- mobgap.utils.df_operations.create_multi_groupby(
- primary_df: DataFrame,
- secondary_dfs: DataFrame | list[DataFrame],
- groupby: Hashable | list[str],
- **kwargs: Unpack[dict[str, Any]],
Group multiple dataframes by the same index levels to apply a function to each group across all dataframes.
This function will return an object similar to a
DataFrameGroupByobject, but with only theapplyand__iter__methods implemented. This special groupby object applies a groupby to the primary dataframe, but when iterating over the groups, or applying a function, it will also provide the groups of the secondary dataframes by usinglocwith the group name of the primary dataframe.This also means that this function is much more limited than the standard groupby object, as it only supports the grouping by existing named index levels and forces all dataframes to have the same index columns.
Warning
It is important to understand that we only groupy the index of the primary dataframe! This means if an index value only exists in one of the secondary dataframes, it will be ignored. We do this to be able to “just” use the normal pandas groupby API under the hood. We simply group the primary dataframe, get the corresponding groups from the secondary dataframes (if available) and inject them into all operations.
- Parameters:
- primary_df
The primary dataframe to group by. Its index will be used to perform the actual grouping.
- secondary_dfs
The secondary dataframes to group by.
- groupby
The names of the index levels to group by.
- kwargs
All further arguments will be passed to
.groupbyof all dataframes.
Examples
>>> df = pd.DataFrame( ... { ... "group1": [1, 1, 2, 3], ... "group2": [1, 2, 1, 1], ... "value": [1, 2, 3, 4], ... } ... ).set_index(["group1", "group2"]) >>> df_2 = pd.DataFrame( ... { ... "group1": [1, 1, 1, 2], ... "group2": [1, 2, 3, 1], ... "value": [11, 12, 13, 14], ... } ... ).set_index(["group1", "group2"]) >>> multi_groupby = create_multi_groupby(df, df_2, ["group1"]) >>> for group, (df1, df2) in multi_groupby: ... print(group) ... print(df1) ... print(df2) 1 value group1 group2 1 1 1 2 2 value group1 group2 1 1 11 2 12 3 13 2 value group1 group2 2 1 3 value group1 group2 2 1 14 3 value group1 group2 3 1 4 Empty DataFrame Columns: [value] Index: []