Utils

`pydts.utils.get_expanded_df(df, event_type_col='J', duration_col='X', pid_col='pid')` ¤

Expands a discrete-time survival dataset into a long-format dataframe suitable for modeling. This function receives a dataframe where each row corresponds to a subject with observed event type and duration. It returns an expanded dataframe where each subject is represented by multiple rows, one for each time point up to their observed time. Right censoring is allowed and should be indicated by event type 0.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Original input dataframe containing one row per subject.	required
`event_type_col`	`str`	Name of the column indicating event type. Censoring is marked by 0.	`'J'`
`duration_col`	`str`	Name of the column indicating event or censoring time.	`'X'`
`pid_col`	`str`	Name of the column indicating subject/patient ID.	`'pid'`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Expanded dataframe in long format, with one row per subject-time pair.

Source code in src/pydts/utils.py

def get_expanded_df(
        df: pd.DataFrame,
        event_type_col: str = 'J',
        duration_col: str = 'X',
        pid_col: str = 'pid') -> pd.DataFrame:
    """
    Expands a discrete-time survival dataset into a long-format dataframe suitable for modeling. This function receives a dataframe where each row corresponds to a subject with observed  event type and duration. It returns an expanded dataframe where each subject is represented  by multiple rows, one for each time point up to their observed time. Right censoring is allowed and should be indicated by event type 0.

    Args:
        df (pd.DataFrame): Original input dataframe containing one row per subject.
        event_type_col (str): Name of the column indicating event type. Censoring is marked by 0.
        duration_col (str): Name of the column indicating event or censoring time.
        pid_col (str): Name of the column indicating subject/patient ID.

    Returns:
        pd.DataFrame: Expanded dataframe in long format, with one row per subject-time pair.
    """
    unique_times = df[duration_col].sort_values().unique()
    result_df = df.reindex(df.index.repeat(df[duration_col]))
    result_df[duration_col] = result_df.groupby(pid_col).cumcount() + 1
    # drop times that didn't happen
    result_df.drop(index=result_df.loc[~result_df[duration_col].isin(unique_times)].index, inplace=True)
    result_df.reset_index(drop=True, inplace=True)
    last_idx = result_df.drop_duplicates(subset=[pid_col], keep='last').index
    events = sorted(df[event_type_col].unique())
    result_df.loc[last_idx, [f'j_{e}' for e in events]] = pd.get_dummies(
        result_df.loc[last_idx, event_type_col]).values
    result_df[[f'j_{e}' for e in events]] = result_df[[f'j_{e}' for e in events]].fillna(0)
    result_df[f'j_0'] = 1 - result_df[[f'j_{e}' for e in events if e > 0]].sum(axis=1)
    return result_df

Utils

pydts.utils.get_expanded_df(df, event_type_col='J', duration_col='X', pid_col='pid') ¤

`pydts.utils.get_expanded_df(df, event_type_col='J', duration_col='X', pid_col='pid')` ¤