Skip to content

Utils

pydts.utils.get_expanded_df(df, event_type_col='J', duration_col='X', pid_col='pid') ยค

Expands a discrete-time survival dataset into a long-format dataframe suitable for modeling. This function receives a dataframe where each row corresponds to a subject with observed event type and duration. It returns an expanded dataframe where each subject is represented by multiple rows, one for each time point up to their observed time. Right censoring is allowed and should be indicated by event type 0.

Parameters:

Name Type Description Default
df DataFrame

Original input dataframe containing one row per subject.

required
event_type_col str

Name of the column indicating event type. Censoring is marked by 0.

'J'
duration_col str

Name of the column indicating event or censoring time.

'X'
pid_col str

Name of the column indicating subject/patient ID.

'pid'

Returns:

Type Description
DataFrame

pd.DataFrame: Expanded dataframe in long format, with one row per subject-time pair.

Source code in src/pydts/utils.py
def get_expanded_df(
        df: pd.DataFrame,
        event_type_col: str = 'J',
        duration_col: str = 'X',
        pid_col: str = 'pid') -> pd.DataFrame:
    """
    Expands a discrete-time survival dataset into a long-format dataframe suitable for modeling. This function receives a dataframe where each row corresponds to a subject with observed  event type and duration. It returns an expanded dataframe where each subject is represented  by multiple rows, one for each time point up to their observed time. Right censoring is allowed and should be indicated by event type 0.

    Args:
        df (pd.DataFrame): Original input dataframe containing one row per subject.
        event_type_col (str): Name of the column indicating event type. Censoring is marked by 0.
        duration_col (str): Name of the column indicating event or censoring time.
        pid_col (str): Name of the column indicating subject/patient ID.

    Returns:
        pd.DataFrame: Expanded dataframe in long format, with one row per subject-time pair.
    """
    unique_times = df[duration_col].sort_values().unique()
    result_df = df.reindex(df.index.repeat(df[duration_col]))
    result_df[duration_col] = result_df.groupby(pid_col).cumcount() + 1
    # drop times that didn't happen
    result_df.drop(index=result_df.loc[~result_df[duration_col].isin(unique_times)].index, inplace=True)
    result_df.reset_index(drop=True, inplace=True)
    last_idx = result_df.drop_duplicates(subset=[pid_col], keep='last').index
    events = sorted(df[event_type_col].unique())
    result_df.loc[last_idx, [f'j_{e}' for e in events]] = pd.get_dummies(
        result_df.loc[last_idx, event_type_col]).values
    result_df[[f'j_{e}' for e in events]] = result_df[[f'j_{e}' for e in events]].fillna(0)
    result_df[f'j_0'] = 1 - result_df[[f'j_{e}' for e in events if e > 0]].sum(axis=1)
    return result_df