Data Regrouping Example¤
As previously discussed, both estimators require enough observed failures for each (j, t). Sometimes, the data do not comply with this requirement. For example, when dealing with hospitalization length of stay, patients are more likely to be released after a few days rather than after a month, and releases can be less frequent on weekends.
In this example we demonstrate data regrouping that can be part of the preprocessing stage, which will allow a successful estimation.
import warnings
import sys
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from pydts.examples_utils.generate_simulations_data import generate_quick_start_df
from pydts.examples_utils.plots import plot_events_occurrence, add_panel_text, plot_example_estimated_params
from pydts.fitters import TwoStagesFitter
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 25)
warnings.filterwarnings('ignore')
%matplotlib inline
Not enough observed failures in later times¤
We consider a setting in which the observed events become less frequent in later times by simply reducing the sample size to \(n=1000\).
real_coef_dict = {
"alpha": {
1: lambda t: -1 - 0.3 * np.log(t),
2: lambda t: -1.75 - 0.15 * np.log(t)
},
"beta": {
1: -np.log([0.8, 3, 3, 2.5, 2]),
2: -np.log([1, 3, 4, 3, 2])
}
}
df = generate_quick_start_df(n_patients=1000, n_cov=5, d_times=30, j_events=2, pid_col='pid', seed=0,
real_coef_dict=real_coef_dict)
Evidently, we see that we do not observe enough events in later times. For example, \(n_{j=1,t=25} = 1\) and \(n_{j=2,t=25} = 0\)
Trying to fit the model with such data will result in the following error message:
For fixing zero events and the tail of the distribution, events occured later than the 21st day (either \(J=1\) or \(J=2\)) are considered to be in a 21+ event time.
Now, we can successfully estimate the parameters:
fig, axes = plt.subplots(2,1, figsize=(10,8))
ax = axes[0]
ax = plot_events_occurrence(df, ax=ax)
add_panel_text(ax, 'a')
ax = axes[1]
ax = plot_events_occurrence(regrouped_df, ax=ax)
labels = [item.get_text() for item in ax.get_xticklabels()]
labels[-1] = '21+'
ax.set_xticklabels(labels)
add_panel_text(ax, 'b')
fig.tight_layout()
Not enough observed events at specific times¤
Consider the case of almost no discharge events during the weekends. In the following we resample the data to reflect this setting:
from random import random
def map_days(row):
if row['X'] in [7, 14, 21] and row['J'] in [1]:
if (random() > 0.1) or (row['X'] == 21):
row['X'] -= 1
row['X'].astype(int)
return row
regrouped_df = regrouped_df.apply(map_days, axis=1)
regrouped_df[['J', 'T', 'C', 'X']] = regrouped_df[['J', 'T', 'C', 'X']].astype('int64')
(regrouped_df.groupby(['J'])['X'].value_counts()).to_frame().unstack()
Trying to fit the model with such data will result in the following error message:
We suggest to regroup empty times with the preceding days:
def map_days_second_try(row):
if row['X'] in [7, 14, 21]:
row['X'] -= 1
row['X'].astype(int)
return row
regrouped_df = regrouped_df.apply(map_days_second_try, axis=1)
regrouped_df[['J', 'T', 'C', 'X']] = regrouped_df[['J', 'T', 'C', 'X']].astype('int64')
(regrouped_df.groupby(['J'])['X'].value_counts()).to_frame().unstack()
fig, axes = plt.subplots(2,1, figsize=(10,8))
ax = axes[0]
ax = plot_events_occurrence(df, ax=ax)
add_panel_text(ax, 'a')
ax = axes[1]
ax = plot_events_occurrence(regrouped_df, ax=ax)
labels = [item.get_text() for item in ax.get_xticklabels()]
labels[5] = '6-7'
labels[11] = '13-14'
labels[17] = '20-21'
ax.set_xticklabels(labels)
add_panel_text(ax, 'b')
fig.tight_layout()
Now, we can estimate the parameters, while the interpretation of the parameters related to the grouped time points should be interpreted with care.