Data Regrouping Example¤

As previously discussed, both estimators require enough observed failures for each (j, t). Sometimes, the data do not comply with this requirement. For example, when dealing with hospitalization length of stay, patients are more likely to be released after a few days rather than after a month, and releases can be less frequent on weekends.

In this example we demonstrate data regrouping that can be part of the preprocessing stage, which will allow a successful estimation.

import warnings
import sys 

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from pydts.examples_utils.generate_simulations_data import generate_quick_start_df
from pydts.examples_utils.plots import plot_events_occurrence, add_panel_text, plot_example_estimated_params
from pydts.fitters import TwoStagesFitter

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 25)
warnings.filterwarnings('ignore')
%matplotlib inline

Not enough observed failures in later times¤

We consider a setting in which the observed events become less frequent in later times by simply reducing the sample size to \(n=1000\).

real_coef_dict = {
    "alpha": {
        1: lambda t: -1 - 0.3 * np.log(t),
        2: lambda t: -1.75 - 0.15 * np.log(t)
    },
    "beta": {
        1: -np.log([0.8, 3, 3, 2.5, 2]),
        2: -np.log([1, 3, 4, 3, 2])
    }
}

df = generate_quick_start_df(n_patients=1000, n_cov=5, d_times=30, j_events=2, pid_col='pid', seed=0, 
                             real_coef_dict=real_coef_dict)

Evidently, we see that we do not observe enough events in later times. For example, \(n_{j=1,t=25} = 1\) and \(n_{j=2,t=25} = 0\)

ax = plot_events_occurrence(df)
add_panel_text(ax, 'a')

No description has been provided for this image

Trying to fit the model with such data will result in the following error message:

m2 = TwoStagesFitter()
try:
    m2.fit(df.drop(columns=['C', 'T']), verbose=0)
except RuntimeError as e:
    raise e.with_traceback(None)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [4], in <cell line: 2>()
      3     m2.fit(df.drop(columns=['C', 'T']), verbose=0)
      4 except RuntimeError as e:
----> 5     raise e.with_traceback(None)

RuntimeError: Number of observed events at some time points are too small. Consider collapsing neighbor time points.
 See https://tomer1812.github.io/pydts/UsageExample-RegroupingData/ for more details.

For fixing zero events and the tail of the distribution, events occured later than the 21st day (either \(J=1\) or \(J=2\)) are considered to be in a 21+ event time.

regrouped_df = df.copy()
regrouped_df['X'].clip(upper=21, inplace=True)
ax = plot_events_occurrence(regrouped_df)
add_panel_text(ax, 'b')

Now, we can successfully estimate the parameters:

fig, axes = plt.subplots(2,1, figsize=(10,8))
ax = axes[0]
ax = plot_events_occurrence(df, ax=ax)
add_panel_text(ax, 'a')
ax = axes[1]
ax = plot_events_occurrence(regrouped_df, ax=ax)
labels = [item.get_text() for item in ax.get_xticklabels()]
labels[-1] = '21+'
ax.set_xticklabels(labels)
add_panel_text(ax, 'b')
fig.tight_layout()

m2 = TwoStagesFitter()
m2.fit(regrouped_df.drop(columns=['C', 'T']))
plot_example_estimated_params(m2)

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

m2.print_summary()

	j1_params	j1_SE	j2_params	j2_SE
covariate
Z1	0.059593	0.187268	0.433584	0.286376
Z2	-0.977635	0.192579	-0.651536	0.285630
Z3	-1.012224	0.191226	-1.147630	0.290300
Z4	-1.048233	0.180891	-0.221056	0.272420
Z5	-0.817598	0.180479	-0.475688	0.272666



Model summary for event: 1

		n_jt	success	alpha_jt
J	X
1	1	63	True	-0.947684
	2	49	True	-1.051086
	3	34	True	-1.295866
	4	34	True	-1.178560
	5	15	True	-1.903136
	6	25	True	-1.298924
	7	20	True	-1.401847
	8	17	True	-1.451159
	9	11	True	-1.788478
	10	12	True	-1.564731
	11	15	True	-1.247966
	12	12	True	-1.373390
	13	7	True	-1.834668
	14	5	True	-2.060652
	15	14	True	-0.884086
	16	5	True	-1.761146
	17	5	True	-1.645269
	18	4	True	-1.729781
	19	1	True	-2.928615
	20	3	True	-1.769298
	21	8	True	-0.566276



Model summary for event: 2

		n_jt	success	alpha_jt
J	X
2	1	24	True	-2.770174
	2	24	True	-2.619309
	3	13	True	-3.105049
	4	11	True	-3.164241
	5	9	True	-3.269706
	6	12	True	-2.900518
	7	14	True	-2.616379
	8	6	True	-3.361561
	9	13	True	-2.468053
	10	3	True	-3.827000
	11	1	True	-4.497481
	12	3	True	-3.627228
	13	4	True	-3.288580
	14	3	True	-3.455462
	15	3	True	-3.369194
	16	3	True	-3.193872
	17	1	True	-4.094674
	18	2	True	-3.325382
	19	2	True	-3.218786
	20	1	True	-3.768195
	21	4	True	-2.222326

Not enough observed events at specific times¤

Consider the case of almost no discharge events during the weekends. In the following we resample the data to reflect this setting:

from random import random 
def map_days(row):
    if row['X'] in [7, 14, 21] and row['J'] in [1]:
        if (random() &gt; 0.1) or (row['X'] == 21):
            row['X'] -= 1
            row['X'].astype(int)
    return row

regrouped_df = regrouped_df.apply(map_days, axis=1)
regrouped_df[['J', 'T', 'C', 'X']] = regrouped_df[['J', 'T', 'C', 'X']].astype('int64')
(regrouped_df.groupby(['J'])['X'].value_counts()).to_frame().unstack()

	X
X	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21
J
0	30.0	20.0	28.0	21.0	22.0	22.0	23.0	21.0	20.0	11.0	18.0	15.0	15.0	13.0	21.0	16.0	19.0	14.0	14.0	14.0	108.0
1	63.0	49.0	34.0	34.0	15.0	43.0	2.0	17.0	11.0	12.0	15.0	12.0	11.0	1.0	14.0	5.0	5.0	4.0	1.0	11.0	NaN
2	24.0	24.0	13.0	11.0	9.0	12.0	14.0	6.0	13.0	3.0	1.0	3.0	4.0	3.0	3.0	3.0	1.0	2.0	2.0	1.0	4.0

df = regrouped_df.copy()
plot_events_occurrence(regrouped_df)

<AxesSubplot:xlabel='Time', ylabel='Number of Observations'>

Trying to fit the model with such data will result in the following error message:

m2 = TwoStagesFitter()
try: 
    m2.fit(regrouped_df.drop(columns=['C', 'T']), verbose=0)
except RuntimeError as e:
    raise e.with_traceback(None)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [11], in <cell line: 2>()
      3     m2.fit(regrouped_df.drop(columns=['C', 'T']), verbose=0)
      4 except RuntimeError as e:
----> 5     raise e.with_traceback(None)

RuntimeError: Number of observed events at some time points are too small. Consider collapsing neighbor time points.
 See https://tomer1812.github.io/pydts/UsageExample-RegroupingData/ for more details.

We suggest to regroup empty times with the preceding days:

def map_days_second_try(row):
    if row['X'] in [7, 14, 21]:
        row['X'] -= 1
        row['X'].astype(int)
    return row

regrouped_df = regrouped_df.apply(map_days_second_try, axis=1)
regrouped_df[['J', 'T', 'C', 'X']] = regrouped_df[['J', 'T', 'C', 'X']].astype('int64')
(regrouped_df.groupby(['J'])['X'].value_counts()).to_frame().unstack()

	X
X	1	2	3	4	5	6	8	9	10	11	12	13	15	16	17	18	19	20
J
0	30	20	28	21	22	45	21	20	11	18	15	28	21	16	19	14	14	122
1	63	49	34	34	15	45	17	11	12	15	12	12	14	5	5	4	1	11
2	24	24	13	11	9	26	6	13	3	1	3	7	3	3	1	2	2	5

plot_events_occurrence(regrouped_df)

<AxesSubplot:xlabel='Time', ylabel='Number of Observations'>

fig, axes = plt.subplots(2,1, figsize=(10,8))
ax = axes[0]
ax = plot_events_occurrence(df, ax=ax)
add_panel_text(ax, 'a')
ax = axes[1]
ax = plot_events_occurrence(regrouped_df, ax=ax)
labels = [item.get_text() for item in ax.get_xticklabels()]
labels[5] = '6-7'
labels[11] = '13-14'
labels[17] = '20-21'
ax.set_xticklabels(labels)
add_panel_text(ax, 'b')
fig.tight_layout()

Now, we can estimate the parameters, while the interpretation of the parameters related to the grouped time points should be interpreted with care.

m2 = TwoStagesFitter()
m2.fit(regrouped_df.drop(columns=['C', 'T']), verbose=0)
plot_example_estimated_params(m2)

m2.print_summary()

	j1_params	j1_SE	j2_params	j2_SE
covariate
Z1	0.125497	0.226571	0.559011	0.344835
Z2	-0.590350	0.230030	-0.146264	0.338669
Z3	-0.577012	0.230431	-0.601869	0.343531
Z4	-0.813756	0.221557	0.119530	0.328842
Z5	-0.586253	0.231495	-0.254760	0.345008



Model summary for event: 1

		n_jt	success	alpha_jt
J	X
1	1	63	True	-1.544779
	2	49	True	-1.658131
	3	34	True	-1.911357
	4	34	True	-1.795520
	5	15	True	-2.530238
	6	45	True	-1.296992
	8	17	True	-2.090432
	9	11	True	-2.431846
	10	12	True	-2.211441
	11	15	True	-1.904531
	12	12	True	-2.032937
	13	12	True	-1.938316
	15	14	True	-1.565972
	16	5	True	-2.453997
	17	5	True	-2.335441
	18	4	True	-2.424637
	19	1	True	-3.647561
	20	11	True	-1.103130



Model summary for event: 2

		n_jt	success	alpha_jt
J	X
2	1	24	True	-3.574037
	2	24	True	-3.442172
	3	13	True	-3.946532
	4	11	True	-4.010704
	5	9	True	-4.104630
	6	26	True	-2.947698
	8	6	True	-4.205070
	9	13	True	-3.331808
	10	3	True	-4.678181
	11	1	True	-5.489276
	12	3	True	-4.531819
	13	7	True	-3.595795
	15	3	True	-4.254731
	16	3	True	-4.099612
	17	1	True	-4.963138
	18	2	True	-4.249112
	19	2	True	-4.146457
	20	5	True	-3.099039

	X
X	1	2	3	4	5	6	8	9	10	11	12	13	15	16	17	18	19	20
J
0	30	20	28	21	22	45	21	20	11	18	15	28	21	16	19	14	14	122
1	63	49	34	34	15	45	17	11	12	15	12	12	14	5	5	4	1	11
2	24	24	13	11	9	26	6	13	3	1	3	7	3	3	1	2	2	5

	X
X	1	2	3	4	5	6	8	9	10	11	12	13	15	16	17	18	19	20
J
0	30	20	28	21	22	45	21	20	11	18	15	28	21	16	19	14	14	122
1	63	49	34	34	15	45	17	11	12	15	12	12	14	5	5	4	1	11
2	24	24	13	11	9	26	6	13	3	1	3	7	3	3	1	2	2	5

	X
X	1	2	3	4	5	6	8	9	10	11	12	13	15	16	17	18	19	20
J
0	30	20	28	21	22	45	21	20	11	18	15	28	21	16	19	14	14	122
1	63	49	34	34	15	45	17	11	12	15	12	12	14	5	5	4	1	11
2	24	24	13	11	9	26	6	13	3	1	3	7	3	3	1	2	2	5