Data Preparation¤

Data Generation¤

For simplicity of presentation, we considered \(M=2\) competing events, though PyDTS can handle any number of competing events as long as there are enough observed failures of each failure type, at each discrete time point.

Here, \(d=30\) discrete time points, \(n=50,000\) observations, and \(Z\) with 5 covariates. Failure times of observations were generated based on the model:

\[ \lambda_{j}(t|Z) = \frac{\exp(\alpha_{jt}+Z^{T}\beta_{j})}{1+\exp(\alpha_{jt}+Z^{T}\beta_{j})} \]

with

\(\alpha_{1t} = -1 -0.3 \log(t)\),

\(\alpha_{2t} = -1.75 -0.15\log(t)\), \(t=1,\ldots,d\),

\(\beta_1 = (-\log 0.8, \log 3, \log 3, \log 2.5, \log 2)\),

\(\beta_{2} = (-\log 1, \log 3, \log 4, \log 3, \log 2)\).

Censoring time for each observation was sampled from a discrete uniform distribution, i.e. \(C_i \sim \mbox{Uniform}\{1,...,d+1\}\).

Our goal is estimating \(\{\alpha_{11},\ldots,\alpha_{1d},\beta_1^T,\alpha_{21},\ldots,\alpha_{2d},\beta_2^T\}\) (70 parameters in total) along with the standard error of the estimators.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pydts.examples_utils.generate_simulations_data import generate_quick_start_df
import warnings
pd.set_option("display.max_rows", 500)
warnings.filterwarnings('ignore')
%matplotlib inline

real_coef_dict = {
    "alpha": {
        1: lambda t: -1 - 0.3 * np.log(t),
        2: lambda t: -1.75 - 0.15 * np.log(t)
    },
    "beta": {
        1: -np.log([0.8, 3, 3, 2.5, 2]),
        2: -np.log([1, 3, 4, 3, 2])
    }
}

n_patients = 50000
n_cov = 5

patients_df = generate_quick_start_df(n_patients=n_patients, n_cov=n_cov, d_times=30, j_events=2, 
                                      pid_col='pid', seed=0, censoring_prob=0.8, 
                                      real_coef_dict=real_coef_dict)

patients_df.head()

	pid	Z1	Z2	Z3	Z4	Z5	J	T	C	X
0	0	0.548814	0.715189	0.602763	0.544883	0.423655	0	30	10	10
1	1	0.645894	0.437587	0.891773	0.963663	0.383442	0	30	24	24
2	2	0.791725	0.528895	0.568045	0.925597	0.071036	0	17	11	11
3	3	0.087129	0.020218	0.832620	0.778157	0.870012	1	1	30	1
4	4	0.978618	0.799159	0.461479	0.780529	0.118274	0	15	14	14

Checking the Data¤

Both estimation methods require enough observed failures of each failure type, at each discrete time point. Therefore, the first step is to make sure this is in fact the case with the data at hand.

As shown below, in our example, the data comply with this requirement.

Preprocessing suggestions for cases when the data do not comply with this requirement are shown in Data Regrouping Example.

patients_df.groupby(['J', 'X'])['pid'].count().unstack('J')

J	0	1	2
X
1	1236	3374	1250
2	1124	2328	839
3	1029	1805	805
4	972	1524	644
5	939	1214	570
6	889	1114	483
7	830	916	416
8	832	830	409
9	797	683	323
10	685	626	306
11	703	569	240
12	648	516	246
13	679	419	226
14	647	410	198
15	603	326	170
16	601	320	162
17	585	280	147
18	564	240	115
19	505	243	125
20	465	204	118
21	488	176	83
22	465	167	89
23	497	166	65
24	457	118	59
25	440	114	58
26	427	109	53
27	430	89	43
28	396	70	38
29	398	67	43
30	3245	47	37