Skip to content

Data Preparation¤

Data Generation¤

For simplicity of presentation, we considered \(M=2\) competing events, though PyDTS can handle any number of competing events as long as there are enough observed failures of each failure type, at each discrete time point.

Here, \(d=30\) discrete time points, \(n=50,000\) observations, and \(Z\) with 5 covariates. Failure times of observations were generated based on the model:

\[ \lambda_{j}(t|Z) = \frac{\exp(\alpha_{jt}+Z^{T}\beta_{j})}{1+\exp(\alpha_{jt}+Z^{T}\beta_{j})} \]

with

\(\alpha_{1t} = -1 -0.3 \log(t)\),

\(\alpha_{2t} = -1.75 -0.15\log(t)\), \(t=1,\ldots,d\),

\(\beta_1 = (-\log 0.8, \log 3, \log 3, \log 2.5, \log 2)\),

\(\beta_{2} = (-\log 1, \log 3, \log 4, \log 3, \log 2)\).

Censoring time for each observation was sampled from a discrete uniform distribution, i.e. \(C_i \sim \mbox{Uniform}\{1,...,d+1\}\).

Our goal is estimating \(\{\alpha_{11},\ldots,\alpha_{1d},\beta_1^T,\alpha_{21},\ldots,\alpha_{2d},\beta_2^T\}\) (70 parameters in total) along with the standard error of the estimators.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pydts.examples_utils.generate_simulations_data import generate_quick_start_df
import warnings
pd.set_option("display.max_rows", 500)
warnings.filterwarnings('ignore')
%matplotlib inline
real_coef_dict = {
    "alpha": {
        1: lambda t: -1 - 0.3 * np.log(t),
        2: lambda t: -1.75 - 0.15 * np.log(t)
    },
    "beta": {
        1: -np.log([0.8, 3, 3, 2.5, 2]),
        2: -np.log([1, 3, 4, 3, 2])
    }
}

n_patients = 50000
n_cov = 5
patients_df = generate_quick_start_df(n_patients=n_patients, n_cov=n_cov, d_times=30, j_events=2, 
                                      pid_col='pid', seed=0, censoring_prob=0.8, 
                                      real_coef_dict=real_coef_dict)

patients_df.head()
pid Z1 Z2 Z3 Z4 Z5 J T C X
0 0 0.548814 0.715189 0.602763 0.544883 0.423655 0 30 10 10
1 1 0.645894 0.437587 0.891773 0.963663 0.383442 0 30 24 24
2 2 0.791725 0.528895 0.568045 0.925597 0.071036 0 17 11 11
3 3 0.087129 0.020218 0.832620 0.778157 0.870012 1 1 30 1
4 4 0.978618 0.799159 0.461479 0.780529 0.118274 0 15 14 14

Checking the Data¤

Both estimation methods require enough observed failures of each failure type, at each discrete time point. Therefore, the first step is to make sure this is in fact the case with the data at hand.

As shown below, in our example, the data comply with this requirement.

Preprocessing suggestions for cases when the data do not comply with this requirement are shown in Data Regrouping Example.

patients_df.groupby(['J', 'X'])['pid'].count().unstack('J')
J 0 1 2
X
1 1236 3374 1250
2 1124 2328 839
3 1029 1805 805
4 972 1524 644
5 939 1214 570
6 889 1114 483
7 830 916 416
8 832 830 409
9 797 683 323
10 685 626 306
11 703 569 240
12 648 516 246
13 679 419 226
14 647 410 198
15 603 326 170
16 601 320 162
17 585 280 147
18 564 240 115
19 505 243 125
20 465 204 118
21 488 176 83
22 465 167 89
23 497 166 65
24 457 118 59
25 440 114 58
26 427 109 53
27 430 89 43
28 396 70 38
29 398 67 43
30 3245 47 37