Skip to content

Regularization¤

Regularized regression can be easily accommodated only with TwoStagesFitter where we first estimate \(\beta_j\) and then \(\alpha_{jt}\). Regularization is introduced by CoxPHFitter of lifelines with event-specific tuning parameters, \(\eta_j \geq 0\), and l1_ratio argument.

For each \(j\), usually, a path of models in \(\eta_j\) are fitted, and the value of l1_ratio defines the type of prediction model. In particular, ridge regression is performed by setting l1_ratio=0, lasso by l1_ratio=1, and elastic net by 0 < l1_ratio < 1.

In the following, we present how to use PyDTS to fit a lasso regularized model, and how to tune the regularization parameters \(\eta_j\).

We start by generating data, as discussed in previous sections:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pydts.examples_utils.generate_simulations_data import generate_quick_start_df
import warnings
pd.set_option("display.max_rows", 500)
warnings.filterwarnings('ignore')
%matplotlib inline

real_coef_dict = {
    "alpha": {
        1: lambda t: -1 - 0.3 * np.log(t),
        2: lambda t: -1.75 - 0.15 * np.log(t)
    },
    "beta": {
        1: -np.log([0.8, 3, 3, 2.5, 2]),
        2: -np.log([1, 3, 4, 3, 2])
    }
}

n_patients = 50000
n_cov = 5

patients_df = generate_quick_start_df(n_patients=n_patients, n_cov=n_cov, d_times=30, j_events=2, 
                                      pid_col='pid', seed=0, censoring_prob=0.8, 
                                      real_coef_dict=real_coef_dict)

train_df, test_df = train_test_split(patients_df, test_size=0.2)

patients_df.head()
pid Z1 Z2 Z3 Z4 Z5 J T C X
0 0 0.548814 0.715189 0.602763 0.544883 0.423655 0 31 10 10
1 1 0.645894 0.437587 0.891773 0.963663 0.383442 0 31 24 24
2 2 0.791725 0.528895 0.568045 0.925597 0.071036 0 17 11 11
3 3 0.087129 0.020218 0.832620 0.778157 0.870012 1 1 31 1
4 4 0.978618 0.799159 0.461479 0.780529 0.118274 0 15 14 14

Predefined Regularization Parameters¤

Lasso with \(\eta_1=0.003\) and \(\eta_2=0.005\), can be applied by

from pydts.fitters import TwoStagesFitter

L1_regularized_fitter = TwoStagesFitter()
fit_beta_kwargs = {
    'model_kwargs': {
        1: {'penalizer': 0.003, 'l1_ratio': 1},
        2: {'penalizer': 0.005, 'l1_ratio': 1}
}}
L1_regularized_fitter.fit(df = patients_df.drop(['C', 'T'], axis = 1),
                          fit_beta_kwargs = fit_beta_kwargs)

L1_regularized_fitter.print_summary()
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

j1_params j1_SE j2_params j2_SE
covariate
Z1 0.000002 0.000101 1.981085e-08 0.000024
Z2 -0.772797 0.025400 -7.976064e-07 0.000071
Z3 -0.761229 0.025532 -1.720702e-01 0.038499
Z4 -0.550481 0.025318 -8.073968e-07 0.000072
Z5 -0.338471 0.025211 -3.409610e-07 0.000031


Model summary for event: 1

n_jt success alpha_jt
J X
1 1 3374 True -1.471644
2 2328 True -1.714213
3 1805 True -1.859723
4 1524 True -1.920774
5 1214 True -2.050566
6 1114 True -2.038532
7 916 True -2.142666
8 830 True -2.151764
9 683 True -2.257665
10 626 True -2.258146
11 569 True -2.268550
12 516 True -2.281249
13 419 True -2.406485
14 410 True -2.340125
15 326 True -2.482320
16 320 True -2.415531
17 280 True -2.460601
18 240 True -2.526029
19 243 True -2.422234
20 204 True -2.506867
21 176 True -2.564875
22 167 True -2.524524
23 166 True -2.431232
24 118 True -2.667097
25 114 True -2.596554
26 109 True -2.527812
27 89 True -2.614859
28 70 True -2.731941
29 67 True -2.645782
30 47 True -2.856479


Model summary for event: 2

n_jt success alpha_jt
J X
2 1 1250 True -3.578498
2 839 True -3.857174
3 805 True -3.793806
4 644 True -3.922340
5 570 True -3.953290
6 483 True -4.033435
7 416 True -4.097484
8 409 True -4.031967
9 323 True -4.185559
10 306 True -4.159369
11 240 True -4.326672
12 246 True -4.222570
13 226 True -4.228361
14 198 True -4.280986
15 170 True -4.351662
16 162 True -4.321422
17 147 True -4.335179
18 115 True -4.497948
19 125 True -4.329126
20 118 True -4.299702
21 83 True -4.569291
22 89 True -4.409830
23 65 True -4.633946
24 59 True -4.627577
25 58 True -4.544617
26 53 True -4.528494
27 43 True -4.624557
28 38 True -4.626009
29 43 True -4.376003
30 37 True -4.384266

Tuning Regularization Parameters¤

In penalized regression, one should fit a path of models in each \(\eta_j\), \(j=1,\ldots,M\). The final set of values of \(\eta_1,\ldots,\eta_M\) corresponds to the values yielding the best results in terms of pre-specified criteria, such as maximizing \(\widehat{\mbox{AUC}}_j\) and \(\widehat{\mbox{AUC}}\), or minimizing \(\widehat{\mbox{BS}}_j\) and \(\widehat{\mbox{BS}}\). The default criteria in PyDTS is maximizing the global AUC, \(\widehat{\mbox{AUC}}\). Two \(M\)-dimensional grid search options are implemented, PenaltyGridSearch when the user provides train and test datasets, and PenaltyGridSearchCV for applying a K-fold cross validation (CV) approach.

PenaltyGridSearch¤

When train and test sets are available, by excecuting the following code, all the four optimization criteria are calculated over the \(M\)-dimensional grid and optimal_set includes the optimal values of \(\eta_1,\ldots,\eta_M\) based on \(\widehat{\mbox{AUC}}\). Here, the optimal set based on \(\widehat{\mbox{AUC}}\) is \(\log\eta_1 = -6\) and \(\log\eta_2 = -6\).

It is noted, that we estimate the parameters of each \(\eta_j\) once. However, since our performance measures requires the evaluation of the overall survival function, we must check each possible combination of \(\eta_j\) seperately. This can be time consuming, especially when we would like to choose between a large number of possible penalizers.

from pydts.model_selection import PenaltyGridSearch


penalizers = np.exp([-2, -3, -4, -5, -6])
grid_search = PenaltyGridSearch()
optimal_set = grid_search.evaluate(train_df, test_df, l1_ratio = 1, 
                                   penalizers = penalizers,
                                   metrics = ['IBS', 'GBS', 'IAUC', 'GAUC'])

print(optimal_set)
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 199 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 204 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 206 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 207 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 213 seconds
(0.0024787521766663585, 0.0024787521766663585)

The user can choose the set of \(\eta_j\), \(j=1,\ldots,M\), values that optimizes other desired criteria. For example, the set that minimizes \(\widehat{\mbox{BS}}\) can be selected as follows

res = grid_search.convert_results_dict_to_df(grid_search.global_bs)
res.columns = ['BS']
res.index.set_names(['eta_1', 'eta_2'], inplace=True)
res
BS
eta_1 eta_2
0.135335 0.135335 0.038662
0.049787 0.038662
0.018316 0.038633
0.006738 0.038468
0.002479 0.038137
0.049787 0.135335 0.038430
0.049787 0.038430
0.018316 0.038403
0.006738 0.038245
0.002479 0.037919
0.018316 0.135335 0.035197
0.049787 0.035197
0.018316 0.035199
0.006738 0.035164
0.002479 0.034911
0.006738 0.135335 0.031541
0.049787 0.031541
0.018316 0.031592
0.006738 0.031685
0.002479 0.031431
0.002479 0.135335 0.028503
0.049787 0.028503
0.018316 0.028563
0.006738 0.028698
0.002479 0.028441
grid_search.convert_results_dict_to_df(grid_search.global_bs).idxmin()
0    (0.0024787521766663585, 0.0024787521766663585)
dtype: object

the final model can be retrieved by

optimal_two_stages_fitter = grid_search.get_mixed_two_stages_fitter(optimal_set)

PenaltyGridSearchCV¤

Alternatively, 5-fold CV is performed by

from pydts.cross_validation import PenaltyGridSearchCV


penalizers = np.exp([-2, -3, -4, -5, -6])
grid_search_cv = PenaltyGridSearchCV()
results_df = grid_search_cv.cross_validate(patients_df, l1_ratio=1, 
                                           penalizers=penalizers, n_splits=5, 
                                           metrics=['IBS', 'GBS', 'IAUC', 'GAUC'])
Starting fold 1/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 174 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 186 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 200 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 209 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 155 seconds
Finished fold 1/5, 1003 seconds
Starting fold 2/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 175 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 187 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 200 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 212 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 178 seconds
Finished fold 2/5, 1031 seconds
Starting fold 3/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 176 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 188 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 200 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 210 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 186 seconds
Finished fold 3/5, 1039 seconds
Starting fold 4/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 177 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 188 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 202 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 211 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 179 seconds
Finished fold 4/5, 1037 seconds
Starting fold 5/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 176 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 188 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 201 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 213 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 182 seconds
Finished fold 5/5, 1040 seconds

results_df
Mean SE
0.135335 0.135335 0.638475 0.002592
0.049787 0.639094 0.003754
0.018316 0.639103 0.003627
0.006738 0.637383 0.002958
0.002479 0.488831 0.003291
0.049787 0.135335 0.638764 0.002806
0.049787 0.638899 0.003862
0.018316 0.639014 0.003780
0.006738 0.637522 0.003173
0.002479 0.488832 0.003291
0.018316 0.135335 0.638666 0.002676
0.049787 0.638953 0.003742
0.018316 0.639031 0.003669
0.006738 0.637576 0.003055
0.002479 0.488833 0.003298
0.006738 0.135335 0.563873 0.001147
0.049787 0.563873 0.001147
0.018316 0.563872 0.001150
0.006738 0.563827 0.001226
0.002479 0.615068 0.002719
0.002479 0.135335 0.568879 0.001269
0.049787 0.568879 0.001269
0.018316 0.568879 0.001268
0.006738 0.568834 0.001290
0.002479 0.630541 0.004160
optimal_set = results_df['Mean'].idxmax()
optimal_set
(0.1353352832366127, 0.01831563888873418)

References¤

[1] Meir, Tomer*, Gutman, Rom*, and Gorfine, Malka, "PyDTS: A Python Package for Discrete-Time Survival Analysis with Competing Risks" (2022)