Regularization¤

Regularized regression can be easily accommodated only with TwoStagesFitter where we first estimate \(\beta_j\) and then \(\alpha_{jt}\). Regularization is introduced by CoxPHFitter of lifelines with event-specific tuning parameters, \(\eta_j \geq 0\), and l1_ratio argument.

For each \(j\), usually, a path of models in \(\eta_j\) are fitted, and the value of l1_ratio defines the type of prediction model. In particular, ridge regression is performed by setting l1_ratio=0, lasso by l1_ratio=1, and elastic net by 0 < l1_ratio < 1.

In the following, we present how to use PyDTS to fit a lasso regularized model, and how to tune the regularization parameters \(\eta_j\).

We start by generating data, as discussed in previous sections:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pydts.examples_utils.generate_simulations_data import generate_quick_start_df
import warnings
pd.set_option("display.max_rows", 500)
warnings.filterwarnings('ignore')
%matplotlib inline

real_coef_dict = {
    "alpha": {
        1: lambda t: -1 - 0.3 * np.log(t),
        2: lambda t: -1.75 - 0.15 * np.log(t)
    },
    "beta": {
        1: -np.log([0.8, 3, 3, 2.5, 2]),
        2: -np.log([1, 3, 4, 3, 2])
    }
}

n_patients = 50000
n_cov = 5

patients_df = generate_quick_start_df(n_patients=n_patients, n_cov=n_cov, d_times=30, j_events=2, 
                                      pid_col='pid', seed=0, censoring_prob=0.8, 
                                      real_coef_dict=real_coef_dict)

train_df, test_df = train_test_split(patients_df, test_size=0.2)

patients_df.head()

	pid	Z1	Z2	Z3	Z4	Z5	J	T	C	X
0	0	0.548814	0.715189	0.602763	0.544883	0.423655	0	31	10	10
1	1	0.645894	0.437587	0.891773	0.963663	0.383442	0	31	24	24
2	2	0.791725	0.528895	0.568045	0.925597	0.071036	0	17	11	11
3	3	0.087129	0.020218	0.832620	0.778157	0.870012	1	1	31	1
4	4	0.978618	0.799159	0.461479	0.780529	0.118274	0	15	14	14

Predefined Regularization Parameters¤

Lasso with \(\eta_1=0.003\) and \(\eta_2=0.005\), can be applied by

from pydts.fitters import TwoStagesFitter

L1_regularized_fitter = TwoStagesFitter()
fit_beta_kwargs = {
    'model_kwargs': {
        1: {'penalizer': 0.003, 'l1_ratio': 1},
        2: {'penalizer': 0.005, 'l1_ratio': 1}
}}
L1_regularized_fitter.fit(df = patients_df.drop(['C', 'T'], axis = 1),
                          fit_beta_kwargs = fit_beta_kwargs)

L1_regularized_fitter.print_summary()

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

	j1_params	j1_SE	j2_params	j2_SE
covariate
Z1	0.000002	0.000101	1.981085e-08	0.000024
Z2	-0.772797	0.025400	-7.976064e-07	0.000071
Z3	-0.761229	0.025532	-1.720702e-01	0.038499
Z4	-0.550481	0.025318	-8.073968e-07	0.000072
Z5	-0.338471	0.025211	-3.409610e-07	0.000031



Model summary for event: 1

		n_jt	success	alpha_jt
J	X
1	1	3374	True	-1.471644
	2	2328	True	-1.714213
	3	1805	True	-1.859723
	4	1524	True	-1.920774
	5	1214	True	-2.050566
	6	1114	True	-2.038532
	7	916	True	-2.142666
	8	830	True	-2.151764
	9	683	True	-2.257665
	10	626	True	-2.258146
	11	569	True	-2.268550
	12	516	True	-2.281249
	13	419	True	-2.406485
	14	410	True	-2.340125
	15	326	True	-2.482320
	16	320	True	-2.415531
	17	280	True	-2.460601
	18	240	True	-2.526029
	19	243	True	-2.422234
	20	204	True	-2.506867
	21	176	True	-2.564875
	22	167	True	-2.524524
	23	166	True	-2.431232
	24	118	True	-2.667097
	25	114	True	-2.596554
	26	109	True	-2.527812
	27	89	True	-2.614859
	28	70	True	-2.731941
	29	67	True	-2.645782
	30	47	True	-2.856479



Model summary for event: 2

		n_jt	success	alpha_jt
J	X
2	1	1250	True	-3.578498
	2	839	True	-3.857174
	3	805	True	-3.793806
	4	644	True	-3.922340
	5	570	True	-3.953290
	6	483	True	-4.033435
	7	416	True	-4.097484
	8	409	True	-4.031967
	9	323	True	-4.185559
	10	306	True	-4.159369
	11	240	True	-4.326672
	12	246	True	-4.222570
	13	226	True	-4.228361
	14	198	True	-4.280986
	15	170	True	-4.351662
	16	162	True	-4.321422
	17	147	True	-4.335179
	18	115	True	-4.497948
	19	125	True	-4.329126
	20	118	True	-4.299702
	21	83	True	-4.569291
	22	89	True	-4.409830
	23	65	True	-4.633946
	24	59	True	-4.627577
	25	58	True	-4.544617
	26	53	True	-4.528494
	27	43	True	-4.624557
	28	38	True	-4.626009
	29	43	True	-4.376003
	30	37	True	-4.384266

Tuning Regularization Parameters¤

In penalized regression, one should fit a path of models in each \(\eta_j\), \(j=1,\ldots,M\). The final set of values of \(\eta_1,\ldots,\eta_M\) corresponds to the values yielding the best results in terms of pre-specified criteria, such as maximizing \(\widehat{\mbox{AUC}}_j\) and \(\widehat{\mbox{AUC}}\), or minimizing \(\widehat{\mbox{BS}}_j\) and \(\widehat{\mbox{BS}}\). The default criteria in PyDTS is maximizing the global AUC, \(\widehat{\mbox{AUC}}\). Two \(M\)-dimensional grid search options are implemented, PenaltyGridSearch when the user provides train and test datasets, and PenaltyGridSearchCV for applying a K-fold cross validation (CV) approach.

PenaltyGridSearch¤

When train and test sets are available, by excecuting the following code, all the four optimization criteria are calculated over the \(M\)-dimensional grid and optimal_set includes the optimal values of \(\eta_1,\ldots,\eta_M\) based on \(\widehat{\mbox{AUC}}\). Here, the optimal set based on \(\widehat{\mbox{AUC}}\) is \(\log\eta_1 = -6\) and \(\log\eta_2 = -6\).

It is noted, that we estimate the parameters of each \(\eta_j\) once. However, since our performance measures requires the evaluation of the overall survival function, we must check each possible combination of \(\eta_j\) seperately. This can be time consuming, especially when we would like to choose between a large number of possible penalizers.

from pydts.model_selection import PenaltyGridSearch


penalizers = np.exp([-2, -3, -4, -5, -6])
grid_search = PenaltyGridSearch()
optimal_set = grid_search.evaluate(train_df, test_df, l1_ratio = 1, 
                                   penalizers = penalizers,
                                   metrics = ['IBS', 'GBS', 'IAUC', 'GAUC'])

print(optimal_set)

Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 199 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 204 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 206 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 207 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 213 seconds
(0.0024787521766663585, 0.0024787521766663585)

The user can choose the set of \(\eta_j\), \(j=1,\ldots,M\), values that optimizes other desired criteria. For example, the set that minimizes \(\widehat{\mbox{BS}}\) can be selected as follows

res = grid_search.convert_results_dict_to_df(grid_search.global_bs)
res.columns = ['BS']
res.index.set_names(['eta_1', 'eta_2'], inplace=True)
res

		BS
eta_1	eta_2
0.135335	0.135335	0.038662
	0.049787	0.038662
	0.018316	0.038633
	0.006738	0.038468
	0.002479	0.038137
0.049787	0.135335	0.038430
	0.049787	0.038430
	0.018316	0.038403
	0.006738	0.038245
	0.002479	0.037919
0.018316	0.135335	0.035197
	0.049787	0.035197
	0.018316	0.035199
	0.006738	0.035164
	0.002479	0.034911
0.006738	0.135335	0.031541
	0.049787	0.031541
	0.018316	0.031592
	0.006738	0.031685
	0.002479	0.031431
0.002479	0.135335	0.028503
	0.049787	0.028503
	0.018316	0.028563
	0.006738	0.028698
	0.002479	0.028441

grid_search.convert_results_dict_to_df(grid_search.global_bs).idxmin()

0    (0.0024787521766663585, 0.0024787521766663585)
dtype: object

the final model can be retrieved by

optimal_two_stages_fitter = grid_search.get_mixed_two_stages_fitter(optimal_set)

PenaltyGridSearchCV¤

Alternatively, 5-fold CV is performed by

from pydts.cross_validation import PenaltyGridSearchCV


penalizers = np.exp([-2, -3, -4, -5, -6])
grid_search_cv = PenaltyGridSearchCV()
results_df = grid_search_cv.cross_validate(patients_df, l1_ratio=1, 
                                           penalizers=penalizers, n_splits=5, 
                                           metrics=['IBS', 'GBS', 'IAUC', 'GAUC'])

Starting fold 1/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 174 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 186 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 200 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 209 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 155 seconds
Finished fold 1/5, 1003 seconds
Starting fold 2/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 175 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 187 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 200 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 212 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 178 seconds
Finished fold 2/5, 1031 seconds
Starting fold 3/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 176 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 188 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 200 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 210 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 186 seconds
Finished fold 3/5, 1039 seconds
Starting fold 4/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 177 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 188 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 202 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 211 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 179 seconds
Finished fold 4/5, 1037 seconds
Starting fold 5/5
Started estimating the coefficients for penalizer 0.1353352832366127 (1/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.1353352832366127 (1/5), 176 seconds
Started estimating the coefficients for penalizer 0.049787068367863944 (2/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.049787068367863944 (2/5), 188 seconds
Started estimating the coefficients for penalizer 0.01831563888873418 (3/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.01831563888873418 (3/5), 201 seconds
Started estimating the coefficients for penalizer 0.006737946999085467 (4/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

Finished estimating the coefficients for penalizer 0.006737946999085467 (4/5), 213 seconds
Started estimating the coefficients for penalizer 0.0024787521766663585 (5/5)
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Finished estimating the coefficients for penalizer 0.0024787521766663585 (5/5), 182 seconds
Finished fold 5/5, 1040 seconds

results_df

		Mean	SE
0.135335	0.135335	0.638475	0.002592
	0.049787	0.639094	0.003754
	0.018316	0.639103	0.003627
	0.006738	0.637383	0.002958
	0.002479	0.488831	0.003291
0.049787	0.135335	0.638764	0.002806
	0.049787	0.638899	0.003862
	0.018316	0.639014	0.003780
	0.006738	0.637522	0.003173
	0.002479	0.488832	0.003291
0.018316	0.135335	0.638666	0.002676
	0.049787	0.638953	0.003742
	0.018316	0.639031	0.003669
	0.006738	0.637576	0.003055
	0.002479	0.488833	0.003298
0.006738	0.135335	0.563873	0.001147
	0.049787	0.563873	0.001147
	0.018316	0.563872	0.001150
	0.006738	0.563827	0.001226
	0.002479	0.615068	0.002719
0.002479	0.135335	0.568879	0.001269
	0.049787	0.568879	0.001269
	0.018316	0.568879	0.001268
	0.006738	0.568834	0.001290
	0.002479	0.630541	0.004160

optimal_set = results_df['Mean'].idxmax()
optimal_set

(0.1353352832366127, 0.01831563888873418)

References¤

[1] Meir, Tomer*, Gutman, Rom*, and Gorfine, Malka, "PyDTS: A Python Package for Discrete-Time Survival Analysis with Competing Risks" (2022)