Regularization¤
Regularized regression can be easily accommodated only with TwoStagesFitter
where we first estimate \(\beta_j\) and then \(\alpha_{jt}\). Regularization is introduced by CoxPHFitter
of lifelines with event-specific tuning parameters, \(\eta_j \geq 0\), and l1_ratio
argument.
For each \(j\), usually, a path of models in \(\eta_j\) are fitted, and the value of l1_ratio
defines the type of prediction model. In particular, ridge regression is performed by setting l1_ratio=0
, lasso by l1_ratio=1
, and elastic net by 0 < l1_ratio
< 1.
In the following, we present how to use PyDTS to fit a lasso regularized model, and how to tune the regularization parameters \(\eta_j\).
We start by generating data, as discussed in previous sections:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pydts.examples_utils.generate_simulations_data import generate_quick_start_df
import warnings
pd.set_option("display.max_rows", 500)
warnings.filterwarnings('ignore')
%matplotlib inline
real_coef_dict = {
"alpha": {
1: lambda t: -1 - 0.3 * np.log(t),
2: lambda t: -1.75 - 0.15 * np.log(t)
},
"beta": {
1: -np.log([0.8, 3, 3, 2.5, 2]),
2: -np.log([1, 3, 4, 3, 2])
}
}
n_patients = 50000
n_cov = 5
patients_df = generate_quick_start_df(n_patients=n_patients, n_cov=n_cov, d_times=30, j_events=2,
pid_col='pid', seed=0, censoring_prob=0.8,
real_coef_dict=real_coef_dict)
train_df, test_df = train_test_split(patients_df, test_size=0.2)
patients_df.head()
Predefined Regularization Parameters¤
Lasso with \(\eta_1=0.003\) and \(\eta_2=0.005\), can be applied by
from pydts.fitters import TwoStagesFitter
L1_regularized_fitter = TwoStagesFitter()
fit_beta_kwargs = {
'model_kwargs': {
1: {'penalizer': 0.003, 'l1_ratio': 1},
2: {'penalizer': 0.005, 'l1_ratio': 1}
}}
L1_regularized_fitter.fit(df = patients_df.drop(['C', 'T'], axis = 1),
fit_beta_kwargs = fit_beta_kwargs)
L1_regularized_fitter.print_summary()
Tuning Regularization Parameters¤
In penalized regression, one should fit a path of models in each \(\eta_j\), \(j=1,\ldots,M\). The final set of values of \(\eta_1,\ldots,\eta_M\) corresponds to the values yielding the best results in terms of pre-specified criteria, such as maximizing \(\widehat{\mbox{AUC}}_j\) and \(\widehat{\mbox{AUC}}\), or minimizing \(\widehat{\mbox{BS}}_j\) and \(\widehat{\mbox{BS}}\). The default criteria in PyDTS is maximizing the global AUC, \(\widehat{\mbox{AUC}}\). Two \(M\)-dimensional grid search options are implemented, PenaltyGridSearch
when the user provides train and test datasets, and PenaltyGridSearchCV
for applying a K-fold cross validation (CV) approach.
PenaltyGridSearch¤
When train and test sets are available, by excecuting the following code, all the four optimization criteria are calculated over the \(M\)-dimensional grid and optimal_set includes the optimal values of \(\eta_1,\ldots,\eta_M\) based on \(\widehat{\mbox{AUC}}\). Here, the optimal set based on \(\widehat{\mbox{AUC}}\) is \(\log\eta_1 = -6\) and \(\log\eta_2 = -6\).
It is noted, that we estimate the parameters of each \(\eta_j\) once. However, since our performance measures requires the evaluation of the overall survival function, we must check each possible combination of \(\eta_j\) seperately. This can be time consuming, especially when we would like to choose between a large number of possible penalizers.
The user can choose the set of \(\eta_j\), \(j=1,\ldots,M\), values that optimizes other desired criteria. For example, the set that minimizes \(\widehat{\mbox{BS}}\) can be selected as follows
the final model can be retrieved by
PenaltyGridSearchCV¤
Alternatively, 5-fold CV is performed by
References¤
[1] Meir, Tomer, Gutman, Rom, and Gorfine, Malka, "PyDTS: A Python Package for Discrete-Time Survival Analysis with Competing Risks" (2022)