Sure Independent Screening¤
Sure independent screening (SIS) has been shown to effectively filter out many uninformative variables in ultra-high-dimensional settings, where the number of covariates greatly exceeds the number of observations (as is common in genetic datasets, for example). Penalized variable selection methods are typically applied to the remaining covariates after screening.
In the following example, we demonstrate how such an analysis can be performed on discrete-time survival data with competing events using the SISTwoStagesFitter
of PyDTS
. An "Exact" version for small sample sizes is also available via SISTwoStagesFitterExact
.
Data Generation¤
To demonstrate the screening process, we first sample a dataset with \(p = 1000\) covariates and \(n = 500\) observations. Clearly, \(p \gg n\), placing us in an ultra-high-dimensional setting.
n_cov = 1000
beta1 = np.zeros(n_cov)
beta1[:5] = 1.5*np.array([-0.6, 0.5, -0.5, 0.6, -0.6])
beta2 = np.zeros(n_cov)
beta2[:5] = 1.5*np.array([0.5, -0.7, 0.7, -0.5, -0.7])
real_coef_dict = {
"alpha": {
1: lambda t: -3.1 + 0.15 * np.log(t),
2: lambda t: -3.2 + 0.15 * np.log(t)
},
"beta": {
1: beta1,
2: beta2
}
}
n_patients = 500
d_times = 6
j_events = 2
ets = EventTimesSampler(d_times=d_times, j_event_types=j_events)
seed = 97
means_vector = np.zeros(n_cov)
covariance_matrix = np.identity(n_cov)
clip_value = 2.5
covariates = [f'Z{i + 1}' for i in range(n_cov)]
patients_df = pd.DataFrame(data=pd.DataFrame(data=np.random.multivariate_normal(means_vector, covariance_matrix,
size=n_patients),
columns=covariates))
patients_df.clip(lower=-1 * clip_value, upper=clip_value, inplace=True)
patients_df = ets.sample_event_times(patients_df, hazard_coefs=real_coef_dict, seed=seed)
patients_df = ets.sample_independent_lof_censoring(patients_df, prob_lof_at_t=0.01 * np.ones(d_times),
seed=seed + 1)
patients_df = ets.update_event_or_lof(patients_df)
patients_df.index.name = 'pid'
patients_df = patients_df.reset_index()
The resulting dataset contains the following observed event-types and event-times:
SIS¤
SISTwoStagesFitter
implements the screening process. As described in the Methods section, we fit marginal estimates for the \(\beta_j\) coefficients using both the original data and permuted data that follow the null model. The maximum absolute value of the marginal coefficients from the null model (fitted using the permuted data) is selected as a data-driven threshold. We then retain only those variables whose marginal coefficients from the original data exceed this threshold.
The results of the marginal estimates under the null model using permuted data, the marginal estimates using the original data, and the data-driven threshold are provided below.
The informative coefficients that exceed the threshold are selected separately for each event type:
Evidently, we successfully identified all informative variables (\(Z_1\)–\(Z_5\)) and filtered out almost all non-informative ones, except for one false positive in event-type 1 (\(Z_{95}\)) and two false positives (\(Z_{198}\), \(Z_{355}\)) in event-type 2. Now, we can use the selected variables to train a TwoStagesFitter
as follows.
Adding LASSO (SIS-L)¤
As an additional variable selection step, LASSO regression can be applied to the set of covariates retained by the screening process.
To select the optimal penalization parameter, we perform a penalty grid search using cross-validation and the evaluation metrics described in the Methods section. By default, model selection is guided by the global-AUC metric.
The mean and standard error (SE) of the global-AUC, calculated via cross-validation for all possible combinations of penalization parameters, are as follows:
We choose the optimal penalizers to be the ones that maximize the global-AUC, i.e., \(\log(\eta_j)\), \(j=1,2\), are
Lastly, we train a regularized TwoStagesFitter
using the entire dataset and the chosen optimal penalizers:
L1_regularized_fitter = TwoStagesFitter()
fit_beta_kwargs = {
'model_kwargs': {
1: {'penalizer': np.exp(chosen_eta[0]), 'l1_ratio': 1},
2: {'penalizer': np.exp(chosen_eta[1]), 'l1_ratio': 1}
}}
L1_regularized_fitter.fit(df = patients_df[['pid', 'X', 'J'] + selected_covariates],
fit_beta_kwargs = fit_beta_kwargs, covariates=fitter.chosen_covariates_j)
lasso_beta = L1_regularized_fitter.get_beta_SE()
Thus, the final SIS-L model is
Recall that the true non-zero \(\beta_j\) values were:
Thus, using SIS, we identified the informative variables and filtered out most of the non-informative ones. Adding LASSO as a subsequent step further reduced the false positive coefficient in event-type 1 to nearly zero. In this specific example, SIS-L did not eliminate additional false positives for event-type 2; however, given the large number of initial covariates, this still represents a substantial improvement.