Estimating with TwoStagesFitterExact¤
Introduction¤
The CoxPHFitter
from the Python lifelines
package, which is used in the first stage of TwoStagesFitter
, employs Efron’s approximation of the partial likelihood function when ties are present. While Efron's method is computationally efficient for large sample sizes, it may yield biased coefficient estimates when the sample size is small.
Therefore, for datasets with up to approximately 500 observations, it is recommended to use the exact method, i.e., TwoStagesFitterExact
, as illustrated below. This method employs ConditionalLogit
models from statsmodels
to estimate the \(\beta_j\) coefficients using the exact likelihood. However, due to its computational complexity, it is suitable only for small sample sizes. Additional tools for model selection and screening available in PyDTS
for use with TwoStagesFitter
also have corresponding "Exact" versions for small sample sizes, which rely on TwoStagesFitterExact
.
Data Preparation¤
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pydts.examples_utils.generate_simulations_data import generate_quick_start_df
from pydts.examples_utils.plots import plot_example_pred_output
import warnings
pd.set_option("display.max_rows", 500)
warnings.filterwarnings('ignore')
%matplotlib inline
Estimation¤
In the following we apply the estimation method of Meir et al. (2022). Note that the data dataframe must not contain a column named 'C'.
Standard Error of the Regression Coefficients¤
Regularization¤
The Exact version supports adding regularization when estimating the Beta coefficients. It is done by passing the fit_beta_kwargs argument to the fit() method. The added regularization term is of the form:
$$
\mbox{Penalizer} \cdot \Bigg( \frac{1-\mbox{L1_wt}}{2}||\beta||_{2}^{2} + \mbox{L1_wt} ||\beta||_1 \Bigg)
$$
In statsmodels
, the penalization parameter is denoted as alpha
. Thus, adding L1, L2, or Elastic Net regularization can be done as follows:
L1¤
L2¤
Elastic Net¤
Prediction¤
Full prediction is given by the method predict_cumulative_incident_function()
The input is a pandas.DataFrame() containing for each observation the covariates columns which were used in the fit() method (Z1-Z5 in our example).
The following columns will be added:
- The overall survival at each time point t
- The hazard for each failure type \(j\) at each time point t
- The probability of event type \(j\) at time t
- The Cumulative Incident Function (CIF) of event type \(j\) at time t
In the following, we provide predictions for the individuals with ID values (pid) 0, 1 and 2. We transposed the output for easy view.