Sure Independent Screening
pydts.screening.SISTwoStagesFitter()
¤
Bases: BaseSISTwoStages
Source code in src/pydts/screening.py
TwoStagesFitter_type = 'CoxPHFitter'
instance-attribute
¤
chosen_covariates = None
instance-attribute
¤
chosen_covariates_j = None
instance-attribute
¤
covariates = None
instance-attribute
¤
df = pd.DataFrame()
instance-attribute
¤
duration_col = None
instance-attribute
¤
event_type_col = None
instance-attribute
¤
events = None
instance-attribute
¤
expanded_df = pd.DataFrame()
instance-attribute
¤
final_model = None
instance-attribute
¤
marginal_estimates_df = pd.DataFrame()
instance-attribute
¤
null_model_df = None
instance-attribute
¤
permuted_df = pd.DataFrame()
instance-attribute
¤
permuted_expanded_df = pd.DataFrame()
instance-attribute
¤
pid_col = None
instance-attribute
¤
threshold = None
instance-attribute
¤
times = None
instance-attribute
¤
_get_params_cols_from_res_df(res_df)
¤
fit(df, threshold=None, quantile=1, covariates=None, event_type_col='J', duration_col='X', pid_col='pid', x0=0, fit_beta_kwargs={}, verbose=2, nb_workers=WORKERS, seed=None, fit_final_model=True)
¤
This method performs the principled sure independence screening (PSIS) process of Zhao et al. (2012) for discrete-time data with data-driven threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
training data for fitting the model |
required |
threshold
|
float
|
a user defined threshold. Defaults to None, i.e. data-driven threshold |
None
|
quantile
|
float
|
the quantile of the absolute values of the coefficients from the null model that determines the data-driven threshold. Only in use when threshold = None. Defaults to 1, which corresponds to the maximum absolute value of the null model's coefficients. |
1
|
covariates
|
list
|
list of covariates to estimate the marginal regression coefficient for. |
None
|
event_type_col
|
str
|
The event type column name (must be a column in df), Right-censored sample (i) is indicated by event value 0, df.loc[i, event_type_col] = 0. |
'J'
|
duration_col
|
str
|
Last follow up time column name (must be a column in df). |
'X'
|
pid_col
|
str
|
Sample ID column name (must be a column in df). |
'pid'
|
x0
|
(Union[array, int], Optional)
|
initial guess to pass to scipy.optimize.minimize function |
0
|
fit_beta_kwargs
|
(dict, Optional)
|
Keyword arguments to pass on to the estimation procedure. |
{}
|
verbose
|
(int, Optional)
|
The verbosity level of pandaallel |
2
|
nb_workers
|
(int, Optional)
|
The number of workers to pandaallel. If not sepcified, defaults to the number of workers available. |
WORKERS
|
seed
|
int
|
pseudo random state. |
None
|
fit_final_model
|
boolean
|
True if to fit and return the TwoStagesFitter with the selected covariates. |
True
|
Returns:
Name | Type | Description |
---|---|---|
final_model |
TwoStagesFitter
|
estimated model with the chosen covariates after PSIS. |
Source code in src/pydts/screening.py
330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 |
|
fit_marginal_model(expanded_df, covariate, event_type_col='J', duration_col='X', pid_col='pid', x0=0, fit_beta_kwargs={}, verbose=2, nb_workers=1)
¤
This method fits a marginal model to data using a single covariate. Note that the expanded discrete-time data is expected as an input (see the Methods section of PyDTS documentation and pydts.utils.get_expanded_df).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
expanded_df
|
DataFrame
|
expanded training data for fitting the model |
required |
covariate
|
str
|
a single covariate to be used in estimating the regression coefficients |
required |
event_type_col
|
str
|
The event type column name (must be a column in df), Right-censored sample (i) is indicated by event value 0, df.loc[i, event_type_col] = 0. |
'J'
|
duration_col
|
str
|
Last follow up time column name (must be a column in df). |
'X'
|
pid_col
|
str
|
Sample ID column name (must be a column in df). |
'pid'
|
x0
|
(Union[array, int], Optional)
|
initial guess to pass to scipy.optimize.minimize function |
0
|
fit_beta_kwargs
|
(dict, Optional)
|
Keyword arguments to pass on to the estimation procedure. |
{}
|
verbose
|
(int, Optional)
|
The verbosity level of pandaallel |
2
|
nb_workers
|
(int, Optional)
|
The number of workers to pandaallel. If not sepcified, defaults to the number of workers available. |
1
|
Returns:
Name | Type | Description |
---|---|---|
result |
DataFrame
|
Estimated parameter and standard errors. TwoStagesFitter.get_beta_SE() output. |
Source code in src/pydts/screening.py
get_data_driven_threshold(df, covariates=None, quantile=1, event_type_col='J', duration_col='X', pid_col='pid', x0=0, fit_beta_kwargs={}, verbose=2, nb_workers=WORKERS, seed=None)
¤
This method calculates a data-driven threshold for each risk. It fits marginal models to the permuted data and returns the required quantile of the absolute values of the coefficients estimated from the null model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
training data for fitting the model |
required |
covariates
|
list
|
list of covariates to estimate the marginal regression coefficient for. |
None
|
quantile
|
float
|
represents the quantile of the absolute values of the coefficients from the null model that determines the data-driven threshold. Defaults to 1, which corresponds to the maximum absolute value of the null model's coefficients. |
1
|
event_type_col
|
str
|
The event type column name (must be a column in df), Right-censored sample (i) is indicated by event value 0, df.loc[i, event_type_col] = 0. |
'J'
|
duration_col
|
str
|
Last follow up time column name (must be a column in df). |
'X'
|
pid_col
|
str
|
Sample ID column name (must be a column in df). |
'pid'
|
x0
|
(Union[array, int], Optional)
|
initial guess to pass to scipy.optimize.minimize function |
0
|
fit_beta_kwargs
|
(dict, Optional)
|
Keyword arguments to pass on to the estimation procedure. |
{}
|
verbose
|
(int, Optional)
|
The verbosity level of pandaallel |
2
|
nb_workers
|
(int, Optional)
|
The number of workers to pandaallel. If not sepcified, defaults to the number of workers available. |
WORKERS
|
seed
|
int
|
pseudo random state. |
None
|
Returns:
Name | Type | Description |
---|---|---|
threshold |
Series
|
Estimated thresholds. |
Source code in src/pydts/screening.py
get_marginal_estimates(expanded_df, covariates=None, event_type_col='J', duration_col='X', pid_col='pid', verbose=2, x0=0, fit_beta_kwargs={}, nb_workers=WORKERS)
¤
This method fits a marginal model to data to each of the covariates. Note that the expanded discrete-time data is expected as an input (see the Methods section of PyDTS documentation and pydts.utils.get_expanded_df).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
expanded_df
|
DataFrame
|
expanded training data for fitting the model |
required |
covariates
|
list
|
list of covariates to estimate the marginal regression coefficient for. |
None
|
event_type_col
|
str
|
The event type column name (must be a column in df), Right-censored sample (i) is indicated by event value 0, df.loc[i, event_type_col] = 0. |
'J'
|
duration_col
|
str
|
Last follow up time column name (must be a column in df). |
'X'
|
pid_col
|
str
|
Sample ID column name (must be a column in df). |
'pid'
|
verbose
|
(int, Optional)
|
The verbosity level of pandaallel |
2
|
x0
|
(Union[array, int], Optional)
|
initial guess to pass to scipy.optimize.minimize function |
0
|
fit_beta_kwargs
|
(dict, Optional)
|
Keyword arguments to pass on to the estimation procedure. |
{}
|
nb_workers
|
(int, Optional)
|
The number of workers to pandaallel. If not sepcified, defaults to the number of workers available. |
WORKERS
|
Returns:
Name | Type | Description |
---|---|---|
results_df |
DataFrame
|
Estimated parameters and standard errors of the marginal models. A concatenation of all the TwoStagesFitter.get_beta_SE() outputs. |
Source code in src/pydts/screening.py
permute_df(df, event_type_col='J', duration_col='X', pid_col='pid', seed=None)
¤
This method applies random permutation on the event-time and event-type columns of the training data such that the covariates are decoupled from the outcome; the permuted data follow the null model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
training data for fitting the model |
required |
event_type_col
|
str
|
The event type column name (must be a column in df), Right-censored sample (i) is indicated by event value 0, df.loc[i, event_type_col] = 0. |
'J'
|
duration_col
|
str
|
Last follow up time column name (must be a column in df). |
'X'
|
pid_col
|
str
|
Sample ID column name (must be a column in df). |
'pid'
|
seed
|
(int, Optional)
|
pseudo random state. |
None
|
Returns:
Name | Type | Description |
---|---|---|
permuted_df |
DataFrame
|
null model data. |
Source code in src/pydts/screening.py
pydts.screening.SISTwoStagesFitterExact()
¤
Bases: BaseSISTwoStages
Source code in src/pydts/screening.py
TwoStagesFitter_type = 'Exact'
instance-attribute
¤
chosen_covariates = None
instance-attribute
¤
chosen_covariates_j = None
instance-attribute
¤
covariates = None
instance-attribute
¤
df = pd.DataFrame()
instance-attribute
¤
duration_col = None
instance-attribute
¤
event_type_col = None
instance-attribute
¤
events = None
instance-attribute
¤
expanded_df = pd.DataFrame()
instance-attribute
¤
final_model = None
instance-attribute
¤
marginal_estimates_df = pd.DataFrame()
instance-attribute
¤
null_model_df = None
instance-attribute
¤
permuted_df = pd.DataFrame()
instance-attribute
¤
permuted_expanded_df = pd.DataFrame()
instance-attribute
¤
pid_col = None
instance-attribute
¤
threshold = None
instance-attribute
¤
times = None
instance-attribute
¤
_get_params_cols_from_res_df(res_df)
¤
fit(df, threshold=None, quantile=1, covariates=None, event_type_col='J', duration_col='X', pid_col='pid', x0=0, fit_beta_kwargs={}, verbose=2, nb_workers=WORKERS, seed=None, fit_final_model=True)
¤
This method performs the principled sure independence screening (PSIS) process of Zhao et al. (2012) for discrete-time data with data-driven threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
training data for fitting the model |
required |
threshold
|
float
|
a user defined threshold. Defaults to None, i.e. data-driven threshold |
None
|
quantile
|
float
|
the quantile of the absolute values of the coefficients from the null model that determines the data-driven threshold. Only in use when threshold = None. Defaults to 1, which corresponds to the maximum absolute value of the null model's coefficients. |
1
|
covariates
|
list
|
list of covariates to estimate the marginal regression coefficient for. |
None
|
event_type_col
|
str
|
The event type column name (must be a column in df), Right-censored sample (i) is indicated by event value 0, df.loc[i, event_type_col] = 0. |
'J'
|
duration_col
|
str
|
Last follow up time column name (must be a column in df). |
'X'
|
pid_col
|
str
|
Sample ID column name (must be a column in df). |
'pid'
|
x0
|
(Union[array, int], Optional)
|
initial guess to pass to scipy.optimize.minimize function |
0
|
fit_beta_kwargs
|
(dict, Optional)
|
Keyword arguments to pass on to the estimation procedure. |
{}
|
verbose
|
(int, Optional)
|
The verbosity level of pandaallel |
2
|
nb_workers
|
(int, Optional)
|
The number of workers to pandaallel. If not sepcified, defaults to the number of workers available. |
WORKERS
|
seed
|
int
|
pseudo random state. |
None
|
fit_final_model
|
boolean
|
True if to fit and return the TwoStagesFitter with the selected covariates. |
True
|
Returns:
Name | Type | Description |
---|---|---|
final_model |
TwoStagesFitter
|
estimated model with the chosen covariates after PSIS. |
Source code in src/pydts/screening.py
330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 |
|
fit_marginal_model(expanded_df, covariate, event_type_col='J', duration_col='X', pid_col='pid', x0=0, fit_beta_kwargs={}, verbose=2, nb_workers=1)
¤
This method fits a marginal model to data using a single covariate. Note that the expanded discrete-time data is expected as an input (see the Methods section of PyDTS documentation and pydts.utils.get_expanded_df).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
expanded_df
|
DataFrame
|
expanded training data for fitting the model |
required |
covariate
|
str
|
a single covariate to be used in estimating the regression coefficients |
required |
event_type_col
|
str
|
The event type column name (must be a column in df), Right-censored sample (i) is indicated by event value 0, df.loc[i, event_type_col] = 0. |
'J'
|
duration_col
|
str
|
Last follow up time column name (must be a column in df). |
'X'
|
pid_col
|
str
|
Sample ID column name (must be a column in df). |
'pid'
|
x0
|
(Union[array, int], Optional)
|
initial guess to pass to scipy.optimize.minimize function |
0
|
fit_beta_kwargs
|
(dict, Optional)
|
Keyword arguments to pass on to the estimation procedure. |
{}
|
verbose
|
(int, Optional)
|
The verbosity level of pandaallel |
2
|
nb_workers
|
(int, Optional)
|
The number of workers to pandaallel. If not sepcified, defaults to the number of workers available. |
1
|
Returns:
Name | Type | Description |
---|---|---|
result |
DataFrame
|
Estimated parameter and standard errors. TwoStagesFitter.get_beta_SE() output. |
Source code in src/pydts/screening.py
get_data_driven_threshold(df, covariates=None, quantile=1, event_type_col='J', duration_col='X', pid_col='pid', x0=0, fit_beta_kwargs={}, verbose=2, nb_workers=WORKERS, seed=None)
¤
This method calculates a data-driven threshold for each risk. It fits marginal models to the permuted data and returns the required quantile of the absolute values of the coefficients estimated from the null model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
training data for fitting the model |
required |
covariates
|
list
|
list of covariates to estimate the marginal regression coefficient for. |
None
|
quantile
|
float
|
represents the quantile of the absolute values of the coefficients from the null model that determines the data-driven threshold. Defaults to 1, which corresponds to the maximum absolute value of the null model's coefficients. |
1
|
event_type_col
|
str
|
The event type column name (must be a column in df), Right-censored sample (i) is indicated by event value 0, df.loc[i, event_type_col] = 0. |
'J'
|
duration_col
|
str
|
Last follow up time column name (must be a column in df). |
'X'
|
pid_col
|
str
|
Sample ID column name (must be a column in df). |
'pid'
|
x0
|
(Union[array, int], Optional)
|
initial guess to pass to scipy.optimize.minimize function |
0
|
fit_beta_kwargs
|
(dict, Optional)
|
Keyword arguments to pass on to the estimation procedure. |
{}
|
verbose
|
(int, Optional)
|
The verbosity level of pandaallel |
2
|
nb_workers
|
(int, Optional)
|
The number of workers to pandaallel. If not sepcified, defaults to the number of workers available. |
WORKERS
|
seed
|
int
|
pseudo random state. |
None
|
Returns:
Name | Type | Description |
---|---|---|
threshold |
Series
|
Estimated thresholds. |
Source code in src/pydts/screening.py
get_marginal_estimates(expanded_df, covariates=None, event_type_col='J', duration_col='X', pid_col='pid', verbose=2, x0=0, fit_beta_kwargs={}, nb_workers=WORKERS)
¤
This method fits a marginal model to data to each of the covariates. Note that the expanded discrete-time data is expected as an input (see the Methods section of PyDTS documentation and pydts.utils.get_expanded_df).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
expanded_df
|
DataFrame
|
expanded training data for fitting the model |
required |
covariates
|
list
|
list of covariates to estimate the marginal regression coefficient for. |
None
|
event_type_col
|
str
|
The event type column name (must be a column in df), Right-censored sample (i) is indicated by event value 0, df.loc[i, event_type_col] = 0. |
'J'
|
duration_col
|
str
|
Last follow up time column name (must be a column in df). |
'X'
|
pid_col
|
str
|
Sample ID column name (must be a column in df). |
'pid'
|
verbose
|
(int, Optional)
|
The verbosity level of pandaallel |
2
|
x0
|
(Union[array, int], Optional)
|
initial guess to pass to scipy.optimize.minimize function |
0
|
fit_beta_kwargs
|
(dict, Optional)
|
Keyword arguments to pass on to the estimation procedure. |
{}
|
nb_workers
|
(int, Optional)
|
The number of workers to pandaallel. If not sepcified, defaults to the number of workers available. |
WORKERS
|
Returns:
Name | Type | Description |
---|---|---|
results_df |
DataFrame
|
Estimated parameters and standard errors of the marginal models. A concatenation of all the TwoStagesFitter.get_beta_SE() outputs. |
Source code in src/pydts/screening.py
permute_df(df, event_type_col='J', duration_col='X', pid_col='pid', seed=None)
¤
This method applies random permutation on the event-time and event-type columns of the training data such that the covariates are decoupled from the outcome; the permuted data follow the null model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
training data for fitting the model |
required |
event_type_col
|
str
|
The event type column name (must be a column in df), Right-censored sample (i) is indicated by event value 0, df.loc[i, event_type_col] = 0. |
'J'
|
duration_col
|
str
|
Last follow up time column name (must be a column in df). |
'X'
|
pid_col
|
str
|
Sample ID column name (must be a column in df). |
'pid'
|
seed
|
(int, Optional)
|
pseudo random state. |
None
|
Returns:
Name | Type | Description |
---|---|---|
permuted_df |
DataFrame
|
null model data. |