Hospitalization length of stay¤
In this example, we simulate a dataset of 10,000 hospitalized patietns as follows:
We sample the covariates such that:
\(\mbox{Admission Year} \sim \mbox{Uniform}(2000,2014)\)
\(\mbox{Gender} \sim \mbox{Bernoullie}(0.5), \qquad\) (1 is Female, 0 is Male)
\(\mbox{Age} \sim \mbox{Normal}(72+5*\mbox{gender}\;,\;12)\) (years)
\(\mbox{Height} \sim \mbox{Normal}(175-5*\mbox{gender}\;,\;7)\) (cm)
\(\mbox{Weight} \sim \mbox{Normal}(\frac{\mbox{height}}{175}*80 - 5 * \mbox{gender} + \frac{\mbox{age}}{20}\;,\;8)\) (kg)
\(\mbox{BMI} \: (\mbox{Body} \: \mbox{Mass} \: \mbox{Index}) = \frac{\mbox{Weight}}{(\frac{\mbox{Height}}{100})^2}\) (kg/m^2)
\(\mbox{Admission Serial} \sim \mbox{LogNormal}(0, 0.75)\)
\(\mbox{Smoking Status} \sim \mbox{Multinomial(No, Previously, Currently)} \quad p=[0.5, 0.3, 0.2]\)
Precondition features:
\(\mbox{General_p} = 0.003 * \mbox{bmi} - 0.15 * \mbox{gender} + 0.002 * \mbox{age} + 0.1 * \mbox{smoking}\)
\(\mbox{Preconditions_p} = max( min(\mbox{General_p}, 0.65), 0.05)\)
\(\mbox{Hypertension} \sim \mbox{Bernoulli}(\mbox{Preconditions_p})\)
\(\mbox{Diabetes} \sim \mbox{Bernoulli}(\mbox{Preconditions_p} + 0.003*\mbox{BMI})\)
\(\mbox{Arterial Fibrillation} \sim \mbox{Bernoulli}(\mbox{Preconditions_p})\)
\(\mbox{Chronic Obstructive Pulmonary Disease} \sim \mbox{Bernoulli}(\mbox{Preconditions_p} + 0.1*\mbox{smoking})\)
\(\mbox{Chronic Renal Failure} \sim \mbox{Bernoulli}(\mbox{Preconditions_p})\)
Finally, based on the above covariates, we sample LOS and the event type: discharged or in-hospital death.
After sampling the LOS, for some patients we remove weight (and BMI) information based on year of admission, to reflect missingness which can occur in real world data.
Loading Simulation Dataset¤
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
pd.set_option('display.max_rows', 500)
from pydts.examples_utils.simulations_data_config import *
from pydts.examples_utils.datasets import load_LOS_simulated_data
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
Data Description¤
First, let's look at some descriptive statistics of the columns:
paper_table = data_df.describe().T.round(2)
idx = [r for r in paper_table.index if r not in ['ID', 'Returning_patient', 'Death_date_in_hosp_missing',
'In_hospital_death',]]
paper_table = paper_table.loc[idx, :]
paper_table.rename({
'Age': 'Age (years)',
'Gender': 'Gender (1 - female)',
'Admyear': 'Admyear (year)',
'Firstadm': 'Firstadm (1 - yes)',
'Admserial': 'Admserial',
'Weight': 'Weight (kg)',
'Height': 'Height (cm)',
'BMI': 'Height (kg/m^2)',
'Smoking': f"Smoking (0 - never, \\ \qquad \qquad \, 1 - previously, \\ \qquad \qquad \, 2 - currently)",
'Discharge_relative_date': 'LOS (days)',
'Hypertension': 'Hypertension (1 - yes)',
'Diabities': 'Diabetes (1 - yes)',
'AF': 'AF (1 - yes)',
'COPD': 'COPD (1 - yes)',
'CRF': 'CRF (1 - yes)',
'Death_relative_date_in_hosp': 'In-Hospital Death (days)'}, inplace=True)
paper_table['count'] = paper_table['count'].astype(int)
When dealing with healthcare data, changes of policy can lead to biased data, so it is also a good idea to see how the data looks like with stratification by year of admission:
Let's visualize the data. With the following figures we can see:
(a) How many patients were hospitalized in total, how many were discharged\died, stratified by year of admission.
(b) Age distributions by sex, males (0) and females (1).
(c) Number of patients at each number of admissions, with a separation to 4 groups.
(d) Kaplan-Meier curves for LOS with and without death as censoring
Next, with the following figures we can further visualize the possible outcomes:
(a) Description of the events (death and release).
(b) Distribution of age, by sex, among the patients who died.
(c-d) Number of observed event, by event type.
and a visualization of the missingness of the weight variable by year of admission:
Data Preprocessing¤
Missing Values Imputation¤
We search for missing data and use median imputation:
Standardization¤
In some applications it is customize to standardize the covariates, such that each will be with the mean 0 and standard deviation of 1.
For Height, Weight, Age and BMI columns we use Standard scaling, and for Returning Patient and Smoking we use Min-Max scaling:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
to_normalize = [HEIGHT_COL, WEIGHT_COL, AGE_COL, BMI_COL]
to_minmax = [RETURNING_PATIENT_COL, SMOKING_COL]
std_scaler = StandardScaler()
X[to_normalize] = std_scaler.fit_transform(X[to_normalize])
minmax_scaler = MinMaxScaler()
X[to_minmax] = minmax_scaler.fit_transform(X[to_minmax])
X.head()
Creating event type and event time:
In_hospital_death = 1 means in hospital death (J=1)
In_hospital_death = 0 with Discharge_relative_date <= 30 means a discharge event (J=2)
Discharge_relative_date = 31 means right censored example, i.e. (J=0 at T=30)
Estimation¤
Now we can estimate the parameters of the model using a TwoStagesFitter:
fig, axes = plt.subplots(1,2, figsize=(18,8))
ax = axes[0]
fitter.plot_all_events_alpha(ax=ax, show=False)
ax.grid()
ax.legend(fontsize=16, loc='center right')
add_panel_text(ax, 'a')
ax = axes[1]
fitter.plot_all_events_beta(ax=ax, show=False, xlabel='Value')
ax.legend(fontsize=16, loc='center right')
add_panel_text(ax, 'b')
fig.tight_layout()