Predicting 30-Day Readmission After Cardiac Surgery Using MIMIC-IV Dataset

Biostat 203B and 212A Final Project

Author

Fauzan Budi Prasetya and UID 306779006

1. Title Page

  • Project title: “Predicting 30-Day Readmission After Cardiac Surgery Using MIMIC-IV Dataset”

2. Abstract

Abstract Background: Hospital readmission within 30 days of discharge is a common among patients undergoing cardiac surgery. By indentifying patients at high risk for readmission, healthcare providers can implement targeted interventions to improve post-discharge care and reduce preventable readmissions.

Objectives: The goal of this project is to develop and validate predictive models for 30-day hospital readmission following cardiac-surgery patients who underwent an EKG during their hospital stay.

Methods: This study was conducted utilizing MIMIC-IV datasets that are focused on electronic health record data from cardiac-surgery patients with an EKG examination. This include patient demographics, hospital stay details, comorbidities, EKG features, and laboratory results. 7000 patients of cardiac surgery were included in the final cohort, with 15% experiencing readmission within 30 days. Three predictive classification models were developed: a logistic regression, a random forest, and an XGBoost model. Performance was evaluated using the area under the receiver operating characteristic curve (ROC AUC), with feature importance assessed to identify key predictors.

Results: All the models achieved a similar ROC AUC around 0.65 to 0.68. This moderate performance suggests that while readmission risk among cardiac surgery patient is complex, structured medical record data from EKGs and the associated hospital stay contain a measurable signal. Preliminary feature importance analysis indicated that elements length of stay, specific EKG abnormalities, demographics like race, and Blood Urea Nitrogen laboratory result were among the strongest predictors.

Conclusion: This study demonstrates that machine learning models using readily available clinical data can moderately predict 30-day readmission in high-risk cardiac-surgery patients with an EKG. A ROC AUC of 0.665 is starting point for identifying patients who may benefit from enhanced discharge planning. Future work to improve predictive performance will focus on integrating hospital recharge note, EKG results note from physicians, and a more comprehensive set of social determinants of health.

3. Introduction

3.1 Background

Hospital readmission within 30 days of discharge is happening in approximately 15-20% of patients undergoing cardiac surgery in the US. It is associated with increased healthcare costs, patient morbidity, and mortality. The Centers for Medicare and Medicaid Services Readmission Reduction Program was initiated on October 1, 2012, as part of the Affordable Care Act. This program administers penalties of Medicare revenue for an excess 30-day readmission rate in the categories of heart failure and myocardial infarction.

Identifying patients at high risk for readmission can enable healthcare providers to implement targeted interventions, such as enhanced discharge planning, post-discharge follow-up, and patient education, to improve outcomes and reduce preventable readmissions. Despite the importance of this issue, accurately predicting which patients are at risk for readmission remains a challenge. Previous studies have explored various clinical and demographic factors associated with readmission risk, but there is a need for more comprehensive models that integrate structured data from electronic health records (EHRs), including EKG features and hospital stay details, to improve predictive performance. This project aims to fill this gap by developing and validating machine learning models that utilize structured EHR data to predict 30-day readmission after cardiac surgery.

The hypothesis of this project is that machine learning models can predict 30-day readmission after cardiac surgery using structured data from EKGs and associated hospital stay details, with the expectation that the models will achieve a ROC AUC of at least 0.65, indicating the typical performance of such readmission prediction models.

4. Methods

4.1 Data Source or Materials

This study was conducted utilizing MIMIC-IV datasets that are focused on electronic health record data from cardiac-surgery patients with an inpatient EKG. The datasets were accessed through BigQuery, and the analysis was performed using R. The study cohort included adult patients who underwent cardiac surgery and had at least one EKG examination during their hospital stay. The exclusion criteria included patients who died during the index hospitalization, those with missing discharge timestamps, and those who were discharged to hospice care or palliative care. There are 8167 admissions of 7640 patients of cardiac surgery included in the final cohort, with 15% experiencing readmission within 30 days. The data access and analysis were conducted in compliance with the MIMIC-IV data use agreement, ensuring patient privacy and appropriate use of the data. All analyses were performed on de-identified data, and the study was approved by the relevant institutional review board, and can be accessed through BigQuery after submitting a data use request.

4.2 Variables / Features

There are total 70 features included in the final model, which are patient demographics, hospital stay details, comorbidities, EKG features, and laboratory results. Demographics data are taken from the patient records. They are consumed to understand the socioeconomic and health background of the patients, which can influence readmission risk. Hospital stay details are taken from the admissions table, including length of stay, admission type, admission location, discharge location, insurance. To get length of stay in ICU, we consume ICU stays table. Hospital stay details are important because they can reflect the severity of illness and the complexity of care, which are often associated with readmission risk. Comorbidity features were derived using the Elixhauser comorbidity index, which is a widely used method for quantifying patient comorbidities based on ICD diagnosis codes. Comorbidities are important predictors of readmission risk, as patients with multiple or severe comorbid conditions are more likely to experience complications and require readmission. EKG features were extracted from machine measurements table, and take the last EKG performed during the hospital stay, between admitted and discharged. After excluding raw EKG column that have high rate of missing data, derived features are calculated, such as heart rate, QRS duration, QT interval, and T wave abnormalities. These features are relevant because they can indicate underlying cardiac conditions and abnormalities that may contribute to readmission risk. Laboratory results included key biomarkers that related with cardiac issues, such as potassium, sodium, calcium, creatinine, and hemoglobin levels. They are important predictors of readmission risk, as abnormal values can indicate ongoing health issues or complications that may lead to readmission.

The features were processed using a combination of imputation for missing values, normalization for numerical variables, and encoding for categorical variables. Some categorial variables were grouped and collapse into broader categories to reduce sparsity and improve model performance, for example race. All categorical variables were transformed into factor, with low frequency transformed into a new category called other and also using dummy variables, while numerical variables were normalized to ensure they were on a comparable scale for modeling. The missing values in numerical features were imputed using k-nearest neighbors (KNN) imputation, which estimates missing values based on the similarity of other observations. Categorical variables with low frequency levels were grouped into an “Other” category to reduce sparsity and improve model performance. The processed features were then used as input for the predictive models. The processing steps were implemented using the recipes package in R, which allowed for a systematic and reproducible approach to data preprocessing.

4.3 Analytical Approach / Modeling

The index time was defined as the discharge time of the cardiac surgery admission, with a prediction window of 30 days for readmission. The outcome variable was defined as a binary indicator of whether the patient was readmitted to the hospital within 30 days of discharge from the index cardiac surgery admission. They were discharge under 24 hours were not considered as readmission.

Three predictive classification models were developed: a logistic regression model, a random forest, and an XGBoost. Model were developed in R, utilizing package mainly tidyverse and tidymodels workflow. After that, performance was evaluated using the area under the receiver operating characteristic curve (ROC AUC), with feature importance assessed to identify key predictors.

All the parameters tuning are using grid search with 5-fold cross validation. The best model is selected based on the highest ROC AUC. In logistic regression, we used regularization strength (penalty or lambda) to control the amount of flexibility applied to prevent overfitting and the elastic net mixing parameter (mixture or alpha) which is a number between 0 and 1, where 0 corresponds to pure ridge regression and 1 corresponds to pure lasso regression, to balance between two types of regularization and achieve better predictive performance.

For random forest, we tuned the feature randomness (mtry), tree complexity (min_n), and ensemble size which is number of trees (trees). These three parameters are chosen because they are the core parameters, as stated in the official documentation of randomForest package and other research paper by Ishwaran, that control the randomness, complexity, and size of the random forest model, which are crucial for optimizing its performance and preventing overfitting.

For XGBoost, we tuned the learning rate, maximum depth, and number of trees. They are the core parameters that control the learning process, model complexity, and ensemble size of the XGBoost model, which are essential for optimizing its performance and preventing overfitting. They are chosen based on the official documentation of xgboost package and other research paper by Chen and Guestrin.

4.4 Model Evaluation

After training those three models with 5-fold cross-validation and utilizing hyperparameter tuning, we evaluated their performance on the test set. The primary performance metric used for model evaluation was the area under the receiver operating characteristic curve (ROC AUC), which measures the model’s ability to discriminate between patients who were readmitted and those who were not. Additionally, we calculated other metrics such as accuracy, ppv, and npv to provide a comprehensive assessment of model performance. The final models were compared based on their ROC AUC values, with the best-performing model selected for further analysis and interpretation.

In logistic regression, ROC AUC drops significantly when the penalty is too high, which indicates that the model is underfitting. It suffers especially with Mixture 1, which is pure lasso regression, as it can be too aggressive in feature selection and may exclude important predictors, leading to a significant drop in performance. On the other hand, when the penalty is too low, the model may overfit the training data, resulting in a decrease in performance on the test set. The optimal performance was observed at a moderate penalty value, with a mixture value around 0.25 or 0.5, which balances between ridge and lasso regression.

The performance of the random forest model is not very sensitive to the number of trees, but it tends to perform better with more trees. Trees with 275 or more generally have higher ROC AUC values, especially when reaching 500 trees, likely because a larger number of trees can capture more complex patterns in the data and reduce variance.

For XGBoost, we observed that a lower learning rate generally leads to better performance, and the optimal tree depth was around 3. This is likely because a lower learning rate allows the model to learn more slowly and avoid overfitting, while a tree depth of 3 provides enough complexity to capture important interactions without being too complex.

5. Results

5.1 Cohort / Sample Characteristics

Characteristic N Overall
N = 8,1671
0
N = 6,8961
1
N = 1,2711
discharge_location 8,167


    CHRONIC/LONG TERM ACUTE CARE
349 (4.3%) 242 (3.5%) 107 (8.4%)
    HOME
366 (4.5%) 288 (4.2%) 78 (6.1%)
    HOME HEALTH CARE
4,567 (56%) 4,043 (59%) 524 (41%)
    REHAB
545 (6.7%) 423 (6.1%) 122 (9.6%)
    SKILLED NURSING FACILITY
2,291 (28%) 1,856 (27%) 435 (34%)
    Other
49 (0.6%) 44 (0.6%) 5 (0.4%)
admission_type 8,167


    ELECTIVE
1,158 (14%) 984 (14%) 174 (14%)
    EW EMER.
1,271 (16%) 1,033 (15%) 238 (19%)
    OBSERVATION ADMIT
567 (6.9%) 474 (6.9%) 93 (7.3%)
    SURGICAL SAME DAY ADMISSION
2,476 (30%) 2,141 (31%) 335 (26%)
    URGENT
2,215 (27%) 1,877 (27%) 338 (27%)
    Other
480 (5.9%) 387 (5.6%) 93 (7.3%)
admission_location 8,167


    CLINIC REFERRAL
69 (0.8%) 54 (0.8%) 15 (1.2%)
    EMERGENCY ROOM
1,104 (14%) 889 (13%) 215 (17%)
    PHYSICIAN REFERRAL
4,421 (54%) 3,777 (55%) 644 (51%)
    PROCEDURE SITE
287 (3.5%) 251 (3.6%) 36 (2.8%)
    TRANSFER FROM HOSPITAL
2,201 (27%) 1,856 (27%) 345 (27%)
    Other
85 (1.0%) 69 (1.0%) 16 (1.3%)
insurance 8,124


    Medicaid
756 (9.3%) 608 (8.9%) 148 (12%)
    Medicare
4,525 (56%) 3,750 (55%) 775 (61%)
    No charge
2 (<0.1%) 2 (<0.1%) 0 (0%)
    Other
175 (2.2%) 139 (2.0%) 36 (2.9%)
    Private
2,666 (33%) 2,362 (34%) 304 (24%)
    Unknown
43 35 8
language 8,120


    Chinese
58 (0.7%) 50 (0.7%) 8 (0.6%)
    English
7,479 (92%) 6,339 (93%) 1,140 (90%)
    Portuguese
48 (0.6%) 38 (0.6%) 10 (0.8%)
    Russian
78 (1.0%) 62 (0.9%) 16 (1.3%)
    Spanish
246 (3.0%) 190 (2.8%) 56 (4.4%)
    Other
211 (2.6%) 173 (2.5%) 38 (3.0%)
    Unknown
47 44 3
marital_status 7,848


    DIVORCED
621 (7.9%) 505 (7.7%) 116 (9.3%)
    MARRIED
4,884 (62%) 4,178 (63%) 706 (57%)
    SINGLE
1,475 (19%) 1,211 (18%) 264 (21%)
    WIDOWED
868 (11%) 705 (11%) 163 (13%)
    Unknown
319 297 22
race 8,167


    Other
1,244 (15%) 1,132 (16%) 112 (8.8%)
    ASIAN
172 (2.1%) 152 (2.2%) 20 (1.6%)
    BLACK
365 (4.5%) 278 (4.0%) 87 (6.8%)
    HISPANIC
260 (3.2%) 193 (2.8%) 67 (5.3%)
    WHITE
6,126 (75%) 5,141 (75%) 985 (77%)
hosp_los 8,167 7.0 (5.0, 10.0) 7.0 (5.0, 10.0) 8.0 (5.0, 13.0)
rr_interval 8,151 789 (689, 909) 789 (689, 909) 769 (674, 895)
    Unknown
16 12 4
qrs_onset 8,153 200 (184, 224) 200 (184, 224) 200 (186, 222)
    Unknown
14 10 4
qrs_end 8,152 302 (278, 338) 300 (278, 338) 304 (280, 340)
    Unknown
15 11 4
t_end 8,153 612 (570, 658) 612 (570, 658) 612 (570, 656)
    Unknown
14 10 4
qrs_axis 8,133 8 (-18, 39) 7 (-18, 38) 8 (-20, 44)
    Unknown
34 23 11
t_axis 8,127 39 (2, 78) 38 (1, 76) 43 (5, 87)
    Unknown
40 29 11
ecg_heart_rate 8,151 76 (66, 87) 76 (66, 87) 78 (67, 89)
    Unknown
16 12 4
qrs_dur 8,152 96 (86, 114) 96 (86, 112) 98 (88, 118)
    Unknown
15 11 4
qt_proxy 8,153 406 (376, 438) 406 (375, 438) 408 (376, 438)
    Unknown
14 10 4
t_dur 8,152 304 (274, 334) 304 (276, 334) 302 (270, 334)
    Unknown
15 11 4
gender 8,167


    F
2,600 (32%) 2,063 (30%) 537 (42%)
    M
5,567 (68%) 4,833 (70%) 734 (58%)
age 8,167 68 (60, 76) 68 (60, 76) 69 (60, 78)
los_icu 8,167 1.6 (1.2, 3.3) 1.5 (1.2, 3.2) 2.2 (1.1, 4.8)
potassium 8,156 4.10 (3.80, 4.40) 4.10 (3.80, 4.40) 4.10 (3.80, 4.50)
    Unknown
11 8 3
sodium 8,151 138.0 (137.0, 140.0) 139.0 (137.0, 140.0) 138.0 (136.0, 140.0)
    Unknown
16 11 5
calcium_total 7,139 8.60 (8.20, 9.10) 8.60 (8.20, 9.10) 8.60 (8.20, 9.10)
    Unknown
1,028 902 126
calcium_free 7,617 1.15 (1.11, 1.18) 1.15 (1.11, 1.18) 1.14 (1.10, 1.18)
    Unknown
550 406 144
magnesium 8,119 2.10 (1.90, 2.30) 2.10 (1.90, 2.30) 2.10 (1.90, 2.40)
    Unknown
48 35 13
creatinine 8,155 0.90 (0.80, 1.20) 0.90 (0.80, 1.10) 1.00 (0.80, 1.30)
    Unknown
12 8 4
bun 8,155 17 (14, 23) 17 (14, 22) 19 (14, 28)
    Unknown
12 8 4
hemoglobin 8,149 11.00 (9.30, 12.90) 11.10 (9.40, 13.00) 10.60 (8.90, 12.30)
    Unknown
18 12 6
hematocrit 8,149 33 (28, 38) 33 (28, 39) 32 (27, 37)
    Unknown
18 12 6
wbc 8,149 9.1 (6.9, 12.7) 9.1 (7.0, 12.8) 9.0 (6.7, 12.4)
    Unknown
18 12 6
platelets 8,149 183 (140, 236) 182 (140, 234) 188 (141, 246)
    Unknown
18 12 6
bicarbonate 8,151 25.0 (23.0, 27.0) 25.0 (23.0, 27.0) 24.0 (22.0, 27.0)
    Unknown
16 11 5
anion_gap 8,151 13.00 (11.00, 15.00) 13.00 (11.00, 15.00) 13.00 (11.00, 15.00)
    Unknown
16 11 5
lactate 7,646 1.30 (1.00, 1.70) 1.30 (1.00, 1.70) 1.30 (1.00, 1.70)
    Unknown
521 386 135
inrpt 8,094 1.20 (1.10, 1.40) 1.20 (1.10, 1.40) 1.30 (1.10, 1.50)
    Unknown
73 57 16
pt 8,094 13.8 (12.2, 15.6) 13.7 (12.2, 15.5) 14.0 (12.3, 16.0)
    Unknown
73 57 16
glucose 8,150 111 (97, 139) 110 (97, 138) 113 (98, 146)
    Unknown
17 12 5
chloride 8,151 103.0 (101.0, 105.0) 103.0 (101.0, 105.0) 103.0 (100.0, 105.0)
    Unknown
16 11 5
chf 8,167 2,422 (30%) 1,955 (28%) 467 (37%)
carit 8,167 4,372 (54%) 3,631 (53%) 741 (58%)
valv 8,167 4,149 (51%) 3,455 (50%) 694 (55%)
pcd 8,167 855 (10%) 671 (9.7%) 184 (14%)
pvd 8,167 1,327 (16%) 1,065 (15%) 262 (21%)
hypunc 8,167 4,701 (58%) 4,026 (58%) 675 (53%)
hypc 8,167 1,495 (18%) 1,186 (17%) 309 (24%)
para 8,167 80 (1.0%) 60 (0.9%) 20 (1.6%)
ond 8,167 313 (3.8%) 247 (3.6%) 66 (5.2%)
cpd 8,167 1,867 (23%) 1,506 (22%) 361 (28%)
diabunc 8,167 1,837 (22%) 1,576 (23%) 261 (21%)
diabc 8,167 945 (12%) 740 (11%) 205 (16%)
hypothy 8,167 995 (12%) 796 (12%) 199 (16%)
rf 8,167 1,345 (16%) 1,038 (15%) 307 (24%)
ld 8,167 350 (4.3%) 272 (3.9%) 78 (6.1%)
pud 8,167 53 (0.6%) 45 (0.7%) 8 (0.6%)
aids 8,167 11 (0.1%) 9 (0.1%) 2 (0.2%)
lymph 8,167 77 (0.9%) 59 (0.9%) 18 (1.4%)
metacanc 8,167 49 (0.6%) 33 (0.5%) 16 (1.3%)
solidtum 8,167 122 (1.5%) 110 (1.6%) 12 (0.9%)
rheumd 8,167 315 (3.9%) 248 (3.6%) 67 (5.3%)
coag 8,167 1,274 (16%) 1,024 (15%) 250 (20%)
obes 8,167 1,332 (16%) 1,104 (16%) 228 (18%)
wloss 8,167 134 (1.6%) 93 (1.3%) 41 (3.2%)
fed 8,167 1,621 (20%) 1,272 (18%) 349 (27%)
blane 8,167 125 (1.5%) 99 (1.4%) 26 (2.0%)
dane 8,167 185 (2.3%) 150 (2.2%) 35 (2.8%)
alcohol 8,167 282 (3.5%) 245 (3.6%) 37 (2.9%)
drug 8,167 213 (2.6%) 174 (2.5%) 39 (3.1%)
psycho 8,167 59 (0.7%) 49 (0.7%) 10 (0.8%)
depre 8,167 974 (12%) 781 (11%) 193 (15%)
1 n (%); Median (Q1, Q3)

From the descriptive statistics, we observed that the cohort consisted of 8167 admissions of 7640 patients who underwent cardiac surgery. There are 1271 patient’s admissions (15.6%) that experienced readmission within 30 days. Since the cohort is in the admission level, some patients may have multiple admissions, and the descriptive summary reflects the distribution of characteristics across all admissions. The mean age of the patients’s admissions was 68 years. The majority of patients’s admissions were male (68% of the admissions), and the most common race was White (75% of the admissions). The most common admission type was surgical same day admission (30% of the admissions), and the most common discharge location was home health care (56% of the admissions). The mean length of stay for the index cardiac surgery admission was 7 days, with a mean ICU stay of 1 to 2 days.

5.2 Model Performance

All the models achieved a similar ROC AUC around 0.65 to 0.68 indicating moderate predictive performance in identifying patients at risk for 30-day readmission after cardiac surgery, followed by random forest with a ROC AUC of 0.65.

[1] "Final Logistic Regression Metrics:"
# A tibble: 5 × 4
  .metric  .estimator .estimate .config        
  <chr>    <chr>          <dbl> <chr>          
1 accuracy binary         0.845 pre0_mod0_post0
2 ppv      binary         0.847 pre0_mod0_post0
3 npv      binary         0.538 pre0_mod0_post0
4 roc_auc  binary         0.705 pre0_mod0_post0
5 pr_auc   binary         0.922 pre0_mod0_post0
[1] "Final Random Forest Metrics:"
# A tibble: 5 × 4
  .metric  .estimator .estimate .config        
  <chr>    <chr>          <dbl> <chr>          
1 accuracy binary         0.843 pre0_mod0_post0
2 ppv      binary         0.846 pre0_mod0_post0
3 npv      binary         0.455 pre0_mod0_post0
4 roc_auc  binary         0.662 pre0_mod0_post0
5 pr_auc   binary         0.907 pre0_mod0_post0
[1] "Final XGBoost Metrics:"
# A tibble: 5 × 4
  .metric  .estimator .estimate .config        
  <chr>    <chr>          <dbl> <chr>          
1 accuracy binary         0.845 pre0_mod0_post0
2 ppv      binary         0.845 pre0_mod0_post0
3 npv      binary         1     pre0_mod0_post0
4 roc_auc  binary         0.686 pre0_mod0_post0
5 pr_auc   binary         0.914 pre0_mod0_post0

The ROC curves for all three models are shown in the figure below, with the AUC values indicated in the legend.

5.3 Additional Analyses

The feature importance analysis for the logistic regression model revealed that certain features were associated with increased odds of readmission, while others were associated with decreased odds. The odds ratios for the top features are visualized in the figure below, with a reference line at 1.0 indicating no effect. Top features that increased the odds of readmission included White race, other race, and length of stay hospital. Conversely, features such as gender male, private insurance, and higher hemoglobin in lab result were associated with decreased odds of readmission.

The feature importance analysis for the random forest and XGBoost models, as visualized using variable importance plots, indicated that features such as length of stay hospital and ICU were among the most important predictors of 30-day readmission risk. The variable importance plots for both models are shown below.

6. Discussion

6.1 Summary of Findings

The results of this study indicate that machine learning models using structured EHR data can moderately predict 30-day readmission in high-risk cardiac-surgery patients with an EKG. All the models achieved a similar ROC AUC around 0.65 to 0.68,, suggesting that while readmission risk is complex, there is a measurable signal in the structured medical record data from EKGs and associated hospital stay details. Key predictors of readmission included length of stay hospital and ICU, certain demographic factors such as race, and laboratory results such as Blood Urea Nitrogen.

6.2 Comparison with Previous Work

The performance of the predictive models in this study, with ROC AUC values around 0.65 to 0.68, is consistent with many existing readmission prediction models in the literature, which often report moderate performance in the range of 0.60 to 0.70. This suggests that while there is a measurable signal in the structured EHR data used for modeling, predicting readmission risk remains a complex task that may require additional data sources or more advanced modeling techniques to achieve higher accuracy.

6.3 Strengths and Limitations

Stengths: - Utilization of a large, real-world dataset (MIMIC-IV) - Inclusion of a wide range of features, including demographics, hospital stay details, comorbidities, EKG features, and laboratory results, to capture the multifaceted nature of readmission risk - Achieve a moderate predictive performance with a ROC AUC of 0.66, which is comparable to many existing readmission prediction models in the literature

Limitations: - The moderate performance of the models (ROC AUC around 0.66) indicates that there is still significant room for improvement in predicting 30-day readmission risk, suggesting that important predictors may be missing from the structured EHR data used in this study - The Negative Predictive Value (NPV) of the model is relatively low, which may limit its utility in clinical practice for identifying patients at low risk of readmission. This can be improved by utilizing SMOTE or other techniques to address class imbalance in the dataset, which may also have trade-offs since applying clinical data with synthetic data generation technique can introduce noise and potentially lead to overfitting - The study is based on data from a single institution (MIMIC-IV), which may limit the generalizability of the findings to other settings or populations - The reliance on structured EHR data may have resulted in the exclusion of important unstructured data, such as clinical notes, which could contain valuable information about patient status and social determinants of health that are not captured in structured fields - The potential for residual confounding and bias due to unmeasured variables or inaccuracies in the EHR data, which could affect the validity of the predictive models and their interpretations - The table is in the admission level, which means that some patients may have multiple admissions, and the descriptive summary reflects the distribution of characteristics across all admissions rather than unique patients, which could influence the interpretation of the results

6.4 Implications

The findings of this study have important implications for clinical practice and healthcare management. The ability to moderately predict 30-day readmission risk using structured EHR data can help healthcare providers identify patients who are at higher risk for readmission after cardiac surgery. This information can be used to implement targeted interventions, such as enhanced discharge planning, post-discharge follow-up, and patient education, to improve patient outcomes and reduce preventable readmissions. Additionally, the identification of key predictors of readmission risk, such as length of stay, can inform clinical decision-making and resource allocation to better support high-risk patients during their hospital stay and after discharge. When the readmission number is reduced, it can lead to improved patient outcomes, reduced healthcare costs including medicare and medicaid, and better resource utilization within the healthcare system.

7. Conclusion

In conclusion, this study demonstrates that machine learning models using structured EHR data can moderately predict 30-day readmission in high-risk cardiac-surgery patients with an EKG. All the models achieved a ROC AUC around 0.65 to 0.68, indicating that while readmission risk is complex, there is a measurable signal in the structured medical record data. Key predictors of readmission included length of stay hospital and ICU. Socio demographic factors also played a role in predicting readmission risk, especially race. Lab result also contributed, especially Blood Urea Nitrogen. These findings suggest that predictive modeling can be a valuable tool for identifying patients at risk for readmission and guiding targeted interventions to improve post-discharge care.

Future work to improve predictive performance will focus on integrating hospital recharge note, EKG results note from physicians, and a more comprehensive set of social determinants of health. Additionally, exploring advanced modeling techniques, such as deep learning or ensemble methods that combine multiple models, may further enhance predictive accuracy.

8. References

Centers for Medicare & Medicaid Services. (n.d.). Hospital Readmissions Reduction Program (HRRP). Retrieved from https://www.cms.gov/medicare/payment/prospective-payment-systems/acute-inpatient-pps/hospital-readmissions-reduction-program-hrrp

Khera, R., Dharmarajan, K., Wang, Y., Lin, Z., Bernheim, S. M., Horwitz, L. I., … Krumholz, H. M. (2020). Association of the Hospital Readmissions Reduction Program with mortality among Medicare beneficiaries hospitalized for heart failure, acute myocardial infarction, and pneumonia. JAMA Network Open, 3(12), e2020045. https://pmc.ncbi.nlm.nih.gov/articles/PMC7382395

Zuckerman, R. B., Sheingold, S. H., Orav, E. J., Ruhter, J., & Epstein, A. M. (2016). Readmissions, observation, and the Hospital Readmissions Reduction Program. New England Journal of Medicine, 374(16), 1543–1551. https://pmc.ncbi.nlm.nih.gov/articles/PMC4186890/

XGBoost Developers. (n.d.). XGBoost parameter tuning. Retrieved from https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html

Kuhn, M., & Wickham, H. (n.d.). rand_forest: Random forest model specification (parsnip). Retrieved from https://parsnip.tidymodels.org/reference/rand_forest.html

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://zbmath.org/?q=an:07260271

(Author(s) unknown). (n.d.). [Article related to water research / IWAPublishing figure]. Retrieved from https://iwaponline.com/view-large/3703360

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://arxiv.org/abs/1603.02754

9. AI usage acknowledgement

This project utilized AI tools (ChatGPT, Gemini, and Deepseek) in github copilot and web browser for code generation, debugging, writing assistance. The AI was used to brainstorm challenges and generate initial drafts. All outputs from the AI were reviewed and validated by the researcher to ensure accuracy and relevance to the project objectives.