Albuquerque, Carina Isabel AndradeRashid, Kauser Al2024-10-302025-10-242024-10-24http://hdl.handle.net/10362/174337Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceCardiovascular disease (CVD) is the leading cause of death globally, significantly impacting mortality and morbidity individual across different demographics. The aim of this study is to leverage attention-based Natural Language Process (NLP) models to predict severe forms of CVD from unstructured clinical notes using discharge summaries of patients in MIMIC-IV dataset. Through a comparative analysis of various models that included LSTM, BERT, clinicalBERT and Clinical LongFormer, as well as modified versions of BERT and clinicalBERT, this research finds that attention-based models outperform traditional deep learning models in handling long and complex unstructured clinical notes, and therefore make better predictions. The best performing model identified in this study is BERT (sliding window), as this model was most accurate (Accuracy: 0.73), well-balanced in predictions (F1-Micro: 0.80) and excelled at correctly predicting specific CVD (AUC: 0.83). Although there are some limitations, this study demonstrates the predictive power of advanced attention-based models in healthcare, which would enable better disease predictions and timely interventions to reduce mortality and morbidity due to CVD.engElectronic Health Records (EMRs)Clinical NotesNatural Language ProcessingTransformerbased MethodsCardiovascular DiseasesSDG 3 - Good health and well-beingPredicting Cardiovascular Disease from Unstructured Clinical Notes: Application of Advanced Natural Language Processing on MIMIC-IV databasemaster thesis203782305