Predictive Modelling of cardiovascular Disease Survival Using Mutual Information and Machine Learning Across Varying Sample Sizes
-
https://doi.org/10.14419/xn4vbz19
Received date: June 27, 2025
Accepted date: August 3, 2025
Published date: August 10, 2025
-
Cardiovascular Disease (CVD); Support Vector Machine (SVM); Logistic Regression (LR); Sample Size. -
Abstract
In clinical data analytics, predicting survival outcomes for cardiovascular disease (CVD) is a challenging task with practical implications. Using three different datasets, this study investigates how sample size affects machine learning performance and generalizability. The methodology combines statistical sample-size analysis with mutual information gain, a filter-based, scalable, and domain-agnostic feature selection strategy, to identify clinically essential features. Mutual information gain measures the dependency between each predictor and the target variable, ensuring computational efficiency and applicability to large-scale data. Machine learning classifiers, support vector machines (SVMs) and logistic regression (LR), are employed to assess predictive performance across varying population sizes. Experimental results demonstrate that increasing the sample size improves model accuracy by up to 10%, recall by 5–8%, and maintains consistent specificity. Furthermore, to enhance clinical reliability, the models are evaluated using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), where SVM achieved an AUC of 0.965 and LR achieved 0.937, validating strong discriminatory power; Also, SHAP-based feature attribution is used to improve interpretability, identifying that larger sample sizes provide more stable and clinically meaningful explanations.
-
References
- Wu R, Peters W and Morgan M W 2002. The next generation of clinical decision support: linking evidence to best practice. J. Healthcare Inf. Manag. 16(1): 50–55.
- Thuraisingham B.. 2000. A primer for understanding and applying data mining. IT Prof. 2(1): 28–31 https://doi.org/10.1109/6294.819936.
- Rajkumar A, and Sophia R G. 2010. Diagnosis of heart disease using a data mining algorithm. Global J. Comput. Sci. Technol. 10: 38–43.
- Anbarasi M, Anupriya E and Iyengar N C S N 2010. Enhanced prediction of heart disease with feature subset selection using a genetic algorithm. Int. J. Eng. Sci. Technol. 2: 5370–5376.
- Palaniappan S and Awang R, 2008. Intelligent heart disease prediction system using data mining techniques.Proc. IEEE/ACS Int. Conf. Comput. Syst. Appl. pp. 108–115 https://doi.org/10.1109/AICCSA.2008.4493524.
- Tripoliti E, Papadopoulos E, Karanasiou T G, Naka G S and Fotiadis K K D I. 2017 Heart failure: diagnosis, severity estimation and prediction of adverse events through machine learning techniques. Comput. Struct. Biotechnol. J. 15:26 47. https://doi.org/10.1016/j.csbj.2016.11.001.
- Dash S R, Syed A S and Samantaray A. 2018. Filtration and classification of ECG signals. Handbook Res. Inf. Secur. Biomed. Signal Process. 72–94. https://doi.org/10.4018/978-1-5225-5152-2.ch005.
- Urbanowicz R J, Meeker M, La Cava W, Olson R S and Moore J H 2018 Relief-based feature selection: Introduction and review. J. Biomed. Inform. 85: 189–203. https://doi.org/10.1016/j.jbi.2018.07.014.
- Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B: Stat. Methodol. 58: 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.
- Santhanam T and Ephzibah E P 2013 Heart disease classification using PCA and feed forward neural networks. Proc. 1st Int. Conf. MIKE pp. 90–99. https://doi.org/10.1007/978-3-319-03844-5_10.
- Ziasabounchi N and Askerzade I N 2014 A comparative study of heart disease prediction based on principal component analysis and clustering methods. Turkish J. Math. Comput. Sci. 16: 18.
- Akhil Jabbar M, Deekshatulu B L and Chandra P 2016 Prediction of heart disease using random forest and feature subset selection. Proc. 6th Int. Conf. IBICA pp. 187–196. https://doi.org/10.1007/978-3-319-28031-8_16.
- Kavitha R and Kannan E, 2016 An efficient framework for heart disease classification using feature extraction and feature selection technique in data mining. in Proc. ICETETS pp. 1–5. https://doi.org/10.1109/ICETETS.2016.7603000.
- Khateeb N and Usman M 2017. Efficient heart disease prediction system using K-nearest neighbour classification technique. Proc. Int. Conf. Big Data Internet Things pp. 21–26. https://doi.org/10.1145/3175684.3175703.
- Gokulnath C B and Shantharajah S P 2019 An optimised feature selection based on genetic approach and support vector machine for heart disease. Cluster Comput. 22: 14777–14787. https://doi.org/10.1007/s10586-018-2416-4.
- Sata M and Elkonca F 2020 A comparison of classification performances between the methods of logistic regression and CHAID analysis accordance with sample size. Int. J. Contemp. Educ. Res. 7(2): 15–26. https://doi.org/10.33200/ijcer.733720.
- Jindal H, Agrawal S, Khera R, Jain R and Nagrath P 2021. Heart disease prediction using machine learning algorithms. IOP Conf. Ser: Mater. Sci. Eng. 1022(1): 012072. https://doi.org/10.1088/1757-899X/1022/1/012072.
- Boukhatem C, Youssef H N and Bou A. 2022 Heart disease prediction using machine learning. Proc. ASET pp. 1–6. https://doi.org/10.1109/ASET53988.2022.9734880.
- Kavya S M, Deepasindhu M, Nowshika B and Shijitha R. 2023 Heart Disease Prediction Using Logistic Regression. J. Coastal Life Med. 11: 573–579.
- Chaudhuri A K, Das S and Ray A. 2024 An Improved Random Forest Model for Detecting Heart Disease. Data-Centric AI Solutions Emerg. Technol. Healthcare Ecosyst. pp. 143–164. https://doi.org/10.1201/9781003356189-10.
- Takahara, M., Katakami, N., Hayashino, Y. et al. Different impacts of metabolic profiles on future risk of cardiovascular disease between diabetes with and without established cardiovascular disease: the Japan diabetes complication and Its Prevention Prospective Study 7 (JDCP study 7). Acta Diabetol 59, 57–65 (2022). https://doi.org/10.1007/s00592-021-01773-z.
- Janosi A, Steinbrunn W, Pfisterer M and Detrano R. 1988. Heart Disease UCI Machine Learning Repository.
- Smith J 2023 Heart Disease Dataset. https://www.kaggle.com/datasets/ johnsmith88/heart-disease-dataset/data Accessed: 2024-06-18.
- Fedesoriano 2023 Stroke Prediction Dataset. https://www.kaggle.com/datasets/ fedesoriano/stroke-prediction-dataset Accessed: 2024-06-18.
-
Downloads
-
How to Cite
Sarraju , V. ., pal , J. ., & Kamilya , S. . (2025). Predictive Modelling of cardiovascular Disease Survival Using Mutual Information and Machine Learning Across Varying Sample Sizes. International Journal of Basic and Applied Sciences, 14(4), 269-278. https://doi.org/10.14419/xn4vbz19
