Predictive Modelling of cardiovascular Disease Survival Using ‎Mutual Information and Machine Learning Across Varying ‎Sample Sizes

  • Authors

    • Vijayalakshmi. Sarraju Department of Computer Science and Engineering, Birla Institute of Technology, Mesra, Lalpur, Ranchi, India
    • Jaya pal Department of Computer Science and Engineering, Birla Institute of Technology, Mesra, Lalpur, Ranchi, India
    • Supreeti. Kamilya Department of Computer Science and Engineering, Birla Institute of Technology, Mesra, Ranchi, India
    https://doi.org/10.14419/xn4vbz19

    Received date: June 27, 2025

    Accepted date: August 3, 2025

    Published date: August 10, 2025

  • Cardiovascular Disease (CVD); Support Vector Machine (SVM); Logistic Regression (LR); Sample Size‎.
  • Abstract

    In clinical data analytics, predicting survival outcomes for cardiovascular disease (CVD) is a challenging task with practical ‎implications. Using three different datasets, this study investigates how sample size affects machine learning performance ‎and generalizability. The methodology combines statistical sample-size analysis with mutual information gain, a filter-based, scalable, and domain-agnostic feature selection strategy, to identify clinically essential features. Mutual information ‎gain measures the dependency between each predictor and the target variable, ensuring computational efficiency and ‎applicability to large-scale data. Machine learning classifiers, support vector machines (SVMs) and logistic regression (LR), ‎are employed to assess predictive performance across varying population sizes. Experimental results demonstrate that ‎increasing the sample size improves model accuracy by up to 10%, recall by 5–8%, and maintains consistent specificity. ‎Furthermore, to enhance clinical reliability, the models are evaluated using the Area Under the Receiver Operating ‎Characteristic Curve (AUC-ROC), where SVM achieved an AUC of 0.965 and LR achieved 0.937, validating strong ‎discriminatory power; Also, SHAP-based feature attribution is used to improve interpretability, identifying that larger ‎sample sizes provide more stable and clinically meaningful explanations‎.

  • References

    1. Wu R, Peters W and Morgan M W 2002. The next generation of clinical decision support: linking evidence to best practice. J. Healthcare Inf. Manag. 16(1): 50–55.
    2. Thuraisingham B.. 2000. A primer for understanding and applying data mining. IT Prof. 2(1): 28–31 https://doi.org/10.1109/6294.819936.
    3. Rajkumar A, and Sophia R G. 2010. Diagnosis of heart disease using a data mining algorithm. Global J. Comput. Sci. Technol. 10: 38–43.
    4. Anbarasi M, Anupriya E and Iyengar N C S N 2010. Enhanced prediction of heart disease with feature subset selection using a genetic algorithm. Int. J. Eng. Sci. Technol. 2: 5370–5376.
    5. Palaniappan S and Awang R, 2008. Intelligent heart disease prediction system using data mining techniques.Proc. IEEE/ACS Int. Conf. Comput. Syst. Appl. pp. 108–115 https://doi.org/10.1109/AICCSA.2008.4493524.
    6. Tripoliti E, Papadopoulos E, Karanasiou T G, Naka G S and Fotiadis K K D I. 2017 Heart failure: diagnosis, severity estimation and prediction of adverse events through machine learning techniques. Comput. Struct. Biotechnol. J. 15:26 47. https://doi.org/10.1016/j.csbj.2016.11.001.
    7. Dash S R, Syed A S and Samantaray A. 2018. Filtration and classification of ECG signals. Handbook Res. Inf. Secur. Biomed. Signal Process. 72–94. https://doi.org/10.4018/978-1-5225-5152-2.ch005.
    8. Urbanowicz R J, Meeker M, La Cava W, Olson R S and Moore J H 2018 Relief-based feature selection: Introduction and review. J. Biomed. Inform. 85: 189–203. https://doi.org/10.1016/j.jbi.2018.07.014.
    9. Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B: Stat. Methodol. 58: 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.
    10. Santhanam T and Ephzibah E P 2013 Heart disease classification using PCA and feed forward neural networks. Proc. 1st Int. Conf. MIKE pp. 90–99. https://doi.org/10.1007/978-3-319-03844-5_10.
    11. Ziasabounchi N and Askerzade I N 2014 A comparative study of heart disease prediction based on principal component analysis and clustering methods. Turkish J. Math. Comput. Sci. 16: 18.
    12. Akhil Jabbar M, Deekshatulu B L and Chandra P 2016 Prediction of heart disease using random forest and feature subset selection. Proc. 6th Int. Conf. IBICA pp. 187–196. https://doi.org/10.1007/978-3-319-28031-8_16.
    13. Kavitha R and Kannan E, 2016 An efficient framework for heart disease classification using feature extraction and feature selection technique in data mining. in Proc. ICETETS pp. 1–5. https://doi.org/10.1109/ICETETS.2016.7603000.
    14. Khateeb N and Usman M 2017. Efficient heart disease prediction system using K-nearest neighbour classification technique. Proc. Int. Conf. Big Data Internet Things pp. 21–26. https://doi.org/10.1145/3175684.3175703.
    15. Gokulnath C B and Shantharajah S P 2019 An optimised feature selection based on genetic approach and support vector machine for heart disease. Cluster Comput. 22: 14777–14787. https://doi.org/10.1007/s10586-018-2416-4.
    16. Sata M and Elkonca F 2020 A comparison of classification performances between the methods of logistic regression and CHAID analysis accordance with sample size. Int. J. Contemp. Educ. Res. 7(2): 15–26. https://doi.org/10.33200/ijcer.733720.
    17. Jindal H, Agrawal S, Khera R, Jain R and Nagrath P 2021. Heart disease prediction using machine learning algorithms. IOP Conf. Ser: Mater. Sci. Eng. 1022(1): 012072. https://doi.org/10.1088/1757-899X/1022/1/012072.
    18. Boukhatem C, Youssef H N and Bou A. 2022 Heart disease prediction using machine learning. Proc. ASET pp. 1–6. https://doi.org/10.1109/ASET53988.2022.9734880.
    19. Kavya S M, Deepasindhu M, Nowshika B and Shijitha R. 2023 Heart Disease Prediction Using Logistic Regression. J. Coastal Life Med. 11: 573–579.
    20. Chaudhuri A K, Das S and Ray A. 2024 An Improved Random Forest Model for Detecting Heart Disease. Data-Centric AI Solutions Emerg. Technol. Healthcare Ecosyst. pp. 143–164. https://doi.org/10.1201/9781003356189-10.
    21. Takahara, M., Katakami, N., Hayashino, Y. et al. Different impacts of metabolic profiles on future risk of cardiovascular disease between diabetes with and without established cardiovascular disease: the Japan diabetes complication and Its Prevention Prospective Study 7 (JDCP study 7). Acta Diabetol 59, 57–65 (2022). https://doi.org/10.1007/s00592-021-01773-z.
    22. Janosi A, Steinbrunn W, Pfisterer M and Detrano R. 1988. Heart Disease UCI Machine Learning Repository.
    23. Smith J 2023 Heart Disease Dataset. https://www.kaggle.com/datasets/ johnsmith88/heart-disease-dataset/data Accessed: 2024-06-18.
    24. Fedesoriano 2023 Stroke Prediction Dataset. https://www.kaggle.com/datasets/ fedesoriano/stroke-prediction-dataset Accessed: 2024-06-18.
  • Downloads

  • How to Cite

    Sarraju , V. ., pal , J. ., & Kamilya , S. . (2025). Predictive Modelling of cardiovascular Disease Survival Using ‎Mutual Information and Machine Learning Across Varying ‎Sample Sizes. International Journal of Basic and Applied Sciences, 14(4), 269-278. https://doi.org/10.14419/xn4vbz19