A Big Data-Driven Information System for Disease Prediction in Public Health: A Comparative Study of Machine Learning Approaches

Main Article Content

Abdelhay HADJ KOUIDER, Benameur ZIANI, Younes GUELLOUMA

Abstract

With the fast growth of health data coming from electronic records, medical devices, and monitoring systems, we now have great opportunities for data-driven decision making. However, dealing with such a large amount of information is still a challenge for standard analysis techniques. In this paper, we show a comparative study of machine learning models within a Big Data framework to predict diseases in public health. We tested six different classification techniques: Naive Bayes, SVM, Random Forest, Gradient Boosting, XGBoost, and MLP. To get reliable results, the experiments were done on two well-known medical datasets (UCI Heart Disease and Pima Indians Diabetes) using a 10-fold stratified cross-validation method. Interestingly, the results show that Naive Bayes performs best for heart disease (83.78% accuracy), while Gradient Boosting is the leader for the Diabetes dataset (77.72% accuracy). These findings offer practical advice on how to choose the right model, while also considering the choices and trade-offs made during the process.

Article Details

Section
Articles