Machine learning for diabetes diagnosis: insights from the Erbil Diabetes Dataset and algorithmic performance

Authors

  • Salar Ameen Raheem Department of Information and Communication Technology Eng., Erbil Polytechnic University, Erbil, Kurdistan Region, Iraq
  • Amal Taha Mawlood Department of Computer Science, Knowledge University, Erbil, Kurdistan Region, Iraq
  • Ibrahim Ismael Hamarash Department of Electrical Eng., College of Engineering, Salahaddin University-Erbil, Erbil, Kurdistan Region, Iraq

DOI:

https://doi.org/10.21271/ZJPAS.37.6.7

Keywords:

Diabetes, Dataset, Machine Learning, Healthcare Systems, Diseases Diagnosis

Abstract

Machine learning technologies have brought significant operational improvements for disease diagnosis-related healthcare activities. Among various conditions, diabetes is particularly suited for prediction through historical and personalized data, which serve as a cornerstone of many machine learning applications. In this study, we present, for the first time, a newly developed, domain-specific diabetes research dataset, called Erbil Diabetes Dataset. The data were collected under the supervision of a medical professional at a laboratory in Erbil, Kurdistan Region of Iraq. The dataset contains twelve key characteristics which were captured from 662 people who visited the laboratory for check-ups. A standard procedure was employed to preprocess the features before presenting them to the diabetes research community for use. The performance evaluation of these algorithms on the specified dataset utilized five algorithms that included Random Forest (RF), K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Xtreme Gradient Boost (XGBoost), and Decision Tree (DT). The evaluation process analyzed the performance results of algorithms through accuracy-recall measurements in addition to precision and F1-score metrics. The result shows KNN and XGBoost reaching outstanding performance values and predictive accuracy measures at 99.25% and 98.80% respectively. The accuracy levels of SVM decrease to 73.68% caused by their sensitivity to hyperparameter optimization. A statistical analysis using a one-way ANOVA test on F1-scores revealed a significant outcome (F = 558.51, p < 0.001), confirming that the differences in model performance were meaningful rather than due to random variation.  

References

(WHO), W. H. O. 2023. Diabetes [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/diabetes [Accessed 18 July 2024].

ABE, S. 2005. Support vector machines for pattern classification (Vol. 2, p. 4). London: Springer.

ALPAYDIN, E. 2020. Introduction to machine learning, MIT press.

BARBIERI, M., PRATTICHIZZO, F., LA GROTTA, R., MATACCHIONE, G., SCISCIOLA, L., FONTANELLA, R. A., TORTORELLA, G., BENEDETTI, R., CARAFA, V. & MARFELLA, R. 2024. Is it time to revise the fighting strategy toward Type 2 Diabetes? Sex and Pollution as New Risk Factors. Ageing Research Reviews, p.102405.

BISHOP, C. M. 2006. Pattern recognition and machine learning by Christopher M. Bishop, Springer Science+ Business Media, LLC.

ÇAKMAK, V. S. & ÖZDEMIR, S. Ç. 2024. Patients with diabetic foot ulcers: A qualitative study of patient knowledge, experience, and encountered obstacles. Journal of Tissue Viability, 33(4), pp.571-578.

CHEN, T. & GUESTRIN, C. 2016. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).

ESMAEILZADEH, P. 2024. Challenges and strategies for wide-scale artificial intelligence (AI) deployment in healthcare practices: A perspective for healthcare organizations. Artificial Intelligence in Medicine, 151, p.102861.

FATAH, K. S. & ALKAKI, Z. R. A. 2021. Application of Binary Logistic Regression Model to Cancer Patients: a case study of data from Erbil in Kurdistan region of Iraq. Zanco Journal of Pure and Applied Sciences, 33(4), pp.117-128.

GONG, Y., LIU, G., XUE, Y., LI, R. & MENG, L. 2023. A survey on dataset quality in machine learning. Information and Software Technology, 162, p.107268.

HASHI, E. K., ZAMAN, M. S. U. & HASAN, M. R. 2017. An expert clinical decision support system to predict disease using classification techniques. International conference on electrical, computer and communication engineering (ECCE) (pp. 396-400). IEEE.

HOLT, R., COCKRAM, C., FLYVBJERG, A. & GOLDSTEIN, B. 2017. Textbook of Diabetes-Preface to the Fifth Edition. Textbook of Diabetes, 5th Edition (pp. xiv-xv). Wiley-Blackwell.

KATARYA, R. & JAIN, S. 2020. Comparison of different machine learning models for diabetes detection. IEEE International Conference on Advances and Developments in Electrical and Electronics Engineering (ICADEE) (pp. 1-5). IEEE.

LLAHA, O. & RISTA, A. 2021. Prediction and Detection of Diabetes using Machine Learning. RTA-CSIT (pp. 94-102).

MENG, X.-H., HUANG, Y.-X., RAO, D.-P., ZHANG, Q. & LIU, Q. 2013. Comparison of three data mining models for predicting diabetes or prediabetes by risk factors. The Kaohsiung journal of medical sciences, 29(2), pp.93-99.

MOHAMMED, B. & YOUSIF, R. Z. 2019. Intelligent system for screening diabetic retinopathy by using neutrosophic and statistical fundus image features. ZANCO Journal of Pure and Applied Sciences, 31, pp.30-39.

MUHAMMAD, L., ALGEHYNE, E. A. & USMAN, S. S. 2020. Predictive supervised machine learning models for diabetes mellitus. SN Computer Science, 1(5), p.240.

MURPHY, K. P. 2012. Machine learning: a probabilistic perspective, MIT press.

NISSAR, I., MIR, W. A., SHAIKH, T. A., AREEN, T., KASHIF, M., KHIANI, S. & HUSSAIN, A. 2024. An Intelligent Healthcare System for Automated Diabetes Diagnosis and Prediction using Machine Learning. Procedia Computer Science, 235, pp.2476-2485.

PANWAR, M., ACHARYYA, A., SHAFIK, R. A. & BISWAS, D. 2016. K-nearest neighbor based methodology for accurate diagnosis of diabetes mellitus. sixth international symposium on embedded computing and system design (ISED) (pp. 132-136). IEEE.

POLEVIKOV, S. 2023. Advancing AI in healthcare: a comprehensive review of best practices. Clinica Chimica Acta, 548, p.117519.

PRANTO, B., MEHNAZ, S. M., MAHID, E. B., SADMAN, I. M., RAHMAN, A. & MOMEN, S. 2020. Evaluating machine learning methods for predicting diabetes among female patients in Bangladesh. Information, 11(8): 374.

RAHEEM, S. A., TAHA, A. & HAMARASH, I. I. 2024. Erbil Diabetes Dataset. V1 ed. Mendeley Data.

SALH, C. H. & ALI, A. M. 2022. Comprehensive study for breast cancer using deep learning and traditional machine learning. Zanco Journal of Pure and Applied Sciences, 34(2), pp.22-36.

SINGH, A., DHILLON, A., KUMAR, N., HOSSAIN, M. S., MUHAMMAD, G. & KUMAR, M. 2021. eDiaPredict: an ensemble-based framework for diabetes prediction. ACM Transactions on Multimidia Computing Communications and Applications, 17(2s), pp.1-26.

TIGGA, N. P. & GARG, S. 2020. Prediction of type 2 diabetes using machine learning classification methods. Procedia Computer Science, 167, pp.706-716.

VARMA, K. M. & PANDA, B. 2019. Comparative analysis of predicting diabetes using machine learning techniques. J. Emerg. Technol. Innov. Res, 6(6), pp.522-530.

VAROQUAUX, G. & COLLIOT, O. 2023. Evaluating machine learning models and their diagnostic value. Machine learning for brain disorders (pp. 601-630).

VIJAYAN, V. V. & ANJALI, C. 2015. Prediction and diagnosis of diabetes mellitus—A machine learning approach. IEEE Recent Advances in Intelligent Computational Systems (RAICS) (pp. 122-127). IEEE.

Published

2025-12-31

How to Cite

Salar Ameen Raheem, Amal Taha Mawlood, & Ibrahim Ismael Hamarash. (2025). Machine learning for diabetes diagnosis: insights from the Erbil Diabetes Dataset and algorithmic performance. Zanco Journal of Pure and Applied Sciences, 37(6), 66–77. https://doi.org/10.21271/ZJPAS.37.6.7

Issue

Section

Engineering and Computer Sciences