Exploring feature selection techniques on Classification Algorithms for
Predicting Type 2 Diabetes at Early Stage
Dublin Core
Title
Exploring feature selection techniques on Classification Algorithms for
Predicting Type 2 Diabetes at Early Stage
Predicting Type 2 Diabetes at Early Stage
Subject
Type 2 diabetes, machine learning, feature selection, feature importance
Description
Predicting early Type 2 diabetes (T2D) is critical for improved care and better T2D outcomes. An accurate and efficient T2D
prediction relies on unbiased relevant features. In this study, we searched for important features to predict T2D by integrating
ML-based models for feature selection and classification from 520 individuals newly diagnosed with diabetes or who will
develop it. We used standard machine learning classifications, such as logistic regression (LR), Gaussian naive Bayes (NB),
decision tree (DT), random forest (RF), support vector machine (SVM) with linear basis function, and k-nearest neighbors
(KNN). We set out to systematically explore the viability of main feature selection representing each different technique, such
as a statistical filter method (F-score), an entropy-based filter method (mutual information), an ensemble-based filter method
(random forest importance), and a stochastic optimization (simultaneous perturbation feature selection and ranking (SpFSR)).
We used a stratified 10-fold cross-validation technique and assessed the performance of discrimination, calibration, and
clinical utility. We attained the highest accuracy of 98% using RF with the full set of features (16 features), then used RF as a
classifier wrapper to select the important features. We observed a combination of SpFSR and RF as the best model with a Pvalue above 0.05 (P-value = 0.26), statistically attaining the same accuracy as the full features. The study's findings support
the efficiency and usefulness of the suggested method for choosing the most important features of diabetic data: polyuria,
gender, polydipsia, age, itching, sudden weight loss, delayed healing, and alopecia.
prediction relies on unbiased relevant features. In this study, we searched for important features to predict T2D by integrating
ML-based models for feature selection and classification from 520 individuals newly diagnosed with diabetes or who will
develop it. We used standard machine learning classifications, such as logistic regression (LR), Gaussian naive Bayes (NB),
decision tree (DT), random forest (RF), support vector machine (SVM) with linear basis function, and k-nearest neighbors
(KNN). We set out to systematically explore the viability of main feature selection representing each different technique, such
as a statistical filter method (F-score), an entropy-based filter method (mutual information), an ensemble-based filter method
(random forest importance), and a stochastic optimization (simultaneous perturbation feature selection and ranking (SpFSR)).
We used a stratified 10-fold cross-validation technique and assessed the performance of discrimination, calibration, and
clinical utility. We attained the highest accuracy of 98% using RF with the full set of features (16 features), then used RF as a
classifier wrapper to select the important features. We observed a combination of SpFSR and RF as the best model with a Pvalue above 0.05 (P-value = 0.26), statistically attaining the same accuracy as the full features. The study's findings support
the efficiency and usefulness of the suggested method for choosing the most important features of diabetic data: polyuria,
gender, polydipsia, age, itching, sudden weight loss, delayed healing, and alopecia.
Creator
Mila Desi Anasanti1
, Khairunisa Hilyati2
, Annisa Novtariany3
, Khairunisa Hilyati2
, Annisa Novtariany3
Publisher
University College London, London, United Kingdom
Date
31-10-2022
Contributor
Fajar bagus W
Format
PDF
Language
Indonesia
Type
Text
Files
Collection
Citation
Mila Desi Anasanti1
, Khairunisa Hilyati2
, Annisa Novtariany3, “Exploring feature selection techniques on Classification Algorithms for
Predicting Type 2 Diabetes at Early Stage,” Repository Horizon University Indonesia, accessed June 7, 2025, https://repository.horizon.ac.id/items/show/9262.
Predicting Type 2 Diabetes at Early Stage,” Repository Horizon University Indonesia, accessed June 7, 2025, https://repository.horizon.ac.id/items/show/9262.