Full article: Critical Factor Analysis for prediction of Diabetes Mellitus using an Inclusive Feature Selection Strategy

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Diabetes mellitus is a metabolic disorder that significantly implicates serious consequences in various parts of the human body, such as the Eye, Heart, kidney, Nerves, Foot, etc. The identification of consistent features significantly helps us to assess their impact on various organs of the human body and prevent further damage when detected at an early stage. The selection of appropriate features in the data set has potential benefits such as accuracy, minimizing complexity in terms of storage, computation, and positive decision-making. The left features might contain potential information that would be useful for analysis. In order to do effective analysis, additionally, all features should be studied and analyzed in plausible ways, such as using more feature selection (FS) methods with and without standardization. This article focuses on analyzing the critical factors of diabetes by using univariate, wrapper, and brute force FS techniques. To identify critical features, we used info gain, chi-square, RFE, and correlation using the NIDDK data. Later, distinct machine learning models were applied to both phases of the feature sets. This study was carried out in two phases to evaluate the efficacy of the techniques employed. The performance has been assessed using accuracy, F1score, and recall metrics.

Introduction

Globally diabetes is recognized as a exacerbate disease and is listed as the 9^th most influenced disease causing more mortality rates. According to World Health Organization (WHO) diabetes had caused approximately 1.5 million deaths directly in the year 2019. The International Diabetes Federation (IDF) reported in the year 2021 that globally more than 463 million people were affected with diabetes mellitus between the age group of 20–79 years. And it was estimated that by the year 2045 the diabetes mellitus count may reach an alarming level of 700 million. In addition, IDF stated that next to China, India has the highest number of diabetic patients which would be approximately 77 million people (Statista Citation2023). Insulin regulates the blood sugar levels in order to ensure the consistent functioning of the human body organs in a normal manner. When there is no sufficient secretion of insulin hormone by the pancreas, This leads to Hyperglycemia (raised blood sugar). The uncontrollable presence of excess blood glucose levels for a longer period of time is often referred as Diabetes (Gromova, Fetissov, and Gruzdkov Citation2021).

The longer diabetes mellitus leads to serious complications such as kidney failure, blindness, and heart strokes, foot ulcer infections, hearing impairment, more time to heal wounds, and the Alzheimer’s disease etc (Alam et al. Citation2021; Papatheodorou et al. Citation2015; Tomic, Shaw, and Magliano Citation2022; Unnikrishnan, Anjana, and Mohan Citation2016). The various reasons for raise in glucose levels are due to stress, lack of proper dietary habits and sleep, genetic mutations and improper functioning of pancreas (CDC Citation2023).

A study by ICMR and MDRF reveals that 25 million people are found to be prediabetes stage. There is approximately 10% of the population India with 74 million confirmed diabetes reported in the year 2021. According to IDF Diabetes Atlas 2021, the Diabetes prevalence has reached globally with over half a billion confirmed diabetics in 2021 as shown in (IDF Diabetes Atlas Citation2023). displays an exponential behavioral trend that reflects the enormous growth rate of people affected by diabetes over the past two decades. Hayden E. Klein’s study showed that the numbers are projected to double within a short period of time (Diabetes Prevalence Expected to Double Globally by 2050 Citation2023). To effectively address this issue with a major concern, it is imperative to actively identify and analyze the crucial features of diabetes. Therefore, to control the diabetes, enough awareness is required as well as identifying the crucial factors causing diabetes and monitoring them would assist the people to maintain organs free from its impact.

Figure 1. Diabetes mellitus impact statistics according to IDF report (source: (IDF Diabetes Atlas Citation2023)).

In simple terms, the prevention and management of diabetes can help to reduce diabetes complications in order to improve the quality of life for people. The affected patients must have an effective imperative analysis to keep diabetes under control and to provide an enhanced treatment. Analyzing the diabetes critical features using multi criteria feature selection with dual data analysis consideration with and without standardization would help us to address this issue. Feature selection applications are enormous which would be used for decision making in distinct fields such as e-commerce, business, government services, graph theory, metal industries, medical applications, disease-oriented prediction and treatment, gene selection, Microarray data etc. As shown in the Feature selection process involves a series of steps, such as search direction, search strategy, evaluation strategy, stopping criteria, and validation process (Sahu, Dehuri, and Jagadev Citation2018). In order to do effective feature selection, features are classified as redundant, noisy, weakly as well as strongly relevant and irrelevant features in general (Yu and Liu Citation2004). The right feature selection helps to run the model more efficiently, reduces computational and storage costs, and boosts processing power (Gutkin, Shamir, and Dror Citation2009). Features are classified based on noise, relevancy, redundancy, and inconsistency properties. The measures such as distance based, correlation based (Buyrukoğlu and Akbaş Citation2022), consistency and threshold methods, etc., are used for feature analysis processing (Aha et al. Citation1991; Kira and Rendell Citation1992). To ensure safety, reliability, and to accelerate the battery development cycle Zi Cheng Fei (Fei et al. Citation2021) proposed an early lifetime prediction for a battery with a combination of a Machine Learning (ML) model and a wrapper feature selection approach. He manually crafted 42 features based on the first-100-cycle charge-discharge raw data. Numerical experiments and paired t-tests are conducted to statistically evaluate the performance of the proposed framework. The support vector machine (SVM) model combined with wrapper feature selection presents the best result for battery lifetime prediction. In industry informatics privacy preserving for industrial applications is crucial issue and to address this issue, the authors Tao Zhang et al. (Zhang et al. Citation2020) introduced a correlation reduction scheme with differential private feature selection by considering the issue of privacy loss while data has a correlation in ML tasks which may lead to more privacy leakage. Experiments show that the proposed scheme obtained better prediction results with ML tasks and fewer mean square errors for data queries compared to existing schemes.

Table 1. The core emphasizing relative concepts required for feature selection exploration.

Download CSV Display Table

Kushan De Silva et al (De Silva, Jönsson, and Demmer Citation2020) demonstrated the value of combining feature selection with ML to identify the predictors that could be useful to enhance prediabetes prediction and clinical decision-making. The authors analyzed a sample of 6346 men and women enrolled in the National Health and nutrition examination survey 2013–2014. Four machine learning algorithms were applied to 46 exposure variables in original and resampled training datasets built using 4 resampling methods. A range of predictors of prediabetes was identified and the result of prediabetes prevalence was 23.43%. Shivani Jain et al. (Jain and Saha Citation2022) developed the rank-based univariate parameter selection strategy using ML classifiers to detect the code smells for vast and sophisticated software. Mutual information (MI), fisher score, and univariate ROC – AUC feature selection techniques were used with brute force and random forest correlation strategies.

The authors compared and analyzed the classifiers’ performance with and without feature selection. Rung Ching Chen et al. (Chen et al. Citation2020) introduced a new feature selection algorithm strategy for data classification based on ML methods. The authors of the article used three popular datasets such as bank marketing, car evaluation database and human activity recognition using smartphones for doing experimentation. Tadist et al. (Tadist et al. Citation2019) highlights four main reasons why feature selection is essential. In order to reduce the complexity of genomic data analysis and to get useful information quickly. To overcome the limitation of lower performance due to filter methods Yosef Masoudi et al. (Masoudi-Sobhanzadeh, Motieghader, and Masoudi-Nejad Citation2019) developed the Featureselect software application. This application was developed by using ten optimization algorithms along with filter methods and three learners. The data sets such as Carcinoma, Drive, Air, Drug, Social, and Energy were tested with the developed application. Out of all methods World Competitive Contests (WCC), League Championship Algorithm (LCA), Forest Optimization Algorithm (FOA), and Learning Automata (LA) performed well and the results showed that wrapper methods are better than the filter methods. Yue Liu et al. (Liu et al. Citation2020) introduced the field experts-based feature selection weighted score model for material manufacturing properties target identification. The method Data-driven Multi-Layer Feature Selection (DMLFS) were implemented with 7 material experts for features consideration, max-mini normalization and ML models for identifying the proper descriptors of material. He tested his proposed methodology on ten material-related data sets and finally proved that the proposed method able to work effectively to identify the targeted properties of materials by ensuring good accuracy.

The early prediction of diabetes levels is more imperative when affected by it. In order to keep the diabetes under control and to provide enhanced treatment process, diabetes critical factors analysis would help the affected patients to maintain their health effectively as well as to avoid diabetes complications. Accomplishing this issue would require a potential study over the diabetes critical features. The importance of this study is to analyze the impact of globally prompted diabetes mellitus and the essence to work toward diagnosing, controlling and preventing diabetes. So, we carried out our work to assess the DM vital features analysis in a significant manner by proposing the work in a two-phase manner with and without standardization. The concepts of feature selection and balancing are the major key strategies involved for doing this analysis by applying machine learning algorithms at the end of both phases.

The study has implemented and incorporated its results into various sections of the article while achieving the following objectives.

The study has been explored to identify the diabetes mellitus common root causes, its implications over other body parts of the diabetes affected people. In addition, the diabetes statistics from 2000 to 2021 are explored to show the stature of diabetes impact globally over the years.
Identifying the critical factors of diabetes in a standardized and non-standardized way by using multi criteria-based feature selection methods.
To compute and analyze the consistency, critical behavior of the feature selection methods based on dual nature of data by analyzing the individually obtained feature sets from both the phases using different classification models.
To compare and analyze the results of the ML procedures employed over the diabetes data by using the distinct metrics such as accuracy, precision, recall, F1-score, sensitivity and specificity.

The article presents an outline in the following manner: In Section 2, we have discussed the various authors’ works on diabetes mellitus analysis and feature selection concepts, including their limitations. In Section 3, we have talked about the system architecture, feature selection methods applied, the data set, and the coding organization process as well. We discussed the experimental results in Section 4. In subsequent sections 5 and 6, we described the conclusion and future work, respectively.

Literature Survey

The preliminary study has identified that diabetes is primarily caused by risk factors such as lifestyle, hereditary factors, psychosocial elements, demographic features, family history, ethnicity, and certain medical conditions. This section covers a comprehensive review of various diabetic analysis-related strategies. From 2019 to 2023, we actively explored a holistic view of critical analysis for diabetes with greater emphasis. We identified gaps and sought to understand the depth of knowledge on diabetes mellitus by articulating pertinent works during this time period. The works were studied, analyzed and discussed based on two primary requirements they were feature selection conceptual methods and the diabetes critical analysis. Subhash and Goel (Gupta and Goel Citation2023) presented an ML model for predicting diabetes in patients. Their study compares different classification algorithms and improved their performance by preprocessing and tuning hyperparameters. The RF model achieved an F1 score of 75.68% and an accuracy of 88.61% on the PIMA dataset. Jahan and Hoque (Kakoly, Hoque, and Hasan Citation2023) proposed the concept with PCA, IG two folded-based parameter selection for predicting the diabetes risk factors using ML algorithms (DT, RF, SVM, KNN LR). The primary data used in the study were collected based on the Helsinki Declaration, 2013, out of which 738 records were included in the final analysis. The study achieved an accuracy level of over 82.2%, with an AUC value of 87.2%. Saxena et al (Saxena et al. Citation2022). aimed for a comparative study of classifiers and feature selection methods for accurate prediction of diabetes. The study used four classifiers (DT, KNN, RF, MLP) and three FS techniques (IG, PCA, SU) on the PIMA Indians diabetes dataset. The RF classifier achieved the best accuracy of 79.8%. Gupta et al. (Gupta et al. Citation2022) implemented an ML model for diagnosing diabetes early with greater accuracy. The hybrid model used NSGA-II and ensemble learning. The dataset used in the study comprises 23 features, with 1288 instances of patients. NSGA-II-XGB approach obtained the better result with an accuracy of 98.86%. Ayse Dogru et al. (Doğru, Buyrukoğlu, and Arı Citation2023) introduced a super-ensemble learning model that enables the early detection of diabetes mellitus. They developed an ensemble model using Grid Search, Chi square methods for hyperparameter tuning, and feature selection. The authors experimented using four base-learners (LR, DT, RF, and gradient boosting) and a SVM meta-learner over three distinct data sets (diabetes risk prediction, PIMA, and 130 US hospital diabetes datasets). The proposed model effectively diagnosed diabetes with greater accuracy scores. Sabitha and devi (Sabitha and Durgadevi Citation2022) introduced a new framework that combined SMOTE augmentation, RFE-based FS, and preprocessing approaches in order to do diabetes prediction accurately. They experimented on the PIMA data set using the methods SVC, KNN, RF, LR, SVC and NB. The proposed method using RFE with RF regression FS achieved consecutive accuracy scores for RF, DT, and SVC of 81.25%, 81.16%, and 82.5%.

To determine the diabetes early Mehmet et al. (Gürsoy and Alkan Citation2022) investigated the diabetes which was collected from the Diabetes Specialization laboratories of Medical City Hospital and Al Kindy Training Hospital. The collected data were classified: they were normal, pre-diabetes and diabetes classes. The implementation was carried out by using the DL methods such as Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) with permutation feature importance for diabetes data analysis and diagnosis. Their created model demonstrated that the results would give medical practitioners a predictive tool for efficient decision-making that could help in the early diagnosis of the condition. Tiwari et al. (Tiwari and Singh Citation2021) proposed the methodology using attribute selection and classification methods predicted diabetes. The technique combined RFE, RF, and the Apriori methodology to identify the significant analysis of the diabetes variables. The XGBoost approach obtained better accuracy. An early and precise diabetes diagnosis approach was implemented by Amit and Chinmay (Kishor and Chakraborty Citation2021) by utilizing the correlation methodology and five ML classifiers (SVM, KNN, RF, LR, NB). They demonstrated that the RF classifier was capable of achieving higher scores for accuracy, sensitivity, specificity, and AUC (97.81, 99.32, 98.86, and 99.35, respectively). The authors Nagaraj et al. (Nagaraj et al. Citation2021) developed the system to classify diabetes-type diagnoses by considering the multiple patient features. Artificial flora-based FS and gradient-boosted tree (GBT) based classification methods were applied after processing the data format conversion and transformation. Their implemented methodology obtained better scores in terms of accuracy, recall, precision, and F-score. Oyebisi et al. (Oladimeji, Oladimeji, and Oladimeji Citation2021) showed the strategy by employing ranking-based feature selection techniques and ML algorithms (KNN, J48, NB, and RF) to detect diabetes at an early stage. It was established that their suggested model was more effective than research from the past. Their research highlights the value of ML in healthcare and the potential applications it offers for diagnosing and treating illnesses. A thorough review of research on developing precise models for anticipating complications of diabetes using ML approaches was presented by Anuradha et al. (Madurapperumage, Wang, and Michael Citation2021). This study intended to aid model developers, researchers, and physicians in further exploration of diabetic study.

Bharath and Udaya Kumar (Chowdary and Kumar Citation2021) proposed a Neuro-Fuzzy model that used feature extraction to improve diabetes classification. The model was tested on the PIMA diabetic dataset and outperformed existing ML models. Jayroop et al. (Ramesh, Aburukba, and Sagahyroon Citation2021) established the remote-based monitoring framework system for effective diabetes control employing handheld, intelligent accessories, and individual medical devices. Using an SVM the authors developed a diabetes risk forecasting model. The suggested method achieved greater flexibility and vendor connectivity while enabling educated decisions based on recent diabetes risk projections and lifestyle insights. Authors Juneja et al. (Juneja et al., Citation2021) described the use of ML approaches and a multi-criteria decision-making framework for diabetes analysis. The Pima diabetes data were used in the research to investigate several predictive algorithms. The results show that supervised learning algorithms performed better than unsupervised ones and that the maintenance of an active lifestyle could be supported by a diabetes early diagnosis.

Jiaqi Hoq et a.l (Hou et al. Citation2020) introduced a method aimed to predict the prevalence of diabetes with the obtained physically examined parameters. The parameter selection methods RFE and F-score were employed subsequently DT, RF, LR, SVM, and MLP were applied to the selected features for diabetes prediction. However, certainly, the system achieved higher accuracy for RF and F-score combination. The developed hybrid classification model by Mishra et al. (Mishra et al. Citation2020) used the Adaptive based enhanced genetic algorithm of MLP for diagnosing diabetes. Their work obtained an accuracy of 97.8% approximately and execution time took 1.12 seconds with the help of attribute optimization technique. The model could assist medical experts to determine risk factors for type 2 diabetes. A Mendelian randomization study that identified the risk factors of diabetes type 2 was presented in the article. The study discovered proof of causal relationships between 34 exposures, including systolic blood pressure, depression, sleeplessness, and smoking.

Sneha and Gangil (Sneha and Gangil Citation2019) experimented with using ML for the early detection of diabetes by selecting significant features from the dataset. DT, RF algorithms achieved high specificity, and NB had the best accuracy. The research also aimed to improve classification accuracy by selecting optimal features. Choubey et al. (Choubey et al. Citation2020) developed a system for efficient diagnosis of diabetes using two phases. One phase involves dataset collection and analysis with KNN, LR, and DT methods. Phase 2 involved applying PCA and PSO algorithms to the data. The approach was more efficient in terms of computation time and accuracy and had the potential for early diagnosis of other medical diseases. On adult demographic datasets Diwan (Alalwan Citation2019) formulated a technique for identifying type 2 diabetes by utilizing SOP and RF algorithms. The Self-Organizing Map method outperformed other algorithms in terms of accuracy. The author recommended that the system must be combined with the diabetic detecting equipment for quicker diagnosis. Simon Fong et al. (Li and Fong Citation2019) constructed a Coefficient of Variation (CV) parameter selection method to enhance diabetes accuracy. By utilizing the Prima diabetes dataset, the CV technique performed better than conventional FS methods because it excluded attributes with low data dispersion. The method came under the categorial behavior of a filter-based scheme which might take a small time for analysis but it did not guarantee obtaining greater accuracy.

Fisher’s linear discriminant analysis (LDA) based concept systemized by Sheik and Selva (Sheik Abdullah and Selvakumar Citation2019) to identify risk factors for type II diabetes with the help of DT and PSO. The technique exhibited increased precision in detecting risk variables and could be applied to various chronic diseases. The system recognized the strong relationship among the MBG-PPG-A1c-FPG (Mean Blood Glucose – Postprandial Plasma Glucose – Glycosylated Hemoglobin – Fasting Plasma Glucose) factors. The potential benefits of using feature selection algorithms provided insights into the current state-of-the-art in this field. The works stated earlier were helpful for expanding the scope of research on feature selection algorithms. Added to that a new improved version of taxonomy for emerging development in multidisciplinary fields yielded more benefits.

The aforementioned works () by different authors have implemented their experimental works with a limited number of selected features, using only one feature selection method, which cannot be deemed as a virtuous approach. In addition, none of the works have not discussed the consistency of the feature selection in respect of standardization before and after. Using mean or median values to replace missing data, removing corresponding records or columns, and employing other methods to handle missing data may introduce bias into the results. After extensively analyzing the various literature on diabetes mellitus, it has been determined that identifying and organizing the hierarchy of critical factors is crucial in understanding its importance. Conducting critical factors impact analysis is essential in order to effectively handle data pertaining to prior and posterior standardization. The article aims to improve the critical factors analysis of diabetes by utilizing a combination of data standardization before and after, in order to address key limitations such as inconsistencies in the quality and quantity of available data that may hinder result comparisons due to lack of standardization across studies.

Table 2. The relevant works of literature concerning diabetes were studied.

Download CSV Display Table

Working Methodology

System Architecture

To explore critical factor analysis in detail, we have introduced a methodology called Inclusive Feature Selection Strategy (IFSS) in order to investigate and analyze the diabetes mellitus data, as represented in . The constructed architecture for estimating the critical factors of diabetes mellitus with standardized and non-standardized processes using feature selection methods is shown below. The working behavior of the proposed system architecture consists of following major components such as: University of California Irvine (UCI) repository data set, cleaning and reduction components, balancing component, standardized and non-standardized modules, model analysis and evaluation on every obtained set individually, feature selection methods comparison based on classification results. The proposed system has been organized into 4 phases.

Figure 2. The proposed inclusive feature selection strategy (IFSS) working methodology.

The phase-1 represents the preliminary phase involves data collection, preparation and determining basic characteristics of data using data dispersion concepts. Phase-2 deals with feature selection exploration phase. Similarly, phase 3 and phase-4 organized as model analysis phase, comparison phase respectively. The activities such as data collection, cleaning, reduction, and balancing the label data for analysis were implemented in Phase 1. Initially the phase-1 comprises of the key operations, such as dealing with missed data, applying normalization, and balancing, etc. The concept of preprocessing would help us to get the coherent data to run models in a sophisticated manner without complexity. Obviously more imperative than good models preprocessing should be treated as a paramount important concept to deal with, unstructured, ambiguous, missing or error filled data. The basic statistical characteristics of the data has been explored using dispersion methods Inter Quartile Range (IQR), standard deviation, skewness and kurtosis etc. The results of the dispersion statistics are shown in . The data doesn’t contain missing data and in the identified null positions we didn’t replace the values with data missing handling techniques such as mean, median, the nearest value, and the most repeated value. Because by filling the missed data with the above-mentioned techniques could lead to manipulation of actual data and causes to get falsified/biased results. Phase 2 has been divided into two modules such as standardized and non-standardized. The feature selection methods (RFE, Info gain, correlation, Chi-square) are applied to both modules but the standardized module contains the balanced data whereas non-standardized module experimented with no balancing. The sampling method (Bach, Werner, and Palt Citation2019; Zhu et al. Citation2020) was used to balance the dataset. In phase 3, the KNN (Ali, Neagu, and Trundle Citation2019; Guo et al. Citation2003; Mucherino, Papajorgji, and Pardalos Citation2009) model was applied and analyzed individually over obtained sets from the phase-2 standardized and the non-standardized modules. Later in phase 4 the classification results were computed by considering the metrics accuracy, precision, recall, F1score.

Figure 3. Descriptive statistics of diabetes.

Implemented FS Methods

To understand the concept of consistency in the feature selection methods, we worked with four methods by using the dataset found in the NIDDK data repository. The applied methods are Information Gain (IG) (Di Franco Citation2019), chi-square (Baker and Cousins Citation1984; Ottenbacher Citation1995), RFE (Bahl et al. Citation2019; Huerta et al. Citation2013), and correlation (Freund, Wilson, and Mohr Citation2010) strategies. To show the consistency of feature selection methods and to explore diabetes critical factors characteristics the methodology was implemented by employing the feature selection methods and ML techniques over the non-standardized and standardized data.

Information Gain

It is necessary to know the amount of information contained in the features in the form of classes. Information gain helps to define the possibility of the occurrence of a class or surprise with respect to the target variables. For performing feature selection, IG (Di Franco Citation2019) helps by evaluating each variable based on variable gain. Mutual information estimates the gain value with the help of two variables.

The Information gain can be computed based on below given formula.

(1)

Information gain = Entrophy (parent) - Average (Entrophy (child))

(1)

Information Gain will calculate the entropy reduction by dividing a dataset with respect to the considered value of a randomly selected variable. Entropy indicates the uncertainty or impurity that exists in the data set. To understand clearly about IG behavior the represented below.

(2)

Entrophy = \sum_{i = 1}^{n} - probability (clas s_{i}) {log}_{2} ((probability (clas s_{i})))

(2)

Table 3. Representation of information gain behavior based on value.

Download CSV Display Table

Whereas n represents the number of different class values

Chi-Square

Chi-Square (Baker and Cousins Citation1984; Ottenbacher Citation1995) is a simple method for performing feature selection to implement the classification task. Feature selection aims to focus on selecting the highly dependent variables. The Chi-square method is used to test whether two variables are related to each other or independent of one another. In simple terms, the chi-square method determines the relationship between dependent and independent variables.

If the two features are dependent and the observed values are not close to the expected values, then the chi-square value is high. If the two features are independent, then the observed values are close to the expected values, then the chi-square value is low. A higher score means both are dependent, one can select for model training. A lower chi-square means that both variables are independent of each other and cannot be selected for experimentation. To interpret chisquare easily we have represented in .

(3)

χ 2 = \sum_{i = 1}^{n} \frac{{(O V_{i} - E V_{i})}^{2}}{E V_{i}}

(3)

Table 4. Representation of chi square behavior based on value.

Download CSV Display Table

where OV and EV represent observed and expected values and n indicates the number of instances.

Pearson Correlation Coefficient

The concept of correlation was introduced by Francis Galton and developed by Pearson Karl to establish the relationship between variables of data sets. The correlation values always lie between − 1 and + 1 including zero. The meaning of the relationship between variables is defined based on the obtained coefficient values. The meaning and relationships among variables based on obtained coefficient values are represented in given below:

Table 5. Representation of coefficient strength features.

Download CSV Display Table

The Pearson correlation (Pearson’s Correlation Coefficient Citation2008; Ratner Citation2009) calculates the strength of the relation between two features. It ranges between −1 and 1. The value of − 1 means a complete negative correlation, 0 indicates no correlation, whereas + 1 means a total positive correlation

(4)

Reg_coef (ρ) = \frac{\sum (x_{i} - \overset{ˉ}{x}) (y_{i} - \overset{ˉ}{y})}{\sqrt{\sum {(x_{i} - \overset{ˉ}{x})}^{2} \sum {(y_{i} - \overset{ˉ}{y})}^{2}}}

(4)

Exhaustive Feature Selection Based RFE

The RFE approach comes under a wrapper mechanism where there is a separate learner for selecting the features. In general, the selection of the learner depends upon the developer’s choice. RFE (Bahl et al. Citation2019; Huerta et al. Citation2013) works based on the brute force evaluation concept. It implies that it will try on all possible feature combinations and then produces the best subset. However, this method is expensive and needs more time compared to forward selection and backward elimination. The reason for computational complexity in RFE is because that it considers every possible combination of features.

Description of the Data Set Used for Analysis

The data set contains diagnostic-related measurements and other pertinent information shown in , which help us to predict whether the person has the disease or not. The data set used is Pima (Kaggle Citation2022), which contains 768 records and 9 features. The target feature for classification has two classes either diabetic or not. The Indian Pima dataset is developed by the NIDDK, Vincent Sigillito from Johns Hopkins University. The features are plasma glucose concentration (PGC), BMI (Body Mass Index), DPF (Diabetic Pedigree Function), Diastolic Blood Pressure (DBP), age, 2-hour serum insulin, skin thickness, Triceps Skinfold Thickness (TSFT), Number of Times Pregnant (NTP) and class variable.

Table 6. Data set description.

Download CSV Display Table

Coding Design Explanation

To Estimate the critical factors of diabetes and to observe the consistency of feature selection methods the coding part is carried out in two phases. Phase 1 deals with unbalanced data by applying four feature selection methods. In phase 2, the standardized data has been analyzed by applying feature selection methods. Later, the ML model was applied to both phases of selected features. The performance of feature selection methods is examined based on the obtained results after applying the model.

Table

Download CSV Display Table

Table

Download CSV Display Table

Table

Download CSV Display Table

Results Discussion

Estimation of the critical factors for diabetes mellitus with the standardized and non-standardized process is implemented with info gain, chi-square, RFE, and coefficient. Significant frames are constituted based on feature selection methods before and after standardization based on features sets shown in The following methods are employed to the generated frames to conduct our experimentation: AdaBoost, RF, DT, GPC, SVM, and SGD respectively. The evaluation metrics such as accuracy, recall, and F1 score are used for evaluating the implemented models.

Table 7. Representation of features considered for data frames construction for phase 1 and phase 2.

Download CSV Display Table

The above shown quickly enables us to comprehend the general characteristics of diabetes data. It presents a summary of the dispersion of diabetes features, including measures such as central tendency, mean, standard deviation, and IQR.

presents the calculated coefficients scores for MI, chi square, RFE, and coefficient methods. The computed coefficient values will assist us in selecting the optimal features for conducting analysis.

Table 8. Feature selection methods computed coefficient scores.

Download CSV Display Table

FS Methods Based Obtained Models Results of with Respect to Prior and Post Standardization

Based on the results shown in , we have created to visually represent our findings. The illustrate the results of the ML models before and after standardization, specifically regarding Mutual Information (MI), Correlation, Chi-square, and Recursive Feature Elimination (RFE). Furthermore, it displays the distinct feature selection methods of implemented machine learning models for the results of 4.a and 4.e, with regards to mutual information. It also includes corresponding results for correlation (4.b and 4.f), chi square (4.c and 4.g), and RFE () with all combinations before and after standardization.

Figure 4. The figures from a–d, e–h represents different ML models obtained results with respect to MI, correlation, Chi square, RFE before standardization, after standardization.

Table 9. The obtained classification results before and after standardization.

Download CSV Display Table

Mutual Information

Based on graphs 4.a and 4.e, as well as , we can observe that most of the models achieve an accuracy of 74%, with the exception of SGD. Surprisingly, prior to standardization, the SGD model outperformed with an accuracy of 73%, compared to 71% after standardization. Nonetheless, SGD demonstrated strong recall with F1 score scores of 80% and 70% after standardization. Among all listed models, the SGD model has the lowest performance with accuracy rates of 0.73 and 0.71 before and after standardization. Only two models RF and GNB outperformed in terms of recall and F1 score, but after standardization, four models RF, GNB, KNN, and SGD showed improved results of 83%, 83%, 82%, and 80%.

Correlation Results

Based on and , both the RF and GNB models have the highest accuracy of 0.74, recall of 0.83, and F1 score of 0.73 after standardization. The KNN model has the highest recall of 0.63 and F1 score of 0.63 before standardization, but after standardization, the recall drops to 0.49 and the F1 score drops to 0.42. The SGD model has the lowest performance among all listed models, with accuracy, recall, and F1 score rates of 0.72, 0.81, and 0.70, respectively, after standardization. The standardization process resulted in all models achieving an accuracy of 74%, with the exception of the SGD model.

Chi Square

and of the chi square-based results revealed that all the models showed better performance with an accuracy of above 0.74. Significantly, the RF, GNB, and KNN models obtained good recall scores (0.83) and F1 scores (0.73) after standardization. The DT model has the lowest performance among all listed models, with an accuracy rate of 0.62, a recall rate of 0.57, and an F1 score of 0.57 before standardization.

RFE Results

Based on RFE results in , and , indifferently few models such as Adaboost, GNB, RF, KNN, and SVM performed better before standardization in terms of accuracy only. The RF and GNB models yield the highest recall of 0.83 and the F1 score of 0.73 after standardization. The SGD model has the lowest performance among all listed models, with accuracy, recall, and F1 scores before and after standardization.

The major characteristics of the models we have identified are as follows on an overall basis.

The RF, GNB, and AdaBoost models have demonstrated strong performance in terms of prior and post standardization.
The correlation, MI, and Chi square feature selection factors have generally shown significant improvement after standardization in the obtained ML results. However, the RFE-based model results did not demonstrate the same level of improvement.
However, both the SGD and GPC models yielded unsatisfactory results in both scenarios, i.e., prior and post standardization.
The equal accuracy rate was obtained for all models with respect to all MI, correlation, RFE, and chi square after standardization.

SMOTE Based Obtained Results

The performance parameters used to assess the accuracy, recall, and F1 score of different machine learning models are displayed in . The RF model has the greatest accuracy (0.83), recall (0.82), and F1 score (0.81). With an accuracy of 0.70, recall of 0.70, and F1 score of 0.70, the DT model performed moderately. The GPC model, with an accuracy of 0.78, recall of 0.78, and F1 score of 0.77, outperforms DT but falls short of RF. The AdaBoost model has an accuracy rate of 0.76, a recall rate of 0.76, and F1 score of 0.76. It performs better than both DT and SVM, but not as well as RF or GPC. The GNB model’s metrics are close to those for AdaBoost, with an accuracy rate of 0.73, a recall rate of 0.72, and F1 score of 0.72. The accuracy rate of 0.76, recall rate of 0.76, and F1 score of 0.75 of the KNN model are comparable to those of AdaBoost. Finally, the SGD model has the lowest performance among all listed models, with an accuracy rate only reaching 0.53.

Figure 5. The obtained SMOTE based classification results with standardization.

Comparison Results

shows the accuracy results of various machine learning models applied by different authors. From the represented table, it can be seen that the proposed concept model has the highest accuracy rate of 83% among all the models listed. The Saxena R et al. (Saxena et al. Citation2022) strategy uses KNN, RF, DT, and Multilayer Perceptron (MLP) algorithms, and RF has a better accuracy rate of 79.80%. The Sabitha E et al. approach (Sabitha and Durgadevi Citation2022) uses LR, RF, DT, SVC, GNB, and KNN algorithms, and LR has shown an improved accuracy rate of 80%. The Kumari S et al. methodology (Kumari, Kumar, and Mittal Citation2021) uses a soft voting classifier (SVC), LR, NB, RF, XGB, NB, and DT algorithms, and SVC obtained good accuracy at 79%. The Joshi R et al. (Joshi and Dhakal Citation2021) employs the LR algorithm and has an accuracy rate of 78.26%. Finally, Chatrati S et al. (Chatrati et al. Citation2022) methodology obtained an accuracy rate of 72% by leveraging the LDA algorithm.

Table 10. The comparative results with other state-of-the-art methods.

Download CSV Display Table

The proposed IFSS methodology has evolved through various activities such as sampling, assessing statistical characteristics, utilizing FS methods, ML models, and evaluating metrics to perform critical factor analysis for diabetes mellitus. The critical factor analysis revealed that the determined primary influential factors are NTP, PGC, BMI, and age. In addition, the recognized secondary features are TSFT, insulin, and DPF. Further, DBP was identified as the least influential feature overall.

Time Complexity Analysis

The PCC algorithm time complexity is $O (n)$ , where n is the number of elements in the input. The time complexity of the chi-square and mutual information algorithms is $O {(n)}^{2}$ . The time complexity of the RFE algorithm is $O (k * {(n)}^{2}$ , where k is the number of features to select and n is the number of samples. To understand the complexity analysis easily we have depicted .

Figure 6. The computational process of feature selection methods with dependent variable and independent variables for coefficients determination.

The overall complexity would be

= correlation complexity + MI complexity + Chi square complexity + RFE complexity = O (n) + O {(n)}^{2} + O {(n)}^{2} + O (k * {(n)}^{2}) = O (k * {(n)}^{2})

Therefore, the overall time complexity to calculate coefficients by different methods would be $O (k {(n)}^{2})$ , where k is the number of features to select and n is the number of samples.

Conclusion

Despite substantial progress in the treatment and prevention of diabetes mellitus, the disease continues to be a major public health concern worldwide. The article focused on experimenting with various mechanisms, including sampling, feature selection exploration, applying different ML models, and evaluating metrics on the Pima data before and after standardization. In this research, we applied a number of feature selection techniques to a comprehensive examination of estimated essential components in diabetes mellitus utilizing both conventional and nonstandard mechanisms. A small crucial set of features in the dataset facilitates easy interpretation of the model, yields better prediction, optimizes training time and computational cost, as well as mitigates the risk of overfitting. To isolate critical elements linked to diabetes mellitus, we used Info gain, chi-square, Recursive Feature Elimination (RFE), and coefficient. The RF, GNB, and AdaBoost models have demonstrated strong performance in terms of prior and post standardization. Info gain, chi-square, and coefficient all improved greatly in their ability to pinpoint influential elements after being subjected to standardization. Nonetheless, interestingly, the RFE approach showed resistance to standardization, with performance being constant both before and after the process. However, the RFE-based model results did not demonstrate the same level of improvement. However, both the SGD and GPC models yielded unsatisfactory results in both scenarios, i.e., prior and post standardization. These results highlight the need of using suitable feature selection methods and considering standardization in order to enhance the precision of crucial factor identification in diabetes mellitus. Better public health outcomes may result from the study’s findings helping researchers and healthcare providers create more effective techniques for managing and preventing diabetes mellitus.

Future Work

This work sheds light on several promising new directions to investigate in the mysterious world of diabetes mellitus research. The following directions show promise for further exploration to continue to delve into the depths of this mysterious realm. In order to uncover even further improvements in performance across all feature selection methods, future work could investigate on alternate data transformation strategies or fine-tune existing standardization procedures. The identification of crucial components inside the intricate web of diabetes mellitus may require future research into Recursive Feature Addition (RFA), or evolutionary algorithms. Additionally, validation and clinical relevance is crucial to verify the findings in clinical settings as the miraculous discoveries develop. The clinical relevance and application of the identified essential criteria may be assessed in future study through collaboration with healthcare practitioners and specialists. This verification will help those fighting diabetes mellitus by closing the gap between theoretical considerations and practical applications. It will yield better results in using ML in diabetes research subsequently, work can be carried out with the real-time data set to extrapolate and develop an automatic recommendation system. This can help the diabetic affected protagonists to decide on their diet.

Author Contributions

The first author carried out the methodology, data curation, experiments, and draft preparation. In addition to providing supervision, the coauthor also helped with conceptualization, validation, review, and editing.

Availability of Data And Materials

Upon a reasonable request, the sources and other pertinent information will be provided.

Consent for Publication

The manuscript not contains any individual person’s data in any form.

Ethics Approval And Consent To Participate

This manuscript does not contain any studies with human participants or animals performed by the author.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

To carry out research work the funding has been granted by the Vellore Institute of Technology in Vellore, Tamil Nadu, India.

References

10 Surprising Things That Can Spike Your Blood Sugar | CDC. Accessed May 24, 2023. [Online]. Available: https://www.cdc.gov/diabetes/library/spotlights/blood-sugar.html
Google Scholar
7th edition | IDF Diabetes Atlas. Accessed Dec 11, 2023. [Online]. Available: https://diabetesatlas.org/atlas/seventh-edition/
Google Scholar
Aha, D. W., D. Kibler, M. K. Albert, and J. R. Quinian. Jan, 1991. Instance-based learning algorithms. Machine Learning 6 (1):37–31. doi:10.1007/BF00153759.
Web of Science ®Google Scholar
Alalwan, S. A. D. Apr, 2019. Diabetic analytics: Proposed conceptual data mining approaches in type 2 diabetes dataset. Indonesian Journal of Electrical Engineering and Computer Science 14 (1):88–95. doi:10.11591/IJEECS.V14.I1.PP88-95.
Google Scholar
Alam, S., M. K. Hasan, S. Neaz, N. Hussain, M. F. Hossain, and T. Rahman. Apr, 2021. Diabetes Mellitus: Insights from Epidemiology, Biochemistry, Risk Factors, Diagnosis, Complications and Comprehensive Management. Diabetology 2021 2(2):36–50. doi:10.3390/DIABETOLOGY2020004.
Google Scholar
Ali, N., D. Neagu, and P. Trundle. Dec, 2019. Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets. SN Applied Sciences 1(12):1–15. doi:10.1007/s42452-019-1356-9.
Web of Science ®Google Scholar
Alyasiri, O. M., Y. N. Cheah, A. K. Abasi, and O. M. Al-Janabi. 2022. Wrapper and hybrid feature selection methods using metaheuristic algorithms for English Text Classification: A systematic review. IEEE Access 10:39833–52. doi:10.1109/ACCESS.2022.3165814.
Web of Science ®Google Scholar
Ang, J. C., A. Mirzal, H. Haron, and H. N. A. Hamed. Sep, 2016. Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection. IEEE/ACM Transactions on Computational Biology & Bioinformatics / IEEE, ACM 13 (5):971–89. doi:10.1109/TCBB.2015.2478454.
PubMed Web of Science ®Google Scholar
Bach, M., A. Werner, and M. Palt. 2019. The proposal of undersampling method for learning from imbalanced datasets. Procedia Computer Science 159 (Jan):125–34. doi:10.1016/J.PROCS.2019.09.167.
Google Scholar
Bahl, A., Hellack, B., Balas, M., Dinischiotu, A., Wiemann, M., Brinkmann, J., Luch, A., Renard, B. Y., Haase, A. Mar, 2019. Recursive feature elimination in random forest classification supports nanomaterial grouping. NanoImpact 15:100179. doi: 10.1016/J.IMPACT.2019.100179.
Web of Science ®Google Scholar
Baker, S., and R. D. Cousins. Apr 1984. Clarification of the use of CHI-square and likelihood functions in fits to histograms. Nuclear Instruments and Methods in Physics Research 221 (2):437–42. doi:10.1016/0167-5087(84)90016-4.
Web of Science ®Google Scholar
Bommert, A., X. Sun, B. Bischl, J. Rahnenführer, and M. Lang. 2020. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis 143 (Mar):106839. doi:10.1016/J.CSDA.2019.106839.
Web of Science ®Google Scholar
Browne, M. W. Mar 2000. Cross-validation methods. Journal of Mathematical Psychology 44 (1):108–32. doi:10.1006/JMPS.1999.1279.
PubMed Web of Science ®Google Scholar
Buyrukoğlu, S., and A. Akbaş. Apr 2022. Machine learning based early prediction of type 2 diabetes: A new hybrid feature selection approach using correlation matrix with heatmap and SFS. Balkan Journal of Electrical and Computer Engineering 10 (2):110–17. doi:10.17694/BAJECE.973129.
Google Scholar
Chang, V., J. Bailey, Q. A. Xu, and Z. Sun. Aug, 2023. Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Computing & Applications 35(22):16157–73. doi: 10.1007/s00521-022-07049-z.
Web of Science ®Google Scholar
Chart: Where Diabetes Burdens Are Rising | Statista. Accessed May 24, 2023. [Online]. Available: https://www.statista.com/chart/23491/share-of-adults-with-diabetes-world-region/
Google Scholar
Chatrati, S. P., G. Hossain, A. Goyal, A. Bhan, S. Bhattacharya, D. Gaurav, and S. M. Tiwari. Mar, 2022. Smart home health monitoring system for predicting type 2 diabetes and hypertension. Journal of King Saud University - Computer and Information Sciences 34 (3):862–70. doi:10.1016/J.JKSUCI.2020.01.010.
Web of Science ®Google Scholar
Chen, R. C., C. Dewi, S. W. Huang, and R. E. Caraka. Dec, 2020. Selecting critical features for data classification based on machine learning methods. Journal of Big Data 7(1):1–26. doi:10.1186/s40537-020-00327-4.
Google Scholar
Choubey, D. K., P. Kumar, S. Tripathi, and S. Kumar. Dec, 2020. Performance evaluation of classification methods with PCA and PSO for diabetes. Network Modeling Analysis in Health Informatics and Bioinformatics 9(1):1–30. doi:10.1007/s13721-019-0210-8.
Web of Science ®Google Scholar
Chowdary, P. B. K., and R. U. Kumar. 2021. Diabetes Classification using an Expert Neuro-fuzzy Feature Extraction Model. International Journal of Advanced Computer Science and Applications 12 (8):368–74. doi:10.14569/IJACSA.2021.0120842.
Web of Science ®Google Scholar
Dalianis, H. 2018. Evaluation Metrics and Evaluation. Clinical Text Mining 45–53. doi:10.1007/978-3-319-78503-5_6.
Google Scholar
De Silva, K., D. Jönsson, and R. T. Demmer. Mar, 2020. A combined strategy of feature selection and machine learning to identify predictors of prediabetes. Journal of the American Medical Informatics Association: JAMIA 27 (3):396–406. doi:10.1093/JAMIA/OCZ204.
PubMed Web of Science ®Google Scholar
Diabetes Prevalence Expected to Double Globally by 2050. 2023. Accessed Dec 13, 2023. [Online]. Available: https://www.ajmc.com/view/diabetes-prevalence-expected-to-double-globally-by-2050
Google Scholar
Di Franco, A. 2019. Information-gain computation in the fifth system. International Journal of Approximate Reasoning 105 (Feb):386–95. doi:10.1016/J.IJAR.2018.11.013.
Web of Science ®Google Scholar
Doğru, A., S. Buyrukoğlu, and M. Arı. Mar, 2023. A hybrid super ensemble learning model for the early-stage prediction of diabetes risk. Medical & Biological Engineering & Computing 61(3):785–97. doi:10.1007/s11517-022-02749-z.
PubMed Web of Science ®Google Scholar
Fei, Z., F. Yang, K. L. Tsui, L. Li, and Z. Zhang. 2021. Early prediction of battery lifetime via a machine learning based framework. Energy 225 (Jun):120205. doi:10.1016/J.ENERGY.2021.120205.
Web of Science ®Google Scholar
Freund, R. J., W. J. Wilson, and D. L. Mohr. 2010. Nonparametric methods. Statistical Methods 689–719. doi:10.1016/B978-0-12-374970-3.00014-7.
Google Scholar
Gromova, L. V., S. O. Fetissov, and A. A. Gruzdkov. Jul, 2021. Mechanisms of glucose absorption in the small intestine in health and metabolic diseases and their role in appetite regulation. Nutrients 13(7). doi:10.3390/NU13072474.
PubMed Web of Science ®Google Scholar
Guo, G., H. Wang, D. Bell, Y. Bi, and K. Greer. 2003. KNN model-based approach in classification. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2888:986–96. doi:10.1007/978-3-540-39964-3_62/COVER/.
Google Scholar
Gupta, S. C., and N. Goel. 2023. Predictive modeling and analytics for diabetes using hyperparameter tuned machine learning techniques. Procedia Computer Science 218:1257–69. doi:10.1016/J.PROCS.2023.01.104.
Google Scholar
Gupta, A., I. S. Rajput, Gunjan, V. Jain, and S. Chaurasia. Sep, 2022. NSGA-II-XGB: Meta-heuristic feature selection with XGBoost framework for diabetes prediction. Concurrency & Computation: Practice & Experience 34 (21):e7123. doi:10.1002/CPE.7123.
Web of Science ®Google Scholar
Gürsoy, M. İ., and A. Alkan. Dec, 2022. Investigation of diabetes data with permutation feature importance based deep learning methods. Karadeniz Fen Bilimleri Dergisi 12 (2):916–30. doi:10.31466/KFBD.1174591.
Google Scholar
Gutkin, M., R. Shamir, and G. Dror. Jul, 2009. SlimPLS: A method for feature selection in gene expression-based disease classification. PLoS One 4(7):e6416. doi:10.1371/JOURNAL.PONE.0006416.
PubMed Web of Science ®Google Scholar
Hou, J., Y. Sang, Y. Liu, and L. Lu, “Feature selection and prediction Model for type 2 diabetes in the Chinese Population with machine learning,” ACM International Conference Proceeding Series, Oct. 2020, doi:10.1145/3424978.3425085.
Google Scholar
Hsu, H. H., C. W. Hsieh, and M. Da Lu. Jul, 2011. Hybrid feature selection by combining filters and wrappers. Expert Systems with Applications 38 (7):8144–50. doi:10.1016/J.ESWA.2010.12.156.
Web of Science ®Google Scholar
Huerta, E. B., R. M. Caporal, M. A. Arjona, and J. C. H. Hernández. 2013. Recursive feature elimination based on linear discriminant analysis for molecular selection and classification of diseases. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7996:244–51. doi:10.1007/978-3-642-39482-9_28/COVER/.
Google Scholar
Hu, M., and F. Wu. 2010. Filter-wrapper hybrid method on feature selection. Proceedings - 2010 2nd WRI Global Congress on Intelligent Systems, GCIS 2010, 3:98–101. doi:10.1109/GCIS.2010.235.
Google Scholar
Jain, S., and A. Saha. Mar, 2022. Rank-based univariate feature selection methods on machine learning classifiers for code smell detection. Evolutionary Intelligence 15(1):609–38. doi:10.1007/s12065-020-00536-z.
Web of Science ®Google Scholar
Joshi, R. D., and C. K. Dhakal. Jul, 2021. Predicting type 2 diabetes using logistic regression and machine learning approaches. International Journal of Environmental Research and Public Health 18(14). doi:10.3390/IJERPH18147346.
Web of Science ®Google Scholar
Juneja, A., S. Juneja, S. Kaur, and V. Kumar. 2021. Predicting Diabetes Mellitus With Machine Learning Techniques Using Multi-Criteria Decision Making. International Journal of Information Retrieval Research 11 (2):38–52. doi:10.4018/IJIRR.2021040103.
Web of Science ®Google Scholar
Kakoly, I. J., M. R. Hoque, and N. Hasan. 2023. Data-driven diabetes risk factor prediction using machine learning algorithms with feature selection technique. Sustainability 15 (6):4930. doi:10.3390/SU15064930.
Web of Science ®Google Scholar
Kira, K., and L. A. Rendell. 1992. A practical approach to feature selection. Machine Learning Proceedings 1992 (Jan):249–56. doi:10.1016/B978-1-55860-247-2.50037-1.
Google Scholar
Kishor, A., and C. Chakraborty. Jun, 2021. Early and accurate prediction of diabetics based on FCBF feature selection and SMOTE. International Journal of Systems Assurance Engineering and Management 1–9. doi:10.1007/s13198-021-01174-z.
Web of Science ®Google Scholar
Kulkarni, A., D. Chong, and F. A. Batarseh. Jan, 2020. Foundations of data imbalance and solutions for a data democracy. Data Democracy: At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering 83–106. doi:10.1016/B978-0-12-818366-3.00005-8.
Google Scholar
Kumari, S., D. Kumar, and M. Mittal. 2021. An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. International Journal of Cognitive Computing in Engineering 2 (Jun):40–46. doi:10.1016/J.IJCCE.2021.01.001.
Google Scholar
Li T, and Fong S. Nov, 2019. A fast feature selection method based on coefficient of variation for diabetics prediction using machine learning. International Journal of Extreme Automation and Connectivity in Healthcare (IJEACH) 1(1):55–65. doi:10.4018/IJEACH.2019010106.
Google Scholar
Liu, Y., J. M. Wu, M. Avdeev, and S. Q. Shi. Feb, 2020. Multi-layer feature selection incorporating weighted score-based expert knowledge toward modeling materials with targeted properties. Advanced Theory and Simulations 3(2):1900215. doi:10.1002/ADTS.201900215.
Web of Science ®Google Scholar
Madurapperumage, A., W. Y. C. Wang, and M. Michael. 2021. A systematic review on extracting predictors for forecasting complications of diabetes mellitus. In ACM International Conference Proceeding Series, May, 327–30. doi:10.1145/3472813.3473211.
Google Scholar
Masoudi-Sobhanzadeh, Y., H. Motieghader, and A. Masoudi-Nejad. Apr, 2019. FeatureSelect: A software for feature selection based on machine learning approaches. BMC Bioinformatics 20(1):1–17. doi:10.1186/s12859-019-2754-0.
PubMedGoogle Scholar
Mishra, S., H. K. Tripathy, P. K. Mallick, A. K. Bhoi, and P. Barsocchi. 2020. EAGA-MLP—an enhanced and adaptive hybrid classification Model for diabetes diagnosis. Sensors 20 (14):4036. doi:10.3390/S20144036.
PubMed Web of Science ®Google Scholar
Mucherino, A., P. J. Papajorgji, and P. M. Pardalos. 2009. Nearest neighbor classification. 83–106. doi:10.1007/978-0-387-88615-2_4.
Google Scholar
Nagaraj, P., P. Deepalakshmi, R. F. Mansour, and A. Almazroa. 2021. Artificial flora algorithm-based feature selection with gradient boosted tree Model for diabetes classification. Diabetes, Metabolic Syndrome and Obesity: Targets and Therapy 14:2789–806. doi:10.2147/DMSO.S312787.
PubMed Web of Science ®Google Scholar
Oladimeji, O. O., A. Oladimeji, and O. Oladimeji. May, 2021. Classification models for likelihood prediction of diabetes at early stage using feature selection. Applied Computing & Informatics ahead-of-print. doi:10.1108/ACI-01-2021-0022.
Google Scholar
Ottenbacher, K. J. Jul, 1995. The chi-square test: Its use in rehabilitation research. Archives of Physical Medicine and Rehabilitation 76 (7):678–81. doi:10.1016/S0003-9993(95)80639-3.
PubMed Web of Science ®Google Scholar
Papatheodorou, K., M. Banach, M. Edmonds, N. Papanas, and D. Papazoglou. 2015. Complications of diabetes. Journal of Diabetes Research 2015:1–5. doi:10.1155/2015/189525.
Web of Science ®Google Scholar
Pearson’s Correlation Coefficient. 2008. Encyclopedia of Public Health. 1090–91. doi:10.1007/978-1-4020-5614-7_2569.
Google Scholar
Pima Indians Diabetes Database | Kaggle. Accessed Jun 23, 2022. [Online]. Available: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Google Scholar
Pirgazi, J., M. Alimoradi, T. Esmaeili Abharian, and M. H. Olyaee. Dec, 2019. An efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets. Scientific Reports 2019 9 (1):1–15. doi:10.1038/s41598-019-54987-1.
PubMedGoogle Scholar
Ramesh, J., R. Aburukba, and A. Sagahyroon. Jun, 2021. A remote healthcare monitoring framework for diabetes prediction using machine learning. Healthcare Technology Letters 8 (3):45–57. doi:10.1049/HTL2.12010.
PubMed Web of Science ®Google Scholar
Ratner, B. Jun, 2009. The correlation coefficient: Its values range between 1/1, or do they. Journal of Targeting Measurement & Analysis for Marketing 17 (2):139–42. doi:10.1057/jt.2009.5.
Google Scholar
Sabitha, E., and M. Durgadevi. 2022. Improving the diabetes Diagnosis prediction rate using data preprocessing, data augmentation and recursive feature elimination method. IJACSA) International Journal of Advanced Computer Science and Applications 13 (9). doi: 10.14569/IJACSA.2022.01309107.
Google Scholar
Sahu, B., S. Dehuri, and A. Jagadev. Aug, 2018. A study on the relevance of feature selection methods in microarray data. The Open Bioinformatics Journal 11 (1):117–39. doi:10.2174/1875036201811010117.
Google Scholar
Saxena, R., S. K. Sharma, M. Gupta, and G. C. Sampada. 2022. A novel approach for feature selection and classification of diabetes mellitus: Machine learning methods. Computational Intelligence and Neuroscience 2022:1–11. doi:10.1155/2022/3820360.
Web of Science ®Google Scholar
Sheik Abdullah, A., and S. Selvakumar. Oct, 2019. Assessment of the risk factors for type II diabetes using an improved combination of particle swarm optimization and decision trees by evaluation with Fisher’s linear discriminant analysis. Soft Computing 23 (20):9995–10017. doi:10.1007/s00500-018-3555-5.
Web of Science ®Google Scholar
Sneha, N., and T. Gangil. 2019. Analysis of diabetes mellitus for early prediction using optimal features selection. Journal of Big Data. doi:10.1186/s40537-019-0175-6.
PubMedGoogle Scholar
Tadist, K., S. Najah, N. S. Nikolov, F. Mrabti, and A. Zahi. Dec, 2019. Feature selection methods and genomic big data: A systematic review. Journal of Big Data 6 (1):1–24. doi:10.1186/s40537-019-0241-0.
Google Scholar
Tiwari, P., and V. Singh. Jan, 2021. Diabetes disease prediction using significant attribute selection and classification approach. Journal of Physics Conference Series 1714 (1):012013. doi:10.1088/1742-6596/1714/1/012013.
Google Scholar
Tomic, D., J. E. Shaw, and D. J. Magliano. Sep, 2022. The burden and risks of emerging complications of diabetes mellitus. Nature Reviews Endocrinology 18 (9):525–39. doi:10.1038/S41574-022-00690-7.
PubMed Web of Science ®Google Scholar
Unnikrishnan, R., R. M. Anjana, and V. Mohan. 2016. Diabetes mellitus and its complications in India. Nature Reviews Endocrinology 12 (6):357–70. doi:10.1038/nrendo.2016.53.
PubMed Web of Science ®Google Scholar
Venkatesh, B., and J. Anuradha. 2019. A review of feature selection and its methods. Cybernetics and Information Technologies 19 (1):3–26. doi:10.2478/CAIT-2019-0001.
Web of Science ®Google Scholar
Yu, L., and H. Liu. Oct, 2004. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research 5:1205–24.
Web of Science ®Google Scholar
Zhang, T., T. Zhu, P. Xiong, H. Huo, Z. Tari, and W. Zhou. Mar, 2020. Correlated differential privacy: Feature selection in machine learning. IEEE Transactions on Industrial Informatics / a Publication of the IEEE Industrial Electronics Society 16 (3):2115–24. doi:10.1109/TII.2019.2936825.
Web of Science ®Google Scholar
Zhu, H., G. Liu, M. Zhou, Y. Xie, and Q. Kang. Jan, 2020. A noisy-sample-removed under-sampling scheme for imbalanced classification of public datasets. IFAC-Papersonline 53 (5):624–29. doi:10.1016/J.IFACOL.2021.04.202.
Google Scholar

Critical Factor Analysis for prediction of Diabetes Mellitus using an Inclusive Feature Selection Strategy

ABSTRACT

Introduction

Table 1. The core emphasizing relative concepts required for feature selection exploration.

Literature Survey

Table 2. The relevant works of literature concerning diabetes were studied.

Working Methodology

System Architecture

Implemented FS Methods

Information Gain

Table 3. Representation of information gain behavior based on value.

Chi-Square

Table 4. Representation of chi square behavior based on value.

Pearson Correlation Coefficient

Table 5. Representation of coefficient strength features.

Exhaustive Feature Selection Based RFE

Description of the Data Set Used for Analysis

Table 6. Data set description.

Coding Design Explanation

Results Discussion

Table 7. Representation of features considered for data frames construction for phase 1 and phase 2.

Table 8. Feature selection methods computed coefficient scores.

FS Methods Based Obtained Models Results of with Respect to Prior and Post Standardization

Table 9. The obtained classification results before and after standardization.

Mutual Information

Correlation Results

Chi Square

RFE Results

SMOTE Based Obtained Results

Comparison Results

Table 10. The comparative results with other state-of-the-art methods.

Time Complexity Analysis

Conclusion

Future Work

Author Contributions

Availability of Data And Materials

Consent for Publication

Ethics Approval And Consent To Participate

Disclosure Statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date