Full article: Explainable data mining model for hyperinsulinemia diagnostics

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

In our research, we present a data mining model for the early diagnosis of hyperinsulinemia, potentially reducing the risk of diabetes, heart disease, and other chronic conditions. The dataset, gathered from 2019 to 2022 by Serbia's Healthcare Center through an observational cross-sectional study, includes 1008 adolescents. Medical datasets are often highly imbalanced and may contain irrelevant features that hinder predictive performance. To address these challenges in the medical data analysis, we propose a model employing Functional Principal Component Analysis (FPCA), which also accounts for outliers that could otherwise lead to the inclusion of irrelevant features. Unlike standard Principal Component Analysis (PCA), which is sensitive to the initial positions of cluster centers influencing the final outcome, our model integrates FPCA with K-Means clustering to improve the preprocessing stage. Additionally, we have incorporated the post-hoc explanatory method SHAP (SHapley Additive exPlanations) alongside algorithms such as Random Forest, XGBoost, and LightGBM to provide deeper insights into our model, identifying the most contributory features for the development of hyperinsulinemia. Experimental results showed that combining FPCA with K-Means clustering enhances the accuracy of the XGBoost classifier, with this model achieving an accuracy score of 0.99.

KEYWORDS:

1. Introduction

Hyperinsulinemia represents a state of pre-type 2 diabetes mellitus and is characterised by significantly elevated insulin levels in the blood . As this pathological entity can persist for years without pronounced symptoms, proper identification and verification of its presence, along with potential risk factors, during early adolescence can hold exceptional significance in preventing various conditions stemming directly from such a state. Hyperinsulinemia is a disorder that can emerge at any life age, including adolescence. It is imperative during these formative years to be vigilant about the potential onset of this disorder and to take proactive steps to prevent it. The adolescent phase is characterised by significant hormonal fluctuations, which can adversely affect pancreatic functionality and insulin regulation in the bloodstream. Moreover, this is a time when unhealthy lifestyle habits are prone to take root, including suboptimal dietary patterns and a lack of physical activity, both of which can substantially heighten the risk of various metabolic syndromes such as hyperinsulinemia. Adolescents at higher risk for hyperinsulinemia often include those with a family history of type 2 diabetes, elevated body mass, insulin resistance, inadequate physical activity, and unhealthy dietary habits (IDF Diabetes Atlas, Citation2021; Thomas et al., Citation2019).

Figure 1. Hyperinsulinemia process.

According to the International Diabetes Federation (IDF), if insulin values are greater than 15 µU/ml at 0 min and insulin values after the OGTT (Oral Glucose Tolerance) test are greater than 75 µU/ml at 120 min, or if the cumulative insulin value exceeds 300 µU/ml, hyperinsulinemia is diagnosed (Sun et al., Citation2022). As the prevalence of hyperinsulinemia accelerates globally and also in our region during this life period, the results of this research can hold considerable scientific and practical importance for pediatricians (Ryder et al., Citation2020). They can aid in strategizing the application of preventive and timely corrective measures to avert the onset of the mentioned pathological entity and the development of potential complications, primarily type 2 diabetes and cardiovascular diseases, extending into adulthood (Andes et al., Citation2020; Horta & de Lima, Citation2019).

Enhancing the diagnosis of hyperinsulinemia necessitates the adoption of innovative techniques. Data mining, with its capacity to uncover vital yet hidden patterns within vast databases, holds the promise of transforming the landscape of medical diagnostics (Savić et al., Citation2023; Vrbančič et al., Citation2022). Numerous data mining techniques and algorithms have been tailored to distil knowledge from medical databases for disease diagnosis (Chen et al., Citation2023; Sun et al., Citation2022). PCA, a simple yet potent non-parametric method, provides a pathway to extract pertinent insights from intricate datasets (Thenappan et al., Citation2022). In scenarios requiring the categorisation of vast datasets into user-defined clusters, the K-Means algorithm (Edeh et al., Citation2022) facilitates this by minimising the squared error function. However, susceptibility to outliers and elevated time complexity hinder its efficacy. Therefore, PCA plays a pivotal role in reducing dataset dimensions while conserving paramount information, thereby refining cluster centroids for improved accuracy. Importantly, the effectiveness of K-Means clustering is rooted in its capacity to group similar entities within a dataset. Any clusters that significantly deviate from the norm, indicating outliers, are identified as anomalies and then removed. To address these challenges, Functional-based Principal Component Analysis (FPCA) (Gecili et al., Citation2021) emerges as advantageous. FPCA identifies and eliminates irrelevant features, a key factor for unbiased outcomes and expedited training. Remarkably, FPCA's strength lies in minimal data loss despite dimension reduction, resulting in enhanced classification accuracy and computational efficiency compared to classical PCA (Pan et al., Citation2023). Within this landscape, the integration of machine learning algorithms like Random Forest, XGBoost, and LightGBM stands as the logical next step. Enriched by the insights from FPCA and K-Means clustering, these algorithms adeptly navigate the complexities of hyperinsulinemia diagnostics. Finally, for a deeper understanding and augmented transparency of the proposed model, the SHAP method is employed. It serves to elucidate the outcomes of intricate ensemble models such as Random Forest, XGBoost, and LightGBM, thus providing enriched insights. This multidimensional approach not only refines accuracy but also transforms the medical diagnosis landscape by harnessing the latent power of advanced technologies. To highlight the contribution of this study, the following main research objective with two sub-questions is posed:

RO: To what extent can unsupervised techniques such as PCA and FPCA improve K-Means clustering and provide foundation for input values of Random Forest, XGBoost and LightGBM contributing to higher classifiers accuracies?

RQ1: Among supervised models such as Random Forest, XGBoost, and LightGBM, which performs better compared to the baseline model, in this case, Logistic Regression, in terms of Recall, Precision, Accuracy, F1 score, and MCC?

RQ2: Using the SHAP method, what are the most informative features that the best-performing model among the three supervised models - Random Forest, XGBoost, and LightGBM - considers when predicting hyperinsulinemia?

As the prevalence of hyperinsulinemia in adolescence rapidly increases, and the significance and widespread application of a combination of unsupervised techniques and supervised models such as Random Forest, XGBoost, and LightGBM become evident, this study's findings hold exceptional scientific and practical value for pediatricians in crafting strategies for preventive and timely corrective measures. These strategies aim to prevent the onset of this pathophysiological entity and the development of potential complications, primarily type 2 diabetes mellitus and cardiovascular diseases, in later adult life.

The results of this research can significantly contribute to a better understanding of the risk factors that influence the occurrence of hyperinsulinemia with elevated glycemia in adolescents, especially those with excessive body weight, poor dietary habits, insufficient physical activity, positive family history, and psychoactive substance use. Timely identification of adolescents at risk for hyperinsulinemia using the latest artificial intelligence tools should become a daily and universally accessible practice because it is of great importance for their future and the health of society as a whole. Finally, the proposed model can be applied in detecting and identifying risk factors for the development of other chronic non-infectious diseases.

The rest of the paper is organised as follows: Section 2 gives an overview of the current research for improving medical diagnostics and of different health-care predictions using statistical methodologies and ML algorithms. Section 3 describes the proposed data mining model and ML algorithms, along with post-agnostic SHAP method that will be employed. Section 4 presents obtained results. Section 5 discusses the results. The concluding remarks and future directions are given in Section 6.

2. Related work

The domains of medical diagnostics and comprehensive prevention are experiencing swift and dynamic growth, driven by the application of advanced technologies that function as the foundational predictive models (Hassan et al., Citation2021; Rasha et al., Citation2023; Zhou et al., Citation2023) Exploring hyperinsulinemia holds significant importance due to its intricate connection with various health aspects beyond just diabetes type 2 (Halloun et al., Citation2022; Koch et al., Citation2021). Hyperinsulinemia, marked by an overproduction of insulin, is not merely a harbinger of type 2 diabetes; it is also implicated in a spectrum of other medical issues including obesity, metabolic syndrome, cardiovascular disease, and some forms of cancer. A deeper comprehension of hyperinsulinemia can shed light on the incipient phases of metabolic disorders, offering a window for early interventions that could halt the escalation into more debilitating diseases such as diabetes (Calcaterra et al., Citation2022). Given the prevailing emphasis on predictive model development for type 2 diabetes mellitus in prior literature, rather than hyperinsulinemia, our objective is to provide an overview and conduct a comparative analysis of their outcomes. This will be achieved through an exploration of the distinct techniques and models employed in these studies.

In (Bansal & Singhrova, Citation2022) the authors have used dataset for diabetes and PCA for dimensionality reduction and K-Means clustering technique to feed the voting classifier that employs soft voting, achieving 98.70% of accuracy. Moreover, dividing the famous, publicly available PIMA Indian diabetic dataset, the study (Khairunnisa et al., Citation2022) showed that a 5-fold cross validation-based evaluation shows that each proposed procedure enhances the K-Nearest Neighbors (KNN) algorithm. Additionally, they concluded that K-Means clustering is capable of increasing the accuracy of KNN from 81.6% to 86.7%. Combining K-Means and PCA improves the KNN accuracy to 90.9%. Another study (Arora et al., Citation2022) used the same PIMA dataset and proposed a novel architecture for predicting diabetes patients using the K-means clustering technique and Support Vector Machine (SVM). The features extracted from K-means are then classified using an SVM classifier achieving the accuracy of 98.7%. In the pursuit of diabetes prediction, the authors in (Ganguly & Singh, Citation2023) employed diverse machine learning assessments on a cluster-based dataset. They utilised the K-Means clustering algorithm for early detection of diabetes, analysing data from 165 diabetic patients. The most noteworthy precision, recall, and F1-score were achieved through the combination of K-Means, while the highest accuracy of 0.79 was attained using the random forest model.

Moreover, the authors in (Choubey et al., Citation2020) engaged in the processing and analysis of obtained datasets through two distinctive methodologies. On one hand, the initial approach involves the application of classification techniques including Logistic Regression, K-Nearest Neighbor, ID3 Decision Tree, C4.5 Decision Tree, and Naive Bayes. On the other hand, the second approach integrates PCA and Particle Swarm Optimization (PSO) algorithms for feature reduction prior to implementing the classification methods of the first approach. A comparative evaluation is conducted to juxtapose the diverse methodologies employed in this reseacrh. The yielded outcomes unequivocally demonstrate the superiority of the proposed technique over conventional classification approaches in terms of reduced computational time and heightened accuracy. Additionally, one study revealed that the classification performance of the k-nearest neighbour (KNN) method hinges on the selection of the k neighbours for a query. Optimizing KNN is challenging due to the difficulty in choosing suitable neighbours and the optimal k value. Furthermore, KNN's effectiveness is hampered by the simplistic majority voting approach. To overcome these issues, we introduce a novel adaptive k-nearest centroid neighbour classifier, termed AD-LAKNCN, which utilises average distance measures. Tests on 24 real-world datasets demonstrate that AD-LAKNCN outperforms nine leading KNN variants in classification accuracy (Wang & Zhang, Citation2022).

In an extensive examination detailed in (Chauhan et al., Citation2021) an inclusive survey of diabetes research papers spanning from 2018 to 2021 is undertaken. The accurate anticipation of diabetes has been achieved through the application of decision tree-based algorithms such as C4.5 AdaBoost and XGBoost. In dealing with vast datasets, individual machine learning techniques like PCA and K-Means emerge as valuable tools for feature selection and anomaly detection. This review also highlights the potent accuracy of diabetes prediction attained by combining supervised and unsupervised AI methodologies, notably employing K-Means and Support Vector Machine.

Furthermore, the analysis by (Muhammad et al., Citation2019) employed K-means clustering, utilising the first two principal components. Additionally, through correlation analysis between diabetes mellitus and specific illnesses, a significant association was unveiled – indicating heightened prevalence of kidney and hypertension issues among diabetes patients.

While K-means is lauded for its straightforwardness and adaptability to various data types, the algorithm's dependence on the initial placement of cluster centres can significantly affect the results. Properly executed, it can produce well-clustered datasets that enhance logistic regression analyses. Conversely, if clustering is not accurate, it can adversely affect the performance of the model. In (Yadav et al., Citation2021) and (Ye et al., Citation2020), the focus was precise diabetes prediction through PCA, K-Means, Random Forest, Multilayer Perceptron and Naïve Bayes. Steps encompass data preprocessing, PCA-driven feature extraction, and voting classifier-based classification in the diabetes prediction model.

Exploring the potential of utilising tongue images to achieve nuanced diabetic population classification, K-means emerges as a viable option. By employing K-means, the diabetic population can be effectively segregated into four distinct clusters defined by Vector Quantized Variational Autoencoder features. Remarkably, this approach yields a notable classification accuracy of approximately 88% (Li et al., Citation2022).

Ultimately, incorporating an explainable element when introducing novel predictive models holds significance, especially when these models are intended for use by experts who may not necessarily possess technical expertise in the domain. SHAP-TreeExplainer model that does not require reference data was proposed in (Lama et al., Citation2021) for predicting a prediabetes or T2D diagnosis 10 years after the data collection. Furthermore, SHAP can provide an interpretative framework for prediction models targeting specific medical conditions, such as Heart Failure stemming from coronary heart disease, by considering contributing risk factors associated with adverse outcomes (Wang et al., Citation2021).

Although the showcased studies demonstrate potential, they do not delve into the preliminary diabetes condition – hyperinsulinemia, through a synergistic integration of unsupervised methods like FPCA and K-Means, coupled with supervised models such as XGBoost, Random Forest, and LightGBM. Additionally, the studies that do investigate diabeteic conditions do lack with an exploration of concept explainability, a gap we address using the SHAP methodology.

3. Methodology

This section consists of the following sub-sections: Section 3.1 provides an overview of Exploratory Data Analysis (EDA) for the selected dataset. Section 3.2 focuses on data preprocessing. Section 3.3 describes the proposed data mining model, including both supervised models and unsupervised techniques. The final sub-section, 3.4, outlines the evaluation metrics used for assessing the presented model.

The constructed model is devised by synergistically integrating PCA, K-Means, and Logistic Regression as the baseline components, along with advanced models such as Random Forest, XGBoost, and LightGBM. A novel methodology is subsequently introduced that leverages PCA to transform the initial feature set, effectively mitigating the issue of correlation. This alleviates the challenge posed to the classification algorithm in identifying relationships within the data (Chang et al., Citation2023; Razvi et al., Citation2019). The integration of PCA serves to eliminate extraneous features, subsequently reducing training time and costs, while simultaneously enhancing model performance (Wang et al., Citation2022). Following the execution of PCA and FPCA analyses, the outcomes are then channelled into an unsupervised clustering phase facilitated by K-Means. This selection is influenced by K-Means’ inherent capacity to handle outliers. The resulting K-Means clusters are refined, and subsequently employed as the foundation for various supervised models, including Random Forest, XGBoost, and LightGBM. These models collectively contribute to the construction of our dataset's classification framework. The illustrative flowchart of the proposed model is showed in .

Figure 2. Experimental setup (proposed data mining model).

3.1 Dataset description

The data including adolescents has been collected by the Healthcare Center in Serbia from 2019 to 2022 through an observational cross-sectional study. To clarify, our dataset consists of 1008 samples aged 12–17 years, all of whom underwent regular systematic examinations and were subsequently instructed to perform the OGT test. This indicates that each subject presented at least one indicator suggestive of potential hyperinsulinemia during their examination-examples include elevated glucose levels, high cholesterol, increased BMI and genetic predispositions for type 2 diabetes mellitus. Our dataset is comprehensive, capturing a total of 8 features representing medical diagnostic criteria, alongside a target class that indicates each individual's test status. Specifically, the dataset includes 336 instances with a positive test result and 672 with a negative result. It is imperative to highlight the substantial size of our dataset, which is significant, as all 1008 participants presented with at least one indicator for potential hyperinsulinemia. The set of features that significantly impact the overall clinical profile of hyperinsulinemia were meticulously selected by pediatricians conducting the regular systematic examinations. The Experimental Data Analysis (EDA) pertaining to the selected features is showcased in .

Table 1. EDA for the dataset features.

Display Table

3.2 Data preprocessing

Preprocessing medical data plays a pivotal role in the data mining process for disease prediction and diagnosis. This is because the presence of low-quality data can significantly compromise the accuracy and reliability of prediction outcomes. To enhance the utility and suitability of our initial dataset for hyperinsulinemia prediction, we systematically employed a range of preprocessing techniques. Initially, in dealing with missing data, a prevalent strategy is to impute absent values with the mean value of the corresponding attribute (Austin et al., Citation2021). In our machine learning framework, meticulous analysis identified 37 instances with missing data, each lacking a value for at least one feature. After thorough consideration, the decision was made to eliminate these instances, as their removal wouldn't detrimentally impact the credibility of the final results. These specific data points were also intentionally not substituted with the value of 0. This decision aligns with the guidance of medical experts who assert that, in the context of this attribute type, 0 is not a valid or appropriate value. Considering that, for instance, laboratory analysis parameters are provided in diverse parameter units for all input features, it becomes imperative to employ the min–max normalisation scaling technique within a specific narrow range [0, 1] (Borkin et al., Citation2019). This ensures the creation of a new scaled dataset based on the original, effectively harmonising the varied parameter units.

3.3 Model design

Our experimental process for the proposed model is structured into three well-defined phases to ensure a comprehensive evaluation. In the initial phase, we focus on enhancing data quality by performing dimensionality reduction on the already processed dataset. This reduction is carried out using two techniques: FPCA and PCA. These methods aim to distil the most crucial information from the data while preserving its essential characteristics. In the second phase, we employ the K-Means clustering algorithm on the subsets obtained from the previous phase's FPCA and PCA outputs. This clustering step is pivotal, as it not only aids in categorising the data but also serves to effectively eliminate outliers and rectify any misclassifications that may have occurred earlier. Finally, in the third phase, the data that has been accurately clustered and classified is primed for utilisation as input within a suite of supervised classifiers. These classifiers encompass advanced models such as Random Forest, XGBoost, and LightGBM. Additionally, a logistic regression model acts as our baseline for comparison. Following a thorough assessment of the model's efficacy, encompassing a comprehensive range of performance metrics including Recall, Precision, Accuracy, ROC, and MCC, the validation process attains a higher level of substantiation through factorial and multivariate analyses.

3.3.1 Unsupervised learning: FPCA and PCA

The process of Principal Component Analysis revolves around a series of distinct steps aimed at transforming the feature space by exploiting attribute relationships (Hasan & Abdulazeez, Citation2021). This transformation entails mapping the original feature space onto a lower-dimensional plane to achieve the desired dimension reduction, subsequently leading to an in-depth analysis of this newly formed feature space. PCA, being an unsupervised technique for dimensionality reduction, achieves this reduction by capitalising on the correlations between input features. The optimisation of the transformation matrix is grounded in identifying the most substantial disparities within the original space (Velliangiri & Alagumuthukrishnan, Citation2019). A pivotal principle governing the selection of Principal Components (PCs) derived from PCA is that directions associated with the greatest variances inherently encode the most salient information about classes. PCA often originates from a linear projection that maximises variance within the projected space (Anowar et al., Citation2021). A commonly employed strategy for PC selection involves setting a variance-explained threshold, often around 80%, and subsequently determining the number of components that generate a cumulative variance sum close to this established threshold. The integration of PCA into a dataset proves invaluable, especially in scenarios demanding unsupervised learning. This utility lies in its ability to efficiently initialise centroids for clustering algorithms, thus significantly enhancing the efficacy of the clustering process. Given that PCA yields a feature subspace optimizing variance along axes, a vital preprocessing step involves standardising the dataset to a unit scale (with mean = 0 and variance = 1). This standardisation step is pivotal to amplify the effectiveness of PCA results, which in turn, stands as a prerequisite for the optimal performance of numerous machine learning algorithms. The PCA analysis transforms the original values of variables into principal components together using the following formulas (1), (2) (Sun et al., Citation2019): (1) $Y = X \cdot C$ (1) X is a set of n vectors, $(x_{1}, \dots, x_{n})$ where each X_i element represents an instance of our dataset, and C is a square matrix of order n: $c_{ij} = [\begin{matrix} c_{11} & \dots & c_{1 n} \\ ⋮ & ⋮ & ⋮ \\ c_{n 1} & \dots & c_{nn} \end{matrix}]$ and (2) $\begin{aligned} Y_{ij} & = [\begin{matrix} x_{1} \\ ⋮ \\ x_{n} \end{matrix}] \cdot [\begin{matrix} c_{11} & \dots & c_{1 n} \\ ⋮ & ⋮ & ⋮ \\ c_{n 1} & \dots & c_{nn} \end{matrix}] \\ = [\begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix}], i = \bar{1, n}, j = \bar{1, n}, dimension : (1, n) x (n, n) = (1, n) . \end{aligned}$ (2) When interpreting principal components, it is often useful to know the correlations of the original variables with the principal components. The correlation between variable X_i and principal component Y_j is, formula (3): (3) $K_{ij} = \sqrt{c_{ij}^{2} \cdot Var (y_{j}) / σ_{ii}}$ (3) where ${σ_{ii}}^{2} = (1 / n - 1) \cdot \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}$ , $\bar{x}$ the mean value is calculated as: $\bar{x} = \sum_{i = 1}^{n} x_{i} / n$

PCA represents eigenvectors and eigenvalues of the covariance matrix. Eigenvectors will determine the directions of the new feature space, while eigenvalues will determine the magnitudes.

From $(λI - C) X = 0$ in other words: $CX = λX$ we define a set S of all vectors that satisfy this equation as, formula (4): (4) $S = {x | (λI - C) X = 0} .$ (4) FPCA is utilised to analyse and further reduce the dimensionality of data, specifically focusing on functional instances. Its primary goal is to determine the ideal number of functional instances to retain in the process. Once the central data within clusters is identified, a matrix of variations, denoted as C, is generated. This matrix involves the calculation of its elements as, formula (5):

$c_{ij} = x_{i} (r_{j}) - \bar{x_{i}} (r_{j})$ , where $r_{j}$ are the values in the observed dataset. (5)

After that, the functional variance is calculated for each r defined as: $Var (r) = \frac{1}{n} \cdot \sum_{i = 1}^{n} c_{ij}^{2},$ where n is the number of instances in each cluster. In this way, less significant instances are eliminated, and only the main ones will be used as input for the K-Means clustering in the next phase of the experiment.

3.3.2 Unsupervised learning: K-Means clustering

The K-means clustering algorithm groups data into two clusters: a set of adolescents with hyperinsulinemia and a set of adolescents without hyperinsulinemia (Ripan et al., Citation2021). Firstly, k cluster centres are assigned values, and then the remaining values are grouped around them based on the Euclidean distance values. In other words, each point x_i is assigned to a specific cluster m (Alam & Muqeem, Citation2022), formula (6). (6) $c_{i} = argmi n_{j} | | x_{i} - k_{j} | |^{2},$ (6) where c_i is the index of the cluster to which the value x_i is assigned, and k_j is the centre of cluster m.

Then, the cluster centres are updated using, formula (7): (7) $k_{j} = \frac{1}{n_{j}} \sum_{i = 1}^{n} x_{i},$ (7) where $n_{j}$ is the number of points in cluster j. (7)

For k = 2, the target variable contains two possible outcomes (positive – has hyperinsulinemia, and negative – does not have hyperinsulinemia).

3.3.3 Supervised learning: logistic regression, Random Forest, XGBoost, and LightGBM

In this research, logistic regression will serve as a baseline for evaluating the performance of Random Forest, XGBoost, and LightGBM. In contrast to more intricate models, logistic regression doesn't aim to uncover data patterns; instead, it predicts based on the most common label in the training set. Logistic regression offers several strategies, including most frequent, stratified, uniform, and constant. While being a straightforward classifier, the anticipation is that Random Forest, XGBoost, and LightGBM will surpass the baseline performance of logistic regression.

The Random Forest Classifier is a prominent ensemble model in machine learning classification. Comprising specialised decision tree configurations, it yields categorical outcomes. This classifier uses bootstrap aggregation to create n trees, averaging their predictions for the final outcome. While selecting split points, the algorithm prioritises effective features from a randomised subset (Alam et al., Citation2019; Xu et al., Citation2020). Adjustable parameters govern model size and foundational learning processes. Despite susceptibility to overfitting, its documented strengths include robustness, adaptability, and solid performance with outliers and missing values. In medical data analysis, Random Forest excels, effectively handling complex relationships and diverse variables in healthcare's intricate scenarios. In this research, the following parameters are used:

n_estimators : the amount of trees within the forest;
max_features: the amount of features that can be considered when looking for the ideal split;
max_depth: the utmost depth of a tree;
criterion: the function to compute the quality of a split.

XGBoost, an acronym for “Extreme Gradient Boosting”, is a pivotal algorithm in this study. It leverages gradient tree boosting, excelling in medical data analysis (Hassan et al., Citation2022). With rapid computation, high accuracy, and built-in regularisation against overfitting, it's well-suited for medical data. Handling intricate relationships, outliers, and imbalanced data, XGBoost's strength lies in healthcare complexities. Guided by an objective loss function, it optimises predictions, enhanced by boosting. This algorithm adeptly uncovers complex medical patterns and dependencies (Li et al., Citation2020; Zhang et al., Citation2020), formula (8): (8) $L^{(t)} = \sum_{i = 1}^{n} l (y_{i} {\hat{y_{i}}}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t})$ (8) The formula (8) amalgamates gradient descent with Taylor approximation to optimise the loss function. With each iteration within the XGBoost framework, a novel model is crafted, focusing on refining the outcome using the error gradient derived from the preceding model (Zhang & Gong, Citation2020). The rationale for selecting XGBoost rests on its demonstrated superiority compared to alternative models. XGBoost has many parameters that can be tuned. For our research, we tuned the following parameters:

subsample: denotes the fraction of observations to be randomly sampled for each tree;
colsample_bytree: the subsample ratio of columns when constructing each tree;
max_depth: the utmost depth of a tree;
min_child_weight: defines the minimum sum of weights of all observations required in a child;
learning_rate: the shrinkage you do at every step you are making.

LightGBM is a gradient boosting algorithm commonly applied to medical data. Its fundamental steps and concepts gain particular significance in the context of medical information. When dealing with medical data, LightGBM becomes a vital tool for analysis and prediction due to its ability to efficiently handle large datasets and achieve high predictive accuracy. The algorithm employs gradient boosting to iteratively enhance prediction models, combining weak models to form a robust predictor (Rufo et al., Citation2021). The gradient of the target function $L (y, F (x))$ with respect to the function $F (x)$ is computed as, formula (9) (Ghourabi, Citation2022): (9) $\frac{\partial L (y, F (x))}{\partial F (x)},$ (9) where these gradients indicate the direction of the rate of change of the objective function with respect to the prediction.

For each node, the Gain is calculated as the value of the function before and after node splitting, as, formula (10): (10) $Gain = L (left, F (left)) - L (right, F (right)) - L (node, F (node))$ (10) where left and right denote nodes to the left and right of the selected node, and “node” represents the node before splitting. The node that yields the highest gain upon adding a new node is selected. For our research, we tuned the following parameters:

num_leaves: controls model complexity, speed, performance;
n_estimators: boosting rounds that affect model performance;
max_bin: improve the accuracy;
min_data_in_leaf: data points in leaf that can prevent overfitting and influence model robustness against noise;
reg_alpha and reg_lambda: L1 and L2 regularisation prevents overfitting, enhances model's generalisation through penalty addition.

3.3.4 Post-agnostic method: SHAP

Employing model-agnostic approaches such as SHAP allows for the evaluation of machine learning models as opaque entities, focusing exclusively on their outputs and disregarding the need for internal model specifics. This characteristic renders SHAP and similar methodologies highly adaptable and applicable across a diverse range of model architectures, enhancing their utility in the field of explainable AI (Nohara et al., Citation2022). Employing SHAP post-hoc enables insights into feature contributions without knowing model internals. In medical data, SHAP's model-agnostic nature is beneficial. Medical data's complex relationships are unveiled, aiding feature influence understanding (Knapič et al., Citation2021). SHAP is adaptable across medical models, traditional or advanced, empowering practitioners to interpret and validate outcomes, leading to informed medical decisions. The assessment of the importance of variable i is determined by analysing how the inclusion of variable I in the set S impacts the value of the function eS. This contribution of variable I is represented by (i) and is calculated as a weighted average across all possible subsets S, formula (11): (11) $\emptyset (i) = \sum_{S \subseteq \frac{{1, \dots, p}}{{i}}} \frac{| S |! (p - 1 - | S |)!}{p!} (e_{S \cup {i}} - e_{S}) .$ (11)

Formula (18), has an equivalen to the following formula (12): (12) $\emptyset (i) = \frac{1}{Π} \sum_{π \in Π} e_{before (π, i) \cup {i} - e_{before (π, i)}} .$ (12) where Π represents the set of all orderings of p variables, and before(π,i) refers to the subset of variables that precede variable i in the ordering π. Each ordering in Π corresponds to a set of values eS that transition from e∅ to (x*). To summarise, when examining a specific ordering, the analysis reveals the impact of adding consecutive variables on the eS function, (Slack et al., Citation2021) is derived by averaging these contributions across all potential orderings.

3.4 Performance metrics

Recall, a performance metric, measures the fraction of accurately identified positives, showcasing the successful recognition of true anomalies. Precision, another metric, indicates the fraction of true positives divided by the total number of positive predictions, offering insight into the accuracy of identified anomalies. Additionally, metrics such as accuracy, and Matthews Correlation Coefficient (MCC) will be integrated into the assessment. To assess the overall performance, the F1 score, a harmonic mean of precision and recall, will be adopted for each model. Furthermore, the generation of confusion matrices will yield insights into how models misclassify both classes. Collectively, these metrics provide a thorough evaluation of model efficacy in medical data analysis (Thabtah et al., Citation2020).

4. Results

In this section, we present the results obtained in this study. Section 4.1 presents the results conducted through traditional statistical analysis, Section 4.2 showcases the outcomes achieved using the proposed data mining model, and Section 4.3 presents the results obtained through factorial analysis.

4.1 Statistical analysis

shows the average glucose and insulin values along with the corresponding standard deviations for participants in both the first experimental group and the second control group. These parameters were measured at intervals of 0th, 30th, 60th, 90th, and 120th minutes. Subsequently, mean values and standard deviations were computed for adolescents within each group, in addition to the values of the HOMA-IR index (insulin resistance index). Utilising Student's t-test, an analysis was conducted to confirm the presence of statistically significant differences among participants between the observed groups during the execution of the OGTT test, as well as for the calculated HOMA-IR index values.

Table 2. OGTT values along with HOMA-IR index values.

Display Table

Based on the statistical outcomes, insulin values exceeding 45 µU/ml were identified at the 0th minute in 18 participants (1.8%), while at the 120th minute, 318 participants (31.5%) exhibited insulin values surpassing 75 µU/ml. Notably, a cumulative insulin value surpassing 300 µU/ml was evident in 336 participants (33.3%), constituting the experimental first group. The second control group comprised 672 participants (66.7%), characterised by a cumulative insulin value below 300 µU/ml. For both participant groups, most hematological and biochemical parameters exhibited elevated values in comparison to reference standards. However, disparities emerged when analysing the results by gender and groups .

Table 3. Results of hematological and biochemical analyses.

Display Table

4.2 Results obtained using proposed data mining model

The original dataset used in this study consists of 1008 instances, divided into two subsets: the first subset is for training and contains 672 instances, while the second subset is for testing and contains 336 instances. The identical steps of the experimental setup, as depicted in are performed during both the training and testing phases.

Through the utilisation of PCA analysis, the original dataset's variable count was minimised to better facilitate subsequent stages involving a range of ML algorithms. To ascertain the technique offering superior dimensionality reduction while preserving crucial instance-specific details, a comparison was drawn between FPCA and PCA. In both analyses, essential variables were distilled, data was streamlined, and dimensionality was reduced without significant information loss. Particularly noteworthy is FPCA's remarkable effectiveness, rendering it the favoured option for enhancing precision in the K-means algorithm for cluster identification. For a comprehensive understanding of these findings, detailed outcomes are available in . Before implementing the model in practice, it is essential to assess the efficiency of the proposed model being utilised. Initially, performance metrics were examined, and the results are presented in . Subsequently, cross-validation was employed to predict the model's behaviour on new, unfamiliar data, thereby addressing issues of bias and substantial variability reduction. The results of the confusion matrix are provided. This encompasses true positives (TP = 222), true negatives (TN = 441), false positives (FP = 2), and false negatives (FN = 7) .

Figure 3. Confusion matrix.

Table 4. FPCA and PCA technique performed on original dataset.

Display Table

Table 5. Performance metrics.

Download CSV Display Table

Subsequently, we refined our k-means clustering outcomes by eliminating inaccurately grouped data points. Should the augmented dataset exceed 80% of the original size, we would then advance to the realm of supervised classification. However, in instances where this threshold is not met, we would iterate through the k-Means procedure anew, pursuing an optimal dataset size, while ensuring that the first eight principal components (8PC's) maintain a correlation of at least 70%. Illustrated in are two discernible clusters denoting positive and negative outcomes, wherein cluster 1 corresponds to the negative outcome and cluster 2 pertains to the positive outcome.

Figure 4. K-Means clustering for the hyperinsulinemia outcome.

In a pursuit to conduct a more comprehensive evaluation of our proposed experimental model, we extended our analysis by employing three additional machine learning algorithms. This was executed across a spectrum of data variations, encompassing the original dataset, data subjected to PCA processing, data transformed through FPCA, and data subject to both FPCA and K-Means processing. From Tables and in both training and testing phase, it can be seen that FPCA + K-Means techniques improved the accuracy of posed algorithms. Additionally, FPCA alone shows to be a good procedure to improve the accuracy of each of the algorithms, while PCA was slightly reducing the accuracy result when applied alone.

Table 6. Accuracy on processed dataset compared on different ML models (training).

Download CSV Display Table

Table 7. Accuracy on processed dataset compared on different ML models (testing).

Download CSV Display Table

Using the SHAP feature importance method, it was possible to determine the influence of seven input features and their values in comparison with mean(SHAP value). From , it can be concluded that the features that played a crucial role in determining whether a dolescent was classified with a positive or negative outcome were BMI (Body Mass Index) and High cholesterol. Additionally, other notable contributors encompassed Insufficient physical activity and an Unhealthy diet. Moreover, the influence of other risk factors, namely the adolescents’ residential demographic area and their age, ranked lowest among the influencing factors.

Figure 5. SHAP feature importance.

4.3 Results obtained using factorial analysis

Conducting exploratory factorial analysis, we once again validate the most influential risk factors. Within the experimental group exhibiting hyperinsulinemia, the pivotal risk factors included BMI and elevated cholesterol levels, serving as the primary factors responsible for the overall risk associated with hyperinsulinemia development. Subsequently, employing a similar approach through confirmatory factor analysis, we also unveiled an additional four significant factors: Insufficient physical activity, Unhealthy diet, Genetic diseases, and the consumption of psychoactive substances. The amalgamation of all other factors contributed to the cumulative risk linked to hyperinsulinemia with elevated glycemia. These outcomes are presented in .

Figure 6. Risk factor importance – factorial analysis.

5. Discussion

The outcomes of the presented research have demonstrated that FPCA enhances the K-Means clustering algorithm, which has been a challenge in prior investigations, and this challenge is addressed through the proposed data mining model. A dataset containing information from 1008 adolescents was divided into two subsets: a training set with 672 adolescents and a testing set with 336 adolescents. The proposed approach was applied to both sets ().

In previous studies, the best result was achieved in (Zhu et al., Citation2019), where an accuracy of 97.4% on a sample of 678 subjects for diabetes detection and identification was attained. Employing PCA and K-Means to enhance logistic regression accuracy, the authors attained peak performance through the Pima Indian diabetes dataset. In (Wu et al., Citation2018), the amalgamation of K-Means and Logistic Regression produced a model accuracy of 95.4% among 589 subjects. Comparable research undertakings have revealed diminished precision and model reliability, constituting a primary hurdle in the domain of medical algorithm applications.

The FPCA technique proposed in our study effectively addresses the fundamental issue of prediction model accuracy. When combined with K-Means, it precisely classified instances. The medical dataset containing 1008 instances of adolescent data achieved a groundbreaking accuracy and reliability of 98.67% through the utilisation of the XG Boost algorithm. Among the experimental group, 33.3% of adolescents with hyperinsulinemia were identified, while 66.7% showed no presence of hyperinsulinemia. Employing the SHAP post-agnostic method, BMI was identified as the most influential risk factor present in all adolescents, followed by six other significant factors. A comparison with factor analysis revealed an identical prioritisation of risk factors as in the FPCA technique, though the model's accuracy was slightly lower at 95%. The results obtained from this research exhibit greater reliability and precision of the novel proposed model compared to previously introduced models and techniques. These insights will be robustly employed to establish frameworks and tailor healthcare policies in Serbia, yielding practical applications that resonate effectively within the medical landscape.

6. Conclusion

The main objective of our paper was to propose the data mining model and improve the accuracy of the supervised models that are frequently used in medical diagnostics.. The experimental segment of this study, which involved the implementation of unsupervised techniques for dimensionality reduction such as PCA, FPCA, along with K-Means clustering, aimed to enhance supervised models and address our primary objective:

“To what extent can unsupervised techniques such as PCA and FPCA improve K-Means clustering and provide foundation for input values of Random Forest, XGBoost and LightGBM contributing to higher classifiers accuracies?”.

It was found that the XGBoost model outperforms the baseline model along with Random Forest and LightGBM. While PCA is a widely recognised technique, its potential for enhancing K-Means clustering and, consequently, the performance of the classifiers, has not garnered sufficient attention. Our experiment effectively demonstrates that logistic regression and LightGBM models, even when having slightly lower accuracy (around 0.02%) could be promising predictors for hyperinsulinemia through the integration of FPCA and K-Means. The study's contribution lies in its capacity to attain significantly improved K-Means clustering outcomes, surpassing what other researchers have accomplished in similar investigations. Additionally, the XGBoost model exhibits enhanced predictive capabilities for hyperinsulinemia onset when contrasted with the outcomes of employing other algorithms, both within our study and in comparison to similar studies conducted by others. Another notable advantage is the model's proficiency in successfully handling new datasets. Utilising post-agnostic methods like SHAP, interpretability is established within the proposed models. This advancement aids medical professionals and healthcare experts in gaining a clearer comprehension of the outcomes when utilising such a diagnostic tool.

Finally, the outcomes of this research hold significance for the domain of hyperinsulinemia prediction across diverse industries. Detecting hyperinsulinemia early on can prove advantageous when the model is effectively integrated with healthcare policies. This integration facilitates the automatic initiation of precautionary measures in the event of an anticipated hyperinsulinemia risk. Leveraging this integration, the timely prediction of hyperinsulinemia can be ensured, thereby mitigating the adverse effects associated with its onset.

Finally, the insights gained from our experiment transcend their initial application in diagnosing hyperinsulinemia, offering valuable implications for a broader range of medical diagnostics. These insights stem from an extensive experimental setup that integrates various unsupervised and supervised learning techniques, alongside a selected hold-out approach for validation. Our dataset comprises authentic values linked to a wide array of indicators from significant laboratory analyses and specialist physician reviews. Furthermore, the interpretability of these models has been achieved by incorporating SHAP with XGBoost, which allows for an understanding of both the overarching patterns and the specifics underpinning individual predictions. Hence, we contend that an additional comparison would not contribute further value, given our thorough global and local analytical coverage.

6.1 Limitations and future directions

When exploring the datasets gathered on a national scale to predict hyperinsulinemia, it is important to consider the limitations associated with missing data. Although this study didn't encounter significant issues due to a small number of missing data points, the introduction of absent values can potentially introduce biases and hinder the precision of predictive models, particularly when dealing with a larger number of instances. Incomplete records arising from missing data can impede the thorough analysis and prediction process, given that these models rely on complete data to achieve accurate estimations. Moreover, the absence of data can result in diminished sample sizes, thereby compromising the statistical power of predictive models. When dealing with a smaller sample size, models may face challenges in detecting significant associations and patterns related to hyperinsulinemia diagnostics, consequently limiting the degree to which their findings can be extrapolated to the broader population. Furthermore, the task of imputing missing values presents challenges, as the chosen imputation methods can introduce assumptions and uncertainties that impact the accuracy and reliability of the projected outcomes.

The application of Shapley values to hyperinsulinemia diagnostics encounters a significant challenge owing to computational complexity, particularly in modern models like deep neural networks with high-dimensional inputs. This intricacy renders the exact computation of Shapley values impractical. Nonetheless, specialised implementations of SHAP are available for tree-based methods (e.g. Random Forest, XGBoost) and additive models, offering viable alternatives. Caution must be exercised, especially when considering particular versions of SHAP for estimation, such as KernelSHAP, which, while effective, can be computationally slow. Additionally, when features are interdependent, using Shapley values might lead to extrapolation into areas with low data density. To mitigate this, conditional versions have been developed, primarily tailored for tree-based models, but necessitating careful interpretation to avoid common pitfalls. Importantly, SHAP explanations do not yield sparse outcomes, as a non-zero Shapley value is attributed to each feature influencing the prediction, irrespective of its magnitude. If sparse explanations are preferred, alternative approaches like counterfactual explanations might prove more suitable (Su et al., Citation2023).

Given these limitations, future research in the realm of predicting hyperinsulinemia diagnostics using diverse datasets should delve into exploring deep learning models. Deep learning models, such as recurrent neural networks (RNNs), have demonstrated promise in handling missing data and extracting intricate patterns from extensive datasets. Particularly, the use of RNNs, like Fuzzy Cognitive Maps, could unearth hidden patterns within the data and facilitate various WHAT-IF simulations to determine important concepts. Harnessing the potential of deep learning models (Abunadi, Citation2022; Hu et al., Citation2022), researchers can potentially address the challenges posed by missing data and enhance the precision, dependability, and general applicability of predictive models for hyperinsulinemia diagnostics.

Institutional review board statement

The study was conducted in accordance with the Declaration of Helsinki, and approved with Protocol code: DZ-01-2294 issued on 02/07/2019 by the Healthcare Center in Serbia, and Protocol code: DZ-01-2646/1 issued on 11/08/2021.

Informed consent statement

Informed consent was obtained from all subjects involved in the study.

Acknowledgements

The authors would like to thank freepik.com website (accessible through the following link: https://www.freepik.com/) for providing free license for parts of the . Hyperinsulinemia process. Muscle picture as a part of . was designed by brgfx/Freepik. Pancreas picture as a part of . was designed by macrovector/Freepik. Liver picture as a part of . was designed by Freepik. Circulation or blood stream picture as a part of . was designed by Freepik. Adipose tissue picture as a part of was designed by Freepik.

Data and code availability statement

The raw dataset used for this study is under a Non-Disclosure Agreement (NDA) and is therefore not available to the public. The code for the presented data mining model could be available upon reasonable request.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The author(s) reported there is no funding associated with the work featured in this article.

References

Abunadi, I. (2022). Deep and hybrid learning of MRI diagnosis for early detection of the progression stages in Alzheimer’s disease. Connection Science, 34(1), 2395–2430. https://doi.org/10.1080/09540091.2022.2123450
Web of Science ®Google Scholar
Alam, A., & Muqeem, M. (2022, March). Integrated k-means clustering with nature inspired optimization algorithm for the prediction of disease on high dimensional data. In 2022 international conference on electronics and renewable systems (ICEARS) (pp. 1556–1561), Organized by: St. Mother Theresa Engineering College. IEEE.
Google Scholar
Alam, M. Z., Rahman, M. S., & Rahman, M. S. (2019). A random forest based predictor for medical data classification using feature ranking. Informatics in Medicine Unlocked, 15, 100180. https://doi.org/10.1016/j.imu.2019.100180
Google Scholar
Andes, L. J., Cheng, Y. J., Rolka, D. B., Gregg, E. W., & Imperatore, G. (2020). Prevalence of prediabetes among adolescents and young adults in the United States, 2005-2016. JAMA pediatrics, 174(2), e194498–e194498. https://doi.org/10.1001/jamapediatrics.2019.4498
PubMed Web of Science ®Google Scholar
Anowar, F., Sadaoui, S., & Selim, B. (2021). Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne). Computer Science Review, 40, 100378. https://doi.org/10.1016/j.cosrev.2021.100378
Web of Science ®Google Scholar
Arora, N., Singh, A., Al-Dabagh, M. Z. N., & Maitra, S. K. (2022). A novel architecture for diabetes patients’ prediction using K-means clustering and SVM. Mathematical Problems in Engineering, 2022, 4815521.
Web of Science ®Google Scholar
Austin, P. C., White, I. R., Lee, D. S., & van Buuren, S. (2021). Missing data in clinical research: A tutorial on multiple imputation. Canadian Journal of Cardiology, 37(9), 1322–1331. https://doi.org/10.1016/j.cjca.2020.11.010
PubMed Web of Science ®Google Scholar
Bansal, A., & Singhrova, A. (2022). An improved machine learning prediction model for diabetes. In ICDSMLA 2020: Proceedings of the 2nd international conference on data science, machine learning and applications (pp. 131–144). Springer Singapore.
Google Scholar
Borkin, D., Némethová, A., Michaľčonok, G., & Maiorov, K. (2019). Impact of data normalization on classification model accuracy. Research Papers Faculty of Materials Science and Technology Slovak University of Technology, 27(45), 79–84. https://doi.org/10.2478/rput-2019-0029
Google Scholar
Calcaterra, V., Biganzoli, G., Ferraro, S., Verduci, E., Rossi, V., Vizzuso, S., Bosetti, A., Borsani, B., Biganzoli, E., & Zuccotti, G. (2022). A multivariate analysis of “metabolic phenotype” patterns in children and adolescents with obesity for the early stratification of patients at risk of metabolic syndrome. Journal of Clinical Medicine, 11(7), 1856. https://doi.org/10.3390/jcm11071856
Web of Science ®Google Scholar
Chang, V., Bailey, J., Xu, Q. A., & Sun, Z. (2023). Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Computing and Applications, 35(22), 16157–16173. https://doi.org/10.1007/s00521-022-07049-z
Web of Science ®Google Scholar
Chauhan, T., Rawat, S., Malik, S., & Singh, P. (2021). Supervised and unsupervised machine learning based review on diabetes care. 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 2021, pp. 581–585. https://doi.org/10.1109/ICACCS51430.2021.9442021.
Google Scholar
Chen, Z., Ma, M., Li, T., Wang, H., & Li, C. (2023). Long sequence time-series forecasting with deep learning: A survey. Information Fusion, 97, 101819. https://doi.org/10.1016/j.inffus.2023.101819
Web of Science ®Google Scholar
Choubey, D. K., Kumar, P., Tripathi, S., & Kumar, S. (2020). Performance evaluation of classification methods with PCA and PSO for diabetes. Network Modeling Analysis in Health Informatics and Bioinformatics, 9(1), 1–30. https://doi.org/10.1007/s13721-019-0210-8
Web of Science ®Google Scholar
Edeh, M. O., Khalaf, O. I., Tavera, C. A., Tayeb, S., Ghouali, S., Abdulsahib, G. M., Richard-Nnabu, N. E., & Louni, A. (2022). A classification algorithm-based hybrid diabetes prediction model. Frontiers in Public Health, 10, 829519. https://doi.org/10.3389/fpubh.2022.829519
Web of Science ®Google Scholar
Ganguly, R., & Singh, D. (2023). An approach to predict early diabetes mellitus with an unsupervised clustering technique. International Journal of Intelligent Systems and Applications in Engineering, 11(3), 45–55.
Google Scholar
Gecili, E., Huang, R., Khoury, J. C., King, E., Altaye, M., Bowers, K., & Szczesniak, R. D. (2021). Functional data analysis and prediction tools for continuous glucose-monitoring studies. Journal of Clinical and Translational Science, 5(1), e51. https://doi.org/10.1017/cts.2020.545
Google Scholar
Ghourabi, A. (2022). A security model based on lightgbm and transformer to protect healthcare systems from cyberattacks. IEEE Access, 10, 48890–48903. https://doi.org/10.1109/ACCESS.2022.3172432
Web of Science ®Google Scholar
Halloun, R., Galderisi, A., Caprio, S., & Weiss, R. (2022). Lack of evidence for a causal role of hyperinsulinemia in the progression of obesity in children and adolescents: A longitudinal study. Diabetes Care, 45(6), 1400–1407. https://doi.org/10.2337/dc21-2210
Web of Science ®Google Scholar
Hasan, B. M. S., & Abdulazeez, A. M. (2021). A review of principal component analysis algorithm for dimensionality reduction. Journal of Soft Computing and Data Mining, 2(1), 20–30.
Google Scholar
Hassan, M. M., Mollick, S., & Yasmin, F. (2022). An unsupervised cluster-based feature grouping model for early diabetes detection. Healthcare Analytics, 2, 01–05, 100112. https://doi.org/10.1016/j.health.2022.100112
Google Scholar
Hassan, M. M., Peya, Z. J., Mollick, S., Billah, M. A. M., Shakil, M. M. H., & Dulla, A. U. (2021, July). Diabetes prediction in healthcare at early stage using machine learning approach. In 2021 12th International conference on computing communication and networking technologies (ICCCNT) (pp. 01-05). IEEE.
Google Scholar
Horta, B. L., & de Lima, N. P. (2019). Breastfeeding and type 2 diabetes: Systematic review and meta-analysis. Current Diabetes Reports, 19(1), 1–6. https://doi.org/10.1007/s11892-019-1121-x
Web of Science ®Google Scholar
Hu, N., Zhang, D., Xie, K., Liang, W., & Hsieh, M.-Y. (2022). Graph learning-based spatial-temporal graph convolutional neural networks for traffic forecasting. Connection Science, 34(1), 429–448. https://doi.org/10.1080/09540091.2021.2006607
Web of Science ®Google Scholar
IDF Diabetes Atlas. (2021). Online version of IDF diabetes atlas: 10th edition. www.diabetesatlas.org, ISBN: 978-2-930229-98-0
Google Scholar
Khairunnisa, S., Suyanto, S., & Yunanto, P. E. (2020, December). Removing noise, reducing dimension, and weighting distance to enhance $ k $-nearest neighbors for diabetes classification. In 2020 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI) (pp. 471-475). IEEE.
Google Scholar
Knapič, S., Malhi, A., Saluja, R., & Främling, K. (2021). Explainable artificial intelligence for human decision support system in the medical domain. Machine Learning and Knowledge Extraction, 3(3), 740–770. https://doi.org/10.3390/make3030037
Google Scholar
Koch, C. A., Bartel, M. J., & Weinberg, D. S. (2021). Possible mechanisms: Hyperinsulinemia and endocrine disrupting chemicals. Deutsches Ärzteblatt international, 118(15), 271.
Google Scholar
Lama, L., Wilhelmsson, O., Norlander, E., Gustafsson, L., Lager, A., Tynelius, P., Wärvik, L., & Östenson, C. G. (2021). Machine learning for prediction of diabetes risk in middle-aged Swedish people. Heliyon, 7(7), e07419. https://doi.org/10.1016/j.heliyon.2021.e07419
Web of Science ®Google Scholar
Li, J., Huang, J., Jiang, T., Tu, L., Cui, L., Cui, J., Ma, X., Yao, X., Shi, Y., Wang, S., Wang, Y., Liu, J., Li, Y., Zhou, C., Hu, X., & Xu, J. (2022). A multi-step approach for tongue image classification in patients with diabetes. Computers in Biology and Medicine, 149, 105935. https://doi.org/10.1016/j.compbiomed.2022.105935
Web of Science ®Google Scholar
Li, M., Fu, X., & Li, D. (2020, March). Diabetes prediction based on XGBoost algorithm. In IOP conference series: Materials science and engineering (Vol. 768 (7), p. 072093). IOP Publishing.
Google Scholar
Muhammad, M. U., Jiadong, R., Muhammad, N. S., Hussain, M., & Muhammad, I. (2019). Principal component analysis of categorized polytomous variable-based classification of diabetes and other chronic diseases. International Journal of Environmental Research and Public Health, 16(19), 3593. https://doi.org/10.3390/ijerph16193593
Web of Science ®Google Scholar
Nohara, Y., Matsumoto, K., Soejima, H., & Nakashima, N. (2022). Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Computer Methods and Programs in Biomedicine, 214, 106584. https://doi.org/10.1016/j.cmpb.2021.106584
Web of Science ®Google Scholar
Pan, Y., Laber, E. B., Smith, M. A., & Zhao, Y. Q. (2023). Reinforced risk prediction with budget constraint using irregularly measured data from electronic health records. Journal of the American Statistical Association, 118(542), 1090–1101. https://doi.org/10.1080/01621459.2021.1978467
PubMed Web of Science ®Google Scholar
Rasha, A.-H., Li, T., Huang, W., Gu, J., & Li, C. (2023). Federated learning in smart cities: Privacy and security survey. Information Sciences, 632, 833–857.
Web of Science ®Google Scholar
Razvi, S. S., Feng, S., Narayanan, A., Lee, Y. T. T., & Witherell, P. (2019, August). A review of machine learning applications in additive manufacturing. In International design engineering technical conferences and computers and information in engineering conference (Vol. 59179, p. V001T02A040). American Society of Mechanical Engineers.
Google Scholar
Ripan, R. C., Sarker, I. H., Hossain, S. M. M., Anwar, M. M., Nowrozy, R., Hoque, M. M., & Furhad, M. H. (2021). A data-driven heart disease prediction model through K-means clustering-based anomaly detection. SN Computer Science, 2(2), 1–12. https://doi.org/10.1007/s42979-021-00518-7
Google Scholar
Rufo, D. D., Debelee, T. G., Ibenthal, A., & Negera, W. G. (2021). Diagnosis of diabetes mellitus using gradient boosting machine (LightGBM). Diagnostics, 11(9), 1714. https://doi.org/10.3390/diagnostics11091714
Web of Science ®Google Scholar
Ryder, J. R., Northrop, E., Rudser, K. D., Kelly, A. S., Gao, Z., Khoury, P. R., Kimball, T. R., Dolan, L. M., & Urbina, E. M. (2020). Accelerated early vascular aging among adolescents with obesity and/or type 2 diabetes mellitus. Journal of the American Heart Association, 9(10), e014891. https://doi.org/10.1161/JAHA.119.014891
PubMed Web of Science ®Google Scholar
Savić, M., Kurbalija, V., Ilić, M., Ivanović, M., Jakovetić, D., Valachis, A., Autexier, S., Rust, J., & Kosmidis, T. (2023). The application of machine learning techniques in prediction of quality of life features for cancer patients. Computer Science and Information Systems, 20(1), 381–404. https://doi.org/10.2298/CSIS220227061S
Web of Science ®Google Scholar
Slack, D., Hilgard, S., Jia, E., Singh, S., & Lakkaraju, H. (2020, February). Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 180–186. https://doi.org/10.1145/3375627.3375830
Google Scholar
Su, Y., Dai, Y., Zeng, Y., Wei, C., Chen, Y., Ge, F., Zheng, P., Zhou, D., Dral, P. O., & Wang, C. (2023). Interpretable machine learning of two-photon absorption. Advanced Science, 10(8), 2204902. https://doi.org/10.1002/advs.202204902
Web of Science ®Google Scholar
Sun, H., Saeedi, P., Karuranga, S., Pinkepank, M., Ogurtsova, K., Duncan, B. B., Stein, C., Basit, A., Chan, J. C. N., Mbanya, J. C., Pavkov, M. E., Ramachandaran, A., Wild, S. H., James, S., Herman, W. H., Zhang, P., Bommer, C., Kuo, S., Boyko, E. J., & Magliano, D. J. (2022a). IDF diabetes atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes research and clinical practice, 183, 109119. https://doi.org/10.1016/j.diabres.2021.109119
PubMed Web of Science ®Google Scholar
Sun, K., Li, W., Saikrishna, V., Chadhar, M., & Xia, F. (2022b). COVID-19 datasets: A brief overview. Computer Science and Information Systems, 19(3), 1115–1132. https://doi.org/10.2298/CSIS210822014S
Web of Science ®Google Scholar
Sun, X., Bright, J. M., Gueymard, C. A., Acord, B., Wang, P., & Engerer, N. A. (2019). Worldwide performance assessment of 75 global clear-sky irradiance models using principal component analysis. Renewable and Sustainable Energy Reviews, 111, 550–570. https://doi.org/10.1016/j.rser.2019.04.006
Web of Science ®Google Scholar
Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429–441. https://doi.org/10.1016/j.ins.2019.11.004
Web of Science ®Google Scholar
Thenappan, S., Valan Rajkumar, M., & Manoharan, P. S. (2022). Predicting diabetes mellitus using modified support vector machine with cloud security. IETE Journal of Research, 68(6), 3940–3950. https://doi.org/10.1080/03772063.2020.1782781
Web of Science ®Google Scholar
Thomas, D. D., Corkey, B. E., Istfan, N. W., & Apovian, C. M. (2019). Hyperinsulinemia: An early indicator of metabolic dysfunction. Journal of the Endocrine Society, 3(9), 1727–1747. https://doi.org/10.1210/js.2019-00065
Web of Science ®Google Scholar
Velliangiri, S., & Alagumuthukrishnan, S. J. P. C. S. (2019). A review of dimensionality reduction techniques for efficient computation. Procedia Computer Science, 165, 104–111. https://doi.org/10.1016/j.procs.2020.01.079
Google Scholar
Vrbančič, G., Pečnik, Š, & Podgorelec, V. (2022). Hyper-parameter optimization of convolutional neural networks for classifying COVID-19 X-ray images. Computer Science and Information Systems, 19(1), 327–352. https://doi.org/10.2298/CSIS210209056V
Web of Science ®Google Scholar
Wang, B., & Zhang, S. (2022). A new locally adaptive K-nearest centroid neighbor classification based on the average distance. Connection Science, 34(1), 2084–2107. https://doi.org/10.1080/09540091.2022.2088695
Web of Science ®Google Scholar
Wang, K., Tian, J., Zheng, C., Yang, H., Ren, J., Liu, Y., Han, Q., & Zhang, Y. (2021). Interpretable prediction of 3-year all-cause mortality in patients with heart failure caused by coronary heart disease based on machine learning and SHAP. Computers in Biology and Medicine, 137, 2084–2107, 104813. https://doi.org/10.1016/j.compbiomed.2021.104813
PubMed Web of Science ®Google Scholar
Wang, Z., Han, D., Li, M., Liu, H., & Cui, M. (2022). The abnormal traffic detection scheme based on PCA and SSH. Connection Science, 34(1), 1201–1220. https://doi.org/10.1080/09540091.2021.1936455
Web of Science ®Google Scholar
Wu, H., Yang, S., Huang, Z., He, J., & Wang, X. (2018). Type 2 diabetes mellitus prediction model based on data mining. Informatics in Medicine Unlocked, 10, 100–107. https://doi.org/10.1016/j.imu.2017.12.006
Google Scholar
Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. Journal of Biomedical Informatics, 107, 103465. https://doi.org/10.1016/j.jbi.2020.103465
PubMed Web of Science ®Google Scholar
Yadav, A., Verma, H. K., & Awasthi, L. K. (2021). Voting classification method with PCA and K-means for diabetic prediction. In Innovations in Computer Science and Engineering: Proceedings of 8th ICICSE (pp. 651-656). Springer Singapore.
Google Scholar
Ye, Y., Xiong, Y., Zhou, Q., Wu, J., Li, X., & Xiao, X. (2020). Comparison of machine learning methods and conventional logistic regressions for predicting gestational diabetes using routine clinical data: A retrospective cohort study. Journal of Diabetes Research, 2020.
Google Scholar
Zhang, D., & Gong, Y. (2020). The comparison of LightGBM and XGBoost coupling factor analysis and prediagnosis of acute liver failure. IEEE Access, 8, 220990–221003. https://doi.org/10.1109/ACCESS.2020.3042848
Web of Science ®Google Scholar
Zhang, X., Yan, C., Gao, C., Malin, B. A., & Chen, Y. (2020). Predicting missing values in medical data via XGBoost regression. Journal of healthcare informatics research, 4(4), 383–394. https://doi.org/10.1007/s41666-020-00077-1
PubMedGoogle Scholar
Zhou, F., Du, X., Li, W., Lu, Z., & Huang, S. C. (2023). Fidan: A predictive service demand model for assisting nursing home health-care robots. Connection Science, 35 (1), 2267791.
Web of Science ®Google Scholar
Zhu, C., Idemudia, C. U., & Feng, W. (2019). Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques. Informatics in Medicine Unlocked, 17, 100179. https://doi.org/10.1016/j.imu.2019.100179
Google Scholar

Explainable data mining model for hyperinsulinemia diagnostics

ABSTRACT

1. Introduction

2. Related work