Full article: Clinical data-driven approach to identifying COVID-19 and influenza from a gradient-boosting model

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Corona Virus Disease 2019 (COVID-19) and influenza are both caused by viruses, seriously affect human health, and are highly infectious. However, because the clinical manifestations of these two groups of diseases have almost identical symptoms, separate Polymerase Chain Reaction (PCR) tests must be used for patients in each disease group. This study proposes an automatic data-processing model based on artificial intelligence and gradient boosting to identifying COVID-19 and influenza. The model can learn directly from raw data without the need for human input to delete empty data. Methodology and techniques operate in two stages: first, it evaluates and processes data to reduce the dataset’s complexity using the light gradient boosting machine (LightGBM); then, in the second stage, it builds a classification model for each disease group based on the extreme gradient boosting (XGBoost) method. The research tools showed that combining two gradient-boosting models both LightGBM and XGBoost to generate automatic COVID-19 and influenza classifiers from clinical data produced strong results and a superior performance versus one model alone, with an overall accuracy of over 99.96%. In the future, the developed model will enable patients to be diagnosed simply and accurately and thereby reduce countries’ testing costs for COVID-19 and similar pandemics that may arise.

Keywords:

1. Introduction

COVID-19—a disease caused by the severe acute respiratory syndrome-coronavirus-2 (SARS-COV-2) (Víctor J.Costela-Ruiz et al., Citation2020; Francesca Coperchini et al., Citation2020) virus—was first detected in December 2019 in China. So far, the disease has spread to 223 countries and territories, paralyzing the global economy (Olagnier & Mogensen, Citation2020; Bakhiet & Taurin, Citation2021). Since the virus has mutated into several variants in a short time, there are still many cases of infection, despite extensive vaccination programs. When infected, many patients have characteristic symptoms of the disease, but others may be asymptomatic, making it difficult to determine exact groups of COVID-19 patients and distinguish them from influenza patients (Alzubi et al., Citation2021; Ben et al., Citation2021; Francesca Colavita et al., Citation2020). For patients treated in hospitals who may be infected with either COVID-19 or influenza, the similarity of the disease manifestations complicates the evaluation process, potentially leading to inaccurate diagnoses and inappropriate treatment regimens. Moreover, cross-infection between different disease groups and the increased disease burden on hospitals adversely affect patients’ health, spread disease among communities, and create the conditions for new variants of the viruses to emerge (Carlo Basile et al., Citation2020). Currently, in most countries, PCR testing is used to distinguish between influenza and COVID-19 for both disease groups (Md Manjurul Ahsan et al., Citation2020; Shuai Shuai Wang et al., Citation2021; Xueyan Mei et al., Citation2020; Feng Shi et al., Citation2021; Xianlong Zhou et al., Citation2021), but this method is only used if a patient shows symptoms of being infected, it requires many biological resources, and obtaining results takes time. An automatic diagnostic tool to analyze patients’ test results is urgently needed. Data analysis based on machine-learning methods plays an important role in medical diagnoses. Many studies aiming to predict COVID-19 disease from patients’ clinical data have yielded positive results. Yet, due to the severity of the current pandemic, lockdown conditions have led to studies using small patient samples and basing their work on limited data collection. Most studies have focused only on analyzing and identifying COVID-19 disease based on the results of blood tests; they have often selected parameters from the blood test indices of patients with the disease as input data for prediction models (Davide Brinati et al., Citation2020; AlJame et al., Citation2020; Abhirup Banerjee et al., Citation2020; Forrest Sheng Bao et al., Citation2020; Aurelle Tchagna Kouanou et al., Citation2021; Krishnaraj Chadaga et al., Citation2022). As a result, the accuracy of these models has been relatively low, ranging between 81% and 99.4%. In the few studies that used clinical datasets for their predictive models, the accuracy was only in the range of 84–92% (Wei Tse Li et al., Citation2020; Patrick Schwab et al., Citation2020), although the datasets contained extensive information that could be used to screen disease groups.

Currently, along with the development of science and technology, artificial intelligence is increasingly developing and they have many applications in life. Many studies (Movassagh et al., Citation2021; Jafar et al., Citation2019; Omar A. Alzubi et al., Citation2020; Krishnaraj Chadaga, et al., Citation2021) (Matjaž Kukar, et al., Citation2021) (Aditya Pradhan, et al., Citation2022) using artificial intelligence in medicine have obtained many new results. Research that employed manual methods to select data attributes for classification models for disease prediction (ShahlaFaisal et al., Citation2022; Abhirup Banerjee et al., Citation2020; Forrest Sheng Bao et al., Citation2020) largely ignored features that were vital for disease treatment, which is a shortcoming of most studies that aim to classify patient groups. Using a dataset compiled from 51 studies on COVID-19 and seasonal influenza (Wei Tse Li et al., Citation2020), Li applied a self-organizing map (SOM) as the key method to collect clusters of attributes and select features from a clinical dataset, to build a disease prediction model. Li (Wei Tse Li et al., Citation2020) selected 19 attributes from a total of 27 variables based on the SOM results, then they used the XGBoost method (Chen & Guestrin, Citation2016) to build a classification model for COVID-19 and seasonal influenza. However, the selection method for identifying the necessary attributes for the classification model was manual, making it extremely difficult to develop a multi-data model or continuously update the data. The data also contained many errors, reducing the accuracy of Li’s model and its effectiveness for current patient classification. Moreover, some important properties were omitted because the filter had shortcomings. The researcher’s experiments on various platforms found they did not support the implementation of the model in practice. Some other studies, as shown in Table , were case studies that used machine learning and clinical data on COVID-19 patients to classify and diagnose the disease. Table describes these studies and provides an overview of the datasets and attributes, and the selected best sets of attributes, for the models. Schwab (Patrick Schwab et al., Citation2020) used all the attributes to build his classification model, which reduced the accuracy of the model due to unbalanced and sparse data characteristics. To address this shortcoming, Li (Wei Tse Li et al., Citation2020) increased the accuracy of the model by selecting important sets of attributes from the input clinical data and many different data-filtering classes, but the author did not use an automatic attribute-selection process. Hence, neither Schwab’s nor Li’s model achieved high efficiency.

Table 1. Relevant works using machine learning based on clinical data to diagnose COVID-19

Download CSV Display Table

Table lists some studies that used routine blood tests as key data for predictive models for COVID-19 patients. These studies not only processed the attributes for the classification models but also selected the attributes based on clinical blood-sample data. According to Table , the usual process for building a predictive model for COVID-19 disease (Davide Brinati et al., Citation2020; Abhirup Banerjee et al., Citation2020; Forrest Sheng Bao et al., Citation2020) is as follows:

Select the attributes from blood test indices (the common data type associated with routine blood tests) of published clinical trials in laboratories/hospitals;
Evaluate different attributes according to characteristics such as means, standard deviations, and correlation coefficients;
Conduct data preprocessing to add/process missing data;
Apply machine learning algorithms to build classification models. The results of these studies had accuracies of 84–86%. The prediction models (AlJame et al., Citation2020; Aurelle Tchagna Kouanou et al., Citation2021; Krishnaraj Chadaga et al., Citation2022) applied the synthetic minority oversampling technique (SMOTE) to prevent imbalance in the data distribution since the datasets were originally extremely unbalanced. The results of these studies showed the approach had accuracies of 91–99.4%; the accuracy was improved when the feature set for a model was small. Yet, the prediction results of these studies did not cover all the disease characteristics of COVID-19 or influenza.

Table 2. Related works on the detection of COVID-19 infection from routine blood tests

Download CSV Display Table

In fact, clinical data for disease diagnosis always contain a multitude of attributes that take categorical and continuous numeric forms, and there will always be missing and faulty data. Most algorithms on machine-learning platforms require informative and numeric input data; when dealing with missing data, people often use coding methods for the attribute set that employ mean values for data with quantifiable attributes, and they may delete data with a high error rate. This data preprocessing helps solve the problem of empty data and converts data into a numeric format suitable for machine-learning algorithms. However, this preprocessing can affect the prediction results by omitting faulty data that contain valuable attributes specifically included to support diagnostic and therapeutic work. Proper evaluation of the original dataset is essential to ensure the specificity of COVID-19 and/or influenza datasets.

Based on a review of previous studies, this study proposes an automated model for the analysis and prediction of COVID-19 and seasonal influenza based on a light-resolution machine (LightGBM) method (Guolin et al., Citation2017; Zhong et al., Citation2018; Dehua Dehua Wang et al., Citation2017) as shown in Figure . In this research, the manuscript extracted and used 1485 samples of clinical test results from patients infected with seasonal influenza or COVID-19. Accordingly, the dataset contained 51 of the features listed in COVID-19 and flu clinical data (Wei Tse Li et al., Citation2020). The development process for this study comprised four main stages:

Figure 1. Current study from this paper.

Preprocessing the features of the data, which included identifying missing data values, conducting a correlation analysis of the features, and filtering noisy data;
Using the LightGBM model to analyze outstanding features that could affect the prediction results;
Investigating modern prediction methods that may be suitable for clinical datasets and able to reduce feature datasets (by removing missing or erroneous data). In this way, we built a viable model to optimize patient classification and prediction results.

Finally, we evaluated and refined the model to ensure that the obtained results were appropriate and accurate. The experimental results demonstrated the superiority of this method over ones reported previously. In the future, the developed model will contribute to the accurate screening, classification, and prediction of patients with seasonal influenza or COVID-19. It will also minimize the costs of medical examinations and treatment processes, and it will reduce the time taken to obtain test results for COVID-19 patients. In the context of the current COVID-19 pandemic, the model provides an optimal, safe, and accurate solution to support the triage process for treating patients.

2. Materials and methods

This study was carried out in two phases in methodology and techniques as shown in Figure :

Figure 2. Overview of the method.

Phase 1: Our automatic reduction of the features of the clinical dataset, which included us:

Investigating the relationships between the attributes, then removing erroneous and duplicate data items and unreliable mounting and figures;
Calculating the percentage of data loss from the input dataset and automatically deleting data according to the specified threshold;
Eliminating highly correlated attributes that could be duplicate data components;
Removing unimportant attributes using the LightGBM classification model.

Phase 2: The classification of patients with COVID-19 and influenza by:

Choosing the best-performing gradient enhancement algorithm;
Adjusting the prediction parameters in the selected algorithm, and choosing the appropriate result.

Our procedure is shown in Figure .

2.1. Visualization of missing data

Data errors in the medical field are common for many reasons, such as errors in data collection, or data samples drawn from different clinical trials, laboratories, or sampling mechanisms (G. Madhu et al., Citation2019). Usually, when dealing with erroneous data for machine-learning models, missing values are removed. However, this is not recommended for medical data because the data can be highly erroneous but contain a great deal of information critical for diagnostic processes. Visualizing the relationships between faulty data columns can help researchers evaluate data before proceeding with its processing, thus preventing the deletion of vital data from faulty datasets. Accordingly, we used the Python Missingno library package (Aleksey Bilogur, Citation2018) to visualize erroneous data and minimize mistaken deletion, redundancy, and removal of necessary information. A full view of erroneous data enhances feature selection for machine-learning models without reducing their accuracy. In this study, the manuscript also used the Missingno library (Japkowicz & Stephen, Citation2002) to visualize and examine the interdependence of features with error values. Figure depicts a correlation matrix showing in detail the error levels for each feature on a scale ranging from low to high. The matrix reveals that the features with the highest error levels were cancer, smoking status, monocytes, thrombocytes, days to death, vaping status, eosinophils, hematocrit, activated partial thromboplastin time, red blood cells, basophils, and fibrinogen. Despite the high error rates, we needed to carefully consider other factors and the constraints of the features before deleting them because they could have been valuable for diagnosing COVID-19 or influenza.

Figure 3. Error values in data columns and their relationships to each other, other columns, and the overall data.

The manuscript generalized the relationships between faulty features using a dendrogram, as shown in Figure , to represent the relationships between the data columns based on the data. Hierarchical relationships made it possible to group columns with the same error levels, to help us avoid discarding important values from the original dataset. The displayed results show that the risk factor feature had an error rate of 84.78%, but this was related to two other attribute groups 2.1 and 2.2 section, as shown in Figure . Therefore, this attribute was required and could not be removed during data preprocessing. As with the risk factor feature, the ground-glass opacity feature also had an error rate of 93.6%, but it was related to all attributes as shown in section 2.1. This suggested that it was important and necessary to omit the raw clinical data on COVID-19 patients from preprocessing because of the information it could provide during the evaluation process. This made sense in terms of classifying and identifying each disease group because many patients have no symptoms. At the end of step 1, the evaluation of the erroneous data in the COVID-19 and flu dataset had helped us to avoid removing unnecessary information. In the next step, we automated the data processing, model building, and evaluation of the proposed classifier.

Figure 4. Error values in columns and relationships, displayed as a dendrogram.

2.2. Calculating the data missing from the dataset

To obtain an appropriate dataset, we needed to develop a method for detecting and removing all features of the input data with error rates greater than 99.9% to ensure a minimum amount of data remained for each feature. Assume we have a general dataset $D = \{(x_{i}, y_{i}) \in R^{m} \times \{0, 1\}, \forall_{i} = \overline{1, n}\}$ , where n is the number of samples, m is the number of features, and x_i and y_i denote the features and the target variable of the dataset. Set D_ogr, the dataset under review (Wei Tse Li et al., Citation2020), described 1072 patients with seasonal or H1N1 influenza and 413 patients with COVID-19. The dataset contained 50 features, considered to be 50 columns, and a labeled diagnosis variable was used as the categorical variable. Processing resulted in the removal of 12 attributes using Algorithm 1, as shown in Table .

Table 3. Summary of deleted data from the COVID-19 and flu set

Download CSV Display Table

Algorithm 1: Proposed Drop missing feature-selection algorithm

Input: Set D_org, thresh: This parameter represents the number of missing rating values needed to drop the missing features

Output: D_learn: The dataset

Begin

1: missing_total ← Calculate how many null values there are per column

2: missing_percent ← missing_total/len (D_ogr)

3: data_missing ← D_ogr.index [missing_percent ≥ thresh]

4: D_learn ← D_ogr.drop [data_missing]

End

The results displayed in Table show that data errors that appear in medical diagnoses are very common, although the process for deleting these data must be considered carefully because important information may be lost. Only if the missing values are unimportant or do not generate relationships with other features can they be eliminated (Hu, et al., 2017; Emmanuel et al., Citation2021). In this research, we deleted features with error rates greater than 99%, as well as features that had few relationships with other features. The erased data function resulted in a feature called “VapingStatus” with an error rate of 100%.

2.3. Calculation of correlations in the dataset

The factor showing the relationships between features is important for the evaluation and classification of disease groups since correlations can directly affect predictions. Highly correlated features will not be particularly informative and will increase the complexity of algorithms (Japkowicz & Stephen, Citation2002; Cohen et al., Citation2002). To remove these features, we calculated the correlations for the entire quantitative feature dataset and then removed the features with high correlations. The performance measure applied to assess the relationship between two continuous variables was Pearson’s correlation. The resulting correlation values provided information about both the nature and degree of relevance of the relationships between features. These correlation values always ranged from 0 to ± 1. The threshold chosen for the removal of the correlated features was 0.95, and correlations greater than the threshold could be discarded. This stage of our method resulted in the removal of the duration of illness feature.

2.4. Building a model based on LightGBM for feature engineering

Gradient-boosted decision trees (GBDTs; Friedman, Citation2001) are a method of using groups of decision trees to predict target labels. To build a GBDT model with T-trees from a dataset of n samples, we used the process described in Figure , where:

Figure 5. Diagram of a GBDT.

K is the number of decision trees
$f_{k} (x_{i}), ∀k = \overline{1, K}$ is the learning function of k-th decision tree

A GBDT is built by subdividing observations based on the attribute values of input data. The GBDT model could find the best division of the data and determine the most time-consuming part of the partitioning process. LightGBM, created by Microsoft Research Asia (Guolin et al., Citation2017), combines modern optimization methods with many different optimization algorithms. In doing so, it combines the predictions from multiple weak decisions to output a better final prediction. To build a LightGBM model with K-trees and n samples, the prediction process according to the GBDT method is generally as follows:

(1)

\begin{aligned} {\hat{y}}_{i}^{(0)} = 0 \\ {\hat{y}}_{i}^{(1)} = f_{1} (x_{i}) = {\hat{y}}_{i}^{(0)} + f_{1} (x_{i}) \\ {\hat{y}}_{i}^{(2)} = f_{1} (x_{i}) + f_{2} (x_{i}) = {\hat{y}}_{i}^{(1)} + f_{2} (x_{i}) \\ \dots . . \\ {\hat{y}}_{i}^{(K)} = \sum_{k = 1}^{K} f_{k} (x_{i}) = {\hat{y}}_{i}^{(K - 1)} + f_{K} (x_{i}) \end{aligned}

(1)

where ${\hat{y}}_{i}^{(k)}$ is the predicted value of the i^th sample at the k^th iteration.

The cost function of LightGBM has two parts: a training error and regularization, as follows:

(2)

Cost = \sum_{i = 1}^{n} l (y_{i}, y_{i}) + \sum_{k = 1}^{k} Ω (f_{k})

(2)

where $Ω (f_{k}) = γ T + \frac{1}{2} λ ∥ W ∥^{2}, ∀k = \overline{1, K}$ . T is the number of leaf nodes, w is the score for a leaf node, $γ$ is the leaf penalty coefficient, and λ ensures that leaf nodes’ scores are not too large.

This is a synthetic approach to incorporate decision trees in modern engineering today. LightGBM trains multiple tree models in an additive manner, with each new tree model trained to predict the residuals (i.e., errors) of the previous models. In LightGBM, a gradient-based one-sided sampling (GOSS) technique, based on the principle that GOSS retains data instances with large gradients, randomly deletes those instances. The retained data items thus have small gradients. This method increases the accuracy of the information-gain calculation compared to uniform sampling methods such as the GBDT algorithm. The exclusive feature bundling (EFB) function of LightGBM is effective for high-dimensional and sparse data. Non-zero values for features are called exclusive features (EFs) since they never take non-zero values simultaneously. The EFB function packs EFs into different bins (keeping exclusive features in separate bins). For this study, the datasets for COVID-19 and influenza patients contained tall and sparse data, so we used LightGBM to select important features among the clinical dataset from the COVID-19 and flu platform. The tree model then used the information-gain (IG) method to compute values for important features. The process for extracting important features, as performed by Algorithm 2, was as follows:

Algorithm 2: Extracting important features using the LightGBM classifier

Input: D_learn; iter = number of iterations, hyperparameters = {objective = “binary”, boosting_type = “goss”, n_estimators = 1000, class_weight = “imbalanced”}

Output: $D_{small}$ : The dataset

Begin

1: Initialize: FeatureImportances = {}

2: Model ← LGBM classifier (hyperparameters)

3: Repeat iter times

•Divide $D_{lean}$ dataset into $D_{Train}$ and $D_{Test}$

•Train model based on $D_{Train}$ and early-ending hyperparameters

•FeatureImportances = model. FeatureImportances

End repeat

4: Calculate average FeatureImportances over iterations

5: Select ZeroFeatureImportances

6: Return $D_{small}$ ← D_learn_—ZeroFeatureImportances

End

2.5. Gradient-boosting classification model

XGBoost was developed based on Friedman’s original gradient-boosting machine (GMB) model (Chen et al., Citation2015; Friedman, Citation2001). XGBoost is an outstandingly improved version of gradient boosting with many advantages (Friedman, Citation2001; Ren et al., Citation2017; Jiang et al., Citation2019; Zhong et al., Citation2018) gained from its ability to handle parallel computations across different datasets, thus increasing the processing speed by up to 10 times compared to GBM. Plus, it greatly reduces the overfitting phenomenon through a regularization mechanism and includes an automatic missing-value handling mechanism. Based on these advantages, we applied XGBoost to build a predictive model for COVID-19 and influenza.

In a GBM with dataset D, the prediction score of the k-th tree is calculated according to EquationEq. (1)(1) $\begin{aligned} {\hat{y}}_{i}^{(0)} = 0 \\ {\hat{y}}_{i}^{(1)} = f_{1} (x_{i}) = {\hat{y}}_{i}^{(0)} + f_{1} (x_{i}) \\ {\hat{y}}_{i}^{(2)} = f_{1} (x_{i}) + f_{2} (x_{i}) = {\hat{y}}_{i}^{(1)} + f_{2} (x_{i}) \\ \dots . . \\ {\hat{y}}_{i}^{(K)} = \sum_{k = 1}^{K} f_{k} (x_{i}) = {\hat{y}}_{i}^{(K - 1)} + f_{K} (x_{i}) \end{aligned}$ (1) and EquationEq. (2)(2) $Cost = \sum_{i = 1}^{n} l (y_{i}, y_{i}) + \sum_{k = 1}^{k} Ω (f_{k})$ (2) . In XGBoost, the optimization of the objective function is converted to a problem of how to determine the minimum of a quadratic function. Thanks to the introduction of the regularization term, XGBoost has a good resistance to overfitting. During training, the model continuously computes the node loss to select the leaf node with the largest gain loss. XGBoost adds new trees using continuous splitting features, and it carries out node splitting using its objective function. The process of adding a tree each time corresponds to learning a new function $f_{k} (X, θ_{k})$ to fit the residual of the final prediction. After training, K trees will be obtained, the features of the prediction pattern will have a corresponding leaf node in each tree, and each leaf node will correspond to a point. Finally, the respective scores of each tree are added together to obtain the prediction value of the sample. A summarizing flow chart for XGBoost is shown in Figure .

Figure 6. Detailed model of phase 2 of the proposed method.

The XGBoost method does not directly use the target variable y as the forecast value, but replaces it with the residual of the previous model. The residuals will be updated gradually according to a coefficient of shrinkage, to transform the model series into more diverse decision trees. Models will cease to be added when the maximum number of predictive models is reached or all observations are classified or correctly predicted. To finetune the hyperparameters for the boosting model, most attention should be paid to three main parameters: the number of trees K (n_estimators), the decision tree depth d (colsample_bytree), and the coefficient of contraction (learning_rate). In our work, we used the planetary grid search method to finetune the hyperparameters for the model. Next, we performed a grid search on the training set. The grid search process carried out cross-validation with k = 5 to choose the best classification model. A new dataset $D_{small}$ was obtained from the results of stage 1, where n was the number of data samples and m was the number of features. An optimization parameter was used as the input for Algorithm 3 to classify and evaluate the whole Section 2 process, as described in Figure .

Algorithm 3: Building the best XGBoost classifier

Input: $D_{small} = {(x_{i}, y_{i}) \in R^{m} \times {0, 1}, ∀i = \overline{1, n}}$ ; hyperparameter is Ɵ = {“n_estimators”, “colsample_bytree”, “learning_rate”, “eval_metric”:’auc’}

Output: Best_Model

Begin

1: Initialize: FeatureImportances = {}

2: Model ← XGBoostClassifier (Ɵ)

3: KFold← StratifiedKFold (n_splits = 5, shuffle = True, random_state = 2020)

4: For i=1 to KFold

•Divide the $D_{small}$ dataset into $D_{Train}$ and $D_{Test}$

•Train the model based on $D_{Train}$ and early-ending hyperparameters

5: Calculate the roc_auc_score, accuracy_score, precision_score, recall_score, and f1_score over iterations

6: Select the best model based on step 4

7: Visualize the mean value from step 4

8: Return the Best_Model from step 4

End

Figure describes the implementation of Section 2, the first task of which was the construction of classification and performance evaluation models. The second task involved predicting patient outcomes for COVID-19 and influenza using the optimal model, which was evaluated for accuracy. This produced a new, small-sized dataset with an 80:20 training ratio, i.e., with 80% used for training and 20% for testing.

2.6. Performance metrics

We used the k-fold cross-validation (CV) method to systematically and carefully evaluate the performance of the COVID-19 and seasonal influenza clustering model. K-fold CV is a statistical analysis method that has been widely used by researchers to evaluate the performance of machine-learning classifiers (Cohen et al., Citation2002). In this study, we conducted CV tests five times to evaluate the performance of the proposed method. The COVID-19 and flu dataset used as our object had unbalanced classes; specifically, the COVID-19 patient samples only represented 27.8% of the total dataset. Thus, to obtain more accurate results, we used cross-correlation to train and test the model on each subset of the two original datasets. Plus, we selected the mean of all the indices; more specifically, we divided each dataset randomly into five distinct subsets of equal size. At each execution step, one subset (20% of the dataset) was chosen as a validation dataset to test the performance of the proposed method. We used the remaining subset (80% of the dataset) as the training dataset. This process continued automatically and was repeated five times to optimize the data. Our application of various machine-learning models to the same dataset aimed to find optimal solutions for patient screening and classification decisions during the assessment of COVID-19 and seasonal influenza. In clinical medicine, “true” and “false” diagnoses may stem from a misclassification model, a misdiagnosis model, a mistaken exclusion, or further concepts applied to diagnostic goals. A “COVID-19 correct diagnosis” was designated a true positive (TP), a “COVID-19 correct exclusion” was designated a true negative (TN), an “influenza correct diagnosis” was designated a false positive (FP), and an “omitted or mistakenly excluded diagnosis” was designated a false negative (FN). These were used as stopping conditions for initial data training. To evaluate the performance of the proposed model, we applied various methods to assess its machine learning (Chen & Guestrin, Citation2016; ShahlaFaisal et al., Citation2022; Aleksey Bilogur, Citation2018), including the following:

Accuracy: The proportion of correctly predicted cases (TP and TN) across all predictions of the dataset, calculated by the formula:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

Sensitivity: Recall (pr) was the hit rate (hit rate), and the true positive rate (TPR) was the ratio of correct positive classifications to the total number of positive and recall cases:

TPR = Sensitivity = \frac{TN}{TN + FP}

Specificity: True negative (TN) (or specificity in clinical medicine) was the correct exclusion rate out of the total number of negative cases:

Specificity = \frac{TP}{TN + FP}

False Positive Rate/Fallout (FPR) was an expression of the rate of mislabeling of negative to positive samples across all negative samples, calculated by the following formula:

FPR = 1 - specificity = 1 - \frac{TP}{TP + FP}

Precision: Since the dataset had a larger sample of influenza than COVID-19 patients, this led to an imbalanced input dataset for the prediction model. Therefore, we used precision to determine the ratio of actually positive cases to the total number of cases labeled “positive” by the model. Precision is a term that refers to the “deterministic” or accurate positive classification of a model:

Precision = \frac{TP}{TP + FP}

F1 score: This was defined as the harmonic mean between precision and recall:

F 1 = 2 x \frac{Precision x Recall}{Precision + Recall}

Receiver operating characteristics (ROCs) were used to calculate the classification performance of the model. A ROC curve was produced for each pair (TPR, FPR) for different thresholds, with each point on the curve representing one pair (TPR, FPR) for one threshold. Then, we calculated the AUC index for the curve, a figure representing the classification performance of the model—the area under the curve (AUC).

3. Experimental results and discussion

We tested the experimental model using Python combined with the software packages Sklearn, Missingno, LightGBM, and XGBoost. Its performance was tested with experiments on the clinical COVID-19 and flu dataset, and the following sections of the article present the experimental results and compare them with those of related studies.

3.1. Reducing the attribute set using lightGBM

The dataset for COVID-19 and seasonal influenza patients contained two types of attributes: categorical and quantitative. These two attributes are both important for the classification and prediction of patients’ disease groups when conducting clinical tests. In the model, we encoded categorical values using a label-encoding technique; that is, we converted each value in a column in the tree diagram into a number. The numerical labels were always in the range of 0–n categories. We then used LightGBM’s categorical_feature to determine the important attribute set for patients’ clinical data. The attribute-selection process was performed according to Algorithm 2, the parameters of which are described in Table .

Table 4. Parameters for the model built and trained using the LightGBM classifier, along with those for the fit model, to select important attributes

Download CSV Display Table

As discussed in Section 1, using the COVID-19 and flu dataset, we automatically performed the processing, removal, and compensation of error data, and we used a machine-learning model to select important features for disease identification, to build a standard dataset that represented different attribute sets. Through this processing, from the input dataset of 50 attributes, we extracted 24 important attributes for the small data output. As such, we identified 24 output characteristics to classify the groups of patients with COVID-19 or influenza based on clinical testing. To evaluate the accuracy and efficiency of the results obtained from the model, we compared the results at this stage with those of Li (Wei Tse Li et al., Citation2020). Details of the comparison are presented in Table .

Table 5. Results of a comparison between the feature selection of our model and Li’s model (Wei Tse Li et al., Citation2020)

Download CSV Display Table

As shown in Table , our model had 24 selected attributes, and the results we obtained replicated almost the entire set of attributes proposed by Li (Wei Tse Li et al., Citation2020). However, there was one additional property—X-ray results. An advantage of our model was that it selected more than five attributes compared to (Wei Tse Li et al., Citation2020), with the ground-glass opacity property being the attribute that Batista AFM’s team (Andrew P.Bradley, Citation1997) identified as particularly important for the high-frequency diagnosis and treatment of patients with severe COVID-19, showing that “49 out of 55 hospitalized COVID-19 patients” expressed the ground-glass opacity property. Therefore, we included this feature in the classification model in Section 2 to predict COVID-19 and influenza. The results clearly demonstrated that the feature selection for our classification model was suitable, selecting important features for the predictive model.

3.2. Building a classification model

We built a partial model based on a small set of experimental data from clinical testing for influenza and COVID-19, including 1484 samples with 24 clinical attributes and 1 classification. The small dataset was divided in two, with 80% used as training data and 20% used as model testing data. When building the predictive model and identifying the optimal process, we used a five-fold cross-check technique with the following rules:

Randomize the number of times n_round = 30;
Test the model with the classifier on the training set and list the loss function values;
Choose the lowest loss function value;
Experimentally adjust n_round to the smallest value found.

We also used many other algorithms to build our classifier based on the small dataset, such as XGBoost, LGBM, and random forest. The results of the area under the receiver operating characteristic (AUROC) evaluation for each algorithm are illustrated in Figure . We found that XGBoost produced the best AUROC values and so we decided it should be used to build the classifier.

Figure 7. AUROC results for comparisons of algorithms in the gradient-boosting group.

To identify the optimal hyperparameters and inputs for the model’s classification algorithm (Algorithm 3), we used the Grid Search technique to search for the following parameters: “colsample_bytree,” “n_estimators,” and “learning_rate.” The selected result for hyperparameters was {“colsample_bytree”: −1, “n_estimators”: 400: 0.8, “learning_rate”: 0.7}. Details of the optimization of parameters are shown in Figures , respectively.

Figure 8. Results for the hyperparameter colsample_bytree for the XGBoost classifier.

Figure 9. Results for the hyperparameter n_estimators for the XGBoost classifier.

Figure 10. Results for the hyperparameter log_loss for the XGBoost classifier.

We applied the best hyperparameters found by the model to the testing dataset and used the receiver operating characteristics (ROCs), precision-recall curve for prediction, and a confusion matrix to evaluate the training model. Following the data training process, we tested the classification process five times for the XGBoost algorithm. The average results for each step of the cross-validation, respectively, are shown in Figures , along with the results of the performance evaluation of the five-fold CV method on the small dataset.

Figure 11. ROC curve for prediction.

Figure 12. Precision recall curve for prediction.

Figure 13. Confusion matrix for prediction.

Figure 14. Important parameters for the predictive model.

The set of important features obtained from the classification model was also compared with the feature set of the model built by Li (Wei Tse Li et al., Citation2020). Unlike (Wei Tse Li et al., Citation2020), the manuscript used the one-hot coding technique to encode the identity dataset with an input feature set of 20, thereby increasing the performance somewhat compared to Li’s model. The XGBoost classification reduced the important 24 attributes to 14 attributes. The ground-glass opacity property was retained by the model in the 14 important features used to predict COVID-19 or influenza. To check the correctness of our model, we compared its performance with that of Li’s model (Wei Tse Li et al., Citation2020). Li used many machine-learning techniques to build the classification model, such as RIDGE regression, random forest, LASSO regression, and XGBoost. The former three techniques were tested on an R model, and using the AUC assessment method, the results were 96.6%, 95.3%, and 96.3%, respectively. The advantage of our approach was the use of the AUC, ROC, sensitivity, and specificity model evaluation methods. The details of the evaluation results are presented in Table .

Table 6. Performance comparison of the two models

Download CSV Display Table

4. Conclusion

From the synthetic COVID-19 and flu dataset, this manuscript built a classifier and predictor for COVID-19 and influenza based on clinical testing. The study demonstrated the use of machine learning through a model that combined two methods: LightGBM and XGBoost. The results from this manuscript show the following salient features:

In particular, the LightGBM model reduced the features while retaining those that were important and necessary for patient evaluation. Building on this, XGBoost allowed us to develop a superior classification model, which delivered an improved performance compared to many similar studies.
The model worked well at discovering important variables for the predictive model from the clinical test results. This increased the efficacy of the input data for the evaluation process and reduced the number of error samples. The results show that the LightGBM method can control and automatically process categorical variables from a raw dataset, thereby maximizing the value of the input variable set for the classification model. The model results are stable and can thus serve as a basis for deployment on larger sets of actual clinical data in hospitals, to make the diagnosis and detection of disease, along with treatment, more effective.
In the future, if a large enough amount of clinical data can be collected, the model will quickly form a digital map for the classification and diagnosis of COVID-19 and seasonal influenza, thereby helping to reverse the determined spread of the COVID-19 pandemic. This can bring a lot of benefits to patients suspected of being infected with COVID −19 pandemic with high efficiency, safety for users and test cost savings.

Data availability

All data generated or analysed during this study are included in this published article (and its supplementary information files).

Correction

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Acknowledgements

This research is funded by Thu Dau Mot University, Binh Duong Province, Vietnam under grant number DT.22.1-008

Disclosure statement

This is to certify that to the best of authors’ knowledge, the content of this manuscript is original. The paper has not been submitted elsewhere nor has been published anywhere.

Authors confirm that the intellectual content of this paper is the original product of our work and all the assistance or funds from other sources have been acknowledged.

Additional information

Funding

The authors received no direct funding for this research.

References

Ahsan, M. M., Alam, T. E., Trafalis, T., & Huebner, P. (2020). Deep MLP-CNN model using mixed-data to distinguish between COVID-19 and Non-COVID-19 patients. Symmetry, 12(9), 1526. https://doi.org/10.3390/sym12091526
Google Scholar
AlJame, M., Ahmad, I., Imtiaz, A., & Mohammed, A. (2020). Ensemble learning model for diagnosing COVID-19 from routine blood tests. Informatics in Medicine Unlocked, 21, 100449. https://doi.org/10.1016/j.imu.2020.100449
PubMedGoogle Scholar
Alzubi, O. A., Alzubi, J. A., Alweshah, M., Qiqieh, I., Al-Shami, S., & Ramachandran, M. (2020). An optimal pruning algorithm of classifier ensembles: Dynamic programming approach. Neural Computing and Applications, 32(20), 16091–25. https://doi.org/10.1007/s00521-020-04761-6
Web of Science ®Google Scholar
Alzubi, J. A., Jain, R., Singh, A., Parwekar, P., & Gupta, M. (2021). COBERT: COVID-19 Question Answering System Using BERT. Arabian Journal for Science and Engineering, 1–11. https://doi.org/10.1007/s13369-021-05810-5
PubMed Web of Science ®Google Scholar
Bakhiet, M., & Taurin, S. (2021). SARS-CoV-2: Targeted managements and vaccine development. Cytokine & Growth Factor Reviews, 58, 16–29. https://doi.org/10.1016/j.cytogfr.2020.11.001
PubMed Web of Science ®Google Scholar
Banerjee, A., Ray, S., Vorselaars, B., Kitson, J., Mamalakis, M., Weeks, S., Baker, M., & Mackenzie, L. S. (2020). Use of machine learning and artificial intelligence to predict SARS-CoV-2 infection from full blood counts in a population. International Immunopharmacology, 86, 106705. https://doi.org/10.1016/j.intimp.2020.106705
PubMed Web of Science ®Google Scholar
Bao, F. S., Youbiao, H., Liu, J., Chen, Y., Qian, L., Zhang, C. R., Han, L., Zhu, B., Yaorong, G., Chen, S., Ming, X., & Ouyang, L. (2020). Triaging moderate COVID-19 and other viral pneumonias from routine blood tests. arXiv, 2005, 06546. https://doi.org/10.48550/arXiv.2005.06546
Google Scholar
Basile, C., Combe, C., Pizzarelli, F., Covic, A., Davenport, A., Kanbay, M., Kirmizis, D., Schneditz, D., van der Sande, F., & Mitra, S. (2020). Recommendations for the prevention, mitigation and containment of the emerging SARS-CoV-2 (COVID-19) pandemic in haemodialysis centres. Nephrology Dialysis Transplantation, 35(5), 737–741. https://doi.org/10.1093/ndt/gfaa069
PubMed Web of Science ®Google Scholar
Ben, H., Guo, H., Zhou, P., & Shi, Z.-L. (2021). Characteristics of SARS-CoV-2 and COVID-19. Nature Reviews Microbiology, 19(3), 141–154. https://doi.org/10.1038/s41579-020-00459-7
PubMed Web of Science ®Google Scholar
Bilogur, A. (2018). Missingno: A missing data visualization suite. Journal of Open Source Software, 3(22), 547. https://doi.org/10.21105/joss.00547
Google Scholar
Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159. https://doi.org/10.1016/S0031-3203(96)00142-2
Web of Science ®Google Scholar
Brinati, D., Campagner, A., Ferrari, D., Locatelli, M., Banfi, G., & Cabitza, F. (2020). Detection of COVID-19Infection from routine blood exams with machine learning: A feasibility study. Journal of Medical Systems, 44(8), 135. https://doi.org/10.1007/s10916-020-01597-4
PubMed Web of Science ®Google Scholar
Chadaga, K., Chakraborty, C., Prabhu, S., Umakanth, S., Bhat, V., & Sampathila, N. (2022). Clinical and laboratory approach to diagnose COVID-19 using machine learning. In Interdisciplinary sciences: Computational life sciences (Vol. 14, No.2, pp. 452–470). Springer.
Google Scholar
Chadaga, K., Prabhu, S., Vivekananda, B. K., Niranjana, S., & Umakanth, S. (2021). Battling COVID-19 using machine learning: A review. Cogent Engineering, 9(1), 1958666.
Google Scholar
Chen, T., Benesty, T. H., Khotilovich, V., & Tang, Y. (2015). Xgboost: Extreme gradient boosting. R Package Version 0.4-2, 1(4), 1–4.
Google Scholar
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. s.n.
Google Scholar
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2002 1 August 2002). Applied multiple regression/correlation analysis for the behavioral sciences (3rd Edition). New York: Routledge. https://doi.org/10.4324/9780203774441
Google Scholar
Colavita, F., Lapa, D., Carletti, F., Lalle, E., Bordi, L., Marsella, P., Nicastri, E., Bevilacqua, N., Giancola, M. L., Corpolongo, A., Ippolito, G., Capobianchi, M. R., & Castilletti, C. (2020). SARS-CoV-2 isolation from ocular secretions of a patient with COVID-19 in Italy with prolonged viral RNA detection. Annals of Internal Medicine, 173(3), 242–243. https://doi.org/10.7326/M20-1176
PubMed Web of Science ®Google Scholar
Coperchini, F., Chiovato, L., Croce, L., Magri, F., & Rotondi, M. (2020). The cytokine storm in COVID-19: An overview of the involvement of the chemokine/chemokine-receptor system. Cytokine & Growth Factor Reviews, 53, 25–32. https://doi.org/10.1016/j.cytogfr.2020.05.003
PubMed Web of Science ®Google Scholar
Costela-Ruiz, V. J., Illescas-Montes, R., Puerta-Puerta, J. M., Ruiz, C., & Melguizo-Rodríguez, L. (2020). SARS-CoV-2 infection: The role of cytokines in COVID-19 disease. Cytokine & Growth Factor Reviews, 54, 62–75. https://doi.org/10.1016/j.cytogfr.2020.06.001
PubMed Web of Science ®Google Scholar
Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., & Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1), 140. https://doi.org/10.1186/s40537-021-00516-9
PubMedGoogle Scholar
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451
Web of Science ®Google Scholar
Guolin, K., Meng, Q., Finley, T., Wang, T., Chen, W., Weidong, M., Qiwei, Y., & Liu, T.Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30.
Google Scholar
Jafar, A., Bharathikannan, ALzubi, B., Tanwar, S., Manikandan, R., Khanna, A., & Thaventhiran, C. (2019). Boosted neural network ensemble classification for lung cancer disease diagnosis. Applied Soft Computing, 80, 579–591. https://doi.org/10.1016/j.asoc.2019.04.031
Web of Science ®Google Scholar
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449. https://doi.org/10.3233/IDA-2002-6504
Google Scholar
Jiang, Y., Tong, G., Yin, H., & Xiong, N. (2019). A Pedestrian Detection Method Based on Genetic Algorithm for Optimize XGBoost Training Parameters. IEEE Access, 7, 118310–118321. https://doi.org/10.1109/ACCESS.2019.2936454
Google Scholar
Kouanou, A. T., Attia, T. M., Feudjio, C., Djeumo, A. F., Mouelas, A. N., Nzogang, M. P., & Christian. (2021). An overview of supervised machine learning methods and data analysis for COVID-19 detection. Journal of Healthcare Engineering, 2021. https://doi.org/10.1155/2021/4733167
Web of Science ®Google Scholar
Kukar, M., Gunčar, G., Vovko, T., Podnar, S., Černelč, P., Brvar, M., Zalaznik, M., Notar, M., Moškon, S., & Notar, M. (2021). COVID-19 diagnosis by routine blood tests using machine learning. Scientific Reports, 11(1), 10738.
PubMed Web of Science ®Google Scholar
Li, W. T., Jiayan, M., Shende, N., Castaneda, G., Chakladar, J., Tsai, J. C., Apostol, L., Honda, C. O., Xu, J., Wong, L. M., Zhang, T., Lee, A., Gnanasekar, A., Honda, T. K., Kuo, S. Z., Yu, M. A., Chang, E. Y., Rajasekaran, Ongkeko, Ongkeko, W. M. (2020). Using machine learning of clinical data to diagnose COVID-19: A systematic review and meta-analysis. BMC Medical Informatics and Decision Making, 20(1), 247. https://doi.org/10.1186/s12911-020-01266-z
PubMed Web of Science ®Google Scholar
Madhu, G., Lalith, B., Bharadwaj, G., Nagachandrika, & Vardhan, K. S. (2019). A novel algorithm for missing data imputation on machine learning. s.n.
Google Scholar
Mei, X., Lee, H.-C., Diao, K.-Y., Huang, M., Lin, B., Liu, C., Xie, Z., Ma, Y., Robson, P. M., Chung, M., Bernheim, A., Mani, V., Calcagno, C., Li, K., Li, S., Shan, H., Lv, J., Zhao, T., Xia, J., … Yang, Y. (2020). Artificial intelligence–enabled rapid diagnosis of patients with COVID-19. Nature Medicine, 26(8), 1224–1228. https://doi.org/10.1038/s41591-020-0931-3
PubMed Web of Science ®Google Scholar
Movassagh, A. A. et al.(27 March 2021). Artificial neural networks training algorithm integrating invasive weed optimization with differential evolutionary model. Journal of Ambient Intelligence and Humanized Computing, Volume, 2020. https://doi.org/10.1007/s12652-020-02623-6
Google Scholar
Olagnier, D., & Mogensen, T. H. (2020). The covid-19 pandemic in Denmark: Big lessons from a small country. Cytokine & Growth Factor Reviews, 53, 10–12. https://doi.org/10.1016/j.cytogfr.2020.05.005
PubMed Web of Science ®Google Scholar
Pradhan, A., Prabhu, S., Chadaga, K., Sengupta, S., & Nath, G. (2022). Supervised Learning Models for the Preliminary Detection of COVID-19 in Patients Using Demographic and Epidemiological Parameters. Information, 13(7), 330.
Google Scholar
Ren, X., Guo, H., Shenghong, L., Wang, S., & Jianhua, L. (2017). A novel image classification method with CNN-XGBoost model. s.n.
Google Scholar
Schwab, P., Schütte, A. D., Dietz, B., & Bauer, S. (2020). Clinical predictive models for COVID-19: Systematic study. Journal of Medical Internet Research, 22(10), e21439. https://doi.org/10.2196/21439
PubMed Web of Science ®Google Scholar
ShahlaFaisal, Tutz, G., & Faisal, S. (2022). Nearest neighbor imputation for categorical data by weighting of attributes. Information Sciences, 592, 306–319. https://doi.org/10.1016/j.ins.2022.01.056
Web of Science ®Google Scholar
Shi, F., Xia, L., Shan, F., Song, B., Wu, D., Wei, Y., Yuan, H., Jiang, H., He, Y., Gao, Y., Sui, H., & Shen, D. (2021). Large-scale screening to distinguish between COVID-19 and community-acquired pneumonia using infection size-aware classification. Physics in Medicine & Biology, 66(6), 065031. https://doi.org/10.1088/1361-6560/abe838
PubMed Web of Science ®Google Scholar
Tchagna Kouanou, A., Mih Attia, T., Feudjio, C., Djeumo, A. F., Mouelas, A. N., Nzogang, M. P., Tchapga, C. T., Tchiotso, D., & Srinivas, S. (2021). An overview of supervised machine learning methods and data analysis for COVID-19 detection.Journal of Healthcare Engineering, 2021, 1–18. Issue Hindawi. https://doi.org/10.1155/2021/4733167
Web of Science ®Google Scholar
Wang, S., Kang, B., Ma, J., Zeng, X., Xiao, M., Guo, J., Cai, M., Yang, J., Li, Y., Meng, X., & Xu, B. (2021). A deep learning algorithm using CT images to screen for Corona virus disease (COVID-19). European Radiology, 31(8), 6096–6104. https://doi.org/10.1007/s00330-021-07715-1
PubMed Web of Science ®Google Scholar
Wang, D., Zhang, Y., & Zhao, Y. (2017). LightGBM: An effective miRNA classification method in breast cancer patients. s.n.
Google Scholar
Zhen, H., Melton, G. B., Arsoniadis, E. G., Wang, Y., Kwaan, M. R., & Simon, G. J. (2017). Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. Journal of Biomedical Informatics, 68, 112–120. https://doi.org/10.1016/j.jbi.2017.03.009
PubMed Web of Science ®Google Scholar
Zhong, J., Sun, Y., Peng, W., Xie, M., Yang, J., & Tang, X. (2018). XGBFEMF: An XGBoost-based framework for essential protein prediction. IEEE Transactions on NanoBioscience, 17(3), 243–250. https://doi.org/10.1109/TNB.2018.2842219
PubMed Web of Science ®Google Scholar
Zhou, X., Wang, Z., Li, S., Liu, T., Wang, X., Xia, J., & Zhao, Y. (2021). Machine learning-based decision model to distinguish between COVID-19 and Influenza: A retrospective, two-centered, diagnostic study. <![cdata[risk Management and Healthcare Policy]]>, 14, 595–604. https://doi.org/10.2147/RMHP.S291498
PubMed Web of Science ®Google Scholar

Clinical data-driven approach to identifying COVID-19 and influenza from a gradient-boosting model

Abstract

1. Introduction

Table 1. Relevant works using machine learning based on clinical data to diagnose COVID-19

Table 2. Related works on the detection of COVID-19 infection from routine blood tests