Full article: Interpretable machine learning for mortality modeling on patients with chronic diseases considering the COVID-19 pandemic in a region of Chile: A Shapley value based approach

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The main objective of this study was to evaluate and interpret different machine learning models to predict the probability of mortality from chronic oncological diseases (COD), chronic non-oncological diseases (CNOD), and COVID-19 in the Biobío Region, Chile from 2016 to 2022. In this study, the causes of death attributed to COD and CNOD were recognized as conditions that would have necessitated palliative care. Retrospective cohort study of mortality data from the Chilean Ministry of Health. A total of 57,623 mortality records due to chronic diseases were considered during the study years in the Biobío Region. Data characteristics included sociodemographic factors (age, gender, residence, place and date of death) and causes of death. Seven classification models were trained: Multinomial Regression, Random Forest, Decision Tree, Support Vector Machine, Naives Bayes, XGBoost and Neural Networks, to predict the probability of mortality from COD, CNOD and COVID-19 and the calibration, discrimination and accuracy of these models. Additionally, the Shapley Additive Explanations (SHAP) values were used to assess the interpretability of the models and the BorutaShap algorithm was used for variable selection. The XGBoost, Random Forest and Multinomial Regression models had the best prediction performances. In all prediction cases, XGBoost had a slight advantage over Random Forest and Multinomial Regression models, with an average global AUROC for repeated cross-validation of 0.624, 0.600, and 0.627, respectively. In addition, a global average Accuracy in favor of XGBoost of 0.642 compared to 0.641 and 0.633 of the models mentioned above. The variables selected by the BorutaShap method were age, place and date of death. XGBoost can be used to predict the probability of death from COD, CNOD and COVID-19 in mortality data from the Biobío Region. This model can be useful for allocating palliative care resources more effectively to the people who require it. Among the relevant variables for the prediction were the Date of death, Place of death and the Age.

KEYWORDS:

1 Introduction

Worldwide, the population is progressively aging and chronic non-communicable diseases (CND) are the main cause of mortality (World Health Organization Citation2020). Each year 41 million people die from CND, representing 71% of mortality in the world. These diseases greatly affect people living in low- and middle-income countries, where more than three quarters of deaths (32 million) from CND occur. They are often associated with older age groups, but evidence shows that 15 million of all deaths attributed to CNDs occur between the ages of 30 and 69 (World Health Organization Citation2021a).

The World Health Organization (WHO) World Health Statistics Report (World Health Organization Citation2020), noted that CNDs account for 7 of the top 10 causes of death globally, revealing an exponential increase in their impact over the period from 2000 to 2019. The leading causes of death worldwide from CND in 2019 were cardiovascular diseases (17.9 million deaths, representing 44% of all CND deaths), cancers (9.3 million, 23% of all CND deaths), respiratory diseases (4.1 million, 10% of all CND deaths) and diabetes (1.5 million deaths, 4% of all CND deaths) (World Health Organization Citation2021b).

In the Region of the Americas, the main cause of death of the population is also associated with CND, deeply impacting the well-being of people and their families and at the same time generating a great social and economic burden (Sánchez-Herrera et al. Citation2013; de la Salud Citation2017). The latest epidemiological data in Chile indicated that the total number of deaths registered in the country increased by 19.2% in the last 11 years, from 91,965 in 2009 to 109,658 in 2019; the main groups of causes of death were tumors or neoplasms with 28,492 deaths (26%), followed by diseases of the circulatory system with 28,079 deaths (25.6%), diseases of the respiratory system with 13,864 deaths (12.6%), other external causes of morbidity and mortality with 8,065 deaths (7.4%), and diseases of the digestive system with 7,996 deaths (7.3%) (INE Citation2018). Based on these statistics, we inferred that chronic non-oncological diseases (CNOD) represent a large percentage of deaths compared to deaths from chronic oncological diseases (COD) that occurred in that period. This results in greater projected requirements for general and palliative care (PC henceforth), due to the fact that chronic diseases in an advanced state present different degrees of complexity and needs, and at the same time, is associated with a limited life expectancy and that will evolve progressively in the medium term toward the end of life (Batiste-Alentorn et al. Citation2017; Meléndez and Limón Citation2018).

The provision of care for COD in Chile is incorporated in the Pain Relief and Palliative Care Program (Pastrana Citation2012, Citation2021). The recent Law 21375 published (Ministerio de Salud Citation2021) in October 2021 is expected to allow the gradual incorporation of PC provision for the CNOD. The health context generated by the COVID-19 pandemic cannot be ignored, because it pushed the provision of health care for people with COD and CNOD to the background. It is unknown how this completely unforeseen health situation impacted the detection of chronic diseases that would have required PC and their mortality; therefore, it is important to analyze the predictive behavior of these groups (COD and CNOD) in the pre-pandemic and pandemic period of COVID-19.

Chow et al. (Citation2001) and Han et al. (Citation2012) pointed out that mortality prediction based on health professional judgment and/or experience is inaccurate and often optimistic, meaning that there is a tendency to overestimate patient survival in daily clinical practice. Therefore, to address this situation, various authors have proposed comparative studies between different machine learning (ML) statistical techniques and logistic regression techniques (Cao et al. Citation2019; Desai Citation2020). In particular, Cary Jr et al. (Citation2021) used logistic regression and a multilayer perceptron (MLP), to predict the probability of mortality at 30 days and 1 year for Medicare program beneficiaries older than 65 years treated for hip fracture in rehabilitation centers. These authors concluded that MLP models can reduce cognitive load and increase the ability to calibrate local data, leading to clinical specificity in mortality prediction so that PC resources can be allocated more effectively. A similar conclusion was obtained by Avati et al. (Citation2018) who validated a hybrid model of machine learning versus deep learning, to classify patients into two categories of results, Intensive Care Unit (ICU) and non-ICU.

To ensure the interpretability of the models, the Lundberg and Lee (Citation2017) proposal was considered, which proposes Shapley Additive Explanations (SHAP) values, which are calculated using the idea of the Shapley value (Hart Citation1989). Finally, to perform the reduction of the number of explanatory variables was applied BorutaShap (Keany Citation2020). This methodology has been recently used by different authors including the following: Chieregato et al. (Citation2022) proposes a hybrid model of machine learning/deep learning for COVID-19 forecasting using a CatBoost model and the Boruta algorithm with the importance of the SHAP feature as a metric. Smith and Alvarez (Citation2021) proposes interpretability of the ML model using the Shapley values to predict the mortality of hospitalized patients with COVID-19 in a hospital of the region of Wuhan, China. Unlike these studies, this research has proposed evaluating ML models in the classification of COD, CNOD, and COVID-19 in order to identify the indirect need for PC and, in turn, guide the planning of the necessary resources for this type of care that has recently been implemented in our country.

The purpose of this study was to evaluate and interpret the different ML models designed to predict mortality in COD, CNOD and COVID-19 with chronic disease mortality data in people of all ages provided by the Department of Statistics and Information of Health (DEIS) of the Ministry of Health of Chile (Departamento de Estadíticas e Informaci’́on de Salud (DEIS) (Citation2022)). This is the first study in Chile to propose the ML methodology for predicting mortality in people who died according to causes of the ICD-10 classification (International Statistical Classification of Diseases and Related Health Problems-10th Revision), used to indicate mortality in the population that would have required PC in the pre-pandemic and pandemic periods Batiste-Alentorn et al. (Citation2017).

2 Methodology

A retrospective cohort study ((Acar et al. Citation2021)) was conducted to identify the ML models that exhibited the highest predictive power. The study utilized mortality data, which are free access and anonymous data obtained from the web page https://deis.minsal.cl//#datosabiertos Departamento de Estadíticas e Informaci’́on de Salud (DEIS) (Citation2022). The data were deaths due to chronic diseases that would have required PC according to the ICD-10 Code from 2016 to 2022. This database consists of 57,623 chronic disease mortality records with 7 variables. The variable “Age group” was created from the variable “Age in years” according to the classification used by Ministerio de Salud GdC (Citation2019). The dependent variable was the basic cause of death according to the ICD 10 classification () and the other 6 variables corresponded to independent variables defined in .

Table 1 Independent Variable Records

Download CSV Display Table

Table 2 ICD-10 codes for conditions identified as potentially having palliative care needs.

Download CSV Display Table

The dependent variable used in this study was basic cause of death, of which the records with chronic diagnoses of the Biobío Region were considered. For this classification, the proposal of Calvache et al. (Citation2020) was followed, using the ICD-10 codes to identify deaths from COD, CNOD and COVID-19, respectively. shows a description of the cause of death and ICD-10 codes used to classify deceased persons who would have required PC. From this classification, the dependent variable is defined as follows: $Y = {\begin{matrix} 0, & If person dies for COD ó C00-C96 \\ 1, & If person dies from CNOD, all other codes \\ excluding C00-C96 and \\ 2, & If person dies from COVID-19 or U07 . \end{matrix}$

The study population was randomly divided into a training cohort, in which the mortality risk algorithms were developed, and a validation cohort, in which the algorithms were applied and tested. The training cohort consisted of 70% of the data cohort, and the validation cohort consisted of the remaining 30%. It was randomly divided in our cohort in the records of deceased people so that there was no duplication between the training and validation sets. In this study, we compared the performance of models based on the accuracy, the ROC, and balanced accuracy, as well as the sensitivity, specificity, positive predicted value (PPV), and negative predicted value (NPV). Readers can refer to the definitions of each of these calibration measures in the Cary Jr et al. (Citation2021) study. A Monte Carlo study consisting of 100 iterations was conducted to assess the precision level of each model using the configuration outlined at the beginning of this paragraph.

2.1 Models

The ML approach is the study of algorithms capable of improving or learning automatically through experience and the use of data (Mitchell Citation1997). The most common approaches to machine learning are supervised and unsupervised. In supervised learning, the algorithm can generalize or learn based on labeled data, either for (Hastie et al. Citation2001) classification or regression problems. On the other hand, unsupervised learning attempts to extract features and patterns from unlabeled (Hastie et al. Citation2001) data. ML techniques have been noted for their computational efficiency and for achieving better prediction results than those obtained with commonly used statistical techniques (Shin et al. Citation2021; Lang et al. Citation2022). Currently, ML is used in a variety of applications, such as finance, robotics, pattern recognition, natural language processing, medical diagnostics, recommendation algorithms for social networks, prediction of people’s mode of travel, among many others (Ray Citation2019). In this study, ML models were used to predict mortality and the need for PC in people who died of chronic diseases in the Biobío Region from 2016 to 2022. In particular, ML techniques were used, such as Artificial Neural Networks (ANN), Support Vector Machine (SVM), Decision Trees (DT), Random Forest (RF), Naives Bayes, XGBoost; and the predictions of these models were compared with the Multinomial Regression model. The models used are described below.

Multinomial Regression Model: This model is also known as the with polytomous response, which is a generalization of the regression model logistic binomial (McCullagh and Nelder Citation1989) in which we want to estimate the probability whether or not the individual experiences a specific event, given a set of variables that explain the specific characteristics of individuals. In the case of the multinomial model, the endogenous variable has more than two alternatives to consider as possible answers (Fernández and Fernández Citation2004).
Decision tree: It is a prediction model used in the field of artificial intelligence, whose objective is to create a model that predicts the value or class of an observation from decision rules obtained from previous data, so this technique can be used to predict categorical and/or continuous variables (classification and regression trees) (Breiman et al. Citation1994). The decision tree model is built from the description narrative of a problem, since it provides a graphic view of decision making, where the structure begins with a node called “root node”; then in this and in the internal nodes decisions are made based on different attributes; subsequently, the branches indicate the decisions made. At the end of the decision tree are the terminal nodes, which represent the result of following a combination of decisions. Alternatively, the terminal node may be associated with the probability that the final result will take on a certain value.
Random Forest: It is a ML model of supervised learning for classification, which has the particularity of being an algorithm that works using different models that, as a result, are combined to obtain a model of the entire set (Hidalgo Ruiz-Capillas Citation2014). The method used in Random Forest to generate different models from the data is known as boostrap aggregatting, or bagging, this method generates a series of training dataset from the original data using bootstrap sampling, to later be used when training a single model with each of the training data.
Support Vector Machine: They are a supervised classification method whose objective is to determine the optimal border between two groups and can be extended to a larger number. This method is used for classification or regression problems (depending on the response variable), but it is often used for classification. Given 2 or more labeled data classes, it acts as a discriminative classifier, formally defined by an optimal hyperplane that separates all classes (Vargas et al. Citation2012).
Naive Bayes: It is a statistical classifier that predicts the class of membership based on probabilities. The probability of occurrence, given that it belongs to a particular class, assumes that the effects of one attribute are independent of the values of the other attributes (Chen et al. Citation2020). This classifier has been identified as a suitable classifier in the medical field, given the advantages of its high accuracy when applied to large databases and is especially useful, for example, in medical diagnoses. Other advantages include lower computational cost, easy to understand as it is a probability-based classifier, and requires a single processing step if the data is discrete (Hariz Citation2012).
XGBoost (eXtreme Gradient Boosting): This model emerges as an efficient implementation of the Gradient Boosting method, developed by (Chen et al. Citation2016). Among the main features, the inclusion of penalty terms to avoid overfitting, proportional reduction of tree leaves, Newton-Raphson method for loss function and efficient implementation for multiprocessor training stand out.

2.2 BorutaShap through XGBoost

When ML techniques are used in decision-making processes, the interpretability of the models becomes important. In this context, Lundberg and Lee (Citation2017) proposes the Shapley Additive Explanations (SHAP) methodology, which is based on cooperative game theory. The main idea is to calculate the Shapley values for the model, which give a fair allocation of gains between the variables according to their contribution to the prediction of the target variable. In this way, SHAP values can be used as a measure of variable importance. Once SHAP is defined, we propose using the BorutaShap introduced by Effrosynidis and Arampatzis (Citation2021), which is a variable selection method that seeks to obtain a reduced subset of these, being discriminated from the SHAP value with the Boruta algorithm (Kursa and Rudnicki Citation2010). Having the subset of important variables, an interpretability analysis of the XGBoost model will be carried out using SHAP, in order to consistently know to what extent and magnitude the variables contribute to predicting the target variable.

2.3 SHAP summary plot

To represent the impact of the variables under study, a SHAP summary diagram (Lundberg et al. Citation2018) was used, which shows the importance of the function and a summary of the SHAP dependency diagrams. In the chart, the features are ordered by their importance. Each row graph is a summary SHAP dependency graph of each variable. Each point represents the SHAP value of an individual. The color of each point denotes the value of the function, from low (blue) to high (red). Black dots represent missing values. If the red points are plotted on the lower side and the blue points are plotted on the upper side, then the risk increases as the value increases. Since a SHAP summary plot shows the importance of feature values and a SHAP summary dependency plot, it is useful to get an overview of the SHAP analysis.

2.4 Preprocessing

Data preprocessing was initially conducted through coding for the response variable and categorizing the Age group variable. Resampling methods were not used, therefore the models were applied to the original imbalanced data. We used R version 4.2.2 (2022-10-31) as statistical analysis tool and packages such as: multinom, randomForest, ctree, svm, naivebayes, and nnet to build models, multinomial, RF, DT models, SVM, Naives Bayes, and ANN, respectively. For RF, the number of variables randomly sampled as candidates at each split was set to $\sqrt{6}$ , and the number of trees to grow was set to 40. For SVM, “C-classification” and radial kernel were used, in the case of DT, the maximum depth of the tree is by default, i.e., maxdepth = Inf. Furthermore, we have implemented XGBoost using Python and leveraged SHAP values to compute the feature importance. The source codes for ANN, XGBoost and Naive Bayes algorithms contain comprehensive details, and they can be obtained by contacting the corresponding author or by referring to the data availability statement in the data set section.

3 Descriptive analysis of the results

3.1 Sample characteristics

The sample of mortality records due to chronic diseases was 57,623 people residing in the Bío Bío Region between the period 2016 to 2022. In general, the average age was 75 years; 51% of these records were men, 50% of the deaths occurred at home. Of the people who died from COD, 66% died at home and were over 65 years old (73%). Of the people who died from CNOD, 51% of these deaths occurred in hospitals or clinics and the patients were significantly older. See .

Table 3 Characteristics of the mortality data in the Biobío region for the period 2016–2022.

Download CSV Display Table

3.2 Model performance

This section summarizes the classification efficiency metrics of the previously defined models for the COD and CNOD classes in the context of the COVID-19 pandemic. To obtain an overview of the predictive performance of the models, a Monte Carlo study was conducted with a total of 100 iterations. In each iteration, the original sample was randomly divided into a training sample, which consisted of approximately 70% of the original sample size, and a test sample containing the remaining observations. Consequently, the performance of each method was assessed using Monte Carlo cross-validation, utilizing the confusion matrix. displays the average and standard deviation of the cross-validation metrics obtained from the confusion matrices. These metrics were calculated from 100 iterations for each of the models proposed in this study.

Table 4 Classification performance table (average and standard deviation of the confusion matrix computed across 100 iterations).

Download CSV Display Table

From it can be seen that favorable metrics have been obtained in the XGBoost, Random Forest and Multinomial Regression algorithms, the most favorable being the XGBoost for having a slightly higher predictive power with respect to the other algorithms. In particular, the accuracy, AUC and balanced accuracy metrics favor this algorithm. Therefore, it can be pointed out that with the data used, a greater advantage is identified in the use of the XGBoost method for the modeling of COD and CNOD mortality in the COVID-19 pandemic context. Moreover, the presence of NaN values in the SVM model’s PPV indicates the model’s challenges in accurately classifying instances as positive for the overlapping class Y = 2. This observation supports the findings of multiple authors, with specific emphasis on those highlighted by Vuttipittayamongkol et al. (Citation2021), which emphasize the detrimental effects of class imbalance and class overlap on SVM performance. Based on the classification results for the Y = 2 class, it is evident that the performance of the ML models exhibits a significant degradation under such circumstances.

3.3 Variable selection using the BorutaShap algorithm

The most relevant variables for the classification of COD, CNOD and COVID-19 were analyzed. This was done using the BorutaShap algorithm. From the analysis, it was observed that the selected variables were 3 of the 6 under study, namely; Date of death, Place of death and Age.

represents the average of the SHAP values for the variables selected by the algorithm and for each class (colors). The interpretation is that the higher the average SHAP values are, the greater the contribution of that variable in predicting each class. We also observed that for the COVID-19 class, on average, the variable that contributed the most was Date of death, which is explained by the fact that the pandemic has developed in the last 2 years. On the other hand, for the COD class, the most influential variables were Place of death and Age. Finally, for CNOD, the most influential variable was Age.

Fig. 1. Feature Importance Plot: the 3 most important variables are listed and ordered by the gain method. The Date of death variable is the most important characteristic of the predictor using the gain method.

represents the SHAP values of each variable for the COD class. From this Figure, it is observed that the older the age, less impact the model will have in predicting the COD class. This is due to the higher prevalence of comorbidities in older people, making it more difficult for the model to distinguish between the COD and CNOD class. On the other hand, for the variable Place of death, the lower the values of the variable, the greater the contribution to predict the COD class; This is because, in general, the majority of people with oncological diagnoses die at home. In the opposite case, the higher the values of the variable Place of death, the lower the contribution of the model to predict the COD class, since the records of admissions to hospitals or clinics are not necessarily for this cause, but due to complications of this. shows the Shap values of each variable, only considering the CNOD class. It can be seen that the older the person, the greater the shap value and therefore the contribution to the model. At the same time, the younger the person, the less contribution they will have to the model’s prediction. It is observed that in the variable Place of death values greater than 0 of the function (1 and 2) have a greater contribution to predict the CNOD class.

Fig. 2. SHAP Summary Plot: The graph shows the 3 most important variables evaluated by the SHAP method and the effects of each characteristic on the COD classification.

Fig. 3. SHAP Summary Plot: The graph shows the 3 most important variables evaluated by the SHAP method and the effects of each characteristic on the CNOD classification.

Finally, represents the SHAP value of each variable, considering only the COVID-19 class. From the figure, it is interesting to highlight the variable Date of death, which has a very strong contribution but in a negative direction. For low values (first years after 2016), the contribution of this variable is low, while recent years (high values of the variable) contribute positively to predicting death from COVID-19. On the other hand, it is observed in the variable Age that the greater the magnitude, the greater the contribution of the variable to predict the COVID-19 class.

Fig. 4. SHAP Summary Plot: The graph shows the 3 most important variables evaluated by the SHAP method and the effects of each characteristic on the classification of COVID-19.

4 Discussion and conclusion

In this study, different models were proposed for the classification of mortality from chronic diseases in the pre-post pandemic period: Multinomial Regression, Random Forest, Decision Trees, Support Vector Machine, Naive Bayes, XGBoost and Neural Networks. Models with various performance measures were compared to capture their predictive and classification capabilities, with the XGBoost statistical technique obtaining a better result. On the other hand, the SHAP method was used to interpret the XGBoost model built with mortality data from chronic diseases in people of all ages in the Bío Bío region from 2016 to 2022 and SHAP summary figures were obtained as Figure of importance of the variables. For the selection of variables, the BorutaSHAP method was proposed, and from this analysis it can be mentioned that the most significant variables to predict or classify mortality are Date of death, Place of death and Age.

With respect to the SHAP value of each class, we concluded that in the case of COD, the variable Age has a smaller effect on the classification if people die at advanced ages, the Place of death does not have a significant effect on the power of classification, and the variable Date of death reflects that the higher the values, the lower the power of classification.

For the CNOD case, the higher Age the variable has a higher shap value and a greater contribution to the model, this means that the model contributes more significantly to predicts in older people; the variable Place of death has a greater impact to predict the CNOD class with place of death in hospitals or clinics. Finally, for the classification of mortality by COVID-19, we found that the variable Date of death has a very strong contribution only in the years of the pandemic (2020–2021). The variable Age has a greater effect in predicting the COVID-19 class only in deaths for older ages. It can be concluded that XGBoost techniques are useful for interpreting machine learning models and can describe the underlying relationships between features and results.

In the empirical evidence, there are no studies that have considered the pandemic context to predict the probability of mortality in this specific population of people with chronic diseases. Therefore, this study would be novel and a first approximation to discover the characteristics of this study population using ML models. Other authors have used ML models considering other population groups and without the pandemic context, Hsu et al. (Citation2021), Cary Jr et al. (Citation2021), among others.

These results indicate that the importance of the need for PC is determined by the place of death and age. The analysis of these results shows that almost half of the people with chronic diseases die at home, and therefore it is up to primary health care in Chile to provide the necessary PC. Considering that the new legislation aims to improve care at the end of life by expanding the spectrum of care to all chronic diseases, it is relevant that public policies in this area consider greater resources in primary health care. Currently in Chile, PC in primary care is delivered by general teams with a low level of specialization Departamento de Estadíticas e Informaci’́on de Salud (DEIS) (Citation2022). Therefore, a greater focus on the preparation of exclusive and specialized teams in PC is required.

The prediction of mortality from COVID-19 shows that the most relevant variable is age, which implies that older people are the most vulnerable to this disease. This is also explained because older people live with more than one chronic disease at the same time, making them more susceptible to serious illness (Han et al. Citation2012). The foregoing reinforces with greater reason that the focus of PC resources should be especially on the elderly and through home care, thus ensuring a better quality of life at the end of life.

During the pandemic, Chile implemented strict quarantine measures and health services focused mainly on the care of patients with COVID-19 and very urgent cases related to other diseases. This context may have had an impact on the results related to place of death in different years. In addition, considering that, in Chile, the Palliative Care Program for patients in advanced stages of oncological diseases provides home care, the place of death becomes particularly significant for class 1 (person dies from CNOD). It is important to note that some geographic areas in the Biobío Region in Chile face challenges in accessing health centers, which may affect mortality rates from certain diseases, including COD and CNOD, which may affect a higher number of deaths that occur in the home. In Chile, Law 21375 was recently enacted, which includes PC for non-oncological chronic diseases. However, it is still in the early stages of implementation, and its impact on mortality and end-of-life care outcomes are still unknown.

4.1 Study limitations

Our study had some limitations. First; The data were obtained from an open database provided by the DEIS, which is a public institution in charge of health statistics. Mortality data obtained from https://deis.minsal.cl//#datosabiertos is the official and unified system for collecting and analyzing health data in Chile and has been effective for over a decade. However, it is important to mention that the analyzed database has a reduced number of variables, compared to the number of variables that could be used in training with ML models. This is due to a limitation in the entry of records of the sociodemographic variables available in the database by the authorized health authority. Secondly, in light of the obtained results, it is evident that all the models exhibit a weak performance when classifying the category of death due to COVID-19. This can be attributed to the imbalanced nature of the original dataset, where the Y = 2 class is a minority, as well as the overlap that exists between this class and classes 0 and 1. To address this issue, following the recommendations of previous authors (Vuttipittayamongkol et al. Citation2021), a possible solution is to employ re-sampling techniques to balance the data. Additionally, overlap-based methods, such as the k–Nearest Neighbor (KNN) rule (although it has faced criticism from other researchers (Sun and Wang Citation2011)) or the Support Vector Data Description (SVDD) algorithm (Tax and Duin Citation2004), can be considered. It is worth noting that the overlap class has not been mathematically well-characterized (Sun and Wang Citation2011), and a standardized measure to quantify the degree of overlap has not yet been established. We take it as a challenge for further research.

Disclosure statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability statement

The data that support the findings of this study are available in Open Science Framework, OSF at https://doi.org/10.17605/OSF.IO/2UPBW. These data were derived from the following resources available in the public domain: https://deis.minsal.cl//#datosabiertos.

References

Acar HC, Can G, Karaali R, Börekçi Ş, Balkan İİ, Gemicioğlu B, Konukoğlu D, Erginöz E, Erdoğan MS, Tabak F. 2021. An easy-to-use nomogram for predicting in-hospital mortality risk in covid-19: a retrospective cohort study in a university hospital. BMC Infect Dis. 21(1):1–12.
PubMed Web of Science ®Google Scholar
Avati A, Jung K, Harman S, Downing L, Ng A, Shah NH. 2018. Improving palliative care with deep learning. BMC Med Inform Decis Mak. 18(4):55–64.
PubMedGoogle Scholar
Batiste-Alentorn XG, Novellas JA, Martínez CL, Calsina-Berna À. 2017. Manual de atención integral de personas con enfermedades crónicas avanzadas: aspectos clínicos. Elsevier Health Sciences.
Google Scholar
Breiman LJ, Charles F, Stone J, Olshen RA. 1994. Classification and regression trees. New York: Wadsworth.
Google Scholar
Calvache JA, Gil F, De Vries E. 2020. How many people need palliative care for cancer and non-cancer diseases in a middle-income country? Analysis of mortality data. Colombian J Anestesiol. 48(4):e201.
Google Scholar
Cao Y, Fang X, Ottosson J, Näslund E, Stenberg E. 2019. A comparative study of machine learning algorithms in predicting severe complications after bariatric surgery. J Clin Med. 8(5):668.
PubMed Web of Science ®Google Scholar
Cary Jr MP, Zhuang F, Draelos RL, Pan W, Amarasekara S, Douthit BJ, Kang Y, Colón-Emeric CS. 2021. Machine learning algorithms to predict mortality and allocate palliative care for older patients with hip fracture. J Am Med Dir Assoc. 22(2):291–296.
PubMed Web of Science ®Google Scholar
Chen S, Webb GI, Liu L, Ma X. 2020. A novel selective naïve Bayes algorithm. Knowl Based Syst. 192:105361.
Web of Science ®Google Scholar
Chen T, Guestrin C. 2016. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, Association for Computing Machinery, New York, NY, USA, p. 785–794.
Google Scholar
Chieregato M, Frangiamore F, Morassi M, Baresi C, Nici S, Bassetti C, Bnà C, Galelli M. 2022. A hybrid machine learning/deep learning covid-19 severity predictive model from ct images and clinical data. Sci Rep. 12(1):1–15.
PubMed Web of Science ®Google Scholar
Chow E, Harth T, Hruby G, Finkelstein J, Wu J, Danjoux C,. 2001. How accurate are physicians’ clinical predictions of survival and the available prognostic tools in estimating survival times in terminally iii cancer patients? A systematic review. Clin Oncol. 13(3):209–218.
Web of Science ®Google Scholar
de la Salud OP. 2017. Las dimensiones económicas de las enfermedades no transmisibles en américa latina y el caribe. https://iris.paho.org/bitstream/handle/10665.2/33994/9789275319055-spa.pdf?sequence=1&isAllowed=y
Google Scholar
Departamento de Estadíticas e Informaci’́on de Salud (DEIS) C. 2022. Defunciones por causa. https://deis.minsal.cl//#datosabiertos
Google Scholar
Desai RJ, Wang SV, Vaduganathan M, Evers T, Schneeweiss S. 2020. Comparison of machine learning methods with traditional models for use of administrative claims with electronic medical records to predict heart failure outcomes. JAMA Network Open 3(1):e1918962.
PubMed Web of Science ®Google Scholar
Effrosynidis D, Arampatzis A. 2021. An evaluation of feature selection methods for environmental data. Ecol Inform. 61:101224.
Web of Science ®Google Scholar
Fernández VP, Fernández RSM. 2004. Regresión logística multinomial. Cuadernos de la Sociedad Española de Ciencias Forestales 18:323–327.
Google Scholar
Han PK, Lee M, Reeve BB, Mariotto AB, Wang Z, Hays RD, Yabroff KR, Topor M, Feuer EJ. 2012. Development of a prognostic model for six-month mortality in older adults with declining health. J Pain Symptom Manage. 43(3):527–539.
PubMed Web of Science ®Google Scholar
Hariz WHM, Adnan M, Rashid A. 2012. Hybrid approaches using decision tree, naive bayes, means and Euclidean distances for childhood obesity prediction. Int J Softw Eng Appl. 6(3):99–106.
Google Scholar
Hart S. 1989. Shapley value. In: Eatwell J, Milgate M, Newman P, editors. Game theory. London: Springer. p. 210–216.
Google Scholar
Hastie T, Tibshirani R, Friedman J. 2001. The elements of statistical learning. Springer series in statistics. New York: Springer.
Google Scholar
Hidalgo Ruiz-Capillas S. 2014. Random forests para detección de fraude en medios de pago. Master’s thesis.
Google Scholar
Hsu J-F, Yang C, Lin C-Y, Chu S-M, Huang H-R, Chiang M-C, Wang H-C, Liao W-C, Fu R-H, Tsai M-H. 2021. Machine learning algorithms to predict mortality of neonates on mechanical intubation for respiratory failure. Biomedicines 9(10):1377.
PubMed Web of Science ®Google Scholar
INE. 2018. Anuario de estadísticas vitales. https://www.ine.cl/estadisticas/sociales/demografia-y-vitales/nacimientos-matrimonios-y-defunciones
Google Scholar
Keany E. 2020. Borutashap: A wrapper feature selection method which combines the boruta feature selection algorithm with Shapley values, Zenodo. [accessed 2021 October 25] https://zenodo.org/record/4247618.
Google Scholar
Kursa MB, Rudnicki WR. 2010. Feature selection with the boruta package. J Stat Softw 36(11):1–13. https://www.jstatsoft.org/index.php/jss/article/view/v036i11
Web of Science ®Google Scholar
Lang X, Wu D, Mao W. 2022. Comparison of supervised machine learning methods to predict ship propulsion power at sea. Ocean Eng. 245:110387. https://www.sciencedirect.com/science/article/pii/S0029801821016802
Web of Science ®Google Scholar
Lundberg SM, Lee SI. 2017. A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems. Vol. 30.
Google Scholar
Lundberg SM, Erion GG, Lee S-I. 2018. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888
Google Scholar
McCullagh P, Nelder J. 1989. Generalized linear models ii. London: Chapman and Hall.
Google Scholar
Meléndez, A, Limón, E. 2018. Del “paliativo no oncológico” al “paciente crónico avanzado”. más allá de las palabras. el reto de los cuidados paliativos del siglo xxi, Monografía SECPAL sobre Cronicidad Avanzada. p. 13–15.
Google Scholar
Ministerio de Salud, C. 2021. Ley 21375 consagra los cuidados paliativos y los derechos de las personas que padecen enfermedades terminales o graves. http://bcn.cl/2s7z9
Google Scholar
Ministerio de Salud GdC. 2019. Matriz de cuidados a lo largo del curso de vida. https://www.minsal.cl/wp-content/uploads/2018/09/Matriz-de-cuidados-a-lo-largo-del-curso-de-vida-2019.pdf
Google Scholar
Mitchell TM. 1997. Machine learning. New York: McGraw-Hill.
Google Scholar
Pastrana T, Lima L, Centeno C, Wenk R, Eisenchlas J, Monti C, Rocafort J. 2012. Atlas de cuidados paliativos en latinoamérica.
Google Scholar
Pastrana T, Lima L, Sánchez-Cárdenas M, Steijn D, Garralda E, Pons-Izquierdo JJ, Centeno C. 2021. Atlas de cuidados paliativos de latinoamérica 2020.
Google Scholar
Ray S. 2019. A quick review of machine learning algorithms. In: 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), p. 35–39.
Google Scholar
Sánchez-Herrera B, Carrillo-González GM, Barrera-Ortiz L, Chaparro-Díaz L. 2013. Carga del cuidado de la enfermedad crónica no transmisible. Aquichan 13(2):247–260.
Google Scholar
Shin S, Austin PC, Ross HJ, Abdel-Qadir H, Freitas C, Tomlinson G, Chicco D, Mahendiran M, Lawler PR, Billia F, Gramolini A, Epelman S, Wang B, Lee DS. 2021. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC Heart Failure 8(1):106–115. arXiv:10.1002/ehf2.13073.
PubMed Web of Science ®Google Scholar
Smith M, Alvarez F. 2021. Identifying mortality factors from machine learning using Shapley values–a case of covid19. Expert Syst Appl. 176:114832.
PubMed Web of Science ®Google Scholar
Sun H, Wang S. 2011. Measuring the component overlapping in the Gaussian mixture model. Data Min Knowl Discov 23:479–502.
Web of Science ®Google Scholar
Tax DM, Duin RP. 2004. Support vector data description. Mach. Learn 54:45–66.
Web of Science ®Google Scholar
Vargas J, Conde MB, Paccapelo MV, Zingaretti ML. 2012. Máquinas de soporte vectorial: metodología y aplicación en r, in: Décimo Congreso Latinoamericano de Sociedades de Estadística.
Google Scholar
Vuttipittayamongkol P, Elyan E, Petrovski A. 2021. On the class overlap problem in imbalanced data classification. Knowl Based Syst. 212:106631.
Web of Science ®Google Scholar
World Health Organization. 2020. World health statistics 2020 monitoring health for the sdgs sustainable development goals. https://apps.who.int/iris/bitstream/handle/10665/332070/9789240005105-eng.pdf
Google Scholar
World Health Organization. 2021a. Fact sheets. noncommunicable diseases. https://www.who.int/news-room/fact-sheets/detail/noncommunicable-diseases
Google Scholar
World Health Organization. 2021b. The global health observatory explore a world of health data. https://www.who.int/data/gho/data/themes/topics/topic-details/GHO/ncd-mortality
Google Scholar

Interpretable machine learning for mortality modeling on patients with chronic diseases considering the COVID-19 pandemic in a region of Chile: A Shapley value based approach

Abstract

1 Introduction