Full article: Comparing lifetime estimates of probability of default for refinancing operations with survival analysis and ensemble methods

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We compare different models on estimating the probability of default over a time horizon, considering censored data. Models were fitted from a survival analysis perspective, considering both classic models (Cox Proportional Hazard) with penalized variations and ensemble machine learning methods (boosting and bagging). Using a dataset of credit card refinancing operations, we assess accuracy performance in both out-of-sample and out-of-time observations. We assess well-established metrics, such as the concordance index, and measures that indicate calibration power (integrated Brier score and time-dependent dynamic AUC). Results show that a boosting approach with component-wise regression as base learner outperform other models for short term operations (36 months), in contrast to longer term transactions (60 months), where Cox Proportional Hazard (with and without penalization) depicted better results.

Keywords:

1 Introduction

The recent spread of available data and the increase in computational processing power allow the development of more complex and robust predictive and prescriptive quantitative models in many fields. In the financial industry, models that demand more granular data and enhanced computing capability are becoming increasingly common. These models help decision making in various tasks such as customer segmentation, portfolio selection, derivatives pricing and risk management. However, banks and regulators still have concerns about the applicability of these quantitative and computational models since, although useful, they may hinder potential and unknown systemic risks.

One common use of quantitative models in financial institutions relates to credit risk management, which is one of the major risks that financial institutions face. Credit risk is mainly associated with potential losses due to the possibility of a borrower defaulting. In this context, credit scoring is one of the approaches to risk management that has been mainly used over the past years (Thomas, Crook, and Edelman Citation2017), involving an estimate of the probability of the outcome, mainly default.

Traditionally, banks make use of statistical models to measure risk and make decisions about credit transactions. For instance, quantitative models are already used on decisions about (i) granting or rejecting a credit loan, (ii) defining limits of exposure to default risk, (iii) establishing interest rates of loans, (iv) calculating provisions necessary to cover expected losses, (v) setting equity capital to comply with regulatory requirements, etc.

More specifically, due to the frequent changes in banking regulations, there are several techniques to model the Probability of Default (PD) of a loan or a borrower. Despite the overall guidance given by the Basel Committee and by the Central Banks of many countries, there is some flexibility for financial institutions to decide which methodology they will use to estimate PD for managerial and regulatory purposes.

Taking into account the broad academic literature on PD and the current regulation, one important research topic involves the study of credit risk in the context of the new guidelines from the International Financial Reporting Standards 9 (IFRS 9) issued by the International Accounting Standards Board (IASB). The IFRS 9 requires relevant changes in modeling PD. Therefore the topic of our paper is relevant both for academics and practitioners. In particular, new accounting standards call for an estimate of a lifetime expected credit losses instead of a sole estimate within a specific time period, i.e., 12 months, or when an impairment occurs.

In this study, we investigate an alternative approach to analye the probability of default until the maturity of credit card refinancing transactions with classification models, exploring Ensemble Methods and Survival Analysis with tree base learners. This paper compares (i) the baseline Cox’s Proportional Hazard (Cox PH) model, (ii) Cox Regression with elastic net regularization, (iii) Survival Trees, (iii) Random Survival Forest (RSF), (iv) Gradient Boosted Survival Trees and (v) Component Wise Gradient Boosted Survival Tree. The dataset consists of refinancing credit card loans. Differently from other studies that focus on the traditional analysis of credit applications, we investigate a different type of borrower: the one that has already defaulted on the original loan and aims at refinancing the non-performing past loan.

This paper is structured as follows. In the next section, we discuss the context and review some of the relevant related literature. We then describe the material and methods used in the study. Finally, we present our results and conclusions, highlighting the contribution of the paper to the academic literature and to the practice of credit risk management, discussing limitations and challenges to study PD from a survival modeling perspective.

2 Theoretical background

2.1 Credit risk

The Basel Committee on Banking Supervision (BCBS) provided an approach for financial institutions to measure expected credit risk losses by estimating specific parameters. From these parameters, financial institutions can calculate the value of the Expected Credit Loss (ECL), defined according to EquationEquation 1 BCBS ((1) $ECL = PD \cdot LGD \cdot EAD$ (1) Citation2006). (1) $ECL = PD \cdot LGD \cdot EAD$ (1) where the risk parameters are: (i) Probability of Default (PD), i.e. the probability that a borrower will default on the agreed contract, (ii) Loss Given Default (LGD), i.e. the percentage of the value of a loan that is lost when the borrower defaults, and (iii) Exposure at Default (EAD), i.e. the credit exposure, in monetary values, at the time of default.

Several techniques have been used to estimate credit risk parameters. However, the BCBS emphasizes the need for banks to monitor the effectiveness of their models in calculating the credit parameters PD, LGD, and EAD (BCBS Citation2005). Although PD models were, in comparison with LGD and EAD models, more explored by academics and practitioners, recent regulation, e.g. IFRS 9, imposed the need for significant enhancements for addressing the probability of a loan become delinquent.

2.2 IFRS 9

Since 2018, a new accounting standard published by the International Accounting Standards (IASB), the IFRS 9, is in effect in a large number of countries International Accounting Standards Board (Citation2014). This new standard changes the classification and measurement of financial assets and liabilities, impacting many elements of companies, such as income statement, credit risk calculation, data management, etc.

Specifically for credit risk, the main change, brought by the new standard, involves the Expected Credit Loss (ECL) measurement, which is based on the definition of three risk stages as a criterion for its calculation (International Accounting Standards Board Citation2014):

For performing credit positions that do not significantly increase risk, the expected 12-month loss must be calculated;
For under performing credit positions that are classified, based on criteria defined by the institution, as having significant increases in risk, the expected loss must be calculated for the lifetime of the operation; and
For non-performing or defaulted assets, the expected loss is calculated for the entire lifetime of the credit transaction.

The inclusion of risk aggravation stages changes the estimation of the risk parameters of the ECL, including the PD. Rather than presenting an estimation for a limited time horizon, i.e. 12-month period, financial institutions should generate the lifetime probability of default of the credit exposure. A lifetime estimate of the probability of default is a challenging issue for financial institutions, due to the characteristics of their credit exposure, such as high volume and long term maturity.

2.3 Probability of default

Although PD is used in the calculation of the ECL, banks can take advantage of default models for a variety of other purposes, such as (i) definition of credit limits, (ii) decision-making on a borrower’s eligibility, (iii) definition of interest rates, (iv) calculation of credit provisions, and (v) calculation of regulatory and economic capital to cope with credit losses. Specific problems related to credit risk and PD modeling represent relevant challenges in the banking industry and can be investigated with the use of machine learning and data mining methods, which is associated with a process for discovering patterns in data (Lefebvre-Ulrikson et al. Citation2016).

PD models aim at identifying the probability that a counterparty in a given transaction will not meet the clauses of a credit agreement. Banks build PD models using historical data on credit operations, personal characteristics and behavior of their customers. In some models, macroeconomic variables are also used for credit risk modeling and can bring significant impact into the models (Djeundje and Crook Citation2018).

Traditional statistical methods, such as logistic regression, discriminant analysis, and decision trees have been used to credit risk applications (Hand and Henley Citation1997). However, more recently advances in computing processing power and artificial intelligence algorithms fostered other approaches to evaluate credit risk. For instance, Yeh and hui Lien (Citation2009) compare different classification models on a Taiwan credit card dataset, comparing their predictive accuracy. The authors find that a higher coefficient of determination is produced by artificial neural networks.

In contrast, Niloy and Navid (Citation2018) conclude that Naive Bayes outperform Logistic Regression using credit card data to model probability of default. Lessmann et al. (Citation2015) compare 41 classification models with 8 datasets and also found that ANN outperform several other individual classifiers, but recommend Random Forest (RF) as benchmark for comparing new classification algorithms. The authors also suggest that outperforming LR can no longer be interpreted as a signal of methodological advancement.

Survival Analysis (SA) applications in financial context has been growing (Andreeva Citation2006; Andreeva, Ansell, and Crook Citation2007; Dirick, Claeskens, and Baesens Citation2017), since it has an appeal on estimating lifetime PD, used to model the time until an event occurrence. In the context of credit operations, the event of interest can be the default. Therefore, SA is useful to analyze PD throughout the total term of the credit exposure. In the context of survival analysis being used to investigate credit risk, literature has shifted from the traditional model toward machine learning algorithms.

Narain (Citation1992) applied survival analysis in credit risk management, fitting an accelerated life exponential model to a 24 months loan data. In addition, the author built a scorecard using multiple regression and concluded that supporting the score with estimated survival times could lead to a better credit-granting decisions. Chopra and Bhilare (Citation2018) found that logistic regression outperform Random Survival Forests (RSF) in out-of-sample evaluation, considering the default for Small and Medium Enterprises (SMEs).

Fantazzini and Figini (Citation2008) compare base decision tree classifiers with ensemble methods. The study reveals that ensemble methods outperformed classical methodologies, highlighting the usefulness of gradient boosting. More recently, Xia et al. (Citation2021) propose the SurvXGBoost algorithm and indicate that it outperforms other benchmark models in terms of predictability and misclassification cost, considering the probability of default of a loan application. The dataset used was obtained from a major P2P lending platform in the US, between Jan/2009 and Dec/2013.

Exemplifying the broad scope of studies using machine learning techniques, Bai, Zheng, and Shen (Citation2021) shows that gradient boosting survival tree outperforms other existing methods by C-index, KS and AUC. The study was conducted using the Lending Club loan dataset retrieved from Kaggle, with operations between 2007 and 2015.

Mixing Survival Analysis and Machine Learning techniques, allows the classification of samples within different groups, taking into account time of occurrence of an event of interest. In the case of credit risk, the main event of interest is default, and within the context of survival analysis it is possible to assess the probability of default in a given time. Although SA and ML methods have been used mainly in medicine (Belle et al. Citation2011; Parizadeh et al. Citation2017; Balazy et al. Citation2019; Deepa and Gunavathi Citation2022; Chen et al. Citation2022), its applications in credit risk is growing in a fast pace, once the results can provide useful information on the field (Breiman Citation1984).

3 Materials and methods

3.1 Survival analysis

Survival Analysis methods explore the time to an event in a given population and, in comparison with others traditional classification models, such as Discriminant Analysis and Logistic Regression, add the feature of assessing probability over time (Bellotti and Crook Citation2008).

In a credit risk context, the probability of the borrower not manifesting the event of interest (default) longer than time t is quantified by the survival function: (2) $S (t) = P (T \geq t)$ (2)

Therefore, Survival Analysis can be used to investigate whether the borrower will default or not, as well as the change in the rate of occurrence of default until a given time t: (3) $h (t) = lim_{δ_{t} \to 0} \frac{P (t \leq T < t + δ_{t} | T \geq t)}{δ_{t}}$ (3) where T is a random variable associated with the survival function, specified in EquationEq. (2)(2) $S (t) = P (T \geq t)$ (2) and h(t) is the hazard function that quantifies the event rate at time t conditional on survival up to t.

Bellotti and Crook (Citation2008) applied the semi-parametric Cox PH survival regression, showing that it has a competitive performance compared to Logistic Regression, and can providing lifetime probabilities of default. Durović (Citation2019) investigated PD modeling based on IFRS 9 regulatory framework, stating the convenient approach provided by time-to-event modeling.

3.2 Cox proportional-hazards model

The Cox Proportional-Hazards (Cox PH) model Cox (Citation1972) is one of the most traditional approaches based on time-to-event techniques. It assumes that the failure rates of two groups are constant, i.e. follow proportional functions.

Considering p covariates and a vector $x = (x_{1}, \dots, x_{p})$ , the general form of the Cox Model is given by: (4) $λ (t) = λ_{0} (t) g (Xβ)$ (4) where $g (.)$ is a nonnegative function. This model is comprised by two components: a nonparametric $λ_{0} (t)$ , which is not specified; and a parametric component which is usually used as: (5) $g (Xβ) = exp (Xβ) = exp (β_{1} x_{1} + \dots + β_{p} x_{p})$ (5) where $β$ is a $p \times 1$ vector of parameters for each covariable. The parameters estimation are given by the Breslow’s approximation to the log likelihood Kalbfleisch and Prentice (Citation1973), using the Newton-Raphson method.

For penalized Cox Models, a penalty parameter λ is considered in the partial log-likelihood Verweij and Van Houwelingen (Citation1994) function: (6) $l_{λ} (β) = l (β) - \frac{1}{2} λ P (β)$ (6) where λ is a non-negative weight parameter and $P (β)$ is the penalty function. Zou and Hastie (Citation2005) proposes elastic net penalty, defined as: (7) $P_{λ, α} = \sum_{j = 1}^{p} λ (α | β_{j} | + \frac{1}{2} (1 - α) β_{j}^{2})$ (7) with $λ > 0$ and $0 < α \leq 1$ combining l₁ and l₂ norms. The use of such penalties leads to well-known regression models (such as Lasso, Ridge and Elastic-Net) applied to survival analysis context.

3.3 Tree-based models

Tree-based models involve segmenting the covariates space into a number of simpler regions aiming, providing an alternative method to linear and additive models. In this context, tree-based models are suitable for classification and regression problems in which there is a set of explanatory variables and a single-response variable.

Decision Trees (DT) have many advantages when comparing to others traditional classification and regression models, such as (i) the easiness of the explanation of the relationship between independent and dependent variables, (ii) the logical process for creating the nodes that are more similar to the human decision-making process, (iii) the intuitive classification rules conveyed in figures that depict the decision trees, etc.

An important disadvantage of the Decision Tree models is that its accuracy tends to be lower than of other regression and classification models. Despite this disadvantage, there are different methods to optimize the predictive power of the tree-based models by aggregating many decision trees, through bagging, random forests, and boosting. The intuition of Decision Tree can be applied in a Survival Analysis context, allowing the investigation of PD models through the lifetime of the credit facility, as required by the regulation.

In this study, we investigate Survival Trees and Random Survival Trees, which are data mining techniques based on machine learning that are extensions of Decision Trees.

3.4 Survival tree

Survival Tree (ST) is a tree-based method in which a splitting rule is used for grouping individuals or observations from their covariates. Each group is selected based on its survival behavior (Bou-Hamad et al. Citation2009).

ST has been applied in various areas. For instance, Cohn et al. (Citation2009) used the survival tree technique to identify the risk factors in the diagnose of children with nephroblastoma. In the study, characteristics that influence the time to relapse, malignancy or death of the patient were identified and a risk hierarchy was created that allows the indication of different treatments given certain characteristics.

A survival tree is built by using the log-rank as splitting rule (LeBlanc and Crowley Citation1993). Therefore, the best split at a specific node is find by finding de predictor x and split value c that maximizes the measure of node separation $| L (x, c) |$ (Ishwaran and Kogalur Citation2007), given by: (8) $L (x, c) = \frac{\sum_{t = 1}^{T} (d_{t, l} - y_{t, l} \frac{d_{t}}{y_{t}})}{\sqrt{\sum_{t = 1}^{T} \frac{y_{t, l}}{y_{t}} (1 - \frac{y_{t, l}}{y_{t}}) (\frac{y_{t} - d_{t}}{y_{t} - 1}) d_{t}}}$ (8) where d_t is the number of deaths at time t in daughter nodes l = 1, 2 and $y_{t, l}$ the number of individuals at risk at time t and node j. More specifically, the best split is given by finding the predictor variable $x^{*}$ and its split value $c^{*}$ that $| L (x^{*}, c^{*}) | \geq | L (x, c)$ for all x and c, as described in Ishwaran and Kogalur (Citation2007).

3.5 Random survival forest

Random Survival Forest (RSF), proposed by Ishwaran et al. (Citation2008), uses ensemble methods on survival trees to obtain the ensemble cumulative hazard function (CHF). The RSF takes advantage of randomization in two ways; (i) by randomly drawing B bootstrap samples for each tree, and at each node of a tree, and (ii) by randomly selecting a subset of variables to split. Therefore, the split node is chosen using the selected candidate variable that maximize the survival differences in the child nodes.

Bellini (Citation2019) considered the Random Survival Forest an alternative for modeling PD lifetime, as well as other techniques of survival analysis and machine learning. Ishwaran et al. (Citation2008) give an overview of the framework of RSF, described in the following steps:

Draw B bootstrap samples
Build a survival tree for each sample. Randomly select p variables at each node. The cutoff point maximizes the survival difference between the child nodes.
Grow the tree until the constraints are not violated.
Calculate the CHF for each single tree and average them to obtain the ensemble CHF.
With the out-of-sample data, compute the prediction error for the ensemble CHF.

3.6 Gradient boosting models

Gradient boosting (Friedman Citation2001) stands for an alternative approach for the optimization problem. More specifically, it works in an additive manner, where models are sequentially updated on former residuals. These models are usually referred to as base learners or weak learners, as they are often simple models. Hence, the overall form of the final model can be described as: (9) $f (x) = \sum_{m = 1}^{M} β_{m} g (x; θ_{m})$ (9) where M is the number of base learners $g (.)$ with parameter θ_m and the overall model is given by a β_m-weighted sum. Following this it can be seen that different base learners, as well as different loss functions, grows into different models. Regarding to this framework, this study applies two different base learners (GBSA and CWGB) resulting in two different models.

Gradient Boosting Survival Analysis (GBSA) implements gradient boosting with regression tree base learner. In each step, a regression tree is fit on the negative gradient of Cox PH loss function, with the regarded additive model $f (x)$ replacing the systematic component $Xβ$ : (10) $\sum_{i = 1}^{n} δ_{i} [f (x_{i}) - log (\sum_{j \in R} exp (f (x_{j}))]$ (10)

Component-Wise Gradient Boosting (Buehlmann Citation2006) uses component-wise least squares as base learner. These weak learners will perform a least squares regression on a vector of pseudo-responses, hereby defined as the negative gradient of the Cox Proportional Hazards (Cox PH) loss function, described in EquationEq. (10)(10) $\sum_{i = 1}^{n} δ_{i} [f (x_{i}) - log (\sum_{j \in R} exp (f (x_{j}))]$ (10) . The selected predictor variable is the one that satisfies: (11) $\underset{i \leq j \leq p}{argmin} \sum_{i = 1}^{n} ({\tilde{Y}}_{i} - \hat{β_{j}} X_{i}^{(j)})$ (11) where ${\tilde{Y}}_{1}, \dots, {\tilde{Y}}_{n}$ denote the pseudo-response vector and $X_{i}^{(j)}$ the j the chosen predictor among p variables. In this way, the overall model will also be a linear model, as it is a result of a linear combinations of m simple linear models.

3.7 Hyperparameters

We conducted a comprehensive grid search to optimize the hyperparameters of the machine learning models, aiming to achieve superior performance. The grid comprised a set of hyperparameters, and presents a detailed depiction of these hyperparameters along with their corresponding grid and best values. For the remaining parameters, default values from the scikit-learn library were utilized (Pedregosa et al. Citation2011). The final configuration is displayed in . This approach allowed us to efficiently explore the hyperparameter space and identify the optimal configuration for our models.

Table 1 Hyperparameters grid values.

Display Table

Table 2 Hyperparemeters final values.

Display Table

3.8 Evaluation metrics

The performance evaluation among applied models was conducted according to Concordance Index (C-index) and Integrated Brier Score (IBS). The analysis of these two metrics can give a reasonable assessment on model performance. The first metric (C-index) indicates how good the model discrimination is and the former (IBS) helps complementing the assessment, bringing insights on calibration, as it is evaluated over time periods.

The Concordance Index (Harrell et al. Citation1982) performs a rank correlation between estimated risks and observed times, comparing pair observations. A comparable pair for two observations i, j is made if the sample with lower observed time has experienced the event. It is considered a concordant pair if the observation with lower survival time has the higher predicted risk score. For each y_k < y_w where y_k represents a censored observation, the c-index is described as: (12) $c ‐index = \frac{\underset{i < j}{\sum_{}^{} \sum_{}^{}} I (y_{i} < y_{j}) I (\hat{S_{i} (t)} > \hat{S_{j} (t)}) + I (y_{j} < y_{i}) I (\hat{S_{j} (t)} > \hat{S_{i} (t)})}{\underset{i < j}{\sum_{}^{} \sum_{}^{}} I (y_{i} < y_{j}) + I (y_{j} < y_{i})}$ (12)

On the other hand, it is also an interesting that the model shows a good calibration performance. In this sense, we expect that the risk predictions over time follows the observed events behavior. Integrated Brier Score can provide an assessment for such aspect in addition of also being a measure of discrimination. For a given time period t, IBS can be defined as: (13) $IBS = \frac{1}{n} \sum_{i = 1}^{n} \int_{t_{1}}^{t_{max}} I (y_{i} \leq t \land δ_{i} = 1) \frac{{(0 - \hat{π} (t | x_{i}))}^{2}}{\hat{G} (y_{i})} + I (y_{i} > t) \frac{{(1 - \hat{π} (t | x_{i}))}^{2}}{\hat{G} (t) dw (t)}$ (13) where $w (t) = \frac{t}{t_{max}}$ . In this way, IBS can be interpreted as how well a model is calibrated for its periods of time.

Models which presents best out-of-sample results are also compared in an out-of-time set of observations. In this case, evaluations are made considering a dynamic time-dependent AUC. The measure is extended to survival context by considering sensitivity (true positive rate) and specificity (true negative rate) as time-dependent measures. Considering $\hat{f} (x_{i})$ as the i-th observation’s risk score estimative and ω_i as the inverse probability of censoring weights (IPCW), the dynamic time-dependent AUC, at time t, can be defined as: (14) ${\hat{AUC}}_{(t)} = \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{n} I (y_{i} \leq t) ω_{i} I (\hat{f} (x_{j}) \leq \hat{f} (x_{i}))}{(\sum_{i = 1}^{n} I (y_{i} > t)) (\sum_{i = 1}^{n} I (y_{i} \leq t) ω_{i})}$ (14)

Therefore, the measure distinguish observations who fail by time $t_{i} \leq t$ from those who fail after time t_i $(t_{i} > t)$ , giving an intuition on calibration power by analyzing discrimination performance over discrete periods of time.

3.9 Data and variables

We used data from credit card refinancing operations of a US financial institution. Differently from studies that use traditional datasets for analysis of application scoring or loans already granted but still solvent, our research explores a different profile of borrowers. More specifically, we investigate potential default in refinancing borrowers who had already been delinquent in their credit card debt. Therefore, our study adds to the understanding of the credit risk phenomenon in the context of a different borrower profile.

Data spam from January 2012 to December 2018 regarding 17,200 operations. Each observation in the database is a contract of credit card refinancing of a borrower with financial information. We developed different models by each time-maturities; 36-month and 60-month operations. For training and test, we used operations that started until 2014 for 36 month maturities, and operations that started until 2013 for 60 months. Hence, the out-of-time performance was evaluated on operations beginning on the following year for each model. The split rule applied to both models was the same; the data set was divided in 70% for training and 30% for testing.

The observed proportion of default was 12% in 36-month and 22% in 60-month operations. We validated the models in an out-of-sample and out-of-time dataset. The out-of-time data consists in operations contracted in 2015 (36-month) and 2014 (60-month), from January to December.

The dataset comprises observations with the following 15 attributes:

Loan amount: the listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
Interest rate: interest rate of the loan.
Installment: the monthly payment owed by the borrower.
Employment length: employment length in years, ranging from zero to ten, where zero means less than one year and ten means ten or more years.
Home ownership: the home ownership status provided by the borrower during registration or obtained from the credit report (rent, own, mortgage or other).
Annual income: the self-reported annual income provided by the borrower during registration.
Verification status: indicate if income was verified or not.
Dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
Total acc: The total number of credit lines currently in the borrower’s credit file.
Public record of bankruptcies: number of public record bankruptcies.
Tax liens: number of tax liens.
Earliest credit line: time since borrower’s earliest reported credit line was opened.
Time to default (in months).
Status: binary variable with 1 (default) or 0 (non default).

Unlike the tradition models of PD, in which the relevant dependent variable is a binary variable related to the occurrence of default, the survival methods requires the observation of time to an event of interest, which in this case is the default. This variable was build as the difference, in months, between the date of the contract of the credit card refinancing and the date of default.

For the survival analysis techniques, we relied on the following assumptions: (i) the event of interest in the survival analysis is the default, (ii) non-default operations are considered right censored, (iii) all trees have been pruned by the total amount of operations on the end nodes.

After cleaning up the original database, we build a data-set containing relevant information that allow the study of the probability of default, throughout time, of credit operations. The analysis was conducted using scikit-survival (Pölsterl Citation2020) framework.

3.10 Exploratory analysis

As the dataset consist of a specific type of operation (refinancing), we carried an exploratory data analysis in order to understand some common characteristics among facilities. presents the descriptive statistics of features for 36-month operations, while provides the corresponding statistics for 60-month operations. We observe that higher loan amounts tend to have longer durations, evidenced by a median loan amount of $20,000 for 60 months, contrasting with $11,262 for 36 months. Similarly, interest rates follow a similar trend, with medians of 17.14% for 60 months and 11.55% for 36 months. It is worth noting that the variables “Public record of bankruptcies” and “Tax liens” exhibit very low variation, leading to their exclusion from further analysis.

Table 3 36 month.

Download CSV Display Table

Table 4 60 month.

Download CSV Display Table

We then compare features distributions among the two-time maturities described before. Some features shows clear differences that are inherent to operations time maturities, such as installment and loan amount. On the other hand, some features remain very similar, as they reflect common characteristics among borrowers, e.g. total number of accounts and earliest credit line. In addition, public record of bankrupticies and tax liens were dropped because of low variance.

shows that the ratio of debt to income have similar behavior, with most observations falling in between 10 and 25 for both type of loans. The time from the earliest credit line that the borrower reported () also shows a homogeneous distribution among facilities, but slightly more right-skewed on 36-month. Since the number of 36-month operations is significantly larger, it is expected that some points presents a higher value.

Fig. 1 Debt to income distribution per time maturity.

Fig. 2 Earliest credit line distribution per time maturity.

The distribution of total number of credit lines in the borrower’s credit file () also present similar values on the majority of observations, apart from some extreme values (bigger than 65) that also shapes the 36-month distribution as right-skewed. It indicates that most customers have been carrying out credit operations for some time.

Fig. 3 Total number of accounts distribution per time maturity.

When contrasting the loan amount, 5 years operations seems to yield higher amounts. It’s peak is around $20,000, greater than the peak observed in 3 years operations, which presents is slight shifted to the left, around $ 10,000 (). This suggests that, in general, higher value operations are contracted with a longer term for payment of installments, illustrating an expected behavior.

The distribution of installment (), which stands for the monthly payment owed by the borrower, seems to present a clear separation, as the maximum value observed in 60-months operation is a value of 0.029 and only 114 (around 0,15%) observations have a value below 0.03 in 36-month time. In line with the loan amount (), this behavior is caused by the time horizon defined, since the total amount can be diluted in more installments.

Fig. 4 Loan amount distribution per time maturity.

Fig. 5 Installment distribution per time maturity.

The annual income presents some extreme values, specially in three year operations, resulting in a highly right-skewed distribution. In order to adjust the scale, we choose to work with the log form of annual income in both terms, which yields a more normal-form distribution ( and ).

Fig. 6 Percentage of loan by income per time maturity.

Fig. 7 Annual income distribution per time maturity.

Therefore, these distributions present a natural behavior considering their time horizon. It’s expected that 36-months operations show more extremely values, since it represents a larger share of contracts. The observed relative percentages of categorical variables, and their cross frequency with loan status are displayed on (36 month) and (60 month). Longer operations shows a default rate of 22.7%, slightly higher than larger ones, with a rate of 10.6%. The rate among possible classes of Verification Status and Home Ownership do not present large divergences. Since all borrowers did not bear the commitment in the original transaction, a higher share of mortgage and rent can be related to a greater difficulty in meeting all monthly commitments when part of the income is already compromised ( and ).

Table 5 36 months operations.

Download CSV Display Table

Table 6 60 months operations

Download CSV Display Table

4 Results

In this section we analyze evaluation metrics presented by all models. Different survival analysis techniques were used to study the behavior of default in refinancing credit card operations over the period agreed. The operations were separated according to two possible time duration (36 and 60 months) with all algorithms considering the features described above, ending up with 16 ( $8 \times 2$ ) model results. The performance comparisons were made using Concordance Index, Integrated Brier Score and Dynamic time-dependent AUC in both out-of-sample and out-of-time data-sets. We also added Kaplan-Meier results (which only consider the target variable for its calculation) and results based on a completely random state, for base comparison purposes ( and ). The best values are marked in bold.

Table 7 Out of sample results.

Download CSV Display Table

Table 8 Out of time results.

Download CSV Display Table

Furthermore, we present feature importance plots ( and ) for the ensemble methods, which facilitate a deeper comprehension of their predictive mechanisms. For 36-month predictions using CWGBSA, a subset of features is employed, with particular emphasis on the ratio debt to income (dti), while for 60-month predictions, only interest rate is considered. Notably, the Installment variable exhibits significant importance in both GBSA models. In the case of RSF, the feature weights remain similar across both models, with higher importance attributed to installments, followed by interest rates. These insights shed light on the driving factors behind the ensemble methods’ predictions and offer valuable interpretability to their outcomes.

Fig. 8 Feature importance for 36 months.

Fig. 9 Feature importance for 60 months.

For 36 months time-horizon, the C-index of all Cox models showed the best results (with little variation among them). The highest value was achieved by CoxRidge 0.652, followed by a tied value of 0.648 presented by CoxPH, CoxLasso and CoxNet, and 0.647 for CWGBSA. Calibration power (represented by Integrated Brier Score) over the time seems to be pretty close for all models, as they all present values close to 0.069, apart from Survival Tree model which is ranked with the worst metric values (0.547 C-index and 0.086 IBS). However, when analyzing the Dynamic time-dependent AUC, CWGBSA outperform the other models, closely followed by CoxRidge. shows CWGBSA increases its performance from 25th month forward, demonstrating good efficiency at later stages of the agreed period.

Regarding models with longer time-horizon operations, CoxPH, CoxLasso and CoxRidge achieves the best C-index scores with values of 0.637, 0.636, and 0.625, respectively. However, a larger difference was observed as the remaining values observed were of 0.615 (RSF), 0.613 (CWGBSA), 0.612 (CoxNet), 0.597 (GBSA) and 0.537 (Survival Tree). CoxPH, CoxRidge, CoxLasso and RSF presented a good calibration value, with an IBS of 0.136, followed by 0.138 (CWGBSA, GBSA), 0.141 (CoxNet) and 0.174 (Survival Tree). Considering the Dynamic time-dependent AUC, CoxPH and CoxLasso outperform other models considering the whole period ().

Fig. 10 Out-of-sample cumulative Dynamic AUC for 36-month operations.

Fig. 11 Out-of-sample cumulative Dynamic AUC for 60-month operations.

Fig. 12 Out-of-time cumulative Dynamic AUC for 36-month operations.

Out-of-time evaluation for 36-months evidenced the dominance of CWGBSA over other models, demonstrating best performance both at C-index (Tabela X) and time-dependent AUC, with an increased. In this sesnse, this result inidcates that CWGBSA has good generalization power and consistency over time, i.e. the capacity to keep good results over period of times has little variation. General behavior seems to be different from out-of-sample, given that AUC discrimination only decreases at future periods (.). A possible explanation arises from the importance model calibration with updated information, given that external factors can have high impact on financial operations behaviors.

For 60 mohths operations, evaluation out-of-time corroborate with previous results, with CoxPH and CoxLasso outperforming other models. However, time-depedent discrimination also shows a different behavior compared to out-of-sample metrics. shows a decrease in power discrimination after the 50th month, in contrast to , which shows an increase at the same period.

Fig. 13 Out-of-time cumulative Dynamic AUC for 60-month operations.

5 Conclusion

In this study we analyzed potential benefits provided by an approach combining Survival Analysis with statistical learning and machine learning methods, in order to adhere to requirements proposed by IFRS 9. This approach can generate point estimates over discrete period of times, providing dynamic probabilities regarded to probability of default during the period of the loan. Therefore, such estimates can contribute to address required regulation and to better understand of the credit through a lifetime time horizon.

The analyzed dataset consists of refinancing credit operations with several financial information regarding borrower history and characteristics of the current operation, such as interest rate, loan amount, number installments, etc. Taking into account survival methods, the time to default was considered as the target variable. Several models with different frameworks were fitted and compared considering suitable metrics.

Overall, four models were consistently top ranked: Cox Proportional-Hazards (CoxPH), Cox with Lasso penalty (CoxLasso), Cox with Ridge penalty (CoxRdige) and Component Wise Gradient Boosting Survival Analysis (CWGBSA). However, their ranks varied according to horizon-time and different sets of test observations, with CWBSA displaying better results on shorter-time operations and Cox-Family models on longer-time ones.

Out-of-time evaluation suggests that models maintain their performances on near future operations, but decreases over time. Although, structural changes given by market situations or intrinsic factors regarded features interactions can change expected results, and it require extra care when considering models at scale and production.

Considering the current dataset, results indicates that a boosting framework, with a component wise as a weak learner, provide better predictability considering shorter-terms operations. On the other hand, when a longer time horizon is considered, Cox models (CoxPH, CoxLasso, CoxRidge) led to better predictability power. Since the operations studied in our paper are related to refinancing operations from customers who have already defaulted on previous loans, this difference on better models regarding distinct time horizons can arise from intrinsic factors from this specific situation. Future studies could address factors that might impact overall probabilities of default, such as macroeconomic factors, or take into account events that establish a relation of competing risk with default, e.g. prepayment of the loan.

Data availability statement

Data not available due to legal restrictions and confidentiality issues.

Disclosure statement

No potential competing interest was reported by the authors.

References

Andreeva, G. 2006. “European Generic Scoring Models Using Survival Analysis.” Journal of the Operational Research Society 57 (10):1180–87. https://doi.org/10.1057/palgrave.jors.2602091
Web of Science ®Google Scholar
Andreeva, G., J. Ansell, and J. Crook. 2007. “Modelling Profitability Using Survival Combination Scores.” European Journal of Operational Research 183 (3):1537–49. https://doi.org/10.1016/j.ejor.2006.10.064
Web of Science ®Google Scholar
Bai, M., Y. Zheng, and Y. Shen. 2021. “Gradient Boosting Survival Tree with Applications in Credit Scoring.” Journal of the Operational Research Society 73:39–55. https://doi.org/10.1080/01605682.2021.1919035
Web of Science ®Google Scholar
Balazy, K., S. Dudley, Y. Qian, N. Sandhu, D. Chang, R. V. Eyben, and E. Kidd. 2019. “Prognostic Model Using a Simple Survival Tree Algorithm for Patients Undergoing Palliative Radiation.” International Journal of Radiation Oncology*Biology*Physics 105 (1):E581.
Web of Science ®Google Scholar
BCBS. 2005. “Studies on the Validation of Internal Rating Systems.” Working Paper No. 14. https://www.bis.org/publ/bcbs_wp14.pd
Google Scholar
BCBS. 2006. Basel II: International Convergence of Capital Measurement and Capital Standards: A Revised Framework - Comprehensive Version. Number June.
Google Scholar
Belle, V. V., K. Pelckmans, S. V. Huffel, and J. A. Suykens. 2011. “Support Vector Methods for Survival Analysis: A Comparison between Ranking and Regression Approaches.” Artificial Intelligence in Medicine 53 (2):107–118. https://doi.org/10.1016/j.artmed.2011.06.006
PubMed Web of Science ®Google Scholar
Bellini, T. 2019. “lifetime pd.” In: IFRS 9 and CECL Credit Risk Modelling and Validation, edited by T. Bellini, 91–153. London: Academic Press. Accessed http://www.sciencedirect.com/science/article/pii/B9780128149409000116
Google Scholar
Bellotti, T., and J. Crook. 2008. “Credit Scoring with Macroeconomic Variables Using Survival Analysis.” Journal of the Operational Research Society 60 (12):1699–1707. https://doi.org/10.1057/jors.2008.130
Web of Science ®Google Scholar
Bou-Hamad, I., D. Larocque, H. Ben-Ameur, L. C. Mâsse, F. Vitaro, and R. E. Tremblay. 2009. “Discrete-Time Survival Trees.” Canadian Journal of Statistics 37 (1):17–32. https://doi.org/10.1002/cjs.10007
Web of Science ®Google Scholar
Breiman, L. 1984. Algorithm Cart. Classification and Regression Trees. Belmont, CA: California Wadsworth International Group.
Google Scholar
Buehlmann, P. 2006. “Boosting for High-Dimensional Linear Models.” The Annals of Statistics 34 (2):559–83.
Web of Science ®Google Scholar
Chen, S., T. Guo, E. Zhang, T. Wang, G. Jiang, Y. Wu, X. Wang, R. Na, and N. Zhang. 2022. “Machine Learning-based Prognosis Signature for Srvival Prediction of Patients with Clear Cell Renal Cell Carcinoma.” Heliyon 8 (9):e10578. https://doi.org/10.1016/j.heliyon.2022.e10578
PubMed Web of Science ®Google Scholar
Chopra, A., and P. Bhilare. 2018. “Application of Ensemble Models in Credit Scoring Models.” Business Perspectives and Research 6 (2):129–41. https://doi.org/10.1177/2278533718765531
Google Scholar
Cohn, S. L., A. D. Pearson, W. B. London, T. Monclair, P. F. Ambros, G. M. Brodeur, A. Faldum, B. Hero, T. Iehara, D. Machin, et al. 2009. “The International Neuroblastoma Risk Group (INRG) Classification System: An INRG Task Force Report.” Journal of Clinical Oncology 27 (2):289–97. https://doi.org/10.1200/JCO.2008.16.6785
PubMed Web of Science ®Google Scholar
Cox, D. R. 1972. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society: Series B (Methodological) 34 (2):187–202. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x
Web of Science ®Google Scholar
Dirick, L., G. Claeskens, and B. Baesens. 2017. “Time to Default in Credit Scoring Using Survival Analysis: A Benchmark Study.” Journal of the Operational Research Society 68 (6):652–65. https://doi.org/10.1057/s41274-016-0128-9
Web of Science ®Google Scholar
Djeundje, V. B., and J. Crook. 2018. “Incorporating Heterogeneity and Macroeconomic Variables Into Multi-State Delinquency Models for Credit Cards.” European Journal of Operational Research 271 (2):697–709. https://doi.org/10.1016/j.ejor.2018.05.040
Web of Science ®Google Scholar
Durović, A. 2019. “Macroeconomic Approach to Point in Time Probability of Default Modeling - IFRS 9 Challenges. Journal of Central Banking Theory and Practice 8 (1):209–23.
Web of Science ®Google Scholar
Fantazzini, D., and S. Figini. 2008. “Random Survival Forests Models for SME Credit Risk Measurement.” Methodology and Computing in Applied Probability 11 (1):29–45. https://doi.org/10.1007/s11009-008-9078-2
Web of Science ®Google Scholar
Friedman, J. H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics 29:1189–1232.
Web of Science ®Google Scholar
Hand, D. J., and W. E. Henley. 1997. “Statistical Classification Methods in Consumer Credit Scoring: A Review.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 160 (3):523–41. https://doi.org/10.1111/j.1467-985X.1997.00078.x
Google Scholar
Harrell, F. E., R. M. Califf, D. B. Pryor, K. L. Lee, and R. A. Rosati. 1982. “Evaluating the Yield of Medical Tests.” JAMA 247 (18):2543–46. https://doi.org/10.1001/jama.1982.03320430047030
PubMed Web of Science ®Google Scholar
International Accounting Standards Board. 2014. International Financial Reporting Standard 9 Financial instruments. International Accounting Standards Board.
Google Scholar
Ishwaran, H., and U. B. Kogalur. 2007. “Random Survival Forests for r.” R News 7 (2):25–31.
Google Scholar
Ishwaran, H., U. B. Kogalur, E. H. Blackstone, M. S. Lauer. 2008. “Random Survival Forests.” The Annals of Applied Statistics 2 (3):841–60. https://doi.org/10.1214/08-AOAS169
Web of Science ®Google Scholar
Kalbfleisch, J. D., and R. L. Prentice. 1973. “Marginal Likelihoods based on Cox’s Regression and Life Model.” Biometrika 60 (2):267–278. https://doi.org/10.1093/biomet/60.2.267
Web of Science ®Google Scholar
LeBlanc, M., and J. Crowley. 1993. “Survival Trees by Goodness of Split.” Journal of the American Statistical Association 88 (422):457–67. https://doi.org/10.1080/01621459.1993.10476296
Web of Science ®Google Scholar
Lefebvre-Ulrikson, W., G. Da Costa, L. Rigutti, and I. Blum. 2016. “Data Mining.” In: Atom Probe Tomography, edited by W. Lefebvre-Ulrikson, F. Vurpillot, and X. Sauvage, 279–317. New York: Elsevier.
Google Scholar
Lessmann, S., B. Baesens, H.-V. Seow, and L. C. Thomas. 2015. “Benchmarking State-of-the-Art Classification Algorithms for Credit Scoring: An Update of Research.” European Journal of Operational Research 247 (1):124–36. https://doi.org/10.1016/j.ejor.2015.05.030
Web of Science ®Google Scholar
Narain, B. 1992. Survival analysis and the credit granting decision. In Credit Scoring and Credit Control, edited by L. C. Thomas, J. N. Crook, and D. B. Edelman, 109–121. Oxford: Oxford University Press.
Google Scholar
Niloy, N., and M. Navid. 2018. “Naïve Bayesian Classifier and Classification Trees for the Predictive Accuracy of Probability of Default Credit Card Clients.” American Journal of Data Mining and Knowledge Discovery 3 (1):1. https://doi.org/10.11648/j.ajdmkd.20180301.11
Google Scholar
Deepa, P., and C. Gunavathi. 2022. “A Systematic Review on Machine Learning and Deep Learning Techniques in Cancer Survival Prediction.” Progress in Biophysics and Molecular Biology 174:62–71. Accessed https://doi.org/10.1016/j.pbiomolbio.2022.07.004
PubMed Web of Science ®Google Scholar
Parizadeh, D., A. Ramezankhani, A. A. Momenan, F. Azizi, and F. Hadaegh. 2017. “Exploring Risk Patterns for Incident Ischemic Stroke During More Than a Decade of Follow-Up: A Survival Tree Analysis.” Computer Methods and Programs in Biomedicine 147:29–36. https://doi.org/10.1016/j.cmpb.2017.06.006
PubMed Web of Science ®Google Scholar
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. 2011. “Scikit-learn: Machine Learning in Python.” Journal of Machine Learning Research 12:2825–30.
Web of Science ®Google Scholar
Pölsterl, S. 2020. “scikit-Survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn.” Journal of Machine Learning Research 21 (212):1–6. Accessed http://jmlr.org/papers/v21/20-729.html
PubMedGoogle Scholar
Thomas, L., J. Crook, and D. Edelman. 2017. Credit Scoring and its Applications. Philadelphia, PA: SIAM.
Google Scholar
Verweij, P. J., and H. C. Van Houwelingen. 1994. Penalized likelihood in cox regression. Statistics in Medicine 13 (23–24):2427–36. https://doi.org/10.1002/sim.4780132307
PubMed Web of Science ®Google Scholar
Xia, Y., L. He, Y. Li, Y. Fu, and Y. Xu. 2021. “A Dynamic Credit Scoring Model based on Survival Gradient Boosting Decision Tree Approach.” Technological and Economic Development of Economy 27 (1):96–119. https://doi.org/10.3846/tede.2020.13997
Web of Science ®Google Scholar
Yeh, I. C., and C. hui Lien. 2009. “The Comparisons of Data Mining Techniques for the Predictive Accuracy of Probability of Default of Credit Card Clients.” Expert Systems with Applications 36 (2 PART 1):2473–80. https://doi.org/10.1016/j.eswa.2007.12.020
Web of Science ®Google Scholar
Zou, H., and T. Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2):301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
Web of Science ®Google Scholar

Comparing lifetime estimates of probability of default for refinancing operations with survival analysis and ensemble methods

Abstract

1 Introduction