Machine learning interpretability for a stress scenario generation in credit scoring based on counterfactuals

To boost the application of machine learning (ML) techniques for credit scoring models, the blackbox problem should be addressed. The primary aim of this paper is to propose a measure based on counterfactuals to evaluate the interpretability of a ML credit scoring technique. Counterfactuals assist with understanding the model with regard to the classification decision boundaries and evaluate model robustness. The second contribution is the development of a data perturbation technique to generate a stress scenario. We apply these two proposals to a dataset on UK unsecured personal loans to compare logistic regression and stochastic gradient boosting (SBG). We show that training a blackbox model (SGB) as conditioned on our data perturbation technique can provide insight into model performance under stressed scenarios. The empirical results show that our interpretability measure is able to capture the classification decision boundary, unlike AUC and the classification accuracy widely used in the banking sector.


Introduction
With the growing prevalence of machine learning (ML) usage in the financial sector, there has been a growing interest in its applications in credit scoring.However, financial institutions are still reluctant to use ML models due to the 'blackbox problem', i.e. these techniques are so complex that the impact of their predictions are often difficult to explain and validate (Rudin, 2019).This problem is also known as the accuracy-explainability trade-off.The Bank of England (BoE) and the Financial Conduct Authority (FCA) conducted a survey on 106 UK financial institutions and determined that two thirds of respondents are already using ML in their business, even if with a limited number of use cases (Bank of England, 2019).
This paper is particularly focused on the use of ML techniques for credit scoring models, the quantitative method used by financial institutions to classify potential customers as good or bad borrowers (Thomas, Crook, & Edelman, 2019).This approach is based on a quantitative score that uses an application scorecard at the loan origination stage.The scorecard is used to record all data items that have predictive power related to the applicant's default risk.This data consists of characteristics about the borrower and information regarding previous relationships with the bank.Banks can develop their own internal quantitative models to translate this information into an individual score which is associated with a default probability of the borrower.
When the last financial crisis unfolded more than a decade ago, financial institutions discovered that most of their blackbox algorithms used to estimate this score were based on flawed assumptions.Therefore, financial regulators decided that additional controls were needed and introduced regulatory requirements for modelling risk management in the banking sector.For example, in April 2011 the Board of Governors of the Federal Reserve System in the US published a document (the Board of Governors of the Federal Reserve System, 2011) that states that banks have to prove that they understand the models they are using.The Financial Stability Board (FSB) in 2017 pointed out that ML is likely to bring many challenges to financial stability due to the lack of model explainability (Financial Stability Board, 2017).https://doi.org/10.1016/j.eswa.2022.117271Received 25 March 2021; Received in revised form 9 February 2022; Accepted 13 April 2022 The European General Data Protection Regulation (GDPR) (Voigt & Bussche, 2017) provides the right to an explanation of algorithm decisions.To guarantee this right, financial regulators require that the models used by financial institutions for computing the capital requirements should be easily understood.This guideline has been followed by the European Banking Authority that defined a model explainable ''when its internal behaviour can be directly understood by humans (interpretability) or when explanations (justifications) can be provided for the main factors that led to its output'' (European Banking Authority (EBA), 2020).Both these regulations create a demand for explainability of ML models in the banking sector.
The two primary aims of this paper are to provide a proposal for making ML models easier to explain and to propose a method to generate stress scenarios for cross sectional data.First, we suggest the use of counterfactuals (Johansson, Shalit, & Sontag, 2016) in understanding the internal workings of the model so that banks can be compliant with financial regulations.Particularly, we analyse the robustness of credit scoring models based on ML techniques to understand the minimum data perturbation to misclassify a borrower.Second, we suggest a data perturbation technique that can be used for generating stress scenarios in a cross-sectional context.
Performance of traditional credit score models, using logistic regression, typically perform well in an economic downturn, in terms of their ability to classify the worst borrowers worse than the best borrowers i.e. the model's discrimination or 'rank ordering' is not materially impacted.However, these models can deteriorate from an accuracy perspective and the probability of default at a given score is no longer as accurate as it should be.As ML models have not been used for the purpose of credit scoring in financial services for long, there is no deep understanding of how they will perform under the same conditions.The purpose of this paper is to explore this dynamic.
We apply our two methodological proposals to a dataset on 61,239 UK unsecured personal loans issued from June 2014 to July 2015.We generate different stressed scenarios from this dataset and we apply both the logistic regression and the stochastic gradient boosting (SGB) techniques as scoring models.Our first set of experiments demonstrate the impact on model performance when trained on augmented data and show that model performance is in fact negatively impacted when the default ratio is increased.Our second set of experiments use our counterfactual distance metric to compare logistic regression with SGB across each dataset feature followed by a comparison on the impact of feature constraints when applied to counterfactual generation.Results here indicate that the decision boundary is smaller with some blackbox models, and that the application of constraints provides insight into feature value ranges which correlate with model misclassification.The last set of experiments combine the two approaches to bring together the study of stressed scenarios via augmentation of the data and isolating weaknesses in the classification decision boundary using counterfactuals.
The remainder of this paper is structured as follows.Section 2 discusses related literature on the explainability of ML techniques in the banking sector.Section 3 provides an overview of counterfactuals and presents our proposal.A novel approach to generate synthetic data for stress scenario is presented in Section 4. Section 5.1 describes the data used for the experiments and explains the setting of the problem.Finally, Section 5 shows the results of applying the proposed methods to the previously stated problems.

Literature review
The usage of ML techniques by financial institutions and Fintech companies has gained significant popularity over recent years (Bank of England, 2019).Based on our knowledge, one of the first uses of ML methods in credit scoring dates back to 1996, when Henley and Hand (1996) obtained that a K-Nearest Neighbour classifier is more accurate than linear and logistic regression, decision trees and graphs.Since then, a large variety of machine learning techniques has been applied to credit scoring showing that these methods are often more accurate than traditional regression models (Brown & Mues, 2012;Kraus, Feuerriegel, & Oztekin, 2020;Lessmann, Baesens, Seow, & Thomas, 2015;Paleologo, Elisseeff, & Antonini, 2010).
Analogously to the analysis conducted by the Bank of England (Bracke, Datta, Jung, & Sen, 2019), in this paper we chose the stochastic gradient boosting (SGB) (Friedman, 2002) as a machine learning technique as it provides the highest predictive accuracy when compared to other methods such as random forest and support vector machines.However, it should be mentioned that SGB is opaque in its decision making, rendering it a non-interpretable ML technique.Chang, Chang, and Wu (2018) apply SGB to a highly imbalanced data (10% of defaulted loans) provided by a financial institution in Taiwan from 2009 to 2016 showing that SGB achieves the highest discriminant accuracy compared to those obtained by logistic regression, support vector machines, and neural networks.Addo, Guegan, and Hassani (2018) also find that SGB achieves highest performance results among logistic regression, random forest, and neural networks with various number of hidden layers.As their dataset is highly imbalanced with 1% of defaulted loans, the authors apply an oversampling technique called SMOTE.Addo et al. (2018) point out the importance of transparency and interpretability of credit scoring models, however they do not explore it further.Moreover, Letham, Rudin, Mccormick, and Madigan (2015) and Paleologo et al. (2010) emphasize that not only accuracy but also interpretability are relevant characteristics for credit scorecards.
Even if some authors have interchangeably used the terms interpretability and explainability, they are two different concepts.The former is mainly focused on describing the prediction in a way that is understandable by humans, while the latter aims to explain the underlying mechanics of a particular decision obtained by the algorithm (Gilpin et al., 2018).As the main aim of this paper is to suggest a technique that financial institutions can use to explain their internal scoring models to regulators, we focus our analysis on interpretability.Furthermore, analogously to a regulatory approach (Bracke et al., 2019), we are interested in credit scoring prediction and not in explaining the causes of loan defaults.
Existing techniques on interpretability can be divided into two main categories: local and global approaches.The latter analyse the model as a whole, for example to analyse possible bias detection.On the other hand, local interpretability tries to identify how different features influenced a particular prediction of the model, e.g. a lending decision for a particular applicant.Analogously to a regulatory approach, we choose in this paper to focus on global interpretability.Classically, global interpretability seeks to understand how a model came to a decision and is based on a general perspective of the input features and the learned coefficients such as weights, parameters, and network architecture.Identifying the importance of features and their interactions are explanations that can be considered global.Our contribution provides global interpretability via analysis of the decision boundary based on the model input and counterfactual generation (Molnar, 2019).Alternatively, our proposal is also able to provide an instrument for local interpretability, for example why a particular loan application was rejected, as Grath et al. (2018) showed in their analysis.
Counterfactuals are example-based explanations that describe causal situations in the form of 'if A had not happened, B would not have happened either' (Johansson et al., 2016).Exemplar usage of applying counterfactuals to tree based models can be found in Tolomei, Silvestri, Haines, and Lalmas (2017).Where financial institutions are concerned, counterfactuals can be used on an individual level to assess what would have to happen to change the prediction from default to nondefault or vice versa.It not only provides better understanding of model predictions, but can also be used as a suggestion to the client on what they could do to get the loan next time they apply for it.
Similar interpretable methods include LIME and SHAP, which a can be used for both local and global interpretability via feature importance.LIME builds local surrogate linear models for the purpose of model interpretability, where the feature weight coefficients on the local linear model can be interpreted as feature importance (Ribeiro, Singh, & Guestrin, 2016).While our proposed approach uses counterfactuals to model the decision boundary, our focus is on robustness and not necessarily feature importance.SHAP (Shapley Additive exPlanations) is a model agnostic value estimation method for explaining individual predictions, where SHAP learns local explanations by utilizing from game theory, so called Shapley Values (Lundberg & Lee, 2017).In a recent effort, Giudici and Raffinetti (2021) have extended SHAP to further address global explainability goals by incorporating Lorenz decompositions.The authors integrate the Lorenz Zonoid model accuracy tool, which is related to the AUROC measure, as such provides a metric that incorporates model performance and model explainability.Our contribution also proposes a new metric; however, our focus is on the use of counterfactuals to calculate the decision boundary conditioned on perturbations to the dataset to gauge model robustness to various scenarios.
There are few studies in the literature that analyse interpretability in credit risk management.Chen, Janizek, Lundberg, and Lee (2020)ranked the features of a credit scoring model available in the LendingClub dataset based on their Shapley value.Then the authors computed the change in the scoring model's predicted log odds of default after imputing each feature (up to 10 features) to the mean.Interventions on the features ranked based on the Shapley values computed on the imputed features lead to a substantial decrease of the predicted likelihood of default.Bussmann, Giudici, Marinelli, and Papenbrock (2020) proposed to apply correlation networks to Shapley values to cluster the predictors in ML models used for credit scoring.They applied this approach to data on small and medium enterprises showing that good and bad borrowers can be clustered based on these predictors.
The Bank of England (Bracke et al., 2019) applied a global approach to explain SGB using the Shapley value, a game theoretic concept that tells how much each feature contributes to the final prediction.They compare logistic regression and SGB on UK mortgage data during the 2015-2017 period.Bracke et al. conclude that the most important features found by the Shapley value -loan-to-value ratio and current interest rate -are in line with the relevant literature.Another main finding is that explanations for ML models depend on the input region considered, and that they should be tested extensively for several potential states of the world.
The conclusion by the Bank of England naturally leads to the notion of stress testing, a simulation technique used to test the resilience of scoring models against possible future economic downturn.Stress testing is becoming very important in the risk evaluation of banks and represents a key technique for risk management and capital decisions for financial institutions, as recognized by the Financial Services Authority (Financial Services Authority, 2008).Most of the approaches available in the literature for stress testing use dynamic models, such as survival models (Bellotti & Crook, 2013;Wang, Crook, & Andreeva, 2020).
To apply a dynamic approach, we need data from multiple years.For this paper we have access to a cross-sectional dataset extracted by Nationwide in 2014-2015, for this reason we cannot apply a dynamic model.We instead propose a method to perturb the data based on the k-Nearest Neighbours algorithm (Henley & Hand, 1996) and, therefore, generate stress scenarios to test the scorecard.We highlight that this proposal is not considered a stress testing approach as it does not capture the dynamic behaviour of the economic cycle.Since default rates tend to increase under downturn economic conditions (Hall, 2010), we can manipulate the default ratio as a proxy of different states of the world.It is crucial to note that we did not perform the perturbation using sampling techniques such as SMOTE (Baesens, Rösch, & Scheule, 2016;Chawla, Bowyer, Hall, & Kegelmeyer, 2002) as the method incorporates randomness in generating synthetic instances, which we aim to avoid as we sought to maintain a degree of consistency with the original data.Ultimately, it was a design choice to reject using a stochastic approach, such as SMOTE, for the stress scenario exercise to avoid adding an additional source of complexity to the problem.While the generation of balanced synthetic data has demonstrated success in the past with SMOTE (Patil, Framewala, & Kazi, 2020), SMOTE can potentially generate unseen feature values in a stochastic manner, thus we sought alternative methods in data generation.Therefore, we settled on using k-Nearest Neighbours for the perturbation procedure as original data values from the population of applicants can be reused and changes to the outcome can be made to meet the stress criteria.The exact details of our k-Nearest Neighbours approach and the counterfactual methodology can be found below.

Counterfactuals and robustness
It requires mentioning, robustness in statistics refers to the study of learning on corrupted data, such that data statistics can still be recovered.We acknowledge the statistical definition, however we use the ML definition of robustness which is defined by the minimum perturbations needed to craft an adversarial example for a ML model to misclassify a datapoint.In essence, robustness in ML refers to adversarial examples for a trained model, in the test set potentially, and robustness in statistics refers to the handling of adversarial examples prior to training.We note, as we associated global interpretability with comprehension of model robustness, we clarify that robustness is a reflection of a model's decision boundary.That when analysed conditioned on model input, provides global interpretability via understanding of the relationship between model performance and input features.Whether a model is sensitive or resilient to adversarial examples, such information provides a global perspective on model performance.
Counterfactuals generated from a blackbox machine learning model can be harnessed to create a counterfactual distance scoring metric to pinpoint areas of weakness in the model.These areas would pertain to values within the feature space.Counterfactuals have provided means of interpreting machine learning models in the past (Guidotti et al., 2019(Guidotti et al., , 2018;;Sharma, Henderson, & Ghosh, 2019), and have been useful in explaining the classification of individuals.Recent advances in counterfactual research have used genetic algorithms to create synthetic populations of instances in the neighbourhood of a datapoint.Taken as a whole, counterfactuals generated with genetic algorithms and machine learning models can express the beliefs of the model conditioned on the data.The score proposed is the metric distance of the counterfactuals and the original datapoint, and so provides a numerical understanding of how well a model performs given certain neighbourhoods in the data.
Applying constraints to counterfactual generation allows a modeller to analyse subpopulations of the data constrained by specific data values.To the best of our knowledge the proposed method is the first systemic analytical framework using counterfactuals to derive a new evaluation metric for model confidence.While Sharma et al. propose a similar counterfactual based metric with optional constraints, our approach expands on the use of the metric by specifying constraints so that we can analyse the robustness of individual features as well as specified continuous ranges for individual features on the model (Sharma et al., 2019).Importantly, Sharma et al. utilize constraints to prevent the generation of counterfactuals that are infeasible or statistically unlikely, whereas we apply constraints in order to interpret model performance more succinctly through inspection of constrained inputs.The granularity of our constraints provides a means of broader interpretation of the model, and as we later make use of generating synthetic data, we can better infer model performance under specified conditions.
As we want to observe the performance of the model given augmented feature values with known classification, we found that counterfactuals can assist in generating datasets contingent on the model and constraints, which can lead to misclassification (i.e.expose feature values which potentially lead the model to make errors).The following methods use counterfactuals generated on subsets of the test population, which fall under a particular feature constraint, to determine the robustness of these subsets with respect to the model.The goal with this method is to complement sampling methods which provide a given blackbox model with biased data, such that we can observe the performance of a machine learning model trained on multiple augmented datasets.The following section provides an overview of the blackbox approach in generating counterfactuals and how they are utilized in calculating the counterfactual distance metric referred to now as the robustness score as well as our algorithm incorporating the robustness score in analysis of model weak points using constraints.

Counterfactuals in machine learning
A counterfactual is defined as a generated datapoint that is as close to the input datapoint as possible but which the models gives a different outcome (Sharma et al., 2019).To illustrate, had a user been denied a loan then a counterfactual statement could take the form, ''Had your income been $5000 greater per year and your credit score 30 points higher your loan would be approved''.Counterfactuals explain model results to a user by providing them actionable ways of changing their behaviour to obtain favourable predictions.Simply put, counterfactuals explain which data features should be changed (and by what amount) to change a specific classification predicted by a model to a different predicted classification.
In past works, counterfactuals have been used for local explanations (Guidotti et al., 2019(Guidotti et al., , 2018;;Sharma et al., 2019), but also have the capacity to be used to identify model decision boundaries in particular data neighbourhoods.By neighbourhood, we are referring to points in the dataset where feature values are broadly similar.Counterfactuals for a datapoint can explain the decision boundary in that neighbourhood of datapoints.By taking counterfactuals for multiple datapoints, with say similar feature values, counterfactuals can explore the decision boundary for those datapoints conditioned on their feature values.For example, a grouping of users denied loans with incomes lower than $2000 per year, and keeping the counterfactual constrained to this feature value, may receive a counterfactual ''On average, had their credit score been 70 points higher, they would have been approved for the loan''.
As counterfactuals can define decision boundaries by providing generated datapoints close to the input datapoint.The distance of the counterfactuals to the original point reflects how robust the model is to adversarial examples.If the counterfactuals across classes are further away from the input instances on average for say model  compared to another model , we can say that the first model  is less likely to make a false prediction.Taken another way, if we find groupings of datapoints with similar feature values, we can take the average distance of the decision boundary to see how easy it is to fool the model given datapoints within the similar feature value ranges.

Genetic algorithm
Genetic algorithms (GAs) provide positive benefits in counterfactual generation including generating points on linear and non-linear models.They also allow for constraints to be added which allows for exploration of the decision boundary of particular sub-groups in the data, as well as restricting feature changes and/or generating counterfactuals on specific feature ranges.Key to understanding how a GA can generate counterfactuals for an instance (), we need to define the distance function and the fitness function.In this body of work, the primary means of generating counterfactuals via a GA follows the LORE pipeline as introduced by Guidotti et al. (2019).

Distance function
The distance function determines the metric difference between an instance () and a counterfactual ().The model presented by Guidotti et al. (2019) LORE, provided a distance function (, ) for mixed data (categorical and continuous).As we normalize the data in our experiments, we treat all features as continuous thus our distance function is the normalized Euclidean distance, where the distance function maps the feature space consisting of  attributes to a value between 0 and 1.In Eq. ( 1) we normalize the euclidean by relying on the total sum of the variances between our two instances in question, as we are comparing the distance between our original instance and the counterfactual, whereas we normalize the variance by the size of the instance vector ().The mean of our instance vector () is represented by ( x).

𝑁𝑜𝑟𝑚𝐸𝑢𝑐𝑙𝑖𝑑(𝑥, 𝑐)
, While we operated with only continuous data, the discrete case is a possibility for future research.As many datasets are mixed discrete continuous, the distance function in Guidotti et al. ( 2019) also provides a method for handling discrete features.
Where a given dataset contains ℎ categorical features and  − ℎ continuous features.The ℎ function is the weighted sum of simple matching coefficients for categorical features.

Fitness function
Two fitness functions are used in the GA, one fitness function looks for instances ( ′ ) similar to () but not equal to () for which the model predicted the same classification.The second fitness function assists in generating instances similar to (), again not equal to (), where the model predicts a different classification.

𝑓 𝑖𝑡𝑛𝑒𝑠𝑠𝐸𝑞𝑢𝑎𝑙(𝑥, 𝑐
In the following equations, I is the indicator function for a conditional statement, I  = 1 and I   = 0.The term (1 − (,  ′ )) determines the similarity of () and ( ′ ).I ()=( ′ ) = 0 determines whether the predicted output produced by model  predicts the same class for () and ( ′ ) with the indication of 0 signifying the classifications are different, while I ()=( ′ ) = 1 indicates the predicted classifications are the same.I = ′ determines whether () and ( ′ ) are the same as this would hurt the overall fitness for both fitness functions.Alternatively, I ()≠( ′ ) = 1 indicates that model classification on the two instances is different, and for the sake of calculating robustness, we are concerned with this case.Ideally, the first indicator term I ()≠( ′ ) returns 1 and we add the similarity measure, which approaches 1 the closer the counterfactual is to () as the distance function is scaled between 0 and 1, with the final measure I = ′ subtracting 1 from the fitness valuation if the potential counterfactual is identical to the original instance.
While the GA has the capacity to generate two populations on an instance () using   and  , in the context of developing our robustness metric, we only analyse the counterfactual populations generate by  , so the concern is to identify potential instances such that I ()≠( ′ ) = 1.Ideally we seek potential counterfactuals where  (,  ′ , ) ≥ 1 and so the GA seeks to maximize this result.

Genetic algorithm pipeline
The GA relies on the fitness functions and the fitness function relies on the distance function.The fitness function evaluates which counterfactuals are similar to () using the distance function.The classification of the counterfactuals predicted by the model  is used as well in the fitness function.The fitness function then assigns a metric determining how fit a given counterfactual is.This process is reflected in the evaluation function.A selection function, selects a population produced by the evaluation function with the highest fitness scores.
A crossover function, is then applied on the selected fittest population of counterfactuals.Crossover of counterfactual feature values is based on a user defined probability value ().Here, two-point crossover is used which select 2 parents and crossover features at random by swapping the crossover feature values of the parents.
The next method for counterfactual generation with a GA is the mutation function.A subset of the generated counterfactuals from the crossover function are mutated based on a user defined probability ().The selected subset of counterfactuals for mutation are transformed by replacing the feature values at random according to the distribution of feature values.
The GA loops for a number of generations  where (1) the fittest counterfactuals are selected, (2) the fittest counterfactuals are augmented and return a new population using the crossover function, and then (3) a subset of the new population is selected and feature values are manipulated via the mutation function.(4) The final step passes the new counterfactual population into the evaluation function to determine the fitness of individual counterfactuals and the loop continues.In algorithm 1, we can see the pipeline with the relevant functions used in generating a genetic neighbourhood for a specific instance.

Framework and robustness metric
As stated, the idea of our method is to use counterfactuals for features to determine the robustness score on datapoints with particular feature value intervals which lead to misclassification.We can isolate datapoints with particular feature value ranges and use the robustness score to indicate if the model is weak in predicting the classification.The CERTIFAI model introduced by Sharma et al. introduced the use of counterfactuals as a means to generate a robustness metric as well as the concept of applying constraints, and here we take a similar approach for calculating the robustness metric with constraints, with the difference being how we apply the scoring metric with constraints for broader interpretability (Sharma et al., 2019).
The output of this pipeline would include feature ranges where the model has an increased likelihood of misclassification.This information can then be passed to a sampling method to generate a synthetic dataset to observe the prediction ability of the model given perturbed feature values.Inversely, this method could be used to observe performance of a model trained on biased data (eg.increased presence of defaults) and use constraints in the generation of counterfactuals to observe the robustness of specific neighbourhoods in the data that are defined by feature value intervals.The pipeline calculating the robustness of subpopulations in the data is as follows: 1.As we are interested in looking at particular feature ranges, we create a subpopulation derived from the test set with a specified interval range.Assuming we have a feature   which has feature values in the range  ≤  ≤  where () is a given value in the range.We could specify an interval   ∈ [, ] where  ≤  and  ≤ .In the context here, we would perform equal-width binning on the features and identify each subpopulation that falls into a given interval.As some values may have a higher frequency we would need to limit the constrained subsets from the test set in order for their size to be generally equal and avoid bias.2. Constraints: First we would need to apply the feature range constraints to the GA, as this will return counterfactuals with the explicit feature range we wish to analyse.So given a particular feature with a specified feature range   ∈ [, ] that we wish to constrain in the generation of counterfactuals, after selecting real datapoints from the test set with our specified feature range, we apply our specified feature range as a constraint   into the GA pipeline.This constraint would be applied to the mutation function, as this method selects values from feature range distributions but would now be limited to our constraint.The current counterfactual and neighbourhood population generated for  is defined by   .
3. Robustness: Once we generate counterfactuals using just the fitnessNotEqual function, as we want to identify the decision boundary for our constrained subset, we calculate the robustness score which is the expected distance (or average normalized Euclidean distance) between input instances and their corresponding counterfactuals.As we are looking at the robustness score across all instances in the subset and constrained counterfactuals, we are able to identify the size of the decision boundary.The smaller the score, the closer the decision boundary is between instances in the subset and their counterfactuals, thus identifying what feature values lead to model misclassification.In calculating average robustness score for a single instance () we normalize the sum of the Euclidean distance for all generated counterfactuals by   , which is the final number of counterfactuals generated for an instance .As our proposal seeks to identify the average robustness for multiple instances x, we normalize the sum of our robustness score across all instances in x by   , which reflects the number of instances in (x).
Essentially, the pipeline will loop through all features and isolate a feature range to constrain on, then (1) create a new subset from the test set based on the constrained feature range, (2) then apply the feature range constraint in the GA for generating counterfactuals, (3) and followed by returning the average robustness score of these constrained counterfactuals.

Robustness of machine learning models and areas of weakness
Putting all the methods together, we arrive at system for analysing specific neighbourhoods in the data with the following algorithm, see Algorithm 2. Two binning methods can be implemented, the primary binning method implemented was equal-frequency binning where the number of instances is roughly equal for each bin interval, and equal-width binning which divides the feature value range into equal partitions.In order to analyse a model with regard to feature regions partitioned by (), we iterate over all features and use a binning function to identify constraints ().The specific constrained feature range (  ) for a given feature   is then passed to the subpopulation function to get the subset of instances falling within the feature range constraint.The subset of constrained instances are then used by Algorithm 1 to generate counterfactuals which are then used in calculating the robustness score for the given feature range.

Primary advantages of the robustness metric
The robustness scores for feature ranges provide an interpretable metric for analysing the strength of a given model.Concretely, it is interpreted that features with larger robustness scores are performing better in the sense that the generated counterfactuals are 'further away', and can be inferred that those value ranges, in general, produce a more reliable prediction by the model.For example, given a feature say,   ←  with 3 bins produced thus providing 3 constraint ranges, with bin values of [6.7, 18.3, 4.9] corresponding with ranges of [ ≤ 1000), (1000 <  ≤ 50000), (50000 < ], we see that Bin 2 is more robust than Bin 1 and 3.As the distance metric calculated counterfactuals with an average distance of 18.3 whereas the other bins have lower scores, thus instances within the value range (1000 <  ≤ 50000) generate counterfactuals further from the original on average.This signifies that the model is more confident about classifications for features in that range (not necessarily what the classification is, just that the model is confident in its decision).These values also tell us that in the cases where the model is classifying instances with   = 100 or   = 100000 we should be cautious as the decision boundary in these ranges is less robust.The robustness score tied with constrained counterfactuals provides a means to pinpoint feature vulnerabilities in machine learning.The robustness score conditioned on discrete ranges for all features provides an additional analysis metric that helps modellers identify which features and more explicitly which ranges are associated with increased likelihoods of misclassification with regard to a model.
Robustness has been investigated in conjunction with financial applications in the past.Petropoulos et al. proposed a robustness metric based on model likelihood on discretized portions of data with particular groupings pertaining to actual classifications (i.e.loan defaults) (Petropoulos, Siakoulis, Stavroulakis, & Klamargias, 2019).The approach is similar in regards to measuring confidence of the model, especially with respect to comparisons with other models, however our framework for robustness is far more granular in that (Petropoulos et al., 2019) are limited to observing groupings of so called score values, while we can analyse performance on individual feature ranges.Other works which investigated model robustness and credit scoring often focused on the construction of a more robust model (Abdou, 2009;Luo, 2020) or used previously defined metrics unrelated to counterfactuals (Vieira, Barboza, Sobreiro, & Kimura, 2019). Counterfactuals

Data perturbation for stress scenario generation
In the following section, we explain our proposal for generating stressed scenarios from a cross-sectional dataset.We consider the particular restriction where only a dataset for a given year is available so we cannot apply a dynamic model, as explained in Section 2. We suggest the use of the k-Nearest Neighbours classifier to generate synthetic data with different default rates or feature values of clients to represent the economic downturn.The following subsections describe the proposal in details.

k-Nearest Neighbours
K-Nearest Neighbours (kNN) is a machine learning algorithm that classifies a new datapoint based on a similarity measure (Henley & Hand, 1996).As the name suggests, it is used to classify a datapoint based on the classification of its nearest neighbours.The general idea of this procedure for a two label classification can be seen in Fig. 1 below.This is exactly the setting for our case -in credit default prediction problems, the clients are usually divided into two classes, defaulted and non-defaulted.The figure above simplifies this problem to two dimensions.When a new point is added, we first need to choose the number of neighbours that we will be considering for classification (usually represented by ), as well as selecting a distance metric used as a similarity measure.A popular one is the Euclidean distance metric seen in Eq. ( 9).

𝑑( ⃗ 𝑝, ⃗
where ⃗  and ⃗  are observations and  is the number of features in the data.Under the chosen distance metric, we check which points are the closest to the new datapoint, and count them.Then, based on this count, the point is assigned to a certain class.For example, suppose that we are interested in checking 5 nearest neighbours, and we see that 2 out of these 5 neighbours are defaulted (class 1), whereas 3 did not (class 0).The kNN algorithm would then conclude that the new datapoint should be classified as non-defaulted.
An extension of this procedure involves taking into account the distance between each neighbour and the currently considered point by calculating the exact distance between two points and multiplying each neighbour's value (1 for default and 0 for non-default) by the distance's inverse.It ensures that closer neighbours weigh more than the ones that are further away.In particular, the general equation for point ⃗  looks as follows: where  is the number of nearest neighbours used.
As we describe below, we use the kNN algorithm for two main tasks: varying the default rates and the applicant's features in the original sample to create synthetic datasets that represent different stressed scenarios.

Framework for generating default rate
Default rates tend to be higher during economic downturns as often people have less money to cover their ongoing loans (Mian & Sufi, 2009).Therefore, in order to assess how the model would perform during such events, we change the classification of certain loans to the opposite -in this case, from non-defaulted to defaulted.The pipeline for changing the classification is as follows: 1. Obtain default probabilities.Apply kNN with the chosen number of neighbours on every datapoint, using Eq. ( 10), in order to get a probability  ∶ 0 ≤  ≤ 1 of every point belonging to a given class based on its neighbours.It can help to conclude whether a point is surrounded by other points from the same class or not.It is all done in -dimensional space, where  is the number of features in the out-of-time set.2. Find questionable observations.Conclude which observations lie close to the decision boundary according to the kNN algorithm.In principle, the closer the observation is to  = 0.5, the more questionable it is because at this point we are at least sure (provided that we would not have this information in the first place) to which class such observation would belong.Depending on the final default rate that we are aiming for, we can increase the range, for example for 0.4 ≤  ≤ 0.6, 0.3 ≤  ≤ 0.7 etc. 3. Change the classification.From the observations obtained above, choose the ones that were initially classified as nondefaulted and classify them as defaulted, which in effect increases the default rate from the initial out-of-time set.

Framework for applicant's features
A similar logic can be applied for changing the independent variables in the original sample.The key in this case is to find observations that would be least likely to change their classification due to the changes in the feature values.Combined with the approach for increasing default rates, it leads to a creation of completely new synthetic out-of-time sets, depending on the hyperparameters used.The procedure for finding certain observations consists of 2 steps and is very similar to the ones described above.It mainly differs by step (2), in which we aim to find observations whose calculated probability of default  is either as close to 0 as possible (for the non-defaulted case) or as close to 1 as possible (for the defaulted case).Having found  such observations, we need to follow the steps outlined below.In the end, a newly created set has observations with perturbed features and unchanged classifications.Optionally, the classification could be perturbed as well using the procedure from the previous subsection.

Data description
We apply the proposed methods to a dataset provided by Nationwide Building Society on 61,239 UK unsecured personal loans issued from June 2014 to July 2015.It contains applicants' characteristics at the time at which they apply for loans and sample weights that should be applied to each observation.The reported percentage of default is 6.8%, which represents 4168 loans.The dependent variable is coded as 1 if the borrower is considered as default and 0 otherwise.For confidential reasons, we cannot share the description of the independent variables for this dataset.
The logistic regression and the SGB are estimated on 50 explanatory variables.We train the models on the data observed between June 2014 and May 2015 (in-sample) and assess the quality of the models on the out-of-time sample from June to July 2015.We obtain 41,711 loans in the training set, 10,101 in the out-of-sample set and 9427 in the out-of-time set.
In this section we summarize the results that we obtained by applying data perturbation and counterfactuals to the described dataset.We follow the analysis structure conducted by the Bank of England (Bracke et al., 2019) on explainability for scoring models by comparing the logistic regression (logit) and the SGB (Friedman, 2002) methods.Hyperparameters were not tuned for models -we rely on scikit-learn (Pedregosa et al., 2011) and XGBoost (Chen, He, Benesty, Khotilovich, & Tang, 2015) default parameter values (version 0.90).

Data perturbation
Perturbing the data can help financial institutions in performing a stress scenario analysis, instead of stress testing (Bellotti & Crook, 2013), when data is collected over a short period of time and, therefore, macroeconomic variables cannot be used.The percentage of defaulted loans in the Nationwide dataset is 6.8%.
Higher probabilities of default for perturbed out-of-time sets are obtained by changing the classification of good observations inside the probability intervals.These probabilities are obtained using kNN, as explained in Section 4, and take an approximate runtime of 87 s to generate.They appear as follows (out-of-sample and out-of-time, respectively): • [0.25, 0.75] -the probability of default increases to 9.9 and 9.4% • [0.15, 0.85] -the probability of default increases to 13.8 and 12.6% • [0.10, 0.90] -the probability of default increases to 16.1 and 14.8% • [0.06, 0.94] -the probability of default increases to 20.0 and 19.0%Fig. 2 below shows how AUC decreases as the probability of default in the perturbed out-of-time set increases for both logistic regression and SGB.The same can be said about the perturbed out-of-sample set.However, it can be seen that the drop for out-of-sample set is much smaller for both models.One of the reasons for this might be due to a higher initial probability of default of the out-of-sample set, which is closer to the probability of default of the in-sample set (7%).Since out-of-sample set seems to have characteristics more similar to the insample set than the out-of-time set, the results on the perturbed data are also better for it.
The drop in performance for both models is the most significant for the initial increase in probability of default.This is due to the overlap of datapoints now found with both classes (non-default and default).As we perturbed the data to have higher ratio of defaulters, the distributions for the two classes now have a degree of overlap.Essentially, ML techniques have a harder time classifying datapoints when feature values are of an ambiguous class association (Prati, Batista, & Monard, 2004).Testing on perturbed data (with higher default ratios) results in more false positives and false negatives thus reducing the AUC value.Overall, the drop is higher for the logistic regression showing that it less robust than SGB.The initial datapoints in the figure reflect AUC values for logistic regression and SGB prior to perturbation of the Nationwide dataset (baseline performance is also stated in Table 1).
Note, as we use kNNs in the perturbation process, the value of  significantly determines the appearance of the new decision boundary.At higher values of  we can expect the decision boundary to smooth out, while at lower values of  we can expect a more flexible decision boundary (James, Witten, Hastie, & Tibshirani, 2013).As increasing  also impacts the ratio of increased defaulters, we can expect a decrease in model performance as the rate of false positives and false negatives increase.Importantly, As our method relies on the returned probabilities  based on the setting of  for determining which points to set to default, it is expected that with alternative settings of  we would rely on different probabilities as the decision boundary smooths out at higher  values.In our experiments, we only selected a small valuation of  to observe an increase in defaulters ( = 5) as our primary concern was experimenting with the various probability ranges.As the exploration of higher  values has a notable impact on the decision boundary, we propose this as an avenue for future directions and further exploration of the proposed work.
Apart from investigating how the performance of the model is affected, we also checked whether the explanations of decisions made by the model are the same for the original and perturbed data.For this task, we used the data obtained by changing applicant's features as explained in Section 4.3.In particular, we perturb the features of non-defaulted applicants, for which the value obtained in Eq. ( 10) is smaller than 0.05.We have used Shapley values, and in particular their variation called Shapley Additive Explanations (SHAP) (Lundberg & Lee, 2017), which measure the average impact on model output by variable magnitude.The comparison for the original out-of-time set and the augmented one are presented in Fig. 3 below where these values represent the magnitude of variable contribution to the outcome.It can be seen that the predictive power of respective variables does not change significantly after perturbing the original data.

Counterfactual robustness
We evaluate the robustness score by looking at the average robustness of individual features and comparing the scores between models, we then look at the robustness of a model where counterfactuals are generated with individual feature ranges acting as constraints.We train blackbox models on the Nationwide credit application dataset with the goal here being to provide another performance metric to compare models, analyse model robustness on individual features, and to further analyse model robustness on particular feature ranges.As we see this as an important contribution to global interpretability analysis of ML models in the financial sector.
In Table 1 we can see the predictive power of the respective blackbox classifiers.In general, common metrics of model performance are model accuracy and the AUC.Initial observations indicate that SGB is the superior model for credit scoring with respect to both metrics.

Model robustness on individual features
To assess the robustness of individual features in a dataset, we generate counterfactuals on individual instances using only the value range of the selected feature in the genetic algorithm.In essence, any counterfactuals of an alternate classification for an instance   will be generated based on only the possible values of a feature   leaving all other values in   the same.For each feature, K-means clustering is utilized to select a sub-population that best represents the data.This provides a population of datapoints that is general and constrains counterfactual generation on the range of values of our selected feature.Counterfactuals are generated on this subset and from the population of counterfactuals we calculate the robustness score.The average robustness scores for each feature, trained with logistic regression and SGB, can be seen in Table 2.
The average feature robustness score for logistic regression and SGB models trained on the Nationwide dataset can be seen in Table 2.We see a similar pattern where individual feature robustness is better performing with SGB than logistic regression.The Nationwide dataset demonstrates on average, a larger overall robustness score among individual features.The logistic regression model demonstrates less certainty among the various features.We find the logistic regression model's most robust feature is variable49 while the least robust is variable02.The SGB model is more consistent with feature robustness scores than with the logistic regression model.The most robust features are variable22, variable46, and variable33.The less robust features are variable18 and variable23 which comparatively are more robust with the SGB model than logistic regression.As an example, if a modeller had two new entries, customers with similar values for variable33 and wildly different values for vari-able23 and both entries have different classifications for the SGB model then further analysis of the entries is required to assess the predicted classifications.

Robustness with binned feature ranges as constraints
In the following evaluation we take the SGB model and observe the robustness score when constraints are applied to the individual feature ranges as seen in Algorithm 2.Here we compare the difference in robustness scores given 2 and 3 bins across all features in the Nationwide dataset.By increasing the bin count we can isolate particular feature ranges that may be less robust to perturbations and help the modeller understand and isolate areas of weakness in the model.The primary method of binning is equal-frequency binning, however we did encounter features which failed to create 3 bins with equal-frequency.In these cases, we defaulted to using equal-width binning.
From Tables 4 and 3 we see again the benefits of increased binning of individual features when training the SGB model on the Nationwide dataset.In looking at the 2 binned constraint, we see that feature variable01 appears robust across both bin ranges 1 and 2. In the case of 3 bins, the corresponding feature has an average robustness score of 11.3, 15.8, and 5.28 for bins 1,2, and 3 respectively.The implication here is that the feature ranges seen with 2 bins does not capture the areas of less robustness that can be observed with increased binning.As we see, bin 3 with 3 bin partitioning is less robust for feature variable01.This increased partitioning of the feature ranges shows potential problem areas not seen with 2 bins.The range corresponding to bin 3 for feature variable01 can be identified in Table 5.This tells the modeller that this range is an area of caution for model prediction.
We expect different measures of robustness for differing bin sizes, as increasing the bin size allows for more granular measures of particular feature ranges.In essence, if a specific feature value has a correlation with adversarial attacks, more discrete binning could potentially capture that fact with a corresponding low robustness score.The presence of outliers can impact the robustness score as well, as constraints for the GA will include outlier values if they are in fact the highest or lowest value for the feature range.This is a consequence of binning being performed by equal-frequency/equal-width binning.The use of K-means clustering does prevent selection of outliers for the sub-populations used in calculating the robustness score for a given feature range, and the fitness functions by design would avoid using outlier values in generating counterfactuals as they would be significantly different from the instances in the subpopulation.

Counterfactual robustness and stress scenario testing
The following experiments seeks to combine the counterfactual robustness score with the data perturbation methodology using kNNs.The purpose here is to generate possible stress scenarios on the data by increasing the ratio of bads (default rate) in the data and perturbing features in order to observe model performance during what could be a financial crisis.The counterfactual robustness score provides insight into what features are most at risk and which value ranges modellers   to 12% and 18%.KNNs are again used on the biased datasets to perturb features based on extracting the most robust ranges for variables for which an observation should have a certain classification.We use these robust ranges to randomize feature values for these observations and hence arrive at a new, perturbed out-of-time sample.

Baseline
In Table 5 The following ranges are provided for each feature.The two values in each bin indicate the lower and upper bound for that bin interval.The initial baseline is seen in Table where the bad rate in the data is initially 6%.The average robustness scores for 3 binned intervals can be observed in Table 3.

Stress scenarios
Nationwide 12% bad rate.In Table 6 we see the impact of increased bad rate has on the Nationwide dataset.We note that features vari-able12 and variable48 become more robust when more observations are labelled as bad.We also see certain features such as vari-able15 and variable31 become less robust with model prediction.
Nationwide 12% bad rate with perturbed features.In Table 7 we see the result of feature perturbation on the Nationwide data with 12% bad rate.It can be observed that the most notable change is a higher robustness for feature variable36 at bin interval 1.Given that, the overall impact of feature data perturbation is minimal with this bias ratio for credit default.
Nationwide 18% bad rate.In Table 8 it can be observed that features variable04 (increased robustness) and variable31 (decreased robustness) change significantly from the baseline Nationwide dataset (see Table 3).The remaining features see little fluctuation in their average robustness scores.
Nationwide 18% bad rate with perturbed features:.The impact of the perturbed feature values on the Nationwide dataset with 18% bad rate is minimal with few actual changes in average robustness.Cases of increased robustness can be seen with feature variable18, as seen in Table 9.
Overall we notice that the robustness of certain features fluctuates with increased bad rate as well as with perturbations on the data.In some cases significantly, however overall the SGB model retains robustness across the majority of features.

Conclusion
The implementation of ML techniques in the financial sector is still limited due to the lack of interpretability.To address this challenge, our paper proposes to merge two distinct approaches, that being counterfactuals generated from blackbox models and targeted data perturbation using kNNs, together creating an approach to perform stress scenario analysis for scoring models.The two approaches can be used separately or in conjunction.
The data perturbation methodology uses kNNs to identify entries which are in close proximity to entries of another classification for class reassignment.This method can be used to generate synthetic data that represent economic downturn when a limited time series is available.The generation of counterfactuals provides an added perspective into data regions that are robust to change or are weak to feature value perturbations.The two approaches combined provide a key stress scenario mechanism for blackbox model analysis.
We test these proposals on a UK dataset on unsecured personal loans.We show that our data perturbation method decreases AUC as the default ratio increases.We also demonstrate the insight of the robustness score derived from counterfactuals, where using counterfactual constraints enables us to compare the robustness of logistic regression and SGB across each feature in the Nationwide dataset.Our results demonstrate that SGB remains more robust to data perturbations than the industry benchmark of logistic regression.We also investigated the stressed scenario datasets provided by our data perturbation method and were able to identify the robustness of discrete feature ranges across all features.Our results indicate that with increasing bad rates, SGB still remains robust to perturbations that signify stress scenarios.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1. kNN example for different number of neighbours for two dimensional space.
have rarely been utilized with respect to credit scoring.McGarth et al. implemented counterfactuals for the use in local interpretations of a models decision(Chen et al., 2018), and instead of using a genetic algorithm for generating counterfactuals they made use of optimization on a loss function with the Nelder-Mead algorithm.Overall, past works have used counterfactuals to assist with financial applications primarily with regard to interpretation of model decisions at the local level(Chen et al., 2018;Grath et al., 2018;Wachter, Mittelstadt, & Russell, 2017;White & d'Avila Garcez, 2019), or for an analysis on model fairness(Coston, Mishler, Kennedy, & Chouldechova, 2020;Kusner, Loftus, Russell, & Silva, 2017;Wang, Sridhar, & Blei, 2019;Zhao, Coston, Adel, & Gordon, 2019), but none, as far as we have observed, use counterfactuals as adversarial examples to analyse model performance on feature values.While some works have used counterfactuals to analyse predictive weaknesses of a model(Sokol  & Flach, 2019), none take advantage of generated counterfactuals for constrained regions in the data for analysing local robustness.

Fig. 3 .
Fig. 3. Shapley values for the original out-of-time set.
1. Calculate the ranges for each variable.Find the minimum min   and maximum max   of each feature   ,  = 1, 2, … ,  from the set of all observations   ,  = 1, 2, … , . 2. Generate new observations.For each feature   of every observation   , generate a new value from the range min   , max   and replace the original value with it.

Table 1
AUC for Logit and SGB models on the Nationwide dataset.

Table 2
Average robustness scores for selected features on the Nationwide dataset.Approximate running time for Logit was 319 s, SGB was 365 s.

Table 3
Average robustness scores for the Nationwide dataset on 3 bin feature constrained counterfactuals.Approximate running time for 3 binned SGB was 565 s.

Table 4
Average robustness scores for the Nationwide dataset on 2 bin feature constrained counterfactuals.Approximate running time for 2 binned SGB 503 s.SGB models are evaluated on both perturbed Nationwide datasets with increased default rate only and on datasets with increased default rates with perturbations in feature values.For stress scenario testing, the GA generates counterfactuals based on the reasoning of the trained model and takes as input the test set.In order to analyse the robustness of the model, we first train the model on the perturbed datasets with increased bad rates and/or perturbed features, followed by testing on unperturbed datasets (original Nationwide dataset).By following this pipeline, we can observe model behaviour given data representing stress scenarios, and the counterfactuals generated are based on stressed model's reasoning on the test set.The robustness score in this context is a reflection of stressed model reasoning on non-stressed data.The following evaluation is done on 4 separate versions of the Nationwide dataset.Using the kNNs, we increase the positive bad rate

Table 5
Feature ranges for the Nationwide dataset.

Table 6
Average robustness scores for the Nationwide dataset with 12% bad rate.Approximate running time for 3 binned SGB was 559 s.

Table 7
Average robustness scores for the Nationwide dataset with 12% bad rate and feature perturbations.Approximate running time for 3 binned SGB was 571 s.

Table 8
Average robustness scores for the Nationwide dataset with 18% bad rate.Approximate running time for 3 binned SGB was 581 s.

Table 9
Average robustness scores for the Nationwide dataset with 18% bad rate and feature perturbations.Approximate running time for 3 binned SGB was 573 s.