Edinburgh Research Explorer A two-stage Bayesian network model for corporate bankruptcy prediction

We develop a Bayesian network (LASSO-BN) model for firm bankruptcy prediction. We select financial ratios via the Least Absolute Shrinkage Selection Operator (LASSO), establish the BN topology, and estimate model parameters. Our empirical results, based on 32,344 US firms from 1961 – 2018, show that the LASSO-BN model outperforms most alternative methods except the deep neural network. Crucially, the model provides a clear interpretation of its internal functionality by describing the logic of how conditional default probabilities are obtained from selected variables. Thus our model represents a major step towards interpretable machine learning models with strong performance and is relevant to investors and policymakers.


| INTRODUCTION
Corporate bankruptcy is a serious issue in the financial market due to its damaging economic and social consequences.As a result, the academic community, financial industry, and regulators are keen to explore reasons behind and ways to predict and prevent it.In the literature, early studies such as Altman (1968), Ohlson (1980), and Zmijewski (1984) document that accounting ratios and stock market data contain valuable information for assessing firm financial health.
More recently, forecasting firm default probability has attracted a lot of attention in the financial technology literature as state-of-the-art computational methods allow us to develop models that evaluate default prediction with great precision (see Chen et al., 2019;Goldstein et al., 2019, for example).These include the logit model (Tian et al., 2015), the support vector machine model (Liang et al., 2016), the random forest (Chandra et al., 2009), and the deep neural network (Cerchiello et al., 2017).Empirical evidence suggests that default forecasting performance can be improved by selecting the most relevant variables via the least absolute shrinkage and selection operator (LASSO) (Tian et al., 2015); or including new heterogeneous features such as textual information (Mai et al., 2019); or employing complicated deep neural network models (Cerchiello et al., 2017), which consist of a number of layers, each armed with numerous hidden neurons, and exhibit strong capability in capturing the relationship between input variables and output bankruptcy forecasts.
Our paper is motivated by this strand of literature but its contribution lies in developing an interpretable machine learning model that not only performs well empirically but also reveals the mechanism through which bankruptcy forecasts are obtained from input variables, that is, it paints a clear picture of model internal functionality.Our paper thus addresses a growing call for model interpretability in an age when increasingly sophisticated machine learning models and big data make the decision making process obscure.Kim and Doshi-Velez (2017) indicate that opening the black box is not about understanding all bits and bytes of the model but instead knowing the logic of the internal functionality for the downstream conclusions; whereas Mittelstadt et al. (2018) acknowledge the need for this but expresses concern that, with complicated internal states and millions of interdependent values, the black box is difficult to open up.
In this paper, we adopt the Bayesian network model, a powerful machine learning tool in handling uncertainty and multi-faceted relationship with a combination of domain knowledge and data-driven modelling (Liu et al., 2018).It has enjoyed great success in the healthcare diagnosis area in predicting the survival of the Alzheimer's disease, heart disease, breast cancer, and so forth (Liu et al., 2018;Lu et al., 2016;Seixas et al., 2014).To the best of our knowledge, the Bayesian network has not been implemented in default probability prediction, an area similar in nature to that of healthcare diagnosis, making the Bayesian network an appropriate method for our purpose.
Methodologically, we perform the LASSO in the first stage to select the most relevant accounting and financial variables (Tian et al., 2015).In the second stage, we construct the Bayesian network structure from selected variables and estimate parameters for the conditional probability via the expectation-maximization (EM) algorithm.The same selected variables are also used in alternative models including the logistic regression, the decision tree, the support vector machine, and the deep neural network model in the empirical analyses.Our data contain quarterly COMPUSTAT accounting and financial information from January 1961 to August 2018 with 31 variables of 32,344 firms with more than 1.5 million firm-quarter observations in total.
Our empirical analyses reveal that the Bayesian network model achieves the second most accurate forecasts and is only outperformed by the complex deep neural network model with three hidden layers.More importantly, once we identify the dependence structure of the Bayesian network, we are able to explain clearly the way that the model arrives at conditional probability for default, and how the default probability varies upon changes with input variables.In this way, the Bayesian network is able to address what-if questions of an ad-hoc scenario, such as what could a firm do differently to achieve a better health status.This allows us to construct bankruptcy probability surface by changing input variables in company financial statements.In other words, we are able to gauge the sensitivity of conditional default probabilities with respect to variations in input variables.
Hence, our paper makes three contributions to the literature.First, as far as we are aware, this is the first study that balances the performance and interpretability of a machine learning model in predicting firm bankruptcy probability, as the existing data science literature is yet to embrace the interpretability issue.Second, given the clear internal functionality of the Bayesian network model, we are able to draw probability surfaces of variables of interest and perform sensitivity and scenario analyses to address what-if questions such as how bankruptcy probabilities change with regard to a particular input variable.We believe that this is the first ad-hoc scenario analysis in bankruptcy prediction.Finally, we offer solid empirical evidence that the Bayesian network model is a promising tool for predicting conditional default probability with precision.Our paper showcases a meaningful application of this powerful method and is relevant to investors, portfolio managers, and regulators.It also points to a promising avenue to which the Bayesian network can make substantial contribution.
The rest of this paper proceeds as follows.We undertake a review of relevant strands of the literature in Section 2. The two-stage Bayesian network model and other methodological issues are discussed in Section 3. In Section 4, we introduce the data, perform empirical analyses, and undertake robustness check.Finally, Section 5 concludes.

| LITERATURE REVIEW
In this section, we review relevant studies that focus on the use of machine learning models in predicting corporate default, on the Bayesian network model, and on the LASSO method.

| Machine learning models
Since the seminal work of Altman (1968), Ohlson (1980), andZmijewski (1984), predicting corporate bankruptcy has been a topical issue in the literature for a long time.Mai et al. (2019) conduct a recent review of this area and note that methodologically many studies focus on machine learning models due to their estimation precision.In Table 1, we provide a partial summary of studies in the past 3 years, all of which feature a model in the machine learning family including the logistic regression, decision tree, random forest, support vector machine, and deep neural network.We further classify them into two groups: the interpretable and non-interpretable ones according to Mittelstadt et al. (2018).Only the simple logistic regression and tree-based models, serving as benchmarks, can be considered interpretable models.
It is worth noting that Gogas et al. (2018) develop a geometric interpretable model and use it as a stress testing tool to visualize the classification space with two variables as well as the linear decision boundary.By calculating the distance of the data to the decision boundary and simulating certain scenarios, the tool provides an effective interpretation for the model results and partially answers the what-if question of changing the critical variable values.However, this study assumes the fail and alive cases are linearly separable by two selected variables, which is contrary to findings in most papers in Table 1.

| Bayesian network model
The Bayesian network is able to capture the relationship and probability distribution to enhance the ontology inference capability in the diagnosis of a variety of diseases.Hence, it is often implemented in the healthcare diagnoses for medical ontology probabilistic inference and achieved via the K2 greedy algorithm.Delen et al. (2019) perform the Bayesian network with the elastic net variable selection method in understanding and predicting prominent variables that determine student attrition and achieve an accuracy as high as 84%.However, no comparison is conducted between the Bayesian network and alternative approaches.Dag et al. (2016) use the Bayesian network to predict heart transplant survival.They adopt different selection methods to generate a set of potential predictors with medically relevant variables and construct the Bayesian network from selected predictors.The Bayesian network not only achieves similar predictive performance compared with the best-performing approaches in the literatures but also provides an interactive relation among the predictors and the conditional survival probability.Meanwhile, the Bayesian network is implemented in project management (Hu et al., 2013;Yet et al., 2016), cyber-security assessment (Zhang et al., 2018), and stock index forecasting (Malagrino et al., 2018).In a pioneer study, Sun and Shenoy (2007) use a Naïve Bayesian network to assess the bankruptcy.The bankruptcy predictors are selected by a heuristic method and a Naïve Bayesian network is constructed based on these predictors.However, the Naïve Bayesian network does not contain any topology or hierarchy logic among the predictors as it considers parallel impacts of all predictors to the output.

| Lasso
Introduced by Tibshirani Tibshirani (1996), the LASSO is a powerful method for variable selection widely adopted in economics and finance.It is successfully implemented in the literature for predicting stock returns using intraday NYSE data (Chinco et al., 2019), corporate bankruptcy (Amendola et al., 2011;Tian et al., 2015), corporate bond recovery rate (Nazemi and Fabozzi, 2018), and macroeconomic time series (Bai and Ng, 2008;Kim and Swanson, 2014).Tian et al. (2015) apply the LASSO to forecast corporate bankruptcy in the US and achieve strong out-of-sample performance, whereas Rapach et al. (2013) show that the LASSO outperforms a backward or forward stepwise regression.

| BAYESIAN NETWORK WITH LASSO
In this section, we first outline the Bayesian network with a simple illustration.We then introduce the LASSO selection method in our two-stage Bayesian network model.Alternative machine learning models are also discussed.

| The Bayesian network
A Bayesian network is a directed graph that encodes the latent probabilistic relationship between variables of interest in a reasoning representation problem (Heckerman et al., 1995;Lauritzen, 1995).The representation usually starts from the domain knowledge, constructs a prior network, and combines it with the observed data to learn a new Bayesian network (Heckerman et al., 1995).In this framework, a variable is termed a node, vertex, or point.The nodes are connected by directed arrows indicating probabilistic dependencies.
To illustrate, we assume that bank failure is caused by two variables: Total Capital (C) and Risk-adjusted Capital Ratio (R) shown in Figure 1.The two arrows starting from the Total Capital (C) and the Riskadjusted Capital Ratio R, respectively, to bank failure (F) suggest that F is dependent on R and C.Meanwhile, the arrow pointing from C to R indicates that the Riskadjusted Ratio also depends on the Total Capital.In this example, Total Capital (C) and Risk-adjusted Capital Ratio (R) are parent variables of bank failure (F), and Total Capital (C) is also a parent of Risk-adjusted Capital Ratio (R).
The joint probability of bank failure (F), Total Capital (C), and Risk-adjusted Capital Ratio (R) can be expressed as follows: We are usually interested in addressing the following question: Given an observed Risk-adjusted Capital Ratio (R), what is the probability of bank failure (F)?The answer can be evaluated by the conditional probability as follows: In general, a Bayesian network model can be summarized as follows.Suppose we have a domain of discrete variables U = x 1 , ..., x n , and a set of cases D = C 1 , ..., C m .Our main interest is in determining the joint probability distribution p(C m + 1 |D, ξ), which is the probability of a new case C m + 1 given the set of past observations D and current information ξ.In Bayesian network models, we do not intend to recover the complete distribution.Instead, we assume that the distribution for data is generated from a latent structural network B s with a number of unknown parameters.Hence, the probability p(C m + 1 |D, ξ) with the structural network B s can be expressed as follows: The structural network B s reflects our belief of the variables and the relationship between them based on domain knowledge.In most cases, however, B s is unknown even though variables U and case observations D are available.Two methods are usually adopted to construct B s : the domain expert heuristic method employed figure illustrates a simple Bayesian network example in which bank failure depends on two variables: the Risk-adjusted Capital Ratio (R) and the Total Capital (C), and the former variable also directly depends on the latter by Chakraborty et al. (2016); and the statistical structure learning algorithm used by Heckerman et al. (1995) and Liu et al. (2018).In this paper, we implement the structural learning method due to the lack of domain knowledge as in most practical cases.
To implement the statistical structure learning algorithm, our first step is structure learning, that is, we identify the interactive relation between variables, specify the topology of the framework in order to construct a Bayesian network.Once we obtain the network, we determine the parameters of the network and define the joint probability representing the statistical behaviour of observed data in parameter learning (Heckerman et al., 1995;Lauritzen, 1995;Liu et al., 2018).

| Structure learning
Structure learning can be performed primarily in three ways: the search-score method, the constraint-learning method, and the dynamic programming based method.Among them, the search-score method is suitable for problems with large volume of data (Daly et al., 2011;Heckerman et al., 1995).In our paper, we construct our Bayesian network model with the popular K2 searchscore algorithm (Cooper and Herskovits, 1992;Feng et al., 2014;Garvey et al., 2015).
Specifically, we assume a domain of n discrete variables U = x 1 , ..., x n with an ordering of variables, a set of cases D = C 1 , ..., C m , and an upper limit u on the number of parents a variable may have.The algorithm heuristically searches for the most appropriate belief-network structure based on D. In the initial stage, an empty π i is created as a set of parents of variables x i , i = 1Á Á Án.A function pred(x i ) is defined to represent a set of variables preceding x i .For each variable x i in U, we calculate the score P old = f(i, π i ) as follows: where q i = j ; i j, and ; i is a list of all possible parents of x i in D; r i = |V i |, and V i lists all possible values of the variable x i ; α ijk is the number of cases in D in which the variable x i is instantiated with its kth value, and the parents of x i in π i are instantiated with the jth instantiation in ; i ; α ijk is the number of instances in the data where the parents of x i in π i are instantiated with the jth instantiation in ; i .
The f(i, π i ) is considered the probability of the case set D given that the parents of x i are π i .When the number of variables in π i is less than u, the variables x m of pred(x i ) will be iteratively added to π i .The probabilistic score is updated if x m is added to the set π i .In this way, the K2 algorithm finds a network structure B s with variables in U that each node in B s exhibits at most u parents, such that the achieved probabilistic score metric is larger than a pre-defined real value of p.

| Parameter learning
The parameters of the Bayesian network, θ ijk , is the conditional probability distribution of the node X i in U taking the kth value with its parent node π i taking the jth value as follows: The parameters can be determined by the expectation-maximization (EM) algorithm, a well-known approach for estimating the maximum likelihood of the model with latent structure (Dempster et al., 1977;Green, 1990;Lauritzen, 1995).Green (1990) introduces an EM algorithm for estimating the penalized likelihood, which exhibits a more efficient convergence rate than the traditional EM algorithm.Following Green (1990) and Lauritzen (1995), we consider a log-likelihood function given the observed data as follows: where X is the learned variable based on the complete data with the density function f, and y is the observed data.The EM algorithm features a recursive process of two steps: First, the expectation step (E-step) of fixing θ and calculating the expected value of E θ logf(X|θ 0 )|y; Second, the maximization step (M-step) of finding θ values that maximize likelihood Q(θ 0 |θ).At the E-step, we add a penalty J(θ 0 ) to the log-likelihood function following Green (1990) as follows: where J(θ 0 ) is a function proportional to a prior density.The M-step maximizes the penalized log-likelihood function.

| LASSO selection method
Following Tian et al. (2015), we estimate the LASSO parameters by minimizing the negative log-likelihood of discrete hazard function with a penalty for the sum of absolute value of covariate parameters.The discrete hazard function is given as follows: where X i, t is a vector of time-varying predictive variables observed for quarter t, and i is the firm index.The variable Y i, t + N is the default label, which is equal to one if firm i files for bankruptcy protection at t + N given that it survives N − 1 quarters from time t to t + N − 1.The negative log-likelihood function with a penalty of sum of the absolute value of the covariate parameters is specified as follows: where n is the number of firms and p is the number of predictive variables in the hazard model.Following Tibshirani (1996), we employ a ten-folder cross-validation for parameter estimation.

| DATA AND EMPIRICAL ANALYSES
In this section, we first introduce our data.In the empirical analyses, we begin with discussing accounting and financial variables selected by the LASSO.We then make comparison of bankruptcy prediction accuracy between alternative models and highlight the interpretability of the models.Finally, we perform a subsample analysis to check the robustness of the baseline results with respect to the sample period.

| Data
We use quarterly COMPUSTAT data from January 1961 to August 2018 for 31 accounting variables as candidate variables for bankruptcy prediction following the existing literature (see Amendola et al., 2011;Bharath and Shumway, 2008;Campbell et al., 2008;Ding et al., 2012;Liang et al., 2016;Mai et al., 2019, for example).When constructing the accounting-based predictors, we align a firm's fiscal year with the calendar year to ensure that the accounting information is observable to investors at the time of prediction.Because we use quarterly data, we lag all variables by a quarter.Furthermore, we remove variables at the top and bottom one percentile following Tian et al. (2015).Our final dataset contains 1,563,010 firm-quarter observations for 32,344 firms. 1  The descriptive statistics of accounting variables is shown in Table 2.As some variables are scaled (such as current assets over current liabilities ACTLCT) whereas others are in monetary terms (such as total asset TASSET), the descriptive statistics varies to a large extent.Our bankruptcy indicator is based on the Reason for Deletion variable dlrsn in the COMPUSTAT.A firm is defaulted if it is de-listed from the stock exchange due to liquidation or bankruptcy and the default indicator is one; otherwise the indictor is zero.In total, we identify 16,924 bankruptcy and liquidation filings over the sample period.In Figure 2 we demonstrate the occurrence of default by year.We observe two clear peaks during 1982 to 1991 and from 2007 to 2008 of the Great Recession.

| Alternative models
Among the most popular bankruptcy prediction models reviewed in Liang et al. (2016) and Lin et al. (2012), we employ the logistic regression (LR) and decision-tree (Tree), which are simple and interpretable models.We also implement the support vector machine (SVM) which is of modest complexity.The deep neural network (DNN) is selected as a complex model and we follow a standard specification DNN(50,30,20) with three hidden layers of 50, 30, and 20 hidden neurons, respectively (Goodfellow et al., 2016).Thus we include four models in addition to the Bayesian network model to assess their interpretability and prediction accuracy.They are applied to the same variables selected by the LASSO model.
All models are implemented via R packages.The LR is based on the ISLR package; the classification and regression trees (CART) model of Huang (2014) is based on the tree package; the SVM with radial kernel is based on the e1071 package; and the DNN is based on the H2o package with H2o cloud computing backend.To train the CART and SVM models, a 10-folder cross-validation is applied to achieve a stable and optimal selection of parameters.
Furthermore, the DNN is trained by the stochastic gradient descent algorithm with epochs of 200.To avoid overfitting, the traditional L2 regularization is applied to penalize the weights.The dropout method of Srivastava et al. (2014) is performed to randomly omit a subset of hidden neurons at each iteration of the training process.An early stopping suggested in Bengio (2012) is also implemented for monitoring the performance of the validation and training set and for stopping the training early when the performance of the training set keeps improving but validation set stops.The Bayesian network model is implemented by the bnlearn package; and the LASSO is performed by the glmnet package.

| LASSO selection results
We identify the most relevant predictors via a 10-folder cross-validation LASSO to optimize the λ coefficient for each predictive variable.Selected variables are summarized in Table 3 Panel A. As we can see, 16 variables exhibit non-zero coefficients out of 31 potential predictive variables.This is more than the number of variables identified in Tian et al. (2015).
The selected variables mainly concern a firm's leverage, liquidity, profitability, and market based variables.First of all, the leverage ratio of total liability over total assets (LTAT) is the most influential with the largest coefficient.This is in line with Campbell et al. (2008), Ohlson (1980), and Zmijewski (1984).The other leverage ratio chosen by the LASSO is a book leverage measure of total debts over total assets (FAT), also chosen in Tian et al. (2015), which contains information about future default risk.We notice that the market leverage measure of total liabilities over the sum of market equity and total liabilities (LTMTA), which is heavily influenced by stock prices, is not selected.Hence, the book leverage ratio may convey more information than the market leverage measure.
The relevance of market based variables is eloquently argued in Campbell et al. (2008), which suggest that the logarithmic market capitalization (RSIZE) and logarithmic stock price (PRICE) are important.As Tian et al. (2015) point out, the information conveyed in PRICE is forward looking, whereas RSIZE reflects the true value of a firm.These two variables exhibit the second and third largest coefficient in the LASSO selection results.
Six liquidity ratios are chosen, including growth of over inventories (INVCHINVT), working capital over total assets (WCAPAT), current liabilities over total liabilities (LCTLT), current assets over current liabilities (ACTLCT), current liabilities over total assets (LCTAT), and cash and short-term investment over total assets (CASHAT).A lack of liquidity is more likely to increase default risk rather than causing bankruptcy directly.As a result, the selection of ACTLCT (current ratio) and WCAPAT (working capital turnover) is consistently with previous research (see Chava and Jarrow, 2004;Ohlson, 1980;Shumway, 2001, for example).Furthermore, the inventory variable (INVCHINVT), the percentage of current liability (LCTLT), the current liability coverage (LCTAT), and a cash and short-term investment variable (CASHAT) all capture different aspect of a firm's liquidity.
Also essential in predicting bankruptcy are profitability ratios.Retained earnings over total assets (REAT) receives the most attention.This choice is consistent with Altman (1968), Chava and Jarrow (2004), and Shumway (2001), all of which show the impact of cumulative profitability on reducing the bankruptcy probability.The other four profitability measures, retained earnings over current liabilities (RELCT), net income over total assets (NIAT), net income over the total of market equity and total liabilities (NIMTA), and operating income over sales (OIADPSALE), imply that cumulative and current period profitability help reduce bankruptcy risk, but to a lesser degree.
In addition to these major accounting and financial ratios, the LASSO also picks sales over total assets (SALEAT): the higher the sales turnover, the lower the bankruptcy risk (Altman, 1968;Shumway, 2001).

| Prediction accuracy
To provide a comprehensive analysis, our bankruptcy prediction horizon ranges from one to 12 quarters ahead.We use two popular measures: the accuracy ratio (ACCU) and the area under the Receiver Operating Characteristic (ROC) curve (AUC), to evaluate the performance of alternative models following Liang et al. ( 2016), Mai et al. ( 2019), and Tian et al. (2015).The ACCU is based on the cumulative accuracy profile (CAP) that measures the percentage of true bankrupt firms included if choosing a different percentage of observations using the sorted forecasted probabilities generated by a used model (Engelmann et al., 2003;Mai et al., 2019).The baseline model assigns class labels randomly.The accuracy ratio of a forecasting model is the difference in the area between the CAP of the model and that of a baseline model.The AUC is an equally popular measure of the overall performance of a model.It is calculated by the ROC curve, which shows the capability of a model balancing the false positive rate and the true positive rate.The area under the ROC curve provides a measure of the capability of the overall performance and the corresponding robustness of the model.
In the empirical analyses, we examine the forecasting performance of alternative models based on two groups of prediction variables.In Group 1, we follow the LASSO result reported in Table 3 and use all 16 selected variables; whereas in Group 2, we only use the 10 topranked variables.This is because the absolute value of LASSO coefficients for variables ranked from 11th to 16th is lower than 1E-5, much closer to zero than the top 10 ranked variables.Furthermore, a model with fewer variables affords better interpretation as only the most relevant variables are included.
Panels A and B in Table 4 contains the ACCU and AUC for bankruptcy predictions over one to 12 quarters ahead generated by the Bayesian network model and alternative ones via, respectively, Group 1 and Group 2 variables.In Panel A, we note that the AUC values of all models are above 0.76 over forecasting horizons of up to 1 year.The DNN(50,30,20) and the Bayesian network model perform the best at 0.9003 and 0.8951, respectively, over the onequarter horizon.Over longer forecasting horizons, the prediction accuracies gradually decrease; whereas across the models the prediction accuracy increases from the more traditional models such as the LR to the state-of-the-art large scaled neural network.The DNN(50,30,20) exhibits the best performance in terms of AUC and ACCU while the Bayesian network model comes second.It is interesting to note that the Bayesian network model easily beats the other three less sophisticated ones.The LR and decision tree model exhibit the lowest accuracy and smallest ACCU and AUC.
In Panel B, with fewer variables and less information to draw upon, the prediction accuracy drops for all models.However, we still find that, similar to results in Panel A, increasing model complexity leads to more accurate bankruptcy prediction.Interestingly, over longer forecasting horizons the Bayesian network seems to make a better use of the information content of variables and tends to outperform the DNN(50,30,20).For example, over 10 to 12 quarters ahead, the AUC for the Bayesian network model is 0.7416, 0.7289, and 0.7259, respectively, and the corresponding AUC for the DNN (50,30,20) is marginally lower at 0.7414, 0.7268, and 0.7257.Similar patterns hold for the ACCU over the longest forecasting horizons.

| Model interpretability
Interpretability refers to the transparency of a model's internal function and the degree of human comprehensibility (Doshi-Velez and Kim, 2017;Mittelstadt et al., 2018).Recently, Bastani et al. (2017) and Ribeiro et al. (2016) pursue two classes of approximate models such as linear models and the decision tree type of models precisely because of model interpretability.In Table 4, approximate models such as the LR and decision, whose internal functions can be easily interpreted by their structures and corresponding parameters, fare poorly empirically.Meanwhile, more complex models such as the SVM and DNN model do not offer easily observable or comprehensible structures but perform well empirically.For the Bayesian network model, it exhibits smaller scale and simpler structure than the DNN model with comparable forecasting accuracy.More importantly, it sheds light on interpreting the internal reasoning logic with a structural network.Below we scrutinize model interpretability in detail based on their 1-year ahead forecasting performance.

| Logistic regression
It is fairly easy to interpret the LR model by looking at variable coefficients and their statistical significance in Table 5.For example, the ratio between firm net income over market equity and total liabilities (NIMTA) and the scaled market capitalization (RSIZE) exhibit the largest coefficient at −0.283 and − 0.171, respectively, and are both highly significant at the 1% level.Hence, these two variables are the most influential in determining future bankruptcy than other variables; and they suggest that the larger the relative net income and market capitalization of a firm, the lower the probability of default.

| Decision tree
Figure 3 shows the decision tree for forecasting bankruptcy over the next year using all 16 LASSO identified variables.We observe that the model constructs the decision tree with only 12 variables, including OIADPSALE, INVTSALE, QALCT, RELCT, LTAT, PRICE, FAT, CASHMTA, REAT, NIAT, TSALE, and APSALE.The decision tree provides a graphical and self-explainable interpretation of the internal function.However, this simplistic structure yields an average AUC of 0.7581 and accuracy ratio of 0.7431, which is among the lowest across alternative models.
Although the structure and coefficients of the above two models offer a transparent functionality and can be relatively easily comprehended in their entirety, the models show a lack of accuracy in handling the prediction problem thus making them a local approximation and represent a partial or a slice of the entire problem (Mittelstadt et al., 2018).

| Bayesian network
Figure 4 illustrates two structures of the Bayesian network for forecasting bankruptcy four quarters ahead based on Group 1 and Group 2 variables.In Figure 4(a), the complex interleaved arrows show the interdependence of 16 variables and dlrsn, the state of bankruptcy.We see that bankruptcy is directly determined by only eight crucial variables: ACTLCT, CASHAT, FAT, NIMTA, PRICE, RSIZE, WCAPAT, and LTAT.They cover key aspects of current asset, income, cash flow, liability, and market capitalization of a firm.Figure 4(b), meanwhile, shows a simpler structure with a clearer relation for the decision making process determined by fewer variables including ACTLCT, FAT, PRICE, RSIZE, WCAPAT and LTAT.It thus provides a simpler logic reasoning of the internal functionality.However, as results in Table 4 Panel B suggest, the simpler structure compromises on the forecasting accuracy.Hence, a natural trade-off exists between the simplicity and interpretability of the model and its forecasting performance.
Furthermore, the structure in Figure 4(a) can be interpreted as the conditional probability of Pr(dlsrn | ACTLCT, CASHAT, FAT, NIMTA, PRICE, RSIZE, WCAPAT, LTAT), which not only provides a binary answer of true or false to the future bankruptcy problem but also yields a probability of it.Each variables is  affected by others through a conditional probability, such as Pr(FAT | CASHAT, LTAT, REAT, WCAPAT, SALEAT), where LTAT, the ratio between total liabilities and total assets, indirectly influences the probability of bankruptcy via its impact on FAT.The conditional probability of Pr(dlsrn|Scenario) is well suited to address what-if questions via a scenario analysis.In Table 6, the Bayesian network model generates a default probability of 0.6867 based on real data.This suggests that the firm under scrutiny is highly likely to default in the future.Hence, we perform a scenario analysis to see what happens when key variable values change.We consider two different scenarios: In the first scenario, we assume that a firm's financial health deteriorates with decreasing net income, cash flow and current asset, and increasing debt and liability; whereas in the second scenario, the financial status improves with higher income, cash flow and current asset and lower debt and liability.Specially, in the first scenario, ACTLCT, CASHAT, and NIMTA are reduced to be less than 1, 0.01, and 0.0001, respectively, while FAT and LTAT are increased to be over 0.8 and 1, respectively.As we expect, the bankruptcy probability shoots up to 0.9789, showing an extremely risky situation that, with less asset, less income but more debt, the firm is almost surely going to default.In the second scenario, a financially healthy firm with more asset, more income and less debt exhibits an extremely low bankruptcy probability of 0.0465.The scenario analysis not only points to the direction of firm survival but also quantifies the default probability given specific variable values, making this analytical tool very helpful for stakeholders in-and outside the firm.
We are also able to perform sensitivity analysis of bankruptcy prediction with respect to particular variables of interest.Based on the information in Table 6, we select  three pairs of influential variables and generate probability surfaces to capture the impact of these variables on bankruptcy probability in Figure 5. Figure 5(a) illustrates the bankruptcy probability conditional upon the ratio between total liabilities and total assets (LTAT) and the ratio between current assets to current liabilities (ACTLCT).With zero current asset (ACTLCT = 0) and a high total liability ratio (LTAT = 7× 10 4 ), the bankruptcy probability increases to as high as 0.8.If the total liability ratio remains at LTAT = 7× 10 4 but current asset ratio (ACTLCT) increases from 0 to the third quartile at 2.32, the bankruptcy probability decreases slowly with tiny magnitude.This shows that a massive liability is hugely detrimental to firm solvency even with large current assets.
Figure 5(b) exhibits bankruptcy probability changes conditional on the ratio of debts to total assets (FAT) and the ratio of cash to total assets (CASHAT).The pattern is similar to that in Figure 5(a) and indicates that when FAT is as high as 2 × 10 4 , the bankruptcy probability remains high at 0.7 even when CASHAT reaches the third quartile at 0.69.The probability drops only to 0.5 even when CASHAT reaches an incredibly high level of 10,000.Meanwhile, if the FAT is drops to the third quartile at 0.268, the probability decreases almost linearly as CASHAT increases.This quantifies and highlights the importance of the debt ratio for the financial health of a firm and reveals that maintaining an appropriate level of debt is an effective way of avoiding future bankruptcy.
In Figure 5(c), we note a similar pattern that when total liabilities over total assets (LTAT) is at a high level of 7× 10 4 , the bankruptcy probability reaches 0.8 and decreases ever so slowly even when the ratio of net income to the total liability (NIMTA) grows to 15, an extremely high level for net income.However, if LTAT stays at the third quartile at 0.75, the probability decreases to 0.5 and further decreases to lower value than 0.4 as NIMTA increases.
To summarize, Figure 5 clearly reflects postfix interpretation based on the model.It addresses the link between conditional probabilities and the final outcome suggested by the model, and captures inferences from the structural conditional probability.The scenario and sensitivity analyses described above are of great use to investors and policymakers as they offer detailed explanation in terms of the theoretical and empirical functionality of the model.

| Robustness check
The empirical analyses so far are based on a long sample period from January 1961 to August 2018, which experiences different business cycles and a number of financial crises.As a robustness check, we evaluate the LASSO selection and Bayesian network model again using a shorter and more recent sample period starting from just before the Great recession in March 2007 to the end of the sample period in August 2018.This is motivated by Figure 2 which shows a recent wave of firm defaults and it would be interesting to see which variables the LASSO identified and how well they predict bankruptcy.In total we have 423,012 firm-quarter observations for the robustness test.
Table 3 Panel B shows the selected variables based on the shorter sample period.We notice a large overlap between the selected variables from the whole sample and the subsample.For example, the log market capitalization (RSIZE) and the log stock price (PRICE) remain the second and third most important variables.Among all 16 variables with non-zero coefficient, nine of them overlap with those in Panel A. The new list of selected variables cover quick assets (QALCT), cash (CASHMTA), inventory (INVTSALE), liability (LCTSALE) and sales (TSALE) that are very similar to variables of current asset (ACTLCT), cash (CASHAT), inventory (INVCHINVT), new call for formulating interpretable machine learning models that are not only powerful in performance but also comprehensible to investors to allow them to make informed investment decisions.In this paper, motivated by successful applications of the Bayesian network in the healthcare diagnosis area, we modify the Bayesian network and implement it for the purpose of predicting firm bankruptcy probability.We first select relevant variables by the LASSO, construct the Bayesian network with selected variables, and estimate model parameters via the EM algorithm.We use quarterly COMPUSTAT data from January 1961 to August 2018 for the empirical analyses and show that the Bayesian network model performs very well and is outperformed only by the complicated DNN with three hidden layers.Furthermore, the topology of the Bayesian network exhibits a clear representation of its internal functionality based on conditional probability inferences.It offers scenario and sensitivity analyses of individual variables on bankruptcy probability making it easily understandable by the general investment community.This underlines the contribution of our paper to the literature.
While we consider our study an important step towards integrating superior forecasting performance with model interpretability, we recognize that the model can be improved along different dimensions in the future.For example, the topology of the Bayesian network can be made dynamically updated over time for more flexibility.Furthermore, most machine learning models forecast bankruptcy probability at time t + 1 based on data at time t, whereas traditional hazard models include data at all healthy time to predict bankruptcy probability.Thus a machine learning model can be enhanced by a hazard model in bringing more promising results with an interpretable structure.

ORCID
Yi Cao https://orcid.org/0000-0002-5087-8861ENDNOTE The decision tree model This figure shows the decision tree model for forecasting bankruptcy four quarters ahead constructed from all 16 variables.The sample period is from January 1961 to August 2018 U R E 4 The Bayesian network model This figure shows the structure of the Bayesian network for forecasting bankruptcy four quarters ahead constructed by (a) 16 variables in Group 1 and (b) 10 variables in Group 2. The sample period is from January 1961 to August 2018 [Colour figure can be viewed at wileyonlinelibrary.com] Sensitivity analysis of the Bayesian network on changes of input variables This figure shows the bankruptcy probability surface generated by the Bayesian network on the changes of (a) LTAT and ACTLCT; (b) FAT and CASHAT; (c) LTAT and NIMTA [Colour figure can be viewed at wileyonlinelibrary.com] T A B L E 2 Summary statistics of input accounting variables This table summarizes the minimum (Min), first quantile (first Qu), median, mean, third quantile (third Qu), maximum (Max) and SD of input accounting variables that are used for predicting corporate bankruptcy probability.Variable description is provided in the last column.The sample period is from January 1961 to August 2018.
Explanatory variables selected by the LASSO This table summarizes the explanatory variables as selected by the LASSO.In Panel A the full sample period is from January 1961 to August 2018; in Panel B the subsample is from March 2007 to August 2018.***, ** and * denote significance at the 1%, 5%, and 10% level, respectively.The z-statistics are reported in parentheses.
T A B L E 3 This table summarizes estimated coefficients and their statistical significance obtained from the logistic regression in predicting 1-year ahead firm bankruptcy.The sample period is from January 1961 to August 2018.
Scenario analysis of bankruptcy probability T A B L E 6