A dynamic performance evaluation of distress prediction models

So far, the dominant comparative studies of competing distress prediction models (DPMs) have been restricted to the use of static evaluation frameworks and as such overlooked their performance over time. This study fills this gap by proposing a Malmquist Data Envelopment Analysis (DEA)-based multi-period performance evaluation framework for assessing competing static and dynamic statistical DPMs and using it to address a variety of research questions. Our findings suggest that (1) dynamic models developed under duration-dependent frameworks outperform both dynamic models developed under duration-independent frameworks and static models; (2) models fed with financial accounting (FA), market variables (MV), and macroeconomic information (MI) features outperform those fed with either MVMI or FA, regardless of the frameworks under which they are developed; (3) shorter training horizons seem to enhance the aggregate performance of both static and dynamic models.

Financial distress refers to the failure of a company to pay its financial obligations in their due time (Beaver, 1966).The financial situation of distressed firms differs from healthy ones (Cleary & Hebb, 2016;Lau, 1987), suggesting that while a firm financial profile weakens, its features shift toward the characteristics of bankrupt firms.In practice, managers could use a distress prediction model (hereafter, DPM) as an early warning system to take proper preventive actions to avoid bankruptcy.
These comparative studies typically evaluate the predictive performance of DPMs using several performance criteria (e.g., correctness of categorical prediction, discriminatory power, information content, and calibration accuracy) and their measures-for a comprehensive list of evaluation criteria and their measures, the reader is referred to Mousavi et al. (2015).
The prevailing comparative studies are criticized for using arbitrary performance measures and evaluation criteria (Balcaen & Ooghe, 2006) and for being monocriterion (Bauer & Agarwal, 2014;Mousavi et al., 2015) because, in each round of evaluation, a single measure (e.g., Type I error) of a single criterion (e.g., the accuracy of categorical prediction) is used to evaluate the performance of DPMs; therefore, the rankings of models under different criteria are typically different and as such one cannot make an informed decision about which model performs better under multiple criteria-this problem of inconsistent rankings for different criteria is clearly illustrated in the results of comparative studies by, for example, Theodossiou (1991), Bandyopadhyay (2006), andHernandez Tinoco andWilson (2013), which show mixed rankings, as well as in our conceptual illustration of the inconsistency of rankings' problem in Table 1.To deal with this shortcoming, Mousavi et al. (2015) and Mousavi and Ouenniche (2018) proposed to apply multicriteria evaluation frameworks (i.e., super-efficiency and context-dependent Data Envelopment Analysis, DEA, models) to assess the relative performance of prediction models under multiple criteria and their measures to obtain a single multi-criteria ranking of models and thus avoid dealing with the problem of several inconsistent rankings.Zavgren (1983) argued that most traditional failure prediction models are based on the underlying assumption that the nexus between the model's dependent variable (i.e., the probability of failure) and its features (e.g., accounting and market information) is constant over time.Empirical studies, however, indicate that this constancy is extremely arguable (Charitou et al., 2004;Du Jardin & Séverin, 2012) and that model's performance is sensitive to fluctuations in macroeconomic circumstances (Mensah, 1984;Platt et al., 1994).For example, the logit model of Ohlson (1980) performed better in the 1980s, while the discrete-time logit prediction model of Shumway (2001) outperformed other models in the 2000s.The changes in patterns of accounting and market-based information over time suggest that prediction models should be re-engineered regularly to encompass the most recent patterns of information (Grice & Ingram, 2001).Consequently, the static performance of prediction models could be far different from their dynamic performance.
In this study, we contend with an overlooked feature of comparative studies of prediction models that lies in the use of what we refer to as a static performance evaluation framework or a static out-of-sample analysis framework to compare the performance of models in that, first, historical data is split into a fit period dataset and a test period dataset, then out-of-sample testing is conducted to assess the predictive performance of a DPM.In this paper, we shall refer to a dynamic performance evaluation framework or a dynamic out-of-sample analysis framework as a framework, which implements a static out-ofsample analysis framework in a dynamic fashion; that is, using a rolling horizon technique.To be more specific, the static out-of-sample analysis framework described above is run several times, each time with a different fit period length, where fit periods are overlapping to comply with a rolling horizon technique such as the fixed-origin rolling horizon technique or the variableorigin rolling horizon technique.The fixed origin rolling horizon technique uses the same historical time period, referred to as the fixed origin, as the starting date of the fit period but increases the length of the fit period by one period at a time in each run; thus, the time horizon is being rolled.By contrast, the variable-origin rolling horizon technique keeps the length of the fit period constant but drops the first period and adds the next period of the historical time horizon from the fit period in each run; thus, again the time horizon is being rolled.Given the above definitions of static and dynamic performance evaluation frameworks, a conceptual comparative analysis of their designs suggests that the static framework ignores the time dynamics in the data, which could bias the performance outlook of the predictive models, on one hand, and neglects the potential need of re-engineering the models as time goes by, on the other hand, whereas the dynamic framework takes account of time dynamics and the corresponding changes in patterns within the data, on one hand, and allows for the re-engineering of the models (i.e., computing revised estimates of the parameters of the models) from one period to the next, on the other hand.Furthermore, the static framework produces a single performance outlook, whereas the dynamic framework produces as many performance outlooks as the number of runs or time horizons considered in the analysis.In this paper, we address the abovementioned issues with the use of static performance evaluation frameworks by using the variable-origin rolling horizon technique to implement the dynamic out-ofsample analysis framework.In addition, we use the Malmquist-DEA as a dynamic multi-criteria framework for assessing the multi-period performance of DPMs to address the issues with the use of mono-criterion performance evaluation frameworks mentioned in previous paragraphs.In sum, in our proposal, we overcome both the drawbacks of the static out-of-sample analysis framework and the drawbacks of the static multi-criteria performance evaluation frameworks of prediction models.Note that, by design, the proposed Malmquist-DEA framework for the performance evaluation of prediction models under multiple criteria and in a dynamic fashion provides multi-criteria rankings of competing DPMs over multiple periods.In practice, such a framework will not only allow one to highlight the models' performance dynamics over time but also reveal sources of inefficiencies in these models.As far as we know, no previous study has investigated the performance of DPMs using a dynamic multi-criteria assessment framework.However, these types of multi-criteria evaluation frameworks have been used in assessing other types of units in a variety of application areas such as renewable energy (e.g., Zeng et al., 2020;Zeng et al., 2019) kiwifruit production (e.g., Mohammadi et al., 2011), electricity generation (Alizadeh et al., 2020), insurance companies (e.g., Beiragh et al., 2020), supply chains (e.g., Shafiei Kaleibari et al., 2016), and energy efficiency (Houshyar et al., 2010;Mohammadi et al., 2011;Zhou et al., 2008).Furthermore, this study uses the proposed dynamic multi-criteria evaluation framework to address the following research questions: a.What is the effect of the modeling framework design on the performance of models?b.What is the effect of the type of information with which models are fed on their performance?c.What is the effect of the length of the training sample on the models' performance?and d.Which DPMs perform better in predicting distress over the years with a high distress rate (HDR)?
The rest of the paper is structured as follows.In Section 2, we review other comparative studies of failure prediction models.In Section 3, we describe the proposed dynamic evaluation framework under multiple criteria, namely, Malmquist DEA, and how to operationalize it for our application.In Section 4, we provide details on our experimental design including data, sample selection, and prediction models.In Section 5, we discuss the empirical findings and provide answers to our research questions.Finally, Section 6 concludes the paper.

| LITERATURE REVIEW
In this section, we provide a brief survey and classification of the literature on bankruptcy and financial distress prediction along with references to the most cited models and tools and, where appropriate, we critically assess the literature and highlight the gaps.
Statistical classifiers such as discriminant analysis and logit analysis have been very popular in the field of failure and distress prediction and remain very popular as benchmarks in the field.The first generation of statistical models is based on discriminant analysis techniques.Beaver (1966Beaver ( , 1968) ) is the ground-breaking study that proposed a univariate discriminant analysis to predict bankruptcy.Later, Altman (1968) applied the multivariate discriminant analysis (MDA) to estimate the renowned "Z-score," which is used as a proxy of the financial situation of a company.The MDA technique has been frequently used in later studies (e.g., Altman, 1982;Altman et al., 1977;Blum., 1974;Deakin, 1972;Serrano-Cinca & Gutiérrez-Nieto, 2013).
The majority of subsequent studies have applied the second generation of statistical techniques; that is, the linear probability model (LPM) (Meyer & Pifer, 1970), logit analysis (LA) (Martin, 1977;Ohlson, 1980), and probit analysis (PA) (Zmijewski, 1984).The first and second generations of models could be viewed as empirical models in that they are driven by practical considerations such as an accurate prediction of the risk class or an exact estimate of the probability of belonging to a risk class; in sum, the selection of the explanatory variables is driven by the predictive performance of the models.These models and their application in some previous studies are not without limitations.Some of the assumptions underlying the modeling frameworks may not be reasonably satisfied for some data sets, on one hand, and the earliest studies restricted the type of information to accounting-based one, on the other hand.Also, these models have a static structure and therefore cannot explicitly account for changes over time in the profiles of companies.
The third generation of statistical models is survival analysis (SA) and contingent claims analysis (CCA) models that overcome some of the limitations of the firstand second-generation models.The underlying modeling frameworks of both SA models and CCA models are dynamic by design.To be more specific, SA models are used to estimate time-varying probabilities of failure.Despite the application of SA models in failure prediction dates back to the mid-1980s (e.g., Crapp & Stevenson, 1987;Lane et al., 1986;Luoma & Laitinen, 1991), Shumway (2001) was the pioneering study, which made its use famous by providing an attractive estimation methodology based on an equivalence between multi-period logit models and a discrete-time hazard model.Thereafter, the suggested discrete-time hazard model-also referred to as a discrete-time logit model-was frequently used in later studies (e.g., Bauer & Agarwal, 2014;Chava et al., 2004;Hernandez Tinoco & Wilson, 2013;Mousavi et al., 2015;Wu et al., 2010) to estimate the coefficients of timevarying accounting and market-based covariates of SA models.Unlike the first-generation, the secondgeneration and SA models, which are empirical, CCA models-also referred to as Black-Scholes-Merton (BSM)-based models-are theoretically grounded.CCA models are grounded in option-pricing theory, as introduced in Black and Scholes (1973) and Merton (1974) whereby the position of an equity holder of a firm is assumed to be the position of long in a call option.Therefore, as suggested by McDonald (2006), the probability of failure could be interpreted as the likelihood that the value of the assets is less than the face value of the liabilities of the firm at maturity; that is, the call option expires worthless.Like any modeling framework, CCA models are not without their limitations.For example, CCA models implicitly assume the same maturities for the liabilities of the firm, which in practice is a limitation (Allen & Saunders, 2004).Also, one might argue that the approximation process of unobservable variables (e.g., expected return, the volatility of return, and market value of assets) is not free of potential measurement errors (Aktug, 2014).Table 2 provides a summary of applied techniques in distress prediction.
Furthermore, several studies have introduced a variety of strategies and methods for identifying the most representative group of features to feed failure prediction models (Balcaen & Ooghe, 2006).On one hand, feature selection strategies could be theoretically grounded such as the features used in CCA models, which are based on option pricing theory (e.g., Bharath & Shumway, 2008;Hillegeist et al., 2004), empirically grounded (e.g, Barboza et al., 2017;Neves & Vieira, 2006;Sun et al., 2011;Unler & Murat, 2010;Zhou et al., 2015), or both (e.g., Laitinen & Suvas, 2016).On the other hand, feature selection methods could be objective or subjective.Objective feature selection methods could be statistical (e.g., Tsai, 2009;Zhou et al., 2012) or non-statistical (e.g., Pacheco et al., 2007;Pacheco et al., 2009) but adopt a common approach, that is, optimizing an effectiveness criterion, whereas subjective feature selection methods make often use of a subjective decision rule including reviewing the literature and selecting the most commonly used features (e.g., Cleary & Hebb, 2016;Zhou, 2013;Zhou et al., 2015).As employing different techniques often results in different sets of selected features, most  4 and 5, respectively.Apart from investigating the effect of the classification model or method, the type of information with which models are fed and the type of feature selection technique, the literature on comparative studies suggests that the type of performance criteria and measures, and the chosen evaluation framework could also have a significant impact on the performance outlook of models (Balcaen & Ooghe, 2006;Mousavi et al., 2015;Zhou, 2013).The conventional comparative analyses have used four categories of evaluation criteria including the correctness of categorical prediction, discriminatory power, information content, and calibration accuracy.In practice, the majority of these studies have applied a restricted number of performance measures and criteria (Bauer & Agarwal, 2014)-for example, type I and type II errors as measures of the correctness of categorical prediction (e.g., Bauer & Agarwal, 2014;Ben & Youssef, 2018;Collins & Green, 1982;Lennox, 1999;Lo, 1986;Luoma & Laitinen, 1991;Press & Wilson, 1978;Sartori et al., 2016;Theodossiou, 1991), ROC or Gini index as a measure of discriminatory power (e.g., Hajek & Henriques, 2017;Hernandez Tinoco & Wilson, 2013;Hillegeist et al., 2004;Hosaka, 2019;e.g., Theodossiou, 1991;Wu et al., 2010), Pseudo-R 2 and Log-likelihood as measures of information content (e.g., Agarwal & Taffler, 2008;Bandyopadhyay, 2006;Bauer & Agarwal, 2014;Li et al., 2010;Theodossiou, 1991), and Brier score (BS) as a measure of calibration accuracy (e.g., Theodossiou, 1991)-which

Feature selection grounds Explanation Example
Theoretically grounded Option pricing theory estimates a value of an options contract by allocating a price, known as a premium, based on the calculated likelihood that the contract will finish in the money at maturity.Hillegeist et al. (2004); Bharath and Shumway (2008) Empirically grounded Selecting the best features using empirical techniques.Neves and Vieira (2006) results in an incomplete assessment of DPMs.Also, conventional studies are criticized for adopting a monocriterion assessment framework (Mousavi et al., 2015) where a single measure of a single criterion is used at a time to evaluate the relative performance of models and provide mono-criterion rankings.Typically, the rankings associated with different measures and criteria are mostly inconsistent, which results in a situation where users cannot make an informed decision as to which DPM is superior in performance.Bauer and Agarwal (2014) were the first to use several measures under three performance criteria of discriminatory power, information content, and correctness of categorical prediction to provide a comprehensive monocriterion assessment comparing the performance of SA model of Shumway (2001), CCA model of Bharath and Shumway (2008), and MDA model of Altman (1968).Mousavi et al. (2015) was the pioneer study that suggests a multi-criteria assessment framework of prediction models of categorical variables or classifiers, that is, super-efficiency DEA to rank the performance of competing statistical models.However, a super-efficiency DEA framework is criticized to be unfair benchmarking since the reference benchmark changes from one distress prediction model evaluation to another (Xu & Ouenniche, 2011).To overcome this methodological issue, Mousavi and Ouenniche (2018) proposed a slacksbased context-dependent DEA model to assess and rank competing DPMs.Further, Mousavi and Lin (2020) applied PROMETHEE 1 multi-criteria decision aid (MCDA) framework to compare the performance of machine learning and artificial intelligence techniques fed with different types of information (i.e., accounting, market, and corporate governance) in distress prediction.
One existing gap in the literature of comparative studies of DPMs is that the application of conventional monocriterion framework (i.e., one measure under one criterion) and the recently introduced multi-criteria frameworks are restricted by their static nature; therefore, the dynamic performance of models over time has been overlooked-a gap that we address in this paper.To be more specific, our main contribution in this paper is to propose a dynamic or multi-period, multi-criteria performance evaluation framework-see next section, which by design provides multi-criteria rankings of competing DPMs over multiple periods.In practice, such a framework will not only allow one to highlight the models' performance dynamics over time but also reveal sources of inefficiencies in these models.As far as we know, no previous study has investigated the performance of DPMs using a dynamic multi-criteria assessment framework.

| A DYNAMIC FRAMEWORK FOR ASSESSING DPMS
This paper proposes a dynamic multi-criteria assessment framework, namely, Malmquist-DEA, which by design can measure and rank the relative performance of competing DPMs over time.Hereafter, we first present the Malmquist productivity index (MPI) and its estimation using DEA to measure the efficiency of decision-making units (DMUs) (see Section 3.1).Then, we present how one might customize a Malmquist-DEA framework to estimate the relative performance of financial distress prediction models (see Section 3.2).

| Malmquist productivity index
First introduced by Caves et al. (1982bCaves et al. ( , 1982a) and continuously improved by other scholars (e.g., Coelli, 1997;Färe et al., 1994), the MPI represents the growth of total factor productivity for a DMU over time-for a review of the applications of MPI, the reader is referred to Färe et al. (2012).
The concepts underlying MPI are illustrated in Figure 1 where a given DMU, say, DMU 0 , has changed its mix of inputs and outputs or production from P t x t 0 , y t 0 À Á in period t to P tþ1 x tþ1 0 , y tþ1 0 À Á in period t þ 1, and the efficient frontier, say, F, with respect to which DMU 0 is assessed has changed from the efficient frontier in period t, say, F t , to the efficient frontier in period t þ 1, say, F tþ1 , where x t 0 and x tþ1 0 (respectively, y t 0 and y tþ1 0 ) represent the input (respectively, output) of DMU 0 at time t and t þ 1, respectively.Note that, from a given period t to the next period t þ 1, the efficiency of DMU 0 could change as a result of a change in its efficiency from period t to period t þ 1, as measured by the ratio of the efficiency of DMU tþ1 0 with respect to the efficient frontier F tþ1 in period t þ 1 to the efficiency of DMU t 0 with F I G U R E 1 Efficiency change and efficient frontier shift respect to the efficient frontier F t in period t.We shall denote such efficiency change from period t to period t þ 1 by EC t,tþ1 , or efficiency change from period t to period t þ 1, which is commonly referred to as the efficient frontier catch-up or recovery effect of DMU 0 from period t to period t þ 1 and reflects the degree to which a DMU improves or worsens its efficiency; in sum, EC t,tþ1 > 1 indicates progress in relative efficiency from period t to t þ 1; EC t,tþ1 ¼ 1 indicates no change or stability from period t to t þ 1, and EC t,tþ1 < 1 indicates a regress in efficiency.
On the other hand, from a given period t to the next period t þ 1, the efficiency of DMU 0 could change as a result of a shift in the efficient frontier from period t to period t þ 1 and the relative positions of DMU t 0 (resp.DMU tþ1 0 Þ with respect to frontiers F t and F tþ1 -this change in efficiency is referred to as the efficient frontier shift effect and is computed as the square-root of the product of the efficient frontier shift for DMU t 0 , say EFS t , and the efficient frontier shift for DMU tþ1 0 , say, EFS tþ1 , where EFS t is measured by the ratio of the efficiency of DMU t 0 with respect to the efficient frontier F t in period t to the efficiency of DMU t 0 with respect to the efficient frontier F tþ1 in period t þ 1, and EFS tþ1 is measured by the ratio of the efficiency of DMU tþ1 0 with respect to the efficient frontier F t in period t to the efficiency of DMU tþ1 0 with respect to the efficient frontier F tþ1 in period t þ 1.For more specifics about MPI, the reader is referred to Tone (2004) and Färe et al. (1992Färe et al. ( , 1994)).
Hereafter, we outline the procedure for computing the MPI: Step 1: Estimating the efficiency change (EC) of each DMU In Figure 1, AE AC represents the efficiency of DMU 0 with input x tþ1 and output y tþ1 with respect to the efficient frontier F tþ1 , and AF AD represents the efficiency of DMU 0 with input x t and output y t with respect to the efficient frontier F t .The efficiency change of DMU 0 is measured as follows: where EC > 1, EC = 1, and EC < 1 refer to progress, stability, and regress in the relative efficiency, respectively.
Step 2: Estimating the efficient frontier shift (EFS) In Figure 1, AD AF (respectively, AB AF ) represents the efficiency of DMU 0 with input x t and output y t at time t (respectively, t þ 1) with respect to frontier F t (respectively, frontier F tþ1 ); therefore, the efficient frontier shift for t, say EFS t , is measured as follows: Equivalently, the efficient frontier shift at time t + 1, say, EFS tþ1 , is where the ratio AC=AE (respectively, AG=AE) demonstrates the efficiency DMU 0 at time t þ 1 with input x tþ1 and output y tþ1 with respect to F tþ1 (respectively, F t ) frontier.
The EFS t,tþ1 is the efficient frontier shift between time t and t þ 1, which is calculated using the geometric mean of EFS t and EFS tþ1 , as follows: Step 3: Estimating the MPI MPI is the product of EC and EFS:

| Adaptation of MPI for our purpose
MPI is a generic framework and as such its application for our purpose, that is, a dynamic relative performance evaluation of competing DPMs under multiple criteria, requires several key specifications.First, the choice of decision-making units or DMUs: In this study, we are concerned with the performance of distress prediction models or DPMs; thus, DMUs are the DPMs (see Section 4.4).
Second, the choice of inputs and outputs: In this study, the inputs and outputs are the measures of the performance criteria under consideration; namely, discriminatory power, the correctness of categorical prediction, calibration accuracy, and the information content (for details about performance measures of different criteria, the reader is referred to Mousavi et al., 2015).The outputs (respectively, inputs) are designated based on the principle that the more (respectively, the less), the better; thus, performance measures to be maximized (respectively, minimized) are set as outputs (respectively, inputs).
Third, the choice of the model to estimate efficiency scores or distances from the frontiers: In this study, we opted for the slacks-based measure (SBM) model of Tone (2001) to estimate EC t,tþ1 , EFS t , and EFS tþ1 , because of its advantages compared to radial models such as CCR and BCC.For more details on the SBM model, we refer the reader to the eminent work of Tone (2002).Table 6 presents the application of SBM-DEA to estimate MPI.
Fourth, the choice of reference index: We aim to measure and compare the relative performance of DPMs over a period; however, the contemporaneous MPI t t+1 is sensitive to linear programming (LP) infeasibility and represents the performance of DMU 0 from time t to t þ 1; thus, it should be adjusted for our purpose.To deal with this issue, we followed Pastor and Lovell (2005) in estimating the global MPI, which represents the best DMUs over the concerned period (i.e., t 1 , t 2 , …, t n ) 2 .Also, in a situation of crossing of the efficient frontiers of different time periods (e.g., F t 1 and F t 2 in Figure 2), the global MPI can be used as the benchmark reference frontier for all DMUs over the concerned period.For illustration, as Figure 2 indicates, the relative efficiency of DMU 0 can be estimated considering the efficient frontier of period 1 (a combination of DMU 1 , DMU 2 , DMU 3 , DMU 4 , and DMU 5 ) or the efficient frontier of period 2 (combination of DMU 6 , DMU 7 , DMU 8 , DMU 9 , and DMU 10 ).Alternatively, the performance of DMU 0 can be measured considering the global frontier, which consists of the best DMUs in the past, that is, DMU 3 , DMU 4 , DMU 5 , DMU 6 , and DMU 7 .

| EMPIRICAL INVESTIGATION
This section provides details on the data (see Section 4.1), sampling (see Section 4.2), features and features selection procedure (see Section 4.3), and the choice of DPMs (see Section 4.4) for our comparative study to illustrate the use of the proposed dynamic multi-criteria methodology for the relative performance evaluation of classifiers.

| Data
We took the following steps to select the data set for our empirical analysis.First, we collected the financial accounting and market-based information of all listed companies (except for utility and financial ones) on the London Stock Exchange (LSE) between 1995 and 2014.Second, since developing some models needs minimum historical data, we excluded companies that have been listed for less than 2 years.Third, to minimize any bias related to excluding companies with missing data (Platt & Platt, 2012;Zavgren, 1983;Zmijewski, 1984), we only discarded those companies with missing values for T A B L E 6 Applying SBM-DEA to estimate MPI

Description Formula
The SBM-DEA model of (Kaoru Tone, 2001Tone, , 2011) ) measures the efficiency of a DMU in a static or single period context; in sum, it measures the distance between a DMU and the efficient frontier that envelops all other DMUs under evaluation.The following SBM model is our proposal for measuring the distance D 1,T ½ x tþ1 0 , y tþ1 0 À Á between DMU tþ1 0 (DMU 0 observed in period t þ 1Þ and the global frontier, say F 1,T ½ , where the global frontier F 1,T ½ envelops all the data points or DMUs observed over the whole horizon of analysis T: Through a change of variables similar to the one proposed by Tone (2001) this nonlinear program can be transformed into an equivalent linear program for solution purposes.
Minimize D 1,T ½ x tþ1 0 , y tþ1 The table explains SBM-DEA method used to estimate MPI.
F I G U R E 2 Global frontier the main accounting variables (e.g., total assets and sales) and market-based variables (e.g., price), which are essential for computing a variety of accounting and market-based ratios (Lyandres & Zhdanov, 2013;Mousavi & Ouenniche, 2018).The residual missing values of each variable were replaced by its most recently observed ones for each company (Mousavi & Ouenniche, 2018;Zhou et al., 2012).Fourth, to deal with the extreme values of any variable, the outlier values are winsorized between the 1st and 99th percentile of each variable (Shumway, 2001).Fifth, we lagged the data set to make sure that the essential accounting variables are existing in the year in which distress is observed (Bauer & Agarwal, 2014;Mohammadi et al., 2011).
To mark distress firms, we defined a binary variable, say, D, that equals 1 for financially distressed firms and 0 otherwise.We followed Pindado et al. (2008) in classifying a firm as financially distressed if (1) for two succeeding years, the company's interest expense is more than its earnings before interest, taxes, depreciation, and amortization (EBITDA), and (2) for two succeeding years, the company suffers a negative growth in market value.The final sample consists of 3389 firms and 36,984 firm-year observations, of which 1414 firm-year observations qualified as distressed resulting in a distress rate average of 3.82% per year.Table 7 summarizes the number and the proportion of healthy and distress firms in the sample.

| Sampling
In this study, we test the performance of DPMs out-ofsample.To investigate the effect of the length of the training period on the performance of DPMs, we developed models using two different lengths of sample period: namely, 3 and 5 years.In addition, we applied the moving origin rolling horizon sampling technique (Mousavi & Ouenniche, 2018) to update the parameters of the models as time goes by and thus get rid of the less relevant historical data and include the most recent data to improve the predictive ability of models and reduce any bias due to events that are no more relevant.
To be clearer, we used firm-year observations for n years from t ¼ n þ 1 to t (where n = 3, 5 and 1999 < t < 2013) as training samples to develop models, which then are used to predict distress in year t þ 1 as the holdout sample year.For example, we used firm-year observations for 3 years from 1997 to 1999 as a training sample to predict the distress probability of firms in the year 2000.Considering the 15 years period of our data set and two lengths of the sample period, we end up with 30 training samples.Table 8 presents the details about the proportion of distressed firms for 30 training samples and 15 holdout samples.

| Features and feature selection procedure
There is a variety of strategies and methods for identifying the most effective group of features to feed failure prediction models (Balcaen & Ooghe, 2006;Neves & Vieira, 2006;Sun et al., 2013;Unler & Murat, 2010).Feature selection strategies could be theoretically grounded, empirically grounded, or both (e.g., Laitinen & Suvas, 2016).On the other hand, feature selection methods could be objective or subjective.Objective feature selection methods could be statistical (e.g., Tsai, 2009;Zhou et al., 2012) or non-statistical (e.g., Pacheco et al., 2007Pacheco et al., , 2009) ) but adopt a common approach; that is, optimizing an effectiveness criterion.Whereas subjective feature selection methods make often use of a subjective decision rule including reviewing the literature and selecting the most used features (Mousavi et al., 2015;Ravi Kumar & Ravi, 2007;Zhou et al., 2015).In this research, we used a statistical objective feature selection method.
To be more specific, we reduced our very large initial set of accounting-based ratios (i.e., 83 accounting-based ratios) using factor analysis, where factors are selected based on two criteria: (1) absolute values of loading are more than 0.5 and (2) communities are more than 0.8.This process continued until either no improvement is seen in the total explained variance or no more variables are excluded.Principal component analysis with factor extraction of VARIMAX is used to run this factor analysis (Chen, 2011;Mousavi et al., 2015).Finally, 31 accounting-based ratios with high commonality values and high factor loading, five frequently used market-based information and two mixed ratios (interaction effect of macroeconomic indicators and accounting-based information) were retrained as explanatory variables to be used as inputs into a stepwise procedure of each statistical model.Table 9 represents the final explanatory variables selected for our analysis.Note: The table presents the primary features that are used in developing distress prediction models.
To examine the impact of the type of information on the performance of DPMs, we trained models with different combinations of available information sources, that is, financial accounting (FA), market variables (MV), and macroeconomic indicators (MI).Also, note that for a better model fit, we considered the interaction effect of macroeconomic indicators on financial accounting items (Mousavi et al., 2019).

| The choice of DPMs for evaluation
To evaluate the dynamic performance of the competing DPMs in predicting distress, we considered two general categories of models, namely, static and dynamic, and implemented the most cited models of each category.In terms of static models, we developed two models of MDA and LA.In terms of dynamic models, we followed Nam et al. (2008) in considering two subcategories of models: namely, duration-dependent (DD) and durationindependent (DI) models.
Note that duration-dependent DPMs comprise a timedependent baseline rate, which could be estimated using historical information of the firm or be represented by a time-dummy or by a time-varying feature of the firm.To this end, we used Kim and Partington's (2015) approach in estimating the time-dependent baseline hazard rate for each firm using firms' historical information in a Cox hazard duration model.We refer to this model as duration-dependent with the firm's specific baseline rate (DDWFSB).Also, since employing an indirect measure of the baseline rate such as time dummies (Beck et al., 1998) is less efficient, we followed Gupta et al. (2015), Nam et al. (2008), andShumway (2001) in using the time-varying feature of firm's age to proxy the timedependent baseline rate in a discrete-time duration model.We refer to this model as duration-dependent with a time-dependent baseline rate (DDWTDB).
Also, note that based on containing a constant (timeindependent) baseline rate, the duration-independent models are classified into two subgroups, namely duration-independent without baseline (DIWOTIB) and duration-independent with time-independent baseline (DIWTIB).In this study, we use the natural logarithm of firm age, ln (age), as the time-independent baseline rate.
Considering two static and four dynamic modeling frameworks, two lengths of the training samples, 15 holdout samples, and three combinations of features, we ended up with 540 developed models.A set of newly developed models using 3-year (from 2011 to 2013) FAMVMI information is presented in Table 10.
Appendix A provides more details about the models developed in this study.

| DYNAMIC AND MULTI-CRITERIA ASSESSMENT OF DISTRESS PREDICTION MODELS
This section summarizes our implementation decisions of the proposed dynamic multi-criteria framework for assessing the performance of DPMs, which we used to address our research questions-see Section 1.
To illustrate the use of the proposed framework and highlight its advantages, we developed six statistical frameworks, namely, MDA, LA, DIWOTIB, DIWTIB-ln (age), DDWTDB-ln (age), and DDWFSB using three combinations of information, that is, FA, MVMI, FAMVMI, and two lengths of training periods, that is, 3-and 5-year training samples.
We tested the prediction accuracy of the DPMs developed on 15 consecutive hold-out samples (from the year 2000 to 2014) using measures under four criteria, that is, the correctness of categorical prediction, discriminatory power, information content, and calibration accuracy.For a detailed presentation of performance criteria and their measures, the reader is referred to Mousavi et al. (2015).
The estimated performance measures are used as inputs and outputs in Malmquist DEA.The Malmquist DEA framework provides the global efficiency scores for each model each year.We used the average scores of 15 hold-out samples as the overall efficiency score of each model.Also, to answer the research question "Which DPMs perform better in predicting distress over the years with high distress rate (HDR)?," we used the average scores of the HDR years, that is, 2003HDR years, that is, , 2008HDR years, that is, , and 2013. .As mentioned above, the advantage of the multicriteria framework is that it facilitates taking multiple performance criteria into account, which results in a comprehensive performance evaluation, provides a single multi-criterion ranking in each period, and facilitates the presentation and monitoring of the performance of models over time.Depending on practitioners' preferences, alternative measures could be selected for each criterion.Section 5.1 presents the yearly and total 15-year rankings of prediction models extracted from the first round (i.e., a combination of measures of the criteria under consideration) of assessment-the other 11 rounds of ranking are not presented here for the sake of saving space.Section 5.2 presents the overall rankings of prediction models considering 12 rounds of assessments.
5.1 | Round I of rankings (inputs: T1, BS; outputs: Pseudo-R 2 , ROC) In the first round of dynamic multi-criteria assessment, we use Type 1 (T1) error (to measure the accuracy of categorical prediction) and BS (to measure the calibration accuracy) as inputs, and Pseudo-R 2 (to measure the information content) and ROC (to measure the discriminatory power) as outputs.Table 11 presents the first-round rankings of DPMs based on the estimated efficiency scores for each year (i.e., 2000 to 2014).Using these dynamic rankings or the corresponding scores, one could analyze the performance of a DPM overtime.To compare the overall or aggregate performance of each model over the 15-year period, we calculate an aggregate ranking based on the average scores over 15 years.Also, to compare the performance of DPMs over years with high distress rates, we calculate an aggregate HDR ranking based on the average scores over years 2003, 2008, and 2013.In sum, Table 11 provides the rankings of DPMs for each year referred to as yearly rankings, the aggregate rankings over 5-year periods (i.e., 2000-2004; 2005-2009; 2010-2014) and over 15-year period (i.e., 2000-2014), and the aggregate rankings over HDR years (i.e., 2003, 2008 and 2013) using Dynamic Multi-Criteria Evaluation Framework, where the aggregate rankings are based on averages of the Malmquist DEA scores of the models.
Figure 3 illustrates the difference in overall performance between static and dynamic DPMs as well as the effect on the performance of these models resulting from the use of different categories of information or features, different rolling horizon lengths and a combination of these.Notice that Panel 1 of Figure 3 shows the superiority of dynamic models' performance over static ones.The results also suggest that the models developed in duration-dependent frameworks, that is, DDWFSB and DDWTDB-ln (age), outperform duration-independent and static models, and among static models, LA outperforms MDA.On the other hand, Panel 2 of Figure 3 indicates that models fed with FAMVMI features outperform those fed with MVMI and FA, respectively.These results are consistent across all models.In addition, Panel 3 of Figure 3 shows that the dynamic (respectively, static) models developed using 5-year (respectively, 3-year) information outperform the dynamic (respectively, static) ones using 3-year (respectively, 5-year) information.Finally, Panel 4 of Figure 3 indicates that most of the dynamic models fed with 5-year features of FAMVMI and FA outperform those fed with 3-year features.On the contrary, most of the dynamic models fed with 3-year MVMI outperform 5-year MVMI.For the robustness of our findings, we performed 12 other experiments or rounds using a variety of combinations of inputs and The table presents the rankings of DPMs yearly, aggregately over 5-and 15-year periods, and aggregately over HDR years using dynamic multi-criteria evaluation framework for one combination of inputs and outputs (inputs: T1 and BS; outputs: ROC and Pseudo-R 2 ).
T A B L E 1 1 (Continued)

Yearly rankings
Aggregate rankings over 5, 15, and HDR year periods The table presents the rankings of DPMs yearly, aggregately over 5-and 15-year periods, and aggregately over HDR years using dynamic multi-criteria evaluation framework for one combination of inputs and outputs (inputs: T1 and BS; outputs: ROC and Pseudo-R 2 ).
output or equivalently measures of the four performance criteria under consideration.The findings based on Panel 3 of Figure 3 of this combination of inputs and outputs are rather an outlier compared to the findings of the remaining combinations, which are summarized in the next subsection.

| The average of rankings for all rounds (combination of different in-puts and outputs)
We expanded our dynamic multi-criteria assessment to 12 different rounds (i.e., combinations of measures of the criteria under consideration), where a single measure is used for each criterion to void any implicit assignment of a higher weight to any of the criteria.Table 12 summarizes the inputs and outputs used in each round and provides aggregate rankings over 15-year period for each combination of inputs and outputs-referred to as AIR-15; the aggregate ranking of these individual combinations of inputs and outputs-referred to as AR-15; and the aggregate ranking of individual combinations of inputs and outputs over HDR years-referred to as AR-HDR-of DPMs using the proposed dynamic multicriteria evaluation framework, where the aggregate rankings are based on averages of the Malmquist DEA scores of the models.In addition, Figure 4 illustrates the difference in overall performance between static and dynamic DPMs based on the AR-15 rankings of Table 12 as well as the effect on the performance of these models resulting from the use of different categories of information or    Note: The table presents the DPMs' aggregate ranking of individual combinations of Inputs and Outputs (AR-15), and aggregate ranking of individual combinations of inputs and outputs over HDR years (AR-HDR) using dynamic multi-criteria evaluation framework.
features, different rolling horizon lengths and a combination of these.Our main findings based on the analysis of this information could be summarized as follows.
First, Panel 1 shows the superiority of dynamic models' performance over static ones, where the performance of the models is aggregated across all categories of information and time horizons.The results also suggest that the models developed in duration-dependent frameworks, that is, DDWFSB and DDWTDB-ln (age), outperform duration-independent and static models, and among static models, LA outperforms MDA.These findings answer the question: What is the effect of the modeling framework design on the performance of models?
Our findings are consistent with other studies (Bauer & Agarwal, 2014;Mousavi & Ouenniche, 2018;Nam et al., 2008;Shumway, 2001), which indicate that the duration models utilize information more effectively in DPMs.Further, the results suggest that models developed in the duration-dependent framework of DDWFSB followed by DDWTDB-ln (age) outperform other dynamic and static frameworks.Our finding is consistent with Nam et al. (2008) and Kim and Partington (2015) proposed that using the baseline rate in the distress models improves their performance.Further, compatible with other research such as Bauer and Agarwal (2014)  Second, Panel 2 indicates that the models fed with FAMVMI features outperform those fed with either MVMI or FA, regardless of the frameworks under which they are developed.In addition, models fed with MVMI features outperform those fed with FA.These findings answer the question: What is the effect of the type of information with which models are fed on their performance?Our findings support Shumway (2001), Agarwal and Taffler (2008) and Bauer and Agarwal (2014), who suggest that using more information enhances the performance of failure prediction models.
Third, Panel 3 shows that both static and dynamic models developed using 3-year information outperform the ones using 5-year information, where the performance of the models is aggregated across all categories of information, except for DIWTIB-ln (age) for which the nature of the performance is reversed because of not taking account of time dynamics.These findings answer the question: What is the effect of the length of the training sample on the models' performance?
Fourth, Panel 4 indicates that static models fed with 3-year features-whether they belong to category FA, category MVMI or category FAMVMI-outperform those fed with 5-year features; that is, shorter horizons seem to enhance their aggregate performance, and this is consistent across all categories of information with which the static models are fed.As to the dynamic models, when fed with features belonging to category MVMI or category FAMVMI, shorter horizons seem to enhance their aggregate performance.However, when dynamic models are fed with features belonging to the FA category, longer horizons seem to enhance their aggregate performance except for model DDWFSB.These findings answer the question: What is the effect of the length of the training sample on the models' performance?
Finally, regarding which DPMs perform better in predicting distress over the years with high distress rate (HDR), examination of the rankings of models in HDR years, that is, 2003HDR years, that is, , 2009HDR years, that is, , and 2013, year by year and over all these years, we found out that DDWTDB-ln (age) fed with FAMVMI and FAMV, respectively, are the best models for predicting distress during HDR years.Furthermore, LA fed with FAMVMI shows very good performance during HDR years despite its static nature.

| CONCLUSION
Prediction of corporate financial distress is crucial for many stakeholders and decision-makers in finance and investment.Although many models have been developed to forecast bankruptcy and financial distress, the relative performance assessment of competing DPMs remained in practice an exercise, which is both mono-criterion and static.The mono-criterion nature of comparative studies is criticized because of conflicts in rankings of models from one performance criterion or measure to another.The static framework of comparative studies does not support any monitoring of the performance of models over time.
In this study, we proposed a dynamic multi-criteria framework, based on Malmquist DEA, to evaluate the performance of DPMs.This framework provides a multicriteria ranking per period that allows the monitoring of DPMs' performance over time.The multi-period ranking takes account of the variations of inputs and outputs over time (e.g., trends) and as such provides a fair evaluation of models.Also, a multi-period ranking of DPMs allows one to assess the robustness of models' predictive power and accuracy during different periods and business cycles.In sum, the proposed performance evaluation framework of DPMs is a powerful tool for monitoring the performance of models over time, highlighting any impact of specific events (e.g., economic recessions) on the performance of DPMs, and suggesting how the DPMs, as specified by specific sets of explanatory variables, that became outdated or less relevant could be changed.
Also, we performed a comparative analysis of the most cited static and dynamic DPMs, which are developed using different combinations of information.Different rounds of evaluation are conducted using several combinations of measures in the four categories of performance criteria, that is, discriminatory power, information content, calibration accuracy, and correctness of categorical predictions.Our main findings could be summarized as follows.
First, the aggregate performance, across all categories of information and time horizons, shows that (a) dynamic models are superior to static ones and (b) dynamic models developed under duration-dependent frameworks outperform dynamic models developed under duration-independent frameworks and static models.These findings answer one of our research questions, namely, what is the effect of the modeling framework design on the performance of models?
Second, the aggregate performance, across all time horizons, shows that models fed with FAMVMI features outperform those fed with either MVMI or FA, regardless of the frameworks under which they are developed.In addition, models fed with MVMI features outperform those fed with FA.These findings answer another one of our research questions; namely, what is the effect of the type of information with which models are fed on their performance?
Third, the aggregate performance of models, across all categories of information, shows that shorter horizons seem to enhance the aggregate performance of both static and dynamic models, except for DIWTIB-ln (age) for which an excessively poor performance when fed with FA information compared to when fed with either MVMI or FAMVMI information explains its poor aggregated performance over a shorter horizon.These findings answer another one of our research questions, namely, what is the effect of the length of the training sample on the models' performance?
Fourth, the aggregate performance of models, across all combinations of inputs and outputs, shows that (a) shorter horizons seem to enhance the aggregate performance of static models and this behavior is consistent across all categories of information with which the static models are fed and (b) shorter horizons seem to enhance the aggregate performance of dynamic models fed with features belonging to category MVMI or category FAMVMI; however, when fed with features belonging to the FA category, longer horizons seem to enhance their aggregate performance with the exception of model DDWFSB.These findings also answer our research question: what is the effect of the length of the training sample on the models' performance?
Finally, regarding which DPMs perform better in predicting distress over the years with a high distress rate (HDR), we found out that DDWTDB-ln (age) fed with FAMVMI and FAMV, respectively, are the best models for predicting distress during HDR years.Furthermore, LA fed with FAMVMI shows very good performance during HDR years despite its static nature.
One of the limitations of this research is the space constraint and as such, we restricted this study to financial distress as an event.Future research could investigate other definitions of failure such as bankruptcy and debt restructuring.Moreover, this study is restricted in terms of data (i.e., listed companies on LSE), and types of models (i.e., statistical models).Future studies could incorporate machine learning and artificial intelligence techniques.Also, future studies could analyze the extent to which failure prediction models are generalizable by considering the data from other countries or stock exchanges.

Framework Explanation
Multiple discriminant analysis (MDA) The common form of discriminant analysis (DA) model for the group k (assuming n groups) could be revealed as follows.
where x j denotes the discriminant features j, β kj denotes the discriminant coefficients of feature j for group k, z k denotes the score of group k, and f is the classifier (linear or nonlinear) that maps the estimated scores of β t x onto a set of real numbers.We followed Hillegeist et al. (2004) to convert estimated scores to the probability of distress using a logit link as follows.P distress ð Þ i ¼ e z 1þe z Equation 2 Logit analysis (LA) The common model of logit analysis for binary variables could be defined as follows: where Y represents the binary dependent variable, X represents the vector of features, β is the vector of coefficients of X in the model, and G : ð Þ represents a link function that maps the estimated scores of β t x, onto a probability.Practically, the link function determines the type of probability model.For example, the link function for a logit model (respectively, probit model) is the cumulative logistic distribution (respectively, cumulative standard normal distribution) function.
Discrete-time hazard framework: Duration-dependent hazard model (DD) Shumway (2001) applied an estimation procedure like the one used in estimating the parameters of a multi-period logit model and proposed a (Shumway, 2001) discrete-time hazard model as follows: P y i,t ¼ 1jx i,t À Á ¼ h tjx i,t ð Þ¼ e α t þx i,t β ð Þ 1þe α t þx i,t β ð Þ Equation 4 where h tjx i,t ð Þdenotes the hazard rate of firm i at time t, X i,t represents the vector of features of firm i at time t.α t denotes the time-variant baseline hazard function, which could be associated with the firm, for example, ln (age) or associated with macroeconomic conditions, for example, exchange rate volatility (Nam et al., 2008).Note that Shumway used ln age ð Þas a constant time-variant baseline rate.The notation of the duration-dependent hazard model is as follows: h tjx i,t ð Þ¼h 0 t ð Þ:e xi,t :β Equation 5 Discrete-time hazard framework: Duration-independent model with time-invariant baseline (DIWTIB) and durationindependent model without baseline hazard rate (DIWOTIB) The coefficients of the features for the duration independent hazard models are estimated using the multi-period logit framework.However, on the contrary of duration-dependent models, the baseline hazard rate of DIWTIB is a time-invariant term, which could be represented by firm related features such as ln (age), 1/ln (age) or macroeconomic features such as exchange rate volatility.The notation of duration-independent hazard model is as follows: h tjx i,t ð Þ¼h 0 :e Framework Explanation DIWOTIB used the multi-period logit framework to estimate the coefficients of the features, however, contrary to DD and DIWTIB, it does not use any baseline hazard rate.

Cox hazard framework
The Cox-hazard model (Cox, 1972) is another duration-dependent model that can take account of time-varying covariates of a firm.This model could be presented as follows: h tjx i,t ð Þ¼h 0 t ð Þ:e xi,t :β Equation 9 The vector of coefficients β are estimated using a partial likelihood function on the training sample, as follows: where i denotes the firm in the event of distress; p is the number of features; k is the firm in the risk set at time t.Note that this equation estimates the vector of β without estimating the baseline hazard rate (Hosmer & Lemeshow, 1999).However, to apply the developed model to estimate the probability of distress, the baseline hazard rate term is required.Therefore, we followed Chen et al. (2005)  where D i denotes a dummy variable equals to 0 if firm i faces the distress, and 0 otherwise; b T i is the distress time for the ith firm; b β is the vector of estimated coefficients.Using Equations 10 and 11, we estimate the probability of distress for individual firms in Equation 9.

T
A B L E 1 1 Ranking of models using one combination of inputs and outputs Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/for.2915by University Of Edinburgh, Wiley Online Library on [19/10/2022].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License T A B L E 1 1 (Continued)

F
Aggregate rankings over 15-year period and HDR years for each individual combination of inputs and outputs in estimating the integrated baseline hazard function with time-varying covariates base onAndersen (1992) as follow: 1099131x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/for.2915by University Of Edinburgh, Wiley Online Library on [19/10/2022].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License and Wu  et al. (2010), our results suggest that the LA model outperforms MDA.Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/for.2915by University Of Edinburgh, Wiley Online Library on [19/10/2022].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License Table below provides details of statistical models developed in this study., 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/for.2915by University Of Edinburgh, Wiley Online Library on [19/10/2022].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License xi,t :β Equation 7 p y i,t ¼ 1 À Á ¼ 1 1þe Àx i,t :β Equation 8 APPENDIX A 1099131x