Advanced methods and implementations for the meta-analyses of animal models: Current practices and future recommendations

Meta-analytic techniques have been widely used to synthesize data from animal models of human diseases and conditions, but these analyses often face two statistical challenges due to complex nature of animal data (e.g., multiple effect sizes and multiple species): statistical dependency and confounding heterogeneity. These chal- lenges can lead to unreliable and less informative evidence, which hinders the translation of findings from animal to human studies. We present a literature survey of meta-analysis using animal models (animal meta-analysis), showing that these issues are not adequately addressed in current practice. To address these challenges, we propose a meta-analytic framework based on multilevel (linear mixed-effects) models. Through conceptualiza- tion, formulations, and worked examples, we illustrate how this framework can appropriately address these issues while allowing for testing new questions. Additionally, we introduce other advanced techniques such as multivariate models, robust variance estimation, and meta-analysis of emergent effect sizes, which can deliver robust inferences and novel biological insights. We also provide a tutorial with annotated R code to demonstrate the implementation of these techniques.


Introduction
Meta-analysis has made a significant contribution to many disciplines including behavioral and neurobiological sciences, playing a crucial role in quantitatively summarizing existing findings and informing evidence-based decision-making (Bannach-Brown et al., 2021;Gurevitch et al., 2018;Nakagawa et al., 2020;Vesterinen et al., 2014). Traditionally, meta-analyses have been used to synthesize data from human randomized controlled trials (RCT), describing the efficacy of treatments and interventions (Chalmers and Haynes, 1994;Schmid et al., 2020). More recently, there has been a surge in meta-analyses using data from animal studies modelling human diseases, physiology, and behaviour (hereafter, "animal meta-analyses"; see Greek and Menache, 2013;Hooijmans et al., 2014;Hunniford et al., 2021).
The aims of animal meta-analyses are diverse with examples including predicting the effectiveness of therapeutic strategies for a neurological disorder (Baldez et al., 2021), identifying study characteristics mediating the effectiveness of a therapy (Figueiredo et al., 2020), and explaining replication failures of preclinical studies (Usui et al., 2021). Perhaps most importantly, high-quality animal meta-analyses can generate reliable and useful appraisals of preclinical experimental evidence, which can be used to direct human trials and inform decisions to progress to clinical applications (Bahadoran et al., 2020;de Vries et al., 2014;Soliman et al., 2020). Therefore, animal meta-analyses can serve as a useful complement to traditional medical meta-analyses (e.g., clinical meta-analysis on RCT data). Although animal meta-analyses are important and performed with increasing frequency, we know very little about their methodological rigor (Hunniford et al., 2021;Mueller et al., 2014). Indeed, as far as we know there is no systematic profiling of methodological and reporting practice of animal meta-analyses (but see Hooijmans et al., 2022).
We have predicted that two major statistical issues had not been adequately addressed in the current practice of animal meta-analyses because the meta-analytic methods (i.e., fixed-effect and random-effects models) inherited from human medical meta-analyses, are technically incapable of handling the two problems. The first issue is the violation of the assumption of statistical independence between effect sizes (e.g., multiple effect sizes per study/paper). Indeed, the true effects and sampling errors are inevitably correlated in animal studies that involve multiple measurements, multiple animal cohorts, multiple species, and multiple treatments ( Fig. 1; Aarts et al., 2014;Pound and Bracken, 2014;Vesterinen et al., 2014). Therefore, an animal meta-analysis often has a hierarchical/multilevel/nested data structure (sometimes a multivariate structure) which might lead to the dependency among the true effect sizes (i.e., the correlation/covariance of the true effects within the same study). Importantly, the traditional statistical models used in animal meta-analyses often fail to deal with this statistical non-independence (but see Bonapersona et al., 2018;Lagisz et al., 2020), inflating type I error, and thus distorting statistical inference, leading to spurious conclusions (Cheung, 2019;Nakagawa et al., 2017;Thomas et al., 2003). The second issue relates to the first point. The hierarchical nature of the animal datasets means that animal meta-analyses could be improved, in terms of explanatory power, from decomposing variance in effect sizes across different levels, e.g., within-and between-study level heterogeneity (a more complex level: species-specific heterogeneity; Konstantopoulos, 2011;Senior et al., 2016). However, the traditional meta-analytic models treat between-experiment variability as the only source of heterogeneity, confounding between-study heterogeneity with other sources of heterogeneity (e.g., within-study between-experiment heterogeneity), and subsequently leading to less informative quantification.
Methodological innovations and advances in meta-analysis offer us pragmatic and effective strategies to deal with these two issues. The use of multilevel meta-analytic models has been developed explicitly to account for non-independent and heterogeneous effect sizes and has successfully been deployed, for example, in ecology, evolution, psychology, and education (Cheung, 2014(Cheung, , 2019Nakagawa et al., 2017;Nakagawa and Santos, 2012). The multivariate models and robust variance estimation (RVE) are also effective approaches to handling dependent effect sizes in the context of meta-analysis (Cheung, 2013;Fisher and Tipton, 2015;Hedges et al., 2010;Jackson et al., 2011;Pustejovsky and Tipton, 2022;Welz et al., 2023). Furthermore, some underappreciated effect size statistics can inform methodological choices and provide new neurobiological insights for animal meta-analyses. For example, meta-analysing variability-based effect sizes which compare differences in variance (rather than its mean) between two groups (i.e., meta-analysis of variance; Fig. 2) enables us to scale out research questions, such as investigating the individual variability in treatment response to drugs in neurological and behavioral disorders (Maslej et al., 2021;Nakagawa et al., 2015;Senior et al., 2020).
Here, we first conduct a systematic literature survey to characterize the current practice of animal meta-analyses by mapping their issues and statistical approaches. While presenting our survey results, we also summarize the traditional and emerging meta-analytic statistical procedures. Second, we illustrate the concepts and rational of the multilevel models (only a small number of researchers have already applied these models to animal meta-analyses; see Bonapersona et al., 2018;Lagisz et al., 2020); how they can deal with non-independence among effect Fig. 1. Schematic of nested experimental designs and clustered data structures in animal studies. Neurobiology and behavioral sciences often involve nested experimental designs, in which multiple traits are measured from a single experimental unit. For example, in an animal experiment where independent mice are randomly allocated to drug-treatment groups with different drug doses and a control group (placebo), multiple traits can be measured from one mouse for each group. Further, measurements can be taken repeatedly over time (longitudinal measures) or from the same body parts. All these can lead to multiple effect sizes per study/paper (effect sizes are correlated with each other within the same studies), which violates the critical assumption of statistical independence between effect sizes. sizes and how multilevel meta-regression can account for multiple levels of heterogeneity. Finally, we touch upon advanced techniques, including the multivariate model and RVE robust variance estimation. Additionally, we provide recommended practices and a hands-on workflow (with annotated R code) that can serve as a template to deal with the statistical issues commonly encountered in animal meta-analyses. Future meta-analysts can adapt our example code to undertake their own animal meta-analyses to draw more robust model inferences, generate new neurobiological insights, and better facilitate animal-to-human translation (Bahadoran et al., 2020).

Survey procedures
The main aim of the literature survey was to capture the state of current practice in conducting meta-analyses on non-human animal data. As such, we performed a systematic literature search of neurobiology and behavioral journals to identify meta-analytic papers published in the last 10 years (2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020)(2021). The details of methodological procedures of the search strategy are detailed in Supplementary Materials (file 1), reported according to PRISMA guidelines (Moher et al., 2009). Briefly, we used the ISI InCites Journal Citation Reports to collect the ISSN of journals classified under 'Neurology' and/or 'Behaviour' (including 115 journals; see details in Supplementary Materials file 1). We then searched within these included journals using the online database PubMed (last updated in October 2021) for meta-analytic papers published within the last 10 years. We restricted our searches to studies labelled as non-human studies and as 'meta-analysis' in the database. This search yielded a total of 188 bibliographic references. Two authors (YY and ML) then performed two-stage screening to identify animal meta-analyses using the following inclusion criteria: (1) the paper addressed a question in the fields of neurology or behavioral sciences; (2) claimed to conduct a meta-analysis; and (3) used data from the animal models (i.e., in vivo animal studies for preclinical research) rather than exclusively human studies. There were 2% conflicting decisions, which were resolved via discussions. Finally, we included 78 papers claiming to be meta-analyses, of which 62 papers were eligible for data extraction of methodological approach.

Data coding
Two authors (YY and ML) independently assessed the full text of the eligible papers to retrieve the following information: (1) whether the authors conducted a formal meta-analysis of animal data (e.g., fixedeffect or random-effects statistical models, or related model, to aggregate effect sizes that were collected from animal studies), (2) the type of effect size used, (3) the type of statistical models employed, (4) whether and how the study accounted for heterogeneity, (5) the type of heterogeneity indices reported, (6) whether the authors reported the number of primary studies (N) and effect sizes (k), (7) whether the data used was potentially at risk of issues of statistical non-independence (e.g., if the ratio of N to k is larger than 1, or if one primary study contributes more than one effect size), (8) whether the authors of meta-analysis acknowledged the presence of statistical non-independence, (9) whether the authors used any procedure to deal with non-independence among effect sizes (i.e., correlated effect sizes), (10) whether the authors addressed the issue of sampling errors non-independence (i.e., correlated sampling errors), (11) whether the authors used any approaches to test for publication bias, and (12) whether the authors reported the software used to perform the animal meta-analyses. In addition, we extracted the bibliometric information of each included animal metaanalysis (e.g., title, authors, the journal published, publish year). We describe the survey questions and associated options in Supplementary Materials (file 1).

Fig. 2.
A diagram showing the computation and interpretation of three important but underappreciated effect sizes in the meta-analysis on animal data. Imagine we aim to screen potential anti-dementia drugs using a fearconditioning test. To do so, we need to answer two questions. The first is "what is the average treatment effect of a drug?" The log-transformed response ratio (lnRR) can quantify the average treatment effect by comparing the ratio of means between the drug group and the placebo group. The second is "do all animals respond in a similar way to the drug?" Here, the log-transformed variability ratio (lnVR) can be used to detect heterogeneous treatment effects. The log-transformed coefficient of variation ratio (lnCVR) is a mean-adjusted version of lnVR, in which the indirect effect from the mean is controlled for. lnVR > 0 or lnCVR > 0 indicate that the tested drug shows a selective efficacy (i.e., high inter-individual variability).

Methodological and reporting practice in animal meta-analyses
For the included 78 self-proclaimed animal meta-analyses, 21% (16) did not conduct formal meta-analyses of animal data (Supplementary Materials file 1). These 'non-formal' meta-analyses used other than formal 'effect-size-based' meta-analysis modelling approaches, which included three categories: (1) meta-analyses of genetics, genomics or bioinformatics (e.g., transcriptional, genome-wide association studies), (2) meta-analyses of brain imaging (e.g., fMRI), and (3) other selfproclaiming as containing meta-analyses but not fitting into any traditional ("formal") meta-analysis framework. For the 62 "formal" metaanalyses, we coded their methodological and reporting practice. We reported the main results in Fig. 3 (for the full survey results, see Supplementary Materials file 1). In the later section, we will discuss the details of each methodological and reporting characteristic of the surveyed animal meta-analyses, in combination with the explanations of the advanced meta-analytic methods.

Common effect sizes in the meta-analysis of animal data
Effect size is a statistic that can measure the size and direction of an effect (e.g., the effectiveness of an antidepressant or the prevalence of depression occurring; referred to as 'effect statistic'; see Nakagawa and Cuthill, 2007). It also can refer to the estimate (value) of a given effect Fig. 3. The summary of the main results of a survey of 62 "formal" meta-analyses using animal models mimicking human diseases, physiology, and behaviour (2011 -2021). Summarized methodological and reporting practice regarding (A) effect size used, (B) statistical models employed, (C) heterogeneity indices, (D) sign of statistical non-independence (multiple effect sizes per animal study; the medium number of effect size per animal study k = 56, the medium number of animal studies per meta-analysis N = 25, k/N ratio = 2.2), (E) addressing non-independent effect sizes, and (F) publication bias test methods. Barplots indicate numbers of papers with a given methodological and reporting characteristic. Plural answers were allowed (i.e. one paper could fit into more than one category/option, for example, both the random-effects model and meta-regression were used in Leffa et al., 2019, such papers were used more than once when tallying counts). We also surveyed other questions, such as whether and how the study accounted for heterogeneity and whether the authors of the meta-analysis acknowledged the presence of statistical non-independence. For the detailed survey methods, questions, and complete results, see Supplementary Materials file 1.
statistic (e.g., 'effect estimate'; see Hentschke and Stüttgen, 2011). In this paper, we use the two definitions interchangeably. Our survey indicates that the effect sizes used in animal meta-analyses can be categorized roughly into four types: (1) standardized mean difference between two (experimentally created or intrinsically occurring) arms, SMD (68%, see Fig. 3; e.g., 'd' familycommon estimators of SMD are Cohen's d and Hedges' g;Hedges, 1982), (2) the incidence of two events (10%; e.g., 2-by-2 contingency table: odds ratio, relative risk, and risk difference), (3) strength of the association/correlation between two variables (1%; correlation coefficient, r, or its Fisher's z transformation, Zr), (4) others (18%; e.g., unstandardized raw mean difference, and determination coefficient, R 2 ). Our survey clearly shows that the most popular effect size is the standardized mean difference, SMD (the 'd' family -Cohen's d or Hedges' g). The formulas used to calculate point estimates of these effect sizes and their sampling variances can be found elsewhere (Hentschke and Stüttgen, 2011;Nakagawa and Cuthill, 2007;Noble et al., 2017).

Emerging effect sizes in the meta-analysis of animal data
Our literature survey revealed the use of three important but underappreciated effect sizes in animal meta-analyses. The first is the normalized mean difference, NMD (9%; Fig. 3), which divides the raw mean difference between two arms (the experimental group vs. control/ reference group, e.g., placebo, sham, or wild-type group) by the mean of the control group (Vesterinen et al., 2014). The second is the log-transformed response ratio, lnRR (1%; Fig. 3), which uses the natural logarithm of the ratio of means between two arms to measure mean difference ( Fig. 2; estimating the average treatment effect; Hedges et al., 1999;Lajeunesse, 2011). The third is the log-transformed variability ratio, lnVR (1%; Fig. 3), which can quantify the difference in variance (standard deviation) around the mean between two arms ( Fig. 2; i.e., estimating inter-individual variability between two arms or heterogeneity of treatment effect; Nakagawa et al., 2015;Senior et al., 2020). The log-transformed coefficient of variation ratio (lnCVR) is a mean-adjusted version of lnVR, where the indirect impact of mean on its variability is controlled for (i.e., accounting for the mean-variance relationship; Nakagawa et al., 2015;Volkmann et al., 2020).
The use of lnVR and lnCVR provides the opportunity to reveal new or neglected neurobiological insights to the field. For example, metaanalysis of traditional effect sizes (e.g., SMD) mainly focuses on how therapy can mitigate neurobiological-and behavioral-disorders (i.e., detecting average treatment effect; Mills et al., 2021). In contrast, meta-analysis of variation (e.g., lnVR and lnCVR) can examine inter-individual variability in response to treatment in mental symptoms (i.e., detecting heterogeneous treatment effect; Hieronymus et al., 2020). This means meta-analysis of variation can be used to examine whether a treatment shows a consistent (in all animals: low inter-subject variability) or selective (in some animals: high inter-subject variability) efficacy (Usui et al., 2021). Examining such inter-subjective variability brings important implications to precision and personalized medicine and behavioral therapies (Lorenzo-Luaces et al., 2021;Luedtke and Kessler, 2021). For example, if a treatment response manifests high inter-subjective variability, it indicates the treatment effect may be subjective-specific and warrants further research on the sources of variability and personalized prescriptions in the clinic practice (Haggarty et al., 2021;Schork, 2015).

The choice of effect sizes in the meta-analysis of animal data
In general, the choice of effect sizes is straightforward because the types of effect sizes should be aligned with the scientific question or hypothesis of a meta-analysis . To put it another way, the goal of a given animal meta-analysis will generally determine which types of effect sizes should be used. For example, if interested in the effectiveness of a given intervention, researchers might consider effect sizes such as SMD (e.g., mean differences between antidepressants and placebo group expressed in the units of standard deviation) or lnRR (e.g., % increase in mean in antidepressants compared with placebo). If the goal is to examine the external validity of animal behavioral assays (e.g., forced swimming or Morris water maze), lnVR or lnCVR (when a mean-variance relationship exists) is preferable because an optimal animal assay should have less inter-individual variability so that it is more reproducible and generalizable.
Recent empirical and simulation work from ecology and animal sciences indicated lnRR and NMD are in general more statistically powerful than SMD and less vulnerable to overestimation (Wang et al., 2018;Yang et al., 2022aYang et al., , 2022b. In conjunction with other merits (e.g., better interpretability, heteroscedasticity-robustness, scale-independence), lnRR and NMD provide 'animal' meta-analysts with a complement to SMD when mean difference is the focal interest (Sánchez-Tójar et al., 2020;Spake et al., 2021). But note that NMD and lnRR are only applicable for ratio-scale data (i.e., bounded at zero, for example, brain size; Houle et al., 2011;Nakagawa et al., 2015). Also, SMD represents additive effects while NMD and lnRR indicate multiplicative effects. Therefore, the dual use of these two types of effect statistics may be advisable whenever possible (e.g., SMD as the main effect size; lnRR for sensitivity analysis; see Supplementary Materials file 2 for a real example).

The prevalence of statistical non-independence in the metaanalysis of animal data
As mentioned, multiple effect size estimates from the same study can lead to non-independence, which violates the assumption of statistical independence of data in traditional meta-analytic models (also see Section 6.1). Non-independent effect sizes mean that they contribute similar information to the fitted model, which requires each effect size should contribute unique information. Our survey has shown that the issue of non-independence was near ubiquitous in animal meta-analyses: 89% (55/62) of animal meta-analyses used more than one effect size from the same study (Fig. 3). Multiple effect sizes originating from the same study might be statistically dependent (i.e., correlated) because they share the same subjects, methodologies, instruments, measurement procedures or other contexts that might induce correlations. Multiple effect size estimates from the same study introduce two types of nonindependencies: correlation in the true underlying effects and the observed effect size estimates (i.e., sampling errors). We summarize the common scenarios of non-independence in animal meta-analyses in Fig. 4. The general causes are: (1) shared study identitymultiple effect sizes are derived from a single study (e.g., traits measured repeatedly at follow-up times or separately for males and females; Abbott et al., 2019;Bird et al., 2016), (2) shared animal identitymultiple traits, outcomes, or endpoints measured on a common set of animals (e.g., measuring depression using forced swim test and tail suspension test for the same cohort; Barha et al., 2017;Burgueno et al., 2020), (3) shared controlmultiple treatments compared to a common control group (e.g., multiple trial arms; England et al., 2015;Neville et al., 2020), and (4) shared species (or strains) or evolutionary historyspecies similarities due to shared evolutionary history (e.g., genetic similarities and phylogenetic relatedness; Khorshidi et al., 2021;Lagisz et al., 2020;Zoerle et al., 2012). Scenarios 1 -4 can result in correlations of the true underlying effects. At the same time, scenarios 2 and 3 also lead to correlations of the correlated sampling errors because of the 'overlapping' animals or repeated use of the same animals when computing effect size estimates (see Fig. 4 and Section 9.2).

Approaches to handle statistical non-independence in the meta-analysis of animal data
We have found that three broad strategies were used when nonindependence was encountered in animal meta-analyses ( Fig. 5): (1) (caption on next page) Y. Yang et al. ignoring non-independence (Lages et al., 2021;Mancini et al., 2020), (2) eliminating non-independence (e.g., averaging or sampling; Currie et al., 2019;Frantzias et al., 2011), and (3) modelling dependence explicitly (Bonapersona et al., 2018;Creutzberg et al., 2021). The first approach was the most used method employed in animal meta-analyses (48%; Fig. 3), which is concerning as it ignores the dependency among effect sizes and treats them as if they were statistically independent (i.e., via the use of simple fixed-effect and random-effects models for analysing non-independent effect sizes). For example, Lages et al., 2021 included 17 effect sizes from six papers but the review authors used a random-effects model in their analyses without accounting for any non-independence.
The second common strategy in the current practice of animal metaanalyses (29%; Fig. 3) also uses a standard meta-analytic model but after aggregating effect sizes or sampling a single effect size from each study to eliminate statistical non-independence of data (Fig. 5). For example, Egan et al. (2014) used a fixed-effect meta-analysis for studies having multiple effect sizes to obtain a 'synthetic' effect size per study (also see Frantzias et al., 2011). The third method, the least common (11%, seven animal meta-analyses), was to explicitly account for dependence among effect sizes or sampling variances using a multilevel model with a suitable random-effects structure. We elaborate on this technique in Section 6.
Using the first and second approaches to account for nonindependence in data is not necessarily incorrect, but they have obvious pitfalls (Cheung, 2014;Nakagawa et al., 2021c;Song et al., 2020). In brief, ignoring non-independence (the first approach) does not necessarily overestimate the model coefficients (e.g., pooled effect size or overall/mean effect or grand mean) but could bias the estimate of model coefficients and underestimated standard errors and, distort the subsequent hypothesis tests with the non-nominal level of Type I error rates and coverage rates of confidence intervals (Cinar et al., 2021a; Table 1). Eliminating non-independence (the second approach) could reduce the statistical power and precision of the model coefficients due to the loss of "sample size" (in this case, the number of effect sizes), and limit the capacity to ask new questions, for example, investigating the drivers of effect size heterogeneity because of the loss of information (explanatory variables, predictors or moderators; Cheung, 2014;Nakagawa et al., 2021c;Song et al., 2020;see Fig. 5). Importantly, neither ignoring nor eliminating non-independence can deal with a more complex non-independence issue such as phylogenetic relatedness (or phylogenetic correlation), which arises when incorporating data from multiple species (Nakagawa and Santos, 2012;Noble et al., 2017). Because species with a shared evolutionary history appear to be more similar to each other, leading to the effect sizes derived within the same species are not independent ( Fig. 4D; Hadfield and Nakagawa, 2010). We suppose that the widespread use of 'ignoring' and 'eliminating' strategies is caused in part by the low uptake of the suitable easy-to-implement methods that can directly model non-independence and practitioners might prefer simple methods (e.g., random-effects model) or software (e.g., Review Manager) they are familiar with, although the lack of awareness of the importance of accounting for non-independence may make a major contribution. For example, Stukalin et al. (2020) treated one experiment with different antidepressant doses as two or more independent experiments in their meta-analytic models.

Fixed-effect and random-effects meta-analytic model
In animal meta-analytic practice, the random-effects model was the most commonly employed (50%; Fig. 3). As noted above, however, the random-effects model is not capable of accounting for non-independent data. In this section, we re-formulate the traditional meta-analytic model as a multilevel model to better capture the hierarchical data structures in animal studies (Fig. 6), such that we can explicitly model the non-independence. Following the conventions in animal metaanalytic practice, the random-effects model can be written as (Vesterinen et al., 2014): where ES j = an estimated/observed effect size for study j (j = 1,…,N; N is the number of animal studies included in a meta-analysis; per animal study contributes one ES j ); ES j can be any effect size type commonly used in animal meta-analyses (e.g., SMD; see Section 3); β 0 = an intercept denoting average true effect/pooled effect size (also known as the global estimate, overall mean or meta-analytic mean); u j = a random effect corresponding to the between-study effect for study j, whose mean E[u j ] = 0 and variance Var[u j ] = τ 2 ((i.e., assumed to be normally distributed with mean 0 and variance τ 2 ); e j = an error term correseponding to the sampling error effect for study j, where Var[e j ] = v j is the sampling variance, which is assumed to be known and estimated, usually, from formulas (e.g., 1/(n − 3) for Zr with n = the number of subjects; see Nakagawa and Cuthill, 2007;Noble et al., 2017, for formulas to estimate v j for the common effect sizes). The fixed-effect (common-effect or equal-effect) model (10% of the surveyed studies; Fig. 3) can be considered to be a special case of the random-effects model; that is, the random-effects model assumes heterogeneous between-study effects (τ 2 > 0) and the fixed-effect model assumes homogeneous between-study effects (τ 2 = 0; ES 1 = … = ES N ≡ β 0 ). Notably, although the fixed-effect model is appropriate on some occasions, most meta-analyses assume non-zero heterogeneity in the underlying effects (across animal cohorts, species, and settings; IntHout et al., 2016;Senior et al., 2016).

Fig. 4.
Four common drivers of statistical non-dependence in meta-analyses on animal data (A) -(D). Statistical non-independence means that effect sizes have a multilevel/nested structure and are correlated within a 'cluster' variable, where effect sizes may be more similar than those across 'cluster' variables. Ignoring nonindependence might lead to a biased estimate of model coefficients, and underestimated standard errors and, consequently, hypothesis testing with incorrect Type I error rates and confidence intervals with incorrect coverage levels. ES = effect size (e.g., lnRR; Fig. 2). Double-headed arrows = correlations in effect sizes. (A) Shared study identity where a single construct of interest (i.e., single outcome, response variable) is measured using different assays, instruments, or scales within the same study. For example, when three assays are simultaneously performed on animals from an antidementiaplacebo comparison to quantify the magnitude of dementia, correlations occur among ES 1 to ES 3. It should be noted that multiple constructs within the same study (e.g., dementia and anxiety) are also possible. In such a case, (multivariate) effect sizes are better modelled by a multivariate than a multilevel model (see Section 9.2). (B) Shared animal identity where multiple effect sizes can be derived from a single animal cohort. For example, brain morphology can be measured for different parts of the same brain. Another example, fear condition test can be conducted on an animal at follow-up times (1, 3, and 5 days). Note that shared animal identity also leads to correlated sampling errors because the same cohort animal will be used multiple times when computing effect sizes. (C) Shared control where multiple treatment groups are compared to the same control group. For example, when comparing a single placebo (control) to two doses of the antidementia drug, correlations in effect sizes occur because data from the placebo group is repeatedly used to calculate effect sizes. Note that shared control also contributes to correlated sampling errors. (D) Shared species or evolutionary historyeffect sizes taken from the same species may be more similar. For example, if three effect sizes were calculated from rabbits, they would be correlated with one another because of the similarity of individuals within the species. Moreover, ES 1 to ES 3 shown in panel D are not independent because of the phylogenetic relatedness (where mice and rabbits are more similar to each other than to dogs). See the main text for real examples for each scenario.

Fig. 5.
Common approaches used to handle statistical non-independence among effect sizes in meta-analyses on animal data. ES = effect size (SMD or lnRR; ES [i] means ith effect size in jth study). v = sampling variance (v [i] means sampling variance of ith effect size in jth study). (A) Following the hypothetical data set with six non-independent effect sizes from three studies (panel (C) in Fig. 2), effect sizes are not independent in Studies 1 and 3 due to shared control (i.e., multiple doses). (B) Researchers completely ignore non-independence by using a fixed-effect and random-effects model to fit these non-independent effect sizes, treating them as if they were statistically independent. (C) Researchers first compute the simple average (weighted average) of multiple effect sizes for Studies 1 and 3 and subsequently use it in a fixed-effect or random-effects model. (D) Researchers first select one random effect size from Studies 1 and 3 and subsequently use it in a fixed-effect or random-effects model. As clearly shown in panels C and D, eliminating non-independence could lead to the loss of information (in this case, within-study moderator: doses of the antidepressant). See the main text for real examples for each approach.

Multilevel meta-analytic models
Importantly, both fixed-effect and random-effects models assume N = k (the number of studies is the same as the number of effect sizes included in a meta-analysis). In animal meta-analytic practice, k is usually larger than N. In our survey results, the medium N was 25 (range = 2 -414, mean = 50; Supplementary Materials file 1), while the medium k was 56 (range = 5 -3288, mean = 203). The medium k to N ratio was 2 (range = 1 -28, mean = 3), indicating that most surveyed metaanalyses extracted multiple effect sizes per study and thus have the issue of statistical non-independence.
A traditional meta-analytic model can be reformulated as a multilevel model to handle the issue of statistical non-independence (Cheung, 2014;Nakagawa and Santos, 2012;Van den Noortgate et al., 2013). In our survey, only seven animal meta-analyses employed the multilevel meta-analytic model. Such a model can be formulated as (i.e., three-level meta-analysis): where ES i = an effect size estimate; note that for a given animal study j (j = 1,…,N), it often contains multiple effect sizes i (i = 1,…,k; k > N); , an intercept, representing the overall / pooled mean, (i.e., average true effectthe true effect sizes across N studies follow a distribution with mean β 0 ); u b[j] = a random-effects term corresponding to between-study effect for study j applied to effect size i (the subscript 'b ′ denotes 'between-study'), which captures study-specific heterogeneity (whose magnitude is determined by the degree of inconsistency between true effect sizes in study 1,…,N); u w[i] = a random-effects term corresponding to within-study effect for effect size i in study j (the subscript 'w ′ denotes 'within-study'), which captures effect-size-specific heterogeneity Var[u w[i] ] = σ 2 w (whose value is defined by the degree of inconsistency between multiple effect sizes in study j; also referred to as residual heterogeneity); e [i] = sampling error in effect size i in study j, which captures sampling variance Var[e i ] = v i (if sampling errors are correlated (where the same cohort animal is repeatedly used for effect size computation; scenarios 2 and 3 in Fig. 4), v i will become a variance and covariance matrix; details see Section 9.3). Because the only model coefficient in Eq. (2) is the intercept, β 0 (the estimate of the overall effect/pooled effect size/grand mean), we usually call it an intercept-only multilevel meta-analytic model. As shown in Fig. 6, when estimating β 0 , Eq. (2) can exactly capture the multilevel/ hierarchical data structure (e.g., multiple effect sizes are clustered/ correlated within a study) that arises the statistical non-independence among effect sizes.

The degrees of dependency
Compared to the random-effects model (Eq. 1), Eq. (2) explicitly models the non-independence due to multiple effect sizes within one study (i.e., the random-effect term u w[i] assumes heterogeneous effect sizes within studies). In the random-effects model, sampling variance (Var[e i ] = v i ) is treated as the only source of within-study variance, whereas it is not the case for the multilevel model (u w[i] lead to effect-size specific variance, which belongs to within-study variance). Therefore, the multilevel model can distinguish between sampling variance and within-study variance (see Section 6.3 for details). Moreover, the degrees of dependency among effect sizes can be quantified via the estimated correlation between the true underlying effects: intra-class In case of no dependency or effect sizes not correlated within clusters/studies (ICC = 0), all effect sizes derived from the same study are independent, meaning that they contribute fully unique information to the fitted model, which is an implicit assumption of fitting dependent effect sizes to traditional meta-analytic models. If ICC = 1, all effect sizes derived from the same study are nonindependent, meaning that they contribute the very same information to the fitted model, which is an implicit assumption of the 'averaging' method dealing with dependent effect sizes (see Section 5).

Parameter estimation and statistical inference
The parameters of the two random-effects terms (i.e., σ 2 b and σ 2 w ) can be estimated from the data along with E[ES [i] ] = β 0 (note that ES [i] and ν [i] can be directly computed from the data). There are various estimators to approximate σ 2 b and σ 2 w , such as maximum likelihood (ML), restricted maximum likelihood (REML), DerSimonian-Laird (DL), and Empirical Bayes (EB). Although DL is a common method in many meta-analyses (a default estimator in Review Manager, RevMan), simulation studies indicate REML and Empirical Bayes outperform over DL in different simulated data Tanriver-Ayder et al., 2021;see Langan et al., 2019;Viechtbauer et al., 2015). Note that only ML and REML estimators are implementable in multilevel models in the current main R packages for conducting meta-analyses, for example, metafor package (Viechtbauer, 2010). Together with the estimated variance components (e.g., σ 2 b and σ 2 w ) and sampling variance (i.e., v i ), the variance (and covariance if involving correlated v i ) matrix can be constructed and model coefficients can be estimated under the inverse-variance weighting scheme (Marin-Martinez and Sánchez-Meca, 2010). Finally, statistical inferences (e.g., statistical tests of model intercept β 0 and CIs construction) can be made based on null-hypothesis tests (e.g., Wald-type tests with standard normal Table 1 Comparison of different methods dealing with non-independence of effect sizes in terms of model coefficient estimates, confidence intervals (CIs), corresponding hypothesis tests and variance component estimates. ICC = intra-class correlation denoting the degree of dependence/correlation within clustering groups (i.e., study). ρ s = correlation of sampling errors (sampling level). RE = fit a random-effects meta-analytic model ignoring the dependency among effect sizes and treating them as if they were statistically independent (assuming ICC = 0). Average-FE = fit a random-effects model but after aggregating effect sizes and sampling variance (assuming sampling correlation ρ s = 0.5 in this example) for each study (assuming ICC = 1, namely, homogeneous effect sizes within studies). Sample-FE = fit a random-effects model but after randomly sampling a single effect size from each study. ML = fit a multilevel meta-analytic model directly modelling the dependency among effect sizes. ML-VCV = fit a multilevel meta-analytic model with a VCV matrix accounting for correlation ρ s (assumed to be 0.5) of sampling errors or observed effect size estimates (sampling level). ML-VCV-RVE = use robust variance estimation (RVE) to guide against model misspecification of ML-VCV. SE = standard error of the pooled SMD or clusterrobust standard error if applying RVE. LCI = 95% lower confidence interval ( distribution or t-distribution), likelihood ratio tests, resampling methods (e.g., a permutation test and bootstrapping) or cluster-robust inference (sandwich-type estimator; see Section 9.3). The use of methods based on t-distribution (with adjusted degrees of freedom), permutation test and cluster-robust inference are preferable in the case of a meta-analysis with a small number of studies (Joshi et al., 2022;Nakagawa et al., 2021c;Sánchez-Meca and Marín-Martínez, 2008;Viechtbauer et al., 2015). We illustrate how to implement multilevel models with recommended estimators (REML) and improved inference methods (t-distribution with adjusted degrees of freedoms) and interpret corresponding model results using rma.mv() function in metafor package in Section 11 (Supplementary Materials file 2; Viechtbauer, 2010).

Flexible random-effects structures
One of the multilevel model's advantages is the capacity to incorporate a flexible random-effects structure, and therefore, to account for various types of non-independence and heterogeneity due to different levels of clustering variables (see Section 7). For example, animal metaanalyses often encounter stratified data structure, by multiple strains or species (e.g., more than 10 antidepressants included in Kara et al., 2018; 13 strains of rodents included in Bird et al., 2016;8 species in Lagisz et al., 2020). In this regard, we can add a corresponding random-effects term, for example, strain-specific effect, to Eq. 2 to account for this: where u s[k] = a random-effects term corresponding to strain-specific effect for strain k (it also can be species-, drug-, dose-specific effectsdepending on which clustering variable is incorporated into the model), wherein Var[u s[k] ] = σ 2 s denotes the strain-specific variance. As a rule of thumb, a proper random-effects term needs at least five levels to make the estimation of the respective variance components feasible and stable (Bolker et al., 2009;Gomes, 2021). Therefore, the strain-specific effect is Fig. 6. Schematic illustrations of the fixedeffect model (A), random-effects model (B), and hierarchical structure of a multilevel model (C). Imagine we aim to assess the efficacy of a new antidementia drug. To properly estimate the generalizability of the efficacy, we employ a multilevel meta-analytic model to aggregate effect sizes derived from different animal models. In the sampling level, we derived mean, standard deviations, and sample size to calculate effect sizes ES [i] . In this level, the only deviation between the estimate of ES [i] and true effect is the sampling error effect e [i] . In the within-study/effect size level, effect sizes are not independent (see Fig. 1). The multilevel model uses a random-effects termeffect-size specific effect u w [i] preferable to the species-specific effect in practice (if there are fewer than 5 species includedin the case of biomedical studies usually only rats and mice). In a more complicated spectrum, we can further extend Eq. (3) to account for additional sources of non-independence by adding random-effects termsfor example, authorship dependence (labs, research groups), non-phylogenetic species similarities, phylogeny relatedness, and temporal and spatial correlations (Hadfield and Nakagawa, 2010;Moulin and Amaral, 2020;Nakagawa et al., 2019;Maire et al., 2019). We only found one paper that accounted for these complex sources of non-independence, for example, Lagisz et al. (2020) employed a phylogenetic multilevel meta-analytic model to address the phylogeny relatedness among 22 non-human species. Given the complexities, we do not elaborate on the methodological details of these complicated models (for interested researchers, see Chamberlain et al., 2012;Cinar et al., 2021a;Nakagawa and Santos, 2012).
Theoretically, any cluster or grouping variable can serve as a random-effects candidate (e.g., study identities, drug types, or species). However, the levels of one cluster variable might be strongly overlapping with another (e.g., study vs. animal cohort). Therefore, a practical problem to consider is to select the best random-effects structure when conducting a meta-analysis. The rationale of testing the randomeffects structure is to investigate whether the examined random-effects terms are (1) of neurobiologically interest, and (2) true sources of heterogeneity. For a random-effects term, we often use informationtheoretic approaches alongside likelihood methods to examine whether it is a suitable random-effects term that should be included in the meta-analytic model, such as Akaike Information Criterion (AIC) and likelihood ratio tests. It is worth noting that when comparing models with different candidate random-effects structures, the ML method should be used rather than REML, because the log-likelihoods ratio (the index of information criteria) is not estimable for models incorporated with different fixed-effects (Gurka, 2006). The methodological details of calculating AIC and log-likelihood ratio are not the focal interest in this paper (but see Cinar et al., 2021b for details). We provide an example showing how to use information-theoretic approaches to decide the best random-effects structure in Supplementary Materials file 2.

Multilevel version of I 2 statistic and variance components
It is common for an animal meta-analysis to combine studies with experimental designs with multiple species/strains, multiple outcomes, and multiple trials, each with multiple arms (Sandercock and Roberts, 2002;Hunniford et al., 2021). All these neurobiological and methodological differences are likely to lead to inconsistency among effect sizes. In the meta-analytic context, this 'inconsistency' is typically referred to as "heterogeneity" of the true underlying effects, and animal studies have a high amount of heterogeneity indeed (Kafkafi et al., 2018;Richter et al., 2009;Voelkl et al., 2020). As with the traditional meta-analyses (fixed-effect and random-effects meta-analyses), the multilevel model also can measure the amount of heterogeneity in the true underlying effects. In animal meta-analytic practice, as revealed by our survey, there are three widely used statistics for determining the amount of heterogeneity: I 2 (43%; Fig. 3), Cochran' Q (20%), and between-study variance τ 2 (19%). In general, Cochran' Q could be useful because it facilitates dichotomous decisions regarding whether the effect sizes are homogeneous (Cochran' Q is a test statistic for testing the null hypothesis of homogeneous effect sizes). It also can be used to assess the uncertainty of between-study variance on some occasions (but see Van Aert et al., 2019a). But it is not as informative as I 2 and τ 2 . Specifically, I 2 and τ 2 can measure the amount of heterogeneity among effect sizes (the former is the relative heterogeneity, and the latter is the absolute heterogeneity; Borenstein et al., 2017). Importantly, the multilevel meta-analytic can partition the two statistics across different levels corresponding to different random-effects terms. The random-effects terms in Eq. (2) indicate that total variance components can be decomposed into between-and within-study-specific variances in the multilevel model (σ 2 b , and σ 2 w , respectively; Fig. 7). The strain-specific variance σ 2 s also can be separated from the total variance if fitting Eq.
Applying a random-effects model to non-independent data leads to model misspecification because a random-effects model treats the between-study variance as the total variance in (τ 2 = τ 2 total ; Eq. 1), while a multilevel model treats between-study variance as one of the components of the total variance (τ 2 total ≥ σ 2 b in Eq. 2; Fig. 7). Therefore, the true total variance (σ 2 total = σ 2 b + σ 2 w ) is incorrectly attributed to the between-

Fig. 7.
A comparison of variance components of a random-effects model and a multilevel model. In a random-effects model, the total variance (σ 2 total ) can be partitioned into two components: "typical" sampling variance (σ 2 sampling ; see Eq. (7) for formula) and betweenstudy variance (τ 2 ). In the multilevel model, the σ 2 total can be decomposed into three components: σ 2 sampling , within-study variance (σ 2 w ) and between-study variance (σ 2 b ). Using a randomeffects model to fit non-independent data could misassign σ 2 w to σ 2 b (see Supplementary Materials file 2 for a real example). Importantly, a random-effects model can only use between-study moderator variables to explain the heterogeneity in true effect, while a multilevel model can use both within-and betweenstudy moderator variable to explain the heterogeneity in true effect. study variance τ 2 in a random-effects model (where τ 2 should be equal to σ 2 b rather than σ 2 total ; see Supplementary Materials file 2 for a real example). I 2 statistic is defined as the relative amount of variance between effect sizes after taking out sampling error effects (Higgins and Thompson, 2002). We formulate I 2 statistics in the context of a multilevel meta-analytic model as follows (Cheung, 2014;Nakagawa and Santos, 2012): within-study specific I 2 between-study specific I 2 total level I 2 where Var[ES i ] = σ 2 total +σ 2 sampling signifies the total variance of the observed/estimated effect size (i.e., ES i ); σ 2 total denotes the total variance in the true effects (true heterogeneity; σ 2 total = σ 2 b +σ 2 w in Eq. 2), which is caused by neurobiological-, and methodological relevant variability (and can be explained by corresponding moderator variables; see Fig. 7 and Section 8); σ 2 sampling = a "typical" sampling error variance, which is driven by the finite 'sampling' of the population; σ 2 sampling can be estimated using (independent) sampling variance v [i] (Higgins and Thompson, 2002; but see section 10.2 for non-independent v [i] ): where σ 2 sampling is also called a "typical" within-study variance v, since sampling variance is the only source of within-study variance in the framework of the random-effects model (Eq. 1: N = k). σ 2 sampling or v can be conceptually treated as a surrogate of the average value of sampling variances v [i] .
The multilevel versions of I 2 statistic have three merits: (1) intuitive (range from 0% to 100% enabling us to have a clear sense of the amount of heterogeneity in a given meta-analysis), (2) with commonly used guidelines (I 2 = 25, 50, 75% can be interpreted as low, moderate and high levels of heterogeneity; Higgins et al., 2003), and (3) interpretable (I 2 w , and I 2 b are the proportions of the effect size variation attributed to within-and between-study inconsistencies, respectively). We show the calculations of the multilevel version of I 2 index using i2_ml function in orchard package in Section 11 (Nakagawa et al., 2021b).

Prediction intervals
Our survey also found one useful heterogeneity index used in animal meta-analyses -prediction interval (PI; Fig. 3; Mancini et al., 2020). 95% PI is defined as the estimate of an interval (a plausible value range) where 95% of the future measurements (i.e., true effect sizes of new studies) would fall when no sampling errors exist van Aert et al., 2021). For example, assume an antidepressant with a mean SMD = − 0.4% and 95% PI = − 0.1 to − 0.7 -this means 95% of new trials using this antidepressant will decrease the manifestation of depressive behaviours by between 0.1 and 0.7 standard deviations over different experimental contexts. We note that PI is distinct from confidence interval (CI). A CI quantifies the precision of the mean effect size (i.e., β 0 ), which is dependent on the standard error (i.e., SE[β 0 ]): 95% , where t df;0.975 = 97.5th percentile of a Student's t-distribution with df degrees of freedom. In contrast, PI captures the dispersion of the mean effect size, which accommodates heterogeneity in the true underlying effects, namely, neurobiologically and methodologically relevant uncertainties (i.e., variance in the true effects σ 2 total ; Section 7.1): where the exact value of df of the Student's t-distribution is controversial; Some common approximations include df = k -1, k -2, or k -4, where k is the number of effect size estimates (a detailed discussion see Knapp and Hartung, 2003;Riley et al., 2011;van Aert et al., 2021;Viechtbauer, 2010). The metafor package uses k -1 as the default.
With the same scale as its mean effect size, the neurobiological interpretation of PI is straightforward. This merit makes it a good complementary statistic to the I 2 index since I 2 index provides no information on the absolute amount of heterogeneity among effect sizes (although it directly measures the percentage of heterogeneity due to 'true' neurobiological-relevant variation as opposed to chance (i.e., sampling variance) -Eq. 7; Borenstein et al., 2017). Specifically, for a given animal meta-analysis with small σ 2 sampling , even a tiny true effect heterogeneity σ 2 total can lead to a high value of I 2 total (see Eq. 6). However, this large I 2 does not necessarily mean a high amount of true heterogeneity of neurobiological-relevant variation in the effect sizes. In contrast, σ 2 total can directly reflect the genuine differences underlying the true effects because the square root of σ 2 total can be interpreted as the standard deviation of the true effect sizes. Moreover, in the framework of the multilevel model, σ 2 total also can be partitioned at different levels (e.g., within-and between-study levels:

Multilevel meta-regression to explain heterogeneity in effect sizes
When heterogeneity is detected (which is indicated by a large I 2 total and significantly meaningful σ 2 total ; see Supplementary Materials file 2), it is necessary to find the drivers of such variability and try to explain (at least part of) this heterogeneity using variables extracted from primary studies as explanatory variables or predictors (known as moderator variables in the meta-analytic context because they moderate the strength of the effect on effect size ES [i] ). There are three common drivers of heterogeneity: neurobiological, methodological and sociological (or meta-scientific) moderators (Nakagawa and Santos, 2012). The neurobiological and methodological moderators can be used to account for heterogeneity due to neurobiological processes (e.g., different doses of drug, or sex tested) and methodological differences (e.g., different drugs or dosages), respectively. The sociological moderators are the drivers of publication bias (see Section 8.2 for details). In other words, including moderator variable (s) in a multilevel model (Eq. 5) allows us to test moderator effectsexamining whether the effect size estimates systematically change in response to different levels of moderator variables (e.g., whether the overall efficacy of an antidepressant drug depends on sex). We found 89% (55/62) of animal meta-analyses accounted for heterogeneity using either meta-regression or subgroup analysis (Supplementary Materials file 1), among which 33 employed meta-regression.
The above multilevel meta-analytic model involving a moderator, in essence, is a meta-analytic regression model via multilevel (linear) mixed-effects models with both fixed effects and random effects, also well-known as (mixed-effects) meta-regression models. It can be expressed as the following mathematical notations: where β ′ 0 = an intercept (which is different from the overall mean, β 0 , in Eq. (2), and this has an important implication in certain circumstances; see Section 9.1.2); β 1 = a slope representing effect size changes for (oneunit increase in) x b [j] , a moderator corresponding to between-study characteristics (e.g., x b[j] = antidepressant types: fluoxetine vs. sertraline) or strain-specific characteristics (e.g., x s[k] = animal taxonomy: rats vs. mice). Eq. 9 is known as a three-level mixed-effects model. Importantly, Eq. (9) also enables us to examine the relationship between effect size changes and a moderator variable that varies within studies (e.g., within-study level moderator variable x w[i] = a series of doses; Figs. 5 and 7). Categorical moderator variables also can be incorporated into Eq. (9) using a dummy-coding strategy (Schielzeth, 2010). For example, animal sex can be dummy coded as D sex : 0 (male) and 1 (female). This is equivalent to subgroup analysis. In the framework of meta-regression, subgroup analysis is achieved by incorporating D sex into Eq. (9): where β male = an intercept indicates the estimate of the mean effect size of the subgroup male; β Δ = a slope equals to β female − β male . Eq. (11) is intuitive: when D sex = 0 (subgroup = male), we can obtain the estimate of β male ; when D sex = 1 (subgroup = female), we obtain the estimate of β female = β male + β Δ . Importantly, β Δ . enables us to examine the difference of a categorical moderator variable at different levels (e.g., the differences between females and males in responses to a given antidepressant; see Supplementary Materials file 2). Eq. 9 explicitly indicates that the strategies of aggregating or selecting effect sizes will prevent the meta-regression from providing information about effect-size level moderator variables (see Fig. 5 and Section 5). Importantly, Eq. (9) can further partition the total variance (i.e., σ 2 total ) into two parts: (1) the variance of fixed-effects, σ 2 f = Var [β 1 x b [j] ], which denotes the heterogeneity explained by the included moderator variable x between [j] (Fig. 7); and (2) the variance of randomeffects terms (σ 2 b , and σ 2 w ), which now becomes "residual" heterogeneity that is not explained by the associated moderator variable. The goodness-of-fit index R 2 is also applicable to quantify the percentage of variance explained by the included moderator variable (Aloe et al., 2010). A general and widespread measure of R 2 is the marginal R 2 (Nakagawa and Schielzeth, 2013), which can be calculated by: In contrast to I 2 total (Eq. 6), R 2 marginal does not contain the sampling error variance (σ 2 sampling ; Eq. 7) in the denominator because this variance component is assumed to be known before including moderator variables to explain the heterogeneity. The calculation of R 2 marginal is readily available, for example, using orchaRd packages (see Section 11).
Two points are worth noting here. First, various methods can be used to deal with covariates with missingness under the assumption of a random missing mechanism (e.g., data are missing completely at random and unrelated to any other variable), including simple deletion (filtering the incomplete cases prior to model fitting; embedded in metafor; Viechtbauer, 2010) and advanced imputation methods (i.e., full information maximum likelihood embedded in metaSEM; Cheung, 2015;Jak and Cheung, 2020). Second, a random meta-regression requires each of its moderators to have at least five studies (Hedges and Pigott, 2004). The minimal number of studies or effect sizes required by a multilevel meta-regression remains unknown, albeit some simulation studies suggest that the estimates of model coefficients of a multilevel meta-regression are generally stable under various simulated situations (Jamshidi et al., 2020;López-López et al., 2017).

Extended Egger's regression to test publication bias
Identifying publication bias is a crucial and mandatory procedure of a meta-analysis because the validity of meta-analytic evidence would be undermined if publication bias occurs (Augusteijn et al., 2019;Nakagawa et al., 2017;Van Aert et al., 2019b). The most common testable form of publication bias is the small-study effect where small studies (i. e., small sample size) often tend to report large effect sizes (Sterne et al., 2000). In our survey, we found that 86% of animal meta-analyses dealt with publication bias in their analyses in some way (Fig. 3). The most common method used to examine publication bias was: funnel plots (35%), (simple) regression-based methods (e.g., Egger's regression; 30%), and trim-and-fill tests (14%). However, all these procedures are not appropriate if effect sizes are statistically dependent or heterogeneous or both (Rodgers and Pustejovsky, 2021;Sterne et al., 2001a).
Recently, a multilevel version of Egger's regression has been proposed to tackle these limitations (Fernández-Castilla et al., 2021;Nakagawa et al., 2021a). Briefly, we need to add sampling error (se

as a moderator variable into a multilevel model (equivalent to set
x within [i] as se [i] in Eq. (9); a potentially better approach is to use 'effective sample size' to replace se [i] ; see Nakagawa et al., 2021a). Then, a statistically significant β 1 (i.e., the slope of se [i] ) means that studies with large se [i] (i.e., small sample size) have large effect size. This indicates that a small study effect exists in the meta-analytic dataset. Likewise, if including publication year as a moderator variable, Eq. (9) can detect another form of publication bias, the decline effect (i.e., time-lag bias, which is defined as the temporal instability of the magnitude of effect sizes), the implication of which is underappreciated (Grainger et al., 2020;Koricheva and Kulinskaya, 2019;Nakagawa et al., 2021a; see Supplementary Materials file 2).

Multi-moderator multilevel meta-regression
In practice, it is common to test one moderator variable at a time to explain the heterogeneity among effect sizes (i.e., Eq. 9; known as the single-moderator meta-regression model or univariable metaregression). Theoretically, multiple moderators (x w[i] and x b [j] ) can be examined simultaneously. This leads to a multi-moderator multilevel meta-regression (i.e., multivariable meta-regression). In contrast to the univariable meta-regression (i.e., Eq. 9), a multi-moderator multilevel meta-regression can provide more neurobiological and meta-scientific insights. For example, it enables us to ask (1) whether there exists an interactive effect between two moderator variables, and (2) what is the adjusted effect size of an animal meta-analysis after correcting for publication bias (Kvarven et al., 2020). Yet, the complexity of parameterization of such a meta-regression requires a large dataset to make optimization algorithms free of convergence issues (Bates et al., 2015;Cinar et al., 2021a). Given that (at least some) datasets in our surveyed animal meta-analyses are not small (k: medium = 56, range = = 5 -3288, mean = 203; Supplementary Materials file 1), it is feasible to introduce this more complex model to the field.

Investigating the effects of multiple moderators and interactions
For the sake of illustration, we use the simplest form of the multimoderator multilevel meta-regression model as an illustration: Eq. (12) builds upon Eq. (9) and contains two moderator variables, whose slopes β 1 and β 2 can be used to quantify the (average) effects of x w[i] (e.g., dose) and x b[j] (e.g., sex) on effect size changes, separately. In practice, the effect of x w[i] (e.g., dose) might be confounded by x b[j] (e.g., sex). For example, high doses of antidepressants are more likely to mitigate the depression symptoms of females, while antidepressants are less effective on males ( Fig. 8; Mauvais-Jarvis et al., 2020;Tannenbaum et al., 2019). This example requires us to control for the impact of x b[j] (e. g., sex) when quantifying the effect of x w[i] (e.g., dose) on effect sizes. To do so, an interaction term needs to be added to Eq. (12): where

x w[i] x b[j] = interaction between x w[i] and x b[j]
; β 3 = slope of the interaction term, which captures the magnitude of the interactive effect.
If the significance test of the model coefficients shows a statistically significant β 3 , we conclude that the two moderator variables can interact with each other. Eqs. (12 and 13) can be extended to a more general form: where the variable x m can be any moderator variable denoting withinand between-study level characteristics; β mod = the moderator variable x m 's slope, which is interpreted as the magnitude of the moderator effect for x m (e.g., the effect of antidepressants on depression symptoms); ∑ β mod x m = the sum of all moderator effects. Though Eq. (14) is not commonly used in the discipline (but see Vendl et al., 2021), it has versatile functionality. It can be used to predict the combined effects of two moderator variables, for example, examining how the effects of an antidepressant drug on females at a series of doses even if these doses have not been tested by empirical studies (e.g., conditional [marginal] estimates). Given a large enough number of effect sizes, Eq. 14 also can be used to construct a linear and non-linear relationship (e.g., quadratic, cubic polynomial, and spline) between a continuous moderator variable and effect size estimates ES [i] (Gasparrini et al., 2012;Orsini et al., 2012). It is worth noting that multi-moderator multilevel meta-regression models share limitations with other types of linear models: for example, they are susceptible to overfitting and multi-collinearity. For a detailed illustration of these complex applications, see Supplementary Materials file 2.

Correcting meta-analysis for publication bias
When replacing x w[i] and x b [j] by sampling error (se [i] ) and publication year of a paper (year [j] ), Eq. 13 enables us to test for small-study effect and time-lag bias simultaneously. More importantly, such a model can correct for publication bias in animal studies: where β ′ 0 is an intercept, which could serve as the bias-corrected overall effect. Imagine that if the meta-analytic data does not have publication bias (e.g., no small-study effect and time-lag bias exist), we are more likely to obtain the bias-corrected effect. Theoretically, se [i] = 0 and year [j] = 0 indicates that no small-study effect and time-lag bias occur. β ′ 0 is the estimate which is explicitly conditional on se [i] = 0 and year [j] = 0. There are two notable issues if our interest is to estimate a bias-corrected overall effect. First, we need to centre year j (i.e., c(year [j] ) at its mean value (set mean c(year [j] as 0), such that β ′ 0 is meaningful to be interpreted as a bias-corrected overall effect: Second, if β ′ 0 in Eq. (16) is significantly different from zero (i.e., a true effect), some researchers recommend replacing sampling error se [i] by its sampling variance se 2 [i] or v [i] Stanley et al., 2017): Eq. (17) assumes a quadratic association between sampling error (se [i] ) and effect sizes (ES [i] ) to avoid a downwardly biased estimate of the bias-corrected overall effect (β ′ 0 ). The combination of Eqs. (16 and 17) is a so-called two-step procedure (Stanley et al., 2017), which provides us with an unprecedented opportunity for testing and correcting for publication bias (see Supplementary Materials file 2 for implementation).
Given that high heterogeneity may invalidate publication bias test (Macaskill et al., 2001;Moreno et al., 2009;Sterne et al., 2001b), it is best to account for the potential heterogeneity when testing publication bias: The significance tests of β 1 and β 2 can be used to indicate the presence of small-study effect and time-lag bias, separately (see Section 8.2); ∑ β mod x m in Eq. (18) is used to accommodate the potential heterogeneity in the animal dataset. Moreover, we can use Eq. (18) to correct for publication bias for the moderator effects, for example, to estimate the efficiency of different antidepressant drugs on depression symptoms after adjusting for publication bias. Following the similar rationale of estimating the bias-corrected overall effect (see above), we can estimate the bias-corrected effect for any moderator variable x m by estimating β mod conditional upon se [i] = 0 (no small-study effect) and year [j] = 0 (no Fig. 8. A diagram showing a typical example of an interactive effect between two moderator variables in meta-analyses on animal data. (A) No interactive effect between the dose of an antidementia drug and animal sex. (B) Interactive effect between the dose of an antidementia drug and animal sex. When an interactive effect exists, the dose-response to an antidementia drug is dependent on animal sex (i.e., the effect of the dose is confounded by animal sex). The steep slope of the regression line for females indicates a strong dose-response relationship of the antidementia drug. The near-flat slope for males indicates that there is no dose-response relationship.

Multivariate models for modelling multiple outcomes simultaneously
Some behavioral and neurobiological studies measure more than a single outcome (e.g., two different endpoints: anxiety and depression within the same studies), therefore, resulting in more than one type of effect size or multivariate effect sizes. When computing multivariate effect sizes, at least two types of non-independence arise due to repeated use of the sample cohort of animals: correlations/covariances between different outcomes and sampling errors (scenarios 2 and 3 in Fig. 4). Multivariate meta-analytic models have been proposed to account for these dependent effect sizes by modelling the multiple types of outcomes simultaneously (also referred to as multi-response and multi-outcome meta-analytic models). Extending univariate meta-analytic models (Eq. 1) to multivariate versions is similar to extending ANOVA to MANOVA. In contrast to a (univariate) random-effects meta-analytic model (Eq. 1), a multivariate meta-analytic model allows both betweenstudy random effects (u i ) and sampling errors (e j ; see next section) to follow a multivariate normal distribution (i.e., u i ∼ N ( 0, Τ 2 ) and e i ∼ N (0, V)).
The principle is that two types of dependence can be directly captured by (co)variance structures of the multivariate models. Specifically, correlations (ρ [p] ) between different outcomes are defined in the variance-covariance matrix of the between-study variances Cov[u i ] = Τ 2 (which are distinct from the one-dimension between-study variances τ 2 in the univariate random-effects model, Eq. 1). Likewise, the correlations (ρ [s] ) between different sampling errors are defined in the variance-covariance matrix of the sampling errors Cov[e i ] = V (which are distinct from the one-dimension sampling variances in the randomeffects model; details see next section). Assume a simple example of Τ 2 involving three outcomes: where τ 2 i = between-study variance for outcomes i; ρ [p]21 = ρ [p]12 = population level correlation (also known as between-study correlation; Gasparrini et al., 2012;Jackson et al., 2011), which is the correlation between the first and second population outcomes; covariance of the first and second outcomes Cov[τ 2 1 , τ 2 2 ] = ρ [p]21 τ 2 τ 1 = ρ [p]21 τ 2 τ 1 . The univariate I 2 statistic based on the proportion of between-study variance (τ 2 ) in the total variance (τ 2 + v) can be easily extended to multivariate I 2 statistics (Higgins and Thompson, 2002). An alternative definition of I 2 is based on the variance-covariance matrix of the model coefficients under the multivariate random-effects and fixed-effect models (Jackson et al., 2012). The matrix Τ 2 , which carries information about the extent to which two pairwise outcomes are correlated, can be used to account for dependent effect sizes while also improving the accuracy of model estimates and enabling the investigation of new questions, such as whether there is a strong correlation between two outcomes at the population level. Multivariate models can also leverage the "borrowing of strength" among correlated outcomes, allowing the estimation of missing effect sizes feasible even when some outcomes are only partially reported in some studies. Nonetheless, three points should be noted here. First, there might be only a few studies reporting complete paired outcomes. As such, no information (from the paired effect sizes from the same study) will be borrowed to increase the precision of model estimates (Jackson et al., 2011). Second, when the included studies have a large number of outcomes, the multivariate models are highly parameterized and likely to be overparameterized. In this regard, the advantages of multivariate models might be compromised by the increasing the number of parameters that need to be estimated (Boca et al., 2017). For example, 15 neural-behavioral traits in Yang et al. (2021) need to estimate 105 correlations ρ [p] . To mitigate the estimation and computational challenges posed by high dimensional parameters, some researchers proposed to enforce a simplified structure on the between-study covariance matrix Τ 2 , such as a compound-symmetry or a diagonal structure (Gasparrini and Armstrong, 2011;Gasparrini et al., 2012;Ritz et al., 2008). Third, it is challenging to construct a sampling variance-covariance matrix V due to the lack of sampling covariances (but see next section for solutions).

Within-study variance-covariance matrix for accounting for correlated sampling errors
Typically, sampling errors are assumed to be independent in the framework of the multilevel model outlined early. However, as mentioned above, repeated use of the same cohort animal or "overlapping" animals (when calculating effect sizes) leads to dependency among sampling errors (see correlated sampling errors due to shared animal and control in Fig. 4). Indeed, the nested random-effects structure in a multilevel model fails to capture this type of non-independence (i.e., mis-specified variance structure). As with the multivariate models, constructing a variance-covariance matrix for sampling errors (withinstudy VCV matrix) is the most straightforward way to account for correlated sampling errors. Theoretically, a multilevel model with a within-study VCV matrix can capture all types of dependence structure of effect sizes, wherein an appropriate specification of random-effects structure (u b[j] and u w [i] ) can account for correlations/covariances in the true effects and a within-study VCV matrix can account for the correlations/covariances in the sampling errors. The sampling error effect (e i ) follows a multivariate normal distribution with mean zero and variances of V (e i ∼ N (0, V)). A simple example V having two studies with three effect sizes (where the first study contributes two effect sizes) can be expressed as: where se i = the sampling error (i.e., standard error) of effect size i (square root of sampling variance v i ); ρ [s]12 = ρ [s]21 = sampling level correlation (also known as within-study correlation or sampling correlation), which is the correlation between the first two observed effect size estimates, or more precisely se 1 and se 2 ; the sampling covariance Yet, 63% of surveyed animal meta-analyses ignored correlated sampling errors in their analyses (but see Lagisz et al., 2020; Supplementary Materials file 1), where a within-study VCV matrix is incorrectly treated as a diagonal matrix with the sampling variance se 2 i (or v i ) along the diagonal and zero along the off-diagonal (i.e., Cov[v i , v k ] = 0, i ∕ = k; model misspecification). The challenge of constructing a within-study VCV matrix is that ρ [s] or individual data used to compute ρ [s] are rarely reported in the primary studies. We propose three approaches to construct a within-study VCV matrix to account for correlated sampling errors: (1) "empirical sampling correlation", (2) "general sampling correlation", and (3) "partially empirical sampling correlation" solutions. First, for studies reporting ρ [s] or sampling covariances, we can directly extract them (contacting authors if missing). For studies involving multiple-treatment and multiple-outcome (Fig. 4), we can use specific formula based on summary statistics to compute (asymptotic) sampling covariances for common effect size statistics ("empirical sampling correlation" solution; see Gleser and Olkin, 2009;Lajeunesse, 2011 for formula). Second, for studies involving other types of correlated sampling errors (e.g., repeated measurements of the same outcome), we can use a "general" formula to compute sampling covariances, that is, "guesstimate" a constant sampling correlation ρ [s] to calculate sampling covariances (e.g., ρ [s] = 0.5 or 0.8; "general sampling correlation" solution; Fisher and Tipton, 2015;Noble et al., 2017;Pustejovsky and Tipton, 2022). The constant ρ [s] can be rough, arbitrary, or educated guess. The validity of the guessed ρ [s] can be ensured by robust variance estimation (see next section) or a sensitivity analysis, through which the robustness of the model parameter estimates (e.g., β 0 ) to the choice of ρ [s] values would be tested. Third, use the "empirical sampling correlation" solution where possible, and supplement the "general sampling correlation" solution to take care of the remaining covariances ("partially empirical sampling correlation" solutions; Pustejovsky and Tipton, 2022).

Robust variance estimation for counteracting model misspecification
Our survey found that two papers used the robust variance estimation, RVE, methods to deal with statistical non-independence (Supplementary Materials file 1; Shields et al., 2015;Zajitschek et al., 2020). The superiority of RVE methods is that they can handle statistical non-independence even without knowing the exact dependence structure of effect sizes (Hedges et al., 2010;Tipton, 2013). RVE methods estimate sampling covariances from the data if the meta-analysis includes a larger number of studies (Hedges et al., 2010). Subsequently, standard errors of model coefficients are adjusted (robust errors) to avoid inflated Type I error rates and incorrect hypothesis tests (e.g., p-value). Existing R packages can easily implement RVE methods (Fisher and Tipton, 2015;Pustejovsky and Tipton, 2018). If our primary interest is in parameter estimations and hypothesis tests of model coefficients, then RVE method is an appealing way to deal with non-independence among effect sizes.
While the RVE method has its benefits, it may not be as effective when employed as a standalone rather than a complementary approach to multilevel or multivariate models. For example, RVE cannot decompose variance components into between-and within-study variances as the multilevel models do. Nor can RVE allow variance components to vary for different outcomes within studies as the multivariate models do. Therefore, it is promising to combine the multilevel or multivariate models with RVE (Nakagawa et al., 2021c;Pustejovsky and Tipton, 2022) where the multilevel or multivariate models can provide new insight regarding heterogeneity-or variance-related parameters while RVE can properly deal with all types of non-independence (especially mis-specified VCV matrix with a guesstimate within-study correlation). Fortunately, this hybrid strategy can be implemented by the combination of metafor and clubSandwich packages (see Supplementary Materials file 2). It should be noted that when the number of studies is small and the moderators are imbalanced, RVE cannot perform nominally (Tanner-Smith et al., 2016;Tipton, 2015). As such, applying RVE to a small-sample size meta-analyses might provide a biased estimate of the sampling variance and covariances matrix, leading to inaccurate robust errors and inflated Type I error rates. Several small sample-size adjustment methods can be used to address these issues (Pustejovsky and Tipton, 2018;Tipton and Pustejovsky, 2015;Welz et al., 2023). Recently, the robust-wild-bootstrapping method has been introduced to control Type I error rates while improving the statistical power of hypothesis tests in RVE (Joshi et al., 2022).

Recommended practices for animal multilevel meta-analysis
Given the data-generating processes and mechanisms in the field (e. g., a single construct of interest and nested effect sizes), we outline our practice recommendations for conducting meta-analysis using animal data. We strongly recommend using the multilevel model as the framework for conducting animal meta-analyses. Fitting multilevel models is feasible and conceptually simple (see Supplementary file 2). As such, it provides a good starting point for researchers in this field to properly account for non-independence, while producing reliable parameter estimates and hypothesis tests (as demonstrated in an example in Section 11).
More specifically, we recommend: (I) Using the (three-level) multilevel models (e.g., Eq. 2) by default, rather than fixed-effect and random-effects models (conventional practices; Eq. 1), with the option to add additional levels for random-effects when necessary (e.g., when analysing data from multiple species, as described in Section 6).
(II) Performing all necessary meta-analytic procedures within the context of the multilevel model, such as using meta-regression to explain heterogeneity or conducting subgroup analyses and testing publication bias (Section 7).
While more technically challenging, we still recommend researchers to account for additional non-independence arising from correlated sampling errors if the dataset includes "shared control" and "shared animals" (i.e., the same animal cohort is used repeatedly to compute effect sizes): This can be achieved through the following steps: (I) Using a multilevel model with a within-study VCV matrix that reflects the extent to which the sampling errors derived from the same animal are correlated (Section 9.3). Such a multilevel model with a within-study VCV matrix is technically a special case of a multivariate meta-analysis model without moderators for different outcomes, but the inclusion of random effects at multiple levels can provide additional insights into heterogeneity among effect sizes (Section 11).
(II) Using RVE methods to defend against the (potential) model misspecification if the sampling correlation ρ [s] is not based on empirical data but assumed to construct the within-study VCV matrix (Section 9.4).
In the case where the dataset involves a multivariate structure (e.g., more than one type of outcome reported in many studies in a metaanalytic dataset) and multiple outcomes are of primary interest, we suggest the following procedures: (I) Employing the multivariate meta-analysis models, which allow for the explicit consideration of non-independence due to both effect sizes and sampling errors through the use of between-and within-study variance-covariance matrices (Section 9.2).
(II) Using RVE methods as necessary to guard against the (potential) model misspecification.
The dataset with more complexity, including nested and correlated structures, may necessitate the use of more sophisticated approaches. A potential solution is to combine the multilevel and multivariate models, referred to as 'multivariate multilevel meta-analytic models', along with RVE methods. A simpler version of this combined approach is the multilevel model with a within-study VCV matrix, as outlined above. Further investigation, through simulation studies, is necessary to determine the empirical performance of this proposed methodology.

Worked example
In this section, we briefly compare the results of our re-analyses of a real dataset, following our recommended practices with the results based on conventional practices. Our re-analyses explicitly illustrate how failure to account for non-independence using traditional metaanalytic techniques might lead to the incorrect inference, underestimated standard errors and distorted hypothesis testing, and ultimately result in spurious conclusions. This worked example comes from one of our surveyed papers - Ramsteijn et al. (2020), on the relationship between the use of selective serotonin re-uptake inhibitor (SSRI) in animals during pregnancy and their offspring's neurobehavioural phenotype. Effect sizes in Ramsteijn et al. (2020)'s, dataset were statistically dependent. For example, within a given study included in Ramsteijn et al. (2020), male and female animals were compared separately. Likewise, animals were exposed to different types of SSRI antidepressants, the authors analysed these comparisons (within the same study) as if they were independent studies. Using the random-effects model to fit these non-independent effect sizes runs the risk of getting spurious results. We randomly selected a subset from Ramsteijn et al. (2020) and we dealt with the non-independence (as shown above) using the advanced methods with the practices outlined in Section 10 and the traditional methods outlined in Section 5. It should be noted that these re-analyses are only for illustrative purposes. Readers interested in the biological questions should refer to the original paper.
For the selected subset, Ramsteijn et al. (2020) used a random-effects model (implicitly ignoring non-independence) and found that the use of SSRI in animals during pregnancy significantly decreased offspring's sensory processing function (pooled SMD = − 0.37, 95CI% = [− 0.69 to − 0.06], p-value < 0.05). Respective to our re-analysis procedures, first we conducted a random-effects analysis via rma() function in metafor package to reproduce their original analysis ("RE" in Table 1). Second, we (inappropriately) assumed effects within studies are homogeneous and computed an "averaged" effect size and sampling variance for each study (assuming sampling correlation ρ s = 0.5) and used them for subsequent random-effects models ("Average"). Third, we randomly selected one effect size from each study and used it for the subsequent random-effects model ("Sample"). Fourth, we used a multilevel meta-analytic model specifying between-study and within-study random effects to directly model the dependency among effect sizes ("ML"). Sixth, building upon "ML" method, we accounted for the (potential) correlated sampling errors by approximating a VCV matrix with a constant ρ s = 0.5 ("ML-VCV"). Seventh, we used the RVE to defend the model misspecification (e.g., the arbitrarily assumed ρ s = 0.5) and make robust model inferences ("ML-VCV-RVE"). It should be noted that because we did not find any dataset suitable for multivariate models in our survey (e.g., meta-analyses with multiple outcomes within studies), Table 1 does not report results corresponding to multivariate models (but see Supplementary Material 3 for an illustration with a fictitious dataset).
Re-analyses based on the multilevel model ("ML") found that the overall effect of SSRIs exposure has a similar magnitude compared to that obtained from the random-effects model ("RE"), but the overall effect was not statistically significant (SMD = − 0.39, 95CI% = [− 0.80 to 0.03], p-value = 0.06). This clearly demonstrates that the standard error has been underestimated and the subsequent hypothesis tests (p-value) were incorrect. Likewise, the between-study variances (σ 2 b ) derived from RE model (conventional practice) also have been overestimated. For example, the value of σ 2 b in the multilevel model was almost half of that in the random-effects model. Using the random-effects model to fit this dataset might lead to a wrong conclusion that the study level has a high amount of heterogeneity (I 2 b = 69%). However, the study level only explains 27% of the total heterogeneity. The remaining 49% of the total heterogeneity is due to the within-study level variability. The results of the multilevel model with a VCV matrix are very similar to those of the multilevel model without a VCV matrix ("ML" vs. "ML-VCV"). The results of the multilevel model are also robust after defending against model misspecification ("ML-VCV" vs. "ML-VCV-RVE"). The multilevel model can quantify the degrees of dependency, that is, ICC (correlation of effect sizes within the same study) suggests that the true underlying effects are weakly correlated with each (0.136).

A hands-on tutorial of the advanced meta-analytic techniques
We provide an easy-to-implement tutorial to help researchers apply these sophisticated techniques outlined above. Broadly, our tutorial contains two parts. In Part I, we use the dataset of the above worked example to illustrate meta-analysis in the framework of the multilevel model (standard practices; Sections 6 to 8; see also Assink and Wibbelink, 2016). We recommend every researcher employ these procedures in an animal meta-analysis, such that potentially misleading conclusions can be avoided. In Part II, we use a more complex dataset to show the implementation of the extended methods (recommended practices) outlined in Section 9. This dataset comes from our published neuroscience meta-analysis , which examined cognition bias across 22 animal species using 71 studies with 459 effect sizes. Given that R language and environment are the most widely used software in animal meta-analyses (Supplementary Materials file 1), we use R code to demonstrate the implementations of these advanced methods. The dataset and R code are freely accessible at an archived repository (See Open Science section). We use R markdown to annotate R code, which allows researchers to easily understand and reproduce our examples, and also easily modify the sample R code to suit their own analyses. The sample R code is based on existing R packages, for example, metafor, orchaRd and clubSandwich. We provide guidance to show the R syntax of these packages implementing the advanced methods outlined in Sections 6-9. The complete R coding-based tutorial can be found at Supplementary Materials file 2.

Conclusions and future perspectives
We have profiled the meta-analytic practice in the field of neurobiology and behavioral research on animal models. Researchers in this field mainly rely on traditional meta-analytic techniques (i.e., fixedeffect and random-effects models) for quantitative evidence synthesis. However, the traditional meta-analytic techniques are very limited in handling complex animal datasets (e.g., hierarchical/correlated data structure), which are more prone to statistical issues (e.g., nonindependent effect sizes and errors). Researchers should go beyond traditional meta-analytic techniques, embracing the multilevel model and other advanced methods (e.g., multivariate models and robust variance estimation). Currently, these advanced methods are rarely used in animal meta-analyses. We have illustrated the concepts, rationale, examples, and implementations of these advanced methods. We expect their applications to continue to increase in future quantitative evidence synthesis in animal studies, delivering more robust/reliable model estimates and new neurobiological insights.
Furthermore, these advanced methods can be further extended to more sophisticated meta-analytic techniques. For example, A network meta-analysis can rank the effectiveness of multiple treatments by incorporating indirect evidence across separate animal studies (for example, using evidence of T A vs. T C and T B vs. T C to infer T A vs. T B ; Riley et al., 2017). We can employ a meta-analytic mediation (causal) model, path analysis or structural equation modelling to identify how a focal variable mediates the response variable of interest (Cheung, 2022;Shadish and Sweeney, 1991). We can also take advantage of an individual participant data (IPD) meta-analysis to circumvent Simpson's paradox (i.e., "aggregation" bias or ecological fallacy) and test hypotheses at the individual animal level rather than the study level (Kaufmann et al., 2016;Riley et al., 2010;van Aert, 2022). Besides meta-analysing data across studies, we can also conduct a meta-analysis within a single study to enhance statistical power (i.e., internal meta-analysis; Goh et al., 2016;Nakagawa and Santos, 2012) and across different meta-analyses to ask high-order questions (i.e., second-order-meta-analysis; Fanelli et al., 2017;Nakagawa et al., 2019). Importantly, researchers in animal meta-analyses should review methodological developments and applications of meta-analytic techniques in other fields. In this regard, researchers can harness more appropriate and powerful meta-analytic techniques to gain not only new neurobiological insights, but methodological and meta-scientific insights. Ultimately, the use of these advancing meta-analyses can lead to better animal-to-human translation of this new knowledge.