Risk of bias assessment in preclinical literature using natural language processing

Abstract We sought to apply natural language processing to the task of automatic risk of bias assessment in preclinical literature, which could speed the process of systematic review, provide information to guide research improvement activity, and support translation from preclinical to clinical research. We use 7840 full‐text publications describing animal experiments with yes/no annotations for five risk of bias items. We implement a series of models including baselines (support vector machine, logistic regression, random forest), neural models (convolutional neural network, recurrent neural network with attention, hierarchical neural network) and models using BERT with two strategies (document chunk pooling and sentence extraction). We tune hyperparameters to obtain the highest F1 scores for each risk of bias item on the validation set and compare evaluation results on the test set to our previous regular expression approach. The F1 scores of best models on test set are 82.0% for random allocation, 81.6% for blinded assessment of outcome, 82.6% for conflict of interests, 91.4% for compliance with animal welfare regulations and 46.6% for reporting animals excluded from analysis. Our models significantly outperform regular expressions for four risk of bias items. For random allocation, blinded assessment of outcome, conflict of interests and animal exclusions, neural models achieve good performance; for animal welfare regulations, BERT model with a sentence extraction strategy works better. Convolutional neural networks are the overall best models. The tool is publicly available which may contribute to the future monitoring of risk of bias reporting for research improvement activities.

• Animal intervention studies vary from clinical trials and automatic tools designed for risk of bias assessment in preclinical literature remain to be developed. • We present and implement a series of natural language processing models for the classification of reporting of five preclinical risk of bias items. • The open-source tool provides the possibility for future research improvement activities.

| BACKGROUND
Systematic review is a type of literature review that attempts to collate all empirical evidence relevant to a pre-specified research question. It uses explicit and systematic methods to minimise bias and provide more reliable findings than narrative review. 1 After the collection of research publications which meet pre-specified inclusion criteria, a critical step is the reporting of strategies designed to reduce risks of bias in the included publications, which is central to the assessment of the reliability of the research findings. 2 The current procedure for risk of bias assessment in literature is that it usually performed separately by two independent investigators, working with an adjudicator to resolve any disagreements. This is both time-consuming and prone to error. As the number of publications describing experimental studies increases rapidly, it has become increasingly difficult for researchers to keep up to date with progress in their field and the findings of systematic reviews are weakened. Therefore, automation tools would accelerate this process and increase reliability. Such tools would also have been useful in evaluating the impact of measures designed to improve the quality and completeness of research reporting, for instance the NPG Quality in Publication (NPQIP) study, 3 the Intervention to Improve Compliance with the ARRIVE guidelines (IICARus) studies, 4 in future evaluation of reporting standards such as the Materials-Design-Analysis-Reporting Minimum Standards Framework 5 and in measuring the impact of institutional research improvement activities. 6 Systematic reviewers have advocated the use of automated approaches to assist risk of bias assessment, using human effort and machine automation in mutually reinforcing ways. 7 The development of machine learning and natural language processing (NLP), including neural models and transfer learning, provides opportunities to create robust tools for risk of bias assessment. For clinical trials, RobotReviewer trains support vector machines on 6610 full texts with pseudo labels derived from 1400 unique strings of bias domains from the Cochrane Database of Systematic Reviews, which achieves overall accuracy around 71%. 8 Zhang et al. consider the supported sentence annotations of bias domains as 'rationales' and use them to train the convolutional neural networks (CNNs) 9 which improves the performance by 5% compared to baseline models. 10 Millard et al. apply logistic regressions on 1467 full-text clinical reports for sentence and document classification separately and achieves an area under the ROC curve greater than 72% for randomisation sequence generation, allocation concealment and blinding. 11 Menke et al. have reported the performance of a proprietary tool SciScore 12 which trains the conditional random fields 13 on 250 research articles with manually labelled entity mentions for random allocation and blinding. The training corpus is randomly selected from the PubMed Open Access articles, and the portion of clinical or preclinical publications is not clear.
Compared with clinical trials, animal studies are conducted in relatively small teams, are reported in a different style, have been shown to have lower reporting of strategies to reduce risks of bias, 14 and are susceptible to different risks of bias. 15 Hence, separate tools for risk of bias assessment in preclinical literature are necessary. Bahor et al. have previously reported the use of regular expressions with rule-based string matching to recognise phrases related to risk of bias reporting in experimental animal studies, which requires many hand-crafted term selections. 16 NLP approaches may achieve more robust results in the preclinical literature compared with nonlearning algorithms.
Several reporting standards relevant to the design, conduct, analysis and reporting of animal studies have been suggested, including the ARRIVE guidelines 17 and the Materials-Design-Analysis-Reporting Minimum Standards Framework, 5 and these each contain multiple domains relating to potential risk of bias. In 2012, a stakeholder group convened by the U.S. National Institute of Neurological Disorders and Stroke prioritised the importance of reporting randomisation and blinding, sample size estimation and data handling (including the reporting of data excluded from analysis). A 2020 systematic review identified 60 publications containing 58 recommendations, with the most frequently recommended being sample size calculation, blinding assessment of outcome, choice of statistical methods and randomised allocation to treatment group. 18 In systematic reviews conducted by our group (see 'Dataset' below for description), the prevalence of reporting of allocation concealment and of sample size calculations is so low that we do not think there are sufficient positive instances to provide adequate training, and we believe that a judgement of the appropriateness of the statistical methods chosen is highly subjective. To the three remaining risks of bias (blinding, randomisation and reporting of data exclusions), we add two further items. We know that the reporting of conflicts of interests is substantially higher than other risks of bias, and want to test the performance of NLP models across a range of reporting prevalence; and regulatory agencies and others often express concerns that studies report compliance with animal welfare regulations, so we include this item.

| MATERIALS AND METHODS
We consider the risk of bias assessment as a typical text classification task. A classification model cannot be trained from the plain text directly and we need to convert text information to analysable data. The core concept is to map each document to a matrix consisting of fixeddimension word vectors or embeddings, 19 then train a classification model to map these numeric text representations to a binary risk of bias label (yes/no). For representation methods, we explore bag-of-words, word2vec, 20 doc2vec 21 and embeddings from BERT. 22 For classification models, we implement baseline models (support vector machine, logistic regression, random forest), neural models (CNN, recurrent neural network (RNN) with attention, hierarchical neural network) and BERT models using two strategies, which are described in greater detail below. The different approaches are summarised in Figure 1, and training details are given in Supporting Information.

| Dataset
We use a collection of full-text publications which our group has annotated for risk of bias 23 from two sources. Firstly, we use in-house data from systematic reviews in (1) psychotic disorders 24 (2386 publications); (2) chemotherapy-induced peripheral neuropathy 25 (1602 publications); and (3) several individually smaller systematic reviews in animal models of stroke, depression, hypertension, myocardial infarction and pain 22 (2439 publications). Secondly, we have collected data in the context of observational (NPQIP, 3 751 publications) and experimental (IICARus, 4 662 publications) studies of interventions to improve the reporting of in vivo research. For the psychosis dataset, each manuscript had been evaluated by a single trained human reviewer and a regular expression approach, 16 with disagreements reconciled by a second independent human reviewer. For all other datasets, each manuscript had undergone risk of bias annotation from two trained reviewers working independently, with differences reconciled by a third reviewer.
The risk of bias labels are at the document level (1 for reported, 0 for not reported) and we consider five risk of bias domains: (1) Random Allocation: animals are randomly allocated to treatment or control groups; (2) Blinded Assessment of Outcome: group identity is concealed from the scientist measuring the outcome; (3) Compliance with Animal Welfare Regulations: researchers report that they complied with relevant animal welfare regulations; (4) Conflict of Interests: authors report any relationship which might be perceived to introduce a potential conflict of interests, or the absence of such a relationship; and (5) Animal Exclusions: a statement of whether or not all animals, all data and all outcomes measured are accounted for and presented in the final analysis. The prevalence of reporting of each of these items, and some example sentences indicating the reporting for each risk of bias item are displayed in Table 1.
Publications were available in PDF format and we converted them to plain text using Xpdf (https://www. xpdfreader.com). We converted all text to lower case and used regular expressions to remove references, citations, URLs, digits, non-ASCII characters and text which precedes the 'Introduction' section, because they are irrelevant to the risk of bias reporting. We used Stanford CoreNLP 26 for word and sentence tokenization. After removing invalid records (for instance where text conversion failed), 7840 full-text publications had annotations for random allocation, blinded assessment of outcome and animal exclusions, and 7089 had annotations for conflict of interests and animal welfare regulations. We combined publications from different source projects and randomly allocated them to training (80%), validation (10%) and test (10%) sets. Summary statistics of the dataset are shown in Table 2.

| Baselines
We explore three text representation methods in baseline models: (1) bag-of-words, (2) word2vec and (3) doc2vec. Bag-of-words (bow) uses word frequency within the document to represent its importance. Considering less important words with high frequency such as 'the' and 'a', TF-IDF (term frequency-inverse document frequency)  weighting is applied, which normalises the word frequency in a document by multiplying a log-scale of the inverse of the frequency of documents where the word occurred. 27 Word2vec is a neural language model which learns to map words to continuous vectors. It can preserve the semantic relationship among words and can either be generated from the learning process jointly within the classification model or fine-tuned on pre-trained word vectors from other language tasks. As the preclinical literature belongs to the biomedical domain, we use the 200dimensional word vectors induced on a combination of PubMed and PubMedCentral texts with texts extracted from a recent English Wikipedia dump, using the skipgram model with a window size of 5. 28 Doc2vec is an unsupervised method which learns to represent a document by a dense vector. There are two approaches for training the dense vector: Distributed Memory and Distributed Bag-of-Words, which are suggested to yield better performance when used together. 21 We explore three baseline classifiers: support vector machine, logistic regression and random forest. Support vector machine and logistic regression are linear classifiers, which are trained to map the word embeddings to the target risk of bias label to minimise a hinge loss function and log loss function separately. 29 Random forest is an ensemble-based non-parametric method which combines a number of decision trees trained on various subsamples. 30

| Neural models
We explore three neural models: CNN, a powerful model for text classification; 9 RNN which is good at modelling sequential text data; 31 and hierarchical attention network (HNN) 32 which takes the hierarchical structure among word, sentence and document into consideration. The critical elements in the model architecture are described below and shown in Figures S1-S4.

| Convolutional Neural Network
We use the classic one-layer CNN 9 for document classification. The main characteristic of CNN is the convolutional layer where multiple filter windows (2D matrices) with different sizes are applied to filter out information. Let x i : j ½ denote the matrix extracted from row i to row j of the document matrix. For one document matrix x ∈ R sÂd and one filter f ∈ R hÂd (where s is the document length, d is the embedding dimension and h is the filter size), the convolution layer sequentially extracts a submatrix which has the same dimension as filter f and does the sum operation of the element-wise product between x i : i À h þ 1 ½ and f . This generates a summarised feature vector w ∈ R sÀhþ1 of the document matrix x by filter f with filter size h. For filter size h, multiple filters are used to capture different features.
The output vectors from the convolutional layer are then passed through an activation function such as ReLU to add non-linearity, and a pooling layer, which extracts the maximum value of each vector. A dropout layer, which randomly sets some values in the vectors to zero, is applied to prevent over-fitting. A final linear transformation is applied to map the vector concatenated from the pooling layer into two numeric values, representing separately whether or not the document reported the risk of bias item.

| RNN with attention
Recurrent neural network is a type of neural network which builds connections over time steps. 33 In the hidden layer, by combining the weighted hidden representations from the previous word and the next word (if it is applied bidirectionally) through a Tanh operation, a basic recurrent neural structure can retain information in the text from both directions. RNN can handle anylength texts and but if the sequence is very long, it is difficult to keep the information from very earlier steps to later steps because of the exploding or vanishing gradient problem. 34 Two variants of RNNs, long short-term memory (LSTM) 31 and gated recurrent unit (GRU) 35 are designed to solve this long-term dependencies problem, which uses multiple gates (forget gate, input gate and output gate in LSTM; reset gate, update gate and output gate in GRU) for each word embedding to control the information we need to flow straight, forget, store and update to the next step.
In the general recurrent structure, the output from the hidden layer is obtained by simply taking the hidden state of the last recurrent cell, which loses some information from other recurrent cells; or averaging hidden states of all recurrent cells, which treats words at different positions equally. However, the same word may play a different role in the decision of the classification when it occurs in different sentences or contexts. A global context matrix (∈ R sÂh ) is created to learn the importance of each word in the document (similar to the attention mechanism described below for HNNs). The attention module is then added to learn and emphasise the word contributions to the entire document sequence. 36

| Hierarchical attention network
Words contribute differently to an individual sentence and sentences contribute differently to the whole document. Hierarchical attention network is proposed to imitate this hierarchical structure of documents, having two levels of attention modules applied at word-level and sentence-level. 32 After the recurrent hidden layer, in the word-level attention module, the hidden representations of each word in a sentence are multiplied by a local word context vector, which is trained to learn the importance of each word in the sentence. The representation vector of each sentence is then summarised from those weighted word representations. Similarly, in sentence-level attention, the hidden representations of each sentence in the document are multiplied by a global sentence context vector, which is trained to learn the importance of each sentence in the document. Then a document representation vector is obtained from those new weighted sentence representations. After an activation function and a linear transformation, we then output the probability for risk of bias items. With the hierarchical structure, HAN can generate ranking scores for sentences, which can be used to extract the most relevant sentences and provided to users to allow them to make a judgement on the veracity of the machine decision.

| BERT models
One limitation of word embeddings like word2vec is that the representation vector of a given word is fixed and independent, regardless of context. Contextualised representation models like BERT 22 (Bidirectional Encoder Representations from Transformers) address this issue. BERT extracts the contextualised embeddings by training a deep bidirectional encoder from transformers 37 on the BooksCorpus and English Wikipedia. The transformer structure mainly consists of identical blocks, and each block contains sub-modules based on multi-head selfattention and a feed-forward neural network. It dispenses with recurrence and convolutions, and achieves state-ofthe-art performance on many NLP tasks. 37 The pretrained BERT can be fine-tuned with a simple additional output layer for downstream tasks. BERT uses WordPiece with a 30,000 token vocabulary for tokenization, which handles rare words better than the 'pure' word embeddings and more efficiently than character embeddings. 38 Previous work shows that the domain corpus used for pre-training affects the performance of the downstream task. 39 Since our task is conducted on preclinical texts, we use the pre-trained weights from BioBERT to initialize the model, which applies the same architecture as BERT and is pre-trained on combinations of text corpora including BookCorpus, English Wikipedia, PubMed abstracts and PubMed Central full-text articles. 40 One drawback of BERT is that it can only accept embeddings of maximum 512 tokens as input, which limits the usage for tasks with long documents. There are other transformer models designed for long documents, such as Longformer 41 which can process a maximum of 4096 tokens. However, this is still computationally expensive, and our full-text publications contain 5000 tokens on average. To solve this issue, we propose two strategies.

| BERT with document chunk pooling
We split documents into text chunks, apply BioBERT to each chunk, and pool the hidden states from different chunks using multiple strategies. This is similar to the structure applied in the classification of clinical notes for patient smoking status, 42 with some modifications as shown in Figure S4. After the WordPiece tokenization, considering a document with s tokens, the document is split into m = [s/510] chunks (excluding the first token indicating classification and the separation token for sentence segmentation). The input representation of the document is X ∈ R mÂ512Âh , where h is the hidden dimension throughout the embedding layer and encoder layers in BioBERT. Instead of taking the hidden states from the last encoder layer, we perform the average pooling operation over several encoder layers to obtain the output. We summarise across tokens within each chunk with five different options: (1) max pooling, (2) average pooling, (3) concatenate output from max pooling and average pooling, (4) use hidden states of the first token, (5) concatenate hidden states of all tokens. After two pooling layers, we explore three head layers (linear, convolution or LSTM) for the downstream classification task. The convolution and LSTM heads use the same architecture as described previously. The linear head cannot handle sequences of different lengths, so we add another pooling layer to obtain the fixed-dimension output. The pooling methods use the same options applied in the second pooling layer, with the exclusion of 'concatenate hidden states of all tokens', which does not generate a fixeddimension output.

| BERT with sentence extraction
Instead of using the full-text document as input, we extract the most relevant sentences to the risk of bias description. We first use scispaCy 43 to split a document into sentences, and then apply SentenceTransformers 44 to obtain a vector for each individual sentence. We also feed a description sentence of each risk of bias item (see descriptions in 2.1 Dataset) to the Sentence-BERT 44 and obtain the corresponding representative vectors. For each individual document, we calculate the cosine similarity score between each sentence vector and the vector of the risk of bias description sentence. We take the first k sentences with the highest similarity scores, that is, the most k relevant sentences, to form a new shorter passage. We then fine-tune the DistilBERT 45 model (a lighter version of BERT), with a linear, convolution or LSTM head on the new passage, to generate the probabilities of risk of bias reporting. The sentence extraction process is unsupervised and is independent of the actual training process.

| Evaluation metrics
To evaluate model performance, we define 'True Positive' as the number of records which report the risk of bias item and are predicted as reported; 'True Negative' as the number of records which do not report the risk of bias item and are predicted as unreported; 'False Positive' as the number of records which do not report the risk of bias item but are predicted as reported; and 'False Negative' as the number of records which report the risk of bias item but are predicted as unreported. For all classification models described above, we calculate

| RESULTS
The performance of eight models from three categories (baselines, neural models and models using BERT with two strategies) on the validation set are shown in Table 3. For baseline models, all items achieve F1 score over 48% and particularly, models for compliance with animal welfare regulations show good performance, with F1 around 90%. For the selection of text representation methods, from our experiments, bag-of-words is not robust and prone to be over-fitting. Doc2vec gives the best results across all items, perhaps because the training sample texts for doc2vec are closer to the preclinical domain, while the pre-trained word2vec vectors are induced from the more general biomedical corpus. For model selection, logistic regression achieves the best performance for random allocation to treatment or control, blinded assessment of outcome and conflict of interests; while for compliance with animal welfare regulations and animal exclusions, support vector machine performs better.
We find that neural models are more robust to hyperparameter tuning than baseline models. For random allocation, blinded assessment of outcome and conflict of interests, neural models improve F1 by 14%-30% over baseline models, with little difference among the three neural models. For compliance with animal welfare regulations, neural models have no advantage over baselines, with F1 falling 4%-14%. For animal exclusions, weight balancing strategy and under-sampling do not effectively address the data imbalance issue, and the training process is prone to over-fitting.
Models using BERT with the two strategies described do not outperform neural models, except the item of compliance of animal welfare regulations, which has 3% improvement. This is expected because in the document chunk pooling (DCP) strategy, we do not take any advantages of BERT structure by freezing all the encoder T A B L E 3 Performance of best model in three categories (baseline, neural model and BERT models with two strategies) for risk of bias items on the validation set Note: 'SVM' represents support vector machine; 'LogReg' represents logistic regression; 'RF' represents random forest; 'CNN' represents convolutional neural network; 'RNN + Attn' represents recurrent neural network with attention; 'HAN' represents hierarchical attention network; 'BERT-DCP' represents BERT model with document chunk pooling; 'BERT-SE' represents BERT model with sentence extraction. For each risk of bias item the best performing approach (by F1 score) is given in bold.
layers, and multiple pooling strategies do not address this limitation; in the sentence extraction strategy, although we can fine-tune DistilBERT, we still lose some information by using shorter texts extracted from full publications. We have not been able to evaluate the performance of sentence extraction modules, which requires further sentence-level annotations. Using the best model in its optimal setting for four risk of bias items, we evaluate and compare the performance with our previous regular expression approach on the test set (no regular expression approach has been developed for animal exclusions). For blinded assessment of outcome, we select the RNN with attention as the optimal model instead of BERT with document chunk pooling (BERT-DCP) because we do not consider the very small advantage to justify the complexity of preprocessing in the latter approach. Table 4 shows our models improve performance by between 13% and 36% for four risk of bias items compared with regular expressions (each significant at p < 0.05 by the McNemar's test 46 ). Table 5 shows the prediction and sentence extraction function of our models on an example paper which reports random allocation, blinded assessment of outcome and animal exclusions, but does not report conflict of interest and animal welfare regulations. Unlike previous rule-based approaches which output yes/no labels only, our models can be used to extract the most relevant T A B L E 4 Performance of the best natural language processing model and regular expression approach for each risk of bias item on the test set  sentences from full text, which can enhance the judgement from the prediction probabilities, or provide signals whether users need to re-check the full texts. In Table 5, sentences extracted for random allocation, blinded assessment of outcome and animal exclusions indicate the clear relation with the items and positive evidence for the prediction probabilities, while sentences extracted for conflict of interests and animal welfare regulations bearing little relation to the target concept, which proves the predictions in a different direction.

| DISCUSSION
We have shown that different models are optimal for the detection of reporting of different risks of bias. Convolutional neural network is best for conflict of interests and RNN with attention are best for random allocation and blinded assessment of outcome. For compliance with animal welfare regulations, models using BERT with sentence extraction (BERT-SE) strategy achieve the best performance. For animal exclusions, CNN achieves the best performance on the validation set, but no approach provides reliable performance on the test set. Compared with the previous regular expression approach, the F1 scores for four risk of bias items are between 13% and 36% higher, indicating a substantial improvement. The sentence extraction function can provide potentially relevant sentences as clues for users making the judgement. We can analyse all positive samples and use the RNN with the attention module to output attention scores for tokens in each individual paper, thus we can extract the most important words in the decision of classification task (Figure 2), which may help the development of future rule-based approaches. Among the incorrect records, our models are more likely to conclude that papers report random allocation, blinded assessment of outcome and animal exclusions, and less likely to predict that papers report conflict of interests and animal welfare regulation (Figure 3). To analyse sources of error, we randomly selected 10 incorrect records for each item from the test set. Our models did not recognise phrases like 'unaware' for blinded assessment but considered that 'animals are randomly selected for testing' indicated random allocation to the experimental group. It may be that most records in our training set describe random allocation based on the presence of the word 'random' and blinded assessment based on the word 'blind', and that our training corpus did not have sufficient examples of alternative valid descriptions for these to be learned. We also found two records where a conflict of interests was given before the 'Introduction' section or after the 'Reference' section, where we had removed the relevant text at the text processing stage.
The tool and codes for predicting probabilities of risk of bias reporting in preclinical full texts is available at https://github.com/qianyingw/pre-rob. The levels of performance achieved make these tools suitable for research improvement activity where several hundred publications are to be evaluated. For instance, for F I G U R E 2 Most important words in the decision of classification for each risk of bias item, based on the average attention scores from RNN output over all positive samples [Colour figure can be viewed at wileyonlinelibrary.com] random allocation in a corpus of 1000 manuscripts, this approach would estimate prevalence within 3% of the true value and for 100 publications, within 10% of the true value (see calculations at https://github.com/ camaradesuk/confidence_intervals_simulation). Given that the changes sought in research improvement activities are at least of this magnitude, we consider the performance of these tools in determining the reporting of risk of bias items to be such that they are suitable for deployment in a research improvement context. Similarly, they are suitable for the evaluation of risk of bias in large corpuses such as large preclinical systematic reviews. However, they are not yet at the level where they are appropriate for the evaluation of individual publications.
Our work has several limitations. First, our training dataset includes publications drawn from three datasets focusing largely on the neurosciences, as well as two datasets from unselected preclinical studies published in PLoS One and Nature. This may influence the generalizability of our findings. Second, PDF to text conversion loses document structure and we cannot identify the main sections of publications. This introduces some noise (for instance text from figures and tables) to our training corpus. Tools like GROBID (https://github.com/ kermitt2/grobid) can convert PDFs to structured XML but it highly depends on the quality of PDF, and in our experience it does not work well for some preclinical publications. However, enhanced approaches to PDF conversion, and increased availability of publications in XML format, means that this approach may become feasible in the future. Finally, following our own systematic review practice, we considered all publications which did not report measures to reduce risks of bias to be at high risk of bias, with no 'unclear' category. Because our models output continuous probability scores ranging from 0 to 1 which can be then used to provide a binary score, it might be possible to identify the 'unclear' category as those with intermediate scores; but since these are not labelled in each of our datasets we have not been able to do so.
In future work, we will seek to improve performance further, using datasets involving more journals and a wider range of preclinical experiments, and will exploit diseases and texts from structured PubMed XMLs, which may yield better performance. We will continue improving the attribution of animal exclusions to achieve more reliable performance and we will develop approaches for other risk of bias items including sample size calculation and allocation concealment. We will also develop a userfriendly function embedded in the preclinical systematic review facility SyRF 47 and a standalone API, enabling usage to others. 48

| CONCLUSIONS
We explore multiple text classification models, from baselines to recent NLP techniques and demonstrate the advantages of neural models and BERT models for risk of bias assessment in preclinical literature. BERT models work well for animal welfare regulations, while convolutional or RNNs achieve better performance for. We encourage the use of NLP techniques to assist risk of bias assessment and reduce workflow in preclinical systematic reviews. If computational limitations require the implementation of a single tool, we recommend using CNNs, which achieve overall good performance across our five risk of bias items. The performance of these tools is such that they could be deployed in automated approaches to monitor risks of bias reporting as part of institutional research improvement activities.