Using rapid prototyping to choose a bioinformatics workflow management system

Workflow management systems represent, manage, and execute multi-step computational analyses and offer many benefits to bioinformaticians. They provide a common language for describing analysis workflows, contributing to reproducibility and to building libraries of reusable components. They can support both incremental build and re-entrancy – the ability to selectively re-execute parts of a workflow in the presence of additional inputs or changes in configuration and to resume execution from where a workflow previously stopped. Many workflow management systems enhance portability by supporting the use of containers, high-performance computing systems and clouds. Most importantly, workflow management systems allow bioinformaticians to delegate how their workflows are run to the workflow management system and its developers. This frees the bioinformaticians to focus on the content of these workflows, their data analyses, and their science. RiboViz is a package to extract biological insight from ribosome profiling data to help advance understanding of protein synthesis. At the heart of RiboViz is an analysis workflow, implemented in a Python script. To conform to best practices for scientific computing which recommend the use of build tools to automate workflows and to re-use code instead of rewriting it, the authors reimplemented this workflow within a workflow management system. To select a workflow management system, a rapid survey of available systems was undertaken, and candidates were shortlisted: Snakemake, cwltool and Toil (implementations of the Common Workflow Language) and Nextflow. An evaluation of each candidate, via rapid prototyping of a subset of the RiboViz workflow, was performed and Nextflow was chosen. The selection process took 10 person-days, a small cost for the assurance that Nextflow best satisfied the authors’ requirements. This use of rapid prototyping can offer a low-cost way of making a more informed selection of software to use within projects, rather than relying solely upon reviews and recommendations by others. Author summary Data analysis involves many steps, as data are wrangled, processed, and analysed using a succession of unrelated software packages. Running all the right steps, in the right order, with the right outputs in the right places is a major source of frustration. Workflow management systems require that each data analysis step be “wrapped” in a structured way, describing its inputs, parameters, and outputs. By writing these wrappers the scientist can focus on the meaning of each step, which is the interesting part. The system uses these wrappers to decide what steps to run and how to run these, and takes charge of running the steps, including reporting on errors. This makes it much easier to repeatedly run the analysis and to run it transparently upon different computers. To select a workflow management system, we surveyed available tools and selected three for “rapid prototype” implementations to evaluate their suitability for our project. We advocate this rapid prototyping as a low-cost (both time and effort) way of making an informed selection of a system for use within a project. We conclude that many similar multi-step data analysis workflows can be rewritten in a workflow management system.

advance understanding of protein synthesis. At the heart of RiboViz is an analysis workflow, 23 implemented in a Python script. To conform to best practices for scientific computing which 24 recommend the use of build tools to automate workflows and to re-use code instead of 25 rewriting it, the authors reimplemented this workflow within a workflow management 26 system. To select a workflow management system, a rapid survey of available systems was 27 undertaken, and candidates were shortlisted: Snakemake, cwltool and Toil 28 (implementations of the Common Workflow Language) and Nextflow. An evaluation of each 29 candidate, via rapid prototyping of a subset of the RiboViz workflow, was performed and 30 Nextflow was chosen. The selection process took 10 person-days, a small cost for the 31 assurance that Nextflow best satisfied the authors' requirements. This use of rapid 32 Introduction 51 Bioinformatics data analysis takes many steps, and a crucial but frustrating part of 52 bioinformatics work is to run the right processing steps, in the right order, on the right data, 53 reliably [1]. Usually these steps will involve disparate pieces of software from different 54 sources, all run from the command line. For example, high-throughput sequencing data 55 analysis may involve demultiplexing, trimming, cleaning, alignment, de-duplication, base 56 quality score recalibration, and quantification. Phylogenetic analysis may involve selecting 57 sequences, multiple sequence alignment, alignment trimming, and tree inference. Image 58 analysis can also involve many steps applied to large numbers of images. Success in these 59 multi-step data analyses generally requires writing a script to automate the steps. However, 60 traditional shell scripts and even Makefiles have limited error reporting, are hard to debug, 61 can be hard to restart after they go wrong, and can be challenging to move from one 62 computer architecture to another. For example, bash scripts do not support re-entrancy or 63 incremental build unless these functionalities are explicitly implemented by their authors, 64 which can be a non-trivial development activity. 65 Workflow management systems -systems to represent, manage and execute analyses -66 address these problems [2][3][4]. They can provide a common language for describing analysis 67 workflows, contributing to reproducibility and the building of libraries of reusable 68 components. They can support both incremental build and re-entrancy, providing the ability 69 to selectively re-execute parts of a workflow in the presence of additional inputs or changes 70 in configuration and the ability to resume execution from where a workflow previously 71 stopped. Many workflow management systems provide support to exploit software 72 containers and package managers, high-performance computing systems and clouds. Most 73 importantly, workflow management systems allow bioinformaticians to delegate how their 74 workflows are run to the workflow management system, and its developers, freeing the 75 bioinformaticians to focus on their science. 76 In this article, we describe the process that we used for selecting a workflow management 77 system for our ribosome profiling software, RiboViz [5]. While Leipzig [4] offers advice on 78 choosing a workflow management system based on the qualities of classes of workflow 79 management system, we used an approach to selection focused on both the popularity of 80 the candidate tools within the bioinformatics community and on the specific merits of the 81 candidate tools in the context of our project's specific requirements. 82 To select a workflow management system, a rapid survey of available workflow 83 management systems was undertaken and candidates were shortlisted: Snakemake 84 Our evaluation used rapid prototyping for three reasons. Firstly, using the candidate 92 systems, and their documentation, would provide more insight into their ease of use, their 93 capabilities, and the quality of their supporting documentation than could be ascertained by 94 solely reading their documentation. Secondly, focusing on implementing our workflow, 95 would give us more insight into these qualities than solely working through tutorial 96 examples specifically designed by the developers of the systems to demonstrate their 97 software. And, thirdly, whatever system we adopted, we would have the corresponding 98 prototype to build upon. 99 Though our focus was on selecting a workflow management system, the use of rapid 100 prototyping offers a low-cost way of making a more informed selection of software to use 101 within projects, rather than relying solely upon reviews and recommendations by others. 102 The intent of this article is not to make a recommendation as to the use of a specific 103 workflow management system for all bioinformatics projects. Nor is this article intended to 104 claim that using rapid prototyping is suitable for the selection of all software or for all 105 projects. Rather, it is to demonstrate how we used rapid prototyping to select a workflow 106 management system that met the specific requirements of our project, and to discuss our 107 experiences with the workflow management systems that we considered. 108 RiboViz and the requirement for a workflow management system 109 RiboViz is a high-throughput sequencing analysis pipeline specialised for ribosome profiling 110 data. RiboViz takes raw data from sequencing machines; estimates how much each part of 111 RNA is translated into protein and how the amount of translation is controlled by the code 112 of that RNA; and produces analysis data, tables, and graphs. At the heart of RiboViz is an analysis workflow to process ribosome profiling data across 118 several samples, whose information along with all parameters for processing is described in 119 a single input YAML file. Sample-specific read data can be provided as separate (fastq) input 120 files or within a multiplexed input file. This workflow invokes a series of steps per sample 121 (for example, adapter trimming, rRNA and ORF alignment, trimming 5' mismatches). In 122 addition, there are some initial, sample-independent, steps (for example, creating rRNA and 123 The RiboViz analysis workflow was implemented in a Python script. Each time a command-136 line tool is invoked, a log file is created for each invocation, in which standard output and 137 error is captured. A log file for the execution of the Python script itself is also created. 138 Sample-specific data and log files are written to sample-specific directories. The Python 139 script also logs all the commands executed via bash to a script which can be run standalone 140 and which allows a specific analysis to be rerun outwith the Python script. The RiboViz 141 Python script can be configured to run in a "dry run" mode whereby it will validate its 142 configuration, check that input files exist and output this complete bash script without 143 executing the steps. However, as our Python script evolved, we were aware that we were adding more features 153 related to managing the invocation of the analysis steps, rather than the nature of these 154 steps themselves -we were implementing a custom workflow management system for 155 RiboViz. This was problematic for several reasons. Our code was becoming more difficult to 156 maintain as it evolved to accommodate additional requirements which were not envisaged 157 when its implementation began in 2016. Our code did not support re-entrancy or 158 incremental build, both of which we viewed as essential for implementing workflows to 159 process large datasets. Nor did our code support parallel execution of the workflow which 160 would be necessary to support the future execution on RiboViz on large-scale datasets. 161 Implementing these would have incurred significant development effort, effort which would 162 be better spent implementing the steps within the workflow, the science itself. 163 It was time for us to adopt two more of the Wilson et al. best practices, to "Use a build tool 164 to automate workflows" and to "Re-use code instead of rewriting it", that is, to use an off-165 the-shelf workflow management system, the adoption of which, we estimated, would incur 166 significantly less effort than implementing re-entrancy, incremental build and support for 167 parallel processing ourselves. 168 A survey of available workflow management systems to shortlist candidates 169 We first conducted a rapid survey of available workflow management systems to shortlist 170 candidates for rapid prototyping. The criteria we used to select candidates for shortlisting 171 are summarised in Table 2. 172 Table 2 Shortlisting criteria 173 Criteria Description

Popularity
The system seems to be in common use and is wellregarded within the bioinformatics community. The system is likely to be practically usable.
Free and open source licence The system is free and has an open source licence, as RiboViz itself is free and open source.
Well-established, stable and with a future The system has been around for at least a year, has regular releases and evidence that it is actively maintained, developed, and supported. Development of the system is unlikely to stop after we migrate to it.
We started by conducting web searches to find out what systems, and existing surveys of 174 systems, were available using combinations of the terms "workflow management system" 175 and "bioinformatics," "survey" and "list". In keeping with our pragmatic, low-cost, approach, 176 a systematic literature review was not undertaken as our goal was not to produce a 177 comprehensive survey of every workflow management system available, but to, in a rapid 178 way, identify which systems are in common use, and are well-regarded, within the 179 bioinformatics community. 180 We consulted existing surveys of workflow management systems. Leipzig  However, its reliance on Groovy (https://groovy-lang.org/), a Python-style scripting 201 language that can be run on the Java platform, was perceived to be daunting and Nextflow's 202 error messages were deemed to be quite cryptic. Snakemake was viewed positively for its 203 ease of use, the concision of its workflows, and for being based on Python. However, its 204 documentation was felt to be lacking and it was not considered to be as flexible as Nextflow. 205 SevenBridges, a commercial system). Our shortlisted candidates were Snakemake and 220 Nextflow. CWL had also been frequently and positively mentioned so we chose both its 221 reference implementation, cwltool, and one of its production implementations, the 222 aforementioned Toil. We chose Toil over Cromwell as Cromwell was listed as a "partial", not 223 "production", implementation on the CWL web site. 224 It was also important that we adopt a workflow management system that was well-225 established, stable and with a future [24]. We did not want to migrate to a system only for 226 development around that system to stop. To assess the stability of and development activity 227 around each tool, we reviewed statistics from their open source repositories and the 228 number of web search results for them (Table 3). 229  us confidence that the candidate systems were being widely and actively used, developed, 236 and supported, and will continue to be so for the foreseeable future. 237

238
Once a shortlist had been drawn up, we carried out an evaluation of each candidate system 239 via rapid prototyping. This allowed for a more detailed evaluation as to whether each of the 240 candidate systems met our requirements as well as to assess how easy it is to use the 241 systems and the perceived quality and utility of their supporting documentation. 242 Our evaluation focused on rapidly prototyping a subset of the RiboViz workflow into each 243 system. There were three reasons for this. Firstly, using each system, and their 244 documentation, would provide more insight into their ease of use, their capabilities, and the 245 quality of their supporting documentation than could be ascertained by solely reading their 246 documentation. Secondly, focusing on implementing our workflow would give us more 247 insight into these qualities than solely working through tutorial examples specifically 248 designed by the developers of the systems to demonstrate their software. And, thirdly, 249 whatever system we adopted, we would have the corresponding prototype to build upon. 250 2-3 person-days were allotted to each system. If nothing productive could be implemented 251 within that period, then the system would be left and the next considered. 252 Our evaluation criteria are shown in Table 4. 253 It will be noted that the first three criteria are subjective, and necessarily so [25]. Ease of 255 use, readability of documentation and ease of implementation are very much dependent 256 upon the skills, knowledge, and experience of those who will use a system and its 257 supporting resources. For RiboViz, users are expected to be familiar with bash command-258 line tools and developers familiar with development of bash, Python and R scripts under 259 Linux. We sought a system that would enable us, and our user community, to implement, 260 maintain and extend our workflow in a way that is easier than at present. Table 5  261 summarises how each tool met our evaluation criteria. 262 Summary of how each workflow management system met the RiboViz project's objective 264 evaluation criteria. 1 the time taken relates to writing CWL workflows, not cwltool or Toil-265 specific workflows. 2 these criteria were not explored as the decision had been made to not 266 consider CWL further considering its lack of support for conditional execution of steps. 3 the 267 lack of support for conditional execution is a restriction of CWL, not Toil. 268

Snakemake 269
Snakemake was easy to download and install, via the conda 270 (https://docs.conda.io/en/latest/) package manager, and had a comprehensive tutorial. 271 Snakemake adopts the same model of operation as the GNU Make 272 (https://www.gnu.org/software/make/) automated build tool -users specify the output 273 files they want to build, Snakemake looks for rules to create these output files and runs the 274 commands (in Snakemake, bash commands or Python scripts) specified in these rules to 275 create the output files. Rules can specify dependencies -files used by the commands to 276 create the output files. If these files do not exist then Snakemake looks for rules to create 277 these, and so on. 278 Snakemake is implemented in Python. Python code can also be embedded within a 279 Snakefile, for example, to create file paths or validate configuration parameters. 280 Implementing steps from the RiboViz workflow was straightforward and a functional version 281 of the complete RiboViz workflow (everything bar steps specifically to handle multiplexed 282 files) was implemented in less than a person-day. 283 Snakemake provided all the required and useful functionality listed in our evaluation 284 criteria. Snakemake provides a "keep going" configuration parameter which can be used to 285 continue processing other samples if processing of one sample fails. Like Make, Snakemake 286 supports incremental build and re-entrancy. Conditional behaviour can be executed via the 287 use of Python conditions. Step-specific log files can be implemented, but Snakemake does 288 not automatically capture these -the bash commands executed by each step explicitly need 289 to redirect standard output and standard error streams into these log files. Like Make, 290 Snakemake supports a "dry run" option that can check that input files exist and that displays 291 the commands that would be run, without running these. As for Make, the ability to specify 292 exactly the files to build can be useful for debugging. 293 While Snakemake does not output a bash script that can be run standalone, it can output a 294 summary file with the commands submitted to bash for execution. This file could be parsed, 295 and the commands extracted and constructed into a bash script. 296 Snakemake has support for running its jobs within containers, HPC systems and clouds. 297

Common Workflow Language, cwltool and Toil 298
Both cwltool and Toil were easy to install, via the Python pip (https://pip.pypa.io/) package 299 manager. It was easy to run a CWL "hello world!" example via both. A comprehensive, step-300 by-step tutorial to the language is available (https://www.commonwl.org/user_guide) [26]. 301 CWL tool wrappers, which describe the inputs and outputs of command-line tools, and job 302 configuration files, which describe workflows, are written as YAML or JSON documents. 303 JavaScript can be embedded for any additional computation that is required, for example to 304 create file paths or validate configuration parameters. 305 Implementing three steps of the RiboViz workflow took a person-day. The "edit-run-debug" 306 development cycle felt slow and painful, due to the richness of CWL and the occasionally 307 cryptic error messages that arose during execution. 308 Conditional behaviour is not yet supported within CWL -a "Collecting use cases for 309 workflow level conditionals" issue [27] was added in February 2020 to their 1.2 milestone, 310 but, at time of writing (August 2020), this has no due date. The lack of conditional 311 invocation means that CWL is not currently suitable for RiboViz, or for other projects that 312 require input-dependent control of workflow structure. (A colleague had evaluated CWL 313 about a year and a half ago and, while they felt that simple workflows showed promise, the 314 lack of conditionals meant that they could not adopt CWL for their project. Similarly, we felt 315 that CWL would not be suitable for RiboViz at this time.) This limitation could have been 316 identified at the shortlisting stage, but we had to achieve a balance between how many 317 criteria to consider during shortlisting and how many during our rapid prototyping. We 318 (incorrectly as it turned out) assumed that support for conditional execution would be a 319 fundamental feature of any workflow management system or languages, such as CWL, 320 executed by them. 321

Nextflow 322
Nextflow was easy to download and install, via the conda package manager, and had a 323 simple tutorial. 324 A Nextflow workflow has a structure analogous to a Makefile or Snakefile -it consists of a 325 set of processes which define inputs, outputs and commands describing how to create the 326 outputs from the inputs. However, Nextflow adopts a dataflow programming model 327 whereby the processes are connected via their outputs and inputs to other processes, and 328 processes run as soon as they receive an input. Unlike Snakemake, a user does not specify 329 the files they want to create, rather, they declare their input files and related configuration 330 and Nextflow continues to invoke processes until no process has any outstanding inputs. 331 processing of one sample fails, or to adjust process resource parameters if a reported error 345 arose from a lack of memory or a time limit that was too low. Nextflow, like Snakemake, 346 supports both incremental build and re-entrancy, via a "resume" option. Conditional 347 execution of steps is supported via a "when" declaration. Unlike Snakemake it is not 348 possible to specify the exact files to build, which can make debugging more challenging. 349 However, every invocation of a step takes place in its own isolated subdirectory which 350 includes a bash script with the command that was invoked, symbolic links to input files, 351 output files, and files with the contents of the standard output and error streams. The step-352 specific bash scripts can be run within their step-specific directories which is useful for 353 debugging the implementation of individual steps. These directories have auto-generated 354 names but Nextflow allows the contents of these directories to be written into known 355 locations with more readable names. 356 A "dry run", analogous to that supported by Snakemake and Make, has been suggested in a 357 Nextflow issue [29], but has not progressed due to challenges in implementing such a 358 feature within a dataflow model. The Nextflow authors instead recommend using small 359 datasets to validate scripts. It should be noted that it may be challenging to identify a small 360 dataset that would allow adequate replication of the workflow's behaviour in the presence 361 of a full dataset. 362 Nextflow has support for running its jobs within containers, HPC systems and cloud. 363 Selecting a workflow management system 364 We decided to adopt Nextflow for the following reasons. It was our subjective impression 365 that Nextflow felt far richer than Snakemake both in terms of features and expressivity, and 366 it was felt that these outweighed its lack of a dry-run feature. The execution of each step 367 within isolated subdirectories is useful for debugging. While writing Nextflow workflows 368 does require knowledge of Groovy, the authors, familiar with Python and R, did not find 369 learning Groovy challenging. The fact that Nextflow was based on Java incurs no additional 370 installation overhead for either users or developers compared to Snakemake -each can be 371 installed using the conda package manager using a single command. Based on our 372 impressions of their documentation, Nextflow's built-in support for, and documentation 373 around, containers, HPC systems and cloud, seemed more thorough than that of Snakemake 374 (though we appreciate that this may change as both tools evolve). 375

376
It took approximately five person-days to complete an implementation of the RiboViz 377 workflow (including support for multiplexed files) within Nextflow. Our existing regression 378 test framework for our Python script was used to validate the implementation of our 379 Nextflow script. 380 The Nextflow implementation has been tested by the RiboViz development team on their 381 own development platforms and also on EDDIE, The University of Edinburgh's high 382 performance computing cluster (https://www.ed.ac.uk/information-services/research-383 support/research-computing/ecdf/high-performance-computing). 384 Release 2.0 of RiboViz [30] includes the Nextflow implementation of the RiboViz workflow. 385 The Python implementation of the RiboViz workflow will be deprecated in a future release. 386 Nextflow has the nf-core collection of bioinformatics pipelines, a resource of open-source, 387 reviewed, and validated Nextflow scripts implementing common data analyses [31]. The 388 associated nf-core developer community (136 members as of 8 July 2020, https://nf-389 co.re/community) has some overlap with the Nextflow developers, but is primarily 390 composed of bioinformaticians. Again, these provide strong evidence for a well-established 391 system with a future and we will consider contributing RiboViz to nf-core in the future. 392 However, no choice of software should be permanently binding. Our positive experiences 393 with Snakemake, and the small effort that would be required to complete the 394 implementation of RiboViz into Snakemake, give us confidence that if we need to migrate 395 from Nextflow to Snakemake in future, then this would be a relatively straightforward 396 migration to undertake. 397 Rapid prototyping may not be suitable for the selection for all software or for all projects. 420 For example, it would not be suitable for selecting software for large-scale IT projects or 421 critical infrastructure. However, the use of rapid prototyping does offer a low-cost way of 422 making a more informed selection of software to use within projects, than relying solely 423 upon reviews and recommendations by others. 424 In conclusion, we agree that workflow management systems are a technology that 425 "bioinformaticians need to be using right now" [3], and that they can implement right now 426 using well-engineered open source tools. 427