Beyond the whole genome consensus: Unravelling of PRRSV phylogenomics using next generation sequencing technologies

gives an overview, including a proposed workﬂow, on the use of NGS to explore the genetic diversity of PRRSV at both macro-and micro-evolutionary levels.


Introduction
Porcine reproductive and respiratory syndrome (PRRS) is a devastating disease resulting in huge economic losses to the swine industry worldwide (Neumann et al., 2005;Nieuwenhuis et al., 2012).The disease is characterised by late term reproductive failure in sows and respiratory distress in young growing pigs (Wensvoort et al., 1991;Zimmerman et al., 1997).Its viral etiologic agent, PRRS virus (PRRSV) of the family Arteriviridae and order Nidovirales, has been found to exist ubiquitously in almost all affected swine populations (Brockmeier et al., 2012;Shi et al., 2010).The positivesense RNA viral genome is approximately 15 kb and it encodes 14 non-structural and eight structural proteins in eight open reading frames (Meulenberg et al., 1993;Snijder and Meulenberg, 1998).The virus has a narrow cell tropism with a preference for cells of the monocyte/macrophage lineage infecting specific subsets of differentiated macrophages in lungs (i.e.alveolar macrophages), lymphoid tissues and placenta (Snijder et al., 2013).The evidence on PRRSV binding, internalisation and genome release has been reviewed elsewhere and a model of viral entry proposed (Van Breedam et al., 2010).Briefly, attachment of PRRSV to target cells is mediated by the interaction between viral glycoprotein 5 (GP5) and porcine sialoadhesin while internalisation and decapsidation of the virus require the trimer formed by viral GP2, GP3 and GP4 to interact with the host CD163.
PRRSV is a rapidly evolving pathogen (Murtaugh et al., 2010) with two predominant genotypes, the European Type 1 (Prototype strain Lelystad) and the North American-like Type 2 (Prototype strain VR-2332), which share approximately 60% nucleotide identity (Allende et al., 1999;Nelsen et al., 1999).However, current understanding from the findings of recent whole genome sequencing reveals the existence of considerably higher genetic variability making up a far more complex viral phylogenomics (Brar et al., 2014;Kvisgaard et al., 2013c;Zhou et al., 2014) (Fig. 1).Not only is genetic divergence evident between the two genotypes, individual PRRSV isolates and subtypes within each of the genotypes may vary up to approximately 20% at the nucleotide level (Han et al., 2006).For instance, variants of PRRSV Type 1 (subtypes 2, 3, and 4) identified in Eastern Europe (Stadejek et al., 2006(Stadejek et al., , 2002) ) seem to be more virulent than the prototypical LV strain of PRRSV Type 1 subtype 1 in the pig model due probably to enhanced host cytokine activities (Morgan et al., 2013;Weesendorp et al., 2013  faster replication in the nasal mucosa have also emerged (Frydas et al., 2013).Similar to PRRSV Type 1, ongoing evolution within the PRRSV Type 2 genotype has been rather striking as well.In 1987, atypical PRRSV Type 2 strains were first identified in North America (Collins et al., 1992;Keffaber, 1989) but variant isolates/subtypes of these highly pathogenic PRRSV Type 2 strains have since spread to Asia (Li et al., 2007;Tian et al., 2007).Furthermore, microevolution within a coexisting quasispecies population has been found to further increase sequence heterogeneity in PRRSV (Goldberg et al., 2003).The dynamics of such mixed viral population is of increasing clinical importance with studies showing them affecting both virulence and pathogenesis (Domingo et al., 2012;Lauring and Andino, 2010).
An important manifestation of the PRRSV genomic complexity, especially within pig populations harbouring multiple circulating genotypes, is the resulting antigenic diversity (Liao et al., 2013).Therefore, control of PRRSV through the use of vaccines that offer cross protection among genetically variable genotypes remains challenging (Kimman et al., 2009).The phylogenomic insights revealed through the use of next generation sequencing (NGS) technologies may thus offer a better understanding of the relationship between the evolutionary dynamics and pathogenesis of PRRSV which in turn may aid in the design of more effective strategies to control the virus.
Here, we provide an introductory overview of the application of NGS to study the phylogenomics of PRRSV and we also suggest Fig. 2. Proposed workflow for NGS analysis of PRRSV.The workflow includes a wet laboratory section where samples are prepared and sequenced; and a bioinformatics analysis section where raw sequencing data are pre-processed before undergoing various downstream analyses.a general workflow (Fig. 2) to accomplish this.Although this workflow mainly focuses upon the use of the Illumina sequencing platforms, it should also be applicable to other NGS platforms with appropriate modifications.

Sequencing technologies
Conventional Sanger sequencing (Sanger et al., 1977) has over the last four decades revolutionised the entire field of biological research.The completion of the first human genome is one among the many important milestones achieved using this technology.The dideoxy or chain-termination process when coupled with automated fluorescent-labelling and capillary electrophoresis allows up to approximately 900 bp of nucleotides to be sequenced with high confidence.Although 96 samples can be sequenced in parallel, automated capillary sequencing remains relatively labour intensive, costly and time consuming; especially with regards to large-scale genome-wide studies.
With the introduction of NGS technologies about ten years ago, a new era of cost-effective and truly high-throughput genomic sciences was ushered in.The various NGS technologies that have been developed or are under development have been reviewed in detail elsewhere (Mardis, 2013;Metzker, 2010).Despite the differences in their underlying chemistries, sequencing protocols and throughputs, all current NGS workflows typically involve nucleic acid library preparation, DNA capturing follows by clonal amplification of individual molecules within the library and finally parallelised sequencing of the amplified library to yield, in the case of ultra-deep NGS, up to billions of sequencing reads.Depending on the platforms used, the 36 to ∼700 bp read lengths are shorter than those obtained with the conventional Sanger sequencing method.However, third generation sequencing technologies, exemplified by Pacific Biosciences and, potentially, Oxford Nanopore, have opened the door to read lengths of >10 kbp which go well beyond that obtained from Sanger sequencing (Morey et al., 2013).
Table 1 highlights and compares various features of Sanger sequencing against five of the most widely used high-throughput sequencing platforms that are currently available.In term of error rate, read length, yield, running cost and the intended applications, each of these NGS technologies has its own merits and weaknesses (Glenn, 2011;Quail et al., 2012;Ross et al., 2013).Nevertheless, all the leading NGS platforms are capable of running whole genome or transcriptome sequencing to a high degree of accuracy.On the other hand, for de novo assembly of either genomes/metagenomes or transcriptomes/metatranscriptomes, the longer read lengths from platforms such as 454, Ion Torrent, Illumina's Miseq and PacBio enable better resolution of low complexity regions and repeats.However, the numbers of reads produced by 454, Ion Torrent and PacBio may be insufficient for low cost full coverage and assembly of larger genomes.Where higher coverage is essential for cases such as detection of rare variants and haplotype reconstruction of viral quasispecies, the ultra-high throughput capacities of Illumina stands up among the other platforms.It is perhaps worth pointing out that for validation of variants detected using the NGS technologies, Sanger sequencing remains the preferred method since it is still regarded as the gold standard for its very low error rate (Ewing et al., 1998).

Application of NGS in PRRSV research
Although PRRSV can be detected using antibody-based immunological methods, molecular methodologies such as PCR and sequencing remain the tools of choice for accurate identification of the virus at the genotypic or subgenotypic level.However, most previous studies on the genetic diversity of PPRSV have been restricted to the Sanger sequencing of the ORF5 and

Entropy (hx) Position (bp)
Genotype 1 Genotype 2 Fig. 3. Sequence Entropy of whole genome PRRSV strains.Multiple whole genome sequence alignment of PRRSV genotype 1 and 2 was generated with MUSCLE and their respective entropy calculated with the molecular evolutionary program, Hyphy.The plots were constructed with 100 nt sliding window.In addition to common reported regions of variability, strains from PRRSV genotype 1 also exhibit higher degree of heterogeneity at the 3 end of the genome.
ORF7 which harbour the main immunogenic epitopes (Díaz et al., 2009;Martin-Valls et al., 2014;Oleksiewicz et al., 2002;Stadejek et al., 2013) and key diagnostic size polymorphisms resulted from insertion/deletion or recombination events (Stadejek et al., 2013).Although more than 12,000 sequences of the two ORFs have been deposited in the GenBank database to date, the drawback of such studies becomes immediately apparent when comparison of more than 300 whole genomes of PRRSV shows that there are clearly other variable regions in addition to just these two (Fig. 3).Indeed, variable regions unique to each of the two genotypes can also be identified.Therefore, phylogenomics studies based solely on partial PRRSV genomes may be misleading and the ability to sequence the whole PRRSV genomes at great depth makes NGS a powerful tool of choice.Indeed, in addition to genotyping, both conventional Sanger sequencing and NGS have been applied extensively to study various aspects of PRRSV biology.They include single gene/ORF discovery, whole genome transcriptomes, identification of subgenomic and heteroclite RNAs (Yuan et al., 2004), characterisation of viral diversity in terms of single nucleotide variants and recombination events, detection of viral quasispecies that evolve in response to immune pressure, and following of persistent infection (Barzon et al., 2013;Beerenwinkel and Zagordi, 2011;Brar et al., 2014;Capobianchi et al., 2013;Chang et al., 2002;McElroy et al., 2014;Radford et al., 2012).
The hallmark brute-force sequencing capacity of NGS when used on a small genome, such as that of PRRSV, can easily result in the genome being sequenced hundreds to tens of thousands times over.This so-called ultra-deep sequencing offers a powerful tool to detect rare occurring variants.In comparison to some of the previous quasispecies studies involving laborious RT-PCRcloning-sequencing works focusing on only a particular subset of PRRSV's structural and non-structural genes (Goldberg et al., 2003;Schommer and Kleiboeker, 2006), the advantage of studying the whole genome using ultra-deep NGS is obvious.The additional heterogeneity brought about by the viral quasispecies can indeed provide valuable insights into the microevolutionary events within a mixed PRRSV population.For instance, these events allow the selection pressure to be followed during a vaccination trial.
Similar to all other modern genomic studies, application of NGS to PRRSV involves a workflow that combines both 'wet' (laboratory) and 'dry' (bioinformatics) activities (Fig. 2).Briefly, RNAs isolated from either tissue or culture-enriched materials, are reverse transcribed before being sequenced for further downstream bioinformatics analysis.

Sample preparation and sequencing
Sample assessment, enrichment of nucleic acids, and construction of sequencing library form the three integral preparatory steps prior to running any NGS application.While these procedures have been comprehensively reviewed (Head et al., 2014), the sample preparation required for the application of NGS to PRRSV depends on whether or not there is prior knowledge of the viral sequence.Table 2 describes the advantages and limits between the two approaches in preparing PRRSV samples for the generation of Illumina sequencing libraries.
Using known genomic sequences of PRRSV Type 1 and Type 2 available in public databases, Guo et al. (2011), Kvisgaard et al. (2013b) and Brar et al. (2014) have independently described a fast and robust method to generate by PCR sets of large overlapping amplicons of the whole PRRSV genome using degenerate primers targeting the two main PRRSV genotypes.Three different NGS platforms, namely Illumina, 454 and Ion Torrent, were used in these studies.The PCR-based enrichment of starting RNA materials enabled successful sequencing of the virus in a wide range of infected clinical tissue materials, including serum, lung and lymph node, even when the viral load was limited (Kvisgaard et al., 2013a).On the other hand, this approach usually requires time-consuming optimisation of the PCR procedures in order to amplify the longer amplicons.In addition, the highly heterogeneous PRRSV clearly poses a challenging obstacle for locating known conserved regions for which PCR primers can be designed.And the prerequisite of prior knowledge of conserved regions also rules out the ability of this methodology to pick up novel strains with mutated primer binding sites.Furthermore, complete coverage of the full genome with no ambiguity may not always be achievable.Of the three platforms tested, 454 with its longer read length allowed a near complete PRRSV genome to be obtained while Illumina gave a much higher read depth.The poorer performance of Ion Torrent was probably the result of its higher error rates in sequencing.It is perhaps worth pointing out that gap closure or extending into novel regions with methods such as nested-PCR, cloning and 5 -RACE may not be trivial.While PCR-and prior knowledge-based techniques allow for rapid diagnosis and genotyping of some PRRSV strains/isolates, heterogeneity of the virus may be seriously under-represented when variabilities and gaps outwith the known conserved regions and potential quasispecies within the same population are not included.
More recently, by bypassing the initial in vitro RT-PCR step and taking the advantage offered by the 225 bp longer read length from Illumina's Miseq machine, a complete PRRSV genome was successfully constructed without any prior knowledge of the viral sequence (Lu et al., 2014).This approach clearly offers the potential advantage of obtaining complete novel PRRSV sequences.However, it is important to note that a higher load of viral RNA was needed before the desired sequencing depth could be achieved.In addition, the "contaminating" host transcriptome, which was in fact the major species of nucleic acids in the sample, sequenced concurrently has to be removed.There is, of course, value in this "contaminating" host transcriptome sequences.When analysed, it provides an insight into potential host responses towards the PRRSV infection.

Pre-processing
Although not an essential step in the NGS analysis of eukaryotic or bacterial genomes, quality check or pre-processing of raw reads used in the mapping of fast evolving viral genomes is highly recommended.In spite of the much improved accuracy and precision of various NGS platforms, errors in raw data are not uncommon.They include remnant and/or read-through adaptors on both the 5 and 3 ends of sequencing reads; low quality (<Q phred 20) and ambiguous bases; chimeric reads; platform-specific artefacts; long homopolymer stretches; and contaminating host nucleic acids (Meacham et al., 2011;Nakamura et al., 2011;Shao et al., 2013;Wang et al., 2012a).Additionally, the infidelity of polymerases, though at a very low rate, used during the amplification of the sample library is potentially another source of error.Ignoring or failing to sufficiently clean up these erroneous data may seriously compromise subsequent downstream biological inferences.Suboptimal mapping of these error containing reads against the reference genome may well lead to false identification of single nucleotide variants (SNVs) and/or indels and skew the estimation of the population complexity.Furthermore, any errors in the raw data can also unnecessarily complicate the de novo assembly process resulting in additional erroneous contigs being assembled.Not only is this going to increase the computational demands but more importantly, the resulting scaffolds may be too fragmented to allow the full construction of the novel genome.The multiplicity effects of any false positive bases or substitutions can indeed have a very profound impact on the accuracy of phylogenomic typing of small viral genomes.Not only can new strains be wrongly identified, the complexity of the quasispecies population may also be wrongly inflated.
Although different algorithms haven been applied to develop various pre-processing tools with diverse features, most of them deal specifically with issues related to the Illumina NGS reads; owing simply to the current dominance of the Illumina platform in the field (Hadfield and Loman, 2014).Many of these tools have been compared and reviewed elsewhere (Del Fabbro et al., 2013;Trivedi et al., 2014;Yang et al., 2013).They include fastx toolkits (Hannon, 2009), fastqc (Andrews, 2010), Cutadapt (Martin, 2011), Sickle (Joshi and Fass, 2011) and Scythe (Buffalo, 2011).While the types of the dataset, requirements of downstream analysis and various parameter-dependent trade-offs ultimately decide which programme is to be used, the quality-control/preprocessing procedure usually starts with initial quality plots of the raw data, followed by different trimming/filtering steps and finally the removal of contaminants.In the case with PRRSV, the most likely "contaminant" is from either the host's (i.e.pig) DNA or RNA which can subsequently be removed with mapping programs such as BWA (Li and Durbin, 2010), Bowtie (Langmead and Salzberg, 2012) or BMTagger (Rotmistrovsky and Agarwala, 2011).

Mapping vs de novo assembly
To date, bioinformatics analyses of most of the recent PRRSV whole-genome sequencing using NGS technologies are limited to mapping the NGS reads to a one of the two prototypic reference PRRSV genomes.Among the many mapping tools developed, BWA and Bowtie are perhaps two of the most mature and widely used ones.The latter is known to run at a faster speed for shorter reads but BWA has a slight advantage in term of mapping accuracy (Li, 2013).While mapping is a fast and efficient way to identify new variants in closely related genomes, the default mapping stringency of these tools may prevent a significant number of real but less similar reads from being successfully mapped to the reference genome.The end result of such a mapping exercise is usually an incomplete consensus genome with variants of lower confidence.A de novo assembly should then be carried out to either close the gaps or identify potential novel recombination and mutation events.
De novo assembly is essentially the piecing together of a jigsaw puzzle involving millions or even billions of short NGS reads.Over the last few years, many draft genomes have been constructed to varying degrees of completeness using as many protocols and tools (Bradnam et al., 2013;Earl et al., 2011;Howison et al., 2013).Depending on the lengths of the NGS reads, these assembly tools are based broadly on either the long read overlapping or short read k-mer graph construction algorithm.It is important to note that regardless of the algorithms used, cleaned raw data of the highest quality is of absolute essential.Very often a hybrid approach using long and short NGS reads is necessary when highly divergence genomes are analysed (Bashir et al., 2012;Wang et al., 2012b).The quality of genomes assembled using the hybrid approach can then be checked by mapping the short reads back to the assembly.

Downstream analysis
Consensus sequence of the mapped genome are routinely constructed by replacing the reference bases with SNVs/Indels identified using tools such as Samtools (Li et al., 2009) and GATK (McKenna et al., 2010).This sequence can then be compared against that derived from the de novo assembly to confirm the quality and accuracy of the novel genome derived from both mapping and assembly.
This final consensus sequence representative of the major strain in the viral population can then be used as a reference genome and fed into the mapping workflow using the same cleaned dataset.A more relaxed variant calling regime in Samtools/GATK or other low-frequency variant detecting tools, such as SNVer (Wei et al., 2011) and LoFreq (Wilm et al., 2012), can next be applied to search for the presence of variants arising from inter-or intra-population quasispecies at frequencies as low as 1% (Lu et al., 2014).At this low frequency, the ultra-deep sequence coverage can provide at least a few reads to confidently support the detected variants; lending more weight to the existence of the quasispecies.However, this approach provides only a combined spectrum of all variants in the total quasipecies population.Separation of co-occurring variants or local haplotype clustering and full construction of global haplotypes are necessary to reveal not only the true diversity of the virus but more importantly to allow a better phenotypic interpretation of the viral evolution in response to stresses such as vaccine or antiviral drug treatment.Although a few algorithms have been proposed, the disentanglement of the huge number of short NGS reads to reconstruct the true viral haplotypes remains an imperfect task; especially when analysing haplotypes present at low-frequency (Schirmer et al., 2014).Nevertheless, high quality longer reads from platform such as 454 and PacBio do in general provide better resolution.Indeed, complete genomes of a number of microorganisms with varying repeat contents have recently been assembled to their whole entirety using reads generated by the PacBio platform (Chin et al., 2013).
In addition to the detection of new variants and construction of novel genomes, read depth or coverage across the analysed genome can reveal other transcriptional features of PRRSV which may be difficult to uncover with conventional molecular biology methods.The higher than average coverage at the 5 leader and 3 UTR is indicative of active transcription of the genomic and subgenomic RNAs.The changes of coverage for each of the subgenomic RNAs at different time points can also help to further the understanding of the roles each of the transcribed genes play during the infection cycle.
The information gained from bioinformatics analysis of PRRSV's NGS studies is therefore multi-dimensional and indeed unprecedentedly insightful.

Future challenges and conclusion
Over the past years, NGS technologies have brought about a dramatic change in the genomic landscape of biology.Such technologies have become an essential tool in the field of viral infectious diseases and it is likely that PRRSV research and diagnosis will benefit greatly from the advancement of NGS in the years to come.For example, such technology could help shed light on the genetic evolution of PRRSV during immune selection (Evans et al., 2014).However, further developments are required for the field to gain even more from this NGS revolution.One of such development would be the ability to acquire, with minimal or no viral enrichment and without prior knowledge of any conserved regions, the full genomic sequence of PRRSV directly from clinical samples such as blood, serum, nasopharyngeal swabs or tissue biopsies.In fact, works are now underway in different laboratories to attempt to sequence full viral genomes directly from primary infected tissues (personal communication).It is anticipated that genomic data gained from such a direct tissue study can provide an unprecedented insight into PRRSV variants distribution within host-tissues and potentially unravelling novel PRRSV's subgenomic, heteroclite and other probable RNA species.However, past studies have also highlighted the need to develop PCR-and culture-free methods to enrich the viral RNA in order to increase the sequencing depth to a level where rare variants from quasispecies can be detected confidently.Perhaps, methods involving sequence-independent amplification such as capture-by-hybridisation (Bent et al., 2013;Kwok et al., 2014) could take full advantage of the ca ∼370 currently available full genomes of PRRSV.In addition, although ultra-deep NGS has enabled the calling of low frequency variants using sophisticated bioinformatics tools, challenges remain to be overcome for accurate haplotype reconstruction of viral variants present in either very low number or with structural variations.Promisingly, however, it is likely that the advent of third generation sequencing technologies capable of long-read single-molecule sequencing without prior amplification, such as the those offered by PacBio and Oxford Nanapore, may enable not only assembly of genomes with high accuracy but also the construction of whole genome quasispecies (Giallonardo et al., 2014).
Genomics data obtained using NGS can provide valuable insights into features of viral populations, such as viral attenuation, response to host immune pressure, disease severity and transmission, which are currently masked by other measures.Taken together it is likely that the use of NGS on routine basis will allow the development of rapid and tailored responses to combat the elusive PRRS disease.

Fig. 1 .
Fig. 1.Phylogenetic analysis of whole genome PRRSV strains.Whole genome sequences from 336 PRRSV isolates currently available in the Genbanks were aligned with the program MUSCLE.The unrooted maximum likelihood phylogenetic tree with 500 bootstrap replications was constructed using MEGA5 and the tree drawn with iTOL.Accession number of each sequence precedes the isolate's name.The two main clades separate the virus into the European genotype 1 and North-American-like/Asian genotype 2.

Table 1
Comparison of conventional Sanger and NGS technologies.

Table 2
Sample preparations and aims.