Genome-Wide Analysis of Human Long Noncoding RNAs

,


INTRODUCTION
Imagine that your task today is to investigate a single human long noncoding RNA (lncRNA).This locus is unexceptional, even average, in all respects: Its sequence contains on average nearly three exons (GENCODE version 38), but there is otherwise no specific feature that illuminates whether it is functional or what its molecular mechanisms might be.In the few tissues in which this lncRNA has been identified, it is expressed at low levels, and it is most often absent from other mammals, including closely related species (8,10,22) (Figure 1).If, instead, your task had been to investigate the average protein-coding gene, then this would have been considerably easier, because you would be drawing upon a wealth of highly curated annotations and extensive experimental observations, using sequence features that accurately predict its molecular mechanism and expression data across many tissues, individuals, and species.The absence of such information for lncRNAs leaves you adrift, without a specific hypothesis to investigate.Unfortunately, you cannot make use of model organisms such as mouse because this lncRNA is absent from it, so you decide to adopt the trusted methods of reverse genetics and disrupt this lncRNA in human cells (31).Yet lack of annotation and a mechanistic understanding for this locus causes you to be uncertain what strategy to applywhether to delete the entire locus or a small portion of it, or to ablate its transcription entirely.Even then, questions remain regarding what cellular phenotypes you should measure, how much these might change, and how you should interpret such observations (see the sidebar titled DNA-Versus RNA-Mediated Function).This is the average experience of a lncRNA biologist.Our aim in this review is to draw conclusions from genome-scale investigations of lncRNAs in the hope that they help biologists either to improve their experimental designs or to choose a different locus to study, one that is more likely to yield robust experimental observations.Other perspectives, which dwell more on molecular mechanisms and functions of individual lncRNAs, have been reviewed extensively elsewhere (88,99).
Taking a gene-centric perspective on lncRNAs raises the problem that a lesson learned from one locus is rarely relevant to others.Our deep functional understanding of Xist-the master regulator of X chromosome inactivation (reviewed in 88)-for example, has not aided investigation of tens of thousands of annotated lncRNAs (Table 1).Whenever researchers propose a lncRNA's mechanism, or its involvement in a pathology such as cancer, almost inevitably they herald this as revealing a new paradigm, one that possibly explains the mode of action of many other lncRNAs.Hundreds of publications state that lncRNAs are emerging as important regulators, elements, or components, and 30% of published reviews on lncRNAs since 2012 employed the term emerging.The implication is that lncRNAs are now being revealed almost as brightcolored butterflies, rather than plain-colored chrysalises.Nevertheless, very few lncRNAs have high-quality evidence for such colorful claims.Instead, low-quality evidence abounds, in part because the lncRNA literature has been contaminated by hundreds of paper-mill publications (106) but also because molecular and cellular observations-such as RNA-molecule interactions and gene expression changes-are often deemed important without sufficient evidence.
As well as acclaiming hard-won advances in human lncRNA biology, it is critical that we recognize the field's substantial knowledge gaps.The ubiquity of lncRNAs within and across eukaryotic species has led some to describe lncRNAs as major actors that contribute substantially to most cellular processes and whose RNA sequence variation will ultimately be recognized as greatly altering human traits and disease susceptibility (70).Faced with the same evidence, others view the vast majority of lncRNAs as nonfunctional, spurious by-products of transcription (82).The truth lies across these two extremes: Some transcripts will lack RNA sequence-dependent function, whereas others will harbor variants that predispose individuals to disease.
The existing review literature has focused mostly on experimental and computational methods that elucidate function for individual lncRNAs.This review focuses on approaches applied across the human genome or lncRNA transcriptome, yet it will also have broader relevance for other animals' lncRNAs.Our purpose was to traverse this spectrum of opinion and ultimately rest where accumulated evidence provides greatest support.We hope that this assists researchers in their design of experiments that definitively test the molecular mechanism of selected lncRNA loci.We take a precautionary approach, covering how some computational and experimental observations that are interpretable as indicating lncRNA functionality have alternative and often more mundane explanations.

NUMBER AND SEQUENCE
The lncRNA biologist has any number of lncRNAs to investigate.GENCODE version 38 lists 17,944 lncRNA loci, whereas other catalogs contain vastly more, up to 270,044 lncRNA transcripts (65).These numbers have been compared favorably with the stable number (approximately 20,000) of human protein-coding genes.To elevate the importance of lncRNA loci even further, some researchers claim that "the large majority of the human genome is transcribed into nonprotein-coding RNAs," whereas "only ∼1.2% of the human genome encodes for protein-coding genes" (e.g., 49, p. 1063).The truth, however, is less impressive: Human lncRNA exons span at most 2.3% of the human genome (82), and most intergenic RNA arises from transcription that is initiated within protein-coding genes (1).Moreover, most lncRNAs are expressed at low levels (61, 87) (Figure 1).These low levels mean that even if, very optimistically, the number of lncRNA loci is 10-fold greater than the number of protein-coding genes, their molecular output is considerably smaller.A claim that "around 98% of all transcriptional output in humans is non-coding RNA" (69, p. 986) is plausible only when this includes intronic nucleotides of protein-coding gene transcripts.
The size and complexity of the human noncoding transcriptome have been proposed to explain human evolution, development, and cognition (15,71).This anthropocentric argument is undermined by observations that there are many species whose genomes, and thus transcriptomes,

DNA-VERSUS RNA-MEDIATED FUNCTION
Observation of a phenotype resulting from the ablation or disruption of a lncRNA locus does not necessarily provide information on its mechanism of action.Such an observation can prosaically be the consequence of deleting functional DNA elements overlapping the annotated locus (including promoter and enhancer regions) and other conserved noncoding sequences.It is often helpful to adopt the null hypothesis that phenotypic effects arising from disrupting a lncRNA locus result from DNA-dependent rather than RNA-dependent function.Experiments whose results might lead to rejection of this hypothesis include attempted rescue of phenotypes following reintroduction of the lncRNA transcript or measurement of phenotypes resulting from this transcript's knockdown in a sequencespecific manner.are more extensive than humans' (81).Furthermore, the human transcriptome currently appears to be more complex than other species' only because it has been more extensively sequenced.
An additional misapprehension is that lncRNAs are biased toward containing two exons (22), whereas the median exon count is only one when transcriptomes are assembled from short reads.Human lncRNAs are not completely devoid of informative sequence features, however.Short sequence motifs (k-mers) within a lncRNA show modest power to predict its subcellular localization and protein-binding capability (37,55).Longer sequence patterns are generally attributable to transposable elements (TEs), which account for approximately 30% of lncRNA sequence, the majority of Xist exons (26), and approximately half of the human genome overall (53).TE sequence has been proposed to contribute RNA domains that are essential for lncRNA function (48).Support for this proposal comes from transcripts of one TE, human endogenous retrovirus subfamily H, being required for human embryonic stem cell identity (48) and from a fragment of another TE enhancing lncRNA localization to the nucleus (63).Further support was proposed from enrichments of particular TE subtypes in exons versus introns (12), but these results were not adjusted to account for multiple tests.It has also been pointed out that purifying selection of TE insertions would be more consistent with a depletion of TEs than with their enrichment (52).Finally, there is no distinction between the evolution of TE sequence within lncRNA loci and the evolution in sequence adjacent to them (84), which again signifies that most TE insertions are nearly neutral with respect to selective pressure and are not strong predictors of lncRNA functionality.It is difficult to disagree with others' view that "functionality should not be lightly attributed to biochemical activities on the genome, including transposable elements, without proper experimental evidence" (21, p. 1248).lncRNA annotations are not always perfect.A relatively small number (∼100) of lncRNA annotations are misclassifications, being instead protein-coding genes that contain functional small open reading frames (2,13,66,68,79,80).Such annotations are continually being corrected, and hence many fewer lncRNA misclassifications are expected to remain in current databases.

EXPRESSION
The properties of lncRNAs are generally less pronounced than those of protein-coding mRNAs.Their transcripts tend to be shorter (22), and their promoters are weaker (73) and contain fewer complex transcription factor motifs (73).Cotranscriptional splicing is less efficient (97), and transcription often terminates prematurely (91).Their transcripts tend to be less stable (16), and their abundance is more often tissue and cell type specific (8,61).Overall, these features result in a level of expression that is typically 10-fold lower than mRNAs' (8,22,61,87).These general trends, however, vary widely and cannot reliably predict an individual lncRNA's molecular mechanism (77), with one exception: Transcripts that are rarely transcribed and quickly degraded are less likely to possess RNA sequence-dependent function (91).
These insights have been gleaned by measuring lncRNA expression in bulk samples containing many cells.To observe lncRNA expression from individual promoters, it was necessary to measure allele-specific transcription in single cells (50).Doing so revealed that not only is a lncRNA's expression level lower than mRNAs', but its variability among cells is higher (50).Measuring transcription dynamics showed that lncRNAs have a burst frequency that is fourfold lower than that of mRNAs', and a burst size that is twofold lower (50).For approximately one-third of lncRNA loci, an allele failed to produce a transcriptional burst over a 24-hour period.
Expression of a lncRNA does not guarantee that its RNA sequence is functional.Many regions of the human genome are transcribed yet are rapidly degraded by the RNA exosome (92).RNA polymerase II needs to have low DNA sequence specificity to transcribe many genes from diverse promoters.Transcription thus often initiates from nucleosome-depleted regions before terminating prematurely at cryptic polyadenylation sites, yielding unstable RNA by-products.Such transient RNAs can contribute RNA-sequencing reads at a sufficient abundance to surpass arbitrarily set thresholds.For example, one study predicted 53,864 human intergenic lncRNAs, each expressed at approximately one copy per cell or higher in at least one of 127 data sets (44).Although these lncRNAs are quite modestly enriched in functional features (such as histone modifications and evolutionary conservation; see below), a large fraction could represent rarely expressed and unstable RNAs.The inclusion of rare unstable RNAs in lncRNA sets inevitably overestimates the number of stable lncRNA loci.As studies increasingly investigate diverse tissues and cells, in particular by using deeper RNA-sequencing coverage, the overall tally of proposed human lncRNA loci will inevitably rise yet further.
One strategy to separate high-from low-confidence lncRNAs exploits the principle of replication, a cornerstone of the scientific method.lncRNAs with the highest confidence are those observed as expressed in multiple different samples.The authors of one study, for example, proposed that only 25% of lncRNA loci expressed in granulocytes are robust, on the basis that only these showed expression across all 21 granulocyte samples acquired at three time points from seven donors (56).They further concluded that lncRNAs display significantly greater interindividual expression variability compared with mRNAs.
Even though low-confidence lncRNA transcripts arise from rare transcriptional events, they may still be products of interesting cellular processes.DNA damage repair, for example, is facilitated by recruitment of repair factors by RNA to the damaged site (3).The origin of this RNA is debated, but it could result from transcription of the site of damage just prior to, or soon after, the DNA double-strand break.Similarly, opportunities for rapid RNA polymerase II-mediated transcription at sites of open chromatin will arise as chromatin compartments, loops, and domains are more slowly lost in mitosis and re-formed in the G1 phase.

RNA STRUCTURE
It is sometimes claimed that lncRNAs commonly fold into thermodynamically stable tertiary structures.If so, then these might represent functional domains, akin to the structural and functional units of proteins.Indeed, in light of lncRNAs' very modest primary sequence conservation, some suggest that secondary and tertiary structure conservation instead is critical for their function (86).It has not been possible to prove or disprove this conjecture, because although highresolution structures can be determined for short (<30 bases) sequence segments or some longer sequences whose structures are stabilized by their association with RNA-binding proteins, doing so for full-length lncRNAs is not currently technically feasible.This lack of structural data results from the fact that lncRNAs-in common with other RNAs-do not adopt a single conformation in isolation (30).Rather, each lncRNA samples from a very large number of conformations, ranging from fully unfolded states to more compact structures.Structures that form more rapidly are more common, whereas slowly forming folds are rarer.Rather than adopting a single structure, therefore, lncRNAs form an ensemble of structures, defined as the population-weighted distribution of all their conformations (30).As a lncRNA encounters other molecules, its ensemble shifts its distribution, altering the time-averaged accessibility of binding sites.
Lacking experimental tertiary structure data, studies began to predict RNA secondary structure content using sequence information only.One such study predicted 35,985 structured RNA elements across the human genome, with an expected false-positive rate of 19.2% (102); another predicted more than 4 million conserved structures at a false-discovery rate of 5-22% (95).Two issues, however, cast doubt on such studies' conclusions.The first is the high proportion (55% in 102) of predicted RNA structures falling outside of transcribed sequence.The second is uncertainty in the reliability of these studies' false-discovery rates (25).
When lncRNA secondary structures form, they are likely stabilized by incorporation within larger molecular complexes.Many sequencing-based methods have been developed to probe RNA interactions and structure (reviewed in 101).These either use small molecules to modify solventaccessible bases or particular base pairs or use cross-linking and proximity ligation to infer intraand intermolecular RNA-RNA interactions.As with all methods exploiting high-throughput sequencing as a last step, rare RNA species are underrepresented among observations.RNA in situ conformation sequencing (RIC-seq), for example, predicts 10-fold fewer ncRNA-ncRNA interactions than mRNA-mRNA interactions, and only 5% of 642 hub RNAs (those with relatively high fractions of RNA-RNA interactions) originate from lncRNAs or pseudogenes (9).
Even so, these methods are increasingly being used to predict intramolecular interactions and secondary structures for human lncRNAs such as Xist, HOTAIR, and SRA (101).Such predictions may suffer from unknown technical biases.Also, their predicted structures, even if they occur in vivo, may not confer functionality.Rivas et al. (89) recently investigated whether pairwise covariation in multiple sequence alignments, a reliable indicator of RNA secondary structure, supports the evolutionary validity of these predicted structures.They expected interacting bases within a functional RNA structure to have accumulated compensatory base-pair substitutions over long evolutionary time.They found no evidence for such paired substitutions within proposed structures in Xist, HOTAIR, and SRA lncRNAs, despite their method and data having sufficient power to do so.They cautioned that the "lack of covariation signal in high-power RNA sequence alignments for these lncRNAs suggests that whatever structure they adopt is not detectably constraining their evolution, and thus may not be relevant for their function" (89, p. 3074).Experimentally defined RNA structures, therefore, should not be considered conclusive until experimental evidence of their functional validity is available.

Subcellular Localization
To prioritize mechanistic hypotheses for a specific lncRNA, we soon wish to know its subcellular localization.lncRNAs located only in the chromatin fraction may regulate gene transcription or be by-products of transcriptional noise, whereas cytoplasmic lncRNAs are more likely to act posttranscriptionally (11,99).Unfortunately, large-scale studies of lncRNA subcellular localization (e.g., 7,22,77) have not always agreed on relative nuclear versus cytoplasmic localization, perhaps because of contamination across subcellular fractions or the absence of the nuclear envelope during mitosis.Furthermore, recently developed methods that localize RNAs at subcellular resolution and at a transcriptome scale have been informative of only the most highly abundant lncRNAs, such as MALAT1 (14,27,104).
For larger numbers of lncRNAs, help is at hand from APEX-RIP.This method, which combines engineered ascorbate peroxidase (APEX)-catalyzed proximity biotinylation of endogenous proteins with RNA immunoprecipitation (RIP), identified 81 and 618 intergenic lncRNAs as being enriched in the cytosol and nucleus, respectively, of HEK293T cells (51).In addition, 11 and 28 intergenic lncRNAs were associated with the nuclear lamina and endoplasmic reticulum, respectively.
Large numbers of lncRNAs reside in the nucleus, among which will be by-products of transcriptional noise (see the sidebar titled Transcriptional Noise), newly transcribed RNAs awaiting export from the nucleus, and RNAs regulating transcriptional bursts from proximal proteincoding genes (50).CoT-1 RNAs are surprisingly abundant RNAs that are highly enriched in TE sequences, mostly LINE-1 elements (17).In interphase, these single-stranded RNAs remain tightly associated with their parental chromosome of origin specifically within euchromatin, but not heterochromatin, and appear to promote more open chromatin packaging (42).CoT-1 RNAs represent only one type of a larger class of chromatin-associated RNAs (caRNAs) proposed to form an RNA mesh that helps to assemble large-scale chromatin structure and to regulate

TRANSCRIPTIONAL NOISE
The cellular transcriptional machinery does not perfectly discriminate cryptic promoters from functional gene promoters.This machinery is abundant and so can engage sites momentarily depleted of nucleosomes and rapidly initiate transcription.The chance occurrence of splice sites can then facilitate the capping, splicing, and polyadenylation of long transcripts.A very large number of such rare RNA species are detectable in RNA-sequencing experiments whose properties are virtually indistinguishable from those of bona fide lncRNAs.Consequently, "a sensible [null] hypothesis is that most of the currently annotated long (typically >200 nt) noncoding RNAs are not functional, i.e., most impart no fitness advantage, however slight" (99, p. 26).
Enhancer: a short stretch of DNA that acts to increase transcription of one or more nearby genes; it is often transcribed into an enhancer RNA chromosome function (76).These caRNAs are derived mostly from pre-mRNAs rather than lncRNAs.To explain caRNA function, it is tempting to invoke the known ability of Xist to spread in cis from its site of synthesis.However, an individual caRNA's chromosomal location and its amount are unlikely to predict its function there (43).An alternative proposal of caRNA function is that "thousands of transcriptional events that simultaneously occur in each cell" (74, p. 662) may organize a cell's nuclear architecture.Observations, however, argue against lncRNAs having such a role, including the rarity of lncRNA transcriptional bursts (50), their very low abundance, and the lack of evidence that lncRNAs are transcribed coordinately.
Active enhancers are often transcribed, yielding mostly short-lived RNA species that are short or long and poly-and/or unpolyadenylated (20,54) and thus, in part, can be defined as lncRNAs (58).Some of these RNAs will be inconsequential, resulting from RNA polymerase II-mediated transcription from regions of transiently open chromatin.As reviewed elsewhere (58), however, enhancer activity can be mediated by the resultant enhancer RNAs.Individual enhancer RNAs have been proposed to promote looping between enhancer and promoter, to bind and regulate transcription factors and coregulators, to promote histone acetylation, and to facilitate transcription elongation (58).Independent confirmation of these observations, however, is often lacking, which limits their generalizability and confidence that they are correct.Furthermore, these different mechanisms are not predictable a priori from, for example, sequence-or chromatin-based signatures.General principles of enhancer RNA mechanisms might be revealed by investigating the molecular consequences of guiding large numbers of these RNAs to their cognate enhancers and/or promoters.

Histone Modification
Enhancer RNAs tend to be short (<150 nucleotides), rapidly turned over by the RNA exosome complex, and capped but not polyadenylated or spliced (90).Their transcribed loci, however, can also yield longer transcripts [>200 nucleotides, i.e., enhancer lncRNAs (elncRNAs)] that are polyadenylated, spliced, and more stable.Those elncRNAs with longer half-lives have greater opportunity to have RNA-dependent function, and most will enact this function locally, in cis, rather than in trans.Trans-acting lncRNAs will need to have even greater stability, and thus few will be transcribed from enhancer regions.
RNA stability, function, and subcellular localization are poorly predicted by sequence features.Instead, Marques et al. (67) defined two lncRNA classes by their relative levels of histone H3K4 mono-and trimethylation at transcriptional initiation regions.Those with higher levels of monomethylation (H3K4me1), a canonical marker of enhancer regions, were classified as elncRNAs; those with higher levels of trimethylation (H3K4me3), a canonical marker of promoters, were classified as promoter lncRNAs (plncRNAs) (33,67).The two lncRNA subtypes are indistinguishable with respect to their length, number of exons, and transcriptional orientation relative to their closest neighboring gene (67).Distinguishing elncRNAs from plncRNAs based on chromatin marks is necessarily specific to each tissue or cell type, yet is relatively robust because elncRNAs are infrequently categorized as plncRNAs (or vice versa) in a second tissue or cell type (6,67).
What separates elncRNAs from plncRNAs is their lower and more tissue-specific expression and a strong depletion of CpG islands at their transcriptional initiation regions (67).Furthermore, altered expression of elncRNAs, but not plncRNAs, correlates with expression levels of neighboring protein-coding genes (6,67), indicating that the elncRNA locus and/or its RNA enhances this gene's activity.Because elncRNAs tend to lack sequence conservation, however, it is more likely their act of transcription, rather than their RNA transcripts, that mediates enhancer activity in cis (67).By contrast, plncRNAs show modest sequence conservation, implying that some act in trans.In summary, elncRNAs and plncRNAs are distinguished by their H3K4me1 and H3K4me3 marks, respectively, at their transcriptional initiation regions and tend to be involved in transcriptional and posttranscriptional regulation, respectively.

EVOLUTION: CONSERVATION AND CONSTRAINT
After the discovery of lncRNAs, some investigators claimed that they lack conservation (83) whereas others saw them as being highly conserved (38).Both could be true, of course, should each lncRNA contain mostly poorly conserved, yet also some richly conserved, sequence.Resolution of this evolutionary question was important: Mutation of conserved lncRNA sequence would be expected to bring functional and phenotypic consequences, including disease; conversely, mutation within nonconserved lncRNA sequence could have no functional or phenotypic effect.
On one side of this argument are lncRNA enthusiasts who propose that all lncRNAs are functional and that evolutionary arguments opposing this view are unreliable.In 2013, Mattick & Dinger (72) wrote, "[N]oncoding RNAs usually show evidence of biological function in different developmental and disease contexts, with, by our estimate, hundreds of validated cases already published and many more en route, which is a big enough subset to draw broader conclusions about the likely functionality of the rest" (p.2).Arguing against this conclusion, however, are observations that lncRNAs are mostly dispensable for viable vertebrate development (31) (discussed further below).
On the other side of this debate are evolutionary biologists who hold that a century-old theoretical evolutionary framework can be trusted to provide deep insight into molecular structure, function, and disease.With a neutral model of evolution, lncRNAs were estimated to contain only a small fraction (4.1-5.5%) of functional sequence, implying that mutations in the remaining sequence would not alter reproductive fitness (84).Mattick & Dinger (72) responded that this model's notion of selective neutrality was highly questionable.This was despite the model being founded on only one assumption-specifically, that mutations (in this case insertions or deletions) occur randomly within neutrally evolving sequence (64).Rather than assuming selective neutrality within ancient TE sequence, as Mattick & Dinger claimed, the model predicted that more than 99% of such sequence evolved neutrally (64).
These arguments focus on species-level sequence conservation that becomes evident after the removal of many deleterious variants over millions of years since these species' last common ancestor.Functionality is not always conserved among species, however, because lineage-specific biology can emerge and ancestral biology can be lost (see the sidebar titled Rapid Turnover).To infer

RAPID TURNOVER
When compared with lncRNAs from other species, human lncRNAs are unexceptional in their length, exon count, tissue specificity, and expression level (46).Like lncRNAs from other species, human lncRNAs have not evolutionarily persisted over many tens of millions of years (46,78).They thus arose ("were born") and were lost ("died") rapidly (57)-indeed, faster than any other functional element type (85).The rapid evolutionary turnover of sequence and transcription leaves few transcribed lncRNA loci in positionally equivalent (i.e., orthologous) genomic locations (46).Rapid evolution of lncRNA loci is consistent with a small contribution to reproductive fitness and thus with absent or relatively minor organismal functions.Thus, the wider the phyletic range of a human lncRNA is (i.e., its evolutionary spread across divergent animal species), the greater the likelihood is that it plays a greater role in human biology.

Constraint: the evolutionary property of reduced reproductive fitness when a sequence is mutated
Molecular function: a property of a molecule that when changed alters fitness the functionality of sequence lacking between-species conservation requires a complementary approach using within-species (i.e., population) data.This approach's signature of functionality is the shift to low population frequency of newly emergent alleles.This shift indicates evolutionary constraint and a tendency for deleterious variants to be purged from this population.
One of two polar opposite outcomes was expected from applying this constraint approach to the human population.In one, lncRNA sequence would be highly constrained even if it was poorly conserved in other species, indicative of important human-specific functions; in the other, human lncRNA sequence would be poorly constrained, consistent with its weak conservation over longer evolutionary intervals.Population data provided compelling evidence for this second outcomespecifically, that newly arising mutations in human lncRNAs are seldom deleterious (24,40).Recent evidence shows that strong selection is almost entirely absent in human lncRNAs whose sequence is not conserved in other species (24).
Although not definitive, evolutionary methods can suggest whether conservation and constraint are due to either DNA-or RNA-dependent function.Conservation tends to be strongest close to the lncRNAs' 5 ends (46) and their promoters (84).Because elncRNA exons tend to lack conservation (67), such loci with conserved promoters could act in cis via the process of transcription rather than having RNA sequence-dependent function.By contrast, plncRNAs-whose exons also exhibit modest conservation (67) and whose CpG-associated promoters are well conserved (84)-are more likely to act in trans in an RNA sequence-dependent manner.

SPLICING
Splicing patterns also evolve rapidly.Fewer than one-third of human lncRNA splicing events are conserved in rodents, for example-much less than the fraction for human mRNAs (∼90%) (103).The human-rodent conservation of GT-AG dinucleotides, which are necessary for efficient splicing, is modest, which implies that lncRNAs' intron location and/or splicing contribute functionally (84).Other sequence features of a spliced transcript, such as exonic splice enhancers (ESEs), also facilitate efficient expression and splicing.ESEs are purine-rich hexamers that bind splicing regulator proteins to aid recognition of splice sites.Unexpectedly, ESEs are unusually frequent near lncRNA splice junctions, occurring at a density comparable to that of ESEs at human mRNAs' splice junctions (41,93).
ESEs have evolved unusually slowly under purifying selection, with splicing motifs accounting for virtually all selection on human lncRNA sequence (41,93).This implies that splicing of multiexon lncRNAs is critical to their molecular function.Furthermore, multiexon elncRNAs are more likely than single-exon elncRNAs to be conserved over mammalian evolution (96).Exonic sequence in multiexon lncRNAs also tends to have a higher GC nucleotide content, relative to their introns, which could reflect selection on G or C alleles to improve the efficiency or robustness of splicing and/or transcription (41).
These evolutionary observations begin to explain why efficient elncRNA splicing is associated with increased enhancer activity for nearby protein-coding genes (28,33,96).However, how lncRNA processing strengthens enhancer activity for this neighboring gene remains unclear.Models include recruitment, during elncRNA splicing, of transcription, spliceosome, and/or chromatin-regulatory factors to the protein-coding gene via chromosomal looping or chromatin remodeling or via the short-lived elncRNA transcript itself (33,96).Much investigation is ongoing to determine the various mechanisms underlying enhancer activity, including those involving RNAs.These mechanisms are likely diverse, involving multiple protein or RNA factors, enhancers, and chromatin states, and lie beyond the scope of this review.

MOLECULAR INTERACTIONS
To address a mechanistic hypothesis for a particular lncRNA, an investigator usually employs experiments targeting its transcript, rather than others surveying whole transcriptomes.A single lncRNA can be tested for its interaction with protein or DNA or another RNA class, such as microRNA (miRNA).Such a small-scale study has advantages of greater feasibility, lower cost, and reduced statistical testing burden over higher-throughput approaches.By contrast, a transcriptome-wide approach, interrogating protein, DNA, or RNA interactions for all lncRNAs, contextualizes each interaction among them all.Despite their higher cost and statistical burden, such methods can open up previously unanticipated lines of inquiry that better explain existing observations.
Experiments need well-chosen controls to account for nonphysiological interactions.Mammalian lncRNA studies have used bacterial RNA controls to account for nonspecific binding.These revealed, for example, that Polycomb repressive complex 2 (PRC2) binds bacterial RNA promiscuously (18), as do many mouse genomic loci (100).Investigators might be wise to explicitly distinguish terms: Colocalization may not involve direct contact, interaction may be fleeting and/or inconsequential, and binding implies mechanistic function.
Once there is sufficient evidence of a lncRNA's molecular interaction, the subsequent challenge is to determine whether and how it contributes to cellular and organismal biology.We present four counterexamples, illustrative of nonfunctional interactions.First, engagement of a lncRNA by the ribosome does not result in its translation (39).Second, despite an interaction between the lncRNA Hotair and PRC2 protein, there is little evidence that this interaction modifies PRC2's set of Polycomb target genes (94).Third, some lncRNA-bound chromatin regions fall outside of regulatory elements (including promoters or enhancers), and the activity of genes near to these regions is not always altered (e.g., 100).Finally, of the ∼10 6 cataloged miRNA-lncRNA interactions (59), which were mostly identified in cell lines, many will not be relevant to human biology.
Experiments are more likely to uncover molecular mechanisms if they carefully employ genetic deletions rather than using only knockdown approaches because the latter suffer from off-target effects (23).Results will also be more compelling if the molecular, cellular, or organismal effects of perturbing a lncRNA correlate with its dosage (100) or are similar from experiments applied to different species.

PHENOTYPES
Review articles commonly state that many lncRNAs are "associated with," "linked to," or "involved in" human diseases.Experimental support for associations, links, or involvements has been collated into databases (75,105).These observations, however, can be interpreted differently, yielding alternative explanations-for example: lncRNA abundance: A lncRNA is differentially expressed between normal and cancer cells, and its expression predicts poor overall patient survival.Nevertheless, if its expression changes only as a consequence of cellular transformation, then the effect on the lncRNA is a by-product, not a causal driver, of oncogenesis.Genetic association: A lncRNA locus contains single-nucleotide polymorphisms (SNPs) significantly associated with disease risk.Nevertheless, if these SNPs fail to correlate with this lncRNA's abundance or molecular mechanism in disease-relevant cells, then it does not causally alter disease risk.Even when these SNPs correlate with lncRNA expression, they may not be causal if they also correlate with altered expression or function of other genes nearby, including in other cell types or developmental stages.
Molecular interaction: A lncRNA interacts with a protein that is mutated in human disease.Nevertheless, this interaction may have no cellular or organismal consequences or have no effect on disease processes, and so it is not conclusive that the lncRNA's interaction modulates disease risk.Disruption of predicted binding site: A somatic or inherited disease-linked mutation alters a lncRNA's predicted binding affinity of DNA, miRNA, or protein.Such predictions, however, typically suffer from high false-positive rates (type 1 errors).Moreover, even if binding occurs, this need not contribute to disease.
Presently, the syntax relating lncRNA variants to function or structure is unknown.We do not know how to glean the few variants that alter lncRNA function from among the vast majority that either are nonfunctional or alter function for other molecules.Current approaches rely on chancing on a functional variant and investigating a narrow set of all possible mechanistic hypotheses.For lncRNA biology to advance, more principled and higher-throughput experiments will be required.
Priority locations for deciphering this syntax are lncRNA splice sites and/or exons, owing to their concentrated evolutionary conservation (see above).A large-scale CRISPR-Cas9 screen used paired single guide RNAs (sgRNAs) targeting splice sites to excise exons from 10,996 human lncRNA loci.Four percent of these lncRNAs were initially proposed as being essential for cellular growth in three cell lines (62), although at least one-third of these are likely false-positive observations (47).A CRISPR interference (CRISPRi) screen, which represses transcription rather than removing DNA, predicted that 3% of 16,401 lncRNA loci are required for cellular growth (60).Nevertheless, these findings need to be treated with caution because most phenotypic observations did not reproduce across multiple cell lines.Predictions that 3-4% of lncRNAs show a growth phenotype in at least one cell line may actually be overestimates because CRISPR and other technologies are not immune to off-target effects (31,34,98) and because CRISPRi targeting of lncRNA loci can inadvertently repress DNA elements that regulate expression of nearby protein-coding genes.
Studying lncRNA biology using human cell lines is pragmatic.Nevertheless, the range of these lines' phenotypes is limited mostly to growth, which is not directly relevant to most human traits and disease phenotypes.Instead, a model organism is required to study lncRNAs' contributions over a broad spectrum of physiological and behavioral phenotypes, across a wide range of conditions, developmental stages, and experimental stimuli.A frequent choice of model organism, because of its phyletic proximity to human, ease of maintenance, and genetic homozygosity, is the laboratory mouse.Nevertheless, this choice immediately limits investigation to only approximately 10% of human lncRNAs, because only these possess a single ortholog in mouse (78).
Phenotypes are known for mouse protein-coding genes at the genome scale.Thousands have been individually disrupted and phenotyped using standardized protocols, with a large majority yielding discernible phenotypes (5).By contrast, a genome-wide investigation of mouse lncRNA loci has not been attempted, and so their phenotypes have not been elucidated using standardized approaches.The central question of whether disruption of a mammalian lncRNA locus commonly results in an overt phenotype thus remains unanswered.
Smaller-scale studies have reported mouse phenotypes for several dozen in vivo lncRNA knockouts.Altered phenotypes range across various physiologies and behaviors, yielding conclusions that mammalian lncRNAs contribute to diverse cellular and physiological processes.Nevertheless, these conclusions are often controversial even for well-known lncRNAs (31).Alternative mechanistic explanations of loss-of-function phenotypes are possible.These include removal of a functional DNA element irrelevant to the lncRNA; unintended effects arising from read-through transcripts; and introduced reporter genes, sites, or transcriptional terminators (4,31).Optimal evidence for RNA-dependent lncRNA function derives from loss of function, followed by complementation approaches (e.g., 35).
Other studies report a lack of phenotypic change following disruption of a lncRNA locus (reviewed in 31).These loci may contribute functions whose disruption causes subtle phenotypes that are unobserved in experimental conditions, or are evident only under particular environmental conditions or after stimulus.From discussions with other lncRNA biologists, we believe that when disruption of a lncRNA locus fails to yield a phenotype, this important observation is often not reported in the published literature.If so, then this file-drawer effect introduces a publication bias that lays down false expectation of the likely success of future experiments.
In summary, among mouse lncRNA loci that have been targeted for disruption and phenotypic scrutiny, many have yielded either no in vivo phenotypes or effects that are not always replicated when different strategies to disrupt the locus are adopted.In the absence of strong evidence to the contrary, therefore, the expectation should be that natural mutations within human lncRNAs only rarely cause overt phenotypes.

TRAITS AND DISEASES
A transcriptome-wide association study (TWAS) can yield evidence that a lncRNA contributes to a human disease or trait.In this approach, a genetic association signal for a lncRNA's abundance in a particular tissue is first estimated.Subsequently, this signal is compared with the genetic association signal for a disease or trait.If the two association signals-one for lncRNA expression, the other for the disease or trait-are concordant across a chromosomal region, then they are colocalized, and both are explicable by a single causal DNA variant.Colocalization provides evidence of the lncRNA's role in causing the disease or trait.
De Goede et al. (19) recently used TWAS and colocalization analysis to determine whether genetically determined expression of 14,100 lncRNA loci across 49 tissues might contribute to 101 distinct complex genetic traits.They identified 83 lncRNA and disease or trait pairs for which colocalization evidence indicated that the lncRNA was more likely to causally alter disease risk or trait than any nearby protein-coding gene.First in this list of 83 pairs, for example, was CYLD-AS1, whose genetic signal of expression in testis or esophageal mucosa was colocalized with the genetic association signal for Crohn's disease; nearby protein-coding genes failed to be colocalized in this manner.Their overall conclusion was that "these colocalization events represent robust connections between genetic variation, lncRNA gene expression, and complex traits" (19, p. 2644).
Nevertheless, these authors acknowledged that their large expression data set is not comprehensive over all cell types, developmental stages, and environmental stimuli.This means that their study cannot be considered complete over all association signals for protein-coding gene or lncRNA expression and thus that a missing association signal for a protein-coding gene may explain a disease association signal better than the available lncRNA expression signal.For the CYLD-AS1 prediction discussed above, for example, other expression data analyzed by Open Targets Genetics (32) deprioritized CYLD-AS1 as causally altering risk of Crohn's disease and prioritized one or more of six neighboring protein-coding genes (in this example, BRD7, ADCY7, CYLD, NKD1, TENT4B, and SNX20).Use of many expression data sets is recommended, therefore, when prioritizing lncRNAs.Results from such analyses are provided by Open Targets Genetics (32) and are facilitated by platforms such as MR-Base (45).
(29)mic and transcriptomic characteristics of lncRNAs in human.(a)Distribution of exon counts for lincRNAs.(b-e)Comparisons of exonic GC content (panel b), nucleotide conservation across vertebrates (phastCons, University of California, Santa Cruz) (panel c), maximum expression across GTEx tissues (36) (panel d), and expression specificity (τ , where 0 indicates broad expression and 1 indicates tissue-specific expression) (panel e) for lincRNAs, pseudogenes, and protein-coding genes.(f ) RNA-seq coverage across lncRNAs and RefSeq models.GTEx RNA-seq experiments were carried out in a nonstranded manner, leading to imprecise expression estimates for overlapping genes.Transcribed pseudogenes are a class of noncoding RNAs whose evolutionary origin is more recent than those of other noncoding RNAs.Abbreviations: FPKM, fragments per kilobase of transcript per million mapped reads; GTEx, Genotype-Tissue Expression; lincRNA, long intergenic noncoding RNA; lncRNA, long noncoding RNA; RefSeq, Reference Sequence; RNA-seq, RNA sequencing; TPM, transcripts per million.Panel f adapted with permission from Reference 25; annotations for lncRNAs, pseudogenes, and protein-coding genes were extracted from GENCODE version 38(29).
(Caption appears on following page) Figure 1 (Figure appears on preceding page)