The genomic basis of host and vector specificity in non-pathogenic trypanosomatids

ABSTRACT Trypanosoma theileri, a non-pathogenic parasite of bovines, has a predicted surface protein architecture that likely aids survival in its mammalian host. Their surface proteins are encoded by genes which account for ∼10% of their genome. A non-pathogenic parasite of sheep, Trypanosoma melophagium, is transmitted by the sheep ked and is closely related to T. theileri. To explore host and vector specificity between these species, we sequenced the T. melophagium genome and transcriptome and an annotated draft genome was assembled. T. melophagium was compared to 43 kinetoplastid genomes, including T. theileri. T. melophagium and T. theileri have an AT biased genome, the greatest bias of publicly available trypanosomatids. This trend may result from selection acting to decrease the genomic nucleotide cost. The T. melophagium genome is 6.3Mb smaller than T. theileri and large families of proteins, characteristic of the predicted surface of T. theileri, were found to be absent or greatly reduced in T. melophagium. Instead, T. melophagium has modestly expanded protein families associated with the avoidance of complement-mediated lysis. We propose that the contrasting genomic features of these species is linked to their mode of transmission from their insect vector to their mammalian host. This article has an associated First Person interview with the first author of the paper.


INTRODUCTION
Trypanosomatidae are a family of single-celled eukaryotes of the class Kinetoplastea, which are characterised by a specialised mitochondrial genome, the kinetoplast. Trypanosomatidae are monoxenous (single host) or dixenous (two host) species. Dixenous trypanosomatids are obligate parasites of a broad diversity of animals and plants whilst monoxenous species are largely restricted to insects (Podlipaev, 2001). However, the taxonomy of trypanosomatids cannot be distilled into these two broad categories, as many monoxenous species opportunistically infect vertebrates (Kaufer et al., 2017) and some dixenous species have subsequently reverted to a monoxenous lifecycle (Schnaufer et al., 2002).
Expansion into vertebrate hosts gave rise to clades of trypanosomatids which represent medical and veterinary threats. Notably, Trypanosoma cruzi and Leishmania spp. cause important diseases in humans, Chagas disease and Leishmaniasis, respectively. Also, Trypanosoma brucei has been subjected to intense molecular and cytological study as two of its subspecies, T. b. gambiense and T. b. rhodesiense, are causative agents of Human African Trypanosomiases (HAT) (Cayla et al., 2019). In addition to the medical impact of HAT, other African trypanosome species such as T. b. brucei, Trypanosoma vivax and Trypanosoma congolense cause morbidity and mortality in livestock, constraining agricultural development (Mehlitz and Molyneux, 2019).
The divergence in trypanosomatid morphology, transmission, and lifecycle development has facilitated an adaption to a diverse range of hosts and vectors. The different surface protein adaptions of T. brucei, T. cruzi and Leishmania spp. exemplify the variant strategies adopted by these parasites to evade the immune systems of their hosts and vectors. For example, T. brucei is extracellular and proliferates in the blood and tissue of mammals. Its cell surface is encoded by a very extensive repertoire of variant surface glycoproteins (VSGs) (Berriman et al., 2005;Wickstead et al., 2004). VSG variation is essential to sustain long-term infections, during which antigenically distinct VSG types dominate at each peak, facilitating host immune evasion (Borst, 2002;Pays et al., 2004). T. cruzi is an intracellular parasite of wild and domestic mammals. The nuclear genome contains an expanded family of mucin genes that represent up to 6% of the genome (Buscaglia et al., 2006) and, along with trans-sialidases (Nardy et al., 2016), these genes enable sustained infections. Leishmania spp. are intracellular trypanosomatids whose cell surface is covered by a thick layer of glycoconjugates, including families of GP63 major surface proteases (MSPs), also known as leishmanolysin (Yao, 2010).
Studies have largely focused on pathogenic dixenous trypanosomatids, which can hold broad host niches, capable of infecting multiple mammalian species (Funk et al., 2013). In contrast, non-pathogenic dixenous trypanosomes, such as Trypanosoma theileri and Trypanosoma melophagium, can be highly specific to their host and vector. These species represent an attractive model to study the basis of host and vector specificity. For the remainder of this manuscript, we refer to species that cause no overt pathogenicity in immunocompetent hosts as non-pathogenic. We recognise that 'nonpathogenic' species may still cause slight detriment to their hosts, which is often difficult to detect, especially when prevalence is high in the host population.
Trypanosoma theileri is a non-pathogenic bovine parasite which has a reported prevalence of 80% in cattle in the US and Europe when screened via culture-based methods (Farrar and Klei, 1990;Matthews et al., 1979;Mott et al., 2011;Schlafer, 1979). T. theileri is transmitted by tabanid flies (Böse and Heister, 1993). It can cause lifelong infections but remains at an extremely low parasitaemia in immunocompetent animals, indicating the presence of an effective host immune evasion mechanism or strict self-imposed population control, which prevents overt disease (Doherty et al., 1993;Seifi, 1995). Experimentally, T. theileri can sustain an infection for at least 12 weeks (Mott et al., 2011) which, combined with their nonpathogenic nature, has stimulated development of T. theileri as a potential vaccine delivery vehicle (Mott et al., 2011). The genome of T. theileri encodes five predicted surface protein families. These are four unique protein families, T. theileri putative surface proteins (TTPSPs) and one MSP family . Together these genes represent ∼9% of the genome of T. theileri, comparable to the representation of the VSG gene family in T. brucei brucei (TREU927/4) (Berriman et al., 2005). Lastly, the trans-sialidases that characterise T. cruzi were also found to be highly expressed in T. theileri. These findings led to the suggestion of a novel immune evasion mechanism in T. theileri, contrasting with the well-known system in African trypanosomes and distinct from that in T. cruzi or Leishmania . T. theileri is at the base of a clade that comprises Trypanosoma rangeli, Trypanosoma cruzi and Trypanosoma grayi (Kelly et al., 2014, distinct from African trypanosomes, such as T. brucei. T. grayi is transmitted via the tsetse fly between African crocodilians via ingestion of tsetse faeces containing infective metacyclic forms (Hoare, 1929(Hoare, , 1931. Trypanosoma melophagium is a non-pathogenic trypanosome of the subgenus megatrypanum, transmitted between sheep via the sheep ked. This flightless insect vector has been eradicated from much of its original geographic distribution due to widespread pesticide use. However, where the sheep ked persists, it often carries T. melophagium (Gibson et al., 2010;Martinkovićet al., 2012). In a study of organic sheep farms, T. melophagium was found to be present in 86% of keds, however, blood smears from sheep on the same farms did not detect trypanosomes (Martinkovićet al., 2012). Other surveys via blood culture found 7.8% of sheep to be infected with T. melophagium (Serpil and Zafer, 2008). It has historically been argued that T. melophagium is a monoxenous parasite of the sheep ked and that the mammalian host is obsolete for transmission (Flu, 1908;Porter;Swingle, 1911). However, extensive studies demonstrated a mammalian host is required (Hoare, 1923). Experimental infections of sheep with T. melophagium suggested the longest infection lasts 3 months and there is no lasting immunity as sheep can be reinfected with T. melophagium after several months of isolation (Gibson et al., 2010;Hoare, 1972) Molecular markers place T. melophagium as a close relative to T. theileri (Martinkovićet al., 2012). SSU rRNA shares ∼98% identity between T. theileri and T. melophagium isolates (Gibson et al., 2010). Presumably the divergence of T. theileri and T. melophagium is associated with their discrete host niches (Gibson et al., 2010;Martinkovićet al., 2012). Nonetheless, T. theileri and T. melophagium undergo a similar transmission cycle where metacyclic forms are produced in the insect hindgut and the infective forms are then believed to be transmitted to their mammalian host via the mouth, by ingestion of insect faeces or the whole insect body. Trypanosomes then invade their mammalian hosts and proliferate in the blood and, potentially, tissues before being taken up as a bloodmeal by their insect vector (Böse and Heister, 1993;Hoare, 1923).
A notable contrast in the biology of these parasites is the divergence in the life history of their vectors. Sheep keds, which transmit T. melophagium, spend their entire life attached to either the skin or wool and hair of sheep. Both male and female keds feed on their mammalian host (Underwood et al., 2015). Tabanids, which transmit T. theileri, breed and lay their eggs in soil, water, or trees. The larvae and pupae stages live on vegetation and soil. Only female adults feed on mammalian blood, which is essential for egg production. Although adult tabanids show considerable adaptation to blood feeding, they also feed on the sugars of plants (Chainey, 1993).
Here we derive the T. melophagium genome using a combination of long and short read technologies. The genome, and its protein encoding genes, was compared to T. theileri, to provide insight into the biological specificity exhibited by each parasite in the context of their close phylogenetic relationship.

T. melophagium genome assembly
An initial assessment of the T. melophagium genome, via k-mer counting, predicted that it was smaller than that of T. theileri (22.3 Mb and 27.6 Mb, respectively), this variation being observed in its repeat and unique sequence (Table 1). Notably, both genomes are predicted to have extremely low heterozygosity (0.3 and 0.4 for T. melophagium and T. theileri, respectively) in comparison to other Trypanosoma isolates (Oldrieve et al., 2021). Large gene families, such as the TTPSPs, only account for ∼10% of the T. theileri genome and so are predicted to have a minor effect in the heterozygosity calculation. The k-mer counting prediction was similar in size to the final assembly (Table 1). The T. melophagium assembly consisted of 64 contigs in comparison to 253 for T. theileri. BUSCO assessments predict that both assemblies are 100% complete, although T. theileri is slightly fragmented ( Table 1).
The T. melophagium and T. theileri genomes were aligned to highlight the conservation of collinearity. The conservation is more similar to the conservation between the species T. brucei TREU927/4 and T. congolense IL3000 2019 than between isolates of the same species, T. brucei TREU927/4 and T. brucei Lister 427 2018 (Fig. 1,  Fig. S3). The T. melophagium genome was annotated with 10,057 protein encoding genes, in comparison to 11,312 in T. theileri (Fig. 2C, Table 1) .
Kinetoplastid genomes, proteomes and transcriptomes were downloaded from TriTrypDB (Aslett et al., 2010). Proteome completeness was assessed with BUSCO. Those above 85% complete were retained, leaving 44 isolates from 9 genus (File S1). T. theileri and T. melophagium have some of the smallest genomes in this study and have the lowest GC content (40.1% and 41.2%, respectively), contrasting with the kinetoplastid mean of 48.4% ( Fig. 2A).
Environmental temperature (Lao and Forsdyke, 2000;Paz et al., 2004), generation time (Shah and Gilchrist, 2011), neutral drift (Eyrewalker, 1991;Rao et al., 2011), tRNAs (Plotkin et al., 2004), translational accuracy/efficiency (Akashi, 1994;Hu et al., 2013;Shah and Gilchrist, 2011;Sorensen et al., 1989), gene splicing and protein folding (Novoa and de Pouplana, 2012) have all been hypothesised to cause codon bias. However, codon bias can be influenced by nutrient availability (Seward and Kelly, 2016). Species with low nitrogen availability, such as the plant trypanosomatid, Phytomonas, have an AT rich genome, potentially to mitigate the lack of nitrogen in their plant hosts (Seward and Kelly, 2016). Selection acting on genome nucleotide cost (Sc) is in competition with selection acting to alter the translational efficiency of the genome (St) Kelly, 2016, 2018). Alternative hypotheses for the cause of codon bias can be excluded by analysing closely related species with similar lifestyles (Seward and Kelly, 2016). Based on universal single copy orthologues, selection pressure acting on the translational efficiency is minimal for both T. theileri and T. melophagium (Fig. 2B). In contrast, they show the greatest predicted selection pressure acting to reduce the nucleotide cost of any kinetoplastid genome, including the closely related T. grayi. The AT biased content can be interpreted as a remodelling of the T. theileri and T. melophagium genomes to reduce the cost of the genome ( Fig. 2A,B). This pattern was consistent when every coding sequence (CDS) was compared (Fig. S4A,B) and in universal single copy orthologs which are essential for every life-cycle stage of The assemblies used in this assessment represent the final draft of the T. melophagium assembly produced during this study and the T. theileri assembly available from TriTrypDB (Aslett et al., 2010;Kelly et al., 2017). The k-mer spectra plots associated with the first section of the table are found in Fig. S1. T. brucei (Fig. S4C,D). T. rangeli displays the highest level of selection pressure acting to increase translational efficiency ( Fig. 2B), potentially linked to its reduced genome length (Fig. 2C).

Orthologous protein clustering and phylogenetic inference
Orthologous clustering identified genes that descended from a gene in the last common ancestor of the 44 kinetoplastid proteomes used in this study (Fig. 2, File S1). From the 44 proteomes, 18,274 orthogroups were identified (96.5% of the proteins used in this study were included in one of these orthogroups), 992 orthogroups were single copy and contained all isolates.
A species tree was generated as part of the orthologous protein clustering. Using genetic markers, previous studies have noted the similarity between T. melophagium and T. theileri (Gibson et al., 2010;Martinkovićet al., 2012). Based on 2,312 gene trees, T. theileri is the closest isolate to T. melophagium and groups with the stercorarian trypanosomes, which include T. cruzi, T. grayi and T. rangeli, rather than with salivarian trypanosomes such as T. brucei (Fig. 3). T. grayi is the closest isolate to the T. melophagium and T. theileri clade and is closer in size to the genome length of T. melophagium than to T. theileri (Fig. 2C). T. melophagium and T. theileri are closely related species with relatively short branch lengths (0.093 and 0.088 substitutions per site, respectively). In comparison, T. congolense is more divergent from T. brucei (0.249) whilst the isolates within T. brucei (T. brucei brucei Lister 427 2018:0.0009, T. brucei evansi STIB805: 0.003, T. brucei brucei TREU927/4: 0.003 and T. brucei gambiense DAL972:0.004) show less divergence than seen between T. theileri and T. melophagium.
T. theileri has a greater number of species-specific orthogroups than T. melophagium and 12.9% of its genes are assigned to one of these orthogroups, while T. melophagium has only 2.7% of its genes in a species specific orthogroup (Table 2). Terminal branch length is correlated with specific orthogroup counts, which could account for the discrepancy. However, T. melophagium and T. theileri are each other's most recent common ancestor and have been evolving at a roughly similar rate since this time, with similar terminal branch lengths (Fig. 3). To visualise these differences, the number of genes in each orthogroup was compared between T. melophagium and T. theileri. Orthogroups associated with host interaction protein families were highlighted based on their identification as a putative cell surface protein by Kelly et al. (2017). Many of the T. theileri species specific orthogroups expansions belong to a cell surface family (Fig. 4A).
Interaction with the mammalian host and predicted cell surface proteins Firstly, and as expected, T. melophagium was found to lack any genes in orthogroups which contained VSGs, characteristic of African trypanosomes (File S4). To validate this, a blast search was performed using a relaxed cut off (1e-5) using the T. melophagium genome as the database and T. brucei TREU927/4 VSGs as the query. No hits were identified.
To enable comparison to the T. theileri genome analysis, the genes and orthogroups from this study were annotated with the host interaction genes from Kelly et al. (2017). Across the entire genome, T. theileri is predicted to contain 1,265 more genes than T. melophagium ( Table 2). Examination of orthogroups which were associated with a T. theileri putative surface protein (TTPSP) revealed a large expansion in T. theileri (1,251 genes) compared to T. melophagium (10 genes) which could equate to much of the disparity in genome size (Fig. 4). To confirm the difference in TTPSPs, the T. theileri transcripts were subjected to a blastn search against a database consisting of the T. melophagium transcripts (1e-25 cut-off ). The T. melophagium transcripts were derived from the genome annotation analysis. Only nine T. theileri TTPSPs aligned to T. melophagium.
TTPSPs were split between four orthogroups in the original T. theileri genome analysis but were split between 66 orthogroups in this analysis (Table 3) . TTPSPs share conserved  The full species-specific clustering summary can be found in File S3.
C terminal GPI addition and N terminal signal sequences and contain regions of high divergence in the remainder of the sequence. TTPSPs are highly expressed as a family at the population level and are largely contained within tandem arrays, highlighting a similarity to the VSGs of T. brucei . Excluding T. melophagium, TTPSPs are absent from all other kinetoplastid species analysed (File S4) revealing their specific innovation in these related trypanosomatids. Following the TTPSPs, the largest expanded gene family in T. theileri are the MSPs. This protein family belongs to the peptidase M8 family of metalloproteinases. MSPs are likely to have contrasting roles in different life stages but their best understood function is to bind and cleave members of the complement system and for evasion of other cellular and antimicrobial immune defences (Yao, 2010). This role presumably allows for evasion of complement-mediated proteolysis and therefore assists survival in the insect and mammalian hosts (Yao, 2010). Orthogroups which annotated as MSPs were expanded in T. melophagium, such as OG0008865. T. melophagium also has a species specific orthogroup (OG0009903) consisting of 11 proteins and annotated as MSP. Combined, these results suggest that the surface protein environment of T. melophagium and T. theileri are distinct and that genes encoding these proteins account for most of the discrepancy in the genome sizes.
Clade and species specific orthogroups T. grayi, T. melophagium and T. theileri contain 42 clade-specific orthogroups including the MSP orthogroups OG0000590 and  OG0008095. Both MSP orthogroups were expanded in T. theileri in comparison to T. grayi and T. melophagium. T. melophagium and T. theileri share 81 orthogroups specific to the two species. Twelve of the orthogroups are putative cell surface protein families, including five representatives from TTPSP and seven MSP orthogroups. Of the nine remaining annotated orthogroups, OG0008096 is the largest and contains trans-sialidsases. The orthogroup is expanded in T. theileri (n=17) compared to T. melophagium (n=5).
T. theileri has 121 species specific orthogroups. Sixty-four of these were putative host interaction genes . Two annotated orthogroups remained, which included a leucine-rich repeat family associated with protein binding (OG0011868) and a calpain cysteine protease family (OG0018204). T. melophagium has 76 species-specific orthogroups, 68 of these were unannotated. MSP families were annotated in three of the remaining orthogroups (OG0009903, OG0013746 and OG0013747). Other annotated species-specific orthogroups consisted of an actin family (OG0018163) along with several families associated with protein binding, WD domain G-beta repeat (OG0018131) and a leucine rich repeat family (OG0018098).

Cell surface modifying enzymes
Trans-sialidases are differentially expanded in T. theileri and T. melophagium, with 38 and eight genes, respectively. Of the four orthogroups containing trans-sialidase, two do not contain proteins from T. melophagium (OG0011818 and OG0015016). These two orthogroups contain proteins from T. cruzi and so are likely to have been lost by T. melophagium and T. grayi. In T. cruzi, transsialidases are involved in host immune evasion (Nardy et al., 2016).
Invertases are typical of the cell surface of Leishmania and are thought to transform sucrose into hexose in the gut of the vector. The orthogroups (OG0000150 and OG0000409) that include 20 T. theileri invertase genes only have three members in T. melophagium and three members in T. grayi. This expansion might indicate an adaptation of T. theileri to its vector, the tabanid fly, which can feed on sugary flower nectar (Kniepert, 1980). In contrast, the T. melophagium vector, the sheep ked, exclusively feeds on mammalian blood. The orthogroups which contain the T. theileri invertase genes (OG0000150 and OG0000409) contain only one gene from the plant parasite Phytomonas (Jaskowska et al., 2015;Sanchez-Moreno et al., 1992), suggesting a different mechanism for sucrose metabolism in these parasites.

Glycolysis
Kelly et al. (2017) compared T. brucei and T. theileri transcriptomes which revealed differences in the abundance of glycosomal enzyme mRNAs. Particularly, pyruvate orthophosphate dikinase, phosphoenolpyruvate carboxykinase and malate dehydrogenase were found to be >10 fold more abundant in T. theileri than in T. brucei . We confirm that enzymes associated with the glycolytic pathway are present in T. melophagium (File S4) and found that T. melophagium has expanded the orthogroups associated with three glycolytic enzymes. These included pyruvate orthophosphate dikinase (OG0000570), phosphoenolpyruvate carboxykinase (OG0000120) and malate dehydrogenase (OG0000332), whilst T. melophagium has a reduced number of genes in the fumarate reductase orthogroup (OG0000078).
Orthogroups associated with peroxisome targeting were found to be in equal numbers (OG0004108 PEX5 and OG0003998 -PEX7). All other glycolytic enzymes are present in equal numbers in the two species. Therefore, it is likely that the glycolysis pathway is conserved in T. melophagium.

Life cycle
Extensive studies of T. brucei have identified genes which are associated with key stages of the T. brucei life cycle. These studies tracked genes associated with stumpy formation in the blood stream form (Cayla et al., 2020;Ling et al., 2011;Liu et al., 2020;Mony et al., 2014) and regulators of metacyclogenesis (Toh et al., 2021). These genes were combined with a list of validated development associated genes such as the RNA binding proteins RBP6, RBP7, RBP10 and ZFP2 and ZFP3 along with developmental regulators NRK A, NRK B, RDK1, RDK2, MAPK2 and phosphatases such as PTP1 and PIP39 (Domingo-Sananes et al., 2015;Gale et al., 1994;Gale and Parsons, 1993;Jones et al., 2014 a;Müller et al., 2002;Szöor et al., 2010Szöor et al., , 2006Walrad et al., 2009). Most of the orthogroups containing these genes were represented with a similar number of genes in each species, indicating the presence of an environmental sensing ability and developmental competence (Fig. 5). However, there were notable differences. There is an expansion in the orthogroups containing KRIPP14, which is a mitochondrial SSU component (Mony et al., 2014), in T. theileri. T. melophagium has expanded its orthogroups containing the kinases NRK (Domingo-Sananes et al., 2015;Gale et al., 1994), NEK and (Gale and Parsons, 1993) ADKF (Mony et al., 2014) along with a dual specificity phosphatase (DsPho) and protein phosphatases 1 (PP1) (Mony et al., 2014;Mony and Matthews, 2015). Both T. theileri and T. melophagium are missing metacaspase (MCA1) which is associated with the later stages of progression towards metacyclic forms in T. brucei (Toh et al., 2021) and Hyp12, which upregulates bound mRNAs during development based on tethering assays in T. brucei (Erben et al., 2014;Lueong et al., 2016;Mony et al., 2014;Mony and Matthews, 2015). Puf11, an effector molecule required for kinetoplast repositioning in epimastigotes (Toh et al., 2021), is also absent in T. melophagium.

RNA interference and transposable elements
All five core genes that represent the trypanosome RNAi machinery (AGO1, DCL1, DCL2, RIF4 and RIF5) were present in T. melophagium, with an extra gene in the orthogroup containing DLC2 (File S4). Therefore, a functional gene silencing pathway is likely to be present in T. melophagium, matching the prediction in T. theileri.
Retrotransposon counts highlighted an expansion in T. cruzi and T. vivax isolates, along with T. brucei Lister 427 2018 (Fig. S5A, File S1). T. melophagium and T. theileri have not expanded their retrotransposon repertoire. A similar pattern was observed for long terminal repeat (LTR) retrotransposon counts, which show a positive correlation with genome size (Fig. S5B).

DISCUSSION
T. melophagium and T. theileri are closely related trypanosomes that have distinct hosts and vectors (Gibson et al., 2010;Martinkovicé t al., 2012). Here, the genome of T. melophagium was sequenced and a draft assembly was produced and annotated. The annotated proteome was incorporated into a comparison with T. theileri, and other publicly available kinetoplastid proteomes, to determine their phylogenetic relationship and to explore the genomic basis of the host and vector specificity of these non-pathogenic trypanosomatids.
Although the two genomes compared in this analysis are predicted to be complete (Table 1), their assembly and annotation were performed 5 years apart using different assembly pipelines and sequencing technologies. Comparing assemblies obtained using the same sequencing technologies and assembly pipelines would allow for greater confidence in the observed variations in their genome content as updated methods and long-read sequencing continue to improve the quality of genome assemblies and, therefore, completeness. This was not possible since many of the assembly tools used are specific for the sequencing technologies whilst there has been substantial development in assembly methods between the two studies. However, based on the convergence of the k-mer counting based prediction with the assembly sizes, along with 100% complete BUSCO scores, we were reassured by the quality of the draft assemblies and the subsequent comparisons of their genome content.
Using k-mer counting based predictions, T. melophagium was anticipated to have a smaller genome than T. theileri. This held true when the data were assembled ( Table 1). The T. melophagium genome is more similar in size to T. grayi, the closest relative to T. melophagium and T. theileri, than to T. theileri (Figs 2 and 3). T. theileri has likely expanded its genome size since speciation occurred. A peculiarity of the T. theileri and T. melophagium isolates analysed in this study is their highly reduced heterozygosity, in contrast to African trypanosomes (Oldrieve et al., 2021). Whilst only one genome is available for both species, should these isolates represent the species as a whole, the reduced heterozygosity could be linked to a founder effect (Pool and Nielsen, 2007). As T. theileri and T. melophagium have specific host and vector niches, the small population that initially expanded into the niches possibly underwent a significant population bottleneck, especially as host domestication caused eradication of wild progenitors and wild relatives, which could have facilitated a reduction in heterozygosity, induced by genetic drift. Alternatively, the absence of a sexual cycle could contribute to the reduced heterozygosity. Although T. melophagium and T. theileri contain genes required for meiosis, this does not confirm the species undergo sexual reproduction.
Selection appears to be acting to reduce the genome wide nucleotide biosynthesis cost in both T. theileri and T. melophagium (Fig. 2B) which has remodelled their genomes toward an AT bias, contrasting with all other kinetoplastid genomes analysed in this study ( Fig. 2A,C). The predicted selection pressure acting to reduce nucleotide cost is at the expense of translational efficiency (Fig. 2C) and is greater than for the free-living Bodo saltans, or monoxenous insect parasites such as the early-branching Paratrypanosoma confusum and Phytomonas EM1. Phytomonas has limited access to nitrogen as it infects nitrogen deficient plants and has been highlighted as an example where diet can cause selection to reduce the species genome nucleotide cost, through a reduction in GC content (Seward and Kelly, 2016). We propose that the reduction in the selection cost of T. theileri and T. melophagium may be related to their non-pathogenic nature. By remodelling their genome to an AT bias, they may have reduced their cost to their host, facilitating reduced pathogenicity. Closely related species of bacteria exist on a spectrum from pathogen to symbiont, highlighting how a selective advantage can arise from a parasite reducing the cost to its host (Toft and Andersson, 2010). It is possible that T. melophagium and T. theileri are part of a similar spectrum amongst trypanosomatids. We acknowledge that alternative hypotheses exist for the reduced nucleotide cost associated with the AT rich genome, such as tissue niche adaptation in their mammalian host or selection primarily operating within the arthropod vector rather than mammalian host.
This clade-specific genome remodelling provides an example of the similarity between T. theileri and T. melophagium. However, the species have contrasting hosts, vectors, and genome sizes. Genome annotation and orthology inference identified candidates for their discrepancy in genome size. When species specific orthogroups were compared, the greatest contrast was between orthogroups associated with the putative cell surface, with the largest expansions detected in T. theileri being of TTPSP and MSP surface protein families (Fig. 4). Although both species undergo a cyclical transmission cycle, which includes mammalian and insect stages, we hypothesise the respective prevalence in their mammalian hosts, and the contrasting life history of their respective vectors could explain the genome expansion in T. theileri. T. theileri, spread by tabanids, are found in over 80% of livestock (Farrar and Klei, 1990;Matthews et al., 1979;Mott et al., 2011;Schlafer, 1979). In comparison, T. melophagium exhibits lower detected prevalence, being rarely identified in its mammalian host via blood smears (Martinkovićet al., 2012) or after blood culture (Serpil and Zafer, 2008). Moreover, sheep keds, which transmit T. melophagium, are intimately associated with their mammalian host, spending their entire life either on the sheep's skin or wool. Here, males and females feed solely on mammalian blood providing many opportunities for transmission of T. melophagium from the sheep to the sheep ked (Hoare, 1923). Therefore, there is potentially less advantage for T. melophagium to invest in mammalian immune evasion mechanisms required to extend the length of its infection in sheep, since it has many transmission opportunities. The limited investment in T. melophagium is emphasised by their relatively unsophisticated putative TTPSP-related repertoire alongside modestly expanded species-specific MSP families (Fig. 4). Instead, T. melophagium could rely on its ancestral ability to sustain infections in invertebrate hosts, which, although able to be primed to defend against a specific pathogen, rely upon an innate immune response (Cooper and Eleftherianos, 2017).
In contrast, T. theileri has a transient host-vector interaction. Tabanid flies of either gender survive on plant sugars, while adult females occasionally feed on mammalian blood (Chainey, 1993). Therefore, potentially T. theileri requires extended survival in its mammalian host to sustain transmission between cattle, compared to the intimate long-term association of sheep keds with T. melophagium. The investment from T. theileri in an expanded surface protein repertoire is likely to support adaptive immune evasion and prolonged survival in the mammalian stage of its life cycle. Alternatively, or additionally, differences between the bovine and ovine immune responses could contribute (Wang et al., 2013). It should be noted that both T. melophagium and T. theileri prevalence was surveyed via blood smear or blood culture. Although this is a standard approach, studies have highlighted the prevalence of T. brucei in adipose tissue (Trindade et al., 2016) and we cannot exclude one species preferentially infecting these tissues, rather than the bloodstream.
Many of the gene families identified in T. theileri  were divided into multiple orthogroups in this study. The discrepancy is likely to be explained by evolution of the methods used by OrthoFinder. At the time of publication of the T. theileri study, OrthoFinder v.1 was available, while our analysis used version v.2.5. For instance, OrthoFinder v.2.5 uses updated sequence alignment tools, such as DIAMOND ultra-sensitive. For this reason, we can speculate that the clustering in this study is more refined, such that the TTPSP families should be divided into smaller protein families. However, large paralogous orthogroups remain the toughest challenge for orthogroup clustering software and so the relationships between this set of proteins will likely continue to evolve alongside the software (Emms and Kelly, 2020).
Genes involved in the trypanosome life cycle, cellular quiescence and meiosis were all detected in T. melophagium, suggesting a competent developmental cycle along with the machinery for sexual recombination. There is an expansion of the T. theileri invertase orthogroup which was not present in T. melophagium. This is potentially associated with the use of sucrose in the tabanid fly's diet. Although the glycolysis pathway is present in T. melophagium, there was an expansion in the pyruvate orthophosphate dikinase, phosphoenolpyruvate carboxykinase and malate dehydrogenase orthogroups. These genes are associated with the branch of the glycolysis pathway that converts pyruvate to succinate to facilitate the recovery of NAD + . This branch of the glycolytic pathway was upregulated in T. theileri in contrast to T. brucei . Interestingly, the core RNAi genes were detected in T. melophagium, consistent with T. theileri but distinct from T. cruzi which is also a stercorarian trypanosome, but which lacks the requisite molecular machinery (Ullu et al., 2004).
In summary, we have found that T. theileri and T. melophagium are closely related species that display substantial remodelling of their genomes to facilitate a reduction in their nucleotide costs, which might reduce the costs they impose on their hosts. T. theileri displays a considerable genome expansion, which is associated with a large repertoire of unique proteins that characterise its cell-surface and host-interaction gene repertoire. These genes could facilitate a lifelong infection in its mammalian host. In contrast, the comparatively unsophisticated immune evasion repertoire displayed by T. melophagium suggests more limited adaptation to its mammalian host.

Trypanosome culture, DNA/RNA extraction and sequencing
The full list of tools used in this study, and the options used to run those tools, can be found in Table S1.
T. melophagium was isolated from sheep blood collected on the island of St Kilda, Scotland, UK (kindly provided by Professor Josephine Pemberton, University of Edinburgh). Whole blood was collected into heparinized vacutainers and used within 2 days. 1 ml of blood was diluted with 5 volumes of a 50% mix of HMI9 supplemented with 20% fetal bovine serum (FBS) and Madin-Darby bovine kidney (MDBK) conditioned medium. All cultures were kept at 37°C and were examined microscopically every 3 days for 6 weeks. After propagation of T. melophagium by culturing of the blood sample, the specimens were transferred and co-cultured with fibroblast-like primary cells as feeder cells, isolated from the same blood sample. Due to the short lifespan of these primary cultured cells, T. melophagium was subsequently co-cultivated with MDBK cells and then progressively adapted to axenic conditions with a 50% mix of HMI9 and MDBK conditioned medium.
DNA was extracted from cultured T. melophagium using a MagAttract high molecular weight DNA kit, following the manufacturer's instructions (Qiagen) and cleaned via ethanol precipitation. The DNA was sequenced with Oxford Nanopore Technology's (ONT) MinION (R9.4.1), following the ONT Rapid Sequencing protocol. Base-calling was performed in high accuracy mode using Guppy (available at https://community.nanoporetech. com/) which produced 1.059 gigabases (Gb) of data. PycoQC was used to visualise the data (Leger and Leonardi, 2019). The same DNA was sequenced with BGI's DNBseq (4.201Gb, 150 base pair reads). RNA was extracted with the RNeasy mini kit (Qiagen) including a DNAse step, following the manufacturer's instructions and sequenced with BGI's DNBseq (5.019Gb, 100 bp reads). Raw DNA and RNA DNBseq reads were trimmed with Trimmomatic (Bolger et al., 2014).
T. theileri sequencing data was downloaded from NCBI. 170 bp genomic reads were used (SRR13482812).

T. melophagium genome assembly and annotation
Jellyfish and GenomeScope were used to provide a k-mer based estimate of the genome size and heterozygosity using the short DNA reads described above (Marçais and Kingsford, 2011;Vurture et al., 2017).
ONT long reads were assembled using Wtdbg2 (Ruan and Li, 2020). For the polishing steps, BWA-MEM (Li, 2013 preprint) was used to align short reads and Minimap2 (Li, 2018) was used to align ONT reads. ONT reads were aligned to the Wtdbg2 draft assembly and three iterations of Racon (Vaser et al., 2017) followed by one round of Medaka (available at https:// github.com/nanoporetech/medaka) were performed. DNBseq reads were mapped to the Medaka polished assembly, and two iterations of Racon were performed. Short and long reads were aligned to the Racon polished assembly to complete two final iterations of polishing with Pilon (Walker et al., 2014). At each stage of polishing, the quality of the draft genome was assessed using scaffold_stats.pl (available at https://github.com/sujaikumar/ assemblage) and BUSCO (Seppey et al., 2019). BUSCO provides a metric for genome assembly and annotation completeness, based on the presence of near-universal single-copy orthologues in a genome assembly or the corresponding annotated proteins.
Both sets of reads were mapped to the draft assembly. Each contig was subjected to a DIAMOND blastx search against the InterProScan database (Buchfink et al., 2015;Jones et al., 2014b). The resulting alignments and DIAMOND hits were visualised with BlobTools (Laetsch and Blaxter, 2017b). Every contig had a DIAMOND hit to sequences from the Trypanosoma genus. Therefore, the assembly was confirmed to be free from contamination. However, two contigs were outliers in comparison to the rest of the assembly at under 100x coverage and consisting of only 3,975 and 1,892 base pairs (Fig. S2A). These contigs were removed from the assembly. BUSCO and BlobTools were re-run on this trimmed assembly to assess the final assembly's completeness and coverage (Fig. S2B).
Publicly available genomes, transcriptomes and proteomes were accessed from TriTrypDB (Aslett et al., 2010) along with the Phytomonas EM1 assembly which was accessed from NCBI (Leinonen et al., 2011). The quality of the proteomes were assessed using BUSCO; only isolates which had >85% complete proteomes were included, which left 43 isolates along with T. melophagium. A list of all the isolates used in this study can be found in File S1. The assembly statistics of the 44 genomes were assessed using scaffold_stats.pl (available at https://github.com/sujaikumar/assemblage).
The genomes from each of these kinetoplastid isolates were screened for transfer RNA genes using tRNAscan-SE (Chan et al., 2021). The outputs of these results were used to infer the strength of selection acting on translational efficiency and nucleotide cost for each isolate, along with the background mutation bias, using CodonMuSe Kelly, 2016, 2018). Each isolate was assessed using (1) every CDS, (2) single-copy universal orthologs (identified in the orthologous protein clustering steps below) (n=992) and 3) single-copy universal orthologs, which are essential for every stage of the T. brucei life cycle (n=158). The last list of genes were identified by screening the universal single-copy orthologs for genes that had a significant reduction in transcript levels in every library of an RNAi phenotype screen (>1.5 log fold decrease) (Alsford et al., 2011).

Orthology inference
Orthologous proteins from 44 kinetoplastid proteomes were identified with OrthoFinder (Emms and Kelly, 2019) and protein clusters were summarised with KinFin (Laetsch and Blaxter, 2017a), using InterProScan annotations based on the Pfam and signalP databases (Jones et al., 2014b;Mistry et al., 2021;Nielsen, 2017). A minimal cut-off threshold was not applied to the orthogroup annotation. The orthogroup annotation summary can be found in File S2. A species tree was produced by STAG and STRIDE, as part of the OrthoFinder analysis Kelly, 2017, 2018), which was visualised with iTOL (Letunic and Bork, 2007). STRIDE identified Bodo saltans as the best root for the consensus species tree.
To confirm the absence of VSGs in T. melophagium, all CDS sequences labelled as 'VSG' in the T. brucei TREU927/4 reference genome were downloaded from TriTrypDB. A blastn search was performed using these VSG sequences as the query and the T. melophagium genome as the database using a loose cut-off (e-value=1e-5). A similar search was performed to confirm the reduced TTPSP counts in T. melophagium. For this, the T. theileri transcripts were used to query a database made from the T. melophagium transcripts (e-value=1e-25).
The cell surface orthogroups in this study were annotated with the orthogroups from Kelly et al. (2017) based on the orthogroup membership of genes in the two analyses. Unless stated otherwise, all of the figures in this study were plotted in R (Team, 2019) using ggplot2 (available at http:// ggplot2.tidyverse.org) and ggrepel (available at http://github.com/slowkow/ ggrepel).