Whole-chromosome hitchhiking driven by a male-killing endosymbiont

Neo-sex chromosomes are found in many taxa, but the forces driving their emergence and spread are poorly understood. The female-specific neo-W chromosome of the African monarch (or queen) butterfly Danaus chrysippus presents an intriguing case study because it is restricted to a single ‘contact zone’ population, involves a putative colour patterning supergene, and co-occurs with infection by the male-killing endosymbiont Spiroplasma. We investigated the origin and evolution of this system using whole genome sequencing. We first identify the ‘BC supergene’, a broad region of suppressed recombination across nearly half a chromosome, which links two colour patterning loci. Association analysis suggests that the genes yellow and arrow in this region control the forewing colour pattern differences between D. chrysippus subspecies. We then show that the same chromosome has recently formed a neo-W that has spread through the contact zone within approximately 2,200 years. We also assembled the genome of the male-killing Spiroplasma, and find that it shows perfect genealogical congruence with the neo-W, suggesting that the neo-W has hitchhiked to high frequency as the male-killer has spread through the population. The complete absence of female crossing-over in the Lepidoptera causes whole-chromosome hitchhiking of a single neo-W haplotype, carrying a single allele of the BC supergene and dragging multiple non-synonymous mutations to high frequency. This has created a population of infected females that all carry the same recessive colour patterning allele, making the phenotypes of each successive generation highly dependent on uninfected male immigrants. Our findings show how hitchhiking can occur between the physically unlinked genomes of host and endosymbiont, with dramatic consequences.


Introduction
Structural changes to the genome play an important role in evolution by altering the extent of recombination among loci. This is best studied in the context of chromosomal inversions that cause localised recombination suppression, and can be favoured by selection if they help to maintain clusters of co-adapted alleles (or 'supergenes') in the face of genetic mixing [1][2][3][4]. A greater extent of recombination suppression occurs in the formation of heteromorphic sex chromosomes, which can link sex-specific alleles similarly to supergenes [5]. However, suppressed recombination can also have costs. In particular, male-specific Y and female-specific W chromosomes can be entirely devoid of recombination, making them vulnerable to genetic hitchhiking and the accumulation of deleterious mutations through 'Muller's ratchet', which may explain their deterioration over time [6][7][8]. These contrasting benefits and costs of recombination suppression are of particular interest in the evolution of neo-sex chromosomes, which can form through fusion of autosomes to existing sex chromosomes. There is accumulating evidence that neo-sex chromosomes are common in animals [9][10][11][12][13][14][15], but the processes underlying their emergence, spread and subsequent evolution have not been widely studied. In particular, there are few studied examples of recently-formed neo-sex chromosomes that are not yet fixed in a species.
The African monarch (or queen) butterfly Danaus chrysippus, provides a unique test case for the causes and consequences of changes in genome architecture and recombination suppression. Like its American cousin (D. plexippus), it feeds on milkweeds and has bright colour patterns that warn predators of its distastefulness. However, within Africa D. chrysippus is divided into four subspecies with distinct colour patterns and largely distinct ranges (Fig. 1A).
Predator learning should favour the maintenance of a single monomorphic warning in any single area. For this reason, researchers have long been puzzled by the large polymorphic contact zone in East and Central Africa where all four D. chrysippus subspecies meet and interbreed [16][17][18] (Fig. 1A). Crosses have shown that colour pattern differences between the subspecies are controlled by Mendelian autosomal loci, including the tightly linked 'B' and 'C' loci (putatively a 'BC supergene' [19]) that define three common forewing patterns [20,21] (Fig. 1A). However, crosses with females from the contact zone revealed that the BC chromosome has become sex linked, forming a neo-W that is unique to this population [19,22]. Since female meiosis is achiasmatic (it lacks crossing-over) in the Lepidoptera, the formation of a neo-W would instantaneously cause perfect linkage, not just of the B and C loci, but of an entire nonrecombining chromosome, along with other maternally-inherited DNA.
What is particularly striking is that the presence of the neo-W coincides with infection by a maternally-inherited 'male killer' endosymbiont related to Spiroplasma ixodetis, which kills male offspring and leads to highly female-biased sex-ratios where infection is common [22][23][24]. The combination of neo-W and male killing is expected to dramatically alter the inheritance and evolution of the BC chromosome [22,25]: Infected females typically give rise to all-female broods who should always inherit the same colour patterning allele on their neo-W, along with the male-killer, while the other maternal allele is systematically eliminated in the dead sons ( Fig.   1B), forming a genetic sink for all colour pattern alleles not on the neo-W. It has been suggested that the restriction of male killing to females with the neo-W, and only in the region in which hybridisation occurs between subspecies, may not be a coincidence [19,22,[25][26][27]. However, the genomic underpinnings of this system -the genetic controllers of colour pattern, the source and spread of the neo-W, and its relationship with the male killer -have until now remained a mystery. We generated a reference genome for D. chrysippus and used whole genome sequencing of population samples to uncover the interconnected evolution of the BC supergene, neo-W and Spiroplasma. Our findings reveal a recent whole-chromosome selective sweep caused by hitchhiking between the host and endosymbiont genomes.  [20]. Note the linkage of B and C, putatively forming a 'BC supergene' [19]. Two examples of heterozygotes that can be found in the contact zone are shown. Note that Cc heterozygotes can exhibit the transiens phenotype with white markings on the forewing with ~50% penetrance. (B) Model showing how fusion of the BC autosome to the W chromosome has produced a neo-W (purple) in contact zone females (top left), while males have two autosomal copies of the BC chromosome (top right). Daughters inherit the neo-W, while sons inherit the other BC chromosome haplotype from their mother. The latter allele is then lost due to male killing by Spiroplasma.

Identification of the BC supergene
We assembled a high quality draft genome for D. chrysippus, with a total length of 322 Megabases (Mb), a scaffold N50 length of 0.63 Mb, and a BUSCO [28] completeness score of 94% (S1 Table-S8 Table). We then further scaffolded the genome into a pseudo-chromosomal assembly based on homology with the Heliconius melpomene genome [29-31] accounting for known fusions that differentiate these species [9,30,32] (Fig. S1). We also re-sequenced 42 individuals representing monomorphic populations of each of the four subspecies and a polymorphic population from a known male-killing hotspot near Nairobi, in the contact zone ( Fig 1A, S9 Table).
To identify the putative BC supergene, we scanned for genomic regions showing high differentiation between the subspecies and an association with colour pattern. Genetic differentiation (F ST ) and excessive divergence (d XY ) is largely restricted to a handful of broad peaks, with a background level of approximately zero ( Fig. 2A, S2 Fig, S3 Fig). This low background level implies a nearly panmictic population across the continent. The effective population size appears to be very large, as average genome-wide diversity at putatively neutral 4-fold degenerate 3 rd codon positions is 0.042, which is among the highest values reported for animals [33,34]. The islands of differentiation that stand out from this background imply selection for local adaptation maintaining particular differences between the subspecies, similar to patterns seen between geographic races of Heliconius butterflies [35]. However, here the peaks of differentiation are broad, covering several Mb, implying some mechanism of recombination suppression such as inversions that differentiate the subspecies.
The inclusion of the polymorphic contact-zone samples, and the fact that three of the subspecies each carry a unique colour pattern allele (Fig 1A), allowed us to identify particular differentiated regions associated with the three major colour pattern traits. A ~3 Mb region on chromosome 4 is associated with the white hindwing patch (A locus) and a ~5 Mb region on chromosome 15 (hereafter chr15) is associated with both background orange/brown (B locus) and the forewing black tip (C locus) (Fig 2A, S2 Fig). Below, we refer to this region on chr15, which spans over 200 protein-coding genes, as the BC supergene [19], although we note that additional associated SNPs on chromosome 22 suggest that background wing melanism may also be influenced by other loci.
Clustering analysis based on genetic distances reveals three clearly distinct alleles at the BC supergene ( Fig 2C). This further supports the hypothesis of recombination suppression, although a number of individuals show mosaic ancestry consistent with occasional recombination (S4 Fig). The three main alleles correspond to the three common forewing phenotypes, so we term these BC chrysippus (orange background with black forewing tip, formerly bbcc), BC dorippus (orange without black tip, formerly bbCC), and BC orientis (brown background with black forewing tip, formerly BBcc) (Fig 2C).  Coloured blocks indicate 20 kb windows in which sequence haplotypes could be assigned to one of three clusters based on pairwise genetic distances (see Methods for details). Windows in grey indicate insufficient relative divergence to be assigned to a cluster and white indicates missing data. Allele names reflect the subspecies from which they come. In heterozygotes, the name of the dominant allele is in bold.
Although it can be challenging to identify particular functional mutations in regions of suppressed recombination, the presence of some recombinant individuals allowed us to narrow down candidate regions for the B and C loci. A cluster of SNPs most strongly associated with background colour (B locus) is found just upstream of the gene yellow, and a phylogenetic network for a 30 kb region around yellow groups individuals nearly perfectly by phenotype, although some individuals classed as heterozygous were intermingled with homozygotes (S5 In Drosophila, Arrow is essential for Wnt signalling in wing development [38]. Wnt signalling is known to underlie variation in colour pattern in Heliconius butterflies [39] and knock-out mutants for the Wnt ligand gene WntA in D. plexippus show a loss of pigmentation [40]. This makes arrow a promising candidate for the C locus gene. While these genes represent our best candidates, numerous strongly-associated SNPs occurred closer to other genes in this region (S10 Table). Future studies will aim to narrow down the associated regions and performing functional validation.
Irrespective of their precise mode of action, the patterns of association imply that the B and C loci are ~1.6 Mb apart (S5 Fig), and would therefore be fairly loosely linked under normal recombination. This physical distance translates to around 7.6 cM, assuming crossover rates similar to those in Heliconius [31,41], whereas the estimated recombination distance between B and C based on crosses is 1.9 cM [42]. Theory predicts that recombination suppression can be favoured if it maintains linkage disequilibrium (LD) between co-adapted alleles in the face of gene flow [1][2][3][4]. Our study is one of only a few cases in which it can be shown that alleles at distinct loci that each influence a component of a complex trait are maintained in LD by suppressed recombination [43,44].
It is likely that chromosomal rearrangements contribute to recombination suppression at the BC supergene. Although our short-read data do not allow us to test directly for inversions, Either way, we suggest that this large structural change, which increases the length of the chromosome by nearly a third, contributes to recombination suppression between the BC dorippus allele and other supergene alleles by interfering with chromosome pairing in heterozygotes.

A neo-W chromosome traps a single haplotype of chromosome 15 in contact zone females
Previous crossing experiments indicated that the BC chromosome has become sex-linked in contact zone females [22]. To confirm this hypothesis using genetic tools, we created a 'cured line' by treating a female from an all-female brood with tetracycline to eliminate Spiroplasma and allow the survival of male offspring [23]. A cross using this female confirms perfect sex- Although we were unable to definitively identify any scaffolds from the ancestral W chromosome, which is likely to be highly repetitive, we can test whether chr15 shows the expected hallmarks of a young neo-W, hypothesised to have formed through fusion to the ancestral W [22]. Due to the complete absence of recombination in females, we expect that a single fused haplotype of chr15 would be spreading in the population. Any unique mutations specific to this haplotype should therefore occur at high frequency in females and be absent in males. We scanned for such high-frequency female-specific mutations, and found them to be abundant across the entire length of chr15 and nearly absent throughout the rest of the genome  (B) Distance-based phylogenetic network for the distal colinear region of chr15 outside of the BC supergene reveals that most contact zone females carry the conserved neo-W haplotype. Cartoons show how the colinear region of chr15 is outside of the BC supergene but would still capture reduced divergence among individuals carrying a shared non-recombining neo-W. (C) Boxplot comparing the density of heterozygous sites in 100 kb windows in the colinear region of chr15 between wild type individuals from the contact zone (CZ) and those carrying the neo-W. Cartoon chromosomes above the plot match those shown in panel B. A relative lack of elevated heterozygosity in the neo-W lineage indicates a lack of divergence of the fused neo-W haplotype, consistent with the fusion being recent. (D) Boxplot of nucleotide diversity (π) within each population for the same colinear region of chr15. On the far right, π is shown for the haploid neo-W haplotype specifically, based on partial sequences isolated from this haplotype (see Methods and S10 Fig for details). The near absence of genetic diversity implies a very rapid spread of the neo-W through the population. The neo-W formed recently and spread rapidly Genetic variation accumulated in the neo-W lineage since its formation can tell us about its age. Sequence divergence between the neo-W and autosomal copies of chr15 (inferred from the density of heterozygous sites in the colinear region of chr15 in females carrying the neo-W) is not significantly different from that between the autosomal copies in 'wild type' individuals that lack the fusion (Fig. 3C, Wilcoxon signed rank test, p=0.36, n=48 100 kb windows). This implies that insufficient time has passed since the fusion event for significant accumulation of new mutations. The limited divergence of the neo-W haplotype from the autosomal copy of chr15 in each female makes it challenging to isolate. Nonetheless, by using diagnostic mutations that are unique to, and fixed in the neo-W lineage, we were able to identify sequencing reads from the shared haplotype and reconstruct a partial neo-W sequence for each female (S10 Fig). A dated genealogy based on these sequences places the root of the neo-W lineage at ~2200 years (26,400 generations) ago (posterior mean = 2201, std. dev. = 318).
The neo-W is present in all but one of the contact zone females, implying a rapid spread since its formation. This process is similar to a selective sweep of a beneficial mutation, except that complete recombination suppression in females means that the sweep affects the entire chromosome equally. Unlike a conventional sweep, it is not expected to eliminate genetic diversity from the population as these females will also carry an autosomal copy of chr15 inherited from their father (Fig 1B). Indeed, we see a 20% reduction in overall nucleotide diversity (π) on chr15 in females of the neo-W lineage ( Fig 3D). However, when we consider only the neo-W haplotype in each of these females we see a nearly complete absence of genetic variation, with a π of 0.00007, two orders of magnitude lower than for autosomal copies of chr15 (0.0228) (Fig. 3D). These results further support a very recent and rapid spread of the neo-W.
The neo-W haplotype carries the recessive BC chrysippus allele at the BC supergene ( S4 Fig). However, previous work [22] shows that at the the focal sampling site in the contact zone, most males are immigrants homozygous for the dominant BC dorippus allele, and the vast majority of females (84%) are heterozygous BC dorippus /BC chrysippus , as expected if most inherit BC dorippus from their father and BC chrysippus (on the neo-W) from their mother. The dominant dorippus phenotype is therefore by far the most abundant in this population. Since aposematic colouration should be under positive frequency dependent selection, it is highly unlikely that the spread of the neo-W can be explained by selection on colour pattern, highlighting the question of what else might have driven its spread.

Hitchhiking between the neo-W and Spiroplama
We hypothesised that the neo-W has spread as a result of co-inheritance with the malekilling Spiroplasma, which is itself spreading through the population as a selfish element.
Experiments have suggested that infected female offspring have increased survival relative to those from uninfected broods, possibly due to reduced competition for resources, although other factors such as improved immunity have not been tested [45]. For this to drive the spread of the neo-W, there would also need to be strict vertical inheritance of Spiroplasma down the female line, such that it is always co-inherited with the neo-W.  (Fig 4A), consistent with matrilineal inheritance. To confirm that the Spiroplasma is strictly vertically inherited and always associated with a single female lineage, we used PCR assays for Spiroplasma and mitochondrial haplotype and expanded our sample size to 158 individuals, including samples used in previous studies going back two decades [19,23] (S12 Table,  Like the neo-W, the Spiroplasma genomes carry limited variation among individuals (π = 0.0005), consistent with a single and recent outbreak of the endosymbiont. Although the lack of variation makes it challenging to infer genealogies, our inferred maximum likelihood genealogies for the neo-W and Spiroplasma are strikingly congruent (Fig 4B). The low bootstrap support for multiple nodes is unsurprising given that these sequences descend from a recent common ancestor, such that most nodes will be defined by only a few informative sites. This does not weaken the support for congruence however, as the probability of two incorrectly inferred topologies matching by chance is infinitesimally small. In a permutation test for congruence between the two distance matrices [46], the observed level of congruence exceeds all 100,000 random permutations. There is therefore strong support for co-inheritance of the neo-W and Spiroplasma [47].

Fig 4. Matrilineal inheritance causes coupling between neo-W and Spiroplasma. (A)
Maximum likelihood phylogeny for the whole mitochondrial genome. Individuals are coloured according to population of origin (see Fig 1A), and those carrying the neo-W ('W') and Spiroplasma ('S') are indicated (including one cured individual in which Spiroplasma was eliminated). Females are indicated by circles and males by squares. (B) Maximum likelihood phylogenies for the neo-W haplotype and Spiroplasma genome isolated from infected females. Corresponding clades are shaded to indicate congruence. Note that two samples are excluded in panel B: the cured sample, which lacked Spiroplasma due to tetracycline treatment, and one infected female found to lack the neo-W. Whether the latter represents an ancestral state or secondary loss requires further investigation. In all trees, nodes supported by more than 70 of 100 bootstrap replicates are indicated by circles.
The combined spread of three physically unlinked DNA molecules -the mitochondrial genome, neo-W and Spiroplasma genome -constitutes a form of genetic hitchhiking, but is facilitated by their strict matrilineal inheritance rather than physical linkage. We cannot entirely rule out the possibility that the neo-W is contributing to this spread, or even driving it entirely, through direct selection or meiotic drive. In theory this is testable by examining broods that carry the neo-W but lack Spiroplasma. We raised 11 such broods in our cured line, and Smith  [48] reported 10 natural broods that showed sex-linked colour pattern and no male killing.
Across these 21 broods, totalling 528 adult offspring, 51% were female. This is far from significantly different from the null expectation of 50% (binomial test p=0.7). However, we note that in order to detect meiotic drive causing a 1% female bias with good power would require a far larger samples size of >15,000. Given the evidence from other work showing that male killers can spread rapidly as selfish elements [49], and the endosymbionts can convey other fitness benefits [50], this remains the most parsimonious explanation in this case. Selfish elements have been shown to drive hitchhiking of the mitochondrial genome or a portion of a chromosome through a population and even across species boundaries [51][52][53]. Our findings show how an entire chromosome can be captured in the same way. Hitchhiking may therefore be of general importance in driving the spread of neo-sex chromosomes.
In D. chrysippus, it is currently unclear whether the neo-W or male-killer emerged first. It is also unclear whether their co-occurrence in a single ancestor was simply a coincidence or instead reflects some functional connection, such as the suggestion that the neo-W might confer susceptibility to the male killer [22]. It is important to note that this is not the first time a neo-sex chromosome has formed in this taxon. A fusion of chromosome 21 to the ancestral Z chromosome occurred in an ancestor of all Danaus species, producing a neo-Z [9,32,54]. It is speculated that a complementary fusion of chromosome 21 to the ancestral W also occurred [9,54], but this is difficult to conclusively verify due to degradation of the W chromosome over longer timescales. If this hypothesis of an ancient neo-W is correct, then the neo-W we describe (W-chr15) might in fact be better described as a neo-neo-W (W-chr21-chr15). It is possible that the spread of the original W-chr21 was also driven by hitchhiking with a selfish endosymbiont.

Genetic and phenotypic consequences of recombination suppression
Sex chromosome evolution in many other taxa involves the progressive spread of recombination suppression outward from the a sex-determining locus [55]. By contrast, the absence of crossing over in female meiosis means that a lepidopteran neo-W experiences complete and immediate recombination suppression over its entire length. Butterfly W chromosomes are therefore thought to be highly degenerated and repetitive, and to our knowledge none have been successfully assembled to date. The young age of the D. chrysippus neo-W therefore provides a rare opportunity to study the early evolutionary consequences of recombination suppression across an entire chromosome. Two related processes could shape its evolution: hitchhiking of pre-existing deleterious mutations that were initially rare in the population [6], and accumulation of novel deleterious mutations due to reduced purging through recombination and selection (i.e. Muller's Ratchet) [7].
As a proxy for the 'genetic load' of deleterious mutations in the population, we considered P n /P s , the normalised ratio of non-synonymous to synonymous polymorphisms. Due to purifying selection, non-synonymous polymorphisms are typically rare, and where they do occur the mutant allele typically occurs at low frequency in the population [56]. When considering all polymorphisms in the neo-W lineage, P n /P s for chr15 (excluding the BC supergene, to avoid bias) is very slightly (~5%) higher than for other autosomes (Fig S13). Of 1000 bootstrap replicates, 916 reproduced this bias, corresponding to a p-value of 0.084. However, when we partition polymorphisms by allele frequency, we see that chr15 carries a large excess of nonsynonymous polymorphisms in the highest frequency class (i.e. minor allele at 50%), with a P n /P s ratio >3 times larger than on other autosomes (S13 Fig). This holds across all 1000 bootstrap replicates (i.e. p<0.001). A change in the frequency distribution of non-synonymous variants, without a significant change in their abundance, is best explained by hitchhiking of preexisting mildly-deleterious alleles that were initially rare in the population but were inadvertently carried to high frequency along with the neo-W haplotype, and are therefore now found in all females in this lineage. In fact, P n /P s for high-frequency polymorphisms on chr15 is higher than that for singletons on autosomes (p=0.044), suggesting that accumulation of additional mildlydeleterious alleles on the neo-W might have occurred early during its spread through the population.
At the phenotypic level, perhaps counter-intuitively, the spread of a single supergene allele on the neo-W has not caused homogenization of warning pattern among contact zone females, and might in fact have the opposite effect. In locations where the neo-W and Spiroplasma are nearly fixed, such as our sampling site near Nairobi, the high incidence of male killing implies that the population is strongly shaped by immigrant males. Since the BC chrysippus allele on the neo-W is universally recessive, daughters will tend to match the phenotype of their immigrant father.
However, because the neo-W is always transmitted to daughters, the paternal chr15 copy will be lost to male killing after one generation, creating a genetic sink for immigrant male genes [22] (S14 Fig). This combination of processes results in a female population that is highly sensitive to the source of immigrants, which is known to fluctuate seasonally with monsoon winds [16,57] (S14 Fig). This model leads to the testable prediction that seasonal fluctuations in female phenotypes should be most dramatic where male killing is most abundant.

Future evolutionary trajectories
The future of the neo-W and Spiroplasma outbreak is uncertain. A lack of males could lead to local extinctions [27], but extinction of the entire infected lineage is unlikely given the high dispersal ability and seasonal influxes of males in the contact zone. Indeed, it is notable that Spiroplasma infection has only been recorded within the contact zone population (with the exception of a single South African female reported here, S12 Table), especially given theory showing that male killers should spread very rapidly across the geographical range of a panmictic population if they provide even a very weak selective advantage [49]. Future work will investigate whether its spread might be curtailed by environmental factors, for example if oviposition behaviour or host plant availability only leads to sibling competition (and consequent benefits for all-female broods) under certain conditions [45]. An alternative and non-mutuallyexclusive hypothesis is that dispersal rates of infected females are strongly reduced. In other systems, sex-ratio distortion has driven adaptive responses by the host, including changes to the mating system [58] and the evolution of resistance to male-killing [59,60]. The absence of evidence for these phenomena in D. chrysippus might simply reflect the recency of the male killing outbreak. Eventually, we also expect the non-recombining neo-W to begin to degenerate through further hitchhiking, gene loss and the spread of repetitive elements [8,55]. This young system provides a rare opportunity to study how these phenomena unfold through time and space.

Reference genome sequencing, assembly and annotation
Detailed methods for generation of the D. chrysippus reference genome are provided in S1 Text. Briefly, a draft assembly was generated using SPAdes [61] from a combination of pairedend and mate-pair libraries of various insert sizes. Scaffolding and resolution of haplotypes was performed using Redundans [62] and Haplomerger2 [63]. The assembly was annotated using a combination of de-novo gene predictors yielding 16,654 protein coding genes. Mitochondrial genomes were assembled using NOVOplasty [64].
Although these genomes are diverged by approximately 90 million years, this homology-based approach has been shown previously to be successful for reconstruct chromosomes in a fragmented D. plexippus genome [9]. In total, 282 Mb (87% of the genome) could be confidently assigned to chromosomes (S1 Fig). Scaffolds representing the Spiroplasma genome were identified based on read depth of remapped reads (S11 Fig) and homology to other available Spirolasma genomes. Annotation was performed using the RAST server pipeline [66,67].

Population sample resequencing and genotyping
This study made use of 42 newly sequenced D. chrysippus individuals, as well as previously sequenced individuals of the sister species, D. petilia (n=1), and next closest outgroup D. gilippus (n=2) [68] (S9 Table). Details of DNA extraction, sequencing and genotyping are provided in S1 Text. Briefly, DNA was extracted from thorax tissue and sequenced (paired-end, 150 bp) to a mean depth of coverage 20x or greater. Reads were mapped to the D. chrysippus reference assembly using Stampy [69] v1.0.31 and genotyping was performed using GATK version 3 [70,71]. Genotype calls were required to have an individual read depth ≥ 8, and heterozygous and alternate allele calls were further required to have an individual genotype quality (GQ) ≥ 20 for downstream analyses.

Genomic differentiation and associations with wing pattern
We used the fixation index (F ST ) to examine genetic differentiation across the genome among the three subspecies for which we had six or more individuals sequenced. F ST was computed using the script popgenWindows.py (github.com/simonhmartin/genomics_general release 0.2) with a sliding windows of 100 kb, stepping in increments of 20 kb. Windows with fewer than 20,000 genotyped sites after filtering (see above) were ignored.
To identify SNPs associated with the three Mendelian colour pattern traits (i.e. the A B and C loci) (Fig 1A), we used PLINK v1.9 [72] with the '--assoc' option, and provided quantitative phenotypes of 0, 1 or 0.5 for assumed heterozygotes, which causes PLINK to use the Wald test for quantitative traits. In addition to the quality and depth filters above, SNPs used for this analysis were required to have genotypes for at least 40 individuals, a minor allele count of at least 2, and to be heterozygous in no more than 75% of individuals. SNPs were also thinned to a minimum distance of 100 bp.
To examine relationships among diploid individuals in specific regions of interest, we constructed phylogenetic networks using the Neighbor-Net [73] algorithm, implemented in SplitsTree [74]. Pairwise distances used for input were computed using the script distMat.py (github.com/simonhmartin/genomics_general release 0.2).

Haplotype cluster assignment
To assign haplotypes to clusters in the BC supergene region, we first phased genotypes using SHAPEIT2 [75,76] using SNPs filtered as for association analysis above, except with a minor allele count of at least 4, and no thinning. Default parameters were used for phasing except that the effective population size was set to 3x10 6 . To minimise phasing switch errors, we analysed each 20 kb window separately. Cluster assignment for both haplotypes from each individual was based on average genetic distance to all haplotypes from each of three reference groups: D. c. dorippus, D. c. orientis, or D. c. alcippus (which is representative of chrysippus as they share the same alleles at the BC supergene). A haplotype was assigned to one of the three groups if its average genetic distance to members of that group was less than 80% the average distance to the other two groups, otherwise it was left as unassigned. Genetic distances were computed using the script popgenWindows.py (github.com/simonhmartin/genomics_general release 0.2).

Identification of neo-W specific sequencing reads
To identify females carrying the neo-W chromosome, we visualised the distribution of female-specific derived mutations that occur at high-frequency. Allele frequencies were computed using the script freq.py (github.com/simonhmartin/genomics_general release 0.2). Due to the absence of female meiotic crossing over in Lepidoptera, all females carrying the neo-W fusion should share a conserved chromosomal haplotype for the entire fused chromosome. To isolate this shared fused haplotype from the autosomal copy, we first identified diagnostic mutations as those that are present in a single copy in each member of the 'neo-W lineage' and absent from all other individuals and outgroups. We then isolated the sequencing read pairs from each of these females that carry the derived mutation (S10 Fig). This resulted in a patchy alignment file, with a stack of read pairs over each diagnostic mutation. Based on these aligned reads, we genotyped each individuals as described above, except here setting the ploidy level to 1, and requiring a minimum read depth of 3.

Diversity and divergence of the neo-W
The lack of recombination across the neo-W makes it possible to gain insights into its age.
Over time, mutations will arise that differentiate the neo-W from the recombining autosomal copies of the chromosome. We estimated this divergence based on average heterozygosity in females carrying the neo-W, and compared it to heterozygosity from contact-zone individuals not carrying the neo-W. Heterozygosity was computed using the Python script popgenWindows.py (github.com/simonhmartin/genomics_general release 0.2) focusing only on the colinear portion of the chromosome (i.e. the distal portion from 11 Mb onwards), which is outside of the BC supergene. Heterozygosity was computed in 100 kb windows, and windows were discarded if they contained fewer than 20,000 sites genotyped in at least two individuals from each population.
A recent spread of the neo-W through the population should also be detectable in the form of strong conservation of the neo-W haplotype in all females that carry it (i.e. reduced genetic diversity). We therefore computed nucleotide diversity (π) in 100 kb windows as above.
Reported values of π and heterozygosity represent the mean +-standard deviation across 100 kb windows.

Genealogical analyses
We produced maximum likelihood trees for the mitochondrial genome, neo-W and Spiroplasma genome, using PhyML v3 [77] with the GTR substitution model. Given the small number of SNPs in both the neo-W and Spiroplasma genome, regions with inconsistent coverage across individuals were excluded manually. Only sites with no missing genotypes were included.
We estimated the root node age for the neo-W using BEAST2 [78,79]  We tested for congruence between the neo-W and Spiroplasma trees using PACo [46], which assesses the goodness-of-fit between host and parasite distance matrices, with 100,000 permutations. Distance matrices were computed using the script distMat.py (github.com/simonhmartin/genomics_general release 0.2).

Analysis of synonymous and non-synonymous polymorphism
We computed P n /P s as as the ratio of non-synonymous polymorphisms per nonsynonymous site to synonymous polymorphisms per synonymous site. Synonymous and nonsynonymous sites were defined conservatively as 4-fold degenerate and 0-fold degenerate codon positions, respectively, with the requirement that the other two codon positions are invariant across the entire dataset. Only sites genotyped in all 15 females in the neo-W lineage were considered, and counts were stratified by minor allele frequency using the script sfs.py (github.com/simonhmartin/genomics_general release 0.1).

Butterfly rearing and molecular diagnostics
To generate a stock line that is cured of Spiroplasma infection, we treated caterpillars from all-female broods with Tetracycline, following Jiggins et al. [23]. A 'cured line' was initiated from a single treated female that had the heterozygous Cc 'transiens' phenotype ( Fig 1A). This female was crossed to a wild male (homozygous cc) to test for sex linkage of phenotype. The cured line was maintained through sibling crosses for six generations and the persistence of males indicated that Spiroplasma had been eliminated.
We then applied a molecular test for sex linkage of chr15 using the F5 brood from the cured line. We designed two separate PCR diagnostics based on SNPs segregating on chr15 to distinguish between the two chromosomes of the male and the female parents (S11 Table). PCR was performed using the Phusion HF Master Mix and HF Buffer (New England Biosciences).
To screen for Spiroplasma infection, we designed a PCR assay targeting the glycerophosphoryl diester phosphodiesterase (GDP) gene (S11 Table). PCR was performed as above. We confirmed the the sensitivity of this diagnostic by analyzing individuals of known infection status based on whole genome sequencing (12 infected and 11 uninfected).
To investigate whether Spiroplasma infection was always associated with a single mitochondrial haplotype, we designed a PCR RFLP for the Cytochrome Oxidase Subunit I (COI) that differentiates the infected 'K' lineage from uninfected (S10 Table). PCR was performed as

Data Availability
Raw genomic data and assemblies are available from GenBank (accession numbers to be provided upon acceptance), all processed data files underlying all figures are available from the Dryad Repository (accession number available upon acceptance). Scripts used for data analysis are available from https://github.com/simonhmartin/genomics_general.