Deciphering colorectal cancer genetics through multi-omic analysis of 100,204 cases and 154,587 controls of European and east Asian ancestries

Colorectal cancer (CRC) is a leading cause of mortality worldwide. We conducted a genome-wide association study meta-analysis of 100,204 CRC cases and 154,587 controls of European and east Asian ancestry, identifying 205 independent risk associations, of which 50 were unreported. We performed integrative genomic, transcriptomic and methylomic analyses across large bowel mucosa and other tissues. Transcriptome- and methylome-wide association studies revealed an additional 53 risk associations. We identified 155 high-confidence effector genes functionally linked to CRC risk, many of which had no previously established role in CRC. These have multiple different functions and specifically indicate that variation in normal colorectal homeostasis, proliferation, cell adhesion, migration, immunity and microbial interactions determines CRC risk. Crosstissue analyses indicated that over a third of effector genes most probably act outside the colonic mucosa. Our findings provide insights into colorectal oncogenesis and highlight potential targets across tissues for new CRC treatment and chemoprevention strategies. CRC, which affects approximately 1.9 million people worldwide annu-ally 1 , has a strong heritable basis 2 . Our understanding of CRC genetics has been informed by genome-wide association studies (GWASs), which have so far identified 150 statistically independent risk variants 3,4 . To provide a comprehensive description of CRC genetics, we brought together the great majority of GWASs performed to date. We complemented GWASs with transcriptome- and methylome-wide association analyses (TWASs and MWASs; Fig. 1). Through integration of these data, we inves-tigated the genes and mechanisms underlying established and new CRC risk loci. We identified credible effector genes and the tissues in which they act, informing our understanding of colorectal tumorigenesis.

Colorectal cancer (CRC) is a leading cause of mortality worldwide. We conducted a genome-wide association study meta-analysis of 100,204 CRC cases and 154,587 controls of European and east Asian ancestry, identifying 205 independent risk associations, of which 50 were unreported. We performed integrative genomic, transcriptomic and methylomic analyses across large bowel mucosa and other tissues. Transcriptome-and methylome-wide association studies revealed an additional 53 risk associations. We identified 155 high-confidence effector genes functionally linked to CRC risk, many of which had no previously established role in CRC. These have multiple different functions and specifically indicate that variation in normal colorectal homeostasis, proliferation, cell adhesion, migration, immunity and microbial interactions determines CRC risk. Crosstissue analyses indicated that over a third of effector genes most probably act outside the colonic mucosa. Our findings provide insights into colorectal oncogenesis and highlight potential targets across tissues for new CRC treatment and chemoprevention strategies. CRC, which affects approximately 1.9 million people worldwide annually 1 , has a strong heritable basis 2 . Our understanding of CRC genetics has been informed by genome-wide association studies (GWASs), which have so far identified 150 statistically independent risk variants 3,4 . To provide a comprehensive description of CRC genetics, we brought together the great majority of GWASs performed to date. We complemented GWASs with transcriptome-and methylome-wide association analyses (TWASs and MWASs; Fig. 1). Through integration of these data, we investigated the genes and mechanisms underlying established and new CRC risk loci. We identified credible effector genes and the tissues in which they act, informing our understanding of colorectal tumorigenesis.

Genetic architecture of CRC
We performed a meta-analysis of CRC GWAS datasets, comprising 100,204 CRC cases and 154,587 controls (73% European and 27% east Asian ancestry; Supplementary Tables 1 and 2). We identified 205 associations, including 37 SNPs at new loci (sentinel risk SNPs > 1 megabase (Mb) from another significant SNP), 13 independent new risk SNPs in conditional analysis (Table 1) and 155 previously reported SNPs or proxies (Table 1, Supplementary Tables 3 and 4 and Supplementary Figs. 1 and 2). There was limited heterogeneity ascribable to population effects (Supplementary Table 2 and Supplementary Fig.  3), although four risk variants (rs12078075, rs57939401, rs151127921 and rs5751474) were monomorphic in east Asian participants (Table 1).
Using linkage disequilibrium (LD) score regression (LD hub), we estimated the heritability of CRC attributable to all common genetic variants to be similar in Europeans (h 2 = 0.11, s.d. = 0.008) and east Asians (h 2 = 0.09, s.d. = 0.006), which translates to 73% of familial CRC risk. Restricting estimates to the 205 GWAS-significant SNPs explained 19.7% of this familial risk. We evaluated the performance of a polygenic risk score (PRS) based on these SNPs in two cohorts independent of the GWAS discovery samples 5,6 . For Europeans and east Asians, individuals in the top PRS decile exhibited odds ratios (ORs) of 2.22 (95% Article https://doi.org/10.1038/s41588-022-01222-9 (P Bonferroni ; Table 2, Supplementary Tables 5 and 6 and Supplementary  Fig. 4). We extended the main TWAS to a transcript isoform-wide association study (TIsWAS), both to ascertain whether specific transcripts could account for TWAS associations and to identify previously unreported risk associations (Supplementary Tables 7 and 8). For a third of TWAS genes, a significant association with CRC risk was found for a single mRNA isoform (Supplementary Table 7). The TIsWAS also identified eight loci associated with CRC risk (Table 3). To improve power for discovery, and as some CRC-risk SNPs may not exert their effects in colorectal mucosa, we also conducted a crosstissue TWAS using our in-house RNA sequencing (RNA-seq) data and the full GTEx and Depression Genes and Networks (DGN) project data (49 tissues) 11 . confidence interval (CI): 1.92-2.57; P = 1.80 × 10 −26 ) and 1.96 (1.64-2.34; P = 8.9 × 10 −14 ) compared with the remaining individuals. Corresponding areas under the receiver operating characteristic curve (AUC) were 0.62 (95% CI: 0.60-0.63) and 0.60 (0.59-0.62).
To complement the TWASs, identify further CRC risk loci and gain mechanistic insights, we extended the PredictDB pipeline to perform MWASs based on quantitative methylation data from histologically normal colorectal mucosa (Supplementary Methods). We found significant associations between CRC risk and methylation of individual CpGs at 69 loci (Supplementary Tables 14 and 15). This included seven new independent risk loci ( Table 5). Risk SNPs may influence CRC risk through changes in the CpG methylation status of regulatory elements, leading to changes in gene expression. We therefore explored the interrelationship of gene expression, CpG methylation and CRC risk in colorectal mucosa for 6,722 genes with both TWAS and MWAS predictions. There was a strong tendency for genes to be represented in both TWASs and MWASs (P < 10 −7 , Fisher's exact test). Subsequently, we conditioned TWAS associations on the top MWAS-significant CpG within 1 Mb, finding that 67/91 (75%) genes did not retain a significant TWAS association (P Bonferroni > 5.50 × 10 −4 ; Supplementary Table 16). Our data are consistent with a model in which many CRC-risk SNPs act through changes in DNA methylation, although formal causality analysis could not be performed to exclude reverse causation or possible confounders.

Effector genes and biological pathways of CRC oncogenesis
A major, largely unfulfilled aim of cancer GWASs is to identify genes and functional mechanisms that may ultimately be clinically useful targets, for example, in chemoprevention. The large GWAS and TWAS datasets in the present study address this aim by enabling a detailed functional analysis of the molecular mechanisms contributing to CRC risk. As TWAS approaches do not identify causal genes directly, we used our  S-MultiXcan uses a two-sided F-test to quantify the significance of the joint fit of the linear regression of the phenotype on predicted expression from multiple tissue models jointly. All associations shown were transcriptome-wide significant after Bonferroni's correction for 12,017 genes with an S-MultiXcan model (that is, P = 0.05/12,017 = 4.16 × 10 −6 for the P S-MultiXcan ). Genes with boundaries <1 Mb apart were considered to be in the same cluster. This resulted in 13 CRC associations, for which all TWAS-significant genes were >1 Mb away from and independent of any GWAS-significant SNP (P GWAS < 5 × 10 −8 ). As expected SNPs close to genome-wide significance were found in all cases. a Two further gene associations were <1 Mb from a GWAS-significant SNP, but analysis conditional on the SNP showed a minimally changed association (Supplementary Table 6) and remained significant at P = 4.16 × 10 −6 . No. indicates the number of new TWAS loci. The z-score and effect size are calculated as the mean across S-PrediXcan models from the TWAS reference datasets. The n models show the number of reference datasets for which the S-PrediXcan elastic nets produced genetically predicted expression models, with the n indep. (independent) showing the number of those models that were statistically independent. The SNP with the lowest CRC GWAS P value within 1 Mb of the gene is also shown.  Table 18). This analysis was complemented with DEPICT based on the GWAS SNPs (https://data.broadinstitute.org/mpg/depict) (Supplementary Table  19). CRC effectors were principally enriched in genes regulating transforming growth factor (TGF)-β/bone morphogenetic protein (BMP), Wnt WNT and Hippo pathways. A number of the credible effector genes that map to these pathways have no established role in CRC, including the intestinal stem cell regulator ZNRF3 (ref. 12 ), the TGF repressor LEMD3 (ref. 13 ) and the epithelilal-to-mesenchymal transition (EMT) regulator RREB1 (ref. 14 ).
To complement the pathway analysis, we performed gene-level functional annotation based on the principal cellular function of each effector gene as reported in the literature (Fig. 2 and Supplementary  Table 20); 36 genes (mostly Wnt and BMP family members) were annotated to colorectal homeostasis (that is, cellular stemness/differentiation). Intriguingly, 16 genes (including ARHGEF19, ARHGEF4, GNA12, RHOG, TAGLN, TSPAN8, STARD13 and LLGL1) were linked to cell migration through RhoA/ROCK signaling. We found eight genes (SPSB1, PIK3C2B, DUSP1, LRIG1, GAB1, RREB1, MAPKAPK5-AS1 and PDGFB) to act within the Ras/Raf growth factor-signaling pathway. In addition to the previously reported association at FUT2, the new fucosyltransferase effector genes FUT3 and FUT6 supported a relationship between the gut microbiome and CRC risk 15 . Inflammation is important in CRC 16 and the TWAS association at the FADS gene cluster and PTGES3 specifically highlighted the role of prostaglandin metabolism in CRC risk. Finally, our data also indicated several effector genes with roles in ion transport and cytoskeletal components ( Fig. 2 and Supplementary Table 20).
Although our pathway analysis and functional annotation indicated that the colorectum was the probable target tissue of many effector genes (Supplementary Tables 19 and 20), some genes were associated with principal roles in other tissue types, for example, neuronal cells (LINGO4, TULP1 and CNIH2) and leukocytes (TOX, TOX4 and MAF, plus many candidate genes within the major histocompatibility complex (MHC) region; Supplementary Table 20). We therefore performed a systematic analysis of effector gene tissue specificity, based on the premise that TWAS associations tend to be present in tissues in which a gene functionally affects CRC risk. Crosstissue analysis showed that all but one effector gene exhibited a TWAS association (false discovery rate (FDR) TWAS < 0.05) in at least one tissue and 52 (34%) genes showed an association in multiple tissues ( Supplementary Fig. 5). For 26 (17%) genes, associations were confined to the colorectal mucosa (P TWAS Bonferroni significant in mucosa, P TWAS > FDR elsewhere). In contrast, 67 genes (43%) showed no evidence of a TWAS association in colorectal mucosa (FDR TWAS > 0.05). Notably, 12 (8%) gene associations were present only in immune cells ( Supplementary Fig. 5 and Supplementary Table 11) and 4 (3%) were restricted to mesenchymal cells (Supplementary Fig. 5 and Supplementary Table 12).

Linking CRC risk to other traits
To gain insight into the role of potentially modifiable risk factors in CRC genetics, we performed crosstrait LD score regression analyses 17 using publicly available GWAS summary statistics for 171 phenotypes. Twelve genetic correlations remained significant (two-sided z-test, Bonferroni-corrected P < 2.93 × 10 −4 ). Notably, positive associations with CRC risk (Supplementary Table 21) included insulin resistance (raised fasting insulin and glucose), smoking and obesity (body mass index, waist:hip ratio and waist circumference), traits that have previously been reported in observational epidemiological studies to be associated with CRC risk 3,18,19 . These associations not only highlight shared biology, but also suggest that public health interventions to reduce cardiometabolic disease will additionally lower CRC burden.

Discussion
We report a comprehensive genetic analysis of CRC risk in the general population. To identify the most credible effector genes for each risk variant, we performed detailed annotation using tissue-specific gene expression and other relevant data types. Our study is twice as large as previous CRC GWASs and also includes participants of both European and east Asian ancestries, demonstrating that most loci are shared across these ancestral groups. This increased power for GWASs, coupled with complementary analyses, including TWASs and MWASs, identified 103 previously unreported risk associations and identified 155 effector genes. These data substantially expand our existing knowledge about the impact of common genetic variation on the heritable risk of CRC.
The availability of large, multi-omic datasets has allowed us to assign the most likely target/effector genes of GWAS and TWAS associations (Fig. 3) and confidence in these assignments will increase as additional functional data are reported in the literature. It is clear that pathways (for example, Wnt, BMP and Hippo) involved in normal intestinal homeostasis play important roles in CRC risk, suggesting that modulation of normal mucosal dynamics has the potential to As per Table 2, S-MultiXcan uses a two-sided F-test to quantify the significance of the joint fit of the linear regression of the phenotype on predicted expression from multiple tissue models jointly. All associations shown were transcriptome-wide significant after Bonferroni's correction for 27,941 transcripts with an S-MultiXcan model (that is, P = 0.05/27,941 = 1.79 × 10 −6 for the P S-MultiXcan ). New associations were called when >1 Mb from both a GWAS-significant SNP and a TWAS locus. As expected, all these loci showed evidence of a risk association in the full TWAS (FDR < 0.05, P < 2.86 × 10 −3 ). Transcripts with boundaries <1 Mb apart were considered to be in the same cluster. This resulted in seven CRC associations. a One further association was identified based on conditional TIsWAS analysis (Supplementary Table 8). Other annotations are as per Table 2.
Article https://doi.org/10.1038/s41588-022-01222-9 prevent colorectal neoplasia. The gut flora is intimately involved in normal bowel homeostasis and effector genes are likely to be involved in microbial interactions. By contrast, Ras pathway activity is thought to be more important during repair or tumorigenesis and the Ras effector genes that we have found may act after tumor initiation. Our finding of multiple risk genes involved in cell adhesion and migration naturally suggests roles in malignant progression, although effects earlier in tumorigenesis also remain plausible. Similarly, immune pathway effector genes could, in principle, have their effects on normal cell function or at any stage of tumorigenesis, from mediating day-to-day microbial interactions to killing of cells in early neoplastic transformation or established tumors. Crosstissue analyses indicated that the colorectal mucosa was the most probable site of action of many effector genes, but some genes are more likely to act in different tissue types. For example, it is highly likely that genes such as HIVEP1, LIF, SH2B3, TOX and TOX4 (and probably genes in the MHC region) influence the development of CRC through immune cell variation, and that EDNRB influences risk through effects on blood vessels. An unexpected finding was that several credible effector genes have primary roles in neurogenesis, raising the intriguing possibility that the enteric nervous system is involved in CRC risk.
Although germline genetics has guided the development of drugs to prevent cardiovascular disease (for example, statins and PCSK9 inhibitors), such a paradigm has yet to be realized for cancer. As almost all CRCs develop from colonic polyps, and up to 40% of the screened population will be diagnosed with one or more polyps, CRC is particularly well suited to evaluate new chemopreventive agents. Our findings highlight candidate targets for chemoprevention, such as gut microbiota, prostaglandin metabolism and signaling through the Wnt, BMP and Hippo pathways. Specific potential targets in the near term include CDK6, which is targeted by drugs in clinical use for cancer therapy, such as palbociclib and ribociclib. Similarly, Wnt pathway activity can be targeted indirectly using Porcupine inhibitors (for example, LGK974, ETC159, CGX-1321 and RXC004) that prevent Wnt ligand palmitoylation 20 , although future approaches may more specifically target effector genes such as WNT4 and ZNRF3. Hence, adapted forms S-MultiXcan uses a two-sided F-test to quantify the significance of the joint fit of the linear regression of the phenotype on predicted expression from multiple tissue models jointly. TWAS tests were performed separately for the following tissue categories: 'Colon_sigmoid': GTEx (n = 318 samples; P Bonferroni = 8.12 × 10 −6 for the P S-PrediXcan ); 'Immune': DGN + GTEx Cells_EBV-transformed_ lymphocytes + GTEx Whole_Blood + GTEx_Spleen (n = 1,966 samples; P Bonferroni = 3.34 × 10 −6 for the P S-MultiXcan ); 'Mesenchymal': GTEx Adipose_Subcutaneous + GTEx Adipose_Visceral_ Omentum + GTEx Cells_Cultured_fibroblasts (n = 1,533 samples; P Bonferroni = 3.96 × 10 −6 for the P S-MultiXcan ); 'Gastrointestinal': the six in-house colorectal mucosa datasets + GTEx Pancreas + GTEx Liver + GTEx Stomach + GTEx Terminal_Ileum + GTEx Oesophageal_Mucosa + GTEx Colon_Transverse (n = 2,615 samples; P Bonferroni = 3.34 × 10 −6 for the P S-MultiXcan ); 'All': the six in-house colorectal mucosa datasets + all GTEx 49 tissues + DGN (n = 16,832 samples; P Bonferroni = 2.31 × 10 −6 for the P S-MultiXcan ). Other annotations are as per Table 2.
Article https://doi.org/10.1038/s41588-022-01222-9 of these drugs or modified dosing regimens could be repurposed for chemoprevention, possibly initially for high-risk groups, such as those with in the top PRS percentiles or cases of Lynch syndrome. Based on our data, we speculate that, in the longer term, targeted approaches based on demethylation of specific CpG sites from MWASs could be an effective means of prevention with minimal toxicity. The identification of additional risk associations has the potential to provide further biological insights into CRC. However, cohort S-MultiXcan uses a two-sided F-test to quantify the significance of the joint fit of the linear regression of the phenotype on predicted expression from multiple tissue models jointly. All associations shown were methylome-wide significant after Bonferroni's correction for 88,888 CpGs with an S-PrediXcan model (P = 0.05/88,888 = 5.62 × 10 −7 for the P S-MultiXcan ). Pairs of CpGs or strings of adjacent CpGs within 1 Mb of each other were considered to lie within the same cluster. Five CRC associations were found for which all CpGs were >1 Mb away from GWAS-significant SNP (P GWAS < 5 × 10 −8 ), although near a SNP close to genome-wide significance. a Two further associations for four CpGs were identified based on conditional MWAS analysis (Supplementary Table  15). New CpG hits were all independent of each other and of GWAS SNPs and TWAS genes. Other annotations are as per Table 2. numbers required in European and east Asian populations to identify additional risk SNPs through GWAS are likely to be prohibitive. Indeed, to identify SNPs explaining 80% of the heritable risk of CRC risk loci, thus providing comprehensive biological insights, will require sample sizes in excess of 500,000 cases and at least that number of controls ( Supplementary Fig. 6). This is far higher than a previous estimate 21 , which was based on a small subset of the GWASs included herein. Extending GWASs to African and other populations may detect further risk SNPs, including population-specific ones. Complementary approaches such as TWASs and MWASs are demonstrably useful for the discovery of further risk loci, especially if, and when, reference datasets from multiple populations are made available.
Overall, our findings demonstrate the power of multi-omics to provide new insights into the biological basis of CRC, including both the identification of candidate effector genes and the support for previously unsuspected functional mechanisms. Importantly, several of the genes and pathways we have identified are potential targets for CRC treatment or chemoprevention.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41588-022-01222-9.

Methods
The research presented in the present study complies with all relevant ethical regulations and has been approved by the South Central Ethics Committee (UK) (reference no. 17/SC/0079).

Criteria for declaring new CRC risk associations
Multi-omic studies present inherent difficulties for deciding what constitutes a new GWAS, TWAS or MWAS association. To declare statistically significant associations for GWASs, we have used the established threshold of P = 5 × 10 −8 . We applied this to both loci >1 Mbp from a previously known SNP and analyses conditioned on the most significant SNP within the 1-Mb region. For TWASs or MWASs we also followed convention and used Bonferroni's correction P = 0.05/N, where N is the number of gene models successfully derived from the reference tissue. Furthermore, for TIsWASs and crosstissue TWASs, we used Bonferroni-corrected P-value thresholds for significance in each of the reference tissue datasets separately, owing to the overlap between tissue groups and the fact that many expression quantitative trait loci (eQTLs) are present across tissues. A further common practice is that a new association should be located >1 Mb from another association (from the present study or previously reported), whether a genome-wide significant GWAS SNP, a TWAS gene or an MWAS CpG. However, use of the 1-Mb distance convention introduces a further problem in that, although the location of a GWAS SNP and MWAS CpG can be defined precisely, the location of a gene cannot. We therefore defined a gene's boundaries by the canonical transcript and new associations must lie 1 Mb from both those boundaries. As TWAS and MWAS associations can affect multiple nearby genes or CpGs (for example, owing to co-regulation or LD between eQTLs and methylation (m) QTLs), we have conservatively assigned each TWAS and MWAS association to a single locus (defined as a group of genes or CpGs that are significantly associated with CRC risk and lie <1 Mb apart). Locus boundaries must be >1 Mb from another association to be declared an independent risk association.
We also performed conditional analyses across GWASs, TWASs and MWASs. This is standard practice in GWASs (see below) 22 , whereby nearby SNPs with no or limited correlation can be independently associated with CRC risk. Conditioning TWASs, TIsWASs and MWASs on GWASs using a set-based mixed effects score test (sMIST) also allowed us to identify risk associations that were independent of the GWAS associations within 1 Mb, based on a P Conditional that (1) remained Bonferroni significant at the unconditional analysis threshold and (2) was within one order of magnitude of P Unconditional . A much larger number of TWAS and MWAS associations fulfilled the only criterion: after conditioning on a GWAS association within 1 Mb (Supplementary Tables 6, 8 and  15). Although we could not exclude the possibility that some of these associations resulted from additional SNPs independent of a nearby GWAS SNP, for example, we conservatively did not declare these to be new risk associations.

GWAS data analysis
Meta-analysis. Within each of the 31 analytical units, we conducted logistic regression under a log(additive) model to examine the association between allelic dosage for each genetic variant and the risk of CRC, adjusted for unit-specific covariates. Meta-analysis under a fixed-effects, inverse-variance weighted model was performed using META v.1.7 (ref. 23 ). Variants in the meta-analysis included only those with an imputation quality score (info/R 2 ) > 0.4, minor allele frequency (MAF) > 0.005 and seen in at least 15 analytical units. The I 2 (fraction of variance attributable to between study heterogeneity) statistic was calculated to quantify between-study heterogeneities, and variants with I 2 > 65% were excluded. A total of 8,782,440 variants was taken forward in the meta-analysis. Meta-analysis of risk estimates was conducted under an inverse-variance-weighted, fixed-effects model 3 . None of the analytical units showed strong evidence of genomic inflation ( ranged from 0.95 to 1.28), and the value for the meta-analysis was 1.30 ( 1,000 = 1.01; Supplementary Fig. 3). To account for any ancestral differences between analytical units, we implemented MR-MEGA v.0.1.5 (ref. 24 ) including ten principal components (PCs) in the analysis. To measure the probability of associations being false positives, the Bayesian false discovery probability 3 was calculated based on a plausible OR of 1.2 (based on the 95th percentile of the meta-analysis OR values) and a prior probability of association of 10 −5 .

Definition of known and new GWAS SNP risk associations.
We identified all previously reported CRC associations at P < 5 × 10 −8 by referencing the NHGRI-EBI GWAS Catalog of human GWASs and searching PubMed (performed June 2021) 3 . Additional articles were ascertained through references cited in primary publications (Supplementary Table 4). Where multiple studies reported associations in the same region (r 2 > 0.1 and within 500 kb to 1 Mb of the index SNP), we considered all variants with genome-wide significant associations. Given the improved power and coverage of our study over previous works, we identified the most strongly associated variant at each known signal and used lead variants for further analyses, rather than the previously reported index variants (Supplementary Table 3). A genome-wide significant risk variant was considered to be new if >1 Mb from a known risk variant.

GWAS conditional analysis.
To identify independent association signals at the discovered CRC risk associations, we performed conditional analyses using Genome-wide Complex Trait Analysis Conditional and Joint Analysis (GCTA-COJO) 22 on the meta-analysis summary statistics. Analyses were performed separately for European and east Asian ancestry populations, to account for LD structural differences. The conditioned data were meta-analyzed together as described above and associations with P Conditional < 5 × 10 −8 were considered to be new secondary associations. As a reference for LD estimation, we made use of genotyping data from 6,684 unrelated samples of east Asian ancestry and 4,284 samples from combined UK10K and European samples in 1000 Genomes.

Heritability analysis
We used the LD score (LDSC) regression package with default parameters as implemented in LD Hub 25 to estimate the SNP heritability from the GWAS meta-analysis summary statistics data 3 . SNPs were filtered to HapMap3 SNPS using 1000 Genomes EUR MAF >5%. SNPs with imputation information score <0.9, MAF < 0.01 and within the MHC region (that is, SNPs between 26 Mb and 34 Mb on chromosome 6) were excluded. Precalculated LD score files computed using 1000 Genome European data were used.
The contribution of risk SNPs to the familial risk of CRC was calculated as ∑ k log λ k log λ 0 , where λ 0 is the familial risk to first-degree relatives of CRC cases, assumed to be 2.2 (ref. 26 ) and λ k is the familial relative risk associated with SNP k, calculated as λ k = p k r 2 k +q k (p k r k +q k ) 2 , where p k is the risk allele frequency (RAF) for SNP k, q k = 1 − p k and r k is the estimated per-allele OR from the meta-analysis 3,27 .

Pleiotropy analysis
We explored crosstrait pleiotropic effects using the LDSC regression package with default parameters 28 as implemented in the LD Hub. The summary statistics for 252 phenotypes were extracted from the LD Hub. For comparability of results across the traits, we limited our analysis to the CRC GWASs of European ancestry. After excluding GWASs performed on non-European cohorts, traits where the LD Hub output came with the following warning messages-'Caution: using this data may yield results outside bounds due to relative low z-score of the SNP heritability of the trait' and 'Caution: using this data may yield less robust results due to minor departure of the LD structure', as well as highly correlated traits-171 phenotypes were included in the Article https://doi.org/10.1038/s41588-022-01222-9 analysis. The departure of the LD structure means departure from the assumption of equal LD structure between two datasets, for example, due to differences in population structure between the study populations. SNPs from the MHC (chr6: 26,000,000 to ~34,000,000) region were removed for all traits before analysis.

Sample size prediction
To estimate the sample size required to detect a given proportion of the GWAS heritability, we made use of GENESIS software (GENetic Effect-Size distribution Inference from Summary-level data) 29 , which implements a likelihood-based approach to model the effect-size distribution in conjunction with LD information, using the three-component model (mixture of two normal distributions). The percentage of GWAS heritability explained for a projected sample size was based on power calculations for the discovery of genome-wide significant SNPs 3 . The genetic variance explained was calculated as the proportion of total GWAS heritability explained by SNPs reaching genome-wide significance at a given sample size.

TWAS analysis
Gene expression models for the six in-house expression datasets were generated using the PredictDB v.7 pipeline for a total of 1,077 participants 7,8 . Elastic net model building with tenfold crossvalidation was performed independently for each dataset. The elastic net models for GTEx v.8 Colon Transverse were obtained from the PredictDB data repository (http://predictdb.org) and had been generated using the same pipeline. Models were computed using HapMap2 SNPs ±1 Mb from each gene, together with covariate factors estimated using PEER 30 , clinical covariates when appropriate (age, sex and, where appropriate, case-control status, type of polyp and anatomical location in the colorectum), and three PCs from the individual dataset's SNP genotype data. Transcriptome-wide association tests were then performed for each dataset with the S-PrediXcan feature using summary statistics from the GWAS meta-analysis. We used individual-level GWAS data from GECCO (n = 8,725) to derive the LD reference covariance matrix. S-MultiXcan analysis was then undertaken across datasets. Significant associations were declared using Bonferroni's correction (0.05 per number of gene models from S-MultiXcan). As recommended 31 , an additional filter of a TWAS association statistic, P S-PrediXcan ≤ 10 −4 , in at least one individual reference dataset was implemented to minimize potential errors due to LD mismatches. Genes localizing to the human leukocyte antigen/ MHC region (chr6: 28,477,797-33,448,354 bp) were excluded. TIsWAS analyses were likewise performed using transcript-level data from the SOCCS, BarcUVa-Seq and GTEx Colon Transverse datasets.
The predictive performance of the models for TWASs and TlsWASs across the datasets was similar. For the TWAS models the number of genes successfully predicted with R 2 > 0.01 (equivalent of R > 0.1) varied between 3,308 for the BarcUVa dataset and 5,092 for SOCCS rectum, whereas GTEx Colon Transverse models were available for 6,295 genes. The mean coefficient of variation (CV)-based prediction R 2 for all genes varied between 0.09 (25th to 75th percentiles 0.04-0.12) for BarcUVa and 0.19 for INTERMPHEN (0.07-0.24), compared with 0.12 (0.04-0.16) for the GTEx Colon Transverse model. The numbers were slightly higher when comparing the overlapping 736 genes only. The in-house TlsWAS models were constructed for a lesser number of transcripts (n = 4,632 for BarcUVa dataset and n = 11,262 for SOCCS rectum dataset) compared with GTEx Colon Transverse (n = 15,500), owing to greater read depth and larger sample size for GTEx. The mean R 2 for all genes varied from 0.07 (0.03-0.09) for BarcUVa to 0.16 for SOCCS colon (0.07-0.21). GTEx Colon Transverse had a mean R 2 of 0.10 (0.03-0.12).

MWAS analysis
Methylation β values were calculated based on the manufacturer's standard, ranging from 0 to 1. Quality control and data normalization were performed in R using the ChAMP software pipeline for the EPIC and 450,000 arrays 32 . Briefly, we filtered out failed probes with detection P > 0.02 in >5% of samples, probes with more than three reads in >5% of samples per probe and all non-CpG probes. Samples with failed probes >0.1 were also excluded from downstream analyses. We discarded all probes with SNPs within 10 bp of the interrogated CpG (from the 1000 Genomes project, CEU population) 33 and probes that ambiguously mapped to multiple locations in the human genome with up to two mismatches 31 . We considered only probes mapping to autosomes and those overlapping between the EPIC and the 450,000 arrays. Normalization was achieved using the Beta MIxture Quantile method. Per-probe methylation models were created using the Pre-dictDB pipeline on the normalized methylation matrix and the genotypes as per TWAS eQTL analysis. To optimize power, we restricted our analysis to 263,341-238,443 (for the 450,000 array) and 377,678 (for the EPIC array) probes annotated to Islands, Shores and Shelves, and discarded 'Open Sea' regions. Further analysis was performed as per the TWAS. CpGs were annotated to a known GWAS signal if within 1 Mb of a genome-wide significant GWAS-risk SNP and otherwise considered to be new. For the MWAS models, the number of CpG probes successfully predicted with R 2 > 0.01 (equivalent of R > 0.1) varied from 24,325 for INTERMPHEN rectum to 30,385 for COLONOMICS. The mean CV-based prediction R 2 for all genes varied from 0.14 (25th to 7th percentiles 0.07-0.16) for INTERMPHEN proximal dataset to 0.19 for SOCCS (0.07-0.25).

Conditional analysis using sMiST for TWAS and MWAS findings
S-MultiXcan is a powerful method for assessing predicted gene expression across multiple tissues and samples, but cannot readily undertake conditional analysis to determine independence of a TWAS or MWAS association from other GWAS, TWAS or MWAS associations. We therefore used the sMiST 34 method to perform conditional analysis of TWAS, TIsWAS and MWAS data adjusting for GWAS-risk SNPs. The sMiST can assess the total effect, including both predicted molecular features (gene expression or methylation) and the residual direct effects of SNPs that are not explained by predicted molecular features, on CRC risk. To be consistent with S-MultiXcan, we assessed only the association of predicted molecular features. We first confirmed that there was a strong correlation between the sMiST and S-MultiXcan results, with minimal discordance (Supplementary Fig. 4). In view of this, we used sMiST to perform conditional TWAS and MWAS analysis for each of the significantly associated genes or CpGs, respectively, conditioning on the lead GWAS-significant SNP (if present) within 1 Mb (Supplementary  Tables 6, 8 and 15). We also conditioned TWAS on TWAS, TIsWAS on TIsWAS and MWAS on MWAS and conducted TWAS conditioned on MWAS analyses for the genes for which both significant genetically predicted expression and methylation models were produced by the PredictDB pipeline. Where multiple CpGs were annotated to the same gene, we selected the association with the lowest MWAS P value. We determined the number of genes associated (at Bonferroni-corrected P = 0.05/6,722 = 7.44 × 10 −6 ) with a CRC risk in TWAS and MWAS (n = 43), or TWAS only (n = 54), MWAS only (n = 91) or neither (n = 6,534). Article https://doi.org/10.1038/s41588-022-01222-9

Effector gene identification
To identify the most credible target or 'effector' genes at each CRC risk locus, a pragmatic approach was utilized. After excluding the MHC region, pseudogenes and transcripts of uncertain significance (generally RPNNNN or ACNNN), the following hierarchical inclusion criteria were used: 1. For significant (Bonferroni-corrected P TWAS < 0.05) TWAS genes at a locus, the gene most strongly associated with CRC risk in any tissue, as long as its P TWAS was at least an order of magnitude lower than any other gene at the locus (n = 112); 2. For loci included under (1), additional genes that remained significant (FDR < 0.05) in conditional TWAS-TWAS analysis including the lead gene (n = 9); 3. At GWAS loci not included under (1), the most significant (FDR < 0.05) TWAS gene, as long as its P TWAS was at least an order of magnitude lower than any other gene at the locus (n = 17); 4. TIsWAS analysis consistent with the approach used for TWAS as described in (1-3) above (n = 16); 5. Genes harboring missense or truncating variants in LD (r 2 > 0. 9) with sentinel GWAS SNPs (n = 1).
A set of 155 genes was identified, which corresponds to about two-thirds of the CRC risk loci from GWASs, TWASs and MWASs (Supplementary Table 17).

The AUC
We calculated the confounder-adjusted AUC of PRS in discriminating individuals with and without CRC by using the propensity score weighting to account for potentially different distribution of confounders between cases and controls 35 . We adjusted for age, sex and four PCs as confounders. We obtained the 95% CIs by bootstrapping and a total of 500 bootstrap samples was generated. We calculated adjusted AUCs using the R package ROCt.

Statistics and reproducibility
No statistical method was used to predetermine sample size. The experiments were not randomized. Data exclusion from each analysis is explained in the corresponding sections. Informed consent was obtained for all participants in the study. A description of the different datasets and cohorts used is included in Supplementary Note.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
Summary-level data for the full set of Asian and European GWASs are available through the GWAS catalog (accession no. GCST90129505). For individual-level data, CCFR, CORECT, CORSA_2 and GECCO are deposited in dbGaP (accession nos. phs001415.v1.p1, phs001315. v1.p1, phs001078.v1.p1, phs001903.v1.p1, phs001856.v1.p1 and phs001045.v1.p1). NSCCG and COIN are available in the European Genome-Phenome Archive under accession nos. EGAS00001005412 (NSCCG) and EGAS00001005421 (COIN). UK Biobank data are available through http://www.ukbiobank.ac.uk and Finnish data through THL Biobank. Access to individual-level data for the remaining studies is controlled through oversight committees. CCFR 1 and CCFR 2 data can be requested by submitting an application for collaboration to the CCFR (forms, instructions and contact information can be located at www.coloncfr/collaboration.org). Applications for individual-level data from the QUASAR2 and SCOT clinical trials will be assessed by the translational research steering committees that oversee those studies. Individual-level data from the CORGI (UK1) study will be made available subject to standard institutional agreements. Application forms for these three studies, and for Scotland Phase 1, Scotland Phase 2, SOCCS, DACHS4 and Croatia, will be provided by emailing a request to access.crc.gwas.data@outlook.com. For access to CORSA_1, please contact gecco@fredhutch.org. For Generation Scotland (GS) access is through the GS Access Committee (access@generationscotland.org). Applications for the Lothian Birth Cohort data should be made through https://www. ed.ac.uk/lothian-birth-cohorts/data-access-collaboration. For details of the application process for Aichi1, Aichi2, BBJ, Guanzhou1, HCES, HCES2, Korea and Shanghai cohorts, please go to https://swhs-smhs. app.vumc.org or contact W.Z. at wei.zheng@vanderbilt.edu. CRC-relevant epigenome data were obtained from the National Center for Biotechnology Information's Gene Expression Omnibus database under accession nos. GSE77737 and GSE36401. Genetically predicted models of gene expression and methylation have been deposited in the Zenodo repository (https://zenodo.org/deposit/6472285).

Code availability
All bioinformatics and statistical analysis tools used in the present study are open source, details of which are available in Methods and Nature Portfolio Reporting Summary. No customized code was used to process or analyze data. Details on URLs used can be found in Supplementary Note.