Translation Comes First: Ancient and Convergent Selection of Codon Usage Bias Across Prokaryotic Genomes

Codon usage is the outcome of different evolutionary processes and can inform us about the conditions in which organisms live and evolve. Here, we present R_ENC’, which is an improvement to the original S index developed by dos Reis et al. (2004). Our index is less sensitive to G+C content, which greatly affects synonymous codon usage in prokaryotes, making it better suited to detect selection acting on codon usage. We used R_ENC’ to estimate the extent of selected codon usage bias in 1800 genomes representing 26 prokaryotic phyla. We found that Gammaproteobacteria, Betaproteobacteria, Actinobacteria, and Firmicutes are the phyla/subphyla showing more genomes with selected codon usage bias. In particular, we found that several lineages within Gammaproteobacteria and Firmicutes show a similar set of functional terms enriched in genes under selected codon usage bias, indicating convergent evolution. We also show that selected codon usage bias tends to evolve in genes coding for the translation machinery before other functional GO terms. Finally, we discuss the possibility to use R_ENC’ to predict whether lineages evolved in copiotrophic or oligotrophic environments.


Introduction
The genetic code consists of a set of molecular interactions used by all cells to translate information encoded in nucleotide sequences into proteins. The principle of these interactions lies in the assignment of 61 codons to the main 20 amino acids that make up proteins. The code is degenerate, meaning that the same amino acid can be specified by more than one synonymous codon. The relative frequency of these synonymous codons varies between genes and genomes. This uneven use of codons is known as codon usage bias (Grantham et al. 1981;Ikemura 1981aIkemura , b, 1985Plotkin and Kudla 2011;Parvathy 2022).
The codon usage of a gene is the result of the combined action of mutational bias, genetic drift, and natural selection (Bulmer 1991). When natural selection is not strong enough, the codon usage of a gene is mostly the result of genetic drift sampling on codons originated by mutation. This is particularly notable in the genomes of endosymbiotic bacteria of insects that tend to become A+T rich by nonadaptive processes (Martínez-Cano et al. 2015). In these cases, it is typical to find the majority (if not all) of the genes in the genome showing a bias towards A+T rich codons.
There are other occasions when natural selection is strong enough to modify the codon usage of genes. For instance, in highly expressed genes (like those encoding ribosomal proteins and translation factors), selection favors a codon usage that facilitates translation. This is, in highly expressed genes, selection favors codons that best match the abundance of tRNA species in the cytoplasm. This effect is more pronounced in species that have short-generation times (Rocha 2004;Sharp et al. 2005; Vieira-Silva and Rocha 2010). Genes showing this kind of selected codon usage bias, which is also called translational selection, are translated faster and released earlier from the ribosome, making their translation more efficient (Quax et al. 2015;Gustafsson et al. 2004;Ran and Higgs 2012;Frumkin et al. 2018). In genomes where selection affects codon usage bias, highly expressed genes typically show codon usage that correlates with tRNA abundance; while for genes with lower levels of expression, codon usage is determined by mutational bias.
Selected codon usage bias has been related to the lifestyle of bacteria and archaea. Prokaryotes showing selected codon usage bias tend to live in a wider range of habitats (like pathogenic bacteria that can live outside or within their host); while prokaryotes with a more specialized lifestyle (like thermophiles, mutualistic endosymbiotic bacteria, or obligate intracellular parasites) tend to show no selected codon usage bias (Botzman and Margalit 2011). According to these observations, selected codon usage bias allows bacteria to grow faster and to face metabolic diverse environments where competition is strongest.
Organisms that are able to grow in environments where nutrients are scarce are known as oligotrophs. By contrast, organisms that grow in environments with larger nutrient availability are known as copiotrophs (Koch 2001). Oligotrophs tend to grow slowly and make efficient use of resources, while copiotrophs are capable of growing faster when nutrients are available (Roller and Schmidt 2015). An association between selected codon usage bias and nutrient availability (oligotrophic/copiotrophic) has been recently demonstrated. Copiotrophic environments favor bacteria showing translational selection in their ribosomal proteincoding genes, while genomes of bacteria in oligotrophic environments suggest codon usage bias selection playing a relaxed role (Okie et al. 2020).
There have been several attempts to measure the strength of selected codon usage bias. One of the first was developed by Wright (1990) and then modified by Fuglsang (2006). In this work, we propose an improvement to a previous method developed by dos Reis et al. (2004). We then use this improved method to revisit (a) the correlation between selected codon usage bias and minimal generation time taking into account the phylogenetic structure of the data; (b) the pattern of selected codon usage bias in different functional GO terms across diverse lineages of bacteria and archaea; and (c) the evolutionary precedence of selected codon usage bias of the genes coding for the translation apparatus relative to other functional GO terms.

Improving Selected Codon Usage Bias Estimation
To estimate the strength of selected codon usage bias we relied on the method proposed by dos Reis et al. (2004). In summary, this method works as follows. For a given genome: first (1) measure the tRNA adaptation index (tAI); then (2) measure the difference between the expected and observed number of codons (dNc); and (3) calculate a Pearson correlation coefficient between tAI and dNc. This correlation is named S by dos Reis et al. (2004). The higher the correlation coefficient, the stronger natural selection is predicted to have shaped codon usage bias on a whole-genome basis.
There are two important things to note about S. The first one is that S measures the coadaptation between the codon usage and the tRNA gene pool (this pool is taken as a proxy of the abundance of cytoplasmic tRNA). The second one is that S measures the effects of selection on codon usage bias irrespectively of a specific set of protein-coding genes. Therefore, S does not assume a priori that there is a set of genes (like those coding for ribosomal proteins) on which selected codon usage will be strongest. In this last respect, S differs from other methods that rely on ribosomal proteincoding genes as gold standards to measure the strength of selected codon bias.
We improved S by making dNc less sensitive to G+C content as follows. First, dNc is the difference between the expected number of codons due to G+C content at third codon positions (Nc e ) and the observed number of codons (Nc) in a given gene [Eq. 1]. It measures how far the effective number of codons depart from the neutral expectancy due to the G+C content of 3rd codon positions. where x g is the GC content of a gene g, and a, b, and c are constants of which values were estimated empirically.
To calculate the observed number of codons (Nc), dos Reis et al. (2004) used the original formula developed by Wright (1990) where And n a is the observed number of codons for the a amino acid; p i is the frequency of the ith codon; and k is the number of synonymous codons for the amino acid of interest.
Unfortunately, Nc is affected by G+C content since it assumes equal usage of all codons. Here we used instead, an improved statistic (Nc') to calculate the expected effective number of codons that better accounts for background nucleotide composition (Novembre 2002). In this new formulation, Novembre (2002) used the Pearson X 2 statistic "to (1) dNc = Nc e − Nc. (2) quantify the departure of each codon's usage (p i ) from some expected usage (e i ) for each amino acid" [Eq. 5]: And using the X 2 a values to calculate F' a : And then the F' a are used to calculate Nc' with Eq. 7 which is equivalent to Eq. 3 (see Novembre 2002 for details).
Here we estimate selected codon usage bias by measuring the correlation between tAI and Nc' and multiplying Pearson's correlation coefficient by -1 (to make it comparable to S). We call this statistic R_ENC' to differentiate it from the S of dos Reis et al. (2004).

Estimation of Selected Codon Usage Bias from Prokaryotic Genomes
Nc, dNc, Nc,' and tAI were calculated on all proteincoding regions encoding a minimum of 100 amino acids from 1800 genomes downloaded from Genome list NCBI browser searching only for complete prokaryotic genomes (https:// www. ncbi. nlm. nih. gov/ genome/ brows e/# !/ overv iew/). These genomes are representative of 26 bacterial and archaeal phyla (Supplementary Table S1). The strategy for these calculations was as follows. Nc was calculated using CodonW (J Peden, version 1.4.2 http:// codonw. sourc eforge. net/) taking as input the CDS nucleotide sequences from each genome. dNc was calculated by using an in-house R script on the CodonW output. CodonM was used to calculate the codon appearance in CDS (dos Reis et al. 2004). For the calculation of tAI, all tRNAs were annotated de novo with tRNAscan-SE (Lowe & Eddy, version 2.0, http:// lowel ab. ucsc. edu/ tRNAs can-SE/). And then, tAI was calculated in R using the tAI function from dos Reis et al. (2004) and the CodonM output. Nc' was calculated by the ENCprime software (Novembre, 2002).

Estimating tAI from RNA-seq Data
We also introduced a modified version of tAI. As mentioned above, tAI indicates how well adapted a given gene is to the tRNA pool based on the diversity and abundance of coded tRNA and taking into account wobble base pair interactions (see dos Reis et al. 2004 for details). Instead of using tRNAcoding genes, when data were available, we estimated tAI from tRNA abundances as measured by RNA-seq experiments. We call this value tAI'. For tAI', we calculated transcripts per million (tpm) from RNA-seq experiments conducted during log phase or exponential growth that were fetched from GEO DataSets (https:// www. ncbi. nlm. nih. gov/ gds/) and processed by (Wei et al. 2019). We estimated tAI' on the following experiments (genomes): SRX020805 (Bacteroides thetaiotaomicron); SRX515181 (Bacillus subtilis); SRX515174 (Escherichia coli); SRX2448246 (Leptospira interrogans); SRX1372108 (Mycobacterium tuberculosis); SRX1638989 (Salmonella enterica); and SRX347145 (Synechocystis sp. PCC 6803).

Other Measures of Selected Codon Usage Bias
For the sake of comparison, we also estimated selected codon usage bias by using the approach followed by Vieira-Silva and Rocha (2010). These authors proposed to estimate translational selection by an index named ΔENC'. To estimate ΔENC, we first estimated translational selection by averaging Nc' values from all ribosomal protein-coding genes (ENC'r) [Eq. 8]: k = number of ribosomal protein-coding genes, Nc ′ i = Nc' from the ith ribosomal protein-coding gene.
And we used the above value to calculate ΔENC', which is a standardization of the mean Nc' from all coding genes in the genome with respect to the mean of the ribosomal protein-coding genes ENC'r values [Eq. 9].
We also calculated the average tAI for ribosomal proteincoding genes (tAIr) [Eq. 10] and fitted it to a Z distribution with mean 0 and standard distribution 1 named here as tAIrz. k = number of ribosomal protein-coding genes, tAI i = tAI from the ith ribosomal protein-coding gene.
It was necessary to normalize tAIr to a Z distribution to be able to compare this value between genomes. Additionally, we calculated ΔtAI [Eq. 11] Phylogenomic Inference, Phylogenetic Signal, and Phylogenetic Contrasts Data on minimal generation times (d) from diverse prokaryotes were gathered and published by Vieira-Silva and Rocha (2010) (Supplementary Table S2). A phylogenomic tree of these 210 organisms was inferred using a concatenation of five protein sequences: valine-tRNA ligase (valS); elongation factor G (fusA); ATP-dependent zinc metalloprotease FtsH (ftsH); DNA-directed RNA polymerase subunit beta (rpoC); and elongation factor Tu (tufA). These phylogenetic markers were detected and aligned using PhyloPhlAn (Segata et al. 2013). For phylogenetic inference, the best model (LG;+G) was predicted by SMS (Lefort et al. 2017) and the tree was reconstructed by PhyML (Guindon et al. 2010) with non-parametric branch support based on the Shimodaira-Hasegawa-like procedure (SH-like).
Phylogenetic signals for selected codon usage bias and minimal generation times were estimated with Pagel's lambda (λ) as implemented in the Phytools R package (Revell 2012).
Correlations between minimal generation time and R_ ENC', ENCr', ΔENC', tRNA copy number, tAIrz, and ΔtAI were evaluated with a Spearman's rank-order correlation and taking into account phylogenetic inertia by performing phylogenetic contrasts with the PGLS method as implemented in the Caper R package (Orme 2018).

GO Terms Enriched in Genes Showing Selected Codon Usage Bias
We also wanted to know if there were GO terms enriched in genes showing selected codon usage bias. First, for each one of the genomes in Table S2, we normalized Nc' and tAI values to mean 0 and standard deviation 1. Next, we performed two different enrichment analyses. For the first one, we plotted all genes from each genome from Table S2 in a tAI versus Nc' chart (see Fig. 4). We then asked whether the set of ribosomal protein-coding genes is enriched in the upper left quadrant by using a Kolmogorov-Smirnov test (p value < 0.05). The results of these analysis are shown in Fig. 4 and Supplementary Table S7. We also tested the above association by using a logistic regression.
For the second enrichment analysis, we grouped genes according to their Gene Ontology terms (Ashburner et al. 2000). Gene ontology annotations were downloaded from the UniProt database (UniProt Consortium 2019; https:// www. unipr ot. org/). In order to assess statistical significance, a Kolmogorov-Smirnov test was implemented for each metric (Nc' and tAI) by using the topGO R package (Alexa and Rahnenfuhrer 2021). We obtained a total of 85 non-redundant terms automatically by using REVIGO (Supek et al. 2011) and manually inspecting the GO hierarchy. We only consider a term as statistically significant if it has an FDR adjusted p value < 0.05 in Nc' and tAI. The results of these analyses are shown in Fig. 5.
We also wanted to know if minimal generation times were significantly different between organisms differing in having (or not) a given GO term enriched in genes showing selected codon usage bias. For this, Wilcoxon oneside signed-rank tests were performed for GO terms under inspection (Supplementary Table S3). The same procedure was done using tAI' for the model organisms (Bacteroides thetaiotaomicron, Bacillus subtilis, Escherichia coli, Leptospira interrogans, Mycobacterium tuberculosis, Salmonella entérica, and Synechocystis sp. PCC 6803). And as above, a given GO term is considered enriched only if it is significantly enriched at the same time in the metrics tAI and Nc'.

Ancestral State Reconstruction
To study the evolutionary precedence of selected codon usage bias in the translation machinery with respect to other GO terms, we applied the following procedure on the set of 210 genomes. First, we defined two kinds of events: "T" and "G." The "T" event was defined to occur when the "Translation" term (GO:0,006,412) appeared as significantly enriched in the two metrics (Nc' and tAI, FDR < 0.05), and the "G" event was defined to occur when a given GO term, other than "Translation," appeared as significantly enriched in the same two metrics. The frequency of these events per genome and GO term is shown in Supplementary Table S4. GO terms without "T" or "G" events were discarded from further analysis. Other GO terms directly linked to "Translation" were removed manually by looking at the GO hierarchy.
Next, events "T" and "G" were converted to binary [0,1] resulting in two vectors: One vector for "T" and another for "G" events. In these vectors, the number "1" represents that the event ("T" or "G") occurred. In the case of having more than one "G" event in different GO terms, we simply assigned a number "1" to the vector of "G" events. Next, we used these vectors as character states at the tips of the phylogenetic tree to infer the ancestral character states at the internal nodes. For this, we applied a maximum likelihood method from the phytools R package that was developed by Yang et al. (1995). This method uses the re-rooting function with a symmetric Q matrix (Revell 2012) and a parsimony method by using the function asr_max_parsimony from the castor R package (Louca and Doebeli 2018).

R_ENC' is Less Influenced by G+C Content Than S
We revisited the unified framework proposed by dos Reis et al. (2004) to estimate selected codon usage bias. In particular, we replaced dNc by Nc' in the Pearson correlation calculations (see Materials and Methods). We used Nc' because it is less biased by G+C content and performs better when evaluating codon usage bias (Liu et al. 2018). As expected, (Fig. 1A and B), our index R_ENC' is less influenced by G+C content than the index (S) proposed by dos Reis et al. (2004). Since G+C content is a major determinant of codon usage in the absence of selection (Sharp et al. 2010), we consider the improvement presented here to be significant. For larger values, R_ENC' and S tend to converge to similar values (Fig. 1C).
As proposed by dos Reis et al. (2004), the genomic landscape is defined by the tRNA gene copy number and genome size (dos Reis et al. 2004) and can be used to study the effect of the above variables on selected codon usage bias. To gain insights on the behavior of the two metrics (R_ENC' and S) in the context of the prokaryotic genomic landscape, we built simple additive models (corrected and log scaled). As described in Supplementary Figure S1, both models show similar fit to the genomic landscape (lm R 2 = 0.246 p value < 2.2e-16 for S; and lm R 2 = 0.264 p value < 2.2e-16 for R_ENC'). However, a more sophisticated analysis like that of dos Reis et al. (2004) and the inclusion of eucaryotes which are out of the scope of this work, would be required to properly account for the correlation between tRNA gene copy number and genome size.

Prokaryotic Lineages Showing Strongest Selected Codon Usage Bias
We then explored the distribution of R_ENC' values across 1800 genomes from 26 bacterial and archaeal phyla. Four lineages: Gammaproteobacteria, Firmicutes, Actinobacteria, and Betaproteobacteria, showed significantly stronger genome selected codon usage bias than the rest of the taxa (Fig. 2, Table S5a, FDR < = 0.05). It is important to mention that these four phyla (Gammaproteobacteria, Firmicutes, Actinobacteria, Betaprotebacteria) together with Alphaproteobacteria, Bacteroidetes/Chlorobi, and Delta/epsilon are the most represented in the sample. To investigate whether the above result is due to sampling bias, we repeated the analysis 10 times by randomly selecting 30 genomes from each phylum. The results show that in most repetitions, the same four phyla show significantly stronger genome selected codon usage bias than the rest of the taxa (Table S5b, FDR < = 0.05).
It is worth mentioning that Escherichia coli stands as the species showing the strongest R_ENC' value, closely followed by the Gammaproteobacteria: Tolumonas auensis, Ferrimonas balearica, and Citrobacter freundii. The first two species are facultative anaerobes while C. freundii is an opportunistic pathogen. Within Firmicutes, the species showing the strongest selected codon usage bias is the heterofermentative lactic acid bacterium Weissella cibaria; for Betaproteobacteria, it is Neisseria sicca, an opportunistic pathogen, and Glutamicibacter arilaitensis for Actinobacteria, which grows on the surface of Reblochon cheese from France.

R_ENC' is Correlated to Minimal Generation Time
Minimal generation time (d) was shown to correlate with selected codon usage bias in prokaryotes (Rocha 2004;Vieira-Silva and Rocha 2010;Thiele et al. 2011). Here, we revisited this result by exploring if R_ENC' also correlates with the minimal generation time. We found that R_ENC' shows a moderate but statistically significant correlation, even when considering the phylogenetic structure of the data (PGLS' r = 0.32 p value = 1.9e-6) (Fig. 3A).
In addition to R_ENC', we also studied the correlation between (d) and other metrics used to estimate selected codon usage bias. These were tAIrz, ΔtAI ΔENC', ENCr', S dos Reis and tRNA gene copy number (Table S6). We found that ENCr' (Fig. 3B) and ΔENC' had the highest correlation with (d) (PGLS' r = 0.4, p value = 6.8e-10; and PGLS' r = 0.4, p value = 1.2e-9, respectively, while the S dos Reis has a value of PGLS' r = 0.048, p value = 1.38e-3. This is in agreement with Vieira-Silva and Rocha (2010) who reported this by using conventional Spearman correlation. These results confirm previous observations indicating that selected codon usage bias is correlated to minimal generation time. This correlation is stronger when selected codon bias is measured with indexes that use ribosomal proteincoding genes as the reference to estimate selected codon usage bias, like ENCr' and ΔENC'.

Ribosomal Protein-Coding Genes Often Show Signals of Translational Selection When R_ENC' Indicates Genome-Wide Selected Codon Usage Bias
Ribosomal protein-coding genes tend to be highly expressed and their codon usage usually shows selected codon usage bias. Because of this, they are used by several metrics as gold standards to measure the effects of selection on codon usage. We first asked how often ribosomal protein-coding genes show selected codon usage bias when R_ENC' indicates genome-wide selected codon usage bias. This can be done by measuring the frequency with which ribosomal protein-coding genes have a codon usage that is highly adapted to the tRNA gene pool (show large tAI values) and using a few different synonymous codons for each amino acid (low Nc' values). A paradigmatic example is that of Escherichia coli where selection has clearly shaped the codon usage of ribosomal protein-coding genes and a few others like the gene coding for the chaperon GroEL (Fig. 4A). There are other unusual cases where ribosomal protein-coding genes show the opposite effect, appearing towards the bottom right of the graph, such as in Nitrosomonas europaea and Syntrophus aciditrophicus (Kolmogorov Smirnov Nc' lower tail p value < 0.05; tAI upper tail p value < 0.05) (Fig. 4B). When selection has not optimized codon usage bias of ribosomal protein-coding genes, these tend to appear scattered around the graph. This is the case of Buchnera aphidicola genomes (Fig. 4C). We found that among the set of 210 genomes, 147 had their ribosomal protein-coding genes in the upper left quadrant (Kolmogorov Smirnov Nc' upper tail p value < 0.05; tAI lower tail p value < 0.05, Supplementary  Table S7).
We also tested for the above association by using logistic regression. The logistic regression showed a moderate association between having the ribosomal protein-coding genes in the upper left quadrant and a strong R_ENC' (Mc Fadden's R squared = 0.12, Supplementary Figure S2). In addition, we also found a statistically significant association by using a Pearson correlation between translational selection (ENC'r) and R_ENC' (r 2 = 0.58, p value < 2.2e-16) and the same result by using GroEL protein-coding genes and R_ENC' (r 2 = 0.43, p value < 2.2e-16). These results show that in general terms, ribosomal protein-coding genes show selected codon usage bias when R_ENC' indicates selected codon usage bias at the genome level.

Gammaproteobacteria and Firmicutes Convergently Evolved Similar GO Terms with Selected Codon Usage Bias
We then asked which functional terms were enriched in genes showing selected codon usage bias. For this, we considered a term to be enriched if a statistically significant portion of its genes mapped to the upper left quadrant defined by the tAI versus Nc' plot. Not surprisingly, translation (GO:0,006,412) was the term enriched in most genomes (Fig. 5). This term is followed by others that are functionally related to translation (like tRNA aminoacylation or gene expression) or by terms related to ATP metabolism. This is not unexpected, since terms related to translation are crucial for fast-growing prokaryotes (Karlin et al. 2001;Klumpp et al. 2013).
Interestingly, we found that in general, Gammaproteobacteria and Firmicutes had more enriched terms than the rest of the taxa, many of which are shared (Fig. 5). We interpret the above pattern as an adaptive convergence due to natural selection. Since the same GO terms exist in other phyla but show less enrichment (orange, yellow, or gray colors in Fig. 5), the alternative explanation (divergence from A B

Fig. 3 Spearman correlation between codon usage bias and growth rates. A R_ENC' versus minimal generation time d(h); and B ENC'r versus d(h)
common ancestry followed by secondary losses of enrichment) seems less likely. In fact, a heatmap analysis of the matrix shown in Fig. 5 shows that several species of Gammaproteobacteria and Firmicutes cluster together in three different but related branches in the dendrogram, thus, supporting our interpretation (Fig. S4). Of course, not all species from Gammaproteobacteria and Firmicutes showing GO terms enriched in genes showing selected codon usage bias, cluster together. Note that the clade containing Escherichia coli, Shewanella oneidensis MR-1, Salmonella enterica, and Yersinia pestis cluster with the clades at the bottom of the figure.
On the side of the GO terms, there is a clear cluster formed by GO:0,006,412 translation, GO:0,046,034 ATP metabolic process, GO:0,006,520 cellular amino acid metabolic process, GO:0,010,467 gene expression, GO:0,006,418 tRNA aminoacylation for protein translation, GO:0,006,096 glycolytic process, and GO:0,046,031 ADP metabolic processes. These terms tend to be enriched together and tend to be present in the three clusters described above. However, this association should be interpreted carefully since genes can have more than one GO term, including sharing the ones mentioned above. For instance, in E. coli, there are 150 genes annotated with both GO:0,006,412 translation and GO:0,010,467 gene expression; 28 sharing translation and GO: 0,006,520 cellular amino acid metabolic process; and 26 with translation as well as GO:0,006,418 tRNA aminoacylation for protein translation.
It is also interesting to see that there is another set of procaryotes clustering together. This includes those species not showing GO terms enriched in genes showing selected codon usage bias. Species within this cluster includes epibionts like Nanoarchaeum equitans Kin4-M, ensosymbionts like Buchnera aphidicola and Wigglesworthia glossinidia, parasites such as Mycoplasma genitalium, as well as free living bacteria like Prochlorococcus marinus among many others.
We also found anecdotal associations between enriched GO terms and the lifestyle of organisms. For instance, photosynthesis and methanogenesis were enriched in cyanobacteria and methanogenic archaea, respectively (Fig. 5). And in agreement with previous studies (Martínez-Cano et al. 2015), endosymbiotic bacteria like Candidatus Blochmannia floridanus, Buchnera aphidicola, Wigglesworthia glossinidia, showed a general lack of gene terms enriched for selected codon usage bias.

Selected Codon Usage Bias Tends to Evolve First in Ribosomal Protein-Coding Genes and Then in Other Processes
One of the patterns shown in Fig. 5 is that the translation GO:0,006,412 term is almost always enriched in genes showing selected codon usage bias whenever other GO terms are enriched. Examples of these other GO terms include those related to obtaining energy (glycolytic process GO:0,006,096; tricarboxylic acid cycle GO:0,006,099; citrate metabolic process GO:0,006,101; and ATP metabolic process GO:0,046,034), as well as niche-specific processes like methanogenesis GO:0,015,948 in methanogenic archaea and photosynthesis GO:0,015,979 in cyanobacteria. This suggests that selected codon usage bias evolved first in genes coding for the translation machinery and subsequently in genes involved in other cellular processes.
To test the above hypothesis, we inferred when selected codon usage bias evolved along the phylogenetic tree for diverse GO terms. For this, we defined two kinds of events: "T" and "G." The "T" event was defined to occur when the translation GO term (GO:0,006,412) appeared as significantly enriched in the two metrics (Nc' and tAI, FDR < 0.05), and the "G" event was defined to occur when any GO term other than translation, appeared as significantly enriched in the same two metrics as before (see Methods). By this, we ended up with two vectors, one for the "T" event and the other for the "G" event. We then inferred by maximum likelihood and parsimony when in the phylogeny these events   In general, we found that "T" events are more widespread than "G" events. This implies that translation evolved selected codon usage bias earlier than other GO terms ("G" events) (Fig. 6). The strong phylogenetic signal of selected codon usage bias (Pagel's λ = 0.99, p value < 0.001) supports the above analysis. More interestingly, we found that other GO terms (represented by "G" events) appear to have evolved independently in different lineages. This is particularly conspicuous for Gammaproteobacteria and Firmicutes.
In addition, we compared the R_ENC' values, between genomes showing enrichment in the translation GO term and those not showing enrichment in this term (Supplementary Figure S3A). The Wilcoxon test showed us that the distributions are not the same (p value 6.21e-13) and the Fig. 6 Translation evolved selected codon usage bias earlier than other GO terms. A Selected codon usage bias of internal nodes was inferred by maximum parsimony. Nodes where selected codon usage bias was inferred to be present are shown in red and those were not, in gray. Left-side tree: GO term of Translation; and right side for any other GO term. For clarity, pie charts in internal nodes are larger than those of tips; B Frequency of selected codon usage bias versus root to node distance. The distance is measured in number of nodes to the root, and the probability of selected codon usage bias is averaged over all nodes having the same distance to the root; C Number of internal nodes showing selected codon usage bias (red) and not showing in (gray) inferred with maximum parsimony and with the likelihood rerooting method (Color figure online) logistic regression showed a considerable association (Mc Fadden's R squared = 0.34) (Supplementary Figure S3B). This suggests that in genomes where the selected codon bias is intense, the efficiency of the translation machinery is improved.

GO Terms Associated with Short-Generation Times
Microorganisms showing short-generation times typically exhibit upregulation of proteins involved in translation, gene expression, and protein synthesis (Scott et al. 2010;Molenaar et al. 2009;Peebo et al. 2015;Zavřel et al. 2019;Mori et al. 2017). As shown above, the GO term of "Translation" is enriched in genes showing codon usage bias in organisms showing short-generation times. We, therefore, asked if there are significant differences in the distribution of minimal generation times between microorganisms showing enrichment of selected codon usage bias in the "Translation" and in other GO terms.
GO Terms related to translation, protein folding, gene expression, and processes related to tricarboxylic acid and photosynthesis show significant differences (Supplementary File S3 and Table S3). Minimal generation times tend to be short in organisms having these enriched terms except in the case of photosynthesis where the opposite occurs (p value < 0.01). This suggests that in some species, terms other than "Translation" contribute to fast cell division.

Using tRNA Expression Data
As mentioned above, the metrics we used only measure approximately the selected codon usage bias because the abundance of cytoplasmic tRNA is estimated from tRNA gene copy number. To overcome this, Wei used tpm (transcripts per million) from RNA-seq experiments instead of tRNA gene copy number in seven organisms (Wei et al. 2019). We used their data to attempt a better estimation of selected codon usage bias for these organisms. On the whole, our analysis showed similar results between R_ ENC' estimated from tAI or tAI'. Enrichment of GO terms showed similar patterns using tAI and tAI' (Fig. 7). Yet there were some exceptions: Synechocystis and L. interrogans showed enrichment in translation related GO terms that were not previously identified using tAI. Fig. 7 Comparison between tAI and tAI'. Selected codon usage bias was estimated with tAI and tAI' for five organisms having RNA-seq data. Brown color indicates that both (tAI and tAI') are significant (p value < 0.05); yellow that only tAI' is significant; red that only tAI is significant; pink none are significant; and gray indicates that there are not genes associated to the corresponding GO term (Color figure online) no tAI' tAI both absent

Discussion
Here, we present R_ENC', which is an improvement to the original S index developed by dos Reis (2004). Our index is less biased by G+C content than S. Similar to other measures of selected codon usage bias, R_ENC' correlates with minimal generation time and with selected codon usage bias of ribosomal protein-coding genes.
We used R_ENC' to explore the frequency of selected codon usage in 1800 complete genomes from diverse Archaea and Bacteria. We found that genomes from four phyla/subphyla: Gammaproteobacteria, Firmicutes, Actinobacteria, and Betaproteobacteria showed stronger selected codon usage bias than the rest of the lineages. Nonetheless, it is important to consider ascertainment bias, since there are not enough representative genomes from many other phylogenetic lineages to discard that this same effect might be present elsewhere. In addition, we found that several species within Gammaproteobacteria and Firmicutes show a similar set of GO terms enriched in genes under selected codon usage bias, indicating convergent evolution.

Conditions for the Evolution of Translational Efficiency
It has been argued that highly adapted codons are selected because they improve the efficiency and the accuracy of translation (Plotkin and Kudla 2011). Some authors suggest that improving the efficiency of translation is more relevant than improving its accuracy, particularly among fast-growing bacteria (Ran and Higgs 2012). The correlation between minimal duplication time and selected codon usage bias of highly expressed genes coding for ribosomal proteins supports this view (Sharp et al. 2010). Accordingly, highly adapted codons contribute to faster ribosome recycling which in turn increases cellular fitness (Kudla et al. 2009). As mentioned before, this is particularly relevant among fast-growing bacteria.
Yet under which environmental conditions will selection favor short-generation times and its associated trait: translation efficiency? A recent study showed that copiotrophic environments are associated with species showing selected codon usage bias as well as other genomic traits like abundant tRNA and rRNA genes, larger genomes, and higher G+C content (Okie et al. 2020). In contrast, organisms growing in oligotrophic environments tend to show the opposite features. Accordingly, selection favors codon usage bias (and the other associated genomic traits) as an adaptation for growing fast when resources are abundant (at least periodically), and favors slow growth and efficient use of resources when resources are constantly scarce. The correlation between minimal generation time and codon usage bias as measured by R_ENC' (and other indexes) can be attributed to the opposing effects selection has on genomic traits depending on the growth strategy of cells. And the growth strategy of cells is adaptations to the different kinds of environments that prokaryotes evolve in.
Therefore, R_ENC' (and other measures of selected codon usage bias) can serve as gross indicators of the environment on which different lineages of prokaryotes have evolved, an idea that was already proposed by Carbone and Madden (2004) and explored more recently by Botzman and Margalit (2011) and Arella et al. (2021).
There are several lines of evidence correlating the lifestyle of organisms to selected codon usage bias. Initially, Rocha (2004) discovered the correlation between selected codon usage bias of highly expressed genes and growth rate. Later on, Sharp et al. (2005) suggested that bacteria living in variable environments would tend to show more often selected codon usage in highly expressed genes than parasitic bacteria. A more in-depth statistical analysis between environment and selected codon usage bias was provided by Botzman and Margalit (2011). These authors showed that variable "environments" favor selected codon usage of highly expressed genes. Other statistical analyses also showed a correlation between growth rate and the ability of bacteria to survive in different environments (Freilich et al. 2009). This is, fast growers tend to inhabit diverse environments while slow growers inhabit more specialized niches.
Based on the degree of coadaptation between codon usage of highly expressed genes and tRNA genes, as well as other features of the genome like the number of rRNA genes and genome size (components of the genomic landscape), we should be able to predict, with some degree of accuracy, minimal duplication times, and the kind of environment on which prokaryotes thrive. This is an idea that has been explored for small genomic sequences derived from metagenomic samples (Vieira-Silva and Rocha 2010). The correlation between selected codon usage bias and growth rate should inform us about the kind of environment (regarding nutrient availability) on which lineages have evolved (Sharp et al. 2010). Because of that, we infer that most lineages of Gammaproteobacteria and Firmicutes studied here adapted convergently to growth in copiotrophic environments.

Evolutionary Precedence of Translation Efficiency
Here, we provide phylogenetic evidence indicating that selected codon usage bias tends to evolve first in the genes coding for the translation machinery and afterward in genes coding for other cellular functions. To our knowledge, this is the first time that such phenomena has been reported in the literature. This result is consistent with the hypothesis that the main advantage of highly adapted codons is to free the ribosome from highly expressed genes (under rapid growth conditions) so they can be used to translate more mRNAs. Theoretically, improving the translation efficiency of genes involved in the translation machinery itself would have a major effect on the translation efficiency of any other gene in the genome.

Concluding Remarks
Darwinian theory predicts that there should be a correlation between heritable phenotypes and environments. However, it is clear that more studies are required to better understand the association between genome features (like codon usage bias) and the physiology, lifestyle, and environment of prokaryotes (Iriarte et al. 2021). Here, we provide another step towards better understanding the association between genomic traits and environments. In particular, the potential of selected codon usage bias to inform us about life-history traits such as growth rate and feeding strategy.