Figures
Abstract
Mutations can be beneficial by bringing innovation to their bearer, allowing them to adapt to environmental change. These mutations are typically unpredictable since they respond to an unforeseen change in the environment. However, mutations can also be beneficial because they are simply restoring a state of higher fitness that was lost due to genetic drift in a stable environment. In contrast to adaptive mutations, these beneficial non-adaptive mutations can be predicted if the underlying fitness landscape is stable and known. The contribution of such non-adaptive mutations to molecular evolution has been widely neglected mainly because their detection is very challenging. We have here reconstructed protein-coding gene fitness landscapes shared between mammals, using mutation-selection models and a multi-species alignments across 87 mammals. These fitness landscapes have allowed us to predict the fitness effect of polymorphisms found in 28 mammalian populations. Using methods that quantify selection at the population level, we have confirmed that beneficial non-adaptive mutations are indeed positively selected in extant populations. Our work confirms that deleterious substitutions are accumulating in mammals and are being reverted, generating a balance in which genomes are damaged and restored simultaneously at different loci. We observe that beneficial non-adaptive mutations represent between 15% and 45% of all beneficial mutations in 24 of 28 populations analyzed, suggesting that a substantial part of ongoing positive selection is not driven solely by adaptation to environmental change in mammals.
Author summary
The extent to which adaptation to changing environments is shaping genomes is a central question in molecular evolution. To quantify the rate of adaptation, population geneticists have typically used signatures of positive selection. However, mutations restoring an ancestral state of higher fitness lost by genetic drift are also positively selected, but they do not respond to a change in the environment. In this study, we have managed to distinguish beneficial mutations that are due to changing environments and those that are restoring pre-existing functions in mammals. We show that a substantial proportion of beneficial mutations cannot be interpreted as adaptive.
Citation: Latrille T, Joseph J, Hartasánchez DA, Salamin N (2024) Estimating the proportion of beneficial mutations that are not adaptive in mammals. PLoS Genet 20(12): e1011536. https://doi.org/10.1371/journal.pgen.1011536
Editor: Kirk E. Lohmueller, University of California Los Angeles, UNITED STATES OF AMERICA
Received: April 29, 2024; Accepted: December 10, 2024; Published: December 26, 2024
Copyright: © 2024 Latrille et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data underlying this article are available at https://doi.org/10.5281/zenodo.7878953. Snakemake pipeline, analysis scripts and documentation are available at https://github.com/ThibaultLatrille/SelCoeff.
Funding: This work was funded by Faculté de Biologie et de Médecine, Université de Lausanne (https://www.unil.ch; to TL, DAH and NS), Swiss National Science Fund (https://www.snf.ch; grant 310030-185223 to NS) and Agence Nationale de la Recherche (https://anr.fr/; grant ANR-19-CE12-0019 / HotRec to JJ). The funders did not play any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Adaptation is one of the main processes shaping the diversity of forms and functions across the tree of life [1]. Evolutionary adaptation is tightly linked to environmental change and species responding to this change [2, 3]. Such environmental changes are either abiotic (e.g. temperature, humidity) or biotic (e.g. pressure from predators or viruses [4]). For adaptation to occur, there must be variation within populations, which mostly appears via mutations in the DNA sequence. While neutral mutations will not impact an individual fitness, deleterious mutations have a negative effect, and beneficial mutations improve their bearer fitness. A beneficial mutation is thus more likely than a neutral mutation to invade the population and reach fixation, resulting in a substitution at the species level.
Upon environmental change, because adaptive beneficial mutations toward new fitness optima are more likely, the number of substitutions also increases (Fig 1A). An increased substitution rate is thus commonly interpreted as a sign of adaptation [5–7]. The availability of large-scale genomic data and the development of theoretical models have enabled the detection and quantification of substitution rate changes across genes and lineages [8–10]. These approaches, now common practice in evolutionary biology, have helped better understand the processes underpinning the rates of molecular evolution, contributing to disentangling the effects of mutation, selection and drift in evolution [11]. However, a collateral effect has been conflating beneficial mutations with adaptive evolution when adaptive evolution is not the only process that can lead to beneficial mutations [12–14].
(A & B) For a given codon position of a protein-coding DNA sequence, amino acids (x-axis) have different fitness values (y-axis). Under a changing fitness landscape (A), these fitnesses fluctuate with time. The protein sequence follows the moving target defined by the amino-acid fitnesses. Since substitutions are preferentially accepted if they are in the direction of this target, substitutions are, on average, adaptive. At the phylogenetic scale (C), beneficial substitutions are common (positive signs), promoting phenotype diversification across species. Under a stable fitness landscape (B), most mutations reaching fixation are either slightly deleterious reaching fixation due to drift or are beneficial non-adaptive mutations restoring a more optimal amino acid. At the phylogenetic scale (D), deleterious substitutions (negative signs) are often reverted via beneficial non-adaptive mutations (positive signs), promoting phenotype stability and preserving well-established biological systems. Even though, individually, any beneficial non-adaptive mutation might have a weak effect on its bearer, we expect them to be scattered across the genome and the genome-wide signature of beneficial non-adaptive mutations to be detectable and quantifiable.
1.1 Beneficial yet non-adaptive mutations
In a constant environment, a deleterious mutation can reach fixation by genetic drift [15]. A new mutation restoring the ancestral fitness will thus be beneficial (Fig 1B), even though the environment has not changed [13, 16–19]. We will refer to as beneficial non-adaptive mutations those mutations that restore the ancestral fitness under the assumption that the fitness landscape has not changed [12, 20]. Such mutations can happen at a different locus, in which case it is called a compensatory mutation [13, 17]. While compensatory mutations change the sequence and thus induce genetic diversification, beneficial non-adaptive mutations at the locus of the initial mutation reduce genetic diversity and do not contribute to genetic innovation, which are the focus of this manuscript. Although Tomoko Ohta considered beneficial non-adaptive mutations negligible in her nearly-neutral theory [15], their importance has now been acknowledged for expanding populations [12]. However, differentiating between an adaptive mutation and a beneficial non-adaptive mutation remains challenging [21]. Indeed, an adaptive mutation responding to a change in the environment and a beneficial non-adaptive mutation have equivalent fitness consequences for their bearer [12]. Similarly, at the population level, both types of mutations will result in a positive transmission bias of the beneficial allele. However, at the macro-evolutionary scale, the consequences of these two types of mutations are fundamentally different. While adaptive mutations promote phenotype diversification (Fig 1C), beneficial non-adaptive mutations promote phenotype stability and may help preserve well-established biological systems (Fig 1D). Additionally, the direction of adaptive evolution is unpredictable because it is caused by an unforeseen change in the environment and, hence, in the underlying fitness landscape [22]. On the other hand, beneficial non-adaptive mutations are predictable because, under a stable fitness landscape, any change from non-optimal to optimal amino acids will move back the site toward the equilibrium expected under the fitness landscape [23–25]. They can then be distinguished from truly novel beneficial mutations because the latter are not expected to mutate toward the amino acids of higher fitnesses defined by the stable fitness landscape but rather mutate to amino acids showing a diversified pattern (Fig 1).
1.2 Fitness landscape reconstruction
The mutation-selection framework permits to link the patterns of substitution along a phylogenetic tree with the underlying fitness landscape [26, 27]. Such mutation-selection models applied to protein-coding DNA sequence alignments at the codon level allow us to estimate relative fitnesses for all amino acids for each site of the sequence, explicitly assuming that the underlying fitness landscape is stable along the phylogenetic tree [28–30]. Moreover, effective population size (Ne) is considered constant along the phylogenetic tree precisely because of the fixed fitness landscape assumption, the consequences of which are detailed in the Discussion. Importantly, because mutation-selection codon models at the phylogenetic scale are based on population-genetics equations, their estimates of selection coefficients are directly interpretable as fitness effects at the population scale; and because they work at the DNA level, we are able to account for mutational bias in DNA and structure of the genetic code. The model further integrates the shared evolutionary history between samples and their divergence, which, together, allow us to estimate fitness effects in sequence alignments even though sequences are not independent samples and might not represent the equilibrium distribution of amino acids (see section 4.2 in Materials & methods). The detailed model implementation is available in S1 File, described as a Bayesian hierarchical model (Fig A in S1 File).
Accordingly, fitting the mutation-selection model to a multi-species sequence alignment allows us to obtain relative fitnesses for all amino acids (Fig 2A). The difference in fitness between a pair of amino acids allows us to predict whether any mutation would be a deleterious mutation toward a less fit amino acid, a nearly-neutral mutation, or a mutation toward a known fitter amino acid constituting thus a beneficial non-adaptive mutation (Fig 2B). We can hence use large-scale genomic data to test whether such fitnesses estimated at the phylogenetic scale predict the fitness effects at the population scale. The placental mammals represent an excellent study system to perform such an analysis. Having originated ∼102 million years ago, they diversified quickly [31]. Additionally, polymorphism data are available for many species [32], as are high quality protein-coding DNA alignments across the genome [33, 34]. By performing our analysis on 14,509 orthologous protein-coding genes across 87 species, we focus on genes shared across all mammals in our dataset and not newly functionalized genes in a lineage.
At the phylogenetic scale (A), we estimated the amino-acid fitness for each site from protein-coding DNA alignments using mutation-selection codon models. For every possible mutation, the difference in amino-acid fitness before and after the mutation allows us to compute the selection coefficient at the phylogenetic scale (S0). Depending on S0 (B), mutations can be predicted as deleterious (), nearly-neutral (
) or beneficial non-adaptive mutations (
) toward a fitter amino acid and repairing existing functions. At the population scale, each observed single nucleotide polymorphism (SNP) segregating in the population can also be classified according to its S0 value (C). Occurrence and frequency in the population of non-synonymous polymorphisms, contrasted to synonymous polymorphisms (deemed neutral), is used to estimate selection coefficients (D-E) at the population scale (S), for each class of selection (
,
,
). We can thus assess whether S0 predicts S and compute precision (F) and recall (G) for each class. The recall value for class
is the probability for beneficial mutations to be non-adaptive (G). Icons are adapted from https://phylopic.org under a Creative Commons license.
Having identified which potential DNA changes represent beneficial non-adaptive mutations (Fig 2A and 2B), we retrieved polymorphism data from 28 wild and domesticated populations belonging to six genera (Equus, Bos, Capra, Ovis, Chlorocebus, and Homo) to assess the presence of beneficial non-adaptive mutations at the population scale. We focused on both mutations currently segregating within populations and on substitutions in the terminal branches, and checked if any of these observed changes were indeed beneficial (Fig 2C and 2E). A similar approach demonstrated the presence of beneficial non-adaptive mutations in humans [23, 24] and in plants [25]. However, the model used to reconstruct the static fitness landscape in these studies can only be applied to deeply conserved protein domains in the tree of life, which corresponds to a subpart of the proteome that evolves slowly. The mutation-selection model used in the present work integrates phylogenetic relationships, and thus allows us to estimate the fitness landscape in shallower phylogenetic trees, and therefore can be applied almost exome-wide [35].
We first quantified the likelihood of any DNA mutation to be a beneficial non-adaptive mutation, that is, whenever a DNA mutation increases fitness under a stable fitness landscape. Subsequently, by quantifying the total amount of beneficial mutations in the current population across all types of DNA mutations, we could tease apart beneficial non-adaptive from adaptive mutations resulting from a change in the fitness landscape. Altogether, in this study, by integrating large-scale genomic datasets at both phylogenetic and population scales, we propose a way to explicitly quantify the contribution of beneficial non-adaptive mutations to positive selection across the entire exome of the six genera (Fig 2F and 2G).
2 Results
2.1 Selection along the terminal branches
First, we assessed whether fitness effects derived from the mutation-selection model at the phylogenetic scale predict selection occurring in terminal branches. We recovered the mutations that reached fixation in the terminal branches of the six genera. We only considered mutations fixed in a population as substitutions in the corresponding branch by discarding mutations segregating in our population samples. For each substitution identified in the terminal branches we obtained its S0 value such as predicted at the mammalian scale (Fig 2A and 2B). We could classify each substitution as either deleterious (), nearly-neutral (
), or beneficial (
). Because S0 values were based on the assumption that the fitness landscape is stable across mammals,
mutations (i.e., with S0 > 1) bring the bearer of this mutation toward an amino acid predicted to be fitter across mammals. Importantly, the mammalian alignment used to estimate the amino acid fitness landscape did not include the six focal genera and their sister species. This ensures independence between, on the one hand, the fitness landscape estimated, and on the other hand, both substitutions that occurred in the terminal branches, and segregating polymorphisms of the focal populations. Example substitutions in the terminal lineage of Chlorocebus sabaeus which are classified as
are shown in S2 File (section 1.1). For instance, in the mammalian protein-coding DNA alignment of gene SELE, the nucleotide at site 1722 has mutated (from T to C) at the base of Simiiformes (monkeys and apes), modifying the corresponding amino acid from Serine to Proline, but has been subsequently reverted in the branch of Chlorocebus sabaeus (Fig A in S2 File). However, other substitutions classified as
in the terminal branch of Chlorocebus sabaeus cannot be clearly interpreted as reversions along the terminal branch, and show several transitions to this amino acid across the mammalian phylogeny, as for instance site 3145 of gene THSD7A (Fig B in S2 File).
Among all the substitutions found in each terminal branch, between 10 and 13% were , while
mutations only represent between 0.9 and 1.2% of all non-synonymous mutations (Fig 3A and 3B for humans, Table A in S2 File for all dataset). Of note, if we were to assume a stationary mutation-selection-drift equilibrium in the terminal lineage, we would expect a symmetric proportion of positively (
) and negatively (
) selected substitutions if there were no adaptation. The lack of symmetry along the terminal branches then provides a means to estimate the frequency of non-adaptive beneficial substitutions. Mathematically, twice the fraction of
substitutions is an estimate of this rate. This rate is highly consistent across lineages (Table A in S2 File) and suggests an overall frequency of nearly neutral substitutions due to consistent long-term selection pressures, mutation and drift of approximately 20% of all substitutions (20%-26% across species).
(A) Distribution of scaled selection coefficients (S0), predicted for all possible non-synonymous DNA mutations away from the ancestral human exome (section 4.4). Mutations are divided into three classes of selection: deleterious (), nearly-neutral (
) and beneficial (
, supposedly beneficial non-adaptive mutations) (B) Distribution of scaled selection coefficients (S0) for all observed substitutions along the Homo branch after the Homo-Pan split (section 4.5). If there are fewer substitutions than expected, this class is thus undergoing purifying selection, as is the case for
. (C) The site-frequency spectrum (SFS) in humans of African descent for a random sample of 16 alleles (means in solid lines and standard deviations in color shades) for each class of selection and for synonymous mutations, supposedly neutral (black). The SFS represents the proportion of mutations (y-axis) with a given number of derived alleles in the population (x-axis). At high frequencies, deleterious mutations are underrepresented. (D) Proportion of beneficial
, nearly-neutral
, and deleterious mutations
estimated at the population scale for each class of selection at the phylogenetic scale (section 4.6). Proportions depicted here are not weighted by their mutational opportunities.
Furthermore, since in principle, mutations are bound to reach fixation more often than neutral mutations, we calculated the dN/dS ratio of non-synonymous over synonymous divergence for all terminal lineages, focusing on the non-synonymous changes predicted as
mutations (
). We obtained values between 1.17 and 1.75 in the different lineages (Table B in S2 File), meaning that
mutations reach fixation slightly more frequently than synonymous mutations that are supposed to be neutral, consistent with these
mutations being weakly beneficial. Such an observation is consistent with the premise that
mutations are weakly beneficial, translating to a scaled selection coefficient between 0.32 and 1.24 (section 1.2 in S2 File). Finding
for these sites confirms that these sites are closer to optimality at the end of the branch than at the beginning. Even though the beneficial effect of these mutations does not come from an environmental change, it does not change the fact that they have contributed positively to the population’s fitness. It is just that at mutation-selection-drift equilibrium, the increase in fitness at these sites is offset by deleterious substitutions elsewhere in the genome so that there is no net adaptation.
This result further indicates that using dN/dS as an estimate of purifying selection is biased (overestimated) due to the presence of beneficial non-adaptive mutations among the non-synonymous substitutions. By discarding all beneficial non-adaptive mutations we can obtain an estimate of dN/dS which is not inflated. By comparing these two ways of calculating dN/dS (see section 4.5 in Materials & methods), we calculated that beneficial non-adaptive mutations inflate dN/dS values by between 9 and 12% across genera (Table C in S2 File). This represents a substantial increase when considering that beneficial non-adaptive mutations only represent between 0.9 and 1.2% of non-synonymous mutational opportunities (Table A in S2 File).
2.2 Selection in populations
Second, we assessed whether our calculated S0 values predicted at the phylogenetic scale were also indicative of the selective forces exerted at the population level. We retrieved single nucleotide polymorphisms (SNPs) segregating in 28 mammalian populations. To determine if SNPs were ancestral or derived, we reconstructed the ancestral exome of each population. We then classified every non-synonymous SNP as either ,
, or
according its S0 value (Fig 2B and 2C).
First, SNPs classified as are spread across the genomes and not strongly associated to the ontology terms of their respective genes (Table D and Fig C in S2 File). In humans, some SNPs have been associated with specific clinical prognosis terms obtained by clinical evaluation of the impact of variants on human Mendelian disorders [36]. Although this classification also relies on deep protein alignments and therefore cannot be considered an independent result from our own, it does provide a consistency check if the effect of a mutation on human health is in line with its fitness effect predicted by our method [37]. Therefore, we investigated whether the non-synonymous SNPs classified as
or
showed enrichment in specific clinical terms compared to SNPs classified as
. Our results show that SNPs predicted as deleterious are associated with clinical terms such as Likely Pathogenic and Pathogenic, implying that, in general, the selective pressure of a mutation exerted across mammals is also predictive of its clinical effect in humans (Table E in S2 File) [38]. Conversely,
mutations are associated with clinical terms such as Benign and Likely Benign, which shows that
mutations are less likely to be functionally damaging (Table F in S2 File).
In addition to clinical prognosis, frequencies at which SNPs are segregating within populations provide information on their selective effects. For instance, deleterious SNPs usually segregate at lower frequencies because of purifying selection, which tends to remove them from the population (Fig 3C for humans). By gathering information across many SNPs, it is possible to estimate the distribution of fitness effects at the population scale, taking synonymous SNPs as a neutral expectation [39–42]. From these estimated fitness effects, we can derive the proportion of deleterious mutations (), nearly-neutral mutations (
) and beneficial mutations (
) at the population scale (see section 4.6 in Materials & methods, Fig A-C in S3 File). These approaches offer a unique opportunity to contrast selection coefficients estimated at the phylogenetic scale (S0) and at the population scale (S) in different dataset (Fig D in S3 File).
Across our selection classes (,
and
), one can ultimately estimate the proportion of correct and incorrect predictions, leading to an estimation of precision and recall (Fig 2F and 2G and section 4.7 in Materials & methods). Across 28 populations of different mammal species, mutations predicted to be deleterious at the phylogenetic scale (
) were indeed purged at the population scale, with a precision in the range of 90–97% (Table 1 and Fig 3D for humans). Conversely, a recall in the range of 96–100% implied that mutations found to be deleterious at the population scale were most likely also predicted to be deleterious at the phylogenetic scale (Table 1). Altogether, purifying selection is largely predictable and amino acids with negative fitness across mammals have been effectively purged away in each population.
Mutations predicted as were effectively composed of a mix of neutral and selected mutations with varying precision (36–63%) and recall (32–45%) across the different populations (Table 1, Fig 3D for humans). The variable proportions between populations can be explained by the effective number of individuals in the population (Ne), a major driver of selection efficacy. Moreover, estimates of mutation rate per generation (u), from Bergeron et al. [43] and Orlando et al. [44], and Watterson’s θ obtained from the synonymous SFS as in Achaz [45], allow us to obtain Ne through Ne = θ/4u. Using correlation analyses that accounted for phylogenetic relationship (see section 4.8 in Materials & methods, Fig E in S3 File), we found that higher Ne was associated with a smaller proportion of nearly-neutral mutations (r2 = 0.31, p = 0.001, Fig 4A). This result follows the prediction of the nearly-neutral theory and suggests that in populations with higher diversity (e.g., Bos or Ovis), discrimination between beneficial and deleterious mutations is more likely to occur (Fig F-H in S3 File). Conversely, many more mutations are effectively neutral in populations with lower diversity (e.g., Homo).
Populations in circles, mean of the species across the populations as squares. (A) Proportion of nearly-neutral mutations at the population scale ( in the y-axis), shown as a function of estimated effective population size (Ne in the x-axis). (B) Proportion of beneficial non-adaptive mutations among all beneficial mutations (
in the y-axis), shown as a function of Ne in x-axis. Correlations account for phylogenetic relationship and non-independence of samples, through the fit of a Phylogenetic Generalized Linear Model (see section 4.6 in Materials & methods).
Finally, mutations predicted to be were indeed beneficial for individuals bearing them, with a precision (Fig 2F) in the range of 19–87% (Table 1 and Fig 3D for humans). This result confirms that selection toward amino acids restoring existing functions is ongoing in these populations. Importantly, the recall value in this case, computed as
, is the probability for a beneficial mutation at the population scale to be a non-adaptive, i.e., going toward a fitter amino acid given a stable fitness landscape (Fig 2G, Table A in S4 File). In other words, the recall value quantifies the number of beneficial mutations restoring damaged genomes instead of creating adaptive innovations. Across the 28 populations, this proportion is in the range of 11–82% (Table 1), with a mean of 30%. Accounting for phylogenetic relationships, we found no correlation between the proportion of beneficial non-adaptive mutations and estimates of Ne based on genetic diversity (r2 = 0.00, p = 0.772, Fig 4B).
We additionally performed controls and simulations to ensure that our results were robust. First, we controlled that these estimations were not affected by SNP mispolarization (Fig A-B in S4 File). Second, we performed simulations at the population-genetic level and confirmed that our method was able to recover the proportion of beneficial mutations that are non-adaptive in synthetic polymorphism datasets (Fig C in S4 File). Third, we ran our analysis filtering out CpG mutations and obtained values of in the range of 5–27%, with a mean of 14% (Table B-C in S4 File, Fig D in S4 File), providing more conservative estimates. Finally, because the phylogenetic mutation-selection codon model should fit better for genes with uniformly conserved functions, we filtered out genes under pervasive adaptation [46] as a control. In this subset of the exome, containing genes with a more stable fitness landscape, we found an increase in the proportion of beneficial mutations that are non-adaptive (Wilcoxon signed-rank, s = 80, p = 0.002, Table D in S4 File), consistent with our expectation that beneficial mutations occur more frequently in genes under changing fitness landscapes.
2.3 Selection in the terminal lineage and in populations
As an alternative to relying solely on currently segregating mutations to quantify selection, one can leverage both polymorphism within a population and substitutions in the terminal lineage to estimate the distribution of fitness effects (DFE). Hence, we estimated precision and recall as done previously, but now including the number of substitutions per site as input for the DFE estimators (see 4.6 in Materials & methods and Fig E in S4 File). When including substitutions in the terminal lineage, estimates of are in the 10–78% range with a mean of 36%, and 19 out of 28 estimates fall between 15% and 45% (Table E-F in S4 File).
Additionally, we controlled that these estimations were not affected by SNP mispolarization (Fig F in S4 File). We also filtered out genes under pervasive adaptation, and again found an increase in , consistent with our expectation (Wilcoxon signed-rank, s = 120, p = 0.027, Table G in S4 File). We assessed the impact of fitting the same functional form of DFE to the three different categories of changes
,
and
. To this aim we computed the total amount of current selection by fitting either a single DFE on the whole dataset or by summing the other three independent DFEs. These disjoint estimates are well correlated, with a goodness of fit r2 = 0.95, 0.89, 0.82 for respectively
(Fig G in S4 File),
(Fig H in S4 File) and
(Fig I in S4 File). Finally, we evaluated the effect of fitting a parametric functional form for the DFE. As implemented in Tataru et al. [42], the DFE is a mixture between a reflected gamma distribution and an exponential distribution (Eq 8, section 4.7 in Materials & methods). Instead of using such a continuous DFE, we also tested our prediction with a non-parametric functional form for the DFE, obtaining estimates of
in the 8–94% range, with a mean of 43% (Table H-I in S4 File).
3 Discussion
3.1 Beneficial mutations are not necessarily adaptive
This study represents an essential step toward integrating the different evolutionary scales necessary to understand the combined effects of mutation, selection, and drift on genome evolution. In particular, we have been able to quantify the proportion of beneficial mutations that are non-adapative (i.e., not a response to a change in fitness landscape), which has only been achievable by combining exome-wide data from both phylogenetic and population scales. At the phylogenetic scale, codon diversity at each site of a protein-coding DNA alignment allows for reconstructing an amino-acid fitness landscape, assuming that this landscape is stable along the phylogenetic tree. These amino-acid fitness landscapes allow us to predict any mutation’s selection coefficient (S0) along a protein-coding sequence. We have compared these selective effects to observations at the population level, and by doing so, we have confirmed that mutations predicted to be deleterious () are generally purified away in extant populations. Our results concur with previous studies showing that SIFT scores [47, 48], based on amino acid alignments across species, also inform on the deleterious fitness effects exerted at the population scale [25]. However, contrary to SIFT scores, our mutation-selection model is parameterized by a fitness function such that changes are directly interpretable as fitness effects (see also section 1 in S3 File). In this regard, an interesting prediction of our model is that some deleterious mutations reach fixation due to genetic drift, while beneficial non-adaptive mutations restore states of higher fitness. We have tested this hypothesis and have found that a substantial part of these predicted non-adaptive mutations (
) are indeed beneficial in extant populations. We estimate that between 11 and 82% of all beneficial mutations in mammalian populations are not adaptive. More specifically, in 24 out of 28 populations analyzed, the percentage of beneficial mutations estimated to be non-adaptive falls between 15 and 45%. These results suggest that many beneficial mutations are not adaptive, but rather restore states of higher fitness. Hence, we can correctly estimate the extent of adaptive evolution only if we account for the number of beneficial non-adaptive mutations [49, 50]. Here instead, we argue that we should dissociate positive selection from adaptive evolution and limit the use of adaptive mutations to those that are associated with adaptation to environmental change as such [12, 13, 51].
3.2 Assumptions and methodological limitations
The exact estimation of the contribution of beneficial non-adaptive mutations to positive selection relies on some hypotheses at both the phylogenetic and population scales and is sensitive to methodological limitations. Indeed, data quality and potentially inadequate modeling choices of both the fitness landscape (at the phylogenetic scale) and fitness effects (at the population scale) might also lead to missed predictions [10]. In practice, we obtained different values of the proportion of non-adaptive beneficial mutations depending on i) the filtering or not of CpG mutations [52], ii) whether we included substitutions in the terminal lineage along with within-population polymorphisms to estimate fitness effects [42], and iii) the model used to infer the fitness effects. It appears that our estimation can be sensitive to model misspecification and overall, while we provide an order of magnitude for the contribution of beneficial non-adaptive mutations to positive selection, methodological improvement on the estimation of the DFE is needed to increase the precision of this value.
To be conservative, we considered mutations as adaptive if they were detected as being under positive selection at the population scale despite them being either incorrectly predicted as deleterious () or nearly-neutral (
) from the amino-acid mammalian fitnesses. An example of an incorrectly predicted deleterious mutation (
) from its fitness landscape could be an amino acid having always been deleterious across mammals, but being advantageous (
) in the current species due to environmental changes or a major shift in their fitness landscape (e.g. domestication). To visualize an example of a wrongly predicted nearly-neutral mutation, we can first imagine a site where only hydrophilic amino acids are accepted because of the protein properties (e.g. a surface site of a globular protein). Let us then assume that such a site is also a target for viruses, hence promoting amino-acid changes which modify the site’s viral affinity [4]. Given the selective pressure favoring amino-acid change, but restricting the possibilities to hydrophilic amino acids, most hydrophilic amino acids will likely be visited along the phylogenetic tree and the mutation-selection model will give high and similar fitnesses to all of them. In such a case, any mutation between hydrophilic amino acids will be wrongly predicted as nearly-neutral (
), while it is in fact adaptive. In summary, under a changing fitness landscapes [53], our phylogenetic mutation-selection model takes an average over fitness changes observed along the phylogeny, causing beneficial mutations (
) to be predicted as either deleterious (
) or nearly-neutral (
), therefore mechanically reducing
, and making our estimate conservative.
3.3 Convergent adaptation
If there are several substitutions toward the same amino acid along the mammalian tree (section 1.1 in S2 File), our mutation-selection model cannot formally distinguish between a scenario where mammals have fixed deleterious mutations that are reverted in several lineages, from more complex scenarios involving convergent adaptation across mammals. In a first scenario, repeated changes of fitness landscapes in the same direction could occur along several lineages, leading to repeated substitutions in multiple lineages (parallel or convergent adaptation). In a second scenario, an environmental change that occurred near the root of placental mammals (∼100 Mys ago), to which extant populations are currently responding independently through weakly adaptive mutations, could also lead to repeated substitutions toward the same amino acids. Importantly, we would usually expect adaptive convergent mutations to be linked to particular converging phenotypes across mammals, and hence, they should not massively affect the whole genome as we find (Fig C and Table D in S2 File). Moreover, after filtering out genes usually associated to recurrent adaptation (e.g. immune genes), we recover an even higher proportion of beneficial non-adaptive mutations (Table D in S4 File). For these reasons, we argue that the signal of predictable positive selection we recover in extant population is indeed mainly driven by non-adaptive evolution.
3.4 The influence of effective population size
Across the genome, beneficial non-adaptive mutations and deleterious mutations reaching fixation create a balance in which genomes are constantly damaged and restored simultaneously at different loci due to drift. Since the probability of fixation of mutations depends on the effective population size (Ne), the history of Ne plays a crucial role in determining the number of beneficial non-adaptive mutations compensating for deleterious mutations [54]. For example, a population size expansion will increase the efficacy of selection, and a larger proportion of mutations will be beneficial (otherwise effectively neutral), thus increasing the number of beneficial non-adaptive mutations. On the other hand, a population that has experienced a high Ne throughout its history should be closer to an optimal state under a stable fitness landscape, having suffered fewer fixations of deleterious mutations and therefore decreasing the probability of beneficial non-adaptive mutations [55]. Overall, we expect the proportion of beneficial non-adaptive mutations to be more dependent on Ne’s long-term expansions and contractions than on the short-term ones [12, 55].
Moreover, because our model assumes a fixed fitness landscape, it implicitly assumes that Ne is constant along the phylogenetic tree. Fluctuations due to changes in the fitness landscape or in Ne will be averaged out by the assumption of the current model that Ne is constant across lineages. It was recently shown [54], using computer intensive mutation-selection models with fluctuating Ne, that relaxing the assumption of a constant Ne results in more extreme estimates of amino-acid fitnesses than with the standard model used in this study. In other words, by assuming a constant Ne, we are underpowered to detect beneficial non-adaptive mutations since amino acids will have more similar fitnesses. As a consequence, some of the beneficial non-adaptive mutations currently segregating in population will be incorrectly classified as nearly-neutral by the mutation-selection model, and thus be wrongly interpreted as adaptive (see previous section). This ultimately results in lower estimates of the proportion of beneficial non-adaptive mutations. Given this inflation of missed predictions due to change in population sizes [14, 56, 57], our estimated proportion of beneficial non-adaptive mutations among adaptive ones is likely to be an underestimation.
3.5 The role of epistasis and compensatory mutations
Our model assumes that amino-acid fitness landscapes are site-specific and also independent of one another, whereas under pervasive epistasis, the fitness effect of any mutation at a particular site would depend on the amino acids present at other sites. Epistasis is common for mutations that influence the protein’s physical properties (e.g. conformation, stability, or affinity for ligands) or might arise due to nonlinear relationship between the protein’s physical properties and fitness [58]. Regardless of its origin, epistasis has been shown to play a role in the evolution of protein-coding genes, with amino-acid residues in contact within a protein or between proteins tending to co-evolve [58–60]. Particularly, the residues in contact co-evolve to become more compatible with each other generating an entrenchment [61–63]. Epistasis therefore allows for compensatory mutations, which restore fitness through mutations at loci different from where deleterious mutations took place, representing another case of non-adaptive beneficial mutations, but one which is not accounted for by our method. Hence, the beneficial mutations that we classify as putatively adaptive might in fact be compensatory mutations, making our estimation of the rate of non-adaptive beneficial mutations conservative.
Despite epistasis being an important factor in protein evolution, several deep-mutational scanning experiments have revealed that a site-specific fitness landscape predicts the evolution of sequences in nature with considerable accuracy [64–66]. Additionally, the fact that we observe such a high proportion of beneficial non-adaptive mutations suggests that the underlying assumptions of our model, namely site-independence, implying no epistasis, and a static fitness landscape, are a reasonable approximation for the underlying fitness landscape of proteins. Our results imply that the fitness effects of new mutations are mostly conserved across mammalian orthologs, in agreement with other studies showing that for conserved orthologs with similar structures and functions, models without epistasis provide a reasonable estimate of fitness effects in protein-coding genes [67, 68]. Conceptually, the framework presented here, with the addition of a more complex protein fitness landscape at the phylogenetic scale, could be used to infer the relative contribution of compensatory mutations to non-adaptive and adaptive evolution.
3.6 Detecting adaptation above the nearly-neutral background
A long-standing debate in molecular evolution is whether the variation we observe between species in protein-coding genes is primarily due to nearly-neutral mutations reaching fixation by drift or primarily due to adaptation [15, 69–71]. Measuring the “rate of adaptation” in proteins, as pioneered by McDonald & Kreitman [5], has been central to inform this debate [72]. However, the McDonald & Kreitman test detects signatures of accelerated evolution in a given terminal branch compared to an expectation based on polymorphism present in the population. It considers the fraction of substitutions that fix too quickly as “adaptive” [5–7] despite there being other processes that can lead to their fixation [73–75] and some of these substitutions being beneficial but non-adaptive [12–14, 17]. Here, the expectation is built on the pattern of substitutions across a phylogeny compared to the fitness effects that can be estimated from both substitutions in a terminal lineage and polymorphism in populations. Moreover, the goal is not to detect a fraction of beneficial substitution (i.e. “adaptive” substitutions for McDonald & Kreitman [5]), but to estimate the proportion of non-adaptive mutations among beneficial ones.
We provide evidence that in mammalian orthologs, many substitutions occur through fixation of both deleterious mutations and beneficial non-adaptive mutations. Detecting adaptation above this background of substitutions remains a challenge [69, 76]. Mathematically, the surplus of positive selection due to an externally-driven changing fitness landscape is called fitness flux, and requires experimentally measuring the selection coefficient of each mutation in each genetic background. The fitness flux can be estimated if either the substitutions history is known [13] or changes of frequency in currently segregating variants [51]. Without experimentally measured selection coefficients, another strategy is precisely to use a nearly-neutral substitution model as a null model of evolution. Under a strictly neutral evolution of protein-coding sequence, we expect the ratio of non-synonymous over synonymous substitutions (dN/dS) to be equal to one. Deviations from this neutral expectation, such as dN/dS > 1, which can be generated by an excess of non-synonymous substitutions, is generally interpreted as a sign of adaptation. However, as shown in this study, a dN/dS > 1 is not necessarily a signature of adaptation but can be due to beneficial non-adaptive mutations. So, by relaxing the strict neutrality and assuming a stable fitness landscape instead, one can predict the expected rate of evolution, called ω0 [77, 78]. Adaptation can thus be considered as evolution under a changing fitness landscape and tested as such by searching for the signature of dN/dS > ω0 [19, 30, 79]. Using a stable fitness landscape as a null model of evolution, thus accounting for selective constraints exerted on the different amino acids, increased the statistical power in testing for adaptation [46]. Instead of relying solely on summary statistics (such as dN/dS or ω0), another strategy to detect adaptation is to include changes in the fitness landscapes inherently within the mutation selection framework, either with small changes along the phylogeny [80] or either by allowing fitness to change on subsets of branches [81, 82]. Such mechanistic models could be more general than site-specific fitness landscapes, including epistasis and changing fitness landscapes [62, 82].
3.7 Conclusions
We have provided empirical evidence that an evolutionary model assuming a stable fitness landscape at the mammalian scale allows us to predict the fitness effects of mutations in extant populations and individuals, acknowledging the balance between deleterious and beneficial non-adaptive mutations. We argue that such a model would represent a null expectation for the evolution of protein-coding genes in the absence of adaptation. Altogether, because a substantial part of positive selection can be explained by beneficial non-adaptive mutations, but not its entirety, we argue that the mammalian exome is shaped by both adaptive and non-adaptive processes, and that none of them alone is sufficient to explain the observed patterns of changes. In that sense, to avoid conflating beneficial mutations with adaptive evolution, the term “adaptation” should retain its original meaning associated with a change in the underlying fitness landscape and be modelled as such [13, 51].
4 Materials & methods
4.1 Phylogenetic dataset
Protein-coding DNA sequence alignments in placental mammals and their corresponding gene trees come from the OrthoMaM database (https://www.orthomam.univ-montp2.fr) and were processed as in Latrille et al. [46]. OrthoMaM contains a total of 116 mammalian reference sequences in v10c [33, 34, 83].
Genes located on the X and Y chromosomes and on the mitochondrial genome were discarded from the analysis because the level of polymorphism—which is necessary for population-based analyses—is expected to be different in these three regions compared to the autosomal genome. Sequences of species for which we used population-level polymorphism (see section 4.3) and their sister species, were removed from the analysis to ensure independence between the data used in the phylogenetic and population scales. Sites in the alignment containing more than 10% of gaps across the species were discarded. Altogether, our genome-wide dataset contains 14, 509 protein-coding DNA sequences in 87 placental mammals.
4.2 Selection coefficient (S0) in a phylogeny-based method
We analyzed the phylogenetic-level data using mutation-selection models. These models assume the protein-coding sequences are at mutation-selection balance under a fixed fitness landscape characterized by a fitness vector over the 20 amino acids at each site [26, 28, 84]. Mathematically, the rate of non-synonymous substitution from codon a to codon b () at site i of the sequence is equal to the rate of mutation of the underlying nucleotide change (μa↦b) multiplied by the scaled probability of mutation fixation (
). The probability of fixation depends on the difference between the scaled fitness of the amino acid encoded by the mutated codon (
) and the amino acid encoded by the original codon (
) at site i [85, 86].
The rate of substitution from codon a to b at a site i is thus:
(1)
Fitting the mutation-selection model on a multi-species sequence alignment leads to an estimation of the gene-wide 4 × 4 nucleotide mutation rate matrix (μ) as well as the 20 amino-acid fitness landscape (F(i)) at each site i. The priors and full configuration of the model are given in S1 File (section 1). From a technical perspective, the Bayesian estimation is a two-step procedure [87]. The first step is a data augmentation of the alignment, consisting in sampling a detailed substitution history along the phylogenetic tree for each site, given the current value of the model parameters. In the second step, the parameters of the model can then be directly updated by a Gibbs sampling procedure, conditional on the current substitution history. Alternating between these two sampling steps yields a Markov chain Monte-Carlo (MCMC) procedure whose equilibrium distribution is the posterior probability density of interest [87, 88]. Additionally, across-site heterogeneities in amino-acid fitness profiles are captured by a Dirichlet process. More precisely, the number of amino-acid fitness profiles estimated is lower than the number of sites in the alignment. Consequently each profile has several sites assigned to it, resulting in a particular configuration of the Dirichlet process. Conversely, sites with similar signatures are assigned to the same fitness profile. This configuration of the Dirichlet process is resampled through the MCMC to estimate a posterior distribution of amino acid profiles for each site specifically [35, 89]. From a more mechanistic perspective, even though not all amino acids occur at every single codon site of the DNA alignment, we can nevertheless estimate the distribution of amino-acid fitnesses by generalizing the information recovered across sites and across amino acids based on the phylogenetic relationship among samples. In particular, synonymous substitutions along the tree contain the signal to estimate branch lengths and the nucleotide transition matrix, while non-synonymous substitutions contain information on fitness difference between codons connected by single nucleotide changes [35].
The selection coefficient for a mutation from codon a to codon b at site i is defined as:
(2)
In our subsequent derivation the source (a) and target (b) codons as well as the site (i) are implicit and thus never explicitly written.
The scaled selection coefficient (S0 = ΔF) is formally the product of the selection coefficient at the individual level (s) and the effective population size (Ne), as S0 = 4Ne × s. The value of S0 informs us on the strength of selection exerted on amino acids changes. Thus, according to its S0 value, we can classify any mutation as either a deleterious mutation toward a less fit amino acid (), a nearly-neutral mutation (
), or a mutation toward a known fitter amino acid, constituting thus a beneficial non-adaptive mutation (
).
We used the Bayesian software BayesCode (https://github.com/ThibaultLatrille/bayescode, v1.3.1) to estimate the selection coefficients for each protein-coding gene in the mammalian dataset. We ran the MCMC algorithm implemented in BayesCode for 2, 000 generations as described in Latrille et al. [46]. For each gene, after discarding a burn-in period of 1, 000 generations of MCMC, we obtained posterior mean estimates (over the 1, 000 generations left of MCMC) of the mutation rate matrix (μ) as well as the 20 amino-acid fitness landscape (F(i)) at each site i.
4.3 Polymorphism dataset
The genetic variants representing the population level polymorphisms were obtained from the following species and their available datasets: Equus caballus (EquCab2 assembly in the EVA study PRJEB9799 [90]), Bos taurus (UMD3.1 assembly in the NextGen project: https://projects.ensembl.org/nextgen/), Ovis aries (Oar_v3.1 assembly in the NextGen project), Capra hircus (CHIR1 assembly in the NextGen project, converted to ARS1 assembly with dbSNP identifiers [91]), Chlorocebus sabaeus (ChlSab1.1 assembly in the EVA project PRJEB22989 [92]), Homo sapiens (GRCh38 assembly in the 1000 Genomes Project [93]). In total, we analyzed 28 populations across the 6 different species with polymorphism data. The data was processed as described in Latrille et al. [46].
Only bi-allelic single nucleotide polymorphisms (SNPs) found within a gene were in our polymorphism dataset, while nonsense variants and indels were discarded. To construct the dataset, we first recovered the location of each SNP (represented by its chromosome, position, and strand) in the focal species and matched it to its corresponding position in the coding sequence (CDS) using gene annotation files (GTF format) downloaded from Ensembl (ensembl.org). We then verified that the SNP downloaded from Ensembl matched the reference in the CDS in FASTA format. Next, the position in the CDS was converted to the corresponding position in the multi-species sequence alignment (containing gaps) from the OrthoMaM database (see section 4.2) for the corresponding gene by doing a global pairwise alignment (Biopython function pairwise2). This conversion from genomic position to alignment position was only possible when the assembly used for SNP-calling was the same as the one used in the OrthoMaM alignment, the GTF annotations, and the FASTA sequences. SNPs were polarized using the three closest outgroups found in the OrthoMaM alignment with est-usfs v2.04 [94], and alleles with a probability of being derived lower than 0.99 were discarded.
4.4 Mutational opportunities
The mutational opportunities of any new mutation refer to its likelihood of falling into a specific category (synonymous, deleterious, nearly-neutral, or beneficial). Deriving such opportunities is necessary to estimate the strength of selection exerted at the population scale since different categories might have different mutational opportunities, and thus polymorphism and divergence need to be corrected accordingly (see sections 4.5, 4.6, and 4.7). To calculate mutational opportunities, we reconstructed the ancestral exome of each of the 28 populations by using the most likely ancestral state from est-usfs (see section 4.3), which differs from the corresponding species reference exome since it accounts for the variability present in the specific population.
From the reconstructed ancestral exome, all possible mutations were computed, weighted by the instantaneous rate of change between nucleotides obtained from the mutation rate matrix (μ, see section 4.2), summing to μtot across the whole exome, and to μsyn when restricted to synonymous mutations. Finally, the mutational opportunities for synonymous mutations were computed as the total number of sites across the exome (Ltot) weighted by the proportion of synonymous mutations among all possible mutations as:
(3)
Similarly, for non-synonymous mutations, the total mutation rate for each class of selection , called μ(x), was estimated as the sum across all non-synonymous mutations if their selection coefficient at the phylogenetic scale is in the class S0 ∈ x. Accordingly, the mutational opportunities (L(x)) for each class of selection coefficient (x) was finally computed as the total number of sites across the exome (Ltot) weighted by the ratio of the aggregated mutations rates falling in the class μ(x):
(4)
Finally, is the probability for a non-synonymous mutation to be in the class x, thus computed as:
(5)
4.5 Substitution mapping and dN/dS in the terminal branch
We inferred the protein-coding DNA sequences for each node of the 4-taxa tree containing the focal species and the three closest outgroups species found in the OrthoMaM alignment by applying the M5 codon model (gamma site rate variation) as implemented in FastML.v3.11 [95]. Consequently, for each focal species we reconstructed the protein coding DNA sequence of the whole exome at the base of the terminal branch before the split from the sister species. We considered Ceratotherium simum simum as Equus caballus’ sister species; Bison bison bison as Bos taurus’ sister species; Pantholops hodgsonii as Ovis aries’ sister species; Pantholops hodgsonii as Capra hircus’ sister species; Macaca mulatta as Chlorocebus sabaeus’ sister species and finally, we considered Pan troglodytes as Homo sapiens’ sister species. From this reconstructed exome, we determined the direction of the substitution occurring along the terminal branch of the phylogenetic tree toward each extant population. SNPs segregating in the population were discarded, and the most likely ancestral state from est-usfs (see section 4.3) was used as the reference for each extant population. For each substitution, we recovered its S0 value as calculated through the phylogeny-based method (see section 4.2). Finally, the rate of non-synonymous over synonymous substitutions for a given class of selection coefficient () was computed as:
(6)
where D(x) was the number of non-synonymous substitutions in class x, Dsyn was the number of synonymous substitutions across the exome, while L(x) and Lsyn were the numbers of non-synonymous and synonymous mutational opportunities, respectively, as defined in section 4.4. δ(dN/dS) was computed as the difference between dN/dS computed over all substitutions and dN/dS when we removed beneficial non-adaptive mutations dN(S0 < 1)/dS, normalized by dN/dS. Note that the quantities δ(dN/dS) and δ(dN) are equivalent due to the simplification of the factor dS:
(7)
4.6 Scaled selection coefficients (S) in a population-based method
To obtain a quantitative estimate of the distribution of selection coefficients for each category of SNPs, we used the polyDFE model [42, 96]. This model uses the count of derived alleles to infer the distribution of fitness effects (DFE). The probability of sampling an allele at a given frequency (before fixation or extinction) is informative of its scaled selection coefficient at the population scale (S). Therefore, pooled across many sites, the site-frequency spectrum (SFS) provides information on the underlying S of mutations. However, estimating a single S for all sampled mutations is biologically unrealistic, and a DFE of mutations is usually assumed [39, 40]. The polyDFE [42, 96] software implements a mixture of a Γ and exponential distributions to model the DFE of non-synonymous mutations, while synonymous mutations are considered neutral. The model estimates the parameters βd, b, pb and βb for non-synonymous mutations as:
(8)
where βd ≤ −1 is the estimated mean of the DFE for S ≤ 0; b ≥ 0.2 is the estimated shape of the Γ distribution; 0 ≤ pb ≤ 1 is the estimated probability that S > 0; βb ≥ 1 is the estimated mean of the DFE for S > 0; and fΓ(S;m, b) is the density of the Γ distribution with mean m and shape b, while fe(S;m) is the density of the exponential distribution with mean m.
PolyDFE requires one SFS for non-synonymous mutations and one for synonymous mutations (neutral expectation), as well as the number of sites on which each SFS was sampled. For populations containing more than 8 individuals, the SFS was subsampled down to 16 chromosomes (8 diploid individuals) without replacement (hyper-geometric distribution) to alleviate the effect of different sampling depths in the 28 populations. Altogether, for each class of selection () of non-synonymous SNPs, we aggregated all the SNPs in the selection class x as an SFS. The number of sites on which each SFS was sampled is given by L(x) for the non-synonymous SFS and Lsyn for the synonymous SFS respectively. For each class of selection x, once fitted to the data using maximum likelihood with polyDFE, the parameters of the DFE (βd, b, pb, βb) were used to compute
,
, and
as:
(9)
(10)
(11)
Rather than relying solely on currently segregating mutations to quantify selection, polyDFE can leverage both divergence and polymorphism to estimate the parameters of the DFE. We can thus add four more inputs to polyDFE: D(x), L(x), Dsyn and Lsyn such as defined in the previous section. Because the estimates of DFE are different with this method, we naturally obtained different values of ,
, and
.
4.7 Precision and recall
For readability, we give here precision and recall for beneficial mutations ( and
), but it can be obtained using the same derivation for the deleterious mutations (
and
) and nearly-neutral mutations (
and
).
Precision is the proportion of mutations correctly predicted as beneficial () out of all predicted as beneficial non-adaptive mutations (
), which can be written as a conditional probability:
(12)
Namely, precision corresponds to the probability for a
mutation to be effectively beneficial at the population level (
). This probability, computed from Eq 11, is obtained by restricting our analysis to SNPs that are predicted to be beneficial non-adaptive mutations (yellow fill for the category
in Fig 3D).
Recall is the proportion of mutations correctly predicted as beneficial () out of all beneficial mutations (
), which can be written as a conditional probability:
(13)
Namely, recall corresponds to the probability for a beneficial mutation at the population level () to be a beneficial non-adaptive mutation (
). Using Bayes theorem, recall can be re-written as:
(14)
where
and
can be calculated using Eqs 12 and 5, respectively, and
is the probability of a mutation to be beneficial at the level of the population, which can be computed from the law of total probabilities as:
(15)
4.8 Correlation with effective population size (Ne)
Genetic diversity estimator Watterson’s θS was obtained for each population from the synonymous SFS as in Achaz [45]. For each popuation, Ne was estimated from the equation Ne = θS/(4 × u), where u is the mutation rate per generation. Estimates for u were averaged per species across the pedigree-based estimation in Bergeron et al. [43] for Homo, Bos, Capra and Chlorocebus. For Ovis we used the estimated u of Capra. For Equus, we used u as estimated in Orlando et al. [44] (u = 7.24 × 10−9). Because a correlation must account for phylogenetic relationship and non-independence of samples, we fitted a Phylogenetic Generalized Linear Model in R with the method pgls with default settings from the package caper [97]. The mammalian dated tree was obtained from TimeTree [98] and pruned to include only the species analysed in this study, with multi-furcation of the different populations from each species placed at the same divergence time as the species (section 2.1 in S3 File).
Supporting information
S1 File. Supplementary appendix on the parameterization of Mutation-selection codon models.
Contains 5 pages of supplementary information including 1 figure (Fig A).
https://doi.org/10.1371/journal.pgen.1011536.s001
(PDF)
S2 File. Supplementary appendix on the caracterization of non-adaptive beneficial mutations.
Contains 12 pages of supplementary information including 3 figures (Fig A to C) and 6 tables (Table A to F).
https://doi.org/10.1371/journal.pgen.1011536.s002
(PDF)
S3 File. Supplementary appendix on the contrast of selection at the phylogenetic and population-genetic scales.
Contains 7 pages of supplementary information including 8 figures (Fig A to H).
https://doi.org/10.1371/journal.pgen.1011536.s003
(PDF)
S4 File. Supplementary appendix on controls for estimating the proportion of beneficial mutations that are not adaptive.
Contains 21 pages of supplementary information including 9 figures (Fig A to I) and 9 tables (Table A to I).
https://doi.org/10.1371/journal.pgen.1011536.s004
(PDF)
Acknowledgments
We gratefully acknowledge the help of Mélodie Bastian, Nicolas Lartillot, Carina Farah Mugal, Laurent Duret, Alexandre Reymond, Daniele Silvestro and Nicolas Gambardella for their advice and reviews concerning this manuscript. This work was performed using the computing facilities of the CC LBBE/PRABI. This study makes use of data generated by the NextGen Consortium.
References
- 1.
Darwin C. On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. vol. 220. John Murray; 1859.
- 2.
Merrell DJ. The Adaptive Seascape: The Mechanism of Evolution. U of Minnesota Press; 1994.
- 3. Gavrilets S, Losos JB. Adaptive Radiation: Contrasting Theory with Data. Science. 2009;323(5915):732–737. pmid:19197052
- 4. Enard D, Cai L, Gwennap C, Petrov DA. Viruses Are a Dominant Driver of Protein Adaptation in Mammals. eLife. 2016;5:e12469. pmid:27187613
- 5. McDonald JH, Kreitman M. Adaptative Protein Evolution at Adh Locus in Drosophila. Nature. 1991;351(6328):652–654. pmid:1904993
- 6. Smith NGC, Eyre-Walker A. Adaptive Protein Evolution in Drosophila. Nature. 2002;415(6875):1022–1024. pmid:11875568
- 7. Welch JJ. Estimating the Genomewide Rate of Adaptive Protein Evolution in Drosophila. Genetics. 2006;173(2):821–837. pmid:16582427
- 8. Yang Z, Bielawski JR. Statistical Methods for Detecting Molecular Adaptation. Trends in Ecology and Evolution. 2000;15(12):496–503. pmid:11114436
- 9. Eyre-Walker A. The Genomic Rate of Adaptive Evolution. Trends in Ecology & Evolution. 2006;21(10):569–575.
- 10. Moutinho AF, Bataillon T, Dutheil JY. Variation of the Adaptive Substitution Rate between Species and within Genomes. Evolutionary Ecology. 2019;34(3):315–338.
- 11. Lynch M. Mutation Pressure, Drift, and the Pace of Molecular Coevolution. Proceedings of the National Academy of Sciences. 2023;120(27):e2306741120. pmid:37364099
- 12. Charlesworth J, Eyre-Walker A. The Other Side of the Nearly Neutral Theory, Evidence of Slightly Advantageous Back-Mutations. Proceedings of the National Academy of Sciences. 2007;104(43):16992–16997. pmid:17940029
- 13. Mustonen V, Lässig M. From Fitness Landscapes to Seascapes: Non-Equilibrium Dynamics of Selection and Adaptation. Trends in genetics. 2009;25(3):111–119. pmid:19232770
- 14. Jones CT, Youssef N, Susko E, Bielawski JP. Shifting Balance on a Static Mutation–Selection Landscape: A Novel Scenario of Positive Selection. Molecular Biology and Evolution. 2017;34(2):391–407. pmid:28110273
- 15. Ohta T. The Nearly Neutral Theory of Molecular Evolution. Annual Review of Ecology and Systematics. 1992;23(1992):263–286.
- 16. Gillespie JH. On Ohta’s Hypothesis: Most Amino Acid Substitutions Are Deleterious. Journal of Molecular Evolution. 1995;40(1):64–69.
- 17. Hartl DL, Taubes CH. Compensatory Nearly Neutral Mutations: Selection without Adaptation. Journal of Theoretical Biology. 1996;182(3):303–309. pmid:8944162
- 18. Sella G, Hirsh AE. The Application of Statistical Physics to Evolutionary Biology. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(27):9541–9546. pmid:15980155
- 19. Cvijović I, Good BH, Jerison ER, Desai MM. Fate of a Mutation in a Fluctuating Environment. Proceedings of the National Academy of Sciences. 2015;112(36):E5021–E5028.
- 20. Piganeau G, Eyre-Walker A. Estimating the Distribution of Fitness Effects from DNA Sequence Data: Implications for the Molecular Clock. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(18):10335–10340. pmid:12925735
- 21. Chi PB, Kosater WM, Liberles DA. Detecting Signatures of Positive Selection against a Backdrop of Compensatory Processes. Molecular Biology and Evolution. 2020;37(11):3353–3362. pmid:32895716
- 22. Bazykin GA. Changing Preferences: Deformation of Single Position Amino Acid Fitness Landscapes and Evolution of Proteins. Biology letters. 2015;11(10):20150315. pmid:26445980
- 23. Moses AM, Durbin R. Inferring Selection on Amino Acid Preference in Protein Domains. Molecular Biology and Evolution. 2009;26(3):527–536. pmid:19095755
- 24. Fischer A, Greenman C, Mustonen V. Germline Fitness-Based Scoring of Cancer Mutations. Genetics. 2011;188(2):383–393. pmid:21441214
- 25. Chen J, Bataillon T, Glémin S, Lascoux M. Hunting for Beneficial Mutations: Conditioning on SIFT Scores When Estimating the Distribution of Fitness Effect of New Mutations. Genome Biology and Evolution. 2021.
- 26. Halpern AL, Bruno WJ. Evolutionary Distances for Protein-Coding Sequences: Modeling Site-Specific Residue Frequencies. Molecular Biology and Evolution. 1998;15(7):910–917. pmid:9656490
- 27. McCandlish DM, Stoltzfus A. Modeling Evolution Using the Probability of Fixation: History and Implications. Quarterly Review of Biology. 2014;89(3):225–252. pmid:25195318
- 28. Rodrigue N, Philippe H. Mechanistic Revisions of Phenomenological Modeling Strategies in Molecular Evolution. Trends in Genetics. 2010;26(6):248–252. pmid:20452086
- 29. Tamuri AU, Goldstein RA. Estimating the Distribution of Selection Coefficients from Phylogenetic Data Using Sitewise Mutation-Selection Models. Genetics. 2012;190(3):1101–1115. pmid:22209901
- 30. Rodrigue N, Lartillot N. Detecting Adaptation in Protein-Coding Genes Using a Bayesian Site- Heterogeneous Mutation-Selection Codon Substitution Model. Molecular Biology and Evolution. 2017;34(1):204–214. pmid:27744408
- 31. Foley NM, Mason VC, Harris AJ, Bredemeyer KR, Damas J, Lewin HA, et al. A Genomic Timescale for Placental Mammal Evolution. Science. 2023;380(6643):eabl8189. pmid:37104581
- 32. Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, et al. Ensembl 2021. Nucleic Acids Research. 2021;49(D1):D884–D891. pmid:33137190
- 33. Ranwez V, Delsuc F, Ranwez S, Belkhir K, Tilak MK, Douzery EJ. OrthoMaM: A Database of Orthologous Genomic Markers for Placental Mammal Phylogenetics. BMC Evolutionary Biology. 2007;7(1):1–12. pmid:18053139
- 34. Scornavacca C, Belkhir K, Lopez J, Dernat R, Delsuc F, Douzery EJP, et al. OrthoMaM V10: Scaling-up Orthologous Coding Sequence and Exon Alignments with More than One Hundred Mammalian Genomes. Molecular Biology and Evolution. 2019;36(4):861–862. pmid:30698751
- 35. Rodrigue N, Philippe H, Lartillot N. Mutation-Selection Models of Coding Sequence Evolution with Site-Heterogeneous Amino Acid Fitness Profiles. Proceedings of the National Academy of Sciences of the United States of America. 2010;107(10):4629–34. pmid:20176949
- 36. Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: Improving Access to Variant Interpretations and Supporting Evidence. Nucleic Acids Research. 2018;46(D1):D1062–D1067. pmid:29165669
- 37. Grimm DG, Azencott CA, Aicheler F, Gieraths U, MacArthur DG, Samocha KE, et al. The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity. Human Mutation. 2015;36(5):513–523. pmid:25684150
- 38. Sullivan PF, Meadows JRS, Gazal S, Phan BN, Li X, Genereux DP, et al. Leveraging Base-Pair Mammalian Constraint to Understand Genetic Variation and Human Disease. Science. 2023;380(6643):eabn2937. pmid:37104612
- 39. Eyre-Walker A, Woolfit M, Phelps T. The Distribution of Fitness Effects of New Deleterious Amino Acid Mutations in Humans. Genetics. 2006;173(2):891–900. pmid:16547091
- 40. Eyre-Walker A, Keightley PD. Estimating the Rate of Adaptive Molecular Evolution in the Presence of Slightly Deleterious Mutations and Population Size Change. Molecular Biology and Evolution. 2009;26(9):2097–2108. pmid:19535738
- 41. Galtier N. Adaptive Protein Evolution in Animals and the Effective Population Size Hypothesis. PLoS Genetics. 2016;12(1):e1005774. pmid:26752180
- 42. Tataru P, Mollion M, Glémin S, Bataillon T. Inference of Distribution of Fitness Effects and Proportion of Adaptive Substitutions from Polymorphism Data. Genetics. 2017;207(3):1103–1119. pmid:28951530
- 43. Bergeron LA, Besenbacher S, Zheng J, Li P, Bertelsen MF, Quintard B, et al. Evolution of the Germline Mutation Rate across Vertebrates. Nature. 2023; p. 1–7. pmid:36859541
- 44. Orlando L, Ginolhac A, Zhang G, Froese D, Albrechtsen A, Stiller M, et al. Recalibrating Equus Evolution Using the Genome Sequence of an Early Middle Pleistocene Horse. Nature. 2013;499(7456):74–78. pmid:23803765
- 45. Achaz G. Frequency Spectrum Neutrality Tests: One for All and All for One. Genetics. 2009;183(1):249–258. pmid:19546320
- 46. Latrille T, Rodrigue N, Lartillot N. Genes and Sites under Adaptation at the Phylogenetic Scale Also Exhibit Adaptation at the Population-Genetic Scale. Proceedings of the National Academy of Sciences of the United States of America. 2023;120(11):e2214977120. pmid:36897968
- 47. Ng PC, Henikoff S. SIFT: Predicting Amino Acid Changes That Affect Protein Function. Nucleic Acids Research. 2003;31(13):3812–3814. pmid:12824425
- 48. Vaser R, Adusumalli S, Leng SN, Sikic M, Ng PC. SIFT Missense Predictions for Genomes. Nature Protocols. 2016;11(1):1–9. pmid:26633127
- 49. Keightley PD, Eyre-Walker A. What Can We Learn about the Distribution of Fitness Effects of New Mutations from DNA Sequence Data? Philosophical Transactions of the Royal Society B: Biological Sciences. 2010;365(1544):1187–1193. pmid:20308093
- 50. Rice DP, Good BH, Desai MM. The Evolutionarily Stable Distribution of Fitness Effects. Genetics. 2015;200(1):321–329. pmid:25762525
- 51. Mustonen V, Lässig M. Fitness Flux and Ubiquity of Adaptive Evolution. Proceedings of the National Academy of Sciences. 2010;107(9):4248–4253. pmid:20145113
- 52. Eyre-Walker A, Eyre-Walker YC. How Much of the Variation in the Mutation Rate along the Human Genome Can Be Explained? G3: Genes, Genomes, Genetics. 2014;4(9):1667–1670. pmid:24996580
- 53. Mustonen V, Lässig M. Molecular Evolution under Fitness Fluctuations. Physical Review Letters. 2008;100(10):108101. pmid:18352233
- 54. Latrille T, Lanore V, Lartillot N. Inferring Long-Term Effective Population Size with Mutation–Selection Models. Molecular Biology and Evolution. 2021;38(10):4573–4587. pmid:34191010
- 55. Huber CD, Kim BY, Marsden CD, Lohmueller KE. Determining the Factors Driving Selective Effects of New Nonsynonymous Mutations. Proceedings of the National Academy of Sciences. 2017;114(17):4465–4470. pmid:28400513
- 56. Lanfear R, Kokko H, Eyre-Walker A. Population Size and the Rate of Evolution. Trends in Ecology and Evolution. 2014;29(1):33–41. pmid:24148292
- 57. Platt A, Weber CC, Liberles DA. Protein Evolution Depends on Multiple Distinct Population Size Parameters. BMC Evolutionary Biology. 2018;18(1):1–9. pmid:29422024
- 58. Starr TN, Thornton JW. Epistasis in Protein Evolution. Protein Science. 2016;25(7):1204–1218. pmid:26833806
- 59. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-Coupling Analysis of Residue Coevolution Captures Native Contacts across Many Protein Families. Proceedings of the National Academy of Sciences. 2011;108(49):E1293–E1301. pmid:22106262
- 60. Marks DS, Hopf TA, Sander C. Protein Structure Prediction from Sequence Variation. Nature Biotechnology. 2012;30(11):1072–1080. pmid:23138306
- 61. Goldstein RA, Pollard ST, Shah SD, Pollock DD. Nonadaptive Amino Acid Convergence Rates Decrease over Time. Molecular Biology and Evolution. 2015;32(6):1373–1381. pmid:25737491
- 62. Goldstein RA, Pollock DD. Sequence Entropy of Folding and the Absolute Rate of Amino Acid Substitutions. Nature Ecology & Evolution. 2017;1(12):1923–1930. pmid:29062121
- 63. Park Y, Metzger BPH, Thornton JW. Epistatic Drift Causes Gradual Decay of Predictability in Protein Evolution. Science. 2022;376(6595):823–830. pmid:35587978
- 64. Ashenberg O, Gong LI, Bloom JD. Mutational Effects on Stability Are Largely Conserved during Protein Evolution. Proceedings of the National Academy of Sciences of the United States of America. 2013;110(52):21071–21076. pmid:24324165
- 65. Doud MB, Ashenberg O, Bloom JD. Site-Specific Amino Acid Preferences Are Mostly Conserved in Two Closely Related Protein Homologs. Molecular Biology and Evolution. 2015;32(11):2944–2960. pmid:26226986
- 66. Bloom JD. Identification of Positive Selection in Genes Is Greatly Improved by Using Experimentally Informed Site-Specific Models. Biology Direct. 2017;12(1):1–24. pmid:28095902
- 67. Youssef N, Susko E, Bielawski JP. Consequences of Stability-Induced Epistasis for Substitution Rates. Molecular Biology and Evolution. 2020. pmid:32897316
- 68. Vigué L, Croce G, Petitjean M, Ruppé E, Tenaillon O, Weigt M. Deciphering Polymorphism in 61,157 Escherichia Coli Genomes via Epistatic Sequence Landscapes. Nature Communications. 2022;13(1):4030. pmid:35821377
- 69. Kimura M. Evolutionary Rate at the Molecular Level. Nature. 1968;217(5129):624–626. pmid:5637732
- 70. Gillespie JH. Substitution Processes in Molecular Evolution. III. Deleterious Alleles. Genetics. 1994;138(3):943–952. pmid:7851786
- 71. Jensen JD, Payseur BA, Stephan W, Aquadro CF, Lynch M, Charlesworth D, et al. The Importance of the Neutral Theory in 1968 and 50 Years on: A Response to Kern and Hahn 2018. Evolution. 2019;73(1):111–114. pmid:30460993
- 72. Galtier N. Half a Century of Controversy: The Neutralist/Selectionist Debate in Molecular Evolution. Genome Biology and Evolution. 2024;16(2):evae003. pmid:38311843
- 73. Galtier N, Duret L. Adaptation or Biased Gene Conversion? Extending the Null Hypothesis of Molecular Evolution. Trends in Genetics. 2007;23(6):273–277. pmid:17418442
- 74. Rousselle M, Laverré A, Figuet E, Nabholz B, Galtier N. Influence of Recombination and GC-biased Gene Conversion on the Adaptive and Nonadaptive Substitution Rate in Mammals versus Birds. Molecular Biology and Evolution. 2019;36(3):458–471. pmid:30590692
- 75. Joseph J. Increased Positive Selection in Highly Recombining Genes Does Not Necessarily Reflect an Evolutionary Advantage of Recombination. Molecular Biology and Evolution. 2024; p. msae107. pmid:38829800
- 76. Ohta T, Gillespie JH. Development of Neutral and Nearly Neutral Theories. Theoretical Population Biology. 1996;49(2):128–142. pmid:8813019
- 77. Spielman SJ, Wilke CO. The Relationship between dN/dS and Scaled Selection Coefficients. Molecular biology and evolution. 2015;32(4):1097–1108. pmid:25576365
- 78. Dos Reis M. How to Calculate the Non-Synonymous to Synonymous Rate Ratio of Protein-Coding Genes under the Fisher-Wright Mutation-Selection Framework. Biology Letters. 2015;11(4):20141031. pmid:25854546
- 79. Rodrigue N, Latrille T, Lartillot N. A Bayesian Mutation-Selection Framework for Detecting Site-Specific Adaptive Evolution in Protein-Coding Genes. Molecular Biology and Evolution. 2021;38(3):1199–1208. pmid:33045094
- 80. Tamuri AU, dos Reis M. A Mutation-Selection Model of Protein Evolution under Persistent Positive Selection. Molecular Biology and Evolution. 2021.
- 81. Kazmi SO, Rodrigue N. Detecting Amino Acid Preference Shifts with Codon-Level Mutation-Selection Mixture Models. BMC Evolutionary Biology. 2019;19(1):62. pmid:30808289
- 82. Stolyarova AV, Nabieva E, Ptushenko VV, Favorov AV, Popova AV, Neverov AD, et al. Senescence and Entrenchment in Evolution of Amino Acid Sites. Nature Communications. 2020;11(1):4603. pmid:32929079
- 83. Douzery EJP, Scornavacca C, Romiguier J, Belkhir K, Galtier N, Delsuc F, et al. OrthoMaM v8: A Database of Orthologous Exons and Coding Sequences for Comparative Genomics in Mammals. Molecular Biology and Evolution. 2014;31(7):1923–1928. pmid:24723423
- 84. Yang Z, Nielsen R. Mutation-Selection Models of Codon Substitution and Their Use to Estimate Selective Strengths on Codon Usage. Molecular Biology and Evolution. 2008;25(3):568–579. pmid:18178545
- 85. Wright S. Evolution in Mendelian Populations. Genetics. 1931;16(2):97–159. pmid:17246615
- 86.
Fisher RA. The Genetical Theory of Natural Selection. The Clarendon Press; 1930.
- 87. Rodrigue N, Lartillot N, Philippe H. Bayesian Comparisons of Codon Substitution Models. Genetics. 2008;180(3):1579–1591. pmid:18791235
- 88. Lartillot N. A Bayesian Mixture Model for Across-Site Heterogeneities in the Amino-Acid Replacement Process. Molecular biology and evolution. 2004;21(6):1095–1109. pmid:15014145
- 89.
Lartillot N. Inférence Probabiliste Pour La Phylogénie, La Génomique Comparative et Les Sciences de La Macro-Évolution; 2013.
- 90. Al Abri MA, Holl HM, Kalla SE, Sutter NB, Brooks SA. Whole Genome Detection of Sequence and Structural Polymorphism in Six Diverse Horses. PLoS ONE. 2020;15(4):e0230899. pmid:32271776
- 91. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: The NCBI Database of Genetic Variation. Nucleic Acids Research. 2001;29(1):308–311. pmid:11125122
- 92. Svardal H, Jasinska AJ, Apetrei C, Coppola G, Huang Y, Schmitt CA, et al. Ancient Hybridization and Strong Adaptation to Viruses across African Vervet Monkey Populations. Nature Genetics. 2017;49(12):1705–1713. pmid:29083404
- 93. Zheng-Bradley X, Streeter I, Fairley S, Richardson D, Clarke L, Flicek P, et al. Alignment of 1000 Genomes Project Reads to Reference Assembly GRCh38. GigaScience. 2017;6(7):gix038. pmid:28531267
- 94. Keightley PD, Jackson BC. Inferring the Probability of the Derived vs the Ancestral Allelic State at a Polymorphic Site. Genetics. 2018;209(3):897–906. pmid:29769282
- 95. Ashkenazy H, Penn O, Doron-Faigenboim A, Cohen O, Cannarozzi G, Zomer O, et al. FastML: A Web Server for Probabilistic Reconstruction of Ancestral Sequences. Nucleic Acids Research. 2012;40(W1):W580–W584. pmid:22661579
- 96.
Tataru P, Bataillon T. polyDFE: Inferring the Distribution of Fitness Effects and Properties of Beneficial Mutations from Polymorphism Data. In: Methods in Molecular Biology. vol. 2090. Humana Press Inc.; 2020. p. 125–146.
- 97. Orme D, Freckleton R, Thomas G, Petzoldt T, Fritz S, Isaac N, et al. The Caper Package: Comparative Analysis of Phylogenetics and Evolution in R. R package version. 2013;5(2):1–36.
- 98. Kumar S, Stecher G, Suleski M, Hedges SB. TimeTree: A Resource for Timelines, Timetrees, and Divergence Times. Molecular Biology and Evolution. 2017;34(7):1812–1819. pmid:28387841