Skip to main content
Advertisement
  • Loading metrics

Estimating evolutionary and demographic parameters via ARG-derived IBD

  • Zhendong Huang,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    ¤ Current address: School of Science, RMIT University, Melbourne, Victoria, Australia

    Affiliation Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Victoria, Australia

  • Jerome Kelleher,

    Roles Funding acquisition, Methodology, Writing – review & editing

    Affiliation Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, United Kingdom

  • Yao-ban Chan ,

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – review & editing

    yaoban@unimelb.edu.au

    Affiliation Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Victoria, Australia

  • David Balding

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Victoria, Australia

Abstract

Inference of evolutionary and demographic parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that even poorly-inferred short IBD segments can improve estimation. Our mutation-rate estimator achieves precision similar to a previously-published method despite a 4 000-fold reduction in data used for inference, and we identify significant differences between human populations. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.

Author summary

Samples of genome sequences can be informative about the history of the population from which they were drawn, and about mutation and other processes that led to the observed sequences. However, obtaining reliable inferences is challenging, because of the complexity of the underlying processes and the large amounts of sequence data that are often now available. A common approach to simplifying the data is to use only genome segments that are very similar between two sequences, called identical-by-descent (IBD). The longer the IBD segment the more informative it is about recent shared ancestry, and current approaches restrict attention to IBD segments above a length threshold. We instead are able to use IBD segments of any length, allowing us to extract much more information from the sequence data. To reduce the computational burden we identify subsets of the available sequence pairs that lead to little information loss. Our approach exploits recent advances in inferring the genealogical history underlying the sample of sequences. Computational cost still limits the size and complexity of problems our method can handle, but where feasible we obtain dramatic improvements in the power of inferences.

Introduction

Multiple techniques have been developed for inference of evolutionary and demographic parameters such as the mutation rate or effective population sizes. Methods based on the sequential Markov coalescent (SMC) model [15] are typically likelihood-based, and hence statistically efficient but computationally demanding which restricts the sample sizes that can be handled. Other approaches can handle large sample sizes by using the allele frequency spectrum (AFS) [69], which reduces the genome sequences to counts of sites with each allele frequency. However, this data reduction loses statistical power, particularly for small sample sizes.

Another common approach to the analysis of genome sequence data is to first extract the lengths of genome segments that are inferred to be identical-by-descent (IBD) [1014]. In practice, IBD is often identified by searching for regions with no evidence for recombination along two sequences since their most recent common ancestor (MRCA), which is problematic because recombinations can be hard to detect or even unobservable. Further, only IBD segments (IBDs) above a given length threshold, often 2 to 4 cM, are retained. This practice wastes valuable information, but has been necessary because the inference of short IBDs is too noisy to be useful for downstream analyses.

The ancestral recombination graph (ARG) is widely used to represent the genealogical history of a sample [1517] and recent developments in inferring aspects of the ARG [1822] now permit us to rapidly extract IBD directly from inferred shared ancestors [23], without requiring zero recombination. Further, ARG inference and IBD extraction are now fast enough to be implemented within an approximate Bayesian computation (ABC) algorithm [24]. Because ABC is simulation-based, requiring no knowledge of the likelihood function, even short inferred IBDs can contribute to inference, removing the need for an information-wasteful length threshold. Instead, we reduce computational cost by using an efficient subset of IBDs that scales linearly with sample size, resulting in little information loss relative to using all IBDs.

Our approach relies on a data structure encoding features of an ARG underlying a sample of genome sequences, called the succinct tree sequence (TS) [25, 26]. The TS minimises redundant storage of subsequences that are similar due to shared ancestry. It has led to spectacular improvements in storage and simulation of large genome datasets [27], and has recently been applied to IBD-based inferences about demographic history and evolutionary parameters [28].

We first demonstrate powerful inferences of mutation and sequencing error rates, and past and present population sizes, given true IBD information in simulation studies. For real datasets, we propose TSABC: ABC with statistics computed from IBDs extracted from an inferred TS. Fig 1 summarises key features of TSABC and other methods. We demonstrate the performance of TSABC with inferences of the mutation rate and population size in simulation studies and real data, and we compare mutation rate estimates with previously-published results and with analyses using a range of IBD length thresholds.

thumbnail
Fig 1. A sketch of different approaches to inference of evolutionary and demographic parameters from samples of genome sequences.

https://doi.org/10.1371/journal.pgen.1011537.g001

We find that using IBDs extracted from an inferred ARG leads to a surprisingly small loss of precision in TSABC relative to use of true IBDs. Further, TSABC performs best with no IBD length threshold: even a low threshold on IBD length reduces the quality of inferences, despite the fact that short IBDs are poorly inferred. TSABC is computationally demanding, which limits the size and complexity of inference problems that can be tackled. However, TSABC can achieve comparable results to previous estimators while using much smaller data sets: we show similar precision to a previously-published mutation-rate estimator despite a 4 000 fold reduction in data available for inference (400-fold smaller sample size and 10-fold smaller genome length).

Description of the method

Definition and notations

The TS encodes genome sequence data efficiently by storing subsequences that are similar as variations of an inferred ancestral sequence. It is defined as , where is the set of leaf (or tip) nodes corresponding to m observed sequences each of length , and P = {m+1, …, n} is the set of internal (ancestral) nodes of the TS ordered backwards in time from the present. The jth edge (cj, pj, lj, rj) in E represents inheritance of sites in the segment [lj, rj], with 1 ≤ ljrj, from internal node pjP to its child cj ∈ {1, …, pj−1}, while the jth element of is a pair (cj, sj) recording the set of sites sj at which there is a sequence difference between node cj and its parent, due either to a mutation or, if cj is a leaf node, sequencing error. The TS has the “succinct” property that any tree component conserved over a genome segment is stored only once, which greatly reduces data storage requirements compared with retaining all distinct marginal trees.

Identity by descent and efficient subsets

We denote the ith IBD segment, i = 1, …, I, in the TS by IBDi = (ci1, ci2, li, ri, pi, Mi), ordered such that ci1 is non-decreasing in i. Here ci1 and ci2 are the leaf nodes of the two sequences, [li, ri] is the IBD genome segment, pi is the MRCA node of ci1 and ci2 for this segment, and Mi denotes the set of sites in [li, ri] at which ci1 and ci2 differ. We write gi for the TMRCA (time since MRCA) of ci1 and ci2, which is the age of pi, in generations. As there is no length threshold, the IBDs of any sequence pair partition the genome: every sequence site is included in exactly one of the IBD segments.

Each IBDi has the same MRCA at each site in [li, ri], and a different MRCA at adjacent sites. Imposing a no-recombination requirement as part of the definition of IBD would be more restrictive, since the absence of recombination implies a common MRCA but the reverse does not hold (see Fig 2, left, for examples of recombinations that do not change the MRCA).

thumbnail
Fig 2. An ancestral recombination graph (ARG) spanning a genome sequence of length = 100 (left), the corresponding sequence of local trees (middle) and efficient IBD subset (right).

The ARG has leaf nodes , named ancestral nodes {5, 6, 7, 8} = P, and a recombination at site 42 (red node). The two dashed lines in the ARG represent inheritance paths due to two unobservable recombination events, which are not represented in the TS. The efficient IBD subset includes two IBD segments for the node pair (1, 2), corresponding to intervals [1, 42] and [43, 100] which have MRCA 6 and 8, respectively, and one IBD segment spanning the whole sequence for pairs (3, 4) and (4, 1). The sequence pairs (1,3) and (2,4) are not included in the efficient subset.

https://doi.org/10.1371/journal.pgen.1011537.g002

To reduce computational effort, we use for inference only an “efficient” subset of IBDs. After fixing an arbitrary order for the sequences, we include in the subset only the IBDs of the sequence pairs (1, m) and (c, c+1) for c = 1, …, m−1 (see Fig 2, right, and S1 Text). An efficient subset has the property that each edge of the TS is included in a descent path from the MRCA for at least one IBD segment in the subset, which ensures that information is retained in the subset about every mutation.

Imposing a length threshold on IBDs is also a form of data reduction but we show below that it can lead to high information loss, because mutations are ignored if they occur at sites not contained in a sufficiently long IBD segment.

Estimation

Let μ and ϵ be the per-site per-generation mutation rate and the per site sequencing error rate, both assumed constant over sites. Let N(g), g = 0, 1, 2, …, be the haploid population size g generations in the past. In S2 Text we derive method-of-moment estimators for μ and ϵ, and non-parametric estimators of N(g), g ≥ 0, based on statistics computed from IBDs. We investigate the performance of these estimators when true IBD information is available in simulation studies.

For observed sequence data, true IBD information is not available and we extract IBDs from an inferred TS. TSABC uses summary statistics derived from these IBDs and related to the method-of-moments estimators. In brief, the usual TSABC workflow is:

  1. (i) from the observed m sequences, infer a TS;
  2. (ii) for each of the m sequence pairs in the efficient subset, use the inferred TS to partition the sequence into segments with the same MRCA (IBD segments);
  3. (iii) compute summary statistics from the IBD segments;
  4. (iv) perform η times:
    1. (a) simulate a parameter value from a specified prior distribution,
    2. (b) use it to simulate a dataset of m sequences,
    3. (c) apply steps (i) through (iii) to the simulated dataset,
    4. (d) compute a distance d between the summary statistics of the observed dataset and those of the simulated dataset;
  5. (v) the parameter values for the simulated datasets that have the smallest d values are retained as an approximate random sample from the posterior distribution.

See S1 Text for implementation details of the TSABC algorithm, including the linear adjustment [29] used to improve the posterior approximation in step (v). Unless otherwise stated, we used η = 2500 in step (iv) of which η/20 = 125 values are retained in step (v). Here we report only the mean of the retained parameter values, which estimates the posterior mean. Other properties of the posterior distribution can be approximated if desired.

We use tsinfer [19] in step (i); speed is critical for an ABC algorithm, and tsinfer is the fastest of the current methods, while retaining high accuracy [22, 30]. For inference of μ and ϵ, we use the statistics and C1 (S2 Text) which are constructed from the method-of-moments estimators and . Nonparametric estimation of N(g) is not feasible, but we can estimate the parameters of a demographic model, which allows powerful inference provided that the model is adequate. We use as statistics in the ABC algorithm the mean and standard deviation (SD) of IBD lengths rili, i = 1, …, I.

The recombination rate r is assumed constant over sites and known for analyses of simulated data, whereas a previously-published human recombination map is assumed for the real-data analysis. A recombination at site s means between sites s and s + 1.

Verification and comparison

Simulation study: True IBD available

We used msprime [32] to simulate TS under the coalescent with recombination [33, 34], assuming demographic models C, Ga and S (Table 1). Sequencing error was simulated by adding elements to at the TS leaf nodes. At the largest error rate (ϵ = 10−3), when μ = 1.3×10−8 any singleton variant is a few times more likely to arise from sequencing error rather than a mutation. For each simulated dataset we estimated μ, ϵ, and N(g), g ≥ 0, using our novel estimators (S2 Text).

thumbnail
Table 1. Parameter values, sample properties and demographic models for the simulation study.

Unless otherwise stated, 25 simulation replicates were generated in each scenario. Model Ga is used for inferences given true IBD and Model Gb is used for inferences from inferred IBD. The value of r is assumed known for all inferences, whereas μ, ϵ and N(g), g ≥ 0, are targets of inference.

https://doi.org/10.1371/journal.pgen.1011537.t001

Both and are well estimated in all demographic models, with no indication of bias (Fig 3). Increasing m has only a modest effect on the SD of estimators, whereas has a larger effect (SD scales with ). Sequencing errors only inflate the number of singleton variants, so is little affected by increasing ϵ (Fig 3, bottom left).

thumbnail
Fig 3. Inference of mutation rate μ and sequencing error rate ϵ with two sequence lengths (columns), when true IBD was available for inference.

Line segments show indicative 95% CIs computed from the average estimate (indicated by a symbol, see legend box) and the empirical SD of the estimates from 25 simulated datasets in each scenario. Bottom left panel shows the impact of ϵ on when m = 10, see S2 Fig (right) for corresponding results when = 108. For the other three panels ϵ = 10−3.

https://doi.org/10.1371/journal.pgen.1011537.g003

While use of the efficient subset of IBDs reduces computational cost in proportion to the reduction in sequence pairs from m(m−1)/2 to m, the average estimated SD of in our Model C simulations increased only slightly, from 0.017 to 0.019 units of 10−8 (see S2 Fig, left, for confidence intervals (CI)). This gain in computation time is typically worth the small loss of statistical efficiency.

The population size estimator is accurate under all models, at least for g ≤ 5 × 104 (Fig 4). These inferences depend on the empirical densities derived from the , i = 1…, I, which are close to the theoretical densities (S4 Fig) despite imprecision of the individual and the input TS only including information about the order of the coalescence events, and not their times.

thumbnail
Fig 4. Estimates of the population size N(g), g ≥ 0, from each of 25 simulation replicates under Model C, Model Ga and Model S, when true IBD was available for inference.

Sequence length is = 108, sequencing error rate is ϵ = 10−3 and sample size is m = 80. See S3 Fig for corresponding results when = 107.

https://doi.org/10.1371/journal.pgen.1011537.g004

Simulation study: Inferred IBD

We used msprime to simulate sequences, recoded them as binary strings with 0 denoting the ancestral allele, and added sequencing errors by assigning 1 to randomly selected sites at rate ϵ (see S3 Text and [35] for alternative models of sequencing error).

We first use simulations to confirm a previous report [36] that the quality of IBD inference is often poor, particularly for short IBDs. We compared the number of true and inferred IBDs for datasets simulated under Model C with μ ranging from 1 to 20 units of 10−8 per site per generation and m = 10, 20 and 160. We also compared the length distribution of true and inferred IBDs for m = 160 and μ = 1.3×10−8.

The number of inferred IBDs tends to increase with both μ and m, but except for very high μ (over 10 times the average human value when m = 160) it remains well below the true number of IBDs (Fig 5, left). Correspondingly, the length distribution of inferred IBDs is highly skewed towards larger values relative to the true distribution (Fig 5, right), as previously reported [36].

thumbnail
Fig 5. Comparison of true and inferred IBDs.

Left: each symbol and vertical line segment show the mean and 95% CI of the mean ratio of IBD counts over 25 Model C simulations with sample sizes m = 10, 20 and 160. The human mutation rate is close to the left endpoints of the curves. Right: histograms of true and inferred IBD length distributions for a Model C simulated dataset with m = 160 and sequence length = 106.

https://doi.org/10.1371/journal.pgen.1011537.g005

This poor inference of IBDs has motivated the widespread use of a length threshold to exclude short IBDs. We investigated its effect for the Model C simulations with m = 10 and μ = 1.3×10−8, modifying TSABC to include only IBDs longer than a threshold of 1, 2 or 4 units of 104 bp. These thresholds are two orders of magnitude shorter than those typically used in practice so our results are likely to greatly understate any actual information loss from thresholding. When a threshold was applied, we included all IBDs satisfying the threshold, rather than using only the efficient subset of IBDs. For comparison, we also applied TSABC to estimate μ using IBD extracted from the true TS generated in simulations. Here, the TSABC workflow uses true rather than inferred IBD in step (ii) above, and hence also in (c), so that the data simulation within ABC mimics the generation process for the data treated as observed.

Table 2 shows that each decrease in the length threshold of IBDs used for inference of μ increased the resulting precision, both for inferred and true IBD, so that even poorly-inferred short IBDs improve TSABC inference. We also see in Table 2 (final column) further evidence that use of the efficient subset of IBDs leads to only a small loss of statistical efficiency. As expected, the use of true IBD improves TSABC compared with using inferred IBD, but the magnitude of the improvement is modest for low or zero threshold. For higher thresholds, bias can be high due to low precision of inference and the prior boundary at 10−8.

thumbnail
Table 2. Comparison of TSABC inference for μ using different IBD length thresholds.

Each result is an average over 25 Model C simulation replicates with m = 10, ϵ = 0, = 107 and μ = 1.3×10−8. In the last column, values based only on IBDs in the efficient subset are given in (). See S5 Fig for a plot of CIs.

https://doi.org/10.1371/journal.pgen.1011537.t002

We next investigated TSABC estimation of μ under Model C and Model Gb with μ set to 0.25, 1 and 4 times the human value of 1.3×10−8, approximately spanning the range of μ in vertebrates [37], and set to 4, 1 and 0.25 times 107. The expected number of mutations is the same in all scenarios but when μ is higher more of the mutations arise as multiple hits at the same site, while lower means fewer recombinations. The N(g) values and ϵ = 0 were assumed known, and the prior distribution for μ when the true value was 1.3×10−8 was Uniform(10−8, 2×10−8), with the endpoints of the uniform prior changing in proportion when μ was fourfold lower or fourfold higher.

Bias in TSABC estimation appears to be negligible for all μ values (Table 3). The loss of precision (increased SD) as μ increases and decreases is due both to fewer recombinations, which reduces precision through less independence along the genome, and more sites with multiple mutations, which are less informative for inference than if the mutations had occurred at distinct sites.

thumbnail
Table 3. TSABC estimation of mutation rate μ.

The expected number of mutations is the same in each scenario (μ × is constant). Values are averages over 25 simulations with no sequencing error (ϵ = 0).

https://doi.org/10.1371/journal.pgen.1011537.t003

To study TSABC estimation of the population size N(g), we used m = 200 and = 106 under each of Model C and Model Gb. For both data simulation models, the TSABC inference used Model G but with different prior distributions. When the simulation used Model C, we fitted Model G with independent priors Uniform(104, 3 × 104) for N(0) and Uniform(−2×10−5, 2×10−5) for τ. Whenever τ < 0, we impose a population size limit N(g) ≤ 2N(0). When the simulation used Model Gb, the independent priors were Uniform(105, 3×105) for N(0), and Uniform(0, 0.002) for τ. All parameters were treated as known except the targets of inference N(0) and τ.

Results for parametric estimation of N(g) are shown in Fig 6. For Model C data simulations, the average estimate of N(0) (true value 20 000) over the 25 replicates is 20 931 with standard error (SE) , while for the growth rate τ (true value 0) the average estimate is 2.10 with SE , both in units of 10−6. With Model Gb data simulations, for N(0) (true value 200 000) we obtained 202 534 (SE 2173) while for τ (true value 1) we obtained 1.08 (SE 0.07) in units of 10−3.

thumbnail
Fig 6. Fitted exponential curves for the population size N(g) obtained using TSABC.

Each of the 25 curves corresponds to a dataset simulated under Model C (left) and Model Gb (right) with no sequencing error (ϵ = 0), sample size m = 200 and sequence length = 106.

https://doi.org/10.1371/journal.pgen.1011537.g006

We performed additional simulations to allow the inferences of μ using the 3-way IBD method [31, 38] to be compared with Relate [20] and TSABC. Genomes consisting of 30 chromosomes were simulated under Model EA, which aims to capture key features of the demographic history of European-Americans (Table 1), and Model C.

The Model C simulations of [38] used ϵ = 10−4 but no gene conversion, while for Model EA [31] also included gene conversion with rate 2×10−8 per base pair per generation and mean tract length 300 bp. The data sets simulated for TSABC used the same sequencing error and gene conversion settings as [38] for Model C and [31] for Model EA. However TSABC was challenged by not including gene conversion in the inference model and by treating both ϵ and N(g) as unknown when inferring μ, with a misspecified model for N(g). The simulated datasets analysed by TSABC and Relate had a 400-fold smaller sample size than [38] (m = 10 versus m = 4×103) and 10-fold smaller genome length ( = 107 per chromosome, versus = 108).

When the data were simulated under Model C, TSABC used independent priors Uniform(10−8, 2×10−8) for μ and Uniform(0.6×10−4, 1.6×10−4) for ϵ. For N(g), we adopted Model G with independent priors N(0) ∼ Uniform(1.4×104, 3×104) and τ ∼ Uniform(−2×10−5, 10−5). When the data were simulated under Model EA, TSABC used the same priors for μ and ϵ as in Model C. For inference of N(g), we adopted Model S with independent priors N(0) ∼ Uniform(1.1×104, 1.5×104), g*∼ Uniform(4500, 6500) and N(g*)∼ Uniform(4.6×104, 5.0×104).

For Relate, we only report results from datasets simulated with neither sequencing error nor gene conversion, because it performed poorly on datasets with these features. We re-formatted the simulated sequences as .vcf files for input to Relate. The software RelateMutationRate (see https://myersgroup.github.io/relate) was then implemented to find the average mutation rates, with mode parameter “AVG”, bins parameter (4, 7, 107), true values used for other evolutionary parameters, and default settings for other tuning parameters.

Table 4 shows that TSABC performs similarly to the 3-way IBD results reported by [31, 38] despite a 400-fold smaller sample size and a 10-fold reduction in sequence length, and despite the challenges we imposed on TSABC: sequencing error was treated as unknown and gene conversion was incorporated in data simulation but not the ABC inference model, which also misspecified the model for N(g). TSABC also provides more accurate inference for μ than Relate when analysing the same data, despite gene conversions and sequencing errors challenging TSABC but not Relate.

thumbnail
Table 4. Comparison of inference of μ (in units of 10−8, true value 1.3).

3-way IBD results are from [38] for Model C and [31] for Model EA. Relate and TSABC results are obtained from 25 simulated datasets under each model, with genomes consisting of 30 chromosomes each of length . The TSABC simulations included sequencing error and gene conversion with the same settings as [38] for Model C and [31] for Model EA. Relate performed poorly on those datasets and the reported results are for datasets simulated without sequencing error or gene conversion. For TSABC, (SE 0.03) units of 10−4 for Model C, and 1.03 (SE 0.03) for Model EA (true value 1).

https://doi.org/10.1371/journal.pgen.1011537.t004

Application: Mutation and growth rates in the 1000 Genomes Project

We analyse chromosomes 20 and 21 from 1 538 individuals in eight of the 26 human populations of the 1000 Genomes Project (1KGP) [39]. S4 Text gives details of the data analysis. Separately for each chromosome, we use TSABC to infer μ assuming the prior Uniform(10−8, 2×10−8), ϵ = 0 and the demographic model of [40], which we refer to as the 1KGP model (see S1 Fig for plots). The 16 sets of 125 retained values were analysed in a two-way ANOVA to assess differences in μ across chromosomes and over populations.

The global mean over both chromosomes is 1.27 (SE 0.03) units of 10−8 per site and per generation, which is close to the whole-genome estimate 1.24 (SE 0.04) based on three samples, of European (2) and African (1) ancestries, totalling about 8.7K individuals [31]. These authors reported no significant difference across their three populations whereas our two-way ANOVA (Table 5) revealed highly significant differences, with population-specific estimates ranging from 1.22 (BEB, CHB) to 1.36 (GBR). These differences may be due to differences in heritable factors, average age at reproduction [41] or environmental exposures such as mutagenic solar radiation [42]. These population-based estimates are higher than current pedigree-based estimates, which are typically between 1.1 and 1.2 units of 10−8 [43, 44], but lower than estimates based on inter-species comparisons [45], consistent with a decrease in μ over time. We found no significant difference overall between the two chromosomes, but there is a highly-significant between-chromosome difference in ITU.

thumbnail
Table 5. Estimation of the mutation rate μ per site per generation (in units of 10−8) on human chromosome 20 and 21 for populations MSL (Mende in Sierra Leone), LWK (Luhya in Webuye, Kenya), BEB (Bengali from Bangladesh), ITU (Indian Telugu from the UK), FIN (Finnish in Finland), GBR (British in England and Scotland), JPT (Japanese in Tokyo, Japan), and CHB (Han Chinese in Beijing, China).

The TSABC analysis assumes the 1KGP demographic model in each population.

https://doi.org/10.1371/journal.pgen.1011537.t005

To estimate population size N(g), 0 ≤ g ≤ 1000 (approximately 27 000 years [46]), we assume demographic Model G constrained such that N(1000) matches the 1KGP model value. The constrained Model G has one free parameter N(0), for which we adopt a Uniform(104, 24×104) prior. To reduce computational effort with little loss of information, in both the observed dataset and TSABC simulations we removed SNPs with minor allele count > 40, which typically correspond to mutations at g ≫ 1000. We estimate N(g) from each chromosome separately and average the results.

Fig 7 shows positive growth in the past 1000 generations for all eight populations. CHB and BEB have the highest N(0), while MSL and LWK have the lowest N(0) despite having the highest values of N(1000). Previous studies have reported similar rapid growth in non-African populations for 0 ≤ g ≤ 1000 [2, 5, 20]. Estimates of N(g) can be sensitive to modelling assumptions and vary widely over studies, particularly for g < 400. However, ordering across populations is more stable. Our N(200) estimates are in the same order as [2] for the three populations (CHB, JPT, LWK) investigated in both papers, and in the same order as [20] for the four populations in common (CHB, JPT, FIN, GBR).

thumbnail
Fig 7. Estimates of recent population sizes for eight populations sampled in the 1000 Genomes Project.

In the legend box, populations are listed in order of decreasing N(0). See Table 5 caption for explanation of the population labels.

https://doi.org/10.1371/journal.pgen.1011537.g007

Discussion

We have shown that ARG-derived IBD combined with ABC can deliver advantages over previous IBD-based methods for inferring mutation rate and population sizes from a sample of genome-wide sequences. Despite verifying that IBD extracted from an inferred TS is often inaccurate, TSABC showed only a modest loss of efficiency relative to the corresponding inference based on true IBD. Note again that the data simulation within ABC mimics the data generation process, and so differs between inferred and true IBD. Simply inserting the true IBDs into TSABC which assumes inferred IBDs would be inappropriate and generate poor results.

For simulated data, we found similar precision in estimating μ to previous results [31, 38] that used 4 000 times more data for inference. Similarly for real human data, we report better precision than [31] despite a five-fold smaller sample size and only chromosomes 21 and 22 rather than the whole genome. Further, we showed highly-significant differences in μ across human populations, but no overall significant difference across chromosomes, and our inferences of historic population sizes showed concordance with previous studies.

These advantages arise because from an ARG we can extract IBD defined in terms of a common MRCA, avoiding the problem of detecting recombinations, and we require only IBDs from m sequence pairs, rather than all m(m−1)/2 pairs, which reduces computational effort with little loss of statistical efficiency. The efficiency of ARG-based IBD inference is a key factor, allowing analyses of many simulated datasets, as required for ABC. Perhaps the most important advantage of TSABC derives from dispensing with the need for an IBD length threshold, allowing information to be extracted from the large number of short IBDs even though they are poorly inferred. A key to understanding why this is possible is that ABC inference does not require the summary statistics to be an accurate estimator of anything, nor do we need to know any distributional properties, we just need the distribution of the summary statistics to be sensitive to the target parameters.

Limitations of TSABC include those that apply to any ABC algorithm: the results are approximate and it can be difficult to estimate the approximation error. Another source of error comes from use of tsinfer for TS inference: more accurate inference is possible but alternative methods lack the computational efficiency of tsinfer. TSABC can be computationally demanding for complex demographic models, and the results presented here are limited to inferring one or two parameters. Although joint estimation of multiple parameters will be computationally challenging, they can be estimated iteratively, fixing some parameters while estimating others. TSABC is able to handle much larger datasets, both in terms of sample size and sequence length, than our previous likelihood-based approach [21]. Further, we were able to incorporate unknown nuisance parameters such as the sequence error rate and misspecification of the demographic model to challenge TSABC inference without substantial detriment to inference quality.

In addition to ABC-based inference, we also derived method-of-moment estimators for ARG-derived IBD, and showed in a simulation study based on true IBD that these estimators could provide very powerful inferences given an accurately inferred ARG. As ARG inference methods improve in accuracy, they may become directly useful on real data, but here we used them indirectly to develop the TSABC summary statistics.

Overall, our results open the way for more powerful evolutionary and demographic inferences from samples of genome sequences than have previously been available. Summary statistics based on IBD lengths represent one approach to extracting information from an ARG inferred from observed sequences, future developments may be based on better summaries, or directly on the ARG which presents challenges due to its complexity.

Supporting information

S1 Text. The efficient IBD subsets and ABC algorithms.

https://doi.org/10.1371/journal.pgen.1011537.s001

(PDF)

S2 Text. Derivation of estimators when TS is known.

https://doi.org/10.1371/journal.pgen.1011537.s002

(PDF)

S3 Text. Different models for sequencing error.

https://doi.org/10.1371/journal.pgen.1011537.s003

(PDF)

S4 Text. Further details for 1KGP data analysis.

https://doi.org/10.1371/journal.pgen.1011537.s004

(PDF)

S1 Fig. The 1KGP N(g) models.

Natural logarithms of g are shown on the x-axis, with the models starting at g = exp(6) ≈ 400 generations in the past. The values of N(1000) which form the right endpoints of Fig 7 correspond to x = log(1000) ≈ 6.9.

https://doi.org/10.1371/journal.pgen.1011537.s005

(TIF)

S2 Fig.

Left: Estimated 95% CIs for the estimation of μ when an efficient subset of IBD segments was extracted from the TS and when all IBDs were used. At each sample size, 25 replicate datasets were simulated under Model C, with sequence length = 107. Right: Impact of sequencing error rate ϵ on when = 108 (other details are the same as for bottom left panel of Fig 3).

https://doi.org/10.1371/journal.pgen.1011537.s006

(TIF)

S3 Fig. Estimates of the population size N(g) when = 107 (other details are the same as for Fig 4).

https://doi.org/10.1371/journal.pgen.1011537.s007

(TIF)

S4 Fig. Histogram of the , i = 1, …, I, obtained from one sample simulated under each of Model C (left) and Model Ga (right), with sample size m = 80, sequence length = 108 and sequencing error rate ϵ = 10−3.

Also shown is a probability density obtained by kernel smoothing of the together with the true density. True IBD was available for inference but no time information.

https://doi.org/10.1371/journal.pgen.1011537.s008

(TIF)

S5 Fig. 95% CIs computed from the estimates and SE shown in Table 2.

https://doi.org/10.1371/journal.pgen.1011537.s009

(TIF)

References

  1. 1. Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–496. pmid:21753753
  2. 2. Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nature Genetics. 2014;46(8):919–925. pmid:24952747
  3. 3. Druet T, Macleod I, Hayes B. Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions. Heredity. 2014;112(1):39–47. pmid:23549338
  4. 4. Terhorst J, Kamm J, Song Y. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nature Genetics. 2017;49(2):303–309. pmid:28024154
  5. 5. Upadhya G, Steinrücken M. Robust inference of population size histories from genomic sequencing data. PLOS Computational Biology. 2022;18(9):e1010419. pmid:36112715
  6. 6. Gutenkunst R, Hernandez R, Williamson S, Bustamante C. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLOS Genetics. 2009;5(10):e1000695. pmid:19851460
  7. 7. Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa V, Foll M. Robust demographic inference from genomic and SNP data. PLOS Genetics. 2013;9(10):e1003905. pmid:24204310
  8. 8. Bhaskar A, Wang Y, Song Y. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Research. 2015;25(2):268–279. pmid:25564017
  9. 9. Kamm J, Terhorst J, Durbin R, Song Y. Efficiently inferring the demographic history of many populations with allele count data. Journal of the American Statistical Association. 2020;115(531):1472–1487. pmid:33012903
  10. 10. Browning S, Browning B. Identity by descent between distant relatives: detection and applications. Annual Review of Genetics. 2012;46:617–633. pmid:22994355
  11. 11. Palamara P, Pe’er I. Inference of historical migration rates via haplotype sharing. Bioinformatics. 2013;29(13):i180–i188. pmid:23812983
  12. 12. Sticca E, Belbin G, Gignoux C. Current developments in detection of identity-by-descent methods and applications. Frontiers in Genetics. 2021;12:1725. pmid:34567074
  13. 13. Tang K, Naseri A, Wei Y, Zhang S, Zhi D. Open-source benchmarking of IBD segment detection methods for biobank-scale cohorts. GigaScience. 2022;11:giac111. pmid:36472573
  14. 14. Chen H, Naseri A, Zhi D. FiMAP: A fast identity-by-descent mapping test for biobank-scale cohorts. PLOS Genetics. 2023;19(12):e1011057. pmid:38039339
  15. 15. Griffiths R, Marjoram P. An ancestral recombination graph. In: Donnelly P, Tavare S, editors. IMA volume on Mathematical Population Genetics. New York: Springer-Verlag; 1997. p. 257–270.
  16. 16. Lewanski A, Grundler M, Bradburd G. The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics. PLOS Genetics. 2024;20(1):e1011110. pmid:38236805
  17. 17. Brandt D, Huber C, Chiang C, Ortega-Del Vecchyo D. The promise of inferring the past using the ancestral recombination graph. Genome Biology and Evolution. 2024;16(2):evae005. pmid:38242694
  18. 18. Rasmussen M, Hubisz M, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLOS Genetics. 2014;10(5):e1004342. pmid:24831947
  19. 19. Kelleher J, Wong Y, Wohns A, Fadil C, Albers P, McVean G. Inferring whole-genome histories in large population datasets. Nature Genetics. 2019;51:1330–1338. pmid:31477934
  20. 20. Speidel L, Forest M, Shi S, Myers S. A method for genome-wide genealogy estimation for thousands of samples. Nature Genetics. 2019;51:1321–1329. pmid:31477933
  21. 21. Mahmoudi A, Koskela J, Kelleher J, Chan Y, Balding D. Bayesian inference of ancestral recombination graphs. PLOS Computational Biology. 2022;18(3):e1009960. pmid:35263345
  22. 22. Zhang B, Biddanda A, Gunnarsson Á, Cooper F, Palamara P. Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits. Nature Genetics. 2023;55:768–776. pmid:37127670
  23. 23. Yang S, Carmi S, Pe’er I. Rapidly registering identity-by-descent across ancestral recombination graphs. Journal of Computational Biology. 2016;23(6):495–507. pmid:27104872
  24. 24. Sisson S, Fan Y,Beaumont M (Eds). Handbook of approximate Bayesian computation. Chapman and Hall/CRC; 2018.
  25. 25. Kelleher J, Lohse K. Coalescent simulation with msprime. Statistical Population Genomics. 2020;986:191–230. pmid:31975169
  26. 26. Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns A, Kelleher J. A general and efficient representation of ancestral recombination graphs. Genetics. 2024;228. pmid:39013109
  27. 27. Kelleher J, Etheridge A, McVean G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLOS Computational Biology. 2016;12(5):e1004842. pmid:27145223
  28. 28. Silcocks M, Farlow A, Hermes A, Tsambos G, Patel H, Huebner S, et al. Indigenous Australian genomes show deep structure and rich novel variation. Nature. 2023;624(7992):593–601. pmid:38093005
  29. 29. Beaumont M, Zhang W, Balding D. Approximate Bayesian computation in population genetics. Genetics. 2002;162(4):2025–2035. pmid:12524368
  30. 30. Brandt D, Wei X, Deng Y, Vaughn A, Nielsen R. Evaluation of methods for estimating coalescence times using ancestral recombination graphs. Genetics. 2022;221(1):iyac044.
  31. 31. Tian X, Cai R, Browning S. Estimating the genome-wide mutation rate from thousands of unrelated individuals. The American Journal of Human Genetics. 2022;109(12):2178–2184. pmid:36370709
  32. 32. Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale A, Tsambos G, et al. Efficient ancestry and mutation simulation with msprime 1.0. Genetics. 2022;220(3):iyab229. pmid:34897427
  33. 33. Griffiths R. Neutral two-locus multiple allele models with recombination. Theoretical Population Biology. 1981;19(2):169–186.
  34. 34. Hudson R. Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology. 1983;23(2):183–201. pmid:6612631
  35. 35. Albers P, McVean G. Dating genomic variants and shared ancestry in population-scale sequencing data. PLOS Biology. 2020;18(1):e3000586. pmid:31951611
  36. 36. Chiang C, Ralph P, Novembre J. Conflation of short identity-by-descent segments bias their inferred length distribution. G3: Genes, Genomes, Genetics. 2016;6(5):1287–1296. pmid:26935417
  37. 37. Bergeron L, Besenbacher S, Zheng J, Li P, Frost Bertelsen M, Quintard B, et al. Evolution of the germline mutation rate across vertebrates. Nature. 2023;615:285–91. pmid:36859541
  38. 38. Tian X, Browning B, Browning S. Estimating the genome-wide mutation rate with three-way identity by descent. The American Journal of Human Genetics. 2019;105(5):883–893. pmid:31587867
  39. 39. Fairley S, Lowy-Gallego E, Perry E, Flicek P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Research. 2020;48(D1):D941–D947. pmid:31584097
  40. 40. 1000 Genomes Project Consortium and others. A global reference for human genetic variation. Nature. 2015;526(7571):68.
  41. 41. Kong A, Frigge M, Masson G, Besenbacher S, Sulem P, Magnusson G, et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature. 2012;488(7412):471–475. pmid:22914163
  42. 42. Harris K. Evidence for recent, population-specific evolution of the human mutation rate. Proceedings of the National Academy of Sciences USA. 2015;112(11):3439–3444. pmid:25733855
  43. 43. Roach J, Glusman G, Smit A, Huff C, Hubley R, Shannon P, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328(5978):636–639. pmid:20220176
  44. 44. Conrad D, Keebler J, DePristo M, Lindsay S, Zhang Y, Cassals F, et al. Variation in genome-wide mutation rates within and between human families. Nature Genetics. 2011;43(7):712–714. pmid:21666693
  45. 45. Scally A, Durbin R. Revising the human mutation rate: implications for understanding human evolution. Nature Reviews Genetics. 2012;13:745–753. pmid:22965354
  46. 46. Wang R, Al-Saffar S, Rogers J, Hahn M. Human generation times across the past 250,000 years. Science Advances. 2023;9:eabm7047. pmid:36608127