GOOGA: A platform to synthesize mapping experiments and identify genomic structural diversity

Lex E. Flagel; Benjamin K. Blackman; Lila Fishman; Patrick J. Monnahan; Andrea Sweigart; John K. Kelly

doi:10.1371/journal.pcbi.1006949

Abstract

Understanding genomic structural variation such as inversions and translocations is a key challenge in evolutionary genetics. We develop a novel statistical approach to comparative genetic mapping to detect large-scale structural mutations from low-level sequencing data. The procedure, called Genome Order Optimization by Genetic Algorithm (GOOGA), couples a Hidden Markov Model with a Genetic Algorithm to analyze data from genetic mapping populations. We demonstrate the method using both simulated data (calibrated from experiments on Drosophila melanogaster) and real data from five distinct crosses within the flowering plant genus Mimulus. Application of GOOGA to the Mimulus data corrects numerous errors (misplaced sequences) in the M. guttatus reference genome and confirms or detects eight large inversions polymorphic within the species complex. Finally, we show how this method can be applied in genomic scans to improve the accuracy and resolution of Quantitative Trait Locus (QTL) mapping.

Author summary

Genome sequences are an essential resource for genetic research in many species. However, most species exhibit considerable variation in genomic organization, making a single reference sequence inadequate. This variation complicates quantitative trait mapping and population genomics. We introduce a new statistical method and computational tools that use linkage information to improve genome assembly and identify structural differences between individuals or populations. We first use the method to correct many assembly errors in the reference genome of Mimulus guttatus. Analyzing five crosses from the M. guttatus species complex, we detect eight large chromosomal inversions and improve the resolution of a trait mapping study. This work illustrates how genetic mapping can be applied to a greater diversity of species to address genetic and evolutionary questions.

Citation: Flagel LE, Blackman BK, Fishman L, Monnahan PJ, Sweigart A, Kelly JK (2019) GOOGA: A platform to synthesize mapping experiments and identify genomic structural diversity. PLoS Comput Biol 15(4): e1006949. https://doi.org/10.1371/journal.pcbi.1006949

Editor: Frank Alexander Feltus, Clemson University, UNITED STATES

Received: June 7, 2018; Accepted: March 15, 2019; Published: April 15, 2019

Copyright: © 2019 Flagel et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data are within the paper and its Supporting Information files. Sequence data are available from the NCBI SRA. The SRA BioProject IDs are PRJNA529262, PRJNA528882, and PRJNA528756. Code and processed sequence data are on Github (https://github.com/flag0010/GOOGA).

Funding: Funding support was received from the National Institutes of Health (nih.gov; R01 GM073990-02 to J.K.K.), the University of Virginia, the University of California, Berkeley, an NSF (nsf.gov) Postdoctoral Fellowship in Biology (DBI-0905958 to BKB), and NSF grants to J. Willis (IOS-1024966) and BKB (IOS-1558035). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Over the last decade, genetic research has been revolutionized by the availability of whole genome sequences for many of the world’s medically, ecologically, and agriculturally important species. It has also become clear that a single reference genome sequence is insufficient for many species. For example, a comparison of two maize accessions found that over 2,500 genes were present in only one of the two genomes [1]. Even in humans, which have much less genetic diversity than maize, structural and gene content polymorphisms are abundant [2]. Differences in gene copy number [3–6], variation in gene order [1, 7, 8] and chromosomal inversions [9–13] are not captured by a single reference genome, nor can they be annotated succinctly in relation to a single reference as is possible for Single Nucleotide Polymorphisms (SNPs). These structural and gene content variants have important phenotypic consequences in many species, highlighting the need for intensive study [14–18].

Recognizing structural variation is important for many of the experimental applications of genomic science. For example, trait-mapping approaches, such as bulked segregant analysis, rely on accumulating signals from adjacent genomic regions (windows) to establish significance. If gene order in the population under study differs from the reference genome, the proximity of polymorphisms will be incorrectly inferred, which in turn, undermines inference of both the location and significance of QTLs [19]. Similar issues can arise in population genomic analyses, such as scans for selection or introgression [20, 21] or of response in “Evolve and Resequence” experiments [22–26]. One solution is to make reference genomes for every divergent accession under study [4]. An alternative approach is to construct ‘pseudo-chromosome’ assemblies to better match structural variation in the focal accessions. Regardless, accounting for structural variation is an important challenge for the continued development of evolutionary genomics.

In this paper, we develop an approach to pseudo-chromosome construction based on comparative genetic mapping. In species with repeat rich genomes, whole genome shotgun sequencing and assembly typically yields many thousands of scaffolds. These scaffolds can be stitched together to form pseudo-chromosomes. There are various techniques for making pseudo-chromosomes, such as following a BAC-tiling path [27], optical mapping using nanochannel arrays [28], or by localizing the scaffolds to markers on a genetic map [29, 30]. Genetic mapping has proved to be effective for initial genome construction and pseudo-chromosome assembly, especially for large genomes [31–33]. We extend this approach using comparative genetic maps from five distinct crosses, allowing us to simultaneously improve the pseudo-chromosome representation of the reference genome and also identify large-scale variation in gene order, including chromosomal inversions and translocations.

Our approach utilizes data from low-coverage sequencing. Restriction site associated DNA-sequencing (RAD-seq) [34, 35] and related reduced-representation methods [36–38] allow cost-effective genotyping of hundreds of recombinant individuals in species with limited genetic markers. While RAD-seq data is often used to create de novo markers [39, 40], RAD-seq reads can also be directly mapped to genomic scaffolds. Recombinant genotypes located to genomic scaffolds can then be used to assemble pseudo-chromosomes. Unfortunately, there are substantial challenges in constructing genetic maps from low-coverage sequencing data and in inferring map differences (i.e., the evidence for structural variation). New approaches are needed to address the following methodological questions: What is the optimal means to convert sequence data into markers? How should we accommodate genotyping error in these markers given that the error rate is often high and variable among samples? After locating markers to genomic scaffolds, how do we obtain the optimal order and orientation of scaffolds into pseudo-chromosomes? Finally, how do we determine if putative differences between maps are real?

We develop a statistical procedure called Genome Order Optimization by Genetic Algorithm (GOOGA) that detects large structural mutations, such as inversions or translocations, using marker data from multiple genetic mapping populations. Importantly for error-prone low-coverage genotyping, GOOGA propagates genotype uncertainty throughout the model, thus accommodating this source of uncertainty directly into the inference of structural variation. GOOGA couples a Hidden Markov Model (HMM) with a Genetic Algorithm (GA). The HMM yields the likelihood of a given ‘map’ (hereafter used to denote the ordering and orientation of scaffolds along a chromosome) conditional on the genotype data. The GA searches map space by creating new candidate orders which are recurrently fed to the HMM to diagnose their likelihood. Inference of recombination rate parameters and/or tests for differences in gene order are enabled by the fact that all calculations are conducted in the currency of likelihood.

As proof of concept, we first apply GOOGA to mapping data simulated from a known genome sequence. After demonstrating its effectiveness in this context, the pipeline is applied to real RAD-seq data (multiplexed shotgun genotyping (MSG) type [36, 41]) from five different mapping populations, each synthesized from a cross between lines within the M. guttatus species complex. The complex is a highly diverse clade of inter-fertile North American wildflowers [42–45]. The five mapping populations include both intra- and inter-specific crosses as well as multiple cross types (F₂s, F₃s, and recombinant inbred lines (RILs)). Recombinant individuals from each mapping population were scored genome-wide for SNPs and then input to GOOGA. Starting from a rough-draft scaffold order [46], GOOGA produces an optimized order and orientation of genomic scaffolds for each population. The M. guttatus reference genome derives from an inbred line used as a parents in two of our five crosses, which allows us to correct many errors (misplaced scaffolds) in the reference genome. Improved estimates for recombination rates indicate the effects of gene density and transposable elements on chromosomal variation in recombination. Comparisons among maps identify eight distinct structural polymorphisms, five of which were suggested by previous mapping studies [12, 13, 47–50]. Finally, we demonstrate GOOGA clarifies the results of a QTL mapping study by correcting errors in the reference genome.

Methods

A simulation based on Drosophila

We used the "A4" genome build [51] of D. melanogaster to serve as the reference in the simulation study (downloaded from https://www.ncbi.nlm.nih.gov/assembly/GCA_002300595.1/). This is a high quality build; likely close to the true genome sequence of the line. To estimate multiplexed shotgun genotyping (MSG) [36, 41] of a mapping population using this genome, we extracted MSG reads from a QTL mapping study in D. melanogaster [52]. We mapped reads, downloaded from https://datadryad.org/resource/doi:10.5061/dryad.gc182, to the A4 sequence using the methods described below for Mimulus MSG reads. The mapping locations of restriction site associated DNA-tags (RAD-tags) were thinned to at most one tag per kb, and then used for subsequent simulation of MSG data in a population of F₂ individuals. To create the mapping population, we consider a cross between isogenic lines that differ at one SNP per RAD-tag. For simplicity, we focused on chromosome 2 with F₁ flies heterozygous at each RAD-tag along this chromosome. We synthesize each F₂ by crossing a male that produces gametes without recombination, to a female that produces gametes based on the recombination frequency map of D. melanogaster [53]. Given the locations and true genotypes at each RAD-tag, we simulate genomic data by sampling a specified number of reads per individual per tag. After forming the entire F₂ population, we fractured the reference genome into 31 scaffolds by randomly choosing 30 break-locations across the chromosome. The genomic data were then input to the GOOGA pipeline. The genomic coordinates of RAD-tags, the recombination map, and the code to simulate F₂ genotype and break the reference genome into scaffolds are provided in S1 Data.

The GOOGA pipeline

Fig 1 is an overview of the data analysis pipeline.

Download:

Fig 1. The GOOGA pipeline is illustrated with inputs and outputs (rectangles) and programs (rounded shapes) at each stage.

Purple denotes main outputs.

https://doi.org/10.1371/journal.pcbi.1006949.g001

The first programs take short-read data from recombinant individuals within the mapping population and make “putative” genotype calls. From the parental genomes and offspring of each mapping population we identified SNPs with GATK [54]. We retained only diagnostic SNPs, where bases could be unambiguously assigned to one of the parents (A or B). Window based calls are based on the aggregate of sequence data from all diagnostic SNP sites located between a lower and upper coordinate (all reads that map between these coordinates). The size of windows is a user defined variable chosen based on the density of SNPs and the average number of reads per SNP per individual. The count of reads matching each parent (A or B) is the basis for making putative calls although thresholds for these decisions are also user defined. In our application to both simulated Drosophila data and real Mimulus data, we delineated markers as windows 100 kb in length within each scaffold. The last marker on each scaffold included all remaining sequence beyond the last complete 100 kb segment. Given the number of reads scored as A and B (across SNPs) within a window, the individual was called AA if the fraction of reads matching parent A exceeded 95%. They were called BB if the fraction was less than 5%, and AB if the fraction was between 25% and 75%. We default to ignorance (NN = No Call) if none of these conditions are met or if the number of reads within the window for that individual is less than a minimum depth threshold (set to 6 for our applications). Ambiguity can result either from mis-mapping of reads (in which case the read counts are misleading) or if recombination occurs within the marker (in which case the true genotype is a combination of two different genotypes, e.g. AA-AB). Scoring either scenario as NN is suitable for downstream analysis by the HMM–data from neighboring markers will strongly inform inference of the underlying genotype at the NN marker. The detailed Methods for making putative genotype calls in for the Mimulus data is reported in S1 Appendix.

We use a window-based approach instead of SNP-specific genotyping because the latter can be highly error prone, especially with low coverage sequencing data and mapping to preliminary (error filled) genome builds. Even after numerous filtering steps (see S1 Appendix), SNP-level calls remain problematic (see S1 Fig for an illustration with data from three Mimulus recombinants). Under-calling of heterozygotes is often problematic with MSG data; both alleles appear less frequently than the binomial distribution predicts [55, 56]. Aggregating signal from closely linked SNPs addresses both low coverage and allele dropout in heterozygotes. However, SNP level genotype calling may be preferable if both the reference scaffolds and recombinant sequencing are sufficiently high in quality. The downstream components of the GOOGA pipeline are fully compatible with SNP-level calls in this circumstance.

The likelihood

The putative calls are the observed or “emitted” states for the HMM and the hidden states are the true underlying genotypes (AA, AB, or BB) [36, 57, 58]. The structure of the model and transition probabilities depends on the breeding design (e.g. F₂ or F₃ or RILs) and on recombination rates. Fig 2 illustrates the HMM for an F₂ population.

Download:

Fig 2. The structure of the Hidden Markov Model is illustrated for F2 individuals.

Transitions between genotype states (AA, AB, or BB) for markers m₁ through m₄ in the latent state layer are determined by recombination fractions (r) between pairs of markers. The probabilities of each latent state estimate are then propagated into the emitted state layer with a genotyping error rate term that is specific to each individual (subscript i).

https://doi.org/10.1371/journal.pcbi.1006949.g002

The HMM is non-homogeneous [59] and we report the transitions for F₂, F₃, and RIL designs in S2 Appendix. The emission probabilities are determined by individual-specific genotyping error rates (Fig 2). Three distinct error rates are estimated for each individual (i): the probability that a true homozygote yields a putative call to heterozygote (e_0i) or to the opposite homozygote (e_1i), and the probability that a heterozygote yields a call to one of the two homozygotes (e_2i). Regarding the last rate, we assume that errors to either of the alternative homozygotes are equally likely.

The HMM is applied in different ways at four stages of the pipeline (Fig 1). The first objective is to estimate individual-specific genotyping error rates from transitions between genotypes within scaffolds. The transition probabilities depend on the true recombination rate per base pair, which is unknown and can vary across the genome. However, to estimate e_0i, e_1i, and e_2i, we fit a simple ‘homogeneous’ model assuming that recombination rate is proportional to physical distance between markers within scaffolds. Based on data from prior Mimulus studies [47, 50], we set rates to the genomic average of 5.0 cM/Mb for F₂ populations and 10.0 cM/Mb for F₃ and RIL populations. For the simulation study, we use the specified (known) recombination rates. The likelihood of data from each scaffold of an individual plant is then a function of e_0i, e_1i, and e_2i, and the likelihood for the entire plant is a product across scaffolds. For each plant we obtain the maximum likelihood estimation (MLE) of e_0i, e_1i, and e_2i via application of the forward-backward algorithm [60] coupled with the bfgs bounded optimization routine [61] of scipy.optimize (https://docs.scipy.org/doc/scipy/reference/optimize.html). We also used the bfgs optimizer to obtain MLE for recombination rates as described below. The individual-specific error rates (reported as S1 Table) can be used to cull highly error prone individuals from mapping populations (S2 Appendix).

The next step is to obtain preliminary recombination rate estimates between markers within scaffolds (Fig 1). While error rates were estimated by taking a likelihood across all scaffolds within a single individual, we here maximize the likelihood within each individual scaffold (with respect to r values between all adjacent markers) across all individuals within a mapping population. These intra-scaffold rate estimates are used in the GA search (described below), but not in the final maps. In our application to the five Mimulus datasets, the intra-scaffold r estimates were nearly always small, consistent with close linkage. However, we noticed one very high rate between two adjacent markers on scaffold 13 of the DUNTIL cross. The same interval exhibits normal (r < 0.01) recombination rates in other crosses. Given evidence for an inversion breakpoint (further evidence below), we split scaffold 13 into 13a and 13b.

The third application of the HMM is within the GA as it searches for the optimal map (ordering and orientation of scaffolds within a chromosome). Here, the log-likelihood is for the entire chromosome of an individual, summed across all individuals in the mapping population. This requires an assignment of scaffolds to linkage groups. In the absence of other information, this can be obtained by applying a standard map making program such as R/QTL [57] to the putative genotype calls. We wrote a simple end matching program (make.ends.meet.py) to accomplish this task for the simulation study. This program, and all others used in the pipeline, are available at (https://github.com/flag0010/GOOGA).

For a chromosome, there are L– 1 + K recombination rates to be estimated, where L is the number of scaffolds on the chromosome and K is the number of intra-scaffold rates within these scaffolds. The likelihood of a particular map is determined mainly by the data at the “joins” (where two scaffolds meet). The genetic algorithm searches map space with intra-scaffold rates held at their estimates from the second application of the HMM (Fig 1). Evaluation is based on the likelihood of the map after optimizing the inter-scaffold rates. However, we apply the full model, with all intra- and inter-scaffold recombination rates re-estimated, to our final map for each chromosome of each mapping population. In both cases, we obtain the maximum likelihood via application of the forward-backward algorithm.

The genetic algorithm

GOOGA maximizes the likelihood of the HMM using a genetic algorithm (GA) applied to each chromosome of each mapping population. A GA is an algorithmic optimization scheme inspired by sexual reproduction and natural selection [62]. Below we use terms such as “individual”, “mutation”, “recombination”, and “fitness” as they are frequently used in the GA literature; not to the biological processes. To build the GA, we first coded unique scaffold orders, including scaffold orientations (i.e. forward strand vs. reverse complement), to make an "individual." Each individual represents a candidate solution among a population of size N competing in a given generation. The likelihood of the map provides the fitness of individuals. To preserve the best scaffold orders, we used a strategy called elitism (E), which allows a predetermined number (E = 4 in our case) of the best individuals (i.e., highest likelihood scaffold orders) to go on to the next generation unchanged. To fill the remaining N-E spots in the next generation, we applied rank-based selection to select pairs of individuals to “mutate” and “recombine” into new individuals for the next generation. The details of the mating and mutation scheme, and well as the strategy for optimizing computation, are reported in S3 Appendix. After extracting the optimal map from the GA, we run the HMM with both inter- and intra-scaffold rates as free parameters. This yields the final MLE for recombination rates across each chromosomes and also the posterior probabilities of genotypes for each individual in the mapping population.

Comparing recombination rates to genomic features

We extracted MLE recombination rates from each Mimulus mapping population to compare to DNA level features such as amount of coding DNA, number of transposable elements, and the presence of centromeric DNA. For these analyses, we defined a 200 kb interval around each 100 kb-long marker, starting at the midpoint of the preceding marker and ending at the mid-point of the next marker. The analysis is defined at this scale to absorb recombination events that occur mid-marker. Consider the first three markers on a scaffold defined on the position ranges 0–100 kb, 100 kb-200 kb, and 200 kb-300 kb, respectively. We related the sequence interval from 50 kb-250 kb to the sum of the two rates (r_1,2 + r_2,3 of Fig 2). This analysis neglected very small scaffolds, the sequence at the ends of longer scaffolds, and the estimated rates between scaffolds (the amount and the features of the interceding DNA are unknown). In this analysis, we also excluded regions where recombination is suppressed due to inversions (S2 Table). We obtained a single rate for each interval by first standardizing map specific rates by the total length of each map and then calculated a weighted mean across populations. The weight given to estimates from each cross is proportional to the reciprocal of the genome-wide recombination rate variance: IMPR = 1.0, IMSWC = 0.899, IMNAS = 0.790, and DUNTIL = 0.677, and IMF3 = 0.380 (see Table 1 for mapping population abbreviations).

Download:

Table 1. Details about the five Mimulus crosses.

For each cross we list the parental lines used, their species, and the cross and mapping population type.

https://doi.org/10.1371/journal.pcbi.1006949.t001

QTL mapping in the IMSWC population

We compared quantitative trait locus (QTL) mapping results for the IMSWC cross using either the optimized scaffold order generated by our pipeline or the scaffold order of the M. guttatus V2 reference genome. For each of the 873 F₂s used for genetic map construction, genotype posterior probabilities were emitted for each of the 111 markers defined for chromosome 11. This chromosome harbors a major QTL that contributes to variation in the ability to flower under 13/11 hours of light/dark (Kooyers et al. 2018, in prep.). Genotypes were then assigned at each marker for each individual based on the genotype with a posterior probability > 0.95; otherwise, the genotype was called as missing. For each marker order, we calculated recombination frequencies (Haldane map function), imputed genotype probabilities at 1 cM steps (error probability = 0.001), and performed interval mapping using the binary model in R/QTL [57].

Results

The efficiency and limitations of GOOGA are demonstrated by simulated data

We simulated data with Drosophila chromosome 2, fragmented in different ways into 31 scaffolds, which were then input to the GOOGA pipeline. Fig 3 is a typical result.

Download:

Fig 3. A typical GOOGA run on simulated data for D. melanogaster Chromosome 2 (200 F2 individuals genotyped for 31 scaffolds).

The input map is given on top and the final optimized map at the bottom. The grey shaded region in between are the 26 steps GOOGA took to transform the input order into the final optimized map. The scaffolds in the grey shaded region are all drawn to a uniform length, while the scaffold lengths for input and final maps are drawn to their genetic map length. Green indicates forward orientation, red reverse. Grey lines connect the same scaffold between steps. The log likelihood (lnLk) value for each step is given to the left of the map order.

https://doi.org/10.1371/journal.pcbi.1006949.g003

The initial map, constructed from putative genotype calls at scaffolds ends, is misassembled in several places, mainly where recombination rates are low. GOOGA converged in 26 steps for this simulation replicate, increasing the log-likelihood (lnLk) by over 500 and reducing the map length by ca. 40%. Genetic map shrinkage occurs because maximum likelihood adjustments of the rate parameters will compensate for bad joins by increasing recombination fraction (r) values. Over the course of a run, bad joins are corrected. Importantly, the final GOOGA order is not completely correct–two small scaffolds near the centromere are inverted. The likelihood of the data under the GOOGA final map and the correct map are exactly equivalent. There are simply no recombinant individuals in the mapping population to resolve the orientation of these scaffolds because it is a low recombination region. In all of simulation replicates, GOOGA converged on the correct map or on a map with an equivalent likelihood to the correct map.

GOOGA optimization rebuilds the M. guttatus reference genome

We used GOOGA to optimize chromosome-scale genetic maps for all 14 chromosomes among five Mimulus mapping populations (Table 1, S2–S15 Figs, and S3 Table). In each case the current M. guttatus genome assembly was used as the starting point with 100 kb windows as genetic markers. For each chromosome, GOOGA produced an MLE pseudo-chromosome construction. The overall map lengths for the F₂ crosses are 1278 cM for the DUNTIL cross, 1523 cM for the IMNAS cross, and 1258 cM for the IMSWC cross. The average map length, 1353 cM, is shorter than previous F₂ maps generated in Mimulus through PCR-based genotyping methods [12, 47, 50]. The map from the IMF3 cross is ≈50% longer than the F₂ average (2043 cM), as expected given the extra generation of recombination between F₂ and F₃ generations. Finally the map from the IMPR cross (1489 cM) is only slightly longer than the F₂ average. There is additional recombination in the formation of these RILs, but the recombination parameter is specified differently (as related to crossover events) in the RIL HMM (see S2 Appendix).

Comparison of the five GOOGA optimized maps to the M. guttatus V2 reference genome order (hereafter V2 map; [46]) indicates that a large number of updates to the reference genome are necessary. Although the five mapping populations differ importantly from each other, changes in scaffold order and orientation from the V2 map are nearly always shared–the five estimated maps differ from the reference in the same way. These regions likely reflect errors in genome assembly given that the reference genome was sequenced from an inbred line (IM62) that is used in two of the crosses that we analyze here (IMF3 and IMSWC). Fig 4 illustrates this point, comparing the maximum likelihood map of the IMF3 data from GOOGA to the V2 map for three chromosomes.

Download:

Fig 4. The improvement of the chromosome map between the M. guttatus V2 reference genome and the GOOGA optimized order for three chromosomes (IMF3 cross).

Within each panel, the V2 order is shown above the IMF3 optimized order. Each genomic scaffold is drawn to its genetic map length (in Morgans) and denoted in green if it maps on the forward strand or red for the reverse strand. Grey lines connect the same scaffold in the V2 and optimized order.

https://doi.org/10.1371/journal.pcbi.1006949.g004

This intra-population cross is likely to be most congruent with the true order of the IM62 reference genome. The difference in log-likelihood (ΔlnLk) provides a measure of improvement in fit of the GOOGA relative to V2. ΔlnLk is computed by fitting the genotype data to both the GOOGA optimized and V2 maps, and then subtracting the former from the latter. In each case, recombination rates are estimated independently, and so ΔlnLk is determined entirely by differences in scaffold order and orientation. The maps for chromosome 14 (Fig 4A) are largely similar. However, differences such as the changes in location and orientation of scaffolds 127, 211, 178, and 140b (inconsistency near the center of Fig 4A, S3 Table), are sufficient for a large improvement in likelihood. The ΔlnLk of 77.1 (S4 Table) corresponds to a likelihood improvement of 10³³, suggesting this new order fits the segregation data in the IMF3 population far better than the V2 map. Importantly, the updated ordering of scaffolds 127, 211, 178, and 140b is shared by the IMSWC, DUNTIL, and IMPR maps (S14 Fig and S3 Table). The IMNAS map retains the [211,127] ordering of the V2 map, albeit with a flip of 127. However, this inconsistency is not compelling because genome assembly by genetic mapping will fail when there is no recombination to provide signal, and there is no evidence of recombination in this region among the 91 F₂s genotyped in the IMNAS population.

Chromosome 2 (Fig 4B) is typical of most IMF3 contrasts. Numerous scaffolds are rearranged (e.g. the IMF3 sequence [44a, 212, 249] is inverted in the V2 map) and several are flipped in place (including scaffold 81 that flanks [44a, 212, 249]). There is an increase in likelihood from the V2 map to the GOOGA optimized order (ΔlnLk = 255.8), and the genetic length of chromosome 2 shrinks by ≈15% from V2 to GOOGA. This effect is even more pronounced for chromosome 10 (Fig 4C), where there is a large increase in likelihood (ΔlnLk = 710.1). Among 70 chromosomes (5 crosses x 14 chromosomes), 23 (33%) chromosomes had ΔlnLK improvements greater than 500, while only 5 (7.1%) improved less than 100 (S4 Table). The most significant alterations to the V2 map are in genomic regions harboring inversions, particularly on chromosomes 5, 8, and 10 (Fig 4C, S6 Fig, S9 Fig, and S11 Fig). The V2 map is based partly on genetic mapping data from approximately 70 recombinant inbred lines from a cross between IM62 and DUN10 (J. Willis pers. comm.). DUN10 is a parent in our DUNTIL cross [35] and the aggregate of evidence (see below) suggests that DUN10 has chromosomal inversions (relative to IM62) on each of these chromosomes.

Comparison of GOOGA maps confirms extensive structural diversity in Mimulus

Our five mapping populations (Table 1) contain one intra-population cross (IMF3), two inter-population crosses (IMPR, IMSWC), a close interspecific-cross (IMNAS), and a more distant interspecific cross (DUNTIL). We observe structural polymorphisms by aligning the five maps to each other by chromosome. Fig 5 compares the maps for chromosome 10 which had previously been shown to harbor an inversion in IMPR [48].

Download:

Fig 5. A reversal of genomic scaffolds due to an inversion is illustrated by the comparison of chromosome 10 maps for all five crosses.

Each genomic scaffold is drawn to its genetic map length and denoted in green if it maps on the forward strand or red for the reverse strand. Grey lines connect the same scaffold between maps.

https://doi.org/10.1371/journal.pcbi.1006949.g005

As described in METHODS, we broke scaffold 13 into 13a and 13b based on a preliminary analysis of the DUNTIL data. GOOGA reassembled 13a and 13b into a continuous sequence for the IMF3 cross, but not the other crosses (Fig 5). There is minimal recombination between 13a and 13b in IMPR, IMNAS, and IMSWC because 13b is flanked by a large block of markers with nearly complete recombination suppression. This suppressed region, which represents at least 4.5 Mb of DNA, is freely recombining in IMF3 and DUNTIL but with a perfect reversal of marker order/orientation between those two crosses. From this, we infer that the IMF3 parents (IM62 and IM767; Table 1) each have inversion karyotype “A”, the DUNTIL parents (LVR and DUN) each have karyotype “B”, and the other three crosses are heterokaryotypic (one parent A and one B) for this inversion (S3 Fig). Noting that IM767 is orientation A, recombination suppression in the IMPR suggests the other parent in this cross (Point Reyes) has orientation B. By similar reasoning, we can conclude that SF5 and SWC also have orientation B (S3 Fig). The right half of chromosome 10 is largely collinear among all five crosses, indicating the inversion is the primary influence on chromosome-wide likelihood. The GOOGA lnLk of the IMF3 data is -3784. If the IMF3 data is forced into the optimized DUNTIL order, the lnLk drops to -4074 (ΔlnLk = 290; S5 Table). This gives strong statistical support of the inversion between the A and B homokarytypic crosses. The effect is less pronounced in the heterokaryotypic crosses. For example, the ΔlnLks of the IMNAS data when forced into the DUNTIL and IMF3 maps are 59 and 124, respectively (S5 Table). Thus, as expected, the recombination suppression in this heterokaryotypic cross results in relatively weak support for either a pure A or B inversion orientation.

The inversion on chromosome 10 is the only case among these crosses where we see free recombination in both homokaryotypes and suppression in the heterokaryotypes. In all other cases, one or more crosses reveal recombination suppression, with at least one homokaryotypic cross also present among our five populations (S11 Fig and S3 Table). Lowry and Willis [13] showed a reversal of marker order (as in Fig 5) for the inversion on chromosome 8. This feature is associated with annual versus perennial life-history within M. guttatus. Here, we see free recombination over the inverted region on chromosome 8 in IMF3, IMSWC, and IMNAS (annual x annual crosses), and suppression in IMPR and DUNTIL (annual x perennial M. guttatus and perennial M. guttatus x perennial M. tilingii, respectively). A similar pattern is noted for previously hypothesized inversions on chromosomes 5 (suppression in DUNTIL and IMPR) and 13 (suppression in DUNTIL) [49] and for the meiotic drive locus on chromosome 11 (suppression in IMF3) [63]. Given comparable evidence, we also identify three novel putative inversions (S2 Table). A region of at least 1.2 Mb spanning scaffolds 19, 73 and 65 on chromosome 2 is completely suppressed in the IMPR (S3 Fig). There is substantial recombination across this region in other crosses: 8 cM in DUNTIL, 10 cM in IMF3, 7 cM in IMNAS, and 2 cM in IMSWC. A larger physical region (≈5 Mb on chromosome 7) is fully suppressed in IMPR but not the other crosses (20–45 cM; S8 Fig). Finally, a stretch of ≈4 Mb on chromosome 14 is suppressed in the IMSWC cross but not in other crosses (about 30 cM, S15 Fig).

To provide a more general comparison of the extent of gene order differences among the crosses, we imposed the optimal map in every cross onto the genotypic data from every other cross and computed the ΔlnLk. Then we summed these for each chromosome. The larger this sum, the greater degree of structural discrepancy between maps for each cross (Table 2). As quantified by this metric, chromosome 10 has the greatest degree of structural discrepancy, an unsurprising result given the large polymorphic inversion (Fig 5). Chromosomes 5 and 11 are next, both of which show large tracts of reordered scaffolds and shared regions of recombination suppression among several crosses (Table 2). Surprisingly, the lowest value is for chromosome 8, which has a pairwise sum of ΔlnLk of only 75.4. The large inversion on chromosome 8 suppresses recombination in annual x perennial mapping populations, and as a consequence, the ordering of scaffolds within the inverted region is fairly arbitrary in those heterotypic crosses. There are few map changes for chromosome 8. This result arises because the strongly supported, co-linear maps from the homokaryotypic crosses (IMF3, IMSWC, and IMNAS) are also largely reiterated in the suppressed crosses (DUNTIL and IMPR). However, all GOOGA maps of chromosome 8 represent a vast improvement over the V2 map (mean ΔlnLk vs V2 = 985, S4 Table), suggesting the shared order that emerges from the inversion region is a large improvement over the reference genome.

Download:

Table 2. Summed pairwise changes in log-likelihood values by chromosome given in descending order.

For each chromosome we imposed the final scaffold order from every cross onto every other cross and calculated the difference in the natural log likelihoods (ΔlnLk) between the crosses own order versus this imposed order. Sums are given on the set of ΔlnLks for each chromosome. A large positive ΔlnLk indicates a pair of crosses with highly incompatible orders.

https://doi.org/10.1371/journal.pcbi.1006949.t002

DNA composition predicts recombination rate variation in Mimulus

We tested whether recombination rate is associated with the proportions of DNA annotated as coding sequence, transposable elements (TEs), and putative M. guttatus cent728 centromeric repeats [55] within a genomic window (Fig 6). Recombination rate is positively correlated with coding sequence density (Pearson’s r = 0.218) and is negatively correlated with TE density (Pearson’s r = -0.478). To test the impact of centromeric repeats [63] on recombination rate, we binned our 200 kb windows into those with < 5% centromeric repeat sequence vs. those with > 5%. Centromeres are expected to suppress recombination, and consistent with this prediction, we see a significant drop in recombination in windows with > 5% centromeric repeats (t-test p-value = 0.0003; Fig 6C). These results indicate that the recombination rates estimated by GOOGA fit well with biological expectations.

Download:

Fig 6. The effect of genomic features on recombination rate is illustrated.

Panel A shows the positive correlation between the proportion of coding sequence and recombination rate. Panel B shows the negative correlation between the proportion annotated as a transposable element (TE) and recombination rate. For both panel A and B the red line markers the least squares best fit. Panel C shows the recombination rate distributions for 100 kb regions with >5% centromeric repeat content, or <5% centromeric repeat content.

https://doi.org/10.1371/journal.pcbi.1006949.g006

Discussion

There has been a resurgence of interest among evolutionary biologists in structural variation, particularly in the contribution of chromosomal inversions to phenotypic variation, adaptive divergence among populations, and speciation [15–18, 64–68]. Inversions are routinely discovered from recombination suppression in genetic maps. They can be verified cytologically [11] or by reversal in marker order when comparing different genetic maps (e.g. [13], Fig 5). Our capacity to generate genetic maps has significantly advanced with RAD-seq and related genotyping platforms [34–38], particularly in non-model organisms. We developed GOOGA in response to these data, with an eye toward detecting structural variation. Accommodating error in map construction is particularly important with low-coverage sequencing and unpolished genome builds. Propagating genotype uncertainty throughout the process to the assignment of map likelihoods provides a means to determine how strongly the underlying genotype data support apparent differences in scaffold order and orientation. The application of GOOGA to detect previously identified structural polymorphism in Mimulus illustrates how evidence of map differences manifest as differences in likelihood (Figs 4 & 5).

Andolfatto et al. [36] developed an HMM for use with RAD-Seq like data in genetic mapping. We have adopted the HMM approach here, although with numerous updates to both the model and implementation. First, we define markers within genomic windows, each inclusive of many SNPs, e.g. [48]. SNP calls from closely linked sites are aggregated to make a putative call for the ancestry of each genomic region (e.g. AA, AB, or BB) of each recombinant individual. Given that the quality of individual DNA samples can vary greatly, we fit a model with individual specific genotyping error rates. The intervals between markers are treated differently depending on whether markers are within a genomic scaffold or between distinct scaffolds. The former are subject to tests for genotyping consistency given implied close linkage. The latter are explored using a genetic algorithm to order and orient scaffolds into chromosomes based on the likelihood of the data. Window-based genotype calling sacrifices resolution to obtain robust markers. In this application, we used 100 kb windows, which is about 0.3% of the average Mimulus chromosome. The resulting marker density is high relative to the number of recombination breakpoints per chromosome in F₂, F₃, and RIL individuals (typically 1–3; S3 Table) and sufficient for testing alternative maps. However, scoring recombination events at the scale of 100 kb does limit inference of genomic features that determine recombination. Factors defined effectively at the 100 kb scale, such as the density of genes or transposable elements exhibit clear correlations with recombination (Fig 6; [69]).

On the other hand, it is too coarse to evaluate finer scale determinants of recombination events. For example, population LD patterns suggest that recombination is strikingly elevated near the start site of genes in M. guttatus [46]. Fig 6A is fully consistent with this result: Mean recombination rate is correlated with the proportion of coding sequence, which is strongly correlated with number of gene start sites per window. However, at the resolution we selected, we cannot distinguish the effect of start sites from other features of gene rich regions.

The emission probabilities in the HMM portion of GOOGA are based on individualized genotyping error rates. This model feature stems from our observation that the quality of genotyping data varies quite substantially among samples, even of the same batch. This variability likely reflects stochastic factors, such as variation in the amount/quality of input DNA per individual. Regardless of cause, we find important differences in error rates among samples in all mapping populations (S1 Table), justifying the need to account for these error rates explicitly and individually. While rates are typically low in absolute terms (medians around 0.01 for e_0i and e_2i, much smaller for e_1i; see METHODS), they are not negligible relative to actual recombination rates between adjacent markers. Differences in genotyping error rates provide key weights on the contributions of different individuals to the overall likelihood of a map and its associated collection of recombination rate estimates. A practical example of the utility of genotype filtering and genotyping error estimation is that it enabled the discovery of two novel putative inversions in the IMPR cross (on chromosomes 2 and 7). These were not identified in the original paper [48] because those authors imposed conservative thresholds both on marker inclusion and on whether individuals were included in the final map construction. Twice as many of the plants from the IMPR population are included in the present analysis and 25% more of the genome is included in the map.

A diversity of factors may complicate marker construction from low-coverage sequencing data [55, 70, 71]. We implement filtering steps in GOOGA to mitigate these factors including SNP quality and allele frequency, SNP-level neighbor consistency tests, read-depth thresholds, marker-level (100 kb interval) neighbor consistency and heterozygosity tests, exclusion of individuals based on a high proportion of missing genotype calls, and/or high genotyping error rates. The goal of this filtering is to produce a marker set that is consistent with Mendelian segregation, but the filters will not always succeed. For example, a region on chromosome 5 involving only 7 markers in the IMF3 cross (corresponding to the small scaffolds 226, 252, 94, 368 and 358) contributes 97 cM to the map length (over 40% of the total for this chromosome). It is possible that this a high recombination region (an unknown amount of DNA resides between these scaffolds), but it seems more likely a spurious inflation. These scaffolds map inconsistently in the other crosses–if they appear at all, they are not always adjacent. Of course, incorrectly locating good markers can produce the same “map inflation” effect as properly locating misleading markers.

The simulation study (Fig 3) reinforces the limitations of using genetic map data to guide genome assembly. Even with fully accurate genotyping, recombination based methods cannot resolve non-recombining regions. Sequencing-based methods [72] may be required to identify structural variants in these situations. Additionally, as GOOGA requires genetic segregation data, it cannot be used in species where segregating families cannot be created experimentally or sampled from nature. We here analyzed standard designs (e.g. F₂, F₃, and RIL), but the HMM could be adapted to other situations such as outbred families or pedigrees. Finally, because GOOGA was designed to work with low-coverage sequence data, it requires several informative reads per windows, which may not be possible in species with extremely low genetic variation. However, outside of these exceptions, the GOOGA methodology is highly flexible and can be readily adapted to work in many species.

Mimulus results

A tangible product of the application of GOOGA to Mimulus guttatus is that we substantially revise the reference genome of this species. The reference line (IM62) is used in two of our crosses, including in a cross to another line from the same population (IM767). Excepting the meiotic drive locus on chromosome 11 [63], IM767 appears to be largely collinear with IM62 (the two lines have the same orientation at other putative inversions). Despite this, the GOOGA realignment of scaffolds yields a dramatic increase in IMF3 likelihood over the V2 assembly: ΔlnLk is 5464 when summed over all chromosomes (S4 Table). Map revisions are found on each chromosome, and supported not only by ΔlnLk within IMF3, but also the maps from the other crosses. Incomplete assembly is a common and important problem in genomics, particularly in species with complex patterns of repeats. The promising implication of Fig 4 is that the rough genome assembly of many species can be dramatically improved with a low-coverage sequencing of a mapping population. Moreover, GOOGA quantifies the magnitude of improvements in terms of increase in likelihood.

The alignment of maps for Mimulus chromosome 10 illustrates the effect of an inversion. A 5 Mb region (left portion of Fig 5) was previously identified as a putative inversion from recombination suppression in the IMPR [48]. This is clearly confirmed here by the inclusion of both homokaryotypic crosses (AxA and BxB) and the ‘map flip’ (top two panels in Fig 5) effectively ascertains the scaffolds within the inversion and their ordering. This example also highlights the importance of marker construction. While it is possible to construct markers de novo with RAD-seq data [40], here we delineate markers on a previously assembled set of reference DNA sequences (the genomic scaffolds from the M. guttatus reference genome) obtained from sequencing of the IM62 inbred line. There are clear advantages to defining markers in this fashion, but care must be taken with this approach, especially in distantly related populations. For example, we initially assumed that the IM62 reference genome scaffolds correctly reflect the gene order for the other mapping populations. However, in our analysis of chromosome 10, we found it necessary to break scaffold 13 into two parts (13a and 13b), though it is continuous in IM62. GOOGA reannealed 13a and 13b in the IMF3 cross but inserted other scaffolds between them in other crosses due to a segregating inversion with a breakpoint contained within scaffold 13 (particularly DUNTIL; Fig 5). This implies that scaffold 13 was correctly assembled for IM62, but not for DUNTIL.

In the five Mimulus crosses, chromosome 10 is the only of the eight putative inversions where both homokaryotypic crosses are included. Reversal of marker ordering between homokaryotypic crosses was previously demonstrated for chromosome 8 [13], and could be demonstrated for others with appropriate selection of parental lines. However, the sort of reversal of genetic maps evident in Fig 5 requires polymorphism within both karyotypes. The derived karyotype is essentially homogeneous (few or no SNPs) for several structural polymorphisms in M. guttatus [12, 73]. Polymorphism should be evident within both karyotypes for inversions that have had time to accumulate variants through mutation or gene flux [74, 75], but perhaps not for very young inversions.

The inclusion of phenotype data with the IMSWC cross illustrates the importance of scaffold ordering for downstream genetic analyses (Fig 7).

Download:

Fig 7.

The results of QTL estimation using the V2 map (top panel) versus the IMSWC optimized map from GOOGA (bottom panel). The maps are aligned in middle panel. Each genomic scaffold is drawn to its genetic map length and denoted in green if it maps on the forward strand or red for the reverse strand. Grey lines connect the same scaffold between maps. QTL estimates are plotted for both panels as the log of the odds (LOD) score on the y-axis, and genetic map location on the x-axis.

https://doi.org/10.1371/journal.pcbi.1006949.g007

Here, each F₂ plant was scored for progression to flowering, a dichotomous trait in the experimental photoperiod, which was restrictive to floral induction for one parent. Application of the same QTL mapping procedure to the data produces radically different outcomes if markers are placed according to the current M. guttatus reference genome (V2 map is the top panel of Fig 7) or by the GOOGA optimized map (bottom panel of Fig 7). Using the latter, which has a ΔlnLk improvement of 1089 over the V2 map (S4 Table), the data suggest a single large-effect QTL localized to a map position 44–45 cM into chromosome 11. QTL mapping to the V2 orientation yields three distinct peaks, each with a high LOD score. The specific markers near the QTL peak in the GOOGA map are jumbled in the V2 map which splits the genotype-phenotype association into three distinct parts. The V2 map is also “stretched”–expanded in recombination length by over 20 cM–likely to compensate for bad scaffold joins. Both of these effects impede QTL inference in places where the reference genome is misassembled.

Application

The GOOGA pipeline is a set of modules (Fig 1): (A) procedures to make genetic markers from low-coverage sequencing data in conjunction with a collection of genomic scaffolds, (B) a method to estimate genotyping error rates specific to each individual, (C) an HMM to estimate recombination rates and obtain a likelihood for a specific ordering and orientation of genomic scaffolds, and (D) a genetic algorithm (GA) to search map space to obtain pseudo-chromosomes that maximize the likelihood of the data. While GOOGA was developed as an integrated series of steps, one or more of the components might be used apart from the rest. Below, we outline a few options that could be appropriate for different species or scenarios.

Most basically, one might use (A) to create markers and then apply standard map making software [76, 77] to the resulting genotype matrix. If this is done with the Mimulus mapping populations (or comparable datasets), the resulting matrices contain a great excess of missing data. As a consequence, extensive culling (both of individuals scored for too few markers and markers scored for too few individuals) is then required to apply standard map making programs. An alternative is to apply (A)-(B)-(C) to generate genetic markers. Assuming that the genomic scaffolds are generally reliable, the HMM will leverage data from neighboring windows to inform genotype calls. After obtaining the MLE on rates, one can extract posterior probabilities on genotypes and then impose ‘hard calls’, e.g. [41], to create a genotype matrix. We found this approach effective with as few as 10 informative reads per windows (see METHODS). Another possibility is to replace the front end of the pipeline. If one has high certainty in the validity of individual SNPs (their location and scoring), it is natural to replace windows (A) with individual SNPs as the observed states of the HMM ([36], Fig 2). Finally, while we found the GA effective for searching map space (D), other map optimization methods may prove useful in searching order/orientation space given that the HMM provides a likelihood for each (e.g., [78, 79]).

Like Mimulus, many species and species complexes harbor significant segregating inversions and other gene order polymorphisms including Drosophila, Zea, Anopheles and Helianthus [80–83]. The simulation study suggests a way that GOOGA could test whether sequences in a novel population under study are collinear with the reference genome. One could start by breaking the reference genome into ‘pseudo-scaffolds’ to be reassembled using data from the novel population. This is essentially what we did in the simulation experiment on D. melanogaster chromosome 2 (Fig 3). Sequences from recombinant individuals of the novel population would be mapped to the pseudo-scaffolds and the data input to GOOGA. Metrics like ΔlnLk can evaluate the reference genome map and identify necessary changes (as we have done in IMSWC in Fig 7). In this way, a necessary solution for species with incomplete genome assemblies (e.g. Mimulus) could be used in species with high-quality reference assemblies (e.g. D. melanogaster) but in populations that are structurally diverged from the reference genome. This application would be particularly convenient in cases where trait mapping is being performed via RAD-Seq genotyping, as no additional data would need to be generated.

Supporting information

S1 Data. Input data and software needed to simulate GOOGA inputs from Drosophila.

This compressed .zip file includes genomic locations and recombination map used to simulate MSG like markers on simulated scaffolds (made by randomly breaking the genome assembly), and also Python programs use to perform these simulations. Running this code will produce inputs appropriate for GOOGA.

https://doi.org/10.1371/journal.pcbi.1006949.s001

(ZIP)

S1 Fig. The SNP level data is reported for each polymorphism within a particular scaffold in each of three plants of the IMSWC mapping panel.

At each SNP, the number of reads that match parent A and parent B are reported. Each plant is heterozygous for their respective scaffold–there is extensive representation of both parents across SNPs. However, a number of specific SNPs (color coded) exhibit strong bias towards one parent or the other (one allele is absent). These regions exhibit the usual tendency (typical of this mapping population and all the others) for allele counts to be over-dispersed relative to the binomial distribution. In other words, the fraction of SNPs that appear homozygous are much greater than expected if reads were sampled independently from heterozygous individual in library construction and sequencing. The graphs illustrate the mixed signal at the SNP level along each scaffold of each plant.

https://doi.org/10.1371/journal.pcbi.1006949.s002

(XLSX)