DPGP2 - African Survey

*** UPDATE - 18.January.2013 - updated (corrected) DPGP2 v3 .***

*** UPDATE - 24.December.2012 - DPGP2 v3 Updated assemblies.***

*** Earlier UPDATE - 31.January.2012 - Filtering hets and annotating admixture.***


ATTENTION: Release 1.0 of the DPGP2 FastQ data is no longer supported. Please download the Release 2.0 FastQ data below. Updated release statement follows:

Phase 2 of the Drosophila Population Genomics Project involved the sequencing of over 100 D. melanogaster genomes from more than 20 population samples - largely from sub-Saharan Africa, which contains the ancestral range of the species. Most of this data was collected from haploid embryos (see Langley et al . 2011 Genetics for experimental methodology), each obtained from an independent inbred or isofemale line. These genomes are are considered haploid / homozygous unless otherwise noted. Genome amplification of the DNA from a single haploid embryo was performed prior to library construction and sequencing (Illumina GA IIx), leading to a moderately increased variance in sequencing depth along the genome. A minority of the data set consists of directly sequenced inbred lines (ZK sample only) or chromosome extraction lines. These lines were sequenced during a transitional phase in our project, they often have lower sequencing depth, and they will not be a major focus of our own analyses.

Release 2.0 of these data (DPGP2) set is now being made public, including sequence reads from the NIH short read archive (unchanged from release 1.0) and FASTQ files based on BWA reference alignments.

Further details and discussion of this data set can be found at the Drosophila Population Genomics discussion group: http://groups.google.com/group/drosophila-population-genomics

The following documents will be critical for appropriate use of this data set, and are available here on this page, www.dpgp.org/dpgp2/DPGP2.html: 

DPGP2_Pop_release2.xls . This file contains basic information about each population sample represented in the DPGP2 data set, including the number of fully sequenced genomes (from haploid embryo sequencing) and the number of partial genomes (chromosome extraction lines, or other samples containing some heterozygous chromosome arms).

DPGP2_Ind_release2.xls . This file lists each individual genome represented in the data set. Importantly, a majority of the haploid embryo sequences (n=118) comprise the "core genomes" of this data set, while the "addendum" category includes chromosome extractions (n=17), (incompletely) inbred lines (n=3), and haploid embryo genomes with specific quality concerns (n=1). The addendum genomes are also quite variable in terms of depth, and this subset of the data may be inappropriate for many analyses.

Our data validation has focused exclusively on euchromatin from chromosomes X, 2, and 3. Data are also provided for other regions (chromosomes 4 and Y, mitochondria, and heterochromatin), but we note that heterochromatin regions (including 4 and Y) could have very poor assemblies, and urge great caution in analyzing these data. Chromosome arms that are not listed under an individual's "focal" and "non-focal" columns in this spreadsheet may not come from the sample of interest (as with non-target chromosomes in balancer extraction lines) or may contain heterozygosity. Hence, only arms listed here for a given sample should normally be analyzed.

This spreadsheet also gives some basic information about sequencing depth and coverage for each sample, along with important genome-specific comments in some cases.

DPGP2_IBD_release2_v2.xls . Based on a simple analysis of identity-by-descent, this file lists broad sequence intervals that are "identical" (below 0.05% pairwise divergence) for specified pairs of genomes. This is primarily intended to help users identify data that may depart from typical population genetic assumptions due to relatedness between genomes. Identity was assessed for overlapping 500 kb windows with 100 kb increments. All pairwise intervals meeting this threshold are reported in this spreadsheet. However, many of these regions are shared between populations, suggesting processes other than close relatedness. We identified a smaller subset of extreme within-population IBD (at scales beyond that observed between populations) and listed these intervals in the spreadsheet (along with methodological details). To mask these within-population IBD regions one uses dpgp2_recommended_ibd_masks_v1.txt with the script fastq2fasta.pl provided below.

The data itself is available from the following:

Sequence read data can be accessed through NCBI by searching at www.ncbi.nlm.nih.gov/sra using the Accession number found in the column title is "SRA Accession" of the spreadsheet DPGP2_Ind_release2.xls (mention above).

FastQ files: Generated by BWA using default parameters (aside from activating the -I flag for Illumina data), then using SAMtools to generate a consensus sequence. Sites within 5bp of a consensus indel and sites called as heterozygous were masked to 'N'. Reads with a mapping quality <20 were excluded. This data object is intended for analyses of substitutional (SNP) variation. However, no guarantee is made concerning the accuracy of the assemblies, especially for repetitive genomic regions. FastQ files are released in two parts, based on the "core" vs. "addendum" partition described above.


dpgp2_v2_rg.ID5.nohets.fastq.bz2 (0.99 Gb) FASTQ file for the RG population sample (a subset of the core genomes). Checksum: "MD5 (dpgp2_v2_rg.ID5.nohets.fastq.bz2) = 768b74de4370e6b79f02907a79234f38". Note that important variation in sequencing depth is still present here.

dpgp2_v2_core.ID5.nohets.fastq.bz2 (3.07 Gb) dpgp2_v2_core.ID5.homozygous.fastq.bz2 (3.07 Gb) FASTQ file for the remainder of the core genomes. Checksum: "MD5 (dpgp2_v2_core.ID5.nohets.fastq.bz2) = 8a84ecdccc5584b25229da99e23a0352".

dpgp2_v2_addendum.ID5.nohets.fastq.bz2 (0.94 Gb) FASTQ file for "addendum" genomes. Checksum: "MD5 (dpgp2_v2_addendum.ID5.fastq.tar.bz2) = f856907349cb78d49d389149744dd96b". Highly heterogenous data set (see above) - use with great caution.

If you downloaded the "addendum" FastQ file on or before Friday 9/16, please download it again if you intend to analyze the ZK186 genome (the sourcing was not corrected until 9/17). The core FastQ data does not include ZK186, and there is no need to re-download the core data if you have done so since release 2.0 was publicized.

dpgp_fastq_2_fasta.pl dpgp_fastq_2_fasta.pl An updated version of the DPGP1 (aka 50 genomes) script dpgp_fastq2fasta.pl is now available. dpgp_fastq_2_fasta.pl is a utility script to generate a fasta file from a fastq data object. It will include only bp meeting the specified minimum quality value (we recommend Q31), in specific target elements of the genome specified in a tab delimited file such as dpgp2_focal_target_genome.txt , and masking specific regions in specific genomes found in another tab delimited file such as dpgp2_recommended_ibd_masks_v1.txt . We also plan to provide a page where various data resources including fasta versions of dpgp2 can be download. That page will appear at the same time as the validated version of fastq2fasta.pl.

IMPORTANT: Even within the core data set, genomes have >2.5 fold differences in sequencing depth. Higher depth genomes tend to cover certain highly diverse and difficult-to-assemble regions that are not covered in low depth genomes. Hence, summaries of pairwise differences will typically show high depth genomes with higher divergence to any other melanogaster genome, versus comparisons involving a low depth genome. For any population genetic analysis, users must decide how to address this potential source of bias (e.g. by excluding sites not covered in all or most individuals, by discarding some sequence reads from high depth genomes, or devising correction factors).

An update to this release will appear soon, including annotation of regions with putative cosmopolitan admixture (recent gene flow from outside Africa), which appears to be a very important consideration for many of these African population samples.

Following a roughly three month analysis window, a preliminary publication will be prepared. This paper will be more limited in scope than recent efforts, serving as a basic description of the data. For studies that go beyond this paper's focus, researchers have the option of collaborating with members of DPGP or pursuing independent analyses. Please contact John Pool (jpool [at] wisc.edu) if you have any questions or if you want to become involved in coordinated analyses. 


UPDATE - 9.November.2011

This update summarizes the effect of sequencing depth on pairwise distances and redefines the group of genomes the most analyses will focus on.  It also describes the identification of chromosome intervals with cosmopolitan (non-sub-Saharan) ancestry, and includes a new FastA data object with admixture masked. 

Depth-divergence bias, and dividing the core data set into “primary” and “secondary” core genomes:

The DPGP2 data set includes genomes with a wide range of sequencing depths (here, defined as the mean number of reads covering a base at Q31 in a BWA assembly).  Even within the “core” data, genomes range from 18X to 47X depth.  The DPGP2 release 2 statement cautioned that depth was observed to have a strong influence on summaries of pairwise distances, thus potentially biasing most population genetic inferences. 

Figure 1.  Influence of sequencing depth on pairwise divergence (to the reference sequence, or the average divergence to the Zambia-Siavonga population sample) for African core genomes.

A non-linear relationship between depth and divergence is apparent for both series.  Thus, depth has a strong influence on comparisons of a genome to any given sequence, not only the reference genome or other non-African genomes.  This pattern would make sense if lower depth genomes failed to cover certain highly diverse but difficult-to-assemble portions of the genome.  That explanation is supported by the correlation of depth with genomic coverage (Figure 2). 

Figure 2.  Influence of sequencing depth on genomic coverage for African core genomes.

Users of the DPGP2 data will need to evaluate the potential impact of depth-divergence bias on their analyses, and pursue strategies to minimize this bias.  We initially investigated the impact of sample coverage thresholds – restricting analysis to sites where all or nearly all genomes in the full data set (or a subset) to have a Q31 base called.  We found that stringent thresholds did succeed in minimizing depth-divergence bias, but they also introduced a powerful reference sequence bias (in the most extreme case, the non-African reference genome became the closest relative of nearly every African genome).  This result suggested a potentially strong effect of non-reference polymorphisms on the probability of a read mapping, highlighting the need for improved assembly algorithms. 

Figure 1 shows that within the core data, the strongest effect of depth on pairwise divergence is for genomes with <25X depth.  We have now chosen to exclude these genomes from most analyses, defining these 18-25X genomes as the “secondary core” data set.  Core genomes with >25X are referred to as “primary core”, and these designations are now given in the spreadsheet DPGP2_Ind_release2.xls .  An updated target txt file (for use with dpgp_fastq_2_fasta.pl ) is located at dpgp2_primarycore_target.txt .

Even among the >25X primary core genomes, a subtle effect of depth on pairwise distance remains (Figure 1).  One potential remedy for the remaining bias (which we have not yet pursued) would be to discard reads from primary core genomes until all have the same depth (~25X).  Altneratively, parametric or non-parametric correction factors could be developed to compensate for the expected bias (as done, at least in a crude way, for the admixture analysis described below).

Inference of cosmopolitan admixture in sub-Saharan genomes:

Initial summaries of the data suggested that cosmopolitan admixture (gene flow from outside sub-Saharan Africa) might be a powerful determinant of genetic variation in many of our sub-Saharan population samples.  Lacking a published method geared to perform the specific analysis of interest – testing for one-way admixture with a single (European) source population but no guaranteed pure African population – we chose to develop our own admixture detection approach.

Admixture detection begins with two reference panels, a “non-African panel” consisting of 8 France primary core genomes, and an “African panel” consisting of 22 Rwanda (RG) primary core genomes.  Simulations indicated that this method should be quite robust to ~10% admixture in the African panel.  Data was evaluated in windows of 1000 non-singleton SNPs from the African panel, using a Hidden Markov Model (HMM) framework.

State 1 (African ancestry) is defined by comparisons of a single African panel genome to the full non-African panel.  This “average divergence to France” is measured in terms of standard deviations (SDs) from the window mean, based on comparisons of the other African panel genomes to the European panel.  These values form the emissions distribution for state 1, which is centered around 0.

State 2 (non-African ancestry) is defined by comparisons of a single non-African panel genome to the remainder of the non-African panel.  These values are also scaled by the Africa-Europe mean and SD mentioned above.  Since non-African genomes are more similar to each other than to African genomes, these smaller Europe-Europe divergences lead state 2’s emissions distribution to be centered on negative numbers. 

Likelihoods were evaluated for each African genome in each window by comparing its “average divergence to France” to the mean and SD from the remainder of the African panel, and comparing its “SD’s from the mean” to the emissions distributions.  Admixture probabilities were obtained using the forward-backward algorithm, with transition probabilities of 0.01, and a minimum value of 0.005 for each emissions distribution bin (of width 0.1 SDs).  Windows with state 2 probability >0.5 were defined as admixed, and one window was added on each side of an admixed interval (due to uncertainty in precise admixture boundaries).

Even after restricting the admixture analysis to genomes with >25X depth, we still noted a modest effect of depth on admixture inferences.  We therefore implemented a crude empirical “correction factor” for each genome to account for observed differences in pairwise divergence values.  Ultimately, we wish to know the effect of a genome’s depth (and other data quality factors) on divergence-to-France.  However, genomes also vary in their divergence-to-France based on admixture levels.  Therefore, divergence-to-Rwanda was used as a proxy.  A genome’s average divergence to (the rest of) the RG sample was computed for each chromosome arm, and scaled by the average divergence-to-Rwanda across all (other) RG genomes.  All divergence-to-France calculations (for emissions distributions and likelihoods) involving the genome in question were then multiplied by the reciprocal of this number, to correct for the average difference in divergence properties between this genome and the African panel.  After implementing this approach, no obvious effect of depth on admixture inferences was observed.

Simulations in which admixture was added to published demographic models suggested very good power to detect cosmopolitan admixture in sub-Saharan populations, even for rather small window sizes.  By comparison, the empirical data had somewhat more overlap between the state 1 and state 2 emissions distributions, which led us to increase window size to that given above (which corresponds to roughly 60kb windows on average). 

The method was applied iteratively for the African panel, in order to remove admixed regions from it and refine admixture estimates.   The first round of emissions distributions used the full RG data (22 primary core genomes), with only identical-by-descent regions filtered.  RG intervals estimated to be admixed in this round were then filtered out from the second round of emissions distributions (thus removing non-African comparisons from our state 1 distribution).  Likelihoods were always run on the unfiltered data, and admixture calls updated after the second round.  A third round of emissions distributions was then created, which served as the basis for final admixture calls in the RG sample and for other African samples.

Our use of the Rwandan RG sample as an African panel, even when evaluating admixture in other sub-Saharan samples, was out of necessity – we lack other African samples of reasonable size.  Fortunately, the RG sample has a genetically intermediate position among sub-Saharan populations.  And aside from the effects of admixture, other African samples are unlikely to have much closer relationships to cosmopolitan samples than RG does.  The modest structure that does exist between RG and other samples may render some admixture estimates less accurate.  In general, though, the method seemed to peform reasonably well on all primary core data. 

Results of the admixture analysis suggest that admixture proportions vary dramatically both among populations and within populations. 

Figure 3.  Admixture proportions vary dramatically from one population sample to the next, even within small geographic regions, as illustrated by results from Uganda (UG vs. UM) and Zambia (ZI vs. ZL/ZO)

Figure 4.  Admixture proportions vary dramatically within populations, as illustrated by the 22 primary core genomes from the RG sample. 

Most admixture intevals were on the megabase scale, suggesting a very recent onset of large-scale admixture.  A recent onset of admixture might partially explain the profoundly non-equilibrium patterns shown in Figures 3 and 4.  However, non-neutral explanations (such as an advantage of cosmopolitan genotypes in urban locations) may be required to account for the rapid rise of admixture in remote African locations.  Additional mechanisms may be required to account for the persistance of admixture-free and majority-admixture genomes in the same towns (Figure 4).  Processes that could slow down recombination between these individuals include inversions, mating preferences, and differences in microhabitat preference or fitness. 

DPGP2 UPDATE - 31.Jan.2012

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

Filtering of heterozygous calls: DNA ambiguity characters were previously present at putatively heterozygous sites in FastQ and FastA files. These sites have now been masked to 'N' in the new files, which are otherwise unchanged. They are unlikely to reflect true heterozygosity, but could result from copy number polymorphism.

Annotation of cosmopolitan admixture in sub-Saharan genomes: Admixture intervals were defined as described above for all African genomes in the primary core data set.  For many population genetic analyses, it may be critical to minimize the admixture present, so the data conforms more closely to assumptions concerning random mating in an unstructured population.  We therefore provide two data objects to facilitate the analysis of admixture-free portions of the primary core data set:

A list of admixture intervals for all African primary core genomes, formatted for use with the  dpgp_fastq_2_fasta.pl conversion script, to generate an admixture-filtered FastA file from the unfiltered FastQ data.  Admixture intervals are listed in one based closed interval format. For example, 1 10 indicates mask the first 10 bp.

dpgp2_v2.ID5.primarycore.nohets.masked.fasta.bz2  (2.5 Gb)
A Q31 FastA file for the primary core data set, with putatively admixed regions of African genomes filtered to “N”.  Other previously described filters, such as long-range identity-by-descent (IBD) remain in place as well.  Checksum: "MD5 (dpgp2_v2.ID5.primarycore.nohets.masked.fasta.bz2) = b697cfb1a28d27bb1d9fcb0275f29fff".

We do not currently plan to offer admixture estimates for the secondary core and addendum data sets.  Analysis of these genomes is “at your own risk” regarding admixture and depth-divergence bias.

DPGP2 UPDATE - 24.Dec.2012

Updated assemblies (DPGP2 v3) were created by replacing the deprecated samtools pileup consensus sequences with the more current samtools mpileup consensus sequences. We recommend using these assemblies for their reduced ascertainment bias.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

                  all 139 genomes of DPGP2 listed in DPGP2_Ind_release2.xls

each assembly can be download from here
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

Updated assemblies: As described in Pool et. al (2012) paired end reads were mapped to the BDGP Release 5 reference sequence using bwa version 0.6.1. The best alignment for each input read was kept, including those with low mapping quality. The resulting paired end bam were then run through the mpileup program from samtools version 0.1.18 to create an intermediate vcf file. After filtering, each vcf file was converted to a fastq file using the method implemented in the vcf2fq routine of the vcftools.pl program provided with the samtools package. Our filtering consisted of removing sites with heterozygous alleles, sites with an RMS mapping quality below 10, consensus indels, as well as sites within three base pairs of these consensus indels. We found that these steps plus minimum read depth of 7 were helpful for increasing data quality and reducing ascertainment bias.

DPGP2 UPDATE - 18.January.2013

Update (correction) of assemblies (DPGP2 v3)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

Three corrupt fastq files from DPGP2 v3 have been replaced:




Replace previous versions of these files (dated April 4 2012) with the files dated Jan 18 2013, available here .