50 D. melanogaster Genomes Project

 

 

Goals

Background

Preview Release 0.5 (April 2009)

Reference Release 1.0 (Sept 2009)


















  

The specific goals of the first phase of the 50 Genomes Project were twofold:



  1. Sequence 7 Mbp of 50 Drosophila melanogaster genomes using custom Affymetrix high-density oligonucleotide arrays.


  2. Establish the necessary infrastructure to complete the sequencing of the entire unique portion of all 50 genomes in the next funding cycle.


  3. * The initial stages of the dpgp were supported under NIH:HG 02942-10A1.



  

Background

In early 2007 it was clear that developing short read sequencing technologies, specifically using Illumina's Solexa Genome Analyzers, was the most economical, efficient, and robust route to completing our short-term goals. By early 2008 we were well on our way to completing the initial long-term goal (50 entire genomes) of the Drosophila Population Genomics Project (DPGP) within this last year of the original funding cycle.

Total Coverage as of Feb 2008 = 320X

This success reflects the technology advances in ultra-high-throughput resequencing. More importantly, our recent experience and the growing potential of new sequencing platforms has inspired another ambitious proposal - resequencing of hundreds of Drosophila melanogaster genomes. By providing the research community with deep population sampling based on the high-throughput platforms, our project will foster the development of new theoretical ideas, talent, and tools. These can be leveraged against the talent and creativity of the Drosophila research community to advance the ideas and applicatons with potential impact on human population genomics.

An isogenic (or inbred) Drosophila melanogaster genome sequenced to 10X with the Illumina GA has a low rate of missing data. We routinely achieving 98% or higher coverage of the non-repetitive genome at Q40 or higher for such isogenic genomes (go to figure) with a single run of the instrument.
One major goal will be to utilize public database resources as much as possible to disseminate our sequences in a timely manner. We are placing the raw data in the NCBI's short read trace archive as quickly as possible given our resources:

Along with the large human population genomics sequencing community, we are working on creating solid and serviceable genomic assemblies and associated publications. As these assemblies emerge, we will post them on this website and submit them to public databases.

  

50 Genomes - Release 0.5



After completing the data collection phase of the 50 genomes project at the end of 2008 we were able to release a "PREVIEW" version of the assemblies in April, 2009.

README - DPGP D.melanogater Solexa Assemblies (Release 0.5)

This is the README file for the preview release of the initial sample of sequenced Drosophila melanogaster genomes by the DPGP using first generation (single-end and 36 bp) Solexa/Illumina technology (Bentley, et al., 2008 Nature 456:53-59) assembled using maq 0.6.8 (Heng, Ruan and Durbin, 2008 Genome Res. 18: 1851-1858). This data preview is intended to clearly show the scope and quality of the data. Release 1.0 will be a reference dataset. The sample consists of 39 inbred genomes from Trudy Mackay's set of inbred lines sampled in Raleigh, NC (Jordan, et al., 2007 Genome Biology 8: R172. doi:10.1186/gb-2007-8-8-r172.) and a set of sequenced chromosomes (8 chrXs, 6 chr2s and 5 chr3s) from a sample of Malawi isofemale lines (Begun and Lindfors, 2005 Mol. Biol. Evol. 22: 2010-2021) that were inbred using balancers. Regions of residual heterozygosity and repeated sequence are filtered (set to "N"). The "raw data" are available in the NCBI Short Read Trace Archive. This Release 0.5 data are in the form of fasta files for each of the major chromosome arms for each sampled genome. The average coverage of the unique portions of all these genomes is >10X. The called bases are those with a consensus (Solexa) nominal quality score >= 30. Bases in repetitive sequences or in regions of (inbred) residual heterozygosity are not called, i.e. "N".

Basic statistics and examples are available HERE.

The download tarball can be found here: dpgp_solexa_preview.tar.gz.

Release 1.0, which will include calibrated quality values for each called base (fasta and qfasta files), annotation of indels where possible, and additional filtering of low quality basecalls, will apear shortly. That version is a snapshot that will form the basis of a paper describing the collection, assembly, and initial analyses (another genome paper!). We anticipate that many researchers will use Release 1.0 data for a wide diversity of purposes in a timely fashion. In deference to academic careers of the junior colleagues who invested great time and effort in this project and with an interest in a coherent and efficient presentation of the literature of all the analyses, we asked that users of these data (both Release 0.5 and Release 1.0) defer publication for six (6) months after the appearance of Release 1.0. Redundant effort, excessive overlap and publication difficulties must be balanced against independent and creative analyses that happen to coincide. The DPGP participants are ready to discuss that emerging content the "genome paper" and to facilitate coordination of efforts.

  

50 Genomes - Release 1.0



At this point (September, 2009) we have carefully identified and isolated a number of quality problems (e.g., mislabelling, residual heterozygosity and identify by descent). The error in the base-calls of the assemblies has been extensively modeled using data from the referene genome and a more accurate determination of the error in the maq assemblies has been applied.

README - DPGP D.melanogater Solexa Assemblies (Release 1.0)

This is the README file for the Release 1.0 of the initial sample of sequenced Drosophila melanogaster genomes by the DPGP using first generation (single-end and 36 bp) Solexa/Illumina technology (Bentley, et al., 2008 Nature 456:53-59) assembled using maq 0.6.8 (Heng, Ruan and Durbin, 2008 Genome Res. 18: 1851-1858). The sample consists of 37 inbred genomes from Trudy Mackay's set of inbred lines sampled in Raleigh, NC (Jordan, et al., 2007 Genome Biology 8: R172. doi:10.1186/gb-2007-8-8-r172.) and a set of sequenced chromosomes (7 chrXs, 6 chr2s and 5 chr3s) from a sample of Malawi isofemale lines (Begun and Lindfors, 2005 Mol. Biol. Evol. 22: 2010-2021) that were inbred using balancers. Regions of repeated sequence are filtered (set to "N"). The "raw data" are available in the NCBI Short Read Trace Archive.

Release 1.0 is in the form of FASTQ files. One for each of the major chromosome arms (inbred and sequenced) from each sampled genome. The average coverage of the unique portions of all these genomes is over 10X.

This release adds the following enhancements to Release 0.5

  • Releasing the data in Sanger's FASTQ format provides quality scores. Note Sanger's format is different than Illumina's in that it allows for the representation of an extended range of quality scores. Quality score 0 is represented as ASCII 33.
  • Initial validation results suggested that the raw consensus quality scores were overly optomistic. We used a more extensive validation to 1) estimate improved quality scores and 2) degrade the quality of systemic errors. The median real quality score of Release 1.0 is over Phred 50 for almost all chromosome arms.  Details of the methods used to derive these quality scores will be provided in the paper describing these data.
  • Regions of residual heterozygosity were present in many of the lines. A Hidden Markov Model, which will be described later was used to classify and delineate these regions for optional removal. The list of these regions is provided.
  • Similarly, large regions of apparent identity by descent were found in three lines. This annotation is also provided as a separate file.
  • The following Release 0.5 lines were removed for QC purposes: RAL-208, RAL-712, MW25-1.
  • Basic statistics and examples from Release 0.5 are still quite representative and are available HERE.

    The download tarball can be found here: dpgp_solexa_r1.0.tar. Checksum: dpgp_solexa_r1.0.tar.md5.

    List of regions of residual heterozygosity: dpgp_r1.0_reshet.txt.

    List of large regions of identity by descent: dpgp_r1.0_ibd.txt.

    We are providing an unsupported FASTQ to FASTA converter: fastq_2_fasta.pl.

    Release 1.0  will be the basis of a paper that describes the collection, assembly, and initial population genetics analyses of these genomes (another genome paper!). We anticipate that many researchers will use Release 1.0 data for a wide diversity of purposes in a timely fashion. In deference to the academic careers of the junior colleagues who invested great time and effort in this project and with an interest in a coherent and efficient presentation in the literature of all analyses, we ask that users of these data (both Release 0.5 and Release 1.0) defer publication for six (6) months after the appearance of Release 1.0. Redundant effort, excessive overlap and publication difficulties must be balanced against independent and creative analyses that happen to coincide. The DPGP participants are ready to discuss the emerging content of the "genome paper" and to facilitate coordination of efforts.