The first assembly has now been upgraded, the following enhancements were made:
- Syntenic read alignment was updated to use a species specific 6 parameter substitution model.
- Additional read filters were added to reduce the internal discrepancy rate from aligned paralogous sequence and improve the overall confidence in the assembled sequence.
- Both ends of a clone were required to be uniquely mapped to the genome.
- Clone mate pair distance was strictly observed. Mate pairs were required to be between 1500 and 6000 bp apart.
- Reads were also compared to a database of Drosophila repeats and transposons to ensure placement was based on all available sequence not simply the repetitive content.
- A kernel method was used to detect regions of the assembly that likely contained multiple copies of repeats or paralogs. These were removed.
- Assemblies with and without indels are provided.
The Drosophila simulans genome project has two central goals: high quality "consensus" sequences for each line and substantial data on population genomic variation. Because of its evolutionary proximity (low divergence) to D.melanogaster, D.simulans shotgun reads can be efficiently assembled on a scaffold, the high quality melanogaster reference sequence. Plasmid libraries from 7 inbred lines of D. simulans were sequenced (approximately 1-4 X coverage each) at the Washington University Genomic Sciences Center.
The primary goal of this assembly is to provide a resource for the study of sequence variation. Collectively referred to as the D.simulans syntenic assembly, the assembly is in fact not one but several. Each component line is individually assembled to a D.melanogaster backbone. Currently the accompanying FASTA file utility pads the individual lines to be consistent with their syntenic D.melanogaster coordinates. This makes creating multiple alignments like the one below trivial.
Please communicate comments and questions to the simyak email list (firstname.lastname@example.org).
Processing the TracesThe sequence data for this assembly were retrieved from the NCBI trace archive on 4/1/2005. The reads from the Trace DB are quality trimmed prior to assembly based on their phred-score derived error probability. These error probabilities are used to trim the read back to the longest contiguous interval with an average probability of error less than 0.005. Each end is then examined and trimmed until its terminal 10 bp has an average probability of error less than 0.005. If the read is less than 50bp after this process it is discarded.Line Reads Total Trimmed Length W501 481,632 354,848,599 W501+fosmids 580,276 419,041,009 C1674 273,460 171,870,034 MD106 266,561 184,083,646 MD199 282,268 193,921,963 NEWC 269,264 197,296,557 SIM4 269,770 187,906,486 SIM6 318,062 235,344,950
Processing the D.melanogaster GenomeThe Release 4 Drosophila melanogaster sequence and it's corresponding flybase annotation was used as the anchor genome in this assembly. The genome is preprocessed to identify all regions which would be misleading or otherwise non-discriminating in determining the best location for a read pair. This preprocessing involves identifying the following regions:The remaining sequence constitutes the primary template for the assembly. The following table includes the original size of the chromosome as well as the regions removed during processing.
- Transposons [BDGP]
- Simple Repeats [BDGP]
- Duplicate 24mers [Determined by hashing the euchromatic genome]
- Heterochromatic Regions These regions were filtered based on manual examination of the density of annotated repetitive sequence in the centromere and telomere proximal regions of the five large arms. The transition from the "typical" euchromatic density of large repeats to the typical "beta heterochromatic" pattern is obvious. The "euchromatic/heterochromatic boundaries" were drawn roughly at the edges of the first annotated gene within each euchromatic arm.4 Full chromosome considered heterochromatic X 1 to 12748 AND 19740624 to END 2L 21307129 to END 2R 1 to 2014072 AND 20762874 to END 3L 22436835 to END 3R 1 to 478547Transposons+ Trimmed Size Repeats Heterochromatic Seq. 2R 20766785 1678324 2017983 2L 22407834 1748693 1100705 3R 27905053 1428700 478547 3L 23771897 1703271 1335062 X 22224390 2215960 2496514 TOTAL 117075959 8774948 7428811
Aligning ReadsTrimmed read pairs from the trace archive are mapped to the processed release 4 euchromatic sequence of Drosophila melanogaster using a two phase approach. The first phase is a statistical hash based approach, currently NCBI MEGABLAST, to find all significant HSPs for a read pair. A critical aspect of the hash based approach is that it leverages soft masking eliminating non-discriminating genomic regions identified above from the seeding phase of HSP determination. The resulting HSPs are then clustered and chained before being processed by a variant of the Smith Waterman algorithm. This critical post processing does a number of things:
A single optimal colinear path and corresponding score is derived from multiple HSPs Match, mismatch, and affine gap parameters are optimized to mel-sim divergence Error probabilities given by the phred scores can be utilized at this phase Transposons, vector, and repeats in simulans are not forced to align
Reads that map to a unique unambiguous location consistent with their clone mate pair are identified. Following this a precise multiple alignment of the reads in each contig is computed. At this point the derived multiple alignment is optimal with respect to the pairwise D.melanogaster alignments. To create a multiple alignment optimal with respect to a D.simulans line the melanogaster sequence is removed and the alignment is "re-aligned". Each component sequence is realigned to an updated consensus in a round robin manner until consensus convergence. The consensus sequence and quality scores are then determined.
Consensus and Quality ScoresThe consensus sequence (for a line) is derived from the multiple alignment. For each column corresponding to a melanogaster base a simple likelihood model is used to determine the quality score for each of the four bases, the highest score determines the consensus base. The quality scores are calculated as-10*log( 1 - Probability base is correct )To compute the probability a base call is correct we assume each read is an observation of a random variable with equal likelihoods for all four bases with some probability of error. From the definition of a phred score the probability of error for a particular observed call will be10^( phred score / -10 ).We also assumed that if a base is in error it is equally likely to be any one of the three other bases. Then, for a given position, if we call an A, Bayes theorem implies the probability that the call is correct is:Pr[ A is correct ] = [ Pr[A] * Pr[ Observations | A is correct ] ] / Pr[ Observations]Where:Pr[A] = 1/4Pr[ Observations | A is correct ] = likelihood of A observations being correct and non-A observations being incorrectPr[Observations] = likelihood of seeing observed values given frequency and error ratesQuality scores are truncated at 90. High quality discrepancies, principally from residual heterozygosity and erroneous sequence, are also reported at this phase. The sequences for all lines were analyzed for windows with significant amounts of high quality discrepancies, e.g. much more than expected by residual heterozygosity. These windows were found to contain multiple genomes due to closely homologous or repetative sequence and were removed from the assembly.
Some Assembly Coverage, Quality, and Annotation
Line Final Reads Assembled BP W501+Fosmids 455,971 92,981,820 C1674 227,416 66,813,132 MD106 218,754 68,010,291 MD199 245,536 72,260,928 NEWC 161,057 51,713,231 SIM4+6 411,015 95,159,421Combined Coverage All Lines Chromosome BP 2L 20159116 2R 17882594 3L 21374937 3R 26272369 X 18374371 TOTAL 104,060,000 SNPs TOTAL SITES 1,223,978 D.melanogaster Divergences TOTAL SITES 5,206,817 INDELs D.simulans insertions 2394246 bp D.simulans deletions 1913727 bpCoverage statistics for the syntenically assembled D. simulans genomes. Both cumulative and combined coverage are presented illustrating the enhanced information gathering power of light-shotgun sequencing.
D.melanogaster Shadowed Annotation and Conserved Gene Models All transcripts have greater than 50% coverage. Splicing, initiation, termination, and reading frame were verified to be considered conserved. GENES WITH FULL LENGTH CDNA From the BDGP Gold Collection Line Genes Conserved c1674 2688 2622 98% md106 2644 2589 98% md199 3191 3125 98% newc 1634 1592 97% sim4+6 4108 3997 97% w501 3938 3825 97% ALL GENES Line Genes Conserved c1674 6945 6534 94% md106 6891 6499 94% md199 8166 7688 94% newc 4246 4004 94% sim4+6 10920 10155 93% w501 10433 9654 93% Note: For technical reasons Sim4 and Sim6 were combined during the alignment and assembly process and a single Sim4+6 consensus was called. Early investigation into heterozygosity indicated the lines were likely mixtures of eachother. Note: Simulans indels are being made available on request.
The primary goal of this assembly is to provide a resource for the study of sequence variation, to that end multiple alignments of the reads in each line are given in addition to their corresponding quality scores and consensus sequence. From this the user can create specific FASTA files with and without indels at the appropriate quality score threshold for their desired application. Since the resulting sequences are aligned to D.melanogaster is also possible to shadow D.melanogaster annotation as well as construct multiple alignments. The above multiple alignment and gene annotations examples of this.
A Note on File Format
The multiple alignments of individual lines were distilled to a format we call VMA, for vertical multiple alignment. The format consists of the following tab delimited columns:
- Position in D.melanogaster genome.
- D.melanogaster BP
- D.simulans BP
- D.simulans consensus Q-score
- Assembly depth
- # of discrepant alleles
- Vector of D.simulans BPs
- Space delimited vector of corresponding quality scores
We are also providing a program VMA2FASTA2 to convert formatted multiple alignments to FASTA files given a quality threshold.VMA2FASTA -i input vertical multiple alignment .vma -o prefix for output files -t [default 15] score threshold