Click here to go directly to Archive

Update

The first assembly has now been upgraded, the following enhancements were made:

Introduction

The Drosophila simulans genome project has two central goals: high quality "consensus" sequences for each line and substantial data on population genomic variation. Because of its evolutionary proximity (low divergence) to D.melanogaster, D.simulans shotgun reads can be efficiently assembled on a scaffold, the high quality melanogaster reference sequence. Plasmid libraries from 7 inbred lines of D. simulans were sequenced (approximately 1-4 X coverage each) at the Washington University Genomic Sciences Center.

The primary goal of this assembly is to provide a resource for the study of sequence variation. Collectively referred to as the D.simulans syntenic assembly, the assembly is in fact not one but several. Each component line is individually assembled to a D.melanogaster backbone. Currently the accompanying FASTA file utility pads the individual lines to be consistent with their syntenic D.melanogaster coordinates. This makes creating multiple alignments like the one below trivial.

Please communicate comments and questions to the simyak email list (simyak@ucdavis.edu).

Assembly

Processing the Traces

The sequence data for this assembly were retrieved from the NCBI trace archive on 4/1/2005. The reads from the Trace DB are quality trimmed prior to assembly based on their phred-score derived error probability. These error probabilities are used to trim the read back to the longest contiguous interval with an average probability of error less than 0.005. Each end is then examined and trimmed until its terminal 10 bp has an average probability of error less than 0.005. If the read is less than 50bp after this process it is discarded.
Line           Reads         Total Trimmed Length
W501           481,632       354,848,599
W501+fosmids   580,276       419,041,009
C1674          273,460       171,870,034
MD106          266,561       184,083,646
MD199          282,268       193,921,963
NEWC           269,264       197,296,557
SIM4           269,770       187,906,486
SIM6           318,062       235,344,950

Processing the D.melanogaster Genome

The Release 4 Drosophila melanogaster sequence and it's corresponding flybase annotation was used as the anchor genome in this assembly. The genome is preprocessed to identify all regions which would be misleading or otherwise non-discriminating in determining the best location for a read pair. This preprocessing involves identifying the following regions:
  • Transposons [BDGP]
  • Simple Repeats [BDGP]
  • Duplicate 24mers [Determined by hashing the euchromatic genome]
  • Heterochromatic Regions These regions were filtered based on manual examination of the density of annotated repetitive sequence in the centromere and telomere proximal regions of the five large arms. The transition from the "typical" euchromatic density of large repeats to the typical "beta heterochromatic" pattern is obvious. The "euchromatic/heterochromatic boundaries" were drawn roughly at the edges of the first annotated gene within each euchromatic arm.
     4  Full chromosome considered heterochromatic
     X  1 to 12748 AND  19740624 to END
     2L 21307129 to END
     2R 1 to 2014072 AND 20762874 to END
     3L 22436835 to END
     3R 1 to 478547
    
The remaining sequence constitutes the primary template for the assembly. The following table includes the original size of the chromosome as well as the regions removed during processing.
                   Transposons+    Trimmed
       Size        Repeats         Heterochromatic Seq.
2R     20766785	    1678324         2017983
2L     22407834     1748693         1100705
3R     27905053	    1428700          478547
3L     23771897	    1703271         1335062
X      22224390	    2215960         2496514
TOTAL  117075959    8774948         7428811

Aligning Reads

Trimmed read pairs from the trace archive are mapped to the processed release 4 euchromatic sequence of Drosophila melanogaster using a two phase approach. The first phase is a statistical hash based approach, currently NCBI MEGABLAST, to find all significant HSPs for a read pair. A critical aspect of the hash based approach is that it leverages soft masking eliminating non-discriminating genomic regions identified above from the seeding phase of HSP determination. The resulting HSPs are then clustered and chained before being processed by a variant of the Smith Waterman algorithm. This critical post processing does a number of things:
  • A single optimal colinear path and corresponding score is derived from multiple HSPs
  • Match, mismatch, and affine gap parameters are optimized to mel-sim divergence
  • Error probabilities given by the phred scores can be utilized at this phase
  • Transposons, vector, and repeats in simulans are not forced to align
  • Reads that map to a unique unambiguous location consistent with their clone mate pair are identified. Following this a precise multiple alignment of the reads in each contig is computed. At this point the derived multiple alignment is optimal with respect to the pairwise D.melanogaster alignments. To create a multiple alignment optimal with respect to a D.simulans line the melanogaster sequence is removed and the alignment is "re-aligned". Each component sequence is realigned to an updated consensus in a round robin manner until consensus convergence. The consensus sequence and quality scores are then determined.

    Consensus and Quality Scores

    The consensus sequence (for a line) is derived from the multiple alignment. For each column corresponding to a melanogaster base a simple likelihood model is used to determine the quality score for each of the four bases, the highest score determines the consensus base. The quality scores are calculated as
       -10*log( 1 - Probability base is correct )
    
    To compute the probability a base call is correct we assume each read is an observation of a random variable with equal likelihoods for all four bases with some probability of error. From the definition of a phred score the probability of error for a particular observed call will be
        10^( phred score / -10 ).
    
    We also assumed that if a base is in error it is equally likely to be any one of the three other bases. Then, for a given position, if we call an A, Bayes theorem implies the probability that the call is correct is:
        Pr[ A is correct ] =
        [ Pr[A] * Pr[ Observations | A is correct ] ] / Pr[ Observations]
    
    Where:
        Pr[A] = 1/4
    
        Pr[ Observations | A is correct ] =
        likelihood of A observations being correct and non-A
        observations being incorrect
    
        Pr[Observations] =
        likelihood of seeing observed values given frequency
        and error rates
    
    Quality scores are truncated at 90. High quality discrepancies, principally from residual heterozygosity and erroneous sequence, are also reported at this phase. The sequences for all lines were analyzed for windows with significant amounts of high quality discrepancies, e.g. much more than expected by residual heterozygosity. These windows were found to contain multiple genomes due to closely homologous or repetative sequence and were removed from the assembly.

    Some Assembly Coverage, Quality, and Annotation

    Line           Final Reads        Assembled BP
    W501+Fosmids   455,971            92,981,820
    C1674          227,416            66,813,132
    MD106          218,754            68,010,291
    MD199          245,536            72,260,928
    NEWC           161,057            51,713,231
    SIM4+6         411,015            95,159,421
    
    Combined Coverage All Lines
    Chromosome      BP
    2L              20159116
    2R              17882594
    3L              21374937
    3R              26272369
    X               18374371
    TOTAL          104,060,000 
    
    SNPs
    TOTAL SITES 1,223,978
    
    D.melanogaster Divergences
    TOTAL SITES 5,206,817
    
    INDELs
    D.simulans insertions  2394246 bp
    D.simulans deletions   1913727 bp
    
    Coverage statistics for the syntenically assembled D. simulans genomes. Both cumulative and combined coverage are presented illustrating the enhanced information gathering power of light-shotgun sequencing.

    D.melanogaster Shadowed Annotation and Conserved Gene Models
    
    All transcripts have greater than 50% coverage. Splicing, initiation, 
    termination, and reading frame were verified to be considered conserved.  
    
    GENES WITH FULL LENGTH CDNA
    From the BDGP Gold Collection 
    Line   Genes Conserved 
    c1674  2688  2622 98%
    md106  2644  2589 98%
    md199  3191  3125 98%
    newc   1634  1592 97%
    sim4+6 4108  3997 97%
    w501   3938  3825 97%
    
    ALL GENES
    Line   Genes Conserved
    c1674  6945  6534  94%
    md106  6891  6499  94%
    md199  8166  7688  94%
    newc   4246  4004  94%
    sim4+6 10920 10155 93%
    w501   10433 9654  93%
    
    Note: For technical reasons Sim4 and Sim6 were combined during
    the alignment and assembly process and a single Sim4+6 consensus 
    was called. Early investigation into heterozygosity indicated the
    lines were likely mixtures of eachother.
    
    Note: Simulans indels are being made available on request.
    

    Downloads

    The primary goal of this assembly is to provide a resource for the study of sequence variation, to that end multiple alignments of the reads in each line are given in addition to their corresponding quality scores and consensus sequence. From this the user can create specific FASTA files with and without indels at the appropriate quality score threshold for their desired application. Since the resulting sequences are aligned to D.melanogaster is also possible to shadow D.melanogaster annotation as well as construct multiple alignments. The above multiple alignment and gene annotations examples of this.

    Click here to go to Archive

    A Note on File Format

    The multiple alignments of individual lines were distilled to a format we call VMA, for vertical multiple alignment. The format consists of the following tab delimited columns:

    We are also providing a program VMA2FASTA2 to convert formatted multiple alignments to FASTA files given a quality threshold.

    Link to VMA2FASTA2

          VMA2FASTA    -i input vertical multiple alignment .vma
                       -o prefix for output files
                       -t [default 15] score threshold