Inroduction
The following dataflow describes the procedure by which as many as 17,000 microarrays covering 50 D.melanogaster genomes will be processed.
A good description of the technology as it was used by Perlegen on Mouse can be seen here:
Details on the current state of the LIMS project can be found here
Diagram
Process Details
Re-sequencing Array Design
An array consists of 300MB of "reference" sequence which is 99% identical to the genome of the organism being sequenced. All possible single nucleotide differences are queried. The technology is reliable if the differences are not adjacent which is relatively rare. Since hybridization uses 25bp oligos, repetative regions 25bp or longer cannot be reliably determined so they are filtered from the input.
Inputs:
A target sequence.
Process:
The target sequence is hashed into 25bp fragments are duplicates are determined and removed.
Additionally known annotated repeats are removed from the target sequence (Repeatmasker).
Output:
A wafer design file consisting of multiple array designs.
A library file for each array design, these are required later during the basecalling process.
The target sequence is divided into 300MB segments for each array design.
Target Preparation + QC
A target sample of DNA to be sequenced is propared for hybridization to the chip. There are multiple possible sources for this DNA however there is a single preparation protocol.
Array Hybridization
The target sample is placed on the appropriate array design and hybridized overnight. At this stage it is important to make sure when handling multiple samples that the appropriate target is placed on the appropriate array design. The arrays are barcoded to prevent mixup. Flourescent label is applied to the array (staining) that is target DNA specific.
Data Aquisition
The flourescently labeled hybridized array image is aquired using ausing a flying objective scanning laser. The scanning is somewhat destructive due to photo bleaching.
Input:
A labeled microarray
Scanning resolution and laser intensity settings. Also used for subsequent analysis.
Output:
A very large image (1GB).
Feature Extraction + QC
Note: At this point once the data is aquired the processing is non-destructive and can an will be repeated as algorithms and analysis techniques improve.
Feature extraction is the process by which the array image is segmented into 8 probes per query bp and the signal intensity is measured.
Input:
An array image
A library file corresponding to the design
Output:
A segmented array image with each probe addressed.
A statistical description ( currently mean and variance ) of each probes signal intensity.
Hybridization Model / Base Calling
This is the process by which signal intensities are converted into sequenced bases. The statistical model is described in High-throughput variation detection and genotyping using microarrays. Cutler DJ, Zwick ME, et. al. Genome Res. 2001. The hybridization model is somewhat complex and frequently updated incorporating additional parameters so it is desireable for the data provenence to be known and re-run capability to be available.
Input:
A statistical description ( currently mean and variance ) of each probes signal intensity.
A library file corresponding to the design
Output:
The sequence and corresponding error probabilities of the target.
Genetic Analysis + QC
This is the stage in which the genotype of the organism is stored and analyzed for contaminiation and made available for public consumption.
Note: An interesting open data representation problem is storing many highly identical genome sequences in a way that is resonable for downstream analysis and viewing. While we only have 50 individuals for this project. Related projects and proposals for smaller target regions have 1000's of individuals.
