Resequencing Process Flow

Inroduction

The following dataflow describes the procedure by which as many as 17,000 microarrays covering 50 D.melanogaster genomes will be processed.

A good description of the technology as it was used by Perlegen on Mouse can be seen here:

Details on the current state of the LIMS project can be found here

Managing_Genotyping_Data

Diagram

process_diagram_r1.png

Process Details

Re-sequencing Array Design

An array consists of 300MB of "reference" sequence which is 99% identical to the genome of the organism being sequenced. All possible single nucleotide differences are queried. The technology is reliable if the differences are not adjacent which is relatively rare. Since hybridization uses 25bp oligos, repetative regions 25bp or longer cannot be reliably determined so they are filtered from the input.

Inputs:

Process:

Output:

Target Preparation + QC

A target sample of DNA to be sequenced is propared for hybridization to the chip. There are multiple possible sources for this DNA however there is a single preparation protocol.

Array Hybridization

The target sample is placed on the appropriate array design and hybridized overnight. At this stage it is important to make sure when handling multiple samples that the appropriate target is placed on the appropriate array design. The arrays are barcoded to prevent mixup. Flourescent label is applied to the array (staining) that is target DNA specific.

Data Aquisition

The flourescently labeled hybridized array image is aquired using ausing a flying objective scanning laser. The scanning is somewhat destructive due to photo bleaching.

Input:

Output:

A very large image (1GB).

Feature Extraction + QC

Note: At this point once the data is aquired the processing is non-destructive and can an will be repeated as algorithms and analysis techniques improve.

Feature extraction is the process by which the array image is segmented into 8 probes per query bp and the signal intensity is measured.

Input:

Output:

Hybridization Model / Base Calling

This is the process by which signal intensities are converted into sequenced bases. The statistical model is described in High-throughput variation detection and genotyping using microarrays. Cutler DJ, Zwick ME, et. al. Genome Res. 2001. The hybridization model is somewhat complex and frequently updated incorporating additional parameters so it is desireable for the data provenence to be known and re-run capability to be available.

Input:

Output:

Genetic Analysis + QC

This is the stage in which the genotype of the organism is stored and analyzed for contaminiation and made available for public consumption.

Note: An interesting open data representation problem is storing many highly identical genome sequences in a way that is resonable for downstream analysis and viewing. While we only have 50 individuals for this project. Related projects and proposals for smaller target regions have 1000's of individuals.

last edited 2006-05-11 04:24:56 by KristianStevens