More Effectively Using Scientific Data Management Tools
Specific Emphasis on Association Studies
Introduction
In it's widest possible definition a genotype is the specific genome of an individual. Since the genomes from individuals of the same species share enourmouse similarity, a genotype is most often characterized by cataloging small variations from a reference sequence. The most common form of this variation is the single nucleotide polymorphism SNP.
The scientific importance of SNPs is enourmous. The genotype of an individual together with environmental effects determines the phenotype of that individual. The phenotype broadly defined is the organism total physical appearance and constitution. The term is also used to describe a specific manifestation of a trait, such as hair color or eye color. Genotypes provide the symbology or semiotics while phenotypes describe the underlying semmantics.
The medical importance of SNPs arrises because disease phenotypes often have a genetic component. See the OMIM database for a catalog of the current state of genetics in medicine. The process of determining the genotype-phenotype association is aptly named the Association Study. Successfull whole genome scale association studies require genomes from a large number of individuals 500+ to gain statistical inferrence power. When making conclusions about a particular individual it is important to understand the underlying aquisition biases and error probability.
Because of the scientific and medical benefits to understanding genotype/phenotype interaction curretly most large scale proposals being funded by NIH/NHGRI are directed towards answering this question. Most of them will generating large amounts of genotype data from diverse methodologies.
This project will motivate and demonstrate the use of aspects of scientific data management as a tool to aquire, process, and store the large quantities of genotype data being produced in such a way that it can be better utilized by the scientific community. There is a great need to store and analyze genotyping data from diverse aquisition techniques in a common shared semmantic framework. Because of the nature of many of the conclusions that will be made from this data - there is also a great need for the end user of this data to access and understand the underlying provenence of each datum and its associated aquisition biases and error probabilities.
Outline and Scope
( A bit ambitious at this point)
The utility of semmantic types and semmantic type propagation will be demonstrated in the context of genotyping data. It may be particularly useful as there are many known ways of aquiring genotype data with the end result most often being a vector of probabilities for one of the four nucleotides at a given location. For example traditioinal sequencing uses a chromatogram as input, while oligonucleotide hybridization has a statistical description of signal intensities. Each have their own unique likelyhood model for "calling the base" but the end result in both cases is a probability vector of outcomes. Semmantic algebra looks like a useful tool for connecting and extending components.
The utility of a scientific workflow will be demonstrated. A Kepler dataflow will be developed for the production of genotyping data from diverse aquisition sources. I could possibly implement three algorithms for determining a genotype in that I have access to data from three of the current platforms and some experience implementing the algorithms.
The workflow would also provide the required data provenence needed to correctly analyze the genotyping data. After all if your going to tell someone they are prodisposed to prostate cancer or that their child has a genetic disease you want to have the underlying data used to draw the conclusion readily available.
An ontology and data model will be developed for the data. I would like to spend some time motivating how the data model could be leveraged in creating and exploiting a physical database.
Three Ways to Determine Genotype
Traditional Sequencing
Vendor: Applied Biosystems (ABI)
Oligonucleotide Hybridization
Vendor: Affymetrix
Flow-Based Pyrosequencing
Vendor: 454
Note there are many many more currently and in development. It's a big field.
