Predicting Traits from Genomic Data
Demonstrates the use of FaST-LMM for trait prediction from genomic data using Microsoft Linux Data Science Virtual Machine
Genome-Wide Association Studies (GWAS) attempt to find genome variants associated with a trait or disorder of interest (phenotype). The identity of these variants may shed light on the phenotype's causes, suggesting new targets for treatment or subjects for future research. Medical professionals may also use these variants to predict, from a genome sequence, whether a patient may develop the phenotype later in life.
A core challenge of GWAS is that there are millions of common variants in the human genome -- with novel mutations occurring every day -- while most research studies have fewer than 100,000 participants. The problem of identifying the phenotypic effect associated with each variant is therefore underdetermined. The existence of population structure, co-inheritance of adjacent variants in the genome, presence of selection, contributions of environmental conditions to phenotype, interactions between variants, variants not considered in the study, and many other effects can all confound GWASs. Computation requirements, once another limiting factor, have fortunately been dramatically improved thanks to recent developments in the approximate solution of linear mixed models (Loh et al., 2015; Widmer et al., 2014).
In this tutorial, we demonstrate how to apply [FaST-LMM](https://www.microsoft.com/en-us/research/project/fastlmm/), a linear mixed model implementation designed for genomics applications, to real genomic data with simulated phenotypes. The tutorial and sample data files can be found in [our GitHub repository](https://github.com/Azure/Cortana-Intelligence-Gallery-Content/tree/master/Resources/Phenotype-Prediction). A supplementary experiment demonstrating how the phenotypes were simulated is also [available on Cortana Intelligence Gallery](https://gallery.cortanaintelligence.com/Experiment/Simulating-phenotypes-from-genomic-data-2).