Simulating phenotypes from genomic data

May 12, 2016

Report Abuse
Simulate phenotypes from genomic data for use in a tutorial on genome wide association studies and phenotype prediction.
A phenotype is an observable quality that reflects genetic and environmental influences. Phenotypes can be categorical, like the presence or absence of a disease, or take on a continuous range of values, as with height and weight. Until recent years, phenotype prediction was limited by the inavailability of genetic information for technical -- and later, financial -- reasons. The advent of low-cost genome sequencing upended that limitation, facilitating new applications of phenotype prediction in medicine, forensics, and agriculture. This experiment is part of a larger [tutorial on phenotype prediction]( using Microsoft Research's [FaST-LMM]( In this experiment, we use real genome sequence data from the [International HapMap Project]( to simulate phenotypes, which are then used in the tutorial to illustrate phenotype prediction. Using simulated phenotypes protects patient confidentiality and allows us to tune the relative contributions of genetics, environment, and noise. The use of true genetic data ensures that our prediction problem includes realistic confounding factors like population structure (differences in genetic makeup caused by non-random mating, e.g. due to physical separation of one subgroup from another). # Input data ingest and preparation This experiment begins by loading a subset of the International HapMap Project]( data from [Azure Blob Storage]( into Azure Machine Learning Studio. The input data are two plaintext files in [PLINK]( format. The first, ``, describes each of the genomic variants that will appear in the dataset, including: - The chromosome on which the variant is found - The variant's [RefSNP]( (rs) identifier, a unique ID which functions as the variant's name - A placeholder where information on linkage between variants would be included if known - The position of the variant along the chromosome 22 rs2236639 0 15452483 22 rs5746664 0 15454622 22 rs16984366 0 15476864 The second file, `chr22.ped`, contains information on each individual's genotype for each variant mentioned in the map file. The file begins with six columns describing the individual: family ID, individual ID, ID of mother if present, ID of father if present, sex (male: 1, female: 2) if known, and a placeholder for phenotype information (not applicable here). The remaining 12,968 columns describe the two alleles that each individual possesses for each of the 6,484 variants, in the same order given in the map file. Each variant has two possible alleles marked A, T, G, or C. 2427 NA19919 NA19908 NA19909 1 -9 G G C C T T A A 2431 NA19916 0 0 1 -9 G G C C T T A G 2424 NA19835 0 0 2 -9 G G C C T T A G The **Enter Data Manually** module at the upper left of the experiment diagram (marked 1) contains access credentials to download these files from our blog storage. If you uploaded data to your own storage account as part of the [tutorial](, you can edit the module to input your own login credentials. The first two Python modules (marked 2-3) will use these credentials and the `azure-storage` Python package to access the files and export them in dataframe format. ![First half of the phenotype simulation experiment]( The text allele information included in the `.ped` file is converted to a numerical representation before phenotype simulation. The third Python module (marked 4 in in the diagram) determines, for each variant, which allele is less common: this is known as the "minor allele". The module then counts the number of minor alleles (0, 1, or 2) that each individual has for each variant. Finally, the number of minor alleles is normalized across individuals for each variant. The result is a numerical representation of the genotypes from the `.ped` file. Finally, the variant RefSNP ids from the `.map` file are used to name the corresponding columns in the input data for easy reference later (5). ## Phenotype simulation A second **Enter Data Manually** module (6) allows the user to specify how much variance in overall phenotype is attributable to: - Randomly-selected "causal" variants with equal effect on phenotype (the number of such variants is also specified) - Random genetic effects (heritable differences in phenotype not attributable to genotyped variants, or any environmental effects that correlate with genetics, such as population-specific lifestyle) - Randomly-generated, binary covariates of equal effect (recorded factors that may impact phenotype, such as sex or environmental conditions) - Random environmental effects (simulated as noise) ![Second half of the phenotype simulation experiment]( The experiment workflow splits into three branches that simulate the phenotypic contributions of causal variants (7), covariates (8), and random effects/noise (9), respectively. These contributions to phenotype are then summed to get overall phenotype values. The causal variants, covariates, and phenotypes will be recorded to blob storage if the account information was updated in the first **Enter Data Manually** module. (The default account information in the gallery experiment allows reading but not writing from our blob storage account.) ## Next steps Please see our [tutorial]( to learn how these simulated phenotypes can be used to demonstrate GWAS and phenotype prediction using [FaST-LMM](