BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics...

Post on 14-Dec-2015

215 views 1 download

transcript

BST 775 LecturePLINK – A Popular Toolset for GWAS

Guodong WuSSG, Department of Biostatistics

University of Alabama at BirminghamSeptember 24, 2013

• Designed for GWAS and population-based linkage analysis.

• Developed by Shaun Purcell*, current version V1.07. • http://pngu.mgh.harvard.edu/~purcell/plink/• Why the toolset is so popular?

• Store the GWAS data sets, which is too large for SAS, R, or other statistical packages.

• Well developed guideline and toolsets for Dataset Management and Quality Control

• Platform for various association methods

Overview

* Purcell et al 2007, AJHG

Overview

• Data management

• Summary statistics

• Quality Control

• Association Test

Summary statistics and quality control

Summary statistics and quality control

Assessment of population stratification

Assessment of population stratification

Further exploration of ‘hits’Further exploration of ‘hits’

Visualization and follow-upVisualization and follow-up

Whole genome SNP-based association

Whole genome SNP-based association

GeneChip Scanner

GeneChip Scanner

Cell Intensity Files for each chip

Cell Intensity Files for each chip

Phenotype, sex and other

covariates

Phenotype, sex and other

covariates

Experimental Design & Sample

Collection

Experimental Design & Sample

Collection

PLINK in GWAS workflow

Data Format

P1 A A A C C G T T A A T TP2 A C A A C G G T A C T TP3 C C A C G G T T A A T TP4 C C A A G G G T A A T T←

Peop

le

SNPs →

PED and MAP format

1 snp1 0 1000X snp2 0 1000Y snp3 0 1000XY snp4 0 1000 MT snp5 0 1000

Transposed format

S1 A A A C C C C C S2 A C A A A C A A S3 C G C G G G G G S4 T T C G T T G T S5 A A G T A A A A S6 T T A C T T T T

←SN

Ps

People →

P1 … P2 …P3 …P4 …P5 …

SNP information

People information 01010100101010101011010011101010101010110111010100101010111010010111011010101101010101010111010

Compact binary format

Data management

• Recode dataset (A,C,G,T → 1,2)

• Reorder, reformat dataset

• Flip DNA strand

• Extract/remove individuals/SNPs

• New phenotypes, covariates as extra file

• Merge 2 or more data sets

Summary and QC

• Hardy-Weinberg test

• Mendel errors

• Missing genotypes

• Allele frequencies

• Tests of non-random missingness

– by phenotype and by (unobserved) genotype

• Sex Check

• Pairwise IBD estimates

Mendel errors

plink --file data --hardy

An exact test by default.

In Case control study, the Control group typically needs more lenient threshold (eg. P-value < 1e-3)

Mendel errors

plink --file data --mendel

Genotyping error when child’s genotype is not inherited from the parents, according to mendel’s law

Output as

Output the error rate for each SNP and each individual

Code Pat , Mat -> Offspring

1 AA , AA -> AB 2 BB , BB -> AB

3 BB , ** -> AA 4 ** , BB -> AA 5 BB , BB -> AA

6 AA , ** -> BB 7 ** , AA -> BB 8 AA , AA -> BB

Missingness and Allele Frequency

plink --file data --missing

Output the missing rate per SNP and per individual.

plink --file data --freq

Output each SNP’s allele frequency

Is the missingness random?

plink --file data –-test-missing

Test whether the SNP is randomly missing between case and control status.

plink --file data -–test-mishap

• Test whether the SNP is randomly missing based on observed genotyped nearby SNPs.

• Assume dense SNP genotyping. • Use haplotype and LD information in tests.

Sex Check

plink --file data –check-sex

Use X chromosome data heterozygosity rates to determine sex, and then compare with the observed sex.

Pairwise IBD sharing (relatedness)

AB AC

AB AC

IBS = 1IBD = 0

ParentsMost recent common ancestor from

homogeneous random mating population

AB AC

PLINK tutorial, October 2006; Shaun Purcell, shaun@pngu.mgh.harvard.edu

plink --file data –-genome

Relatedness Check

• The Genome-wide information, typically do not need whole-genome SNPs.

• Typically 100K independent SNPs are enough.

Association methods in PLINK

• Population-based– Allelic, trend, genotypic, Fisher’s exact– Stratified tests (Cochran-Mantel-Haenszel, Breslow-Day)– Linear & logistic regression models

• multiple covariates, interactions, joint tests, etc

• Family-based– Disease traits: TDT / sib-TDT– Continuous traits: QFAM (between/within model, QTDT)

• Permutation procedures– “adaptive”, max(T), gene-dropping, between/within, rank-based,

within-cluster

• Multilocus tests– Haplotype estimation, set-based tests, Hotelling’s T2, epistasis

An Example: logistic Regression

plink --maf 0.05 --exclude nonautosomalSNPs.txt --out AllAssoc --bfile bdata --remove exclusions.txt --logistic --hide-covar --pheno IChipCovs.txt --pheno-name cas_con --covar IChipCovs.txt --covar-name Sex,EurAdmix

An Example: logistic RegressionResult

Cardinal rules in PLINK

• Always consult the log file, console output

• Also consult the web documentation– regularly

• PLINK has no memory– each run loads data anew, previous filters lost

• Exact syntax and spelling is important– “minus minus” …

PLINK tutorial, October 2006; Shaun Purcell, shaun@pngu.mgh.harvard.edu