OpenCL applications in genomics

transcript

Using OpenCL to accelerategenomic analysis

Gary K. Chen

June 16, 2011

An outline

OpenCL Introduction

Copy number inference in tumors

Data considerations

Hidden Relatedness

Variable Selection

Scientific Programming on GPGPUdevices

I nVidia and ATI are currently market leadersI Very competitive in performance and priceI Impressive double-precision performance - though

still about 4 times slower than 32 bit FPI ATI 9370 chipset: 528 64-GFLOPS 4GB GDDR5

$2,399I nVidia Tesla C2050: 520 64-GFLOPS 3GB GDDR5

$2,199I Source: www.sabrepc.com

Future multi-core CPUs

I Intel’s 48 core SCC chipI Potentially a more powerful solution when

considering data intenstive computing. Notconstrained by PCI bus

An open-standards based developmentplatform

Same idea as CUDA, different terms

Data parallel coding

OpenCL Concepts

An outline

OpenCL Introduction

Data considerations

Hidden Relatedness

Variable Selection

Biology background

I DNAI A string with a four letter alphabet: A,C,G,TI Humans have two copies: one from mom, one from

dadI Most of the sequence between two strands is the

same, except for a small proportion

I Example sequence: ATATTGC. We could have:

I A single nucleotide polymorphism (common pointmutation): ATATAGC

I Copy number variants/abberations(deletions,amplifications,translocations):

I AT–GCI ATATTATTATTGC

SNP microarrays

What is observed

I Microarray outputI Probes are dyed, and microarrays scanned with

CCD camerasI X,Y: Intensities of A and B alleles (two possible

variants)I R = X+Y: Overall intensityI LRR (log2 R ratio): Intensity relative to a standard

intensityI BAF (B allele frequency): Ratio of allelic intensity

between A and B

Inferring CNVs from microarray output

Hidden Markov ModelI A formalized statistical model

I We want to use information from observables(LRR,BAF) to infer true state of nature (copynumber, genotype)

Table: Example hidden states from PennCNV software

State CN possible genotypes1 0 Null2 1 A,B3 2 AA,AB,BB4 2 AA,BB5 3 AAA,AAB,ABB,BBB6 4 AAAA,AAAB,AABB,ABBB,BBBB

I Inference is harder!I 1. When dissecting breast tissue for example,

stromal (normal cell) contamination is almostinevitable. Hence you are modeling a mixture oftwo or more cell populations

I Suppose you have a state assuming normal CN=2,tumor CN=4,α = .2

I e.g. ri = αri,n + (1− α)ri,tI expected mean intensity: .2(1) + .8(1.68) = 1.544

I 2. Amplification events can be wilder thangermline (e.g. blood) events, leading to greatercopy number/genotype possibilities

I Combine issues 1) and 2) and you can get a hugesearch space

Expanded state space

ID BACn CNt BACt α r̄ b̄0 0 2 0 0.3 1 01 0 2 0 0.6 1 02 1 2 1 0.3 1 0.53 1 2 1 0.6 1 0.54 2 2 2 0.3 1 15 2 2 2 0.6 1 16 0 1 0 0.3 0.65 07 0 1 0 0.6 0.8 08 0 1 1 0.3 0.65 0.5384629 0 1 1 0.6 0.8 0.25

10 1 1 0 0.3 0.65 0.23076911 1 1 0 0.6 0.8 0.37512 1 1 1 0.3 0.65 0.76923113 1 1 1 0.6 0.8 0.62514 2 1 0 0.3 0.65 0.46153815 2 1 0 0.6 0.8 0.7516 2 1 1 0.3 0.65 117 2 1 1 0.6 0.8 118 0 3 0 0.3 1.35 019 0 3 0 0.6 1.2 020 1 3 1 0.3 1.35 0.3703721 1 3 1 0.6 1.2 0.41666722 1 3 2 0.3 1.35 0.6296323 1 3 2 0.6 1.2 0.58333324 2 3 3 0.3 1.35 125 2 3 3 0.6 1.2 1

Algorithm

I InitializeI Empirically estimate σ of BAF and LRRI Compute emission matrix O for each state/obs

from a Gaussian pdf

I Train: Expectation MaximizationI Forward backward: computes posterior probs and

overall likelhoodI Baum Welch: Compute MLE of transition

probabilites in matrix T

I Traverse state pathI Viterbi (dynamic programming): walk the state

path based on max-product

I Parallel Forward AlgorithmI We compute the probability vector at observation

t: f0:t = f0:t−1TOt

I Each state (element of the m-state vector) canindependently compute a sum-product

I Threadblocks map to statesI Threads calculate products in parallel, followed by

a log2(m) addition reduction

Technical issue: Underflow

I Tiny probabilities often have to be representedin log space (even for FP64)

I How do we deal with adding log probabilities?I We usually exponentiate, add, then log

I RemedyI Add an offset to log before exponentiatingI Subtract the offset from the log space answer

Gridblocks: Forward Backward Calculation

Code: Computing products in parallel

Code: 2 Reductions: computing offset,sum-product

Algorithm Improvements

I Examples:I Re-scaling transition matrix (accounting for

SNP spacing)I Serial: O(2nm2); Parallel: O(n)

I Forward backwardI Serial: O(2nm2); Parallel: O(nlog2(m))

I ViterbiI Serial: O(nm2); Parallel: O(nlog2(m))

I Normalizing constant (Baum-Welch)I Serial: O(nm); Parallel: O(log2(n))

I MLE of transition matrix (Baum-Welch)I Serial: O(nm2); Parallel: O(n)

Performance

Table: One EM iteration on Chr 1 (41,263 SNPs)

states CPU GPU fold-speedup128 9.5m 37s 15x512 2h 35m 1m 44s 108x

An outline

OpenCL Introduction

Data considerations

Hidden Relatedness

Variable Selection

Storing data

I Global memoryI Relatively abundant, but slowI However, even 4GB may be insufficient for modern

datasets

I Genotype dataI Highly compressibleI We only care if a position differs from the

canonical sequenceI Thus: AA,AB,BB,NULL are 4 possible genotypesI Should be able to encode this into two bits, so 4

genotypes per byte

Possible approachesI Store as a float array

I +: Easy to implementI -: Uses 16 times as much memory as needed!

I Store as an int arrayI Allocate a local memory array of 256 rows, 4 cols

for mapping all possible genotype 4-tuplesI +: Uses global memory efficiently, maximizes

bandwidthI -: You might not even have enough local memory,

much less for real workI Store as a char array

I Right bitshift pairs of bits, then OR mask with 3I +: Uses global memory efficiently, saves on local

memoryI -: Threads load a minimum of 4 bytes per word,

you use 25% of available bandwidth

One solution: custom container

I Idea:I Designate each threadblock to handle 512

genotypesI First 32 threads: each loads a packedgeno t

element

I For each of the 32 threads:I Loop four times, extracting each charI Subloop four times, extacting each genotype via

bitshift/mask

Illustration

An outline

OpenCL Introduction

Data considerations

Hidden Relatedness

Variable Selection

Inferring Relatedness

I Inferring relatedness

I The human race is one large pedigree

I Individuals of the same ethnicity are expectedto share more SNP alleles

I We can summarize this relationship through acorrelation matrix called ’K’

Uses for the ’K’ matrix

I Principal Components AnalysisI A singular value decomposition on ’K’I K = VDV ′

I V contains orthogonal axes, facilitating populationstructure inference

I Estimating heritabilityI In random effects modelsI Y = µ + βX + γ2K + σ2II h2 = γ2

γ2+σ2

Example: Latino samples in LA

Computing K

I Essentially a matrix multiplicationI K̂jk = 1

∑mi=1

(xij−2fi )(xik−2fi )

4fi (1−fi )I Or in another words: K=ZZ’I Including more SNPs adds more precise, subtle

information

I Parallel codeI Carrying out matrix multiplication is

straightforward on GPUI Matrix multiplication is ideal for GPU: Approx.

240x speedup.I Because K is summed over SNPs, we can split

genotype matrix by subsets of SNPs and run eachK slice in parallel

An outline

OpenCL Introduction

Data considerations

Hidden Relatedness

Variable Selection

I One goal in biomedical research is correlatingDNA variation to disease phenotypes

I Genomics technologyI The number of subjects n remains about the same

(cost of recruiting, sample preps, etc), whilenumber of features p is exploding

I Rate that data is being generated per dollarsurpasses Moore’s Law

Regression

I Standard logistic regressionI The usual method for hypothesis testing of

candidate predictorsI log( p

1−p) = βX , p being the probability of affection

I We apply Newton-Raphson scoring until f (β) ismaximized.

I Logistic regression simple fails whens p > n

I L1 penalized regression, aka LASSOI Idea: Fit the logistic regression model, but subject

to a penalty parameter λI g(β) = f (β)− λ

∑pj=1 |βj |

Algorithms for fitting the LASSOI One dimensional Newton Raphson at variable

j :I Cyclic Coordinate DescentI ∆βj = β

(new)j − βj = − g ′βj

g ′′βj

I g ′(βj) =n∑

xi ,jyi1

1 + exp(xi ,jβjyi)− sgn(βj)λ

I g ′′(βj) =n∑

x2i ,j

exp(xi ,jβjyi)

(1 + exp(xi ,jβjyi))2

I We cycle through each j until likelihood stopsincreasing within some tolerance

I Performs great, but only allows parallelizationacross samples

ref: Genkin,Lewis,Madigan: Am Stat Assoc 2007 Vol 49,No. 3

Distributed GPU implementation

I If possible to parallelize across variables, it isworth splitting up design matrix

I For really large dimensions, we can link up anarbitrary number of GPUs

I Message Passing Interface allows us to beagnostic to physical location of GPU devices

Distributed GPU implementation

I Approach:I MPI master node delegates heavy lifting to slaves

across networkI Master node performs fast serial code, such as

sampling new λ, comparing logLs, broadcastinggradients, etc.

I Network traffic is kept to a minimumI Implemented for Greedy Coordinate Descent and

Gradient DescentI Developed on server at USC Epigenome Center: 2

Tesla C2050s

Parallel algorithms for fitting the LASSO

I Greedy coordinate descent (ref)I Same algorithm as CCD, except for each variable

sweep, update only j that gives greatest increase inlogL

I No dependencies between subjects and variables,massive parallelization across subjects ANDvariables

I Ideal if you have a huge dataset, and you want astringest type 1 error rate (only care about a fewvariables)

I Ayers and Cordell, Gen Epi 2010: Permute, andpick largest λ that allows first “false” variable toenter

ref: Wu, Lange: Annals Appl Stat 2008 Vol 2,No. 1

Layout for greedy coordinate descentimplementation

Overview of Greedy CD algorithmI Newton-Raphson kernel

I Each threadblock maps to a block of 512 subjects(theads) for 1 variable

I Each thread calculates subject’s contribution togradient and hessian

I Sum (reduction) across 512 subjectsI Sum (reduction) across subject blocks in new

kernel

I Compute log-likelihood change for eachvariable (like above).

I Apply a max operator (log2 reduction) toselect variable with greatest contribution tolikelihood.

I Iterate repeatedly until likelihood increase lessthan epsilon

Evaluation on large dataset

I GWAS dataI 6,806 subjects in a case control study of prostate

cancerI 1,047,986 SNPs typed

I Invoke approx. 7 billion threads per iterationI Total walltime for 1 GCD iteration (sweep

across all variables)I 15 minutes on optimized serial implementation

split across 2 slave CPUsI 5.8 seconds on parallel implementation across 2

nVidia Tesla C2050 GPU devicesI 155x speed up

Parallel algorithms for fitting the LASSOI (Stochastic Mirror) Gradient Descent (ref)

I Sometimes, we are interested in tuning λ for saythe best cross validation errors

I Greedy descent seems awfully wasteful in that onlyone βj is updated

I However, we can update all variables in parallelcycling through subjects

I AlgorithmI Extremely simple:I For subject i : gradient gi = −yi

(1+exp(xiβyi ))I Update his βi vector, where βi ,j = βi ,j − ηgixi ,j

I η is a learning parameter, set sufficiently small(e.g. .0001)

ref: Shwartz,Tewari: Proc. 26th Intern. Conf Machine

Learning 2009

Gradient descentI Performance

I Slow convergence compared to serial cycliccoordinate descent, but far more scalable

I For large lambdas, slower than greedy coordinatedescent

I Computation:bandwidth ratio not greatI For 1 million SNPs, only about 15x speedup. Far

more SNPs are neededI Technical issues

I Must store genotypes in subject major order toenabled coalesced memory loads/stores

I Makes SNP level summaries like means and SDsdifficult to compute.

I Heterogeneous data types: floats: (E,ExG),compresesed chars: (G,GxG)

I Memory constrained: can perform interactions onthe fly with SNP major

Potential for robust variable selection:

I Subsampling:I Applying LASSO once overfits data. Model

selection inconsistentI Subsampling is preferable: Bootstrapping, stability

selection, x-fold cross validationI Number of replicates << number of samples <<

number of features

I Bayesian variable selection:I If we assume βLASSO conditionally independentI Master node can (quickly) sample hyperparameters

(e.g. λ) from a prior distribution

OpenCL applications in genomics

Technology