Date post: | 25-Jun-2015 |
Category: |
Technology |
Upload: | usc |
View: | 1,131 times |
Download: | 0 times |
Using OpenCL to accelerategenomic analysis
Gary K. Chen
June 16, 2011
An outline
OpenCL Introduction
Copy number inference in tumors
Data considerations
Hidden Relatedness
Variable Selection
Scientific Programming on GPGPUdevices
I nVidia and ATI are currently market leadersI Very competitive in performance and priceI Impressive double-precision performance - though
still about 4 times slower than 32 bit FPI ATI 9370 chipset: 528 64-GFLOPS 4GB GDDR5
$2,399I nVidia Tesla C2050: 520 64-GFLOPS 3GB GDDR5
$2,199I Source: www.sabrepc.com
Future multi-core CPUs
I Intel’s 48 core SCC chipI Potentially a more powerful solution when
considering data intenstive computing. Notconstrained by PCI bus
An open-standards based developmentplatform
Same idea as CUDA, different terms
Data parallel coding
OpenCL Concepts
An outline
OpenCL Introduction
Copy number inference in tumors
Data considerations
Hidden Relatedness
Variable Selection
Biology background
I DNAI A string with a four letter alphabet: A,C,G,TI Humans have two copies: one from mom, one from
dadI Most of the sequence between two strands is the
same, except for a small proportion
I Example sequence: ATATTGC. We could have:
I A single nucleotide polymorphism (common pointmutation): ATATAGC
I Copy number variants/abberations(deletions,amplifications,translocations):
I AT–GCI ATATTATTATTGC
SNP microarrays
What is observed
I Microarray outputI Probes are dyed, and microarrays scanned with
CCD camerasI X,Y: Intensities of A and B alleles (two possible
variants)I R = X+Y: Overall intensityI LRR (log2 R ratio): Intensity relative to a standard
intensityI BAF (B allele frequency): Ratio of allelic intensity
between A and B
Inferring CNVs from microarray output
Hidden Markov ModelI A formalized statistical model
I We want to use information from observables(LRR,BAF) to infer true state of nature (copynumber, genotype)
Table: Example hidden states from PennCNV software
State CN possible genotypes1 0 Null2 1 A,B3 2 AA,AB,BB4 2 AA,BB5 3 AAA,AAB,ABB,BBB6 4 AAAA,AAAB,AABB,ABBB,BBBB
Copy number inference in tumors
I Inference is harder!I 1. When dissecting breast tissue for example,
stromal (normal cell) contamination is almostinevitable. Hence you are modeling a mixture oftwo or more cell populations
I Suppose you have a state assuming normal CN=2,tumor CN=4,α = .2
I e.g. ri = αri,n + (1− α)ri,tI expected mean intensity: .2(1) + .8(1.68) = 1.544
I 2. Amplification events can be wilder thangermline (e.g. blood) events, leading to greatercopy number/genotype possibilities
I Combine issues 1) and 2) and you can get a hugesearch space
Expanded state space
ID BACn CNt BACt α r̄ b̄0 0 2 0 0.3 1 01 0 2 0 0.6 1 02 1 2 1 0.3 1 0.53 1 2 1 0.6 1 0.54 2 2 2 0.3 1 15 2 2 2 0.6 1 16 0 1 0 0.3 0.65 07 0 1 0 0.6 0.8 08 0 1 1 0.3 0.65 0.5384629 0 1 1 0.6 0.8 0.25
10 1 1 0 0.3 0.65 0.23076911 1 1 0 0.6 0.8 0.37512 1 1 1 0.3 0.65 0.76923113 1 1 1 0.6 0.8 0.62514 2 1 0 0.3 0.65 0.46153815 2 1 0 0.6 0.8 0.7516 2 1 1 0.3 0.65 117 2 1 1 0.6 0.8 118 0 3 0 0.3 1.35 019 0 3 0 0.6 1.2 020 1 3 1 0.3 1.35 0.3703721 1 3 1 0.6 1.2 0.41666722 1 3 2 0.3 1.35 0.6296323 1 3 2 0.6 1.2 0.58333324 2 3 3 0.3 1.35 125 2 3 3 0.6 1.2 1
Algorithm
I InitializeI Empirically estimate σ of BAF and LRRI Compute emission matrix O for each state/obs
from a Gaussian pdf
I Train: Expectation MaximizationI Forward backward: computes posterior probs and
overall likelhoodI Baum Welch: Compute MLE of transition
probabilites in matrix T
I Traverse state pathI Viterbi (dynamic programming): walk the state
path based on max-product
I Parallel Forward AlgorithmI We compute the probability vector at observation
t: f0:t = f0:t−1TOt
I Each state (element of the m-state vector) canindependently compute a sum-product
I Threadblocks map to statesI Threads calculate products in parallel, followed by
a log2(m) addition reduction
Technical issue: Underflow
I Tiny probabilities often have to be representedin log space (even for FP64)
I How do we deal with adding log probabilities?I We usually exponentiate, add, then log
I RemedyI Add an offset to log before exponentiatingI Subtract the offset from the log space answer
Gridblocks: Forward Backward Calculation
Code: Computing products in parallel
Code: 2 Reductions: computing offset,sum-product
Algorithm Improvements
I Examples:I Re-scaling transition matrix (accounting for
SNP spacing)I Serial: O(2nm2); Parallel: O(n)
I Forward backwardI Serial: O(2nm2); Parallel: O(nlog2(m))
I ViterbiI Serial: O(nm2); Parallel: O(nlog2(m))
I Normalizing constant (Baum-Welch)I Serial: O(nm); Parallel: O(log2(n))
I MLE of transition matrix (Baum-Welch)I Serial: O(nm2); Parallel: O(n)
Performance
Table: One EM iteration on Chr 1 (41,263 SNPs)
states CPU GPU fold-speedup128 9.5m 37s 15x512 2h 35m 1m 44s 108x
An outline
OpenCL Introduction
Copy number inference in tumors
Data considerations
Hidden Relatedness
Variable Selection
Storing data
I Global memoryI Relatively abundant, but slowI However, even 4GB may be insufficient for modern
datasets
I Genotype dataI Highly compressibleI We only care if a position differs from the
canonical sequenceI Thus: AA,AB,BB,NULL are 4 possible genotypesI Should be able to encode this into two bits, so 4
genotypes per byte
Possible approachesI Store as a float array
I +: Easy to implementI -: Uses 16 times as much memory as needed!
I Store as an int arrayI Allocate a local memory array of 256 rows, 4 cols
for mapping all possible genotype 4-tuplesI +: Uses global memory efficiently, maximizes
bandwidthI -: You might not even have enough local memory,
much less for real workI Store as a char array
I Right bitshift pairs of bits, then OR mask with 3I +: Uses global memory efficiently, saves on local
memoryI -: Threads load a minimum of 4 bytes per word,
you use 25% of available bandwidth
One solution: custom container
I Idea:I Designate each threadblock to handle 512
genotypesI First 32 threads: each loads a packedgeno t
element
I For each of the 32 threads:I Loop four times, extracting each charI Subloop four times, extacting each genotype via
bitshift/mask
Illustration
An outline
OpenCL Introduction
Copy number inference in tumors
Data considerations
Hidden Relatedness
Variable Selection
Inferring Relatedness
I Inferring relatedness
I The human race is one large pedigree
I Individuals of the same ethnicity are expectedto share more SNP alleles
I We can summarize this relationship through acorrelation matrix called ’K’
Uses for the ’K’ matrix
I Principal Components AnalysisI A singular value decomposition on ’K’I K = VDV ′
I V contains orthogonal axes, facilitating populationstructure inference
I Estimating heritabilityI In random effects modelsI Y = µ + βX + γ2K + σ2II h2 = γ2
γ2+σ2
Example: Latino samples in LA
Computing K
I Essentially a matrix multiplicationI K̂jk = 1
m
∑mi=1
(xij−2fi )(xik−2fi )
4fi (1−fi )I Or in another words: K=ZZ’I Including more SNPs adds more precise, subtle
information
I Parallel codeI Carrying out matrix multiplication is
straightforward on GPUI Matrix multiplication is ideal for GPU: Approx.
240x speedup.I Because K is summed over SNPs, we can split
genotype matrix by subsets of SNPs and run eachK slice in parallel
An outline
OpenCL Introduction
Copy number inference in tumors
Data considerations
Hidden Relatedness
Variable Selection
Variable Selection
I One goal in biomedical research is correlatingDNA variation to disease phenotypes
I Genomics technologyI The number of subjects n remains about the same
(cost of recruiting, sample preps, etc), whilenumber of features p is exploding
I Rate that data is being generated per dollarsurpasses Moore’s Law
Regression
I Standard logistic regressionI The usual method for hypothesis testing of
candidate predictorsI log( p
1−p) = βX , p being the probability of affection
I We apply Newton-Raphson scoring until f (β) ismaximized.
I Logistic regression simple fails whens p > n
I L1 penalized regression, aka LASSOI Idea: Fit the logistic regression model, but subject
to a penalty parameter λI g(β) = f (β)− λ
∑pj=1 |βj |
Algorithms for fitting the LASSOI One dimensional Newton Raphson at variable
j :I Cyclic Coordinate DescentI ∆βj = β
(new)j − βj = − g ′βj
g ′′βj
I g ′(βj) =n∑
i=1
xi ,jyi1
1 + exp(xi ,jβjyi)− sgn(βj)λ
I g ′′(βj) =n∑
i=1
x2i ,j
exp(xi ,jβjyi)
(1 + exp(xi ,jβjyi))2
I We cycle through each j until likelihood stopsincreasing within some tolerance
I Performs great, but only allows parallelizationacross samples
ref: Genkin,Lewis,Madigan: Am Stat Assoc 2007 Vol 49,No. 3
Distributed GPU implementation
I If possible to parallelize across variables, it isworth splitting up design matrix
I For really large dimensions, we can link up anarbitrary number of GPUs
I Message Passing Interface allows us to beagnostic to physical location of GPU devices
Distributed GPU implementation
I Approach:I MPI master node delegates heavy lifting to slaves
across networkI Master node performs fast serial code, such as
sampling new λ, comparing logLs, broadcastinggradients, etc.
I Network traffic is kept to a minimumI Implemented for Greedy Coordinate Descent and
Gradient DescentI Developed on server at USC Epigenome Center: 2
Tesla C2050s
Parallel algorithms for fitting the LASSO
I Greedy coordinate descent (ref)I Same algorithm as CCD, except for each variable
sweep, update only j that gives greatest increase inlogL
I No dependencies between subjects and variables,massive parallelization across subjects ANDvariables
I Ideal if you have a huge dataset, and you want astringest type 1 error rate (only care about a fewvariables)
I Ayers and Cordell, Gen Epi 2010: Permute, andpick largest λ that allows first “false” variable toenter
ref: Wu, Lange: Annals Appl Stat 2008 Vol 2,No. 1
Layout for greedy coordinate descentimplementation
Overview of Greedy CD algorithmI Newton-Raphson kernel
I Each threadblock maps to a block of 512 subjects(theads) for 1 variable
I Each thread calculates subject’s contribution togradient and hessian
I Sum (reduction) across 512 subjectsI Sum (reduction) across subject blocks in new
kernel
I Compute log-likelihood change for eachvariable (like above).
I Apply a max operator (log2 reduction) toselect variable with greatest contribution tolikelihood.
I Iterate repeatedly until likelihood increase lessthan epsilon
Evaluation on large dataset
I GWAS dataI 6,806 subjects in a case control study of prostate
cancerI 1,047,986 SNPs typed
I Invoke approx. 7 billion threads per iterationI Total walltime for 1 GCD iteration (sweep
across all variables)I 15 minutes on optimized serial implementation
split across 2 slave CPUsI 5.8 seconds on parallel implementation across 2
nVidia Tesla C2050 GPU devicesI 155x speed up
Parallel algorithms for fitting the LASSOI (Stochastic Mirror) Gradient Descent (ref)
I Sometimes, we are interested in tuning λ for saythe best cross validation errors
I Greedy descent seems awfully wasteful in that onlyone βj is updated
I However, we can update all variables in parallelcycling through subjects
I AlgorithmI Extremely simple:I For subject i : gradient gi = −yi
(1+exp(xiβyi ))I Update his βi vector, where βi ,j = βi ,j − ηgixi ,j
I η is a learning parameter, set sufficiently small(e.g. .0001)
ref: Shwartz,Tewari: Proc. 26th Intern. Conf Machine
Learning 2009
Gradient descentI Performance
I Slow convergence compared to serial cycliccoordinate descent, but far more scalable
I For large lambdas, slower than greedy coordinatedescent
I Computation:bandwidth ratio not greatI For 1 million SNPs, only about 15x speedup. Far
more SNPs are neededI Technical issues
I Must store genotypes in subject major order toenabled coalesced memory loads/stores
I Makes SNP level summaries like means and SDsdifficult to compute.
I Heterogeneous data types: floats: (E,ExG),compresesed chars: (G,GxG)
I Memory constrained: can perform interactions onthe fly with SNP major
Potential for robust variable selection:
I Subsampling:I Applying LASSO once overfits data. Model
selection inconsistentI Subsampling is preferable: Bootstrapping, stability
selection, x-fold cross validationI Number of replicates << number of samples <<
number of features
I Bayesian variable selection:I If we assume βLASSO conditionally independentI Master node can (quickly) sample hyperparameters
(e.g. λ) from a prior distribution