Research Statement

Research Statement

My research focuses on studying the fundamental tradeoffs between cache-

obliviousness, cache-optimality, and parallelism of algorithms and data structures on

modern multi-core and many-core 1 architectures with hierarchical cache. My

approach combines both theory and experiments. I have been mainly working on

stencil computation, general dynamic programming computation, and numerical

algorithms.

Since 2009, I have been working with Prof. Charles E. Leiserson at MIT on stencil

computation. The project ``The Pochoir Stencil Compiler’’ [6, 7] was funded both by

NSF at total amount USD $983,017 and Intel Corp at amount RMB (Chinese Yuan)

904,627.72. In this project, we achieved following results: 1) improved the parallelism

asymptotically with the same cache-efficiency of a cache-oblivious parallel algorithm

for multi-dimensional simple stencil computation2 by inventing ``hyperspace cut’’; 2)

handles periodic and aperiodic boundary condition in one unified algorithm; 3)

designed domain-specific language (DSL) embedded in C++ for stencil computation; 4)

designed and developed a novel two-phase compilation strategy that the first phase

call any common C++ tool chain to verify the correctness and will invoke the Pochoir

compiler only afterwards to do a source-to-source transformation for a highly

optimized code. The two-phase compilation strategy saves massive cost of parsing and

type-checking of C++ language. In this project, I contributed to the algorithm, code

generation, benchmarking and core compiler software for the Pochoir system.

After the ``Pochoir’’ project, I continue working on a joint project on general

dynamic programming problem. Note that stencil computation can be viewed as a

special case of dynamic programming with constant but non-orthogonal dependencies.

In the research of general dynamic programming problem, my focus lies on the

fundamental tradeoff between time and cache complexity.

Modern multicore systems with hierarchical caches demand both parallelism and

cache-locality from software to yield good performance. In the analysis of parallel

computations, theory usually considers two metrics: time complexity and cache

complexity. The traditional objective for scheduling a parallel computation is to

minimize the time complexity, i.e., if we represent the parallel computation as a DAG

(Directed Acyclic Graph), time complexity is the length of critical path in DAG.

Alternatively, one can focus on minimizing the cache complexity, i.e., the number of

cache misses incurred during the execution of program. Theoretical analyses often

consider these metrics separately; in reality, the actual completion time of a program

depends on both, since the number of cache misses has a direct impact on the running

time and the time complexity bound often serves as a good indicator of scalability,

load-balance and scheduling overheads.

1 For example: Intel MIC (Many-Integrated-Core) coprocessor 2 Simple stencil is a stencil computation without heterogeneity in space or time.

Tuning of algorithms for time and/or cache complexity are usually not preferred.

It has several disadvantages: the code structure becomes more complicated; the

parameter space to explore is usually exponentially sized; and the tuned code is non-

portable, i.e., for different hardware systems, separate tunings are required. Moreover,

since the tuning environment cannot be exactly the same as running environment, e.g.

different numbers of background daemon processes, different loads of network traffic,

etc. , it means that the long-tuned code is almost always sub-optimal. Classic cache-

oblivious algorithms eliminate the need of tuning of optimality for hierarchical cache

largely. Can we further eliminate the need of tuning between time and cache

complexity while still remaining cache-obliviousness? What’s the fundamental

tradeoff between time and cache complexity? What obliviousness can buy us and cost

us? These are the questions lying in the center of my research.

For generic parallel computation, there is a tension between the objectives of

minimizing time and cache complexity. Take LCS (Longest Common Subsequence) as

an example: Given two sequences L =< l1l2 … ln > and R =< r1r2 … rn >, we find

its longest common subsequence by filling out a 2D table using recurrences

X[i, j] = {

0, 𝑖𝑓 𝑖 = 0 ∨ 𝑗 = 0

𝑋[𝑖 − 1, 𝑗 − 1] + 1, 𝑖𝑓 𝑖, 𝑗 > 0 ∧ 𝑙𝑖 = 𝑟𝑗

max{𝑋[𝑖 − 1, 𝑗], 𝑋[𝑖, 𝑗 − 1]} , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Figure 1. 2-way to 𝑛-way Divide-And-Conquer algorithms for LCS

In literature, there are two classes of algorithms to solve the problem. One is the

divide-and-conquer based cache-oblivious algorithm, the other is based on looping

possibly with tiling. If we adopt a 2-way divide-and-conquer parallel algorithm as

shown on the left-hand side in Figure 1, i.e., at each recursion level, we cut each

dimension into two halves. Since at each recursion level, we have three out of four

sub-quadrants sitting on the critical paths, the time recurrence T∞(𝑛) = 3𝑇∞ (𝑛

2)

solves to T∞(𝑛) = 𝑂(𝑛𝑙𝑜𝑔23), where n is the input problem size. We usually only

count serial cache complexity in ideal cache model, a.k.a. cache-oblivious model

because parallel cache complexity will be determined by fitting both time complexity

and serial cache complexity into a formula that is determined by the underlying run-

time scheduler. In ideal cache model, there are two levels of memory. The upper level

is a fully associative cache of size M and lower level is an infinitely sized main memory.

When there is a cache miss in upper level, the system employs an omniscient cache

replacement policy to exchange a cache line in size B between the two levels.

Parameters M and B are correlated by a tall cache assumption, i.e., M = Ω(B1+ϵ),

where ϵ > 0 is a constant. In ideal cache model, serial cache complexity is calculated

by summing up cache misses caused by four individual sub-quadrants at each

recursion level, i.e., Q(n) = 4Q (n

2). The recursive summation stops at some level

when the problem size of a sub-quadrant just fits into the cache and its parent doesn’t,

i.e., ∃n0 𝑠. 𝑡. n0 = 𝜖0𝑀 ∧ 2𝑛0 > 𝑀, where ϵ0 ∈ (0,1], i.e., 0 < ϵ0 ≤ 1, is a constant.

Because after this level, further recursive divide-and-conquer won’t cause any more

cache misses than Q(n0) = O (n0

B) . Solving the recurrence, we have Q(n) =

O (n2

ϵ0BM) . Keep doing more-way divide-and-conquer, eventually the algorithm will

reduce to n-way divide-and-conquer algorithm as shown on the right-hand side in

Figure 1, which is essentially parallel looping algorithm without tiling. For n-way divide-

and-conquer algorithm, the time complexity (span) reduces to T∞(𝑛) = 𝑂(𝑛) with

serial cache complexity increased to Q(n) = O (n2

B). From above analyses, we can see

that 2-way divide-and-conquer algorithm has the worst time complexity while the best

cache complexity, parallel looping algorithm or n-way divide-and-conquer in Figure 1,

on the contrary, has the best time complexity while the worst cache complexity.

Traditional wisdom may suggest to tune a balanced point between these two extremes

to get a good performance on a real machine. Apparently, the intuition behind balance

is that we cannot get both optimal at the same time. wavefront

wavefront

Figure 2. Scheduling of classic 2-way divide-and-conquer algorithm for LCS on the left and

cache-oblivious wavefront algorithm on the right. Solid arrows indicate true dependencies derived

from the defining recurrence equations, while dashed arrows indicate false dependencies

introduced by the scheduling of algorithm.

Figure 3. Performance comparison of four algorithms for LCS, i.e., Parallel Loop without tiling,

Blocked Loop (Parallel Loop with tiling), classic 2-way divide-and-conquer based cache-oblivious

parallel (2-way COP) algorithm, and cache-oblivious wavefront (COW) algorithm. In the

performance plot, we fix the same base case size for all four algorithms except parallel loop

(without tiling) and use exactly the same non-inlined kernel function to compute the base case for

all algorithms so that the only difference is how different algorithms schedule the base cases.

In [1] we have shown that both optimal time and cache complexity is achievable

at the same time via a more compacted scheduling as shown on the right-hand side of

Figure 2. The new scheduling policy eliminates all false dependencies introduced by

prior divide-and-conquer based cache-oblivious parallel algorithm and retains only

true dependency from the defining recurrence equations. From a high level point of

view, the new algorithm proceeds like dynamically unfolding sub-quadrants on the

divide-and-conquer tree and the progress of unfolded sub-quadrants are aligned to a

wavefront. In other words, the proceeding wavefront swept throughout the divide-

and-conquer tree are generated from a 2-way divide-and-conquer algorithm so we

name the technique ``cache-oblivious wavefront (COW for short)’’3. From Figure 3, we

can see that by combining the best of both cache-oblivious and looping world, the

cache-oblivious wavefront algorithm beat both classic 2-way divide-and-conquer

based cache-oblivious parallel algorithm and parallel loop with tiling (Blocked Loop

algorithm in Figure 3.) algorithm. Some natural questions following the direction of

research are: Does this or similar technique apply to all divide-and-conquer based

cache-oblivious parallel algorithm? What ``cache-obliviousness’’ can buy us and cost

us? What’s the fundamental tradeoff between time and cache complexity? These are

all fundamental problems I am working on. Some recent progresses on the direction

include a successful application of the cache-oblivious wavefront technique to some

numerical algorithms, such as Cholesky factorization and LU factorization without

pivoting.

Besides the continuous study of fundamental tradeoff between time and cache

complexity, I also have research interests in wider area of parallel algorithms and data

structures, e.g. my recent study of Range 1 Query algorithms, which is a special case

of Range Partial Sum Query algorithm but have only values of 0 or 1 on discrete grid-

3 Thanks to Prof. Charles E. Leiserson at MIT CSAIL for coining the name.

0

2

4

6

8

up

dat

ed p

oin

ts /

sec

on

d

(x1

e9

) lin

ear

scal

e

side length (n)

LCS: Performance (bsize=64)

Parallel Loop Blocked Loop2-way COP COW

cells was published in COCOON’14 [3]; another my recent study of weight balance on

boundaries and skeletons [4] is an inverse problem of the barycenter problem, i.e., The

barycenter problem is: given a set of n weights W = {w1, w2, … , wn} and arbitrary

n locations X = {x1, x2, … , xn} on the boundary of an arbitrary multi-dimensional

polygon, it’s easy to calculate the barycenter in O(n) time by formula x =

∑ 𝑤𝑖 ⋅ 𝑥𝑖ni=1 ; The inverse problem is that given an arbitrary point in/outside the

convex/concave polygon and the weight set W, how fast can we identify n locations

X = {x1, x2, … , xn} on the boundary of polygon to place the n weights such that

their barycenter is the given point? The results were published in SoCG’15 [4].

References:

1) Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, Rezaul

A. Chowdhury, Cache-Oblivious Wavefront: Improving Parallelism of Recursive

Dynamic Programming Algorithms without Losing Cache-Efficiency, 20th ACM

SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15),

Feb. 9-11, 2015, San Francisco, CA, USA.

2) Rezaul A. Chowdhury, Pramod Ganapathi, Yuan Tang, Jesmin Jahan Tithi, Improving

Parallelism of Recursive Stencil Algorithms without sacrificing Cache Performance, 2nd

Annual Workshop on Stencil Computations (WOSC’14), held in conjunction with

SPLASH’14,Portland, Oct. 20-24, Oregon, USA, published in ACM digital library.

3) Michael A. Bender, Rezaul A. Chowdhury, Pramod Ganapathi, Samuel McCauley and

Yuan Tang, The Range 1 Query (R1Q) Problem, The 20th International Computing and

Combinatorics Conference (COCOON'14), August 4-6, Atlanta, Georgia, USA, 2014.

4) Luis Barba, Otfried Cheong, Jean-Lou De Carufel, Michael Gene Dobbins, Rudolf

Fleischer, Akitoshi Kawamura, Matias korman, Yoshio Okamoto, janos Pach, Yuan

Tang, Takeshi Tokuyama, Sander Verdonschot, Tianhao Wang, Weight Balancing on

Boundaries and Skeletons, The 30th Annual ACM Symposium on Computational

Geometry (SoCG’14) 2014, June 8-11, Kyoto, Japan.

5) Pramod Ganapathi, Rezaul Chowdhury, and Yuan Tang (11/9/12). The R1Q Problem.

22nd Annual Fall Workshop on Computational Geometry (FWCG’12). College Park,

Maryland.

6) Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and

Charles E. Leiserson. The Pochoir stencil compiler. 23rd ACM Symposium on Parallelism

in Algorithms and Architectures (SPAA'11), 2011.

7) Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and

Charles E. Leiserson. Coding stencil computations using the Pochoir stencil-speci_cation

language. 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar'11), 2011.

Date post:	18-Feb-2017
Category:	Documents
Upload:	yuan-tang
View:	462 times
Download:	0 times

Research Statement

Documents