Supporting Applications Involving Irregular Accesses and Recursive Control Flow on
Emerging Parallel Environments
1
Xin HuoDepartment of Computer Science and Engineering
The Ohio State University
Advisor: Prof. Gagan Agrawal
Motivation - Parallel Architectures
2
• Programming Model• CUDA/OpenCL• All x86 based programming models
• Running Method• Offload method• Offload/Native method
• Memory Configuration• Configurable shared memory + L2 cache• Coherent L2 cache
• SIMT vs. SIMD• Support non-continuous and non-aligned
accesses, and control dependencies automatically
• Not direct support for non-continuous, non-aligned accesses, control/data dependencies
• Many-core Architecture
Motivation - Parallel Architectures• Heterogeneous Architecture
3
CPU - Decoupled GPU CPU - Coupled GPU
PCIe bus x86 CPU CORES
GPU Engine Arrays
Fusion APU
Memory Controller
Physical Memory
Host Memory GPU Memory
Physical Memory
Motivation - Application Patterns• Irregular / Unstructured Reduction
- A dwarf in Berkeley view on parallel computing (Molecular Dynamics and Euler)
- Challenges in Parallelism• Heavy data dependencies
- Challenges in Memory Performance• Indirect memory accesses result in poor data locality
• Recursive Control Flow- Conflicts between SIMD architecture and control dependencies- Recursion Support: SSE (No), OpenCL (No), CUDA(Yes)
• Generalized Reduction and Stencil Computation- Need to be reconsidered when harness thousands of threads
4
Thesis Work
5
Generalized Reduction
CPU + GPUIrregular Reduction
Stencil Computation
Recursive Application
GPU
APU
Xeon Phi
Application Patterns
Parallel Architectures
Thesis Work• Approaches for Parallelizing Reductions on
GPUs (HiPC 2010)
• An Execution Strategies and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs (ICS 2011)
• Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations (HiPC 2011)
• Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture (SC 2012)
• Efficient scheduling of recursive control flow on GPUs (ICS 2013)
• A Programming System for Xeon Phis with Runtime SIMD Parallelization (ICS2014)
6
Generalized Reduction
CPU + GPUIrregular Reduction
Stencil Computation
Recursive Application
GPU
APU
Xeon Phi
Application Patterns
Parallel Architectures
Outline• Latest work
- A Programming System for Xeon Phis with Runtime SIMD Parallelization
• Other work- Strategy and runtime support for irregular reductions on GPUs- Improve parallelism of SIMD for recursive applications on
GPUs- Different strategies for generalized reduction on GPU- Task scheduling frameworks
• Decoupled GPU + CPU• Coupled GPU + CPU
• Conclusion
7
Intel Xeon Phi Architecture
8
SIMD - Single Instruction Multiple Data
9
for(i = 0; i < n; ++i) A[i] = A[i] + B[i];
for(i = 0; i < n; i += 4) A[i:(i+4)] = A[i:(i+4)] + B[i:(i+4)];
Scalar Loop SIMD Loop
MMX, SSE, AVX and IMCI Instruction Sets
10
GPU vs. Xeon Phi
GFlops Memory size
Memory speed Power Frequency
of core # of core
Tesla C2070 1030 6GB 1.5Ghz 238w 1.15GHz 448
Xeon Phi SE10P 1070 8GB 5.5Ghz 300w 1.1GHz 612443904
Contributions• MIMD and SIMD Programming System
- MIMD Parallelism• Pattern knowledge based parallel framework• Task Partitioning & Scheduling Strategies• Data reorganization to help SIMD parallelism • Automatic MIMD API
- SIMD Parallelism• Automatic SIMD API• Control flow dependency resolving• Data dependency resolving
12
MIMD and SIMD Challenges for Different Patterns
13
Application Patterns MIMD Challenge SIMD Challenge
Generalized Reduction Job partition
unaligned/non-unit-stride accesscontrol flow dependencydata dependency/conflits
Stencil Computation Job partition unaligned/non-unit-stride access
control flow dependency
Irregular Reduction
Job partitionload balance
unaligned/random memory accesscontrol flow dependencydata dependency/conflicts
MIMD Parallel System
14
Task Definition
Configuration Task Type
Parameter Type
Kernel Function
MIMD Parallel FrameworkTask
Task Partitioner
job job job job job
Job Scheduler
job
Data Reorganization
MIMD Parallel System• Task Partitioner
- Computation Space Partitioning • Achieve good load balance• Need data replication or locking to resolve data dependency• Generalized Reduction, Stencil Computation
- Reduction Space Partitioning • No data dependency between partitions• Introduce computation redundancy and load imbalance - can be reduced by
appropriate partitioning and reorder method • Irregular Reduction
• Job Scheduler- Static, dynamic, or user defined strategies
• API Interface- Easy interface
• run(Task) : Register Task to MIMD framework, and start to run• join() : Block and wait execution finish
15
SIMD Data Type
Data Type Name Description
Scalar Type int, float, double, ….Data is shared by all the SIMD lanes. All the basic data types or temporary variables are belonged to scalar type.
Vector Type vint, vfloat, vdouble, …. It includes multiple data scaling to all the SIMD lanes.
Mask Type mask It helps handling control flow in vectorization.
SIMD Data Type
17
value
value0 value1 value2 value3
int s;
vint v;
int s = s1
vint v = v1;
s1
v1-0 v1-1 v1-2 v1-3
vint v = s; s s s s
vint array[2]; v0-0 v0-1 v0-2 v0-3 v1-0 v1-1 v1-2 v1-3
SIMD API
18
vint v1, v2; int s; mask m;op represents the supported mathematic or logic operations
API ExamplesAssignment API
Support assignment between vector types, and scalar type to vector type.
v1 = v2; v1 = s; v1 op= v2; v1 op= s;
Mathematic APISupport most mathematic operations, including +,-,*,/,%, between vector types and scalar types.
v1 = v2 op v1;v1 = v2 op s;
Logic APISupport most logic operations, including ==, !=, <, >, <=, >=, between vector types and scalar types. Return type is mask type.
m = v1 op v2;m = v1 op s;
Load/Store APIvoid load(void *src); v1.load(addr);void store(void *dst); v1.store(addr);void load(void *src, const vint &index, int scale) load [src+index[i]*scale] into v[i]
void store(void *dst, const vint &index, int scale) store v[i] to [dst+index[i]*scale]
continuous memory access
non-continuous memory access
Sample Kernels - Kmeans
19
Input: float *data;int i; //the index of one node
//Iterate k clusters to compute the distance to each clusterfor(int j = 0; j < k; ++j){ float dis = 0.0; for(int m = 0; m < 3; ++m) { dis += (data[i*3+m]-cluster[j*3+m])*
(data[i*3+m]-cluster[j*3+m]); } dis = sqrt(dis);}
Input:vfloat *data;int i; //the index of one node
//Iterate k clusters to compute the distance to each clusterfor(int j = 0; j < k; ++j){ vfloat dis = 0.0; for(int m = 0; m < 3; ++m) { dis += (data[i+m*n]-cluster[j*3+m])*
(data[i+m*n]-cluster[j*3+m]); } dis = sqrt(dis);}
Sequential Codes SIMD API
Input : (i, j) the index of one node, vfloat b[N][N] vfloat Dx = 0.0;vfloat Dy = 0.0;
//Compute the weight for a node in a 3x3 areafor(int p = -1; p <= 1; p++){ for(int q = -1; q <= 1; q++){ Dx+=wH[p+1][q+1]*b[X(i,p,q)][Y(j,p,q)]; Dy += wV[p+1][q+1]*b[X(i,p,q)][Y(j,p,q)]; }}vfloat z = sqrt(Dx*Dx + Dy*Dy);z.store(&a[i][j]);
Sample Kernels - Sobel
20
Input : (i, j) the index of one node, float b[N][N]float Dx = 0.0;float Dy = 0.0;
//Compute the weight for a node in a 3x3 areafor(int p = -1; p <= 1; p++){ for(int q = -1; q <= 1; q++){ Dx += wH[p+1][q+1]*b[i+p][j+q]; Dy += wV[p+1][q+1]*b[i+p][j+q]; }}float z = sqrt(Dx*Dx + Dy*Dy);a[i][j] = z;
Sequential Codes SIMD API
Input : (i, j) the index of one node, float b[N][N] __m512 Dx = _mm_set1_ps(0.0);__m512 Dy = _mm_set1_ps(0.0);
//Compute the weight for a node in a 3x3 areafor(int p = -1; p <=1; ++p){ for(int q = -1; q <=1; ++q){ __m512 *tmp = (__m512*)&b[i+q][j+p*vec_width]; __m512 tmpx = _mm512_mul_ps(*tmp, wH[p+1][q+1]); Dx = _mm512_add_ps(Dx, tmpx);
__m512 tmpy = _mm512_mul_ps(*tmp, wV[p+1][q+1]); Dy = _mm512_add_ps(Dy, tmpy); }}
__m512 sqDX = _mm512_mul_ps(Dx, Dx);__m512 sqDy = _mm512_mul_ps(Dy, Dy);__m512 ret = _mm512_add_ps(sqDx, sqDy);ret = _mm512_sqrt_ps(ret);_mm512_store_ps(&a[i][j], ret);
Sample Kernels - Sobel
21
Input : (i, j) the index of one node, float b[N][N]float Dx = 0.0;float Dy = 0.0;
//Compute the weight for a node in a 3x3 areafor(int p = -1; p <= 1; p++){ for(int q = -1; q <= 1; q++){ Dx += wH[p+1][q+1]*b[i+p][j+q]; Dy += wV[p+1][q+1]*b[i+p][j+q]; }}float z = sqrt(Dx*Dx + Dy*Dy);a[i][j] = z;
Sequential codes Manual vectorization codes
Data Layout Reorganization
22
• Why data reorganization?
• Vectorization only support continuous and aligned memory access
• Gather/Scatter APIs to load and store non-continuous and unaligned data
• Increase more than 60% overhead and introduce programming efforts to users 1 2 4 8 16 32 64 128 256
0
2
4
6
8
10Kmeans (k = 100)Gather/Scatter Data reorganization
Thread Numbers
Exe
cutio
n tim
e (s
)
Data Layout Reorganization• Non-unit-stride memory access
- AOS (Array of Structure) to SOA (Structure of Array)- Structure{x, y, z}- [x0, y0, z0, x1, y1, z1,…] => [x0, x1, …, y0, y1, …, z0, z1, …]
• Non-aligned memory access- Non-linear data layout transformation proposed in “Data
Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures”
- Provide functions to gain the indices of the adjacent nodes after transformation
• Irregular/Indirect memory access- Edge reordering method proposed by Bin Ren
23
1 2 34 A 56 7 8
Control Flow Resolution• IMCI first introduces operations with mask
parameters- mask: bitset (1: active lanes, 0: inactive lanes)- old value: the default value for inactive lanes
24
vint v = [0, 2, 4, 6];mask = v < 3;old = 1;v = 0;
Active Active Inactive Inactive
if(v < 3)v = 0;
elsev = 1;
mask = [1, 1, 0, 0]
00 11
How to pass mask and old value to our
operator overload API?
Mask Operation Implementation• IMCI implementation
- Mask operations has different API from unmask operations - _mm512_add_ps(v1, v2) and _mm512_mask_add_ps(old, mask, v1, v2)
• Our Considerations- Operator overload functions do not support extra parameters- Minimize the complexity of API
• Mask Status is one kind of status in execution- It exist during the whole thread life cycle - static- It is private to each thread - thread local
25
Mask Status Class (MS)Member Descriptionstatic __thread mask m mask type variablestatic __thread type old_val the default value for inactive lanesstatic set_mask(mask m, type val); set or update mask and old valuestatic void clear_mask(); clear old value and set all lanes to active
Mask Operation Implementation• IMCI implementation
- Mask operations has different API from unmask operations - _mm512_add_ps(v1, v2) and _mm512_mask_add_ps(old, mask, v1, v2)
• Our Considerations- Operator overload functions do not support extra parameters- Minimize the complexity of API
• Mask Status is one kind of status in execution- It exist during the whole thread life cycle - static- It is private to each thread - thread local
26
vint v = [0, 2, 4, 6];MS::set_mask(v<3, 1);v = 0;
if(v < 3)v = 0;
elsev = 1;
vint v = [0, 2, 4, 6];mask = v < 3;old = 1; v = 0;
Mask Operation Implementation
• When our API should use mask operation?- API always use mask operation
• No effort for users• Introduce extra performance
overhead- User decide when to use mask
operation• Two vector types: unmask
vector type and mask vector type
• Change from unmask to mask type by calling mask() function
27
1 2 4 8 16 32 64 128 2560
2
4
6
8Sobel (16384 x 16384)API API-Mask
Thread Numbers
Exe
cutio
n tim
e (s
)
Mask Operation Implementation
• When our API should use mask operation?- API always use mask operation
• No effort for users• Introduce extra performance
overhead- User decide when to use mask
operation• Two vector types: unmask
vector type and mask vector type
• Change from unmask to mask type by calling mask() function
28
vint v = [0, 2, 4, 6];MS::set_mask(v<3, 1); v = 0;
if(v < 3)v = 0;
elsev = 1;
vint v = [0, 2, 4, 6];mask = v < 3;old = 1; v = 0;
vint v = [0, 2, 4, 6];MS::set_mask(v<3, 1); v.mask() = 0;
Data Dependency Resolution
29
template<class ReducFunc = func>void reduction(int *update, int scale, int offset, vint *index, type value)
for(int i = 0; i < simd_width; ++i){ReducFunc()(update[index[i]*scale+offset], value[i];)
}
• Writing conflits in reduction- Undefined behavior when different SIMD lanes
update the same location
• Serialized reduction
A C A BB D
Data Dependency Resolution
30
A A B B C DA B
• Reorder reduction
Still need gather and scatter for non-continuous memory access
• Group reductionA A B B C D
A B A B C D
Experiment Evaluations• Environment configuration
- Xeon Phi SE 10P with 61 cores at 1.1 GHz- 8GB DDR5 memory- Running in native model- Intel ICC 13.1.0 with -O3 option
• Benchmarks- Generalized reduction: Kmeans, NBC- Stencil computation: Sobel, Heat3D
• Experiment Goals- SIMD-API- SIMD-Manual- Pthread-vec: -vec option- Pthread-novec: -no-vec option- OpenMP-vec: -vec option with #pragma vector always- OpenMP-novec: -no-vec option
31
Speedup - Generalized Reduction
32
• SIMD-API achieves better performance compared to Pthread-nonvec and Pthread-vec
• SIMD-API introduces very small overhead compared to SIMD-Manual
• Kmeans: the performance of compiler vectorization dependents on K
• NBC: large number of divergences limit the performance of compiler vectorization
Speedup - Stencil Computation
33
• Sobel: Compiler vectorization fails due to the complicate nest loop
• Heat3D: Compiler vectorization achieves the best performance, which is very similar to our API version
• Unaligned access is the major challenge for vectorization
• Compiler can also do auto-vectorization for stencil computation in some conditions
Scalability
34
Comparison with OpenMP
35
• MIMD + SIMD Parallelism- More than 3 times speedups
for most applications
• MIMD Parallelism- Gain better performance due
to the pattern knowledge based task partitioning and scheduling strategies.
Outline• Latest work
- A Programming System for Xeon Phis with Runtime SIMD Parallelization
• Other work- Strategy and runtime support for irregular reductions on GPUs- Improve parallelism of SIMD for recursive applications on
GPUs- Different strategies for generalized reduction on GPU- Task scheduling frameworks
• Decoupled GPU + CPU• Coupled GPU + CPU
• Conclusion
36
Irregular Reduction• Irregular Applications
- Unstructured grid pattern- Random and indirect
memory accesses
• Molecular Dynamics- Indirection Array -> Edges
(Interactions)- Reduction Objects ->
Molecules (Attributes)- Computation Space ->
Interactions b/w molecules- Reduction Space ->
Attributes of Molecules
37
Main Issues• Traditional Strategies are not effective
- Full Replication (Private copy per thread)• Large memory overhead• Both intra-block and inter-block combination• Shared memory usage unlikely
- Locking Scheme (Private copy per block)• Heavy conflicts within a block• Avoid intra-block combination, but not inter-block combination• Shared memory is only available for small data sets
• Need to choose Partitioning Strategy- Make sure data can be put into shared memory- Choice of partitioning space (Computation VS. Reduction)- Tradeoffs: Partitioning overhead & Execution efficiency
38
Contributions• A Novel Partitioning-based Locking Strategy
- Efficient shared memory utilization- Eliminate both intra and inter-block combination
• Optimized Runtime Support- Multi-Dimensional Partitioning Method- Reordering & Updating components for correctness
and memory performance
• Significant Performance Improvements- Exhaustive evaluation- Up to 3.3x improvement over traditional strategies
39
Choice of Partitioning Space
• Two partitioning choices:- Computation Space
• Partition on edges- Reduction Space
• Partition on nodes
40
Computation Space Partitioning
• Pros:- Load Balance on
Computation
• Cons:- Unequal reduction size
in each partition- Replicated reduction
elements (4 out of 16 nodes are replicated)
- Combination cost
41
• Partitioning on the iterations of computation loop
1
2
3
4
8 6
9
16
5
7
12
16101311
14
Partition 1
Partition 2
Partition 3 Partition 4Shared memory is
infeasible
6
5
4 5
24
12
7
Reduction Space Partitioning
• Pros:- Balanced reduction space- Independent between each
two partitions- Avoid combination cost- Shared memory is feasible
• Cons:- Imbalance on computation
space- Replicated work caused by
the crossing edges
42
• Partitioning on the Reduction Elements
1
2
3
4
8 6
9
16
5
7
12
16101311
14
Partition 1
Partition 2
Partition 3
Partition 4
5
7
Reduction Space Partitioning - Challenges
• Unbalanced & Replicated Computation- Partitioning method can achieve balance
between Cost and Efficiency• Cost: Execution time of partitioning method• Efficiency: Reduce number of crossing edges
(Replicated work)
• Maintain correctness on GPU- Reorder reduction space- Update/Reorder computation space
43
Runtime Partitioning Approaches• Metis Partitioning (Multi-level k-way Partitioning)
- Execute sequentially on CPU- Minimizes crossing edges- Cons: Large overhead for data initialization
• GPU-based (Trivial) Partitioning- Parallel execution on GPU- Minimize execution time- Cons: Large number of crossing edges among partitions
• Multi-dimensional Partitioning (Coordinate Information)- Execute sequentially on CPU- Balance between cost and efficiency
44
High Cost
Low Efficiency
Performance Gains
Molecular Dynamics: Comparison between Partitioning-based Locking (PBL) , Locking, Full Replication, and Sequential CPU time
45
Euler: Comparison between Partitioning-based Locking (PBL) , Locking, Full Replication, and Sequential CPU time
Outline• Latest work
- A Programming System for Xeon Phis with Runtime SIMD Parallelization
• Other work- Strategy and runtime support for irregular reductions on GPUs- Improve parallelism of SIMD for recursive applications on
GPUs- Different strategies for generalized reduction on GPU- Task scheduling frameworks
• Decoupled GPU + CPU• Coupled GPU + CPU
• Conclusion
46
Current Recursion Support on Modern GPUs
• Problem definition: intra-warp thread scheduling- Each thread in a warp executes an independent recursive function
call
• SIMT (Single Instruction Multiple Thread): different from the traditional SSE extensions- Each thread owns a stack for function call - Branches control: different branches execute in serial
• Recursion support on current GPUs- AMD GPUs and OpenCL programming model
• Not support recursion - NVIDIA GPUs
• Support recursion from computing capability 2.0 and SDK 3.1• Performance is limited by the reconvergence method — immediate
post dominator reconvergence47
Immediate post-dominator Reconvergence in Recursion
/* General Branch */Fib(n){ if(n < 2) { return 1; } else { x = Fib(n-1); y = Fib(n-2); return x+y; }}
T0T1
• Reconvergence can only happen on the same recursion level
• Threads with shorter branch cannot return until the threads with longer branch coming back to the reconvergence point
48
Contributions• Dynamic reconvergence method for recursion
- Reconvergence can happen before or after immediate post-dominator
• Two kinds of implementations- Dynamic greedy reconvergence- Dynamic majority reconvergence
• Significant performance improvements- Exhaustive evaluation on six recursion benchmarks
with different characteristics- Up to 6x improvement over immediate post-dominator
method
49
Dynamic Reconvergence MechanismsT0T1 Divergence
50
Dynamic Reconvergence Mechanisms• Remove the static reconvergence on immediate post-dominator• Dynamic Reconvergence Implementations
- Greedy Reconvergence• Convergence group: The group of threads that have the same PC
(Program Counter) address with the current active threads• Greedily reconverge threads in the same convergence group• Reconvergence can cross different recursion levels
- Majority-based Reconvergence• Majority group: The largest group of the threads with the same PC• Tend to maximize IPC (Instructions Per Cycle)• The threads in minority have a chance to join in majority in future
- Other Dynamic Reconvergence mechanisms• Greedy-return: Schedule threads in return branch first• Majority-threshold: Prevent starvation of threads in minority group
51
Outline• Latest work
- A Programming System for Xeon Phis with Runtime SIMD Parallelization
• Other work- Strategy and runtime support for irregular reductions on GPUs- Improve parallelism of SIMD for recursive applications on
GPUs- Different strategies for generalized reduction on GPU- Task scheduling frameworks
• Decoupled GPU + CPU• Coupled GPU + CPU
• Conclusion
52
Different Strategies for Generalized Reductions on GPUs
• Tradeoffs between Full Replication and Locking Scheme• Hybrid Scheme
- Balance between Full Replication and Locking Scheme- Introduce a intermediate scheduling layer “group” under
thread block• Intra-group: Locking Scheme• Inter-group: Full Replications
- Benefits of Hybrid Scheme varies with group size• Reduced memory overhead• Better use of shared memory• Reduce combination cost & conflicts
• Extensive evaluations on different applications with different parameters
53
Porting Irregular Reductions on Heterogeneous CPU-GPU Architecture
• A Multi-level Partitioning Framework- Parallelize irregular reduction on heterogeneous architecture
• Coarse-grained Level: Tasks between CPU and GPU• Fine-grained Level: Task between thread blocks or threads• Both use reduction space partitioning
- Eliminate device memory limitation on GPU
• Runtime Support Scheme- Pipeline Scheme overlap partitioning and computation on GPU- Work stealing based scheduling strategy provide load balance and
increase the pipelining length
• Significant Performance Improvements- Achieve 11% and 22% improvement for Euler and Molecular
Dynamics
54
Accelerating Applications on Integrated CPU-GPU
• Thread Block Level Scheduling Framework on Fusion APU- Work Scheduling target -> thread blocks (Not devices)- Only launch one kernel in the beginning -> small command
launching overhead- No synchronization between devices or thread blocks- Inter and Intra-device load balance is achieved by fine-grained and
factoring scheduling policies
• Locking-free Implementations based on shared memory- Master-worker Scheduling - Token Scheduling
• Applications with different communication patters- Stencil Computing (Jacobi): 1.6x- Generalized Reduction (K-means): 1.92x- Irregular Reduction (Molecular Dynamics): 1.15x
55
Future Work• Extension of current SIMD programming system
- Support SIMD vectors with different lengths• Hide different implementations for different vector
lengths• Simulate new functions in IMCI for old SIMD
instruction set• Support varied vector lengths at runtime
- Support CPU-MIC heterogeneous environment• Shared memory between CPU and MIC• Global synchronization on MIC without involving
host
56
Conclusions• Different strategies for generalized reductions on GPU
- Approaches for Parallelizing Reductions on Modern GPUs(HiPC2010, Best Student Paper)
• Strategy and runtime support for irregular reductions- An Execution Strategy and Optimized Runtime Support for Parallelizing
Reductions on Modern GPUs (ICS2011, Best Paper Nomination)
• Task scheduling frameworks for heterogeneous architectures- Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations
(HiPC2011)- Accelerating MapReduce on a Coupled CPU-GPU Architecture (SC2012)
• Improve parallelism of SIMD for recursion on GPUs- Efficient Scheduling of Recursive Control Flow on GPUs (ICS2013)
• MIMD and SIMD parallel framework and API for Xeon Phi- A Programming System for Xeon Phis with Runtime SIMD Parallelization
(ICS2014)
57
Acknowledgements
58
Advisor: Dr. Gagan Agrawal Committee Members:Dr. P. SadayappanDr. Feng QinDr. Radu Teodorescu
Mentor in PNNL: Dr. Sriram KrishnamoorthyCurrent lab members:Dr. Bin RenDr. Tekin BicerLinchuan ChenYu SuYi WangMehmet Can KurtMucahid KutluJiaqi LiuSameh ShohdyPrevious lab members:Dr. Wenjing MaDr. Vignesh RaviDr. David ChiuDr. Qian ZhuDr. Fan WangDr. Wei JiangDr. Tantan LiuJiedan Zhu
Thanks for your attention!Q & A