University of Illinois at Urbana-Champaign
Memory Architectures for Protein Folding:
MD on million PIM processors
Fort Lauderdale, May 03,
2L.V. Kale MD on very Large PIM machines
Overview
EIA-0081307: “ITR: Intelligent Memory Architectures and Algorithms to Crack the Protein Folding Problem”
PIs:
– Josep Torrellas and Laxmikant Kale (University of Illinois)
– Mark Tuckerman (New York University)
– Michael Klein (University of Pennsylvania)
– Also associated: Glenn Martyna (IBM)
Period: 8/00 - 7/03
3L.V. Kale MD on very Large PIM machines
Project Description
Multidisciplinary project in computer architecture and software, and computational biology
Goals:
– Design improved algorithms to help solve the protein folding problem
– Design the architecture and software of general-purpose parallel machines that speed-up the solution of the problem
4L.V. Kale MD on very Large PIM machines
Some Recent Progress: Ideas
Developed REPSWA – (Reference Potential Spatial Warping Algorithm) – Novel algorithm for accelerating conformational sampling in
molecular dynamics, a key element in protein folding
– Based on ``spatial warping'' variable transformation. This transformation is designed to shrink barrier regions on the energy
landscape and grow attractive basins without altering the equilibrium properties of the system
– Result: large gains in sampling efficiency
– Using novel variable transformations to enhance conformational sampling in molecular dynamics Z. Zhu, M. E. Tuckerman, S. O. Samuelson and G. J. Martyna, Phys. Rev. Lett. 88, 100201 (2002).
5L.V. Kale MD on very Large PIM machines
Some Recent Progress: Tools
Developed LeanMD, a molecular dynamics parallel program that targets at very large scale parallel machines– Research-quality program based on the Charm++ parallel object oriented
language
– Descendant from NAMD (another parallel molecular dynamics application) that achieved unprecedented speedup on thousands of processors
– LeanMD to be able to run on next generation parallel machines with ten thousands or even millions of processors such as Blue Gene/L or Blue Gene/C
– Requires a new parallelization strategy that can break up the simulation problem in a more fine grained manner to generate parallelism enough to effectively distribute work across a million processors.
6L.V. Kale MD on very Large PIM machines
Some Recent Progress: Tools
Developed a high-performance communication library
– For collective communication operations AlltoAll personalized communication, AlltoAll multicast, and AllReduce
These operations can be complex and time consuming in large parallel machines
Especially costly for applications that involve all-to-all patterns– such as 3-D FFT and sorting
– Library optimizes collective communication operations by performing message combining via imposing a virtual topology
– The overhead of AlltoAll communication for 76-byte message exchanges between 2058 processors is in the low tens of milliseconds
7L.V. Kale MD on very Large PIM machines
Some Recent Progress: People
The following graduate student researchers have been supported:
– Sameer Kumar (University of Illinois)
– Gengbin Zheng (University of Illinois)
– Jun Nakano (University of Illinois)
– Zhongwei Zhu (New York University)
8L.V. Kale MD on very Large PIM machines
Overview
Rest of the talk:
– Objective: Develop a Molecular Dynamics program that will run effectively on a million processors Each with low memory to processor ratio
– Method: Use parallel objects methodology
Develop an emulator/simulator that allows one to run full-fledged programs on simulated architecture
– Presenting Today: Simulator details
LeanMD Simulation on BG/L and BG/C
9L.V. Kale MD on very Large PIM machines
Performance Prediction on Large Machines
Problem:
– How to predict performance of applications on future machines?
– How to do performance tuning without continuous access to a large machine?
Solution:
– Leverage virtualization
– Develop a machine emulator
– Simulator: accurate time modeling
– Run a program on “100,000 processors” using only hundreds of processors
10L.V. Kale MD on very Large PIM machines
Blue Gene Emulator: functional view
Affinity message queues
Communication threads
Worker threads
inBuff
Non-affinity message queues
CorrectionQ
Converse scheduler
Converse Q
Communication threads
Worker threads
inBuff
Non-affinity message queues
CorrectionQ Affinity message
queues
11L.V. Kale MD on very Large PIM machines
Emulator to Simulator Emulator:
– Study programming model
and application development
Simulator:
– performance prediction
capability
– models communication
latency based on network
model;
– Doesn’t model memory
access on chip, or network
contention
Parallel performance is hard to model– Communication subsystem
Out of order messages Communication/
computation overlap
– Event dependencies
Parallel Discrete Event Simulation– Emulation program
executes in parallel with event time stamp correction.
– Exploit inherent determinacy of application
12L.V. Kale MD on very Large PIM machines
How to simulate?
Time stamping events
– Per thread timer (sharing one physical timer)
– Time stamp messages Calculate communication latency based on network model
Parallel event simulation
– When a message is sent out, calculate the predicted arrival time for the destination bluegene-processor
– When a message is received, update current time as: currTime = max(currTime,recvTime)
– Time stamp correction
13L.V. Kale MD on very Large PIM machines
Parallel correction algorithm
Sort message execution by receive time;
Adjust time stamps when needed
Use correction message to inform the change in event startTime.
Send out correction messages following the path message was sent
The events already in the timeline may have to move.
14L.V. Kale MD on very Large PIM machines
M8
M1 M7M6M5M4M3M2
RecvTime
ExecutionTimeLine
Timestamps Correction
15L.V. Kale MD on very Large PIM machines
M8M1 M7M6M5M4M3M2
RecvTime
ExecutionTimeLine
Timestamps Correction
16L.V. Kale MD on very Large PIM machines
M1 M7M6M5M4M3M2
RecvTime
ExecutionTimeLine
M8
ExecutionTimeLineM1 M7M6M5M4M3M2 M8
RecvTime
Correction Message
Timestamps Correction
17L.V. Kale MD on very Large PIM machines
M1 M7M6M5M4M3M2
RecvTime
ExecutionTimeLine
Correction Message (M4)
M4
Correction Message (M4)
M4
M1 M7M4M3M2
RecvTime
ExecutionTimeLineM5 M6
Correction Message
M1 M7M6M4 M3M2
RecvTime
ExecutionTimeLineM5
Correction Message
Timestamps Correction
19L.V. Kale MD on very Large PIM machines
LeanMD
LeanMD is a molecular dynamics simulation application written in Charm++
Next generation of NAMD,
– The Gordon Bell Award winner in SC2002.
Requires a new parallelization strategy
– break up the problem in a more fine-grained manner to effectively distribute work across the extreme large number of processors.
20L.V. Kale MD on very Large PIM machines
LeanMD Performance Analysis
Need readable graphs:
1 to a page is fine, but with larger fonts, thicker lines