Post on 29-Jan-2016
transcript
Introduction to Research 2007Introduction to Research 2007
Ashok Srinivasan
Florida State University
www.cs.fsu.edu/~asriniva
Ashok Srinivasan
Florida State University
www.cs.fsu.edu/~asriniva
Recent collaborators
V. Aggarwal, J. Kolhe, L. Ji, M. Mascagni, H. Nymeyer, and Y. Yu
Florida State University
S. Kapoor
IBM Austin
S. Namilae
Oak Ridge National Lab
Recent collaborators
V. Aggarwal, J. Kolhe, L. Ji, M. Mascagni, H. Nymeyer, and Y. Yu
Florida State University
S. Kapoor
IBM Austin
S. Namilae
Oak Ridge National Lab
M. Krishna, A. Kumar, N. Jayam, G. Senthilkumar, P. K. Baruah, and R. Sharma
Sri Sathya Sai University, India
N. Chandra
University of Nebraska at Lincoln
M. Krishna, A. Kumar, N. Jayam, G. Senthilkumar, P. K. Baruah, and R. Sharma
Sri Sathya Sai University, India
N. Chandra
University of Nebraska at Lincoln
Research support
Funding
DoD, FSU, NSF
Research support
Funding
DoD, FSU, NSF
Computer time
IBM, NCSA, NERSC, ORNL
Computer time
IBM, NCSA, NERSC, ORNL
OutlineOutline
Research Areas
Computational Nanotechnology
Computational Biology
High Performance Computing on Multicore Processors
Potential Research Topics
Graduate Courses
Research AreasResearch Areas
High Performance Computing, Applications in Computational Sciences, Scalable Algorithms, Mathematical Software
Current topics: Computational Nanotechnology, Computational Biology, HPC on Multicore Processors
New Topics: Dynamic Data Driven Applications
Old Topics: Computational Finance, Parallel Random Number Generation, Monte Carlo Linear Algebra, Computational Fluid Dynamics, Image Compression
Importance of Parallel ComputingImportance of Parallel Computing
Makes feasible products based on more fundamental understanding of science Example: Nanotechnology, Medicine
Increasing relevance to industry In 1993, fewer than 30% of top 500 supercomputers were
commercial Now, over 50% are commercial
Finance and insurance Medicine Aerospace and Automobiles Telecom Oil exploration Shoes! (Nike) Potato chips! Toys!
Architectural TrendsArchitectural Trends
Massive parallelism 10K processor systems will be commonplace Large end already has over 100K processors
Single chip multiprocessing All processors will be multicore Heterogeneous multicore processors
Cell used in the PS3 80-core processor from Intel Processors with hundreds of cores are already commercially
available
Distributed environments, such as the Grid But it is hard to get good performance on these
systems
Computational NanotechnologyComputational Nanotechnology
Example application Carbon Nanotube
Can span 23,000 miles without failing due to own weight
100 times stronger than steel Lighter than feather Conducts heat better than
diamond Computations are used to understand
materials at the atomic scale, so that better materials can be designed
Easier than experimentation at the nano-meter scale
CNT Tensile TestCNT Tensile Test
Pull the CNT at constant speed Determine material properties from force-displacement response
Computational difficulties Time steps size ~ 10 –15 seconds
Desired time range is much larger A million time steps are required to reach 10-9 s ~ 500 hours of computing for ~ 40K atoms using GROMACS MD uses unrealistically large pulling speed
1 to 10 m/s instead of 10-7 to10-5 m/s Results at unrealistic speeds are unrealistic!
Difficulty with ParallelizationDifficulty with Parallelization
Results on scalable code Does not scale efficiently
beyond 10 ms/iteration
If we want to simulate to a ms Time step 1 fs 1012
iterations 1010s ≈ 300 years
If we scaled to 10 s per iteration 4 months computing time
NAMD, 327K atom ATPase PME, Blue Gene, IPDPS 2006
NAMD, 92K atom ApoA1 PME, Blue Gene, IPDPS 2006
IBM Blue Matter, 43K Rhodopsin, Blue Gene, Tech Report 2005
Desmond, 92K atom ApoA1, SC 2006
Data Driven Time ParallelizationData Driven Time Parallelization
Each processor simulates a
different time interval Initial state is obtained by
prediction, using prior data
(except for processor 0)
Verify if prediction for end state is
close to that computed by MD
Prediction is based on
dynamically determining a
relationship between the current
simulation and those in a
database of prior results
If time interval is sufficiently large, then communication overhead is small
ResultsResults
Speedup result Red line: Ideal speedup Blue: v = 0.1m/s Green: A different predictor Experimental parameters
v = 1m/s, using v = 10m/s CNT with 1000 atoms Xeon/ Myrinet cluster
Validation Compare stress strain
response Blue: Exact results Red: Time parallel results Green: Direct prediction
Computational BiologyComputational Biology
Data driven time parallelization in the AFM simulation of proteins An order of magnitude improvement in performance by combining
conventional and data driven time parallelization with the protein Titin
A PowerPC core, with 8 co-processors (SPE) with 256 K local
store each
Shared 512 MB - 2 GB main memory - SPEs can DMA
Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops
in double precision for SPEs
204.8 GB/s EIB bandwidth, 25.6 GB/s for memory
Two Cell processors can be combined to form a Cell blade with
global shared memory
High Performance Computing on High Performance Computing on Multicore ProcessorsMulticore Processors
DMA put timesDMA put times
Memory to Memory Copy using:
• SPE local store
• memcpy by PPE
Memory to Memory Copy using:
• SPE local store
• memcpy by PPE
Cell ArchitectureCell Architecture
Cell MPI ResultsCell MPI Results
PE: Consider SPUs to be a logical hypercube – in each step, each SPU exchanges messages with neighbor along one dimension
DIS: In step i, SPU j sends to SPU j + 2i and receives from j – 2i
Comparison of MPI_Barrier on different hardware
P Cell (PE) s
Xeon/Myrinet s
NEC SX-8 s SGI Altix BX2 s
8 0.4 10 13 3
16 1.0 14 5 5
MPI_Barrier timing
Broadcast bandwidth
Potential Research TopicsPotential Research Topics
Computational Biology Data Driven Time Parallelization Markov State Modeling Other topics
Dynamic Data Driven Applications Combining simulations and experiments in superplastic forming
High Performance Computing on Multicore Processors Algorithms and libraries on the Cell processor
Example: Sorting, linear algebra, etc Good software cache/code overlaying implementations
Other possible new directions Applications in history, linguistics, medicine, etc
Graduate CoursesGraduate Courses
Parallel Computing, Spring 2008 MPI and OpenMP programming on traditional parallel machines Threaded programming on multicore processors Parallel algorithms
Advanced Algorithms, Fall 2008 Approximation algorithms for NP hard problems Randomized algorithms Cache aware algorithms