Date post: | 10-Feb-2017 |
Category: |
Technology |
Upload: | insidehpc |
View: | 245 times |
Download: | 0 times |
SECTION 3:MODERN COMPUTING:CLOUD, DISTRIBUTED & HIGH PERFORMANCE
DR. ÜMIT V. ÇATALYÜREKPROFESSOR AND ASSOCIATE CHAIRGeorgia Institute of Technology
JANUARY 27, 2017
The Big Data to Knowledge (BD2K) Guide to the Fundamentals of Data Science
1
ÜMİT V. ÇATALYÜREK• A Professor in the School of Computational Science &
Engineering in the College of Computing at the Georgia Institute of Technology.
• A recipient of an NSF CAREER award
• The primary investigator of several awards from the Department of Energy, the National Institute of Health, & the National Science Foundation.
• An Associate Editor for Parallel Computing, & editorial board member for IEEE Transactions on Parallel & Distributed Computing, & the Journal of Parallel & Distributed Computing.
• A Fellow of IEEE, member of ACM & SIAM, & the Chair for IEEE TCPP 2016-2017, & Vice-Chair for ACM SIGBio 2015-2018 term.
• Main research areas: parallel computing, combinatorial scientific computing & biomedical informatics.
• More information about Dr. Ümit V. Çatalyürek can be found at http://cc.gatech.edu/~umit.
2
MODERN COMPUTING: CLOUD, DISTRIBUTED & HIGH PERFORMANCE COMPUTING Ümit V. ÇatalyürekProfessor and Associate ChairSchool of Computational Science and EngineeringGeorgia Institute of Technology
The BD2K Guide to the Fundamentals of Data Science Series27 January 2017
3
Outline
• HPC • What is it? Why?
• A Crash Course on (HPC) Computer Architecture• History of Single “Processor” Performance• Taxonomy of Processors, Memory Topology of Parallel Computers• Supercomputers
• How to speedup your application?• Focus the common case• Pay attention to locality• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma
• Summary
4
What does High Performance Computing (HPC) mean?
• There is no such thing as “Low Performance Computing”
• “HPC most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business” (insideHPC)
• HPC allows scientists and engineers to solve complex science, engineering, and business problems using applications that require high bandwidth, enhanced networking, and very high compute capabilities.” (Amazon AWS)
• “HPC is the use of parallel processing for running advanced application programs efficiently, reliably and quickly… The term HPC is occasionally used as a synonym for supercomputing.” (SearchEnterpriseLinux/WhatIs.com)
5
My Definition of High Performance Computing (HPC)
• Efficient use of computing platforms for running application programs quickly.
• Why do we care about speed?• We do not want science to wait for computing.
• Why do we care about efficiency?• Efficient use of resources means more resources available to all of us J• Somebody has to pay the bills!• When you have efficient program, it will be also very fast!
• Supercomputing is HPC, but HPC does not mean just supercomputing• For Supercomputers check top500.org (more later).
6
Computing Today• Computing = Parallel Computing = HPC
• Any “computer” you touch has parallel processing power:• Your laptop’s CPU has at least 2 cores.• Your cell phone has 4-8 cores!
• This is BD2K Seminar: Data (and hence computational need) is BIG!• Too big that it does not fit in to your computer.• It takes too long to compute on your computer.
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Dec-82
Nov-84
Oct-86
Sep-88
Aug-90
Jul-92
Jun-94
May-96
Apr-98
Mar-00
Feb-02
Jan-04
Dec-05
Nov-07
Oct-09
Sep-11
Aug-13
Jul-15
Megab
ases
GenBankBases
GenBank
WGS
Source: http://www.genome.gov/sequencingcosts/Oxford NanaporeMinION MkI
7
Outline
• HPC • What is it? Why?
• A Crash Course on (HPC) Computer Architecture• History of Single “Processor” Performance• Taxonomy of Processors, Memory Topology of Parallel Computers• Supercomputers
• How to speedup your application?• Focus the common case• Pay attention to locality• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma
• Summary
8
History of Single “Processor” Performance
9
RISC
Move to multi-cores
Bandwidth and Latency
•Bandwidth or throughput• Total work done in a given time• 10,000-25,000X improvement for processors• 300-1200X improvement for memory and disks
•Latency or response time• Time between start and completion of an event• 30-80X improvement for processors• 6-8X improvement for memory and disks
10
Bandwidth and Latency
11
Log-log plot of bandwidth and latency milestones
Flynn’s Taxonomy
12
Instructions
Single (SI) Multiple (MI)
Dat
a
Mul
tiple
(MD
)
SISD
Single-threaded process
MISD
Pipeline architecture
SIMD
Vector Processing
MIMD
Shared-/ Distributed-
Memory Computing
Sing
le (S
D)
SISD
13
D D D D D D D
Processor
Instructions
SIMD
14
D0
Processor
Instructions
D0D0 D0 D0 D0
D1
D2
D3
D4
…Dn
D1
D2
D3
D4
…Dn
D1
D2
D3
D4
…Dn
D1
D2
D3
D4
…Dn
D1
D2
D3
D4
…Dn
D1
D2
D3
D4
…Dn
D1
D2
D3
D4
…Dn
D0
GPU (SIMD) Advantage
15 Images are from W. Dally’s SC10 Keynote Talk
MIMD
16
D D D D D D D
Processor
Instructions
D D D D D D D
Processor
Instructions
Memory Typology: Shared
17
Memory
Processor
Processor Processor
Processor
a.k.a. SMPs
Memory Typology: Distributed
18
MemoryProcessor MemoryProcessor
MemoryProcessor MemoryProcessor
Network
Memory Typology: Hybrid
19
MemoryProcessor
Network
Processor
MemoryProcessor
Processor
MemoryProcessor
Processor
MemoryProcessor
Processor
Memory Typology: Hybrid + Hetorogenous
20
Memory
Processor
Network
Processor
GPU
Memory
Processor
Processor
GPU
Memory
Processor
Processor
GPU
Memory
Processor
Processor
GPU
Outline
• HPC • What is it? Why?
• A Crash Course on (HPC) Computer Architecture• History of Single “Processor” Performance• Taxonomy of Processors, Memory Topology of Parallel Computers• Supercomputers
• How to speedup your application?• Focus the common case• Pay attention to locality• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma
• Summary
21
Oxen or Chicken Dilemma
• "If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?”Seymour Cray
22
23
Highlights from Top500
24
Highlights from Top500
25
Highlights from Top500
26
Outline
• HPC • What is it? Why?
• A Crash Course on (HPC) Computer Architecture• History of Single “Processor” Performance• Taxonomy of Processors, Memory Topology of Parallel Computers• Supercomputers
• How to speedup your application?• Focus the common case• Pay attention to locality• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma
• Summary
27
Amdahl’s Law
28
( )enhanced
enhancedenhanced
new
oldoverall
SpeedupFraction Fraction
1 ExTimeExTime Speedup
+−==1
Best you could ever hope to do:
( )enhancedmaximum Fraction - 1
1 Speedup =
( ) !"
#$%
&+−×=
enhanced
enhancedenhancedoldnew Speedup
FractionFraction ExTime ExTime 1
Amdahl’s Law Example:
( )
( )56.1
64.01
100.4 0.4 1
1
SpeedupFraction Fraction 1
1 Speedup
enhanced
enhancedenhanced
overall
==+−
=
+−=
29
• Sequence Analysis Pipeline has a “slow” step which does error correction of the input reads
• New CPU 10X faster• I/O bound server, so 60% time waiting for I/O
• Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster
Multiple Sequence AlignmentVTISCTGSSSNIG-AGNHVKWYQQLPGVTISCTGTSSNIG--SITVNWYQQLPGLRLSCSSSGFIFS--SYAMYWVRQAPGLSLTCTVSGTSFD--DYYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNW--YVDGATLVCLISDFYPG--AVTVAW--KADSAALGCLVKDYFPE--PVTVSW--NS-GVSLTCLVKGFYPS--DIAVEW--ESNG
30
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWESNG--
• Optimal: O(2n P |li|)• 6 sequences of length 100 if constant is 10-9 seconds
• running time 6.4 x 104 seconds (~17.7 hours)• add 2 sequences
• running time 2.6 x 109 seconds (~82.4 years!)
or
CLUSTAL W
• Based on Higgins & Sharp CLUSTAL [Gene88]• Progressive alignment-based strategy
• Pairwise Alignment (n2l2)• A distance matrix is computed using either an approximate method (fast) or
dynamic programming (more accurate, slower)• Computation of Guide Tree (n3): phylogenetic tree
• Computed from the distance matrix • Iteratively selecting aligned pairs and linking them.
• Progressive Alignment (nl2)• A series of pairwise alignments computed using full dynamic programming to align
larger and larger groups of sequences.• The order in the Guide Tree determines the ordering of sequence alignments. • At each step; either two sequences are aligned, or a new sequence is aligned with
a group, or two groups are aligned. • n: number of sequences in the query• l : average sequence length
31
Speeding up CLUSTALW Breakdown of CLUSTAL W Execution Time on PIII-650MHz
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
25 50 75 100 150 200 400 600 800 1000
Number of GPCR Sequences
Tim
e Fr
actio
ns
prog-align
guidetree
pairwise
• By parallelizing most time consuming part: pair-wise alignment
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8
Spee
edup
# Processors
Speedup of Parallelized Version of CLUSTALW
linear
pair align
ideal speeduptotal
32
0
100
200
300
400
500
600
700
800
900
1,000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Spee
dup
Number Of Processors
10.00%
5.00%
2.00%
1.00%
0.50%
0.10%
More on Amdahl’s law
33
Outline
• HPC • What is it? Why?
• A Crash Course on (HPC) Computer Architecture• History of Single “Processor” Performance• Taxonomy of Processors, Memory Topology of Parallel Computers• Supercomputers
• How to speedup your application?• Focus the common case• Pay attention to locality• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma
• Summary
34
Levels of the Memory Hierarchy35
CPU Registers100s Bytes300 – 500 ps (0.3-0.5 ns)
L1 and L2 Cache10s-100s K Bytes~1 ns - ~10 ns$1000s/ GByte
Main MemoryG Bytes80ns- 200ns~ $100/ GByte
Disk10s T Bytes, 10 ms (10,000,000 ns)~ $1 / GByte
CapacityAccess TimeCost
Tape infinitesec-min~$1 / GByte
Registers
L1 Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Upper Level
Lower Level
faster
Larger
L2 Cache
Blocks
Locality Aware Remote Visualization
• Scientific and clinical research generate multi-BG to multi-TB of spatially and temporally correlated data• Different spatial and temporal resolutions• Different acquisition modalities, from CT to light microscopy to electron
micrography• Examples Applications: Visible Human, mouse BIRN
• DataCutter Streams Data to MPI-based OSC Parallel Renderer
• Setup• Full color Visible Woman dataset
• Super-sampled at 2x for entire dataset, 4x and 8x for regions of the dataset• Data stored on 20 nodes• 8 rendering nodes and 1 compositing node with texture VR• Remote thin client connected over internet
System Overview
Query Execution
Implementation of OSC Parallel Renderer
Implementation of OSC Parallel Renderer
Outline
• HPC • What is it? Why?
• A Crash Course on (HPC) Computer Architecture• History of Single “Processor” Performance• Taxonomy of Processors, Memory Topology of Parallel Computers• Supercomputers
• How to speedup your application?• Focus the common case• Pay attention to locality• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma
• Summary
41
Current and Emerging Scientific Applications
42
Processing Remotely-Sensed DataNOAA Tiros-Nw/ AVHRR sensor
AVHRR Level 1 DataAVHRR Level 1 Data• As the TIROS-N satellite orbits, the Advanced Very High Resolution Radiometer (AVHRR)sensor scans perpendicular to the satellite’s track.• At regular intervals along a scan line measurementsare gathered to form an instantaneous field of view(IFOV).• Scan lines are aggregated into Level 1 data sets.
A single file of Global Area Coverage (GAC) data represents:• ~one full earth orbit.• ~110 minutes.• ~40 megabytes.• ~15,000 scan lines.
One scan line is 409 IFOV’s
Satellite Data Processing
DCE-MRI AnalysisShort Sequence Mapping
Quantum Chemistry
Image Processing Multimedia Video Surveillance Montage
Application Patterns•Complex and diverse processing structures
43
Processing Remotely-Sensed DataNOAA Tiros-Nw/ AVHRR sensor
AVHRR Level 1 DataAVHRR Level 1 Data• As the TIROS-N satellite orbits, the Advanced Very High Resolution Radiometer (AVHRR)sensor scans perpendicular to the satellite’s track.• At regular intervals along a scan line measurementsare gathered to form an instantaneous field of view(IFOV).• Scan lines are aggregated into Level 1 data sets.
A single file of Global Area Coverage (GAC) data represents:• ~one full earth orbit.• ~110 minutes.• ~40 megabytes.• ~15,000 scan lines.
One scan line is 409 IFOV’s
Bag-of-Tasks Model
Data Analysis Applications
Bag-of-Tasks Applications
Task
File
Application Patterns•Complex and diverse processing structures
44
Data Analysis Applications
Bag-of-Tasks Applications Workflows
Non-streaming
Task
File
Sequential or Parallel Task
This image cannot currently be displayed.
Non-streaming
Application Patterns•Complex and diverse processing structures
45
Streaming
Data Analysis Applications
Bag-of-Tasks Applications Workflows
Non-streaming
Task
File
Sequential or Parallel Task Streaming
Taxonomy of Parallelism
•Complex and diverse processing structures• Varied parallelism
46
Bag-of-Tasks Applications
Sequential Task
File
P1 P2 P3 P4
Task-parallelism
Application Patterns
• Complexanddiverseprocessingstructures• Variedparallelism
•Bag-of-tasksapplications:task-parallelism
47
Data-parallelism
Task-parallelism
Non-streaming Workflows
Sequential or Parallel Task
P1 P2 P3 P4
Taxonomy of Parallelism
• Complexanddiverseprocessingstructures• Variedparallelism
48
Taxonomy of Parallelism
• Complexanddiverseprocessingstructures• Variedparallelism
•Bag-of-tasks:task-parallelism•Non-streamingworkflows:task- anddata-parallelism
49
Data-parallelism
Streaming Workflows
Sequential or Parallel Task
P1 P2 P3 P4
Pipelined-parallelism
Task-parallelism
Taxonomy of Parallelism
• Complexanddiverseprocessingstructures• Variedparallelism
50
Taxonomy of Parallelism
• Complexanddiverseprocessingstructures• Variedparallelism
•Bag-of-tasks:task-parallelism•Non-streamingworkflows:task- anddata-parallelism•Streamingworkflows:task-,data- andpipelined-parallelism
51
Outline
• HPC • What is it? Why?
• A Crash Course on (HPC) Computer Architecture• History of Single “Processor” Performance• Taxonomy of Processors, Memory Topology of Parallel Computers• Supercomputers
• How to speedup your application?• Focus the common case• Pay attention to locality• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma
• Summary
52
An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma•Classifybiopsytissueimagesintodifferentsubtypesofprognosticsignificance
•Veryhighresolutionslides• Dividedintosmallertiles
•Multi-resolutionimageanalysis• Mimicsthewaypathologistsperformtheiranalysis• Ifclassificationatlowerresolutionisnotsatisfactory,analysisalgorithmisexecutedathigherresolution(s),hencethedynamicworkload.
53
Why do we need HPC?
§ Due to the large sizes of whole-slide images§ A 120K x 120K image digitized at 40x occupies more than 40 GB.
§ The processing time on a single CPU§ For an image tile of 1K x 1K is »6 secs w/ Matlab, 850 msecs w/
C++§ For a “small” 50K x 50K slide (assuming 50% background) »20 min.
§ In algorithm development§ Algorithm development in Matlab§ Requires evaluation of many different techniques, parameters etc.
§ In clinical practice, 8-9 biopsy samples are collected per patient. For an average of 500 neuroblastoma patients treated annually, our biomedical image analysis consumes:§ On a CPU: 24 months using Matlab and 3.4 months using C++.§ Can we reduce this to couple days or even hours?
54
`
Whole-slide image
Label 1Label 2
backgroundundetermined
Assign classification labels
Classification map
Image tiles (40X magnification)
CPUSSE
Intel Xeon Phi
…
Computation units
GPU …
Computational Infrastructure
55
CPUC/C++
…
Characterizing the GPU/CPU speed-up56
Color conversion
Co-occur.
matrices
LBP operator
Histo-gram
Color channels Three Three One OneOutput results 1Kx1K
tile4x4
matrix1Kx1K
tile256 bins
Comput. weight Heavy Average Heavy LowOperator type Streaming Iterative Streaming IterativeData reuse None Strong Little StrongLocality access None High Little HighArithm. intensity
Heavy Low Average Low
Memory access
Low High Average High
GPU speed-up 166.09 x 16.75 x 85.86 x 8.32 x
Effect of runtime optimizations57
Homogeneous base case
Heterogeneous base case
Tile recalculation rate: % of tiles recalculated at higher resolution.
ODDS improves performance even in the base case
Using an additional CPU-only machine is more than 3x faster than GPU-only version
Cluster Comput (2012) 15:125–144 139
Table 6 Differentdemand-driven schedulingpolicies used in Sect. 6
Demand-driven Area of Queue Policy Size of request for
Scheduling Policy effect Sender Receiver data buffers
DDFCFS Intra-filter Unsorted Unsorted Static
DDWRR Intra-filter Unsorted Sorted by speedup Static
ODDS Inter-filter Sorted by speedup Sorted by speedup Dynamic
In Table 6 we present three demand-driven policies(where consumer filters only get as much data as they re-quest) used in our evaluation. All these scheduling policiesmaintain some minimal queue at the receiver side, such thatprocessor idle time is avoided. Simpler policies like round-robin or random do not fit into the demand-driven paradigm,as they simply push data buffers down to the consumer filterswithout any knowledge of whether the data buffers are beingprocessed efficiently. As such, we do not consider these tobe good scheduling methods, and we exclude them from ourevaluation.
The First-Come, First-Served (DDFCFS) policy simplymaintains FIFO queues of data buffers on both ends of thestream, and a filter instance requesting data will get what-ever data buffer is next out of the queue. The DDWRRpolicy uses the same technique as DDFCFS on the senderside, but sorts its receiver-side queue of data buffers bythe relative speedup to give the highest-performing databuffers to each processor. Both DDFCFS and DDWRR havea static value for requests for data buffers during execu-tion, which is chosen by the programmer. For ODDS, dis-cussed in Sect. 5.3, the sender and receiver queues are sortedby speedup and the receiver’s number of requests for databuffers is dynamically calculated at run-time.
6.5.1 Homogeneous cluster base case
This section presents the results of experiments run in thehomogeneous cluster base case, which consists of a sin-gle CPU/GPU-equipped machine. In these experiments, wecompared ODDS to DDWRR. DDWRR is the only oneused for comparison because it achieved the best perfor-mance among the intra-filter task assignment policies (seeSect. 6.3). These experiments used NBIA with asynchro-nous copy, and 26,742 image tiles with two resolution levels,as in Sect. 6.3, and the tile recalculation rate is varied.
The results, presented in Fig. 17, surprisingly show thateven for one processing node ODDS could surpass the per-formance allowed by DDWRR. The gains due to asynchro-nous transfers between ODDS and DDWRR at a 20% tilerecalculation rate, for instance, is around 23%. The improve-ments obtained by ODDS are directly related to the ability tobetter select data buffers that maximize the performance ofthe target processing units. It occurs even for one processingmachine because the data buffers are queued at the sender
Fig. 17 Homogeneous base case evaluation
Fig. 18 Tiles processed by CPU for each communication policy asrecalculation rate is varied
side for both policies, but ODDS selects the data buffers thatmaximize the performance of all processors of the receiver,improving the ability of the receiver filter to better assigntasks locally.
Figure 18 presents the percentage of tasks processed bythe CPU according to the communication policy and tile re-calculation rate. As shown, DDFCFS is only able to processa reasonable amount of tiles when the reconfiguration rateis 0%; its collaboration to the entire execution is minimumfor the other experiments. When analyzing DDWRR andODDS, on the other hand, both allow the CPU to computea significant number of tiles for all values of reconfigurationrate, which directly explains the performance gap between
Outline
• HPC • What is it? Why?
• A Crash Course on (HPC) Computer Architecture• History of Single “Processor” Performance• Taxonomy of Processors, Memory Topology of Parallel Computers• Supercomputers
• How to speedup your application?• Focus the common case• Pay attention to locality• Take advantage of parallelism
• An Example Application: Whole-Slide Histopathology Image Analysis for Neuroblastoma
• Summary
58
How about Cloud Computing?
• Cloud Computing• It is not really “Cloud”; it is someone else’s computer!• Rent instead of buy.
• Pay for Compute, Data Storage and Transfer.• Our current best bet to enable sharing of large data, workflows and
computational resources.• For “most of us” our best bet to achieve scalability and speed.
• Sample reading:• Nature Reviews Genetics 11, 647-657 (September 2010) | doi:10.1038/nrg2857• Computational solutions to large-scale data management and analysis• Eric E. Schadt, Michael D. Linderman, Jon Sorenson, Lawrence Lee, and Garry
P. Nolan• http://www.nature.com/nrg/multimedia/compsolutions/slideshow.html
• See also: Correspondence by Trelles et al. | Correspondence by Schadt et al.
59
Summary
• How to speedup your application?• Focus the common case
• If only 50% can be “improved”, best you can get 2x speedup!
• Pay attention to locality• Reduce data move• Move computation to data
• Take advantage of parallelism• Multiple types of parallelism: task-, data- and pipelined-parallelism• Fastest processor does not mean your application will run fast; find most suitable
architecture. • GPUs are good for “regular” computations• GPUs can be up-to 10x faster compared to multi-core CPU, in many real life
applications, it is usually 3-5x
60
QUESTIONS?
61