Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | darcy-patterson |
View: | 219 times |
Download: | 0 times |
Applications and Runtime for multicore/manycore
March 21 2007Geoffrey Fox
Community Grids Laboratory Indiana University
505 N Morton Suite 224Bloomington IN
[email protected]://grids.ucs.indiana.edu/ptliupages/presentations/
Pradeep K. Dubey, [email protected]
Tomorrow
What is …? What if …?Is it …?
Recognition Mining Synthesis
Create a model instance
RMS: Recognition Mining SynthesisRMS: Recognition Mining Synthesis
Model-basedmultimodalrecognition
Find a modelinstance
Model
Real-time analytics ondynamic, unstructured,
multimodal datasets
Photo-realism andphysics-based
animation
TodayModel-less Real-time streaming and
transactions on static – structured datasets
Very limited realism
Discussed in Seminars http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/Rest mainly classicparallel computing
Intel’s Application Stack
Some Bioinformatics Datamining• 1. Multiple Sequence Alignment (MSA)
– Kernel Algorithms• HMM (Hidden Markov Model)• pairwise alignments (dynamic programming) with heuristics (e.g.
progressive, iterative method)
• 2. Motif Discovery– Kernel Algorithms:– MEME (Multiple Expectation Maximization for Motif
Elicitation)– Gibbs sampler
• 3. Gene Finding (Prediction)– Hidden Markov Methods
• 4. Sequence Database Search– Kernel Algorithms
• BLAST (Basic Local Alignment Search Tool) • PatternHunter• FASTA
Berkeley Dwarfs• Dense Linear Algebra• Sparse Linear Algebra• Spectral Methods• N-Body Methods• Structured Grids• Unstructured Grids• Pleasingly Parallel
• Combinatorial Logic• Graph Traversal• Dynamic Programming• Branch & Bound• Graphical Models
(HMM)• Finite State Machine
Consistent in Sprit with Intel AnalysisI prefer to develop a few key applications rather than debate their classification!
Client side Multicore applications• “Lots of not very parallel applications”• Gaming; Graphics; Codec conversion for multiple user
conferencing ……• Complex Data querying and data
manipulation/optimization/regression ; database and datamining (including computer vision) (Recognition and Mining for Intel Analysis)– Statistical packages as in Excel and R
• Scenario and Model simulations (Synthesis for Intel)• Multiple users give several Server side multicore
applications• There are important architecture issues including
memory bandwidth not discussed here!
Approach I• Integrate Intel, Berkeley and other sources including
database (successful on current parallel machines like scientific applications)
• and define parallel approaches in “white paper”• Develop some key examples testing 3 parallel
programming paradigms– Coarse Grain functional Parallelism (as in workflow) including
pleasingly parallel instances with different data– Fine Grain functional Parallelism (as in Integer Programming)– Data parallel (Loosely Synchronous as in Science)
• Construct so can use different run time including perhaps CCR/DSS, MPI, Data Parallel .NET
• May be these will become libraries used as in MapReduce Workflow Coordination Languages ….
Approach II• Have looked at CCR in MPI style applications
– Seems to work quite well and support more general messaging models
• NAS Benchmark using CCR to confirm its utility• Developing 4 exemplar multi-core parallel applications
– Support Vector Machines (linear algebra) Data Parallel
– Deterministic Annealing (statistical physics) Data Parallel
– Computer Chess or Mixed Integer Programming Fine Grain Parallelism
– Hidden Markov Method (Genetic Algorithms) Loosely Coupled functional Parallelism
• Test high level coordination to such parallel applications in libraries
CCR for Data Parallel (Loosely Synchronous) Applications
• CCR supports general coordination of messages queued in ports in Handler or Rendezvous mode
• DSS builds service model on CCR and supports coarse grain functional parallelism
• Basic CCR supports fine grain parallelism as in computer chess (and use STM enabled primitives?)
• MPI has well known collective communication which supply scalable global synchronization etc.
• Look at performance of MPI_Sendrecv • What is model that encompasses best shared and
distributed memory approaches for “data parallel” problems – This could be put on top of CCR?
• Much faster internal versions of CCR
Thread0
Port3
Thread2Port
2
Port1
Port0
Thread3
Thread1
Thread2Port
2
Thread0Port
0
Port3
Thread3
Port1
Thread1
Thread3Port
3
Thread2Port
2
Thread0Port
0
Thread1Port
1
(a) Pipeline (b) Shift
(d) Exchange
Thread0
Port3
Thread2Port
2
Port1
Port0
Thread3
Thread1
(c) Two Shifts
Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive
Use onAMD 4-coreXeon 4-coreXeon 8-core
Latter do up to 8 way parallelism
WriteExchangedMessages
Port3
Port2
Thread0
Thread3
Thread2
Thread1Port1
Port0
Thread0
WriteExchangedMessages
Port3
Thread2 Port2
Exchanging Messages with 1D Torus Exchangetopology for loosely synchronous execution in CCR
Thread0
Read Messages
Thread3
Thread2
Thread1Port1
Port0
Thread3
Thread1
Stage Stage Stage
Break a single computation into different number of stages varying from 1.4 microseconds to 14 seconds for AMD (1.6 microseconds to 16 seconds for Xeon Quad core)
Stages (millions)
Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107 stages on HP Opteron Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode
Time Seconds
8.04 microseconds overhead per stage averaged from 1 to 10 million stages
Overhead =Computation
Computation Component if no Overhead
4-way Pipeline Pattern4 Dispatcher ThreadsHP Opteron 1.4 microseconds
computation per stage
14 microseconds computation per stage
Stage Overhead versus Thread Computation timeStage Overhead versus Thread Computation time
Overhead per stage constant up to about million stages and then increases
Average Run Time vs. Maxstage (partial result)
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200
Millio
ns
Thousands
Maxstage
Avera
ge R
un
Tim
e (
mic
rosec)
Series1
Average Run Time vs. Maxstage (partial result)
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200
Millio
ns
Thousands
Maxstage
Avera
ge R
un
Tim
e (
mic
rosec)
Series1
Average Run Time vs. Maxstage (partial result)
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200
Millio
ns
Thousands
Maxstage
Averag
e R
un
T
im
e (m
icro
sec)
Series1
Average Run Time vs. Maxstage (partial result)
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200
Millio
ns
Thousands
Maxstage
Averag
e R
un
T
im
e (m
icro
sec
)
Series1
Stages (millions)
0 0.2 0.4 0.6 1.00.8
4.63 microseconds per stageaveraged from 1 to 1 millionstages
Time Seconds
4-way Pipeline Pattern4 Dispatcher ThreadsHP Opteron
Average Run Time vs. Maxstage (partial result)
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200
Millio
ns
Thousands
Maxstage
Avera
ge R
un
Tim
e (
mic
rosec)
Series1
Average Run Time vs. Maxstage (partial result)
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200
Millio
ns
Thousands
Maxstage
Avera
ge R
un
Tim
e (
mic
rosec)
Series1
Average Run Time vs. Maxstage (partial result)
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200
Millio
ns
Thousands
Maxstage
Averag
e R
un
T
im
e (m
icro
sec)
Series1
Average Run Time vs. Maxstage (partial result)
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200
Millio
ns
Thousands
Maxstage
Averag
e R
un
T
im
e (m
icro
sec
)
Series1
Stages (millions)
0 0.2 0.4 0.6 1.00.8
4.63 microseconds per stageaveraged from 1 to 1 millionstages
Time Seconds
4-way Pipeline Pattern4 Dispatcher ThreadsHP Opteron
14 Seconds Stage Computation 14 Microseconds
0
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12
Stages (millions)
Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107 stages on Dell 2 processor 2-core each Xeon Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode
Time Seconds
12.40 microseconds per stageaveraged from 1 to 10 millionstages
4-way Pipeline Pattern4 Dispatcher ThreadsDell Xeon
Overhead =Computation
Computation Component if no Overhead
Summary of Stage Overheads for AMD 2-core 2-processor MachineSummary of Stage Overheads for AMD 2-core 2-processor Machine
These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 28 microseconds (500,000 stages)
Number of Parallel Computations Stage Overhead (microseconds) 1 2 3 4 8
match 0.77 2.4 3.6 5.0 8.9 Straight Pipeline default 3.6 4.7 4.4 4.5 8.9
match N/A 3.3 3.4 4.7 11.0 Shift
default N/A 5.1 4.2 4.5 8.6
match N/A 4.8 7.7 9.5 26.0 Two Shifts default N/A 8.3 9.0 9.7 24.0
match N/A 11.0 15.8 18.3 Error Exchange
default N/A 16.8 18.2 18.6 Error
Summary of Stage Overheads for Intel 2-core 2-processor MachineSummary of Stage Overheads for Intel 2-core 2-processor Machine
These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 30 microseconds. AMD overheads in parentheses
These measurements are equivalent to MPI latenciesNumber of Parallel Computations Stage Overhead
(microseconds) 1 2 3 4 8
match 1.7 (0.77)
3.3 (2.4)
4.0 (3.6)
9.1 (5.0)
25.9 (8.9) Straight
Pipeline default 6.9 (3.6)
9.5 (4.7)
7.0 (4.4)
9.1 (4.5)
16.9 (8.9)
match N/A 3.4 (3.3)
5.1 (3.4)
9.4 (4.7)
25.0 (11.0) Shift
default N/A 9.8 (5.1)
8.9 (4.2)
9.4 (4.5)
11.2 (8.6)
match N/A 6.8 (4.8)
13.8 (7.7)
13.4 (9.5)
52.7 (26.0) Two
Shifts default N/A 23.1 (8.3)
24.9 (9.0)
13.4 (9.7)
31.5 (24.0)
match N/A 28.0 (11.0)
32.7 (15.8)
41.0 (18.3) Error
Exchange default N/A 34.6
(16.8) 36.1
(18.2) 41.0
(18.6) Error
Summary of Stage Overheads for Intel 4-core 2-processor MachineSummary of Stage Overheads for Intel 4-core 2-processor Machine
These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 30 microseconds. 2-core 2-processor Xeon overheads in parentheses
These measurements are equivalent to MPI latencies
Number of Parallel Computations Stage Overhead (microseconds) 1 2 3 4 8
match 1.33 (1.7)
4.2 (3.3)
4.3 (4.0)
4.7 (9.1)
6.5 (25.9) Straight
Pipeline default 6.3 (6.9)
8.4 (9.5)
9.8 (7.0)
9.5 (9.1)
6.5 (16.9)
match N/A 4.3 (3.4)
4.5 (5.1)
5.1 (9.4)
7.2 (25.0) Shift
default N/A 8.3 (9.8)
10.2 (8.9)
10.0 (9.4)
7.2 (11.2)
match N/A 7.5 (6.8)
6.8 (13.8)
8.4 (13.4)
22.8 (52.7) Two
Shifts default N/A 20.3 (23.1)
30.4 (24.9)
27.3 (13.4)
23.0 (31.5)
match N/A 26.6 (28.0)
23.6 (32.7)
21.4 (41.0)
33.1 (error)
Exchange default N/A 31.3
(34.6) 38.7
(36.1) 46.0
(41.0) 33.5
(error)
0
5
10
15
20
5.7
5.9
6.1
6.3
6.5
6.7
6.9
7.1
7.3
7.5
7.7
7.9
Overhead microseconds (averaged over 500,000 consecutive synchronizations)
Fre
qu
en
cy
XP-Pro
8-way Parallel Pipeline on two 4-core Xeon• Histogram of 100 runs -- each run has 500,000
synchronizations following a thread execution that takes 33.92 microseconds– So overhead of 6.1 microseconds modest
• Message size is just one integer
• Choose computation unit that is appropriate for a few microsecond stage overhead
0
5
10
15
20
25
30
Overhead microseconds (averaged over 500,000 consecutive synchronizations)
Fre
qu
en
cy
AMD 4-way27.94 microsecond Computation Unit
XP Pro
8-way Parallel Shift on two 4-core Xeon• Histogram of 100 runs -- each run has 500,000
synchronizations following a thread execution that takes 33.92 microseconds– So overhead of 8.2 microseconds modest
• Shift versus pipeline adds a microsecond to cost
• Unclear what causes second peak
0
5
10
15
20
25
6.4
6.8
7.2
7.6 8
8.4
8.8
9.2
9.6 10
10.
4
10.
8
11.
2
11.
6
Overhead microseconds (averaged over 500,000 consecutive synchronizations)
Fre
qu
en
cy
XP-Pro
0
5
10
15
20
25
30
35
40
45
50
Overhead microseconds (averaged over 500,000 consecutive synchronizations)
Fre
qu
en
cy
VISTA
0
10
20
30
40
50
3.2
3.4
3.6
3.8 4
4.2
4.4
4.6
4.8 5
5.2
Overhead microseconds (averaged over 500,000 consecutive synchronizations)
Fre
qu
ency
AMD 4-way27.94 microsecond Computation Unit
XP Pro
8-way Parallel Double Shift on two 4-core Xeon• Histogram of 100 runs -- each run has 500,000
synchronizations following a thread execution that takes 33.92 microseconds– So overhead of 22.3 microseconds significant– Unclear why double shift slow compared to shift
• Exchange performance partly reflects number of messages
• Opteron overheads significantly lower than Intel
0
2
4
6
8
10
12
14
16
21
21
.4
21
.8
22
.2
22
.6 23
23
.4
23
.8
24
.2
24
.6 25
25
.4
25
.8Overhead microseconds (averaged over 500,000 consecutive synchronizations)
Fre
qu
en
cy
XP-Pro
0
5
10
15
20
25
30
35
Overhead microseconds (averaged over 500,000 consecutive synchronizations)
Fre
qu
en
cy
AMD 4-way27.94 microsecond Computation Unit
XP Pro
AMD 2-core 2-processor Bandwidth MeasurementsAMD 2-core 2-processor Bandwidth MeasurementsPreviously we measured latency as measurements corresponded to small messages. We did a further set of measurements of bandwidth by exchanging larger messages of different size between threads
We used three types of data structures for receiving dataArray in thread equal to message size
Array outside thread equal to message size
Data stored sequentially in a large array (“stepped” array)
For AMD and Intel, total bandwidth 1 to 2 Gigabytes/second
Bandwidths in Gigabytes/second summed over 4 cores
Array Inside Thread Array Outside
Threads Stepped Array Outside Thread
Number of stages
Small Large Small Large Small Large
Approx. Compute Time per stage µs
250000 0.90 0.96 1.08 1.09 1.14 1.10 56.0
0.89 0.99 1.16 1.11 1.14 1.13 2500
1.13 up to 107 words 56.0
1.19 1.15 5000
2800
1.15 1.13 200000
1.13 up to 107 words 70
Intel 2-core 2-processor Bandwidth MeasurementsIntel 2-core 2-processor Bandwidth Measurements
For bandwidth, the Intel did better than AMD especially when one exploited cache on chip with small transfers
For both AMD and Intel, each stage executed a computational task after copying data arrays of size 105 (labeled small), 106 (labeled large) or 107 double words. The last column is an approximate value in microseconds of the compute time for each stage. Note that copying 100,000 double precision words per core at a gigabyte/second bandwidth takes 3200 µs. The data to be copied (message payload in CCR) is fixed and its creation time is outside timed process
Bandwidths in Gigabytes/second summed over 4 cores Array Inside Thread Array Outside
Threads Stepped Array Outside Thread
Number of stages
Small Large Small Large Small Large
Approx. Compute Time per stage µs
250000 0.84 0.75 1.92 0.90 1.18 0.90 59.5
200000 1.21 0.91 74.4
1.75 1.0 5000
2970
0.83 0.76 1.89 0.89 1.16 0.89 2500
59.5
2500 1.74 0.9 2.0 1.07 1.78 1.06 5950
Typical Bandwidth measurements showing effect of cache with slope change5,000 stages with run time plotted against size of double array copied in each stage from thread to stepped locations in a large array on Dell Xeon Multicore
Time Seconds
4-way Pipeline Pattern4 Dispatcher ThreadsDell Xeon
Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to 100,000 double words
Array Size: Millions of Double Words
Slope Change(Cache Effect)
0
50
100
150
200
250
300
350
1 10 100 1000 10000
Round trips
Av
era
ge
ru
n t
ime
(m
icro
se
co
nd
s)
Timing of HP Opteron Multicore as a function of number of simultaneous two-way service messages processed (November 2006 DSS Release)
CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better
DSS Service Measurements