Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | melvin-phelps |
View: | 214 times |
Download: | 1 times |
1
Multithreaded Programming Concepts
2010. 3. 12Myongji University
Sugwon Hong
1
2
Why Multi-Core?
Until recently increasing clock frequency is the holy grail to all processor designers to boost performance.
But it seems that they reach the dead end for raising clock speed because of power consumption and overheating.
So, they realize that it is much more efficient to run several cores at a lower frequency than one single core at a much faster frequency.
3
Power and Frequency
(source : Intel Academy program)
4
A little bit of history
In the past, performance scaling in single-core processors was achieved by increasing the clock frequency.
When processors shrink and clock frequencies rise, Excess power consumption, and
overheating Memory access time failed to keep pace
with increasing clock frequencies.
5
Instruction/data-level parallelism Since 1993, processor designers
supported parallel execution at instruction and data level.
Instruction-level parallelism Out-of-order execution pipeline and multiple
functional units to execute instructions in parallel
Data-level parallelism Multimedia Extension (MMX) in 1997 Streaming SIMD Extension (SSE)
6
Hyper-Threading
In 2002, Intel utilized additional copies of execution resources to execute two separate threads simultaneously on the same processor core.
This multi-threading idea eventually lead to introducing dual-core processor in 2005.
7
Evolution of Multi-Core Technology
(source : Intel Academy program)
8
Multi-processors Architecture Shared memory multiprocessor
(SMP) Non-shared memory architecture
Massively Parallel Processor (MPP) Cluster
CPU CPU CPU CPU CPU
Shared memory
SMP
CPU CPU CPU CPU CPU
Interconnected
memoryMPP
9
Multi-processors vs. Multi-cores Shared memory multi-processors
(SMP) Multiple thread on a single core
(SMT) Multiple thread on multi-cores (CMT)
Tricky acronym
CMP (Chip Multi-processor)
SMT (Simultaneous MultiThreading)
CMT (Chip-level MultiThreading)
10
CMT processor products
1st generation: Sun Microsystems (late 2005)
Intel Dual-Core Xeon (2005) Intel Quad-Core Xeon (late 2006) AMD Quad-Core Opteron (2007) 8-Core (??)
11
Thread
A thread is a sequential flow of instructions executed within a program.
Thread vs. Process A single process always has one main
thread which initialize the process and begins executing the instructions.
Any thread can create other threads within a process which share code and data segments. But each thread has its own stack.
12
Thread in a Processprocess
13
Why use threads?
Threads are intended to improve performance and responsiveness of a program.
Quick turnaroud time Completing a single job in the smallest
amount of time possible High throughput
Finishing the most tasks in a fixed amount of time
14
Risks of using Threads But if they are not used properly, they can
lead to degrade performance, and sometimes unpredictable behavior, and
error conditions Data race (race conditions) Deadlock
And other extra burdens. Code complexity Portability issues Testing and debugging difficulty
15
Race condition
It happens when more than two threads access a shared variable.
“It is nondeterministic!”
For example, when Tread A and Tread B are executing the statement.
area = area + 4.0 / (1.0 + x*x)
16(source : Intel Academy program)
17
How to deal with race condition Synchronization
Critical region Mutual exclusion
18
Concurrency vs. Parallelism
Generally two terminologies can be used interchangeably. But conventional wisdom has the following distinction.
Concurrency It happens when more than two threads are
in progress simultaneously, normally on a single processor.
Parallelism It occurs when more than two threads are
executed simultaneously on multiple cores.
19
Performance criteria
Speedup Efficiency Granularity Load balance
20
Speedup
The most noticeable quantitative measure is to compare the execution time of the best serial algorithm with that of the parallel algorithm.
Speedup = Ts/Tp
Ts = Serial Time, Tp = Parallel Time
Amdahl’s Law
Speedup = 1/[S+(1-S)/n + H(n)]
S: percentage of time spent on executing the serial portion
H(n) : parallel overhead
n: the number of cores
21
Example
Consider painting a fence. Suppose it takes 30 min to get ready to paint and 30 min for cleanup after painting. Assume that it takes 1 min to paint one single picket and there are 300 pickets. What are the speedups when 1, 2, 10, 100 painters do this job respectively? What is the maximum speedup?
What if you use a spray gun to paint the fence? What happens if the fence owner uses spray gun to paint 300 pickets in 1 hrs?
22
Parallel Efficiency
A measure of how efficiently core resources are used during parallel computations
In the previous example, assume that you knew that all painters were only busy for an average of less than 6% of entire job time but are still getting paid for the whole time. Do you think you were getting the money’s worth from the 100 painters?
Efficiency = (Speedup / Number of Threads) * 100%
23
Granularity
The ratio of computation to synchronization
Coarse-grained Concurrent threads have a large amount of
computation between synchronization events. Fine-grained
Concurrent threads have a very little computation between synchronization events.
24
Load Balance
Balancing the workloads among multiple threads
If more work is assigned to some threads, they will sit idle until other threads with more work finish.
All the cores must be busy to get max. performance.
For load balancing, which size of task will be better? Large-sized or small-sized?
26
Computer Memory Hierarchy
CPU
L1 cache
L1 cache
L2 cache Main
memory
disk
1’s cycle 1’s ~10 cycle
~100’s cycle ~1000’s cycle
27
Architecture consideration(1) In order to obtain better performance, we
need to understand how the work is done inside.
Cache Cache line (cache block, e.g. 64bytes)
Data moves between memory and caches in cache line. Shared caches or separate caches between cores Cache miss is very costly. Cache coherency when they are separate. Replacement policies such as LRU
28
Architecture consideration(2) Memory management
Paging Translation look-aside table (TLB)
Inside CPU Registers
29
False sharing
Assume the cache line is 64 bytes. What happens if two threads try to execute at the same time?
Thread 1
int a[1000];
int b[1000];
while
a[998] = i * 1000;
Thread 2
int a[1000];
int b[1000];
while
b[0] = i;
30
Poor cache utilization
What is the difference between the following two codes?
int a[1000][1000];
for (i=0; i<100; ++i)
for (j=0; j<1000; ++j)
a[i][j] = i*j;
int b[1000][1000];
for (i=0; i<100; ++i)
for (j=0; j<1000; ++j)
b[j][i] = i*j;
31
Poor Cache Utilization - with eggs
(source : Intel Academy program)
32
Good Cache Utilization – with eggs
(source : Intel Academy program)