Post on 13-Mar-2020
transcript
1
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
MULTICORE DIGITAL
SIGNAL PROCESSING
Karol Desnos – kdesnos@insa-rennes.fr
Slides from M. Pelcat, K. Desnos,
J.-F. Nezan, D. Ménard,
M. Raulet, J. Gorin
2
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Previously
on MDSPs
3
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Design Challenges for MPSoC-Based Systems
• Exploit architecture parallelism
• Express application parallelism
• Balance computational load on PEs
• Hardware/Software co-design process
• Complex design-space exploration
• Respect constraints
• Predict/guarantee application performances
• Reuse legacy code
Previously on MDSPs
4
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Grail of Heterogeneous MPSoCs Programming
Previously on MDSPs
Multicore Compiler
Simulator
+ Debugger
+ Profiler
Algorithm
Architecture
Portable Multicore Program
PE
Main
Proc.
Main
Proc.
Main
Proc.
Main
Proc.
PE PE PE PE
Peripherals
Main
Memory
Multicore Runtime
5
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Properties
• Synchronous Dataflow (SDF)
• Data-driven execution: An actor is fired when its input FIFOs contain
enough data-tokens.
Previously on MDSPs
B
A C D E
Core1 A B C C D E
2
2
1
1
1 2 1 1
Source: E. Lee and D. Messerschmitt, “Synchronous data flow”, Proceedings of the IEEE, 1987.
6
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Properties
• Synchronous Dataflow (SDF)
• Parallelisms: / / /
Previously on MDSPs
Source: E. Lee and D. Messerschmitt, “Synchronous data flow”, Proceedings of the IEEE, 1987.
2
2
1
1
1 2 1 1
B
A C D E
Core1
Core2
Core3
x2
A B C C E A C +1 +1
B C +1 +1
D E +1 +1
D
Task parallelism Data parallelism Pipeline parallelism Internal parallelism
Data Pipeline Internal Task
7
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
• Lecture 1 – Maxime Pelcat
• Introduction to the course
• Applications for MDSPs
• Lecture 2 – Karol Desnos
• Languages and MoCs
• Programming MPSoCs
• Dataflow MoCs
• Lecture 3 – Maxime Pelcat
• Hardware Architectures
Course Outline
8
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
• Lecture 4 – Karol Desnos
• Theoretical Bounds
• Mapping/Scheduling Strategies
• Lecture 5 – Karol Desnos
• Lab Session
Course Outline
9
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Theoretical Bounds
10
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Amdahl’s Law • Developed in 1967 by Gene Amdahl
• A generic performance metric for applications
• Notations
• x: ratio of the code that is perfectly parallel, the rest is sequential
• N processing elements
• Speedup S refers to the acceleration brought by adding cores
• Formulation
• Ideal speedup for N PE:
𝑺 = 𝟏
𝟏−𝒙 +𝒙
𝑵
• Maximum achievable speedup :
𝑺𝒎𝒂𝒙 = lim𝑵→∞
𝑺 =𝟏
𝟏 − 𝒙
Theoretical Bounds
11
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Amdahl’s Law
• Example: with 70% of parallel code
Theoretical Bounds
… …
… …
As many threads as we want for 70% of code
A single thread for 30% of code
12
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Amdahl’s Law
• Example: with 70% of parallel code
• Speedup is limited to 1.0 on 1 core (no kidding !)
• Speedup is limited to 1.5 on 2 cores
• Speedup is limited to 2.1 on 4 cores
• Speedup is limited to 2.6 on 8 cores
• …
Theoretical Bounds
1
1,5
2
2,5
3
3,5
0 5 10 15 20 25 30 35
Max. speedup = 3.33
13
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Amdahl’s Law
• Example:
• Max speedup of 5.0 for 80%
• Max speedup of 3.3 for 70%
• Max speedup of 2.5 for 60%
Theoretical Bounds
1
1,5
2
2,5
3
3,5
4
4,5
5
0 5 10 15 20 25 30 35
14
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Amdahl’s Law
• Limitation:
• Inter-process communications are ignored
• No computation is perfectly parallel
• Amdahl’s law has brought many doubts on multicores
• Why add more cores if the parallelism of applications limits
speedups so much?
Theoretical Bounds
15
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Gustafson’s Law • Developed in 1988 by John Gustafson
• Hypothesis: More cores imply more parallelism
• Sequential code latency remains constant regardless the number of PE
• Parallel code is increased (by the developer) to fit the number of PE
• Notations
• 𝑆 + 𝑃: Sequential time + Parallel time with 1 PE
• 𝑆 + 𝑵. 𝑃: Sequential time + Parallel time with N PE
• 𝒙 =𝑆
𝑺+𝑷: Ratio of sequential time over total time (/!\ ≠ Amdahl /!\)
• Formulation
• Ideal speedup for N PE:
𝑺𝒑𝒆𝒆𝒅𝒖𝒑 =𝑺 + 𝑵 ∙ 𝑷
𝑺 + 𝑷= ⋯ = 𝑵 − 𝒙 ∙ (𝑵 − 𝟏)
Theoretical Bounds
16
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Gustafson’s Law
• Example: With 70% of parallel code
• Speedup is limited to 1.7 on 2 cores (Amdahl: 1.5)
• Speedup is limited to 3.1 on 4 cores (Amdahl: 2.1)
• Speedup is limited to 5.9 on 8 cores (Amdahl: 2.6)
Theoretical Bounds
1
5
9
13
17
21
25
29
0 5 10 15 20 25 30 35
17
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Dataflow Speedup
• Maximum speedup is given by finding the critical path
• Data path whose sum of actor execution times is the largest
• Example (communications not considered): • Critical path length = 1 + 6 + 3 + 1 = 11 ms
• Total work = 23 ms
• Max speedup = 23 / 11 = 2.09
Theoretical Bounds
A 1ms
B 4ms
C 6ms
D 3ms
E 3ms
F 5ms
G 1ms
18
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Dataflow Speedup
• PREESM Speedup Assessment Chart
• Evaluate quality of a schedule
Theoretical Bounds
A 1ms
B 4ms
C 6ms
D 3ms
E 3ms
F 5ms
G 1ms
1
2
3
Speedup
Number of PE 0
2 3 4 5 6 1 7
Critical Path length
Architecture Limit
Dummy Scheduling (Fast but far from optimal)
19
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Dataflow Speedup
• Limitations of PREESM Speedup Assessment Chart
• Only latency is considered
• Software pipelining is not considered
– Example: New critical path: 1+6 / New max speedup = 3.8
• All cores are identical
• All communications have the same speed
Theoretical Bounds
A 1ms
B 4ms
C 6ms
D 3ms
E 3ms
F 5ms
G 1ms
Pipeline stage 1 Pipeline stage 2
20
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Mapping/Scheduling
Strategies
21
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
• Heterogeneous Mapping/Scheduling Problem
• Heuristic Algorithms
• Load Balancing
• Runtime Systems
Mapping/Scheduling Strategies
22
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Schedu-what ?
• Reminder
Mapping/Scheduling Strategies
Task1
Task2
Task3
Task5
Task6
Task7
Task4
Core1 Core2 PE1
Core1 Core2 PE1
Task1 Task2 Task3
Task5 Task6
Task7 Task4
Mapping
Tasks and
architecture
Core1 Core2 PE1
Task1 Task2
Task3
Task5
Task6
Task7
Task4
order order order
Scheduling Core1 Core2 PE1
Task1 Task2
Task3
Task5
Task6
Task7
Task4
time time time
Timing
23
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Different Strategies • Choices can be made during compile or run time.
Mapping/Scheduling Strategies
Core1 Core2 PE1
Task1 Task2 Task3
Task5 Task6
Task7 Task4
Mapping Core1 Core2 PE1
Task1 Task2
Task3
Task5
Task6
Task7
Task4
order order order
Scheduling Core1 Core2 PE1
Task1 Task2
Task3
Task5
Task6
Task7
Task4
time time time
Timing
Source: E. Lee, “Scheduling Strategies for Multiprocessor real-time DSP”
Mapping Schedulin
g
Timing
fully dynamic run run run
static-
assignment
compile run run
self-timed compile compile run
fully static compile compile compile
Adaptivity++
Performance++
24
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
About Mapping, Scheduling, and Timing
• Part of “Operational Research” • How to organize a company
• How to organize a project (Gantt Chart, …)
• How to make decisions in general
• NP-Hard Problem • Verifying the validity of a solution to the problem can be computed
in polynomial time (eg. verifying that a schedule is valid).
• No polynomial time algorithm for solving NP-complete problems is
known (and it is likely that none exists.)
• When the problem grows (eg. number of cores or actors), solving it is
becoming more complex exponentially.
Mapping/Scheduling Strategies
25
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
About Mapping, Scheduling, and Timing
• Multicore scheduling is equivalent to quadratic
assignment NP-Hard problem • N facilities, each pair of facilities (f,g) associated to a flow of
communication
• N locations to put the facilities, each pair of locations (l,m)
associated to a distance
• Objective: Put each facility on a location and minimize traffic (i.e. the sum of the distances multiplied by the corresponding flows)
Mapping/Scheduling Strategies
f1 f2
f4
f3 5
5
5
5
1 l1 l2
l4
l3
5
8 6
5
3
Facilities Locations
26
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
About Mapping, Scheduling, and Timing
• Real problem is even more complex • M facilities (i.e. actors)
• N<M locations (i.e. cores) to put the facilities
• Heterogeneity: actors have different costs on different cores
• Objective is not only communication minimization
but also latency, throughput, memory, power…
Mapping/Scheduling Strategies
f1 f2
f4
f3 5
5
5
5
1 l1 l2
l4
l3
5
8 6
5
3
Facilities Locations
27
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
• Heterogeneous Mapping/Scheduling Problem
• Heuristic Algorithms
• Load Balancing
• Runtime Systems
Mapping/Scheduling Strategies
28
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Heterogeneous Mapping/Scheduling
• Exact vs Heuristic Algorithms • Exact algorithms find the optimal solution (exponential time)
• Heuristics explore only parts of the given problem.
• Many heuristics exist • List scheduling, greedy scheduling
• FAST scheduling (Y.-K. Kwok)
• Hybrid flow-shop scheduling (J. Boutellier)
• Meta-heuristics (genetic algorithms, ant colonies…)
• …
• Quality of heuristic results can not be predicted • But models should contain enough information to make decisions
Mapping/Scheduling Strategies
29
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Heterogeneous Mapping/Scheduling
• Several class of heuristic algorithms
Mapping/Scheduling Strategies
Problem Specific Generic Algorithms
Co
ns
tru
cti
ve
It
era
tive
Source: Z. Peng, lecture notes of “Computer aided design of electronics”, LiU
• List scheduling
• Greedy scheduling
• Hybrid flow-shop
• Divide and conquer
• Branch and bound
• Integer Linear Programming
• Genetic Algorithms
• Simulated annealing
• Ant colony
• FAST scheduling
30
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
List Scheduling Algorithm
1. Create a list of actors sorted in: • Topological order (i.e. data dependency order)
• When equivalent, secondary sorting criteria is used:
longest execution time, critical path before last task, …
Mapping/Scheduling Strategies
A 1ms
B 4ms
C 6ms
D 3ms
E 3ms
F 5ms
G 1ms
Longest Execution time
A 1ms
E 3ms
C 6ms
B 4ms
D 3ms
F 5ms
G 1ms
Longest Critical Path
A 1ms
D 3ms
C 6ms
B 4ms
F 5ms
E 3ms
G 1ms
31
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
List Scheduling Algorithm
2. Map and schedule actors to the first available PE:
Mapping/Scheduling Strategies
Longest Execution time
Core1
Core2
A
C
B
D
F
E
G
A 1ms
E 3ms
C 6ms
B 4ms
D 3ms
F 5ms
G 1ms
Longest Critical Path
A 1ms
D 3ms
C 6ms
B 4ms
F 5ms
E 3ms
G 1ms
A 1ms
B 4ms
C 6ms
D 3ms
E 3ms
F 5ms
G 1ms
Core1
Core2
A
C
B
D
E
F
G
With longest execution time
14ms
32
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Heterogeneous List Scheduling Algorithm
• Core that can finishes actor execution first wins • For heterogeneous targets
Mapping/Scheduling Strategies
A 2ms
3ms
B 4ms
8ms
C 5ms
2ms
E 3ms
4ms
D 2ms
1ms
F 8ms
4ms
.5ms
A tCPU
tDSP
tAcc.
CPU
DSP
A
B
C
E
D
Acc. F
Longest (shortest exec. time) order: A C F B E D
8ms
33
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Heterogeneous List Scheduling Algorithm
• Scheduling order is important
• An optimal order always exists
• Try all orders (exhaustive search to find the optimal
Mapping/Scheduling Strategies
A 2ms
3ms
B 4ms
8ms
C 5ms
2ms
E 3ms
4ms
D 2ms
1ms
F 8ms
4ms
.5ms
CPU
DSP
A
B
C
D
Acc. F
Topological order: A B C D E F
E
9ms
34
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
FAST Iterative Heuristic
1. Create an initial solution with list scheduling
2. Iteratively 1. Select a random actor from the critical path
2. Change its mapping
3. Reschedule and evaluate the resulting latency
3. Keep the best result
Mapping/Scheduling Strategies
CPU
DSP
A
B
C
D
Acc. F
E
9ms
A 2ms
3ms
B 4ms
8ms
C 5ms
2ms
E 3ms
4ms
D 2ms
1ms
F 8ms
4ms
.5ms
CPU
DSP
A
B
C
D
Acc. F
E
9ms
CPU
DSP
A
B
C
D
Acc. F
E
8ms
35
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Genetic Iterative Heuristic
1. Create an pool of solutions with list/fast scheduling
• Each solution is represented by its ordered list
2. Iteratively 1. Discard the worst solutions
2. Produce new solutions using cross-over and mutation
3. Reschedule and evaluate the resulting latency
3. Keep the best result
Mapping/Scheduling Strategies
Mutation
A
D
C
B
F
E
A
D
C
B
E
F
A 2ms
3ms
B 4ms
8ms
C 5ms
2ms
E 3ms
4ms
D 2ms
1ms
F 8ms
4ms
.5ms
Cross-over
A
D
C
B
F
E
A
D
B
C
E
F
A
D
C
B
F
E
A
D
B
C
E
F
36
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Scheduling under multiple constraints
• Example with latency and power
Mapping/Scheduling Strategies
A 2ms
3ms
B 4ms
8ms
C 5ms
2ms
E 3ms
4ms
D 2ms
1ms
F 8ms
4ms
.5ms
CPU
DSP
A
B
C
D
Acc. F
E
8ms – 11.1J
A 2J
1.5J
B 4J
4J
C 5J
1J
E 3J
2J
D 2J
0.5J
F 10J
2.5J
0.1J
Time Power
CPU
DSP A
B
C
D
Acc. F
E
10ms – 9.1J
37
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
• Heterogeneous Mapping/Scheduling Problem
• Heuristic Algorithms
• Load Balancing
• Runtime Systems
Mapping/Scheduling Strategies
38
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Load Balancing
Mapping/Scheduling Strategies
CPU
DSP
A
B
C
D
Acc. F
E
8ms – 11.1J
CPU
DSP A
B
C
D
Acc. F
E
10ms – 9.1J
8ms – 8J
6ms – 3J
0.5ms – 0.1J
4ms – 4J
10ms – 5J
0.5ms – 0.1J
Unbalanced power
consumpion
Unbalanced computational
load
39
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Load Balancing Strategies
• With total predictability
(i.e. known number of tasks – SDF-like)
• Decentralized static decision
• No adaptivity to algorithm
modifications
• No decision overhead
• Self-timed execution
Mapping/Scheduling Strategies
Core 2
Thread/Core 1
Decentralized (Preesm, SynDEx)
Actor A
Actor B
Actor D time
Actor E
Actor C
Core 2
Actor C
40
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Load Balancing Strategies
• With high predictability
(i.e. reconfigurable tasks)
• Master/Slave
• Adaptivity to algorithm variations
• Master core can become a bottleneck
Mapping/Scheduling Strategies
Thread/Core 2
Master/Slave (Spider Runtime)
Master Operator
Multicore Runtime
Core 1
assigns
finished
Dequeue Actor
Process Actor
Signals finished
41
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Load Balancing Strategies
• Without predictability (i.e. highly dynamic number of tasks)
• Work Queueing
• Implemented over multi-threading
• Great freedom in thread creation
• The shared task queue becomes
the bottleneck
Mapping/Scheduling Strategies
Work-queueing (Apple Grand Central Dispatch, OpenMP)
Thread/Core 1
Core 2 pops
pushes
pops
pushes
Dequeue Task
Process Task
Enqueue Task(s)
42
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Load Balancing Strategies
• Without predictability (i.e. highly dynamic number of tasks)
• Job Stealing
• One task queue per core:
No more bottleneck
• Hard to predict performance
Mapping/Scheduling Strategies
Job stealing (Cilk, Intel Threading Building Blocks)
Thread/Core 1
pops
Core 2
pops
Core 3
pops steals
pushes pushes pushes
Dequeue Task
Process Task
Enqueue Task(s)
43
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
• Heterogeneous Mapping/Scheduling Problem
• Heuristic Algorithms
• Load Balancing
• Runtime Systems
Mapping/Scheduling Strategies
44
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Runtime systems • Role of Runtime Systems for dynamic adaptation
Mapping/Scheduling Strategies
Multicore Compiler
Simulator
+ Debugger
+ Profiler
Algorithm
Architecture
Portable Multicore Program
PE
Main
Proc.
Main
Proc.
Main
Proc.
Main
Proc.
PE PE PE PE
Peripherals
Main
Memory
Multicore Runtime
45
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Runtime Systems
• Role of Runtime Systems for dynamic adaptation • A runtime system is the extension of an operating system for
distributed hardware
• 2 types of multiprocessing management systems: • AMP (Asymmetric Multiprocessing)
• complexity not masked each core has its own OS, e.g. SYS/BIOS, no runtime
• SMP (Symmetric Multiprocessing) • simulating a unique core an OS controls the whole architecture, e.g. SMP Linux
• Limited notions, compilers and runtime systems are now combined
• For instance, OpenMP is based on a runtime system to dispatch
threads
Mapping/Scheduling Strategies
46
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Runtime Systems
• The industry develops new runtime systems
• Apple Grand Central Dispatch
• Intel Threading Building Blocks
• Texas Instruments Open Event Machine
• They are based on task and data synchronization
descriptions
• The semantics are getting close to dataflow
• Several runtime systems are experimented at IETR
• Based on dataflow algorithm descriptions
• Spider: Synchronous Parameterized and Interfaced Dataflow
Embedded Runtime
Mapping/Scheduling Strategies
47
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
Runtime Systems
Mapping/Scheduling Strategies
Symmetric Phase Master/Slave Phase
Dataflow Management
Template graph
Parameter Values
uC-OS/II Tasks
uC-OS/II based RTOS Scheduling
Core
Core
Core
Core
CLK
48
Multicore DSPs – Karol Desnos (kdesnos@insa-rennes.fr)
General Conclusion
• Applications and architectures are increasingly complex
• Model-based system design helps at several design stages
• To evaluate languages/models: focus on MoC
• MoCs offer « pure » semantics, free of syntax
• No one-fit-all solution to design Multicore DSP systems
• Many solutions exist now, complex choices have to be made