Leveling the Field for Multicore Open Systems Architectures
Markus LevyPresident, EEMBC
President, Multicore Association
Analyzing the Multicore Ecosystem
• Embedded Microprocessor Benchmark Consortium® (EEMBC)
• Industry benchmarks since 1997• Tool for evaluating embedded processors,
compilers, systems• MultiBench for shredding multicore
processors
Enabling the Multicore Ecosystem
• Initial engagement began in May 2005• Industry-wide participation• Current efforts
• Communications APIs• Hypervisors• Multicore Programming Practices• Resource Management
Multicore Issues to Solve• Concurrent programming • Communications, synchronization, resource
management between/among cores• Performance analysis• Debugging• Distributed power management• OS virtualization• Modeling and simulation• Load balancing• Algorithm partitioning
Multicore Benchmarking Rules
• Do not rely on a single answer• Match your application requirements
– Small or large data sets– Few or many threads– Dependencies– OS overhead
Benchmarking Multicore – What’s Important?
• Measuring scalability• Memory and I/O bandwidth• Inter-core communications• OS scheduling support• Efficiency of synchronization• System-level functionality
EEMBC Multicore Strategy• Evaluation and future development of
scalable SMP architectures– Includes multicore and manycore
• Measure impact of parallelization and scalability across both data processing and computationally-intensive tasks
Some Results
Quad Core
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 4 8
Number of concurrent streams
Spee
dup
64M-check-reassembly 64M-cmykw2 64M-rotatew2 64M-tcp-mixed
Huge drop in performance when oversubscribed Nice scaling on networking
only workloads
Some benchmarks plateau earlier then expected
MultiBench Results
Two Processor System Utilizing Single Memory Controller
QuadCore
Processor 1
QuadCore
Processor 2
DDR2Interface
Processors 1 and 2 must always arbitrate for memory via their front side bus connection through the North Bridge.
NorthBridge
Intel Front Side Bus
Dual Quads vs. Single Quad
1. Find max values for each workload2. Ratio of all scores
0
0.5
1
1.5
2
2.5
64M
-che
ck-r
eass
embl
y
64M
-che
ck-r
eass
embl
y-tc
p
64M
-che
ck-r
eass
embl
y-tc
p-cm
ykw
2-ro
tate
w2
64M
-che
ck-r
eass
embl
y-tc
p-h2
64w
2
64M
-cm
ykw
2
64M
-cm
ykw
2-ro
tate
w2
64M
-rot
atew
2
64M
-tcp
-mix
ed
64M
-x26
4-1w
orke
r
64M
-x26
4-2w
orke
rs
64M
-x26
4-4w
orke
rs
64M
-x26
4-8w
orke
rs
ippk
tche
ck-6
4M-1
Wor
ker
ipre
s-72
M1w
orke
r
ipre
s-72
M2w
orke
r
md5
-32M
1wor
ker
md5
-32M
2wor
ker
md5
-32M
4wor
ker
rgbc
myk
-5x1
2M1w
orke
rs
rgbc
myk
-5x1
2M2w
orke
rs
rgbc
myk
-5x1
2M4w
orke
rs
rgbc
myk
-5x1
2M8w
orke
rs
rota
te-1
6x4M
s1w
1
rota
te-1
6x4M
s1w
2
rota
te-1
6x4M
s1w
4
rota
te-1
6x4M
s1w
8
rota
te-1
6x4M
s32w
1
rota
te-1
6x4M
s32w
2
rota
te-1
6x4M
s32w
4
rota
te-1
6x4M
s32w
8
rota
te-1
6x4M
s4w
1
rota
te-1
6x4M
s4w
2
rota
te-1
6x4M
s4w
4
rota
te-1
6x4M
s4w
8
rota
te-3
4kX
512-
90de
g
rota
te-c
olor
-4M
-90d
eg
No Workloads Yield 2x Scaling
Two Processor System Utilizing Dual Memory Controllers
QuadCore
Processor 1
QuadCore
Processor 2
LinkDDR2Interface
DDR2Interface
Direct AccessShared Access
Doubly Shared Access
Two Processor System Utilizing Single Memory Controller
QuadCore
Processor 1
QuadCore
Processor 2
Link DDR2Interface
• Processor 1 must always access memory by traversing link to Processor 2• Requires arbitration to access Processor 2’s memory
• Processor 2 always has prioritized access to this memory since it is directly attached.
• Affinity can help performance
Dual Quad Cores –Single Memory Controller
Dual Quad Cores –Dual Memory Controllers
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4 6 8 12 16 20
64M-cmykw2
64M-cmykw2-rotatew2
64M-rotatew2
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4 6 8 12 16 20
64M-cmykw2
64M-cmykw2-rotatew2
64M-rotatew2
0
0.5
1
1.5
2
2.5
3
3.5
4
1 2 3 4 5 6 7 8 9
rotate-color-4M-90deg (DS)
rotate-color-4M-90deg (DD)
•Rotation by multiple workers cooperating to process a single image.•Each worker thread acquires slices and writes them to the output buffer.
•Potential bottlenecks related to memory interfaces and synchronization between worker threads.
Dual Quad Cores –Single Memory Controller
Dual Quad Cores –Dual Memory Controllers0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9
rotate-16x4Ms32w1rotate-16x4Ms32w2
rotate-16x4Ms32w4rotate-16x4Ms32w8
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9
rotate-16x4Ms32w1rotate-16x4Ms32w2
rotate-16x4Ms32w4
rotate-16x4Ms32w8
The Multicore Association Roadmap
Communications- MCAPI: ultra-light weight
Resource Management - Memory management- Basic synchronization - Resource registration- Resource partitioning
Task Management-Task scheduling
The Four MCA Pillars
Virtualization (or OS)
Communication ResourceManagement
TaskManagement
Debug
Multicore SystemAdopted stds
MCA Foundation APIs
Value Added Functions• Languages
• Programming Models• Design Environments
• Application Generators• Benchmarks
Services•Load Balancing
•System Mgt.•Power Mgt.•Reliability
•Quality of Service
Why Virtualize?
• Hardware consolidation• Migration and hosting of legacy applications• Resource management and balancing• Faster provisioning• Fault tolerance
Virtualized Multicore System
Fractional CPU assignment for background tasks such as firmware update/system health monitor/power management
Benchmarking involves many system-level elements
HARDWARE PERIPHERALS
ARM CORE #1
TRANGO #1
ARM CORE #2
TRANGO #2
ARM CORE #3
TRANGO #3
ARM CORE #4
TRANGO #4
Firm
ware
Upda
te
AppApp App AppAppApp App App
SMPSMP RTOS
AppApp App App
Prop
rieta
ryAp
plica
tion
VPU A VPU B VPU C VPU D VPU E VPU F VPU G
VPU GVPU FVPU A VPU B VPU C VPU D VPU E
CORE #2 CORE #3 CORE #4 CORE #1
EEMBC HypermarkImportant Metrics
• Overall performance overhead of a hypervisor, i.e. CPU loading
• Static footprint (code and data size)• Jitter• Interrupt latency• Comparison to native performance
20
Multicore Programming Practices(MPP)
• Long term– Continue research into languages, methodologies, etc
• Short term– How today’s embedded C/C++ code may be written to be
“multicore ready”
• Influence of a group of like-minded methodology experts to ensure completeness, usefulness and industry-wide compatibility
• Creation of a standard “best practices” guide through a recognized, neutral industry body– Based on capturing current best practices
MPP Scope & Approach
21
Sequential C/C++
Architecture 1e.g. Shared
Memory
Architecture 2e.g. Programmer Managed Memory
Architecture 3e.g. Message
Passing
• Focus on existing C/C++ without extensions, targeting current architectures
• Draw up framework of common pitfalls when transitioning from serial to parallel
• Consider solutions or avoidance tactics
• Analyze performance of solution
Summary
• Performance analysis highlights benefits and pitfalls
• EEMBC licensing and membership
• Portability prevents entrapment• Multicore Association standards and
working groups