A Survey on Multi-Core Processors and Systems CSC8210 Fall 2010 D.M.Rasanjalee Himali Ruku...

A Survey on Multi-Core Processors and Systems

CSC8210

Fall 2010

D.M.Rasanjalee Himali

Ruku Roychowdhury

Overview

1. Introduction

2. The Conceptual Architecture of Multicores

3. Resource Management of Multicore Systems

4. Parallelization with Multicores

5. Measuring Multicore Performance

6. Challenges of Multicores

7. Conclusion

1. INTRODUCTION

What is Multi - Core

PentiumPentium

PentiumPentium

Pentium Processor

Dual Core

Quad Core

Multi-core Architecture

the single core

Single Core Architecture

Multi Core Architecture

Replicate multiple processor cores on a single die

Multi-coreCPU chip

core 1 core 2 core 3 core 4

Multi-core architectures• Multi-core processors are MIMD:

– Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data).

• L1 caches private, L2 caches private in some architectures (and shared in others)

• Memory is always shared

memory

L2 cache

L1 cache L1 cache

C O

R E

1

C O

R E

0

hyper-threads

memory

L2 cache

L1 cache L1 cacheC O

R E

1

C O

R E

0

L2 cache

memory

L2 cache

L1 cache L1 cacheC O

R E

1

C O

R E

0

L2 cache

L3 cache L3 cache

Both L1 and L2 are private

Examples: AMD Opteron, AMD Athlon, Intel Pentium D

A design with L3 caches

Example: Intel Itanium 2

Private L1 cache and Shared L2 cache

Example: Dual-coreIntel Xeon processors

Interaction with the Operating System

• The cores run in parallel– Within each core, threads are time-

sliced (just like on a uniprocessor)

• OS perceives each core as a separate processor

• OS scheduler maps threads/processes to different cores

• Most major OS support multi-core today:– Windows, Linux, Mac OS X, …

core

1

core

2

core

3

Several threads

core

4

Several threads

Several threads

Several threads

Why Multi-Core?• Problem: Can no longer significantly increase processor performance by frequency scaling

– Power Wall

– Instruction Level Parallelism (ILP) Wall

– Memory Wall

• Solution: Performance Scaling through Parallel Processing

– Reduce power with voltage scaling

– Reduce DRAM access with large caches

– Place multiple cores on a single chip, with each core processing multiple threads simultaneously

• Advantages of Multi-Cores:

– Low Power Consumption, Lower Heating, Smaller Devices

• Challenges of Multicores:– Programming,Resource Contention,Benchmarks

Popular Multi-Core Architectures• AMD

– Athlon II, dual-, triple-, and quad-core desktop processors. – Phenom, dual-, triple-, quad-, and hex-core desktop processors.

• IBM – POWER5, a dual-core processor, released in 2004. – POWER7, a 4,6,8-core processor, released in 2010.

• Intel – Core Duo, a dual-core processor. – Core 2 Quad, 2 dual-core dies packaged in a multi-chip module.

• Nvidia – GeForce 9 multi-core GPU (8 cores, 16 scalar stream processors per core) – GeForce 200 multi-core GPU (10 cores, 24 scalar stream processors per core)

• Sun Microsystems – UltraSPARC IV and UltraSPARC IV+, dual-core processors. – UltraSPARC T1, an eight-core, 32-thread processor.

2. THE CONCEPTUAL ARCHITECTURE OF MULTICORES

Formal Models of Computation for Multicores

• Multicore systems are composed out of several different processing elements, e.g. processors, memories, busses and various other communication interfaces.

• Therefore, design of such systems is time consuming and complex.

• In order to automate the design process, system specification has to be clear and unambiguous.

• A specification model is helpful to enable validation and analyzability.

Models of Computation

• Model of Computation is essentially a general way of describing system behavior in an abstract, conceptual form and enables representation of system requirements and constraints on a high level

Process-oriented Models

Describes System’s behavior in a form of concurrent processes communicating with each other through message passing.

State-based Models

State based models are given in a form of a state machine consist of set of states and transitions. State based models focus on control flow.

Types of Process based Models• Process Network Consist of independent processes which can run in parallel. Communication is unidirectional and point to point.

• Process Calculi Provides high level formal description of interactions and synchronization

mechanisms among concurrent processes

• Synchronous Data Flow This is a kind of data flow model having an actor as a basic unit of

computation. It imposes additional constraint that amount of data flow

needs to be predetermined.

Types of State based Models

• Finite State Machines

• Program state Machines It can be a combination of process oriented model and state based

model. However, large number of states are required to efficiently model system

behavior. Concept of hierarchy and concurrency are necessary in state based models to

deal with complexity.

Comparison between Models

3. RESOURCE MANAGEMENT OF MULTICORE SYSTEMS

• Shared resources mostly have to do with the memory hierarchy

• Threads running on cores in the same memory domain may compete for the shared resources– Ex: Cache contention occurs when two or more threads are assigned to run

on Core0 and Core1 in with shared cache

• This contention can significantly degrade their performance.

Understanding Resource Contention• It has been documented in previous studies [1,4] that the execution

time of a thread can vary greatly depending on which threads run on the other cores of the same chip.

• This is especially true if several cores share the same last-level cache (LLC).

• When a thread requests a line that is not in the cache (miss) a new cache line must be allocated. Issue: if cache is full , some data must be evicted to free up a line for the new piece of data.

• The evicted line might belong to a different thread from the one that issued the cache miss (modern CPUs do not assure any fairness in that regard).

Mitigating Resource Contention• Researchers have proposed mechanisms for mitigating LLC

contention– Hardware Based Solutions:

• Static Cache Partitioning (Ex: Optimal Cache Partitioning, MQ)• Dynamic Cache Partitioning (Ex: UCP, AMQ)

• Limitations:– increases the cost and complexity of the hardware, has limited

flexibility and long time-to-market

• Better Solution:Cache Aware Scheduling

• How to address contention using Scheduling:– Contention Prediction– Contention aware scheduling (Ex: Symbiotic Jobscheduling)

Contention Prediction

• Method 1: Consider the LLC Miss Rate of the threads:• If a thread issues lots of cache misses, it must have a large cache working

set, since each miss results in the allocation of a new cache line

• Any thread that has a large cache working set must suffer from contention while inflicting contention on others.

• Previous work has shown that an application may take up to 65% longer to complete when it runs with a high-miss-rate co-runner than with a low-miss-rate co-runner

• Method 2: Consider memory reuse pattern of threads:• If a thread hardly ever reuses its cached data (ex:video-streaming) it will

not suffer from contention even if it brings lots of data into the cache.

• That is because such a thread needs very little space to keep in cache the data that it actively uses.

• Researchers demonstrated that this approach predicts the extent of contention quite accurately in [2].

Contention Prediction Using Memory Reuse Pattern

• Memory Reuse Pattern is used in the past to effectively model the contention between threads that share a cache

• To this end, research done using memory-reuse patterns are highly successful at modeling contention because they capture two important qualities related to contention: sensitivity and intensity:

– Sensitivity (S) measures how much a thread suffers whenever it shares the cache with other threads.

– Intensity (Z), on the other hand, measures how much a thread hurts other threads whenever it shares a cache with them.

• Using the metrics S and Z, a new metric called Pain is created

Pain of thread A due to sharing a cache with thread B = sensitivity of A * intensity of B

Pain(A|B) = SA * ZBPain of thread B due to sharing a cache with thread A = sensitivity of B * intensity of A

Pain(B|A) = SB * ZAA combined Pain for the thread pair = Pain of A due to B + the Pain of B due to A

Pain(A,B) = Pain(A|B) + Pain(B|A)

• Intuitively, Pain(A|B) approximates the performance degradation of A when A runs with B relative to running solo.

Contention Prediction Using Memory Reuse Pattern

Using Contention Models in a Scheduler

• A contention model help us find the best schedule and avoid the worst one

• Ex:– we have a system with two pairs of cores sharing the two caches

– we want to find the best schedule for four threads

– The scheduler would construct all the possible permutations of threads on this system

Using Contention Models in a Scheduler

• If we have four threads A, B, C, and D there will be three unique schedules:

(1) {(A,B), (C,D)};(2) {(A,C), (B,D)}; and

(3) {(A,D), (B,C)}.

Notation (A,B) means that threads A and B are co-scheduled in the same memory domain.

• For each schedule, the scheduler estimates the Pain for each pair:

• Then it averages the Pain values of the pairs to estimate the Pain for the schedule as a whole.

• The schedule with the lowest Pain is deemed to be the estimated best schedule.

Popular Resource Contention Management Strategies

• Cache Partitioning Strategies:– Optimal Cache Partitioning

• develops a model for studying the optimal allocation of cache memory among two or more competing processes. It describes a cache replacement policy better than LRU replacement

– UCP• a custom hardware solution estimates each application’s number of hits and

misses in the cache. The cache is then partitioned to minimize the number of cache misses for the co-running applications.

• Contention Aware Scheduling Strategies:– Symbolic JobScheduling

• demonstrate an SMT jobscheduler called SOS. SOS combines an overhead-free sample phase which collects information about various possible schedules, and a symbiosis phase which uses that information to predict which schedule will provide the best performance.

– Operating System Scheduler• Introduces the cache-fair algorithm that reduces co-runner-dependent

variability in an application’s performance by ensuring that the application always runs as quickly as it would under fair cache allocation, regardless of how the cache is actually allocated.

4. PARALLELIZATION WITH MULTICORES

• All multithreaded cores keep multiple hardware threads “on chip”, ready for execution.

• Each thread needs its own copy of state components such as instruction pointers and other control registers.

• There exist a variety of approaches to switching between threads per core. The most common two types are TMT (temporal) and SMT (simultaneous).

• Most recent CPUs employ HyperThreading technology introduced by INTEL which is an example of SMT.

Multithreading

Integration of Componentson chip

• The number and selection of integrated components on chip is an important design decision.

• Possible components to include on-chip are memory controllers, communication interfaces, and memory.

• Placing the memory controller on-chip increases bandwidth and decreases latency.

• Some designs support multiple integrated memory controllers to make memory-access bandwidth scalable with the number of cores.

• Example, IBM’s Blue Gene/P16 system relies on a highly integrated system-on-a-chip design which features four cores, five network interfaces, two memory controllers and 8MB of L3 cache.

Shared vs. Private Caches

• Advantages of private:

They are closer to core, so faster access.

Reduces contention

• Advantages of shared:

Threads on different cores can share the same cache data.

Shared caches are important if threads of the same application are executing on multiple cores and share a significant amount of data.

• However, shared caches can impose high demands on the interconnect.

Interconnect

• communication among different on-chip components: cores, caches, memory controllers and network controllers can have huge impact on performance.

• The trend now is to use a crossbar or other advanced mechanisms for on chip communication to reduce latency and contention.

• However, crossbars can become expensive.

• Off chip interconnect has also become significant as data processing increases with thread level parallelism and eventually increases the off chip communication.

• The tread off between on-chip interconnect and off-chip interconnect is an important design decision.

Types of Parallelism• Task Parallelism This is the simplest form of parallelization. The work is broken down into several sub tasks. Independent subtasks can run concurrently on different threads.

Data Parallelism

• Without Data Parallelism

• With Data Parallelism

Pipeline Parallelism Allows for parallelization of a single task. This is helpful when partial or total ordering exists in the data set

preventing Data Parallelism

Structured Grid Parallelism

• Data is arranged in a regular multidimensional grid

• Computation proceeds as a sequence of grid update steps.

• At each step, all points are updated using values from a small neighborhood around each point.

• In Parallel version, each grid is divided into subgrids and each subgrid is computed independently

Summary of Current languages

An example Parallel program with CILK

• 01 cilk int fib (int n)

• 02 {

• 03 if (n < 2) return n;

• 04 else

• 05 {

• 06 int x, y;

• 07 x = spawn fib (n-1);

• 08 y = spawn fib (n-2);

• 09 sync;

• 10 return (x+y);

• 11 }

• 12 }

CILK employs Work Stealing as a part of parallelism

5. MEASURING MULTICORE PERFORMANCE

• Developers face quandaries when making trade-offs with multicore processors.

– Even if a dual-core processor appears to be better than a single-core processor, how much better is it? Twice as good?

– Are more cores worth the additional cost, design complexity, power consumption, and programming difficulty?

• Multicore technologies are highly differentiated, so multicore benchmarks need to be highly differentiating as well.

MultiCore Performace Evaluation Benchmarks

• One of the most important potential benefits of multicore:– using parallelism to improve the performance of individual tasks

rather than improving overall throughput.

• Accomplishing such parallelism would require benchmarks that utilize task decomposition, functional decomposition, or data decomposition.

• These methods could more comprehensively exercise all of the major multicore benchmark criteria.

MultiCore Performace Evaluation Benchmarks

• Intel® VTune™ Performance Analyzer• http://www.intel.com/software/products/vtune

• Evaluate applications on all sizes of systems based on Intel® processors, from embedded systems through supercomputers

• helps users deliver fast software on the latest 64-bit multicore systems

• OProfile• http://oprofile.sourceforge.net/

• profiling of all processes, shared libraries, kernel and device drivers, via the hardware performance monitoring units that were built into the processor

• VampirTrace• Provides programmers with tracing facilities for parallel programs. • It consists of a profiling library for MPI programs to generate traces

during the runtime and a visualization tool, called Vampir, for displaying the enerated traces

• Performance Debugging Tool (PDT)• parallel tracing tool developed to record specific events during program

executions, e.g., communication events. It

http://www.intel.com/software/products/vtune

http://oprofile.sourceforge.net/

MultiBench

• An extensive suite of multicore benchmarks that utilizes an API abstraction to more easily support SMP architectures[3].

• The multicore benchmarks are delivered as a set of workloads, each comprising one or more work items.

• Although it’s easier said than done, users can select from this list the workloads that most closely resemble their application.

MultiBench• EEMBC’s (Embedded Microprocessor Benchmark Consortium)

MultiBench[3] is a new benchmark suite for measuring the throughput of multiprocessor systems, including those built with multicore processors

Performance Scaling With MultiBench

Ex :the results of running MultiBench on two different dual-core processors clocked at2.0GHz

Ex: Performance Scaling on a quad-core processor

• Software challenges

– Reentrancy- All code shared between more than one task must be reentrant.

– Deadlocks: Deadlocks happen when two or more tasks become blocked because each is waiting on a resource that another has.

– Livelocks• Livelocks are similar to deadlocks with one exception.• Livelocks don’t block the threads.• The threads fail to do useful work because they require something

from another task.

6. CHALLENGES OF MULTICORES

Multicore Challenges

– Race conditions• Occurs when two thread tries to access the same variable at same time.• Outcome of a computation may vary depending on which task executing

gets the access first.

– Synchronization• Additional resources are needed to maintain synchronization• Many new techniques are proposed. E.g. Dynamic Buffer Allocation

Technique

– Load Balancing• load-balancing techniques result in a quite different power, performance

and thermal behavior of the processor.• Current techniques are Round Robin (RR) and Lower Index First (LIF) ,

Waiting Idle First (WIF)

Debugging Challenges

• Un-deterministic behavior makes the problem “less visible”.

• One breakpoint can bit hit several times consecutively by a same task on different threads.

• Data being processed can be changed break to break.

Programmer’s challenges

• Till now parallel programming was limited to scientists for research purposes.

• Rest of the regular software community is not ready for the shift.

• It is difficult to identify the sequential and parallel part of a code.

• Parallel code also should be portable and Cross Platform.

Hardware Challenges• Cache Coherence• Coherence should be maintained

in private caches

A simple solution:invalidation-based protocol with snooping

Additional HardwareInter core Bus

Hardware Challenges• “Best” interconnection architecture depends on too many design

constraints– power/area budget – bandwidth requirements– technology– Cache and core configuration

• Interconnect, core and cache interact more deeply in on chip network.

• The interconnect fabric itself is large and power-hungry.

• The interconnect, even without the sharing of L2 caches, can take the area of three cores and the power of one.

Hardware Challenges

• A Hierarchical Interconnection model – Shared Bus Fabric (SBF)

• Provide shared connection between different modules.• Provide coherent communication when L1 and L2 caches are private.

– P2P Connections• Connects two SBFs

– Components of SBF

• Address Bus• Snoop Bus,• Data Bus• Response Bus

INTERCONNECT• First, the requester or Core will signal that it has a request.

• It sends the request over an address bus (AB).

• Requests are taken off the address bus and placed in a snoop queue, awaiting access to the snoop bus (SB).

• Transactions placed on the snoop bus cause each snooping node to place a response on the response bus (RB).

• Queues at the end of the response bus collect these responses and generate a broadcast message to each involved party .

• Finally, the data is sent over a bidirectional data bus (DB) to the original requester.

• If there are multiple SBFs (e.g., connected by a P2P link), the address request will be broadcast to the other SBFs via that link, and a combined response returned.

Conclusion• Multi-core chips an important new trend in computer architecture

• Several new multi-core chips are in design phases

• Parallel programming techniques likely to gain importance

• The general trend in processor development has moved from dual-, tri-, quad-, hexa-, octo-core chips to ones with tens or even hundreds of cores.

• Also, multi-core chips mixed with simultaneous multithreading, memory-on-chip, and special-purpose "heterogeneous" cores promise further performance and efficiency gains, especially in processing multimedia, recognition and networking applications.

• There is also a trend of improving energy-efficiency by focusing on performance-per-watt with advanced fine-grain or ultra fine-grain power management and dynamic voltage and frequency scaling

References1. Alexandra Fedorova , Sergey Blag odurov,Sergey zhuravl ev Managing Contention for Shared

Resources on Multicore Processors 2010

2. C handra, D., Guo, F., Kim, S. and Solihin, Y. Predicting inter-thread cache contention on a multiprocessor architecture. In Proceedings of the 11th International Symposium on High-performance Computer Architecture, 2005

3. www.eembc.org/press/pressrelease/223001_M30_EEMBC.pdf

4. S. Cho and L. Jin. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 455–468, 2006.

5. Supporting Efficient Synchronization in Multi-core NoCs Using Dynamic Buffer Allocation Technique ,Xiaowen Chen, Zhonghai Lu Axel Jantsch and Shuming Chen

6. A thermal-friendly load-balancing technique for multi-core processors ,Enric Musoll

7. Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling, Rakesh Kumar, Victor Zyuban_, Dean M. Tullsen

8. Design Challenges for Realization of the Advantages of Embedded Multi-Core Processors ,Ronald Goodman, Scott Black

9. Synchronization and Pipelining on Multicore:Shaping Parallelism for a New Generation of Processors, Anna Youssefi

http://www.eembc.org/press/pressrelease/223001_M30_EEMBC.pdf

Date post:	15-Jan-2016
Category:	Documents
Upload:	deirdre-stanley
View:	215 times
Download:	0 times

A Survey on Multi-Core Processors and Systems CSC8210 Fall 2010 D.M.Rasanjalee Himali Ruku...

Documents