FRUCT Seminar 28-30 May 2009, St.Petersburg 1
Embedded Computing Challenges and Trends
Yuriy SheyninDirector, Doctor of Science
190 000 St. PetersburgBolshaya Morskaya, No 67Tel/Fax: +7 812 710 6234E-mail: [email protected]
St. Petersburg State University of Aerospace InstrumentationSt. Petersburg State University of Aerospace InstrumentationInstitute Institute ofof HighHigh--Performance Computer and Network TechnologiesPerformance Computer and Network Technologies
FRUCT Seminar 28-30 May 2009, St.Petersburg 2
Natural parallelism
Real world applications are naturally parallel Hardware is naturally parallel,Should be naturally parallel.
programming model, system software, architecture
FRUCT Seminar 28-30 May 2009, St.Petersburg 3
Seven critical questions for 21st Century parallel computing.
// “The Landscape of Parallel Computing Research: A View from Berkeley” //
FRUCT Seminar 28-30 May 2009, St.Petersburg 4
Embedded and High Performance ComputingHave more in common looking forward than they did in the past.
Both are concerned with powerBoth are concerned with hardware utilization, sensitive to costSize of embedded software increases,
hand tuning must be limited importance of software reuse increases.
Both embedded and high-end servers now connect to networks, --both need to prevent unwanted accesses and virusesEmbedded computing future applications is mapped closely to problems in scientific computing.
Biggest difference -- emphasis on realtime computing in embedded systems
FRUCT Seminar 28-30 May 2009, St.Petersburg 5
MooreMoore’’s Law:s Law:
Number of transistors on a chip doubles every Number of transistors on a chip doubles every 2 year2 year
FRUCT Seminar 28-30 May 2009, St.Petersburg 6
From multicore to “manycore”
Conventional wisdom - to double the number of cores on a chip with each silicon generationManycore architectures and supporting software technologies would “reset” microprocessor hardware and software roadmaps for the next 30 years.Evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems
FRUCT Seminar 28-30 May 2009, St.Petersburg 7
Scaling cores – scaling performance ?
Adding cores can slow data-intensive applications
Intel experimental chip has 80 cores.
FRUCT Seminar 28-30 May 2009, St.Petersburg 8
Chips are built from processing elements that are the most efficient in
MIPS (Million Instructions per Second) per watt,
MIPS per area of silicon,
MIPS per development dollar.
The target -- 1000s of cores per chip
FRUCT Seminar 28-30 May 2009, St.Petersburg 9
FRUCT Seminar 28-30 May 2009, St.Petersburg 10
FRUCT Seminar 28-30 May 2009, St.Petersburg 11
Energy and Power
Distinguish between energy (Joules) and power (Joules/second or Watts), which is the rate of consuming energy.
The energy used by a computation affects the battery life of a mobile device,Energy per task is usually a metric to be minimized in a design, Peak power consumption is usually treated as a design constraint.
FRUCT Seminar 28-30 May 2009, St.Petersburg 12
FRUCT Seminar 28-30 May 2009, St.Petersburg 13
Memory
The old Amdahl rule of thumb:a balanced computer system needs about 1 MB of main memory capacity per MIPS of processor performance
The DRAM industry has dramatically lowered the price per gigabyte:from $10,000,000 per gigabyte in 1980to $100 per gigabyte in 2007
FRUCT Seminar 28-30 May 2009, St.Petersburg 14
Challenge #4: Memory Latency
FRUCT Seminar 28-30 May 2009, St.Petersburg 15
Latency Lags Bandwidth
Over last 25 years Latency Lags Bandwidth:Bandwidth Improved 120X to 2200XLatency Improved only 4X to 20X
Rule of Thumb for Latency Lagging BW
Bandwidth improves by more than the square of the improvement in Latency
FRUCT Seminar 28-30 May 2009, St.Petersburg 16
Regarding the hardware and architecture
usу simple processors innovate in memory as well as in processor design, consider separate latency-oriented and bandwidth-oriented networks. hybrid interconnect design that uses circuit switches to tailor the interconnect topology to application requirements richer hardware support for fine-grained synchronization and communication constructsdo not include features that significantly affect performance orenergy if you do not provide counters that let programmers accurately measure their impact via performance counters andenergy counters.
FRUCT Seminar 28-30 May 2009, St.Petersburg 17
How to evaluate parallel computing?
Conventional way to evaluate architecture innovation is to study a benchmark suite based on existing programs, such as EEMBC (Embedded Microprocessor Benchmark Consortium)
It seems unwise to let a set of existing source code drive an investigation into parallel computing.
Obstacles to innovation in parallel computing is that it is currently unclear how to express a parallel computation best.
Find a higher level of abstraction for reasoning about parallel application requirements.
FRUCT Seminar 28-30 May 2009, St.Petersburg 18
Aproach is to define a number of characterizing cases which capture a pattern of computationand communication pattern common to a class of important applications.
FRUCT Seminar 28-30 May 2009, St.Petersburg 19
Recognition, Mining, and Synthesis(RMS)“ RMS is multimodal recognition and synthesis over
large and complex data sets”Recognition is a form of machine learning, where computers examine data and construct mathematical models of that data.
Mining searches the web to find instances of that model.
Synthesis refers to the creation of new models, such as in graphics.
FRUCT Seminar 28-30 May 2009, St.Petersburg 20
Berkeley’s “Dwarfs”
13. Finite StateMachine
12.ConstructGraphicalModels
11.Backtrack andBranch+Bound
10.DynamicProgramming
9.Graphtraversal
8.CombinationalLogic(e.g.,encryption)
7.Monte Carlo
6.UnstructuredGrids
5.StructuredGrids
4.N-BodyMethods
3.SpectralMethods
2.Sparse LinearAlgebra
1. Dense LinearAlgebra
“Dwarfs” constitute classes where membership in a class is defined by similarity in computation and data movement.Dwarfs are specified at a high level of abstraction that can group related but quite different computational methods.
FRUCT Seminar 28-30 May 2009, St.Petersburg 21
Performance Limit: Memory Bandwidth, Memory Latency, or Computation?
Memory wall limited performance for almost half the dwarfsMemory latency is a bigger problem than memory bandwidth
Memory bandwidth limited2. Sparse Matrix, 5. Structured Grid
Memory latency limited3. Spectral (FFT), 6.Unstructured Grid, 9.Graph traversal, 10.Dynamic Programming
Computationally limited1. Dense Matrix, 4. N-Body, 8. Combinational Logic
FRUCT Seminar 28-30 May 2009, St.Petersburg 22
Communication patterns and interconnections
Communication patterns are observed to be sparseNon-blocking crossbar will be grossly over-designed for most application requirementsCommunication patterns are not isomorphic to a fixed-topology interconnect such as a torus, mesh, or hypercube. Assigning a dedicated path to each point-to-point message transfer is not solved trivially by any given fixed-degree interconnect topology. Carefully place jobs so that they match the static topology of the interconnect fabricEmploy an interconnect fabric that can be reconfigured to conform to the application’s communication topology.
FRUCT Seminar 28-30 May 2009, St.Petersburg 23
Dependability
Next generation of microprocessors will face higher soft and hard error rates.Redundancy in space or in time is the way to make dependable systems from undependable components.
FRUCT Seminar 28-30 May 2009, St.Petersburg 24
Programming modelsProgramming model is a bridge between a system
developer’s natural model of an application and an implementation of that application on available hardware.
A programming model should allow the programmer to balance the competing goals of productivity and implementation efficiency.should be independent of the number of processors.should support a wide range of data types should support successful models of parallelism:
task-level parallelism,word-level parallelism, bit-level parallelism.
FRUCT Seminar 28-30 May 2009, St.Petersburg 25
Metrics for Success
Maximizing programmer productivitythe ability to productively program these high-performance multiprocessors of the future is as at least as important as providing high-performance silicon implementations of these architectures.
Maximizing application performanceRadical ideas are required to make manycore architectures a secure and robust base for productive software development
FRUCT Seminar 28-30 May 2009, St.Petersburg 26
New efficiency metrics for new parallel architectures
Minimizing remote accesses. In the case where data is accessed by computational tasks that are spread over different processing elements -- to optimize its placement to minimize communication.
Load balance. The mapping of computational tasks to processing elements must be performed to minimize elements to be idle (waiting for data or synchronization).
Granularity of data movement and synchronization. Most modern networks perform best for large data transfers; the latency of synchronization is high and so it is advantageousto synchronize as little as possible.
FRUCT Seminar 28-30 May 2009, St.Petersburg 27
Adaptive Libraries and Autotuners vs. Traditional Compilers
Performance of parallel applications depend on the quality of the generated code, traditionally the responsibility of the compiler.
It is considered to be difficult to add new optimizations to compilers, needed in the transition from instruction-levelparallelism to task- and data-level parallelism.
Peak performance may still require handcrafting the program in languages like C, FORTRAN, or even assembly code.
Adaptive libraries do automatic adaptation of ready-made library components to specific features of the particular application and the particular computing platform
Autotuners optimize a set of library kernels by generating many variants of a given kernel and benchmarking each variant by running on the target platform.
FRUCT Seminar 28-30 May 2009, St.Petersburg 28
Operating systems - Composable Primitives not Pre-Packaged Solutions
Embedded systems have historically had very minimal application-specific run-time systems
Embedded systems increase in functionality, protection and reliability concerns, require much more sophisticated and stableoperating systems and the hardware support
Operating systems will have more in common for embedded and server computing.
Operating systems will be “deconstructed”
Operating system could essentially be libraries where only the functions needed are linked into the application, on top of a thin virtual machines layer providing protection and sharing of hardware resources.
FRUCT Seminar 28-30 May 2009, St.Petersburg 29
Everything is changing in embedded computing…
“Power wall”: Power is expensive, but transistors are “free”; we can put more transistors on a chip than we have the power to turn on concern isn’t only dynamic power; static power due to leakage can be 40% of total power
“Memory wall” Load and store is slow, but multiply is fast
“ILP wall”: There are diminishing returns on finding more ILP
Monolithic uniprocessors in silicon were reliable internally, with errors occurring at the pins.Below 65 nm feature sizes, chips will have high soft and hard error rates.
Hard to scale a successful chip project.Wire delay, noise, cross coupling (capacitive and inductive),manufacturing variability, reliability (see above), clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes.
The cost of masks at 65 nm feature size, the cost of Electronic
Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes to demonstrate new architecture ideas.
Bandwidth improves by at least the square of the improvement in latency.
Uniprocessor performance doubled every 18 months. Now doubling of uniprocessor performance may take 5 years.Increasing parallelism is the primary method of improving processor performance.
FRUCT Seminar 28-30 May 2009, St.Petersburg 30
Thank you
FRUCT Seminar 28-30 May 2009, St.Petersburg 31