Embedded Computing Challenges and Trends - FRUCT · Embedded Computing Challenges and Trends Yuriy...

FRUCT Seminar 28-30 May 2009, St.Petersburg 1

Embedded Computing Challenges and Trends

Yuriy SheyninDirector, Doctor of Science

190 000 St. PetersburgBolshaya Morskaya, No 67Tel/Fax: +7 812 710 6234E-mail: [email protected]

St. Petersburg State University of Aerospace InstrumentationSt. Petersburg State University of Aerospace InstrumentationInstitute Institute ofof HighHigh--Performance Computer and Network TechnologiesPerformance Computer and Network Technologies


Natural parallelism

Real world applications are naturally parallel Hardware is naturally parallel,Should be naturally parallel.

programming model, system software, architecture


Seven critical questions for 21st Century parallel computing.

// “The Landscape of Parallel Computing Research: A View from Berkeley” //


Embedded and High Performance ComputingHave more in common looking forward than they did in the past.

Both are concerned with powerBoth are concerned with hardware utilization, sensitive to costSize of embedded software increases,

hand tuning must be limited importance of software reuse increases.

Both embedded and high-end servers now connect to networks, --both need to prevent unwanted accesses and virusesEmbedded computing future applications is mapped closely to problems in scientific computing.

Biggest difference -- emphasis on realtime computing in embedded systems


MooreMoore’’s Law:s Law:

Number of transistors on a chip doubles every Number of transistors on a chip doubles every 2 year2 year


From multicore to “manycore”

Conventional wisdom - to double the number of cores on a chip with each silicon generationManycore architectures and supporting software technologies would “reset” microprocessor hardware and software roadmaps for the next 30 years.Evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems


Scaling cores – scaling performance ?

Adding cores can slow data-intensive applications

Intel experimental chip has 80 cores.


Chips are built from processing elements that are the most efficient in

MIPS (Million Instructions per Second) per watt,

MIPS per area of silicon,

MIPS per development dollar.

The target -- 1000s of cores per chip




Energy and Power

Distinguish between energy (Joules) and power (Joules/second or Watts), which is the rate of consuming energy.

The energy used by a computation affects the battery life of a mobile device,Energy per task is usually a metric to be minimized in a design, Peak power consumption is usually treated as a design constraint.



Memory

The old Amdahl rule of thumb:a balanced computer system needs about 1 MB of main memory capacity per MIPS of processor performance

The DRAM industry has dramatically lowered the price per gigabyte:from $10,000,000 per gigabyte in 1980to $100 per gigabyte in 2007


Challenge #4: Memory Latency


Latency Lags Bandwidth

Over last 25 years Latency Lags Bandwidth:Bandwidth Improved 120X to 2200XLatency Improved only 4X to 20X

Rule of Thumb for Latency Lagging BW

Bandwidth improves by more than the square of the improvement in Latency


Regarding the hardware and architecture

usу simple processors innovate in memory as well as in processor design, consider separate latency-oriented and bandwidth-oriented networks. hybrid interconnect design that uses circuit switches to tailor the interconnect topology to application requirements richer hardware support for fine-grained synchronization and communication constructsdo not include features that significantly affect performance orenergy if you do not provide counters that let programmers accurately measure their impact via performance counters andenergy counters.


How to evaluate parallel computing?

Conventional way to evaluate architecture innovation is to study a benchmark suite based on existing programs, such as EEMBC (Embedded Microprocessor Benchmark Consortium)

It seems unwise to let a set of existing source code drive an investigation into parallel computing.

Obstacles to innovation in parallel computing is that it is currently unclear how to express a parallel computation best.

Find a higher level of abstraction for reasoning about parallel application requirements.


Aproach is to define a number of characterizing cases which capture a pattern of computationand communication pattern common to a class of important applications.


Recognition, Mining, and Synthesis(RMS)“ RMS is multimodal recognition and synthesis over

large and complex data sets”Recognition is a form of machine learning, where computers examine data and construct mathematical models of that data.

Mining searches the web to find instances of that model.

Synthesis refers to the creation of new models, such as in graphics.


Berkeley’s “Dwarfs”

13. Finite StateMachine

12.ConstructGraphicalModels

11.Backtrack andBranch+Bound

10.DynamicProgramming

9.Graphtraversal

8.CombinationalLogic(e.g.,encryption)

7.Monte Carlo

6.UnstructuredGrids

5.StructuredGrids

4.N-BodyMethods

3.SpectralMethods

2.Sparse LinearAlgebra

1. Dense LinearAlgebra

“Dwarfs” constitute classes where membership in a class is defined by similarity in computation and data movement.Dwarfs are specified at a high level of abstraction that can group related but quite different computational methods.


Performance Limit: Memory Bandwidth, Memory Latency, or Computation?

Memory wall limited performance for almost half the dwarfsMemory latency is a bigger problem than memory bandwidth

Memory bandwidth limited2. Sparse Matrix, 5. Structured Grid

Memory latency limited3. Spectral (FFT), 6.Unstructured Grid, 9.Graph traversal, 10.Dynamic Programming

Computationally limited1. Dense Matrix, 4. N-Body, 8. Combinational Logic


Communication patterns and interconnections

Communication patterns are observed to be sparseNon-blocking crossbar will be grossly over-designed for most application requirementsCommunication patterns are not isomorphic to a fixed-topology interconnect such as a torus, mesh, or hypercube. Assigning a dedicated path to each point-to-point message transfer is not solved trivially by any given fixed-degree interconnect topology. Carefully place jobs so that they match the static topology of the interconnect fabricEmploy an interconnect fabric that can be reconfigured to conform to the application’s communication topology.


Dependability

Next generation of microprocessors will face higher soft and hard error rates.Redundancy in space or in time is the way to make dependable systems from undependable components.


Programming modelsProgramming model is a bridge between a system

developer’s natural model of an application and an implementation of that application on available hardware.

A programming model should allow the programmer to balance the competing goals of productivity and implementation efficiency.should be independent of the number of processors.should support a wide range of data types should support successful models of parallelism:

task-level parallelism,word-level parallelism, bit-level parallelism.


Metrics for Success

Maximizing programmer productivitythe ability to productively program these high-performance multiprocessors of the future is as at least as important as providing high-performance silicon implementations of these architectures.

Maximizing application performanceRadical ideas are required to make manycore architectures a secure and robust base for productive software development


New efficiency metrics for new parallel architectures

Minimizing remote accesses. In the case where data is accessed by computational tasks that are spread over different processing elements -- to optimize its placement to minimize communication.

Load balance. The mapping of computational tasks to processing elements must be performed to minimize elements to be idle (waiting for data or synchronization).

Granularity of data movement and synchronization. Most modern networks perform best for large data transfers; the latency of synchronization is high and so it is advantageousto synchronize as little as possible.


Adaptive Libraries and Autotuners vs. Traditional Compilers

Performance of parallel applications depend on the quality of the generated code, traditionally the responsibility of the compiler.

It is considered to be difficult to add new optimizations to compilers, needed in the transition from instruction-levelparallelism to task- and data-level parallelism.

Peak performance may still require handcrafting the program in languages like C, FORTRAN, or even assembly code.

Adaptive libraries do automatic adaptation of ready-made library components to specific features of the particular application and the particular computing platform

Autotuners optimize a set of library kernels by generating many variants of a given kernel and benchmarking each variant by running on the target platform.


Operating systems - Composable Primitives not Pre-Packaged Solutions

Embedded systems have historically had very minimal application-specific run-time systems

Embedded systems increase in functionality, protection and reliability concerns, require much more sophisticated and stableoperating systems and the hardware support

Operating systems will have more in common for embedded and server computing.

Operating systems will be “deconstructed”

Operating system could essentially be libraries where only the functions needed are linked into the application, on top of a thin virtual machines layer providing protection and sharing of hardware resources.


Everything is changing in embedded computing…

“Power wall”: Power is expensive, but transistors are “free”; we can put more transistors on a chip than we have the power to turn on concern isn’t only dynamic power; static power due to leakage can be 40% of total power

“Memory wall” Load and store is slow, but multiply is fast

“ILP wall”: There are diminishing returns on finding more ILP

Monolithic uniprocessors in silicon were reliable internally, with errors occurring at the pins.Below 65 nm feature sizes, chips will have high soft and hard error rates.

Hard to scale a successful chip project.Wire delay, noise, cross coupling (capacitive and inductive),manufacturing variability, reliability (see above), clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes.

The cost of masks at 65 nm feature size, the cost of Electronic

Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes to demonstrate new architecture ideas.

Bandwidth improves by at least the square of the improvement in latency.

Uniprocessor performance doubled every 18 months. Now doubling of uniprocessor performance may take 5 years.Increasing parallelism is the primary method of improving processor performance.


Thank you


Date post:	07-Apr-2018
Category:	Documents
Upload:	trantruc
View:	219 times
Download:	6 times

Embedded Computing Challenges and Trends - FRUCT · Embedded Computing Challenges and Trends Yuriy...

Documents