Download - Accelerated Processing Unit

8/3/2019 Accelerated Processing Unit

1/15

Accelerated Processing Unit

CHAPTER 1

INTRODUCTION

Imagine a PC that:

Recognizes your gestures without a remote

Responds to your touch or voice to do your bidding

Supports bi-directional hi-definition video chat over links with limited

bandwidth

Finds and tags the photos and videos in your library that contain particular

faces, places or objects

Helps you sort through your photo libraries to eliminate duplicates saved

with different file names

Enhances the videos youve created with regard to color, focus and image

stability

Up-scales even low-quality content to seamlessly match the capabilities of

your HD display

Adds stereoscopic 3D realism to 2D content

Supports immersive, multi-monitor 3D gaming experiences

Department of Electronics and Communication,College of Engineering , Adoor 1


2/15


Sells at price points well within reach of the mainstream consumer.

Many of these capabilities exist today piecemeal in labs, running on expensive,

workstation-class computers that cost as much as tens of thousands of dollars. Why

havent we progressed further, faster in delivering these capabilities to the mainstream?

The semiconductor industry prides itself on rapid improvements in system performance,but hardware that runs fast enough to enable these advanced capabilities still costs far too

much to enable high-volume deployment. Software developers, always tuned to market

realities as well as technology, have focused their efforts on applications that run well on

the dual- and quad-core x86 processors that comprise the bulk of todays mainstream

system offerings. But change is in the air; in 2011, affordable mainstream systems that

can support these advanced capabilities are set to enter the market. Youve probably

heard this story before. Every two years, advances in semiconductor technology allow

chip architects to double the number of transistors they can fit in a given area of silicon.

Over the past decade, these extra transistors have been used to increase the size of on-

chip caches and add more x86 processor cores to designs, making todays CPUs the

fastest processors ever. Even the slowest contemporary CPUs have more than enough

performance to handle traditional office productivity, Internet browsing and e-mail

applications, which long ago ceased to be limited by CPU speed. But as fast as they are,

todays CPUs lack the performance to deliver a vivid, modern computing experience on

their own. The latest applications require CPUs that can deal with vast amounts of data

and require hundreds, if not thousands of individual threads to manipulate the massive

databases needed to recognize an object in a scene, the meaning in a sentence, or an

anomaly in an x-ray image. Not surprisingly, traditional CPU architectures and

application programming tools optimized for scalar data structures and serial algorithmsfit poorly with these new vector-oriented, multi-threaded data-parallel models.

Fortunately, innovative architectures and tools better suited for these new workloads have

emerged. Graphics processing units (GPUs), originally intended to enhance 3D

visualization, have evolved into powerful, programmable vector processors that can

accelerate a wide variety of software applications. Software tools like DirectCompute and

OpenCL permit developers to create standards-based applications that combine the power

of CPU cores and programmable GPU cores, and run on a wide variety of hardware

platforms. A few ambitious independent software vendors (ISVs) have already added

support for these new vector capabilities into their most advanced products, even if they

had to structure their code around proprietary hardware and software interfaces to get the

job done.

Advanced Micro Devices (AMDs) forthcoming Accelerated Processing Units

(APUs) build upon this momentum and take PC computing to the next level. These new

processors are being designed to accelerate multimedia and vector processing

applications, enhance the end-users PC experience, reduce power consumption, and



3/15


offer a superior visual graphics experience at mainstream system price points. More

importantly, these APUs will enable ISVs to create new generations of applications and

user interfaces limited perhaps only by the inventiveness of their developers, rather than

by the constraints of the traditional CPU architectures that have dominated the computer

industry for decades.

CHAPTER 2

ACCELERATED PROCESSING UNIT

At the most basic level,

Accelerated Processing Units

combine general-purpose x86

CPU cores with programmable

vector processing engines on a

single silicon die. APUs also

include a variety of critical

system elements, includingmemory controllers, I/O

controllers, specialized video

decoders, display outputs, and

bus interfaces, but real appeal of

these chips stems from the

inclusion of both scalar and

vector hardware as full-fledged

processing elements. CPU and a

basic graphics unit have been

lashed together in a single package with truly

programmable GPUs like those in the AMD Fusion, VIA corefusion, let alone GPUs that

can be programmed using high-level industry-standard tools like DirectCompute and

OpenCL. AMD is best situated to address this engineering challenge, as it is currently the

only company which has access to extensive IP resources (e.g. patents and engineering

expertise) in both x86 processor technology and industry-leading GPU technology. In



4/15


fact, AMDs recognition that it needed proven GPU technology for future converged

products drove its 2006 acquisition of ATI Technologies. APU is set to arrive in a variety

of shapes and sizes adapted to the requirements of their target markets. AMD has

disclosed that its first APUs, code-named Llano and Ontario, are designed for

mainstream desktop and notebook platforms and thin and light notebooks, and

netbooks and slates. Both of these APUs will combine multiple superscalar x86 processor

cores with an array of programmable SIMD engines leveraged from AMDs discrete

graphics portfolio. The key aspect to note is that all the major system elements x86

cores, vector (SIMD) engines, and a Unified Video Decoder (UVD) for HD decoding

tasks attach directly to the same high speed bus, and thus to the main system memory.

This design concept eliminates one of the fundamental constraints that limit the

performance of traditional integrated graphics controllers (IGPs).

Until now, transistor budget constraints

typically mandated a two chip solution for

such systems, forcing system architects to

use a chip-to-chip crossing between the

memory controller and either the CPU or

GPU. These transfers affect memory

latency, consume system power and thus

impact battery life. The APUs scalar x86

cores and SIMD engines share a common

path to system memory to help avoid these

constraints. Total system performance can

be further enhanced through the addition ofa discrete GPU. The common architectures

of the APU and GPU allow for a multi-GPU configuration where the system can scale to

harness all available resources for exceptional graphics and enable truly breathtaking

overall performance. Although the APUs scalar x86 cores and SIMD engines share a

common path to system memory, APUs first generation implementations divide that

memory into regions managed by the operating system running on the x86 cores and

other regions managed by software running on the SIMD engines. APU provides high

speed block transfer engines that move data between the x86 and SIMD memory

partitions. Unlike transfers between an external frame buffer and system memory, these

transfers never hit the systems external bus. Clever software developers can overlap theloading and unloading of blocks in the SIMD memory with execution involving data in

other blocks. Insight 64 anticipates that future APU architectures will evolve towards a

more seamless memory management model that allows even higher levels of balanced

performance scaling. Just as AMDs architects have woven x86 cores and GPU cores

into a single hardware fabric, astute software developers can now begin to weave high

performance vector algorithms into programs previously constrained by the limited



5/15


computational capabilities of conventional scalar processors, even when arranged in

multi-core configurations. In just a few years, machines equipped with programmable

GPUs are expected to comprise a meaningful portion of the installed base of PCs.

Software coming from ISVs who take advantage of these enhanced capabilities will have

the ability to execute well beyond the capability of packages that lack support for these

features.

CHAPTER 3

REASONS FOR MERGING

The CPU and the GPU have been on this collision course for quite some time;

although we often refer to the CPU as a general purpose processor and the GPU as a

graphics processor, the reality is that they are both general purpose. The GPU is merely ahighly parallel general purpose processor, which is particularly well suited for particular

applications such as 3D gaming. As the GPU became more programmable and thus

general purpose, its highly parallel nature became interesting to new classes ofapplications: things like scientific computing are now within the realm of possibility for

execution on a GPU.

Today's GPUs are vastly superior to what we currently call desktop CPUs when itcomes to things like 3D gaming, video decoding and a lot of HPC applications. The

problem is that a GPU is fairly worthless at sequential tasks, meaning that it relies on

having a fast host CPU to handle everything else other than what it's good at.



6/15


Figure 3 Amdahls Law

ATI discovered that long term, as the GPU grows in its power, it will eventually

be bottlenecked by the ability to do high speed sequential processing. In the same vein,the CPU will eventually be bottlenecked by the ability to do highly parallel processing. In

other words, GPUs need CPUs and CPUs need GPUs for all workloads going forward.

Neither approach will solve every problem and run every program out there optimally, but the combination of the two is what is necessary.

To understand the point of combining a highly sequential processor like modernday desktop CPUs and a highly parallel GPU you have to look above and beyond the

gaming market, into what AMD is calling stream computing. AMD perceives a number

of potential applications that will require a very GPU-like architecture to solve, thingsthat we already see today. Simply watching an HD-DVD can eat up almost 100% of

some of the fastest dual core processors today, while a GPU can perform the same

decoding task with much better power efficiency. H.264 encoding and decoding are

perfect examples of tasks that are better suited for highly parallel processor architecturesthan what desktop CPUs are currently built on. But just as video processing is important,

so are general productivity tasks, which is where we need the strengths of present day

Out of Order superscalar CPUs. A combined architecture that can excel at both types ofapplications is clearly a direction that desktop CPUs need to target in order to remain



7/15


relevant in future applications for consumers as well as in researches.

Future applications will easily combine stream computing with more sequential

tasks, and we already see some of that now with web browsers. Imagine browsing a sitelike YouTube except where all of the content is much higher quality and requires far

more CPU (or GPU) power to play. You need the strengths of a high powered sequential

processor to deal with everything other than the video playback, but then you need thestrengths of a GPU to actually handle the video. Examples like this one are overly simple,

as it is very difficult to predict the direction software will take when given even more

processing power; the point is that CPUs will inevitably have to merge with GPUs inorder to handle these types of applications.

CHAPTER 4

MERGING CPUS AND GPUS

AMD views the APU

progression as three discretesteps:

Today we have a CPU and aGPU separated by an

external bus, with the two

being quite independent.The CPU does what it does

best, and the GPU helps out

wherever it can.



8/15


Step 1, is what AMD is calling integration, and it is what we can expect in the first

Fusion product. The CPU and GPU are simply placed next to one another and there's

minor leverage of that relationship, mostly from a cost and power efficiency standpoint.

Step 2, which AMD calls optimization, gets a bit more interesting. Parts of the CPU can

be shared by the GPU and vice versa. There's not a deep level of integration, but it beginsthe transition to the most important step - exploitation.

The final step in the evolution of APU is where the CPU and GPU are truly integrated,and the GPU is accessed by user mode instructions just like the CPU. You can expect to

talk to the GPU via extensions to the x86 ISA, and the GPU will have its own register file

(much like FP and integer units each have their own register files). Elements of the

architecture will be shared, especially things like the cache hierarchy, which will proveuseful when running applications that require both CPU and GPU power.

The GPU could easily be integrated onto a single die as a separate core behind a shared

L3 cache. For example, if you look at the current Barcelona architecture you have fourhomogenous cores behind a shared L3 cache and memory controller; simply swap one of

those cores with a GPU core and you've got an idea of what one of these chips could looklike. Instructions that can only be processed by the specialized core will be dispatched

directly to it, while instructions better suited for other cores will be sent to them. There

would have to be a bit of front end logic to manage all of this, but it's easily done.

Chapter 5

APU in Consumer Electronics

The potential of Fusion extends far beyond the PC space and into the embedded

space. If you can imagine a very low power, low profile Fusion CPU, you can easily see

it being used in not only PCs but consumer electronics devices as well. The benefit is thatyour CE devices could run the same applications as your PC devices, truly encouraging

and enabling convergence and cohabitation between CE and PC devices.

Despite both sides attempting to point out how they are different, AMD and Intel

actually have very similar views on where the microprocessor industry is headed. Both

companies have stated to us that they have no desire to engage in the "core wars", as inwe won't see a race to keep adding cores. The explanation for why not is the same onethat applied to the GHz race: if you scale exclusively in one direction (clock speed or

number of cores), you will eventually run into the same power wall. The true path to

performance is a combination of increasing instruction level parallelism, clock speed, andnumber of cores in line with the demands of the software you're trying to run.

AMD has been a bit more forthcoming than Intel in this respect by indicating that



9/15


it doesn't believe that there's a clear sweet spot, at least for desktop CPUs. AMD doesn't

believe there's enough data to conclude whether 3, 4, 6 or 8 cores are the ideal number for

desktop processors. From our testing with Intel's V8 platform, an 8-core platformtargeted at the high end desktop, it is extremely difficult finding high end desktop

applications that can even benefit from 8 cores over 4. Our instincts tell us that for

mainstream desktops, 3 - 4 general purpose x86 cores appears to be the near term targetthat makes sense. You could potentially lower the number of cores needed if you

combine other specialized hardware (e.g. an H.264 encode/decode core).

What's particularly interesting is that many of the same goals Intel has for the

future of its x86 processors are in line with what AMD has planned. For the past couple

of IDFs Intel has been talking about bringing to market a < 0.5W x86 core that can be

used for devices that are somewhere in size and complexity between a cell phone and anUMPC (e.g. iPhone). Intel has committed to delivering such a core in 2008 called

Silverthorne, based around a new micro-architecture designed for these ultra low power

environments.

AMD confirmed that it too envisions ultra low power x86 cores for use in

consumer electronics devices, areas where ARM or other specialized cores are commonlyused. AMD also recognizes that it can't address this market by simply reducing clock

speed of its current processors, and thus AMD mentioned that it is working on a separate

micro-architecture to address these ultra low power markets. AMD didn't attribute any

timeframe or roadmap to its plans, but knowing what we know about Fusion's debut we'dexpect a lower power version targeted at UMPC and CE markets that make up all the

sales are scheduled to follow as early as possible.

Why even think about bringing x86 cores to CE devices like digital TVs or

smartphones? AMD offered one clear motivation: the software stack that will run on

these devices is going to get more complex. Applications on TVs, cell phones and otherCE devices will get more complex to the point where they will require faster processors.

Combine that with the fact that software developers don't want to target multiple

processor architectures when they deliver software for these CE devices, and by usingx86 as the common platform between CE and PC software you end up creating an entire

environment where the same applications and content can be available across any device.

The goal of PC/CE convergence is to allow users to have access to any content, on any

device, anywhere - if all the devices you're trying to gain access to content/programs onhappen to all be x86, it makes the process much easier.

Why is a new core necessary? Although x86 can be applied to virtually anymarket segment, the range of usefulness of a particular core can extend throughout an

order of magnitude of power. For example, AMD's current desktop cores can easily be

scaled up or down to hit TDPs in the 10W - 100W range, but they would not be good forhitting something in the sub-1W range. AMD can easily address the sub-1W market, but

it will require a different core from what it addresses the rest of the market with. This

philosophy is akin to what Intel discovered with Centrino; in order to succeed in the

mobile market, you need a mobile specific design. To succeed in the ultra mobile and



10/15


handtop markets, you need an ultra mobile/handtop specific processor design as well.

Both AMD and Intel realize this, and now both companies have publicly stated that they

are planning to do something about this recent consumer requirements.

Chapter 6

New Era of Software Development

The GPU is ushering in a new age for software developers. Thats because theGPU is no longer just about visualization or high-end graphics. Sure, those are important

functions, but new software and applications will more fully leverage the latent

capabilities of the GPU as it takes its place alongside the CPU as a powerfulcomputational engine. This merging of CPU and GPU processing power, combined with

the changing face of the Internet, promises to drive software to the next level of

innovation.

As Wired Magazine boldly declared recently, the Internet isnt just about webbrowsing anymore; its about instant communication and the applications and data to

deliver video, photos and audio. The changing dynamics of the Internet are putting

mobility at a premium and driving consumers increasingly into the market for thebroadening range of mobile devices Smartphone, tablets, netbooks, and notebooks.



11/15


What better time for the emergence of the APU a processor that will combine the power

of the CPU and GPU onto a single chip in a small, power-saving format.

Software developers have already started to ask, how do I embrace the new ageof GPU and APU computing? Luckily, AMD is in the trenches working with

industry leaders on the tools and standards needed to help smooth the transition. Asweve touched on in previous blog posts, AMD supports:

OpenCL: OpenCL is an open standard framework for writing parallel programsto execute across heterogeneous platforms consisting of CPUs, GPUs, and other

processors. Notably, the standard enables applications to access the GPU for non-

graphical computing and to balance computation between the CPU and GPU,

therefore making it the perfect development environment for the APU. Wereseeing a lot of exciting innovation happening around OpenCL, such as

MainConcepts new OpenCL H.264/AVC Encoder. MainConcept offers a flexible

and powerful software development kit so other software developers can easily

add OpenCL accelerated encoding to their own solutions. OpenCL is also helpingto drive developments around more natural user interfaces like touch and gesture

and object and facial recognition as well as allowing developers to harness thepower of the GPU for productivity in HD video conferencing and virus scanning.

Microsofts DirectX: DirectX, Microsofts Windows graphics technology,

provides a collection of APIs that developers can use for handling tasks related to

multimedia. Its been widely used by Windows developers for games and videoapplications and is catching the attention of a larger group of developers by

enabling code to be offloaded to the GPU. DirectX APIs include:o D2D: a hardware-accelerated 2-D graphics API that provides high

performance and high-quality rendering for 2-D geometry, bitmaps, and

text. D2D drives your day-to-day software experience to a new level,

particularly when it comes to online gaming and productivity applications.The next generation of web browsers is making use of D2D technology,

including Microsofts IE9 beta and Mozillas FireFox4 beta.

o Directcompute: Another API set of DirectX, Directcompute provides

programmers with a more flexible way to access the computational

capability of GPUs that support DirectX 10 and DirectX 11.

Cyberlinks MediaShow 5s FaceMe Technology, which is designed to

quickly identify faces in photos, is optimized for Microsoft DirectX 11Directcompute.

OpenGL: OpenGL is another standard specification defining a cross-language,

cross-platform API for writing applications that produce 2D and 3D computergraphics such as in content design software and high-end games. While OpenGL



12/15


isnt new, it does have noteworthy new functionality that simplifies porting

between mobile and desktop platforms and increases interoperability with

OpenCL. The recently released OpenGL 4.0 specification also includes update tothe OpenGL Shading language which lets developers better utilize the GPU

acceleration.

Chapter 7

Practicality

Although its exciting to look at the new applications that will finally becomepractical in the Fusion era, the fact remains that most users will want their new APU-

based systems to handle a mix of traditional applications for office productivity and

Internet access, along with those new exciting apps. Fortunately, the changes AMD made

to enable new APU-accelerated applications can also help existing applications run betteras well.

Many of these improvements stem from AMDs ability to fit the CPU cores, GPU

cores and North Bridge (the part of the chip where the memory controller and PCI-

express interfaces reside) onto a single piece of silicon. As noted earlier, this eliminates achip-to-chip linkage that adds latency to memory operations and consumes power. It

takes less energy to move electrons across a chip than to move those same electrons



13/15


between two chips, and the power saved by this small change alone can help significantly

increase system battery life. The co-location of all key elements on one chip also allows

AMD to take a holistic approach to power management on these APUs. They can powervarious parts of the chip up and down depending on workloads, squeezing out a few

milliwatts here and another few milliwatts there which in the aggregate can amount to

significant power savings.

Finally, some of the improvements can be attributed to the advanced GPU

technology AMD embeds in its APU offerings. Although the company has yet to revealthe technical specs of these GPUs, it has disclosed they will be DirectX 11-compliant.

These will be the first APU-based systems that can support DirectX 11s enhanced visual

experience without a discrete GPU, and thus will represent a cost-effective solution forsystems developers

Chapter 8

Conclusion

Since the days of the earliest personal computers, each major advance in system

capability has enabled innovative software developers to create new products that opened

new markets. The Apple II gave us VisiCalc, the first spreadsheet. The original IBM PC

led to Lotus 1-2-3, the first spreadsheet with graphics. The Macintosh ushered in an eraof desktop publishing that has forever changed the way the world creates and distributes

information.

The dramatic increase in performance enabled by AMD Fusion technology can

create new opportunities for entrepreneurial developers to innovate and make the world abetter and richer place. Along the way, they may enrich themselves as well. Thats the

way the system is supposed to work.



14/15


More importantly, compared to todays mainstream offerings, APU-based

platforms will possess prodigious amounts of computational horsepower. This processing

power will allow developers to tackle problems that lie beyond the capabilities of todaysmainstream systems,and will enable innovative developers to step up and update existing

applications or invent new ones that take advantage of GPU acceleration. These features

will be a standard part of every APU. Over time, even the most affordable PCs can beexpected to have the computational performance of yesterdays million dollar

mainframes with all day battery life.

Of course, few users will want to run the same applications on tomorrows

notebooks that they ran on yesterdays mainframes and supercomputers. They will likely

want to run applications that help them in their everyday lives, doing tasks they cannotaccomplish on the systems they own today. They may want to use facial recognition

software to sort their photos and videos, or even to help them identify people they meet

on the street or actors they see in movies. They may want the on-screen appearance of thevideos they stream to approach that of the HD content on their TVs, even when

bandwidth constrains that content to a low resolution format.

For the hardware developer, ODM or PC manufacturer, its time to start thinking

about how to incorporate these new APUs into product lines in order to enhance the

consumer experience. Software developers should look to this new power to help theirsoftware run even better. All developers are encouraged to upgrade their skills and learn

about OpenCL and DirectCompute, and to examine current software projects to see how

they can be improved in a world where systems have dramatically more power. Becausepretty soon, they will.

Reference

The Industry-Changing Impact of Accelerated Computing

o Nathan Brookwood

fusion.amd.com



15/15


http://www.anandtech.com/show/2229

http://sites.amd.com/us/fusion/APU/Pages/fusion.aspx

http://www.dailytech.com/article.aspx?newsid=4696

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter34.html