+ All Categories
Home > Technology > HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Date post: 07-Jul-2015
Category:
Upload: amd-developer-central
View: 590 times
Download: 1 times
Share this document with a friend
Description:
Presentation HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding at the AMD Developer Summit (APU13) November 11-13, 2013
Popular Tags:
48
HSA HSA A F ll S t E lt A F ll S t E lt HSAemu HSAemu AFull S ys t em Emulat or AFull S ys t em Emulat or for HSA Platform for HSA Platform for HSA Platform for HSA Platform Prof. YehChing Chung System Software Laboratory Department of Computer science Department of Computer science National Tsing Hua University National Tsing Hua University ® copyright OIA National Tsing Hua University 1
Transcript
Page 1: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAHSA A F ll S t E l tA F ll S t E l tHSAemuHSAemu ‐‐ A Full System Emulator A Full System Emulator for HSA Platformfor HSA Platformfor HSA Platformfor HSA Platform

Prof. Yeh‐Ching Chung

System Software LaboratoryDepartment of Computer scienceDepartment of Computer science National Tsing Hua University

National Tsing Hua University ® copyright OIANational Tsing Hua University 1

Page 2: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Outline

Introduction to HSAIntroduction to HSA Design of HSAemu P f E l ti Performance Evaluation Conclusions and Future Work

National Tsing Hua University ® copyright OIANational Tsing Hua University 2

Page 3: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Introduction to HSA 

HSA is an industry standard to define next‐generation f g

hardware/software architecture for heterogeneous computingfor heterogeneous computing 

National Tsing Hua University ® copyright OIANational Tsing Hua University 3

Page 4: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Hardware Platform of HSA

National Tsing Hua University ® copyright OIANational Tsing Hua University 4

Page 5: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Simplified HSA Software Stack

Application

Domain Specific Libs(Bolt, OpenCV™, … many others)Application 

SW O GL ES OthRenderscript OpenGL‐ESRuntime

OtherRuntime

p/OpenCLRuntime

Legacy Driver

HSA Runtime

HSAILHSA Software

Ctl

Legacy Driver

HSA FinalizerDrivers

Kernel Driver

CPU(s) GPU(s) Other AcceleratorsDifferentiated HW

GPU ISA

National Tsing Hua University ® copyright OIANational Tsing Hua University 5

Page 6: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Specification of Simple HSA Platform

Hardware– Memory

SoftwareHSA R ti API– Memory 

• Shared Virtual Memory (hUMA)• Cache Coherency Domains• Memory‐Based Signaling and 

– HSA Runtime APIs• Initialization of HSA components• Topology discovery• Manage AQL packets

Synchronization for CPU and GPU

– Task Control• Architected Queuing Language (AQL)

Manage AQL packets• Dispatch application tasks• Signal HW and wait for result• Recycle available resources

• Efficient Syscall Infrastructure• Preemptive Context Switching 

– Debugging Infrastructure

– User Mode Queue• Store AQL packets

– Virtual ISA ‐ HSAILgg g• Allow system software to set 

Instruction/ Memory/ Conditional, etc., breakpoints

E ti H dli

Virtual ISA  HSAIL• A low level instruction set designed for 

parallel computing

– Exception Handling• GPU trap handler to trigger GPU 

interrupt for GPU exception

National Tsing Hua University ® copyright OIANational Tsing Hua University 6

Page 7: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

What Is HSAemu

HSAemu is a full system emulator that supports the following HSA features– Shared virtual memory between CPU and GPU– Memory based signaling and synchronizationMemory based signaling and synchronization– Multiple user level command queues– Preemptive GPU context switching

Concurrent execution of CPU threads and GPU threads– Concurrent execution of CPU threads and GPU threads– HSA runtime– FinalizerA P j S d b M di T k (MTK) A Project Sponsored by MediaTek (MTK)

Currently, it supports simple HSA platform simulationsimulation– Functional‐accurate simulation– Cycle‐accurate simulation

National Tsing Hua University ® copyright OIANational Tsing Hua University 7

Page 8: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Architecture of HSAemu

HSAemu consists of 6 components– HSA Runtime– CPU Simulation Module– GPU Task Dispatcher– Functional‐Accurate GPU Simulator (Fast‐GPU Simulator)

– Cycle‐Accurate GPU Simulator (Mult2sim)– GPU Helper Functions

National Tsing Hua University ® copyright OIANational Tsing Hua University 8

Page 9: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAemu Runtime

User Mode Queue– Store AQL packetsStore AQL packets

AQL Queue Manager – Manage AQL packets in User Mode  

Queue

AQL Command Dispatcher Launch the execution of kernel jobs on– Launch the execution of kernel jobs on  HSAemu

Support OpenCL runtime

National Tsing Hua University ® copyright OIANational Tsing Hua University

pp p

9

Page 10: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

CPU Simulation Module (1)

PQEMU – Perform multicore CPU simulation HSA Signal Handler – Receive AQL command 

from HSA Runtime and launch GPU simulation

National Tsing Hua University ® copyright OIANational Tsing Hua University 10

Page 11: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

CPU Simulation Module (2)

PQEMU– A parallel system emulator based on QEMU– A parallel system emulator based on QEMU– Tow efficient synchronization models (UCC/SCC)– Dynamic binary translation (DBT) technique– A project sponsored by MTK

Agent code, HSA runtime, and operating system are run on PQEMUsystem are run on PQEMU

Code Cache

DBT DBT DBT DBT

CPU CPUCPU CPU

Unified Code Cache (UCC) Model

National Tsing Hua University ® copyright OIANational Tsing Hua University 11

“PQEMU: A Parallel System Emulator Based on QEMU” (ICPADS 2011)

Page 12: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

GPU Task Dispatcher (1)

AQL Command Monitor– Receive signal from HSA Signal Handler– Copy AQL packets from User Mode Queue 

to HW AQL Queue– Launch AQL Packet Worker

AQL Packet Worker– Dequeue AQL packets from HW AQL Queue– Parse AQL packetParse AQL packet– Dispatch kernel jobs to Fast‐GPU Simulator 

or M2S‐GPU Simulator according to the kernel informationkernel information

National Tsing Hua University ® copyright OIANational Tsing Hua University 12

Page 13: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

GPU Task Dispatcher (2)

Execution Flow 

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 14: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

GPU Task Dispatcher (3)

Signal from HAS Signal Handler

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 15: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

GPU Task Dispatcher (4)

Copy AQL packets fromCopy AQL packets fromUser Mode Queue

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 16: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

GPU Task Dispatcher (5)

Ask AQL Packet Workerto parse AQL Packet

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 17: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

GPU Task Dispatcher (6)

Launch Fast-GPUSimulator

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 18: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

GPU Task Dispatcher (7)

Launch M2S-GPU SimulationSimulation

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 19: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Fast‐GPU Simulator

A functional‐accurate simulator for generic GPU model simulation– HSAIL Translator 

• Act as a Finalizer• Use static binary translation technique  to translate BRIG file to host executableto translate BRIG file to host executable binary file (x86) based on LLVM

• Host SSE instruction optimization

– GPU Thread Scheduler• Simulate a generic GPU model

National Tsing Hua University ® copyright OIANational Tsing Hua University 19

Page 20: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (1)

Architecture

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 21: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (2)

Launch LLVMHSAIL Translator

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 22: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (3)

ConstructConstruct Control Flow

Graph of HSAIL

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 23: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (4)

Translate HSAIL to LLVM IR

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 24: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (5)

Translate LLVM IRto Host Executableto Host Executable

Object File

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 25: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (6)

Load Host ExecutableLoad Host ExecutableObject File

to memory

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 26: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (7)

Link to GPU Helper Functions

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 27: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (8)

SStore the translation resultto GPU Code Cache

National Tsing Hua University ® copyright OIANational Tsing Hua University

Page 28: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (2)

Host SSE instruction Optimization– Reconstruct the control flow graph of kernel function

– Use bitmap masking and packing/unpacking algorithms to generate host SSE instructionsalgorithms to generate host SSE instructions 

National Tsing Hua University ® copyright OIANational Tsing Hua University 28

Page 29: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (3)

Example : The control flow graph for kernel function $foo

National Tsing Hua University ® copyright OIANational Tsing Hua University 29

Page 30: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

HSAIL Translator (4) Reconstruct the control flow graph by depth‐first traversal

Perform bitmap maskingand packing & unpackingalgorithmsalgorithms 

National Tsing Hua University ® copyright OIANational Tsing Hua University 30

Page 31: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

GPU Thread Scheduler

Simulate a generic GPU model– GPU Thread Scheduler assigns work groups 

to free CU threads in the GPU Thread Poolto free CU threads in the GPU Thread Pool– Each CU thread executes all work items in a 

work group The maximum number of CU threads is– The maximum number of CU threads is limited by host operating system   

National Tsing Hua University ® copyright OIANational Tsing Hua University 31

Page 32: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

M2S‐GPU Simulator (1)

A cycle‐accurate simulator for AMD Southern Islands GPU model simulation– HSAIL Translator 

• Translate BRIG file to GPU binary

– M2S Bridge• Bridge Multi2Sim GPU Model to HSAemuHSAemu

– M2S GPU Module• Simulate a cycle‐accurate GPU modelSimulate a cycle accurate GPU model

National Tsing Hua University ® copyright OIANational Tsing Hua University 32

Page 33: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

M2S‐GPU Simulator (2)

HSAIL Translator– Act as a Finalizer– Translate HSAIL to AMD Southern Islands GPU binary

– Use static binary translation technique based on LLVM

National Tsing Hua University ® copyright OIANational Tsing Hua University 33

Page 34: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

M2S‐GPU Simulator (3)

M2S Bridge : An interface to launch M2S GPU M d lM2S GPU Module– Initialize the data structures used by 

AMD Southern Islands GPU, including aAMD Southern Islands GPU, including a memory register for AMD Southern Islands GPU to access the shared system memory in HSAemumemory in HSAemu

– Invoke M2S GPU Module (the AMD Southern Islands GPU module in Multi2Sim)  

National Tsing Hua University ® copyright OIANational Tsing Hua University 34

Page 35: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

M2S‐GPU Simulator (4)

M2S GPU Module– A cycle‐accurate AMD Southern Islands GPU simulator in Multi2Sim

Memory access is performed by y p yHSAemu memory helper function to comply the hUMA modelp y

National Tsing Hua University ® copyright OIANational Tsing Hua University 35

Page 36: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

GPU Helper Functions (1)

Memory Helper Function– A soft‐mmu of GPU with a page table 

worker and a TLB to enable hUMA model– Support the redirect access of a local pp

segment memory to a non‐shared private memory in GPU 

K l I f ti H l F ti Kernel Information Helper Function– Collect and return information of GPU 

simulation and current execution state s u at o a d cu e t e ecut o state– Retrieve kernel information such as 

working item ID, work group size, etc, from AQL packetAQL packet

National Tsing Hua University ® copyright OIANational Tsing Hua University 36

Page 37: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

GPU Helper Functions (2)

Mathematic Helper Function– Simulate special mathematical instructions 

such as trigonometric instructions by calling the corresponding mathematical functions in standard library 

Synchronization Helper Function– Barrier synchronization implementation for 

generic GPU model simulation 

National Tsing Hua University ® copyright OIANational Tsing Hua University 37

Page 38: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

hUMAModel in HSAemu

Unified coherent address space – GPU can access a  virtual memory page allocated by CPU

Soft‐mmu is simulated for GPU– TLB hit/miss events can be traced

Memory segment access– Global memory segment access is handled by memory helper function

– Group memory segment access is handled by host ld/st instructions

National Tsing Hua University ® copyright OIANational Tsing Hua University 38

Page 39: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Recall: Hardware Simulation of HSAemu

HSA hardware components simulated– Multicore CPU: A parallel multicore CPU model simulation– Functional‐Accrate GPU: A generic GPU model simulation– Cycle‐Accurate GPU: AMD Southern Islands GPU model simulation

– hUMA: A unified address space between CPU and GPU simulation

– Synchronization Primitive: Barrier instruction simulation– Hardware AQL Queue: A HW dispatch queue for GPU 

i l tisimulation

National Tsing Hua University ® copyright OIANational Tsing Hua University 39

Page 40: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Recall: Software Utilities of HSAemu

HSA software utilities designed– HAS Runtime: HSA runtime library (OpenCL runtime)– Topology Discovery: Discover the current platform topology– User Mode Queue: A queue for each user application– Signal Event: Notify GPU to work– HSAIL Generator: A PTX to HSAIL source level translator– BRIG Generator: Generate a binary format from a Kernel file– HSAIL Translator: Translate HSAIL to host executable binary– GPU Code Cache: store translated host binaries

National Tsing Hua University ® copyright OIANational Tsing Hua University 40

Page 41: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Performance Evaluation

Experimental Environment

Benchmarks: – Nearest Neightbor (NN), K‐Means, FFT, FWT, N‐Body– Binary Search, Bitonic Sort, Reduction, FWT

National Tsing Hua University ® copyright OIANational Tsing Hua University

y , , ,

41

Page 42: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Scalability of Fast‐GPU Simulator

Comparison of NN, K‐means and FWT benchmarks on 32 physical coresphysical cores

The speedup is scalable when # of CU threads < # of host physical coresphysical cores

National Tsing Hua University ® copyright OIANational Tsing Hua University 42

Page 43: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

SSE Optimization of Fast‐GPU Simulator

Performance comparison of FFT when turn on/off SSE i i iSSE optimization

National Tsing Hua University ® copyright OIANational Tsing Hua University 43

Page 44: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

N‐Body Simulation by Fast‐GPU Simulator

N‐Body Simulation 

All of host physical CPUs are running

National Tsing Hua University ® copyright OIANational Tsing Hua University 44

Page 45: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Comparison of HSAemu and Multi2Sim

20

benchmark speedup

14

16

18

Fast‐GPU Sim > M2S‐GPU sim > Multi2Sim

10

12

14

4

6

8

BinarySearch BitonicSort FastWalshTransform Reductionmulti2sim 1 1 1 1

0

2

multi2sim 1 1 1 1HSAemu 2.931317 18.88827 8.645516 6.294213Hybrid 2.873768 0.921835 2.407809 2.105663

multi2sim HSAemu Hybrid

National Tsing Hua University ® copyright OIANational Tsing Hua University 45

Page 46: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Conclusions 

An HSA‐compliant full system emulator has been implemented– A functional‐accurate simulator for generic GPU model– A cycle‐accurate simulator for AMD Southern Islands GPU model (from Multi2Sim)

The HSAIL Translator acts as a finalizer that enables the integration of HSAemu with existing simulators, for example, Multi2Sim

Open source – Nov. 12, 2013p ,– http://hsaemu.org/

National Tsing Hua University ® copyright OIANational Tsing Hua University 46

Page 47: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Future work

Enhance HSAemu by implementing more HSA f tfeatures

I HSA i h i i l Integrate HSAemu with some existing cycle‐accurate GPU simulators

Design a cycle‐accurate simulator based on PQEMU for generic CPU model

Deisgn a cycle‐accurate simulator based on PQEMU for big.LITTLE CPU model

National Tsing Hua University ® copyright OIANational Tsing Hua University 47

Page 48: HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding

Q & AQ & A

National Tsing Hua University ® copyright OIANational Tsing Hua University 48


Recommended