Post on 14-May-2020
transcript
Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing
Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, Steven Swanson
Department of Computer Science and EngineeringUniversity of California, San Diego
Non-volatile Systems LaboratoryNVSL
Non-volatile Systems LaboratoryNVSL
Applications interact with files
2
Non-volatile Systems LaboratoryNVSL
How we process files today
3SSD
DRAMCPU
GPUGPU
“12345678”
0xBC614E
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Non-volatile Systems LaboratoryNVSL
The conventional model
4
CPU/APU
DRAM
SSD
GPU
Retrieve File Parse data and create objects Compute kernel
Compute kernel
Creating objects generates traffic on CPU-memory bus and results in system overhead
Non-volatile Systems LaboratoryNVSL
Overhead of creating objects
5
0
0.2
0.4
0.6
0.8
1.0
Page
Rank CC bf
s
gaus
sian
hybr
idso
rt
kmea
ns lud nn
srad
JASP
A
aver
age
Perc
enta
ge o
f Exe
cutio
n Ti
me
Object creation Other CPU computationGPU Moving data to GPU
64%
Creating objects is now the bottleneck in applications
GPU accelerated applications
Non-volatile Systems LaboratoryNVSL
High-speed storage doesn’t help
6
0102030405060708090
Page
Rank CC bf
s
gaus
sian
hybr
idso
rt
kmea
ns lud nn
srad
JASP
A
aver
ageTh
roug
hput
of P
arsin
g In
put D
ata
(MB/
Sec)
SSD RamDrive HDD
Very little difference among different storage technologies
GPU accelerated applications
Non-volatile Systems LaboratoryNVSL
Preventing P2P communication between peripherals
7SSD
DRAMCPU
GPUGPU
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Desired data path
Real data path in the current model
P2P is useless since we need CPU to create application objects
Non-volatile Systems LaboratoryNVSL
We need to rethink the processing model!
8
Morpheus
Non-volatile Systems LaboratoryNVSL
Outline• The Morpheus model• The system architecture• Experimental result• Conclusion
9
Non-volatile Systems LaboratoryNVSL
Morpheus: Creating application objects in SSDs
10
GPU
SSD
DRAMCPU
SSD Processor
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Non-volatile Systems LaboratoryNVSL
The Morpheus model
11
CPU/APU
DRAM
SSD
GPU
Retrieve objects Compute kernel
Compute kernel
StorageApp
Non-volatile Systems LaboratoryNVSL
Benefits of Morpheus• Bypass system overheads• Allow applications to take advantage from
P2P data communication• Reduce traffic over system interconnects• Lower energy consumption
12Non-volatile Systems LaboratoryNVSL
Morpheus: Creating application objects in SSDs
6
GPU
SSD
DRAMCPU
SSD Processor
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Non-volatile Systems LaboratoryNVSL
Outline• The Morpheus model• The system architecture• Experimental result• Conclusion
13
Non-volatile Systems LaboratoryNVSL
Implementing the Morpheus model
14
Application
Morpheus runtime
Morpheus-NVMe Driver
NVMe-P2P
PCIe Interconnect
Morpheus-SSDGPU
OperatingSystem
Hardware
GPURuntime
Non-volatile Systems LaboratoryNVSL
Morpheus-NVMe
15
• NVMe: An interface defines how the host computer should interact with non-volatile memory devices
• Morpheus-NVMe extensions• MInit: install and prepare the execution of a
StorageApp• MRead: reads and applies the StorageApp on the
reading data• MWrite: writes and applies the StorageApp on
the writing data• MDeinit: completes and releases the
StorageApp
Non-volatile Systems LaboratoryNVSL
Morpheus-SSD
16
Embedded core
Embedded core
Embedded core
Embedded core
Embedded core
Embedded core
Embedded core
Embedded core
AcceleratorAcceleratorAcceleratorAccelerator
In-storage Interconnect
PCIe
/NVM
e In
terfa
ce
DRAM controller
SSD DRAM
flash interfaceDMA Engine
Flash memory
Managing Morpheus-NVMe commandsExecuting StorageAppsFlash
Flash
Flash
Flash
Flash
DDR3/DDR4DRAM
PCIEXPRESS
Non-volatile Systems LaboratoryNVSL
GPU
NVMe-P2P
17
• Mapping GPU device memory to PCIe BAR using AMD DirectGMA or NVIDIA GPUDirect
• Generate Morpheus-NVMe commands using GPU memory addresses as the DMA targets
• Morpheus directly pulls/pushes data from/to GPU addresses, without going through the main memory
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07
Non-volatile Systems LaboratoryNVSL
Creating a StorageApp• Use C to compose a StorageApp• Use the Morpheus-SSD library to access SSD
resources• The compiler generates machine code that the
embedded processors can execute
18
StorageApp int inputApplet(ms_stream ssd_input_stream, void *edge_array) { Edge ssd_edge_array[4096]; int i = 0; while(ms_scanf(ssd_input_stream, "%d %d", &ssd_edge_array[i%4096].first, &ssd_edge_array[i%4096].second)==2) { i++; if(i % 4096 == 0) { ms_memcpy(edge_array, ssd_edge_array, sizeof(Edge)*4096); edge_array += sizeof(Edge)*4096; } } ms_memcpy(edge_array, ssd_edge_array, sizeof(Edge)*(i%4096)); return i;}
Non-volatile Systems LaboratoryNVSL
Invoking a StorageApp in host applications
• Like calling a function• Prepare arguments using the Morpheus runtime
library• The runtime library interacts with the driver to
utilize the SSD facilities
19
void test_distributed_page_rank(char* graphfilename, int num_ofVertex, int num_ofEdges, int iterations) { FILE *fin; ms_stream ssd_input_stream; void **arg_list; fin = fopen(graphfilename, "r"); ssd_input_stream = ms_stream_create(fin); Edge *edge_array = (Edge *)malloc(sizeof(Edge)*num_ofEdges); inputApplet(ssd_input_stream, edge_array); ms_stream_destroy(ssd_input_stream); // The rest of code ...}
Non-volatile Systems LaboratoryNVSL
Outline• The Morpheus model• The system architecture• Experimental result• Conclusion
20
Non-volatile Systems LaboratoryNVSL
Experimental setup• Intel Xeon E5-2609 v2 processor• NVIDIA K20 GPU• Morpheus-SSD: A 512GB SSD with a PMCS
(now Microsemi) NVMe controller
21
Morpheus-SSD
K20 GPU
Non-volatile Systems LaboratoryNVSL
Morpheus improves performance
22
00.20.40.60.81.01.21.41.61.8
Page
Rank CC bf
s
gaus
sian
hybr
idso
rt
kmea
ns lud nn
srad
JASP
A
aver
age
Spee
dup
Morpheus-SSD Morpheus+NVMe-P2P
1.32
x1.
39x
GPU accelerated applications
Non-volatile Systems LaboratoryNVSL
Morpheus saves power/energy
23
1.32
x1.
39x
00.10.20.30.40.50.60.70.80.91.0
Page
Rank CC bf
s
gaus
sian
hybr
idso
rt
kmea
ns lud nn
srad
JASP
A
aver
age
Norm
alize
d Va
lue
Power Energy
-7 %
-42%
GPU accelerated applications
Non-volatile Systems LaboratoryNVSL
Morpheus makes wimpy servers more competitive
24
0
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60Pa
geRa
nk CC bfs
gaus
sian
hybr
idso
rt
kmea
ns lud nn
srad
JASP
A
aver
age
Spee
dup
over
2.5
G C
PUs
1.2G CPUMorpheus-SSD on 1.2G CPUMorpheus-SSD on 1.2G CPU + NVMe-P2P
0.53
x1.
08x
1.12
x
Morpheus-SSD + wimpy CPUs can compete with high-end servers
GPU accelerated applications
Non-volatile Systems LaboratoryNVSL
Conclusion
25
• Object creation/deserialization/serialization becomes a new bottleneck for high-performance heterogeneous computers
• Morpheus model leverages under-utilized computing resources in storage device to • bypass system overheads• enable efficient data communication mechanisms
• Morpheus-SSD improves application performance by 1.39x and allows wimpy servers to compete with high-end servers
Non-volatile Systems LaboratoryNVSL
Thank you!
26
Hung-Wei Tseng will be an assistant professor in
starting from this August