+ All Categories
Home > Documents > System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post...

System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post...

Date post: 28-Aug-2018
Category:
Upload: truongtuyen
View: 218 times
Download: 0 times
Share this document with a friend
18
System Software for Big Data and Post Petascale Computing Osamu Tatebe University of Tsukuba The Japanese Extreme Big Data Workshop February 26, 2014
Transcript
Page 1: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

System Software for Big Data and Post Petascale Computing

Osamu Tatebe University of Tsukuba

The Japanese Extreme Big Data Workshop February 26, 2014

Page 2: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

I/O performance requirement for exascale applications

• Computational Science (Climate, CFD, …) – Read initial data (100TB~PB) – Write snapshot data (100TB~PB)

periodically

• Data Intensive Science (Particle Physics, Astrophysics, Life Science, …) – Data analysis of 10PB~EB experiment

data

Page 3: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Scalable performance requirement for Parallel File System

Year FLOPS #cores IO BW IOPS Systems 2008 1P 100K 100GB/s O(1K) Jaguar, BG/P 2011 10P 1M 1TB/s O(10K) K, BG/Q 2016 100P 10M 10TB/s O(100K) 2020 1E 100M 100TB/s O(1M)

IO BW and IOPS are expected to be scaled-out in terms of # cores or # nodes

Performance target

Page 4: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Technology trend

• HDD performance not increase so much – 300 MB/s, 5 W in 2020 – 100 TB/s means O(2M)W

• Flash, storage class memory – 1 GB/s, 0.1 W in 2020 – Cost, limited number of updates

• Interconnects – 62 GB/s (Infiniband 4xHDR)

☹ ☺

Page 5: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Current parallel file system

• Central storage array • Separate installation of compute nodes and storage • Network BW between compute nodes and storage

needs to be scaled-up to scale out the I/O performance

MDS

Compute nodes (clients) Storage

NW BW limitation

Page 6: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Remember memory architecture

CPU

Mem

CPU

Mem

Shared memory Distributed memory

Page 7: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Scaled-out parallel file system

• Distributed storage in compute nodes • I/O performance would be scaled out by accessing near

storage unless metadata performance is bottleneck – Access to near storage mitigates network BW requirement – The performance may be non uniform

MDS cluster

Compute nodes (clients)

Storage

Page 8: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Example of Scale-out Storage Architecture

• 3 years later snapshot • Non-uniform but

scale-out storage • R&D of system

software stacks is required to achieve maximum I/O performance for data-intensive science

CPU (2 sockets x2.0GHzx16

coresx32FPU)

memory chipset

1TB local

storage

12 Gbps SAS x 16

x 16

19.2 GB/s, 16 TB

x 500

Metadata server

x 10

9.6 TB/s, 8 PB 96 TB/s, 80 PB

Infiniband HDR 62 GB/s

• 5,000 IO nodes • 10 MDSs

Page 9: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Challenge • File system (Object store)

– Central storage cluster to distributed storage cluster – Scaled out parallel file system up to O(1M) clients

• Scaled out MDS performance • Compute node OS

– Reduction of OS noises – Cooperative cache

• Runtime system – Optimization for non uniform storage access “NUSA”

• Global storage for data sharing of exabyte-scale data among machines

Page 10: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Scaled out parallel file system

• Federate local storage in compute nodes – Special purpose

• Google file system [SOSP’03] • Hadoop file system (HDFS)

– POSIX(-like) • Gfarm file system [CCGrid’02, NGC’10]

Page 11: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Scaled-out MDS • GIGA+ [Swapnil Patil et al.

FAST’11] – Incremental directory partitioning – Independent locking in each

partition • skyFS [Jing Xing et al. SC’09]

– Performance improvement during directory partitioning in GIGA+

• Lustre – MT scalability in 2.X – Proposed clustered MDS

• PPMDS [Our JST CREST R&D] – Shared-nothing KV stores – Nonblocking software transactional

memory (No lock)

IOPS (file creates per sec) #MDS (#core) GIGA+ 98K 32 (256) skyFS 100K 32 (512) Lustre 2.4 80K 1 (16) PPMDS 270K 15 (240)

Page 12: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Development of Pwrake: Data-intensive Workflow System

IO-aware Task Scheduling: • Locality-aware scheduling

– Selection of Compute Nodes by Input files

• Buffer Cache-aware scheduling – Modified LIFO to ease Trailing Task Problem

0 2000 4000

Naïve

Locality aware

Locality awareand Cache aware

Locality-aware

42% speedup Cache-aware

23% speedup

Workflow elapsed time (sec) with I/O file size 900 GB (10 nodes)

Process Process Process

SSH

Pwrake

file3 file1 file2

Gfarm file system

Pwrake = Workflow System based on Rake (Ruby make)

Page 13: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Maximize Locality using Multi-Constraint Graph Partitioning

[Tanaka, CCGrid 2012] • Task scheduling based on MCGP can minimize data movement

• Applied to Pwrake workflow system and evaluated on Montage workflow

Data movement reduced by 86% Execution time improved by 31%

Simple Graph Partitioning Multi-Constraint Graph Partitioning

Parallel tasks are unbalanced among nodes.

Page 14: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

HPCI Shared Storage • HPCI – High Performance Computing Infrastructure

– “K”, Hokkaido, Tohoku, Tsukuba, Tokyo, Titech, Nagoya, Kyoto, Osaka, Kyushu, RIKEN, JAMSTEC, AIST

• A 20PB Gfarm distributed file system consisting East and West sites

• Grid Security Infrastructure (GSI) for user ID • Parallel file replication among sites • Parallel file staging to/from each center

11.5 PB (60 servers)

10 PB (40 servers)

10 (~40) Gbps MDS MDS

West site (AICS) East site (U Tokyo)

MDS MDS

Picture courtesy by Hiroshi Harada (U Tokyo)

Page 15: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Storage structure of HPCI Shared Storage

Temporal space • I/O performance •No backup

Data sharing •Capacity and reliability •Secured communication •Fair share and easy to use •No backup but file can be replicated

Lustre, Pansas, GPFS, …

Gfarm file system

Persistent storage •Capacity and reliability •Back up copy will be in Tape or disk

Local file

system

Global file system

Wide-area distributed file

system

Objective File system How to use

•mv/cp •File staging

•mv/cp •File staging

HPCI Shared Storage

•Web I/F Remote clients

Page 16: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Initial Performance Result

898 847

1,107 1,073

0

200

400

600

800

1,000

1,200

Hokkaido Kyoto Tokyo AICS

I/O Bandwidth [MB/sec]

File copy performance of 300 1GB files to HPCI Shared Storage

Page 17: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Related System

• XSEDE-Wide File System (GPFS) – Planned, but not in operation yet

• DEISA Global File System – Multicluster GPFS

• RZG, LRZ, BSC, JSC, EPSS, HLRS, … • Site name included in the path name – no location

transparency – files cannot be replicated across sites

– PRACE does not provide global file system • Limitation of operation systems that can mount • PRACE does not assume to use multiple sites

Page 18: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Summary • App IO requirement

– Computational Science • Scaled-out IO performance up to O(1M) nodes (100TB to 1PB per

hour) – Data Intensive Science

• Data processing for 10PB to 1EB data (>100TB/sec) • File system, Object store, OS and runtime R&D for scale out

storage architecture – Central storage cluster to distributed storage cluster

• Network wide RAID • Scaled out MDS

– Runtime system for non uniform storage access “NUSA” • Locality aware process scheduling

• Global file system


Recommended