+ All Categories
Home > Documents > FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control...

FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control...

Date post: 07-May-2018
Category:
Upload: phamkhue
View: 231 times
Download: 7 times
Share this document with a friend
21
Copyright 2011 FUJITSU LIMITED Advanced Software for the Supercomputer PRIMEHPC FX10
Transcript
Page 1: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

Advanced Software for the Supercomputer PRIMEHPC FX10

Page 2: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

System Configuration of PRIMEHPC FX10

File

management

nodes

Job

management

nodes

Control

nodes

System

integration

node

Login

nodes

User Management nodes

Administrator

Global file system (Data storage area)

Local file system (Temporary area occupied by jobs)

Compute nodes

6D mesh/torus Interconnect

IO network (IB), management network (GbE)

• Login

• Compilation

• Job submission

• Data transfer to/from global

file system

• Data communication for

system job operations

management

• System operations management

• Job operations management

1

Page 3: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

System Software Stack

User/ISV Applications

High-performance file system

Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability

HPC Portal / System Management Portal

PRIMEHPC FX10

File system, operations management Application development environment

System operations management

System configuration management System control System monitoring System installation & operation

Job operations management

Job manager Job scheduler Resource management Parallel execution environment

VISIMPACTTM

Shared L2 cache on a chip Hardware intra-processor

synchronization

Compilers

Support Tools

MPI Library Scalability of High-Func. Barrier Comm.

IDE Profiler & Tuning tools Interactive debugger

Hybrid parallel programming Sector cache support SIMD / Register file extensions

Linux-based enhanced Operating System Enhanced hardware support System noise reduction Error detection / Low power

2

Page 4: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

OS (Linux-based enhanced Operating System)

Easy existing application porting

POSIX API: Linux kernel 2.6.x, glibc 2.x

High performance / High scalability

Enhanced hardware support

CPU registers, Large memory page, High speed interconnect

Reduce system noise in highly parallel program

Inter-node OS scheduling

High availability / low power consumption

Hardware error detection / isolation

memory patrol, io driver enhance.

CPU suspend during system idle state.

Node 1

Node 2

Node 3

Node 4

Synchronous daemon scheduling

Application running

idle wait Daemon services = System noise

Node 1

Node 2

Node 3

Node 4

Application running

Idle → CPU suspend Job running

3

Page 5: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

System Software Stack

User/ISV Applications

High-performance file system

Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability

HPC Portal / System Management Portal

PRIMEHPC FX10

Linux-based enhanced Operating System

File system, operations management Application development environment

System operations management

System configuration management System control System monitoring System installation & operation

Job operations management

Job manager Job scheduler Resource management Parallel execution environment

VISIMPACTTM

Shared L2 cache on a chip Hardware intra-processor

synchronization

Compilers

Support Tools

MPI Library Scalability of High-Func. Barrier Comm.

IDE Profiler & Tuning tools Interactive debugger

Hybrid parallel programming Sector cache support SIMD / Register file extensions

Enhanced hardware support System noise reduction Error detection / Low power

4

Page 6: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

System Operations Management

Copyright 2011 FUJITSU LIMITED

Hierarchical structure for

efficient system operation and

adaptability to large-scale

systems

The load is distributed by using

the job management sub node.

Easy to operate with a single

system image

The system is efficiently

operated by dividing a logical

resource partition named

“Resource Unit”.

Job management node Job operations

management

Easy to operate

with single

system image

System administrator

Compute nodes

Compute nodes

IO node IO node

・・・

・・・

Compute nodes

Compute nodes

IO node IO node

・・・

・・・

Tofu interconnect

- Power control - Hardware monitoring - Software service monitoring

による効率的な運用

Cluster Control node

Job manage sub node Job manage sub node

Nodegroup#1 Nodegroup#2

Hierarchical

structure for

efficient

operation

Resource Unit #1 Resource Unit#2 階層化による効率的な

Logical Resource Partition 5

Page 7: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

High Availability System

The important nodes have redundancy

Control node

Job management node

Job management sub node

File servers

For example : right figure

Continuing job execution even if the job

management node is in failed status

The job data always synchronizes between

active node and stand-by node.

Alternatively to stand-by node if active node is

down.

Job management nodes

JOBs

active stand-by data

sync user

failure

active

6

Page 8: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

Job Operations Environment

Efficient resource usage

Flexible job scheduling based on prioritized resource assignment

Interconnect topology-aware resource assignment

Backfill scheduling for keeping the resources busy

Asynchronous file staging

High availability

Avoids assigning faulty resources to jobs

disconnects faulty nodes from job operations

Management nodes with failover support

Time

Job A Job B

Job C Running job

Job C

T0

T1

Job A Job B

Job C

Running job Job C

Now t1 t2 t3

Backfilling

disabled

Backfilling

enabled

7

Page 9: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

Resource Assignment

Interconnect topology-aware resource assignment

Treats 12 compute nodes as one interconnect unit

Assigns cubic-shaped interconnect unit(s) to a job

Using adjacent interconnect unit(s) is suitable for contiguous communication,

and also avoids interfering with other jobs.

Optimizes the alignment of resources

Rotating the cubic-shaped interconnect units This improves total system

utilization by rotating the cubic shaped interconnect units.

In-use unoccupied

x

z y 8 6

6 6 8

8 8 4

4 6 4 4

8

Page 10: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

System Software Stack

User/ISV Applications

High-performance file system

Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability

HPC Portal / System Management Portal

PRIMEHPC FX10

Linux-based enhanced Operating System

File system, operations management Application development environment

System operations management

System configuration management System control System monitoring System installation & operation

Job operations management

Job manager Job scheduler Resource management Parallel execution environment

VISIMPACTTM

Shared L2 cache on a chip Hardware intra-processor

synchronization

Compilers

Support Tools

MPI Library Scalability of High-Func. Barrier Comm.

IDE Profiler & Tuning tools Interactive debugger

Hybrid parallel programming Sector cache support SIMD / Register file extensions

Enhanced hardware support System noise reduction Error detection / Low power

9 9

Page 11: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

High Scalability

Achieved high-scalable IO performance with multiple OSSes.

Parallel IO (MPI-IO)

Shared File

Compute

Node

Compute

Node

Compute

Node

Compute

Node

OSS OSS OSS OSS

File

Compute

Node

Compute

Node

Compute

Node

Compute

Node

Single Stream IO

File File File

OSS OSS OSS OSS

Adapted various IO model

Add Server&Storage

Number of servers

Thro

ughput

OSS: Object storage server

File

Compute

Node

Compute

Node

Compute

Node

Compute

Node

Master IO

File File File

OSS OSS OSS OSS

10

Page 12: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

Fair Share QoS: Sharing IO bandwidth with all users.

Login Node File Servers

User A

User B

IO Bandwidth

IO Bandwidth Guarantee

Without Fair Share QoS With Fair Share QoS

Not Fair

User A

User B

Fair

Best Effort QoS: Utilize all IO bandwidth exhaustively. Occupied by one client Shared by all clients

Client(s) A

Client(s) B

Client(s)

File Servers

11 11

Page 13: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED

High Reliability and High Availability

Avoiding single point of failure by redundant hardware and failover

mechanism.

OSS (Active)

OSS (Active)

RAID RAID

MDS OSS

RAID

IB SW IB SW Network path

Disk path

Dual Server

RAID

Failover

MDS (Active)

MDS (Standby)

Failover

Compute Node

File Management

Server

Monitoring

& Managing

Software

12

Page 14: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

System Software Stack

User/ISV Applications

High-performance file system

Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability

HPC Portal / System Management Portal

PRIMEHPC FX10

Linux-based enhanced Operating System

File system, operations management Application development environment

System operations management

System configuration management System control System monitoring System installation & operation

Job operations management

Job manager Job scheduler Resource management Parallel execution environment

Compilers

MPI Library Scalability of High-Func. Barrier Comm.

Hybrid parallel programming Sector cache support SIMD / Register file extensions

Enhanced hardware support System noise reduction Error detection / Low power

Support Tools IDE Profiler & Tuning tools Interactive debugger

VISIMPACTTM

Shared L2 cache on a chip Hardware intra-processor

synchronization

Copyright 2011 FUJITSU LIMITED 13

Page 15: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

VISIMPACTTM

(Virtual Single Processor by Integrated Multi-core Parallel Architecture)

Mechanism that treats multiple cores as one high-speed CPU Easy and efficient execution of inter-

core thread parallel processing with a

multi-core CPU

Supports the realization of a highly-

efficient Hybrid model (Automatic

parallelization + MPI)

Large-capacity shared L2 cache

memory decrease in the influence of

false sharing

Inter-core hardware barrier facilities 6-10

times faster than conventional software

barrier

CPU technologies

Memory

CPU

Core

L2$

Process

Core

L2$

Process

Memory

CPU

Core

Core

L2$

Process Inter-core

thread parallel

processing

Core • • •

Barrier synchronization

Thread 1

Hardware barrier synchronization: 10

times faster than conventional system

Tim

e

• • •

• • •

Core

Thread 2

Core

Thread N

Copyright 2011 FUJITSU LIMITED 14

Page 16: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

System Software Stack

User/ISV Applications

High-performance file system

Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability

HPC Portal / System Management Portal

PRIMEHPC FX10

Linux-based enhanced Operating System

File system, operations management Application development environment

System operations management

System configuration management System control System monitoring System installation & operation

Job operations management

Job manager Job scheduler Resource management Parallel execution environment

VISIMPACTTM

Shared L2 cache on a chip Hardware intra-processor

synchronization

Enhanced hardware support System noise reduction Error detection / Low power

Support Tools

Compilers

MPI Library Scalability of High-Func. Barrier Comm.

Hybrid parallel programming Sector cache support SIMD / Register file extensions

IDE Profiler & Tuning tools Interactive debugger

Copyright 2011 FUJITSU LIMITED 15

Page 17: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Programming Model for High Scalability

Hybrid parallelism by VISIMPACT

and MPI library

VISIMPACT

•Automated multi-thread parallelization

•High performance thread barrier used

Inter-core hardware barrier facility

MPI library

•High performance collective

communications used Tofu barrier

facility

Scalability of Himeno benchmark(XL size)

0

2

4

6

8

10

12

14

16

18

20

1,000 10,000 100,000Number of cores

Performance

ratio

Hybrid + Tofu barrier

Hybrid

Flat MPI

Quotation from K computer performance data

Himeno benchmark detail (65536 Core)

0

0.2

0.4

0.6

0.8

1

1.2

Hybrid + Tofu

barrier

Hybrid Flat MPI

Time Ratio

Collective communication

Neighbor Communication and Calculation

Quotation from K computer performance data

Copyright 2011 FUJITSU LIMITED 16

Page 18: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Customized MPI Library for High Scalability

Point-to-Point communication

•Use a special type of low-latency path

that bypasses the software layer

•The transfer method optimization

according to the data length, process

location and number of hops

Collective communication

•High performance Barrier, Allreduce,

Bcast and Reduce used Tofu barrier

facility

•Scalable Bcast, Allgather, Allgatherv,

Allreduce and Alltoall algorithm

optimized for Tofu network Copyright 2011 FUJITSU LIMITED

Barrier / Allreduce Performance

0

10

20

30

40

50

60

70

80

90

100

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000

Number of Nodes (48x6xZ)E

lap

se

d T

ime

(m

icro

se

co

nd

s)

Barrier (Tofu Barrier) Allreduce (Tofu Barrier)

Barrier (software) Allreduce (software)

Quotation from K computer performance data

17

Page 19: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Compiler Optimization for High Performance

Instruction-level parallelism with SIMD instructions

Improvement of computing efficiency used Expanded registers

Improvement of cache efficiency used Sector cache

0

0.2

0.4

0.6

0.8

1

1.2

Memory wait Cache missesOperation wait Instructions committed

NPB3.3 LU Execution time comparison (relative values)

FX1 PRIMEHPC FX10

Faster

Efficient use of

Expanded registers

reduces Operation wait

0

0.2

0.4

0.6

0.8

1

1.2

Memory wait Cache missesOperation wait Instructions committed

NPB3.3 MG Execution time comparison (relative values)

FX1 PRIMEHPC FX10

Faster

SIMD implementation

reduces Instructions

committed

Copyright 2011 FUJITSU LIMITED 18

Page 20: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Application Tuning Cycle and Tools

Execution

Job

Information

Open Source

Tools

Overall

Tuning

RMATT

Tofu-PA

MPI Tuning

CPU Tuning

PAPI

Vampir-trace

Profiler

Profiler

Vampir-trace

FX10 Specific

Tools

Profiler snapshot

Copyright 2011 FUJITSU LIMITED 19

Page 21: FX10 Advanced Software - Fujitsu Global services = System noise idle wait Node 1 Node 3 ... Control node Cluster ... FX10 Advanced Software

Copyright 2011 FUJITSU LIMITED 20


Recommended