2006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications 1 Three Topics in Parallel...

2006-09-292006-09-29 Emin Gabrielyan, Three Topics in ParallEmin Gabrielyan, Three Topics in Parallel Communicationsel Communications

11

Three Topics in Parallel Three Topics in Parallel CommunicationsCommunications

Thesis presentation by Emin Thesis presentation by Emin GabrielyanGabrielyan


Parallel communications: Parallel communications: bandwidth enhancement or fault-bandwidth enhancement or fault-

tolerance?tolerance?

We do not know if parallel communications We do not know if parallel communications were first used for fault-tolerance or for were first used for fault-tolerance or for bandwidth enhancementbandwidth enhancement

In 1964 Paul Baran proposed parallel In 1964 Paul Baran proposed parallel communications for fault-tolerance communications for fault-tolerance (inspiring the design of ARPANT and Internet)(inspiring the design of ARPANT and Internet)

1981 IBM invented the 8-bit parallel port 1981 IBM invented the 8-bit parallel port for faster communicationfor faster communication


Bandwidth enhancement by Bandwidth enhancement by parallelizing the sources and sinksparallelizing the sources and sinks

Bandwidth enhancement Bandwidth enhancement can be achieved by can be achieved by adding parallel pathsadding parallel pathsBut a greater capacity But a greater capacity enhancement is enhancement is achieved if we can achieved if we can replace the senders and replace the senders and destinations with parallel destinations with parallel sources and sinkssources and sinksThis is possible in This is possible in parallel I/O (first topic of parallel I/O (first topic of the thesis)the thesis)


Parallel transmissions in coarse-Parallel transmissions in coarse-grained networks cause congestionsgrained networks cause congestions

In coarse-grained circuit-switched HPC In coarse-grained circuit-switched HPC networks uncoordinated parallel networks uncoordinated parallel transmissions cause congestionstransmissions cause congestions

The overall throughput degrades due to The overall throughput degrades due to access conflicts on shared resourcesaccess conflicts on shared resources

Coordination of parallel transmissions is Coordination of parallel transmissions is covered by the second topic of my thesis covered by the second topic of my thesis (liquid scheduling)(liquid scheduling)


Classical backup parallel circuits for Classical backup parallel circuits for fault-tolerancefault-tolerance

Typically the Typically the redundant redundant resource remains resource remains idleidle

As soon as there is As soon as there is a failure with the a failure with the primary resourceprimary resource

The backup The backup resource replaces resource replaces the primary onethe primary one


Parallelism in living organismsParallelism in living organismsParallelism is Parallelism is observed in observed in almost every almost every living organismsliving organismsDuplication of Duplication of organs primarily organs primarily serves for fault-serves for fault-tolerancetoleranceAnd as a And as a secondary secondary purpose, for purpose, for capacity capacity enhancementenhancement


Simultaneous parallelism for fault-Simultaneous parallelism for fault-tolerance in fine-grained networkstolerance in fine-grained networks

A challenging bio-A challenging bio-inspired solution is inspired solution is to use to use simultaneously all simultaneously all available paths for available paths for achieving fault-achieving fault-tolerancetoleranceThis topic is This topic is addressed in the addressed in the last part of my last part of my presentation presentation (capillary routing)(capillary routing)

2006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications

8

Fine Granularity Parallel I/O for Cluster

Computers

SFIO, a Striped File parallel I/O


9

Why is parallel I/O required

Single I/O gateway for cluster computer saturates

Does not scale with the size of the cluster


10

What is Parallel I/O for Cluster Computers

Some or all of the cluster computers can be used for parallel I/O


11

Objectives of parallel I/O

Resistance to concurrent access Scalability as the number of I/O nodes

increases High level of parallelism and load balance for

all application patterns and all types of I/O requests


12

Parallel I/O Subsystem

Concurrent Access by Multiple Compute Nodes

No concurrent access overheads

No performsne degradation

When the number of compute nodes increases


13

Scalable throughput of the parallel I/O subsystem

The overall parallel I/O throughput should increase linearly as the number of I/O nodes increasesParallel I/O Subsystem

Number of I/O Nodes

Thr

ough

put


14

Concurrency and Scalability = Scalable All-to-All Communication

Concurrency and Scalability (as the number of I/O nodes increases) can be represented by scalable overall throughput when the number of compute and I/O nodes increases

Number of I/O and Compute Nodes

All-

to-A

ll T

hrou

ghpu

t

I/O Nodes

Compute Nodes


15

High level of parallelism and load balance

Balanced distribution across parallel disks must be ensured:

For all types of application patterns: Using small or large I/O requests Continuous or fragmented I/O request

patterns


16

How parallelism is achieved?

Split the logical file into stripes

Distribute the stripes cyclically across the subfiles

Sub

files

file1

file2 file3

file4

file5file6

Logical file


17

The POSIX-like Interface of Striped File I/O

Using SFIO from MPI

Simple Posix like interface

#include <mpi.h>#include "/usr/local/sfio/mio.h"int _main(int argc, char *argv[]){ MFILE *f; int r=rank(); //Collective open operation f=mopen("p1/tmp/a.dat;p2/tmp/a.dat;", 5); //each process writes 8 to 14 characters at its own position

if(rank==0) mwritec(f,0,"Good*morning!",13); if(rank==1) mwritec(f,13,"Bonjour!",8); if(rank==2) mwritec(f,21,"Buona*mattina!",14);

mclose(f); //Collective close operation}


18

Distribution of the global file data across the subfiles Example with three compute nodes and two I/O

nodes

First subfile

Global file

Second subfile

G o o d *

G o o d *

n g ! B o

n g ! B o

! B u o n

! B u o n

t i n a !

t i n a !

m o r n i

m o r n i

n j o u r

n j o u r

a * m a t

a * m a t

130 21


19

Impact of the stripe unit size on the load balance

When the stripe unit size is large there is no guarantee that an I/O request will be well parallelized

subfiles

Logical fileI/O Request


20

Fine granularity striping with good load balance

Low granularity ensures good load balance and high level of parallelism

But results in high network communication and disk access costsubfiles

Logical fileI/O Request


21

Fine granularity striping is to be maintained

Most of the HPC parallel I/O solutions are optimized only for large I/O blocks (order of Megabytes)

But we focus on maintaining fine granularity The problem of the network communication

and disk access are addressed by dedicated optimizations


22

Overview of the implemented optimizations

Disk access requests aggregation (sorting, cleaning-overlaps and merging)

Network communication aggregation Zero-copy streaming between network and

fragmented memory patterns (MPI derived datatypes)

Support of the multi-block interface efficiently optimizes application related file and memory fragmentations (MPI-I/O)

Overlapping of network communication with disk access in time (at the moment write operation only)


23

Multi-block I/O request

Disk access optimizations Sorting Cleaning the

overlaps Merging Input: striped

user I/O requests

Output: optimized set of I/O requests

No data copy

block 1 bk. 2 block 3

access1 access2

Local subfile

6 I/O access requests are

merged into 2


24

Network Communication Aggregation without Copying

Striping across 2 subfiles

Derived datatypes on the fly

Contiguous streaming

Logical file

From: application memory

Remote I/O node 1

Remote I/O node 2

To: remote I/O nodes


25

SFIO library on compute node

Functional Architecture

Blue: Interface functions

Green: Striping functionality

Red: I/O request optimizations

Orange: Network communication and relevant optimizations

bkmerge: overlapping and aggregation

mkbset: creates on the fly MPI derived datatypes

SFP_CMD_WRITESFP_CMD

_READ

mreadmwrite

mreadc mreadb mwritec mwriteb

mrw (cyclic distribution)

sfp_rflush sfp_wflush

sfp_readc sfp_writec

sfp_rdwrc (request caching)

flushcache

sfp_readsfp_write sortcache

sfp_readb sfp_writeb

bkmerge

mkbsetsfp_wait

all

SFP_CMD_BREAD

SFP_CMD_BWRITE

I/O Node

MPI MPIMPIMPI

I/O L

isten

er


26

Optimized throughput as a function of the stripe unit size

3 I/O nodes

1 compute node

Global file size: 660 Mbytes

TNET About 10

MB/s per disk

0

5

10

15

20

25

3050 100

200

500

1000

2000

5000

1000

0

2000

0

5000

0

Stripe unit size (bytes)

Wri

te t

hro

ug

hp

ut

(MB

/s)

non-optimized optimized


27

All-to-all stress test on Swiss-Tx cluster supercomputer

Stress test is carried out on Swiss-Tx machine

8 full crossbar 12-port TNet switches

64 processors Link throughput is

about 86 MB/s


28

SFIO on the Swiss-Tx cluster supercomputer

MPI-FCI Global file size: up

to 32 GB Mean of 53

measurements for each number of nodes

Nearly linear scaling with 200 bytes stripe unit !

Network is a bottleneck above 12 nodes

0

50

100

150

200

250

300

350

400

1 3 5 7 911 13 15 17 19 21 23 25 27 29 31

Number of compute and I/O nodes

Ove

rall

all-t

o-al

l thr

ough

put (

MB

/s)

write maximum

write average

read maximum

read average


29

Liquid scheduling for low-latency circuit-switched networks

Reaching liquid throughput in HPC wormhole switching and in Optical lightpath routing networks


30

Upper limit of the network capacity

Given is a set of parallel transmissions

and a routing scheme

The upper limit of network’s aggregate capacity is its liquid throughput


31

Distinction: Packet Switching versus Circuit Switching

Packet switching is replacing circuit switching since 1970 (more flexible, manageable, scalable)

New circuit switching networks are emerging (HPC clusters, Optical switching)

In HPC wormhole routing targets extremely low latency requirements

In optical network packet switching is not possible due to lack of technology


32

Coarse-Grained Networks In circuit switching

the large messages are transmitted entirely (coarse-grained switching)

Low latency The sink starts

receiving the message as soon as the sender starts transmission

Message Sink

Message Source

Fin

e-G

rain

ed

Pac

ket

switc

hing

Coa

rse-

grai

ned

Circ

uit

switc

hing


33

Parallel transmissions in coarse-grained networks

When the nodes transmit in parallel across a coarse-grained network in uncoordinated fashion congestion may occur

The resulting throughput can be far below the expected liquid throughput


34

Congestions and blocked paths in wormhole routing

When the message encounters a busy outgoing port it waits

The previous portion of the path remains occupied

Source1

Sink2

Sink1

Source2

Sink3

Source3


35

Hardware solution in Virtual Cut-Through routing

In VCT when the port is busy

The switch buffers the entire message

Much more expensive hardware than in wormhole switching

Source1

Sink2

Sink1

Source2

Sink3

Source3

buffering


36

Other hardware solutions

In optical networks OEO conversion can be used

Significant impact on the cost (vs. memory-less wormhole switch and MEMS optical switches)

Affecting the properties of the network (e.g. latency)


37

Application level coordinated liquid scheduling

Liquid scheduling is a software solution

Implemented at the application level No investments in network hardware Coordination between the edge nodes

is required Network topology knowledge is

assumed


38

Example of a simple traffic pattern

5 sending nodes (above)

5 receiving nodes (below)

2 switches 12 links of

equal capacity Traffic consist

of 25 transfers


39

Round robin schedule of all-to-all traffic pattern

First, all nodes simultaneously send the message to the node in front

Then, simultaneously, to the next node

etc


40

Throughput of round-robin schedule

3rd and 4th phases require each two timeframes

7 timeframes are needed in total

Link throughput = 1Gbps Overall throughput =

25/7x1Gbps = 3.57Gbps


41

A liquid schedule and its throughput

6 timeframes of non-congesting transfers Overall throughput = 25/6x1Gbps = 4.16Gbps


42

Problem of liquid scheduling

Building liquid schedule for arbitrary traffic of transfers

Problem of partitioning of the traffic into minimal number of subsets consisting of non-congesting transfers

Timeframe = a subset of non-congesting transfers


43

Definitions of our mathematical model

Transfer is a set of links lying on the path of the transmission

Load of a link is the number of transfers in the traffic using that link

Most loaded links are called bottlenecks

Duration of the traffic is the load of its bottlenecks


44

bott

lene

cks

Teams = non-congesting transfers using all bottleneck links

The shortest possible time to carry out the traffic is the active time of the bottleneck links

Then the schedule must keep the bottleneck links busy all the time

Therefore the timeframes of a liquid schedule must consist of transfers using all bottlenecks

team

not

a te

am


45

Retrieval of teams without repetitions by subdivisions

Teams can be retrieved without repetitions by recursive partitioning

By a choice of a transfer all teams are divided into teams using that transfer and teams not using it

Each halves can be similarly sub divided until individual teams are retrieved


46

Teams use all bottlenecks: retrieving teams of traffic skeleton

Since teams must use transfers using the bottleneck links

We can first create teams using only such transfers (traffic skeleton)

Chart: fraction of the traffic skeleton

0%10%20%30%40%50%60%70%80%90%

100%

0 (0

0)64

(08

)10

0 (1

0)12

1 (1

1)14

4 (1

2)16

9 (1

3)19

6 (1

4)22

5 (1

5)22

5 (1

5)25

6 (1

6)28

9 (1

7)32

4 (1

8)36

1 (1

9)40

0 (2

0)44

1 (2

1)48

4 (2

2)57

6 (2

4)62

5 (2

5)90

0 (3

0)

Number of transfers (and number of contributing nodes) for 362 different traffic patterns across Swiss-Tx cluster

Frac

tion

of

tran

sfer

s us

ing

bott

lene

cks

nodes:transfers:


47

Optimization by first retrieving the teams of the skeleton

Speedup: by skeleton optimization

Reducing the search space 9.5 times

4.7

5.5 7.4

7.9

8.1

8.3

9.2

9.3

9.6

9.9

10.0

10.1

10.7

10.8

10.9

11.3

12.0

12.2

12.6

12.7

13.4

14.0 20

.0

0%

5%

10%

15%

20%

25%

30%

35%

466.

6K (

100)

926.

2K (

121)

4.2M

(12

1)4.

2M (

121)

212K

(10

0)4.

9M (

121)

4.1M

(12

1)9.

2M (

121)

693.

2K (

100)

14.1

M (

121)

15.2

M (

121)

753.

7K (

100)

682K

(10

0)93

6K (

100)

1.2M

(10

0)88

.1K

(81

)95

K (

81)

115.

9K (

81)

1.8M

(10

0)57

.6K

(81

)9.

2K (

64)

136.

7K (

81)

14.2

M (

121)

Number of possible full teams (and number of transfers) for 23 different traffic patterns across the Swiss-Tx cluster

Sea

rch

spac

e re

duct

ion

(%)

idle+skeleton+blank idle+blank blank

transfers:

full


48

Liquid schedule assembling from retrieved teams

By relying on efficient retrieval of full teams (subsets of non-congesting transfers using all bottlenecks)

We assemble liquid schedule by trying together different combinations of teams

Until all transfers of the traffic are used


49

Liquid schedule assembling optimizations (reduced traffic)

Proved. If we remove a team from a traffic, new bottlenecks can emerge

New bottlenecks add additional constraints on the teams of the reduced traffic

Proved. A liquid schedule can be assembled if we use teams of the reduced traffic (instead of constructing teams of the initial traffic from the remaining transfers)

Proved. A liquid schedule can be assembled by considering only saturated full teams


50

Liquid schedule construction speed with our algorithm

0.001

0.01

0.1

1

10

100

1000

10000

100000

1 21 41 61 81 101

121

141

161

181

201

221

241

261

281

301

321

341

361

362 sample topologies

CP

U ti

me

in s

econ

ds -

MILP Cplex method Liquid schedule construction algorithm

360 traffic patterns across Swiss-Tx network

Up to 32 nodes Up to 1024 transfers Comparison of our

optimized construction algorithm with MILP method (optimized for discrete optimization problems)


51

Carrying real traffic patterns according to liquid schedules

Swiss-Tx supercomputer cluster network is used for testing aggregate throughputs

Traffic patterns are carried out according liquid schedules

Compare with topology-unaware round robin or random schedules


52

Theoretical liquid and round-robin throughputs of 362 traffic samples

362 traffic samples across Swiss-Tx network

Up to 32 nodes Traffic carried out

according to round robin schedule reaches only 1/2 of the potential network capacity

0

200

400

600

800

1000

1200

1400

1600

1800

0 (

00)

64 (

08)

100

(10

)12

1 (

11)

144

(12

)16

9 (

13)

196

(14

)22

5 (

15)

225

(15

)25

6 (

16)

289

(17

)32

4 (

18)

361

(19

)40

0 (

20)

441

(21

)48

4 (

22)

576

(24

)62

5 (

25)

900

(30

)

Ove

rall

thro

ughp

ut (

MB

/s)

-

liquid throughput round-robin schedule

nodes:

transfers:


53

Throughput of traffic carried out according liquid schedules

Traffic carried out according to liquid schedule practically reaches the theoretical throughput

200

400

600

800

1000

1200

1400

1600

1800

1 (

01)

64 (

08)

100

(10

)

121

(11

)

144

(12

)

169

(13

)

196

(14

)

225

(15

)

225

(15

)

256

(16

)

289

(17

)

324

(18

)

361

(19

)

400

(20

)

441

(21

)

484

(22

)

576

(24

)

676

(26

)

961

(31

)

Ove

rall

tthr

ough

put (

MB

/s)

theoretical liquid throughputmeasured throughput of a topology-unaware schedulemeasured throughput of a liquid schedule

nodes:

transfers:


54

Liquid scheduling conclusions: application, optimization, speedup

In HPC networks, large messages are “copied” across the network causing congestions

Arbitrarily transmitted transfers yield throughput below the theoretical capacity

Liquid scheduling: relies on network topology and reaches the theoretical liquid throughput of the network

Liquid schedules can be constructed in less than 0.1 sec for traffic patterns with 1000 transmissions (about 100 nodes)

Future work: dynamic traffic patterns and application in OBS


55

Fault-tolerant streaming with Capillary-routing

Path diversity and Forward Error Correction codes at the packet level


56

Structure of my talk The advantages of packet level FEC in

Off-line streaming Solving the difficulties of Real-time

streaming by multi-path routing Generating multi-path routing

patterns of various path diversity Level of the path diversity and the

efficiency of the routing pattern for real-time streaming


57

Decoding a file with Digital Fountain Codes

A file is divided into packets

Digital fountain code generates numerous checksum packets

Sufficient quantity of any checksum packets recovers the file

Like when filling your cup only collecting a sufficient amount of drops matters

…

…

…


58

Transmitting large files without feedback across lossy networks using digital fountain codes

Sender transmits the checksum packets instead of the source packets

Interruptions cause no problems

The file is recovered once a sufficient number of packets is delivered

FEC in off-line streaming relies on time stretching


59

In Real-time streaming the receiver play-back buffering time is limited

While in off-line streaming the data can be hold in the receiver buffer …

In real-time streaming the receiver is not permitted to keep data too long in the playback buffer


60

Long failures on a single path route

If the failures are short, by transmitting a large number of FEC packets, receiver may constantly have in time a sufficient number of checksum packets

If the failure lasts longer than the playback buffering limit, no FEC can protect the real-time communication


61

Reliable Off-line streaming

Rel

iabl

e re

al-

Tim

e st

ream

ing

Applicability of FEC in Real-Time streaming by using path diversity

Time stretching

Pla

ybac

k b

uffe

r lim

it

Real-time streaming

Losses can be recovered by extra packets:

received later (in off-line streaming)

received via another path (in real-time streaming)

Path diversity replaces time-stretching

Pat

h di

vers

ity


62

Creating an axis of multi-path patterns

Intuitively we imagine the path diversity axis as shown

High diversity decreases the impact of individual link failures, but uses much more links, increasing the overall failure probability

We must study many multi-path routings patterns of different diversity in order to answer this question

Single path routing

Multi-path routing

Multi-path routing

Multi-path routing

Path diversity


63

Capillary routing creates solutions with different level of path diversity

As a method for obtaining multi-path routing patterns of various path diversity we relay on capillary routing algorithm

For any given network and pair of nodes capillary routing produces layer by layer routing patterns of increasing path diversity

Path diversity = Layer of Capillary Routing


64

Capillary routing - introduction

Capillary routing first offers a simple multi-path routing pattern

At each successive layer it recursively spreads out individual sub-flows of previous layers

The path diversity develops as the layer number increases

The construction relies on LP


65

Reduce the maximal load of all links

Capillary routing – first layer First take the

shortest path flow and minimize the maximal load of all links

This will split the flow over a few parallel routes


66

Capillary routing – second layer Then identify the

bottleneck links of the first layer

And minimize the flow of the remaining links

Continue similarly, until the full routing pattern is discovered layer by layer

Reduce the load of the remaining

links


67

Capillary Routing Layers

Single network

4 routing patterns

Increasing path diversity


68

Application model: evaluating the efficiency of path diversity To evaluate the efficiencies of patterns

with different path diversities we rely on an application model where:

The sender uses a constant amount of FEC checksum packets to combat weak losses and

The sender dynamically increases the number of FEC packets in case of serious failures

source packets re

dund

ant

pack

ets

FEC block


69

Packet Loss Rate = 3%

Packet Loss Rate = 30%

Strong FEC codes are used in case of serious failures

When the packet loss rate observed at the receiver is below the tolerable limit, the sender transmits at its usual rate

But when the packet loss rate exceeds the tolerable limit, the sender adaptively increases the FEC block size by adding more redundant packets


70

Redundancy Overall Requirement The overall amount of dynamically

transmitted redundant packets during the whole communication time is proportional:

to the duration of communication and the usual transmission rate

to a single link failure frequency and its average duration

and to a coefficient characterizing the given multi-path routing pattern


71

Equation for ROR: it depends only on the routing pattern r(l)

Where: FECr(l) is the FEC transmission block size in case of the complete failure of link l

r(l) is the load of link l for a given routing pattern FECt is the FEC block size at default

streaming (tolerating loss rate t)

1)(|

)( 1lrtLl t

lr

FEC

FECROR


72

ROR coefficient Smaller the ROR coefficient of the multi-

path routing pattern, better is the choice of multi-path routing for real-time streaming

By measuring ROR coefficient of multi-path routing patterns of different path diversity, we can evaluate the advantages (or disadvantages) of diversification

Multi-path routing patterns of different diversity are created by capillary routing algorithm


73

05

1015202530354045505560

laye

r1

laye

r2

laye

r3

laye

r4

laye

r5

laye

r6

laye

r7

laye

r8

laye

r9

laye

r10

capillarization

Ave

rage

RO

R r

atin

g

ROR as a function of diversity Here is ROR as a

function of the capillarization level

It is an average function over 25 different network samples (obtained from MANET)

The constant tolerance of the streaming is 5.1%

Here is ROR function for a stream with a static tolerance of 4.5%

Here are ROR functions for static tolerances from 3.3% to 7.5%

3.3%3.9%4.5%5.1%

7.5%6.3%


74

05

1015202530354045505560

Eight different sets of 25 network samples

Ave

rage

RO

R r

atin

g

3.3%

3.9%

4.5%5.1%

7.5%…

layers: 1…10 |1…10 |1…10 |1…10 |1…10 |1…10 |1…10 |1…10

Set2 Set3 Set4 Set5 Set6 Set7 Set8Set1

ROR rating over 200 network samples

ROR coefficients for 200 network samples

Each section is the average for 25 network samples

Network samples are obtained from random walk MANET

Path diversity obtained by capillary routing reduces the overall amount of FEC packets


75

Conclusions

Although strong path diversity increases the overall failure rate it is beneficiary for real-time streaming (except a few pathological cases)

Capillary routing patterns reduce the overall number of redundant packets required from the sender

In single-path real-time streaming application of FEC at packet level is almost useless

With multi-path routing patterns real-time applications can have great advantages from application of FEC

Future work: using overly network to achieve a multi-path communication flow

Considering coding also inside network, not only at the edges; aiming also at energy saving in MANET


Thank you!Thank you!

Presented topics:Presented topics:

Fine-grained parallel I/O for cluster Fine-grained parallel I/O for cluster computerscomputers

Liquid scheduling of parallel transmissions Liquid scheduling of parallel transmissions in coarse-grained networksin coarse-grained networks

Capillary routing: fault-tolerance in fine-Capillary routing: fault-tolerance in fine-grained networksgrained networks

Date post:	26-Mar-2015
Category:	Documents
Upload:	maya-mccurdy
View:	220 times
Download:	0 times

2006-09-29 Emin Gabrielyan, Three Topics in Parallel Communications 1 Three Topics in Parallel...

Documents