Intel Research Adaptive Load Sharing for Multiprocessor Network Nodes Lukas Kencl Intel Research...

IntelIntel Research Research

Adaptive Load Sharingfor MultiprocessorNetwork Nodes

Adaptive Load Sharingfor MultiprocessorNetwork Nodes

Lukas KenclLukas Kencl

Intel Research CambridgeIntel Research Cambridge

UCL, November 12, 2003UCL, November 12, 2003

2IntelIntel Research Research

OutlineOutline Adaptive load sharing methodAdaptive load sharing method

FlowFlow-to-processor mapping -to-processor mapping Adaptation with Adaptation with minimal disruptionminimal disruption Method validation and applicationMethod validation and application

Research: Further methodsResearch: Further methods Adaptive data structuresAdaptive data structures Dynamic code reconfigurationDynamic code reconfiguration

Outlook: Adaptive methods in networkingOutlook: Adaptive methods in networking


Adaptive Load Sharing for Adaptive Load Sharing for Multiprocessor Network NodesMultiprocessor Network NodesPh.D. Work: IBM Zurich Research & EPFL, LausannePh.D. Work: IBM Zurich Research & EPFL, Lausanne

2 parts:2 parts: Flow-to-processor mappingFlow-to-processor mapping Adaptation with minimal disruptionAdaptation with minimal disruption


Multiprocessor Network NodeMultiprocessor Network NodeAs a Load Sharing SystemAs a Load Sharing System

Assumptions:Assumptions:• Data arrives in Data arrives in packetized flows.packetized flows.

• Any processor can Any processor can process any packet.process any packet.

Heterogenous Heterogenous processor capacityprocessor capacity jj..

Incoming PacketsIncoming PacketsIncoming PacketsIncoming Packets

Multiple (M)Multiple (M)

ProcessorsProcessors

11

22

33

MM

Multiple (N)Multiple (N)

InputsInputs

11

22

33

44

NN

Task:Task:• Load on processors within some measure of balance.Load on processors within some measure of balance.• Same flow to same processor (reordering, context).Same flow to same processor (reordering, context).

Advantage:Advantage: system optimization. system optimization. Drawback:Drawback: complexity, overhead. complexity, overhead.


Acceptable Load Sharing Acceptable Load Sharing as the Measure of Balanceas the Measure of Balance

Processing load on processor Processing load on processor jj j j (t)(t)

Capacity on processor Capacity on processor jj jj

Workload intensity on processor Workload intensity on processor jj jj(t)(t) = = jj(t)(t) / / jj

Total system workload intensityTotal system workload intensity (t)(t) = = jj(t)(t) / / jj

““No single processorNo single processor is overutilized if the is overutilized if the system in totalsystem in total is not overutilized, is not overutilized,

and vice versa.”and vice versa.”

Acceptable load sharing:Acceptable load sharing:

ifif (t)(t) 1 1 thenthen jj, , jj(t)(t) 1,1,

if if (t)(t) > >11 then then jj, , jj(t)(t) >>1.1.

Acceptable load sharing minimizes packet loss probability!Acceptable load sharing minimizes packet loss probability!


Minimizing Disruption Minimizing Disruption Goal:Goal: Acceptable load sharing Acceptable load sharing

withoutwithout maintaining maintaining flow stateflow state information information

and yet and yet minimizingminimizing the probability of the probability of mapping disruptionmapping disruption (flow remapping). (flow remapping).

NPNP-complete problem-complete problem (Integer Linear Programming): (Integer Linear Programming):

maxmax vv vv(t) . (t) . jj (1{ f(t- (1{ f(t-t)(v) = j} .t)(v) = j} .1{ f(t)(v)= j1{ f(t)(v)= j}),}),

whilewhile vv a avv(t) l(v) .(t) l(v) . 1{ f(t)(v)= j} 1{ f(t)(v)= j} = = jj (t) (t) jj , , j.j.

v v - - flow identifier vector in the packet header,flow identifier vector in the packet header,

ff(t)(t)(v)(v) - - function mapping flows to processors, changing over time,function mapping flows to processors, changing over time,

vv(t)(t){0,1}{0,1} - - indicator ifindicator if vv has appeared in the intervalshas appeared in the intervals (t-2(t-2 t, t- t, t- t) t) and and (t-(t- t, t) t, t),,

aavv(t)(t) - - how many times hashow many times has vv appeared in the intervalappeared in the interval (t-(t- t, t), t, t),

l(v)l(v) - - load per packet carryingload per packet carrying v,v,

tt - - iteration intervaliteration interval. .

Even if we knew all the flow state information, an Even if we knew all the flow state information, an NPNP-complete problem – -complete problem – heuristicsheuristics..


Flow-to-Processor Mapping Flow-to-Processor Mapping Multiple (N)Multiple (N)

InputsInputs

11

22

33

NN

Multiple (M) Multiple (M)

ProcessorsProcessors

11

22

33

44

MM

Incoming PacketIncoming Packet

flow flow identifier identifier vectorvector vv

v v

Weights vector Weights vector x x = (x= (x11 , ..., x, ..., xM M ).).

Upon packet arrival, a decision is made where to process the packet, based Upon packet arrival, a decision is made where to process the packet, based on the flow identifieron the flow identifier and a set of weights and a set of weights. A . A flow-to-processor mapping flow-to-processor mapping ff is thus established.is thus established.

f ( f ( vv ) = ) = 33


Flow-to-Processor MappingFlow-to-Processor Mapping

Def.: Flow-to-Processor mapping function f, f (v): V M :

f (v) = j xj . g ( v, j ) = maxk xk . g (v, k),

where v is the flow identifier vector, x = (x1, ... , xm) is a weights' vector

and g (v, j)(0,1) is a pseudorandom function of uniform distribution.

Highest Random Weight (HRW) Mapping, Thaler, Ravishankar, 1997, Ross, 1998, CARP Protocol, Windows NT LB

0 1 2 3

g (v, 3)

g (v, 1)

Map to max of:

g (v, 2)

Example: 3 processors with homogenous processing capacity (weight xi = 1, i).1

Map to proc. 2.


HRW Mapping Favourable Properties HRW Mapping Favourable Properties Minimal disruption of mapping in case of processor addition. Example: Add processor No. 4, vectors mapped either (i) as before addition or (ii) to the newly added processor – minimal number of vectors change mapping.

0 1 2 3

g (v, 3)

g (v, 1)

Max of:g (v, 2)

1

4

g (v, 4)

0 1 2 3

g (v, 3)

g (v, 1)

g (v, 4)

4

g (v, 2)

1 Max of:

Load balancing over heterogenous processors: weights’ vector x is in a 1-to-1 correspondence to p = (p1, ... , pm), the vector of traffic fractions received at each processor. Pseudorandom function g (v, j)(0,1) can be implemented as a fast-computable hash function.


4. Download new4. Download new

x := xx := x(t).(t).

2. Evaluate 2. Evaluate

(t)(t) = ( = (1 1 (t)(t), , 2 2 (t)(t), ... , , ... , m m (t)(t)))

(compare threshold).(compare threshold).

3. Compute new3. Compute new

x x (t)(t) = (x = (x1 1 (t)(t), ..., x, ..., xm m (t)(t)).).

Adaptation through FeedbackAdaptation through Feedback

Trigger definition targets Trigger definition targets preventing overload, preventing overload, if system in if system in total not overloaded, and vice total not overloaded, and vice versa.versa.

A A threshold threshold triggers adaptation triggers adaptation when close to load sharing when close to load sharing bounds. bounds.

Flow-to-processor mapping Flow-to-processor mapping ff becomes a becomes a function of timefunction of time f(t)f(t)((vv))

Adaptation may cause Adaptation may cause flowflow remapping! remapping! How to minimize the How to minimize the amount remapped?amount remapped?

Multiple (N)Multiple (N)

Input CardsInput Cards

11

22

33

NN

Multiple (M)Multiple (M)

processorsprocessors

11

22

33

44

MM

CPCP

Control PointControl Point

1. Filtered workload 1. Filtered workload

intensity ( j) =intensity ( j) = j j(t).(t).

Problem:Problem: incoming requests incoming requests are packets, not flows! Packets not evenly distributed over flows -are packets, not flows! Packets not evenly distributed over flows ->> not evenly distributed over the request object spacenot evenly distributed over the request object space -> HRW mapping not sufficient for -> HRW mapping not sufficient for acceptable load sharing boundsacceptable load sharing bounds –> need to–> need to adapt!adapt!


Adaptation AlgorithmAdaptation Algorithm

StartStart

TriggerTrigger

AdaptationAdaptation??

Wait time Wait time tt

Adapt Adapt

weights' vector weights' vector xx and uploadand upload

Triggering PolicyTriggering Policy Adaptation PolicyAdaptation Policy

YesYesNoNo

Compute filtered Compute filtered processor workload processor workload

intensity intensity (t)(t)


TriggeringTriggering PolicyPolicy

Dynamic workload intensityDynamic workload intensity

thresholdthreshold

'' (t) = 1/2 (1+(t) = 1/2 (1+(t))(t))

Triggering policyTriggering policy

(i)(i) if if (t) (t) 1 1 andand maxmax jj (t) > (t) > (t) (t) then then adaptadapt;;

(ii) if(ii) if (t) > 1(t) > 1 andand minmin jj (t)(t) < < (t) (t) then then adapt.adapt.

ExampleExample::

(t) = (0.8, 0.2, 0.2), (t) = (0.8, 0.2, 0.2), (t) = 0.4(t) = 0.4

(t) = 0.7,(t) = 0.7,

11 (t)(t) >> (t)(t) adapt.adapt.

Triggering thresholdTriggering threshold

(t)(t) = = maxmax(('' (t), upper)(t), upper)

or vice versaor vice versa

Hysteresis boundHysteresis bound

upper: (1+upper: (1+ HH(t)) .(t)) . (t)(t)

lower: (1 -lower: (1 - HH(t)) .(t)) . (t)(t)


Adaptation Policy: MinimalAdaptation Policy: Minimal Disruption Example (3 Proc.)Disruption Example (3 Proc.)

• reduced receives less, unaltered receives more, if reduction by a single, invariable multiplier.• minimal disruption of the mapping.

0

x1

x2

1 2 3

x2 . g (v, 2)

x1 . g (v, 1)

Max of:

1 2 3

x3

0

2/3.x1

2/3.x2

x3 . g (v, 3)

2/3 . x2 . g (v, 2)

2/3 . x1 . g (v, 1)

Max of:

x3 x3 . g (v, 3)


Adaptation Policy:Adaptation Policy:Minimal DisruptionMinimal Disruption

• A, B A, B - - mutually exclusive subsets ofmutually exclusive subsets of M={1,...,m}M={1,...,m}, , M=AM=A B.B.• (0, 1).(0, 1).• f, f'f, f' – two HRW mappings with the weights' vectors– two HRW mappings with the weights' vectors xx, , x'x'::

x'x'jj = = . x . xjj,, jj A,A,

x'x'jj = x = xjj,, jj BB..

• ppjj , , p'p'jj - fraction of objects mapped to node- fraction of objects mapped to node j j usingusing ff, , f'f'..

1) 1) p'p'jj ppjj , , jj AA,,

p'p'jj ppjj , j , j BB..

2) Fraction of objects mapped to a different node by each mapping2) Fraction of objects mapped to a different node by each mapping

isis MINIMAL, MINIMAL, that is, equal tothat is, equal to | | p'p'jj - p- pj j | . |V | | . |V | at every nodeat every node j. j.


Adaptation PolicyAdaptation Policy

Let Let (t)(t) 11.. Then:Then:

xxj j (t) : = (t) : = c(t)c(t) . x . xj j (t-(t-t) ,t) , ifif jj (t)(t) >> (t) (t) (( j j exceeds threshold exceeds threshold (t)(t))),,

xxj j (t) : = x(t) : = xj j (t-(t-t),t), ifif jj (t)(t) (t) (t) (( j j does not exceed threshold does not exceed threshold (t)(t)))..

IfIf (t) > 1(t) > 1,, the adaptation is carried out in a symmetrical manner.the adaptation is carried out in a symmetrical manner.

The weights' The weights' multiplier coefficient multiplier coefficient c(t)c(t) : :

( )( ) (t)(t)

minmin {{ jj(t)(t) | | jj(t)(t) >> (t)(t)}}c(t) =c(t) =

1/m1/m

FactorFactor c(t)c(t) is proportional to the minimal error and to the number of nodes. is proportional to the minimal error and to the number of nodes.


ValidationValidation


ExpectationsExpectations

• Workload intensity on individual processors close Workload intensity on individual processors close to that of the system in total to that of the system in total (acceptable load (acceptable load sharing);sharing);• Packet loss probability lowered Packet loss probability lowered (acceptable load (acceptable load sharing);sharing);• Persistent flows Persistent flows (appearing in two consecutive iterations)(appearing in two consecutive iterations) seldom remapped seldom remapped (minimize disruption).(minimize disruption).


“Realistic” Generated Traffic and Router System Model“Realistic” Generated Traffic and Router System Model

Measured flow length cumulative distribution

Traffic characterization (approximated from various published OC-3 statistics):

• Number of packets per time interval;

• Number of flows per time interval;

• Measured flow length distribution, complemented by Pareto to generate the heavy tail;

• Identifier vector distribution;

• Per-packet processing load distribution;

• Maximal per-flow fraction of interface rate f .Router system model:• 8 processors, 13 interfaces;

• System workload intensity close to 1;

• 3 alternatives to load sharing (LS):

• Naive (no LS)

• Static (LS with static weights)

• Adaptive LS.


Adaptation Keeps Per-processor Workload Intensity Close to IdealAdaptation Keeps Per-processor Workload Intensity Close to Ideal

Naive, no LS.Naive, no LS.

Static LS.Static LS.

Adaptive LS.Adaptive LS.

Max and min of all.Max and min of all.


Packet Loss Significantly Reducedwith the Adaptive Control LoopPacket Loss Significantly Reducedwith the Adaptive Control Loop

Packet loss: Naive, Static, Adaptive, Ideal.Packet loss: Naive, Static, Adaptive, Ideal. Packet loss in excess of Ideal: Static, Adaptive.Packet loss in excess of Ideal: Static, Adaptive.

Adaptive load sharing saves on average 60% Adaptive load sharing saves on average 60%

of packets dropped in excess by the static load sharing.of packets dropped in excess by the static load sharing.


Minimal Disruption Property Ensures Few Flow RemappingsMinimal Disruption Property Ensures Few Flow Remappings

Flows, per iteration: appearing, persitent and remapped.Flows, per iteration: appearing, persitent and remapped.

Adaptive control loop Adaptive control loop leads on average to:leads on average to:• less than 0.05% less than 0.05% of the of the appearing flowsappearing flows remapped per iteration;remapped per iteration;• less than 0.2% less than 0.2% of the of the persistent flowspersistent flows remapped per iteration.remapped per iteration.


Applications and Implementation Applications and Implementation


Extension to Prevent RemappingExtension to Prevent Remapping

Minimal disruption property – only a small amount of Minimal disruption property – only a small amount of flows require special treatmentflows require special treatment Which ones? Keep state?Which ones? Keep state? What treatment?What treatment?

Which onesWhich ones: during transient periods between two : during transient periods between two mappings after adaptation:mappings after adaptation: Compute Compute bothboth mappings (OLD and NEW) per each packet. mappings (OLD and NEW) per each packet. If mappings differ, apply special treatment.If mappings differ, apply special treatment.

TreatmentTreatment:: If new flow (SYN packet), insert a classifier rule that maps to the If new flow (SYN packet), insert a classifier rule that maps to the newnew

mapping.mapping. If existing flow, insert a rule that maps to the If existing flow, insert a rule that maps to the old old mapping.mapping.

Monitor and delete terminated flows.Monitor and delete terminated flows.


Server Load Balancer on the IBM PowerNP with Zero RemappingsServer Load Balancer on the IBM PowerNP with Zero Remappings

IBM PowerNP

S1

S2

SM

Control Point

xNEW

x

Classifier

NewFlows

Old Flows

HNEW HOLD

HNEW = HOLD

xOLD

Hash Function

Rule hit

Rule miss

Incoming Packet

Multiple (M) Servers

CP sends directlyNew Flows Old Flows

Collision Flows

Redirectto CP

New or Old Flow?

IBM PowerNP


Adaptive Load Sharing: SummaryAdaptive Load Sharing: Summary

• hash–based;hash–based;• minimum state information;minimum state information;• adaptive, yet minimum flow disruptions;adaptive, yet minimum flow disruptions;

• multiprocessor network node transformed multiprocessor network node transformed into a parallel computer;into a parallel computer;• wide scope of applications (server LB, NP wide scope of applications (server LB, NP dispatcher, distributed router, etc.).dispatcher, distributed router, etc.).


Research: Further methods

Research: Further methods


Adaptive lookup/classification on a multiprocessor systemAdaptive lookup/classification on a multiprocessor system Splitting a large lookup table/rule base into several Splitting a large lookup table/rule base into several smaller smaller

consecutiveconsecutive sub-tables/rule bases sub-tables/rule bases Each sub-table/rule base has a dedicated processor Each sub-table/rule base has a dedicated processor

(microengine)(microengine) Boundaries adaptively tuned according to load on processorsBoundaries adaptively tuned according to load on processors Sub-data structures adapting to the enforced traffic locality.Sub-data structures adapting to the enforced traffic locality.

0 232 - 1

Example: Lookup TableExample: Lookup Table

ME0 ME1 ME2 ME3

Example: Rule BaseExample: Rule Base

0

0

232 - 1

232 - 1

ME0

ME1ME2

ME3


Adaptive lookup tableAdaptive lookup table Table entries typically organized in a tree structure;Table entries typically organized in a tree structure; Adapting to the traffic patterns – rebalancing the tree Adapting to the traffic patterns – rebalancing the tree

according to hits distribution when updating table;according to hits distribution when updating table; Problem: worst case – optimization.Problem: worst case – optimization.

00 10

101100

1001110010

00

10

101100

10011

10010

500 hits500 hits

20 hits20 hits

50 hits50 hits

70 hits70 hits

600 hits600 hits

MemAccesses: 2 . 20 + 2 . 70 + 3 . 50 + MemAccesses: 2 . 20 + 2 . 70 + 3 . 50 + 3 . 5003 . 500 = = 18301830

1 MA

2 MA

3 MA

Example:Example:

20 hits20 hits 70 hits70 hits

50 hits50 hits

600 hits600 hits

MemAccesses: MemAccesses: 1. 5001. 500 + 3. 20 + 3 . 70 + 4. 50 = + 3. 20 + 3 . 70 + 4. 50 = 970970

500 hits500 hits


Dynamic code reconfigurationDynamic code reconfiguration Modular code, modules interconnected via virtual Modular code, modules interconnected via virtual

queuesqueues Counters periodically accounting for virtual queue Counters periodically accounting for virtual queue

occupancyoccupancy If imbalance on the codepathsIf imbalance on the codepaths

Restructure codeRestructure code Remap resourcesRemap resources

Example: IP vs. MPLS balanceExample: IP vs. MPLS balance

MPLS

IP

MPLS

IPMAC IP AQM

MPLS

Critical CodepathCritical Codepath


Outlook: Adaptive methodsin networkingOutlook: Adaptive methodsin networking Improved performance, reduced power consumption;Improved performance, reduced power consumption; AdaptiveAdaptive instead of programmable networks! instead of programmable networks!

Feedback control rather than programming/debugging;Feedback control rather than programming/debugging; Execution self-adjusts based on monitored knowledge Execution self-adjusts based on monitored knowledge

Data path: Data path: program code, data structures, resource assingment;program code, data structures, resource assingment; Control path: Control path: routing, transport mechanisms;routing, transport mechanisms;

Set of primitives out of which to compose functionality, Set of primitives out of which to compose functionality, rather than a program.rather than a program.

Issues:Issues: Distributed feedback control; Distributed feedback control; Control algorithms on different layers do not interfere.Control algorithms on different layers do not interfere.

IntelIntel Research Research

Q & AQ & A

The End - Thank You!The End - Thank You!


BackupBackup


Why Load Sharing at All? System Optimization!Why Load Sharing at All? System Optimization!

Advantages• Maximize total load, while respecting a packet loss constraint;• M/M/m queue > m * M/M/1 queues;• Fault tolerance;• Scalability.

?

No Load Sharing Load Sharing

Drawbacks• Increased system complexity due to:

• state information maintenance;• computing overhead.


Proof of Concept - Simple SimulatorProof of Concept - Simple Simulator

• 8 outgoing links with various capacities, preceded by per-link queues;

• simple, generated traffic – random (uniform) identifier vector, uniform packet burst probability;

• HRW weights initially set to 0;

Results:

• weights values asymptotically tend to the correct ones;

• queue utilization soon close to the total system utilization;

• decrease in standard deviation of queue occupancy shows the influence of feedback control.


Maximal Per-Flow Fractionof the Interface Rate f Significantly Influences Performance

Maximal Per-Flow Fractionof the Interface Rate f Significantly Influences Performance

Packets dropped and flows remapped, in dependence on f.

The more a single flow may consume from the interface rate, the worse the adaptive load sharing method performs, in both the packet loss and flow remappings response variables.


Data Path in a Distributed Router

LineCards

1

4

CP

NPUs

2

3

4

5

N

3

1

2

Incoming Packet

packet fields contained in

the identifier vector v, v

w

v

additional packet fields contained in the

information vector w

w

1. Parse

2. f ( v ) = 3

4. Request (3, w )

5. NextHop ( w ) = N

6. GetPayload (1, v )7. Switch Packet (N)

3. StorePayload (1, v)

In

Out

Input & Output Switch / Shared Memory


• max (A,B,C,D) = max (max (A,B), max (C,D))• adaptation on multiple tree levels• minimal flow disruption holds on the lowest level only!!!

• balance the tree - avoid nodes with few child nodes

• avoid adaptation on higher levels – looser threshold, wider hysteresis

• correlations among levels of hierarchy when computing g(v, j) – use offset

Scalable HRW WeightsData StructureScalable HRW WeightsData Structure


Flow-to-Processor Mapping Bkp.

Def.: Flow-to-Processor mapping function f, f (v): V M :

f (v) = j xj . g ( v, j ) = maxk xk . g (v, k),

where v is the flow identifier vector, x = (x1, ... , xm) is a weights' vector

and g (v, j)(0,1) is a pseudorandom function of uniform distribution. The weights’ vector x is in a 1-to-1 relationship to p = (p1, ... , pm), the vector of traffic fractions received at each processor.

Highest Random Weight (HRW) Mapping, Thaler, Ravishankar, 1997, Ross, 1998, CARP Protocol.

0

x1

x2

1 2 3

x2 . g (v, 2)

x1 . g (v, 1)

Map to max of:

x3 x3 . g (v, 3)

Example: 3 processors


HRW Mapping - How and Why Does It Work? – BKP

HRW Mapping - How and Why Does It Work? – BKP

xMAX

0

x1

x2

1 2 3

xMAX

0

x1

x2

1 2 3 4

x4

xMAX . g (v, 3)

x2 . g (v, 2)

x1 . g (v, 1)

xMAX . g (v, 3)

x2 . g (v, 2)

x1 . g (v, 1)

x4 . g (v, 4)

Max of:

Max of:


HRW Mapping Properties Examples – Bkp. HRW Mapping Properties Examples – Bkp.

x3 = xMAX

0

x1

x2

1 2 3

x3 . g (v, 3)

x2 . g (v, 2)

x1 . g (v, 1)

Max of:

Minimal disruption of mapping in case of processor addition (add 4th processor):

0 1 2 3

g (v, 3)

g (v, 1)

Max of:g (v, 2)

1

4

g (v, 4)

0 1 2 3

g (v, 3)

g (v, 1)

g (v, 4)

4

g (v, 2)

1 Max of:

Load balancing over heterogenous processors: weights’ vector x is in a 1-to-1 corres-

pondence to p = (p1, ... , pm), the vector of traffic fractions received at each processor:

Example: p3 + /2 + /3

Date post:	12-Jan-2016
Category:	Documents
Upload:	noel-robbins
View:	222 times
Download:	0 times

Intel Research Adaptive Load Sharing for Multiprocessor Network Nodes Lukas Kencl Intel Research...

Documents