Broadcast FederationAn architecture for scalable
inter-domain multicast/broadcast.
http://www.cs.berkeley.edu/~mukunds/bfed/
Mukund Seshadri [email protected]
with
Yatin [email protected]
Spring 2002
Motivation
One-to-many or many-to-many applications e.g. Internet live audio/video broadcast
No universally deployed multicast protocol. IP Multicast
Limited Scalability (due to router state or flooding nature) Address scarcity Need for administrative boundaries.
SSM - better semantics and business model, still requires smart network.
Overlays - Application-level “routers” form an overlay network and perform multicast forwarding.
Less efficient, but easier deployability Maybe used in CDNs (Content Distribution Networks) for
pushing data to edge Heavy duty edge servers replicate content
Goals
Design an architecture for the composition of different, non-interoperable multicast/broadcast domains to provide an end-to-end one-to-many packet delivery service.
Design and implement a high performance (clustered) broadcast gateway for the above architecture
Requirements
Intra-domain protocol independence (both app-layer and IP-layer)
Should be easily customizable for each specific multicast protocol.
Scalable (throughput, number of sessions) Should not distribute info about sessions to
entities not interested in those sessions. Should use available multicast capability
wherever possible.
Basic Design Broadcast Network (BN)
– any multicast capable network/domain/CDN)
Broadcast Gateway (BG) Bridges between 2 BNs Explicit BG peering Overlay of BGs Analogous to BGP routers. App-level
For both app-layer and IP-layer protocols
X Less efficient link usage, and more delay
Commodity hardware Easier customizability and
deployabilityX Inefficient hardware
Source
Clients
BG
BN
Peering
Data
Naming
Session Owner BN Facilitates shared tree protocols Address space limited only by individual BNs’ naming
protocols.
Session Description Owner BN Session name in owner BN Options
Metrics – hop-count, latency,bandwidth, etc. Transport – best-effort, reliable, etc. Number of sources – single, multiple.
URL style-bin://Owner_BN/native_session_name?pmtr=value&pmtr2=value2…
B-Gateway components
3 loosely coupled components – Routing – for “shortest” unicast routes towards
sources Tree building – for “shortest” path distribution
tree. Data forwarding – to send data efficiently across
tree edges.
NativeCast interface – interacts with local broadcast capability
Routing
Peer BGs exchange BN/BG level reachability info
Path – vector algorithm Different routes for different metrics/options
e.g. BN-hop-count + best-effort+multi-source, latency+reliable, etc.
Session-agnostic Avoids all BNs knowing about all sessions. BG-level selectivity available using SROUTEs.
Policy hooks can be applied to such a protocol.
One reverse shortest path-tree per session Tree rooted at owner BN. “Soft” BG tree state:(Session: Upstream node: list of downstream nodes)
Can be bi-directional Fine-grained selectivity using SROUTE messages before JOIN
phase.
Distribution Trees
Source
(S1:N:B2)
(S1:B1:N)B2
B1
B3 C1
C1 JOINs
JOINN NativeCast
Client/Mediator
(S1:N:B2)
(S1:B1:N,B3)B2
B1
B3(S1:B2:B4)
(S1:B3:N)
C2
C2 JOINs
(S1:B3:N)
C3
(S1:N:B2)
(S1:B1:N,B3)B2
B1
B3(S1:B2:B4,B5)
(S1:B3:N)
C2
C3 JOINs
B4B4
B5
Peering Link
(Session:Parent:Child1,Child2,..)BG and its tree state
C1C1
SourceSource
Mediator
How does a client tell a BG that it wants to join a session?
Client in owner BN no interaction with the federation.
Client not in owner BN needs to send JOIN to a BG in its BN.
BNs are required to implement the Mediator abstraction, for sending JOINs for sessions to BGs.
Modified clients which send JOIN to BGs Well-known Mediator IP Multicast group. Routers or other BN-specific aggregators Can be part of the NativeCast interface.
Data Forwarding Decouples control from
data …control nodes from data
nodes.
TRANSLATION messages carry data path addresses per session
e.g. TCP/UDP/IP Multicast address+port.
e.g. a transit SSM network might require 2+ channels to be setup for one session.
Label negotiation, for fast forwarding.
Can be piggy-backed on JOINs
(S1:L:P2)
(S1:P1:L)P2
P1
P3
C1
UDP:IP2,Port2
IPM:Null
IPM:Null
UDP:IP1,Port1
IPM:IPm1,Portm1
IPM:IPm1,Portm1
JOIN
TRANSLATION
Source
Clustered BG design
1 Control node+`n’ Data nodes.
Control node routing + tree-building.
Independent data paths flow directly through data nodes.
TRANSLATION messages contain IP addresses of data nodes in the cluster.
Throughput bottlenecked only by the IP router/NIC.
“Soft” data-forwarding state at data nodes.
C1
C2
D11 D12
D22D21
Data Stream
Dxx: Data Node Or Dnode
Sources
Receivers
Cx
Broadcast GatewayControl Node
IPMul or CDNstream
IPMul or CDNstream
Control Mesgs.
BN1
BN2
BG1
BG2
BGx: Broadcast Gateway
Encapsulates all BN-specific customization Interface to local broadcast capability
Send and Receive broadcast data Allocate and reclaim local broadcast addresses Subscribe to and unsubscribe from local broadcast
sessions Implement “Mediator” functionality – intercept and
reply to local JOINs Get SROUTE values.
Exists on control and data nodes.
NativeCast
Implementation
Linux/C++ event-driven program Best-effort forwarding. NativeCast implemented for IP Multicast, a
simple HTTP-based CDN and SSM. Each NativeCast implementation ~ 700 lines
of code. Tested scalability of clustered BG (throughput,
sessions) using HTTP-based NativeCast. Used Millennium cluster.
Experimental Setup
No. of sources = no. of sinks = no. of Dnodes. (so that sources/sinks don’t become bottleneck).
440Mbps raw TCP throughput.
500MHz PIII’s; 1 Gbps NICs.
>50Gbps switch. Sources of two types
– rate-limited,and unlimited.• Note: IPMul is based on UDP; CDN is
based on HTTP (over TCP).
C1
C2
D11 D12
D22D21
Data Stream
Dxx: Data Node Or Dnode
Sources
Receivers
Cx
Broadcast GatewayControl Node
IPMul or CDNstream
IPMul or CDNstream
Control Mesgs.
BN1
BN2
BG1
BG2
BGx: Broadcast Gateway
Results
Vary number of data nodes, use one session per data node.
Near-linear throughput scaling.
Gigabit speed achieved.
Better with larger message size.
Note: maximum (TCP-based) throughput achievable using different data message (framing) sizes is shown above.
Multiple Sessions
• Variation of total throughput when no. of sessions is increased to several sessions per Dnode shown. The sources are rate-unlimited.
• High throughput is sustained when no. of sessions is large.
With 1 Dnode With 5 Dnodes
Multiple Sessions …
Rate-limited sources (<103Kbps).
5 Dnodes, 1 KB message size.
No significant reduction in throughput.
Achieve large number of sessions+high throughout for large message sizes.
Transport-layer modules (e.g. SRM local recovery).
Wide area deployment?
Future Work
Links
“Broadcast Federation: An Application Layer Broadcast Internetwork” – Yatin Chawathe, Mukund Seshadri (NOSSDAV’02) http://www.cs.berkeley.edu/~mukunds/bfed/nossdav02.ps.gz
This presentation: http://www.cs.berkeley.edu/~mukunds/bfed/bfed-retreat.ppt
Extra Slides…
SROUTEs…
…are session-specific routes to source in the owner BN
All BGs in owner BN know all SROUTEs for owned sessions.
SROUTE-Response gives all SROUTEs.
Downstream BGs can cache this value to reduce SROUTE traffic.
Downstream BG(s) compute best target BG in owner BN and send JOINs towards that BG.
JOINs contain SROUTEs received earlier.
Session info sent only to interested BNs.
X Increases initial setup latency
TRANSLATION
JOIN
SROUTE-Request
SROUTE-Response
REDIRECT
Source
Client
Phase 1 Phase 2
BG
BN
Peering
More Results
Varied data message size from 64 bytes to 64 KB.
1 Dnode Clearly, higher
message sizes are better
Due to forwarding overhead – memcpys, syscalls, etc.
Some More Results
Used 796 MHz PIII’s as Dnodes. Varied no.of Dnodes, single session per
Dnode. Achieved Gigabit-plus speeds with 4
Dnodes.