+ All Categories
Home > Documents > C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Date post: 18-Jan-2018
Category:
Upload: darrell-fitzgerald
View: 215 times
Download: 0 times
Share this document with a friend
Description:
Morgan Kaufmann Publishers 27 April, 2017 Rack-Mounted Servers Sun Fire x4150 1U server Chapter 6 — Storage and Other I/O Topics
54
[email protected] Lin Gu C o m p u t e r N e t w o r k s Datacenter Networks
Transcript
Page 1: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

[email protected] Gu

C o m p u t e r N e t w o r k s

Datacenter Networks

Page 2: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Rack-Mounted ServersSun Fire x4150 1U server

Page 3: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Scale Up vs. Scale Out

SMPSuper Server

DepartmentalServer

PersonalSystem

Clusterof PCs

MPP

Page 4: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Link Layer 5-4

Data center networks 10’s to 100’s of thousands of hosts, often

closely coupled, in close proximity: e-business (e.g. Amazon) content-servers (e.g., YouTube, Akamai, Apple,

Microsoft) search engines, data mining (e.g., Google)

challenges: multiple applications,

each serving massive numbers of clients

managing/balancing load, avoiding processing, networking, data bottlenecks

Inside a 40-ft Microsoft container, Chicago data center

Page 5: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Link Layer 5-5

Server racks

TOR switches

Tier-1 switches

Tier-2 switches

Load balancer

Load balancer

B

1 2 3 4 5 6 7 8

A C

Border router

Access router

Internet

Data center networks load balancer: application-layer routing receives external client requests directs workload within data center returns results to external client

(hiding data center internals from client)

Page 6: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Server racks

TOR switches

Tier-1 switches

Tier-2 switches

1 2 3 4 5 6 7 8

Data center networks rich interconnection among switches, racks:

increased throughput between racks (multiple routing paths possible)

increased reliability via redundancy

Page 7: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Low Earth Orbit networks

Some unsuccessful earlier attempts, Iridium, … Such systems may come back in the future It does not have to be a satellite – Google Loon, …

Wireless communication is convenient, and can be high-bandwidth. Satellites can be an effective solution.

Page 8: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Deep space communication

Extremely long latency What protocols work? How to build the transceivers? Befriend physicists

How to communication in the Solar System, in the Galaxy, or in deeper space?

Page 9: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

6-9

The network is the computer”- SUN

Microsystems

Page 10: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Appendix

10

Page 11: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Motivations of using Clusters over Specialized Parallel Computers

• Individual PCs are becoming increasingly powerful

• Communication bandwidth between PCs is increasing and latency is decreasing (Gigabit Ethernet, Myrinet)

• PC clusters are easier to integrate into existing networks

• Typical low user utilization of PCs (<10%)

• Development tools for workstations and PCs are mature

• PC clusters are a cheap and readily available

• Clusters can be easily grown

Page 12: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Cluster Architecture

Sequential Applications

Parallel Applications

Parallel Programming Environment

Cluster Middleware(Single System Image and Availability Infrastructure)

Cluster Interconnection Network/Switch

PC/Workstation

Network Interface Hardware

CommunicationsSoftware

PC/Workstation

Network Interface Hardware

CommunicationsSoftware

PC/Workstation

Network Interface Hardware

CommunicationsSoftware

PC/Workstation

Network Interface Hardware

CommunicationsSoftware

Sequential Applications

Sequential Applications

Parallel ApplicationsParallel

Applications

Page 13: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Major Components of a Datacenter

• Computing hardware (equipment racks)

• Power supply and distribution hardware

• Cooling hardware and cooling fluid distribution hardware

• Network infrastructure

• IT Personnel and office equipment

Datacenter Networking

Page 14: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Growth Trends in Datacenters• Load on network & servers continues to rapidly grow

– Rapid growth: a rough estimate of annual growth rate: enterprise datacenters: ~35%, Internet datacenters: 50% - 100%

– Information access anywhere, anytime, from many devices• Desktops, laptops, PDAs & smart phones, sensor

networks, proliferation of broadband• Mainstream servers moving towards higher speed links

– 1-GbE to10-GbE in 2008-2009– 10-GbE to 40-GbE in 2010-2012

• High-speed datacenter-MAN/WAN connectivity– High-speed datacenter syncing for disaster recovery

Datacenter Networking

Page 15: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

• A large part of the total cost of the DC hardware– Large routers and high-bandwidth switches are very

expensive• Relatively unreliable – many components may fail.• Many major operators and companies design their

own datacenter networking to save money and improve reliability/scalability/performance.– The topology is often known– The number of nodes is limited– The protocols used in the DC are known

• Security is simpler inside the data center, but challenging at the border

• We can distribute applications to servers to distribute load and minimize hot spots

Datacenter Networking

Page 16: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Networking components (examples)

• High Performance & High Density Switches & Routers

– Scaling to 512 10GbE ports per chassis

– No need for proprietary protocols to scale

• Highly scalable DC Border Routers

– 3.2 Tbps capacity in a single chassis

– 10 Million routes, 1 Million in hardware

– 2,000 BGP peers– 2K L3 VPNs, 16K L2 VPNs– High port density for GE and

10GE application connectivity– Security

768 1-GE port Downstream

64 10-GE port Upstream

Datacenter Networking

Page 17: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Common data center topologyInternet

Servers

Layer-2 switchAccess

Datacenter

Layer-2/3 switchAggregation

Layer-3 routerCore

Datacenter Networking

Page 18: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Data center network design goals• High network bandwidth, low latency• Reduce the need for large switches in the core• Simplify the software, push complexity to the

edge of the network• Improve reliability• Reduce capital and operating cost

Datacenter Networking

Page 19: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Avoid this…

Datacenter Networking

and simplify this…

Page 20: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

?

Can we avoid using high-end switches?• Expensive high-end switches to

scale up• Single point of failure and

bandwidth bottleneck– Experiences from real systems

• One answer: DCell20

Interconnect

Page 21: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

DCell Ideas• #1: Use mini-switches to scale out • #2: Leverage servers to be part of the routing

infrastructure– Servers have multiple ports and need to forward

packets• #3: Use recursion to scale and build complete

graph to increase capacity

Interconnect

Page 22: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

One approach: switched network with a hypercube interconnect

• Leaf switch: 40 1Gbps ports+2 10 Gbps ports.– One switch per rack.– Not replicated (if a switch fails, lose one rack of capacity)

• Core switch: 10 10Gbps ports– Form a hypercube

• Hypercube – high-dimensional rectangle

Data Center Networking

Page 23: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Hypercube properties• Minimum hop count• Even load distribution for all-all communication.• Can route around switch/link failures.• Simple routing:

– Outport = f(Dest xor NodeNum)– No routing tables

Interconnect

Page 24: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

A 16-node (dimension 4) hypercube

0

3

2

1

0 0

1

2

3 3

1 1

3

0 2021 5 4

6732

10 11 15 14

8 9 13 12

1 1 1 1

1 1 1 1

1 1 1 1

3 3 3 3

2

2

2

2

2

2

2

2

2

2

2

2

0

0

0

0

0

0

0

0

0

0

0

0

3 3 3 3

3 3 3 3

Interconnect

Page 25: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

64-switch Hypercube

63 * 4 links toother containers

One container:

4 links

Level 0: 32 40-port 1 Gb/sec switches

Level 1: 8 10-port 10 Gb/sec switches

64 10 Gb/sec links

16 10 Gb/sec linksLevel 2: 2 10-port 10 Gb/sec switches

1280 Gb/sec links

4X4Sub-cube

4X4Sub-cube

4X4Sub-cube

4X4Sub-cube

16links

16links

16links

16links

Interconnect

How many servers can be connected in this system?

81920 servers with 1Gbps bandwidth

Core switch: 10Gbps port x 10

Leaf switch: 1Gbps port x 40 + 10Gbps port x 2.

Page 26: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

The Black BoxData Center Networking

Page 27: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Typical layer 2 & Layer 3 in existing systems• Layer 2

– One spanning tree for entire network• Prevents looping• Ignores alternate paths

• Layer 3– Shortest path routing between source and

destination– Best-effort delivery

Interconnect

Page 28: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Problem With common DC topology• Single point of failure• Over subscription of links higher up in the

topology– Trade off between cost and provisioning

• Layer 3 will only use one of the existing equal cost paths

• Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity

Interconnect

Page 29: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Fat-tree based SolutionConnect hosts together using a fat-tree topology

– Infrastructure consists of cheap devices• Each port supports same speed as the end host

– All devices can transmit at line speed if packets are distributed along existing paths

– A k-ary fat-tree is composed of switches with k ports• How many switches? … 5k2/4• How many connected hosts? … k3/4

Interconnect

Page 30: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

k-ary Fat-Tree (k=4)Interconnect

k2/4 switches

Use the same type of switches in the core, aggregation, and edge, with each switch having k ports

k pods

k/2 switches/pod

k/2 switches/pod

k/2 hosts/edge switch

Page 31: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Fat-tree Modified• Enforce special addressing scheme in DC

– Allows host attached to same switch to route only through switch

– Allows inter-pod traffic to stay within pod– unused.PodNumber.switchnumber.Endhost

• Use two level look-ups to distribute traffic and maintain packet ordering.

Interconnect

Page 32: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

2-Level look-ups

• First level is prefix lookup– Used to route down the topology to endhost

• Second level is a suffix lookup– Used to route up towards core– Diffuses and spreads out traffic– Maintains packet ordering by using the same ports for the

same endhost

Interconnect

Page 33: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Comparison of several schemes– Hypercube: high-degree interconnect for large net, difficult

to scale incrementally – Butterfly and fat-tree: cannot scale as fast as DCell– De Bruijn: cannot incrementally expand – DCell: low bandwidth between two clusters (sub-DCells)

Interconnect

Page 34: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Distributed Systems

Sun Fire x4150 1U server

Page 35: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Datacenter

DNS LB system

Users are geographically distributed, and computation is globally optimized.

Datacenter

Datacenter

Load Balancing• The load balancing systems regulate global data center traffic• Incorporates site health, load, user proximity, and service

response for user site selection• Provides transparent site failover in case of disaster or service

outage

• Providing site selection for users• Harnessing the benefits and

intricacies of geo-distribution• Leveraging both DNS and non-

DNS methods for multi-site redundancy

Global Data Center Deployment

Cloud and Globalization of Computation

Page 36: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

GWS

Google’s Search System

Computing in an LSDS The browser issues a query DNS lookup HTTP handling GWS Backend HTTP response

San Jose

HTTP

London

Hong Kong

Google.com

GWS GWS GWS GWS

Backend

HTTP

Inside d

ata center

s

Page 37: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Google’s Cluster ArchitectureGoals A high-performance distributed system for

search Thousands of machines collaborate to handle the

workload Price-performance ratio Scalability Energy efficiency and cooling High availability

Luiz Andre Barroso, Jeffrey Dean, Urs Holzle. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, vol. 23, no. 2, pp. 22-28, Mar./Apr. 2003

Page 38: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

How to compute in a network?

Multiple computers on a perfect network may work like one larger traditional computer.

However, computing becomes more complex when messages can be lost/tampered/duplicated. bandwidth is limited. operations incur long latencies, non-uniform

latencies, or both. events are asynchronous.

Hence, computation in an LSDS over imperfect networks may have to be organized in a different way from traditional computer systems.

How to correctly compute on an imperfect network?

Page 39: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Two Generals’ Problem

Send a messenger, then expect the messenger to come back with one acknowledgment?

Send 100 messengers? How to prove it is possible

or impossible to reach an agreement?

Attack at 5am.

Two generals want to agree on a time to attack an enemy in between them. If the attack is synchronized, the generals can defeat the enemy. Otherwise, they will be defeated by enemy one by one. The generals can send messengers to each other, but the messenger may be caught by the enemy. Can the two generals reach an agreement?

Page 40: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Three Generals’ Problem in Paxos

Who decides the attack time? When is the decision made and agreed on? What if one general betrayed? Byzantine

Generals Problem.

Paxos: reach global consensus in a distributed system with packet loss

Further reading: [Lamport98] Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May. 1998), 133-169.

Page 41: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

How to maintain state in a network? Part-Time Parliament

Priests can leave the Chamber (server crash or isolated from the system) and may never come back (server fail).

Messengers can leave the Chamber (delayed or out-of-order packets) and may never come back (packet loss).

Priests (legislators) in a parliament communicate to each other using messengers. Both the priests and the messengers can leave the parliament Chamber at any time, and may never come back. Can the parliament pass laws (decrees) and ensure consistency (no ambiguity/discrepancy on the content of a decree)?

Page 42: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Paxos

Each priest keeps records of the passed decrees (and some additional information) on his/her ledger (nonvolatile storage).

Messengers deliver the candidate decree and votes.

Protocol 1: Suppose we know there are n priests. A priest constructs a decree and sends it to the other n-1 priests, and collects their votes that support the decree. A vote against the degree equal to not voting. If there are n-1 votes for the decree, the decree is passed.

Problem?

Tax=0Tax=

0

Tax=0Tax=0

OK

OK

OK

OK

Page 43: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Paxos

Resilient to server failures and packet loss.

State (passing or not passing) of a decree is defined unambiguously.

What is “majority”? The proposing priest may contact a

“quorum” consisting a majority of the priests.

Protocol 2: … A decree is passed by a majority voting for it.

Tax=0Tax=

0

Tax=0Tax=0OK

OK

OK

Problem?

Page 44: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Paxos

Clients can query any priest, and the priest may know the decree.

What if the particular priest does not know the decree?

Protocol 3: … Inform all priests about the passing of a decree.

Tax=0Tax=0

Tax=0OK

OK

OK

Problem?

Tax=0 done

Tax=0 done

Tax=0 done

Page 45: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Paxos

If all replies agree, the decree is (perhaps) unambiguous.

What if a priest in the majority set does not know?

Protocol 4: … Read from a majority set.

Tax=

=0

Problem?

Tax = ?

don’t know

Tax==0

Page 46: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Paxos

Will there be a majority? Can the majority be wrong?

Protocol 5: … Read following a majority.

Tax=

=0

Answers to a query should be consistent (identical, or, at least, compatible).

Tax = ?

Tax = 100

Tax==0

Page 47: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Paxos

1. One priest serves as the president, and proposes a decree with a unique ballot no. b. The president sends the ballot with the proposal to a set of priests.

2. A priest responds to the receipt of the proposal message by replying with its latest vote (LastVote) and a promise that it would not respond to any ballots whose ballot nos. are between the LastVote’s ballot no. and b. LastVote can be null.

Protocol 6: Consider a single decree (e.g., tax = 0) –

Page 48: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

PaxosProtocol 6: (continued)3. After receiving promises from a majority set, the

president selects the value for the decree based on the LastVotes of this set (the quorum), and sends the acceptance of the decree to the quorum.

4. The members of the quorum replies with a confirmation (vote) to the president, and reception of all the quorum members’ confirmations (votes) means the decree is passed.

Page 49: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

PaxosProtocol 6: (continued)5. After receiving votes from the whole quorum, the

president records the decree in its ledger, and sends a success message to all the priests.

6. Upon receiving the success message, a priest records the decree d in its ledger.

How to know the protocol works correctly?

Page 50: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Paxos 

Page 51: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Paxos

The leads to a system where every passed decree is the same as the first passed one.

Page 52: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Paxos 

All passed decrees are identical.

Page 53: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Three Generals’ Problem in Paxos

Who decides the attack time? How to agree? What if one general betrayed?

Byzantine Generals Problem.

Attack at 5am.

Can we use Paxos to solve the Three Generals’ problem?

Page 54: C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Beyond Single-Decree Paxos

Multiple Paxos instances Sequence of instances Further reading: [Lamport98] Leslie

Lamport. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May. 1998), 133-169.

Can we use Paxos to pass more than one decree?


Recommended