+ All Categories
Home > Documents > Hardware Lifecycle at Scale - Open Compute Project

Hardware Lifecycle at Scale - Open Compute Project

Date post: 24-Nov-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
44
Transcript
Page 1: Hardware Lifecycle at Scale - Open Compute Project
Page 2: Hardware Lifecycle at Scale - Open Compute Project

Hardware L i fecyc le a t

Sca leBrian Dodds, Craig Ross

Facebook

Page 3: Hardware Lifecycle at Scale - Open Compute Project

Learnings

1

4

3

Wrap Up

Hardware Lifecycle2

Facebook’s Infrastructure Evolution

Agenda

Page 4: Hardware Lifecycle at Scale - Open Compute Project

Facebook's Infrastructure Evolution

Page 5: Hardware Lifecycle at Scale - Open Compute Project

2010

2012

2014

2016

600M

1B Intro Acquisition

1.3B 200M 200M Acquisition

1.65B 900M 500M 1B

Facebook’s Growth

Page 6: Hardware Lifecycle at Scale - Open Compute Project

Facebook’s Scale Today

• Billions of photo and video uploads

• Trillions of user requests

• Tens of trillions of database queries

• 100s of trillions of cache queries

Huge demands on servers, storage, network, and

power

Each Day:

Page 7: Hardware Lifecycle at Scale - Open Compute Project

Why Build Our Own Hardware?

• Faster response to growth demands

• Optimize end-to-end (Application->Power->Thermal)

• Highest Operational Efficiency

• Commodity components

Be Open

Advantages

Page 8: Hardware Lifecycle at Scale - Open Compute Project

The Facebook Datacenter

Page 9: Hardware Lifecycle at Scale - Open Compute Project

2014

2015

2016

Open Compute

Project Launch

Hardware

ComputePRN

Hardware

Storage LLA

2010

2012

2013

2011

Hardware

Network

Fabric

FRC

ATN

FTW, CLN

Infrastructure Evolution

Page 10: Hardware Lifecycle at Scale - Open Compute Project

2014

2015

2016

2010

2012

2013

2011Rack & Power

Freedom triplet

Rack & Power

Open Rack V1

Rack & Power

Open Rack V2

Compute

Freedom

Compute

Windmill

2015

Compute

Winterfell

Storage

Knox

Compute

Leopard

Compute

Yosemite

Storage

Honey BadgerNetwork Switch

WedgeStorage

BluRay

GPU

Big Sur

Network

Back Pack

Storage

Lightning

Hardware Evolution

Page 11: Hardware Lifecycle at Scale - Open Compute Project

Facebook Datacenters

Page 12: Hardware Lifecycle at Scale - Open Compute Project

Hardware LifecycleInfrastructure @ Scale

Page 13: Hardware Lifecycle at Scale - Open Compute Project

Hack Sustain DecomDeployBuildDesign

Page 14: Hardware Lifecycle at Scale - Open Compute Project

Hack Sustain DecomDeployBuildDesign

Page 15: Hardware Lifecycle at Scale - Open Compute Project

Hack Sustain DecomDeployBuildDesign

Page 16: Hardware Lifecycle at Scale - Open Compute Project

Hack Sustain DecomDeployBuildDesign

Page 17: Hardware Lifecycle at Scale - Open Compute Project
Page 18: Hardware Lifecycle at Scale - Open Compute Project

Chassis Level Assembly

Rack Assembly(in Region)

Data Centers

Hack Sustain DecomDeployBuildDesign

Page 19: Hardware Lifecycle at Scale - Open Compute Project

Component Level Manufacturing

Chassis + Rack Level Assembly

(in Region)

Data Centers

Hack Sustain DecomDeployBuildDesign

Page 20: Hardware Lifecycle at Scale - Open Compute Project

Hack Sustain DecomDeployBuildDesign

Page 21: Hardware Lifecycle at Scale - Open Compute Project
Page 22: Hardware Lifecycle at Scale - Open Compute Project

Hack Sustain DecomDeployBuildDesign

Page 23: Hardware Lifecycle at Scale - Open Compute Project

Hack Sustain DecomDeployBuildDesign

Page 24: Hardware Lifecycle at Scale - Open Compute Project

Learnings

Page 25: Hardware Lifecycle at Scale - Open Compute Project

2014

2015

2016

2010

2012

2013

2011Rack & Power

Freedom triplet

Rack & Power

Open Rack V1

Rack & Power

Open Rack V2

Compute

Freedom

Compute

Windmill

2015

Compute

Winterfell

Storage

Knox

Compute

Leopard

Compute

Yosemite

Storage

Honey BadgerNetwork Switch

WedgeStorage

BluRay

GPU

Big Sur

Network

Six Pack

Storage

Lightning

Hardware Evolution

Page 26: Hardware Lifecycle at Scale - Open Compute Project

2014

2015

2016

2010

2012

2013

2011Rack & Power

Freedom triplet

Rack & Power

Open Rack V1

Rack & Power

Open Rack V2

Compute

Freedom

Compute

Windmill

2015

Compute

Winterfell

Storage

Knox

Compute

Leopard

Compute

Yosemite

Storage

Honey BadgerNetwork Switch

WedgeStorage

BluRay

GPU

Big Sur

Network

Six Pack

Storage

Lightning

Learnings - SensorsIssues: BMC and PSU monitoring

woes

Learnings: Improve monitoring of

critical sensors.

Page 27: Hardware Lifecycle at Scale - Open Compute Project

2014

2015

2016

2010

2012

2013

2011Rack & Power

Freedom triplet

Rack & Power

Open Rack V1

Rack & Power

Open Rack V2

Compute

Freedom

Compute

Windmill

2015

Compute

Winterfell

Storage

Knox

Compute

Leopard

Compute

Yosemite

Storage

Honey BadgerNetwork Switch

WedgeStorage

BluRay

GPU

Big Sur

Network

Six Pack

Storage

Lightning

Learnings – Supply Chain/ApplicationIssues: Single-sourced epidemic

failure. App performance issues.

Row Hammer.

Learnings: Multi-source

components, robust app testing @

scale, improve component

monitoring.

Page 28: Hardware Lifecycle at Scale - Open Compute Project

2014

2015

2016

2010

2012

2013

2011Rack & Power

Freedom triplet

Rack & Power

Open Rack V1

Rack & Power

Open Rack V2

Compute

Freedom

Compute

Windmill

2015

Compute

Winterfell

Storage

Knox

Compute

Leopard

Compute

Yosemite

Storage

Honey BadgerNetwork Switch

WedgeStorage

BluRay

GPU

Big Sur

Network

Six Pack

Storage

Lightning

Learnings – DC ToolingIssues: Shipped hardware before

all tooling was finished – Idle HW.

Learnings: Make tooling a first-

class citizen for phase exit.

Page 29: Hardware Lifecycle at Scale - Open Compute Project

Hardware

Eventually

Fails

Page 30: Hardware Lifecycle at Scale - Open Compute Project

Robust Infrastructure

Monitor Alarm

RemediateDesign Feedback

Page 31: Hardware Lifecycle at Scale - Open Compute Project

Robust Infrastructure

Monitor Alarm

RemediateDesign Feedback

Page 32: Hardware Lifecycle at Scale - Open Compute Project

MonitoringMany servers, components, services, and regions

Page 33: Hardware Lifecycle at Scale - Open Compute Project

Monitoring

Failure Rate

Page 34: Hardware Lifecycle at Scale - Open Compute Project

Monitoring

Error Types

Page 35: Hardware Lifecycle at Scale - Open Compute Project

Monitoring

Filters

Page 36: Hardware Lifecycle at Scale - Open Compute Project

Robust Infrastructure

Monitor Alarm

RemediateDesign Feedback

Page 37: Hardware Lifecycle at Scale - Open Compute Project

AlarmsAnomaly Detection

Anomaly Within Cohorts

Gradual Increases

And

Sudden Spikes

Page 38: Hardware Lifecycle at Scale - Open Compute Project

Robust Infrastructure

Monitor Alarm

RemediateDesign Feedback

Page 39: Hardware Lifecycle at Scale - Open Compute Project

Remediation

• Phase 1: Root Cause Analysis

• Phase 2: Review Remediation Plan

• Phase 3: Implement Remediation

The Journey is 1% Finished

0%

10%

20%

30%

40%

Mar2014

Jul2014

Nov2014

Mar2015

Jul2015

Now

Page 40: Hardware Lifecycle at Scale - Open Compute Project

Robust Infrastructure

Monitor Alarm

RemediateDesign Feedback

Page 41: Hardware Lifecycle at Scale - Open Compute Project

Design ImprovementsHDD Slot Temperature vs. Swap Rate

Higher temps.

More swaps.

Page 42: Hardware Lifecycle at Scale - Open Compute Project

Wrap Up

Page 43: Hardware Lifecycle at Scale - Open Compute Project

Key takeaways

• FB scale is growing. Infrastructure needs to innovate

• Move fast and adapt with robust HW lifecycle

• Everything fails – minimize impact with tooling

Page 44: Hardware Lifecycle at Scale - Open Compute Project

Recommended