Hardware L i fecyc le a t
Sca leBrian Dodds, Craig Ross
Learnings
1
4
3
Wrap Up
Hardware Lifecycle2
Facebook’s Infrastructure Evolution
Agenda
Facebook's Infrastructure Evolution
2010
2012
2014
2016
600M
1B Intro Acquisition
1.3B 200M 200M Acquisition
1.65B 900M 500M 1B
Facebook’s Growth
Facebook’s Scale Today
• Billions of photo and video uploads
• Trillions of user requests
• Tens of trillions of database queries
• 100s of trillions of cache queries
Huge demands on servers, storage, network, and
power
Each Day:
Why Build Our Own Hardware?
• Faster response to growth demands
• Optimize end-to-end (Application->Power->Thermal)
• Highest Operational Efficiency
• Commodity components
Be Open
Advantages
The Facebook Datacenter
2014
2015
2016
Open Compute
Project Launch
Hardware
ComputePRN
Hardware
Storage LLA
2010
2012
2013
2011
Hardware
Network
Fabric
FRC
ATN
FTW, CLN
Infrastructure Evolution
2014
2015
2016
2010
2012
2013
2011Rack & Power
Freedom triplet
Rack & Power
Open Rack V1
Rack & Power
Open Rack V2
Compute
Freedom
Compute
Windmill
2015
Compute
Winterfell
Storage
Knox
Compute
Leopard
Compute
Yosemite
Storage
Honey BadgerNetwork Switch
WedgeStorage
BluRay
GPU
Big Sur
Network
Back Pack
Storage
Lightning
Hardware Evolution
Facebook Datacenters
Hardware LifecycleInfrastructure @ Scale
Hack Sustain DecomDeployBuildDesign
Hack Sustain DecomDeployBuildDesign
Hack Sustain DecomDeployBuildDesign
Hack Sustain DecomDeployBuildDesign
Chassis Level Assembly
Rack Assembly(in Region)
Data Centers
Hack Sustain DecomDeployBuildDesign
Component Level Manufacturing
Chassis + Rack Level Assembly
(in Region)
Data Centers
Hack Sustain DecomDeployBuildDesign
Hack Sustain DecomDeployBuildDesign
Hack Sustain DecomDeployBuildDesign
Hack Sustain DecomDeployBuildDesign
Learnings
2014
2015
2016
2010
2012
2013
2011Rack & Power
Freedom triplet
Rack & Power
Open Rack V1
Rack & Power
Open Rack V2
Compute
Freedom
Compute
Windmill
2015
Compute
Winterfell
Storage
Knox
Compute
Leopard
Compute
Yosemite
Storage
Honey BadgerNetwork Switch
WedgeStorage
BluRay
GPU
Big Sur
Network
Six Pack
Storage
Lightning
Hardware Evolution
2014
2015
2016
2010
2012
2013
2011Rack & Power
Freedom triplet
Rack & Power
Open Rack V1
Rack & Power
Open Rack V2
Compute
Freedom
Compute
Windmill
2015
Compute
Winterfell
Storage
Knox
Compute
Leopard
Compute
Yosemite
Storage
Honey BadgerNetwork Switch
WedgeStorage
BluRay
GPU
Big Sur
Network
Six Pack
Storage
Lightning
Learnings - SensorsIssues: BMC and PSU monitoring
woes
Learnings: Improve monitoring of
critical sensors.
2014
2015
2016
2010
2012
2013
2011Rack & Power
Freedom triplet
Rack & Power
Open Rack V1
Rack & Power
Open Rack V2
Compute
Freedom
Compute
Windmill
2015
Compute
Winterfell
Storage
Knox
Compute
Leopard
Compute
Yosemite
Storage
Honey BadgerNetwork Switch
WedgeStorage
BluRay
GPU
Big Sur
Network
Six Pack
Storage
Lightning
Learnings – Supply Chain/ApplicationIssues: Single-sourced epidemic
failure. App performance issues.
Row Hammer.
Learnings: Multi-source
components, robust app testing @
scale, improve component
monitoring.
2014
2015
2016
2010
2012
2013
2011Rack & Power
Freedom triplet
Rack & Power
Open Rack V1
Rack & Power
Open Rack V2
Compute
Freedom
Compute
Windmill
2015
Compute
Winterfell
Storage
Knox
Compute
Leopard
Compute
Yosemite
Storage
Honey BadgerNetwork Switch
WedgeStorage
BluRay
GPU
Big Sur
Network
Six Pack
Storage
Lightning
Learnings – DC ToolingIssues: Shipped hardware before
all tooling was finished – Idle HW.
Learnings: Make tooling a first-
class citizen for phase exit.
Hardware
Eventually
Fails
Robust Infrastructure
Monitor Alarm
RemediateDesign Feedback
Robust Infrastructure
Monitor Alarm
RemediateDesign Feedback
MonitoringMany servers, components, services, and regions
Monitoring
Failure Rate
Monitoring
Error Types
Monitoring
Filters
Robust Infrastructure
Monitor Alarm
RemediateDesign Feedback
AlarmsAnomaly Detection
Anomaly Within Cohorts
Gradual Increases
And
Sudden Spikes
Robust Infrastructure
Monitor Alarm
RemediateDesign Feedback
Remediation
• Phase 1: Root Cause Analysis
• Phase 2: Review Remediation Plan
• Phase 3: Implement Remediation
The Journey is 1% Finished
0%
10%
20%
30%
40%
Mar2014
Jul2014
Nov2014
Mar2015
Jul2015
Now
Robust Infrastructure
Monitor Alarm
RemediateDesign Feedback
Design ImprovementsHDD Slot Temperature vs. Swap Rate
Higher temps.
More swaps.
Wrap Up
Key takeaways
• FB scale is growing. Infrastructure needs to innovate
• Move fast and adapt with robust HW lifecycle
• Everything fails – minimize impact with tooling