Date post: | 23-Jan-2018 |
Category: |
Technology |
Upload: | vinu-charanya |
View: | 219 times |
Download: | 2 times |
Improving efficiency of Twitter Infrastructureusing Chargeback
@vinucharanya @micheal
• Brief History • Problem • Chargeback
• Engineering Challenges • The product • Impact
• Future
AGENDA
© Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html
2010
© Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html
© Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html
3283 Tweets Per Sec (TPS)
© Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html
5X increaseon avg. TPS
3283 Tweets Per Sec (TPS)
©The Simpsons
MONOLITH SERVICES
FENCING & OWNERSHIP
Clear isolation of services & its ownership.
RELIABILITY Failure isolation and graceful degradation
SCALABILITY & EFFICIENCY
Scale independently ensuring efficient use of infrastructure
DEVELOPER PRODUCTIVITY
Make it simple for engineers to build and launch services quickly and easily
(Micro) Services Oriented Model
2013
August 2 at 7:21:50 PDT
August 2 at 7:21:50 PDT
143,199 Tweets Per Sec (TPS)
August 2 at 7:21:50 PDT
28X increaseon avg. TPS
143,199 Tweets Per Sec (TPS)
Hundreds and thousands of #events at any given instant
Most Retweeted Tweet in History
RELIABILITY DEVELOPER AGILITY SCALABILITY EFFICIENCY
“Do More with Less”
Fast forward to 2016
INFRASTRUCTURE AND DATACENTER MANAGEMENT
CORE APPLICATION SERVICES
TWEETS
USERS
SOCIAL GRAPH
PLATFORM SERVICES
SEARCH
MESSAGING & QUEUES
CACHE
MONITORING AND ALERTING
REVERSE PROXY
FRAMEWORK/
LIBRARIES
FINAGLE (RPC)
SCALDING (Map Reduce in Scala)
HERON (Streaming Compute)
JVM
MANAGEMENT
TOOLS
SELF SERVE
SERVICE DIRECTORY
CHARGEBACK
CONFIG
DATA & ANALYTICSPLATFORM
INTERACTIVE QUERY
DATA DISCOVERY
WORKFLOW MANAGEMENT
INFRASTRUCTURESERVICES
MANHATTAN(Key-Val Store)
HDFS (File System)
BLOBSTORE
GRAPH STORE
STORAGE
AURORA (Scheduler)
HADOOP (Map-Reduce)
MESOS (Cluster Manager)
COMPUTE
DEPLOY(Workflows)
INFRASTRUCTURE AND DATACENTER MANAGEMENT
CORE APPLICATION SERVICES
TWEETS
USERS
SOCIAL GRAPH
PLATFORM SERVICES
SEARCH
MESSAGING & QUEUES
CACHE
MONITORING AND ALERTING
REVERSE PROXY
FRAMEWORK/
LIBRARIES
FINAGLE (RPC)
SCALDING (Map Reduce in Scala)
HERON (Streaming Compute)
JVM
MANAGEMENT
TOOLS
SELF SERVE
SERVICE DIRECTORY
CHARGEBACK
CONFIG
DATA & ANALYTICSPLATFORM
INTERACTIVE QUERY
DATA DISCOVERY
WORKFLOW MANAGEMENT
INFRASTRUCTURESERVICES
MANHATTAN(Key-Val Store)
HDFS (File System)
BLOBSTORE
GRAPH STORE
STORAGE
AURORA (Scheduler)
HADOOP (Map-Reduce)
MESOS (Cluster Manager)
COMPUTE
DEPLOY(Workflows)
INFRASTRUCTURE AND DATACENTER MANAGEMENT
CORE APPLICATION SERVICES
TWEETS
USERS
SOCIAL GRAPH
PLATFORM SERVICES
SEARCH
MESSAGING & QUEUES
CACHE
MONITORING AND ALERTING
REVERSE PROXY
FRAMEWORK/
LIBRARIES
FINAGLE (RPC)
SCALDING (Map Reduce in Scala)
HERON (Streaming Compute)
JVM
MANAGEMENT
TOOLS
SELF SERVE
SERVICE DIRECTORY
CHARGEBACK
CONFIG
DATA & ANALYTICSPLATFORM
INTERACTIVE QUERY
DATA DISCOVERY
WORKFLOW MANAGEMENT
INFRASTRUCTURESERVICES
MANHATTAN(Key-Val Store)
HDFS (File System)
BLOBSTORE
GRAPH STORE
STORAGE
AURORA (Scheduler)
HADOOP (Map-Reduce)
MESOS (Cluster Manager)
COMPUTE
DEPLOY(Workflows)
INFRASTRUCTURE AND DATACENTER MANAGEMENT
CORE APPLICATION SERVICES
TWEETS
USERS
SOCIAL GRAPH
PLATFORM SERVICES
SEARCH
MESSAGING & QUEUES
CACHE
MONITORING AND ALERTING
REVERSE PROXY
FRAMEWORK/
LIBRARIES
FINAGLE (RPC)
SCALDING (Map Reduce in Scala)
HERON (Streaming Compute)
JVM
MANAGEMENT
TOOLS
SELF SERVE
SERVICE DIRECTORY
CHARGEBACK
DEPLOY(Workflows)
CONFIG
DATA & ANALYTICSPLATFORM
INTERACTIVE QUERY
DATA DISCOVERY
WORKFLOW MANAGEMENT
INFRASTRUCTURESERVICES
MANHATTAN(Key-Val Store)
HDFS (File System)
BLOBSTORE
GRAPH STORE
STORAGE
AURORA (Scheduler)
HADOOP (Map-Reduce)
MESOS (Cluster Manager)
COMPUTE
INFRASTRUCTURE AND DATACENTER MANAGEMENT
CORE APPLICATION SERVICES
TWEETS
USERS
SOCIAL GRAPH
PLATFORM SERVICES
SEARCH
MESSAGING & QUEUES
CACHE
MONITORING AND ALERTING
REVERSE PROXY
FRAMEWORK/
LIBRARIES
FINAGLE (RPC)
SCALDING (Map Reduce in Scala)
HERON (Streaming Compute)
JVM
MANAGEMENT
TOOLS
SELF SERVE
SERVICE DIRECTORY
CHARGEBACK
CONFIG
DATA & ANALYTICSPLATFORM
INTERACTIVE QUERY
DATA DISCOVERY
WORKFLOW MANAGEMENT
INFRASTRUCTURESERVICES
MANHATTAN(Key-Val Store)
HDFS (File System)
BLOBSTORE
GRAPH STORE
STORAGE
AURORA (Scheduler)
HADOOP (Map-Reduce)
MESOS (Cluster Manager)
COMPUTE
DEPLOY(Workflows)
THOUSANDS OF SERVICES
HUNDREDS OF TEAMS
What is the overall use of infrastructure & platform resources across Twitter’s services?
What is the overall use of infrastructure & platform resources across Twitter’s services?
How to attribute resource consumption to teams/organization?
What is the overall use of infrastructure & platform resources across Twitter’s services?
How to attribute resource consumption to teams/organization?
How do you incentivize the right behavior to improve efficiency of resource usage?
Ability to meter allocation and utilization of resources per service, per engineering team and charge them accordingly
CHARGEBACK
COMPUTE STORAGE
PLATFORM AND OTHER SERVICES
SERVICE Tweet Service
SERVICE Ads Shard
SERVICE Who To Follow
RESOURCEunit of abstraction
MULTI-TENANCYtenant management using canonical identifiers
SERVICEIDENTITY
RESOURCECATALOG
COMPUTE STORAGE
PLATFORM AND OTHER SERVICES
SERVICE Tweet Service
SERVICE Ads Shard
SERVICE Who To Follow
RESOURCEunit of abstraction
MULTI-TENANCYtenant management using canonical identifiers
SERVICEIDENTITY
RESOURCECATALOG
COMPUTE STORAGE
PLATFORM AND OTHER SERVICES
SERVICE Tweet Service
SERVICE Ads Shard
SERVICE Who To Follow
RESOURCEunit of abstraction
MULTI-TENANCYtenant management using canonical identifiers
METERING ANDCHARGEBACK
SERVICEIDENTITY
RESOURCECATALOG
METERING ANDCHARGEBACK
COMPUTE STORAGE
SERVICEMETADATA
PLATFORM AND OTHER SERVICES
SERVICE Tweet Service
SERVICE Ads Shard
SERVICE Who To Follow
RESOURCEunit of abstraction
MULTI-TENANCYtenant management using canonical identifiers
UNIFIED CLOUD PLATFORM
SERVICEIDENTITY
RESOURCECATALOG
METERING ANDCHARGEBACK
COMPUTE STORAGE
SERVICEMETADATA
PLATFORM AND OTHER SERVICES
SERVICE Tweet Service
SERVICE Ads Shard
SERVICE Who To Follow
RESOURCEunit of abstraction
MULTI-TENANCYtenant management using canonical identifiers
SERVICE IDENTITY
A canonical way of identifying a service that consumesresources on various platform infrastructure.
• Disparate identifiers across infrastructure and platform services
• Multiple provisioning workflows (Self-Serve, Tickets)
• Disparate Ownership trackers (Email, LDAP)
• Lack of support for public cloud Identity and Access Management systems (IAM)
role: cim-servicejob_name: ui; env: prodid: <role>.<env>.<job_name>
app_id: cost_reportingid: <app_id>
Project: chargebackTeam: Cloud Infra MgmtSource code: /cim
COMPUTE
STORAGE
PROBLEM
BATCHCOMPUTE
role: cim-servicepool: etl_pipe_prodjob_name: compute_costid: <role>.<pool>.<job_name>
DASHBOARD
IDENTITY MANAGER
PROVISION
CONSUMPTION
• Designed an Entity Model that • Define canonical identifier scheme
across infrastructure and platform services• Define ownership structure with org
• Single pane of glass for every developer to manage their project IDs (including abstracting out public cloud IAM systems)
• Provider APIs for infrastructure services to provision and manage identityINFRASTRUCTURE
SERVICEINFRASTRUCTURE
SERVICEINFRASTRUCTURE
SERVICEINFRASTRUCTURE
SERVICEINFRASTRUCTURE
SERVICE
OUR APPROACH
API
Source of truth for identifier to org structure mapping improving Service ownership within the Org
Enables service to service authentication/authorization
IMPACT
BUSINESS OWNER
TEAM
PROJECT
SERVICE/SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
1:N
1:N
1:N
ENTITY MODEL FOR SERVICE IDENTITY
Model that provides canonical identifier across infrastructure and platform service and ties it to an org structure
BUSINESS OWNER
TEAM
PROJECT
SERVICE/SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
1:N
1:N
1:N
REVENUE
ADS SERVING
adshard
adshard
<Aurora, adshard.prod.adshard>
EXAMPLE of services running (on Aurora/Mesos)
ADS PREDICTION
prediction
ads-prediction
<Aurora, ads-prediction.prod.campaign-x>
ENTITY MODEL: EXAMPLE
RESOURCE CATALOG
Consistent way of identifying and inventorying ofresources of various platform infrastructure.
• Lack of clarity on what is available & how many resources are consumed
• Need to capture resource fluidity across infrastructure and platform services
• Better support to model abstract resources (ex, QPS, Tweets per Second)
• Need to define TCO (Total Cost of Ownership) of a resource per unit time
PROBLEM
CPUMEMORYDISK
STORAGE IN GBWPSRPS
COMPUTE
STORAGE
BATCHCOMPUTE
CPUFILES ACCESSEDSTORAGE IN GB
CORES MEMORY DISK
application = Task( name = 'application', resources = Resources(cpu = 1.0, ram = 512 * MB, disk = 1024 * MB), processes = [stage_application, run_application], constraints = order(stage_application, run_application))
CORES MEMORY DISK
application = Task( name = 'application', resources = Resources(cpu = 1.0, ram = 512 * MB, disk = 1024 * MB), processes = [stage_application, run_application], constraints = order(stage_application, run_application))
GPU NETWORK Need for Fluidity!
• Defining unit price for a resource • Framework to price resources. • Ensure Total Cost of Ownership. Eg. License cost, chargeback cost from other
services, human cost etc. • Support for Time Granularity. Eg. Machines/VMs used per day, Cores used per day
Used Cores
Operational Overhead
Headroom
Underutilized Quota AllocationTotal Cost of Ownership
Twitter Compute Platform
$X core-dayContainer Size Buffer (Underutilized Reservation)
Exce
ss Q
uota
and
Res
erva
tion
Non-Prod Used Cores
Disaster Recovery & Event Spikes
PROVIDER
INFRASTRUCTURE SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N
ENTITY MODEL FOR RESOURCE CATALOG
Model that supports Resource Fluidity and captures and manages unit price of a resource over time.
TWITTER DC/PUBLIC CLOUD
AURORA
COMPUTE
CORE-DAYS
$X
PROVIDER
INFRASTRUCTURE SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N
EXAMPLE of Resource Catalog
TWITTER DC
HADOOP
STORAGE
GB- RAM
ENTITY MODEL: EXAMPLE
PROCESSING CLUSTER
FILE ACCESSES
…
…GB- RAM
FILE ACCESSES… …
$X $Y …$M $N… …
METERING PIPELINE
HIGH LEVEL ARCHITECTURE
The Product
TEAM/ORG BILL
INFRASTRUCTURE PNL
ORG/TEAM BUDGET
CUSTOM REPORTS
• Infrastructure & Platform Owners • Overall Cluster Growth • Allocation v/s Utilization of resources by Customer Team
• Service Owners • Allocation v/s Utilization of resources across each Infrastructure & Platform
• Finance • Budget Management (Budget v/s Spend)
• Execs • Efficiency • Trends
What has been the Impact?
Jun 1, 2015 Sept 1, 2015
Twitter Compute Platform (Aurora/Mesos)
3 months (Jun - Sep, 2015)
Allocated Quota
Utilized Cores
Sept 1, 2015 Jan 1, 2015
Twitter Compute Platform (Aurora/Mesos)
4 months (Sep, 2015 - Jan, 2016)
Allocated Quota
Utilized Cores
More core usage against reservationcompared to May 2015
33%
• Ensures true to the cost unit price computation
• Input for capacity planning and budgeting
• Visibility into the organizational spend and enables accountability
• Improved utilization of infrastructure service resources • Enables comparison with Public Cloud Offerings
• Improved Service Ownership
IMPACT
Kite - Unified Cloud Platform A cloud agnostic service lifecycle manager
SERVICE IDENTITYMANAGER
RESOURCEPROVISIONING MANAGER
DASHBOARD(SINGLE PANE OF GLASS)
REPORTING
INFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE & PLATFORM SERVICE
SERVICE LIFECYCLE WORKFLOWS
METADATA RESOURCE QUOTA MANAGEMENT DEPLOY METERING &
CHARGEBACKIDENTITY
PROVIDER APIS & ADAPTERS
@vinucharanya
@dpkagrawal
@pragashjj@fvrojas
@micheal
@igb
@imjessicayuen
@_jordanly
@xcv58