[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Service Ownership & Improve...

Post on 23-Jan-2018

82 views 0 download

transcript

How built a framework to improve infrastructure resource utilization at scale

★ Sr. Systems Engineer @Twitter★ Proud being a member of @TwitterWomen,

@Techwomen and @WomenWhoCode

I am @VinuCharanya

Hello!

1 2 3 4

History & ContextChargeback @TwitterKite - Service Lifecycle ManagerImpact & Future Work

Agenda

History & Context

Thousands of MicroServices

Thousands of MicroServices

Thousands of MicroServices

INFRASTRUCTURE & DATACENTER MANAGEMENT

CORE APPLICATION SERVICES

TWEETS

USERS

SOCIAL GRAPH

PLATFORM SERVICES

SEARCH

MESSAGING & QUEUES

CACHE

MONITORING AND ALERTING

INGRESS & PROXY

FRAMEWORK/

LIBRARIES

FINAGLE(RPC)

SCALDING(Map Reduce in

Scala)

HERON(Streaming Compute)

JVM

MANAGEMENT

TOOLS

SELF SERVE

SERVICE DIRECTORY

CHARGEBACK

CONFIG MGMT

DATA & ANALYTICSPLATFORM

INTERACTIVE QUERY

DATA DISCOVERY

WORKFLOWMANAGEMENT

INFRASTRUCTURESERVICES

MANHATTAN

BLOBSTORE

GRAPHSTORE

TIMESERIESDB

STORAGE

MESOS/AURORA

HADOOP

COMPUTE

MYSQL

VERTICA

POSTGRES

DB/DW

DEPLOY(Workflows)

MESOS/AURORA

HADOOP

MANHATTAN

67%N

umbe

r of S

erve

rs

Number of Servers

MESOS/AURORA

HADOOP

MANHATTAN

67%How to get visibility into resources used by

individual jobs & datasets?

Number of Servers

MESOS/AURORA

HADOOP

MANHATTAN

67%How to attribute resource consumption

to teams/organization?

Number of Servers

MESOS/AURORA

HADOOP

MANHATTAN

67%How do you incentivize the right behavior to

improve efficiency of resource usage?

Chargeback @Twitter

Chargeback @Twitter

Ability to meter allocation & utilization of resources

Chargeback @Twitter

Ability to meter allocation & utilization of resources per service, per project, per engineering team

Chargeback @Twitter

Ability to meter allocation & utilization of resources per service, per project, per engineering team to improve visibility & enable accountability

Features

Supports diverse Infra Services

Chargeback @Twitter

18

Meters abstract resources at daily

granularityDetailed Reports

19

Chargeback @Twitter

1. Resource Catalog: Consistent way to inventory infrastructure resources

Support diverse Infrastructure and Platform Services

20

Chargeback @Twitter

1. Resource Catalog: Consistent way to inventory infrastructure resources

• Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource

Support diverse Infrastructure and Platform Services

21

Chargeback @Twitter

1. Resource Catalog: Consistent way to inventory infrastructure resources

• Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource

2. Resource <> Client Identifier Ownership: Map of client identifier to an owner to enable accountability

Support diverse Infrastructure and Platform Services

OFFER MEASURE COST

RESOURCE CATALOG ENTITY MODEL

OFFER MEASURES

OFFER MEASURE COST

1:N

RESOURCE CATALOG ENTITY MODEL

PROVIDER

INFRASTRUCTURE SERVICE

OFFERINGS

OFFER MEASURES

OFFER MEASURE COST

1:N

1:N

1:N

1:N

RESOURCE CATALOG ENTITY MODEL

TWITTER DC/PUBLIC CLOUD

COMPUTE

CORE-DAYS

$X

PROVIDER

INFRASTRUCTURE SERVICE

OFFERINGS

OFFER MEASURES

OFFER MEASURE COST

1:N

1:N

1:N

1:N

RESOURCE CATALOG ENTITY MODEL

TWITTER DC/PUBLIC CLOUD

COMPUTE

CORE-DAYS

$X

PROVIDER

INFRASTRUCTURE SERVICE

OFFERINGS

OFFER MEASURES

OFFER MEASURE COST

1:N

1:N

1:N

1:N

TWITTER DC

STORAGE

GB- RAM

PROCESSING CLUSTER

FILEACCESSES

…GB- RAM

FILE ACCESSE

S… …

$X $Y …$M $N… …

RESOURCE CATALOG ENTITY MODEL

{ measures: [{"measure_id": 1,"measure_label": "core-days","measure_unit_label": "per 1 core-day","offering_id": 1,"offering_label": "Compute","infrastructure_id": 1,"infrastructure_name": "Aurora"

},

{"measure_id": 2,"measure_label": "machine-days","measure_unit_label": "per 1 machine-day","offering_id": 2,"offering_label": “zone:tweety","infrastructure_id": 8,"infrastructure_name": "Physical Infrastructure",

},

{

/api/1/measures

Chargeback @Twitter

So, how do you incentivize the right behavior to improve efficiency of resource usage?

Pricing is one way…

Operational Overhead

Headroom

Production Used Cores

Non-Prod Used Cores

Cost of Physical Server($X / day) Total available Cores

Quota Buffer(Underutilized Quota)

Container Size Buffer(Underutilized Reservation)

Total Cost of Ownership for Aurora$X core-day

Operational Overhead

Headroom

Production Used Cores

Non-Prod Used Cores

Cost of Physical Server($X / day) Total available Cores

Quota Buffer(Underutilized Quota)

Container Size Buffer(Underutilized Reservation)

Total used Cores

Total Cost of Ownership for Aurora$X core-day

Operational Overhead

Headroom

Production Used Cores

Non-Prod Used Cores

Cost of Physical Server($X / day) Total available Cores

Quota Buffer(Underutilized Quota)

Container Size Buffer(Underutilized Reservation)

Total used Cores

Excess Cores (incl. DR, Spikes, Overallocation)Total Cost of Ownership for Aurora

$X core-day

Operational Overhead

Headroom

Production Used Cores

Non-Prod Used Cores

Cost of Physical Server($X / day) Total available Cores

Quota Buffer(Underutilized Quota)

Container Size Buffer(Underutilized Reservation)

Total used Cores

Excess Cores (incl. DR, Spikes, Overallocation)

Cores used by platformfor operations &

maintenance

Total Cost of Ownership for Aurora$X core-day

Features

Supports diverse Infra/Platform

Services

Chargeback @Twitter

34

Meters abstract resources at daily

granularityDetailed Reports

35

Chargeback @Twitter

INFRASTRUCTURE SERVICE 1

INFRASTRUCTURE SERVICE 2

INGESTMETRICS

RAWFACT TRANSFORMER RESOLVED

FACT

RESOURCE CATALOG

REPORT

REPORT

Metering Pipeline (ETL Job)

IDENTIFIER OWNERSHIP

MAPPING

Metrics Ingestor

DATA FIDELITY

Metering Pipeline (ETL Job)

36

Chargeback @Twitter

INFRASTRUCTURE SERVICE 1

INFRASTRUCTURE SERVICE 2

INGESTMETRICS

RAWFACT TRANSFORMER RESOLVED

FACT

RESOURCE CATALOG

REPORT

REPORT

Metering Pipeline (ETL Job)

IDENTIFIER OWNERSHIP

MAPPING

Schema(client_identifier, offering_measure, volume, metadata, timestamp)

DATA FIDELITY

Metering Pipeline (ETL Job)

37

Chargeback @Twitter

Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 1

INFRASTRUCTURE SERVICE 2

INGESTMETRICS

RAWFACT TRANSFORMER RESOLVED

FACT

RESOURCE CATALOG

IDENTIFIER OWNERSHIP

MAPPING

REPORT

REPORT

Transformer

DATA FIDELITY

Metering Pipeline (ETL Job)

38

Chargeback @Twitter

Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 1

INFRASTRUCTURE SERVICE 2

INGESTMETRICS

RAWFACT TRANSFORMER RESOLVED

FACT

RESOURCE CATALOG

IDENTIFIER OWNERSHIP

MAPPING

REPORT

REPORT

1. Resolve Ownership

DATA FIDELITY

Metering Pipeline (ETL Job)

39

Chargeback @Twitter

Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 1

INFRASTRUCTURE SERVICE 2

INGESTMETRICS

RAWFACT TRANSFORMER RESOLVED

FACT

RESOURCE CATALOG

IDENTIFIER OWNERSHIP

MAPPING

REPORT

REPORT

2. Cost Computation

DATA FIDELITY

Metering Pipeline (ETL Job)

40

Chargeback @Twitter

Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 1

INFRASTRUCTURE SERVICE 2

INGESTMETRICS

RAWFACT TRANSFORMER RESOLVED

FACT

RESOURCE CATALOG DATA FIDELITY

REPORT

REPORT

IDENTIFIER OWNERSHIP

MAPPING

Data Fidelity & Reporting

Metering Pipeline (ETL Job)

41

Chargeback @Twitter

Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 1

INFRASTRUCTURE SERVICE 2

INGESTMETRICS

RAWFACT TRANSFORMER RESOLVED

FACT

RESOURCE CATALOG

REPORT

REPORT

IDENTIFIER OWNERSHIP

MAPPING

1. Verify Data Integrity & Fidelity

DATA FIDELITY

Metering Pipeline (ETL Job)

42

Chargeback @Twitter

Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 1

INFRASTRUCTURE SERVICE 2

INGESTMETRICS

RAWFACT TRANSFORMER RESOLVED

FACT

RESOURCE CATALOG

REPORT

REPORT

IDENTIFIER OWNERSHIP

MAPPING

2. Alert when things don’t seem the way it should be

DATA FIDELITY

Metering Pipeline (ETL Job)

43

Chargeback @Twitter

INFRASTRUCTURE SERVICE 1

INFRASTRUCTURE SERVICE 2

EXPORTMETRICS

RAWFACT TRANSFORMER RESOLVED

FACT

RESOURCE CATALOG

IDENTIFIER OWNERSHIP

DATA FIDELITY

REPORT

REPORT

Metering Pipeline (ETL Job)

Features

Supports diverse Infra/Platform

Services

Chargeback @Twitter

44

Meters abstract resources at daily

granularityDetailed Reports

45

Chargeback @Twitter

Customers

Infrastructure & Platform Operators Overall Cluster GrowthAllocation v/s Utilization of resources by Client/Tenant

Finance & Execs Budget v/s Spend per OrgInfrastructure PnLOverall Efficiency & Trends

Service Owners & Developers Team BillPer Service Allocation vs. Utilization of Resources

Reports

Customers

Infrastructure & Platform Operators Overall Cluster GrowthAllocation v/s Utilization of resources by Client/Tenant

Finance & Execs Budget v/s Spend per OrgInfrastructure PnLOverall Efficiency & Trends

INFRASTRUCTURE PNL

47

Chargeback @Twitter

Customers

Infrastructure & Platform Operators Overall Cluster GrowthAllocation v/s Utilization of resources by Client/Tenant

Finance & Execs Budget v/s Spend per OrgInfrastructure PnLOverall Efficiency & Trends

Service Owners & Developers Team BillPer Service Allocation vs. Utilization of Resources

Reports

CHARGEBACK BILL FOR A TEAM

CHARGEBACK DRILLDOWN FOR A TEAM

Features

Supports diverse Infra/Platform

Services

Chargeback @Twitter

50

Meters abstract resources at daily

granularityDetailed Reports

51

1 2 3 4

Learnings

Chargeback @Twitter

Invest in data Fidelity

Accurate Ownership Mapping

Logical grouping of resources

Track historical data

• Trust in data is most important.

• Invest in monitoring & alerting for data inconsistencies

• Leverage this for detecting abnormal increase/decrease and notify users

• Static mappings go out of date quickly

• Invest in systems (ex, Kite) for users to manage it themselves

• Identifiers were too granular and teams were too broad.

• Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain

• Unit prices change over time

• Orgs / Teams change over time

• Resources get added / removed

• Change history is essential for consistency which is used for CAP planning

52

1 2 3 4

Learnings

Chargeback @Twitter

Invest in data Fidelity

Accurate Ownership Mapping

Logical grouping of resources

Track historical data

• Trust in data is most important.

• Invest in monitoring & alerting for data inconsistencies

• Leverage this for detecting abnormal increase/decrease and notify users

• Static mappings go out of date quickly

• Invest in systems (ex, Kite) for users to manage it themselves

• Identifiers were too granular and teams were too broad.

• Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain

• Unit prices change over time

• Orgs / Teams change over time

• Resources get added / removed

• Change history is essential for consistency which is used for CAP planning

53

1 2 3 4

Learnings

Chargeback @Twitter

Invest in data Fidelity

Accurate Ownership Mapping

Logical grouping of resources

Track historical data

• Trust in data is most important.

• Invest in monitoring & alerting for data inconsistencies

• Leverage this for detecting abnormal increase/decrease and notify users

• Static mappings go out of date quickly

• Invest in systems (ex, Kite) for users to manage it themselves

• Identifiers were too granular and teams were too broad.

• Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain

• Unit prices change over time

• Orgs / Teams change over time

• Resources get added / removed

• Change history is essential for consistency which is used for CAP planning

54

1 2 3 4

Learnings

Chargeback @Twitter

Invest in data Fidelity

Accurate Ownership Mapping

Logical grouping of resources

Track historical data

• Trust in data is most important.

• Invest in monitoring & alerting for data inconsistencies

• Leverage this for detecting abnormal increase/decrease and notify users

• Static mappings go out of date quickly

• Invest in systems (ex, Kite) for users to manage it themselves

• Identifiers were too granular and teams were too broad.

• Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain

• Unit prices change over time

• Orgs / Teams change over time

• Resources get added / removed

• Change history is essential for consistency which is used for CAP planning

55

1 2 3 4

Learnings

Chargeback @Twitter

Invest in data Fidelity

Accurate Ownership Mapping

Logical grouping of resources

Track historical data

• Trust in data is most important.

• Invest in monitoring & alerting for data inconsistencies

• Leverage this for detecting abnormal increase/decrease and notify users

• Static mappings go out of date quickly

• Invest in systems (ex, Kite) for users to manage it themselves

• Identifiers were too granular and teams were too broad.

• Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain

• Unit prices change over time

• Orgs / Teams change over time

• Resources get added / removed

• Change history is essential for consistency which is used for CAP planning

SERVICE IDENTITY MANAGER

RESOURCE PROVISIONING MANAGER

DASHBOARD (SINGLE PANE OF GLASS)

REPORTING

INFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE & PLATFORM SERVICE

SERVICE LIFECYCLE WORKFLOWS

METADATA RESOURCE QUOTA MANAGEMENT

METERING & CHARGEBACKCLIENT IDENTITY

PROVIDER APIS & ADAPTERS

10,000+ Client Identifiers 1,000+ Projects 100+ Teams 8 Infrastructure Services

58

Kite @Twitter

59

Kite @Twitter

Identity System: Built a consistent way to group client identifiers of different infrastructure services into a project and enabled ownership

• Capture Org Structure: Support org structure changes, project transfer workflows to ensure up-to-date ownership of identifiers

• Unify client identifier provisioning workflow: Enables single source of truth and reduces operator pain around provisioning and managing client identifiers.

Client Identifier Management

IDENTITY ENTITY MODEL

<INFRA, CLIENTID> <Aurora, tweetypie.prod.tweetypie>

<Aurora, ads-prediction.prod.campaign-x>

IDENTITY ENTITY MODEL

SERVICE/SYSTEM ACCOUNT

<INFRA, CLIENTID>

1:N

tweetypie

<Aurora, tweetypie.prod.tweetypie>

ads-prediction

<Aurora, ads-prediction.prod.campaign-x>

BUSINESS OWNER

TEAM

PROJECT

SERVICE/SYSTEM ACCOUNT

<INFRA, CLIENTID>

1:N

1:N

1:N

1:N

INFRASTRUCTURE

TWEETYPIE

tweetypie

tweetypie

<Aurora, tweetypie.prod.tweetypie>

ADS PREDICTION

prediction

ads-prediction

<Aurora, ads-prediction.prod.campaign-x>

REVENUE

IDENTITY ENTITY MODEL

BUSINESS OWNER

TEAM

PROJECT

SERVICE/SYSTEM ACCOUNT

<INFRA, CLIENTID>

1:N

1:N

1:N

1:N

INFRASTRUCTURE

TWEETYPIE

tweetypie

tweetypie

<Aurora, tweetypie.prod.tweetypie>

ADS PREDICTION

prediction

ads-prediction

<Aurora, ads-prediction.prod.campaign-x>

REVENUE

IDENTITY ENTITY MODEL

Entities are time varying dimensions

Impact

10,000+ Client Identifiers

CLAIM OWNERSHIP

PROJECT DISCOVERY

TEAM OVERVIEW

TEAM OVERVIEW

Released unused Resources

TEAM OVERVIEW

Q2 unit price update

TEAM OVERVIEW

New project launch

PROJECT METADATA

AURORA QUOTA MANAGER

Future Work

75

Future Work

Impact & Future Work

1 2Resource provisioning

Enable project deprecation

• Extend Quota Manager and unify the experience into Kite

• Onboard Hadoop, Storage and other systems

• Detect unused resources, notify users, trigger deprecation process based on policy

3Capacity Planning

• Provide historic trends and help with forecast of capacity

76

1 2

Future Work

Impact & Future Work

Resource provisioning

Enable project deprecation

• Extend Quota Manager and unify the experience into Kite

• Onboard Hadoop, Storage and other systems

• Detect unused resources, notify users, trigger deprecation process based on policy

3Capacity Planning

• Provide historic trends and help with forecast of capacity

77

1 2

Future Work

Impact & Future Work

Resource provisioning

Enable project deprecation

• Extend Quota Manager and unify the experience into Kite

• Onboard Hadoop, Storage and other systems

• Detect unused resources, notify users, trigger deprecation process based on policy

3Capacity Planning

• Provide historic trends and help with forecast of capacity

79

1 2

Future Work

Impact & Future Work

Resource provisioning

Enable project deprecation

• Extend Quota Manager and unify the experience into Kite

• Onboard Hadoop, Storage and other systems

• Detect unused resources, notify users, trigger deprecation process based on policy

3Capacity Planning

• Provide historic trends and help with forecast of capacity

@VinuCharanya