Focus, Governance, and Innovation: How LinkedIn Scaled to 3M Jira Issues and 500M Members

Post on 21-Jan-2018

2,533 views 0 download

transcript

LinkedIn scales to 3M issues and 500M members

A tale of process, people, and technology

ARNIE MATZ | DIRECTOR, SOF TWARE ENGINEERING | LINKEDIN

DAN HATA | SENIOR ENGINEERING MANAGER | LINKEDIN

500,000,000+ registered members

138+M UNITED STATES OF AMERICA

29+M Brazil

42+M India

8+M Indonesia4+M Phillippines3+M Malaysia1+M Singapore

1+M Japan

1+M Korea

32+M China

1+M New Zealand

8+M Australia

23+M UK14+M France10+M Germany10+M Italy9+M Spain7+M Netherlands3+M Belgium2+M Denmark2+M Sweden1+M Ireland

Artifact Repository Review

SCM

Development Tools

IDE CI Pipeline

SCM Review

Jira Artifact Repository

Artifact Repository

Development Operations Marketing Facilities

Jira Business Usage at LinkedIn

What did LinkedIn think of Jira in 2015?

Why does my dashboard freeze at

10am?

Jira was fast at my last company.

Why don’t we just build our own Jira?

Why does Jira always crash?

Have we looked into alternatives?

Why is Jira always slow?

Will Jira ever be stable?

One last request……

Please fix Jira as soon as possible Arnie

0

17.5

35

52.5

70

2015 2016 2017 2018 2019 2020 2021

Issues Count Growth: 2015-2022

2015 Scary Stuff

Stability and Performance No understanding of Jira stability and performance issues.

No change control process.

Unlimited admins Many people have admin access making change control and standardization impossible.

Lucene index corruption 25% of Jira restarts resulted in index corruption and recovery takes hours.

Rapid custom field growth Contributes to index growth. There was no governance. Admins said yes to all custom field requests.

No governance

No change control

No metrics

2015 ASSESSMENT SUMMARY

Unplanned outages almost every day

Sometimes all day

Thousands of custom fields

Growing by 150% year over year

....and way out of control

Process

Three investment areas

Technology People

CHECK POINT

1.2 million issues 300+ million members

6,000+ employees

2015

People: Process:

Technology:

CRITICAL

REVIEW

CRITICAL

Roles For Supporting Jira

App AdminsFocus on customer service: external consultants

OperationsFocus on deterministic change and mitigating risk

DevelopersEnsuring performance and scale are built into all solutions

ManagersFocus on governance, strategy, Atlassian partnership.

App Admins

Developers

Operations

Managers

2015 Team Staffing: Before

4 0

00

App Admins

Developers

Operations

Managers

2015 Team Staffing: During

4

00

0

App Admins

Developers

Operations

Managers

2015 Team Staffing: After

4 1

10

Monitoring and SLOs

Atlassian relationship

Governance CRITICAL

CRITICAL

Change control

2015 Process: Before

CRITICAL

WARNING

0

17.5

35

52.5

70

2015 2016 2017 2018 2019 2020 2021

Issues Count Growth: 2015-2022

0

17.5

35

52.5

70

2015 2016 2017 2018 2019 2020 2021

2015 Issue Count Projections — Original 2015 Issue Projection

— Issue Projection with Governance in Place

Under Control

• Removed unused Custom Fields for a 40% overall reduction!

• Limited Custom Field growth through Governance

2015: Custom Fields

Governance: Unbound columns

• Unbounded columns in Jira: • Comments • Versions

2015 Process Improvements

Misuse: What can I do?

• Document and communicate what is acceptable use

• Work with users to find the right solution • Through technology, make it impossible

for misuse to reoccur

2015 Process Improvements

Change Control • Configuration as code • All changes are tested,

reviewed, communicated, with rollback plans

Service Level Objectives • Tracked and investigated violations

• Example: <2 seconds issue creation time

2015 Process Improvements

Atlassian Relationship • Introduced TAM • Added Premier Support • Partnered with TAM and PS at Atlassian

to target a performance upgrade • Extended licensing for end of life plugin

2015 Process Improvements

Monitoring and SLOs

Atlassian relationship

Governance

SUCCESS

WARNING

Change control ALMOST

2015 Process: After

ALMOST

Availability

Hardware

Monitoring and Alerting CRITICAL

CRITICAL

WARNING

2015 Technology: Before

Hardware upgrade • Lucene index on SSD • Currently 75 GB • 6 hours rebuild time

2015 Technology Improvements

Adding Software Driven Governance • Python-Jira client enables innovations • Replica databases provides read access to

application needing real-time Jira data

2015 Technology Improvements

Leveraging inGraphs • All application and system resources displayed on a

single page

2015 Technology Improvements

Jira Data Center

2017 Technology Improvements

CHECK POINT

2 million issues 400+ million members

9,000+ employees

2016

1.2 million issues 300+ million members

6,000+ employees

2015

People: Process:

Technology:

CRITICAL

REVIEW

CRITICAL

ALMOSTPeople: Process:

Technology: REVIEW

REVIEW

App Admins

Developers

Operations

Managers

2016 Team Staffing: Before

4 1

10

App Admins

Developers

Operations

Managers

2016 Team Staffing: After

2 1

10.5

Operational Excellence

Atlassian Relationship

Governance

WARNING

2016 Process: Before

ALMOST

WARNING

2016 Process Improvements

Governance • Documented and communicated • All requests lead with business requirement • Scale is the most important requirement • Automated Governance

2016 Process Improvements

Operational Excellence Culture • Code and config reviews • Intelligent risk decision • Change control and communication • Monitoring and metrics • Automated remediation • Service level objectives • Awesome alerts and response • Business continuity plan • Relentless pursuit of exceptions causation • Blameless postmortems

2016 Process Improvements

Partnering with Atlassian • TAM relationship: evolved from tactical to

strategic in 2016 • Partnering with TAM for all major upgrades • Atlassian Premier Support provides critical bug

fix over the holidays to address bug in widely used gadget

Operational Excellence

Atlassian Relationship

Governance

2016 Process: After

SUCCESS

ALMOST

SUCCESS

Availability

Hardware

Monitoring and Alerting

CRITICAL

2016 Technology: Before

WARNING

ALMOST

User blacklisting and throttling • Implemented blacklisting based on username • Throttling based on requests/minute per host

2016 Technology Improvements

#Jira.conf

# Blacklist a user to by adding and entry with value of 1.

map $remote_user $user_blacklisted {

default 0;

"johnnynumberfive" 1;

}

Larger Hardware, Tuned Instance • 64GB upgraded to 256GB • JVM increased

2016 Technology Improvements

Leveraging inGraphs • Monitor and alerting on all bottlenecks

2016 Technology Improvements

Logstash Parsing logs to make useful

data

Adding in ELK

Kibana Create dashboards showing

insightful data

Elastic Search Horizontally scalable data

storage

Adding in ELK

CHECK POINT

2 million issues 400+ million members

9,000+ employees

2016

3 million issues 500+ million members

10,000+ employees

1.2 million issues 300+ million members

6,000+ employees

People: Process:

Technology:

CRITICAL

REVIEW

CRITICAL

2015

ALMOST

SUCCESS

People: Process:

Technology: ALMOST

2017

ALMOSTPeople: Process:

Technology: ALMOSTREVIEW

REVIEW

Operational Excellence

Atlassian Relationship

Product Vision

SUCCESS

2017 Process: Before

ALMOST

SUCCESS

App Admins Operations

Developers Manager

Roles For Supporting Jira

Understands customer requirements and prioritizes work.

Product Owner

App Admins Operations

Developers Manager

Roles For Supporting Jira

Understands customer requirements and prioritizes work.

Product Owner

2 1

10.5

0

2017 Process ImprovementsPartnering with Atlassian • Networking with the Jira community • Providing feedback and requesting features

Availability

Hardware

Monitoring and Alerting

2017 Technology: Before

ALMOST

ALMOST

ALMOST

Real User Monitoring • Performance regression reports emailed daily • Response times include rendering • Global statistics give us insight into latency

2017 Technology Improvements

Jira Data Center • 4 nodes improves our MTTR by avoiding lengthy

index rebuilds • Resilient from the "single click of death"

2017 Technology Improvements

Getting to Scale

Always ask why!

Invest in the team

Build vendor relationship

Lather, rinse, repeat

Thank you!

ARNIE MATZ | DIRECTOR, SOF TWARE ENGINEERING | LINKEDIN

DAN HATA | SENIOR ENGINEERING MANAGER | LINKEDIN