Scaling Etsy: What Went Wrong, What Went Right

Post on 14-May-2015

8,117 views 3 download

Tags:

description

Slides for the talk given at Surge 2011.

transcript

Scaling :What Went Wrong,

What Went Right

Ross Snyderross@etsy.com@beamrider9 Sept. 30, 2011

1

Etsy is the world’s handmade marketplace.

(vintage and supplies, too)

2

Etsy was founded in mid-2005 and is constantly growing.

Gross Merchandise Sales ($MM)

3

Four employees, one web*, one db, founder’s apartment

June2005:

* until getting slashdotted by a link from Boing Boing in Aug. 2005

From humble beginnings...

4

250+ employees, multiple offices, billions of pageviews

Sept.2011:

... to today’s handmade juggernaut.

(NYC Mayor Mike Bloomberg visited Etsy in June 2011)

5

How’d we get here?

6

Answer: with some difficulty.“There is no education like adversity.” - Benjamin Disraeli

7

A few disclaimers

8

Hindsight is 20/20

9

“History is written by the victors”

10

Etsy thrives today because of what

its early employees accomplished

11

Your narrator wasn’t present for mostof the events covered in this talk

12

Etsy Architecture: 2007

13

Etsy Architecture: 2007

Operating System:

Database:

Webserver:

Languages:

14

Etsy Architecture: 2007

Most business logic inPostgres stored procedures

15

Etsy Architecture: 2007

Front end / database interaction = stored procedure calls wrapped with PHP functions

16

Etsy Architecture: 2007

Some database partitioning by feature,but still with a large central DB

17

Etsy Architecture: 2007

Site uptime = not great

18

Etsy Architecture: 2007

“How do we scale?”

19

Etsy Architecture: 2007

“Let’s write some middleware!”

(runners up: “Let’s rewrite the site in Java!”and “Let’s rewrite the site in Python!”)

20

“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.”

Conway’s Law:

- Melvin Conway, 1968

21

Etsy Engineering: 2007

Dev DBA Ops

22

Etsy Engineering: 2007

Dev DBA Ops

Devs write code

23

Etsy Engineering: 2007

Dev DBA Ops

DBAs write SQL

24

Etsy Engineering: 2007

Dev DBA Ops

Ops deploys code & touches prod

25

SILOS

26

Etsy’s big bet: “Sprouter”(the Stored Procedure Router)

27

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Runs on each webserver,listens on port 8010

28

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Maps name/arguments to a Postgres stored procedure, calls it, returns results

29

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Caches things

30

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Supports sharding (in theory)

31

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Devs write PHP, DBAs write SQL,meet somewhere in the middle

32

SILOS

33

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

The hope: easier to scale Sprouterthan to scale the database itself

34

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

(scaling the db when everything’s in stored procedures = somewhere between

hard and impossible)

35

Sprouter: TimelineFall ’07: Idea first discussed

Spring ’08: Alpha version debutsFall ’08: Released in production

36

Sprouter: TimelineFall ’07: Idea first discussed

Spring ’08: Alpha version debutsFall ’08: Released in production

Spring ’09: Sprouter deprecated37

What happened?

38

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Forcibly centralizes database access

39

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Hides data store implementationfrom caller

40

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Opens the door for“clever” automatic caching

41

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Prevents developers from writing SQL (?)

42

43

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Creates substantial developer friction

44

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Homegrown daemon + dependenciesfor Ops to maintain

45

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Lack of community support / provability

46

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Complex synchronization required to deploy (due to tight coupling with Postgres)

47

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Database remains single point of failure(sharding features never fully formed)

48

Sprouter: SummaryExtra barriers to development

49

Sprouter: SummaryExtra barriers to development+ Negligible (negative?) effect on site reliability

50

Sprouter: SummaryExtra barriers to development

+ Deploys even more painful+ Negligible (negative?) effect on site reliability

51

Sprouter: SummaryExtra barriers to development

+ Deploys even more painful+ Requires extra Ops/Dev resources

+ Negligible (negative?) effect on site reliability

52

Sprouter: SummaryExtra barriers to development

+ Deploys even more painful+ Requires extra Ops/Dev resources

=

+ Negligible (negative?) effect on site reliability

53

How did attitudes change so quickly?

54

Sprouter: TimelineFall ’07: Idea first discussed

Spring ’08: Alpha version debutsFall ’08: Released in production

Spring ’09: Sprouter deprecated55

The Great Etsy Culture Shift

56

The Great Etsy Culture Shift

Just as Sprouter went live, many of its strongest proponents departed Etsy

57

The Great Etsy Culture Shift

Taking with them...

58

The Great Etsy Culture Shift

Devotion to Postgres stored procedures / types

59

The Great Etsy Culture Shift

Fear of developers writing SQL

60

The Great Etsy Culture Shift

Fear of developers touching prod

61

The Great Etsy Culture Shift

Infrequent / large deploys to production

62

The Great Etsy Culture Shift

“Not developed here”

63

Fall

’08

Then Now

The Great Etsy Culture Shift

64

DevOps

65

DevOps

Silos = bad

66

DevOps

Trust, cooperation, transparency,shared responsibility = good

67

DevOps

“We’re all in this together”

68

The Way Forward: Part 1

Stabilize the site

69

The Way Forward: Part 1

Improve metrics & monitoring

Stabilize the site

70

The Way Forward: Part 1

StatsDhttp://github.com/etsy/statsd

Stabilize the site

71

The Way Forward: Part 1

Upgrade database hardwarevertically as far as possible

Stabilize the site

72

The Way Forward: Part 1

Give developers production access to help troubleshoot problems

Stabilize the site

73

The Way Forward: Part 2

Continuous Deployment

74

The Way Forward: Part 2

Any engineer can deploy to prod(generally happens 25+ times per day)

Continuous Deployment

75

The Way Forward: Part 2

Deployinatorhttp://github.com/etsy/deployinator

Continuous Deployment

76

The Way Forward: Part 2

One button that deploys the site

Continuous Deployment

77

The Way Forward: Part 2

Small changesets, deployed frequently

Continuous Deployment

78

The Way Forward: Part 2

Requires solid tests,good communication

Continuous Deployment

79

The Way Forward: Part 2

Distributed developer-driven QA

Continuous Deployment

80

The Way Forward: Part 3

Circumvent Sprouter

81

The Way Forward: Part 3

Object-Relational Mapping (ORM)

Circumvent Sprouter

82

The Way Forward: Part 3

aka “The Vietnam of Computer Science”(Google it)

Circumvent Sprouter

83

The Way Forward: Part 3

Front-end PHP talks directly to database via ORM (also written in PHP)

Circumvent Sprouter

84

The Way Forward: Part 3

ORM can cache where appropriate(as can front end)

Circumvent Sprouter

85

The Way Forward: Part 4

Database Sharding

86

The Way Forward: Part 4

Etsy has a lot of DNA from flickr -including their DB sharding scheme

Database Sharding

87

The Way Forward: Part 4

Based on MySQL

Database Sharding

88

The Way Forward: Part 4

Battle-tested, well-known

Database Sharding

89

The Way Forward: Part 4

Scales horizontally to infinity(or close enough)

Database Sharding

90

The Way Forward: Part 4

No single points of failure(master-master replication)

Database Sharding

91

Gradually phase out Sprouter,phase in ORM / sharded data

The Way Forward: Part 4Database Sharding

92

Sprouter: Timeline

Fall ’07: Idea first discussedSpring ’08: Alpha version debuts

Fall ’08: Released in productionSpring ’09: Sprouter deprecated

93

Sprouter: Timeline

Fall ’07: Idea first discussedSpring ’08: Alpha version debuts

Fall ’08: Released in productionSpring ’09: Sprouter deprecated

Spring ’11: Sprouter turned off

94

95

Lessons Learned

96

Etsy Architecture: 2007

Operating System:

Database:

Webserver:

Languages:

97

Etsy Architecture: 2011

Operating System:

Database:

Webserver:

Languages:

98

Open & trusting > closed & afraid(DevOps DevOps DevOps)

99

Front end/database interaction is too critical to take chances on novel/untested solutions

100

Side corollary: If you’re doing something “clever”, you’re probably doing it wrong

101

The architectural decisions you make today will have large impact long after you’re gone

102

No architectural hole is so deep that proven scaling strategies don’t exist for digging out

103

We are probably making decisions today that will be the subject of a similar talk in 2015

Acknowledgement

104

Learn More:http://codeascraft.etsy.com/@codeascraft

105

Etsy is hiring!http://www.etsy.com/careers@etsy

106