+ All Categories
Home > Technology > Scaling Etsy: What Went Wrong, What Went Right

Scaling Etsy: What Went Wrong, What Went Right

Date post: 14-May-2015
Category:
Upload: ross-snyder
View: 8,117 times
Download: 3 times
Share this document with a friend
Description:
Slides for the talk given at Surge 2011.
Popular Tags:
106
Scaling : What Went Wrong, What Went Right Ross Snyder [email protected] @beamrider9 Sept. 30, 2011 1
Transcript
Page 1: Scaling Etsy: What Went Wrong, What Went Right

Scaling :What Went Wrong,

What Went Right

Ross [email protected]@beamrider9 Sept. 30, 2011

1

Page 2: Scaling Etsy: What Went Wrong, What Went Right

Etsy is the world’s handmade marketplace.

(vintage and supplies, too)

2

Page 3: Scaling Etsy: What Went Wrong, What Went Right

Etsy was founded in mid-2005 and is constantly growing.

Gross Merchandise Sales ($MM)

3

Page 4: Scaling Etsy: What Went Wrong, What Went Right

Four employees, one web*, one db, founder’s apartment

June2005:

* until getting slashdotted by a link from Boing Boing in Aug. 2005

From humble beginnings...

4

Page 5: Scaling Etsy: What Went Wrong, What Went Right

250+ employees, multiple offices, billions of pageviews

Sept.2011:

... to today’s handmade juggernaut.

(NYC Mayor Mike Bloomberg visited Etsy in June 2011)

5

Page 6: Scaling Etsy: What Went Wrong, What Went Right

How’d we get here?

6

Page 7: Scaling Etsy: What Went Wrong, What Went Right

Answer: with some difficulty.“There is no education like adversity.” - Benjamin Disraeli

7

Page 8: Scaling Etsy: What Went Wrong, What Went Right

A few disclaimers

8

Page 9: Scaling Etsy: What Went Wrong, What Went Right

Hindsight is 20/20

9

Page 10: Scaling Etsy: What Went Wrong, What Went Right

“History is written by the victors”

10

Page 11: Scaling Etsy: What Went Wrong, What Went Right

Etsy thrives today because of what

its early employees accomplished

11

Page 12: Scaling Etsy: What Went Wrong, What Went Right

Your narrator wasn’t present for mostof the events covered in this talk

12

Page 13: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

13

Page 14: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Operating System:

Database:

Webserver:

Languages:

14

Page 15: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Most business logic inPostgres stored procedures

15

Page 16: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Front end / database interaction = stored procedure calls wrapped with PHP functions

16

Page 17: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Some database partitioning by feature,but still with a large central DB

17

Page 18: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Site uptime = not great

18

Page 19: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

“How do we scale?”

19

Page 20: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

“Let’s write some middleware!”

(runners up: “Let’s rewrite the site in Java!”and “Let’s rewrite the site in Python!”)

20

Page 21: Scaling Etsy: What Went Wrong, What Went Right

“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.”

Conway’s Law:

- Melvin Conway, 1968

21

Page 22: Scaling Etsy: What Went Wrong, What Went Right

Etsy Engineering: 2007

Dev DBA Ops

22

Page 23: Scaling Etsy: What Went Wrong, What Went Right

Etsy Engineering: 2007

Dev DBA Ops

Devs write code

23

Page 24: Scaling Etsy: What Went Wrong, What Went Right

Etsy Engineering: 2007

Dev DBA Ops

DBAs write SQL

24

Page 25: Scaling Etsy: What Went Wrong, What Went Right

Etsy Engineering: 2007

Dev DBA Ops

Ops deploys code & touches prod

25

Page 26: Scaling Etsy: What Went Wrong, What Went Right

SILOS

26

Page 27: Scaling Etsy: What Went Wrong, What Went Right

Etsy’s big bet: “Sprouter”(the Stored Procedure Router)

27

Page 28: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Runs on each webserver,listens on port 8010

28

Page 29: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Maps name/arguments to a Postgres stored procedure, calls it, returns results

29

Page 30: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Caches things

30

Page 31: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Supports sharding (in theory)

31

Page 32: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

Devs write PHP, DBAs write SQL,meet somewhere in the middle

32

Page 33: Scaling Etsy: What Went Wrong, What Went Right

SILOS

33

Page 34: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

The hope: easier to scale Sprouterthan to scale the database itself

34

Page 35: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter

(scaling the db when everything’s in stored procedures = somewhere between

hard and impossible)

35

Page 36: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: TimelineFall ’07: Idea first discussed

Spring ’08: Alpha version debutsFall ’08: Released in production

36

Page 37: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: TimelineFall ’07: Idea first discussed

Spring ’08: Alpha version debutsFall ’08: Released in production

Spring ’09: Sprouter deprecated37

Page 38: Scaling Etsy: What Went Wrong, What Went Right

What happened?

38

Page 39: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Forcibly centralizes database access

39

Page 40: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Hides data store implementationfrom caller

40

Page 41: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Opens the door for“clever” automatic caching

41

Page 42: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: “Good” Parts

Prevents developers from writing SQL (?)

42

Page 43: Scaling Etsy: What Went Wrong, What Went Right

43

Page 44: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Creates substantial developer friction

44

Page 45: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Homegrown daemon + dependenciesfor Ops to maintain

45

Page 46: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Lack of community support / provability

46

Page 47: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Complex synchronization required to deploy (due to tight coupling with Postgres)

47

Page 48: Scaling Etsy: What Went Wrong, What Went Right

Web(PHP)

Sprouter(Python)

DB(Postgres)

Sprouter: Not-As-Good Parts

Database remains single point of failure(sharding features never fully formed)

48

Page 49: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: SummaryExtra barriers to development

49

Page 50: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: SummaryExtra barriers to development+ Negligible (negative?) effect on site reliability

50

Page 51: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: SummaryExtra barriers to development

+ Deploys even more painful+ Negligible (negative?) effect on site reliability

51

Page 52: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: SummaryExtra barriers to development

+ Deploys even more painful+ Requires extra Ops/Dev resources

+ Negligible (negative?) effect on site reliability

52

Page 53: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: SummaryExtra barriers to development

+ Deploys even more painful+ Requires extra Ops/Dev resources

=

+ Negligible (negative?) effect on site reliability

53

Page 54: Scaling Etsy: What Went Wrong, What Went Right

How did attitudes change so quickly?

54

Page 55: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: TimelineFall ’07: Idea first discussed

Spring ’08: Alpha version debutsFall ’08: Released in production

Spring ’09: Sprouter deprecated55

Page 56: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

56

Page 57: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Just as Sprouter went live, many of its strongest proponents departed Etsy

57

Page 58: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Taking with them...

58

Page 59: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Devotion to Postgres stored procedures / types

59

Page 60: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Fear of developers writing SQL

60

Page 61: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Fear of developers touching prod

61

Page 62: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

Infrequent / large deploys to production

62

Page 63: Scaling Etsy: What Went Wrong, What Went Right

The Great Etsy Culture Shift

“Not developed here”

63

Page 64: Scaling Etsy: What Went Wrong, What Went Right

Fall

’08

Then Now

The Great Etsy Culture Shift

64

Page 65: Scaling Etsy: What Went Wrong, What Went Right

DevOps

65

Page 66: Scaling Etsy: What Went Wrong, What Went Right

DevOps

Silos = bad

66

Page 67: Scaling Etsy: What Went Wrong, What Went Right

DevOps

Trust, cooperation, transparency,shared responsibility = good

67

Page 68: Scaling Etsy: What Went Wrong, What Went Right

DevOps

“We’re all in this together”

68

Page 69: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 1

Stabilize the site

69

Page 70: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 1

Improve metrics & monitoring

Stabilize the site

70

Page 71: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 1

StatsDhttp://github.com/etsy/statsd

Stabilize the site

71

Page 72: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 1

Upgrade database hardwarevertically as far as possible

Stabilize the site

72

Page 73: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 1

Give developers production access to help troubleshoot problems

Stabilize the site

73

Page 74: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Continuous Deployment

74

Page 75: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Any engineer can deploy to prod(generally happens 25+ times per day)

Continuous Deployment

75

Page 76: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Deployinatorhttp://github.com/etsy/deployinator

Continuous Deployment

76

Page 77: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

One button that deploys the site

Continuous Deployment

77

Page 78: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Small changesets, deployed frequently

Continuous Deployment

78

Page 79: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Requires solid tests,good communication

Continuous Deployment

79

Page 80: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 2

Distributed developer-driven QA

Continuous Deployment

80

Page 81: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 3

Circumvent Sprouter

81

Page 82: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 3

Object-Relational Mapping (ORM)

Circumvent Sprouter

82

Page 83: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 3

aka “The Vietnam of Computer Science”(Google it)

Circumvent Sprouter

83

Page 84: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 3

Front-end PHP talks directly to database via ORM (also written in PHP)

Circumvent Sprouter

84

Page 85: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 3

ORM can cache where appropriate(as can front end)

Circumvent Sprouter

85

Page 86: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

Database Sharding

86

Page 87: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

Etsy has a lot of DNA from flickr -including their DB sharding scheme

Database Sharding

87

Page 88: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

Based on MySQL

Database Sharding

88

Page 89: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

Battle-tested, well-known

Database Sharding

89

Page 90: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

Scales horizontally to infinity(or close enough)

Database Sharding

90

Page 91: Scaling Etsy: What Went Wrong, What Went Right

The Way Forward: Part 4

No single points of failure(master-master replication)

Database Sharding

91

Page 92: Scaling Etsy: What Went Wrong, What Went Right

Gradually phase out Sprouter,phase in ORM / sharded data

The Way Forward: Part 4Database Sharding

92

Page 93: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: Timeline

Fall ’07: Idea first discussedSpring ’08: Alpha version debuts

Fall ’08: Released in productionSpring ’09: Sprouter deprecated

93

Page 94: Scaling Etsy: What Went Wrong, What Went Right

Sprouter: Timeline

Fall ’07: Idea first discussedSpring ’08: Alpha version debuts

Fall ’08: Released in productionSpring ’09: Sprouter deprecated

Spring ’11: Sprouter turned off

94

Page 95: Scaling Etsy: What Went Wrong, What Went Right

95

Page 96: Scaling Etsy: What Went Wrong, What Went Right

Lessons Learned

96

Page 97: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2007

Operating System:

Database:

Webserver:

Languages:

97

Page 98: Scaling Etsy: What Went Wrong, What Went Right

Etsy Architecture: 2011

Operating System:

Database:

Webserver:

Languages:

98

Page 99: Scaling Etsy: What Went Wrong, What Went Right

Open & trusting > closed & afraid(DevOps DevOps DevOps)

99

Page 100: Scaling Etsy: What Went Wrong, What Went Right

Front end/database interaction is too critical to take chances on novel/untested solutions

100

Page 101: Scaling Etsy: What Went Wrong, What Went Right

Side corollary: If you’re doing something “clever”, you’re probably doing it wrong

101

Page 102: Scaling Etsy: What Went Wrong, What Went Right

The architectural decisions you make today will have large impact long after you’re gone

102

Page 103: Scaling Etsy: What Went Wrong, What Went Right

No architectural hole is so deep that proven scaling strategies don’t exist for digging out

103

Page 104: Scaling Etsy: What Went Wrong, What Went Right

We are probably making decisions today that will be the subject of a similar talk in 2015

Acknowledgement

104

Page 105: Scaling Etsy: What Went Wrong, What Went Right

Learn More:http://codeascraft.etsy.com/@codeascraft

105

Page 106: Scaling Etsy: What Went Wrong, What Went Right

Etsy is hiring!http://www.etsy.com/careers@etsy

106


Recommended