+ All Categories
Home > Documents > Cataldo DataFlowInTheDataCenter...

Cataldo DataFlowInTheDataCenter...

Date post: 02-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
22
IN THE DATA CENTER DATA FLOW wealthfront.co m Adam Cataldo @djscrooge November 7, 2013
Transcript
Page 1: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

IN THE DATA CENTER DATA FLOW

wealthfront.com

Adam Cataldo @djscrooge November 7, 2013

Page 2: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Wealthfront & Me

•  Wealthfront is the largest and fastest growing software-based financial advisor

•  We manage the first $10,000 for free the rest for only 0.25% a year

•  Our automated trading system continuously rebalances a portfolio of low-cost ETFs, with continuous tax-loss harvesting for accounts over $100,000

•  I built the data platform we use for website optimization, investment research, business analytics, and operations

2

Page 3: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Why the Ptolemy conference?

•  This is not a talk about modeling, simulation, and design of concurrent, real-time embedded systems

•  This is a talk about the design of a data analytics system

•  It turns out many of the patterns are the same in both fields

3

Page 4: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

MapReduce & Hadoop

4

Page 5: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Hadoop at a Glance

•  Scales well for large data sets

•  Industry standard for data processing

•  Optimized for throughput batch-processing

•  Long latency

•  Overkill for small data sets

5

Page 6: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Cascading

6

Page 7: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Why Cascading?

•  Most real problems require multiple MapReduce jobs

•  Provides a data-flow abstraction to specify data transformations

•  Builds on standard database concepts: joins, groups, and so on

•  Provides decent testing capabilities, which we’ve extended

7

Page 8: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

From SQL to Cascading

select name from users join mails on users.email=mails.to

8

Pipe joined = new CoGroup(users, “email”, mails, “to);

Pipe name = new Retain(joined, “lastName”);

Page 9: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Cascading to Hadoop

9

mails

users

mails mappers

users mappers

join reducers

result

Page 10: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Getting data ready for Cascading

10

Production MySQL DB

Avro file Avro

file Avro files

Production MySQL DB Amazon

Simple Storage Service

extract transform load

Page 11: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

•  A compact data format, capable of storing large data sets

Why Avro?

•  We compress with Google Snappy

•  Compressed is splittable into 128MB chunks

•  De-facto file format for Hadoop

11

Page 12: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Running Cascading Jobs

12

Production MySQL DB Amazon

Simple Storage Service

Elastic MapReduce

Online Systems

Redshift data

warehouse

Page 13: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

What do we do with the data?

•  We use it to track how well the investment product is performing

•  We use it to track how well the business is performing

•  We use it to monitor our production systems

•  We use it to test how well new features perform on the website

13

Page 14: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Bandit Testing

•  When rolling new features out, we expose the new version to some users and the old version to the rest

•  We monitor what percent of users “convert”: sign up, fund account, etc.

•  We gradually send more traffic to the winning variant of the experiment

•  Similar to A/B testing, but way faster

14

Page 15: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

Does anyone know where the name bandit

testing comes from?

Page 16: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Thompson Sampling

1.  Estimate the probability for each variant of the experiment that it performs best, using Bayesian inference

2.  Weight the percentage of traffic sent to each variant according to this probability

3.  End the experiment when one variant has a 95% chance of winning, or when the losing arms have no more than a %5 chance of beating the winner by more than 1%

4.  In 2012, Kaufmann et al proved optimality of Thompson sampling

16

Page 17: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

What’s Redshift?

•  Amazon’s cloud-based data warehouse database

•  To support ad-hoc analysis, we copy all raw and computed data into redshift

•  It’s a column-oriented database, optimized for aggregate queries and joins over large batch sizes

17

Page 18: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

What are the technical challenges?

•  Testing complicated analytics computations is non-trivial

-  We ended up writing a small library to make testing Cascading jobs simpler

•  Running multiple Hadoop jobs on large datasets takes a long time

-  We use Spark for prototyping, to get a speedup

•  Your assumptions about the constraints on the data is always wrong

18

Page 19: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Where’s this heading?

•  We have a unique collection of consumer web data and financial data

•  There are many ways we can combine this data to make our product better

•  Hypothetical example: suggest portfolio risk adjustments based on a client’s withdrawal patterns

19

Page 20: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

How is this relevant?

•  We use data flow as the primary model of computation

•  While the time scales are much slower, we have timing constraints, called SLAs, imposed by production use cases

•  We have to make sure all code can safely execute concurrently on multiple machines, cores, and threads

20

Page 21: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

wealthfront.com |

Disclosure

21

Text

Nothing in this presentation should be construed as a solicitation or offer, or recommendation, to buy or sell any security. Financial advisory services are only provided to

investors who become Wealthfront clients pursuant to a written agreement, which investors are urged to read and carefully

consider in determining whether such agreement is suitable for their individual facts and circumstances. Past performance is no

guarantee of future results, and any hypothetical returns, expected returns, or probability projections may not reflect

actual future performance.  Investors should review Wealthfront’s website for additional information about advisory

services.   

Page 22: Cataldo DataFlowInTheDataCenter 2013PtolemyMiniconferencechess.eecs.berkeley.edu/...DataFlowInTheDataCenter...Hadoop at a Glance • Scales well for large data sets • Industry standard

Recommended