Feature Engineering -...

Post on 01-Aug-2020

0 views 0 download

transcript

Feature Engineering The Dark Art of Data Science

Josh Wills // Senior Director of Data Science

2 © 2014 Cloudera, Inc. All rights reserved.

About Me

3 © 2014 Cloudera, Inc. All rights reserved.

The Two Kinds of Data Scientists

• The Lab • Statisticians who got really

good at programming • Neuroscientists,

geneticists, social scientists, etc.

• The Factory • Software engineers who

were in the wrong place at the wrong time

4 © 2014 Cloudera, Inc. All rights reserved.

A Brief History of Data Products

5 © 2014 Cloudera, Inc. All rights reserved.

Scorecards

6 © 2014 Cloudera, Inc. All rights reserved.

Spell Correction

7 © 2014 Cloudera, Inc. All rights reserved.

Virtual Personal Assistants

• Pipeline of not-so-loosely coupled ML systems

• Speech recognition • Semantic decoding •  Intent Model • Dialogue Rules • Language Generation • Speech Synthesis

8 © 2014 Cloudera, Inc. All rights reserved.

Machine Learning vs. Feature Extraction

9 © 2014 Cloudera, Inc. All rights reserved.

Talking About Feature Engineering

10 © 2014 Cloudera, Inc. All rights reserved.

Brainwash

11 © 2014 Cloudera, Inc. All rights reserved.

Good: More Features == Better Performance

12 © 2014 Cloudera, Inc. All rights reserved.

Good: Feature Development Scales

13 © 2014 Cloudera, Inc. All rights reserved.

Bad: Lots of Grunt Work

14 © 2014 Cloudera, Inc. All rights reserved.

Deployment: Even More Grunt Work

15 © 2014 Cloudera, Inc. All rights reserved.

Analytic Data Model: Giant Spreadsheet

16 © 2014 Cloudera, Inc. All rights reserved.

Operational Data Model: 3NF

17 © 2014 Cloudera, Inc. All rights reserved.

The Impedance Mismatch

18 © 2014 Cloudera, Inc. All rights reserved.

What Do We Need?

19 © 2014 Cloudera, Inc. All rights reserved.

One Solution I Thought Might Work

20 © 2014 Cloudera, Inc. All rights reserved.

Inventing on Principle

21 © 2014 Cloudera and/or its affiliates. All rights reserved.

A Data Model For Feature Engineering

22 © 2014 Cloudera, Inc. All rights reserved.

Bridging the Gap

23 © 2014 Cloudera, Inc. All rights reserved.

Spell Correction Revisited

24 © 2014 Cloudera, Inc. All rights reserved.

A Simple Star Schema for Search

25 © 2014 Cloudera, Inc. All rights reserved.

A Supernova Schema for Search

26 © 2014 Cloudera, Inc. All rights reserved.

Beyond Analytic SQL: Nested SQL Sessions

27 © 2014 Cloudera, Inc. All rights reserved.

Exhibit (http://github.com/jwills/exhibit)

28 © 2014 Cloudera, Inc. All rights reserved.

Operational Supernovas

29 © 2014 Cloudera, Inc. All rights reserved.

ln(supernova)

30 © 2014 Cloudera, Inc. All rights reserved.

Data Science and the Holy Grail

31 © 2014 Cloudera, Inc. All rights reserved.

Feature Engineering: Within / Across

1.  Generate features for normalization/segmentation.

2.  Generate normalization constants and segments using the result of Step 1.

3.  Generate input features using the original data and the result of Step 2.

4.  Generate a model using the result of Step 3.

32 © 2014 Cloudera and/or its affiliates. All rights reserved.

Feature Engineering IDE

33 © 2014 Cloudera, Inc. All rights reserved.

Data Science as ETL Workflow

34 © 2014 Cloudera, Inc. All rights reserved.

Metlife’s Wall •  A 360 degree view of all customer

data and interactions

•  Backed by a NoSQL, supernova-style data model (MongoDB, in this instance)

•  Multiple Use Cases •  Customer Support •  Exploratory Analytics/Metrics Definitions •  Profiles/Segmentation

35 © 2014 Cloudera, Inc. All rights reserved.

Queryable Wall

36 © 2014 Cloudera, Inc. All rights reserved.

Integrated Model and Feature Store

37 © 2014 Cloudera, Inc. All rights reserved.

Feature Application and Evaluation

38 © 2014 Cloudera, Inc. All rights reserved.

Exhibitor (http://github.com/jhlch/exhibitor)

Thank you.