Post on 01-Aug-2020
transcript
Feature Engineering The Dark Art of Data Science
Josh Wills // Senior Director of Data Science
2 © 2014 Cloudera, Inc. All rights reserved.
About Me
3 © 2014 Cloudera, Inc. All rights reserved.
The Two Kinds of Data Scientists
• The Lab • Statisticians who got really
good at programming • Neuroscientists,
geneticists, social scientists, etc.
• The Factory • Software engineers who
were in the wrong place at the wrong time
4 © 2014 Cloudera, Inc. All rights reserved.
A Brief History of Data Products
5 © 2014 Cloudera, Inc. All rights reserved.
Scorecards
6 © 2014 Cloudera, Inc. All rights reserved.
Spell Correction
7 © 2014 Cloudera, Inc. All rights reserved.
Virtual Personal Assistants
• Pipeline of not-so-loosely coupled ML systems
• Speech recognition • Semantic decoding • Intent Model • Dialogue Rules • Language Generation • Speech Synthesis
8 © 2014 Cloudera, Inc. All rights reserved.
Machine Learning vs. Feature Extraction
9 © 2014 Cloudera, Inc. All rights reserved.
Talking About Feature Engineering
10 © 2014 Cloudera, Inc. All rights reserved.
Brainwash
11 © 2014 Cloudera, Inc. All rights reserved.
Good: More Features == Better Performance
12 © 2014 Cloudera, Inc. All rights reserved.
Good: Feature Development Scales
13 © 2014 Cloudera, Inc. All rights reserved.
Bad: Lots of Grunt Work
14 © 2014 Cloudera, Inc. All rights reserved.
Deployment: Even More Grunt Work
15 © 2014 Cloudera, Inc. All rights reserved.
Analytic Data Model: Giant Spreadsheet
16 © 2014 Cloudera, Inc. All rights reserved.
Operational Data Model: 3NF
17 © 2014 Cloudera, Inc. All rights reserved.
The Impedance Mismatch
18 © 2014 Cloudera, Inc. All rights reserved.
What Do We Need?
19 © 2014 Cloudera, Inc. All rights reserved.
One Solution I Thought Might Work
20 © 2014 Cloudera, Inc. All rights reserved.
Inventing on Principle
21 © 2014 Cloudera and/or its affiliates. All rights reserved.
A Data Model For Feature Engineering
22 © 2014 Cloudera, Inc. All rights reserved.
Bridging the Gap
23 © 2014 Cloudera, Inc. All rights reserved.
Spell Correction Revisited
24 © 2014 Cloudera, Inc. All rights reserved.
A Simple Star Schema for Search
25 © 2014 Cloudera, Inc. All rights reserved.
A Supernova Schema for Search
26 © 2014 Cloudera, Inc. All rights reserved.
Beyond Analytic SQL: Nested SQL Sessions
27 © 2014 Cloudera, Inc. All rights reserved.
Exhibit (http://github.com/jwills/exhibit)
28 © 2014 Cloudera, Inc. All rights reserved.
Operational Supernovas
29 © 2014 Cloudera, Inc. All rights reserved.
ln(supernova)
30 © 2014 Cloudera, Inc. All rights reserved.
Data Science and the Holy Grail
31 © 2014 Cloudera, Inc. All rights reserved.
Feature Engineering: Within / Across
1. Generate features for normalization/segmentation.
2. Generate normalization constants and segments using the result of Step 1.
3. Generate input features using the original data and the result of Step 2.
4. Generate a model using the result of Step 3.
32 © 2014 Cloudera and/or its affiliates. All rights reserved.
Feature Engineering IDE
33 © 2014 Cloudera, Inc. All rights reserved.
Data Science as ETL Workflow
34 © 2014 Cloudera, Inc. All rights reserved.
Metlife’s Wall • A 360 degree view of all customer
data and interactions
• Backed by a NoSQL, supernova-style data model (MongoDB, in this instance)
• Multiple Use Cases • Customer Support • Exploratory Analytics/Metrics Definitions • Profiles/Segmentation
35 © 2014 Cloudera, Inc. All rights reserved.
Queryable Wall
36 © 2014 Cloudera, Inc. All rights reserved.
Integrated Model and Feature Store
37 © 2014 Cloudera, Inc. All rights reserved.
Feature Application and Evaluation
38 © 2014 Cloudera, Inc. All rights reserved.
Exhibitor (http://github.com/jhlch/exhibitor)
Thank you.