+ All Categories
Home > Documents > YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING...

YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING...

Date post: 01-Jan-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
36
AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA YOW DATA 2019
Transcript
Page 1: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

AUTO FEATURE ENGINEERINGRAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES

Ananth GundabattulaSenior Architect @CBA

YOW DATA 2019

Page 2: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

FEATURE ENGINEERING – TWO PARADIGMS

Classical machine learning

Feature engineering : The art of creating features from raw data using domain knowledgeExplainableRelational and human behavioral use cases. (Ex: Reporting)Expensive - Easily one of the top two cost contributors in building a model as it is human intuition driven.

Deep learning Feature engineering is intrinsic to the model.Not explainable (yet!)Problem/cost shifts towards building the right parameters of the neural net model

Page 3: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

LIFETIME OF A FEATUREInitiate•

Understand the requirement

• Collect raw data

• CleanseDesign• Explore

hypothesis•

Preprocessing (ex: impute )

• Domain driven

• Model specific (Ex: One hot

Build• Real time/

Batch • Persisted /

Run time• Exactly

Govern• Lineage• Impact

analysis• Monitor

decay• Retire

Page 4: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

CAN WE AUTOMATE PARTS OF THIS LIFECYCLE ?

Reduce costPossibly govern betterRisk assess betterMaintain it better

Page 5: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

TOY EXAMPLE SCHEMA

Page 6: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

TOY EXAMPLE DEMO – DFS ALGORITHMGENERATE FEATURES FOR THE CUSTOMER ENTITY

Page 7: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

SAMPLE FEATURE ANALYSIS

Customer- MAX(sessions.MEAN(transactions.amount))

The maximum of the average spend across each session of a customer

Customer - MIN(sessions.NUM_UNIQUE(transactions.products.MODE(transactions.session_id)))

The minimum of the number of unique products that were purchased in sessions with most frequent number of transactions.

Transactions - sessions.customers.NUM_UNIQUE(sessions.WEEKDAY(session_start))

The number of unique weekdays that a customer typically shops in – Is he/she a weekend customer ?

Transactions - sessions.customers.SUM(sessions.MIN(transactions.amount))

Sum of all minimum transaction amount per session until now

Page 8: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

DFS – DEEP FEATURE SYNTHESIS ALGORITHM

Page 9: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

DFS ALGORITHM – BUILDING BLOCKS

Direct Feature (JOIN and INHERIT)

product_id product_price

Aggregation Feature (JOIN + GROUP BY)

order_id product_id order_date

JOIN

order_id product_id order_date product_price

As it is a forward relationship

order_id cust_id order_date product_price

GROUP BY (Product price) +

Aggregation Primitive(ex: SUM)

order_id cust_id order_date SUM(product_price)

As it is a backward relationship

Page 10: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

DFS ALGORITHM – BUILDING BLOCKS CONTD….

Transform Feature (Primitive on same row)

order_id order_dt month(order_date)

product_id

order_id order_dt product_id

MONTH function on datetime type

Page 11: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

DFS ALGORITHM – STACK TO HARVEST

R-FEATURES(Backward

relationship)

D-FEATURES(Forward

relationship)

E-FEATURES(Entity Transforms)

R-FEAT E-FEAT D-FEAT R-FEAT E-FEAT D-FEAT

R-FEAT E-FEAT D-FEAT R-FEAT E-FEAT D-FEATIgnored for brevity…

Page 12: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

DESIGN PHASE CONSTRUCTS

Page 13: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

DOMAIN - UNDERSTANDING OF THE BASE DATA TYPES

Not a permutation and combination library. Data types taken into account when fed as metadataDatetime – Apply on time functions.Categorical types – ex: NUM_UNIQUE on categorical but not on amountsTime DeltasIDs,Lat/LongZIPCode, CountryCodeIPAddress, Phone NumberEmail AddressesFilePath

Page 14: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

DOMAIN FEATURES – SEEDING BASE FEATURES

transaction_id transaction_amount expensive_purchase inexpensive_purchase multiple_of_10s

Expensive purchase – Amount > 125Inexpensive purchase – Amount < 25Rounded 10 purchase – Amount %10 == 0

Number of features vs seed features introduced

0

200

400

600

800

Depth 2 Depth 3 Depth 4 Depth 5

749

561

14494

705

517

13787

661

473

13080

617

429

12373

No Seeds 1 seed 2 seed 3 seed

SKEW(sessions.PERCENT_TRUE(transactions.amount > 125))

Page 15: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

DOMAIN – INTERESTING VALUES/ FOCUS ON SUBSET OF ROWS

session_id customer_id device

1 1 desktop

2 3 iphone

3 1 iphone

4 2 galaxy

5 4 desktop

6 4 desktop

COUNT(sessions WHERE device = desktop)

COUNT(sessions WHERE device = iphone)

Page 16: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

BRING YOUR OWN DOMAIN FUNCTIONScase_id customer_i

dcase type

case_id comments

SQL Aggregation catalogue

SQL Transform catalogue

case_id customer_id

case type

NLTK Aggregation catalogue

NLTK Transform catalogue

case_id comments

Feat

ures

bas

ed o

n N

LP fu

nctio

n pr

imiti

ves

Feat

ures

bas

ed o

n SQ

L fu

nctio

n pr

imiti

ves

Page 17: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

FEATURE EXPLOSION

Number of features is a function of:- Entities- Rfeat- Dfeat- Depth- EFeat

Mock Customer, Airlines and Retail Datasets

0

1250

2500

3750

5000

Mock Customers Mock Products Flights Trip Logs Retail Customers Retail Products

263

104812891235

4847

1625

4845

1586

161518

280617

231352535

1047

3791

1241

3789

1586

129194280429109170329183

1233

351

12311202

6982158123 644065143489

127487

31233224673

Depth 2 Depth 3 Depth 4 Depth 5

Page 18: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

DISTRIBUTED FEATURE GENERATION

Ensure each distributed partition contains all the forward and backward related rows of the primary entity in the same partition

Worker 1 Worker 2

Page 19: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

FEATURE SELECTION – A COMPUTE INTENSIVE CHALLENGE

Auto Feature Engineering

Feature Selection

Features of interest

• Cost model shifts from human cost to compute cost

• Compute costs much cheaper than human cost and hence the overall reduced cost Compute is

spent in materializing large number of feature columns so that feature selection can select the best fit features

• Filter• Wrapper• Embedded

AutoML could be a logical extension for an end to end automation approach. AutoML is itself a very compute intensive approach

Ready for the build pipelines

Page 20: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

SEMANTIC INTERPRETATION

Page 21: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

SEMANTIC TREE STRUCTURES

Customer Entity –Aggregate features only

Customer Entity - Mix of aggregate and transform features

Transactions Entity - Direct feature

Page 22: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

BUILD PHASE CONSTRUCTS

Page 23: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

RISK & SECURITY - GOVERNANCE

Path suppressions

Disallow paths to prevent features using certain stacking paths.Can be used to align with data office regulations of the enterprise.

Provenance and controlsCodified feature engineering paths are a better model for provenance.Simplified data access security is a by-product of this approach.Better CI/CD controls

Break a build if a GDPR violation is part of a feature definition

Page 24: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

LINEAGE ANALYSIS AND METADATA SYSTEMS INTEGRATION

MetadataData typesRelational model

LineageFeature types (How was a column derived)SQL Operations

Page 25: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

FEATURE ENGINEERING – TWO PATHS TO PRODUCTION

SQL driven pipelines

SQL is the major pattern for data processing and transformation logicSupported by both streaming as well as batch processing engines.

Low latency pipelines

Absolute low latency engines might opt for NO-SQL styled driven pipelines Feature tools is easier to port into this world as it supports Ser-De for feature definitions.

Page 26: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

BATCH USE CASES - REPRESENTATIVE SQL COMPLEXITY

Depth 1 – COUNT(transactions)

Page 27: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

BATCH USE CASES - REPRESENTATIVE SQL COMPLEXITY

Depth 2 – MAX(sessions.MEAN(transactions.amount))

Page 28: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

BATCH USE CASES - REPRESENTATIVE SQL COMPLEXITY

Depth 3 -   STD(transactions.products.STD(transactions.amount))

Page 29: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

SQL GENERATION

Customer feature - SKEW(transactions.products.SKEW(transactions.amount))

Page 30: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

SEMANTIC FOREST STRUCTURES – PARTIAL/SUBSET VIEW

Page 31: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

SQL OPTIMIZATION

Apache CalciteOptimizer rule

Cost model

Graph walk

SQL for build pipelines

Page 32: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

PIPELINE DEPLOYMENT PATTERNS

Featuretools

Flink/Spark/Dask runtimes

Distributed store

Save feature definitions

Boot feature definitions

SQL (Flink/Spark)) runtimesDSL Tools (Ex:SQL)

Page 33: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

SEMANTIC STRUCTURES TO VARIED DATA SINKS

Semantic structures

NO SQL

GRAPH

SQL

Page 34: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

REVISIT LIFECYCLE – AUTOMATION CAN HELPInitiate•

Understand the requirement

• Collect raw data

• CleanseDesign• Explore

hypothesis•

Preprocessing (ex: impute )

• Domain driven

• Model specific (Ex: One hot

Build• Real time/

Batch • Persisted /

Run time• Exactly

Govern• Lineage• Impact

analysis• Monitor

decay• Retire

Page 35: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

Q&ATHANK YOU

Page 36: YOW DATA 2019...AUTO FEATURE ENGINEERING RAPID FEATURE HARVESTING USING DFS AND DATA ENGINEERING TECHNIQUES Ananth Gundabattula Senior Architect @CBA …

AUTOML

Optimizer

Worker 1

Worker 2

Worker n

Worker A-1 Worker A-2 Worker A-3

Worker B-1 Worker B-2 Worker B-3

Worker C-1 Worker C-2 Worker C-3


Recommended