H2O World - Munging, modeling, and pipelines using Python - Hank Roark

transcript

MUNGING, MODELING,AND PIPEL INES USING PYTHON

Hank Roark

COMMUNITY FEEDBACK

Pythonic Interface to H2O, R interface parity

Rapid learning and iteration

Leverage existing knowledge and skills

Interface cleanly with PyData ecosystem

More Environments, esp. PySpark

Python Pipelines to Production

EXAMPLE FROM THE IOTDomain: Prognostics and Health ManagementMachine: Turbofan Jet EnginesData Set: A. Saxena and K. Goebel (2008). "Turbofan Engine Degradation Simulation Data Set", NASA Ames Prognostics Data Repository

Predict Remaining Useful Life from Partial Life Runs

Six operating modes, two failure modes, manufacturing variability

Training: 249 jet engines run to failureTest: 248 jet engines

WHY THIS EXAMPLE?

GETTING READY FOR BRONTOBYTES

LOADING DATA

SUMMARY STATISTICS

FEATURE ENGINEERING

Calculate Total CyclesFor Each Unit

FEATURE ENGINEERING

Append To OriginalFrame

FEATURE ENGINEERING

Create New Feature of Cycles

Remaining

EXPLORATORY DATA ANALYSISBoolean Indexing

EXPLORATORY DATA ANALYSISSample thedata to local

memory

EXPLORATORY DATA ANALYSIS

Use yourfavorite

visualizationtools

(Seaborn!)

Ugh, where are

trendsover time

ZeroRemainingUsefulLife

MODEL BASED DATA ENRICHMENTSensor

measurementsappear inclusters

Correspondingto operating

MODEL BASED DATA ENRICHMENT

Use H2O k-means to find cluster

centers

MODEL BASED DATA ENRICHMENT

Enrich existing datawith operating mode

membership

MORE FEATURE ENGINEERINGFor non-constant

sensor measurements

within an operating mode,

Standardize each sensor measurement

by operating mode

Based on thetraining data

TRENDS OVER TIME!

Before H2O Munging

Ready for H2O Learning

Time Time

MODELING

Configure anEstimator

MODELING

Train an Estimator

MODEL EVALUATIONEvaluate Performance

at a glancein Python

at a glancein H2O Flow

at a glancegraphically in Python

CROSS VALIDATION

SetupHyperparameterSearch Options

CROSS VALIDATION

Configurefull full

grid search

CROSS VALIDATION

Executegrid search

CROSS VALIDATION

Evaluate results &model selection

MORE CONTROL – SCIKIT PIPELINES

Create Pipelines

Hyperparameter Options

Cross validation strategy

HyperparameterSearch Strategy

DATA PIPELINES USING H2OASSEMBLY

TypicalData Preparation

Add some structure

H2OASSEMBLY TO PRODUCTION

Javafor

ProductionScoring

Python

MORE ENVIRONMENTS

PySparkling Water = Python + Spark + H2O

Python + Sparkling Water

COMMUNITY FEEDBACK

Pythonic Interface to H2O, R interface parity

Rapid learning and iteration

Leverage existing knowledge and skills

Interface cleanly with PyData ecosystem

More Environments, esp. PySpark

Python Pipelines to Production

RESULTSH2O Python Framework:

H2OFrame & H2OEstimators

H2OAssembly for Data Prep Pipelines

Python, Jupyter Notebooks,Pandas, Scikit-Learn Integration

PySparkling Water

RESOURCES

• Python booklet• Tibshirani release• Python documentation• Github examples• Jupyter Notebook of Example

THANK YOU

H2O World - Munging, modeling, and pipelines using Python - Hank Roark

Software