+ All Categories
Home > Documents > Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models:...

Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models:...

Date post: 08-Oct-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
19
© 2017 KNIME AG. All Rights Reserved. Integrating high-performance machine learning: H2O and KNIME Mark Landry (H2O), Christian Dietz (KNIME)
Transcript
Page 1: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved.

Integrating high-performance machine learning: H2O and KNIME

Mark Landry (H2O), Christian Dietz (KNIME)

Page 2: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

Speed

H2O: in-memory machine learning platform designed for speed on distributed systems

2

Accuracy

Page 3: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java

Object

YourImagination

Data Prep Export:Plain Old Java

Object

Local

SQL

High Level Architecture

3

Page 4: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

Distributed Algorithms

• Foundation for In-Memory Distributed Algorithm Calculation - Distributed Data Frames and columnar compression

• All algorithms are distributed in H2O: GBM, GLM, DRF, Deep Learning and more. Fine-grained map-reduce iterations.

• Only enterprise-grade, open-source distributed algorithms in the market

User Benefits

Advantageous Foundation

• “Out-of-box” functionalities for all algorithms (NO MORE SCRIPTING) and uniform interface across all languages: R, Python, Java

• Designed for all sizes of data sets, especially large data• Highly optimized Java code for model exports• In-house expertise for all algorithms

Parallel Parse into Distributed Rows

Fine Grain Map Reduce Illustration: Scalable Distributed Histogram Calculation for GBM

Fou

nd

atio

n fo

r D

istr

ibu

ted

Alg

ori

thm

s

4

Page 5: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

5

Scientific Advisory Council

Page 6: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

6

X = million of samples

Gradient Boosting Machine Benchmark(also available for GLM and Random Forest)

Time (s) AUC

X = million of samples

Machine Learning Benchmarks(https://github.com/szilard/benchm-ml)

Page 7: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

Supervised Learning

• Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie

• Naïve Bayes

Statistical Analysis

Ensembles

• Distributed Random Forest: Classification or regression models

• Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations

Deep Neural Networks

• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Unsupervised Learning

• K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k

Clustering

Dimensionality Reduction

• Principal Component Analysis: Linearly transforms correlated variables to independent components

• Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

Anomaly Detection

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

7

H2O Algorithms

Page 8: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 8

H2O in KNIME

Live Demo

Page 9: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 9

H2O in KNIME

• Offer our users high-performance machine learning algorithms from H2O in KNIME

• Allow to mix & match with other KNIME functionality

– Data wrangling KNIME Analytics Platform functionality

– KNIME Big-Data Connectors

– Text Mining, Image Processing, Cheminformatics, …

– and more!

Page 10: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 10

H2O in KNIME

Live Demo

Page 11: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 11

H2O in KNIME – Cross Validation

Page 12: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 12

H2O in KNIME – Cross Validation

Page 13: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 13

H2O in KNIME – Cross Validation

Page 14: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 14

H2O in KNIME – Parameter Optimization

Page 15: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 15

H2O in KNIME – Parameter Optimization

Page 16: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 16

H2O in KNIME – Nodes in KNIME 3.4

Page 17: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 17

H2O in KNIME – What’s cooking?

Page 18: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 18

H2O in KNIME – What’s cooking?

Page 19: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 19

Thank you!


Recommended