GBM in H2O with Cliff Click: H2O API

GBM:Distributed Tree Algorithms on H2O

Cliff Click, CTO [email protected]://0xdata.comhttp://cliffc.org/blog

mailto:[email protected]

http://0xdata.com/

http://cliffc.org/blog

0xdata.com 2

H2O is...

● Pure Java, Open Source: 0xdata.com● https://github.com/0xdata/h2o/

● A Platform for doing Math● Parallel Distributed Math● In-memory analytics: GLM, GBM, RF, Logistic Reg

● Accessible via REST & JSON● A K/V Store: ~150ns per get or put● Distributed Fork/Join + Map/Reduce + K/V

https://github.com/0xdata/h2o/

0xdata.com 3

Agenda

● Building Blocks For Big Data:● Vecs & Frames & Chunks

● Distributed Tree Algorithms● Access Patterns & Execution

● GBM on H2O● Performance

0xdata.com 4

A Collection of Distributed Vectors

// A Distributed Vector// much more than 2billion elementsclass Vec { long length(); // more than an int's worth

// fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx);

void set(long idx, double d); // writable void append(double d); // variable sized}

0xdata.com 5

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Frames

A Frame: Vec[]age sex zip ID car

●Vecs aligned in heaps●Optimized for concurrent access●Random access any row, any JVM

●But faster if local... more on that later

0xdata.com 6

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

A Chunk, Unit of Parallel AccessVec Vec Vec Vec Vec

●Typically 1e3 to 1e6 elements●Stored compressed●In byte arrays●Get/put is a few clock cycles including compression

0xdata.com 7

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Parallel Execution

Vec Vec Vec Vec Vec●All CPUs grab Chunks in parallel●F/J load balances

●Code moves to Data●Map/Reduce & F/J handles all sync●H2O handles all comm, data manage

0xdata.com 8

Distributed Data Taxonomy

Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double

Row i – i'th elements of all the Vecs in a Frame

0xdata.com 9

Distributed Coding Taxonomy

● No Distribution Coding:● Whole Algorithms, Whole Vector-Math● REST + JSON: e.g. load data, GLM, get results

● Simple Data-Parallel Coding:● Per-Row (or neighbor row) Math● Map/Reduce-style: e.g. Any dense linear algebra

● Complex Data-Parallel Coding● K/V Store, Graph Algo's, e.g. PageRank

0xdata.com 10

Distributed Coding Taxonomy

● No Distribution Coding:● Whole Algorithms, Whole Vector-Math● REST + JSON: e.g. load data, GLM, get results

● Simple Data-Parallel Coding:● Per-Row (or neighbor row) Math● Map/Reduce-style: e.g. Any dense linear algebra

● Complex Data-Parallel Coding● K/V Store, Graph Algo's, e.g. PageRank

Read the docs!

This talk!

Join our GIT!

0xdata.com 11

Simple Data-Parallel Coding

● Map/Reduce Per-Row: Stateless● Example from Linear Regression, Σ y2

● Auto-parallel, auto-distributed● Near Fortran speed, Java Ease

double sumY2 = new MRTask() { double map( double d ) { return d*d; } double reduce( double d1, double d2 ) { return d1+d2; }}.doAll( vecY );

0xdata.com 12


● Map/Reduce Per-Row: State-full● Linear Regression Pass1: Σ x, Σ y, Σ y2

class LRPass1 extends MRTask { double sumX, sumY, sumY2; // I Can Haz State? void map( double X, double Y ) { sumX += X; sumY += Y; sumY2 += Y*Y; } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; }}

0xdata.com 13


● Map/Reduce Per-Row: Batch State-fullclass LRPass1 extends MRTask { double sumX, sumY, sumY2; void map( Chunk CX, Chunk CY ) {// Whole Chunks for( int i=0; i<CX.len; i++ ){// Batch! double X = CX.at(i), Y = CY.at(i); sumX += X; sumY += Y; sumY2 += Y*Y; } } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; }}

0xdata.com 14

Distributed Trees

● Overlay a Tree over the data● Really: Assign a Tree Node to each Row● Number the Nodes● Store "Node_ID" per row in a temp Vec

● Make a pass over all Rows● Nodes not visited in order...● but all rows, all Nodes efficiently visited

● Do work (e.g. histogram) per Row/Node

Vec nids = v.makeZero();… nids.set(row,nid)...

0xdata.com 15

Distributed Trees

● An initial Tree● All rows at nid==0● MRTask: compute stats

● Use the stats to make a decision...● (varies by algorithm)!

nid=0

X Y nidsA 1.2 0B 3.1 0C -2. 0D 1.1 0

nid=0nid=0

Tree

MRTask.sum=3.4

0xdata.com 16

Distributed Trees

● Next layer in the Tree (and MRTask across rows)

● Each row: decide!– If "1<Y<1.5" go left else right

● Compute stats per new leaf

● Each pass across allrows builds entire layer

nid=0

X Y nidsA 1.2 1B 3.1 2C -2. 2D 1.1 1

nid=01 < Y < 1.5

Tree

sum=1.1

nid=1 nid=2

sum=2.3

0xdata.com 17

Distributed Trees

● Another MRTask, another layer...● i.e., a 5-deep tree

takes 5 passes●

nid=0nid=01 < Y < 1.5

Tree

sum=1.1Y==1.1 leaf

nid=3 nid=4

X Y nidsA 1.2 3B 3.1 2C -2. 2D 1.1 4 sum= -2. sum=3.1

0xdata.com 18

Distributed Trees

● Each pass is over one layer in the tree● Builds per-node histogram in map+reduce callsclass Pass extends MRTask2<Pass> { void map( Chunk chks[] ) { Chunk nids = chks[...]; // Node-IDs per row for( int r=0; r<nids.len; r++ ){// All rows int nid = nids.at80(i); // Node-ID THIS row // Lazy: not all Chunks see all Nodes if( dHisto[nid]==null ) dHisto[nid]=... // Accumulate histogram stats per node dHisto[nid].accum(chks,r); } }}.doAll(myDataFrame,nids);

0xdata.com 19

Distributed Trees

● Each pass analyzes one Tree level● Then decide how to build next level● Reassign Rows to new levels in another pass

– (actually merge the two passes)

● Builds a Histogram-per-Node● Which requires a reduce() call to roll up

● All Histograms for one level done in parallel

0xdata.com 20

Distributed Trees: utilities

● “score+build” in one pass:● Test each row against decision from prior pass● Assign to a new leaf● Build histogram on that leaf

● “score”: just walk the tree, and get results● “compress”: Tree from POJO to byte[]

● Easily 10x smaller, can still walk, score, print

● Plus utilities to walk, print, display

0xdata.com 21

GBM on Distributed Trees

● GBM builds 1 Tree, 1 level at a time, but...● We run the entire level in parallel & distributed

● Built breadth-first because it's "free"● More data offset by more CPUs

● Classic GBM otherwise● Build residuals tree-by-tree● Tuning knobs: trees, depth, shrinkage, min_rows

● Pure Java

0xdata.com 22

GBM on Distributed Trees

● Limiting factor: latency in turning over a level● About 4x faster than R single-node on covtype● Does the per-level compute in parallel● Requires sending histograms over network

– Can get big for very deep trees

●

0xdata.com 23

Summary: Write (parallel) Java

● Most simple Java “just works”● Fast: parallel distributed reads, writes, appends

● Reads same speed as plain Java array loads● Writes, appends: slightly slower (compression)● Typically memory bandwidth limited

– (may be CPU limited in a few cases)

● Slower: conflicting writes (but follows strict JMM)● Also supports transactional updates

0xdata.com 24

Summary: Writing Analytics

● We're writing Big Data Analytics● Generalized Linear Modeling (ADMM, GLMNET)

– Logistic Regression, Poisson, Gamma● Random Forest, GBM, KMeans++, KNN

● State-of-the-art Algorithms, running Distributed● Solidly working on 100G datasets

● Heading for Tera Scale

● Paying customers (in production!)● Come write your own (distributed) algorithm!!!

0xdata.com 25

Cool Systems Stuff...

● … that I ran out of space for● Reliable UDP, integrated w/RPC● TCP is reliably UNReliable

● Already have a reliable UDP framework, so no prob

● Fork/Join Goodies:● Priority Queues● Distributed F/J● Surviving fork bombs & lost threads

● K/V does JMM via hardware-like MESI protocol

0xdata.com 26

H2O is...

● Pure Java, Open Source: 0xdata.com● https://github.com/0xdata/h2o/

● A Platform for doing Math● Parallel Distributed Math● In-memory analytics: GLM, GBM, RF, Logistic Reg

● Accessible via REST & JSON● A K/V Store: ~150ns per get or put● Distributed Fork/Join + Map/Reduce + K/V

https://github.com/0xdata/h2o/

0xdata.com 27

The Platform

NFSHDFS

byte[]

extends Iced

extends DTask

AutoBuffer

RPC

extends DRemoteTask D/F/J

extends MRTask User code?

JVM 1

NFSHDFS

byte[]

extends Iced

extends DTask

AutoBuffer

RPC

extends DRemoteTask D/F/J

extends MRTask User code?

JVM 2

K/V get/put

UDP / TCP

0xdata.com 28

Other Simple Examples

● Filter & Count (underage males):● (can pass in any number of Vecs or a Frame)

long sumY2 = new MRTask() { long map( long age, long sex ) { return (age<=17 && sex==MALE) ? 1 : 0; } long reduce( long d1, long d2 ) { return d1+d2; }}.doAll( vecAge, vecSex );

0xdata.com 29


● Filter into new set (underage males):● Can write or append subset of rows

– (append order is preserved)

class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set }};Vec risk = new AppendableVec();new Filter().doAll( risk, vecAge, vecSex );...risk... // all the underage males

0xdata.com 30


● Filter into new set (underage males):● Can write or append subset of rows

– (append order is preserved)

class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set }};Vec risk = new AppendableVec();new Filter().doAll( risk, vecAge, vecSex );...risk... // all the underage males

0xdata.com 31


● Group-by: count of car-types by ageclass AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; }}

0xdata.com 32

class AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; }}


● Group-by: count of car-types by ageSetting carAges in map() makes it an output field. Private per-map call, single-threaded write access.

Must be rolled-up in the reduce call.

Setting carAges in map makes it an output field. Private per-map call, single-threaded write access.

Must be rolled-up in the reduce call.

0xdata.com 33


● Uniques● Uses distributed hash set

class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); }};long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size();

0xdata.com 34


● Uniques● Uses distributed hash set

class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); }};long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size();

Setting dnbhs in <init> makes it an input field. Shared across all maps(). Often read-only.

This one is written, so needs a reduce.

Date post:	26-Jan-2015
Category:	Technology
Upload:	srisatish-ambati
View:	132 times
Download:	3 times

GBM in H2O with Cliff Click: H2O API

Technology