+ All Categories
Home > Technology > ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Date post: 20-Aug-2015
Category:
Upload: paco-nathan
View: 11,669 times
Download: 1 times
Share this document with a friend
Popular Tags:
90
ACM Big Data Mining Camp, 2013-10-12: Cascading, Pattern, and PMML Paco Nathan @pacoid Chief Scientist, Mesosphere
Transcript
Page 3: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

• established XML standard for predictive model markup

• organized by Data Mining Group (DMG), since 1997 http://dmg.org/

• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc.

• PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows

“PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations. With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.”

PMML – an industry standard

wikipedia.org/wiki/Predictive_Model_Markup_Language

Page 4: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

PMML – vendor coverage

Page 5: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

• Association Rules: AssociationModel element

• Cluster Models: ClusteringModel element

• Decision Trees: TreeModel element

• Naïve Bayes Classifiers: NaiveBayesModel element

• Neural Networks: NeuralNetwork element

• Regression: RegressionModel and GeneralRegressionModel elements

• Rulesets: RuleSetModel element

• Sequences: SequenceModel element

• Support Vector Machines: SupportVectorMachineModel element

• Text Models: TextModel element

• Time Series: TimeSeriesModel element

PMML – model coverage

ibm.com/developerworks/industry/library/ind-PMML2/

Page 6: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

## train a RandomForest model f <- as.formula("as.factor(label) ~ .")fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance)print(fit) predicted <- predict(fit, data)data$predicted <- predictedconfuse <- table(pred = predicted, true = data[,1])print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))

PMML – create a model in R

Page 7: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

<?xml version="1.0"?><PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd"> <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>  <Application name="Rattle/PMML" version="1.2.30"/>  <Timestamp>2012-10-22 19:39:28</Timestamp> </Header> <DataDictionary numberOfFields="4">  <DataField name="label" optype="categorical" dataType="string">   <Value value="0"/>   <Value value="1"/>  </DataField>  <DataField name="var0" optype="continuous" dataType="double"/>  <DataField name="var1" optype="continuous" dataType="double"/>  <DataField name="var2" optype="continuous" dataType="double"/> </DataDictionary> <MiningModel modelName="randomForest_Model" functionName="classification">  <MiningSchema>   <MiningField name="label" usageType="predicted"/>   <MiningField name="var0" usageType="active"/>   <MiningField name="var1" usageType="active"/>   <MiningField name="var2" usageType="active"/>  </MiningSchema>  <Segmentation multipleModelMethod="majorityVote">   <Segment id="1">    <True/>    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">     <MiningSchema>      <MiningField name="label" usageType="predicted"/>      <MiningField name="var0" usageType="active"/>      <MiningField name="var1" usageType="active"/>      <MiningField name="var2" usageType="active"/>     </MiningSchema>...

PMML – capture business logic of analytics workflows

Page 8: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

PMML in ActionAlex Guazzelli, Wen-Ching Lin, Tridivesh Jenaamazon.com/dp/1470003244

See also excellent resources at:

zementis.com/pmml.htm

PMML – further study

Page 9: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Lab: RStudio and PMML in R

set up RStudio…

rstudio.com/ide/

use the Iris data to build predictive models…

• github.com/Cascading/pattern pattern-examples/examples/r/rattle_pmml.R

• test/train hold-outs

• evaluating predictive power

• export as PMML

Page 10: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

library(pmml)library(randomForest)library(nnet)library(XML)library(kernlab) ## split data into test and train sets data(iris)iris_full <- iris

colnames(iris_full) <- c("sepal_length", "sepal_width", "petal_length", "petal_width", "species") idx <- sample(150, 100)iris_train <- iris_full[idx,]iris_test <- iris_full[-idx,]

Model: data prep based on “Iris”

Page 11: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

## http://mkseo.pe.kr/stats/?p=220 f <- as.formula("as.factor(species) ~ .")fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=50) print(fit$importance)print(fit)print(table(iris_test$species, predict(fit, iris_test, type="class"))) plot(fit, log="y", main="Random Forest")varImpPlot(fit)MDSplot(fit, iris_full$species) out <- iris_fullout$predict <- predict(fit, out, type="class") write.table(out, file=paste(dat_folder, "iris.rf.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE)saveXML(pmml(fit), file=paste(dat_folder, "iris.rf.xml", sep="/"))

Model: Random Forest

Page 12: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

## http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/iris_lm/ f <- as.formula("sepal_length ~ .")fit <- lm(f, data=iris_train) print(summary(fit))print(table(round(iris_test$sepal_length), round(predict(fit, iris_test)))) op <- par(mfrow = c(3, 2))plot(predict(fit), main="Linear Regression")plot(iris_full$petal_length, iris_full$petal_width, pch=21, bg=c("red", "green3", "blue")[unclass(iris_full$species)], main="Edgar Anderson's Iris Data", xlab="petal length", ylab="petal width")plot(fit)par(op) out <- iris_fullout$predict <- predict(fit, out) write.table(out, file=paste(dat_folder, "iris.lm_p.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE)saveXML(pmml(fit), file=paste(dat_folder, "iris.lm_p.xml", sep="/"))

Model: Linear Regression

Page 13: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

## http://statisticsr.blogspot.com/2008/10/notes-for-nnet.html samp <- c(sample(1:50,25), sample(51:100,25), sample(101:150,25)) ird <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]), species=factor(c(rep("setosa",50), rep("versicolor", 50), rep("virginica", 50)))) f <- as.formula("species ~ .")fit <- nnet(f, data=ird, subset=samp, size=2, rang=0.1, decay=5e-4, maxit=200) print(fit)print(summary(fit))print(table(ird$species[-samp], predict(fit, ird[-samp,], type = "class"))) out <- irdout$predict <- predict(fit, ird, type="class") write.table(out, file=paste(dat_folder, "iris.nn.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE)saveXML(pmml(fit), file=paste(dat_folder, "iris.nn.xml", sep="/"))

Model: Neural Network

Page 14: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

## http://mkseo.pe.kr/stats/?p=15 ds <- iris_full[,-5]fit <- kmeans(ds, 3) print(fit)print(summary(fit))print(table(fit$cluster, iris_full$species)) op <- par(mfrow = c(1, 1))plot(iris_full$sepal_length, iris_full$sepal_width, pch = 23, bg = c("blue", "red", "green")[fit$cluster], main="K-Means Clustering")points(fit$centers[,c(1, 2)], col=1:3, pch=8, cex=2)par(op) out <- iris_fullout$predict <- fit$cluster write.table(out, file=paste(dat_folder, "iris.kmeans.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE)saveXML(pmml(fit), file=paste(dat_folder, "iris.kmeans.xml", sep="/"))

Model: K-Means Clustering

Page 15: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

## http://mkseo.pe.kr/stats/?p=15 i = as.matrix(iris_full[,-5])fit <- hclust(dist(i), method = "average") initial <- tapply(i, list(rep(cutree(fit, 3), ncol(i)), col(i)), mean)dimnames(initial) <- list(NULL, dimnames(i)[[2]])kls = cutree(fit, 3) print(fit)print(table(iris_full$species, kls)) op <- par(mfrow = c(1, 1))plclust(fit, main="Hierarchical Clustering")par(op) out <- iris_fullout$predict <- kls write.table(out, file=paste(dat_folder, "iris.hc.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE)saveXML(pmml(fit, data=iris, centers=initial), file=paste(dat_folder, "iris.hc.xml", sep="/"))

Model: Hierarchical Clustering

Page 16: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

## https://support.zementis.com/entries/21176632-what-types-of-svm-models-built-in-r-can-i-export-to-pmml f <- as.formula("species ~ .")fit <- ksvm(f, data=iris_train, kernel="rbfdot", prob.model=TRUE) print(fit)print(table(iris_test$species, predict(fit, iris_test))) out <- iris_fullout$predict <- predict(fit, out) write.table(out, file=paste(dat_folder, "iris.svm.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE)saveXML(pmml(fit, dataset=iris_train), file=paste(dat_folder, "iris.svm.xml", sep="/"))

Model: Support Vector Machine

Page 18: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

Page 19: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

ANSI SQL for ETL

Page 20: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

endusesJava, Pig for business logic

Page 21: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive models

Page 22: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

Page 23: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

endusesJava, Pig for business logic

most of the project costs…

Page 24: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

a compiler sees it all…one connected DAG:

• optimization

• troubleshooting

• exception handling

• notifications

cascading.org

Page 25: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

a compiler sees it all…

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );

cascading.org

Page 26: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

a compiler sees it all…

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );

Page 27: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Cascading – functional programming

Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature.

to ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows:

• leverages JVM and Java-based tools without anyneed to create new languages

• allows programmers who have Java expertise to leverage the economics of Hadoop clusters

Edgar Codd alluded to this (DSLs for structuring data) in his original paper about relational model

Page 28: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Cascading – functional programming

• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments

• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:

Cascalog in Clojure (2010)Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wiki

Why Adopting the Declarative Programming Practices Will Improve Your Return from TechnologyDan Woods, 2013-04-17 Forbes

forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/

Page 29: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Functional Programming for Big Data

WordCount with token scrubbing…

Apache Hive: 52 lines HQL + 8 lines Python (UDF)

compared to

Scalding: 18 lines Scala/Cascading

functional programming languages help reduce software engineering costs at scale, over time

Page 30: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.

• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.

Page 31: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Workflow Abstraction – pattern language

Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

data is represented as flows of tuples

operations in the flows bring functional programming aspects into Java

A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199

Page 32: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Workflow Abstraction – literate programming

Cascading workflows generate their own visual documentation: flow diagrams

in formal terms, flow diagrams leverage a methodology called literate programming

provides intuitive, visual representations for apps –great for cross-team collaboration

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Literate ProgrammingDon Knuthliterateprogramming.com

Page 33: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Workflow Abstraction – business process

following the essence of literate programming, Cascading workflows provide statements of business process

this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)

Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.)

this is especially apparent in large-scale Cascalog apps:

“Specify what you require, not how to achieve it.”

by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale

Page 34: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

void map (String doc_id, String text):

for each word w in segment(text):

emit(w, "1");

void reduce (String word, Iterator group):

int count = 0;

for each pc in group:

count += Int(pc);

emit(word, String(count));

The Ubiquitous Word Count

Definition:

this simple program provides an excellent test case for parallel processing:

• requires a minimal amount of code

• demonstrates use of both symbolic and numeric values

• shows a dependency graph of tuples as an abstraction

• is not many steps away from useful search indexing

• serves as a “Hello World” for Hadoop apps

a distributed computing framework that runs Word Count efficiently in parallel at scale can handle much larger and more interesting compute problems

count how often each word appears in a collection of text documents

Page 35: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

1 map 1 reduce18 lines code gist.github.com/3900702

WordCount – conceptual flow diagram

cascading.org/category/impatient

Page 36: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

WordCount – Cascading app in Java

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

Page 37: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

map

reduceEvery('wc')[Count[decl:'count']]

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

GroupBy('wc')[by:['token']]

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[head]

[tail]

[{2}:'token', 'count'][{1}:'token']

[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']

wc[{1}:'token'][{1}:'token']

[{2}:'token', 'count'][{2}:'token', 'count']

[{1}:'token'][{1}:'token']

WordCount – generated flow diagramDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

Page 38: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

(ns impatient.core  (:use [cascalog.api]        [cascalog.more-taps :only (hfs-delimited)])  (:require [clojure.string :as s]            [cascalog.ops :as c])  (:gen-class))

(defmapcatop split [line]  "reads in a line of string and splits it by regex"  (s/split line #"[\[\]\\\(\),.)\s]+"))

(defn -main [in out & args]  (?<- (hfs-delimited out)       [?word ?count]       ((hfs-delimited in :skip-header? true) _ ?line)       (split ?line :> ?word)       (c/count ?count)))

; Paul Lam; github.com/Quantisan/Impatient

WordCount – Cascalog / ClojureDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

Page 39: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

github.com/nathanmarz/cascalog/wiki

• implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language

• run ad-hoc queries from the Clojure REPL –approx. 10:1 code reduction compared with SQL

• composable subqueries, used for test-driven development (TDD) practices at scale

• Leiningen build: simple, no surprises, in Clojure itself

• more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog

• has a learning curve, limited number of Clojure developers

• aggregators are the magic, and those take effort to learn

WordCount – Cascalog / ClojureDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

Page 40: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

import com.twitter.scalding._ class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ \\[\\]\\(\\),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true))}

WordCount – Scalding / ScalaDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

Page 41: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

github.com/twitter/scalding/wiki

• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading

• code is compact, easy to understand

• nearly 1:1 between elements of conceptual flow diagram and function calls

• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.

• significant investments by Twitter, Etsy, eBay, etc.

• great for data services at scale

• less learning curve than Cascalog

WordCount – Scalding / ScalaDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

Page 42: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

CREATE TABLE text_docs (line STRING); LOAD DATA LOCAL INPATH 'data/rain.txt'OVERWRITE INTO TABLE text_docs; SELECT word, COUNT(*)FROM (SELECT split(line, '\t')[1] AS text FROM text_docs) tLATERAL VIEW explode(split(text, '[ ,\.\(\)]')) lTable AS wordGROUP BY word;

WordCount – Apache HiveDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

Page 43: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

WordCount – Apache HiveDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

Mhive.apache.org

pro:‣ most popular abstraction atop Apache Hadoop

‣ SQL-like language is syntactically familiar to most analysts

‣ simple to load large-scale unstructured data and run ad-hoc queries

con:‣ not a relational engine, many surprises at scale

‣ difficult to represent complex workflows, ML algorithms, etc.

‣ one poorly-trained analyst can bottleneck an entire cluster

‣ app-level integration requires other coding, outside of script language

‣ logical planner mixed with physical planner; cannot collect app stats

‣ non-deterministic exec: number of maps+reduces may change unexpectedly

‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc.

Page 44: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

docPipe = LOAD '$docPath' USING PigStorage('\t', 'tagsource') AS (doc_id, text);docPipe = FILTER docPipe BY doc_id != 'doc_id';

-- specify regex to split "document" text lines into token streamtokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;tokenPipe = FILTER tokenPipe BY token MATCHES '\\w.*';

-- determine the word countstokenGroups = GROUP tokenPipe BY token;wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count;

-- outputSTORE wcPipe INTO '$wcPath' USING PigStorage('\t', 'tagsource');EXPLAIN -out dot/wc_pig.dot -dot wcPipe;

WordCount – Apache PigDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

Page 45: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

WordCount – Apache PigDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

Mpig.apache.org

pro:‣ easy to learn data manipulation language (DML)

‣ interactive prompt (Grunt) makes it simple to prototype apps

‣ extensibility through UDFs

con:‣ not a full programming language; must extend via UDFs outside of language

‣ app-level integration requires other coding, outside of script language

‣ simple problems are simple to do; hard problems become quite complex

‣ difficult to parameterize scripts externally; must rewrite to change taps!

‣ logical planner mixed with physical planner; cannot collect app stats

‣ non-deterministic exec: number of maps+reduces may changes unexpectedly

‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc.

Page 46: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Two Avenues to the App Layer…

scale ➞co

mpl

exity

Enterprise: must contend with complexity at scale everyday…

incumbents extend current practices and infrastructure investments – using JVM, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff

Start-ups: crave complexity and scale to become viable…

new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding

Page 48: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Pattern – model scoring

• migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML

• great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc.

• integrate with other libraries –Matrix API, etc.

• leverage PMML as another kind of DSL

cascading.org/pattern

Page 49: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

CustomerOrders

Classify ScoredOrders

GroupBytoken

Count

PMMLModel

M R

FailureTraps

Assert

ConfusionMatrix

Pattern – score a model, using pre-defined Cascading app

cascading.org/pattern

Page 50: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

public static void main( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );  // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "\t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "\t" ), classifyPath );  // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg();  OptionSet options = optParser.parse( args );  // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );  if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); }  // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); }

Pattern – score a model, within an app

Page 51: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Approach 1: Vagrant Cluster for Cascading and Hadoop

set up Vagrant (use v1.3.3 only!) and VirtualBox to run Cascading…PS: we can share USB thumb drives to speed up box downloads!

github.com/Cascading/vagrant-cascading-hadoop-cluster

NB: when running Gradle builds, you must run as “root”…then when running Hadoop, you must run as “mapred” and use HDFS commands.

Page 52: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Approach 2: Laptop Setup for Java, Hadoop, Gradle, Cascading

set up a build environment locally and run Apache Hadoop in “standalone” mode… works fine for Linux or MacOSX;however, please no “cdh”, “hdp”, “homebrew”, or “cygwin”

liber118.com/pxn/course/itds/install.html

download as a ZIP file, or use Git to clone the repo…

github.com/Cascading/Impatient

NB: when running Hadoop, you will run in local mode –no HDFS

Page 53: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Approach 3: Login to a pre-configured EC2 Node

assuming you are familiar with using SSH on Linux or MacOSX,or using Putty on Windows…

we will give instructions during the workshop

NB: when running Hadoop, you will run in local mode –no HDFS

Page 56: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Experiments – comparing models

• much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale

• run multiple variants, then measure relative “lift”

• Concurrent runtime – tag and track models

the following example compares two models trained with different machine learning algorithms

this is exaggerated, one has an important variable intentionally omitted to help illustrate the experiment

Page 57: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

## train a Random Forest model## example: http://mkseo.pe.kr/stats/?p=220 f <- as.formula("as.factor(label) ~ var0 + var1 + var2")fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)print(fit)saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))

Experiments – Random Forest model

OOB estimate of error rate: 14%Confusion matrix: 0 1 class.error0 69 16 0.18823531 12 103 0.1043478

Page 58: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

## train a Logistic Regression model (special case of GLM)## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r f <- as.formula("as.factor(label) ~ var0 + var2")fit <- glm(f, family=binomial, data=data)print(summary(fit))saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))

Experiments – Logistic Regression model

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 ***var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

NB: this model has “var1” intentionally omitted

Page 59: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Experiments – comparing results

• use a confusion matrix to compare results for the classifiers

• Logistic Regression has a lower “false negative” rate (5% vs. 11%)however it has a much higher “false positive” rate (52% vs. 14%)

• assign a cost model to select a winner –for example, in an ecommerce anti-fraud classifier:

FN ∼ chargeback risk FP ∼ customer support costs

Page 60: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Why Do Ensembles Matter?

The World…per Data Modeling

The World…

Page 61: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Two Cultures

“A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.”

Statistical Modeling: The Two Cultures Leo Breiman, 2001bit.ly/eUTh9L

chronicled a sea change from data modeling (silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization) which led in turn to the practice of leveraging inter-disciplinary teams

Page 62: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Ensemble Models

Breiman: “a multiplicity of data models”

BellKor team: 100+ individual models in 2007 Progress Prize

while the process of combining models adds complexity (making it more difficult to anticipate or explain predictions) accuracy may increase substantially

Ensemble Learning: Better Predictions Through DiversityTodd HollowayETech (2008)abeautifulwww.com/EnsembleLearningETech.pdf

The Story of the Netflix Prize: An Ensemblers TaleLester MackeyNational Academies Seminar, Washington, DC (2011)stanford.edu/~lmackey/papers/

Page 63: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

KDD 2013 PMML Workshop

Pattern: PMML for Cascading and HadoopPaco Nathan, Girish KathalagiriChicago (2013-08-11)

19th ACM SIGKDD Conference on Knowledge Discovery and Data Miningkdd13pmml.wordpress.com

Page 64: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Pattern: Example App

• example integration of PMML and Cascading, using a sample app based on the crime dataset from the City of Chicago Open Data

• sample app implements a predictive model for expected crime rates based on location, hour of day, and month

• modeling performed in R, using the pmml package

• multiple models are captured as PMML, then integrated via Pattern to implement the entire workflow as a single app

• PMML provides a vector for migrating workloads off of SAS, SPSS, etc., onto Hadoop clusters for more cost-effective scaling

Page 65: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Pattern: Example App

City of Chicago Open Data portalcityofchicago.org/city/en/narr/foia/CityData.html

Pattern open source projectgithub.com/Cascading/pattern

Observed benefits include greatly reduced development costs and less licensing issues at scale, while leveraging the scalability of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff.

Analysts can train predictive models in popular analytics frameworks, such as SAS, Microstrategy, R, Weka, SQL Server, etc., then run those models at scale on Apache Hadoop with little or no coding required.

Page 66: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Pattern API: Support for Model Chaining, Transforms, etc.

workflow used for data preparation:

Page 67: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Pattern API: Support for Model Chaining, Transforms, etc.

workflow used for model scoring:

Page 69: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables

this approach attempts to understand not just problems and solutions, but also the processes involved and their variances

particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering…

programmers typically don’t think this way… however, both systems engineers and data scientists must

Process Variation Data Tools

Statistical Thinking

Page 70: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

approximately 80% of the costs for data-related projects gets spent on data preparation – mostly on cleaning up data quality issues: ETL, log files, etc., generally by socializing the problem

unfortunately, data-related budgets tend to go into frameworks which can only be used after clean up

most valuable skills:

‣ learn to use programmable tools that prepare data

‣ learn to understand the audience and their priorities

‣ learn to generate compelling data visualizations

‣ learn to estimate the confidence for reported results

‣ learn to automate work, making analysis repeatable

d3js.org

What is needed most?

Page 71: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

apps

discovery

modeling

integration

systems

help people ask the right questions

allow automation to place informed bets

deliver data products at scale to LOB end uses

build smarts into product features

keep infrastructure running, cost-effective

Team Process = Needs

analysts

engineers

inter-disciplinary leadership

Page 72: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

business process,stakeholder

data prep, discovery, modeling, etc.

software engineering, automation

systems engineering, availability

datascience

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

Team Composition = Roles

leverage non-traditional pairing among roles, to complement skills and tear down silos

Page 73: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

discovery

discovery

modeling

modeling

integration

integration

appsapps systems

systems

business process,stakeholder

data prep, discovery, modeling, etc.

software engineering, automation

systems engineering, availability

datascience

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

Team Composition = Needs × Roles

Page 75: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Cluster Computing’s Dirty Little Secret

many of us make a good living by leveraging high ROI apps based on clusters, and so execs agree to build out more data centers…

clusters for Hadoop/HBase, for Storm, for MySQL, for Memcached, for Cassandra, for Nginx, etc.

this becomes expensive!

a single class of workloads on a given cluster is simpler to manage, but terrible for utilization… various notions of “cloud” help…

Cloudera, Hortonworks, probably EMC soon: sell a notion of “Hadoop as OS” ⇒ All your workloads are belong to us

Google Data Center, Fox News

~2002

Page 76: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Beyond Hadoop

Hadoop – an open source solution for fault-tolerant parallel processing of batch jobs at scale, based on commodity hardware… however, other priorities have emerged for the analytics lifecycle:

• apps require integration beyond Hadoop

• multiple topologies, mixed workloads, multi-tenancy

• higher utilization

• lower latency

• highly-available, long running services

• more than “Just JVM” – e.g., Python growth

keep in mind the priority for multi-disciplinary efforts, to break down even more silos – well beyond the de facto “priesthood” of data engineering

Page 77: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Beyond Hadoop

Google has been doing data center computing for years, to address the complexities of large-scale data workflows:

• leveraging the modern kernel: isolation in lieu of VMs

• “most (>80%) jobs are batch jobs, but the majority of resources (55–80%) are allocated to service jobs”

• mixed workloads, multi-tenancy

• relatively high utilization rates

• JVM? not so much…

• reality: scheduling batch is simple; scheduling services is hard/expensive

Page 79: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

“Return of the Borg”

Omega: flexible, scalable schedulers for large compute clustersMalte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkeseurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf

Page 80: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Mesos – definitions

a common substrate for cluster computing

http://mesos.apache.org/

heterogenous assets in your data center or cloud made available as a homogenous set of resources

• top-level Apache project

• scalability to 10,000s of nodes

• obviates the need for virtual machines

• isolation (pluggable) for CPU, RAM, I/O, FS, etc.

• fault-tolerant replicated master using ZooKeeper

• multi-resource scheduling (memory and CPU aware)

• APIs in C++, Java, Python

• web UI for inspecting cluster state

• available for Linux, OpenSolaris, Mac OSX

Page 81: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Mesos – architecture

given the use of Mesos as a Data Center OS kernel…

• Chronos provides complex scheduling capabilities,much like a distributed Unix “cron”

• Marathon provides highly-available long-running services, much like a distributed Unix “init.d”

• next time you need to build a distributed app, consider using these as building blocks

a major lesson learned from Spark:

• leveraging these kinds of building blocks, one can rebuild Hadoop 100x faster, in much less code

Page 82: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Mesos – architecture

Ruby

Kernel

Apps

servicesbatch

Frameworks

Python

JVM

C++

Workloads

distributed file system

Chronos

DFS

distributed resources: CPU, RAM, I/O, FS, rack locality, etc. Cluster

Storm

Kafka JBoss Django RailsSharkImpalaScalding

Marathon

SparkHadoopMPI

MySQL

Page 83: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Deployments

Page 84: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Case Study: Twitter (bare metal / on premise)

“Mesos is the cornerstone of our elastic compute infrastructure – it’s how we build all our new services and is critical for Twitter’s continued success at scale. It's one of the primary keys to our data center efficiency."

Chris Fry, SVP Engineeringblog.twitter.com/2013/mesos-graduates-from-apache-incubation

• key services run in production: analytics, typeahead, ads

• Twitter engineers rely on Mesos to build all new services

• instead of thinking about static machines, engineers think about resources like CPU, memory and disk

• allows services to scale and leverage a shared pool of servers across data centers efficiently

• reduces the time between prototyping and launching

Page 85: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Case Study: Airbnb (fungible cloud infrastructure)

“We think we might be pushing data science in the field of travel more so than anyone has ever done before… a smaller number of engineers can have higher impact through automation on Mesos."

Mike Curtis, VP Engineeringgigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven...

• improves resource management and efficiency

• helps advance engineering strategy of building small teams that can move fast

• key to letting engineers make the most of AWS-based infrastructure beyond just Hadoop

• allowed company to migrate off Elastic MapReduce

• enables use of Hadoop along with Chronos, Spark, Storm, etc.

Page 86: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Arguments for Data Center Computing

rather than running several specialized clusters, each at relatively low utilization rates, instead run many mixed workloads

obvious benefits are realized in terms of:

• scalability, elasticity, fault tolerance, performance, utilization

• reduced equipment cap­ex, Ops overhead, etc.

• reduced licensing, eliminating need for VMs or potential vendor lock­in

subtle benefits – arguably, more important for Enterprise IT:

• reduced time for engineers to ramp­up new services at scale

• reduced latency between batch and services, enabling new high­ROI use cases

• enables Dev/Test apps to run safely on a Production cluster

Page 87: ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Media Coverage

Mesosphere Adds Docker Support To Its Mesos-Based Operating System For The Data CenterFrederic LardinoisTechCrunch (2013-09-26)techcrunch.com/2013/09/26/mesosphere...

Play Framework Grid Deployment with MesosJames Ward, Flo Leibert, et al.Typesafe blog (2013-09-19)typesafe.com/blog/play-framework-grid...

Mesosphere Launches Marathon FrameworkAdrian BridgwaterDr. Dobbs (2013-09-18)drdobbs.com/open-source/mesosphere...

New open source tech Marathon wants to make your data center run like Google’sDerrick HarrisGigaOM (2013-09-04)gigaom.com/2013/09/04/new-open-source...

Running batch and long-running, highly available service jobs on the same clusterBen LoricaO’Reilly (2013-09-01)strata.oreilly.com/2013/09/running-batch...


Recommended