+ All Categories
Home > Data & Analytics > Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Date post: 16-Apr-2017
Category:
Upload: spark-summit
View: 1,867 times
Download: 1 times
Share this document with a friend
65
Reactive Feature Generation with Spark and MLlib Jeff Smith x.ai
Transcript
Page 1: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Feature Generation with Spark and MLlib

Jeff Smith x.ai

Page 2: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

x.ai is a personal assistant who schedules meetings for you

Page 3: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

M A N N I N G

Jeff Smith

Page 4: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

REACTIVE MACHINE LEARNING

Page 5: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Machine Learning

Page 6: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Systems

Page 7: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Traits of Reactive Systems

Page 8: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Responsive

Page 9: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Resilient

Page 10: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Elastic

Page 11: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Message-Driven

Page 12: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Strategies

Page 13: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Data

Page 14: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Data

Page 15: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Systems

Page 16: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

INTRODUCING FEATURES

Page 17: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Microblogging DataSquawks Squawkers Super

Page 18: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Page 19: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

FEATURE TRANSFORMS

Page 20: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Extract Transform

Page 21: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Features

trait FeatureType[V] { val name: String}

trait Feature[V] extends FeatureType[V] { val value: V}

Page 22: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Features

case class IntFeature(name: String, value: Int) extends Feature[Int]

case class BooleanFeature(name: String, value: Boolean) extends Feature[Boolean]

Page 23: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Named Transforms

def binarize(feature: IntFeature, threshold: Double) = { BooleanFeature("binarized-" + feature.name, feature.value > threshold)}

Page 24: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Non-Trivial Transformsdef categorize(thresholds: List[Double]) = { (rawFeature: DoubleFeature) => { IntFeature("categorized-" + rawFeature.name, thresholds.sorted .zipWithIndex .find { case (threshold, i) => rawFeature.value < threshold }.getOrElse((None, -1)) ._2) }}

Page 25: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Standardizing Namingtrait Named { def name(inputFeature: Feature[Any]) : String = { inputFeature.name + "-" + Thread.currentThread.getStackTrace()(3).getMethodName }}

object Binarizer extends Named { def binarize(feature: IntFeature, threshold: Double) = { BooleanFeature(name(feature), feature.value > threshold) }}

Page 26: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Lineages"cleaned-normalized-categorized-interactions"

interactions categorized normalized cleaned

Page 27: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

PIPELINES

Page 28: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Page 29: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Multi-Stage Generation

Raw Data ExtractedExtract Transform Features

Page 30: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Pipeline Compositiondef extract(rawSquawks: RDD[JsonDocument]): RDD[IntFeature] = { ???} def transform(inputFeatures: RDD[Feature[Int]]): RDD[BooleanFeature] = { ???} val trainableFeatures = transform(extract(rawSquawks))

Page 31: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Page 32: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Pipelines

Don’t orchestrate when you can compose

Page 33: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Pipeline Failure

Raw Data FeaturesFeature Generation Pipeline

Raw Data FeaturesFeature Generation Pipeline

Page 34: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Supervising Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Supervision

Page 35: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Supervising Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Reactive DB Drivers Cluster Managers Feature Validation

Page 36: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Database DriversCouchbase MongoDB Cassandra

Page 37: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Database Drivers

val rawSquawks: RDD[JsonDocument] = sc.couchbaseView( ViewQuery.from("squawks", "by_squawk_id")) .map(_.id) .couchbaseGet[JsonDocument]()

Page 38: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Cluster ManagersSpark Standalone Mesos YARN

Page 39: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Supervising Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Reactive DB Drivers Cluster Managers Feature Validation

Page 40: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

FEATURE COLLECTIONS

Page 41: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Original Features

object SquawkLength extends FeatureType[Int]

object Super extends LabelType[Boolean]

val originalFeatures: Set[FeatureType] = Set(SquawkLength)val label = Super

Page 42: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Basic Features

object PastSquawks extends FeatureType[Int]

val basicFeatures = originalFeatures + PastSquawks

Page 43: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

More Features

object MobileSquawker extends FeatureType[Boolean]

val moreFeatures = basicFeatures + MobileSquawker

Page 44: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Collections

case class FeatureCollection(id: Int, createdAt: DateTime, features: Set[_ <: FeatureType[_]], label: LabelType[_])

Page 45: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Collectionsval earlierCollection = FeatureCollection(101, earlier, basicFeatures, label)

val latestCollection = FeatureCollection(202, now, moreFeatures, label)

val featureCollections = sc.parallelize( Seq(earlierCollection, latestCollection))

Page 46: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Page 47: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Fallback Collections

val FallbackCollection = FeatureCollection(404, beginningOfTime, originalFeatures, label)

Page 48: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Fallback Collectionsdef validCollection(collections: RDD[FeatureCollection], invalidFeatures: Set[FeatureType[_]]) = { val validCollections = collections.filter( fc => !fc.features .exists(invalidFeatures.contains)) .sortBy(collection => collection.id) if (validCollections.count() > 0) { validCollections.first() } else FallbackCollection}

Page 49: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

VALIDATING FEATURES

Page 50: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Supervising Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Reactive DB Drivers Cluster Managers Feature Validation

Page 51: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Predicting Super Squawkers

Page 52: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Training Instances

val instances = Seq( (123, Vectors.dense(0.2, 0.3, 16.2, 1.1), 0.0), (456, Vectors.dense(0.1, 1.3, 11.3, 1.2), 1.0), (789, Vectors.dense(1.2, 0.8, 14.5, 0.5), 0.0))

Page 53: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Selection Setup

val featuresName = "features"val labelName = "isSuper"

val instancesDF = sqlContext.createDataFrame(instances) .toDF("id", featuresName, labelName)

val K = 2

Page 54: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Selection

val selector = new ChiSqSelector() .setNumTopFeatures(K) .setFeaturesCol(featuresName) .setLabelCol(labelName) .setOutputCol("selectedFeatures")

Page 55: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Selection

val selector = new ChiSqSelector() .setNumTopFeatures(K) .setFeaturesCol(featuresName) .setLabelCol(labelName) .setOutputCol("selectedFeatures")

val selectedFeatures = selector.fit(instancesDF) .transform(instancesDF)

Page 56: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Back to RDDs

val labeledPoints = sc.parallelize(instances.map({ case (id, features, label) => LabeledPoint(label = label, features = features)}))

Page 57: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Validating Features

def validateSelection(labeledPoints: RDD[LabeledPoint], topK: Int, cutoff: Double) = { val pValues = Statistics.chiSqTest(labeledPoints) .map(result => result.pValue) .sorted pValues(topK) < cutoff}

Page 58: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Persisting Validationscase class ValidatedFeatureCollection(id: Int, createdAt: DateTime, features: Set[_ <: FeatureType[_]], label: LabelType[_], passedValidation: Boolean, cutoff: Double)

Page 59: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

SUMMARY

Page 60: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Systems

Page 61: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Traits of Reactive Systems

Page 62: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Strategies

Page 63: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Data

Page 64: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Page 65: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Feature Generation with Spark and MLlib

reactivemachinelearning.com @jeffksmithjr @xdotai


Recommended