+ All Categories
Home > Software > USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016

Date post: 15-Apr-2017
Category:
Upload: chris-fregly
View: 1,201 times
Download: 5 times
Share this document with a friend
74
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Spark and Recommendations Spark, Streaming, Machine Learning, Graph Processing, Approximations, Probabilistic Data Structures, NLP USF Seminar Series Thanks, USF!! Feb 5 th , 2016 Chris Fregly Principal Data Solutions Engineer We’re Hiring! (Only Nice People) advancedspark.com
Transcript
Page 1: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Spark and Recommendations

Spark, Streaming, Machine Learning, Graph Processing, Approximations, Probabilistic Data Structures, NLP

USF Seminar Series Thanks, USF!!

Feb 5th, 2016

Chris Fregly Principal Data Solutions Engineer

We’re Hiring! (Only Nice People) advancedspark.com!

Page 2: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Am I?

2

Streaming Data Engineer Netflix OSS Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Meetup Organizer Advanced Apache Meetup

Book Author Advanced .

Due 2016

Page 3: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Advanced Apache Spark Meetup http://advancedspark.com Meetup Metrics Top 5 Most-active Spark Meetup! 2400+ Members in just 6 mos!! 2500+ Docker image downloads

Meetup Mission Deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance

3

Page 4: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Live, Interactive Demo!! Audience Participation Required

(cell phone or laptop)

4

Page 5: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

demo.advancedspark.com End User ->

ElasticSearch ->

Spark ML ->

Data Scientist -> 5

<- Kafka <- Spark Streaming <- Cassandra, Redis <- Zeppelin, iPython

Page 6: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Presentation Outline   Scaling with Parallelism and Composability

  Similarity and Recommendations

  When to Approximate

  Common Algorithms and Data Structures   Common Libraries and Tools

6

Page 7: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Scaling with Parallelism

7

Peter O(log n)

O(log n)

Page 8: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Scaling with Composability

Max (a max b max c max d) == (a max b) max (c max d)

Set Union (a U b U c U d) == (a U b) U (c U d)

Addition (a + b + c + d) == (a + b) + (c + d)

Multiply (a * b * c * d) == (a * b) * (c * d)

Division??

8

Page 9: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

What about Division? Division (a / b / c / d) != (a / b) / (c / d) (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7)) 0.134 != 0.857

9

What were the Egyptians thinking?! Not Composable

“Divide like an Egyptian”

Page 10: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

What about Average?

Overall AVG ( [3, 1] ((3 + 5) + (5 + 7)) 20 [5, 1] == ----------------------- == --- == 5 [5, 1] ((1 + 2) + 1) 4 [7, 1]

) 10

value

count

Pairwise AVG (3 + 5) (5 + 7) 8 12 20 ------- + ------- == --- + --- == --- == 10 != 5 2 2 2 2 2

Divide, Add, Divide? Not Composable

Single Divide at the End? Doesn’t need to be Composable!

AVG (3, 5, 5, 7) == 5

Add, Add, Add? Composable!

Page 11: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Presentation Outline   Scaling with Parallelism and Composability

  Similarity and Recommendations

  When to Approximate

  Common Algorithms and Data Structures   Common Libraries and Tools

11

Page 12: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Similarity

12

Page 13: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Euclidean Similarity Exists in Euclidean, flat space Based on Euclidean distance Linear measure Bias towards magnitude

13

Page 14: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Cosine Similarity Angular measure Adjusts for Euclidean magnitude bias

14

Normalizes to unit vectors

Page 15: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Jaccard Similarity Set similarity measurement Set intersection / set union -> Based on Jaccard distance Bias towards popularity

15

Page 16: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Log Likelihood Similarity Adjusts for popularity bias Netflix “Shawshank” problem

16

Page 17: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Word Similarity Edit Distance Calculate char differences between words Deletes, transposes, replaces, inserts

17

Page 18: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Document Similarity TD/IDF Term Freq / Inverse Document Freq Used by most search engines

Word2Vec Words embedded in vector space nearby similars

18

Page 19: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Similarity Pathway ie. Closest recommendations between 2 people

19

Page 20: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Calculating Similarity Exact Brute-Force “All-pairs similarity” aka “Pair-wise similarity”, “Similarity join” Cartesian O(n^2) shuffle and comparison

Approximate Sampling Bucketing (aka “Partitioning”, “Clustering”) Remove data with low probability of similarity

Reduce shuffle and comparisons 20

Page 21: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Bonus: Document Summary Text Rank aka “Sentence Rank” TF/IDF + Similarity Graph + PageRank

Intuition Surface summary sentences (abstract) Most similar to all others (TF/IDF + Similarity Graph) Most influential sentences (PageRank)

21

Page 22: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Similarity Graph Vertex is movie, tag, actor, plot summary, etc. Edges are relationships and weights

22

Page 23: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Topic-Sensitive PageRank Graph diffusion algorithm Pre-process graph, add vector of probabilities to each vertex Probability of ending up at this vertex from every other

vertex

23

Page 24: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Recommendations

24

Page 25: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Basic Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: like or rating Implicit User Feedback: search, click, hover, view, scroll Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations) Model Evaluation: Compare predictions to actual values of hold out split Feature Engineering: Modify, reduce, combine features

25

Page 26: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Feature Engineering Dimension Reduction Reduce number of features (aka “feature space”)

Principle Component Analysis (PCA) Find principle features that describe the data in terms of variance Peel the dimensional layers back until you describe the data

Example: One-Hot Encoding Convert categorical feature values to 0’s, 1’s Remove any hint of a relationship between the categories Bears -> 1 Bears -> [1,0,0] 49’ers -> 2 --> 49’ers -> [0,1,0] Steelers-> 3 Steelers-> [0,0,1]

26

1 binary column per category

Page 27: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Features Binary Features: True or False

Numeric Discrete Features: Integers

Numeric Features: Real values

Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL)

Temporal Features: Time-based (Time of Day, Binge Watching)

Categorical Features: Finite, unique set of categories (NFL teams)

27

Page 28: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Non-Personalized Recommendations

28

Page 29: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Cold Start Problem “Cold Start” problem New user, don’t know their pref, must show them something!

Movies with highest-rated actors Top K Aggregations

Most desirable singles PageRank of like activity

Facebook social graph Recommend friend activity

29

Page 30: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Personalized Recommendations

30

Page 31: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Clustering (aka. Nearest Neighbors) User-to-User Clustering Similar movies watched or rated Similar wiewing pattern (ie. binge or casual)

Item-to-Item Clustering Similar tags/genres on movies Similar textual description (TF/IDF, Word2Vec, NLP, Image)

31 http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html!My OKCupid Profile! My Hinge Profile!

Page 32: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

User-to-Item Collaborative Filtering Matrix Factorization ①  Factor the large matrix (left) into 2 smaller matrices (right) ②  Fill in the missing values with in the large matrix

32

Page 33: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Item-to-Item Collaborative Filtering Made famous by Amazon Paper ~2003 Problem As # of users grew, Matrix Factorization couldn’t scale

Solution Offline/Batch Generate itemId -> List[customerId] vectors

Online/Real-time For each item in cart, recommend similar items from vector space

33

Page 34: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Presentation Outline   Scaling with Parallelism and Composability

  Similarity and Recommendations

  When to Approximate

  Common Algorithms and Data Structures   Common Libraries and Tools

34

Page 35: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

When to Approximate? Memory or time constrained queries Relative vs. exact counts are OK (# errors between then and now)

Using machine learning or graph algos Inherently probabilistic and approximate Finding topics in documents (LDA) Finding similar pairs of users, items, words at scale (LSH) Finding top influencers (PageRank)

Streaming aggregations (distinct count or top k) Inherently sloppy means of collecting (at least once delivery)

35

Approximate as much as you can get away with! Ask for forgiveness later !!

Page 36: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

When NOT to Approximate? If you’ve ever heard the term…

“Sarbanes-Oxley”

…in-that-order, at the office, after 2002.

36

Page 37: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Presentation Outline   Scaling with Parallelism and Composability

  Similarity and Recommendations

  When to Approximate

  Common Algorithms and Data Structures   Common Libraries and Tools

37

Page 38: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

A Few Good Algorithms

38

You can’t handle the approximate!

Page 39: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Common to These Algos & Data Structs Low, fixed size in memory Known error bounds Store large amount of data Less memory than Java/Scala collections Tunable tradeoff between size and error Rely on multiple hash functions or operations Size of hash range defines error

39

Page 40: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Bloom Filter Set.contains(key): Boolean

“Hash Multiple Times and Flip the Bits Wherever You Land”

40

Page 41: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Bloom Filter Approximate set membership for key False positive: expect contains(), actual !contains() True negative: expect !contains(), actual !contains()

Elements only added, never removed

41

Page 42: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Bloom Filter in Action

42

set(key) contains(key): Boolean

Images by @avibryant

TRUE -> maybe contains FALSE -> definitely does not contain.

Page 43: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

CountMin Sketch Frequency Count and TopK

“Hash Multiple Times and Add 1 Wherever You Land”

43

Page 44: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CountMin Sketch (CMS) Approximate frequency count and TopK for key ie. “Heavy Hitters” on Twitter

44

Johnny Hallyday Martin Odersky Donald Trump

Page 45: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CountMin Sketch In Action

45

Images derived from @avibryant

Find minimum of all rows

… …

Can overestimate, but never underestimate

Multiple hash functions (1 hash function per row)

Binary hash output (1 element per column)

x 2 occurrences of “Top Gun” for slightly additional complexity

Top Gun

Top Gun

Top Gun (x 2)

A FewGood Men

Taps

Top Gun (x 2)

add(Top Gun, 2)

getCount(Top Gun): Long

Use Case: TopK movies using total views

add(A Few Good Men, 1)

add(Taps, 1)

A FewGood Men

Taps

Overlap Top Gun

Overlap A Few Good Men

Page 46: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

HyperLogLog Count Distinct

“Hash Multiple Times and Uniformly Distribute Where You Land”

46

Page 47: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

HyperLogLog (HLL) Approximate count distinct Slight twist Special hash function creates uniform distribution

Error estimate 14 bits for size of range m = 2^14 = 16,384 slots error = 1.04/(sqrt(16,384)) = .81%

47

Not many of these

Page 48: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

HyperLogLog In Action Use Case: Distinct number of views per movie

48

0 32 Top Gun: Hour 2 user2001

user 4009

user 3002

user 7002

user 1005

user 6001

User 8001

User 8002

user 1001

user 2009

user 3005

user 3003

Top Gun: Hour 1 user 3001

user 7009

0 16

Uniform Distribution: Estimate distinct # of users by inspecting just the beginning

Uniform Distribution: Estimate distinct # of users

by inspecting just the beginning

Composable: Hour 1 + 2 (lose a bit of precision)

Page 49: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Locality Sensitive Hashing Set Similarity

“Pre-process Items into Buckets, Compare Within Buckets”

49

Page 50: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Locality Sensitive Hashing (LSH) Approximate set similarity Hash designed to cluster similar items Avoids cartesian all-pairs comparison Pre-process m rows into b buckets b << m

Hash items multiple times Similar items hash to overlapping buckets Compare just contents of buckets Much smaller cartesian … and parallel !!

50

Page 51: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

DIMSUM Set Similarity

“Pre-process Items into Buckets, Compare Within Buckets”

51

Page 52: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

DIMSUM “Dimension Independent Matrix Square Using MR” Remove vectors with low probability of similarity RowMatrix.columnSimiliarites(threshold)

Twitter DIMSUM Case Study 40% efficiency gain over bruce-force cosine sim

52

Page 53: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Presentation Outline   Scaling with Parallelism and Composability

  Similarity and Recommendations

  When to Approximate

  Common Algorithms and Data Structures   Common Libraries and Tools

53

Page 54: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Common Tools to Approximate

Twitter Algebird

Redis

Apache Spark

54

Composable Library

Distributed Cache

Big Data Processing

Page 55: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Twitter Algebird Rooted in Algebraic Fundamentals! Parallel Associative Composable Examples Min, Max, Avg BloomFilter (Set.contains(key)) HyperLogLog (Count Distinct) CountMin Sketch (TopK Count)

55

Page 56: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Redis Implementation of HyperLogLog (Count Distinct) 12KB per item count 2^64 max # of items 0.81% error (Tunable) Add user views for given movie

PFADD TopGun_HLL user1001 user2009 user3005 PFADD TopGun_HLL user3003 user1001

Get distinct count (cardinality) of set

PFCOUNT TopGun_HLL Returns: 4 (distinct users viewed this movie)

56

ignore duplicates

Tunable

Union 2 HyperLogLog Data Structures PFMERGE TopGun_HLL Taps_HLL

Page 57: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark Approximations Spark Core

RDD.count*Approx() Spark SQL

PartialResult HyperLogLogPlus approxCountDistinct(column)

Spark ML Stratified sampling PairRDD.sampleByKey(fractions: Double[ ]) DIMSUM sampling Probabilistic sampling reduces amount of comparison shuffle RowMatrix.columnSimilarities(threshold)

Spark Streaming A/B testing StreamingTest.setTestMethod(“welch”).registerStream(dstream)

57

Page 58: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Demos!

58

Page 59: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Counting Exact Count vs. Approx HyperLogLog, CountMin Sketch

59

Page 60: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

HashSet vs. HyperLogLog

60

Page 61: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

HashSet vs. CountMin Sketch

61

Page 62: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Set Similarity Exact Jaccard Similarity vs. Approx Locality Sensitive Hashing

62

Page 63: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Brute Force Cartesian All Pair Similarity

63

90 mins!

Page 64: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

All Pairs & Locality Sensitive Hashing

64

<< 90 mins!

Page 65: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Many More Demos Available! http://advancedspark.com

Download Docker or Clone Github

65

Page 66: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Bonus: Netflix Recommendations From Offline DVD Ratings to Real-time Trending Now

66

Page 67: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

$1 Million Netflix Prize (2006-2009) Goal Improve movie predictions by 10% (RMSE)

Dataset (userId, movieId, rating, timestamp) Test data withheld to calculate RMSE upon submission

Winning algorithm 10.06% improvement (RMSE) Ensemble of 500+ ML Combined using GBDT’s Computationally impractical

67

Page 68: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Secret to the Winning Algorithms Adjust for the following… Human bias “Alice effect”: Alice tends to rate lower than average user “Inception effect”: Inception is rated higher than average “Alice-Inception effect”: Combo of Alice and Inception Time-based bias Number of days since a user’s first rating Number of days since a movie’s first rating Number of people who have rated a movie A movie’s overall mean rating

68

Page 69: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Current Netflix Recommendations

69

Throw away loffline-generated user factors (U)

Page 70: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Netflix Common ML Algorithms Logistic Regression Linear Regression Gradient Boosted Decision Trees Random Forest Matrix Factorization SVD Restricted Boltzmann Machines Deep Neural Nets Markov Models LDA Clustering … 70

Page 71: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Bonus: Netflix Search No results? No problem… Show similar results! Used as implicit feedback for future decision making

71

Page 72: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Netflix and Data Netflix has a lot of data about a lot of users and a lot of movies. Netflix can use this data to buy new movies. Netflix is global. Netflix can use this data to choose original programming. Netflix knows that a lot of people like Politics and Kevin Spacey.

72

The UK doesn’t have any White Castles. So they renamed my favourite movie, “Harold and Kumar Get the Munchies”

(This broke all of my unit tests.)

My favorite movie, “Harold and Kumar Go to White Castle”

Summary: Buy NFLX Stock!

Page 73: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Thank You!! Chris Fregly @cfregly IBM Spark Tech Center http://spark.tc San Francisco, California, USA

http://advancedspark.com Sign up for the Meetup and Book Contribute to Github Repo Run all Demos using Docker

Find me: LinkedIn, Twitter, Github, Email, Fax 73

Image derived from http://www.duchess-france.org/

Page 74: USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Power of data. Simplicity of design. Speed of innovation.

IBM Spark

advancedspark.com @cfregly


Recommended