+ All Categories
Home > Software > Boston Spark Meetup May 24, 2016

Boston Spark Meetup May 24, 2016

Date post: 22-Jan-2018
Category:
Upload: chris-fregly
View: 1,652 times
Download: 0 times
Share this document with a friend
118
Flux Capacitor AI Bringing AI Back to the Future! Bringing AI Back to the Future! Flux Capacitor AI advancedspark.com
Transcript
Page 1: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

advancedspark.com

Page 2: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Who Am I?

2

Streaming Data EngineerNetflix OSS Committer

Data Solutions EngineerApache Contributor

Principal Data Solutions EngineerIBM Technology Center

Meetup OrganizerAdvanced Apache Meetup

Book AuthorAdvanced .

Due 2016

Page 3: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Advanced Apache Spark Meetuphttp://advancedspark.com

Meetup MetricsTop 10 Most-active Spark Meetup!3200+ Members in just 9 mos!!3700+ Docker downloads (demos)

Meetup MissionCode deep-dive into Spark and related open source projectsSurface key patterns and idiomsFocus on distributed systems, scale, and performance

3

Page 4: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Live, Interactive Demo!Audience Participation Required!!Cell Phone Compatible!!!

demo.advancedspark.com4

Page 5: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

http://demo.advancedspark.com

End User ->

ElasticSearch ->

Spark ML ->

Data Scientist ->

5

<- Kafka

<- SparkStreaming

<- Cassandra,Redis

<- Zeppelin, iPython

Page 6: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations6

Page 7: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Scaling with Parallelism

7

Peter

O(log n)O(log n)

WorkerNodes

Page 8: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Parallelism with ComposabilityWorker 1 Worker 2

Max (a max b max c max d) == (a max b) max (c max d)

Set Union (a U b U c U d) == (a U b) U (c U d)

Addition (a + b + c + d) == (a + b) + (c + d)

Multiply (a * b * c * d) == (a * b) * (c * d)

8

What about Division and Average?Collect at Driver

Page 9: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

What about Division?Division (a / b / c / d) != (a / b) / (c / d)

(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7))

0.134 != 0.857

9

What were the Egyptians thinking?!Not Composable

“Divide like an Egyptian”

Page 10: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

What about Average?

Overall AVG(3, 1) (3 + 5 + 5 + 7) 20

+ (5, 1) == -------------------- == --- == 5+ (5, 1) (1 + 1 + 1 + 1) 4+ (7, 1)

10

values

counts

Pairwise AVG(3 + 5) (5 + 7) 8 12 20------- + ------- == --- + --- == --- == 10 != 5

2 2 2 2 2

Divide, Add, Divide?Not Composable

Single-Node Divide at the End?Doesn’t need to be Composable!

AVG (3, 5, 5, 7) == 5

Add, Add, Add?Composable!

Page 11: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations11

Page 12: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Similarities

12

Page 13: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Euclidean SimilarityExists in Euclidean, flat spaceBased on Euclidean distance Linear measureBias towards magnitude

13

Page 14: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Cosine SimilarityAngular measureAdjusts for Euclidean magnitude biasNormalize to unit vectors in all dimensionsUsed with real-valued vectors (versus binary)

14

org.jblas.DoubleMatrix

Page 15: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Jaccard SimilaritySet similarity measurementSet intersection / set union Bias towards popularityWorks with binary vectors

15

Page 16: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Log Likelihood SimilarityAdjusts for popularity biasNetflix “Shawshank” problem

16

Page 17: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Word SimilarityEdit Distance

Misspellings and autocorrect

Word2VecSimilar words are defined by similar contexts in vector space

17

English Spanish

Page 18: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Find Synonyms with Word2Vec

18

Page 19: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Find Synonyms using Word2Vec

19

Page 20: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Document SimilarityTF/IDF

Term Freq / Inverse Document FreqUsed by most search engines

Doc2VecSimilar documents are determined by similar contexts

20

Page 21: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus! Text Rank Document SummaryText Rank (aka Sentence Rank)

Surface summary sentences TF/IDF + Similarity Graph + PageRank

Most similar sentence to all other sentencesTF/IDF + Similarity Graph

Most influential sentencesPageRank

21

Page 22: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Similarity Pathways (Recommendations)Best recommendations for 2 (or more) people

“You like Max Max. I like Message in a Bottle.We might like a movie similar to both.”

Item-to-Item Similarity Graph + Dijkstra Heaviest Path

22

Page 23: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Similarity Pathway for Movie Recommendations

23

Page 24: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Load Movies with Tags into DataFrame

24

My Choice

TheirChoice

Page 25: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Tag Jaccard SimilarityBased on Tags

25

Calculate Jaccard Similarity(Tag Set Similarity)

Must be Above the Given Jaccard Similarity Threshold

Page 26: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Tag Similarity Graph

26

Edge Weights ==

Jaccard Similarity(Based on Tag Sets)

Page 27: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Use Dijkstra to Find Heaviest Pathway

27

Page 28: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Calculating Exact SimilarityBrute-Force Similarity

Cartesian ProductO(n^2) shuffle and computeaka. All-pairs, Pair-wise,

Similarity Join

28

Page 29: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Calculating Approximate SimilarityGoal: Reduce Shuffle

Approximate SimilaritySamplingBucketing or ClusteringIgnore low-similarity probability

Locality Sensitive Hashing Twitter Algebird MinHash

29

BucketBy Genre

Page 30: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

① Netflix Recommendations30

Page 31: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Recommendations

31

Page 32: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Basic TerminologyUser: User seeking recommendationsItem: Item being recommendedExplicit User Feedback: user knows they are rating or liking, can choose to dislikeImplicit User Feedback: user not explicitly aware, cannot dislike (click, hover, etc)Instances: Rows of user feedback/input dataOverfitting: Training a model too closely to the training data & hyperparametersHold Out Split: Holding out some of the instances to avoid overfittingFeatures: Columns of instance rows (of feedback/input data)Cold Start Problem: Not enough data to personalize (new)Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)Model Evaluation: Compare predictions to actual values of hold out splitFeature Engineering: Modify, reduce, combine featuresLoss Function: Function we’re trying to minimize such as least-squared error for Linear RegressionCross Entropy: Loss function used for classification algorithms such as Logistic RegressionOptimizer: Technique to optimize loss function such as Stochastic Gradient Descent (SGD)

32

Page 33: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Stochastic Gradient Descent (SGD)Optimizes Loss Function

Least Squared Error b/w predicted and actual valueCross Entropy Log Likelihood b/w predicted and actual probability

33

2-Dimensional 3-Dimensional

Page 34: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

FeaturesBinary: True or FalseNumeric Discrete: Integers

Numeric: Real Values

Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)

Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)

Categorical Nominal: Independent, Favorite Sports Teams, Dating SpotsTemporal: Time-based, Time of Day, Binge Viewing

Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)

Media: Images, Audio, Video

Geographic: (Longitude, Latitude), Geohash

Latent: Hidden Features within Data (Collaborative Filtering)Derived: Age of Movie, Duration of User Subscription

34

Page 35: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Feature EngineeringDimension Reduction

Reduce number of features in feature spacePrinciple Component Analysis (PCA)

Find principle features that best describe data variancePeel dimensional layers back

One-Hot EncodingConvert nominal categorical feature values into 0’s and 1’sRemove any numerical relationship between categories

Bears -> 1 Bears -> [1.0, 0.0, 0.0]49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0]Steelers-> 3 Steelers-> [0.0, 0.0, 1.0]

35

Convert Each Item to Binary Vector

with Single 1.0 Column

Page 36: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Feature Normalization & StandardizationGoal

Scale features to standard sizePrevent boundless featuresHelps avoid overfittingRequired by many ML algos

Normalize FeaturesCalculate L1 (or L2, etc) norm, then divide into each element

Standardize FeaturesApply standard normal transformation (mean->0, stddev->1)

org.apache.spark.ml.feature.[Normalizer, StandardScaler]36

http://www.mathsisfun.com/data/standard-normal-distribution.html

Page 37: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Non-Personalized Recommendations

37

Page 38: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Cold Start Problem“Cold Start” problem

New user, don’t know their preferences, must show something!

Movies with highest-rated actorsTop K aggregations

Facebook social graphFriend-based recommendations

Most desirable singlesPageRank of likes and dislikes

38

Page 39: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!GraphFrame PageRank

39

Page 40: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Example: Dating Site “Like” Graph

40

Page 41: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

PageRank of Top Influencers

41

Page 42: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Personalized Recommendations

42

Page 43: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Personalized PageRank

43

Page 44: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Personalized PageRank: Outbound Links

44

0.15 = (1 - 0.85 “Damping Factor”)85% Probability: Choose Among Outbound Network

15% Probability: Choose Self or Random

85% AmongOutboundNetwork

Page 45: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Personalized PageRank: No Outbound

45

0.15 = (1 - 0.85 “Damping Factor”)85% Probability: Choose Among Outbound Network

15% Probability: Choose Self or Random

85% Among No

OutboundNetwork!!

Page 46: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

User-to-User ClusteringUser Similarity

Time-basedPattern of viewing (binge or casual)Time of viewing (am or pm)

Ratings-basedContent ratings or number of viewsAverage rating relative to others (critical or lenient)

Search-basedSearch terms

46

Page 47: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item ClusteringItem Similarity

Profile text (TF/IDF, Word2Vec, NLP)Categories, tags, interests (Jaccard Similarity, LSH)Images, facial structures (Neural Nets, Eigenfaces)

Dating Site Example…

47Cluster Similar Eigen-facesCluster Similar Profiles Cluster Similar Categories

Page 48: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: NLP Conversation Starter Bot

48

“If your responses to my generic opening lines are positive, I may read your profile.”

Spark ML, Stanford CoreNLP,TF/IDF, DecisionTrees, Sentiment

http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Page 49: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Demo!Spark + Stanford CoreNLP Sentiment Analysis

49

Page 50: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Top 100 Country Song Sentiment

50

Page 51: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Surprising Results…?!

51

Page 52: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Based RecommendationsBased on Metadata: Genre, Description, Cast, City

52

Page 53: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Item-to-Item-based Recommendations

One-Hot Encoding + K-Means Clustering

53

Page 54: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

One-Hot Encode Tag Feature Vectors

54

Page 55: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Cluster Movie Tag Feature Vectors

55

HyperparameterTuning

(K Clusters?)

Page 56: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Analyze Movie Tag Clusters

56

Page 57: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

User-to-Item Collaborative FilteringMatrix Factorization① Factor the large matrix (left) into 2 smaller matrices (right)② Lower-rank matrices approximate original when multiplied③ Fill in the missing values of the large matrix④ Surface k (rank) latent features from user-item interactions

57

Page 58: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Collaborative FilteringFamous Amazon Paper circa 2003

ProblemAs users grew, user-to-item collaborative filtering didn’t scale

SolutionItem-to-item similarity, nearest neighbors Offline (Batch)

Generate itemId->List[userId] vectorsOnline (Real-time)

From cart, recommend nearest-neighbors in vector space58

Page 59: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Collaborative Filtering-based Recommendations

59

Page 60: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Fitting the Matrix Factorization Model

60

Page 61: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Show ItemFactors Matrix from ALS

61

Page 62: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Show UserFactors Matrix from ALS

62

Page 63: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Generating Individual Recommendations

63

Page 64: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Generating Batch Recommendations

64

Page 65: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Clustering + Collaborative Filtering RecsCluster matrix output from Matrix FactorizationLatent features derived from user-item interaction

Item-to-Item SimilarityCluster item-factor matrix->

User-to-User Similarity<-Cluster user-factor matrix

65

Page 66: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Clustering + Collaborative Filtering-based Recommendations

66

Page 67: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Show ItemFactors Matrix from ALS

67

Page 68: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Convert to Item Factors -> mllib.VectorRequired by K-Means Clustering Algorithm

68

Page 69: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Fit and Evaluate K-Means Cluster Model

69

Measures ClosenessOf Points Within Clusters

K = 5 Clusters

Page 70: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Genres and ClustersTypical Genres

Documentary, Romance, Comedy, Horror, Action, Adventure

Latent (Hidden) ClustersEmotionally-Independent Dramas for Hopeless RomanticsWitty Dysfunctional-Family TV Animated ComediesRomantic Crime Movies based on Classic LiteratureLatin American Forbidden-Love MoviesCritically-acclaimed Emotional Drug MovieCerebral Military Movie based on Real LifeSentimental Movies about Horses for Ages 11-12Gory Canadian Revenge MoviesRaunchy Mad Scientist Comedy

70

Page 71: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations71

Page 72: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

When to Approximate?Memory or time constrained queries

Relative vs. exact counts are OK (approx # errors after a release)

Using machine learning or graph algosInherently probabilistic and approximate

Streaming aggregationsInherently sloppy collection (exactly once?)

72

Approximate as much as you can get away with!Ask for forgiveness later !!

Page 73: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

When NOT to Approximate?If you’ve ever heard the term…

“Sarbanes-Oxley”

…at the office.

73

Page 74: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

A Few Good Algorithms

74

You can’t handle the approximate!

Page 75: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Common to These Algos & Data StructsLow, fixed size in memoryStore large amount of dataKnown error boundsTunable tradeoff between size and errorLess memory than Java/Scala collectionsRely on multiple hash functions or operationsSize of hash range defines error

75

Page 76: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bloom FilterSet.contains(key): Boolean

“Hash Multiple Times and Flip the Bits Wherever You Land”

76

Page 77: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bloom FilterApproximate Set.contains(key)

No means No, Yes means Maybe

Elements can only be addedNever updated or removed

77

Page 78: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bloom Filter in Action

78

set(key) contains(key): Boolean

Images by @avibryant

Set.contains(key): TRUE -> maybe contains (other key hashes may overlap)Set.contains(key): FALSE -> definitely does not contain (no key flipped all bits)

Page 79: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

CountMin SketchFrequency Count and TopK

“Hash Multiple Times and Add 1 Wherever You Land”

79

Page 80: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

CountMin Sketch (CMS)Approximate frequency count and TopK for keyie. “Heavy Hitters” on Twitter

80

Matei Zaharia Martin Odersky Donald Trump

Page 81: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

CountMin Sketch In Action (TopK Count)

81

Images derived from @avibryant

Find minimum of all rows

……

Can overestimate, but never underestimate

Multiple hash functions(1 hash function per row)

Binary hash output(1 element per column)

x 2 occurrences of “Top Gun” for slightly additional complexity

Top GunTop Gun

Top Gun(x 2)

A FewGood Men

Taps

Top Gun(x 2)

add(Top Gun, 2)

getCount(Top Gun): Long

Use Case: TopK movies using total views

add(A Few Good Men, 1)

add(Taps, 1)

A FewGood Men

Taps

Overlap Top Gun

Overlap A Few Good Men

Page 82: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HyperLogLogCount Distinct

“Hash Multiple Times and Uniformly Distribute Where You Land”

82

Page 83: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HyperLogLog (HLL)Approximate count distinctSlight twist

Special hash function creates uniform distributionHash subsets of data with single, special hash func

Error estimate14 bits for size of rangem = 2^14 = 16,384 hash slotserror = 1.04/(sqrt(16,384)) = .81%

83

Not many of these

Page 84: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HyperLogLog In Action (Count Distinct)Use Case: Number of distinct users who view a movie

84

0 32

Top Gun: Hour 2user2001

user4009

user3002

user7002

user1005

user6001

User8001

User8002

user1001

user2009

user3005

user3003

Top Gun: Hour 1user3001

user7009

0 16

Uniform Distribution:Estimate distinct # of users by inspecting just the beginning

0 32

Top Gun: Hour 1 + 2user2001

user4009

user3002

user7002

user1005

user6001

User8001

User8002

Combine across different scales

user7009

user1001

user2009

user3005

user3003

user3001

Page 85: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Locality Sensitive HashingSet Similarity

“Pre-process Items into Buckets, Compare Within Buckets”

85

Page 86: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Locality Sensitive Hashing (LSH)Approximate set similarityPre-process m rows into b buckets

b << m; b = buckets, m = rowsHash items multiple times

** Similar items hash to overlapping buckets** Hash designed to cluster similar items

Compare just contents of bucketsMuch smaller cartesian compare ** Compare in parallel !!

Avoids huge cartesian all-pairs compare86

Chapter 3: LSH

Page 87: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

DIMSUMSet Similarity

“Pre-process and ignore data that is unlikely to be similar.”

87

Page 88: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

DIMSUM“Dimension Independent Matrix Square Using MR”Remove vectors with low probability of similarity

RowMatrix.columnSimiliarites(threshold)Twitter DIMSUM Case Study

40% efficiency gain over bruce-force Cosine Sim

88

Page 89: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Common Tools to Approximate

Twitter Algebird

Redis

Apache Spark

89

Composable Library

Distributed Cache

Big Data Processing

Page 90: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Twitter AlgebirdAlgebraic Fundamentals

Parallel

Associative

ComposableExamples

Min, Max, AvgBloomFilter (Set.contains(key))HyperLogLog (Count Distinct)CountMin Sketch (TopK Count)

90

Page 91: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

RedisImplementation of HyperLogLog (Count Distinct)

12KB per item count2^64 max # of items0.81% error

Add user views for given moviePFADD TopGun_Hour1_HLL user1001 user2009 user3005PFADD TopGun_Hour1_HLL user3003 user1001

Get distinct count (cardinality) of setPFCOUNT TopGun_Hour1_HLLReturns: 4 (distinct users viewed this movie)

Union 2 HyperLogLog Data StructuresPFMERGE TopGun_Hour1_HLL TopGun_Hour2_HLL

91

ignore duplicates

Tunable

Page 92: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Approximations in Spark LibrariesSpark Core

countByKeyApprox(timeout: Long, confidence: Double)PartialResult

Spark SQLapproxCountDistinct(column: Column, targetResidual: Float)approxQuantile(column: Column, quantiles: Seq[Float], targetResidual: Float)

Spark MLStratified sampling

sampleByKey(fractions: Map[K, Double])DIMSUM sampling

Probabilistic sampling reduces amount of shuffleRowMatrix.columnSimilarities(threshold: Double)

92

Page 93: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Exact Count vs. Approximate HLL and CMS Count

93

Page 94: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HashSet vs. HyperLogLog (Memory)

94

Page 95: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HashSet vs. CountMin Sketch (Memory)

95

Page 96: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Exact Similarity vs. Approximate LSH Similarity

96

Page 97: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Brute Force Cartesian All Pair Similarity

97

47 seconds

Page 98: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Locality Sensitive Hash All Pair Similarity

98

6 seconds

Page 99: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Many More Demos!

or

Download Docker Clone on Github

99

http://advancedspark.com

Page 100: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations100

Page 101: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix RecommendationsFrom Ratings to Real-time

101

Page 102: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Has a Lot of DataNetflix has a lot of data about a lot of users and a lot of movies.

Netflix can use this data to buy new movies.

Netflix is global.

Netflix can use this data to choose original programming.

Netflix knows that a lot of people like politics and Kevin Spacey.

102

The UK doesn’t have White Castle.Renamed my favourite movie to:

“Harold and Kumar Get the Munchies”

My favorite movie:“Harold and Kumar Go to White Castle”

Summary: Buy NFLX Stock!

This broke my unit tests!

Page 103: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Data Pipeline - Then

103

v1.0

v2.0

Page 104: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Data Pipeline – Now (Keystone)

104

v3.0

9 million events per second22 GB per second!!

EC2 D2XLDisk: 6 TB, 475 MB/sRAM: 30 GNetwork: 700 Mbps

Auto-scaling,Fault tolerance

A/B Tests,Trending Now

SAMZA

Splits high andnormal priority

Page 105: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Recommendation Data Pipeline

105

Throw away batch user factors (U)

Keep batch video factors (V)

Page 106: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Trending Now (Time-based Recs)Uses Spark StreamingPersonalized to user (viewing history, past ratings)Learns and adapts to events (Valentine’s Day)

106

“VHS”

Number of Plays

Number of Impressions

CalculateTake Rate

Page 107: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Pandora Time-based RecsWork Days

Play familiar musicUser is less likely accept new music

Evenings and WeekendsPlay new musicMore like to accept new music

107

Page 108: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

$1 Million Netflix Prize (2006-2009)Goal

Improve movie predictions by 10% (Root Mean Sq Error)Test data withheld to calculate RMSE upon submission

5-star Ratings Dataset(userId, movieId, rating, timestamp)

Winning algorithm(s)10.06% improvement (RMSE)Ensemble of 500+ ML combined with GBDT’sComputationally impractical

108

Page 109: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Secrets to the Winning AlgorithmsAdjust for the following human bias…① Alice effect: user rates lower than avg② Inception effect: movie rated higher than avg③ Overall mean rating of a movie④ Number of people who have rated a movie⑤ Number of days since user’s first rating⑥ Number of days since movie’s first rating⑦ Mood, time of day, day of week, season, weather

109

Page 110: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Common ML AlgorithmsLogistic RegressionLinear RegressionGradient Boosted Decision TreesRandom ForestMatrix FactorizationSVDRestricted Boltzmann MachinesDeep Neural NetsMarkov ModelsLDAClustering

110

Ensembles!

Page 111: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Genres and ClustersTypical Genres

Documentaries, Romance Comedies, Horror, Action, Adventure

Latent (Hidden) ClustersEmotionally-Independent Dramas for Hopeless RomanticsWitty Dysfunctional-Family TV Animated ComediesRomantic Crime Movies based on Classic LiteratureLatin American Forbidden-Love MoviesCritically-acclaimed Emotional Drug MovieCerebral Military Movie based on Real LifeSentimental Movies about Horses for Ages 11-12Gory Canadian Revenge MoviesRaunchy Mad Scientist Comedy

111

Page 112: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Social IntegrationPost to Facebook after movie start (5 mins)Recommend to new users based on friendsHelps with Cold Start problem

112

Page 113: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix SearchNo results? No problem… Show similar results!

Utilize extensive DVD CatalogMetadata search (ElasticSearch)Named entity recognition (NLP)

Empty searches are opportunity!Explicit feedback for future recommendationsContent to buy and produce!

113

Page 114: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix A/B TestsUsers tend to click on images featuring…

Faces with strong emotional expressionsVillains over heroesSmall number of cast members

114

Page 115: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Recommendation Serving LayerUse Case: Recommendation service depends on EVCacheProblem: EVCache cluster does down or becomes latent!?Answer: github.com/Netflix/Hystrix Circuit Breaker!

Circuit StatesClosed: Service OK

Open: Service DOWNFallback to Static

115

Page 116: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Why Higher Average Ratings 2004+?2004, Netflix noticed higher ratings on averageSome possible reasons why…

116

① Significant UI improvements deployed② New recommendation engine deployed③

Page 117: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Thank You, Everyone!!Chris Fregly @cfreglyResearch Scientist @ Flux Capacitor AISan Francisco, California, USA

http://fluxcapacitor.comSign up for the Meetup and BookContribute to Github RepoRun all Demos using Docker

Find me LinkedIn, Twitter, Github, Email, Fax117

Image derived from http://www.duchess-france.org/

Page 118: Boston Spark Meetup May 24, 2016

Flux Capacitor AI Bringing AI Back to the Future!

Bringing AI Back to the Future!


Recommended