+ All Categories
Home > Data & Analytics > MasterSearch_Meetup_AdvancedAnalytics

MasterSearch_Meetup_AdvancedAnalytics

Date post: 24-Jan-2018
Category:
Upload: longhow-lam
View: 166 times
Download: 0 times
Share this document with a friend
41
Data science @ RTL Nederland Longhow Lam @longhowlam 1 st Master Search Advanced Analytics meetup 5-juli-2017
Transcript
Page 1: MasterSearch_Meetup_AdvancedAnalytics

Data science @ RTL Nederland

Longhow Lam

@longhowlam

1st Master Search Advanced Analytics meetup

5-juli-2017

Page 2: MasterSearch_Meetup_AdvancedAnalytics

Agenda

RTL Data science set up

Some data science topics @RTL

Text mining

Computer vision

Association rules

Power BI

Page 3: MasterSearch_Meetup_AdvancedAnalytics

RTL Data science set up

Source Data Click data

Heartbeat data

Account data

Location

Metadata

Campaign data

Etc..

Data science team

4 data engineers

4 data scientists

The main businesses at RTL we work for

ETL processes

Find out for your self:https://www.rtl.nl/werkenbij/

Use cases

Churn modeling

Response modeling

Customer segmentation

Look-alikes-for Advertisers

Recommendation engines

Page 4: MasterSearch_Meetup_AdvancedAnalytics

Content similarity

Which movies on Videoland are close to each other?

Which news articles on RTL Nieuws are close to each other?

movies, we can look at the movie summaries or Video captures

news articles, we can look the text of the articles or corresponding news image

Hence text mining and computer vision

Page 5: MasterSearch_Meetup_AdvancedAnalytics

Text mining

Page 6: MasterSearch_Meetup_AdvancedAnalytics

Text mining

2000 movie plots / summaries on VideoLand

For each movie plot: count the words / terms

Put the counts in a so-called term document matrix

There are around 50.000 terms in the 2000 movie plots

Usually this matrix is very sparse

Aap Film Auto …. Leven …. …. Zwaar

Film1 1 4 8

Film2 10

Film3 5

… 1

… 6 6

Film2000 1 8Term document matrix

Page 7: MasterSearch_Meetup_AdvancedAnalytics

Similarity: cosine similarity

Between movies (so between rows of the matrix) we can now calculate similarities

A distance that is often used is cosine similarity

Visually we can see this distance in the following figure:

Suppose we only have two terms:

1. Leven

2. Spannend

# leven

# SpannendFilm 1

Film 2

cosine similarity

Page 8: MasterSearch_Meetup_AdvancedAnalytics

VideoLand To get a feeling for Movie similarities we created a small shiny app

Page 9: MasterSearch_Meetup_AdvancedAnalytics

RTL Nieuws power-bi dashboard to see article similarties

Page 10: MasterSearch_Meetup_AdvancedAnalytics

Computer vision

Page 11: MasterSearch_Meetup_AdvancedAnalytics

Two approaches used @RTL

Computer Vision API from Vendors (Microsoft, Clarifai, Google,…)

Tweak things ourselves with Keras/Tensorflow

+ Ready and Easy to use (Just send your image to them)

+ Not too expensive ($0.84 / 1000 images)

- No control on what is returned

- Takes more effort to set it up

- Needs more knowledge

+ More control on what you are doing

Page 12: MasterSearch_Meetup_AdvancedAnalytics

RTL Nieuws image: API examples

Feature Name Value

Description { "type": 0, "captions": [ { "text": "a group of people sitting on a table", "confidence": 0.4894670976127814 } ] }

Tags [ { "name": "person", "confidence": 0.996391236782074 }, { "name": "indoor", "confidence": 0.9104063510894775 }, { "name": "people", "confidence": 0.7057779431343079 } ]

Image Format Jpeg

Image Dimensions 4096 x 3078

Clip Art Type 0 Non-clipart

Line Drawing Type 0 Non-LineDrawing

Black & White Image False

Is Adult Content False

Adult Score 0.042066238820552826

Is Racy Content False

Racy Score 0.061784882098436356

Categories [ { "name": "people_many", "score": 0.9296875 } ]

Faces [ { "age": 52, "gender": "Male", "faceRectangle": { "width": 298, "height": 298, "left": 433, "top": 1370 } }, { "age": 78, "gender": "Male", "faceRectangle": { "width": 269, "height": 269, "left": 3212, "top": 1410 } }, { "age": 64, "gender": "Male", "faceRectangle": { "width": 241, "height": 241, "left": 2108, "top": 1534 } } ]

Page 13: MasterSearch_Meetup_AdvancedAnalytics

Feature Name Value

Description { "type": 0, "captions": [ { "text": "Linda de Mol talking on a cell phone", "confidence": 0.46178352459016536 } ] }

Tags [ { "name": "person", "confidence": 0.9999904632568359 }, { "name": "outdoor", "confidence": 0.9974232912063599 }, { "name": "woman", "confidence": 0.9967917799949646 }, { "name": "lady", "confidence": 0.7658315896987915 } ]

Image Format Jpeg

Image Dimensions 1024 x 421

Clip Art Type 0 Non-clipart

Line Drawing Type 0 Non-LineDrawing

Black & White Image

False

Is Adult Content False

Adult Score 0.009753250516951084

Is Racy Content False

Racy Score 0.014254707843065262

Categories [ { "name": "people_portrait", "score": 0.96875 } ]

Faces [ { "age": 28, "gender": "Female", "faceRectangle": { "width": 282, "height": 282, "left": 286, "top": 35 } } ]

RTL Nieuws image: API examples

Page 15: MasterSearch_Meetup_AdvancedAnalytics

Tweak things ourselves with Keras

Keras is a high-level neural networks API running on top of

either TensorFlow

or Theano.

and now also CNTK

Developed for fast experimentation.

Easier to use than tensorflow, but you still have lot’s of options

There is now also an R interface (of course created by Rstudio… )

Page 16: MasterSearch_Meetup_AdvancedAnalytics

Keras: Simpel set-up “Architecture”

Tensorflow installed on a (linux) machineIdeally with lots of GPU’s

pip install keras

You’re good to go in Python (Jupyter notebooks)

install_github("rstudio/keras")

You’re good to go inR / RStudio

Page 17: MasterSearch_Meetup_AdvancedAnalytics

Example in R: Neural network with two hidden layers

Pixel 3

Pixel 2

Pixel 1

Pixel 783

Pixel 784

Label 0

Label 9

Page 18: MasterSearch_Meetup_AdvancedAnalytics

Using pre-trained models

Image classifiers have been trained on big GPU machines for weeks with millions of pictures on very large networks

Not many people do that from scratch. Instead, one can use pre-trained networks and start from there.

VGG19 deep learning model143 million weights!!!

Page 19: MasterSearch_Meetup_AdvancedAnalytics

predict image class using pretrained models

Page 20: MasterSearch_Meetup_AdvancedAnalytics

RTL NIEUWS Images labeled with resnet and vgg16

Link to trellisJS app

Page 21: MasterSearch_Meetup_AdvancedAnalytics

Extract features using pre-trained models

Remove top layers for feature extraction

We have a 7*7*512 ‘feature’ tensor = 25.088 values

Page 22: MasterSearch_Meetup_AdvancedAnalytics

Only a few lines of R code

Page 23: MasterSearch_Meetup_AdvancedAnalytics

RTL NIEUWS Image similarity

1024 RTL Nieuws Sample pictures. Compute for each image the 25.088 feature values.

Calculate for each image the top 10 closest images, based on cosine similarity.

Little Shiny APP

Page 24: MasterSearch_Meetup_AdvancedAnalytics

Examples RTL Nieuws image similarities

Page 25: MasterSearch_Meetup_AdvancedAnalytics

Examples RTL Nieuws image similarities

Page 26: MasterSearch_Meetup_AdvancedAnalytics

The Brad Pitt Similarity index

Page 27: MasterSearch_Meetup_AdvancedAnalytics

Take five Brad Pitt pictures

Run them trough the pre-trained vgg16 and extract feature vectors. This is a 5 by 25088 matrix

The brad Pit IndexTake other images, run them through the VGG16Calculate the distances with the five Brad Pitt pictures and average:

0.771195 0.802654 0.714752 0.792587 0.8291976 0.80969440.665990 0.9737212

Page 28: MasterSearch_Meetup_AdvancedAnalytics

0.6273 0.5908 0.8231 0.7711 0.8839 0.8975 0.6934 0.9659

Focusing on only the face!!

Page 29: MasterSearch_Meetup_AdvancedAnalytics

Can you shake hands with your neighbor?

A little Statistical Experiment

Page 30: MasterSearch_Meetup_AdvancedAnalytics

Can you shake hands with your neighbor?

A little Statistical Experiment

50.1% of people don’t wash their hands after visiting the toilet

Page 31: MasterSearch_Meetup_AdvancedAnalytics

A little Statistical Experiment

Page 32: MasterSearch_Meetup_AdvancedAnalytics

Can you shake hands with your neighbor?

A little Statistical Experiment

84.6% of all statistics are justmade up on the spot !!

Page 33: MasterSearch_Meetup_AdvancedAnalytics

Association Rules

Mining

Page 34: MasterSearch_Meetup_AdvancedAnalytics

Association Rules Mining

Market basket analysis

Association rules mining (arm)

Mixture of different methods

Ensemble

ARM is one of several so called collaborative filter algorithms

Collaborative filtering is a method of making recommendations about the interests of one user (filter) by collecting preferences or behavior from many users (collaborating).

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization methods

Page 35: MasterSearch_Meetup_AdvancedAnalytics

Association rule mining

The basics

Identify frequent item sets (or rules) in the customer transaction data:

IF item X THEN item Y

IF item A and B THEN item very likely item C

Not all rules are interesting, use ‘support’ and ‘lift’ to judge importance of a rule

# trxs. {X} {Y}

Total # trxs. Support (X,Y) =

Lift (X,Y) = Support (X,Y)

Support (X) * Support(Y)

Support & LiftGTST Nieuwe Tijden 10.8%

Star trek GTST 0.018%

For example a lift of 2.5 means: If people have watched movie X then they are 2.5 more likely to watch movie Y than if they didn’t watch movie X

Page 36: MasterSearch_Meetup_AdvancedAnalytics

Association rules virtual items

User Movie

1 Blacklist

1 Startrek

1 James bond

2 Kill Bill

2 Pulp fiction

3 Stargate

3 Men in Black

An old trick with association rules mining is to add ‘virtual’ items

User Virtual item

1 Blacklist

1 Startrek

1 James bond

1 Male

1 [25-30) Y

2 Kill Bill

2 Pulp fiction

2 Female

2 [40-45) Y

3 Stargate

3 Men in Black

2 Male

2 [50-55) Y

Rules that now might appear are for example:

Male, [40-45), Startrek James Bond

Female, [20-25), Kill Bill Pulp Fiction

Page 37: MasterSearch_Meetup_AdvancedAnalytics

Association rules with R and Gephi

Page 38: MasterSearch_Meetup_AdvancedAnalytics

Power BI

Survival curves

Page 39: MasterSearch_Meetup_AdvancedAnalytics

Survival curve

At which moment in an episode do people stop watching?

Can we compare different episodes and series?

Survival Curves!!

For a specific Episode from a specific Serie:

Take all Videoland streams: Starts / Stops from

Determine completion rate, and rank all streams on completion rate

Calculate empirical distribution F

Survival: S =1 – F

Do this for all episodes and series

Page 40: MasterSearch_Meetup_AdvancedAnalytics
Page 41: MasterSearch_Meetup_AdvancedAnalytics