MasterSearch_Meetup_AdvancedAnalytics

Data science @ RTL Nederland

Longhow Lam

@longhowlam

1st Master Search Advanced Analytics meetup

5-juli-2017

https://www.linkedin.com/today/author/longhowlam

https://www.linkedin.com/today/author/longhowlam

Agenda

RTL Data science set up

Some data science topics @RTL

Text mining

Computer vision

Association rules

Power BI

RTL Data science set up

Source Data Click data

Heartbeat data

Account data

Location

Metadata

Campaign data

Etc..

Data science team

4 data engineers

4 data scientists

The main businesses at RTL we work for

ETL processes

Find out for your self:https://www.rtl.nl/werkenbij/

Use cases

Churn modeling

Response modeling

Customer segmentation

Look-alikes-for Advertisers

Recommendation engines

Content similarity

Which movies on Videoland are close to each other?

Which news articles on RTL Nieuws are close to each other?

movies, we can look at the movie summaries or Video captures

news articles, we can look the text of the articles or corresponding news image

Hence text mining and computer vision

Text mining

Text mining

2000 movie plots / summaries on VideoLand

For each movie plot: count the words / terms

Put the counts in a so-called term document matrix

There are around 50.000 terms in the 2000 movie plots

Usually this matrix is very sparse

Aap Film Auto …. Leven …. …. Zwaar

Film1 1 4 8

Film2 10

Film3 5

… 1

…

… 6 6

Film2000 1 8Term document matrix

Similarity: cosine similarity

Between movies (so between rows of the matrix) we can now calculate similarities

A distance that is often used is cosine similarity

Visually we can see this distance in the following figure:

Suppose we only have two terms:

1. Leven

2. Spannend

# leven

# SpannendFilm 1

Film 2

cosine similarity

VideoLand To get a feeling for Movie similarities we created a small shiny app

http://145.131.21.163:3838/sample-apps/ContentBasedSimilarityApp/

RTL Nieuws power-bi dashboard to see article similarties

Computer vision

Two approaches used @RTL

Computer Vision API from Vendors (Microsoft, Clarifai, Google,…)

Tweak things ourselves with Keras/Tensorflow

+ Ready and Easy to use (Just send your image to them)

+ Not too expensive ($0.84 / 1000 images)

- No control on what is returned

- Takes more effort to set it up

- Needs more knowledge

+ More control on what you are doing

RTL Nieuws image: API examples

Feature Name Value

Description { "type": 0, "captions": [ { "text": "a group of people sitting on a table", "confidence": 0.4894670976127814 } ] }

Tags [ { "name": "person", "confidence": 0.996391236782074 }, { "name": "indoor", "confidence": 0.9104063510894775 }, { "name": "people", "confidence": 0.7057779431343079 } ]

Image Format Jpeg

Image Dimensions 4096 x 3078

Clip Art Type 0 Non-clipart

Line Drawing Type 0 Non-LineDrawing

Black & White Image False

Is Adult Content False

Adult Score 0.042066238820552826

Is Racy Content False

Racy Score 0.061784882098436356

Categories [ { "name": "people_many", "score": 0.9296875 } ]

Faces [ { "age": 52, "gender": "Male", "faceRectangle": { "width": 298, "height": 298, "left": 433, "top": 1370 } }, { "age": 78, "gender": "Male", "faceRectangle": { "width": 269, "height": 269, "left": 3212, "top": 1410 } }, { "age": 64, "gender": "Male", "faceRectangle": { "width": 241, "height": 241, "left": 2108, "top": 1534 } } ]

Feature Name Value

Description { "type": 0, "captions": [ { "text": "Linda de Mol talking on a cell phone", "confidence": 0.46178352459016536 } ] }

Tags [ { "name": "person", "confidence": 0.9999904632568359 }, { "name": "outdoor", "confidence": 0.9974232912063599 }, { "name": "woman", "confidence": 0.9967917799949646 }, { "name": "lady", "confidence": 0.7658315896987915 } ]

Image Format Jpeg

Image Dimensions 1024 x 421

Clip Art Type 0 Non-clipart

Line Drawing Type 0 Non-LineDrawing

Black & White Image

False

Is Adult Content False

Adult Score 0.009753250516951084

Is Racy Content False

Racy Score 0.014254707843065262

Categories [ { "name": "people_portrait", "score": 0.96875 } ]

Faces [ { "age": 28, "gender": "Female", "faceRectangle": { "width": 282, "height": 282, "left": 286, "top": 35 } } ]

RTL Nieuws image: API examples

See TrelliscopJS app

https://longhowlam.github.io/computervision/

Tweak things ourselves with Keras

Keras is a high-level neural networks API running on top of

either TensorFlow

or Theano.

and now also CNTK

Developed for fast experimentation.

Easier to use than tensorflow, but you still have lot’s of options

There is now also an R interface (of course created by Rstudio… )

https://github.com/tensorflow/tensorflow

https://github.com/Theano/Theano

https://github.com/Microsoft/CNTK

Keras: Simpel set-up “Architecture”

Tensorflow installed on a (linux) machineIdeally with lots of GPU’s

pip install keras

You’re good to go in Python (Jupyter notebooks)

install_github("rstudio/keras")

You’re good to go inR / RStudio

Example in R: Neural network with two hidden layers

Pixel 3

Pixel 2

Pixel 1

Pixel 783

Pixel 784

Label 0

Label 9

Using pre-trained models

Image classifiers have been trained on big GPU machines for weeks with millions of pictures on very large networks

Not many people do that from scratch. Instead, one can use pre-trained networks and start from there.

VGG19 deep learning model143 million weights!!!

predict image class using pretrained models

RTL NIEUWS Images labeled with resnet and vgg16

Link to trellisJS app

https://longhowlam.github.io/RTLNieuws_pretrainedVGG16/

Extract features using pre-trained models

Remove top layers for feature extraction

We have a 7*7*512 ‘feature’ tensor = 25.088 values

Only a few lines of R code

RTL NIEUWS Image similarity

1024 RTL Nieuws Sample pictures. Compute for each image the 25.088 feature values.

Calculate for each image the top 10 closest images, based on cosine similarity.

Little Shiny APP

http://5.100.228.219:3838/sample-apps/rtlnieuws_image_sim

Examples RTL Nieuws image similarities

Examples RTL Nieuws image similarities

The Brad Pitt Similarity index

Take five Brad Pitt pictures

Run them trough the pre-trained vgg16 and extract feature vectors. This is a 5 by 25088 matrix

The brad Pit IndexTake other images, run them through the VGG16Calculate the distances with the five Brad Pitt pictures and average:

0.771195 0.802654 0.714752 0.792587 0.8291976 0.80969440.665990 0.9737212

0.6273 0.5908 0.8231 0.7711 0.8839 0.8975 0.6934 0.9659

Focusing on only the face!!

Can you shake hands with your neighbor?

A little Statistical Experiment



50.1% of people don’t wash their hands after visiting the toilet




84.6% of all statistics are justmade up on the spot !!

Association Rules

Mining

Association Rules Mining

Market basket analysis

Association rules mining (arm)

Mixture of different methods

Ensemble

ARM is one of several so called collaborative filter algorithms

Collaborative filtering is a method of making recommendations about the interests of one user (filter) by collecting preferences or behavior from many users (collaborating).

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization methods

Association rule mining

The basics

Identify frequent item sets (or rules) in the customer transaction data:

IF item X THEN item Y

IF item A and B THEN item very likely item C

Not all rules are interesting, use ‘support’ and ‘lift’ to judge importance of a rule

# trxs. {X} {Y}

Total # trxs. Support (X,Y) =

Lift (X,Y) = Support (X,Y)

Support (X) * Support(Y)

Support & LiftGTST Nieuwe Tijden 10.8%

Star trek GTST 0.018%

For example a lift of 2.5 means: If people have watched movie X then they are 2.5 more likely to watch movie Y than if they didn’t watch movie X

Association rules virtual items

User Movie

1 Blacklist

1 Startrek

1 James bond

2 Kill Bill

2 Pulp fiction

3 Stargate

3 Men in Black

An old trick with association rules mining is to add ‘virtual’ items

User Virtual item

1 Blacklist

1 Startrek

1 James bond

1 Male

1 [25-30) Y

2 Kill Bill

2 Pulp fiction

2 Female

2 [40-45) Y

3 Stargate

3 Men in Black

2 Male

2 [50-55) Y

Rules that now might appear are for example:

Male, [40-45), Startrek James Bond

Female, [20-25), Kill Bill Pulp Fiction

Association rules with R and Gephi

Power BI

Survival curves

Survival curve

At which moment in an episode do people stop watching?

Can we compare different episodes and series?

Survival Curves!!

For a specific Episode from a specific Serie:

Take all Videoland streams: Starts / Stops from

Determine completion rate, and rank all streams on completion rate

Calculate empirical distribution F

Survival: S =1 – F

Do this for all episodes and series

Date post:	24-Jan-2018
Category:	Data & Analytics
Upload:	longhow-lam
View:	166 times
Download:	0 times

MasterSearch_Meetup_AdvancedAnalytics

Data & Analytics