Date post: | 24-Jan-2018 |
Category: |
Data & Analytics |
Upload: | longhow-lam |
View: | 166 times |
Download: | 0 times |
Data science @ RTL Nederland
Longhow Lam
@longhowlam
1st Master Search Advanced Analytics meetup
5-juli-2017
Agenda
RTL Data science set up
Some data science topics @RTL
Text mining
Computer vision
Association rules
Power BI
RTL Data science set up
Source Data Click data
Heartbeat data
Account data
Location
Metadata
Campaign data
Etc..
Data science team
4 data engineers
4 data scientists
The main businesses at RTL we work for
ETL processes
Find out for your self:https://www.rtl.nl/werkenbij/
Use cases
Churn modeling
Response modeling
Customer segmentation
Look-alikes-for Advertisers
Recommendation engines
Content similarity
Which movies on Videoland are close to each other?
Which news articles on RTL Nieuws are close to each other?
movies, we can look at the movie summaries or Video captures
news articles, we can look the text of the articles or corresponding news image
Hence text mining and computer vision
Text mining
Text mining
2000 movie plots / summaries on VideoLand
For each movie plot: count the words / terms
Put the counts in a so-called term document matrix
There are around 50.000 terms in the 2000 movie plots
Usually this matrix is very sparse
Aap Film Auto …. Leven …. …. Zwaar
Film1 1 4 8
Film2 10
Film3 5
… 1
…
… 6 6
Film2000 1 8Term document matrix
Similarity: cosine similarity
Between movies (so between rows of the matrix) we can now calculate similarities
A distance that is often used is cosine similarity
Visually we can see this distance in the following figure:
Suppose we only have two terms:
1. Leven
2. Spannend
# leven
# SpannendFilm 1
Film 2
cosine similarity
VideoLand To get a feeling for Movie similarities we created a small shiny app
RTL Nieuws power-bi dashboard to see article similarties
Computer vision
Two approaches used @RTL
Computer Vision API from Vendors (Microsoft, Clarifai, Google,…)
Tweak things ourselves with Keras/Tensorflow
+ Ready and Easy to use (Just send your image to them)
+ Not too expensive ($0.84 / 1000 images)
- No control on what is returned
- Takes more effort to set it up
- Needs more knowledge
+ More control on what you are doing
RTL Nieuws image: API examples
Feature Name Value
Description { "type": 0, "captions": [ { "text": "a group of people sitting on a table", "confidence": 0.4894670976127814 } ] }
Tags [ { "name": "person", "confidence": 0.996391236782074 }, { "name": "indoor", "confidence": 0.9104063510894775 }, { "name": "people", "confidence": 0.7057779431343079 } ]
Image Format Jpeg
Image Dimensions 4096 x 3078
Clip Art Type 0 Non-clipart
Line Drawing Type 0 Non-LineDrawing
Black & White Image False
Is Adult Content False
Adult Score 0.042066238820552826
Is Racy Content False
Racy Score 0.061784882098436356
Categories [ { "name": "people_many", "score": 0.9296875 } ]
Faces [ { "age": 52, "gender": "Male", "faceRectangle": { "width": 298, "height": 298, "left": 433, "top": 1370 } }, { "age": 78, "gender": "Male", "faceRectangle": { "width": 269, "height": 269, "left": 3212, "top": 1410 } }, { "age": 64, "gender": "Male", "faceRectangle": { "width": 241, "height": 241, "left": 2108, "top": 1534 } } ]
Feature Name Value
Description { "type": 0, "captions": [ { "text": "Linda de Mol talking on a cell phone", "confidence": 0.46178352459016536 } ] }
Tags [ { "name": "person", "confidence": 0.9999904632568359 }, { "name": "outdoor", "confidence": 0.9974232912063599 }, { "name": "woman", "confidence": 0.9967917799949646 }, { "name": "lady", "confidence": 0.7658315896987915 } ]
Image Format Jpeg
Image Dimensions 1024 x 421
Clip Art Type 0 Non-clipart
Line Drawing Type 0 Non-LineDrawing
Black & White Image
False
Is Adult Content False
Adult Score 0.009753250516951084
Is Racy Content False
Racy Score 0.014254707843065262
Categories [ { "name": "people_portrait", "score": 0.96875 } ]
Faces [ { "age": 28, "gender": "Female", "faceRectangle": { "width": 282, "height": 282, "left": 286, "top": 35 } } ]
RTL Nieuws image: API examples
Tweak things ourselves with Keras
Keras is a high-level neural networks API running on top of
either TensorFlow
or Theano.
and now also CNTK
Developed for fast experimentation.
Easier to use than tensorflow, but you still have lot’s of options
There is now also an R interface (of course created by Rstudio… )
Keras: Simpel set-up “Architecture”
Tensorflow installed on a (linux) machineIdeally with lots of GPU’s
pip install keras
You’re good to go in Python (Jupyter notebooks)
install_github("rstudio/keras")
You’re good to go inR / RStudio
Example in R: Neural network with two hidden layers
Pixel 3
Pixel 2
Pixel 1
Pixel 783
Pixel 784
Label 0
Label 9
Using pre-trained models
Image classifiers have been trained on big GPU machines for weeks with millions of pictures on very large networks
Not many people do that from scratch. Instead, one can use pre-trained networks and start from there.
VGG19 deep learning model143 million weights!!!
predict image class using pretrained models
RTL NIEUWS Images labeled with resnet and vgg16
Link to trellisJS app
Extract features using pre-trained models
Remove top layers for feature extraction
We have a 7*7*512 ‘feature’ tensor = 25.088 values
Only a few lines of R code
RTL NIEUWS Image similarity
1024 RTL Nieuws Sample pictures. Compute for each image the 25.088 feature values.
Calculate for each image the top 10 closest images, based on cosine similarity.
Little Shiny APP
Examples RTL Nieuws image similarities
Examples RTL Nieuws image similarities
The Brad Pitt Similarity index
Take five Brad Pitt pictures
Run them trough the pre-trained vgg16 and extract feature vectors. This is a 5 by 25088 matrix
The brad Pit IndexTake other images, run them through the VGG16Calculate the distances with the five Brad Pitt pictures and average:
0.771195 0.802654 0.714752 0.792587 0.8291976 0.80969440.665990 0.9737212
0.6273 0.5908 0.8231 0.7711 0.8839 0.8975 0.6934 0.9659
Focusing on only the face!!
Can you shake hands with your neighbor?
A little Statistical Experiment
Can you shake hands with your neighbor?
A little Statistical Experiment
50.1% of people don’t wash their hands after visiting the toilet
A little Statistical Experiment
Can you shake hands with your neighbor?
A little Statistical Experiment
84.6% of all statistics are justmade up on the spot !!
Association Rules
Mining
Association Rules Mining
Market basket analysis
Association rules mining (arm)
Mixture of different methods
Ensemble
ARM is one of several so called collaborative filter algorithms
Collaborative filtering is a method of making recommendations about the interests of one user (filter) by collecting preferences or behavior from many users (collaborating).
Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)
Model-based algorithms Matrix factorization methods
Association rule mining
The basics
Identify frequent item sets (or rules) in the customer transaction data:
IF item X THEN item Y
IF item A and B THEN item very likely item C
Not all rules are interesting, use ‘support’ and ‘lift’ to judge importance of a rule
# trxs. {X} {Y}
Total # trxs. Support (X,Y) =
Lift (X,Y) = Support (X,Y)
Support (X) * Support(Y)
Support & LiftGTST Nieuwe Tijden 10.8%
Star trek GTST 0.018%
For example a lift of 2.5 means: If people have watched movie X then they are 2.5 more likely to watch movie Y than if they didn’t watch movie X
Association rules virtual items
User Movie
1 Blacklist
1 Startrek
1 James bond
2 Kill Bill
2 Pulp fiction
3 Stargate
3 Men in Black
An old trick with association rules mining is to add ‘virtual’ items
User Virtual item
1 Blacklist
1 Startrek
1 James bond
1 Male
1 [25-30) Y
2 Kill Bill
2 Pulp fiction
2 Female
2 [40-45) Y
3 Stargate
3 Men in Black
2 Male
2 [50-55) Y
Rules that now might appear are for example:
Male, [40-45), Startrek James Bond
Female, [20-25), Kill Bill Pulp Fiction
Association rules with R and Gephi
Power BI
Survival curves
Survival curve
At which moment in an episode do people stop watching?
Can we compare different episodes and series?
Survival Curves!!
For a specific Episode from a specific Serie:
Take all Videoland streams: Starts / Stops from
Determine completion rate, and rank all streams on completion rate
Calculate empirical distribution F
Survival: S =1 – F
Do this for all episodes and series