Date post: | 22-Jan-2018 |
Category: |
Data & Analytics |
Upload: | abhishek-m-shivalingaiah |
View: | 69 times |
Download: | 2 times |
Cloudera Movies Data Science Project On Big Data
-Abhishek M Shivalingaiah
ContentsProject IntroductionData ExplorationData CleaningData ClassificationData ClusteringPredicting User Ratings
Project Introduction
Cloudera Movies is an internet on-demand streaming video service. The viewership is
rising currently. To cater the rising demand of the content they provide, they plan to
expand the hardware infrastructure and improve their software stack. They want to
identify its consumer base and build a recommendation system.
Background
As Data Scientist Team, we are tasked to:
• understand which user accounts are used most often by younger viewers
• segment sessions based on the customer behaviour to improve the site’s usability
• build recommendation engine for Cloudera users to increase time on site and
reduce their churn
Objective
ContentsProject IntroductionData ExplorationData CleaningData ClassificationData ClusteringPredicting User Ratings
Data Exploration
Cloudera Node
17GB JSON Log files (#68)
Heckle
103 MB JSON log Files (#10)
Jeckle
99 MB JSON log Files (#10)
~600K lines of log
HDFS
Local Disk File Storage System
Data Lake for streaming log data
Map – Reduce Jobs
Data Exploration contd..
Auth createdAt payload refId sessionId type user userAgent
… 19 event
type values
91.14%
3.20%
N= 612,873
Majority of events are content playback events
• 1000000 < USER ID < 100000000
• Presence of Non – Numeric ITEM IDs
Avg. User Rating Avg. Write Review
Different Time zones
Timestamps (UTC,
UTC-8 and others)
Account Event
Update
Password
Update
Payment
Info
Parental
Controls
Data Exploration contd..
Exploration
and Reviewing
Aggregation,
Summarization
and Debugging
Unix Command
in Hadoop ShellMap-Reduce
Architecture
Regular Expression
Grep –Eo Regexp
Reading and
Extracting unique
- cat
- awk
Python Libraries
- Json
- Sys
- Uniq –c
- User Defined
Reducer
- Bash -c
- Job Detail Page (Pass/
Killed in logs col.)
- Job Failure Page
(identifying the problem in
the stack trace of the error
column)
- Task Logs (To spot the error
in the log)
Broken
Pipe Error
ContentsProject IntroductionData ExplorationData CleaningData ClassificationData ClusteringPredicting User Ratings
Data Cleaning
• Summarize data into less verbose format, so that the volume of the data is
reduced for further analysis
• Fix the issues identified in the exploration phase
Objective
Camel Type, Snake Type and Spelling Variances
- Broken Pipe Error
- Key Error
Lot of non meaningful variables having less variance
Handling Missing Variables
Same account different behavior
Data Cleaning contd..KEY : VALUE
UserID Timestamp SessionID Flags Parameters, , ,[TAB]
- dateutil.parser
- Json
- Sys
Tzinfo
Timedelta
datetime
- C -> recommendation- L -> Login- l -> Logout- P -> Popular- R -> Recommended- S -> Searched
- a -> Queued- c -> subactions of an account- h -> Hover- i -> Browsed- p -> Marker- q -> ReviewedQueue- r -> Recent- t -> Rating- v -> Verified Password- w -> Review- x -> Parental Control
4MB
ContentsProject IntroductionData ExplorationData CleaningData ClassificationData ClusteringPredicting User Ratings
Classifying users into Child and Adults
The purpose of this section is to help the Cloudera Movies legal team to
understand which user accounts are used most often by younger viewers.
The information that we used in this classification problem
Parental control events labeled the account as a child account.
Only adult accounts are able to perform account control operations like
changing the password
We are using that information to label some of the accounts and then use
those known labels to discover the unknown label
Approach to solve the classification
problem
We can use logit or a probit to classify the users as a child or adult
Challenges in this approach:
Learn the labels based on behavior profile, hard to tease out the signal because having parental controls enabled does not guarantee anything about the user’s behavior.
create features from the content viewed, we can count each item of content as a
boolean feature, which fits well with logistic regression.
2000 users, each of which viewed zero or more content items, and over 8500 content items, each of which was viewed by zero or more users.
there’s a strict separation between the items viewed by accounts with parental controls enabled and those viewed by accounts without parental controls enabled.
Propagate labels of known labeled users to unknown users based on content viewed.
Approach to solve the classification
problem
SimRank Approach.
Propagate influence based on distances and similarities using a SimRank
approach.
The process
Extracting the content items played
Write a mapper program to generate a compound key (userid, start and end)
The compound value (‘kid’, ‘item string’)
Write a reduce program to find the adult, kids and the content viewed
Prepare the SimRank algorithm
The teleport sets for adults and kids
Creating test and training for both sets (80/20)
The process
Adjacency Matrix
Write a mapper program to output the content item and every user who viewed
it and then reduce it to aggregate all the users who viewed that item. (item1,
(user1,user2,user3))
Implementing SimRank
mapper needs to do is read in the current SimRank vector and compute the
matrix product with the adjacency matrix
For each non-zero entry in a column, multiply by the corresponding entry in the
SimRank vector and emit the result with the row label as the key
reducer, sum up the intermediate values for each row, add in the teleport
contribution, and emit the final sum as the value for that row in the new SimRank
vector
The Process
Interpreting the comparing the SimRank Vectors
Normalize the vectors and assign them a sign
The label is assigned based on the larger absolute value
Classify the observation as adult or child
Testing on validation set we obtain a 99.64% accuracy with 9 mislabeled
records
ContentsProject IntroductionData ExplorationData CleaningData ClassificationData ClusteringPredicting User Ratings
We look to profile consumer behavior
Clustering user sessions reveals
No. of natural groupings
Which behavioral group a session belongs to
End goal of clustering is to
look for notable behavior groupings, such as a large group of sessions where the
user searches unsuccessfully several times and then watches a video from the
home page.
flag sessions that are outliers in the grouping, as these sessions may represent
anomalies of interest, such as bots, fraud, system errors, etc.
Identifying patterns reflective of the groups (system optimization)
Steps involved in clustering
Create a list of features and try them out, using
the statistics from Cloudera ML to evaluate the
quality of the features
Arrive at set of features that give optimal
statistics
Cloudera ML also gives the optimal number of
clusters
Step 1: Determining features for testing
Directly pull out some features from the cleaned data
(Actions, Number of items hovered over, Session
duration, Number of items played, Number of items
browsed, Number of items reviewed, Number of items
rated, Number of items searched etc.)
Ignore features that are unlikely to relate to the user’s
behavior during the session (eg. whether the user
logged in and out)
Some less direct features can be extracted too (Mean
play time, shortest play time, total play time etc.)
Step 2 & 3: Merging the data and
generating feature vectors
Merge sessions with parental controls which were
previously split while cleaning
Aggregate the records by session ID and then merge
records for the same ID
Use merged data to generate the features
Step 4 : Normalize numeric values using
z-score
Dimensions like Total play time measured in large scales would have a
much larger effect than others (Eg. Number of recommended items
played)
Z-score for standardization
Subtract the mean from each observation and divide by the sample standard
deviation
Resulting data will have a mean of zero and a standard deviation of one (if you
divide by s)
𝑧 =𝑥 − ҧ𝑥
𝑠𝑧 =
𝑥 − ҧ𝑥
𝑀𝐴𝐷OR
Step 5 : Clustering with k-means
Partitioning goal: Partition the log files into k clusters by session ID
Given a k, find a partition of k clusters that optimize the chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods:
k-means (MacQueen 1967) – Each cluster is represented by the center of the cluster
k-medoids (Kaufman and Rousseeuw 1987) – Each cluster is represented by one of the objects in the cluster. Also known as Partition Around Medoids (PAM)
Since we’d like to be able to compare results we specify the seed
We use k-means++ sketch for choosing the initial values (or "seeds") for the k-means clustering algorithm
Feature identification :
Adding them all in chunk
Back them out incrementally when the clusters start to become less distinct
Code
$ ml kmeans --input-file part2sketch.avro --centers-file part2centers.avro --clusters 40,60,80,100,120,140,160,180,200 --best-of 3 --seed 1729 --num-threads 1 --eval-details-file part2evaldetails.csv --eval-stats-file part2evalstats.csv
Step 5 : Clustering with k-means
With maximum number of
features yielding cluster
statistics - 23 Cluster selection
Threshold
Predictive strength - 0.8
Stable Clusters - 0.8
With the final results cluster 22
seems to be the best choice for
clustering
Number of clusters - 300
Predictive strength - 1.0
Stable Clusters - 0.92
ContentsProject IntroductionData ExplorationData CleaningData ClassificationData ClusteringPredicting User Ratings
User – User Similarity Item - Item Similarity
Predicting User Ratings(Recommendation Engine)
Predicting User Ratings
Preparation of data
Mahout requires that all IDs, user and item, be numeric.
Implicit data (user, item, rating )and Explicit data (user, played, reviewed)
Training(May 5th – May 10th) and Testing Data (May 10th -- May 12th )
Similarity
TanimotoCoefficientSimilarity
LogLikelihoodSimilarity
Evaluation
RMSE(Root Mean Square Error)