+ All Categories
Home > Data & Analytics > Cloudera Data Science Challenge 3 Solution by Doug Needham

Cloudera Data Science Challenge 3 Solution by Doug Needham

Date post: 18-Jul-2015
Category:
Upload: doug-needham
View: 221 times
Download: 1 times
Share this document with a friend
Popular Tags:
27
Cloudera Data Science Challenge Presentation by @dougneedham
Transcript

Cloudera

Data Science

Challenge Presentation by @dougneedham

Cloudera: Certified Data Scientist

This is the goal. What are the requirements?

Requirement 1. DS-200 Test.

Requirement 2. Data Science Challenge.

http://www.cloudera.com/content/cloudera/en/training/certification/

ccp-ds/challenge/challenge3.html

DS-200 Test

Here are the sections of the exam:

Data Acquisition

Data Evaluation

Data Transformation

Machine Learning Basics

Clustering

Classification

Collaborative Filtering

Model/Feature Selection

Probability

Visualization

Optimization

Data Science Challenge Itself

The resources needed for the challenge.

A Cluster:

7 nodes (1 Named Node, 6 Data Nodes.

1 Node Cloudera Manager

1 Node Cloudera Director

Cloudera Director requires a particular AMI for the East Coast AWS region(

ami-3218595b RHEL 6.4 X86_64 ). It took a bit of time to get this right. The

version I used is highly dependent on RHEL 6.4

Cluster Management

Cloudera Director made it easy to create the cluster once I used the

proper AMI. As noted previously it only really works well with RHEL 6.4

Cloudera Manager allowed for management and monitoring.

Demo here:

Restarting the cluster non-trivial.

Clusters are meant to stay up. A bit of verification work was needed

when the cluster was shut down over the holidays.

The problems

The challenge is made up of 3 problems. Each one could be a larger

effort in and of itself to solve.

The data had to be transformed in order to process it.

The sophisticated portion of the challenge was the actual processing of

the data, Machine Learning, Graph analysis, Statistical Confidence,

etc…

Then the output needed to be tweaked a bit in order to conform to the

deliverable specification.

Problem 1

Flight Delays.

SmartFly’s business is providing its customers with timely travel

information and notifications about flights, hotels, destination weather,

traffic getting to the airport, and anything else that can help make the

travel experience smoother. Their product team has come up with the

idea of using the flight data that they have been collecting to predict

whether customers’ flights will be delayed so that they can respond

proactively. They’ve now contacted you to help them test out the

viability of the idea

From a given set of historical flights, create an ordered list based on the

probability that a given set of future scheduled flights will be delayed or

not.

Problem 2

Web site log analytics.

Congratulations! You have just published your first book on data

science, advanced analytics, and predictive modeling. You’ve

decided to use your skills as a data scientist to create and optimize a

website promoting your book, and you have started several ad

campaigns on a popular search engine in order to drive traffic to your

site.

Provide statistics about a web-site where your new book is featured.

Problem 3

Who should Follow whom?

Winklr is a curiously popular social network for fans of the sitcom Happy

Days. Users can post photos, write messages, and most importantly,

follow each other’s posts and content. This helps users keep up with

new content from their favorite users on the site.

Rules

Individual Contributions Only

You must participate in this challenge only on an individual basis; teams

are not permitted.

Sharing

Any sharing of code or solutions or collaboration with another person or

entity is strictly forbidden.

ToolsYou may use any tools or software you desire to complete the

challenge.

Prerequisites

You must have successfully passed Data Science Essentials (DS–200)

Deliverables Problem1 – Ordered list of flights

Problem 2 – JSON file with a python populated Python Dictionary containing specific answers to questions.

Problem3 – Top 70,000 connections that should be recommended.

In addition to the deliverables stated above for each problem, you must provide a solution abstract and the complete set of source code used to solve the challenge problems, as described below.

Solution Abstract

The solution abstract should be a brief write-up in PDF format that addresses the following points:

For each part you needed to do this:

Explain your methodology including approach, assumptions, software and algorithms used, testing and validation techniques applied, model selection criteria, and total time spent.

Please include in your solution abstract any information that can be used to understand the logic behind your approach and all steps taken, including data preparation, modeling, validation, analysis, visualization, etc. The solution abstract should typically be 3 to 5 pages and no more than 6 pages.

Complete Source Code

Tarball or zip file of all source code used to complete the challenge, including programs, scripts, and other artifacts.

My github is linked at the end of this presentation

Scoring

Submission Scoring

Submissions will be scored as follows. Each problem part will be scored independently. The score for each part will be a composite of the percentage correct for all submitted solutions for that part and the score assigned to the corresponding section of the solution abstract. The scores for the three parts will be weighted and combined into a final composite score.

The percentage correct for each part will be scored against a golden master of known correct answers. Note that some questions may have more than one correct answer, and partial credit may be awarded.

The solution abstract will be scored according to objective criteria about your approach and general mastery of the tools and techniques. Writing quality and formatting will not contribute to the score, except in cases where the writing is so poor as to impact understanding.

Did anyone notice anything?

Each of the 3 problems highlight a very different aspect of Data

Science.

Machine Learning

Statistical Analysis

Graph Analysis

Each of these individually are areas that people specialize in. I for one,

intend to dive deep into Graph Analysis. It is quite interesting from what I

have seen so far.

The code has to be straightforward and while it is not clear they will do so for

this particular challenge, at least one of the prior challenges, the code is run

independently as part of the Cloudera grading process.

Bringing us to the question of What is

a Data Scientist ?

http://nirvacana.com/thoughts/becoming-a-data-scientist/

If you do the challenge are you a purple

squirrel?

Who is a Data Scientist ?

Who here has seen the Indiana Jones movies?

Marcus Brody and Henry “Indiana” Jones were both Archeologists.

Both Lectured, and taught Archeology

Both understood the tenets of Archeology

Both knew what finding an artifact means.

Both could speak intelligently about the significance of any finds associated with the search.

Only Indiana went on the “quest” – Why?

Those “intangible” skills of being able not only to survive, but to thrive in a chaotic environment played to “Indy’s” strengths.

https://www.youtube.com/watch?v=PgfpIV29Ccc

There are many types of data scientist: http://www.datasciencecentral.com/profiles/blogs/six-categories-of-data-scientists

I think the environment affects the success of a data scientist.

From Data Science Central

Opinion time

This challenge has reminded me of some of my first efforts building a data

warehouse, where I had to build the pipeline of data from our source

systems into our Operational Data Store, and out to our Star Schema for

daily reports.

It has many of the earmarks of a real-world project.

The biggest difference between this challenge and a “real world problem”,

is, there are known solutions. And someone knows those solutions.

In “the real world”, we have things like User acceptance testing and such.

This is a more opinionated way of judging success or failure, rather than the

objective measure of : Is the data in the answer set?

Doug’s Problem Solving approach

This is the approach I took, and may or may not be useful for others to apply.

Analysis. I started with some basic numbers, and just browsing through the data with the “Data Science at the Command line toolkit”. This is very handy for getting a feel for things.

Based on some general understanding this analysis provided, create a “pipeline”

Generally the data has to be transformed to a usable structure for the particular method of solving the problem.

Do some basics with the problem solving method, Stats, ML, Graph, etc…

Get some data back out of that tool, then format output to specification.

Iterate.

I did this for problem 1, moved to problem 2, then finally problem 3. Then went back to 1, back to 2, back to 3.

This method allowed me to give some “space” to myself, and actually look at the each problem with fresh eyes on more than one occasion.

Breaking the basics down of Input, Process, Output for each problem allowed me to have “working” code for each problem really quickly, then through tuning, analysis, research, and some time to think about the problem, I was able to come up with each unique solution.

It also allows me to refactor the code, having given each problem time to “rest”.

Very much like a painting, broad strokes first, details emerge as the painting progresses.

Another benefit is, if I am able to get the data all the way through the pipeline, it becomes obvious where the performance bottlenecks are for the pipeline.

This method does take a bit of time.

Solution 1

Type of problem: Machine Learning

Use Python to format the data.

Create an individual set of files based on point of origin (Departing airport)

Use these individual files to create a model using Spark MLLib per airport.

Run the scheduled flights through the model, then use the score of the model(Area under the ROC, denoting accuracy) multiplied by the output of the prediction (which is either 1 or 0).

This allows us to say with a higher degree of accuracy or not whether a flight will be delayed.

Code problem1.sh, and PredictFlights.scala

What the heck is a ROC?

This comes from: http://gim.unmc.edu/dxtests/roc3.htm

There are metrics using a predictive model

True Positive

True Negative

False Positive

False Negative.

The higher the area under the Yellow line, the better.

This is used for model validation to ensure that the model is making accurate predictions against known data.

http://en.wikipedia.org/wiki/Receiver_operating_characteristic

Solution 2

Type of problem: Statistical reporting

Python Streaming, so lots of Map-Reduce code. Could probably be

replaced with Spark, at the time, this path seemed to be the most

straight-forward.

A bit of analysis.

Collect some numbers. – Statistically Significant numbers, that is.

Format the data in JSON.

Code: Problem2.sh

What makes this data science?

Isn’t this the same thing as business analysis?

What makes this difference is the latter part of question 4 and 5.

Here they are:

Question 4: “How many full days of data, starting from the first day, are required to determine that the newsletter signup rate for experiment one is better than experiment two at the 99% confidence level?”

Question 5: “Using a z-test, determine how many full days of data, starting from the first full day, are needed to confirm that experiment four earns more revenue than experiment three at the 99% confidence level.”

The accurate measurement of confidence is what makes this different. I have built a number of Data warehouse environments from Retail, Finance, and Health Care. Even with the Chain of Custody information built in to provide for data traceability, I have seen very few decisions based on the output of the data warehouse. Why is this?

No one has discussed confidence levels in the data. Having a rational conversation about the “confidence” and statistical significance of the data allows for more rational decision making.

<Opinion>This is one of the key differentiators of data science versus business analytics. </Opinion>

Solution 3

Type of problem: Graph Analysis

Create a Master Graph.

Run Page Rank to identify centrality.

Create many small graphs for individual users.

Mask the Master Graph, and PageRank Graph.

Multiply out Centrality, number of in Degrees for a possible followers,

and the inverse of the length of the path away from this particular user

to a candidate vertex to be followed.

This code runs in over 48 hours.

Code: Problem3.sh, and AnalyzeGraph.scala

Graph Analysis

As Graphs get really large it becomes difficult to visualize them.

However, I was able to “subset” the master graph based on the

recommendation output of my process.

I was expecting to see one big clump of nodes tightly connected. This

would be the “Target” to follow.

I was also expecting to see two smaller clumps of nodes, loosely

connected to the larger clump. These are the “followers”, as we make

a recommendation to them to follow the more popular node, they will

be closer connected to this user.

Here is the output from Gephi that shows whether the code worked or

not.

Gephi output

Where to go from here?

Spark.

Scala.

Learn these topics.

Teach these topics.

Especially for folks planning on sitting for Data Science challenge 4:

Learn Scala. Learn Spark.

Oh, and keep studying about Graphs…

Code located here: https://github.com/dougneedham/Cloudera-

Data-Scientist-Challenge-3


Recommended