Date post: | 05-Dec-2014 |
Category: |
Technology |
Upload: | hadoopsummit |
View: | 738 times |
Download: | 2 times |
Copyright © Think Big Analytics and Neustar Inc.1
Asking the Right Questions of your Data
Mike PetersonVP of Platforms and Data Architecture, Neustar
Jun 26, 2013
2 Copyright © Neustar Inc.
We have come a long way!!!
3
But where/when is the GOLD?Unintended Consequence of Big DataWe need to ask the right QuestionsOh, and lets remember religionand not forget GOVERNANCE
Copyright © Neustar Inc.
Big Data Evolution Status
4
» New data platform is built – 3Tier » Collected many Pbs of data» Hadoop infrastructure in place for 2yrs » Established Data Science teams» Machine Learning is in place » Increased technology skills» Focused data teams» Active in the community
Copyright © Neustar Inc.
Our Partners are still a part of our process
5 Copyright © Think Big Analytics and Neustar Inc.
» Expertise in Technologies» Trusted partner» Collaborative Teams
» Open source leader» Invested in client success» Price/performance
Some Unintended Consequences
6
» More Customer Reporting Request» Because we suddenly have lots of customer
data available» Meaning more work for the DW team!!!
» DR Site is more required than ever» More data, means more critical data to protect» Network Stress to support DR and other additional
access
» Data Governance is overwhelmed with request» Retention Policies need to be re-thought
Copyright © Neustar Inc.
Questions
7
» Customer Driven Questions» Easy to understand
» Subject Questions» Discover the pivot and you have a good start
» Exploratory Questions» Thinking of the unformed questions» Working from the top down» Narrowing the answer before you test all the data
Copyright © Neustar Inc.
Questions - Approaches
• Understand what manual process you want to automate: what is currently manually predicted that could be automated and determine if there’s any way to get training data comprising of <input,output> pairs.
• Consider methods to augment existing data with a “pivot” column that can be used to join. For example, geo-location of an IP address could lead to joining with Census Data based on zip+4.
Questions - Approaches
• Determine if your problem is one of prediction or one of grouping (clustering). The latter is more of a task that can lead to better understanding rather than solving a direct business problem.
Questions - Approaches
• Determine if you are more interested in finding “interesting” relationships among data columns rather than knowing the columns. This is a task I’d call more of “discovery” than prediction but the idea is to determine one column as the output column in terms of the other columns as input.
• Doing this for all output columns can lead to “discovery” of those correlations that are the strongest (e.g., every time a customer buys beer at 5PM, he is likely to buy diapers). This is more of a fishing expedition, but can lead to unusual insights.
Impetus Approach to Questioning Data
11 Copyright © Neustar Inc.
EXISTING DATA
PROPERTY
BUSINESS
STRATEGY
CUSTOMER
PROBLEM
STATEMENTS
ANALYSIS OF DATA
PROPERTY
DISCUSSION WITH
STAKEHOLDERS
ANALYSIS OF
PROBLEM
STATEMENT
DATA NEEDS
STATEMENT
REFINED
PROBLEM
STATEMENT
DATA ANALYTICS
PLAN
Who knew there was religion in Analytics
12
» Statistical Analysis vs. Machine Learning» Stats people think “truth”» Machine Learning people think “near truth”
» Truth is easy to bound» Cost models make sense to org
» Near Truth is hard to explain and bound » It is where the real exploration happens» But – it can consume the Data Scientist
» Both can net real returns – and they need to co-exist
Copyright © Neustar Inc.
13 Copyright © Neustar Inc.
GOVERNANCE
14
» Don’t forget about Governance» Contracts» PII» Brand
» CPO & CISO are your friends - honestly» Protect your CUSTOMER DATA
» It will slow you down in the beginning» But you want your results to be reputable
» We need to get to a policy framework at some point that is automated
Copyright © Neustar Inc.
About Impetus
» Accelerated consulting and services leader for Big Data; Headquartered in San Jose since 1996; 1400+; Presences in Silicon Valley, Atlanta, NYC; offices in India; Expertise through Architects
» Pioneers in distributed software engineering with vertical and functional expertise; Dedicated innovation labs; 200+ Big Data practitioners; 80+ dedicated to R&D
Drill* Incoming Question
* Problem Landscape
* Underlying Constraints
* Specific Goals
Assess* Goal Driven Hypotheses
* Data Requirement
* Resource Requirements
* Analysis Plan
Target* Data Collection
* Quality Assessment
* Cross Validation
* Restructuring
Analyze* Test Previous Hypotheses
* Explore New Hypotheses
* Test
* Quantify Results
Recommend
* Summary of Results
* Key Novel Insights
* Impact Analysis
* Action Items
Data Science Approach
» Recommender Systems
» Sentiment Analysis
» Topic Identification
» Predictive Analytics
» Data Stream Analytics
Data Science Focus Areas
Contact us at [email protected]
Thank you
Questions?