+ All Categories
Home > Documents > Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions...

Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions...

Date post: 31-Mar-2015
Category:
Upload: kelly-satterthwaite
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
34
Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions www.aabcomsolutions.co m www.carlosbossy.com
Transcript
Page 1: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Data Mining with Analysis Services

Carlos BossyPrincipal ConsultantMCTS, MCITP BI

Aabcom Solutionswww.aabcomsolutions.comwww.carlosbossy.com

Page 2: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

My Background

25 years Software Development 15 years software companies as Programmer, Architect, Manager,

Director, VP, CTO 5+ years Business Intelligence Consultant

Data Warehouse deployment for State of WA Child Welfare Data Integration/Warehouse architecture for Solar Energy company Data Warehouse model for State of OR Data Mining model for Houston sports league

Current Projects

Experience

Page 3: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Applications

ReportsData Integration

Overview

Make Data Mining an integral component of your Data Architecture.

Page 4: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Session Overview

What is Data Mining Data Mining Algorithms in SQL Server Creating a Data Mining Model Using the Model in an Application Using the Model in SSIS Awareness of Data Mining

Architecture and Process

Page 5: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Application Architecture

Application

Service Business Data

Relational Database

Page 6: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Application Architecture with BI

Application

Service Business Data

Relational Database

Data Integration

Data Warehouse Cube Text

Warehouse

Performance Management

Reports

Analysis

Data Mining

Text Mining

Ad-hoc

Data Mining

Page 7: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

If you want to…

Predict the future Get rich buying stocks Win the lottery Win your Fantasy Football League

Data Mining is for you!

Why Mine Data?

Page 8: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Really - Why Mine Data?

U n c o v e r H i d d e n R e l a ti o n s h i p s

F i n d s o m e t h i n g u n u s u a l o r u n e x p e c t e d

I m p r o v e u p o n d o m a i n e x p e r t ’ s k n o w l e d g e

M a n a g e l a r g e d a t a s e t s

C r e a t e P r e d i c ti v e A n a l y ti c s p l a tf o r m

Maximize value of Data

Competitive Edge

Page 9: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Data Mining Defined

Data Mining is the process of sorting through large amounts of data and picking out relevant information (Wikipedia).

Data Mining is the extraction of hidden predictive information from large databases.

The Process of Knowledge Discovery Sophisticated Statistical Model Discovery of Patterns and Relationships

It is not Dredging, Snooping or an Invasion of Privacy

Page 10: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Today vs. Yesterday

Explosion of Data doubles every 3 years (Moore’s Law)

Data Volumes can’t be comprehended by humans

Uncover complex and difficult to find patterns for competitive edge

Improve professional judgment of Domain Expert (small but valuable)

Knowledge Discovery Converting Data to Information N Infinity

FutureLegacy

Manageable Volumes of Data The power of SQL Domain Experts could grasp and

analyze a complete Database Limited CPU Horsepower Finite

Page 11: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Outcomes

Predict number of runs Colorado Rockies would score in their next game using Neural Network.

InputPrevious 12 games runs scoredHome/Away

PredictNumber runs next game

ModelHome Runs = (G7*.142)+(G1*.118)+…Away Runs = (G7*.129)+(G1*.091)+…

Page 12: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Non-linear Results

Fitting a line to a set of data points to measure the effect of a single independent variable.

y = mx + b Statistical Methods

Data Mining

Page 13: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Outcomes – Decision Tree

Annual Income = 108,491.880+6.394*(Savings Balance-27,898.703)-2,443.789*(Avg Check Size-59.555)-0.247*(Credit Card Limit-8,250.000)+33.430*(Age-38.500)-1,032.285*(Over Drafts-1.000)

Page 14: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Outcomes – Time Series

Total Gen-1 < 36276.430 and Flow Date < 3/2/2007 4:58:07 AM and Flow Date < 8/22/2006 5:54:22 PM and Flow Date < 8/2/2006 1:35:37 PM

Total Gen = 31452.534 + 0.126 * Total Gen(-1)

Total Gen-1 < 36276.430 and Flow Date >= 3/2/2007 4:58:07 AM and Flow Date < 5/16/2007 7:46:52 PM and Flow Date >= 3/15/2007 12:56:15 PM

Total Gen = 6186.992 + 0.361 * Total Gen(-2) + 0.602 * Total Gen(-1)

Page 15: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Applications

o Credit Risk Analysis o Churn Analysiso Customer Retentiono Targeted Marketingo Market Basket Analysiso Sales Forecastingo Stock Predictionso Medical Diagnosiso Bioscience Research

o Surveyso Insurance Rate Quoteso Credit Card Fraudo Web Site Eventso Loan Applicationso Hiring and Recruitingo Cross-Marketingo Social Scienceo Economics

Page 16: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Data Mining Sources

SSASCube

DW

Data Mining

OLTP Database

Table

Col 1Col 2Col 3

SQL Server

Oracle

MySQL

Etc.

Page 17: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Cleanse

Massage

Select Training Set

Apply DM Algorithm

Train

Test

Training the Model

Page 18: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Using the Model

Customer

CustomerIDCustName

Mining Model

CheckingBalanceSavingsBalanceNumTransactions

MonthlyBankAccount

BankAccountIDCheckingBalanceSavingsBalanceNumTransactionsCustomerID

PredictionIncome

SQL Join

DMX Prediction Join

Result

Page 19: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Prediction JoinDMX

SELECT Predict([Movie Purchases],3) as MoviesFrom [Customer Movie Association]NATURAL PREDICTION JOIN(SELECT 44 AS [Age], 'Female' AS [Gender], 'Rent' AS [Home Ownership], 'Divorced' AS [Marital Status], (SELECT 'Mrs. Doubtfire' AS [Movie Title] UNION SELECT 'My Big Fat Greek Wedding' AS [Movie Title] UNION SELECT 'Patriot Games' AS [Movie Title]) AS [Movie Purchases]) AS t

Page 20: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Approaches

Clustering Classification

Regression Market Basket Analysis

Page 21: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Mining Algorithms

Time Series Naïve Bayes Association Clustering Decision Trees Logistic Regression Clustering Sequence Clustering Neural Networks

Page 22: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Data Mining Algorithms

Analytical problem Examples AlgorithmsClassification: Assign cases to predefined classes

Credit risk analysisChurn analysisCustomer retention

Decision TreesNaive BayesNeural Nets

Segmentation: Taxonomy for grouping similar cases

Customer profile analysisMailing campaign

ClusteringSequence Clustering

Association: Advanced counting for correlations

Market basket analysisAdvanced data exploration

Decision TreesAssociation

Time Series Forecasting: Predict the future

Forecast salesPredict stock prices

Time Series

Prediction: Predict a value for a new case based on values for similar cases

Quote insurance ratesPredict customer income

All

Deviation analysis: Discover how a case or segment differs from others

Credit card fraud detectionNetwork infusion analysis

All

* Andy Cheung, Microsoft

Page 23: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Time Series

Uses Autoregression + Decision Tree to build model Each time series is a single case No prediction join with test or actual cases Prediction is always the same for given time slots

Analyzes how a variable changes over time.

Page 24: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Time Series Tree

All

Stock(t-5) <= 22.83

Stock(t-5) >

22.83

Stock(t-1) <= 30.15

Stock(t-1) >

30.15

Stock = 6.85 + .62*Stock(t-1) + .21*Stock(t-2)

Node Regression Formula

Page 25: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Naïve Bayes

Probabilistic classifier based on Bayes’ theorem with strong (naive) independence assumptions.

Simple Classification Algorithm Good starting point for better understanding of your data Uses only discrete data

Page 26: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Naïve Bayes Example

Cell Phone Service

Gender Premium Service

Custom Ring Tones International Calls

Female 53% 19% 56% 27%

Male 47% 41% 14% 38%

Premium Svc = Yes Ring Tones = Yes International = No

Likelihood of Female = .53 * .19 * .56 * .73 = .0412Likelihood of Male = .47 * .41 * .14 * .62 = .0167

P(Female) = .0412/(.0412 + .0167) = 71.2%P(Male) = .0167/(.0412 + .0167) = 28.8%

Page 27: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Association Rules

Detect relationships or associations between specific values of categorical variables in large data sets.

Uses only Discrete Data

Rule - Attribute value conditions that occur frequently together in a given dataset {Male, IT, Star Wars} {Star Trek}

Itemset - A set of attribute values.

Support - Total number of transactions.

Confidence - Probability that {X} {Y}

Importance (Lift) –Interestingness. Measure of whether Correlation is positive, negative or none.

Page 28: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Logistic Regression

Predict the probability of a discrete outcome from a set of variables that are continuous, discrete or both.

Non-linear regression model that produces results between 0 and 1. Popular in health science for disease prediction. Marketing uses for dichotomous predictions (buy or not buy, renew or

cancel). Same as Neural Network without the hidden layer.

Probability = 1/1 + e-z where z = c + yx1 + yx2 + …

If x1 = Weight and x2 = Age thenRisk of heart attack = 1/1 + e-z where z = -2.4 + 1.3x1 - .7x2

Page 29: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Decision Tree

Graphical representation displaying options, risks and the decision-making sequence.

Most popular data mining model. East to visualize because of it’s graphical representation. Branches represent

choices with associated risks, costs, results, or probabilities. Each test examines the value of a single column in the data and uses it to

determine the next test to apply. The results of all tests determine which label to predict.

Similar to human thought process when making a decision. Finds non-linear relationships. Supports classification, regression and association within the model.

Page 30: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Neural Network

Classifies large and complex data sets by grouping cases together in a way loosely based on the brain.

Most sophisticated algorithm but difficult to interpret. Works well with non-linear data and finds smooth non-linear relationships. Modeled as a group of interconnected nodes. No agreed upon definition. Microsoft algorithm is one of many techniques. Can build multiple models based on discrete inputs.

I

I

I

H

H

H

H

O

O

Back to Input layer after weights adjusted for error

Page 31: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Clustering

Places data elements into related groups without advance knowledge of the group definitions.

Good starting point for better understanding of your data. Finds the hidden variable that accurately classifies data. Data grouped into clusters have a high similarity based on

the attribute values.

Page 32: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Sequence Clustering

Discovers the properties of sequences by grouping them into clusters and assigning them to one of the clusters.

Hybrid of sequence and clustering techniques. Typically used with web and event logs as data sources.

Page 33: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Demo

Page 34: Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions  .

Conclusion

Slow Adoption Where do you start? Science + Art Not quite A.I.

… yet!

More Info and ReferencesTDWI – The Data Warehousing InstituteACM, IEEEBooks: Data Mining with SQL Server 2005/8 (Wiley)

Mining the Talk (IBM Press)Data Mining know it all (Morgan/Kaufman)


Recommended