Data Mining- IMT Nagpur-Manish

8/8/2019 Data Mining- IMT Nagpur-Manish

1/88

Data MiningData Mining

&&

Its Business ApplicationsIts Business Applications

MANISH GUPTA

Principal Analytics Consultant

Innovation Labs, 24/7 Customer Pvt. Ltd.

Bangalore-560071

(Email: [email protected])


2/88

22

Why Data Mining?Why Data Mining?

Data explosion problemData explosion problem

Automated data collection tools and mature databaseAutomated data collection tools and mature database

technology lead to tremendous amounts of datatechnology lead to tremendous amounts of data

stored in databases, data warehouses and otherstored in databases, data warehouses and other

information repositoriesinformation repositories

We are drowning in data, but starving for knowledge!We are drowning in data, but starving for knowledge!

Secret of Success in Business is knowing that whichSecret of Success in Business is knowing that which

nobody else knows.nobody else knows.

Solution: Data Warehousing and Data MiningSolution: Data Warehousing and Data Mining


3/88

33

What is Data Mining?What is Data Mining?(Knowledge Discovery in Databases)(Knowledge Discovery in Databases)

DefinitionDefinition

ExtractionExtraction ofof interestinginteresting ((nonnon--trivialtrivial,,implicitimplicit,, previouslypreviously unknownunknown andandpotentiallypotentially usefuluseful)) informationinformation ororpatternspatterns fromfrom datadata inin largelargedatabasesdatabases


4/88

Data Mining Vs DBMSData Mining Vs DBMS--SQLSQL

44

DBMS-SQL Data Mining

Queries based on the data

held

Infers knowledge from

the data to answer queries

Examples:yLast months sales for

each product

y Sales grouped by

customer age etc.

y List of customerswhose policies lapsed

Examples:yWhat characteristics do

customers have whose

policies have lapsed ?

y Is the sales of this

product dependent on thesales of some other

product?


5/88

55

Data Mining:Data Mining:

Confluence of Multiple DisciplinesConfluence of Multiple Disciplines

Data Mining

DatabaseTechnology

Statistics

OtherDisciplines

InformationScience

MachineLearning

Visualization


6/88

66

Data Mining and Business IntelligenceData Mining and Business Intelligence

Increasing potential

to support

business decisions End User

BusinessAnalyst

Data

Analyst

DBA

Making

Decisions

Data Presentation

Visualization Techniques

Data Mining

Information Discovery

Data Exploration

OLAP,

Statistical Analysis, Querying and Reporting

Data Warehouses

Data SourcesPaper, Files, Information Providers, Database Systems


7/88

77

Data Mining: A KDD ProcessData Mining: A KDD Process

Data mining: the core ofData mining: the core ofknowledge discoveryknowledge discoveryprocess.process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation


8/88

88

Architecture of a Typical DataArchitecture of a Typical Data

Mining SystemMining System

Data

Warehouse

Data cleaning & data integration Filtering

Databases

Database or datawarehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base


9/88

99

ApplicationsApplicationsBusiness DomainBusiness Domain

MarketMarket--Basket DatabasesBasket Databases

Financial DatabasesFinancial Databases

Insurance DatabaseInsurance Database

Telecommunication DatabaseTelecommunication Database

Business Anal yticsBusiness Analytics

CRMCRM

DefenceDefence DomainDomain

MSDFMSDF ELINT Data AnalysisELINT Data Analysis

Emitter ClassificationEmitter Classification

Intrusion DetectionIntrusion Detection


10/88

1010

Business ApplicationsBusiness Applications

DatabaseDatabase analysisanalysis andand decisiondecision supportsupport

MarketMarket analysisanalysis andand managementmanagement

target marketing, customer relation management,target marketing, customer relation management,

market basket analysis, cross selling, marketmarket basket analysis, cross selling, market

segmentationsegmentation

FraudFraud detectiondetection andand managementmanagement

OtherOther ApplicationsApplications

TextText miningmining (news(news group,group, email,email, documents)documents) andand

WebWeb analysisanalysis..


11/88

1111

Market Analysis & ManagementMarket Analysis & Management

Where are the data sources for analysis?Where are the data sources for analysis?

Credit card transactions, loyalty cards, discount coupons,Credit card transactions, loyalty cards, discount coupons,

customer complaint calls, plus (public) lifestyle studiescustomer complaint calls, plus (public) lifestyle studies

Target marketingTarget marketing

Find clusters of model customers who share the sameFind clusters of model customers who share the samecharacteristics: interest, income level, spending habits, etc.characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over timeDetermine customer purchasing patterns over time

Conversion of single to a joint bank account: marriage, etc.Conversion of single to a joint bank account: marriage, etc.

CrossCross--market analysismarket analysis

Associations/coAssociations/co--relations between product salesrelations between product sales

Prediction based on the association informationPrediction based on the association information


12/88

1212

Customer profilingCustomer profiling

data mining can tell you what types of customers buy whatdata mining can tell you what types of customers buy what

products (clustering or classification)products (clustering or classification)

Identifying customer requirementsIdentifying customer requirements

identifying the best products for different customersidentifying the best products for different customers

use prediction to find what factors will attract newuse prediction to find what factors will attract new

customerscustomers

Provides summary informationProvides summary information

various multidimensional summary reportsvarious multidimensional summary reports

statistical summary information (data central tendency andstatistical summary information (data central tendency and

variation)variation)

ContdContd


13/88

1313

Data Mining TechniquesData Mining Techniques

Clustering

Classification

Association Rules Mining(Market Basket Analysis)


14/88

ClusteringClustering

1414


15/88

1515

Clustering: Basic IdeaClustering: Basic Idea

ClusteringClustering Grouping a set of data objects into clustersGrouping a set of data objects into clusters

Similar objects within the same clusterSimilar objects within the same cluster

Dissimilar objects in different clustersDissimilar objects in different clusters

Clustering is unsupervisedClustering is unsupervised No previous categorization knownNo previous categorization known

Totally data drivenTotally data driven


16/88

1616

Clustering:ExampleClustering:Example

A good clustering method will produce high qualityA good clustering method will produce high quality

clusters withclusters with

high intrahigh intra--class similarityclass similarity

low interlow inter--class similarityclass similarity

***

***

*

*

*

*

* *

***

*

*

*

***

*

*

*

*

**

*

Outlier


17/88

1717

Similarity ComputationSimilarity Computation

Distance between objects used as metricDistance between objects used as metric

The definitions of distance functions usuallyThe definitions of distance functions usually

different for different type of attributesdifferent for different type of attributes

Must satisfy following propertiesMust satisfy following properties

d(d(i,ji,j)) uu 00

d(d(i,ji,j)) ==d(d(j,ij,i))

d(d(i,ji,j)) ee d(d(i,ki,k)) ++d(d(k,jk,j))


18/88

1818

Distance Calculation: objects XDistance Calculation: objects Xii and Xand Xjj

MinskowskiMinskowski

EuclidianEuclidian

ManhattanManhattan

||...||||),(2211 pp jxixjxixjxixjid

!

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211!

)||...|||(|),( 2222

2

11 pp jx

ix

jx

ix

jx

ixjid !

(p attributes)


19/88

1919

MethodsMethods

Partition MethodsPartition Methods

Iterative MethodsIterative Methods

Convergence criteria specified byConvergence criteria specified bythe userthe user

Hierarchical MethodsHierarchical Methods

Agglomerative / DivisiveAgglomerative / Divisive

UseUse DendrogramDendrogram representationrepresentation


20/88

2020

Partitioning MethodsPartitioning Methods

KK--Means ClusteringMeans Clustering

Decide kDecide k no. of clustersno. of clusters

Randomly pick k seedsRandomly pick k seeds use as centroidsuse as centroids Repeat until conditionRepeat until condition

Scan database and assign each object to a clusterScan database and assign each object to a cluster

Compute centroidsCompute centroids

Evaluate quality of clusteringEvaluate quality of clustering


21/88

2121

RecordsRecords Feature1Feature1 Feature2Feature2 Feature3Feature3 Feature4Feature4

L1L1 33 1010 2323 3636

L2L2 1212 66 1212 4141

L3L3 55 1212 1717 2424L4L4 44 88 77 1313

L5L5 11 1616 11 2828

L6L6 1818 00 2222 5151

L7L7 66 88 66 1212

L8L8 1515 55 22 66

L9L9 00 1010 1515 1818

L10L10 99 22 2424 1515

Example:Example:


22/88

2222

InitializationInitialization

We take, the number of cluster centers asWe take, the number of cluster centers as

3 i.e. K=3.3 i.e. K=3.

Lets take the initial Cluster Centers asLets take the initial Cluster Centers as L1 ( 3,10,23,36)L1 ( 3,10,23,36)

L5 (1,16,1,28)L5 (1,16,1,28)

L8 (15,5,2,6)L8 (15,5,2,6)


23/88

Pictorial View of the Clusters After First IterationPictorial View of the Clusters After First Iteration

2323

L2

L1

L4

L5L6

L10

L9

L3

L8

L7

(9.4,6,19.6,33.4)

(0.5,13,8,23)

(8.3,7,5,10.3)


24/88

2424

L2

L1

L4

L5L6

L10

L9

L3

L8

L7

(8.33,7,5,10.3)

(10.5,4.5,20.3,35.8)(2,12.7,11,23.3)

Pictorial View of the Clusters After Second IterationPictorial View of the Clusters After Second Iteration


25/88

2525

L2

L1

L4

L5L6

L10

L9

L3

L8

L7

(8.33,7,5,10.3)

(10.5,4.5,20.3,35.8)

The cluster

centers remain

same as in the

second iteration

so we stop here.

(2,12.7,11,23.3)

Pictorial View of the Clusters After Third IterationPictorial View of the Clusters After Third Iteration


26/88

2626

Hierarchical MethodsHierarchical Methods

Agglomerative MethodsAgglomerative Methods

Bottom Up approachBottom Up approach

Divisive MethodsDivisive Methods

Top Down ApproachTop Down Approach


27/88

2727

DendrogramDendrogram

BA C D E F

Database

Distance Matrix

00

ddabab 00

ddacac ddbcbc 00

ddadad ddbdbd ddcdcd 00

ddaeae ddbebe ddcece dddede 00

ddafaf ddbfbf ddcfcf dddfdf ddefef 00A, B, C E,F

A,B,C,D

Agglomerative ApproachAgglomerative Approach


28/88

2828

Database

BA C E FD

A, B, C

E,FA,B,C,D

Divisive ApproachDivisive Approach


29/88

2929

Clustering: ApplicationsClustering: Applications

Marketing ManagementMarketing Management

Discover distinct groups in customer bases, andDiscover distinct groups in customer bases, and

then use this knowledge to develop targetedthen use this knowledge to develop targeted

marketing programsmarketing programs

BankingBanking ATM Location identificationATM Location identification

Text MiningText Mining Grouping documents with similar characteristicsGrouping documents with similar characteristics


30/88


31/88

Clustering Companies using DowClustering Companies using Dow

Jones IndexJones Index


32/88

Trading System DevelopmentTrading System Development


33/88

Clustering for Customer ProfilingClustering for Customer Profiling


34/88

3434


35/88

3535

New Product line developmentNew Product line development


36/88

Crime Hot Spot AnalysisCrime Hot Spot Analysis


37/88

Clustering for MedicalClustering for Medical

DiagnosticsDiagnosticsHuman Genome Project:Human Genome Project: Finding Relationships between diseases,Finding Relationships between diseases,

cellular functions, and drugs.cellular functions, and drugs.

WincosinWincosin Breast Cancer StudyBreast Cancer Study

Cancer Diagnosis and PredictionsCancer Diagnosis and Predictions


38/88

ClassificationClassification

3838


39/88

3939

Easy to agree these are sunset pictures!

These are all As! Handwritten Characters from

NIST database

In most cases, easy for experts to attach class labels

difficult to explain why!


40/88

4040

ClassificationClassification

Supervised learning method

Use historical data to construct a model(Hypothesis Formulation)

Discover relationship between inputattributes and target

Use the model for prediction

Major Classification Methods Decision Tree(ID3, CART, C4.5, SLIQ) Neural Network(MLP)

Support Vector Machine

Bayesian Classifiers(NBC, BBN)

K-Nearest Neighbor(KNN)


41/88

The classification task

Input: a training set of tuples, eachlabelled with one class label

Output: a model (classifier) which assigns

a class label to each tuple based on theother attributes.

The model can be used to predict the

class of new tuples, for which the classlabel is missing or unknown


42/88

Training step

TrainingData

NAME AGE INCOME CREDIT

Mary 20 - 30 low poor

James 30 - 40 low fair

Bill 30 - 40 high goodJohn 20 - 30 med fair

Marc 40 - 50 high good

Annie 40 - 50 high good

ClassificationAlgorithms

IF age = 30 - 40OR income = highTHEN credit = good

Classifier(Model)


43/88

Test step

TestData

NAME AGE INCOME CREDIT

Paul 20 - 30 high good

Jenny 40 - 50 low fairick 30 - 40 high fair

Classifier(Model)

CREDIT

good

fairgood


44/88

Prediction

UnseenData

Classifier(Model)

CREDIT

good

good

fair

NAME AGE INCOME

Doc 20 - 0 high

Phil 30 - 0 low

Kat 0 - 0 med


45/88

4545

Classification: ApproachesClassification: Approaches

Decision Tree InductionDecision Tree Induction

Neural NetworksNeural Networks

Support Vector MachineSupport Vector MachineBayesian ApproachBayesian Approach

Rule InductionRule Induction


46/88

4646

Decision Tree InductionDecision Tree Induction

Recursive partitioning of T until stoppingRecursive partitioning of T until stopping

criterion satisfied (criterion satisfied (purity of partition, depth of tree etc.)purity of partition, depth of tree etc.)

Decide the split criterionDecide the split criterion

Select the splitting attributeSelect the splitting attribute

Partition the data according to the selected attributePartition the data according to the selected attribute

Apply induction method recursively on each partitionApply induction method recursively on each partition


47/88

4747

Decision tree inducersDecision tree inducers

ID3ID3 RJ QuinlanRJ Quinlan 19861986 Simple, Uses information gain, no pruningSimple, Uses information gain, no pruning

C4.5C4.5 -- RJ QuinlanRJ Quinlan 19931993 Uses gain ratio, handles numeric attributes andUses gain ratio, handles numeric attributes and

missing values, errormissing values, error--based pruningbased pruning

SLIQSLIQ --Mehta et al., 1996Mehta et al., 1996 Scalable, one scan of database, usesScalable, one scan of database, uses ginigini indexindex

CARTCART -- BriemanBrieman et al. 1984et al. 1984

constructs binary tree, costconstructs binary tree, cost--complexity pruning, cancomplexity pruning, cangenerate regression treesgenerate regression trees


48/88

4848

Attribute Selection CriteriaAttribute Selection Criteria

Information GainInformation Gain

Entropy(C, S) =Entropy(C, S) = -- ppii log(plog(pii))

Entropy (Before split)Entropy (Before split) -- Entropy (After split)Entropy (After split)

Gain RatioGain Ratio Information Gain / Entropy (Before split)Information Gain / Entropy (Before split)

GiniGini IndexIndex

Measures divergenceMeasures divergence

GiniGini(C, S) =1(C, S) =1 ppii22


49/88

Classical example: play tennis?

tl empe at e mi ity in y Class

s nny ot hi h alse

s nny hot hi h t e

ove cast hot hi h alse

ain mil hi h alse

ain cool normal alse

rain cool normal true

overcast cool normal true

sunny mil hi h alse

sunny cool normal alserain mil normal alse

sunny mil normal true

overcast mil hi h true

overcast hot normal alse

rain mil hi h true

z Trainingset fromQuinlansbook


50/88


51/88

Decision tree obtained with ID3(Quinlan 86)

outlook

overcast

humidity windy

high normal falsetrue

sunny rain

N NP P

P


52/88

From decision trees to classification rules

One rule is generated for eachpath in the treefrom the root to a leaf

Rules are generally simpler to understand than

trees

outlook

overcast

humidity windy

high normal falsetrue

sunny rain

N NP P

P

IF outlook=sunny

AND humidity=normal

THEN play tennis


53/88

5353

Advantages & LimitationsAdvantages & Limitations

AdvantagesAdvantages

Self explanatorySelf explanatory

handle both numeric and categorical datahandle both numeric and categorical data

NonNon--parametric methodparametric method

LimitationsLimitations

Most algorithms predict only categorical attribMost algorithms predict only categorical attrib

OverTrainingOverTraining Need for pruningNeed for pruning

LargeLarge TreeTree


54/88

Bayesian classification

The classification problem may be formalizedusing a-posteriori probabilities:

P(C|X) = prob. that the sample tupleX= is of class C.

E.g. P(class=N | outlook=sunny,windy=true,)

Idea: assign to sample X the class label C suchthat P(C|X) is maximal


55/88

Estimating a-posteriori probabilities

Bayes theorem:P(C|X) = P(X|C)P(C) / P(X)

P(X) is constant for all classes

P(C) = relative freq of class C samplesC such that P(C|X) is maximum =

C such that P(X|C)P(C) is maximum

Problem: computing P(X|C) is unfeasible!


56/88

Nave Bayesian Classification

Nave assumption: attribute independence

P(x1,,xk|C) = P(x1|C)P(xk|C)

If i-th attribute is categorical:P(x

i|C) is estimated as the relative freq of

samples having value xi as i-th attribute in classC

If i-th attribute is continuous:P(xi|C) is estimated thru a Gaussian densityfunction

Computationally easy in both cases


57/88


58/88


59/88

If data set is not so large

Cross-validation

AvailableExamples

TrainingSet

Test.Set

10%90%

Repeat 10

times

Used to develop 10 different tree Tabulateaccuracies

Generalizationean and stddev

of accurac


60/88

Classification: ApplicationClassification: Application

Bank Loan Granting SystemBank Loan Granting System

Decision trees constructed from bankDecision trees constructed from bank--loanloanhistories to produce algorithms to decidehistories to produce algorithms to decide

whether to grant a loan or not.whether to grant a loan or not.AntiAnti Money Laundering SystemMoney Laundering System

KYC StatusKYC Status

Email Classification SystemEmail Classification System

Spam or NotSpam or Not

6060


61/88

Stock Research ApplicationStock Research Application

Efficient Prediction of Option Prices usingEfficient Prediction of Option Prices using

Machine Learning TechniquesMachine Learning TechniquesPrediction of both European and American Option PricesPrediction of both European and American Option Prices

using General Regression Neural Network and Supportusing General Regression Neural Network and Support

Vector Regression.Vector Regression.

Stock Portfolio Management: Prediction ofStock Portfolio Management: Prediction of

Risk using Text ClassificationRisk using Text ClassificationPrediction or classification of risk in investment of a particularPrediction or classification of risk in investment of a particular

company by Text classification using Navecompany by Text classification using Nave BayesBayes(NB) and(NB) andKK--Nearest Neighbor(KNN).Nearest Neighbor(KNN).

Prediction of Financial Data SeriesPrediction of Financial Data Series: using: using

MATLAB GARCH ToolboxMATLAB GARCH Toolbox

6161


62/88

Pattern Recognition:Pattern Recognition:Artificial Neural Network ApplicationArtificial Neural Network Application

Letter Recognition SystemLetter Recognition System

Zip Code Identification SystemZip Code Identification System

Apple's Newton uses a neural netApple's Newton uses a neural net

Speech Recognition SystemSpeech Recognition System

Voice DialingVoice Dialing

Image ProcessingImage Processing

BioinfomaticsBioinfomatics

6262


63/88

Emitter ClassificationEmitter Classification

ELINT Data AnalysisELINT Data Analysis

Identification of Radars and PlatformIdentification of Radars and Platform

Successfully Delivered DAPR software toSuccessfully Delivered DAPR software toIndian Navy (INTEG)Indian Navy (INTEG)

6363


64/88

6464

Text

Speaker 2

Speaker 1

Although the spoken words are the same the recorded digital signals

are very different!


65/88

6565

Pattern Recognition ExamplePattern Recognition Example

Noisy image Recognized

pattern


66/88

Association Rule MiningAssociation Rule Mining(Market Basket Analysis)(Market Basket Analysis)

6666


67/88

6767

Association RulesAssociation Rules

Intrarecord LinksIntrarecord Links

Finding associations among sets of objectsFinding associations among sets of objects

in transaction databases, relationalin transaction databases, relational

databases.databases. Rule form: AntecedentRule form: Antecedent ppConsequentConsequent

[support, confidence][support, confidence]..

Examples.Examples. shirt, tie, socksshirt, tie, socks pp shoesshoes [0.5%, 60%][0.5%, 60%]

White bread, butterWhite bread, butterpp eggegg [2.3%, 80%][2.3%, 80%]


68/88

6868

PreliminariesPreliminaries

Given: (1) database of transactionsGiven: (1) database of transactions

(2) each transaction is a list of items(2) each transaction is a list of items

Find:Find: allall rules that correlate the presence of one set of itemsrules that correlate the presence of one set of itemswith that of another set of itemswith that of another set of items

E.g.,E.g., 95% of people who purchase PC and color printer95% of people who purchase PC and color printeralso purchase compter tablealso purchase compter table

Business Questions:Business Questions:

** Electronic itemsElectronic items (What the store should do to boost(What the store should do to boostsale of electronic items)sale of electronic items)

Herbal Health productsHerbal Health products ** (What other products should(What other products shouldthe store stocks up?)the store stocks up?)


69/88

6969

Formal DefinitionFormal Definition

If X and Y are twoIf X and Y are two itemitem--setssets, such that X, such that X

Y =Y = , then for an association rule, then for an association rule

XX ppYY

SupportSupport is the probability that X and Yis the probability that X and Y

occur together [ P(X U Y)]occur together [ P(X U Y)]

ConfidenceConfidence is the conditional probabilityis the conditional probability

that Y occurs in a transaction, given X isthat Y occurs in a transaction, given X is

present in the same transaction [P(Y/X)]present in the same transaction [P(Y/X)]


70/88

7070

Itemset and SupportItemset and Support

Item A(4)

TranTran

s IDs ID

ItemsItems

1010 A, B,CA, B,C

2020 A, CA, C

3030 A, C, DA, C, D

4040 B, C, EB, C, E5050 A, C, EA, C, E Sup(A): 4 (80%), Sup (AB): 1 (20%)

Sup (ABC): 1 (20%), Sup (ABCD): 0

Sup(ABCDE): 0

Item C (5)Item A & C(4)


71/88

7171

ConfidenceConfidence

Strength of theStrength of the

discovered rulediscovered ruleComputed asComputed as

P(XP(X , Y)/P(X), Y)/P(X)

AA pp C (4/4)C (4/4)CC ppA (4/5)A (4/5)

Item

C(5)

Item A & C(4)

Item A(4)


72/88

7272

InterestingnessInterestingness

Minimum SupportMinimum Support User specified parameter (Frequent itemsets)User specified parameter (Frequent itemsets)

For minsup of50% F = {A, C, AC}For minsup of50% F = {A, C, AC}

For minsup of30% F = {A, B, C, AC, E}For minsup of30% F = {A, B, C, AC, E}

Minimum ConfidenceMinimum Confidence Report rules that satisfy minimum confidenceReport rules that satisfy minimum confidence

levellevel

With minconf of50% some of the discoveredWith minconf of50% some of the discoveredrules arerules are

AApp

C [75%

], ABC [75%

], ABpp

C[100%

], EC[100%

], Epp

F[100%

]F[100%

]etc.etc.

TransTrans

IDID

ItemsItems

1010 A, B,CA, B,C

2020 A, CA, C

3030 A, C, DA, C, D

4040 B, C, EB, C, E

5050 A, C, EA, C, E


73/88

7373

The Apriori algorithmThe Apriori algorithm

The best known algorithmThe best known algorithmTwostepsTwosteps::

Find all itemsets that have minimum supportFind all itemsets that have minimum support

((frequent itemsetsfrequent itemsets, also called large itemsets)., also called large itemsets).

Use frequent itemsets toUse frequent itemsets to generate rulesgenerate rules..

E.g., a frequent itemsetE.g., a frequent itemset{Chicken, Clothes, Milk} [sup =3/7]{Chicken, Clothes, Milk} [sup =3/7]

and one rule from the frequent itemsetand one rule from the frequent itemset

ClothesClothes pp Milk, ChickenMilk, Chicken [sup =3/7,[sup =3/7,

conf=3/3]conf=

3/3]

A i iA i i E lE l


74/88

AprioriApriori ExampleExample

7474

Itemset Count

Bread 4

Coke 2

Milk 4

Bear 3

Diaper 4

Eggs 1

Itemset Count{Bread, Milk} 3

{Bread, Beer} 2

{Bread, Diaper} 3

{Milk, Beer} 2

{Milk, Diaper} 3

{Coke,Diaper} 2

{Milk,Coke} 2

{Beer,Coke} 1

{Bread,Coke} 1

{Beer, Diaper} 3

2-itemsets1-itemsets

Itemset Count

{Milk, Coke, Diaper} 2

{Milk, Coke, Beer} 1

{Beer, Milk, Diaper} 2

{Bread, Beer, Diaper} 2

{Bread, Beer, Milk} 1

{Bread, Milk, Diaper} 2

3-itemsets

TID

Items

1 Bread, Milk

2 Beer, Diaper, Bread, Eggs

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Bread, Diaper, Milk

TID Items


75/88

7575

ExampleExample Finding frequentFinding frequent itemsetsitemsets

Dataset T

TID Items

T100 1, 3, 4

T200 2, 3, 5

T300 1, 2, 3, 5

T400 2, 5

itemset:count

1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3

F1: {1}:2, {2}:3, {3}:3, {5}:3

C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}

2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2

F2

: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2

C3: {2, 3,5}

3. scan T C3: {2, 3, 5}:2 F3:{2, 3, 5}

minsup=0.5


76/88

7676

An exampleAn example

FF33 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},= {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},

{1, 3, 5}, {2, 3, 4}}{1, 3, 5}, {2, 3, 4}}

After joinAfter join CC44 = {{1, 2, 3, 4}, {1, 3, 4, 5}}= {{1, 2, 3, 4}, {1, 3, 4, 5}}

After pruning:After pruning:

CC44 = {{1, 2, 3, 4}}= {{1, 2, 3, 4}}

becausebecause {1, 4, 5}{1, 4, 5} is not inis not in FF33 ({1, 3, 4, 5} is removed)({1, 3, 4, 5} is removed)


77/88

7777

Generating rules: an exampleGenerating rules: an example

Suppose {2,3,4} is frequent, with sup=50%Suppose {2,3,4} is frequent, with sup=50% Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, withProper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with

sup=50%, 50%, 75%, 75%, 75%, 75% respectivelysup=50%, 50%, 75%, 75%, 75%, 75% respectively

These generate these association rules:These generate these association rules:

2,32,

3pp

4,4, confidence=100%

confidence=100%

2,42,4 pp 3,3, confidence=100%confidence=100%

3,43,4 pp 2,2, confidence=67%confidence=67%

22pp 3,4,3,4, confidence=67%confidence=67%

33pp 2,4,2,4, confidence=67%confidence=67%44 pp 2,3,2,3, confidence=67%confidence=67%

All rules have support =50%All rules have support =50%


78/88

7878

On Apriori AlgorithmOn Apriori Algorithm

Seems to be very expensiveSeems to be very expensive

LevelLevel--wise searchwise search

K = the size of the largest itemsetK = the size of the largest itemset

It makes at most K passes over dataIt makes at most K passes over dataIn practice, K is bounded (10).In practice, K is bounded (10).

The algorithm is very fast. Under some conditions,The algorithm is very fast. Under some conditions,

all rules can be found inall rules can be found in linear timelinear time..

Scale up to large data setsScale up to large data sets


79/88

7979

AR: ApplicationsAR: Applications

Retail Marketing

Floor Planning, Discounting, Catalogue Design

Medical Diagnosis Comparison of the genotype of people with/without aComparison of the genotype of people with/without a

condition allowed the discovery of a set of genes thatcondition allowed the discovery of a set of genes that

together account for many cases of diabetes.together account for many cases of diabetes.

Geographical Information systems Link Analysis


80/88

WalmartWalmart StudyStudy

8080

T i l B i D i iT i l B i D i i


81/88

Typical Business DecisionsTypical Business Decisions(For(ForWalmartWalmart))

WhatWhat toto putput onon sale?sale?

HowHow toto designdesign coupons?coupons?

HowHow toto placeplace merchandisemerchandise etcetc onon shelf shelf totomaximisemaximise profitprofit ??

8181


82/88


83/88

8383

Defence ApplicationsDefence Applications

Applications in DefenceApplications in Defence Finding Associations inFinding Associations in

Terrorists activitiesTerrorists activities ( Ex: 9/11( Ex: 9/11

attack)attack)

Finding Associations in studyingFinding Associations in studyingbehavior of the enemy duringbehavior of the enemy during

warwar

Finding Associations which mayFinding Associations which may

lead to intrusion threat atlead to intrusion threat at

strategic locations.strategic locations.


84/88

List of PapersList of Papers


85/88

List of PapersList of PapersPublishedPublished

1.1. RobustRobust ApproachApproach forfor EstimatingEstimating ProbabilitiesProbabilities inin NaiveNaive--BayesBayes ClassifierClassifier forfor

GeneGene ExpressionExpression Data,Data, ElsevierElsevier:: ExpertExpert SystemsSystems withwithApplicationsApplications,, 20102010,, doidoi::1010..10161016/j/j..eswaeswa..20102010..0606..076076..

2.2. BestBest PaperPaper AwardAward::,, RankingRanking PolicePolice AdministrationAdministration UnitsUnits onon thethe BasisBasis ofof CrimeCrime

PreventionPrevention MeasuresMeasures usingusing DataData EnvelopmentEnvelopment AnalysisAnalysis andand Clustering,Clustering, 66thth

InternationalInternational ConferenceConference onon EE--GovernanceGovernance (ICEG(ICEG 20082008),), 4040--5353..

3.3. TowardsTowards situationsituation awarenessawareness inin integratedintegrated airair defencedefence usingusing clusteringclustering andand

casecase basedbased reasoningreasoning,, SpringerSpringer:: LectureLecture NotesNotes inin ComputerComputer ScienceScience,, 59095909,,579579584584,, 20092009..

4.4. AdaptiveAdaptive QueryQuery InterfaceInterface forfor MiningMining CrimeCrime Data,Data, SpringerSpringer:: LectureLecture NotesNotes inin

ComputerComputer ScienceScience (LNCS)(LNCS),, 47774777,, 20072007,, 285285--296296..

5.5. RobustRobust ApproachApproach forfor EstimatingEstimating ProbabilitiesProbabilities inin NaiveNaive--BayesBayes Classifier,Classifier,

SpringerSpringer:: LectureLecture NotesNotes inin ComputerComputer ScienceScience (LNCS),(LNCS), 48154815,, 20072007,, 1111--1616..

6.6. AA MultivariateMultivariate TimeTime SeriesSeries ClusteringClustering ApproachApproach forfor CrimeCrime TrendsTrends Prediction,Prediction,

ProcProc.. ofofIEEEIEEE SystemSystem ManMan && CyberneticsCybernetics,, 20082008,, 892892--896896..

7.7. CrimeCrime DataData MiningMining forfor IndianIndian PolicePolice InformationInformation System,System, ProcProc.. 55thth InternationalInternational

ConferenceConference onon EE--governancegovernance (ICEG(ICEG 20072007),), 388388--397397..

8.8. ClusteringClustering withwith VaryingVarying WeightsWeights onon TypesTypes ofof Crime,Crime, ORSIORSI ConferenceConference,, 20082008

List of Papers (Contd )List of Papers (Contd )


86/88

List of Papers (Contd.)List of Papers (Contd.)CommunicatedCommunicated

1.1. An Efficient Statistical Feature Selection Approach for Classification ofAn Efficient Statistical Feature Selection Approach for Classification of

Gene Expression Data,Gene Expression Data, Journal of Biomedical InformaticsJournal of Biomedical Informatics, July2010,, July2010,Resubmission with minor modificationResubmission with minor modification..

2.2. Towards A Framework of Intelligent Decision Support System for IndianTowards A Framework of Intelligent Decision Support System for Indian

Police,Police, Elsevier: Decision Support Systems, May2010Elsevier: Decision Support Systems, May2010..

3.3. A Statistical Approach for Feature Selection and Ranking,A Statistical Approach for Feature Selection and Ranking, Elsevier: PatterElsevier: Patter

RecognitionRecognition, June 2010., June 2010.

4.4. A Novel Approach for DistanceA Novel Approach for Distance--Based SemiBased Semi--Supervised Clustering usingSupervised Clustering usingFunctional Link Neural Network,Functional Link Neural Network, Springer Soft ComputingSpringer Soft Computing, June 2010., June 2010.

5.5. An Efficient Similarity Measure based Multivariate Time Series ClusteringAn Efficient Similarity Measure based Multivariate Time Series Clustering

Approach for Performance Analysis,Approach for Performance Analysis, IEEE System Man and CyberneticsIEEE System Man and Cybernetics,,

May2010.May2010.

6.6. Issues and Challenges for Emitter Classification in the Context of ElectronicIssues and Challenges for Emitter Classification in the Context of Electronic

Warfare.Warfare. DefenceDefence Science Journal.Science Journal.

7.7. A Novel Approach forWeighted Clustering using HyperlinkA Novel Approach forWeighted Clustering using Hyperlink--Induced TopicInduced Topic

Search (HITS) Algorithm,Search (HITS) Algorithm, DefenceDefence Science Journal.Science Journal.


87/88

8787

ReferencesReferencesHan and Kamber, Data Mining: Concepts and Techniques, Morgan Kauffman

Arun Pujari, Data Mining Techniques, University Press

Hand et al., Principles of Data Mining, PHI.

J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.

J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.

M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier fordata mining. In Proc. 1996 Int. Conf. Extending Database Technology(EDBT'96), Avignon, France, March 1996.

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between

sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C.

R. Agrawal and R. Srikant. Fast algorithms for mining association rules.

VLDB'94 487-499, Santiago, Chile.

J. McQueen, Some methods for classification and analysis of multivariateobservations, Proc. Symp. Math. Statist. And Probability, 5th, Berkeley, 1,

1967, 281298.

A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM

Computing Surveys, 31(3), 1999, 264323.


88/88

Questions?Questions?

Date post:	09-Apr-2018
Category:	Documents
Upload:	sumeet-gupta
View:	221 times
Download:	0 times

Data Mining- IMT Nagpur-Manish

Documents