Date post: | 09-Apr-2018 |
Category: |
Documents |
Upload: | sumeet-gupta |
View: | 221 times |
Download: | 0 times |
of 82
8/8/2019 Data Mining- IMT Nagpur-Manish
1/88
Data MiningData Mining
&&
Its Business ApplicationsIts Business Applications
MANISH GUPTA
Principal Analytics Consultant
Innovation Labs, 24/7 Customer Pvt. Ltd.
Bangalore-560071
(Email: [email protected])
8/8/2019 Data Mining- IMT Nagpur-Manish
2/88
22
Why Data Mining?Why Data Mining?
Data explosion problemData explosion problem
Automated data collection tools and mature databaseAutomated data collection tools and mature database
technology lead to tremendous amounts of datatechnology lead to tremendous amounts of data
stored in databases, data warehouses and otherstored in databases, data warehouses and other
information repositoriesinformation repositories
We are drowning in data, but starving for knowledge!We are drowning in data, but starving for knowledge!
Secret of Success in Business is knowing that whichSecret of Success in Business is knowing that which
nobody else knows.nobody else knows.
Solution: Data Warehousing and Data MiningSolution: Data Warehousing and Data Mining
8/8/2019 Data Mining- IMT Nagpur-Manish
3/88
33
What is Data Mining?What is Data Mining?(Knowledge Discovery in Databases)(Knowledge Discovery in Databases)
DefinitionDefinition
ExtractionExtraction ofof interestinginteresting ((nonnon--trivialtrivial,,implicitimplicit,, previouslypreviously unknownunknown andandpotentiallypotentially usefuluseful)) informationinformation ororpatternspatterns fromfrom datadata inin largelargedatabasesdatabases
8/8/2019 Data Mining- IMT Nagpur-Manish
4/88
Data Mining Vs DBMSData Mining Vs DBMS--SQLSQL
44
DBMS-SQL Data Mining
Queries based on the data
held
Infers knowledge from
the data to answer queries
Examples:yLast months sales for
each product
y Sales grouped by
customer age etc.
y List of customerswhose policies lapsed
Examples:yWhat characteristics do
customers have whose
policies have lapsed ?
y Is the sales of this
product dependent on thesales of some other
product?
8/8/2019 Data Mining- IMT Nagpur-Manish
5/88
55
Data Mining:Data Mining:
Confluence of Multiple DisciplinesConfluence of Multiple Disciplines
Data Mining
DatabaseTechnology
Statistics
OtherDisciplines
InformationScience
MachineLearning
Visualization
8/8/2019 Data Mining- IMT Nagpur-Manish
6/88
66
Data Mining and Business IntelligenceData Mining and Business Intelligence
Increasing potential
to support
business decisions End User
BusinessAnalyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP,
Statistical Analysis, Querying and Reporting
Data Warehouses
Data SourcesPaper, Files, Information Providers, Database Systems
8/8/2019 Data Mining- IMT Nagpur-Manish
7/88
77
Data Mining: A KDD ProcessData Mining: A KDD Process
Data mining: the core ofData mining: the core ofknowledge discoveryknowledge discoveryprocess.process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
8/8/2019 Data Mining- IMT Nagpur-Manish
8/88
88
Architecture of a Typical DataArchitecture of a Typical Data
Mining SystemMining System
Data
Warehouse
Data cleaning & data integration Filtering
Databases
Database or datawarehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
8/8/2019 Data Mining- IMT Nagpur-Manish
9/88
99
ApplicationsApplicationsBusiness DomainBusiness Domain
MarketMarket--Basket DatabasesBasket Databases
Financial DatabasesFinancial Databases
Insurance DatabaseInsurance Database
Telecommunication DatabaseTelecommunication Database
Business Anal yticsBusiness Analytics
CRMCRM
DefenceDefence DomainDomain
MSDFMSDF ELINT Data AnalysisELINT Data Analysis
Emitter ClassificationEmitter Classification
Intrusion DetectionIntrusion Detection
8/8/2019 Data Mining- IMT Nagpur-Manish
10/88
1010
Business ApplicationsBusiness Applications
DatabaseDatabase analysisanalysis andand decisiondecision supportsupport
MarketMarket analysisanalysis andand managementmanagement
target marketing, customer relation management,target marketing, customer relation management,
market basket analysis, cross selling, marketmarket basket analysis, cross selling, market
segmentationsegmentation
FraudFraud detectiondetection andand managementmanagement
OtherOther ApplicationsApplications
TextText miningmining (news(news group,group, email,email, documents)documents) andand
WebWeb analysisanalysis..
8/8/2019 Data Mining- IMT Nagpur-Manish
11/88
1111
Market Analysis & ManagementMarket Analysis & Management
Where are the data sources for analysis?Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons,Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studiescustomer complaint calls, plus (public) lifestyle studies
Target marketingTarget marketing
Find clusters of model customers who share the sameFind clusters of model customers who share the samecharacteristics: interest, income level, spending habits, etc.characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over timeDetermine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.Conversion of single to a joint bank account: marriage, etc.
CrossCross--market analysismarket analysis
Associations/coAssociations/co--relations between product salesrelations between product sales
Prediction based on the association informationPrediction based on the association information
8/8/2019 Data Mining- IMT Nagpur-Manish
12/88
1212
Customer profilingCustomer profiling
data mining can tell you what types of customers buy whatdata mining can tell you what types of customers buy what
products (clustering or classification)products (clustering or classification)
Identifying customer requirementsIdentifying customer requirements
identifying the best products for different customersidentifying the best products for different customers
use prediction to find what factors will attract newuse prediction to find what factors will attract new
customerscustomers
Provides summary informationProvides summary information
various multidimensional summary reportsvarious multidimensional summary reports
statistical summary information (data central tendency andstatistical summary information (data central tendency and
variation)variation)
ContdContd
8/8/2019 Data Mining- IMT Nagpur-Manish
13/88
1313
Data Mining TechniquesData Mining Techniques
Clustering
Classification
Association Rules Mining(Market Basket Analysis)
8/8/2019 Data Mining- IMT Nagpur-Manish
14/88
ClusteringClustering
1414
8/8/2019 Data Mining- IMT Nagpur-Manish
15/88
1515
Clustering: Basic IdeaClustering: Basic Idea
ClusteringClustering Grouping a set of data objects into clustersGrouping a set of data objects into clusters
Similar objects within the same clusterSimilar objects within the same cluster
Dissimilar objects in different clustersDissimilar objects in different clusters
Clustering is unsupervisedClustering is unsupervised No previous categorization knownNo previous categorization known
Totally data drivenTotally data driven
8/8/2019 Data Mining- IMT Nagpur-Manish
16/88
1616
Clustering:ExampleClustering:Example
A good clustering method will produce high qualityA good clustering method will produce high quality
clusters withclusters with
high intrahigh intra--class similarityclass similarity
low interlow inter--class similarityclass similarity
***
***
*
*
*
*
* *
***
*
*
*
***
*
*
*
*
**
*
Outlier
8/8/2019 Data Mining- IMT Nagpur-Manish
17/88
1717
Similarity ComputationSimilarity Computation
Distance between objects used as metricDistance between objects used as metric
The definitions of distance functions usuallyThe definitions of distance functions usually
different for different type of attributesdifferent for different type of attributes
Must satisfy following propertiesMust satisfy following properties
d(d(i,ji,j)) uu 00
d(d(i,ji,j)) ==d(d(j,ij,i))
d(d(i,ji,j)) ee d(d(i,ki,k)) ++d(d(k,jk,j))
8/8/2019 Data Mining- IMT Nagpur-Manish
18/88
1818
Distance Calculation: objects XDistance Calculation: objects Xii and Xand Xjj
MinskowskiMinskowski
EuclidianEuclidian
ManhattanManhattan
||...||||),(2211 pp jxixjxixjxixjid
!
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211!
)||...|||(|),( 2222
2
11 pp jx
ix
jx
ix
jx
ixjid !
(p attributes)
8/8/2019 Data Mining- IMT Nagpur-Manish
19/88
1919
MethodsMethods
Partition MethodsPartition Methods
Iterative MethodsIterative Methods
Convergence criteria specified byConvergence criteria specified bythe userthe user
Hierarchical MethodsHierarchical Methods
Agglomerative / DivisiveAgglomerative / Divisive
UseUse DendrogramDendrogram representationrepresentation
8/8/2019 Data Mining- IMT Nagpur-Manish
20/88
2020
Partitioning MethodsPartitioning Methods
KK--Means ClusteringMeans Clustering
Decide kDecide k no. of clustersno. of clusters
Randomly pick k seedsRandomly pick k seeds use as centroidsuse as centroids Repeat until conditionRepeat until condition
Scan database and assign each object to a clusterScan database and assign each object to a cluster
Compute centroidsCompute centroids
Evaluate quality of clusteringEvaluate quality of clustering
8/8/2019 Data Mining- IMT Nagpur-Manish
21/88
2121
RecordsRecords Feature1Feature1 Feature2Feature2 Feature3Feature3 Feature4Feature4
L1L1 33 1010 2323 3636
L2L2 1212 66 1212 4141
L3L3 55 1212 1717 2424L4L4 44 88 77 1313
L5L5 11 1616 11 2828
L6L6 1818 00 2222 5151
L7L7 66 88 66 1212
L8L8 1515 55 22 66
L9L9 00 1010 1515 1818
L10L10 99 22 2424 1515
Example:Example:
8/8/2019 Data Mining- IMT Nagpur-Manish
22/88
2222
InitializationInitialization
We take, the number of cluster centers asWe take, the number of cluster centers as
3 i.e. K=3.3 i.e. K=3.
Lets take the initial Cluster Centers asLets take the initial Cluster Centers as L1 ( 3,10,23,36)L1 ( 3,10,23,36)
L5 (1,16,1,28)L5 (1,16,1,28)
L8 (15,5,2,6)L8 (15,5,2,6)
8/8/2019 Data Mining- IMT Nagpur-Manish
23/88
Pictorial View of the Clusters After First IterationPictorial View of the Clusters After First Iteration
2323
L2
L1
L4
L5L6
L10
L9
L3
L8
L7
(9.4,6,19.6,33.4)
(0.5,13,8,23)
(8.3,7,5,10.3)
8/8/2019 Data Mining- IMT Nagpur-Manish
24/88
2424
L2
L1
L4
L5L6
L10
L9
L3
L8
L7
(8.33,7,5,10.3)
(10.5,4.5,20.3,35.8)(2,12.7,11,23.3)
Pictorial View of the Clusters After Second IterationPictorial View of the Clusters After Second Iteration
8/8/2019 Data Mining- IMT Nagpur-Manish
25/88
2525
L2
L1
L4
L5L6
L10
L9
L3
L8
L7
(8.33,7,5,10.3)
(10.5,4.5,20.3,35.8)
The cluster
centers remain
same as in the
second iteration
so we stop here.
(2,12.7,11,23.3)
Pictorial View of the Clusters After Third IterationPictorial View of the Clusters After Third Iteration
8/8/2019 Data Mining- IMT Nagpur-Manish
26/88
2626
Hierarchical MethodsHierarchical Methods
Agglomerative MethodsAgglomerative Methods
Bottom Up approachBottom Up approach
Divisive MethodsDivisive Methods
Top Down ApproachTop Down Approach
8/8/2019 Data Mining- IMT Nagpur-Manish
27/88
2727
DendrogramDendrogram
BA C D E F
Database
Distance Matrix
00
ddabab 00
ddacac ddbcbc 00
ddadad ddbdbd ddcdcd 00
ddaeae ddbebe ddcece dddede 00
ddafaf ddbfbf ddcfcf dddfdf ddefef 00A, B, C E,F
A,B,C,D
Agglomerative ApproachAgglomerative Approach
8/8/2019 Data Mining- IMT Nagpur-Manish
28/88
2828
Database
BA C E FD
A, B, C
E,FA,B,C,D
Divisive ApproachDivisive Approach
8/8/2019 Data Mining- IMT Nagpur-Manish
29/88
2929
Clustering: ApplicationsClustering: Applications
Marketing ManagementMarketing Management
Discover distinct groups in customer bases, andDiscover distinct groups in customer bases, and
then use this knowledge to develop targetedthen use this knowledge to develop targeted
marketing programsmarketing programs
BankingBanking ATM Location identificationATM Location identification
Text MiningText Mining Grouping documents with similar characteristicsGrouping documents with similar characteristics
8/8/2019 Data Mining- IMT Nagpur-Manish
30/88
8/8/2019 Data Mining- IMT Nagpur-Manish
31/88
Clustering Companies using DowClustering Companies using Dow
Jones IndexJones Index
8/8/2019 Data Mining- IMT Nagpur-Manish
32/88
Trading System DevelopmentTrading System Development
8/8/2019 Data Mining- IMT Nagpur-Manish
33/88
Clustering for Customer ProfilingClustering for Customer Profiling
8/8/2019 Data Mining- IMT Nagpur-Manish
34/88
3434
8/8/2019 Data Mining- IMT Nagpur-Manish
35/88
3535
New Product line developmentNew Product line development
8/8/2019 Data Mining- IMT Nagpur-Manish
36/88
Crime Hot Spot AnalysisCrime Hot Spot Analysis
8/8/2019 Data Mining- IMT Nagpur-Manish
37/88
Clustering for MedicalClustering for Medical
DiagnosticsDiagnosticsHuman Genome Project:Human Genome Project: Finding Relationships between diseases,Finding Relationships between diseases,
cellular functions, and drugs.cellular functions, and drugs.
WincosinWincosin Breast Cancer StudyBreast Cancer Study
Cancer Diagnosis and PredictionsCancer Diagnosis and Predictions
8/8/2019 Data Mining- IMT Nagpur-Manish
38/88
ClassificationClassification
3838
8/8/2019 Data Mining- IMT Nagpur-Manish
39/88
3939
Easy to agree these are sunset pictures!
These are all As! Handwritten Characters from
NIST database
In most cases, easy for experts to attach class labels
difficult to explain why!
8/8/2019 Data Mining- IMT Nagpur-Manish
40/88
4040
ClassificationClassification
Supervised learning method
Use historical data to construct a model(Hypothesis Formulation)
Discover relationship between inputattributes and target
Use the model for prediction
Major Classification Methods Decision Tree(ID3, CART, C4.5, SLIQ) Neural Network(MLP)
Support Vector Machine
Bayesian Classifiers(NBC, BBN)
K-Nearest Neighbor(KNN)
8/8/2019 Data Mining- IMT Nagpur-Manish
41/88
The classification task
Input: a training set of tuples, eachlabelled with one class label
Output: a model (classifier) which assigns
a class label to each tuple based on theother attributes.
The model can be used to predict the
class of new tuples, for which the classlabel is missing or unknown
8/8/2019 Data Mining- IMT Nagpur-Manish
42/88
Training step
TrainingData
NAME AGE INCOME CREDIT
Mary 20 - 30 low poor
James 30 - 40 low fair
Bill 30 - 40 high goodJohn 20 - 30 med fair
Marc 40 - 50 high good
Annie 40 - 50 high good
ClassificationAlgorithms
IF age = 30 - 40OR income = highTHEN credit = good
Classifier(Model)
8/8/2019 Data Mining- IMT Nagpur-Manish
43/88
Test step
TestData
NAME AGE INCOME CREDIT
Paul 20 - 30 high good
Jenny 40 - 50 low fairick 30 - 40 high fair
Classifier(Model)
CREDIT
good
fairgood
8/8/2019 Data Mining- IMT Nagpur-Manish
44/88
Prediction
UnseenData
Classifier(Model)
CREDIT
good
good
fair
NAME AGE INCOME
Doc 20 - 0 high
Phil 30 - 0 low
Kat 0 - 0 med
8/8/2019 Data Mining- IMT Nagpur-Manish
45/88
4545
Classification: ApproachesClassification: Approaches
Decision Tree InductionDecision Tree Induction
Neural NetworksNeural Networks
Support Vector MachineSupport Vector MachineBayesian ApproachBayesian Approach
Rule InductionRule Induction
8/8/2019 Data Mining- IMT Nagpur-Manish
46/88
4646
Decision Tree InductionDecision Tree Induction
Recursive partitioning of T until stoppingRecursive partitioning of T until stopping
criterion satisfied (criterion satisfied (purity of partition, depth of tree etc.)purity of partition, depth of tree etc.)
Decide the split criterionDecide the split criterion
Select the splitting attributeSelect the splitting attribute
Partition the data according to the selected attributePartition the data according to the selected attribute
Apply induction method recursively on each partitionApply induction method recursively on each partition
8/8/2019 Data Mining- IMT Nagpur-Manish
47/88
4747
Decision tree inducersDecision tree inducers
ID3ID3 RJ QuinlanRJ Quinlan 19861986 Simple, Uses information gain, no pruningSimple, Uses information gain, no pruning
C4.5C4.5 -- RJ QuinlanRJ Quinlan 19931993 Uses gain ratio, handles numeric attributes andUses gain ratio, handles numeric attributes and
missing values, errormissing values, error--based pruningbased pruning
SLIQSLIQ --Mehta et al., 1996Mehta et al., 1996 Scalable, one scan of database, usesScalable, one scan of database, uses ginigini indexindex
CARTCART -- BriemanBrieman et al. 1984et al. 1984
constructs binary tree, costconstructs binary tree, cost--complexity pruning, cancomplexity pruning, cangenerate regression treesgenerate regression trees
8/8/2019 Data Mining- IMT Nagpur-Manish
48/88
4848
Attribute Selection CriteriaAttribute Selection Criteria
Information GainInformation Gain
Entropy(C, S) =Entropy(C, S) = -- ppii log(plog(pii))
Entropy (Before split)Entropy (Before split) -- Entropy (After split)Entropy (After split)
Gain RatioGain Ratio Information Gain / Entropy (Before split)Information Gain / Entropy (Before split)
GiniGini IndexIndex
Measures divergenceMeasures divergence
GiniGini(C, S) =1(C, S) =1 ppii22
8/8/2019 Data Mining- IMT Nagpur-Manish
49/88
Classical example: play tennis?
tl empe at e mi ity in y Class
s nny ot hi h alse
s nny hot hi h t e
ove cast hot hi h alse
ain mil hi h alse
ain cool normal alse
rain cool normal true
overcast cool normal true
sunny mil hi h alse
sunny cool normal alserain mil normal alse
sunny mil normal true
overcast mil hi h true
overcast hot normal alse
rain mil hi h true
z Trainingset fromQuinlansbook
8/8/2019 Data Mining- IMT Nagpur-Manish
50/88
8/8/2019 Data Mining- IMT Nagpur-Manish
51/88
Decision tree obtained with ID3(Quinlan 86)
outlook
overcast
humidity windy
high normal falsetrue
sunny rain
N NP P
P
8/8/2019 Data Mining- IMT Nagpur-Manish
52/88
From decision trees to classification rules
One rule is generated for eachpath in the treefrom the root to a leaf
Rules are generally simpler to understand than
trees
outlook
overcast
humidity windy
high normal falsetrue
sunny rain
N NP P
P
IF outlook=sunny
AND humidity=normal
THEN play tennis
8/8/2019 Data Mining- IMT Nagpur-Manish
53/88
5353
Advantages & LimitationsAdvantages & Limitations
AdvantagesAdvantages
Self explanatorySelf explanatory
handle both numeric and categorical datahandle both numeric and categorical data
NonNon--parametric methodparametric method
LimitationsLimitations
Most algorithms predict only categorical attribMost algorithms predict only categorical attrib
OverTrainingOverTraining Need for pruningNeed for pruning
LargeLarge TreeTree
8/8/2019 Data Mining- IMT Nagpur-Manish
54/88
Bayesian classification
The classification problem may be formalizedusing a-posteriori probabilities:
P(C|X) = prob. that the sample tupleX= is of class C.
E.g. P(class=N | outlook=sunny,windy=true,)
Idea: assign to sample X the class label C suchthat P(C|X) is maximal
8/8/2019 Data Mining- IMT Nagpur-Manish
55/88
Estimating a-posteriori probabilities
Bayes theorem:P(C|X) = P(X|C)P(C) / P(X)
P(X) is constant for all classes
P(C) = relative freq of class C samplesC such that P(C|X) is maximum =
C such that P(X|C)P(C) is maximum
Problem: computing P(X|C) is unfeasible!
8/8/2019 Data Mining- IMT Nagpur-Manish
56/88
Nave Bayesian Classification
Nave assumption: attribute independence
P(x1,,xk|C) = P(x1|C)P(xk|C)
If i-th attribute is categorical:P(x
i|C) is estimated as the relative freq of
samples having value xi as i-th attribute in classC
If i-th attribute is continuous:P(xi|C) is estimated thru a Gaussian densityfunction
Computationally easy in both cases
8/8/2019 Data Mining- IMT Nagpur-Manish
57/88
8/8/2019 Data Mining- IMT Nagpur-Manish
58/88
8/8/2019 Data Mining- IMT Nagpur-Manish
59/88
If data set is not so large
Cross-validation
AvailableExamples
TrainingSet
Test.Set
10%90%
Repeat 10
times
Used to develop 10 different tree Tabulateaccuracies
Generalizationean and stddev
of accurac
8/8/2019 Data Mining- IMT Nagpur-Manish
60/88
Classification: ApplicationClassification: Application
Bank Loan Granting SystemBank Loan Granting System
Decision trees constructed from bankDecision trees constructed from bank--loanloanhistories to produce algorithms to decidehistories to produce algorithms to decide
whether to grant a loan or not.whether to grant a loan or not.AntiAnti Money Laundering SystemMoney Laundering System
KYC StatusKYC Status
Email Classification SystemEmail Classification System
Spam or NotSpam or Not
6060
8/8/2019 Data Mining- IMT Nagpur-Manish
61/88
Stock Research ApplicationStock Research Application
Efficient Prediction of Option Prices usingEfficient Prediction of Option Prices using
Machine Learning TechniquesMachine Learning TechniquesPrediction of both European and American Option PricesPrediction of both European and American Option Prices
using General Regression Neural Network and Supportusing General Regression Neural Network and Support
Vector Regression.Vector Regression.
Stock Portfolio Management: Prediction ofStock Portfolio Management: Prediction of
Risk using Text ClassificationRisk using Text ClassificationPrediction or classification of risk in investment of a particularPrediction or classification of risk in investment of a particular
company by Text classification using Navecompany by Text classification using Nave BayesBayes(NB) and(NB) andKK--Nearest Neighbor(KNN).Nearest Neighbor(KNN).
Prediction of Financial Data SeriesPrediction of Financial Data Series: using: using
MATLAB GARCH ToolboxMATLAB GARCH Toolbox
6161
8/8/2019 Data Mining- IMT Nagpur-Manish
62/88
Pattern Recognition:Pattern Recognition:Artificial Neural Network ApplicationArtificial Neural Network Application
Letter Recognition SystemLetter Recognition System
Zip Code Identification SystemZip Code Identification System
Apple's Newton uses a neural netApple's Newton uses a neural net
Speech Recognition SystemSpeech Recognition System
Voice DialingVoice Dialing
Image ProcessingImage Processing
BioinfomaticsBioinfomatics
6262
8/8/2019 Data Mining- IMT Nagpur-Manish
63/88
Emitter ClassificationEmitter Classification
ELINT Data AnalysisELINT Data Analysis
Identification of Radars and PlatformIdentification of Radars and Platform
Successfully Delivered DAPR software toSuccessfully Delivered DAPR software toIndian Navy (INTEG)Indian Navy (INTEG)
6363
8/8/2019 Data Mining- IMT Nagpur-Manish
64/88
6464
Text
Speaker 2
Speaker 1
Although the spoken words are the same the recorded digital signals
are very different!
8/8/2019 Data Mining- IMT Nagpur-Manish
65/88
6565
Pattern Recognition ExamplePattern Recognition Example
Noisy image Recognized
pattern
8/8/2019 Data Mining- IMT Nagpur-Manish
66/88
Association Rule MiningAssociation Rule Mining(Market Basket Analysis)(Market Basket Analysis)
6666
8/8/2019 Data Mining- IMT Nagpur-Manish
67/88
6767
Association RulesAssociation Rules
Intrarecord LinksIntrarecord Links
Finding associations among sets of objectsFinding associations among sets of objects
in transaction databases, relationalin transaction databases, relational
databases.databases. Rule form: AntecedentRule form: Antecedent ppConsequentConsequent
[support, confidence][support, confidence]..
Examples.Examples. shirt, tie, socksshirt, tie, socks pp shoesshoes [0.5%, 60%][0.5%, 60%]
White bread, butterWhite bread, butterpp eggegg [2.3%, 80%][2.3%, 80%]
8/8/2019 Data Mining- IMT Nagpur-Manish
68/88
6868
PreliminariesPreliminaries
Given: (1) database of transactionsGiven: (1) database of transactions
(2) each transaction is a list of items(2) each transaction is a list of items
Find:Find: allall rules that correlate the presence of one set of itemsrules that correlate the presence of one set of itemswith that of another set of itemswith that of another set of items
E.g.,E.g., 95% of people who purchase PC and color printer95% of people who purchase PC and color printeralso purchase compter tablealso purchase compter table
Business Questions:Business Questions:
** Electronic itemsElectronic items (What the store should do to boost(What the store should do to boostsale of electronic items)sale of electronic items)
Herbal Health productsHerbal Health products ** (What other products should(What other products shouldthe store stocks up?)the store stocks up?)
8/8/2019 Data Mining- IMT Nagpur-Manish
69/88
6969
Formal DefinitionFormal Definition
If X and Y are twoIf X and Y are two itemitem--setssets, such that X, such that X
Y =Y = , then for an association rule, then for an association rule
XX ppYY
SupportSupport is the probability that X and Yis the probability that X and Y
occur together [ P(X U Y)]occur together [ P(X U Y)]
ConfidenceConfidence is the conditional probabilityis the conditional probability
that Y occurs in a transaction, given X isthat Y occurs in a transaction, given X is
present in the same transaction [P(Y/X)]present in the same transaction [P(Y/X)]
8/8/2019 Data Mining- IMT Nagpur-Manish
70/88
7070
Itemset and SupportItemset and Support
Item A(4)
TranTran
s IDs ID
ItemsItems
1010 A, B,CA, B,C
2020 A, CA, C
3030 A, C, DA, C, D
4040 B, C, EB, C, E5050 A, C, EA, C, E Sup(A): 4 (80%), Sup (AB): 1 (20%)
Sup (ABC): 1 (20%), Sup (ABCD): 0
Sup(ABCDE): 0
Item C (5)Item A & C(4)
8/8/2019 Data Mining- IMT Nagpur-Manish
71/88
7171
ConfidenceConfidence
Strength of theStrength of the
discovered rulediscovered ruleComputed asComputed as
P(XP(X , Y)/P(X), Y)/P(X)
AA pp C (4/4)C (4/4)CC ppA (4/5)A (4/5)
Item
C(5)
Item A & C(4)
Item A(4)
8/8/2019 Data Mining- IMT Nagpur-Manish
72/88
7272
InterestingnessInterestingness
Minimum SupportMinimum Support User specified parameter (Frequent itemsets)User specified parameter (Frequent itemsets)
For minsup of50% F = {A, C, AC}For minsup of50% F = {A, C, AC}
For minsup of30% F = {A, B, C, AC, E}For minsup of30% F = {A, B, C, AC, E}
Minimum ConfidenceMinimum Confidence Report rules that satisfy minimum confidenceReport rules that satisfy minimum confidence
levellevel
With minconf of50% some of the discoveredWith minconf of50% some of the discoveredrules arerules are
AApp
C [75%
], ABC [75%
], ABpp
C[100%
], EC[100%
], Epp
F[100%
]F[100%
]etc.etc.
TransTrans
IDID
ItemsItems
1010 A, B,CA, B,C
2020 A, CA, C
3030 A, C, DA, C, D
4040 B, C, EB, C, E
5050 A, C, EA, C, E
8/8/2019 Data Mining- IMT Nagpur-Manish
73/88
7373
The Apriori algorithmThe Apriori algorithm
The best known algorithmThe best known algorithmTwostepsTwosteps::
Find all itemsets that have minimum supportFind all itemsets that have minimum support
((frequent itemsetsfrequent itemsets, also called large itemsets)., also called large itemsets).
Use frequent itemsets toUse frequent itemsets to generate rulesgenerate rules..
E.g., a frequent itemsetE.g., a frequent itemset{Chicken, Clothes, Milk} [sup =3/7]{Chicken, Clothes, Milk} [sup =3/7]
and one rule from the frequent itemsetand one rule from the frequent itemset
ClothesClothes pp Milk, ChickenMilk, Chicken [sup =3/7,[sup =3/7,
conf=3/3]conf=
3/3]
A i iA i i E lE l
8/8/2019 Data Mining- IMT Nagpur-Manish
74/88
AprioriApriori ExampleExample
7474
Itemset Count
Bread 4
Coke 2
Milk 4
Bear 3
Diaper 4
Eggs 1
Itemset Count{Bread, Milk} 3
{Bread, Beer} 2
{Bread, Diaper} 3
{Milk, Beer} 2
{Milk, Diaper} 3
{Coke,Diaper} 2
{Milk,Coke} 2
{Beer,Coke} 1
{Bread,Coke} 1
{Beer, Diaper} 3
2-itemsets1-itemsets
Itemset Count
{Milk, Coke, Diaper} 2
{Milk, Coke, Beer} 1
{Beer, Milk, Diaper} 2
{Bread, Beer, Diaper} 2
{Bread, Beer, Milk} 1
{Bread, Milk, Diaper} 2
3-itemsets
TID
Items
1 Bread, Milk
2 Beer, Diaper, Bread, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Bread, Diaper, Milk
TID Items
8/8/2019 Data Mining- IMT Nagpur-Manish
75/88
7575
ExampleExample Finding frequentFinding frequent itemsetsitemsets
Dataset T
TID Items
T100 1, 3, 4
T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
itemset:count
1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
F1: {1}:2, {2}:3, {3}:3, {5}:3
C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
F2
: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2
C3: {2, 3,5}
3. scan T C3: {2, 3, 5}:2 F3:{2, 3, 5}
minsup=0.5
8/8/2019 Data Mining- IMT Nagpur-Manish
76/88
7676
An exampleAn example
FF33 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},= {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 5}, {2, 3, 4}}{1, 3, 5}, {2, 3, 4}}
After joinAfter join CC44 = {{1, 2, 3, 4}, {1, 3, 4, 5}}= {{1, 2, 3, 4}, {1, 3, 4, 5}}
After pruning:After pruning:
CC44 = {{1, 2, 3, 4}}= {{1, 2, 3, 4}}
becausebecause {1, 4, 5}{1, 4, 5} is not inis not in FF33 ({1, 3, 4, 5} is removed)({1, 3, 4, 5} is removed)
8/8/2019 Data Mining- IMT Nagpur-Manish
77/88
7777
Generating rules: an exampleGenerating rules: an example
Suppose {2,3,4} is frequent, with sup=50%Suppose {2,3,4} is frequent, with sup=50% Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, withProper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with
sup=50%, 50%, 75%, 75%, 75%, 75% respectivelysup=50%, 50%, 75%, 75%, 75%, 75% respectively
These generate these association rules:These generate these association rules:
2,32,
3pp
4,4, confidence=100%
confidence=100%
2,42,4 pp 3,3, confidence=100%confidence=100%
3,43,4 pp 2,2, confidence=67%confidence=67%
22pp 3,4,3,4, confidence=67%confidence=67%
33pp 2,4,2,4, confidence=67%confidence=67%44 pp 2,3,2,3, confidence=67%confidence=67%
All rules have support =50%All rules have support =50%
8/8/2019 Data Mining- IMT Nagpur-Manish
78/88
7878
On Apriori AlgorithmOn Apriori Algorithm
Seems to be very expensiveSeems to be very expensive
LevelLevel--wise searchwise search
K = the size of the largest itemsetK = the size of the largest itemset
It makes at most K passes over dataIt makes at most K passes over dataIn practice, K is bounded (10).In practice, K is bounded (10).
The algorithm is very fast. Under some conditions,The algorithm is very fast. Under some conditions,
all rules can be found inall rules can be found in linear timelinear time..
Scale up to large data setsScale up to large data sets
8/8/2019 Data Mining- IMT Nagpur-Manish
79/88
7979
AR: ApplicationsAR: Applications
Retail Marketing
Floor Planning, Discounting, Catalogue Design
Medical Diagnosis Comparison of the genotype of people with/without aComparison of the genotype of people with/without a
condition allowed the discovery of a set of genes thatcondition allowed the discovery of a set of genes that
together account for many cases of diabetes.together account for many cases of diabetes.
Geographical Information systems Link Analysis
8/8/2019 Data Mining- IMT Nagpur-Manish
80/88
WalmartWalmart StudyStudy
8080
T i l B i D i iT i l B i D i i
8/8/2019 Data Mining- IMT Nagpur-Manish
81/88
Typical Business DecisionsTypical Business Decisions(For(ForWalmartWalmart))
WhatWhat toto putput onon sale?sale?
HowHow toto designdesign coupons?coupons?
HowHow toto placeplace merchandisemerchandise etcetc onon shelf shelf totomaximisemaximise profitprofit ??
8181
8/8/2019 Data Mining- IMT Nagpur-Manish
82/88
8/8/2019 Data Mining- IMT Nagpur-Manish
83/88
8383
Defence ApplicationsDefence Applications
Applications in DefenceApplications in Defence Finding Associations inFinding Associations in
Terrorists activitiesTerrorists activities ( Ex: 9/11( Ex: 9/11
attack)attack)
Finding Associations in studyingFinding Associations in studyingbehavior of the enemy duringbehavior of the enemy during
warwar
Finding Associations which mayFinding Associations which may
lead to intrusion threat atlead to intrusion threat at
strategic locations.strategic locations.
8/8/2019 Data Mining- IMT Nagpur-Manish
84/88
List of PapersList of Papers
8/8/2019 Data Mining- IMT Nagpur-Manish
85/88
List of PapersList of PapersPublishedPublished
1.1. RobustRobust ApproachApproach forfor EstimatingEstimating ProbabilitiesProbabilities inin NaiveNaive--BayesBayes ClassifierClassifier forfor
GeneGene ExpressionExpression Data,Data, ElsevierElsevier:: ExpertExpert SystemsSystems withwithApplicationsApplications,, 20102010,, doidoi::1010..10161016/j/j..eswaeswa..20102010..0606..076076..
2.2. BestBest PaperPaper AwardAward::,, RankingRanking PolicePolice AdministrationAdministration UnitsUnits onon thethe BasisBasis ofof CrimeCrime
PreventionPrevention MeasuresMeasures usingusing DataData EnvelopmentEnvelopment AnalysisAnalysis andand Clustering,Clustering, 66thth
InternationalInternational ConferenceConference onon EE--GovernanceGovernance (ICEG(ICEG 20082008),), 4040--5353..
3.3. TowardsTowards situationsituation awarenessawareness inin integratedintegrated airair defencedefence usingusing clusteringclustering andand
casecase basedbased reasoningreasoning,, SpringerSpringer:: LectureLecture NotesNotes inin ComputerComputer ScienceScience,, 59095909,,579579584584,, 20092009..
4.4. AdaptiveAdaptive QueryQuery InterfaceInterface forfor MiningMining CrimeCrime Data,Data, SpringerSpringer:: LectureLecture NotesNotes inin
ComputerComputer ScienceScience (LNCS)(LNCS),, 47774777,, 20072007,, 285285--296296..
5.5. RobustRobust ApproachApproach forfor EstimatingEstimating ProbabilitiesProbabilities inin NaiveNaive--BayesBayes Classifier,Classifier,
SpringerSpringer:: LectureLecture NotesNotes inin ComputerComputer ScienceScience (LNCS),(LNCS), 48154815,, 20072007,, 1111--1616..
6.6. AA MultivariateMultivariate TimeTime SeriesSeries ClusteringClustering ApproachApproach forfor CrimeCrime TrendsTrends Prediction,Prediction,
ProcProc.. ofofIEEEIEEE SystemSystem ManMan && CyberneticsCybernetics,, 20082008,, 892892--896896..
7.7. CrimeCrime DataData MiningMining forfor IndianIndian PolicePolice InformationInformation System,System, ProcProc.. 55thth InternationalInternational
ConferenceConference onon EE--governancegovernance (ICEG(ICEG 20072007),), 388388--397397..
8.8. ClusteringClustering withwith VaryingVarying WeightsWeights onon TypesTypes ofof Crime,Crime, ORSIORSI ConferenceConference,, 20082008
List of Papers (Contd )List of Papers (Contd )
8/8/2019 Data Mining- IMT Nagpur-Manish
86/88
List of Papers (Contd.)List of Papers (Contd.)CommunicatedCommunicated
1.1. An Efficient Statistical Feature Selection Approach for Classification ofAn Efficient Statistical Feature Selection Approach for Classification of
Gene Expression Data,Gene Expression Data, Journal of Biomedical InformaticsJournal of Biomedical Informatics, July2010,, July2010,Resubmission with minor modificationResubmission with minor modification..
2.2. Towards A Framework of Intelligent Decision Support System for IndianTowards A Framework of Intelligent Decision Support System for Indian
Police,Police, Elsevier: Decision Support Systems, May2010Elsevier: Decision Support Systems, May2010..
3.3. A Statistical Approach for Feature Selection and Ranking,A Statistical Approach for Feature Selection and Ranking, Elsevier: PatterElsevier: Patter
RecognitionRecognition, June 2010., June 2010.
4.4. A Novel Approach for DistanceA Novel Approach for Distance--Based SemiBased Semi--Supervised Clustering usingSupervised Clustering usingFunctional Link Neural Network,Functional Link Neural Network, Springer Soft ComputingSpringer Soft Computing, June 2010., June 2010.
5.5. An Efficient Similarity Measure based Multivariate Time Series ClusteringAn Efficient Similarity Measure based Multivariate Time Series Clustering
Approach for Performance Analysis,Approach for Performance Analysis, IEEE System Man and CyberneticsIEEE System Man and Cybernetics,,
May2010.May2010.
6.6. Issues and Challenges for Emitter Classification in the Context of ElectronicIssues and Challenges for Emitter Classification in the Context of Electronic
Warfare.Warfare. DefenceDefence Science Journal.Science Journal.
7.7. A Novel Approach forWeighted Clustering using HyperlinkA Novel Approach forWeighted Clustering using Hyperlink--Induced TopicInduced Topic
Search (HITS) Algorithm,Search (HITS) Algorithm, DefenceDefence Science Journal.Science Journal.
8/8/2019 Data Mining- IMT Nagpur-Manish
87/88
8787
ReferencesReferencesHan and Kamber, Data Mining: Concepts and Techniques, Morgan Kauffman
Arun Pujari, Data Mining Techniques, University Press
Hand et al., Principles of Data Mining, PHI.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier fordata mining. In Proc. 1996 Int. Conf. Extending Database Technology(EDBT'96), Avignon, France, March 1996.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between
sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94 487-499, Santiago, Chile.
J. McQueen, Some methods for classification and analysis of multivariateobservations, Proc. Symp. Math. Statist. And Probability, 5th, Berkeley, 1,
1967, 281298.
A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM
Computing Surveys, 31(3), 1999, 264323.
8/8/2019 Data Mining- IMT Nagpur-Manish
88/88
Questions?Questions?