1
CSE300
Data mining and its application and Data mining and its application and usage in medicineusage in medicine
By RadhikaBy Radhika
2
CSE300
Data Mining and MedicineData Mining and Medicine HistoryHistory
Past 20 years with relational databases More dimensions to database queries
earliest and most successful area of data mining Mid 1800s in London hit by infectious disease
Two theories– Miasma theory Bad air propagated disease– Germ theory Water-borne
Advantages– Discover trends even when we don’t understand reasons– Discover irrelevant patterns that confuse than enlighten– Protection against unaided human inference of patterns provide
quantifiable measures and aid human judgment Data Mining
Patterns persistent and meaningful Knowledge Discovery of Data
3
CSE300
The future of data miningThe future of data mining 10 biggest killers in the US10 biggest killers in the US
Data mining = Process of discovery of interesting, Data mining = Process of discovery of interesting, meaningful and actionable patterns hidden in large meaningful and actionable patterns hidden in large amounts of dataamounts of data
4
CSE300
Major Issues in Medical Data MiningMajor Issues in Medical Data Mining Heterogeneity of medical dataHeterogeneity of medical data
Volume and complexity Physician’s interpretation Poor mathematical categorization Canonical Form Solution: Standard vocabularies, interfaces
between different sources of data integrations, design of electronic patient records
Ethical, Legal and Social IssuesEthical, Legal and Social Issues Data Ownership Lawsuits Privacy and Security of Human Data Expected benefits Administrative Issues
5
CSE300
Why Data Preprocessing?Why Data Preprocessing? Patient records consist of clinical, lab parameters, Patient records consist of clinical, lab parameters,
results of particular investigations, specific to tasksresults of particular investigations, specific to tasks Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only aggregate data
Noisy: containing errors or outliers Inconsistent: containing discrepancies in codes or
names Temporal chronic diseases parameters
No quality data, no quality mining results!No quality data, no quality mining results! Data warehouse needs consistent integration of
quality data Medical Domain, to handle incomplete,
inconsistent or noisy data, need people with domain knowledge
6
CSE300
What is Data Mining? The KDD ProcessWhat is Data Mining? The KDD Process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
7
CSE300
From Tables and Spreadsheets to Data CubesFrom Tables and Spreadsheets to Data Cubes A data warehouse is based on a A data warehouse is based on a multidimensionalmultidimensional data data
model that views data in the form of a model that views data in the form of a data cubedata cube A data cube, such as sales, allows data to be modeled A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensionsand viewed in multiple dimensions Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year) Fact table contains measures (such as
dollars_sold) and keys to each of related dimension tables
W. H. Inmon:“A data warehouse is a W. H. Inmon:“A data warehouse is a subject-orientedsubject-oriented, , integratedintegrated, , time-varianttime-variant, and , and nonvolatilenonvolatile collection of collection of data in support of management’s decision-making data in support of management’s decision-making process.”process.”
8
CSE300
Data Warehouse vs. Heterogeneous DBMSData Warehouse vs. Heterogeneous DBMS Data warehouse: update-driven, high performanceData warehouse: update-driven, high performance
Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
Do not contain most current information Query processing does not interfere with
processing at local sources Store and integrate historical information Support complex multidimensional queries
9
CSE300
Data Warehouse vs. Operational DBMSData Warehouse vs. Operational DBMS OLTP (on-line transaction processing)OLTP (on-line transaction processing)
Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory,
banking, manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making
Distinct features (OLTP vs. OLAP):Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical,
consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex
queries
10
CSE300
11
CSE300
Why Separate Data Warehouse?Why Separate Data Warehouse? High performance for both systemsHigh performance for both systems
DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery
Warehouse tuned for OLAP: complex OLAP queries, multidimensional view, consolidation
Different functions and different data:Different functions and different data: Missing data: Decision support requires historical
data which operational DBs do not typically maintain
Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
12
CSE300
13
CSE300
14
CSE300
Typical OLAP OperationsTypical OLAP Operations Roll up (drill-up): summarize dataRoll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-upDrill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data, or introducing new dimensions
Slice and dice: Slice and dice: project and select
Pivot (rotate): Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.
Other operationsOther operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
15
CSE300
16
CSE300
17
CSE300
Multi-Tiered ArchitectureMulti-Tiered Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
other
sources
Data Storage
OLAP Server
18
CSE300
Steps of a KDD Process Steps of a KDD Process Learning the application domain:Learning the application domain:
relevant prior knowledge and goals of application Creating a target data set: data selectionCreating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!)Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining Choosing functions of data mining summarization, classification, regression, association,
clustering. Choosing the mining algorithm(s)Choosing the mining algorithm(s) Data mining: search for patterns of interestData mining: search for patterns of interest Pattern evaluation and knowledge presentationPattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledgeUse of discovered knowledge
19
CSE300
Common Techniques in Data MiningCommon Techniques in Data Mining Predictive Data MiningPredictive Data Mining
Most important Classification: Relate one set of variables in data to
response variables Regression: estimate some continuous value
Descriptive Data MiningDescriptive Data Mining Clustering: Discovering groups of similar instances Association rule extraction
Variables/Observations Summarization of group descriptions
20
CSE300
LeukemiaLeukemia Different types of cells look very similarDifferent types of cells look very similar Given a number of samples (patients) Given a number of samples (patients)
can we diagnose the disease accurately? Predict the outcome of treatment? Recommend best treatment based of previous
treatments? Solution: Data mining on micro-array dataSolution: Data mining on micro-array data 38 training patients, 34 testing patients ~ 7000 patient 38 training patients, 34 testing patients ~ 7000 patient
attributesattributes 2 classes: Acute Lymphoblastic Leukemia(ALL) vs 2 classes: Acute Lymphoblastic Leukemia(ALL) vs
Acute Myeloid Leukemia (AML) Acute Myeloid Leukemia (AML)
21
CSE300
Clustering/Instance Based LearningClustering/Instance Based Learning Uses specific instances to perform classification than general Uses specific instances to perform classification than general
IF THEN rulesIF THEN rules Nearest Neighbor classifierNearest Neighbor classifier Most studied algorithms for medical purposesMost studied algorithms for medical purposes Clustering– Partitioning a data set into several groups Clustering– Partitioning a data set into several groups
(clusters) such that(clusters) such that Homogeneity: Objects belonging to the same cluster are
similar to each other Separation: Objects belonging to different clusters are
dissimilar to each other. Three elements Three elements
The set of objects The set of attributes Distance measure
22
CSE300
Measure the Dissimilarity of ObjectsMeasure the Dissimilarity of Objects
Find best matching instanceFind best matching instance Distance functionDistance function
Measure the dissimilarity between a pair of data objects
Things to considerThings to consider Usually very different for interval-scaled,
boolean, nominal, ordinal and ratio-scaled variables
Weights should be associated with different variables based on applications and data semantic
Quality of a clustering result depends on both the Quality of a clustering result depends on both the distance measuredistance measure adopted and its implementation adopted and its implementation
23
CSE300
Minkowski DistanceMinkowski Distance Minkowski distance: a generalizationMinkowski distance: a generalization
If q = 2, d is Euclidean distanceIf q = 2, d is Euclidean distance If q = 1, d is Manhattan distanceIf q = 1, d is Manhattan distance
)0(||...||||),(2211
qqj
xi
xj
xi
xj
xi
xjid q
pp
xi
xj
q=2 q=16
6
128.48
Xi (1,7)
Xj(7,1)
24
CSE300
Binary VariablesBinary Variables A contingency table for binary dataA contingency table for binary data
Simple matching coefficientSimple matching coefficient
dcbacb jid
),(
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
25
CSE300
Dissimilarity between Binary VariablesDissimilarity between Binary Variables ExampleExample
A1A1 A2A2 A3A3 A4A4 A5A5 A6A6 A7A7
Object 1Object 1 11 00 11 11 11 00 00
Object 2Object 2 11 11 11 00 00 00 11
Object 1
Object 2
11 00 sumsum
11 22 22 44
00 22 11 33
sumsum 44 33 77 7
41222
22)2
,1
(
OOd
26
CSE300
K-nearest neighbors algorithmK-nearest neighbors algorithm InitializationInitialization
Arbitrarily choose k objects as the initial cluster centers (centroids)
Iteration until no changeIteration until no change For each object Oi
Calculate the distances between Oi and the k centroids
(Re)assign Oi to the cluster whose centroid is the closest to Oi
Update the cluster centroids based on current assignment
27
CSE300
kk-Means Clustering Method -Means Clustering Method
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
cluster
meancurrent clusters
new clusters
objectsrelocat
ed
28
CSE300
Dataset Dataset Data set from UCI repositoryData set from UCI repository http://kdd.ics.uci.edu/http://kdd.ics.uci.edu/ 768 female Pima Indians evaluated for diabetes768 female Pima Indians evaluated for diabetes After data cleaning 392 data entriesAfter data cleaning 392 data entries
29
CSE300
Hierarchical ClusteringHierarchical Clustering Groups observations based on dissimilarityGroups observations based on dissimilarity Compacts database into “labels” that represent the Compacts database into “labels” that represent the
observationsobservations Measure of similarity/DissimilarityMeasure of similarity/Dissimilarity
Euclidean Distance Manhattan Distance
Types of ClusteringTypes of Clustering Single Link Average Link Complete Link
30
CSE300
Hierarchical Clustering: ComparisonHierarchical Clustering: Comparison
Average-link Centroid distance
1
2
3
4
5
61
2
5
3
4
Single-link Complete-link
1
2
3
4
5
61
2
5
34
1
2
3
4
5
61
2 5
3
41
2
3
4
5
6
12
3
4
5
31
CSE300
Compare DendrogramsCompare Dendrograms
1 2 5 3 6 4 1 2 5 3 6 4
1 2 5 3 6 4 2 5 3 6 4 1
Average-link Centroid distance
Single-link Complete-link
32
CSE300
Which Distance Measure is Better?Which Distance Measure is Better? Each method has both advantages and disadvantages; Each method has both advantages and disadvantages;
application-dependentapplication-dependent Single-linkSingle-link
Can find irregular-shaped clusters Sensitive to outliers
Complete-link, Average-link, and Centroid distanceComplete-link, Average-link, and Centroid distance Robust to outliers Tend to break large clusters Prefer spherical clusters
33
CSE300
Dendrogram from datasetDendrogram from dataset
Minimum spanning tree through the observationsMinimum spanning tree through the observations Single observation that is last to join the cluster is patient whose Single observation that is last to join the cluster is patient whose
blood pressure is at bottom quartile, skin thickness is at bottom blood pressure is at bottom quartile, skin thickness is at bottom quartile and BMI is in bottom halfquartile and BMI is in bottom half
Insulin was however largest and she is 59-year old diabeticInsulin was however largest and she is 59-year old diabetic
34
CSE300
Dendrogram from datasetDendrogram from dataset
Maximum dissimilarity between observations in one Maximum dissimilarity between observations in one cluster when compared to anothercluster when compared to another
35
CSE300
Dendrogram from datasetDendrogram from dataset
Average dissimilarity between observations in one Average dissimilarity between observations in one cluster when compared to anothercluster when compared to another
36
CSE300
Supervised versus Unsupervised LearningSupervised versus Unsupervised Learning Supervised learning (classification)Supervised learning (classification)
Supervision: Training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on training set Unsupervised learning (clustering)Unsupervised learning (clustering)
Class labels of training data are unknown Given a set of measurements, observations, etc.,
need to establish existence of classes or clusters in data
37
CSE300
Derive models that can use patient specific Derive models that can use patient specific information, aid clinical decision makinginformation, aid clinical decision making
Apriori decision on predictors and variables to predictApriori decision on predictors and variables to predict No method to find predictors that are not present in the No method to find predictors that are not present in the
datadata Numeric ResponseNumeric Response
Least Squares Regression Categorical ResponseCategorical Response
Classification trees Neural Networks Support Vector Machine
Decision modelsDecision models Prognosis, Diagnosis and treatment planning Embed in clinical information systems
Classification and PredictionClassification and Prediction
38
CSE300
Least Squares RegressionLeast Squares Regression Find a linear function of predictor variables that Find a linear function of predictor variables that
minimize the sum of square difference with response minimize the sum of square difference with response Supervised learning techniqueSupervised learning technique
Predict insulin in our dataset :glucose and BMIPredict insulin in our dataset :glucose and BMI
39
CSE300
Decision TreesDecision Trees Decision treeDecision tree
Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification
ID3 algorithmID3 algorithm Based on training objects with known class labels to
classify testing objects Rank attributes with information gain measure Minimal height
least number of tests to classify an object Used in commercial tools eg: Clementine ASSISTANT
Deal with medical datasets Incomplete data Discretize continuous variables Prune unreliable parts of tree Classify data
40
CSE300
Decision TreesDecision Trees
41
CSE300
Algorithm for Decision Tree InductionAlgorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-valued,
they are discretized in advance) Tree is constructed in a top-down recursive
divide-and-conquer manner At start, all training examples are at the root Test attributes are selected on basis of a heuristic
or statistical measure (e.g., information gain) Examples are partitioned recursively based on
selected attributes
42
CSE300
Training DatasetTraining Dataset
AgeAge BMIBMI HereditaryHereditary VisionVision Risk of Risk of Condition XCondition X
P1P1 <=30<=30 highhigh nono fairfair nono
P2P2 <=30<=30 highhigh nono excellentexcellent nono
P3P3 >40>40 highhigh nono fairfair yesyes
P4P4 31…4031…40 mediummedium nono fairfair yesyes
P5P5 31…4031…40 lowlow yesyes fairfair yesyes
P6P6 31…4031…40 lowlow yesyes excellentexcellent nono
P7P7 >40>40 lowlow yesyes excellentexcellent yesyes
P8P8 <=30<=30 mediummedium nono fairfair nono
P9P9 <=30<=30 lowlow yesyes fairfair yesyes
P10P10 31…4031…40 mediummedium yesyes fairfair yesyes
P11P11 <=30<=30 mediummedium yesyes excellentexcellent yesyes
P12P12 >40>40 mediummedium nono excellentexcellent yesyes
P13P13 >40>40 highhigh yesyes fairfair yesyes
P14P14 31…4031…40 mediummedium nono excellentexcellent nono
43
CSE300
Construction of A Decision Tree for “Condition X”Construction of A Decision Tree for “Condition X”
Age?
>4030…40
<=30
[P1,…P14]Yes: 9,
No:5
[P1,P2,P8,P9,P11]
Yes: 2, No:3
[P3,P7,P12,P13]Yes: 4, No:0
[P4,P5,P6,P10,P14]
Yes: 3, No:2History
no yes
YES
[P1,P2,P8]
Yes: 0, No:3
[P9,P11]Yes: 2,
No:0
Vision
fairexcellent
NO YES NO YES
[P6,P14]Yes: 0,
No:2
[P4,P5,P10]
Yes: 3, No:0
44
CSE300
Entropy and Information GainEntropy and Information Gain SS contains contains ssii tuples of class tuples of class CCii for for ii = {1, ..., = {1, ..., mm} } Information measures info required to classify any Information measures info required to classify any
arbitrary tuplearbitrary tuple
Entropy of attribute A with values {aEntropy of attribute A with values {a11,a,a22,…,a,…,avv}}
Information gained by branching on attribute AInformation gained by branching on attribute A
ss
ss
,...,s,ssi
m
i
im21 2
1
log)I(
),...,(...
E(A) 11
1
mjjmjj
ssIs
ssv
j
)E(),...,,I()Gain( 21 AsssA m
45
CSE300
Entropy and Information GainEntropy and Information Gain Select attribute with the Select attribute with the highesthighest information gain (or information gain (or
greatest entropy reduction)greatest entropy reduction) Such attribute minimizes information needed to
classify samples
46
CSE300
Rule InductionRule Induction IF conditions THEN ConclusionIF conditions THEN Conclusion Eg: CN2Eg: CN2
Concept description: Characterization: provides a concise and succinct summarization
of given collection of data Comparison: provides descriptions comparing two or more
collections of data
Training set, testing setTraining set, testing set ImpreciseImprecise Predictive AccuracyPredictive Accuracy
P/P+N
47
CSE300
Example used in a ClinicExample used in a Clinic Hip arthoplasty trauma surgeon predict patient’s long-Hip arthoplasty trauma surgeon predict patient’s long-
term clinical status after surgeryterm clinical status after surgery Outcome evaluated during follow-ups for 2 yearsOutcome evaluated during follow-ups for 2 years 2 modeling techniques2 modeling techniques
Naïve Bayesian classifier Decision trees
Bayesian classifierBayesian classifier P(outcome=good) = 0.55 (11/20 good) Probability gets updated as more attributes are
considered P(timing=good|outcome=good) = 9/11 (0.846) P(outcome = bad) = 9/20 P(timing=good|
outcome=bad) = 5/9
48
CSE300
NomogramNomogram
49
CSE300
Bayesian ClassificationBayesian Classification Bayesian classifier vs. decision treeBayesian classifier vs. decision tree
Decision tree: predict the class label Bayesian classifier: statistical classifier; predict
class membership probabilities Based on Based on Bayes theoremBayes theorem; estimate ; estimate posteriorposterior
probabilityprobability Naïve Bayesian classifier: Naïve Bayesian classifier:
Simple classifier that assumes attribute independence
High speed when applied to large databases Comparable in performance to decision trees
50
CSE300
Bayes TheoremBayes Theorem Let Let XX be a data sample whose class label is unknown be a data sample whose class label is unknown Let Let HHii be the hypothesis that be the hypothesis that XX belongs to a particular belongs to a particular
class class CCii
P(P(HHii) is ) is class priorclass prior probability that probability that XX belongs to a belongs to a particular class particular class CCii
Can be estimated by ni/n from training data samples
n is the total number of training data samples ni is the number of training data samples of class Ci
)()()|(
)|(XP
iHPiHXPXiHP
Formula of Bayes Theorem
51
CSE300
More classification TechniquesMore classification Techniques Neural NetworksNeural Networks
Similar to pattern recognition properties of biological systems
Most frequently used Multi-layer perceptrons
– Input with bias, connected by weights to hidden, output
Backpropagation neural networks Support Vector MachinesSupport Vector Machines
Separate database to mutually exclusive regions Transform to another problem space Kernel functions (dot product) Output of new points predicted by position
Comparison with classification treesComparison with classification trees Not possible to know which features or combination of
features most influence a prediction
52
CSE300
Multilayer PerceptronsMultilayer Perceptrons Non-linear transfer functions to weighted sums of Non-linear transfer functions to weighted sums of
inputsinputs Werbos algorithmWerbos algorithm
Random weights Training set, Testing set
53
CSE300
Support Vector MachinesSupport Vector Machines 3 steps3 steps
Support Vector creation Maximal distance between points found Perpendicular decision boundary
Allows some points to be misclassifiedAllows some points to be misclassified Pima Indian data with X1(glucose) X2(BMI)Pima Indian data with X1(glucose) X2(BMI)
54
CSE300
What is Association Rule Mining?What is Association Rule Mining? Finding frequent patterns, associations, correlations, or causal Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction structures among sets of items or objects in transaction databases, relational databases, and other information databases, relational databases, and other information repositoriesrepositories
Example of Association Rules{High LDL, Low HDL} {Heart Failure}
PatientIDPatientID ConditionsConditions11 High LDL Low HDL, High LDL Low HDL,
High BMI, High BMI, Heart FailureHeart Failure
22 High LDL Low HDLHigh LDL Low HDL, , Heart Failure, Heart Failure, DiabetesDiabetes
33 DiabetesDiabetes
44 High LDL Low HDLHigh LDL Low HDL, , Heart FailureHeart Failure
55 High BMIHigh BMI , , High LDL High LDL Low HDLLow HDL, , Heart FailureHeart Failure
People who have high LDL (“bad” cholesterol), low HDL (“good cholesterol”) are at
higher risk of heart failure.
55
CSE300
Association Rule MiningAssociation Rule Mining Market Basket AnalysisMarket Basket Analysis
Same groups of items bought placed together Healthcare
Understanding among association among patients with demands for similar treatments and services
Goal : find items for which joint probability of occurrence is high
Basket of binary valued variables Results form association rules, augmented with
support and confidence
56
CSE300
Association Rule MiningAssociation Rule Mining
Dintrans
YXcontainingtransYXP
#
)(#)(
Association RuleAssociation Rule An implication
expression of the form X Y, where X and Y are itemsets and XY=
Rule Evaluation Rule Evaluation MetricsMetrics Support (s): Fraction of
transactions that contain both X and Y
Confidence (c): Measures how often items in Y appear in transactions thatcontain X
Xcontainingtrans
YXcontainingtransYXP
#
)(#)|(
Trans containing Y
Trans containing
both X and Y
Trans containing X
D
57
CSE300
The Apriori AlgorithmThe Apriori Algorithm Starts with most frequent 1-itemsetStarts with most frequent 1-itemset Include only those “items” that pass thresholdInclude only those “items” that pass threshold Use 1-itemset to generate 2-itemsetsUse 1-itemset to generate 2-itemsets Stop when threshold not satisfied by any itemsetStop when threshold not satisfied by any itemset
LL11 = {frequent items}; = {frequent items};for (k = 1; for (k = 1; LLkk != !=; k++) do; k++) do Candidate Generation: Ck+1 = candidates
generated from Lk; Candidate Counting: for each transaction t in
database do increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_supreturn return k k LLkk;;
58
CSE300
Apriori-based MiningApriori-based Mining
b, eb, e4040
a, b, c, ea, b, c, e3030
b, c, eb, c, e2020
a, c, da, c, d1010
ItemsItemsTIDTID
Min_sup=0.511dd
33ee
33cc
33bb
22aa
SupSupItemsetItemsetData base D 1-candidates
Scan D
33ee
33cc
33bb
22aa
SupSupItemsetItemsetFreq 1-itemsets
bcbc
aeae
acac
cece
bebe
abab
ItemsetItemset2-candidates
cece
bebe
bcbc
aeae
acac
abab
ItemsetItemset
22
11
22
22
33
11
SupSup
Counting
Scan D
cece
bebe
bcbc
acac
ItemsetItemset
22
22
22
33
SupSup
Freq 2-itemsets
bcebce
ItemsetItemset
3-candidates
bcebce
ItemsetItemset
22
SupSup
Freq 3-itemsets
Scan D
59
CSE300
Principle Component AnalysisPrinciple Component Analysis Principle ComponentsPrinciple Components
In cases of large number of variables, highly possible that some subsets of the variables are very correlated with each other. Reduce variables but retain variability in dataset
Linear combinations of variables in the database Variance of each PC maximized
– Display as much spread of the original data PC orthogonal with each other
– Minimize the overlap in the variables Each component normalized sum of square is unity
– Easier for mathematical analysis Number of PC < Number of variables
Associations found Small number of PC explain large amount of variance
Example 768 female Pima Indians evaluated for diabetes Number of times pregnant, two-hour oral glucose tolerance test
(OGTT) plasma glucose, Diastolic blood pressure, Triceps skin fold thickness, Two-hour serum insulin, BMI, Diabetes pedigree function, Age, Diabetes onset within last 5 years
60
CSE300
PCA ExamplePCA Example
61
CSE300
National Cancer InstituteNational Cancer Institute CancerNet CancerNet http://www.nci.nih.govhttp://www.nci.nih.gov CancerNet for Patients and the PublicCancerNet for Patients and the Public CancerNet for Health ProfessionalsCancerNet for Health Professionals CancerNet for Basic ReasearchersCancerNet for Basic Reasearchers CancerLitCancerLit
62
CSE300
ConclusionConclusion About ¾ billion of people’s medical records are About ¾ billion of people’s medical records are
electronically availableelectronically available Data mining in medicine distinct from other fields due Data mining in medicine distinct from other fields due
to nature of data: heterogeneous, with ethical, legal to nature of data: heterogeneous, with ethical, legal and social constraintsand social constraints
Most commonly used technique is classification and Most commonly used technique is classification and prediction with different techniques applied for prediction with different techniques applied for different casesdifferent cases
Associative rules describe the data in the databaseAssociative rules describe the data in the database Medical data mining can be the most rewarding Medical data mining can be the most rewarding
despite the difficultydespite the difficulty
63
CSE300
Thank you !!!Thank you !!!