Challenges and Techniques for Mining Clinical data
Wesley W. ChuLaura Yu Chen
Outline
Introduction of SmartRule association rule mining
Case I: mining pregnancy data to discover drug exposure side effects
Case II: mining urology clinical data for operation decision making
SmartRule Features Generate MFIs directly from tabular data
Reduce the search space and the support counting time by taking advantage of column structures
User select MFIs for rule generation User can select a subset of MFIs to including
certain attributes as targets in rule generation Derive rules from targeted MFIs
Efficient support-counting by building inverted indices for the collection of itemsets
Hierarchically organize rules into trees and use spreadsheet to present the rule trees
System overview of SmartRule
TMaxMiner:Compute MFI from tabular data.
MFIData
Rules Config
Domain experts
InvertCount: - MFIsFIs - Count sup
RuleTree: - Generate - Organize
FI Supports
Excel Book 1
2
3
4 5
6
Computation Complexity
Efficient MFI mining: Does not require superset checking gather past tail information to
determine the next node to explore during the mining process
Efficient rule generation: Reduce the computation for support-
counting by building inverted indices
Scalability Limitation: Microsoft Excel
spreadsheet size is 65,536 rows in one spreadsheet
When the dataset exceeds the spreadsheet size limit: Partition the dataset into multiple groups
of the maximum spreadsheet size to derive MFIs for each spreadsheet
Then join these MFIs for generating association rules
Case I: Mining Pregnancy Data Data set: Danish National Birth Cohort
(DNBC) Dimension: 4455 patients x 20 attributes Each patient record contain:
Exposure status : drug type, timing, and sequence of different drugs
Possible confounders: vitamin intake, smoking, alcohol consumption, socio-economic status and psycho-social stress
Endpoint: preterm birth, malformations and prenatal complications
Sample Pregnancy Data
Challenges Problem: discover side effects of drug
exposure during pregnancy E.g.: study how the antidepressants and
confounders influence the preterm birth of the new-born
Difficulties in finding side effects: Small number of patients suffer side effect Sensitive to the drug exposure time Exposure to sequence of multiple drugs
Derive Drug Side Effects via SmartRule(1): low-support low-confidence rules
Low support or low confidence rules could still be significant because of their contrast to normal pregnant woman
For example: If patients exposed to cita in the 3rd trimester,
then have preterm birth with support=0.0011, confidence=0.1786
If patients not exposured to cita, then have preterm birth with support=0.0433, confidence=0.0444
Derive Drug Side Effects via SmartRule(2): temporal sensitive rules
Divide the pregnancy period into time slots (e.g. trimester) and combine drug exposure by time:
If patients exposed to cita in the 1st trimester and drink alcohol, then have preterm birth with support=0.0011 and confidence=0.132
If patients exposed to cita in the 2nd trimester and drink alcohol, then have preterm birth with support=0.0011 and confidence=0.417
If patients exposed to cita in the 3rd trimester and drink alcohol, then have preterm birth with support=0.0009 and confidence=0.364
Flexible in time slot division, domain user can control granularity
Rule Presentation Hierarchically
organize rules into trees
View general rules and then extend to specific rules
Use spreadsheet to present the rule trees
Easy to sort, filter or extend the rule trees to search for the interesting rules
2) If exposed to cita in the 1st trimester, then preterm birth (sup=0.0016, conf=0.0761)
6) If exposed to cita in the 1st trimester and drink alcohol, then preterm birth (sup=0.0011, conf=0.132)
7) If exposed to cita in the 2nd trimester and drink alcohol, then preterm birth (sup=0.0011, conf=0.417)
3) If exposed to cita in the 2nd trimester, then preterm birth (sup=0.0013, conf=0.1714)
4) If exposed to cita in the 3rd trimester, then preterm birth (sup=0.0011, conf=0.1786)
A part of the rule hierarchy for the exposure to the antidepressant citalopram and alcohol at different time
period of pregnancy with preterm birth
8) If exposed to cita in the 3rd trimester and drink alcohol, then preterm birth (sup=0.0009, conf=0.364)
1) In general, patients have preterm birth (sup=0.0454, conf=0.0454)
5) If no exposure to cita, then preterm birth (sup=0.0433, conf=0.0444)
Knowledge Discovery from Data Mining Results
Challenges: Examining the vast number of rules
manually is too labor-intensive Exploring knowledge (rules) without
specific goal
Existing approach:Top-down in Rule Hierarchy Association rules are represented in general
rules, summaries and exception rules (GSE patterns). The GSE pattern presents the discovered rules in a hierarchical fashion. Users can browse the hierarchy from top-down to find interesting exception rules.
Due to the low occurance of drug side effects, interesting rules are exception rules and reside at the lower level of the hierarchy. Without user guidance, it requires exploration of the entire GSE hierarchy to locate the interesting exception rules.
Reference: B. Liu, M. Hu, and W. Hsu, "Multi-level organization and summarization of the discovered
rules," Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Aug, 2000, Boston, USA.
B. Liu, M. Hu, and W. Hsu, "Intuitive representation of decision trees using general rules and exceptions.“ Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, Austin, Texas, USA.
New effective bottom up technique to find exception rules
Derive a set of seed attributes from high-confidence rules For example, given high-conf rule:
If exposed to Anxio in the pre, in and post time and use tobacco and have symptoms of depression, then have preterm birth with confidence = 0.6
List of seed attributes: Anxio_pre, Anxio_in, Anxio_post, tobacco and symptoms of depression
Using seed attributes to explore exception rules via rule hierarchy
Explore more rules based on these seed attributes in the rule hierarchies First look for rules that represent effect of
each single seed attribute on preterm birth Then further explore the combination of
multiple seed attributes
High-confidence rule
Seed attributes
Rule hierarchyRule hierarchy
New Findings from Data Mining
Finding: combined exposure to citalopram and alcohol in pregnancy is associated with an increased risk of preterm birth
Not initially discovered by epidemiology study due to the large number of combinations among all the attributes and their values
2) If exposed to cita in the 1st trimester, then preterm birth (sup=0.0016, conf=0.0761)
6) If exposed to cita in the 1st trimester and drink alcohol, then preterm birth (sup=0.0011, conf=0.132)
7) If exposed to cita in the 2nd trimester and drink alcohol, then preterm birth (sup=0.0011, conf=0.417)
3) If exposed to cita in the 2nd trimester, then preterm birth (sup=0.0013, conf=0.1714)
4) If exposed to cita in the 3rd trimester, then preterm birth (sup=0.0011, conf=0.1786)
8) If exposed to cita in the 3rd trimester and drink alcohol, then preterm birth (sup=0.0009, conf=0.364)
1) In general, patients have preterm birth (sup=0.0454, conf=0.0454)
5) If no exposure to cita, then preterm birth (sup=0.0433, conf=0.0444)
Statistical Analysis VS. Data Mining
Statistical analysis Infeasible to test all
potential hypotheses for large number of attributes
Testing hypotheses with small sample size has limited statistical power
Data mining No hypothesis, mine
association in large dataset with multiple temporal attributes
Can generate association rules independent of the sample size
Derive rules with temporal information of drug exposure
Case II: Mining Urology Clinical Data
Data set: urology surgeries operated during 1995 to 2002 at the UCLA Pediatric Urology Clinic
Dimension: 130 patients x 28 attributes
Bladder Body & Bladder Neck
Training Data Attributes Each patient record contain:
Pre-operative conditions: Demography data: age, gender, etc. patient ambulatory status (A) catheterizing skills (CS) amount of creatinine in the blood (SerumCrPre) leak point pressure (LPP) urodynamics, such as the minimum volume of saline infused
into a bladder when its pressure reached 20 cm of water (20%min)
Type of surgery performed: Op-1 Bladder Neck Reconstruction with Augmentation Op-2 Bladder Neck Reconstruction without Augmentation Op-3 Bladder Neck Closure without Augmentation Op-4 Bladder Neck Closure with Augmentation
Post-op complications: infection, complication, etc. Final outcome of the surgery: urine continence wet or dry
Sample of Urology Clinical Data
Goals and Challenges Goal:
Derive a set of rules from the clinical data set (training set) that summarize the outcome based on patients’ pre-op data
Predict operation outcome based on a given patient’s pre-op data (test set), and recommend the best operation to perform
Challenge: Small sample size, large number of
attributes Continuous-value attributes such as uro-
dynamics measurements
Data Mining Steps 1. Separate the patients into four groups based
on their type of surgery performed 2. In each group, partition the continuous value
attributes into discrete intervals or cells. Since the sample size is very small, we use a hybrid technique to determine the optimal number of cells and cell sizes.
3. Generate association rules for each patient group based on the partitioned continues value attributes
4. For a given patient with a specific set of pre-op conditions, the generated rules from the training set can be used to predict success or failure rate for a specific operation
Partitioning Continuous Value Attributes Current approach to partition continuous
attribute: Using domain expert guidance can be biased and
inconsistent Statistical clustering technique fails when the training
set size is small and the number of attributes is large New hybrid approach:
Using data mining technique to select a small set of key attributes
Using statistical classification technique to perform the optimal partition (determine the cell sizes and the number of cells) from the small set of key attributes
Hybrid Clustering Technique
Select a small key attribute set (via data mining):
Use domain expert partition to perform mining on the training set
Select a set of key attributes that contribute to high confidence and support rules
Optimal partition (via statistical classification) Use statistical classification techniques (e.g. CART) to
determine the optimal number of cells and their corresponding cell sizes for the attributes
Mining optimally partitioned attribute data yields better quality rules
Partition of continuous variables for operations Partition of continuous variables into optimal
number of discrete intervals (cells) and cell sizes for four types of operations.
Cell# LPP SerumCrPre
1 [0, 19] [0, 0.75]
2 (19, 33.5] [0.75, 2.2]
3 (33.5,40] n/a
4 normal n/a
Operation Type 1
Operation Type 4
Cell# LPP 20% mean
1 [0, 19] [0, 33.37]
2 (19, 69] (33.37, 37.5]
3 normal (37.5, 52]
4 n/a (52, 110]
Cell# 20%min 20%mean 30%min 30%mean LPP SerumCrPre
1 [80, 118] [50, 77] [100, 170] [51, 51] [12, 20] [0, 0.5]
2 [145, 178] [88, 104] [206, 241] [94, 113] [24, 36] [0.7, 1.4]
3 [221, 264] [135, 135] n/a [135, 135] normal n/a
Operation Type 2
Cell# 20%min 20%mean 30%min 30%mean LPP SerumCrPre
1 [103,130] [57, 75] [129, 157] [86, 93] [6, 29] [0.3, 0.7]
2 [156,225] [92, 105] [188, 223] [100,121] [30,40] [1.0, 1.5]
Operation Type 3
Recommending operation based on rules derived from training set
Transform the patient’s pre-op data of the continues value attributes using the optimal partitions for each operation
Find a set of rules (from the training set) that matches the patients’ pre-op data
Compare the matched rules from each operation, recommend the type of sugary that provides the best match
Example: Prediction for Matt
AmbulatoryStatus (A)
CathSkills (CS)
SerumCrPre
20%min
20%mean(M)
30%min
30%mean
LPP UPP
4 1 0.5 31 20 50 33 27 unkown
Patient Matt’s pre-operative conditions
AmbulatoryStatus (A)
CathSkills (CS)
SerumCrPre
20%min
20%mean(M)
30%min
30%mean
LPP
Op-1 4 1 1 n/a n/a n/a n/a 2
Op-2 4 1 1 <1 <1 <1 <1 2
Op-3 4 1 1 <1 <1 <1 <1 1
Op-4 4 1 n/a n/a 1 n/a n/a 2
Discretized pre-operative conditions of patient Matt’s pre-op conditions. The attributes not used in rule generation are
denoted as n/a
Rule trees selected from the knowledge base that match patient Matt’s pre-op profile
Surgery Conditions Outcome Support Support(%) Confidence
Op-1CS=1 Success 10 41.67 0.77
CS=1 and LPP=2 Success 3 12.5 0.75
Op-2CS=1 and LPP=2 Fail 2 16.67 0.67
20%min=1 and LPP=2 Fail 2 16.67 0.67
Op-3CS=1 and SerumCrPre=1 Success 5 50 0.83
CS=1, SerumCrPre=1 and LPP=1 Success 2 20 1
Op-4
A=4 Success 14 32.55 0.78
A=4 and CS=1 Success 11 25.58 0.79
A=4, CS=1 and LPP=2 Success 8 18.6 0.8
A=4, CS=1 and M=1 Success 6 13.95 1
A=4, CS=1, M=1 and LPP=2 Success 6 13.95 1
Based on the rule tree, we note that Operations 3 and 4 both match patient Matt’s pre-op conditions. However, Operation 4 matches more attributes in Matt’s pre-op conditions than Operation 3. Thus, Operation 4 is more desirable for patient Matt.
Representing rules in a hierarchical structure
A4CS1Lpp2Success
A4CS1M1Lpp2Success
sup=32.55%,conf=0.78
sup=13.95%,conf=1
sup=18.6%,conf=0.8
sup=13.95%,conf=1
sup=25.58%,conf=0.79
A4Success
A4CS1M1Success
A4CS1Success
Represent rule trees for Op-4 by spreadsheet Rule tree for Op-4
Favorable user feedback in using the spreadsheet interface because of its ease in rule searching and sorting
Lesson learn from mining data with small sample size For small sample size, hybrid clustering
yield better than conventional unsupervised clustering techniques
Hybrid clustering enables us to generate useful rules for small sample sizes, which could not be done using data mining or statistical classifying methods alone
Conclusion Mining pregnancy data:
Discover drug exposure side effects (association) Advantage over traditional statistical approaches:
Independent of hypotheses Independent of the sample size Derive rules with temporal information
Using seed attribute approach to effectively discover exception rules via rule hierarchy
Mining urology clinical data: Deriving association rules based on patient’s pre-op
conditions and their operation outcomes according to different type of operations
Hybrid clustering technique to derive optimal partition for continuous value attributes . This technique is critical for deriving high quality rules for small sample size with large number of attributes
Reference Qinghua Zou, Yu Chen, Wesley W. Chu and Xinchun Lu. Mining
association rules from tabular data guided by maximal frequent itemset. Book Chapter in “Foundations and Advances in Data Mining”, edited by Wesley W. Chu and T.Y. Lin, Springer, 2005.
Yu Chen, Lars Henning Pedersen, Wesley W. Chu and Jorn Olsen. "Drug Exposure Side Effects from Mining Pregnancy Data" In SIGKDD Explorations (Volume 9, Issue 1), June 2007, Special Issue on Data Mining for Health Informatics, Guest Editors: Raymond Ng and Jian Pei .
Q. Zou, W.W. Chu, and B. Lu. SmartMiner: A depth-first search algorithm guided by tail information for mining maximal frequent itemsets. In Proc. of the IEEE Intl. Conf. on Data Mining, 2002.
R. Agrawal and R. Srikant: Fast algorithms for mining association rules. In Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994.
D. Burdick, M. Calimlim, and J. Gehrke: MAFIA: a maximal frequent itemset algorithm for transactional databases. In Intl. Conf. on Data Engineering, Apr. 2001.
K. Gouda and M.J. Zaki: Efficiently Mining Maximal Frequent Itemsets. Proc. of the IEEE Int. Conference on Data Mining, San Jose, 2001.
Reference B. Liu, M. Hu, and W. Hsu, "Multi-level organization and summarization
of the discovered rules," Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Aug, 2000, Boston, USA.
B. Liu, M. Hu, and W. Hsu, "Intuitive representation of decision trees using general rules and exceptions.“ Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, Austin, Texas, USA.
Frequent Itemset Mining Implementations Repository, http://fimi.cs.helsinki.fi/
http://www.ics.uci.edu/~mlearn/MLRepository.html