Date post: | 01-Jul-2015 |
Category: |
Technology |
Upload: | gerald-lushington |
View: | 471 times |
Download: | 5 times |
Introduction to Bioinformatics:Mining Your Data
Gerry LushingtonLushington in Silico
modeling / informatics consultant
2/28/2012Intro to Data Mining / GHL1
2/28/2012Intro to Data Mining / GHL2
What is Data Mining?
Use of computational methods to perceive trends in data that can be used to explain or predict important outcomes or properties
Applicable across many disciplines:Molecular bioinformatics
Medical Informatics
Health Informatics
Biodiversity informatics
2/28/2012Intro to Data Mining / GHL3
Example Applications:
a) Relative gene expression datab) Relative protein abundance datac) Relative lipid & metabolite profilesd) Glycosylation variantse) SNPs, allelesf) Cellular traitsg) Organism traitsh) Behavioral traitsi) Case history
Find relationships between:
Convenient Observables vs.
1. Disease susceptibility2. Drug efficacy3. Toxin susceptibility4. Immunity5. Genetic disorders6. Microbial virulence7. Species adaptive success8. Species complementarity
Important Outcomes
2/28/2012Intro to Data Mining / GHL4
Goals for this lecture:
Focus on Data Mining: how to approach your data and use it to understand biology
Overview of available techniques
Understanding model validation
Try to think about data you’ve seen: what techniques might be useful?
Don’t worry about grasping everything:K-INBRE Bioinformatics Core is here to help!!
2/28/2012Intro to Data Mining / GHL5
Basic Data Mining:Find relationships between:a) Easy to measure properties vs.b) Important (but harder to measure) outcomes or attributes
Use relationships to understand the conceptual basis for outcomes in b)
Use relationships to predict outcomes in new cases where outcome has not yet been measured
2/28/2012Intro to Data Mining / GHL6
Basic Data Mining: simple measureables
2/28/2012Intro to Data Mining / GHL7
Basic Data Mining: general observation
Unhappy Happy
2/28/2012Intro to Data Mining / GHL8
Basic Data Mining: relationship (#1)
Unhappy Happy
Blue = happy; Red = unhappy accuracy = 12/20 = 60%
2/28/2012Intro to Data Mining / GHL9
Basic Data Mining: relationship (#2)
Unhappy Happy
Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%
2/28/2012Intro to Data Mining / GHL10
Data Mining: procedure
1. Data Acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
2/28/2012Intro to Data Mining / GHL11
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Key issues include:
a) format conversion from instrumentb) any necessary mathematical manipulations (e.g., Density = M/V)
Peak heights?
Peak positions?
2/28/2012Intro to Data Mining / GHL12
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Key issues include:
a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers
2/28/2012Intro to Data Mining / GHL13
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Key issues include:
a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers
C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3
Use controls to scale data
C
2/28/2012Intro to Data Mining / GHL14
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Key issues include:
a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers
Subjective(requires experience
and/or domain knowledge)
2/28/2012Intro to Data Mining / GHL15
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
2/28/2012Intro to Data Mining / GHL16
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
x x
2/28/2012Intro to Data Mining / GHL17
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
x
2/28/2012Intro to Data Mining / GHL18
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
x
1 2 3 4
2/28/2012Intro to Data Mining / GHL19
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
• Train preliminary models based on random sets of properties
• Evaluate models according to correlative or predictive performance
• Experiment with promising sets adding or deleting descriptors to gauge impact on performance
2/28/2012Intro to Data Mining / GHL20
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Predict which sample will have which outcome?
a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
2/28/2012Intro to Data Mining / GHL21
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Predict which sample will have which outcome?
a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
x
y
2/28/2012Intro to Data Mining / GHL22
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Predict which sample will have which outcome?
a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
y-n +n
YESNO
x
y
2/28/2012Intro to Data Mining / GHL23
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Predict which sample will have which outcome?
a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
x2
x1
2/28/2012Intro to Data Mining / GHL24
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Predict which sample will have which outcome?
a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
x2
x1
y2
y4y3
y1
y1
y2
y3
y4
= resistant to types I & II diabetes
= susceptible to types I & II
= susceptible only to type II
= susceptible only to type I
2/28/2012Intro to Data Mining / GHL25
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Predict which sample will have which outcome?
a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
x2
x1Susceptible to type I
Resistant to type I
2/28/2012Intro to Data Mining / GHL27
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Predict which sample will have which outcome?
a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
x2
x1
If x1 < c and x2 > a then resistantElse if x1 > c and x2 > b then resistantElse susceptible
a
E = 9
b
c
x2
x1Susceptible to type I
Resistant to type I
2/28/2012Intro to Data Mining / GHL28
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Predict which sample will have which outcome?
a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
x1
Susc.Resistant
a
x2
Susc. Resistant
b
Fx1 - Gx2
c
If Fx1 - Gx2 < c then resistantElse susceptible
Susc.Resistant
2/28/2012Intro to Data Mining / GHL29
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Define criteria and tests to prove model validity
a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
2/28/2012Intro to Data Mining / GHL30
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Define criteria and tests to prove model validity
a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
x2
x1Susc.(Pos.)
Accuracy = (TP + TN)
TP + TN + FP + FN
= 142 / 154
Resistant (Neg.)
2/28/2012Intro to Data Mining / GHL31
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Define criteria and tests to prove model validity
a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
x2
x1
Sensitivity = TP
TP + FN
= 67 / 72
FPR = FP
TN + FP= 6 / 81
Susc.(Pos.)
Resistant (Neg.)
Note: Specificity = 1 - FPR
2/28/2012Intro to Data Mining / GHL32
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Define criteria and tests to prove model validity
a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
x2
x1
Sensitivity = TP
TP + FN
= 69 / 72
FPR = FP
TN + FP= 19 / 81
Varying model
stringency
less
more
Susc.(Pos.)
Resistant (Neg.)
Note: Specificity = 1 - FPR
2/28/2012Intro to Data Mining / GHL33
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Define criteria and tests to prove model validity
a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
FPR
Sens
2/28/2012Intro to Data Mining / GHL34
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Define criteria and tests to prove model validity
a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
Sens
Area under curve is excellent measure of model performance
1.0: perfect model0.5: random
FPR
2/28/2012Intro to Data Mining / GHL35
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Define criteria and tests to prove model validity
a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
Predictions are imperfect due to:• Imperfect Algorithms• Imperfect Data
Cross-Validation:
• Carefully monitor features that are useful across different independent data subsets
• This can be accomplished with N-fold cross-validation:
• Best feature selection and classification algorithms will yield best consistent performance across independent trials
• Best features will be consistently important across trials
Train
Test
Trial 5Trial 4Trial 3Trial 2Trial 1
Model performance = mean predictive performance over 5 trials
2/28/2012Intro to Data Mining / GHL36
2/28/2012Intro to Data Mining / GHL37
Data Mining: procedure
1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
Analysis is only useful if it is used; only improves if it is tested
a) Good validation requires successful new predictionsb) Imperfect predictions can lead to method refinement and
greater understanding
Questions?
Lushington in Silico
Geraldlushington3117 at aol.com
Geraldlushington.org
2/28/2012Intro to Data Mining / GHL38