+ All Categories
Home > Documents > Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications...

Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications...

Date post: 29-Dec-2015
Category:
Upload: allison-fowler
View: 216 times
Download: 1 times
Share this document with a friend
Popular Tags:
53
Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois [email protected] September 10, 2002 Data Mining in Bioinformatics
Transcript
Page 1: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

Peter Bajcsy, PhDAutomated Learning GroupNational Center for Supercomputing ApplicationsUniversity of [email protected]

September 10, 2002

Data Mining in Bioinformatics

Page 2: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

2

Outline

• Introduction—Interdisciplinary Problem Statement—Microarray Problem Overview

• Microarray Data Processing—Image Analysis and Data Mining—Prior Knowledge—Data Mining Methods—Database and Optimization Techniques—Visualization

• Validation• Summary

Page 3: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

3

Introduction: Recommended Literature

1. Bioinformatics – The Machine Learning Approach by P. Baldi & S. Brunak, 2nd edition, The MIT Press, 2001

2. Data Mining – Concepts and Techniques by J. Han & M. Kamber, Morgan Kaufmann Publishers, 2001

3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2nd edition, John Wiley & Sons, 2001

Page 4: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

4

Bioinformatics, Computational Biology, Data Mining

• Bioinformatics is an interdisciplinary field about the the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems.

• Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g.— Genomes (viruses, bacteria, fungi, plants, insects,…)— Proteins and Proteomes— Biological Sequences— Molecular Function and Structure

• Data Mining is searching for knowledge in data— Knowledge mining from databases— Knowledge extraction— Data/pattern analysis— Data dredging— Knowledge Discovery in Databases (KDD)

Page 5: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

5

Introduction: Problems in Bioinformatics Domain

• Problems in Bioinformatics Domain—Data production at the levels of molecules,

cells, organs, organisms, populations—Integration of structure and function data,

gene expression data, pathway data, phenotypic and clinical data, …

—Prediction of Molecular Function and Structure

—Computational biology: synthesis (simulations) and analysis (machine learning)

Page 6: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

6

MICROARRAY PROBLEM

Page 7: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

7

Microarray Problem: Major Objective

• Major Objective: Discover a comprehensive theory of life’s organization at the molecular level—The major actors of molecular biology: the

nucleic acids, DeoxyriboNucleic acid (DNA) and RiboNucleic Acids (RNA)

—The central dogma of molecular biology

Proteins are very complicated molecules with 20 different amino acids.

Page 8: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

8

Input and Output of Microarray Data Analysis

• Input: Laser image scans (data) and underlying experiment hypotheses or experiment designs (prior knowledge)

• Output: —Conclusions about the input hypotheses or knowledge

about statistical behavior of measurements—The theory of biological systems learnt automatically

from data (machine learning perspective)– Model fitting, Inference process

Page 9: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

9

Overview of Microarray Problem

Data Mining

Microarray Experiment

Image Analysis

Biology Application Domain

Experiment Design and Hypothesis

Data Analysis

Artificial Intelligence (AI)

Knowledge discovery in databases (KDD)

Data Warehouse

Validation

Statistics

Page 10: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

10

Statistics Community

• Random Variables

• Statistical Measures

• Probability and Probability Distribution

• Confidence Interval Estimations

• Test of Hypotheses

• Goodness of Fit

• Regression and Correlation Analysis

Page 11: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

11

Artificial Intelligence (AI) Community

• Issues:—Prior knowledge

(e.g., invariance)

—Model deviation from true model

—Sampling distributions

—Computational complexity

—Model complexity (overfitting)

Collect Data

Train Classifier

Choose Model

Choose Features

Evaluate Classifier

Design Cycle of Predictive Modeling

Page 12: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

12

Knowledge Discovery in Databases (KDD) Community

Database

Page 13: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

13

Microarray Data Mining and Image Analysis Steps• Image Analysis

— Normalization— Grid Alignment— Spot Quality Assurance Control— Feature construction (selection and extraction)

• Data Mining— Prior knowledge— Statistics— Machine learning— Pattern recognition— Database techniques— Optimization techniques— Visualization

• Validation— Issues— Cross validation techniques

?

Page 14: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

14

MICROARRAY IMAGE ANALYSIS

Page 15: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

15

Microarray Image Analysis

Page 16: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

16

DATA MINING OF MICROARRAY DATA

Page 17: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

17

Why Data Mining ? Sequence Example

• Biology: Language and Goals• A gene can be defined as a region of DNA.• A genome is one haploid set of chromosomes with the

genes they contain.• Perform competent comparison of gene sequences

across species and account for inherently noisy biological sequences due to random variability amplified by evolution

• Assumption: if a gene has high similarity to another gene then they perform the same function

• Analysis: Language and Goals• Feature is an extractable attribute or measurement

(e.g., gene expression, location)• Pattern recognition is trying to characterize data

pattern (e.g., similar gene expressions, equidistant gene locations).

• Data mining is about uncovering patterns, anomalies and statistically significant structures in data (e.g., find two similar gene expressions with confidence > x)

Page 18: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

18

Types of Expected Data Mining and Analysis ResultsHypothetical Examples:• Binary answers using tests of hypotheses

—Drug treatment is successful with a confidence level x.

• Statistical behavior (probability distribution functions)—A class of genes with functionality X follows Poisson

distribution.• Expected events

—As the amount of treatment will increase the gene expression level will decrease.

• Relationships—Expression level of gene A is correlated with

expression level of gene B under varying treatment conditions (gene A and B are part of the same pathway).

• Decision trees —Classification of a new gene sequence by a “domain

expert”.

Page 19: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

19

PRIOR KNOWLEDGE

Page 20: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

20

Prior Knowledge: Experiment Design

• Microarray sources of systematic and random errors

• Feature selection and variability

• Expectations and Hypotheses

• Data cleaning and transformations

• Data mining method selection

• Interpretation

Collect Data

Choose Features

Data Cleaning and Transformations

Choose Model and Data Mining Method

Pri

or K

now

ledg

e

Page 21: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

21

Prior Knowledge from Experiment Design

Complexity Levels of Microarray Experiments:1. Compare single gene in a control situation versus a treatment

situation• Example: Is the level of expression (up-regulated or down-regulated)

significantly different in the two situations? (drug design application)• Methods: t-test, Bayesian approach

2. Find multiple genes that share common functionalities• Example: Find related genes that are dependent?• Methods: Clustering (hierarchical, k-means, self-organizing maps,

neural network, support vector machines)3. Infer the underlying gene and protein networks that are

responsible for the patterns and functional pathways observed• Example: What is the gene regulation at system level?• Directions: mining regulatory regions, modeling regulatory networks

on a global scaleGoal of Future Experiment Designs: Understand biology at the system

level, e.g., gene networks, protein networks, signaling networks, metabolic networks, immune system and neuronal networks.

Page 22: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

22

Data Mining Techniques

S ta tis t ics M a ch in e lea rn ing

D a ta b ase te chn iqu es P a tte rn re co g n it ion

O p tim iza tio n te ch n iq u es

D a ta m in in g tech n iq u e s d ra w from

Visualization

Page 23: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

23

STATISTICS

Page 24: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

24

Statistics

Inductive Statistics

Statistics

Descriptive Statistics

Are two sample sets

identically distributed ?

Make forecast and inferences

Describe data

Page 25: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

25

•Gene Expression Level in Control and Treatment situations

•Is the behavior of a single gene different in Control situation than in Treatment situation ?

Statistical t-test

• m – sample mean

• s – variance

Normalized distance Normalized distance t follows a Student distributionwith f degrees of freedom.

If t>thresh then the control and treatment data populations are considered to be different.

?

Page 26: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

26

MACHINE LEARNINGAND

PATTERN RECOGNITION

Page 27: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

27

Machine Learning

Supervised

Machine Learning

Unsupervised

Reinforcement“Natural groupings”

Examples

Page 28: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

28

Pattern Recognition

Pattern Recognition

Linear Correlation and Regression

Neural Networks

Statistical Models

Decision Trees

Locally Weighted Learning

NN representation and gradient based optimization

NN representation and genetic algorithm based optimization

k-nearest neighbors, support vectors

Page 29: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

29

Unsupervised Learning and Clustering

• A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.

• Examples of data objects:—gene expression levels, sets of co-regulated genes

(pathways), protein structures

• Categories of Clustering Methods—Partitioning Methods—Hierarchical Methods—Density-Based Methods

“Natural groupings”

Page 30: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

30

Unsupervised Clustering: Partitioning Methods

• K-means Algorithm partitions a set of n objects into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster similarity is low.

• Input: number of desired cluster k

• Output: k labels assigned to n objects

• Steps:1.Select k initial cluster’s centers 2.Compute similarity as a distance between an object

and each cluster center3.Assign a label to an object based on the minimum

similarity4.Repeat for all objects5.Re-compute the cluster’s centers as a mean of all

objects assign to a given cluster6.Repeat from Step 2 until objects do not change their

labels.

Example: Centroid-Based Technique

Page 31: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

31

Unsupervised Clustering: Partitioning Methods

• K-medoids Algorithm partitions a set of n objects into k clusters so that it minimizes the sum of the dissimilarities of all the objects to their nearest medoid.

• Input: number of desired cluster k• Output: k labels assigned to n objects• Steps:1.Select k initial objects as the initial medoids2.Compute similarity as a distance between an

object and each cluster medoid3.Assign a label to an object based on the minimum

similarity4.Repeat for all objects5.Randomly select a non-medoid object and swap

with the current medoid it would decrease intra-cluster square error

6.Repeat from Step 2 until objects do not change their labels.

Example: Representative-Based Technique

Page 32: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

32

Unsupervised Clustering: Hierarchical Clustering

• Hierarchical Clustering partitions a set of n objects into a tree of clusters

• Types of Hierarchical Clustering

—Agglomerative hierarchical clustering– Bottom-up strategy of building clusters

—Divisive hierarchical clustering– Top-down strategy of building clusters

Page 33: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

33

Unsupervised Agglomerative Hierarchical Clustering• Agglomerative Hierarchical Clustering partitions a set

of n objects into a tree of clusters with a bottom-up strategy.

• Steps:1. Assign a unique label to each data object and form n

clusters2. Find nearest clusters and merge them3. Repeat Step 2 till the number of desired clusters is equal to

the number of merged clusters.

• Types of Agglomerative Hierarchical Clustering— The nearest neighbor algorithms (minimum or single-linkage algorithm, minimal

spanning tree)— The farthest neighbor algorithms (maximum or complete-linkage algorithm)

Page 34: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

34

Unsupervised Clustering: Density-Based Clustering

• Density-Based Spatial Clustering with Noise aggregates objects into clusters if the objects are density connected.

• Density connected objects:—Simplified explanation:P and Q are density connected

if there is an object O such that both P and Q are density connected to O.

—Aggregate P and Q if they are density connected with respect to R-radius neighborhood and Minimum Object criteria

Page 35: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

35

Supervised Learning or Classification

• Classification is a two-step process consisting of learning classification rules followed by assignment of classification label.

Page 36: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

36

Supervised Learning: Decision Tree

• Decision tree algorithm constructs a tree structure in a top-down recursive divide-and-conquer manner

Car Insurance: Risk Assessment

Age < 25 ?

Risk: LowRisk: High

Sports car ?Risk: High

Age Car Type

Risk

23 family High

17 sports High

43 sports High

68 family Low

32 truck Low

20 family High

yes no

noyes

Attributes

Answers

Visualization of Decision Boundaries

Page 37: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

37

Supervised Learning: Bayesian Classification

• Bayesian Classification is based on Bayes theorem and it can predict class membership probabilities.

• Bayes Theorem (X-data sample, H-hypothesis of data label)—P(H/X) posterior probability—P(H) prior probability

• Classification-maximum posteriori hypothesis

Page 38: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

38

Statistical Models: Linear Discriminant

• Linear Discriminant Functions form boundaries between data classes.

• Finding Linear Discriminant Functions is achieved by minimizing a criterion error function.

Linear discriminant function

Quadratic discriminant function

Finding w coefficients:

-Gradient Descent Procedures

-Newton’s algorithm

Page 39: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

39

Neural Networks

• Neural network is a set of connected input/output units where each connection has a weight associated with it.

• Phase I: learning – adjust weights such that the network predicts accurately class labels of the input samples

• Phase II: classification- assign labels by passing an unknown sample through the network

• Steps:1. Initial weights from [-1,1]2. Propagate the inputs forward3. Backpropagate the error4. Terminate learning (training) if (a) delta w < thresh or (b) percentage

of misclassified samples < thresh or (c) max number of iterations has been exceeded

Interpretation

Page 40: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

40

Support Vector Machines (SVM)

• SVM algorithm finds a separating hyperplane with the largest margin and uses it for classification of new samples

Page 41: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

41

DATABASE TECHNIQUESAND

OPTIMIZATION TECHNIQUES

Page 42: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

42

Database Techniques

• Database Design and Modeling (tables, procedures, functions, constraints)

• Database Interface to Data Mining System

• Efficient Import and Export of Data

• Database Data Visualization

• Database Clustering for Access Efficiency

• Database Performance Tuning (memory usage, query encoding)

• Database Parallel Processing (multiple servers and CPUs)

• Distributed Information Repositories (data warehouse)

MINING

Page 43: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

43

Optimization Techniques

• Highly nonlinear search space (global versus local maxima)

• Gradient based optimization

• Genetic algorithm based optimization

• Optimization with sampling

• Large search space

• Example: A genome with N genes can encode 2^N states (active or inactive states, regulated is not considered). Human genome ~ 2^30,000; Nematode genome ~ 2^20,000 patterns.

Page 44: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

44

VISUALIZATION

Page 45: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

45

Visualization

• Data: 3D cubes,distribution charts, curves, surfaces, link graphs, image frames and movies, parallel coordinates

• Results: pie charts, scatter plots, box plots, association rules, parallel coordinates, dendograms, temporal evolution

Pie chart Parallel coordinates

Temporal evolution

Page 46: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

46

Novel Visualization of Features

Feature Selection and Visualization

Feature Selection

Mean Feature Image

Page 47: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

47

Novel Visualization of Clustering Results

Isodata (K-means)Clustering

Class Labeling and Visualization

Mean Feature Image Label Image

Page 48: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

48

VALIDATION

Page 49: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

49

Why Validation?

• Validation type:— Within the existing data— With newly collected data

• Errors and uncertainties:— Systematic or random errors— Unknown variables - number of classes— Noise level - statistical confidence due to noise— Model validity – error measure, model over-fit or under-fit — Number of data points - measurement replicas

• Other issues— Experimental support of general theories— Exhaustive sampling is not permissive

Page 50: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

50

Error Detection: Example of Spot Screening

Mask Image – No ScreeningMask Image – Location and Size Screening

Mask Image – SNR Screening

Page 51: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

51

Cross Validation: Example

• One-tier cross validation— Train on different data than test data

• Two-tier cross validation— The score from one-tier cross validation is

used by the bias optimizer to select the best learning algorithm parameters (# of control points) . The more you optimize the more you over-fit. The second tier is to measure the level of over-fit (unbiased measure of accuracy).

— Useful for comparing learning algorithms with control parameters that are optimized.

— Number of folds is not optimized.• Computational complexity:

— #folds of top tier X #folds of bottom tier X #control points X CPU of algorithm

Page 52: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

52

Summary

• Bioinformatics and Microarray problem— Interdisciplinary Challenges: Terminology— Understanding Biology and Computer Science

• Data mining and image analysis steps— Image Analysis— Experiment Design as Prior Knowledge — Expected Results of Data Mining— Which Data Mining Technique to Use?— Data Mining Challenges: Complexity, Data Size, Search Space

• Validation— Confidence in Obtained Results?— Error Screening— Cross validation techniques

Page 53: Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu@ncsa.uiuc.edu September.

53

Backup


Recommended