+ All Categories
Home > Documents > Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer...

Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer...

Date post: 21-Dec-2015
Category:
Upload: bruce-patterson
View: 221 times
Download: 1 times
Share this document with a friend
31
Artificial Artificial Intelligence Intelligence Term Project #3 Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University [email protected]
Transcript
Page 1: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Artificial IntelligenceArtificial IntelligenceTerm Project #3Term Project #3

Kyu-Baek Hwang

Biointelligence Lab

School of Computer Science and Engineering

Seoul National University

[email protected]

Page 2: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

2

OutlineOutline

Bayesian network – revisit I Properties of Bayesian network Structural learning of Bayesian network

Project 3-1 Analysis of structural learning algorithms ALARM dataset

Bayesian network – revisit II Bayesian network classifiers (probabilistic inference)

Project 3-2 Classification of microarray gene expression data using

Bayesian networks

Page 3: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

3

Bayesian NetworkBayesian Network

The joint probability distribution over all the variables in the Bayesian network.

n

i iin XPXXXP121 )|(),...,,( Pa

)|()|(),|()()(

),,,|(),,|(),|()|()(

),,,,(

CEPBDPBACPBPAP

DCBAEPCBADPBACPABPAP

EDCBAP

BA

C D

E

Local probability distribution for Xi

1

: the set of parents of

( ,..., ) ~ parameter for ( | )

: # of configurations of

: # of states of

i

i i

i i iq i i

i i

i i

X

P X

q

r X

Pa

Pa

Pa

Page 4: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

4

Generative ModelGenerative Model

From the underlying distribution, a set of data examples can be generated.

Conditional probability of interest can be calculated from jpd.

Gene B

Class

Gene F Gene G

Gene A

Gene C Gene D

Gene E Gene H

This Bayesian network can classify the examples by calculating appropriate conditional probability.

P(Class| other variables)

Page 5: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

5

Classification by Bayesian Networks IClassification by Bayesian Networks I

Calculate the conditional probability of ‘Class’ variable given the value of the other variables. Infer conditional probability from joint probability distribution. For example,

where summation is taken over all the possible class values.

,) , , , , , , , ,(

) , , , , , , , ,(

) , , , , , , , |(

Class

HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP

HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP

HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP

Page 6: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

6

Knowing the Causal StructureKnowing the Causal Structure

Gene B

Class

Gene F Gene G

Gene A

Gene C Gene D

Gene E Gene H

Gene C regulates Gene E and F.

Gene D regulates Gene G and H.

Class has an effect on Gene F and G.

A set of comprehensible rules (or knowledge)

Page 7: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

7

Learning Bayesian NetworksLearning Bayesian Networks

Metric approach Use a scoring metric to measure how well a particular structure

fits an observed set of cases. A search algorithm is used. Find a canonical form of an

equivalence class.

Independence approach An independence oracle (approximated by some statistical test)

is queried to identify the equivalence class that captures the independencies in the distribution from which the data was generated. Search for a PDAG

Page 8: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

8

Scoring Metrics for Bayesian NetworksScoring Metrics for Bayesian Networks

Likelihood L(G, G, C) = P(C|Gh, G) Gh: hypothesis that the given data (C) was generated by a

distribution that can be factored according to G. The maximum likelihood metric of G (entropy metric with opposite

sign)

1

,

log ( , ) max log ( , , )

log ( )

ˆ( ) log ( | )

( , ) log ( | )

G

i i

ML G

N

G jj

i ii

i i i ii x

M G C L G C

P

N P P x

N x P x

x

pa

x

x pa

pa pa

prefer complete graph structure

N: data size

xj: jth example

Page 9: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

9

Information Criterion Scoring MetricsInformation Criterion Scoring Metrics

The Akaike information criterion (AIC) metric

Bayesian information criterion (BIC) metric

( , ) log ( , ) ( )

( ) ( 1)AIC ML

i ii

M G C M G C Dim G

Dim G r q

NGDimCGMCGM MLBIC log)(2

1),(log),(

Page 10: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

10

MDL Scoring MetricsMDL Scoring Metrics

The minimum description length (MDL) metric 1

The minimum description length (MDL) metric 2

),()(log),(1 CGMGPCGM BICMDL

)(log||),(log),(2 GDimcNECGMCGM GMLMDL

Page 11: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

11

Bayesian Scoring MetricsBayesian Scoring Metrics

A Bayesian metric

The BDe (Bayesian Dirichlet & likelihood equivalence) metric

Prior on the network structure

cGCPGPCGM hh ),|(log)|(log),,(

n

i

q

j

r

k ijk

ijkijk

ijij

ijhi i

N

NN

NN

NGCP

1 1 1 )'(

)'(

)'(

)'(),|(

log log| |( ) 2

ii

nn

P G

Pa

Page 12: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

12

Greedy Search AlgorithmGreedy Search Algorithm

Generate initial Bayesian network structure G0.

For m = 1, 2, 3, …, until convergence. Among the possible local changes (insertion of an edge, reversal of an

edge, and deletion of an edge) in Gm–1, the one leads to the largest improvement in the score is performed. The resulting graph is Gm.

Stopping criterion Score(Gm–1) == Score(Gm).

At each iteration (learning Bayesian network consisting of n variables) O(n2) local changes should be evaluated to select the best one.

Random restarts is usually adopted to escape local maxima.

Page 13: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

13

Project 3-1Project 3-1

Analysis of structural learning algorithms Data generation from ALARM network

Various data set size, e.g., 1000, 3000, 5000, 10000.

Structural learning of Bayesian network by greedy search (hill-climbing) with several kinds of scoring metrics

Compare the results w.r.t. edge errors according to various sample sizes and learning methods

Page 14: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

14

ALARM NetworkALARM Network

# of nodes: 37# of edges: 46# of possible values of variable: 2 ~ 4 values

Page 15: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

15

Data GenerationData Generation

Using Netica (http://www.norsys.com)

Page 16: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

16

Structural LearningStructural Learning

WEKA (http://www.cs.waikato.ac.nz/ml/weka/) http://www.cs.waikato.ac.nz/~remco/weka_bn/

Page 17: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

17

Probabilistic InferenceProbabilistic Inference

Calculate the conditional probability given values of observed variables. Junction tree algorithm Sampling methods General probabilistic inference is intractable. (It is known to be

NP-hard.) However, calculation of the conditional probability for

classification is rather straightforward because of the property of Bayesian network structure (d-separation).

Page 18: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

18

Markov BlanketMarkov Blanket

Variables of interest X = {X1, X2, …, Xn}

For a variable Xi, its Markov blanket MB(Xi) is the subset of X – Xi which satisfies the following:

Markov boundary Minimal Markov blanket

)).(|()|( iiii XXPXXP MBX

Page 19: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

19

Markov Blanket in Bayesian NetworkMarkov Blanket in Bayesian Network

Given Bayesian network structure, determination of the Markov blanket of a variable is straightforward. By the conditional independence assertions.

Gene B

Class

Gene F Gene G

Gene A

Gene C Gene D

Gene E Gene H

The Markov blanket of a node in the Bayesian network consists of all of its parents, spouses, and children.

Page 20: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

20

Classification by Bayesian Networks IIClassification by Bayesian Networks II

),|(),|(),|(

),|(),|()(),|()()()(

),|(),|()(),|()()()(

)|(),|(),|()|()(),|()()()(

)|(),|(),|()|()(),|()()()(

) , , , , , , , ,(

) , , , , , , , ,(

) , , , , , , , |(

DClassGPClassCFPBAClassP

DClassGPClassCFPDPBAClassPCPBPAP

DClassGPClassCFPDPBAClassPCPBPAP

DHPDClassGPClassCFPCEPDPBAClassPCPBPAP

DHPDClassGPClassCFPCEPDPBAClassPCPBPAP

HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP

HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP

HGeneGGeneFGeneEGeneDGeneCGeneBGeneAGeneClassP

Class

Class

Class

Page 21: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

21

Project 3-2Project 3-2

Classification using Bayesian network Evaluate performance of Bayesian network classifier

(classification accuracy) Various parameter settings, e.g., scoring metrics and learning

methods If possible, compare with other learning methods such as neural

networks and decision trees. Leave-one-out cross validation

Using WEKA

Page 22: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

22

Molecular Biology: Central DogmaMolecular Biology: Central Dogma

DNA microarray

Page 23: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

23

DNA MicroarraysDNA Microarrays

Monitor thousands of gene expression levels simultaneously traditional one gene experiments.

Fabricated by high-speed robotics.

Known probes

Page 24: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

24

Types of DNA MicroarraysTypes of DNA Microarrays

Oligonucleotide chips An array of oligonucleotide (20 ~ 80-mer oligos) probes is

synthesized.

cDNA microarrays Probe cDNA (500 ~ 5,000 bases long) is immobilized to a solid

surface.

Page 25: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

25

StudyStudy

Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells, MH Cheok et al., Nature Genetics 35, 2003.

60 leukemia patients

Bone marrow samples

Affymetrix GeneChip arrays

Gene expression data

Page 26: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

26

Gene Expression DataGene Expression Data

# of data examples 120 (60: before treatment, 60: after treatment)

# of genes measured 12600 (Affymetrix HG-U95A array)

Task Classification between “before treatment” and “after treatment”

based on gene expression pattern

Page 27: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

27

Affymetrix GeneChip ArraysAffymetrix GeneChip Arrays

Use short oligos to detect gene expression level. Each gene is probed by a set of short oligos. Each gene expression level is summarized by

Signal: numerical value describing the abundance of mRNA A/P call: denotes the statistical significance of signal

Page 28: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

28

PreprocessingPreprocessing

Remove the genes having more than 60 ‘A’ calls # of genes: 12600 3190

Discretization of gene expression level Criterion: median gene expression value of each sample 0 (low) and 1 (high)

Page 29: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

29

Gene FilteringGene Filtering

Using mutual information

Estimated probabilities were used. # of genes: 3190 50

Final dataset # of attributes: 51 (one for the class)

Class: 0 (after treatment), 1 (before treatment)

# of data examples: 120

CG CPGP

CGPCGPCGI

, )()(log

),(log),();(

Page 30: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

30

Final DatasetFinal Dataset

120

51

Page 31: Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr.

Copyright (c) 2004 by SNU CSE Biointelligence Lab

31

SubmissionSubmission

Deadline: 2004. 12. 2 Location: 301-419


Recommended