+ All Categories

Agenda

Date post: 09-Jan-2016
Category:
Upload: liesel
View: 43 times
Download: 0 times
Share this document with a friend
Description:
Agenda. Biological databases related to microarray Gene Ontology KEGG Biocarta Reactome MSigDB Pathway enrichment analysis GSEA GSA Ingenuity Pathway Analysis (IPA) Motif finding. 1. Databases. Biological pathways and knowledge are very complex:. - PowerPoint PPT Presentation
Popular Tags:
74
Agenda 1. Biological databases related to microarray 1. Gene Ontology 2. KEGG 3. Biocarta 4. Reactome 5. MSigDB 2. Pathway enrichment analysis 1. GSEA 2. GSA 3. Ingenuity Pathway Analysis (IPA) 3. Motif finding
Transcript
Page 1: Agenda

Agenda

1. Biological databases related to microarray1. Gene Ontology2. KEGG3. Biocarta4. Reactome5. MSigDB

2. Pathway enrichment analysis1. GSEA2. GSA3. Ingenuity Pathway Analysis (IPA)

3. Motif finding

Page 2: Agenda

1. Databases

Biological pathways and knowledge are very complex:

Is it possible to establish a database? • To systematically structuring and managing the knowledge? • To validate analysis result or be incorporated into analysis?

Page 3: Agenda

1.1 Gene Ontology• Ontologies: Controlled vocabularies to describe fuctions of genes.• The database is structured as directed acyclic graphs (DAGs), which

differ from hierarchical trees in that a 'child' (more specialized term) can have many 'parents' (less specialized terms).

Page 4: Agenda

Molecular Function Ontology

the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity

Biological Process Ontology

broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions

Cellular Component Ontology

subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex

1.1 Gene Ontology

Three major categories in Gene Ontology:

Current term counts: as of April 2, 2005 at 18:00 Pacific time:17708 terms, 93.8% with definitions.

9263 biological_process1496 cellular_component6949 molecular_function

Page 5: Agenda

Evidence code:

How is the information collected?

1.1 Gene Ontology

• IC inferred by curator • IDA inferred from direct assay • IEA inferred from electronic annotation • IEP inferred from expression pattern • IGI inferred from genetic interaction• IMP inferred from mutant phenotype • IPI inferred from physical interaction • ISS inferred from sequence or structural similarity • NAS non-traceable author statement • ND no biological data available • RCA inferred from reviewed computational analysis • TAS traceable author statement • NR not recorded

There may be (a lot of) errors in the database!!

Page 6: Agenda

1.1 Gene Ontology

Demo:

• Go to GO: http://www.geneontology.org • Go to “Tools" and click on "AmiGO". • Click “Browse”. Click on the boxes with "+" to expand any category to

look at its subcategories. Click on "-" to collapse again. • Type the term “cell cycle" in the "Search GO"field. Press "Submit". You

will then see all GO categories containig this word. • Click on a GO term, say “cell cycle arrest”. Genes belonging to this GO

term can be shown. Further filter genes by “Data source” or “Species”.• Type the name “cyclin" in Amigo. Change to the “genes or proteins"

selection button and press "Submit". You will then see a number of genes containing this name. Press some of the "Tree view" links. 

• Note that in some cases, the same term category can exist in different places in the tree. This ontology is thus not strictly hierarchical, but shows complex "many-to-many" relationships between gene products, ontology terms and branches in the ontology tree. 

Page 7: Agenda

http://www.genome.jp/kegg/pathway.html

1.2 KEGG

Page 8: Agenda

1.2 KEGGKyoto Encyclopedia of Genes and Genomes

KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks in biological processes (PATHWAY database), the information about the universe of genes and proteins (GENES/SSDB/KO databases), and the information about the universe of chemical compounds and reactions (COMPOUND/GLYCAN/REACTION databases).

The current statistics of KEGG databases is as follows:

Number of pathways 23,574(PATHWAY database)Number of reference pathways 265(PATHWAY database)Number of ortholog tables 87(PATHWAY database)Number of organisms 272(GENOME database)Number of genes 911,584(GENES database)Number of ortholog clusters 35,456(SSDB database)Number of KO assignments 6,221(KO database)Number of chemical compounds 12,737(COMPOUND database)Number of glycans 11,017(GLYCAN database)Number of chemical reactions 6,399(REACTION database)Number of reactant pairs 5,953(RPAIR database)

Page 9: Agenda

1.2 KEGGRNA polymerase:

Page 10: Agenda

1.2 KEGGCell cycle:

Page 11: Agenda

1.2 KEGGParkinson’s disease:

Alzheimer’s disease, Huntington’s disease, Prion disease….

Page 12: Agenda

1.3 Biocarta

Page 13: Agenda

1.4 Reactome• A manually curated and peer-reviewed (authors,

reviewers and editors) pathway database.• Now annotates 5849 proteins, 4555 complexes, 4827

reactions and 1192 pathways in Homo Sapien (Version 39, 2/21/2012)

Page 14: Agenda

# of pathways (gene sets)

Accuracy (manually curated?)

Include gene-gene interactions(network graphs)?

Note

Gene Ontology 17708 gene sets (2005)

No (include many computational predictions)

No

KEGG 415 pathways, 951 diseases

Yes Yes

Biocarta 250 pathways, 4000 proteins, 800 complexes and 3000 interactions

Yes Yes Cancer focused

Reactome 1192 pathways (human)

Yes Yes

NIC-Nature Pathway Interaction Database (PID)

59 pathways Yes Yes Curated by Nature editorial team

Page 15: Agenda

1.5 MSigDB

A comprehensive pathway database (mainly gene sets without graphical interaction model). Useful for conventional pathway (gene set) enrichment analysis.

C1: Positional gene sets (326)C2: Curated gene sets (3272)

Canonical pathways (880)Biocarta (217)KEGG (186)Reactome (430)

C3: Motif gene sets (836)miRNA targets gene sets (221)TF targets gene sets(615)

C4: Computational gene sets (881)C5: GO gene sets (1454)

Page 16: Agenda

2. Enrichment analysis

After1. Selecting DE genes, or

2. Classification, or

3. Clustering

We are usually given a gene list for further investigation.

How do we validate information contained in the gene list by available

biological knowledge?

Page 17: Agenda

Cell cycle data: Cells are synchronized and samples taken at various time points (covering 2 cell cycles). 6162 genes are included.G e tting a ho m o g e ne o us p o p ula tio n o f c e lls:

c e ll c yc le

C e lls a t va rio ussta g e s o f c e ll c yc le

Sync hro niza tio n c o nd itio ns:-Te m p e ra ture shift to 37 C fo r C DC 15 ye a st ts-stra in-a d d p he ro m o ne-Elutria tio n

Re le a se b a c k into c e ll c yc le

Ta ke sa m p lea s c e lls p ro g re ssthro ug h c yc lesim ulta ne o usly

From Fourier analysis, 800 genes with cyclic gene expression pattern are selected for further investigation.

Are these 800 genes really involved in cell cycle?

2. Enrichment analysis

Page 18: Agenda

http://db.yeastgenome.org/cgi-bin/GO/goTermMapper

2. Enrichment analysis

Page 19: Agenda

Related to cell cycle

Annotated but not related to

cell cycle

Not annotated Total

All genes 385 5703 74 6162

Expression with cyclic

pattern

100 691 9 800

Is the selected set of genes enriched in the GO term of “cell cycle”?

2. Enrichment analysis

Page 20: Agenda

Related to cell cycle

Annotated but not related to

cell cycle

Total

Other genes 285 5012 5297

Expression with cyclic

pattern

100 691 791

Total 385 5703 6088

%64.12791

100

???~ %38.5

5297

285

2. Enrichment analysis

Page 21: Agenda

Related to cell cycle

Annotated but not related to

cell cycle

Total

Other genes N11 N12 N1

Expression with cyclic

pattern

N21 N22 N2

Total N1 N2 N

. from sampled Data 22211211 p,p,p,p

2. Enrichment analysis

Page 22: Agenda

2122

ˆ

ˆ

2

2

1221

0

~

expected

)expectedobserved(

ˆˆˆ,ˆ,ˆ

,hypothesis nullunder

)rly (particula :

ncy)(independe :

2

2

ijN

NN

N

NNN

ij pN

pNN

ij

ijij

jijiij

jj

ii

jiijA

jiij

ji

jiij

ij

ijij

N

NNppp

N

Np

N

Np

ppppppH

pppH

2. Enrichment analysis

Page 23: Agenda

Related to cell cycle

Annotated but not related to cell

cycle

Total

Other genes 285 5012 5297Expression with

cyclic pattern100 691 791

Total 385 5703 6088

2644.616088

57037916088

5703791691

6088

3857916088

385791100

6088

570352976088

570352975012

6088

38552976088

3855297285

2222

R code for chi-square test without continuity correction> chisq.test(matrix(c(285, 5012, 100, 691), 2, 2), correct=F)

Pearson's Chi-squared test

data: matrix(c(285, 5012, 100, 691), 2, 2) X-squared = 61.2644, df = 1, p-value = 4.99e-15

2. Enrichment analysis

Page 24: Agenda

Fisher’s exact test:G genes in the genome (G=1663) are analyzed; Functional category “F”. In a cluster of size C, h genes are found to be in a functional category “F” with m genes, then p-value (i.e. the probability of observing h or more annotated genes in the cluster is calculated as (Tavazoie et al. 1999):

1

0

1][h

i

m

Gim

CG

i

C

hXP

Chi-squared test is an approximate test and may not perform well when sample size small. Fisher’s exact test is a better alternative.

2. Enrichment analysis

Page 25: Agenda

Inside cluster Outside cluster Total

Inside pathway F h m-h m

Outside pathway F C-h G-m-C+h G-m

Total C G-C G

C G C G

h m h m

If genes are randomly assigned, the probability of having h intersection genes is

1

0

1][h

i

m

Gim

CG

i

C

hXP

The p-value is the probability to observe h or more intersection genes by chance:

2. Enrichment analysis

Fisher’s exact test

Page 26: Agenda

• There are only two possibilities to observe more extremely than observation:

Inside cluster Outside cluster Total

Inside pathway F 39 1 40

Outside pathway F 161 1799 1960

Total 200 1800 2000

Total

39 1 40

161 1799 1960

Total 200 1800 2000

Total

40 0 40

160 1800 1960

Total 200 1800 2000

200 1800 200 1800

39 1 40 0p-value 0

2000 2000

40 40

Observation:

2. Enrichment analysisFisher’s exact test

Page 27: Agenda

2. Enrichment analysis

Kolmogorov-Smirnov test (KS test)

-- A major issue of Fisher’s exact test is that it requires an ad hoc threshold to generate DE gene list.

-- KS test is a better way to associate any gene order with a pathway information.

Example: S1=(1,2,3,5), S2=(4,6,8,9,10)

D=maxx |F1(x)-F2(x)|

Page 28: Agenda

In practice, we need to search through thousands of GO terms to determine which GO term is enriched in the selected gene set .

Multiple comparison problem!!

Difficulties: Tests are highly dependent.

1.Hierarchical structure of the GOe.g. “Cell Proliferation” is a parent GO term of “Cell

Cycle”.2.Each gene can belong to multiple GO terms.

e.g. human HoxA7 gene belongs to four GO terms: “Development”, “Nucleus”, “DNA dependent regulation and transcription”, “Transcription factor activity”.

2. Enrichment analysis

Page 29: Agenda

2. Enrichment analysisSimple and Naïve way:1.Get p-values from Fisher’s exact test for all pathways.2.Correct by Benjamini-Hochberg procedure to control FDR.

Problem:1.Fisher’s test simplify DE statistics into a biomarker list (0-1).2.Does not consider gene dependence structure and pathway hierarchical dependence structure.

Improved methods:1.Use averaged t-statistics or Kolmogorov-Smirnov (KS) statistics as the pathway-specific enrichment score.2.Apply permutation test (either gene permutation or sample permutation) to perform FDR control.3.Read the following papers if interested.

Goeman, J.J. and Buhlmann, P. (2007) Analyzing gene expression data in terms of gene sets: methodological issues, Bioinformatics, 23, 980-987.Tian, L., Greenberg, S.A., Kong, S.W., Altschuler, J., Kohane, I.S. and Park, P.J. (2005) Discovering statistically significant pathways in expression profiling studies, Proceedings of the National Academy of Sciences of the United States of America, 102, 13544-13549.Efron, B. and Tibshirani, R. (2007) On testing the significance of sets of genes, Annals of Applied Statistics, 1, 107-129.Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S. and Mesirov, J.P. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences of the United States of America, 102, 15545-15550.

Page 30: Agenda

Simple Fisher’s exact test: •Ingenuity Pathway

A commercial package with good interface and human curated annotation. Can generate network figures.

• NIH DAVIDFree and web-based. Perform enrichment analysis (Fisher’s exact test), adjust for multiple comparison and generate a table of results. Use multiple databases.

• “Gostats” package in BioconductorFree and web-based. Perform enrichment analysis (Fisher’s exact test) and generate a table of results. Use only GO database.

More sophisticated and systematic methods:• Gene set enrichment analysis (GSEA; MIT Mesirov’s group) http://www.broad.mit.edu/gsea/ (free)• Gene set analysis (GSA; Stanford Tibshirani’s group) http://www-stat.stanford.edu/~tibs/GSA/ (free)• Ingenuity Pathway Analysis (IPA) http://www.ingenuity.com/ (commercial; Pitt has purchases licenses)

2. Enrichment analysis

Page 31: Agenda

Things to note when using biological database:

1. Biological pathways and gene functions are complex and difficult to quantify.

2. Data may not be accurate. The analysis should take into account of strength of evidence.

3. May need to go to specific database for particular organism. (e.g. SGD for yeast; FlyBase and BDGP for fly)

4. To systematically collect and manage massive biological knowledge from publications and experiments is an important and active research topic in bioinformatics.

2. Enrichment analysis

Page 33: Agenda

3. Motif Finding

http://web.indstate.edu/thcme/mwking/gene-regulation.html

Page 34: Agenda

Factor Sequence Motif Comments

c-Myc and Max CACGTG c-Myc first identified as retroviral oncogene; Max specifically associates with c-Myc in cells

c-Fos and c-Jun TGAC/GTC/AA both first identified as retroviral oncogenes; associate in cells, also known as the factor AP-1

CREB TGACGC/TC/A

G/A

binds to the cAMP response element; family of at least 10 factors resulting from different genes or alternative splicing; can form dimers with c-Jun

c-ErbA; also TR (thyroid hormone receptor)

GTGTCAAAGGTCAfirst identified as retroviral oncogene; member of the steroid/thyroid hormone receptor superfamily; binds thyroid hormone

c-Ets G/CA/CGGAA/TG

T/Cfirst identified as retroviral oncogene; predominates in B- and T-cells

GATA T/AGATA family of erythroid cell-specific factors, GATA-1 to -6

c-Myb T/CAACG/TGfirst identified as retroviral oncogene; hematopoietic cell-specific factor

MyoD CAACTGAC controls muscle differentiation

NF-(kappa)B and c-Rel GGGAA/CTNT/CCC(1) both factors identified independently; c-Rel first identified as retroviral oncogene; predominate in B- and T-cells

RAR (retinoic acid receptor)

ACGTCATGACCT binds to elements termed RAREs (retinoic acid response elements) also binds to c-Jun/c-Fos site

SRF (serum response factor)

GGATGTCCATATTAGGACATCT

exists in many genes that are inducible by the growth factors present in serum

3. Motif Finding

http://web.indstate.edu/thcme/mwking/gene-regulation.html

Page 35: Agenda

• Genes in a cluster have similar expression patterns.

• They might share common regulatory motifs so they are expressed simultaneously.

• It is of interest to find motifs from the gene clusters.

3. Motif Finding

Page 36: Agenda

The following materials are obtained from Shirley Liu at Harvard.

3. Motif Finding

Page 37: Agenda

3. Motif Finding

Page 38: Agenda

3. Motif Finding

Page 39: Agenda

3. Motif Finding

Page 40: Agenda

3. Motif Finding

Page 41: Agenda

3. Motif Finding

Page 42: Agenda

3. Motif Finding

Page 43: Agenda

3. Motif Finding

Page 44: Agenda

3. Motif Finding

Page 45: Agenda

3. Motif Finding

Page 46: Agenda

3. Motif Finding

Page 47: Agenda

3. Motif Finding

Page 48: Agenda

3. Motif Finding

Page 49: Agenda

3. Motif Finding

Page 50: Agenda

3. Motif Finding

Page 51: Agenda

3. Motif Finding

Page 52: Agenda

3. Motif Finding

Page 53: Agenda

3. Motif Finding

Page 54: Agenda

3. Motif Finding

Page 55: Agenda

3. Motif Finding

Page 56: Agenda

3. Motif Finding

Page 57: Agenda

3. Motif Finding

Page 58: Agenda

3. Motif Finding

Page 59: Agenda

3. Motif Finding

Page 60: Agenda

3. Motif Finding

Page 61: Agenda

3. Motif Finding

Page 62: Agenda

3. Motif Finding

Page 63: Agenda

3. Motif Finding

Page 64: Agenda

3. Motif Finding

Page 65: Agenda

3. Motif Finding

Page 66: Agenda

3. Motif Finding

Page 67: Agenda

3. Motif Finding

Page 68: Agenda

3. Motif Finding

Page 69: Agenda

3. Motif Finding

Page 70: Agenda

3. Motif Finding

Page 71: Agenda

3. Motif Finding

Page 72: Agenda

3. Motif Finding

Page 73: Agenda

3. Motif Finding

Page 74: Agenda

3. Motif Finding


Recommended