CSCE555 Bioinformatics Lecture 23 Integrative Genomics II Meeting: MW 4:00PM-5:15PM SWGN2A21...

CSCE555 Bioinformatics Lecture 23 Integrative Genomics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:University of South Carolina Department of Computer Science and Engineering Outline: Integrative Genomics II Integrative Microarray Analysis The Data Sources Graph mining of co-expression relationships Codense Second order relationship analysis DNA RNA Protein Gene Expression Datasets: the Transcriptome Oligonucleotide Array cDNA Array Protein abundance measurement (Mass Spec) Protein interactions (yeast 2-hybrid system, protein arrays) Protein complexes (Mass Spec) Information Flow Cellular systems 1995 Bacteria ~1.6K genes 1997 Eukaryote ~6K genes 1998 Animal ~20K genes 2001 Human 30~100K genes Objective: $1000 human genome Rapid accumulation of microarray data in public repositories NCBI Gene Expression Omnibus EBI Array Express 137,231 experiments 55,228 experiments The public microarray data increases by 3 folds per year Multiple Microarray Technology Platforms Microarray Platforms ---- Datasets Graph-based Approach for the Integrative Microarray Analysis gene experiments e g h i Annotation a b d f Functional Annotation Gene Ontology e g h i Transcriptional Annotation TF Frequent Subgraph Mining Problem is hard! Problem formulation: Given n graphs, identify subgraphs which occur in at least m graphs (m n) Our graphs are massive! (>10,000 nodes and >1 million edges) The traditional pattern growth approach (expand frequent subgraph of k edges to k+1 edges) would not work, since the time and memory requirements increase exponentially with increasing size of patterns and increasing number of networks. Novel Algorithms to identify diverse frequent network patterns CoDense (Hu et al. ISMB 2005) identify frequent coherent dense subgraphs across many massive graphs Network Biclustering (Huang et al, ISMB 2007) identify frequent subgraphs across many massive graphs Network Modules (NeMo) (Yan et al. ISMB 2007) identify frequent dense vertex sets across many massive graphs CODENSE: identify frequent coherent dense subgraphs across massive graphs Identify frequent co-expression clusters across multiple microarray data sets c 1 c 2 c m g 1.1.2.2 g 2.4.3.4 c 1 c 2 c m g 1.8.6.2 g 2.2.3.4 c 1 c 2 c m g 1.9.4.1 g 2.7.3.5 c 1 c 2 c m g 1.2.5.8 g 2.7.1.3 a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c...... The solution: CODENSE a novel algorithm, called CODENSE, to mine frequent coherent dense subgraphs. The target subgraphs have three characteristics: (1) All edges occur in >= k graphs (frequency) (2) All edges should exhibit correlated occurrences in the given graph set. (coherency) (3) The subgraph is dense, where density d is higher than a threshold and d=2m/(n(n-1)) (density) m: #edges, n: #nodes CODENSE: Mine coherent dense subgraph a b d e g h i c f summary graph f a b d e g h i c G1G1 f a b c d e f g h i a b c d e f g h i a b c d e f g h i a b c d e f g h i a b c d e g h i G3G3 G2G2 G6G6 G5G5 G4G4 (1) Builds a summary graph by eliminating infrequent edges (2) Identify dense subgraphs of the summary graph a b d e g h i c f summary graph e g h i c f Sub( ) Step 2 MODES Observation: If a frequent subgraph is dense, it must be a dense subgraph in the summary graph. However, the reverse conclusion is not true. CODENSE: Mine coherent dense subgraph (3) Construct the edge occurrence profiles for each dense summary subgraph e g h i c f Sub( ) Step 3 e-f c-i c-h c-f c-e G6G5G4G3G2G1 E edge occurrence profiles CODENSE: Mine coherent dense subgraph e-f c-i c-h c-f c-e G6G5G4G3G2G1 E edge occurrence profiles Step 4 c-f c-h c-e e-h e-f f-h c-i e-i e-g g-i h-i second-order graph S g-h f-i (4) builds a second-order graph for each dense summary subgraph CODENSE: Mine coherent dense subgraph c-f c-h c-e e-h e-f f-h c-i e-i e-g g-i h-i second-order graph S g-h f-i Step 4 c-f c-h c-e e-h e-f f-h e-i e-g g-i h-i Sub(S) g-h (5) Identify dense subgraphs of the second-order graph Observation: if a subgraph is coherent (its edges show high correlation in their occurrences across a graph set), then its 2nd- order graph must be dense. CODENSE: Mine coherent dense subgraph (6) Identify the coherent dense subgraphs c-f c-h c-e e-h e-f f-h e-i e-g g-i h-i Sub(S) g-h Step 5 c e f h e g h i Sub(G) CODENSE: Mine coherent dense subgraph CODESNE Solution The identified subgraphs by definition satisfy the three criteria: (1)All edges occur in >= k graphs (frequency) (2)All edges should exhibit correlated occurrences in the given graph set. (coherency) (3)The subgraph is dense, where density d is higher than a threshold and d=2m/(n(n-1)) (density) m: #edges, n: #nodes CODENSE: Mine coherent dense subgraph CODENSE The design of CODENSE can solve the scalability issue. Instead of mining each biological network individually, CODENSE compresses the networks into two meta-graphs (the summary graph and the second-order graph) and performs clustering in these two graphs only. Thus, CODENSE can handle any large number of networks. c 1 c 2 c m g 1.1.2.2 g 2.4.3.4 c 1 c 2 c m g 1.8.6.2 g 2.2.3.4 c 1 c 2 c m g 1.9.4.1 g 2.7.3.5 c 1 c 2 c m g 1.2.5.8 g 2.7.1.3 a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c Applying CoDense to 39 yeast microarray data sets Functional annotation Annotation Functional Annotation (Validation) Method: leave-one-out approach - masking a known gene to be unknown, and assign its function based on the other genes in the subgraph pattern. Functional categories: 166 functional categories at GO level at least 6 Results: 448 predictions with accuracy of 50% Functional Annotation (Prediction) We made functional predictions for 169 genes, covering a wide range of functional categories, e.g. amino acid biosynthesis, ATP biosynthesis, ribosome biogenesis, vitamin biosynthesis, etc. A significant number of our predictions can be supported by literature. Functional annotation Annotation We made functional predictions for 779 known and 116 unknown genes by random forest classification with 71% accuracy. Variables for random forest classification: functional enrichment P-valuenetwork topology score network connectivitypattern recurrence numbers average node degreeunknown gene ratio Network size Reconstruct transcriptional cascades by second-order analysis Zhou et al. Nature Biotech 2005 Frequently occurring tight clusters Transcription Factors Co-occurrence of tight clusters Coexpression network constructed with the dataset 1 Co-occurrence of tight clusters Coexpression network constructed with the dataset 2 Co-occurrence of tight clusters Coexpression network constructed with the dataset 3 Co-occurrence of tight clusters Coexpression network constructed with the dataset 4 Co-occurrence of tight clusters Coexpression network constructed with the dataset 5 Coexpression Networks Transcription Factors Set 1 Transcription Factor Set 2 Cooperativity Relevance Networks Three types of transcription cascades TF1 TF2 TF3 Type I Type II gene 2 gene 1 gene 3 gene 4 gene 5 TF1 TF2 gene 2 gene 1 gene 3 gene 4 gene 5 Type III TF1 TF2 gene 2 gene 1 gene 3 gene 4 gene 5 Transcription Regulation Protein Interaction Applying to 39 yeast microarray data sets We identified 60 transcription modules. Among them, we found 34 pairs that showed high 2nd- order correlation. A significant portion (29%, p- value

Date post:	08-Jan-2018
Category:	Documents
Upload:	prudence-henry
View:	224 times
Download:	0 times

CSCE555 Bioinformatics Lecture 23 Integrative Genomics II Meeting: MW 4:00PM-5:15PM SWGN2A21...

Documents