Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 0 times |
Gene Analytics: Discovery and Contextualization of
Enriched Gene Groups
N. Lavrač, I. Mozetič, V. Podpečan, P. Kralj Novak (Jožef Stefan Institute, Ljubljana)
H. Motaln, M. Petek, K. Gruden
(National Institute of Biology, Ljubljana)
2http://www.bisonet.eu BISON Bled WK, Aug. 2009
Talk outline
• Relational data mining and subgroup discovery• Semantic data mining: Using ontologies in
SEGS• BISON clallenge• Experimental use case: Glioma cancer treatment• BISON methodology: combining SEGS+Biomine• Gene analytics services and future work
3http://www.bisonet.eu BISON Bled WK, Aug. 2009
Data Mining
data
Data MiningData Mining
knowledge discovery from data
model, patterns, …
Given: transaction data table, relational database, text documents, Web pages
Find: a classification model, a set of interesting patterns
Person Age Spect. presc. Astigm. Tear prod. LensesO1 young myope no reduced NONEO2 young myope no normal SOFTO3 young myope yes reduced NONEO4 young myope yes normal HARDO5 young hypermetrope no reduced NONE
O6-O13 ... ... ... ... ...O14 pre-presbyohypermetrope no normal SOFTO15 pre-presbyohypermetrope yes reduced NONEO16 pre-presbyohypermetrope yes normal NONEO17 presbyopic myope no reduced NONEO18 presbyopic myope no normal NONE
O19-O23 ... ... ... ... ...O24 presbyopic hypermetrope yes normal NONE
4http://www.bisonet.eu BISON Bled WK, Aug. 2009
Subgroup discovery task definition
(Kloesgen, Wrobel 1997)
– Given: a population of individuals and a property of interest (e.g. AML class, in the task of finding genes differentially expressed in AML leukemia as opposed to ALL leukemia)
– Find: `most interesting’ descriptions of population subgroups• are as large as possible
(high target class coverage)• have most unusual distribution of the target property
(high TP/FP ratio, high significance)
5http://www.bisonet.eu BISON Bled WK, Aug. 2009
Sample microarray analysis tasks
• Two-class diagnosis problem of distinguishing between acute lymphoblastic leucemia (ALL, 27 samples) and acute myeloid leukemia (AML, 11 samples), with 34 samples in the test set. Every sample is described with gene expression values for 7129 genes.
• Multi-class cancer diagnosis problem with 14 different cancer types, in total 144 samples in the training set and 54 samples in the test set. Every sample is described with gene expression values for 16063 genes.
• SD results in simple IF-THEN rules, interpretable by biologists
IF (KIAA0128_gene DIFF-EXPRESSED) AND (prostaglandin_d2_synthase_gene NOT-DIFF-EXP)
THEN Leukemia
6http://www.bisonet.eu BISON Bled WK, Aug. 2009
Relational Data Mining (Inductive Logic Programming)
Relational Relational Data MiningData Mining
knowledge discovery from data
model, patterns, …
Given: a relational database, a set of tables. sets of logical facts, a graph, …Find: a classification model, a set of interesting patterns
7http://www.bisonet.eu BISON Bled WK, Aug. 2009
Relational Data Mining (ILP)• Learning from multiple
tables• Complex relational
problems:– structured data:
representation of molecules and their properties in protein engineering, biochemistry, ...
• Semantic relational data mining– Using domain
ontologies as background knowledge for relational data mining
8http://www.bisonet.eu BISON Bled WK, Aug. 2009
Gene Ontology (GO)• GO is a database of terms for genes:
– Function - What does the gene product do?– Process - Why does it perform these activities?– Component - Where does it act?
• Known genes are annotated to GO terms (www.ncbi.nlm.nih.gov)
• Terms are connected as a directed acyclic graph (is_a, part_of)
• Levels represent specificity of the terms
12093 biological process 1812 cellular components 7459 molecular functions
9http://www.bisonet.eu BISON Bled WK, Aug. 2009
Ontology encoded as relational background knowledge
Prolog facts: predicate(geneID, CONSTANT).
interaction(geneID, geneID).
component(2532,'GO:0016020').
component(2532,'GO:0005886').
component(2534,'GO:0008372').
function(2534,'GO:0030554').
function(2534,'GO:0005524').
process(2534,'GO:0007243').
interaction(2534,5155).
interaction(2534,4803).
Basic, plus generalized background knowledge using GO
zinc ion binding ->
metal ion binding, ion binding, binding
10http://www.bisonet.eu BISON Bled WK, Aug. 2009
Multi-Relational representation
FUNCTION
GENE(main table,class labels)
GENE-GENEINTERACTION
PROCESS COMPONENT
GENE-FUNCTION GENE-PROCESS GENE-COMPONENT
is_a part_of is_a part_of is_a part_of
11http://www.bisonet.eu BISON Bled WK, Aug. 2009
Propositionalization in RDM
f(7,A):-function(A,'GO:0046872').f(8,A):-function(A,'GO:0004871').f(11,A):-process(A,'GO:0007165').f(14,A):-process(A,'GO:0044267').f(15,A):-process(A,'GO:0050874').f(20,A):-function(A,'GO:0004871'), process(A,'GO:0050874').f(26,A):-component(A,'GO:0016021').f(29,A):- function(A,'GO:0046872'), component(A,'GO:0016020').f(122,A):-interaction(A,B),function(B,'GO:0004872').f(223,A):-interaction(A,B),function(B,'GO:0004871'),
process(B,'GO:0009613').f(224,A):-interaction(A,B),function(B,'GO:0016787'),
component(B,'GO:0043231').
Propositionalization through first-order feature construction (KARDIO 1994, LINUS 1991, RSD 2006)
Novelty of SEGS (2008): Feature construction from ontology information, features with support > min_support
existential
12http://www.bisonet.eu BISON Bled WK, Aug. 2009
Gene set enrichmentanalysis with SEGS
• A gene set is enriched if the genes that are members of that gene set are statistically significantly differentially expressed compared to the rest of the genes.
• New gene set enrichment method: SEGS - Searching for Enriched Gene Sets (JSI) (Trajkovski et al. JBI 2008)
• SEGS approach: Using GO, KEGG and ENTREZ ontologies as background knowledge for semantic subgroup discovery
13http://www.bisonet.eu BISON Bled WK, Aug. 2009
OntologiesOntologies
• Gene Ontology (GO): standardized biological terms used to annotate gene products– Molecular Function– Biological Process– Cellular Component
• Kyoto Encyclopedia of Genes and Genomes (KEGG): manually drawn pathway maps representing the knowledge on the molecular interaction and reaction networks
• ENTREZ: gene annotations with GO and KO terms and gene-gene interaction data
14http://www.bisonet.eu BISON Bled WK, Aug. 2009
Identifying differentially expressed genes in data preprocessing
Gene i
Sample j
14/28
To identify genes that display a large difference in gene expression between groups (class A and class B) and are homogeneous within groups, statistical tests (e.g. t-test) and p-values (e.g. permutation test) are computed.Two sample t–statistic is used to testthe equality of group means mA and mB.
15http://www.bisonet.eu BISON Bled WK, Aug. 2009
Ranking of differentially expressed genes
The genes can be ordered in a ranked list L, according to their differential expression between the classes.
The challenge is to extract meaning from this list, to describe them.
The terms of the Gene Ontology were used as a vocabulary for the description of the genes.
16http://www.bisonet.eu BISON Bled WK, Aug. 2009
Gene expression data: Positive and negative examples for data mining
fact(class, geneID, weight).
fact(‘diffexp',64499, 5.434).fact(‘diffexp',2534, 4.423).fact(‘diffexp',5199, 4.234).fact(‘diffexp',1052, 2.990).fact(‘diffexp',6036, 2.500).……fact(‘random',7443, 1.0).fact('random',9221, 1.0).fact('random',23395,1.0).fact('random',9657, 1.0).fact('random',19679, 1.0).……
17http://www.bisonet.eu BISON Bled WK, Aug. 2009
Ontology encoded as relational background knowledge + gene expression data
Prolog facts: predicate(geneID, CONSTANT).
interaction(geneID, geneID).
component(2532,'GO:0016020').
component(2532,'GO:0005886').
component(2534,'GO:0008372').
function(2534,'GO:0030554').
function(2534,'GO:0005524').
process(2534,'GO:0007243').
interaction(2534,5155).
interaction(2534,4803).
fact(class, geneID, weight).
fact(‘diffexp',64499, 5.434).fact(‘diffexp',2534, 4.423).fact(‘diffexp',5199, 4.234).fact(‘diffexp',1052, 2.990).fact(‘diffexp',6036, 2.500).……fact(‘random',7443, 1.0).fact('random',9221, 1.0).fact('random',23395,1.0).fact('random',9657, 1.0).fact('random',19679, 1.0).……
Basic, plus generalized background knowledge using GO
zinc ion binding ->
metal ion binding, ion binding, binding
18http://www.bisonet.eu BISON Bled WK, Aug. 2009
Ontology encoded as relational features + gene expression data
f(7,A):-function(A,'GO:0046872').f(8,A):-function(A,'GO:0004871').f(11,A):-process(A,'GO:0007165').f(14,A):-process(A,'GO:0044267').f(15,A):-process(A,'GO:0050874').f(20,A):-function(A,'GO:0004871'),
process(A,'GO:0050874').f(26,A):-component(A,'GO:0016021').f(29,A):- function(A,'GO:0046872'),
component(A,'GO:0016020').f(122,A):-
interaction(A,B),function(B,'GO:0004872').f(223,A):-
interaction(A,B),function(B,'GO:0004871'), process(B,'GO:0009613').
f(224,A):-interaction(A,B),function(B,'GO:0016787'), component(B,'GO:0043231').
fact(class, geneID, weight).
fact(‘diffexp',64499, 5.434).fact(‘diffexp',2534, 4.423).fact(‘diffexp',5199, 4.234).fact(‘diffexp',1052, 2.990).fact(‘diffexp',6036, 2.500).……fact(‘random',7443, 1.0).fact('random',9221, 1.0).fact('random',23395,1.0).fact('random',9657, 1.0).fact('random',19679, 1.0).……
19http://www.bisonet.eu BISON Bled WK, Aug. 2009
Propositionalization
f1 f2 f3 f4 f5 f6 … … fn
g1 1 0 0 1 1 1 0 0 1 0 1 1
g2 0 1 1 0 1 1 0 0 0 1 1 0
g3 0 1 1 1 0 0 1 1 0 0 0 1
g4 1 1 1 0 1 1 0 0 1 1 1 0
g5 1 1 1 0 0 1 0 1 1 0 1 0
g1 0 0 1 1 0 0 0 1 0 0 0 1
g2 1 1 0 0 1 1 0 1 0 1 1 1
g3 0 0 0 0 1 0 0 1 1 1 0 0
g4 1 0 1 1 1 0 1 0 0 1 0 1
20http://www.bisonet.eu BISON Bled WK, Aug. 2009
Propositional subgroup discoveryf1 f2 f3 f4 f5 f6 … … fn
g1 1 0 0 1 1 1 0 0 1 0 1 1
g2 0 1 1 0 1 1 0 0 0 1 1 0
g3 0 1 1 1 0 0 1 1 0 0 0 1
g4 1 1 1 0 1 1 0 0 1 1 1 0
g5 1 1 1 0 0 1 0 1 1 0 1 0
g1 0 0 1 1 0 0 0 1 0 0 0 1
g2 1 1 0 0 1 1 0 1 0 1 1 1
g3 0 0 0 0 1 0 0 1 1 1 0 0
g4 1 0 1 1 1 0 1 0 0 1 0 1
f2 and f3f2 and f3
[4,0][4,0]
21http://www.bisonet.eu BISON Bled WK, Aug. 2009
Summary: SEGS Method and Results
• SEGS method:– Through semantic subgroup discovery SEGS generates candidate
gene set descriptions as conjunctions of first-order features, combining individual GO, KEGG and ENTREZ terms
– SEGS combines Fisher, GSEA and PAGE enrichment tests to select most interesting groups of differentially expressed genes
• SEGS results:– Descriptions of subgroups of genes that are differentially
expressed (e.g., belong to class DIFF-EXP of top 300 most differentially expressed genes) in contrast with RANDOM genes (randomly selected genes with low differential expression).
• Sample subgroup description: diffexp(A) ;- interaction(A,B) & function(B,'GO:0004871') &
process(B,'GO:0009613')
22http://www.bisonet.eu BISON Bled WK, Aug. 2009
SEGSSEGS implementation implementationQuery: Results:
23http://www.bisonet.eu BISON Bled WK, Aug. 2009
BISON project
• The challenge: Support humans to find new, interesting links accross domains, named bisociations– across different contexts
– across different types of data and knowledge sources
• Open problems:– Fusion of heterogeneous data/knowledge sources into a
joint representation format - a large information network named BisoNet (consisting of nodes and relatioships between nodes)
– Finding unexpected, previously unknown links between BisoNet nodes belonging to different contexts
24http://www.bisonet.eu BISON Bled WK, Aug. 2009
Heterogeneous data sources(BISON, M. Berthold, 2008)
26http://www.bisonet.eu BISON Bled WK, Aug. 2009
Use Case: Glioma Cancer(investigated at NIB)
• Glioma– a type of brain cancer – different types– Glioblastoma: life expectany
less than 1 year
• Glioma treatment– No efficient treatment available– Testing new hypotheses for
treatment: using stem cells for drug transport to the brain ?
– New insights in stem cell behavior and brain cancer mechanisms ?
27http://www.bisonet.eu BISON Bled WK, Aug. 2009
Glioma treatmentGlioma treatment
• Biological questions:– Are stems cells efficient/effective for drug transport ?– What are the risks associated?
• ad. Risks: Evaluation of BM-hMSC stem cells stability– Biological experiments: 4 stem cell lines– RNA was isolated and sent for transcriptome analysis– hMSC growth curves reveal “fast” & “slow” growing clones
– Slow: hMSC-1, hMSC-3 5-6 passages/6-weeks– Fast: hMSC-2, hMSC-4 8 passages/6-weeks
– Risk of malignant transformation: – Two lines (hMSC-1 and hMSC2) transformed into cancer cells
– Microarray analysis is performed to find groups of differentially expressed genes in several experiments:– slow vs. fast growing cell lines– normal vs. cancerous cell lines
29http://www.bisonet.eu BISON Bled WK, Aug. 2009
SEGS+Biomine MethodologySEGS+Biomine Methodology
e.g. - slow-vs-fast cell growth
Gene sets:Microarray: Contextualization,Exploratory link discovery
30http://www.bisonet.eu BISON Bled WK, Aug. 2009
BiomineBiomine
• In Biomine (UH) (DILS 2006), data from numerous public databases are merged into a large graph: currently consisting of 1,968,951 vertices and 7,008,607 edges.– Vertices correspond to entities and concepts– Edges represent known, annotated relationships between
vertices. – A link (a relation between two entities) is manifested as a path
or a subgraph connecting the corresponding vertices. – A bisociative link is a path traversing nodes belonging to
different domains/contexts
• In Biomine, a method for link discovery between entities in queries was developed for graph exploration.
31http://www.bisonet.eu BISON Bled WK, Aug. 2009
Biomine Information fusionBiomine Information fusion• Biomine graph integrates numerous databases
32http://www.bisonet.eu BISON Bled WK, Aug. 2009
SEGS+Biomine MethodologySEGS+Biomine Methodology
• Biomine information fusion into a BisoNet information network
• Interesting node discovery and contextualisation with SEGS– Information fusion of GO, KEGG, ENTREZ– Identify conjunctions of concepts from different domains
(ontologies)
• Interesting cross=context link discovery with Biomine– create bisociative links as paths in the Biomine subgraph
connecting the concepts proposed by SEGS
• BisoNet Exploration/Explanation– explore BisoNet paths ranked according to weigths/probabilities
(as currently implemented in Biomine)
33http://www.bisonet.eu BISON Bled WK, Aug. 2009
Biomine: Bisociative link discoveryBiomine: Bisociative link discoveryQuery: Result:
34http://www.bisonet.eu BISON Bled WK, Aug. 2009
SEGS merges GO, KEGG and ENTREZ,BisoNet is used for concept visualization
SEGS+BiomineInformation fusion
35http://www.bisonet.eu BISON Bled WK, Aug. 2009
Identify interesting concepts (BisoNet nodes)from different contexts (different databases)
SEGS+BiomineCreative knowledge discovery
36http://www.bisonet.eu BISON Bled WK, Aug. 2009
Create bisociative cross-context links/paths linking BisoNet concepts from different contexts
SEGS+BiomineCreative link discovery
37http://www.bisonet.eu BISON Bled WK, Aug. 2009
Explore and interpret most interestingcross-context BisoNet links/paths between concepts
SEGS+BiomineExploration and explanation
38http://www.bisonet.eu BISON Bled WK, Aug. 2009
SummarySummary• SEGS discovers interesting descriptions of differentially
expressed gene groups as conjunctions of concepts from different contexts
• Biomine finds cross-context links (paths) between concepts discovered by SEGS
• The SEGS+Biomine approach has the potential for creative knowledge and bisociative link discovery
• Preliminary results in stem cell microarray data analysis (EMBC 2009) indicate that the SEGS+Biomine methodology may lead to new insights – in vitro experiments are being planned at NIB to verify and validate the preliminary insights