Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature
Dr. N. JEYAKUMAR, M.Sc., Ph.D.,Bioinformatics Centre
School of BiotechnologyMadurai Kamaraj University
Madurai – 625021, INDIA
2
Purpose & Goals Extracting gene specific functional ‘keywords’ from
biological literature From full-abstracts Gene specific sentences
Augment extracted keywords with MeSH and GO keywords related to gene
Compare the accuracy of results with a test data set in various keyword extraction methods
Full-abstracts Gene specific sentences Gene specific sentences + MeSH keywords Gene specific sentences+ MeSH and GO keywords
Use the keyword extraction method to cluster the differentially expressed gene clusters in a microarray experiments
3
Outline
Part I: Text mining and keyword extraction from literature Our text mining methodology
Part II: Applications to microarrays Functional keyword clustering of
microarray data
Two Parts: I, and II
?
Part I: Text Mining
5
Text Mining: Introduction and overview Text mining aims to identify non-trivial, implicit,
previously unknown, and potentially useful patterns in text (e.g. classification system, association rules, hyphothesis etc.)
includes more established research areas such as information retrieval (IR), natural language processing (NLP), information extraction (IE), and traditional data mining (DM)
relevant to bioinformatics because of explosive growth of biomedical literature (e.g.
MEDLINE – 15 million records) availability of some information in textual form only,
e.g. clinical records
6
Experimental design of gene clustering with sentences-level, MeSH and GO keywords
S e t o f A b s tra c t
G e n e L ist
G e n e /P ro te inD ic tio n a ry
Yo u r s tu ff h e re .
Yo u r s tu ff h e re .
Yo u r s tu ff h e re . Yo u r s tu ff h e re .
C lu s te rin g
K e y w o rd E x tra c tio n
M e S H /G OK e y w o r d
E x tra c tio n
S e n te n c e E x c tra c tio n
F e a tu re Ve c to rG e n e ra tio n
F ilte rin gM e d L in eA b s tra c ts
M ic ro a r ra yE x p e rim e n t
M e S H /G e n e O n to lo g y
P a tte rn sV isu a liz a tio nA n n o ta t io n
Text Mining: System Architecture
7
Text Mining: Keyword Extraction from Biomedical Literature
Steps to extract sentence-level keywords
Gene - Synonym dictionary – A special gene name synonym name dictionary was created for human genes using Entrez-Gene
Gene-name normalization - This process replaces all the gene names in the abstract with its unique canonical identifier (Entrez gene ID) using the gene-synonym dictionary specially constructed for this study.
Sentence filtering – using corpus specific the regular expression as the following example
($gene @{0,6} $action (of|with) @{0,2} $gene)
extracts sentences that match the structure shown below the expression. The notational construct ‘A B ...’ is interpreted as ‘A followed by B followed by ...’.
gene name 0-6 words action verb ‘of’ or ‘with’ 0-2 words gene name
Keyword extraction. – Next slide
8
Text Mining: Keyword Extraction from biomedical literature
Table 1. An example set of regular expressions as nouns describing agents and agents, and passive and active verbs
Name of Expression Expression Pattern Sentence Output Nouns describing agents ($gene (is)? (the|an|a) @{0,2}$action of @{0,2}
$gene)
IL6, a known mediator of STAT3 response
Nouns describing actions ($gene @{0,6} $action (of|with) @{0,1} $gene)
abi5 domains required for interaction with abi3
Passive verbs ($gene @{0.6} (is|was|be|are|were) @{0,1} $action $(by|via|through) @{0,3} $gene)
Protein kinase c (PKC) has been shown to be activated by parathyroid hormone
Active verbs ($gene $sub-action @{0,1} $action @{0,2} $gene) Insulin mediated inhibition of hormone sensitivity lipase activity
9
Text Mining: Keyword Extraction from Biomedical LiteratureKeyword extraction Example
Sentence: BRCA1 physically associates with p53 and stimulates its
transcriptional activity.
Brill-POS-tagged sentence: BRCA1/NNP physically/RB associates/VBZ with/IN p53/NN and/CC
stimulates/VBZ its/PRP$ transcriptional/JJ activity/NN ./.
Sentence keywords: associates, stimulates, transcription activity
Sentence keywords after manual curation: transcription activity
10
Text Mining: MeSH Keyword Extraction
MeSH keywords MeSH keywords are subject index terms assigned to each scientific
literature by the Natural Library of Medicine (NLM) for purpose of subject indexing and searching the journal articles via PubMed.
MeSH keyword extraction Extracted directly from gene specific abstracts via Perl scripts
MeSH keyword curation Using a MeSH keywords stop words dictionary (e.g., human, DNA,
animal, Support U.S Govt etc.).
For example the MeSH keywords associated with a gene ‘FOS’ in our gene list are ‘oncogene, felypressin, transcription-factor, thermo-receptors, DNA-binding, antibiosis, inflammatory-response, zinc-fingers, gene-regulation, and neuronal-plasticity’.
11
Text Mining: GO Keyword Extraction
GO keywords Gene Ontology (GO) is a hierarchical organization of gene and gene
product terms from various databases in which concepts at higher levels in the hierarchy are more general than those further down
GO keyword extraction Out of the three GO annotation categories we included only
molecular function and biological process and left out cellular component as it is less important for characterizing genes functions
Further, due to hierarchical nature of GO and multiple inheritance in the GO structure, we consider with every ancestor up to the level 2 in the GO tree
For example the GO keywords associated with the gene ‘FOS’ in our gene list are ‘protein-dimerization, DNA binding, RNA polymerase, transcription factor, DNA methylation, and inflammatory-response’.
12
Text Mining: Keyword Representation and Calculation of Numeric Vectors
This process is concerned with computing the numeric weight, wij, for each gene-keyword pair (gi, tj) (i = 1, 2, … n and j = 1, 2, … k) to represent the gene’s characteristics in terms of the associated keywords.
Common techniques for such numeric encoding include
Binary. The presence or absence of a keyword relative to a gene.
Term frequency. The frequency of occurrence of a keyword with a gene.
Term frequency / inverse document frequency (TF*IDF). The relative frequency of occurrence of a keyword with a gene compared to other genes
13
Text Mining: TF*IDF Weighting Most weighting scheme in information retrieval and
text classification method is the TFIDF (term frequency / inverse document frequency) weighting scheme.
TF(w,d) (Term Frequency) is the number of times word w occurs in a document d.
DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.
The inverse document frequency is calculated as
Where | D | is total number of documents in the corpus
)log()( )(||wDF
DwIDF
14
Text Mining: Keyword Representation and Calculation of Numeric vectors
In our study, as the keywords are extracted from gene specific sentences but not from full abstracts, the number of keywords associated with each gene is small.
Further, the frequency of occurance of most keywords tended be one.
Therefore, the binary encoding scheme was adopted as illustrated in Table 2 .
Genes / Terms t1 t2 ... tk g1 w11 = 0 w21 = 1 ... wk1 = 1 g2 w12 = 1 w22 = 1 ... wk2 = 0 ... ... ... ... ... gn w1n = 0 w2n = 0 ... wkn = 1
Table 2. Binary representation of gene * keywords
15
Text Mining: Gene Clustering After, our binary coding scheme adopted in this
study consists of numeric row vectors representing genes (via the associated biological functional keywords), and numeric column vectors representing annotation terms (via the associated genes)
Clustering can produce useful and specific information about the biological characteristics of sets of genes
Clustering: Partition unlabeled examples into disjoint subsets of clusters, such that:
Examples within a cluster are very similar Examples in different clusters are very different
Discover new categories in an unsupervised manner.
16
Text Mining: Test Set and Evaluation The test set contains 20 genes and 10 abstracts for each gene,
resulting in a total of 200 abstracts in two cancer categories (Table 3) was used evaluate usefulness of our keyword extraction method
Genes Category
ADAM23, DKK1, IGF2, LRRC4, L3MBTL, MMP9, MSH2, PTPNS1, SFMBT1, ZIC1
Brain Tumor
AMPH, ATM, BRCA1, BRCA2, CHEK2, CDH1, PHB, TFF1, TSG101, XRCC3
Breast Cancer
Table 3. Test set of 20 human genes manually grouped in to two cancer categories
17
Text Mining: Evaluation
Full abstract keywords (baseline). Extracts gene annotation terms based on term frequencies * inverse document frequencies (TF*IDF) within the entire abstract without regard to sentence structure.
Sentence keywords. Extracts gene specific keywords based sentence-level processing.
Sentence + MeSH keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction).
Sentence + MeSH + GO keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction) and GO terms (see Section GO keyword extraction
18
Text Mining: EvaluationResults of various keyword extraction methods
Keywords Extraction Method
Precision
Recall F-measure (%)
Abstract keywords (baseline)
0.31 0.24 27.05
Sentence keywords only 0.57 0.38 45.60
Sentence + MeSH keywords
0.64 0.47 54.19
Sentence + MeSH + GO keywords
0.78 0.72 74.88
Part II: Applications to Microarrays
Functional keyword Clustering of genes resulting from microarray experiment
20
Applications to Microarrays Data and Analysis As an illustrative example, our keyword extraction
methods was applied to functional interpretation of cluster of genes that were found differentially expressed in a microarray experiment investigating the impact of two mitogenic protein Epidermal growth factor (EGF) and Sphingosine 1-phosphate (S1P) on glioblastoma cell lines
when compared to the resting state, 19 genes were significantly differentially expressed as a response to EGF, 35 genes as a response to S1P and 30 genes as a response to COM, i.e., combined stimuli of S1P and EGF. The three gene lists are referred to as G(EGF), G(S1P) and G(COM), respectively (Table 4).
21
Applications to Microarrays Data and Analysis
Table 4. List of Differentially Expressed GenesGene List Name of Genes
G(EGF) (19 genes)
HRY, KLF2, ID1, JUN, DUSP6, IMPDH2, GP1BB, PNUTL1, CGI-96, CALD1, TRIM15, FOS, SPRY4, CLU, SLC5A3, MRPS6, ABCA1, OLFM1, PHLDA1
G(S1P) (35 genes)
F3, NR4A1, KLF5, GADD45B, IL8, CITED2, CALD1, IL6, BCL6, LBH, HRB2, KIAA0992, NFKBIA, TNFAIP3, CCL2, DSCR1, TXNIP, NAB1, EHD1, GBP1, GLIPR1, MAP2K3, FZD7, RGS3, SOCS5, FOSL2, JAG1, DOC1, NRG1, BTG1, PDE4C, KIAA1718, KIAA0346, SFRS3, PLAU
G(COM)(30 genes)
MAFF, DUSP5, EGR3, SERPINE1, ZFP36, DUSP1, LIF, DTR, MYC, GADD45B, RTP801, ATF3, JUNB, SNARK, WEE1, EGR2, TIEG, SPRY2, CEBPD, SGK, GEM, NEDD9, LDLR, EGR1, C8FW, UGCG, MCL1, ZYX, FOSL1, DIPA
22
Applications to Microarrays Data and Analysis Using these the three gene lists obtained from the
microarray experiment (Table 6) as query in MEDLINE returned the three corresponding sets of abstracts A(EGF), A(S1P) and A(COM), respectively (Table 5).
The abstracts were processed with the keyword extraction method involving sentence-level augmented with MeSH and GO keywords
The resulting keywords were encoded in binary weighting scheme
The resulting representations were clustered using average linkage hierarchical clustering algorithm.
23
Applications to Microarrays Data and Analysis
Table 5. Three sets of abstracts, A(EGF), A(S1P), and A(COM), retrieved via MEDLINE for this study
Gene List # of Genes in List
Retrieved Abstract Set
# of Abstracts in Set
G(EGF) 19 A(EGF) 28 913
G(S1P) 35 A(S1P) 19 705
G(COM) 30 A(COM) 39 890
24
Applications to Microarrays Average Linkage Hierarchical Clustering Algorithm Use average similarity across all pairs within
the merged cluster to measure the similarity of two clusters.
Compromise between single and complete link.
Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters.
)( :)(
),()1(
1),(ji jiccx xyccyjiji
ji yxsimcccc
ccsim
25
Applications to Microarrays Results
HRY
KLF2ID1
JUN
DUSP6
IMPDH2
GP1BB
PNUTL1
CALD1
TRIM 15
FO S
SPRY4
CLU
SLC5A3
MRPS6
ABCA1
OLFM1PHLDA1
neur
al tu
be d
efec
tstra
nscr
iptio
n fa
ctor
cell
deat
hem
bryo
gene
sis
ion
bind
ing
angi
ogen
esis
inhi
bitio
nem
bryo
nic
deve
lopm
ent
trans
-act
ivat
ors
zinc
fing
ers
mito
gene
sis
asse
mbl
ese
cret
ion
bios
ynth
esis
regu
latio
ngl
ycop
rote
inan
drog
ens
odon
toge
nesi
sca
lmod
ulin
-bin
ding
desa
tura
ses
shap
e-re
gula
tion
rela
xatio
ntu
mor
igen
esis
intra
cellu
lar
athe
roge
nese
sgl
utam
ine-
trans
port
DN
A-m
ethy
latio
nfe
lypr
essi
ntra
nsiti
oncl
uste
ring
reco
mbi
natio
nth
erm
o-re
cept
ors
v-fo
sfu
sion
sens
atio
nim
mun
o-re
activ
ityan
tibio
sis
oste
obla
sts
Summary of analysis of EGF cluster
26
Applications to Microarrays ResultsSummary of analysis of S1P cluster
F3
N R 4A1
KLF5
G A DD 45B
IL8
C ITE D 2
C ALD 1
IL6
BC L6
H R B2
N FKBIA
TN FAIP3
C C L2
D SC R 1
TXN IP
N AB1
EH D 1G B P1
G LIP R 1
M AP2 K3
FZD 7R G S3
SO CS 5
FO SL2
JA G 1
D O C 1
N R G 1
BTG 1
PD E4C
SFR S3
PLA U
athe
roge
nesi
sm
itoge
nesi
sas
sem
ble
infla
mm
atio
nan
giog
enes
isen
docy
tosi
sly
mph
ocyt
espa
thog
enes
is
DN
A-d
epen
dent
foca
l-con
tact
DN
A-d
amag
esp
licin
gG
1 ph
ase
extra
cellu
lar
mot
ility
prot
ein-
bind
ing
cos-
cells
myo
sin
RN
A lo
caliz
atio
ndo
se-r
espo
nse
antic
odon
cyto
toxi
city
para
sito
phor
ous
G p
rote
inde
mye
linat
ion
cyto
lysi
sC
a re
leas
elo
com
otio
nho
meo
stas
isci
rcul
atio
nph
osph
oryl
atio
nsy
nthe
sis
repa
irpr
otei
n ki
nase
endo
thel
ializ
atio
nor
gano
gene
sis
cell-
adhe
sion
mut
agen
esis
imm
une-
resp
onse
27
Applications to Microarrays ResultsSummary of analysis of COM cluster
M A FF
D U SP 5
EG R 3
SE R P IN E 1
ZFP 36
D U SP 1
LIF
D TR
M Y C
G A D D 45B
R TP 801ATF 3
JU N B
SN A R KW E E 1
EG R 2TIE G
SP R Y 2
C E B P D
SG K
G E M
N E D D 9
LD LR
EG R 1
C 8FW
U G C G
M C L1
ZY X
FO S L1
D IP A
DN
A-b
indi
ngzi
nc fi
nger
sre
pres
sor p
rote
ins
DN
A-d
epen
dent
nucl
eus
trans
activ
atio
nle
ucin
e zi
pper
stra
nscr
iptio
nge
ne e
xpre
ssio
n re
gula
tion
oxid
ativ
e st
ress
prot
o-on
coge
nece
ll su
rviv
alsi
gnal
tran
sduc
tion
mat
urat
ion
endo
cyto
sis
diffe
rent
iatio
nm
itoge
nesi
sm
itosi
sG
2 ph
ase
chem
osen
sitiv
itym
utag
enes
isly
mph
angi
ogen
esis
ion
bind
ing
RN
A pr
oces
sing
G2-
m tr
ansi
tion
mR
NA
splic
ing
imm
orta
lity
DN
A re
com
bina
tion
mic
rotu
bule
gene
sile
ncin
ghe
lix-lo
op-h
elix
mot
ifstra
nscr
iptio
n fa
ctor
seiz
ures
geno
me
inst
abili
ty
DN
A m
odifi
catio
nD
NA
met
hyla
tion
jun
gene
s
28
Conclusions An important topic in microarray data mining is to bind
transcriptionally modulated genes to functional pathways or how transcriptional modulation can be associated with specific biological events such as genetic disease phenotype, cell differentiation etc.
However, the amount of functional annotation available with each transcriptionaly modulated genes is still a limiting factor because not all genes are well annotated
Further, Jenssen et al. (2001) earlier compiled a network of human gene relationships from MEDLINE abstracts. These compiled relationships were then compared to the gene expression cluster results. This approach gives a very interesting result: functionally related genes can show totally different patterns, and hence belong to different clusters (Jenssen, et al.: A literature network of human genes for high-throughput analysis of gene expression, Nat.Genet., 28, 21-28, 2001)
29
Conclusions Our gene functional keyword clustering/ grouping will
enable to select functionally informative genes from differentially expressed genes for further investigations.
Our evaluation suggests that this approach will provide more specific and useful information than typical approaches using abstract-level information. This is particularly the case when the sentence-level terms are augmented by MeSH and GO keywords
As the current text mining scenario is on full-text mining As full-text contains large number of irreverent sentences compare to abstracts this approach is more appropriate for full-text study as it filters irrelevant sentences before clustering.
30
Acknowledgments Eric G. Bremer, Brain Tumor Research Program,
Children’s Memorial Research Center, Chicago, IL, USA, and James R. van Brocklyn, Division of Neuropathology, Department of Pathology, The Ohio State University, Columbus, Ohio, USA for the microarray data set
Dr. Daniel Berrar, Bioinformatics Research Group, University of Ulster, UK
Members of Bioinformatics Centre, Madurai Kamaraj University, India
Dept of Biotechnology, Govt. of India for Bioinformatics facilities
31
THANK YOU