Inferring Hidden Relationships from Biological Literature
with Multi-level Context Terms
Introduction• Literature Based Discovery (LBD)
Swanson’s ABC model
Drug repositioning
Alzheimer
In-sulin
PKC1
CATS
SOS2
35
2
89
4
Literature-based discovery (LDA)? ---the very idea.1. It means deriving, from the public record of
science new solutions to scientific problems.
2. The possibility arises, for example, when two articles considered together for the first time suggest new information of scientific interest not apparent from either article alone.
Venn Diagram -- ABC Model
A CB
Articles about an AB relationship.
Articles about a BC relationship.
AB BC
AB and BC are complementary but disjoint :They can reveal an implicit relationship between A and C in the absence of any explicit relation.
An ABC example based on title words in Medline Magnesium-deficient rat as a model of epilepsy.Lab Animal Sci 28:680-5, 1978
The relation of migraineand epilepsy. Brain 92: 285-300, 1969
A magnesium8011
C migraine2756An unintended link
Venn diagram: sets of Medline records; A,C are disjoint.
22 45
B epilepsy
Related work• CTD
• A manually curated database.• Inferring chemical – gene – disease relations using ABC models
Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (2008, NAR)
Related work• CoPub discovery
• Co-occurrence score based ABC models• Inferring diseases, genes, drugs relations
Literature Mining for the Discovery of hidden connections between drugs, genes and diseases (2010, PLOS computational biology)
MeSH Terms
Objective• Objective
• Inferring hidden drug-disease relations accurately from the lit-eratures
• Limitations on previous models• generate large volume of false positive candidate relations• are semi-automatic, labor-intensive technique requiring human
experts’ input.
• Solution strategy• Incorporate context information into relation inference
Suggested approaches• Our approach
• Key Idea• Inferring drug-disease relations based on context term
similarity• Drug - gene relation• Disease - gene relation
• Our hypothesis • The similarity of context terms between Drug-Gene
and Gene-Disease model enables to infer more mean-ingful Drug-Disease relations.
Dis-easeDrug
GeneContext Context
Simi-larity
Suggested approaches• Context vector
• Set of bio-medical terms in paper abstracts
Interaction
Bio-medicalTerm
Ab-stact1Abstact2
A context vectorof an interaction
Average
Suggested approaches• Similarity measures and score comparison
Alzheimer Insulin
PCK1
1 1 1 1 1 0 …0 0 2 1 1 1 …
Context Vec-tors
Similar-ity
Measure
Scoring
Score Comparison - All frequencies V.S. - Context similarity based filtered frequencies
Similarity Measures- Cosine similarity- Spearman Correlation
Suggested approaches• Experiment overview
Entity Tag-ging
Interaction Ex-traction
Context Vector Extraction
PubMed Ab-
stracts
Drug – Disease
Inference
ScoredResult Evaluation
Known disease-drug interac-tions (CTD)(336,693)
Answer set
Prev. model VSOur model
Perfor-mance
analysisLiterature analysis
CTD : Comparative Toxicogenomics DatabaseUMLS : Unified Medical Language System
Known disease-drug interac-tions (Phar-
mGKB)(1,992)
Entity Dictio-nary
DatasetUMLS
• 96,031 disease, 45,527 gene, 6,132 symptom synonyms
PharmGKB• 25,693 disease, 28,091 drug, 258,840 gene synonyms
CTD• 68,211 disease, 384,141 chemical, 679,701 gene synonyms.
Pubmed• 77,711 Alzheimer's disease related abstracts
Multi-level entity recognition• Dictionary based entity recognition from the ab-
stracts.• We import data from three external databases to generate the
multi-level entity dictionaries: PharmGKB, CTD, and UMLS. • We define the entity levels of the dictionaries into four different
levels: gene, drug, disease, and symptom.
Multi-level entity recognition• From Alzheimer’s disease related 77,711 abstracts
• Parse the sentences using the Condition Ran-dom Field (CRF) based sentence detector
• Extract Bio-medical entities using LingPipe• Match the extracted entities with PharmGKB, and
CTD entity dictionary databases to extract interac-tion data
• Map the extracted entities to the UMLS entity dictio-nary database to extract members of context vector
Suggested approaches• Experiment overview
Entity Tag-ging
Interaction Ex-traction
Context Vector Extraction
PubMed Ab-
stracts
Drug – Disease
Inference
ScoredResult Evaluation
Known disease-drug interac-tions (CTD)(336,693)
Answer set
Prev. model VSOur model
Perfor-mance
analysisLiterature analysis
CTD : Comparative Toxicogenomics DatabaseUMLS : Unified Medical Language System
Known disease-drug interac-tions (Phar-
mGKB)(1,992)
Entity Dictio-nary
Interaction Extraction• To extract biologically meaningful interactions,
we limited to extract the patterns of ‘drug - gene’ and ‘gene - disease’ from the recognized entities.
• We generated entity dictionaries from Phar-mGKB and CTD databases. PharmGKB and CTD have different number of terms, so their tag-ging results are different from each other.
• We tagged biological entities from PubMed records. After we tagged them, we extracted candidate interactions when two different types of entities are co-occurred within a sentence.
Suggested approaches• Experiment overview
Entity Tag-ging
Interaction Ex-traction
Context Vector Extraction
PubMed Ab-
stracts
Drug – Disease
Inference
ScoredResult Evaluation
Known disease-drug interac-tions (CTD)(336,693)
Answer set
Prev. model VSOur model
Perfor-mance
analysisLiterature analysis
CTD : Comparative Toxicogenomics DatabaseUMLS : Unified Medical Language System
Known disease-drug interac-tions (Phar-
mGKB)(1,992)
Entity Dictio-nary
• We compare our method to the ABC model that is based on entity frequency in Alzheimer’s disease re-lated abstracts.• The comparison was made for top 100, 500 results• Literature analysis in top 10 ranked interactions.
Evaluation method
Phar-mGKB(1,992)
CTD(336,693)
Answer set
ABC model VSOur model
ABC model VSOur model
ResultsEntity Tagging
• From 77,711 abstracts related with “Alzhemier• 1,640,761 biomedical entities
• 295,419 were tagged by the PharmGKB entity dictionary• 438,987 were tagged by the CTD entity dictionary• 260,291 were tagged by the UMLS entity dictionary
Interaction Extrac-tion• PharmGKB tagged entities
• From 60,415 interactions• We inferred 14,481 new disease-drug interactions
• CTD tagged entities• From 119,464 interactions• We inferred 136,570 interactions
• Size of context vector• 1,641 terms
Results
PharmGKB• The PharmGKB case does not achieve outstanding performance (be-
tween 0%~1%). • The weak performance is attributed to the fact that PharmGKB
has only 1,992 drug-disease interactions. • Furthermore, our dataset was not all PubMed abstracts but only
Alzheimer’s disease related context.
Results
CTD• The Context based approach is superior to the baseline in all cases
(Top 100, 500).• When we filtered the inferred interactions using the context
term based similarity, we observed that it helped improving performance, which is better than the frequency used only.
CTD-Hybrid0.95 Disease Chemical PMID Only Frequency model
D058225:D016229Plaque,
AmyloidAmyloid beta-Peptides 21575663 o
D000544:D020932 Alzheimer Disease Nerve Growth Factor 20965859 oD005182:D000544 Alzheimer Disease Flavin-Adenine Dinucleotide 12127087 oD015850:D000544 Alzheimer Disease Interleukin-6 20667498 oD000544:D014409 Alzheimer Disease Tumor Necrosis Factor-alpha 21327054 oD000544:D015415 AlzheimerDisease Biological Markers o
D005182:D002311Cardiomyopathy,
DilatedFlavin-Adenine Dinucleotide o
D000544:D016229 Alzheimer Disease Amyloid beta-Peptides 21726674 o
D002311:D016229Cardiomyopathy,
DilatedAmyloid beta-Peptides x
D000544:D007328AlzheimerDisease
Insulin 21525299 x
ResultsTop 10 ranked interactions (CTD based)
Results• Alzheimer’s disease - Insulin
• A low score case (0.28) Alzheimer – CATS – Insulin
• A relatively high score case (0.95) Alzheimer- CYC-1- In-sulin
Conclusion• We suggested context-vectors to infer unknown relationships
based on biologically meaningful terms.• We constructed multi-level entity dictionary to recognize multi-
level entities from the literature.• We utilized our context vectors to discover putative drugs and
diseases relationships.• We evaluated the results by drug-disease relations which are
curated from the literature.(PharmGKB, CTD).
• In the Alzheimer’s disease 77,711 papers, we found that our context vector based hybrid approach has better precision than previous frequency based ABC model.
Future Study: Difference Approach to Context Terms
• Based on Interaction words (verb terms), de-fine possible direct interaction among enti-ties, and assume that interactions among the rest of entities are context.
I-verbI-Ent1 I-En2 C-Ent C-EntC-EntSentence 1
I-verbI-Ent1 I-En2C-Ent C-EntC-EntSentence 2
I-verbC-Ent I-En1C-Ent C-EntI-Ent2
Sentence 3
Future Study
Questions?• Thank you!
Questions?
Thank You!