KDD-2008
Anticipating Annotations and Anticipating Annotations and
Emerging Trends in Biomedical Emerging Trends in Biomedical
Literature Literature
Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann
Integrated Data Systems, Siemens Corporate Research, Princeton, NJ
Markus Bundschus
Department of Computer Science, Ludwig-Maximilians-University
Page 2 8/25/2008 Las Vegas, NVKDD-2008
From data to knowledge
PubMed: >15M abstracts from 1975-2007.
UniProt: >200k proteins.
GeneOntology: >22k processes and functions.
Mesh: >23k medical terms, >170k synonyms.
FDA Clinical Trials: >50k reports
…
Proprietary data sources, such as patent information
and news articles
Large volume and complexity of information requires automation of important tasks:
- Detect biomedical concepts- Detect topics- Satisfy search queries- Track historical trends- Predict new trends
Data sources for biomedical
literature research
Page 3 8/25/2008 Las Vegas, NVKDD-2008
Bio Journal Monitor
Query
BioJournalMonitor
Topics
Group large search resultinto topics to facilitateanalysis and drill down
PubMedand other sources
By keyword By MeSH concept By date…
TrendsShow emerging trends related to query.
Named-entity recognitionAnnotationTrend analysisClusteringRanking
Screening of biomedical literature
Early detection of biomarkers and technologies related to a disease
Tracking relevance of biomarkers over time
Prediction of research trends.
Use cases
KDD-2008
Annotation
Page 5 8/25/2008 Las Vegas, NVKDD-2008
Medical Subject Headings
MeSH annotation Each document in PubMed is
manually indexed with a set of MeSH terms
Semi-automatic approaches assist indexers
Our approach Model the generative process of
document writing and document indexing
Author chooses relevant topics Based on topic distribution
author writes the paper
Indexer reads the paper and extracts hidden topic structure
Indexer assigns index terms based on topics.
Document writing Document indexing
Page 7 8/25/2008 Las Vegas, NVKDD-2008
Topic-Concept Model
Given a set of annotated documents D={(w1,c1),…,(wD,cD)}, simultaneously
model the process of document writing
and document indexing
Use hierarchical Bayesian framework to
model this generative process
For each of the Md concepts in
document d draw a topic according to
the topic assignments of each word
θ, Ф, Γ provide information about topic-
word- and concept distributions
w: word chosen from a vocabulary of size N
c: concept chosen from a set of MeSH concepts
z: topic responsible for generating a word
z_tilde: topic responsible for generating a concept
α, β, γ: Dirichlet prior parameters
θ, Ф, Γ: model parameters (to be learned)
Page 11 8/25/2008 Las Vegas, NVKDD-2008
Experiment
Use 2 benchmark datasets provided by NLM
Compare results with NLM approach and
Naïve Bayes (multi-label)
Prune MeSH concepts to top layer (109
MeSH concepts)
Random 50K Genetics
KDD-2008
Emerging Trend Detection
Page 13 8/25/2008 Las Vegas, NVKDD-2008
Emerging Trend Detection
Problem New MeSH terms are selected by experts – this can happen long after
the term becomes important and widely used! An early identification of potential MeSH terms would be very useful for
technology scouting teams and biomedical researchers.
Challenges Automatically identify newly emerging important concepts Prepare a collection that can be used for evaluation of emerging trend
detection methods 1.5M PubMed abstracts from 01/1975 through 10/2007 with keywords:
cancer, carcinoma, tumor, neopla, malignant.
81 interesting cancer-related MeSH term introduced during this period.
Page 16 8/25/2008 Las Vegas, NVKDD-2008
Representation and Scoring
Representation:
• Term frequency in sliding 12 month window.
• Divide by the total number of documents in that period.
Scoring function (Better than the one in the paper!)
• Consider a 24 month period ending with the current month t
• Count the number of times normalized frequency f reaches a new maximum in that period:
Excluded terms that have not yet occurred (impossible) have already been added to MeSH (truth known) are added to MeSH within the next year (too late)
TP: terms that will be added after at least 1 year FP: terms that will never be added to MeSH FN: terms that are added to MeSH at time [t+1y, t+5y] TN: All other terms
Experimental setup
• 140K word stems, 81 true positives
• Top ranked 300 terms / month
• Time horizon 1 year / 5 years
0
241,...,24
),(max),(),(i
ijjtwfitwftws
Page 17 8/25/2008 Las Vegas, NVKDD-2008
Results
Time difference between inclusion in MeSH and earliest detection in top 300. 48 out of 81 positive terms are detected.
Is top 300 too much?
300 * 12 month * 25 years = 90,000 terms.
However, only 6,290 unique terms occurred in top 300 in this period.
Addition of new MeSH terms describing cancer-related biomarkers. Since the 1st term is added in 1985, it only makes sense to start evaluation in 1980 (given our horizon parameters).
Page 18 8/25/2008 Las Vegas, NVKDD-2008
Precision and Recall Measures
Page 19 8/25/2008 Las Vegas, NVKDD-2008
BioJournalMonitor
Trends of concepts and topics
Group abstractsinto topics.
Summarize topicswith keywords
Page 20 8/25/2008 Las Vegas, NVKDD-2008
Conclusion
Described BioJournalMonitor system for automated analysis of biomedical literature and other data sources.
Discussed in detail: automated categorization of articles using LDA models; and detection of important emerging trends
Future Work
Extend LDA approach to cover entire MeSH hierarchy Examine supervised approaches for identifying emerging trends;
and evaluate on different data – ex. Heart disease instead of cancer-related biomarkers