Download - KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann.

KDD-2008

Anticipating Annotations and Anticipating Annotations and

Emerging Trends in Biomedical Emerging Trends in Biomedical

Literature Literature

Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann

Integrated Data Systems, Siemens Corporate Research, Princeton, NJ

Markus Bundschus

Department of Computer Science, Ludwig-Maximilians-University

Page 2 8/25/2008 Las Vegas, NVKDD-2008

From data to knowledge

PubMed: >15M abstracts from 1975-2007.

UniProt: >200k proteins.

GeneOntology: >22k processes and functions.

Mesh: >23k medical terms, >170k synonyms.

FDA Clinical Trials: >50k reports

…

Proprietary data sources, such as patent information

and news articles

Large volume and complexity of information requires automation of important tasks:

- Detect biomedical concepts- Detect topics- Satisfy search queries- Track historical trends- Predict new trends

Data sources for biomedical

literature research


Bio Journal Monitor

Query

BioJournalMonitor

Topics

Group large search resultinto topics to facilitateanalysis and drill down

PubMedand other sources

By keyword By MeSH concept By date…

TrendsShow emerging trends related to query.

Named-entity recognitionAnnotationTrend analysisClusteringRanking

Screening of biomedical literature

Early detection of biomarkers and technologies related to a disease

Tracking relevance of biomarkers over time

Prediction of research trends.

Use cases

KDD-2008

Annotation


Medical Subject Headings

MeSH annotation Each document in PubMed is

manually indexed with a set of MeSH terms

Semi-automatic approaches assist indexers

Our approach Model the generative process of

document writing and document indexing

Author chooses relevant topics Based on topic distribution

author writes the paper

Indexer reads the paper and extracts hidden topic structure

Indexer assigns index terms based on topics.

Document writing Document indexing


Topic-Concept Model

Given a set of annotated documents D={(w1,c1),…,(wD,cD)}, simultaneously

model the process of document writing

and document indexing

Use hierarchical Bayesian framework to

model this generative process

For each of the Md concepts in

document d draw a topic according to

the topic assignments of each word

θ, Ф, Γ provide information about topic-

word- and concept distributions

w: word chosen from a vocabulary of size N

c: concept chosen from a set of MeSH concepts

z: topic responsible for generating a word

z_tilde: topic responsible for generating a concept

α, β, γ: Dirichlet prior parameters

θ, Ф, Γ: model parameters (to be learned)


Experiment

Use 2 benchmark datasets provided by NLM

Compare results with NLM approach and

Naïve Bayes (multi-label)

Prune MeSH concepts to top layer (109

MeSH concepts)

Random 50K Genetics

KDD-2008

Emerging Trend Detection


Emerging Trend Detection

Problem New MeSH terms are selected by experts – this can happen long after

the term becomes important and widely used! An early identification of potential MeSH terms would be very useful for

technology scouting teams and biomedical researchers.

Challenges Automatically identify newly emerging important concepts Prepare a collection that can be used for evaluation of emerging trend

detection methods 1.5M PubMed abstracts from 01/1975 through 10/2007 with keywords:

cancer, carcinoma, tumor, neopla, malignant.

81 interesting cancer-related MeSH term introduced during this period.


Representation and Scoring

Representation:

• Term frequency in sliding 12 month window.

• Divide by the total number of documents in that period.

Scoring function (Better than the one in the paper!)

• Consider a 24 month period ending with the current month t

• Count the number of times normalized frequency f reaches a new maximum in that period:

Excluded terms that have not yet occurred (impossible) have already been added to MeSH (truth known) are added to MeSH within the next year (too late)

TP: terms that will be added after at least 1 year FP: terms that will never be added to MeSH FN: terms that are added to MeSH at time [t+1y, t+5y] TN: All other terms

Experimental setup

• 140K word stems, 81 true positives

• Top ranked 300 terms / month

• Time horizon 1 year / 5 years

0

241,...,24

),(max),(),(i

ijjtwfitwftws


Results

Time difference between inclusion in MeSH and earliest detection in top 300. 48 out of 81 positive terms are detected.

Is top 300 too much?

300 * 12 month * 25 years = 90,000 terms.

However, only 6,290 unique terms occurred in top 300 in this period.

Addition of new MeSH terms describing cancer-related biomarkers. Since the 1st term is added in 1985, it only makes sense to start evaluation in 1980 (given our horizon parameters).


Precision and Recall Measures


BioJournalMonitor

Trends of concepts and topics

Group abstractsinto topics.

Summarize topicswith keywords


Conclusion

Described BioJournalMonitor system for automated analysis of biomedical literature and other data sources.

Discussed in detail: automated categorization of articles using LDA models; and detection of important emerging trends

Future Work

Extend LDA approach to cover entire MeSH hierarchy Examine supervised approaches for identifying emerging trends;

and evaluate on different data – ex. Heart disease instead of cancer-related biomarkers