Lars Juhl Jensen Biomedical text mining. exponential growth.

Post on 20-Jan-2016

218 views 0 download

Tags:

transcript

Lars Juhl Jensen

Biomedical text mining

exponential growth

~45 seconds per paper

information retrieval

named entity recognition

augmented browsing

text corpora

information extraction

information retrieval

find the relevant papers

ad hoc retrieval

user-specified query

“yeast AND cell cycle”

PubMed

indexing

fast lookup

stemming

word endings

dynamic query expansion

MeSH terms

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

no tool will find that

named entity recognition

computer

as smart as a dog

teach it specific tricks

identify the concepts

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

comprehensive lexicon

proteins

chemicals

compartments

tissues

diseases

organisms

CDC2

cyclin dependent kinase 1

orthographic variation

upper- and lower-case

CDC2

Cdc2

spaces and hyphens

cyclin dependent kinase 1

cyclin-dependent kinase 1

prefixes and postfixes

CDC2

hCDC2

“black list”

SDS

scalable implementation

text corpora

>10 km<10 hours

most use Medline

~22 million abstracts

few use full-text articles

no access

PDF files

layout-aware extraction

millions of full-text articles

information extraction

formalize the facts

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

two approaches

co-mentioning

counting

within documents

within paragraphs

within sentences

co-mentioning score

NLPNatural Language Processing

grammatical analysis

part-of-speech tagging

multiword detection

semantic tagging

sentence parsing

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

extract stated facts

high precision

poor recall

ExerciseGo to http://diseases.jensenlab.org

Find TYMS disease associations

Inspect the text-mining evidence

Look for examples of synonym usage

Find genes linked to colorectal cancer

thank you!