Text mining exercise

Lars Juhl Jensen

Text mining exercise

~5 m

the task

named entity recognition

human proteins

link proteins to diseases

what I have done

information retrieval

two diseases

prostate cancer

schizophrenia

two sets of documents

62,755 abstracts

65,588 abstracts

one directory with each set

one file with each abstract

dictionary

tab-delimited file

human proteins

22,523 entities

synonyms

from many databases

orthographic variation

prefixes and postfixes

automatically generated

2,726,495 names

tagdir program

flexible matching

upper- and lower-case

spaces and hyphens

tab-delimited output

what you will do

named entity recognition

find unfortunate names

create “black list”

information extraction

co-mentioning

within documents

link proteins to diseases

link between the diseases

a helping hand

“black list”

100+ matches

10+ matches

wrap up

prostate cancer

FOLH1

schizophrenia

Glutamate carboxypeptidase II

same protein

synonyms matter

“black list” is crucial

text mining is quite simple

diseases.jensenlab.org

Date post:	16-Jul-2015
Category:	Documents
Upload:	lars-juhl-jensen
View:	441 times
Download:	3 times

Text mining exercise

Documents