Lars Juhl Jensen
Text mining exercise
~5 m
named entity recognition
link proteins to diseases
information retrieval
two sets of documents
one directory with each set
one file with each abstract
tab-delimited file
from many databases
orthographic variation
prefixes and postfixes
automatically generated
flexible matching
upper- and lower-case
spaces and hyphens
tab-delimited output
named entity recognition
find unfortunate names
create “black list”
information extraction
link proteins to diseases
link between the diseases
Glutamate carboxypeptidase II
“black list” is crucial
text mining is quite simple
diseases.jensenlab.org