Information Retrieval, Information Extraction, and Text Mining
Applications for Biology
Slides by Suleyman Cetintas & Luo Si
1
Outline
Introduction
Overview of Literature Data Sources
PubMed, HighWire Press, Google Scholar, Other Sources
Structure of Biomedical Language
Biological Terminology
Lexical and Semantic Sources for Biology
Biomedical Literature Processing Applications
Beyond BioCreative: Advanced Applications
Summary
References
2
Introduction
Life-science research
Large and heterogeneous biological data
in the form of protein and genomic sequence data, expression profiles,
protein structures
Yet, significant amount of information in natural language
Most discoveries communicated by natural language
via publications, patents, reports, and e-texts on the www
controlled vocabulary terms used for other biological sources:
gene product annotations (e.g., Gene Ontology [GO] terms)
Database records (e.g., UniProt), containing comments, keywords,
descriptions etc.
3
Introduction
Structured database entries
enable efficient data retrieval, exchange, and analysis
recent tendency to enrich annotation records
general annotation databases such as UniProt (of 134K citations as of
2008) are of great practical value
Yet, only capable of covering a small fraction of biological
context information
can’t capture the richness of scientific information, argumentation in
the literature
Hard to cope up with the rapid accumulation of new publications
Text mining can help to link the database entries to the
evidence and argumentation in the literature
4
Introduction
Online literature collections
e.g., PubMed
70 million queries every month, >20 million publications (as of 2010)
crucial importance to experimental biologists, biomedical researchers,
database curators, etc.
Face double-exponential growth rates (due to new
journals & increasing number of journal articles)
Different needs
Scientific community needs efficient and effective information
retrieval for targeted literature searches
Pharmaceutical industry uses text-mining systems for their
competitive intelligence
Government institutions use such tools to have a global view of
the current research state
5
Overview of Literature Data Sources
Several efforts to make medical and life-science journal
information electronically accessible to the public through
the worldwide web
Efforts can be grouped under 3 categories:
1) Centralized institutional (PubMed) or academic (Highwire
Press & Holllis) repositories of peer reviewed articles or
abstracts
II) Article collection repositories by publishers (e.g.,
BioMedCentral, EMBASE)
III) Access to indexed scholar articles (e.g., Google Scholar,
Scirus) via web-crawlers
6
Overview of Literature Data Sources
Several efforts to make medical and life-science journal
information electronically accessible to the public through
the worldwide web
Efforts can be grouped under 3 categories:
1) Centralized institutional (PubMed) or academic (Highwire
Press & Holllis) repositories of peer reviewed articles or
abstracts
II) Article collection repositories by publishers (e.g.,
BioMedCentral, EMBASE)
III) Access to indexed scholar articles (e.g., Google Scholar,
Scirus) via web-crawlers
7
PubMed
The most important resource for text mining applications
Includes citations (i.e., title, abstract, authors, and source
information) by participating publishers
by the National Center for Biotechnology Information (NCBI)
at the National Library of Medicine (NLM)
Basic Search:
can be accessed online by Entrez, a text based search and
retrieval system
Entrez improves the basic keyword searches by translating the user
query to Medical Subject Heading (MeSH) terms
MeSH: controlled vocabulary terms of medical domain, chemicals,
genes, proteins, etc.
8
PubMed
Growth of PubMed citations between 1986-2010
9
PubMed
Technology development timeline for PubMed (in light green
color) and other biomedical literature search tools (in light
orange color)
10
PubMed
Programmatic Access:
PubMed also offers a more programmatic access to its content
through:
Entrez Programming Utilities
Open Source Projects
BioPerl, BioPhyton, BioJava, etc. for biologist programmers
The NCBI provides the My NCBI service, to periodically
retrieve new publications in PubMed matching a predefined
user query
The requester receives a corresponding notification via an e-mail alert
system
11
PubMed
For a Local PubMed
it is possible to have a local relational database of all PubMed
citations
Obtain a licensed copy of the whole PubMed containing XML-
formatted citation records from NLM/NCBI
Mobile Access
Txt2MEDLINE: use SMS to access PubMed
PubMed Informer: Web-based PubMed monitoring tool,
facilitates PDA downloads and RSS feeds
12
Google Scholar
alternative to PubMed
not only peer reviewed articles, but also other scholarly texts
such as theses, books, preprint repositories
often returns larger retrieval sets, (yet with substantial number
of link-outs to PubMed records)
does not offer the advanced search functions that PubMed
offers
13
HighWire Press
alternative to PubMed
an initiative of Stanford University
represents another complementary resource to PubMed
Access to peer-reviewed articles, providing search interface to
over 1160 journals, 4.8 million full-text articles (with over 1.9
million articles available free by HighWire partner publishers)
share many search characteristics with PubMed (there are also
differences of each)
HighWire , further
has graphical representation of articles’ citation map
allows user specifiy where to conduct the search (title, abstract, etc.)
14
Other resources
PubMed Central
Free access to full-text articles (not
only to abstracts)
contains articles published before 1966
publishers have also developed
platforms of searchable article
repositories such as EMBASE and
BioMed Central to improve the access
to their articles
15
Structure of Biomedical Language
A collection of homologous protein sequences often
share a common structural fold and tend to exhibit a
similar function
In natural language, a particular meaning may be expressed
using different but largely synonymous expressions
Natural language processing (NLP) is used to ‘decode’
human language
exploiting the regularities and constraints that occur at
multiple levels in human language
These 4 levels: words, syntax, semantics, pragmatics
16
Structure of Biomed. Lang.: Words
Tokenization and morphology: identification of words in
biology text
in English, word boundaries by whitespace, sentence
boundaries by ‘.’ (period or full stop).
there are too many complications as well
the JULIE (Jena University Language and Information
Engineering) laboratory provides tools for token and sentence
boundary detection
17
Structure of Biomed. Lang.: Words
Tokenization and morphology: identification of words in
biology text
very important stage
gene mention identification (BioCreative II – gene mention
task)
some teams explored the integration of publicly available gene
mention taggers, e.g. the ABNER application or the LingPipe system
linking these mentions to specific entries in biological
resources (gene normalization)
stemming – convert words to their roots, reduce variability
general stemmer – the Porter stemmer
specific biomedical stemmers
18
Structure of Biomed. Lang.: Syntax
Syntax: syntax or grammar of a language controls how
words are grouped into meaningful phrases
words can be associated with parts of speech (POS) tags
POS taggers are based on machine learning algorithms (e.g.,
hidden Markov models) trained on manually marked corpus
biomedical POS distribution slightly different than the general English
special taggers for biomedical domain: MedPost tagger, dTagger
POS tagging can be useful to
detect textual patterns expressing protein interaction
locate gene and protein mentions
19
Structure of Biomed. Lang.: Semantics &
Pragmatics
Semantics: capture the meaning
e.g., ‘c-Jun is activated by VRK1’ can be represented as an
operator ‘activate(VRK1,c-Jun)’
semantic representation abstracts away the syntax
Pragmatics: capture the larger context and its
contribution to meaning
text mining systems often rely on sentences as basic processing
unit for extracting associations between biological entities
descriptions of those relations goes beyond sentence
boundaries, and make use of referring expressions
20
Structure of Biomedical Language
Main NLP levels, from word tokenization to semantics
21
Biological Terminology
Biological literature characterized by
heavy use of domain-specific terminology
~12% of all terms in biochemistry pubs are technical terms
a need for recognizing medical terms & their variations
automatically
2 main challenges
constant formation of new terms and new short forms
ambiguity or polysemy (multiple meaning of the same word)
22
Biological Terminology
ambiguity or polysemy
text mining tools must select the correct sense of the word,
using the context behind (for disambiguating)
gene names are problem – as often shared across species
general English => 0.57% ambiguity
medical terms => 1.01% ambiguity
gene names => 14.20% ambiguity
biomedical & life science literature heavily depends on short
forms => further ambiguity
online tools for acronym-full name pairs:
ADAM, the Abbreviation Server, and AcroMine
23
Lexical and Semantic Sources for Biology
domain-specific technical terms
used for expressing functional descriptions of bio-entities,
relevant biological processes, experimental techniques
terminological repositories & dictionaries
important resources to interpret scientific articles
many have been developed
ontologies
developed for various subfields of biology
Gene Ontology (GO)
widely used as controlled vocabulary to describe biologically relevant
aspects of gene products
24
Lexical and Semantic Sources for Biology
Ontologies
Gene Ontology (GO)
Although primarily designed for annotation purposes, can also be used
as a lexical resource for indexing via the GoPubMed application
GOAnnotator
allows extraction of test-based GO annotations for a given protein
identifier (Swiss accession number)
GO Annotation Task in BioCreative I
showed that automatic detection of GO terms are more efficient in
case of short terms
25
Lexical and Semantic Sources for Biology
Word Level
SwissProt
biological annotation database
BioThesaurus
widely used resource combining gene and protein names from
multiple sources
TerMine
developed at the National Center for Text Mining (NaCTeM)
integrates automatic term recognition approach using linguistic and
statistical analysis of candidate terms
26
Biomedical Literature Processing Applications
Provide access to information in scientific articles at
various levels of granularity
Building blocks for biomedical text processing can be
grouped with respect to the BioCreative tasks:
Document retrieval: core of the ‘interaction article’ subtask, to
select articles about protein-protein interactions
Entity mention: identification of mentions of biological entities
Entity normalization: linking biological entities (e.g., genes,
proteins, etc.) to biological resources (e.g., SwissProt, Entrez
Gene, etc.)
27
BioMed. Lit. Proc. Apps: Document Retrieval
Requires the ability to process and index massive volumes
of data (e.g., the entire MEDLINE collection)
robust, efficient wrt space and time
Look for keywords that characterize a collection of
papers, based on keyword frequency
basis of neighbor searches in MEDLINE (the predecessor of
eT-Blast)
still the most heavily used system
Statistical analysis of word occurrences
many current literature mining systems rely on
calculated over the whole PubMed database, resulting in
weighted associations between biological entities
28
BioMed. Lit. Proc. Apps: Document Retrieval
Statistical analysis of word occurrences
underlying assumption is that if two biological entities
frequently co-occur together, they should have some biological
relationship
can provide high recall
challenge in human interpretation
lacks semantic information on the type of biological association
CoPub Mapper system
provide online access to ranked co-occurrence associations extracted
from PubMed (btw genes and biological terms)
PubGene system
Generates a graphical protein interaction network based on protein-
protein literature co-occurances
29
BioMed. Lit. Proc. Apps: Document Retrieval
Stemming
converts words into standardized forms (stems)
essential component of IR systems and search engines
one common shortcoming
two semantically different words can be collapsed to a common stem
used by systems such as eTBlast, to quantify the similarity btw
documents
CoPub System
detects over-represented terms from multiple abstract
collections
eTBlast
ranks retrieved PubMed records given an input article
30
BioMed. Lit. Proc. Apps: Document Retrieval
Clustering Algorithms
used to group genes
according to their
expression profiles in
microarray experiments
using document similarity
calculation have been used
by PubClust, McSyBi
list of systems for
clustering and similarity
ranking on the right
31
Biomedical Literature Processing Applications:
Gene Mention & Gene Normalization
Biologists search the annotation databases using
gene/protein names or symbols as queries
these names have been manually extracted from the literature
too time-consuming
unable to cover all synonyms or naming variants used by the
biologists
Automatic detection of protein & gene mentions
improves the coverage of annotation databases
enable semantically refined literature search
constitute a crucial initial step for other text mining systems
focus of BioCreative gene mention task
performance of 90% F measure
training data of 15K sentences & 5K test sentences 32
Biomedical Literature Processing Applications:
Gene Mention & Gene Normalization
Most current bio-entity recognition systems
e.g., GAPSCORE, ABGENE
Can label text for protein or gene mentions
other systems such as ANBER
also identify cell lines or cell types
Chemical compound mentions
Another set of biological entities of interest
Oscar, open source system for chemical entity recognition
integrates dictionary of compound names
as well as using regular expressions, heuristics, and certain word
combinations to find chemical names in text
33
Biomedical Literature Processing Applications:
Gene Mention & Gene Normalization
Mentions of species and taxonomic names
important for the emerging field of biodiversity
crucial step to link gene mentions to corresponding organism
source
Detecting bio-entity mentions alone is often not enough
to retrieve informative sentences
BioIE system detects (for a given query keyword) only
sentences related to protein families, functions, etc.
Other applications such as iHOP
given a gene or protein, maps it to its corresponding db identifier, and
retrieves related sentences with definition info, etc.
34
Biomedical Literature Processing Applications:
Gene Mention & Gene Normalization
Detecting bio-entity mentions alone is often not enough
to retrieve informative sentences
EBIMed and FACTA systems
for a given query protein, present a summary table of co-occurring
concepts based on PubMed abstracts
FABLE
retrieves co-occurring gene and protein mentions for a query keyword
results can be downloaded in XML or Excell format
For searching functional information for gene products
search with protein sequences is possible though METIS and
MedBlast systems
query sequence is linked to corresponding db record, and the
associated literature is retrieved afterwards
35
Beyond BioCreative: Advanced Applications
iHOP and InfoPubMed
allow retrieval of protein interaction sentences from PubMed
Chilibot
to find supporting relationship evidence between two
predefined entities of interest (genes, proteins, keywords)
Mutation-Finder
to extract amino-acid mutation mentions from large text
collections
MarkerInfoFinder
to detect information related to sequence variants of human
genes
36
Beyond BioCreative: Advanced Applications
PepBank Database (of peptide sequences)
a text-mining system was used to automatically detect and
extract peptide sequences from abstracts and full-text papers
Photo.ELM Database
integrated a text-mining system to detect S/T/Y
phosphorylation sites from the literature
MeInfoText & PubMeth
use text-mining to provide detailed information on gene
methylation and association with cancer
Epiloc System
a text-based subcellular location prediction tool
(complementing alternative sequence-based localization algs)
37
Summary: Biological Text Mining Applications
from the Biology User Perspective
Protein-relations
Function
annotation &
localization
relations
Gene group & lists
analysis
38
Summary: Biological Text Mining Applications
from the Biology User Perspective
Acronmy and
term extraction
Gene-disease
assocication
39
Summary: Biological Text Mining Applications
from the Biology User Perspective
Gene-disease
assocication
Bio-entity tagging
Text retrieval,
classification,
clustering,
similarity ranking
40
Summary: Biological Text Mining Applications
from the Biology User Perspective
Protein sequence
Gene group & lists
analysis
41
References
Main References:
M. Krallinger, A. Valencia, :L. Hirschman. Linking genes to
literature: text mining, information extraction, and retrieval
applications for biology. Genome Biol. 2008; 9:S8.
Z. Lu, PubMed and beyond: a survey of web tools for searching
biomedical literature, Database. 2011.
For original images & references to the mentioned tools, please
either conduct an online search with their names or refer to
the original articles above
42
Questions ?
Please let us know in case of any
questions/issues!
Further info: {scetinta, lsi}@cs.purdue.edu
43