Download - Information Retrieval, Information Extraction, and Text ... · Overview of Literature Data Sources Several efforts to make medical and life-science journal information electronically

Information Retrieval, Information Extraction, and Text Mining

Applications for Biology

Slides by Suleyman Cetintas & Luo Si

1

Outline

Introduction

Overview of Literature Data Sources

PubMed, HighWire Press, Google Scholar, Other Sources

Structure of Biomedical Language

Biological Terminology

Lexical and Semantic Sources for Biology

Biomedical Literature Processing Applications

Beyond BioCreative: Advanced Applications

Summary

References

2

Introduction

Life-science research

Large and heterogeneous biological data

in the form of protein and genomic sequence data, expression profiles,

protein structures

Yet, significant amount of information in natural language

Most discoveries communicated by natural language

via publications, patents, reports, and e-texts on the www

controlled vocabulary terms used for other biological sources:

gene product annotations (e.g., Gene Ontology [GO] terms)

Database records (e.g., UniProt), containing comments, keywords,

descriptions etc.

3

Introduction

Structured database entries

enable efficient data retrieval, exchange, and analysis

recent tendency to enrich annotation records

general annotation databases such as UniProt (of 134K citations as of

2008) are of great practical value

Yet, only capable of covering a small fraction of biological

context information

can’t capture the richness of scientific information, argumentation in

the literature

Hard to cope up with the rapid accumulation of new publications

Text mining can help to link the database entries to the

evidence and argumentation in the literature

4

Introduction

Online literature collections

e.g., PubMed

70 million queries every month, >20 million publications (as of 2010)

crucial importance to experimental biologists, biomedical researchers,

database curators, etc.

Face double-exponential growth rates (due to new

journals & increasing number of journal articles)

Different needs

Scientific community needs efficient and effective information

retrieval for targeted literature searches

Pharmaceutical industry uses text-mining systems for their

competitive intelligence

Government institutions use such tools to have a global view of

the current research state

5


Several efforts to make medical and life-science journal

information electronically accessible to the public through

the worldwide web

Efforts can be grouped under 3 categories:

1) Centralized institutional (PubMed) or academic (Highwire

Press & Holllis) repositories of peer reviewed articles or

abstracts

II) Article collection repositories by publishers (e.g.,

BioMedCentral, EMBASE)

III) Access to indexed scholar articles (e.g., Google Scholar,

Scirus) via web-crawlers

6


Several efforts to make medical and life-science journal

information electronically accessible to the public through

the worldwide web

Efforts can be grouped under 3 categories:

1) Centralized institutional (PubMed) or academic (Highwire

Press & Holllis) repositories of peer reviewed articles or

abstracts

II) Article collection repositories by publishers (e.g.,

BioMedCentral, EMBASE)

III) Access to indexed scholar articles (e.g., Google Scholar,

Scirus) via web-crawlers

7

PubMed

The most important resource for text mining applications

Includes citations (i.e., title, abstract, authors, and source

information) by participating publishers

by the National Center for Biotechnology Information (NCBI)

at the National Library of Medicine (NLM)

Basic Search:

can be accessed online by Entrez, a text based search and

retrieval system

Entrez improves the basic keyword searches by translating the user

query to Medical Subject Heading (MeSH) terms

MeSH: controlled vocabulary terms of medical domain, chemicals,

genes, proteins, etc.

8

PubMed

Growth of PubMed citations between 1986-2010

9

PubMed

Technology development timeline for PubMed (in light green

color) and other biomedical literature search tools (in light

orange color)

10

PubMed

Programmatic Access:

PubMed also offers a more programmatic access to its content

through:

Entrez Programming Utilities

Open Source Projects

BioPerl, BioPhyton, BioJava, etc. for biologist programmers

The NCBI provides the My NCBI service, to periodically

retrieve new publications in PubMed matching a predefined

user query

The requester receives a corresponding notification via an e-mail alert

system

11

PubMed

For a Local PubMed

it is possible to have a local relational database of all PubMed

citations

Obtain a licensed copy of the whole PubMed containing XML-

formatted citation records from NLM/NCBI

Mobile Access

Txt2MEDLINE: use SMS to access PubMed

PubMed Informer: Web-based PubMed monitoring tool,

facilitates PDA downloads and RSS feeds

12

Google Scholar

alternative to PubMed

not only peer reviewed articles, but also other scholarly texts

such as theses, books, preprint repositories

often returns larger retrieval sets, (yet with substantial number

of link-outs to PubMed records)

does not offer the advanced search functions that PubMed

offers

13

HighWire Press

alternative to PubMed

an initiative of Stanford University

represents another complementary resource to PubMed

Access to peer-reviewed articles, providing search interface to

over 1160 journals, 4.8 million full-text articles (with over 1.9

million articles available free by HighWire partner publishers)

share many search characteristics with PubMed (there are also

differences of each)

HighWire , further

has graphical representation of articles’ citation map

allows user specifiy where to conduct the search (title, abstract, etc.)

14

Other resources

PubMed Central

Free access to full-text articles (not

only to abstracts)

contains articles published before 1966

publishers have also developed

platforms of searchable article

repositories such as EMBASE and

BioMed Central to improve the access

to their articles

15


A collection of homologous protein sequences often

share a common structural fold and tend to exhibit a

similar function

In natural language, a particular meaning may be expressed

using different but largely synonymous expressions

Natural language processing (NLP) is used to ‘decode’

human language

exploiting the regularities and constraints that occur at

multiple levels in human language

These 4 levels: words, syntax, semantics, pragmatics

16

Structure of Biomed. Lang.: Words

Tokenization and morphology: identification of words in

biology text

in English, word boundaries by whitespace, sentence

boundaries by ‘.’ (period or full stop).

there are too many complications as well

the JULIE (Jena University Language and Information

Engineering) laboratory provides tools for token and sentence

boundary detection

17

Structure of Biomed. Lang.: Words

Tokenization and morphology: identification of words in

biology text

very important stage

gene mention identification (BioCreative II – gene mention

task)

some teams explored the integration of publicly available gene

mention taggers, e.g. the ABNER application or the LingPipe system

linking these mentions to specific entries in biological

resources (gene normalization)

stemming – convert words to their roots, reduce variability

general stemmer – the Porter stemmer

specific biomedical stemmers

18

Structure of Biomed. Lang.: Syntax

Syntax: syntax or grammar of a language controls how

words are grouped into meaningful phrases

words can be associated with parts of speech (POS) tags

POS taggers are based on machine learning algorithms (e.g.,

hidden Markov models) trained on manually marked corpus

biomedical POS distribution slightly different than the general English

special taggers for biomedical domain: MedPost tagger, dTagger

POS tagging can be useful to

detect textual patterns expressing protein interaction

locate gene and protein mentions

19

Structure of Biomed. Lang.: Semantics &

Pragmatics

Semantics: capture the meaning

e.g., ‘c-Jun is activated by VRK1’ can be represented as an

operator ‘activate(VRK1,c-Jun)’

semantic representation abstracts away the syntax

Pragmatics: capture the larger context and its

contribution to meaning

text mining systems often rely on sentences as basic processing

unit for extracting associations between biological entities

descriptions of those relations goes beyond sentence

boundaries, and make use of referring expressions

20


Main NLP levels, from word tokenization to semantics

21


Biological literature characterized by

heavy use of domain-specific terminology

~12% of all terms in biochemistry pubs are technical terms

a need for recognizing medical terms & their variations

automatically

2 main challenges

constant formation of new terms and new short forms

ambiguity or polysemy (multiple meaning of the same word)

22


ambiguity or polysemy

text mining tools must select the correct sense of the word,

using the context behind (for disambiguating)

gene names are problem – as often shared across species

general English => 0.57% ambiguity

medical terms => 1.01% ambiguity

gene names => 14.20% ambiguity

biomedical & life science literature heavily depends on short

forms => further ambiguity

online tools for acronym-full name pairs:

ADAM, the Abbreviation Server, and AcroMine

23


domain-specific technical terms

used for expressing functional descriptions of bio-entities,

relevant biological processes, experimental techniques

terminological repositories & dictionaries

important resources to interpret scientific articles

many have been developed

ontologies

developed for various subfields of biology

Gene Ontology (GO)

widely used as controlled vocabulary to describe biologically relevant

aspects of gene products

24


Ontologies

Gene Ontology (GO)

Although primarily designed for annotation purposes, can also be used

as a lexical resource for indexing via the GoPubMed application

GOAnnotator

allows extraction of test-based GO annotations for a given protein

identifier (Swiss accession number)

GO Annotation Task in BioCreative I

showed that automatic detection of GO terms are more efficient in

case of short terms

25


Word Level

SwissProt

biological annotation database

BioThesaurus

widely used resource combining gene and protein names from

multiple sources

TerMine

developed at the National Center for Text Mining (NaCTeM)

integrates automatic term recognition approach using linguistic and

statistical analysis of candidate terms

26

Biomedical Literature Processing Applications

Provide access to information in scientific articles at

various levels of granularity

Building blocks for biomedical text processing can be

grouped with respect to the BioCreative tasks:

Document retrieval: core of the ‘interaction article’ subtask, to

select articles about protein-protein interactions

Entity mention: identification of mentions of biological entities

Entity normalization: linking biological entities (e.g., genes,

proteins, etc.) to biological resources (e.g., SwissProt, Entrez

Gene, etc.)

27

BioMed. Lit. Proc. Apps: Document Retrieval

Requires the ability to process and index massive volumes

of data (e.g., the entire MEDLINE collection)

robust, efficient wrt space and time

Look for keywords that characterize a collection of

papers, based on keyword frequency

basis of neighbor searches in MEDLINE (the predecessor of

eT-Blast)

still the most heavily used system

Statistical analysis of word occurrences

many current literature mining systems rely on

calculated over the whole PubMed database, resulting in

weighted associations between biological entities

28


Statistical analysis of word occurrences

underlying assumption is that if two biological entities

frequently co-occur together, they should have some biological

relationship

can provide high recall

challenge in human interpretation

lacks semantic information on the type of biological association

CoPub Mapper system

provide online access to ranked co-occurrence associations extracted

from PubMed (btw genes and biological terms)

PubGene system

Generates a graphical protein interaction network based on protein-

protein literature co-occurances

29


Stemming

converts words into standardized forms (stems)

essential component of IR systems and search engines

one common shortcoming

two semantically different words can be collapsed to a common stem

used by systems such as eTBlast, to quantify the similarity btw

documents

CoPub System

detects over-represented terms from multiple abstract

collections

eTBlast

ranks retrieved PubMed records given an input article

30


Clustering Algorithms

used to group genes

according to their

expression profiles in

microarray experiments

using document similarity

calculation have been used

by PubClust, McSyBi

list of systems for

clustering and similarity

ranking on the right

31

Biomedical Literature Processing Applications:

Gene Mention & Gene Normalization

Biologists search the annotation databases using

gene/protein names or symbols as queries

these names have been manually extracted from the literature

too time-consuming

unable to cover all synonyms or naming variants used by the

biologists

Automatic detection of protein & gene mentions

improves the coverage of annotation databases

enable semantically refined literature search

constitute a crucial initial step for other text mining systems

focus of BioCreative gene mention task

performance of 90% F measure

training data of 15K sentences & 5K test sentences 32



Most current bio-entity recognition systems

e.g., GAPSCORE, ABGENE

Can label text for protein or gene mentions

other systems such as ANBER

also identify cell lines or cell types

Chemical compound mentions

Another set of biological entities of interest

Oscar, open source system for chemical entity recognition

integrates dictionary of compound names

as well as using regular expressions, heuristics, and certain word

combinations to find chemical names in text

33



Mentions of species and taxonomic names

important for the emerging field of biodiversity

crucial step to link gene mentions to corresponding organism

source

Detecting bio-entity mentions alone is often not enough

to retrieve informative sentences

BioIE system detects (for a given query keyword) only

sentences related to protein families, functions, etc.

Other applications such as iHOP

given a gene or protein, maps it to its corresponding db identifier, and

retrieves related sentences with definition info, etc.

34



Detecting bio-entity mentions alone is often not enough

to retrieve informative sentences

EBIMed and FACTA systems

for a given query protein, present a summary table of co-occurring

concepts based on PubMed abstracts

FABLE

retrieves co-occurring gene and protein mentions for a query keyword

results can be downloaded in XML or Excell format

For searching functional information for gene products

search with protein sequences is possible though METIS and

MedBlast systems

query sequence is linked to corresponding db record, and the

associated literature is retrieved afterwards

35


iHOP and InfoPubMed

allow retrieval of protein interaction sentences from PubMed

Chilibot

to find supporting relationship evidence between two

predefined entities of interest (genes, proteins, keywords)

Mutation-Finder

to extract amino-acid mutation mentions from large text

collections

MarkerInfoFinder

to detect information related to sequence variants of human

genes

36


PepBank Database (of peptide sequences)

a text-mining system was used to automatically detect and

extract peptide sequences from abstracts and full-text papers

Photo.ELM Database

integrated a text-mining system to detect S/T/Y

phosphorylation sites from the literature

MeInfoText & PubMeth

use text-mining to provide detailed information on gene

methylation and association with cancer

Epiloc System

a text-based subcellular location prediction tool

(complementing alternative sequence-based localization algs)

37

Summary: Biological Text Mining Applications

from the Biology User Perspective

Protein-relations

Function

annotation &

localization

relations

Gene group & lists

analysis

38



Acronmy and

term extraction

Gene-disease

assocication

39



Gene-disease

assocication

Bio-entity tagging

Text retrieval,

classification,

clustering,

similarity ranking

40



Protein sequence

Gene group & lists

analysis

41

References

Main References:

M. Krallinger, A. Valencia, :L. Hirschman. Linking genes to

literature: text mining, information extraction, and retrieval

applications for biology. Genome Biol. 2008; 9:S8.

Z. Lu, PubMed and beyond: a survey of web tools for searching

biomedical literature, Database. 2011.

For original images & references to the mentioned tools, please

either conduct an online search with their names or refer to

the original articles above

42

Questions ?

Please let us know in case of any

questions/issues!

Further info: {scetinta, lsi}@cs.purdue.edu

43