Playing Biology’s Name Game: Identifying Protein Names In Scientific Text
Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and
Ralf Zimmer
Pac Symp Biocomput. 2003;:403-14.
Abstract Construction of a comprehensive
general purpose name dictionary An accompanying automatic curation
procedure based on a simple token model of protein names
An efficient search algorithm to analyze all abstracts in MEDLINE
Parameters are optimized using machine learning techniques
Model for protein and gene names Protein names are often composed
of more than one word (token) The “order” of these words is not
very important – permutation of tokens may occur
General-purpose dictionaries of protein names must be automatically composed
Token classes (1/3)
Token classes (2/3) Extract all words from the dictionary
with frequency of occurrence > 100 Non-descriptive tokens: words
occurring in databases but rarely used in free text or have no influence on the significance of match
Modifier tokens: words crucial for correct recognition
Token classes (3/3) Specifier tokens: Arabic and
Roman numbers and Greek letters Delimiter tokens: used to gain
specificity in the matching procedure – help identify name boundaries
Common words: obtained by comparison to a standard English dictionary
Standard tokens: gene identifiers as they cannot be easily assigned to a separate calss
Automatic generation of the dictionary Extract gene symbols, alias names, and
full names for all human genes from the HUGO Nomenclature database
Create an entry for each official gene symbol and add the corresponding names in the OMIM database
Extract all synonyms in SWISSPROT and TREMBL database and match these to HUGO entries
Curation of the dictionary (1/3) To resolve ambiguities and to
remove nosensical names from the dictionary
A curation procedure consists of two phases – expansion and pruning
Expansion:
Curation of the dictionary (2/3) Pruning: remove redundancies, ambiguities,
and irrelevant synonyms First: synonyme a sequence of token
class identifiers Use regular expression to search unspecific
synonyms (e.g. only non-descriptive tokens, only specifier tokens, etc.)
Finally, a list of ambiguous names is stored separately with reference to their original records
Curation of the dictionary (3/3) The ambiguity list can be used to
identify such entries and move them to the manual curation list based on their frequency of occurrence.
Efficient detection of names (1/3)
MEDLINE contains about 11 million abstracts Linear time in the number of tokens of the
parsed text To sweep over the abstract, processing one
token at a time and keep a set of candidate solutions and two associated scoring measures, boundary score s and acceptance score s, for the present position
Efficient detection of names (2/3)
boundary score s: controls the end of the extension of a candidate match and is increased on a token mismatch. The candidate is pruned if s >boundary threshold
acceptance score s: determine whether the candidate is reported as a match. s is a linear combination of token-class-specific match and mismatch terms. In other words, the significance of token classes vary.
Efficient detection of names (3/3)
Example:
Only the non-descriptive token “precursor” is unmatched in the candidate a nearly maximal match score would be computed (if non-descriptive tokens receive a small weight)
However, the semantically significant modifier token “receptor” leads to a substantial mismatch term (if weights are set appropriately)
Parameter optimization
Robust linear programming (RPL) was used to compute a set of sensible weights
This supervised machine learning techniques uses a set of positive samples, i.e. correctly identified protein names, and a set of negative ones.
The match and mismatch weighting parameters for delimiter, specifier, modifier, and standard tokens were tuned.
The optimized weightings penalize mismatch of modifier and number tokens and reward matching of other token classes to various extend
Evaluation The test dataset is based on the TRANSPATH
database on regulatory interactions. Extracted all human proteins with
SWISSPROT annotations Discarded abstracts if no text was available
or if a protein was described for the first time Resulting benchmark set consists of 611
associations (141 objects in 470 abstracts)
Results – 5-fold c.v.