Date post: | 17-Jan-2016 |
Category: |
Documents |
Upload: | adela-moody |
View: | 215 times |
Download: | 0 times |
Effective Reranking for Extracting Protein-protein Interactions from Biomedical
Literature
Deyu Zhou, Yulan He and Chee Keong Kwoh
School of Computer Engineering
Nanyang Technological University, Singapore
30 August 2007
OutlineOutline
• Protein-protein interactions (PPIs) extraction
• Hidden Vector State (HVS) model for PPIs extraction
• Reranking approaches
• Experimental results
• Conclusions
ProteinProtein
Interact
Protein
Protein-Protein Interactions ExtractionProtein-Protein Interactions Extraction
Spc97p interacts with Spc98 and Tub4 in the
two-hybrid system
Spc97p interact Spc98Spc97p interact Tub4
Spc97p interact Spc98Spc97p interact Tub4
Existing ApproachesExisting Approaches
Statistics Methods
Pattern Matching
Parsing-Based
Simple to ComplicatedSimple to Complicated
An exampleAn example
However, unlike another tumor suppressor protein, p53, Rb did not have any significant effecton basal levels of transcription, suggesting that Rb specifically interacts with IE2 rather ...
Part-of-speech tagging
However/RB ,/, unlike/IN another/DT tumor/NN suppressor/NN protein/NN ,/, p53/NN ,/, Rb/NN did/VBD not/RB have/VB any/DT significant/JJ effect/NN on/IN basal/JJ levels/NNS of/INtranscription/NN ,/, suggesting/VBG that/IN Rb/NN specifically/RB interacts/VBZ with/IN IE2/NN rather/RB ...
However/RB ,/, unlike/IN another/DT tumor/NN suppressor/NN protein/NN ,/, PROTEIN(p53/NN) ,/, PROTEIN(Rb/NN) did/VBD not/RB have/VB any/DT significant/JJ effect/NN on/INbasal/JJ levels/NNS of/IN transcription/NN ,/, suggesting/VBG that/IN PROTEIN(Rb/NN)specifically/RB interacts/VBZ with/IN PROTEIN(IE2/NN) rather/RB ...
Protein name identification
Statistics-Based ApproachesStatistics-Based Approaches
Corpus level statisticSentence level statistic
(p53, IE2)
(Rb, IE2)
+1
+1
Relation Occurrence
(p53, Rb) +1(p53, IE2)
...
8
1
Relation Occurrence
... 6
Relation Confidence
(p53, IE2)
...
75%
...
... ...
Predefined threshold a = 7
Pattern Matching ApproachesPattern Matching Approaches
Rb interact IE2p53 interact IE2
Protein [*] interact[s] with protein protein RB VBZ WITH protein
Rb interact IE2
Pattern matching
Pattern 1 Pattern 2
Parsing-Based ApproachesParsing-Based Approaches
Syntactic processing
Semantic processing...Rb specifically interacts with IE2...
N ADV V P N
NP PP
VP
VP
(<INTERACT><THE Rb PROTEIN><THE IE2 PROTEIN>)
Rb interact IE2
…...
Semantic ParserSemantic Parser
Ĉ = argmax { P(C|Wn) } = argmax { P(C) P(Wn|C) } c c
Ĉ = argmax { P(C|Wn) } = argmax { P(C) P(Wn|C) } c c
For each candidate word string Wn, need to compute most likely set of embedded concepts
semanticmodel
lexicalmodel
We could use a simple finite state tagger …
P(Wn|C)
P(C)
… can be robustly trained using EM, but model is too weak to represent embeddings in natural language
<s> Spc97p interacts with Spc98 and Tub4 in the </s>
SS PROTEIN INTERACT DUMMY SEPROTEIN PROTEINDUMMY DUMMY
two-hybrid system
Perhaps use some form of hierarchical HMM in which each state is a terminal or a nested HMM …
… but when using EM, models rarely converge on good solutions and, in practice, direct maximum-likelihood from “tree-bank” data are needed to train models
P(Wn|C)
P(C)
Spc97p interacts with Spc98 and Tub4 in the two-hybrid system
S
PROTEIN
INTERACT
PREP PROTEIN PROTEINAND DUMMY
INTERACTION
SUBJECT OBJECT OBJECT
Hidden Vector State ModelHidden Vector State Model
<s> Spc97p interacts with Spc98 and Tub4 in the two-hybrid system </s>
SS
PROTEIN
INTERACT
DUMMY SEPROTEIN PROTEINDUMMY DUMMY
PROTEININTERACTPROTEIN
SS
SS PROTEINSS
INTERACTPROTEIN
SS
DUMMYINTERACTPROTEIN
SS
PROTEININTERACTPROTEIN
SS
DUMMYINTERACTPROTEIN
SS
DUMMYSS
SESS
The HVS model is an HMM in which the states correspond to the stack of a push-down automata with a bounded stack size …
P(Wn|C)
… this is a very convenient framework for applying constraints
P(C)PROTEIN
INTERACTPROTEIN
SS
SS PROTEINSS
INTERACTPROTEIN
SS
DUMMYINTERACTPROTEIN
SS
PROTEININTERACTPROTEIN
SS
DUMMYINTERACTPROTEIN
SS
DUMMYSS
SESS
<s> Spc97p interacts with Spc98 and Tub4 in the two </s> -hybrid system
HVS model transition constraints:
• finite stack depth – D• push only one non-terminal semantic onto the stack at each step
… model defined by three simple probability tables
Ĉ = argmax { ∏P(nt|Ct-1) P(Ct[1]|Ct [2..Dt]) P(Wt|Ct) } c,N t
Ĉ = argmax { ∏P(nt|Ct-1) P(Ct[1]|Ct [2..Dt]) P(Wt|Ct) } c,N t
Parsing with the HVS model
P(nt|Ct-1)P(nt|Ct-1)
1) POP 1 elements from the previous stack state, n =1
1) POP 1 elements from the previous stack state, n =1
P(Ct[1]|Ct [2..Dt])P(Ct[1]|Ct [2..Dt])
2) Push 1 pre-terminal semantic concept into stack
2) Push 1 pre-terminal semantic concept into stack
P(Wt|Ct)P(Wt|Ct)3) Generate the next word3) Generate the next word
PROTEININTERACTPROTEIN
SS
PROTEININTERACTPROTEIN
SS
… with Spc98 and Tub4 …… with Spc98 and Tub4 …
INTERACTPROTEIN
SS
INTERACTPROTEIN
SS
DUMMYINERACTPROTEIN
SS
DUMMYINERACTPROTEIN
SS
Train using EM and apply constraints
Abstract semantic annotationPROTEIN (
INTERACT (PROTEIN) )
CUL-1 was found to interact with SKR-1, SKR-2, SKR-3, and SKR-7 in yeast two-hybrid system
Training text
Data Constraints
EM Parameter Estimation
HVS Model Parameters
Parse Statistics
Limit forward-backward search to only include states which are consistent with the constraints
Reranking MethodologyReranking Methodology
• Reranking approaches attempts to improve upon an existing probabilistic parser by reranking the output of the parser.
• It has benefited applications such as name-entity extraction, semantic parsing and semantic labeling.
• To rerank parses generated by the HVS model for protein-protein interactions extraction
Architecture Architecture
Annotated Corpus E
Test DataTraining
Training
SemanticParsing
RerankingReranking
Model
Parse results
Ranked 1st parse
Extracted protein-protein
Interactions
HVS model
Parsing Information IPStructure Information ISComplexity Information IC...
Features:
Reranking approaches Reranking approaches
• Features for Reranking
Suppose sentence Si has its corresponding parse set Ci = {Cij, j = 1,.. N}
– Parsing Information
– Structure Information
– Complexity Information
Reranking approaches Reranking approaches
Score is defined as• log-linear regression model
• Neural Network
• Support Vector Machines
Experiments Experiments
• Setup– Corpus I
• comprises of 300 abstracts randomly retrieved from the GENIA corpus
• GENIA is a collection of research abstracts selected from the search results of MEDLINE database with keyword (MeSH terms) “human, blood cells and transcription factors”
• split into two parts:– Part I contains 1500 sentences (training data)
– Part II consists of 1000 sentences (test data)
Experimental ResultsExperimental Results
Figure 1: F-measure vs number of candidate parses.Figure 1: F-measure vs number of candidate parses.
Experimental Results Experimental Results (cont’d)(cont’d)
Experiments
Recall
(%)
Precision
(%)
F-Score
(%)
Baseline 55.8 55.6 55.7
SVM
NN
LLR
59.1
57.9
58.5
60.2
61.8
61.2
59.7
59.8
59.8
Table 3: Results based on the interaction category.Table 3: Results based on the interaction category.
ConclusionsConclusions
• Three reranking methods for the HVS model in the application of extracting protein-protein interactions from biomedical literature.
• Experimental results show that 4% relative improvement in F-measure can be obtained through reranking on the semantic parse results
• Incorporating other semantic or syntactic information might be able to give further gains.