Post on 15-Apr-2017
transcript
Bio-RE for KG completion
Gotta catch’em all!™
Claudiu Mihăilă
Because it matters
Drug development process
Signalling pathways
Scientia potentia est
• ∼26M biomedical articlesindexed
• ∼3500 articles per day in 2015
• More information than any oneperson can comprehend
Structured databases
Structured databases
• High number of DBs
• Manually curated by experts
but
• Long backlog
• Limited coverage, reflectingbias of the curators
• Limited linkage to literatureevidence
ML Objective
Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.
ML Objective
Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.
ML Objective
Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.
ML Objective
Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.
ML Objective
Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.
Supervised ML approaches
BioNLP-ST’16 – Task GE4 – NFκB KB construction
• 2-stage classification with SVM/LR + post-processing
• RNNs with LSTM units on words, PoSs, dependencies
P R F1Evex 0.47 0.32 0.38TEES 0.45 0.33 0.38VERSE 0.60 0.23 0.33
Supervised ML
• High confidence annotations
• Literature evidencemarked-up
but
• Limited coverage/high bias
• Limited number of corpora
• Expensive and slow toproduce training data
Distant SupervisionCombining DBs and Supervised ML
Align relation candidates extractedfrom text with known relationshipsand use structured database asdistant supervision training signal
• no bias to specific genre
• cheap and fast to produce fairlylarge training sets
Distant Supervision Example
PubMedMutations in the gene encoding the TAR DNA-binding protein 43have been identified in some familial amyotrophic lateral sclero-sis (ALS).
A novel missense mutation in a highly conserved region of TDP-43 was identified in a patient with sporadic ALS.
We screened the TARDBP mutation in 721 Japanese ALS by di-rect sequencing.
DB:IS_ASSOCIATED_WITH
TDP43, ALS
Famous work
DeepDive (Stanford/Lattice)
• Gene-gene interactionfrom PLOS biomedicaljournals
• Uses BIOGRID for distantsupervision
Literome
• Protein regulation eventextraction from Pubmedabstracts
• Uses the PathwayInteraction Database fordistant supervision
Information Extraction Pipeline
PMPMC
EntityRecognition
EntityResolution
SyntacticParsing
OpenIERelations
Knowledge graph
Distant supervision Pipeline
BioDBs
PMPMC
KnowledgeGraph
DistantSupervision
ExtractedRelations
Curation
InformationExtraction
Curation
Examples of learned relations(IS_ASSOCIATED_WITH, CAH, WNK1)
. . . mineralocorticoid excess can be caused by congenital adrenal hyperplasia (CAH) . . .due to mutations in the WNK1, WNK4, KLHL3, CUL3 genes.
PM:22932914
(IS_ASSOCIATED_WITH, Brachydactyly, CHSY1)
Our results place Chsy1 as an essential regulator of joint patterning and providea mouse model of human brachydactylies caused by mutations in CHSY1.
PM:22280990
(IS_ASSOCIATED_WITH, Cushing Syndrome, KCNJ5)
. . . these mutations, in addition to mutations in the KCNJ5 gene. . ., may be responsiblefor the tumorigenesis of APAs and CPAs with subclinical Cushing’s syndrome.
PM:26743443
Challenges
• Error propagation fromupstream tasks
• Cross-sentence relations
• Long tails - overfitting to themore common entity/entitypairs
• Speculation, negation, changesover time, conflictinginformation
Thank you for your attention!Any questions?