Date post: | 22-Jan-2018 |
Category: |
Documents |
Upload: | benjamin-good |
View: | 442 times |
Download: | 0 times |
Crowd-poweredknowledge integration
Benjamin Good
MEM seminar January 20, 2016
205H (across from Chair’s office)
@bgood
What are the functions of Fibronectin?
37186 articles
What are the functions of the 238 ‘significant’ genes that came up in my high throughput screen??
What are the functions of Fibronectin?
37186 articles
…
Gene Property Value
Fibronectin Biological Process
Angiogenesis
Fibronectin Cellular Localization
Extracellular matrix
Fibronectin Related Disease
Glomerulopathy
“knowledge integration”“curation”
“knowledge base”
Answers
Applications of knowledge bases
• Find information
• Plan research • ”Known unknowns?”
• Interpret data
• Gene Ontology Enrichment Analysis
GO Enrichment Analysis
8
1) Mice treated to model psoriasis, gene expression measured
2) 115 genes identified 3) Tens of thousands of
relevant articles4) 1 rapid statistical test with
a pretty read out…
inflammatory response…
defense response
15
taxis9
chemotaxis9
Wound response
13
immune response
14
Gene counts for overrepresentedGO categories
Annotation missing from human GO annotation.
Should be here!
(‘5 HT Receptor’ means ‘Serotonin Receptor’)
Circa 2010
We don’t know what we are missing
15
inflammatory response
defense response
Serotonin receptor activity?
????
response to wounding
immune response
Process of building knowledge bases
1. do science 2. publish it 3. Manually extract the knowledge
Gene Property Value
Fibronectin Biological Process
Angiogenesis
Fibronectin Cellular Localization
Extracellular matrix
Fibronectin Related Disease
Glomerulopathy
Many scientists, powerful tools, comparatively little reward for curating knowledge
100’s of thousands 100’s
Professional knowledge integration does not scale up
1. do science 2. publish it 3. Manually extract the knowledge
Gene Property Value
Fibronectin Biological Process
Angiogenesis
Fibronectin Cellular Localization
Extracellular matrix
Fibronectin Related Disease
Glomerulopathy
Crowd-poweredknowledge integration
Benjamin Good
MEM seminar January 20, 2016
205H (across from Chair’s office)
@bgood
How can we use this to better manage our collective knowledge?
2015
1999
1987
2015
1999
1987
1974
1960
Divide and Conquer Algorithm
Big Problem
Smaller problem
Smaller problem
Smaller problem
Smaller problem
Smaller problem
Smaller problem
Split
Merge
Dividing and conquering knowledge integration
• Macro• Wikidata global
knowledge platform for improving SPLIT and MERGE
• Micro• Crowdsourcing for
extreme SPLIT
Big Problem
Smaller problem
Smaller problem
Smaller problem
Smaller problem
Smaller problem
Smaller problem
Divide and Conquer (bioinformatics)
Build knowledge base of all biology
Split?
Merge?
Ad Hoc
Ad HocVery hard
Merging knowledge bases:the language barrier
“Methadone”Interacts with:
“Moxifloxacin” May treat: Opioid-Related Disorders
ID:N0000000174
ID:4095
Molecular Weight: 309.44518 g/mol
…
= ?
= ?
= ?
= ?
= ?= ?
ID:DB00333
Manufactured by: Roxane laboratories inc
Good for business, bad for science
Google Scholar search shows 469 papers about “identifier mapping” in bioinformatics
Global Knowledge Platform
• What would happen if everyone was working on literally the same database?
• Split up work more effectively by increasing cross-institutional awareness
• Reduce the merge problem by working with the same database entities from the outset
Is to dataas Wikipedia is to text
“Giving more people more access to more knowledge”
A free and open repository of knowledge
Managed by the MediaWiki foundation that operates Wikipedia
We are seeding it with biomedical data
Gene DrugDisease
• All human, mouse genes and proteins
• All FDA approved drugs
• 9,000+ human diseases
Burgstaller et al (2016) Database (preprint in BioRxiv)
Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
Microbial Genetic Data
•Widely Distributed•Difficult to query•Not structured in meaningful way•A lot of interest from this community !
Tim Putman leading Microbial efforts in Wikidata
• Loading genes, proteins, annotations for 120 reference genomes. Completed 8 genomes so far
• Building a data model in wikidata to capture multi-organism molecular interactions 1
• Creating a genome browser that will display microbial genes and annotations gathered dynamically from Wikidata
1 Putman et al (2016) (under review) (preprint in BioRxiv)
Dividing and conquering knowledge integration
• Macro• Wikidata global
knowledge platform for distributed SPLIT and MERGE
• Micro• Crowdsourcing for
extreme SPLIT
Big Problem
Smaller problem
Smaller problem
Smaller problem
Smaller problem
Smaller problem
Smaller problem
Reading…
…
Gene Property Value
Fibronectin Biological Process
Angiogenesis
Fibronectin Cellular Localization
Extracellular matrix
Fibronectin Related Disease
Glomerulopathy
Can we break the task of extracting facts from the literature down so we can distribute it much more broadly (SPLIT) ?
Extracting knowledge from text
…
Gene Property Value
Fibronectin Biological Process
Angiogenesis
Fibronectin Cellular Localization
Extracellular matrix
Fibronectin Related Disease
Glomerulopathy
1. Find concepts in text2. Identify relationships between concepts
Identify diseases
Experiments1. paid people 6 cents per abstract on Amazon Mechanical Turk
microtask workplace 1
2. paid people 0 cents per abstract on http://mark2cure.org 2
• In both cases the aggregated labor of 3 or more non-expert
workers were statistically equivalent to a single professional
• The process was faster and far less expensive
1Good et al (2015) Biocomputing 2Tsueng et al in preparation
Relation extraction
Assuming we can find concepts in text, can ‘the crowd’ correctly identify relationships ?
Example relation verification task
Answer: it does not say that it causes ulcers, it is used to treat..
BioCreative evaluation
• 500 abstracts
• 0.505 F score (0.475 Precision, 0.540 Recall)
• 5th out of 18 teams
Li et al (2016) Database (preprint in BioRxiv)
Different approaches produce very different results Ongoing work to understand why
Winner BioCreative (machine learning)
Collaborators, different machine learning scheme
Scripps Entry
“Ground Truth”
(Thanks to Alex Pico, WikiPathways)
Could be better
Could be better
ideas Knowledge
dataVery good at this..
Could be better
http://biobranch.org
http://knowledge.bio
The point… is to help you
Thanks!• Gene Wiki Team
Andra Waagmeester (Micelio)
* Sebastian Burgstaller (Scripps)
* Tim Putman (Scripps)
* Elvira Mitraka (U Maryland)
Julia Turner (Scripps)
Justin Leong (UBC)
Lynn Schriml (U Maryland)
Paul Pavlidis (UBC)
• Microtask Team* Toby Li (Scripps)
* Ginger Tsueng (Scripps)
Max Nanis (Scripps)
Jennifer Fouquier (Scripps)
Jake Bruggeman (Scripps)
• Bioinformatics Games TeamMargaret Wallace (Playmatics)Nick Fortugno (Playmatics)Melanie Stegman (Science Game Center)
• http://knowledge.bioRichard and Kenneth Bruskiewich (Star informatics)
Farzon Ahmed (Star informatics)
• http://biobranch.org* Karthik G (Scripps)
• Grant Writing and Management Team
Andrew Su (Scripps)
Chunlei Wu (Scripps)
* First author on manuscript cited in this presentation
Today
Another day
Depending on your database, Methadone = one or more of:
• 3953
• 00567621
• /m/058gq
• 76-99-3
• 4095
• C₂₁H₂₇NO
• 1S/C21H27NO/c1-5-20(23)21(16-17(2)22(3)4,18-12-8-6-9-13-18)19-14-10-7-11-15-19/h6-15,17H,5,16H2,1-4H3
• USSIQXCVUWKGNF-UHFFFAOYSA-N
• 6807
• CHEMBL651
• 00333
• UC6VBE7V1Z
• 4038959-5
• CCC(=O)C(CC(C)N(C)C)(c1ccccc1)c2ccccc2
• C07163
• N07BC02
• 5458
• N0000147909
• Q179996
• ...
Auto-Merge
• Group from Vienna independently loaded Drug-Drug interactions 1
• Without our work or even our awareness, this content integrated with our content to enable new, otherwise impossible queries:
• A fundamentally different process than existed before!
1 Pfunder et al (2015) Journal of Medical Internet Research
What clinically relevant drug-drug interactions are known for the drug methadone (NDF-RT N0000000174)? 2
2 Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)