+ All Categories
Home > Documents > 2016 mem good

2016 mem good

Date post: 22-Jan-2018
Category:
Upload: benjamin-good
View: 442 times
Download: 0 times
Share this document with a friend
61
Crowd-powered knowledge integration Benjamin Good MEM seminar January 20, 2016 205H (across from Chair’s office) @bgood [email protected]
Transcript

Crowd-poweredknowledge integration

Benjamin Good

MEM seminar January 20, 2016

205H (across from Chair’s office)

@bgood

[email protected]

“knowledge”

• A lot

• Important

• Text

What are the functions of Fibronectin?

37186 articles

What are the functions of the 238 ‘significant’ genes that came up in my high throughput screen??

What are the functions of Fibronectin?

37186 articles

Gene Property Value

Fibronectin Biological Process

Angiogenesis

Fibronectin Cellular Localization

Extracellular matrix

Fibronectin Related Disease

Glomerulopathy

“knowledge integration”“curation”

“knowledge base”

Answers

Knowledge Bases

5

1,500+ listed at http://www.oxfordjournals.org/nar/database/a/

Applications of knowledge bases

• Find information

• Plan research • ”Known unknowns?”

• Interpret data

• Gene Ontology Enrichment Analysis

Interesting Gene ListGene Ontology, Pathway, Network interpretation

GO Enrichment Analysis

8

1) Mice treated to model psoriasis, gene expression measured

2) 115 genes identified 3) Tens of thousands of

relevant articles4) 1 rapid statistical test with

a pretty read out…

inflammatory response…

defense response

15

taxis9

chemotaxis9

Wound response

13

immune response

14

Gene counts for overrepresentedGO categories

Knowledge bases are important tools and will only grow more important over time

10

Great!

11

BUT

12

GO annotation is not complete

Annotation missing from human GO annotation.

Should be here!

(‘5 HT Receptor’ means ‘Serotonin Receptor’)

Circa 2010

Added to GO Jan. 2016

First characterized 1996(Kohen et al J Neurochem)

We don’t know what we are missing

15

inflammatory response

defense response

Serotonin receptor activity?

????

response to wounding

immune response

“Gene Ontology, its great right ?”

• “It sucks”

• “I only use it out of desperation”

WHY?!

Process of building knowledge bases

1. do science 2. publish it 3. Manually extract the knowledge

Gene Property Value

Fibronectin Biological Process

Angiogenesis

Fibronectin Cellular Localization

Extracellular matrix

Fibronectin Related Disease

Glomerulopathy

why does he look so down?

Many scientists, powerful tools, comparatively little reward for curating knowledge

100’s of thousands 100’s

More than 2 articles published/minute

Professional knowledge integration does not scale up

1. do science 2. publish it 3. Manually extract the knowledge

Gene Property Value

Fibronectin Biological Process

Angiogenesis

Fibronectin Cellular Localization

Extracellular matrix

Fibronectin Related Disease

Glomerulopathy

Crowd-poweredknowledge integration

Benjamin Good

MEM seminar January 20, 2016

205H (across from Chair’s office)

@bgood

[email protected]

One thing is scaling up with the scientific literature.

2015

1999

1987

1974

1960

How can we use this to better manage our collective knowledge?

2015

1999

1987

2015

1999

1987

1974

1960

Divide and Conquer Algorithm

Big Problem

Smaller problem

Smaller problem

Smaller problem

Smaller problem

Smaller problem

Smaller problem

Split

Merge

Dividing and conquering knowledge integration

• Macro• Wikidata global

knowledge platform for improving SPLIT and MERGE

• Micro• Crowdsourcing for

extreme SPLIT

Big Problem

Smaller problem

Smaller problem

Smaller problem

Smaller problem

Smaller problem

Smaller problem

Divide and Conquer (bioinformatics)

Build knowledge base of all biology

Split?

Merge?

Ad Hoc

Ad HocVery hard

Merging knowledge bases:the language barrier

“Methadone”Interacts with:

“Moxifloxacin” May treat: Opioid-Related Disorders

ID:N0000000174

ID:4095

Molecular Weight: 309.44518 g/mol

= ?

= ?

= ?

= ?

= ?= ?

ID:DB00333

Manufactured by: Roxane laboratories inc

Good for business, bad for science

Google Scholar search shows 469 papers about “identifier mapping” in bioinformatics

Global Knowledge Platform

• What would happen if everyone was working on literally the same database?

• Split up work more effectively by increasing cross-institutional awareness

• Reduce the merge problem by working with the same database entities from the outset

Is to dataas Wikipedia is to text

“Giving more people more access to more knowledge”

A free and open repository of knowledge

Managed by the MediaWiki foundation that operates Wikipedia

It’s a knowledge base!

• Anyone can edit

• Anyone can use

Item: Q84

We are seeding it with biomedical data

Gene DrugDisease

• All human, mouse genes and proteins

• All FDA approved drugs

• 9,000+ human diseases

Burgstaller et al (2016) Database (preprint in BioRxiv)

Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)

Nurturing a multi-communitygarden of biomedical knowledge

Gene DrugDisease

Application #1 (of many)

Burgstaller et al (2016) Database (preprint in BioRxiv)

Microbial Genetic Data

•Widely Distributed•Difficult to query•Not structured in meaningful way•A lot of interest from this community !

Microbial Genetic Data

Tim Putman leading Microbial efforts in Wikidata

• Loading genes, proteins, annotations for 120 reference genomes. Completed 8 genomes so far

• Building a data model in wikidata to capture multi-organism molecular interactions 1

• Creating a genome browser that will display microbial genes and annotations gathered dynamically from Wikidata

1 Putman et al (2016) (under review) (preprint in BioRxiv)

Dividing and conquering knowledge integration

• Macro• Wikidata global

knowledge platform for distributed SPLIT and MERGE

• Micro• Crowdsourcing for

extreme SPLIT

Big Problem

Smaller problem

Smaller problem

Smaller problem

Smaller problem

Smaller problem

Smaller problem

Reading…

Gene Property Value

Fibronectin Biological Process

Angiogenesis

Fibronectin Cellular Localization

Extracellular matrix

Fibronectin Related Disease

Glomerulopathy

Can we break the task of extracting facts from the literature down so we can distribute it much more broadly (SPLIT) ?

Extracting knowledge from text

Gene Property Value

Fibronectin Biological Process

Angiogenesis

Fibronectin Cellular Localization

Extracellular matrix

Fibronectin Related Disease

Glomerulopathy

1. Find concepts in text2. Identify relationships between concepts

Find the diseases

Aggregate multiple judgments for quality

Identify diseases

Experiments1. paid people 6 cents per abstract on Amazon Mechanical Turk

microtask workplace 1

2. paid people 0 cents per abstract on http://mark2cure.org 2

• In both cases the aggregated labor of 3 or more non-expert

workers were statistically equivalent to a single professional

• The process was faster and far less expensive

1Good et al (2015) Biocomputing 2Tsueng et al in preparation

Relation extraction

Assuming we can find concepts in text, can ‘the crowd’ correctly identify relationships ?

Chemical Causes Disease Workflow

Li et al (2016) Database (preprint in BioRxiv)

Example relation verification task

Answer: it does not say that it causes ulcers, it is used to treat..

BioCreative evaluation

• 500 abstracts

• 0.505 F score (0.475 Precision, 0.540 Recall)

• 5th out of 18 teams

Li et al (2016) Database (preprint in BioRxiv)

Different approaches produce very different results Ongoing work to understand why

Winner BioCreative (machine learning)

Collaborators, different machine learning scheme

Scripps Entry

“Ground Truth”

(Thanks to Alex Pico, WikiPathways)

Could be better

Could be better

ideas Knowledge

dataVery good at this..

Could be better

http://biobranch.org

http://knowledge.bio

The point… is to help you

Thanks!• Gene Wiki Team

Andra Waagmeester (Micelio)

* Sebastian Burgstaller (Scripps)

* Tim Putman (Scripps)

* Elvira Mitraka (U Maryland)

Julia Turner (Scripps)

Justin Leong (UBC)

Lynn Schriml (U Maryland)

Paul Pavlidis (UBC)

• Microtask Team* Toby Li (Scripps)

* Ginger Tsueng (Scripps)

Max Nanis (Scripps)

Jennifer Fouquier (Scripps)

Jake Bruggeman (Scripps)

• Bioinformatics Games TeamMargaret Wallace (Playmatics)Nick Fortugno (Playmatics)Melanie Stegman (Science Game Center)

• http://knowledge.bioRichard and Kenneth Bruskiewich (Star informatics)

Farzon Ahmed (Star informatics)

• http://biobranch.org* Karthik G (Scripps)

[email protected]

• Grant Writing and Management Team

Andrew Su (Scripps)

Chunlei Wu (Scripps)

* First author on manuscript cited in this presentation

Today

Another day

3-species metabolism, modeled in wikidata. Putman 2016

Results 593 abstracts compared to gold standard

• 7 days• $192.90• 17 workers

F = 0.81, k = 2

Errors from CDR workflow

CDR results: impact of voting

Identifiers form the foundation

17 identifier schemes integrated into wikidata item

Drug

Depending on your database, Methadone = one or more of:

• 3953

• 00567621

• /m/058gq

• 76-99-3

• 4095

• C₂₁H₂₇NO

• 1S/C21H27NO/c1-5-20(23)21(16-17(2)22(3)4,18-12-8-6-9-13-18)19-14-10-7-11-15-19/h6-15,17H,5,16H2,1-4H3

• USSIQXCVUWKGNF-UHFFFAOYSA-N

• 6807

• CHEMBL651

• 00333

• UC6VBE7V1Z

• 4038959-5

• CCC(=O)C(CC(C)N(C)C)(c1ccccc1)c2ccccc2

• C07163

• N07BC02

• 5458

• N0000147909

• Q179996

• ...

Auto-Merge

• Group from Vienna independently loaded Drug-Drug interactions 1

• Without our work or even our awareness, this content integrated with our content to enable new, otherwise impossible queries:

• A fundamentally different process than existed before!

1 Pfunder et al (2015) Journal of Medical Internet Research

What clinically relevant drug-drug interactions are known for the drug methadone (NDF-RT N0000000174)? 2

2 Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)


Recommended