Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | andrew-su |
View: | 2,260 times |
Download: | 3 times |
Crowdsourcing to structure biological knowledge
Andrew Su, Ph.D.Department of Molecular and Experimental Medicine
The Scripps Research Institute
ISI, USC
August 16, 2012
Human genetics underlies human health2
~3 billion bases
~23,000 genes
Molecular diagnostics & therapeutics
Molecular understanding of:• Biological function• Genetic variation• Mutation• Deletion• Amplification• …
“Gene annotation”
Structured gene annotations enable computation3
Structured annotations
Few genes are well annotated4
38%
59%
TP53TNFAPOEMTHFRIL6HLA-DRB1VEGFAEGFRTGFB1ACE
Data: NCBI gene2pubmed, August 2010
23,278 protein-coding genes
Genes, sorted by decreasing counts
Co
un
ts
Gene ontology (GO)
PubMed
Biocuration is a key annotation bottleneck5
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
0
200,000
400,000
600,000
800,000
1,000,000
Number of PubMed-indexed articles
6
311,696 articles (1.5% of PubMed)have been cited by GO annotations
7
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
The Long Tail is a prolific source of content8
ShortHead
Long Tail
Content produced
Contributors (sorted)
News :Video:
Product reviews:Food reviews:Talent judging:
NewspapersTV/Hollywood
Consumer reportsFood criticsOlympics
BlogsYouTube
Amazon reviewsYelp
American Idol
9
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
From crowdsourcing to structured data10
The Gene Wiki
Biological Games
10,000 gene “stubs” within Wikipedia11
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
Gene Wiki has a critical mass of readers12
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Gene Wiki has a critical mass of editors13
Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Edi
tor
coun
t Editors
Edits Edi
t co
unt
A review article for every gene is powerful14
Hyperlinks to related concepts
References to the literature
Reelin: 68 editors, 543 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Document- and concept-centric text mining16
Subject Object
Predicate
Simple text mining for gene annotations17
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
Candidate assertion
GO:0006897
6319 novel Gene Ontology annotations2147 novel Disease Ontology annotations
Gene Wiki+ for integrative queries18
http://genewikiplus.org
mwsync
Dynamic queries across genes, diseases, SNPs19
20
21
TOP 100 GENES
Gene Wiki+ for integrative queries22
http://genewikiplus.org
mwsync
{{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] [[HasSNP:: <q>[[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] </q>]]}}
…
OMIMPharmGKB
OMIMPharmGKB
Gene Wiki+ for integrative queries23
http://genewikiplus.org
mwsync
From crowdsourcing to structured data24
The Gene Wiki
Biological Games
Not just the biomedical literature…25
BioGPS aggregates gene-centric information26
http://biogps.org
The plugin interface is simple and universal27
KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}
STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}
Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}
URL template
Gene entityRendered URL
The plugin interface is simple and universal28
The plugin interface is simple and universal29
The plugin interface is simple and universal30
The plugin interface is simple and universal31
The plugin interface is simple and universal32
Total of 389 gene-centric online databases registered as BioGPS plugins
BioGPS has a critical mass of users33
• > 4100 registered users• 4000 unique visitors per week• 40,000 page views per week
1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge
7. U Penn8. Stanford9. Wash U10. UNC
Top 10 organizations
Daily pageviews
All resources should provide RDF…34
Mining structured content from HTML35
Defining a data extraction template36
…
TP53 TNF APOE IL6 VEGF …EGFR TGFB1
All resources should provide flat files…38
From crowdsourcing to structured data39
The Gene Wiki
Biological Games
40
http://www.flickr.com/photos/archana3k1/4124330493/
Seven million human hours
41
Twenty million human hours
http://www.flickr.com/photos/ableman/2171326385/
-42
150 billion human hours
http://www.flickr.com/photos/rvp-cw/6243289302/
per year
Using games to fold proteins43
Fold.it players have successfully:• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
Using games to annotate gene-disease links46
http://genegames.org
If its ‘right’, you get points
then on to the next question
Click the related disease
hurry!
Dizeez players seem pretty smart…47
In total:• 207 unique gamers• 1045 games played• 8525 guesses
# Occurrences Gene Disease
7 GAST gastrinoma
7 RBP3 retinoblastoma
7 SSX1 synovial sarcoma
6 TG Graves' disease
6 CRYGC Cataract
6 SOX8 mental retardation
6 WRN Werner syndrome
6 ABL1 leukemia
6 MLL3 leukemia
6 SNAI2 breast carcinoma
Pubmed OMIM PharmGKB Gene Wiki
Dizeez players seem pretty smart…48
# Occurrences Gene Disease
5 MECOM sarcoma
4 ATF7 cancer
3 ABCB5 acute myeloid leukemia
3 SART1 glioblastoma
3 NCK1 leukemia
3 NEK1 cancer
Pubmed OMIM PharmGKB Gene Wiki
In total:• 207 unique gamers• 1045 games played• 8525 guesses
GenESP: Two-player annotation games49
COMBO: Genomic predictors for disease50
cancer normal
find patterns
make predictions on new samples
cancer
normal
COMBO: Genomic predictors for disease51
COMBO: Genomic predictors for disease52
COMBO: Genomic predictors for disease53
COMBO: Genomic predictors for disease54
COMBO: Genomic predictors for disease55
COMBO: Genomic predictors for disease56
57
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
58
Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors
WP:MCB Project
Collaborators
Erik ClarkeBen GoodSalvatore Loguercio
Ian MacleodChunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contacthttp://sulab.org
[email protected]@andrewsu+Andrew Su
Summer internships for students!
Recruiting graduate students in quantitative biology! See http://education.scripps.edu/