A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced
Genomes
Andrew Su, Ph.D.@andrewsu
[email protected]://sulab.org
January 16, 2014
GMOD 2014
OK
OK
Why am I giving this keynote?
2
3
http://www.flickr.com/photos/portland_mike/6140660504/
Harnessing the crowd…
4
… to organize information
http://www.flickr.com/photos/45697441@N00/6629580443
My simplified history of MODs5
My simplified history of MODs6
GMOD is widely used7
199 (!) organizations listed as GMOD users
Does the current model scale?8
Does the current model scale?9
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
1
10
100
1000
10000
100000
1000000
Bacteria
Eukaryotes
Archaea
Does the current model scale?10
# sequenced genomes
Year
Does the current model scale?11
The Long Tail of genomic data is being lost12
Identified 517 operons and 103 small regulatory RNAs...
The Long Tail of genomic data is being lost13
Identified 517 operons and 103 small regulatory RNAs...
At least you can download structured data…14
Centralized Model Organism Database concept15
CMOD
16
http://www.flickr.com/photos/aigle_dore/5626312363/
GMOD as a Service (GaaS)
17
http://www.flickr.com/photos/shannonmary/187131727/
Few genes are well annotated…18
Data: NCBI, February 2013
41%
65%
CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF
20,473 protein-coding genes
Genes, sorted by decreasing counts
GO
An
no
tati
on
C
ou
nts
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
0
200,000
400,000
600,000
800,000
1,000,000
Number of PubMed-indexed articles
… because the literature is sparsely curated?19
… because the literature is sparsely curated?20
0
1 0
2 0
Average capacity of human scientistNumber of articles read by typical scientist
21
311,696 articles (1.5% of PubMed)have been cited by GO annotations
22
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
The Long Tail is a prolific source of content23
ShortHead
Long Tail
Content produced
Contributors (sorted)
News :Video:
Product reviews:Food reviews:Talent judging:
NewspapersTV/Hollywood
Consumer reportsFood criticsOlympics
BlogsYouTube
Amazon reviewsYelp
American Idol
Wikipedia is reasonably accurate24
Wikipedia has breadth and depth25
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words(millions)
Wikipedia Britannica Online
26
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
Filtering, extracting, and summarizing PubMed
Documents
Concepts Review article
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Wiki success depends on a positive feedback29
Gene wiki page utility
Number ofusers
Number ofcontributors
1001
2002
10,000 gene “stubs” within Wikipedia30
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
Gene Wiki has a critical mass of readers31
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
Gene Wiki has a critical mass of editors32
Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Edi
tor
coun
t Editors
Edits Edi
t co
unt
A review article for every gene is powerful33
References to the literature
Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
Making the Gene Wiki more computable34
Structured annotationsFree text
Filling the gaps in gene annotation35
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
Candidate assertion
GO:0006897
6319 novel GO annotations2147 novel DO annotations
Gene Wiki content improves enrichment analysis36
GO term
Gene listConcept
recognitionPubMed abstracts
Enrichment analysis
GO:0007411
axon guidance
(GO:0007411)
264 genes
Linked genes through PubMed
P = 1.55 E-20
811 articles
Yes No
Yes 13 2
No 251 12033
Gene Wiki content improves enrichment analysis37
GO term
Gene listConcept
recognitionPubMed abstracts
Gene Wiki
+
Enrichment analysis
GO:0006936 GO:0006936
muscle contraction
(GO:0006936)
87 genes
Linked genes through PubMed
Linked genes through
PubMed + Gene Wiki
P = 1.0 P = 1.22 E-09
251 articles
87 articles
Gene Wiki content improves enrichment analysis38
p-value (PubMed only)
p-value (PubMed + GW)
Muscle contraction
More significant
PubMed + GW
More significant
PubMed only
The Long Tail of scientists is a valuable source of
information on gene function
39
http://fiehnlab.ucdavis.edu/projects/rice_metabolome/
Can we skip text mining?
Wikidata41
Provide a database of the world’s knowledge that
anyone can edit
- Denny Vrandečić
Wikidata understands scale42
Wikidata understands scale43
14 million Wikidata items…
…13 million total genes in Entrez Gene
Wikidata understands scale44
27 million Wikidata statements…
…150k total GO annotations
Wikidata for biology45
is a
regulates
Interacts with
Protein
Glycoprotein
Neural development
VLDL receptor
Amyloid precursor protein
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
Reelin
http://www.wikidata.org/wiki/Q414043
Wikidata for biology46
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
Increasing biological data in Wikidata47
http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
Loading genomic data into Wikidata48
Entrez Gene
Ensembl
UniProt
UCSC
PDB
RefSeq
Wikidata gene model49
Added ~1000 human genes so far….
Wikidata as CMOD?50
CMOD
Wikidata as CMOD?51
CMODPowered by:
CMOD
The Long Tail of
bioinformaticianscan collaboratively build a Centralized Model Organism
Database (CMOD).
52
53
Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon LimMany Wikipedia editors
WP:MCB Project
Gene Wiki Collaborators
Katie FischBen GoodSalvatore Loguercio
Tobias MeissnerMax NanisChunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contacthttp://sulab.org
[email protected]@andrewsu+Andrew Su
Adriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco
Key group alumni
Recruiting for student,
postdoc, outreach, and/or
staff positions!