T H E W O R L D O F B I O C U R AT I O N
O P T I M I Z I N G I T S I M PA C T
April 7, 2014—Seventh International Biocuration Conference
S O M E O N E W H O I S R E S P O N S I B L E F O R T H E C A R E A N D S U P E R V I S I O N O F B I O L O G I C A L K N O W L E D G E R E S O U R C E S A N D T H E I R U S E
W H A T I S A B I O C U R A T O R ?
W H AT D O B I O C U R AT O R S D O T O D AY ?
• Credits to Kaveh Bazargan ᔥ
• @kaveh1000
F R U I T I N F O O D P R O C E S S O R
S M O O T H I E
R E S E A R C H
R E S E A R C H I N W O R D P R O C E S S O R
P D F
F R U I T ? ?
R E S E A R C H ? ?
?
R E S E A R C H ? ?
Y O U , T H E B I O C U R AT O R
B I O C U R AT O R S O F T H E W O R L D U N I T E !
• You have nothing to lose but your PDF files
!
! X
O U R R O L E I N T H E R E S E A R C H L I F E C Y C L E
T H E W O R L D O F B I O C U R A T I O N
http://www.nbcnews.com/id/49258816/ns/technology_and_science-science/t/live-concert-microbial-data-turned-song-lab/#.UzSB9ceT4_E
D E S I G N I N G E X P E R I M E N T S
http://www.nbcnews.com/id/49258816/ns/technology_and_science-science/t/live-concert-microbial-data-turned-song-lab/#.UzSB9ceT4_E
D E S I G N I N G E X P E R I M E N T S
http://www.langdonbiology.org/AP/labs/Notebook/AP_notebook.htm
C O L L E C T I N G D ATA
Thomas Nast - http://www.victorianweb.org/art/illustration/nast/51.jpg
W R I T I N G U P
R E S U LT S
http://rrresearch.fieldofscience.com/2012_02_01_archive.html
R E V I E W I N G C O N C L U S I O N S
C A P T U R I N G K N O W L E D G E
I S B
C A P T U R I N G K N O W L E D G E
D E S I G N I N G E X P E R I M E N T S C O L L E C T I N G D ATA
R E V I E W I N G C O N C L U S I O N S
W R I T I N G U P
R E S U LT S
~ 3 0 0 B I O C U R A T O R S
B I O C U R AT I O N I N V E R S I O N
D E S I G N I N G E X P E R I M E N T S
C O L L E C T I N G D ATA
W R I T I N G U P R E S U LT S
R E V I E W I N G C O N C L U S I O N S
C A P T U R I N G K N O W L E D G E
http://www.nsf.gov/statistics/nsf13331/pdf/nsf13331.pdf
H U N D R E D S O F T H O U S A N D S O F G R A D S T U D E N T S
P O S T- D O C S
L A B O R AT O R I E S
J O U R N A L S
I N T H E L A B
E A R LY I N T E R V E N T I O N — S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies, & formats
S U P P O R T S TA N D A R D S , T H E Y ’ R E O U R F R I E N D• November, 1999
• 45 biologists
• 14 days
• 140 megabases of Drosophila genome
!
• Published in March 2000
G E N E O N T O L O G Y, E T A L .
Q U E S T F O R O R T H O L O G S
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
Q U E S T F O R O R T H O L O G S
• 30 phylogenomic databases
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
Q U E S T F O R O R T H O L O G S
• 30 phylogenomic databases
• Vary in # of species, taxonomic range, sampling density, and methodology
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
Q U E S T F O R O R T H O L O G S
• 30 phylogenomic databases
• Vary in # of species, taxonomic range, sampling density, and methodology
• Joint benchmarking effort
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
Q U E S T F O R O R T H O L O G S
• 30 phylogenomic databases
• Vary in # of species, taxonomic range, sampling density, and methodology
• Joint benchmarking effort
• Only possible through the use of shared reference proteomes and formats
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
Q U E S T F O R O R T H O L O G S
• 30 phylogenomic databases
• Vary in # of species, taxonomic range, sampling density, and methodology
• Joint benchmarking effort
• Only possible through the use of shared reference proteomes and formats
questfororthologs.org/ — www.ebi.ac.uk/reference_proteomes
E A R LY I N T E R V E N T I O N — S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies, & formats
• Develop and follow guidelines (paper and web-based)
• e.g. Gaudet, P., et al. Towards BioDBcore: a community-defined information specification for biological databases. Database 2011. PMCID: PMC3017395
• Resource Identification Initiative
• www.force11.org/Resource_identification_initiative
• Vasilevsky NA, et al. On the reproducibility of science: unique identification of research resources in the biomedical literature. PeerJ. 2013 Sep 5;1:e148. doi: 10.7717/peerj.148. PubMed PMID: 24032093; PubMed Central PMCID: PMC3771067.
E A R LY I N T E R V E N T I O N — S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies, & formats
• Embed community accepted standards in the lab environment
K N O C K O U T M O U S E P R O J E C T 2
• Broad standardized phenotyping of knockout mice on a standard genetic background
• Data collection from many centres
• www.mousephenotype.org
K N O C K O U T M O U S E P R O J E C T 2
• Broad standardized phenotyping of knockout mice on a standard genetic background
• Data collection from many centres
• www.mousephenotype.org
Cindy Smith
P R O T O C O L S A R E S TA N D A R D I Z E D
R E Q U I R E U S E O F PA R T I C U L A R O N T O L O G Y T E R M S T O D E S C R I B E P H E N O T Y P E
E A R LY I N T E R V E N T I O N — S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies, & formats
• Embed community accepted standards in the lab environment
• Work with labs to embed standards into their data generation pipeline
E A R LY I N T E R V E N T I O N — S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies, & formats
• Embed community accepted standards in the lab environment
• Stealth standards
S TA N D A R D S T H R O U G H U T I L I T Y — A P O L L O
C S I R O V I D E O — D E M O A T G E N O M E A R C H I T E C T. O R G
S TA N D A R D S T H R O U G H U T I L I T Y — A P O L L O
C S I R O V I D E O — D E M O A T G E N O M E A R C H I T E C T. O R G
T O O L S F O R T H E C O M M U N I T Y
T O O L S F O R T H E C O M M U N I T Y
• Web-based so researchers anywhere have access
T O O L S F O R T H E C O M M U N I T Y
• Web-based so researchers anywhere have access
• Concurrent access supports real-time collaboration
T O O L S F O R T H E C O M M U N I T Y
• Web-based so researchers anywhere have access
• Concurrent access supports real-time collaboration
• Built-in support for standards (transparently compliant)
T O O L S F O R T H E C O M M U N I T Y
• Web-based so researchers anywhere have access
• Concurrent access supports real-time collaboration
• Built-in support for standards (transparently compliant)
• Automatic generation of ready-made computable data
T O O L S F O R T H E C O M M U N I T Y
• Web-based so researchers anywhere have access
• Concurrent access supports real-time collaboration
• Built-in support for standards (transparently compliant)
• Automatic generation of ready-made computable data
• Client-side application relieves server bottleneck and supports privacy
E A R LY I N T E R V E N T I O N — S U P P O R T I N G S TA N D A R D S
• Promote community-accepted identifiers, ontologies, & formats
• Embed community accepted standards in the lab environment
• Stealth standards
• Re-purpose internal curation tools for external users
• Provide on-line documentation, hands-on training and rapid-response user help
• Work with educators to make these tools an integral part of the curriculum
• e.g. CACAO (Critical Assessment of Community Annotation using Ontologies), ecoliwiki.net/colipedia/index.php/CACAO_0.1
• DNA subway (Apollo)
S U B M I S S I O N
• CANTO: curation.pombase.org
• Structured Digital Abstracts
• Identifiers for all named genes, proteins, metabolites or other objects in the article
• Main results described in simple ontology terms
• Experimental evidence types
• Not only a synopsis of the results but computer-readable
• Gerstein, M., et al. Structured digital abstract makes text mining easy. Nature 447, 142 (10 May 2007) | doi:10.1038/447142a.
• Minimal Information reporting guidelines
• http://mibbi.sourceforge.net/portal.shtml
S U B M I T T I N G D ATA — I N A S T R U C T U R E D W AY
P U B L I S H I N G
P U B L I S H I N G
P U B L I S H I N G
• First there were letters
P U B L I S H I N G
• First there were letters
• Then Henry Oldenburg created the first scientific journal in 1665
P U B L I S H I N G
• First there were letters
• Then Henry Oldenburg created the first scientific journal in 1665
• Result: too much to absorb
P U B L I S H I N G
• First there were letters
• Then Henry Oldenburg created the first scientific journal in 1665
• Result: too much to absorb
Washed away on the sea of information
P E E R A N D E D I T O R I A L R E V I E W B E C A M E A F I LT E R
C O N S E Q U E N T LY …
• Figshare: figshare.org
• iDigBio: www.idigbio.org
• Dryad: datadryad.org
• eLife: www.elifesciences.org
• Unlike journal articles, the scale of web-native publishing may overwhelm attempts at manual curation (using current strategies)
T H E M E D I U M O F P U B L I C AT I O N I S C H A N G I N G
D O W E N E E D T O C U R AT E ?
S C H O L A R S H I P : B E Y O N D T H E PA P E R . J A S O N P R I E M . N AT U R E 4 9 5 , 4 3 7 – 4 4 0 ( 2 8 M A R C H 2 0 1 4 )
“…powerful, online filters will distill communities impact judgements algorithmically”
S O M E S AY N O …
D O W E N E E D T O C U R AT E ?
• Resolution of differences
• Clarity, eliminating noise
• Validation & design of automated methods
E V E N A P L A C E L I K E G O O G L E U S E S C U R AT O R S ( * A N D S O F T W A R E )
• Hundreds of operators per country
• Multiple kinds of errors: overlapping jurisdictions, accidental merges, road maps to satellite images mismatch, etc.
• Every road that you see has been hand-massaged
!
!
http://www.theatlantic.com/technology/archive/2012/09/how-google-builds-its-maps-and-what-it-means-for-the-future-of-everything/261913/
D O W E N E E D T O C U R AT E ?
• Resolution of differences
• Clarity, eliminating noise
• Validation & design of automated methods
C L A R I T Y
• Answer boxes: Quick answers to concrete questions
!
!
!
!
C L A R I T Y
• Answer boxes: Quick answers to concrete questions
!
!
!
!
C L A R I T Y
• Answer boxes: Quick answers to concrete questions
!
!
!
!
C L A R I T Y
• Answer boxes: Quick answers to concrete questions
!
!
!
!
• Much of this information comes from Freebase which is structured in terms of entities and properties
C L A R I T Y
• Answer boxes: Quick answers to concrete questions
!
!
!
!
• Much of this information comes from Freebase which is structured in terms of entities and properties
Robert West, et al. Knowledge Base Completion via Search-Based Question Answering. http://www.cs.ubc.ca/~murphyk/Papers/www14.pdf WWW’14 April 7–11, 2014, Seoul, Korea. ACM 978-1-4503-2744-2/14/04. DOI:2568032
D O W E N E E D T O C U R AT E ?
• Resolution of differences
• Clarity, eliminating noise
• Validation & design of automated methods
• PDF is still the dominant form of distribution
• PDF “Annotation”
• UTOPIA, www.utopiadocs.com
• DOMEO, swan.mindinformatics.org
• Textpresso, www.textpresso.org
• All of these are still lacking domain specifics (or need to be taught)
• FORCE11, www.force11.org
• Common goal is advancing scientific communications
• Beyond the PDF
L I T E R AT U R E I S I N F O R M AT I V E B U T I S N O T I N F O R M AT I O N
X
VA L I D AT I O N A N D D E S I G N O F A U T O M AT E D M E T H O D S
VA L I D AT I O N A N D D E S I G N O F A U T O M AT E D M E T H O D S
VA L I D AT I O N A N D D E S I G N O F A U T O M AT E D M E T H O D S
Write/modify software
VA L I D AT I O N A N D D E S I G N O F A U T O M AT E D M E T H O D S
Run the algorithm
Write/modify software
VA L I D AT I O N A N D D E S I G N O F A U T O M AT E D M E T H O D S
Run the algorithm
Write/modify software
Evaluate results
VA L I D AT I O N A N D D E S I G N O F A U T O M AT E D M E T H O D S
• Requires trusted reference datasets!
Run the algorithm
Write/modify software
Evaluate results
VA L I D AT I O N A N D D E S I G N O F A U T O M AT E D M E T H O D S
• Requires trusted reference datasets!
• Biocurators are partners with developers!
Run the algorithm
Write/modify software
Evaluate results
S C H O L A R S H I P : B E Y O N D T H E PA P E R . J A S O N P R I E M . N AT U R E 4 9 5 , 4 3 7 – 4 4 0 ( 2 8 M A R C H 2 0 1 4 )
“…powerful, online filters will distill communities impact judgements algorithmically”
D O W E N E E D T O C U R AT E ?
T H E PA R A B L E O F G O O G L E F L U : T R A P S I N B I G D ATA A N A LY S I S . D AV I D L A Z E R E T A L . S C I E N C E 1 4 M A R C H 2 0 1 4 :
V O L . 3 4 3 N O . 6 1 7 6 P P. 1 2 0 3 - 1 2 0 5
“‘Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement
to, traditional data collection and analysis.”
D O W E N E E D T O C U R AT E ?
D O W E N E E D T O C U R AT E ?
• Yes
!
!
!
!
D O W E N E E D T O C U R AT E ?
• Yes
!
!
!
!
• But…
S Y S T E M AT I C R E V I E W & C R I T I C I S M I S R E Q U I R E D
O U R S T R E N G T H I S I N Q U A L I T Y O F T H E I N F O R M A T I O N W E C A N P R O V I D E
C U S I C K , M . , E T A L . L I T E R AT U R E - C U R AT E D P R O T E I N I N T E R A C T I O N D ATA S E T S
N AT M E T H O D S . J A N 2 0 0 9 ; 6 ( 1 ) : 3 9 – 4 6 . P M C I D : P M C 2 6 8 3 7 4 5
“…literature curated datasets have inherent reliability difficulties…”
H O W C A N B I O C U R AT O R S A D D R E S S C R I T I C I S M S ?
G R E E N B E R G , S . , H O W C I TAT I O N D I S T O R T I O N S C R E AT E U N F O U N D E D A U T H O R I T Y: A N A LY S I S O F A C I TAT I O N N E T W O R K
B M J J U LY 2 0 0 9 ; 3 3 9 D O I : H T T P : / / D X . D O I . O R G / 1 0 . 1 1 3 6 /
T H E R I S K ( B Y A N A L O G Y )
56
W E ' R E R E S P O N S I B L E F O R T H E Q U A L I T Y
• “Reviewing the quality of the data is an obligation of any entity that assumes responsibility over the data.”
• Limor Peer et al., IDCC 2014
PA I N T A P O P T O S I S - S U M M A R Y
• 52 families annotated: - 8 were par$cipants in execution phase of apoptosis;
• 44 others are either:
A. upstream of apoptosis B. phenotypes C. targets
Example 1: Protein (cytochrome c) upstream of apoptosis execution
Cytochrome c is directly involved in apoptotic DNA fragmentation
Example 1: Protein (cytochrome c) upstream of apoptosis execution
Cytochrome c is directly involved in apoptotic DNA fragmentation
➢ [Cells] – [cytochrome c] = No apoptotic DNA fragmentation
Example 1: Protein (cytochrome c) upstream of apoptosis execution
Cytochrome c is directly involved in apoptotic DNA fragmentation
➢ [Cells] – [cytochrome c] = No apoptotic DNA fragmentation
➢ [Cells] – [cytochrome c] + [cytochrome c] = apoptotic DNA fragmentation
Example 2: Phenotype of reduced cell survival and increased DNA fragmentation
• E3 ubiquitin-protein ligase TRAF7 was annotated to execution phase of apoptosis
➢ Exogenous expression of TRAF7
➢ No other data in terms of where in apoptosis this may be. !
➢ All we know is altering TRAF7 levels affects apoptosis.
Example 3: TargetDSG2 was annotated to execution phase of apoptosis
Example 3: TargetDSG2 was annotated to execution phase of apoptosis
Example 3: TargetDSG2 was annotated to execution phase of apoptosis
DSG2 is a *target* of a protease (caspase), and although its degradation indeed seems to be a part of apoptosis it does not *mediate* apoptosis.
P R O V E T H E N E E D F O R B I O C U R AT I O N
• Publish: Quantitative improvements before/after
• Publish: Curator consistency studies
• Publish: Independent external reviews
E N A B L I N G R E S E A R C H
W H AT I S A B I O C U R AT O R ?
W H AT I S A B I O C U R AT O R ?
W H AT I S A B I O C U R AT O R ?
W H AT I S A B I O C U R AT O R ?
• A highly skilled and trained keeper of our biological heritage of knowledge.
W H AT I S A B I O C U R AT O R ?
• A highly skilled and trained keeper of our biological heritage of knowledge.
• A content specialist who understands the research and can succinctly distill biological research results into computable data
W H AT I S A B I O C U R AT O R ?
• A highly skilled and trained keeper of our biological heritage of knowledge.
• A content specialist who understands the research and can succinctly distill biological research results into computable data
• Considers the ease of finding this information, its relatedness to other information, and its research and educational usability
B6.Cg-‐Alms1foz/fox/J
increased weight, adipose tissue volume,
glucose homeostasis altered
ALSM1(NM_015120.4) [c.10775delC] + [-‐]
GENOTYPE
PHENOTYPE
obesity, diabetes mellitus, insulin resistance
increased food intake, hyperglycemia, insulin resistance
kcnj11c14/c14; insrt143/+(AB)
M O D E L S R E C A P I T U L AT E VA R I O U S P H E N O T Y P I C A S P E C T S O F D I S E A S E
B6.Cg-‐Alms1foz/fox/J
increased weight, adipose tissue volume,
glucose homeostasis altered
GENOTYPE
PHENOTYPE
obesity, diabetes mellitus, insulin resistance
increased food intake, hyperglycemia, insulin resistance
kcnj11c14/c14; insrt143/+(AB)
M O D E L S R E C A P I T U L AT E VA R I O U S P H E N O T Y P I C A S P E C T S O F D I S E A S E
?
R E S E A R C H R E S O U R C E SDoelken S C et al. Dis. Model. Mech. 2013;6:358-372
Smedley D et al. Database. 2013; bat025 Mungall CJ et al. Genome Biol. 2010; 11(1):R2 Washington N et al. Plos Biol 2009; e1000247
C R O S S - S P E C I E S P H E N O T Y P E C O M PA R I S O N S B Y S E M A N T I C S I M I L A R I T Y
CANDIDATE GENE PRIORITIZATION
PHENOTYPIC INTERPRETATION OF VARIANTS IN EXOMES (PHIVE)
Whole exome
Remove off-target and common variants
Variant score from allele freq and pathogenicity
Phenotype score from phenotypic similarity
PhenIX/PhIVE score to give final candidates
http://monarchinitiative.org
C O N F I R M E D D I A G N O S E S
• Infantile Parkinsonism-dystonia
• Wiedemann Steiner syndrome
• de novo SYNGAP1 mutation leading autosomal dominant mental retardation
• Frank-ter Haar syndrome
• Infantile hypophosphatasia
• … (~28%)
R E L AT E D N E S S A C R O S S B I O L O G Y
R E L AT E D N E S S A C R O S S B I O L O G Y
• Bio-Curator, not bio-Archivist
• Actively trying to represent current best understanding
R E L AT E D N E S S A C R O S S B I O L O G Y
• Bio-Curator, not bio-Archivist
• Actively trying to represent current best understanding
• Support interoperability
R E L AT E D N E S S A C R O S S B I O L O G Y
• Bio-Curator, not bio-Archivist
• Actively trying to represent current best understanding
• Support interoperability
• Support research and educational usability
R E L AT E D N E S S A C R O S S B I O L O G Y
• Bio-Curator, not bio-Archivist
• Actively trying to represent current best understanding
• Support interoperability
• Support research and educational usability
• Support inference
R E L AT E D N E S S A C R O S S B I O L O G Y
• Bio-Curator, not bio-Archivist
• Actively trying to represent current best understanding
• Support interoperability
• Support research and educational usability
• Support inference
• Not just for supporting searches, not just for finding PDF/online papers!
W H AT C A N B E D O N E ?
W H AT C A N B E D O N E ?
W H AT C A N B E D O N E ?
W H AT C A N B E D O N E ?
W H AT C A N B E D O N E ?
B I O D I V E R S I T Y D ATA J O U R N A L
B I O D I V E R S I T Y D ATA J O U R N A L
B I O D I V E R S I T Y D ATA J O U R N A LF R O M W R I T I N G , S U B M I S S I O N , P E E R - R E V I E W, E D I T I N G , P U B L I C AT I O N T O D I S S E M I N AT I O N !
W H AT C A N I S B D O ?
W H AT C A N I S B D O ?
• Tangible support of standards efforts
• QfO, RII, MI, publish guidelines, validators …
W H AT C A N I S B D O ?
• Tangible support of standards efforts
• QfO, RII, MI, publish guidelines, validators …
• Create a curation mindset across the entire life cycle
• Support embedded/repurposed software, education, actively engage with text-miners, provide on-line support …
W H AT C A N I S B D O ?
• Tangible support of standards efforts
• QfO, RII, MI, publish guidelines, validators …
• Create a curation mindset across the entire life cycle
• Support embedded/repurposed software, education, actively engage with text-miners, provide on-line support …
• Prove the necessity for curation
• Publish studies, greater emphasis on review and quality (assessment)
W H AT C A N I S B D O ?
• Tangible support of standards efforts
• QfO, RII, MI, publish guidelines, validators …
• Create a curation mindset across the entire life cycle
• Support embedded/repurposed software, education, actively engage with text-miners, provide on-line support …
• Prove the necessity for curation
• Publish studies, greater emphasis on review and quality (assessment)
• Work with traditional publishers
• FORCE11, structured submissions
W H AT C A N Y O U D O ?
• Consider
• The ease of finding information
• Its relatedness to other information
• Its research and educational usability
R E S E A R C H ? ?
Y O U , T H E B I O C U R AT O R
I S B
A C K N O W L E D G E M E N T S A N D T H A N K SY O U A R E N O T A L O N E