Date post: | 07-May-2015 |
Category: |
Technology |
Upload: | maryann-martone |
View: | 581 times |
Download: | 1 times |
Maryann E. Martone, Ph. D. University of California, San Diego
Neuroscience is unlikely to be served by a few large databases like the genomics and proteomics community
Whole brain data (20 um
microscopic MRI)
Mosiac LM images (1 GB+)
ConvenNonal LM images
Individual cell morphologies
EM volumes & reconstrucNons
Solved molecular structures
No single technology serves these all equally well. Mul6ple data types; mul6ple scales; mul6ple
databases
hPp://neuinfo.org
• NIF’s mission is to maximize the awareness of, access to and uNlity of research resources produced worldwide to enable bePer science and promote efficient use – NIF unites neuroscience informaNon without respect to domain,
funding agency, insNtute or community
– NIF is like a “Pub Med” for all biomedical resources and a “Pub Med Central” for databases
– Makes them searchable from a single interface – PracNcal and cost-‐effecNve; tries to be sensible – Learned a lot about current data prac6ces
The Neuroscience InformaNon Framework is an iniNaNve of the NIH Blueprint consorNum of insNtutes hPp://neuinfo.org
h=p://neuinfo.org June10, 2013 dkCOIN InvesNgator's Retreat 6
• A portal for finding and using neuroscience resources
A consistent framework for describing resources
Provides simultaneous search of mulNple types of informaNon, organized by category
Supported by an expansive ontology for neuroscience
UNlizes advanced technologies to search the “hidden web”
UCSD, Yale, Cal Tech, George Mason, Washington Univ
Literature
Database FederaNon
Registry
We’d like to be able to find: • What is known****:
– What are the projecNons of hippocampus? – Is GRM1 expressed In cerebral cortex? – What genes have been found to be upregulated in
chronic drug abuse in adults – What animal models have similar phenotypes to
Parkinson’s disease? – What studies used my polyclonal anNbody against
GABA in humans?
• What is not known: – ConnecNons among data – Gaps in knowledge
A framework makes it easier to address these quesNons
With the thousands of databases and other informaNon sources available, simple descripNve metadata will not suffice
• NIF curators • NominaNon by the community • Semi-‐automated text mining pipelines
NIF Registry Requires no special skills Site map available for local hosNng
• NIF Data FederaNon • DISCO interop • Requires some programming skill • Open Source Brain < 2 hr
Two Nered system: low barrier to entry
Current Planned
DISCO Dashboard Func6ons • Ingest Script Manager • Public Script Repository • Data & Event Tracker • Versioning System • Curator Tool • Data Transformer Manager
June10, 2013 dkCOIN InvesNgator's Retreat 11 Luis Marenco, Rixin Wang, Perrry Miller, Gordon Shepherd Yale University
NIF was designed to be populated rapidly with progressive refinement
Databases come in many shapes and sizes
• Primary data: – Data available for reanalysis, e.g.,
microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)
• Secondary data – Data features extracted through
data processing and someNmes normalizaNon, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connecNvity statements (BAMS)
• TerNary data – Claims and asserNons about the
meaning of data • E.g., gene upregulaNon/
downregulaNon, brain acNvaNon as a funcNon of task
• Registries: – Metadata – Pointers to data sets or
materials stored elsewhere • Data aggregators
– Aggregate data of the same type from mulNple sources, e.g., Cell Image Library ,SUMSdb, Brede
• Single source – Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of informaNon arNfacts using a mulNtude of technologies
Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion: Synonyms
and related concepts Boolean queries
Data sources categorized by “data type” and level of nervous
system
Common views across mulNple
sources
Tutorials for using full resource when geong there from
NIF
Link back to record in
original source
Connects to
Synapsed with
Synapsed by
Input region
innervates
Axon innervates Projects to Cellular contact
Subcellular contact
Source site
Target site
Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
• You (and the machine) have to be able to find it – Accessible through the web – Structured or semi-‐structured – AnnotaNons
• You (and the machine) have to be able to use it – Data type specified and in an acNonable form
• You (and the machine) have to know what the data mean
• SemanNcs • Context: Experimental metadata • Provenance: where did they come from
Knowledge in space and spaNal relaNonships (the “where”)
Knowledge in words, terminologies and logical relaNonships (the “what”)
Purkinje Cell
Axon Terminal
Axon DendriNc Tree
DendriNc Spine
Dendrite
Cell body
Cerebellar cortex
There is liPle obvious connecNon between data sets taken at different scales using different microscopies without an explicit representaNon of the biological objects that the data represent
• NIF covers mulNple structural scales and domains of relevance to neuroscience • Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
NIFSTD
Organism
NS FuncNon Molecule InvesNgaNon Subcellular structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
DysfuncNon Quality Anatomical Structure
Brain
Cerebellum
Purkinje Cell Layer
Purkinje cell
neuron
has a
has a
has a
is a
• Ontology: an explicit, formal representaNon of concepts relaNonships among them within a parNcular domain that expresses human knowledge in a machine readable form
• Branch of philosophy: a theory of what is
• e.g., Gene ontologies
• Express neuroscience concepts in a way that is machine readable – Synonyms, lexical variants – DefiniNons
• Provide means of disambiguaNon of strings – Nucleus part of cell; nucleus part of brain; nucleus part of atom
• Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases GABA as a neurotransmiPer
• ProperNes – Support reasoning
• Provide universals for navigaNng across different data sources – SemanNc “index” – Link data through relaNonships not just one-‐to-‐one mappings
• Provide the basis for concept-‐based queries to probe and mine data • Establish a semanNc framework for landscape analysis
MathemaNcs, Computer code or Esperanto
birnlex_1732 Brodmann.1
Explicit mapping of database content helps disambiguate non-‐unique and custom terminology
June10, 2013 24
Aligns sources to the NIF semanNc framework
• Search Google: GABAergic neuron
• Search NIF: GABAergic neuron
– NIF automaNcally searches for types of GABAergic neurons
Types of GABAergic neurons
Search by meaning not by string
Equivalence classes; restricNons
Arbitrary but defensible
• Neurons classified by • Circuit role: principal neuron vs interneuron • Molecular consNtuent: Parvalbumin-‐neurons, calbindin-‐neurons • Brain region: Cerebellar neuron • Morphology: Spiny neuron
• Molecule Roles: Drug of abuse, anterograde tracer, retrograde tracer • Brain parts: Circumventricular organ • Organisms: Non-‐human primate, non-‐human vertebrate • QualiNes: Expression level • Techniques: Neuroimaging
What genes are upregulated by drugs of abuse in the adult mouse? (show me the data!)
Morphine Increased expression
Adult Mouse
• NIF ConnecNvity: 7 databases containing connecNvity primary data or claims from literature on connecNvity between brain regions
• Brain Architecture Management System (rodent) • Temporal lobe.com (rodent) • Connectome Wiki (human) • Brain Maps (various) • CoCoMac (primate cortex) • UCLA MulNmodal database (Human fMRI) • Avian Brain ConnecNvity Database (Bird)
• Total: 1800 unique brain terms (excluding Avian)
• Number of exact terms used in > 1 database: 42 • Number of synonym matches: 99 • Number of 1st order partonomy matches: 385
hPp://neurolex.org
• SemanNc MediWiki
• Provide a simple interface for defining the concepts required
• Light weight semanNcs
• Good teaching tool for learning about semanNc integraNon and the benefits of a consistent semanNc framework
• Community based: • Anyone can contribute their terms, concepts, things
• Anyone can edit • Anyone can link
• Accessible: searched by Google • Growing into a significant knowledge base for neuroscience
• InternaNonal NeuroinformaNcs CoordinaNng Facility
Demo D03
Larson et al, FronNers in NeuroinformaNcs, in press
• Neurolex provides an on-‐line computable index for expressing models in semanNc terms, and linking to other knowledge and data
• Implemented forms for certain types of enNNes
• Neuroscience knowledge in the web
Pages are linked through properNes; Knowledge-‐base built through cross-‐modular relaNons and links to data; red links
• > 1000 Dicom Terms – Karl Helmer – Data Sharing Task Force
• Tasks and CogniNve Concepts from CogniNve Atlas – Russ Poldrack
• >280 Neurons – Gordon Shepherd and 30 world
wide experts • ~500 fly neurons from Fly
Anatomy Ontology – David Osumi-‐Sutherland
• >1200 Brain parcellaNons
`20,000 concepts: Spreadsheet downloads, through NIF Web Services, SPARQL endpoint
200,000 edits 150 contributors
Because they are staNc URL’s, Wikis are searchable by Google
Neurolex: > 1 million triples �
Dr. Yi Zeng: Chinese neural knowledge base NIF Cell Graph
1. Look brain region up in NeuroLex 2. Look up cells contained in the brain
region 3. Find those cells that are known to project
out of that brain region 4. Look up the neurotransmiPers for those
cells 5. Determine whether those
neurotransmiPers are known to be excitatory or inhibitory
6. Report the projecNon as excitatory or inhibitory, and report the enNre chain of logic with links back to the wiki pages where they were made
7. Make sure user can get back to each statement in the logic chain to edit it if they think it is wrong
Stephen Larson CHEBI:18243
Are projecNons from the VTA excitatory or inhibitory?
• INCF Project – Neuron Registry – > 30 experts worldwide
– Fill out neuron pages in Neurolex Wiki
– Led by Dr. Gordon Shepherd
Soma locaNon
Dendrite locaNon
Axon locaNon
0
50
100
150
200
250
300
Number Total redlinks easy fixes
hard fixes
Soma locaNon
Dendrite locaNon
Axon locaNon
Social networks and community sites let us learn things from the collecNve behavior of contributors
37
neurolex.org: Semantic Wiki
• INCF Community encyclopedia • Define all vocabulary, terms, protocols, brain structures, diseases, etc
• Living review articles • Links to data, models and literature • Semantic organization, search, analysis and integration • Searchable via the web
• Global directory of all shared vocabularies, CDEs, etc
Slide courtesy of Sean Hill: InternaNonal NeuroinformaNcs CoordinaNng Facility
MarNn Telefont, HBP: Lab Space connecNng to Knowledge Space
• NIF can be used to survey the data landscape
• Analysis of NIF shows mulNple databases with similar scope and content
• Many contain parNally overlapping data
• Data “flows” from one resource to the next – Data is reinterpreted, reanalyzed or
added to
• Is duplicaNon good or bad? NIF is trying to make it easier to work with diverse data
NIF is in a unique posiNon to answer quesNons about the neuroscience landscape: Kepler Workflow engine + NIF semanNcs
Where are the data?
Striatum Hypothalamus Olfactory bulb
Cerebral cortex
Brain
Brain region
Data source
∞
What is easily machine processable and accessible
What is potenNally knowable
What is known: Literature, images, human
knowledge
Unstructured; Natural language processing, enNty recogniNon, image processing and
analysis; paywalls communicaNon
Abstracts vs full text vs tables etc
Closed world vs open world
We know a lot about some things and less about others; some of NIF’s sources are comprehensive; others are highly biased
But...NIF has > 2M anNbodies, 338,000 model organisms, and 3 million microarray records
Neocortex
Olfactory bulb
Neostriatum
Cochlear nucleus
All neurons with cell bodies in the same brain region are grouped together
ProperNes in Neurolex
Exposing knowledge gaps and biases
Where are the data?
Striatum Hypothalamus Olfactory bulb
Cerebral cortex
Brain
Brain region
Data source Funding
• Gemma: Gene ID + Gene Symbol • DRG: Gene name + Probe ID
• Gemma presented results relaNve to baseline chronic morphine; DRG with respect to saline, so direcNon of change is opposite in the 2 databases
• Analysis: • 1370 statements from Gemma regarding gene expression as a funcNon of chronic morphine • 617 were consistent with DRG; over half of the claims of the paper were not confirmed in this analysis • Results for 1 gene were opposite in DRG and Gemma • 45 did not have enough informaNon provided in the paper to make a judgment
RelaNvely simple standards would make life easier
NIF favors a hybrid, Nered, federated system
• Domain knowledge – Ontologies
• Claims, models and observaNons – Virtuoso RDF triples – Model repositories
• Data – Data federaNon – SpaNal data – Workflows
• NarraNve – Full text access
Neuron Brain part Disease Organism Gene
Caudate projects to Snpc Grm1 is upregulated in
chronic cocaine Betz cells
degenerate in ALS
NIF provides the tentacles that connect the pieces: a new type of enNty for 21st century science
Technique People
Scholar
Library
Scholar
Publisher
FORCE11.org: Future of research communicaNons and e-‐scholarship
Scholar
Consumer
Libraries
Data Repositories
Code Repositories Community databases/pla}orms
OA
Curators
Social Networks
Social Networks Social
Networks
Peer Reviewers
NarraNve
Workflows
Data
Models
MulNmedia
NanopublicaNons
Code
• Of the ~ 4000 columns that NIF queries, ~1300 map to one of our core categories: – Organism
– Anatomical structure
– Cell – Molecule
– FuncNon – DysfuncNon – Technique
• 30-‐50% of NIF’s queries autocomplete
• When NIF combines mulNple sources, a set of common fields emerges – >Basic informaNon models/semanNc models exist for certain types of enNNes
SemanNc frameworks create spaces in which to compare the current state of data and knowledge
• Several powerful trends should change the way we think about our data: One Many – Many data
• GeneraNon of data is geong easier shared data • Data space is geong richer: more –omes everyday • But...compared to the biological space, sNll sparse
– Many resources: everyone wants to be “the” one but e pluribus unum – Many eyes
• Wisdom of crowds • More than one way to interpret data
– Many algorithms • Not a single way to analyze data
– Many analyNcs • “Signatures” in data may not be directly related to the quesNon for which they were acquired but tell us something really interesNng
New works need to be created with an eye towards the web and interoperability
Jeff Grethe, UCSD, Co InvesNgator, Interim PI
Amarnath Gupta, UCSD, Co InvesNgator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer (reNred)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11
Data Space
Laboratory Space
Knowledge Space
BAMS
Lexicon
Encyclopedia
47/50 major preclinical published cancer studies could not be replicated
• “The scienNfic community assumes that the claims in a preclinical study can be taken at face value-‐that although there might be some errors in detail, the main message of the paper can be relied on and the data will, for the most part, stand the test of Nme. Unfortunately, this is not always the case.”
• Geong data out sooner in a form where they can be exposed to many eyes and many analyses may allow us to expose errors and develop bePer metrics to evaluate the validity of data
Begley and Ellis, 29 MARCH 2012 | VOL 483 | NATURE | 531
• Every resource is resource limited: few have enough Nme, money, staff or experNse required to do everything they would like – If the market can support 11 MRI databases, fine
– Some consolidaNon, coordinaNon is usually warranted
• Big, broad and messy beats small, narrow and neat – Without trying to integrate a lot of data, we will not know what needs to be done
– Progressive refinement; addiNon of complexity through layers
• Be flexible and opportunisNc – A single opNmal technology/container for all types of scienNfic data and informaNon
does not exist; technology is changing
• Think globally; act locally: – No source, not even NIF, is THE source; we are all a source – Think about interoperaNon from the incepNon
Regional part of nervous system ParcellaNon
scheme parcel
ParcellaNon scheme parcel
Single species or strain
ParcellaNon scheme
Precise definiNon
Technique
INCF Task Force: Alan Rutenberg, Seth Ruffins
FuncNonal part of nervous system
ParNally overlaps
Taxon rank
General hierarchy
1200 parts of nervous system characterized (mostly) according to CUMBO terms
1200 “parcels” from individual atlases/papers
700 neurons 280 via Neuron Registry
Available via NIF vocabulary services (REST)
Hosted in a Virtuoso triple store via SPARQL