Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | brett-craig |
View: | 215 times |
Download: | 1 times |
Standards and Ontologies for Data Annotation
Helen ParkinsonMicroarray Informatics Team
European Bioinformatics Institute
NBN-EBI Course, October 2002
Annotation, problems and solutions What is an ontology? Examples and uses of existing ontologies ArrayExpress – a database for microarray
gene expression data Use of ontologies to annotate microarray
data in ArrayExpress
Talk structure
Informatics resources for biologists
Over 500 databanks and analysis tools that work over various resources
Repositories of knowledge and data at various levels, primary and secondary databases,and interfaces eg EMBL, Swissprot, Ensembl
Knowledge often held as free text; limited use made of controlled vocabularies
Enormous amount of semantic heterogeneity and poor query facilities
Search for “Ssp1” gene in DDBJ/EMBL/Genbank
1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76
2: AL441624 S.pombe hromosome I cosmid c110 3: AL159180 S.pombe chromosome I P1 p14E8 4: AL049609 S.pombe chromosome III cosmid c297 5: AL136235 S.pombe chromosome I cosmid c664 6: D45882 Yeast ssp1 gene for protein kinase,
complete cds 7: X59987 S.pombe SSP1 gene for mitochondrial
Hsp70 protein (Ssp1)
Gene synonyms
Problem, a name can identify different genes even in a well annotated organism like S.pombe
Ssp1=SPAC664.11 SPAC110.04c SPCC297.03
Annotation problems Free text entries in databases cause
problems, human not machine readable and humans are error prone
Example - many genes and proteins can have the same name even in well annotated organisms
Many important projects have no coordination of standards , for e.g. gene naming, describing developmental stages
Whose responsibility is this? – community?
Possible solutions
Using ontologies, like the gene ontology but covering many more areas of biology than gene products
What is an ontology and how can they be used?
Thinking about how you describe the experiment as you start it
What is an ontology? Captures knowledge for both humans and computer
applications Has a set of vocabulary definitions that capture a
community’s knowledge of a domain `An ontology may take a variety of forms, but necessarily
it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.‘
It is more than a controlled vocabulary, it has structure (but a cv is a good place to start)
What does an ontology do?
Captures knowledge Creates a shared understanding –
between humans and for computers Makes knowledge machine
processable Makes meaning explicit – by definition
and context
Range of ontologies
Catalog/ID
GeneralLogical
constraints
Terms/glossary
Thesauri“narrower
term”relation Formal
is-aFrames
(properties)
Informalis-a
Formalinstance
Value Restrs.
Disjointness, Inverse, part-of…
Gene Ontology
Mouse Anatomy
EcoCyc
TAMBISMGED
Slide from Robert Stevens, University of Manchester
Three types of ontologies Domain-oriented, which are either domain
specific (e.g. E. coli) or domain general (e.g. gene function)
Task-oriented, which are either task specific (e.g. annotation analysis) or task general (e.g. problem solving);
Generic, which capture common high level concepts, such as Physical, Abstract and Substance.
How can ontologies be used?
Community reference -- neutral authoring. Either defining database schema or defining a
common vocabulary for database annotation (avoiding free text).
Providing common access to information. Ontology-based search by forming queries over databases.
Understanding database annotation and technical literature.
Guiding and interpreting analyses and hypothesis generation
Components of an ontology
Class, container for information, has a definition and a relationship to other classes (is-a, part-of, kind-of)
Instances, terms that are contained within a class
Example of a class, subclass relationship
Class def African elephantsub-class of elephant
slot constraint comes fromslot has filler Africa
Just formalised way to say that African elephants are a type of elephant that come from Africa, but this is machine readable
Examples of usable external ontologies and cv’s
NCBI taxonomy database Jackson Lab mouse strains and genes Edinburgh mouse atlas anatomy Chemical and compound Ontologies, e.g.
CAS Species specific, fly, A.thaliana, GO GOBO ontologies various pathology ontologies
ICD10
International statistical classification of diseases and related health problems….or what people die of
Useful information, should be included in databases, eg microarray, health related, DeCode,
International, defines disease etc universally
But..too much definition can be problematic….
Too much definition can be bad
ICD-9 (E826) 8 READ-2 (T30..) 81 READ-3 87 ICD-10 (V10-19) 587 V31.22 Occupant of three-wheeled motor vehicle injured in
collision with pedal cycle, person on outside of vehicle, nontraffic accident, while working for income
W65.40 Drowning and submersion while in bath-tub, street and highway, while engaged in sports activity
X35.44 Victim of volcanic eruption, street and highway, while resting, sleeping, eating or engaging in other vital activities
And coverage may not be universal
ICD10 includes accidents in space But not accidents involving collisions
between cars and moose Most Scandinavians are more likely to
be injured colliding with a moose than in orbit
Summary
Ontologies can define terms and structure knowledge for both humans and machines, many do this successfully
If over engineered they are no longer human readable, and it is too hard to use them to annotate data
Introducing ArrayExpress - a database which needs
an ontology
ArrayExpress Public database for gene expression data Aims to store well annotated (MIAME compliant)
and well structured data MIAME
Recorded info should be sufficient to interpret and replicate the experiment
Information should be structured so that querying and automated data analysis and mining are feasible*
*Brazma et al,.Nature Genetics, 2001
Infrastructure at the EBI
ArrayExpress(Oracle)
Other publicMicroarrayDatabases
(GEO, CIBEX)
www
EBI
ExpressionProfiler
ExternalBioinformatic
databases
Data analysis
www
Queries
www
MIAMExpress(MySQL)
MAGE-ML
Submissions
MA
GE
-ML
Array Manufacturers
LIMS
Microarray
software
Data AnalysissoftwareM
AG
E-M
L E
xpo
rt
Local MIAMExpressInstallations
MAGE-ML files
Submissions
MAGE-ML pipelines
ArrayExpress Conceptual Model
PublicationExternal links
Hybridisation ArraySampleSource
(e.g., Taxonomy)
Experiment
Normalisation
Gene(e.g., EMBL)
Data
Public Data Access
Data export as tab delimited file Export to Expression profiler As MAGE-ML, from query interface Arrays exportable as tab delimited file
Getting data in
From local LIMS system From other microarray database, eg
BASE, Rosetta Resolver, SMD Via MIAMExpress, point and click tool
from EBI
MIAMExpress submission and annotation tool
Based on MIAME concepts and questionnaire
Perl-CGI, MySQL database Experiment, Array, Protocol
submissions Generic annotation tool, all expt types Exports MAGE-ML
Array Definition Format
Tab delimited file format describing array
Defines relationships between features and sequences
Provides sequence annotation, database references
Exportable from db too
ArrayExpress curation effort
User support and help documentation Curation at source (not destination) Support on ontologies and CV’s Minimize free text, removal of synonyms MIAME encouragement Help on MAGE-ML Goal: to provide high-quality, well-
annotated data to allow automated data analysis
Why do we need a ontology for the database?
To help users annotate their data usefully easily
To perform structured queries To accurately compare data To avoid problems with free text
searching To avoid excessive curation workload
in future
Sample annotation Gene expression data only have meaning in
the context of detailed sample descriptions If the data is going to be interpreted by
independent parties, sample information has to be searchable and recorded in the database
Controlled vocabularies and ontologies are needed for unambiguous sample description, e.g cell type, compound, species, developmental stage
None of this is trivial
MGED Biomaterial (sample) Ontology
Under construction – by MGED OWG– Using OILed
Motivated by MIAME and coordinated with the ArrayExpress database model
We are defining classes, providing constraints, and adding terms
Now being extended to describe experiments and arrays
MGED BioMaterial Internal Terms
Internal and External Terms combined
Examples of external ontologies and cv’s
NCBI taxonomy database Jackson Lab mouse strains and genes Edinburgh mouse atlas anatomy HUGO nomenclature for Human genes Chemical and compound Ontologies, e.g. CAS TAIR Flybase anatomy GO (www.geneontology.org) GOBO ontologies
Example Annotation
Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology classes and corresponding external references:
“Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”
©-BioMaterialDescription
©-Biosource Property
©-Organism
©-Age
©-DevelopmentStage
©-Sex
©-StrainOrLine
©-BiosourceProvider
©-OrganismPart
©-BioMaterialManipulation
©-EnvironmentalHistory
©-CultureCondition
©-Temperature
©-Humidity
©-Light
©-PathogenTests
©-Water
©-Nutrients
©-Treatment
©-CompoundBasedTreatment
(Compound)
(Treatment_application)
(Measurement)
MGED BioMaterial Ontology Instances
7 weeks after birth
Female
Charles River, Japan
22 2C
55 5%
12 hours light/dark cycle
Specified pathogen free conditions
ad libitum
MF, Oriental Yeast, Tokyo, Japan
in vivo, oral gavage
100mg/kg body weight
External References
NCBI TaxonomyNCBI Taxonomy
Mouse Anatomical DictionaryMouse Anatomical Dictionary
International Committee on Standardized Genetic Nomenclature for Mice
International Committee on Standardized Genetic Nomenclature for Mice
Mouse Anatomical DictionaryMouse Anatomical Dictionary
ChemIDplusChemIDplus
Mus musculus musculus id: 39442
Stage 28
C57BL/6
Liver
Fenofibrate, CAS 49562-28-9
Forms make this annotation easier
Sanger Human and Mouse Array Annotation Pipeline
Takes sequences present on array Exonerate (alignment algorithm) against the
NCBI assembly (from Ensembl) Inherits annotation from Ensembl, gene
names, database references, GO terms provides a common annotation in tab delimited format, can be parsed to MAGE-ML, or used in ADF
Pipeline available for external users to beta test
Example BioSequence annotation341310_A FRZB; 2q32.1 ENSG00000162998 FRIZZLED-RELATED PROTEIN PRECURSOR (FRZB-1) (FREZZLED) (FRITZ). [Source:SWISSPROT;Acc:Q92765]AAC50736;AAB51298;AAC51217;transmembrane receptor (F);developmental processes (P);skeletal development (P);extracellular (C);membrane (C);U24163;U91903;U68057;NM_001463;753587_A282310_A PPFIA1; 11q13.3 ENSG00000131626 PROTEIN TYROSINE PHOSPHATASE, RECEPTOR TYPE, F POLYPEPTIDE (PTPRF), INTERACTING PROTEIN (LIPRIN), ALPHA 1. [Source:RefSeq;Acc:NM_003626]Q13136;AAC50173; U22816; NM_003626;770192_A LGALS9; 17q11.2 ENSG00000168961 GALECTIN-9 (HOM-HD-21) (ECALECTIN). [Source:SWISSPROT;Acc:O00182]Q8WYQ7;BAB83623;CAA88922;BAA22166;BAA31542;BAB83625;BAB83624;CAB93851;lectin (F);galactose binding lectin (F); AB040130;AB040129;Z49107;AB006782;AB005894;AJ288083;AJ288084;AJ288085;AJ288086;AJ288087;AJ288088;AJ288089;AJ288090;NM_002308;NM_009587;30502_C DLL1; 6q27 ENSG00000112577 DELTA-LIKE PROTEIN 1 PRECURSOR (DROSOPHILA DELTA HOMOLOG 1) (DELTA1) (H-DELTA-1). [Source:SWISSPROT;Acc:O00548]Q9UJV2;Q9NU41;AAF05834;AAB61286;AAG09716;CAB89569;calcium binding (F);structural molecule (F);Notch receptor ligand (F);histogenesis and organogenesis (P);cell differentiation (P);cell communication (P);integral membrane protein (C);membrane (C);AF196571;AF003522;AF222310;AL078605;NM_005618;stSG89231PSCD4; 22q13.1 ENSG00000100055 CYTOHESIN 4. [Source:SWISSPROT;Acc:Q9UIA0]Q9H7Q0;BAB15718;AAF15389;AAF28896;CAB63067;guanyl-nucleotide release factor (F);ARF guanyl-nucleotide exchange factor (F);AK024428;AF075458;AF125349;Z94160;NM_013385;stSG89236EP300; 22q13.2 ENSG00000100393 E1A-ASSOCIATED PROTEIN P300. [Source:SWISSPROT;Acc:Q09472]AAA18639; transcription factor (F);transcription co-activator (F);protein C-terminus binding (F);transcription regulation (P);signal transduction (P);neurogenesis (P);cell cycle (P);transcription, from Pol II promoter (P);nucleus (C);U01877; NM_001429;stSG89351PPARA; 22q13.31 ENSG00000100406 PEROXISOME PROLIFERATOR ACTIVATED RECEPTOR ALPHA (PPAR-ALPHA). [Source:SWISSPROT;Acc:Q07869]Q9BWQ7;AAA36468;CAA68898;AAB32649;CAB42862;CAB44427;AAH00052;transcription factor (F);steroid hormone receptor (F);ligand-dependent nuclear receptor (F);peroxisome receptor (F);transcription regulation (P);transcription, from Pol II promoter (P);energy pathways (P);fatty acid metabolism (P);nucleus (C);L02932;Y07619;S74349;AL049856;AL078611;BC000052;NM_005036;stSG89356 22q12.1 ENSG00000159873 DJ366L4.2 (NOVEL PROTEIN) (FRAGMENT). [Source:SPTREMBL;Acc:Q9UGY6]Q9UGY6;Q9ULT6;CAB63034;BAB33319; AL023494;AB051436;stSG89488PCQAP; 22q11.21 ENSG00000099917 POSITIVE COFACTOR 2 GLUTAMINE/Q-RICH-ASSOCIATED PROTEIN (PC2 GLUTAMINE/Q-RICH-ASSOCIATED PROTEIN) (TPA-INDUCIBLE GENE-1) (TIG-1) (CTG7A). [Source:SWISSPROT;Acc:Q96RN5]AAC12944;AAK58423;BAB85034;AAH13985;AAH07529;AAB91443;transcription regulation (P);nucleus (C);AF056191;AF328769;AK074268;BC013985;BC007529;U80745;NM_015889;stSG89493 22q12.2 EC 1.10.2.2;ENSG00000170192 UBIQUINOL-CYTOCHROME C REDUCTASE COMPLEX 7.2 KDA PROTEIN (EC 1.10.2.2) (CYTOCHROME C1, NONHEME 7 KDA PROTEIN) (COMPLEX III SUBUNIT X) (7.2 KDA CYTOCHROME C1-ASSOCIATED PROTEIN SUBUNIT) (HSPC119). [Source:SWISSPROT;Acc:Q9UDW1]Q9P012;Q9NZY4;AAF29115;BAB20672;AAF29083;AAH05402;AAF29023;oxidoreductase (F);ubiquinol-cytochrome c reductase (F);electron transport (P);complex III (ubiquinone to cytochrome c) (P);mitochondrion (C);inner membrane (C);mitochondrial electron transport chain complex (sensu Eukarya) (C);ubiquinol-cytochrome c reduAF161500;AC004882;AB028598;AF161468;BC005402;AF161536;NM_013387;stSG89214 22q13.31 ENSG00000075234 Q9NWP8;BAA91331; AK000706;NM_017931;
<BioSequence length=“500“> <SequenceDatabases_assnlist> <DatabaseEntry accession="ENSG00000162998"> <Database_assnref> <Database_ref identifier="DB:Ensembl"/> </Database_assnref> </DatabaseEntry> </SequenceDatabases_assnlist> <Type_assn> <OntologyEntry category="biosequence:type" value=“Ensembl gene"/> </Type_assn> <NameValueType category="Ensembl:Description" value=“FRIZZLED-
RELATED PROTEIN PRECURSOR (FRZB-1) (FRIZZLED) (FRITZ)"> <OntologyEntry category="GO:molecular_function" value="transcription
factor"></BioSequence>
Future Futher data acquisition for ArrayExpress Update ArrayExpress to newest MAGE-OM V2.0 MIAMExpress, domain specific, portable Further ontology development and integration
into tools, use of OWL Curation tools (other than grep, Perl scripts) Improved query interface for AE ArrayExpress update tool Data exchange between public databases
Resources Schemas for both ArrayExpress and MIAMExpress,
access to code MAGE-ML examples, Arrays, Expts, Protocols MIAME glossary, MAGE-MIAME-ontology mappings List of ontology resources from MGED pages Help in establishing pipelines MGED software MAGEstk Curation, help and advice www.mged.org www.ebi.ac.uk/arrayexpress
Acknowledgments Microarray Informatics Team, EBI Robert Stevens, Jeremy Rogers and
colleagues, University of Manchester, UK
Chris Stoeckert, University of Pennsylvania, USA
Quote
‘Most biologists would rather share their toothbrush than share a gene name’
Michael Ashburner