Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European...

Standards and Ontologies for Data Annotation

Helen ParkinsonMicroarray Informatics Team

European Bioinformatics Institute

NBN-EBI Course, October 2002

Annotation, problems and solutions What is an ontology? Examples and uses of existing ontologies ArrayExpress – a database for microarray

gene expression data Use of ontologies to annotate microarray

data in ArrayExpress

Talk structure

Informatics resources for biologists

Over 500 databanks and analysis tools that work over various resources

Repositories of knowledge and data at various levels, primary and secondary databases,and interfaces eg EMBL, Swissprot, Ensembl

Knowledge often held as free text; limited use made of controlled vocabularies

Enormous amount of semantic heterogeneity and poor query facilities

Search for “Ssp1” gene in DDBJ/EMBL/Genbank

1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76

2: AL441624 S.pombe hromosome I cosmid c110 3: AL159180 S.pombe chromosome I P1 p14E8 4: AL049609 S.pombe chromosome III cosmid c297 5: AL136235 S.pombe chromosome I cosmid c664 6: D45882 Yeast ssp1 gene for protein kinase,

complete cds 7: X59987 S.pombe SSP1 gene for mitochondrial

Hsp70 protein (Ssp1)

Gene synonyms

Problem, a name can identify different genes even in a well annotated organism like S.pombe

Ssp1=SPAC664.11 SPAC110.04c SPCC297.03

Annotation problems Free text entries in databases cause

problems, human not machine readable and humans are error prone

Example - many genes and proteins can have the same name even in well annotated organisms

Many important projects have no coordination of standards , for e.g. gene naming, describing developmental stages

Whose responsibility is this? – community?

Possible solutions

Using ontologies, like the gene ontology but covering many more areas of biology than gene products

What is an ontology and how can they be used?

Thinking about how you describe the experiment as you start it

What is an ontology? Captures knowledge for both humans and computer

applications Has a set of vocabulary definitions that capture a

community’s knowledge of a domain `An ontology may take a variety of forms, but necessarily

it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.‘

It is more than a controlled vocabulary, it has structure (but a cv is a good place to start)

What does an ontology do?

Captures knowledge Creates a shared understanding –

between humans and for computers Makes knowledge machine

processable Makes meaning explicit – by definition

and context

Range of ontologies

Catalog/ID

GeneralLogical

constraints

Terms/glossary

Thesauri“narrower

term”relation Formal

is-aFrames

(properties)

Informalis-a

Formalinstance

Value Restrs.

Disjointness, Inverse, part-of…

Gene Ontology

Mouse Anatomy

EcoCyc

TAMBISMGED

Slide from Robert Stevens, University of Manchester

Three types of ontologies Domain-oriented, which are either domain

specific (e.g. E. coli) or domain general (e.g. gene function)

Task-oriented, which are either task specific (e.g. annotation analysis) or task general (e.g. problem solving);

Generic, which capture common high level concepts, such as Physical, Abstract and Substance.

How can ontologies be used?

Community reference -- neutral authoring. Either defining database schema or defining a

common vocabulary for database annotation (avoiding free text).

Providing common access to information. Ontology-based search by forming queries over databases.

Understanding database annotation and technical literature.

Guiding and interpreting analyses and hypothesis generation

Components of an ontology

Class, container for information, has a definition and a relationship to other classes (is-a, part-of, kind-of)

Instances, terms that are contained within a class

Example of a class, subclass relationship

Class def African elephantsub-class of elephant

slot constraint comes fromslot has filler Africa

Just formalised way to say that African elephants are a type of elephant that come from Africa, but this is machine readable

Examples of usable external ontologies and cv’s

NCBI taxonomy database Jackson Lab mouse strains and genes Edinburgh mouse atlas anatomy Chemical and compound Ontologies, e.g.

CAS Species specific, fly, A.thaliana, GO GOBO ontologies various pathology ontologies

ICD10

International statistical classification of diseases and related health problems….or what people die of

Useful information, should be included in databases, eg microarray, health related, DeCode,

International, defines disease etc universally

But..too much definition can be problematic….

Too much definition can be bad

ICD-9 (E826) 8 READ-2 (T30..) 81 READ-3 87 ICD-10 (V10-19) 587 V31.22 Occupant of three-wheeled motor vehicle injured in

collision with pedal cycle, person on outside of vehicle, nontraffic accident, while working for income

W65.40 Drowning and submersion while in bath-tub, street and highway, while engaged in sports activity

X35.44 Victim of volcanic eruption, street and highway, while resting, sleeping, eating or engaging in other vital activities

And coverage may not be universal

ICD10 includes accidents in space But not accidents involving collisions

between cars and moose Most Scandinavians are more likely to

be injured colliding with a moose than in orbit

Summary

Ontologies can define terms and structure knowledge for both humans and machines, many do this successfully

If over engineered they are no longer human readable, and it is too hard to use them to annotate data

Introducing ArrayExpress - a database which needs

an ontology

ArrayExpress Public database for gene expression data Aims to store well annotated (MIAME compliant)

and well structured data MIAME

Recorded info should be sufficient to interpret and replicate the experiment

Information should be structured so that querying and automated data analysis and mining are feasible*

*Brazma et al,.Nature Genetics, 2001

Infrastructure at the EBI

ArrayExpress(Oracle)

Other publicMicroarrayDatabases

(GEO, CIBEX)

www

EBI

ExpressionProfiler

ExternalBioinformatic

databases

Data analysis

www

Queries

www

MIAMExpress(MySQL)

MAGE-ML

Submissions

MA

GE

-ML

Array Manufacturers

LIMS

Microarray

software

Data AnalysissoftwareM

AG

E-M

L E

xpo

rt

Local MIAMExpressInstallations

MAGE-ML files

Submissions

MAGE-ML pipelines

ArrayExpress Conceptual Model

PublicationExternal links

Hybridisation ArraySampleSource

(e.g., Taxonomy)

Experiment

Normalisation

Gene(e.g., EMBL)

Data

Public Data Access

Data export as tab delimited file Export to Expression profiler As MAGE-ML, from query interface Arrays exportable as tab delimited file

Getting data in

From local LIMS system From other microarray database, eg

BASE, Rosetta Resolver, SMD Via MIAMExpress, point and click tool

from EBI

MIAMExpress submission and annotation tool

Based on MIAME concepts and questionnaire

Perl-CGI, MySQL database Experiment, Array, Protocol

submissions Generic annotation tool, all expt types Exports MAGE-ML

Array Definition Format

Tab delimited file format describing array

Defines relationships between features and sequences

Provides sequence annotation, database references

Exportable from db too

ArrayExpress curation effort

User support and help documentation Curation at source (not destination) Support on ontologies and CV’s Minimize free text, removal of synonyms MIAME encouragement Help on MAGE-ML Goal: to provide high-quality, well-

annotated data to allow automated data analysis

Why do we need a ontology for the database?

To help users annotate their data usefully easily

To perform structured queries To accurately compare data To avoid problems with free text

searching To avoid excessive curation workload

in future

Sample annotation Gene expression data only have meaning in

the context of detailed sample descriptions If the data is going to be interpreted by

independent parties, sample information has to be searchable and recorded in the database

Controlled vocabularies and ontologies are needed for unambiguous sample description, e.g cell type, compound, species, developmental stage

None of this is trivial

MGED Biomaterial (sample) Ontology

Under construction – by MGED OWG– Using OILed

Motivated by MIAME and coordinated with the ArrayExpress database model

We are defining classes, providing constraints, and adding terms

Now being extended to describe experiments and arrays

MGED BioMaterial Internal Terms

Internal and External Terms combined

Examples of external ontologies and cv’s

NCBI taxonomy database Jackson Lab mouse strains and genes Edinburgh mouse atlas anatomy HUGO nomenclature for Human genes Chemical and compound Ontologies, e.g. CAS TAIR Flybase anatomy GO (www.geneontology.org) GOBO ontologies

http://www.geneontology.org/

Example Annotation

Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology classes and corresponding external references:

“Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”

©-BioMaterialDescription

©-Biosource Property

©-Organism

©-Age

©-DevelopmentStage

©-Sex

©-StrainOrLine

©-BiosourceProvider

©-OrganismPart

©-BioMaterialManipulation

©-EnvironmentalHistory

©-CultureCondition

©-Temperature

©-Humidity

©-Light

©-PathogenTests

©-Water

©-Nutrients

©-Treatment

©-CompoundBasedTreatment

(Compound)

(Treatment_application)

(Measurement)

MGED BioMaterial Ontology Instances

7 weeks after birth

Female

Charles River, Japan

22 2C

55 5%

12 hours light/dark cycle

Specified pathogen free conditions

ad libitum

MF, Oriental Yeast, Tokyo, Japan

in vivo, oral gavage

100mg/kg body weight

External References

NCBI TaxonomyNCBI Taxonomy

Mouse Anatomical DictionaryMouse Anatomical Dictionary

International Committee on Standardized Genetic Nomenclature for Mice

International Committee on Standardized Genetic Nomenclature for Mice

Mouse Anatomical DictionaryMouse Anatomical Dictionary

ChemIDplusChemIDplus

Mus musculus musculus id: 39442

Stage 28

C57BL/6

Liver

Fenofibrate, CAS 49562-28-9

Forms make this annotation easier

Sanger Human and Mouse Array Annotation Pipeline

Takes sequences present on array Exonerate (alignment algorithm) against the

NCBI assembly (from Ensembl) Inherits annotation from Ensembl, gene

names, database references, GO terms provides a common annotation in tab delimited format, can be parsed to MAGE-ML, or used in ADF

Pipeline available for external users to beta test

Example BioSequence annotation341310_A FRZB; 2q32.1 ENSG00000162998 FRIZZLED-RELATED PROTEIN PRECURSOR (FRZB-1) (FREZZLED) (FRITZ). [Source:SWISSPROT;Acc:Q92765]AAC50736;AAB51298;AAC51217;transmembrane receptor (F);developmental processes (P);skeletal development (P);extracellular (C);membrane (C);U24163;U91903;U68057;NM_001463;753587_A282310_A PPFIA1; 11q13.3 ENSG00000131626 PROTEIN TYROSINE PHOSPHATASE, RECEPTOR TYPE, F POLYPEPTIDE (PTPRF), INTERACTING PROTEIN (LIPRIN), ALPHA 1. [Source:RefSeq;Acc:NM_003626]Q13136;AAC50173; U22816; NM_003626;770192_A LGALS9; 17q11.2 ENSG00000168961 GALECTIN-9 (HOM-HD-21) (ECALECTIN). [Source:SWISSPROT;Acc:O00182]Q8WYQ7;BAB83623;CAA88922;BAA22166;BAA31542;BAB83625;BAB83624;CAB93851;lectin (F);galactose binding lectin (F); AB040130;AB040129;Z49107;AB006782;AB005894;AJ288083;AJ288084;AJ288085;AJ288086;AJ288087;AJ288088;AJ288089;AJ288090;NM_002308;NM_009587;30502_C DLL1; 6q27 ENSG00000112577 DELTA-LIKE PROTEIN 1 PRECURSOR (DROSOPHILA DELTA HOMOLOG 1) (DELTA1) (H-DELTA-1). [Source:SWISSPROT;Acc:O00548]Q9UJV2;Q9NU41;AAF05834;AAB61286;AAG09716;CAB89569;calcium binding (F);structural molecule (F);Notch receptor ligand (F);histogenesis and organogenesis (P);cell differentiation (P);cell communication (P);integral membrane protein (C);membrane (C);AF196571;AF003522;AF222310;AL078605;NM_005618;stSG89231PSCD4; 22q13.1 ENSG00000100055 CYTOHESIN 4. [Source:SWISSPROT;Acc:Q9UIA0]Q9H7Q0;BAB15718;AAF15389;AAF28896;CAB63067;guanyl-nucleotide release factor (F);ARF guanyl-nucleotide exchange factor (F);AK024428;AF075458;AF125349;Z94160;NM_013385;stSG89236EP300; 22q13.2 ENSG00000100393 E1A-ASSOCIATED PROTEIN P300. [Source:SWISSPROT;Acc:Q09472]AAA18639; transcription factor (F);transcription co-activator (F);protein C-terminus binding (F);transcription regulation (P);signal transduction (P);neurogenesis (P);cell cycle (P);transcription, from Pol II promoter (P);nucleus (C);U01877; NM_001429;stSG89351PPARA; 22q13.31 ENSG00000100406 PEROXISOME PROLIFERATOR ACTIVATED RECEPTOR ALPHA (PPAR-ALPHA). [Source:SWISSPROT;Acc:Q07869]Q9BWQ7;AAA36468;CAA68898;AAB32649;CAB42862;CAB44427;AAH00052;transcription factor (F);steroid hormone receptor (F);ligand-dependent nuclear receptor (F);peroxisome receptor (F);transcription regulation (P);transcription, from Pol II promoter (P);energy pathways (P);fatty acid metabolism (P);nucleus (C);L02932;Y07619;S74349;AL049856;AL078611;BC000052;NM_005036;stSG89356 22q12.1 ENSG00000159873 DJ366L4.2 (NOVEL PROTEIN) (FRAGMENT). [Source:SPTREMBL;Acc:Q9UGY6]Q9UGY6;Q9ULT6;CAB63034;BAB33319; AL023494;AB051436;stSG89488PCQAP; 22q11.21 ENSG00000099917 POSITIVE COFACTOR 2 GLUTAMINE/Q-RICH-ASSOCIATED PROTEIN (PC2 GLUTAMINE/Q-RICH-ASSOCIATED PROTEIN) (TPA-INDUCIBLE GENE-1) (TIG-1) (CTG7A). [Source:SWISSPROT;Acc:Q96RN5]AAC12944;AAK58423;BAB85034;AAH13985;AAH07529;AAB91443;transcription regulation (P);nucleus (C);AF056191;AF328769;AK074268;BC013985;BC007529;U80745;NM_015889;stSG89493 22q12.2 EC 1.10.2.2;ENSG00000170192 UBIQUINOL-CYTOCHROME C REDUCTASE COMPLEX 7.2 KDA PROTEIN (EC 1.10.2.2) (CYTOCHROME C1, NONHEME 7 KDA PROTEIN) (COMPLEX III SUBUNIT X) (7.2 KDA CYTOCHROME C1-ASSOCIATED PROTEIN SUBUNIT) (HSPC119). [Source:SWISSPROT;Acc:Q9UDW1]Q9P012;Q9NZY4;AAF29115;BAB20672;AAF29083;AAH05402;AAF29023;oxidoreductase (F);ubiquinol-cytochrome c reductase (F);electron transport (P);complex III (ubiquinone to cytochrome c) (P);mitochondrion (C);inner membrane (C);mitochondrial electron transport chain complex (sensu Eukarya) (C);ubiquinol-cytochrome c reduAF161500;AC004882;AB028598;AF161468;BC005402;AF161536;NM_013387;stSG89214 22q13.31 ENSG00000075234 Q9NWP8;BAA91331; AK000706;NM_017931;

<BioSequence length=“500“> <SequenceDatabases_assnlist> <DatabaseEntry accession="ENSG00000162998"> <Database_assnref> <Database_ref identifier="DB:Ensembl"/> </Database_assnref> </DatabaseEntry> </SequenceDatabases_assnlist> <Type_assn> <OntologyEntry category="biosequence:type" value=“Ensembl gene"/> </Type_assn> <NameValueType category="Ensembl:Description" value=“FRIZZLED-

RELATED PROTEIN PRECURSOR (FRZB-1) (FRIZZLED) (FRITZ)"> <OntologyEntry category="GO:molecular_function" value="transcription

factor"></BioSequence>

Future Futher data acquisition for ArrayExpress Update ArrayExpress to newest MAGE-OM V2.0 MIAMExpress, domain specific, portable Further ontology development and integration

into tools, use of OWL Curation tools (other than grep, Perl scripts) Improved query interface for AE ArrayExpress update tool Data exchange between public databases

Resources Schemas for both ArrayExpress and MIAMExpress,

access to code MAGE-ML examples, Arrays, Expts, Protocols MIAME glossary, MAGE-MIAME-ontology mappings List of ontology resources from MGED pages Help in establishing pipelines MGED software MAGEstk Curation, help and advice www.mged.org www.ebi.ac.uk/arrayexpress

http://www.mged.org/

Acknowledgments Microarray Informatics Team, EBI Robert Stevens, Jeremy Rogers and

colleagues, University of Manchester, UK

Chris Stoeckert, University of Pennsylvania, USA

Quote

‘Most biologists would rather share their toothbrush than share a gene name’

Michael Ashburner

Date post:	30-Dec-2015
Category:	Documents
Upload:	brett-craig
View:	215 times
Download:	1 times

Standards and Ontologies for Data Annotation Helen Parkinson Microarray Informatics Team European...

Documents