+ All Categories
Home > Documents > June 2006 The many uses for information content over bio-ontologies Phillip Lord School of Computing...

June 2006 The many uses for information content over bio-ontologies Phillip Lord School of Computing...

Date post: 20-Dec-2015
Category:
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
28
June 2006 The many uses for information content over bio-ontologies Phillip Lord School of Computing Science Newcastle University
Transcript

June 2006

The many uses for information content over bio-ontologies

Phillip Lord

School of Computing Science

Newcastle University

June 2006

Summary

• The purpose and status of ontologies within biology.

• The motivation for measures.

• Our original information content measures and their validation.

• More recent use cases and our initial attempts to solve these.

June 2006

The problems of biology

• Biology has few formal representations and few grand theories.

• Most of the knowledge is represented as text.

June 2006

Uniprot:- A protein database?ID PRIO_HUMAN STANDARD; PRT; 253 AA.AC P04156;DT 01-NOV-1986 (Rel. 03, Created)DT 01-NOV-1986 (Rel. 03, Last sequence update)DT 20-AUG-2001 (Rel. 40, Last annotation update)DE Major prion protein precursor (PrP) (PrP27-30) (PrP33-35C) (ASCR).GN PRNP.OS Homo sapiens (Human).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.OX NCBI_TaxID=9606;RN [1]RP SEQUENCE FROM N.A.RX MEDLINE=86300093; PubMed=3755672;RA Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H.,RA Prusiner S.B., Dearmond S.J.;RT "Molecular cloning of a human prion protein cDNA.";RL DNA 5:315-324(1986).RN [2]RP SEQUENCE OF 8-253 FROM N.A.RX MEDLINE=86261778; PubMed=3014653;RA Liao Y.-C.J., Lebo R.V., Clawson G.A., Smuckler E.A.;RT "Human prion protein cDNA: molecular cloning, chromosomal mapping,RT and biological implications.";RL Science 233:364-367(1986).RN [3]RP SEQUENCE OF 58-85 AND 111-150 (VARIANT AMYLOID GSS).RX MEDLINE=91160504; PubMed=1672107;RA Tagliavini F., Prelli F., Ghiso J., Bugiani O., Serban D.,RA Prusiner S.B., Farlow M.R., Ghetti B., Frangione B.;RT "Amyloid protein of Gerstmann-Straussler-Scheinker disease (IndianaRT kindred) is an 11 kd fragment of prion protein with an N-terminalRT glycine at codon 58.";RL EMBO J. 10:513-519(1991).RN [4]RP STRUCTURE BY NMR OF 118-221.RX MEDLINE=20359708; PubMed=10900000;RA Calzolai L., Lysek D.A., Guntert P., von Schroetter C., Riek R.,RA Zahn R., Wuethrich K.;RT "NMR structures of three single-residue variants of the human prionRT protein.";RL Proc. Natl. Acad. Sci. U.S.A. 97:8340-8345(2000).CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THECC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLEDCC "RODS".CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.CC -!- POLYMORPHISM: THE FIVE TANDEM OCTAPEPTIDE REPEATS REGION IS HIGHLYCC UNSTABLE. INSERTIONS OR DELETIONS OF OCTAPEPTIDE REPEAT UNITS ARECC ASSOCIATED TO PRION DISEASE.

FT SIGNAL 1 22FT CHAIN 23 230 MAJOR PRION PROTEIN.FT PROPEP 231 253 REMOVED IN MATURE FORM (BY SIMILARITY).FT LIPID 230 230 GPI-ANCHOR (BY SIMILARITY).FT CARBOHYD 181 181 N-LINKED (GLCNAC...) (PROBABLE).FT DISULFID 179 214 BY SIMILARITY.FT DOMAIN 51 91 5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-G-FT Q.FT REPEAT 51 59 1.FT REPEAT 60 67 2.FT REPEAT 68 75 3.FT REPEAT 76 83 4.FT REPEAT 84 91 5.FT IN PATIENTS WHO HAVE A PRP MUTATION ATFT CODON 178: PATIENTS WITH MET DEVELOP FFI,FT THOSE WITH VAL DEVELOP CJD).FT /FTId=VAR_006467.FT VARIANT 171 171 N -> S (IN SCHIZOAFFECTIVE DISORDER).FT /FTId=VAR_006468.FT VARIANT 178 178 D -> N (IN FFI AND CJD).FT /FTId=VAR_006469.FT VARIANT 180 180 V -> I (IN CJD).FT /FTId=VAR_006470.FT VARIANT 183 183 T -> A (IN FAMILIAL SPONGIFORMFT ENCEPHALOPATHY).FT /FTId=VAR_006471.FT VARIANT 187 187 H -> R (IN GSS).FT /FTId=VAR_008746.FT VARIANT 188 188 T -> K (IN EOAD; DEMENTIA ASSOCIATED TOFT PRION DISEASES).FT /FTId=VAR_008748.FT VARIANT 188 188 T -> R.FT /FTId=VAR_008747.FT VARIANT 196 196 E -> K (IN CJD).FT /FTId=VAR_008749.FT /FTId=VAR_006472.SQ SEQUENCE 253 AA; 27661 MW; 43DB596BAAA66484 CRC64;MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA VVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV ILLISFLIFL IVG//

CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE CC BRAIN OF HUMANS AND ANIMALS INFECTEDCC WITH NEURODEGENERATIVE DISEASES KNOWN ASCC TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION CC DISEASES,LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), CC GERSTMANN-STRAUSSLER SYNDROME (GSS), FATAL CC FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; CC SCRAPIE IN SHEEP AND GOAT; BOVINE SPONGIFORM CC ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE CC MINK ENCEPHALOPATHY (TME); CHRONIC WASTINGCC DISEASE (CWD) OF MULE DEER AND ELK; FELINE CC SPONGIFORM ENCEPHALOPATHY (FSE) IN CATS AND CC EXOTIC UNGULATE ENCEPHALOPATHY (EUE) IN CC NYALA AND GREATER KUDU. THE PRION DISEASES CC ILLUSTRATE THREE MANIFESTATIONS OF CNS CC DEGENERATION: (1) INFECTIOUS (2)CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS.CC TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TOCC OCCUR AFTER CONSUMPTION OF PRION-INFECTEDCC FOODSTUFFS.DR EMBL; M13667; AAA19664.1; -.DR EMBL; M13899; AAA60182.1; -.DR EMBL; D00015; BAA00011.1; -.DR PIR; A05017; A05017.DR PIR; A24173; A24173.DR PIR; S14078; S14078.DR PDB; 1E1G; 20-JUL-00.DR PDB; 1E1J; 20-JUL-00.DR PDB; 1E1P; 20-JUL-00.DR PDB; 1E1S; 21-JUL-00.DR PDB; 1E1U; 20-JUL-00.DR PDB; 1E1W; 20-JUL-00. DR MIM; 176640; -.DR MIM; 123400; -.DR MIM; 137440; -.DR MIM; 245300; -.DR MIM; 600072; -.DR MIM; 604920; -.DR InterPro; IPR000817; Prion.DR Pfam; PF00377; prion; 1.DR PRINTS; PR00341; PRION.DR SMART; SM00157; PRP; 1.DR PROSITE; PS00291; PRION_1; 1.DR PROSITE; PS00706; PRION_2; 1.KW Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal;KW 3D-structure; Polymorphism; Disease mutation.

June 2006

The purpose of GO

“The original intent of the group was to construct a set of vocabularies comprising terms that we could share with a common understanding of the meaning, and that could support cross-database queries”

June 2006

The Gene Ontology

• One solution to this is the Gene Ontology (GO). • Uses informal representation (DAG)• Has only two label types (is-a and part-of)

June 2006

Searching GO

• It is possible to search for GO terms or all of it’s children.

• But what about sibling terms?

• Is it possible to define entities as being related?

• How semantically similar are two GO terms?

June 2006

Edge Distance Measures

• The closer two terms are, the more similar.

• “photoreceptor” and “transmembrane receptor” share a common parent.

• But also “signal tranducer” (parent of other two) and “chaperone”.

• Surely the first two are closer?

June 2006

Edge Counting with Weights

• Weight the edges to scale the distance calculation.

• “high-affinity tryptophan transporter” is 14 terms deep.

• “anticoagulent” is only 3.

• Hand annotating GO would be significant task

• Even if we knew how to do it.

June 2006

How is GO used.

• Can we use the information in the corpus?

• Can we define similarity extensionally rather than intentionally?

June 2006

Alpha Mating Factor

Rosetta Inpharmatics: Pubs: Signaling and Circuitry of Multiple MAPK Pathways...

Zymo Research's new products are for E. coli transformation, bubble-free gel casting,

ALPHA-MATING FACTOR H-TRP-HIS-TRP-LEU-GLN-LEU-LYS-PRO-GLY-GLN-PRO-MET-TYR-OH. Yeast P values The alpha project @ tMSI: Mating response

Substance used by yeast to indicate that they are in an appropriate state for mating

June 2006

Sex PheromonePrimal Instinct Pheromones - Pheromone The secret formula to get girls!

PHEROMONE POWER human sex pheromones PHEROMONE POWER The most

powerfull love potion! Human Pheromone the proven ingredient PHEROMONE ATTRACTION building self confidence PHEROMONE ATTRACTION Primal Instinct pheromones - Incredible Learn the art of SEDUCTION. All Free Information. sex pheromone --aphrodisiac -- pheromone smell !!

June 2006

Information Content and GO• We define p(c) as the number of times each

term, or any of it’s children occur, divided by the number of times any term occurs.

June 2006

Probabilities to Similarities

• We define probability of the minimum subsumer, p(ms) as:

)}({min)2,1()2,1(

cpccccScms

p

• where S(c1,c2) is the set of all shared parents of c1 and c2.

June 2006

…to similarities

sim(c1,c2) = - ln pms(c1,c2)

• After Resnik (1995)

June 2006

Validation

• This is fine, but does it work?

• Does it return sensible results?

• Does it returns results that a biologist thinks is sensible?

• If two genes are similar, then they should have similar annotation.

June 2006

Validation

• For each protein, perform sequence similarity search, take top 50 hits.

• Compare these hits.

• Plot sequence vs semantic similarity.

June 2006

June 2006

June 2006

Hunting for Annotation Errors

A spermine synthase is annotated as a “spermidine synthase”, (GO:0004766).

The mis-annotation reported here stem from the dataset incorporated from manual GO annotation by Proteome Inc., and extracted via LocusLink (E. Camon. pers.comm.). For all of those protein pairs which identified a misannotation,

June 2006

Summary

• The Gene Ontology annotation provides a large corpus

• We can use information content based measures to define a similarity measure

• The measure conforms to biological intuition.

• We have used this measure to isolate anntoation errors.

June 2006

Secretory Proteins in Bacillus

• Bacillus is a widespread, variable bacterial family.

• Most famous are subtilis (for being dull) and anthracis (for being deadly).

• Secretory proteins are often important in pathogenesis.

June 2006

Secretome Workflow

• Shaded boxes indicate the set of secreted proteins. The number in brackets represents the total number of proteins to be classified at each level, across the 12 Bacillus species.

June 2006

Embarrassment of Riches

• We now have too many proteins to handle.

• What is the average function? How are the functions distributed amoung known functions?

June 2006

Summarising GO

• GO Slims:- these are defined subsets of GO. Different subsets for different species.

• Ranks:- GO is a graph, so ranks aren’t really sensible; it’s not clear that they mean anything anyway.

June 2006

Information Content Based Measures

• Choose a threshold, p(t)• Define a subset of GO where

p(c) < p(t) andforall (parents of c)

p(parents of c) > p(t)• All terms with p(c) < p(t) should have at least one parent

in this subset. • All terms with p(c) > p(t) are probably dull anyway• Should be (very) approx 1/p(t) terms in subset.

June 2006

Terms

0

200

400

600

800

1000

1200

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.01

0.00

1

0.00

01

0.00

001

Threshold

Nu

mb

er o

f T

erm

s

• Many more terms at each threshold than expected

• Most of the variation change between 0.1 and 0.001

• Number of leaf terms with p(c) > p(t) gives

June 2006

Acknowledgments

Similarity

• Robert Stevens, Andy Brass, Carole Goble, University of Manchester

Bacillus

• Tracy Craddock, Anil Wipat, Colin Harwood, Newcastle University


Recommended