Post on 23-Feb-2016
description
transcript
T. Hamp & L. Richter
Protein Prediction II Exercise
T. Hamp & L. Richter
Exercise – Project LayoutGeneral remarks – recap: Report 60pts, Exam 40 pts, weekly
presentations of each group, one bad presentation allowed, groups of 3-4 students
Contact & Questions: pp2ex@rostlab.org only!
The exercise is taken from the CAFA competition
Prediction of HPO terms
HPO: Human phenotype ontology
2
T. Hamp & L. Richter
Terms – Definitions and ExplanationsAmino acids (aa): Building blocks for proteins, 20 different aa are
found in proteinsProtein sequence: String of characters representing a sequence of
amino acids (string from a 20 letter alphabet)The protein sequence defines the protein structure and the protein
function (within some limits)Proteins sequences are stored in large publicly available repositoriesOne of the most well known repositories is UniProt (
http://www.uniprot.org/) and its section Swiss-ProtBesides the sequence these databases hold additional information
about the protein, too
3
T. Hamp & L. Richter
Ontology (in information science)Ontology: An ontology represents knowledge as a set of concepts
within a domain, using a shard vocabulary to denote types, properties and interrelationships of those concepts
Human Phenotype ontology (HPO): Set of concepts describing human appearing (shape, health, a.s.f.)
HPO concepts are hierarchically ordered, i.e. there is a “is-a” relation ship.
they are arranged in a tree-like fashion
4
T. Hamp & L. Richter
Our competitionProteins are annotated (described) with experimentally determined
information
As time goes by: Proteins are associated with information about experimentally confirmed effects on the human phenotype
The associated term are taken form the Human Phenotype ontologyExperimental determination is slow and expensive
=> we try to predict associated HPO terms for the yet un-annotated
5
T. Hamp & L. Richter
More formal stepsFind a function that assigns a set of HPO terms T to a sequence s so
that the number of false assignment is minimal and the number of true assignments is maximal
Remember: The true evaluation is done after submission when so far not annotated sequences get experimentally determined annotations
6
T. Hamp & L. Richter
TasksDownload files from www.rostlab.org/~richter/pp2_files.tgz
Get familiar with the provided files
Especially the column names (look for at Uniprot and HPO)Read:
http://biofunctionprediction.org/sites/default/files/IntroductionCAFA_pedja.pdf
7