Bioinformatics approaches for…

transcript

Bioinformatics approaches for…Bioinformatics approaches for…

Teresa K AttwoodFaculty of Life Sciences & School of Computer Science

University of Manchester, Oxford RoadManchester M13 9PT, UK

http://www.bioinf.man.ac.uk/dbbrowser/

…….analysing GPCRs…..analysing GPCRs….

……..whichwhich craft is best? craft is best?

OverviewOverview• What are GPCRs?

– why they’re interesting & important– why bioinformatics approaches are important

• In silico function prediction – a reality check

• Family-based methods for characterising GPCRs• Understanding the tools

– problems with pair-wise & family-based approaches– estimating (biological) significance

• Seeking deeper functional insights• Conclusions

GDPGTP

GTPGTP

What are GPCRs?What are GPCRs?G protein-coupled receptorsG protein-coupled receptors

• A functionally diverse family of cell-surface 7TM proteins • Functional diversity achieved via

– interaction with a variety of ligands – stimulation of various intracellular pathways via coupling to

different G proteins

Why are GPCRs interesting?Why are GPCRs interesting?Attwood, TK & Flower, DR (2002) Trawling the genome for G protein-coupled receptors: the importance

of integrating bioinformatic approaches. In Drug Design – Cutting Edge Approaches, pp.60-71.

• They are ubiquitous – >800 GPCR genes in the human genome, from 3 major

superfamilies • rhodopsin-, secretin- & metabotropic glutamate receptor-like

• Share almost no sequence similarity– but are united by common 7TM architecture

• Constitute a complex multi-gene family– populated by >50 families & >350 subtypes

Isn’t just stamp collecting!Isn’t just stamp collecting!Attwood, TK & Flower, DR (2002) Trawling the genome for G protein-coupled receptors: the importance

of integrating bioinformatic approaches. In Drug Design – Cutting Edge Approaches, pp.60-71.

• GPCRs are of profound biomedical importance– targets for >50% of prescription drugs– yield sales >$16 billion/annum

• they’re big business!

• Given their importance, we need to – characterise the ones we know about– identify new ones

• & discover what they do!– e.g., as potential new drug targets

Why studying GPCRs is difficultWhy studying GPCRs is difficult• Only 2 crystal structures available

– bovine rhodopsin (2000) & human 2-adrenergic receptor (2007)

• Many GPCRs haven’t been characterised experimentally– remain 'orphans’, with unknown ligand specificity

• With >800 human GPCRs, this isn’t much to go on!

Why use bioinformatics approaches?Why use bioinformatics approaches?• Computational approaches are important

– can be used to help identify, characterise & model novel receptors • usually by similarity & extrapolation of known characteristics

• Bioinformatics thus offers complementary tools for elucidating the structures & functions of receptors

• But the task is non-trivial– GPCRs exhibit rich relationships & complex molecular interactions

• present many challenges for in silico analysis– in trying to derive meaningful functional insights, traditional methods are

likely to be limited

Src Grb2Shc Sos

Ras Rap

GTPGTP

Regulation of geneexpression

Nucleus

PI3Kγ

PLCβPKC

RasGRF

Raf1 B-Raf

PKACa2+

biogenicamines

amino acids

lipids

peptides proteins

lightothers

γβα

βα γ

biogenicamines

amino acids

lipids

peptides proteins

lightothers

We’ve been using biology-unaware search tools to analyse such complex systemsHow far can we truly expect to understand cellular function with such naïve approaches…?

In silicoIn silico function prediction function prediction…a reality check…a reality check

• What is the function of this structure?

• What is the function of this sequence?

• What is the function of this motif?– the fold provides a scaffold, which can be

decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level

“A test case for structural genomics Structure-based assignment of the biochemical function of

hypothetical protein mj0577” (Zarembinski et al., PNAS 95 1998)

Although the structure co-crystallised with ATP, the biochemical function of the protein is unknown

What's in a sequence?What's in a sequence?

Full domain alignment methods

Single motif methods

Multiple motif methods

Fuzzy regex (eMOTIF)

Exact regex (PROSITE)

Profiles (Profile Library)

HMMs (Pfam)

Identity matrices (PRINTS)

Weight matrices (Blocks)

Methods for family analysisMethods for family analysisAttwood, TK (2000). The quest to deduce protein function from sequence: the role of pattern databases. Int.J. Biochem. Cell Biol., 32(2), 139–155.

The challenge of family analysisThe challenge of family analysis

• highly divergent family with single function?• superfamily with many diverse functional families?

– must distinguish if function analysis done in silico– a tough challenge!

In the beginning was PROSITEIn the beginning was PROSITE

[GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]-X(2)-[LIVMFT]-[GSTANC]-LIVMFYWSTAC]-[DENH]-R

TM domain

Diagnostic limitations of PROSITEDiagnostic limitations of PROSITEID G_PROTEIN_RECEP_F1_1; PATTERN. AC PS00237; DT APR-1990 (CREATED); NOV-1997 (DATA UPDATE); SEP-2004 (INFO UPDATE). DE G-protein coupled receptors family 1 signature. PA [GSTALIVMFYWC]-[GSTANCPDE]-{EDPKRH}-x(2)-[LIVMNQGA]-x(2)-[LIVMFT]- PA [GSTANC]-[LIVMFYWSTAC]-[DENH]-R-[FYWCSH]-x(2)-[LIVM]. NR /RELEASE=44.6,159201; NR /TOTAL=1622(1621); /POSITIVE=1530(1529); /UNKNOWN=0(0); NR /FALSE_POS=92(92); /FALSE_NEG=261; /PARTIAL=61;

• This represents an apparent 22% error rate – the actual rate is probably higher

• Thus, a match to a pattern is not necessarily true – & a mis-match is not necessarily false!

• False-negatives are a fundamental limitation to this type of pattern matching– if you don't know what you're looking for, you'll never know

you missed it!

Where do motifs (fingerprints) fit in?Where do motifs (fingerprints) fit in?(fingerprints are hierarchical)(fingerprints are hierarchical)

loop regionTM domain TM domain

Rhodopsin-likeRhodopsin-like superfamily, family superfamily, family & subtype& subtype GPCRs in PRINTSGPCRs in PRINTS

Attwood, TK (2001) A compendium of specific motifs for diagnosing GPCR subtypes. TiPS, 22(4), 162-165.

Searching PRINTS - FingerPRINTScanSearching PRINTS - FingerPRINTScanScordis, P, Flower, DR & Attwood, TK (1999) FingerPRINTScan: intelligent

searching of the PRINTS motif database. Bioinformatics, 15, 523-524.

• GPCR fingerprints are embedded in PRINTS– allows diagnosis of GPCR mosaics

N CN C

Visualising fingerprintsVisualising fingerprintsAttwood, TK & Findlay, JBC (1993) Design of a discriminating fingerprint

for G-protein-coupled receptors. Protein Eng., 6(2), 167–176.

Visualising fingerprintsVisualising fingerprintsAttwood, TK & Findlay, JBC (1993) Design of a discriminating fingerprint

for G-protein-coupled receptors. Protein Eng., 6(2), 167–176.

Diagnosing partial matchesDiagnosing partial matches

• Missed by PROSITE– wasn’t annotated as a FN

An integrated approachAn integrated approachMulder, NJ, Apweiler, R, Attwood, TK, Bairoch, A et al. (2007) New developments in InterPro. NAR, 35, D224-8.

• To simplify sequence analysis, the family dbs were integrated within a unified annotation resource – InterPro– initial partners were PRINTS,

PROSITE, profiles & Pfam• now many more partners

– linked to its satellite dbs• but lags behind their coverage

– by Oct 2007, it had 14,768 entries & covered 76% of UnitProtKB

• major role in fly & human genome annotation

InterPro – method comparisonInterPro – method comparison

Where has this got us?Where has this got us?

Understanding the tools Understanding the tools …estimating significance…estimating significance

• How do we know what to believe? • Let’s explore some of the difficulties that arise when

pair-wise search tools (BLAST & FastA) & family-based methods are used naïvely– these examples caution us to think about what the results

actually mean in biological terms.....

Identifying sequence similarityIdentifying sequence similarity

• GPCRs present many challenges for in silico functional analysis

• Several signature-based methods now available– with different areas of optimum application

• Yet naïve, pair-wise similarity searching has been the mainstay of functional annotation efforts– it allows us to identify/quantify relationships between

sequences

• But quantifying similarity between sequences is not the same as identifying their functions

Problems with pairwise similarity toolsProblems with pairwise similarity toolsGaulton, A & Attwood, TK (2003) Bioinformatics approaches for the classification of G protein-coupled

receptors. Current Opinion in Pharmacology, 3, 114-120.

• For identifying precise families to which receptors belong & the ligands they bind, pair-wise tools are limited – at what level of seq ID is ligand specificity conserved?

• some GPCRs with 25% ID share a common ligand; • others, with greater levels, don’t…

• It may be impossible to tell from BLAST if an orphan belongs to a known family (the top hit), or if it will bind a novel ligand – e.g., for the now de-orphaned UR2R, BLAST indicates most

similarity to the type 4 SSRs, yet it is known to bind a different (related) ligand

When is a GPCR not an SSR?When is a GPCR not an SSR?Query length: 389 AA Date run: 2002-10-18 09:08:29 UTC+0100 on sib-blast.unil.chTaxon: Homo sapiensDatabase: XXswissprot

120,412 sequences; 45,523,583 total letters SWISS-PROT Release 40.29 of 10-Oct-2002

Db AC Description Score E-value sp Q9UKP6 Q9UKP6 Orphan receptor [Homo sapiens... 782 0.0sp P31391 SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4]... 167 3e-41sp O43603 GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G... 147 4e-35sp P30872 SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF-2... 144 3e-34sp P32745 SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR-28... 140 3e-33sp P35346 SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5)... 140 6e-33sp P30874 SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens... 134 3e-31sp P30874 SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF-1... 134 3e-31sp P48145 GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei... 133 7e-31sp O60755 GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G... 132 2e-30sp P41143 OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1] ... 128 2e-29sp P35372 SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien... 125 1e-28sp P35372 OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho... 125 1e-28

When is a GPCR not an SSR?…when it’s a UR2R…when it’s a UR2R

Query length: 389 AA Date run: 2002-10-18 09:08:29 UTC+0100 on sib-blast.unil.chTaxon: Homo sapiensDatabase: XXswissprot

120,412 sequences; 45,523,583 total letters SWISS-PROT Release 40.29 of 10-Oct-2002

Db AC Description Score E-value sp Q9UKP6 UR2R_HUMAN Urotensin II receptor (UR-II-R) [GPR14] [Ho... 782 0.0sp P31391 SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4]... 167 3e-41sp O43603 GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G... 147 4e-35sp P30872 SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF-2... 144 3e-34sp P32745 SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR-28... 140 3e-33sp P35346 SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5)... 140 6e-33sp P30874 SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens... 134 3e-31sp P30874 SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF-1... 134 3e-31sp P48145 GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei... 133 7e-31sp O60755 GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G... 132 2e-30sp P41143 OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1] ... 128 2e-29sp P35372 SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien... 125 1e-28sp P35372 OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho... 125 1e-28

Residue Number

UR2R_HUMAN vs SOMATOSTANRUR2R_HUMAN vs UROTENSIN2R

1 380 1 380

The trouble with top hitsThe trouble with top hits

• The most statistically significant hit is not always the most biologically relevant

• Yet many rule-based ‘expert systems’ still rely on top BLAST or FastA hits to make their diagnoses

• BLAST/FastA ‘see’ generic similarity & not the often-subtle differences that constitute the functional determinants between closely-related receptor families & subtypes

• Failure to appreciate this fundamental point has generated numerous annotation errors in our databases

-opioid receptor -opioid receptor -opioid receptor true

Misleading annotation via FastAMisleading annotation via FastA

• As we’ve seen, it’s tempting to use top hits from BLAST or FastA results to classify unknown proteins– but this may lead us (& especially computer programs) to false

functional conclusions• PSI-BLAST is more sensitive than BLAST, because it

creates a profile from hits above a given threshold– but this too can cause problems– let’s take a closer look

Misleading results from BLASTMisleading results from BLAST

So, is UL78 a GPCR?So, is UL78 a GPCR?& if so, what sort?& if so, what sort?

What What PSI-PSI-BLAST BLAST saidsaid(profile dilution (profile dilution in action)in action)

What GeneQuiz said…What GeneQuiz said…a thrombin receptora thrombin receptor

What GeneQuiz said later…What GeneQuiz said later…

Overview of resultsOverview of resultspair-wise & family-based methodspair-wise & family-based methods

What is UL78?What is UL78?

Tool No hit Poor hit Significant hitBLAST GPCRs in list

PSI-BLAST thrombin receptor; chemokine & opioid receptors

PROSITE profile GPCR

PRINTS

Blocks-PRINTS GPCR

GeneQuiz thrombin receptor; C5A receptor

Bioinformatics tools, alone, cannot tell us!

So, beware top hitsSo, beware top hits…but also beware bottom hits!…but also beware bottom hits!

Let us now compare & contrast some InterPro results with those of its source dbs…

Rhodopsin-like superfamily Rhodopsin-like superfamily GPCRs in InterPro 2005 GPCRs in InterPro 2005

IPR000276 GPCR_Rhodopsn 7752 proteins

PS50262 G_PROTEIN_RECEP_F1_2 7702 proteins

PF00001 7tm_1 7064 proteins

PS00237 G_PROTEIN_RECEP_F1_1 6527 proteins

PR00237 GPCRRHODOPSN 5821 proteins (don’t include partials)

Rhodopsin-like superfamily Rhodopsin-like superfamily GPCRs in the source databases GPCRs in the source databases

Pfam FP ? FN ? U ? TP? 8776 matches 7064

PROSITE (profile) FP 3 FN 3 U 12 TP 1837 matches 7702

PROSITE (regex) FP 92 FN 261 U 0 TP 1530 matches 6527

PRINTS FP 0 FN ? U 0 TP 1154 matches 5821 >2165 updated

Rhodopsin-like superfamily Rhodopsin-like superfamily GPCRs in InterPro 2007 GPCRs in InterPro 2007

IPR000276 GPCR_Rhodopsn 16,845 proteins

PS50262 G_PROTEIN_RECEP_F1_2 16,714 proteins

PF00001 7tm_1 15,712 proteins

PR00237 GPCRRHODOPSN 13,405 proteins

PS00237 G_PROTEIN_RECEP_F1_1 13,723 proteins

No human curator has time to validate all these matches…

14,615 rhodopsin-like superfamily 14,615 rhodopsin-like superfamily GPCRs in Pfam?GPCRs in Pfam?

ID Q6NV75 PRELIMINARY; PRT; 609 AA.AC Q6NV75;DT 05-JUL-2004 (TrEMBLrel. 27, Created)DT 05-JUL-2004 (TrEMBLrel. 27, Last sequence update)DT 05-JUL-2004 (TrEMBLrel. 27, Last annotation update)DE G protein-coupled receptor 153.GN Name=GPR153;OS Homo sapiens (Human).OX NCBI_TaxID=9606 RN [1]RP SEQUENCE FROM N.A.RC TISSUE=Brain;RA Strausberg R.L., Feingold E.A., Grouse L.H., Derge J.G.,RA Jones S.J., Marra M.A.;RT "Generation and initial analysis of more than 15,000 full-lengthRT human and mouse cDNA sequences.";RL Proc. Natl. Acad. Sci. U.S.A. 99:16899-16903(2002).RP SEQUENCE FROM N.A.RC TISSUE=Brain;RA Strausberg R.;RL Submitted (MAR-2004) to the EMBL/GenBank/DDBJ databases.DR EMBL; BC068275; AAH68275.1; -. DR GO; GO:0004872 DR InterPro; IPR000276; GPCR_Rhodpsn.DR Pfam; PF00001; 7tm_1; 1.DR PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1.KW ReceptorSQ SEQUENCE 609 AA; 65341 MW; E525CC7F60D0891C CRC64; MSDERRLPGS AVGWLVCGGL SLLANAWGIL SVGAKQKKWK PLEFLLCTLA ATHMLNVAVP IATYSVVQLR RQRPDFEWNE GLCKVFVSTF YTLTLATCFS VTSLSYHRMW MVCWPVNYRL SNAKKQAVHT VMGIWMVSFI LSALPAVGWH DTSERFYTHG CRFIVAEIGL GFGVCFLLLV GGSVAMGVIC TAIALFQTLA VQVGRQADHR AFTVPTIVVE DAQGKRRSSI DGSEPAKTSL QTTGLVTTIV FIYDCLMGFP VLVVSFSSLR ADASAPWMAL CVLWCSVAQA LLLPVFLWAC DRYRADLKAV REKCMALMAN DEESDDETSL EGGISPDLVL ERSLDYGYGG DFVALDRMAK YEISALEGGL PQLYPLRPLQ EDKMQYLQVP PTRRFSHDDA DVWAAVPLPA FLPRWGSGED LAALAHLVLP AGPERRRASL LAFAEDAPPS RARRRSAESL LSLRPSALDS GPRGARDSPP GSPRRRPGPG PRSASASLLP DAFALTAFEC EPQALRRPPG PFPAAPAAPD GADPGEAPTP PSSAQRSPGP RPSAHSHAGS LRPGLSASWG EPGGLRAAGG GGSTSSFLSS PSESSGYATL HSDSLGSAS//

Pfam match Q6NV75/24-297

PROSITE (profile) no match

PROSITE (regex) no matchPRINTS no match

ClustalW – sequences too divergent to be aligned

false negative

Beware top & bottom hitsBeware top & bottom hits…but also beware simplistic analysis …but also beware simplistic analysis tools coupled with wet experiments! tools coupled with wet experiments!

Let’s finally look at how hydropathy profiles can compel biologists to make strange deductions…

- & still get their results published in Science!

Pfam Lanthionine synthetase C-like proteinPROSITE (profile) no match

PROSITE (regex) no matchPRINTS no match

ClustalW – sequences too divergent to be aligned

ID Q9C929_ARATH Unreviewed; 401 AA.AC Q9C929;DT 01-JUN-2001, integrated into UniProtKB/TrEMBL.DT 01-JUN-2001, sequence version 1.DT 24-JUL-2007, entry version 23.DE Putative G protein-coupled receptor; 80093-78432.GN Name=F14G24.19; OrderedLocusNames=At1g52920;OS Arabidopsis thaliana (Mouse-ear cress).OC Eukaryota; Viridiplantae; Streptophyta; ... Arabidopsis. OX NCBI_TaxID=3702;RN [1]RP NUCLEOTIDE SEQUENCE.RA Lin X., Kaul S., Town C.D., Benito M., Creasy T.H., Haas B.J., Wu D.,RA Maiti R., Ronning C.M., Koo H., Fujii C.Y., Utterback T.R.,RA Barnstead M.E., Bowman C.L., White O., Nierman W.C., Fraser C.M.;RT "Arabidopsis thaliana chromosome 1 BAC F14G24 genomic sequence.";RL Submitted (DEC-1999) to the EMBL/GenBank/DDBJ databases.RN [2]RP NUCLEOTIDE SEQUENCE.RA Town C.D., Kaul S.;RL Submitted (JAN-2001) to the EMBL/GenBank/DDBJ databases.DR EMBL; AC019018; AAG52264.1; -; Genomic_DNA. [EMBL / GenBank / DDBJ]DR PIR; E96570; E96570.DR UniGene; At.66935; -.DR GenomeReviews; CT485782_GR; AT1G52920.DR KEGG; ath:At1g52920; -.DR TAIR; At1g52920; -.DR GO; GO:0004872; F:receptor activity; IEA:UniProtKB-KW.DR InterPro; IPR007822; LANC_like.DR InterPro; Graphical view of domain structure.DR Pfam; PF05147; LANC_like; 1.KW Receptor.SQ SEQUENCE 401 AA; 45284 MW; C9D3BF8CC8F0FE0B CRC64; MPEFVPEDLS GEEETVTECK DSLTKLLSLP YKSFSEKLHR YALSIKDKVV WETWERSGKR VRDYNLYTGV LGTAYLLFKS YQVTRNEDDL KLCLENVEAC DVASRDSERV TFICGYAGVC ALGAVAAKCL GDDQLYDRYL ARFRGIRLPS DLPYELLYGR AGYLWACLFL NKHIGQESIS SERMRSVVEE IFRAGRQLGN KGTCPLMYEW HGKRYWGAAH GLAGIMNVLM HTELEPDEIK DVKGTLSYMI QNRFPSGNYL SSEGSKSDRL VHWCHGAPGV ALTLVKAAQV YNTKEFVEAA MEAGEVVWSR GLLKRVGICH GISGNTYVFL SLYRLTRNPK YLYRAKAFAS FLLDKSEKLI SEGQMHGGDR PFSLFEGIGG MAYMLLDMND PTQALFPGYE L//

Src Grb2Shc Sos

Ras Rap

GTPGTP

Regulation of geneexpression

Nucleus

PI3Kγ

PLCβPKC

RasGRF

Raf1 B-Raf

PKACa2+

biogenicamines

amino acids

lipids

peptides proteins

lightothers

γβα

βα γ

biogenicamines

amino acids

lipids

peptides proteins

lightothers

They do sums (quickly) & crude string matching

RememberRememberComputers don’t do biology!Computers don’t do biology!

Seeking deeper functional insightsSeeking deeper functional insightsAttwood, TK, Croning, MD & Gaulton, A (2002) Deriving structural and functional insights from a ligand-based

hierarchical classification of G protein-coupled receptors. Protein Eng., 15, 7-12.

• S’family, family & subtype motifs have different locations• If s’family motifs define the common scaffold, hypothesis:

– family motifs relate to ligand binding?– subtype motifs relate to G protein coupling?– powerful tools for subtyping & potentially de-orphaning GPCRs

Locations of ligand-binding residues & motif distributionLocations of ligand-binding residues & motif distribution

Locations of G protein-coupling residues & distribution of motifsLocations of G protein-coupling residues & distribution of motifs

Subtype motifs & # of fingerprints mapping to each region

G protein coupling regions & ## of families mapping to each region

Seeking deeper functional insights?Seeking deeper functional insights?Attwood, TK, Croning, MD & Gaulton, A (2002) Deriving structural and functional insights from a ligand-based

hierarchical classification of G protein-coupled receptors. Protein Eng., 15, 7-12.

• Clearly, many family- & subtype motifs are simply in the ‘wrong’ place for the initial hypothesis to be true

Muscarinic receptors Muscarinic receptor M5GPCR superfamily

Refining the hypothesisRefining the hypothesis

• Besides, it’s not that simple– only part of the answer

• Need to consider that GPCRs don’t function in isolation– their functions are modulated via interactions with other proteins

• Also, the phenomenon of dimerisation challenges the view of the GPCR monomer as functional unit– many GPCRs exist as homo- & heterodimers

• Such observations demand a more systematic analysis of motifs & their likely functional roles

Oligomerisation & protein-protein interaction Oligomerisation & protein-protein interaction residues/regionsresidues/regions

A pilot study with adrenergic, bradykinin & dopamine receptorsA pilot study with adrenergic, bradykinin & dopamine receptors

family-level motifssubfamily-level motifs

residues involved in oligomerisationresidues involved in protein-protein interactionresidues involved in G protein coupling

residues involved in ligand binding

Where next?Where next?• Based on location, some family-level motifs couldn’t

be involved in ligand binding & some subtype-level motifs couldn’t be involved in G protein coupling– clearly, 3D location must be taken into account

• functional correlations would then be stronger

• The remaining motifs are likely to be involved in other molecular interactions– e.g., dimerisation, effector proteins….(early results promising)

• this will help us to build a knowledge-based system to help suggest the likely functional roles for family- & subtype-level motifs in future

ConclusionsConclusions• There are many barriers to success for the jobbing

bioinformatician, e.g.: – not fully understanding the processes we’re trying to model

& predict (e.g., protein folding)– the dynamic nature of biological data– not having been rigorous in the way we define &/or describe

biology/biological processes in the literature– the volume of data, data heterogeneity– maintenance of data, propagation of errors…

• Possibly the largest hurdle is that computers are number crunchers– they don’t do biology, & trying to teach them is hard– & the harder we try, the clearer it is how naïve we’ve been

ConclusionsConclusions• In silico functional annotation requires several dbs to be

searched & several tools to be used– different methods provide different perspectives– dbs aren’t complete & their contents don’t fully overlap

• The more dbs searched, the harder it is to interpret results• The more computers are involved in automating annotation,

the greater the need for collaboration– especially between s/w developers, annotators & ‘wet’

experimentalists

• The more data we have, the more rigorous we must be in thinking/writing if we are to make sense of the complexities

ConclusionsConclusionsFlower DR & Attwood, TK (2004) Integrative bioinformatics for functional genome annotation: trawling for G

protein-coupled receptors.Semin Cell Dev Biol., 15(6), 693-701.

• For GPCRs, there are many analysis tools available– BLAST, FastA, family databases, modelling tools, etc.

• We must understand the limitations of the methods– no method is infallible or able to replace the need for biological validation – use all available resources & understand their problems – none is best!

• Used wisely, bioinformatics tools are useful– BLAST/FastA offer broad brush strokes, motif-methods add fine detail– together, they facilitate receptor characterisation & prediction of ligand

specificity, & allow identification of novel ligand-binding, G protein-coupling or other likely molecular interaction motifs

• We are a long way from having reliable tools for deducing GPCR function & structure from sequence– but with the right approach, there is hope

Bioinformatics approaches for…

Documents