Post on 26-Jan-2016
description
transcript
Protein Information Resource
Oversight and Scientific Advisory Board Meeting
November 14, 2005Georgetown University Medical Center
Welcome and Introduction
Vassilios Papadopoulos, Ph.D.Associate Vice President & Director, Biomedical Graduate Research OrganizationGeorgetown University Medical Center
David States, M.D., Ph.D.Chair, PIR Oversight and Scientific Advisory BoardProfessor & Director of Bioinformatics, University of Michigan
PIR/UniProt Overview
Project Overview, Organization, Infrastructure
Cathy H. Wu, Ph.D.
Director, PIR
Professor, Georgetown University Medical Center
4
Protein Information Resource (PIR)
UniProt Universal Protein Resource: Central Resource of Protein Sequence and Function
PIRSF Family Classification System: Protein Classification and Functional Annotation
iProClass Integrated Protein Database: Data Integration and Protein Mapping
Cyber Infrastructure (Interoperability and Dissemination): Ontology, XML, Object/Relational DB, J2EE Architecture
Integrated Protein Informatics Resource for Genomic/Proteomic Research
http://pir.georgetown.edu
5
UniProt: Universal Protein Resource
International Consortium Protein Information Resource (PIR) European Bioinformatics Institute (EBI) Swiss Institute of Bioinformatics (SIB)
NIH U01 Grant (NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR) Phase I (09/02-08/05): $6 Million Annual Bridge (09/05-?/06): $6.6M Phase II (?/06-?/09): $6.6-8.0(?)M
Central Resource of Protein Sequence and Function
http://www.uniprot.org
NHGRI
6
UniProt Databases UniProt Archive (UniParc)
Comprehensive sequence archive with sequence history
Produced at EBI UniProt Reference Clusters (UniRef)
Non-redundant reference clusters for sequence search Produced at PIR
UniProt Knowledgebase (UniProtKB) Integration of PIR-PSD, Swiss-Prot and TrEMBL databases Stable, comprehensive, fully classified, richly and accurately annotated
knowledgebase UniProtKB/Swiss-Prot: Produced at SIB UniProtKB/TrEMBL: Produced at EBI Literature-based and automated annotation at SIB, PIR, EBI
7
UniProt Management Structure Scientific Advisory Panel (SAP) to be established by NHGRI
8
UniProt Project Coordination UniProt email discussion groups
Project Liaisons and Ad hoc teams Tri-weekly teleconference calls Tri-annual face-to-face Consortium meetings
January 12-13, 2006 at Geneva April 10-11, 2006 at Georgetown University
Exchange visits of scientific and technical staff Five PIR staff at SIB (1-2 weeks, Nov 05) for annotation integration
Retreats
France, 2004
9
UniProt Activities at PIR Integration of PIR-PSD into UniProtKB Swiss-Prot/TrEMBL
Incorporation of unique PIR entries Incorporation of PIR annotations: references, experimental
features with literature evidence tag Functional annotation of UniProtKB proteins
Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins
Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site)
Production of UniRef100/90/50 databases => Enhancement & scaling Creation of UniProt web site and help system => Unified UniProt web
site & user community interaction
10
PIRSF Classification System
PIRSF: Evolutionary relationships of proteins from super- to sub-families Curated families with name rules and site rules Curation platform with classification/visualization tools Deliverables: UniProtKB annotations, InterPro families,
PIRSF reports, PIRSF curation platform
Protein Classification and Functional Annotation
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF Work Group Meeting, April 2003
11
iProClass Integrated Protein Database
Data integration from >90 databases Underlying data warehouse for protein ID/name/bibliography mapping Integration of protein family, function, structure for functional annotation Rich link (link + summary) for value-added reports of UniProt proteins
Data Integration and Protein Mapping
Disease/Variation
OMIMHapMap
…Ontology
GO
Protein Sequence
UniProtUniRefUniParcRefSeq
GenPept…
Gene/Genome
GenBank/EMBL/DDBJLocusLinkUniGene
MGITIGR
…
Gene Expression
GEOGXD
ArrayExpressCleanExSOURCE
…
Structure
PDBSCOPCATH
PDBSumMMDB
…
Family
PIRSFInterPro
PfamPrositeCOG
…
Interaction
DIPBIND
…
Taxonomy
NCBI TaxonNEWT
Protein Expression
Swiss-2DPAGEPMG
…
Literature
PubMed
Function/Pathway
EC-IUBMBKEGG
BioCartaEcoCyc
WIT…
Modification
RESIDPhosphoBase
…
iProClass
Integrated Protein Knowledgebase
iProClass
Integrated Protein Knowledgebase
NCBI X-Refs
Gene/Genome
Gene Ontology
KEGG PathwayStructure Homolog
PTM
EC
Additional Refs
NCBI X-Refs
Gene/Genome
Gene Ontology
KEGG PathwayStructure Homolog
PTM
EC
Additional Refs
Funded by NSF
12
iProLINK Literature Mining Resource
iProLINKNLP Research
Literature-Based Curation
Bibliography Mapping& Annotation Extraction
Protein Name Ontology
Named Entity Recognition& Ontology Induction
Databases
UniProtPIRSF
iProClassGO
Bibliography
PubMed
Literature Mining &Protein Curation
Dictionary and Ontology• Protein Names and Synonyms• PIRSF Family Names in DAG
Guidelines• Protein Naming Rules• Name Tagging Guidelines
Literature Corpus• Name-Tagged
Bibliography Display• Mapping of Protein ID to PubMed ID• Papers Categorized by Annotations• Papers Tagged with Annotations
Literature Corpus• Annotation-Tagged
Bibliography Submission• Protein Mapping• Annotation Categorization
integrated Protein Literature, INformation and Knowledge
http://pir.georgetown.edu/iprolink
Funded by NSF
Bibliography report: Annotated bibliography for UniProtKB proteins BioThesaurus reports: Protein and gene names for UniProtKB proteins RLIMS-P program: Tag PubMed abstracts for phosphorylation objects Protein ontology DAG: PIRSF-based ontology
13
NIAID Proteomic Admin Center
Funded by NIAID
NIAID Proteomic Master Catalog & Complete Proteomes iProXpress for Protein Function and Pathway Analysis
Gene/Peptide-Protein Mapping Sequence Analysis & Data Mining Function/ Pathway Discovery
Clustered Matrix Clustered Graph Pathway Map
Gene/Peptide-Protein MappingSequence Analysis & Data Mining
Function and Pathway Analysis
Protein Information
Matrix
Interaction Map
IP/2D/MS Proteomic DataGene ExpressioniProXpressintegrated Protein eXpressionAnalysis System
iProClassiProClasshttp://pir.georgetown.edu/proteomics/
14
Bioinformatics Infrastructure NCI caBIG: PIR grid-enablement (Programming access to UniProtKB) NSF TeraGrid: All-against-all BLAST (UniProtKB related sequences) PIR Bioinformatics Framework
Software Framework: J2EE n-Tier Architecture with Object Models Database Distribution: XML, FASTA, Relational (Oracle 9i, MySQL) Other Deliverables: Object Models, Web Services
Funded by NCI
Clients Middle Tier Data Source
(JavaWebStart)
Applications
Web Browser
(JavaWebStart)
Applications
Web Browser
JDBC
FlatFileAdapter
XMLAdapter
JDBC
FlatFileAdapter
XMLAdapter
MySqlDB2
Oracle
LegacyDatabases
XMLRepositories
MySqlDB2
Oracle
MySqlDB2
Oracle
LegacyDatabases
LegacyDatabases
XMLRepositoriesXMLRepositories
Servlet[Controller]
JSP,HTML,
XML (XSLT)[Presentation]
SQLDAO
DAOManager
Domain Objects[Model]
FLATDAO
XMLDAO
Servlet[Controller]
JSP,HTML,
XML (XSLT)[Presentation]
SQLDAO
DAOManager
Domain Objects[Model]
FLATDAO
XMLDAO
15
Computing Environment Computers:
Two Sun V880, IBM P690, 100-CPU Linux Cluster, Compaq 4100 Alpha
Networking: Internet2, GU Network (1Gbps)
GU UIS Advanced Research Computing
GU CiscoSwitch
10/100 mbsPC’s
Alpha Server4100
PIR WebsiteDevelopment System
Oracle Database
IBMP690
Uniprot MirrorDB2
OracleFTP Site #2
GUGateway
OutsideWorld
Windows 2K ServerPrinting, Virus Protection,
Backups
1 Gbit/secSun Fire V880
Uniprot WebsiteOracle
Time LogicProduction System
FTP Site #1
Linux ServerUniprot Mail
ServerJitterbug
NetworkPrinters
Linux Cluster50 Linux PC’sWith 100 CPU
Blast/FastaLinux NFSFile Server
PortablePC’s
GU CiscoSwitch
10/100/1000 mbs
Sun Fire V880Development System
FTP Site #3
16
PIR Environment Funding: ~$3Million Annual Total (2/3 UniProt, 1/3 Other) Home Institution: Georgetown University Medical Center (GUMC) Subcontract: National Biomedical Research Foundation (NBRF) New Location: Off-Campus (GU North Campus), 6250 SQFT
Suite 1200, 3300 Whitehaven Street NW, Washington, DC 20007
17
PIR Organization
25 Staff Members 14 GU, 11 NBRF
22 FTEs 12.7 GU, 9.3 NBRF
17 with Doctorate Degree 11 GU Faculty
2 Professors 1 Research Associate Professor 6 Research Assistant Professors 2 Research Instructors
Informatics Team (12) (10.7 FTE)
Executive Team MembersDr. Peter McGarvey, Project Manager & Research Associate Professor (GU)Dr. Hongzhan Huang, Bioinformatics Team Lead & Research Assistant Professor (GU) Baris Suzek, Associate Team Lead, Bioinformatics & Research Associate (GU)
Staff MembersDr. Leslie Arminski, System Manager (NBRF)Dr. Hsing-Kuo Hua, Software Engineer (NBRF)Dr. Xin Yuan, Bioinformatics Scientist & Research Instructor (GU)Dr. Robel Y. Kahsay, Bioinformatics Scientist & Research Instructor (GU)Yongxing Chen, Bioinformatics Programmer (NBRF)Jing Zhang, Bioinformatics Programmer (NBRF)Sehee Chung, Software Engineer (GU)Natalia Petrova, PhD Student (GU) (0.5)Jess Catana, System Manager (GU) (0.2)
Informatics Team (12) (10.7 FTE)
Executive Team MembersDr. Peter McGarvey, Project Manager & Research Associate Professor (GU)Dr. Hongzhan Huang, Bioinformatics Team Lead & Research Assistant Professor (GU) Baris Suzek, Associate Team Lead, Bioinformatics & Research Associate (GU)
Staff MembersDr. Leslie Arminski, System Manager (NBRF)Dr. Hsing-Kuo Hua, Software Engineer (NBRF)Dr. Xin Yuan, Bioinformatics Scientist & Research Instructor (GU)Dr. Robel Y. Kahsay, Bioinformatics Scientist & Research Instructor (GU)Yongxing Chen, Bioinformatics Programmer (NBRF)Jing Zhang, Bioinformatics Programmer (NBRF)Sehee Chung, Software Engineer (GU)Natalia Petrova, PhD Student (GU) (0.5)Jess Catana, System Manager (GU) (0.2)
Protein Science Team (12) (10.3 FTE)
Executive Team MembersDr. Winona Barker, Director Emeritus of PIR (NBRF) (0.55)Dr. Darren Natale, Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Zhangzhi Hu, Associate Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Lai-Su L. Yeh, Administrative Coordinator (NBRF)
Staff MembersDr. Robert S. Ledley, NBRF President, Professor (NBRF/GU) (0.05)Dr. Anastasia Nikolskaya, Senior Protein Scientist & Research Assistant Professor (GU)Dr. Raja Mazumder, Scientific Coordinator & Research Assistant Professor (GU)Dr. C.R. Vinayaka, Senior Protein Scientist (NBRF)Dr. Sona Vasudevan, Senior Protein Scientist (NBRF)Dr. Cecilia Arighi, Senior Protein Scientist & Research Assistant Professor (GU)Vincent Hermoso, Protein Research Assistant (NBRF) (0.7)Christina Fang, Project Coordinator & Protein Research Assistant (NBRF)
Protein Science Team (12) (10.3 FTE)
Executive Team MembersDr. Winona Barker, Director Emeritus of PIR (NBRF) (0.55)Dr. Darren Natale, Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Zhangzhi Hu, Associate Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Lai-Su L. Yeh, Administrative Coordinator (NBRF)
Staff MembersDr. Robert S. Ledley, NBRF President, Professor (NBRF/GU) (0.05)Dr. Anastasia Nikolskaya, Senior Protein Scientist & Research Assistant Professor (GU)Dr. Raja Mazumder, Scientific Coordinator & Research Assistant Professor (GU)Dr. C.R. Vinayaka, Senior Protein Scientist (NBRF)Dr. Sona Vasudevan, Senior Protein Scientist (NBRF)Dr. Cecilia Arighi, Senior Protein Scientist & Research Assistant Professor (GU)Vincent Hermoso, Protein Research Assistant (NBRF) (0.7)Christina Fang, Project Coordinator & Protein Research Assistant (NBRF) PIR Director
Dr. Cathy Wu Professor (GU)
PIR Director
Dr. Cathy Wu Professor (GU)
Informatics Team (12) (10.7 FTE)
Executive Team MembersDr. Peter McGarvey, Project Manager & Research Associate Professor (GU)Dr. Hongzhan Huang, Bioinformatics Team Lead & Research Assistant Professor (GU) Baris Suzek, Associate Team Lead, Bioinformatics & Research Associate (GU)
Staff MembersDr. Leslie Arminski, System Manager (NBRF)Dr. Hsing-Kuo Hua, Software Engineer (NBRF)Dr. Xin Yuan, Bioinformatics Scientist & Research Instructor (GU)Dr. Robel Y. Kahsay, Bioinformatics Scientist & Research Instructor (GU)Yongxing Chen, Bioinformatics Programmer (NBRF)Jing Zhang, Bioinformatics Programmer (NBRF)Sehee Chung, Software Engineer (GU)Natalia Petrova, PhD Student (GU) (0.5)Jess Catana, System Manager (GU) (0.2)
Informatics Team (12) (10.7 FTE)
Executive Team MembersDr. Peter McGarvey, Project Manager & Research Associate Professor (GU)Dr. Hongzhan Huang, Bioinformatics Team Lead & Research Assistant Professor (GU) Baris Suzek, Associate Team Lead, Bioinformatics & Research Associate (GU)
Staff MembersDr. Leslie Arminski, System Manager (NBRF)Dr. Hsing-Kuo Hua, Software Engineer (NBRF)Dr. Xin Yuan, Bioinformatics Scientist & Research Instructor (GU)Dr. Robel Y. Kahsay, Bioinformatics Scientist & Research Instructor (GU)Yongxing Chen, Bioinformatics Programmer (NBRF)Jing Zhang, Bioinformatics Programmer (NBRF)Sehee Chung, Software Engineer (GU)Natalia Petrova, PhD Student (GU) (0.5)Jess Catana, System Manager (GU) (0.2)
Protein Science Team (12) (10.3 FTE)
Executive Team MembersDr. Winona Barker, Director Emeritus of PIR (NBRF) (0.55)Dr. Darren Natale, Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Zhangzhi Hu, Associate Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Lai-Su L. Yeh, Administrative Coordinator (NBRF)
Staff MembersDr. Robert S. Ledley, NBRF President, Professor (NBRF/GU) (0.05)Dr. Anastasia Nikolskaya, Senior Protein Scientist & Research Assistant Professor (GU)Dr. Raja Mazumder, Scientific Coordinator & Research Assistant Professor (GU)Dr. C.R. Vinayaka, Senior Protein Scientist (NBRF)Dr. Sona Vasudevan, Senior Protein Scientist (NBRF)Dr. Cecilia Arighi, Senior Protein Scientist & Research Assistant Professor (GU)Vincent Hermoso, Protein Research Assistant (NBRF) (0.7)Christina Fang, Project Coordinator & Protein Research Assistant (NBRF)
Protein Science Team (12) (10.3 FTE)
Executive Team MembersDr. Winona Barker, Director Emeritus of PIR (NBRF) (0.55)Dr. Darren Natale, Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Zhangzhi Hu, Associate Team Lead, Protein Science & Research Assistant Professor (GU)Dr. Lai-Su L. Yeh, Administrative Coordinator (NBRF)
Staff MembersDr. Robert S. Ledley, NBRF President, Professor (NBRF/GU) (0.05)Dr. Anastasia Nikolskaya, Senior Protein Scientist & Research Assistant Professor (GU)Dr. Raja Mazumder, Scientific Coordinator & Research Assistant Professor (GU)Dr. C.R. Vinayaka, Senior Protein Scientist (NBRF)Dr. Sona Vasudevan, Senior Protein Scientist (NBRF)Dr. Cecilia Arighi, Senior Protein Scientist & Research Assistant Professor (GU)Vincent Hermoso, Protein Research Assistant (NBRF) (0.7)Christina Fang, Project Coordinator & Protein Research Assistant (NBRF) PIR Director
Dr. Cathy Wu Professor (GU)
PIR Director
Dr. Cathy Wu Professor (GU)
18
PIR Community Interactions (since 2004)
Presentations and Invited Seminars NIH Proteomics Workshop (Bi-Annual) – Bioinformatics Day Conference Demos/Posters: ISMB-05, US HUPO-05, SOFG04 Over 20 Invited Presentations: Keystone, Human Brain Project Satellite
Symposium, PDB Symposium, HUPO-05 Policy Forums, Committees: NSF Plant Cyberinfrastructure, NIH Protein
Structure Initiative, HUPO Proteomics Standards Initiative Publications: Over 25 Refereed Papers and Book Chapters Collaborations and Interactions
Collaborated and interacted with over 10 research institutions Hosted face-to-face meetings for NIAID/caBIG projects
Paper and Grant Reviews Reviewed over 20 papers for referred journals and conferences Served on NSF/NIH grant review panels
19
PIR-Georgetown Interactions
Teaching Courses: Bioinformatics (BCHB 521), Advanced
Bioinformatics (BCHB 621) Lectures: Medical Biochemistry, Protein Biomarker,
Introductory Biology Mentoring
Mentored 9 graduate students (PhD students, MS Internship projects)
Intercampus Seminars Proposal Submission by PIR Young Investigators as PI
Six proposals to federal and other agencies
PIR/UniProt – Summary & Statistics
Database Growth Database Usage Unified UniProt WebSite PIR UniProt Consortium Interactions
Peter McGarvey, Ph.D.
21
UniProt Reference Clusters (UniRef)
UniProt Archive(UniParc)
UniProt Knowledgebase(UniProtKB)
UniProt: the world's most comprehensive catalog of information on proteins
http://www.uniprot.orgUniProt (Universal Protein Resource) http://www.uniprot.orgUniProt (Universal Protein Resource)
Swiss-Prot sectionManually-annotated protein sequences
= + += + +
UniRef100
UniRef90
UniRef50
UniRef100
UniRef90
UniRef50
A stable, comprehensive
archive of all publicly available protein sequences for
sequence tracking from:
Swiss-Prot, TrEMBL, PIR-PSD,
EMBL, Ensembl, IPI, PDB, RefSeq,
FlyBase, WormBase, Patent Offices, etc.
Non-redundant reference sequences clustered from UniProtKB and UniParc for
comprehensive or fast sequence searches at 100%,
90%, or 50% identity
Integration of Swiss-Prot, TrEMBLand PIR-PSD
Fully classified, richly and accurately annotated protein sequences with minimal redundancy and extensive
cross-references
TrEMBL sectionComputer-annotated protein sequences
UniProt: Universal Protein Resource http://www.uniprot.org
22
Database Growth
0
1000000
2000000
3000000
4000000
5000000
6000000
Rel 1.0,Dec-03
Rel 2.0 Rel 3.0 Rel 4.0 Rel 5.0 Rel 6.0 Rel 6.4,Nov-05
Major Releases
UniParc
UniRef100
UniProtKB
UniProtKB/TrEMBL
UniRef90
UniRef50
UniProtKB/SwissProt
+EVN -EVN
23
FTP Downloads 2005
0
1000
2000
3000
4000
5000
6000
7000
2005
UniRef50
UniRef90
UniRef100
UniProt/SwissProt
UniPrtot/TrEMBL
Unique Domains
0
10000
20000
30000
40000
50000
PIR.Georgetown.Edu
PIR.UniProt.Org
Hits
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
PIR.UniProt.Org PIR.Georgetown.Edu
24
Customer Email Topicshelp@uniprot.org & pirmail@georgetown.edu
UniProtKB UniRef UniParc iProClass PSD NREF PIRSF
UniProt ~75% 12% 8% < 1% < 1% < 1% < 1%
PIR 22% 1% 1% 21% ~15% 18% 10%
FTP Site Web Site
XML ID Mapping
UniProt 16% 29% 16% ~18%
PIR 11% 27% 15% ~3% 1 Day Turnaround
“PIR is a wonderful resource.” – Craig“Thank you for your prompt response, as always UniProt is on the ball!” – Fiona
550 UniProt emails720 PIR emails
25
PIR/UniProt – Unified UniProt Web Site
Dec. 03, Three Synchronized Sites based on PIR Design
Nov. 04, Established Goals for Unified Web Sites.
2005, Back-end Data and Software Platform Developed.
Nov. 05, PIR Playing a Lead Role in Developing Specifications for the Interface.
June 06, Release of Unified UniProt Web Site Hosted by PIR and EBI
26
PIR/UniProt - Consortium Interactions
UniProt liaison group (discussion of high-level issues) UniProt web site committee (Unified UniProt web site planning) UniProt Link committee (working with external databases) UniProt help-mail (answering user inquiries) UniProt document committee (documentation, tutorials and FAQs) UniProt XML group (XML documentation and maintenance) UniProt group for automatic annotation pipeline Manual curation of Swiss-Prot template sequences Manual curation of site rules and controlled vocabularies Development of automatic annotation rules Development of protein naming guidelines Incorporation of new protein families into InterPro PIR routinely visits or hosts colleagues from EBI and SIB for
discussions. Biweekly update of UniRef, UniParc and UniProtKB databases
Protein Classification and Annotation
Darren Natale, Ph.D.
Team Lead, Protein Science, PIR
Research Assistant Professor, GUMC
28
Protein Curation Activities
PIRSF – classification of homeomorphic proteins based on evolutionary relationships
PIRNR – family-based “Name Rules” that define the parameters for propagating specific name, EC and GO annotation to members
PIRSR – family-based “Site Rules” that define the parameters for propagating specific feature annotation to members
29
Specialized Tools (I)
DAGPreserves these three features in a navigable format
•Pfam/PIRSF Hierarchy
•Domain Relatives
•Domain Composition
In edit mode, allows easy creation, destruction, and movement of PIRSFs
30
Specialized Tools (II)
PIR Tree and Alignment Viewer (PIRTAV)HPS = 3-hexulose-6-phosphate synthase
HPS
KGPDC
KGPDC = 3-keto-L-gulonate 6-phosphate decarboxylase
Phylogenetic Tree Classification/Annotation Alignment
PIRSF Curation Pipeline Uncurated level – computer-generated Preliminary Curation Level
Curate membership (principle tools: BLAST results, iterative blastclust, on-the-fly HMM)
Curate domain architecture Select seeds
Full Curation Level Curate name and some references Optional: write abstract indicating function, structure, etc.
After name review session and HMM performance check, all information (HMM, membership, annotation) is sent to EBI for integration into InterPro.
(Full level only)
32
PIRNR Curation Pipeline
Start with PIRSF curated to Full level Define match criteria for application of the rule Review protein name, synonyms, EC numbers,
GO terms Find those that are appropriate to propagate to
members that match rule criteria
After review of propagable information, send match conditions, exclusion conditions, and propagated fields to EBI for inclusion into automatic annotation pipeline. Results are displayed in EBI’s UniProt entry extended view.
33
PIRSR Curation Pipeline Start with PIRSF with curated membership and
seeds. At least one member must have solved structure.
Edit seed-to-structure alignment to define and retain conserved regions covering pertinent residues
Build Site HMM from concatenated conserved regions Define feature annotation using controlled vocabulary
with evidence attribution
Apply rules to PIRSF members, create log files to send to SIB (UniProtKB/Swiss-Prot) or EBI (UniProtKB/TrEMBL). Results are incorporated into UniProtKB flat files.
4222
1595
1266
162 Preliminary
693 Full
352 Full + Desc
Nov-2004 Nov-2005
PIRSF (Families) 5876 7083
PIRNR (Name Rules) 320 1321
PIRSR (Site Rules) 81 164
Progress on Protein Curation Activities
1001
1207
83
428 DE/GO/EC
342 DE/GO
157 DE
561
420
251
35 Active
34 Metal/Binding
14 Misc.
112
38
14
35
PIRSFs integrated into InterPro Sent: PIRSF-unique:
PIRNR touches on UniProtKB/TrEMBL Entries: Annotation lines:
PIRSR touches on UniProtKB Entries: Feature lines:
1,775
840
60,300
281,400
41,000 ( 9,800)
100,000 (27,000)
Impact Measurements
Increasing Throughput & Impact
PIRSF PIRNR PIRSR
Curated
Full
To InterPro
AutoAnno
With Structure
Active
•Comprehensive coverage
•Curation “push”
•Propagation at PIR
•Add ligand-binding
Increased specificity
Active +
Ligand
•Emphasize Full/InterPro •Rules to EBI •Active sites
All three will be integrated into the Swiss-Prot annotation platformAll three will be integrated into the Swiss-Prot annotation platformAll three will be integrated into the Swiss-Prot annotation platform
UniRef Databases
Hongzhan Huang, Ph.D.
Bioinformatics Team Lead
Protein Information Resource, GUMC
38
UniRef (UniProt Reference Clusters) Non-Redundant Reference Clusters for Sequence Searching Derived from UniProtKB and Selected UniParc Sources
UniRef100: 100% sequence identity UniRef90: 90% sequence identity (1/3 size reduction from UniRef100) UniRef50: 50% sequence identity (2/3 size reduction)
Release 6.4 (Nov 05)
39
UniRef100 The most comprehensive sequence dataset for sequence similarity search
3,176K sequences in UniRef100 vs. 3,022K sequences in NCBI nr Source Sequences
Complete UniProtKB - Splice Variants as separate entries Selected UniParc (e.g. Ensembl and RefSeq)
Non-Redundancy Combine identical sequences from all species Merge sub-fragments
Sub-fragments
40
UniRef90 & UniRef50 Reduced sequence datasets for faster sequence similarity search Representative sequence for each cluster Clustering Algorithm
CD-HIT: Fast, top down, non-overlapping PIR’s parallelized version running on Linux Cluster
UniRef90: 1/3 size reduction UniRef50: 2/3 size reduction
41
UniRef50 Sequence Classification
Completely automated, biweekly-updated classification of all proteins
How good are the UniRef50 clusters? Evaluated by all-against-all BLAST search results 98% of the clusters are of good quality: each sequence matches every
other sequences within the cluster Problematic clusters
One long sequence bridges two or more non-related sub-clusters. May be resulted from incorrect gene models, domain-fusion, polyprotein New algorithm will be developed with length/overlap parameters to
detect and regroup such clusters.
42
Usages of UniRef Clusters UniRef90/50 for comprehensive automated classification of proteins
Faster searches and less cluttered similarity search outputs More even sampling of sequence space and reduction of search bias
UniRef for integrity check of database annotation Uniref100 to annotate EST sequences UniRef50 to detect incorrect gene models
UniRef90/50 for PIRSF family classification UniRef90 to recruit new PIRSF family members UniRef50 to create new PIRSF families
UniRef50 Clusters
PIRCF Families(Computer-generated
Families)
PIRSF Families
Merge related clusters
Checked by
curator
Literature Mining
Zhang-Zhi Hu, M.D.
Associate Team Lead, Protein Science, PIR
Research Assistant Professor, GUMC
44
iProLINKAn Integrated Resource for Protein Literature Mining
Complete UniProtKB bibliography mapping
RLIMS-P text mining tool for protein phosphorylation
BioThesaurus: protein/gene names
45
PIR/UniProt Protein Bibliography
355,629 unique citations (PMID) are in iProClass for 2.4 million UniProtKB entries.
166,950 (47%) citations are currently in UniProtKB.
The additional 188,679 (53%) unique citations are taken from sources such as GeneRIF, SGD, MGI.
Bibliography report: curated citations
user submitted computationally mapped
46
BioThesaurus report
Gene/protein names mapping Search synonyms Resolve name ambiguity
Database annotation Error detection: conflicting
names in UniProtKB Literature mining
Query expansion: synonyms and text-variants allow for expanded search results
Applications of BioThesaurus
IAPP
BioThesaurus – comprehensive collection of gene/protein names from multiple sources and their associations with database entities.
IAPP named in 18 entries
47
Rule-based LIterature Mining System for Protein Phosphorylation
RLIMS-P report – PMID:1939059
kinase substrate sites
MEDLINE abstract (PubMed ID)
Phosphorylation feature extraction
UniProtKB entry mapping
UniProtKB site feature annotation & evidence
attribution
PMID mapping
RLIMS-P
1876 UniProtKB entries are currently annotated with 4042 phosphorylation sites.
105K unique citations (PMID) are in UniProtKB/Swiss-Prot Batch processing by RLIMS-P yielded 4690 abstracts with
phosphorylation information, 913 of them with site information, including 214 in UniProtKB entries with no annotated phosphorylation features.
P12957
RLIMS-P –
48
NIAID Biodefense Proteomics ProgramNIAID Biodefense Proteomics Program
Peter McGarvey, Ph.D.
49
NIAID Biodefense Proteomics ProgramNIAID Biodefense Proteomics Program
7 Proteomics Research Centers: Identifying Targets for Therapeutic Interventions “..discovering targets for potential candidates for the next generation of vaccines, therapeutics, and diagnostics”
Administrative Resource Center: Support research centers, public distribution of results and protocols
..establish a Scientific Working Group, Interoperability Working Group, Data infrastructure and promote awareness of the project so scientists worldwide can utilize these resources.
50
Administrative ResourceAdministrative Resource Project Management - Social & Scientific Systems (SSS)
Meetings and Communications Web Portal NIAID Annual Meeting at PIR May 2006
Scientific Coordination - PIR & VBI Scientific Advisory Working Group (SWG) Interoperability Working Group (IWG)
Data Infrastructure – PIR & VBI Proteomic Database: Storage and Retrieval (VBI) Data Management and Analysis Tools (PIR/VBI) Integrated Protein Knowledge System (PIR)
51
Proteomics Program Interaction MapProteomics Program Interaction Map
52
Multiple Data Typesfrom ProteomicsResearch Centers
Data Integration at Admin Center
Integrated Dataat VBI
Data Exchange FormatControlled Vocabulary
Ontology
Master Catalog & Complete Proteomes
at GU-PIR
iProClassUniProt PIRSF
Protein IDPeptide/Protein
Sequence Mapping
NCI caBIG™ Projects
Baris E. Suzek
Associate Bioinformatics Team Lead
Protein Information Resource, GUMC
54
About caBIG The cancer Biomedical Informatics Grid - WWW of cancer
research National Cancer Institute (NCI) and over 50 cancer centers Goals:
Breaking down technical and collaborative barriers within the cancer community
Facilitating connectivity and sharing of information through common standards and unifying architecture
Addressing not only syntactic but also semantic interoperability
https://cabig.nci.nih.gov
55
PIR Activities in caBIG
Domain Workspaces Clinical Trial Management Systems Integrative Cancer Research Workspace
PIR Developer Project: Grid Enablement of PIR PIR Adopter Project (Tester): SEED Genome Annotation Tool PIR Participant (Consultant): Protein informatics tools, databases
Tissue Banks and Pathology Tools Workspace Cross Cutting Workspaces
Architecture Vocabularies and Common Data Elements
PIR Participant: Protein models, objects, vocabularies, ontologies
56
Grid-Enablement of PIR Goal: UniProt Knowledgebase (UniProtKB) serves as the central
protein information resource for cancer research One of four caBIG reference projects
PIR (Georgetown University) caTIES (University of Pittsburg) rProteomics (Duke University) caArray (NCICB/Georgetown)
First phase completed UniProKB is searchable through caGrid browser
Second phase to be developed Expose more information from PIR/UniProt databases to caBIG Increase semantic/syntactic interoperability with other services
Current Architecture caGrid 0.5
57
PIR SEED Adoption
SEED Genome Annotation Tool Developer: U Chicago/Argonne National Lab Open source and distributed framework for genome annotation Support subsystems annotation and metabolic reconstructions Explore functional coupling based on genome context,
metabolic pathway, and phylogenetic profile PIR roles
Assist development of use cases Create test procedures and test the system Develop user manual