Interpro ontology mapped into chado schema
cvterm_rela5onship table
Protocols for Representa/on of Protein Domain Annota/ons in Clade-‐Oriented Databases: a Case Study at the Legume Informa/on System using
Chado/Tripal Pooja E. Umale , Andrew D. Farmer
Na5onal Center for Genome Resources (NCGR), Santa Fe, NM 87505, USA
Introduc5on Methods Results
Interpro Consor/um Databases
PROSITE
HAMAP
PFAM
PRINTS
ProDom
SMART
TIGRFAMS
PIRSF
SUPERFAMILY
CATH-‐Gene3D
PANTHER
Input FASTA amino acid sequences
Score BLAST hits
Tokenize blast hits
Score the tokens (lexical
analysis)
Gene Ontology annota5on
Assign best scoring
descrip5on
Interpro is a searchable database that is used to elucidate protein func5on and annota5on for our project. InterproScan tool is used to scan query sequences against Interpro protein signature databases. We employed AHRD (h\ps://github.com/groupschoof/AHRD) to assign human readable descrip5ons to predicted proteins. Also for a be\er user experience and visualiza5on of protein domain annota5ons we incorporated in the context of the MSA view provided by jalview. Chado database, Drupal (open source content management system) and GMOD’s Tripal are the so`ware tools that were used for data storage and module/website development.
Acknowledgements
Web-‐based presenta5on of protein domain data and its annota5ons is made available at h\p://www.legumeinfo.org/search/protein_domains. We developed a shareable Tripal extension module for this purpose, enabling search by domains and interlinking our domain-‐oriented representa5on to other modules that showcase gene and gene families of legumes.
Gene family set sharing common domain
Chado Schema representa5on of InterproScan results
Example: Jalview display of Protein domain annota5ons on consensus sequence of a gene family
AHRD tool workflow
feature table (match$1_26_518) protein_hmm_match
domain feature feature_id organism_id uniquename
type_id
featureloc table
(for source feature -‐1 )
featureloc_id feature_id
srcfeature_id fmin fmax
featureloc table
(for source feature-‐2)
featureloc_id feature_id
srcfeature_id fmin fmax
organism table organism_id
genus species
cvterm table cvterm_id cv_id name
feature table (PF00221)
HMM representa5on of domain
feature_id
feature table (glyma.Glyma.10G209800.1)
Polypep5de feature
feature_id
Display of set of genes that have common domain
Protein domains can be conceptualized from a number of perspec5ves, from their role in defining an individual protein’s structure and func5on to their evolu5onary role in crea5ng novel molecular func5ons through duplica5on and recombina5on into unique mul5-‐domain protein architectures. Although many species-‐ and clade-‐oriented databases use standard protein domain analyses to characterize the puta5ve func5ons and cellular localiza5ons of the gene products represented in the genomes and transcriptomes of their species of interest, this is o`en limited to trea5ng the matched domains as proper5es of the genes that are simply an aid to their classifica5on and retrieval. While this gene-‐centric perspec5ve is clearly of great importance, eleva5ng domains to a prominent posi5on in the context of such databases has the poten5al to provide insights into many interes5ng biological ques5ons, from the role of domains in constraining and shaping intra-‐species diversity pa\erns (including SNPs, splice isoforms, and gene fusions) to their role in providing the basis for the defini5on of gene family groupings of orthologous and paralogous genes as well as providing insights into their evolu5onary dynamics. We have u5lized and extended a set of widely used open source tools for analysis, storage and web-‐based presenta5on of protein domain data to populate the Chado database underlying the Legume Informa5on System (h\p://legumeinfo.org) and to make this data available through a shareable Tripal extension module for enabling search by domains, exploi5ng the ontological structure of InterPro and interlinking our domain-‐oriented representa5on to other modules for presenta5on of gene and gene families.
Protein domain search page
dbxref table (IPR001106)
cvterm table (Aroma5c amino
acid lyase)
cvterm_id
dbxref_id
The InterPro protein families database: the classifica/on resource aEer 15 years. Nucleic Acids Research, Jan 2015; doi: 10.1093/nar/gku1243 InterProScan 5: genome-‐scale protein func/on classifica/on. BioinformaCcs, Jan 2014; doi:10.1093/bioinformaCcs/btu031 Waterhouse AM, Procter JB, Mar5n DMA, Clamp M, Barton GJ (2009) Jalview Version 2-‐a mul5ple sequence alignment editor and analysis w o r k b e n c h . B i o i n f o r m a 5 c s 2 5 : 1 1 8 9 -‐ 1 1 9 1 . doi:10.1093/bioinforma5cs/btp033 Ficklin S.P., Sanderson L.A., Cheng C.H., Staton M.E., Lee T., Cho I.H., Jung S., Be\ K.E., Main D. Tripal: a construc5on toolkit for online genome databases. Database. 2011:bar044. .
References/Publica5ons
Example
GFF file storing iprscan results
Methods
Introduc5on
Results
Future Direc5ons
• Use of the ontology structure of interproscan to enhance searching • display of intraspecific varia5on in the context of the domain
architecture (similar to how we are now displaying interspecific varia5on in the MSAs)