Date post: | 15-Jan-2016 |
Category: |
Documents |
View: | 218 times |
Download: | 0 times |
Semantics powered Bioinformatics
Amit Sheth, William S. York, et alLarge Scale Distributed Information Systems Lab &
Complex Carbohydrate Research CenterUniversity of Georgia
http://lsdis.cs.uga.edu
Project Information:
Background: SW for Life Sciences• Bioinformatics of Glycan Expression –
component of the NCRR "Integrated Technology Resource for Biomedical Glycomics”.
• W3C Interest Group on Semantic Web for Health care and Life Sciences
• Deployed Active Semantic Electronic Medical Patient Record application at the Athens Heart Center
Agenda
• Review of Accomplishments/Ongoing Work:o GLYDE standardo GlycO Ontologyo ProPreO Ontologyo Semantic Analytical Glycomics Workflowo Visualizationo Semantic Web Services: WSDL-S/METEOR-S
GLYDE standard
• An XML based representation format for glycan structures
• Inter-convertible with existing data represented using IUPAC or LINUCS.
• In progress: Incorporation of Probability based representation
• In progress: Incorporation of aspects for visualization of structures using GLYDE (XML) files
GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydrate Research, 340 (18), Dec 30, 2005.
• Enable querying and export of query results in GLYDE format
• Using GLYDE representation for disambiguation, mapping and matching
MonosaccharideDB
SweetDB
KEGG
<glyde><residue>
.
.</residue></glyde>
<glyde><residue>
.
.</residue></glyde>
QUERY
RESULT
GLYDE
Collaborative GlycoInformatics
• Development of GLYDE semantic web portal • Integration with www.glycosciences.de
o Visualization aspect integrated with LiGraph (Heidelberg) or OntoVista (UGA)
• Semantic Annotation of publications in GlycoProteomics domain
GLYDE Semantic PortalKEGG
MonosaccharideDB
www.glycosciences.de
Collaborative GlycoInformatics
Collaborative GlycoInformaticsEvolving collaboration between:• LSDIS/CCRC:
Will York, Amit Sheth, Michael Pierce
• EUROCarbDB (German Cancer Research Center): Willi von der Lieth
• Consortium for Functional Glycomics (CFG): Rahul Raman, Ram Sasisekharan, Thomas Lütteke
• N.D. Zelinsky Institute of Organic Chemistry (Moscow) Yuriy Knirel
• Mitsui Knowledge Industry (Japan): Hisashi Narimatsu, Norihiro Kikuchi
• Kyoto Encyclopedia of Genes and Genomes (KEGG): Minoru Kanehisa, Kiyoko F. Aoki-Kinoshita
• Palo Alto Research Center (PARC): David Goldberg,
Semantic GlcyoInformatics - Ontologies
• GlycOGlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans)o Contains 600+ classes and 100+ properties –
describe structural features of glycans; unique population strategy
o URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco
• ProPreOProPreO: a comprehensive process Ontology modeling experimental proteomicso Contains 330 classes, 6 million+ instanceso Models three phases of experimental proteomics
URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo
GlycO taxonomy
The first levels of the GlycO taxonomy
Most relationships and attributes in GlycO
GlycO exploits the expressiveness of OWL-DL.Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships.
Pathway representation in GlycO
Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways.
Zooming in a little …The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC
2.4.1.145.
The product of this reaction is the
Glycan with KEGG ID 00020.
Reaction R05987catalyzed by enzyme 2.4.1.145
adds_glycosyl_residueN-glycan_b-D-GlcpNAc_13
Ontology Population
• The next slides show the different steps that were necessary to populate GlycO with glycan structures from multiple sources.
• GLYDE is used to disambiguate between representations from multiple sources
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
Ontology population workflow
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
[][Asn]{[(4+1)][b-D-GlcpNAc]{[(4+1)][b-D-GlcpNAc]
{[(4+1)][b-D-Manp]{[(3+1)][a-D-Manp]
{[(2+1)][b-D-GlcpNAc]{}[(4+1)][b-D-GlcpNAc]
{}}[(6+1)][a-D-Manp]{[(2+1)][b-D-GlcpNAc]{}}}}}}
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
<Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue></Glycan>
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
• ProPreO: A process ontology to capture proteomics experimental lifecycle:o Separationo Mass spectrometryo Analysiso 330 classeso 110 propertieso 6 million+ instances
ProPreO
Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated.
Usage: Mass spectrometry analysis
Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
P(S | M = 3461.57) = 0.6 P(T | M = 3461.57)
= 0.4
Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Semantic Annotation of Experimental Data•Enables Ontology-mediated Disambiguation•Allows correlation between disparate entities using Semantic Relations
Cell Culture
Glycoprotein Fraction
Glycopeptides Fraction
extract
Separation technique I
Glycopeptides Fraction
n*m
n
Signal integrationData correlation
Peptide Fraction
Peptide Fraction
ms data ms/ms data
ms peaklist ms/ms peaklist
Peptide listN-dimensional arrayGlycopeptide identificationand quantification
proteolysis
Separation technique II
PNGase
Mass spectrometry
Data reductionData reduction
Peptide identificationbinning
n
1
Semantic GlycoProteomics Semantic GlycoProteomics WorkflowWorkflow
Web Services based Workflow = Web Process
Web Service 1Web Service 4
Web Service 2
Web Service 3
WS1
WS 2
WS 3
WS 4
WORKFLOW
LINUX
SolarisMAC
Windows XP
BOWSER
• Use semantics for describing Web Services• WSDL-S (LSDIS/IBM)• Use service-level annotation of Web Services • Graphical traversal of taxonomy of biological
concepts to search for Web Services• http://128.192.9.11:8080/stargate/bowser.jsp
Semantic Annotation of Scientific DataSemantic Annotation of Scientific Data
830.9570 194.9604 2580.2985 0.3592688.3214 0.2526
779.4759 38.4939784.3607 21.77361543.7476 1.38221544.7595 2.9977
1562.8113 37.47901660.7776 476.5043
ms/ms peaklist data
<ms/ms_peak_list>
<parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer
mode = “ms/ms”/>
<parent_ion_mass>830.9570</parent_ion_mass>
<total_abundance>194.9604</total_abundance>
<z>2</z>
<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>
<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>
<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>
<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>
<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>
<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>
<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>
<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>
<ms/ms_peak_list>
Annotated ms/ms peaklist data
Semantic annotation of Scientific Semantic annotation of Scientific DataData
Annotated ms/ms peaklist data
<ms/ms_peak_list>
<parameter
instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer”
mode = “ms/ms”/>
<parent_ion_mass>830.9570</parent_ion_mass>
<total_abundance>194.9604</total_abundance>
<z>2</z>
<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>
<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>
<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>
<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>
<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>
<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>
<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>
<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>
<ms/ms_peak_list>
Identified and quantified
peptides
Specific cellularprocess
Lectin
Collection of N-glycan ligands
Collection ofBiosynthetic enzymes
Discovery of relationship between biological Discovery of relationship between biological entitiesentities
Fragment ofSpecific protein
GlycOProPreO
Gene Ontology (GO)
Genomic database (Mascot/Sequest)
The inference: instances of the class collection of Biosynthetic enzymes (GNT-V) are involved in the specific cellular process (metastasis).
p
r
o
c
e
s
s
• Formalize description and classification of Web Services using ProPreO concepts
Semantic Web Services using WSDL-SSemantic Web Services using WSDL-S
<?xml version="1.0" encoding="UTF-8"?><wsdl:definitions targetNamespace="urn:ngp" …..xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<wsdl:types> <schema targetNamespace="urn:ngp“ xmlns="http://www.w3.org/2001/XMLSchema"> …..</complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> <wsdl:message name="replaceCharacterResponse"> <wsdl:part name="replaceCharacterReturn" type="soapenc:string"/> </wsdl:message>
WSDL ModifyDBWSDL-S ModifyDB
<?xml version="1.0" encoding="UTF-8"?><wsdl:definitions targetNamespace="urn:ngp" ……xmlns:wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns:ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" >
<wsdl:types> <schema targetNamespace="urn:ngp" xmlns="http://www.w3.org/2001/XMLSchema">……</complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest" wssem:modelReference="ProPreO#peptide_sequence"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> ProPreO
process Ontology
data
sequence
peptide_sequence
Concepts defined in
process Ontology
Description of a Web Service using:WebServiceDescriptionLanguage
Semantic Visualization
• Ontologies are meant for machine consumption
• Often too convoluted for the human eye• The scientist needs to know the concepts
she uses for annotation• Build a visualization environment that
translates the formal concepts into a representation the domain expert understands well
Single Glycan
Customizable Layouts
• Using customizable layouts, knowledge can be formalized in a machine understandable way and then visually translated for the user’s needs.– Cartoonist representation for the Glycobiologist– Chemical reactions as left side right side,
instead of convoluted representation in the ontology.
Ongoing and Future Work
• SemURI: Semantic URI based provenance scheme using ProPreO
• RDF-based version of the GLYDE schema• A framework for semantic annotation of
experimental data• Integration of large datasets (~500MB)
into ProPreO for reasoning
• http://lsdis.cs.uga.edu/projects/glycomics/
Further details at: