Post on 22-Dec-2015
transcript
Ram SasisekharanBiological Engineering Division
Massachusetts Institute of TechnologyCambridge, MA
Bioinformatics:Glycomics
A post-genomic paradigm
Outline• Overview – “omics”• Systems Biology • Pre- and Post- Genomic
bioinformatics• Issues with Glycomics• Addressing the ChallengesAddressing the Challenges• New Research Models for Post New Research Models for Post
Genomics Bioinformatics – Genomics Bioinformatics – the Glue the Glue GrantsGrants
• Consoritum for Functional GlycomicsConsoritum for Functional Glycomics• ConclusionsConclusions
Central dogma in biology and the Central dogma in biology and the age of the age of the “’“’omicsomics””
Proteomics
Genotype(DNA Sequence)
RNATranslation of
Protein Sequence
PosttranslationalModification
PHENOTYPE
GlycomicsGlycomics
An emergingAn emergingparadigmparadigm
Genomics
Information Content in Biological Information Content in Biological SystemsSystems
SequenceSequenceDNA: NucleotidesProtein: Amino AcidsCarbohydrates:Monosaccharides
SequenceSequenceDNA: NucleotidesProtein: Amino AcidsCarbohydrates:Monosaccharides StructureStructure
SecondaryTertiary
Quaternary
StructureStructureSecondary
TertiaryQuaternary
Biological Biological ActivityActivity
MolecularCellularTissue
Biological Biological ActivityActivity
MolecularCellularTissue
Interactions between molecules
ManipulateMolecular GeneticsChemical GeneticsCell Engineering
MeasureBiochemistry
ImagingBioelectronics
Model
Bayesian NetworksBoolean Networks
Mine
Bioinformatics
Systems Biology Systems Biology
What is Bioinformatics ?What is Bioinformatics ?
• AssimilationAssimilation• Cataloging Cataloging • Classification Classification of biological information for• Model CreationModel Creation• Prediction Prediction of behavior of a biological system for a given set of
inputs
Data AcquisitionData Acquisition
Web InterfaceData Curation
Knowledge base
Data StorageData Storage
Database InfrastructureDatabase Design
Data Data DisseminationDissemination
Search EnginesSimple/Advanced
Queries
Tools forTools forData AnalysisData Analysis
Statistical Analysis Statistical Analysis
ComparisonComparisonScoring functionsScoring functions
Model CreationModel Creation
Network of relationships between
structural and functional attributes of
biological macromolecules
PredictionPrediction
System Behavior to set
of Inputs
Landmarks in GenomicsLandmarks in Genomics
1970s Advent of DNA sequencing
1980s DNA sequencing automated
1990s Era of Bioinformatics : Rapid Era of Bioinformatics : Rapid computational computational manipulation, storage and dissemination manipulation, storage and dissemination of of sequence information sequence information
1995 First whole genome sequenced
1999 First human chromosome sequenced
2001 Draft of human genome
Evolving framework for Evolving framework for BioinformaticsBioinformatics
Pre-genomics Bioinformatics• Representing sequence information - single alphabet
code– DNA: {A,T,G,C}– Proteins: {A,C,D-I,K-N,P-T,V,W,Y}– Carbohydrates: not well defined
• Storing Information – simple flat file databases– Sequence Databases – GenBank, SwissProt: Flat file databases
without any annotation or structuring of gene and protein sequence information
– Structure Databases – Protein Databank: Flat file database. Structural annotations like classification of structural superfamilies (SCOP) was created from PDB entries
– Biological Activity – there was no real database that catalogued the important biological roles of biopolymers. Part of this information were stored as additional text fields in the sequence and structure databases
Limited development of bioinformatics platforms for Limited development of bioinformatics platforms for carbohydrates carbohydrates
Evolving framework for Evolving framework for BioinformaticsBioinformatics
Post-genomics Bioinformatics – Proteomics, Glycomics
• Types of information– Data sets from high throughput experiments – Microarray, Mass
spectroscopy and other analytical tools– Data sets from diverse experiments – mouse models to study the
biological macromolecule in vivo, sensitive assays for studying interactions between proteins in a biological pathway
• Types of Databases – Complex relational databases– Relational databases store different attributes obtained from high
throughput experimental data and relationships between these attributes
Increasing awareness of importance of carbohydrates inIncreasing awareness of importance of carbohydrates infundamental biological functions, yet little development fundamental biological functions, yet little development
on the on the bioinformatics applications to represent, store and bioinformatics applications to represent, store and
manipulatemanipulateinformation in carbohydratesinformation in carbohydrates
Glycomics
Types of CarbohydratesTypes of Carbohydrates
P
P
Asn-X-Thr/Ser
Nascent Polypeptide
OST
ER
Cytosol
Golgi
N-glycan diversity
Branched Sugars: N-GlycansBranched Sugars: N-Glycans
Types of CarbohydratesTypes of Carbohydrates
Cell
GAGs are the most acidic and information dense linear sugars
Linear Sugars: Linear Sugars: GlycosaminoglycansGlycosaminoglycans
Representing Information in Representing Information in CarbohydratesCarbohydrates
R=20R=4
In the case of carbohydrates, there are variations in the chemical configuration of the monosaccharide building blocks, linkage between monosaccharides andvariations in the exocyclic substituitions (R groups) thereby making them highly information dense – both linear and branched sugar structures
X: SO3- ; Y: Ac/SO3
- - variation in the chemical configuration (I/G) and exocyclic sulfation pattern gives 32 building blocks – in comparison with 20 amino acids and 4 bases.
High information density makes representation of information content in High information density makes representation of information content in carbohydrates a challenging task – simple alphabetic codes doncarbohydrates a challenging task – simple alphabetic codes don’’t efficientlyt efficientlycapture the information contentcapture the information content
Proteins and DNA – Backbone is mostly fixed, variations in building blocks is primarily due to variations in the side chain R groups
Carbohydrate – Protein Carbohydrate – Protein Interaction:Interaction:
• Carbohydrate – Protein interactions are key in modulating cell-cell communication
• Glycosylation on cell surface proteins act as recognition motifs for proteins on mutiple cell types including immune cells and pathogens
• Due to multivalent interactions the binding between a single carbohydrate and lectin is weak and thus is hard to characterize
Multivalent interactions between carbohydrates and Multivalent interactions between carbohydrates and proteins complicate the relationship between these proteins complicate the relationship between these interacting speciesinteracting species
Glycosaminoglycan ParadigmGlycosaminoglycan Paradigm• Cell surface proteoglycans
comprise of long GAG polysaccharides that provide the cell with a “sugar coat”
• GAGs interact with a multitude of signaling molecules in a sequence specific manner and modulate important biological processes
• Different GAG sequences have differential affinities for a particular signaling molecules and this gradient in affinity plays a key role in “analog” regulation of biological function
IL-8TGF-FGFINF-VEGF TNFChemokineEnzymesIntegrinsPathogens
Characterization of Carbohydrate Characterization of Carbohydrate StructuresStructures
– Biosynthesis is not template based and it involves several enzymes
– There are multiple isoforms of these enzymes with different substrate specificities further increasing the diversity of structures
– It is not possible at this time to amplify tissue derived carbohydrates due to their complex biosynthesis – low low amounts of biological sampleamounts of biological sample
Complex BiosynthesisComplex Biosynthesis
NDST
2OST
6OST
3OSTEpimerase
Characterization of Carbohydrate Characterization of Carbohydrate StructuresStructures
Challenges in Isolation and PurificationChallenges in Isolation and Purification
– Due to the chemical heterogeneity, it is difficult to get pure amounts of homogeneous samples.
– Often the sample analyzed is a mixture, therefore the sequence information in many cases cannot be fully determined – non deterministic system
Partial information on carbohydrate structure due to Partial information on carbohydrate structure due to limitations in their structural characterization poses limitations in their structural characterization poses significant challenges in storing and manipulating significant challenges in storing and manipulating information content in carbohydratesinformation content in carbohydrates
Advancing Glycomics – Advancing Glycomics – Key IssuesKey Issues
• Representing Information in Carbohydrates is complicated – alphabetic codes are too cumbersome to handle information density
• Dealing with analysis of low amounts of tissue derived material
• As a result of the challenges in the structural analysis of carbohydrates, there is a need to develop tools to represent and manipulate partial/non-deterministic information on carbohydrate structures
• “Analog” regulation of biological function by carbohydrates poses a challenge in providing functional attributes to specific carbohydrate structures
Addressing the ChallengesAddressing the Challenges
Representing information in Carbohydrates – HSGAG as model system
• Property encoded nomenclature (PEN)– Numerical scheme that optimally allocates bits to encode “properties” and
the identity of the building block of biopolymers– Facilitates the use of mathematical operations to manipulate the
information.
Features of PEN frameworkFeatures of PEN framework
• Derived from an internal logic that is based on the chemical nature of the building block
• Uses a numerical system and mathematical operations to perform manipulations
• Can be easily extended to encode more variations either by using more bits or higher numerical base due to the flexibility of the number system
• Facilitates comparison of “properties” directly since property encoded is a function of the chemical identity of the building block.
Dealing with low sample amounts Dealing with low sample amounts ––
Sensitive MALDI-MS analysisSensitive MALDI-MS analysis• Matrix – Caffeic acid• Complex with Basic
peptide – (RG)nR detected• Laser induced ionization
leads to formation of molecular ions
• Mass of saccharide is obtained to an accuracy of <1 Dalton, by subtracting mass of peptide from mass of complex
• Accurately determine masses of picomolar amounts of sample typical of biologically important HSGAG oligosaccharides
Applications of PENApplications of PEN
Mass Composition relationship
The length, number of sulfates and acetates of a HLGAG oligomer can be unambiguously assigned for oligomers up to tetradecasaccharide
Applications of PEN:Applications of PEN: PEN-MALDI Sequencing Strategy PEN-MALDI Sequencing Strategy
Formalism:Hexadecimal
binary notationbased mass-line
MADLI-MS
Formalism:Hexadecimal
binary notationbased mass-line
MADLI-MS
Experimentalcomposition,chemical
enzymatic
Experimentalcomposition,chemical
enzymaticiterative
Unique Solution
All PossibleSequences
Sequencing HSGAGs: ExampleSequencing HSGAGs: Example
New Research Models for Post New Research Models for Post Genomics Bioinformatics – the Glue Genomics Bioinformatics – the Glue
GrantsGrants• Alliance for Cellular Signaling (AfCS):Alliance for Cellular Signaling (AfCS): To
understand as completely as possible the relationships between sets of inputs and outputs in signaling cells that vary both temporally and spatially, i.e. how cells interpret signals in a context-dependent manner
• Cell Migration Consortium:Cell Migration Consortium: To accelerate progress in cell migration-related research by fostering interdisciplinary research activities and producing novel reagents and information
• Consoritum for Functional Glycomics:Consoritum for Functional Glycomics: Define the paradigms by which carbohydrate binding proteins function in cellular communication
• Inflammation and Host Response Inflammation and Host Response Consortium:Consortium: It is designed to acquire new scientific knowledge about the biological basis for different outcomes in injured patients.
Consoritum for Functional Consoritum for Functional GlycomicsGlycomics
Organization of the Core FacilitesOrganization of the Core Facilites
Data Organization: DatabasesData Organization: Databases
Data Storage: Database DesignData Storage: Database Design
• Overview– Classification of data
• 6 key identifiers (name tags for data) – CBP ID, GT ID, Carb ID, Project ID, Microarray ID, Mouse Strain ID
• Data Fields – provide structure to the type of data being entered. Selection of the appropriate data fields depends on what kinds of data will be entered
– Linking data• Data fields pertaining to a specific attribute are
stored in a table • Each table will be linked to other tables via common
data fields or identifiers.• The data tables and their links form an “Ontology
Diagram”
Data Storage: Database OntologyData Storage: Database Ontology
CBPCBPCell line
cDNA sequence
YieldCore C
Core D
PI
Mouse StrainImmunology
HIstology
Core F
Core G
PI
Carb
ohyd
rate
Li
gand
Bin
din
g
Data
Core HCore C
Bio
chem
ical p
ath
way
PI
Expression Mouse Studies
CBP-Carbohydrate Interaction
Data Storage: Relational Data Storage: Relational DatabasesDatabases
Protein
CBP ID CBP001
GenBank GB0001
SwissProt SP00001
PDB 1XXX
. . .
Author
Name XYZ, …
Email 1@2.3
Institution ABC
…
characterized
Protein Expression
Cell Line BL21
Gel Image Img.jpg
cDNA clone GB0002
. . .
Carbohydrate
Carb ID CBP001
Structure notation
Carb DBCarb0001
PDB 1XYZ
. . .
was expressed using
interacts with
Characterization
Carb IDCarb0001
Mass Spec MS-1.jpg
NMR NMR.jpg
Structure notation
. . .characterized using
Author
Name MNO, …
Email 2@3.4
Institution IJK
…
characterized
Sample Ontology from AfCS
ConclusionsConclusions
• In the post – genomics era, high through put experimental methods are generating large data sets pertaining to multiple sequence, structure and functional attributes of genes and proteins – Transition from Traditional Biology Transition from Traditional Biology Information Information driven driven ““Systems BiologySystems Biology””
• With constantly increasing computational power, there has been a big leap in development of bioinformatics tools to deal with large data sets
• Increasing awareness of the role of carbohydrates in fundamental biological processes modulating cell-cell and cell-matrix interactions
• Development of bioinformatics applications for carbohydrates has many challenges due to their complexity and heterogeneity
• Addressing these challenges would enable the development of bioinformatics for glycobiology to provide a more comprehensive description of the ““statestate”” of a biological system and to better predict the ““responseresponse”” of a biological system to a given set of ““inputsinputs””