Ram Sasisekharan Biological Engineering Division Massachusetts Institute of Technology Cambridge, MA...

transcript

Ram SasisekharanBiological Engineering Division

Massachusetts Institute of TechnologyCambridge, MA

Bioinformatics:Glycomics

A post-genomic paradigm

Outline• Overview – “omics”• Systems Biology • Pre- and Post- Genomic

bioinformatics• Issues with Glycomics• Addressing the ChallengesAddressing the Challenges• New Research Models for Post New Research Models for Post

Genomics Bioinformatics – Genomics Bioinformatics – the Glue the Glue GrantsGrants

• Consoritum for Functional GlycomicsConsoritum for Functional Glycomics• ConclusionsConclusions

Central dogma in biology and the Central dogma in biology and the age of the age of the “’“’omicsomics””

Proteomics

Genotype(DNA Sequence)

RNATranslation of

Protein Sequence

PosttranslationalModification

PHENOTYPE

GlycomicsGlycomics

An emergingAn emergingparadigmparadigm

Genomics

Information Content in Biological Information Content in Biological SystemsSystems

SequenceSequenceDNA: NucleotidesProtein: Amino AcidsCarbohydrates:Monosaccharides

SequenceSequenceDNA: NucleotidesProtein: Amino AcidsCarbohydrates:Monosaccharides StructureStructure

SecondaryTertiary

Quaternary

StructureStructureSecondary

TertiaryQuaternary

Biological Biological ActivityActivity

MolecularCellularTissue

Biological Biological ActivityActivity

MolecularCellularTissue

Interactions between molecules

ManipulateMolecular GeneticsChemical GeneticsCell Engineering

MeasureBiochemistry

ImagingBioelectronics

Bayesian NetworksBoolean Networks

Bioinformatics

Systems Biology Systems Biology

What is Bioinformatics ?What is Bioinformatics ?

• AssimilationAssimilation• Cataloging Cataloging • Classification Classification of biological information for• Model CreationModel Creation• Prediction Prediction of behavior of a biological system for a given set of

inputs

Data AcquisitionData Acquisition

Web InterfaceData Curation

Knowledge base

Data StorageData Storage

Database InfrastructureDatabase Design

Data Data DisseminationDissemination

Search EnginesSimple/Advanced

Queries

Tools forTools forData AnalysisData Analysis

Statistical Analysis Statistical Analysis

ComparisonComparisonScoring functionsScoring functions

Model CreationModel Creation

Network of relationships between

structural and functional attributes of

biological macromolecules

PredictionPrediction

System Behavior to set

of Inputs

Landmarks in GenomicsLandmarks in Genomics

1970s Advent of DNA sequencing

1980s DNA sequencing automated

1990s Era of Bioinformatics : Rapid Era of Bioinformatics : Rapid computational computational manipulation, storage and dissemination manipulation, storage and dissemination of of sequence information sequence information

1995 First whole genome sequenced

1999 First human chromosome sequenced

2001 Draft of human genome

Evolving framework for Evolving framework for BioinformaticsBioinformatics

Pre-genomics Bioinformatics• Representing sequence information - single alphabet

code– DNA: {A,T,G,C}– Proteins: {A,C,D-I,K-N,P-T,V,W,Y}– Carbohydrates: not well defined

• Storing Information – simple flat file databases– Sequence Databases – GenBank, SwissProt: Flat file databases

without any annotation or structuring of gene and protein sequence information

– Structure Databases – Protein Databank: Flat file database. Structural annotations like classification of structural superfamilies (SCOP) was created from PDB entries

– Biological Activity – there was no real database that catalogued the important biological roles of biopolymers. Part of this information were stored as additional text fields in the sequence and structure databases

Limited development of bioinformatics platforms for Limited development of bioinformatics platforms for carbohydrates carbohydrates

Evolving framework for Evolving framework for BioinformaticsBioinformatics

Post-genomics Bioinformatics – Proteomics, Glycomics

• Types of information– Data sets from high throughput experiments – Microarray, Mass

spectroscopy and other analytical tools– Data sets from diverse experiments – mouse models to study the

biological macromolecule in vivo, sensitive assays for studying interactions between proteins in a biological pathway

• Types of Databases – Complex relational databases– Relational databases store different attributes obtained from high

throughput experimental data and relationships between these attributes

Increasing awareness of importance of carbohydrates inIncreasing awareness of importance of carbohydrates infundamental biological functions, yet little development fundamental biological functions, yet little development

on the on the bioinformatics applications to represent, store and bioinformatics applications to represent, store and

manipulatemanipulateinformation in carbohydratesinformation in carbohydrates

Glycomics

Types of CarbohydratesTypes of Carbohydrates

Asn-X-Thr/Ser

Nascent Polypeptide

Cytosol

N-glycan diversity

Branched Sugars: N-GlycansBranched Sugars: N-Glycans

Types of CarbohydratesTypes of Carbohydrates

GAGs are the most acidic and information dense linear sugars

Linear Sugars: Linear Sugars: GlycosaminoglycansGlycosaminoglycans

Representing Information in Representing Information in CarbohydratesCarbohydrates

R=20R=4

In the case of carbohydrates, there are variations in the chemical configuration of the monosaccharide building blocks, linkage between monosaccharides andvariations in the exocyclic substituitions (R groups) thereby making them highly information dense – both linear and branched sugar structures

X: SO3- ; Y: Ac/SO3

- - variation in the chemical configuration (I/G) and exocyclic sulfation pattern gives 32 building blocks – in comparison with 20 amino acids and 4 bases.

High information density makes representation of information content in High information density makes representation of information content in carbohydrates a challenging task – simple alphabetic codes doncarbohydrates a challenging task – simple alphabetic codes don’’t efficientlyt efficientlycapture the information contentcapture the information content

Proteins and DNA – Backbone is mostly fixed, variations in building blocks is primarily due to variations in the side chain R groups

Carbohydrate – Protein Carbohydrate – Protein Interaction:Interaction:

• Carbohydrate – Protein interactions are key in modulating cell-cell communication

• Glycosylation on cell surface proteins act as recognition motifs for proteins on mutiple cell types including immune cells and pathogens

• Due to multivalent interactions the binding between a single carbohydrate and lectin is weak and thus is hard to characterize

Multivalent interactions between carbohydrates and Multivalent interactions between carbohydrates and proteins complicate the relationship between these proteins complicate the relationship between these interacting speciesinteracting species

Glycosaminoglycan ParadigmGlycosaminoglycan Paradigm• Cell surface proteoglycans

comprise of long GAG polysaccharides that provide the cell with a “sugar coat”

• GAGs interact with a multitude of signaling molecules in a sequence specific manner and modulate important biological processes

• Different GAG sequences have differential affinities for a particular signaling molecules and this gradient in affinity plays a key role in “analog” regulation of biological function

IL-8TGF-FGFINF-VEGF TNFChemokineEnzymesIntegrinsPathogens

Characterization of Carbohydrate Characterization of Carbohydrate StructuresStructures

– Biosynthesis is not template based and it involves several enzymes

– There are multiple isoforms of these enzymes with different substrate specificities further increasing the diversity of structures

– It is not possible at this time to amplify tissue derived carbohydrates due to their complex biosynthesis – low low amounts of biological sampleamounts of biological sample

Complex BiosynthesisComplex Biosynthesis

3OSTEpimerase

Characterization of Carbohydrate Characterization of Carbohydrate StructuresStructures

Challenges in Isolation and PurificationChallenges in Isolation and Purification

– Due to the chemical heterogeneity, it is difficult to get pure amounts of homogeneous samples.

– Often the sample analyzed is a mixture, therefore the sequence information in many cases cannot be fully determined – non deterministic system

Partial information on carbohydrate structure due to Partial information on carbohydrate structure due to limitations in their structural characterization poses limitations in their structural characterization poses significant challenges in storing and manipulating significant challenges in storing and manipulating information content in carbohydratesinformation content in carbohydrates

Advancing Glycomics – Advancing Glycomics – Key IssuesKey Issues

• Representing Information in Carbohydrates is complicated – alphabetic codes are too cumbersome to handle information density

• Dealing with analysis of low amounts of tissue derived material

• As a result of the challenges in the structural analysis of carbohydrates, there is a need to develop tools to represent and manipulate partial/non-deterministic information on carbohydrate structures

• “Analog” regulation of biological function by carbohydrates poses a challenge in providing functional attributes to specific carbohydrate structures

Addressing the ChallengesAddressing the Challenges

Representing information in Carbohydrates – HSGAG as model system

• Property encoded nomenclature (PEN)– Numerical scheme that optimally allocates bits to encode “properties” and

the identity of the building block of biopolymers– Facilitates the use of mathematical operations to manipulate the

information.

Features of PEN frameworkFeatures of PEN framework

• Derived from an internal logic that is based on the chemical nature of the building block

• Uses a numerical system and mathematical operations to perform manipulations

• Can be easily extended to encode more variations either by using more bits or higher numerical base due to the flexibility of the number system

• Facilitates comparison of “properties” directly since property encoded is a function of the chemical identity of the building block.

Dealing with low sample amounts Dealing with low sample amounts ––

Sensitive MALDI-MS analysisSensitive MALDI-MS analysis• Matrix – Caffeic acid• Complex with Basic

peptide – (RG)nR detected• Laser induced ionization

leads to formation of molecular ions

• Mass of saccharide is obtained to an accuracy of <1 Dalton, by subtracting mass of peptide from mass of complex

• Accurately determine masses of picomolar amounts of sample typical of biologically important HSGAG oligosaccharides

Applications of PENApplications of PEN

Mass Composition relationship

The length, number of sulfates and acetates of a HLGAG oligomer can be unambiguously assigned for oligomers up to tetradecasaccharide

Applications of PEN:Applications of PEN: PEN-MALDI Sequencing Strategy PEN-MALDI Sequencing Strategy

Formalism:Hexadecimal

binary notationbased mass-line

MADLI-MS

Formalism:Hexadecimal

binary notationbased mass-line

MADLI-MS

Experimentalcomposition,chemical

enzymatic

Experimentalcomposition,chemical

enzymaticiterative

Unique Solution

All PossibleSequences

Sequencing HSGAGs: ExampleSequencing HSGAGs: Example

New Research Models for Post New Research Models for Post Genomics Bioinformatics – the Glue Genomics Bioinformatics – the Glue

GrantsGrants• Alliance for Cellular Signaling (AfCS):Alliance for Cellular Signaling (AfCS): To

understand as completely as possible the relationships between sets of inputs and outputs in signaling cells that vary both temporally and spatially, i.e. how cells interpret signals in a context-dependent manner

• Cell Migration Consortium:Cell Migration Consortium: To accelerate progress in cell migration-related research by fostering interdisciplinary research activities and producing novel reagents and information

• Consoritum for Functional Glycomics:Consoritum for Functional Glycomics: Define the paradigms by which carbohydrate binding proteins function in cellular communication

• Inflammation and Host Response Inflammation and Host Response Consortium:Consortium: It is designed to acquire new scientific knowledge about the biological basis for different outcomes in injured patients.

Consoritum for Functional Consoritum for Functional GlycomicsGlycomics

Organization of the Core FacilitesOrganization of the Core Facilites

Data Organization: DatabasesData Organization: Databases

Data Storage: Database DesignData Storage: Database Design

• Overview– Classification of data

• 6 key identifiers (name tags for data) – CBP ID, GT ID, Carb ID, Project ID, Microarray ID, Mouse Strain ID

• Data Fields – provide structure to the type of data being entered. Selection of the appropriate data fields depends on what kinds of data will be entered

– Linking data• Data fields pertaining to a specific attribute are

stored in a table • Each table will be linked to other tables via common

data fields or identifiers.• The data tables and their links form an “Ontology

Diagram”

Data Storage: Database OntologyData Storage: Database Ontology

CBPCBPCell line

cDNA sequence

YieldCore C

Core D

Mouse StrainImmunology

HIstology

Core F

Core G

Core HCore C

ical p

Expression Mouse Studies

CBP-Carbohydrate Interaction

Data Storage: Relational Data Storage: Relational DatabasesDatabases

Protein

CBP ID CBP001

GenBank GB0001

SwissProt SP00001

PDB 1XXX

Author

Name XYZ, …

Email 1@2.3

Institution ABC

characterized

Protein Expression

Cell Line BL21

Gel Image Img.jpg

cDNA clone GB0002

Carbohydrate

Carb ID CBP001

Structure notation

Carb DBCarb0001

PDB 1XYZ

was expressed using

interacts with

Characterization

Carb IDCarb0001

Mass Spec MS-1.jpg

NMR NMR.jpg

Structure notation

. . .characterized using

Author

Name MNO, …

Email 2@3.4

Institution IJK

characterized

Sample Ontology from AfCS

ConclusionsConclusions

• In the post – genomics era, high through put experimental methods are generating large data sets pertaining to multiple sequence, structure and functional attributes of genes and proteins – Transition from Traditional Biology Transition from Traditional Biology Information Information driven driven ““Systems BiologySystems Biology””

• With constantly increasing computational power, there has been a big leap in development of bioinformatics tools to deal with large data sets

• Increasing awareness of the role of carbohydrates in fundamental biological processes modulating cell-cell and cell-matrix interactions

• Development of bioinformatics applications for carbohydrates has many challenges due to their complexity and heterogeneity

• Addressing these challenges would enable the development of bioinformatics for glycobiology to provide a more comprehensive description of the ““statestate”” of a biological system and to better predict the ““responseresponse”” of a biological system to a given set of ““inputsinputs””

Ram Sasisekharan Biological Engineering Division Massachusetts Institute of Technology Cambridge, MA...

Documents