hop Indonesia

8/8/2019 hop Indonesia

1/80

1

Dr. Virendrakumar (Virendra) C. BhavsarProfessorDean 2003-2008

Director, Advanced Computational Research Lab. 2000-10Faculty of Computer Science

University of New Brunswick (UNB)Fredericton, Canada

Visiting ProfessorCenter for Development of Advanced Computing (C-DAC)

Pune, India

Bioinformatics An Overview

2

Outline

Introduction UNB, C-DAC, Bioinformatics

Genome Genes, Proteomes, Evolution

Databases and Information Retrieval

Sequence Alignment and Phylogenetic trees

Protein Structure and Drug Discovery

Proteomics and Systems Biology

Infrastructure: UNB and C-DAC

Research Work at the University of New Brunswick and C-DAC

Future


2/80

3

University of New Brunswick (UNB)

4

Faculty of Computer Science

The First Faculty of CS in Canada

University of New Brunswick

Fredericton, New BrunswickFredericton, New BrunswickCanadaCanada

Oldest English Language University in CanadaOldest English Language University in Canada

Established in 1785Established in 1785


3/80

55

Fredericton and UNB


4/80

77

8

Center for Development of AdvancedComputing (C-DAC)

India


5/80

1987

The Government of India decides to

launch a national initiative for

development of indigenoussupercomputers

Government of USA refuses sale of

Supercomputer to India

India requires Supercomputer for

Weather Forecasting

History

Garuda GridComputing

Social Computingwith participatory

approach

1991

1994

1998

2002-03

200710 TF

PARAM Padma

Viable HPC businesscomputing environment

PARAM 10000

Platform for User communityto interact/ collaborate

PARAM 8000

Technology Denial

2010100 TF

2012-131 PF

PoC

100 Mbps

17 Locations

Main

PhaseGaruda

PARAM 9000

C-DAC: HPC : Evolution and

Road Map


6/80

Headquarter Pune

Centres

Pune Knowledge Park, Bangalore

Electronics City, Bangalore

Chennai

Delhi

Hyderabad

Kolkata

Mohali

Mumbai

Noida Thiruvananthapuram

C-DAC HQ

Centres

C-DAC Centres

Total Manpower is 2100 across all the centres of C-DAC

C-DACs Thrust Areas

High Performance Computing & Grid Computing Hardware, Software, Systems, Applications, Research, Technology, Infrastructure

Multilingual Computing Tools, Fonts, Products, Solutions, Research, Technology Development

Software Technologies OSS, Multimedia, ICT for masses, E-Governance, Geomatics

Professional Electronics Digital Broadband, Wireless Systems, Network Technologies, Power Electronics, Real-Time

Systems, Embedded Systems, VLSI/ASIC Design, Agri Electronics

Cyber Security & Cyber Forensics Cyber Security tools, technologies & solution development, Research & Training

Health Informatics Hospital Information System, Telemedicine, Decision Support System

Ubiquitous Computing RFID, Design, Development and Integration of Ubicomp System Components

Education & Training e-Learning Technologies & Services


7/80

Compute Nodes

No. of Processors : 248 (Power 4 @ 1 GHz)

Aggregate Peak Computin g : 10 05 GF s (~ 1 TF )

File Servers

No. of Processors : 24 (UltraSparc-III@900MHz)

Aggregate Memory : 96 GigaBytes

Internal Storage : 0.4 TeraBytes

File System : QFS

Operating System : Solaris 8

Networks

Primary : PARAMNet-II @ 2.5 Gbps Full Duplex

Backup : Gigabit Ethernet @ 1 Gbps Full Duplex

Management : 10/100 MBPs Fast Ethernet

External Storage

Storage Array : 5 TeraBytes with 16 T3 disk arrays

Tape Library : 12 TeraBytes - L700 (5 LTO drives

Software

HPCC - C-DACs High performance computing and communication software suite

Compilers, Parallel Libraries and Tools

Ranked 171 in 2nd quarter end and 258 as per the latest ranking

C-DAC

Advanced Computing Training School (ACTS)


8/80

ACTS @ a glance

An outfit initiated by C-DACR&D in 1993

Begun with modest 20

students and grown to over5000 students

Trained more than quartermillion students

Grown from one city onecentre to 30 cities and 50centres within India

Over 150 crores of investmentand 600 plus dedicatedmanpower

Spread from India toInternational

From One course to morethan 10 courses

International Presence

Tajikistan

Uzbekistan

Mauritius

Ghana

Seychelles

Myanmar

Russia

Tanzania

Turkmenistan

Lesotho

BelarusSaudi ArabiaAzerbaijan

Armenia


9/80

Post Graduate CoursesDAC : Diploma in Advanced Computing

DACA : Diploma in Advanced Computer ArtsDVLSI : Diploma in VLSI Design

WiMC : Diploma in Wireless & Mobile Computing

DSSD : Diploma in System Software Development

DGi : Diploma in Geo informatics

DISCS : Diploma in Information System & Cyber Security

DHI : Diploma in Healthcare Informatics

DLC : Diploma in Language Computing

DIVESD: Diploma in Integrated VLSI & Embedded SystemDesign

DESD : Diploma in Embedded Systems Design

DPC : Diploma in Parallel Computing

Post Graduate Diploma Programs

M.Tech. Programs

Computer Science & Engineering

Software Engineering

Information Technology

VLSI

Artificial Intelligence

Grid Computing & Storage Management

Embedded Systems Design

Wireless & Network Technology

Process Control & Instrumentation


10/80

Training Programmes UNDER Tech sangam

20

Bioinformatics


11/80

21

Bioinformatics

The creation and development of advancedinformation and computational techniques for solving

problems in biology

and development of advanced information andHigh Performance Computing (HPC)Hardware and software for high speed computations

and large storageor solving problems in biology

Definitions

22

Bio Introduction


12/80

23

in biology

Molecular Biology

Living organisms (on Earth)

Lipids - Separate inside from outside

Proteins Build 3D machinery to perform biological

functionsDNA: Store information on how to build machinery (DNA)

Diagram of a cell

Lipid membranes - provide barrier

Protein structures - do work

DNA nucleus - store info

24

in biology

Molecular Biology

Deoxyribonucleic Acid (DNA)

Composition

- Sequence of nucleotides

0Nucleotide = deoxyribose sugar + phosphate group +base


13/80

25

in biology

Molecular Biology - DNA

DNA: contains genetic instructions used in thedevelopment and functioning of all known livingorganisms with the exception of some viruses.

DNA molecules: long-term storage ofinformation.

DNA: a set ofblueprints, like a recipe or a code, since it

contains the instructions needed to construct othercomponents ofcells, such as proteins and RNAmolecules.

Genes: The DNA segments that contain instructions toconstruct the above components of cells

Other DNA sequences: structural purposes, or areinvolved in regulating the use of this genetic information.

Chemically, DNA consists of two long polymers of simple

units called nucleotides, with backbones made ofsugarsand phosphate groups joined by esterbonds. These twostrands run in opposite directions to each other and aretherefore anti-parallel. Attached to each sugar is one offour types of molecules called bases. It is the sequenceof these four bases along the backbone that encodes

26

in biology


- two long polymers of simple units called nucleotides,with backbones made ofsugars and phosphate groups

joined by esterbonds.

- These two strands run in opposite directions to eachother and are therefore anti-parallel.

-Attached to each sugar is one of four types of moleculescalled bases. It is the sequence of these four bases alongthe backbone that encodes information. This informationis read using the genetic code, which specifies the

sequence of the amino acids within proteins.

-The code is read by copying stretches of DNA into therelated nucleic acid RNA, in a process calledtranscription.

- Within cells, DNA is organized into long structurescalled chromosomes. These chromosomes areduplicated before cells divide, in a process called DNAreplication. Eukaryotic organisms (animals, plants, fungi,and protists)


14/80

27

in biology


-DNA is organized into long structures calledchromosomes.

- Chromosomes are duplicated before cells divide, in aprocess called DNA replication.

- Eukaryotic organisms (animals, plants, fungi, andprotists) store most of their DNA inside the cell nucleusand some of their DNA in organelles, such asmitochondria orchloroplasts.

- Prokaryotes (bacteria and archaea) store their DNA onlyin the cytoplasm.

28

in biology

Molecular Biology

RNA: Ribonucleic acid (RNA)

- a long chain of nucleotide units

- Each nucleotide consists of a nitrogenous base, aribose sugar, and a phosphate

RNA is very similar to DNA

RNA is usually single-stranded

DNA is usually double-stranded

RNA nucleotides contain ribose while DNA containsdeoxyribose (a type of ribose that lacks one oxygenatom)

RNA has the base uracil rather than thymine that ispresent in DNA


15/80

29

in biology

Molecular Biology

DNA: DNA DNA (Replication)

RNA: DNA RNA (Transcription / GeneExpression)

Protein: RNA Protein (Translation)

DNA, RNA, Proteins

Proteins and nucleic acids (DNA, RNA) are essentialcomponents for living organisms

DNA Transcription RNA Translation Proteins

Chromosome

DNA

DNA

Gene 1 Gene 2 . . . .

(gene)


16/80

Raw Biological data Nucleic Acids (DNA)

Raw Biological data

Amino acid residues (proteins)


17/80

Standard Genetic Code

T C A G

T

TTT Phe (F)TTC "TTA Leu (L)TTG "

TCT Ser (S)TCC "TCA "TCG "

TAT Tyr (Y)TACTAA TerTAG Ter

TGT Cys (C)TGCTGA TerTGG Trp (W)

C

CTT Leu (L)CTC "CTA "CTG "

CCT Pro (P)CCC "CCA "CCG "

CAT His (H)CAC "CAA Gln (Q)CAG "

CGT Arg (R)CGC "CGA "CGG "

A

ATT Ile (I)ATC "ATA "ATG Met (M)

ACT Thr (T)ACC "ACA "ACG "

AAT Asn (N)AAC "AAA Lys (K)AAG "

AGT Ser (S)AGC "AGA Arg (R)AGG "

G

GTT Val (V)GTC "GTA "

GTG "

GCT Ala (A)GCC "GCA "

GCG "

GAT Asp (D)GAC "GAA Glu (E)

GAG "

GGT Gly (G)GGC "GGA "

GGG "

Triplets of DNA called Codons code into a amino acid

A Protein StructureA Protein Structure


18/80

Protein 3D structure

The structure of the protein sequence determines theThe structure of the protein sequence determines the

functionalityfunctionality

http://anatomy.med.unsw.edu.au/cbl/research/cytoskeleton/swissprotactin.htm

36

Informatics


19/80

FASTA formatted Sequences

FASTA: "FAST-All alignment; it works with any alphabet- FAST-P for protein- FAST-N for nucleotide alignment

Sample FASTA formatted Sequences

FASTA:"FAST-All alignment; it works with any alphabet, an

extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment.

EST sequence (A, C, G, T)>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,

mRNA sequence

ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT

CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT

CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT

CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA

CATT

Protein Sequence (20 different amino acids)>gi|532319|pir|TVFV2E|TVFV2E envelope protein

ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT

QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC

HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK

MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK

TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF

APTEVRRYTGGHERQKRVPF


20/80

Biological Databases

Genome databases flat files or relational database

GenBank, EMBL, DDBJ, PDB, SWISSPROT, PIR

Classification of Biological databases:

- primary databases (GenBank, EMBL, DDBJ)

- secondary databases (SWISSPROT, PDB, PIR)

Biological databases

Like any other database

Data organization for optimal analysis

Data is of different types

Raw data (DNA, RNA, protein sequences)

Curated data (DNA, RNA and proteinannotated sequences and structures,expression data)


21/80

41

for solving problems in biology

Biological databases -Examples

Nucleotide DatabasesAlternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome,MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations,IMGT

Genome DatabasesHuman, Mouse, Yeast, C.elegans, FLYBASE, Parasites

Protein DatabasesSwiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis,

HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT Structure Databases

PDB, MSD, FSSP, DALI Microarray Database

ArrayExpress Literature Databases

MEDLINE, Software Biocatalog, Flybase Archives Alignment Databases

BAliBASE, Homstrad, FSSP


22/80

3D Macromolecular structural data

Data originates from NMR or X-raycrystallography techniques

If the 3D structure of a protein is solved ...they have it

PDB Protein Data Bank

What to take home

Databases are a collection of data

Need to access and maintain easily and flexibly

Biological information is vast and sometimesvery redundant

Distributed databases bring it all together withquality controls, cross-referencing andstandardization

Computers can only create data, they do notgive answers


23/80

45

Bioinformatics

46

Gene sequences determine biological function

Genomic DNA Amino acids Proteins Function

Similar composition similar function?

- DNA sequences- Amino acid sequences

- Protein 3-D structure

Predicting protein function

- Designer drugs- Personalized treatments solving problems in biology

Premise of Bioinformatics


24/80

47

Bioinformatics

Determining protein function

Hard way

-Biological / chemical analyses

- Determine 3D structure w/ x-ray crystallography, NMR

Easy way?

- Sequence protein / DNA find close match in database

- Guess function based on match

- Validate guess in lab

Bioinformatics is imprecise

- Similar to data-mining

- Only suggests possible relationships

- Must validate correlation causation

48

Growth of Bioinformatics

1970s

- DNA sequencing

- Alignment w/ Smith-Waterman (dynamic programming)

1980s

- Sequence databases (EMBL, GenBank)

- Alignment w/ FASTA (linked lists, hashing)

1990s

- Automatic DNA sequencing

- Alignment w/ BLAST (neighborhood words, probabilities)

- Internet & WWW

Now

- Genomics, Proteomics


25/80

49

Bioinformatics Topics

Sequence alignments

- Find similarity between DNA / protein (amino acid) sequences

Genome assembly

- Combining genomic fragments to form whole genome

Gene identification & annotation

- Identify and classify genes on the genome

Microarrays & gene expression analysis

- Use DNA microarray (gene chip) to measure mRNA

Protein folding

- Compute 3-D protein structure protein sequence

Phylogenetic analysis

- Find genetic relationships between sequences and speciesbetween

between sequences / species

What Does Genomics Mean?

Genomics: a science that studies the geneticmaterial of a species at the molecular level

A scientific approach to identify and define thefunction of genes, as well as uncover when and howgenes work together to produce traits

Structural Genomics approaches (mapping) -

focus on traits controlled by one or a few genes, andoften only provide information regarding thelocation of a gene or genes

Examine the interrelationships and interactionsbetween thousands of genes

How do we do this?


26/80

Genome Organization

Leaf Tuber

Chromosome

DNA

Genome Organization Proteins are building blocks for living organisms

Proteins are derived from DNA transcription the gene (RNA) that codes proteins is formed from DNA Translation RNA triplets (codons) code into amino acids

DNA Gene can also be known by finding complimentary (cDNA), the activeor expressed gene is termed as Expressed Sequence Tags (ESTs)

Chromosome

DNA

DNA

Gene 1 Gene 2 . . . .


27/80

PromoterSwitch

Coding ORFMessage

....TATACAGCAAAATAGAAAGATCTAGTGTCCCATGGCGATGAGTCGTGTAGCTTCT.

DNA

Gene 1 Gene 2 Etc.

Genome Organization

cDNA Collections (Libraries)

Various tissues are collected from the plant,and messages are extracted

Leaf

Messages

Tuber

Messages


28/80


The messages are copied to form double-stranded DNA copies (cDNA) of each message

Leaf cDNA Tuber cDNA

Each copy is glued into a piece of bacterial DNAfor easier storage, handling and propagation,resulting in a collection or library of cDNAs

for each tissue


The cDNAs are then read or sequenced, to give the

order of As, Cs, Gs or Ts for each

We are left with the sequence of each gene that is

active (expressed) in each cell, tissue or organ studies

These are Expressed Sequence Tags or ESTs

Using complex computer resources, these ESTs can

be analyzed and compared with known sequences

and proteins

Look for messages associated with specific organs or

characteristic/traits


29/80

Take Home Points

Messages from various genes are important,as they dictate which proteins are produced

Promoters are also important, as they dictatewhere a specific message and protein isproduced

Genomics involves the study of all of themessages produced by the various plant cells

A lot of information needs to be organizedand analyzed

Database

Contains all the ESTs sequences

Contains useful annotations

Blast Searches

Contig Assemblies

Transmembrane Spanning Regions Gel Pictures

EST Information


30/80

Data Analysis

Tens of thousands of ESTs available for study

Most methods to study message distributions arelow throughput AND time consuming

Genomics necessitates the large scale study of

gene expression

How can we do this?

Microarray Analysis

Microarray Analysis


31/80

Microarray Analysis

Microarray Analysis


32/80

Microarray Analysis - Processing

IntensityDepe ndenceComparison

R2 = 0.2014

R2 = 0.6185

-6

-4

-2

0

2

4

6

8

10

12

0 2 4 6 8 10 12 14 16 18

0.5*(Log(G)+Log(R))

Log(R/G) Slide3

Slide70

Poly. (Slide70)

Poly. (Slide3)

Image Processing

Data Normalization

Differential

GeneExpression

Cluster

Analysis

Pathway

Analysis

Analysis



33/80

Signal

Background


Irregular size orshape

Irregular placement

Low intensity

Saturation

Spot variance

Background variance

indistinguishable saturated bad print artifactmiss alignment



34/80

Calculate numeric characteristics of each spot

Throw out spots that do not meet minimumrequirements for each characteristic

Throw out spots that do not have minimumoverall combined quality


Microarray Analysis - Data

Normalization

Normalize data to correct for variances

Dye bias

Location bias

Intensity bias

Pin bias

Slide bias

Control vs. non-control spots


35/80

Cluster genes based on expression profiles

Gene expression across several treatments

Hypothesis: Genes with similar function havesimilar expression profiles

Microarray Analysis -Clustering

Expression Profile Clustering


36/80

Project

Database

Engine

Microarray Analysis - Data Management

Information Processing and Handling

Assembly and annotation of genomic data

EST analysis and databases

Cluster analysis of microarray data

Comparisons of various transcriptomic methods

Integration of sequence, transcriptomic, proteomic,

metabolomic, transgenic data


37/80

73

Research Problems in Bioinformatics

Find genomes of all organisms

Identify and annotate all genes

Compute sequence 3D structure for all proteins

Compare DNA / protein sequences for similarity

Compare families of DNA / protein sequences

Reason to be optimistic: Biology is finite

~30,000 human genes; ~1000 protein superfamilies

but computers speeds keep increasing

Fighting Bird FluFighting Bird Flu


38/80

Virus in 3-DVirus in 3-D

76

Bioinformatics Infrastructure HighPerformance Computing


39/80

77

1974 - 1 MHz clock1988 40 MHz2002 2 GHz2009 P4 3.0 GHz, Quadcore 2.66 MHz

Intel Montecito chip1.72 Billion transistors

NVidia 280 series GPU 1.4 Billion transistors

- Circuit complexity doubles every 18 months Computing power at a given cost doubles every 18

months

- Processor clock rates: 40% increase/year + moreinstr./cycle

- DRAM Access Times: 10% increase/year cachesrequired

Advances in Microprocessor Technology

78

Jaguar

Oak Ridge National Lab., USA

- 1.72 Petaflop/s (Quadrilion): million billion (10**15)floating-point operations/sec (Flops) onLinpack benchmark

-2.332 Petaflops peak (.i.e 2332 Tera flops)

- Power 1750 Watt/sq ft; ~50 million KWh per year

- Space 4352 square feet, larger than NBAbasketball court

-

Current Supercomputer Nov 2009


40/80

79

Jaguar


80

Jaguar



41/80

Future

IBM Cyclops64 supercomputer on a chip

C-DAC initiative for 2010 petaflopmachine

NCSA, USA 2011 petaflop machine

NASA, SGI and Intel Pleiades 10petaflop by 2012

1 Exaflop (10**18 flops) by 2019

Human brain neural simulations 10exaflop by 2025

2-week Full Weather modeling 1 zetaflops (10**21 flops) by 2030

High Performance Computing and Networking@



42/80

Advanced Computational Research Lab(ACRL) Infrastructure

People, Research, Excellence

ACEnet: Atlantic Computational ExcellenceNetwork

Hosting sites:

Member sites:


43/80

ACEnet

Atlantic Canada is a distributed environment

$30 million initiative

Waterways make networkingsolutions difficult (e.g. Cabot Strait)

ACEnet

World-class HPC facilities

Behave as a single, regionally distributedcomputational power grid

Create and operate sophisticatedcollaboration facilities to bind togethergeographically dispersed researchcommunities.


44/80

ACEnet at UNB

Fundy: SUN cluster, AMD Opeteron, 632 cores

ACEnet: 3324 cores

Internet connectivity > 2Gbps at UNB


45/80

Collaboration Grid

Collaboration gear across Atlantic Canada Lecture rooms equipped so ACEnet sites can share

seminars and participate remotely

ACEnet cafs at each site sharing continuous videofeeds

Desktop level collaboration equipment for personalcommunication

Access Grid streams tens to hundreds ofMbps across the CANARIEnetwork

ACEnet

Bioinformatics Research@



46/80

The Canadian Potato Genome Project

Collaborators

Dr.Patricia Evans (UNB), Dr.Barry Flinn (BioAtlantech), Dr. David Dekoyer (PotatoResearch Center), Carleton University, Nova Scotia Agricultural College

Students: Aijazuddin Syed (MCS Student), En Zhang (MCS Student),

Zheng Wang (MCS Student), Marc Cooper (MCS Student),

Rachita Sharma (PhD Student)

Potato

Integral part of diet French fries,mashed potatoes

Provides 12 essential vitamins

Fourth important crop worldwide

Potato has not been explored in termsof functional and bio-chemical traits

Potato genome is much unknownregarding the control of potatodevelopment and processing/qualitytraits (disease resistance, stress tolerance, carbohydratemetabolism, tuber shape)


47/80

Economic Importance Of The Potato

Integral part of the diet of a largeproportion of the worlds population

Supplies at least 12 essential vitamins

and minerals

Still much unknown regarding the

control of potato development and

processing/quality traits(ie. disease resistance, stress tolerance, carbohydrate metabolism, tuber shape)

The Canadian Potato Genome Project (CPGP)

46% of national potato production $1 Billion/year

Home of McCain Foods Ltd. $5.5 billion/year

Potato Research Center (PRC) of AAFC

Solanum Genomics International Inc./BioAtlantech

Carleton University


Nova Scotia Agricultural College (NSAC)


48/80

CPGP Goals

Leaf Tuber

CPGP targets genes associated with

tuber health and tuber quality: Tuber Health Late Blight and

Common Scab

Tuber Quality Stable dry matter

accumulation, cold sweetening and

after-cooking darkening

DNA

Gene 1 Gene 2 . . .

Project Description

Identification Of A Differential Gene Expression PatternAnd Genes Related To Resistance In Potato Late Blight

One of the most devastating disease of potato worldwide

If left unmanaged, complete destruction of crops can occur

Attacks leaves and tubers; large necrotic lesions on leaves

and dry rot that spreads through tubers; 2o bacterial and

fungi often infect through late blight lesions


49/80

Late Blight Project

Collaborative effort with AAFC Potato Research Centre

Population of blight-sensitive and blight-resistant plantsof near isogenicity

cDNA libraries made from leaves of a blight-sensitive and

a blight resistant plant

2500 messages were sequenced from each library

(5000 total ESTs)

Different ESTs to be profiled for expression

The tremendous amounts of data generated will need to be

managed efficiently

Database - Sequence Info


50/80

Late Blight Project

cDNA Microarray Using SGII Clones

hybridized with Cy3 (resistant) + Cy5 (susceptible) probes

(reciprocal labelling experiments)

ANDLBRLF02345HTF.01 - Class II chitinase

ANDLBRLF01256HTF.01 - Pathogenesis-related protein

P23 precursor

ANDLBRLF02041HTF.01 - Unknown protein

What Use Is All Of This Information?

Transgenics:- Enhance tuber quality, processing traits, disease

resistance, stress tolerance more rapidly than breeding

Expression Assisted Selection:- Obtain expression profiles for thousands of genes

associated with specific traits or characteristics- Use these profiles as a baseline to compare with

the expression profiles of unknown clones; crosses

New Protein Products :- Identify genes encoding secreted proteins/ligands

- Test these for growth-promoting/other effects

- Express genes in batch cultures and purify proteins


51/80

GFP expression in tobacco cells

GA-20 oxidase in potato:

GA-20 oxidaseknockouts withenhanced tuberproduction

GA-20 oxidase

knockouts withreduced tubersprouting

Example Of Gene Use

Information Processing and Handling

Assembly and annotation of genomic data

EST analysis and databases

Cluster analysis of microarray data

Comparisons of various transcriptomic methods

Integration of sequence, transcriptomic, proteomic,

metabolomic, transgenic data


52/80

The Canadian Potato Genome Project

Sequence the geneand build cDNA libraries

[Solanum Genomics Intl. Inc(SGII)]

EST sequence generation[National Research Council

at Halifax and SGII]

Bioinformatics: base-Calling, clustering,

BLAST, annotations,and Gene expression

[UNB and PRC]

Microarray profiling[SGII, PRC, UNB, Ontario

Canter Institute, and NSAC]

Leaf and tubercDNA

FASTA formattedEST sequence& trace files

Sample FASTA formatted Sequences

EST sequence>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,

mRNA sequence

ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT

CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT

CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT

CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA

CATT

Protein Sequence>gi|532319|pir|TVFV2E|TVFV2E envelope protein

ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT

QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC

HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK

MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK

TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF

APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL

LAAVEAQQQMLKLTIWGVK


53/80

Standard Genetic Code

T C A G

T

TTT Phe (F)TTC "TTA Leu (L)

TTG "

TCT Ser (S)TCC "TCA "

TCG "

TAT Tyr (Y)TACTAA Ter

TAG Ter

TGT Cys (C)TGCTGA Ter

TGG Trp (W)

C

CTT Leu (L)CTC "CTA "CTG "

CCT Pro (P)CCC "CCA "CCG "

CAT His (H)CAC "CAA Gln (Q)CAG "

CGT Arg (R)CGC "CGA "CGG "

A

ATT Ile (I)ATC "ATA "ATG Met (M)

ACT Thr (T)ACC "ACA "ACG "

AAT Asn (N)AAC "AAA Lys (K)AAG "

AGT Ser (S)AGC "AGA Arg (R)AGG "

G

GTT Val (V)GTC "GTA "GTG "

GCT Ala (A)GCC "GCA "GCG "

GAT Asp (D)GAC "GAA Glu (E)GAG "

GGT Gly (G)GGC "GGA "GGG "

Database

Contains all the ESTs sequences

Contains useful annotations

Blast Searches

Contig Assemblies

Transmembrane Spanning Regions

Gel Pictures EST Information


54/80

Data Analysis - Bioinformatics

Tens of thousands of ESTs available for study

Most methods to study message distributions are low

throughput AND time consuming

Genomics necessitates the large scale study of geneexpression

Automation required for routine processes

Data acquisition for potato genome annotation

Automated protein classification with rule maintenance

Use agents to integrate the software and primary databases in

a flexible and robust way

Overview of Bioinformatics Researchat UNB

Automated ProteinClassification and Rule

Maintenance

Automated DataAcquisition Pipeline

TraceScan

Multi-AgentSystem for Potato

Genome Annotation

ESTsequences

Homologs, Motifs,Fingerprints, Transmembrane,and Signal sites


55/80

TraceScan - Keywords

Chromatogram - visual representation of the digital output producedby an automated sequencing machine. A chromatogram is drawn as aset of four overlapping waveforms, one for each nucleotide base

Base-calling - determining the set of nucleotide bases for a DNAsequence strand from the analysis of the digital output produced by asequencing machine

Heterozygosity exists in the chromatogram where the presence of asecond strong peak appears beneath a primary peak. This mayindicate the presence of a secondary nucleotide base at the location inthe sequence

BLAST Basic Local Alignment Search Tool

Example of a Chromatogram


56/80

The TraceScan Software System

Designed to investigate sequence quality, potential polymorphisms, andbase heterozygosity in EST sequences.

Relies on the combined analysis of a DNA sequence trace file, the tracechromatogram, and multiple alignment of sequence homologs.

Allows base-calls to be substituted where superimposed peaks havebeen detected in the trace.

Base-calls deemed in error can be corrected to improve sequence qualityand data reliability.

TraceScan

Visualizes DNA sequence chromatograms

Detects overlapping trace peaks using modifications to the PHREDbase-caller

Paks are highlighted on the user interface.

Modifications to PHRED enable base-calls with overlapping peaks to besubstituted.

Base substitutions produce a new set of base quality scores for thesequence.


57/80

TraceScan

An interface to NCBI BLAST provides sequence comparisoncapabilities.

Sequences are compared using BLASTN and BLASTX.

BLASTN alignments are analyzed in search of discrepancies that mayidentify base-calling errors or putative polymorphisms in the tracesequence.

Reading Frames from BLASTX results are analyzed to examine ifsubstituted base-calls result in synonymous or non-synonymous codonsubstitutions.

TraceScan System Architecture


58/80

Overview of BioinformaticsResearch at UNB


Maintenance


TraceScan


Genome Annotation

ESTsequences


The Automated Data Acquisition Pipeline

(ADAP) - Keywords

Hypothetical Protein: The protein sequence that is obtained fromtranscription and translation of the DNA sequence. It is hypotheticalbecause we do not know if it is the real protein which DNA codes to.

Homologs: Evolutionarily related protein sequences

Comparative genomics: A technique where the functional traits of a

protein sequence are learnt from its homologs

Motifs: Highly conserved regions of protein sequences

Fingerprints: Collection of motifs

BLASTP: Basic Local Alignment Search Tool for Protein to Proteinsearches


59/80

Automated Data AcquisitionPipeline (ADAP)

Gathers data for genome annotation

ADAP features:

Uses comparative genomics to learn from the Homologs

New variant of BLAST, Parameter Regulated Iterative BLAST(PRI-BLAST)

Uses 7 various analysis/search tools

A few software design patterns are used

Perl, MySQL, Perl-DBI, BioPerl, EMBOSS, BLASTP, SGE 5.3,

and Perl-Gtk on Linux

ADAP Overview

Phase 1: Hypotheticalprotein extraction andhomolog generation

Phase 2: Sequence basedprotein structure

prediction

Phase 3: Database searchbased protein family

prediction

Potato ADAPdatabase

Input: FASATAformatted EST

Sequences

Homologs and

HPs

Perl-MySQLDatabaseInterface

Legend

Data Flow

DatabaseInteractions


60/80

Parameter Regulated Iterative BLAST(PRI-BLAST)

Static set of BLASTP parameters (neighborhood score, E-value, fractionidentical, BLOSUM matrix etc) not good since protein evolves at differentrates

PRI-BLAST iteratively performs the BLASTP over query sequence andcategorizes the query as

a Celebrity query (C) many homologs an Average query (A) a few or no homologs an Obscured query (O) some homologs

PRI-BLAST Rule module

Decides which set of BLASTP parameters to use Halts the PRI-BLAST

Statistical module Density of homologs is computed through SQL statements

Example BLASTP reportBLASTP 2.2.8 [Jan-05-2004]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer

. . . . . Nucleic Acids Res. 25:3389-3402.

Query= CK00043.5prime

(182 letters)

Database: All non-redundant GenBank CDS

translations+PDB+SwissProt+PIR+PRF excluding environmental samples

1,795,144 sequences; 592,604,613 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AAD46849.2| LD03471p [Drosophila melanogaster] 329 5e-90

ref|NP_651977.1| CG6773-PA [Drosophila melanogaster] >gi|7300991... 285 1e-76

ref|XP_312881.1| ENSANGP00000014751 [Anopheles gambiae] >gi|2129... 209 7e-54

gb|AAH54585.1| Unknown (protein for MGC:63980) [Danio rerio] 184 4e-46

.

.

.

>gb|AAD46849.2| LD03471p [Drosophila melanogaster] Length = 386

Score = 329 bits (1155), Expect = 5e-90

Identities = 181/182 (99%), Positives = 181/182 (99%)

Query: 1 VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG 60

VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG

Sbjct: 6 VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG 65

Query: 61 SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK 120

SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK

Sbjct: 66 SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK 125

Query: 121 LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPNXHTIGVN 180

LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPN HTIGVN

Sbjct: 126 LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPNAHTIGVN 185

Query: 181 AI 182

AI

Sbjct: 186 AI 187


61/80

motif search based Protein Sequence Analysis(mPSA)

Motifs are conserved regions of protein sequences, and fingerprint isa collection of motifs in some order

mPSA (Phases 2 & 3) for the ADAP contains 6 mPSA tools fromEMBOSS

Phase 2: sequence based mPSA

secondary structure: transmembranes(Tmap), signal sites(Sigcleave), and general secondary structure (Garnier)

super secondary structure: DNA binding sites (Helixturnhelix)

Phase 3: database search based mPSA

protein motifs from PROSITE (Patmatmotifs) and proteinfingerprints from (Pscan)

Homologues for Various Ranges of Lengths of Hyp. Proteins

8768

5235

2882

1633

550873

288592

221495

2791

53380434

516124 226

979

2020

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

10

-15

15

-20

20

-25

25

-30

30

-35

35

-40

40

-45

45

-50

50

-55

55

-60

60

-65

65

-70

70

-75

75

-80

80

-85

85

-90

90

-95

95

-100

100

-105

105

-110

110

-115

Length of Hyp. Protein

NumberofHomologues

Homologues (Total)

Shorter protein sequences have more homologs they can be false positives


62/80

Homologues with E


63/80

Bioinformatics Research at UNB


Maintenance


TraceScan


Genome Annotation

ESTsequences


Automated Protein Classification andRule Maintenance

Use machine-learning techniques to find some rules

Apply the rules to classify uncharacterized sequences

Categorizedsequences and

their related data

Rule

ConstructionProcess

A decision treeconsisting of

rules

Uncharacterizedsequences

Rule applicationprocess

Newlycharacterized

sequences


64/80

Automated Protein Classification andRule Maintenance

Source data collection

Automated rule generation

Machine-learning algorithms and their comparison

Automated rule maintenance

Automated Rule Generation

C4.5 and CITree algorithms produce decision trees

WEKA (Waikato Environment for Knowledge Analysis ) will be used foranalyzing the dataset. (http://www.cs.waikato.ac.nz/~ml/index.html)

Start

Rule Construction & DecisionTree Creation

Rule Sieving

Is the rulequalified?

End of Rules?

Yes

No

Rule Database

Apply rules to annotate targetsequences

Target SequenceDatabaseEnd

Sequences and theirrelated data

Update Rule Database

Update Target Sequence Database

Rule Generation process

No

Yes


65/80

Comparison of Algorithms

The evaluation of criteria for machine learning algorithms: accuracyand AUC (Area Under the ROC (Receiver Operating Characteristics)Curve)

Performance analysis


66/80

Tree Generated using Weka

Bioinformatics Research at UNB


Maintenance


TraceScan


Genome Annotation

ESTsequences



67/80

Multi-agent Systems

A multiagent system is one that consists of a number of agents, whichinteract with one-another

In the most general case, agents will be acting on behalf of users withdifferent goals and motivations

To successfully interact, they will require the ability to cooperate,coordinate, and negotiate with each other, much as people do

Multi-Agent System for PotatoGenome Annotation

Target Sequence

Database

Local Database

NRDB MONTH

INFORMATION

AGENT

PIPELINE

AGENT

WEB

AUTOMATED DATA

ACQUISITION

PIPELINE

Rule DatabaseRULE

CONSTRUCTION

AGENT

DATABASE

UPDATE AGENT

PRINTS PROSITE

Target Sequence

Database

CLASSIFICATION

MODULE


68/80

Mapping Transcription factors from aModel to a non-Model Organism

136

Transcription Factor

Group of proteins that initiate transcription transcriptional activators

transcriptional repressors

Consists of DNA binding domains Binds to the binding site regions (specific DNA

sequences)

Controls the expression of the genes

Human genome: 2600 proteins contain DNA-binding domains


69/80

137

Transcription Factor Mapping

Model Organism

Investigated thoroughly by biologists Nodes: Transcription factors

Non-Model Organism

Not much data available Nodes: Predicted transcription factors

A

B

C

A1

B1

C1

Source Genome Target Genome

138

Transcription Factor Mapping


70/80

139

Methodology

BLASTP is used to map transcription factors from Ecoliand Bacillus subtillisto E.coli group and Bacillus

group Parameter E-value threshold: 1e-5 to 10

All transcription factors from one genome cannot bemapped to another genome

The number of confirmed mappings between any twogenomes is dependent on the definition of confirmed

mapping used Compare the available transcription factors of the target genome to

the predicted set of transcription factors

140

Summary of Mapping Results

Transcription factor mapping in bacterialgenomes

Proposed method is able to map most of thetranscription factors

Transcription factor sequence motifs arepreserved well

0.1 and 0.01: best e-value thresholds

Correct choice of e-value threshold can be moreimportant than selection of evolutionarily closer

model organism


71/80

Bioinformatics @ C-DAC

Dr. Rajendra JoshiGroup Coordinator: Bioinformatics

Scientific and Engineering Computing Group

Centre for Development of Advanced ComputingPune - 411007

[email protected]://bioinfo-portal.cdac.in

Bioinformatics Resources &Applications Facility (BRAF)

Funded by the Department of InformationTechnology (DIT), Ministry of Communications andInformation Technology

Grid-enabling of numerous bioinformatics codeslike SW, BLAST, ClustalW, AMBER, CHARMM etc

As part of BRAF, the team interacted withscientists from various CSIR labs, IITs andindustries


72/80

AMD processor 2.6Ghz (Total: 204cores, 1060.8 GF)

4 nos. of SunX4600 (8 socket dual

core each) giving 64 cores.

32 nos. of SunX2200 (dual socketdual core each) giving 128 cores.

Backup server: SunX2200 (4 cores)

Storage server: two Sun X2200 (8cores)

Infiniband switch (Mellanox DDR2,48 port)

Storage: 20 Terabytes, RAID5 Tape library with autoloader

Benchmarking completed forAMBER, CHARMM, MEME, SW,Fasta, ClustalW, BLAST

BIOGENE: 1TF machine

Using BRAF Facility

Gipsy portal: Use browser andopen the url

http://gipsy.bioinfo-portal.cdac.in

Command line login

ssh -p 30005 gateway.cdac.in

Help on command line usage isavailable in the README file inthe users home directory.

Helpline: [email protected]


73/80

Bioinformatics Application SoftwareforHigh-End Clusters and Grid

iMolDock : An interface for Molecular Docking on HPC

GENOPIPE : Automated Genome Annotation Pipeline on HPC

Anvaya : A Workflow Environment for High Throughput Comparative Genomics

Taxo Grid : Phylogeny on Grid

GenomeGrid : Bioinformatics Problem Solving Environment on Grid

GIPSY : Bioinformatics Problem Solving Environment on HPC

High-throughput Workflows forGenome Analysis


74/80

Collaboration: Biotechnology andBiological Sciences Research Council (UK)

A Systems Biology based

approach for annotation ofSalmonella andMycobacterium genomes

Establishment of a commonBioinformatics pipeline foranalyses of bacterial genomeswith emphasis on identification

of virulence and pathogenicfactors

Collaboration: Institute of AnimalHealth (UK)

Genome Annotation: Salmonella Causative agent of Typhoid Transmitted via food contamination Economic losses as it affects

livestock

Annotation of 5 Salmonella

genomes with a wide host-rangeFood-borne disease cycle: Salmonella

Genome Annotation via GENOPIPE

Single nucleotide polymorphism


75/80

Collaboration: University of Surrey (UK)

Expert curation of Mycobacterium lepraegenome: causative agent of Leprosy

Development of a tool to calculate molecularweight of metabolites

Furin Complex

Collaboration:Oregon Health & Science University (USA)

Collaborative project initiated with OHSU in December 2009

Provide computational support to the experimental studies at OHSU,through MD simulations on BIOGENE cluster

Propeptide domain of serine protease Furin acts as a pH sensor

Phenomenon has been elucidated in-silico through MD simulations

Ten sets of simulations performed using NAMD


76/80

Collaborations: caBIG (NIH)

The National Cancer Institute (NCI) is

involved in deployment of an integratedbiomedical informatics infrastructure,the cancer Biomedical Informatics Grid(caBIG)

network that will freely connect theentire cancer community

caBIG would setup node at CDAC GARUDA GRID and BRAF resources

may be used

OA1 (GPR143) aGPCR

Belongs to Class I GPCR,Rhodopsin family

7TM receptors or heptahelical

receptors An integral membrane

glycoprotein of 404 aa

Protein product ofocularalbinismtype 1 gene

Ocular albenism, a X-linkedinherited disorder in which theeye lacks melanin pigment

Homology based approach along

with CGMD simulation has been

Collborations: IIT MadrasCGMD studies on GPCR


77/80

Collaboration: Jubilant Biosys Simulate fragment binding

sites by Molecular Dynamicssimulation methods

To identify most probablesite of interaction ofchemical fragments in theprotein.

8 large simulations of 10nseach was carried out

Results handed over toJubilant

Collaboration: Nicholas Piramal

Contract Research project

To understand protein ligandinteractions using MolecularDynamics simulations

Involves carrying outmolecular dynamics

simulations on very largebiomolecular systems

Benefits in designing bettermolecules for known drugtargets.

Four 20ns moleculardynamics simulations havebeen carried out


78/80

155

Conclusion

Biology transforming from observational and physicalexperiments computational science

Bioinformatics - Exciting research area

Challenges Biology and Computer Science different waysof working and need for close collaboration

Opportunities new crops, personalized medicine, earlydiagnosis,

156

Research Problems in Bioinformatics

Find genomes of all organisms

Identify and annotate all genes

Compute sequence 3D structure for all proteins

Compare DNA / protein sequences for similarity

Compare families of DNA / protein sequences

Reason to be optimistic: Biology is finite

~30,000 human genes; ~1000 protein superfamilies

but computers speeds keep increasing


79/80

157

Business Opportunities

Clinical research Gene therapy Molecular science Pharmaceutical companies - automated technologies tomanufacture effective therapies and drugs due to increasingconcerns about drug safety and the stringent regulations thatgovern clinical trials for drug discovery. Bioinformatics platform market growing very fast rate Global bioinformatics market: ~ $8.3 billion by 2014 Knowledge management - 2009 -$1.3 billion Bioinformatics platforms market - 2014 - ~ $3.9 bill ion

158

Business Opportunities

Global bioinformatics market segments

- Bioinformatics platforms- Sequence alignment platforms- Sequence manipulation platforms,- Sequence analysis platforms- Structural analysis platforms

- Content/ Know ledge management tools- Specialized knowledge management tools- Generalized know ledge management tools- Services- Data Analysis- Sequencing Services- Database & Management services- Applications


80/80

Thank You!

Date post:	09-Apr-2018
Category:	Documents
Upload:	irvan-teha
View:	214 times
Download:	0 times

hop Indonesia

Documents