Date post: | 09-Apr-2018 |
Category: |
Documents |
Upload: | irvan-teha |
View: | 214 times |
Download: | 0 times |
of 80
8/8/2019 hop Indonesia
1/80
1
Dr. Virendrakumar (Virendra) C. BhavsarProfessorDean 2003-2008
Director, Advanced Computational Research Lab. 2000-10Faculty of Computer Science
University of New Brunswick (UNB)Fredericton, Canada
Visiting ProfessorCenter for Development of Advanced Computing (C-DAC)
Pune, India
Bioinformatics An Overview
2
Outline
Introduction UNB, C-DAC, Bioinformatics
Genome Genes, Proteomes, Evolution
Databases and Information Retrieval
Sequence Alignment and Phylogenetic trees
Protein Structure and Drug Discovery
Proteomics and Systems Biology
Infrastructure: UNB and C-DAC
Research Work at the University of New Brunswick and C-DAC
Future
8/8/2019 hop Indonesia
2/80
3
University of New Brunswick (UNB)
4
Faculty of Computer Science
The First Faculty of CS in Canada
University of New Brunswick
Fredericton, New BrunswickFredericton, New BrunswickCanadaCanada
Oldest English Language University in CanadaOldest English Language University in Canada
Established in 1785Established in 1785
8/8/2019 hop Indonesia
3/80
55
Fredericton and UNB
8/8/2019 hop Indonesia
4/80
77
8
Center for Development of AdvancedComputing (C-DAC)
India
8/8/2019 hop Indonesia
5/80
1987
The Government of India decides to
launch a national initiative for
development of indigenoussupercomputers
Government of USA refuses sale of
Supercomputer to India
India requires Supercomputer for
Weather Forecasting
History
Garuda GridComputing
Social Computingwith participatory
approach
1991
1994
1998
2002-03
200710 TF
PARAM Padma
Viable HPC businesscomputing environment
PARAM 10000
Platform for User communityto interact/ collaborate
PARAM 8000
Technology Denial
2010100 TF
2012-131 PF
PoC
100 Mbps
17 Locations
Main
PhaseGaruda
PARAM 9000
C-DAC: HPC : Evolution and
Road Map
8/8/2019 hop Indonesia
6/80
Headquarter Pune
Centres
Pune Knowledge Park, Bangalore
Electronics City, Bangalore
Chennai
Delhi
Hyderabad
Kolkata
Mohali
Mumbai
Noida Thiruvananthapuram
C-DAC HQ
Centres
C-DAC Centres
Total Manpower is 2100 across all the centres of C-DAC
C-DACs Thrust Areas
High Performance Computing & Grid Computing Hardware, Software, Systems, Applications, Research, Technology, Infrastructure
Multilingual Computing Tools, Fonts, Products, Solutions, Research, Technology Development
Software Technologies OSS, Multimedia, ICT for masses, E-Governance, Geomatics
Professional Electronics Digital Broadband, Wireless Systems, Network Technologies, Power Electronics, Real-Time
Systems, Embedded Systems, VLSI/ASIC Design, Agri Electronics
Cyber Security & Cyber Forensics Cyber Security tools, technologies & solution development, Research & Training
Health Informatics Hospital Information System, Telemedicine, Decision Support System
Ubiquitous Computing RFID, Design, Development and Integration of Ubicomp System Components
Education & Training e-Learning Technologies & Services
8/8/2019 hop Indonesia
7/80
Compute Nodes
No. of Processors : 248 (Power 4 @ 1 GHz)
Aggregate Peak Computin g : 10 05 GF s (~ 1 TF )
File Servers
No. of Processors : 24 (UltraSparc-III@900MHz)
Aggregate Memory : 96 GigaBytes
Internal Storage : 0.4 TeraBytes
File System : QFS
Operating System : Solaris 8
Networks
Primary : PARAMNet-II @ 2.5 Gbps Full Duplex
Backup : Gigabit Ethernet @ 1 Gbps Full Duplex
Management : 10/100 MBPs Fast Ethernet
External Storage
Storage Array : 5 TeraBytes with 16 T3 disk arrays
Tape Library : 12 TeraBytes - L700 (5 LTO drives
Software
HPCC - C-DACs High performance computing and communication software suite
Compilers, Parallel Libraries and Tools
Ranked 171 in 2nd quarter end and 258 as per the latest ranking
C-DAC
Advanced Computing Training School (ACTS)
8/8/2019 hop Indonesia
8/80
ACTS @ a glance
An outfit initiated by C-DACR&D in 1993
Begun with modest 20
students and grown to over5000 students
Trained more than quartermillion students
Grown from one city onecentre to 30 cities and 50centres within India
Over 150 crores of investmentand 600 plus dedicatedmanpower
Spread from India toInternational
From One course to morethan 10 courses
International Presence
Tajikistan
Uzbekistan
Mauritius
Ghana
Seychelles
Myanmar
Russia
Tanzania
Turkmenistan
Lesotho
BelarusSaudi ArabiaAzerbaijan
Armenia
8/8/2019 hop Indonesia
9/80
Post Graduate CoursesDAC : Diploma in Advanced Computing
DACA : Diploma in Advanced Computer ArtsDVLSI : Diploma in VLSI Design
WiMC : Diploma in Wireless & Mobile Computing
DSSD : Diploma in System Software Development
DGi : Diploma in Geo informatics
DISCS : Diploma in Information System & Cyber Security
DHI : Diploma in Healthcare Informatics
DLC : Diploma in Language Computing
DIVESD: Diploma in Integrated VLSI & Embedded SystemDesign
DESD : Diploma in Embedded Systems Design
DPC : Diploma in Parallel Computing
Post Graduate Diploma Programs
M.Tech. Programs
Computer Science & Engineering
Software Engineering
Information Technology
VLSI
Artificial Intelligence
Grid Computing & Storage Management
Embedded Systems Design
Wireless & Network Technology
Process Control & Instrumentation
8/8/2019 hop Indonesia
10/80
Training Programmes UNDER Tech sangam
20
Bioinformatics
8/8/2019 hop Indonesia
11/80
21
Bioinformatics
The creation and development of advancedinformation and computational techniques for solving
problems in biology
and development of advanced information andHigh Performance Computing (HPC)Hardware and software for high speed computations
and large storageor solving problems in biology
Definitions
22
Bio Introduction
8/8/2019 hop Indonesia
12/80
23
in biology
Molecular Biology
Living organisms (on Earth)
Lipids - Separate inside from outside
Proteins Build 3D machinery to perform biological
functionsDNA: Store information on how to build machinery (DNA)
Diagram of a cell
Lipid membranes - provide barrier
Protein structures - do work
DNA nucleus - store info
24
in biology
Molecular Biology
Deoxyribonucleic Acid (DNA)
Composition
- Sequence of nucleotides
0Nucleotide = deoxyribose sugar + phosphate group +base
8/8/2019 hop Indonesia
13/80
25
in biology
Molecular Biology - DNA
DNA: contains genetic instructions used in thedevelopment and functioning of all known livingorganisms with the exception of some viruses.
DNA molecules: long-term storage ofinformation.
DNA: a set ofblueprints, like a recipe or a code, since it
contains the instructions needed to construct othercomponents ofcells, such as proteins and RNAmolecules.
Genes: The DNA segments that contain instructions toconstruct the above components of cells
Other DNA sequences: structural purposes, or areinvolved in regulating the use of this genetic information.
Chemically, DNA consists of two long polymers of simple
units called nucleotides, with backbones made ofsugarsand phosphate groups joined by esterbonds. These twostrands run in opposite directions to each other and aretherefore anti-parallel. Attached to each sugar is one offour types of molecules called bases. It is the sequenceof these four bases along the backbone that encodes
26
in biology
Molecular Biology - DNA
- two long polymers of simple units called nucleotides,with backbones made ofsugars and phosphate groups
joined by esterbonds.
- These two strands run in opposite directions to eachother and are therefore anti-parallel.
-Attached to each sugar is one of four types of moleculescalled bases. It is the sequence of these four bases alongthe backbone that encodes information. This informationis read using the genetic code, which specifies the
sequence of the amino acids within proteins.
-The code is read by copying stretches of DNA into therelated nucleic acid RNA, in a process calledtranscription.
- Within cells, DNA is organized into long structurescalled chromosomes. These chromosomes areduplicated before cells divide, in a process called DNAreplication. Eukaryotic organisms (animals, plants, fungi,and protists)
8/8/2019 hop Indonesia
14/80
27
in biology
Molecular Biology - DNA
-DNA is organized into long structures calledchromosomes.
- Chromosomes are duplicated before cells divide, in aprocess called DNA replication.
- Eukaryotic organisms (animals, plants, fungi, andprotists) store most of their DNA inside the cell nucleusand some of their DNA in organelles, such asmitochondria orchloroplasts.
- Prokaryotes (bacteria and archaea) store their DNA onlyin the cytoplasm.
28
in biology
Molecular Biology
RNA: Ribonucleic acid (RNA)
- a long chain of nucleotide units
- Each nucleotide consists of a nitrogenous base, aribose sugar, and a phosphate
RNA is very similar to DNA
RNA is usually single-stranded
DNA is usually double-stranded
RNA nucleotides contain ribose while DNA containsdeoxyribose (a type of ribose that lacks one oxygenatom)
RNA has the base uracil rather than thymine that ispresent in DNA
8/8/2019 hop Indonesia
15/80
29
in biology
Molecular Biology
DNA: DNA DNA (Replication)
RNA: DNA RNA (Transcription / GeneExpression)
Protein: RNA Protein (Translation)
DNA, RNA, Proteins
Proteins and nucleic acids (DNA, RNA) are essentialcomponents for living organisms
DNA Transcription RNA Translation Proteins
Chromosome
DNA
DNA
Gene 1 Gene 2 . . . .
(gene)
8/8/2019 hop Indonesia
16/80
Raw Biological data Nucleic Acids (DNA)
Raw Biological data
Amino acid residues (proteins)
8/8/2019 hop Indonesia
17/80
Standard Genetic Code
T C A G
T
TTT Phe (F)TTC "TTA Leu (L)TTG "
TCT Ser (S)TCC "TCA "TCG "
TAT Tyr (Y)TACTAA TerTAG Ter
TGT Cys (C)TGCTGA TerTGG Trp (W)
C
CTT Leu (L)CTC "CTA "CTG "
CCT Pro (P)CCC "CCA "CCG "
CAT His (H)CAC "CAA Gln (Q)CAG "
CGT Arg (R)CGC "CGA "CGG "
A
ATT Ile (I)ATC "ATA "ATG Met (M)
ACT Thr (T)ACC "ACA "ACG "
AAT Asn (N)AAC "AAA Lys (K)AAG "
AGT Ser (S)AGC "AGA Arg (R)AGG "
G
GTT Val (V)GTC "GTA "
GTG "
GCT Ala (A)GCC "GCA "
GCG "
GAT Asp (D)GAC "GAA Glu (E)
GAG "
GGT Gly (G)GGC "GGA "
GGG "
Triplets of DNA called Codons code into a amino acid
A Protein StructureA Protein Structure
8/8/2019 hop Indonesia
18/80
Protein 3D structure
The structure of the protein sequence determines theThe structure of the protein sequence determines the
functionalityfunctionality
http://anatomy.med.unsw.edu.au/cbl/research/cytoskeleton/swissprotactin.htm
36
Informatics
8/8/2019 hop Indonesia
19/80
FASTA formatted Sequences
FASTA: "FAST-All alignment; it works with any alphabet- FAST-P for protein- FAST-N for nucleotide alignment
Sample FASTA formatted Sequences
FASTA:"FAST-All alignment; it works with any alphabet, an
extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment.
EST sequence (A, C, G, T)>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,
mRNA sequence
ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT
CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT
CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT
CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA
CATT
Protein Sequence (20 different amino acids)>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPF
8/8/2019 hop Indonesia
20/80
Biological Databases
Genome databases flat files or relational database
GenBank, EMBL, DDBJ, PDB, SWISSPROT, PIR
Classification of Biological databases:
- primary databases (GenBank, EMBL, DDBJ)
- secondary databases (SWISSPROT, PDB, PIR)
Biological databases
Like any other database
Data organization for optimal analysis
Data is of different types
Raw data (DNA, RNA, protein sequences)
Curated data (DNA, RNA and proteinannotated sequences and structures,expression data)
8/8/2019 hop Indonesia
21/80
41
for solving problems in biology
Biological databases -Examples
Nucleotide DatabasesAlternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome,MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations,IMGT
Genome DatabasesHuman, Mouse, Yeast, C.elegans, FLYBASE, Parasites
Protein DatabasesSwiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis,
HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT Structure Databases
PDB, MSD, FSSP, DALI Microarray Database
ArrayExpress Literature Databases
MEDLINE, Software Biocatalog, Flybase Archives Alignment Databases
BAliBASE, Homstrad, FSSP
8/8/2019 hop Indonesia
22/80
3D Macromolecular structural data
Data originates from NMR or X-raycrystallography techniques
If the 3D structure of a protein is solved ...they have it
PDB Protein Data Bank
What to take home
Databases are a collection of data
Need to access and maintain easily and flexibly
Biological information is vast and sometimesvery redundant
Distributed databases bring it all together withquality controls, cross-referencing andstandardization
Computers can only create data, they do notgive answers
8/8/2019 hop Indonesia
23/80
45
Bioinformatics
46
Gene sequences determine biological function
Genomic DNA Amino acids Proteins Function
Similar composition similar function?
- DNA sequences- Amino acid sequences
- Protein 3-D structure
Predicting protein function
- Designer drugs- Personalized treatments solving problems in biology
Premise of Bioinformatics
8/8/2019 hop Indonesia
24/80
47
Bioinformatics
Determining protein function
Hard way
-Biological / chemical analyses
- Determine 3D structure w/ x-ray crystallography, NMR
Easy way?
- Sequence protein / DNA find close match in database
- Guess function based on match
- Validate guess in lab
Bioinformatics is imprecise
- Similar to data-mining
- Only suggests possible relationships
- Must validate correlation causation
48
Growth of Bioinformatics
1970s
- DNA sequencing
- Alignment w/ Smith-Waterman (dynamic programming)
1980s
- Sequence databases (EMBL, GenBank)
- Alignment w/ FASTA (linked lists, hashing)
1990s
- Automatic DNA sequencing
- Alignment w/ BLAST (neighborhood words, probabilities)
- Internet & WWW
Now
- Genomics, Proteomics
8/8/2019 hop Indonesia
25/80
49
Bioinformatics Topics
Sequence alignments
- Find similarity between DNA / protein (amino acid) sequences
Genome assembly
- Combining genomic fragments to form whole genome
Gene identification & annotation
- Identify and classify genes on the genome
Microarrays & gene expression analysis
- Use DNA microarray (gene chip) to measure mRNA
Protein folding
- Compute 3-D protein structure protein sequence
Phylogenetic analysis
- Find genetic relationships between sequences and speciesbetween
between sequences / species
What Does Genomics Mean?
Genomics: a science that studies the geneticmaterial of a species at the molecular level
A scientific approach to identify and define thefunction of genes, as well as uncover when and howgenes work together to produce traits
Structural Genomics approaches (mapping) -
focus on traits controlled by one or a few genes, andoften only provide information regarding thelocation of a gene or genes
Examine the interrelationships and interactionsbetween thousands of genes
How do we do this?
8/8/2019 hop Indonesia
26/80
Genome Organization
Leaf Tuber
Chromosome
DNA
Genome Organization Proteins are building blocks for living organisms
Proteins are derived from DNA transcription the gene (RNA) that codes proteins is formed from DNA Translation RNA triplets (codons) code into amino acids
DNA Gene can also be known by finding complimentary (cDNA), the activeor expressed gene is termed as Expressed Sequence Tags (ESTs)
Chromosome
DNA
DNA
Gene 1 Gene 2 . . . .
8/8/2019 hop Indonesia
27/80
PromoterSwitch
Coding ORFMessage
....TATACAGCAAAATAGAAAGATCTAGTGTCCCATGGCGATGAGTCGTGTAGCTTCT.
DNA
Gene 1 Gene 2 Etc.
Genome Organization
cDNA Collections (Libraries)
Various tissues are collected from the plant,and messages are extracted
Leaf
Messages
Tuber
Messages
8/8/2019 hop Indonesia
28/80
cDNA Collections (Libraries)
The messages are copied to form double-stranded DNA copies (cDNA) of each message
Leaf cDNA Tuber cDNA
Each copy is glued into a piece of bacterial DNAfor easier storage, handling and propagation,resulting in a collection or library of cDNAs
for each tissue
cDNA Collections (Libraries)
The cDNAs are then read or sequenced, to give the
order of As, Cs, Gs or Ts for each
We are left with the sequence of each gene that is
active (expressed) in each cell, tissue or organ studies
These are Expressed Sequence Tags or ESTs
Using complex computer resources, these ESTs can
be analyzed and compared with known sequences
and proteins
Look for messages associated with specific organs or
characteristic/traits
8/8/2019 hop Indonesia
29/80
Take Home Points
Messages from various genes are important,as they dictate which proteins are produced
Promoters are also important, as they dictatewhere a specific message and protein isproduced
Genomics involves the study of all of themessages produced by the various plant cells
A lot of information needs to be organizedand analyzed
Database
Contains all the ESTs sequences
Contains useful annotations
Blast Searches
Contig Assemblies
Transmembrane Spanning Regions Gel Pictures
EST Information
8/8/2019 hop Indonesia
30/80
Data Analysis
Tens of thousands of ESTs available for study
Most methods to study message distributions arelow throughput AND time consuming
Genomics necessitates the large scale study of
gene expression
How can we do this?
Microarray Analysis
Microarray Analysis
8/8/2019 hop Indonesia
31/80
Microarray Analysis
Microarray Analysis
8/8/2019 hop Indonesia
32/80
Microarray Analysis - Processing
IntensityDepe ndenceComparison
R2 = 0.2014
R2 = 0.6185
-6
-4
-2
0
2
4
6
8
10
12
0 2 4 6 8 10 12 14 16 18
0.5*(Log(G)+Log(R))
Log(R/G) Slide3
Slide70
Poly. (Slide70)
Poly. (Slide3)
Image Processing
Data Normalization
Differential
GeneExpression
Cluster
Analysis
Pathway
Analysis
Analysis
Microarray Analysis - Processing
8/8/2019 hop Indonesia
33/80
Signal
Background
Microarray Analysis - Processing
Irregular size orshape
Irregular placement
Low intensity
Saturation
Spot variance
Background variance
indistinguishable saturated bad print artifactmiss alignment
Microarray Analysis - Processing
8/8/2019 hop Indonesia
34/80
Calculate numeric characteristics of each spot
Throw out spots that do not meet minimumrequirements for each characteristic
Throw out spots that do not have minimumoverall combined quality
Microarray Analysis - Processing
Microarray Analysis - Data
Normalization
Normalize data to correct for variances
Dye bias
Location bias
Intensity bias
Pin bias
Slide bias
Control vs. non-control spots
8/8/2019 hop Indonesia
35/80
Cluster genes based on expression profiles
Gene expression across several treatments
Hypothesis: Genes with similar function havesimilar expression profiles
Microarray Analysis -Clustering
Expression Profile Clustering
8/8/2019 hop Indonesia
36/80
Project
Database
Engine
Microarray Analysis - Data Management
Information Processing and Handling
Assembly and annotation of genomic data
EST analysis and databases
Cluster analysis of microarray data
Comparisons of various transcriptomic methods
Integration of sequence, transcriptomic, proteomic,
metabolomic, transgenic data
8/8/2019 hop Indonesia
37/80
73
Research Problems in Bioinformatics
Find genomes of all organisms
Identify and annotate all genes
Compute sequence 3D structure for all proteins
Compare DNA / protein sequences for similarity
Compare families of DNA / protein sequences
Reason to be optimistic: Biology is finite
~30,000 human genes; ~1000 protein superfamilies
but computers speeds keep increasing
Fighting Bird FluFighting Bird Flu
8/8/2019 hop Indonesia
38/80
Virus in 3-DVirus in 3-D
76
Bioinformatics Infrastructure HighPerformance Computing
8/8/2019 hop Indonesia
39/80
77
1974 - 1 MHz clock1988 40 MHz2002 2 GHz2009 P4 3.0 GHz, Quadcore 2.66 MHz
Intel Montecito chip1.72 Billion transistors
NVidia 280 series GPU 1.4 Billion transistors
- Circuit complexity doubles every 18 months Computing power at a given cost doubles every 18
months
- Processor clock rates: 40% increase/year + moreinstr./cycle
- DRAM Access Times: 10% increase/year cachesrequired
Advances in Microprocessor Technology
78
Jaguar
Oak Ridge National Lab., USA
- 1.72 Petaflop/s (Quadrilion): million billion (10**15)floating-point operations/sec (Flops) onLinpack benchmark
-2.332 Petaflops peak (.i.e 2332 Tera flops)
- Power 1750 Watt/sq ft; ~50 million KWh per year
- Space 4352 square feet, larger than NBAbasketball court
-
Current Supercomputer Nov 2009
8/8/2019 hop Indonesia
40/80
79
Jaguar
Current Supercomputer Nov 2009
80
Jaguar
Current Supercomputer Nov 2009
8/8/2019 hop Indonesia
41/80
Future
IBM Cyclops64 supercomputer on a chip
C-DAC initiative for 2010 petaflopmachine
NCSA, USA 2011 petaflop machine
NASA, SGI and Intel Pleiades 10petaflop by 2012
1 Exaflop (10**18 flops) by 2019
Human brain neural simulations 10exaflop by 2025
2-week Full Weather modeling 1 zetaflops (10**21 flops) by 2030
High Performance Computing and Networking@
University of New Brunswick
8/8/2019 hop Indonesia
42/80
Advanced Computational Research Lab(ACRL) Infrastructure
People, Research, Excellence
ACEnet: Atlantic Computational ExcellenceNetwork
Hosting sites:
Member sites:
8/8/2019 hop Indonesia
43/80
ACEnet
Atlantic Canada is a distributed environment
$30 million initiative
Waterways make networkingsolutions difficult (e.g. Cabot Strait)
ACEnet
World-class HPC facilities
Behave as a single, regionally distributedcomputational power grid
Create and operate sophisticatedcollaboration facilities to bind togethergeographically dispersed researchcommunities.
8/8/2019 hop Indonesia
44/80
ACEnet at UNB
Fundy: SUN cluster, AMD Opeteron, 632 cores
ACEnet: 3324 cores
Internet connectivity > 2Gbps at UNB
8/8/2019 hop Indonesia
45/80
Collaboration Grid
Collaboration gear across Atlantic Canada Lecture rooms equipped so ACEnet sites can share
seminars and participate remotely
ACEnet cafs at each site sharing continuous videofeeds
Desktop level collaboration equipment for personalcommunication
Access Grid streams tens to hundreds ofMbps across the CANARIEnetwork
ACEnet
Bioinformatics Research@
University of New Brunswick
8/8/2019 hop Indonesia
46/80
The Canadian Potato Genome Project
Collaborators
Dr.Patricia Evans (UNB), Dr.Barry Flinn (BioAtlantech), Dr. David Dekoyer (PotatoResearch Center), Carleton University, Nova Scotia Agricultural College
Students: Aijazuddin Syed (MCS Student), En Zhang (MCS Student),
Zheng Wang (MCS Student), Marc Cooper (MCS Student),
Rachita Sharma (PhD Student)
Potato
Integral part of diet French fries,mashed potatoes
Provides 12 essential vitamins
Fourth important crop worldwide
Potato has not been explored in termsof functional and bio-chemical traits
Potato genome is much unknownregarding the control of potatodevelopment and processing/qualitytraits (disease resistance, stress tolerance, carbohydratemetabolism, tuber shape)
8/8/2019 hop Indonesia
47/80
Economic Importance Of The Potato
Integral part of the diet of a largeproportion of the worlds population
Supplies at least 12 essential vitamins
and minerals
Still much unknown regarding the
control of potato development and
processing/quality traits(ie. disease resistance, stress tolerance, carbohydrate metabolism, tuber shape)
The Canadian Potato Genome Project (CPGP)
46% of national potato production $1 Billion/year
Home of McCain Foods Ltd. $5.5 billion/year
Potato Research Center (PRC) of AAFC
Solanum Genomics International Inc./BioAtlantech
Carleton University
University of New Brunswick
Nova Scotia Agricultural College (NSAC)
8/8/2019 hop Indonesia
48/80
CPGP Goals
Leaf Tuber
CPGP targets genes associated with
tuber health and tuber quality: Tuber Health Late Blight and
Common Scab
Tuber Quality Stable dry matter
accumulation, cold sweetening and
after-cooking darkening
DNA
Gene 1 Gene 2 . . .
Project Description
Identification Of A Differential Gene Expression PatternAnd Genes Related To Resistance In Potato Late Blight
One of the most devastating disease of potato worldwide
If left unmanaged, complete destruction of crops can occur
Attacks leaves and tubers; large necrotic lesions on leaves
and dry rot that spreads through tubers; 2o bacterial and
fungi often infect through late blight lesions
8/8/2019 hop Indonesia
49/80
Late Blight Project
Collaborative effort with AAFC Potato Research Centre
Population of blight-sensitive and blight-resistant plantsof near isogenicity
cDNA libraries made from leaves of a blight-sensitive and
a blight resistant plant
2500 messages were sequenced from each library
(5000 total ESTs)
Different ESTs to be profiled for expression
The tremendous amounts of data generated will need to be
managed efficiently
Database - Sequence Info
8/8/2019 hop Indonesia
50/80
Late Blight Project
cDNA Microarray Using SGII Clones
hybridized with Cy3 (resistant) + Cy5 (susceptible) probes
(reciprocal labelling experiments)
ANDLBRLF02345HTF.01 - Class II chitinase
ANDLBRLF01256HTF.01 - Pathogenesis-related protein
P23 precursor
ANDLBRLF02041HTF.01 - Unknown protein
What Use Is All Of This Information?
Transgenics:- Enhance tuber quality, processing traits, disease
resistance, stress tolerance more rapidly than breeding
Expression Assisted Selection:- Obtain expression profiles for thousands of genes
associated with specific traits or characteristics- Use these profiles as a baseline to compare with
the expression profiles of unknown clones; crosses
New Protein Products :- Identify genes encoding secreted proteins/ligands
- Test these for growth-promoting/other effects
- Express genes in batch cultures and purify proteins
8/8/2019 hop Indonesia
51/80
GFP expression in tobacco cells
GA-20 oxidase in potato:
GA-20 oxidaseknockouts withenhanced tuberproduction
GA-20 oxidase
knockouts withreduced tubersprouting
Example Of Gene Use
Information Processing and Handling
Assembly and annotation of genomic data
EST analysis and databases
Cluster analysis of microarray data
Comparisons of various transcriptomic methods
Integration of sequence, transcriptomic, proteomic,
metabolomic, transgenic data
8/8/2019 hop Indonesia
52/80
The Canadian Potato Genome Project
Sequence the geneand build cDNA libraries
[Solanum Genomics Intl. Inc(SGII)]
EST sequence generation[National Research Council
at Halifax and SGII]
Bioinformatics: base-Calling, clustering,
BLAST, annotations,and Gene expression
[UNB and PRC]
Microarray profiling[SGII, PRC, UNB, Ontario
Canter Institute, and NSAC]
Leaf and tubercDNA
FASTA formattedEST sequence& trace files
Sample FASTA formatted Sequences
EST sequence>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,
mRNA sequence
ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT
CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT
CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT
CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA
CATT
Protein Sequence>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
8/8/2019 hop Indonesia
53/80
Standard Genetic Code
T C A G
T
TTT Phe (F)TTC "TTA Leu (L)
TTG "
TCT Ser (S)TCC "TCA "
TCG "
TAT Tyr (Y)TACTAA Ter
TAG Ter
TGT Cys (C)TGCTGA Ter
TGG Trp (W)
C
CTT Leu (L)CTC "CTA "CTG "
CCT Pro (P)CCC "CCA "CCG "
CAT His (H)CAC "CAA Gln (Q)CAG "
CGT Arg (R)CGC "CGA "CGG "
A
ATT Ile (I)ATC "ATA "ATG Met (M)
ACT Thr (T)ACC "ACA "ACG "
AAT Asn (N)AAC "AAA Lys (K)AAG "
AGT Ser (S)AGC "AGA Arg (R)AGG "
G
GTT Val (V)GTC "GTA "GTG "
GCT Ala (A)GCC "GCA "GCG "
GAT Asp (D)GAC "GAA Glu (E)GAG "
GGT Gly (G)GGC "GGA "GGG "
Database
Contains all the ESTs sequences
Contains useful annotations
Blast Searches
Contig Assemblies
Transmembrane Spanning Regions
Gel Pictures EST Information
8/8/2019 hop Indonesia
54/80
Data Analysis - Bioinformatics
Tens of thousands of ESTs available for study
Most methods to study message distributions are low
throughput AND time consuming
Genomics necessitates the large scale study of geneexpression
Automation required for routine processes
Data acquisition for potato genome annotation
Automated protein classification with rule maintenance
Use agents to integrate the software and primary databases in
a flexible and robust way
Overview of Bioinformatics Researchat UNB
Automated ProteinClassification and Rule
Maintenance
Automated DataAcquisition Pipeline
TraceScan
Multi-AgentSystem for Potato
Genome Annotation
ESTsequences
Homologs, Motifs,Fingerprints, Transmembrane,and Signal sites
8/8/2019 hop Indonesia
55/80
TraceScan - Keywords
Chromatogram - visual representation of the digital output producedby an automated sequencing machine. A chromatogram is drawn as aset of four overlapping waveforms, one for each nucleotide base
Base-calling - determining the set of nucleotide bases for a DNAsequence strand from the analysis of the digital output produced by asequencing machine
Heterozygosity exists in the chromatogram where the presence of asecond strong peak appears beneath a primary peak. This mayindicate the presence of a secondary nucleotide base at the location inthe sequence
BLAST Basic Local Alignment Search Tool
Example of a Chromatogram
8/8/2019 hop Indonesia
56/80
The TraceScan Software System
Designed to investigate sequence quality, potential polymorphisms, andbase heterozygosity in EST sequences.
Relies on the combined analysis of a DNA sequence trace file, the tracechromatogram, and multiple alignment of sequence homologs.
Allows base-calls to be substituted where superimposed peaks havebeen detected in the trace.
Base-calls deemed in error can be corrected to improve sequence qualityand data reliability.
TraceScan
Visualizes DNA sequence chromatograms
Detects overlapping trace peaks using modifications to the PHREDbase-caller
Paks are highlighted on the user interface.
Modifications to PHRED enable base-calls with overlapping peaks to besubstituted.
Base substitutions produce a new set of base quality scores for thesequence.
8/8/2019 hop Indonesia
57/80
TraceScan
An interface to NCBI BLAST provides sequence comparisoncapabilities.
Sequences are compared using BLASTN and BLASTX.
BLASTN alignments are analyzed in search of discrepancies that mayidentify base-calling errors or putative polymorphisms in the tracesequence.
Reading Frames from BLASTX results are analyzed to examine ifsubstituted base-calls result in synonymous or non-synonymous codonsubstitutions.
TraceScan System Architecture
8/8/2019 hop Indonesia
58/80
Overview of BioinformaticsResearch at UNB
Automated ProteinClassification and Rule
Maintenance
Automated DataAcquisition Pipeline
TraceScan
Multi-AgentSystem for Potato
Genome Annotation
ESTsequences
Homologs, Motifs,Fingerprints, Transmembrane,and Signal sites
The Automated Data Acquisition Pipeline
(ADAP) - Keywords
Hypothetical Protein: The protein sequence that is obtained fromtranscription and translation of the DNA sequence. It is hypotheticalbecause we do not know if it is the real protein which DNA codes to.
Homologs: Evolutionarily related protein sequences
Comparative genomics: A technique where the functional traits of a
protein sequence are learnt from its homologs
Motifs: Highly conserved regions of protein sequences
Fingerprints: Collection of motifs
BLASTP: Basic Local Alignment Search Tool for Protein to Proteinsearches
8/8/2019 hop Indonesia
59/80
Automated Data AcquisitionPipeline (ADAP)
Gathers data for genome annotation
ADAP features:
Uses comparative genomics to learn from the Homologs
New variant of BLAST, Parameter Regulated Iterative BLAST(PRI-BLAST)
Uses 7 various analysis/search tools
A few software design patterns are used
Perl, MySQL, Perl-DBI, BioPerl, EMBOSS, BLASTP, SGE 5.3,
and Perl-Gtk on Linux
ADAP Overview
Phase 1: Hypotheticalprotein extraction andhomolog generation
Phase 2: Sequence basedprotein structure
prediction
Phase 3: Database searchbased protein family
prediction
Potato ADAPdatabase
Input: FASATAformatted EST
Sequences
Homologs and
HPs
Perl-MySQLDatabaseInterface
Legend
Data Flow
DatabaseInteractions
8/8/2019 hop Indonesia
60/80
Parameter Regulated Iterative BLAST(PRI-BLAST)
Static set of BLASTP parameters (neighborhood score, E-value, fractionidentical, BLOSUM matrix etc) not good since protein evolves at differentrates
PRI-BLAST iteratively performs the BLASTP over query sequence andcategorizes the query as
a Celebrity query (C) many homologs an Average query (A) a few or no homologs an Obscured query (O) some homologs
PRI-BLAST Rule module
Decides which set of BLASTP parameters to use Halts the PRI-BLAST
Statistical module Density of homologs is computed through SQL statements
Example BLASTP reportBLASTP 2.2.8 [Jan-05-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer
. . . . . Nucleic Acids Res. 25:3389-3402.
Query= CK00043.5prime
(182 letters)
Database: All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF excluding environmental samples
1,795,144 sequences; 592,604,613 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
gb|AAD46849.2| LD03471p [Drosophila melanogaster] 329 5e-90
ref|NP_651977.1| CG6773-PA [Drosophila melanogaster] >gi|7300991... 285 1e-76
ref|XP_312881.1| ENSANGP00000014751 [Anopheles gambiae] >gi|2129... 209 7e-54
gb|AAH54585.1| Unknown (protein for MGC:63980) [Danio rerio] 184 4e-46
.
.
.
>gb|AAD46849.2| LD03471p [Drosophila melanogaster] Length = 386
Score = 329 bits (1155), Expect = 5e-90
Identities = 181/182 (99%), Positives = 181/182 (99%)
Query: 1 VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG 60
VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG
Sbjct: 6 VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG 65
Query: 61 SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK 120
SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK
Sbjct: 66 SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK 125
Query: 121 LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPNXHTIGVN 180
LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPN HTIGVN
Sbjct: 126 LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPNAHTIGVN 185
Query: 181 AI 182
AI
Sbjct: 186 AI 187
8/8/2019 hop Indonesia
61/80
motif search based Protein Sequence Analysis(mPSA)
Motifs are conserved regions of protein sequences, and fingerprint isa collection of motifs in some order
mPSA (Phases 2 & 3) for the ADAP contains 6 mPSA tools fromEMBOSS
Phase 2: sequence based mPSA
secondary structure: transmembranes(Tmap), signal sites(Sigcleave), and general secondary structure (Garnier)
super secondary structure: DNA binding sites (Helixturnhelix)
Phase 3: database search based mPSA
protein motifs from PROSITE (Patmatmotifs) and proteinfingerprints from (Pscan)
Homologues for Various Ranges of Lengths of Hyp. Proteins
8768
5235
2882
1633
550873
288592
221495
2791
53380434
516124 226
979
2020
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
10
-15
15
-20
20
-25
25
-30
30
-35
35
-40
40
-45
45
-50
50
-55
55
-60
60
-65
65
-70
70
-75
75
-80
80
-85
85
-90
90
-95
95
-100
100
-105
105
-110
110
-115
Length of Hyp. Protein
NumberofHomologues
Homologues (Total)
Shorter protein sequences have more homologs they can be false positives
8/8/2019 hop Indonesia
62/80
Homologues with E
8/8/2019 hop Indonesia
63/80
Bioinformatics Research at UNB
Automated ProteinClassification and Rule
Maintenance
Automated DataAcquisition Pipeline
TraceScan
Multi-AgentSystem for Potato
Genome Annotation
ESTsequences
Homologs, Motifs,Fingerprints, Transmembrane,and Signal sites
Automated Protein Classification andRule Maintenance
Use machine-learning techniques to find some rules
Apply the rules to classify uncharacterized sequences
Categorizedsequences and
their related data
Rule
ConstructionProcess
A decision treeconsisting of
rules
Uncharacterizedsequences
Rule applicationprocess
Newlycharacterized
sequences
8/8/2019 hop Indonesia
64/80
Automated Protein Classification andRule Maintenance
Source data collection
Automated rule generation
Machine-learning algorithms and their comparison
Automated rule maintenance
Automated Rule Generation
C4.5 and CITree algorithms produce decision trees
WEKA (Waikato Environment for Knowledge Analysis ) will be used foranalyzing the dataset. (http://www.cs.waikato.ac.nz/~ml/index.html)
Start
Rule Construction & DecisionTree Creation
Rule Sieving
Is the rulequalified?
End of Rules?
Yes
No
Rule Database
Apply rules to annotate targetsequences
Target SequenceDatabaseEnd
Sequences and theirrelated data
Update Rule Database
Update Target Sequence Database
Rule Generation process
No
Yes
8/8/2019 hop Indonesia
65/80
Comparison of Algorithms
The evaluation of criteria for machine learning algorithms: accuracyand AUC (Area Under the ROC (Receiver Operating Characteristics)Curve)
Performance analysis
8/8/2019 hop Indonesia
66/80
Tree Generated using Weka
Bioinformatics Research at UNB
Automated ProteinClassification and Rule
Maintenance
Automated DataAcquisition Pipeline
TraceScan
Multi-AgentSystem for Potato
Genome Annotation
ESTsequences
Homologs, Motifs,Fingerprints, Transmembrane,and Signal sites
8/8/2019 hop Indonesia
67/80
Multi-agent Systems
A multiagent system is one that consists of a number of agents, whichinteract with one-another
In the most general case, agents will be acting on behalf of users withdifferent goals and motivations
To successfully interact, they will require the ability to cooperate,coordinate, and negotiate with each other, much as people do
Multi-Agent System for PotatoGenome Annotation
Target Sequence
Database
Local Database
NRDB MONTH
INFORMATION
AGENT
PIPELINE
AGENT
WEB
AUTOMATED DATA
ACQUISITION
PIPELINE
Rule DatabaseRULE
CONSTRUCTION
AGENT
DATABASE
UPDATE AGENT
PRINTS PROSITE
Target Sequence
Database
CLASSIFICATION
MODULE
8/8/2019 hop Indonesia
68/80
Mapping Transcription factors from aModel to a non-Model Organism
136
Transcription Factor
Group of proteins that initiate transcription transcriptional activators
transcriptional repressors
Consists of DNA binding domains Binds to the binding site regions (specific DNA
sequences)
Controls the expression of the genes
Human genome: 2600 proteins contain DNA-binding domains
8/8/2019 hop Indonesia
69/80
137
Transcription Factor Mapping
Model Organism
Investigated thoroughly by biologists Nodes: Transcription factors
Non-Model Organism
Not much data available Nodes: Predicted transcription factors
A
B
C
A1
B1
C1
Source Genome Target Genome
138
Transcription Factor Mapping
8/8/2019 hop Indonesia
70/80
139
Methodology
BLASTP is used to map transcription factors from Ecoliand Bacillus subtillisto E.coli group and Bacillus
group Parameter E-value threshold: 1e-5 to 10
All transcription factors from one genome cannot bemapped to another genome
The number of confirmed mappings between any twogenomes is dependent on the definition of confirmed
mapping used Compare the available transcription factors of the target genome to
the predicted set of transcription factors
140
Summary of Mapping Results
Transcription factor mapping in bacterialgenomes
Proposed method is able to map most of thetranscription factors
Transcription factor sequence motifs arepreserved well
0.1 and 0.01: best e-value thresholds
Correct choice of e-value threshold can be moreimportant than selection of evolutionarily closer
model organism
8/8/2019 hop Indonesia
71/80
Bioinformatics @ C-DAC
Dr. Rajendra JoshiGroup Coordinator: Bioinformatics
Scientific and Engineering Computing Group
Centre for Development of Advanced ComputingPune - 411007
[email protected]://bioinfo-portal.cdac.in
Bioinformatics Resources &Applications Facility (BRAF)
Funded by the Department of InformationTechnology (DIT), Ministry of Communications andInformation Technology
Grid-enabling of numerous bioinformatics codeslike SW, BLAST, ClustalW, AMBER, CHARMM etc
As part of BRAF, the team interacted withscientists from various CSIR labs, IITs andindustries
8/8/2019 hop Indonesia
72/80
AMD processor 2.6Ghz (Total: 204cores, 1060.8 GF)
4 nos. of SunX4600 (8 socket dual
core each) giving 64 cores.
32 nos. of SunX2200 (dual socketdual core each) giving 128 cores.
Backup server: SunX2200 (4 cores)
Storage server: two Sun X2200 (8cores)
Infiniband switch (Mellanox DDR2,48 port)
Storage: 20 Terabytes, RAID5 Tape library with autoloader
Benchmarking completed forAMBER, CHARMM, MEME, SW,Fasta, ClustalW, BLAST
BIOGENE: 1TF machine
Using BRAF Facility
Gipsy portal: Use browser andopen the url
http://gipsy.bioinfo-portal.cdac.in
Command line login
ssh -p 30005 gateway.cdac.in
Help on command line usage isavailable in the README file inthe users home directory.
Helpline: [email protected]
8/8/2019 hop Indonesia
73/80
Bioinformatics Application SoftwareforHigh-End Clusters and Grid
iMolDock : An interface for Molecular Docking on HPC
GENOPIPE : Automated Genome Annotation Pipeline on HPC
Anvaya : A Workflow Environment for High Throughput Comparative Genomics
Taxo Grid : Phylogeny on Grid
GenomeGrid : Bioinformatics Problem Solving Environment on Grid
GIPSY : Bioinformatics Problem Solving Environment on HPC
High-throughput Workflows forGenome Analysis
8/8/2019 hop Indonesia
74/80
Collaboration: Biotechnology andBiological Sciences Research Council (UK)
A Systems Biology based
approach for annotation ofSalmonella andMycobacterium genomes
Establishment of a commonBioinformatics pipeline foranalyses of bacterial genomeswith emphasis on identification
of virulence and pathogenicfactors
Collaboration: Institute of AnimalHealth (UK)
Genome Annotation: Salmonella Causative agent of Typhoid Transmitted via food contamination Economic losses as it affects
livestock
Annotation of 5 Salmonella
genomes with a wide host-rangeFood-borne disease cycle: Salmonella
Genome Annotation via GENOPIPE
Single nucleotide polymorphism
8/8/2019 hop Indonesia
75/80
Collaboration: University of Surrey (UK)
Expert curation of Mycobacterium lepraegenome: causative agent of Leprosy
Development of a tool to calculate molecularweight of metabolites
Furin Complex
Collaboration:Oregon Health & Science University (USA)
Collaborative project initiated with OHSU in December 2009
Provide computational support to the experimental studies at OHSU,through MD simulations on BIOGENE cluster
Propeptide domain of serine protease Furin acts as a pH sensor
Phenomenon has been elucidated in-silico through MD simulations
Ten sets of simulations performed using NAMD
8/8/2019 hop Indonesia
76/80
Collaborations: caBIG (NIH)
The National Cancer Institute (NCI) is
involved in deployment of an integratedbiomedical informatics infrastructure,the cancer Biomedical Informatics Grid(caBIG)
network that will freely connect theentire cancer community
caBIG would setup node at CDAC GARUDA GRID and BRAF resources
may be used
OA1 (GPR143) aGPCR
Belongs to Class I GPCR,Rhodopsin family
7TM receptors or heptahelical
receptors An integral membrane
glycoprotein of 404 aa
Protein product ofocularalbinismtype 1 gene
Ocular albenism, a X-linkedinherited disorder in which theeye lacks melanin pigment
Homology based approach along
with CGMD simulation has been
Collborations: IIT MadrasCGMD studies on GPCR
8/8/2019 hop Indonesia
77/80
Collaboration: Jubilant Biosys Simulate fragment binding
sites by Molecular Dynamicssimulation methods
To identify most probablesite of interaction ofchemical fragments in theprotein.
8 large simulations of 10nseach was carried out
Results handed over toJubilant
Collaboration: Nicholas Piramal
Contract Research project
To understand protein ligandinteractions using MolecularDynamics simulations
Involves carrying outmolecular dynamics
simulations on very largebiomolecular systems
Benefits in designing bettermolecules for known drugtargets.
Four 20ns moleculardynamics simulations havebeen carried out
8/8/2019 hop Indonesia
78/80
155
Conclusion
Biology transforming from observational and physicalexperiments computational science
Bioinformatics - Exciting research area
Challenges Biology and Computer Science different waysof working and need for close collaboration
Opportunities new crops, personalized medicine, earlydiagnosis,
156
Research Problems in Bioinformatics
Find genomes of all organisms
Identify and annotate all genes
Compute sequence 3D structure for all proteins
Compare DNA / protein sequences for similarity
Compare families of DNA / protein sequences
Reason to be optimistic: Biology is finite
~30,000 human genes; ~1000 protein superfamilies
but computers speeds keep increasing
8/8/2019 hop Indonesia
79/80
157
Business Opportunities
Clinical research Gene therapy Molecular science Pharmaceutical companies - automated technologies tomanufacture effective therapies and drugs due to increasingconcerns about drug safety and the stringent regulations thatgovern clinical trials for drug discovery. Bioinformatics platform market growing very fast rate Global bioinformatics market: ~ $8.3 billion by 2014 Knowledge management - 2009 -$1.3 billion Bioinformatics platforms market - 2014 - ~ $3.9 bill ion
158
Business Opportunities
Global bioinformatics market segments
- Bioinformatics platforms- Sequence alignment platforms- Sequence manipulation platforms,- Sequence analysis platforms- Structural analysis platforms
- Content/ Know ledge management tools- Specialized knowledge management tools- Generalized know ledge management tools- Services- Data Analysis- Sequencing Services- Database & Management services- Applications
8/8/2019 hop Indonesia
80/80
Thank You!