Biologie ComputationnellePour quoi faire ?
Protein family Gene sequence Structure
Non-coding regions
Gene networks
Gene order
Bioinformatique TranslationnellePLOS Computational BiologyA Peer-Reviewed, Open Access JournalTranslational Bioinformatics : a collection of Education Articles, 2012http://www.ploscollections.org/article/browseIssue.action?issue=info:doi/10.1371/issue.pcol.v03.i11 Terrain d'illustration de l'impact de la Biologie Computationnelle :
- intégrer des quantités considérables de données moléculaires et cliniques pour une meilleure compréhension des bases moléculaires de maladies qui à son tour modifiera les pratiques cliniques et comme horizon améliorer la santé humaine.
Informatics : the study of how to represent, store, search, retrieve and analyze information
Medical Informatics : concerns medical information
BioInformatics : concerns basic biological information
Clinical Informatics : focuses on the clinical delivery part of medical informatics
Biomedical Informatics : merges bioinformatics and medical informatics
Imaging Informatics : focuses on images
Translationnelle :- Comment améliorer efficacement le diagnostic, le pronostic et le traitement des
patients ?● Création d'appareils médicaux● Diagnostic moléculaire● Thérapeutique via de petites molécules● Vaccins
Etc.● Maîtrise de l'explosion des connaissances en biologie moléculaire, génétique et génomique.
Y a-t-il eu transfert de la connaissance de la structure en double hélice de l'ADN vers une amélioration pratique avec un impact sur la santé humaine d'un point de vue de l'innovation technologique ?
Ce qui est certain, on est capable aujourd'hui de mesurer rapidement :- les séquences ADN (à l'échelle d'un génome entier !)- les séquences ARN et leur expression - les séquences protéique, leur structure, expression et modification- la structure, la présence et quantité de petits métabolites moléculaires
Translational Bioinformatics : the development and application of informatics methods that connect molecular entities to clininal entities :
Molecular Informatics : concerns molecular/cellular information
Clinical Informatics : focuses on the clinical delivery part of medical informatics
PubMED : http://www.ncbi.nlm.nih.gov/pubmedUMLS : Unified Medical Language Systemhttp://www.nlm.nih.gov/research/umls/
Genes ; DNA ;RNA messengers ;MicroRNAs ; ProteinsSignaling molecules and cascades ;Metabolites ;Cellular communication processes ;Cellular organization
Genbank http://www.ncbi.nlm.nih.gov/genbank/Gene Expression Omnibushttp://www.ncbi.nlm.nih.gov/geo Protein Data Bankhttp://www.wwpdb.org/ KEGGhttp://www.genome.jp/kegg/ MetaCychttp://metacyc.org Reactomehttp://www.reactome.org
Impact : Médecine personnalisée ?Association de génomes, fouille de marqueurs génétiquesAnalyse de données génomiques personnelles, fouille de données issues d'enregistrements électroniques etc.
4 Grands Chapitres pour cette session :
- Imagerie Quantitative
- Machine Learning / Data Mining
- Représentation par graphes et réseaux
- Représentation des connaissances : données, banque de données, ontologies
Des technologies/ environnements de développement :- Java : ImageJ, Weka
- Python : Biopython, Numpy, Scipy, Matplotlib, Enthought Python Distribution et Canopy
- scripts : Perl, Gawk
- Inkscape, ImageMagik / Sphinx / XML, SBML, BioPax, GPML, JSON, SQL, noSQL, Hadoop
Des softwares :
- Clustal → T-coffee, PathwayAPI, BioGRID, PatternHunter ….
→ Des choix, un co-design du cours ...
Des travaux pratiques sur :
- Intégration de Connaissance (Ontologies, données)
- Construction de Réseaux d'interaction de gènes : R, statistiques
- Visualisation de l'interactome : PPI , maladies, Cytoscape software, analyse topologique de graphes (network analyzer, arbres de Steiner trees, MST)
- Assemblage de connaissance et interprétation : GSEA, GO, KEGG, BioLattice, pathology and ontology based analysis
- Priorisation de gènes pathogènes : MLP, Weka
- Analyse d'Images : ImageJ, cellProfiler, DcellIQ, NeuronStudio, ITK
- Visualisation -omiques : R, python
- Algorithmes de Graphes, Complexité
Une perspective biologique permanente :
- Gene feature recognition :→ TIS (Translation Initiation Site) → TSS (Transcriptional Start Sites)→ Feature Generation → Feature Selection → Feature integration → Gene finding
- Gene expression analysis : → Affymetrix Gene Chip Data→ Gene expression Profile Classification→ Gene expression Profile Clustering→ Gene Regulatory Circuits Reconstruction : Differentially Expressed
Genes, Gene Interaction Prediction
- Sequence Alignment / Comparaison / Homology :→ Multiple Sequence Alignment (Dynamic Programming)→ Function assignment to protein sequence (Guilt by Association)→ Discovery of Active Site or Domain of a function→ PPI / Proteomic Profile Analysis→ key mutation site identification
- Phylogenetic tree : → Construction→ Comparison
-Biological Networks / Graph of Interactions : → Natural pathways→ PPI Networks→ Protein Complex Prediction
TIS : Translation Initiation Site Recognition/Prediction
299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Un échantillon de cDNA
Pourquoi le second ATG est-il un TIS?
Gene prediction
T-Cell Epitopes Prediction By Artificial Neural Network
• Honeyman et al., Nature Biotechnology 16:966-969, 1998
PAS Prediction
Feature generati on
I ncomi ngsequences
Feature sel ecti on
Feature i ntegrati on
END
BEGI N
SVM in Weka
Histone Promoter Recognition Programs
9 Motifs Discovered by MEME algo in Histone Promoter 5’ Region [-250,-1] among 127 histone promoters
Diagnosis of Childhood Acute Lymphoblastic Leukemia (ALL) and Optimization of Risk-Benefit Ratio of Therapy
Immunophenotyping
TEL-AML1BCR-ABLHyperdiploid >50E2A-PBX1
MLL T-ALL Novel
Affymetrix GeneChip Micro Array Analysis
Proteomics Data : Guilt-by-Association
Compare T with seqs of known function in a db
Assign to T same function as homologs
Confirm with suitable wet experiments
Discard this functionas a candidate
Topology of Protein Interaction Networks: Hubs, Cores, Bipartites
Maslov & Sneppen, Science, v296, 2002
318 edges329 nodes
In nucleus ofS. cerevisiae
Proteomics Data: Subgraphs in Protein Interaction
Yeast SH3 domain-domainInteraction network: 394 edges, 206 nodesTong et al. Science, v295. 2002
8 proteins containing SH35 binding at least 6 of them
a bConfiguration a is less likely than b in protein interaction networks → Graph/Network Mining
Decision Tree Based In-Silico Cancer Diagnosis
A B A
B A
x1
x2
x4
x3
> a1
> a2
Leaf nodes
Internal nodes
Root node
B
AA
B
B A
A
Prognosis based on Gene Expression Profiling
• Yeoh et al., Cancer Cell 1:133-143, 2002; Differentiating MLL subtype from other subtypes of childhood leukemia
● Training data (14 MLL vs 201 others), Test data (6 MLL vs 106 others), Number of features: 12558
Given a test sample, at most 3 of the 4 genes’ expression values are needed to make a decision!
Discovery of Diagnostic Biomarkers for Ovarian Cancer
● Motivation: cure rate ~ 95% if correct diagnosis at early stage
● Proteomic profiling data obtained from patients’ serum samples
● The first data set by Petricoin et al was published in Lancet, 2002
● Data set of June-2002.● 253 samples: 91 controls
and 162 patients suffering from the disease; 15154 features (proteins, peptides, precisely, mass/charge identities) SVM: 0 errors; Naïve Bayes: 19
errors; k-NN: 15 errors.
Mining Errors from Bio Databases
RECORD
SINGLE SOURCE DATABASE
Invalid values
Ambiguity
Incompatible schema
ATTRIBUTE
Uninformative sequences
Undersized sequences
Annotation error
Dubious sequences
Sequence redundancy
Data provenance flaws
Cross-annotation error
Sequence structure violation
Vector contaminated sequence
Erroneous data transformation
MULTIPLESOURCE DATABASE
• Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004).
• In Nov 2004, the total number of undersized protein sequences increases to 3,350.
• Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004).
• In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711.
Undersized protein sequences in major databases
218 171
42
528
116 151
1015
364 383
12351
1253 2 120 0 23
0
200
400
600
800
1000
1200
1 2 3
Sequence Length
Num
ber
of r
ecor
ds
DDBJ
EMBL
GenBank
PDB
SwissProt
PIR
Undersized nucleotide sequences in major databases
2 3 924
5573 69
115
81
233228
108 108
5167
6
40 45
77104
0
50
100
150
200
250
1 2 3 4 5
Sequence LengthN
um
ber
of
reco
rds
DDBJ
EMBL
GenBank
PDB
Example Meaningless Seqs
Duplicates detected by association rules
6
36.3
0.3
49.4
5.2
32.7
0.11.8 2.4 3.8
5.77.5 7.9 9.4
0
10
20
30
40
50
60
Rule 1
Rule 2
Rule 3
Rule 4
Rule 5
Rule 6
Rule 7
Association rules
FP
% a
nd
FN
%
FP% FN% x 1000
S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%)Rule 7
S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%)Rule 6
S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%)Rule 5
S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%)Rule 4
S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Rule 3
S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Rule 2
S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Rule 1
Rule 1. Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates.
Rule 2. Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates.
Rule 3. Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates.
Time since split
Australian
Papuan
Polynesian
Indonesian
Cherokee
Navajo
Japanese
Tibetan
English
Italian
Ethiopian
Mbuti PygmyAfrica
Europe
Asia
America
Oceania
Austalasia
Root
● Estimate order in which “populations” evolved
● Based on assimilated freq of many different genes
● But …– is human evolution a
succession of population fissions?
– Is there such thing as a proto-Anglo-Italian population which split, never to meet again, and became inhabitants of England and Italy?
Phylogenetic tree construction
Predicting interactions using phylogenetic profile
Pellegrini et al. PNAS 96, 4285-4288 (1999)
Comparative genomics
GWAS Genome Wide Association Studies