Date post: | 18-Apr-2015 |
Category: |
Documents |
Upload: | filippo-ledda |
View: | 21 times |
Download: | 1 times |
BackgroundGAME
Secondary Structure PredictionSummary
Ph.D. in Electronic and Computer Engineering
Dept. of Electrical and Electronic Engineering
University of Cagliari
Protein Secondary Structure Prediction:Novel Methods and Software Architectures
Filippo Giuseppe LeddaAdvisor : Prof. Giuliano Armano
Cagliari, March 2, 2011
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Contribution
BioinformaticsProtein secondary structure prediction
SLB: a novel input encoding techniqueSSP2: A novel multiple-expert architectureHeterogeneous Output Combination
BCalign: new pairwise alignment algorithm based onbeta-contact prediction
Machine Learning
GAME: a general architecture and framework for real-worldprediction problems
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Outline
1 BackgroundSecondary Structure PredictionFacing Real-World Problems
2 GAMEIntroductionThe Framework
3 Secondary Structure PredictionExploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Outline
1 BackgroundSecondary Structure PredictionFacing Real-World Problems
2 GAMEIntroductionThe Framework
3 Secondary Structure PredictionExploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Protein Structure Prediction
Biological background
Characteristics of living things (the phenotype) aredetermined by their genetic code (the genotype)Proteins, the first realization of the phenotype, carry outmost of cell functionsCentral dogma: DNA→ RNA→ Protein SequenceProtein Sequence→ Structure→ Function
PredictionPredicting protein structure from sequence is one of the greatopen problems of biology
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Protein Structure Prediction
Why predict structure?Structure determines functionNo general theory to obtain structure from sequenceSequences are much easier to collect than structuresSequences can be engineered
Possible applicationsDisease analysisAd-hoc drug synthesis“Engineering” life!
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Protein Structure Prediction
Reliable tertiary structure prediction needs templatestructures (close homologues)3D ab initio prediction is very hard→ use secondarystructure prediction
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Protein Secondary Structure
Secondary structure consists of local conformations ofresidues, stabilized by hydrogen bonds
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Secondary Structure Prediction
ExampleERLCLKYLVYKDLRTRGYIVKTGLKYGADFRLYERGANI
↓CCHHHHHHHHHHHHHCCCEEEECHHHCCCEEEECCCCCC
IssuesStructured dataGreat input spaceLarge and biased training setLow input/output correlationLabelling noise
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Evaluation Measures
Q3. It is a measure of accuracy, defined as follows:
Q3 =100
3
∑i=h,e,c
tpiNi
(1)
SOV. The Segment OVerlap score (Zemla, 99) accountsfor the predictive ability of a system by considering theoverlapping between predicted and actual structuralsegments.Ch, Ce, Cc . The Matthews Correlation Coefficient(Matthews, 75) relies on the concept of confusion matrix.Definition:
Ci =tpi tni − fpi fni√
(tpi + fpi )(tpi + fni )(tni + fpi )(tni + fni ), (2)
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
The Informative Sources
The success of a predictor is determined by its ability to exploitthe main correlations of the problem
Informative sources1 Sequence-to-structure (Anfinsen’s dogma)2 Intra-sequence (really low)3 Inter-sequence (homology)4 Intra-structure (structure interactions)5 Inter-structure (homology)
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Classical Approach
Machine learning techniques with specific architectures allow toexploit different information sources.PHD architecture (Rost, 1993):
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Outline
1 BackgroundSecondary Structure PredictionFacing Real-World Problems
2 GAMEIntroductionThe Framework
3 Secondary Structure PredictionExploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Real-World Problems
Protein structure prediction is a real-world problem (RWP)
Common issues of RWPsDeal with structured data (e.g. sequences, texts, images)Importance of pre-processing/feature extractionNoisy and biased example dataUncertain labelling
Useful approachesHuman expertise in the fieldCustom architecturesWise balancing of datasetsTry many different solutions (test&select)
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Secondary Structure PredictionFacing Real-World Problems
Tools for Machine Learning
Acknowledged tools (e.g. WEKA, RAPIDMINER) allow toeasily set up and compare lots of ML techniquesUnfortunately, they help only with part of RWPs issues
Limits of the acknowledged toolsWork with atomic data instances
Need for extra software for feature extraction/decompositionMemory requirements explosion due to decomposedstructured dataNo performance measure for the structured data
No support for release in the real environmentStatic approach to datasets
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
Outline
1 BackgroundSecondary Structure PredictionFacing Real-World Problems
2 GAMEIntroductionThe Framework
3 Secondary Structure PredictionExploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
GAME
GAME[1, 4] is a framework and tool for the definition andassessment of predictors for real-world problems.
FeaturesPlug-in architectureSupport for structured dataSupport for comparative experimentsSupport for expert combinationJust-in-time dataset iterationSupport for final releasePortability
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
GAME Approach to Prediction
Data are managed in their natural formatSuitable encoding modules interface with underlying(generic) prediction algorithms
InputData
Instancedata
OutputData
Prediction/Classification
Machine learning system
Formatting/extraction
InputEncoding
OutputEncodingLabelling
LearningAlgorithmSource
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
Generic Architectures with Multiple Experts
Three kinds of Expert allow to build tree architectures
Experts1 Ground2 Refiner (to build pipelines)3 Combiner (for parallel combination)
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
Configuring Experts: Graphical Interface
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
Outline
1 BackgroundSecondary Structure PredictionFacing Real-World Problems
2 GAMEIntroductionThe Framework
3 Secondary Structure PredictionExploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
The Framework
GAME is written in Java 6.0 following a plug-in architecture(actually at version 2.0: 354 classes and 2173 methods)
Usage1 Define the problem (implement data description modules)2 Define the encoding modules3 Graphically configure and run experiments
Applied toSecondary structure predictionOptical character recognitionAntibody packing angles predictionProtein beta contact prediction
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
Configuring Experiments: Architecture
AlgorithmModules
EncoderModules
ExpertModules
Expert
DatasetIteratorModules Experiment
ControllerModules
DataModules
InstanceData
TrainingDataset
DatasetIterator
TestDataset
InstanceData
ModulesModule manager
ExpertGUI/XML/
Serialization
ExperimentGUI/XML/
Serialization
GAME Experiment
Net/FileSystem
Setting ManagerGeneralSettings
GUI/XML/Serialization
Predictor
Decoder
DecoderModules
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
Configuring Experiments: Graphical Interface
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
Defining Modules – Code
Defining modules which integrate with the graphical interface isvery simple.Example:
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
Defining Modules – Interface
With the given example code, configuration and documentationwindows are generated automatically
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
IntroductionThe Framework
GAME for Secondary Structure Prediction
GAME has been used to perform all the secondary structureprediction experiments
Actually~15 input encoding methods~5 output encoding methodsCustom post processingStandard measures: Q3, SOV , Matthews correlationcoefficientReleased two web servers
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Outline
1 BackgroundSecondary Structure PredictionFacing Real-World Problems
2 GAMEIntroductionThe Framework
3 Secondary Structure PredictionExploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Inter-Sequence Correlation
A strong correlation can be observed between sequences ofevolutionary related proteins (homologues)
Background
Sequences diverge during evolution from a commonancestorStructure is much more conserved than sequenceHence, similar sequences have similar structure
How to use this information for prediction?Including information about similar sequences in the inputrepresentation. About +10% accuracy.
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Input Encoding for Secondary Structure Prediction
The prediction algorithm operates on slices extracted with asliding window
The protein’s sequence is represented as a profile before slicesare extracted
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Input Encoding
Naive ApproachStatically associate a vector to each amino acid(position-independent encoding)Relies only on the low local sequence-structure correlation
Modern ApproachExploit inter-sequence correlation with position-specificrepresentationsUse multiple alignmentsThree phases:
1 Similarity Search (find homologues)2 Multiple alignment (align them)3 Encoding (generate profile)
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Multiple Sequence Alignment
Proteins are evolutionary related to other proteins. Multiplealignments put in evidence the changes occurred during theevolution in a protein family
Substitutions observed in a multiple alignment are expected tobarely affect structure
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Frequency-Based Encoding (FR)
Frequency-based encoding
Introduced for prediction by PHD (Rost, 1993)Purely position-specificRepresents each column of the multiple alignment with theobserved frequency for each amino acid
WeaknessThe multiple alignment not always permits to reliably estimatethe substitution frequencies
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
PSSM Encoding
PSI-BLAST PSSM is the most acknowledged way to encodemultiple alignment information for similarity searches andprediction (firstly used with PSIpred (Jones, 1999))
Algorithm
PSSM (Henikoff&Henikoff, 1996) improves frequencies addingpseudo-counts where the estimation is weak
Pseudo-counts are obtained by standard amino acidsubstitution matrices-search hits, +pseudo-countsMain issue: how much pseudo counts, exactly?
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Sum-Linear Blosum Encoding
Algorithm
Like PSSM, SLB[2] uses substitution matrices to enhancefrequency counts.
A linear combination of BLOSUM62 matrix columns isused to encode the i-th position in the protein sequenceMatrix columns are weighted with the frequency counts
AdvantagesNaturally includes position-independent withposition-specific informationSimple implementationBetter performances?
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Experimental Results
Encoding techniques were compared with the PHD architecture
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Experimental Results
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Outline
1 BackgroundSecondary Structure PredictionFacing Real-World Problems
2 GAMEIntroductionThe Framework
3 Secondary Structure PredictionExploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Intra-Structure Correlation
Secondary structure elements are not independent. Mutualinteractions holding between the amino-acids along the protein,including the hydrogen bonds involved in the formation ofsecondary structure
Residue interactionsWithin α-helices and β-strandsBetween β-strands in the same sheet
IssueResidues should not be predicted independently
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Exploiting Inter-Structure Correlation
Common ApproachModern predictors include a structure-to-structure layer, whichrefine the point-by-point predictions of a firstsequence-to-structure layer
Our ProposalAlso include intra-structure information in the outputrepresentation [3, 6]
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
The SSP2 Architecture
IssueAdding information to the output representation hardens thetraining process
SolutionAdd the information step-by-step along suitable pipelines
SSP2 Architecture1 Combination2 Pipelines3 Diversity
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
GAMESSP2
GAMESSP2 is a realization of the SSP2 with GAMEUses variable output windows to produce scalablerepresentations
Test configuration
GROUND(ANN)
REFINER(ANN)IN OUT
ENC = PSSMWout = 1..11
ENC = SLB / noneWout = 1..9
REFINER(ANN)
ENC = noneWout = 1
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Finding the Best Configuration
ENC =< PSSM,none,none >, Wout =< j , k ,1 >
System Q3 SOV Ch Ce Cc
j = 1, k = 1 79.70 77.31 0.74 0.67 0.61j = 1, k = 3 79.71 77.46 0.75 0.67 0.61j = 1, k = 5 79.79 78.00 0.75 0.67 0.61j = 1, k = 7 79.80 78.14 0.75 0.67 0.61j = 1, k = 9 79.78 77.70 0.75 0.67 0.61j = 3, k = 5 80.17 78.37 0.75 0.67 0.61j = 3, k = 7 80.14 78.87 0.75 0.67 0.62j = 3, k = 9 80.02 78.21 0.75 0.67 0.61j = 5, k = 7 80.38 78.45 0.76 0.68 0.62j = 5, k = 9 80.13 78.55 0.76 0.67 0.62j = 7, k = 9 80.23 78.65 0.76 0.67 0.62
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Finding the Best Configuration
ENC =< PSSM,SLB,none >, Wout =< j , k ,1 >
System Q3 SOV Ch Ce Cc
j = 1, k = 1 79.70 77.31 0.74 0.67 0.61j = 1, k = 3 79.66 77.43 0.74 0.66 0.61j = 1, k = 5 79.95 77.67 0.75 0.67 0.61j = 1, k = 7 80.09 77.79 0.75 0.67 0.62j = 1, k = 9 79.97 77.97 0.75 0.67 0.62j = 3, k = 5 80.11 78.00 0.75 0.67 0.62j = 3, k = 7 80.34 78.29 0.76 0.67 0.62j = 3, k = 9 80.12 78.24 0.75 0.67 0.61j = 5, k = 7 80.17 78.78 0.75 0.67 0.62j = 5, k = 9 80.14 78.54 0.76 0.67 0.62j = 7, k = 9 80.06 78.47 0.75 0.67 0.62
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Exploiting Inter-Sequence CorrelationExploiting Intra-Structure Correlation
Benchmarking: EVA common set 6
The four best configurations have been combined to obtain thefinal GAMESSP2 system on the EVA common 6 dataset (212proteins)
System Q3 SOV Ch Ce Cc
PHDpsi(Rost ,1993) 74.99 70.87 0.66 0.69 0.53PSIpred(Jones,1999) 77.76 75.36 0.69 0.74 0.56PROFsec(Rost ,unp.) 76.70 74.76 0.68 0.72 0.56DBNN (Yao et al. , 2008) 77.8 72.4 0.71 0.65 0.58GAMESSP2 78.34 76.17 0.70 0.76 0.59
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
BackgroundGAME
Secondary Structure PredictionSummary
Summary
Five different kinds of correlation were identified for thesecondary structure prediction problemWe realized GAME, a general framework for real-worldprediction problems, and used it for secondary structurepredictionWe proposed a novel input encoding to take into accountinter-sequence correlationsWe proposed an original use of output encoding alongsuitable pipelines to exploit intra-structure correlations
OutlookApply the framework to further problemsExploit inter-structure correlations
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
Appendix Publications
Publications I
F. LEDDA, L. MILANESI, AND E. VARGIU.GAME: A Generic Architecture based on MultipleExperts for Predicting Protein Structures.International Journal Communications of SIWN,3:107–112, 2008.
G. ARMANO, F. LEDDA, AND E. VARGIU.Sum-Linear Blosum: A Novel Protein Encoding Methodfor Secondary Structure Prediction.International Journal Communications of SIWN, 6:71–77,2009.
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
Appendix Publications
Publications II
G. ARMANO, F. LEDDA, AND E. VARGIU.SSP2: A Novel Software Architecture for Predicting ProteinSecondary Structure.Sequence and Genome Analysis: Methods andApplication, in press.
G. ARMANO, F. LEDDA, AND E. VARGIU.GAME: a Generic Architecture based on MultipleExperts for bioinformatics applications.In BITS Annual Meeting 2009, Genova (Italy), 2009
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
Appendix Publications
Publications III
F. LEDDA AND E. VARGIU.Experimenting Heterogeneous Output Combination toImprove Secondary Structure Predictions.In Workshop on Data Mining and Bioinformatics, Cagliari(Italy), 2008
G. ARMANO AND F. LEDDA.Exploiting Intra-Structure Information for SecondaryStructure Prediction with Multifaceted Pipelines.submitted
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011
Appendix Publications
Publications IV
FILIPPO LEDDA, GIULIANO ARMANO, AND ANDREW
C.R. MARTIN.Using a Beta Contact Predictor to Guide PairwiseSequence Alignments for Comparative Modelling.submitted
Filippo Giuseppe Ledda Ph.D. defense talk, March 2, 2011