+ All Categories
Home > Documents > A New Biclustering Algorithm for Analyzing Biological Data

A New Biclustering Algorithm for Analyzing Biological Data

Date post: 23-Feb-2016
Category:
Upload: umeko
View: 42 times
Download: 0 times
Share this document with a friend
Description:
A New Biclustering Algorithm for Analyzing Biological Data. Prashant Paymal Advisor: Dr. Hesham Ali. Introduction. Microarray technology use to study the expression of many genes at once Large amount of data is produced in the microarray technology - PowerPoint PPT Presentation
Popular Tags:
21
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali
Transcript
Page 1: A New Biclustering Algorithm for Analyzing Biological Data

A New Biclustering Algorithm for Analyzing Biological DataPrashant Paymal

Advisor: Dr. Hesham Ali

Page 2: A New Biclustering Algorithm for Analyzing Biological Data

Introduction•Microarray technology use to study the

expression of many genes at once

•Large amount of data is produced in the microarray technology

•Proper analysis of the data is important to get meaningful information from it

•There is a need for new analysis techniques

Page 3: A New Biclustering Algorithm for Analyzing Biological Data

Data Analysis•From data to knowledge

•We need to process data by grouping and synthesizing information into a “big picture” based upon characteristics and relationships

•One of the most used analysis technique is traditional clustering

Page 4: A New Biclustering Algorithm for Analyzing Biological Data

Traditional Clustering

• Applied to either rows or columns of the data matrix separately

• Each gene is defined using all the conditions

• Each condition is characterized by the activity of all the genes that belong to it

Genes

Genes

Conditions

Conditions

Page 5: A New Biclustering Algorithm for Analyzing Biological Data

Motivation• The large amount of data provide us great

challenges of analysis

• Clustering algorithms consider all the conditions to group genes and all the genes to group conditions

• Biologically data may not show similar behavior in all conditions but in a subset of them

• Traditional clustering algorithms will very likely miss some important information

Page 6: A New Biclustering Algorithm for Analyzing Biological Data

Biclustering• The term “Biclustering” was first used by Cheng and

Church in gene expression data analysis [Year 2000]

• Clusters do not need to include all parameters (genes in Bioinformatics) for all conditions

• Data Matrix ▫Each gene – One row▫Each condition – One column▫Each element – expression level of a gene under

specific condition

Page 7: A New Biclustering Algorithm for Analyzing Biological Data

Biclustering (Cont.)

• Performs clustering in these two dimensions simultaneously

• Each gene is selected using only a subset of the conditions

• Each condition is selected using only a subset of the genes

Genes

Conditions

Page 8: A New Biclustering Algorithm for Analyzing Biological Data

Goal of Biclustering•To identify subgroups of genes and

subgroups of conditions by performing simultaneous clustering of both rows and columns of the gene expression matrix, instead of clustering these two dimensions separately

•To find biclusters is NP-hard problem: It is actually a generalized version of traditional clustering

Page 9: A New Biclustering Algorithm for Analyzing Biological Data

Previous Work•A systematic comparison and evaluation of

biclustering methods for gene expression data - Amela Prelic (2006)

• Algorithms:▫ Statistical Algorithmic Method for Biclustering Analysis

Algorithm (SAMBA)▫ Order Preserving Submatrix Algorithm (OPSM)▫ Iterative Signature Algorithm (ISA)▫ Cheng and Church algorithm▫ xMotif▫ Bimax

Page 10: A New Biclustering Algorithm for Analyzing Biological Data

Previous Work (Cont.)•Comparative Analysis of Biclustering

Algorithms – Doruk Bozdag … (2010)

• Algorithms▫ Correlated Pattern Bicluster Algorithm (CPB)▫ Cheng and Church Algorithm▫ Order Preserving Submatrix Algorithm (OPSM)▫ HARP Algorithm

▫ Minimum Sum-Squared Residue-based CoClustering Algorithm (MSSRCC)

▫ Statistical Algorithmic Method for Biclustering Analysis Algorithm (SAMBA)

Page 11: A New Biclustering Algorithm for Analyzing Biological Data

The Importance of Assessment• Different algorithms give different solutions for same

data

• There is no agreed upon guideline for choosing among them

• Validation Techniques▫External Validation Measures

Evaluate a result based on the knowledge of the correct class labels

▫Internal Validation Measures Evaluate a result based on the information intrinsic to the

data alone

Page 12: A New Biclustering Algorithm for Analyzing Biological Data

Validation•In most biclustering papers external

validation measures used to assess the methods,

▫It is not clear how to extend notions such as homogeneity and separation to the biclustering context (Gat-Viks et al 2003)

▫Internal measures don’t work well in case of biclustering due to which Gat-Viks et al 2003 and Handl et al 2005 recommend external measures

Page 13: A New Biclustering Algorithm for Analyzing Biological Data

Objectives of the Project•Comprehensive Assessment Technique

▫Internal measures as well as external measures

•Customized Biclustering Method▫Input domain

Page 14: A New Biclustering Algorithm for Analyzing Biological Data

Validation using Synthetic Data•Testing using Manufactured data

▫The portion of the implanted bicluster the algorithm was able to return

▫The portion external or irrelevant to the implanted bicluster which algorithm returns

▫Two metrics to evaluate cluster quality U: Uncovered portion of the implanted bicluster E: Portion of the output cluster external to the

implanted bicluster

Page 15: A New Biclustering Algorithm for Analyzing Biological Data

Validation using Synthetic Data• Testing using real (domain specific) data – for example

using Gene match score▫M1, M2 be two sets of Biclusters

▫Average of the maximum match scores for all biclusters in M1 with respect to the bicluster in M2

• Potential improvements ▫Don’t consider samples / conditions▫Specificity and Sensitivity

Page 16: A New Biclustering Algorithm for Analyzing Biological Data

Proposed Assessment• Calculate sensitivity and specificity scores

▫ Specificity: proportion of negatives which are correctly identified

▫ Sensitivity: proportion of actual positives which are correctly identified

• Improve existing measures: ▫ Average of the maximum match scores for all bi-clusters in M1

with respect to bi-clusters in M2 (considering both genes and samples)

• Assessment based on knowledge of domain data▫ The resulting biclusters were evaluated based on the

enrichment of Gene Ontology (GO) terms

Page 17: A New Biclustering Algorithm for Analyzing Biological Data

Experiments• Given two biclustering results

▫M1: Result of a biclustering algorithm▫M2: True Result▫(G1, C1) M1 and (G2, C2) M2

• Calculate similarity score (Jaccard Coefficient)▫ and

• Calculate the two scores,▫Score 1: % of result of an algorithm is included in the

true result▫Score 2: % of true result an algorithm can find

2121

GGGG

2121CCCC

Page 18: A New Biclustering Algorithm for Analyzing Biological Data

Results• Synthetic Data: 100 genes and 100 samples• 10 implanted biclusters of each size 10 X 10 (10 genes and 10

samples)• Used publically available different biclustering algorithm

implementations

• Score 1: % of result of an algorithm is included in the true result• Score 2: % of true result an algorithm can find

Algorithm No of biclusters

Score 1 Score 2

Cheng and Church Algorithm (CC) 8 0.475 0.38

Iterative Search Algorithm (ISA) 9 1 0.9

Order Preserving Sub Matrix (OPSM) Algorithm 32 0.273139 0.874044

Statistical Algorithm Method for Bicluster Analysis (SAMBA) 9 0.5 0.45

xMotif Algorithm 87 0.100023 0.870204

Page 19: A New Biclustering Algorithm for Analyzing Biological Data

Conclusion• Traditional Clustering is too restrictive technique for

analyzing datasets in various application domains

• We need new flexible analysis technique like biclustering to deal with possible imperfections in the input datasets

• Assessment of data analysis is critical and must be considered while selecting the right tool for each application domains

• Biclustering represents a powerful tool for analysis of data in a variety of domains and can be applicable to datasets other than biology

Page 20: A New Biclustering Algorithm for Analyzing Biological Data

References• Madeira, S.C., Oliveira, A.L.: Biclustering algorithms

for biological data analysis: A survey • Amela Prelic et al: A systematic comparison and

evaluation of biclustering methods for gene expression data

• http://cheng.ececs.uc.edu/biclustering

• http://www.tik.ethz.ch/~sop/bicat/

• http://acgt.cs.tau.ac.il/expander/

Page 21: A New Biclustering Algorithm for Analyzing Biological Data

Thank you…


Recommended