My Research Work and Clustering Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

My Research Work and Clustering

Dr. Bernard Chen Ph.D.University of Central Arkansas

Fall 2010

Outline

Introduction Experimental Setup Clustering Future Works

Central Dogma of Molecular Biology

Amino Acids, the subunit of proteins

Protein Primary, Secondary, and Tertiary Structure

Protein 3D Structure

Protein Sequence Motif Although there are 20 amino acids, the

construction of protein primary structure is not randomly choose among those amino acids

Sequence Motif: A relatively small number of functionally

or structurally conserved sequence patterns that occurs repeatedly in a group of related proteins.

Protein Sequence Motif

These biologically significant regions orresidues are usually: Enzyme catalytic site Prostethic group attachment sites

(heme, pyridoxal-phosphate, biotin…) Amino acid involved in binding a metal

ion Cysteines involved in disulfide bonds Regions involved in binding a molecule

(ATP/ADP, GDP/GTP, Ca, DNA…)

Goal of the our group The main purpose is trying to obtain

and extract protein sequence motifs which are universally conserved and

across protein family boundaries.

Discuss the relation between Protein Primary structure and Tertiary structure

Outline


Experiment setup: HSSP matrix: 1b25

HSSP matrix: 1b25

Representation of Segment Sliding window size: 9 Each window corresponds to a sequence

segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP.

More than 560,000 segments (413MB) are generated by this method.

DSSP: Obtain 2nd Structure information

Outline


Clustering Algorithms

There are two clustering algorithms we used in our approach:

K-means Clustering Fuzzy C-means Clustering

K-means Clustering

K-means Clustering

K-means Clustering

K-means Clustering

K-means Clustering

Fuzzy C-means Clustering







Granular Computing Model

Original dataset

Fuzzy C-Means Clustering

Information Granule 1

Information Granule M

K-means Clustering

K-means Clustering

Join Information

Final Sequence Motifs Information

...

...

Motivation

Reduce Space-complexity

Number of Members

Number of Clusters

Data Size

Granule 0 136112 151 99.9MB

Granule 1 68792 76 50.5MB

Granule 2 86094 95 63.2MB

Granule 3 65361 72 47.9MB

Granule 4 63159 70 46.3MB

Granule 5 120130 133 88.2MB

Granule 6 128874 143 94.6MB

Granule 7 4583 5 3.3MB

Granule 8 43254 48 31.7MB

Granule 9 5032 6 3.7MB

Total 721390 799 529MB

Original dataset

562745 800 413MB

Table 1 summary of results obtained by FCM

Reduce Time-complexity

Wei’s method: 1285968 sec (15 days) * 6 = 7715568 sec (90 days)

Granular Model: 154899 sec + 231720 sec * 6 = 1545219 sec (18 days) (FCM exe time) (2.7 Days)

HSSP-BLOSUM62 Measure

Outline


Part1Bioinformatics

Knowledge and Dataset Collection

Part2Discovering Protein

Sequence Motifs

Part3Motif Information

Extraction

Part4Mining the Relations between Motifs and

Motifs

Part5Protein Local Tertiary Structure Prediction

FutureWorks

PART3: protein information extraction by Decision Tree

PART4: Clustering with association rule and graph theory

PART4: Super rule generation by DB-Scan

Apply DB scan to build up super-rules among all motifs

PART5: Protein local tertiary structure prediction

By Decision Tree Naïve Bayesian Association rule algorithms and more…

Date post:	02-Jan-2016
Category:	Documents
Upload:	hugh-dorsey
View:	219 times
Download:	0 times

My Research Work and Clustering Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Documents