Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | hugh-dorsey |
View: | 219 times |
Download: | 0 times |
My Research Work and Clustering
Dr. Bernard Chen Ph.D.University of Central Arkansas
Fall 2010
Outline
Introduction Experimental Setup Clustering Future Works
Central Dogma of Molecular Biology
Amino Acids, the subunit of proteins
Protein Primary, Secondary, and Tertiary Structure
Protein 3D Structure
Protein Sequence Motif Although there are 20 amino acids, the
construction of protein primary structure is not randomly choose among those amino acids
Sequence Motif: A relatively small number of functionally
or structurally conserved sequence patterns that occurs repeatedly in a group of related proteins.
Protein Sequence Motif
These biologically significant regions orresidues are usually: Enzyme catalytic site Prostethic group attachment sites
(heme, pyridoxal-phosphate, biotin…) Amino acid involved in binding a metal
ion Cysteines involved in disulfide bonds Regions involved in binding a molecule
(ATP/ADP, GDP/GTP, Ca, DNA…)
Goal of the our group The main purpose is trying to obtain
and extract protein sequence motifs which are universally conserved and
across protein family boundaries.
Discuss the relation between Protein Primary structure and Tertiary structure
Outline
Introduction Experimental Setup Clustering Future Works
Experiment setup: HSSP matrix: 1b25
HSSP matrix: 1b25
Representation of Segment Sliding window size: 9 Each window corresponds to a sequence
segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP.
More than 560,000 segments (413MB) are generated by this method.
DSSP: Obtain 2nd Structure information
Outline
Introduction Experimental Setup Clustering Future Works
Clustering Algorithms
There are two clustering algorithms we used in our approach:
K-means Clustering Fuzzy C-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
Fuzzy C-means Clustering
Fuzzy C-means Clustering
Fuzzy C-means Clustering
Fuzzy C-means Clustering
Fuzzy C-means Clustering
Fuzzy C-means Clustering
Fuzzy C-means Clustering
Granular Computing Model
Original dataset
Fuzzy C-Means Clustering
Information Granule 1
Information Granule M
K-means Clustering
K-means Clustering
Join Information
Final Sequence Motifs Information
...
...
Motivation
Reduce Space-complexity
Number of Members
Number of Clusters
Data Size
Granule 0 136112 151 99.9MB
Granule 1 68792 76 50.5MB
Granule 2 86094 95 63.2MB
Granule 3 65361 72 47.9MB
Granule 4 63159 70 46.3MB
Granule 5 120130 133 88.2MB
Granule 6 128874 143 94.6MB
Granule 7 4583 5 3.3MB
Granule 8 43254 48 31.7MB
Granule 9 5032 6 3.7MB
Total 721390 799 529MB
Original dataset
562745 800 413MB
Table 1 summary of results obtained by FCM
Reduce Time-complexity
Wei’s method: 1285968 sec (15 days) * 6 = 7715568 sec (90 days)
Granular Model: 154899 sec + 231720 sec * 6 = 1545219 sec (18 days) (FCM exe time) (2.7 Days)
HSSP-BLOSUM62 Measure
Outline
Introduction Experimental Setup Clustering Future Works
Part1Bioinformatics
Knowledge and Dataset Collection
Part2Discovering Protein
Sequence Motifs
Part3Motif Information
Extraction
Part4Mining the Relations between Motifs and
Motifs
Part5Protein Local Tertiary Structure Prediction
FutureWorks
PART3: protein information extraction by Decision Tree
PART4: Clustering with association rule and graph theory
PART4: Super rule generation by DB-Scan
Apply DB scan to build up super-rules among all motifs
PART5: Protein local tertiary structure prediction
By Decision Tree Naïve Bayesian Association rule algorithms and more…