+ All Categories
Home > Documents > Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang...

Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang...

Date post: 31-Mar-2015
Category:
Upload: muhammad-beecham
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
25
Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio State University Homepage: http://bmi.osu.edu/~yxiang Joint work with Philip R.O. Payne and Kun Huang To appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics
Transcript
Page 1: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Transactional Database Transformation and Its Application in Prioritizing Human Disease

Genes

Yang XiangDepartment of Biomedical Informatics, The Ohio State University

Homepage: http://bmi.osu.edu/~yxiang

Joint work with Philip R.O. Payne and Kun HuangTo appear in IEEE/ACM Transactions on Computational Biology

and Bioinformatics

Page 2: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Motivation: Netflix problem• The Netflix Problem:

Given the current user ratings, how to recommend movies to users?

? ? ? ? ? ? ? 4 ? ?

? 3 ? ? ? ? ? ? ? ?

? ? ? 2 ? ? ? ? 4 ?

? ? 5 ? ? 3 ? ? ? ?

? ? ? ? 4 ? ? ? ? ?

? ? 3 ? ? ? 2 ? ? ?

? ? ? 1 ? ? ? ? ? ?

? ? ? ? ? ? ? ? 1 ?

Users

Movies

Page 3: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Motivation: Matrix Completion

• 0s are unsampled entries, other values are sampled entries

0 0 0 0 0 0 0 4 0 0

0 3 0 0 0 0 0 0 0 0

0 0 0 2 0 0 0 0 4 0

0 0 5 0 0 3 0 0 0 0

0 0 0 0 4 0 0 0 0 0

0 0 3 0 0 0 2 0 0 0

0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 0

Users

Movies

• Can we recover such kinds of matrices?

Page 4: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Matrix Completion Theory and methods• If the number m of sampled entries obeys

for some positive numerical constance C, then with very high probability, most n*n matrices of rank r can be perfectly recovered. [Candès et al. Exact Matrix Completion via convex optimization, Foundations of Computational Mathematics, 9(6), 717-772.]

• Matrix completion methods (http://perception.csl.uiuc.edu/matrix-rank/sample_code.html#MC)– Singular Value Thresholding– OptSpace– Acceloerated Proximal Gradient– Subspace Evolution and Transfer– Grouse

Page 5: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Transactional Database (0,1)-matrix Bipartite graph

Transaction Items

1 Bread, Diaper, Eggs

2 Beer, Coke, Apples,

3 Bread, Milk, Beer, Coke

4 Diaper, Eggs, Apples

5 Bread, Beer, Coke

Bread Milk Diaper Beer Eggs Coke Apples

1 1 0 1 0 1 0 0

2 0 0 0 1 0 1 1

3 1 1 0 1 0 1 0

4 0 0 1 0 1 0 1

5 1 0 0 1 0 1 0

Transactional Database

(0,1)-matrix

1

2

3

4

5

Bread

Milk

Diaper

Beer

Eggs

Coke

Apples

Page 6: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Question: Can (0,1)-matrix be completed?

Bread Milk Diaper Beer Eggs Coke Apples

1 1 0 1 0 1 0 0

2 0 0 0 1 0 1 1

3 1 1 0 1 0 1 0

4 0 0 1 0 1 0 1

5 1 0 0 1 0 1 0

Consider each transaction is a customer. What is each customer’s altitude towards un-purchased items (i.e., 0 entries)?

It does not make a good sense to use the sampling model here as for the matrix completion, i.e., non-zero is a sample entry and zero is a unsampled entry.

Page 7: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Our proposal: (0,1)-matrix transformation

• An entry is evaluated by its support patterns (independent evidence). • P is a supporting pattern for entry (i,j) if and only if P covers (i,j) and,

M(x,y)=1 for any entry (x,y)ϵP\{(i,j)} • Since the value of (i,j) is not considered for a supporting pattern, the

supporting pattern of an entry is independent of the entry value.

Page 8: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Support Pattern Measurementused in this work

Biomedical Informatics question: How to efficiently transform M into F defined above, such that F can unbiasedly predict the unkown gene-phenotype relationships?

Page 9: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Find support patterns and calculate F (i,j) for one entry

1 0 1 0 0 1 1 0 0 11 1 0 1 1 0 1 1 0 10 0 1 1 0 1 0 0 1 10 1 1 0 1 1 1 0 0 01 0 1 1 0 0 0 1 1 00 1 0 1 1 1 1 0 1 11 1 0 0 1 0 1 1 1 01 0 0 1 0 1 0 1 0 0

Find support patterns for the magenta entry (4,d)

1 0 1 0 0 1 1 0 0 1

1 1 0 1 1 0 1 1 0 1

0 0 1 1 0 1 0 0 1 1

0 1 1 0 1 1 1 0 0 0

1 0 1 1 0 0 0 1 1 0

0 1 0 1 1 1 1 0 1 1

1 1 0 0 1 0 1 1 1 0

1 0 0 1 0 1 0 1 0 0

1

2

3

4

5

6

7

8

a b c d e f g h i j

1

2

3

4

5

6

7

8

a b c d e f g h i j

1 0 1 0 1

0 1 0 1 0

0 1 0 0 0

1 0 1 1 1

0 0 0 1 0

2

3

5

6

8

b c e f g23568

bcefg

Find the maximum edge bicliqueF (4,d)=6

Page 10: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Maximal biclique and maximum edge biclique

• A biclique is maximal if it cannot be extended.• Maximum edge biclique is a maximal biclique

with the maximum number of edges.• Listing all maximal biclique is a NP-hard

problem. Find one maximum edge biclique is NP-hard too.

Page 11: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Solutions for listing all maximal bicliques

• Associate Rule Mining– Frequent Itemset

An itemset whose support is no less than a minimum support (minsup) threshold. In the transaction example, set minsup=3, then {beer} is a frequent itemset. {beer, coke} is too.

– (Frequent) Closed ItemsetAn itemset is closed if none of its immediate supersets has the same support as the itemset

– Maximal Frequent ItemsetAn itemset is maximal frequent if none of its immediate supersets is frequent

Page 12: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Solutions for listing all maximal bicliques

• A close itemset with its supporting transaction set exactly corresponds to a maximal biclique in the corresponding bipartite graph

• Using frequent closed itemset to approximate closed itemset.• MAFIA: Mining frequent itemset, frequent closed itemset, and maximal

frequent itemset. http://himalaya-tools.sourceforge.net/Mafia/

Bread Milk Diaper Beer Eggs Coke Apples

1 1 0 1 0 1 0 0

2 0 0 0 1 0 1 1

3 1 1 0 1 0 1 0

4 0 0 1 0 1 0 1

5 1 0 0 1 0 1 0

1

2

3

4

5

Bread

Milk

Diaper

Beer

Eggs

Coke

Apples

Page 13: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Solution Summary for one entry (i,j)

• Construct a submatrix corresponding to the entry (i,j).

• Using frequent closed mining tools to build frequent closed itemsets (set the support threshold as low as the computer can handle)

• Build supporting transactions for the frequent closed itemsets, thus we obtained all the candidate maximal bicliques.

• Find the maximum edge biclique and get the F (i,j) value.

Page 14: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

How about all entries?

• The previous solution is for one entry. How about all entries in a m*n matrix?

• Simply repeating the previous calculation for m*n times is not a wise choice.

Page 15: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

IndEvi Algorithm in a Nutshell• Assume input is a set of maximal cliques of the original (0,1)-matrix.• Project each maximal clique horizontally and vertically.

Let C be the maximal clique as shown by the shaded area.

Can you figure out how to calculate FC(i,j) for an entry (i,j)?

• Each entry will remember the largest FC(i,j).with respect to all Cs.Please refer to the paper for the algorithm detail.

Page 16: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

IndEviRe Algorithm: Independent Evidence Reconstruction

• IndEvi algorithm ensure an entry (i,j) remember the largest FC(i,j) value, and the corresponding reference to C in the set of maximal bicliques.

• IndEviRe algorithm reconstructs the support pattern according to the reference and the value of (i,j).

Page 17: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Key theorem: unbiased predicting

Page 18: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Application in Prioritizing Human Disease Genes

• Transactional data: gene-to-phenotype (G2P) dataset from http://human-phenotype-ontology.org (10/03/2010)

• Closed itemset generator: MAFIAhttp://himalaya-tools.sourceforge.net/Mafia/

• Platform: Linux, C++, STL• Cross-validate Platform (10/04/2010):

www.geneanswers.com (GACOM)

Page 19: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Measurement: Fold Enrichment

Intuitively, fold enrichment measures how good known disease genes are ranked among all genes

Page 20: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Results• Among all 34503(=|E|) known gene-phenotype relations, 4598(=|E’|) of

them with gene ranked among the top 0.1107% (=y%) of the 1807 candidate genes for it, achieving a 120.4 (x/y=13.3264/0.1107) fold-enrichment.

• Rank Cutoff

Page 21: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Case Study: Colon Cancer

Page 22: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Case Study: Breast Cancer

Page 23: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Case Study: Osteoarthritis

• Supporting pattern (by IndEviRe) for TNXB: {COL3A1, COL5A1, COL5A2, TNXB}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES, JOINT DISLOCATION, MITRAL VALVE PROLAPSE, SOFT SKIN, OSTEOARTHRITIS}

• Supporting pattern (by IndEviRe) for VWF: {COL3A1, COL5A1, COL5A2, TNXB, VWF}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES, , MITRAL VALVE PROLAPSE, OSTEOARTHRITIS}

Page 24: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Conclusion• The supporting patterns for an entry in (0,1)-matrix is a good

resource for knowledge inference.• Frequent closed itemset mining provide a practical platform for

solving our problems.• IndEvi and IndEviRe algorithms can efficiently calculate F score

and reconstruct evidence for any entry, with the input of maximal bicliques. The result for an entry is independent of its original value (0 or 1). Only one call of frequent closed itemset mining on the original matrix is necessary.

• Readers may revise the F function for different applications. • The algorithm is simple to implement, and the result is easy to

analyze. Our method has a wide range of applications.• The study on human gene-phenotype data shows that our

method is efficient and effective.

Page 25: Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Thanks!

Questions?


Recommended