Redescription Mining: Algorithms and Applications in Bioinformatics · 2020-01-18 · Redescription...

Redescription Mining:Algorithms and Applications in Bioinformatics

Deept KumarDepartment of Computer Science

Virginia Tech, Blacksburg, VA 24061

Dissertation submitted to the Faculty of theVirginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computer Science

Examining Committee:

Naren Ramakrishnan, ChairMalcolm PottsRichard HelmT. M. MuraliChris North

April 19, 2007Blacksburg, Virginia

Keywords: redescriptions, redescription mining, storytelling, bioinformatics.Copyright c© 2007, Deept Kumar

Redescription Mining:Algorithms and Applications in Bioinformatics

Deept Kumar

Abstract

Scientific data mining purports to extract useful knowledge from massive datasets curatedthrough computational science efforts, e.g., in bioinformatics, cosmology, geographic sci-ences, and computational chemistry. In the recent past, we have witnessed major transfor-mations of these applied sciences into data-driven endeavors. In particular, scientists are nowfaced with an overload of vocabularies for describing domain entities. All of these vocab-ularies offer alternative and mostly complementary (sometimes, even contradictory) ways toorganize information and each vocabulary provides a different perspective into the problembeing studied. To further knowledge discovery, computational scientists need tools to helpuniformly reason across vocabularies, integrate multiple forms of characterizing datasets, andsituate knowledge gained from one study in terms of others.

This dissertation defines a new pattern class called redescriptions that provides high level ca-pabilities for reasoning across domain vocabularies. A redescription is a shift of vocabulary, ora different way of communicating the same information; redescription mining finds concertedsets of objects that can be defined in (at least) two ways using given descriptors. We presentthe CARTwheels algorithm for mining redescriptions by exploiting equivalences of partitionsinduced by distinct descriptor classes as well as applications of CARTwheels to several bioin-formatics datasets. We then outline how we can build more complex data mining operationsby cascading redescriptions to realize a story, leading to a new data mining capability calledstorytelling. Besides applications to characterizing gene sets, we showcase its uses in otherdatasets as well. Finally, we extend the core CARTwheels algorithm by introducing a theoreti-cal framework, based on partitions, to systematically explore redescription space; generalizingfrom mining redescriptions (and stories) within a single domain to relating descriptors acrossdifferent domains, to support complex relational data mining scenarios; and exploiting struc-ture of the underlying descriptor space to yield more effective algorithms for specific classesof datasets.

In loving memory of:

Chandrani Lamgora (1925–1997)Shailesh Lamgora (1963–1997)Kamla Srivastava (1907–1999)

Dr. H.C. Srivastava (1929–2000)Namrata Srivastava (1974–2002)Shanti Srivastava (1923–2003)

iii

Acknowledgments

I would like to begin my thanking Dr. Naren Ramakrishnan, my friend, philosopher and guide inthe truest sense of the words in the past five years. PhD being my first degree in Computer Science,Naren has been extremely patient with me as I made myself familiar with the field starting fromthe very basics. His encouragement and vote of confidence has been instrumental in helping mebring about the change in thought process required of me over the course of my PhD. Many of theideas developed during the course of this research work (notably, the CARTwheels algorithm forredescription mining and the notion of essential tree pairs) have been hugely influenced by Naren’sinputs.

An important aspect of my research work has been the applications of techniques developed inBioinformatics. This part of my work benefited hugely from the inputs of Dr. Malcolm Potts andDr. Richard Helm who helped me make sense of the computational results obtained. Additionally,they have provided the motivation for many of the research questions considered during the courseof my PhD.

My collaboration on two different research projects with Dr. T. M. Murali have been of im-mense help in broadening my understanding of current Bioinformatics research. Special thanks goout to Murali for his keen interest in helping me solve some of the programming issues I faced atvarious stages of my work.

My interaction with Dr. Chris North has pushed me to consider better and more informativevisualization for the results obtained from my research work. The myoglobin project with Dr.Alexey Onufriev has opened my eyes to bioinformatics research related to structural biology. Mysincere thanks go out to Dr. Lenwood Heath and Dr. Ruth Greene for initiating me into the fieldof Bioinformatics and providing me with the fundamental concepts I utilized during the course ofmy research.

I have had the wonderful opportunity to interact with and collaborate with fellow ComputerScience students including Saverio Perugini, Alan Sioson, Dan Moisa, Greg Grothaus, SatishTadepalli, Jory Z. Ruscio, and Joseph Gresock among others. Vibha Singhal, a staff memberin the department, was also a great help in the initial phases of my research work. From the Bio-chemistry department, Steve Slaughter helped me procure the data used for parts of my researchwork. A special thanks to Steve for serving on my defense committee as a proxy member for Dr.Malcolm Potts.

Outside of my work, my long stay in Blacksburg was made enjoyable by the company ofmy friends including Nilanjan Saha, Maulik Shukla, Ashwini Shukla, KP, Gautam Vasudevan,Gautam Verma, Priya Bhavasar, Amlan Dasgupta, Rajan Paradkar, Arul Isai Imran, and PrakritiParijat among others. Above all, Rekha, with whom I shared the most wonderful seven years Icould imagine, has been the one constant that has helped me get through the good and not so good

iv

times in Blacksburg.Last, but certainly not the least, I wish to express my gratitude towards all members of my

family who have supported me during the course of my academic life. Special thanks are inorder for my parents Dr. G. K. Srivastava and Shashi Srivastava and my two lovely sisters Rituand Ruchika who have shown an unwavering belief in me that has helped me complete this longcherished dream of mine.

v

Contents

1 Introduction 11.1 The setting for redescription mining . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Original contributions of this work . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Outline of Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 The CARTwheels Redescription Mining Algorithm 72.1 Redescription Mining as Alternating Tree Induction . . . . . . . . . . . . . . . . . 7

2.1.1 Working Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Why does CARTwheels work? . . . . . . . . . . . . . . . . . . . . . . . . 102.1.3 Configuring Alternations in CARTwheels . . . . . . . . . . . . . . . . . . 112.1.4 The CARTwheels Algorithmic Framework . . . . . . . . . . . . . . . . . 132.1.5 Assessing Significance of Mined Redescriptions . . . . . . . . . . . . . . 132.1.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Applications in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Descriptor Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.3 CARTwheels Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.4 Example Redescriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.5 Effect of ρ and η . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Survey of Related Research 213.1 Attribute-Value Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Relational Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.1 ILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.2 Propositionalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.1 Categorical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.2 Coupled Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.3 Cluster Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.4 Conceptual Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Constructive induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vi

3.5 Co-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.6 Profiling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.7 Schema matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.8 Model management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.9 Summary – A Case for Redescription Mining . . . . . . . . . . . . . . . . . . . . 29

4 Using Redescriptions to Elucidate Stress Response Pathways 304.1 Experiment Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.2 CARTwheels Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Understanding Desiccation Tolerance . . . . . . . . . . . . . . . . . . . . . . . . 324.2.1 Recovery/Rehydration Phase . . . . . . . . . . . . . . . . . . . . . . . . . 354.2.2 Desiccation Tolerance and Sulfur Metabolism . . . . . . . . . . . . . . . . 35

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Algorithms for Storytelling 375.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3 Designing a Storyteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.1 Working Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.3.3 Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.3.4 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3.5 Assessing Significance of Stories . . . . . . . . . . . . . . . . . . . . . . 50

5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.4.1 Word Overlaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.4.2 Gene Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.4.3 PubMed Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.5 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 Systematizing the Exploration of Redescriptions in CARTwheels 646.1 Sources of Redundancies in CARTwheels . . . . . . . . . . . . . . . . . . . . . . 646.2 Partition Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.3 From CARTs to Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.4 Redescriptions in Partition Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.4.1 CARTwheels in Partition Space . . . . . . . . . . . . . . . . . . . . . . . 706.5 Essential Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.5.1 Border of Essential Partition Pairs . . . . . . . . . . . . . . . . . . . . . . 726.5.2 Essential Tree Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.6 Approach to Mine Essential Tree Pairs . . . . . . . . . . . . . . . . . . . . . . . . 736.6.1 Traversal Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.6.2 Proposed Solution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.7 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.7.1 Further Curtailing the Space of Trees Considered . . . . . . . . . . . . . . 77

vii

6.8 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7 Extensions and Novel Applications of Redescription Mining 807.1 Redescription Mining in Structured Descriptor Spaces . . . . . . . . . . . . . . . . 80

7.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.1.2 Solution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.1.3 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.2 Redescription Mining with Many-Many Relationships . . . . . . . . . . . . . . . . 857.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.2.2 Solution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.3 Constrained Storytelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.3.2 Solution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.4 Cross-taxonomic Comparisons using Redescriptions . . . . . . . . . . . . . . . . 1017.4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.5 Shattering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.5.2 Solution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8 Conclusion 109

Bibliography 110

Vita 117

viii

List of Figures

1.1 Six descriptors defined over a universal set of countries. . . . . . . . . . . . . . . . 21.2 Dependencies between research questions studied in this document. . . . . . . . . 4

2.1 Example data for illustrating operation of CARTwheels algorithm. . . . . . . . . . 82.2 Dataset to initialize CARTwheels algorithm and the first classification tree induced. 82.3 Dataset for second and third iteration of CARTwheels algorithm. . . . . . . . . . . 92.4 Alternating tree growing in the CARTwheels algorithm. . . . . . . . . . . . . . . . 92.5 Contour plot depicting best attainable Jaccard’s coefficient, for different set sizes. . 112.6 Seven example redescriptions mined using CARTwheels. . . . . . . . . . . . . . . 172.7 Frequency plot of descriptor sizes for universal set G1, G2, and G3, respectively. . . 182.8 Precision for redescriptions mined vs. ρ for universal set G1, G2, and G3, respec-

tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.9 Total number of redescriptions mined vs. ρ for universal set G1, G2, and G3,

respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Results of redescription analysis of desiccation and rehydration in yeast. . . . . . . 334.2 Redescription analysis and sulfur metabolism. . . . . . . . . . . . . . . . . . . . . 34

5.1 An example input to storytelling. . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2 Example data for illustrating operation of storytelling algorithm. . . . . . . . . . . 415.3 Behavior of distance metric used for decision tree construction. . . . . . . . . . . . 415.4 Storytelling using CARTwheels alternation. . . . . . . . . . . . . . . . . . . . . . 425.5 Dataset for the first and second alternation. . . . . . . . . . . . . . . . . . . . . . . 435.6 The working of our heuristic function in A* search. . . . . . . . . . . . . . . . . . 465.7 The graph G (used to prove admissibility of heuristic function) for our running

example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.8 Fraction of stories mined and average time required to mine stories as a function

of story length for different values of θ using L5 and L10 . . . . . . . . . . . . . . 525.9 Fraction of stories mined and average time required to mine stories as a function

of story length for different values of b using L5 and L10 . . . . . . . . . . . . . . 545.10 Comparison of the behavior of heuristic based search with BFS for L5 and L10 . . . 555.11 A significant story among gene sets from protein modification to hexokinase. . . . 565.12 Fraction of stories mined and average time required to mine stories as a function

of story length, for different values of θ and b in the geneset study . . . . . . . . . 575.13 Comparison of the behavior of heuristic based search with BFS for the geneset study. 585.14 An example significant story among PubMed abstracts relating chemical stresses. . 60

ix

5.15 An example significant story among PubMed abstracts around cAMP levels. . . . . 615.16 Fraction of stories mined and average time required to mine stories as a function

of story length, for different values of θ and b using PubMed documents. . . . . . . 625.17 Comparison of the behavior of heuristic based search with BFS for PubMed docu-

ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.1 Precision for redescriptions mined vs. number of unique redescriptions mined. . . . 656.2 Various examples of partitions of universal set O . . . . . . . . . . . . . . . . . . 656.3 A part of the lattice induced by partitions of O. The partitions at the same level

have the same number of blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.4 X and Y descriptors used to explain mappings between tree and partitions. . . . . 676.5 Modeling partitions induced by decision trees. . . . . . . . . . . . . . . . . . . . . 676.6 An example of a logically reducible tree. . . . . . . . . . . . . . . . . . . . . . . . 686.7 The first two trees constructed using (a) Y descriptors and (b) X descriptors in our

running example used for the partitions-based analysis. . . . . . . . . . . . . . . . 706.8 The partition lattice that corresponds to the tree in Fig. 6.7(a). . . . . . . . . . . . . 716.9 The partition lattice that corresponds to the tree in Fig. 6.7(b). . . . . . . . . . . . 716.10 The decoupling process used to find essential tree pairs. . . . . . . . . . . . . . . . 756.11 Plot showing count of total number of essential tree pairs mined and non-essential

trees encountered in our case study. . . . . . . . . . . . . . . . . . . . . . . . . . 786.12 Plot showing count of total number of redescriptions and unique redescriptions

mined as part of the essential tree pairs in our case study. . . . . . . . . . . . . . . 79

7.1 Examples of structures seen in descriptors encountered in bioinformatics datasets. . 817.2 The effect of structure in descriptor space on the decision trees constructed. . . . . 827.3 Plot showing the effect of utilizing descriptor space structure in mining redescrip-

tions for study involving GO CEL categories and clustering based descriptors foryeast. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.4 Plot showing the effect of utilizing descriptor space structure in mining redescrip-tions for study involving GO CEL and GO MOL categories for all genes in yeast. . 85

7.5 A homologous set of genes motivating the problem of generalizing Jaccard’s coef-ficient to many-many relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.6 The degree distribution for genes used in worm and fly for our cross-genomicexpression data analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.7 Behavior of clustering factor and node coverage for all redescriptions possible us-ing descriptors from worm and fly for our cross-genomic expression data analysis. . 90

7.8 Redescription R1CG for our cross-genomic study. . . . . . . . . . . . . . . . . . . 907.9 Redescription R2CG for our cross-genomic study. . . . . . . . . . . . . . . . . . . 917.10 The degree distribution for transcription factors and the regulated genes for our

choice of universal set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927.11 Behavior of clustering factor and node coverage for all redescriptions possible us-

ing descriptors defined over transcription factors and regulated genes in yeast. . . . 937.12 Redescription R1TF for our transcriptional regulation study. . . . . . . . . . . . . . 947.13 Redescription R2TF for our transcriptional regulation study. . . . . . . . . . . . . . 95

x

7.14 An ER diagram depicting the flow of data starting from transcription factors tofeedback molecules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.15 Constrained story R1CS showing clusters from gene expression data considered ina sequential order of four temporal windows. . . . . . . . . . . . . . . . . . . . . 99

7.16 The constrained story R2CS based on a pre-specified model. . . . . . . . . . . . . . 1007.17 Examples of cross-taxonomic redescription across the GO category hierarchies. . . 1027.18 Examples of redescriptions that hold with Jaccard’s coefficient of 1 for more that

1 species. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.19 Shattering using two descriptors f1 = {a1, a2} and f2 = {a1, a3}. . . . . . . . . . 1057.20 Relationship between Shatterability threshold value and Shatter Index. . . . . . . . 1067.21 Four connected components formed by gene pairs that have a Shatterability of 0 in

our study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

xi

List of Tables

2.1 CARTwheels algorithmic framework. . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Summary of universal sets and descriptors. . . . . . . . . . . . . . . . . . . . . . . 15

4.1 Descriptor vocabularies and domains (SGD refers to the Saccharomyces genomedatabase). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1 Storytelling algorithmic framework. . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Heuristic for storytelling A* search. . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.1 CARTwheels algorithmic framework for mining essential tree pairs. . . . . . . . . 76

7.1 Summary of input GO categories for the 6 species considered . . . . . . . . . . . . 1027.2 Summary of redescriptions obtained for the 6 species. . . . . . . . . . . . . . . . . 1037.3 Pairwise overlap between redescriptions obtained for the 6 species. . . . . . . . . . 1047.4 GO categories involved in redescriptions mined for all 6 species. . . . . . . . . . . 105

xii

Chapter 1

Introduction

Data mining is a field with historic roots in the business analytics and commercial data processingdisciplines. Two classical examples of business applications of data mining include analysis oftransactional point-of-sale data and capturing trends in stock market trades. The Internet explosionhas provided access to new forms of data, such as web logs, and fostered applications such asautomated recommender systems.

Over the last two decades, increased computational capabilities have accelerated the generationand accumulation of huge amounts of scientific data [74], and have consequently set the stage fordata mining in scientific domains. Some of the important scientific domains in which data miningmethodologies have been applied include geology, geophysics, astrophysics, cosmology, chemicaland materials engineering, and bioinformatics [74]. Datasets stemming from scientific experimentsare typically larger and less homogeneous than in the business domain. As a result, although mostof the business-oriented data mining algorithms work for scientific datasets, special methodologiesare required to work with scientific data more effectively and efficiently.

There are two types of scientific analysis tasks that can be supported through data mining [12].Discovery driven mining can be used to drive the observation→hypothesis step in scientific method-ology, while verification driven mining supports the hypothesis→experiment step. Of the two, theobservation→hypothesis step requires the development of more sophisticated methodologies, spe-cific to the data type, for purposes of knowledge discovery.

Discovery-driven mining is usually a two-step process. The first step is that of data reductionwhich involves cataloging, classification, segmentation or partitioning of the data [13]. This stepproduces manageable datasets, amenable to further analysis. Once the data is in a reduced (orcatalog) form, the second step involves applying learning algorithms to derive hypotheses fromit. The outputs of learning algorithms can often be viewed as a further form of data reduction,especially if we adopt the view that mined patterns are summarizers or compressed forms of theoriginal datasets.

The reduced forms of data are the sine qua non of all data-driven sciences today; in fact, thesedisciplines are drowning in not just the dimensionality of data, but also in the multitude of de-scriptors available for characterizing data. As an example, consider gene expression studies usingbioinformatics approaches. The universal set of genes in a given organism (O) can be studied inmany ways, such as functional categorizations, expression level quantification using microarrays,protein interactions, and participation in biological pathways. Each of these methodologies pro-vides a different vocabulary to define subsets of O (e.g., ‘genes localized in cellular compartment

1

Countries with > 200 Nobel prize winners = {US}Countries with > 150 billionaires = {US}Countries with history of communism = {China, Russia}Countries with defense budget > $30 billion = {France, Germany, Japan, UK, US}Permanent members of the U.N. Security Council = {China, France, Russia, UK, US}Countries with declared nuclear arsenals = {China, France, India, Israel, Pakistan, Russia, UK, US}

Figure 1.1: Six descriptors defined over a universal set of countries.

nucleus’, ‘genes up-expressed two-fold or more in heat stress’, ‘genes encoding for proteins thatform the immunoglobin complex’, and ‘genes involved in glucose biosynthesis’). These subsetsof O can be thought of as descriptors defined using a particular vocabulary. It is important tonote that there are typically no restrictions on the way descriptors can be defined. They couldbe as systematic or objective as the ones described above. However, very subjective descriptors,like ‘a biologist’s favorite genes’ or ‘genes that a biologist believe could show interesting behav-ior together’, are also equally valid. It is important to develop data mining approaches that canaccommodate such a diversity of descriptors in a unified manner. This research work introducesredescription mining as a principled methodology that uses available descriptor vocabularies tomine patterns called redescriptions.

1.1 The setting for redescription mining

A redescription is a shift-of-vocabulary, or a different way of communicating information about agiven subset of data. The goal of redescription mining is to find subsets of data that afford multipledescriptions.

Formally, the inputs to redescription mining are the universal set of objects O and two sets(X and Y ) of subsets of O. The elements of X are the descriptors Xi, and are assumed to forma covering of O (

⋃

i Xi = O). Similarly⋃

i Yi = O. The only requirement of a descriptor isthat it be a proper subset of O and denote some logical grouping of the underlying objects (forease of interpretation). The goal of redescription mining is to find equivalence relationships ofthe form Zj ⇔ Zk that hold at or above a given Jaccard’s coefficient θ (i.e., |Zj∩Zk|

|Zj∪Zk|≥ θ), where

Zj and Zk are set-theoretic expressions involving Xi’s and Yi’s, respectively. For tractability pur-poses, some restrictions on the length of the allowable set-theoretic expressions (not their form)is assumed to be provided. Redescription mining hence involves constructive induction (the taskof inventing new features) and exhibits traits of both unsupervised and supervised learning. It isunsupervised because it finds conceptual clusters underlying data, and it can be viewed as super-vised because clusters defined using descriptors are given meaningful characterizations (in termsof other descriptors).

Consider the set of all countries in the world. The elements of this set can be described invarious ways, e.g., geographical location, political status, scientific capabilities, and economicprosperity. Such features allow the definition of various subsets of the given (universal) set, asdescriptors. Examples of these are shown in Fig. 1.1. A redescription involves a subset definablein two ways, for instance

’Countries with > 200 Nobel prize winners’ ⇔ ‘Countries with > 150 billionaires’

2

This redescription involves two descriptors, and says that the countries with more than 200 Nobelprize winners are also those countries with more than 150 billionaires. One country satisfies bothdescriptors, namely U.S.A., and hence it can be said to have been redescribed. The strength of theredescription is given by the symmetric Jaccard’s coefficient as 1/1 = 1. Descriptors on either sideof a redescription can involve more than one entity, e.g.,

’Countries with defense budget > $30 billion’ ⇔ ’Permanent members of U.N. Security Council’This redescription is only approximate, however, since the left descriptor contains U.S.A., U.K.,Japan, France, Germany and the right descriptor represents U.S.A., U.K., Russia, France, China.The Jaccard’s coefficient is hence 3/7 = 0.428. To strengthen redescriptions such as above, moreselective descriptors can be used:

’Countries with declared nuclear arsenals’ ⇔ ’Permanent members of U.N. Security Council’which improves the Jaccard’s coefficient to 5/8 = 0.625 since the left descriptor now representsU.S.A., U.K., Russia, France, China, India, Israel, Pakistan. Another approach to strengtheningis to form set-theoretic operations (union, intersection, difference) involving the given descriptors;e.g. the redescription

’Countries with defense budget > $30 billion’ ∩ ’Countries with declared nuclear arsenals’ ⇔’Permanent members of U.N. Security Council’ − ’Countries with history of communism’

Here, a set intersection has been constructed on the left and a set difference on the right, from thegiven descriptors, yielding a redescription for the 3-element set: {US, UK, France}.

A typical approach to mining such patterns would be to first fix the form of the set-theoreticexpressions and then search within the space of possible instantiations. The goal of this work isto investigate an algorithmic framework (CARTwheels [75]) that simultaneously constructs set-theoretic expressions and searches in the space of possible redescriptions. CARTwheels is de-scribed in detail in subsequent chapters, but some initial remarks are made at this point.

CARTwheels exploits one property of redescription space and two important properties of bi-nary decision trees (CARTs) to mine redescriptions. The property of redescription space it exploitsis that redescriptions never occur in isolation, but in pairs. For instance, the redescription Zj ⇔ Zk

coexists with ¬Zj ⇔ ¬Zk. The properties of trees that are exploited are more subtle. First, ifthe nodes in such a tree correspond to boolean membership variables of the given descriptors, thenwe can interpret paths to represent set intersections, differences, or complements; unions of pathswould correspond to disjunctions. Second, a partition of paths in the tree corresponds to a parti-tion of objects. All these three properties are employed in CARTwheels which grows two trees inopposite directions so that they are joined at the leaves. Essentially, one tree exposes a partitionof objects via its choice of subsets and the other tree tries to grow to match this partition usinga different choice of subsets. If partition correspondence is established, then paths that join canbe read off as redescriptions. CARTwheels explores the space of possible tree matchings via analternation process whereby trees are repeatedly re-grown to match the partitions exposed by theother tree. By suitably configuring this alternation, we aim to guarantee, with non-zero probability,that any redescription existing in the dataset would be found.

3

RQ1(Chap. 4)

RQ3(Chap. 6)

RQ2(Chap. 5)

RQ4 RQ5 RQ6

(Chap. 7)

RQ7 RQ8

Figure 1.2: Dependencies between research questions studied in this document. Each arrow pointstowards the dependent research question(s).

1.2 Research Questions

The redescription mining approach as described leads to a number of possible avenues for furtherresearch. A few of these possibilities will be explored in the proposed work. In particular, eightresearch questions are considered (see Fig. 1.2), covering algorithmic, application, and general-ization aspects of redescriptions. Our domain contexts are primarily drawn from bioinformaticsalthough we outline examples from other domains as well. We begin with applications and com-positional uses of the basic CARTwheels algorithm. In particular,

RQ1: How can we utilize redescriptions as starting points to gain domain-specific insight intodatasets?

Chapter 4 describes a case study that elucidates stress response pathways by building uponconcerted sets of genes identified via redescriptions. We show how a systematic ‘expansion’ ofredescriptions into pathways, by following interaction relationships, helps gain valuable insightinto molecular mechanisms underlying the given datasets.

Though the aim of redescription mining is to find redescription with very high Jaccard’s coef-ficient (preferably 1), redescriptions that hold with Jaccard’s coefficient < 1 also find applicationin many domains. An approximate redescription implies a common meeting ground for two con-certed communities of objects. A chain of such approximate redescriptions can effectively relatetwo subsets that have nothing in common! This is especially useful in storytelling and link analysisapplications. A query such as ’what is the relationship between people traveling on Flight 847 andthe top 10 wanted list by the FBI?’ can be posed in terms of redescription finding.

In storytelling, the two sets that need to be connected are provided as input. The redescriptionmining algorithm starts with one of the descriptors as the descriptor of choice in the decision treeconstructed. During the alternations that follow, the underlying algorithm should be able to findintermediate descriptors that maneuver the exploration towards the second descriptor. This requiresan exploration policy that is able to direct the alternation in a focused and task-specific manner:

RQ2: How can storytelling be modeled in terms of redescription mining?

In Chapter 5 we present a novel algorithm that uses redescription mining as a building blockin an A* search to mine stories. This demonstrates the validity of not just storytelling but theemergence of compositional data mining to build complex algorithms from simpler primitives.

4

The above two questions are addressed without any significant changes to the basic CARTwheelsapproach. A deeper question is to be able to systematize CARTwheels alternation so that redundantredescriptions are avoided in the exploration strategy:RQ3: How can we systematize the search for redescriptions in CARTwheels?

In Chapter 6, we present an approach that systematizes CARTwheels alternation by organizingsearch along the lattice of partitions induced by descriptors and path expressions involving descrip-tors. This helps control any potential redundancy issues by ensuring that any CART is exploredat most once and arranging a ‘basis set’ of these CARTs along a border separating redescribablefrom non-redescribable partitions.

The two threads of research above—compositionality and systematized partition-based search—are variously combined in the next set of research questions. All these questions are addressed inChapter 7.

Quite frequently, there is a structure underlying the descriptor space that can be fruitfully ex-ploited by the data mining algorithm. Examples are the biological functional category hierarchies(e.g., GO) where there is a set-containment structure to the categories. There is a critical need topush structural information deeper into the algorithm design:RQ4: How can structural characteristics of descriptor space be exploited in redescription mining?

The Jaccard’s coefficient, currently employed in CARTwheels, utilizes foreign key relation-ships to assess the strength of redescriptions across descriptors. The measure, as described, worksonly for the case where there is a one-to-one relationship between entities in the participating de-scriptors. However, this might not always be the case. A motivating example can be found inhomology relationships: one gene can have more than one ortholog in another organism. Thus,the next research question is intended to extend the basic capability of redescription mining toaccommodate such relationships:RQ5: How can redescription mining be generalized to accommodate many-many relationships in

the underlying data?Having extended redescription mining to many-many relationships, we focus on loosening the

single-domain restriction to allow for redescriptions across multiple domains. An example wouldbe to look for redescriptions between genes and proteins, given a mapping between the two. Thisapproach can be easily generalized to span a whole set of domains in a given entity-relationship(ER) diagram.RQ6: How can redescription mining be used to relate entities from different domains?

Finally, we outline two novel applications of redescriptions by addressing the questions:RQ7: Can we use redescription mining as a basis for cross-taxonomic comparisons?

We present the results of an extensive study of redescriptions across the three GO taxonomiesand comparisons of the sets of redescriptions across organisms.RQ8: Can we use redescription mining to study how a given set of descriptors ‘shatters’ the un-

derlying object set?We define new measures such as the shatterability threshold and the shatter index that aids in

identifying ‘modules’ of highly conserved properties over a space of descriptors.

5

1.3 Original contributions of this work

This research work:

• introduces and formalizes the data mining task of redescription mining

• presents the CARTwheels algorithmic framework for mining redescriptions

• presents extensions and applications of the CARTwheels algorithmic framework

• introduces and formalizes the data mining task of mining stories

• presents the storytelling algorithm for mining stories

• introduces the notion of mining constrained stories

• describes how redescription and its extensions help characterize scientific datasets

1.4 Outline of Document

We begin with an introduction to the CARTwheels algorithmic framework for mining redescrip-tions [75] in Chapter 2. This chapter provides an in-depth understanding of the task and an ap-proach for mining redescriptions which is then contrasted with related work in Chapter 3. InChapter 4, we describe an application of CARTwheels to elucidate stress response pathways usingresults and data from biological experiments [83]. The Storytelling algorithm [37] is introduced inChapter 5. In Chapter 6, we define an approach to systematize the search for redescriptions usingCARTwheels. Chapter 7 deals with extensions to redescription mining and some novel applicationsof interest in Bioinformatics. Chapter 8 serves as the conclusion for this research work.

6

Chapter 2

The CARTwheels Redescription MiningAlgorithm

In this chapter, we introduce the CARTwheels redescription mining algorithm. CARTwheels growstwo trees in opposite directions so that they are joined at the leaves. It thus exploits the dualitybetween class partitions and path partitions in an induced classification tree to model and mineredescriptions. Algorithm design decisions, implementation details, and experimental results arepresented.

2.1 Redescription Mining as Alternating Tree Induction

Classification and regression trees (CARTs) were among the earliest proposed approaches for pat-tern classification and data mining [9]. While being powerful in terms of accuracy and efficiencyof induction, their results are also simple to understand as they mimic the decision-making logicof human experts. The renewed emphasis on data mining propagated by the KDD community inthe 1990s has fueled a resurgence of interest in tree-based methods [18, 23].

Our use of CARTs is different from classical supervised learning situations because we useCARTs as a way to learn maps between descriptors, rather than from features to classes. We workwith two trees at every stage of the algorithm: the decision conditions in the first tree (say, top) arebased on set membership checks in entries from X and the bottom tree is based on membershipchecks in entries from Y ; thus matching of leaves corresponds to a potential redescription. Thisidea hence uses paths in the classification trees as representations of boolean expressions involvingthe descriptors.

CARTwheels is an alternating algorithm, in that the top tree is initially fixed and the bottomtree is grown to match it. Next, the bottom tree is fixed, and the top tree is re-grown. This processcontinues, spouting redescriptions along the way, until designated stopping criteria are met.

2.1.1 Working Example

For ease of illustration, consider the artificial example in Fig. 2.1 that shows two sets of descriptorsfor the universal set O = {o1, o2, o3, o4, o5}. Here, the set X corresponds to the set of descriptors{X1, X2, X3, X4} and Y corresponds to {Y1, Y2, Y3, Y4}. The cardinalities of X and Y may not be

7

X1 = { o2, o3 }X2 = { o3, o4 }X3 = { o2, o4 }X4 = { o1, o5 }

Y1 = { o1, o2, }Y2 = { o2, o3, o4 }Y3 = { o3, o5 }Y4 = { o1, o2, o5 }

Figure 2.1: Example data for illustrating operation of CARTwheels algorithm.

object Y1 Y2 Y3 Y4 classo1

√ × × √X4

o2

√ √ × √X1

o3 × √ √ × X1

o4 × √ × × X2

o5 × × √ √X4

Y3

Y2 Y1

yes no

X1 X4 X4 X2

yes no yes no

Figure 2.2: (left) Dataset to initialize CARTwheels algorithm. (right) induced classification tree.

the same in the general case. Further, in a realistic application, the number of descriptors wouldfar exceed the number of objects.

To initialize the CARTwheels alternation, we prepare a traditional dataset for classification treeinduction, where the entries correspond to the objects, the boolean features are derived from oneof X or Y , and the classes are derived from the other. In the dataset shown in Fig. 2.2 (left), thefeatures correspond to set membership in entries of Y and each object is assigned a unique class,chosen from the Xi’s it participates in. We employed a greedy set covering of the objects using theentries of X in order to establish the class labels in Fig. 2.2 (left). For instance, o2 belongs to bothX1 and X3, but the tie is broken in favor of X1. Notice that in this process, X3 does not receiveany representation in the prepared dataset.

A classification tree can now be grown using any of the impurity measures studied in theliterature (e.g., entropy, Gini index, misclassification rate). Fig. 2.2 (right) depicts a possible tree.The leaves of the tree deterministically predict a class label from X , typically the majority class.At this point, the specific details of how the tree was induced are not important, only that any suchtree will induce a partition of the underlying objects. In this case, the tree induces a 3-partitionwhich mirrors the 3-class partition present in the original dataset, but is not exactly the same. Theleft most path corresponds to the region Y3 ∩ Y2, the right most path corresponds to O − Y3 − Y1,and the union of the two middle paths gives (Y3 −Y2)∪ (Y1 −Y3). The reader can verify that theseregions do not have a one-to-one correspondence with the regions X1, X2, and X4 in the originalpartition. For instance, only X2 enjoys such a correspondence, with O − Y3 − Y1. In ‘readingoff’ a partition from a tree in this manner, a conjunction thus results from a path of length > 1,a disjunction results from multiple paths predicting the same class, with negations correspondingto following the ‘no’ branch from a given node. This partition is used as the starting point for thealternation (Fig. 2.4, first frame).

We now prepare a dataset with entries from X as the features and the regions thus formed(involving Yi’s) as the classes, as shown in Fig. 2.3 (top). Inducing a classification tree from thisdataset really corresponds to growing a second tree to match the first tree at the leaves, as depicted

8

obj. X1 X2 X3 X4 classo1 × × × √

(Y3 − Y2) ∪ (Y1 − Y3)o2

√ × √ × (Y3 − Y2) ∪ (Y1 − Y3)o3

√ √ × × Y3 ∩ Y2

o4 × √ √ × O − Y3 − Y1

o5 × × × √(Y3 − Y2) ∪ (Y1 − Y3)

obj. Y1 Y2 Y3 Y4 classo1

√ × × √(X3 ∩ X1) ∪ (X4 − X3)

o2

√ √ × √(X3 ∩ X1) ∪ (X4 − X3)

o3 × √ √ × (O − X3 − X4)o4 × √ × × (X3 − X1)o5 × × √ √

(X3 ∩ X1) ∪ (X4 − X3)

Figure 2.3: (top) Dataset for second iteration of CARTwheels algorithm. Notice that class la-bels are now set-theoretic expressions involving Yi’s. (bottom) Dataset for third iteration ofCARTwheels algorithm.

Y3

Y2 Y1

yes no

yes no yes no

X4 X1

X3

yesno

yesyes nono

Y1

Y3

noyesyes no

Y2

yes no

X4 X1

X3

yesno

yesyes nono

noyes

noyes

Y4

Y3

· · · · · · · · ·

Figure 2.4: Alternating tree growing in the CARTwheels algorithm. The alternation begins with atree (first frame) defining set-theoretic expressions to be matched. The bottom tree is then grownto match the top tree (second frame), which is then fixed, and the top tree is re-grown (third frame).Colored arrows indicate the matching paths. Redescriptions corresponding to matching paths atevery stage are read off and subjected to evaluation by Jaccard’s coefficient.

9

in Fig. 2.4 (second frame). In this case, the second tree also learns a 3-partition and we can evaluateeach of these matchings using the Jaccard’s measure. This produces three redescriptions:

(X3 ∩ X1) ∪ (X4 − X3) ⇔ (Y3 − Y2) ∪ (Y1 − Y3)

(X3 − X1) ⇔ (O − Y3 − Y1)

(O − X3 − X4) ⇔ (Y3 ∩ Y2)

all of which hold at Jaccard’s coefficient 1. This need not be the case in general. The bottom treemight be able to match only some paths in the top tree, or the matches might not pass our Jaccard’scutoff. This process is then continued, now with Yi’s as features and the partitions derived fromthe bottom tree as classes (see bottom of Fig. 2.3). The new matchings yield the redescriptions:

(X3 ∩ X1) ∪ (X4 − X3) ⇔ Y4

(O − X3 − X4) ⇔ (Y3 − Y4)

(X3 − X1) ⇔ (O − Y3 − Y4)

which, fortuitously, also have a Jaccard’s coefficient of 1. Notice that, this time, the root decisionnode that has been picked is Y4 (see third frame of Fig. 2.4) and the tree actually resembles adecision list (a tree where every internal node has a leaf on its ‘yes’ branch). The alternation canbe continued (see Sec. 2.1.4 for ways to configure the search).

If we limit the size of the trees at every iteration, it is easy to see that the set-expressionsconstructed cannot get arbitrarily long. In our running example, we use a depth limit of 2 so thatall expressions on either side of a mined redescription can involve at most three descriptors. Thelongest expressions result from unions of two paths involving different subtrees.

2.1.2 Why does CARTwheels work?

The use of trees to mine one-directional implications (rules) is well understood and is the idea be-hind algorithms such as C4.5 [72]. In CARTwheels, we exploit the duality between class partitionsand path partitions to posit the stronger notion of equivalence. In fact, if a tree reduces the entropyto zero, it is clear that there must be a one-to-one correspondence between its path partitions andclass partitions, which are really path partitions from the other tree. Keep in mind that differentpaths are union-ed when they predict the same class, and this property is crucial to establishing theduality.

The search for redescriptions in CARTwheels can be viewed as a problem of identifying (andcreating) correlated random variables. We present a simple analysis in the case of one-level tree.A descriptor, e.g., D, can be considered to be a discrete random variable that takes on values fromO. Every object in D occurs with probability 1

|D|and other objects occur with probability zero, to

yield total probability mass of 1. Notice that this makes the self entropy of such a random variableto be the logarithm of the size of the descriptor. Now consider running a CARTwheels alternationwith a depth limit of 1 for the classification trees. Mining a redescription with Jaccard’s coefficientof 1 is equivalent to identifying a random variable D′ whose entropy distance from D is zero. Theentropy distance is given by:

H(D, D′) − I(D; D′)

10

Figure 2.5: Contour plot depicting best attainable Jaccard’s coefficient, for different set sizes.

where H(D, D′) is the joint entropy function of {D, D′} and I qualifies the mutual information,in turn given by:

I(D; D′) = H(D) − H(D|D′)

where H(D) is the self-entropy of D and H(D|D′) is the conditional entropy of D given D′. Inother words, the average reduction in uncertainty about D due to knowing D ′ is exactly the selfentropy of D, causing an entropy distance of 0. Entropy distance is a true distance measure, unlikemeasures such as the Kullback-Leibler (KL) divergence. Smaller values of entropy distance henceimply higher values of Jaccard’s coefficient.

2.1.3 Configuring Alternations in CARTwheels

CARTwheels provides a general framework to explore a space of redescriptions; to configure itsalternation, there are several issues to be considered.

We will begin by observing that the continuation of CARTwheels alternation, after mining aredescription, is really an attempt to explore and stay within a relatively small region of high Jac-card’s coefficient. Fig. 2.5 shows an idealized scenario where descriptors (or expressions derivedfrom them) occur in all possible sizes, with the best possible overlaps. In a realistic dataset, theregions of high Jaccard’s coefficient might be disjoint, and a good exploration policy must try tovisit all potential regions.

In contrast to traditional classification tree induction which is motivated at reducing entropy,CARTwheels must actually maintain entropy in some form, since impurity drives exploration.However, if the impurity in the underlying datasets remains constant, some redescriptions arebound to be found over and over again. The trade-off here is clearly between exploration and

11

Table 2.1: CARTwheels algorithmic framework.Input: objects O, descriptor sets {Xi}, {Yi}Output: redescriptions RParameters:

θ (Jaccard’s coefficient), d (depth of trees),ρ (# of class participations allowed/descriptor), andη (max. # of consecutive unsuccessful alternations).

Initialization:set answer set R = {}set class participation counts for all {Xi}, {Yi} = 0set feature set F = {Yi}; set classes C = {Xi}set dataset D = construct dataset(O, F , C)set tree t = construct tree(D, d)if (all leaves in t have same class c ∈ C)

set l = random leaf in t having non-zero entropyimpurify(t,l)

C = paths to classes(t)flag = false; count = 0

Alternation:G = {Xi}while (count < η)

F = Gif (flag = false)

{Xi} = G; G = {Yi}else

{Yi} = G; G = {Xi}endifD = construct dataset(O,F ,C)t = construct tree(D,d)if (all leaves in t have same class c ∈ C)

set l = random leaf in t having non-zero entropyimpurify(t,l)

endifRnew = eval(t,θ)if (Rnew = {})

count = count + 1else

count = 0foreach c ∈ C

if c is involved in some r ∈ RnewH = descriptors(c)foreach descriptor g ∈ G ∩H

increase g’s class participation countif g’s class participation count > ρ

remove g from Gendfor

endifendfor

end ifR = R∪Rnew; flag = not(flag)C = paths to classes(t)

end while

12

redundancy: to support sufficient exploration, we must accept redundancy, and conversely if wedesire to reduce redundancy, we must settle for insufficient coverage of the redescription space.This trade-off suggests that a tunable parameter for CARTwheels alternation is the number oftimes that a descriptor is allowed to participate in redescriptions.

2.1.4 The CARTwheels Algorithmic Framework

Table 2.1 describes the CARTwheels algorithmic framework in detail. The outline follows theexample shown previously: construct dataset prepares a dataset suitable for CART induction as inFig. 2.2; construct tree creates the decision tree of depth d; and paths to classes reads expressionsoff an induced tree, to be used as classes in the next step for each object in O. Notice the use ofan impurify function in both the initialization and the alternation steps, which typically assigns thesecond-best class label to the chosen leaf l. Additional impurification steps, to aid exploration, areincluded in our implementations of construct tree (e.g., we do not always branch on the attributewith the best entropy gain and sometimes perform randomized moves at the root level).

The eval function returns redescriptions satisfying the Jaccard’s threshold θ. Our implementa-tion of eval requires redescriptions to hold in both the mined and complementary forms, e.g., forthe equivalence Z1 ∪ Z2 ⇔ Z3 to be considered as a redescription, it must hold with Jaccard’scoefficient at least θ, as must its complement: ¬Z1 ∩ ¬Z2 ⇔ ¬Z3. This ensures that every re-description truly induces a partition of O × O space. descriptors is a function that analyzes aset-theoretic expression and returns the set of descriptors participating in it.

The important tunable parameter in Table 2.1 is ρ, controlling the trade-off between redundancyand exploration. A participation count is incremented each time a given descriptor appears in aredescription in its role as part of a class, and when this reaches ρ, the descriptor is removed fromconsideration. The parameter η specifies the maximum number of alternations that CARTwheelscan go through without mining any redescriptions.

2.1.5 Assessing Significance of Mined Redescriptions

To assess the significance of a redescription Zi ⇔ Zj , we use the cumulative hypergeometricdistribution [91] to determine the probability of obtaining a rate of co-occurrence of Zi and Zj

(over the object domain), given their marginal occurrence probabilities, and comparing it to theobserved rate of co-occurrence by chance. This reduces to the (symmetric) probability that arandomly constructed Zj of size |Zj| has an overlap of more than |Zi ∩ Zj| with the fixed set Zi:

1 −|Zi∩Zj |−1

∑

k=0

(

|Zi|k

) (

|O| − |Zi||Zj| − k

)

(

|O||Zj|

) (2.1)

Observe that one way to get a strong p-value would be to have very small sizes for Zi and Zj

(which in turn, make the achievement of a respectable Jaccard’s coefficient difficult). On the otherhand, if Zi and Zj are large, the ease with which they could overlap increases, and hence evenhigh Jaccard’s coefficients might not correspond to a strong p-value. Therefore, for interpretation

13

purposes, it is important to not think of intersection size as a surrogate for significance of re-descriptions. In the experiments reported in the next section, statistically significant redescriptionsinvolving as few as 1 object to as large as 80 objects are presented.

2.1.6 Implementation Details

CARTwheels is implemented in C++ atop a Postgres database providing access to the descriptors.We use an AD-tree data structure [57] for fast counting purposes and estimation of entropy (thisis distinct from the classification tree that combines the descriptors). The AD-tree provides accessto the distributions of ‘class labels’ for every combination of ‘features’ and, since the definitionof features and class labels change at every iteration, is rebuilt continually. Notice that the datastructure is expected to provide both the sizes of descriptors as well as their negations (when wefollow the ‘no’ branch) and hence, the depth of the AD-tree is set to just greater than the allowabledepth of the classification trees. The CARTwheels algorithm consults the AD-tree whenever itmust make a choice of a decision node (except when its move is exploratory). After evaluatingmatchings, set-expressions read off the trees are subjected to tabular minimization, in order toarrive at a canonical form.

The implementation allows for configuring the space of redescriptions that are explored. Thedepth limit for the top and bottom trees can be individually specified, and we can also preferentiallyinclude or exclude certain types of expressions in mined redescriptions. For instance, syntacticconstraints on redescriptions (e.g., only conjunctions are allowed) can be incorporated as biases inthe tree construction phase of CARTwheels.

2.2 Applications in Bioinformatics

We now present an application of CARTwheels to studying gene expression datasets from mi-croarray experiments conducted on the budding yeast Saccharomyces cerevisiae. Bioinformaticsis fertile ground for application of CARTwheels and S. cerevisiae is arguably the most well studied(and documented) model organism through bioinformatics techniques. Practically every experi-mental methodology applied toward yeast can be viewed as a way to define descriptors. Eventhe results of other data analysis/mining algorithms can be used as a source of descriptors! Theunderlying universal set of objects could be initialized to the set of genes, proteins, or processes,in S. cerevisiae. CARTwheels hence brings many computational and experimental technologies tobear upon redescription mining. It supports the capture of both similarities and distinctions amongdescriptors derived from these diverse sources.

2.2.1 Datasets

The redescription process begins by defining the universal set of genes (or open reading frames,ORFs) G, which is dependent on our biological goals. Here, we are interested in characterizingsimilarities and differences in yeast gene expression behavior across related families of stresses.Gasch et al. ([21]) is an important source for such a study since it provides results from more than170 comparisons, across a variety of environmental stresses. We use three different universal sets,to illustrate diverse ways of using the CARTwheels framework:

14

Table 2.2: Summary of universal sets and descriptors.G1 G2 G3

# stresses 5 7 7# expts 7 9 9# ORFs 74 332 171GO (biological process) descriptors 210 479 382GO (cellular component) descriptors 42 112 97GO (molecular function) descriptors 126 298 204Expression level range descriptors 224 373 344k-means clusters 70 270 0Histone expression range descriptors 152 168 162# descriptors 824 1700 1189

G1: the set of ORFs that show significant change in expression (more than 1-fold up- or down-regulation) in some time point in each of the five stresses from (heat shock from 25◦C to 37◦C,hyper-osmotic shock, hypo-osmotic shock, H2O2 exposure, and mild heat shock at variable osmo-larity).

G2: the set of ORFs that show more than 4-fold up- or down-regulation change in expression insome time point in each of the seven stresses from (heat shock from 25◦C to 37◦C, hyper-osmoticshock, hypo-osmotic shock, H2O2 exposure, mild heat shock at variable osmolarity, heat shockfrom 37◦C to 25◦C, and heat shock from 29◦C to 33◦C). Notice that two additional stresses areincluded, from how G1 was constructed.

G3: the set of ORFs more than 4-fold up- or down-regulation change in expression in some timepoint in each of the seven stresses in G2 and that do not belong to the set of ESR (EnvironmentalStress Response) genes as characterized by Gasch et al. ([21]). The ESR dataset (comprising 868ORFs) constitute a characterization of yeast ORFs that show a marked uniformity of expressionacross diverse stresses, and hence have been excluded by many researchers in their analyses – seefor instance, ([79]).

The choice of the universal set can be viewed as a conditioning context and must be keptin mind when interpreting any mined redescriptions. It can be viewed as an implicit descriptoroccurring on both sides of every mined redescription, e.g., Zi ⇔ Zj in G1 can be viewed asZi ∩ G1 ⇔ Zj ∩ G1.

2.2.2 Descriptor Definition

We defined descriptors for the genes in the chosen universal sets in a variety of ways. One classof descriptors was derived from categories in the GO (Gene Ontology) biological process, GOcellular component, and GO molecular function taxonomies, that have representation among thechosen genes. The microarray results from the stresses of Gasch et al. (relevant to each universalset) were bucketed to yield range descriptors of the form ‘expression level ∈ [%x, 0] in time point%y of stress experiment %z’ (for negative %x) and ‘expression level ∈ [0, %x] in time point %yof stress experiment %z’ (for positive %x). Notice that we are not constrained to pick descriptors

15

from only the stresses used to define the universal set, although we have made that choice here.Further, k-means clustering was performed using the Genesis software suite ([88]) on each ofthe stresses individually, with a setting of 10 clusters for G1 and 10 and 20 clusters for G2. Nodescriptors based on k-means clustering were defined for G3. Since heat shock and mild heat shockat variable osmolarity are actually pairs of experiments, this step yields (5+2) × 10 = 70 (for G1)and (7+2) × (10 + 20) = 270 (for G2) descriptors, depicting clusters of genes with similar temporalprofiles. It must be kept in mind that each of these experiments in turn comprise of multiple timepoints, different for each stress. Finally, we included microarray results from a histone depletionexperiment conducted by Wyrick et al. ([96]) and created range descriptors similar to the Gaschstresses; this is to allow us to relate the effect of histone depletion to that of environmental stresses.Table 2.2 summarizes the number of descriptors of each type defined for each of the universal sets,and provides count statistics. Fig. 2.7 presents frequency plots for the sizes of the descriptors ineach of the universal sets. As expected, a majority of descriptors in each case have very few ORFs.

2.2.3 CARTwheels Configuration

To invoke CARTwheels for a particular universal set, we initialized X to be all descriptors de-rived from the Gasch et al. dataset (which includes the range descriptors as well as the k-meansclusters). This ensures that all redescriptions will involve some aspect of the Gasch et al. experi-ment and prevents the possibility of, say, mining a redescription between two GO taxonomies. Ywas initialized to the set of all descriptors; thus, there is some overlap between X and Y . In or-der to prevent obvious redescriptions arising from this overlap, the algorithm was precluded fromutilizing descriptors in one tree if they are already present in the other tree.

We employed a Jaccard’s threshold θ of 0.5 and a depth-limit d ≤ 2 in both the top andbottom tree induction alternations. The limit on the number of allowable alternations η is setto 10, and ρ was varied from 1 to 6. Redescriptions mined by CARTwheels are subjected to a‘tightening’ step, akin to rule pruning in packages like C4.5 ([72]). This might involve attemptingto drop terms from both sides of the redescription, or restricting range descriptors (if they occurin the redescription), and determining whether this causes significant degradation of Jaccard’scoefficient. If no degradation is observed, then the redescription can be tightened. A p-value cutoffof 0.001 for significance of redescriptions was utilized for all our results in this chapter. We firstdescribe the qualitative nature of biological results obtained through redescription and then assessthe exploratory behavior of CARTwheels.

2.2.4 Example Redescriptions

Seven key mined redescriptions (R1–R7) are depicted in Fig. 2.6. R1-R3 are defined over universalset G1, R4-R6 over G2, and R7 over G3. These redescriptions were selected for both their biolog-ical interest as well as for their feature construction novelties. The proteins encoded by genes in aredescription may interact with one another or, with other proteins not included in the redescrip-tion. Such analyses make it possible to uncover cryptic and subtle features of gene expression andregulation.

R1 is a redescription where both sides involve descriptors from gene expression bucketing. Itrelates negatively expressed ORFs in the histone depletion experiment with similarly expressedORFs in a Gasch comparison (heat shock). R1 can be read as ‘of the 74 ORFs in the first universal

16

YCL040W, YDR171W, YER103W, YFL014W,YFR053C, YGL037C, YGR088W, YGR248W,YHL021C, YHR104W, YLL026W, YLR178C,YLR327C, YML100W, YML128C, YMR105C,

YMR250W, YOR173W, YOR374W

Redescription R6ORF List

Time Point (HS29_TO_33)

5 10 15 20 25 30

Rel

ativ

e E

xpre

ssio

n L

evel

-4

-2

0

2

4

6

Time Point (HS1)

0 20 40 60 80

Rel

ativ

e E

xpre

ssio

n L

evel

-4

-2

0

2

4

6

8

0.51

Trend (HS29_TO_33) <=> Trend (HS1)

Time Point (HS2)

5 10 15 20 25 30 35 40 45 50 55 60

Rel

ativ

e E

xpre

ssio

n L

evel

-4

-3

-2

-1

0

1

2

Time Point (SORB_1M)

0 20 40 60 80 100 120

Rel

ativ

e E

xpre

ssio

n L

evel

-3

-2

-1

0

1

2

Time Point (SORB_29C_1M_TO_33C_NO)

5 10 15 20 25 30

Rel

ativ

e E

xpre

ssio

n L

evel

-1

0

1

2

Trend (HS2) <=> Trend (SORB_1M) <=> Trend (SORB_29C_1M_TO_33C_NO) <=> Trend (HS2)

RedescriptionR3

ORF List

YAL025CYBR247CYCL054WYGR145WYHR052WYML123CYMR185WYNL141WYPL019CYPR112C

0.82

0.63

0.69

Heat Shock (hs-1), 30 minutes <= -2 <=>Cellular GO category 5730: nucleolus <=> 0.32mM H2O2 30 minutes <= -1


YBR247C, YCL059C,YDL148C, YDL153C,YDR398W, YGL029W,YGL078C, YGL171W,YGR159C, YHR088W,YHR089C, YHR169W,YHR196W, YKL078W,YKL099C, YKL172W,YKR081C, YLL008W,,YLL011W, YLR002C,YLR129W, YLR175W,YLR197W, YLR222C,YLR276C, YML093W,YMR131C, YMR229C,YMR290C, YNL002C,YNL061W, YNL110C,YNL175C, YNL248C,YNL308C, YOL041C,YOL077C, YOR145C,YOR310C, YOR340C,YPL043W, YPL093W,YPL126W, YPL211W,

YPL266W

Nucleolus

-1Heat Shock 30 minutes

hs-1

-2-3-4-5-6 0-7

-10.32 mM H2O2, 30

minutes

-2-3-4-5-6 0-7

0.51

0.56

Heat Shock (hs-1), 10 minutes <= -2 <=>Histone depletion, 6 hr >= -6

0.78


-1Heat Shock 10minutes hs-1

-2-3-4-5-6 0 0Histone

depletion, 6 hr

-1-2-3-4-5-6

YAL025C, YGL055W,YGR145W, YML123CYNL141W, YOR315W,

YPL093W

RedescriptionR2

ORF List

Time Point (H2O2)

20 40 60 80 100 120 140 160

Rel

ativ

e E

xpre

ssio

n L

evel

-3

-2

-1

0

1

2

YDR342C, YGL055WYHR094C, YHR096CYML123C, YOL084WYOL101C, YPL019C

Membrane

InnerMembrane

0.89

Cellular GO category 16020: membrane AND NOT Cellular GO category 19866: inner membrane <=> Trend (H2O2)

Cellular GO category 5618: cell wall <=> 29C +1M sorbitol to 33C + 1M sorbitol , 15 minutes >= 1 AND 0.32 mMH2O2, 20 min >= 2 OR 29C +1M sorbitol to 33C + 1M sorbitol , 30 minutes <= -2


YKL096WYOR382WYPL130W

0.75Cell wall

-129C +1M sorbitol to33C + 1M sorbitol,

30 minutes

-2-3-4 0

29C +1M sorbitol to 33C+ 1M sorbitol , 15 minutes

0.32

mM

H2O

2, 2

0m

inut

es

654321

7

1 5432

Heat Shock, 15 minutes hs-2 <= -1 <=> (Biological GO category 7010: cytoskeleton organization and biogenesis ANDBiological GO category 80: G1 phase of mitotic cell cycle) OR (Biological GO category 42254: ribosome biogenesisand assembly AND NOT Biological GO category 7010: cytoskeleton organization and biogenesis) <=> Heat Shock 10minutes, hs-1 <= -2 <=> Biological GO category 7046: ribosome biogenesis <=> Heat Shock 30 minutes hs-1 <= -1



-2-3-4-5-6 0 Ribosomebiogenesis

G1 phase ofmitotic cell

cycle ribosomebiogenesis and

assemblycytoskeletonorganization

and biogenesis


-2-3-4-5-6 0


-2-3-4-5-6 0-7

0.58 0.53

0.55

0.51

YBR247C, YDL014W, YDL148C, YDL153C,YDR087C, YDR398W, YGL029W, YGL078C,

YGL171W, YGR103W, YHR089C, YHR169W,,YHR196W, YKL009W, YKL078W, YKL099C,YKL172W, YKR081C, YLL008W, YLL011W,YLR002C, YLR009W, YLR129W, YLR175W,YLR197W, YLR222C, YLR276C, YML093W,YMR131C, YMR229C, YNL002C, YNL061W,YNL075W, YNL110C, YNL308C, YOR294W,YOR310C, YPL043W, YPL126W, YPL211W,

YPL266W

Figure 2.6: Seven redescriptions mined using CARTwheels. Each box gives a readable state-ment of the redescription, presents it in graphical form, and identifies the ORFs conforming tothe redescription. R1-R3 are defined over universal set G1, R4-R6 over G2 and R7 over G3. TheJaccard’s coefficient is displayed over the redescription arrow.

17

0 10 20 30 40 50 600

50

100

150

200

250

300

350

400

size of descriptors

freq

uenc

y

Descriptors defined over G1

0 50 100 150 200 2500

100

200

300

400

500

600Descriptors defined over G2

freq

uenc

y

size of descriptors0 20 40 60 80 100 120

0

100

200

300

400

500

600

freq

uenc

y

size of descriptors

Descriptors defined over G3

Figure 2.7: Frequency plot of descriptor sizes for universal set G1, G2, and G3, respectively.

set, the ORFs negatively expressed in the histone depletion experiment (6 hours) are also thosethat are negatively expressed two-fold or more in the heat shock (10 minutes) experiment.’ Thisredescription holds with a Jaccard’s coefficient of 0.78. Since each side contains a single descriptor,this redescription does not present any set construction. R1 involves 7 ORFs, three of which arereported to be regulated by similar mechanisms, according to the work of Segal et al. ([79]). TheseORFs comprise functions related to metabolism, catalytic activity, and are located in the cytoplasm.The Pearson coefficients for these ORFs in the histone depletion experiments match very strongly,showcasing the use of redescription in identifying a concerted set of ORFs.

R2 relates a k-means cluster to a set difference of two related GO cellular component cat-egories. While the 8 ORFs in R2 appear to be part of different response pathways, 5 of these 8ORFs are similarly regulated according to the work of Segal et al.; these genes relate to the cellularhyperorganization and membrane dynamics in the regulation network.

R3 is actually a triangle of redescription relationships that illustrates the power of CARTwheels.Three different experimental comparisons are involved in this circular chain of redescriptions, with10 ORFs being implicated in all three descriptors. From a biological standpoint, this is a veryinteresting result – the common genes indicate concerted participation across stress conditions;whereas the genes participating in, say, two of the descriptors, but not the third, suggest a carefuldiversification of functionality. Out of the 10 ORFs taking part in the redescription, 6 are relatedto cell growth and maintenance and 5 have binding motifs related to the DNA binding proteinREB1. The importance of phosphate and ribosomes appears to be salient in this redescription. Itis important to note that the circularity of R3 is not directly mined by CARTwheels, but inferredpost-hoc from a linear chain.

The theme in R4 is ribosome assembly/biogenesis and RNA processing. R4 is a linear chaincomprising two redescriptions, and uses a GO descriptor as an intermediary between two expression-based descriptors. It is also interesting that this redescription involves a set of 45 ORFs!

R5 is an even longer chain involving 41 ORFs that are common to all descriptors. Notice therather complicated set construct involving a disjunction of a conjunction and a difference, involvingthree different GO biological categories. Incidentally, this is the most complicated set expressionrepresentable in a 2-level tree.

R6 is a relationship between two k-means clusters, between heat shock stresses. The ORFsparticipating in R6 demonstrate a clear focus on sugar or sugar phosphate metabolism.

R7 is a redescription relating a disjunction of descriptors to a GO cellular component category.It is also our first example of a redescription where a rectangular region is mined in a 2D spaceinvolving two different experimental comparisons. Usually such a region would require a 4-leveltree, but since it is bounded by the extremal values specific to each experiment, it can be captured

18

Figure 2.8: Precision for redescriptions mined vs. ρ for universal set G1, G2, and G3, respectively.

Figure 2.9: Total number of redescriptions mined vs. ρ for universal set G1, G2, and G3, respec-tively.by a conjunction of merely two descriptors.

2.2.5 Effect of ρ and η

If we view the alternation process as one of information retrieval, we can adapt traditional pre-cision and recall metrics for algorithm assessment. Precision here refers to the number of uniqueredescriptions as a fraction of the total number of redescriptions mined. Recall refers to the numberof unique redescriptions as a fraction of the total number of redescriptions possible. Unfortunately,the latter metric is nearly impossible to attain, even for our depth limit of 2. For even the smallestuniversal set considered here, the size of the space of possible redescriptions is O(1014)! Our ap-proach hence is to track precision and the total number of redescriptions, across various values ofρ.

Fig. 2.8 shows the monotonic decrease of precision as ρ is increased, and Fig. 2.9 depicts thesteady increase in the total number of redescriptions mined. These graphs indicate that the trade-off between redundancy and exploration holds across all the datasets considered here. A formalcharacterization is underway. The effect of η parameter, that controls the number of unsuccessfulconsecutive alternations, is as expected: increasing η results in a greater number of (total andunique) redescriptions mined (not shown for space considerations).

2.3 Discussion

This chapter is a first exploration into the formulation of the redescription mining problem and haspresented an approach for mining redescriptions automatically. Redescriptions can be thought ofas generalizations of one-directional implications (e.g., association rules [3], rules in ILP [59]),where one descriptor is required to be a proper subset of the other. This generalization coupled

19

with the automatic identification of set-theoretic constructions makes CARTwheels a very powerfulapproach to mining (approximate) equivalence relations. We have demonstrated the effectivenessof CARTwheels in a domain that exhibits a richness of descriptors, and shown how it capturespatterns involving small as well as large sets of objects.

In the next chapter, we present further relationships between redescription mining and otherlearning paradigms.

20

Chapter 3

Survey of Related Research

This chapter surveys redescription-related research in data mining, covering attribute-value learn-ing techniques, relational mining techniques, and clustering techniques. A redescription can bevariously thought of as a generalization of the results from these simple techniques. Ideas pursuedin the constructive induction, co-training, schema matching and model management literatures arealso presented. Redescription mining can be considered as having similar motivations to theseworks.

3.1 Attribute-Value Learning Techniques

In attribute-value learning techniques, the actual values of entities or items of interest (without anylevel of abstraction) are used for mining patterns. Such techniques can be viewed as working witha single table of data. Some of the important types of attribute-value learning techniques includeassociation rules mining and decision tree induction. These are explained in detail in the followingsubsections.

3.1.1 Association Rule Mining

The task of association rule mining is best understood in terms of the original application domainin which it was first described [2]. Consider a supermarket transaction database recorded by aretailer over a period of time. One of the analyses that is performed on these transaction databasesis to look for a group of items, or itemsets, that appear frequently together in transactions. A cut-off value for support for an itemset, defined as the number of transactions having that itemset, isused to distinguish frequent itemsets from non-frequent ones. Once a frequent itemset I is found,useful implication rules of the form I1 → I2 can be derived from it, where I1 and I2 are bothitemsets and I1 ∪ I2 = I . The confidence of such a rule is the ratio of the supports of (I1 ∪ I2) andI1. Association rule mining is hence the problem of finding implications that satisfy user-definedminimum support and confidence constraints.

The task of mining association rules can be logically broken down into two steps – the firstconsists of finding frequent itemsets and the second involves finding implication rules. If there aren items in a transaction database, finding frequent itemsets by going through each of the (2n − 1)possibilities is highly expensive computationally. One of the most commonly used algorithms for

21

mining frequent itemsets is Apriori [3], which uses the lattice structure inherent in itemsets to positthat an itemset Ik with one extra item than itemset Ij cannot have a higher support than Ij. Thissuggests a bottom-up pruning algorithm that first verifies whether all subsets of an itemset havesupport, before evaluating the given itemset. Over the years, a number of methods have tried tooptimize this algorithm for large datasets using methods like partitioning [77], hashing [66], andsampling [92]. The second step of finding implication rules has received comparatively lesserattention, primarily because this step does not lend itself to significant algorithmic optimizations.

Although initially designed for and applied toward transaction databases, association rule min-ing has been used in a wide variety of domains. Some of the fields that have benefited fromassociation rule mining include cross-marketing, catalog design, and web log analysis.

An indirect contribution of the association rule mining community to data mining is the designof a number of data structures that are used for fast counting of support and confidence. TheADtree [57] is a particularly useful tree structure that stores an itemset along each of its branches.Each node in the tree denotes a particular item and records a count for the number of transactionsthat contain the itemset traversed by the path from the root to the node. This data structure hasbeen employed in CARTwheels for fast entropy gain estimation.

Although association rule mining is a very useful method to extract causalities among items, itis primarily focused on mining relationships in a single table. Any relational information inherentin data must first be re-mapped (e.g., by joins) into a single table setting. In many data-rich appli-cations, such as bioinformatics, it would be better to more directly exploit the relational structurein the mining algorithm. Nevertheless, attempts have been made to bridge relational mining andassociation rule mining, and the methodologies described thus far in this section are still pertinent.

3.1.2 Decision Trees

Decision trees were among the earliest proposed approaches for pattern classification and datamining [9]. While being powerful in terms of accuracy and efficiency of induction, their results arealso simple to understand as they mimic the decision-making logic of human experts. The renewedemphasis on data mining propagated by the knowledge discovery in databases (KDD) communityin the early 1990s has fueled a resurgence of interest in tree-based methods. Researchers haverevisited tree induction algorithms in the context of datasets residing in secondary storage [23] [18],creating scalable and highly efficient implementations [8]. The many fielded applications of tree-based methods range from everyday uses such as spam filtering [31] to astrophysical domains suchas classifying galaxies [35].

Decision trees aim to use a set of attributes to classify data points into a set of predefineddiscrete-valued target functions or classes. A training dataset is used to learn the decision tree andtest examples are used to check for its accuracy. Typically, each data point considered takes oneout of the possible domain of values for each attribute (although there could be missing data) aswell as a class label. A decision tree with a range of discrete class labels is called a classificationtree, whereas a decision tree with a range of continuous values is called a regression tree; hencethe name classification and regression tree (CART) [9] is used interchangeably for decision tree.

Any decision tree tests for an attribute at each internal node of the tree. Branches coming out ofthat node represent the values that the particular attribute can take. Any leaf in the tree correspondsto some class from the set of possibilities. The order in which attributes are chosen as the tree is

22

constructed can be governed by a variety of measures such as information gain, gain ratio, and giniindex.

ID3 [70] is one of the earliest and most well-known implementation of a decision tree learningalgorithm. It uses a greedy top-down search through the space of decision trees and uses inform-ation gain to select the attribute tested at each internal node. Information gain is defined in terms ofentropy which is a measure of disorder in the classification of data based on the various values thatan attribute can take. The difference in entropy before and after splitting at a particular attribute fora given node defines the information gain. ID3 chooses the attribute that maximizes informationgain at each node. Hence, there is an inherent bias towards shorter trees in this algorithm. Theresulting decision tree can also be restated as a set of if-then rules if required.

One of the problems with the initial ID3 implementation of decision trees was that of overfittingwherein an increase in the depth of the tree (and hence better accuracy) for classifying a trainingdataset results in decreased accuracy for test examples. This is particularly the case when thetraining examples have random error or noise. Some of the methods used to reduce overfittinginclude reduced-error pruning [71] and rule post-pruning [72] used in the C4.5 variant of ID3.

Although there have been a number of attempts to incorporate continuous-valued attributes [14,15], decision tree mining is best-suited for data with discrete-valued attributes and classifications.It is however restrictive in the kind of constructs that can be encapsulated in the mined tree.

Though decision trees have almost always been used for classification purposes, they couldfind application in feature construction if tree building is looked at as a way to combine features.This role of decision trees is central to the CARTwheels redescription framework presented earlier.

3.2 Relational Mining

The popularity of relational database management systems (RDBMSs) have helped pose the prob-lem of mining relationships at the relational level. This would involve using a relational structure,like a set of predicates, to represent the data as well as structure the space of possible patterns. Oneof the most important methods in this category is inductive logic programming (ILP) [59].

3.2.1 ILP

ILP requires a set of training examples E and background knowledge B. The objective is to find ahypothesis H defined using some concept description language L as a set of clauses, such that His complete and consistent with respect to the background knowledge B and the examples E. Theset of examples E, stated as ground facts in the language L, consists of E+, the set of examplesthat are true (positive), and E−, the set of examples that are false (negative) for the hypothesis H .Much of the ILP literature focuses on finding H that consists of clauses with at most one literal asthe consequent (or head of the clause); such statements are called horn clauses.

ILP can be viewed as a search problem through the possible hypothesis space. This hypothesisspace can be imagined as a lattice using the concept of θ-subsumption [39] to structure it. Clausew θ-subsumes w′ if there exists a substitution θ such that wθ is a subset of w′. Clause w is at leastas general as clause w′ (w <= w′) if w θ-subsumes w′. Clause w is more general than w′ (w < w′)if w <= w′ holds and w′ <= w does not. In this case, w′ is a specialization of w and w is ageneralization of w′. If w θ-subsumes w′, then w entails w′.

23

θ-subsumption provides a generality ordering for hypotheses, thus structuring the hypothesisspace. It can then be used to prune out large parts of the search space. In the generalization ofw′ to w, if w′ entails a negative example, so will w; hence generalizations of w ′ can be neglected.On the other hand, if w does not cover a positive example, w ′, which is a specialization of w,will also not cover it. Hence, the specializations of w need not be considered. The search of thehypothesis space can be performed in a bottom-up manner using generalization techniques or atop-down manner using specialization approaches.

In generalization, the algorithm starts with the most specific clause that covers a given exampleand generalizes the clause till it starts to cover negative examples. Two basic techniques hereare relative least general generalization (rlgg) used in GOLEM [61] and inverse resolution usedin CIGOL [60]. The lgg of two clauses [67] is the minimum upper bound on the two clausesin the θ-subsumption lattice. rlgg is simply the lgg relative to the background knowledge B. Ininverse resolution, the basic step involves the inversion of the logical resolution process using ageneralization operator based on inverse substitution [60].

In specialization, a top-down search of the θ-subsumption graph is employed in systems suchas MIS (Model Inference System [80]). Here, the algorithm starts with the most general clausesand refines or specializes them by either applying a substitution to the clause or adding a literal tothe clause. This is continued down the lattice till the clause does not cover any negative examples.The clauses considered during the search need to be able to cover at least one positive example.

The rules mined using ILP are one-directional implications, where one descriptor (clause) isrequired to be an almost proper subset of the other. The rules mined can be used to posit con-tainment in the consequent descriptor if the antecedent descriptor is true. Redescriptions can beconsidered as an extension of the relational rule mining in ILP to bidirectional rules. Notice thatILP rules are relatively restricted in the set-constructs that can be used in clauses, as compared toredescriptions.

3.2.2 Propositionalization

Propositionalization refers to the transformation of a relational representation of the ILP learningproblem into a propositional (attribute-value or feature-based) representation.

The LINUS [39] system is one of the first projects to look at using propositionalization forrelational rule learning. The types of literals to be used as the head predicate and in the bodyare provided just as in other ILP implementations. All the features (or predicates) are constructedprior to the learning process, from the literal types provided. This is in sharp contrast with classicalILP where the feature construction takes place as the θ-subsumption tree is traversed during thelearning process. A truth table is then constructed for each of the data points using a substitution foreach variable in the literals constructed. Each data point thus becomes equivalent to a transactionand each constructed feature is equivalent to an item. This data is then used for learning relationalrules employing attribute-value techniques.

The LINUS algorithm cannot induce clauses with variables that occur in the body but notin the head (i.e., variables that are existentially quantified). The improved DINUS algorithm [39]overcomes this restriction and handles clauses with determinate local variables. A further variationof the DINUS system with the capability to deal with nondeterminate local variables in the bodyof the hypothesis clause has also been proposed [40].

24

One of the main problems with propositionalization-based systems is the ever-increasing num-ber of features constructed and considered. Even though attempts have been made to move awayfrom exhaustive first-order feature construction [40], the number of constructs is still huge. Thusthese systems are very time consuming and memory intensive. Although redescription mining alsoinvolves feature construction, it is closer to simple ILP algorithms as the features are constructedwhile the algorithm is searching in the feature space and not prior, as in propositionalization. Also,the type of constructs involved in redescription mining is a proper superset of the ones supportedby LINUS and DINUS (which are chiefly conjunctions).

3.3 Clustering

Clustering is a very widely used technique that groups data points into groups or clusters such thatpoints within a cluster are closer (through some metric of closeness) to each other than to pointsin other clusters [34]. Much of the application of clustering is to numerical data as it is mucheasier to define similarity metrics (e.g., distances) for cardinal data. However, there has been somework done on clustering categorical data as well. This relates directly to the aim of redescriptionmining, and is surveyed here. Coupled clustering and cluster ensembles are advanced clusteringtechniques and share the idea of reconciliation of clusters defined in more than one vocabularywith redescription mining.

3.3.1 Categorical Clustering

One of the most important issues in clustering of categorical data is the definition of an appropri-ate similarity measure. Various measures have been used, ranging from the Jaccard’s coefficientbetween vectors of categorical data to modified forms of entropy in COOLCAT [4]. Many of thesemethods (e.g. CACTUS [17]) try and take advantage of the fact that categorical data is usuallyrestricted in domain unlike numeric data that is generally unbounded. An interesting example of acategorical clustering system is STIRR [25] where a dynamical system is used to represent a tableof categorical data. The idea then is to search for fixed points of this dynamical system.

Categorical clustering, like clustering in general, suffers from a lack of describability con-straints. Thus, though it it possible to group data together, it is difficult to supply a label that bestdescribes the contents of a mined cluster. Because of the explicit use of descriptors in redescrip-tions, this problem never arises in redescriptions.

3.3.2 Coupled Clustering

Coupled clustering tries to partition two datasets into corresponding subsets such that each sub-set in one dataset has a matching subset in the other dataset. Each pair of subsets found in thismanner is called a coupled cluster. This method thus considers both within-dataset similarities andbetween-dataset similarities to form the coupled clusters. Recently, Marx et al. have suggested amethod [48] where only the between-dataset similarities are used to drive the clustering process.They use the cost-based framework for pairwise clustering suggested by Puzicha et al. [69]. Acost criterion defined using the cost function A(SS, CC) is used to guide the search for a suitableclustering configuration. Here SS is a collection of pairwise similarity scores for each pair of data

25

points and CC is a candidate clustering configuration of data divided into k clusters. The costfunction A suggested is limited to similarity values in each cluster and takes the size of a clusterinto consideration for calculating the weight contribution of that cluster. The idea then in coupledclustering is to search through various configurations and look for the one with minimum cost.Marx et al. present an application and results of this approach for synthetic and textual data.

The aim of coupled clustering is very similar to redescription mining. Both methods look fora cluster of items that share some commonalities. However, the approach is applied at the levelof items in coupled clustering as compared to relations in redescription mining. This once againleads to describability issues for the results obtained.

3.3.3 Cluster Ensembles

Cluster ensembles aim to achieve almost the same objective as coupled clustering. However, thereare two key differences. First, cluster ensembles are not restricted to two datasets. Also, thedatasets are reconciled post-clustering; thus clusters are created for each dataset individually andconsensus functions are used to combine these clusters into one set of clusters. Introduced byStrehl and Ghosh [86], the method they propose revolves around building a hypergraph using theset of clusterings available. The hypergraph is induced by creating a hyperedge for each cluster.A data point that is a part of a cluster is a vertex in the hyperedge corresponding to that cluster.There are a number of different algorithms suggested in [86] to obtain the reconciled clusters fromthe hypergraph. The hypergraph-partitioning algorithm (HGPA) is used to look for a hyperedgeseparator that partition the hypergraph into k connected components of approximately the samesize thus resulting in k clusters. Alternatively, the meta-cluster algorithm (MCLA) can be usedto create meta-hyperedges by collapsing hyperedges using Jaccard’s coefficient to measure close-ness of two hyperedges. Each meta-hyperedge has an association vector which keeps track of thenumber of times a data point is associated with that meta-hyperedge. After k meta-clusters havebeen created in such manner, a data point is then assigned to the meta-hyperedge (meta-cluster) forwhich it has the highest entry in the association vector.

Cluster ensembles can be used for efficient clustering of entities defined in more than vocabu-lary. A simple example could be reconciling functional categorization and gene expression profilebased clusters for genes to find reconciled clusters. This is very similar to what redescription min-ing can achieve. However, an added advantage of redescription mining is that the combinations ofclusters are not just reconciliatory but can involve differences or unions of clusters.

3.3.4 Conceptual Clustering

In conceptual clustering [52], the primary objective is not to optimize a mathematical function ofcluster quality but a more abstract notion of cluster describability. Clusters are chosen in favorof the ease with which they can be described, using given vocabulary. The input to conceptualclustering is typically a dataset, along with a set of features for labeling clusters. Concepts thatcharacterize the output clusters from conceptual clustering are usually conjunctions between rela-tions for variables that describe the entities.

Conceptual clustering is motivated towards making the results of clustering techniques describ-able and hence produces results that are similar to redescriptions. However, redescription miningimposes the stricter requirement that a cluster be describable in two ways.

26

3.4 Constructive induction

Constructive induction is the task of ‘inventing’ new features, for use in data mining. The as-sumption here is that the mined patterns will be easier to state/infer if they were represented in analternative vocabulary. An early example is the CLASSIC system [11] that uses description logicsto extend the representation capabilities of first-order logic. Another approach, given in [53], usesderived descriptors to find inductive rules involving descriptors not present in the original set ofobservations provided. Here, simple logical rules are proposed as a method for generating new de-scriptors. For example, given genes with expression values for a time series experiment, a deriveddescriptor could refer to the set of genes with expression value between 1 and 2 for the first timepoint. Redescription mining shares the goal of constructive induction and is motivated by the useof trees as feature combiners.

3.5 Co-training

Co-training [7] is included in this survey due to its striking resemblance to the CARTwheels al-gorithm. However, its goals are different. It tries to address the problem of learning when theclassifications of some data points is not known. The data available D has a set of features F thatconsists of two disjoint subsets F1 and F2. Each of these subsets is capable of learning the classifierif all data points have a classification assigned to them. However, due to lack of classification, twoindependent learners K1 and K2 are required to learn and train each other. At first, K1 is learnedusing D projected onto F1 and K2 is learn using D projected onto F2. Then K1 teaches K2 using adata point Di (from D) that best classifies a class Cj in K1. It then uses the projection of Di on F2

as a trainer for class Cj for K2. Both the learners have their roles reversed when K2 teaches K1.This whole cycle is repeated till there is an improvement in the classifier accuracy.

The cyclic process in co-training is very similar to the way redescriptions are mined using thecyclic learning of decision trees. However, the aim at each step of the process is very different. Inredescription mining it is important to maintain impurity while the alternations take place, in orderto ensure exploration of the redescription space. In co-training the learners try to match as wellas possible so that they can converge to an accurate learner. Also, the idea of feature constructionwhile mining redescriptions has no specific importance in co-training.

3.6 Profiling methods

There exists a class of algorithms whose chief purpose is to profile a given category of data interms of features that are assumed to be given. In fact, inducing understandable definitions ofclasses using a set of features [94] is a goal common to all descriptive classification applications.Redescription mining loses the distinction between classes and features and provides necessary aswell as sufficient descriptors for covering other descriptors. An additional requirement in classprofiling is to choose features that absolutely or partially contrast one class from another and,typically, to choose a minimal subset of features that can contrast all pairs of classes. Redescrip-tions impose an equivalence relation over the expression space that can be efficiently harnessed toanswer these queries.

27

Niche finding is a special case of profiling classes where the classes are singleton sets. Re-descriptions hence cover single instances, such as this example from the domain of universi-ties [93]: ‘Wake Forest University is the only suburban Baptist university.’ Finding niches forindividuals and objects has important applications in product placement, targeting recommenda-tions, and personalization.

The work by Pu and Mendelzon [68] aims to use profiling as a way to re-state queries inways that are compact or easier to compute. These transformed queries, obtained using the MDL(minimum description length) principle, can be viewed as redescriptions with Jaccard’s coefficient1. However, Pu and Mendelzon’s focus is on complexity results for targeted classes of expressions,specifically those used in hierarchical and multidimensional (OLAP) databases. On the other hand,CARTwheels’s bias supports general DNF expressions, is aimed at mining all redescriptions—exact and approximate—from a given dataset, and is not driven by compactness constraints.

3.7 Schema matching

A schema describes the structure of a set of elements, using notions such as relational tables, andXML elements. Schema matching [73] aims to reconcile two schemas using a semantic levelmapping between elements of the schemas. This approach finds applications in wide variety offields including data integration, data warehousing and, very importantly, in e-business.

The problem of schema matching can be stated simply as that of finding a Match operator be-tween two schemas. A number of systems are available to find this Match operator using differenttypes of learning techniques. SemInt [42] looks for mapping at the level of individual attributes intwo schemas. It starts with clustering similar attributes from the first schema (S1). These clustersare then used to learn a neural network. Attributes from the second schema (S2) are then passedthrough this neural network to determine the best corresponding cluster in S1. The SemanticKnowledge Articulation Tool (SKAT) [55] uses learned rules to determine equivalences betweentwo schemas.

The relationships considered in schema matching research are primarily of the foreign keynature or otherwise operate at the instance level. Redescription mining can be considered as asignificant extension of schema matching as it involves more complex set-theoretic relationships.Redescriptions can also be employed for finding bi-directional rules to relate elements in twoschemas.

3.8 Model management

Model management [5] is a framework that recognizes the complex inter-relationships that wouldexist in multi-database enterprises and provides union, intersection, and difference operators forreconciliation, integration, and migration purposes. It uses a Match operator like in schema match-ing to find mappings between two models. Arithmetic operations like Diff (to find differencesbetween two models), Merge (to find union of two models) and Compose (to apply mapping be-tween models transitively) are also available. All of these use the Match operator between modelsof interest.

Cupid [46] is used for schema matching in Model management. It uses weighted means of

28

linguistic and structural similarity between pairs of elements in two schemas to decide on themappings between elements.

In model management, the relationships between models are assumed to be user provided; theactual operator used between two models is defined by the user. Redescription mining chooses themost meaningful operators on the fly and hence can be used to obtain the operators and their orderof usage for achieving a certain model reconciliation task.

3.9 Summary – A Case for Redescription Mining

The data mining methodologies discussed constitute a wide range of techniques and applicationdomains. However, most of them are applicable to specific domains and data types. In redescrip-tion mining, a key advantage of using descriptors and their constructs is that almost any type ofdata (numeric or categorical) that affords a set representation can be used for mining. This evenincludes the results of other data mining algorithms (such as clusters) that can be posed as sets.

The features or descriptors available to use for mining are usually fixed. In cases where featureconstruction is allowed, the constructs are either user specified (as in model management) or aretoo simplistic (as in propositionalization). Redescription mining allows for complex set-theoreticconstructs to be created and used for mining purposes. Even though the constructs can get arbi-trarily complex, they maintain describability that results of traditional clustering algorithms do nothave.

Since the features constructed on the two sides of the redescription result in a set of items thatcan be described uniquely in two different vocabularies, they can be expected to show much morecommonality than items that follow single directional rules (as in ILP). Redescription mining ishence a method aimed towards mining tight, describable, sets of items using the most informativefeatures available.

29

Chapter 4

Using Redescriptions to Elucidate StressResponse Pathways

This chapter presents an application of CARTwheels to elucidating stress response pathways inthe budding yeast Saccharomyces cerevisiae. For the purpose of this chapter, we consider resultsfrom a microarray experiment aimed at analyzing the transcriptional response of yeast to desic-cation and rehydration under glucose-limiting conditions. Due to the multiple stresses involvedin desiccation and rehydration, and our desire to compare the response of yeast to these stresseswith those reported previously for conditions such as heat shock and diauxic shift, we applied theCARTwheels algorithm to descriptors obtained from various gene expression datasets. Applicationof redescription mining sheds new light on the mechanisms and processes involved in desiccation(stasis) and rehydration (recovery) in yeast.

4.1 Experiment Overview

The experiment considered consists of samples of yeast colonies taken and analyzed (using mi-croarrays) at seven time-points indicating different stages of the dessication/rehydration process.Samples taken at the start of the drying process correspond to time-point T1. T2 corresponds to18h after the start of the drying process and T3 (Tdry) corresponds to 42h after the start. At thispoint, the yeast colonies were rehydrated and samples were collected for various analyses at timepoints of 15 min (T4), 45 min (T5), 90 min (T6) and 360 min (T7) after the start of rehydration. Mi-croarrays for samples corresponding to the seven time-points mentioned above were used to obtainrelative expression estimates with respect to both T1 (6 comparisons) and Tdry (4 comparisons).

4.1.1 Dataset

We defined the universal set of genes as the 210 yeast ORFs that exhibited fold changes (positiveor negative) greater than five for any experimental comparison, or had an average expression of4-fold (positive or negative) or greater across all comparisons. This universal set is referred as thehigh expressors in the rest of this chapter.

We employed 11 types of vocabularies for descriptors that provide a total of just over 5000descriptors (Table 4.1). One class of descriptors was derived from categories in the GO (Gene

30

Table 4.1: Descriptor vocabularies and domains (SGD refers to the Saccharomyces genomedatabase).

Vocabulary Domain Count SourceGO CEL GO Cellular Category 336 SGDGO BIO GO Biological Category 1404 SGDGO MOL GO Molecular Category 1494 SGDDESIC Expression levels: this work 303 this workGASCH Expression levels: Gasch et al. 936 [21]HISTONE Expression levels: Histone depletion 231 [96]KMC k-means clusters (from DESIC) 60 this workHYDRO INDEX Hydropathic protein index 33 [19]HYDRO SCORE Hydropathic protein score 37 [19]MOD HYDRO INDEX Modified protein hydropathicity 39 [19]DRY ACTIVE Expression level: commercial yeast 32 this work

Ontology) biological process, GO cellular component, and GO molecular function taxonomies,that have representation among the chosen genes. The microarray results from our experiment,chosen stress experiments of Gasch et al. [21] and histone depletion experiments by Wyrick etal. ([96]) were bucketed to yield range descriptors of the form ‘expression level ∈ [%x, 0] in timepoint %y of stress experiment %z’ (for negative %x) and ‘expression level ∈ [0, %x] in time point%y of stress experiment %z’ (for positive %x). Further, k-means clustering was performed usingthe Genesis software suite ([88]) on our time series data, with a setting of 10 and 20 clusters forall comparisons with respect to time points 1 and 3, yielding 60 descriptors. The hydropathy indexdescriptors were based on predictive models of the tendency of an amino acid to be structurallyinside the protein. Finally, expression buckets of data from a commercial dry active yeast was usedto form a final class of descriptors. Table 4.1 summarizes the number of descriptors of each typeand the source for each descriptor family.

4.1.2 CARTwheels Configuration

To invoke CARTwheels for our universal set, X was initialized to be all descriptors derived fromour experimental dataset (which includes the range descriptors as well as the k-means clusters).This ensures that all redescriptions involve some aspect of the results from our experiment and pre-vents the possibility of, say, mining a redescription between two GO taxonomies. Y was initializedto the set of all descriptors; thus, there is some overlap between X and Y . In order to prevent obvi-ous redescriptions arising from this overlap, the algorithm was precluded from utilizing descriptorsin one tree if they are already present in the other tree.

A Jaccard’s threshold θ of 0.5 was employed and a depth-limit d ≤ 2 was set for both the topand bottom tree induction alternations. The limit on the number of allowable alternations η wasset to 20, and ρ was set to 5. As before, redescriptions mined by CARTwheels are subjected toa ‘tightening’ step, akin to rule pruning in packages like C4.5 ([72]). A p-value cutoff of 0.005for significance of redescriptions was utilized in this experiment. To gauge the importance of theresults, the qualitative nature of biological results obtained through redescriptions is described.

31

4.2 Understanding Desiccation Tolerance

Ten key mined redescriptions (R1–R10) are depicted in Fig. 4.1 and Fig. 4.2. These redescriptionswere selected for both their biological interest as well as for their feature construction novelties.The proteins encoded by genes in a redescription may interact with one another or, with otherproteins not included in the redescription. Such analyses make it possible to uncover cryptic andsubtle features of gene expression and regulation.

Redescription R1 imposed a trend on the descriptor set showing genes with increasing expres-sion after rehydration (Fig. 4.1), and redescribed them in relation to genes up-regulated duringheat shock but down-regulated during peroxide treatment. Three genes were identified, SIP18,GRE1, as well as a putative γ-glutamyl kinase gene, YHR033W. SIP18 and GRE1 were two ofthe most up-regulated genes found in our study. Sip18p and Gre1p share those properties used todefine a larger group of proteins, the hydrophilins [19] but possess no strong mutual similaritiesexcept for their N-termini. A FASTA-based search of the N-terminal 18-mers in the non-redundantdatabase identified similar N-termini in putative proteins from Kluyveromyces lactis and Candidaglabrata, as well as a dehydrin-like protein from the desiccation-tolerant resurrection plant Se-laginella lepidophylla. SIP18 may be part of an osmotic stress response locus on ChromosomeXIII consisting of ALD2, PAI3 and SIP18, with regulation occurring through the HOG2 signal-ing cascade [54] as well as through the cyclin-dependent kinase Ssn3p (Srb10p, [32]). Analysis ofSip18p demonstrated that the carboxy-terminal lysine residues of the protein are essential for bind-ing to phospholipids in vitro [78]. Transcription of GRE1 and two other genes, GRE2 and GRE3,were reported to be responsive to osmotic and other stresses [20]. In our experiments GRE2 wasactivated 2-fold upon desiccation then its expression decreased during rehydration; transcription ofGRE3 showed no activation in response to either stress. Gre2p was reported to interact with Pwp1pand Ygr111wp [33]. Pwp1p is a β-transducin-like protein containing WD-40 repeats characteris-tic of F-box-like proteins in many organisms [38]. The latter proteins are implicated in control ofcell cycle transition, transcriptional regulation and signal transduction. Ygr111wp interacts withLys14p [33], a transcriptional activator of the lysine pathway genes (with 2-aminoadiptate semi-aldehyde as co-inducer) and is involved in saccharopine reductase synthesis.

Redescription R2 identified two additional candidate desiccation-regulated genes. URA7 (CTPsynthetase) and YOR309C (unknown function) had transcription profiles of marked down regula-tion at T2 and up-regulation following rehydration. CTP synthetase plays an essential role in thesynthesis of all membrane phospholipids in eukaryotic cells including yeast [50, 65]. YOR309Cencodes a 16-kDa protein with lysine and arginine contents of 16.7 and 27.8%, respectively, whichis similar to the nucleic acid-binding protamine encoded by YOR053W. Control of chromatincondensation is likely an important component of the drying response of cells [81] and this is apotential role for YOR309C. In an effort to look at heterochromatin/euchromatin relationships,we attempted to redescribe the desiccation and rehydration data with those of a histone depletionstudy [96]. In Redescription R9, two formate dehydrogenase genes FDH1 and FDH2 along witha heat shock encoding gene (SSA3) were significantly up-regulated in both studies, and in anothercase (Redescription R10) a putative tyrosine phosphatase gene (YGR203C) was found to be down-regulated during desiccation and rehydration, but increased in expression during the initial phaseof the histone depletion experiment.

32

Figure 4.1: Redescription analysis of desiccation and rehydration in yeast. The top panel of eachredescription describes the redescription in a text format. The bottom panel shows the redescriptionin graphical format. The Jaccard’s coefficient for each redescription (redescription quality) isshown over the redescription (blue) arrow. Also included in the bottom panel is the gene listobtained from the redescription. Limiters are in place for fold changes so that they do not passthrough a fold change of zero. Thus when a fold change states greater than -2, a range from -2 to 0is meant. If a region defined in the graphical format extends to the end of the axis, it indicates thatgenes above the highest value listed on the axis will still be included (i.e., there is no upper limitof the fold change value). Notice that some redescriptions such as R4 are multi-way and providemore than two ways to describe a given set of genes. Redescriptions R9 and R10 compare thedesiccation data with that reported from the histone depletion study of Wyrick et al. [96], all othersutilize data from Gasch et al. [21].

33

Figure 5

ARO4

SAH1

YFR055W

SAM1

SER3

URA7

LYS14

HIS4 HIS1

lysinesaccharopine

acetoacetate

α-KG

α-ketoglutarate

Thiamine

SAM

transport

PHO3

riboswitch ligands?

phospholipid synthesis

SIP18binding

NADPH

HMG-CoA

ERG13

glutamate

acetyl-CoA

acetyl-CoA

YBR238C

FRDS

OSM1

fumaratereductases

osmotic growthprotein

BOI1TEF4

MET30

SIR3

CYS4

F-box; protein

ubiquitination

TPO2 polyamine transport

YHB1 oxidative stress response

LYS12

CLN2

CDC34

GAS3

GAS1

ARO1

S0B L40B

L17B L7A

S26B L27B

S16B L13AS10AS22B

S9B

L7B

ribosomal

genes

URA8

serine biosynthesis

cell wallorganization

redescribed, down regulatednot redescribed, interacting,down regulated

not redescribed, marked up regulation

involved in sulfur metabolism

not redescribed, interacting,up regulated

not redescribed, interacting,no change

metabolite, cofactor

A

B

C

Redescription R5

Heat Shock, 30 min ≤ -1 T2 vs T1 -5 AND NOT T2 vs T1 -1

Redescription R5 Gene List

ARO4, ASN1, CLN2, GAS3, HEM13, HIS1, IMD4, PHO3, RPL-7A, 7B, 13A, 17B, 27B, 40B, RPS-0B, 9B, 10A, 16B, 22B, 26B, SAH1, SAM1, SUN4, TEF4, TPO2, URA7, UTR2, YHB1, YBR238C, YER156C, YFR055W, YOR309C

-7 -5 -3 -1Heat shock,

30 minT2 vs. T1

0.71

-7 -5 -3 -1

≤ ≤

Figure 4.2: Redescription analysis and sulfur metabolism. Panel A, Statement of RedescriptionR5 that relates heat shock to the desiccation experiment. Panel B, a graphical depiction of theredescription and the genes identified. Panel C, desiccation and heat shock lead to down-regulationof sets of genes with a central function in sulfur metabolism. Genes present in the redescriptionwere first analyzed to determine if their products had any known interaction with one another e.g.SAM1 and URA7, or with other proteins e.g. CLN2 and CDC34. The network of interactionsand functions was built in an iterative fashion; in view of the potential function of its proteinin phospholipid binding SIP18 is shown close to other genes associated with lipid synthesis andbinding (URA7, BOI1).

34

4.2.1 Recovery/Rehydration Phase

Redescription analysis was also used to investigate the rehydration process. Redescription R2(Fig. 4.1) associates genes that are increased in transcription level during recovery (relative toTdry) with those associated with a 1M sorbitol stress. Two genes were identified: SPS100 andYGL157W. YGL157W was down-regulated overall relative to T1, but its level increased signifi-cantly during the rehydration phase. Interestingly, this gene is annotated as encoding a protein withsimilarity to plant dihydroflavonol 4-reductase, the corresponding function of the protein encodedby GRE2. SPS100, up-regulated more than 7-fold, encodes a sporulation specific wall maturationprotein, Sps100p, which has a protective role during the early stages of spore wall formation [41],presumably when wall reorganization is occurring.

In redescription R3, HOR7, a hyperosmolarity-responsive gene, was up-regulated almost 10-fold at 6 hours of rehydration (relative to Tdry); the same gene was induced 7-fold in the presenceof 0.32mM H2O2 for 30 min, but not for 50 min (Fig. 4.1). Hor7p is a small, depolarizing plasmamembrane-bound protein [43]. It shares 49% sequence identity with Ddr2p, a protein currentlyannotated as a DNA damage response heat shock protein. Ddr2p has been suggested to be avaculolar membrane protein (SGD website), and it is highly probable that it serves a role similarto that of Hor7p in the vacuole membrane. The mode of expression of HOR7 suggests a role inthe rehydration response, thereby suggesting that depolarization of the plasma membrane may bea physiological response to rehydration.

4.2.2 Desiccation Tolerance and Sulfur Metabolism

Redescription R5 (Fig. 4.2) states that ”among the 210 high-expressors found in our study withBY4743, the only genes that are 1-fold or more down-regulated during heat shock (30 min;24) are also those genes that are between 1-fold and 5-fold down-regulated at T2”. This re-description relates the expression changes of yeast during desiccation to that occurring duringheat shock [21]. A conspicuous feature of Redescription R5 is the presence of three genes in-volved in sulfur metabolism: SAM1 (S-adenosylmethionine synthetase), encoding the protein thatsynthesizes the potential riboswitch ligand S-adenosylmethionine, SAM, AdoMet, 13, SAH1 (S-adenosyl-L-homocysteine hydrolase) and YFR055W (cystathionine β-lyase). Riboswitch ligandssuch as SAM appear to serve as ancient master control molecules whose concentrations are beingmonitored to ensure homeostasis of a much wider set of metabolic pathways [89, 95], and indeedSAM has recently been implicated in G1 cell cycle regulation [56].

The results of a redescription such as R5 an be used to further evaluate responses by deter-mining the known interactions of the proteins encoded by these three genes. Here, Sam1p wasreported to interact with 13 other proteins [22], and the gene for one of these, URA7, is alsopresent in the redescription. In an iterative process using each of the genes (and their respec-tive protein interactions [6]) a network of interactions was assembled from the redescription set(Fig. 4.2). Assembly of the network also relied upon the primary microarray data to infer possi-ble additional relationships. For example, MET30, encoding a cell-cycle F-box protein, and alsoinvolved in sulfur metabolism and protein ubiquitination, can directly or indirectly be associatedwith TEF4 (translation elongation factor EF-1γ) and CLN2 (cyclin-dependent protein kinase reg-ulator), both of which are present in the redescription. Note that MET30 itself was not presentin the redescription, and in fact it had the highest level of up-regulation of all cell cycle control

35

genes. Further refinement of the network included the clustering of genes from a pathway thatshows interactions with other genes in the redescription but not with one another; for example,the clustering of HIS4 and HIS1, ARO1 and ARO4, LYS14 and LYS 12, and GAS1 and GAS3.With the exception of MET30, SIP18 and CDC34, the transcription of each gene in the proposednetwork was either down-regulated, or unchanged, suggesting a role for sulfur metabolism in thedesiccation and rehydration response.

Many of the genes in R5 were identified in a subsequent redescription (R7 in Fig. 4.1) thatmined trends in early down-regulation and late up-regulation with respect to heat shock transcrip-tion. Included among these were ribosomal protein genes S16B, S24B and S26B, PHO3 andCLN2. The acid phosphatase encoded by PHO3 (3.6 fold increased) is involved in transport ofthe vitamin thiamine, another potential riboswitch ligand. Thiamine-repressible acid phosphatasephysiologically catalyzes the hydrolysis of thiamine phosphates in the periplasmic space of S.cerevisiae, and participates in the utilization of the thiamine moiety. CLN2, encoding G1 cyclin,a regulatory protein involved in reentry to the cell cycle following arrest, was also included inthe subset. However, the highest fold induction of transcription was associated with LSR1 encod-ing U2 snRNA, a component involved in early assembly of the spliceosome. Of 39 up-regulatedgenes that were identified in redescription R8 (Fig. 4.1), 16 are involved in the transcriptionalchanges of the major metabolic pathways in response to desiccation. Additional genes present inthe redescription R8 include many of the high expressors in our experiment, including the acetatetransport gene ADY2, and INO1, the gene encoding inositol 1-phosphate synthase. This redescrip-tion emphasizes that the response of yeast to water stress has components that are different to thoseinvolved in an oxidant stress.

4.3 Discussion

This chapter has shown that redescriptions are not only interesting patterns in their own right butcan also serve as starting points for further analysis into stress response pathways. As expected,redescription analysis is extremely useful in relating the results of different experiments that couldhave a bearing on the results of the yeast study considered here. Knowledge of the molecular basisof desiccation tolerance as pointed to by redescriptions can help contribute to our basic under-standing of the living cell and its inherent ability to enter into, and return from, a state of completemetabolic arrest.

36

Chapter 5

Algorithms for Storytelling

In this chapter, we formulate a new data mining problem called storytelling as a generalizationof redescription mining. Recall that in traditional redescription mining, we are given a set of ob-jects and a collection of subsets defined over these objects. The goal is to view the set system asa vocabulary and identify two expressions in this vocabulary that induce the same set of objects.Storytelling, on the other hand, aims to explicitly relate object sets that are disjoint (and hence,maximally dissimilar) by finding a chain of (approximate) redescriptions between the sets. Thisproblem finds applications in bioinformatics, for instance, where the biologist is trying to relatea set of genes expressed in one experiment to another set, implicated in a different pathway. Weoutline an efficient storytelling implementation that embeds the CARTwheels redescription miningalgorithm in an A* search procedure, using the former to supply next move operators on searchbranches to the latter. This approach is practical and effective for mining large datasets and, at thesame time, exploits the structure of partitions imposed by the given vocabulary. Three applica-tion case studies are presented: a study of word overlaps in large English dictionaries, exploringconnections between genesets in a bioinformatics dataset, and relating publications in the PubMedindex of abstracts.

5.1 Introduction

Consider the set system in Fig. 5.1 where the six objects are books and the descriptors denote booksabout traveling in London (Y), books containing information about places where popes are interred(G), popular books about the history of codes and ciphers (R), books about Mary Magdalene (M),and books about the ancient Priory of Sion (B). An example redescription for this dataset is: ‘booksinvolving Priory of Sion as well as Mary Magdalene are the same as non-travel books describingwhere popes are interred,’ or B ∩ M ⇔ G − Y . This is an exact redescription and gives twodifferent ways of defining the singleton set {‘The Da Vinci Code’}.

While traditional redescription mining is focused on finding object sets that are similar, story-telling aims to explicitly relate object sets that are disjoint (and hence, maximally dissimilar).Given a start and end descriptor, the goal here is to find a path from one to the other through asequence of intermediaries, each of which is an approximate redescription of its neighbor(s). Anexample story in the above dataset results when we try to relate London travel books to books aboutcodes and cipher history: Some London travel books (Y) overlap with books about places where

37

Figure 5.1: An example input to storytelling.

popes are interred (G), some of which are books about ancient codes (R). This story is a sequenceof (approximate) redescriptions: Y ⇔ G ⇔ R. Each step of this story holds with Jaccard’s coeffi-cient 1/3 (the ratio of the size of common elements to elements on either side of the redescription).A stronger story, that holds with Jaccard’s coefficient 1/2 at each step, is: B ⇔ (G ∩ M) ⇔ R.

Why is this problem interesting and relevant? Storytelling finds application in many domains,such as bioinformatics, computational linguistics, document modeling, social network analysis,and counter-terrorism. In these contexts, stories reveal multiple forms of insights. First, sinceintermediaries must conform to a priori knowledge, we can think of storytelling as a carefully ar-gued process of removing and adding participants, not unlike a real story. Knowing exactly whichobjects must be displaced, and in what order, helps expose the mechanics of complex relationships.Second, storytelling can be viewed as an abstraction of relationship navigation for propositional vo-cabularies and offers insights similar to what we expect to gain from techniques such as inductivelogic programming applied to multi-relational databases. Third, storytelling reveals insight intohow the underlying Venn diagram of sets is organized, and how it can be harnessed for explain-ing disjoint connections. In particular, we can investigate if certain sets have greater propensityfor participating in some stories more than others. Such insights have great explanatory powerand help formulate hypotheses for situating new data in the context of well-understood processes.Finally, in domains such as bioinformatics, the emergence of high-throughput data acquisition sys-tems (e.g., genome-wide functional screens, microarrays, RNAi assays) has made it easy for adomain scientist to define vocabularies and sets. We argue that these domains are now sufferingfrom ‘descriptor overload’; storytelling promises to be a valuable tool to attack this problem andreconcile disparate vocabularies.

Why is this problem difficult? Storytelling is non-trivial because the space of possible de-scriptor expressions is not enumerable beforehand and hence the network of overlap relationshipscannot be materialized statically. In a typical application, we have hundreds to thousands of objectsand an order of magnitude greater descriptors, with an even larger number of possible set-theoreticconstructions made of the descriptors. Effective storytelling solutions must multiplex the task ofconstructive induction of descriptor expressions with focused search toward the end point of thestory.

38

5.2 Background

The inputs to storytelling are the universal set of objects O (e.g., books, genes) and a collectionS of subsets of O. The elements of S are descriptors Si, and assumed to form a covering of O,i.e.,

⋃

i Si = O. In addition, two elements of S are designated as the starting descriptor (A) andthe ending descriptor (B), where typically A ∩ B = {}. In the simplest version of storytelling,we must formulate a sequence of descriptors (called a story) Z1,Z2, · · · ,Zk ∈ S where Z1 = A,Zk = B, and J (Zi, Zj) ≥ θ, 1 ≤ i < k, j = i + 1, θ being a user supplied threshold. Here,J denotes the Jaccard’s coefficient between two sets, given by: J (Zi, Zj) =

|Zi∩Zj |

|Zi∪Zj |. J is zero if

the sets are disjoint, is one if they are the same, and is between zero and one if either descriptoris a subset of the other, or when they are overlapping. Every successive pair of descriptors in thestory constitutes a redescription, Zi ⇔ Zj , whose left and right sides (approximately) induce thesame subset of O. Observe that storytelling can progress only when redescriptions are strictlyapproximate, i.e., when 0 < J (Zi, Zj) < 1.

In the generalized version of storytelling studied here, the intermediaries Zi are not constrainedto be just elements of S but can be set-theoretic expressions over the Si’s, e.g., S1∪S3,S2−S4,S1∩(S2 ∪ S5). In this case, to ensure well posedness, we will require that all participating expressionsconform to a given bias, e.g., monotone forms, conjunctions, or are otherwise length-limited.

Additionally, in the generalized version, we do not restrict our descriptors to be simple setsbut allow them to be multisets as well. Here, we consider O to be an ordered set of elementsand model each multiset descriptor (Zi) as a weighted vector VZi

defined over O. The weightcorresponding to each element in O could simply be the frequency of that element in the descriptoror it could be a more complicated formulation of the same. For multisets, we use the weightedJaccard’s coefficient as our similarity measure between two descriptors. The weighted Jaccard’s(Jw) between descriptors Zi and Zj can be defined as follows:

Jw(Zi, Zj) =

|O|∑

x=1

(VZi[x]VZj

[x])

|O|∑

x=1

(VZi[x])2 +

|O|∑

x=1

(VZj[x])2 −

|O|∑

x=1

(VZi[x]VZj

[x])

(5.1)

Notice that the weighted Jaccard’s coefficient reduces to the unweighted case if we use 1 as theweight for each element present in a multiset and 0 for elements not in the multiset. The approachwe outline for mining stories for simple sets generally holds for the case of multiset descriptors aswell. In case any of the steps needs to be handled differently, we outline the details of the alternateapproach as well.

There are many algorithms proposed for mining redescriptions, some based on systematic enu-meration and pruning (e.g., CHARM-L [97]) and some based on heuristic search (like CARTwheelsconsidered in Chap. 2). These algorithms also differ in their choice of bias (conjunctions inCHARM-L, and depth-limited DNF expressions in CARTwheels). We focus on CARTwheels asit provides the exploratory features necessary to incrementally construct stories. However, unlikeChap. 2, where CARTwheels alternation was configured to enumeratively explore the space of allredescriptions, the main contribution of the present chapter is to demonstrate how we can focus thealternation toward a target descriptor.

39

5.3 Designing a Storyteller

We embed CARTwheels inside an A* search procedure, using the former to supply next move op-erators on search branches to the latter. Each move is a possible redescription to be explored anda heuristic function evaluates these redescriptions for their potential to lead to the end descriptorof the story. In this chapter, we focus on story length—number of redescriptions to reach the enddescriptor—as the primary criterion of optimality although different criteria might be more suitablein other applications. Backtracking happens when a previously unexplored move (redescription)appears more attractive than the current descriptor. The search terminates when we reach a de-scriptor that is within the specified Jaccard’s threshold from the ending descriptor or when thereare no redescriptions left to explore.

5.3.1 Working Example

For ease of illustration, consider the artificial example in Fig. 5.2 with six descriptors {S1,S2,S3,S4,S5,S6} defined over the universal set O = {o1, o2, o3, o4, o5, o6}. Our goal is to find a storybetween descriptor S1, corresponding to the set {o1}, and S5, corresponding to the set {o5}, suchthat each step is a redescription that holds with Jaccard’s coefficient at least θ = 0.5. In thisexample, we set the maximum depth of CARTs used to 2.

We begin with a CART that models the start descriptor S1 by exposing the partition P1 ={S1,S1}, i.e., {{o1}, {o2, o3, o4, o5, o6}}. The end descriptor is similarly captured by a CART ex-posing the partitionP5 = {S5,S5} = {{o1, o2, o3, o4, o6}, {o5}}. We first translate our requirementof minimum Jaccard’s coefficient into a threshold on minimum distance between partitions. (Thisis necessary because, unlike metrics such as entropy gain and Gini coefficient, a constraint on Jac-card’s coefficient does not directly characterize probability distributions necessary for incrementalinduction of decision trees.)

The specific partition-distance metric we use between partitions Pi = {Zi, Zi} and Pi+1 ={Zi+1, Zi+1} is the Lopez de Mantaras [44] criterion:

D(Pi,Pi+1)=− 1

|O| [za log(za

|Zi+1|) + zc log(

zc

|Zi+1|)]

− 1

|O| [zb log(zb

|Zi+1|) + zd log(

zd

|Zi+1|)]

− 1

|O| [za log(za

|Zi|) + zb log(

zb

|Zi|)]

− 1

|O| [zc log(zc

|Zi|) + zd log(

zd

|Zi|)] (5.2)

where za = |Zi ∩ Zi+1|, zb = |Zi ∩ Zi+1|zc = |Zi ∩ Zi+1|, zd = |Zi ∩ Zi+1|

In Eqn. 5.2, the first four terms correspond to the entropy when Pi+1 is used to split Pi. Similarly,the last four terms correspond to the entropy when Pi is used to split Pi+1. Given |Zi| and |O|, the

40

S1 = { o1, }S2 = { o1, o2, o3 }S3 = { o2, o4 }S4 = { o3, o5 }S5 = { o5 }S6 = { o6 }

Figure 5.2: Example data for illustrating operation of storytelling algorithm.

Figure 5.3: Behavior of distance metric used for decision tree construction. Each column of datapoints from the x-axis upwards represents the distance values as |Zi − Zi+1| increases for a fixedvalue of |Zi+1−Zi|. The gray data points bound the region in which we need the distance betweenpartitions to lie for the Jaccard’s threshold to hold. The parameters used in this plot are : |Zi| = 10,|O| = 100, θ = 0.2.

distance value in Eqn. 5.2 is dependent on |Zi − Zi+1| and |Zi+1 − Zi|. The Jaccard’s coefficientbetween Zi and Zi+1 can also be calculated using these two quantities as:

J (Zi, Zi+1) =|Zi| − |Zi − Zi+1||Zi| + |Zi+1 − Zi|

If we fix the value of |Zi+1 − Zi|, where

0 ≤ |Zi+1 − Zi| ≤ b((1 − θ)Zi)/θc

there are values of |Zi − Zi+1|, such that

b(1 − θ)Zic ≥ |Zi − Zi+1| ≥ 0

and where the Jaccard’s coefficient between Zi and Zi+1 is greater than θ.Fig. 5.3 shows the behavior of the distance function in Eqn. 5.2 for different values of |Zi −

Zi+1| and |Zi+1 − Zi|. Here, each column of data points from the x-axis upwards represents thedistance values as |Zi−Zi+1| increases for a fixed value of |Zi+1−Zi|. This figure clearly indicates

41

Figure 5.4: Storytelling using CARTwheels alternation. Beginning with S1, the starting descriptorexposed by the bottom tree in (a), the alternation systematically moves toward S5, the endingdescriptor in (d). At each step we alternately keep one of the trees fixed and grow a new tree tomatch it. The partition induced by each tree is shown above or below the tree as the case may be.The story mined here is the sequence of redescriptions: S1 ⇔ (S2 − S3) ⇔ S3 ⇔ S4 ⇔ S5.the region (bounded by the points shown in Gray) in which we need the distance between partitionsto lie for the Jaccard’s threshold to hold. We call the region covered by points in Fig. 5.3 the regionof similar partitions. In general, if we try to minimize the distance between two partitions, wepull the next partition towards the region of similar partitions. Apart from this simple relationshipbetween the distance metric and Jaccard’s coefficient, another reason why we chose this metric isthat it promises to provide a better match between two consecutive partitions given the restrictedlength of the trees we construct [44].

The general idea now is to consider the partition exposed by the current tree and induce newtrees to match this partition. In Chap. 2, we allow these partitions to be composed of arbitraryblocks so that when two trees match, we obtain as many redescriptions as there are matchingblocks. For storytelling, we focus on only 2-block partitions since, by the statement of the story-telling problem, both start and end descriptors induce only 2-block partitions and we desire a singlechain of redescriptions between them. Thus, for a 1-level tree, which has only one descriptor suchas Zi, the partition would be Pi = {Zi, Zi}. For a 2-level tree, the partition would correspond tohow the four paths in the tree are merged to induce a 2-block partition.

Consider the beginning partition {S1,S1}. In constructing a tree to match this partition, welook for a descriptor inducing a 2-block partition such that the distance between that partition and{S1,S1} is minimized. If, for one or more descriptors, the distance lies in the similar partitionsregion, we greedily choose the one which induces a block Zi+1 with the highest Jaccard’s coef-ficient with the end point of the story. Once the tree has been constructed in this manner, classassignments at the leaves are made by majority and paths that lead to a given class are union-ed toform redescriptions.

42

obj. S2 S3 S4 S5 S6 classo1

√ × × × × S1

o2

√ √ × × × S1

o3

√ × √ × × S1

o4 × √ × × × S1

o5 × × √ √ × S1

o6 × × × × √ S1

obj. S1 S3 S4 S5 S6 classo1

√ × × × × (S2 − S3)

o2 × √ × × × (S2 − S3)o3 × × √ × × (S2 − S3)

o4 × √ × × × (S2 − S3)

o5 × × √ √ × (S2 − S3)

o6 × × × × √(S2 − S3)

Figure 5.5: (left) Dataset to initialize storytelling algorithm. (right) Dataset for the second alterna-tion.

For instance, Fig. 5.4 (a) shows the decision tree we have constructed to match the partition{S1,S1}. This tree provides the first step in the story to be the redescription S1 ⇔ (S2−S3). In thisexample, we show only one possible ‘next tree’ but in our implementation, we maintain a numberof such possible matching trees, to simulate a branching process and for potential backtracking.Note that while the current redescription holds with a Jaccard’s value of 0.5, the new descriptordoes not have any overlap with S5 (the target).

The alternation process is easily understood by considering the datasets from which each treeis being learned. To initialize the alternation, we prepare a traditional dataset for classification treeinduction (see Fig. 5.5, left), where the entries correspond to the objects, the class (to be learned)corresponds to membership in the starting descriptor, and the boolean features are comprised ofthe remaining descriptors. For the next step in our story, we use partition {(S2−S3), (S2 − S3)} asthe classes to match and consider the dataset as shown in Fig. 5.5 (right). In constructing the newdataset, observe that we ignore the descriptor that is the top-most node (here, S2) in the decisiontree that defines the current partition. This ensures that we do not utilize the same features formatching a partition as those that define the partition! The one-level tree we learn at this stage isshown in Fig. 5.4 (b). The redescription of interest here is (S2 − S3) ⇔ S3, which also holds witha Jaccard’s coefficient of 0.5. Although it introduces the element we seek (o5), the redescription tothe end point of the story, S3 ⇔ S5, has only a Jaccard’s coefficient of 0.25. We hence continuethe search and obtain the redescription S3 ⇔ S4 which gives us the desired overlap with the target,and our final redescription, namely S4 ⇔ S5. Our story is thus

S1 ⇔ (S2 − S3) ⇔ S3 ⇔ S4 ⇔ S5.

5.3.2 Implementation

The storytelling algorithmic framework is shown in Table 5.1 and follows the outline of the work-ing example above. The key utility functions are: construct dataset, that prepares the dataset D ateach alternation; construct tree, that is called b (branching factor) times at each step to create treesof desired depth limit d; eval, which determines if the Jaccard’s coefficient between the currentdescriptor and the union of the paths leading to it in the current tree have a Jaccard’s coefficienthigher than or equal to θ; calculate heuristic score, which computes the heuristic score for eachtree used to guide the search; and print story, which prints the story by tracing back the sequenceof mined redescriptions.

In the outer A* search procedure, qualified trees are placed in the open list OL (a priorityqueue) and considered in order of their evaluations. If the heuristic evaluation hN for the currently

43

picked tree tN is zero, we have arrived at a tree that has sufficient Jaccard’s overlap with the endpoint of the story and we terminate. If hN is not zero, a new set of classes C for objects in O areinduced using the function paths to classes, tN is moved to the closed list, and all trees inducedby construct tree are placed in the open list. This process is repeated until there is no tree left inthe open list or a story has been found.

Just as the notion of ‘class’ is revised periodically at each alternation, so are the candidate set of‘features.’ Observe that, in the Initialization step, the set of possible features involves all except thestarting descriptor. Inside the Alternation subroutine, the candidate set of features is made equal toall except the feature used at the root of the current tree (supplied by the routine top).

The heuristic score hj for tree tj is combined with cost expended so far (gj) to arrive at theevaluation criterion sj. Nodes in OL are hence ordered by sj. We assume unit cost per redescrip-tion so that the story length is the number of redescriptions required to traverse to the endingdescriptor. The heuristic function h is designed to systematically never over-estimate the numberof redescriptions remaining and takes the value of zero for a tree whose partition is within thespecified Jaccard’s coefficient to the ending descriptor. We now present details of h and prove itsadmissibility.

5.3.3 Heuristic

Table 5.2 outlines the approach to estimate hj for tree tj . This algorithm can be understood asfollows. Assume that the new descriptor Zj (provided by tree tj) has fj elements in commonwith the target descriptor Y and ej elements that do not participate in Y . This means that Zj

must shed enough of the ej elements and acquire enough of the |Y | − fj elements in order tohave a Jaccard’s threshold of ≥ θ with Y . The goal of calculate heuristic score is to estimatethe minimum number of redescriptions required to shed the requisite number among ej elementsand acquire some of the necessary |Y | − fj elements. The procedure first conservatively estimatesif the current discrepancies already correspond to a Jaccard’s threshold of ≥ θ with Y , in whichcase it returns zero. If this is not possible, the procedure estimates the shortest number of stepsin which the deletions and additions can happen by a recursive computation. Two extremes areconsidered at each step – the case where we can acquire as many of the necessary new elements asdictated by θ without any removals, and the case where we can shed as many of the unnecessaryelements as dictated by θ without any additions. This step provides us the bounds δfmax andδemax in Table 5.2. We then search combinatorially within these ranges for the maximal number ofdeletions, for every possible number of additions, such that θ holds, akin to dynamic programming.The minimum number of redescriptions over all possibilities is then returned.

For a worked out example, consider Fig. 5.6. Here assume that our start descriptor (forthe universal set O defined earlier) is defined using the first tree grown in our running example(Fig. 5.4(a)), i.e., S2 − S3. The end descriptor is the same as in our running example. In the di-rected graph shown, each node corresponds to the tuple (f, e). Thus, the start node (rectangle) isset to (0, 2) since it—{o1, o3}—contains none (0) of the required elements and 2 spurious elements(which are to be discarded). Similarly, the end node (hexagon) is set to (1, 0). The Jaccard’s coef-ficient threshold is set to 0.5. For (0, 2), δfmax = 1 and δemax = 1. Thus, the best possible nextstates correspond to the other ends of the two directed edges starting from (0, 2). However, noneof the next two states possible has the required overlap with (1, 0). Hence, we recurse the givenproblem into a smaller subproblem. As an example, we move to the state (1, 1) from the node

44

Table 5.1: Storytelling algorithmic framework.Input:

1. a domain of objects O;

2. a collection S of sets defined over O;

3. a designated starting descriptor A ∈ S; and

4. a designated ending descriptor B ∈ S.

Parameters:

1. a threshold θ (0 < θ < 1) denoting the minimum required Jaccard’s coefficient for each connection in the story;

2. d (depth of trees) that imposes a bias B over set expressions defined on S; and

3. branching factor b that restricts the maximum number of possible next states from each state in the A* search.

Output: a story comprising a sequence of intermediaries Z1, Z2, · · · , Zk, such that Z1 = A, Zk = B, J(Zi, Zi−1) ≥ θ, 1 < i ≤ k, and eachof Zi’s is in the desired bias B.

Initialization:set open list for A* search OL = {}; closed list for A* search CL = {}

set classes C = {A,A}; features F = S − C

set dataset D = construct dataset(O, F , C)for (1 ≤ j ≤ b)

set tree tj = construct tree(D, d, j)if (eval(tj , θ))

set gj = 0; hj = calculate heuristic score(tj , C, B, θ, |O|); sj = gj + hjadd tj with (sj , gj , hj ) as a node in OL

end ifend for

Alternation:set boolean done = false

while ((OL is not empty) AND ! done)

tN = head(OL)if (hN = 0)

print story(tN , CL)set done = true

set classes C = paths to classes(tN ); features F = S − top(tN )

set dataset D = construct dataset(O, F , C)for (1 ≤ j ≤ b)

set tree tj = construct tree(D, d, j)if (eval(tj , θ))

set gj = gN + 1; hj = calculate heuristic score(tj , C, B, θ, |O|); sj = gj + hjadd tj with (sj , gj , hj ) as a node in OL

end ifend fordelete head from OL

end while

(1, 2). This new state has the required overlap with the final node and thus we have found a path tothe final state as: (0, 2) → (1, 2) → (1, 1) → (1, 0). The other possible nodes and paths can alsobe examined in this manner to construct the whole graph shown. From this graph, the minimumpath length found is 3 and that is the value returned by our heuristic function for the state (0, 2).

Lemma 1 The heuristic defined by calculate heuristic score is admissible.

Proof: The heuristic function is considered to be admissible if it never overestimates the cost(length) of a path from a node to the goal. Our proof for the admissibility of the heuristic function

45

Table 5.2: Heuristic for storytelling A* search.calculate heuristic score(tj , C, Y , θ, |O|):

set Zj−1 = target class from C; Zj = block from tj that redescribes to Zj−1

set fj = |Zj ∩ Y |; ej = |Zj − Y |

set hj = 0calculate hj = minpath(fj , ej , |Y |, hj , θ, |O|)return hj

minpath(f , e, |Y |, h, θ, |O|):calculate θY = f/(e + |Y |)

if (θY ≥ θ)

return h

else

calculate δfmax = min(b (1−θ)(f+e)θ

c, |Y | − f ); δemax = min(b(1 − θ)(f + e)c, |O| − |Y | − e)set hmin = ∞

for (i = 0; i ≤ δfmax; i = i + 1)set done = falsefor (k = δemax; k ≥ 0 and ! done; k = k − 1)

calculate θnew = f+e−k

f+e+iif (θnew ≥ θ)

set done = true; hcurr = minpath(f + i, e − k, |Y |, h + 1, θ, |O|)if (hcurr < hmin)

set hmin = hcurrend if

end ifend for

end forreturn hmin

end if

takes into the consideration the graph of all nodes (f, e) possible and slowly shrinks it to onlythe nodes that could take part in the shortest path from the node to goal. It then shows that theexhaustive search we perform using the dynamic programming paradigm finds the shortest pathin this reduced graph which in turn corresponds to the shortest path in the original graph. Thisshortest path length is always going to be less than the actual number of steps needed to go fromthe given node to the goal.

In very general terms, given A, B, O and θ, we can construct a graph G = {V, E} whereV = {Vi|Vi = (fi, ei), fi ≤ |B|, (fi + ei) ≤ |O|}. For any two nodes Vi and Vj in V , we can

(1,2)

(1,1) (1,0)

(0,1)

(0,2)

Figure 5.6: The working of our heuristic function. The start state (descriptor) is represented by therectangle node corresponding to the tuple (0, 2) and the end state by the hexagon node (1, 0). Forθ = 0.5, the next states considered from each state are connected by the edges. The two differentsets of lines (black and gray) represent the paths that could be taken to reach the final state in theminimum (3) number of steps (redescriptions).

46

(1,4) (0,3)

(0,5)

(1,5)

(0,4)

(1,3) (0,2)

(1,2)

(1,1)

(0,1)

(1,0)

Figure 5.7: The graph G for our running example. Edges in the graph correspond to redescriptionsand nodes are (f, e) tuples. The start node for our story is shown in rectangle and the end node isthe hexagon. Edges represent redescriptions that hold for Jaccard’s coefficient of 0.5. The directeddashed edges represent the path taken in our running example to reach the final state from the startstate. The thick directed edges represent the path taken in evaluation of the heuristic.

calculate the maximum value of Jaccard’s coefficient possible between them as:

Jmax(Vi, Vj) =|Vi ∩ Vj|

fi + ei + fj + ej − |Vi ∩ Vj|

where |Vi ∩ Vj| is defined as min(fi, fj) + min(ei, ej).Using this formula, we can define E = {Ei,j|Ei,j = (Vi, Vj), Vi ∈ V, Vj ∈ V,Jmax(Vi, Vj) ≥

θ}. The graph G for our working example is shown in Fig. 5.7. Notice that any descriptor inS can be represented as a node in the graph G using the mapping function M(Zi) = Vi whereVi = (fi, ei), fi = |Zi ∩ B| and ei = |Zi| − fi. Thus, in G, B is represented as the node VB =(fB, eB) = (|B|, 0) (hexagon node in Fig. 5.7) and A is the node VA = (fA, eA) typically of theform (0, |A|) (rectangle node in Fig. 5.7). We use these definitions for VA and VB throughout thecourse of this proof.

Next, we define a subgraph of G restricted to the legal descriptors from S as defined by the biasB as GS = {VS, ES}. Here, VS = {Vi|Vi ∈ V, ∃Zi ∈ S s.t. M(Zi) = Vi} and ES = {Ei,j|Ei,j =(Vi, Vj), Vi ∈ VS, Vj ∈ VS, Jmax(Vi, Vj) ≥ θ}. Note that VA ∈ VS and VB ∈ VS .

For the sake of our proof, let us assume that we can find a story between A and B for the givenθ, S and B. This implies that there exists at least one (shortest) path that connects VA and VB ingraphs G and GS . Any shortest path in G from A to B of length k can be represented as a sequence

47

of nodes sn(A, B) = {V1, ...Vi, Vi+1, ...Vk} where V1 = VA and Vk = VB . Note that |sn(A, B)| inGS provides a lower bound for the length of the shortest story that we seek with our A* search.

In traversing sn(A, B) from A to B, our aim is to decrease eA (by removing elements) andincrease fA (by adding elements) such that we reach a node that has a sufficient overlap with B.Here, the θ threshold might not allow the required addition/deletion to happen in just one step. Inthe ideal case, sn(A, B) does not need to contain nodes Vi ∈ V that have fi < fA or ei > eA. Usingthis restriction for all intermediate nodes in sn(A, B), we can define the shortest path sp(A, B) assp(A, B) = {V1, ...Vi, Vi+1, ...Vk|fi+1 ≥ fi, ei+1 ≤ ei}. Thus, for our search for sp(A, B), wecan restrict ourselves to the directed graph GD = {VD, ED}. Here, VD = {Vi|Vi ∈ V, Vi =(fi, ei), fi ≥ fA, ei ≤ eA} and ED = {Ei,j|Ei,j ∈ E , fj ≥ fi, ej ≤ ei}. By defining the graph GD

in such a manner, we have eliminated all nodes and edges from G that would make the heuristicspend extra steps (redescriptions) in gaining unwanted elements and losing elements that belong toB. However, GD still retains all nodes and edges from G that could be a part of the shortest path.Thus, our search for |sp(A, B)| in G is equivalent to a search for |sp(A, B)| in GD.

If we define the set of nodes VB = {Vi|Vi ∈ VD; (Vi, VB) ∈ ED}, it is clear that there is exactlyone node in sp(A, B) that is part of VB . Consider a node Vi ∈ {VD − (VB ∪ {VB})}, i.e., which isnot either the last (destination) descriptor or the last but one descriptor in the story. Consider anyedge in GD of the form (Vi, Vj). Without loss of generality, we can assume that Vj is obtained byadding δfi elements to and removing δei elements from Vi. Thus, the Jaccard’s coefficient betweenVi and Vj is is given by the formula:

J (Vi, Vj) =fi + ei − δei

fi + ei + δfi

The maximum value of δei possible is obtained when δfi = 0 and J (Vi, Vj) is as close to θ aspossible. In such a case, δei,max can be defined as:

δei,max = b(1 − θ)(fi + ei)cSimilarly, the maximum value of δfi possible is obtained when δei = 0 and J (Vi, Vj) is as

close to θ as possible. In such a case, δfi,max can be defined as:

δfi,max = b(1 − θ)(fi + ei)

θc

For all possible edges (Vi, Vj), the possible values of δfi and δei lie in the range δfi,max ≥δfi ≥ 0 and δei,max ≥ δei ≥ 0. It is quite evident that as δfi increases, the corresponding valuesfor δei such that θ holds become smaller. For each value of δfi, there is a maximum value ofδei denoted by δeδf

i,max such that θ threshold holds. We call an edge from Vi to any one of thecombinations (fi + δfi,ei − δeδf

i,max) as a critical edge. In Fig. 5.7, the two critical edges for thenode (0, 4) are shown by darkened (undirected) lines.

We can define the critical graph GC = {VC , EC} where VC = {Vi|Vi ∈ VD} and

EC = {Ei,j|Ei,j = (Vi, Vj), Vi ∈ (VD − (VB ∪ VB)), Vj ∈ (VD − VB),

Vi = (fi, ei), Vj = (fj, ej), fj = fi + δfi, ej = ei − δeδfi,max}

∪{Ei,j|Ei,j = (Vi, Vj), Vi ∈ (VB ∪ VB), Vj = VB}

48

It is clear from our definition of a critical edge that |sp(A, B)| in GD (and hence G) is the same as|sp(A, B)| in GC . The critical graph in Fig. 5.7 consists of the nodes highlighted in gray.

Our heuristic in Table 5.2 exhaustively searches for |sp(A, B)| in GC and hence finds the lengthof the shortest path between A and B in G. Since, GS ∈ G, any shortest path in G is always shorterthan the shortest path in GS that we seek in our storytelling algorithm. Our argument above holdsfor any other intermediate descriptor in our story as the starting descriptor and B as the endingdescriptor. From the discussion above we can conclude that our heuristic is admissible. As anexample, the shortest path found between A = (0, 1) and B = (1, 0) by the heuristic is shownby the thick directed lines in Fig. 5.7. The actual path found in the running example is shown indashed lines. �

The heuristic function defined in Table 5.2 is applicable for simple set descriptors. In case thedescriptors are modeled as multisets, we cannot use the same approach. This is because, forinstance, elimination of different subsets from a given descriptor, even if they are of the samesize, will result in different Jaccard’s coefficients. This implies that we will have to exhaustivelysearch all combinations for removal and addition of elements to determine the theoretically shortestpossible storylength. To avoid this, for the case of the weighted Jaccard’s coefficient in Eqn. 5.1,we use a simpler heuristic function wherein we estimate the maximum weight we can gain/lose ateach step. We add and remove these maximum weight values from our current document weight.Using this idea, we calculate the number of steps required to gain enough of the weight for thefinal document and lose enough weight from the current document, to reach a Jaccard’s coefficientabove the threshold for the final document (observe that this is still admissible but not as good atracker of the shortest storylength).

5.3.4 Data structures

The efficient implementation of our storytelling algorithm hinges on data structures for fast esti-mation of overlaps. This problem has been studied by the database community in various guises,e.g., similarity search [62] and set joins on similarity predicates [76]. Specific solutions advocatethe use of signature trees [47], hierarchical bitmap indices [58], and locality sensitive hashing [26],especially the technique of min-wise independent permutations [10] that is particularly geared to-ward the Jaccard’s coefficient. In this chapter, we combine an AD-tree data structure [57] with thesignature tables [1] approach for efficient similarity search in categorical data.

The signature table is constructed before the Initialization step mentioned in Table 5.1. Here,objects in the universal set are divided into a predefined number of clusters (c) on the basis of theirco-occurrence frequencies. This is achieved by first constructing a graph where each object is anode and objects that co-occur form edges. Each edge is associated with a weight which is theinverse of the co-occurrence frequency for that edge. The weight associated with each node is thesum of the weights of each edge it is a part of. The total weight of the graph is the sum of the weightof all nodes. We set the critical weight of the graph to be the total weight divided by c. For findingeach cluster, we begin with the nodes that are a part of the edge that currently has the minimumweight associated with it. We delete these nodes from the graph and add them to the current cluster.The weight of the cluster is now the sum of the weights of nodes it contains (as obtained from theoriginal graph). We continue to add nodes to the cluster that have the minimum weight in an edgeassociated with any of the objects that already are a part of the cluster. This continues till the

49

weight of the cluster is greater than the critical weight. At this point, we recalculate the criticalweight on the basis of the nodes remaining in the original graph and proceed to finding the next setof nodes, till all c clusters have been found.

Each descriptor, originally a binary vector size |O|, can now be condensed into a binary vector(the signature) of size c by encoding a 1 for each cluster that has an object present in the descriptorand a 0 for each cluster that has no object in the descriptor.

All descriptors and their co-occurrence frequencies (used in constructing a decision tree ofdepth more than 1) are also built into an AD-tree at this stage. The descriptors at the top-level ofthe AD-tree are additionally linked to their signatures. When a similarity search query is issued,only nodes that correspond to signatures of interest need to be investigated. At greater depths in theAD-tree, we can either construct individual signature tables for each node in the AD-tree or we canopt to use a traditional AD-tree node that contain descriptor names and co-occurrence frequencies.In our implementation, we used traditional AD-tree nodes at depth greater than 1.

Using these data structures, we can reduce the number of descriptors searched against at eachstep and improve the speed of computation of stories. For instance, in the first call to the functionconstruct tree, where we are looking for the best match for the class X from among the descriptorset D, we can reduce X to a vector of size c (X c). Also, we keep a count of the number of objects inX that belong to each of the c clusters in the form of a frequency vector f c. The optimistic Jaccard’scoefficient (OJ ) between Xc and a signature vector V c

i corresponding to a set of descriptors canthen be calculated by the formula

OJ (Xc, V ci ) =

∑cj=1(f

c[j] ∗ Xc[j] ∗ V ci [j])

∑cj=1 f c[j]

We then compare Xc to all the signature vectors and retain only those for which the optimisticJaccard’s coefficient is above θ. This narrows down our search to only those descriptors that havepotential to provide the necessary overlap.

5.3.5 Assessing Significance of Stories

The significance of a story is assessed at the level of each redescription participating in the story.As mentioned in Chap. 2, we assess the significance of redescription Zi ⇔ Zi+1 using the cu-mulative hypergeometric distribution [91] to determine the probability of obtaining a rate of co-occurrence of Zi and Zi+1 (over the object domain), given their marginal occurrence probabilities,and comparing it to the observed rate of co-occurrence by chance. For simple set descriptors, thecorresponding equation is reproduced here as:

p(Zi, Zi+1) = 1 −|Zi∩Zi+1|−1

∑

j=0

(

|Zi|j

) (

|O| − |Zi||Zi+1| − j

)

(

|O||Zi+1|

) (5.3)

To account for multiple hypothesis testing, the significance threshold is determined by firstcharacterizing the distribution for all descriptors tested. This is achieved by fixing Zi and ran-domly sampling from the set of available descriptors. For each such descriptor Zk, we use theformula in Eqn. 5.3 to determine the co-occurrence probability of an overlap of at least |Zi ∩ Zk|.

50

We use the q-value approach [85] suggested for estimating false discovery rate to calculate theadjusted significance of the probability obtained for the redescription between Zi and Zi+1. For allredescriptions considered in our case studies, the q-value threshold was set to 0.01.

For the case of multisets, elements having unequal weights implies that Eqn. 5.3 cannot be usedfor estimating the probability we seek. Therefore, we employ an analogous methodology whereinwe randomly generate multisets of the size |Zi+1| with each element associated with a weightassigned to it. We calculate the proportion of these randomly generated sets that have an overlapgreater than Jw(Zi, Zi+1) with Zi to estimate the probability p(Zi, Zi+1). This process needs tobe repeated for randomly chosen descriptors to obtain the distribution of probability required forestimating the q-value.

5.4 Experimental Results

Our three experimental studies are meant to illustrate different aspects of our storytelling algorithmand implementation. The first study characterizes word overlaps in large English dictionaries andillustrates scalability of the implementation and how the different parameter settings affect thequality of stories mined. The second study, involving gene sets in bioinformatics, showcases theconstructive induction capabilities of CARTwheels when used for storytelling. This study andthe third, which builds stories between PubMed abstracts, also illustrate interesting nuggets ofdiscovered knowledge

5.4.1 Word Overlaps

In our first study, we implement storytelling for the MorphWord puzzle wherein we are given twowords, e.g., PURE and WOOL, and we must morph one into the other by changing only one letterat a time (meaningfully). One solution is: PURE → PORE → POLE → POLL → POOL →WOOL. Here we can think of a word as a set of (letter, position) tuples so that all meaningfulEnglish words constitute the descriptors. Each step of this story is an approximate redescriptionbetween two four-element sets, having three elements in common. It is important to note that wordsthat are anagrams of each other (e.g., ‘ELVIS’ and ‘LIVES’) will not have a Jaccard’s coefficientof 1, since position is important.

We harvested words of length 3 to 13 words from the Wordox dictionary of English words(http://www.esclub.gr/games/wordox/), yielding more than 160, 000 words. Consistent with theMorphWord puzzle, we restrict all CARTs to be of depth d = 1 and study the effect of θ and b onthe number of stories possible, length of stories mined, and time taken to mine stories. For easeof interpretation, we recast Jaccard’s thresholds in terms of the number of letters in common (lc)between two words. Although MorphWord is traditionally formulated with lc = 1, we explorehigher lc values as well. Due to space restrictions, we present our results on subsets of the masterword list, namely 5 letter words (L5) and 10 letter words (L10). In each case, we selected 100, 000pairs of words at random and tried to find stories between them, with different lc and b settings. Anexample story we mined with five letter words (setting lc = 3) for start and end words as BOOTHand FLASH respectively is: BOOTH ⇔ BOATS ⇔ BEAMS ⇔ DEADS ⇔ GRADS ⇔ GRADE⇔ CRAZE ⇔ CRASH ⇔ FLASH.

Fig. 5.8 (first column) depicts plots of the fraction of stories (out of 100, 000) mined with

51

Figure 5.8: (First column) Fraction of stories mined as a function of story length, for differentvalues of lc. (Second column) Average time required to mine stories as a function of story length,for different values of lc. (Top row) Five letter word vocabulary (L5). (Bottom row) Ten letterword vocabulary (L10).

52

various story lengths as a function of lc, for a branching factor b = 5. In these plots, a story lengthof 0, rather counter-intuitively, implies that no story was found for the word pair considered. Thecritical story length where the majority of stories are mined steadily increases as lc is increased.This is because, as lc is increased, more overlap is required at each step of the story such that ittakes longer for one word to morph into another. At the same time, the total number of storiesmined decreases as lc is increased, due to the lack of viable redescriptions.

To study the effect of b on the length of stories mined, we focus our attention on lc values of2 for L5 and 5 for L10. Fig. 5.9 (first column) shows plots of the fraction of stories mined withvarious lengths as a function of b. As before, a path length of 0 in the plots implies that no story wasfound for the word pair considered. Here, there are qualitative changes between the two datasets(Fig. 5.9, first column, top and bottom rows).

For L5, the lc value chosen (2) supports a significant number of short stories. This lc valueaffords a high number of possible redescriptions per descriptor (about 20–100), making it highlylikely that the A* algorithm will follow a path that leads close to the target word. In other words,b does not have as significant an impact for this dataset.

For L10, the lc value chosen contributes to a higher probability of longer stories. As a result thebranching factor b plays a crucial role. This is evident in the case of b = 1, where the excessivelygreedy strategy is often rendered futile. As b increases, the chances of going down toward thetarget word increases and more stories are mined.

To study the effect of these parameters on the time required to mine stories, we set b = 5 asbefore for understanding the role of lc. We computed the average time taken to mine a story, forvarious story lengths, across all pairs of words considered. Fig. 5.8 (second column) shows plots ofthis average time against story lengths, for different lc values. In these plots, we have normalizedthe time measurements by calibrating the maximum time to have a value of 1. The actual timetaken to find stories ranges from a few seconds to a few minutes (on a 2.3 GHz Apple Xserve G5computer with 4 GB RAM) depending on the parametric settings as well as the story length. (Thisholds for the other studies as well.) The plots in Fig. 5.8 (second column) show that the generalbehavior in the two graphs is again quite similar. There is a near linear increase in time required,with steeper increases for lower lc values. This is because the lower lc values cause an increase inthe number of possibilities (within the bound of b = 5) which must be explored before convergingon the shortest path. Also observe the higher times for story lengths of 0, indicating it takes longerto conclude that stories do not exist. Similar linear trends are observed in time versus the role of b(Fig. 5.9, second column). Here, steeper profiles are witnessed for higher b values. Once again, thisis due to the increase in the number of possibilities, although as Fig. 5.9, second column (bottompanel) shows, these increases appears to taper off quickly.

These figures clearly indicate the underlying trade-off in mining stories: time versus impor-tance of optimal story lengths.

Fig. 5.10 highlights the advantage of using the heuristic function described in Table 5.2. Here,the top row corresponds to plots for L5 with lc = 3 and b = 10 and the bottom row correspondsto L10 with lc = 5 and b = 10. The plot on the left for L5 shows the behavior of the effectivebranching factor for both BFS and heuristic search. Here, there is a vast difference in the branchingfactor initially but this difference tends to decrease as the story length increases. This behavior isprimarily because as the story length increases, we are more and more likely to encounter nodesthat have very few redescriptions (1 or 2) associated with them. Our choice of lc being so highalso ensures that the number of redescriptions is low. The branching factor is not binding in these

53

Figure 5.9: (First column) Fraction of stories mined as a function of story length, for differentvalues of b. (Second column) Average time required to mine stories as a function of story length,for different values of b. (Top row) Five letter word vocabulary (L5). (Bottom row) Ten letter wordvocabulary (L10).

54

Figure 5.10: Comparison of the behavior of heuristic based search with BFS for (top) L5, withlc = 3 and b = 10 and (bottom) L10, with lc = 5 and b = 10. The plots on the left compare theaverage value of the effective branching factor for various story lengths. The plots on the rightcompare the time taken using the two search strategies

55

YGL158W YJL164C

YKL035W

kinase hexokinase protein

modification

YCL040W YFR053C YGL158W YJL164C

YKL035W

YCL040W YDR516C YFR053C YGL158W YGR052W YJL164C

YCL040W YDR516C YFR053C

cell growth and/or

maintenance

transferase, transferring P-containing

groups

Figure 5.11: A significant story among gene sets from protein modification to hexokinase.

cases. Correspondingly, the plot on the right shows the time comparison of heuristic search againstBFS. It is important to note that the time taken for the heuristic search also includes the time takento compute the heuristic. This plot shows that even though the effective branching factor becomesquite similar for the two cases as story length increases, there is still significant difference betweenthe time taken in mining stories with the heuristic search requiring lesser time.

A more pronounced effect of the heuristic function can be seen in the plots from L10. Here,there is a vast difference between the heuristic and BFS for both the effective branching factorand time taken in searching for all story lengths. This is because, for our choice of lc, we findstories that are not very long. This also decreases the chances of encountering nodes with very fewredescriptions.

5.4.2 Gene Sets

In our second case study, we mine stories among descriptors defined over gene sets in the buddingyeast S. cerevisiae. For this study, we used descriptors defined for the universal sets G1 and G3 inChap. 2. An example significant story (for G1), between the GO categories protein modificationand hexokinase, mined for θ = 0.5, b = 5, and d = 2 is shown in Fig. 5.11. Observe that thesecond descriptor in the story involves a set intersection performed by CARTwheels. A unifyingfeature that links the genes in this story is their common role in nutrient control and carbohy-drate metabolism, particularly metabolism of glucose-phosphate. Considering the three genes inthe first descriptor, YKL035W is involved in the reversible conversion of glucose-1-phosphate toUDP-glucose via UTP; YJL164C is a cAMP dependent kinase and binds both YFL033C (glucoserepressed, nutrient control) and YIL033C (glycogen accumulation); and YGL158W is a kinase thatbinds YGL115W (release from glucose repression). Two new genes enter the story with the firstredescription, namely YCL040W (involved in phosphorylation of glucose) and YFR053C (a hex-okinase also involved in the phosphorylation of glucose in the glycolysis pathway). In traversingthe second redescription, two additional genes appear: YDR516C is involved in phosphorylationof glucose and, most importantly, also binds YCL040W (which is present in earlier redescriptions).YGR052W is a mitochondrial serine/threonine kinase of unknown function. Through the threadof the story we predict that YGR052W may also be involved in an aspect of glucose metabolismand/or nutrient control.

We also conducted a study of the behavior of our storytelling algorithm similar to the one forthe words example. For all these studies, we used G3 from Chap. 2 as the universal set of genesand fixed the depth of trees constructed to 2. Fig. 5.12 (Top left) depicts plots of the fraction ofstories (out of 100, 000) mined with various story lengths as a function of θ, for a branching factor

56

Figure 5.12: Fraction of stories mined as a function of story length, for different values of θ (Topleft) and different values of b (Bottom left); Average time required to mine stories as a functionof story length, for different values of θ (Top right) and different value of b (Bottom right); forgeneset.

57

Figure 5.13: Comparison of the behavior of heuristic based search with BFS for the geneset studychosen with θ = 0.3 and b = 10. The plot on the left compares the average value of the effectivebranching factor for various story lengths. The plot on the right compares the time taken using thetwo search strategies.

b = 5. As is evident from the plot, the number of stories mined for each path length shows a muchsteeper drop here than in the words example. Also, even though there is a peak story length forwhich most stories appear, that value does not change as is the case in the previous study. Theprimary reason for this behavior is that a much lower number of descriptors are involved in thisstudy as compared to the previous one, resulting in a much lesser number of redescriptions.

For our study with varying value of b, we fixed θ to 0.5. For this value of θ, a majority of thedescriptor pairs searched do not yield a story. Thus, in Fig. 5.12(Bottom left), we have not shownthe fraction of word pairs for which no stories were mined. This plot is very similar to the onesseen in the previous example. Once the peak fraction value for a given story length is reached,increasing the b value does not have any effect on the fraction of stories mined.

To study the effect of these parameters on the time required to mine stories, we set b = 5as before for understanding the role of θ. Fig. 5.12(Top right) shows plots of the average timeagainst story lengths, for different θ values. There is a near linear increase in time required, withsteeper increases for lower θ. This is because the lower θ values cause an increase in the number ofpossibilities (within the bound of b = 5) which must be explored before converging on the shortestpath. Also, as θ is increased, the time required to conclude that no story exists decreases becauseof lack of redescriptions possible. Similar linear trends are observed in time versus the role of b(Fig. 5.12(Bottom right)). Here, steeper profiles are witnessed for higher b values. Once again, thisis due to the increase in the number of possibilities.

Fig. 5.13 compares heuristic search to BFS for geneset with θ = 0.3 and b = 10. The compari-son here is very similar to that for L5. As before, the plot on the left shows the effective branchingfactor for both BFS and heuristic search. The difference in the two plots narrows down with storylength because of lack of redescriptions available for nodes. The plot on the right shows the timecomparison of heuristic search against BFS. This plot again shows that the difference between thetwo search strategies is maintained even when the effective branching factor seems to have verylittle difference. Notice that for story length 1, the time taken by heuristic search is more than BFS.This is primarily because the time taken to calculate the heuristic value is dominant at such smallstory lengths.

58

5.4.3 PubMed Abstracts

For our final case study, we consider the more than 140, 000 publications about yeast in the PubMedindex and focus on finding stories between publication abstracts. Each abstract is hence a descrip-tor over terms/keywords. We restrict our CARTs to be of depth 1 and also adopt the weightedJaccard’s measure as defined in Eqn. 5.1 that is more suited to measuring similarity between bags.

To generate keywords, we focused on the 3756 abstracts containing the keywords ‘yeast ANDstress’. For each of these documents, we used the corresponding title and abstract and split thetext into words delimited by space. We removed completely numeric or special character words aswell as a set of stop words from these words. We used Porter’s stemming and manual inspectionto group words that share similarity together. We calculated the IDF values for each word (w) thatappears in any document as the log2 of the ratio of the total number of documents to the number ofdocuments in which w appears. Next, we computed the TFw.IDFw for each word that appears ina document for each of the documents and eliminated 95% of these values by setting a threshold of7 on this value. The words that have TFw.IDFw values above 7 were retained for each documentalong with their frequency values. This resulted in a total of 6821 unique words in our study.

For document Zi, the vector VZi[w] in the definition of weighted Jaccard’s in Eqn. 5.1 is given

by VZi[w] = TFw.IDFw. Two examples of significant stories we mined using this function are

given in Figs 5.14 and 5.15 (the PubMed IDs and publication dates are given alongside).The first story (see Fig. 5.14), mined with θ = 0.2, b = 5, begins with a high throughput

experiment that links chemical stress to gene expression in Saccharomyces cerevisiae, and endswith heat stress transcription factors in tomato. The ‘story line’ was initiated through comparisonsbetween oxidative and heavy metal stresses. This led to a paper identifying a gene from Candidasp. that was expressed when the cells are exposed to cadmium but not copper, mercury, leador manganese. Interestingly a BLAST search for the encoded protein sequence indicates thatthe protein is novel. The link between tomato heat stress transcription factors and a cadmium-specific gene with no known match in the current databases was through work with the fissionyeast Schizosaccharomyces pombe where a study looked specifically at heat and cadmium stressresponses. This story hence illustrates the key players in the systems biology of related chemicalstresses.

For our next example, we mined for stories that follow a strict sequential order of publicationyear. One example here is shown in Fig. 5.15, with settings θ = 0.4, b = 5. The featured com-pound in this story—cAMP (cyclic adenosine monophosphate)—is a signaling molecule foundin all forms of life. In yeast it serves as a relay for glucose levels, effecting distinct responsesin accord with nutrient need and availability. This ‘backwards in time’ story starts with a paperthat addresses specific changes in gene transcription modulated by cAMP levels in Schizosaccha-romyces cerevisiae. It was connected with a paper that also addressed the relationship betweennutrients and cAMP, but with a different yeast (Saccharomyces cerevisiae) and with a differentemphasis (partners upstream and downstream of where cAMP intersects the pathway). The thirdpaper in the story describes the relationship between a serine/threonine protein kinase (Snf1) andnutrient levels, and how it is related to AMP concentrations (the degradation product of cAMP),while the fourth paper links catalase gene expression to cAMP. Together these four paper providea continual story line on how yeast responds to changes in nutrient levels. Interestingly, Paper #4has been cited 82 times (as of March 2006), Paper #3 114 times, and Paper #2 182 times (Paper #1is too new to be heavily cited). However, the only connection between them in the citation indices

59

Early expression of yeast genes affected by chemical stress(PMID:15713640; 2005–02–16)

⇓Glutathione, but not transcription factor Yap1, is required for carbon source-dependent

resistance to oxidative stress in Saccharomyces cerevisiae(PMID:10794174; 2000–06–22)

⇓The role of glutathione in yeast dehydration tolerance

(PMID:14697735; 2003–12–30)⇓

The adaptive response of Saccharomyces cerevisiae to mercury exposure(PMID:11816031; 2002–01–29)

⇓Cloning, characterization, and expression of the CIP2 gene induced under cadmium stress in

Candida sp.(PMID:9627968; 1998–07–09)

⇓Mutations in the Schizosaccharomyces pombe heat shock factor that differentially affect

responses to heat and cadmium stress(PMID:10071222; 1999–04–07)

⇓Genome-wide analysis of the biology of stress responses through heat shock transcription factor

(PMID:15169889; 2004–05–31)⇓

Isolation and characterization of HsfA3, a new heat stress transcription factor ofLycopersicon peruvianum

(PMID:10849352; 2000–10–10)⇓

Heat stress transcription factors from tomato can functionally replace HSF1in the yeast Saccharomyces cerevisiae

(PMID:9268023; 1997-09-18)

Figure 5.14: An example significant story among PubMed abstracts relating chemical stresses.

60

Suppressors of an adenylate cyclase deletion in the fission yeast Schizosaccharomyces pombe(PMID:15189983; 2004–06–10)

⇓Novel sensing mechanisms and targets for the cAMP-protein kinase A pathway in

Saccharomyces cerevisiae(PMID:10476026; 1999–10–21)

⇓Deletion of SNF1 affects the nutrient response of yeast and

resembles mutations which activate the adenylate cyclase pathway(PMID:1752415; 1992–01–24)

⇓Negative regulation of transcription of the Saccharomyces cerevisiae catalase

T (CTT1) gene by cAMP is mediated by a positive control element(PMID:1848176; 1991–04–15)

Figure 5.15: An example significant story among PubMed abstracts around cAMP levels.

is that Paper #2 has referenced paper #4.As in the other studies, we studied the effect of θ and b on the length of stories mines as well

as the time taken to mine stories. The results are shown in Fig. 5.16. For studies with varying θ,we fixed the b value to 5 and for varying b, we fixed θ to 0.2. The results shown follow the samekind of trends as seen in the previous two studies. It is important to note here that the behavior ofour algorithm does not change even though the heuristic that we use in this case is different fromthe one for the previous two cases.

Fig. 5.17 compares heuristic search to BFS for PubMed documents with θ = 0.05 and b = 5.The comparison more or less mirrors the earlier studies. The plot on the left shows the effectivebranching factor for both BFS and heuristic search. The difference in the two plots narrows downwith story length because of lack of redescriptions available for nodes. The plot on the right showsthe time comparison of heuristic search against BFS. This plot again shows that the differencebetween the two search strategies is maintained even when the effective branching factor seems tohave very little difference. For story length 1, the time taken by heuristic search is again more thanBFS because the time taken to calculate the heuristic value is dominant at such small time values.

5.5 Related Research

We briefly survey related research in three categories: storytelling in information visualization,approaches for topic tracking in documents, and link mining.

In the first category, storytelling has been viewed, not in a data mining context, but as aninformation organization tool based on narrative structures from real life. Kuchinsky et al. [36]propose an interactive approach for biological information management using three constructs –items, collections (of items), and stories. A ‘story editor’ is used to form an outline of the storyusing a template. The players (items and itemsets) are then used to fill in the template manually tocomplete the story.

Pertinent work in the topic tracking community, e.g., [28] focuses on post-processing search

61

Figure 5.16: Fraction of stories mined as a function of story length, for different values of θ (Topleft) and different values of b (Bottom left); Average time required to mine stories as a functionof story length, for different values of θ (Top right) and different value of b (Bottom right); forPubMed documents.

Figure 5.17: Comparison of the behavior of heuristic based search with BFS for PubMed docu-ments with θ = 0.05 and b = 5. The plot on the left compares the average value of the effectivebranching factor for various story lengths. The plot on the right compares the time taken using thetwo search strategies.

62

results into storylines by analyzing bipartite graphs of document-term relationships. Here a storyis a thread of related documents with temporal as well as semantic coherence. Although similar toour PubMed abstracts case study, these works are focused on unsupervised discovery of all threadswhereas we focus on directed storylines between given start and end points. Furthermore, as shownin our GeneSets case study, we allow arbitrary set constructions for the purpose of positing overlapsby casting stories as a generalization of redescriptions. The definition of a ‘thread’ is also differentin this work and relies on the notion of node-disjoint directed paths.

Link mining [24] begins with data that can be modeled as a collection of links and, in this sense,storytelling can be approached as a problem of analyzing overlap relationships. However, the linksused and sought by us are between sets of items rather than individual items, and these sets arenot enumerated beforehand. The concept of stories is also inherently similar in spirit to relationalknowledge discovery, e.g., [63], but observe that our vocabularies are primarily propositional innature, and defined over a single domain of objects. In future work, we aim to generalize storytelling into relational redescription mining where the stories can straddle different domains andemploy relationships for navigating across domains.

Finally, the applications presented here suggest comparisons to classical discovery systemssuch as Swanson’s Arrowsmith [90] which can be viewed as seeking stories of length two. Ourstories can be of arbitrary lengths with differing complexities of the participating descriptors.

5.6 Discussion

By defining stories as chains of redescriptions, we have been able to design a storytelling algorithmas A* search around the outputs of a redescription mining algorithm. We have demonstrated thescalability of this approach using the Word overlaps case study and showcased its potential forknowledge discovery using the Gene sets and PubMed abstracts case studies.

In future work, we aim to investigate other metrics for evaluating stories besides story length,e.g., based on the number of objects temporarily brought into the story, the story’s conformanceto prior background knowledge, or using overlap metrics that better mirror a domain scientist’sconception of set similarity. We also aim to explore connections to works that characterize thestructure of partitions [51, 82] and investigate whether storylines can be designed around paths insuch discrete structures. We also intend to generalize from propositional to predicate vocabulariesand cast storytelling in the context of relational redescriptions. This will help provide structuredstories that follow a template of connections. Our eventual goal is to establish storytelling as animportant tool for reasoning with data and domain theories.

63

Chapter 6

Systematizing the Exploration ofRedescriptions in CARTwheels

The CARTwheels approach described in Chap. 2 serves as a good starting point for our foray intothe space of redescriptions possible. However, the approach is plagued by two different types ofredundancies that we discuss at the beginning of this chapter. To outline our approach to reduceredundancy in CARTwheels, we introduce the notion of partition space and relate it to the treeswe construct in CARTwheels. Next, we outline the reasons for moving into the partition spacerepresentation for redescriptions. We look at various strategies for traversing the space of partitionsand discuss an approach we have designed to cover the space of partitions possible in CARTwheelsin an efficient manner. This approach deals with and reduces both types of redundancies discussedin this chapter.

6.1 Sources of Redundancies in CARTwheels

In the CARTwheels approach, an important parameter that governs the alternation process is thenumber of times a particular descriptor is allowed to be involved in a tree as a part of the deriveddescriptors that form the class assignments for objects. This parameter, called ρ, is set to a spec-ified value before we start the CARTwheels alternation process. Another parameter, η, is used todefine the stopping criterion for our alternation process. The algorithm stops when it has had ηconsecutive alternations without finding a redescription.

Fig. 6.1 (derived from the results shown in Chap. 2) shows the number of unique redescriptionsmined as a function of the precision obtained by the CARTwheels algorithm for different values ofρ. The value for ρ decreases from 5 to 1 as we move along the x-axis. This plot clearly demonstratesthe precision-recall trade-off that we need to consider in the operation of our algorithm. If wedesire greater numbers of unique redescriptions, we are bound to achieve that only at a cost oflower precision, which would imply more repetitions of redescriptions mined.

The performance of CARTwheels algorithm can be improved by (a) reducing the size of thespace of redescriptions to only those that are essential, and (b) defining an efficient approach formining such redescriptions. A critical step in this approach would be to carefully define a set ofessential redescriptions such that (a) all unique redescriptions are a part of this set in some form,and (b) the number of repeated redescriptions is minimized. In the next few sections we define

64

Figure 6.1: Precision for redescriptions mined vs. number of unique redescriptions mined.

Figure 6.2: Various examples of partitions of O. (a) is a 3-block partition of O represented as{o1o3, o2o4, o5}. (b) is the O> partition for O. (c) is the partition O⊥.

such a set of essential redescriptions and outline a method to mine them.

6.2 Partition Space

Consider the set of objects O = {o1, o2, o3, o4, o5} defined in Chap. 2. Let ΠO represent the set ofall partitions of O. A partition π ∈ ΠO is made up of several pair-wise disjoint blocks that spanO, i.e., π = {π1, π2, . . . , πm} where πi ⊆ O, πi ∩ πj 6= ∅ (for i 6= j), and

⋃

i πi = O. The

one-block partition is represented by O⊥. The partition containing |O| blocks is represented byO>. An example of a partition of O is shown in Fig. 6.2(a) where each indivisible rectangle inthe figure represents as block. Fig. 6.2(b) and (c) show O⊥ and O> for our choice of O. Givenany two blocks πi

a and πjb (from the same or different partitions), we say that block πi

a splits orshatters block πj

b iff πia ∩ πj

b ⊂ πjb , or equivalently iff πj

b \ πia 6= ∅. For example the block {o1, o5}

shatters {o1, o4, o5} but the opposite is not true. Note that when there is no ambiguity we write theset {o1, o4, o5} as o1o4o5. If πa, πb ∈ ΠO, we say that πa is a sub-partition of πb, denoted πa ≤ πb,if every block of πa is contained in a block of πb or equivalently if no block of πb shatters a blockof πa. For example {o1o2, o3o5, o4} ≤ {o1o2o4, o3o5}.

Given two partitions πa and πb, define πa ∩ πb = {(πia ∩ πj

b 6= ∅)|πia ∈ πa, π

jb ∈ πb}. For

example, {o1o2, o3o5, o4}∩ {o1o2o5, o3o4} = {o1o2, o3, o4, o5}. For a partition πa, define an equiv-

65

Figure 6.3: A part of the lattice induced by partitions of O. The partitions at the same level havethe same number of blocks.

alence relation γ on its blocks with respect to (w.r.t) πb as follows: 1) (πia, π

ia) ∈ γ for all πi

a,and 2) (πi

a, πja) ∈ γ (with i 6= j) if there exists πk

b ∈ πb, such that both πia and πj

a shatter πkb .

Let [πia]γ be the equivalence class of πa w.r.t. γ, and let Γb

a = {[πia]γ} be the partition over the

blocks of πa induced by γ. For example, with πa = {o1, o2, o3, o4o5} and πb = {o1, o2, o3o4, o5},Γb

a = {{o1}, {o2}, {o3, o4o5}}. Note that o3 and o4o5 belong to the same class since they bothshatter o3o4 ∈ πb. Define πa ∪ πb = {∪πj ∈ [πi

a]γ | πia ∈ πa}. For example, continuing the

example above, πa ∪ πb = {o1, o2, o3o4o5}.The sub-partition relation ≤ forms a lattice over the set of all partitions, where the join (∨) of

any two partitions is defined as πa ∨πb = πa ∩πb, and the meet (∧) of any two partitions is definedas πa ∧ πb = πa ∪ πb.

Fig. 6.3 shows a part of the lattice induced by partitions of O. The partitions shown at the samelevel contain the same number of blocks. The connections in the lattice denote the sub-partitionrelationship. For any connection in the lattice, the partition at the bottom is a sub-partition of thepartition at the top.

6.3 From CARTs to Partitions

Every decision trees (CART) we construct in the CARTwheels alternation process can be directlymapped to a partition. The partition induced by a tree is determined by the descriptors used inconstructing the tree as well as the paths that are fused together in disjunctions. As an example,consider the two trees shown in Fig. 6.5. Here, we use the definition of the X and Y descriptors(reproduced in Fig. 6.4) used in the running example in Chap. 2. Both the trees shown in Fig. 6.5(a)

66

X1 = { o2, o3 }X2 = { o3, o4 }X3 = { o2, o4 }X4 = { o1, o5 }

Y1 = { o1, o2, }Y2 = { o2, o3, o4 }Y3 = { o3, o5 }Y4 = { o1, o2, o5 }

Figure 6.4: X and Y descriptors used to explain mappings between tree and partitions.

X2

X1

noyesyes no

X3

yes no

o2 o3, o4 o1, o5

X2

X1

noyesyes no

X3

yes no

o2 o1, o5o3 o4

Y4

yes no

Y2

yes no

o2 o3, o4o1, o5

(a) (b) (c)

Figure 6.5: Modeling partitions induced by trees. (a) and (b) are two trees that use the samedescriptors in the same nodes but induce different partitions, because of differences in how pathsare grouped, (c) the tree uses different and fewer number of descriptors than the tree in (a) butinduces the same partition.

and Fig. 6.5(b) use the same set of descriptors but have different paths used in conjunctions.The tree shown in Fig. 6.5(a) can be used to define three derived descriptors -

• X1 ∧ X3 = {o2} (shown in red)

• (X1 ∧ X3) ∨ (X2 ∧ X1) = {o3, o4} (shown in green)

• (X1 ∧ X2) = {o1, o5} (shown in blue)

Each of these derived descriptors can be thought of as a block in the partition of O defined bythe tree in Fig. 6.5(a). Thus, the tree corresponds to the partition {o1o5, o2, o3o4}.

In Fig. 6.5(b), we show another tree with the same descriptors as in Fig. 6.5(a) but a differentgrouping of the paths. This tree corresponds to the partition {o1o5, o2o4, o3}. Thus, with differentgroupings of paths, the same descriptors could be used to define different partitions.

It is important to mention here that the number of paths (4) induced by the trees in Fig. 6.5(a)and Fig. 6.5(b) need not be the same as the number of blocks in the partitions defined by the trees.Observe that in Fig. 6.5(a), four paths map to a 3-block partition. In general, a tree of depth d couldcorrespond to a partition with b blocks, where 1 ≤ b ≤ 2d, depending on the number of descriptorsused and grouping of the paths. We will use this idea later in defining an efficient exploration forCARTwheels.

Fig. 6.5(c) shows another tree constructed using Y descriptors that maps to the same partitionas the tree in Fig. 6.5(a). Notice that the Y tree utilizes lesser number of descriptors and hencecontains lesser number of paths than the X tree.

67

X2

X1

noyesyes no

X3

yes no

o2 o3 o1, o5o4

X2

X1

noyes

yes no

o2, o3 o1, o5o4

(a) (b)

Figure 6.6: The tree in (a) can be logically reduced to the tree in (b). This is because the expression(X1 ∧X3)∨ (X1 ∧¬X3) is equivalent to X1. Hence, both trees induce the same partition. For ourpurpose, we do not consider trees like (a) and reduce them to forms such as (b).

Fig. 6.6 shows another pair of trees that are related to each other. The tree in Fig. 6.6(a) con-tains two paths (marked in green) that combine to form a disjunction which is logically reducible.Specifically, (X1 ∧ X3) ∨ (X ∧ X3) is equivalent to X1. Therefore, the tree in Fig. 6.6(a) inducesthe same partition as the tree Fig. 6.6(b) which uses 1 less descriptor and contains 1 less path. Eventhough the trees shown in Fig. 6.6 are dissimilar in construct, we will consider them as equivalentsince one is a logically reduced form of the other. For the purpose of our analysis and CARTwheelsexploration, we will not consider trees such as the one shown in Fig. 6.6(a) but only its logicallyreduced form.

For ease of further discussion, we define the set of all trees of depth d possible using X andY descriptors as τX,d and τY,d respectively. Each tree defined in these sets along with informationabout path disjunctions corresponds to a partition in the corresponding partition space.

6.4 Redescriptions in Partition Space

In CARTwheels, the decision trees that we can construct are restricted by the descriptors availableto us (as X and Y ) and the maximum depth of trees allowed d. Using these restrictions, we candefine the notion of realizable partitions as follows:

Realizable partition: Any partition in ΠO that can be constructed using a subset of the givenset of descriptors (X or Y ) in a decision tree of depth d is a realizable partition. We use ΠO,X,d

to denote the set of all realizable partitions using descriptor set X with decision trees of maximumdepth d. The corresponding set for Y is ΠO,Y,d.

As may be recalled, any redescription always occurs in a pair with its negation. Thus, a re-description can be thought of as two realizable partitions with two blocks each such that there issufficient overlap (defined by θ) between two unique pairs of blocks in the partitions. We can ex-tend this idea to partitions with more than two blocks by using the redescription operator definedas:

Redescription operator: The redescription operator R between two partitions is defined us-ing two sets of realizable partitions (like ΠO,X,d, ΠO,Y,d) as R : R(πi) = πj . Here, πi ∈ ΠO,X,d

and πj ∈ ΠO,Y,d, number of blocks in πi = number of blocks in πj , and for each block πki , there

68

exists a unique block πkj such that J (πk

i , πkj ) ≥ θ.

As is obvious from the definition, if R(πi) = πj , then R(πj) = πi. Also, if πi = πj , thenR(πi) = πj for any θ threshold. Notice that the redescription operator R is not a function but arelation; for every πi there could be more than one πj for which the R relation holds.

Another notion that is closely related to the redescription operator is the redescribable partitionpair:

Redescribable partition pair: Given ΠO,X,d, ΠO,Y,d and a Jaccard’s coefficient threshold θ, aredescribable partition pair is a partition tuple (πi, πj) where πi ∈ ΠO,X,d, πj ∈ ΠO,Y,d, such thatR(πi) = πj .

An interesting property of redescription partition pairs viewed in the partition space is stated inthe following theorem:

Theorem 6.4.1 For a redescribable partition pair (πi, πj), for any two blocks in each partitionπk

i , πli ∈ πi and πk

j , πlj ∈ πj , if πk

i ⇔ πkj and πl

i ⇔ πlj for a given θ, then the redescription

πki ∪ πl

i ⇔ πkj ∪ πl

j holds with Jaccard’s coefficient ≥ θ.

Proof: The Jaccard’s coefficient for the redescription πki ∪ πl

i ⇔ πkj ∪ πl

j is given by the formula:

J ((πki ∪ πl

i), (πkj ∪ πl

j)) =|(πk

i ∪ πli) ∩ (πk

j ∪ πlj)|

|(πki ∪ πl

i) ∪ (πkj ∪ πl

j)|(6.1)

Since πi and πj are partitions, |πki ∩πl

i| = 0 and |πkj ∩πl

j | = 0. Therefore, we can write the twoterms in the Jaccard’s coefficient calculation as follows.

|(πki ∪ πl

i) ∩ (πkj ∪ πl

j)| = |πki ∩ πk

j | + |πki ∩ πl

j| + |πli ∩ πk

j | + |πli ∩ πl

j||(πk

i ∪ πli) ∪ (πk

j ∪ πlj)| = |πk

i | + |πli| + |πk

j | + |πlj| − |(πk

i ∪ πli) ∩ (πk

j ∪ πlj)|

As is evident from Eqn. 6.1, the Jaccard’s coefficient value increases with |πki ∩πl

j| and |πli∩πk

j |.Thus, only changing these two numbers, the minimum value of the Jaccard’s coefficient is obtainedwhen we set |πk

i ∩πlj| = 0 and |πl

i∩πkj | = 0. Furthermore, the minimum values possible for |πk

i ∩πkj |

and |πli ∩ πl

j| given that πki ⇔ πk

j and πli ⇔ πl

j are:

|πki ∩ πk

j |min =θ

1 + θ(|πk

i | + |πkj |)

|πli ∩ πl

j|min =θ

1 + θ(|πl

i| + |πlj|)

Thus, the minimum value possible for the Jaccard’s’ coefficient in such a scenario is given by:

J ((πki ∪ πl

i), (πkj ∪ πl

j))min =θ

1+θ(|πk

i | + |πkj | + |πl

i| + |πlj|)

(1 − θ1+θ

)(|πki | + |πk

j | + |πli| + |πl

j|)This reduces to J ((πk

i ∪πli), (π

kj ∪πl

j))min = θ. Therefore, the redescription πki ∪πl

i ⇔ πkj ∪πl

j

always holds. �

As we will see later, Theorem 6.4.1 is required for us to be able to define the set of essentialpartitions.

69

Figure 6.7: The first two trees constructed using (a) Y descriptors and (b) X descriptors in ourrunning example used for the partitions-based analysis.

6.4.1 CARTwheels in Partition Space

In CARTwheels, we are provided with descriptor sets X and Y and the depth of decision treesto be constructed as d. Consider the space (lattice) of realizable partitions ΠO,X,d and ΠO,Y,d. Ateach alternation in CARTwheels, we switch between these lattices while looking for redescribablepartition pairs. From the partitions viewpoint, the essence of CARTwheels is the following lemma:

Lemma 2 If π1 = {π11, π

21, · · · , πk

1} and π2 = {π12, π

22, · · · , πk

2} are partitions of O (containingthe same number k of blocks), and π1

1 ⊆ π12 , π2

1 ⊆ π22 , · · · , and πk

1 ⊆ πk2 , then π1

1 = π12 , π2

1 = π22 ,

· · · , and πk1 = πk

2 .

Proof: Assume the contrary, i.e., that there exists an index l such that π l1 ⊂ πl

2. This implies thatthere is at least one object o ∈ O s.t. o ∈ πl

2 and o /∈ πl1. Since o belongs to the lth block in the

second partition, it cannot belong to any other block in the same partition. Similarly, since o doesnot belong to the lth block in the first partition, it must belong to some other block (say m) in thatpartition. But this would mean that πm

1 is not a subset of πm2 , contradicting the premise. �

Lemma 2 allows us to evaluate partition pairs obtained from two consecutive trees in CARTwheelsby examining the set of single directional implications from the second partition to the first. If eachblock in the second partition corresponds to a unique block in the first partition, we can concludethat each of the implications is actually a redescription. Thus, for a partition pair with b blocks,we can evaluate the b redescriptions possible by using b implication tests as opposed to the 2b testsrequired when we do not have the notion of partitions involved.

6.5 Essential Partitions

Consider the first two trees (tX,1, tY,1) grown in our running example for CARTwheels (in Chap. 2)shown in Fig. 6.7. These are 2-level trees with 4 paths each. We can use different groupings ofthe paths in these trees to realize 1, 2, 3 and 4-block partitions from them. The two lattices of allpartitions that could be induced by these tree are shown in Fig. 6.9 and Fig. 6.8.

The partitions shown in red in Fig. 6.9 and Fig. 6.8 are induced by the two trees (along with thepath assignments) in our running example. For ease of further analysis, we can call these partitionsπX,1 and πY,1 respectively. Consider the partitions πX,2 and πY,2 shown in blue in Fig. 6.9 and

70

o1,o2,o3,o4,o5

o5o4o1,o2 o3

o1,o2,o3,o4 o5 o1,o2,o3,o5 o4 o1,o2,o4,o5 o3 o1,o2,o3 o4,o5 o1,o2,o4 o3,o5 o1,o2,o5 o3,o4 o3,o4,o5o1,o2

o1,o2 o5o3,o4 o1,o2 o4o3,o5 o1,o2 o3o4,o5 o1,o2,o5 o4o3 o1,o2,o3 o5o4 o1,o2,o4 o5o3

Y,1

Y,2

Figure 6.8: The partition lattice that corresponds to the tree in Fig. 6.7(a). Partition marked inred corresponds to the exact partition for the tree in Fig. 6.7(a). Partition marked in red is a sub-partition of the partition in blue.

Figure 6.9: The partition lattice that corresponds to the tree in Fig. 6.7(b). Partition marked inred corresponds to the exact partition for the tree in Fig. 6.7(b). Partition marked in red is a sub-partition of the partition in blue.

71

Fig. 6.8 respectively. It is obvious from the figures that πX,1 ≤ πX,2 and πY,1 ≤ πY,2. We knowfrom the example involving the two trees from Chap. 2 that R(πY,1) = πX,1 for θ = 1. Thisknowledge combined with Theorem 6.4.1 can be used to conclude that R(πY,2) = πX,2. Notethat there is no redescribable pair of partitions (πX,i, πY,j) from the two lattices shown for whichπY,j ≤ πY,1 and πX,i ≤ πX,1. Therefore, the redescribable partition pair (πX,1, πY,1) provide aunique set of redescriptions. We can use such pairs of partition to define the notion of essentialpartition pairs.

Essential partition pair: An essential partition pair comprises of a redescribable partition pairπX,i ∈ ΠO,X,d and πY,j ∈ ΠO,Y,d such that there exists no redescribable partition pair πX,k ∈ΠO,X,d, πY,l ∈ ΠO,Y,d such that πX,k ≤ πX,i and πY,l ≤ πY,j .

Notice that two or more essential partition pairs can correspond to redescriptions between thesame blocks. However, it is the combination of the blocks taking part in redescriptions that isunique to any essential partition pair.

6.5.1 Border of Essential Partition Pairs

Consider the two lattices of partitions shown in Fig. 6.9 and Fig. 6.8. Consider any path in theselattices (from top to bottom) where each partition pair considered in the path is redescribable toeach other. For the two lattices shown, all such paths end up in the partition pair (πX,1, πY,1). Asalready discussed, this is an essential partition pair. Clearly, the essential partition pair can be seento a part of a border separating the space of redescribable partitions from the space of partitions thatcannot be redescribed. Thus, the task of finding essential partitions is akin to visiting the partitionpairs on this border.

6.5.2 Essential Tree Pairs

Consider a tree pair (tX,i ∈ τX,d, tY,j ∈ τY,d) that corresponds to a redescribable partition pair(πX,i ∈ ΠO,X,d, πY,j ∈ ΠO,Y,d). Assume that (πX,i, πY,j) is not an essential partition pair in thelattices ΠO,X,d and ΠO,Y,d with (π∗

X,i, π∗Y,j) being the corresponding essential pair of partitions. It

is possible that the tree pair tX,i, tY,j does not contain any combination of paths that can result in(π∗

X,i, π∗Y,j). Thus, if we aim to mine only the essential partition pairs, we will ignore such a tree

pair from our consideration. However, this tree pair might contain a unique (in terms of descriptorsand not items) set of redescriptions which we would lose in such a scenario. To ensure against this,we define the notion of an essential tree pair. This definition requires the sub-lattice of the latticeof all partitions of O that is possible given a particular tree. For example, the sub-lattice of ΠO,X,d

that is possible given a tree tX,i is termed ΠiO,X,d.

Essential tree pair: An essential tree pair comprises of a pair of trees (tX,i ∈ τX,d, tY,j ∈ τY,d)that correspond to a redescribable partition pair (πX,k ∈ Πi

O,X,d, πY,l ∈ ΠjO,Y,d) such that there

exists no redescribable partition pair (πX,m ∈ ΠiO,X,d, πY,n ∈ Πj

O,Y,d) for which πX,m ≤ πX,k andπY,n ≤ πY,l.

Notice that mining the essential tree (and not partition) pairs would correspond to mining allredescriptions possible for a given choice of X , Y and d since every redescription will be repre-sented by one block or a disjunction of more that blocks that are a part of an essential tree pair.Thus, to ensure that each unique redescription is mined, we need to focus on mining all essential

72

tree pairs (as opposed to partition pairs).

6.6 Approach to Mine Essential Tree Pairs

This section aims to provide a theoretical basis and reasoning for our approach for mining essentialtree pairs using CARTwheels.

6.6.1 Traversal Strategies

Even though we have moved from mining essential partition pairs to essential tree pairs we can inessence still traverse the lattice of partition sets ΠO,X,d and ΠO,Y,d to mine them. Thus, an importantaspect of our approach to mine the essential tree pairs is to define a lattice traversal strategy suitablefor this purpose.

Given any lattice such as the one shown in Fig. 6.3 imposed by a set of partitions, we cantraverse it using an approach that could be classified as top-down, bottom-up or a combinationof both. Here, we will enumerate the advantages and disadvantages of using the top-down orbottom-up approach to mine essential tree pairs.

Top-down traversal

In the top-down traversal approach, we start with the set of all 2-block partitions in ΠO,X,d andΠO,Y,d. We can mine all tree pairs from these two lattices that correspond to redescribable 2-blockpartitions. Next, we move to all 3-block partitions in the two lattices. Any tree pair that corre-sponds to a redescribable 3-block partition has to correspond to a redescribable 2-block partition.Therefore, we can use the set of redescribable 2-block tree (partition) pairs to try and form any3-block partition pairs that correspond to redescribable partitions. If any such 3-block tree pairexists for a given 2-block tree pair, we can discard the 2-block tree pair. Otherwise, the 2-blocktree pair is essential and is retained. We can continue moving down the lattice in this manner tillwe reach 2d-block partitions which is the last set of trees considered under our bias.

The major advantage of using this approach, and one it shares with the Apriori algorithm, formining essential tree pairs is that the space of tree (partition) pairs to be searched within reducesas we go down the lattice. This is because of the sub-partition relationship between partitions ondifferent levels in the lattice. A small problem with this approach is that the first set of redescribabletree pairs it finds are not necessarily essential. We need to traverse at least one more level todetermine if that is the case.

Bottom-up traversal

In the bottom-up traversal approach, we start with the set of all 2d-block partitions in ΠO,X,d andΠO,Y,d. We can mine all tree pairs from these two lattices that correspond to redescribable 2d-blockpartitions. Next, we move to all (2d − 1)-block partitions in the two lattices. For any tree pair(tX,i, tY,j) we consider at this level, we first check for the presence of a 2d-block redescribable treepair ((tX,k, tY,l) for which we could fuse two blocks to obtain (tX,i, tY,j). If such a tree pair exists,we do not need to explicitly test if (tX,i, tY,j) is a redescribable tree pair (see Theorem 6.4.1).

73

Notice that even if a particular (2d − 1)-block tree (in X or Y domain) can be specialized to a treewhich provides an essential 2d-block tree pair, we still need to search for redescriptions for the(2d − 1)-block tree as there could be some other trees that form a redescribable partition pair withit at level (2d − 1) but not at 2d. Thus, at any level in the partition lattices, we need to search foressential tree pairs for every tree pair we can construct. The traversal of each level continues in asimilar manner as we move up the lattice till we reach the 1-block partition.

The major advantage of using this approach for mining essential tree pairs is that any treepair mined by this approach has to be an essential tree pair. However, a big disadvantage is thatthe search space for tree pairs at any level of the partitions is not reduced in any way using thisapproach.

6.6.2 Proposed Solution Strategy

The approach we adopt to find essential tree pairs consists of an informed top-down search in theX and Y partition lattices. We chose the top-down traversal strategy as it allows for reduction inthe search space as we go down the lattice; a property which is very critical given the large numberof trees that need to be constructed at each level during traversal.

In our approach, we first search for any 2-block tree pair (tX,i, tY,j) that correspond to a re-describable partition pair. We use the process of decoupling to obtain the essential tree pairs thatcorrespond to (tX,i, tY,j). Taking the idea of the border of partitions and Theorem 6.4.1 into con-sideration, in decoupling, we break down the disjunctions for blocks (in a redescribable partitionpair) that are redescriptions of each other to form two new blocks that could be redescriptions ofeach other. This is akin to the idea of moving from a partition to its sub-partition in the two latticessimultaneously. In decision trees, any disjunction between two paths taking part in a block canbe decoupled to obtain two blocks in the tree. An example of decoupling for a tree (from a treepair) is shown in Fig. 6.10. Given two trees that correspond to a redescribable partition pair, wecan continue decoupling the paths till the partition pair that results is still redescribable. The finalresult of this process is a tree pair that corresponds to the essential tree pair for the tree pair westarted with.

Once we have found the essential tree pairs that correspond to the 2-block tree pair (tX,i, tY,j),we can move to the next 2-block tree pair. Here, a few observations about the decoupling processcan help us narrow down the search space:

Observation 4.1: Given the essential tree pair (tX,i, tY,j), and any other tree pair (tX,k, tY,l)that corresponds to a redescribable partition pair. If (tX,k, tY,l) can be decoupled to (tX,i, tY,j), then(tX,i, tY,j) is the essential tree pair for (tX,k, tY,l).

Observation 4.2: Given a two-block redescribable tree (partition) pair (tX,i, tY,j) that involvespX,i and pY,j paths respectively, let tree pair (t∗X,i, t

∗Y,j) be the essential tree pair for (tX,i, tY,j).

Consider another 2-block tree pair (tX,k, tY,l), where pX,k ≤ pX,i and pY,l ≤ pY,j, obtained by thedisjunction of logically reducible paths (see Fig 6.6) in (tX,i, tY,j). (t∗X,i, t

∗Y,j) is also the essential

tree pair for (tX,k, tY,l).Observation 4.1 provides a method that allows us to determine if the essential tree pair for any

new tree pair found already exists within the set of essential tree pairs mined. Observation 4.2suggests that trees should be mined in the order of decreasing number of paths if possible. Usingthese observations to better focus our search, we could exhaust all 2-block tree pairs possible tofind the full set of essential tree pairs.

74

X3

X1

noyesyes no

X2

yes no

o3 o2 o1, o5o4

X3

X1

noyesyes no

X2

yes no

o3 o2 o1, o5o4

Figure 6.10: The decoupling process: The tree on the left is the original tree with three blocksshown in green, blue, and yellow. After decoupling the blue block, the tree on the right is formedwith four blocks (green, red, yellow, and blue).

6.7 Algorithm Details

We use the basic approach described in the previous section to define our algorithm for miningessential tree pairs as shown in Table 6.1.

We initialize the algorithm by setting the essential tree pairs set to the empty set ET . Also, weassign the set of available Y descriptors to be U = Y .

The algorithm begins by choosing any one of the Yi ∈ Y randomly. For any such Yi, weinitialize a set of visited trees as the empty set V Ty. The set of essential tree pairs for the particularYi is initialized to the empty set TP . Depending on our goal, we can either search for redescriptionsinvolving all trees with Yi as top node or we could restrict our search space to a part of it. In thealgorithm defined, we search through all trees possible. In the first foray into the correspondingfor loop, we generate a random 2-block tree with 2d paths and Yi as top node using the functionconstruct tree(). In the subsequent alternations (loops), we already have a tx from the previousalternation that we could use to construct a new tree ty. Notice that in the construct tree() function,the fourth parameter V Ty is passed to ensure that the new tree is not a repeat. In this function,we can use any of the metrics for growing decision tree. In all our studies, we used entropy gainas our metric. Another important characteristic of this function is that it gives preference to treesin decreasing order of number of paths. This is done in accordance with Observation 4.2 so thatany tree which is likely to produce a minimum generator for other logically reducible trees isencountered first.

After constructing ty, we check whether ty and tx correspond to redescribable partitions. Ifthat is true, we use the function find esspair() (which in turn uses Observation 4.1) to check for thepresence of an essential tree pair for (tx, ty) in TP . In case no such pair exists, we use decouplingin the function generate esspair() to find the tree pair(s) (t∗x, t

∗y) that are the essential tree pairs for

(tx, ty) and add them to TP .Next, we use the classes obtained from the tree ty and X descriptors as features to construct

matching trees for ty. We use the construct tree() function for this purpose once again. Here, wecould restrict ourselves to just the first essential tree pair involving a particular tree ty or we couldlook for all matching tree pairs possible. We store all matching trees in a priority queue Tx orderedby decreasing value of number of paths in the trees for reasons outlined earlier. While the set Tx isnot empty, we pop() the first tree in the queue as tx. We again check for whether the trees tx and tycorrespond to a redescribable partition pair. If so, we follow the same procedure as before to find

75

Table 6.1: CARTwheels algorithmic framework for mining essential tree pairs.Input: objects O, descriptor sets X , Y

Output: Essential tree pairs ET as decision tree pairs from X and Y domains.Parameters:

θ (Jaccard’s coefficient),d (depth of trees),

Algorithm:set answer set ET = {}set unused Y descriptors U = Y

set Yi = random descriptor from Y

set current tree in X domain tx = null

while Yi 6= null

set U = U - Yi

set visited trees in the Y domain as V Ty = {}set essential tree pairs for visited trees as TP = {}while an unvisited Y tree with top node Yi is found

set count = count + 1set classes C = paths to classes(tx)set feature set F = {Y }set dataset D = construct dataset(O, F , C)set tree ty = construct tree(D, Yi, d, V Ty)if R(ty) = tx

if find esspair((tx, ty), TP ) = null

set (t∗x, t∗y) = generate esspair(tx, ty)add (t∗x, t∗y) to TP

end ifend ifset classes C = paths to classes(ty )set feature set F = {X}set dataset D = construct dataset(O, F , C)set tree set Tx = construct tree(D, null, d, null)while Tx 6= {}

set tx = pop(Tx)if R(ty) = tx

if find esspair((tx, ty), TP ) = null

set (t∗x, t∗y) = generate esspair(tx, ty)add (t∗x, t∗y) to TP

end ifend if

end whileend foradd TP to ET

set Yi = best match(tx, U )end whilereturn ET

and add essential tree pairs for (tx, ty) to TP .This process of finding distinct trees with Yi continues till either no viable new tree can be

76

constructed or a preset limit of number of trees is reached. After completing the processing of aparticular Yi, we add the set of essential tree pairs TP to the set ET . We move to the next Yi byusing the best match() function to find the closest match to the partition induced by the current treetx from the set U of available Y descriptors.

It is important to note here that the algorithm outlined resembles the original CARTwheelsalgorithm if we use limits to constrain the number of X and Y trees considered. However, even insuch a case, we always mine the essential tree pairs for any two trees that match each other.

6.7.1 Further Curtailing the Space of Trees Considered

The construct tree() function described earlier performs an intelligent search through the space ofall trees possible for a given tx or ty provided. The space of trees is not critical when a pre-specifiedlimit is applied to the number of matching trees to be constructed. When there is no limit specifiedon the number of trees, we consider two separate cases to reduce our search space:

Case 1(θ = 1): In the case where θ is set to 1, we use the constraint that each leaf in thedecision tree constructed has to have an entropy of 0 to be able to match the tree being searchedfor. While constructing a decision tree of depth d, if the minimum entropy for any leaf (l) in aparticular configuration of nodes is found to be greater than 0, we can rule out all trees that containthe sequence of nodes from the root of the tree to l.

Case 2(θ < 1): This provides a slightly harder situation for pruning out a group of trees.Consider a tree tx input to us for which we need to mine a ty that could provide redescriptions.Let the partition that corresponds to tx be πx. Any ty with partition πy which can provide tworedescriptions with the 2-block tree tx has to satisfy the following conditions:

• For each of the two blocks, π1x and π2

x, there has to be a path in ty (we call it py,i) such thatthe confidence of the implication rule py,i ⇒ π1

x or py,i ⇒ π2x is greater than θ.

• For each of the two blocks, π1x and π2

x, the disjunction of paths with leafs assigned to a blockhas to have a confidence greater than θ for the implication rule from the disjunction to eitherof π1

x or π2x.

In searching for trees ty, while adding the last node to the tree, we only consider those descrip-tors that satisfy the confidence conditions stated above. If no such node exists, we have success-fully eliminated all trees with the current configuration (of other nodes) from consideration. It isimportant to mention here that the pruning strategies above are more suited for low values of d.

6.8 Case Study

To demonstrate the effectiveness of our approach in reducing redundancy resulting from miningnon-essential tree pairs, we conducted a simple case study that involved mining a subset of allessential tree pairs for the input set of descriptors. In particular, for each descriptor Yi, we choseto construct 300 unique trees that have it as the top node and found the 200 best trees that matchsuch a tree. Even though this covers a portion of the space of all tree pairs, we can obtain usefulstatistics from such a study to show the improvement we can expect by mining only essential treepairs.

77

Figure 6.11: Plot showing count of total number of essential tree pairs mined and non-essentialtrees encountered in our case study.

For our study we considered the universal set of yeast genes G1 from Chap. 2 that consists of74 genes. In order to meet the goals of our experiment, we used a very limited set of descriptorsfrom the ones used earlier. From the set of expression descriptors, we used all descriptors thatcorrespond to the clusters obtained using gene expression data profiles. From the functional anno-tation descriptors, we chose to use the GO Cellular location categories for our analysis. We used adepth of 2 for trees constructed using either of the vocabularies. We used a p-value cutoff of 10−5

to remove insignificant redescriptions.In our study, we set up CARTwheels to find all essential tree pairs possible (given the constraint

on tree counts) for different values of the θ threshold starting from 0.4 to 1.0. We set the Xdescriptors to be the expression based descriptors and Y descriptors to be GO cellular categorybased descriptors. For this setting, Fig. 6.11 shows the total number of essential trees mined (blackline) for various values of θ. The second curve (grey line) corresponds to the number of treeseliminated from consideration since they are not essential. This figure shows that for almost allvalues of θ considered, about 25% of the total number of trees were found to be non-essentialand removed from consideration for lower values of θ. This proportion goes down as the θ valueincreases to about 18% from θ = 0.8. This is primarily because at higher values of θ, disjunctionbetween paths are more likely to provide us with redescriptions than the corresponding decoupledpaths.

Fig. 6.12 shows the total number of redescriptions that the essential tree pairs correspond to(black line) and the number of unique redescriptions that are part of this set (grey line). As wouldbe expected, there are more number and proportion of non-unique redescriptions at lower valuesof θ because of more trees with 3 or 4 blocks (that could be repeated) being a part of the essentialtree pairs. As the θ becomes high, a majority of the essential pairs are 2-block trees that provideus with two unique redescriptions each. Thus, most of the redescriptions mined at high values of θ(above 0.6) are unique.

78

Figure 6.12: Plot showing count of total number of redescriptions and unique redescriptions minedas part of the essential tree pairs in our case study.

6.9 Discussion

In this chapter, we built a theoretical framework that defined the notion of essential trees whichneed to be mined to cover the space of all redescriptions possible. We also presented a modifiedversion of the CARTwheels approach for mining essential trees using a top-down search in thepartition-space lattice induced by the descriptors provided. As the results in our case study seem toindicate, we can expect such a search to reduce the space of trees explored and hence tackle someof the redundancy in the results from CARTwheels.

79

Chapter 7

Extensions and Novel Applications ofRedescription Mining

One of the themes of this dissertation has been the versatility of redescriptions, not just as datamining patterns, but also as the starting point to elucidate pathways and as a compositional unitto construct stories. In this chapter, we further present many new extensions and applications ofredescription mining.

7.1 Redescription Mining in Structured Descriptor Spaces

7.1.1 Problem Statement

As described in the previous chapter, CARTwheels can be used to perform a systematic explo-ration of the space of redescriptions to reduce redundancy in the patterns mined. In our approachand analysis to this point, we have not modeled any structure inherent in the space of descriptors.Descriptor space is often non-flat and could have a rich structure, such as a hierarchy or a generalpartial order, underlying it. As an example, consider the descriptors constructed from gene ex-pression data for experiments conducted by Gasch et al. [21] presented in Chap. 2. In this study,for each integer cutoff within the range of expression values observed at a particular time point ofthe time series dataset, we defined a descriptor containing genes that are expressed with magni-tude more than the cutoff. This method of constructing the descriptors automatically results in astructure of containment relationships for descriptors from this family, as shown in Fig. 7.1(a). Asimilar containment relationship is inherent to the GO functional category based descriptors whichare typically related, not simply by a linear chain as in the expression-based descriptors, but bya hierarchy (Fig. 7.1(b)). In other branches of the GO taxonomy, a general partial order of de-scriptors is available. Redescriptions constructed from descriptors related through such structuralconstraints hence might exhibit a form of logical redundancy that can potentially be eliminated inCARTwheels.

80

G1

G2

G4

G3

GO category 1

GO category 2 GO category 3

GO category 4

(b)

EXPR_01

EXPR_02

EXPR_03

EXPR_04

expression >= 1

expression >= 2

expression >= 3

expression >= 4

(a)

Figure 7.1: Examples of structures seen in descriptors encountered in bioinformatics datasets. (a)Linear chain of subsets for gene expression based descriptors, and (b) Tree like structure for GOfunctional category hierarchy.

7.1.2 Solution Strategy

The harnessing of descriptor structure to develop effective data mining algorithms is a populartheme in the KDD literature [29]; in fact, the classical Apriori [2] algorithm can be viewed asexploiting one such constraint through an anti-monotonicity criterion, to avoid evaluating someitemsets. Another example of exploiting descriptor structure can be seen in ILP algorithms. Thestructure is provided to ILP algorithms as background knowledge B in the form of ground facts orother predicates [39]. These are then used to block out some portions of the θ-subsumption latticesuch that rules that restate B or any of its logical derivatives are not reported.

In our solution strategy, we assume that the structure inherent to the descriptor space is providedto us in the form of Horn clauses such as Xi ∨ Xj where Xi, Xj ∈ X . In CARTwheels, the use ofa decision tree to model boolean expressions implies that all derived descriptors are in DNF form.If Xi and Xj lie along a path in a decision tree, then it is clear that clauses involving these twodescriptors must fall into one of two categories:

• Pij: Xi ∧ Xj and Xi ∧ Xj

• Pji: Xj ∧ Xi and Xj ∧ Xi

The simplest trees with paths containing one of these two pairs of conjunctions are shown inFig. 7.2 where (a) corresponds to Pij and (b) corresponds to Pji. Using the clause Xi ∨ Xj , theconjunctions shown in Fig. 7.2 can be reduced as follows:

• Pij: Xi ∧ Xj = Xi and Xi ∧ Xj = φ

• Pji: Xj ∧ Xi = Xi and Xj ∧ Xi 6= φ

This analysis implies that for Pij, the presence of Xj is superfluous and will result in onlyredundant derived descriptors. Pji provides us with one redundant conjunction (Xj ∧ Xi) and onenon-redundant conjunction (Xj ∧ Xi). Since these two conjunction will always occur in pair as

81

X i

X j

not empty empty

?

X j

X i

not empty

?

yes

yes no

no yes no

no yes

not empty

Figure 7.2: The effect of structure in descriptor space on the decision trees constructed. (a) Deci-sion tree contains a reducible path and a path that corresponds to an empty set. (b) Decision treecontains a reducible path and a non-reducible path that corresponds to a non-empty set.

part of possibly two derived descriptors, we will need to retain the redundant conjunction (path) ofthe derived descriptors in Pji.

From Fig. 7.2, it can be seen that conjunctions listed in Pji are formed when the tree we con-struct contains a path from Xi to Xj where depth(Xi) > depth(Xj) and Xj evaluates to true on thepath. Here, depth() is a function that measures the depth of a node in the tree with the root node at0. Analogously, Pij is formed when the path from Xi to Xj is such that depth(Xj) > depth(Xi) andXi evaluates to true on the path. A check for this situation can hence be pushed into the decisiontree construction procedure.

The discussion above also applies to the structural information Xj → Xi that we can deducefrom the Horn clause provided to us along with the assumption of a universal set of items. In thiscase, the only non-redundant descriptor is formed when the tree we construct contains a path fromXi to Xj where depth(Xj) > depth(Xi) and Xi evaluates to false on the path.

Building on the above discussion, we adopt a two-step approach to tackle the redundancy thatcreeps in due to structure in descriptor spaces. In the first step, we restrict the construction of treesin CARTwheels such that no tree that could contain conjunctions of the type in Pij is formed. Thiscan be done while constructing the ADtree support structure for CARTwheels by storing structureinformation alongside each node of the ADtree. Thus, when a particular branch that starts with adescriptor needs to be extended, its ascendants in the structural hierarchy are not considered forthe next node to be supplied for decision tree, even though one of them might reduce the entropythe most.

A second step is required to eliminate the redundancy that results from trees that contain con-junctions of the type in Pji. This is achieved through a post-processing step to reduce redescriptionsobtained into their shortest logical form using the structure information. We use tabular minimiza-tion for this purpose and output the reduced form of the redescription as the result.

7.1.3 Case studies

To understand the effect of introducing the space of descriptors into redescription mining, weperformed two relevant simulation studies. The purpose of these studies is to get an insight intothe scale of reduction in the redescription space that could be obtained using the ideas discussed inthis section.

82

Figure 7.3: Plot showing the effect of utilizing descriptor space structure in mining redescriptionsfor study involving GO CEL categories and clustering based descriptors for yeast. The three curvescorrespond to the total number of redescriptions mined, number of redescriptions found to beredundant because of Pij type constructs, and (Pji and/or Pij) type constructs.

Study with GO CEL categories

In the first study, we used the structure of descriptors for one of our descriptor families (GO Cellularlocation, GO CEL) supplied as input for redescription mining.

Universal set: The universal set used in this study is the same as G1 used in Chap. 2. This setconsists of 74 genes that show significant expression change in each of the yeast stress responseexperiments conducted by Gasch et. al. that we consider.

Descriptors: The set of descriptors in this study consist of the GO cellular location categoryfor the genes on one side and the clusters obtained from the expression profiles on the other.The total number of descriptors used in this study is 42 for GO CEL and 70 for the expressionbased descriptors. We also used the structure inherent to the GO cellular location categories ascontainment relationships between descriptors.

CARTwheels configuration: Motivated by the goal of this study, we set the maximum depthfor expression based descriptors as 1 and for GO CEL descriptors as 2. The p-value cutoff forvalid redescriptions was set to 10−5. Since we have used conjunctions between descriptors as thebasis of our analysis above, we used a restricted version of CARTwheels where only conjunctionsbetween descriptors were allowed.

Results: Initially, we allowed all possible conjunctions to form between descriptors of the sametype and found redescriptions across them. We varied the value of θ from 0.1 to 1.0 and recordedthe number of unique redescriptions mined in each case. Next, we removed conjunctions of thetype in Pij from our analysis and recorded the reduced number of redescriptions mined. Finally,we removed conjunctions of the type shown in Pji and counted the unique redescriptions mined.

Fig. 7.3 shows the result of our analysis. For each value of θ, we plot three numbers: total num-

83

ber of redescriptions mined, number of redescriptions found to be redundant because of constructsof type Pij, and the number of redescriptions that are redundant because of constructs of type Pijand/or Pji. As an example, at θ = 0.5 our knowledge of the structure of GO CEL descriptorshelps us discard about 30% of the redescriptions after removing constructs of type Pij and about50% of the redescriptions after pruning out both types of constructs. This figure clearly demon-strates the usefulness of our analysis. The number of unique redescriptions mined is shown to bereduced dramatically after each of the steps for the dataset considered thus resulting in a reductionin redundancy.

Study across GO MOL and GO CEL categories

For our second study, we considered two structurally rich descriptor sets as input to CARTwheelsand performed a similar analysis as before.

Universal set: For this study, we used the universal set of all yeast genes that have both GOmolecular function (GO MOL) and GO cellular location (GO CEL) categories defined for them.This provides us with a universal set of 3206 genes.

Descriptors: We used the GO MOL and GO CEL category assignments for genes as descrip-tors. This resulted in a total of 1472 descriptors for GO MOL and 317 descriptors for GO CELcategories. We again used hierarchy information between GO categories to define a containmentrelationship between descriptors.

CARTwheels configuration: Unlike the previous study, we are provided with structurally richdescriptors for both the descriptors sets provided as input. In order to use structural informationfrom the descriptors sets, we set a depth of 2 for both types of descriptors in CARTwheels. Therest of the settings in this analysis were exactly the same as for the previous case.

Results: We performed a similar three step analysis on the data set as before and kept countsof the number of redescriptions obtained after each step. Fig. 7.4 shows the results of our analysis.Even though we find a similar trend in the three curves shown here as in the previous study, the gulfbetween the number of redescriptions mined and those that are structurally redundant is quite large,especially for low values of θ. For high values of θ, the proportion of redundant redescriptions isquite large (about 50% for θ = 0.9). However, as the θ value decreases, the sheer number ofdescriptors used in this study allows for a lot more redescriptions to be found as compared to thenumber of redescriptions involving structurally related descriptors. Thus, even though the numberof redescriptions eliminated increases with decreasing θ, the redundancy removed is obscured bythe total number of redescriptions mined.

The studies in this section clearly showcase the reduction in redundancy we can obtain byutilizing the structure of descriptor space provided to us. Further, even though the second studyshows that the proportion of structurally redundant redescriptions is much lesser at very low valuesof θ, we would generally focus on the θ > 0.5 domain where the redundancy could be a sizableportion of the redescriptions mined for other datasets as well.

84

Figure 7.4: Plot showing the effect of utilizing descriptor space structure in mining redescriptionsfor study involving GO CEL and GO MOL categories for all genes in yeast. The three curvescorrespond to the total number of redescriptions mined, number of redescriptions found to beredundant because of Pij type constructs, and (Pji and/or Pij) type constructs.

7.2 Redescription Mining with Many-Many Relationships


Thus far, the Jaccard’s coefficient has been our measure of choice, in CARTwheels, to evaluate thegoodness of a redescription. In the case where it relates sets of items that share a one-one corre-spondence, it gives a good measure for the overlap between the subsets involved in a redescription.However, it does not have a logical extension defined for the case where relationships betweenitems are many-to-one or many-to-many. Therefore, a new measure is required to quantify thegoodness of a redescription for the many-to-many case.

To see the relevance of this problem, consider the hypothetical sets of genes for organisms

p 1 p 2 p 3 p 4

q 1 q 2 q 3 q 4

Organism P

Organism Q

Figure 7.5: A homologous set of genes motivating the problem of generalizing Jaccard’s coefficientto many-many relationships.

85

P and Q shown in Fig. 7.5. Links between genes in the two organisms indicate orthologousrelationships. Assume that a descriptor D1 defined over P involves genes p1 and p2, and descriptorD2 defined over Q involves q1 and q3. Unlike redescriptions presented earlier, descriptors D1

and D2 cannot be reconciled for finding similarity between them as they do not share the sameuniversal set. The presence of more than one ortholog for any gene in a descriptor (enclosed indotted boxes in Fig. 7.5) also prevents us from viewing this as a simple case of mapping acrossdifferent universal sets.


The related work pertinent to this question arises from research that defines overlaps for specialsituations. Strehl et al. [87] present a modified version of the Jaccard’s coefficient between vectorsxa and xb that is able to handle discrete as well as continuous non-negative features. Here, eachdata point (or item) is represented as a vector of numerical values. Ganesan et al. [16] formulatemeasures for similarity that exploit structure in data domains (as described earlier). They specifi-cally consider data that can be modeled as a rooted tree. The data points (or items) are actually justthe leafs of this tree. They measure the similarity between two itemsets derived from these itemsby using the intersection of ancestors of the items in each set. Of the two measures they propose,GCSM (generalized cosine-similarity measure) uses many-to-many matches. It defines similaritycontribution of each item in itemset I1 by considering all items in itemset I2 that have a (possiblynon-unary) similarity with it.

Our method to tackle redescriptions in the many-to-many case uses the bipartite graph inher-ent to objects taking part in many-to-many relationships for defining similarity measures. Themeasures enumerated using this method are not reducible or connected to the traditional Jaccard’scoefficient we have used thus far. For the purpose of this approach, let O1 and O2 be two universalsets of objects provided. The many-to-many relationship between objects in O1 and O2 can besummarized as the set of pairs E := {Ei,j|Ei,j = (oi, oj), oi ∈ O1, oj ∈ O2}. We do not assumeany weights attached with connections in E . Also, each object in the sets O1 and O2 is assumed tobe involved with at least one relationship. Together, O1, O2, and E define an undirected bipartitegraph G := (O1 ∪ O2, E). Let the sets of descriptors defined over items in O1 and O2 be X andY respectively. Any pair of descriptors (Xa, Yb) can be used to define a subgraph of G in manydifferent ways. For our purpose, we consider the bipartite subgraph Ga,b := (Xa ∪ Yb, Ea,b) whereEa,b = {Ei,j|Ei,j ∈ E , Ei,j = (oi, oj), oi ∈ Xa, oj ∈ Yb}. We can now use certain properties of Ga,b

to measure the strength of a correspondence (redescription) between descriptors Xa and Yb.

Clustering factor

Given the graph Ga,b with |Xa| nodes in O1 connected to |Yb| nodes in O2, the maximum number ofedges possible in such a bipartite graph is |Xa||Yb|. A graph with as many edges would correspondto a complete bipartite graph, or a biclique. The number of edges in the graph Ga,b is the cardinalityof the set of edges |Ea,b|. Thus, one measure of the strength of redescription between Xa and Yb isthe clustering factor C(Xa, Yb):

C(Xa, Yb) =|EXa,Yb

||Xa||Yb|

(7.1)

86

As is evident from this definition, 0 ≤ C(Xa, Xb) ≤ 1. Values of C(Xa, Xb) close to 1 wouldrequire a large value of degree for items in Xa and Yb, especially for large values of |Xa| and |Yb|.

Node coverage

C(Xa, Yb) can be thought of as an edge-centric measure for measuring similarity between descrip-tors. Although high values of this measure are desirable, it is possible to find interesting pairs ofdescriptors Xa and Yb where there are very few connection involving a node (item) but all itemshave at least one connection. This idea provides the basis for our node-centric measure called nodecoverage (N (Xa, Yb)). For the purpose of this definition, let Ia,b be the set of isolated nodes inGa,b, i.e., those whose degree is zero.

N (Xa, Yb) = 1 − |Ia,b||Xa| + |Yb|

(7.2)

This measure is the ratio of the number of nodes with at least one edge incident on them to thetotal number of nodes in the graph Ga,b. Again, 0 ≤ N (Xa, Xb) ≤ 1. Values of N (Xa, Xb) closeto 1 imply that for almost every item in Xa, there exists an item in Yb that corresponds to it andvice-versa.

The two measures defined can be used separately or in conjunction depending on the nature ofthe bipartite graph, descriptors provided, and the type of results sought.

CARTwheels configuration for many-to-many case

Assume that we are given a partition of objects in the universal set O1 that we would like touse to classify objects in O2. Since each object in O2 could map to more than one object in O1

(potentially in different blocks), we need an unambiguous policy to assign classes to objects in O2.Towards this end, without any loss of generality, let us consider an object o1,i ∈ O1 that maps toobjects o2,j, o2,k ∈ O2. If o2,j and o2,k belong to different blocks, the class is assigned as the blockcontaining the object (from o2,j and o2,k) with the minimum degree. In case of a tie between twoor more items from different blocks, the class is assigned as the block with minimum cumulativedegree from amongst those in contention. Once this class assignment is completed for objects inO2, we can use tree growing measures such as entropy gain or gini index to grow the decision treewith descriptors for O2 serving as features. Leaf assignments are made on the basis of frequencywith at least one leaf assigned a different class to ensure exploration.

Having defined the two measures and the CARTwheels configuration that we can use in themany-to-many case, we consider examples of their application to two biologically relevant scenar-ios.

Assessing significance of redescriptions mined

To measure the significance of redescriptions with a given clustering factor and node coverage, weused an idea similar to the one used for the one-to-one case. Consider any redescription A ⇔ Bwhere A ∈ O1 and B ∈ O2. We estimated the probability that a pair of randomly generated setsof size |A| from O1 and |B| from O2 have node coverage and clustering factor greater than orequal to the redescription in consideration. This was done using a simulation with 100, 000 pairs

87

Figure 7.6: The degree distribution for genes used in worm and fly for our cross-genomic expres-sion data analysis.

of randomly generated subsets of O1 and O2. We used the relationships between items in O1 andO2 to calculate the clustering factor and node coverage for each randomly generated pair. The ratioof the number of such pairs for which the two measure are above the values for our redescriptionto the total number of pairs tested provides an estimate of the probability (or p-value) that we seek.For all our results shown here, we used a p-value cut-off of 0.005.

7.2.3 Results

Gene expression data analysis across multiple organisms

For the purposes of this study, we used microarray gene expression data from a time course analy-sis conducted for each of Caenorhabditis elegans [49] (worm) and Drosophila melanogaster [84](fly). The worm time course consists of 7 time points and the fly time course consists of 9 timepoints. The aim of both the experiments considered is to study the temporal response of the re-spective organism to heat stress (or heat shock). The purpose of our study is to find similarities anddifferences in the behavior of genes in these two experiments through orthologous relationships.Orthologs between worm and fly genes were obtained from the InParanoid database [64].

88

As a first step, the set of genes for each of the organisms was restricted to those with at leastone ortholog. Also, genes with missing expression values were not considered in our study. Thisresulted in a total of 3572 genes for worm and 4160 genes for fly related by 10405 ortholog rela-tionships. Out of these genes, we constructed our universal set on the basis of actual expressionvalues from the time series data. In particular, the universal set consists of genes that showed morethan 1.5-fold positive or negative expression change in any of the time points along with all theirorthologs (even if the orthologs do not satisfy the expression cut-off required). This resulted in atotal of 173 genes for worm and 211 genes for fly related by 547 ortholog relationships.

The degree distribution of genes from the two organisms considered in our universal set isshown in Fig. 7.6. The histogram indicates that the majority of genes under consideration forboth organisms have exactly one ortholog involved with them. Very few genes have more than 5orthologs although there are a few outliers (at a degree values of 17 and 26) for the two organisms.This initial analysis suggests that we cannot expect many descriptor pairs across the two organismsto have high values for the clustering coefficient although node coverage is not affected.

For each of the universal sets of genes, we constructed two types of expression-based descrip-tors. First, genes were bucketed to yield range descriptors of the form ‘expression level ∈ [%x,0] in time point %y for organism %z’ (for negative %x) and ‘expression level ∈ [0, %x] in timepoint %y for organism %z’ (for positive %x). This resulted in a total of 184 descriptors for wormand 153 descriptors for fly. Further, k-means clustering was performed using the Genesis softwaresuite ([88]) on our time series data with a setting of 30 clusters for each of the organisms indepen-dently. Thus, the total number of descriptors used in this study is 214 for worm and 183 for fly.For the purpose of our study, we restricted decision trees for the two organisms to be of depth 1(i.e., no derived descriptors).

As the first part of our results, we examine the behavior of the two measures we defined earlierwith relation to each other. Towards this end, we conducted a run of CARTwheels where we foundall redescriptions possible for different values of clustering factor and node coverage. This resultedin a total of 16, 617 redescriptions. The values of clustering factor and node coverage for each ofthese redescriptions was recorded.

Fig. 7.7 shows the results from our study. This plot shows that very few of the redescriptionshave a high value of the clustering coefficient. Also, high values for the clustering factor go hand-in-hand with high values for the node coverage. However, it is possible to have very high valuesfor the node coverage but low values for clustering factor. Many of the redescriptions up-to abouta node coverage value of 0.4 to 0.5 have very low values for the clustering factor. On the basis ofthis plot, we utilize a cutoff of 0.6 for node coverage and 0.1 for clustering coefficient to obtainreasonably well connected redescriptions. For this choice of our measures, we obtained a total of151 redescriptions. We discuss two of the redescriptions found in our analysis in detail.

The first redescription R1CG is shown in Fig. 7.8. This redescription involves 1 gene in flyrelated to 3 genes in worm. The clustering factor in this case is 0.75 and the node coverage is0.8. This redescription relates a mildly down-expressed gene (FBgn0001224) in fly, 64 hours afterapplying heat stress, to three highly down-expressed genes at the start of application of heat stressto worm. Interestingly, all three genes in worm that take part in this redescription are structuralconstituent of eye lens (according to their GO molecular function category). All three are also re-ported to be induced in response to heat shock or any other environmental stress (from Wormbase).In addition, WBGene00002016 and WBGene00002017 are also involved in determination of adultlife span (according to their GO biological process annotation). The FBgn0001224 gene in fly is

89

Figure 7.7: Behavior of clustering factor and node coverage for all redescriptions possible usingdescriptors from worm and fly for our cross-genomic expression data analysis.

R1CG

Drosophila: foldchange <= -1 for Heat shock 64 hr <=>

Heat shock, 0 hr

-2 -1 0-3-4-5

Heat shock, 64 hr

-2 -1 0

0.8 0.75FBgn0001224

WBGene00002018

WBGene00014192

WBGene00002017

WBGene00002016

C. elegans: foldchange <= -4.5 for Heat shock 0 hr

Figure 7.8: Redescription R1CG for our cross-genomic study. The two expression based descriptorsalong with the subgraph they induce are shown in the figure. The clustering factor and nodecoverage lie to the right and left of the equivalence sign respectively.

90

R2CG

- Drosophila: Cluster 6 for Heat shock <=> C. elegans: Cluster 5 for Heat shock

0.71

0.18

FBgn0032061 W BG ene00008625 W BG ene00010228

W BG ene00008028

W BG ene00000830

W BG ene00011674W BG ene00012816

W BG ene00013972

W BG ene00015246

W BG ene00014254

W BG ene00015062

FBgn0037879

FBgn0029752

FBgn0035491

Drosphila Heat Shock: Tim e Point

1 2 3 4 5 6 7 8 9

Exp

ress

ion

Va

lue

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

C. elegans Heat Shock: Tim e Point

1 2 3 4 5 6 7

Exp

ress

ion

Va

lue

-1.0

-0.5

0.0

0.5

1.0

Figure 7.9: Redescription R2CG for our cross-genomic study. The two clustering based descriptorsalong with the subgraph they induce are shown in the figure. The clustering factor and nodecoverage are below and above the equivalence sign respectively.

known to be a heat shock protein (Hsp23) and is involved with the GO molecular function actinbinding. The GO biological processes assigned to this gene are defense response, protein foldingand response to heat. This redescription provides an example of a set of genes related to each othersince they are all involved directly with response to a common stress, namely heat shock.

The second redescription we discuss (R2CG) is shown in Fig. 7.9. This redescription relates acluster obtained from gene expression profiles in the fly heat shock experiment to a cluster fromthe worm heat shock experiment. These clusters contain 4 and 10 genes respectively. Out of these,3 genes in fly and 7 genes in worm take part in the redescription with 7 orthologous connectionsbetween them. This results in a clustering factor of 0.18 and a node coverage of 0.71. The pair ofgenes FBgn0032061 and WBGene00000830 related to each other are well annotated and have verysimilar annotations. These two genes are involved in the GO molecular function of catalase activ-ity and take part in electron transport and response to oxidative stress biological processes (GO).Again, the pair of genes FBgn0029752 and WBGene00015062 in our study are well annotated.

91

Figure 7.10: The degree distribution for transcription factors and the regulated genes for our choiceof universal set.

Both these genes are involved in the molecular function of protein disulfide oxidoreductase activ-ity (GO). However, WBGene00015062 is involved with the glycerol ether metabolic process whileFBgn0029752 is involved with sulfur metabolism. The other related set of genes FBgn0037879in fly and WBGene00008028, WBGene00008625, WBGene00012816, WBGene00013972, andWBGene00015246 in worm are not annotated very well. All these genes are found in the extracel-lular region and are defense-related proteins containing the SCP domain.

Both our examples so far involve only many-one relationships; our next case study presentsredescriptions involving true many-many relationships.

Gene expression data analysis using transcriptional regulation

In this study, we aim to use information about transcriptional regulation of genes in yeast (Sac-charomyces cerevisiae) to relate transcription factors and the genes that they regulate. This is amany-to-many scenario since one gene could be regulated by many transcription factors and anytranscription factor could regulate more than one gene. Notice that transcription factors are them-selves modeled as gene products and therefore the universal set of transcription factors is a subsetof the set of all genes. Also important is to realize that there is no one-to-one correspondencebetween a transcription factor and itself in the set of genes since a transcription factor does not

92

Figure 7.11: Behavior of clustering factor and node coverage for all redescriptions possible usingdescriptors defined over transcription factors and regulated genes in yeast.

regulate itself unless explicitly specified.We used gene expression results from the desiccation-rehydration experiment we considered

earlier (see Chap. 4). This experiment involves a seven time point study of the response of yeast todesiccation and rehydration. Here, we defined our universal set to be the set of genes up or down-expressed more than two-fold in any of the time points in our experiment and the transcriptionfactors that regulate them [30](irrespective of their expression values). This resulted in a total of117 transcription factors, 863 genes and 1827 regulatory relationships between them.

The degree distribution of transcription factors and regulated genes in our universal set is shownin Fig. 7.10. This plot is quite different from the ones seen in the previous study as there is a highnumber of transcription factors with a high degree value. However, this should not be surprisingsince the number of transcription factors considered is much lesser than the size of regulated genesset (about 1/8th). Therefore, each transcription factor can be expected to have a large number ofgenes that they regulate. The maximum degree for a transcription factor is 113. In contrast to thetranscription factors, the set of regulated genes shows a similar gradient of frequency distributionwith increase in degree as seen in the previous study. Again, the degree distribution of regulatedgenes suggests that we cannot expect many descriptor pair across the two organisms to have highvalues for the clustering factor.

Our descriptor set for this study consists of only the results from k-means clustering of thetemporal expression data considered. For the yeast transcription factors, we used a setting of 20and 40 clusters respectively in two independent runs of the Genesis software. For the regulatedgenes universal set, we used settings of 50 and 100 clusters in two separate runs. This resulted in a

93

R1TF

Transcription Factors: Cluster 25/40 <=> Regulated Genes: Cluster 31/100

0.75

0.40

TF Desiccation-Rehydration: Time Point

2 3 4 5 6 7

Exp

ress

ion

Va

lue

-4.0

-3.5

-3.0

-2.5

-2.0

-1.5

-1.0

Regulated genes Desiccation-Rehydration: Time Point

2 3 4 5 6 7

Exp

ress

ion

Va

lue

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

SWI5

CYB2

PMA2

YML131W

IME1

RO X1

CIN5

MSN4

Figure 7.12: Redescription R1TF for our transcriptional regulation study. The two clusters (de-scriptors) along with the subgraph they induce are shown in the figure. The clustering factor andnode coverage are below and above the equivalence sign respectively.

total of 210 descriptors in our study. We again set the maximum depth of trees in CARTwheels to 1considering the fact that we do not need to construct derived descriptors from the clusters providedas input.

As in the previous study, we begin with an examination of the behavior of our two measureswith relation to each other for this new universal set. We conducted a run of CARTwheels wherewe found all redescriptions possible for any values of clustering factor and node coverage. Thisresulted in a total of 3143 redescriptions. The values of clustering factor and node coverage foreach of these redescriptions was recorded.

Fig. 7.11 shows the results from our study. This plot is almost identical to the earlier plot. Veryfew of the redescriptions have a high value of the clustering factor. A lot of the redescriptions up-toabout a node coverage value of 0.4 to 0.5 have very low values for the clustering coefficient. In thiscase, we use a cutoff of 0.5 for node coverage and 0.1 for clustering factor to obtain reasonably wellconnected redescriptions. For this choice of our measure, we obtained a total of 109 redescriptions.We discuss two of the redescriptions found in our analysis.

The first redescription R1TF is shown in Fig. 7.12. This redescription relates a cluster in thetranscription factor domain to one in the regulated genes domain. It can be seen from the behav-ior of genes in the cluster that the transcription factors stabilize at about −3 fold-change after therehydration occurs at time point 3. At the same time, the regulated genes that were about 3 foldexpressed at time point 2 start to show reduced expression with the onset of rehydration. Eventu-ally, at time points 6 and 7 there is no gene expression for any of these genes. In the redescriptionbetween the two clusters, the relationship between them is defined using 3 transcription factors and

94

R2TF

Transcription Factors: Cluster 15/20 <=> Regulated Genes: Cluster 47/50

0.56

0.11

DAL81

YER066C-A

ERG11

YLR112W

RTA1

FHL1

CAD1

GTS1

INO4

MRS3

HMS2

RPL15A

YLR065C

RPS31

RPS29A

RPL31B

RPL20A

YMR173W-A


2 3 4 5 6 7

Exp

ress

ion

Va

lue

0.0

0.5

1.0

1.5

2.0

2.5

Regulated genes Desiccation-Rehydration: Time Point

2 3 4 5 6 7E

xpre

ssio

n V

alu

e

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

Figure 7.13: Redescription R2TF for our transcriptional regulation study. The two clusters (de-scriptors) along with the subgraph they induce are shown in the figure. The clustering factor andnode coverage are below and above the equivalence sign respectively.

95

3 genes that they regulate. Interestingly, two of the genes (YML131W and ROX1) are coregulatedby more than one transcription factor in the cluster considered. Of the three regulated genes inconsideration, IME1 is a meiosis inducing protein and is involved with meiotic gene expression.ROX1 is a site-specific DNA binding protein that acts as a repressor. The gene YML131W isa putative protein of unknown function with similarity to oxidoreductases. Its involvement in aredescription such as this could provide pointers towards its functionality.

The second redescription, R2TF is shown in Fig. 7.13. This redescription relates another clusterin the transcription factor domain to one in the regulated genes domain. The connection betweenthe two descriptors is made using 2 transcription factors and 8 regulated genes. The transcriptionfactors in the cluster seem to be mildly up-expressed during the desiccation phase but are a littlemore abundantly expressed after rehydration. The change in their behavior mirrors the changein the regulated gene expression values. Here, highly down-expressed genes during the desicca-tion stage slowly return to no change in expression by about the last time point after rehydration.Interestingly, one of the transcription factors (FHL1) regulates 7 of the genes taking part in ourredescription. The other transcription factor DAL81 regulates a hypothetical protein YLR112W.5 out of the 7 genes regulated by FHL1 are all listed as ribosomal proteins (RPL15A, RPL20A,RPL31B, RPS29A, and RPS31). HMS2 is a heat shock transcription factor homolog while RTA1is involved in 7-aminocholesterol resistance. Another interesting aspect of this redescription is that2 out of the 5 genes not taking part in the redescription are hypothetical proteins and 1 correspondsto a questionable ORF.

This study concludes our section on mining redescriptions for many-to-many relationshipsacross entities. The two measures we have introduced in this section appear to work well in unisonand help us extract biologically useful results. However, it is important to have an understandingof the measure cutoffs we should use given the set of connections available to us.

7.3 Constrained Storytelling


Consider an ER diagram representation of some downstream gene level responses to stress shownin Fig 7.14. Here, the sequence of events is in the counter-clockwise direction starting from theentity Transcription factor. A critical application of inter-domain redescription mining is its po-tential use in traversing an entire entity-relationship diagram such as the one shown through asequence of redescriptions. This could bring redescription into a powerful framework of its own,with capabilities akin to ILP. In addition, if redescriptions can be mined across different entitiesin an entity-relationship diagram, they provide a way of constructing a story – a method that canbe termed constrained storytelling, where the structure of the story is fixed (e.g., slots, roles, andparts) but not the participants.


Relating entities between domains (like genes and proteins) is very reminiscent of the problemstackled in the schema matching literature. As mentioned earlier, in schema matching, the Matchoperator needs to be learned using a variety of approaches. The operator thus learned can be

96

Transcription Factor

Response element

Gene expression

bind to

cause

Mobilization of defense mechanism

leads to

Feedback molecules

causes build-up of

could repress

Sequence Location

Name

Type

Sequence

Location Type

Name

Gene name Level

Action Type

Name Type

Figure 7.14: An ER diagram depicting the flow of data starting from transcription factors to feed-back molecules.

applied in a variety of ways to relate items in the two domains. The similarity derived betweentwo schemas could be based on the sameness of data types or primary or foreign key relationshipsthat could have different cardinality. The use of schema matching has however not been extendedto build stories.

For each entity in an ER diagram, we use relevant attributes of the entity to define descriptors.(Although we can use relationships to define descriptors as well, we prefer to reserve them fornavigating to different object sets in stories.) We further assume a relationship mapping tablebetween directly connected entities. We can exploit the mapping between any pair of such entitiesto find redescriptions using the one-to-one or many-to-many measures, as the case may be. Wecan now consider each descriptor taking part in a redescription as a node in a graph where edgesare defined using redescription relationships between descriptors. It is important to note that thesedescriptors will be shared across redescription classes if an entity is connected to more than oneother entity. Thus, any traversal of the ER diagram can be thought of as a path in the graph from adescriptor defined for a starting entity, to any descriptor for the final entity.

To find paths in the graph of redescriptions available to us, we use a simple depth-first searchbeginning from any descriptor corresponding to the first entity. All paths encountered during aDFS performed on the graph that span the start and entities would correspond to the constrainedstory we seek. Thus, finding constrained stories is a two-stage process; in the first step we useCARTwheels to construct the nodes and edges in our redescription graph that is used in the secondstep to perform a depth first search. Although not attempted here, we can also utilize the A*heuristic from our propositional storytelling implementation by suitably defining measures on thequality of the path.

97

7.3.3 Results

For this research question, we apply the approach described above on two different structures (ERdiagrams) used as models to find structured stories. We again utilize the gene expression dataobtained from the yeast desiccation-rehydration experiment for this study.

The universal set that we used is a slight modification of the one used in our transcriptionalregulation study. Here, the universal set consists of yeast genes that are up or down-expressedat least two fold in one of the time points from the six time points(2–7). As before, we definea universal set of transcription factors by using transcriptional regulation relationships to retainall transcription factors that regulate at least one of the highly expressed genes. Out of this set ofgenes and transcription factors, we eliminate all genes that do not have any GO functional categoryassigned to them. This constraint results in universal sets of 699 genes and 117 transcription factorsrespectively.

We used a moving window on the time series data containing three consecutive time pointsin each window. Thus for our temporal dataset with six time points (2 to 7), we have a total offour windows possible. For each of these windows, we used the gene expression profiles from theexperiment to cluster genes. As before, we used settings of 20 and 40 clusters for the transcriptionfactors and settings of 50 and 100 clusters for the set of regulated genes in k-means clustering.This resulted in a total of 600 descriptors for regulated genes and 240 descriptors for transcriptionfactors. In addition, we used the GO biological process, cellular location, and molecular functioncategories as descriptors in our study. This resulted in a total of 352 GO descriptors assigned to atleast one of the regulated genes or transcription factors.

For our first analysis, we performed a simple redescription analysis across each of the timewindows keeping our universal set as the set of regulated genes. Thus, we have a total of 150descriptors for each time window and we tried to mine redescriptions across successive time win-dows. The goal here is to traverse the four time windows as a sequence of redescriptions to studysmall additions/subtractions that might occur to the clusters across the time windows. We used aJaccard’s coefficient threshold of 0.6 and set the depth of decision trees constructed to 1. Aftermining redescriptions for successive time windows, we performed a DFS on the graph obtained tofind a total of 31 sequences of redescriptions that span the 4 time windows.

An example of such a sequence (or structured story) is shown in Fig. 7.15. The story R1CS re-lates genes that are highly up-expressed in almost all time points in the yeast desiccation-rehydrationexperiment. The four clusters related through this story have 11 genes common to all of them. Forthe first redescription, from the 5 genes in the first cluster that do not take part in the redescription,one makes a reappearance in the last cluster found (YJL045W). There is only 1 gene (SPS100) incommon between the first two clusters apart from the 11 that are common to all.

Interestingly, the cluster found for time window T3-T5 contains exactly the same genes as thecluster for window T4-T6. In fact, only 1 gene (SPS100) out of the 16 genes in the clusters fromT3-T5 and T4-T6 is not present in the cluster for T5-T7. This suggests that the gene SPS100 whichis highly expressed from before rehydration through the initial phases of rehydration slightly dropsoff in expression with increasing time.

The last cluster found (T5-T7 time window) contains the gene ADH2 which was actually un-expressed during the desiccation and initial rehydration phase (T2, T3 time points). However,this gene begins to be strongly up-expressed after the rehydration stage, joining the set of high-expressors in the time window T5-T7. ADH2 can thus be seen as a gene that is strongly effected by

98

Figure 7.15: Constrained story R1CS showing clusters from gene expression data considered ina sequential order of four temporal windows. 11 genes that are common to all the four clustersare listed inside the box in the middle of the figure. Genes listed along the redescription symbolcorrespond to the genes shared by that particular redescription. Genes that belong to a cluster butdo not take part in any redescription are shown above (for the top clusters) or below (for the bottomclusters) them.

99

R2CS

TF, Time window T3-T5: Cluster 26/40 <=> Regulated Genes, Time window T3-T5:

Cluster 53/100 <=> GO Bio 282: bud site selection

bud site

selection

0.30 0.43


3 4 5

Ex

pre

ssio

nV

alu

e

1.0

1.5

2.0

2.5

SWI4

BUD9

RSR1

FKS1

BUD7

HEM12MBP1

Regulated Genes, Desiccation-Rehydration: Time Point

3 4 5

Exp

ress

ion

Va

lue

-3.0

-2.5

-2.0

0.57

AXL2, BUD28

Transcription

Factorsregulate Genes GO Categorybelong to

Figure 7.16: (top) The model used to mine the constrained story R2CS. (bottom) The constrainedstory R2CS showing a cluster from transcription factor expression profiles for time window T3-T5.This cluster is related to a cluster that contains three genes that it regulates obtained for the sametime window T3-T5. For the many-to-many redescription here, the number above the arrow isthe clustering factor and below the arrow is the node coverage. In the last step of the story, thegene expression cluster is redescribed to the biological process GO category: bud site selection.Genes that are a part of the this cluster and are contained in the bud site selection GO category areshaded in green. Genes that belong to the GO category bid site selection but do not take part in theredescription are listed above it.

the desiccation-rehydration process. It is important to mention here that even though many of theresults in a study like this could be seen to mirror the results that would be obtained from a clus-tering technique across all time windows, genes like ADH2 and their interesting behavior cannotbe singled out using the simple clustering techniques.

For our second result, we modeled the data/entities available to us as shown in Fig. 7.16 (top).Our aim here is to find a structured story across different types of entities (transcription factorsand regulated genes) which themselves can be redescribed to useful GO categories. This studybuilds upon our earlier example that used this dataset by introducing the constraint of similaritywith GO categories for the expression based descriptors. The two types of redescriptions requiredin this study were mined by using two distinct setups for CARTwheels. To find redescriptionsbetween transcription factor and regulated genes expression clusters, we went back to our similaritymeasures for the many-to-many case and used a setting of 0.5 for the node coverage and 0.1 forclustering coefficient. As is evident from the descriptors available to us, we used a maximum depth

100

of 1 for all trees constructed in the CARTwheels process.For the second step in this study, we used a maximum depth of 2 for descriptors involving GO

categories and 1 for cluster descriptors for genes. We also disallowed the formation of disjunctionsin derived descriptors redescribed to the clusters for ease of interpretation. The Jaccard’s coefficientthreshold for this study was set to 0.4 and we required that the support of the redescriptions foundis at least 3.

Results from these two steps were combined using DFS to obtain a total of 102 constrainedstories. An example of a story found in this analysis is shown in Fig. 7.16 (bottom). Story R2CS firstrelates the transcription factor SWI4 that is mildly up-expressed in the time T3-T5 with a definitetrend towards the unexpressed state (which it reaches in time point T6). The genes that SWI4regulates (BUD9, RSR1, and FKS1) belong to a cluster defined for the same time window T3-T5.These genes are highly down-expressed at T3 but after the onset of rehydration, they trend towardsrelatively lower levels of down-expression. This trend continues beyond time point T5 as well.The second step of our story relates the gene expression cluster in consideration to the biologicalprocess GO category ‘bud site selection.’ This information together with the first redescription canbe used to reason about the behavior of the genes considered in this redescription. Since we havethree out of the five genes in the second descriptors involved in the selection of a budding site foryeast, they are bound to be strongly down-expressed when a stress such as desiccation is appliedto the yeast cells. Under such conditions, SWI4 can be seen to be contributing to the inhibitionof genes that are involved in bud site selection. However, upon rehydration, the yeast cells canstart the budding process once again. Thus, the expression level for SWI4 begins to reduce post-rehydration and the inhibited genes start to reach favorable expression states to carry out theirdesignated activities.

7.4 Cross-taxonomic Comparisons using Redescriptions


Assume that we are provided with two families of functional annotations or ontologies, E and F ,over the same space of objects (e.g., genes). The objective is to conduct an all-pairs redescriptionstudy relating categories or concepts between E and F . From the results of such a study, if e1 ∈ Eis redescribed to f2 ∈ F with a very high Jaccard’s coefficient, we could help impute annotationsand properties typically associated with e1 to f2 (and vice versa). The results of such a study canthen be used for functional enrichment of unclassified genes, to analyze the structural consistencies(and inconsistencies) of different ontologies, and in general as an educational tool to communicatesimilarities and differences across taxonomies. Finally, when the ontologies apply to multipleorganisms, we can study the extent to which redescriptions transfer across organisms and whethersome organisms have more developed ontologies than others.

7.4.2 Results

We conducted a cross-taxonomic study using the GO biological process (GO BIO), GO cellu-lar component (GO CEL), and GO molecular function (GO MOL) assignments available for theArabidopsis thaliana (arabidopsis), Drosophila melanogaster (fly), Homo sapiens (human), Mus

101

Table 7.1: Summary of input GO categories for the 6 species consideredArabidopsis Fly Human Mouse Worm Yeast

Data Dated 07/13/04 12/23/04 12/22/04 12/22/04 12/13/04 12/29/04Universal set size 13572 8911 23424 25142 11606 5731Genes with BIO defined 6340 7424 18068 18193 9299 4711Genes with CEL defined 3114 4131 16002 17362 5179 5713Genes with MOL defined 12817 7606 21135 21887 8975 5714BIO categories involved 1043 2493 2837 2774 1361 1691CEL categories involved 205 530 543 496 261 424MOL categories involved 1212 2013 2516 2230 1062 1470

GO BIO 51013: microtubule severing <=> GO CEL 8352: katanin

(a) Fly

katanin microtubule

severing

1.00

GO BIO 6415: translational termination AND GO CEL 5737: cytoplasm <=> GO MOL 3747: translation release factor activity

(b) Worm

translation release factor

activity

translational termination

cytoplasm

1.00

GO BIO 6415: translational termination AND GO CEL 8372: cellular component unknown <=> GO MOL 3747: translation release factor activity

(c) Arabidopsis

translation release factor

activity

translational termination

cellular component unknown

1.00

Figure 7.17: Examples of cross-taxonomic redescription: (a) A redescription between two func-tional categories for the Fly genome, (b) A redescription involving an intersection between twocategories for the Worm genome, (c) A redescription involving the GO CEL category ‘cellularcomponent unknown’ for the Arabidopsis genome. This redescription relates to the one in (b).

musculus (mouse), Caenorhabditis elegans (worm) and Saccharomyces cerevisiae (yeast) genomesfrom the GO database website (http://www.geneontology.org/GO.current.annotations.shtml). TheGO hierarchy information used for propagation of GO categories up the GO tree was also takenfrom the same website. For each organism, only those genes were considered that have at least oneGO category, other than the categories for unknown GO BIO, GO MOL and, GO CEL, defined.The summary of the data used is provided in Table 7.1.

The experiment using the GO assignments for genes in each genome was performed as follows.The universal set of genes was defined as above. Within this universal set, the three families ofGO categories were used individually as also in pairs to form input sets of descriptors. In all theruns, the descriptor family for which redescriptions were sought was used to build one-level trees.Two level trees were used for each pair of descriptor families used to construct derived descriptors.Thus, if a redescription was sought between the GO BIO categories on one side and combinationsof GO CEL and GO MOL categories on the other, the study was done using a one-level tree forthe GO BIO categories and up-to a 2-level tree for GO MOL and GO CEL categories. We alsorestricted all derived descriptors to involve just intersections and differences between descriptors.The support threshold was set at 3 to retain only the most significant redescriptions. The Jaccard’scoefficient threshold was set at 0.5.

Fig. 7.17 shows a few examples of redescriptions mined using CARTwheels. Fig. 7.17(a)shows a simple redescription between a GO BIO (microtubule severing) and GO CEL (katanin)category. This redescription holds for the Fly genome with Jaccard’s coefficient of 1 and involves4 genes. This type of a redescription can be easily used to relate functional enrichments acrossdifferent taxonomies as described earlier. Fig. 7.17(b) shows a redescription involving a derived

102

Table 7.2: Summary of redescriptions obtained for the 6 species. Numbers in bracket indicatenumber of redescriptions with no derived descriptors

Arabidopsis Fly Human Mouse Worm YeastBIO categories involved 259 (244) 169 (138) 389 (915) 375 (314) 257 (239) 260 (224)CEL categories involved 50 (40) 146 (92) 102 (71) 94 (70) 70 (58) 149 (103)MOL categories involved 176 (140) 230 (149) 369 (237) 358 (241) 271 (217) 205 (162)BIO redescriptions 6852 (469) 15483 (324) 43828 (589) 41969 (622) 37174 (713) 23473 (513)CEL redescriptions 4971 (139) 12567 (207) 22046 (163) 15531 (147) 11280 (178) 43293 (329)MOL redescriptions 9352 (408) 39788 (363) 68765 (582) 66920 (581) 52445 (711) 20163 (388)

descriptor formed by the intersection of a GO BIO and a GO CEL category which relates to a GOMOL category. This redescription holds for the Worm genome with Jaccard’s coefficient 1 andinvolves 3 genes. Fig. 7.17(c) shows a redescription for the Arabidopsis genome where the GOBIO and GO MOL involved are the same as in Fig. 7.17(b). The difference here is that GO CELis assigned the GO category ‘Cellular category unknown.’ This redescription also holds with aJaccard’s coefficient of 1 and involve 11 genes. The pair of redescriptions found could potentiallybe used to better characterize the GO CEL categorization for the genes involved for the Arabidopsisgenome.

Table 7.2 summarizes the number of GO categories for which redescriptions are available andthe number of redescriptions mined for each of the six species. In all cases, the use of deriveddescriptors (intersection and difference based) results in a significant increase in the number ofcategories involved in at least one redescription. Also, as is to be expected, a much higher numberof redescriptions are found for genomes that have more categories involved with the genes (givingmore descriptors to form derived descriptors with). Comparing Table 7.1 and Table 7.2, we canconclude that a large proportion of the functional categories have redescriptions associated withthem for all genomes.

Cross-genomic comparisons using redescriptions

Redescriptions found in the cross-taxonomic study described above can be used to validate andcheck the consistency of GO category assignments across different genomes. For this analysis, weconducted a pairwise comparison of redescriptions found for two different species and checked foroverlap (same descriptors involved). Importantly, we did not require that the support or Jaccard’scoefficient be the same for the same redescription across a pair of species. The overlap observedis summarized in Table 7.3. It is important to note here that the results shown do not take anydependencies between GO categories across multiple species into consideration. Therefore, someof the conserved redescriptions found could be a result of using orthologous relationship betweengenes from two organisms for assigning GO categories. In a future study, we could eliminatesuch GO category assignments for genes and conduct a similar cross-genomic comparison withthe remainder.

The species with large number of redescriptions (mouse and human) have high overlaps be-tween them as also with other species. The fly and arabidopsis redescriptions show the minimumoverlap. This is a result of the low number of descriptors available and redescriptions found forthese species. As would be expected, arabidopsis and yeast which differ from the other 4 speciesmost drastically show low amount of overlap.

Fig. 7.18 shows 3 examples of redescriptions found to be common for various species for a Jac-

103

Table 7.3: Pairwise overlap between redescriptions obtained for the 6 species. The numbers inbracket indicate the number of distinct descriptors involved.

Arabidopsis Fly Human Mouse Worm YeastArabidopsis - 2744 (48) 4550 (129) 4383 (120) 4133 (124) 4824 (59)Fly 2744 (48) - 13306 (123) 11198 (94) 8237 (96) 5561 (117)Human 4550 (129) 13306 (123) - 59674 (475) 29912 (291) 9871 (116)Mouse 4383 (120) 11198 (94) 59674 (475) - 26884 (282) 5278 (91)Worm 4133 (124) 8237 (96) 29912 (291) 26884 (282) - 4555 (86)Yeast 4824 (59) 5561 (117) 9871 (116) 5278 (91) 4555 (86) -

(a) Yeast, Worm

protein- nucleus import, docking

protein transporter

activity

nuclear pore

1.00

(c) Human, Mouse , Worm

microtubule polymerization

GTPase activity

microtubule cytoskeleton

1.00

(b) Worm, Human

glycylpeptide N-

tetradecanoylt ransferase

activity

N-terminal protein

myristoylation

cellular component unknown

1.00

Figure 7.18: Examples of redescriptions that hold with Jaccard’s coefficient of 1 for more that 1species: (a) A redescription common to yeast and worm genomes, (b) A redescription commonto the worm and human genomes, (c) A redescription common to the human, mouse and wormgenomes.

card’s threshold of 1. Fig. 7.18 (a) shows a redescription common to the yeast and worm genome.This redescription involves an intersection between a GO MOL and a GO CEL category related toa GO BIO category. It involves 12 genes in the yeast genome and 4 genes in the worm genome.Fig. 7.18 (b) shows a redescription common to the worm and human genome. This redescriptionagain involves an intersection with the GO CEL category ”cellular category unknown” that is con-served across the two species. It involves 4 genes in the human genome and 3 genes in the wormgenome. Fig. 7.18 (c) shows a redescription common to the human, mouse and worm genome. Itinvolves the intersection between a GO MOL and GO CEL category related to a GO BIO category.This redescription involves 45 genes for human, 30 genes for mouse and 16 genes for the wormgenome. Out of the redescription counts shown in Table 7.3, 920 redescriptions involving 15 cate-gories were found for all 6 species. These 15 categories are listed in Table 7.4. All these categorieslie quite high in the GO hierarchy and involve a lot of genes. Not surprisingly, there is no exampleof a redescription involving a very specific and precise functional category that could be found tobe conserved across the six species.

7.5 Shattering


In this research task, we use redescription analysis to define a new measure of similarity betweenthe underlying objects. Specifically, we find gene pairs or groups of genes that are frequentlyredescribed together, in a given vocabulary of descriptors. This problem of finding ‘similar’ geneshelps generalize similarly motivated ideas in the literature, ranging from gene or protein sequence

104

Table 7.4: GO categories involved in redescriptions mined for all 6 species.

GO category Type Description

GO:0003735 MOL structural constituent of ribosomeGO:0004672 MOL protein kinase activityGO:0004674 MOL protein serine/threonine kinase activityGO:0004812 MOL tRNA ligase activityGO:0005840 CEL ribosomeGO:0006413 BIO translational initiationGO:0006418 BIO tRNA aminoacylation for protein translationGO:0006468 BIO protein amino acid phosphorylationGO:0008452 MOL RNA ligase activityGO:0016310 BIO phosphorylationGO:0016875 MOL ligase activity, forming carbon-oxygen bondsGO:0016876 MOL ligase activity, forming aminoacyl-tRNA and related compoundsGO:0016886 MOL ligase activity, forming phosphoric ester bondsGO:0043038 BIO amino acid activationGO:0043039 BIO tRNA aminoacylation

f 1

f 2

(a)

{a 1 }

f 1

f 2

(b)

{a 2 }

f 1

f 2

(c)

{a 1 , a 2 , a 3 }

Figure 7.19: Shattering using two descriptors f1 = {a1, a2} and f2 = {a1, a3} (a) f1 ∩ f2 shattersf1 and separates a1 from a2. It also shatters f2 and separates a1 from a3, (b) f1 − f2 shatters f1

and separates a1 from a2, (c) f1

⋃

f2 brings together a1 and a3. The element(s) shown at the top ofeach frame is the set corresponding to the shaded region in the corresponding Venn diagram.

similarity to the use of the GO hierarchical structure for this purpose( [45]).


The idea in this study is to see how boolean expressions involving multiple descriptors help poten-tially shatter the contents of the original descriptors. As a result of this shattering, some genes thatwere together under one descriptor would no longer be grouped together. Fig. 7.19 shows somemotivating examples for two descriptors f1 = {a1, a2} and f2 = {a1, a3}. The intersection be-tween f1 and f2 in Fig. 7.19(a) shatters f1 and separates a1 from a2. It also shatters f2 and separatesa1 from a3. Similarly, the difference between f1 and f2 in Fig. 7.19(b) shatters f1 and separatesa1 from a2. The derived descriptor based on the union of f1 and f2 in Fig. 7.19(c) actually bringstogether a1 and a3.

CARTwheels, through its ability to construct derived descriptors, shatters the contents of de-scriptors at each alternation. We can define the frequency of occurence of a gene as the numberof redescriptions that it takes part in. Analogously, the co-occurence frequency of a gene pair is

105

Figure 7.20: Relationship between Shatterability threshold value and Shatter Index. Notice that wedid not connect the last point (1, 1) to our main plot. This is because, there is a slight discontinuityat this point for the Shatter Index as it suddenly jumps to 1 from 0.18 when Shatterability thresholdis set to 1.

.

the number of redescriptions in which the two appear together. On the basis of the example con-sidered, it is easy to see that the co-occurrence frequency of a gene pair among the redescriptionsmined can be expected to be not greater than the frequencies of occurrence of the individual geneseven if the gene pair is a part of the same descriptor.

For cases like the one considered above, the more a gene pair is found to co-occur, the morethey can be expected to be related to each other (functionally as well as otherwise). In other words,the easier a gene pair can be shattered, the more different the genes in the pair could/should be.

To quantify the extent of this shattering, we introduce a new measure called shatterability (S).Given that the frequency of a gene pair {a1, a2} in redescriptions mined is z1,2, the frequency ofgene a1 is y1 and the frequency of gene a2 is y2, the Shatterability for the gene pair (S1,2) is definedas 1 minus the Jaccard’s coefficient derived from the frequencies, as shown in Equation 7.3. Thusa Shatterability value of 0 implies that the corresponding gene pair could not be shattered at all.

S1,2 = 1 − z1,2

y1 + y2 − z1,2(7.3)

Another parameter of interest in this study is the Shatter Index (SI). The Shatter Index fora given universal set O and a given threshold on the Shatterability (t) is defined as shown inEquation 7.4. This measure allows us to make some decisions on setting a good threshold onthe Shatterability to obtain a reasonable set of gene pairs. Notice that Shatter Index is a measuredefined over the whole universal set as opposed to Shatterability, which is defined for gene pairs.

SIt =# of gene pairs{a1, aj}with Si,j ≤ t

|O||O − 1|/2(7.4)

106

(a) (b)

(c)

(d)

Figure 7.21: Four connected components formed by gene pairs that have a Shatterability of 0 inour study. (a) The component shown does not exhibit any interesting categorization, (b) The fivegenes are involved with RNA dependent adenosinetriphosphatase, (c) and (d) These componentsare a part of the small nucleolar ribonucleoprotein complex.

7.5.3 Results

For this case study, we restricted ourselves to descriptors based on functional categorization. Wecould potentially involve gene expression descriptors as well in a later study to make the resultsmore unexpected and interesting.

The universal set is defined on the basis of gene expression levels in experiments of interest.Here, we have chosen a universal set of yeast genes defined in Chap. 2 (G2) as the set of ORFs thatshow more than 4-fold up or down-regulation change in expression in some time point in each ofthe seven stresses [21] from (heat shock from 25◦C to 37◦C, hyper-osmotic shock, hypo-osmoticshock, H2O2 exposure, mild heat shock at variable osmolarity, heat shock from 37◦C to 25◦C, andheat shock from 29◦C to 33◦C). In addition, the ORFs are required to have at least one functionalcategory defined for them. This results in a universal set of 230 ORFs.The descriptors we use aresimply the GO category assignments for genes in our universal set. This results in a total of 479GO BIO, 112 GO CEL and 298 GO MOL descriptors.

The study was conducted using a 1-level tree for GO BIO descriptors and relating the classesobtained from this tree to descriptors obtained from 2-level trees for GO CEL and GO MOL de-scriptors. Redescriptions involving only intersections and differences were used. Also, the mini-

107

mum support was set to 1 and the minimum Jaccard’s was set to 0.5.Our study resulted in a total of 1905 redescriptions for our universal set of choice. From these

redescriptions, counts of co-occurrence of pairs of genes were derived and the Shatterability valuescalculated. Fig. 7.20 shows the change in Shatter Index with different Shatterability thresholds forthis study. Here, the steep jump at the end of the graph (near Shatterability value of 1) happensbecause a significant number of the gene pairs (about 80%) have a co-occurence frequency of 0and hence are allocated a Shatterability value of 1.

Next, we analyzed the set of genes with very low Shatterability values. For this purpose, wechose the 80 gene pairs with a Shatterability of 0. We built a graph from these gene pairs using eachgene as a node and each gene pair as an edge in the graph. We looked for connected components inthis graph and eliminated those with less than 4 genes in them. This reduced the graph to 71 edgeswith 24 nodes involved. The four connected components that remained are shown in Fig. 7.21.Fig. 7.21 (a) shows a small component with 4 genes. This component is not very interesting interms of the categories that the genes are involved with. The component shown in Fig. 7.21 (b)is involved with the GO MOL category RNA dependent adenosinetriphosphatase. Componentsshown in Fig. 7.21 (c) and Fig. 7.21 (d) are actually a part of the same protein complex (GO CEL:small nucleolar ribonucleoprotein complex) and share some interesting functional categories (GOBIO: processing of 20S pre-rRNA and GO MOL: snoRNA binding).

The idea of shattering presented here appears to be reasonably successful in finding smallconcerted modules of conserved functionality. It remains future work to extend such analysis todescriptors defined from specific experimental data.

7.6 Discussion

In this chapter, the use of structure inherent to descriptors is shown to reduce the redundancyin redescriptions mined. We also successfully defined a measure of similarity for redescriptionsinvolving many-to-many relationships between items. The case studies involving the measuresintroduced yielded biologically interesting results. The measures introduced can be used to conductlarge scale cross-genomic redescription analysis for multiple genomes. Another set of resultsobtained using two simple models in our case studies for constrained storytelling suggest that itcan be a useful tool in mining patterns across a sequence of domains. We also introduced thenotion of shattering that builds upon knowledge gained through redescription mining to measurethe similarity or closeness between items.

108

Chapter 8

Conclusion

In this dissertation, we have proposed the new data mining task of redescription mining aimed atfinding patterns using multiple vocabularies of data descriptors available in a given domain. TheCARTwheels approach that we developed for mining redescriptions finds applications in descriptorrich fields such as bioinformatics. We conducted several studies involving redescription miningusing biological data and generated interesting and novel results both from a computational aswell as biological perspective.

This research opens up many interesting avenues for further investigation, many already beingpursued. The storytelling idea, presented in Chapter 5, has now developed into an independentresearch project of its own, with emphasis on modeling combinatorial relationships by mining theliterature [27]. Stories mined over a corpus of documents can be summarized into ‘novellas’ thatshed insight into trends in the biomedical literature. The constrained storytelling idea has nowresulted in the ‘compositional data mining’ project that brings in the dual notion of ‘biclusters’to complement redescriptions. Finally, the theoretical work on partition spaces and shatteringsuggests that redescription can be used as a valuable primitive for characterizing datasets, not justfor data mining purposes, but also for large scale comparative studies.

We have also shown here that redescription mining provides a domain-neutral way to cast com-plex data mining scenarios in terms of simpler primitives. This work makes possible to formulateand solve entirely new classes of research problems that are vital to knowledge discovery. The keyto success in our approach is the use of domain-scientist-defined object sets (i.e., descriptors) asthe starting point of analysis, ensuring relevance of mined results. As scientists are empoweredto create their own vocabularies and descriptors and reason with them, there will be greater un-derstanding of scientific datasets. Redescription mining promises to be an important tool in thisendeavor.

109

Bibliography

[1] C.C. Aggarwal, J.L. Wolf, and P.S. Yu. A New Method for Similarity Indexing of MarketBasket Data. In Proc. SIGMOD’99, pages 407–418, 1999.

[2] R. Agrawal, T. Imielinski, and A.N. Swami. Mining Association Rules between Sets of Itemsin Large Databases. In Proceedings of the 1993 ACM SIGMOD International Conference onManagement of Data, pages 207–216, May 1993.

[3] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in LargeDatabases. In Proceedings of the 20th International Conference on Very Large Data Bases,pages 487–499, Sep 1994.

[4] D. Barbara, Y. Li, and J. Couto. COOLCAT: An Entropy-based Algorithm for CategoricalClustering. In Proceedings of the Eleventh International Conference on Information andKnowledge management, pages 582–589, Nov 2002.

[5] P.A. Bernstein, R. Pottinger, and A.Y. Halevy. A Vision for Management of Complex Models.SIGMOD Record (ACM Special Interest Group on Management of Data), Vol. 29(4):pages55–63, Dec 2000.

[6] Breitkreutz B.J., C. Stark, and M. Tyers. The GRID: the General Repository for InteractionDatasets. Genome Biol., Vol. 4:pages R23, 2003.

[7] A. Blum and T. Mitchell. Combining Labeled and Unlabeled Data with Co-training. InProceedings of the Eleventh Annual Conference on Computational Learning Theory, pages92–100, Jul 1998.

[8] J.S. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scaling Mining Algorithms toLarge Databases. Communications of the ACM, Vol. 45(8):pages 38–43, Aug 2002.

[9] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees.Chapman and Hall/CRC, 1984.

[10] A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher. Min-Wise Independent Per-mutations. JCSS, Vol. 60(3):pages 630–659, June 2000.

[11] W.W. Cohen and H. Hirsh. Learning the CLASSIC Description Logic: Theoretical and Ex-perimental Results. In Proceedings of the Fourth International Conference on Principles ofKnowledge Representation and Reasoning, pages 121–133, 1994.

110

[12] J. Crawford and F Crawford. Data Mining in a Scientific Environment. In Proceedings ofAUUG 96 and Asia Pacific World Wide Web Conference, Sep 1996.

[13] U. Fayyad, D. Haussler, and P. Stolorz. Mining Scientific Data. Communications of the ACM,Vol. 39(11):pages 51–57, Nov 1996.

[14] U.M. Fayyad and K.B. Irani. On the Handling of Continuous-Valued Attributes in DecisionTree Generation. Machine Learning, Vol. 8(1):pages 87–102, 1992.

[15] U.M. Fayyad and K.B. Irani. Multi-interval discretization of continuous-valued attributes forclassification learning. In Proceedings of IJCAI-93, the 13th International Joint Conferenceon Artificial Intelligence, pages 1022–1027, Aug/Sep 1993.

[16] P. Ganesan, H. Garcia-Molina, and J. Widom. Exploiting Hierarchical Domain Structure toCompute Similarity. ACM Transactions on Information Systems, Vol. 21(1):pages 64–93, Jan2003.

[17] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS - Clustering Categorical Data UsingSummaries. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pages 73–83, Aug 1999.

[18] V. Ganti, J. Gehrke, and R. Ramakrishnan. Mining Very Large Databases. IEEE Computer,Vol. 32(8):pages 38–45, Aug 1999.

[19] A. Garay-Arroyo, J.M. Colmenero-Flores, A. Garciarrubio, and A.A. Covarrubias. Highlyhydrophilic proteins in prokaryotes and eukaryotes are common during conditions of waterdeficit. J. Biol. Chem., Vol. 275:pages 5668–5674, 2000.

[20] A. Garay-Arroyo and A.A. Covarrubias. Three genes whose expression is induced by stressin Saccharomyces cerevisiae. Yeast, Vol. 15:pages 879–892, 1999.

[21] A.P. Gasch, P.T. Spellman, C.M. Kao, O. Carmel-Harel, M.B. Eisen, G. Storz, D. Botstein,and P.O. Brown. Genomic Expression Programs in the Response of Yeast Cells to Environ-mental Changes. Molecular Biology of the Cell, Vol. 11:pages 4241–4257, 2000.

[22] A.C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J.M. Rick,A.M. Michon, C.M. Cruciat, and et al. Functional organization of the yeast proteome bysystematic analysis of protein complexes. Nature, Vol. 415:pages 141–147, 2002.

[23] J. Gehrke, R. Ramakrishnan, and V. Ganti. RainForest: A Framework for Fast Decision TreeConstruction of Large Datasets. Data Mining and Knowledge Discovery, Vol. 4(2/3):pages127–162, July 2000.

[24] L. Getoor. Link Mining: A New Data Mining Challenge. SIGKDD Explorations, Vol.5(1):pages 84–89, 2003.

[25] D. Gibson, J.M. Kleinberg, and P. Raghavan. Clustering Categorical Data: An ApproachBased on Dynamical Systems. In Proceedings of the 24th International Conference on VeryLarge Data Bases, pages 311–322, Aug 1998.

111

[26] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. InProc. VLDB’99, pages 518–529, Sep 1999.

[27] J. Gresock, D. Kumar, R.F. Helm, M. Potts, and N. Ramakrishnan. Mining Novellas fromPubMed Abstracts using a Storytelling Algorithm, Feb 2007. Technical Report TR-07-08,Department of Computer Science, Virginia Tech.

[28] R. Guha, R. Kumar, D. Sivakumar, and R. Sundaram. Unweaving a Web of Documents. InProc. KDD’05, pages 574–579, 2005.

[29] J. Han, L.V.S. Lakshmanan, and R. T. Ng. Constraint-Based, Multidimensional Data Mining.Computer (special issues on Data Mining), Vol. 32(8):pages 46–50, Aug 1999.

[30] C.T. Harbison, D.B. Gordon, T.I. Lee, N.J. Rinaldi, K.D. Macisaac, T.W. Danford, N.M. Han-nett, J.B. Tagne, D.B. Reynolds, J. Yoo, E.G. Jennings, J. Zeitlinger, D.K. Pokholok, M. Kel-lis, P.A. Rolfe, K.T. Takusagawa, E.S. Lander, D.K. Gifford, E. Fraenkel, and R.A. Young.Transcriptional regulatory code of a eukaryotic genome. Nature, Vol. 431(7004):pages 99–104, 2004.

[31] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,Inference, and Prediction. Springer, 2001.

[32] F.C.P. Holstege, E.G. Jennings, J.J. Wyrick, T.I. Lee, C.J. Hengartner, M.R. Green, T.R.Golub, E.S. Lander, and R.A. Young. Dissecting the regulatory circuitry of a eukaryoticgenome. Cell, Vol. 95:pages 717–728, 1998.

[33] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci USA, Vol.98:pages 4569–4574, 2001.

[34] A. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Inc., 1988.

[35] C. Kamath, E. Cantu-Paz, I.K. Fodor, and N.A. Tang. Classifying Bent-Double Galaxies.IEEE Computing in Science and Engineering, Vol. 4(4):pages 52–60, Jul/Aug 2002.

[36] A. Kuchinsky, K. Graham, D. Moh, A. Adler, K. Babaria, and M.L. Creech. BiologicalStorytelling: a Software Tool for Biological Information Organization based upon NarrativeStructure. ACM SIGGROUP Bulletin, Vol. 23(2):pages 4–5, Aug 2002.

[37] D. Kumar, N. Ramakrishnan, M. Potts, and R.F. Helm. Algorithms for Storytelling. In Proc.KDD’06, pages pages 604–610, 2006.

[38] H. Kuroda, N. Takahashi, M. Shimada, H. amd Seki, K. Shinozaki, and M. Matsui. Classi-fication and expression analysis of Arabidopsis F-box-containing protein genes. Plant CellPhysiol., Vol. 43:pages 1073–1085, 2002.

[39] N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques and Applications.Ellis Horwood, New York, 1994.

112

[40] N. Lavrac and P.A. Flach. An Extended Transformation Approach to Inductive Logic Pro-gramming. ACM Transactions on Computational Logic, Vol. 2(4):pages 458–494, Oct 2001.

[41] D.T.S. Law and J. Segall. The SPS100 gene of Saccharomyces cerevisiae is activated latein the sporulation process and contributes to spore wall maturation. Mol. Cell. Biol., Vol.8:pages 912–922, 1988.

[42] W. Li and C. Clifton. SEMINT: A Tool for Identifying Attribute Correspondence in Het-erogeneous Databases Using Neural Networks. Data and Knowledge Engineering, Vol.33(1):pages 49–84, Apr 2000.

[43] Q. Lisman, D. Urli-Stam, and J.C.M. Holthuis. HOR7, a multicopy suppressor of the Ca+2-induced growth defect in sphingolipid mannosyltransferase-deficient yeast. J. Biol. Chem.,Vol. 279:pages 36390–36396, 2004.

[44] R. Lopez de Mantaras. A Distance-Based Attribute Selection Measure for Decision TreeInduction. Machine Learning, Vol. 6:pages 81–92, 1991.

[45] P.W. Lord, R.D. Stevens, A. Brass, and C.A. Goble. Semantic similarity measures as toolsfor exploring the gene ontology. In Pac. Symp. Biocomput., pages pages 601–612, 2003.

[46] J. Madhavan, P.A. Bernstein, and E. Rahm. Generic Schema Matching with Cupid. InProceedings of the 27th International Conference on Very Large Data Bases, pages 49–58,Sep 2001.

[47] N. Mamoulis, D.W. Cheung, and W. Lian. Similarity Search in Sets and Categorical Datausing the Signature Tree. In Proc. ICDE’03, pages 75–86, 2003.

[48] Z. Marx, I. Dagan, J.M. Buhmann, and E. Shamir. Coupled Clustering: A Method for De-tecting Structural Correspondence. Journal of Machine Learning Research, Vol. 3:pages747–780, Dec 2002.

[49] S.A. McCarroll, C.T. Murphy, S. Zou, S.D. Pletcher, C. Chin, Y.N. Jan, C. Kenyon, C.I.Bargmann, and H. Li. Comparing genomic expression patterns across species identifiesshared transcriptional profile in aging. Nature Genetics, Vol. 36:pages 197–204, 2004.

[50] V.M. McDonough, R.J. Buxeda, M.E.C. Bruno, O. Ozier-Kalogeropoulos, M.T. Adeline,C.R. McMaster, R.M. Bell, and G.M. Carman. Regulation of Phospholipid Biosynthesis inSaccharomyces cerevisiae by CTP. J. Biol. Chem., Vol. 270:pages 18774–18780, 1995.

[51] M. Meila. Comparing Clusterings by the Variation of Information. In Proc. COLT’03, pages173–187, 2003.

[52] R.S. Michalski. Knowledge Acquisition through Conceptual Clustering: A TheoreticalFramework and Algorithm for Partitioning Data into Conjunctive Concepts. InternationalJournal of Policy Analysis and Information Systems, Vol. 4:pages 219–243, 1980.

[53] R.S. Michalski. A Theory and Methodology of Inductive Learning. Artificial Intelligence,Vol. 20(2):pages 111–161, 1983.

113

[54] V.J. Miralles and R. Serrano. A genomic locus in Saccharomyces cerevisiae with four genesup-regulated by osmotic stress. Mol. Microbiol., Vol. 17:pages 653–662, 1995.

[55] P. Mitra, G. Wiederhold, and J. Jannink. Semi-automatic Integration of Knowledge Sources.In Proceedings of Fusion ’99, Jul 1999.

[56] M. Mizunuma, K. Miyamura, D. Hirata, H. Yokoyama, and T. Miyakawa. Involvement ofS-adnosylmethionine in G1 cell-cycle regulation in Saccharomyces cerevisiae. Proc. Natl.Acad. Sci. USA, Vol. 101:pages 6086–6091, 2004.

[57] A.W. Moore and M.S. Lee. Cached Sufficient Statistics for Efficient Machine Learning withLarge Datasets. Journal of Artificial Intelligence Research, Vol. 8:pages 67–91, 1998.

[58] M. Morzy, T. Morzy, A. Nanopoulos, and Y. Manolopoulos. Hierarchical Bitmap Index: AnEfficient and Scalable Indexing Technique for Set-Valued Data. In Proc. ADBIS’03, pages236–252, Sep 2003.

[59] S. Muggleton. Scientific Knowledge Discovery using Inductive Logic Programming. Com-munications of the ACM, Vol. 42(11):pages 42–46, Nov 1999.

[60] S. Muggleton and W. Buntine. Machine Invention of First-order Predicates by InvertingResolution. In Proceedings of the Fifth International Conference on Machine Learning, pages339–352, 1988.

[61] S. Muggleton and C. Feng. Efficient Induction of Logic Programs. In Proceedings of the 1stConference on Algorithmic Learning Theory, pages 368–381, 1990.

[62] A. Nanopoulos and Y. Manolopoulos. Efficient Similarity Search for Market Basket Data.VLDB Journal, Vol. 11(2):pages 138–152, 2002.

[63] J. Neville and D. Jensen. Supporting Relational Knowledge Discovery: Lessons in Archi-tecture and Algorithm Design. In Proc. Data Mining Lessons Learned Workshop, ICML’02,2002.

[64] K.P. O’Brien, M. Remm, and E.L. Sonnhammer. Inparanoid: a comprehensive database ofeukaryotic orthologs. Nucleic Acids Res., Vol. 33:pages D476–D480, 2005.

[65] D.B. Ostrander, D.J. O’Brien, J.A. Gorman, and G.M. Carman. Effect of CTP synthetaseregulation by CTP on phospholipid synthesis in Saccharomyces cerevisiae. J. Biol. Chem.,Vol. 273:pages 18992–19001, 1998.

[66] J.S. Park, M.S. Chen, and P.S. Yu. An Effective HashBased Algorithm for Mining AssociationRules. In Proceedings of the 1995 ACM SIGMOD International Conference on Managementof Data, pages 175–186, May 1995.

[67] G.D. Plotkin. A Note on Inductive Generalization. In Machine Intelligence, volume 5, pages153–163. Edinburgh University Press, 1970.

[68] K.Q. Pu and Mendelzon A.O. Concise descriptions of subsets of structured sets. ACMTransactions on Database Systems (TODS), Vol. 30(1):pages 211–248, Mar 2000.

114

[69] J. Puzicha, T. Hofmann, and J.M. Buhmann. A Theory of Proximity Based Clustering: Struc-ture Detection by Optimization. Pattern Recognition, Vol. 33(4):pages 617–634, 2000.

[70] J.R. Quinlan. Induction of Decision Trees. Machine Learning, Vol. 1(1):pages 81–106, 1986.

[71] J.R. Quinlan. Rule Induction with Statistical Data – A Comparison with Multiple Regression.Journal of the Operational Research Society, Vol. 38:pages 347–352, 1987.

[72] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

[73] E. Rahm and P.A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDBJournal, Vol. 10(4):pages 334–350, 2001.

[74] N. Ramakrishnan and A. Grama. Mining Scientific Data. Advances in Computers, Vol.55:pages 119–169, 2001.

[75] N. Ramakrishnan, D. Kumar, B. Mishra, M. Potts, and R.F. Helm. Turning CARTwheels: AnAlternating Algorithm for Mining Redescriptions. In Proc. KDD’04, pages 266–275, 2004.

[76] S. Sarawagi and A. Kirpal. Efficient Set Joins on Similarity Predicates. In Proc. SIGMOD’04,pages 743–754, June 2004.

[77] A. Savasere, E. Omiecinski, and S. Navathe. An Efficient Algorithm for Mining AssociationRules in Large Databases. In Proceedings of the 21st International Conference on Very LargeData Bases, pages 432–444, Sep 1995.

[78] D. Scheglmann, K. Werner, G. Eiselt, and R. Klinger. Role of paired basic residues of proteinC-termini in phospholipid binding. Protein Eng., Vol. 15:pages 521–528, 2002.

[79] E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Friedman. ModuleNetworks: Identifying Regulatory Modules and their Condition-Specific Regulators fromGene Expression Data. Nature Genetics, Vol. 34(2):pages 166–176, 2003.

[80] E.Y. Shapiro. Algorithmic Program Debugging. MIT Press, 1983.

[81] B. Shirkey, N.J. McMaster, S.C. Smith, D.J. Wright, H. Rodriguez, P. Jaruga, M. Birincioglu,R.F. Helm, and M. Potts. Genomic DNA of Nostoc commune (Cyanobacteria) becomescovalently modified during long-term (decades) desiccation but is protected from oxidativedamage and degradation. Nucleic Acids Res., Vol. 31:pages 2995–3005, 2003.

[82] D.A. Simovici and S. Jaroszewicz. An Axiomatization of Partition Entropy. IEEE Transac-tions on Information Theory, Vol. 48(7):pages 2138–2142, 2002.

[83] J.A. Singh, D. Kumar, R. Ramakrishnan, V. Singhal, J. Jervis, J. Garst, S. Slaughter, A.M.DeSantis, M. Potts, and R.F. Helm. Transcriptional Response of Saccharomyces cerevisiae toDesiccation and Rehydration. Applied and Environmental Microbiology, Vol. 71(12):pages8752–8763, Dec 2005.

115

[84] J.G. Sorensen, M.M. Nielsen, M. Kruhoffer, J. Justesen, and V. Loeschcke. Full genome geneexpression analysis of the heat stress response in Drosophila melanogaster. Cell Stress andChaperones, Vol. 10(4):pages 312–328, 2005.

[85] J.D. Storey and R. Tibshirani. Statistical significance for genome-wide experiments. Pro-ceedings of the National Academy of Sciences, Vol. 100:pages 9440–9445, 2003.

[86] A. Strehl and J. Ghosh. Cluster Ensembles – A Knowledge Reuse Framework for CombiningMultiple Partitions. Journal of Machine Learning Research, Vol. 3:pages 583–617, Mar 2003.

[87] A. Strehl, J. Ghosh, and R. Mooney. Impact of Similarity Measures on Web-page Cluster-ing. In Proceedings of the 17th National Conference on Artificial Intelligence: Workshop ofArtificial Intelligence for Web Search (AAAI 2000), pages 58–64, July 2000.

[88] A. Sturn, J. Quackenbush, and Z. Trajanoski. Genesis: Cluster Analysis of Microarray Data.Bioinformatics, Vol. 18(1):pages 207–208, 2002.

[89] N. Sudarsan, J.E. Barrick, and R.R. Breaker. Metabolite-binding RNA domains are presentin the genes of eukaryotes. RNA, Vol. 9:pages 644–647, 2003.

[90] D.R. Swanson and N.R. Smalheiser. An Interactive System for Finding ComplementaryLiteratures: A Stimulus to Scientific Discovery. Artificial Intelligence, Vol. 91(2):pages 183–203, 1997.

[91] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church. Systematic determi-nation of genetic network architecture. Nature Genetics, Vol. 22(3):pages 213–215, 1999.

[92] H. Toivonen. Sampling Large Databases for Association Rules. In Proceedings of the 22thInternational Conference on Very Large Data Bases, pages 134–145, Sep 1996.

[93] R. Valdes-Perez. Pickniche software., 1999. http://www.cs.cmu.edu/sci-disc/pickniche.html.

[94] R. Valdes-Perez, V. Pericliev, and F. Pereira. Concise, Intelligible, and Approximate Profilingof Multiple Classes. International Journal of Human-Computer Studies, Vol. 53(3):pages411–436, 2000.

[95] W.C. Winkler, A. Nahvi, N. Sudarsan, J.E. Barrick, and R.R. Breaker. An mRNA struc-ture that controls gene expression by binding S-adenosylmethionine. Nat. Struct. Biol., Vol.10:pages 701–707, 2003.

[96] J.J. Wyrick, F.C. Holstege, E.G. Jennings, H.C. Causton, D. Shore, M. Grunstein, E.S. Lan-der, and R.A. Young. Chromosomal Landscape of Nucleosome-Dependent Gene Expressionand Silencing in Yeast. Nature, Vol. 402:pages 418–421, 1999.

[97] M.J. Zaki and N. Ramakrishnan. Reasoning about Sets using Redescription Mining. In Proc.KDD’05, pages 364–373, 2005.

116

Vita

Deept Kumar was born in the small town of Muzzaffarnagar in U.P., India. He received most ofhis initial schooling from Hartmann College in Bareilly. Deept pursued his Bachelor’s degree inChemical Engineering at the Indian Institute of Technology, Mumbai from 1994 to 1998. Aftercompletion of his undergraduate studies, he worked with Infosys Technologies Limited, Bangaloreas a Software Engineer for a year. In 1999, he moved to Virginia Tech in Blacksburg to pursuehis M.S. in Environmental Engineering. From 2002 onwards, he joined the Computer Sciencedepartment at Virginia Tech to pursue his doctoral degree. Deept received his PhD. in ComputerScience in May 2007.

117

Date post:	08-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times