The integration of business intelligence and knowledge management

The integration ofbusiness intelligenceand knowledgemanagement

by W. F. CodyJ. T. KreulenV. KrishnaW. S. Spangler

Enterprise executives understand that timely,accurate knowledge can mean improvedbusiness performance. Two technologies havebeen central in improving the quantitative andqualitative value of the knowledge available todecision makers: business intelligence andknowledge management. Businessintelligence has applied the functionality,scalability, and reliability of modern databasemanagement systems to build ever-largerdata warehouses, and to utilize data miningtechniques to extract business advantagefrom the vast amount of available enterprisedata. Knowledge management technologies,while less mature than business intelligencetechnologies, are now capable of combiningtoday’s content management systems andthe Web with vastly improved searching andtext mining capabilities to derive more valuefrom the explosion of textual information. Webelieve that these systems will blend overtime, borrowing techniques from each otherand inspiring new approaches that cananalyze data and text together, seamlessly.We call this blended technology BIKM. In thispaper, we describe some of the currentbusiness problems that require analysis ofboth text and data, and some of the technicalchallenges posed by these problems. Wedescribe a particular approach based on anOLAP (on-line analytical processing) modelenhanced with text analysis, and describe twotools that we have developed to explore thisapproach—eClassifier performs text analysis,and Sapient integrates data and text through

an OLAP-style interaction model. Finally, wediscuss some new research that we arepursuing to enhance this approach.

A critical component for the success of the modernenterprise is its ability to take advantage of all avail-able information. This challenge becomes more dif-ficult with the constantly increasing volume of infor-mation, both internal and external to an enterprise.It is further exacerbated because many enterprisesare becoming increasingly “knowledge-centric,” andtherefore a larger number of employees need accessto a greater variety of information to be effective.The explosive growth of the World Wide Web clearlycompounds this problem.

Enterprises have been investing in technology in aneffort to manage the information glut and to gleanknowledge that can be leveraged for a competitiveedge. Two technologies in particular have showngood return on investment in some applications andare benefiting from a large concentration of researchand development. The technologies are business in-telligence (BI) and knowledge management (KM).

Business intelligence technology has coalesced in thelast decade around the use of data warehousing andon-line analytical processing (OLAP). Data warehous-

�Copyright 2002 by International Business Machines Corpora-tion. Copying in printed form for private use is permitted with-out payment of royalty provided that (1) each reproduction is donewithout alteration and (2) the Journal reference and IBM copy-right notice are included on the first page. The title and abstract,but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based andother information-service systems. Permission to republish anyother portion of this paper must be obtained from the Editor.

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 0018-8670/02/$5.00 © 2002 IBM CODY ET AL. 697

ing is a systematic approach to collecting relevantbusiness data into a single repository, where it is or-ganized and validated so that it can be analyzed andpresented in a form that is useful for business de-cision-making.1 The various sources for the relevantbusiness data are referred to as the operational datastores (ODS). The data are extracted, transformed,and loaded (ETL) from the ODS systems into a datamart. An important part of this process is data cleans-ing, in which variations on schemas and data valuesfrom disparate ODS systems are resolved. In the datamart, the data are modeled as an OLAP cube (mul-tidimensional model), which supports flexible drill-down and roll-up analyses. Tools from various ven-dors (e.g., Hyperion, Brio, Cognos) provide the enduser with a query and analysis front end to the datamart. Large data warehouses currently hold tens ofterabytes of data, whereas smaller, problem-specificdata marts are typically in the 10 to 100 gigabytesrange.

Knowledge management definitions span organiza-tional behavioral science, collaboration, contentmanagement, and other technologies. In this con-text, we are using it to address technologies used forthe management and analysis of unstructured infor-mation, particularly text documents. It is conjecturedthat there is as much business knowledge to begleaned from the mass of unstructured informationavailable as there is from classical business data. Webelieve this to be true and assert that unstructuredinformation will become commonly used to providedeeper insights and explanations into events discov-ered in the business data. The ability to provide in-sights into observed events (e.g., trends, anomalies)in the data will clearly have applications in business,market, competitive, customer, and partner intelli-gence as well as in many domains such as manufac-turing, consumer goods, finance, and life sciences.

The variety of textual information sources is ex-tremely large, including business documents, e-mail,news and press articles, technical journals, patents,conference proceedings, business contracts, govern-ment reports, regulatory filings, discussion groups,problem report databases, sales and support notes,and, of course, the Web. Knowledge and contentmanagement technologies are used to search, orga-nize, and extract value from all of these informationsources and are a focus of significant research anddevelopment.2,3 These technologies include cluster-ing, taxonomy building, classification, informationextraction, and summarization. An increasing num-ber of applications, such as expertise location,4,5

knowledge portals, customer relationship manage-ment (CRM), and bioinformatics, require mergingthese unstructured information technologies withstructured business data analysis.

It is our belief that over time techniques from bothBI and KM will blend. Today’s disparate systems willuse techniques from each and will, in turn, inspirenew techniques that will seamlessly span the anal-ysis of both data and text. With this in mind, we de-scribe our contributions in this direction. First, webriefly describe some business problems that moti-vate this integration and some of the technical chal-lenges that they pose. Then we describe eClassifier,a comprehensive text analysis tool that provides aframework for integrating advanced text analytics.Next, we present an example that motivates our par-ticular approach toward integrating data and textanalysis and describe our architecture for a combineddata and document warehouse and associated tool-ing. Finally, we discuss some current research direc-tions in extracting information from documents thatcan increase the value of a data cube.

Motivation for BIKM

The desire to extend the capabilities of business in-telligence applications to include textual informa-tion has existed for quite some time. The major in-hibitors have included the separation of the data ondifferent data management systems, typically acrossdifferent organizations, and the immaturity of au-tomated text analysis techniques for deriving bus-iness value from large amounts of text. The currentfocus on information integration in many enterprisesis rapidly diminishing the first inhibitor, and advancesin machine learning, information retrieval, and sta-tistical natural language processing are eroding thesecond.

Examples of BIKM problems. To understand the im-portance of BIKM, it is useful to look at some realbusiness problems and to determine how this tech-nology can provide a return on the investment (ROI).The ROI can be achieved, in general, in one of twoways: (1) through cost reductions and identificationof inefficiencies (improved productivity), and (2)through identification of revenue opportunities andgrowth. Here are some typical scenarios in which ourcustomers believe their business analyses would ben-efit substantially from data and text integration:

1. Understanding sales effectiveness. A telemarketingrevenue data cube can help identify products that

CODY ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002698

are most successfully sold over the phone, salesrepresentatives who generate the most sales, andcustomers who are the most receptive to this salesapproach. Unfortunately, the particular salestechniques used by these successful sales repre-sentatives in various situations are not capturedby quantitative measures in the OLAP cube. How-ever, these sales conversations are now frequentlyrecorded and converted to text. The text of con-versations associated with high-revenue sales rep-resentatives and high-yield customers can be an-alyzed by various language processing or patterndetection techniques to find patterns in the useof phrases or phrase sequences.

2. Improving support and warranty analysis. Fre-quently in business applications, short text de-scriptions, from, for example, customer complaints,are recorded in a database but are then encodedinto short classification codes by a person. Thecode fields then become the basis for any busi-ness analysis of the set of customer complaints.Variations in the assignment of codes by differ-ent people can cause emerging trends or prob-lem situations to be overlooked. The applicationof modern linguistic and machine-learning tech-niques (i.e., classification) to the text could pro-vide a more consistent encoding, or at least a val-idation of the human encoding, as the basis forthe business analysis.

3. Relating CRM to profitability. Data cubes for un-derstanding revenues achieved over a set of cus-tomers frequently omit the costs associated withindividual customers. In some industries thesecosts can substantially reduce the profit from acustomer. The costs can include the number ofcalls the customer made into the business forproblem resolution, complaint handling, or just“hand-holding.” Extracting measures of thesecosts (e.g., time spent on the phone with the cus-tomer) and measures of the customer’s loyalty forcontinued business (e.g., sentiment analysis of thecustomer interaction) from a customer relation-ship management (CRM) system and mergingthese measures into the revenue cube would pro-vide a more complete picture of the profitabilityderived from a customer.6

Environmental issues. We have briefly presentedsome typical customer scenarios in which bringingtext analysis together with classical data analysis canprovide increased business value. Naturally, there areenvironments of varying complexity in which thesescenarios occur, and consequently there are a vari-ety of technologies and tools that may be needed

in these different environments. In this section, wedistinguish three general environments based onthe degree of integration of the text and the datasources.

The simplest scenario occurs when the text informa-tion sits inside the same database as the business dataand is unambiguously associated with the related bus-iness data. The text may simply be in character fieldsin the business data records or in separate tableslinked with the data records through common joinattributes. In this situation, text analysis techniquescan be used to extract value from the text in the formof additional attributes, relationships, and facts thatcan then be directly related to the business data. Asintegrated database systems that bring text (e.g., XML[Extensible Markup Language]) together with datain a single database become more common, the abil-ity to use text analysis to enrich the directly relateddata will also increase.

Currently, most textual information is in systems dis-tinct from the ODS systems used to populate busi-ness intelligence data marts. In the simplest case thetext system has meta-data that logically correspondto fields in the business data, for example, customername or product name, which can be used to linkthe text with the data. However, the text system mayuse different forms for the meta-data values thanthose in the database, and this necessitates a datamapping transform to determine the correct asso-ciation of text to data, for example, associating “DB2”with IBM DB2 Universal Database*, Enterprise-Extended Edition, or “J. Smith” with John W. Smith.These problems are common and difficult when in-tegrating data from different source systems, but forthis discussion we assume that enough data cleans-ing and transformation tools exist to at least some-what automate this mapping.7

In the absence of adequate meta-data to relate thetext to the data, classification technology can be usedto categorize the text documents. The classes mightcorrespond to the values in a data attribute—for ex-ample, the members in a dimension of an OLAP cube.The assignment of a document to a particular classfor a data attribute (e.g., product name) could havea confidence measure associated with it and the doc-ument might be assigned to several classes. This clas-sification process may require training, and it shouldmake use of any relevant meta-data available in thedatabase. Once the text has been appropriately re-lated to the business data, it can be processed by thetext analysis techniques to extract the desired bus-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 CODY ET AL. 699

iness information, such as additional attributes, re-lationships, or facts.8

A more complicated situation arises when, unlike inthe previous examples, the sources of text to relateto a business data analysis are not known a priori.The relevant text sources can depend on the type ofdata analysis being performed, and the number ofpossibilities for such sources may be very large. In thiscase, a discovery process is needed to identify the ap-propriate text sources for thebusinessanalysis, andthenan association technology is needed to relate the textto the data records. Finally, the appropriate text anal-ysis can be used to extract the business value. As a briefexample, consider a business analyst exploring a rev-enue cube and detecting a downward movement in rev-enues for a software product in some part of the UnitedStates.9 Thedatacubeshows thephenomenonbutdoesnot provide any explanation for it. Because the issueis the revenue decline of a software product in a cer-tain region, there is a natural set of questions that mightbe asked to understand the revenue decline and a sub-stantial number of text sources one might wish toreview to find the answers. In general, the questionsto be asked depend on the issue under investigationand the characteristics of the data, for example, itsschema, its meta-data, its application context. In thisexample the text sources could include:

1. Enterprise-specific information, such as servicecall logs about the product and competitive in-telligence reports about other companies’ prod-ucts

2. Purchased text information, from sources such asHoovers and Dun & Bradstreet, on general soft-ware market conditions

3. Public documents in Web forums that contain dis-cussions about products, such as ePinions.com

Current work on meta-data to represent the infor-mation content published in data sources and workon question-answering systems to match questionsto information sources will help to automate the dis-covery process.10 In all of these cases, the interac-tive analysis of data and text may ultimately requirethe use of a modern text-analysis tool to explore thetext documents themselves. In the next section wedescribe such a tool.

eClassifier

Research and development investment in knowl-edge-management technologies has made significantprogress. However, there still exists a need for an

approach that integrates complementary and best-of-breed algorithms with guidance from domain ex-pertise. eClassifier was designed to fill this need byincorporating multiple algorithms into an architec-ture that supports the integration of additional al-gorithms as they become proven. eClassifier is an ap-plication that can quickly analyze a large collectionof documents and utilize multiple algorithms, visu-alizations, and metrics to create and to maintain ataxonomy. The taxonomies that eClassifier helps tocreate can be arbitrarily complex hierarchical cat-egorizations. The algorithms and representationmust be robust in order to apply across many diversedomains. In our research, we very quickly encoun-tered environments where the documents to be an-alyzed were ungrammatical and contained misspell-ings, esoteric terms, and abbreviations. Help-deskproblem tickets or discussion groups are examplesof such environments.

eClassifier is a comprehensive text-analysis applica-tion that allows a knowledge worker to learn fromlarge collections of unstructured documents. It wasdesigned to employ a “mixed-initiative” approachthat applies domain expertise, through interactionswith state-of-the-art text analysis algorithms and vi-sualization, to provide a global understanding of adocument collection. Most of the complexities in-herent in text mining are hidden by using default be-haviors, which can be modified as a user gains expe-rience. The tool can be used to automaticallycategorize a large collection of text documents andthen provide to a knowledge worker a broad spec-trum of controls to refine the building of an arbi-trarily complex hierarchical taxonomy. eClassifier hasimplemented numerous analytical, graphical, and re-porting algorithms to allow a deep understanding ofthe concepts contained within a document collection.The tool has been optimized to analyze over a mil-lion documents. Additionally, after a given taxon-omy has been generated, a classification object canbe published and used within another application,through the eClassifier run-time API (application pro-gramming interface), to dynamically retrieve infor-mation about the documents as well as to incremen-tally process new documents. Advanced visualizationtechniques allow the concept space to be exploredfrom many different perspectives. Multiple taxono-mies can be generated and explored to discover newrelationships or important cross sections. Text sort-ing and extraction techniques provide valuable con-cept summarizations for each category. Many viewsare provided, including spreadsheets, bar graphs,plots, trees, and summary reports.


We have used eClassifier extensively in conjunctionwith Lotus Discovery Server* and IBM Global Ser-vices on both internal and customer informationsources. Based on our application of eClassifier invarious domains, with many users, we have reachedthe conclusion that it is very difficult to automati-cally produce a satisfactory taxonomy for a diverseset of users without allowing human intervention.The power of eClassifier is that it explicitly providesfor the incorporation of human judgment at all ap-propriate phases of the taxonomy generation process.It provides the necessary tools for understanding thetaxonomy, for efficiently editing it, and for validatingthat the taxonomy is learnable by a classifier.

Document representation. The applications forwhich eClassifier has typically been applied are char-acterized by documents about a single concept. Suchapplication domains with documents that are rela-tively short include help-desk problem tickets ande-mail. In domains with longer, multitopic docu-ments, some preprocessing is needed to break thedocuments down into conceptual chunks. Typicallythis is done using document structures such as chap-ters, sections, or paragraphs.

eClassifier represents each document with a vectorof weighted frequencies from a feature space ofterms and phrases.11,12 The feature space is obtainedby counting the occurrence of terms and phrases ineach document and the vector is normalized to haveunit Euclidean norm (the sum of the squares of theelements is one). To reduce the feature space rep-resentation for efficiency of computation and scal-ability, while maintaining maximum information, weutilize several techniques. We use stop-word lists toeliminate words bearing no content. We utilize syn-onym lists to collapse semantically similar words andstem variants to their base form. We use stock phraselists to eliminate structural or no-content repetitivephrases. Stock phrases can also be automatically de-tected by the system through the use of statisticalcounting techniques. We use “include word” lists toidentify semantically important terms that must re-main in the feature space. Finally, we heuristicallyreduce the feature space by removing terms with thehighest and lowest frequency of occurrences.

Once the feature space is determined, eClassifieruses a dictionary tool that provides a convenientmethod for dynamically inspecting and modifying thefeature space. This tool provides a frequency mea-sure and a relevance measure for each term andphrase. The frequency measure is the percentage of

documents in which the term occurs, and the rele-vance measure is the maximum frequency with whicha term occurs in any category, effectively measuringthe term’s influence on the taxonomy. Terms orphrases with high values for either of these measuresshould be considered carefully, because they heav-ily influence the document representation and there-fore the resulting taxonomy. We have found thiscombination of techniques to be important and effec-tive across a broad range of document sources.

Taxonomy generation. The first step in the analysisof the document collection is to create an initial cat-egorization or taxonomy, which can be automatedby applying clustering algorithms. In eClassifier wehave implemented an optimized variant of thek-means algorithm13,14 using a cosine similarity met-ric15 to automatically partition the documents intok disjoint clusters. K-means can then be applied toeach cluster to create a hierarchical taxonomy. Inaddition to k means we have implemented an EM(expectation maximization) clustering algorithm andEM with MHAC (modified hierarchical agglomerativeclustering), which is a variant that generates hier-archical taxonomies.16 In practice we have found au-tomatic clustering algorithms to be very effective increating initial taxonomies, which are used to get asense of the concepts contained in the document col-lection. However, clustering does not always parti-tion the documents in ways that are meaningful toa human. To partially address this, we have devel-oped some additional methods for creating taxon-omies, one of which is an interactive, query-basedclustering that seeds categories based on a set of key-words, tests out the queries, and then refines the clus-ters based on the observed results. The query-basedclusters can then be further subclassed using unsu-pervised clustering techniques. Finally, we have alsofound that it is sometimes useful to start with an ini-tial classification based upon meta-data providedwith the documents.

Taxonomy evaluation. Once we have an initial tax-onomy of the documents, eClassifier provides themeans to understand and to evaluate it. This allowsus to address the unexpected results that do not meethuman expectations. Figure 1 is an eClassifier screen-shot showing summary information and statistics fora set of categories (note this could be at any depthin a hierarchical taxonomy). This view provides cat-egory label, size, cohesion, and distinctness measures.The vector-space model lends itself to computationof a normalized centroid for each cluster, which rep-resents the average document in the cluster for the


current feature space. The category centroid is cen-tral to the computation of the summary informationin this view.

The category label is generated using a term-cover-age algorithm that identifies dominant terms in thefeature space. If a single term occurs in 90 percentor more of the documents in a category, the cate-gory is labeled with that term. If more than one termoccurs with 90 percent frequency, then all of theseterms (up to four) will be included in the name, withthe “&” character between each term. If no singleterm covers 90 percent of the documents, then themost frequent term becomes the first entry in thename. The second entry is the one that occurs mostfrequently in all documents that do not contain thefirst term of the name. If the set of documents con-taining either of these two words is now 90 percent

of the documents in the category, these two wordscombined become the name (separated by a com-ma). If not, this process is repeated. If none of thetop four terms is contained in 10 percent or moreof the documents, the category is called “Miscella-neous.” We have experimented with various otheralgorithms for labeling categories, including findingthe most frequently occurring phrases. Althoughthese sometimes appear to be more meaningful, wehave found them to be often misleading and to mis-characterize the category as a whole. Although thisalgorithm is effective for quickly summarizing a cat-egory, we also allow the user to assign a differentlabel at any time.

In addition to a label, three metrics are computedfor each category by default. The size column dis-plays a count of the number of documents in the cat-

Figure 1 eClassifier class table view


egory and its percentage of the total collection. Co-hesion is a measure of the variance of the documentswithin a category. The cohesion is calculated basedon the average cosine distance from the centroid.We have found that this provides a good measureof the similarity within a category, and categories withlow cohesion are good candidates for splitting. Dis-tinctness is a measure of variance of the documentsbetween categories. The distinctness is calculatedbased on the cosine distance from a category’s cen-troid to its nearest neighboring category’s centroid.We have found this to provide a good measure ofsimilarity between categories, so categories with lowdistinctness are similar to a neighboring category andwould be candidates for potential merging.

Category evaluation. In addition to understandinga given taxonomy at a macro level, it is important

to be able to precisely understand what core con-cept a category represents. To address this need,eClassifier provides a special view that shows statis-tics about term frequency, induced classificationrules, and document examples as shown in Figure2. eClassifier has a bar graph representation of thecategory centroid. For each term in the feature space,it shows the frequency of occurrence within the givenclass (red bar) and the frequency of occurrencewithin the total document collection (blue bar). Theterms are sorted in decreasing order of red minusblue bar height in order to focus attention on themost relevant terms for the class. The class compo-nents panel visualizes the effect of certain terms(inclusion � �, exclusion � �) when used as a de-cision tree classifier. Nodes in the tree are selectedbased on minimizing the entropy of in-category vsout-of-category documents. If certain rules are par-

Figure 2 eClassifier class view


ticularly meaningful, a user can click on the nodeand create a new class from the identified set of doc-uments. Finally, this view provides example docu-ments from each category. Several sorting techniquesare available. Ordering by “most typical” is calcu-lated based on proximity to the centroid. This is aneffective technique—examining a few documentsclose to the centroid helps a user to understand theessence of the category’s concept. Ordering by “leasttypical,” by showing example documents farthestfrom the centroid, helps the user to evaluate uni-formity within the category. Examples that the useridentifies as not really belonging to the category caneasily be moved to other categories or to a newly cre-ated category. With each modification, all relevantstatistics are dynamically updated.

Taxonomy visualization. Visualization is an impor-tant technique to convey information. eClassifier uses

visualization to help a user to explore the relation-ships between categories of documents within a tax-onomy. We show an example of eClassifier’s visu-alization in Figure 3. In the visualization each dotrepresents a document, which, when clicked on, willbe displayed. The color of the dot denotes its mem-bership within a corresponding category. A large dotrepresents a category centroid, which is the averagefeature vector for the documents in that category.For each rendering we select the centroids of threecategories to form a plane. All the documents arethen projected onto this plane.

This visualization is useful for exploring the relation-ship between various categories. We can quickly seewhich categories are close in proximity and we canfind specific documents that may span these cate-gories by selecting documents that lie on their re-spective borders.

Figure 3 eClassifier visualization


The visualization gives multiple views of the data byallowing the user to select different planes on whichto project. This can be done for all possible selec-tions of three centroids to show many different viewsof the data, in procession. This process is known astouring.17 The visualization also has a “navigatormode,” which displays only closely related catego-ries and allows the user to navigate by clicking onencircled centroids to show that category’s mostclosely related categories.

Classification. Once a taxonomy is created for a doc-ument collection, it is often useful to assign addi-tional documents to the taxonomy as they becomeavailable. To do this, eClassifier creates a batch clas-sifier to process the additional documents. We havefound that no single classification algorithm is su-perior under all circumstances, so we have imple-mented four algorithms and a methodology for eval-uating which is best for a given document collection.For a given taxonomy, half of the documents are se-lected as the training set and half are left as the testset. A classification model is generated for each ofthe four implemented algorithms (nearest centroid,naive Bayes multivariate, naive Bayes multinomial,and decision tree) based on the training set. The bestalgorithm is then selected by determining classifi-cation accuracy performance on the test set. At eachlevel of the taxonomy hierarchy a different classifiermay be selected, based on which approach is mostaccurate at classifying the documents at that level.Additionally, as was the case during the clusteringprocess, we allow complete control over the selec-tion of the classification approach. Based on the (lackof) classification accuracy of the model selected, theuser may choose to make adjustments to the taxon-omy to improve the accuracy of the classifier. Theclassification accuracy for various classification al-gorithms can also be visualized in the class view (seeFigure 1).

Analysis and reporting. In addition to the techniquesdescribed for taxonomy generation and maintenance,eClassifier provides several techniques for deeperanalysis of the text, for example discovery of corre-lations of the text with corresponding data and forcomparing document collections. The first techniquewe call “FAQ analysis” because it has commonly beenapplied to find frequently asked questions in help-desk data sets, although it can, in general, find fre-quently occurring topics in any document collection.Discovery of correlations is useful when analyzinga given taxonomy with respect to time (trend anal-ysis) or against other associated meta-data. eClas-

sifier will run a chi-squared test to find statisticalanomalies for a given category in relation to othercategorical attributes associated with documents.Continuous variables, such as time, are made dis-creet before analysis. Analyzing an attribute vs timein this way can lead to the discovery of spikes or otherinteresting trends. This technique can also be appliedto any categorical data associated with the document.For example, assume we generated a technology-based taxonomy of patents using eClassifier. Wecould then analyze the patents to see which technol-ogies are receiving the most patents over time. Oncewe know which technologies are “hot,” we could thenanalyze the patents with their associated corporateassignees to see which corporations are active in thehot technologies.

Another useful analysis is to use a generated taxon-omy to compare document collections. For a giventaxonomy and collection of documents, we can an-alyze a second collection of documents to discoverwhich areas are poorly covered within the taxonomy.We have applied this technique to help-desk prob-lem tickets and the associated self-help knowledgebases to identify knowledge gaps, for example, prob-lems that are not well covered in the knowledge base.This can also be used to compare a collection of re-quirements documents against a collection of capa-bility documents to discover knowledge-gap deficien-cies.

An integration paradigm

In each of the environments discussed earlier, textis ultimately associated with business data recordsto enhance the understanding of the data. In someanalysis-oriented environments, just bringing the as-sociated text “near” the data with a flexible, inter-active browsing and analysis tool such as eClassifieris sufficient to provide the user with some explana-tion for the business phenomenon. In the “discov-ery” environment this may be the natural and onlyrealizable paradigm. Therefore, in addition to searchcapability, mechanisms to discover patterns, at-tributes, and schema in the documents, allowingthem to be readily associated with the data, and tool-ing to provide an interactive analysis environmentfor both data and text will be helpful here. Thougha valuable step, this approach has scalability prob-lems if the number of sources or the number of as-sociated documents is large.

In the more narrowly constrained first and secondenvironments discussed earlier, we might strive to


achieve a tighter integration of the text informationwith the associated data. One way to do this is to usean OLAP multidimensional data model1 as the inte-grating mechanism. The typical dimensional datamodel for an OLAP system uses a star schema as themodel for a data cube. A basic star schema consistsof a fact table at its center and a corresponding setof dimension tables, as shown in Figure 4. A fact ta-ble is a normalized table that consists of a set of mea-sures or facts and a set of attributes represented byforeign keys into a set of dimension tables. The mea-sures are typically numeric and additive (or at leastpartially additive). Because fact tables can have a verylarge number of rows, great effort is made to keepthe columns as concise as possible. A dimension ta-ble is highly denormalized and contains the descrip-tive attributes of each fact table entry. These at-tributes can consist of multiple hierarchies as wellas simple attributes. Because the dimension tablestypically consist of less than 1 percent of the overallstorage requirement, it is quite acceptable to repeatinformation to improve system query performance.The level at which the dimensions and measures en-capsulate the data is referred to as the “fact grain.”An example of a low-level grain is at the transac-tional level, where the dimensions are the product,geography, and date of the transaction, and the mea-sures are the dollar revenue and units sold.

In the example in Figure 4 we have three dimensiontables: product, geography, and date. The product

dimension has an associated hierarchy: group3 type3 product. The geography dimension has an asso-ciated hierarchy: country3 state3 city. The datedimension has an associated hierarchy: year3 quar-ter3 day. These three dimensions represent the at-tributes that we can use to analyze our facts. In thisexample, we have a revenue fact table. Each row inthe fact table represents the aggregate transactionsat the lowest level in each of the dimensions. In thiscase, each fact is aggregated at product, city, and day.The measures that are aggregated are revenue andunits sold. This model allows us to explore the effectof product, geography, and date, at all levels in eachhierarchy, on revenue and units sold and other mea-sures computed from these values such as total rev-enue, average revenue, total units sold, and averageunits sold. Typically an analyst would use an appli-cation to analyze these various measures by lookingat trends over time, or by finding weaknesses orstrengths in products or geographies.

Integrating text information into this analysis re-quires progress in several areas of text analytics inwhich we are currently working. The first is the useof text classification technology either to find at-tributes in the documents that can be used to linkthem to the data, or to find attributes in the doc-uments that can be used as additional dimensionsto deepen the understanding of the data. Second,we are researching information extraction technol-ogies to process the text and to compute quantita-

PRODUCT DIMENSION

PKey Group Type Product

01 Software Database DB202 Software Messaging MQ Series03 Hardware Server S/39004 Hardware PC Thinkpad T20... ... ... ...

Figure 4 Example star schema data model

GKey Country State City

01 USA CA San Jose02 USA NY NY03 USA IL Chicago04 Canada Quebec Toronto... ... ... ...

GEOGRAPHY DIMENSION

REVENUE FACTS

PKey GKey TKey Revenue Units

01 01 02 1000 1 01 02 02 2000 203 03 03 25000 504 04 04 20000 2... ... ... ... ...

DATE DIMENSION

DKey Year Quarter Day

01 2002 Q1 Jan 1 02 2002 Q2 Apr 15 03 2001 Q4 Dec 10 04 2001 Q3 Aug 20 ... ... ... ...


tive values from the documents. The quantitative val-ues can then be used as measures in a document facttable (Table 1). The combined data can not only be“sliced and diced” in the traditional OLAP paradigmof data analysis, but also the related documents canbe explored in various ways that exploit their struc-ture to make their content more useful. This inter-action model and its underlying information modelis an area for our current research.

Consider again the example in Figure 4. The factshave keys for the dimensions of product, geography,and date. Now we also have a database of problemtickets resulting from service calls. The problem tick-ets have meta-data recorded along with a transcriptof the problem description. If we run a set of anal-yses over this collection of documents we can hopeto accomplish several things. First, by using a clas-sification process we can divide the problem ticketsby problem type, thereby creating a new dimension,in addition to the existing meta-data dimensions, intowhich problem tickets can be categorized. Second,by running experimental text analyses over the textof the problem tickets we can attempt to quantifythe severity of the problem in the ticket. Upon do-ing this, the problem ticket documents can be or-ganized into their own fact table with the schemashown in Table 1.

The first four columns, which are foreign keys intodimension tables, are derived from the meta-dataassociated with the tickets in the problem ticket da-tabase. The fifth column is now a dimension asso-ciated with the problem that was created by auto-matically classifying the tickets. The sixth column isa measure associated with the problem that can becalculated from the meta-data. The seventh columnis a measure of the severity of the problem as cal-culated by a text analysis of the transcription of thecall. This may be a scoring of the frustration or an-ger felt by the caller. Finally, the last column ties thisdocument fact back to the original document in theticket database to facilitate movement from the OLAPenvironment of these facts into a document analysisenvironment.

Given these fact table schemas, if we roll up the firstfact table (Figure 4) along the product, geography,

and date dimensions, while computing the averagedollar sales and average units sold, and if we roll upthe second table along the product, geography, date,and problem dimensions while computing the num-ber of customer keys, the average days open, andthe average complaint severity, then the join will giveus a picture of the revenue as well as the problemcosts for each product, per location, per time periodassociated with a problem type (e.g., installation,missing CDs, etc.). Then, with an integrated toolingenvironment we can perform this type of quantita-tive dimensional OLAP analysis and then seamlesslymove into a document analysis to understand thecomplaints in more depth. A discussion of such anexperimental tooling environment that has been builtat the Almaden Research Center follows.

Integrated BIKM tools (Sapient & eClassifier). In theprevious section we describe our text analysis sys-tem, eClassifier. In this section we describe the tool-ing we have built to apply the OLAP data model totext documents, creating a document warehouse. Wethen describe how we link the data model for thedata and the documents through shared dimensions,and how this enhances our analytical capabilities. Fi-nally, we describe how text analytics can be used todynamically enhance this data model with what wecall dynamic dimensions.

The tool we have developed allows us to explore datacubes with a star schema and consists of a report viewand navigational controls. The report view providesa view of the results of data queries on a data cube.The reports can be summary tables (Figure 5), trendline graphs (Figure 6), or pie charts. An importantpart of the navigational controls are the dimensionsand metrics selection boxes. The dimension selec-tion box allows the user to select and drill down oneach dimension. This includes drilling down a dimen-sion hierarchy or cross drilling from one dimensionto another. The metric selection box allows the userto select metrics that are computable for the givendata cube. Additional navigation buttons allow for-ward and backward navigation to view previous re-ports. Other navigation controls are discussed later.

Document warehousing. We extend the techniquesused on data in business intelligence to documents

Table 1 Schema for problem ticket documents

Productkey

Geographykey

Date key Customerkey

Problemtype key

Days open Severityof thecomplaint

Ticketidentifier


by using a dimensional model where the fact tablegranularity is a document, and the dimension tableshold the attributes of the document. Without addi-tional processing this representation is a “factless”fact table, because there are, as yet, no associatedmeasures. The process of populating the documentwarehouse has some complexities beyond typical ETLprocessing. In many cases the source of the docu-ments is not an operational data store. Typically doc-uments are automatically and incrementally put intothe document warehouse based on either a subscrip-tion (push model) or a scheduled retrieval process(pull model). Additionally, we need a method to fil-ter the documents because not all documents willbe relevant to the purpose of the document cube.

Depending on the source, most documents havesome associated meta-data that can naturally be usedto populate some dimensions, such as author, dateof publication, and document source. However, thereare dimensions of potential interest that may not beincluded in the meta-data. If the dimension is known,classification techniques can be used to populate it.Using this model, all of the techniques previouslydescribed that are available to data cubes are nowavailable to document cubes.

Shared dimensions. Thus far we have shown how starschemas can be used to organize and analyze bothdata and document cubes. Although each on theirown can provide very useful information, providinga mechanism to link them will allow deeper analysisand thereby provide greater value. As an example,we revisit our product-geography-date revenue cubefrom Figure 4. If we have a collection of documentsthat are relevant to the given products, in the givengeographies, over the given times, the informationthey contain and its relationship to the business dataanalysis can greatly improve decision making. Somedocuments that could provide insight in this exam-ple would be sales logs, customer support logs, newsand press articles, marketing material, and discus-sion groups. All of these could provide unique in-sights into why a product is selling well or poorly ina given geography during a given time frame. Thekey to achieving these insights is to directly link thedata to the documents through shared dimensions.An example data model of data and document cubeswith shared dimensions is illustrated in Figure 7.

Dynamic dimensions. At this point we have data anddocument cubes that are linked through shared di-mensions. All of the analytical techniques used on

Figure 5 Document counts for products


data cubes can be used on the document cubes.Given the linkage created by shared dimensions, wecan use the constraints used to identify a subset ofdata to then identify the corresponding set of doc-uments and then make inferences from those doc-uments about the data. For example, if the data showa drop in revenue for a product in certain geogra-phies during a given time period, we can use theseconstraints on the document cube to identify the doc-uments that might best explain the drop in revenue.We can then use standard OLAP techniques to in-vestigate the relationship to any additional (non-shared) dimensions available for the documents.However, sometimes the existing dimensions andtheir taxonomies may be insufficient to fully explainthe data. The documents can then be further ana-lyzed using a deeper text analytical system such aseClassifier. We have provided this in our BIKM sys-tem by augmenting the document warehouse withan additional table (i.e., the token table) that has thedocument identifier, token identifier, and token off-set for every token in every document (shown in Fig-ure 7). The token table allows us to dynamically se-

lect (extract) and initiate eClassifier on an arbitrarysubset of the documents from the document ware-house. Once we have invoked eClassifier on the doc-uments we can perform all of the analytical capa-bilities outlined previously.

Furthermore, eClassifier can be used to create a newtaxonomy over this selected set of documents. Thisnew taxonomy is effectively a new (hierarchical) di-mension that adds value to the existing data and doc-ument cubes. For example, problem tickets can beclassified into problem types. This dimension pro-vides a finer granularity for understanding the prob-lems that are contributing to the costs associated withproducts in a given region and time period.

The new taxonomy can be made available to the doc-ument warehouse by creating a corresponding di-mension table to represent the taxonomy and thenpopulating an added column in the fact table, asso-ciating all known documents with the newly pub-lished dimension. This new dimension is now avail-able to all of the analytical and reporting capabilities

Figure 6 Time trend chart for products


in the OLAP environment. Additional processing canbe performed to classify all of the documents thatwere not in the extracted set of documents into thenew dimension.

For example, we selected the “ThinkPad* T20” prod-uct (see Figure 5) and extracted into eClassifier the2858 documents associated with this product. Weused eClassifier to produce the new taxonomy shownin Figure 8. We then saved this taxonomy for thedocument warehouse by publishing it as the “newthinkpad taxonomy” dimension and updating thedocument fact table appropriately. This allows us todrill from within the data warehouse, and the resultsare shown in Figure 9.

Summary and future research

The previous sections discuss our current integra-tion model for data and text analysis and the toolingwe have built to experiment with it. The missing, andsomewhat open-ended, portion of this integrationis the text analytics that will be used to create thequantitative metrics that populate the documentcube and augment the data cube metrics. There issignificant work going on in the IBM research com-munity, especially within the unstructured informa-tion management area, to perform information ex-

traction from documents. These efforts include: (1)extracting quantitative facts from documents (e.g.,the financial terms of a contract); (2) deducing re-lationships between entities in a document (e.g., newproduct A competes with product B); and (3) mea-suring the level of subjective values such as severityor sentiment in documents (e.g., a customer letterreflects extreme displeasure with a company’s ser-vice). Currently we are exploring techniques to ac-complish these tasks based on statistical machine-learning approaches. We hope to report on these ina future paper.

Another area of future research that we believe ispromising is the integration of ontologies into thetaxonomy generation and dimension publishing por-tions of our BIKM architecture. Ontologies providea level of semantics that we do not currently address,allowing improved taxonomies and reasoning aboutthe data and text. Furthermore, emerging ontolog-ical technologies such as the semantic Web can pro-vide a vehicle to integrate the text and data understudy with a far larger body of text and data, therebyexpanding the potential insights.

In this paper we show that text integrated with bus-iness data can provide valuable insights for improv-ing the quality of business decisions. We describe a

Figure 7 Shared dimension data model

METRICS

Transaction CountsTotal RevenueAverage RevenueTotal UnitsAverage Units

METRICS

Document Counts

CAUSE

Cause_IDCause

SUBJECT

Subj_IDSubject

SOLUTION

Sol_IDSolution

DOCUMENT

Doc_IDTitleAbstract

TOKEN

Doc_IDToken_IDOffset

DATE

Date_IDDateMonthYear

PRODUCT

Prod_IDProd_LineProd_GroupProduct

DATAFACT

DOCUMENT FACTGEOGRAPHY

Geo_IDSiteLocation


text analysis framework and how to integrate it intoa business intelligence data warehouse by introduc-ing a document warehouse and linking the twothrough shared dimensions. We believe that this pro-vides a platform on which to build and research newalgorithms to find the currently hidden business valuein the vast amount of text related to business data.Technologies in the areas of information extractionand integrated text and data mining will build on thisframework, allowing it to achieve its full business po-tential.

Acknowledgments

The authors gratefully acknowledge the contribu-tions of Dharmendra Modha, Ray Strong, JustinLessler, Thomas Brant, Iris Eiron, Hamid Pirahesh,Shivakumar Vaithyanathan, and Anant Jhingran fortheir contributions to eClassifier, Sapient, and theunderlying ideas of BIKM.

*Trademark or registered trademark of International BusinessMachines Corporation.

Cited references

1. R. Kimball, The Data Warehouse Toolkit, John Wiley & Sons,Inc., New York (1996).

2. D. Sullivan, Document Warehousing and Text Mining, JohnWiley & Sons, Inc., New York (2001).

3. T. Nasukawa and T. Nagano, “Text Analysis and KnowledgeMining System,” IBM Systems Journal 40, No. 4, 967–984(2001).

4. W. Pohs, Practical Knowledge Management, IBM Press, Dou-ble Oak, TX (2001).

5. W. Pohs, G. Pinder, C. Dougherty, and M. White, “The Lo-tus Knowledge Discovery System: Tools and Experiences,”IBM Systems Journal 40, No. 4, 956–966 (2001).

6. See http://www-4.ibm.com/software/data/bi/banking/ezmart.htm.

7. M. Hernandez, R. J. Miller, and L. Haas, “Clio: A Semi-Au-tomatic Tool for Schema Mapping,” Proceedings, Special In-terest Group on Management of Data, Santa Barbara, CA (May21–24, 2001).

8. See http://www.itl.nist.gov/iad/894.02/related_projects/muc/index.html.

9. S. Sarawagi, R. Agrawal, and N. Megiddo, “Discovery-Driv-en Exploration of OLAP Data Cubes,” Proceedings, 6th In-ternational Conference on Extending Database Technology, Va-lencia, Spain (March 23–27, 1998), pp. 168–182.

10. C. Kwok, O. Etzioni, and D. S. Weld, “Scaling Question An-swering to the Web,” Proceedings, 10th International WorldWide Web Conference, Hong Kong (May 1–5, 2001), avail-able at http://www10.org/cdrom/papers/120/.

11. G. Salton and M. J. McGill, Introduction to Modern Retrieval,McGraw-Hill Publishing, New York (1983).

12. G. Salton and C. Buckley, “Term-Weighting Approaches inAutomatic Text Retrieval,” Information Processing and Man-agement 4, No. 5, 512–523 (1988).

13. R. O. Duda and P. E. Hart, Pattern Classification and SceneAnalysis, John Wiley & Sons, Inc., New York (1973).

Figure 8 eClassifier taxonomy for ThinkPad T20 documents


14. J. A. Hartigan, Clustering Algorithms, John Wiley & Sons, Inc.,New York (1975).

15. E. Rasmussen, “Clustering Algorithms,” W. B. Frakes andR. Baeza-Yates, Editors, Information Retrieval: Data Struc-tures and Algorithms, Prentice Hall, Englewood Cliffs, NewJersey (1992), pp. 419–442.

16. S. Vaithyanathan and B. Dom, “Model-Based Hierarchical Clus-tering,” available at http://www.almaden.ibm.com/cs/people/dom/papers/uai2k.ps.

17. I. Dhillon, D. Modha, and S. Spangler, “Visualizing ClassStructures of Multi-Dimensional Data,” Proceedings, 30thConference on Interface, Computer Science and Statistics, May1998.

Accepted for publication July 12, 2002.

William F. Cody IBM Research Division, Almaden Research Cen-ter, 650 Harry Road, San Jose, California 95120 (electronic mail:[email protected]). Dr. Cody is a senior manager of theKnowledge Middleware and Technology group at IBM’s Alma-den Research Center. He received his Ph.D. degree in mathe-matics in 1979 and has held various product development, re-search, and management positions with IBM since joining thecompany in 1974. He has published papers on database appli-cations, database technology, software engineering, and grouptheory.

Jeffrey T. Kreulen IBM Research Division, Almaden ResearchCenter, 650 Harry Road, San Jose, California 95120 (electronic mail:[email protected]). Dr. Kreulen is a manager at the IBMAlmaden Research Center. He holds a B.S. degree in appliedmathematics (computer science) from Carnegie Mellon Univer-sity and an M.S. degree in electrical engineering and a Ph.D. de-gree in computer engineering, both from Pennsylvania State Uni-versity. Since joining IBM in 1992, he has worked on multi-processor systems design and verification, operating systems,systems management, Web-based service delivery, and integratedtext and data analysis.

Vikas Krishna IBM Research Division, Almaden Research Cen-ter, 650 Harry Road, San Jose, California 95120 (electronic mail:[email protected]). Mr. Krishna is a software engineer atthe IBM Almaden Research Center. He holds a B.Tech. degreein naval architecture from IIT Madras, an M.E. degree in com-putational fluid dynamics from Memorial University, Newfound-land, Canada, and a M.S. degree in computer engineering fromSyracuse University, New York. Since joining IBM in 1997, hehas developed systems for Web-based service delivery, business-to-business information exchange, and the integrated analysis oftext and data.

W. Scott Spangler IBM Research Division, Almaden ResearchCenter, 650 Harry Road, San Jose, California 95120 (electronic mail:

Figure 9 Dynamic dimension results


[email protected]). Mr. Spangler has been doing knowl-edge base and data mining research for the past 15 years—latelyat IBM and previously at the General Motors Technical Center,where he won the prestigious “Boss” Kettering award (1992) fortechnical achievement. Since coming to IBM in 1996, he has de-veloped software components, available through the Lotus Dis-covery Server product and IBM alphaWorks�, for data visual-ization and text mining. He holds a B.S. degree in mathematicsfrom the Massachusetts Institute of Technology and an M.S. de-gree in computer science from the University of Texas.


Date post:	11-Apr-2015
Category:	Documents
Upload:	researchinfoguanxicom
View:	299 times
Download:	2 times

The integration of business intelligence and knowledge management

Documents