The concept of document warehousing for multi-dimensional...

www.elsevier.com/locate/dsw

Decision Support Systems

The concept of document warehousing for multi-dimensional

modeling of textual-based business intelligenceB

Frank S.C. Tseng a,*, Annie Y.H. Chou b,1

aDepartment of Information Management, National Kaohsiung First University of Science and Technology, 1 University Road,

YenChao, Kaohsiung, Taiwan 824, ROCbDepartment of Computer and Information Science, Chinese Military Academy, Taiwan

Available online 17 June 2005

Abstract

During the past decade, data warehousing has been widely adopted in the business community. It provides multi-

dimensional analyses on cumulated historical business data for helping contemporary administrative decision-making. Never-

theless, it is believed that only about 20% information can be extracted from data warehouses concerning numeric data only, the

other 80% information is hidden in non-numeric data or even in documents. Therefore, many researchers now advocate that it is

time to conduct research work on document warehousing to capture complete business intelligence. Document warehouses,

unlike traditional document management systems, include extensive semantic information about documents, cross-document

feature relations, and document grouping or clustering to provide a more accurate and more efficient access to text-oriented

business intelligence. In this paper, we discuss the basic concept of document warehousing and present its formal definitions.

Then, we propose a general system framework and elaborate some useful applications to illustrate the importance of document

warehousing. The work is essential for establishing an infrastructure to help combine text processing with numeric OLAP

processing technologies. The combination of data warehousing and document warehousing will be one of the most important

kernels of knowledge management and customer relationship management applications.

D 2005 Elsevier B.V. All rights reserved.

Keywords: Data warehousing; Document warehousing; Knowledge management; OLAP

0167-9236/$ - see front matter D 2005 Elsevier B.V. All rights reserved.

doi:10.1016/j.dss.2005.02.011

B This research was partially supported by National Science

Council, TAIWAN, ROC, under Contract No. NSC-91-2416-H-

327-005.

* Corresponding author. Tel.: +886 7 6011000x4113; fax: +886 7

7659541.

E-mail addresses: [email protected] (F.S.C. Tseng),

[email protected] (A.Y.H. Chou).1 Tel.: +886 7 7438179.

1. Introduction

Data warehousing [18] and data mining techniques

[17] are gaining popularity as organizations realize the

benefits of being able to perform multi-dimensional

analyses of cumulated historical business data to help

contemporary administrative decision-making [2,4,15,

17,22]. This inspires enterprises to eagerly delve into

useful business intelligence (BI) from both internal

42 (2006) 727–744

F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744728

and external data. Business intelligence is supposed to

provide decision-makers with the tactical and strategic

information they need for understanding, managing,

and coordinating the operations and processes in orga-

nizations.

However, much of the efforts have only touched

the tip of the information iceberg. While the techni-

ques regarding data warehouses, multi-dimensional

models, on-line analytical processing (OLAP), or

even ad hoc reports have served enterprises well;

they do not completely address the full scope of

business intelligence. It is believed that [42], for the

business intelligence of an enterprise, only about 20%

information can be extracted from formatted data

stored in relational databases. The remaining 80%

information is hidden in unstructured or semi-struc-

tured documents. This is because the most prevalent

medium for expressing information and knowledge is

text. For instances, market survey reports, project

status reports, meeting records, customer complaints,

e-mails, patent application sheets, and advertisements

of competitors are all recorded in documents.

Despite that, documents in the Web, enterprise

repositories, and public document management sys-

tems are all growing as well. Therefore, knowledge

workers, managers, and executives still have to spend

much of the working moment reading dozens, if not

hundreds, of various types of electronic documents

spread over the Internet. There is just too much text to

digest in daily life. The fast-growing and tremendous

amount of documents has far exceeded the human

ability for comprehension without powerful tools.

As a result, when doing important decision-making,

some relevant documents may be ignored, and some

irrelevant documents may be considered by intuition.

We believe that leaving out information induced from

relevant documents or keeping information by intui-

tively guessing from irrelevant documents may be

detrimental, causing disaster from the strategy weaved

by incomplete information.

To alleviate this phenomenon, Grigsby [14],

McCabe et al. [26] and Sullivan [33] have advocated

that documents should be properly warehoused

according to some well-defined concepts for expand-

ing the scope of business intelligence to include tex-

tual information. Ishikawa and colleagues [19–21]

even advocated this by implementing a prototype

system to support management of compound docu-

ments, keyword-based and content-based retrieval.

They used ECA rules to classify multimedia docu-

ments, and SOM (Self-Organizing Map) to cluster a

set of collected texts into the number of groups in the

retrieval space of manageable dimensions.

Hence, we think one of the next challenges of the

information community will be the study of topics

about document warehousing and text mining to help

enterprises in obtaining complete business intelli-

gence. Although research work regarding text mining

have been conducted widely (for examples, the gen-

tle readers are referred to Refs. [44,23–25,40,34]),

however, the issues regarding document warehousing

are rarely addressed. We have proposed a multi-

dimensional indexing structure, called D-tree in

Ref. [36] to study the performance measurement

for constructing document warehouses. Some theo-

retical analyses on the properties of indexing a doc-

ument warehouse were also elaborated in Ref. [35].

With document warehouses, the documents of enter-

prises can be well organized for effective analysis, or

feature extraction to create distilled and fruitful busi-

ness intelligence.

Since there are usually many diverse concepts

involved in a document, a document is multi-dimen-

sional in nature. Document warehouses, unlike tradi-

tional document management systems, include

extensive semantic information about documents,

cross-document feature relations, and document

grouping or clustering to provide more accurate and

more efficient access to text-oriented business intelli-

gence. To facilitate flexible and effective multi-dimen-

sional on-line analytical document processing and

browsing, a multi-dimensional query language for

querying document warehouses is indispensable. In

Ref. [37], we have devised a multi-dimensional query

expression for querying document warehouses to pro-

vide users an easy and efficient way of performing on-

line analytical processing on documents.

Although issues about document warehousing

have been addressed in Refs. [14,26,33], there are

still no formal definitions established up to now. In

this work, we will first discuss the concept of docu-

ment warehousing and formally define the related

terms. Then, we propose a framework for document

warehousing and elaborate some applications of doc-

ument warehousing to sketch an attractive roadmap of

using document warehouses.

F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 729

As Web applications proliferate tremendously,

there will be a great deal of need for rapid text

processing and browsing. Document warehousing

does not only provide an infrastructure for developing

tools for business executives to systematically orga-

nize, understand, and properly categorize their docu-

ments to help strategic decision-making, but also

integrate all kinds of related documents being

browsed instantly.

Document warehousing also provides an important

platform for on-line analytical processing (OLAP) in

text level for the interactive analysis of multi-dimen-

sional documents of various granularities, which

facilitates effective text mining, integrates documents

into the business intelligence infrastructure, and pro-

vides the means to search for and target specific

information the way we now do with numeric data.

Furthermore, as the construction of data warehouses

can be viewed as an important step for data mining,

the construction of document warehouses can be

regarded as an indispensable preprocessing step for

text mining. We realize that, no matter how wonder-

ful the mechanism a system adopts, it cannot do

much without good content organization of the do-

main on which it is to work. Moreover, we often

recognize that, once a good content organization is

available, many different mechanisms might be

employed equally well to implement effective sys-

tems. A well-organized document warehouse just

provides various mechanisms a wonderful content

organization to work on.

In this paper, based on our prior works [35–37],

we further illustrate the general architecture of a

document warehouse and its applications. The

work is essential for establishing an infrastructure

to help combine text processing with numeric

OLAP processing technologies. We believe such

an infrastructure can help extend numeric data anal-

ysis for combination with text processing technolo-

gies to make data warehousing and document

warehousing one of the most important kernels of

knowledge management and customer relationship

management applications. By combining document

warehousing and data warehousing, documents can

be integrated into the business intelligence infra-

structure and can provide the means to search for

and target specific information the way we now do

with numeric data.

Although the content of documents are often more

than text and may include some graphics or even

multimedia data, in this paper, we only consider the

textual parts of documents. In the future, we will

extend our work to encompass the entirety of docu-

ments, which may be called multimedia warehousing

[19–21].

Our paper is organized as follows. In Section 2, the

important concepts of document warehousing are for-

mally presented. Then, based on these definitions, we

will propose a general architecture for constructing

document warehouses in Section 3. Then, some of the

applications of document warehouses will be dis-

cussed in Section 4. Finally, we conclude and propose

some future work in Section 5.

2. An introduction to document warehousing

In the following, we give some definitions about

document, dimension, document tuple, and document

cube for document warehousing.

Definition 1. A document T={k1, k2, . . . , ki} is alogical unit of text characterized by a set of keywords

{k1, k2, . . . , ki}.

To organize documents into structures, we need the

concept of dimension defined as follows.

Definition 2. A dimension D is a tree structure of m

levels, mz1, which is used for representing the hier-archical relationships among a set of keywords. A

node in a dimension D is called a member, and each

internal node contains a special child called summary

member, denoted d*T, which is used for denoting thetotal concept of the other children of the internal node.

When drawing a dimension, we usually leave out a

summary member, since it has the same meaning with

its parent node. Besides, the keywords in a dimension

are not limited to only those contained in document

contents. Any property or metadata of a document file

(e.g., those defined in Dublin Core Metadata Element

Set [38]) can also be regarded as a keyword in a

dimension for constructing document cubes. Further-

more, if documents are organized into predefined

categories, the category hierarchy to which a docu-

ment belongs can also be regarded as a dimension.

That is, text is not unstructured as is often assumed,

Table 1

A relation Region and its alternative for constructing dimension R

(a)

Location City

South Tainan

South Kaohsiung

South Pingtong

North Taipei

North Taoyun

North Hsinchu

(b)

Tag Parent Level Keyword

1 1 1 (All Region)

2 1 2 South

3 1 2 North

4 2 3 Tainan

5 2 3 Kaohsiung

6 2 3 Pingtong

7 3 3 Taipei

8 3 3 Taoyun

9 3 3 Hsinchu


which has been pointed out by Sullivan [33]. The

concept of dimension can be employed to model the

structure inherently hidden in text.

According to the keyword sources, dimensions can

be distinguished into the following types:

1. Ordinary dimension. A dimension contains key-

words used for scanning the document contents.

2. Metadata dimension. A dimension contains key-

words used for scanning document file properties

or metadata. For example, in Dublin Core Meta-

data Element Set, there are title, creator, subject,

description, publisher, contributor, date, type, for-

mat, identifier, source, language, relation, cover-

age, and rights; all can be regarded as metadata

dimensions.

3. Category dimension. A dimension contains key-

words corresponding to the nodes in a category

hierarchy, such as Wordnet [27,41], in which all

considered documents should be multi-categorized.

A document is related to such dimension or if not it

can be determined manually or automatically

assigned by document categorization tools.

To simplify our discussion, we mainly use ordinary

dimensions, together with the metadata dimension

time (i.e., date), in the following examples.

Definition 3. For a dimension D, the ith-level member

set, denoted D(i), is defined as D(i)={a |a is a mem-

ber in the ith level of D, but a is not a summary

member}. Besides, we use D(0) to denote the union of

all non-summary members in D, which is the union of

all ith level member sets in D. That is, D(0)=[1ViVhD(i), where h is the height of D. In practice, each D(i)

has a specific name, which will be called the ith-level

name.

Practically, a dimension can be constructed from a

relational table, with each level corresponding to an

attribute in the relation and the attribute names usually

used as the corresponding level names. To illustrate

the above definitions, we give an example as follows.

Besides, any keyword in a dimension can be imple-

mented as a set of synonyms to encompass more

semantics.

Example 1. Suppose there is a relation Region repre-

senting the regions of Taiwan as shown in Table 1(a).

Another alternative is shown in Table 1(b). This rela-

tion can be used to construct a dimension, denoted R

as depicted in Fig. 1, where the first level corresponds

to the dimension itself, which is commonly denoted

b(All Region)Q, and the second and third levels arederived from the attributes Location, and City, respec-

tively. All nodes in Fig. 1 with label d*T are summarymembers. That is, the summary member in the second

level has the same meaning with all regions in Tai-

wan, which represents {South, North}. Besides, the

summary members under South and North have the

same corresponding meaning with South and North,

which denote {Tainan, Kaohsiung, Pingtong} and

{Taipei, Taoyun, Hsinchu}, respectively. By omitting

all the summary members, Fig. 1 is redrawn in Fig. 2.

According to the illustration of dimension R, we know

that R(1)={(All Region)}, R(2)={South, North}, and

R(3)={Tainan, Kaohsiung, Pingtong, Taipei, Taoyun,

Hsinchu}, and R(0)={(All Region), South, North, Tai-

nan, Kaohsiung, Pingtong, Taipei, Taoyun, Hsinchu}.

For a dimension D, there are two basic operations

called drill-down and roll-up, which are formally

defined as follows.

Definition 4. For a dimension D, expanding an

internal node to obtain all of its children is called

(All Region)

South North

Tainan Pingtong TaipeiKaohsiung* *

*

1

2

3

Level

Taoyun Hsinchu

Fig. 1. An illustration of dimension R.


drill-down, and shrinking a set of children to obtain

their common parent is called roll-up.

This can be further clarified by the following

definitions.

Definition 5. For any two n-tuple of keywords

A=(a1, a2, . . . , ai, . . . , an) and B =(b1, b2, . . . , bi,. . . , bn) defined on n dimensions (D1, D2, . . . , Di,. . . , Dn), where ai and biaDi(0), we define B is amember of drilling down A along dimension Di (or A

is a member of rolling up B along dimension Di),

denoted A �i B, if and only if there exists exactly an i,1V iVn, such that bi is a child of ai in Di, and bj=aj,for all j p i.

Definition 6. For a document T with unique identifier

idT, a document index of T defined on n dimensions

(D1, D2, . . . , Dn) is denoted x =(idT, KT), whereKT=(K1, K2, . . . , Ki, . . . , Kn) is an n-tuple of key-word sets, such that each Ki contains a set of key-

words, and for all keywords kijaKi, kijaT andkijaDi(0), for all 1V iVn.

(All Region

South

Tainan PingtongKaohsiung

All Region

Location

City

Level Name

2

4 65

Fig. 2. A concise illustrati

For simplicity, the first and second components of

a document index x =(idT, KT) will be denoted x1 and

x2 (i.e., x1= idT and x2=KT), respectively. When all

|Ki| =1, the document index is also called a base

document index, and each Ki can also be denoted

by its only element for convenience (That is, in such

cases, a KT=({k1}, {k2}, . . . , {ki}, . . . , {kn}) can beabbreviated as KT=(k1, k2, . . . , ki, . . . , kn)). If thereare at least one Ki, such that |Ki|N1, and the sizes of

the other Kj’s all equal to 1, then the document index

is also called a composite document index. Finally, if

there are some Ki, such that |Ki| =0, then the docu-

ment index is also called a degenerate document

index. In the following, a degenerate document

index with some |Ki| =0 will be generalized by

using the top level member set of the corresponding

dimension, Di(1), to substitute the missing keyword

set Ki.

Example 2. Suppose there is a complaint e-mail

issued from a customer as shown in Fig. 3. Then, a

base document index of T defined on the above two

)

North

Taipei

1

2

3

Level

Taoyun Hsinchu

1

7 9

3

8

on of dimension R.

To whom it may concern: We have bought a TV from your Kaohsiung branch last weekend. However, we found the screen is severely unstable. Please give us the phone number of your service center. Thank you for your kindly help.

Sincerely,

Frank S.C. Tseng

Fig. 3. A complaint e-mail issued by a customer (A0001).


dimensions (R, P) can be obtained as x =(A0001,

({Kaohsiung}, {TV})), where A0001 is the unique

identifier of T.

The basic component of a document cube is called

a cell, which is defined as follows.

Definition 7. A cell defined on n dimensions (D1, D2,

. . . , Dn) is denoted c =(tc, Xc), where tc=(c1, c2, . . . ,ci, . . . , cn), ciaDi(0)[{d*T}, 1V iVn, and Xc={x1,x2, . . . , xj, . . . , xm} is a set of document indices of theform xj =(idTj, (K1, K2, . . . , Kn)), where idTj is theunique identifier of some document Tj and Ki\Di(0)pF, 1V iVn. The set of all such documentunique identifiers idTj involved in the cell c=(tc,

non-base cell

base cell

regiont

a

d

TV

Refrigerator

Cellular Phone

Radio

Monitor

PrinterC

omputer

Com

munication

Appliance

(All Product)

Fig. 4. A sample illustration

Xc) is denoted ID(c)={xj1|8 xjaXc}. That is, a doc-

ument with unique identifier in ID(c) can be directly

accessed from the cell c.

Definition 8. A cell c=(tc, Xc), where tc=(c1, c2, . . . ,ci, . . . , cn), defined on n dimensions (D1, D2, . . . ,Dn) is called an m-d cell, 0VmVn, if and only if thereare exactly m non-summary member ci (i.e.,

ci p d*T). If m =n and ciaDi(hi), where hi isthe height of Di, for all 1V iVn, then c is alsocalled a base cell; otherwise c is called a non-base

cell.

Definition 9. An n-dimensional i-d cell a=((a1, a2,

. . . , an), Xa) is a parent of another n-dimensional i-d

product

ime

S

T3 T1

T2

ID(a)

ID(d)

Taipei

Taoyuan

HsinChu

Tainan

Kaohsiung

Pingtong

North

South

(All Region)

of a document cube.


cell b =((b1, b2, . . . , bn), Xb), if and only if thefollowing conditions hold:

1. There exists exactly one k, such that ak is the

parent of bk in Dk, and al =bl, for all l p k, 1V lVn.2. ID(b)p ID(a), where ID(a) and ID(b) are the sets

of all document unique identifiers involved in cells

a and b, respectively.

Definition 10. A document cube DC =(S, (D1, D2,

. . . , Dn)), where S is a set of documents defined on ndimensions (D1, D2, . . . , Dn), is a cube composed ofall cells ci=(tci, Xci) with tcia�1VjVn Dj 0ð Þ andID(ci)pS.

Based on the above definitions, a set of docu-

ments S can be multi-dimensionally indexed by a

document cube DC =(S, (D1, D2, . . . , Dn)), whichallows users to browse documents by rolling up and

drilling down along some dimensions Di for differ-

ent granularities and perspectives, obtaining further

insight into relationships among documents. A sam-

ple illustration of a document cube DC =(S, (R, P,

T)) is shown in Fig. 4, where R and P represent the

aforementioned dimensions region and product, re-

Document Source A

metadata

Integrateddocument base

summardocum

sumdo

warehouseadministrator

warehouseadministrator

Archive

Document Source B

Document Source C

Fro

nt-E

ndC

ompo

nent

Fig. 5. The proposed architecture

spectively. Besides, we assume T is a dimension

representing time.

3. A framework for document warehousing

Designing a comprehensive architecture for docu-

ment warehousing can be challenging because docu-

ment warehousing covers a wide spectrum of concepts

as we have shown in Section 2. Fortunately, there is

already a general architecture being established for

data warehousing in Ref. [2]. Based on the architec-

ture, we extend the constructs to include more features

for documents warehousing. The proposed architec-

ture is shown in Fig. 5.

Based on this architecture, we outline the general

process of extraction, transformation, loading, dimen-

sional modeling, and construction of a document

warehouse in Fig. 6. In this process, documents stored

in different document sources are respectively trans-

formed and loaded into the document base. At the

same time, some of the metadata are retrieved or

generated into the metadata repository. Besides, the

document may be further integrated and categorized

into groups in the document base according to their

izedents

highlymarized

cuments

Text Mining Tools

Application Programs

Bac

k-E

ndC

ompo

nent

PC

2003

2002

2001

2000

1999

1998

1997

Scanner

Printer

LCD Monitor

Software

Digital Camera

Phone

On-Line Analytical Processing

On-Line AnalyticalProcessing

of document warehouses.

BusinessModels Presentation

OLAP UserQuery/Tool

DocCubesDocument

Base

Metadata

Transform

ation and Loading

DocumentSource A

DocumentSource B

Cluster

DimensionalModeling

PC

2003

2002

2001

2000

1999

1998

1997

Scanner

Printer

LCD Monitor

Software

Digital Camera

Phone

On-Line AnalyticalProcessing

Fig. 6. The general process of extraction, transformation, loading, dimensional modeling, and construction of a document warehouse.


metadata or keywords. Then, by applying the dimen-

sional modeling process, fruitful document cubes can

be created for on-line analytical processing in text

level based on certain business models. Notice that a

document cube does not need space to store the

document contents; it only contains the dimension

information and the file pointers, which can be used

to trace back the original document contents stored in

the document base. Finally, the processed result can

be presented via hyper-linked Web presentations.

The major components of a document warehouse

are explained as follows.

3.1. Document sources

The source of documents for a document ware-

house is supplied from:

1. Internal sources: In an organization, there are

documents in various formats spread throughout

the organization on any kind of document reposi-

tories. The files may be in XML formats, MS Word

formats, e-mail or even plain text.

2. External sources: Documents may also come from

the Internet, including Web pages, FTP sites, com-

mercially available document bases, private docu-

ments shared by private servers or document

repositories associated with an organization’s sup-

pliers or customers.

3.2. Front-end component

The front-end component performs all the neces-

sary pre-processing of documents, such as text sum-

marization [12,16], text feature extraction [10],

document categorization [3], or other text mining

procedures [24,25,34], and then store the obtained

features or patterns into the meta-data or store the

summarized result as another summarized document.

3.3. Warehouse administrator

The warehouse administrator performs all the

operations associated with the management of

the documents in the warehouse. The opera-

tions include:

1. Enrich the metadata of all stored documents: Some

of the document metadata (e.g., those defined in

Dublin Core Metadata Element Set [38]) may be

missing and should be added manually by the

warehouse administrator.

2. Perform necessary text mining operations or gen-

erate the summarization for documents either man-

ually or by software tools (e.g., IBM Intelligent

Miner for Text [44]).

3. Create the dimensions and document indexes for

constructing document cubes.

4. Archive documents and related data/metadata.

3.4. Back-end components

The back-end component performs all the opera-

tions responsible for the management of user queries.

It is typically composed of a set of document access

tools, a multi-dimensional document query interface

[37], document warehouse monitoring tools, and cus-

tomized tools.


3.5. Highly summarized documents

This part stores all the summarization derived from

multiple documents, which belong to the same cluster

or categorization. Some of such achievements have

already been conducted [9,13].

The simplest format of a highly summarized doc-

ument can be represented by a set of keywords

appeared in the original document. Keywords of a

document can be derived by computing the traditional

Table 2

The proposed metadata design

Attribute name Description

Title A name given to the resource.

Creator An entity primarily responsible

for making the content of the

resource.

Subject A topic of the content of the

resource.

Description An account of the content of the

resource.

Publisher An entity responsible for making

the resource available.

Contributor An entity responsible for making

contributions to the content of the

resource.

Date A date of an event in the lifecycle

of the resource.

Type The nature or genre of the content

of the resource.

Format The physical or digital manifestation

of the resource.

Identifier An unambiguous reference to the

resource within a given context.

Source A reference to a resource from

which the present resource is

derived.

Language A language of the intellectual

content of the resource.

Relation A reference to a related resource.

Coverage The extent or scope of the content

of the resource.

Rights Information about rights held in

and over the resource.

. . . . . .

Keywords The keyword set derived from the

resource.

Summarization A brief summary generated from

the resource by a summarization

tool.

File_Path A file pointer used to address the

resource.

tf*idf weights [28,29], pivoted cosine weights [31], or

one derived by any term-weighting scheme.

3.6. Metadata

The metadata of a document warehouse stores all

the metadata derived from all documents. All the

processes in the proposed architecture will use the

metadata interchangeably. In the paper, we propose to

design the metadata as Table 2 describes. That is, the

metadata can be stored in a traditional relational

table, which contains attributes for all the elements

defined in the Dublin Core Metadata Element Set

[38] and some additional attributes. For simplicity,

we only list three extra attributes in Table 2, where

dSummarizationT, dKeywordsT, and dFile_PathT areused to store the summarization of the document,

the keyword set derived from the original document,

and the file path used to show the pathway back to

the document from which the metadata are derived.

Such file path is inherently unique and can be used to

describe the mapping between the document sources

and a common view of the information within the

document warehouse.

Some of the metadata can be obtained or derived

directly from the document itself. For example, docu-

ments stored in Microsoft Word format has some

summary information associated with the file itself

and we can retrieve them directly from the stream

by employing Structured Storage [39]. Structured

Storage provides file and data persistence in COM

by handling a single file as a structured collection of

objects known as storages and streams.

4. Data modeling of document warehouses

The dimensional modeling technique [18,22]

adopted widely in data warehouse modeling can be

extended for document warehouses. Every dimension-

al model is composed of one central table with a

composite key, called the fact table, which uses for-

eign keys to link to a set of dimension tables. This

characteristic dstar-likeT structure is also called a starschema. Such multi-dimensional data model for text

permits the definition of any dimension of interest as

defined in Definition 2. In Fig. 7, we show a star

schema for modeling document warehouses.

Fact Table

Title_ID (FK)Creator_ID (FK)...Date_ID (FK)...Rights_ID (FK)...Keyword_ID (FK)...Document_IDcount (Measure)

DateDimension

CreatorDimension

TitleDimension

RightsDimension

KeywordDimension

S

T3 T1

T2

Fig. 7. An example star schema of a document warehouse.


4.1. Dimensions

As we have discussed in Section 2, dimensions can

be distinguished into the following types:

1. Ordinary dimension. A document can be highly

summarized by a set of keywords. Therefore, we

can construct an ordinary dimension containing a

set of keywords to allow users to pinpoint the

desired documents directly.

2. Metadata dimension. That is, those elements de-

fined in Dublin Core Metadata Element Set: title,

creator, subject, description, publisher, contributor,

date, type, format, identifier, source, language, re-

lation, coverage, and rights, can all be regarded as

metadata dimensions. Some of the dimensions

might be hierarchies or simply related data.

3. Category dimension. For example, a hierarchy

such as Wordnet or its subset, or user-defined

hierarchies can be employed as category dimen-

sions. Notice that, there may be more than one

category dimensions used to construct a document

cube, since a document can be multi-categorized

into different categories from various points of

view.

In Table 1, we have presented two representations

for the dimension Region. Both representations can

be easily obtained from each other by conversion.

The structure of Table 1a is easier understood by

people, and that of Table 1b is more efficient for

computer processing. For dimensional modeling, we

assume each dimension Di is internally stored in the

relation Di(Tag, Parent, Level, Keyword) conform-

ing to the structure of Table 1b, such that the

underlined attribute Tag represents the primary key,

and Di.Parent is a foreign key and is referenced to

Di.Tag. For example, the dimensions Product and

Time in Fig. 4 can be represented as shown in

Fig. 8.

Under such circumstances, for a dimension D,

D(0) is the whole relation, and the other ith level

member sets D(i) can be easily obtained by the

SQL statements bSELECT Tag, Keyword FROM DWHERE Level= iQ, for 1V iVn.

4.2. The fact table

In general, the central fact table may be composed

of the following attributes:

1. A composite key, which is composed of a set of

foreign keys to the following dimensions:

(a) Ordinary dimensions: For example, the dimen-

sion Keyword shown in Fig. 7 is an ordinary

dimension.

Product Time Tag Parent Level Keyword Tag Parent Level Keyword1 1 1 (All Product) 1 1 1 (All Time)2 1 2 Appliance 2 1 2 20033 1 2 Communication 3 1 2 20044 1 2 Computer 4 2 3 Q1, 20035 2 3 TV 5 2 3 Q2, 20036 2 3 Refrigerator 6 2 3 Q3, 20037 3 3 Cellular Phone 7 2 3 Q4, 20038 3 3 Radio 8 3 3 Q1, 20049 4 3 Monitor 9 3 3 Q2, 200410 3 Printer 10 3 3 Q3, 2004

11 3 3 Q4, 20044

Fig. 8. Dimensions product and time represented in relations.


(b) Metadata dimensions: For example, the dimen-

sions Title, Date, Creator, . . . , and Rights asshown in Fig. 7 are metadata dimensions.

(c) Category dimensions. Notice that, there are no

category dimensions shown in Fig. 7.

2. Attributes used to derive the measures in a docu-

ment cube. The document count (i.e., the attribute

count of fact table in Fig. 7) can be regarded as the

default measure in a document cube. Another pos-

sible measure has been defined in Ref. [26] as the

weight of the term frequency of the corresponding

keyword.

3. A column Document_ID represents the document

identifier, which is a foreign key link to the relation

S(Document_id, File_path) as shown in Fig. 7.

That is, the set S in Fig. 7 can be regarded as a

dimension containing all the document identifiers

and the corresponding file paths, where Documen-

t_id is the primary key and file_path is used for

storing the file path or URI of the documents.

Therefore, the fact table can be stored in the

relation Fact_Table(Document_id, D1_tag, D2_Tag,

. . . , Di_Tag, . . . , Dn_Tag, Count), where Docmen-t_id is a foreign key which is referenced to S.Doc-

ment_id, and each Di_Tag is a foreign key matching

the primary key Did Tag of dimension Di. For each

document T (with identifier idT) in S, the document

index x =(idT, KT), where KT=(K1, K2, . . . , Ki, . . . ,Kn), can be used to generate idTf g � �1ViVn Kið Þ asthe set of initial tuples in the Fact_Table.

Note that the initial tuples generated by the above

process may cause redundancies. We formulate this

by the following definition.

Definition 11. A document index defined on n dimen-

sions (D1, D2, . . . , Dn), x =(idT, KT), where KT=(K1,K2, . . . , Ki, . . . , Kn) is minimal, if and only if for twokeywords kx and kyaKi, kx p ky, there is no ancestryrelationship between kx and ky in Di, for all 1V iVn.That is, kx is not an ancestor of ky inDi, and vice versa.

For a non-minimal document index x=(idT, KT),

where KT=(K1, K2, . . . , Ki, . . . , Kn), we can alwaysreduce x into a minimal document index, denoted ẋ, by

iteratively finding pairs of keywords kx and ky inKi, for

all 1V iVn, such that ky is a descendant of kx, and theneliminating the ancestor kx and retaining the descen-

dant ky. Such process can be denoted as xi ẋx. It helpsto reduce the storage cost of a document index without

loss of indexing information, since the documents

indexed by an ancestor can be recursively derived by

rolling up the indexed documents from its lowest des-

cendants along the corresponding dimension.

Example 3. Suppose there is a non-minimal document

index of T (with unique identifier A001), defined on

dimensions (R,P), x =(A0001, ({South, Kaohsiung},

{TV})), then after xi ẋx, we obtain ẋx =(A0001,({Kaohsiung}, {TV})), since, according to the dimen-

sion R, the documents indexed by South can be de-

rived from Tainan, Kaohsiung, and Pingtong.

4.3. The construction of document cubes

To construct a document cube, the process is some-

how different from that in data cubes. Although a

document index derived according to metadata and


category dimensions are the same as that in data

cubes, the process to generate a document index

with respect to an ordinary dimension is prone to be

time-consuming. This is because computing a mea-

sure in a data cube is just numerical computation.

However, to compute a document index (de-

fined in Definition 6), we have to scan the docu-

ment content to match the keywords in any ordinary

dimension. Therefore, it is necessary to develop an-

other indexing structure to accommodate the docu-

ment indices derived from ordinary dimensions. We

have already proposed an indexing structure,

called D-tree, to meet this objective in Refs.

[35,36], where the performance evaluation of the

indexing structure was also studied.

5. Applications of document warehouses

In this section, we present two applications of

document warehouses. The first one is a document

warehouse for organizing the complaint e-mails to

provide better customer relationship management.

The other one is for preparing a document warehouse

indexing a set of journal papers falling in different

categories, published in different journals on different

date times.

5.1. An application for customer relationship

management

Suppose there is a company manufacturing appli-

ances, communication equipments and computer per-

Fact Tab

Region_ID (FCreator_ID Date_ID (FKProduct_ID Time_ID (FKDocument_Icount (Meas

CreatorDimension

DateDimension

S

T3 T1

T2

Fig. 9. An example star schema for

ipherals, and it has established branches in the north

and south regions. The objective is to warehouse

customer complaint e-mails for customer relationship

management.

After modeling the document cube, we obtain two

metadata dimensions (Creator and Date) and three

ordinary dimensions (Region, Product, and Time) as

shown in Fig. 9.

We briefly describe these dimensions as follows:

1. Ordinary dimension. The dimensions Region and

Product are as shown in Figs. 2 and 10, respec-

tively. The dimension Time is the purchase time

described in the e-mail.

2. Metadata dimension. The dimension Creator stores

the e-mail addresses of customers and dimension

Date stores the date of receiving of e-mails.

3. Category dimension. There are no category dimen-

sions shown in this example. However, the e-mail

documents could be further categorized either man-

ually or automatically by software tools into hier-

archical categories.

The fact table is composed of the following

attributes:


foreign keys to the aforementioned dimensions.

2. The attribute count is regarded as the default mea-

sure in this document cube.

3. A column Document_ID served as a foreign key

and is referenced to the relation S(Document_ID,

file_path).

le

K)(FK))

(FK))

Dure)

ProductDimension

RegionDimension

TimeDimension

complaint e-mail management.

(All Product)

Appliance Computer

TV Cellularphone

RadioRefrigerator

1

2

3

Level

LaptopComputer

DesktopComputer

Communication 4

5 6

1

2 3

7 8 9 10

DVDPlayer

11

Fig. 10. A concise illustration of dimension P.


After constructing the document cube, we can

perform on-line analytical processing on the obtained

document cube as illustrated in Fig. 11. Notice that,

each of the count shown in Fig. 11 is actually a

hyperlink, which links to a page containing the orig-

inal e-mails.

5.2. An application for journal paper warehousing

Suppose there is a university laboratory, which

intends to warehouse research journal papers accord-

ing to some predefined categories.

After modeling the document cube, we establish

one category dimension Category and two metadata

dimensions: Source and Times, as shown in Fig. 12,

where Category represents the predefined categories,

Source stores the journal names of the selected papers,

and Times regards the publishing data times.

We briefly describe these dimensions as follows:

1. Ordinary dimension. There is no ordinary dimen-

sion in this example. However, users may add a

dimension containing all of the keywords in the

selected journal papers, and organize the keywords

into hierarchies, either manually or automatically

by some text processing tools.

2004/Quarter 1Product

TVRefrigerato

Appliance

DVD PlayeLaptop CompComputerDesktop Comp

RadioCommunicationCellular Pho

Fig. 11. On-line analytical processing o

2. Metadata dimension. The dimension Source stores

the journal names, which are published under some

communities (e.g., ACM, IEEE, Elsevier, and

Kluwer). The dimension Times stores the date of

publishing, which is organized according to the

Year–Month hierarchy.

3. Category dimension. The dimension Category

stores the predefined categories, i.e., computer-

aided engineering, data mining and knowledge dis-

covery, data model, data structures, data warehous-

ing, database management, . . ., and so forth. Thesepredefined categories are all organized into the same

level to simplify the illustration. Notice that all

papers are multi-categorized into theses categories.

That is, a paper may fall into two or more categories.

The fact table is composed of the following

attributes:


foreign keys to the aforementioned dimensions.

2. The attribute count is regarded as the default mea-

sure in this document cube.

3. A column Document_ID served as a foreign key

and is referenced to the relation S(Document_ID,

file_path).

RegionNorth South

5 2r 3 7r 7 9uter 8 1uter 2 3

6 1ne 5 7

S

T3

T3 T1

ver the example document cube.

Fact TableCategory_ID (FK)Source_ID (FK)Times_ID (FK)Document_IDcount (Measure)

CategoryDimension

TimeDimension

ST3 T1

T2

SourceDimension

T4 T5

Fig. 12. An example star schema for journal paper warehousing.

Column Axes:Source Category (All Times)

Submit Query Reset

Row Axes: Last Times Selected: 2004

Fig. 13. On-line analytical processing over the example document cube.



To prove the concept proposed in this paper, we

have implemented this example to allow users to per-

form on-line analytical processing on the obtained

document cube as illustrated in Fig. 13. The dimension

Times is used for slicing the document cube. Notice

that, each of the count shown in Fig. 13 is actually a

hyperlink, which links to a Web page containing the

detailed paper listing as shown in Fig. 14 (when slicing

the Times dimension by d2004T and clicking the countintersected by dElsevierT and dData Mining and Knowl-edge DiscoveryT). If all of the paper files were stored inthe publisher’s Web sites for a long time period, then

this document warehouse does not need to copy and

store the physical files locally. That is, such a document

warehouse effectively saves storage space and provides

very fast document access without degradation in per-

formance even as the size of the warehouse grows.

6. Conclusion and future directions

6.1. Conclusion

While data warehouses and the numeric-centric

business intelligence technologies have served most

of the enterprises well, they do not fully address the

complete scope of business intelligence. In this paper,

Fig. 14. The detailed paper listing after the count in the intersection of dE

we advocate the importance of constructing document

warehouses to support text-centric business intelli-

gence, and propose an architecture for document ware-

housing. When documents are warehoused, users can

perform ad hoc on-line analytical processing (OLAP)

over text in a document warehouse, just as the way

users can perform OLAP over summarized data in a

data warehouse.

The concept of document warehousing is not only

providing the ability to very fast document access

without degradation in performance even as the size

of the warehouse grows, but also offering a set of

versatile applications for content management of

enterprise business intelligence. In business, docu-

ment warehousing can help administrators organize

meeting reports, gazettes, or even customer com-

plaint e-mails, where the company personnel, pro-

ducts, and time may be regarded as the dimensions,

such that documents related to some employees, or

products in some time, at somewhere can be re-

trieved or browsed instantly. In recent years, we

have seen most data warehouse applications applied

in Customer Relationship Management (CRM), a

promising trend in business affairs. However, a

data warehouse creation only supports the numeric

analyses of customer behaviors. To obtain the reason

why customers buy (or did not buy) some products,

lsevierT and dData Mining and Knowledge DiscoveryT was clicked.


we need to establish a document warehouse. By data

warehousing, users can realize business phenomena

regarding who, what, when, where, and which clear-

ly. Nevertheless, to discover why the phenomena

occurred, a document warehouse should be

employed [33].

When documents are warehoused, the task of

version control will become very easy, since users

can directly trace the documents based on some

criteria along the time dimension. Such merits

also make document warehousing an exhilarating

organization for on-line topic detecting and event

tracking [1] of news. Besides, document clustering

can be achieved directly via visualizations. Users

can also develop some document summarization

tools [9,16,30] to summarize a cluster of related

documents. To sum up, data warehousing and doc-

ument warehousing are not only one of the most

important infrastructures of knowledge management,

but also the kernel of customer relationship man-

agement. Both are used for respectively organizing

documents and formatted data in a multi-dimension-

al basis. We compare their similarities and differ-

ences in Table 3.

Table 3

A comparison between document warehousing and data warehousing

Document warehousing

Similarities 1. Both have the same construction process.

We may employ star schema or snowflake

[22] to design the modeling process.

2. Both gather business document/data from

heterogeneous resources.

3. Users can do on-line analytical processing

over the established result.

Differences 1. Intend to obtain text-oriented business

intelligence.

2. Resources gathered from market survey

reports, project status reports, meeting reco

customer complaints, e-mails, patent

application sheets, and advertisements of

competitors.

3. It filters out unnecessary documents and

intends to help users to address problems

regarding why.

4. Enriched with text mining techniques to

summarize documents or categorize docume

5. Document sources should be integrated

file systems, or native XML databases [6

6.2. Future works

In our future work, we will conduct more techni-

ques for document warehousing. The preliminary

components may include the following modules.

1. Employ XML Schema [43] to define document

metadata. We advocate using the Extensible Mark-

up Language (XML) to be the intermediate media

for document interchange.

2. Incorporate automatic text summarization [12,14,

23], key feature extraction [10], or even document

classification and categorization [3] techniques for

document warehousing. Develop related text sum-

marization techniques to extract the most important

10~20% content for users to digest the documents

more easily and propose how to bind a document

summary with its corresponding documents for

document warehousing.

3. Automatic document metadata decomposition and

the mechanisms for storing the obtained metadata

into native XML or XML-enabled databases [5–7].

This helps users manage document warehouses

more efficiently.

Data warehousing

1. Intend to obtain numeric-oriented

business intelligence.

rds,

2. Resources gathered from internal

databases of POS (point-of-sale) systems,

ERP (enterprise resource planning) systems,

accounting systems, or financial management

systems.

3. It aggregates numerical data according to

various dimensions, and intends to help users

to address problems regarding who, what,

when, where, and which.

nts.

4. Enriched with data mining techniques to

summarize, classify, cluster formatted data

or find the associations.

in

,7].

5. Data sources can be integrated in relational

databases.


Besides, although the dimension concepts defined

in this paper are organized into hierarchical structures,

it is however assumed that when scanning a document,

the system will ignore the hierarchical relationships

among keywords in the document. Based on this

work, we wish to incorporate some natural language

processing technologies to enhance the linguistic anal-

ysis and annotation results of document parsing, and

elaborate the work of adopting domain-specific ontol-

ogy [8,11,25,32] with more refined concepts to be built

in the corresponding dimensions of a document cube.

Ontological analysis can help clarify the structure of

knowledge regarding a set of related documents. Given

a set of related documents corresponding to a specific

domain, the ontology forms the semantic heart of any

system of knowledge representation, and their docu-

ment cube forms the syntactic centroid of any system of

concept organization.

Finally, since the construction of a document ware-

house has to scan a large amount of documents, which

is a task prone to time-consumption, the parallel archi-

tecture for such a process will be investigated further in

the future.

References

[1] J. Allan, R. Pepka, V. Lavrenko, On-line new event detection

and tracking, Proceedings of the 21st Annual International

ACM SIGIR Conference on Research, 1998, pp. 37–45.

[2] S. Anahory, D. Murray, Data Warehousing in the Real World:

A Practical Guide for Building Decision Support Systems,

Addison-Wesley Longman, Harlow, England, 1997.

[3] A. Appiani, F. Cesarini, A. Colla, M. Diligenti, M. Gori, S.

Marinai, G. Soda, Automatic document classification and

indexing in high-volume applications, International Journal

on Document Analysis and Recognition 4 (2) (2002) 69–83.

[4] M.J.A. Berry, G. Linoff, Data Mining Techniques: For Mar-

keting, Sales, and Customer Support, John Wiley & Sons,

New York, 1997.

[5] E. Bertino, B. Catania, Integrating XML and databases, IEEE

Internet Computing 5 (4) (2001) 84–88.

[6] E. Bertino, E. Ferrari, XML and database integration, IEEE

Internet Computing 5 (6) (2001) 75–76.

[7] Champion, M, Native XML vs. XML-Enabled: the Difference

Makes a Difference, http://www.softwareag.com/xml/library/

champion_nativexml.htm, Software AG: The XML Company.

[8] B. Chandrasekaran, J.R. Josephson, V.R. Benjamins, What are

ontologies, and why do we need them? IEEE Intelligent

Systems 14 (1) (1999 Jan./Feb.) 20–26.

[9] H.H. Chen, S.J. Huang, A summarization system for Chinese

news from multiple sources, Proceedings of the 4th Interna-

tional Workshop on Information Retrieval with Asia Lan-

guage, 1999, pp. 1–7.

[10] F.F. Feng, W.B. Croft, Probabilistic techniques for phrase

extraction, Information Processing & Management 37 (2)

(2001 Mar.) 199–220.

[11] N. Fridman, C.D. Hafner, The state of the art in ontology

design, AI Magazine 18 (3) (1997) 53–74.

[12] J. Goldstein, M. Kantrowitz, V. Mittal, J. Carbonell, Summa-

rizing text documents: sentence selection and evaluation

metrics, Proceedings of SIGIR, 1999, pp. 121–128.

[13] J. Goldstein, V.O. Mittal, J.G. Carbonell, J.P. Callan, Creating

and evaluating multi-document sentence extract summaries,

Proceedings of the 9th International Conference on Informa-

tion and Knowledge Management, 2000, pp. 165–172.

[14] Grigsby, M., The Internet Document Warehouse: Content

Management for the Back Office, Technical Report,

IMERGE Consulting, Inc., 2001. http://www.imergeportal.

com/publishedarticles.asp.

[15] R. Hackathorn, Data warehousing energizes your enterprise,

Datamation 1 (1995 Feb.) 38–42.

[16] U. Hahn, I. Mani, The challenges of automatic summarization,

IEEE Computer 33 (11) (2000 Nov.) 29–36.

[17] J. Han, M. Kamber, Data Mining: Concepts and Techniques,

Morgan Kaufmann Publishers, 2001.

[18] W.H. Inmon, Building the Data Warehouse, John Wiley and

Sons, New York, NY, 1993.

[19] H. Ishikawa, K. Kubota, Y. Noguchi, K. Kato, M. Ono, N.

Yoshizawa, A. Kanaya, A document warehouse: a multimedia

database approach, Proceedings of the IEEE 9th International

Workshop on Database and Expert Systems Applications

(DEXA’98) Vienna, Austria, Aug. 26–28, 1998, pp. 90–94.

[20] H. Ishikawa, K. Kubota, Y. Noguchi, K. Kato, M. Ono, N.

Yoshizawa, Y. Kanemasa, Document warehousing based on a

multimedia database system, IEEE International Conference

on Data Engineering, 1999, pp. 168–173.

[21] H. Ishikawa, M. Ohta, K. Kato, Document warehousing: a

document-intensive application of a multimedia database, Pro-

ceedings of the IEEE 11th International Workshop on Re-

search Issues in Data Engineering, Heidelberg, Germany,

April 01–02, 2001, pp. 25–31.

[22] R. Kimball, The Data Warehouse Toolkit: Practical Techniques

for Building Dimensional Data Warehouses, John Wiley &

Sons, Inc., 1996.

[23] K. Knight, Mining online text, Communications of the ACM

42 (11) (1999).

[24] S.-H. Lin, C.-S. Shih, M.C. Chen, J.-M. Ho, M.-T. Ko, Y.-M.

Huang, Extracting classification knowledge of Internet docu-

ments with mining term associations: a semantic approach,

Proceedings of ACM SIGIR Conference on Research and

Development in Information Retrieval, 1998, pp. 241–249.

[25] S. Loh, L.K. Wives, J.P. de Oliverira, Concept-based knowl-

edge discovery in texts extracted from the web, SIGKDD

Explorations 2 (1) (2000 Jun.)1998.

[26] M.C.McCabe, J. Lee, A. Chowdhury, D. Grossman, O. Frieder,

On the design and evaluation of a multi-dimensional approach

to information retrieval, Proceedings of the 23th Annual Inter-

national ACM SIGIR Conference, 2000, pp. 363–365.

http://www.softwareag.com/xml/library/champion_nativexml.htmhttp://www.imergeportal.com/publishedarticles.asp


[27] G.A. Miller, Wordnet: an online lexical database, International

Journal of Lexicography 3 (4) (1990) 235–312.

[28] G. Salton, Automatic Text Processing, Addison-Wesley Pub-

lishing Company, 1988.

[29] G. Salton, M. Gill, Introduction to Modern Information Re-

trieval, McGraw-Hill, 1983.

[30] S. Sekine, C. Nobata, A Survey of Multi-Document Summa-

rization, Proceedings HLT-NAACL Text Summarization

Workshop and Document Understanding Conference (DUC

2003), pp. 65–72.

[31] A. Singhal, C. Buckley, M. Mitra, Pivoted document length

normalization, Procedings of the 19th Annual International

ACM SIGIR Conference, 1996, pp. 21–29.

[32] V. Sugumaran, V.C. Storey, Ontologies for conceptual model-

ing: their creation, use, and management, Data and Knowledge

Engineering 42 (2002) 251–271.

[33] D. Sullivan, Document Warehousing and Text Mining: Tech-

niques for Improving Business Operations, Marketing and

Sales, John Wiley & Sons, Inc., 2001.

[34] Ah-Hwee Tan, Text mining: the state of the art and the

challenges, Proceedings of the PAKDD 99—Workshop on

Knowledge Discovery from Advanced Databases, Beijing,

1999, pp. 50–70.

[35] F.S.C. Tseng, W.P. Lin, A study on indexing structure and its

properties for constructing document warehouses, Proceedings

of the 20th Workshop on Combinatorial Mathematics and

Computation Theory, Taiwan, 2003 (Aug.), pp. 18–27.

[36] F.S.C. Tseng, W.P. Lin, D-Tree: A Multi-Dimensional

Indexing Structure for Constructing Document Warehouses,

Journal of Information Science and Engineering, in press.

[37] F.S.C. Tseng, Design of a Multi-Dimensional Query Expres-

sion for Document Warehouses, Information Sciences, accept-

ed and in press.

[38] http://dublincore.org/, Dublin Core Metadata Initiative.

[39] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/

stg/stg/structured_storage_start_page.asp, Structured Storage.

[40] http:otn.oracle.com/products/text/x/tech_Overviews/imt_817.

html, Oracle Corporation. InterMedia Text 8.1.6.

[41] http://www.globalwordnet.org, The GlobalWordnet Association.

[42] http://www.survey.com, bDevelopment Snapshot: WarehouseData of the Future,Q Application Development Trends, Feb.2000.

[43] http://www.w3.org/XML/schema.

[44] http://www-3.ibm.com/software/data/iminer/fortext, IBM In-

telligent Miner for Text: Text Analysis Tools version 2.10.0.

Frank S.C. Tseng received his B.S., M.S. and Ph.D. degrees, all in

computer science and information engineering from National Chiao

Tung University, Taiwan, ROC, in 1986, 1988, and 1992, respec-

tively. He is one of the winners of Acer Long Term Ph.D. disser-

tation prize in 1992. From 1993 to 1995, he served the military in

the General Headquarters of ROC Air Force. He joined the faculty

of the Department of Information Management, Yuan-Ze Universi-

ty, Taiwan, ROC, on August 1995. From 1996 to 1997, he was the

chairman of the department. He is currently with the Department of

Information Management, National Kaohsiung First University of

Science and Technology, as an associate professor. His research

interests include heterogeneous database systems, XML technolo-

gies for Internet computing, data warehousing, data mining, and

document warehousing. He has published extensively in journals

such as the VLDB Journal, IEEE Transactions on Knowledge and

Data Engineering, Data and Knowledge Engineering, Journal of

Systems and Software, Distributed and Parallel Databases: An

International Journal, Journal of Information Science, Information

Sciences, and Journal of Information Science and Engineering.

Dr. Tseng is a member of the IEEE Computer Society and the

Association for Computing Machinery. He was listed in Marquis

Who’s Who in Medicine and Healthcare in May 2004.

Annie Y.H. Chou received her B.S. degree in applied mathematics,

M.S. degree in computer science and information engineering, and

Ph.D degree in computer and information science, all from National

Chiao Tung University, Taiwan, ROC, in 1987, 1989, and 1996,

respectively. From 1989 to 1992, she was an assistant researcher of

Chunghua Telecom Laboratories, Taiwan, ROC. She joined the

faculty of the Department of Computer and Information Science,

Chinese Military Academy, in August 1997. She is presently the

chairman of the department. Her research interests include mathe-

matical analysis of computer algorithms, file organization design,

data warehousing and data mining, and internet computing. She was

a member of the Phi Tau Phi Scholastic Honor Society. Dr. Chou

has had papers published in the Computer Journal and Journal of

Information Science and Engineering.

http://dublincore.org/http://msdn.microsoft.com/library/default.asp?url=/library/en-us/stg/stg/structured_storage_start_page.asphttp:otn.oracle.com/products/text/x/tech_Overviews/imt_817.htmlhttp://www.globalwordnet.orghttp://www.survey.comhttp://www.w3.org/XML/schemahttp://www-3.ibm.com/software/data/iminer/fortext

The concept of document warehousing for multi-dimensional modeling of textual-based business intelligenceIntroductionAn introduction to document warehousingA framework for document warehousingDocument sourcesFront-end componentWarehouse administratorBack-end componentsHighly summarized documentsMetadata

Data modeling of document warehousesDimensionsThe fact tableThe construction of document cubes

Applications of document warehousesAn application for customer relationship managementAn application for journal paper warehousing

Conclusion and future directionsConclusionFuture works

References

Date post:	23-Mar-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times