www.elsevier.com/locate/dsw
Decision Support Systems
The concept of document warehousing for multi-dimensional
modeling of textual-based business intelligenceB
Frank S.C. Tseng a,*, Annie Y.H. Chou b,1
aDepartment of Information Management, National Kaohsiung First University of Science and Technology, 1 University Road,
YenChao, Kaohsiung, Taiwan 824, ROCbDepartment of Computer and Information Science, Chinese Military Academy, Taiwan
Available online 17 June 2005
Abstract
During the past decade, data warehousing has been widely adopted in the business community. It provides multi-
dimensional analyses on cumulated historical business data for helping contemporary administrative decision-making. Never-
theless, it is believed that only about 20% information can be extracted from data warehouses concerning numeric data only, the
other 80% information is hidden in non-numeric data or even in documents. Therefore, many researchers now advocate that it is
time to conduct research work on document warehousing to capture complete business intelligence. Document warehouses,
unlike traditional document management systems, include extensive semantic information about documents, cross-document
feature relations, and document grouping or clustering to provide a more accurate and more efficient access to text-oriented
business intelligence. In this paper, we discuss the basic concept of document warehousing and present its formal definitions.
Then, we propose a general system framework and elaborate some useful applications to illustrate the importance of document
warehousing. The work is essential for establishing an infrastructure to help combine text processing with numeric OLAP
processing technologies. The combination of data warehousing and document warehousing will be one of the most important
kernels of knowledge management and customer relationship management applications.
D 2005 Elsevier B.V. All rights reserved.
Keywords: Data warehousing; Document warehousing; Knowledge management; OLAP
0167-9236/$ - see front matter D 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.dss.2005.02.011
B This research was partially supported by National Science
Council, TAIWAN, ROC, under Contract No. NSC-91-2416-H-
327-005.
* Corresponding author. Tel.: +886 7 6011000x4113; fax: +886 7
7659541.
E-mail addresses: [email protected] (F.S.C. Tseng),
[email protected] (A.Y.H. Chou).1 Tel.: +886 7 7438179.
1. Introduction
Data warehousing [18] and data mining techniques
[17] are gaining popularity as organizations realize the
benefits of being able to perform multi-dimensional
analyses of cumulated historical business data to help
contemporary administrative decision-making [2,4,15,
17,22]. This inspires enterprises to eagerly delve into
useful business intelligence (BI) from both internal
42 (2006) 727–744
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744728
and external data. Business intelligence is supposed to
provide decision-makers with the tactical and strategic
information they need for understanding, managing,
and coordinating the operations and processes in orga-
nizations.
However, much of the efforts have only touched
the tip of the information iceberg. While the techni-
ques regarding data warehouses, multi-dimensional
models, on-line analytical processing (OLAP), or
even ad hoc reports have served enterprises well;
they do not completely address the full scope of
business intelligence. It is believed that [42], for the
business intelligence of an enterprise, only about 20%
information can be extracted from formatted data
stored in relational databases. The remaining 80%
information is hidden in unstructured or semi-struc-
tured documents. This is because the most prevalent
medium for expressing information and knowledge is
text. For instances, market survey reports, project
status reports, meeting records, customer complaints,
e-mails, patent application sheets, and advertisements
of competitors are all recorded in documents.
Despite that, documents in the Web, enterprise
repositories, and public document management sys-
tems are all growing as well. Therefore, knowledge
workers, managers, and executives still have to spend
much of the working moment reading dozens, if not
hundreds, of various types of electronic documents
spread over the Internet. There is just too much text to
digest in daily life. The fast-growing and tremendous
amount of documents has far exceeded the human
ability for comprehension without powerful tools.
As a result, when doing important decision-making,
some relevant documents may be ignored, and some
irrelevant documents may be considered by intuition.
We believe that leaving out information induced from
relevant documents or keeping information by intui-
tively guessing from irrelevant documents may be
detrimental, causing disaster from the strategy weaved
by incomplete information.
To alleviate this phenomenon, Grigsby [14],
McCabe et al. [26] and Sullivan [33] have advocated
that documents should be properly warehoused
according to some well-defined concepts for expand-
ing the scope of business intelligence to include tex-
tual information. Ishikawa and colleagues [19–21]
even advocated this by implementing a prototype
system to support management of compound docu-
ments, keyword-based and content-based retrieval.
They used ECA rules to classify multimedia docu-
ments, and SOM (Self-Organizing Map) to cluster a
set of collected texts into the number of groups in the
retrieval space of manageable dimensions.
Hence, we think one of the next challenges of the
information community will be the study of topics
about document warehousing and text mining to help
enterprises in obtaining complete business intelli-
gence. Although research work regarding text mining
have been conducted widely (for examples, the gen-
tle readers are referred to Refs. [44,23–25,40,34]),
however, the issues regarding document warehousing
are rarely addressed. We have proposed a multi-
dimensional indexing structure, called D-tree in
Ref. [36] to study the performance measurement
for constructing document warehouses. Some theo-
retical analyses on the properties of indexing a doc-
ument warehouse were also elaborated in Ref. [35].
With document warehouses, the documents of enter-
prises can be well organized for effective analysis, or
feature extraction to create distilled and fruitful busi-
ness intelligence.
Since there are usually many diverse concepts
involved in a document, a document is multi-dimen-
sional in nature. Document warehouses, unlike tradi-
tional document management systems, include
extensive semantic information about documents,
cross-document feature relations, and document
grouping or clustering to provide more accurate and
more efficient access to text-oriented business intelli-
gence. To facilitate flexible and effective multi-dimen-
sional on-line analytical document processing and
browsing, a multi-dimensional query language for
querying document warehouses is indispensable. In
Ref. [37], we have devised a multi-dimensional query
expression for querying document warehouses to pro-
vide users an easy and efficient way of performing on-
line analytical processing on documents.
Although issues about document warehousing
have been addressed in Refs. [14,26,33], there are
still no formal definitions established up to now. In
this work, we will first discuss the concept of docu-
ment warehousing and formally define the related
terms. Then, we propose a framework for document
warehousing and elaborate some applications of doc-
ument warehousing to sketch an attractive roadmap of
using document warehouses.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 729
As Web applications proliferate tremendously,
there will be a great deal of need for rapid text
processing and browsing. Document warehousing
does not only provide an infrastructure for developing
tools for business executives to systematically orga-
nize, understand, and properly categorize their docu-
ments to help strategic decision-making, but also
integrate all kinds of related documents being
browsed instantly.
Document warehousing also provides an important
platform for on-line analytical processing (OLAP) in
text level for the interactive analysis of multi-dimen-
sional documents of various granularities, which
facilitates effective text mining, integrates documents
into the business intelligence infrastructure, and pro-
vides the means to search for and target specific
information the way we now do with numeric data.
Furthermore, as the construction of data warehouses
can be viewed as an important step for data mining,
the construction of document warehouses can be
regarded as an indispensable preprocessing step for
text mining. We realize that, no matter how wonder-
ful the mechanism a system adopts, it cannot do
much without good content organization of the do-
main on which it is to work. Moreover, we often
recognize that, once a good content organization is
available, many different mechanisms might be
employed equally well to implement effective sys-
tems. A well-organized document warehouse just
provides various mechanisms a wonderful content
organization to work on.
In this paper, based on our prior works [35–37],
we further illustrate the general architecture of a
document warehouse and its applications. The
work is essential for establishing an infrastructure
to help combine text processing with numeric
OLAP processing technologies. We believe such
an infrastructure can help extend numeric data anal-
ysis for combination with text processing technolo-
gies to make data warehousing and document
warehousing one of the most important kernels of
knowledge management and customer relationship
management applications. By combining document
warehousing and data warehousing, documents can
be integrated into the business intelligence infra-
structure and can provide the means to search for
and target specific information the way we now do
with numeric data.
Although the content of documents are often more
than text and may include some graphics or even
multimedia data, in this paper, we only consider the
textual parts of documents. In the future, we will
extend our work to encompass the entirety of docu-
ments, which may be called multimedia warehousing
[19–21].
Our paper is organized as follows. In Section 2, the
important concepts of document warehousing are for-
mally presented. Then, based on these definitions, we
will propose a general architecture for constructing
document warehouses in Section 3. Then, some of the
applications of document warehouses will be dis-
cussed in Section 4. Finally, we conclude and propose
some future work in Section 5.
2. An introduction to document warehousing
In the following, we give some definitions about
document, dimension, document tuple, and document
cube for document warehousing.
Definition 1. A document T={k1, k2, . . . , ki} is alogical unit of text characterized by a set of keywords
{k1, k2, . . . , ki}.
To organize documents into structures, we need the
concept of dimension defined as follows.
Definition 2. A dimension D is a tree structure of m
levels, mz1, which is used for representing the hier-archical relationships among a set of keywords. A
node in a dimension D is called a member, and each
internal node contains a special child called summary
member, denoted d*T, which is used for denoting thetotal concept of the other children of the internal node.
When drawing a dimension, we usually leave out a
summary member, since it has the same meaning with
its parent node. Besides, the keywords in a dimension
are not limited to only those contained in document
contents. Any property or metadata of a document file
(e.g., those defined in Dublin Core Metadata Element
Set [38]) can also be regarded as a keyword in a
dimension for constructing document cubes. Further-
more, if documents are organized into predefined
categories, the category hierarchy to which a docu-
ment belongs can also be regarded as a dimension.
That is, text is not unstructured as is often assumed,
Table 1
A relation Region and its alternative for constructing dimension R
(a)
Location City
South Tainan
South Kaohsiung
South Pingtong
North Taipei
North Taoyun
North Hsinchu
(b)
Tag Parent Level Keyword
1 1 1 (All Region)
2 1 2 South
3 1 2 North
4 2 3 Tainan
5 2 3 Kaohsiung
6 2 3 Pingtong
7 3 3 Taipei
8 3 3 Taoyun
9 3 3 Hsinchu
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744730
which has been pointed out by Sullivan [33]. The
concept of dimension can be employed to model the
structure inherently hidden in text.
According to the keyword sources, dimensions can
be distinguished into the following types:
1. Ordinary dimension. A dimension contains key-
words used for scanning the document contents.
2. Metadata dimension. A dimension contains key-
words used for scanning document file properties
or metadata. For example, in Dublin Core Meta-
data Element Set, there are title, creator, subject,
description, publisher, contributor, date, type, for-
mat, identifier, source, language, relation, cover-
age, and rights; all can be regarded as metadata
dimensions.
3. Category dimension. A dimension contains key-
words corresponding to the nodes in a category
hierarchy, such as Wordnet [27,41], in which all
considered documents should be multi-categorized.
A document is related to such dimension or if not it
can be determined manually or automatically
assigned by document categorization tools.
To simplify our discussion, we mainly use ordinary
dimensions, together with the metadata dimension
time (i.e., date), in the following examples.
Definition 3. For a dimension D, the ith-level member
set, denoted D(i), is defined as D(i)={a |a is a mem-
ber in the ith level of D, but a is not a summary
member}. Besides, we use D(0) to denote the union of
all non-summary members in D, which is the union of
all ith level member sets in D. That is, D(0)=[1ViVhD(i), where h is the height of D. In practice, each D(i)
has a specific name, which will be called the ith-level
name.
Practically, a dimension can be constructed from a
relational table, with each level corresponding to an
attribute in the relation and the attribute names usually
used as the corresponding level names. To illustrate
the above definitions, we give an example as follows.
Besides, any keyword in a dimension can be imple-
mented as a set of synonyms to encompass more
semantics.
Example 1. Suppose there is a relation Region repre-
senting the regions of Taiwan as shown in Table 1(a).
Another alternative is shown in Table 1(b). This rela-
tion can be used to construct a dimension, denoted R
as depicted in Fig. 1, where the first level corresponds
to the dimension itself, which is commonly denoted
b(All Region)Q, and the second and third levels arederived from the attributes Location, and City, respec-
tively. All nodes in Fig. 1 with label d*T are summarymembers. That is, the summary member in the second
level has the same meaning with all regions in Tai-
wan, which represents {South, North}. Besides, the
summary members under South and North have the
same corresponding meaning with South and North,
which denote {Tainan, Kaohsiung, Pingtong} and
{Taipei, Taoyun, Hsinchu}, respectively. By omitting
all the summary members, Fig. 1 is redrawn in Fig. 2.
According to the illustration of dimension R, we know
that R(1)={(All Region)}, R(2)={South, North}, and
R(3)={Tainan, Kaohsiung, Pingtong, Taipei, Taoyun,
Hsinchu}, and R(0)={(All Region), South, North, Tai-
nan, Kaohsiung, Pingtong, Taipei, Taoyun, Hsinchu}.
For a dimension D, there are two basic operations
called drill-down and roll-up, which are formally
defined as follows.
Definition 4. For a dimension D, expanding an
internal node to obtain all of its children is called
(All Region)
South North
Tainan Pingtong TaipeiKaohsiung* *
*
1
2
3
Level
Taoyun Hsinchu
Fig. 1. An illustration of dimension R.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 731
drill-down, and shrinking a set of children to obtain
their common parent is called roll-up.
This can be further clarified by the following
definitions.
Definition 5. For any two n-tuple of keywords
A=(a1, a2, . . . , ai, . . . , an) and B =(b1, b2, . . . , bi,. . . , bn) defined on n dimensions (D1, D2, . . . , Di,. . . , Dn), where ai and biaDi(0), we define B is amember of drilling down A along dimension Di (or A
is a member of rolling up B along dimension Di),
denoted A �i B, if and only if there exists exactly an i,1V iVn, such that bi is a child of ai in Di, and bj=aj,for all j p i.
Definition 6. For a document T with unique identifier
idT, a document index of T defined on n dimensions
(D1, D2, . . . , Dn) is denoted x =(idT, KT), whereKT=(K1, K2, . . . , Ki, . . . , Kn) is an n-tuple of key-word sets, such that each Ki contains a set of key-
words, and for all keywords kijaKi, kijaT andkijaDi(0), for all 1V iVn.
(All Region
South
Tainan PingtongKaohsiung
All Region
Location
City
Level Name
2
4 65
Fig. 2. A concise illustrati
For simplicity, the first and second components of
a document index x =(idT, KT) will be denoted x1 and
x2 (i.e., x1= idT and x2=KT), respectively. When all
|Ki| =1, the document index is also called a base
document index, and each Ki can also be denoted
by its only element for convenience (That is, in such
cases, a KT=({k1}, {k2}, . . . , {ki}, . . . , {kn}) can beabbreviated as KT=(k1, k2, . . . , ki, . . . , kn)). If thereare at least one Ki, such that |Ki|N1, and the sizes of
the other Kj’s all equal to 1, then the document index
is also called a composite document index. Finally, if
there are some Ki, such that |Ki| =0, then the docu-
ment index is also called a degenerate document
index. In the following, a degenerate document
index with some |Ki| =0 will be generalized by
using the top level member set of the corresponding
dimension, Di(1), to substitute the missing keyword
set Ki.
Example 2. Suppose there is a complaint e-mail
issued from a customer as shown in Fig. 3. Then, a
base document index of T defined on the above two
)
North
Taipei
1
2
3
Level
Taoyun Hsinchu
1
7 9
3
8
on of dimension R.
To whom it may concern: We have bought a TV from your Kaohsiung branch last weekend. However, we found the screen is severely unstable. Please give us the phone number of your service center. Thank you for your kindly help.
Sincerely,
Frank S.C. Tseng
Fig. 3. A complaint e-mail issued by a customer (A0001).
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744732
dimensions (R, P) can be obtained as x =(A0001,
({Kaohsiung}, {TV})), where A0001 is the unique
identifier of T.
The basic component of a document cube is called
a cell, which is defined as follows.
Definition 7. A cell defined on n dimensions (D1, D2,
. . . , Dn) is denoted c =(tc, Xc), where tc=(c1, c2, . . . ,ci, . . . , cn), ciaDi(0)[{d*T}, 1V iVn, and Xc={x1,x2, . . . , xj, . . . , xm} is a set of document indices of theform xj =(idTj, (K1, K2, . . . , Kn)), where idTj is theunique identifier of some document Tj and Ki\Di(0)pF, 1V iVn. The set of all such documentunique identifiers idTj involved in the cell c=(tc,
non-base cell
base cell
regiont
a
d
TV
Refrigerator
Cellular Phone
Radio
Monitor
PrinterC
omputer
Com
munication
Appliance
(All Product)
Fig. 4. A sample illustration
Xc) is denoted ID(c)={xj1|8 xjaXc}. That is, a doc-
ument with unique identifier in ID(c) can be directly
accessed from the cell c.
Definition 8. A cell c=(tc, Xc), where tc=(c1, c2, . . . ,ci, . . . , cn), defined on n dimensions (D1, D2, . . . ,Dn) is called an m-d cell, 0VmVn, if and only if thereare exactly m non-summary member ci (i.e.,
ci p d*T). If m =n and ciaDi(hi), where hi isthe height of Di, for all 1V iVn, then c is alsocalled a base cell; otherwise c is called a non-base
cell.
Definition 9. An n-dimensional i-d cell a=((a1, a2,
. . . , an), Xa) is a parent of another n-dimensional i-d
product
ime
S
T3 T1
T2
ID(a)
ID(d)
Taipei
Taoyuan
HsinChu
Tainan
Kaohsiung
Pingtong
North
South
(All Region)
of a document cube.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 733
cell b =((b1, b2, . . . , bn), Xb), if and only if thefollowing conditions hold:
1. There exists exactly one k, such that ak is the
parent of bk in Dk, and al =bl, for all l p k, 1V lVn.2. ID(b)p ID(a), where ID(a) and ID(b) are the sets
of all document unique identifiers involved in cells
a and b, respectively.
Definition 10. A document cube DC =(S, (D1, D2,
. . . , Dn)), where S is a set of documents defined on ndimensions (D1, D2, . . . , Dn), is a cube composed ofall cells ci=(tci, Xci) with tcia�1VjVn Dj 0ð Þ andID(ci)pS.
Based on the above definitions, a set of docu-
ments S can be multi-dimensionally indexed by a
document cube DC =(S, (D1, D2, . . . , Dn)), whichallows users to browse documents by rolling up and
drilling down along some dimensions Di for differ-
ent granularities and perspectives, obtaining further
insight into relationships among documents. A sam-
ple illustration of a document cube DC =(S, (R, P,
T)) is shown in Fig. 4, where R and P represent the
aforementioned dimensions region and product, re-
Document Source A
metadata
Integrateddocument base
summardocum
sumdo
warehouseadministrator
warehouseadministrator
Archive
Document Source B
Document Source C
Fro
nt-E
ndC
ompo
nent
Fig. 5. The proposed architecture
spectively. Besides, we assume T is a dimension
representing time.
3. A framework for document warehousing
Designing a comprehensive architecture for docu-
ment warehousing can be challenging because docu-
ment warehousing covers a wide spectrum of concepts
as we have shown in Section 2. Fortunately, there is
already a general architecture being established for
data warehousing in Ref. [2]. Based on the architec-
ture, we extend the constructs to include more features
for documents warehousing. The proposed architec-
ture is shown in Fig. 5.
Based on this architecture, we outline the general
process of extraction, transformation, loading, dimen-
sional modeling, and construction of a document
warehouse in Fig. 6. In this process, documents stored
in different document sources are respectively trans-
formed and loaded into the document base. At the
same time, some of the metadata are retrieved or
generated into the metadata repository. Besides, the
document may be further integrated and categorized
into groups in the document base according to their
izedents
highlymarized
cuments
Text Mining Tools
Application Programs
Bac
k-E
ndC
ompo
nent
PC
2003
2002
2001
2000
1999
1998
1997
Scanner
Printer
LCD Monitor
Software
Digital Camera
Phone
On-Line Analytical Processing
On-Line AnalyticalProcessing
of document warehouses.
BusinessModels Presentation
OLAP UserQuery/Tool
DocCubesDocument
Base
Metadata
Transform
ation and Loading
DocumentSource A
DocumentSource B
Cluster
DimensionalModeling
PC
2003
2002
2001
2000
1999
1998
1997
Scanner
Printer
LCD Monitor
Software
Digital Camera
Phone
On-Line AnalyticalProcessing
Fig. 6. The general process of extraction, transformation, loading, dimensional modeling, and construction of a document warehouse.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744734
metadata or keywords. Then, by applying the dimen-
sional modeling process, fruitful document cubes can
be created for on-line analytical processing in text
level based on certain business models. Notice that a
document cube does not need space to store the
document contents; it only contains the dimension
information and the file pointers, which can be used
to trace back the original document contents stored in
the document base. Finally, the processed result can
be presented via hyper-linked Web presentations.
The major components of a document warehouse
are explained as follows.
3.1. Document sources
The source of documents for a document ware-
house is supplied from:
1. Internal sources: In an organization, there are
documents in various formats spread throughout
the organization on any kind of document reposi-
tories. The files may be in XML formats, MS Word
formats, e-mail or even plain text.
2. External sources: Documents may also come from
the Internet, including Web pages, FTP sites, com-
mercially available document bases, private docu-
ments shared by private servers or document
repositories associated with an organization’s sup-
pliers or customers.
3.2. Front-end component
The front-end component performs all the neces-
sary pre-processing of documents, such as text sum-
marization [12,16], text feature extraction [10],
document categorization [3], or other text mining
procedures [24,25,34], and then store the obtained
features or patterns into the meta-data or store the
summarized result as another summarized document.
3.3. Warehouse administrator
The warehouse administrator performs all the
operations associated with the management of
the documents in the warehouse. The opera-
tions include:
1. Enrich the metadata of all stored documents: Some
of the document metadata (e.g., those defined in
Dublin Core Metadata Element Set [38]) may be
missing and should be added manually by the
warehouse administrator.
2. Perform necessary text mining operations or gen-
erate the summarization for documents either man-
ually or by software tools (e.g., IBM Intelligent
Miner for Text [44]).
3. Create the dimensions and document indexes for
constructing document cubes.
4. Archive documents and related data/metadata.
3.4. Back-end components
The back-end component performs all the opera-
tions responsible for the management of user queries.
It is typically composed of a set of document access
tools, a multi-dimensional document query interface
[37], document warehouse monitoring tools, and cus-
tomized tools.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 735
3.5. Highly summarized documents
This part stores all the summarization derived from
multiple documents, which belong to the same cluster
or categorization. Some of such achievements have
already been conducted [9,13].
The simplest format of a highly summarized doc-
ument can be represented by a set of keywords
appeared in the original document. Keywords of a
document can be derived by computing the traditional
Table 2
The proposed metadata design
Attribute name Description
Title A name given to the resource.
Creator An entity primarily responsible
for making the content of the
resource.
Subject A topic of the content of the
resource.
Description An account of the content of the
resource.
Publisher An entity responsible for making
the resource available.
Contributor An entity responsible for making
contributions to the content of the
resource.
Date A date of an event in the lifecycle
of the resource.
Type The nature or genre of the content
of the resource.
Format The physical or digital manifestation
of the resource.
Identifier An unambiguous reference to the
resource within a given context.
Source A reference to a resource from
which the present resource is
derived.
Language A language of the intellectual
content of the resource.
Relation A reference to a related resource.
Coverage The extent or scope of the content
of the resource.
Rights Information about rights held in
and over the resource.
. . . . . .
Keywords The keyword set derived from the
resource.
Summarization A brief summary generated from
the resource by a summarization
tool.
File_Path A file pointer used to address the
resource.
tf*idf weights [28,29], pivoted cosine weights [31], or
one derived by any term-weighting scheme.
3.6. Metadata
The metadata of a document warehouse stores all
the metadata derived from all documents. All the
processes in the proposed architecture will use the
metadata interchangeably. In the paper, we propose to
design the metadata as Table 2 describes. That is, the
metadata can be stored in a traditional relational
table, which contains attributes for all the elements
defined in the Dublin Core Metadata Element Set
[38] and some additional attributes. For simplicity,
we only list three extra attributes in Table 2, where
dSummarizationT, dKeywordsT, and dFile_PathT areused to store the summarization of the document,
the keyword set derived from the original document,
and the file path used to show the pathway back to
the document from which the metadata are derived.
Such file path is inherently unique and can be used to
describe the mapping between the document sources
and a common view of the information within the
document warehouse.
Some of the metadata can be obtained or derived
directly from the document itself. For example, docu-
ments stored in Microsoft Word format has some
summary information associated with the file itself
and we can retrieve them directly from the stream
by employing Structured Storage [39]. Structured
Storage provides file and data persistence in COM
by handling a single file as a structured collection of
objects known as storages and streams.
4. Data modeling of document warehouses
The dimensional modeling technique [18,22]
adopted widely in data warehouse modeling can be
extended for document warehouses. Every dimension-
al model is composed of one central table with a
composite key, called the fact table, which uses for-
eign keys to link to a set of dimension tables. This
characteristic dstar-likeT structure is also called a starschema. Such multi-dimensional data model for text
permits the definition of any dimension of interest as
defined in Definition 2. In Fig. 7, we show a star
schema for modeling document warehouses.
Fact Table
Title_ID (FK)Creator_ID (FK)...Date_ID (FK)...Rights_ID (FK)...Keyword_ID (FK)...Document_IDcount (Measure)
DateDimension
CreatorDimension
TitleDimension
RightsDimension
KeywordDimension
S
T3 T1
T2
Fig. 7. An example star schema of a document warehouse.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744736
4.1. Dimensions
As we have discussed in Section 2, dimensions can
be distinguished into the following types:
1. Ordinary dimension. A document can be highly
summarized by a set of keywords. Therefore, we
can construct an ordinary dimension containing a
set of keywords to allow users to pinpoint the
desired documents directly.
2. Metadata dimension. That is, those elements de-
fined in Dublin Core Metadata Element Set: title,
creator, subject, description, publisher, contributor,
date, type, format, identifier, source, language, re-
lation, coverage, and rights, can all be regarded as
metadata dimensions. Some of the dimensions
might be hierarchies or simply related data.
3. Category dimension. For example, a hierarchy
such as Wordnet or its subset, or user-defined
hierarchies can be employed as category dimen-
sions. Notice that, there may be more than one
category dimensions used to construct a document
cube, since a document can be multi-categorized
into different categories from various points of
view.
In Table 1, we have presented two representations
for the dimension Region. Both representations can
be easily obtained from each other by conversion.
The structure of Table 1a is easier understood by
people, and that of Table 1b is more efficient for
computer processing. For dimensional modeling, we
assume each dimension Di is internally stored in the
relation Di(Tag, Parent, Level, Keyword) conform-
ing to the structure of Table 1b, such that the
underlined attribute Tag represents the primary key,
and Di.Parent is a foreign key and is referenced to
Di.Tag. For example, the dimensions Product and
Time in Fig. 4 can be represented as shown in
Fig. 8.
Under such circumstances, for a dimension D,
D(0) is the whole relation, and the other ith level
member sets D(i) can be easily obtained by the
SQL statements bSELECT Tag, Keyword FROM DWHERE Level= iQ, for 1V iVn.
4.2. The fact table
In general, the central fact table may be composed
of the following attributes:
1. A composite key, which is composed of a set of
foreign keys to the following dimensions:
(a) Ordinary dimensions: For example, the dimen-
sion Keyword shown in Fig. 7 is an ordinary
dimension.
Product Time Tag Parent Level Keyword Tag Parent Level Keyword1 1 1 (All Product) 1 1 1 (All Time)2 1 2 Appliance 2 1 2 20033 1 2 Communication 3 1 2 20044 1 2 Computer 4 2 3 Q1, 20035 2 3 TV 5 2 3 Q2, 20036 2 3 Refrigerator 6 2 3 Q3, 20037 3 3 Cellular Phone 7 2 3 Q4, 20038 3 3 Radio 8 3 3 Q1, 20049 4 3 Monitor 9 3 3 Q2, 200410 3 Printer 10 3 3 Q3, 2004
11 3 3 Q4, 20044
Fig. 8. Dimensions product and time represented in relations.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 737
(b) Metadata dimensions: For example, the dimen-
sions Title, Date, Creator, . . . , and Rights asshown in Fig. 7 are metadata dimensions.
(c) Category dimensions. Notice that, there are no
category dimensions shown in Fig. 7.
2. Attributes used to derive the measures in a docu-
ment cube. The document count (i.e., the attribute
count of fact table in Fig. 7) can be regarded as the
default measure in a document cube. Another pos-
sible measure has been defined in Ref. [26] as the
weight of the term frequency of the corresponding
keyword.
3. A column Document_ID represents the document
identifier, which is a foreign key link to the relation
S(Document_id, File_path) as shown in Fig. 7.
That is, the set S in Fig. 7 can be regarded as a
dimension containing all the document identifiers
and the corresponding file paths, where Documen-
t_id is the primary key and file_path is used for
storing the file path or URI of the documents.
Therefore, the fact table can be stored in the
relation Fact_Table(Document_id, D1_tag, D2_Tag,
. . . , Di_Tag, . . . , Dn_Tag, Count), where Docmen-t_id is a foreign key which is referenced to S.Doc-
ment_id, and each Di_Tag is a foreign key matching
the primary key Did Tag of dimension Di. For each
document T (with identifier idT) in S, the document
index x =(idT, KT), where KT=(K1, K2, . . . , Ki, . . . ,Kn), can be used to generate idTf g � �1ViVn Kið Þ asthe set of initial tuples in the Fact_Table.
Note that the initial tuples generated by the above
process may cause redundancies. We formulate this
by the following definition.
Definition 11. A document index defined on n dimen-
sions (D1, D2, . . . , Dn), x =(idT, KT), where KT=(K1,K2, . . . , Ki, . . . , Kn) is minimal, if and only if for twokeywords kx and kyaKi, kx p ky, there is no ancestryrelationship between kx and ky in Di, for all 1V iVn.That is, kx is not an ancestor of ky inDi, and vice versa.
For a non-minimal document index x=(idT, KT),
where KT=(K1, K2, . . . , Ki, . . . , Kn), we can alwaysreduce x into a minimal document index, denoted ẋ, by
iteratively finding pairs of keywords kx and ky inKi, for
all 1V iVn, such that ky is a descendant of kx, and theneliminating the ancestor kx and retaining the descen-
dant ky. Such process can be denoted as xi ẋx. It helpsto reduce the storage cost of a document index without
loss of indexing information, since the documents
indexed by an ancestor can be recursively derived by
rolling up the indexed documents from its lowest des-
cendants along the corresponding dimension.
Example 3. Suppose there is a non-minimal document
index of T (with unique identifier A001), defined on
dimensions (R,P), x =(A0001, ({South, Kaohsiung},
{TV})), then after xi ẋx, we obtain ẋx =(A0001,({Kaohsiung}, {TV})), since, according to the dimen-
sion R, the documents indexed by South can be de-
rived from Tainan, Kaohsiung, and Pingtong.
4.3. The construction of document cubes
To construct a document cube, the process is some-
how different from that in data cubes. Although a
document index derived according to metadata and
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744738
category dimensions are the same as that in data
cubes, the process to generate a document index
with respect to an ordinary dimension is prone to be
time-consuming. This is because computing a mea-
sure in a data cube is just numerical computation.
However, to compute a document index (de-
fined in Definition 6), we have to scan the docu-
ment content to match the keywords in any ordinary
dimension. Therefore, it is necessary to develop an-
other indexing structure to accommodate the docu-
ment indices derived from ordinary dimensions. We
have already proposed an indexing structure,
called D-tree, to meet this objective in Refs.
[35,36], where the performance evaluation of the
indexing structure was also studied.
5. Applications of document warehouses
In this section, we present two applications of
document warehouses. The first one is a document
warehouse for organizing the complaint e-mails to
provide better customer relationship management.
The other one is for preparing a document warehouse
indexing a set of journal papers falling in different
categories, published in different journals on different
date times.
5.1. An application for customer relationship
management
Suppose there is a company manufacturing appli-
ances, communication equipments and computer per-
Fact Tab
Region_ID (FCreator_ID Date_ID (FKProduct_ID Time_ID (FKDocument_Icount (Meas
CreatorDimension
DateDimension
S
T3 T1
T2
Fig. 9. An example star schema for
ipherals, and it has established branches in the north
and south regions. The objective is to warehouse
customer complaint e-mails for customer relationship
management.
After modeling the document cube, we obtain two
metadata dimensions (Creator and Date) and three
ordinary dimensions (Region, Product, and Time) as
shown in Fig. 9.
We briefly describe these dimensions as follows:
1. Ordinary dimension. The dimensions Region and
Product are as shown in Figs. 2 and 10, respec-
tively. The dimension Time is the purchase time
described in the e-mail.
2. Metadata dimension. The dimension Creator stores
the e-mail addresses of customers and dimension
Date stores the date of receiving of e-mails.
3. Category dimension. There are no category dimen-
sions shown in this example. However, the e-mail
documents could be further categorized either man-
ually or automatically by software tools into hier-
archical categories.
The fact table is composed of the following
attributes:
1. A composite key, which is composed of a set of
foreign keys to the aforementioned dimensions.
2. The attribute count is regarded as the default mea-
sure in this document cube.
3. A column Document_ID served as a foreign key
and is referenced to the relation S(Document_ID,
file_path).
le
K)(FK))
(FK))
Dure)
ProductDimension
RegionDimension
TimeDimension
complaint e-mail management.
(All Product)
Appliance Computer
TV Cellularphone
RadioRefrigerator
1
2
3
Level
LaptopComputer
DesktopComputer
Communication 4
5 6
1
2 3
7 8 9 10
DVDPlayer
11
Fig. 10. A concise illustration of dimension P.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 739
After constructing the document cube, we can
perform on-line analytical processing on the obtained
document cube as illustrated in Fig. 11. Notice that,
each of the count shown in Fig. 11 is actually a
hyperlink, which links to a page containing the orig-
inal e-mails.
5.2. An application for journal paper warehousing
Suppose there is a university laboratory, which
intends to warehouse research journal papers accord-
ing to some predefined categories.
After modeling the document cube, we establish
one category dimension Category and two metadata
dimensions: Source and Times, as shown in Fig. 12,
where Category represents the predefined categories,
Source stores the journal names of the selected papers,
and Times regards the publishing data times.
We briefly describe these dimensions as follows:
1. Ordinary dimension. There is no ordinary dimen-
sion in this example. However, users may add a
dimension containing all of the keywords in the
selected journal papers, and organize the keywords
into hierarchies, either manually or automatically
by some text processing tools.
2004/Quarter 1Product
TVRefrigerato
Appliance
DVD PlayeLaptop CompComputerDesktop Comp
RadioCommunicationCellular Pho
Fig. 11. On-line analytical processing o
2. Metadata dimension. The dimension Source stores
the journal names, which are published under some
communities (e.g., ACM, IEEE, Elsevier, and
Kluwer). The dimension Times stores the date of
publishing, which is organized according to the
Year–Month hierarchy.
3. Category dimension. The dimension Category
stores the predefined categories, i.e., computer-
aided engineering, data mining and knowledge dis-
covery, data model, data structures, data warehous-
ing, database management, . . ., and so forth. Thesepredefined categories are all organized into the same
level to simplify the illustration. Notice that all
papers are multi-categorized into theses categories.
That is, a paper may fall into two or more categories.
The fact table is composed of the following
attributes:
1. A composite key, which is composed of a set of
foreign keys to the aforementioned dimensions.
2. The attribute count is regarded as the default mea-
sure in this document cube.
3. A column Document_ID served as a foreign key
and is referenced to the relation S(Document_ID,
file_path).
RegionNorth South
5 2r 3 7r 7 9uter 8 1uter 2 3
6 1ne 5 7
S
T3
T3 T1
ver the example document cube.
Fact TableCategory_ID (FK)Source_ID (FK)Times_ID (FK)Document_IDcount (Measure)
CategoryDimension
TimeDimension
ST3 T1
T2
SourceDimension
T4 T5
Fig. 12. An example star schema for journal paper warehousing.
Column Axes:Source Category (All Times)
Submit Query Reset
Row Axes: Last Times Selected: 2004
Fig. 13. On-line analytical processing over the example document cube.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744740
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 741
To prove the concept proposed in this paper, we
have implemented this example to allow users to per-
form on-line analytical processing on the obtained
document cube as illustrated in Fig. 13. The dimension
Times is used for slicing the document cube. Notice
that, each of the count shown in Fig. 13 is actually a
hyperlink, which links to a Web page containing the
detailed paper listing as shown in Fig. 14 (when slicing
the Times dimension by d2004T and clicking the countintersected by dElsevierT and dData Mining and Knowl-edge DiscoveryT). If all of the paper files were stored inthe publisher’s Web sites for a long time period, then
this document warehouse does not need to copy and
store the physical files locally. That is, such a document
warehouse effectively saves storage space and provides
very fast document access without degradation in per-
formance even as the size of the warehouse grows.
6. Conclusion and future directions
6.1. Conclusion
While data warehouses and the numeric-centric
business intelligence technologies have served most
of the enterprises well, they do not fully address the
complete scope of business intelligence. In this paper,
Fig. 14. The detailed paper listing after the count in the intersection of dE
we advocate the importance of constructing document
warehouses to support text-centric business intelli-
gence, and propose an architecture for document ware-
housing. When documents are warehoused, users can
perform ad hoc on-line analytical processing (OLAP)
over text in a document warehouse, just as the way
users can perform OLAP over summarized data in a
data warehouse.
The concept of document warehousing is not only
providing the ability to very fast document access
without degradation in performance even as the size
of the warehouse grows, but also offering a set of
versatile applications for content management of
enterprise business intelligence. In business, docu-
ment warehousing can help administrators organize
meeting reports, gazettes, or even customer com-
plaint e-mails, where the company personnel, pro-
ducts, and time may be regarded as the dimensions,
such that documents related to some employees, or
products in some time, at somewhere can be re-
trieved or browsed instantly. In recent years, we
have seen most data warehouse applications applied
in Customer Relationship Management (CRM), a
promising trend in business affairs. However, a
data warehouse creation only supports the numeric
analyses of customer behaviors. To obtain the reason
why customers buy (or did not buy) some products,
lsevierT and dData Mining and Knowledge DiscoveryT was clicked.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744742
we need to establish a document warehouse. By data
warehousing, users can realize business phenomena
regarding who, what, when, where, and which clear-
ly. Nevertheless, to discover why the phenomena
occurred, a document warehouse should be
employed [33].
When documents are warehoused, the task of
version control will become very easy, since users
can directly trace the documents based on some
criteria along the time dimension. Such merits
also make document warehousing an exhilarating
organization for on-line topic detecting and event
tracking [1] of news. Besides, document clustering
can be achieved directly via visualizations. Users
can also develop some document summarization
tools [9,16,30] to summarize a cluster of related
documents. To sum up, data warehousing and doc-
ument warehousing are not only one of the most
important infrastructures of knowledge management,
but also the kernel of customer relationship man-
agement. Both are used for respectively organizing
documents and formatted data in a multi-dimension-
al basis. We compare their similarities and differ-
ences in Table 3.
Table 3
A comparison between document warehousing and data warehousing
Document warehousing
Similarities 1. Both have the same construction process.
We may employ star schema or snowflake
[22] to design the modeling process.
2. Both gather business document/data from
heterogeneous resources.
3. Users can do on-line analytical processing
over the established result.
Differences 1. Intend to obtain text-oriented business
intelligence.
2. Resources gathered from market survey
reports, project status reports, meeting reco
customer complaints, e-mails, patent
application sheets, and advertisements of
competitors.
3. It filters out unnecessary documents and
intends to help users to address problems
regarding why.
4. Enriched with text mining techniques to
summarize documents or categorize docume
5. Document sources should be integrated
file systems, or native XML databases [6
6.2. Future works
In our future work, we will conduct more techni-
ques for document warehousing. The preliminary
components may include the following modules.
1. Employ XML Schema [43] to define document
metadata. We advocate using the Extensible Mark-
up Language (XML) to be the intermediate media
for document interchange.
2. Incorporate automatic text summarization [12,14,
23], key feature extraction [10], or even document
classification and categorization [3] techniques for
document warehousing. Develop related text sum-
marization techniques to extract the most important
10~20% content for users to digest the documents
more easily and propose how to bind a document
summary with its corresponding documents for
document warehousing.
3. Automatic document metadata decomposition and
the mechanisms for storing the obtained metadata
into native XML or XML-enabled databases [5–7].
This helps users manage document warehouses
more efficiently.
Data warehousing
1. Intend to obtain numeric-oriented
business intelligence.
rds,
2. Resources gathered from internal
databases of POS (point-of-sale) systems,
ERP (enterprise resource planning) systems,
accounting systems, or financial management
systems.
3. It aggregates numerical data according to
various dimensions, and intends to help users
to address problems regarding who, what,
when, where, and which.
nts.
4. Enriched with data mining techniques to
summarize, classify, cluster formatted data
or find the associations.
in
,7].
5. Data sources can be integrated in relational
databases.
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 743
Besides, although the dimension concepts defined
in this paper are organized into hierarchical structures,
it is however assumed that when scanning a document,
the system will ignore the hierarchical relationships
among keywords in the document. Based on this
work, we wish to incorporate some natural language
processing technologies to enhance the linguistic anal-
ysis and annotation results of document parsing, and
elaborate the work of adopting domain-specific ontol-
ogy [8,11,25,32] with more refined concepts to be built
in the corresponding dimensions of a document cube.
Ontological analysis can help clarify the structure of
knowledge regarding a set of related documents. Given
a set of related documents corresponding to a specific
domain, the ontology forms the semantic heart of any
system of knowledge representation, and their docu-
ment cube forms the syntactic centroid of any system of
concept organization.
Finally, since the construction of a document ware-
house has to scan a large amount of documents, which
is a task prone to time-consumption, the parallel archi-
tecture for such a process will be investigated further in
the future.
References
[1] J. Allan, R. Pepka, V. Lavrenko, On-line new event detection
and tracking, Proceedings of the 21st Annual International
ACM SIGIR Conference on Research, 1998, pp. 37–45.
[2] S. Anahory, D. Murray, Data Warehousing in the Real World:
A Practical Guide for Building Decision Support Systems,
Addison-Wesley Longman, Harlow, England, 1997.
[3] A. Appiani, F. Cesarini, A. Colla, M. Diligenti, M. Gori, S.
Marinai, G. Soda, Automatic document classification and
indexing in high-volume applications, International Journal
on Document Analysis and Recognition 4 (2) (2002) 69–83.
[4] M.J.A. Berry, G. Linoff, Data Mining Techniques: For Mar-
keting, Sales, and Customer Support, John Wiley & Sons,
New York, 1997.
[5] E. Bertino, B. Catania, Integrating XML and databases, IEEE
Internet Computing 5 (4) (2001) 84–88.
[6] E. Bertino, E. Ferrari, XML and database integration, IEEE
Internet Computing 5 (6) (2001) 75–76.
[7] Champion, M, Native XML vs. XML-Enabled: the Difference
Makes a Difference, http://www.softwareag.com/xml/library/
champion_nativexml.htm, Software AG: The XML Company.
[8] B. Chandrasekaran, J.R. Josephson, V.R. Benjamins, What are
ontologies, and why do we need them? IEEE Intelligent
Systems 14 (1) (1999 Jan./Feb.) 20–26.
[9] H.H. Chen, S.J. Huang, A summarization system for Chinese
news from multiple sources, Proceedings of the 4th Interna-
tional Workshop on Information Retrieval with Asia Lan-
guage, 1999, pp. 1–7.
[10] F.F. Feng, W.B. Croft, Probabilistic techniques for phrase
extraction, Information Processing & Management 37 (2)
(2001 Mar.) 199–220.
[11] N. Fridman, C.D. Hafner, The state of the art in ontology
design, AI Magazine 18 (3) (1997) 53–74.
[12] J. Goldstein, M. Kantrowitz, V. Mittal, J. Carbonell, Summa-
rizing text documents: sentence selection and evaluation
metrics, Proceedings of SIGIR, 1999, pp. 121–128.
[13] J. Goldstein, V.O. Mittal, J.G. Carbonell, J.P. Callan, Creating
and evaluating multi-document sentence extract summaries,
Proceedings of the 9th International Conference on Informa-
tion and Knowledge Management, 2000, pp. 165–172.
[14] Grigsby, M., The Internet Document Warehouse: Content
Management for the Back Office, Technical Report,
IMERGE Consulting, Inc., 2001. http://www.imergeportal.
com/publishedarticles.asp.
[15] R. Hackathorn, Data warehousing energizes your enterprise,
Datamation 1 (1995 Feb.) 38–42.
[16] U. Hahn, I. Mani, The challenges of automatic summarization,
IEEE Computer 33 (11) (2000 Nov.) 29–36.
[17] J. Han, M. Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann Publishers, 2001.
[18] W.H. Inmon, Building the Data Warehouse, John Wiley and
Sons, New York, NY, 1993.
[19] H. Ishikawa, K. Kubota, Y. Noguchi, K. Kato, M. Ono, N.
Yoshizawa, A. Kanaya, A document warehouse: a multimedia
database approach, Proceedings of the IEEE 9th International
Workshop on Database and Expert Systems Applications
(DEXA’98) Vienna, Austria, Aug. 26–28, 1998, pp. 90–94.
[20] H. Ishikawa, K. Kubota, Y. Noguchi, K. Kato, M. Ono, N.
Yoshizawa, Y. Kanemasa, Document warehousing based on a
multimedia database system, IEEE International Conference
on Data Engineering, 1999, pp. 168–173.
[21] H. Ishikawa, M. Ohta, K. Kato, Document warehousing: a
document-intensive application of a multimedia database, Pro-
ceedings of the IEEE 11th International Workshop on Re-
search Issues in Data Engineering, Heidelberg, Germany,
April 01–02, 2001, pp. 25–31.
[22] R. Kimball, The Data Warehouse Toolkit: Practical Techniques
for Building Dimensional Data Warehouses, John Wiley &
Sons, Inc., 1996.
[23] K. Knight, Mining online text, Communications of the ACM
42 (11) (1999).
[24] S.-H. Lin, C.-S. Shih, M.C. Chen, J.-M. Ho, M.-T. Ko, Y.-M.
Huang, Extracting classification knowledge of Internet docu-
ments with mining term associations: a semantic approach,
Proceedings of ACM SIGIR Conference on Research and
Development in Information Retrieval, 1998, pp. 241–249.
[25] S. Loh, L.K. Wives, J.P. de Oliverira, Concept-based knowl-
edge discovery in texts extracted from the web, SIGKDD
Explorations 2 (1) (2000 Jun.)1998.
[26] M.C.McCabe, J. Lee, A. Chowdhury, D. Grossman, O. Frieder,
On the design and evaluation of a multi-dimensional approach
to information retrieval, Proceedings of the 23th Annual Inter-
national ACM SIGIR Conference, 2000, pp. 363–365.
http://www.softwareag.com/xml/library/champion_nativexml.htmhttp://www.imergeportal.com/publishedarticles.asp
F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744744
[27] G.A. Miller, Wordnet: an online lexical database, International
Journal of Lexicography 3 (4) (1990) 235–312.
[28] G. Salton, Automatic Text Processing, Addison-Wesley Pub-
lishing Company, 1988.
[29] G. Salton, M. Gill, Introduction to Modern Information Re-
trieval, McGraw-Hill, 1983.
[30] S. Sekine, C. Nobata, A Survey of Multi-Document Summa-
rization, Proceedings HLT-NAACL Text Summarization
Workshop and Document Understanding Conference (DUC
2003), pp. 65–72.
[31] A. Singhal, C. Buckley, M. Mitra, Pivoted document length
normalization, Procedings of the 19th Annual International
ACM SIGIR Conference, 1996, pp. 21–29.
[32] V. Sugumaran, V.C. Storey, Ontologies for conceptual model-
ing: their creation, use, and management, Data and Knowledge
Engineering 42 (2002) 251–271.
[33] D. Sullivan, Document Warehousing and Text Mining: Tech-
niques for Improving Business Operations, Marketing and
Sales, John Wiley & Sons, Inc., 2001.
[34] Ah-Hwee Tan, Text mining: the state of the art and the
challenges, Proceedings of the PAKDD 99—Workshop on
Knowledge Discovery from Advanced Databases, Beijing,
1999, pp. 50–70.
[35] F.S.C. Tseng, W.P. Lin, A study on indexing structure and its
properties for constructing document warehouses, Proceedings
of the 20th Workshop on Combinatorial Mathematics and
Computation Theory, Taiwan, 2003 (Aug.), pp. 18–27.
[36] F.S.C. Tseng, W.P. Lin, D-Tree: A Multi-Dimensional
Indexing Structure for Constructing Document Warehouses,
Journal of Information Science and Engineering, in press.
[37] F.S.C. Tseng, Design of a Multi-Dimensional Query Expres-
sion for Document Warehouses, Information Sciences, accept-
ed and in press.
[38] http://dublincore.org/, Dublin Core Metadata Initiative.
[39] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/
stg/stg/structured_storage_start_page.asp, Structured Storage.
[40] http:otn.oracle.com/products/text/x/tech_Overviews/imt_817.
html, Oracle Corporation. InterMedia Text 8.1.6.
[41] http://www.globalwordnet.org, The GlobalWordnet Association.
[42] http://www.survey.com, bDevelopment Snapshot: WarehouseData of the Future,Q Application Development Trends, Feb.2000.
[43] http://www.w3.org/XML/schema.
[44] http://www-3.ibm.com/software/data/iminer/fortext, IBM In-
telligent Miner for Text: Text Analysis Tools version 2.10.0.
Frank S.C. Tseng received his B.S., M.S. and Ph.D. degrees, all in
computer science and information engineering from National Chiao
Tung University, Taiwan, ROC, in 1986, 1988, and 1992, respec-
tively. He is one of the winners of Acer Long Term Ph.D. disser-
tation prize in 1992. From 1993 to 1995, he served the military in
the General Headquarters of ROC Air Force. He joined the faculty
of the Department of Information Management, Yuan-Ze Universi-
ty, Taiwan, ROC, on August 1995. From 1996 to 1997, he was the
chairman of the department. He is currently with the Department of
Information Management, National Kaohsiung First University of
Science and Technology, as an associate professor. His research
interests include heterogeneous database systems, XML technolo-
gies for Internet computing, data warehousing, data mining, and
document warehousing. He has published extensively in journals
such as the VLDB Journal, IEEE Transactions on Knowledge and
Data Engineering, Data and Knowledge Engineering, Journal of
Systems and Software, Distributed and Parallel Databases: An
International Journal, Journal of Information Science, Information
Sciences, and Journal of Information Science and Engineering.
Dr. Tseng is a member of the IEEE Computer Society and the
Association for Computing Machinery. He was listed in Marquis
Who’s Who in Medicine and Healthcare in May 2004.
Annie Y.H. Chou received her B.S. degree in applied mathematics,
M.S. degree in computer science and information engineering, and
Ph.D degree in computer and information science, all from National
Chiao Tung University, Taiwan, ROC, in 1987, 1989, and 1996,
respectively. From 1989 to 1992, she was an assistant researcher of
Chunghua Telecom Laboratories, Taiwan, ROC. She joined the
faculty of the Department of Computer and Information Science,
Chinese Military Academy, in August 1997. She is presently the
chairman of the department. Her research interests include mathe-
matical analysis of computer algorithms, file organization design,
data warehousing and data mining, and internet computing. She was
a member of the Phi Tau Phi Scholastic Honor Society. Dr. Chou
has had papers published in the Computer Journal and Journal of
Information Science and Engineering.
http://dublincore.org/http://msdn.microsoft.com/library/default.asp?url=/library/en-us/stg/stg/structured_storage_start_page.asphttp:otn.oracle.com/products/text/x/tech_Overviews/imt_817.htmlhttp://www.globalwordnet.orghttp://www.survey.comhttp://www.w3.org/XML/schemahttp://www-3.ibm.com/software/data/iminer/fortext
The concept of document warehousing for multi-dimensional modeling of textual-based business intelligenceIntroductionAn introduction to document warehousingA framework for document warehousingDocument sourcesFront-end componentWarehouse administratorBack-end componentsHighly summarized documentsMetadata
Data modeling of document warehousesDimensionsThe fact tableThe construction of document cubes
Applications of document warehousesAn application for customer relationship managementAn application for journal paper warehousing
Conclusion and future directionsConclusionFuture works
References