+ All Categories
Home > Documents > The concept of document warehousing for multi-dimensional...

The concept of document warehousing for multi-dimensional...

Date post: 23-Mar-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
18
The concept of document warehousing for multi-dimensional modeling of textual-based business intelligence B Frank S.C. Tseng a, * , Annie Y.H. Chou b,1 a Department of Information Management, National Kaohsiung First University of Science and Technology, 1 University Road, YenChao, Kaohsiung, Taiwan 824, ROC b Department of Computer and Information Science, Chinese Military Academy, Taiwan Available online 17 June 2005 Abstract During the past decade, data warehousing has been widely adopted in the business community. It provides multi- dimensional analyses on cumulated historical business data for helping contemporary administrative decision-making. Never- theless, it is believed that only about 20% information can be extracted from data warehouses concerning numeric data only, the other 80% information is hidden in non-numeric data or even in documents. Therefore, many researchers now advocate that it is time to conduct research work on document warehousing to capture complete business intelligence. Document warehouses, unlike traditional document management systems, include extensive semantic information about documents, cross-document feature relations, and document grouping or clustering to provide a more accurate and more efficient access to text-oriented business intelligence. In this paper, we discuss the basic concept of document warehousing and present its formal definitions. Then, we propose a general system framework and elaborate some useful applications to illustrate the importance of document warehousing. The work is essential for establishing an infrastructure to help combine text processing with numeric OLAP processing technologies. The combination of data warehousing and document warehousing will be one of the most important kernels of knowledge management and customer relationship management applications. D 2005 Elsevier B.V. All rights reserved. Keywords: Data warehousing; Document warehousing; Knowledge management; OLAP 1. Introduction Data warehousing [18] and data mining techniques [17] are gaining popularity as organizations realize the benefits of being able to perform multi-dimensional analyses of cumulated historical business data to help contemporary administrative decision-making [2,4,15, 17,22]. This inspires enterprises to eagerly delve into useful business intelligence (BI) from both internal 0167-9236/$ - see front matter D 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2005.02.011 B This research was partially supported by National Science Council, TAIWAN, ROC, under Contract No. NSC-91-2416-H- 327-005. * Corresponding author. Tel.: +886 7 6011000x4113; fax: +886 7 7659541. E-mail addresses: [email protected] (F.S.C. Tseng), [email protected] (A.Y.H. Chou). 1 Tel.: +886 7 7438179. Decision Support Systems 42 (2006) 727 – 744 www.elsevier.com/locate/dsw
Transcript
  • www.elsevier.com/locate/dsw

    Decision Support Systems

    The concept of document warehousing for multi-dimensional

    modeling of textual-based business intelligenceB

    Frank S.C. Tseng a,*, Annie Y.H. Chou b,1

    aDepartment of Information Management, National Kaohsiung First University of Science and Technology, 1 University Road,

    YenChao, Kaohsiung, Taiwan 824, ROCbDepartment of Computer and Information Science, Chinese Military Academy, Taiwan

    Available online 17 June 2005

    Abstract

    During the past decade, data warehousing has been widely adopted in the business community. It provides multi-

    dimensional analyses on cumulated historical business data for helping contemporary administrative decision-making. Never-

    theless, it is believed that only about 20% information can be extracted from data warehouses concerning numeric data only, the

    other 80% information is hidden in non-numeric data or even in documents. Therefore, many researchers now advocate that it is

    time to conduct research work on document warehousing to capture complete business intelligence. Document warehouses,

    unlike traditional document management systems, include extensive semantic information about documents, cross-document

    feature relations, and document grouping or clustering to provide a more accurate and more efficient access to text-oriented

    business intelligence. In this paper, we discuss the basic concept of document warehousing and present its formal definitions.

    Then, we propose a general system framework and elaborate some useful applications to illustrate the importance of document

    warehousing. The work is essential for establishing an infrastructure to help combine text processing with numeric OLAP

    processing technologies. The combination of data warehousing and document warehousing will be one of the most important

    kernels of knowledge management and customer relationship management applications.

    D 2005 Elsevier B.V. All rights reserved.

    Keywords: Data warehousing; Document warehousing; Knowledge management; OLAP

    0167-9236/$ - see front matter D 2005 Elsevier B.V. All rights reserved.

    doi:10.1016/j.dss.2005.02.011

    B This research was partially supported by National Science

    Council, TAIWAN, ROC, under Contract No. NSC-91-2416-H-

    327-005.

    * Corresponding author. Tel.: +886 7 6011000x4113; fax: +886 7

    7659541.

    E-mail addresses: [email protected] (F.S.C. Tseng),

    [email protected] (A.Y.H. Chou).1 Tel.: +886 7 7438179.

    1. Introduction

    Data warehousing [18] and data mining techniques

    [17] are gaining popularity as organizations realize the

    benefits of being able to perform multi-dimensional

    analyses of cumulated historical business data to help

    contemporary administrative decision-making [2,4,15,

    17,22]. This inspires enterprises to eagerly delve into

    useful business intelligence (BI) from both internal

    42 (2006) 727–744

  • F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744728

    and external data. Business intelligence is supposed to

    provide decision-makers with the tactical and strategic

    information they need for understanding, managing,

    and coordinating the operations and processes in orga-

    nizations.

    However, much of the efforts have only touched

    the tip of the information iceberg. While the techni-

    ques regarding data warehouses, multi-dimensional

    models, on-line analytical processing (OLAP), or

    even ad hoc reports have served enterprises well;

    they do not completely address the full scope of

    business intelligence. It is believed that [42], for the

    business intelligence of an enterprise, only about 20%

    information can be extracted from formatted data

    stored in relational databases. The remaining 80%

    information is hidden in unstructured or semi-struc-

    tured documents. This is because the most prevalent

    medium for expressing information and knowledge is

    text. For instances, market survey reports, project

    status reports, meeting records, customer complaints,

    e-mails, patent application sheets, and advertisements

    of competitors are all recorded in documents.

    Despite that, documents in the Web, enterprise

    repositories, and public document management sys-

    tems are all growing as well. Therefore, knowledge

    workers, managers, and executives still have to spend

    much of the working moment reading dozens, if not

    hundreds, of various types of electronic documents

    spread over the Internet. There is just too much text to

    digest in daily life. The fast-growing and tremendous

    amount of documents has far exceeded the human

    ability for comprehension without powerful tools.

    As a result, when doing important decision-making,

    some relevant documents may be ignored, and some

    irrelevant documents may be considered by intuition.

    We believe that leaving out information induced from

    relevant documents or keeping information by intui-

    tively guessing from irrelevant documents may be

    detrimental, causing disaster from the strategy weaved

    by incomplete information.

    To alleviate this phenomenon, Grigsby [14],

    McCabe et al. [26] and Sullivan [33] have advocated

    that documents should be properly warehoused

    according to some well-defined concepts for expand-

    ing the scope of business intelligence to include tex-

    tual information. Ishikawa and colleagues [19–21]

    even advocated this by implementing a prototype

    system to support management of compound docu-

    ments, keyword-based and content-based retrieval.

    They used ECA rules to classify multimedia docu-

    ments, and SOM (Self-Organizing Map) to cluster a

    set of collected texts into the number of groups in the

    retrieval space of manageable dimensions.

    Hence, we think one of the next challenges of the

    information community will be the study of topics

    about document warehousing and text mining to help

    enterprises in obtaining complete business intelli-

    gence. Although research work regarding text mining

    have been conducted widely (for examples, the gen-

    tle readers are referred to Refs. [44,23–25,40,34]),

    however, the issues regarding document warehousing

    are rarely addressed. We have proposed a multi-

    dimensional indexing structure, called D-tree in

    Ref. [36] to study the performance measurement

    for constructing document warehouses. Some theo-

    retical analyses on the properties of indexing a doc-

    ument warehouse were also elaborated in Ref. [35].

    With document warehouses, the documents of enter-

    prises can be well organized for effective analysis, or

    feature extraction to create distilled and fruitful busi-

    ness intelligence.

    Since there are usually many diverse concepts

    involved in a document, a document is multi-dimen-

    sional in nature. Document warehouses, unlike tradi-

    tional document management systems, include

    extensive semantic information about documents,

    cross-document feature relations, and document

    grouping or clustering to provide more accurate and

    more efficient access to text-oriented business intelli-

    gence. To facilitate flexible and effective multi-dimen-

    sional on-line analytical document processing and

    browsing, a multi-dimensional query language for

    querying document warehouses is indispensable. In

    Ref. [37], we have devised a multi-dimensional query

    expression for querying document warehouses to pro-

    vide users an easy and efficient way of performing on-

    line analytical processing on documents.

    Although issues about document warehousing

    have been addressed in Refs. [14,26,33], there are

    still no formal definitions established up to now. In

    this work, we will first discuss the concept of docu-

    ment warehousing and formally define the related

    terms. Then, we propose a framework for document

    warehousing and elaborate some applications of doc-

    ument warehousing to sketch an attractive roadmap of

    using document warehouses.

  • F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 729

    As Web applications proliferate tremendously,

    there will be a great deal of need for rapid text

    processing and browsing. Document warehousing

    does not only provide an infrastructure for developing

    tools for business executives to systematically orga-

    nize, understand, and properly categorize their docu-

    ments to help strategic decision-making, but also

    integrate all kinds of related documents being

    browsed instantly.

    Document warehousing also provides an important

    platform for on-line analytical processing (OLAP) in

    text level for the interactive analysis of multi-dimen-

    sional documents of various granularities, which

    facilitates effective text mining, integrates documents

    into the business intelligence infrastructure, and pro-

    vides the means to search for and target specific

    information the way we now do with numeric data.

    Furthermore, as the construction of data warehouses

    can be viewed as an important step for data mining,

    the construction of document warehouses can be

    regarded as an indispensable preprocessing step for

    text mining. We realize that, no matter how wonder-

    ful the mechanism a system adopts, it cannot do

    much without good content organization of the do-

    main on which it is to work. Moreover, we often

    recognize that, once a good content organization is

    available, many different mechanisms might be

    employed equally well to implement effective sys-

    tems. A well-organized document warehouse just

    provides various mechanisms a wonderful content

    organization to work on.

    In this paper, based on our prior works [35–37],

    we further illustrate the general architecture of a

    document warehouse and its applications. The

    work is essential for establishing an infrastructure

    to help combine text processing with numeric

    OLAP processing technologies. We believe such

    an infrastructure can help extend numeric data anal-

    ysis for combination with text processing technolo-

    gies to make data warehousing and document

    warehousing one of the most important kernels of

    knowledge management and customer relationship

    management applications. By combining document

    warehousing and data warehousing, documents can

    be integrated into the business intelligence infra-

    structure and can provide the means to search for

    and target specific information the way we now do

    with numeric data.

    Although the content of documents are often more

    than text and may include some graphics or even

    multimedia data, in this paper, we only consider the

    textual parts of documents. In the future, we will

    extend our work to encompass the entirety of docu-

    ments, which may be called multimedia warehousing

    [19–21].

    Our paper is organized as follows. In Section 2, the

    important concepts of document warehousing are for-

    mally presented. Then, based on these definitions, we

    will propose a general architecture for constructing

    document warehouses in Section 3. Then, some of the

    applications of document warehouses will be dis-

    cussed in Section 4. Finally, we conclude and propose

    some future work in Section 5.

    2. An introduction to document warehousing

    In the following, we give some definitions about

    document, dimension, document tuple, and document

    cube for document warehousing.

    Definition 1. A document T={k1, k2, . . . , ki} is alogical unit of text characterized by a set of keywords

    {k1, k2, . . . , ki}.

    To organize documents into structures, we need the

    concept of dimension defined as follows.

    Definition 2. A dimension D is a tree structure of m

    levels, mz1, which is used for representing the hier-archical relationships among a set of keywords. A

    node in a dimension D is called a member, and each

    internal node contains a special child called summary

    member, denoted d*T, which is used for denoting thetotal concept of the other children of the internal node.

    When drawing a dimension, we usually leave out a

    summary member, since it has the same meaning with

    its parent node. Besides, the keywords in a dimension

    are not limited to only those contained in document

    contents. Any property or metadata of a document file

    (e.g., those defined in Dublin Core Metadata Element

    Set [38]) can also be regarded as a keyword in a

    dimension for constructing document cubes. Further-

    more, if documents are organized into predefined

    categories, the category hierarchy to which a docu-

    ment belongs can also be regarded as a dimension.

    That is, text is not unstructured as is often assumed,

  • Table 1

    A relation Region and its alternative for constructing dimension R

    (a)

    Location City

    South Tainan

    South Kaohsiung

    South Pingtong

    North Taipei

    North Taoyun

    North Hsinchu

    (b)

    Tag Parent Level Keyword

    1 1 1 (All Region)

    2 1 2 South

    3 1 2 North

    4 2 3 Tainan

    5 2 3 Kaohsiung

    6 2 3 Pingtong

    7 3 3 Taipei

    8 3 3 Taoyun

    9 3 3 Hsinchu

    F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744730

    which has been pointed out by Sullivan [33]. The

    concept of dimension can be employed to model the

    structure inherently hidden in text.

    According to the keyword sources, dimensions can

    be distinguished into the following types:

    1. Ordinary dimension. A dimension contains key-

    words used for scanning the document contents.

    2. Metadata dimension. A dimension contains key-

    words used for scanning document file properties

    or metadata. For example, in Dublin Core Meta-

    data Element Set, there are title, creator, subject,

    description, publisher, contributor, date, type, for-

    mat, identifier, source, language, relation, cover-

    age, and rights; all can be regarded as metadata

    dimensions.

    3. Category dimension. A dimension contains key-

    words corresponding to the nodes in a category

    hierarchy, such as Wordnet [27,41], in which all

    considered documents should be multi-categorized.

    A document is related to such dimension or if not it

    can be determined manually or automatically

    assigned by document categorization tools.

    To simplify our discussion, we mainly use ordinary

    dimensions, together with the metadata dimension

    time (i.e., date), in the following examples.

    Definition 3. For a dimension D, the ith-level member

    set, denoted D(i), is defined as D(i)={a |a is a mem-

    ber in the ith level of D, but a is not a summary

    member}. Besides, we use D(0) to denote the union of

    all non-summary members in D, which is the union of

    all ith level member sets in D. That is, D(0)=[1ViVhD(i), where h is the height of D. In practice, each D(i)

    has a specific name, which will be called the ith-level

    name.

    Practically, a dimension can be constructed from a

    relational table, with each level corresponding to an

    attribute in the relation and the attribute names usually

    used as the corresponding level names. To illustrate

    the above definitions, we give an example as follows.

    Besides, any keyword in a dimension can be imple-

    mented as a set of synonyms to encompass more

    semantics.

    Example 1. Suppose there is a relation Region repre-

    senting the regions of Taiwan as shown in Table 1(a).

    Another alternative is shown in Table 1(b). This rela-

    tion can be used to construct a dimension, denoted R

    as depicted in Fig. 1, where the first level corresponds

    to the dimension itself, which is commonly denoted

    b(All Region)Q, and the second and third levels arederived from the attributes Location, and City, respec-

    tively. All nodes in Fig. 1 with label d*T are summarymembers. That is, the summary member in the second

    level has the same meaning with all regions in Tai-

    wan, which represents {South, North}. Besides, the

    summary members under South and North have the

    same corresponding meaning with South and North,

    which denote {Tainan, Kaohsiung, Pingtong} and

    {Taipei, Taoyun, Hsinchu}, respectively. By omitting

    all the summary members, Fig. 1 is redrawn in Fig. 2.

    According to the illustration of dimension R, we know

    that R(1)={(All Region)}, R(2)={South, North}, and

    R(3)={Tainan, Kaohsiung, Pingtong, Taipei, Taoyun,

    Hsinchu}, and R(0)={(All Region), South, North, Tai-

    nan, Kaohsiung, Pingtong, Taipei, Taoyun, Hsinchu}.

    For a dimension D, there are two basic operations

    called drill-down and roll-up, which are formally

    defined as follows.

    Definition 4. For a dimension D, expanding an

    internal node to obtain all of its children is called

  • (All Region)

    South North

    Tainan Pingtong TaipeiKaohsiung* *

    *

    1

    2

    3

    Level

    Taoyun Hsinchu

    Fig. 1. An illustration of dimension R.

    F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 731

    drill-down, and shrinking a set of children to obtain

    their common parent is called roll-up.

    This can be further clarified by the following

    definitions.

    Definition 5. For any two n-tuple of keywords

    A=(a1, a2, . . . , ai, . . . , an) and B =(b1, b2, . . . , bi,. . . , bn) defined on n dimensions (D1, D2, . . . , Di,. . . , Dn), where ai and biaDi(0), we define B is amember of drilling down A along dimension Di (or A

    is a member of rolling up B along dimension Di),

    denoted A �i B, if and only if there exists exactly an i,1V iVn, such that bi is a child of ai in Di, and bj=aj,for all j p i.

    Definition 6. For a document T with unique identifier

    idT, a document index of T defined on n dimensions

    (D1, D2, . . . , Dn) is denoted x =(idT, KT), whereKT=(K1, K2, . . . , Ki, . . . , Kn) is an n-tuple of key-word sets, such that each Ki contains a set of key-

    words, and for all keywords kijaKi, kijaT andkijaDi(0), for all 1V iVn.

    (All Region

    South

    Tainan PingtongKaohsiung

    All Region

    Location

    City

    Level Name

    2

    4 65

    Fig. 2. A concise illustrati

    For simplicity, the first and second components of

    a document index x =(idT, KT) will be denoted x1 and

    x2 (i.e., x1= idT and x2=KT), respectively. When all

    |Ki| =1, the document index is also called a base

    document index, and each Ki can also be denoted

    by its only element for convenience (That is, in such

    cases, a KT=({k1}, {k2}, . . . , {ki}, . . . , {kn}) can beabbreviated as KT=(k1, k2, . . . , ki, . . . , kn)). If thereare at least one Ki, such that |Ki|N1, and the sizes of

    the other Kj’s all equal to 1, then the document index

    is also called a composite document index. Finally, if

    there are some Ki, such that |Ki| =0, then the docu-

    ment index is also called a degenerate document

    index. In the following, a degenerate document

    index with some |Ki| =0 will be generalized by

    using the top level member set of the corresponding

    dimension, Di(1), to substitute the missing keyword

    set Ki.

    Example 2. Suppose there is a complaint e-mail

    issued from a customer as shown in Fig. 3. Then, a

    base document index of T defined on the above two

    )

    North

    Taipei

    1

    2

    3

    Level

    Taoyun Hsinchu

    1

    7 9

    3

    8

    on of dimension R.

  • To whom it may concern: We have bought a TV from your Kaohsiung branch last weekend. However, we found the screen is severely unstable. Please give us the phone number of your service center. Thank you for your kindly help.

    Sincerely,

    Frank S.C. Tseng

    Fig. 3. A complaint e-mail issued by a customer (A0001).

    F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744732

    dimensions (R, P) can be obtained as x =(A0001,

    ({Kaohsiung}, {TV})), where A0001 is the unique

    identifier of T.

    The basic component of a document cube is called

    a cell, which is defined as follows.

    Definition 7. A cell defined on n dimensions (D1, D2,

    . . . , Dn) is denoted c =(tc, Xc), where tc=(c1, c2, . . . ,ci, . . . , cn), ciaDi(0)[{d*T}, 1V iVn, and Xc={x1,x2, . . . , xj, . . . , xm} is a set of document indices of theform xj =(idTj, (K1, K2, . . . , Kn)), where idTj is theunique identifier of some document Tj and Ki\Di(0)pF, 1V iVn. The set of all such documentunique identifiers idTj involved in the cell c=(tc,

    non-base cell

    base cell

    regiont

    a

    d

    TV

    Refrigerator

    Cellular Phone

    Radio

    Monitor

    PrinterC

    omputer

    Com

    munication

    Appliance

    (All Product)

    Fig. 4. A sample illustration

    Xc) is denoted ID(c)={xj1|8 xjaXc}. That is, a doc-

    ument with unique identifier in ID(c) can be directly

    accessed from the cell c.

    Definition 8. A cell c=(tc, Xc), where tc=(c1, c2, . . . ,ci, . . . , cn), defined on n dimensions (D1, D2, . . . ,Dn) is called an m-d cell, 0VmVn, if and only if thereare exactly m non-summary member ci (i.e.,

    ci p d*T). If m =n and ciaDi(hi), where hi isthe height of Di, for all 1V iVn, then c is alsocalled a base cell; otherwise c is called a non-base

    cell.

    Definition 9. An n-dimensional i-d cell a=((a1, a2,

    . . . , an), Xa) is a parent of another n-dimensional i-d

    product

    ime

    S

    T3 T1

    T2

    ID(a)

    ID(d)

    Taipei

    Taoyuan

    HsinChu

    Tainan

    Kaohsiung

    Pingtong

    North

    South

    (All Region)

    of a document cube.

  • F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 733

    cell b =((b1, b2, . . . , bn), Xb), if and only if thefollowing conditions hold:

    1. There exists exactly one k, such that ak is the

    parent of bk in Dk, and al =bl, for all l p k, 1V lVn.2. ID(b)p ID(a), where ID(a) and ID(b) are the sets

    of all document unique identifiers involved in cells

    a and b, respectively.

    Definition 10. A document cube DC =(S, (D1, D2,

    . . . , Dn)), where S is a set of documents defined on ndimensions (D1, D2, . . . , Dn), is a cube composed ofall cells ci=(tci, Xci) with tcia�1VjVn Dj 0ð Þ andID(ci)pS.

    Based on the above definitions, a set of docu-

    ments S can be multi-dimensionally indexed by a

    document cube DC =(S, (D1, D2, . . . , Dn)), whichallows users to browse documents by rolling up and

    drilling down along some dimensions Di for differ-

    ent granularities and perspectives, obtaining further

    insight into relationships among documents. A sam-

    ple illustration of a document cube DC =(S, (R, P,

    T)) is shown in Fig. 4, where R and P represent the

    aforementioned dimensions region and product, re-

    Document Source A

    metadata

    Integrateddocument base

    summardocum

    sumdo

    warehouseadministrator

    warehouseadministrator

    Archive

    Document Source B

    Document Source C

    Fro

    nt-E

    ndC

    ompo

    nent

    Fig. 5. The proposed architecture

    spectively. Besides, we assume T is a dimension

    representing time.

    3. A framework for document warehousing

    Designing a comprehensive architecture for docu-

    ment warehousing can be challenging because docu-

    ment warehousing covers a wide spectrum of concepts

    as we have shown in Section 2. Fortunately, there is

    already a general architecture being established for

    data warehousing in Ref. [2]. Based on the architec-

    ture, we extend the constructs to include more features

    for documents warehousing. The proposed architec-

    ture is shown in Fig. 5.

    Based on this architecture, we outline the general

    process of extraction, transformation, loading, dimen-

    sional modeling, and construction of a document

    warehouse in Fig. 6. In this process, documents stored

    in different document sources are respectively trans-

    formed and loaded into the document base. At the

    same time, some of the metadata are retrieved or

    generated into the metadata repository. Besides, the

    document may be further integrated and categorized

    into groups in the document base according to their

    izedents

    highlymarized

    cuments

    Text Mining Tools

    Application Programs

    Bac

    k-E

    ndC

    ompo

    nent

    PC

    2003

    2002

    2001

    2000

    1999

    1998

    1997

    Scanner

    Printer

    LCD Monitor

    Software

    Digital Camera

    Phone

    On-Line Analytical Processing

    On-Line AnalyticalProcessing

    of document warehouses.

  • BusinessModels Presentation

    OLAP UserQuery/Tool

    DocCubesDocument

    Base

    Metadata

    Transform

    ation and Loading

    DocumentSource A

    DocumentSource B

    Cluster

    DimensionalModeling

    PC

    2003

    2002

    2001

    2000

    1999

    1998

    1997

    Scanner

    Printer

    LCD Monitor

    Software

    Digital Camera

    Phone

    On-Line AnalyticalProcessing

    Fig. 6. The general process of extraction, transformation, loading, dimensional modeling, and construction of a document warehouse.

    F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744734

    metadata or keywords. Then, by applying the dimen-

    sional modeling process, fruitful document cubes can

    be created for on-line analytical processing in text

    level based on certain business models. Notice that a

    document cube does not need space to store the

    document contents; it only contains the dimension

    information and the file pointers, which can be used

    to trace back the original document contents stored in

    the document base. Finally, the processed result can

    be presented via hyper-linked Web presentations.

    The major components of a document warehouse

    are explained as follows.

    3.1. Document sources

    The source of documents for a document ware-

    house is supplied from:

    1. Internal sources: In an organization, there are

    documents in various formats spread throughout

    the organization on any kind of document reposi-

    tories. The files may be in XML formats, MS Word

    formats, e-mail or even plain text.

    2. External sources: Documents may also come from

    the Internet, including Web pages, FTP sites, com-

    mercially available document bases, private docu-

    ments shared by private servers or document

    repositories associated with an organization’s sup-

    pliers or customers.

    3.2. Front-end component

    The front-end component performs all the neces-

    sary pre-processing of documents, such as text sum-

    marization [12,16], text feature extraction [10],

    document categorization [3], or other text mining

    procedures [24,25,34], and then store the obtained

    features or patterns into the meta-data or store the

    summarized result as another summarized document.

    3.3. Warehouse administrator

    The warehouse administrator performs all the

    operations associated with the management of

    the documents in the warehouse. The opera-

    tions include:

    1. Enrich the metadata of all stored documents: Some

    of the document metadata (e.g., those defined in

    Dublin Core Metadata Element Set [38]) may be

    missing and should be added manually by the

    warehouse administrator.

    2. Perform necessary text mining operations or gen-

    erate the summarization for documents either man-

    ually or by software tools (e.g., IBM Intelligent

    Miner for Text [44]).

    3. Create the dimensions and document indexes for

    constructing document cubes.

    4. Archive documents and related data/metadata.

    3.4. Back-end components

    The back-end component performs all the opera-

    tions responsible for the management of user queries.

    It is typically composed of a set of document access

    tools, a multi-dimensional document query interface

    [37], document warehouse monitoring tools, and cus-

    tomized tools.

  • F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 735

    3.5. Highly summarized documents

    This part stores all the summarization derived from

    multiple documents, which belong to the same cluster

    or categorization. Some of such achievements have

    already been conducted [9,13].

    The simplest format of a highly summarized doc-

    ument can be represented by a set of keywords

    appeared in the original document. Keywords of a

    document can be derived by computing the traditional

    Table 2

    The proposed metadata design

    Attribute name Description

    Title A name given to the resource.

    Creator An entity primarily responsible

    for making the content of the

    resource.

    Subject A topic of the content of the

    resource.

    Description An account of the content of the

    resource.

    Publisher An entity responsible for making

    the resource available.

    Contributor An entity responsible for making

    contributions to the content of the

    resource.

    Date A date of an event in the lifecycle

    of the resource.

    Type The nature or genre of the content

    of the resource.

    Format The physical or digital manifestation

    of the resource.

    Identifier An unambiguous reference to the

    resource within a given context.

    Source A reference to a resource from

    which the present resource is

    derived.

    Language A language of the intellectual

    content of the resource.

    Relation A reference to a related resource.

    Coverage The extent or scope of the content

    of the resource.

    Rights Information about rights held in

    and over the resource.

    . . . . . .

    Keywords The keyword set derived from the

    resource.

    Summarization A brief summary generated from

    the resource by a summarization

    tool.

    File_Path A file pointer used to address the

    resource.

    tf*idf weights [28,29], pivoted cosine weights [31], or

    one derived by any term-weighting scheme.

    3.6. Metadata

    The metadata of a document warehouse stores all

    the metadata derived from all documents. All the

    processes in the proposed architecture will use the

    metadata interchangeably. In the paper, we propose to

    design the metadata as Table 2 describes. That is, the

    metadata can be stored in a traditional relational

    table, which contains attributes for all the elements

    defined in the Dublin Core Metadata Element Set

    [38] and some additional attributes. For simplicity,

    we only list three extra attributes in Table 2, where

    dSummarizationT, dKeywordsT, and dFile_PathT areused to store the summarization of the document,

    the keyword set derived from the original document,

    and the file path used to show the pathway back to

    the document from which the metadata are derived.

    Such file path is inherently unique and can be used to

    describe the mapping between the document sources

    and a common view of the information within the

    document warehouse.

    Some of the metadata can be obtained or derived

    directly from the document itself. For example, docu-

    ments stored in Microsoft Word format has some

    summary information associated with the file itself

    and we can retrieve them directly from the stream

    by employing Structured Storage [39]. Structured

    Storage provides file and data persistence in COM

    by handling a single file as a structured collection of

    objects known as storages and streams.

    4. Data modeling of document warehouses

    The dimensional modeling technique [18,22]

    adopted widely in data warehouse modeling can be

    extended for document warehouses. Every dimension-

    al model is composed of one central table with a

    composite key, called the fact table, which uses for-

    eign keys to link to a set of dimension tables. This

    characteristic dstar-likeT structure is also called a starschema. Such multi-dimensional data model for text

    permits the definition of any dimension of interest as

    defined in Definition 2. In Fig. 7, we show a star

    schema for modeling document warehouses.

  • Fact Table

    Title_ID (FK)Creator_ID (FK)...Date_ID (FK)...Rights_ID (FK)...Keyword_ID (FK)...Document_IDcount (Measure)

    DateDimension

    CreatorDimension

    TitleDimension

    RightsDimension

    KeywordDimension

    S

    T3 T1

    T2

    Fig. 7. An example star schema of a document warehouse.

    F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744736

    4.1. Dimensions

    As we have discussed in Section 2, dimensions can

    be distinguished into the following types:

    1. Ordinary dimension. A document can be highly

    summarized by a set of keywords. Therefore, we

    can construct an ordinary dimension containing a

    set of keywords to allow users to pinpoint the

    desired documents directly.

    2. Metadata dimension. That is, those elements de-

    fined in Dublin Core Metadata Element Set: title,

    creator, subject, description, publisher, contributor,

    date, type, format, identifier, source, language, re-

    lation, coverage, and rights, can all be regarded as

    metadata dimensions. Some of the dimensions

    might be hierarchies or simply related data.

    3. Category dimension. For example, a hierarchy

    such as Wordnet or its subset, or user-defined

    hierarchies can be employed as category dimen-

    sions. Notice that, there may be more than one

    category dimensions used to construct a document

    cube, since a document can be multi-categorized

    into different categories from various points of

    view.

    In Table 1, we have presented two representations

    for the dimension Region. Both representations can

    be easily obtained from each other by conversion.

    The structure of Table 1a is easier understood by

    people, and that of Table 1b is more efficient for

    computer processing. For dimensional modeling, we

    assume each dimension Di is internally stored in the

    relation Di(Tag, Parent, Level, Keyword) conform-

    ing to the structure of Table 1b, such that the

    underlined attribute Tag represents the primary key,

    and Di.Parent is a foreign key and is referenced to

    Di.Tag. For example, the dimensions Product and

    Time in Fig. 4 can be represented as shown in

    Fig. 8.

    Under such circumstances, for a dimension D,

    D(0) is the whole relation, and the other ith level

    member sets D(i) can be easily obtained by the

    SQL statements bSELECT Tag, Keyword FROM DWHERE Level= iQ, for 1V iVn.

    4.2. The fact table

    In general, the central fact table may be composed

    of the following attributes:

    1. A composite key, which is composed of a set of

    foreign keys to the following dimensions:

    (a) Ordinary dimensions: For example, the dimen-

    sion Keyword shown in Fig. 7 is an ordinary

    dimension.

  • Product Time Tag Parent Level Keyword Tag Parent Level Keyword1 1 1 (All Product) 1 1 1 (All Time)2 1 2 Appliance 2 1 2 20033 1 2 Communication 3 1 2 20044 1 2 Computer 4 2 3 Q1, 20035 2 3 TV 5 2 3 Q2, 20036 2 3 Refrigerator 6 2 3 Q3, 20037 3 3 Cellular Phone 7 2 3 Q4, 20038 3 3 Radio 8 3 3 Q1, 20049 4 3 Monitor 9 3 3 Q2, 200410 3 Printer 10 3 3 Q3, 2004

    11 3 3 Q4, 20044

    Fig. 8. Dimensions product and time represented in relations.

    F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 737

    (b) Metadata dimensions: For example, the dimen-

    sions Title, Date, Creator, . . . , and Rights asshown in Fig. 7 are metadata dimensions.

    (c) Category dimensions. Notice that, there are no

    category dimensions shown in Fig. 7.

    2. Attributes used to derive the measures in a docu-

    ment cube. The document count (i.e., the attribute

    count of fact table in Fig. 7) can be regarded as the

    default measure in a document cube. Another pos-

    sible measure has been defined in Ref. [26] as the

    weight of the term frequency of the corresponding

    keyword.

    3. A column Document_ID represents the document

    identifier, which is a foreign key link to the relation

    S(Document_id, File_path) as shown in Fig. 7.

    That is, the set S in Fig. 7 can be regarded as a

    dimension containing all the document identifiers

    and the corresponding file paths, where Documen-

    t_id is the primary key and file_path is used for

    storing the file path or URI of the documents.

    Therefore, the fact table can be stored in the

    relation Fact_Table(Document_id, D1_tag, D2_Tag,

    . . . , Di_Tag, . . . , Dn_Tag, Count), where Docmen-t_id is a foreign key which is referenced to S.Doc-

    ment_id, and each Di_Tag is a foreign key matching

    the primary key Did Tag of dimension Di. For each

    document T (with identifier idT) in S, the document

    index x =(idT, KT), where KT=(K1, K2, . . . , Ki, . . . ,Kn), can be used to generate idTf g � �1ViVn Kið Þ asthe set of initial tuples in the Fact_Table.

    Note that the initial tuples generated by the above

    process may cause redundancies. We formulate this

    by the following definition.

    Definition 11. A document index defined on n dimen-

    sions (D1, D2, . . . , Dn), x =(idT, KT), where KT=(K1,K2, . . . , Ki, . . . , Kn) is minimal, if and only if for twokeywords kx and kyaKi, kx p ky, there is no ancestryrelationship between kx and ky in Di, for all 1V iVn.That is, kx is not an ancestor of ky inDi, and vice versa.

    For a non-minimal document index x=(idT, KT),

    where KT=(K1, K2, . . . , Ki, . . . , Kn), we can alwaysreduce x into a minimal document index, denoted ẋ, by

    iteratively finding pairs of keywords kx and ky inKi, for

    all 1V iVn, such that ky is a descendant of kx, and theneliminating the ancestor kx and retaining the descen-

    dant ky. Such process can be denoted as xi ẋx. It helpsto reduce the storage cost of a document index without

    loss of indexing information, since the documents

    indexed by an ancestor can be recursively derived by

    rolling up the indexed documents from its lowest des-

    cendants along the corresponding dimension.

    Example 3. Suppose there is a non-minimal document

    index of T (with unique identifier A001), defined on

    dimensions (R,P), x =(A0001, ({South, Kaohsiung},

    {TV})), then after xi ẋx, we obtain ẋx =(A0001,({Kaohsiung}, {TV})), since, according to the dimen-

    sion R, the documents indexed by South can be de-

    rived from Tainan, Kaohsiung, and Pingtong.

    4.3. The construction of document cubes

    To construct a document cube, the process is some-

    how different from that in data cubes. Although a

    document index derived according to metadata and

  • F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744738

    category dimensions are the same as that in data

    cubes, the process to generate a document index

    with respect to an ordinary dimension is prone to be

    time-consuming. This is because computing a mea-

    sure in a data cube is just numerical computation.

    However, to compute a document index (de-

    fined in Definition 6), we have to scan the docu-

    ment content to match the keywords in any ordinary

    dimension. Therefore, it is necessary to develop an-

    other indexing structure to accommodate the docu-

    ment indices derived from ordinary dimensions. We

    have already proposed an indexing structure,

    called D-tree, to meet this objective in Refs.

    [35,36], where the performance evaluation of the

    indexing structure was also studied.

    5. Applications of document warehouses

    In this section, we present two applications of

    document warehouses. The first one is a document

    warehouse for organizing the complaint e-mails to

    provide better customer relationship management.

    The other one is for preparing a document warehouse

    indexing a set of journal papers falling in different

    categories, published in different journals on different

    date times.

    5.1. An application for customer relationship

    management

    Suppose there is a company manufacturing appli-

    ances, communication equipments and computer per-

    Fact Tab

    Region_ID (FCreator_ID Date_ID (FKProduct_ID Time_ID (FKDocument_Icount (Meas

    CreatorDimension

    DateDimension

    S

    T3 T1

    T2

    Fig. 9. An example star schema for

    ipherals, and it has established branches in the north

    and south regions. The objective is to warehouse

    customer complaint e-mails for customer relationship

    management.

    After modeling the document cube, we obtain two

    metadata dimensions (Creator and Date) and three

    ordinary dimensions (Region, Product, and Time) as

    shown in Fig. 9.

    We briefly describe these dimensions as follows:

    1. Ordinary dimension. The dimensions Region and

    Product are as shown in Figs. 2 and 10, respec-

    tively. The dimension Time is the purchase time

    described in the e-mail.

    2. Metadata dimension. The dimension Creator stores

    the e-mail addresses of customers and dimension

    Date stores the date of receiving of e-mails.

    3. Category dimension. There are no category dimen-

    sions shown in this example. However, the e-mail

    documents could be further categorized either man-

    ually or automatically by software tools into hier-

    archical categories.

    The fact table is composed of the following

    attributes:

    1. A composite key, which is composed of a set of

    foreign keys to the aforementioned dimensions.

    2. The attribute count is regarded as the default mea-

    sure in this document cube.

    3. A column Document_ID served as a foreign key

    and is referenced to the relation S(Document_ID,

    file_path).

    le

    K)(FK))

    (FK))

    Dure)

    ProductDimension

    RegionDimension

    TimeDimension

    complaint e-mail management.

  • (All Product)

    Appliance Computer

    TV Cellularphone

    RadioRefrigerator

    1

    2

    3

    Level

    LaptopComputer

    DesktopComputer

    Communication 4

    5 6

    1

    2 3

    7 8 9 10

    DVDPlayer

    11

    Fig. 10. A concise illustration of dimension P.

    F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 739

    After constructing the document cube, we can

    perform on-line analytical processing on the obtained

    document cube as illustrated in Fig. 11. Notice that,

    each of the count shown in Fig. 11 is actually a

    hyperlink, which links to a page containing the orig-

    inal e-mails.

    5.2. An application for journal paper warehousing

    Suppose there is a university laboratory, which

    intends to warehouse research journal papers accord-

    ing to some predefined categories.

    After modeling the document cube, we establish

    one category dimension Category and two metadata

    dimensions: Source and Times, as shown in Fig. 12,

    where Category represents the predefined categories,

    Source stores the journal names of the selected papers,

    and Times regards the publishing data times.

    We briefly describe these dimensions as follows:

    1. Ordinary dimension. There is no ordinary dimen-

    sion in this example. However, users may add a

    dimension containing all of the keywords in the

    selected journal papers, and organize the keywords

    into hierarchies, either manually or automatically

    by some text processing tools.

    2004/Quarter 1Product

    TVRefrigerato

    Appliance

    DVD PlayeLaptop CompComputerDesktop Comp

    RadioCommunicationCellular Pho

    Fig. 11. On-line analytical processing o

    2. Metadata dimension. The dimension Source stores

    the journal names, which are published under some

    communities (e.g., ACM, IEEE, Elsevier, and

    Kluwer). The dimension Times stores the date of

    publishing, which is organized according to the

    Year–Month hierarchy.

    3. Category dimension. The dimension Category

    stores the predefined categories, i.e., computer-

    aided engineering, data mining and knowledge dis-

    covery, data model, data structures, data warehous-

    ing, database management, . . ., and so forth. Thesepredefined categories are all organized into the same

    level to simplify the illustration. Notice that all

    papers are multi-categorized into theses categories.

    That is, a paper may fall into two or more categories.

    The fact table is composed of the following

    attributes:

    1. A composite key, which is composed of a set of

    foreign keys to the aforementioned dimensions.

    2. The attribute count is regarded as the default mea-

    sure in this document cube.

    3. A column Document_ID served as a foreign key

    and is referenced to the relation S(Document_ID,

    file_path).

    RegionNorth South

    5 2r 3 7r 7 9uter 8 1uter 2 3

    6 1ne 5 7

    S

    T3

    T3 T1

    ver the example document cube.

  • Fact TableCategory_ID (FK)Source_ID (FK)Times_ID (FK)Document_IDcount (Measure)

    CategoryDimension

    TimeDimension

    ST3 T1

    T2

    SourceDimension

    T4 T5

    Fig. 12. An example star schema for journal paper warehousing.

    Column Axes:Source Category (All Times)

    Submit Query Reset

    Row Axes: Last Times Selected: 2004

    Fig. 13. On-line analytical processing over the example document cube.

    F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744740

  • F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 741

    To prove the concept proposed in this paper, we

    have implemented this example to allow users to per-

    form on-line analytical processing on the obtained

    document cube as illustrated in Fig. 13. The dimension

    Times is used for slicing the document cube. Notice

    that, each of the count shown in Fig. 13 is actually a

    hyperlink, which links to a Web page containing the

    detailed paper listing as shown in Fig. 14 (when slicing

    the Times dimension by d2004T and clicking the countintersected by dElsevierT and dData Mining and Knowl-edge DiscoveryT). If all of the paper files were stored inthe publisher’s Web sites for a long time period, then

    this document warehouse does not need to copy and

    store the physical files locally. That is, such a document

    warehouse effectively saves storage space and provides

    very fast document access without degradation in per-

    formance even as the size of the warehouse grows.

    6. Conclusion and future directions

    6.1. Conclusion

    While data warehouses and the numeric-centric

    business intelligence technologies have served most

    of the enterprises well, they do not fully address the

    complete scope of business intelligence. In this paper,

    Fig. 14. The detailed paper listing after the count in the intersection of dE

    we advocate the importance of constructing document

    warehouses to support text-centric business intelli-

    gence, and propose an architecture for document ware-

    housing. When documents are warehoused, users can

    perform ad hoc on-line analytical processing (OLAP)

    over text in a document warehouse, just as the way

    users can perform OLAP over summarized data in a

    data warehouse.

    The concept of document warehousing is not only

    providing the ability to very fast document access

    without degradation in performance even as the size

    of the warehouse grows, but also offering a set of

    versatile applications for content management of

    enterprise business intelligence. In business, docu-

    ment warehousing can help administrators organize

    meeting reports, gazettes, or even customer com-

    plaint e-mails, where the company personnel, pro-

    ducts, and time may be regarded as the dimensions,

    such that documents related to some employees, or

    products in some time, at somewhere can be re-

    trieved or browsed instantly. In recent years, we

    have seen most data warehouse applications applied

    in Customer Relationship Management (CRM), a

    promising trend in business affairs. However, a

    data warehouse creation only supports the numeric

    analyses of customer behaviors. To obtain the reason

    why customers buy (or did not buy) some products,

    lsevierT and dData Mining and Knowledge DiscoveryT was clicked.

  • F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744742

    we need to establish a document warehouse. By data

    warehousing, users can realize business phenomena

    regarding who, what, when, where, and which clear-

    ly. Nevertheless, to discover why the phenomena

    occurred, a document warehouse should be

    employed [33].

    When documents are warehoused, the task of

    version control will become very easy, since users

    can directly trace the documents based on some

    criteria along the time dimension. Such merits

    also make document warehousing an exhilarating

    organization for on-line topic detecting and event

    tracking [1] of news. Besides, document clustering

    can be achieved directly via visualizations. Users

    can also develop some document summarization

    tools [9,16,30] to summarize a cluster of related

    documents. To sum up, data warehousing and doc-

    ument warehousing are not only one of the most

    important infrastructures of knowledge management,

    but also the kernel of customer relationship man-

    agement. Both are used for respectively organizing

    documents and formatted data in a multi-dimension-

    al basis. We compare their similarities and differ-

    ences in Table 3.

    Table 3

    A comparison between document warehousing and data warehousing

    Document warehousing

    Similarities 1. Both have the same construction process.

    We may employ star schema or snowflake

    [22] to design the modeling process.

    2. Both gather business document/data from

    heterogeneous resources.

    3. Users can do on-line analytical processing

    over the established result.

    Differences 1. Intend to obtain text-oriented business

    intelligence.

    2. Resources gathered from market survey

    reports, project status reports, meeting reco

    customer complaints, e-mails, patent

    application sheets, and advertisements of

    competitors.

    3. It filters out unnecessary documents and

    intends to help users to address problems

    regarding why.

    4. Enriched with text mining techniques to

    summarize documents or categorize docume

    5. Document sources should be integrated

    file systems, or native XML databases [6

    6.2. Future works

    In our future work, we will conduct more techni-

    ques for document warehousing. The preliminary

    components may include the following modules.

    1. Employ XML Schema [43] to define document

    metadata. We advocate using the Extensible Mark-

    up Language (XML) to be the intermediate media

    for document interchange.

    2. Incorporate automatic text summarization [12,14,

    23], key feature extraction [10], or even document

    classification and categorization [3] techniques for

    document warehousing. Develop related text sum-

    marization techniques to extract the most important

    10~20% content for users to digest the documents

    more easily and propose how to bind a document

    summary with its corresponding documents for

    document warehousing.

    3. Automatic document metadata decomposition and

    the mechanisms for storing the obtained metadata

    into native XML or XML-enabled databases [5–7].

    This helps users manage document warehouses

    more efficiently.

    Data warehousing

    1. Intend to obtain numeric-oriented

    business intelligence.

    rds,

    2. Resources gathered from internal

    databases of POS (point-of-sale) systems,

    ERP (enterprise resource planning) systems,

    accounting systems, or financial management

    systems.

    3. It aggregates numerical data according to

    various dimensions, and intends to help users

    to address problems regarding who, what,

    when, where, and which.

    nts.

    4. Enriched with data mining techniques to

    summarize, classify, cluster formatted data

    or find the associations.

    in

    ,7].

    5. Data sources can be integrated in relational

    databases.

  • F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744 743

    Besides, although the dimension concepts defined

    in this paper are organized into hierarchical structures,

    it is however assumed that when scanning a document,

    the system will ignore the hierarchical relationships

    among keywords in the document. Based on this

    work, we wish to incorporate some natural language

    processing technologies to enhance the linguistic anal-

    ysis and annotation results of document parsing, and

    elaborate the work of adopting domain-specific ontol-

    ogy [8,11,25,32] with more refined concepts to be built

    in the corresponding dimensions of a document cube.

    Ontological analysis can help clarify the structure of

    knowledge regarding a set of related documents. Given

    a set of related documents corresponding to a specific

    domain, the ontology forms the semantic heart of any

    system of knowledge representation, and their docu-

    ment cube forms the syntactic centroid of any system of

    concept organization.

    Finally, since the construction of a document ware-

    house has to scan a large amount of documents, which

    is a task prone to time-consumption, the parallel archi-

    tecture for such a process will be investigated further in

    the future.

    References

    [1] J. Allan, R. Pepka, V. Lavrenko, On-line new event detection

    and tracking, Proceedings of the 21st Annual International

    ACM SIGIR Conference on Research, 1998, pp. 37–45.

    [2] S. Anahory, D. Murray, Data Warehousing in the Real World:

    A Practical Guide for Building Decision Support Systems,

    Addison-Wesley Longman, Harlow, England, 1997.

    [3] A. Appiani, F. Cesarini, A. Colla, M. Diligenti, M. Gori, S.

    Marinai, G. Soda, Automatic document classification and

    indexing in high-volume applications, International Journal

    on Document Analysis and Recognition 4 (2) (2002) 69–83.

    [4] M.J.A. Berry, G. Linoff, Data Mining Techniques: For Mar-

    keting, Sales, and Customer Support, John Wiley & Sons,

    New York, 1997.

    [5] E. Bertino, B. Catania, Integrating XML and databases, IEEE

    Internet Computing 5 (4) (2001) 84–88.

    [6] E. Bertino, E. Ferrari, XML and database integration, IEEE

    Internet Computing 5 (6) (2001) 75–76.

    [7] Champion, M, Native XML vs. XML-Enabled: the Difference

    Makes a Difference, http://www.softwareag.com/xml/library/

    champion_nativexml.htm, Software AG: The XML Company.

    [8] B. Chandrasekaran, J.R. Josephson, V.R. Benjamins, What are

    ontologies, and why do we need them? IEEE Intelligent

    Systems 14 (1) (1999 Jan./Feb.) 20–26.

    [9] H.H. Chen, S.J. Huang, A summarization system for Chinese

    news from multiple sources, Proceedings of the 4th Interna-

    tional Workshop on Information Retrieval with Asia Lan-

    guage, 1999, pp. 1–7.

    [10] F.F. Feng, W.B. Croft, Probabilistic techniques for phrase

    extraction, Information Processing & Management 37 (2)

    (2001 Mar.) 199–220.

    [11] N. Fridman, C.D. Hafner, The state of the art in ontology

    design, AI Magazine 18 (3) (1997) 53–74.

    [12] J. Goldstein, M. Kantrowitz, V. Mittal, J. Carbonell, Summa-

    rizing text documents: sentence selection and evaluation

    metrics, Proceedings of SIGIR, 1999, pp. 121–128.

    [13] J. Goldstein, V.O. Mittal, J.G. Carbonell, J.P. Callan, Creating

    and evaluating multi-document sentence extract summaries,

    Proceedings of the 9th International Conference on Informa-

    tion and Knowledge Management, 2000, pp. 165–172.

    [14] Grigsby, M., The Internet Document Warehouse: Content

    Management for the Back Office, Technical Report,

    IMERGE Consulting, Inc., 2001. http://www.imergeportal.

    com/publishedarticles.asp.

    [15] R. Hackathorn, Data warehousing energizes your enterprise,

    Datamation 1 (1995 Feb.) 38–42.

    [16] U. Hahn, I. Mani, The challenges of automatic summarization,

    IEEE Computer 33 (11) (2000 Nov.) 29–36.

    [17] J. Han, M. Kamber, Data Mining: Concepts and Techniques,

    Morgan Kaufmann Publishers, 2001.

    [18] W.H. Inmon, Building the Data Warehouse, John Wiley and

    Sons, New York, NY, 1993.

    [19] H. Ishikawa, K. Kubota, Y. Noguchi, K. Kato, M. Ono, N.

    Yoshizawa, A. Kanaya, A document warehouse: a multimedia

    database approach, Proceedings of the IEEE 9th International

    Workshop on Database and Expert Systems Applications

    (DEXA’98) Vienna, Austria, Aug. 26–28, 1998, pp. 90–94.

    [20] H. Ishikawa, K. Kubota, Y. Noguchi, K. Kato, M. Ono, N.

    Yoshizawa, Y. Kanemasa, Document warehousing based on a

    multimedia database system, IEEE International Conference

    on Data Engineering, 1999, pp. 168–173.

    [21] H. Ishikawa, M. Ohta, K. Kato, Document warehousing: a

    document-intensive application of a multimedia database, Pro-

    ceedings of the IEEE 11th International Workshop on Re-

    search Issues in Data Engineering, Heidelberg, Germany,

    April 01–02, 2001, pp. 25–31.

    [22] R. Kimball, The Data Warehouse Toolkit: Practical Techniques

    for Building Dimensional Data Warehouses, John Wiley &

    Sons, Inc., 1996.

    [23] K. Knight, Mining online text, Communications of the ACM

    42 (11) (1999).

    [24] S.-H. Lin, C.-S. Shih, M.C. Chen, J.-M. Ho, M.-T. Ko, Y.-M.

    Huang, Extracting classification knowledge of Internet docu-

    ments with mining term associations: a semantic approach,

    Proceedings of ACM SIGIR Conference on Research and

    Development in Information Retrieval, 1998, pp. 241–249.

    [25] S. Loh, L.K. Wives, J.P. de Oliverira, Concept-based knowl-

    edge discovery in texts extracted from the web, SIGKDD

    Explorations 2 (1) (2000 Jun.)1998.

    [26] M.C.McCabe, J. Lee, A. Chowdhury, D. Grossman, O. Frieder,

    On the design and evaluation of a multi-dimensional approach

    to information retrieval, Proceedings of the 23th Annual Inter-

    national ACM SIGIR Conference, 2000, pp. 363–365.

    http://www.softwareag.com/xml/library/champion_nativexml.htmhttp://www.imergeportal.com/publishedarticles.asp

  • F.S.C. Tseng, A.Y.H. Chou / Decision Support Systems 42 (2006) 727–744744

    [27] G.A. Miller, Wordnet: an online lexical database, International

    Journal of Lexicography 3 (4) (1990) 235–312.

    [28] G. Salton, Automatic Text Processing, Addison-Wesley Pub-

    lishing Company, 1988.

    [29] G. Salton, M. Gill, Introduction to Modern Information Re-

    trieval, McGraw-Hill, 1983.

    [30] S. Sekine, C. Nobata, A Survey of Multi-Document Summa-

    rization, Proceedings HLT-NAACL Text Summarization

    Workshop and Document Understanding Conference (DUC

    2003), pp. 65–72.

    [31] A. Singhal, C. Buckley, M. Mitra, Pivoted document length

    normalization, Procedings of the 19th Annual International

    ACM SIGIR Conference, 1996, pp. 21–29.

    [32] V. Sugumaran, V.C. Storey, Ontologies for conceptual model-

    ing: their creation, use, and management, Data and Knowledge

    Engineering 42 (2002) 251–271.

    [33] D. Sullivan, Document Warehousing and Text Mining: Tech-

    niques for Improving Business Operations, Marketing and

    Sales, John Wiley & Sons, Inc., 2001.

    [34] Ah-Hwee Tan, Text mining: the state of the art and the

    challenges, Proceedings of the PAKDD 99—Workshop on

    Knowledge Discovery from Advanced Databases, Beijing,

    1999, pp. 50–70.

    [35] F.S.C. Tseng, W.P. Lin, A study on indexing structure and its

    properties for constructing document warehouses, Proceedings

    of the 20th Workshop on Combinatorial Mathematics and

    Computation Theory, Taiwan, 2003 (Aug.), pp. 18–27.

    [36] F.S.C. Tseng, W.P. Lin, D-Tree: A Multi-Dimensional

    Indexing Structure for Constructing Document Warehouses,

    Journal of Information Science and Engineering, in press.

    [37] F.S.C. Tseng, Design of a Multi-Dimensional Query Expres-

    sion for Document Warehouses, Information Sciences, accept-

    ed and in press.

    [38] http://dublincore.org/, Dublin Core Metadata Initiative.

    [39] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/

    stg/stg/structured_storage_start_page.asp, Structured Storage.

    [40] http:otn.oracle.com/products/text/x/tech_Overviews/imt_817.

    html, Oracle Corporation. InterMedia Text 8.1.6.

    [41] http://www.globalwordnet.org, The GlobalWordnet Association.

    [42] http://www.survey.com, bDevelopment Snapshot: WarehouseData of the Future,Q Application Development Trends, Feb.2000.

    [43] http://www.w3.org/XML/schema.

    [44] http://www-3.ibm.com/software/data/iminer/fortext, IBM In-

    telligent Miner for Text: Text Analysis Tools version 2.10.0.

    Frank S.C. Tseng received his B.S., M.S. and Ph.D. degrees, all in

    computer science and information engineering from National Chiao

    Tung University, Taiwan, ROC, in 1986, 1988, and 1992, respec-

    tively. He is one of the winners of Acer Long Term Ph.D. disser-

    tation prize in 1992. From 1993 to 1995, he served the military in

    the General Headquarters of ROC Air Force. He joined the faculty

    of the Department of Information Management, Yuan-Ze Universi-

    ty, Taiwan, ROC, on August 1995. From 1996 to 1997, he was the

    chairman of the department. He is currently with the Department of

    Information Management, National Kaohsiung First University of

    Science and Technology, as an associate professor. His research

    interests include heterogeneous database systems, XML technolo-

    gies for Internet computing, data warehousing, data mining, and

    document warehousing. He has published extensively in journals

    such as the VLDB Journal, IEEE Transactions on Knowledge and

    Data Engineering, Data and Knowledge Engineering, Journal of

    Systems and Software, Distributed and Parallel Databases: An

    International Journal, Journal of Information Science, Information

    Sciences, and Journal of Information Science and Engineering.

    Dr. Tseng is a member of the IEEE Computer Society and the

    Association for Computing Machinery. He was listed in Marquis

    Who’s Who in Medicine and Healthcare in May 2004.

    Annie Y.H. Chou received her B.S. degree in applied mathematics,

    M.S. degree in computer science and information engineering, and

    Ph.D degree in computer and information science, all from National

    Chiao Tung University, Taiwan, ROC, in 1987, 1989, and 1996,

    respectively. From 1989 to 1992, she was an assistant researcher of

    Chunghua Telecom Laboratories, Taiwan, ROC. She joined the

    faculty of the Department of Computer and Information Science,

    Chinese Military Academy, in August 1997. She is presently the

    chairman of the department. Her research interests include mathe-

    matical analysis of computer algorithms, file organization design,

    data warehousing and data mining, and internet computing. She was

    a member of the Phi Tau Phi Scholastic Honor Society. Dr. Chou

    has had papers published in the Computer Journal and Journal of

    Information Science and Engineering.

    http://dublincore.org/http://msdn.microsoft.com/library/default.asp?url=/library/en-us/stg/stg/structured_storage_start_page.asphttp:otn.oracle.com/products/text/x/tech_Overviews/imt_817.htmlhttp://www.globalwordnet.orghttp://www.survey.comhttp://www.w3.org/XML/schemahttp://www-3.ibm.com/software/data/iminer/fortext

    The concept of document warehousing for multi-dimensional modeling of textual-based business intelligenceIntroductionAn introduction to document warehousingA framework for document warehousingDocument sourcesFront-end componentWarehouse administratorBack-end componentsHighly summarized documentsMetadata

    Data modeling of document warehousesDimensionsThe fact tableThe construction of document cubes

    Applications of document warehousesAn application for customer relationship managementAn application for journal paper warehousing

    Conclusion and future directionsConclusionFuture works

    References


Recommended