Introduction - Kent State Universitypersonal.kent.edu/~smanandh/Smanandh_TextMining.doc · Web...

Text Mining –

An Examination of Text Mining Software for Tri-State Times

Sujan Manandhar

Virginia Dressler

Information Science

Dr. J. Holmes

Mar 19, 2007

Introduction

With more than an estimated 80 percent of content on the Web to be in unedited,

unstructured text format (Haravu & Neelameghan, 2003, p. 100), a growing problem for

effective, relevant information retrieval methods particularly in situations of massive amounts of

data is increasingly evident. This idea can be seen in a query of commercial search engines that

often yield high recall and low precision. Commercial search engines also include indexing and

ranking mechanisms that are unbeknownst to the user and greatly impact the result of any search

in the ordering mechanisms and selection of terms. The greater number of recalled documents

entails more time and energy on the user side to sort out meaningful information from the

irrelevant. As a result, an increasing demand for quality information retrieval methods has

become a field of interest in software developments.

Text mining is in a relatively early phase, and as such, is still quite limited in application

and use, yet often yielding beneficial results. In this paper, the authors will look into text mining

as a solution to one of the problems of information retrieval. Text mining can be defined as the

discovery of new, previously unknown information, by automatically extracting information

from processed text sources. In addition, text mining is capable of linking extracted information

together to form new facts or new hypotheses to be explored further by more conventional means

of experimentation. (Hearst, 2003). As well, text mining can be used to uncover a “narrative in

an unstructured mass of text” (Haravu & Neelameghan, p. 103) and how a particular

environment is evolving (a defined market or business, for example).

Concepts between texts can be linked together as processed through a number of natural

language processing methods, such as corresponding thesaurus, glossary, a pre-structured subject

representations or a schedule of a classification scheme. Depending on the application, these are

1

either preformatted factors of a software package, or can be manipulated by the end user. For this

study, the authors chose relatively basic applications of text mining for our purpose to allow

study of the results of a query in respects to the decisions and considerations of a test group.

The authors conducted a study in which they evaluated different text mining software for

the Tri-State University newspaper information organization needs. This study will compare

various text mining systems on their functionality and ease of use, as well as fitting the small

budget and skills-set of the users.

Problem Statement

By utilizing different text mining software, the same sets of documents will deem

different results while processing the same query. Differences in the algorithms and weighting

schemas are the main impact the retrieved data, though the assessment and evaluation from the

end user should also be considered as an influencing factor as to the overall effectiveness and

relevance of the retrieved information. Additionally, a comparison of the results will examine the

varying degrees of satisfaction for users, indicating a strong degree of subjectivity and difference

of the user’s relation to information and language.

Extent of the study

This study will look at only a small aspect of some simple applications to provide a

cursory example of the limitations and benefits of text mining. By using a small set of documents

from a single source, the authors hope to provide enough limitation to effectively examine the

effectiveness of retrieved information.

2

We believe that future research will be needed to further delve into the topic with larger

case studies and how text mining can be implemented into other settings and uses other than

medical, technical and business. As well, a review and study of the major applications of text

mining would be useful at this point in time. Since this is an early stage for the applications of

text mining, we feel that it is also important to observe the current limitations and gain

information from existing applications. The possibility of creating links between increasingly

massive banks of textual information will be a major topic of study as our dependence and stock

in technology increases.

Background

Text mining can mean different things depending on the method and purpose. Sharp

(2001) aptly says, “text mining per se is new and is still defining itself.” Yeates (2002) says that

text mining discovers patterns in natural language text, and is the process of analyzing text to

extract information from it for particular purposes. Natarajan (2005) describes text mining as

intelligent text analysis by discovering unknown links and relationships between sets of

documents, even perhaps between non-trivial information. Whatever the context or definition,

the computational process involved in text mining is the fundamental aspect of the detection of

pattern.

The applications of text mining date to the mid-1990s, with IBM’s release of the first

software package, Intelligent Text Miner, in 1998. The acknowledgment of information existing

as raw material to an organization has impacted this aspect of information retrieval to software

designers (do Prado, Oliveira, et. al., 2004, p. 225). As well, the considerations of the user have

become a larger part of software design, particularly with natural language processing

3

capabilities. This is particularly effective in situations where there is a lack of controlled (or

structured) data, such as Web content, e-mails, and other informal documents. A simple

information retrieval is often complicated by the factors inherent to unstructured data, such as the

variety of spelling.

Natarajan (2005) cites five requirements for high quality text mining. The desirable conditions

and factors for information retrieval include: information comprehensiveness, quality of

knowledge base, high-quality method of information retrieval, techniques and protocols of

information extraction (implementation of internal thesaurus, etc.) and technical expertise (p.

33). Interestingly, classification and cataloguing methods of the library have often been used for

reference in many of the various natural language processing applications.

Literature Review

The topic of text mining is still a relatively new area, and the most of literature on the

subject tends to be geared towards the practical application (business, financial, medical, etc.).

But as far back as in the 1950s, Luhn (1958) in a seminal paper on automatic abstracting, pointed

out "the resolving power of significant words" in primary text. Doyle (1961) hints on early text

mining and says "natural characterization and organization of information can come from

analysis of frequencies and distributions of words in libraries." Swanson (1988_1) a major force

behind text mining, examined scientific literature as a natural phenomenon worthy of

"exploration, correlation, and synthesis."

Sullivan (2004) discusses text mining within a business model, outlining the constructs of

such software and the impact and benefits of text mining in a practical scenario. Sullivan also

mentions the current limitations as of 2004 as to the state of NLP processing and error rates. He

4

also concludes that pre-existing (preformatted) processing schemas and components in many text

mining applications are inappropriate for most queries. More effective are the methods that allow

the user to identify relations and categories within a sample set of documents (“supervised

learning”, pg. 102). In turn, algorithms would be created by these decisions and choices of

related documents.

do Prado (2004) applies the CRISP-DM methodological approach to a case study. A set

of 57,000 documents from a news agency were gathered in 2001 to study the relationships

between the text and the defined grouping of information (types of news). Clusters of text

connected unrelated items, indicating a method of knowledge discovery. Seven main schemas of

clustering were found in this group of documents, both hierarchical and non-hierarchical. A

conceptual approach to the search deemed better results (p. 224). Sharp (2001) looks at several

examples of text mining and natural language processing (NLP). He also traces the aspects of

machine learning in text mining and how this can play a pivotal role in its development. He is

able to highlight some the main features of text mining along with some of the seminal works

related to it.

Feldman and Sanger (2006) were one of the first to encapsulate the whole topic of text

mining into a book that breaks down the core principles as well as examining probabilistic

models. A selection of existing applications in application to specific fields of interest are also

studied, mainly business, technical and medical situations.

Major Concepts in Text Mining

Before getting into the nuts and bolts of text mining, it is important to know some of the

5

key concepts that lead up to it. To try and expound on all the concepts related to text mining

would not be possible within the realms of this paper. The authors will try to explain some of the

important terms that this study comes across – natural language processing, knowledge

discovery, and data mining.

Knowledge Discovery

Knowledge discovery is the process of finding novel, interesting, and useful patterns in data.

Data mining is a subset of knowledge discovery. This method allows the data to suggest new

hypotheses to test (Purple Insight, n.d.). James M. Caruthers makes an interesting analogy and

says "Instead of mining for a nugget of gold, knowledge discovery is more like sifting through a

warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself.

After appropriate assembly, however, a Rolex watch emerges from the disparate parts." (Venere,

2004)

In his seminal work Swanson (1988_2) proved that it was possible to discover new

knowledge from existing literature by linking the information present in complementary but

disconnected articles. Smalheiser and Swanson (1998) postulated a number of new biomedical

hypotheses, which were later verified by domain experts. They developed two approaches,

known as Open and Closed discovery, for generating new hypotheses. However, their research

required substantial manual input. Since then, a number of efforts have aimed to automate the

discovery process.

Natural Language Processing

Natural language processing (NLP) is a major component of text mining. NLP is the

6

branch of linguistics which deals with computational models of language. Sharp (2001) says that

NLP can differentiate how words are used such as by sentence parsing and part-of-speech

tagging, and in the process add discriminatory power to statistical text analysis. He says that

NLP could be a powerful tool for text mining.

Natural language has evolved to help humans communicate with one another and also to

record information. Computers are still incapable of understanding natural language, though with

language processing mechanisms an attempt to make meaningful and relevant connections

between words continues to be of interest and study. Humans are able to differentiate and apply

linguistic patterns to text, overcoming obstacles (such as slang, spelling variations, and

contextual meaning), while computers do not handle them easily. Meanwhile, although our

language capabilities allow us to comprehend unstructured data, we lack the computer's ability to

process text in large volumes at high speeds. The key to text mining is creating technology that

combines a human's linguistic capabilities with the speed and accuracy of a computer (Fan,

Wallace, et. al. 2005).

The complete understanding of natural language text is difficult to attain. Text mining

focuses on extracting a small amount of information from text with high reliability. (Yeates,

2002) Natural language is ambiguous and the same keyword may express entirely different

meanings, e.g. “Washington” may be a person or a place. Such ambiguity is normally resolved

through context. The inverse problem is that different expressions may refer to the same

meaning, e.g. “car” and “automobile”. From these two problems, it is easy to rule out the surface

expression of the keywords alone as a proper representation for text mining. (Chibelushi, Sharp,

& Salter, n.d.)

Moreover, because of the centrality of natural language text to its mission, text mining

7

also draws on advances made in other computer science disciplines concerned with the handling

of natural language. Perhaps most notably, text mining exploits techniques and methodologies

from the areas of information retrieval, information extraction, and corpus-based computational

linguistics (Feldman & Sanger, 2006).

One example of the application of a large-scale natural language processing database to

practical search methods was with the WordNet project (Stevenson, 2003, p. 39). A cognitive

psychologist used research on the structure of the human mental lexicon and attempted to

construct a resource that would mirror this structure. A basic block of terms were created in the

research and found that over half of these terms to have a rather large number of synonyms,

which were tiered into hierarchical chains. For example, the term ‘canary’ was related to about

20 other terms, closest term being ‘finch’ and furthest (but still deemed to be related) ‘entity’ (p.

40). In terms of this application in conjunction with text mining, all relevant synonyms would in

effect be used with a query. Again, there is a number of thesaurus and dictionary programs,

whose quality also varies, which would become a part of the natural language processing. The

quality of these programs would impact the search, as too many synonyms would result in higher

recall of similar terms that would less likely be relevant.

Technological advances are, however, beginning to close the gap between human and

computer languages. The field of natural language processing has produced technologies that

teach computers natural language, enabling them to analyze, understand, and even generate text.

(Fan, Wallace, et. al. 2005) As more time and research is put into this facet of text mining, the

greater the possibilities of relevant, meaningful results will be made.

8

Data Mining

Text mining has its roots in data mining and consequently has many similar features.

Like data mining, text mining seeks to extract useful information from data sources through the

identification and exploration of interesting patterns. But unlike data mining, in text mining the

data sources are created from defined and processed document collections. Interesting patterns

are found not among formalized database records but in the unstructured textual data in the

documents in these collections (Feldman and Sanger, 2006). In addition, both text mining and

data mining systems have similar high-level architecture like preprocessing routines, pattern-

discovery algorithms, and presentation-layer elements such as visualization tools to enhance the

browsing of answer sets.

Text mining adopts many of the specific types of patterns in its core knowledge discovery

operations that were first introduced in data mining research (Feldman & Sanger, 2006). While

data mining mostly deals with structured data, text mining is designed to handle structured data

from databases or XML files, and can also handle unstructured or semi-structured data sets (such

as email, full-text documents, and HTML files). As a result, text mining is a better solution for

companies where huge volumes of diverse information must be merged and managed (Fan,

Wallace, et. al. 2005).

To date, however, most research and development efforts have centered on data mining

using structured data. Because data mining deals with data have already been stored in a

structured format, much of its preprocessing focus falls on two critical tasks: 1) scrubbing and

normalizing data and 2) creating extensive numbers of table joins. But in text mining,

preprocessing deals with the identification and extraction of representative features for natural

language documents. These preprocessing operations mainly transform unstructured data stored

9

in document collections into a more explicitly structured intermediate format (Feldman &

Sanger, 2006).

Process of Text Mining

Text mining can be examined in three stages: Text preparation, text processing and text

analysis. A key element of text mining is its focus on the document collection. At its simplest, a

document collection can be any grouping of text-based documents. Practically speaking,

however, most text mining solutions are aimed at discovering patterns across very large

document collections. Another critical element in text mining is the document. For practical

purposes, a document can be very informally defined as a unit of discrete textual data within a

collection that usually, but not necessarily, correlates with some real-world document such as a

business report, legal memorandum, e-mail, research paper, manuscript, article, press release, or

news story (Feldman & Sanger, 2006).

During the initial stage, text preparation uses a selected set of documents and is input

using text mining software that cleans and preprocesses the data. The text processing stage is

where the user enters the picture and enters an information query into the program. An algorithm

is applied to the processed data, which clusters the data to find meaningful patterns. Text

analysis evaluates and determines the relevance of mined text into a more tangible form.

(Natarajan, 2005)

After text files are input to a system, text mining software produces a semantic network

of key concepts and terms in each file, defined by a weighing algorithm. This algorithm is used

to find meaningful patterns in data by frequency of terms. Phrasal analysis is also available in

addition to the term searches. This method has often proved to be the most effective method of

10

text mining, as noun phrases such as company names, personal names, locations, or case names

would often be more useful than single term searches. Adoption of subject headings and content

descriptors are frequently derived from library classification schemes. The user can enter a query

and receive a compilation of relevant data as pulled from documents in the format of an XML or

HTML file, or even a comma separated file. These forms of the results are often highly visual or

graphical, and aim to produce an organizational knowledge map. The purpose of this compilation

would be to discover new knowledge or information based on similar concepts or to find a

narrative in an unstructured set of documents.

The understanding of the specific mapping of the selected group of information is

important to the information retriever. There should be a certain level of awareness of the

classification, clustering and categorization schemes of platform servers, network servers,

database and workgroup applications to effectively use text mining software. The quality of the

information that is mined is largely dependent on a certain level of organization within the base

of the originating knowledge base. As well, the results are reviewed by the individual to assign

relevance and value to the information.

Figure 1- Text Mining Process (Adapted from Fan, Wallace, et. al. 2005)

11

Technologies in Text Mining

There are several techniques in the text mining technologies that are utilized through different

applications. Some of these include information extraction, topic tracking, summarization,

categorization, clustering, concept linkage, information visualization, and question answering.

(Fan, Wallace, et. al.) In this section we reviewed some of the technologies that we used in

evaluating the text mining software for our research, the Tri-State project.

Information extraction - This technology is a popular method for analyzing unstructured text and

identifying key phrases and relationships within text. It looks for predefined sequences in the text

by pattern matching. The technology is especially useful when dealing with large volumes of

text. (Fan, Wallace, et. al. 2005)

Summarization - Text summarization helps users figure out whether a lengthy document meets

their needs and is worth reading. It is important to reduce the length and detail of a document

while retaining its main points and overall meaning. Sentence extraction mines important

sentences from a given text by statistically weighting all the sentences in the text. Other

heuristics like position information and extracting sentences following key phrase like "in

conclusion" is used in summarization. Headings and other markers of subtopics are searched in

order to identify the document's key points (Fan, Wallace, et. al. 2005)

Topic tracking - A topic-tracking system keeps user profiles and, based on the documents a user

views, forecasts other documents of interest to the user. This allows users to choose keywords

and notifies them when news relating to the topics becomes available. Some tools let users select

particular categories of interest and infers users' interests based on their reading histories and

12

click-through information they've left behind online. (Fan, Wallace, et. al. 2005)

Term weighting and association rules are also common in text mining. In the term weighting

technique document representation is done by removing functional words (e.g. conjunctions,

prepositions, pronouns, adverbs, etc.) and then assigning weights to content words (e.g. agent,

decision making), in order to describe how important the word is for that particular document or

document collection. This is because some words carry more meaning than others.

(Chibelushi, Sharp, and Salter, n.d.) Association rules have made their way from data mining to

text mining. An association rule is a probabilistic statement about the co-occurrence of certain

events in a database or large collection of texts. (Ibid.)

Purpose of Text Mining

The purpose of text mining is to make meaningful connections between unstructured text

data. As we have discussed, three main stages in text mining are data preparation, data

processing and data analysis. Depending on the software or sites reviewed (or guidance of the

human counterpart), this would be formatting the data during the selection and preprocessing

stage. One issue in the result of mined text is the lack of a hierarchy in the display of indexes or

clustered information.

A data-mining algorithm is then applied to the preprocessed data. Sentence and paragraph

identification as well as tagging parts of speech in a set of documents are discerned at this point.

Natural Language Processing would provide conceptual relations between entitled and perhaps

make links between certain chucks of information (the people, associated companies, etc.) Text

analysis is the more subjective aspect to the process, in the evaluation of the output. After being

13

run through certain algorithms, the resulting text is further subjected to further processing (Link

Discovery tool, or other).

Rudimentary term extraction is the most basic form of text mining that weights lists of

terms from a set of texts into a feature vector. A search of any scale would in effect measure the

similarities between documents by the feature vectors. In some text mining software, the user or

systems administrator would take a sample group of documents and create certain rules on terms

which the software translates into an algorithm (this has been referred to as “supervised learning”

Sullivan, p. 102), mentioned earlier. Alternatively, other software is set with preexisting

classification schemas that weight phrases and terms (or “unsupervised learning”, Sullivan, p.

102).

The underlying notion in text mining is that frequency of term occurrence equates

relevance. Perhaps more useful is Maximum frequent sequences (MFS), which is a method of

extracting phrases that occur the most frequently in a set of documents. Specific phrases can

often provide higher precision in the recall, for example by the use of company name, product

name, proper name, etc. Also to note, a certain level of awareness on the part of the user as to the

effect of language of the query, controlled or natural, with relation to the search is vital to

pertinent results.

Within an increasing set of documents, patterns begin to emerge within the mined text,

and as does the number of patterns eventually. More successful cases of text mining are in areas

with highly controlled language, such as Biotechnology, Competitive intelligence, and Consumer

product development. Table 1 summarizes some of the main uses and the technologies used in

text mining in the principal industries.

14

Table 1 - Applications of text mining in various industries (Source: Fan, Wallace, et. al. 2005)

As previously mentioned, natural language techniques were applied to text mining to

attempt to represent the user and the document in the search process to aid the search method. In

addition, searching with NLP can also classify documents together by discovering multiple point

relationships between terms and phrases. NLP is however prone to error, particularly in

environments with a range of topics and styles. Even with this in mind, there can be beneficial

connections to be made between previously unrelated items.

15

Research Model Using the CRISP-DM

CRISP-DM was developed in late 1996 by three main players of the then young data-

mining market - DaimlerChrysler (then Daimler-Benz), SPSS (then ISL), and NCR.

DaimlerChrysler had already had some experience in applying data mining in its business

operations. SPSS had been providing services related to data mining since 1990 and also

launched the first commercial data mining workbench - Clementine - in 1994. NCR had the

Teradata data and had teams of data mining consultants and technology specialists to service its

clients’ requirements. (CRISP-DM, 2000) Over the years the model has been developed and

suited for better data mining for various purposes. We found that this particular model would be

an effective method in application with our case study.

The CRISP-DM methodological model consists of the following (Sullivan, 2004)

Business understanding - The clients’ point of view is considered at the first stage,

identifying the requirements and objectives of the selected applications. Problems and

restrictions of each application are identified and examined as well. This phase also

incorporates a description of the client background, the business objectives, and a

description of the criteria used to measure the success of the study.

Data understanding - All relevant information is identified to carry out the application,

and also to develop an initial gauge on the applications’ content, quality and utility. This

initial collection of data assists the analyst to discover the particulars of the individual

programs. As well, problems related to the format and values of the applications are

looked at this point. The manner in which data was collected, including the different

sources, meaning, volumes, reading procedure, etc. will also be of interest. These can also

16

give an indication of the quality of the data.

Data Preparation - In this stage, the final data set from which the model will be created

and validated will be reviewed. Tools for data extraction, cleaning, and transformation

are applied to data preparation. Combinations of tables, format changing and aggregation

of values are drawn out to satisfy the input requirements of the particular learning

algorithms.

Modeling - Data-mining techniques are selected and applied at this stage, according to the

objectives as defined in the first step of the model. The core phase of KDD (Knowledge

Discovery and Data Mining) is modeling, which corresponds to the choice of the

technique, its parameterization, and its execution over a defined data training set. As well,

other models can be created in this phase if required.

Evaluation - A review of the previous steps will be made in order to verify the results

against the objectives as defined in the business understanding phase. The next tasks to be

preformed will also be defined here. Dependent on the results, route corrections may be

defined, which correspond to the return to one of the already performed phases using

other parameters, or looking for additional data.

Deployment - This phase sets the necessary actions to make the acquired knowledge

available to the organization. A final report is generated to explain the results and the

experiences useful to the client business.

17

Figure 2: Phases of the CRISP-DM Process Model (Source: CRISP-DM, 2000)

The CRISP-DM process is more of a cycle in which the sequence of the phases is not

necessarily stringent. Moving back and forth between different phases is essential. The

sequence of the tasks is dependent on the outcome of each phase. The arrows (See Figure 2)

indicate the most important and frequent dependencies between phases. The outer circle in the

figure indicates that the process is cyclic in nature.

The mining process continues after a solution has been deployed. The lessons learned

during the practice can generate new, often more focused business questions. Subsequent mining

processes will benefit from the experiences of previous ones, with discovery and examination of

successful results.

18

In our study for the Tri-State Times we decided to use the CRISP model as well. It is a

popular and proven solution. Many business solutions have depended on the CRISP model. In a

study by Chibelushi, Sharp, and Salter (n.d.) in analyzing the transcripts of the meetings, they

recorded a set of meetings and transcribed them for further processing. These transcripts were

manually edited to prepare for the modeling phase and then further analyzed to track the themes

discussed and extract the key issues and any associated actions, as well as identifying the

initiator. The approach combined statistical natural language processing and semantic analysis of

the transcripts. do Prado (2004) applies the CRISP-DM methodological approach to a case study

in order to look into the relationships between the text and the defined grouping news items.

The Tri-State Times Text Mining Project

The student newspaper Tri-State Times are researching the use of text mining software to

assist in finding news articles, columns, and editorials that are similar to the selected news items.

In addition, the software should be able to investigate the primary terms and phrases in the

selected article so that similarities to other articles can be identified.

The members of the Tri-State Times have approached the School of Library Information

Science (SLIS) to assist in researching the text mining software that would be appropriate for

their use. However due to lack of funds, they would like to use a system that can be purchased

for a minimal cost or one that is available for free. In addition, the upkeep of the system should

be easy and should not incur any major additional costs.

The members of the research team at SLIS will conduct a study in which they will

evaluate different text mining software that will possibly satisfy the needs of the Tri-State Times.

Once the search has been narrowed down, they will conduct tests based on the criteria of the

19

tasks that are needed by the newspaper. In addition to comparing the ease of use of the various

systems, a survey will be conducted with a sample of the users to determine the functionality of

the system and to determine what the users feel about the system and the resulting data sets. A

combination of these factors will be used to determine the best text mining solution for the

newspaper.

Step 1 - Business Understanding

During this initial phase, it is necessary to focus on understanding the project objectives and

requirements from the organization’s perspective, and then converting this knowledge into a data

mining problem definition, and a preliminary plan designed to achieve the objectives.

The student newspaper at Tri-State University is short staffed. The editorial team has

consulted the SLIS for assistance to see if they can come up with a solution in order to help them

archive important articles. The objective of this project is to find a method to find similar

articles, columns, and editorials to stories that the editorial team selects. In order to achieve this,

they survey and select major news articles and then look for similar articles covering the same

news story in other major newspapers. One can look into most of the major news sources

manually one at a time, or a text mining model can be used that would assist in the process. The

text mining model will be able to identify the key terms and ideas in the articles. These terms can

also be used to look up similar articles. In addition, other articles can be identified and selected

from other news sources automatically with the use of certain text mining software. Finally with

the use of key terms and sentences can also formulate a summary of the article. This summary

can be input into a collection/database of summaries and can be accessed in the future for use in

writing editorials, opinions, and other articles.

20

Step 2 - Data Understanding

The data understanding begins with initial data collection of news stories from various national

news websites – e.g. CNN, Fox News, Yahoo News, etc. Next local stories of interest are

selected from local and regional newspapers. In order to get familiar with the data the main

stories are identified from one or two of these news sources. A major data quality issue may be

the availability of more than just a single version of the news articles at various times, especially

on the Internet. Another issue is amount of time spent on identifying the main news article,

whether or not this is unwieldy or not. Once the editorial team selects the news article (dataset) a

member of the staff should be able to enter these in the system and generate the output of key

terms and phrases, summary of the article, and possible matches of similar articles.

Step 3 - Data Preparation

The data preparation phase covers all activities to construct the final dataset (data that will be fed

into the modeling tool(s)) from the initial raw data. In this case, the raw data includes all the

news articles that the editorial team deemed to be archived during the first selection of these

articles. The final dataset includes the articles that have been weeded out from the raw data and

are considered more important than the rest of the articles. During this phase, you choose one or

two versions of the main stories of the day from news sources. These news articles are converted

to text versions so that they can be inputted without the images. These text versions of the news

stories make up the final dataset that are ready to be entered into the text mining system.

Step 4 - Modeling

During this phase, various modeling techniques are selected and applied. The SLIS team looked

21

at various text mining tools depending on the needs of the Tri-State Times. Text mining tools

come in various capabilities and prices. Major vendors offer text mining tools that cost in the

region of thousands of dollars. Some of the major text mining tools are compared below along

with their major features:

Table 2 - Text-mining technologies offered by commercial vendors (Source: Fan, Wallace, et. al. 2005)

Although most of the above systems offered all the functionalities that the Tri-State

project needed, these tools were beyond the budget of the student newspaper. In addition,

learning and applying these systems would require additional effort. Hence the SLIS team had to

look for systems that were simpler to use, and were less expensive or offered for free on the

Internet. With this in mind, the initial costs were minimal. The SLIS team narrowed down the

search to three text mining models: 1) Termine, 2) Textalyzer, and 3) Topicalizer.

The SLIS team took a sample set of chosen articles from the final data set and plugged

22

them into the respective models. The output from these systems were then tabulated and

compared. In some cases, the outputs were not comparable. In these cases, another set of data

was taken and re-plugged into the systems. The outputs of these articles were then evaluated.

Step 5 - Evaluation

During this stage the SLIS team evaluated the results that were generated from the three selected

models of text mining. Each of the results were compared on the basis of the three criteria – 1)

terms and phrases selected 2) summary generated 3) generated list of similar articles.

The key terms that the systems generated were evaluated in terms of whether the terms

selected could be used to look for other related articles from other news sources. The summaries

generated were compared to see if the models were able to generate a concise summary that

could be used in the future for editorials and opinions. In addition, the summaries were evaluated

to see if the system was able to connect ideas, or just extract sentences from the input text.

Finally, the set of articles that the system generates as similar news articles of the input

are also evaluated. The number of articles in the output and the variety of sources are also

investigated. These outputs were evaluated by a team of eleven participants who looked at the

three criteria: 1) terms and phrases selected 2) summary generated 3) generated list of similar

articles – for each of the three systems. In addition, these users also evaluated the systems in the

ease of use and the available functionalities.

The users tested the three software models with a few news articles that they input into the

systems. The output were generated and compared for each of the systems. In addition, a short

23

survey was taken to evaluate their user experience. (See Appendix A for the survey). Overall the

users were satisfied with all three models of text mining that were evaluated. The fact that these

systems were available on the Internet for no cost was attractive the users, who were aware of the

financial constraints. In addition the ease-of-use of all three systems was also a feature that the

participants liked. All three models tested did provide keywords and phrases that were helpful in

locating other news articles of interest to the users. But due to the additional functionalities of the

Topicalizer, including the summary and the links to other articles, most of the participants

preferred this system to the other ones tested.

Topicalizer is a service which automatically analyses a document specified by a URL or a

plain text regarding its word, phrase and text structure. It provides functional information on a

given text including the following: Word, sentence and paragraph count, collocations, syllable

structure, lexical density, keywords, readability and a short abstract on what the given text is

about. (Topicalizer, n.d.)

The results of the study were summarized as follows:

Based on the information collected from the users, there was a 72% agreement rate to the

24

resulting sets of information. Interestingly, this was also practically concurrent with previous

studies measuring the frequency of agreement between expert and non-expert semantic taggers

(Stevenson, 2003, p. 50). Often, the areas of disagreement involved topics that were more subject

to multiple interpretation (the user was not sure of context or relevance).

Step 6 - Deployment

Creation of the model is generally not the end of the project. Once the testers chose Topicaliser

as the system of choice for the Tri-State Times and the trial runs were complete, the results are

organized and stored in a way that they can be retrieved as per need of the student newspaper. A

structure for reporting is created with a consistent and understandable format, so that the users of

these results.

Conclusion

By using a selection of text mining software, our study found that differences exist in the manner

in which people relate to information, even from the same query. Apart from differences in the

particular algorithms and weighting schemas, definitive difference was found in the decision

processes of different individuals to the same set of information, and also in the comparison of

satisfaction surveys. This ultimately proves to be more of a subjective nature, though important

information can be found within these differences for further improvements in the software.

We felt that although certain limitations can be found in text mining, there are also

opportunities for further research and study. There have been many benefits in the existent

applications of text mining, and we feel that there are many possibilities for improvements in the

current software. Our study was limited to a specific set of data and users, while future research

25

could utilize a much broader scope of application (larger sets of data, wider array of topic, larger

user base).

26

References

Atkinson-Abutridy, John. (2004) Semantically-Driven Explanatory Text Mining: Beyond Keywords. Retrieved March 3, 2007, from Universidad de Concepci´on, Departamento de Ingenier´ıa Inform´atica, http://www.springerlink.com/content/v78y4u242a67uupe/fulltext.pdf

Chibelushi, C., Sharp, B., Salter, A. (2004) A Text Mining Approach to Tracking Elements of Decision Making: a pilot study. Retrieved March 5, 2007, from Staffordshire University, School of Computing, http://www.comp.lancs.ac.uk/computing/research/cseg/projects/tracker/css_iceis04.pdf

CRISP-DM (2000) Retrieved on Mar 02, 2007 from (Reference:

http://www.crisp-dm.org/Overview/index.htm)

do Prado, H. A., Moreira de Oliveira, J. P., Ferneda, E., Wives, L. K., Silva, E. M., Loh, S. (2004). Transforming Textual Patterns into Knowledge. In Raisinghani, Mahesh (Ed.), Business intelligence in the digital economy: opportunities, limitations, and risks. (p. 207-227). Hershey, PA: Idea Group Publications.

Doyle, L. (1961). Semantic road maps for literature searchers. Journal of the Association for Computing Machinery, 8, 223-239.

Fan, W., Wallace, l., Rich, S. and Zhang., Z. (2005) “ Tapping Into the Power of Text Mining”. Retrieved on Mar 10, 2007 from http://pubs.dlib.vt.edu:9090/2/01/text_mining_final_preprint.pdf

Feldman, R. & Sanger, J. (2006). The text mining handbook: advanced approaches in analyzing unstructured data. New York: Cambridge University Press.

Haravu, L. J. and Neelameghan, A. (2003). Text Mining and Data Mining in Knowledge Organization and Discovery: The Making of Knowledge-Based Products. Cataloging & Classification, 37 (1/2), 97-113.

Hearst, M. (2003). What Is Text Mining?. Retrieved on March 12, 2007 from http://www.ischool.berkeley.edu/~hearst/text-mining.html

Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165.

Mironova, S. Y., Berry M. W., Atchley, S., Beck, M. (2004). Advancements in text mining algorithms and software. In Kargupta, H. (Ed.) Data mining: next generation challenges and future directions. (p. 425-436). Menlo Park, CA: MIT Press

Natarajan, M., (2005, July). Role of Text Mining in Information Extraction and Information

27

http://www.ischool.berkeley.edu/~hearst/text-mining.html

http://pubs.dlib.vt.edu:9090/2/01/text_mining_final_preprint.pdf

http://www.crisp-dm.org/Overview/index.htm

http://www.comp.lancs.ac.uk/computing/research/cseg/projects/tracker/css_iceis04.pdf

http://www.springerlink.com/content/v78y4u242a67uupe/fulltext.pdf

Management. DESIDOC Bulletin of Information Technology, 25 (4), 31-8.

Purple Insight, (n.d.) Retrieved on March 12, 2007 from http://www.purpleinsight.com/downloads/docs/visualizer_tutorial/glossary/go01.html

Sharp, M. (2001). Text Mining. Seminar in Information Studies, Retrieved on Mar 02, 2007 from http://www.scils.rutgers.edu/~msharp/text_mining.htm

Smalheiser, N.R., Swanson, D.R., (1998). “Using ARROWSMITH: A Computer Assisted Approach to Formulating and Assessing Scientific Hypotheses”, Computer Methods and Programs in Biomedicine 57(3), 149-153.

SPSS (2005) Improve Business Results with Text Mining. Retrieved March 1, 2007, from http://www.spss.com/PDFs/TMC4SPC-0105.pdf

Stevenson, M. (2003) “Word Sense disambiguation.” Stanford, CA : Center for the Study of Language and Information.

Sullivan, D. (2004). Text Mining in Business Intelligence. In Raisinghani, Mahesh (Ed.), Business intelligence in the digital economy: opportunities, limitations, and risks. (pp. 98-110). Hershey, PA: Idea Group Publications

Swanson, D. R. (1988_1). Historical note: Information retrieval and the future of an illusion. Journal of the American Society for Information Science, 39, 92-98.

Swanson, D.R. (1988_2). Migraine and Magnesium: Eleven neglected connections. Perspectives in Biology and Medicine, 31, 526-557

Termine Retrieved on Feb 22, 2007 from http://www.nactem.ac.uk/software/termine/ Textaly Retrieved on Feb 22, 2007 from http://textalyser.net/ Topicalizer Retrieved on Feb 22, 2007 from http://www.topicalizer.com/

Venere, E. (2004) “ 'Knowledge discovery' Could Speed Creation of New Products” Purdue News Service Retrieved on Mar 03, 2007 from http://www.purdue.edu/UNS/html4ever/2004/041018.Caruthers.discover.html

Yeates, S. (2002). Text Mining. Retrieved on March 02, 2007 from http://www.cs.waikato.ac.nz/~nzdl/textmining/

28

http://www.cs.waikato.ac.nz/~nzdl/textmining/

http://www.purdue.edu/UNS/html4ever/2004/041018.Caruthers.discover.html

http://www.topicalizer.com/

http://textalyser.net/

http://www.nactem.ac.uk/software/termine/

http://www.spss.com/PDFs/TMC4SPC-0105.pdf

http://www.scils.rutgers.edu/~msharp/text_mining.htm

http://www.purpleinsight.com/downloads/docs/visualizer_tutorial/glossary/go01.html

Appendix – A -

Questionnaire for Participants (Note: One form for each system)

Please feel free to add or remove success viewpoints as appropriate.

Estimate the level of success each query, using this response scale.

5 very satisfied 4 satisfied 3 neutral 2 dissatisfied 1 very dissatisfied

1. __ The generated summary was relevant to the query.

2. __ The terms and phrases of the query were found to give relevant results.

3. __ The results gave too many variations to similar articles.

4. __ The system was easy to use and understand.

5. __ The system provided simple and ample methods of search mechanisms.

29

Identify both the successful and unsuccessful elements found in the resulting information. Were

these useful for further study, or were they irrelevant?

.

Estimate the level of satisfaction of the results, using this response scale.

5 very satisfied 4 satisfied 3 neutral 2 dissatisfied 1 very dissatisfied

1. __ The resulting data sets was sufficient for our purpose of use and study.

2. __ The extracted information lead to a discovery of new knowledge.

3. __

4. __

5. __

30

Rate the following characteristics for the environment for the project being reviewed. Use a scale

of 1 to 5, where 1 is the lowest rating and 5 is the highest. If the item does not apply, mark an X.

__ ease of software use

__ software quality

__ analysis capability

__ design capability

__ appropriateness of technology to query

__ effective use of data configuration

__ quality assurance of data

__ clarity of source

31

Date post:	22-Apr-2018
Category:	Documents
Upload:	tranxuyen
View:	213 times
Download:	1 times

Introduction - Kent State Universitypersonal.kent.edu/~smanandh/Smanandh_TextMining.doc · Web...

Documents