1
CHAPTER 1
INTRODUCTION
Nowadays, a large quantity of data is being accumulated. Usually there is a huge gap from
the stored data to the knowledge that could be construed from the data. This transition
won't occur automatically, that's where Data Mining comes into picture. In Exploratory
Data Analysis, some initial knowledge is known about the data, but Data Mining could
help in a more in-depth knowledge about the data. Seeking knowledge from massive data
is one of the most desired attributes of Data Mining. Manual data analysis has been around
for some time now, but it creates a bottleneck for large data analysis. Fast developing
computer science and engineering techniques and methodology generates new demands to
mine complex types of data. A number of Data Mining techniques (such as association,
clustering, classification) are developed to mine this vast amount of data. Previous studies
[67] of Data Mining mostly focused on structured data, such as relational, transactional
and data warehouse data. However, in reality, a substantial portion of the available
information is stored in text databases (or document databases), which consists of large
collections of documents from various sources, such as news articles, books, digital
libraries and Web pages. Text databases are rapidly growing due to the increasing amount
of information available in electronic forms, such as electronic publications, e-mail, CD-
ROMs, and the World Wide Web (which can also be viewed as a huge, interconnected,
dynamic text database).
Data stored in most text databases are semi structured data in that they are neither
completely unstructured nor completely structured. For example, a document may contain
a few structured fields, such as title, authors, publication date, length, category, and, so on,
but also contain some largely unstructured text components, such as abstract and contents.
There have been a great deal of studies on the modeling and implementation of semi
structured data in recent database research. Information Retrieval techniques, such as text
indexing, have been developed to handle unstructured documents. But, traditional
Information Retrieval techniques become inadequate for the increasingly vast amounts of
text data. Typically, only a small fraction of the many available documents will be relevant
to a given individual or user. Without knowing what could be in the documents, it is
2
difficult to formulate effective queries for analyzing and extracting useful information
from the data. Users need tools to compare different documents, rank the importance and
relevance of the documents, or find patterns and trends across multiple documents. Thus,
Text Mining has become an increasingly popular and essential theme in Data Mining.
Text Mining, also known as knowledge discovery from text, and document information
mining, refers to the process of extracting interesting patterns from very large text corpus
for the purpose of discovering knowledge [129]. It is an interdisciplinary field involving
Information Retrieval, Text Understanding, Information Extraction, Clustering,
Categorization, Topic Tracking, Concept Linkage, Computational Linguistics,
Visualization, Database Technology, Machine Learning, and Data Mining [120].
The Text Mining tools/applications intend to capture the relationships between data. They
can be roughly organized into two groups. One group focuses on document exploration
functions to organize documents based on their content and provide an environment for a
user to navigate and browse in a document or concept space. It includes Clustering,
Visualization, and Navigation. The other group focuses on text analysis functions to
analyze the content of the documents and discover relationships between concepts or
entities described in the documents. They are mainly based on natural language processing
techniques, including Information Retrieval, Information Extraction, Text Categorization,
and Summarization [128], [129].
Content-based text selection techniques have been extensively evaluated in the context of
Information Retrieval. Every approach to text selection has four basic components:
Some technique for representing the documents
Some technique for representing the information need (i.e., profile construction)
Some way of comparing the profiles with the document representations
Some way of using the results of that comparison
This thesis is about Text Mining for Information Retrieval. Our main interest is in how
techniques and tools of Text Mining can be used as an exploration tool. Search engine is
the most well known Information Retrieval tool. The application of Text Mining to
Information Retrieval improves the precision of IR systems and reduces the number of
documents that a single query returns. In this thesis, we proposed a method of ranking the
Web text pages based on statistical heuristics. Also an efficient and global method of
3
clustering is proposed to group the similar documents. To speed up the process of text
document retrieval, a new method of storing the inverted index file is proposed by using
the range partition feature of oracle, where the space requirement of Random Access
Memory is reduced considerably by storing the inverted file on the secondary storage and
bringing only the required portion to main memory. Apart from studying this, we have
made some work in the adjacent field of Text Document Summarization to generate
single-document extract summary which can be used to cluster similar documents (Web
text documents) using the proposed clustering method.
1.1 Data Mining – Concepts and Techniques
Database technology has evolved from primitive file processing to the development of
database management systems with query and transaction processing. Due to the explosive
growth in data collected from applications including business and management,
government administration, scientific and engineering, and environmental control, there is
an increasing demand for efficient and effective data analysis and data understanding
tools. Data warehouse systems provide some data analysis capabilities which include data
cleaning, data integration and OLAP (On-Line Analytical Processing). These analysis
techniques provide functionalities such as summarization, consolidation and aggregation,
as well as the ability to view information at different angles. Although OLAP tools support
multidimensional analysis and decision making, additional data analysis tools are required
for in-depth analysis, such as data classification, clustering, and the characterization of
data changes over time. The widening gap between data and information calls for a
systematic development of Data Mining tools which will turn data tombs into golden
nuggets" of knowledge. Data Mining tools perform data analysis which may uncover
important data patterns, contributing greatly to business strategies, knowledge bases, and
scientific and medical research.
Data Mining has many definitions:-
Definition 1 : Data Mining, also referred as knowledge discovery in databases, is a process
of nontrivial extraction of implicit, previously unknown and potentially useful information
(such as knowledge rules, constraints, regularities) from data in databases [106].
4
Definition2 : "Data Mining is the process of sorting through large amounts of data and
picking out relevant information. It is usually used by business intelligence organizations,
and financial analysts, but is increasingly being used in the sciences to extract information
from the enormous data sets generated by modern experimental and observational
methods" [38].
The ultimate goal of Data Mining is prediction - and predictive Data Mining is the most
common type of Data Mining and one that has the most direct business applications.
The process of Data Mining consists of three stages.
a) Exploration - It is the first stage of Data Mining process. It performs data
preparation which includes data cleaning, data transformations, selecting subsets of
records. If the size of the data set is large containing large number of variables
("fields") then it also performs preliminary feature selection operations to reduce
the number of variables to a manageable range based on the statistical methods.
Depending on the nature of the analytic problem, exploration stage may involve a
simple choice of straightforward predictors for a regression model, or elaborate
exploratory analyses using a wide variety of graphical and statistical methods to
identify the most relevant variables and determine the complexity and/or the
general nature of models. Information processed in this stage is then used in the
next stage.
b) Model building and validation – In this stage various models are considered and
the best model is chosen based on its predictive performance (i.e., which can
explain the variability in question and can produce stable results across samples).
Variety of techniques are developed that can be applied on different models using
same data set to compare their performance to choose the best. These techniques
often called as ―competitive evaluation of models‖ are considered the core of
predictive Data Mining and include Bagging (Voting, Averaging), Boosting,
Stacking (Stacked Generalizations), and Meta-Learning.
c) Deployment – In this model, new data is applied to the best model selected in the
previous stage in order to generate predictions or estimates of the expected
outcome.
5
Data Mining involves an integration of techniques from multiple disciplines such as
database technology, statistics, machine learning, high performance computing, pattern
recognition, neural networks, data visualization, Information Retrieval, image and signal
processing, and spatial data analysis. By performing Data Mining, interesting knowledge,
regularities, or high-level information can be extracted from databases and viewed or
browsed from different angles. The discovered knowledge can be applied to decision
making, process control, information management, query processing, and so on.
Therefore, Data Mining is considered as one of the most important frontiers in database
systems and one of the most promising, new database applications in the information
industry [49], [67].
In the following section 1.1.1, we discuss the different Data Mining techniques based on
the kinds of databases to be mined, and the kinds of knowledge to be mined. After this, we
also briefly explain the general architecture of the Data Mining.
1.1.1 Nature of Data
In this section, we discuss different data stores [32], [67] on which mining can be
performed. In principle, Data Mining should be applicable to any kind of information
repository. This includes relational databases, data warehouses, transactional databases,
object-oriented and object-relational databases, spatial databases, time-series databases,
text databases, multimedia databases and the World-Wide Web. The challenges and
techniques of mining may differ for each of the repository systems. A brief introduction
to each of the major data repository systems listed above is given below:
a) Relational databases
A relational database is a collection of tables. Each tuple in relational database has a
unique name, consists of a set of attributes (columns or fields) and usually stores a large
number of tuples (records or rows). An object in a relational table is represented by a tuple
and is identified by a unique key (Primary key) and is described by a set of attribute
values. Data Mining applied to relational databases allows searching for trends or data
patterns. For example, Data Mining systems may detect deviations, such as items whose
sales are far from those expected in comparison with the previous year. It may also
6
analyze customer data to predict the credit risk of new customers based on their income,
age, and previous credit information.
Relational databases are one of the most popularly available and rich information
repositories for Data Mining.
b) Data warehouses
A data warehouse is a repository of information collected from multiple sources, stored
under a unified schema, and which usually resides at a single site. Data warehouses are
constructed via a process of data cleansing, data transformation, data integration, data
loading, and periodic data refreshing. Although data warehouse tools help support data
analysis, additional tools for Data Mining are required to allow more in depth and
automated analysis.
c) Transactional databases
In general, a transactional database consists of a file where each record represents a
transaction. A transaction typically includes a unique transaction identity number (trans
ID), and a list of the items making up the transaction (such as items purchased in a store).
The transactional database may have additional tables associated with it, which contain
other information regarding the transaction, such as sales transactional database contain
information about sales and keep the record of transaction date, the customer ID number,
sales person ID number, and the name of branch at which the sale occurred, and so on. A
regular data retrieval system fails to answer queries like ―Which items sold well together?"
However, Data Mining systems for transactional data can identify the relationship between
different transactions, such as it can detect the set of items frequently sold together. For
example, based on the sales trend that microwave proof container are commonly
purchased together with microwave, an offer of an expensive set of microwave proof
container can be given to customers buying selected models of microwave, in the hope of
selling more of the expensive containers.
7
d) Object-Oriented Databases
Object–Oriented databases are based on the object-oriented programming paradigm,
where, each entity is an object. Data and code relating to an object are encapsulated into a
single unit. Each object has associated with it the following:
A set of variables that describe the object.
A set of messages that the object can use to communicate with other objects or rest
of the database system.
A set of methods, where each method holds the code to implement a message.
Objects that share common set of properties can be grouped into an object class. Each
object is an instance of its class. Object classes can be organized into class/subclass
hierarchies. Such a class inheritance feature benefits information sharing.
e) Object-Relational Databases
Object-relational model extends the basic relational data model by adding the power to
handle complex data types, class hierarchies, and object inheritance.
Data Mining techniques provide the methods to handle complex object structures, complex
data types, class and subclass hierarchies, property inheritance, and methods and
procedures.
f) Spatial Databases
Spatial databases contain spatial-related information. Such databases include geographic
(map) databases, VLSI chip design databases, and medical and satellite image databases.
Spatial databases may be represented in raster format (n-dimensional bit maps or pixel
maps), vector format (roads, bridges, buildings, lakes).
Spatial data cubes may be constructed to organize data into multidimensional structures
and hierarchies, on which OLAP operations (such as drill-down and roll-up) can be
performed. Spatial Data Mining includes spatial data description, classification,
association, clustering, and spatial trend and outlier analysis.
8
g) Temporal databases and Time-Series Databases
Temporal databases and time-series databases both store time-related data. A temporal
database stores relational data having time-related attributes which may involve several
timestamps, each having different semantics. A time-series database stores sequence of
values that change with time, such as data collected regarding the stock exchange.
Data mining techniques can be used to find the characteristics of object evolution or the
trend of changes for objects in the database. Such information can be useful in decision
making and strategy planning.
h) Text Databases
Text databases are databases that contain word descriptions (sentences, paragraphs) for
objects (summary reports, error messages, documents). Text databases may be highly
unstructured (some web pages on World Wide Web), semi structured (e-mail messages,
HTML/XML Web pages), relatively structured (library databases). Data Mining on text
databases may uncover general descriptions of object classes, as well as keyword or
content associations, and the clustering behavior of text objects. To handle this, standard
Data Mining methods need to be integrated with Information Retrieval techniques and the
construction or use of hierarchies especially for text data (such as dictionaries and
thesauruses), as well as discipline-oriented term classification systems (such as in
chemistry, medicine, law or economics).
i) Multimedia Databases
Multimedia databases store image, audio, video data, sequence data, hypertext data
containing text, text markups, and linkages. For multimedia database mining, storage and
search techniques need to be integrated with standard Data Mining methods to handle the
issues like content-based retrieval and similarity search, generalization and
multidimensional analysis, classification and prediction analysis, and mining associations
in multimedia data.
9
j) World Wide Web
The World Wide Web serves as a huge, widely distributed, global information service
center for news, advertisements, financial management, education, and many other
information services. Web contains a rich and dynamic collection of hyperlink information
and Web page access and usage information, providing rich sources for Data Mining. It
involves mining Web linkage structures, Web contents, and Web access patterns to
identify authoritative pages, automatic classification of Web documents, building a
multilayered Web information base and Weblog.
1.1.2 Data Mining Techniques
There have been many advances on researches and developments of Data Mining, and
many Data Mining techniques have been developed. Some of the techniques are briefly
discussed below:
a) Decision trees
A decision tree is a simple inductive learning structure. Given an instance of an object or
situation, which is specified by a set of properties, the tree returns a "yes" or "no" decision
about that instance. Decision tree learning is a common method used in Data Mining. Each
interior node corresponds to a variable; an arc to a child represents a possible value of that
variable. A leaf represents a possible value of target variable given the values of the
variables represented by the path from the root. A tree can be "learned" by splitting the
source set into subsets based on an attribute value test. This process is repeated on each
derived subset in a recursive manner. In Data Mining, trees can also be described as the
combination of mathematical and computing techniques to aid the description,
categorization and generalization of a given set of data [40].
Decision trees use real Data Mining algorithms. Decision trees help users to understand
the information that is very descriptive through data classification. A decision tree process
will generate the rules followed in a process. For example, a lender at a bank goes through
a set of rules when approving a loan. Based on the loan data a bank has, the outcomes of
the loans (default or paid), and limits of acceptable levels of default, the decision tree can
10
set up the guidelines for the lending institution. These decision trees are very similar to the
first decision support (or expert) systems.
b) Memory-Based Reasoning (MBR)
MBR [109] uses known instances of a model to predict unknown instances. It maintains a
record of characteristics of known records in a training dataset. When a new record arrives
for evaluation, the algorithm uses the characteristics of the neighbors to find neighbors
similar to the new record for prediction and classification. MBR technique uses the
distance function and the combination function as the two key components. The distance
function is used to calculate the distance between the new record and the records in the
training dataset. The results obtain determine the neighbors (data records) of the new
incoming data record in the training dataset. In the next step, combination function of the
algorithm combines the results of the various distance functions to determine the final
answer.
Hence, for solving a Data Mining problem using MBR, three critical issues considered are:
Selecting the most suitable historical records to form the training or base dataset
Establishing the best way to compose the historical record
Determining the two essential functions, namely, the distance function and the
combination
c) Genetic Algorithms
Genetic algorithms [109] apply the ―survival of the fittest‖ principle to Data Mining. It
uses an iterative process of selection, cross-over, and mutation operators to evolve
successive generations of models. At each iteration, every model competes with other by
inheriting traits from previous ones until only the most predictive model survives.
d) Neural Networks
Neural Networks are analytic techniques modeled after the (hypothesized) processes of
learning in the cognitive system and the neurological functions of the brain and capable of
predicting new observations (on specific variables) from other observations (on the same
or other variables) after executing a process of so-called learning from existing data.
11
These algorithms are effective when the data is shapeless and lacks any apparent pattern.
The basic unit of an artificial neural network is node, modeled after the neurons in the
brain. The other structure is the link that corresponds to the connection between neurons in
the brain. Neural networks [109] mimic the human brain by learning from a training
dataset and applying the learning to generalize patterns for classification and prediction.
Neural networks have high tolerance to noisy data and have the ability to classify patterns
on which they have not been trained. In addition, several algorithms have recently been
developed for the extraction of rules from trained neural networks. These factors
contribute towards the usefulness of neural networks for classification in Data Mining
[67].
e) Link Analysis
Link Analysis technique mines relationships and discovers knowledge. This type of
capability can be used in a variety of advanced artificial intelligence applications like
logistics planning, event probability prediction, and intelligence gathering. For example, in
a sale transaction at a supermarket, many items bought together in one trip are all linked
together. Some technologies included in this category can also perform machine learning
and reasoning functions.
Depending upon the types of knowledge discovery, link analysis techniques have three
types of applications [109]:
Associations Discovery. Associations are affinities between items. Association
discovery algorithms find combinations where the presence of one item suggests the
presence of another.
Sequential Pattern Discovery. These algorithms discover patterns where one set of
items follows another specific set. Time plays a role in these patterns. When records
are selected for analysis, information about date and time as data items enable
discovery of sequential patterns.
Similar Time Sequence Discovery. This technique depends on the availability of time
sequences. In the previous technique, the results indicate sequential events over time.
This technique, however, finds a sequence of events and then comes up with other similar
sequences of events.
12
f) Clustering and Nearest Neighbor
Clustering [91] seeks to identify a finite set of abstract categories that describe the data by
determining natural affinities in the data set based upon a pre-defined distance or similarity
measure. Clustering can employ categories of different types (e.g., a flat partition, a
hierarchy of increasingly fine-grained partitions, or a set of possibly overlapping clusters).
Clustering can proceed by agglomeration, where instances are initially merged to form
small clusters and small clusters are merged to form larger ones; or by successive division
of larger clusters into smaller ones. Some clustering algorithms produce explicit cluster
descriptions; others produce only implicit descriptions. Different methods [8], [12], [21],
[52] are available to generate the clusters containing similar documents.
Nearest Neighbor algorithm supports clustering and classification matching cases
internally to each other or to an exemplar specified by a domain expert. A simple example
of a nearest neighbor method would be as follows: given a set X = {x1 x2 x3… xn} of
vectors composed of n features with binary values, for each pair (xi , xj), xi and xj, create a
vector vi of length n by comparing the values of each corresponding feature ni of each pair
(xi , xj), entering a 1 for each ni feature with matching values match and 0 otherwise. Then
sum the vi values to compute the degree of match. Those pairs (xi, xj) with the largest result
are the nearest neighbors. For complex nearest neighbor methods, features can be
weighted to reflect degree of importance. Domain expertise is needed to select salient
features, compute weights for those features, and select a distance or similarity measure.
Nearest neighbor approaches have been used for text classification.
g) Rule Induction
Rule induction [62] is an important technique of machine learning used for Data Mining. It
is the technique which expresses the regularities hidden in data in terms of rules. Usually
rules are expressions of the form
if (attribute − 1; value − 1) && (attribute − 2; value − 2) && .. (attribute − n; value − n)
then (decision; value):
Some rule induction systems induce more complex rules, in which values of attributes may
be expressed by negation of some values or by a value subset of the attribute domain. Data
from which, rules are derived, are usually presented in the form of a table in which cases
13
(or examples) are labels (or names) for rows and variables are labeled as attributes and a
decision.
1.1.3 Data Mining Applications
There are wide varieties of applications benefiting from Data Mining. The technology
encompasses a rich collection of proven techniques that cover a wide range of applications
in both the commercial and noncommercial realms. In some cases, multiple techniques are
used, back to back, to greater advantage.
Listed below are few major applications of Data Mining [67], [109]:
a) Applications in the business area:
Data Mining technology has widespread applications in the commercial arena. Most of the
tools target the commercial sector. A few examples of Data Mining in the business area
are outlined as follows:
Customer Segmentation - Businesses use Data Mining to understand their
customers. Cluster detection algorithms discover clusters of customers sharing the
same characteristics.
Market Basket Analysis - Link analysis algorithms uncover affinities between
products that are bought together. Other businesses such as upscale auction houses
use these algorithms to find customers to whom they can sell higher-value items.
Risk Management - Insurance companies and mortgage businesses use Data
Mining to uncover risks associated with potential customers.
Fraud Detection - Credit card companies use Data Mining to discover abnormal
spending patterns of customers. Such patterns can expose fraudulent use of the
cards.
Delinquency Tracking - Loan companies use the technology to track customers
who are likely to default on repayments.
Demand Prediction - Retail and other businesses use Data Mining to match
demand and supply trends to forecast demand for specific products.
14
b) Applications in the Telecommunications Industry
The telecommunication industry has quickly evolved from offering local and long distance
telephone services to providing many other comprehensive communication services,
including fax, pager, cellular phone, Internet messenger, images, e-mail, computer and
Web data transmission, and other data traffic. With the deregulation of the
telecommunication industry in many countries and the development of new
communication technologies, the telecommunication market is rapidly expanding and
highly competitive. This creates a great demand for Data Mining in order to help
understand the business involved, identify telecommunication patterns, catch fraudulent
activities, make better use of resources, and improve the quality of service.
A few examples of Data Mining in the telecommunication industry are outlined as follows:
Multidimensional analysis of telecommunication data – Telecommunication data is
intrinsically multidimensional with dimensions such as calling-time, duration,
location of caller, and type of call. The multidimensional analysis of such data can
be used to identify and compare the data traffic, system work load, resource usage,
user group behavior, profit, and so on. OLAP and visualization tools are used to
consolidate telecommunication data.
Fraudulent pattern analysis and the identification of unusual patterns – It is
important to identify potentially fraudulent entry to customer accounts and their
atypical usage patterns, detect attempts to gain fraudulent entry to customer
accounts, switch and route congestion patterns, and periodic calls from automatic
dial-out equipment that have been improperly programmed. Many of these types
of patterns can be discovered by multidimensional analysis, cluster analysis, and
outlier analysis.
Use of visualization tools in telecommunication data analysis – Tools for OLAP
visualization, linkage visualization, association visualization, clustering, has been
useful for telecommunication data analysis.
15
c) Applications in Banking and Finance
The banking and finance industry is fertile ground for Data Mining. Banks and financial
institutions generate large volumes of detailed transaction data. Fraud detection, risk
assessment of potential customers, trend analysis, and direct marketing are the primary
Data Mining applications at banks.
In the financial area, requirements for forecasting dominate. Forecasting of stock prices
and commodity prices with a high level of approximation can mean large profits. Neural
network algorithms are used in forecasting, options and bond trading, portfolio
management, and in mergers and acquisitions.
A few examples of Data Mining in the banking and finance are outlined as follows:
Loan payment prediction and customer credit policy analysis - Loan payment
prediction and customer credit analysis are critical to the business of a bank. Many
factors can strongly or weakly influence loan payment performance and customer
credit rating. Data Mining methods, such as attribute selection and attribute
relevance ranking, may help identify important factors and eliminate irrelevant
ones.
Classification and clustering of customers for targeted marketing - Classification
and clustering methods can be used for customer group identification and targeted
marketing. Customers with similar behaviors regarding loan payments may be
identified by multidimensional clustering techniques. These can help identify
customer groups, associate a new customer with an appropriate customer group,
and facilitate targeted marketing.
Detection of money laundering and other financial crimes - To detect money
laundering and other financial crimes, it is important to integrate information from
multiple databases (like bank transaction databases, and federal or state crime
history databases), as long as they are potentially related to the study. Multiple data
analysis tools can then be used to detect unusual patterns, such as large amounts of
cash flow at certain periods, by certain groups of customers. Useful tools include
data visualization tools (to display transaction activities using graphs by time and
by groups of customers), linkage analysis tools (to identify links among different
customers and activities), classification tools (to filter unrelated attributes and rank
the highly related ones), clustering tools (to group different cases), outlier analysis
tools (to detect unusual amounts of fund transfers or other activities), and
16
sequential pattern analysis tools (to characterize unusual access sequences). These
tools may identify important relationships and patterns of activities and help
investigators focus on suspicious cases for further detailed examination.
d) Applications in Biomedical and DNA data Analysis
The past decade has seen an explosive growth in genomics, proteomics, functional
genomics, and biomedical research. Examples range from the identification and
comparative analysis of the genomes of human and other species (by discovering
sequencing patterns, gene functions, and evolution paths) to the investigation of genetic
networks and protein pathways, and the development of new pharmaceuticals and
advances in cancer therapies. Biological Data Mining has become an essential part of a
new research field called bioinformatics.
A few examples of Data Mining in biological data analysis are outlined as follows:
Semantic integration of heterogeneous, distributed genomic and proteomic
databases - Genomic and proteomic data sets are often generated at different labs
and by different methods. They are distributed, heterogeneous, and of a wide
variety. The semantic integration of such data is essential to the cross-site analysis
of biological data. Also, it is important to find correct linkages between research
literature and their associated biological entities. Such integration and linkage
analysis would facilitate the systematic and coordinated analysis of genome and
biological data. Data cleaning, data integration, reference reconciliation,
classification, and clustering methods will facilitate the integration of biological
data and the construction of data warehouses for biological data analysis.
Visualization tools in genetic data analysis - Visualization and visual Data Mining
play an important role in biological data analysis. Alignments among genomic or
proteomic sequences and the interactions among complex biological structures are
most effectively presented in graphic forms, transformed into various kinds of
easy-to-understand visual displays. Such visually appealing structures and patterns
facilitate pattern understanding, knowledge discovery, and interactive data
exploration.
Association and path analysis - Most diseases are not triggered by a single gene but
by a combination of genes acting together. Recently, many studies have focused on
the comparison of one gene to another. Association analysis methods can be used
17
to help determine the kinds of genes that are likely to co-occur in target samples.
Such analysis would facilitate the discovery of groups of genes and the study of
interactions and relationships between them. While a group of genes may
contribute to a disease process, different genes may become active at different
stages of the disease. If the sequence of genetic activities across the different stages
of disease development can be identified, it may be possible to develop
pharmaceutical interventions that target the different stages separately, therefore
achieving more effective treatment of the disease. Such path analysis is expected to
play an important role in genetic studies.
e) Applications in the Retail Industry
The retail industry is a major application area for Data Mining, since it collects huge
amounts of data on sales, customer shopping history, goods transportation, consumption,
and service. The quantity of data collected continues to expand rapidly, especially due to
the increasing ease, availability, and popularity of business conducted on the Web, or e-
commerce. Retail Data Mining can help identify customer buying behaviors, discover
customer shopping patterns and trends, improve the quality of customer service, achieve
better customer retention and satisfaction, enhance goods consumption ratios, design more
effective goods transportation and distribution policies, and reduce the cost of business.
A few examples of Data Mining in the retail industry are outlined as follows:
Customer Segmentation - Direct marketing involves targeting campaigns and
promotions to specific customer segments. Cluster detection and other predictive
Data Mining algorithms provide customer segmentation. Customer segmentation
tools discover clusters and predict success rates for direct marketing campaigns. At
the backend, Data Mining tools for customer segmentation can be integrated with
the data warehouse for data selection and extraction. At the front end, these tools
work well with standard presentation software.
Market basket analysis - Retail industry promotions necessarily require knowledge
of which products to promote and in what combinations. Retailers use link analysis
algorithms to find affinities among products that usually sell together. Based on the
affinity grouping, retailers can plan their special sale items and also the arrangement
of products on the shelves.
18
Inventory Management - Inventory for a retailer encompasses thousands of
products. Inventory turnover and management are significant concerns for these
businesses. Retailers use Data Mining for inventory management.
Sales forecasting – Retail sales are subject to strong seasonal fluctuations. Holidays
and weekends also make a difference. Therefore, sales forecasting is critical for the
industry. The retailers turn to the predictive algorithms of Data Mining technology
for sales forecasting.
1.1.4 Data Mining – Interdisciplinary domain
Data Mining involves an integration of techniques from multiple disciplines such as
database technology, statistics, machine learning, data visualization, Information Retrieval,
image and signal processing, and spatial data analysis. Emphasis is on efficient and
scalable Data Mining techniques for large databases. By performing Data Mining,
interesting knowledge, regularities, or high-level information can be extracted from
databases and viewed or browsed from different angles. The discovered knowledge can be
applied to decision making, process control, information management, query processing,
and so on. Therefore, Data Mining is considered as one of the most important frontiers in
database systems and one of the most promising interdisciplinary developments in the
information industry. Few of the different disciplines which can be integrated with Data
Mining are briefly discussed below:
a) Statistics
Data Mining in Statistics deals with finding useful patterns in data sets. This includes
hypothesis testing and parameter estimation. Hypothesis testing is a part of inferential
statistics, which starts with an initial premise (called the Null Hypothesis) and then data
collected is tested with this premise. If the hypothesis is validated for the data to a certain
degree, then the Null Hypothesis is said to be True or else it is said to be False. Parameter
Estimation deals with finding parameter, like means, standard deviations, etc. that would
describe the distribution of a given sample of data points.
Finding optimal strategies for Data Collection is another issue with Statistics. Methods
need to be developed, which would efficiently search large databases to find representative
19
sample data points. Different Data Mining techniques have to utilize for evolving data as
opposed to for static data.
Model Estimation of the data samples is also an important aspect of Statistics. Samples can
have different model distributions, leading to development of different algorithms for
them. Applicability of algorithms, hence, becomes a major issue in this case.
b) Relational Databases
RDBMS (Relational Database Base Management System) stores the data in tables and can
quickly search the requested data on applying different Query Languages. Query
optimization plays an important role in RDMS and deals with finding the best possible
method for processing the given queries, in terms of time taken for processing task and the
reliability of the query responses.
Databases part of Data Mining provides the fast and reliable access to data.
c) Artificial Intelligence
The goal of Artificial Intelligence is to perceive the information from the environment
using intelligent agents and automate the task of finding set of actions via logical
reasoning to achieve predetermined goal.
Search techniques are employed to map a set of perception into a single or a set of actions.
The search techniques can be divided into uniform cost search or informed search.
Heuristics are used in informed search methods to find an optimal set of actions to achieve
the desired goals.
Knowledge representation methods are used to describe the relationships between
different objects in the environment. Examples for these are First Order Logic and
production rules. Some of these knowledge representation methods can also be used to
describe the semantics of the knowledge, e.g. frames for semantic networks.
Knowledge acquisition, maintenance and application are other branches of Artificial
Intelligence, which are highly related with Databases and also with Data Mining.
20
d) Machine Learning
Machine Learning focuses on complex representations and search methods for specialized
data-intensive problems. Different machine learning methods utilize the specific prior
knowledge associated with the collected data. Such methods are generally more dependent
on the domain of the data.
Machine learning is often used in the context of Data Mining, to denote the application of
generic model-fitting or classification algorithms for predictive Data Mining. Unlike
traditional statistical data analysis, which is usually concerned with the estimation of
population parameters by statistical inference, the emphasis in Data Mining (and machine
learning) is usually on the accuracy of prediction (predicted classification), regardless of
whether or not the "models" or techniques that are used to generate the prediction is
interpretable or open to simple explanation. Good examples of this type of technique often
applied to predictive Data Mining are neural networks or meta-learning techniques such as
boosting, etc. These methods usually involve the fitting of very complex "generic" models
that are not related to any reasoning or theoretical understanding of underlying causal
processes; instead, these techniques can be shown to generate accurate predictions or
classification in cross validation samples.
e) Visualization
Visualization is used to gain visual insights into the structure of the data. Users can
interactively explore the data, e.g. zoom in/out, rotate for images, or display some specific
detailed information for some attributes. Visualization is abundantly used as a pre- and
post-processing tool for KDD.
Various Branches of Visualization are:
Displaying summarized properties of relevant data.
Exploring various relationships between variables (attributes).
Investigate large databases and to convey huge amount of information.
Analyze data from geographic or spatial domains.
21
f) Information Retrieval
Information Retrieval (IR) is the area of study concerned with searching for documents,
for information within documents, and for metadata about documents, as well as that of
searching relational databases and the World Wide Web. In response to various challenges
of providing information access, the field of Information Retrieval evolved to give
principled approaches to searching various forms of content. The field began with
scientific publications and library records, but soon spread to other forms of content,
particularly those of information professionals, such as journalists, lawyers, and doctors.
Much of the scientific research on Information Retrieval has occurred in these contexts,
and much of the continued practice of Information Retrieval deals with providing access to
unstructured information in various corporate and governmental domains. Many
universities and public libraries use IR systems to provide access to books, journals and
other documents. Web search engines are the most visible IR applications.
g) Online Analytical Processing
OLAP allows users to browse data following logical questions about the data. OLAP
generally includes the ability to drill down into data, moving from highly summarized
views of data into more detailed views. This is generally achieved by moving along
hierarchies of data. For example, if one were analyzing populations, one could start with
the most populous continent, and then drill down to the most populous country, then to the
state level, then to the city level, then to the neighborhood level. OLAP also includes
browsing up hierarchies (drill up), across different dimensions of data (drill across), and
many other advanced techniques for browsing data, such as automatic time variation when
drilling up or down time hierarchies. OLAP is by far the most implemented and used
technique. It is also generally the most intuitive and easy to use.
1.1.5 Architecture of Data Mining
Broadly Data Mining architecture [95] has three layers- Database Layer with sub-layers to
prepare and store data and metadata, Data Mining Application Layer which uses the
algorithms to process the data and store the results in the database, Front-End Layer to
facilitate the parameter settings for Data Mining Application and visualization of the
results in interpretable form.
22
a) Database Layer
Database layer can be hosted on an RDBMS or can be mixture of RDBMS and files
system or a file system only, e.g., data from source systems may be initially staged on a
files system and then loaded onto an RDBMS. The Database layer may consist of various
sub- layers. The data in these sub-layers interface with multiple systems based on the
activities in which it participates. Following diagram represents various sub-layers in the
Database layer.
i. Metadata layer is the most commonly and frequently used layer. It forms the
backbone for the data in entire Data Mining Architecture and information about
data sources, transformation algorithms, cleansing rules and the Data Mining
Results.
ii. Data Layer comprises of Staging Area, Prepared / Processed Data and Data Mining
Results. The Staging Area is used for temporarily holding the data sourced from
Parameters,
Data Mining
Queries
Data Mining
Results, Metadata
Database
Queries
Metadata,
Data
Database Data Mining
Application
Front End
Figure 1.1 : Generic 3-layer architecture for Data Mining
Transformation,
Cleansing and
Consolidation
Metadata and Data
Extracted from
Source Systems
Data and Metadata
for Data Mining
Output from Data
Mining
Application
Metdata
Staging Area
Prepared Input Data
Data Mining Results
Figure 1.2: Metadata Layer
23
various source systems. It can be held in any form e.g. flat files, tables in RDBMS.
This data is transformed, cleansed, consolidated and loaded into a structured
schema during Data Preparation process. This prepared data is used as Input Data
for Data Mining. The base data may undergo summarization or derivation based on
the business case before it‘s presented to the Data Mining Application.
iii. The Data Mining output can be captured in the Data Mining Results layer so that it
can be made available to the users for visualization and analysis.
b) Data Mining Application Layer
Data Mining Application has two primary components as shown in the figure 1.3:
Figure 1.3: Data Mining Application Layer
i. Data Manager Layer
It manages the data in the Database Layer and controls the data flow for Data Mining
purpose. It provides the following functionalities:
Manage Data Sets - The Data Manager layer helps to classify the input data into
multiple sets so that they can be utilized during various stages of the Data Mining
task such as for building the Data Mining Model, Final Testing and Deployment
tasks. Also it classifies the results of the Data Mining task, which might be utilized
for further processing.
Input Data
From
Database
Data Mining
Results Data Manager
Data Mining
Tools/Algorithms
24
Input Data Flow – Data Manager layer provides transformation routines to extract
the data from the database in the required specific format (like itemized data for
Associations) for the Data Mining task. Also, it controls the flow of data as per the
Data Mining task requirements i.e. row by row or bulk load.
Output Data Flow - Data Manager layer manages the results generated by the Data
Mining task and facilitated them to target systems (Front End or other systems like
CRM) in required data format and data flow specifications.
The Data Manager layer needs to be portable depending on the database from which
data has to be extracted and the Data Mining tool.
ii. Data Mining Tools / Algorithms
This is the heart of the complete DM architecture. Numerous tools are available in the
market like SAS, SPSS, Teradata Miner and IBM Intelligent Miner to facilitate the
application of algorithms on the input data. These Data Mining tools perform different
tasks using various techniques / algorithms depending upon the business to analyze the
data and generate the results.
c) Front End Layer
Front End is the user interface layer. It provides following prime functionalities:
i. Administration
Administration screens for the Data Mining tasks are usually provided as a part of the
products / tools. These are utilized to administer the following primary tasks:
Data flow processes (e.g. Extracts, Loads)
Data Mining routines
Error reporting and correction
User security settings
ii. Input Parameter Settings
During the Data Mining Model build, iterations are inevitable. These iterations are
needed to fine-tune the model by changing various parameters involved in the model.
For executing a Data Mining task, the user needs to provide respective input parameters,
25
then observe the effect on the results and change the parameters if needed based on the
interpretation and understanding of the results. This facility is provided in the Front End
Layer.
iii. Data Mining Results / Visualization
The results of Data Mining task sometimes need formatting, conversion to user
understandable form to provide reports to the user. The front-end caters to the
predefined formats of the output files generated by the respective Data Mining technique
to provide the user flexibility to view and analyze the results of Data Mining. Reporting
utility performs the task of displaying the reports, charts and smart reports (e.g. Clusters,
Trees, and Networks).
1.2 Text Mining – Characteristics and domains of applications
Data Mining is typically concerned with the detection of patterns in numeric data, but very
often important (e.g., critical to business) information is stored in the form of text. Unlike
numeric data, text is often amorphous, and difficult to deal with. Text Mining generally
consists of the analysis of (multiple) text documents by extracting key phrases, concepts,
etc. and the preparation of the text processed in that manner for further analyses with
numeric Data Mining techniques (e.g., to determine co-occurrences of concepts, key
phrases, names, addresses, product names, etc.).
1.2.1 Representation of text documents
In Text Mining study, a document is generally used as the basic unit of analysis. A
document is a sequence of words and punctuation, following the grammatical rules of the
language, containing any relevant segment of text and can be of any length. It can be the
paper, an essay, book, web page, emails, etc, depending on the type of analysis being
performed and depending upon the goals of the researcher. In some cases, a document may
contain only a chapter, a single paragraph, or even a single sentence. The fundamental unit
of text is a word. A term is usually a word, but it can also be a word-pair or phrase. In this
thesis, we will use term and word interchangeably. Words are comprised of characters, and
26
are the basic units from which meaning is constructed. By combining a word with
grammatical structure, a sentence is made. Sentences are the basic unit of action in text,
containing information about the action of some subject. Paragraphs are the fundamental
unit of composition and contain a related series of ideas or actions. As the length of text
increases, additional structural forms become relevant, often including sections, chapters,
entire documents, and finally, a corpus of documents. A corpus is a collection of
documents. And, a lexicon is the set of all unique words in the corpus [120].
In Text Mining studies, a sentence is regarded simply as a set of words, or a ―bag of
words‖, and the order of words can be changed without impacting the outcome of the
analysis. The syntactical structure of a sentence or paragraph is intentionally ignored in
order to efficiently handle the text. The bag-or-words concept is also referred to as
exchangeability in the generative language model [84].
1.2.2 Text Mining Techniques
Text Mining is an interdisciplinary field that utilizes techniques from the general field
of Data Mining and additionally, combines methodologies from various other areas
such as Information Extraction, Information Retrieval, Computational Linguistics,
Categorization, Clustering, Summarization, Topic Tracking and Concept Linkage [46],
[50], [120]. In the following sections, we will discuss each of these technologies and the
role that they play in Text Mining.
a) Information Extraction
Information extraction (IE) [71] is a process of automatically extracting structured
information from unstructured and/or semi-structured machine-readable documents,
processing human language texts by means of NLP. The final output of the extraction
process is some type of database obtained by looking for predefined sequences in text, a
process called pattern matching [64].
27
Tasks performed by IE systems include:
Term analysis, which identifies the terms appearing in a document. This is
especially useful for documents that contain many complex multi-word terms, such
as scientific research papers.
Named-entity recognition, which identifies the names appearing in a document,
such as names of people or organizations. Some systems are also able to recognize
dates and expressions of time, quantities and associated units, percentages, and so
on.
Fact extraction, which identifies and extracts complex facts from documents. Such
facts could be relationships between entities or events.
IE transforms a corpus of textual documents into a more structured database, the
database constructed by an IE module then can be provided to the KDD module for
further mining of knowledge as illustrated in figure 1.4.
b) Information Retrieval
Retrieval of text-based information also termed Information Retrieval (IR) has become a
topic of great interest with the advent of text search engines on the Internet. Text is
considered to be composed of two fundamental units, namely the document (book, journal
paper, chapters, sections, paragraphs, Web pages, computer source code, and so forth) and
the term (word, word-pair, and phrase within a document). Traditionally in IR, text queries
Text
Information
Extraction DB
Data
Mining Rules
Text Data
Mining
Figure 1.4. Overview of IE-based Text Mining framework
28
and documents both are represented in a unified manner, as sets of terms, to compute the
distances between queries and documents thus providing a framework within to directly
implement simple text retrieval algorithms.
c) Computational Linguistics/ Natural Language Processing
Natural Language Processing is a theoretically motivated range of computational
techniques for analyzing and representing naturally occurring texts at one or more levels of
linguistic analysis for the purpose of achieving human-like language processing for a
range of tasks or applications. The goal of Natural Language Processing (NLP) is to design
and build a computer system that will analyze, understand, and generate natural human-
languages. Applications of NLP include machine translation of one human-language text
to another; generation of human-language text such as fiction, manuals, and general
descriptions; interfacing to other systems such as databases and robotic systems thus
enabling the use of human-language type commands and queries; and understanding
human-language text to provide a summary or to draw conclusions.
NLP system provides the following tasks:
Parse a sentence to determine its syntax.
Determine the semantic meaning of a sentence.
Analyze the text context to determine its true meaning for comparing it with other
text.
The role of NLP in Text Mining is to provide the systems in the information extraction
phase with linguistic data that they need to perform their task. Often this is done by
annotating documents with information like sentence boundaries, part-of-speech tags,
parsing results, which can then be read by the information extraction tools.
d) Categorization
Categorization is the process of recognizing, differentiating and understanding the ideas
and objects to group them into categories, for specific purpose. Ideally, a category
illuminates a relationship between the subjects and objects of knowledge. Categorization is
fundamental in language, prediction, inference, decision making and in all kinds of
environmental interaction.
29
There are many categorization theories and techniques. In a broader historical view,
however, three general approaches to categorization may be identified as:
Classical categorization - According to the classical view, categories should be
clearly defined, mutually exclusive and collectively exhaustive, belonging to one,
and only one, of the proposed categories.
Conceptual clustering – It is a modern variation of the classical approach in which
classes (clusters or entities) are generated by first formulating their conceptual
descriptions and then classifying the entities according to these descriptions.
Conceptual clustering is closely related to fuzzy set theory, in which objects may
belong to one or more groups, in varying degrees of fitness.
Prototype theory - Categorization can also be viewed as the process of grouping
things based on prototypes. Categorization based on prototypes is the basis for
human development, and relies on learning about the world via embodiment.
e) Topic Tracking
A topic tracking [64] system works by keeping user profiles and, based on the documents
the user views, predicts other documents of interest to the user. Yahoo offers a free topic
tracking tool (www.alerts.yahoo.com) that allows users to choose keywords and notifies
them when news relating to those topics becomes available.
Topic tracking technology however has limitations. For example, if a user sets up an alert
for ―Text Mining‖, s/he will receive several news stories on mining for minerals, and very
few that are actually on Text Mining. Some of the better Text Mining tools let users select
particular categories of interest or the software automatically can even infer the user‘s
interests based on his/her reading history and click-through information. Keyword
extraction has become a basis of several Text Mining applications such as search engine,
text categorization, summarization, and topic detection. In [99], Nelken et al. proposed a
disambiguation system that separates the on-topic occurrences and filters them from the
potential multitude of references to unrelated entities.
30
f) Clustering
Clustering [3] is a technique in which objects of logically similar properties are physically
placed together in one class of objects and a single access to the disk makes the entire class
available. There are many clustering methods available, and each of them may give a
different grouping of a dataset. The choice of a particular method will depend on the type
of output desired, the known performance of method with particular types of data, the
hardware and software facilities available and the size of the dataset. In general, clustering
methods may be divided into two categories based on the cluster structure which they
produce. The non-hierarchical methods divide a dataset of N objects into M clusters, with
or without overlap. These methods are divided into partitioning methods, in which the
classes are mutually exclusive, and the less common clumping methods, in which overlap
is allowed. Each object is a member of the cluster with which it is most similar; however
the threshold of similarity has to be defined. The hierarchical methods produce a set of
nested clusters in which each pair of objects or clusters is progressively nested in a larger
cluster until only one cluster remains. The hierarchical methods can be further divided into
agglomerative or divisive methods. In agglomerative methods, the hierarchy is build up in
a series of N-1 agglomerations, or Fusion, of pairs of objects, beginning with the un-
clustered dataset. The less common divisive methods begin with all objects in a single
cluster and at each of N-1 steps divide some clusters into two smaller clusters, until each
object resides in its own cluster.
g) Concept Linkage
Concept linkage [46] identifies related documents based on commonly shared concepts
and between them. The primary goal of concept linkage is to provide browsing for
information rather than searching for it as in IR. For example, a Text Mining software
solution may easily identify a link between topics X and Y, and Y and Z. Concept linkage
is a valuable concept in Text Mining which could also detect a potential link between X
and Z, something that a human researcher has not come across because of the large
volume of information s/he would have to sort through to make the connection. Concept
linkage is beneficial to identify links between diseases and treatments. In the near future,
Text Mining tools with concept linkage capabilities will be beneficial in the biomedical
31
field helping researchers to discover new treatments by associating treatments that have
been used in related fields.
h) Information Visualization
Visual Text Mining [46], or information visualization, puts large textual sources in a
visual hierarchy or map and provides browsing capabilities, in addition to simple
searching e.g., Informatik V‘s DocMiner. The user can interact with the document map
by zooming, scaling, and creating sub-maps. The government can use information
visualization to identify terrorist networks or to find information about crimes that may
have been previously thought unconnected. It could provide them with a map of
possible relationships between suspicious activities so that they can investigate
connections that they would not have come up with on their own. Text Mining with
Information visualization has been shown to be useful in academic areas, where it can
allow an author to easily identify and explore papers in which s/he is referenced. It is
useful to user allowing them to narrow down a broad range of documents and explore
related topics.
i) Summarization
A summary is a text that is produced from one or more texts, that contain a significant
portion of the information in the original text(s), and that is no longer than half of the
original text(s). ‗Text‘ here includes multimedia documents, on-line documents,
hypertexts, etc. Many types of summary that have been identified include indicative
summaries (that provide an idea of what the text is about without giving any content) and
informative ones (that do provide some shortened version of the content). Extracts are
summaries created by reusing portions (words, sentences, etc.) of the input text verbatim,
while abstracts are created by re-generating the extracted content. Generic summary is not
related to specific topic while query-based summary generates a summary discussing the
topic mentioned in the given query. Also summary can be created for single document or
multi-documents.
32
1.2.3 Domains of applications of Text Mining
Regarded as the next wave of knowledge discovery, Text Mining has a very high
commercial value. It is an emerging technology for analyzing large collections of
unstructured documents for the purposes of extracting interesting and non-trivial patterns
or knowledge.
Text Mining applications can be broadly organized into two groups [129]:
Document exploration tools – They organize documents based on their text content
and provide an environment for a user to navigate and browse in a document or
concept space. A popular approach is to perform clustering on the documents based on
their similarities in content and present the groups or clusters of the documents in
certain graphical representation.
Document analysis tools – They analyze the text content of the documents and
discover the relationships between concepts or entities described in the documents.
They are mainly based on natural language processing techniques, including text
analysis, text categorization, information extraction, and summarization.
There are many possible application domains based on Text Mining technology [42], [64],
[130]. We briefly mention a few below:
Customer profile analysis, e.g., mining incoming emails for customers' complaint
and feedback.
Patent analysis, e.g., analyzing patent databases for major technology players,
trends, and opportunities.
Information dissemination, e.g., organizing and summarizing trade news and
reports for personalized information services.
Company resource planning, e.g., mining a company's reports and correspondences
for activities, status, and problems reported.
Security Issues, e.g., analyzing plain text sources such as Internet news. It also
involves in the study of text encryption.
Open-ended survey responses, e.g., analyzing a certain set of words or terms that
are commonly used by respondents to describe the pro‘s and con‘s of a product or
service (under investigation), suggesting common misconceptions or confusion
regarding the items in the study.
33
Text classification, e.g., filtering out most undesirable ―junk email‖ automatically
based on certain terms or words that are not likely to appear in legitimate
messages.
Competitive Intelligence, e.g., enabling companies to organize and modify the
company strategies according to present market demands and the opportunities
based on the information collected by the company about themselves, the market
and their competitors, and to manage enormous amount of data for analyzing to
make plans.
Customer Relationship Management (CRM), e.g., rerouting specific requests
automatically to the appropriate service or supplying immediate answers to the
most frequently asked questions.
Multilingual Applications of Natural Language Processing, e.g., identifying and
analyzing web pages published in different languages.
Technology watch, e.g., identifying the relevant Science & Technology literatures,
and extracting the required information from these literatures efficiently.
Text summarization, e.g., creating a condensed version of a document or a
document collection (multi-document summarization) that should contain its most
important topics.
Bio-entity recognition, e.g., identifying and classifying technical terms in the
domain of molecular biology corresponding to concepts instances that are of
interest to biologists. Examples of such entities include the names of proteins,
genes and their locations of activity such as cells or organism names.
Organize repositories of document-related meta-information, e.g., automatic text
categorization methods [107] are used to create structured metadata used for
searching and retrieving relevant documents based on a query.
Gain insights about trends, relations between people/places/organizations, e.g.,
aggregating and comparing information extracted automatically from documents of
certain type like incoming mail, customer letters, news-wires and so on.
34
1.2.4 Architecture of a Text Mining system
Text Mining system takes as an input a collection of documents and then preprocesses
each document by checking its format and character sets [136]. Next, these preprocessed
documents go through a text analysis phase, sometimes repeating the techniques, until the
required information is extracted. Three text analysis techniques are shown in Figure 1.5,
but many other combinations of techniques could be used depending on the goals of the
organization. The resulting extracted information can be input to a management
information system, yielding an abundant amount of knowledge for the user of that
system. Figure 1.6 explores the detailed processing steps followed in Text Mining System.
knowledge
Document
Collection
Retrieve and
preprocess
document
Summarization Clustering
Information
Extraction
Analyze Text
Management
Information
System
Figure 1.5 : An example of Text Mining System
Document
Collection
Retrieve and Pre-
process Document
Feature
Selection
Feature
Generation
Feature
Selection
Feature
Generation
TM Techniques
Management
Information System
Information Retrieval
Summarization Topic Discovery
Information
Extraction
Knowledge
Figure 1.6 : Text Mining Process
35
Different steps of Text Mining process as shown above, are briefly discussed below:
a) Document files of different formats like PDF files, txt files or flat files are
collected from different sources such as online chat, SMS, emails, message boards,
newsgroups, blogs, wikis and web pages. This unstructured dataset of documents is
pre-processed to perform following three tasks:
Tokenize the file into individual tokens using space as the delimiter.
Remove the stop words which do not convey any meaning.
Use porter stemmer algorithm to stem the words with common root word.
b) Feature Generation and Feature Selection activities are performed on these
retrieved and preprocessed documents to represent the unstructured text documents
in a more structured spread sheet format. Feature Selection algorithms help to
identify the important features which requires an exhaustive search of all subsets of
features of chosen cardinality. If the large numbers are available this is impractical
for supervised learning algorithms the search is for satisfactory set of features
instead of optimal set.
c) After the appropriate selection of features the Text Mining techniques are
incorporated for the applications like Information Retrieval, Information
Extraction, Summarization and Topic Discovery for necessary knowledge
discovery process.
Figure 1.6 depicts the knowledge stored in the management information system where the
knowledge is stored and retrieved.
1.3 Information Retrieval – Basic Concepts, Models and
Techniques
In this section, first we briefly discuss the techniques for Information Retrieval such as
extraction of index terms, retrieval models. Finally, we describe the different Information
Retrieval evaluation techniques and the framework of Information Retrieval.
36
1.3.1 Introduction
Information Retrieval is a field at the intersection of information science and computer
science. The term was coined by Mooers in 1951, who advocated that IR can be applied to
the ―intellectual aspects‖ of description of information and systems for its searching [97].
It concerns itself with the indexing and retrieval of information from heterogeneous and
mostly-textual information resources. In other words, Information Retrieval is defined as
"The study of systems for indexing, searching, and recalling data, particularly text or other
unstructured forms."
As shown in figure 1.7, an Information Retrieval [72] process begins when a user enters a
query into the system. Queries are formal statements of information needs, for example
search strings in web search engines. In Information Retrieval a query does not uniquely
identify a single object in the collection. An object is an entity that is represented by
information in a database. Depending on the application the data objects may be, for
example, text documents, images, audio, mind maps or videos. Often the documents
themselves are not kept or stored directly in the IR system, but are instead represented in
the system by document surrogates or metadata. User queries are matched against the
database information. Several objects may match the query, perhaps with different degrees
of relevancy. Most IR systems compute a numeric score on how well, each object in the
database match the query, and rank the objects according to this value. The top ranking
objects are then shown to the user. The process may then be iterated if the user wishes to
refine the query.
37
1.3.2 Document Preprocessing
The aim of text preprocessing is to transform each document into a sequence of features
that will be used in subsequent steps. The input documents go through text segmentation,
punctuation removal, conversion of upper to lower case, and stopword removal [114]. In
this section we outline the preprocessing steps to represent the text document as a term
vector.
Test collections
(e.g. Document databases)
Information need for
anomalous state of
knowledge
Representation and
organization
Text surrogates,
organized
Representation
Comparison or
Interaction
Quer
y
Retrieved texts
Evaluatio
n
Modificati
on
Use
Texts User with goals,
tasks, intentions,
etc.
Figure 1.7: A general model of Information Retrieval
Relevance feedback
38
1.3.2.1 Lexical Analysis of the text
Lexical analysis is the process of converting the stream of text of the text document into
stream of words which are later treated as index terms. The spaces between the words are
treated as the word separator. There are certain issues which are considered while
identifying the words in the text document.
Punctuation marks like dot(.), comma(,), hyphen(-), apostrophe(‗) etc are removed during
lexical analysis process. But there are certain situations when the removal of punctuation
marks has negative impact on the retrieval performance. For example, the word ―T.V.‖
contains punctuation marks as its integral part, and the removal of punctuation marks from
it ―TV‖ does not affect the analysis process, while if the punctuation marks from the word
―a.m.‖ are removed then the converted word ―am‖ shows totally different meaning. In this
case, the dot mark should not be removed.
During lexical analysis process, the case sensitivity of the words is ignored and two words
with same sequence of letters like ―AMERICA‖ and ―America‖ are treated as equal. But
sometimes this may result in misleading information like word ―US‖ and word ―us‖.
Words containing digits are ignored to be treated as index term. But sometimes, they may
contain important information like ―512B.C.‖. So alphanumeric words should not be
ignored and words containing only digits can be discarded to be treated as index terms.
The lexical analysis process requires suitable exceptions to be considered along with the
general rule to handle the above discussed issues to minimize the document preprocessing
errors.
1.3.2.2 Elimination of Stopwords
Stopwords [1], [14] are regarded as 'functional words' which do not carry meaning in
natural language and so they are ignored when identifying the index terms. Elimination of
stopwords contributes to reducing the size of the indexing structure considerably.
Stopwords list normally includes articles (like a, the), prepositions (like on, in), and
conjunctions (like but, and). Words which occur too frequent among the documents
collection can also be included in the stopwords list as such words do not contribute to
discriminate the documents. For example ―Reuters-21578‖ collection set contains Reuters
newswire stories and so the word ―Reuter‖ appears in each document and hence cannot be
used as an index term to differentiate among the documents.
39
1.3.2.3 Stemming
Stemming [29], [125] is the process for reducing inflected (or sometimes derived) terms to
their stem, base or root form - generally a written word form. A stem is the portion of a
word which is left after the removal of its affixes (i.e., prefixes and suffixes). In affix
removal, suffix is removed from the word. For example, strings "collected", "collection",
"collecting" are based on the base term "collect". Stemming reduces the size of the
indexing structure on reducing the number of distinct index terms and improves the
retrieval performance by reducing the variants of the same root word to a common
concept. Many suffix removal algorithms are known, like, Lovins algorithm, Krovetz
stemming, Paice/Husk algorithm and Porter algorithm [25].
1.3.2.4 Term Frequency and Weighting
The relevance of each term in the document is estimated using the term frequency
information to generate weights for all the terms in a document [1], [96], [150], [119].
Different methods are used to calculate the weight of a term:
a) Term Frequency (TF) Weighting
The weight of jth
term in ith
document using TF weighting is represented as
wij = tfij
where tfij is the frequency of occurrence of jth
term in ith
document
But TF weighting does not consider the frequency of the term throughout all the
documents in the document corpus.
b) Term Frequency (TF) × Inverse Document Frequency (IDF) Weighting
TF×IDF weighting approach weights the frequency of a term in a document with a factor
that discounts its importance if it appears in most of the documents, as in this case the term
is assumed to have little discriminating power. The weight of the jth
term in ith
document is
represented as
wij = tfij · log ( n / nj ) (1.1)
where
40
n is the total number of documents in the document pool
nj is the number of documents in the pool containing jth
term, nj ≤ n
c) TF×IDF Weighting with Length Normalization
In this approach, to account for documents of different lengths each document vector is
normalized so that it is of unit length. Here, weight of the jth
term in ith
document is
represented as
(1.2)
log.
log.
1
2
m
k k
ik
j
ij
ij
n
ntf
n
ntf
w
Where, m is total number of unique terms appearing in ith
document.
TF×IDF weighting approach [116] weighs the frequency of a term in a document based on
two observations:
• The relevance of a term to the topic of a document is proportional to the number of times
it appears in the document.
• If a term appears in large number of documents in the document set, then it cannot be
used to discriminate the different documents.
1.3.3 Retrieval models
For the Information Retrieval [72] to be efficient, the documents are typically transformed
into a suitable representation. There are several representation models. These
representation models are categorized according to two dimensions: the mathematical
basis and the properties of the model.
41
a) Mathematical basis
Based on mathematical basis, the representation models are classified as:
i. Set-theoretic models represent documents as sets of words or phrases. Similarities
are usually derived from set-theoretic operations on those sets. Common Set-
theoretic models include:
Standard Boolean model
Extended Boolean model
Fuzzy retrieval
ii. Algebraic models represent documents and queries as vectors, matrices, or tuples.
The similarity of the query vector and document vector is represented as a scalar
value. Some common algebraic models are:
Vector space model
Generalized vector space model
(Enhanced) Topic-based Vector Space Model
Extended Boolean model
Latent semantic indexing
iii. Probabilistic models treat the process of document retrieval as a probabilistic
inference. Similarities are computed as probabilities that a document is relevant for
a given query. Probabilistic theorems like the Bayes' theorem are often used in
these models. Common Probabilistic models are:
Binary Independence Model
Probabilistic relevance model on which is based the okapi (BM25) relevance
function
Uncertain inference
Language models
Divergence-from-randomness model
Latent Dirichlet allocation
iv. Feature-based retrieval models view documents as vectors of values of feature
functions (or just features) and seek the best way to combine these features into a
single relevance score, typically by learning to rank methods. Feature functions are
42
arbitrary functions of document and query, and as such can easily incorporate
almost any other retrieval model as just a yet another feature.
b) Properties of the model
Based on properties of the model, the representation models are classified as:
Models without term-interdependencies treat different terms/words as independent.
This fact is usually represented in vector space models by the orthogonality
assumption of term vectors or in probabilistic models by an independency
assumption for term variables.
Models with immanent term interdependencies allow a representation of
interdependencies between terms. However the degree of the interdependency
between two terms is defined by the model itself. It is usually directly or indirectly
derived (e.g. by dimensional reduction) from the co-occurrence of those terms in
the whole set of documents.
Models with transcendent term interdependencies allow a representation of
interdependencies between terms, but they do not allege how the interdependency
between two terms is defined. They relay an external source for the degree of
interdependency between two terms. (For example a human or sophisticated
algorithms.)
In this section, we discuss the following three Information Retrieval mathematical models,
one from each category:
Standard Boolean Model
Vector Space Model
Binary Independence Probabilistic Model
1.3.3.1 Standard Boolean Model
The Boolean model of Information Retrieval is a classical Information Retrieval model. It
is based on Boolean Logic and classical Set theory in that both the documents to be
searched and the user's query are conceived as sets of terms. Retrieval is based on whether
or not the documents contain the query terms. Given a finite set
43
T = {t1, t2, ..., tj, ..., tm} (1.3)
of elements called index terms (e.g. words or expressions - which may be stemmed -
describing or characterizing documents such as keywords given for a article), a finite set
D = {D1, ..., Di, ..., Dn}, (1.4)
where Di is an element of the power set of T of elements called documents
Given a Boolean expression - in a normal form - Q called a query as follows:
Q = (Wi OR Wk OR ...) AND ... AND (Wj OR Ws OR ...), (1.5)
with Wi=ti, Wk=tk, Wj=tj, Ws=ts, or Wi=NON ti, Wk=NON tk, Wj=NON tj, Ws=NON ts
where ti means that the term ti is present in document Di, whereas NON ti means that it is
not.
Equivalently, Q can be given in a disjunctive normal form, too. An operation called
retrieval, consisting of two steps, is defined as follows:
1. The sets Sj of documents are obtained that contain or not term tj (depending on
whether Wj=tj or Wj=NON tj) :
Sj = {Di|Wj element of Di} (1.6)
2. Those documents are retrieved in response to Q which are the result of the
corresponding sets operations, i.e. the answer to Q is as follows:
UNION ( INTERSECTION Sj)
The main advantages of the Boolean model are:
i. Clean Formalism
ii. Easy to implement
iii. Intuitive concept
44
On the other hand, the disadvantages of the Boolean model are:
i. Exact matching may retrieve too few or too many documents
ii. Difficult to rank output, some documents are more important than others
iii. Hard to translate a query into a Boolean expression
iv. All terms are equally weighted
v. More like data retrieval than Information Retrieval
1.3.3.2 Vector Space Model
Vector space model (or term vector model) [66], [118], [121], is an algebraic model used
for information filtering, Information Retrieval, indexing and relevancy rankings. It
represents natural language documents (or any objects, in general) in a formal manner
through the use of vectors (of identifiers, such as, for example, index terms) in a multi-
dimensional linear space.
Documents and queries are represented as vectors.
dj = (w1,j,w2,j,...,wt,j) (1.7)
q = (w1,q,w2,q,...,wt,q) (1.8)
Each dimension corresponds to a separate term. If a term occurs in the document, its value
in the vector is non-zero. Several different ways of computing these values, also known as
(term) weights, have been developed. One of the best known schemes is tf-idf weighting.
The definition of term depends on the application. Typically terms are single words,
keywords, or longer phrases. If the words are chosen to be the terms, the dimensionality of
the vector is the number of words in the vocabulary (the number of distinct words
occurring in the corpus).
Vector operations can be used to compare documents with queries.
45
Figure1.8: Vector Space Model
Relevance rankings of documents in a keyword search can be calculated, using the
assumptions of document similarities theory, by comparing the deviation of angles
between each document vector and the original query vector where the query is
represented as same kind of vector as the documents.
In practice, it is easier to calculate the cosine of the angle between the vectors, instead of
the angle itself:
Where d2 . q is the intersection (i.e. that dot product) of the document d2 and the query q
vectors (ref. figure 1.8), is the norm of vector d2, and is the norm vector q.
The norm of a vector is calculated as such:
A cosine value of zero means that the query and document vector are orthogonal and have
no match (i.e. the query term does not exist in the document being considered).
46
The vector space model has the following advantages over the Standard Boolean model:
i. Simple model based on linear algebra
ii. Term weights not binary
iii. Allows computing a continuous degree of similarity between queries and
documents
iv. Allows ranking documents according to their possible relevance
v. Allows partial matching
The vector space model has the following limitations:
i. Long documents are poorly represented because they have poor similarity values (a
small scalar product and a large dimensionality)
ii. Search keywords must precisely match document terms; word substrings might
result in a "false positive match"
iii. Semantic sensitivity; documents with similar context but different term vocabulary
won't be associated, resulting in a "false negative match".
iv. The order in which the terms appear in the document is lost in the vector space
representation.
v. Assumes terms are statistically independent
vi. Weighting is intuitive but not very formal
1.3.3.3 Probabilistic Model (Binary Independence Model)
The Binary Independence Model (BIM) is a probabilistic Information Retrieval technique
that makes some simple assumptions to make the estimation of document/query similarity
probability feasible.
The "binary" in BIM is to be taken in the sense of "Yes or No', the representation is an
ordered set of Boolean variables. The representation of a documents or query is a vector
with one Boolean element for each term under consideration. More specifically, a
document is represented by a vector d = (x1, ..., xm) where xt=1 if term t is present in the
document d and xt=0 if it's not. Many documents can have the same vector representation
with this simplification. Queries are represented in a similar way. "Independence" signifies
that terms in the document are considered independently from each other and no
47
association between terms is modeled. This assumption is very limiting, but it has been
shown that it gives good enough results for many situations. This independence is the
"naive" assumption of a Naive Bayes classifier, where properties that imply each other are
nonetheless treated as independent for the sake of simplicity. This assumption allows the
representation to be treated as an instance of a Vector space model by considering each
term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other
terms.
The probability P(R|d,q) that a document is relevant derives from the probability of
relevance of the terms vector of that document P(R|x,q). By using the Bayes rule we get:
where P(x|R=1,q) and P(x|R=0,q) are the probabilities of retrieving a relevant or non
relevant document, respectively. If so, then that document's representation is x. The exact
probabilities cannot be known beforehand, so use estimates from statistics about the
collection of documents must be used.
P(R=1|q) and P(R=0|q) indicate the previous probability of retrieving a relevant or non
relevant document respectively for a query q. If, for instance, we knew the percentage of
relevant documents in the collection, then we could use it to estimate these probabilities.
Since a document is either relevant or non relevant to a query we have that:
P(R = 1 | x,q) + P(R = 0 | x,q) = 1 (1.12)
Roelleke and Wang [113] investigate the probabilistic relational implementations of BIM
under the use of probabilistic relational algebra for the integration of Information Retrieval
in databases. This work surges as a result of the interest shown by Cui and Potok [36] in
applying the knowledge of probabilistic models for Information Retrieval in structured
data, to investigate the problem of ranking the answers to a database query when many
tuples are returned.
48
1.3.4 Similarity measures
Many data and particularly Text Mining techniques like clustering, classification are based
on the similarity measures between the objects. Measures of similarity may be obtained
from vectors of measurements or characteristics describing each object (document). Given
a particular vector-space representation, distance between documents can be defined as a
well-defined function of distance between their document vectors. Two commonly used
similarity measures are Euclidean Distance and Cosine distance [70], [115], [116], [121].
1.3.4.1 Euclidean Distance
Euclidean distance examines the square root of square differences between the vectors of
pair of text documents.
In n dimensions, the Euclidean distance between two document vectors di and dk is
computed as:
c
j
kjijik ddEd1
2)(
, such that i ≠k, i,k n (1.13)
where, dij (or dkj) is the jth
term in di (or dk) document.
c is the sum of unique terms in di (a) and dk (b), c=a+b
Disadvantage of Euclidean distance is that it is influenced by variables that have larger
values of frequency occurrence in the document. Normalizing the distance to a value
between 0 and 1 is therefore used to prevent this.
1.3.4.2 Cosine Distance
Cosine similarity is a measure of similarity between two vectors of m dimension by
finding the angle between them, often used to compare documents in Text Mining. For
text matching, the document vectors di and dk are the tf-idf vectors of the documents,
which give a high weight to terms that have a high frequency occurrence in the document
and low frequency occurrence in whole document corpus.
49
Cosine distance between two documents di and dk is defined as:
(1.14)sin
1
2
1
2
1
b
j
kj
a
j
ij
ba
j
kjij
ik
dd
dd
eCo
where, dij (or dkj) is the jth
term in di (or dk) document.
The cosine measure gives values between 0 and 1, with documents that contain similar
words tending to have a similarity close to 1, and documents with few words in common
tending to have a similarity close to 0.
Unlike Euclidean distance, cosine distance ensures that only words shared between
compared documents are considered (the weight of a word is zero if it does not appear in a
document). Cosine distance has been found historically to be quite effective in practical IR
experiments, where queries are also expressed using the same term-based representation as
that used for documents.
1.3.5 Retrieval Evaluation
Many different measures for evaluating the performance of Information Retrieval systems
have been proposed. Cleverdon et al. [35] listed six criteria that could be used to evaluate
an Information Retrieval system: (1) coverage, (2) time lag, (3) recall, (4) precision, (5)
presentation and (6) user effort. Of these criteria, recall and precision have most frequently
been applied in measuring Information Retrieval. The measures require a collection of
documents and a query. All measures contribute by evaluating the relevance of every
document to a particular query.
In the following discussion, we explain the popular retrieval evaluation measures: recall,
precision, F-measure.
50
1.3.5.1 Recall
Recall [72], [112] is the fraction of documents that are relevant to the query that are
successfully retrieved.
In binary classification, recall is called sensitivity. Recall can also be defined as the
probability that a relevant document is retrieved by the query.
It is trivial to achieve 100% recall by returning all documents in response to any query.
Therefore recall alone is not enough and efficiency of the IR system is computed based
on the number of non-relevant documents retrieved by computing the precision.
1.3.5.2 Precision
Precision [72], [112] is the fraction of the documents retrieved that are relevant to the
user's information need.
In binary classification, precision is analogous to positive predictive value. Precision takes
all retrieved documents into account. It can also be evaluated at a given cut-off rank,
considering only the topmost results returned by the system. This measure is called
precision at n or P@n.
1.3.5.3 F-measure
F-measure is the weighted harmonic mean [133], [142] of precision and recall.
Traditional F-measure or balanced F-score is defined as:
This is also known as the F1 measure, because recall and precision are evenly weighted.
51
Two other commonly used F measures are the F2 measure, which weights recall twice as
much as precision, and the F0.5 measure, which weights precision twice as much as recall.
The general formula of F-measure for non-negative real β is:
Fβ measures the effectiveness of retrieval with respect to a user who attaches β times as
much importance to recall as precision.
Two Cluster Quality measures [70] to evaluate how representative are the current clusters
to ―true‖ classes are discussed below:
1.3.5.4 Purity
It is measure of the extent to which a cluster contains samples of a single class [39], [69].
The purity of ith
cluster can be computed as:-
where,
pij is the probability that members of ith
cluster belongs to jth
class
nij is total no. of documents from jth
class assigned to ith
cluster
ni is total no. of documents in ith
cluster
Total Purity can be computed as:-
where,
ni is the total no. of documents in ith
cluster
n is the total no. of documents in document pool
52
1.3.5.5 Entropy
Entropy [15], [70], [121] measures the degree to which each cluster consists of samples of
a single category class.
Entropy of each cluster i can be computed as
where,
Entropyi denotes the entropy of ith
cluster
Q is total number of category classes
The total entropy of cluster set can be computed by weighting entropy of each cluster
where,
k denotes the total number of clusters
Higher the positive value of the total entropy, the better the clustering performance is.
1.4 Thesis Outline
The thesis is organized chapter wise as follows:
Chapter 1: This chapter is devoted to introduction on Data Mining, Text Mining and
Information Retrieval. Different techniques, applications areas and architecture of Data
Mining and Text Mining are discussed in the chapter. Chapter also outlines basic concepts,
models and techniques of Information Retrieval such as extraction of index terms, retrieval
models. At the end of the chapter, different Information Retrieval evaluation techniques
and the framework of Information Retrieval are explained.
Chapter 2: In this chapter, a discussion on related work on document indexing, hyperlink
structure of web pages, clustering and text document summarization is discussed. Based
on the literature survey on each topic, the problems and challenges identified from existing
tools and techniques for each are discussed in brief, providing the basis for the work to be
carried out.
53
Chapter 3: This chapter discusses the method for Quick Text Retrieval Algorithm
Supporting Synonyms Based on Fuzzy Logic. Different compression algorithms (like Elias
Gamma code, Elias Delta code, Fibonacci Code) are explained in the chapter and the
concept of Fuzzy Information Retrieval is studied along with Suffix tree clustering.
Chapter 4: This chapter is about Web Page Ranking Based on Text Content of Linked
Pages. In this chapter, different link analysis ranking algorithms like (HITS, pSALSA,
SALSA, HubAvg, AThresh, HThresh , FThresh, BFS) are discussed and the rank score of
Web text pages obtained through these link analysis ranking algorithms are compared to
the proposed approach.
Chapter 5: This chapter discusses the problem of Automatic generation of initial value K
to apply K-means method for Text Documents Clustering. In the chapter, methods and
limitations of different clustering techniques like K-means, Bisecting K-means, HFTC
(Hierarchical Document clustering using Frequent Itemsets), Hybrid PSO+K-means,
Global K-means are discussed. Later, a new method of text document clustering is
proposed to overcome the limitations of existing clustering methods discussed in the
chapter.
Chapter 6: This chapter is about Document Summarization based on Sentence Ranking
Using Vector Space Model. In this section, different summarization tools like Copernic,
SweSum, Extractor, MSWord AutoSummarizer, Intelligent, Brevity, Pertinence are
studied. Based on the analysis of these tools, a new method of summarization is proposed
and the summaries obtained on applying these tools to DUC-2002 dataset are then
compared using ROUGE-1.5.5 toolkit.
Chapter 7: It is the last chapter of the thesis in which conclusion and future scope have
been discussed.
54