CHAPTER TWO
L I T E R A T U R E R E V I E W
This chapter involves the basic topics on Data Mining – its meaning, reasons for its
application, various tasks, processes, techniques and application area of data
mining. It dwells much on Artificial neural network approach. These algorithms
strength, weaknesses and when to apply are not left out.
2.1 DATA MINING
Data Mining is the process of discovery of meaningful new correlation, pattern, and
trends by shifting through large amount of data stored repositories and by using
pattern recognition technologies as well as statistical and mathematical [6]
techniques.
Data mining also refers to the analysis of the large quantities of data that are stored
in computers in form of file or in database. It is called exploratory data analysis,
among other things [5].
Data mining is not limited to business. Data mining has been heavily used in the
medical field, to include diagnosis of patient records to help identify best practices.
2.2 WHY DATA MINING
Data mining caught on in a big way in the last few years due to the following
number of factors [6]:
i) The data is being produced and collected at unprecedented way.
ii) The data is being warehoused – data warehousing brings together data from
different format in a common format with consistent definitions for keys and fields.
iii) The computing power is affordable – price for disk, memory, processing
power, I/O bandwidth is affordable by many ordinary businesses.
iv) The existence of commercial data mining software products.
2.3 DATA MINING PROCESS
In order to systematically conduct data mining analysis, a general process is usually
followed. There are some standard processes, two of which CRISP (an industry
standard process consisting of a sequence of steps that are usually involved in a
data mining study) and SEMMA which stands for Sample, Explore, Modify, Model,
6
Assess(figure 2.2 depicts how SEMMA phases interact) [7]. While each step of either
approach isn’t needed in every analysis, this process provides a good coverage of
the steps needed, starting with data exploration, data collection, data processing,
analysis, inferences drawn, and implementation.
2.3.1 CRISP-DM
CRISP-DM - Cross-Industry Standard Process for Data Mining is widely used by
industries and cooperate organizations. This model consists of six phases intended
as a cyclical process.
Business Understanding: Business understanding includes determining
business objectives, assessing the current situation, establishing data mining
goals, and developing a project plan.
Figure 2.1 CRISP-DM processes
Data Understanding: Once business objectives and the project plan are
established, data understanding considers data requirements. This step can
include initial data collection, data description, data exploration, and the
verification of data quality. Data exploration such as viewing summary
statistics (which includes the visual display of categorical variables) can
7
occur at the end of this phase. Models such as cluster analysis can also be
applied during this phase, with the intent of identifying patterns in the data.
Data Preparation: Once the data resources available are identified, they
need to be selected, cleaned, built into the form desired, and formatted.
Data cleaning and data transformation in preparation of data modeling
needs to occur in this phase. Data exploration at a greater depth can be
applied during this phase, and additional models utilized, again providing
the opportunity to see patterns based on business understanding.
Modeling Data mining software tools such as visualization (plotting data and
establishing relationships) and cluster analysis (to identify which variables
go well together) are useful for initial analysis. Tools such as generalized
rule induction can develop initial association rules. Once greater data
understanding is gained (often through pattern recognition triggered by
viewing model output), more detailed models appropriate to the data type
can be applied. The division of data into training and test sets is also needed
for modeling.
Evaluation Model results should be evaluated in the context of the business
objectives established in the first phase (business understanding). This will
lead to the identification of other needs (often through pattern recognition),
frequently reverting to prior phases of CRISP-DM. Gaining business
understanding is an iterative procedure in data mining, where the results of
various visualization, statistical, and artificial intelligence tools show the
user new relationships that provide a deeper understanding of
organizational operations.
Deployment Data mining can be used to both verify previously held
hypotheses, or for knowledge discovery (identification of unexpected and
useful relationships). Through the knowledge discovered in the earlier
phases of the CRISP-DM process, sound models can be obtained that may
then be applied to business operations for many purposes, including
prediction or identification of key situations. These models need to be
monitored for changes in operating conditions, because what might be true
today may not be true a year from now. If significant changes do occur, the
8
model should be redone. It’s also wise to record the results of data mining
projects so documented evidence is available for future studies.
Figure 2.2 Schematic of SEMMA (original from SAS )
2.4 DATA MINING TASKS
CLASSIFICATION: This consists of examining the features of newly presented
object and assigning it to one of a predefined set of classes. For our purposes,
the objects to be classified are generally represented by records in databases
and the act of classification consists of updating each record by filling in a field
with a class code of some kind.
The classification task is classified by a well defined definition of some classes,
and a training set consisting of pre-classified examples. The task is to build a
model of some kind that can be applied to unclassified data in order to classify
it.
ESTIMATION: While classification deals with discrete outcomes such as yes or
no, measles, rubella, or chicken pox; Estimation deals with continuous value
9
2.32
outcomes. Given some input data, we use estimation to come up with a value for
some unknown continuous variable such as income, height, or credit card
balance.
In practice, estimation is often used to perform classification task and Neural
Networks are well-suited to estimate tasks.
PREDICTION: Prediction is the same for classification and estimation except that
the records are classified according to some predicted future behaviour on
estimated future value. In a prediction task, the only way to check the accuracy of
the classification is to wait and see.
Any of the techniques used for classification and estimation can adapted for use
in prediction is already known, along with historical data for those examples. The
historical data is used to build a model that explains the current observed
behaviour. When this model is applied to current inputs, the result is a prediction
of future behaviour.
AFFINITY GROUPING: The task affinity grouping is to determine which things
go together. It can be used to identify cross-selling opportunities and to design
attractive packages or groupings of product and services. Affinity grouping is one
simple approach to generate rules from data. If two items go together, two
association rules can be generated together.
CLUSTRING: This is the task of segmenting a heterogeneous population into a
number of more homogeneous subgroups. It does not rely on predefined classes
and records are grouped together on the basis of self-similarity. It is now up to
you to determine what meaning, if any, to attach to the resulting clusters.
Clustering is often done as a prelude to some other form of data mining or
modelling.
DESCRIPTION: Sometimes the purpose of data mining is to simply describe what
is going on in a complicated database in a way that increases the understanding
of people, products or processes that produces the data in the first place.
Some of the techniques that will later be discussed in this chapter such as market
basket analysis tools are purely descriptive. Others like neural networks provide
next to nothing in the way of description [6].
10
2.5 DATA MINING ISSUES
As data mining initiatives continue to evolve, there are several issues Congress may
decide to consider related to implementation and oversight. These issues include,
but are not limited to, data quality, interoperability, mission creep, and privacy. As
with other aspects of data mining, while technological capabilities are important,
other factors also influence the success of a project’s outcome [6].
Data Quality
Data quality refers to the accuracy and completeness of the data. Data quality can
also be affected by the structure and consistency of the data being analyzed. The
presence of duplicate records, the lack of data standards, the timeliness of updates,
and human error can significantly impact the effectiveness of the more complex
data mining techniques, which are sensitive to subtle differences that may exist in
the data. To improve data quality, it is sometimes necessary to “clean” the data,
which can involve the removal of duplicate records, normalizing the values used to
represent information in the database, accounting for missing data points,
removing unneeded data fields, identifying anomalous data points (e.g., an
individual whose age is shown as 142 years), and standardizing data formats (e.g.,
changing dates so they all include MM/DD/YYYY).
Interoperability
This refers to the ability of a computer system and/or data to work with other
systems or data using common standards or processes For data mining,
interoperability of databases and software is important to enable the search and
analysis of multiple databases simultaneously, and to help ensure the compatibility
of data mining activities of different agencies. Similarly, as agencies move forward
with the creation of new databases and information sharing efforts, they will need
to address interoperability issues during their planning stages to better ensure the
effectiveness of their data mining projects.
Mission Creep
Mission creep refers to the use of data for purposes other than that for which the
data was originally collected. This can occur regardless of whether the data was
provided voluntarily by the individual or was collected through other means. One of
the primary reasons for misleading results is inaccurate data. All data collection
11
efforts suffer accuracy concerns to some degree. Ensuring the accuracy of
information can require costly protocols that may not be cost effective if the data is
not of inherently high economic value. In well-managed data mining projects, the
original data collecting organization is likely to be aware of the data’s limitations
and account for these limitations accordingly. However, such awareness may not be
communicated or heeded when data is used for other purposes
Privacy
As additional information sharing and data mining initiatives have been announced,
increased attention has focused on the implications for privacy.
Concerns about privacy focus both on actual projects proposed, as well as concerns
about the potential for data mining applications to be expanded beyond their
original purposes (mission creep).
So far there has been little consensus about how data mining should be carried out,
with several competing points of view being debated. Some observers contend that
tradeoffs may need to be made regarding privacy to ensure security. In contrast,
some privacy advocates argue in favor of creating clearer policies and exercising
stronger oversight.
2.6 BASIC STYLES OF DATA MINING
The first, hypothesis testing, is a top-down approach that attempt to substantiate or
disproved preconceived ideas. The second, knowledge discovery, is a bottom-up
approach that starts with the data and tries to get it to tell us we didn’t already
know [6].
2.6.1 Hypothesis
A hypothesis is a propose explanation whose validity can be tested. Testing the
validity of an hypothesis is done by analyzing data that may simply be collected by
observation or generated through experiment.
The process of hypothesis testing
The hypothesis testing method ha several steps:
1) Generate good ideas (hypothesis)
2) Determine what data would allow these hypotheses to be tested.
3) Locate the data
4) Prepare the data for analysis
12
5) Build computer model based on the data
6) Evaluate computer model to confirm or reject hypotheses
2.6.2 Knowledge Discovery
Undirected learning ha s long been goal of artificial intelligence researchers in the
academic discipline called machine learning. In the real world, discovering valuable
patterns is worthwhile, but it is still hard work.
Knowledge discovery can be either directed or undirected.
Directed Knowledge Discovery
This is goal oriented. There is a specific field whose value we want to predict, a
fixed set of classes to be assigned to each record, or a specific relationship we
want to explore.
Here are the steps in the process of direct knowledge discovery:
1. Identify source of pre-classified data.
2. Prepare data for analysis
3. Build and train computer model
4. Evaluate the computer model
Undirected Knowledge Discovery
Here, there is no target field. The data mining tool is simply let loosed on the
data with the hope that it will discover meaningful structure.
The Process of Undirected Knowledge Discovery
Here are the steps in the process of undirected knowledge discovery:
1. Identify source of pre-classified data.
2. Prepare data for analysis
3. Build and train computer model
4. Evaluate the computer model
5. Apply the computer model to new data.
6. Identify potential targets for directed knowledge discovery.
7. Generate new hypothesis to test
13
D A T A M I N I N G T E C H N I Q U E S / M E T H O D S
2.7 MEMORY-BASED REASONING
Memory-based reasoning systems are a type of model, supporting the modeling
phase of the data mining process. Their unique feature is that they are relatively
machine driven, involving automatic classification of cases. It is a highly useful
technique that can be applied to text data as well as traditional numeric data
domains.
Memory-based reasoning is an empirical classification method [8]. It operates by
comparing new unclassified records with known examples and patterns.
The case that most closely matches the new record is identified, using one of a
number of different possible measures. Memory-based reasoning provides best
overall classification when compared with the more traditional approaches in
classifying jobs with respect to back disorders [9].
Matching: While matching algorithms are not normally found in standard
data mining software, they are useful in many specific data mining
applications. Fuzzy matching has been applied to discover patterns in the
data relative to user expectations [10]. Java software has been used to
completely automate document matching [11]. Matching can also be applied to
pattern identification in geometric environments [12].
There are a series of measures that have been applied to implement
memory-based reasoning. The simplest technique assigns a new observation
to the pre-classified example most similar to it. The Hamming distance
metric identifies the nearest neighbor as the example from the training
database with the highest number of matching fields (or lowest number of
non-matching fields). Case-based reasoning is a well-known expert system
approach that assigns new cases to the past case that is closest in some
sense. Thus case-based reasoning can be viewed as a special case of the
nearest neighbor technique.
Weighted Matching
Data mining can involve deletion of variables, but the usual attitude is to
retain data because you don’t know what it may provide. Weighting provides
another means to emphasize certain variables over others. All that would
14
change would be that the “Matches” measure could now represent a
weighted score for selection of the best matching case
Distance Minimization
This concept uses the distance measured from the observation to be
classified to each of the observations in the known data set.
In this case, the nominal and ordinal data needs to be converted to
meaningful ratio data
Strength of Memory-Based Reasoning
It produces results that are readily understandable
It is applicable to arbitrary data types, even non relational data.
It works efficiently on any number of fields.
Maintaining the training set requires a minimal amount of effort.
Weaknesses of Memory-Based Reasoning
It is computationally expensive when doing classification and prediction
It requires a large amount of storage for the training set.
Results can be dependent on the choice of distance function, combination
function and the number of neigbours.
2.8 ASSOCIATION RULES IN KNOWLEDGE DISCOVERY
An association rule is an expression of X → Y, where X is a set of items, and Y is a
single item. Association rule methods are an initial data exploration approach that
is often applied to extremely large data set.
Association rules mining provides valuable information in assessing significant
correlations. They have been applied to a variety of fields, to include medicine [13]
and medical insurance fraud detection [14].
Many algorithms have been proposed to find association rules mining in large
databases. Most, such as the APriori algorithm identify correlations among
transactions consisting of categorical attributes using binary values. Some data
mining approaches involve weighted association rules for binary values, [15] or time
intervals [16].
Data structure is an important issue due to the scale of data usually encountered [17].
Structured query language (SQL) has been a fundamental tool in manipulation of
database content. Knowledge discovery involves ad hoc queries, needing efficient
15
query compilation. Lopes et al. considered functional dependencies in inference
problems. SQL was used by those researchers to generate sets of attributes that
were useful in identifying item clusters.
Key measures in association rule mining include support and confidence.
Support refers to the degree to which a relationship appears in the data.
Confidence relates to the probability that if a precedent occurs, a
consequence will occur. The rule X → Y has minimum support value minsup
if minsup percent of transactions support X → Y, the rule X → Y holds with
minimum confidence value minconf if minconf percent of transactions that
support X also support Y. For example, from the transactions kept in
supermarkets, an association rule such as “Bread and Butter → Milk” could
be identified through association mining.
2.9 MARKET BASKET ANALYSIS
Market-basket analysis refers to methodologies studying the composition of a
shopping basket of products purchased during a single shopping event.
This technique has been widely applied to grocery store operations (as well as
other retailing operations, to include restaurants). Market basket data in its rawest
form would be the transactional list of purchases by customer, indicating only the
items purchased together (with their prices). This data is challenging because of a
number of characteristics: [18]
A very large number of records (often millions of transactions per day)
Sparseness (each market basket contains only a small portion of items
carried)
Heterogeneity (those with different tastes tend to purchase a specific
subset of items).
The aim of market-basket analysis is to identify what products tend to be purchased
together. Analyzing transaction-level data can identify purchase patterns, such as
which frozen vegetables and side dishes are purchased with steak during barbecue
season. This information can be used in determining where to place products in the
store, as well as aid inventory management. Product presentations and staffing can
be more intelligently planned for specific times of day, days of the week, or
holidays. Another commercial application is electronic couponing, tailoring coupon
16
face value and distribution timing using information obtained from market baskets [19].
2.9.1 Market Basket Analysis Benefits
The ultimate goal of market basket analysis is finding the products that
customers frequently purchase together. The stores can use this information by
putting these products in close proximity of each other and making them more
visible and accessible for customers at the time of shopping.
These assortments can affect customer behavior and promote the sales for
complement items. The other use of this information is to decide about the layout of
catalogs and put the items with strong association together in sales catalogs. The
advantage of using sales data for promotions and store layout is that the consumer
behavior information determines the items with associations. This information may
vary based on the area and the assortments of available items in stores and the
point of sale data reflects the real behavior of the group of customers that
frequently shop at the same store. Catalogs that are designed based on the market
basket analysis are expected to be more effective on consumer behavior and sales
promotion.
2.9.2 Strength of Market Basket Analysis
It produces clear and understandable results
It supports undirected data mining
It works on variable-length data.
The computations it uses are simple to understand
2.9.3 Weaknesses of Market Basket Analysis
It requires exponentially more computational effort as the problem size
grows.
It has a limited supports for attributes on the data
It is difficult to determine the right number of items
It discounts rear items
17
2.10 FUZZY SETS IN DATA MINING
Real-world application is full of vagueness and uncertainty. Several theories on
managing uncertainty and imprecision have been advanced, to include fuzzy set
theory[20], probability theory[21], rough set theory[22] and set pair theory[23].
Fuzzy set theory is used more than the others because of its simplicity and
similarity to human reasoning. Fuzzy modeling provides a very useful tool to deal
with human vagueness in describing scales of value. The advantages of the fuzzy
approach in data mining is that it serves as an “… interface between a numerical
scale and a symbolic scale which is usually composed of linguistic terms.[24]”
Fuzzy association rules described in linguistic terms help users better understand
the decisions they face[25]. Fuzzy set theory is being used more and more frequently
in intelligent systems. A fuzzy set A in universe U is defined as A={(x, µA(x))| x Uε , µA
(x) ε [0,1]} where µA (x) is a membership function indicating the degree of
membership of x to A. The greater the value of µA (x) , the more x belongs to A. Fuzzy
sets can also be thought of as an extension of the traditional crisp sets and
categorical/ordinal scales, in which each element is either in the set or not in the set
(a membership
function of either 1 or 0).
Fuzzy set theory in its many manifestations (interval-valued fuzzy sets, vague sets,
grey-related analysis, rough set theory, etc.) is highly appropriate for dealing with
the masses of data available.
There are many data mining tools available, to cluster data, to help analysts
find patterns, to find association rules. The majority of data mining approaches in
classification work with numerical and categorical information.
Most data mining software tools offer some form of fuzzy analysis.
Modifying continuous data is expected to degrade model accuracy, but might be
more robust with respect to human understanding. (Fuzzy representations might
lose accuracy with respect to numbers that don’t really reflect accuracy of human
understanding, but may better represent the reality humans are trying to express.)
Another approach to fuzzify data is to make it categorical. Categorization of data is
expected to yield greater inaccuracy on test data. However, both treatments are still
useful if they better reflect human understanding, and might even be more accurate
on future implementations.
18
The categorical limits selected are key to accurate model development. Not many
data mining techniques take into account ordinal data features.
2.10.1 Fuzzy Association Rules
With the rapid growth of data in enterprise databases, making sense of valuable
information becomes more and more difficult. KDD (Knowledge Discovery in
Databases) can help to identify effective, coherent, potentially useful and previously
unknown patterns in large databases [26]. Data mining plays an important role in the
KDD process, applying specific algorithms for extracting desirable knowledge or
interesting patterns from existing datasets for specific purposes. Most of the
previous studies focused on categorical attributes.
Mining fuzzy association rules for quantitative values has long been considered by
a number of researchers, most of whom based their methods on the important
APriori algorithm [27]. Each of these researchers treated all attributes (or all the
linguistic terms) as uniform. However, in real-world applications, the users perhaps
have more interest in the rules that contain fashionable items.
Decreasing minimum support minsup and minimum confidence minconf to get
rules containing fashionable items is not best, because the efficiency of the
algorithm will be reduced and many uninteresting rules will be generated
simultaneously [28]. Weighted quantitative association rules mining based on a fuzzy
approach has been proposed (by Genesei) using two different definitions of
weighted support: with and without normalization [29].
In the non-normalized case, he used the product operator for defining the combined
weight and fuzzy value.
The combined weight or fuzzy value is very small and even tends to zero when the
number of items is large in a candidate itemset, so the support level is very small,
this will result in data overflow and make the algorithm terminate unexpectedly
when calculating the confidence value.
2.11 ROUGH SET
Rough set analysis is a mathematical approach that is based on the theory of rough
sets first introduced by Pawlak (1982) [22]. The purpose of rough sets is to discover
knowledge in the form of business rules from imprecise and uncertain data sources.
19
Rough set theory is based on the notion of indiscernibility and the inability to
distinguish between objects, and provides an approximation of sets or concepts by
means of binary relations, typically constructed from empirical data.
Such approximations can be said to form models of our target concepts, and hence
in the typical use of rough sets falls under the bottom-up approach to model
construction. The intuition behind this approach is the fact that in real life, when
dealing with sets, we often have no means of precisely distinguishing individual set
elements from each other due to limited resolution (lack of complete and detailed
knowledge) and uncertainty associated with their measurable characteristics.
As an approach to handling imperfect data, rough set analysis complements other
more traditional theories such as probability theory, evidence theory, and fuzzy set
theory.
2.11.1 A Brief Theory of Rough Sets
Statistical data analysis faces limitations in dealing with data with high levels of
uncertainty or with non-monotonic relationships among the variables.
The original idea behind his Rough sets theory was “… vagueness inherent to the
representation of a decision situation.
Vagueness may be caused by granularity of the representation. Due to the
granularity, the facts describing a situation are either expressed precisely
by means of ‘granule’ of the representation or only approximately. [30]” The
vagueness and imprecision problems are present in the information that describes
most real world applications.
2.11.2 Rough Sets as an Information System
In rough sets, an information system is a representation of data that prescribes
some object. An information system S is composed of a 4-tuple S = < U, Q, V, f >
where U is the closed universe of a N objects {x1, x2, …, xN}, a nonempty finite set; Q
is a nonempty finite set of n attributes {q1, q2, …, qn} (that uniquely characterizes
the objects); V = Uq Qє Vq where Vq is the value of the attribute q; f : U × Q → V is
the total decision function called the information function such that f (x, q) є Vq for
every q є Q, x є U [31]. The six stores are the universe U, the first three attributes are
Q, their possible values V, and the profit category f.
20
Any pair (q, v) for q Q,, v є є Vq is called the descriptor in an information system S.
The information system can be represented as a finite data table, in which the
columns represent the attributes, the rows represent the objects and the cells
represent the attribute values f(x, q). Thus, each row in the table describes the
information about an object in S.
If we let S = < U, Q, V, f > be an information system,
be a subset of attributes, and x, y є U are objects, then x and y are indiscernible by
the set of attribute A in S if and only if f (x, a) = f (y, a) for every a є A. Every subset
of variables A determines an equivalence relation of the universe U, which is
referred to indiscernibility relation. For any given subset of attributes the IND(A) is
an equivalence relation on universe U and is called an indiscernibility relation. The
indiscernibility relation IND(A) can be defined as IND(A) = {(x, y) є U × U : for all a є
A, f (x, a) = f (y, a) If the pair of objects (x, y) belongs to the relation IND(A) then
objects x and y are called indiscernible with respect to attribute set A. In other
words, we cannot distinguish object x from y based on the information
contained in the attribute set A.
2.11.3 Some Exemplary Applications of Rough Sets
Most of the successful applications of rough sets are in the field of medicine, more
specifically, in medical diagnosis or prediction of outcomes.
Rough sets have been applied to analyze a database of patients with duodenal ulcer
treated by highly selective vagotomy1 (HSV) [32]. The goal was to predict the long-
term success of the operation, as evaluated by a surgeon into four outcome classes.
This successful HSV study is still one of few data analysis studies, regardless of the
methodology, that has managed to cross the clinical deployment barrier. There have
been a steady stream of rough set applications in medicine. Some more recent
applications include analysis of breast cancer [33] and other forms of diagnosis [34], as
well as support to triage of abdominal pain [35] and analysis of Medicaid Home Care
Waiver programs [36].
In addition to medicine, Rough Sets have also been applied to a wide range of
application areas to include real estate property appraisal [37], predicting
bankruptcy [38] and predicting the gaming ballot outcomes [39]. Rough sets have been
applied to identify better stock trading timing [40], to enhance support vector
21
A Q
machine models in manufacturing process document retrieval [41], and to evaluate
safety performance of construction firms [42]. Rough sets have thus been useful in
many applications.
2.12 SUPPORT VECTOR MACHINES
Support vector machines (SVMs) are supervised learning methods that generate
input-output mapping functions from a set of labeled training data [5].
The mapping function can be either a classification function (used to categorize the
input data) or a regression function (used for estimation of the desired output). For
classification, nonlinear kernel functions are often used to transform the input data
(inherently representing highly complex nonlinear relationships) to a high
dimensional feature space in which the input data becomes more separable (i.e.,
linearly separable) compared to the original input space. Then, the maximum-
margin hyperplanes are constructed to optimally separate the classes in the
training data. Two parallel hyperplanes are constructed on each side of the
hyperplane that separates the data by maximizing the distance between the two
parallel hyperplanes [5].
An assumption is made that the larger the margin or distance between these
parallel hyperplanes the better the generalization error of the classifier will be.
SVMs belong to a family of generalized linear models which achieves a classification
or regression decision based on the value of the linear combination of features.
They are also said to belong to “kernel methods”.
22
Figure 2.3Process map and the main steps of the rough sets analysis
In addition to its solid mathematical foundation in statistical learning theory, SVMs
have demonstrated highly competitive performance in numerous real-world
applications, such as medical diagnosis, bioinformatics, face recognition, image
processing and text mining, which has established SVMs as one of the most popular,
state-of-the-art tools for knowledge discovery and data mining.
Similar to artificial neural networks, SVMs possess the well-known ability of being
universal approximators of any multivariate function to any desired degree of
accuracy. Therefore, they are of particular interest to modeling highly nonlinear,
complex systems and processes.
Regression
A version of a SVM for regression was proposed called support vector
regression (SVR). The model produced by support vector classification
(as described above) only depends on a subset of the training data, because
the cost function for building the model does not care about training points
that lie beyond the margin. Analogously, the model produced by SVR only
depends on a subset of the training data, because the cost function for
building the model ignores any training data that are close (within a
threshold є) to the model prediction [6].
2.12.1 Use of SVM – A Process-Based Approach
Due largely to the better classification results, recently support vector machines
(SVMs) have become a popular technique for classification type problems. Even
though people consider them as easier to use than artificial neural networks, users
who are not familiar with the intricacies of SVMs often get unsatisfactory results. In
this section we provide a process-based approach to the use of SVM which is more
likely to produce better results.
Preprocess the data
Scrub the data
Deal with the missing values
Deal with the presumably incorrect values
Deal with the noise in the data
23
Transform the data
Numerisize the data
Normalize the data
Develop the model(s)
Select kernel type (RBF is a natural choice)
Determine kernel parameters based on the selected kernel type (e.g.,
C and for ال RBF) – A hard problem. One should consider using
crossvalidation and experimentation to determine the appropriate
values for these parameters.
If the results are satisfactory, finalize the model, otherwise change the
kernel type and/or kernel parameters to achieve the desired accuracy
level.
Extract and deploy the model.
Figure 2.4
2.12.2 Support Vector Machines versus Artificial Neural Networks
24
The development of ANNs followed a heuristic path, with applications and
extensive experimentation preceding theory. In contrast, the development of SVMs
involved sound theory first, then implementation and experiments.
A significant advantage of SVMs is that while ANNs can suffer from multiple local
minima, the solution to an SVM is global and unique.
Two more advantages of SVMs are that that have a simple geometric interpretation
and give a sparse solution. Unlike ANNs, the computational complexity of SVMs
does not depend on the dimensionality of the input space. ANNs use empirical risk
minimization, whilst SVMs use structural risk minimization. The reason that SVMs
often outperform ANNs in practice is that they deal with the biggest problem with
ANNs, SVMs are less prone to over fitting.
They differ radically from comparable approaches such as neural networks:
SVM training always finds a global minimum, and their simple geometric
interpretation provides fertile ground for further investigation.
Most often Gaussian kernels are used, when the resulted SVM
corresponds to an RBF network with Gaussian radial basis functions. As the
SVM approach “automatically” solves the network complexity problem, the size
of the hidden layer is obtained as the result of the QP procedure. Hidden
neurons and support vectors correspond to each other, so the center problems
of the RBF network is also solved, as the support vectors serve as the basis
function centers.
In problems when linear decision hyperplanes are no longer feasible, an
input space is mapped into a feature space (the hidden layer in NN
models), resulting in a nonlinear classifier.
SVMs, after the learning stage, create the same type of decision
hypersurfaces as do some well-developed and popular NN classifiers.
Note that the training of these diverse models is different. However,
after the successful learning stage, the resulting decision surfaces are
identical.
Unlike conventional statistical and neural network methods, the SVM
approach does not attempt to control model complexity by keeping the
number of features small.
Classical learning systems like neural networks suffer from their
25
theoretical weakness, e.g. back-propagation usually converges only to
locally optimal solutions. Here SVMs can provide a significant
improvement.
In contrast to neural networks SVMs automatically select their model
size (by selecting the Support vectors).
The absence of local minima from the above algorithms marks a major
departure from traditional systems such as neural networks.
While the weight decay term is an important aspect for obtaining good
generalization in the context of neural networks for regression, the margin
plays a somewhat similar role in classification problems.
In comparison with traditional multilayer perceptron neural networks
that suffer from the existence of multiple local minima solutions,
convexity is an important and interesting property of nonlinear SVM
classifiers.
SVMs have been developed in the reverse order to the development of
neural networks (NNs). SVMs evolved from the sound theory to the
implementation and experiments, while the NNs followed more
heuristic path, from applications and extensive experimentation to the
theory.
2.12.3 Disadvantages of Support Vector Machines
Besides the advantages of SVMs (from a practical point of view) they
have some limitation. An important practical question that is not entirely
solved, is the selection of the kernel function parameters – for Gaussian
kernels the width parameter (∑) – and the value of ( ) in the ( )-insensitiveє є
loss function.
A second limitation is the speed and size, both in training and testing. It
involves complex and time demanding calculations. From a practical point of
view perhaps the most serious problem with SVMs is the high algorithmic
complexity and extensive memory requirements of the required quadratic
programming in large-scale tasks. Shi et al. have conducted comparative
testing of SVM with other algorithms on real credit card data.
Processing of discrete data presents another problem.
26
Despite these limitations, because SVMs are based on sound theoretical foundation
and the solution it produces are global and unique in nature (as opposed to getting
stuck in local minima), nowadays they are the most popular prediction modeling
techniques in the data mining arena. Their use and popularity will only increase as
the popular commercial data mining tools start to incorporate them into their
modeling arsenal [43].
2.13 PERFORMANCE EVALUATION FOR PREDICTIVE MODELING
Once a predictive model is developed using the historical data, one would be
curious as to how the model will perform for the future (on the data that it has not
seen during the model building process). One might even try multiple model types
for the same prediction problem, and then, would like to know which model is the
one to use for the real-world decision making situation, simply by comparing them
on their prediction performance (e.g., accuracy). But, how do you measure the
performance of a predictor? What are the commonly used performance metrics?
What is accuracy? How can we accurately estimate the performance measures? Are
there methodologies that are better in doing so in an unbiased manner? These
questions are answered in the following sub-sections. First, the most commonly
used performance metrics will be described, then a wide range of estimation
methodologies are explained and compared to each other [5].
2.13.1 Performance Metrics for Predictive Modeling
In classification problems, the primary source of performance measurements
is a coincidence matrix (a.k.a. classification matrix or a contingency table).
The numbers along the diagonal from upper-left to lower-right represent the
correct decisions made, and the numbers outside this diagonal represent the errors.
The true positive rate (also called hit rate or recall) of a classifier is estimated by
dividing the correctly classified positives (the true positive count) by the total
positive count. The false positive rate (also called false alarm rate) of the classifier is
estimated by dividing the incorrectly classified negatives (the false negative count)
by the total negatives.
The overall accuracy of a classifier is estimated by dividing the total correctly
classified positives and negatives by the total number of samples.
27
Other performance measures, such as recall (a.k.a. sensitivity), specificity
and F-measure are also used for calculating other aggregated performance
measures (e.g., area under the ROC curves).
2.13.2 Estimation Methodology for Classification Models
Estimating the accuracy of a classifier induced by some supervised learning
algorithms is important for the following reasons. First, it can be used to estimate
its future prediction accuracy which could imply the level of confidence one should
have in the classifier’s output in the prediction system. Second, it can be used for
choosing a classifier from a given set (selecting the “best” model from two or more
qualification models). Lastly, it can be used to assign confidence levels to multiple
classifiers so that the outcome of a combining classifier can be optimized. Combined
classifiers are increasingly becoming more popular due to the empirical results that
suggest them producing more robust and more accurate predictions as they are
compared to the individual predictors. For estimating the final accuracy of a
classifier one would like an estimation method with low bias and low variance. In
some application domains, to choose a classifier or to combine classifiers the
absolute accuracies may be less important and one might be willing to trade off bias
for low variance.
2.14 DECISION TREES
Decision trees are powerful and popular tools for classification and prediction. The
attractiveness of tree-based methods is due in large part to the fact that, in contrast
to neural networks, decision trees represents rules. Rules can readily be expressed
in English so that we humans can understand them or in a database access language
like SQL so that record falling into a particular category may be retrieved.
There are varieties of algorithms for building decision trees which share the
describe traits of explicability. Two of the most popular go by the acronyms CART
and CHAID which stand respectively for Classification and Regression Trees and
Chi-square Automatic Interaction Detection. A new algorithm, C4.5, is gaining
popularity and is now available in several software packages [5].
28
2.14.1 Strengths of Decision-Tree Methods
The strengths of decision-trees are:
Decision trees are able to generate understandable rules
Decision trees perform classification without requiring much computation.
Decision trees are able to handle continuous and categorical variables
Decision trees provide a clear indication of which fields are most important
for prediction or classification.
Figure 2.5 A beverage prediction tree
2.14.2 Weakness of Decision Trees Methods
Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous variable such as income, blood pressure or interest rate.
Decision trees are also problematic for time-series data unless a lot of effort is put
into presenting the data in such a way that trends and sequential patterns are made
visible.
2.14.3 Application of Decision Tree Methods
Decision-tree methods are a good choice when the data mining task is classification
of records of prediction of outcomes. Use decision trees when your goal is to assign
29
Watch the game
Yes
Home team wins?
No
Yes
No Yes
No
No Yes
Out with friends?
Beer!Milk!Beer!Diet Soda!
Out with friends
each record to one of a few broad categories. Decision trees are also a natural
choice when your goal is to generate rules that can be easily understood, explained,
and translated into SQL or a natural language.
2.15 GENETIC ALGORITHM
Genetic Algorithms, first introduced by [Holland 1975] [44], have been applied to a
variety of problems and offer intriguing possibilities for general purpose adaptive
search algorithms in artificial intelligence, especially, but not necessarily, for
situations where it is difficult or impossible to precisely model the external
circumstances faced by the program. Search based on evolutionary models had, of
course, been tried before Holland. However, these models were based on mutation
and natural selection and were not notably successful. The principal difference of
Holland’s approach was the incorporation of a ’crossover’ operator to mimic the
effect of sexual reproduction.
Figure 2.6 below illustrates the basic idea of GA
Figure 2.6 Generic Model for Genetic Algorithm
Genetic algorithms are mathematical procedures utilizing the process of genetic
inheritance. They have been usefully applied to a wide variety of analytic problems.
Data mining can combine human understanding with automatic analysis of data to
detect patterns or key relationships. Given a large database defined over a number
of variables, the goal is to efficiently find the most interesting patterns in the
30
database. Genetic algorithms have been applied to identify interesting patterns in
some applications.
They usually are used in data mining to improve the performance of other
algorithms, one example being decision tree algorithms,3 another association rules.
Genetic algorithms require certain data structure. They operate on a population
with characteristics expressed in categorical form. The analogy with genetics is
that the population (genes) consist of characteristics (alleles). One way to
implement genetic algorithms is to apply operators (reproduction, crossover,
selection) with the feature of mutation to enhance generation of potentially better
combinations. The genetic algorithm process is thus:
1. Randomly select parents.
2. Reproduce through crossover. Reproduction is the operator choosing which
individual entities will survive. In other words, some objective function or selection
characteristic is needed to determine survival.
Crossover relates to changes in future generations of entities.
3. Select survivors for the next generation through a fitness function.
4. Mutation is the operation by which randomly selected attributes of randomly
selected entities in subsequent operations are changed.
5. Iterate until either a given fitness level is attained, or the preset number of
iterations is reached.
Genetic algorithm parameters include population size, crossover rate (the
probability that individuals will crossover), and the mutation rate (the probability
that a certain entity mutates) [45].
2.15.1 Genetic Algorithm Advantages: Genetic algorithms are very easy to
develop and to validate, which makes them highly attractive if they apply. The
algorithm is parallel, meaning that it can applied to large populations efficiently.
The algorithm is also efficient in that if it begins with a poor original solution, it can
rapidly progress to good solutions. Use of mutation makes the method capable of
identifying global optima even in very nonlinear problem domains. The method
does not require knowledge about the distribution of the data.
31
2.15.2 Genetic Algorithm Disadvantages: Genetic algorithms require
mapping data sets to a form where attributes have discrete values for the genetic
algorithm to work with. This is usually possible, but can lose a great deal of detail
information when dealing with continuous variables. Coding the data into
categorical form can unintentionally lead to biases in the data.
There are also limits to the size of data set that can be analyzed with genetic
algorithms. For very large data sets, sampling will be necessary, which leads to
different results across different runs over the same data set.
2.15.3 GA Operators
Selection
This is the procedure for choosing individuals (parents) on which to perform
crossover in order to create new solutions. The idea is that the ‘fitter’ individuals
are more prominent in the selection process, with the hope that the offspring they
create will be even fitter still.
Two commonly used procedures are ‘roulette wheel’ and ‘tournament’ selection.
In roulette wheel, each individual is assigned a slice of a wheel, the size of the
slice being proportional to the fitness of the individual. The wheel is then spun
and the individual opposite the marker becomes one of the parents. In
tournament selection several individuals are chosen at random and the fittest
becomes one of the parents.
Crossover
Along with mutation, crossover is the operator that creates new candidate
solutions. A position is randomly chosen on the string and the two parents are
‘crossed over’ at this point to create two new solutions. Multiple point crossover
is where this occurs at several points along the string. A crossover probability
(Pc) is often given which enables a chance that the parents descend into the next
generation unchanged.
Mutation
After crossover, each bit of the string has the potential to mutate, based on a
mutation probability (Pm). In binary encoding mutation involves the flipping of
a bit from 0 to 1 or vice versa.
32
2.15.4 Application of Genetic Algorithms in Data Mining
Genetic algorithms have been applied to data mining in two ways. External
support is through evaluation or optimization of some parameter for another
learning system, often hybrid systems using other data mining tools such as
clustering or decision trees. In this sense, genetic algorithms help other data mining
tools operate more efficiently. Genetic algorithms can also be directly applied to
analysis, where the genetic algorithm generates descriptions, usually as decision
rules or decision trees. Many applications of genetic algorithms within data mining
have been applied outside of business.
Specific examples include medical data mining and computer network intrusion
detection. In business, genetic algorithms have been applied to customer
segmentation, credit scoring, and financial security selection.
Genetic algorithms can be very useful within a data mining analysis dealing with
more attributes and many more observations. It saves the brute force checking of
all combinations of variable values, which can make some data mining algorithms
more effective. However, application of genetic algorithms requires expression of
the data into discrete outcomes, with a calculable functional value upon which to
base selection.
This does not fit all data mining applications. Genetic algorithms are useful because
sometimes it does fit. We review an application to demonstrate some of the aspects
of genetic algorithms.
2.16 NEURAL NETWORKS
Neural computation is introduced as an intelligent system relating the processing
parameters to the process responses such a system is based on artificial neural
network (ANN) [46] which is an interconnected structure of processing elements
called neurons. The ANN structure consists of the input pattern representing the
processing parameters, the output pattern, the hidden layers describing implicitly
the correlations between the processing parameters and the output characteristics.
The connection between a couple of neurons is described by a number called
weight translating the strength of the connection [47].
33
There are three steps which are required to optimize the ANN structure; these are
training, validation and testing steps. There are several types of neural network
architectures. But in this study we will focus on multilayer perception (MLP) and
Back propagation Net.
2.16.1 BIOLOGICAL BACKGROUND
Structural and Functional Organization of the Brain
The inspiration for the development of ANN lies on the organization and
functionality of the (human) brain. The brain is organized in different structural
levels, which correspond to small-scale and large-scale anatomical and
functional organizations. Different functions take place in different organization
levels. The hierarchy of these levels are shown in Fig. 2.8 from the lowest
(bottom) to the highest (top). Therefore, the lowest (basic) level of brain
structural organization is the molecular level and the highest is the Central
Nervous System [48].
The synapses are the neuronal interconnections and their function depends on
specific molecules and ions. The next level is the neural microcircuit, which is an
assembly of synaptic connections organized to produce a specific functional
operation. The neural microcircuits are grouped to form dendritic subunits
that are parts of the dendritic trees of individual neurons. It is believed that
neurons are the simplest computing unit in the brain, the simplest element that
can perform computational tasks. At the next hierarchical and complexity level
34
Figure 2.7 Structure of a neural cell in human brain
we have local neural circuits (neural networks), which are constructed from the
same type of neurons, and are able to perform operations characteristic of a
localized region of the brain [49].
Figure 2.8 Schematic structural organization of the brain.
At a higher level these neural circuits are organized to interregional circuits
than involve multiple regional neural networks located in different parts of the
brain through specific pathways, columns and topographic maps. These
structures are organized to respond to incoming sensory information.
Neurophysiological experiments have shown clearly that different sensory
inputs (motor, somatosensory, visual, auditory, etc.) are mapped onto
specialized corresponding areas of the cerebral cortex. The ultimate level of
complexity and hierarchy, the interregional circuits mediate specific types of
behaviour in the central nervous system.
35
Central nervous systemCentral nervous system
Interregional circuitsInterregional circuits
Local Circuits
Neurons
Dentritic trees
Senapses
Molecules
Neural microcircuits
2.16.2 The Neuron
The key word to understand the brain structural organization and function is the
neuron. The idea of the neurons was introduced by Ramon y Cajal in 1911 and
refers to the fundamental logical units that the whole Central Nervous System is
consisted of. It is indicative that the neuron lies somewhere in the middle of the
structural organization of the brain shown in Fig. 2.8 A neuron is a nerve cell with
all of its processes. Neurons are one of the main distinguishing features of animals
(plants do not have nerve cells). Neurons come in a wide variety of shapes, sizes
and functionality in different parts of the brain. The number of different classes of
neurons that have been identified in humans lies between seven and a hundred (the
observed wide variety in that estimation is related to how restrictively a class of
neurons is defined)[49]. A simple representation of a neuron is shown in Fig. 2.9
Fig 2.9 Schematic representation of a typical neuron
As it is shown in Fig. 2.9 typically the neuron mainly consists of three parts, the
dendrites (or dendritic tree) and the synapses (or synaptic connections or synaptic
36
terminals), the neuron cell body, the axon. Typically the neuron can be in two
states:
the resting state, where no electrical signal is generated, and the firing state, where
the neuron depolarises and an electrical signal is generated (that is the output of
the neuron). [48]
The neuron receives inputs from other neurons that are connected to it, via
synaptic connections that are mainly positioned in the dendrites. The incoming
signals (which are in the form of positive or negative electrical potentials) are
summed in neurons cell body (also called soma) and if the obtained sum exceeds a
certain amount, which is referred as the activation threshold, then the neuron
depolarises and an electrical pulse is generated. This pulse is commonly known as
action potential or spike.
Originating at or close to the cell body of the neuron the action potential propagates
through the axon of the neuron at constant velocity and amplitude to the synaptic
terminals. Through these synaptic terminals the electrical signals generated at one
neuron are transmitted to the neurons that it is interconnected to.
Typically, neural events happen in the millisecond (10-3 sec) range, whereas in
silicon chip the corresponding time range is of the order of nanosecond (10-9 sec).
Thus, biological neurons are five to six hundreds of magnitude slower than silicon
chips.
2.16.3 Dendrites and Synapses
The dendrites, the receptive zones of the neuron, have an irregular surface and a
great number of branches. As it is shown in the right top of Fig. 1.2 there are
observed dendritic spines and synaptic inputs in a dendrite. These synaptic inputs
are the points that a neuron is connected to other neurons and receive input signals
from them. Thus synapses are the elementary functional and structural units that
mediate the interactions between neurons. A number of one to ten thousand of
incoming synapses is typical for cortical neurons. With respect to the nature of the
signal that is transferred through a synapse there are two kinds of synaptic
connections, the chemical synapse and the electrical synapse, with the former to be
the most common [49].
In the case of the chemical synapse there is no actual contact of the presynaptic and
37
the postsynaptic neuron. A synaptic gap (synaptic cleft) occurs instead and the
chemical synapse operates as follows: when an electrical signal arrives from the
presynaptic neuron to the synapse, a process at the presynaptic neuron liberates a
number of molecules of a chemical substance called neurotransmitter. These
neurotransmitter molecules diffuse across the synaptic gap and are captured in
specialized regions of the dendrites of the postsynaptic neuron, by molecules that
are called neuroreceptors, and generate electrical signals in the postsynaptic
region. Thus, a chemical synapse converts electrical signals that are generated in
the presynaptic neuron into chemical signals that travel through the synaptic gap
and then back into postsynaptic electrical signals.
It is obvious that this kind of synaptic transmission is unidirectional and
nonreciprocal, i.e., chemical synapses carry signals from a neuron that always plays
the role of the presynaptic unit to another neuron that always plays the role of the
postsynaptic unit. This is the main difference between chemical and electrical
synapses [49].
In the case that two neurons are interconnected via an electrical synapse, an
electrical signal can be transmitted from the neuron with higher voltage to the one
with a lower voltage, thus signal transmission can be bi-directional in electrical
synapses. This characteristic of the electrical synapses means that there is no fixed
presynaptic and postsynaptic neuron in that kind of synaptic connections and
these roles can be interchanged depending on the electrical conditions on each one
of the interconnected neurons.
Further from distinguishing synapses to chemical and to electrical ones according
to the nature of the transmitting signal, we classify the synapses with respect to the
kind of activation that is produced to the postsynaptic neuron in two main
categories: the excitatory synapses and the inhibitory synapses. In the first case,
that of the excitatory synapses, the electrical potential that is transmitted to the
postsynaptic neuron is positive and has an excitation effect. In the second case of
the inhibitory synapse the postsynaptic potential is negative and imposes
inhibition on the postsynaptic neuron.
A key point in the synaptic transmission is that the signals are weighted. That is,
some postsynaptic potentials are stronger than others.
38
2.16.4 Neuron Cell Body
The neuron cell body (or soma) has a triangular – like form and contains the
nucleus of the cell. As it is shown in Fig. 1.2, the dendrites are leading into the
neuron cell body, carrying the incoming inputs (electrical signals generated by the
postsynaptic potentials). These electrical signals affect the membrane potential of
the cell body of the neuron. Typically, when in the resting state, the membrane
potential of a neuron is approximately –70 mV. If the incoming postsynaptic
potential is positive (excitatory) the membrane potential is increased and is moving
closer to the firing state. If the incoming postsynaptic potential is negative
(inhibitory) the membrane potential is decreased, moving away from the firing
state [49].
All the incoming postsynaptic potentials are summed in both time and space
(temporal and spatial summation). If the resulted sum is equal to or greater than
the firing threshold of the neuron, and the membrane potential exceeds a certain
value (typically –60 mV) then the neuron depolarises (fires) and an action potential
is generated and propagated through the axon of the neuron to the synaptic
terminals.
After firing, the neuron returns to the resting state and the membrane potential to
the appropriate resting value. This is not done instantaneously, but takes a little
time which is called refractory period of the neuron. When the refractory period is
passed, the neuron is ready to fire again if it receives the appropriate input.
2.16.5 The Axon
In cortical neurons, the axon is very long and thin and is characterized by high
electrical resistance and very large capacity. The neural axon is the main
transmission line of the neuron that propagates the action potential. The axon has a
smoother surface than the dendrites and carries the characteristic Ranvier nodes
(not shown in Fig. 2.8) that help the propagation of the action potential along the
axon. The axon terminates to the synaptic terminals that establish the
interconnection of the neuron to other neurons [49].
39
2.16.5 The Neuron Model
To built-up an ANN we need to model the biological neuron, the elementary
computing unit in the brain that is capable to perform information-processing
operations. The simplest model of a neuron is shown in Fig. 2.10.
Neurons, also referred as processing elements (PE), nodes, short-term memory
devices, or threshold logic units, are the ANN components where the most, if not all,
of the computing is done. The generic model of the neuron shown in Fig. 2.10
consists the basis for designing and implementing ANNs. As it is indicated in Fig.
2.10 three are the basic elements of the neuronal model: a set of synapses or
synaptic (connecting) links, an adder (logical unit) and an activation function
(threshold function).
The synapses, or connecting links carry the input signals to the neuron, coming
from either the environment or outputs of other neurons. Each synapse is
characterized by a weight or strength of its own, which will affect the impact of the
specific input.
Therefore, the incoming signals to a neuron are weighed, multiplied by the
appropriate value of the synaptic weight. To be more specific, a signal xj at the input
of synapse j of the kth neuron is multiplied by the synaptic weight wkj. In the
notation following here, the first subscript refers to the neuron in question and the
second subscript refers to the input to which the weight refers.
Figure 2.10 Model of a neuron.
40
In the notation following here, the first subscript refers to the neuron in question
and the second subscript refers to the input to which the weight refers. In general
and in accordance to the biological figure, there are two primary types of synaptic
connections: the excitatory and the inhibitory ones. The excitatory connections
increase the neuron’s activation and are typically represented by positive signals.
On the other hand, inhibitory connections decrease neuron’s activation and are
typically represented by negative signals. The two types of connections are
implemented using positive and negative values for the corresponding synaptic
weights [49].
One of the most important features of the model neuron presented here, as well as
the biological neurons, is that the values of the synaptic weights are subject to
alterations and modification in response to various inputs and in accordance to the
network’s own rules for modification. This feature that is technically called synaptic
modification is of great importance since it is closely related to that ability of
adaptation and learning of the ANN.
Sometimes there is an additional parameter bk, that is associated with the inputs.
The role of this additional parameter depends on the type of the activation function.
Typically it is considered to be an internal bias, which can also be weighted. In a
somehow different approach this parameter is a threshold value (denoted by θk for
the kth neuron) that must be exceeded for there to be any neuronal activation. In
general it is a parameter that has the effect of either increasing or decreasing the
neuron’s input υk to the activation function if its value is positive or negative
respectively [47].
The second basic element of the model neuron is the adder. This element is
responsible for summing the input signals to the neuron that are transmitted
through the synapses of the neuron and are weighted by them. The described
operations constitute a linear combiner. As mentioned above the total result of the
summation of the incoming weighted signal and the addition of the bias bk or
subtraction of the threshold θk is denoted by υk.
The third basic element of the model neuron is the activation function, which is also
referred as squashing function or signal function. The role of the activation function
is to squash (limit) the output signal of the neuron to a certain (finite) range. Thus,
41
activation function maps a (possibly infinite) domain (the input) to a pre-specified
range (the output). A great number of mathematical functions should be suitable for
the role of the activation function of a neuron. However, four are the most common
families of functions that are widely used: the step, the linear, the ramp and the
sigmoid functions [50].
The step (or threshold) function is described by the following equation:
Thus, the step function of (Eq. 2.1) returns a positive value if its argument is a
nonnegative number, otherwise it returns a negative value if its argument is a
negative number. A special case of the step function is for = 1 and = 0. In thatγ δ
case (Eq. 2.1) is transformed to (Eq. 2.2):
This special case of the step function is commonly referred as a Heaviside function.
A plot of the Heaviside function is shown in Fig. 2.11a. A neuron that incorporates
the Heaviside function as its activation function is usually referred as the
McCulloch-Pitts neuron model, in recognition of the pioneering work done by
McCulloch and Pitts back in 1943. According to that neuron model the output of a
neuron turns to the firing state generating an output signal equal to 1 if the total
input to the neuron is non-negative. Otherwise, in the case that the total input is
negative the neuron remains in the resting state, generating no signal (zero output).
This characteristic behaviour is described by the special term and is referred as the
all-or-none property of the McCulloch-Pitts model [47]. The all-or-none property is
in accordance to the behaviour of the biological neurons where the total
postsynaptic potential (inputs) must exceed a certain internal threshold value in
order the neuron to fire and generate an action potential. If that threshold value is
not exceeded the neuron remains in the resting state and ceases.
The next family of activation functions is the linear function, described in its general
form by the equation:
( ) = φ υ αυ (Eq. 2.3)
42
The parameter is a real-valued constant that regulates the magnification of theα
neuron activity . Despite its simple form the linear function is rather inappropriateυ
for the role of the activation function of a neuron, since it is not bounded
(considering that the input parameter is not bounded too).υ
The third family of commonly used activation functions is the ramp function, also
referred as piece-wise linear function. The ramp function is a linear function that is
bounded to the range [- , + ] and in its general form is described by the equation:γ γ
In the above equation and – correspond to the maximum and the minimumγ γ
output values respectively, i.e., the upper and lower bound of the mapping. The
piece-wise linear functions of (Eq. 2.4) are often used to represent a simplified
nonlinear operation and can be viewed as an approximation to a nonlinear
amplifier. Depending on the value of the input parameter the ramp functionυ
operates as a linear function without running into saturation, if is in the linearυ
region, otherwise the function returns the upper or lower saturation values. In the
special case that for = γ 1/2 and 1 and 0 as upper and lower bound respectively (Eq.
2.4) takes the form:
A graphical representation of the ramp function described in (Eq. 2.5) is shown in
Fig. 2.11b. As it is shown in Fig. 2.11b this special form of the ramp function exhibits
a linear part in the range of – 1/2< < υ 1/2 and saturates to the upper or the lower
bound if exceeds that range.υ
The fourth and final family of activation functions is the sigmoid functions. The
family of sigmoid functions is by far the pervasive type of activation function and is
the most commonly used in the implementation of an ANN. That is because the
sigmoid functions incorporate a number of properties that are mostly desirable in
the construction of a neuron. There are several types of sigmoid functions. A
common type is the logistic function that is described in the following equation:
43
The parameter is the slope parameter of the sigmoid function. Graphicalα
representations of the logistic sigmoid function for different values of the slope
parameter are shown in Fig. 2.11c. The shape of the obtained representationsα
reveals that the reason for the sigmoid functions to have been given that name is
the s-shape of its graphs. As it is easily recognised in Fig. 2.11c the logistic sigmoid
function is a bounded, monotonic, non-decreasing function that provides a graded,
nonlinear response. Thus, the logistic function balances between linear and
nonlinear behaviour.
The upper and lower bounds (saturation values) of that function are 1 and 0
respectively. Another feature of the logistic function that is partially revealed in Fig.
2.11c is the role of the slope parameter . The greater the value of that parameterα
the steeper is the increase of the logistic function. In the limit that the slope
parameter approaches infinity, the logistic function turns simply to a step
(Heaviside) function.
However, for values of the slope parameter in the normal range, the logistic
function is a continuous and differentiable function that returns a continuous range
of values from 0 to 1 (graded response). On the opposite the Heaviside function is
not differentiable.
A second sigmoid type function that ranges in the interval [0,1] is the augmented
ratio of squares function defined as:
What is common in the activation functions described by the (Eqs. 2.2, 2.5, 2.6 and
2.7) is that return an output in the range from 0 to 1. However, sometimes it is
44
a b
c
desirable to have an activation function in the range from –1 to 1. In that case we
have to give a different definition of the threshold function of (Eqs. 2.1 and 2.2). The
new form of the threshold function is described by the following equation:
Fig 2.11. Three common types of activation functions. (a) Threshold (Heaviside)
function. (b) Piecewise-linear (ramp) function. (c) Sigmoid function for varying slop
parameter a.
The above equation is commonly referred to as the signum function since it returns
the sign of the parameter or 0 if in neither positive, nor negative.υ υ
Similarly, other types of sigmoid functions have to be presented for the case that the
output range from –1 to 1 is the desirable one, instead of the range from 0 to 1 that
return the logistic sigmoid function of (Eq. 2.6). If that is the case, among others,
two are some reasonable candidates. The first one is a hyperbolic trigonometric
function, the hyperbolic tangent function, which is defined as:
( ) = tanh( )φ υ υ (Eq. 2.9)
The second one is defined by the formula:
45
Both these functions defined in the last two equations have saturation levels at –1
(lower) and 1 (upper), therefore range in [-1,1].
The description of the neural dynamics in mathematical terms follows. According to
the notation introduced above, assuming that the kth neuron receives m synaptic
connections, υk is the total sum of the incoming input weighed signals xj via the jth
synaptic connection, and wkj is the corresponding synaptic weight of that
connection, the threshold is θk and the bias is bk. In the case that the adder sums the
total incoming weighted signals and subtracts the threshold θk the obtained result
υk is given by the mathematical formula:
In (Eq. 2.13), the bias bis included in the form as the product Wk0X0., where X0 = 1
and Wk0= bk.
Finally, let yk be the output signal of the kth neuron that receives a total incoming
signal k. The output of the neuron is given by the next formula:υ
y κ = φ υ k y (Eq. 2.14)
In the above equation, ( ) is the activation function, which should be given by oneφ υ
of the described in Eqs. 2.1 – 2.10.
The neuron-like processing element presented here model approximately three of
the processes we know real neurons perform. As far as we know, there are at least
150 processes performed by the neurons in the human brain. Although the obvious
poverty of the model neuron, it handles several basic functions. Namely, the model
neuron is capable to receive and evaluate the input signals, to calculate a total of the
46
combined inputs and compare that total to some threshold level, and finally to
determine what the output should be. In addition to the deterministic neuronal
model presented above, for some applications of neural networks it is desirable to
incorporate a stochastic feature in the dynamics of the neural model. In such a case,
the neuronal model is based on a modification of the bi-state neuronal element of
McCulloch-Pitts and it is permitted to reside in only two states: +1 and –1. The
decision of a neuron to alter its state is probabilistic. Thus, if the neuron fires (is in
the +1 state) with probability of firing P( ), then it remains in the –1 state withυ
probability 1-P( ). The firing probability is given by the formula:υ
In the above formula T is a pseudo-temperature that is incorporated to control the
noise level, thus the uncertainty and the stochastic nature in firing, and must be
realised as a parameter that represents the effects of synaptic noise. In the limit
case that T → 0, the stochastic neural model reduces to the noiseless (therefore
deterministic) form that is described by the McCulloch-Pitts neural model in (Eq.
2.2) [47].
2.16.6 Supervised and unsupervised learning
The learning algorithm of a neural network can either be supervised or
unsupervised.
A neural net is said to learn supervised, if the desired output is already known.
Neural nets that learn unsupervised have no such target outputs.
It can't be determined what the result of the learning process will look like.
During the learning process, the units (weight values) of such a neural net are
"arranged" inside a certain range, depending on given input values. The goal is to
group similar units close together in certain areas of the value range.
This effect can be used efficiently for pattern classification purposes [51].
2.16.7 Forward propagation
Forward propagation is a supervised learning algorithm and describes the "flow of
information" through a neural net from its input layer to its output layer.
The algorithm works as follows:
47
1. Set all weights to random values ranging from -1.0 to +1.0
2. Set an input pattern (binary values) to the neurons of the net's input layer
3. Activate each neuron of the following layer:
Multiply the weight values of the connections leading to this neuron with
the output values of the preceding neurons
Add up these values
Pass the result to an activation function, which computes the output value of
this neuron
4. Repeat this until the output layer is reached
5. Compare the calculated output pattern to the desired target pattern and
compute an error value
6. Change all weights by adding the error value to the (old) weight values
7. Go to step 2
8. The algorithm ends, if all output patterns match their target patterns
2.16.8 Multi-layer Perceptron: This was first introduced by M. Minsky and
S. Papert in 1969 [52]. It is a special case of perceptron whose first layer units are
replaced by trainable threshold logic units in order to allow it to solve non-linear
separable problem. Minsky and Papert called multi-layer perceptron of one
trainable hidden layer a Gamba perceptron. The structure is shown below:
Figure 2.12
48
.
.
.
.
.
.
.
.
.
InputLayer
FirstHiddenLayer
SecondHiddenLayer
OutputLayer
Each layer is fully connected to the next one. Depending on the complexity,
performance and implementation point of view, the number of hidden layers may
be increased or decreased with corresponding increase or decrease in the number
of hidden units and connections.
Both the perceptron and the multi-layer perceptron are trained with error-
correction learning. But since perceptron does not have an explicit error available,
this stopped further work on the multi-layer perceptron around 1970, until a
method to train multi-layer perceptrons was later discovered. The method is called
Back Propagation or the generalized Delta Rule.
With this method, processing is done from the input to the output layer, that is, in
the forward direction, after which computed errors are then propagated back in the
backward direction to change the weights so as to obtain a better result.
2.16.9 Strength of Artificial Neural Networks
They can handle a wide range of problems
They provide good results even in complicated domain
They handle both categorical and continuous variables
They are available in many off-the-shelf packages
2.16.10 Weaknesses of Artificial Neural Networks
They require inputs in the range of 0 to 1
They can not explain their results
They may converge prematurely to an inferior solution
2.17 ON-LINE ANALYTICAL PROCESSING
OLAP is the next advance in giving end-user access to data.
These are client-server tools that have an advance graphical interface talking to an
efficient and powerful presentation of the data called a cube. The cube is ideally
suited for queries that allow users to slice-and-dice the data in any way they see fit.
The cube itself is stored in either a relational database, typically using a star schema
or in a special multi-dimensional database that optimize OLAP operations. OLAP
tools have a very fast response times, measured in seconds. SQL queries on
49
standard relational database would require hours or days in many cases to generate
the same information. In addition, OLAP tools provide handy analysis functions that
are difficult or impossible to express in SQL.
2.17.1 OLAP and Data Mining
We have to provide feedback to people and use the information from data mining to
improve business process. We need to enable people to provide input, in the form of
observations, hypotheses and hunches about what results are important and how to
use those results [6].
In the larger solution to exploit data, OLAP clearly plays an important role as a
means of broadening the audience with access to data.
2.17.2 Strengths of OLAP
It is a powerful visualization tool.
It provides fast, interactive response time
It is good for analyzing time series
It can be used to find some clusters and outliers
Many vendors offer OLAP products
2.17.3 Weaknesses of OLAP
Setting-up a cube can be difficult
It does not handle continuous variables well
Cubes can quickly become out-of –date
It is not data mining
50
2.18 DATA MINING APPLICATION AREAS
Other application areas are:
Health sector
Food And Drug Product Safety
Election analysis
Detection Of Terrorists Or Criminals
Etc
2.19 DATA MINING TOOLS
Many good data mining software products are available [5]:
Enterprise Miner by SAS
Intelligent Miner by IBM
CLEMENTINE by SPSS
PolyAnalyst by Megaputer
WEKA (from the University of Waikato in New Zealand) etc
51
Given a CSP P = (V,D,C), its dual transformation dual(p) = (Vdual(p), Ddual(p), C Ddual(p)) is defined as follows.
Vdual(p) = {c1,…,cn}where c1,….cm are called dual variables. For each constraint Ci of P there is a unique corresponding dual variable ci. We use vars(ci) and rel(ci) to denote the corresponding sets vars(Ci) and rel(Ci) (given that the context is not ambiguous). Ddual(p) = {dom(c1),…,dom(cm)} is the set of domains for the dual variables. For each dual variable ci, dom(ci) = rel(Ci), i.e., each value for is a tuple over vars(Ci). An assignment of a value t to a dual variable ci, ci ← t , can thus be viewed as being a sequence of assignments to the ordinary variablesx vars(ci) where each such ordinary variable is assigned the ϵvaluet[x].
Cdual(p) is a set of binary constraints overVdual(p) called the dual constraints.There is a dual constraint between dual variables ci and cj if S = vars(ci) vars(cj) ⋂ ≠ ∅. In this dual constraint a tuple ti dom(ci) ϵis compatible with a tuple tj dom(cj) iff ti[S] = tj[S], i.e., the ϵtwo tuples have the same values over their common variables
52
Given a CSP P = (V,D,C), its hidden transformation hidden(p) = (V hidden (p), D hidden (p), C D hidden (p)) is defined as follows:V hidden (p), {x1,…., xn} ∪ {c1,…,cn}where {x1,…., xn} is the original set of variables in V (Called ordinary variables) and c1,….cm are called dual variables generated form from the constant CThere is a unique dual variable corresponding to each constraint Ci ϵ C. When dealing with the hidden transformation, the dual variables are sometimes called hidden variables.D hidden (p)= {dom(x1),….,dom( xn)} ∪ {dom(c1),…,dom(cm)} is the set of domains for the dual variables. For each dual variable ci, dom(ci) = rel(Ci),
V = {x1,…., xn} is a finite set of n variables.D = {dom(x1,.….,dom(xn)} is a set of domains. Each variable x V has a ϵ corresponding finite domain of possible values, dom(x).C = {C1,……,Cm} is a set of m constraints. Each constraint C ϵ C is a pair (vars(C), rel(C)) defined as follows:
53