Download - My Chapter Two

CHAPTER TWO

L I T E R A T U R E R E V I E W

This chapter involves the basic topics on Data Mining – its meaning, reasons for its

application, various tasks, processes, techniques and application area of data

mining. It dwells much on Artificial neural network approach. These algorithms

strength, weaknesses and when to apply are not left out.

2.1 DATA MINING

Data Mining is the process of discovery of meaningful new correlation, pattern, and

trends by shifting through large amount of data stored repositories and by using

pattern recognition technologies as well as statistical and mathematical [6]

techniques.

Data mining also refers to the analysis of the large quantities of data that are stored

in computers in form of file or in database. It is called exploratory data analysis,

among other things [5].

Data mining is not limited to business. Data mining has been heavily used in the

medical field, to include diagnosis of patient records to help identify best practices.

2.2 WHY DATA MINING

Data mining caught on in a big way in the last few years due to the following

number of factors [6]:

i) The data is being produced and collected at unprecedented way.

ii) The data is being warehoused – data warehousing brings together data from

different format in a common format with consistent definitions for keys and fields.

iii) The computing power is affordable – price for disk, memory, processing

power, I/O bandwidth is affordable by many ordinary businesses.

iv) The existence of commercial data mining software products.

2.3 DATA MINING PROCESS

In order to systematically conduct data mining analysis, a general process is usually

followed. There are some standard processes, two of which CRISP (an industry

standard process consisting of a sequence of steps that are usually involved in a

data mining study) and SEMMA which stands for Sample, Explore, Modify, Model,

6

Assess(figure 2.2 depicts how SEMMA phases interact) [7]. While each step of either

approach isn’t needed in every analysis, this process provides a good coverage of

the steps needed, starting with data exploration, data collection, data processing,

analysis, inferences drawn, and implementation.

2.3.1 CRISP-DM

CRISP-DM - Cross-Industry Standard Process for Data Mining is widely used by

industries and cooperate organizations. This model consists of six phases intended

as a cyclical process.

Business Understanding: Business understanding includes determining

business objectives, assessing the current situation, establishing data mining

goals, and developing a project plan.

Figure 2.1 CRISP-DM processes

Data Understanding: Once business objectives and the project plan are

established, data understanding considers data requirements. This step can

include initial data collection, data description, data exploration, and the

verification of data quality. Data exploration such as viewing summary

statistics (which includes the visual display of categorical variables) can

7

occur at the end of this phase. Models such as cluster analysis can also be

applied during this phase, with the intent of identifying patterns in the data.

Data Preparation: Once the data resources available are identified, they

need to be selected, cleaned, built into the form desired, and formatted.

Data cleaning and data transformation in preparation of data modeling

needs to occur in this phase. Data exploration at a greater depth can be

applied during this phase, and additional models utilized, again providing

the opportunity to see patterns based on business understanding.

Modeling Data mining software tools such as visualization (plotting data and

establishing relationships) and cluster analysis (to identify which variables

go well together) are useful for initial analysis. Tools such as generalized

rule induction can develop initial association rules. Once greater data

understanding is gained (often through pattern recognition triggered by

viewing model output), more detailed models appropriate to the data type

can be applied. The division of data into training and test sets is also needed

for modeling.

Evaluation Model results should be evaluated in the context of the business

objectives established in the first phase (business understanding). This will

lead to the identification of other needs (often through pattern recognition),

frequently reverting to prior phases of CRISP-DM. Gaining business

understanding is an iterative procedure in data mining, where the results of

various visualization, statistical, and artificial intelligence tools show the

user new relationships that provide a deeper understanding of

organizational operations.

Deployment Data mining can be used to both verify previously held

hypotheses, or for knowledge discovery (identification of unexpected and

useful relationships). Through the knowledge discovered in the earlier

phases of the CRISP-DM process, sound models can be obtained that may

then be applied to business operations for many purposes, including

prediction or identification of key situations. These models need to be

monitored for changes in operating conditions, because what might be true

today may not be true a year from now. If significant changes do occur, the

8

model should be redone. It’s also wise to record the results of data mining

projects so documented evidence is available for future studies.

Figure 2.2 Schematic of SEMMA (original from SAS )

2.4 DATA MINING TASKS

CLASSIFICATION: This consists of examining the features of newly presented

object and assigning it to one of a predefined set of classes. For our purposes,

the objects to be classified are generally represented by records in databases

and the act of classification consists of updating each record by filling in a field

with a class code of some kind.

The classification task is classified by a well defined definition of some classes,

and a training set consisting of pre-classified examples. The task is to build a

model of some kind that can be applied to unclassified data in order to classify

it.

ESTIMATION: While classification deals with discrete outcomes such as yes or

no, measles, rubella, or chicken pox; Estimation deals with continuous value

9

2.32

outcomes. Given some input data, we use estimation to come up with a value for

some unknown continuous variable such as income, height, or credit card

balance.

In practice, estimation is often used to perform classification task and Neural

Networks are well-suited to estimate tasks.

PREDICTION: Prediction is the same for classification and estimation except that

the records are classified according to some predicted future behaviour on

estimated future value. In a prediction task, the only way to check the accuracy of

the classification is to wait and see.

Any of the techniques used for classification and estimation can adapted for use

in prediction is already known, along with historical data for those examples. The

historical data is used to build a model that explains the current observed

behaviour. When this model is applied to current inputs, the result is a prediction

of future behaviour.

AFFINITY GROUPING: The task affinity grouping is to determine which things

go together. It can be used to identify cross-selling opportunities and to design

attractive packages or groupings of product and services. Affinity grouping is one

simple approach to generate rules from data. If two items go together, two

association rules can be generated together.

CLUSTRING: This is the task of segmenting a heterogeneous population into a

number of more homogeneous subgroups. It does not rely on predefined classes

and records are grouped together on the basis of self-similarity. It is now up to

you to determine what meaning, if any, to attach to the resulting clusters.

Clustering is often done as a prelude to some other form of data mining or

modelling.

DESCRIPTION: Sometimes the purpose of data mining is to simply describe what

is going on in a complicated database in a way that increases the understanding

of people, products or processes that produces the data in the first place.

Some of the techniques that will later be discussed in this chapter such as market

basket analysis tools are purely descriptive. Others like neural networks provide

next to nothing in the way of description [6].

10

2.5 DATA MINING ISSUES

As data mining initiatives continue to evolve, there are several issues Congress may

decide to consider related to implementation and oversight. These issues include,

but are not limited to, data quality, interoperability, mission creep, and privacy. As

with other aspects of data mining, while technological capabilities are important,

other factors also influence the success of a project’s outcome [6].

Data Quality

Data quality refers to the accuracy and completeness of the data. Data quality can

also be affected by the structure and consistency of the data being analyzed. The

presence of duplicate records, the lack of data standards, the timeliness of updates,

and human error can significantly impact the effectiveness of the more complex

data mining techniques, which are sensitive to subtle differences that may exist in

the data. To improve data quality, it is sometimes necessary to “clean” the data,

which can involve the removal of duplicate records, normalizing the values used to

represent information in the database, accounting for missing data points,

removing unneeded data fields, identifying anomalous data points (e.g., an

individual whose age is shown as 142 years), and standardizing data formats (e.g.,

changing dates so they all include MM/DD/YYYY).

Interoperability

This refers to the ability of a computer system and/or data to work with other

systems or data using common standards or processes For data mining,

interoperability of databases and software is important to enable the search and

analysis of multiple databases simultaneously, and to help ensure the compatibility

of data mining activities of different agencies. Similarly, as agencies move forward

with the creation of new databases and information sharing efforts, they will need

to address interoperability issues during their planning stages to better ensure the

effectiveness of their data mining projects.

Mission Creep

Mission creep refers to the use of data for purposes other than that for which the

data was originally collected. This can occur regardless of whether the data was

provided voluntarily by the individual or was collected through other means. One of

the primary reasons for misleading results is inaccurate data. All data collection

11

efforts suffer accuracy concerns to some degree. Ensuring the accuracy of

information can require costly protocols that may not be cost effective if the data is

not of inherently high economic value. In well-managed data mining projects, the

original data collecting organization is likely to be aware of the data’s limitations

and account for these limitations accordingly. However, such awareness may not be

communicated or heeded when data is used for other purposes

Privacy

As additional information sharing and data mining initiatives have been announced,

increased attention has focused on the implications for privacy.

Concerns about privacy focus both on actual projects proposed, as well as concerns

about the potential for data mining applications to be expanded beyond their

original purposes (mission creep).

So far there has been little consensus about how data mining should be carried out,

with several competing points of view being debated. Some observers contend that

tradeoffs may need to be made regarding privacy to ensure security. In contrast,

some privacy advocates argue in favor of creating clearer policies and exercising

stronger oversight.

2.6 BASIC STYLES OF DATA MINING

The first, hypothesis testing, is a top-down approach that attempt to substantiate or

disproved preconceived ideas. The second, knowledge discovery, is a bottom-up

approach that starts with the data and tries to get it to tell us we didn’t already

know [6].

2.6.1 Hypothesis

A hypothesis is a propose explanation whose validity can be tested. Testing the

validity of an hypothesis is done by analyzing data that may simply be collected by

observation or generated through experiment.

The process of hypothesis testing

The hypothesis testing method ha several steps:

1) Generate good ideas (hypothesis)

2) Determine what data would allow these hypotheses to be tested.

3) Locate the data

4) Prepare the data for analysis

12

5) Build computer model based on the data

6) Evaluate computer model to confirm or reject hypotheses

2.6.2 Knowledge Discovery

Undirected learning ha s long been goal of artificial intelligence researchers in the

academic discipline called machine learning. In the real world, discovering valuable

patterns is worthwhile, but it is still hard work.

Knowledge discovery can be either directed or undirected.

Directed Knowledge Discovery

This is goal oriented. There is a specific field whose value we want to predict, a

fixed set of classes to be assigned to each record, or a specific relationship we

want to explore.

Here are the steps in the process of direct knowledge discovery:

1. Identify source of pre-classified data.

2. Prepare data for analysis

3. Build and train computer model

4. Evaluate the computer model

Undirected Knowledge Discovery

Here, there is no target field. The data mining tool is simply let loosed on the

data with the hope that it will discover meaningful structure.

The Process of Undirected Knowledge Discovery

Here are the steps in the process of undirected knowledge discovery:

1. Identify source of pre-classified data.

2. Prepare data for analysis

3. Build and train computer model

4. Evaluate the computer model

5. Apply the computer model to new data.

6. Identify potential targets for directed knowledge discovery.

7. Generate new hypothesis to test

13

D A T A M I N I N G T E C H N I Q U E S / M E T H O D S

2.7 MEMORY-BASED REASONING

Memory-based reasoning systems are a type of model, supporting the modeling

phase of the data mining process. Their unique feature is that they are relatively

machine driven, involving automatic classification of cases. It is a highly useful

technique that can be applied to text data as well as traditional numeric data

domains.

Memory-based reasoning is an empirical classification method [8]. It operates by

comparing new unclassified records with known examples and patterns.

The case that most closely matches the new record is identified, using one of a

number of different possible measures. Memory-based reasoning provides best

overall classification when compared with the more traditional approaches in

classifying jobs with respect to back disorders [9].

Matching: While matching algorithms are not normally found in standard

data mining software, they are useful in many specific data mining

applications. Fuzzy matching has been applied to discover patterns in the

data relative to user expectations [10]. Java software has been used to

completely automate document matching [11]. Matching can also be applied to

pattern identification in geometric environments [12].

There are a series of measures that have been applied to implement

memory-based reasoning. The simplest technique assigns a new observation

to the pre-classified example most similar to it. The Hamming distance

metric identifies the nearest neighbor as the example from the training

database with the highest number of matching fields (or lowest number of

non-matching fields). Case-based reasoning is a well-known expert system

approach that assigns new cases to the past case that is closest in some

sense. Thus case-based reasoning can be viewed as a special case of the

nearest neighbor technique.

Weighted Matching

Data mining can involve deletion of variables, but the usual attitude is to

retain data because you don’t know what it may provide. Weighting provides

another means to emphasize certain variables over others. All that would

14

change would be that the “Matches” measure could now represent a

weighted score for selection of the best matching case

Distance Minimization

This concept uses the distance measured from the observation to be

classified to each of the observations in the known data set.

In this case, the nominal and ordinal data needs to be converted to

meaningful ratio data

Strength of Memory-Based Reasoning

It produces results that are readily understandable

It is applicable to arbitrary data types, even non relational data.

It works efficiently on any number of fields.

Maintaining the training set requires a minimal amount of effort.

Weaknesses of Memory-Based Reasoning

It is computationally expensive when doing classification and prediction

It requires a large amount of storage for the training set.

Results can be dependent on the choice of distance function, combination

function and the number of neigbours.

2.8 ASSOCIATION RULES IN KNOWLEDGE DISCOVERY

An association rule is an expression of X → Y, where X is a set of items, and Y is a

single item. Association rule methods are an initial data exploration approach that

is often applied to extremely large data set.

Association rules mining provides valuable information in assessing significant

correlations. They have been applied to a variety of fields, to include medicine [13]

and medical insurance fraud detection [14].

Many algorithms have been proposed to find association rules mining in large

databases. Most, such as the APriori algorithm identify correlations among

transactions consisting of categorical attributes using binary values. Some data

mining approaches involve weighted association rules for binary values, [15] or time

intervals [16].

Data structure is an important issue due to the scale of data usually encountered [17].

Structured query language (SQL) has been a fundamental tool in manipulation of

database content. Knowledge discovery involves ad hoc queries, needing efficient

15

query compilation. Lopes et al. considered functional dependencies in inference

problems. SQL was used by those researchers to generate sets of attributes that

were useful in identifying item clusters.

Key measures in association rule mining include support and confidence.

Support refers to the degree to which a relationship appears in the data.

Confidence relates to the probability that if a precedent occurs, a

consequence will occur. The rule X → Y has minimum support value minsup

if minsup percent of transactions support X → Y, the rule X → Y holds with

minimum confidence value minconf if minconf percent of transactions that

support X also support Y. For example, from the transactions kept in

supermarkets, an association rule such as “Bread and Butter → Milk” could

be identified through association mining.

2.9 MARKET BASKET ANALYSIS

Market-basket analysis refers to methodologies studying the composition of a

shopping basket of products purchased during a single shopping event.

This technique has been widely applied to grocery store operations (as well as

other retailing operations, to include restaurants). Market basket data in its rawest

form would be the transactional list of purchases by customer, indicating only the

items purchased together (with their prices). This data is challenging because of a

number of characteristics: [18]

A very large number of records (often millions of transactions per day)

Sparseness (each market basket contains only a small portion of items

carried)

Heterogeneity (those with different tastes tend to purchase a specific

subset of items).

The aim of market-basket analysis is to identify what products tend to be purchased

together. Analyzing transaction-level data can identify purchase patterns, such as

which frozen vegetables and side dishes are purchased with steak during barbecue

season. This information can be used in determining where to place products in the

store, as well as aid inventory management. Product presentations and staffing can

be more intelligently planned for specific times of day, days of the week, or

holidays. Another commercial application is electronic couponing, tailoring coupon

16

face value and distribution timing using information obtained from market baskets [19].

2.9.1 Market Basket Analysis Benefits

The ultimate goal of market basket analysis is finding the products that

customers frequently purchase together. The stores can use this information by

putting these products in close proximity of each other and making them more

visible and accessible for customers at the time of shopping.

These assortments can affect customer behavior and promote the sales for

complement items. The other use of this information is to decide about the layout of

catalogs and put the items with strong association together in sales catalogs. The

advantage of using sales data for promotions and store layout is that the consumer

behavior information determines the items with associations. This information may

vary based on the area and the assortments of available items in stores and the

point of sale data reflects the real behavior of the group of customers that

frequently shop at the same store. Catalogs that are designed based on the market

basket analysis are expected to be more effective on consumer behavior and sales

promotion.

2.9.2 Strength of Market Basket Analysis

It produces clear and understandable results

It supports undirected data mining

It works on variable-length data.

The computations it uses are simple to understand

2.9.3 Weaknesses of Market Basket Analysis

It requires exponentially more computational effort as the problem size

grows.

It has a limited supports for attributes on the data

It is difficult to determine the right number of items

It discounts rear items

17

2.10 FUZZY SETS IN DATA MINING

Real-world application is full of vagueness and uncertainty. Several theories on

managing uncertainty and imprecision have been advanced, to include fuzzy set

theory[20], probability theory[21], rough set theory[22] and set pair theory[23].

Fuzzy set theory is used more than the others because of its simplicity and

similarity to human reasoning. Fuzzy modeling provides a very useful tool to deal

with human vagueness in describing scales of value. The advantages of the fuzzy

approach in data mining is that it serves as an “… interface between a numerical

scale and a symbolic scale which is usually composed of linguistic terms.[24]”

Fuzzy association rules described in linguistic terms help users better understand

the decisions they face[25]. Fuzzy set theory is being used more and more frequently

in intelligent systems. A fuzzy set A in universe U is defined as A={(x, µA(x))| x Uε , µA

(x) ε [0,1]} where µA (x) is a membership function indicating the degree of

membership of x to A. The greater the value of µA (x) , the more x belongs to A. Fuzzy

sets can also be thought of as an extension of the traditional crisp sets and

categorical/ordinal scales, in which each element is either in the set or not in the set

(a membership

function of either 1 or 0).

Fuzzy set theory in its many manifestations (interval-valued fuzzy sets, vague sets,

grey-related analysis, rough set theory, etc.) is highly appropriate for dealing with

the masses of data available.

There are many data mining tools available, to cluster data, to help analysts

find patterns, to find association rules. The majority of data mining approaches in

classification work with numerical and categorical information.

Most data mining software tools offer some form of fuzzy analysis.

Modifying continuous data is expected to degrade model accuracy, but might be

more robust with respect to human understanding. (Fuzzy representations might

lose accuracy with respect to numbers that don’t really reflect accuracy of human

understanding, but may better represent the reality humans are trying to express.)

Another approach to fuzzify data is to make it categorical. Categorization of data is

expected to yield greater inaccuracy on test data. However, both treatments are still

useful if they better reflect human understanding, and might even be more accurate

on future implementations.

18

The categorical limits selected are key to accurate model development. Not many

data mining techniques take into account ordinal data features.

2.10.1 Fuzzy Association Rules

With the rapid growth of data in enterprise databases, making sense of valuable

information becomes more and more difficult. KDD (Knowledge Discovery in

Databases) can help to identify effective, coherent, potentially useful and previously

unknown patterns in large databases [26]. Data mining plays an important role in the

KDD process, applying specific algorithms for extracting desirable knowledge or

interesting patterns from existing datasets for specific purposes. Most of the

previous studies focused on categorical attributes.

Mining fuzzy association rules for quantitative values has long been considered by

a number of researchers, most of whom based their methods on the important

APriori algorithm [27]. Each of these researchers treated all attributes (or all the

linguistic terms) as uniform. However, in real-world applications, the users perhaps

have more interest in the rules that contain fashionable items.

Decreasing minimum support minsup and minimum confidence minconf to get

rules containing fashionable items is not best, because the efficiency of the

algorithm will be reduced and many uninteresting rules will be generated

simultaneously [28]. Weighted quantitative association rules mining based on a fuzzy

approach has been proposed (by Genesei) using two different definitions of

weighted support: with and without normalization [29].

In the non-normalized case, he used the product operator for defining the combined

weight and fuzzy value.

The combined weight or fuzzy value is very small and even tends to zero when the

number of items is large in a candidate itemset, so the support level is very small,

this will result in data overflow and make the algorithm terminate unexpectedly

when calculating the confidence value.

2.11 ROUGH SET

Rough set analysis is a mathematical approach that is based on the theory of rough

sets first introduced by Pawlak (1982) [22]. The purpose of rough sets is to discover

knowledge in the form of business rules from imprecise and uncertain data sources.

19

Rough set theory is based on the notion of indiscernibility and the inability to

distinguish between objects, and provides an approximation of sets or concepts by

means of binary relations, typically constructed from empirical data.

Such approximations can be said to form models of our target concepts, and hence

in the typical use of rough sets falls under the bottom-up approach to model

construction. The intuition behind this approach is the fact that in real life, when

dealing with sets, we often have no means of precisely distinguishing individual set

elements from each other due to limited resolution (lack of complete and detailed

knowledge) and uncertainty associated with their measurable characteristics.

As an approach to handling imperfect data, rough set analysis complements other

more traditional theories such as probability theory, evidence theory, and fuzzy set

theory.

2.11.1 A Brief Theory of Rough Sets

Statistical data analysis faces limitations in dealing with data with high levels of

uncertainty or with non-monotonic relationships among the variables.

The original idea behind his Rough sets theory was “… vagueness inherent to the

representation of a decision situation.

Vagueness may be caused by granularity of the representation. Due to the

granularity, the facts describing a situation are either expressed precisely

by means of ‘granule’ of the representation or only approximately. [30]” The

vagueness and imprecision problems are present in the information that describes

most real world applications.

2.11.2 Rough Sets as an Information System

In rough sets, an information system is a representation of data that prescribes

some object. An information system S is composed of a 4-tuple S = < U, Q, V, f >

where U is the closed universe of a N objects {x1, x2, …, xN}, a nonempty finite set; Q

is a nonempty finite set of n attributes {q1, q2, …, qn} (that uniquely characterizes

the objects); V = Uq Qє Vq where Vq is the value of the attribute q; f : U × Q → V is

the total decision function called the information function such that f (x, q) є Vq for

every q є Q, x є U [31]. The six stores are the universe U, the first three attributes are

Q, their possible values V, and the profit category f.

20

Any pair (q, v) for q Q,, v є є Vq is called the descriptor in an information system S.

The information system can be represented as a finite data table, in which the

columns represent the attributes, the rows represent the objects and the cells

represent the attribute values f(x, q). Thus, each row in the table describes the

information about an object in S.

If we let S = < U, Q, V, f > be an information system,

be a subset of attributes, and x, y є U are objects, then x and y are indiscernible by

the set of attribute A in S if and only if f (x, a) = f (y, a) for every a є A. Every subset

of variables A determines an equivalence relation of the universe U, which is

referred to indiscernibility relation. For any given subset of attributes the IND(A) is

an equivalence relation on universe U and is called an indiscernibility relation. The

indiscernibility relation IND(A) can be defined as IND(A) = {(x, y) є U × U : for all a є

A, f (x, a) = f (y, a) If the pair of objects (x, y) belongs to the relation IND(A) then

objects x and y are called indiscernible with respect to attribute set A. In other

words, we cannot distinguish object x from y based on the information

contained in the attribute set A.

2.11.3 Some Exemplary Applications of Rough Sets

Most of the successful applications of rough sets are in the field of medicine, more

specifically, in medical diagnosis or prediction of outcomes.

Rough sets have been applied to analyze a database of patients with duodenal ulcer

treated by highly selective vagotomy1 (HSV) [32]. The goal was to predict the long-

term success of the operation, as evaluated by a surgeon into four outcome classes.

This successful HSV study is still one of few data analysis studies, regardless of the

methodology, that has managed to cross the clinical deployment barrier. There have

been a steady stream of rough set applications in medicine. Some more recent

applications include analysis of breast cancer [33] and other forms of diagnosis [34], as

well as support to triage of abdominal pain [35] and analysis of Medicaid Home Care

Waiver programs [36].

In addition to medicine, Rough Sets have also been applied to a wide range of

application areas to include real estate property appraisal [37], predicting

bankruptcy [38] and predicting the gaming ballot outcomes [39]. Rough sets have been

applied to identify better stock trading timing [40], to enhance support vector

21

A Q

machine models in manufacturing process document retrieval [41], and to evaluate

safety performance of construction firms [42]. Rough sets have thus been useful in

many applications.

2.12 SUPPORT VECTOR MACHINES

Support vector machines (SVMs) are supervised learning methods that generate

input-output mapping functions from a set of labeled training data [5].

The mapping function can be either a classification function (used to categorize the

input data) or a regression function (used for estimation of the desired output). For

classification, nonlinear kernel functions are often used to transform the input data

(inherently representing highly complex nonlinear relationships) to a high

dimensional feature space in which the input data becomes more separable (i.e.,

linearly separable) compared to the original input space. Then, the maximum-

margin hyperplanes are constructed to optimally separate the classes in the

training data. Two parallel hyperplanes are constructed on each side of the

hyperplane that separates the data by maximizing the distance between the two

parallel hyperplanes [5].

An assumption is made that the larger the margin or distance between these

parallel hyperplanes the better the generalization error of the classifier will be.

SVMs belong to a family of generalized linear models which achieves a classification

or regression decision based on the value of the linear combination of features.

They are also said to belong to “kernel methods”.

22

Figure 2.3Process map and the main steps of the rough sets analysis

In addition to its solid mathematical foundation in statistical learning theory, SVMs

have demonstrated highly competitive performance in numerous real-world

applications, such as medical diagnosis, bioinformatics, face recognition, image

processing and text mining, which has established SVMs as one of the most popular,

state-of-the-art tools for knowledge discovery and data mining.

Similar to artificial neural networks, SVMs possess the well-known ability of being

universal approximators of any multivariate function to any desired degree of

accuracy. Therefore, they are of particular interest to modeling highly nonlinear,

complex systems and processes.

Regression

A version of a SVM for regression was proposed called support vector

regression (SVR). The model produced by support vector classification

(as described above) only depends on a subset of the training data, because

the cost function for building the model does not care about training points

that lie beyond the margin. Analogously, the model produced by SVR only

depends on a subset of the training data, because the cost function for

building the model ignores any training data that are close (within a

threshold є) to the model prediction [6].

2.12.1 Use of SVM – A Process-Based Approach

Due largely to the better classification results, recently support vector machines

(SVMs) have become a popular technique for classification type problems. Even

though people consider them as easier to use than artificial neural networks, users

who are not familiar with the intricacies of SVMs often get unsatisfactory results. In

this section we provide a process-based approach to the use of SVM which is more

likely to produce better results.

Preprocess the data

Scrub the data

Deal with the missing values

Deal with the presumably incorrect values

Deal with the noise in the data

23

Transform the data

Numerisize the data

Normalize the data

Develop the model(s)

Select kernel type (RBF is a natural choice)

Determine kernel parameters based on the selected kernel type (e.g.,

C and for ال RBF) – A hard problem. One should consider using

crossvalidation and experimentation to determine the appropriate

values for these parameters.

If the results are satisfactory, finalize the model, otherwise change the

kernel type and/or kernel parameters to achieve the desired accuracy

level.

Extract and deploy the model.

Figure 2.4

2.12.2 Support Vector Machines versus Artificial Neural Networks

24

The development of ANNs followed a heuristic path, with applications and

extensive experimentation preceding theory. In contrast, the development of SVMs

involved sound theory first, then implementation and experiments.

A significant advantage of SVMs is that while ANNs can suffer from multiple local

minima, the solution to an SVM is global and unique.

Two more advantages of SVMs are that that have a simple geometric interpretation

and give a sparse solution. Unlike ANNs, the computational complexity of SVMs

does not depend on the dimensionality of the input space. ANNs use empirical risk

minimization, whilst SVMs use structural risk minimization. The reason that SVMs

often outperform ANNs in practice is that they deal with the biggest problem with

ANNs, SVMs are less prone to over fitting.

They differ radically from comparable approaches such as neural networks:

SVM training always finds a global minimum, and their simple geometric

interpretation provides fertile ground for further investigation.

Most often Gaussian kernels are used, when the resulted SVM

corresponds to an RBF network with Gaussian radial basis functions. As the

SVM approach “automatically” solves the network complexity problem, the size

of the hidden layer is obtained as the result of the QP procedure. Hidden

neurons and support vectors correspond to each other, so the center problems

of the RBF network is also solved, as the support vectors serve as the basis

function centers.

In problems when linear decision hyperplanes are no longer feasible, an

input space is mapped into a feature space (the hidden layer in NN

models), resulting in a nonlinear classifier.

SVMs, after the learning stage, create the same type of decision

hypersurfaces as do some well-developed and popular NN classifiers.

Note that the training of these diverse models is different. However,

after the successful learning stage, the resulting decision surfaces are

identical.

Unlike conventional statistical and neural network methods, the SVM

approach does not attempt to control model complexity by keeping the

number of features small.

Classical learning systems like neural networks suffer from their

25

theoretical weakness, e.g. back-propagation usually converges only to

locally optimal solutions. Here SVMs can provide a significant

improvement.

In contrast to neural networks SVMs automatically select their model

size (by selecting the Support vectors).

The absence of local minima from the above algorithms marks a major

departure from traditional systems such as neural networks.

While the weight decay term is an important aspect for obtaining good

generalization in the context of neural networks for regression, the margin

plays a somewhat similar role in classification problems.

In comparison with traditional multilayer perceptron neural networks

that suffer from the existence of multiple local minima solutions,

convexity is an important and interesting property of nonlinear SVM

classifiers.

SVMs have been developed in the reverse order to the development of

neural networks (NNs). SVMs evolved from the sound theory to the

implementation and experiments, while the NNs followed more

heuristic path, from applications and extensive experimentation to the

theory.

2.12.3 Disadvantages of Support Vector Machines

Besides the advantages of SVMs (from a practical point of view) they

have some limitation. An important practical question that is not entirely

solved, is the selection of the kernel function parameters – for Gaussian

kernels the width parameter (∑) – and the value of ( ) in the ( )-insensitiveє є

loss function.

A second limitation is the speed and size, both in training and testing. It

involves complex and time demanding calculations. From a practical point of

view perhaps the most serious problem with SVMs is the high algorithmic

complexity and extensive memory requirements of the required quadratic

programming in large-scale tasks. Shi et al. have conducted comparative

testing of SVM with other algorithms on real credit card data.

Processing of discrete data presents another problem.

26

Despite these limitations, because SVMs are based on sound theoretical foundation

and the solution it produces are global and unique in nature (as opposed to getting

stuck in local minima), nowadays they are the most popular prediction modeling

techniques in the data mining arena. Their use and popularity will only increase as

the popular commercial data mining tools start to incorporate them into their

modeling arsenal [43].

2.13 PERFORMANCE EVALUATION FOR PREDICTIVE MODELING

Once a predictive model is developed using the historical data, one would be

curious as to how the model will perform for the future (on the data that it has not

seen during the model building process). One might even try multiple model types

for the same prediction problem, and then, would like to know which model is the

one to use for the real-world decision making situation, simply by comparing them

on their prediction performance (e.g., accuracy). But, how do you measure the

performance of a predictor? What are the commonly used performance metrics?

What is accuracy? How can we accurately estimate the performance measures? Are

there methodologies that are better in doing so in an unbiased manner? These

questions are answered in the following sub-sections. First, the most commonly

used performance metrics will be described, then a wide range of estimation

methodologies are explained and compared to each other [5].

2.13.1 Performance Metrics for Predictive Modeling

In classification problems, the primary source of performance measurements

is a coincidence matrix (a.k.a. classification matrix or a contingency table).

The numbers along the diagonal from upper-left to lower-right represent the

correct decisions made, and the numbers outside this diagonal represent the errors.

The true positive rate (also called hit rate or recall) of a classifier is estimated by

dividing the correctly classified positives (the true positive count) by the total

positive count. The false positive rate (also called false alarm rate) of the classifier is

estimated by dividing the incorrectly classified negatives (the false negative count)

by the total negatives.

The overall accuracy of a classifier is estimated by dividing the total correctly

classified positives and negatives by the total number of samples.

27

Other performance measures, such as recall (a.k.a. sensitivity), specificity

and F-measure are also used for calculating other aggregated performance

measures (e.g., area under the ROC curves).

2.13.2 Estimation Methodology for Classification Models

Estimating the accuracy of a classifier induced by some supervised learning

algorithms is important for the following reasons. First, it can be used to estimate

its future prediction accuracy which could imply the level of confidence one should

have in the classifier’s output in the prediction system. Second, it can be used for

choosing a classifier from a given set (selecting the “best” model from two or more

qualification models). Lastly, it can be used to assign confidence levels to multiple

classifiers so that the outcome of a combining classifier can be optimized. Combined

classifiers are increasingly becoming more popular due to the empirical results that

suggest them producing more robust and more accurate predictions as they are

compared to the individual predictors. For estimating the final accuracy of a

classifier one would like an estimation method with low bias and low variance. In

some application domains, to choose a classifier or to combine classifiers the

absolute accuracies may be less important and one might be willing to trade off bias

for low variance.

2.14 DECISION TREES

Decision trees are powerful and popular tools for classification and prediction. The

attractiveness of tree-based methods is due in large part to the fact that, in contrast

to neural networks, decision trees represents rules. Rules can readily be expressed

in English so that we humans can understand them or in a database access language

like SQL so that record falling into a particular category may be retrieved.

There are varieties of algorithms for building decision trees which share the

describe traits of explicability. Two of the most popular go by the acronyms CART

and CHAID which stand respectively for Classification and Regression Trees and

Chi-square Automatic Interaction Detection. A new algorithm, C4.5, is gaining

popularity and is now available in several software packages [5].

28

2.14.1 Strengths of Decision-Tree Methods

The strengths of decision-trees are:

Decision trees are able to generate understandable rules

Decision trees perform classification without requiring much computation.

Decision trees are able to handle continuous and categorical variables

Decision trees provide a clear indication of which fields are most important

for prediction or classification.

Figure 2.5 A beverage prediction tree

2.14.2 Weakness of Decision Trees Methods

Decision trees are less appropriate for estimation tasks where the goal is to predict

the value of a continuous variable such as income, blood pressure or interest rate.

Decision trees are also problematic for time-series data unless a lot of effort is put

into presenting the data in such a way that trends and sequential patterns are made

visible.

2.14.3 Application of Decision Tree Methods

Decision-tree methods are a good choice when the data mining task is classification

of records of prediction of outcomes. Use decision trees when your goal is to assign

29

Watch the game

Yes

Home team wins?

No

Yes

No Yes

No

No Yes

Out with friends?

Beer!Milk!Beer!Diet Soda!

Out with friends

each record to one of a few broad categories. Decision trees are also a natural

choice when your goal is to generate rules that can be easily understood, explained,

and translated into SQL or a natural language.

2.15 GENETIC ALGORITHM

Genetic Algorithms, first introduced by [Holland 1975] [44], have been applied to a

variety of problems and offer intriguing possibilities for general purpose adaptive

search algorithms in artificial intelligence, especially, but not necessarily, for

situations where it is difficult or impossible to precisely model the external

circumstances faced by the program. Search based on evolutionary models had, of

course, been tried before Holland. However, these models were based on mutation

and natural selection and were not notably successful. The principal difference of

Holland’s approach was the incorporation of a ’crossover’ operator to mimic the

effect of sexual reproduction.

Figure 2.6 below illustrates the basic idea of GA

Figure 2.6 Generic Model for Genetic Algorithm

Genetic algorithms are mathematical procedures utilizing the process of genetic

inheritance. They have been usefully applied to a wide variety of analytic problems.

Data mining can combine human understanding with automatic analysis of data to

detect patterns or key relationships. Given a large database defined over a number

of variables, the goal is to efficiently find the most interesting patterns in the

30

database. Genetic algorithms have been applied to identify interesting patterns in

some applications.

They usually are used in data mining to improve the performance of other

algorithms, one example being decision tree algorithms,3 another association rules.

Genetic algorithms require certain data structure. They operate on a population

with characteristics expressed in categorical form. The analogy with genetics is

that the population (genes) consist of characteristics (alleles). One way to

implement genetic algorithms is to apply operators (reproduction, crossover,

selection) with the feature of mutation to enhance generation of potentially better

combinations. The genetic algorithm process is thus:

1. Randomly select parents.

2. Reproduce through crossover. Reproduction is the operator choosing which

individual entities will survive. In other words, some objective function or selection

characteristic is needed to determine survival.

Crossover relates to changes in future generations of entities.

3. Select survivors for the next generation through a fitness function.

4. Mutation is the operation by which randomly selected attributes of randomly

selected entities in subsequent operations are changed.

5. Iterate until either a given fitness level is attained, or the preset number of

iterations is reached.

Genetic algorithm parameters include population size, crossover rate (the

probability that individuals will crossover), and the mutation rate (the probability

that a certain entity mutates) [45].

2.15.1 Genetic Algorithm Advantages: Genetic algorithms are very easy to

develop and to validate, which makes them highly attractive if they apply. The

algorithm is parallel, meaning that it can applied to large populations efficiently.

The algorithm is also efficient in that if it begins with a poor original solution, it can

rapidly progress to good solutions. Use of mutation makes the method capable of

identifying global optima even in very nonlinear problem domains. The method

does not require knowledge about the distribution of the data.

31

2.15.2 Genetic Algorithm Disadvantages: Genetic algorithms require

mapping data sets to a form where attributes have discrete values for the genetic

algorithm to work with. This is usually possible, but can lose a great deal of detail

information when dealing with continuous variables. Coding the data into

categorical form can unintentionally lead to biases in the data.

There are also limits to the size of data set that can be analyzed with genetic

algorithms. For very large data sets, sampling will be necessary, which leads to

different results across different runs over the same data set.

2.15.3 GA Operators

Selection

This is the procedure for choosing individuals (parents) on which to perform

crossover in order to create new solutions. The idea is that the ‘fitter’ individuals

are more prominent in the selection process, with the hope that the offspring they

create will be even fitter still.

Two commonly used procedures are ‘roulette wheel’ and ‘tournament’ selection.

In roulette wheel, each individual is assigned a slice of a wheel, the size of the

slice being proportional to the fitness of the individual. The wheel is then spun

and the individual opposite the marker becomes one of the parents. In

tournament selection several individuals are chosen at random and the fittest

becomes one of the parents.

Crossover

Along with mutation, crossover is the operator that creates new candidate

solutions. A position is randomly chosen on the string and the two parents are

‘crossed over’ at this point to create two new solutions. Multiple point crossover

is where this occurs at several points along the string. A crossover probability

(Pc) is often given which enables a chance that the parents descend into the next

generation unchanged.

Mutation

After crossover, each bit of the string has the potential to mutate, based on a

mutation probability (Pm). In binary encoding mutation involves the flipping of

a bit from 0 to 1 or vice versa.

32

2.15.4 Application of Genetic Algorithms in Data Mining

Genetic algorithms have been applied to data mining in two ways. External

support is through evaluation or optimization of some parameter for another

learning system, often hybrid systems using other data mining tools such as

clustering or decision trees. In this sense, genetic algorithms help other data mining

tools operate more efficiently. Genetic algorithms can also be directly applied to

analysis, where the genetic algorithm generates descriptions, usually as decision

rules or decision trees. Many applications of genetic algorithms within data mining

have been applied outside of business.

Specific examples include medical data mining and computer network intrusion

detection. In business, genetic algorithms have been applied to customer

segmentation, credit scoring, and financial security selection.

Genetic algorithms can be very useful within a data mining analysis dealing with

more attributes and many more observations. It saves the brute force checking of

all combinations of variable values, which can make some data mining algorithms

more effective. However, application of genetic algorithms requires expression of

the data into discrete outcomes, with a calculable functional value upon which to

base selection.

This does not fit all data mining applications. Genetic algorithms are useful because

sometimes it does fit. We review an application to demonstrate some of the aspects

of genetic algorithms.

2.16 NEURAL NETWORKS

Neural computation is introduced as an intelligent system relating the processing

parameters to the process responses such a system is based on artificial neural

network (ANN) [46] which is an interconnected structure of processing elements

called neurons. The ANN structure consists of the input pattern representing the

processing parameters, the output pattern, the hidden layers describing implicitly

the correlations between the processing parameters and the output characteristics.

The connection between a couple of neurons is described by a number called

weight translating the strength of the connection [47].

33

There are three steps which are required to optimize the ANN structure; these are

training, validation and testing steps. There are several types of neural network

architectures. But in this study we will focus on multilayer perception (MLP) and

Back propagation Net.

2.16.1 BIOLOGICAL BACKGROUND

Structural and Functional Organization of the Brain

The inspiration for the development of ANN lies on the organization and

functionality of the (human) brain. The brain is organized in different structural

levels, which correspond to small-scale and large-scale anatomical and

functional organizations. Different functions take place in different organization

levels. The hierarchy of these levels are shown in Fig. 2.8 from the lowest

(bottom) to the highest (top). Therefore, the lowest (basic) level of brain

structural organization is the molecular level and the highest is the Central

Nervous System [48].

The synapses are the neuronal interconnections and their function depends on

specific molecules and ions. The next level is the neural microcircuit, which is an

assembly of synaptic connections organized to produce a specific functional

operation. The neural microcircuits are grouped to form dendritic subunits

that are parts of the dendritic trees of individual neurons. It is believed that

neurons are the simplest computing unit in the brain, the simplest element that

can perform computational tasks. At the next hierarchical and complexity level

34

Figure 2.7 Structure of a neural cell in human brain

we have local neural circuits (neural networks), which are constructed from the

same type of neurons, and are able to perform operations characteristic of a

localized region of the brain [49].

Figure 2.8 Schematic structural organization of the brain.

At a higher level these neural circuits are organized to interregional circuits

than involve multiple regional neural networks located in different parts of the

brain through specific pathways, columns and topographic maps. These

structures are organized to respond to incoming sensory information.

Neurophysiological experiments have shown clearly that different sensory

inputs (motor, somatosensory, visual, auditory, etc.) are mapped onto

specialized corresponding areas of the cerebral cortex. The ultimate level of

complexity and hierarchy, the interregional circuits mediate specific types of

behaviour in the central nervous system.

35

Central nervous systemCentral nervous system

Interregional circuitsInterregional circuits

Local Circuits

Neurons

Dentritic trees

Senapses

Molecules

Neural microcircuits

2.16.2 The Neuron

The key word to understand the brain structural organization and function is the

neuron. The idea of the neurons was introduced by Ramon y Cajal in 1911 and

refers to the fundamental logical units that the whole Central Nervous System is

consisted of. It is indicative that the neuron lies somewhere in the middle of the

structural organization of the brain shown in Fig. 2.8 A neuron is a nerve cell with

all of its processes. Neurons are one of the main distinguishing features of animals

(plants do not have nerve cells). Neurons come in a wide variety of shapes, sizes

and functionality in different parts of the brain. The number of different classes of

neurons that have been identified in humans lies between seven and a hundred (the

observed wide variety in that estimation is related to how restrictively a class of

neurons is defined)[49]. A simple representation of a neuron is shown in Fig. 2.9

Fig 2.9 Schematic representation of a typical neuron

As it is shown in Fig. 2.9 typically the neuron mainly consists of three parts, the

dendrites (or dendritic tree) and the synapses (or synaptic connections or synaptic

36

terminals), the neuron cell body, the axon. Typically the neuron can be in two

states:

the resting state, where no electrical signal is generated, and the firing state, where

the neuron depolarises and an electrical signal is generated (that is the output of

the neuron). [48]

The neuron receives inputs from other neurons that are connected to it, via

synaptic connections that are mainly positioned in the dendrites. The incoming

signals (which are in the form of positive or negative electrical potentials) are

summed in neurons cell body (also called soma) and if the obtained sum exceeds a

certain amount, which is referred as the activation threshold, then the neuron

depolarises and an electrical pulse is generated. This pulse is commonly known as

action potential or spike.

Originating at or close to the cell body of the neuron the action potential propagates

through the axon of the neuron at constant velocity and amplitude to the synaptic

terminals. Through these synaptic terminals the electrical signals generated at one

neuron are transmitted to the neurons that it is interconnected to.

Typically, neural events happen in the millisecond (10-3 sec) range, whereas in

silicon chip the corresponding time range is of the order of nanosecond (10-9 sec).

Thus, biological neurons are five to six hundreds of magnitude slower than silicon

chips.

2.16.3 Dendrites and Synapses

The dendrites, the receptive zones of the neuron, have an irregular surface and a

great number of branches. As it is shown in the right top of Fig. 1.2 there are

observed dendritic spines and synaptic inputs in a dendrite. These synaptic inputs

are the points that a neuron is connected to other neurons and receive input signals

from them. Thus synapses are the elementary functional and structural units that

mediate the interactions between neurons. A number of one to ten thousand of

incoming synapses is typical for cortical neurons. With respect to the nature of the

signal that is transferred through a synapse there are two kinds of synaptic

connections, the chemical synapse and the electrical synapse, with the former to be

the most common [49].

In the case of the chemical synapse there is no actual contact of the presynaptic and

37

the postsynaptic neuron. A synaptic gap (synaptic cleft) occurs instead and the

chemical synapse operates as follows: when an electrical signal arrives from the

presynaptic neuron to the synapse, a process at the presynaptic neuron liberates a

number of molecules of a chemical substance called neurotransmitter. These

neurotransmitter molecules diffuse across the synaptic gap and are captured in

specialized regions of the dendrites of the postsynaptic neuron, by molecules that

are called neuroreceptors, and generate electrical signals in the postsynaptic

region. Thus, a chemical synapse converts electrical signals that are generated in

the presynaptic neuron into chemical signals that travel through the synaptic gap

and then back into postsynaptic electrical signals.

It is obvious that this kind of synaptic transmission is unidirectional and

nonreciprocal, i.e., chemical synapses carry signals from a neuron that always plays

the role of the presynaptic unit to another neuron that always plays the role of the

postsynaptic unit. This is the main difference between chemical and electrical

synapses [49].

In the case that two neurons are interconnected via an electrical synapse, an

electrical signal can be transmitted from the neuron with higher voltage to the one

with a lower voltage, thus signal transmission can be bi-directional in electrical

synapses. This characteristic of the electrical synapses means that there is no fixed

presynaptic and postsynaptic neuron in that kind of synaptic connections and

these roles can be interchanged depending on the electrical conditions on each one

of the interconnected neurons.

Further from distinguishing synapses to chemical and to electrical ones according

to the nature of the transmitting signal, we classify the synapses with respect to the

kind of activation that is produced to the postsynaptic neuron in two main

categories: the excitatory synapses and the inhibitory synapses. In the first case,

that of the excitatory synapses, the electrical potential that is transmitted to the

postsynaptic neuron is positive and has an excitation effect. In the second case of

the inhibitory synapse the postsynaptic potential is negative and imposes

inhibition on the postsynaptic neuron.

A key point in the synaptic transmission is that the signals are weighted. That is,

some postsynaptic potentials are stronger than others.

38

2.16.4 Neuron Cell Body

The neuron cell body (or soma) has a triangular – like form and contains the

nucleus of the cell. As it is shown in Fig. 1.2, the dendrites are leading into the

neuron cell body, carrying the incoming inputs (electrical signals generated by the

postsynaptic potentials). These electrical signals affect the membrane potential of

the cell body of the neuron. Typically, when in the resting state, the membrane

potential of a neuron is approximately –70 mV. If the incoming postsynaptic

potential is positive (excitatory) the membrane potential is increased and is moving

closer to the firing state. If the incoming postsynaptic potential is negative

(inhibitory) the membrane potential is decreased, moving away from the firing

state [49].

All the incoming postsynaptic potentials are summed in both time and space

(temporal and spatial summation). If the resulted sum is equal to or greater than

the firing threshold of the neuron, and the membrane potential exceeds a certain

value (typically –60 mV) then the neuron depolarises (fires) and an action potential

is generated and propagated through the axon of the neuron to the synaptic

terminals.

After firing, the neuron returns to the resting state and the membrane potential to

the appropriate resting value. This is not done instantaneously, but takes a little

time which is called refractory period of the neuron. When the refractory period is

passed, the neuron is ready to fire again if it receives the appropriate input.

2.16.5 The Axon

In cortical neurons, the axon is very long and thin and is characterized by high

electrical resistance and very large capacity. The neural axon is the main

transmission line of the neuron that propagates the action potential. The axon has a

smoother surface than the dendrites and carries the characteristic Ranvier nodes

(not shown in Fig. 2.8) that help the propagation of the action potential along the

axon. The axon terminates to the synaptic terminals that establish the

interconnection of the neuron to other neurons [49].

39

2.16.5 The Neuron Model

To built-up an ANN we need to model the biological neuron, the elementary

computing unit in the brain that is capable to perform information-processing

operations. The simplest model of a neuron is shown in Fig. 2.10.

Neurons, also referred as processing elements (PE), nodes, short-term memory

devices, or threshold logic units, are the ANN components where the most, if not all,

of the computing is done. The generic model of the neuron shown in Fig. 2.10

consists the basis for designing and implementing ANNs. As it is indicated in Fig.

2.10 three are the basic elements of the neuronal model: a set of synapses or

synaptic (connecting) links, an adder (logical unit) and an activation function

(threshold function).

The synapses, or connecting links carry the input signals to the neuron, coming

from either the environment or outputs of other neurons. Each synapse is

characterized by a weight or strength of its own, which will affect the impact of the

specific input.

Therefore, the incoming signals to a neuron are weighed, multiplied by the

appropriate value of the synaptic weight. To be more specific, a signal xj at the input

of synapse j of the kth neuron is multiplied by the synaptic weight wkj. In the

notation following here, the first subscript refers to the neuron in question and the

second subscript refers to the input to which the weight refers.

Figure 2.10 Model of a neuron.

40

In the notation following here, the first subscript refers to the neuron in question

and the second subscript refers to the input to which the weight refers. In general

and in accordance to the biological figure, there are two primary types of synaptic

connections: the excitatory and the inhibitory ones. The excitatory connections

increase the neuron’s activation and are typically represented by positive signals.

On the other hand, inhibitory connections decrease neuron’s activation and are

typically represented by negative signals. The two types of connections are

implemented using positive and negative values for the corresponding synaptic

weights [49].

One of the most important features of the model neuron presented here, as well as

the biological neurons, is that the values of the synaptic weights are subject to

alterations and modification in response to various inputs and in accordance to the

network’s own rules for modification. This feature that is technically called synaptic

modification is of great importance since it is closely related to that ability of

adaptation and learning of the ANN.

Sometimes there is an additional parameter bk, that is associated with the inputs.

The role of this additional parameter depends on the type of the activation function.

Typically it is considered to be an internal bias, which can also be weighted. In a

somehow different approach this parameter is a threshold value (denoted by θk for

the kth neuron) that must be exceeded for there to be any neuronal activation. In

general it is a parameter that has the effect of either increasing or decreasing the

neuron’s input υk to the activation function if its value is positive or negative

respectively [47].

The second basic element of the model neuron is the adder. This element is

responsible for summing the input signals to the neuron that are transmitted

through the synapses of the neuron and are weighted by them. The described

operations constitute a linear combiner. As mentioned above the total result of the

summation of the incoming weighted signal and the addition of the bias bk or

subtraction of the threshold θk is denoted by υk.

The third basic element of the model neuron is the activation function, which is also

referred as squashing function or signal function. The role of the activation function

is to squash (limit) the output signal of the neuron to a certain (finite) range. Thus,

41

activation function maps a (possibly infinite) domain (the input) to a pre-specified

range (the output). A great number of mathematical functions should be suitable for

the role of the activation function of a neuron. However, four are the most common

families of functions that are widely used: the step, the linear, the ramp and the

sigmoid functions [50].

The step (or threshold) function is described by the following equation:

Thus, the step function of (Eq. 2.1) returns a positive value if its argument is a

nonnegative number, otherwise it returns a negative value if its argument is a

negative number. A special case of the step function is for = 1 and = 0. In thatγ δ

case (Eq. 2.1) is transformed to (Eq. 2.2):

This special case of the step function is commonly referred as a Heaviside function.

A plot of the Heaviside function is shown in Fig. 2.11a. A neuron that incorporates

the Heaviside function as its activation function is usually referred as the

McCulloch-Pitts neuron model, in recognition of the pioneering work done by

McCulloch and Pitts back in 1943. According to that neuron model the output of a

neuron turns to the firing state generating an output signal equal to 1 if the total

input to the neuron is non-negative. Otherwise, in the case that the total input is

negative the neuron remains in the resting state, generating no signal (zero output).

This characteristic behaviour is described by the special term and is referred as the

all-or-none property of the McCulloch-Pitts model [47]. The all-or-none property is

in accordance to the behaviour of the biological neurons where the total

postsynaptic potential (inputs) must exceed a certain internal threshold value in

order the neuron to fire and generate an action potential. If that threshold value is

not exceeded the neuron remains in the resting state and ceases.

The next family of activation functions is the linear function, described in its general

form by the equation:

( ) = φ υ αυ (Eq. 2.3)

42

The parameter is a real-valued constant that regulates the magnification of theα

neuron activity . Despite its simple form the linear function is rather inappropriateυ

for the role of the activation function of a neuron, since it is not bounded

(considering that the input parameter is not bounded too).υ

The third family of commonly used activation functions is the ramp function, also

referred as piece-wise linear function. The ramp function is a linear function that is

bounded to the range [- , + ] and in its general form is described by the equation:γ γ

In the above equation and – correspond to the maximum and the minimumγ γ

output values respectively, i.e., the upper and lower bound of the mapping. The

piece-wise linear functions of (Eq. 2.4) are often used to represent a simplified

nonlinear operation and can be viewed as an approximation to a nonlinear

amplifier. Depending on the value of the input parameter the ramp functionυ

operates as a linear function without running into saturation, if is in the linearυ

region, otherwise the function returns the upper or lower saturation values. In the

special case that for = γ 1/2 and 1 and 0 as upper and lower bound respectively (Eq.

2.4) takes the form:

A graphical representation of the ramp function described in (Eq. 2.5) is shown in

Fig. 2.11b. As it is shown in Fig. 2.11b this special form of the ramp function exhibits

a linear part in the range of – 1/2< < υ 1/2 and saturates to the upper or the lower

bound if exceeds that range.υ

The fourth and final family of activation functions is the sigmoid functions. The

family of sigmoid functions is by far the pervasive type of activation function and is

the most commonly used in the implementation of an ANN. That is because the

sigmoid functions incorporate a number of properties that are mostly desirable in

the construction of a neuron. There are several types of sigmoid functions. A

common type is the logistic function that is described in the following equation:

43

The parameter is the slope parameter of the sigmoid function. Graphicalα

representations of the logistic sigmoid function for different values of the slope

parameter are shown in Fig. 2.11c. The shape of the obtained representationsα

reveals that the reason for the sigmoid functions to have been given that name is

the s-shape of its graphs. As it is easily recognised in Fig. 2.11c the logistic sigmoid

function is a bounded, monotonic, non-decreasing function that provides a graded,

nonlinear response. Thus, the logistic function balances between linear and

nonlinear behaviour.

The upper and lower bounds (saturation values) of that function are 1 and 0

respectively. Another feature of the logistic function that is partially revealed in Fig.

2.11c is the role of the slope parameter . The greater the value of that parameterα

the steeper is the increase of the logistic function. In the limit that the slope

parameter approaches infinity, the logistic function turns simply to a step

(Heaviside) function.

However, for values of the slope parameter in the normal range, the logistic

function is a continuous and differentiable function that returns a continuous range

of values from 0 to 1 (graded response). On the opposite the Heaviside function is

not differentiable.

A second sigmoid type function that ranges in the interval [0,1] is the augmented

ratio of squares function defined as:

What is common in the activation functions described by the (Eqs. 2.2, 2.5, 2.6 and

2.7) is that return an output in the range from 0 to 1. However, sometimes it is

44

a b

c

desirable to have an activation function in the range from –1 to 1. In that case we

have to give a different definition of the threshold function of (Eqs. 2.1 and 2.2). The

new form of the threshold function is described by the following equation:

Fig 2.11. Three common types of activation functions. (a) Threshold (Heaviside)

function. (b) Piecewise-linear (ramp) function. (c) Sigmoid function for varying slop

parameter a.

The above equation is commonly referred to as the signum function since it returns

the sign of the parameter or 0 if in neither positive, nor negative.υ υ

Similarly, other types of sigmoid functions have to be presented for the case that the

output range from –1 to 1 is the desirable one, instead of the range from 0 to 1 that

return the logistic sigmoid function of (Eq. 2.6). If that is the case, among others,

two are some reasonable candidates. The first one is a hyperbolic trigonometric

function, the hyperbolic tangent function, which is defined as:

( ) = tanh( )φ υ υ (Eq. 2.9)

The second one is defined by the formula:

45

Both these functions defined in the last two equations have saturation levels at –1

(lower) and 1 (upper), therefore range in [-1,1].

The description of the neural dynamics in mathematical terms follows. According to

the notation introduced above, assuming that the kth neuron receives m synaptic

connections, υk is the total sum of the incoming input weighed signals xj via the jth

synaptic connection, and wkj is the corresponding synaptic weight of that

connection, the threshold is θk and the bias is bk. In the case that the adder sums the

total incoming weighted signals and subtracts the threshold θk the obtained result

υk is given by the mathematical formula:

In (Eq. 2.13), the bias bis included in the form as the product Wk0X0., where X0 = 1

and Wk0= bk.

Finally, let yk be the output signal of the kth neuron that receives a total incoming

signal k. The output of the neuron is given by the next formula:υ

y κ = φ υ k y (Eq. 2.14)

In the above equation, ( ) is the activation function, which should be given by oneφ υ

of the described in Eqs. 2.1 – 2.10.

The neuron-like processing element presented here model approximately three of

the processes we know real neurons perform. As far as we know, there are at least

150 processes performed by the neurons in the human brain. Although the obvious

poverty of the model neuron, it handles several basic functions. Namely, the model

neuron is capable to receive and evaluate the input signals, to calculate a total of the

46

combined inputs and compare that total to some threshold level, and finally to

determine what the output should be. In addition to the deterministic neuronal

model presented above, for some applications of neural networks it is desirable to

incorporate a stochastic feature in the dynamics of the neural model. In such a case,

the neuronal model is based on a modification of the bi-state neuronal element of

McCulloch-Pitts and it is permitted to reside in only two states: +1 and –1. The

decision of a neuron to alter its state is probabilistic. Thus, if the neuron fires (is in

the +1 state) with probability of firing P( ), then it remains in the –1 state withυ

probability 1-P( ). The firing probability is given by the formula:υ

In the above formula T is a pseudo-temperature that is incorporated to control the

noise level, thus the uncertainty and the stochastic nature in firing, and must be

realised as a parameter that represents the effects of synaptic noise. In the limit

case that T → 0, the stochastic neural model reduces to the noiseless (therefore

deterministic) form that is described by the McCulloch-Pitts neural model in (Eq.

2.2) [47].

2.16.6 Supervised and unsupervised learning

The learning algorithm of a neural network can either be supervised or

unsupervised.

A neural net is said to learn supervised, if the desired output is already known.

Neural nets that learn unsupervised have no such target outputs.

It can't be determined what the result of the learning process will look like.

During the learning process, the units (weight values) of such a neural net are

"arranged" inside a certain range, depending on given input values. The goal is to

group similar units close together in certain areas of the value range.

This effect can be used efficiently for pattern classification purposes [51].

2.16.7 Forward propagation

Forward propagation is a supervised learning algorithm and describes the "flow of

information" through a neural net from its input layer to its output layer.

The algorithm works as follows:

47

1. Set all weights to random values ranging from -1.0 to +1.0

2. Set an input pattern (binary values) to the neurons of the net's input layer

3. Activate each neuron of the following layer:

Multiply the weight values of the connections leading to this neuron with

the output values of the preceding neurons

Add up these values

Pass the result to an activation function, which computes the output value of

this neuron

4. Repeat this until the output layer is reached

5. Compare the calculated output pattern to the desired target pattern and

compute an error value

6. Change all weights by adding the error value to the (old) weight values

7. Go to step 2

8. The algorithm ends, if all output patterns match their target patterns

2.16.8 Multi-layer Perceptron: This was first introduced by M. Minsky and

S. Papert in 1969 [52]. It is a special case of perceptron whose first layer units are

replaced by trainable threshold logic units in order to allow it to solve non-linear

separable problem. Minsky and Papert called multi-layer perceptron of one

trainable hidden layer a Gamba perceptron. The structure is shown below:

Figure 2.12

48

.

.

.

.

.

.

.

.

.

InputLayer

FirstHiddenLayer

SecondHiddenLayer

OutputLayer

Each layer is fully connected to the next one. Depending on the complexity,

performance and implementation point of view, the number of hidden layers may

be increased or decreased with corresponding increase or decrease in the number

of hidden units and connections.

Both the perceptron and the multi-layer perceptron are trained with error-

correction learning. But since perceptron does not have an explicit error available,

this stopped further work on the multi-layer perceptron around 1970, until a

method to train multi-layer perceptrons was later discovered. The method is called

Back Propagation or the generalized Delta Rule.

With this method, processing is done from the input to the output layer, that is, in

the forward direction, after which computed errors are then propagated back in the

backward direction to change the weights so as to obtain a better result.

2.16.9 Strength of Artificial Neural Networks

They can handle a wide range of problems

They provide good results even in complicated domain

They handle both categorical and continuous variables

They are available in many off-the-shelf packages

2.16.10 Weaknesses of Artificial Neural Networks

They require inputs in the range of 0 to 1

They can not explain their results

They may converge prematurely to an inferior solution

2.17 ON-LINE ANALYTICAL PROCESSING

OLAP is the next advance in giving end-user access to data.

These are client-server tools that have an advance graphical interface talking to an

efficient and powerful presentation of the data called a cube. The cube is ideally

suited for queries that allow users to slice-and-dice the data in any way they see fit.

The cube itself is stored in either a relational database, typically using a star schema

or in a special multi-dimensional database that optimize OLAP operations. OLAP

tools have a very fast response times, measured in seconds. SQL queries on

49

standard relational database would require hours or days in many cases to generate

the same information. In addition, OLAP tools provide handy analysis functions that

are difficult or impossible to express in SQL.

2.17.1 OLAP and Data Mining

We have to provide feedback to people and use the information from data mining to

improve business process. We need to enable people to provide input, in the form of

observations, hypotheses and hunches about what results are important and how to

use those results [6].

In the larger solution to exploit data, OLAP clearly plays an important role as a

means of broadening the audience with access to data.

2.17.2 Strengths of OLAP

It is a powerful visualization tool.

It provides fast, interactive response time

It is good for analyzing time series

It can be used to find some clusters and outliers

Many vendors offer OLAP products

2.17.3 Weaknesses of OLAP

Setting-up a cube can be difficult

It does not handle continuous variables well

Cubes can quickly become out-of –date

It is not data mining

50

2.18 DATA MINING APPLICATION AREAS

Other application areas are:

Health sector

Food And Drug Product Safety

Election analysis

Detection Of Terrorists Or Criminals

Etc

2.19 DATA MINING TOOLS

Many good data mining software products are available [5]:

Enterprise Miner by SAS

Intelligent Miner by IBM

CLEMENTINE by SPSS

PolyAnalyst by Megaputer

WEKA (from the University of Waikato in New Zealand) etc

51

Given a CSP P = (V,D,C), its dual transformation dual(p) = (Vdual(p), Ddual(p), C Ddual(p)) is defined as follows.

Vdual(p) = {c1,…,cn}where c1,….cm are called dual variables. For each constraint Ci of P there is a unique corresponding dual variable ci. We use vars(ci) and rel(ci) to denote the corresponding sets vars(Ci) and rel(Ci) (given that the context is not ambiguous). Ddual(p) = {dom(c1),…,dom(cm)} is the set of domains for the dual variables. For each dual variable ci, dom(ci) = rel(Ci), i.e., each value for is a tuple over vars(Ci). An assignment of a value t to a dual variable ci, ci ← t , can thus be viewed as being a sequence of assignments to the ordinary variablesx vars(ci) where each such ordinary variable is assigned the ϵvaluet[x].

Cdual(p) is a set of binary constraints overVdual(p) called the dual constraints.There is a dual constraint between dual variables ci and cj if S = vars(ci) vars(cj) ⋂ ≠ ∅. In this dual constraint a tuple ti dom(ci) ϵis compatible with a tuple tj dom(cj) iff ti[S] = tj[S], i.e., the ϵtwo tuples have the same values over their common variables

52

Given a CSP P = (V,D,C), its hidden transformation hidden(p) = (V hidden (p), D hidden (p), C D hidden (p)) is defined as follows:V hidden (p), {x1,…., xn} ∪ {c1,…,cn}where {x1,…., xn} is the original set of variables in V (Called ordinary variables) and c1,….cm are called dual variables generated form from the constant CThere is a unique dual variable corresponding to each constraint Ci ϵ C. When dealing with the hidden transformation, the dual variables are sometimes called hidden variables.D hidden (p)= {dom(x1),….,dom( xn)} ∪ {dom(c1),…,dom(cm)} is the set of domains for the dual variables. For each dual variable ci, dom(ci) = rel(Ci),

V = {x1,…., xn} is a finite set of n variables.D = {dom(x1,.….,dom(xn)} is a set of domains. Each variable x V has a ϵ corresponding finite domain of possible values, dom(x).C = {C1,……,Cm} is a set of m constraints. Each constraint C ϵ C is a pair (vars(C), rel(C)) defined as follows:

53