Download - Data Mining Session

8/7/2019 Data Mining Session

1/16


2/16

Introduction

Data Mining is the process of extracting valid,previously unknown, and ultimately comprehensibleinformation from large databases and using it to makecrucial business decision. The extracted informationcan be used to form a prediction or classification

model, identify relations between database records, orprovide a summary of the database being mined.


3/16

Introduction

The goal of identifying and utilizing information hidden in datahas three requirements.

- First, the captured data must be integrated into organization-

wide views, instead of department-specific views, and often

supplemented with open source and/or purchased data.

- Second, the information contained in the integrated data must

be extracted, or mined.

- Third, the mined information must be organized in ways that

enable decision-making.


4/16

Hypothesis Verification

The verification model takes an hypothesis from the user andtests the validity of it against the data. The emphasis is with theuser who is responsible for formulating the hypothesis and issuingthe query on the data to affirm or negate the hypothesis.

System supports this operation is called verification-driven datamining system. Such system suffers from two problems: 1) Theyrequire the decision-maker to hypothesize the desired information.2) The quality of the extracted information is based on the usersinterpretation of the posed querys results.


5/16

Information Discovery

The discovery model differs in its emphasis in that it is the system

automatically discovering important information hidden in the data.The data is sifted in search of frequently occurring patterns, trends andgeneralisations about the data without intervention or guidance fromthe user. The discovery or data mining tools aim to reveal a largenumber of facts about the data in as short a time as possible. Thecorresponding systems are called discovery-driven data mining

systems.Summary, verification-driven data mining will allow the decision-

maker to express and verify organizational and personal domainknowledge and hypotheses, while discovery-driven data mining will beused to refine these hypotheses, as well as identify information notpreviously hypothesized by the user.


6/16

Data Mining Process

Data TargetData

Preprocess

Data

Transform

Data

Pattern

KnowledgeSelection

Preprocessing

Transformation

DataMining

Interpretation &evaluation


7/16

Data Mining Process

Data

Warehouse

Selected

Data

Transformed Data

Extracted

Information

Assimilated

Information

Select Transform MineAssimilate


8/16

Data Mining Operation1) Creation of prediction and classification models: The goal of thisoperation is to use the contents of the database, which reflect

historical data, ie., data about the past, to automatically generate amodel that can predict a future behavior.

2) Link analysis: the goal of link analysis is to establish relationsbetween the records in a database.

3) Database segmentation: it is often necessary to partition them

into collections of related records either as a means of obtaining asummary of each database, or before performing a data miningoperation such as model creation, or link analysis.

4) Deviation detection: its goal is to identify outlying points in aparticular data set, and explain whether they are due to noise orother impurities being present in the data, or due to causal reasons.


9/16

Data Mining Techniques

Supervised Induction

Association Discovery

Sequence Discovery

Conceptual Clustering

Visualization

Neural Network


10/16

Supervised Induction

Supervised induction refers to the process of automatically creating aclassification model from a set of records (examples), called the training

set. The training set may either be a sample of the database or warehousebeing mined, the entire database, or a data warehouse. A supervisedinduction technique is particularly suitable for data mining if it has threecharacteristics:

1. It can produce high quality models even when the data in thetraining set is noisy and incomplete.

2. The resulting models are comprehensible and explainable so that theuser can understand how decision are made by the system.

3. It can accept domain knowledge. Such knowledge can expedite theinduction task while simultaneously improving the quality of the inducedmodel.


11/16

Association Discovery

Association Discovery function is an operation against

this set of records which return affinities that exist amongthe collection of items. These affinities can be expressed byrules such as 72% of all the records that contain itemsA,B, and C also contain items D and E. The specificpercentage of occurrences is called the confidence factor ofthe association.


12/16

Sequence Discovery

Such a situation is typical of a Direct Mail Application.In this case, a catalog merchant has the information, for

each customer, of the sets of products that the customerbuys in every purchase order. A sequence discoveryfunction will analyze such collections of related recordsand will detect frequently occurring patterns of productsbought over time.


13/16

Conceptual ClusteringClustering is used to segment a database into subsets, the clusters, with the members of

each cluster sharing a number of interesting properties. The results of a clustering operationare used in one of two ways. First, for summarizing the contents of the target database byconsidering the characteristics of each created cluster rather than those of each record in thedatabase. Second, as an input to other methods, eg. Supervised induction. A cluster is asmaller and more manageable data set to the supervised inductive learning component.


14/16

Visualization

Visualization provides analysts with visual summaries of data from adatabase. It can also be used as a method for understanding the

information extracted using other data mining methods. Data miningnecessitates the use of interactive visualization techniques that allow theuser to quickly and easily change the type of information displayed, as wellas the particular visualization method used. Visualizations are particularlyuseful for noticing phenomena that hold for a relatively small subset of thedata, and thus are drowned out by the rest of the data when statisticaltests are used since these tests generally check for global features.

The advantage of using visualization is that the analyst does not haveto know what type of phenomenon he is looking for in order to noticesomething unusual or interesting.


15/16

Neural NetworkNeural networks are an approach to computing that involves developing

mathematical structures with the ability to learn. The methods are the result ofacademic investigations to model nervous system learning. Neural networks have theremarkable ability to derive meaning from complicated or imprecise data and can be

used to extract patterns and detect trends that are too complex to be noticed by eitherhumans or other computer techniques. A trained neural network can be thought of asan "expert" in the category of information it has been given to analyse. This expert canthen be used to provide projections given new situations of interest and answer "whatif" questions.

Z1

Z2

X1 X2 X3 X4 X5 X6

F(I)

X1

X2

X3

X4

X5

X6

W1

W2

W3

W4

W5

W6

F(I

)

F(I) = X1*W1 + X2*W2 + .


16/16