8/7/2019 Data Mining Session
1/16
8/7/2019 Data Mining Session
2/16
Introduction
Data Mining is the process of extracting valid,previously unknown, and ultimately comprehensibleinformation from large databases and using it to makecrucial business decision. The extracted informationcan be used to form a prediction or classification
model, identify relations between database records, orprovide a summary of the database being mined.
8/7/2019 Data Mining Session
3/16
Introduction
The goal of identifying and utilizing information hidden in datahas three requirements.
- First, the captured data must be integrated into organization-
wide views, instead of department-specific views, and often
supplemented with open source and/or purchased data.
- Second, the information contained in the integrated data must
be extracted, or mined.
- Third, the mined information must be organized in ways that
enable decision-making.
8/7/2019 Data Mining Session
4/16
Hypothesis Verification
The verification model takes an hypothesis from the user andtests the validity of it against the data. The emphasis is with theuser who is responsible for formulating the hypothesis and issuingthe query on the data to affirm or negate the hypothesis.
System supports this operation is called verification-driven datamining system. Such system suffers from two problems: 1) Theyrequire the decision-maker to hypothesize the desired information.2) The quality of the extracted information is based on the usersinterpretation of the posed querys results.
8/7/2019 Data Mining Session
5/16
Information Discovery
The discovery model differs in its emphasis in that it is the system
automatically discovering important information hidden in the data.The data is sifted in search of frequently occurring patterns, trends andgeneralisations about the data without intervention or guidance fromthe user. The discovery or data mining tools aim to reveal a largenumber of facts about the data in as short a time as possible. Thecorresponding systems are called discovery-driven data mining
systems.Summary, verification-driven data mining will allow the decision-
maker to express and verify organizational and personal domainknowledge and hypotheses, while discovery-driven data mining will beused to refine these hypotheses, as well as identify information notpreviously hypothesized by the user.
8/7/2019 Data Mining Session
6/16
Data Mining Process
Data TargetData
Preprocess
Data
Transform
Data
Pattern
KnowledgeSelection
Preprocessing
Transformation
DataMining
Interpretation &evaluation
8/7/2019 Data Mining Session
7/16
Data Mining Process
Data
Warehouse
Selected
Data
Transformed Data
Extracted
Information
Assimilated
Information
Select Transform MineAssimilate
8/7/2019 Data Mining Session
8/16
Data Mining Operation1) Creation of prediction and classification models: The goal of thisoperation is to use the contents of the database, which reflect
historical data, ie., data about the past, to automatically generate amodel that can predict a future behavior.
2) Link analysis: the goal of link analysis is to establish relationsbetween the records in a database.
3) Database segmentation: it is often necessary to partition them
into collections of related records either as a means of obtaining asummary of each database, or before performing a data miningoperation such as model creation, or link analysis.
4) Deviation detection: its goal is to identify outlying points in aparticular data set, and explain whether they are due to noise orother impurities being present in the data, or due to causal reasons.
8/7/2019 Data Mining Session
9/16
Data Mining Techniques
Supervised Induction
Association Discovery
Sequence Discovery
Conceptual Clustering
Visualization
Neural Network
8/7/2019 Data Mining Session
10/16
Supervised Induction
Supervised induction refers to the process of automatically creating aclassification model from a set of records (examples), called the training
set. The training set may either be a sample of the database or warehousebeing mined, the entire database, or a data warehouse. A supervisedinduction technique is particularly suitable for data mining if it has threecharacteristics:
1. It can produce high quality models even when the data in thetraining set is noisy and incomplete.
2. The resulting models are comprehensible and explainable so that theuser can understand how decision are made by the system.
3. It can accept domain knowledge. Such knowledge can expedite theinduction task while simultaneously improving the quality of the inducedmodel.
8/7/2019 Data Mining Session
11/16
Association Discovery
Association Discovery function is an operation against
this set of records which return affinities that exist amongthe collection of items. These affinities can be expressed byrules such as 72% of all the records that contain itemsA,B, and C also contain items D and E. The specificpercentage of occurrences is called the confidence factor ofthe association.
8/7/2019 Data Mining Session
12/16
Sequence Discovery
Such a situation is typical of a Direct Mail Application.In this case, a catalog merchant has the information, for
each customer, of the sets of products that the customerbuys in every purchase order. A sequence discoveryfunction will analyze such collections of related recordsand will detect frequently occurring patterns of productsbought over time.
8/7/2019 Data Mining Session
13/16
Conceptual ClusteringClustering is used to segment a database into subsets, the clusters, with the members of
each cluster sharing a number of interesting properties. The results of a clustering operationare used in one of two ways. First, for summarizing the contents of the target database byconsidering the characteristics of each created cluster rather than those of each record in thedatabase. Second, as an input to other methods, eg. Supervised induction. A cluster is asmaller and more manageable data set to the supervised inductive learning component.
8/7/2019 Data Mining Session
14/16
Visualization
Visualization provides analysts with visual summaries of data from adatabase. It can also be used as a method for understanding the
information extracted using other data mining methods. Data miningnecessitates the use of interactive visualization techniques that allow theuser to quickly and easily change the type of information displayed, as wellas the particular visualization method used. Visualizations are particularlyuseful for noticing phenomena that hold for a relatively small subset of thedata, and thus are drowned out by the rest of the data when statisticaltests are used since these tests generally check for global features.
The advantage of using visualization is that the analyst does not haveto know what type of phenomenon he is looking for in order to noticesomething unusual or interesting.
8/7/2019 Data Mining Session
15/16
Neural NetworkNeural networks are an approach to computing that involves developing
mathematical structures with the ability to learn. The methods are the result ofacademic investigations to model nervous system learning. Neural networks have theremarkable ability to derive meaning from complicated or imprecise data and can be
used to extract patterns and detect trends that are too complex to be noticed by eitherhumans or other computer techniques. A trained neural network can be thought of asan "expert" in the category of information it has been given to analyse. This expert canthen be used to provide projections given new situations of interest and answer "whatif" questions.
Z1
Z2
X1 X2 X3 X4 X5 X6
F(I)
X1
X2
X3
X4
X5
X6
W1
W2
W3
W4
W5
W6
F(I
)
F(I) = X1*W1 + X2*W2 + .
8/7/2019 Data Mining Session
16/16