+ All Categories
Home > Technology > Data Mining Concepts

Data Mining Concepts

Date post: 13-Jan-2015
Category:
Upload: dung-nguyen
View: 23,102 times
Download: 0 times
Share this document with a friend
Description:
Introduction the topic of data mining technique.
Popular Tags:
40
Data Mining Concept Ho Viet Lam - Nguyen Thi My Dung May, 14 th 2007
Transcript
Page 1: Data Mining Concepts

Data Mining Concept

Ho Viet Lam - Nguyen Thi My DungMay, 14th 2007

Page 2: Data Mining Concepts

Content

o Introductiono Overview of data mining technologyo Association ruleso Classificationo Clusteringo Applications of data miningo Commercial toolso Conclusion

Page 3: Data Mining Concepts

Introduction

o What is data mining?o Why do we need to ‘mine’ data?o On what kind of data can we ‘mine’?

Page 4: Data Mining Concepts

What is data mining?o The process of discovering meaningful

new correlations, patterns and trends by sifting through large amounts of data stored in repositoties, using pattern recognition technologies as well as statical and methematics techniques.

o A part of Knowledge Discovery in Data (KDD) process.

Page 5: Data Mining Concepts

Why data mining?

The explosive growth in data collectionThe storing of data in data warehousesThe availability of increased access to data

from Web navigation and intranet We have to find a more effective way to use

these data in decision support process than just using traditional querry languages

Page 6: Data Mining Concepts

On what kind of data?o Relational databaseso Data warehouseso Transactional

databaseso Advanced database

systemsObject-relationalSpacial and TemporalTime-seriesMultimedia, textWWW…

Structure - 3D Anatomy

Function – 1D Signal

Metadata – Annotation

Page 7: Data Mining Concepts

Overview of data mining technology

o Data Mining vs. Data Warehousingo Data Mining as a part of Knowledge

Discovery Processo Goals of Data Mining and Knowledge

Discoveryo Types of Knowledge Discovery during

Data Mining

Page 8: Data Mining Concepts

Data Mining vs. Data Warehousing

o A Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run queries over data that originally came from different sources.

o The goal of data warehousing is to support decision making with data!

Page 9: Data Mining Concepts

Knowledge Discovery in Databasesand Data Mining

o The non-trivial extraction of implicit, unknown, and potentially useful information from databases.

oThe knowledge discovery process comprises six phases:

Page 10: Data Mining Concepts

Goals of Data Mining and KDD

o Prediction: how certain attibutes within the data will behave in the future.

o Identification: identify the existence of an item, an event, an activity.

o Classification: partition the data into categories.

o Optimization:optimize the use of limited resources.

Page 11: Data Mining Concepts

Types of Knowledge Discovery during Data Mining

o Association ruleso Classification heirarchieso Sequential patternso Patterns within time-serieso Clustering

Page 12: Data Mining Concepts

o Introductiono Overview of data mining technologyo Association ruleso Classificationo Clusteringo Application of data miningo Commercial toolso Conclusion

Content

Page 13: Data Mining Concepts

Association Rules

o PurposeProviding the rules correlate the presence of

a set of items with another set of itemExamples:

Page 14: Data Mining Concepts

Association Rules

o Some conceptso Market-basket model

Look for combinations of productsPut the SHOES near the SOCKS so that if a

customer buys one they will buy the other

oTransactions: is the fact the person buys some items in the itemset at supermarket

Page 15: Data Mining Concepts

Association Rules

o Some conceptsTransaction-id time Items-Bought

101 6:35 Bread, Milk, cookies, juice

792 7:38 Milk, juice

1130 8:05 Milk, eggs

1735 8:40 Bread, cookies, coffee

Item support

Milk 3

Bread 2

Cookies 2

Juice 2

Coffee 1

Eggs 1

Support: it refers how frequently a specific itemset occurs in the databaseX => Y: Bread => Juice is 50%the confidence of the rule X=>Y:

support (X U Y) / support (X) The goal of mining association rules is generate all possible rules that exceed some minimum user-specified support and confidence thresholds

Page 16: Data Mining Concepts

Association Rules

o Apriori Algorithm

Input: database of m transactions, D, and a minimum support, mins, represented as a fraction of m

Output: frequent itemsets, L1, L2, …, Lk

Page 17: Data Mining Concepts

Association Ruleso Apriori algorithm

Transaction-id time Items-Bought

101 6:35 Bread, milk, cookies, juice

792 7:38 Milk, juice

1130 8:05 Milk, eggs

1735 8:40 Bread, cookies, coffee

Item support

Milk 3

Bread 2

Cookies

2

Juice 2

Eggs 1

Coffee 1

Dmins = 2

minf = 0.5

The candidate 1-itemsets

milk, bread, juice, cookies, eggs, coffee

0.75, 0.5, 0.5, 0.5, 0.25, 0.25

frequent 1-itemsets

milk, bread, juice, cookies

0.75, 0.5, 0.5, 0.5

freq > 0.5

The candidate 2-itemsets

{milk, bread}, {milk, juice}, {bread, juice}, {milk, cookies}, {bread, cookies}, {juice, cookies}

0.25, 0.5, 0.25, 0.25, 0.5, 0.25

frequent 2-itemsets

{milk, juice}, {bread, cookies}

0.5, 0.5

The candidate 3-itemsets

{……………..}

{……………….}

frequent 3-itemsets

Ф

Ф

result

milk, bread, juice, cookies, {milk, juice}, {bread, cookies}

Page 18: Data Mining Concepts

o Apriori Algorithm

Begin+ compute support(ij) = count(ij)/m for each individual item, i1, i2, ..,in by scanning the database once and counting the number of transactions that item ij appears in + the candidate frequent 1-itemset, C1, will be the set of items i1, i2, …, in+ the subset of items containing ij form Ci where support(ij) >= mins becomes the frequent+ 1-itemset, L1;+ k=1;+ termination = false;+ repeat

+ Lk+1 = ;+ create the candidate frequent (k+1)-itemset, Ck+1, by combining members of Lk that have k-1 items in common; (this forms candidate frequent (k+1)-itemsets by selectively extending frequent k-itemsets by one item)+ in addition, only consider as elements of Ck+1 those k+1 items such that every subset of size k appears in Lk;+ scan the database once and compute the support for each member of Ck+1; if the support for a member of Ck+1; if the support for a member of Ck+1 >= min then add that member to Lk+1;+ if Lk+1 is empty then termination = true else k = k+1;

+ until terminationend;

Page 19: Data Mining Concepts

Association Rules

o Demo of Apriori Algorithm

Page 20: Data Mining Concepts

Association Rules

o Frequent-pattern tree algorithmMotivated by the fact that Apriori based

algorithms may generate and test a very large number of candidate itemsets.

Example:with 1000 frequent 1-items, Apriori would have to generate 2^1000 candidate 2-itemsets

The FP-growth algorithm is one aproach that eliminates the generation of a large number of candidate itemsets

Page 21: Data Mining Concepts

Association Rules

o Frequent-pattern tree algorithmGenerating a compressed version of the

database in terms of an FP-TreeFP-Growth Algorithm for finding frequent

itemsets

Page 22: Data Mining Concepts

Association Rules

o FP-Tree algorithm

Item Support link

Milk 3

bread 2

cookies 2

juice 2

Item head table

Root

Bread, milk, cookies, juiceMilk, bread, cookies, juice

Milk:1

Bread:1

Cookies:1

Juice:1

Milk, juice

Juice:1

Milk:2

Milk, eggsMilk

Milk:3

Bread, cookies, coffeeBread, cookies

Bread:1

Cookies:1

Transaction 1Transaction 2Transaction 3Transaction 4

Page 23: Data Mining Concepts

Association Ruleso FP-Growth algorithm Root

Bread:1

Cookies:1

Juice:1

Juice:1

Milk:3 Bread:1

Cookies:1

Milk, juice Bread, cookies

Milk bread

cookies juice

Page 24: Data Mining Concepts

Association Ruleso FP-Growth algorithm

Procedure FP-growth (tree, alpha);Begin

If tree contains a single path thenFor each combination, beta, of the nodes in the pathGenerate pattern (beta U alpha) with support = minimum support

of nodes in betaElse

For each item, I, in the header of the tree doBegin

Generate pattern beta = (I U alpha) with support = i.support;Construct beta’s conditional pattern base;Construct beta’s conditional FP-tree, bete_tree;If beta_tree is not empty then

FP-growth (beta_tree, beta);End

end

Page 25: Data Mining Concepts

Association Rules

o Demo of FP-Growth algorithm

Page 26: Data Mining Concepts

Classification

o Introduction Classification is the process of learning a model

that describes different classes of data, the classes are predetermined

The model that is produced is usually in the form of a decision tree or a set of rules

married

salary Acct balance

age

Yes

<20k

Poor risk

>=20k<50k

Fair risk

>=50

Good risk

no

<5k

Poor risk>=25<25

>5k

Fair risk Good risk

Page 27: Data Mining Concepts

RID Married Salary Acct balance Age Loanworthy

1 No >=50 <5k >=25 Yes

2 Yes >=50 >=5k >=25 Yes

3 Yes 20k..50k <5k <25 No

4 No <20k >=5k <25 No

5 No <20k <5k >=25 No

6 Yes 20k..50k >=5k >=25 Yes

Class attribute

n

iiin ppSSSI

1221 log),...,(

Expected information

),...,(*...

)( 11

1jnj

n

j

jnj SSIS

SSAE

Salary

I(3,3)=1

Information gain

Gain(A) = I-E(A)

E(Married)=0.92Gain(Married)=0.08

E(Salary)=0.33Gain(Salary)=0.67

E(A.balance)=0.82Gain(A.balance)=0.18

E(Age)=0.81Gain(Age)=0.19

ageClass is “no” {4,5}

>=50k20k..50k<20k

Class is “no” {3} Class is “yes” {6}

Class is “yes” {1,2}

Entropy<25 >=25

Page 28: Data Mining Concepts

Classificationo Algorithm for decision tree induction

Procedure Build_tree (records, attributes);Begin

Create a node N;If all records belongs to the same class, C then

Return N as a leaf node with class label C;If Attributes is empty then

Return n as a leaf node with class label C, such that the majority of records belong to it;

Select attribute Ai (with the highest information gain) from Attributes;Label node N with Ai;For each know value, Vj, of Ai doBegin

Add a brach from node N for the condition Ai=Vj;Sj=subset of Records where Ai=Vj;If Sj is empty then

Add a leaf, L, with class label C, such that the majority of records belong to it and return L

Else add the node returned by build_tree(Sj, Attributes – Aj)End;

End;

Page 29: Data Mining Concepts

Classification

o Demo of decision tree

Page 30: Data Mining Concepts

Clustering

o IntroductionThe previous data mining task of classification

deals with partitioning data based on a pre-classified training sample

Clustering is an automated process to group related records together.

Related records are grouped together on the basis of having similar values for attributes

The groups are usually disjoint

Page 31: Data Mining Concepts

Clustering

o Some conceptsAn important facet of clustering is the

similarity function that is usedWhen the data is number, a similarity

function based on distance is typically used

Euclidean metric (Euclidean distance), Minkowsky metric, Mahattan metric.

2211 ||...||),(tan knjnkjkj rrrrrrceDis

Page 32: Data Mining Concepts

Clustering

o K-means clustering algorithmo Input: a database D, of m records r1,…,rm and

a desired number of clusters. ko Output: set of k clustersBegin

Randomly choose k records as the centroids for the k clusters’Repeat

Assign each record, ri, to a cluster such that the distance between ri and the cluster centroid (mean) is the smallest among the k clusters;Recalculate the centroid (mean) for each cluster based on the records assigned to the cluster;

Until no change;

End;

Page 33: Data Mining Concepts

Clustering

o Demo of K-means algorithm

Page 34: Data Mining Concepts

Content

o Introductiono Overview of data mining technologyo Association ruleso Classificationo Clusteringo Applications of data miningo Commercial toolso Conclusion

Page 35: Data Mining Concepts

Applications of data mining

o Market analysisMarketing stratergiesAdvertisement

o Risk analysis and managementFinance and finance investmentsManufacturing and production

o Fraud detection and detection of unusual patterns (outliers)Telecommunication Finanancial transactionsAnti-terrorism (!!!)

Page 36: Data Mining Concepts

Applications of data mining

o Text mining (news group, email,

documents) and Web mining

o Stream data mining

o DNA and bio-data analysis

Diseases outcome

Effectiveness of treatments

Identify new drugs

Page 37: Data Mining Concepts

Commercial toolso Oracle Data Miner

http://www.oracle.com/technology/products/bi/odm/odminer.html

o Data To Knowledge

http://alg.ncsa.uiuc.edu/do/tools/d2k

o SAS http://www.sas.com/

o Clementinehttp://spss.com/clemetine/

o Intelligent Miner http://www-306.ibm.com/software/data/iminer/

Page 38: Data Mining Concepts

Conclusion

o Data mining is a “decision support” process in which we search for patterns of information in data.

o This technique can be used on many types of data.

o Overlaps with machine learning, statistics, artificial intelligence, databases, visualization…

Page 39: Data Mining Concepts

Conclusion

The result of mining may be to discover the following type of “new” information:Association rulesSequencial patternsClassification trees…

Page 40: Data Mining Concepts

References

o Fundamentals of Database Systemsfourth edition -- R.Elmasri, S.B.Navathe -- Addition Wesley -- ISBN 0-321-20448-4

o Discovering Knowledge in Data: An Introduction to Data Mining

Daniel T.Larose –Willey – ISBN 0-471-66652-2

o RESOURCES FROM THE INTERNET

Thanks for listening!!!


Recommended