- 1. Data Mining Concept Ho Viet Lam - Nguyen Thi My Dung May, 14
th2007
2. Content
- Overview of data mining technology
- Applications of data mining
3. Introduction
- Why do we need to mine data?
- On what kind of data can we mine?
4. What is data mining?
- The process of discovering meaningful new correlations,
patterns and trends by sifting through large amounts of data stored
in repositoties, using pattern recognition technologies as well as
statical and methematics techniques.
- A part ofK nowledgeD iscovery inD ata(KDD) process.
5. Why data mining?
- The explosive growth in data collection
-
- The storing of data in data warehouses
-
- The availability of increased access to data from Web
navigation and intranet
- We have to find a more effective way to use these data
indecision support processthan just using traditional querry
languages
6. On what kind of data?
- Advanced database systems
Structure - 3D Anatomy Function 1D Signal Metadata Annotation 7.
Overview of data mining technology
- Data Mining vs. Data Warehousing
- Data Mining as a part of Knowledge Discovery Process
- Goals of Data Mining and Knowledge Discovery
- Types of Knowledge Discovery during Data Mining
8. Data Mining vs. Data Warehousing
- A Data Warehouse is a repository of integrated information,
available for queries and analysis. Data and information are
extracted from heterogeneous sources as they are generated....This
makes it much easier and more efficient to run queries over data
that originally came from different sources.
- The goal of data warehousing is to support decision making with
data!
9. Knowledge Discovery in Databases and Data Mining
- The non-trivial extraction of implicit, unknown, and
potentially useful information from databases.
- The knowledge discovery process comprises six phases:
10. Goals of Data Mining and KDD
- Prediction : how certain attibutes within the data will behave
in the future.
- Identification : identify the existence of an item, an event,
an activity.
- Classification : partition the data into categories.
- Optimization : optimize the use of limited resources.
11. Types of Knowledge Discovery during Data Mining
- Classification heirarchies
- Patterns within time-series
12.
- Overview of data mining technology
- Application of data mining
Content 13. Association Rules
-
-
- Providing the rules correlate the presence of a set of items
with another set of item
14. Association Rules
-
-
-
-
- Look for combinations of products
-
-
-
-
- Put the SHOES near the SOCKS so that if a customer buys one
they will buy the other
-
-
- Transactions: is the fact the person buys some items in the
itemset at supermarket
15. Association Rules
-
-
- Support: it refers how frequently a specific itemset occurs in
the database
-
-
- X => Y:Bread => Juice is 50%
-
-
- the confidence of the rule X=>Y:
-
-
- support (X U Y) / support (X)
-
-
- The goal of mining association rules is generate all possible
rules that exceed some minimum user-specified support and
confidence thresholds
Bread, cookies, coffee 8:40 1735 Milk, eggs 8:05 1130 Milk,
juice 7:38 792 Bread, Milk, cookies, juice 6:35 101 Items-Bought
time Transaction-id 1 Coffee 1 Eggs 2 Juice 2 Cookies 2 Bread 3
Milk support Item 16. Association Rules
-
- Input:database of m transactions,D, and a minimum support,
mins, represented as a fraction of m
-
- Output : frequent itemsets, L 1 , L 2 , , L k
17. Association Rules
D mins = 2 minf = 0.5 freq >0.5 Bread, cookies, coffee 8:40
1735 Milk, eggs 8:05 1130 Milk, juice 7:38 792 Bread, milk,
cookies, juice 6:35 101 Items-Bought time Transaction-id 1 Eggs 1
Coffee 2 Juice 2 Cookies 2 Bread 3 Milk support Item 0.75, 0.5,
0.5, 0.5, 0.25, 0.25 milk, bread, juice, cookies, eggs, coffee The
candidate 1-itemsets 0.75, 0.5, 0.5, 0.5 milk, bread, juice,
cookies frequent 1-itemsets 0.25, 0.5, 0.25, 0.25, 0.5, 0.25 {milk,
bread}, {milk, juice}, {bread, juice}, {milk, cookies}, {bread,
cookies}, {juice, cookies} The candidate 2-itemsets 0.5, 0.5 {milk,
juice}, {bread, cookies} frequent 2-itemsets {.} {..} The candidate
3-itemsets frequent 3-itemsets milk, bread, juice, cookies, {milk,
juice},{bread, cookies} result 18.
-
- + compute support(i j ) = count(i j )/m for each individual
item, i 1 , i 2 , ..,i nby scanning the database once and counting
the number of transactions that item i jappears in
-
- + the candidate frequent 1-itemset, C1, will be the set of
items i1, i2, , in
-
- + the subset of items containing i jform Ci where support(i j )
>= mins becomes the frequent
-
-
- + create the candidate frequent (k+1)-itemset, C k+1 , by
combining members of L kthat have k-1 items in common; (this forms
candidate frequent (k+1)-itemsets by selectively extending frequent
k-itemsets by one item)
-
-
- + in addition, only consider as elements of C k+1those k+1
items such that every subset of size k appears in L k ;
-
-
- + scan the database once and compute the support for each
member of C k+1;if the support for a member of C k+1 ; if the
support for a member of C k+1>= min then add that member to L
k+1 ;
-
-
- + if L k+1is empty then termination = true
19. Association Rules
- Demo of Apriori Algorithm
20. Association Rules
- Frequent-pattern tree algorithm
-
-
- Motivated by the fact that Apriori based algorithms may
generate and test a very large number of candidate itemsets.
-
-
-
-
- with 1000 frequent 1-items, Apriori would have to generate
2^1000 candidate 2-itemsets
-
-
- The FP-growth algorithm is one aproach that eliminates the
generation of a large number of candidate itemsets
21. Association Rules
- Frequent-pattern tree algorithm
-
- Generating a compressed version of the database in terms of an
FP-Tree
-
- FP-Growth Algorithm for finding frequent itemsets
22. Association Rules
Item head table Root Bread, milk, cookies, juice Milk, bread,
cookies, juice Milk:1 Bread:1 Cookies:1 Juice:1 Milk, juice Juice:1
Milk:2 Milk, eggs Milk Milk: 3 Bread, cookies, coffee Bread,
cookies Bread:1 Cookies:1 Transaction 1 Transaction 2 Transaction 3
Transaction 4 2 juice 2 cookies 2 bread 3 Milk link Support Item
23. Association Rules
Root Bread:1 Cookies:1 Juice:1 Juice:1 Milk: 3 Bread:1 Cookies:1
Milk, juice Bread, cookies Milk bread cookies juice 24. Association
Rules
-
- Procedure FP-growth (tree, alpha);
-
-
- If tree contains a single path then
-
-
-
- For each combination, beta, of the nodes in the path
-
-
-
- Generate pattern (beta U alpha) with support = minimum support
of nodes in beta
-
-
-
- For each item, I, in the header of the tree do
-
-
-
-
- Generate pattern beta = (I U alpha) with support =
i.support;
-
-
-
-
- Construct betas conditional pattern base;
-
-
-
-
- Construct betas conditional FP-tree, bete_tree;
-
-
-
-
- If beta_tree is not empty then
-
-
-
-
- FP-growth (beta_tree, beta);
25. Association Rules
- Demo of FP-Growth algorithm
26. Classification
-
- Classification is the process of learning a model that
describes different classes of data,the classes are
predetermined
-
- The model that is produced is usually in the form of adecision
treeor a set of rules
married salary Acct balance age Yes =20k =50 Good risk no =25 5k
Fair risk Good risk 27. Class attribute Expected information Salary
I(3,3)=1 Information gain Gain(A) = I-E(A) E(Married)=0.92
Gain(Married)=0.08 E(Salary)=0.33 Gain(Salary)= 0.67
E(A.balance)=0.82 Gain(A.balance)=0.18 E(Age)=0.81 Gain(Age)= 0.19
age Class is no {4,5} >=50k 20k..50k =25 >=5k 20k..50k Yes 6
No >=25