THE INTEGRATION OF FUZZY LOGIC AND GRAPH SEARCH
FOR SEQUENTIAL PATTERN MINING
SUKANYA YUENYONG
A THESIS SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR
THE DEGREE OF MASTER OF SCIENCE
(TECHNOLOGY OF INFORMATION SYSTEM MANAGEMENT)
FACULTY OF GRADUATE STUDIES
MAHIDOL UNIVERSITY
2008
COPYRIGHT OF MAHIDOL UNIVERSITY
ACKNOWLEDGEMENTS
I would like to express my sincere to my major adviser, Assist. Prof. Dr. Pisit
Phokharatkul, for his kindness, valuable advice and numerous suggestions.
I would like to thank my co-advisers Dr.Rangsipan Marukatat and
Dr.Noppadol Wanichworanant. I profoundly thank them for their valuable
recommendations.
I would like to thank Dr. Bunlur Emaruchi, Chair committee, for his kindness
and all supports.
I would like to thank Assoc. Prof. Dr. Chom Kimpan, who was the external
examiner of the thesis defense for his suggestion and comments.
I would like to thank Mr. Noppadol Teeranachaideekul for his suggestion. And
thank to my best friends for encouragement.
Finally, I am thankful to my mother and my uncle for their supports,
encouragement and love.
Sukanya Yuenyong
Fac. of Grad. Studies, Mahidol Univ. Thesis /
iv
THE INTEGRATION OF FUZZY LOGIC AND GRAPH SEARCH FOR
SEQUENTIAL PATTERN MINING
SUKANYA YUENYONG 4837225 EGTI / M M.Sc. (TECHNOLOGY OF INFORMATION SYSTEM MANAGEMENT) THESIS ADVISORS: PISIT PHOKHARATKUL, D.Eng., NOPPADOL
WANICHWORANANT, Ph.D., AND RUNGSIPAN MARUKATAT, Ph.D.
ABSTRACT
Sequential pattern discovery is an important problem in data mining. In recent
years, there have been many researchers trying to find new techniques to extract the
sequential patterns from a large database. In this research, an effective way of the
integrating fuzzy logic and graph search methods to create the fuzzy logic and graph
search (FGS) algorithm for sequential pattern mining is proposed. The execution time
of the two graph search techniques was compared. It was found that the depth-first
search (DFS) takes less execution time than the breadth-first search (BFS). Also, the
FGS algorithm takes less execution time than the GST algorithm when the k-sequence
is greater than or equal to the 1-sequence (k≥2). The outcomes of the FGS algorithm
are more valuable than the GST algorithm because the quantitative values of each
transaction are considered. Finally, it was found that the FGS outcomes are
substantially lower than the GST outcomes. Sometimes, the reduction is an advantage
but it may not be so for all cases.
KEY WORDS: DATA MINIG / SEQUENTIAL PATTERN / FUZZY LOGIC /
GRAPH SEARCH
46 pp.
Fac. of Grad. Studies, Mahidol Univ. Thesis /
v
การทําเหมืองขอมูลที่มีรูปแบบโดยลําดับดวยวิธีบูรณาการของฟซซีลอจิกและการคนหาแบบกราฟ (THE INTEGRATION OF FUZZY LOGIC AND GRAPH SEARCH FOR SEQUENTIAL PATTERN MINING)
สุกัญญา ยืนยง 4837225 EGTI / M
วท.ม. (เทคโนโลยีการจัดการระบบสารสนเทศ) คณะกรรมการควบคุมวิทยานิพนธ: พิศิษฏ โภคารัตนกลุ, D.Eng., นภดล วณิชวรนนัท , Ph.D.,
รังสิพรรณ มฤคทัต, Ph.D.
บทคัดยอ
การหารูปแบบโดยลําดับเปนปญหาที่สําคัญอันหนึ่งของการทําเหมืองขอมูล ตลอดระยะเวลาที่ผานมามีนักวิจัยมากมายยังคงพยายามที่จะหาเทคนิควิธีใหมๆเพื่อจะนําเอารูปแบบโดยลําดับออกมาจากฐานขอมูลขนาดใหญ งานวิจัยนี้นําเสนอประสิทธิภาพของวิธีการบูรณาการของฟซซีลอจิกและการคนหาแบบกราฟสําหรับทําเหมืองขอมูลท่ีมีรูปแบบโดยลําดับ และเปรียบเทียบเวลาในการประมวลผลของการคนหาแบบกราฟ 2 วิธี พบวาวิธีการคนหาเชิงลึกกอน(depth-first
search) ใชเวลาในการประมวลผลนอยกวาการคนหาเชิงกวางกอน (breadth-first search)
นอกจากนี้วิธีการบูรณาการของฟซซีลอจิกและการคนหาแบบกราฟยังใชเวลาในการประมวลผลนอยกวาการใชเทคนิคการคนหาแบบกราฟเพียงอยางเดียว และยังใหผลลัพธที่ทรงคุณคามากกวาเพราะวิธีการที่นําเสนอนี้พิจารณาถึงปริมาณของแตละรายการในฐานขอมูลดวย ซ่ึงยังผลใหสามารถลดรูปแบบลําดับของเหตุการณไดดวย จากกรณีดังกลาวบางครั้งการลดจํานวนรูปแบบเหตุการณนับเปนขอดี แตมันอาจจะเปนขอเสียในบางกรณีก็ได
46 หนา.
vii
CONTENTS
Page
ACKNOWLEDGEMENTS iii
ABSTRACT iv
LIST OF TABLES viii LIST OF FIGURES ix CHAPTER
I INTRODUCTION
1.1 General Introduction 1
1.2 Statement of Problem 1
1.3 Objective of work 2
1.4 Scope of work 2
1.5 Expected Result 3
II LITERATURE REVIEW
2.1 Related Researches 4
2.2 Data Mining 4
2.3 Sequential Pattern 6
2.4 Fuzzy Set 10
2.5 Graph Theory 16
III MATERIALS AND METHODS
3.1 Research Methodology and Procedure 23
3.2 Materials 33
IV RESULTS
4.1 Simulation Model 34
4.2 Experimental Results 34
vii
CONTENTS (Cont.)
CHAPTER Page
V DISCUSSION
5.1 Comparison between algorithms 41
VI CONCLUSION AND RECOMMENDATIONS 6.1 Conclusion of this work 43
6.2 Recommendations 43
REFERENCES 45
BIOGRAPHY 46
LIST OF TABLES
TABLES Page
Table 2.1 Sorted Transaction Data 8
Table 2.2 Large Item sets Minimum Support = 40% 9
Table 2.3 Transformed Database 9
Table 2.4 Example values 12
Table 3.1 The data table of 1-sequence 23
Table 3.2 The data table after transform the quantities values 26
Table 3.3 The data table after transform the quantities 27
values of customer id ‘1000’
Table 3.4 All2-sequence of customer id ‘1000’ 27
Table 3.5 The data table of 2-sequence 28
Table 3.6 The relational graph table structure 28
Table 3.7 The relational graph table 29
Table 3.8 The sequential patterns table 30
Table 3.9 The sequential patterns table with confidence values 30
Table 4.1 The outcomes of the FGS algorithm total 6 Patterns 35
LIST OF FIGURES
Figures Page
Figure 2.1 Knowledge Discovery in Database processes 5
Figure 2.2 Bivalent Sets to Characterize the Temp. of a room 10
Figure 2.3 Fuzzy sets to characterize the Temp. of a room 11
Figure 2.4 A membership function based on the person's age 12
Figure 2.5 Fuzzy set Union 14
Figure 2.6 Fuzzy set Intersection 15
Figure 2.7 Fuzzy set Complement 16
Figure 2.8 Gives a pictorial view of this graph. 17
The edge (x,x) is called a self-loop
Figure 2.9 Example of a directed graph 17
Figure 2.10 Example of an undirected graph 18
Figure 2.11 The Edge List Graph Representation 19
Figure 2.12 Breadth-first search spreading through a graph 20
Figure 2.13 Depth-first search on an undirected graph 22
Figure 3.1 Membership functions 24
Figure 3.2 The relational graph 29
Figure 4.1 The environment C250.I10.D5000 33
Figure 4.2 The environment C500.I20.D5000 37
Figure 4.3 The environment classify by gender 38
Figure 4.4 The environment classify by age 38
Fac. Of Grad. Studies, Mahidol Univ. M.Sc.(Tech. of Info. Sys.Management) / 1
CHAPTER I
INTRODUCTION
1.1 General Introduction
Nowadays, there is too much information everywhere in daily life and there are
many data collections. Computers are a very popular tool. We use computers to collect
data in various formats such as text file, database and XML formats. The advantages
of data collection are more than search and review. We can extract desirable
knowledge from the existing data; we call that, data mining. Data mining is the
process of extracting interesting information or patterns from large information
repositories. Recently, data mining has been recognized as a new area and has been
growing at a rapid pace. Due to the rapid growth of data, these new techniques of data
mining are urgently requested.
There are many types of data mining. Sequential pattern discovery is an
important problem in data mining. In recent years there have been and continue to be
many researchers trying to find new techniques to extract the sequential patterns from
large database. They used difference algorithms such as: AprioriAll, Generalized
Sequential Pattern (GSP), PrefixSpan, SPADE, Graph Search Techniques, MEMory
Indexing for Sequential Pattern mining (MEMISP), Sequential Pattern minIng with
Regular expressIon consTraints (SPIRIT), Fuzzy, Multi-Dimensional Sequential
Pattern Mining, Incremental Mining of Sequential Patterns and Periodic Pattern
Analysis. Each algorithm has both advantages and disadvantages. The researcher tries
to amend the disadvantages and improve its advantages.
1.2 Statement of Problem A problem of finding sequential patterns is what the customer will buy after they
already bought some item or item set. That problem is concerned with inter-
transaction patterns.
Sukanya Yuenyong Introduction / 2
There are two important things in mining sequential patterns. First is the
performance of execution time. Secondary is the result; how can extract the most
useful sequential patterns. This problem can construe into many issues depending on
the interest of an individual such as: case of inventory, mining sequential pattern use
for predicting the consumer purchasing behavior. The outcome of sequential pattern
mining can predict what the next product or group of products will be purchased when
the product or group of products already purchased is known.
Although the algorithms always extract the sequential pattern, the user will
always desire a better pattern. However, new algorithms must continue to solve the
two important constraints of execution time and give a better result.
1.3 Objective of work
The objective of this work:
1.3.1 To develop how to employ fuzzy logic and graph search
algorithm in a subject of sequential pattern mining.
1.3.2 To benchmark a performance of sequential pattern mining using
the integration of fuzzy logic and graph search algorithm.
1.4 Scope of work
The scope of this work will be:
1.4.1 Comparing the execution time of sequential pattern mining
between depth-first search and breadth-first search.
1.4.2 Sample data in this work use the generation of the synthetic data
in the transaction database.
1.4.3 This research does not involve data cleaning and data
preprocessing.
Fac. Of Grad. Studies, Mahidol Univ. M.Sc.(Tech. of Info. Sys.Management) / 3
1.5 Expected Result
The outcomes of this work will be the new algorithm to mine sequential patterns
employed by the integration of fuzzy logic and graph search to be invented. The
advantages and disadvantages of this algorithm will be analyzed.
Sukanya Yuenyong Literature Review / 4
CHAPTER II
LITERATURE REVIEW
2.1 Related Researches
Mining sequential pattern was first introduced in “Mining sequential patterns” [4]
by Agrawal. Recently years the researcher tried to find new technique for extract the
sequential patterns from large database. They used difference algorithms such as
AprioriAll and AprioriSome, DSG algorithm, fuzzy algorithm and the algorithm
mining path traversal pattern adopted an Apriori-like method to find sequential
patterns. However, it is very costly to generate candidate sets and repeatedly scans the
database.
Graph Search Techniques [6] give the advantage over the Apriori-like algorithm.
This algorithm can generate large sequences without constructing candidate sequences
and generate large k-sequences without following from large (k-1)-sequences step-by-
step. Besides, this algorithm considers time constraints as well. Time constraints make
the found sequence patterns more useful.
However, this algorithm does not consider the quantitative of each transaction. If
the patterns can tell the quantitative that will be valuable more than other. These
patterns can apply to many issues such as catalog design, store layout, customer
segmentation, cross marketing strategies, cross inventory strategies, effectiveness of
promotional campaigns etc.
Due to the reason above, this work will propose the new algorithm for sequential
patterns mining. We will consider the quantity and time constraints.
2.2 Data Mining
Data Mining [1] is the process of extracting interesting (non-trivial, implicit,
previously unknown and potentially useful) information or patterns from large
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys. Management) / 5
information repositories such as: relational database, data warehouses, XML
repository, etc. Also data mining is known as one of the core processes of Knowledge
Discovery in Database (KDD). The KDD processes are shown in Figure 2.1
Graphical User Interface
Pattern Evaluation
Data Mining Tools
Data Repositories
Database Data Warehouse Other Repositories
Data Cleaning & Integration
Knowledge base
Figure 2.1 Knowledge Discovery in Database processes.
Knowledge Discovery in Database processes, first we need to clean and integrate
the databases. Since the data source may come from different databases, which may
have some inconsistence and duplications, we must clean the data source by removing
those noises or make some compromises. Suppose we have two different databases,
different words are used to refer the same thing in their schema. When we try to
Sukanya Yuenyong Literature Review / 6
integrate the two sources we can only choose one of them, if we know that they denote
the same thing. And also real world data tend to be incomplete and noisy due to the
manual input mistakes. The integrated data sources can be stored in a database, data
warehouse or other repositories.
As not all the data in the database are related to our mining task, the second
process is to select task related data from the integrated resources and transform them
into a format that is ready to be mined. Suppose we want to find which items are often
purchased together in a supermarket, while the database that records the purchase
history may contains customer ID, items bought, transaction time, prices, number of
each items and so on, but for this specific task we only need items bought. After
selection of relevant data, the database that we are going to apply our data mining
techniques to will be much smaller, consequently the whole process will be more
efficient.
Various data mining techniques are applied to the data source; different
knowledge comes out as the mining result. That knowledge is evaluated by certain
rules, such as the domain knowledge or concepts. After the evaluation, as shown in
Figure 2.1, if the result does not satisfy the requirements or contradicts with the
domain knowledge, we have to redo some processes until getting the right results.
Depending on the evaluation result we may have to redo the mining or the user may
modify his requirements. After we get the knowledge, the final step is to visualize data
cubs or 3D graphics. This process is try to make the data mining results easier to be
used and more understandable.
2.3 Sequential Pattern
Sequential Pattern [1] is a sequence of item sets that frequently occurred in a
specific order, all items in the same item sets are supposed to have the same
transaction time value or within a time gap. Usually all the transactions of a customer
are together viewed as a sequence, usually called customer-sequence, where each
transaction is represented as an item sets in that sequence, all the transactions are list
in a certain order with regard to the transaction-time.
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys. Management) / 7
2.3.1 Support
A customer support a sequence s if s is contained in the corresponding
customer sequence, the support of sequence s is defined as the fraction of customers
who support this sequence.
Support(s) = Number of support customers
Total number of customers
Example: There are 10 customers. If we define the minimum support at 20%
that means every kind of item in each transaction must purchase by customer at least
20% = Number of support customers 10
Number of support customers = 20% * 10
= 0.2*10
= 2 Persons
2.3.2 Sequential support-confidence (S-confidence) [11]
Support confidence (s-confidence) of a sequential pattern S = <s1, s2, …, sm> and sj is (xj1xj2…xjk), where xit is an item in the itemset si, denoted as sequential s-confidence, is a measure that reflects the overall support affinity among items within the sequence. It is the ratio of the minimum support of items within this pattern to the maximum support of items within the sequential pattern. That is, this measure is de-fined as
Min
1 ≤m' ≤m, 1 ≤k' ≤length(s m'){support ({x
m' k' ⊆s
m'})}
S-conf (S) = Max
1 ≤ m' ' ≤m, 1 ≤k' ' ≤length (s m' '){support({x
m' ' k' ' ⊆s
m' ' })}
2.3.3 Sequential support affinity pattern [11] A sequential pattern is a sequential support affinity pattern if the s-confidence of the sequential pattern is no less than a minimum s-confidence
Sukanya Yuenyong Literature Review / 8
(min_sconf). In other words, a sequential pattern S is a sequential support affinity pattern if and only if |S| > 0 and s-confidence (S) ≥ min_sconf.
Example: consider a pattern S = {(AB)(AC)(ABC)(AE)} and S` =
{(BC)(BD)(BCD)(BF)}. Assume that a min_sconf is 0.5, support ({A}) = 2, support
({B}) = 5, support ({C}) = 8, support ({D}) = 4, support ({E}) = 5, and support ({F})
= 6, where support (X) is the support value of a sequential pattern X. Then, the
sequential s-confidence (S) is 0.25 (2/8) and s-confidence (S`) is 0.5 (4/8). Therefore,
sequential pattern S is not a sequential support affinity pattern but pattern S` is a
sequential support affinity pattern.
2.3.4 Sequential Pattern Mining
Sequential Pattern Mining is the process of extracting certain sequential
patterns whose support exceeds a predefined minimal support threshold. Since the
number of sequences can be very large, and users have different interests and
requirements, to get the most interesting sequential patterns, usually a minimum
support is pre-defined by users. By using the minimum support we can prune out those
sequential patterns of no interest, consequently make the mining process more
efficient. Obviously a higher support of sequential pattern is desired for more useful
and interesting sequential patterns.
Table 2.1 Sorted Transaction Data
Customer-id Transaction-time Purchased-items
1
1
Oct 23’ 02
Oct 28’ 02
30
90
2
2
2
Oct 18’ 02
Oct 21’ 02
Oct 27’ 02
10, 20
30
40, 60, 70
3 Oct 15’ 02 30, 50, 70
4
4
Oct 08’ 02
Oct 16’ 02
30
40, 70
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys. Management) / 9
Table 2.1 Sorted Transaction Data (Continued)
Customer-id Transaction-time Purchased-items
4 Oct 25’ 02 90
5 Oct 20’ 02 90
Table 2.2 Large Item sets Minimum Support = 40%
Large Item sets Mapped To
(30)
(40)
(70)
(40,70)
(90)
1
2
3
4
5
Table 2.3 Transformed Database
Customer-
id
Customer Sequence Transformed DB After Mapping
1
2
3
4
5
<(30)(90)>
<(10,20)(30)(40,60,70)>
<(30,50,70)>
<(30)(40,70)(90)>
<(90)>
<{(30)}{(90)}>
<{(30)}{(40)(70)(40,70)}>
<{(30)(70)}>
<{(30)}{(40)(70)(40,70)}{(90)}>
<{(90)}>
<{1}{5}>
<{1}{2,3,4}>
<{1,3}>
<{1}{2,3,4}{5}>
<{5}>
Sequential pattern mining is used in a great spectrum of areas. In computational
biology, sequential pattern mining is used to analyze the mutation patterns of different
amino acids. Business organizations use sequential pattern mining to study customer
behaviors. Sequential pattern mining is also used in system performance analysis and
telecommunication network analysis.
Sukanya Yuenyong Literature Review / 10
2.4 Fuzzy Set
Fuzzy Set Theory [7] was formalized by Professor Lofti Zadeh at the University
of California in 1965. What Zadeh proposed is very much a paradigm shift that first
gained acceptance in the Far East and its successful application has ensured its
adoption around the world.
A paradigm is a set of rules and regulations which defines boundaries and tells us
what to do to be successful in solving problems within these boundaries. For example
the use of transistors instead of vacuum tubes is a paradigm shift - likewise the
development of Fuzzy Set Theory from conventional bivalent set theory is a paradigm
shift.
Bivalent Set Theory can be somewhat limiting if we wish to describe a
'humanistic' problem mathematically. For example, Fig 2.1 below illustrates bivalent
sets to characterize the temperature of a room.
Figure 2.2 Bivalent Sets to Characterize the Temp. of a room.
The most obvious limiting feature of bivalent sets that can be seen clearly from
the diagram is that they are mutually exclusive - it is not possible to have membership
of more than one set (opinion would widely vary as to whether 50 degrees Fahrenheit
is 'cold' or 'cool' hence the expert knowledge we need to define our system is
mathematically at odds with the humanistic world). Clearly, it is not accurate to define
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys. Management) / 11
a transition from a quantity such as 'warm' to 'hot' by the application of one degree
Fahrenheit of heat. In the real world a smooth (unnoticeable) drift from warm to hot
would occur.
This natural phenomenon can be described more accurately by Fuzzy Set Theory.
Fig. 2.2 below shows how fuzzy sets quantifying the same information can describe
this natural drift.
Figure 2.3 Fuzzy sets to characterize the Temp. of a room. The whole concept can be illustrated with this example. Let's talk about people
and "youngness". In this case the set S (the universe of discourse) is the set of people.
A fuzzy subset YOUNG is also defined, which answers the question "to what degree is
person x young?" To each person in the universe of discourse, we have to assign a
degree of membership in the fuzzy subset YOUNG. The easiest way to do this is with
a membership function based on the person's age.
young(x) ={1, if age(x) <= 20,(30-age(x))/10, if 20 < age(x) <= 30,0, if age(x) > 30 }
Sukanya Yuenyong Literature Review / 12
A graph of this looks like:
Figure 2.4 a membership function based on the person's age
Given this definition, here are some example values:
Table 2.4 Example values.
Person Age Degree of youth
Johan 10 1.00
Edwin 21 0.90
Parthiban 25 0.50
Arosha 26 0.40
Chin Wei 28 0.20
Rajkumar 83 0.00
So given this definition, we'd say that the degree of truth of the statement
"Parthiban is YOUNG" is 0.50.
Note: Membership functions almost never have as simple a shape as age(x). They
will at least tend to be triangles pointing up, and they can be much more complex than
that. Furthermore, membership functions so far are discussed as if they always are
based on a single criterion, but this isn't always the case, although it is the most
common case. One could, for example, want to have the membership function for
YOUNG depend on both a person's age and their height (Arosha's short for his age).
This is perfectly legitimate, and occasionally used in practice. It's referred to as a two-
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys. Management) / 13
dimensional membership function. It's also possible to have even more criteria, or to
have the membership function depend on elements from two completely different
universes of discourse.
2.4.1 Implementations[10]
For triangular and trapezoidal fuzzy set membership functions, let
a ≤ b ≤ c ≤ d denote characteristic points :
2.4.2 Fuzzy Set Operations.
2.4.2.1 Union
The membership function of the Union of two fuzzy sets A and B with
membership functions and respectively is defined as the maximum of the two
individual membership functions. This is called the maximum criterion.
Sukanya Yuenyong Literature Review / 14
Figure 2.5 Fuzzy set Union.
The Union operation in Fuzzy set theory is the equivalent of the OR operation in
Boolean algebra.
2.4.2.2 Intersection
The membership function of the Intersection of two fuzzy sets A and B
with membership functions and respectively is defined as the minimum of the
two individual membership functions. This is called the minimum criterion.
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys. Management) / 15
Figure 2.6 Fuzzy set Intersection
The Intersection operation in Fuzzy set theory is the equivalent of the
AND operation in Boolean algebra.
2.4.2.3 Complement
The membership function of the Complement of a Fuzzy set A with
membership function is defined as the negation of the specified membership
function. This is called the negation criterion.
Sukanya Yuenyong Literature Review / 16
Figure 2.7 Fuzzy set Complement
The Complement operation in Fuzzy set theory is the equivalent of the
NOT operation in Boolean algebra.
2.5 Graph Theory [8]
2.5.1 The Graph Abstraction
A graph is a mathematical abstraction that is useful for solving many kinds of
problems. Fundamentally, a graph consists of a set of vertices, and a set of edges,
where an edge is something that connects two vertices in the graph. More precisely, a
graph is a pair (V,E), where V is a finite set and E is a binary relation on V. V is called
a vertex set whose elements are called vertices. E is a collection of edges, where an
edge is a pair (u,v) with u,v in V. In a directed graph, edges are ordered pairs,
connecting a source vertex to a target vertex. In an undirected graph edges are
unordered pairs and connect the two vertices in both directions, hence in an undirected
graph (u,v) and (v,u) are two ways of writing the same edge.
This definition of a graph is vague in certain respects; it does not say what a
vertex or edge represents. They could be cities with connecting roads, or web-pages
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys. Management) / 17
with hyperlinks. These details are left out of the definition of a graph for an important
reason; they are not a necessary part of the graph abstraction. By leaving out the
details we can construct a theory that is reusable that can help us solve lots of different
kinds of problems.
Back to the definition: a graph is a set of vertices and edges. For purposes of
demonstration, let us consider a graph where we have labeled the vertices with letters,
and we write an edge simply as a pair of letters. Now we can write down an example
of a directed graph as follows:
V = {v, b, x, z, a, y } E = { (b,y), (b,y), (y,v), (z,a), (x,x), (b,x), (x,v), (a,z) } G = (V, E)
Figure 2.8 gives a pictorial view of this graph. The edge (x,x) is called a self-loop.
Edges (b,y) and (b,y) are parallel edges, which are allowed in a multigraph
(but are normally not allowed in a directed or undirected graph).
Figure 2.9 Example of a directed graph.
Next we have a similar graph, though this time it is undirected. Fig. 2.9 gives
the pictorial view. Self loops are not allowed in undirected graphs. This graph is the
undirected version (b,y)), meaning it has the same vertices and the same edges with
their directions removed. Also the self edge has been removed, and edges such as (a,z)
and (z,a) are collapsed into one edge. One can go the other way, and make a directed
Sukanya Yuenyong Literature Review / 18
version of an undirected graph be replacing each edge by two edges, one pointing in
each direction.
V = {v, b, x, z, a, y }
E = { (b,y), (y,v), (z,a), (b,x), (x,v) }
G = (V, E)
Figure 2.10 Example of an undirected graph.
Now for some more graph terminology. If some edge (u,v) is in graph , then
vertex v is adjacent to vertex u. In a directed graph, edge (u,v) is an out-edge of vertex
u and an in-edge of vertex v. In an undirected graph edge (u,v) is incident on vertices u
and v.
In Fig. 2.10, vertex y is adjacent to vertex b (but b is not adjacent to y). The
edge (b,y) is an out-edge of b and an in-edge of y. In Fig. 2.10, y is adjacent to b and
vice-versa. The edge (y,b) is incident on vertices y and b.
In a directed graph, the number of out-edges of a vertex is its out-degree and
the number of in-edges is its in-degree. For an undirected graph, the number of edges
incident to a vertex is its degree. In Fig. 2.9, vertex b has an out-degree of 3 and an in-
degree of zero. In Fig. 2.10, vertex b simply has a degree of 2.
Now a path is a sequence of edges in a graph such that the target vertex of
each edge is the source vertex of the next edge in the sequence. If there is a path
starting at vertex u and ending at vertex v we say that v is reachable from u. A path is
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys. Management) / 19
simple if none of the vertices in the sequence are repeated. The path <(b,x), (x,v)> is
simple, while the path <(a,z), (z,a)> is not. Also, the path <(a,z), (z,a)> is called a
cycle because the first and last vertex in the path are the same. A graph with no cycles
is acyclic.
A planar graph is a graph that can be drawn on a plane without any of the
edges crossing over each other. Such a drawing is called a plane graph. A face of a
plane graph is a connected region of the plane surrounded by edges. An important
property of planar graphs is that the number of faces, edges, and vertices are related
through Euler's formula: |F| - |E| + |V| = 2. This means that a simple planar graph has
at most O (|V|) edges.
2.5.2 Edge List Representation
An edge-list representation of a graph is simply a sequence of edges, where
each edge is represented as a pair of vertex ID's. The memory required is only O (E).
Edge insertion is typically O (1), though accessing a particular edge is O (E) (not
efficient). Fig. 2.11 shows an edge-list representation of the graph in Fig. 2.11 The
edge_list adaptor class can be used to create implementations of the edge-list
representation.
Figure 2.11 The Edge List Graph Representation.
2.5.3 Graph Search Algorithms
Tree edges are edges in the search tree (or forest) constructed (implicitly or
explicitly) by running a graph search algorithm over a graph. An edge (u,v) is a tree
edge if v was first discovered while exploring (corresponding to the visitor explore()
method) edge (u,v). Back edges connect vertices to their ancestors in a search tree. So
for edge (u,v) the vertex v must be the ancestor of vertex u. Self loops are considered
to be back edges. Forward edges are non-tree edges (u,v) that connect a vertex u to a
Sukanya Yuenyong Literature Review / 20
descendant v in a search tree. Cross edges are edges that do not fall into the above
three categories.
2.5.3.1 Breadth-First Search
Breadth-first search is a traversal through a graph that touches all of the
vertices reachable from a particular source vertex. In addition, the order of the
traversal is such that the algorithm will explore all of the neighbors of a vertex before
proceeding on to the neighbors of its neighbors. One way to think of breadth-first
search is that it expands like a wave emanating from a stone dropped into a pool of
water. Vertices in the same ``wave'' are the same distance from the source vertex. A
vertex is discovered the first time it is encountered by the algorithm. A vertex is
finished after all of its neighbors are explored. Here's an example to help make this
clear. A graph is shown in Fig. 2.12 and the BFS discovery and finish order for the
vertices is shown below.
Figure 2.12 Breadth-first search spreading through a graph.
Order of discovery: s r w v t x u y
Order of finish: s r w v t x u y
We start at vertex, and first visit r and w (the two neighbors of). Once
both neighbors of are visited, we visit the neighbor of r (vertex v), then the neighbors
of w (the discovery order between r and w does not matter) which are t and x. Finally
we visit the neighbors of t and x, which are u and y.
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys. Management) / 21
2.5.3.2 Depth-First Search
A depth first search (DFS) visits all the vertices in a graph. When
choosing which edge to explore next, this algorithm always chooses to go ``deeper''
into the graph. That is, it will pick the next adjacent unvisited vertex until reaching a
vertex that has no unvisited adjacent vertices. The algorithm will then backtrack to the
previous vertex and continue along any as-yet unexplored edges from that vertex.
After DFS has visited all the reachable vertices from a particular source vertex, it
chooses one of the remaining undiscovered vertices and continues the search. This
process creates a set of depth-first trees which together form the depth-first forest. A
depth-first search categorizes the edges in the graph into three categories: tree-edges,
back-edges, and forward or cross-edges (it does not specify which). There are typically
many valid depth-first forests for a given graph, and therefore many different (and
equally valid) ways to categorize the edges.
One interesting property of depth-first search is that the discover and
finish times for each vertex form a parenthesis structure. If we use an open-parenthesis
when a vertex is discovered and a close-parenthesis when a vertex is finished, then the
result is a properly nested set of parenthesis. Fig. 2.13 shows DFS applied to an
undirected graph, with the edges labeled in the order they were explored. Below we
list the vertices of the graph ordered by discover and finish time, as well as show the
parenthesis structure. DFS is used as the kernel for several other graph algorithms,
including topological sort and two of the connected component algorithms. It can also
be used to detect cycles.
Sukanya Yuenyong Literature Review / 22
Figure 2.13 Depth-first search on an undirected graph.
Order of discovery: a b e d c f g h i
Order of finish: d f c e b a
Parenthesis: (a (b (e (d d) (c (f f) c) e) b) a) (g (h (i i) h) g)
Fac. Of Grad. Studies, Mahidol Univ. M.Sc.(Tech. of Info. Sys.Management) / 23
CHAPTER III
MATERIALS AND METHODS
3.1 Research Methodology and Procedure
In this work, we integrate two algorithms that are fuzzy logic and graph search for
sequential pattern mining. The first we assume the fuzzy membership functions for
transform the quantitative values. Next we mine the sequential patterns by two graph
search algorithm. They are depth-first search (DFS) and breadth-first search (BFS).
The methods of sequential patterns mining step by step as follow;
1. Prepare sample data.
2. Assume the fuzzy membership functions.
3. Transform the quantitative values.
4. Create 2-sequences from 1-sequences table.
5. Design and construct the relational graph table.
6. Search the sequential patterns from the relational graph table.
7. Calculate the sequential patterns confidence values.
8. Evaluate the performance of algorithm.
3.1.1 Prepare sample data.
For sample data, we prepare the generation of the synthetic data in the
transaction database by Advanced Data Generator program. We use the notation C for
sum of customers, I for sum of kind of items and D for sum of transactions. For
example, C500.I5.D5000 represents the simulation environment with 500 customers
who purchased items, 5 kinds of item, and 5000 transactions. We use the various
experimental data to evaluate the output.
Sukanya Yuenyong Materials and methods/24
The structure of the data table (1-sequences) as follows:
Table 3.1 the data table of 1-sequence
Purchased Item Customer Id Transaction Time Quantities
C 1000 20/3/2550 10
A 1000 14/4/2550 9
E 1000 30/3/2550 7
B 1000 24/5/2550 2
C 1001 15/12/2550 10
A 1001 16/12/2550 9
A 1001 22/10/2550 8
B 1002 16/8/2550 5
C 1003 4/3/2550 1
C 1003 31/10/2550 10
A 1003 16/11/2550 9
D 1004 1/1/2550 2
A 1004 2/4/2550 7
E 1004 27/6/2550 11
D 1004 31/8/2550 1
3.1.2 Assume the fuzzy membership function.
In this work, the quantitative purchased are divide into three fuzzy regions:
Low, Middle and High. The example membership functions are show as follows:
Fac. Of Grad. Studies, Mahidol Univ. M.Sc.(Tech. of Info. Sys.Management) / 25
0
1
3 6 12
low mid higvaluesMembership
Number of item
Figure 3.1 Membership functions.
µlow(x) = max (min ((6-x/6-3), 1), 0) µmid(x) = max (min (x-3/6-3, 12-x/12-6), 0) µhig(x) = max (min (x-6/12-6, 1), 0)
The fuzzy membership functions must define carefully. The number of
memberships and how many items in each membership are depend on each issue.
3.1.3 Transform the quantitative values.
We transform the quantitative values of each transaction datum into fuzzy
sets use the given membership functions above. The result of each membership
function called ‘degree of truth’ [12]. So the maximum value means the maximum
truth.
For example: The quantity of item C is 5
µLow (5) = max (min ((6-5/6-3), 1), 0)
= max (min (0.33, 1), 0)
= max (0.33, 0)
= 0.33 µMid (5) = max (min (5-3/6-3, 12-5/12-6), 0)
= max (min (0.67, 1.17), 0)
= max (0.67, 0)
Sukanya Yuenyong Materials and methods/26
= 0.67 µHig (5) = max (min (5-6/12-6, 1), 0)
= max (min (-0.17, 1), 0)
= max (-0.17, 0))
= 0
The maximum degree of truth = max (µLow (x), µMid (x), µHig (x))
= max (0.33, 0.67, 0)
= 0.67
So we transform the quantitative value of item C to ‘Mid’ because middle
function gives the maximum degree of truth.
However, the support count value of each transaction in the table must be
greater than or equal to the minimum support.
The support count value defines as the fraction of customer who supports this
sequence.
Support(C) = Number of support customers
Total number of customers
For example: There are 5 customers. If we define the minimum support at
20% that means the kind of item in each transaction must purchase by customers at
least
20% = Number of support customers 5
Number of support customers = 20% * 5
= 0.2*5
= 1 person
So we prune away the kind of item that have a unique customer purchased its
less than one person.
Fac. Of Grad. Studies, Mahidol Univ. M.Sc.(Tech. of Info. Sys.Management) / 27
The structure of the data table as follows:
Table 3.2 the data table after transform the quantitative values.
Purchased Item Customer Id Transaction Time Quantities
C-hig 1000 20/3/2550 10
A-hig 1000 14/4/2550 9
E-mid 1000 30/3/2550 7
B-low 1000 24/5/2550 2
C-hig 1001 15/12/2550 10
A-hig 1001 16/12/2550 9
A-hig 1001 22/10/2550 8
B-mid 1002 16/8/2550 5
C-low 1003 4/3/2550 1
C-hig 1003 31/10/2550 10
A-hig 1003 16/11/2550 9
D-low 1004 1/1/2550 2
A-mid 1004 2/4/2550 7
E-hig 1004 27/6/2550 11
D-low 1004 31/8/2550 1
3.1.4 Create 2-sequences from 1-sequences table
We create 2-sequences from 1-sequence data table above by pair two items.
The first item must be purchase before the other. However, the support count value of
2-sequences in the table must greater than or equal to the minimum support and time
space between two sequences must less than the maximum interval value (time
constraint). For example, we define the max-interval value is 30 days. That means we
want to find the sequential patterns its have a time space between each item in the
patterns less than or equal to the maximum interval value (<= 30 days).
Sukanya Yuenyong Materials and methods/28
Table 3.3 the data table after transform the quantitative values of customer id ‘1000’.
Purchased Item Customer Id Transaction Time Quantities
C-hig 1000 20/3/2550 10
A-hig 1000 14/4/2550 9
E-mid 1000 30/3/2550 7
B-low 1000 24/5/2550 2
Table 3.4 2-sequence of customer id ‘1000’
Sequence Customer Id Start Time End Time Time Space
(C-hig)(A-hig) 1000 20/3/2550 14/4/2550 25
(C-hig)(E-mid) 1000 20/3/2550 30/3/2550 41
(C-hig)(B-low) 1000 20/3/2550 24/5/2550 35
(A-hig)(B-low) 1000 14/4/2550 24/5/2550 40
(E-mid)(A-hig) 1000 30/3/2550 14/4/2550 14
(E-mid)(B-low) 1000 30/3/2550 24/5/2550 24
The time space of (C-hig) (B-low), (A-hig) (B-low) and (E-mid) (B-low) are
more than max-interval that means they are uninteresting sequences.
Next, we prune away the 2-sequence that has support count value less than
the minimum support.
Table 3.5 the data table of 2-sequence
Sequence Customer Id Start Time End Time
(C-hig)(A-hig) 1000 20/3/2550 14/4/2550
(C-hig)(E-mid) 1000 20/3/2550 30/3/2550
(E-mid)(A-hig) 1000 30/3/2550 14/4/2550
(C-hig)(A-hig) 1001 15/12/2550 16/12/2550
(C-hig)(A-hig) 1003 31/10/2550 16/11/2550
Fac. Of Grad. Studies, Mahidol Univ. M.Sc.(Tech. of Info. Sys.Management) / 29
Note: If start time equal to end time (purchased in the same day), we separate
the item name with space such as (C-hig A-mid).
3.1.5 Design and construct the relational graph table.
We design a relational graph table with the important properties. The
structure of the data table as follows:
Table 3.6 the relational graph table structure.
From_vertex To_vertex Edge_type Customer Id Start Time End Time
Note: the data type of ‘Edge_type’ is a Boolean, set true if start time not
equal to end time.
3.1.6 Search the sequential patterns from the relational graph table.
We search the sequential patterns from the relational graph table by two
graph search algorithm. They are depth-first search (DFS) and breadth-first search
(BFS) algorithm.
DFS follows the following rules:
Step 1: Select an unvisited node, visit it, and treat as the current node
Step 2: Find an unvisited neighbor of the current node by compare the
start time of an unvisited neighbor with the end time of current node, visit it,
and make it the new current node.
Step 3: If the current node has no unvisited neighbors, backtrack to its
parent, and make that the new current node then repeat the above two steps
until no more nodes can be visited.
Step 4: If there are still unvisited nodes, repeat from step 1.
Sukanya Yuenyong Materials and methods/30
Algorithm:
Procedure DFS(input: graph G) begin Stack S; Integer s,x; while (G has an unvisited node) do s := an unvisited node; visit(v); push(v,S); While (S is not empty) do x := top(S); if (x has an unvisited neighbor y) then visit(y); push(y,S); else pop(S); endif endwhile endwhile end
BFS follows the following rules:
Step 1: Select an unvisited node s, visit it, have it be the root in a BFS tree being formed. Its level is called the current level. Step 2: From each node x in the current level, in the order in which the level nodes were visited, visit all the unvisited neighbors of x. The newly visited nodes from this level form a new level that becomes the next current level. Step 3: Repeat the previous step until no more nodes can be vsisted. Step 4: If there are still unvisited nodes, repeat from Step 1.
Algorithm:
Procedure BFS(input: graph G) begin Queue Q; Integer s,x; while (G has an unvisited node) do s := an unvisited node; visit(s); Enqueue(s,Q); While (Q is not empty) do x := Dequeue(Q); For (unvisited neighbor y of x) do visit(y);
Fac. Of Grad. Studies, Mahidol Univ. M.Sc.(Tech. of Info. Sys.Management) / 31
Enqueue(y,Q); endfor endwhile endwhile end
For example:
Table 3.7 the relational graph table.
From_vertex To_vertex Edge_type Customer Id Start Time End Time
C-hig A-hig T 1000 20/3/2550 14/4/2550
C-hig E-mid T 1000 20/3/2550 30/3/2550
E-mid A-hig T 1000 30/3/2550 14/4/2550
C-hig A-hig T 1001 15/12/2550 16/12/2550
C-hig A-hig T 1003 31/10/2550 16/11/2550
C-hig A-hig20/3/2550 > 14/4/2550
15/12/2550 > 16/12/255031/10/2550 > 16/11/2550
E-mid20/3/2550 > 30/3/2550
30/3/2550 > 14/4/2550
Figure 3.2 the relational graph.
Table 3.8 the sequential patterns table.
Sequential Patterns Customer Id
C-hig > A-hig 1000
C-hig > A-hig 1001
C-hig > A-hig 1003
C-hig > E-mid > A-hig 1000
Sukanya Yuenyong Materials and methods/32
Next, we prune away the 2-sequence that has support count value less than
the minimum support.
Sequential Patterns: 1. C-hig > A-hig
2. C-hig > E-mid > A-hig
3.1.7 Calculate the sequential patterns confidence values
Each pattern has difference confidence values. We calculate the confidence
value [11] from
Min {support ({x ⊆s })}
1 ≤m' ≤m, 1 ≤k' ≤length(s m') m' k' m'S-conf (S) =
Max {support({x ⊆s1 ≤ m' ' ≤m, 1 ≤k' ' ≤length (s m' ') m' ' k' ' m' '
})}
The support of C-hig, A-hig and E-mid are 0.6, 0.6 and 0.2 in ordered. So the
confidence value of C-hig > A-hig = (0.6/0/6) = 1 and the confidence value of C-hig >
E-mid > A-hig = (0.2/0.6) = 0.33
Table 3.9 the sequential patterns table with confidence values.
Sequential Patterns Confidence
C-hig > A-hig 1
C-hig > E-mid > A-hig 0.33
3.1.8 Evaluate the performance of algorithm.
We explore the execution time and the sequential patterns from the
integration of fuzzy logic and graph search algorithm under different simulation
environments and minimum support.
Fac. Of Grad. Studies, Mahidol Univ. M.Sc.(Tech. of Info. Sys.Management) / 33
3.2 Materials
• Hardware : The specification of computer used for mining sequential
patterns consists of these following
• CPU Intel® Core™ 2 Duo T7300 (2.00 GHz.,
800 MHz. FSB, 4 MB L2)
• RAM 1024 MB. DDR2 667
• Hard Disk 160 GB 5400 RPM.
• Software:
• Microsoft Access
• Microsoft Visual Web Developer 2005 Express Edition
• Microsoft Word
• Advanced Data Generator
Sukanya Yuenyong Results / 34
CHAPTER IV
RESULTS
4.1 Simulation Model
We undertake several experiments on a computer laptop; Intel® Core™ 2 Duo
T7300 (2.00 GHz, 800 MHz. FSB, 4 MB L2), 1024 MB. DDR2-667, 160 GB 5400
RPM HDD. We use the integration of fuzzy logic and graph search algorithm to mine
sequential patterns. In the minable process, we use two graph search algorithm that are
depth-first search (DFS) and breadth-first search (BFS) to compare the execution time.
We generate the synthetic data by Advanced Data Generator program. Each
experiment, we use the notation C for sum of customers, I for sum of kind of item and
D for total transactions. For example, the experiment label is C500.I10.D5000
represents the simulation environment with 500 customers who purchased item, 10
kinds of item that purchased by customer and 5000 total transactions.
4.2 Experimental Results
Experiment 1:
We explore the execution time of graph search technique (GST) and the
integration of fuzzy logic and graph search (FGS). In this experiment we use two
graph search techniques that are depth-first search (DFS) and breadth-first search
(BFS). We experiment under the different simulation environment and minimum
supports, as shown in Fig. 4.1. Here we assume the fuzzy membership function are
low <= 3, middle = 10 and high >= 18 and the max-interval = 30 days. As result, we
found the DFS take the execution time less than BFS. Beside, the FGS algorithm takes
the execution time less than the GST algorithm when k-sequence more than 1-
sequence (k>=2). Although we change the value of total customers and the total kinds
of item, the FGS algorithm always take the execution time less than the GST
algorithm.
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys.Management) /
35
Finally, we also found the high support value use the few execution times
because support value of k-sequence always less than the minimum support.
C250I10D5000
0500
1000150020002500300035004000
1 0.75 0.5 0.38 0.25 0.2
Minimum support
Exe
cutio
n tim
e
GST
DFS
BFS
Figure 4.1(a) the environment C250.I10.D5000
C500I10D5000
0200400600800
1000120014001600
1 0.75 0.5 0.38 0.25 0.2
Minimum support
Exe
cutio
n tim
e
GST
DFS
BFS
Figure 4.1(b) the environment C500.I10.D5000
Sukanya Yuenyong Results / 36
C1000I10D5000
0
2
4
6
8
10
12
1 0.75 0.5 0.38 0.25 0.2
Minimum support
Exe
cutio
n tim
e
GST
DFS
BFS
Figure 4.1(c) the environment C1000.I10.D5000
C500I15D5000
0
5
10
15
20
25
1 0.75 0.5 0.38 0.25 0.2
Minimum support
Exe
cutio
n tim
e
GST
DFS
BFS
Figure 4.1(d) the environment C500.I15.D5000
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys.Management) /
37
C500I20D5000
0
5
10
15
20
25
1 0.75 0.5 0.38 0.25 0.2
Minimum support
Exe
cutio
n tim
e
GST
DFS
BFS
Figure 4.1(e) the environment C500.I20.D5000
Experiment 2:
We explore the outcomes of each algorithm are shown as table 4.1. Here the
experiment label is C250.I10.D5000 and we assume the fuzzy membership function
are low <= 3, middle = 10 and high >= 18, the max-interval = 30 days and the
minimum support = 20%, the FGS algorithm and the GST algorithm take the
execution time are 101.05 and 3,652.34 second in ordered and the outcomes of the
FGS algorithm given more feature than the GST algorithm. It inform about the
quantitative of each items in the sequential patterns. That gives more valuable
outcomes in many issues such as; catalog design, store layout, cross marketing
strategies, cross inventory strategies, effectiveness of promotional campaigns etc.
Beside, the FGS algorithm can reduce the number of outcomes patterns from 7860 to 6
patterns as well.
Table 4.1(a) the outcomes of the FGS algorithm total 6 Patterns
SequentialPatterns S-Confidence B-hig -> G-hig 0.9794 I-hig -> A-hig 0.9695 F-hig -> A-hig 0.8781 B-hig -> A-hig 0.8719 F-hig -> A-hig -> A-hig 0.5737 B-hig -> G-hig -> G-hig 0.5697
Sukanya Yuenyong Results / 38
Table 4.1(b) the outcomes of the GST algorithm 22 of 7860 patterns
SequentialPatterns S-ConfidenceA -> A -> A -> A -> A -> A -> A -> A -> H -> G 0.8964D -> A -> A -> B -> B -> B -> B -> B -> C -> A -> E 0.8924D -> A -> A -> C -> C -> C -> C -> A -> A -> C -> I 0.8924D -> A -> B -> B -> E -> G -> A -> B -> B -> J 0.8924E -> A -> A -> C -> C -> D -> G -> I -> F -> H -> G 0.8725E -> A -> B -> C -> C -> C -> A -> A -> E -> H 0.8725B -> A -> A -> A -> C -> C -> E -> G -> F -> G 0.8645B -> A -> A -> B -> B -> B -> B -> B -> C -> A -> E 0.8645B -> A -> C -> D -> E -> C -> A -> B -> H -> I -> I 0.8645B -> A -> D -> C -> A -> A -> A -> D -> D -> G -> H 0.8645B -> A -> F -> D -> A -> A -> C -> C 0.8645B -> B -> A -> A -> A -> A -> A -> A -> A -> A -> H -> G 0.8645B -> B -> A -> A -> C -> C -> D -> G -> I -> F -> H -> G 0.8645E -> B -> A -> A -> A -> A -> A -> A -> A -> A -> H -> G 0.8645C -> I -> C -> E -> E -> C -> F -> D -> G -> I 0.8566C -> I -> E -> D -> G -> A -> A -> J -> H -> H 0.8566C -> H -> D -> A -> A -> A -> A -> A -> A -> A -> A -> H -> G 0.8406C -> H -> D -> A -> A -> B -> B 0.8406C -> H -> E -> G -> D -> D -> D 0.8406F -> A -> A -> C -> C -> D -> G -> I -> F -> H -> G 0.8367F -> A -> B -> D -> E -> C -> B -> E -> C 0.8367F -> B -> A -> A -> C -> C -> D -> G -> I -> F -> H -> G 0.8367
Experiment 3:
We explore the execution time of the FGS algorithm under the different
simulation environment and minimum supports, as shown in Fig. 4.2. Here the number
of transactions is varied from 5000 to 20000 and minimum support from 20% to
100%. When the number of transaction increases, the execution time increases as well.
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys.Management) /
39
The FGS algorithm
0
1000
2000
3000
4000
5000
6000
1 0.75 0.5 0.38 0.25 0.2
Minimum Support
Exe
cutio
n tim
e(se
c)
C500I10D5000C500I10D10000C500I10D20000
Figure 4.2 the environment C500.I20.D5000
Experiment 4:
We explore the outcomes of the FGS algorithm are shown as Fig. 4.3. Here
we assume the fuzzy membership function are low <= 3, middle = 10 and high >= 18,
the max-interval = 30 days. We classify the customer by gender into male and female,
by age into 4 groups are 13-18, 19-25, 26-40 and 41-60. The gender and age does not
give the result effect. But we found when the number of transaction increases, the
execution time increases as well.
D500I10D5000
0
5
10
15
20
25
30
35
1 0.75 0.5 0.38 0.25 0.2
Minimum Support
Exe
cutio
n tim
e(se
c)
All Gender
Female
Male
Figure 4.2(a) the environment classify by gender
Sukanya Yuenyong Results / 40
D500I10D5000
0
5
10
15
20
25
30
35
1 0.75 0.5 0.38 0.25 0.2
Minimum Support
Exe
cutio
n tim
e(se
c)
All Age
13-18
19-25
26-40
41-60
Figure 4.2(b) the environment classify by age
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys.Management) /
41
CHAPTER V
DISCUSSION
5.1 Comparison between algorithms.
In algorithm comparison, we consider the result of each algorithm in same
environment. We found the depth-first search (DFS) algorithm take less the execution
time than the bread-first search (BFS) algorithm. Although we change the value of
total customers, the total kinds of item and the total transactions, the outcome does not
have any result effect.
When we compare between the integration of fuzzy logic and graph search (FGS)
with the graph search technique (GST), we found the FGS algorithm takes less the
execution time than the GST algorithm. That because the FGS algorithm can prune out
the items more than the GST algorithm. For example, we defined the minimum
support at 20%. If item C occurred in 30 transactions and there are 25% of customers
who purchased its, that means all transaction of item C will be insert into 1-sequence
data table but the FGS algorithm used fuzzy set to classify item C into 3 set; C-hig
occurred in 12 transactions, C-mid occurred in 6 transactions and C-low occurred in 7
transactions. Even if item C have 25% of customers who purchased them, that does
not mean we will insert them into 1-sequence data table. We must investigate the
support value of C-hig, C-mid and C-low. If only C-hig have the support value greater
than minimum support that means 12 transactions of C-hig will insert into 1-sequence
data table. In above case, the FGS algorithm prune out 13 transactions of C so it use
the execution time to search the sequential patterns less than the GST algorithm. The
FGS algorithm abundantly reduces the outcome. The reduction may make us miss the
useful or important patterns.
We found the higher support value always use the less execution times than the
lower support value because the higher support value will prune out the transactions
more than the lower support value.
Sukanya Yuenyong Discussion / 42 Also, if we separate the data transactions by the other properties such as gender
and age. We found the outcomes are more valuable because we know who will buy the
items.
Finally, we consider the quantitative of each item. We found the outcomes are
more valuable because the GST algorithm gives us just what the customer will buy
after they already bought some items but the FGS algorithm give what and how many
are the customer will buy after they already bought some items. It can apply to many
issues such as catalog design, store layout, cross marketing strategies, cross inventory
strategies, effectiveness of promotional campaigns etc.
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys.Management) / 43
CHAPTER VI
CONCLUSION AND RECOMMENDATIONS
6.1 Conclusion of this work We propose the integration of fuzzy logic and graph search (FGS) algorithm to
find sequential patterns from a transaction database. Through the experiments, we
found the FGS algorithm is superior to the graph search technique (GST) algorithm.
The advantage of the FGS algorithm over the GST algorithm as follows:
- The FGS algorithm takes less the execution time than the GST algorithm
when k-sequence is more than 1-sequence (k>=2).
- The outcomes of FGS algorithm are more valuable than GST algorithm
because we consider the quantitative value of item in each transaction.
The other advantage of the FGS algorithm follows:
- The FGS algorithm can generate k-sequence (k>=3) without following from
(k-1)-sequences
- The outcomes of FGS algorithm have more valuable because we consider
time space between each sequence.
The disadvantage of the FGS algorithm follows:
- The FGS algorithm takes more the execution time than the GST algorithm
when 1-sequence.
- The FGS algorithm abundantly reduces the outcome patterns. The reduction
may make us miss the useful or important patterns.
6.2 Recommendations In this work, we integrate fuzzy logic and graph search to mine sequential
patterns that given more advantage than using only graph search algorithm.
Sukanya Yuenyong Conclusion and Recommendation / 44
In the future we shall use another section of fuzzy logic to mine sequential
patterns that is fuzzy graph. It is very interesting to analyze how difference when we
integrate and merge fuzzy logic and graph search.
Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys.Management) / 45
REFERENCE
1. Qiankun Zhao and Sourav S. Bhowmick. Sequential Pattern Mining: A Survey
Nanyang Technological Univeristy, Singapore, 2003.
2. HAN, J. AND KAMBER, M. 2000. Data Mining Concepts and Techniques.
Morgan Kanufmann.
3. YANG, J., WANG, W., AND YU, P. S. 2001. Infominer: mining surprising
periodic patterns. In Proceeding of the seventh ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM Press, 395-400.
4. Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Eleventh
International Conference on Data Engineering, P. S. Yu and A. S. P. Chen,
IEEE Computer Society Press, Taipei, Taiwan, 3-14.
5. Tzung-Pei Hong, Kuie-Ying Lin and Shyue-Liang Wang. Mining Fuzzy
Sequential Patterns from Multiple-Item Transactions. I-Shou University,
Taiwan.
6. Yin-Fu Huang and Shao-Yuan Lin. Mining Sequential Patterns Using Graph
Search Techniques. Institute of Electronic and Information Engineering
National Yunlin University of Science and Technology.
7. http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol2/jp6/article2.html
8. http://www.boost.org/libs/graph/doc/graph_theory_review.html
9. http://www.csc.umist.ac.uk/people/wolkenhauer.htm 10. Olaf Wolkenhauer . Fuzzy Mathematics. Control Systems Centre of UMIST,UK.
11. Unil Yun. Mining Sequential Support Affinity Patterns with Weight Constraints.
Electronics and Telecommunications Research Institute, Telematics & USN
Research Division, Telematics Service Convergence Research Team,Korea.
Sukanya Yuenyong Biography / 46
BIOGRAPHY
NAME Miss Sukanya Yuenyong
DATEOF BIRTH 11 November 1982
PLACE OF BIRTH Bangkok, Thailand
INSTITUTIONS ATTENDED Siam University, 2004:
Bachelor of Science
(Computer Science)
Mahidol University, 2008:
Master of Science
(Technology of Information System
Management)
HOME ADDRESS 139/1 M.6, Bhudsakhon Rd.,
Suanluang, Kratumban,
Samutsakhon, Thailand
Tel. 08-9125-5005
E-mail : [email protected]