Web Usage Mining Structuring semantically enriched clickstream data

Web Usage Mining Structuring semantically enriched clickstream data

by

Peter I. Hofgesang Stud.nr. 1421247

A thesis submitted to the

Department of Computer Science

in partial fulfilment of the requirements for the degree of Master of Computer Science at the

Vrije Universiteit

Amsterdam, The Netherlands

August 2004

supervisor Dr. Wojtek Kowalczyk Faculty of Sciences, Vrije Universiteit Amsterdam Department of Computer Science second reader Dr. Elena Marchiori Faculty of Sciences, Vrije Universiteit Amsterdam Department of Computer Science

4

Abstract Web servers worldwide generate a vast amount of information on web users’ browsing activities. Several researchers have studied these so-called clickstream or web access log data to better understand and characterize web users. Clickstream data can be enriched with information about the content of visited pages and the origin (e.g., geographic, organizational) of the requests. The goal of this project is to analyse user behaviour by mining enriched web access log data. We discuss techniques and processes required for preparing, structuring and enriching web access logs. Furthermore we present several web usage mining methods for extracting useful features. Finally we employ all these techniques to cluster the users of the domain www.cs.vu.nl and to study their behaviours comprehensively. The contributions of this thesis are a data enrichment that is content and origin based and a tree-like visualization of frequent navigational sequences. This visualization allows for an easily interpretable tree-like view of patterns with highlighted relevant information. The results of this project can be applied on diverse purposes, including marketing, web content advising, (re-)structuring of web sites and several other E-business processes, like recommendation and advertiser systems.

5

Content 1 Introduction .......................................................................................................................... 7 2 Related research ................................................................................................................... 9 3 Data preparation ................................................................................................................. 11

3.1 Data description ......................................................................................................... 11 3.2 Cleaning access log data............................................................................................ 13 3.3 Data integration ......................................................................................................... 17 3.4 Storing the log entries................................................................................................ 17 3.5 An overall picture ...................................................................................................... 18

4 Data structuring .................................................................................................................. 20 4.1 User identification ..................................................................................................... 20 4.2 User groups................................................................................................................ 21 4.3 Session identification................................................................................................. 22 4.4 An overall picture ...................................................................................................... 23

5 Profile mining models ........................................................................................................ 25 5.1 Mining frequent itemsets ........................................................................................... 25 5.2 The mixture model..................................................................................................... 27 5.3 The global tree model ................................................................................................ 29

6 Analysing log files of the www.cs.vu.nl web server .......................................................... 35 6.1 Input data ................................................................................................................... 35 6.2 Distribution of content-types within the VU-pages and access log entries ............... 39 6.3 Experiments on data structuring ................................................................................ 40 6.4 Mining frequent itemsets ........................................................................................... 44 6.5 The mixture model..................................................................................................... 52 6.6 The global tree model ................................................................................................ 59

7 Conclusion and future work ............................................................................................... 64 Acknowledgements ..................................................................................................................... 66 Bibliography................................................................................................................................ 67 APPENDIX................................................................................................................................. 69

APPENDIX A. The uniform resource locator (URL)............................................................. 69 APPENDIX B. Input file structures ........................................................................................ 69 APPENDIX C. Experimental details ...................................................................................... 71 APPENDIX D. Implementation details .................................................................................. 81 APPENDIX E. Content of the CD-ROM................................................................................ 84

6

Structure This Master Thesis is organized as follows: Chapter 1, “Introduction” This chapter provides a high-level overview of the related research and main goals of this project. Chapter 2, “Related research” Chapter 2 gives a comprehensive overview of the related research known so far. Chapter 3, “Data preparation” This chapter follows through all steps of the data preparation process. It starts describing the main characteristics of the input data followed by a description of the data cleaning process. The section on data integration will explain how the different data sources are merged together for data enrichment while the next section concerns data loading. Finally an overall scheme and an experiments section are laid out. Chapter 4, “Data structuring” In chapter 4 we explain how the semantically enriched data is combined to form user sessions. It also discusses the process of user identification and gives a description of groups of users, both of which are preliminary requirements of the identification of sessions. The chapter ends with an overall scheme of data structuring followed by a section of experiments. Chapter 5, “Profile mining models” This chapter provides an overview of the theoretical background of applied data mining models. First it explains the widely used mining algorithm of frequent itemsets. The following section describes the recently researched mixture model architecture. And finally a tree model is proposed for exploiting the hierarchical structure of session data. Chapter 6, “Analysing log files of the www.cs.vu.nl web server” Chapter 6 discusses experimental results of mining models applied on the semantically enriched data. All the input data are related to a specific web domain: www.cs.vu.nl. Chapter 7, “Conclusion and future work” Finally in chapter 7 we present the conclusions of our research and explore avenues of future work.

7

1 Introduction The extensive growth of the information reachable via the Internet induces its difficulty in manageability. It raises a problem to numerous companies to publish their product range or information online in an efficient, easily manageable way. The exploration of web users’ customs and behaviours plays a key role in dissecting and understanding the problem. Web mining is an application of data mining techniques to web data sets. Three major web mining methods are web content mining, web structure mining and web usage mining. Content mining applies methods to web documents. Structure mining reveals hidden relations in web site and web document structures. In this thesis we employ web usage mining which presents methods to discover useful usage patterns from web data. Web servers are responsible for providing the available web content on user requests. They collect all the information on request activities into so-called log files. Log data are a rich source for web usage mining. Many scientific researches aim at the field of web usage mining and especially at user behaviour exploration. Besides, there is a great demand in the business sector for personalized, custom-designed systems that conform highly to the requirements of users. There is a substantial amount of prior scientific works as well on modelling web user characteristics. Some of them present a complete framework of the whole web usage mining task (e.g., Mobasher et al. (1996) [18] proposed WEBMINER). Many of them present page access frequency based models and modified association rules mining algorithms, such as [1, 31, 23]. Xing and Shen (2003) [30] proposed two algorithms (UAM and PNT) for predicting user navigational preferences both based on page visits frequency and page viewing time. UAM is a URL-URL matrix providing page-page transition probabilities concerning all users’ statistics. And PNT is a tree based algorithm for mining preferred navigation paths. Nanopoulos and Manolopoulos (2001) [21] present a graph based model for finding traversal patterns on web page access sequences. They introduce one level-wise and two non-level wise algorithms for large paths exploiting graph structure. While most of the models work on global “session levels” an increasing number of researches show that the exploration of user groups or clusters is essential for better characterisation: Hay et al. (2003) [14] suggest Sequence Alignment Method (SAM) for measuring distance of sessions incorporated within structural information. The proposed distance is reflected by the number of operations required to transform sessions into one another. SAM distance based clusters form the basis of further examinations. Chevalier et al. (2003) [8] suggest rich navigation patterns consisting of frequent page set groups and web user groups based on demographical patterns. They show the correlation between the two types of data. Other researches point far beyond frequency based models: Cadez et al. (2003) [4] propose a finite mixture of Markov models on sequences of URL categories traversed by users. This complex probability based structure models the data generation process itself. In this thesis we discuss techniques and processes required for further analysis. Furthermore we present several web usage mining methods for extracting useful features. An overall process workflow can be seen in figure 1.

8

INPUT DATA

Web server’s access log data

Content type mapping table

URL / content type

DATABASE

DATA PREPARATION

DATA FILTERING

DATA INTEGRATION

SESSION IDENTIFICATION PROFILE MINING

1 2 3 4 5

Association rules

s

Identifiedsessions

ARFORMAT

MMFORMAT

GTMFORMAT 2

3

3

3Tree model

USER SELECTION

Geographical and organizational information

Pro

babi

lity

Content types

T e x t Text Te

xtT e x t

Mixture model

Figure 1: The overall process workflow

This thesis considers three separate data sets as input data. Access log data are generated by the web server of the specified domain and contains user access entries. The content-type mapping table contains relations between documents and their category in the form of URL / content type pairs. Mapping tables can either be generated by classifier algorithms or by content providers. In the case of this latter type, contents of pages are given explicitly in the form of content categories (e.g., news, sport, weather, etc.). Geographical and organizational information make it possible to determine different categories of users. All data mining tasks start with data preparation, which prepares the input data for further examination. It consists of four main steps as it can be seen in figure 1. Data filtering strips out irrelevant entries, data integration enriches log data with content labels and the enriched data are stored in a database. The user selection process sorts out appropriate user entries of a specified group for session identification. The following step in the whole process is session identification. Related log entries are identified as unique user navigational sequences. Finally these sequences are written to output files in different formats depending on the application. The profile mining step applies several web usage mining methods to discover relevant patterns. It uses an association rules mining algorithm [1] for mining frequent page sets and for generating interesting rules. It also applies the mixture model proposed by Cadez et al. (2001) [5] to build a predictive model of navigational behaviours of users. Finally it presents a tree model for representing and visualizing visiting patterns in a nice and natural way. In the experimental part of this thesis we employ all these techniques to address the problem of defining clusters on the users of the www.cs.vu.nl web domain and we study their behaviours comprehensively. The contributions of this thesis are content based data enrichment and visualization of frequent navigational sequences. Data enrichment amplifies users’ transactional data with the content types of visited pages and documents and makes distinctions among users based on geographical and organizational information. The visualization presents a tree-like view of patterns that highlights relevant information and can be interpreted easily.

9

2 Related research There are numerous commercial software packages usable to obtain statistical patterns from web logs, such as [11, 22, 37]. They focus mostly on highlighting log data statistics and frequent navigation patterns but in most cases do not explore relationships among relevant features. Some researches aim at proposing data structures to facilitate web log mining processes. Punin et al. (2001) [24] defined the XGMML and LOGML XML languages. XGMML is for graph description while the latter is for web log description. Other papers focus only (or mostly) on data preparation [6, 13, 15]. Furthermore there are complete frameworks presented for the whole web usage mining task (e.g., Mobasher et al. (1996) [18] proposed WEBMINER). Many researches, such as [1, 23, 31], present page access frequency based models and modified apriori [1] (frequent itemset mining) algorithms. Some papers (e.g., [32] [10] [9]) present online recommender systems to assist the users’ browsing or purchasing activity. Yao et al. (2000) [32] use standard data mining and machine learning techniques (e.g., frequent itemset mining, C4.5 classifier, etc.) combined with agent technologies to provide an agent based recommendation system for web pages. While Cho et al. (2002) [10] suggest a product recommendation method based on data mining techniques and product taxonomy. This method employs decision tree induction for the selecting of users likely to buy the recommended products. Hay et al. (2003) [14] apply sequence alignment method (SAM) for clustering user navigational paths. SAM is a distance-based measuring technique that considers the order of sequences. The SAM distance of two sequences reflects the number of transformations (i.e., delete, insert, reorder) required to equalize them. A distance matrix is required for clustering which holds SAM distance scores for all session pairs. The analysis of the resulting clusters showed that the SAM based method outperforms the conventional association distance based measuring. In their paper Runkler and Bezdek (2003) [27] use relational alternating cluster estimation (RACE) algorithm for clustering web page sequences. RACE finds the centers for a specified number of clusters based on a page sequence distance matrix. The algorithm alternately computes the distance matrix and one of the cluster centers in each iteration. They propose Levenshtein (a.k.a edit) distance for measuring the distance between members (i.e. textual representation of visited page number sequences within sessions). Levenshtein distance counts the number of delete, insert or change steps necessary to transform one word into the other. Pei et al. (2000) [23] propose a data structure called web access pattern tree (WAP-tree) for efficient mining of access patterns from web logs. WAP-trees store all the frequent candidate sequences that have a support higher than a preset threshold. All the information stored by WAP-tree are labels and frequency counts for nodes. In order to mine useful patterns in WAP-trees they present WAP-mine algorithm which applies conditional search for finding frequent events. WAP-tree structure and WAP-mine algorithm together offer an alternative for apriori-like algorithms. Smith and Ng (2003) [28] present a self-organizing map framework (LOGSOM) to mine web log data and present a visualization tool for user assistance. Jenamani et al. (2003) [16] use a semi-Markov process model for understanding e-customer behaviour. The keys of the method are a transition probability matrix (P) and a mean holding time matrix (M). P is a stochastic matrix and its elements store the probabilities of transition

10

states. M stores the average lengths of time for processes to remain in state i before moving to state j. In this way this probabilistic model is able to model the time elapsed between transitions. Some papers present methods based on content assumptions. Baglioni et al. (2003) [2] uses URL syntactic to determine page categories and to explore the relation between users’ sex and navigational behaviour. Cadez et al. (2003) [4] experiment on categorized data from Msnbc.com. Visualization of frequent navigational patterns makes human perception easier. Cadez et al. (2003) [4] present a WebCanvas tool for visualizing Markov chain clusters. This tool represents all user navigational paths for each cluster, colour coded by page categories. Youssefi et al. (2003) [33] present 3D visualization superimposed web log patterns and extracted web structure graphs.

11

3 Data preparation Preparing the input data is the first step of all data and web usage mining tasks. The data in this case are, as mentioned above, the access log files of the web server of the examined domain and the content types mapping table of the HTML pages within this domain. Data preparation consists of three main steps such as data cleaning/filtering, data integration and data storing. Data cleaning is the task of removing all irrelevant entries from the access log data set. Data integration establishes the relation between log entries and content mappings. And the last step is to store the enriched data into a convenient database. A comprehensive study has been made by Cooley et al. (1999) [13] on all these preprocessing tasks. This chapter starts with the description of the input data and generation procedure, followed by the details of log access file cleaning and data integration for log entries and mapping data integration. Finally it presents the database scheme for data storing and an overall picture and description of the data preparation process.

3.1 Data description This section describes the details of the access log and content type mapping data.

3.1.1 Access log files Visitors to a web site click on links and their browser in turn requests pages from the web server. Each request is recorded by the server in so-called access log files1. Access logs contain requests for a given period of time. The time interval used is normally an attribute of the web server. There is a log file present for each period and the old ones are archived or erased depending on the usage and importance. Most of log files of web servers are stored in a common log file format (CLFF) [34] or in an extended log file format (ELFF) [35]. An extended log file contains a sequence of lines containing ASCII characters terminated by either the sequence LF or CRLF. Entries consist of a sequence of fields relating to a single HTTP transaction. Fields are separated by white space. If a field is unused in a particular entry dash, a "-" marks the omitted field. Web servers can be configured to write different fields into the log file in different formats. The most common fields used by web servers are the followings: remotehost, rfc931, authuser, date, request, status, bytes, referrer, user_agent.

1 There are other types of log files generated by the web server as well, but this project does not consider them.

12

The meanings of all these fields are explained in the table below with given examples:

The most commonly used fields of access log file entries by web servers Field name Description of the field (with example)

remotehost Remote hostname (or IP number if DNS hostname is not available) example: 82.168.4.229

rfc931 The remote login name of the user. example: -

authuser The username with which the user has authenticated himself. example: -

[date] Date and time of the request with the web server’s time zone. example: [20/Jan/2004:23:17:37 +0100]

"request"

The request line exactly as it came from the client. It consists of three subfields: the request method, the resource to be transferred, and the used protocol. example: "GET / HTTP/1.1"

status The HTTP status code returned to the client. example: 200

bytes The content-length of the document transferred. example: 12079

"referer" The url the client was on before requesting the url. example: "-"

"user_agent" The software the client claims to be using. example: "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"

Table 1

3.1.2 Content types mapping table A content types mapping table is a table containing URL/content type pair entries. URLs are file locator paths referring to documents, and content types are labels giving the types of documents (for more details about URL refer to APPENDIX A). Content types can either be generated by an algorithm or by content providers where the contents of pages are given explicitly (e.g., sport pages refer to “sport” content, etc.). Generator algorithms can also be distinguished depending on whether they produce the content types automatically or are driven by human interaction.

13

We use an external algorithm [3], which attaches labels to all HTML documents in a collection of HTML pages based on their contents. The algorithm is based on the naive Bayes classifier supplemented by a smart example selector algorithm. It uses only the textual content of the HTML pages stripping out the control tags. Some parts of the text enclosed within special tags (e.g., title or header tags) are biased. The algorithm chooses the first 100 pages randomly to be categorized by humans. This initialization step is followed by an “active learning” method. This method chooses the examples by considering the ones already selected. This thesis deals with other documents besides HTML as well (eg. pdf, ps, doc, rtf, etc.). However it would be a difficult process to attach labels to each of them based on their content. This is because the structure of these files is specific and most of the time very complex. And their size is usually very large. For these reasons a very simple technique is used to identify such documents. The label “documents” is attached to all pdf and ps files that refers to scientific papers, e-books, documentations, etc., while the label “other documents” is attached to all other document types (e.g., doc, rtf, ppt, etc.). Other documents determine e.g., administrative papers, forms, etc. According to these remarks, a mapping table is completed to contain entries for the two labels. The following table presents an example of content types mapping table:

An example of content-type mapping table URL content type identifier

bi/courses-en.html 4 ci/DataMine/DIANA/index.html 6 … …

Table 2

3.2 Cleaning access log data As described above, raw access log files contain a vast amount of variant request entries. Each log entry can be informative for some application but this project excludes most of them. Processing of certain types of requests would lead to misconclusions (e.g., requests generated by spider engines). Besides, stripping the data has a positive effect on processing time and the required storage space. Since this project focuses only on documents themselves (like html, pdf, ps, doc files) all the request entries on different file types should be stripped out. Furthermore as the main goal is the characterization of users, robot transactions, which generate web traffic automatically by robot programs, must also be filtered out. There are several other criteria for filtering. Detailed descriptions of the filtering criteria and methods follow further on.

3.2.1 Filtering unsupported extensions A typical web page is made up of many individual files. Beyond the HTML page it consists of graphical elements, code styles, mappings etc., all in separate files. Each user request for an

14

HTML file evokes hidden requests for all the files required for displaying that specific page. In this manner access log files contain all the hidden requests’ traces as well. Extension filtering strips out all the request entries for file types other than predefined (for the structure of extension list file refer to APPENDIX B4 Extension filter list file). Requested files’ extensions in log entries could be extracted from the “request” field. An example of such request field: "GET /ai/kr/imgs/ibrow.jpg HTTP/1.0"

3.2.2 Filtering spider transactions A significant portion of log file entries is generated by robot programs. These robots, also known as spider or crawler engines, automatically search through a specific range of the web. They index web content for search engines, prepare content for offline browsing or for several other purposes. The common point in all crawlers’ activity is that, although they are mostly supervised by humans, they generate systematic, algorithmic requests. So without eliminating spider entries from log files, real users’ characteristics would be distorted by features of machines. Spiders can be identified by searching for specific spider patterns in the "user_agent” field of log entries. Most of the well-disposed spiders put their name or some kind of pattern that identifies them into this field. Once a pattern has been identified, the filter method ignores the examined log entry. Spider patterns can be looked up browsing the web for spiders. There are several pages considering spider activities and patterns, and there are lots of professional forums on the subject (mostly discussing how to avoid them) [29]. Spider patterns are collected in a separate spider list file (refer to APPENDIX B5). An example of such user_agent field: "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"

3.2.3 Filtering dynamic pages Web pages generated on user requests dynamically are called “dynamic pages”. These pages can not be located on the web server as an individual file, since they are built by a specific engine using several data sources. For this reason dynamic pages cannot be analyzed in a simple way. However with the application of several tricks it is possible to still obtain useful information. Jacobs et al. (2001) in [15] use an inductive logic programming (ILP) framework to reveal usage patterns based on dynamic page link parameters that are passed to the server. Since it is not an objective of this thesis to apply sophisticated methods for information recovery on dynamic pages, the filtering process simply eliminates all such reference.

15

There is no standard for the structure of URL requests for dynamic pages except that parameters appear after the “?” (question mark) in the URL which consist of name/value pairs. Therefore, dynamic pages can basically be filtered out by searching for the question mark in “request” fields of log entries. Note that requests for a single dynamic page without any parameters, thus without the delimiter question mark, would be stripped out during extension filtering (e.g., *.jsp, *.php, *.asp pages). An example of such a dynamic page’s request field: "GET /obp/overview.php?lang=en HTTP/1.0"

3.2.4 Filtering HTTP request methods HTTP/1.0 [25, 26] allows several methods to be used to indicate the purpose of a request. The most often used methods are GET, HEAD and POST. Since using the GET method is the only way of requesting a document that could be useful for this project, the request method filter ignores any other requests. The filter examines the “request” field of the log entry for the “GET” method identifier. An example of such a request field: "POST /modules/coppermine/themes/default/theme.php HTTP/1.0"

3.2.5 Filtering and replacing escape characters URL escape characters are special character sequences made up of a leading “%” character and two hexadecimal characters. They substitute special characters in URL requests that could be problematic while transferring requests to different types of servers. Special characters are simply replaced by sequences of standard characters. In most cases the task is only to replace these escape sequences with their representatives, but in certain instances URLs contain corrupted sequences that cannot be interpreted. In these cases the entries should be ignored. Corrupt sequences can be caused by typing errors of the users, automatically generated robot requests, etc.

3.2.6 Filtering unsuccessful requests If a user requests a page that does not exist, his browser replies with the well-known “404 error, page not found” error message. In this case the user has to use the “back” button to navigate back to the previous page or type a different URL manually. Either way the user doesn’t use the requested page to navigate through it, since the error page doesn’t provide any link to follow. For this reason log entries of erroneous requests should also be ignored. These entries can be filtered by examining the “status” field. The status of corrupt requests mostly equals to “404”. In special cases status field can take other values as well, such as “503” etc.

16

An example of such a log entry: 200.177.162.127 - - [16/May/2004:08:07:42 +0200] "POST /modules/coppermine/include/init.inc.php HTTP/1.0" 404 302 "-" "Mozilla 4.0 (Linux)"

3.2.7 Filtering request URLs for a domain name A URL of a page request consists of a domain name and the path of the requested document relative to the public directory of the domain. Since the domain name is not ambiguous to the responsible web server, it stores only the relative path of the request in the access log files, without the domain name. In a few cases however, log file entries tend to contain the whole absolute path. This leads to mapping errors during data integration, since the mapping table contains only relative paths and comparison is based on paths similarity. For these reasons a URL in the “request” field has to be transformed to the relative format. An example of such request field: "GET /www.cs.vu.nl/fb/generated/wrk_units/120.html HTTP/1.1"

3.2.8 Path completion When a user requests a public directory instead of a specific file, the web server tries to find the default page in that directory. The default page is “index.html” in most cases, but it varies between web servers. Thus the task is to complete the URL with the name of the default page in case a log entry contains a directory request. It is possible that the server does not contain the default page in the requested directory. In this case the certain log entry will be filtered while looking it up in the content type mapping table (refer to section 3.1.2 Content types mapping table). An example of such a request field:

original request field: "GET /pub/minix/ HTTP/1.1" completed request field: "GET /pub/minix/index.html HTTP/1.1"

3.2.9 Filtering anchors Anchors are special qualifiers for HTML link references. They act as reference points within a single web page. If a named anchor is placed somewhere in the HTML page’s body, a link referring to the HTML page completed with a special character hash mark and the name of the anchor (e.g., link + “#” + anchor name) following the link will scroll directly to the place where the anchor is put. Anchors should be stripped out from URLs, otherwise the HTML document can not be found in the mapping table. An example of such a request field: "GET /vakgroepen/ai/education/courses/micd/opgave_1.html#1c HTTP/1.1"

17

We don’t filter frame pages. Frames are supported by the HTML specification and make it possible to split an HTML document into several “sub documents” (e.g., a frame for the navigation menu, a frame for the content, etc.). Each frame refers to a specific HTML document, resulting in a separate page request. The main frame page contains mostly special tags for controlling all the subframes. This page is either labelled miscellaneous or labelled the same as its subframes by the text mining algorithm [3]. Either way there is no need to pay special attention to such pages while preparing the data.

3.3 Data integration A novel approach in this project is to use content types of the visited pages rather than URL references. Content types, as described earlier, are given in a special mapping table where each entry consists of an URL/content type pair (refer to section 3.1.2 Content types mapping table). Data integration in this context means that there should be a content type label attached to every single stored log entry. The most simple and convenient method is to attach content labels to transactions during data cleaning2. This would save time, since it uses the same cycle for both processes. After cleaning and filtering a log entry, the data integration step looks up the entry’s request URL in the mapping table. If the URL is present, the corresponding type label is attached to the entry. Otherwise the extension of the URL is checked for a valid document type, other than HTML (refer to section 3.2.1 Filtering unsupported extensions), and looked up in the table again. If the extension was an HTML page, it should be deleted3.

3.4 Storing the log entries The final step of the data preparation is to store the data in a convenient database. MySQL was chosen as a database server in spite of the fact that the current version does not support stored procedures. In most cases it would be easier and faster to use internal methods for manipulating the data inside the database, but there were no inextricable difficulties that occurred during the project in this context. The advantages of MySQL are that it is fast, easy to maintain, free to use for research purposes and it’s widely accepted. The database scheme for storing cleaned log entries can be seen in table 3. 2 Depending on the application. For continuous streaming data, a better solution would be to attach labels online to entries, and probably it would use the content identification model also to identify unknown contents besides a preset mapping table. 3 This step could be improved by using the original classifier model in case of a missing URL.

18

Database scheme of the cslog table column name type name

id bigint remotehost varchar rfc931 varchar authuser varchar transdate datetime request text content_type tinyint status smallint bytes int referer text user_agent text

Table 3

The column names respond to the log field names mentioned in section 3.1.1 Access log files except for the content_type field which refers to the attached content type described in the previous paragraph and id which is the unique identifier of the entries.

3.5 An overall picture The following figure gives an overall picture of our data preparation scheme.

Loading/filtering/mapping access log data

DATABASE

Object

LogParser

RAW LOG

Object

Transaction

Object

TransactionFilter

Object

Log2DatabaseObject

Transaction (filtered,mapped)

cslog.txt datahandling.prop

spider.flt

extension.flt

Object

MappingTable

mapping_table.mtd

Figure 2: An overall picture of the data preparation

19

The first step in the data preparation process is to load raw log files into the memory line by line by the LogParser object. This object transforms all entries into suitable Transaction objects, which contain all the fields of the log file. Once a Transaction has been parsed, it goes through the TransactionFilter, which filters out useless entries (by simply ignoring them). After this step a content-type label is attached to all transactions by the MappingTable object. Finally Log2Database loads the filtered transactions into the specified database.

20

4 Data structuring Sessions a.k.a. transactions4 constitute the basis of most web log mining processes. They are related to users and composed of pages visited during a separate browsing activity. This chapter starts with the description of user identification, which is essential for session identification. This is followed by details on grouping of users, which is also a relevant topic as characterization of them is the main goal of this project. The next paragraph deals with session identification methods and types, while discussing moreover how the selection method is restricted to groups of users. The final section presents a comprehensive overview of the data structuring process.

4.1 User identification Identification of users is essential for efficient data mining. It makes it possible to distinguish user specific data within the whole data set. It is straightforward to identify users in Intranet applications since they are required to identify themselves by following the login process. It is much more complicated in the case of public domains. The reason is that Internet protocols (e.g., HTTP, TCP/IP) do not require user authorization from client applications (e.g., web browser). The only private information exchanged is the machine IP address of the client. Identification based on this information is unreliable. This is because multiple users may use the same machine (thus the same IP address) to connect to the Internet. And on the other hand, a single user may use several machines to use the same service. Besides, proxy servers and firewalls hide the true IP address of the client. There are many solutions to resolve this problem. Content providers can force users to register for their services. In this way users have to follow a login process each time they want to browse their contents. To avoid explicit user authentication, servers can use so called cookies. Cookies are user specific files stored on client machines. Each time a user visits the same service, the server can obtain user information from stored cookies. The most accurate identification based solely on access log files is to use both IP address and browser agent type as a unique user identification pair [13]. However some papers use IP/cookie pairs [2]. The identification procedure proposed in this thesis takes place inside the database as a select query, which fills up the users table from the cslog table. Table 4 shows the data scheme of the users table. 4 Market basket analysis terminology uses “transaction” in terms of items purchased at once. Meanwhile the information technology (IT) sector denotes “transaction” for unique client-server request-respond information exchanges. Furthermore IT terminology also uses the term session (which is analogous to market basket) to denote consequent user page visits a.k.a. navigation sequences. To resolve the conflict, this thesis uses both terminologies for determination of navigation sequences except in chapter 3 Data preparation, where “transaction” translates to page accesses.

21

Data scheme of the users table column name type name

id bigint remotehost varchar host_name varchar TLD varchar user_agent text

Table 4

Remotehost and user_agent fields are equal to the above mentioned pair while host_name and TLD will be discussed in the next section (4.2 User groups).

4.2 User groups Arranging users into specific user groups is essential for further examinations. All the statistics and models described later are based on sessions belonging to user groups. The advantage of user authenticated systems is the availability of personal information on registered users. This would help to form the most exact and diverse groups for them. Possibilities are restricted to the information which can be mined from access log files in case of public domains. In public domains, groups can be formed based on user IP addresses (network ranges), geographical data, visiting frequency, etc. Access log file entries contain either the IP address or the domain name in the remotehost field. For this reason in both cases the IP address or the domain name should be looked up and updated in the users table. After this process the remotehost field should refer to IP addresses while the host_name field refers to the corresponding domain name in users table. Organizational groups A natural grouping of users is present in most internal networks in the term of subnetwork address ranges. Subnetwork address ranges determine sub network domains within the whole network. There can be separate network ranges for user groups like staff, management, students, administration, etc. Using these ranges and IP addresses of users, a variety of groups can be formed. Geographical groups Most of the network (IP) addresses or network ranges have a domain name registered to them. The domain name consists of level and sublevel names divided by dots. The most right-hand side name of the whole string refers to the top level domain (TLD). TLD can be country codes like nl, hu, uk, etc. or other reserved names for public organizations such as com, org, gov etc. The rest of the domain name could be built of organization names followed by department names etc., all in hierarchical structure (e.g., www.cs.vu.nl). Geographical distinction among users can be set up using TLD names. A group can be formed for example based on the “nl” TLD. Users can be selected for this group by searching for “nl” TLD in their corresponding domain name. No special geographical observations can be obtained from organizational TLDs, such as network infrastructure (net) and commercial (com) top level domains. This is because these domains can be registered worldwide and thus they have no clear relationship to countries.

22

4.3 Session identification Sessions constitute the basis of most web log mining processes. They are related to users and composed of pages visited during a separate browsing activity. Visited pages belong to a specific domain and form a sequence in visiting order. It is worth mentioning that not all the requests are present in log files. Most of the browsers use cache technology that allows the usage of previously visited pages instead of downloading them again. Besides, proxy servers also use page caching. They collect all frequently visited pages within a company and store them to reduce bandwidth load. This result on some pages is visited in “offline” mode in a visiting sequence. That means that no entry refers to these accesses in log files. This problem can be solved by setting the expiration timestamp of pages to minimal, which forces clients to download expired pages. However this solution assumes that we can change the structure of documents. Several methods were proposed (e.g., [13]) to offer algorithmic solutions for this problem. We believe that the main characteristics can be observed without the necessity of such data preparation techniques. There are several session identification methods described in different scientific literatures [6, 13, 20]. The most widely accepted methods are the so called time frame (or time window) identification [13] and the maximal forward reference (MFR) identification [7]. Both methods work on pre-selected page accesses, so they work on data grouped by users and ordered by access time. The data consists of the user identification number (id field), the date and time of page access (transdate field) and the content type of the visited page (content_type field). In addition, MFR requires the request URL (request field). The time frame identifier method divides page accesses for a user using a time window. This window or time interval is suggested to be approximately 30 minutes [13, 14, 30]. Most of the commercial products set a 30 minute timeout interval for splitting. The identifier iterates through the entries and whenever an entry’s access time (transdate) is out of the time interval it starts a new session and starts to measure time interval from that entry again. The maximal forward reference identifier adds page access entries to a session list up to the page before a backward reference is made. Backward reference is defined to be a page that is already contained in the set of pages for the current transaction. In that case it starts a new session list and goes on with iteration. For example, an access sequence of A B C B D E E E F G would be broken into four transactions, i.e. A B C, B D E, E, and E F G. The drawback of this method is that it does not consider that some of the “backward” references may provide useful information. And besides it may include entries within the same session even if a week elapsed between them.

23

4.4 An overall picture The figure below represents the functional model of session identification:

GetSessions

webmining.prop

page access entries

User sessions in the appropriate

data format

selected . entries

Object

TransactionMemoryIterator

Object

TransactionSimple

Object

Identifier

Object

TimeFrameIdentifier

Object

MFRIdentifier

Object

UserIPGroupSelector

Object

UserGroupSelector

Object

UserCountryGroupSelector

DATABASE

cslog, users tables

identified . sessions

Object

SessionFormatPrinter

Figure 3: Functional model of the session identification process

At the beginning TransactionMemoryIterator object retrieves all the log entries from cslog table ordered by id and sub-ordered by transdate. Note that although the number of log entries can be large, the memory requirement of the whole dataset is still manageable because all the information needed for an entry is its id, content_type and transdate (and URL for MFR identification). After fetching the data, TransactionMemoryIterator iterates through the user ids and for each id it forces UserGroupSelector to decide whether the given user belongs to a group or not. More specifically UserGroupSelector could be a subnet network ranges (UserIPGroupSelector) selector or a geographical group selector (UserCountryGroupSelector) depending on the settings in the webmining.prop properties file (for more information on group selections refer to section 4.2 User groups). When a user is selected by the group selector it is passed forward to the Identifier for identification of access entries into user sessions.

24

Note again that an Identifier could more specifically be, as it was described earlier in session identification section (4.3 Session identification), a time frame identifier (TimeFrameIdentifier) or a maximal forward reference identifier (MFRIdentifier). Finally, identified sessions for a user are appended to the output file by the SessonFormatPrinter in the appropriate format (e.g., association rule format, mixture model format, global tree model format, etc.).

25

5 Profile mining models So far we discussed all techniques and steps required for data preparation and data enrichment. This chapter deals with the discussion of data mining models used in this project for pattern discovery on enriched data. It starts with an explanation of the widely used association rules mining technique and follows with the discussion of a recent model called mixture model. Finally it presents the global tree model, which represents session data in a natural way and makes it easy to mine session-specific statistics on stored data. This model is also able to represent its structure in an easily interpretable graphical way. Consider the following formal notion5 as dataset representation for all the models described below:

5.1 Mining frequent itemsets One of the most well known and popular data mining techniques is the association rules (AR) or frequent itemsets mining algorithm. The algorithm was originally proposed by Agrawal et al. [1] for market basket analysis. Because of its significant applicability, many revised algorithms have been introduced since then, and AR mining is still a widely researched area.

5 Note that the notion is almost the same as it was proposed in [9], with the difference that transactions are not considered as sets of items but rather as an ordered list of content types of visited pages within a session.

Notion 5.1 Let },...,,{ 21 NDDDD = be a transaction or session data set generated by N individuals, where iD is the observed data on the i th user, Ni ≤≤1 . Each individual data set iD consist of a set of one or more transactions for that user, i.e.,

},...,,...,{ 1 iinijii yyyD = , where in is the total number of transactions observed for

user i , and ijy is the j th transaction for user i , inj ≤≤1 .

An individual session ijy consists of content-type references of visited pages within a

user session. },...,,...,{ 1 ijijkijkijij nnny = , where ijk is the length of the i th user’s j th

session, 1≥ijk .

nijn is a content-type reference, which can take values from the content type reference

range: Knnij ≤≤1 . Each reference of the range K...1 refers to a content group (refer to section 3.1.2 Content types mapping table).

26

The aim of association rule mining is exploring relations and important rules in large datasets in expressions of the form “if premise then conclusion” ( 0=∩→ YXYX ) implication form.

A dataset is considered as a sequence of entries consisting of attribute values also known as items. A set of such items is called an itemset (entries themselves are itemsets). Formally,

Using the notions (Notion 5.1) introduced at the beginning of this chapter, items refer to nijn

content-type references and an itemset is a ijy user session with the restriction that each item

can occur at most once. A problem with association rules is that for a given i number of items there are i2 itemsets and

for each −k itemset there are k2 rules. This could result in an unacceptable amount of rules. The solution is to consider only rules with a support and confidence higher than s and c .

The problem of mining association rules can be decomposed in two major steps:

1. Find all frequent itemsets that have support greater than the threshold s and 2. for each frequent itemset, generate all the rules that have confidence greater than the

threshold c. “Apriori” was the first association rules mining algorithm. Lots of improved algorithms (most of them are “apriori”-based) have been introduced since it was published. In the following we give the pseudo code of the “apriori” algorithm [1].

Let 0=∩→ YXYX be an association rule. It has support s (in D ) if %s of transactions from D contain YX ∪ . It has confidence c if %c of transactions from D that contain X also contain Y .

Let },...,,{ 21 niiiI = be a collection of all items, where )...1( ni j ∈ is an item. An itemset is a collection of items, where each item can occur at most once. A transaction or session is an itemset.

27

Rules can be generated incrementally, starting from 1-itemset conclusions, because of the property of confidence:

5.2 The mixture model In their paper Cadez et al. (2001) [5] proposed a generative mixture model for predicting user profiles and behaviours based on historical transaction data. A mixture model is a way of representing a more complex probability distribution in terms of simpler models. It uses a Bayesian framework for parameter estimation on the other hand the mixture model addresses

Let L be a frequent itemset and LA ⊂ is a subset, then the following statement is true: If confidence of AAL ⇒− )( is c then for any AB ⊂ the confidence of

BBL ⇒− )( is at least c.

Initial conditions

:kL set of large k-itemsets (have minimal support) :kC set of candidate k-itemsets

:D set of transactions (as described above), Dt ⊂ s: support threshold Algorithm

k

kk

k

k

k

kLitemsetsfrequentallofSet

scountcCcLcountsubCsubif

tofsubsubsetskallforDtnstransactioallfor

candidatesnewofSetCkLkfor

itemsetsfrequentL

U=

≥⊂=++⊂

−⊂

=++<>=

−=

−

}}.|{

;.)(

){;0;2(};1{

1

1

})1(|{3}||,|{2

1

1

1

−

−

∉−−∃∧∈−=∪∧∈∪⇐

⇐=

kkk

kk

k

k

LsubsetkCppCstepkqpLqpqpCstep

emptyCstepcandidatesnewofsetC

28

the heterogenity of page visits. Even if a user hasn’t visited a page before, the model can predict it with a low probability. Cadez et al. (2001) presented both a global and an individual model, this thesis applies only the global mixture model. Transaction data consistently mean web page visits or sessions in this thesis, instead of the slightly different market basket data mentioned in [5]. While sessions are ordered sequences of visited pages, market baskets are sets of purchased items. However session data can be simply transformed towards the market basket data structure for applying mixture model:

The global mixture model consists of K components. Each of the components describes a prototype transaction forming a basis function. A component models a specific session’s prototype which consists of visited page types with counts relatively higher than for other items. A K-component mixture model for modeling a users’ site visit ijy is given below:

As for modeling components, [5] proposed a simple memoryless multinomial model. For every component there is a multinomial distribution ),...,( 1 kCkk ΘΘ=Θ present, conditioned on ijn ,

the total number of pages visited in the i-th user’s j-th session. The mixture model (Notion 5.3 – (1)) completed with multinomials can be written as

Notion 5.4 – Mixture model with multinomials

∑ ∏= =

Θ=K

k

C

c

nkckij

ijcyp1 1

)( α (2)

Notion 5.3 – K-component mixture model

∑=

=K

kijkkij yPyp

1)()( α (1)

Where 0>kα is the component weight for the k-th component,∑ =

k k 1α .

KkPk ≤≤1, is the k-th mixture component.

Notion 5.2 (alteration of Notion 5.1) For the mixture model approach transaction notion should be altered in the following way: an individual session ijy consists of counts of content type references of visited

pages within a user navigational sequence. },...,,...,{ 1 ijKijkijij nnny = , where ijkn indicates how many pages of k content type are in the i th user’s j th session,

Kk ≤≤1 , ijkn≤0 .

29

The full data likelihood is presented below with the independency assumption of an individual’s behaviour:

The unknown parameters },...,{ 1 KΘΘ and },...,{ 1 Kαα are estimated by an expectation maximization (EM) algorithm.

5.3 The global tree model Pei et al. (2000) [23] propose a WAP-tree architecture for efficiently mining frequent itemsets. The tree based model contains besides the tree structure a link-queue for each type of label. The queues connect all the same labels forming chains. Xing and Shen (2003) in [30] present so-called preferred navigation tree (PNT) for mining preferred navigation paths. PNT stores URL, frequency of visits and visiting time in its nodes. In our approach we use a global tree model (GTM). The GTM provides a special representation of session data for groups of users. The structure of the model is similar to that of the PNT presented in [30]. The model preserves the information obtained from the structure of sessions and it stores individual pages in visiting order. In this model sessions with the same prefix share the same branch of the tree. This results in less storage required for the model. Also, the model was built to be able to visualize frequent navigational paths in a tree structure. Visualization helps to understand the patterns by highlighting relevant information. Each node in a tree model registers four pieces of information: content-type label, frequency number, reference to its parent node and reference to its children nodes. The root of the tree model is a special virtual node with an optional title label and frequency 0. Every other node is labelled by one of the content-type labels and is associated with a frequency which stores the number of occurrences of the corresponding prefix ended with that content-type in the original session database. A model consists of K6 branches (session trees) connected to the virtual root node. Each branch contains a root node labelled with a unique content-type identifier. A branch stores only those user sessions which start with a page labelled with the same content-type as its root’s. Figure 4 presents the visualization of a sample tree model. An BA → path of a tree from any A node to any B node (where the level number of A in the tree is not greater than that of B) represents one or more subsessions where the frequency

6 K is the number of content-types, refer to Notion 5.1.

Notion 5.5 – Full data likelihood

∏=

Θ=ΘN

iiDpDp

1

)|()|( (3)

Θ represents the unknown parameters: both the parameters of the K component multinomials, },...,{ 1 KΘΘ , and the α vector for profile weights, },...,{ 1 Kαα .

30

number of the B node represents the total number of sessions containing this ordered subsequence pattern. A special case of the BA → path is when A is the root node (of a session tree). In this case the path represents one or more sessions or subsessions depending on the frequency of B node and the sum frequency of its children nodes:

Building the tree model Model building starts with the initialization of the K session trees. All trees are initialized for a unique k content type. Then all sessions of the data set are added to its correspondent session tree. Each session is examined for its first page type and a tree is selected according to the result. Adding a session to its tree can be implemented recursively. The recursive function takes a parent node and subsession parameters and updates or creates the child node of this parent with the content-type given by the first element of the subsession. The recursive step is to pass the child node as parent parameter and the new subsession parameter arises from the removal of the first entry of the original subsession. The recursive process stops when the length of the subsession is equal or less than one. Algorithm to build the global tree model

Initial conditions Ds∈ :session is : is the ith element of session s sessionTrees: array [1..K] of SessionTree SessionTree: tree object for k content type, consists of a root node and children nodes node: is a SessioTree node containing ct: content-type of node freq: is the frequency of this node parent: node reference to parent node children: array [1..L] of node references

Let Bf be the frequency number of the B node and let ∑=BofnodechildrenCallfor

cfsum be

the summed frequency of its children nodes. Let BRoot → be the path from the root node to the B node, then BRoot → represents at least one real session if sumf B > , in which case the sumf B − difference gives the number of real BRoot → sessions.

31

Algorithm scheme of the algorithm:

});(].[

{;

1 saddsessessionTreDsallfor

essessionTreinit∈

initialization of sessionTrees:

}}

;.;.

;0.;.

);0,(][{..1

{

nullchildrenrootnullparentroot

freqrootictroot

ieSessionTreiessessionTreKifor

essessionTreinit

⇐⇐

⇐⇐

==

adding a session to the correspondent SessionTree:

}

}}

);(),.({}

);(),.({().

();.{1.;.

){:,:(

});,(

;_

){:(]._[

1

sntfirstElemesforchildcreateaddSessionelse

sntfirstElemesforchildaddSessionexistsntfirstElemesforchildif

tElementremoveFirsslengthsif

freqnodesessionsparentNodenodeaddSession

srootaddSessionreturn

typecontentsifsessionsaddtypecontentessessionTre

>++

<>

32

Mining preferred paths from GTM Preferred navigation paths can be mined directly from the tree model. (A, B) paths or sessions that have a higher support than a preset threshold value are the preferred navigation paths. The algorithm given below scans each level of all session trees for possible candidates ignoring branches that have low support.

Trees’ similarity In the following a tree similarity measure is proposed for determining different tree models’ distance. By the means of the similarity measure we can determine the likeness of trees. We expect that the similarity measure of two trees built based on two distinct session data set will be high if the data sets were generated by users of similar behaviours. The proposed distance measure considers both the structure of the tree and the frequency of tree nodes. The distance satisfies the following criteria:

Mining preferred paths initial conditions

treshold support:svalue supporttheir and sessions supportedoflist:supported

nodeschildrencandidateoflisthildrencandidateCnodescandidateoflistthecandidates

::

algorithm

};

}}

}

);(

{.{].[..1

//}

]),[,(({])[,(

{()...1

{0().

hildrencandidateCcandidates

childaddhildrencandidateC

sfreqchildifdoildrenNumberOfChicandidatesjfor

candidateschildpossiblegather

support)icandidatesrootaddsupportedsicandidatesrootffrequencyOif

dosizecandidatesiforemptyhildrencandidateC

dosizecandidateswhilenoderootcandidates

j

j

⇐

⇐

≥=

⇐≥

=⇐<>

⇐

33

The distance measure proposed is a simple approach based on forming the intersection of the two trees’ session dataset.

The similarity proportion can be easily calculated then by dividing the sum value by the summed number of different sessions in the two trees (the summed number of all sessions in the two trees subtracted the sum value). If we multiply the resulted value by 100 we get the similarity percentage.

Trees’ similarity initial conditions

sessionscommonallofnumbertheregisterssumnodeschildrencandidateoflisthildrencandidateC

nodescandidateoflistthecandidatesmodeltreetwo theTT

::

::, 21

algorithm

};

};

};

{)(),({...1

{0.;

;0

21

hildrencandidateCcandidates

treesbothinpresentarethat

candidatesofnodeschildrentheallputhildrencandidateC

sessionscommonofnumbertheaddsumssessioncandidatesroothavebothTandTif

dosizecandidatesifordosizecandidateswhile

treesbothinoccurethattreessessionfromnodesroottheallputcandidates

sum

i

i

⇐

⇐

⇐

=<>

⇐⇐

Assumptions 1. A similarity distance measures not only the structure of trees but also (or rather) the frequencies of their nodes. Higher frequencies should be taken into account with higher weights. 2. The extra information that originates from sessions should be exploited. 3. Considering 21 ,TT trees: the distance of 1T from 2T should be equal to the distance of 2T to 1T . Formally )(.)(. 1221 TdistTTdistT = .

34

Visualization of tree models Frequent navigational paths are conventionally represented by text or tables which are not easy to understand. Visualization of a tree model however makes it easy to interpret the patterns. A picture of a tree model consists of nodes with content-type labels and their colour code. Nodes are connected with lines (edges) in different thickness marking the frequencies of given paths. Besides thickness, edges contain proportional numbers for each child of a node measuring the distribution of frequencies for the given children nodes. Besides, the number of “real” sessions for that path of the tree is also given in parentheses. The tree visualization contains only the supported sessions based on a support threshold set for the model. Figure 4 presents a sample tree.

Figure 4: Visualization of a sample tree model

The sample tree above (figure 4) contains nine different content-type nodes. Its most frequent “starting” node is english/department. 62% of the visitors (that is 9 visitors in this case) start on the department pages and then go on to the faculty pages. Faculty pages have 100% of visiting rate within this branch which means that all of the users whom went on from department pages visited faculty pages also etc.

35

6 Analysing log files of the www.cs.vu.nl web server For the purpose of this thesis the discussion will be restricted to the analysis of user behaviours for a single web domain www.cs.vu.nl. Therefore all the data used in the following experiments are in connection with the web server of the Computer Science Department of the Vrije Universiteit. This chapter presents experimental results using all the techniques described earlier. The first section describes the details of the input access log files and mapping table. This is followed by experimental results of data preparation and data structuring techniques. Finally, the last sections present results of the three profile mining models AR, MM and GTM. Results of association rules and frequent itemsets mining can show which page sets users tend to visit within a session and what rules can be defined on frequent itemsets. A mixture model can tells what distribution the data come from and how many components (based on different user behaviours) are likely to have generated the data. Both the AR mining and the Mixture model ignore the information which can be mined from the order of pages within sessions. The global mixture model, in contrast, is based on the structure of sessions. It can answer the question which session sequences (or subsequences) are highly preferred by users. It also provides a visualization of frequent navigational paths in the tree structure. Most of the algorithms were implemented in the Java programming language. For further details on their implementation each section refers to the proper APPENDIX table. Only the most frequent and most important patterns will be presented in this section but an additional CD-ROM for this master thesis contains all the results and outputs experimented (refer to APPENDIX E).

6.1 Input data The input data in this case are the access log files of the www.cs.vu.nl web server for a certain period of time, the content types mapping table of the HTML pages of the www.cs.vu.nl domain and the organizational and geographical information for user group identification.

6.1.1 Access log files Four consecutive access log files were collected and merged together from the www.cs.vu.nl server. In total they sum up to one month of access log entries. The details are summarized in the table below:

Details on the merged access log entries File name cs_access_log_20040530-20040704 Size (MB) 1 533, 344 Period 30 May 2004 – 4 July 2004 Number of entries 7 126 732

Table 5

36

The apache web server of the www.cs.vu.nl domain writes the following fields, in the given sequence, into the log files: remotehost, rfc931, authuser, date, request, status, bytes, referrer, user_agent. For the accepted access log file structure refer to APPENDIX B2.

6.1.2 The mapping table Data enrichment is partly based on the content information of visited web pages. This information is given by a table with URL/content type entries. The table was generated by a text mining algorithm that was developed in a different project [3]. The text mining algorithm attaches labels to all HTML pages of a document set based on their contents. The HTML pages (VU-pages) were downloaded by wget [36] invoked with the following parameters:

The given parameters force wget to download all the *.htm and *.html files from the www.cs.vu.nl domain in the depth of five levels recursively. In case of a page access failure it retries to download the page four more times again. This resulted in a collection of 13.001 HTML pages (with a total size of about 90MB) that were consequently assigned to 19 categories:

Description of the content-types (content-categories) type id. type name description

1 photo

This type refers to pages containing a negligible quantity of textual information with one or more images. It most likely refers to personal photo albums, lecture slides or informational pages with messages like “under construction” or “this page has been moved to …”.

2 miscellaneous

“Miscellaneous” type refers to pages with absent or insufficient content. It most likely refers to framesets, empty, file list, form, moved or redirected pages. It can contain photo pages as well in case that the page doesn’t contain relevant textual information.

3 dutch/department This type-group contains department pages in Dutch.

4 english/reference

“English/reference” group most likely refers to pages containing e-books or manual pages for different systems or programs. It can be a manual for an operation system or an API reference for a programming language. It contains pages written in English.

5 english/activity

This group most likely refers to pages containing invitations for official or free time activities. Among these events can be science conferences, exhibitions, concerts, trips for international students or any other happening which is in connection with the University. The group contains pages written in English.

6 english/department This category contains department pages in English.

7 english/project This type most likely refers to research projects of the computer science department written in English.

8 english/person This group is most likely refers to pages of the faculty

wget -l5 -r -t5 -A.htm,.html http://www.cs.vu.nl

37

/faculty members. They are usually very formal and they mostly consist of fields of research, professional background, research projects and other information related to the member’s research area or department. It contains pages written in English.

9 english/person /student

This group most likely refers to student pages. Student pages mostly contain personal information (e.g., hobby, lyrics, etc.) and links to pages of friends and courses. The group contains pages written in English.

10 english/person /faculty/publication

“English/person/faculty/publication” category most likely refers to pages containing publications of faculty members comprising at least the abstracts. It contains pages written in English.

11 english/course This group most likely refers to course pages. They mostly contain the description of the course, lecture slides, recommended literature and set assignments in English.

12 dutch/course The same as the english/course group, but containing pages written in Dutch.

13 dutch/person /student

The same as the english/person/student group, but containing pages written in Dutch.

14 dutch/person /faculty

The same as the english/person/faculty group, but containing pages written in Dutch.

15 other_language This type-group contains pages written in other languages than English or Dutch.

16 dutch/project The same as the english/project group, but containing pages written in Dutch.

17 dutch/activity The same as the dutch/activity group, but containing pages written in Dutch.

18 documents This group contains documents in Adobe Acrobat (pdf) or Postscript (ps) format. They are most likely to be scientific papers, publications, e-books, etc.

19 other documents “Other documents” contains documents in Microsoft Word (doc), Microsoft PowerPoint (ppt), Microsoft Excel (xls), Rich text (rtf) or plain text (txt) format. They are most likely to be administrative papers, forms, course materials etc.

Table 6: Description of the content-types

The labelling algorithm (supported by human) provided only an approximate categorization of pages. Roughly, about 74% of pages got the right labels; see [3] for details. To reduce the length of type names in some places we will use the letter “E” and “D” referring to English and Dutch groups (e.g., “E/department” refers to “english/department”). For the accepted (by the webmining package) file structure of the mapping table refer to APPENDIX B3.

38

6.1.3 Experiments on data preparation The following table contains statistical results of the access log data filtering and content type data integration.

Statistics of the access log data filtering and content type data integration

Filtering method Number of filtered (bad) records

Percentage to the total nr. of entries

Unsupported extensions 5 062 579 71,04% Spider transactions 2 779 464 39,00% Dynamic pages 205 046 2,88% Unsupported HTTP request methods 55 343 0,78% Corrupt escape characters 10 305 0,14% Unsuccessful requests 381 454 5,35% Domain filter 728 0,01% Path completion or anchor stripping 778 079 10,92% All methods (valid transactions) 795 987 11,17% Mapping errors (on valid transactions) 348 048 4,88% Total transactions stored (valid transactions with content type) 447 939 6,29%

Table 7

All numbers in the table were compared to the total number of records and for this reason the sum of percentages doesn’t amount to 100%. Most of the filtering methods above are required for more exact user behavioural results. Except for the elimination of dynamic pages that is not a necessity for this reason. The analysis of dynamic pages requires a much more sophisticated system. Since the targeted domain does not contain a significant amount of dynamic pages (2,88% of total accesses), and it can be assumed that static and dynamic pages would mostly not be mixed during one single user session, the correspondent filter simply ignores all dynamic pages appearing in the log files. Not surprisingly, statistics show that most of the entries contain requests for unsupported file types. Also a vast amount of transactions are generated by spider transactions. The mapping table does not contain content type entries for 43,73% of the valid transactions. We assume this result from a frequent change in the www.cs.vu.nl’s pages. Mapping errors occur when a page referred by a log entry is missing from the page collection (and therefore from the mapping table). The total number of valid transactions after mapping the content-types to log entries is 447 939. These were stored in a database. For the implementation details on data cleaning and data integration, refer to APPENDIX D1 and APPENDIX D2.

39

6.2 Distribution of content-types within the VU-pages and access log entries

The following table shows the frequencies of content-types within the VU-pages and the access log entries of the www.cs.vu.nl.

Distribution of content-types within the VU-pages and access log entries within VU-pages within access log entries

id category frequency percentage frequency percent. of (1-17)

percent.of total

1 photo 2 432 18,68% 34671 9,49% 7,74% 2 miscellaneous 2 727 20,95% 67149 18,38% 14,99% 3 dutch/department 3 0,02% 17698 4,84% 3,95% 4 english/reference 966 7,42% 13240 3,62% 2,96% 5 english/activity 32 0,25% 640 0,18% 0,14% 6 english/department 269 2,07% 23138 6,33% 5,17% 7 english/project 441 3,39% 14984 4,10% 3,35% 8 english/person/faculty 1 084 8,33% 69334 18,97% 15,48% 9 english/person/student 549 4,22% 17203 4,71% 3,84%

10 english/person/faculty /publications 1 901 14,60% 39208 10,73% 8,75%

11 english/course 111 0,85% 19588 5,36% 4,37% 12 dutch/course 806 6,19% 11755 3,22% 2,62% 13 dutch/person/student 1 210 9,29% 31709 8,68% 7,08% 14 dutch/person/faculty 10 0,08% 260 0,07% 0,06% 15 other_language 417 3,20% 1718 0,47% 0,38% 16 dutch/project 27 0,21% 212 0,06% 0,05% 17 dutch/activity 26 0,20% 2901 0,79% 0,65% 18 documents* - - 66801 - 14,91% 19 other documents* - - 15730 - 3,51% - total without 18-19 13 011 100,00% 365408 100% - - total with all - - 447939 - 100,00%

* The mapping table entries contain only the extensions for categories 18 and 19.

Table 8

The two distributions of content types (table 8) show relevant information on user behaviours. According to their relatively large proportion among the collection of HTML pages category photo, E/reference and D/course is overmatched within user visit frequencies. One would not expect the relatively low proportion of course (dutch and english) visits (8,58%). Furthermore the high proportion of visits from the Netherlands (refer to figure 5 in section 6.3.1) may indicate that students visit course pages mostly from home. Publications (E/person/faculty/publication) are mostly visited from foreign countries as one may expect. On the other side E/person/faculty, E and D/department categories have higher rates in log entries. The high proportion of E/department category can be explained by that pages within this class are placed on the top level hierarchy of the VU-pages’ structure. Thus they present the links for reaching other pages. And besides many users within the VU set department pages as their starting page. The E/reference category is mostly visited from other countries and documents are mostly downloaded also from foreign countries as it can be seen in figure 5. The summed

40

proportion of English pages within the VU-pages is 41,13% against the 15,99% of Dutch pages. And on the other side 54% of the log entries belong to English categories while 17,66% of the entries belongs to Dutch page requests.

6.3 Experiments on data structuring This section provides details on data structuring. It starts with presenting the user groups and their related statistics. While in the continuation it shows details on the session identification.

6.3.1 The user groups formed for the users of www.cs.vu.nl The remotehost field of a log entry is either given in the form of an IP address or of a domain name. The IP address is required for grouping users by network ranges while the domain names are important for the geographical sorting. The UpdateDBIPAddresses program was used to update all the remotehost fields of the cslog table (refer to table 3 in section 3.4 Storing the log entries) to the corresponding IP addresses that were given by domain names. The next step is to select users into the users table using the updated remotehost fields and user_agent fields from the cslog table. In the following step, the UpdateDBHostNames program fills in the host_name field for every corresponding remotehost address in the users table. Meanwhile the processing of the domain names it also determines their top level domain (TLD) addresses and fills in the TLD field. For details on UpdateDBIPAddresses and UpdateDBHostNames refer to APPENDIX D1. A total number of 118,141 users have been identified from log entries based on unique remotehost/user_agent pairs. The following groups make distinctions among these users. After identifying all the available IP addresses and domain names, the following demographical data can be obtained in relation with the users’ TLD field. The table below contains the details of the first 20 most frequent TLDs. TLDs are ranked by frequency and a summarized count for all the other top level domains is present in the last row. A table containing all the details of the TLDs can be found in the APPENDIX C3.

The 20 most frequent top level

domains rank TLD count country

1 nl 19248 Netherlands

2 net 18299 network infrastructure

3 com 11457 commercial 4 fr 3125 france 5 be 3058 belgium 6 de 3001 germany 7 ca 2133 canada 8 it 2038 italy 9 uk 1903 united kingdom 10 au 1852 australia

11 edu 1803 educational establishments

12 jp 1532 japan

13 br 1485 brazil 14 ch 963 switzerland 15 mx 935 mexico 16 pl 878 poland 17 at 635 austria 18 fi 610 finland 19 dk 553 denmark 20 se 531 sweden

- 8642 sum of all other countries

- 33460

number of users without geographical information

41

Table 9

Not surprisingly the table represents the fact that most of the user visits come from the Netherlands. Besides the home country, users tend to show keen interest on computer science pages of the VU from nine particular other countries. Three of them are neighbouring (or near) countries, namely France, Belgium and Germany. Among the visitors from these countries would probably be students looking for further studies or fellow researchers interested on project or member details. The other six countries are spread worldwide. There are a total number of 33,460 users without geographical information. This is because their IP addresses cannot be looked up for domain names. The following user groups were formed according to the available geographical and organisational information. Geographical groups Groups formed by geographical information (by TLD acronyms) are described in the table below:

The description of the geographical groups group name description nr of users

nl Contains users identified by the “nl” top level domain. 19248

other

All the other countries and organizations differ from the “nl” TLD. Note that we didn’t eliminate com, org, net and edu TLDs from this category despite their “undeterminable” geographical origin. They form the basis and the most frequent part of this group, thus eliminating them would result in a loss of many valuable user sessions. However, during the analysis we have to consider that a significant part of this group may belong to the “nl” group.

65433

Table 10

Organizational groups In the Computer Science Department of Vrije Universiteit there are separate network ranges for users groups like staff, students, administration, etc. Groups identified from their belonging address ranges are described in the table below:

The description of the organizational groups group name description nr of users

staff Contains users identified by the subnet network range addresses for teachers of the Computer Science department.

274

student Contains users identified by the subnet network range addresses for student machines of the Computer Science department.

567

Table 11

Figure 5 shows the distributions of content-types for user groups. The most popular group was the geographical “other” group followed by the “nl” group (with a proportion of 58,74% and 41,26% among all geographically labelled transactions). Not surprisingly the organizational groups have a much lower visit rate compared to the geographical groups, since they contain

42

much less user (the proportion of “staff” and “student” groups are almost identical, 52,42% and 47,58%). For more details on figure 5 refer to the analysis of table 8 (section 6.2).

Content-type distributions among user groups

0 5000 10000 15000 20000 25000 30000 35000

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

cont

ent-

type

iden

tifie

rs

v isiting frequency nl other staff student

Figure 5: Distribution of content-types among user groups

43

6.3.2 Session identification There are two session identification methods described earlier in this thesis. The following table shows statistics on time frame (TF) identified sessions for all user groups. The timeout parameter was set to the “standard” [13] 30 minutes length.

User group session statistics for time frame identification Session length statistics group name total nr. of

sessions min avg max std. deviation all users 165 778 1 2.7 2 299 9.2 Geographical groups nl 39 671 1 3.39 275 7.29 others 79 750 1 2.4 352 4.74 Subnet network ranges staff 2 795 1 5.5 193 11.41 students 3 123 1 4.47 134 6.36

Table 12

The table above shows that users visit around 3-5 pages per average within a single session. Statistics also show that users within the VU tend to visit more pages per sessions than the average. The surprisingly large maximal length within “all users” is likely to refer to a spider transaction. However checking the details in raw transaction data shows no signs of spider activity. Neither the user_agent field contains any spider pattern nor do requested pages show systematic download. This can be because some spiders, for various reasons, pretend to be a real user. The table below contains statistics on maximal forward reference (MFR) identified sessions for all user groups. It shows that all groups contain much more sessions than in case of TF identification. This derives from the nature of MFR identification which breaks the session if a page has been previously occurred in it.

User group session statistics for maximal forward reference identification Session length statistics group name total nr. of

sessions min avg max std. deviation all users 247407 1 1.81 705 4.23 Geographical groups nl 63478 1 2.12 705 5.04 other 115764 1 1.65 487 2.87 Subnet network ranges staff 6245 1 2.46 108 5.7 students 5306 1 2.63 66 2.42

Table 13

Geographical groups don’t sum up in both tables to the total session number of all users. This is because there are lots of IP addresses with missing domain names in the database (host names can not be looked up for them). According to our observations we are going to use time frame identified session data for further experiments. The TF identificator seemed to be more realistic on the examined database entries and most of the researchers apply also this method, such as [30]. All the following experiments are based on session data instead of raw log entries.

44

6.4 Mining frequent itemsets This section will provide information on frequent page sets and association rules for all the user groups. The AR implementation used by this project for data analysis is an Apriori-T (Apriori Total) algorithm, developed by the LUCS-KDD research team, which makes use of a "reverse" set enumeration tree where each level of the tree is defined in terms of an array (i.e. the T-tree data structure is a form of Trie)7 [12]. For further details on the implementation refer to APPENDIX D4. The support and confidence threshold values for the association rules mining algorithm were tuned to give as much important patterns as possible and to keep the percentage of useless information in a low level. Analysis of all sessions presents an overall picture of all the user sessions retrieved from the database. A more sophisticated characterisation will follow in the part for analysis of the geographical and organizational groups.

6.4.1 The analysis of all visits The analysis of frequent itemsets within sessions of all users gives an overall picture of user behaviour on the www.cs.vu.nl domain. Frequent one-itemsets with their supports are presented in the table below:

Frequent one-itemsets of all visits items (content-type labels and category names) support 1 (8) E/person/faculty 51,10% 2 (10) E/person/faculty/publication 35,45% 3 (2) miscellaneous 32,81% 4 (6) E/department 18,87% 5 (11) E/course 16,89% 6 (13) D/person/student 16,54% 7 (4) E/reference 16,00% 8 (18) documents 15,56% 9 (1) photo 14,65%

10 (7) E/project 12,45% 11 (19) other documents 9,64% 12 (3) D/department 9,59% 13 (12) D/course 9,52% 14 (9) E/person/student 9,51%

Table 14

Item (1) shows that more than half of the sessions contain pages of faculty members (in English) and 35,45% of them include publication pages of faculty members (2). The high support of miscellaneous pages (3) does not indicate any special custom. It shows probably that

7 The input data for the algorithm contain sessions with redundant elements removed and types in ascending order. Trivial sessions that contain only one page are also stripped out.

45

a great proportion of the pages contain frames8. Department pages were used in 24,05% of the transactions as (probably for) starting points of user visits9. Course pages were visited in approximately 26% of the sessions (in this case the co-occurrence of the two categories is negligible, approximately 0,5%). English course pages were almost twice as popular as Dutch course pages. Dutch student pages are more popular than English ones. The joint occurrence of English and Dutch student pages is 23,09% based on the same calculation. Table 15 shows the selected frequent two-itemsets.

Frequent two-itemsets of all visits items (content-type labels and category names) support 1 (10) E/person/faculty/publication, (8) E/person/faculty 19,44% 2 (8) E/person/faculty, (6) E/department 12,69% 3 (8) E/person/faculty, (2) miscellaneous 11,72% 4 (8) E/person/faculty, (1) photo 9,19% 5 (10) E/person/faculty/publication, (2) miscellaneous 9,12% 6 (18) documents, (8) E/person/faculty 8,26% 7 (11) E/course, (8) E/person/faculty 8,13% 8 (8) E/person/faculty, (4) E/reference 7,99% 9 (13) D/person/student, (2) miscellaneous 7,83% 10 (18) documents, (10) E/person/faculty/publication 7,63% 11 (10) E/person/faculty/publication, (6) E/department 7,27% 12 (10) E/person/faculty/publication, (4) E/reference 6,96% 13 (11) E/course, (10) E/person/faculty/publication 6,81% 14 (13) D/person/student, (8) E/person/faculty 6,43% 15 (8) E/person/faculty, (7) E/project 6,17% 16 (10) E/person/faculty/publication, (7) E/project 5,16% 17 (9) E/person/student, (8) E/person/faculty 4,61% 18 (6) E/department, (3) D/department 4,41% 19 (13) D/person/student, (1) photo 4,08% 20 (8) E/person/faculty, (3) D/department 3,84% 21 (13) D/person/student, (12) D/course 3,70% 22 (7) E/project, (6) E/department 3,26% 23 (19) other documents, (11) E/course 3,23% 24 (10) E/person/faculty/publication, (3) D/department 3,07% 25 (19) other documents, (10) E/person/faculty/publication 2,97% 26 (13) D/person/student, (9) E/person/student 2,96%

Table 15

We can set up some rough custom models based on two-itemsets. 19,44% of the visits show interest on information of faculty members and their research. Itemsets don’t provide sequential information but presumably visits belonging to (1) consist of an entry page for a faculty member and a consequent publication page of that person. (2), (6), (10) and (24) may also belong to this “custom group”. (2) and (24) forecast that such visits start on the department pages and then go on to faculty member pages. Itemset (6) and (10) show that many of the users download scientific material from the pages of faculty members. (8), (12), (15) and (16) show special interest on faculty member pages for project information and references. Itemsets that contain 8 Frame pages, as it was described earlier, mostly do not contain valuable information for the content classifier algorithm. 9 This result comes from the sum of supports of the English and Dutch pages subtracted the support of their co-occurrence, refer to table 15 of two-itemsets

46

miscellaneous type indicate that pages are probably structured in framesets, such as pages in content categories 8, 10, 13 in itemsets (3), (5), (9). Itemset (4) can be interpreted as a primitive model for “free time” or “photo viewer” activities. It contains page visits for photo galleries of faculty members. These galleries mostly contain personal photos like travel etc. images. (19) also relates to this custom group with the difference that it contains student photo gallery pages. (7) (13) and (23) form a “study” custom group. Many persons of the scientific staff present all their professional information on a single web page. The content classifier algorithm in this case will probably choose a content-type that refers to the largest topic on it. This resulted presumably in the strange combination of itemset (13). (7) and (13) basically indicates the same consequence which is that they contain course page (in English) visits from faculty member pages. In all certainty this member and the teacher of the course is the same person or has a strong relation to the course. (23) shows that 3,23% of the visits result in the download of course materials.

Frequent three-itemsets of all visits items (content-type labels and category names) support

1 (10) E/person/faculty/publication, (8) E/person/faculty, (6) E/department 4,96%

2 (18) documents, (10) E/person/faculty/publication, (8) E/person/faculty 4,21%

3 (10) E/person/faculty/publication, (8) E/person/faculty, (7) E/project 3,54% 4 (11) E/course, (10) E/person/faculty/publication, (8) E/person/faculty 3,05%

5 (10) E/person/faculty/publication, (8) E/person/faculty, (4) E/reference 2,68%

Table 16

The table above contains the frequent three-itemsets. Itemsets (1), (2), (3) and (5) forms the previously described “faculty member” or “research” custom group. A possible scenario for a user visit based on these sets can be that a user starts the visit on the department pages. Then he goes to a faculty member page and visits the member’s publication page. In the meantime he downloads materials from the member’s pages. He would also with a great probability visit project or reference pages from faculty member pages. (4) represents the “study” custom group. Such visits start from a faculty member’s or his publication’s page (which probably also has a mixed type of content) and ends on course pages related to the member.

Association rules of all visits premise conclusion confidence 1 (7) E/project, (10) E/person/faculty/publication (8) E/person/faculty 68.7%

2 (6) E/department, (10) E/person/faculty/publication (8) E/person/faculty 68.22%

3 (6) E/department (8) E/person/faculty 67.26%

4 (7) E/project, (8) E/person/faculty (10) E/person/faculty/ publication 57.39%

5 (10) E/person/faculty/publication, (18) documents (8) E/person/faculty 55.1% 6 (10) E/person/faculty/publication (8) E/person/faculty 54.82% 7 (18) documents (8) E/person/faculty 53.04%

8 (8) E/person/faculty, (18) documents (10) E/person/faculty /publication 50.94%

Table 17

47

Association rules provide more information on frequent itemsets. Table 17 contains rules that have higher confidence than 50%. (1) indicates that if a user visits project and publication pages he will also visit faculty member pages with 68,7% confidence, etc. All the rules in the table belong to the “research” custom group. This fact consolidates the importance of this type of behaviour and indicates that it is the most significant among visiting behaviour types.

6.4.2 The analysis of the geographical groups Table 18 shows the selected frequent one-itemsets of the “nl” and “other” geographical groups.

Frequent one-itemsets of the geographical groups support items (content-type labels and category

names) “nl” group “other” group1 (8) E/person/faculty 44,57% 53,34% 2 (13) D/person/student 36,13% 5,94% 3 (10) E/person/faculty/publication 22,81% 41,32% 4 (1) photo 20,17% 11,58% 5 (12) D/course 19,86% 3,84% 6 (6) E/department 17,90% 19,81% 7 (9) E/person/student 12,69% 7,51% 8 (18) documents 11,70% 16,74% 9 (7) E/project 11,38% 15,63% 10 (11) E/course 8,54% 21,49% 11 (4) E/reference 8,35% 18,63%

Table 18

The “research” behaviour type is significant in both categories but considering the summed support values for itemsets (1), (3), (8), (9) and (11) shows that research pages have an almost 50% higher visit rate within group “other”. The summed support values for content-categories of type 8, 10, 4, 18 and 7 are 145,66% for “other” and 98,81% for “nl” user groups. “Free time” behaviour is more frequent in the “nl” group based on the support values for student and photo categories. While the support of the photo category is 20,17% in “nl” visits, the “other” group contains 11,58% of the photo visit rate. Student pages are also frequently visited within the “nl” group. The summed supports for Dutch and English student pages is 42,92% (subtracted their co-occerrence) while the same value in the “other” group is approximately 13,45% (their co-occurrence within this category is negligible). Not surprisingly, the “study” custom group is also more frequent in the “nl” than in the “other” group. Dutch and English course pages have 28,4% of summed support in the “nl” group while the same value in the “other” group is 25,33%. In case of the “nl” group it indicates that many students probably study and therefore visit course pages from home. The “other” group contains very few Dutch course visits, which is the second most frequent category among the “nl” visits, but has a surprisingly large amount of visits to English course pages. This fact indicates that English course pages contain useful information for foreign visitors.

48

Table 19 contains frequent two-itemsets of the geographical groups.

Frequent two-itemsets of the geographical groups support

items (content-type labels and category names) “nl” group “other” group

1 (10) E/person/faculty/publication, (8) E/person/faculty 14,17% 22,11% 2 (13) D/person/student, (8) E/person/faculty 12,04% 3,04% 3 (8) E/person/faculty, (6) E/department 9,42% 14,91% 4 (13) D/person/student, (1) photo 8,59% - * 5 (9) E/person/student, (8) E/person/faculty 8,55% - * 6 (8) E/person/faculty, (7) E/project 8,43% 5,31% 7 (13) D/person/student, (12) D/course 8,36% - * 8 (8) E/person/faculty, (1) photo 8,24% 9,50% 9 (10) E/person/faculty/publication, (6) E/department 7,02% 7,37% 10 (8) E/person/faculty, (4) E/reference 6,17% 7,85% 11 (18) documents, (8) E/person/faculty 5,91% 9,15% 12 (13) D/person/student, (9) E/person/student 5,90% - * 13 (12) D/course, (8) E/person/faculty 5,60% - * 14 (9) E/person/student, (1) photo 4,24% - * 15 (10) E/person/faculty/publication, (4) E/reference 4,17% 8,18% 16 (18) documents, (10) E/person/faculty/publication 4,11% 9,00% 17 (11) E/course, (8) E/person/faculty 3,56% 10,43% 18 (11) E/course, (10) E/person/faculty/publication 3,33% 8,43% * Not supported by the set support threshold value.

Table 19

The table above shows that indeed the “other” group contains mostly “official” visits, such as itemsets (1), (3), (6), (9), (10), (11), (15) and (16). Visitors in this group most likely start on English department pages and from there they go on to faculty member pages and navigate to member’s publication pages. A large percentage of them also visit reference and project pages following links from faculty members’ pages. Many users within this group download documents from faculty members. “Official visits” are also frequent in the “nl” group, but in contrast with the “other” group it also contains a great number of “study” visits. (7), (13), (17) and (18) support the assumption that most of the “study” visits start on the faculty member pages and then go on to the course pages. It is interesting to note that the Dutch and English pages are not mixing within sessions. Probably this is because the Vrije Universiteit provides bachelor and masters degrees and while the official language of bachelor education is Dutch, most of the courses are in English in case of the masters. The “other” group also contains a large number of course pages in English visited from faculty members’ pages. Such visits can be generated by interested teachers and students from abroad. (4), (8), (12) and (14) show “free time” visits. (4) and (14) contain Dutch and English student page visits and visits for their photo pages followed the links from them. (8) contains the same types of sessions for faculty member pages. (2), (5) and (7) indicate a mixed activity of “free time” and other behaviour types.

49

Frequent three-itemsets of the geographical groups are presented in table 20.

Frequent three-itemsets of the geographical groups support

items (content-type labels and category names) “nl” group “other” group

1 (10) E/person/faculty/publication, (8) E/person/faculty, (6) E/department 4,49% 5,13%

2 (10) E/person/faculty/publication, (8) E/person/faculty, (7) E/project 4,41% 3,25%

3 (13) D/person/student, (8) E/person/faculty, (1) photo 4,41% - *

4 (13) D/person/student, (9) E/person/student, (8) E/person/faculty 4,37% - *

5 (13) D/person/student, (9) E/person/student, (1) photo 3,68% - *

6 (10) E/person/faculty/publication, (8) E/person/faculty, (4) E/reference 3,30% - *

7 (18) documents, (10) E/person/faculty/publication, (8) E/person/faculty 2,63% 4,93%

8 (11) E/course, (10) E/person/faculty/publication, (8) E/person/faculty - * 3,59%

* Not supported by the set support threshold value.

Table 20

(1) shows the “classic” research visit. Users start on department pages, navigate to faculty members’ pages and to the members’ publication pages. (2), (6), (7) and probably (8) also belong to the research custom. (3), (4) and (5) present mostly “free time” visits. The “study” behaviour type is missing from the three-itemsets. An explanation can be that students of the University know the URLs of study pages exactly and go directly there instead of starting from department pages and following through the links.

6.4.3 The analysis of the organizational groups Table 21 shows the frequent one-itemsets of the “staff” and “student” organizational groups.

Frequent one-itemsets of the organizational groups support

items (content-type labels and category names) “staff” group “student” group

1 (8) E/person/faculty 68,36% 59,41% 2 (10) E/person/faculty/publication 31,75% 36,88% 3 (3) D/department 31,23% 16,31% 4 (6) E/department 26,58% 18,61% 5 (13) D/person/student 22,13% 16,38% 6 (1) photo 16,75% 5,88% 7 (18) documents 15,62% 18,40% 8 (12) D/course 13,34% 16,59% 9 (7) E/project 12,82% 41,50% 10 (4) E/reference 7,65% 19,03% 11 (9) E/person/student 6,83% 6,16% 12 (11) E/course - * 10,36% * Not supported by the set support threshold value.

Table 21

50

One would expect higher differences among the support values for content categories within the “staff” and “student” organizational groups than the presented supports in table 21. One may think that categories like student, photo and course pages are visited at a significantly higher rate in the “student” than in the “staff” group. The opposite can be observed in case of itemsets (5), (6) and (11). This fact shows that for some reasons teachers are more interested in student and photo pages than students. A possible explanation for this phenomenon can be that Ph.D students within the “staff” group visit their fellow student pages. The table shows also that both groups are interested in “research” pages and members of the “staff” group don’t visit the English course pages. (9) and (10) show that students are much more interested in project and reference pages than teachers. Two- and three-itemsets will probably provide more information on the afore-mentioned discrepancies. Frequent two-itemsets are presented in the table below.

Frequent two-itemsets for the organizational groups support

items (content-type labels and category names) “staff” group

“student” group

1 (10) E/person/faculty/publication, (8) E/person/faculty 22,34% 27,08% 2 (8) E/person/faculty, (3) D/department 19,23% 5,95% 3 (8) E/person/faculty, (6) E/department 14,99% 9,45% 4 (13) D/person/student, (8) E/person/faculty 14,89% 5,11% 5 (8) E/person/faculty, (1) photo 10,55% - * 6 (18) documents, (8) E/person/faculty 10,55% 7,07% 7 (10) E/person/faculty/publication, (6) E/department 9,72% 7,84% 8 (8) E/person/faculty, (7) E/project 8,79% 37,16% 9 (8) E/person/faculty, (4) E/reference 5,89% 18,05% 10 (7) E/project, (4) E/reference - * 15,75% * Not supported by the set support threshold value.

Table 22

(2) and (3) show that the “research” behaviour type is more general within “staff” group than it is in “student”. Teachers may look for contact information (e.g., telephone number, email address etc.) of their colleagues via faculty member pages. Interesting is that photo galleries of faculty members (5) are only popular among teachers. (8) and (9) indicate that project and reference pages mostly contain study material for students. Project and reference pages probably cover useful information for course assignments that students have to do in groups. This would explain that a great proportion of students use the University’s infrastructure to visit such pages.

51

Table 23 contains frequent three-itemsets of the organizational groups:

Frequent three-itemsets of the organizational groups support

items (content-type labels and category names) “staff” group

“student” group

1 (8) E/person/faculty, (6) E/department, (3) D/department 6,83% 2,52%

2 (10) E/person/faculty/publication, (8) E/person/faculty (6) E/department 6,10% 4,48%

3 (13) D/person/student, (8) E/person/faculty (10) E/person/faculty/publication 5,79% - *

4 (10) E/person/faculty/publication, (8) E/person/faculty (3) D/department 5,17% - *

5 (10) E/person/faculty/publication, (6) E/department (3) D/department 5,07% - *

6 (10) E/person/faculty/publication, (8) E/person/faculty (7) E/project 4,45% 18,61%

7 (18) documents, (10) E/person/faculty/publication, (8) E/person/faculty 3,72% 3,43%

8 (8) E/person/faculty, (7) E/project, (4) E/reference - * 15,68%

9 (10) E/person/faculty/publication, (8) E/person/faculty (4) E/reference - * 12,32%

10 (10) E/person/faculty/publication, (7) E/project, (4) E/reference - * 11,41%

11 (12) D/course, (8) E/person/faculty, (6) E/department - * 2,52% * Not supported by the set support threshold value.

Table 23

(2) and (4) show the “classic” “research” visit. Teachers tend to visit faculty pages in a sequence of department, faculty member and faculty member’s publication pages. Itemsets (1) and (5) indicate that in most cases users change the language of department pages within their sessions. The “study” behaviour type is popular among student visits. They consist of pages in sequence of faculty member, publication and project or reference categories, such as (6), (8), (9) and (10).

6.4.4 Conclusion The study of frequent itemsets indicated that the most significant behaviour type is “research” in almost every user group. More than 50% is the proportion of research visits among all sessions. In case of the geographical groups “other” also contains more than 50% of support for research pages while among the organizational groups “staff” has a higher visit rate for this type of pages. The “study” behaviour has a relatively low base among all visits. However, the geographical “nl” and the organizational “student” groups have a high visit rate for the “study” custom. The “free time” behaviour type has a base of approximately 20% within all sessions. This high visit rate is also typical within the “nl” geographical group but not apparent significantly among sessions belong to the organizational groups. However, the “staff” has a relatively large visit rate for photo galleries of faculty members.

52

6.5 The mixture model We drew the basic inferences of the frequent itemsets in the previous section. We try to refine the established custom characteristics in this passage with an analysis of mixture models (MM) for each session group. The mixture model implementation used for modelling session data was developed in a different project [17]. A mixture model can be viewed as a clustering of all the users. Each cluster is characterized by a vector of frequencies (tetas) with which members of such a cluster visit specific pages. These frequencies can be visualized in a bar chart - a kind of a "group profile". Additionally, the parameter alpha can be interpreted as the cluster size. To interpret the charts in the following sections it is necessary to look at the legend (refer to table 6: Description of the content-types in section 6.1.2 Mapping table). In our experiments we run the MM algorithm with 10 different settings for mixture component numbers (starting with 1 component up to 10 components mixtures) for building models on each data set. We set the algorithm to iterate through the model building process 10 times for each mixture model and to choose the most probable model for each component setting. We use log-probability scores (“logp scores”) to evaluate the predictive power of the models. Logp scores are calculated based on the formula of Notion 5.5 (in chapter 5) transformed to the logarithm of the expression. Higher logp scores mean that the model is evaluated to be more probable on the dataset. In most cases we put only the figure of the most probable mixture models in the thesis.

6.5.1 The analysis of all visits The figure below presents the logp scores of all the 10 mixture components settings.

1 2 3 4 5 6 7 8 9 10-10.5

-10

-9.5

-9

-8.5

-8

-7.5x 10

5 LogLikelihood(#iterations, #clusters)

Number of iterations

LogL

1 2 3 4 5 6 7 8 910

Figure 6: Logp scores of the 10 mixture component settings of all visits

53

The mixture model with the trivial one component shows only data statistics similar to frequent one-itemsets which we already discussed in the previous section. We choose the maximal number of mixture components heuristically. The logp scores for models with a number of components higher than 6 or 7 tend to have the same characteristics and are “close” to each other. Therefore we chose the most probable model by the analysis of 2 to 7 components models.

0 2 4 6 8 10 12 14 16 18 200

0.05

0.1

0.15

0.2

0.25

α=0

.58

LogL=-952781.3177

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

α=0

.42

Figure 7: Two-component mixture model of all visits

The histograms in the mixture model above present clusters of similar users. Alphas are their sizes and the levels of histogram values present “interests” of members of these clusters. The first component of figure 7 shows the “research” and “study” behaviour types and the second presents a mixture of “free time” and “study” activities. These mixtures within the base components indicate that the number of components is probably higher than two. The analysis of all figures resulted in choosing the model with six mixture components as the most probable (figure 8). The first base component refers to a “research” behaviour that has a very high (0,27) probability. The second component also has a high probability and shows a “study” behaviour type with the visiting of faculty member, publication, reference and course pages and downloading of course materials. The third component refers to the “student page visit” custom with visiting of English and Dutch student pages. It also has some visits to Dutch course and faculty member pages. Component number four is presented in almost all mixture models for all visits regardless to component numbers (numbers above two components) and presents no interpretable information given that the miscellaneous category mostly refers to framset or empty pages. Component five refers to “determined research downloads” behaviour which means that users know exactly the URL of the material they want to download. The last component presents “free time” visit model in that users visit photo galleries. The visited photos belong mostly to faculty members.

54

0 2 4 6 8 10 12 14 16 18 200

0.5

α=0

.27

LogL=-830947.0322

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4α

=0.2

4

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

α=0

.21

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.15

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.1

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.031

Figure 8: Six-component mixture model of all visits

6.5.2 The analysis of the geographical groups We chose also six as the most probable number of components in the case of the geographical groups. The first component of the mixture model of the “nl” geographical group in figure 9 presents the “free time” behaviour by visiting Dutch student pages at a probability rate of 0.28. The second most probable component (nr. 2) refers to “study” visits that probably start on Dutch department pages, go on to Dutch course pages, and finally download course materials. The “research” component (nr. 3) also has a high probability and refers to the “classical” sequence of department pages, faculty member pages, and members’ publication pages. In case of component four miscellaneous pages are combined with department, faculty member, and student pages. This could mean that the structures of these pages are based on frames but no visiting characteristics can be observed. The “determined research downloads” habit is presented in component five in a proportion of 7,6% to all sessions. The last component is a mixture of “free time” visits. It contains visits for faculty members’ and students photo pages as well as for activity pages.

55

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.28

LogL=-240571.6526

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4α

=0.2

2

0 2 4 6 8 10 12 14 16 18 200

0.5

α=0

.21

0 2 4 6 8 10 12 14 16 18 200

0.5

α=0

.16

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.076

0 2 4 6 8 10 12 14 16 18 200

0.5

α=0

.054

Figure 9: Six-component mixture model of the geographical “nl” group

The probability of the “research” behaviour is much higher within the geographical “other” group than in case of the “nl” model. The first and second components of figure 10 together form more than 50% of probability for research pages. The first component can also model the “study” custom for foreign students. Component three models the interest in student pages. This component also has a relatively large probability. “Determined research downloads” has approximately 10% of proportion among sessions in this group. The “photo viewing” habit is presented with a low probability in the last component.

56

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

α=0

.27

LogL=-354766.0415

0 2 4 6 8 10 12 14 16 18 200

0.5α

=0.2

5

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

α=0

.17

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.16

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.1

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.042

Figure 10: Six-component mixture model of the geographical “other” group

6.5.3 The analysis of the organizational groups Figure 11 shows the six-component mixture model of the organizational “staff” group. High presence of English and Dutch department pages in the first component (from the top) without any other significant category may imply that the web browser clients of staff members’ machines are set to show department pages as start pages. Component two also shows such habit, with the difference that teachers may set their own home page as starting page with 0,29 probability. The third component shows interest on faculty member and research pages. Teachers may look for colleagues’ contact information. Component four refers to a “determined student page visits” behaviour and five to a “determined download” habit. The last component also shows that photo pages are visited with direct request for the pages. But the “photo viewing” behaviour is not popular within this group. Most of the components don’t contain department pages. This fact indicates that most of the users within this group know the URLs for the required resources.

57

0 2 4 6 8 10 12 14 16 18 200

0.5

α=0

.32

LogL=-19836.6606

0 2 4 6 8 10 12 14 16 18 200

0.5

1α

=0.2

9

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

α=0

.18

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.089

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.08

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.039

Figure 11: Six-component mixture model of the organizational “staff” group

Component one and five most probably refer to a “study” habit in the six-component mixture model of the “student” group (figure 12). The third component implies the classic “research” sequence. Component four represent the “free time” visit behaviour for Dutch student page visits. This group also contains the “determined download” habit represented by component five. The last component also shows some kind of “free time” activity with visits for activity pages and possible downloads of registration forms for free time events.

58

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

α=0

.25

LogL=-26581.5718

0 2 4 6 8 10 12 14 16 18 200

0.5

1α

=0.2

3

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

α=0

.19

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.15

0 2 4 6 8 10 12 14 16 18 200

0.5

1

α=0

.11

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

α=0

.068

Figure 12: Six-component mixture model of the organizational “student” group

6.5.4 Conclusion The results of mixture model analysis show the same major characteristics as the results of frequent itemsets mining. The “research” behaviour type is the most probable visit activity among all visits followed by “study” and “free time” habits. The geographical “nl” group contains more “free time” while the “other” group have a higher visit rate for “research” pages. Sessions among the “staff” group are more likely to be research or start up (department) pages whilst the “student” group contains more visits for behaviour types like “study”, “research” and “free time”.

59

6.6 The global tree model In contrast with the previous models the global tree model (GTM) is based on the sequential information presented for sessions (in term of consecutive page visits). The tree model provides frequent navigational paths and tree-like visualization of relevant patterns. The analysis of all raw sessions for user groups would result in large, slightly informative, plain trees. Since we want to analyse “complex” user navigational paths we strip out one-length sessions. One-length sessions are generated mostly by users following links of search result pages, starting page settings for web clients and direct visits. Either way these items shift the whole characteristics of user behaviours. We also eliminate consecutive redundant elements within sessions (e.g., we analyse the sequence of 6 8 10 instead of 6 8 8 10). This transformation gathers up all sessions with the same characteristics and preserves the ordering information. In the following experiments we insert only partial trees or trees referring to sessions with a relatively high support rate. Furthermore we refer to “total” trees in APPENDIX C4 – C8 for each group. The CD-ROM contains additional tree visualization figures in high resolution.

6.6.1 The analysis of all visits Figure 13 refers to the tree visualization of all visits by 3% of support threshold:

Figure 13: The tree model of all visits (3% of support treshold)

This partial tree shows that “research” is the most important behaviour type among all visits. 29% of the sessions start with faculty member pages and go on to publication pages. Table 24 (and the figure in APPENDIX C4) shows that surprisingly, only a relatively low number of sessions start on the department pages. Most of the users go directly to faculty members’ pages and browse members’ publication pages from there. If a user starts from department pages he continues mostly on faculty member pages, as shown in session type (8). It is interesting that 16% of the sessions that start on faculty members’ pages go on to the department pages. This activity is the opposite one might expect. A relatively large proportion of sessions start directly

60

on publication pages. 17% of these sessions end with downloading of documents whilst 19% of them end with visiting reference pages.

Frequent sessions of all visits by 1% of support treshold session frequency percentage 1 (8) E/person/faculty, (10) E/person/faculty/publication 1344 3%

2 (2) miscellaneous, (7) E/project, (2) miscellaneous, (7) E/project, (2) miscellaneous, (7) E/project, (2) miscellaneous, (7) E/project

887 2%

3 (8) E/person/faculty, (6) E/department 705 2% 4 (8) E/person/faculty, (4) E/reference 600 1% 5 (8) E/person/faculty, (11) E/course 582 1% 6 (8) E/person/faculty, (1) photo 571 1% 7 (10) E/person/faculty/publication, (18) documents 505 1% 8 (6) E/department, (8) E/person/faculty 478 1%

9 (13) D/person/student, (2) miscellaneous, (13) D/person/student 454 1%

10 (8) E/person/faculty, (18) documents 444 1% 11 (10) E/person/faculty/publication, (4) E/reference 439 1% 12 (11) E/course, (19) other documents 397 1%

Table 24

6.6.2 The analysis of the geographical groups Figures 14 and 15 and in the APPENDIX C5 and C6 contain the most frequent navigational paths of geographical groups. As we stated earlier, users within the “other” group tend to visit “research” pages more frequently than within the “nl” group. 34% of their visits start on faculty member pages and then the majority go on to publication pages. Some of these visits end with downloading of documents or returning to member pages. The “nl” group also contains a large proportion (18%) of “research” pages. However the referring branch in the “nl” tree is a mixture of contents. It contains student and photo pages as well. 11% of visitors among the “other” group use faculty member pages to visit E/course materials. No such behaviour can be observed among the “nl” group. Quite the contrary, the “nl” group doesn’t contain E/course pages at all among its frequent navigational paths. 13% of the faculty member pages’ visitors go and see photo pages of the members. This proportion is twice as much as in case of the “nl” group. Visitors among the “other” group use the department pages more frequently. Most of these visits are likely to go on to the faculty members’ pages in both the “other” (55% of them) and the “nl” (26% of them) groups. In case of the “other” group, publication and project pages are also frequent destinations from the department pages. The “nl” group tends to have more “free time” visits than the group “other”. 19% of the visits related to the “nl” group contain student pages and 20% of them include photo galleries. The “study” behaviour type does not appear as an individual (sub)branch of the “nl” tree but study pages are spread around in the tree.

61

Figure 14: The tree model of the “nl”group Figure 15: The tree model of the “other”group

6.6.3 The analysis of the organizational groups Figures 16 and 17 and in the APPENDIX C7 and C8 contain the most frequent navigational paths for the organizational groups. Visits of the “staff” users start mostly on faculty member pages. They then navigate to publication pages, download materials or simply go to department pages. The most relevant session structure for the “student” group starts on miscellaneous pages, then goes to faculty member pages followed by project pages and finally ends either on publication or reference pages. Reference, project, faculty member, and publication pages are spread in the whole “student” tree mixed with other components. Both “staff” and “student” trees are a kind of mixtures. They don’t contain clear user behaviour types, whereas trees for the geographical groups and for all sessions do. The reason probably is that organizational groups contain much less sessions.

62

Figure 16: The tree model of the “staff”group Figure 17: The tree model of the “student”group

6.6.4 The similarity of tree models Table 25 contains the similarity measures for all tree model pairs of all groups. We “equalized” all the session data pairs before measuring them. This means that we randomly stripped out sessions from the greater data set to make the number of sessions equal in all pairs. The diagonal from the upper left to the lower right corner contains 100% similarity since they refer to similarities of the same groups. The similarity matrix is symmetrical because of the commutative property of the similarity measure.

63

Similarity measures for tree models of all user groups

group all

geog. – nl

geog. – other

org. – staff

org. – student

group all 100% 40.36% 70.46% 20.29% 21.11% geographical – nl 40.36% 100% 27.76% 20.08% 27.78% geographical – other 70.46% 27.76% 100% 19.18% 16.59% organizational – staff 20.29% 20.08% 19.18% 100% 23.75% organizational – student 21.11% 27.78% 16.59% 23.75% 100%

Table 25

According to these figures the “other” group is the most similar to the “all sessions” group. This is not surprising since the “other” group contains the greatest part within all sessions. Comparing the “nl” and “other” groups results in 27,76% of similarity while measuring the distance between the “staff” and “student” yields 23,75% of similarity.

6.6.5 Conclusion The analysis of the tree models mostly confirms our preliminary assumptions (in the AR and MM sections – for the details refer to section 6.4 and 6.5) for ordering of pages in typical sessions. However in some cases it turned out that the expected orders are not realistic. Most of the groups contain the subsequence of faculty member pages and department pages in a higher frequency than department pages and faculty member pages. One would expect the opposite of the phenomenon.

64

7 Conclusion and future work In our work we have presented a methodology for web usage mining. We discussed data preprocessing and data enrichment processes of access log entries of web servers. Data enrichment is about integrating content types of documents with access log entries. The enriched data is structured into user navigational sequences in the next step. With the application of geographical and organizational data we have set up user groups for users and their related sessions. We presented three data mining models for exploring user behaviours among groups of users: the association rules mining algorithm was used to explore frequent itemsets and rules on them. The mixture model presented a clustering of users by similar collections of pages they visit. Thirdly the global tree model was proposed for mining frequent navigational paths with the preservation of sequential information of page visits. Visualization of the tree models facilitate human perception and in this manner helped to obtain the most important patterns. Finally we applied all the discussed techniques to the web site of the Computer Science Department of Vrije Universiteit, The Netherlands (www.cs.vu.nl domain). We discovered three significant types of user behaviours analysing the experimental results. These types are “research”, “study” and “free time”. Sessions belonging to the “research” behaviour type consist mostly of faculty members’ pages, their publication pages, reference and project pages. They include department pages for navigations and downloads of (scientific) documents. The “study” custom mostly refers to Dutch and English course pages but also contains reference and project pages in large numbers. The “free time” visits consist mostly of photo pages, activity pages and Dutch and English student pages. Other minor behaviour types are described within the analysis of the models. In general the “research” custom is the far most popular among all sessions and among most of the session groups. Study pages are not as popular as research pages but have a significant base within all sessions. The “free time” habit is the least popular among behaviour types but it still has a relatively large support among all sessions. We categorized all the user sessions into the four subcategories of geographical and organizational group categories. Geographical categories are the “nl” and the “other” groups, where “nl” refers to user sessions related to users from the Netherlands and “other” consists of sessions for users from all the other countries. Organizational categories are the “staff” and “student” categories referring to the sessions of the staff and student users of the university. The “research” custom is the most frequent among users of the “other” group. Approximately half of the sessions within this group relate to this custom. “Staff” has more research sessions within organizational groups. The “study” custom is the most frequent in the geographical “nl” and the organizational “student” groups. The “other” group also contains a large number of visits for English course pages. Free time visits are the most popular among “nl” sessions and have a significant base within “student” visits. The “staff” group also contains a significant proportion of sessions containing pages of faculty members’ photos. Surprisingly, department pages are infrequent among starting pages of user sessions. This indicates that most of the users don’t use department pages for page lookups. However a popular scenario of user visits is to start the visiting sequence in faculty member pages and then go on to department pages to navigate to the following destination. Another conclusion of the results is that course pages are not so popular among students. This may indicate that students

65

visit course pages mostly from home. However a significant proportion of students visit reference and project pages from the labs of the University. This may imply that they mostly use the facilities of the University for solving group assignments. It is to be remarked that the Vrije Universiteit has a dedicated Intranet system (the Blackboard) for managing all the information about courses. Albeit most of the courses have informational pages within the VU-pages, this system may provide extra information for users. User visits to the Blackboard system were not tracked within this project. Another important pattern is that users from abroad tend to visit English course pages. These sessions can be generated either by students looking for course materials for their studies or by foreign teachers read up on courses information. The analysis covers only a short period (one month: June) and the observed patterns certainly change over time. This fact may explain some “extreme” patterns covered by data analysis. To avoid “periodical” patterns it would be interesting to perform the data analysis automatically, e.g., once per a month. We tried to develop as accurate algorithms as possible but there are some internal and external limitations that are influenced by or might influence the experimental results. Web logs of public web domains provide insufficient information on users. Some of the identified users and their sessions may contain incorrect data despite the applied heuristics for identification processes. The accuracy can only be improved by using cookies or other external identification techniques (refer to section 6.3 Data structuring). The problem is completely solved in case of the analysis of an Intranet (login required) application because of the automatic user identification. Another problem is the high number of mapping errors that occur either because some requests refer to a deeper level in HTML pages’ structure than was set or because the requested pages were removed in the meantime. Therefore the number of mapping errors can partly be reduced by downloading the VU-pages in deeper levels. The exponential growing of page numbers would overload the content classifier algorithm though. A much more sophisticated solution would be to build a separate content retrieval system where all the pages in a website would have at least one URL, content type, and timestamp entry. Each time a page is changed a new entry would refer to its new content type in the system (in case it is different from the previous one). During the analysis of access log files, their timestamp and the timestamp of content entries would be compared and the suitable content label would be chosen. The accuracy of content labelling is the most critical part of the whole process. The actual 74% of average accuracy of content categories assures the reliability of the major user characteristics that were experienced. However, increasing the accuracy would result in lower “noise” of the data and even more reliable experiments. The web usage mining system proposed in this thesis was built basically for “static” analysis. This means that the access log files, the VU-pages and all the other input data were evaluated “offline”, independently from their generation time. A potential improvement of the system would be to process access log entries and to attach labels to the requested documents “online”, at the same time when they are generated. This would allow us to analyse systems dynamically. I will go on with my researches within the DIANA project (http://www.cs.vu.nl/ci/DataMine/ DIANA/) focusing on the real time analysis of dynamic systems and on the development of new, adaptive algorithms for mining data streams.

66

Acknowledgements I wish to express sincere appreciation to Dr. Wojtek Kowalczyk for his assistance and insight throughout the development of this project. I would also like to express sincere thanks to Dr. Elena Marchiori for her valuable advice and feedback. In addition, special thanks to Dr. Frits Daalmans, for technical and non-technical advice, support, and editing. I also thank to Krisztián Balog for the fruitful cooperation during the project and for providing the content label data for the VU-pages. I thank to Dr. Elisabeth Hornung, my Mom for her advice and suggestion, and many thanks to my family for supporting me during the one year in the Netherland. I would also like to say a special thanks to my lovely girlfriend Maya, for her patience and for pushing me over the final finishing line. I would like to express my special thanks to the Vrije Universiteit, Amsterdam for the opportunity to participate the one year International Master Program.

67

Bibliography 1. Agrawal, R., Imielinski, T., and Swami, A. (1993), Mining association rules between sets of

items in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 207–216

2. Baglioni, M., Ferrara, U., Romei, A., Ruggieri, S., and Turini, F. (2003), Preprocessing and Mining Web Log Data for Web Personalization. 8th Italian Conf. on Artificial Intelligence vol. 2829 of LNCS, p.237-249

3. Balog, K. (2004), An Intelligent Support System for Developing Text Classifiers. MSc. Thesis, Vrije Universiteit of Amsterdam, The Netherlands

4. Cadez, I. V., Heckerman, D., Meek, C., Smyth, P., and White, S. (2003), Model-Based Clustering and Visualization of Navigation Patterns on a Web Site. Data Mining and Knowledge Discovery, vol.7 n.4, p.399-424

5. Cadez, I.V., Smyth, P., Ip, E., and Mannila, H. (2001), Predictive Profiles for Transaction Data using Finite Mixture Models. Technical Report No. 01–67, Information and Computer Science Department, University of California, Irvine

6. Chen, Z., Fu, A., and Tong, F. (2002), Optimal Algorithms for Finding User Access Sessions from Very Large Web Logs. Proceedings of the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining, (PAKDD), Taipei

7. Chen, M., Park, J.S., and Yu, P.S. (1998), Efficient data mining for path traversal patterns. IEEE Transactions on Knowledge and Data Engineering, vol.10 n.2, p.209-221

8. Chevalier, K., Bothorel, C., and Corruble, V. (2003), Discovering rich navigation patterns on a web site. Proceedings of the 6th International Conference on Discovery Science Hokkaido University Conference Hall, Sapporo, Japan

9. Cho, Y.H., and Kim, J.K. (2004), Application of Web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications vol.26, p.233–246

10. Cho, Y.H., Kim, J.K., Kim, S.H. (2002), A personalized recommender system based on web usage mining and decision tree induction. Expert Systems with Applications vol.23, p.329–342

11. ClickTracks. Retrieved February 12, 2004 from http://www.clicktracks.com/ 12. Coenen, F. (2004), The LUCS-KDD Apriori-T Association Rule Mining Algorrithm,

http://www.cxc.liv.ac.uk/~frans/KDD/Software/Apriori_T/aprioriT.html, Department of Computer Science, The University of Liverpool, UK.

13. Cooley, R., Mobasher, B., Srivastava, J. (1999), Data Preparation for Mining World Wide Web Browsing Patterns. In Knowledge and Information System, vol.1(1), p.5-32

14. Hay B., Wets, G., and Vanhoof K. (2003), Segmentation of visiting patterns on websites using a sequence alignment method. Journal of Retailing and Consumer Services vol.10, p.145–153

15. Jacobs, N., Heylighen, A., and Blockeel, H. (2001), Dynamic Website Mining. Proceedings of the European Symposium on Intelligent Technologies, Hybrid Systems and their implementation on Smart Adaptive Systems, Tenerife (Spain)

16. Jenamani, M., Mohapatra, P.K.J., and Ghose, S. (2003), A stochastic model of e-customer behaviour. Electronic Commerce Research and Applications vol.2, p.81–94

68

17. Mixture model implementation within the DIANA project http://www.cs.vu.nl/ci/DataMine/DIANA/

18. Mobasher, B., Jain, N., Han, E., and Srivastava, J. (1996), Web Mining: Pattern discovery from World Wide Web transactions. Technical Report TR 96-050, University of Minnesota, Dept. of Computer Science, Minneapolis

19. Zaki, M. J. (2002), Efficiently Mining Frequent Trees in a Forest. SIGKDD ’02 Edmonton, Alberta, Canada

20. Nanopoulos, A., and Manolopoulos, Y. (2000), Finding Generalized Path Patterns for Web Log Data Mining. J. Stuller et al. (Eds.): ADBIS-DASFAA, LNCS 1884, p.215-228

21. Nanopoulos A., Manolopoulos Y. (2001), Mining patterns from graph traversals. Data and Knowledge Engineering No. 37, p.243-266

22. OneStat.com. Retrieved February 12, 2004 from http://www.onestat.com/ 23. Pei, J., Han, J., Mortazavi-asl, B., and Zhu, H. (2000), Mining Access Patterns Efficiently

from Web Logs. Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, p.396-407

24. Punin, J.R., Krishnamoorthy, M.S., and Zaki, M.J. (2001), LOGML: Log Markup Language for Web Usage Mining. Proceedings in WEBKDD Workshop 2001: Mining Log Data Across All Customer TouchPoints (with SIGKDD01), San Francisco

25. Fielding R., Gettys, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T. Hypertext Transfer Protocol - HTTP/1.1. Network Working Group. RFC 2616.

26. Berners-Lee, T., Fielding, R. and H. Frystyk. Hypertext Transfer Protocol - HTTP/1.0. Network Working Group. RFC 1945.

27. Runkler, T.A., Bezdek, and J.C. (2003), Web mining with relational clustering. International Journal of Approximate Reasoning vol.32, p.217–236

28. Smith, K.A., and Ng, A. (2003), Web page clustering using a self-organizing map of user navigation patterns. Decision Support Systems vol.35, p.245– 256

29. Spider pattern lists were verified on the services listed below. Retrieved March 17, 2004 from http://www.robotstxt.org (list of well known robots - not up to date), http://www.spy-bot.net/list_adware.asp (list of spiders) and using and http://www.google.com search engine for missing or uncertain spiders.

30. Xing, D., and Shen, J. (2004), Efficient data mining for web navigation patterns. Information and Software Technology vol.46, p.55–63

31. Yang, Q., Li T.I., and Wang K. (2003), Web-log Cleaning for Constructing Sequential Classifiers. Applied Artificial Intelligence vol. 17, iss. 5-6, p.431-441

32. Yao, Y., Hamilton, H.J., and Wang, X.W. (2000), PagePrompter: An Intelligent Agent for Web Navigation Created Using Data Mining Techniques. Technical report, Department of Computer Science, University of Regina Regina, Saskatchewan, Canada

33. Youssefi, A.H., Duke, D.J., Zaki, M.J., and Glinert, E.P. (2003), Towards Visual Web Mining. In Proceeding of Visual Data Mining at IEEE Intl Conference on Data Mining (ICDM), Florida

34. Luotonen, A. (1995), The Common Logfile Format, http://www.w3.org/pub/WWW/Daemon/User/Config/Logging.html

35. Extended Log File Format - W3C WD-logfile-960221, W3C Working Draft WD-logfile-960221. http://www.w3.org/TR/WD-logfile.html

36. GNU Software Foundation (1999), Wget. Available at http://www.gnu.org/software/wget/wget.html. 37. Webtrends, Retrieved February 12, 2004 from http://www.netiq.com/products/log

69

APPENDIX

APPENDIX A. The uniform resource locator (URL) Uniform resource locators (URL) identify resources in the World Wide Web. The syntax of an HTTP URL is 'http://' host.domain [':'port] [ path ['?' query]] where

- host.domain is the name of the web service (server) - port is optional (default is 80) - path is the absolute location of the requested resource in the server (consists of path +

file name + extension with delimiter fields) - query is a collection of parameters in case of dynamic pages

APPENDIX B. Input file structures

1. The structure of the properties file The properties file contains the most adjustable properties for the webmining package in form of key/value pairs in each line. Pairs are delimited by ‘=’ character. Supported properties are described in the table below:

Supported properties of the properties file Database properties

JDBC_driver_name Name of the JDBC driver for database connection. e.g., com.mysql.jdbc.Driver

connection_name Name of the database connection. e.g., jdbc:mysql://localhost/test

user_name Name of the user for the specified database. e.g., TEST user_password Password for the user. e.g., test log_table_name Name of the access log table. e.g., cslog log_users_table_name Name of the users table. e.g., users Properties for data handling

access_log_path Path and file name of the (merged) access log file. e.g., c:\log.txt

Properties for transaction filtering default_page_name Name of the default HTML page. e.g., index.html accepted_extensions_list Path and file name of the extension list file. spider_engines_list Path and file name of the spider list file. Properties for session identification

time_frame_intervall Length of time frame (in minutes) for time frame identification. e.g., 30

group_selector_type

Type of group selector. Possible values: all, subnets, country. They refer to all sessions (all), only sessions generated by a user specified by the file given by the network_range_file_name key (subnets) and sessions generated by a user specified by the file given by the country_list_file_name key (country).

network_range_file_name Path and file name for the file specifying a subnet group. country_list_file_name Path and file name for the file specifying a country group.

70

Properties for data integration mapping_table_path Path and file name for the content mapping table file.

generated_mapping_table_path Path and file name for the artificial content mapping table file to be generated.

Properties for geographical statistics

country_codes_file_name Path and file name for the country codes file containing the names and short names for most of the countries in the World.

Table 26: Supported properties of the properties file

2. The structure of access log files The apache web server of the www.cs.vu.nl domain uses the extended log file format [35] and writes the following fields into the log files in order of appearance: remotehost, rfc931, authuser, date, request, status, bytes, referrer, user_agent. Log files can contain an arbitrary number of entries containing all the fields described above. Each line of the file should contain only one entry and each field is separated by one ore more white spaces. The syntactic of an entry is the following: Syntactic of access log file remotehost character string (e.g., 81.69.10.150) rfc931 character string (e.g., -) authuser character string (e.g., -)

date given in [dd/MMM/yyyy:HH:mm:ss Z] format (e.g., [30/May/2004:03:30:10 +0200])

request character string (e.g., "GET /~fbenmba/straatremixes/images/home.png HTTP/1.1")

status integer (e.g., 200) bytes integer (e.g., 12193)

referrer character string (e.g., http://www.google.com.br/search?hl=pt-BR&ie=UTF-8&q=Andrew&meta=)

user_agent character string (e.g., Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1))

Table 27

3. The structure of the content types mapping table file A content type mapping table contains URL/content type pairs. Three types of entries can occur in a mapping file according to its structure: Description of the file structure of the content types mapping table

number of content types

The first valuable line should contain the number of distinct content types (n). Content types considered as integers in range 0 .. n (all numbers inclusive) .

URL / content type pairs

Following there can be an arbitrary number of URL/content type pair entries. Where each line corresponds to a pair entry. The pair should be delimited with a white space.

textual entries The file may contain comment lines anywhere in the structure presuming that a hash mark (‘#’) stands as the first character of the line.

Table 28

71

4. The structure of the extension filter list file Extension filter list file contains extension entries for filtering not allowed file types. The structure of such file is as follows:

Description of the file structure of the extension filter list file extension entries

Each valuable line of the file should contain a specific file extension (without any dots or special marks, e.g., ‘html’).


Table 29

5. The structure of the spider filter pattern list file Spider filtering is based on the spider pattern list file, which contains recognizable patterns for known spiders. The file was made by pre-examining log files for cs.vu.nl web server and filtering out suspicious user agents, extraordinary patterns. These patterns were tested against spider list providers’ pages like [29]. The file contains spider pattern entries each of them in a separate line. The structure of such a file is as follows:

Description of the file structure of the spider pattern list file spider pattern entries Each valuable line of the file should contain a specific spider pattern.


Table 30

APPENDIX C. Experimental details

1. Spider pattern list and rank Table 31 contains spider patterns ranked by frequency counts during the www.cs.vu.nl access log file analysis.

72

Spider pattern list and rank rank pattern frequency1 NET CLR 1228717 2 BOT 576740 3 WGET 220265 4 FUNWEBPRODUCTS 197637 5 DIGEXT 194020 6 CRAWLER 101244 7 SLURP 90338 8 HOTBAR 48969 9 JEEVES 48511 10 HTTRACK 21432 11 IA_ARCHIVER 17058 12 GRUB-CLIENT 9599 13 YCOMP 7246 14 LIBWWW-PERL 4476 15 APPIE 3705 16 AVSEARCH 2563 17 SPIDER 1432 18 TELEPORT 1316 19 WEBCOPIER 794 20 WEBCOLLAGE 528 21 DVD OWNER 524 22 FREESURF 490 23 LYCOS 467

24 ROADRUNNER 461 25 YAHOO 389 26 SCOOTER 194 27 FLASHGET 187 28 INFOSEEK 87 29 WEBSEARCH 29 30 PITA 23

31 T-H-U-N-D-E-R-S-T-O-N-E 14

32 FLUFFY 4 33 NETNEWSWIRE 3 34 WEBDUP 2 35 WEBVAC 0 36 VIAS 0 37 ZYBORG 0 38 TEOMAAGENT 0 39 GULLIVER 0 40 ARCHITEXT 0 41 MERCATOR 0 42 ULTRASEEK 0 43 MANTRAAGENT 0 44 MOGET 0 45 MUSCATFERRET 0 46 SLEEK 0 KIT_FIREBALL 0

Table 31: Spider pattern list and rank

2. Extension list and rank Table 32 contains extensions ranked by frequency counts during the analysis of www.cs.vu.nl access log files. We listed here the top 100 most frequent extensions leaving out some unknown and infrequent items.

73

Extension list and frequeny rank extension freq.

1 gif 1919367 2 jpg 1785872 3 html 1705041 4 js 353113 5 png 199905 6 pdf 138413 7 css 127857 8 php 111405 9 htm 105481

10 ico 58851 11 pac 50358 12 txt 48777 13 ps 45563 14 mp3 31337 15 php3 27459 16 gz 25263 17 zip 19123 18 1 13988 19 doc 11046 20 wrl 9831 21 bmp 9283 22 3 8254 23 ppt 7931 24 2 7851 25 taz 6833 26 tar 6772 27 class 5822 28 tgz 4860 29 z 4494 30 shtml 4470 31 swf 3945 32 misc 3699

33 jpeg 3584 34 xml 3471 35 pl 3334 36 asp 3333 37 wma 3265 38 mid 3211 39 eps 3154 40 java 2973 41 tex 2839 42 c 2485 43 rdf 2373 44 jar 2336 45 8 2332 46 dcr 2328 47 5 2108 48 cgi 2017 49 9 2011 50 wmv 2009 51 xbm 1943 52 mnx 1935 53 hs 1909 54 pas 1893 55 announce 1878 56 wav 1858 57 4 1575 58 xls 1516 59 exe 1373 60 tab 1332 61 h 1189 62 spf 1160 63 xhtml 1083 64 dvi 1058 65 1 1013 66 imp 845 67 bib 830

68 xsl 793 69 stdout 788 70 rss 770 71 bak 767 72 idx 710 73 smi 699 74 m 696 75 dtd 652 76 pn 592 77 avi 587 78 cur 570 79 nl 530 80 readme 493 81 wmz 484 82 old 428 83 fst 424 84 cpp 420 85 mpg 362 86 log 344 87 ref 338 88 owl 331 89 rtf 304 90 au 282 91 com 233 92 pps 205 93 fla 183 94 sgml 171 95 aux 166 96 ram 146 97 srt 132 98 mpeg 128 99 rar 122

100 bat 119

Table 32: Extension list and frequeny

3. Geographical distribution of users visiting www.cs.vu.nl The table below contains the geographical distribution of users visiting www.cs.vu.nl during the observed period.

74

Geographical distribution of users rank TLD count country 1 nl 19248 netherlands

2 net 18299 network infrastructure

3 com 11457 commercial 4 fr 3125 france 5 be 3058 belgium 6 de 3001 germany 7 ca 2133 canada 8 it 2038 italy 9 uk 1903 united kingdom 10 au 1852 australia

11 edu 1803 educational establishments (primarily us)

12 jp 1532 japan 13 br 1485 brazil 14 ch 963 switzerland 15 mx 935 mexico 16 pl 878 poland 17 at 635 austria 18 fi 610 finland 19 dk 553 denmark 20 se 531 sweden 21 ar 498 argentina 22 es 491 spain 23 gr 471 greece 24 hu 443 hungary 25 no 393 norway 26 us 374 united states

27 org 352

other organizations not clearly falling within the other gtlds

28 nz 341 new zealand 29 il 313 israel

30 ru 302 russian federation

31 pt 301 portugal 32 sg 275 singapore 33 mil 273 us military 34 cz 261 czech republic 35 gov 235 us government 36 tr 233 turkey 37 cl 228 chile

38 tw 226 taiwan, province of china

39 ro 210 romania 40 in 151 india 41 hr 128 croatia 42 sk 116 slovakia 43 za 114 south africa 44 ma 104 morocco 45 lt 102 lithuania 46 hk 101 hong kong 47 uy 94 uruguay

48 ie 93 ireland 49 th 90 thailand 50 ee 86 estonia 51 sa 83 saudi arabia 52 co 79 colombia

53 do 79 dominican republic

54 my 67 malaysia 55 id 65 indonesia 56 kr 60 korea, republic of 57 ua 54 ukraine 58 si 54 slovenia

59 yu 49

yugoslavia (now serbia and montenegro, iso code has changed to cs)

60 ph 49 philippines 61 is 47 iceland 62 cy 38 cyprus 63 bg 34 bulgaria 64 ve 32 venezuela 65 lu 28 luxembourg 66 mu 27 mauritius 67 int 26 null 68 cn 24 china

69 tt 22 trinidad and tobago

70 lv 19 latvia 71 py 19 paraguay 72 ec 19 ecuador 73 cr 16 costa rica 74 np 14 nepal 75 pk 13 pakistan 76 pe 12 peru 77 lb 12 lebanon

78 md 11 moldova, republic of

79 nu 10 niue 80 by 9 belarus 81 fj 9 fiji 82 ni 9 nicaragua 83 ke 9 kenya 84 aw 8 aruba 85 mz 7 mozambique 86 mt 6 malta 87 jo 6 jordan

88 bn 5 brunei darussalam

89 bw 5 botswana

90 arpa 5 address and routing parameter area

91 cu 5 cuba 92 qa 5 qatar 93 na 5 namibia 94 zw 4 zimbabwe 95 aero 4 null

75

96 kh 4 cambodia 97 bm 4 bermuda 98 su 4 null

99 mk 3 macedonia, the former yugoslav republic of

100 kz 3 kazakhstan 101 fo 3 faroe islands

102 ir 3 iran, islamic republic of

103 tz 3 tanzania, united republic of

104 tv 3 tuvalu 105 to 3 tonga 106 tg 2 togo 107 biz 2 null 108 sv 2 el salvador 109 al 2 albania 110 uz 2 uzbekistan 111 ad 2 andorra 112 lk 2 sri lanka 113 om 2 oman 114 gl 2 greenland 115 jm 2 jamaica

116 cc 2 cocos (keeling) islands

117 mg 2 madagascar

118 sr 2 suriname

119 ba 1 bosnia and herzegovina

120 cx 1 christmas island 121 nc 1 new caledonia 122 am 1 armenia 123 sz 1 swaziland 124 pa 1 panama 125 vn 1 viet nam 126 ls 1 lesotho 127 ge 1 georgia

128 ae 1 united arab emirates

129 pg 1 papua new guinea

130 rw 1 rwanda 131 bs 1 bahamas 132 ao 1 angola 133 ky 1 cayman islands 134 sm 1 san marino 135 bt 1 bhutan 136 ug 1 uganda

137 st 1 sao tome and principe

138 zm 1 zambia 139 az 1 azerbaijan

Table 33: Geographical distribution of users

76

4. Global tree model of “all visits” by s = 1,3 support threshold

Figure 18

77

5. Global tree model of “nl” group by s = 1,0 support threshold

Figure 19

78

6. Global tree model of “other” group by s = 1,5 support threshold

Figure 20

79

7. Global tree model of “staff” group by s = 1,0 support threshold

Figure 21

80

8. Global tree model of “student” group by s=0,8 support threshold

Figure 22

81

APPENDIX D. Implementation details All the algorithms required for the tasks described in the thesis were implemented in the Java language. We used a MySQL database server for data storage and retrieval. Details on the implementation and database are listed in the table below.

Technical details Implementation language java package name webmining notion the package is database independent Database name MySQL version 4.0.18

note

MySQL doesn’t support stored procedures up to version 5.0 (which, while this thesis was written, was only in beta stadium and as such unstable) This fact makes data processing a bit more difficult and less effective, because some processing steps would like to work directly inside the database. However, all the tasks and problems could be done with proper efficiency.

Table 34

Our webmining package contains six major subpackages: datahandling, dataintegration, sessionidentification, patterndiscovery, stats and visualization. All the main classes belonging to these packages are listed below with brief descriptions.

1. Data preparation (cleaning, filtering, loading) – webmining.datahandling package

Main objects of the webmining.datahandling package DatabaseConnection Handles database connection (based on the properties file). HostNameLookup Provides methods for IP address – domain name lookup. LoadLog The main object which manages the cleaning and loading process Log2Database Loads the prepared transactions into the database.

LogParser The parser object which parses the input raw log file into useful Transactions.

Transaction This object stores all information of a log entry in parsed format.

TransactionFilter Filter object that can filter out useless or not supported transactions.

TransactionSimple Simplified transaction object for log information retrieval from the database.

UpdateDBHostNames Updates the users table with host names for corresponding remotehost fields.

UpdateDBIPAddresses Updates the cslog table’s remotehost fields with IP addresses in case they contain host names.

Data files used by the package cslog.txt Text file containing log entries in raw format

webmining.prop Properties file that contains all the properties needed for the process (e.g., database properties, file paths and file names, etc.).

extension.flt This file contains all the file extensions for request URLs that are

82

Table 35

2. Data preparation (integration) – webmining.dataintegration package

Table 36

3. Data structuring – webmining.sessionidentification package

Table 37

supported by this project.

spider.flt This file contains all known spider engine names or spider patterns for filtering out spider transactions.

Main objects of the webmining.dataintegration package

GenerateAMT The main process for generating an artifical mapping table using GenerateArtificialMappingTable object.

GenerateArtificalMappingTable Generates artificial mapping data from the specified access log file, with randomly added content types in the given interval.

MappingTable

Representation of the mapping table. It reads the mapping information, (URL, content type) entries, from the specified text file and stores them in an effectively searchable HashTable.

Data files used by the package

mapping_table.mtd Text file containing mapping entries for the specific collection of documents (HTML pages).

webmining.prop The properties file which contains all the properties which are needed for the process (e.g., database properties, file paths and names)

Main objects of the webmining.sessionidentification package GetSessions The main object which manages the identification process.

Identifier Interface for all identifier objects. Describes that an identifier should make sessions from a given set of user page access entries (from an Array of TransactionSimple objects).

MFRIdentifier This is an identifier object, which identifies sessions by maximal forward reference method.

SessionFormatPrinter Provides methods for printing identified user sessions in different output formats into the specified output file.

TimeFrameIdentifier This is an identifier object, which identifies sessions using time frame identification method.

TransactionDBIterator (deprecated class)

This object retrieves user page accesses for every user separately and invokes the specified identifier on the collected data. As a result it gives back the identified sessions. (It is much slower than memory iterator, thus this class is out of usage.)

TransactionMemoryIterator

This object retrieves all the page accesses (rather content types of them) for every user into the memory and invokes the specified identifier with the collected data. As a result it gives back the identified sessions.


webmining.prop The properties file which contains all the properties which are needed for the processes (e.g., database properties, file paths and names).

83

4. Profile mining models – webmining.patterndiscovery package This package contains two subpackages for association rules mining (assoc) and global tree model (gtm) implementations.

Table 38

Table 39

Main objects of the webmining.patterndiscovery.assoc package

note

Note, that the LUCS-KDD Apriori-T Association Rule Mining Algorrithm implemented by Coenen, F. (2004) [12] was put into the webmining package structure without any modification. The following class descriptions are mainly from the documentation of the program.

AprioriTapp Fundamental Apriori-T application.

AprioriTsortedApp Apriori-T application with input data preprocessed so that it is ordered according to frequency of single items --- this serves to reduce the computation time.

AprioriTsortedPrunedApp

Apriori-T application with data ordered according to frequency of single items and columns representing unsupported 1-itemsets removed --- again this serves to enhance computational efficiency.

AssocRuleMining

Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g., operations such as subset, member, union etc.) and (iv) data and parameter output.

RuleList Set of methods that allow the creation and manipulation (e.g., ordering, etc.) of a list of ARs.

TotalSupportTree Methods to implement the "Apriori-T" algorithm using the "Total support" tree data structure (T-tree).

TtreeNode

Methods concerned with the structure of Ttree nodes. Arrays of these structures are used to store nodes at the same level in any sub-branch of the T-tree. Note this is a separate class to the other T-tree classes which are arranged in a class hierarchy.


input_session.txt Plain texts file containing the input user sessions in a special format. Page content types within a session are in ascending order with redundant pages removed.

Main objects of the webmining.patterndiscovery.gtm package GlobalTreeModel This class provides the representation for the tree model.

LoadGTM Initialize the tree model and loads all the user sessions into it. Besides, it is also responsible for managing tree visualization.

SessionTree SessionTree is a tree structure containing all the sessions for a specific starting page. The whole model consists of SessionTrees in a number of distinct content types.

TreeNode Contains information for one node such as parent and children references, content type and frequency of the node, etc.

DATA FILES USED BY THE PACKAGE input_session.txt Plain text file containing the input user sessions.

84

APPENDIX E. Content of the CD-ROM The additional CD-ROM to the master thesis contains all the input and data files as well as all the important results that were made during the project. It contains the source and binary code of the whole webmining package and this master thesis in electronic format. To make the browsing easier we made an HTML user interface for the provided content. It is accessible from the root of the CD by opening the “index.html” file.

Date post:	12-Sep-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Web Usage Mining Structuring semantically enriched clickstream data

Documents