algorithms are the kmeans and the k-mediod. The
advantage of the partitionbased
algorithms that they use an iterative way to create
the clusters, but the drawback is, that the number of
clusters have to be determined in advance and only
spherical shapes can be determined as clusters.
Hierarchical algorithmsprovides a
hierarchical grouping of the objects. There exist
two approaches, the bottom-up and the top-down
approach.In case of bottom-up approach, at the
beginning of the algorithm each object represents a
different cluster and at the end all objects belong to
the same cluster. In case of top-down method at the
start of the algorithmall objects belong to the same
cluster which is split, until each object constitute a
different cluster. A key aspect in these kind of
algorithms is the definition of the distance
measurements between the objects and between the
clusters. The drawback of the hierarchical
algorithm is that after an object is assigned to a
given cluster it cannot be modified later.
Furthermore, like in partition-based case, also only
spherical clusters can be obtained. The advantage
of the hierarchical algorithms is that the validation
indices (correlation, inconsistency measure), which
can be defined on the clusters, can be used for
determining the number of the clusters.
Density-based algorithmsstart by
searching for core objects, and they are growing the
clusters based on these cores and by searching for
objects that are in a neighbourhood within a radius
of a given object. The advantage of these type of
algorithms is that they can detect arbitrary form of
clusters and it can filter out the noise.
Grid-based algorithmsthe grid-based
algorithms use a hierarchical grid structure to
decompose the object space into finite number of
cells. For each cell statistical information is stored
about the objects and the clustering is achieved on
these cells. The advantage of this approach is the
fast processing time that is in general independent
of the number of data objects.
Fuzzy clusteringsuppose that no hard
clusters exist on the set of objects, but one object
can be assigned to more than one cluster. The best
known fuzzy clustering algorithm is FCM.
III. Analysis of Problem With the explosive growth of information
sources available on the World Wide Web and the
rapidly increasing pace of adoption to internet
commerce, internet has evolved into a gold mine
that contains or dynamically generates information
that is beneficial to E-businesses. A web site is the
most direct link a company has to its current and
potential customers. The companies can study
visitor’s activities through web analysis, and find
the patterns in the visitor’s behavior. Web usage
patterns could be directly applied to efficiently
manage activities related to e-Business, e-CRM, e-
Services, e-Education, e-Newspapers, and e-
Government . With the large number of companies
using the internet to distribute and collect
information, knowledge discovery on the web has
become an important research area.
Application like Customer Relationship
Management (CRM) can use data from within and
outside an organization to allow an understanding
of its customer on individual basis or on the group
basis such as by forming customer’s profiles. An
improved knowledge about the customers’
preference and needs forms the basis for effective
CRM. For the better business it’s important to keep
loyalty of their old customers and to lure new
customers. Automated data mining or knowledge
discovery techniques can be used to discover web
user profiles. These mass user profiles can
automatically extract frequent access patterns from
the history of the previous user click streams stored
in web log files. Although there have been
considerable advances in web usage mining ,there
have been no detailed studies presenting a fully
integrated approach to mine a real web sites, such
as evolving profiles, dynamic content and the
availability of taxonomy or database in addition to
web logs.
IV. Proposed Work The general scheme of the proposed approach for
mining usage profiles using fuzzy clustering is as
shown below:
Web Log Data
When users visit a Web site, the Web server
stores the information about their accesses in a log
file. Each record of a log file represents a page
request executed from a Web user. In particular, it
typically contains the following information: user’s
IP address, date and time of the access, URL of the
requested page, request protocol, a code indicating
the status of the request.
Usage Pre-processing
The aim of the pre-processing step is to
identify user sessions starting from the information
contained in the access log file.
Data pre-processing involves two main steps: data
cleaning and user session identification.
Archana N Boob et al ,Int.J.Computer Technology & Applications,Vol 3 (1),329-331
IJCTA | JAN-FEB 2012 Available [email protected]
330
ISSN:2229-6093
Data cleaning
The first step of log data pre-processing consists in
removing useless requests from log files. In
particular, data cleaning removes redundant
references such as images, sound files, multiple
frames, and dynamic pages that have the same
template. We eliminate the irrelevant items by
checking the suffix of the URL requests. Hence, all
log entries with filename suffixes such as gif, jpeg,
jpg, and map are removed. These operations allow
to not only remove uninteresting sessions but also
to simplify the mining task that will follow.
User session identification
- User identification
User identification refers to a process of labeling
users with their visiting pages’ web logs.
According to the IP address and user agent, visitors
will be classified accordingly. Due to the
existence of cache, proxy server (including cafe,
etc) and firewall network, this step could be very
complicated and time-consuming, scholars have put
forward some heuristic rules to identify users: (1)
The different IP address represents with different
user. (2) When the IP address is as same as the
others’, the defaults of different operating systems
or browser represent different users. (3) With the
same IP address and operating system and also the
same browser, judging whether there is a direct link
between the requiring page and all the pages visited
previously, if so, then there is only one user, if not,
then different users.
- Session identification
Session refers to a series activities from when a
user first logs into the website to when the
userleaves it. The goal to identify session is to get
meaningful visiting sequence during specific time.
Session categorization by fuzzy clustering
Once user sessions have been identified, a
clustering process is applied in order to group
similar sessions in the same category. Each session
category includes users exhibiting a common
browsing behaviour and hence similar interests.
Web data uses different types of clustering
algorithms .One important criteria to be considered
in the choice of the clustering method is the
possibility of creating overlapping clusters. This is
a fundamental facet in Web personalization, where
the ambiguity of the navigational data requires that
a user may belong to more than one category or
profile. Fuzzy clustering turns out to be a good
candidate method to handle ambiguity in the data,
since it enables the creation of overlapping clusters
and introduces a degree of item-membership in
each cluster.
V. Desired Implications
Web Usage mining involves mining the
usage characteristics of the users of Web
Applications. This extracted information can then
be used in a variety of ways such as, to enhance the
quality of electronic commerce services, to
personalize the web portals, improvement of the
applications etc.
The above proposed method can be
successfully implemented for mining the usage
characteristics of the users of Web Applications.
The number of similar urls visited by users for a
particular session gives understanding of user
behavior. Comparing the number of urls visited and
user session will give clear understanding of user’s
evolution.
VI. References
[1] R. Kosala, and H. Blockeel, WebMining Research: A
Survey, SIGKDDExplorations, Vol.2, No.1, 2000, pp. 1-15.
[2] F. M. Facca and P. L. Lanzi, Mining interesting knowledge
from weblogs: a survey, Data & Knowledge Engineering, 53,
2005, pp. 225–241.
[3] B. Mobasher, R. Cooley, J. Srivastava, Automatic
personalization based onWeb usage mining, TR-99010,
Department of Computer Science. DePaul University, 1999.
[4] F. Masseglia, P. Poncelet, R. Cicchetti, An efficient
algorithm for web usage mining, J. Networking Inf. Syst. (NIS),
2(5-6), 1999, pp. 571–603.
[5] S. Araya, M. Silva, R. Weber, A methodology for web usage
mining and its application to target group identification, Fuzzy
Sets and Systems, 148, 2004, pp. 139–152.
[6] D. Pierrakos, G. Paliouras, C. Papatheodorou, and C. D.
Spyropoulos, Web usage mining as a tool for personalization: a
survey. User Modeling and User- Adapted Interaction, Vol. 13,
No. 4, 2003, pp. 311-372] K. P. Sankar, T. Varun, M. Pabitra,
Web Mining in Soft Computing Framework: Relevance, State of
the Art and Future Directions, IEEE Transaction on Neural
Networks, Vol. 13, No. 5, 2002, pp. 1163-1177.
[7] Y. H. Cho, J. K. Kim, Application of Web usage mining and
product taxonomy to collaborative recommendations in e-
commerce, Expert Systems with Applications, 26, 2004, pp.
233–246.
[8] A. Abraham, Business Intelligence from Web Usage Mining,
Journal of Information & Knowledge Management, Vol. 2, No.
4, 2003, pp. 375-390.
[9] M .Kitsuregawa, M. Toyoda, I. Pramudiono, Web
community mining and web log mining: commodity cluster
based execution, In Proceedings of the 13th Australasian
Database Conference (ADC(02), Melbourne, Australia, 5, 2002,
pp. 3–10.
[10] M. N. Garofalakis, R. Rastogi, S. Seshadri, K. Shim, Data
minino and the web: past, present and future, In Proc. of the
second international workshop on web information and data
management, ACM, 1999.
[11] O. Nasraoui, R. Krishnapuram, A Joshi, Relational
clustering based on a new robust estimator with applications to
web mining, In Proc. of the International Conf. North American
Fuzzy Info. Proc. Society (NAFIPS 99), New York, 1999, pp.
705-709.
[12] A. Vakali, J. Pokorný and T. Dalamagas, An Overview of
Web Data Clustering Practices, EDBT Workshops, 2004, pp.
597-606
Archana N Boob et al ,Int.J.Computer Technology & Applications,Vol 3 (1),329-331
IJCTA | JAN-FEB 2012 Available [email protected]
331
ISSN:2229-6093