Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
LOGS MINER : PORTAL FOR DATA MINING WEB ACCESS LOGS
Presented byAndrew Wong
9th Annual IUG meeting at HKU Library 8 December 2009
Agenda• Definitions• Motivations• Architecture of Logs Miner• Logs Miner User Interface• Logs Miner reports• Benefits• Future development
2
Definitions
Web data mining-- “application of data mining methodologies,
techniques, and models to variety of data forms, structures, and usage patterns that comprise the World Wide Web”
(Markov, Z. & Larose, D. T. 2007)
3
Three scopes of Web data mining:Web content miningWeb structure miningWeb log mining
Definitions
Web log mining• Discover user access patterns from Web
usage logs• Is also called web usage mining• Three processing stages:
1. Pre-processing2. Pattern discovery3. Pattern analysis
4
Purposes for web logs mining• Identify and classify different group of
patrons• Understand search patterns by different
group of patrons• Adapt web-user interfaces to suit users
need• Statistical data for collection
management
5
Web logs
6
lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“
lbnxyz.ust.hk - - [16/Nov/2009:12:03:27 +0800] "GET /catalog/?s=brandy&feed=rss HTTP/1.1" 304 - "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=10486796160015392754)"
lbz222.ust.hk - - [16/Nov/2009:12:03:30 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5“
lbz333.ust.hk - - [16/Nov/2009:12:03:33 +0800] "GET /catalog/?s=brandy HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"
lbz444ust.hk - - [16/Nov/2009:12:03:35 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"
• Web logs provide huge information on user action
Web logs
7
Fields Value
Remote host field lbz000.ust.hk
Date/Time field [16/Nov/2009:12:03:26 +0800]
HTTP request “GET /catalog/ HTTP/1.1“
Status code field 200
Transfer Volume (Bytes) Field
20283
User agent field "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“
lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“
Various types of web logCommon Log Format – usually used by Apache Web
server logs, Apache Tomcat Logse.g. Library web server, INNOPAC, SmartCAT, Institutional Repository
8
lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“
Microsoft IIS Log Formate.g. ILLiad, Class Registration Form
2009-07-20 01:22:44 GET /ce/ - 66.249.71.201 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - 401 1891 0
Include:• Remote host field• Date field• Time field• HTTP request field• Status code field• Transfer Volume (Bytes)• Referrer field• User agent field
Various types of web logMicrosoft Streaming Servere.g. Streaming video
9
143.89.160.133 2009-09-02 10:21:20 - /arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv 0 6 5 200 {3300AD50-2C39-46c0-AE0A-41B7139D4722} 11.0.5721.5251 en-US WMFSDK/11.0.5721.5251_WMPlayer/11.0.5721.5268 - wmplayer.exe 11.0.5721.5145 Windows_XP 5.1.0.2600 Pentium 3816 216613290 2830093 rtsp TCP - - - 2244972 2244972 398 398 0 0 0 0 0 0 1 1 100 143.89.105.168 lbms07.ust.hk 1 0 - 245 file://C:\wmhome\hkust\arc-open\oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv mms://stream.ust.hk/arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv - - 0
Fields only for streaming server:• Video codec• Audio codec• Duration• Client’s player
Web Logfile analysis toolsTools used to analyze web access logs• AccessWatch v1.33• Analog 6.0• Pwebstats• RefStats 1.2• INNOPAC Millennium Web Report – Search
Statistics
Others:• AWStats• Sawmill Analytics• Webalizer
10
Motivations• Create a portal for storing,
analyzing all different web access logs.
• Interface for querying web access logs to generate dynamic statistical report
11
AWStats as core• Ability to analyze different log formats
including Apache NCSA combined log files, IIS log files (W3C), streaming servers log files
• Feasible to analyze non-standardized log format
• Support works from command line and from a browser as CGI• Build a web interface to query the data
(Logs Miner)• Pre-process the raw log data, running large
scale query in cron job
12
AWStats as core• Unlimited log file size
• Report number of unique visit and visit
• Provides Plug-in to expand the functionality
• Open source
13
Requirement for AWStats• Web logs files: raw data must be
contained web logs components such as client IP address, status code, HTTP Request field……
• Any OS platform which supporting PERL
14
System configuration of Logs Miner:
• PC-level workstations• CentOS release 5.4• Apache web server 2.0• PERL v.5.8.8• AWStats 6.9
15
Logs Miner architecture
16
AWStats
AWStats
reports
Pattern discovery, pattern analysisPreprocessing
Raw logs: Library web server,INNOPAC,SmartCAT,Institutional repository,Digital archives …..
Access statistics
Logs Miner UI
Customized report
Logs Miner user interface• A portal for mining web access log data and
retrieve information about usages of multiple web applications.
• Built on top of AWStats, an open source logs analyzer.
• Currently set up to analyze more than 20 library servers and applications including Library Web Server, INNOPAC, Institutional
Repository, Digital Archives, SmartCAT, ILLiad, Streaming Video Server, etc.
17
Logs Miner user interface
18
URL: https://lbnx16.ust.hk/mining
Includes 20+ applicationsProvides three types of reportFiltered by URL or Host
Generates Yearly or monthly report
Query box which supporting regular
expression
Logs Miner user interface
19
URL: https://lbnx16.ust.hk/mining
Tips for construct query string
Three types of reports• AWStats reports• Access statistics
- filtered by URL / Host• Customized reports
20
AWStats report
21
AWStats report
22
AWStats report
23
Report the number of - number of unique visitors- number of visits- These number are exclude the visit from the Robot
AWStats report
24
AWStats report
25
Created by plugins: geoip
AWStats report
26
Work in progress
HKUST's iPhone Application for receiving Library information and searching on SmartCAT
Access statistics report
27
Query box which supporting regular expression
Access statistics report – filtered by URL
28
Access statistics report – filtered by Host
29
Example (1) – Usage of a database
30
Database title:
Cambridge Journals Online
URL: http://library.ust.hk/cgi/db/cambridge.pl?subscribedTo
Server name: library.ust.hk (Library web server)
Parameters /cgi/db/cambridge.pl?subscribedTo
Include pattern: cgi\/db\/cambridge\.pl.+
Example (1) – Usage of a database
31
Example (1) – Usage of a database
32
Example (2) – Usage of a document of HKUST Institutional Repository
33
Document Long, Jiafu 2005, Autoinhibition of X11/Mint scaffold proteins revealed by the closed ……
URL: http://repository.ust.hk/dspace/bitstream/1783.1/2496/1/nsmb958.pdf
Server name: repository.ust.hk (HKUST Institutional Repository)
Parameters /dspace/bitstream/2496/1/nsmb958.pdf
Include pattern:
\/1783\.1\/2496\/1\/nsmb958\.pdf
Example (2) – Usage of a document of HKUST Institutional Repository
34
Example (2) – Usage of a document of HKUST Institutional Repository
35
Example (3) – Access by particular group
36
Number of access on Library web page from Library public workstations
Library web page
URL: http://library.ust.hk/
Server name: library.ust.hk (Library web server)
Client’s name convention
OPAC workstation (lbb[nnn].ust.hk)IC workstation (lbc[nnn].ust.hk)Computer Lab (lba[nnn].ust.hk
Include pattern:
lb(a|b|c)[\d]+\.ust.hk\.hk
Example (3) – Access by particular group
37
Example (3) – Access by particular group
38
Example (4) – Exclude particular group
39
Number of access on Digital Archives from HKUST campus but exclude HKUST Library Staff
Digital university archives
URL: http://archives.ust.hk/
Server name: archives.ust.hk (Digital Archives)
Client’s name convention
Library staff workstation (lbz[nnn].ust.hk)
40
Example (4) – Exclude particular group
Include pattern:
^.+\.ust\.hk$
Exclude pattern:
lbz.+\.ust.hk\.hk
41
Example (4) – Exclude particular group
Example (5) – Number of virtual visits• A virtual visit is defined as a user’s request
on the library’s website in order to use one of the services provided by the library.
• One Key Performance Indicator – Virtual visits per capita
• Includes main web applications:- Library web server- Innopac- SmartCAT (Next generation Catalogs)- HKUST Institutional Repository- Digital Archives - HKUST ILLiad
42
Example (5) – Number of virtual visits
43
Report the number of • Visits
- a unique IP accesses a page, and requests other pages without an hour between any of the requests
Example (5) – Number of virtual visits
44
Request within an hour
Request within an hour
Request within an hour
Count as a visit
Example (5) – Number of virtual visits
45
Applications unique visit visit page visit/visitor pages/visit
Library web server 413,324 1,018,811 60,78,913 2.46 5.96
IR 94,596 133,458 632,256 1.41 4.73
Digital Archives 1497 3,511 90,489 2.34 25.77
E-Journal 21,833 42,768 376,473 1.95 8.8
E-theses 25,848 34,956 116,664 1.35 3.33
HKUST ILLiad 8,039 18,548 138,109 2.3 7.44
SmartCat 4,202 9,398 288,787 2.23 30.72
Streaming Videos 778 1,233 4,073 1.58 3.30
Total 570,117 1,262,683 7,725,764 2.21 6.11
Virtual Visit in 2009 1,262,683 2.21 6.11
Customized reports• Built-in customized reports to provide a
full picture of page visit figures of similar pages
From HKUST Library Web Server (http://library.ust.hk)
• Sitemap• Databases List• Course Guides• Database Guides• Subject Guides
46
Customized reports
47
SubSet:• Sitemap• Databases List• Course Guides• Database Guides• Subject Guides
Customized reports
48
HKUST library web sitemap
Customized reports
49
Customized reports
50
Add more customized reports template
• E-Journal list• Library Forms• ……
Benefits of Logs Miner• Central place for storing, processing and
analyzing Web Logs data• Combined usage data from different
server logs• Statistics report can be generated
dynamically. • Flexible querying interface enabling users
to construct their own statistical reports in real-time
51
Privacy issue• From web access logs, individual client’s
action can be tracked• Protected by firewall, file permission, user
authentication• Logs Miner User Interface can be only
accessed from library network
52
IMPORTANT: As data retrieved in your searches or reports may contain usage patterns of our users, please be careful not to re-distribute such information outside of the HKUST Library.
Future Development• Include more web applications
such as HKUST PowerSearch server (federated search to Library’s subscription resources)
• Create more customized report template such as E-journal list
53
ReferenceHan, J., & Kamber, M. 2006. Data mining :Concepts and
techniques (2nd ed.). Amsterdam: Morgan Kaufmann.
Liu, H., & Keselj, V. 2007. Combined mining of web server logs and web contents for classifying user navigation patterns and predicting users' future requests. Data knowledge engineering, 61(2): 304.
Markov, Z., & Larose, D. T. 2007. Data mining the web :Uncovering patterns in web content, structure, and usage. Hoboken, N.J.: Wiley-Interscience.
54