There are only There are only 1010 types of types of people in the world:people in the world:
Those who understand binary, Those who understand binary, and those who don't. and those who don't.
Web Mining and Web Mining and Information Assurance Information Assurance
Dr. Xueping Li
Dept. of Industrial & Information EngineeringUniversity of Tennessee
Outline
Introduction to Web Mining Web content mining Web usage mining Web structure mining Complex Networks
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
Alternative names and their “inside stories”: Data mining: a misnomer? Knowledge discovery(mining) in databases
(KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Source: Jiawei Han and Micheline Kamber: Data Mining: concepts and Techniques
Why Data Mining? — Potential Applications
Database analysis and decision support Market analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Fraud detection and management
Other Applications Text mining (news group, email, documents) and Web analysis.
Intelligent query answering
Jiawei Han and Micheline Kamber: Data Mining: concepts and Techniques
Motivation: “Necessity is the Mother of Invention”
What is Web Mining?
Discovering useful information from the World-Wide Web and its usage patterns
Applications Web search e.g., Google, Yahoo,… Vertical Search e.g., FatLens, Become,… Recommendations e.g., Amazon.com Advertising e.g., Google, Yahoo Web site design e.g., landing page
optimization
Structured vs. Web data mining
Traditional data mining data is structured and relational well-defined tables, columns, rows, keys, and
constraints. Web data
Readily available data rich in features and patterns Text, image, audio, video
Spontaneous formation and evolution of topic-induced graph clusters hyperlink-induced communities
Challenges Content includes truth, lies, obsolete information, contradictions,
… Uncontrolled quality, widely distributed, rapidly changing,
heterogeneous/complex data types, no consistent semantics or structure within or across objects, etc. (XHTML & XML?)
Size of the Web
Number of pages Technically, infinite
Because of dynamically generated content Lots of duplication (30-40%)
Best estimate of “unique” static HTML pages comes from search engine claims Google = 8 billion, Yahoo = 20 billion Lots of marketing hype
Number of unique web sites Netcraft survey says 72 million sites
(http://news.netcraft.com/archives/web_server_survey.html)
Growth of the Internet
* Fig. source: Douglas E. Comer, Computers Networks and Internets with Internet Applications, 4e, Person Prentice Hall, 2004
Web Mining Taxonomy
Web content mining (WCM) Web usage mining (WUM) Web structure mining (WSM)
Web Mining
Web Content Mining
Web Usage Mining
Web Structure
Mining
WCM & WUM
Main source of the data: Log files
Main source of the data about the activity of our web server are Log files
Typical line of a Log file: 2005-05-29 04:13:40 128.2.215.4 - W3SVC1 WM
160.36.231.167 80 GET /Kdd/wm/wm.zip - 206 64 1507568 551 1816312 HTTP/1.1 www.utk.edu Mozilla/4.0+(compatible;+MSIE+5.5;+Windows+NT+5.0) - http://li.utk.edu/kdd/wm
E.g. Log files on WinNT/2000 reside at the \winnt\system32\logfiles\ system directory
What kind of problems do we solve?
Personalization of web services: Preparing offers (discounts, products, contents) customized
for each particular user Understanding of what is going on at the web server:
Customer groups identification, behavioral patterns …the goal is to better organize web services …optimization of site navigation
Better “Banner Adds” selection to increase the probability to be clicked by the user
…it is not hard to increase the probability Building the psychological profiles based on the texts
read by the user …to get more info about the user than he has about
himself Etc. etc. etc.
Data analysis methods
Log files include sequences of events (click-streams):
…methods for analyzing event sequences are usually modified classical methods from the area of Data-Mining for analysis of very large databases
Basic methods are modified methods for induction of association rules, clustering, decision trees
Other analytic methods are from the areas of Text-Mining, Statistics and Machine-Learning
Fig. A General Architecture for Web Usage Mining
WUM - web usage miner
main goal: navigation pattern discovery
sequence of pages through the website typical patterns optimization of site navigation
three steps log file cleaning pattern analysis visualization
Association rules example
Items={milk, coke, pepsi, beer, juice}. Support = 3 baskets.
B1 = {m, c, b} B2 = {m, p, j}B3 = {m, b} B4 = {c, j}B5 = {m, p, b} B6 = {m, c, b, j}B7 = {c, b, j} B8 = {b, c}
Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.
Association Rules
If-then rules about the contents of baskets.
{i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it is likely to contain j.
Confidence of this association rule is the probability of j given i1,…,ik.
Association rules example (cont.)
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
An association rule: {m, b} → c. Confidence = 2/4 = 50%.
+__ +
Association rules in Web-logs
Searching for rules that connect two or more events, e.g.
60% of the users that visited URL/company/product, also visited company/product/product1.html
30% of the users that visited URL/company/special-offer/ also visited company/product2.html
Profiling using time dimension
Searching for rules that connect two or more events taking into account time dimension:
30% of the users that visited URL/company/product/product1.html also searched in the last week words W1 and W2 on Yahoo
60% of the users that ordered product1 in the next 15 days also ordered product2
Classification rules
Identification of behavior for groups of users - additional information can be obtained from cookies, registration,etc.:
Users that frequently visit page /company/products/product3.html are from educational institutions
50% of the users that visited /company/products/product4.html are in age group of 20-25 and live at the sea coast
Real-Time Data-Analysis
At some web servers there are too many hits to be saved and analyzed off-line:
…we have a data stream – no time or space for off-line data analysis (e.g. search engines, shops, banks, news, …)
…we would like to understand what is going on to detect e.g. anomalies or changes in trends
The solution is in using special type of methods for online event analysis:
Methods are able to analyze non-stationary data At each moment results (models) are in human readable
form (e.g. decision trees, rules, …) …no need to save Log files
Document visualization
Application:From Web log to Web Loyalty
A study done by the Harvard Business A study done by the Harvard Business School indicates that an increase of School indicates that an increase of 5%5% in customer loyalty can increase in customer loyalty can increase profitability from profitability from 25%25% to a much as to a much as 80%80%. (Multimedia Live, 2001). (Multimedia Live, 2001)
Web search
The Web
Ad indexes
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages
Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Web crawler
Indexer
Indexes
Search
User
Search engine components
Spider (a.k.a. crawler/robot) – builds corpus Collects web pages recursively
For each known URL, fetch the page, parse it, and extract new URLs
Repeat Additional pages from direct submissions & other
sources The indexer – creates inverted indexes
Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc.
Query processor – serves query results Front end – query reformulation, word stemming,
capitalization, optimization of Booleans, etc. Back end – finds matching documents and ranks them
Typical anatomy of a large-scale crawler.
PageRank
Used by Google Prioritize pages returned from search
by looking at Web structure. Importance of page is calculated
based on number of pages which point to it – Backlinks.
Weighting is used to provide more importance to backlinks coming form important pages.
PageRank (cont’d)
PR(p) = c (PR(1)/N1 + … + PR(n)/Nn) PR(i): PageRank for a page i which
points to target page p. Ni: number of links coming out of page i
WSM
Self-Similarity of Internet Traffic Internet Invariant Scale Free Network
Self-Similarity of Internet Traffic (Measured) and Not in Poisson or Ordinary Telephone Traffic
Internet Invariant
FTP transfers, Pareto tail Interarrival time of packets, Heavy-tailed Connection duration, Lognormal TCP connections/Web session, Heavy-
tailed
Session duration, Pareto …
Martin J. Fischer etc., “Analyzing the Waiting Time Process in Internet Queueing Systems With the Transform Approximation Method”
Random Networks (Erdos/Renyi, 1960)
Average path length L ~ LnN, small;
Clustering coefficient C~0; C: probability that any two nodes are connected to each other, given that they are both connected to a common node ( probability that friends of friends are friends)
Regular Networks
High degree of clustering: C~1 Average path length L: large
Small-World Networks
High degree of clustering: C~1 Average path length L: Small (due to
shortcuts);
D.J.Watts and S.H. Strogatz, Nature 393, pp. 440-442 (1998)
Random, Small-World, and Regular Networks
C L
Regular High Large
Small-World High Small
Random Low Small
Examples of small-world networks: power grid, internet, social network, scientific citation network, movie-actor network et al.
Complex Networks: How are they formed?
Growth Starting with a small number of nodes, at every
time step a new node with a number 9m) of links is added
Preferential Attachment Barabasi-Albert (BA) model: probability for node i
to acquire a new link is
This results in an algebraic degree distribution
j jii kkk /)(
KkP ~)(
A. L. Barabasi and R. Albert, Science 286, 509 (1999)
Consequence of Algebraic Degree Distribution
Statistical moments
0
~k
nn dkkkk
Do not exist for n=[r]-1, [r],… where [r] is the smallest integer greater than r: networks have no characteristic scales (Scale-Free Networks)
Examples of SFN: (1)WWW, r(in)~2.1, r(out)~2.4; (2)Interent (r~2.5) (3)Network of movie actors (r~2.3); (4) Electrical power-grid of western US (r~4) (5) Scientific citation network (r~3.0)
Alternative Models
For scale-free networks, preferential attachment probability IIi(ki)~ki leads to an algebraic degree distribution;
For random networks, the attachment probability does not depend on ki: i(ki) = constant, which leads to an exponential degree distribution: P(k)~e^(-ak);
Many realistic networks exhibit scale-free feature only to certain extent. Often, algebraic and exponential distributions are observed in different ranges of k.
How robust is the Internet?
SFN is robust against random attacks while
vulnerable to malicious intentional attacks
Yuhai Tu, How robust is the Internet? Nature, Vol 406, July 2000
More topics
Privacy Issues In Web Mining Crawling the web Web graph analysis Structured data extraction Classification and vertical search Collaborative filtering Web advertising and optimization Mining web logs Systems Issues
Hmm, conclusion?
Web-Mining should be used by everybody offering services on the web and not being satisfied by simple access statistics!
The idea is to make something more out of the data already collected by your computer.
It is expected that Web-Mining will become soon a standard part of a typical web-solution.
Marko Grobelnik http://www-ai.ijs.si/MarkoGrobelnik/ Institut Jožef Stefan
Acknowledgements & References
Fowler, T. B., “A Short Tutorial on Fractals and Internet Traffic,” The Telecommuni-cation Review, Volume 10, Mitretek Systems, McLean, VA, pp. 1-14, 1999.
Bastian Germershaus, “ Integration of association rules into WUM”
Gao Kun, “Analysis Techniques of Discovered Patterns”
Mengdan Yu, “Mining E-Business Gold” Stanford CS345 “Data Mining”
Thanks~~~