+ All Categories
Home > Documents > The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf ·...

The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf ·...

Date post: 05-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
60
The Web and Searching for Information Multimedia Information Systems VO/KU (707.020) Christoph Trattner Know-Center November 23, 2015 Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 1 / 65
Transcript
Page 1: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

The Web and Searching for InformationMultimedia Information Systems VO/KU (707.020)

Christoph Trattner

Know-Center

November 23, 2015

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 1 / 65

Page 2: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Outline

1 Internet and the Web

2 Web as a Graph

3 Navigation Behavior

4 Search

5 Data analysis for navigation

6 Social Web

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 2 / 65

Page 3: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

The Web

What are the reasons for the success of the Web?

Network - the Internet

Addressability across the network

Simplicity

Cross platform, extensible, based on standards

Architecture that scales

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 3 / 65

Page 4: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

Internet Growth

Figure: Internet Growth (http://wstweb1.ecs.soton.ac.uk/web-observatory/about/tracking-explosive-growth/)

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 4 / 65

Page 5: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

Internet Growth

Not only computers, but other devices, e.g. phones connected to theInternet

Billions devices connected today, 80 billion expected by 2020 (IDATE)

1990: 0.01 PB/Month100.000G transferred over the Internet per year

Global internet traffic (Cisco estimates)

1990: 0.001 PB/month

2000: 84 PB/month

2010: more than 20000 PB/month

forecast for 2018: 132000 PB/month

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 5 / 65

Page 6: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

Internet Growth

Figure: Internet Map http://www.opte.org/

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 6 / 65

Page 7: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

The Web

The fastest growth of any technology in the human history

Time to reach 50 million people

Telephone 75 yearsRadio 35 yearsTV 13 yearsThe Web 4 years

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 7 / 65

Page 8: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

The Web

Figure: Jakob Nielsen, 100 Million Web Sites,http://www.useit.com/alertbox/web-growth.html

Explosive growth (1991-1997): 850%/year

Rapid growth (1998-2001): 150%/year

Maturing growth (2002-2006): 25%/year

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 8 / 65

Page 9: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

The Web

The size of the Web

Visible Web and The Deep Web (behind passwords)

Estimates: Deep Web several orders of magnitude larger

The size of the Web ≈ 1000 billions (2008) http://googleblog.

blogspot.com/2008/07/we-knew-web-was-big.html

Indexed pages ≈ 50 billions http://www.worldwidewebsize.com/

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 9 / 65

Page 10: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

The Web

Figure: http://techcrunch.com/2009/05/08/is-the-growth-of-the-web-slowing-down-or-just-taking-a-breather/

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 10 / 65

Page 11: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

The Web

The Incredible Growth Of The Web (1984-2013) infographic:http://www.mediabistro.com/alltwitter/

web-growth-history_b48671

Internet users: from 1000 in 1984 to 3 billion in 2014

Web sites: from 130 in 1994 to over over one billion in 2014

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 11 / 65

Page 12: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

The Web

Queries per day (Google): 10000 in 1998 to 4.4 billion in 2012 (http://www.internetlivestats.com/google-search-statistics/)

Social media users: Facebook - 1 billon, Twitter and LinkedIn - 200million

Mobile: 1.3 billon Smartphones in 2012, over half used for browsing

See more Internet and Web stats:http://www.internetlivestats.com/

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 12 / 65

Page 13: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

The Web

Figure: The Web is dead,http://www.wired.com/magazine/2010/08/ff_webrip/all/1

Web grows but its share is sinking

Mobile apps get things done on the Internet without using the Web

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 13 / 65

Page 14: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Internet and the Web

The Web

The Web Ain’t Dead Yet (And It’s Getting Easier to Create)http://www.wired.com/epicenter/2011/08/

web-aint-dead-easier-to-make/

Apps and big platforms: easy to use but hard to program

Figure: HTML5

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 14 / 65

Page 15: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Web as a Graph

Information retrieval on the Web

How do we access and retrieve data on the Web?

Type an URL

Browse/Navigate

Search

To understand these we need to analyze the Web as a naturalphenomenon, as an object of scientific inquiry

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 15 / 65

Page 16: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Web as a Graph

Navigating the Web

Graph Structure in the Web, Broder er al. 2000

What is the structure of the Web?

Which pages can be accessed by navigation?

How fast can you reach an arbitrary Web page by navigation?

Analysis of the Web crawl ≈ 200 million pages, 1,5 billion links

Goal: understand Web structure on a macroscopic scale

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 16 / 65

Page 17: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Web as a Graph

Graph Structure in the Web

Figure: Bow-tie model of the Web graph

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 17 / 65

Page 18: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Web as a Graph

Graph Structure in the Web

SCC: the heart of the Web

IN: new pages, not discovered yet

OUT: corporate websites

TENDRILS: disconnected from SCC

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 18 / 65

Page 19: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Web as a Graph

Graph Structure in the Web

The diameter of SCC is 28

The diameter of the graph is over 500

Two randomly chosen pages are connected with a path in only 24%of the cases

Average directed path length around 16

Average undirected path length around 6

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 19 / 65

Page 20: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Web as a Graph

Graph Structure in the Web

Figure: In-degree distribution on the Web

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 20 / 65

Page 21: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Web as a Graph

Graph Structure in the Web

In-degree: Power law with exponent 2.1

Graph Evolution: Densification and Shrinking Diameters by J.Leskovec, 2007.

Study of various real world graphs

Densification: edges grow superlinearly in the number of nodes withtime

Average distance between nodes often shrinks

Shrinking diameter as graph grows

The current Web graph has a similar structure

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 21 / 65

Page 22: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Navigation Behavior

Navigational behavior on the Web

Study by Huberman in 1998: Strong Regularities in World Wide WebSurfing

Model gives a probability distribution for number of pages (depth) auser will visit in a site

Observing the number of links users follow on a website

Theoretical model confirmed with the log analysis of several largewebsites

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 22 / 65

Page 23: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Navigation Behavior

Navigational behavior on the Web

Figure: Number of links followed (clicks) vs. number of users (frequency)

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 23 / 65

Page 24: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Navigation Behavior

Navigational behavior on the Web

Study by Gleich et al on 2010

Tracking the Random Surfer: Empirically Measured TeleportationParameters in PageRank

Teleportation parameter α is the probability that a user will not followa link but will jump to another page by e.g. typing a URL in theaddress bar

In Google they made an estimation setting α = 0.15

Study measured α empirically

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 24 / 65

Page 25: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Navigation Behavior

Navigational behavior on the Web

Browser toolbar logs of Microsoft toolbar

The entire Web: α ≈ 0.35

HelloMovies (structured hierarchical navigation): α ≈ 0.35

Wikipedia: α ≈ 0.6

Findings: Users still navigate

Wikipedia vs. HelloMovies: more link structure → more navigation

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 25 / 65

Page 26: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Navigation Behavior

Navigation: summary of problems

Web graph not completely connected

Central navigational structures are not possible

Users do not follow too many links

But, still users navigate!

Some further studies showed the importance of combination of searchand navigation: first search then navigate, then refine, etc.

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 26 / 65

Page 27: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Search engines - categories

Index search engines with spiders/robots

Catalog search engines (Web directories)

Combinations of index and catalog search engines

Meta search engines

Recommendation systems

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 27 / 65

Page 28: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Search engine architecture

Figure: Generic search engine architecture

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 28 / 65

Page 29: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Index search engines

Robots collect data by following links

Complex web-pages: HTML, CSS, JavaScript, text as graphics, flash,frames... → problems for robots

Gathered data stored in database (page repository)

Indexing module analyses pages and writes them into an index

Query module searches in the index using keywords

Ranking module sorts results according to estimated relevance

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 29 / 65

Page 30: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Ranking of search results

No problems in finding information

Problems in ranking (millions of) results

Ranking strategies

Word counts (how many times does a search word appears?)

Proximity (how close search words are?)

Position of words in a document (title, meta tags, ...)

title and meta informations< metaname =′′ keywords ′′content =′′ fruits, vegetables ′′ >,< metaname =′′ description′′content =′′ onlinefruit − shop′′ >

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 30 / 65

Page 31: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

Currently Google has the best results

Ranking of Google based on two components

Hits (in content)

PageRank (most important)

How does it work?

Find documents with hits, calculate weights

Apply PageRank

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 31 / 65

Page 32: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

Plain hits - full-text hits (words are somewhere in the text)

URL

Title (second important)

Anchor text (most important)

Meta Tags

Font sizes of the text - relative to the document

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 32 / 65

Page 33: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

Why is the idea of using anchor text cool?

If destination document is an image you can still find it!

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 33 / 65

Page 34: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

PageRank

Google robot investigates links on the Web

Calculate link statistics for the Web

Pages that have more links pointing to them get higher PageRank

Higher PageRank - more relevant

Pointing pages also have a PageRank

PageRank contributes to PageRank - recursive definition

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 34 / 65

Page 35: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google Ranking

PageRank (formula)

PR(A) = (1− d) + d(PR(T1)C(T1) + ...+ PR(Tn)

C(Tn) )

d - constant, usually 0.85

PR(A) - PageRank of Page A

T1 ... Tn - all pages pointing to Page A

PR(T1) ... PR (Tn) - PageRanks of pages pointing to Page A

C(Tx) - number of outgoing links from Page Tx

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 35 / 65

Page 36: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google Ranking

Formula is iterative

To calculate PR(A) you need to know PR(T1) ... PR(Tn) - but youdon’t know it

Start with 0 for all PRs and iterate until there is no difference invalues

The formula converges ;)

For small networks 20-40 iteration steps needed

For big networks - hundreds of iterations (but each iteration isextremely costly for the Web graph)

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 36 / 65

Page 37: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

Each page gives its PageRank to pages that it points to

No discrimination: each page shares its PageRank equally PR(Tn)C(Tn)

PageRank forms a probability distribution of pages being accessed

The normalized sum of all PRs (in closed topology) is equal to 1

1

n

n∑i=1

PR(Ti ) = 1

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 37 / 65

Page 38: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

PageRank Example 1

Calculate http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_01.php

Source Code http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_01.phps

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 38 / 65

Page 39: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

PageRank Example 2

Calculate http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_02.php

Source Code http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_02.phps

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 39 / 65

Page 40: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

PageRank Example 3

Calculate http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_03.php

Source Code http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_03.phps

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 40 / 65

Page 41: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

Average < 1

Page C saves its PR

If no page saves its PR Average = 1

If number of pages is very high Average ≈ 1

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 41 / 65

Page 42: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

PageRank calculates probability that you can access a page if youbrowse “randomly”

Each page gets at least something:

Obvioulsy: PR(C ) > PR(B) > PR(A)

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 42 / 65

Page 43: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

PageRank Example 4

Calculate http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_04.php

Source Code http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_04.phps

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 43 / 65

Page 44: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

PageRank Example 5

Calculate http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_05.php

Source Code http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_05.phps

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 44 / 65

Page 45: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

PageRank Example 6

Calculate http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_06.php

Source Code http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_06.phps

Receiving PR externally is good for a PR of a site

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 45 / 65

Page 46: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

Analysis of results

Hierarchy increases PR of the page on the top (homepage)

If you point out you give away some of your PR

Hope that what you give you will get back

If links point to your homepage you will get a lot of PR

Especially if a page with high PR points in!

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 46 / 65

Page 47: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

What does PageRank actually measure?

Popularity!

People create links to a page because they know about the page!

Well-known page gets a lot of links - high PR

It relies on the very nature of the Web and its community

Reasons for the success!

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 47 / 65

Page 48: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

PageRank Example 7 (Google Bombing)

Calculate http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_07.php

Source Code http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_07.phps

1000 spam pages, no wasted PageRank

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 48 / 65

Page 49: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

PageRank Example 8 (Google Bombing)

Calculate http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_08.php

Source Code http://kmi.tugraz.at/staff/vsabol/courses/

mmis1/examples/google/google_08.phps

external page with a huge PageRank

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 49 / 65

Page 50: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

Analysis of results (Google Bombing)

Quality of the page is most important

People will point to your page!

Google can remove you from the index because of bombing!

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 50 / 65

Page 51: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

When bombing can be “successful”?

Take some unusual anchor text

Make many links with that text to a known page

Submit a query with that text

Famous bomb:http://www.google.com/search?q=miserable+failure

Jokes, cannot earn you money

Useful for political activism and raising awarness

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 51 / 65

Page 52: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

Original PageRank paper:http://www-db.stanford.edu/~backrub/google.html

Mathematical analysis of PageRank

Langville and Meyer: Deeper inside PageRank, 2004

PR as a variant of eigenvector centrality of the Web graph

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 52 / 65

Page 53: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

Centrality measures: identifying most important nodes in a graph

Eigenvector centrality

A node is important if it is connected to other important nodes

Issue 1: in a directed network a node with no incoming links haseigenvector centrality of 0

Correct by giving each page a small amount of centrality (α)

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 53 / 65

Page 54: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Google ranking

Issue 2: higy centrality node with many links passes huge amounts ofcentrality to targets

Correct this by splitting the centrality equally among all linked nodes

PageRank made exactly those two corrections

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 54 / 65

Page 55: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Catalog Search Engines

Listing of Web sites organised into a hierarchical structure

Editorial office checks links/pages

Smaller amount of pages

Easier to find things (for beginners)

Yahoo directory (http://dir.yahoo.com),

DMOZ open directory project (http://www.dmoz.org/)

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 55 / 65

Page 56: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Search

Meta Search Engines

Search simultaneous more search engines, collect results

No specific syntax to learn

Cannot use special features of search engines

Issues with ranking (often round robin)

Add additional capabilities (e.g. result clustering)

Mamma (mother of all search engines) http://www.mamma.com

Clusty (previously Vivisimo) http://www.clusty.com

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 56 / 65

Page 57: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Data analysis for navigation

Recommendation systems

Data and link analysis to support navigation

Automatic creation of recommendations

Collaborative filtering: recommendations based on past behaviour ofusers

Content-base filtering: recommendations based on similar itemproperties

Automatic creation of hierarchies (for navigation)

Automatic creation of overviews, etc.

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 57 / 65

Page 58: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Data analysis for navigation

Recommendation systems

Bookstore: client likes book; users who liked that book wereinterested in these books as well

Shops

Books, videos, cds, ...

e.g. Amazon.com

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 58 / 65

Page 59: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Data analysis for navigation

The End

Any questions?

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 59 / 65

Page 60: The Web and Searching for Informationkti.tugraz.at/staff/vsabol/courses/mmis1/slides_web.pdf · Parameters in PageRank Teleportation parameter is the probability that a user will

Data analysis for navigation

The End

30.11.2015: Visualization in the Web

07.12.2015: Introduction to Social Web

14.12.2015: Recent Trends in Social Media and the second partialexam

Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 60 / 65


Recommended