+ All Categories
Home > Documents > The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville,...

The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville,...

Date post: 31-Dec-2015
Category:
Upload: tiffany-debra-long
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
51
The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University (Townsville, Australia) Aalborg University
Transcript
Page 1: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

The WWW as a Database:WWW Query Languages

Curtis Dyreson

James Cook University

(Townsville, Australia)

Aalborg University

Page 2: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Outline

• searching the WWW– search engines– WWW query languages

• WebSQL– WWW graph– cost

• Jumping Spider– hybrid

Page 3: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Searching the WWW

• search engines– Altavista, Infoseek, 2100 others!

• static architecture – robot: periodic, slow, non-uniform coverage– index: keywords to URLs, fast, ranking algorithm

• example query

Lecture notes on trees in a data structures

course.

Page 4: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

A Search Engine Index

Page 5: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

A Search Engine Indexdata structures

Page 6: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

A Search Engine Index

lecture notes

data structures

Page 7: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

A Search Engine Index

lecture notes

treesdata structures

Page 8: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

A Search Engine Index

lecture notes

treesdata structures

Page 9: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

A Search Engine Index

lecture notes

treesdata structures

Page 10: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WWW Query Languages

• search engines index single pages

• multi-page concepts

• hunting strategy– search engine to nearby page– manual search

• WWW query languages

WebSQL, W3QS, WebLog

Page 11: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WWW Graph Structure

• large (650K servers, 350M pages)

• dynamic, cycliclink = edge

page = node

Page 12: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL

• SQL-like

• search engine to find pages• path expression (regular expression of links)• text manipulation predicates

SELECT <attribute list>FROM <document list>WHERE <predicate>;

Page 13: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL From Clause

• from clause collects a set of documents

• unstructured - primitive schema

• MENTIONS - retrieve from search engineDOCUMENT x SUCH THAT x MENTIONS ‘data structures’

Page 14: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL From Clause

• from clause collects a set of documents

• unstructured - primitive schema Document[URL, text, link to URL, modify date]

• MENTIONS - retrieve from search engine

SELECT z.URLFROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* zWHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;

Page 15: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL From Clause

• path expression finds related documents

• URL

• local link: ->

• global link: =>

DOCUMENT x SUCH THAT “http://www.cs.auc.dk”

DOCUMENT y SUCH THAT x -> y

DOCUMENT y SUCH THAT x => y

Page 16: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL From Clause

• at most one link: ?

• any number of links: *

• alternation: |

DOCUMENT y SUCH THAT x ->(->)? y

DOCUMENT y SUCH THAT x (=> | ->*) y

DOCUMENT y SUCH THAT x ->* y

Page 17: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL From Clause: Example

FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

Page 18: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL From Clause: Example

FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

Java

Page 19: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL From Clause: Example

FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

Java

Page 20: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL From Clause: Example

FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

Java

Page 21: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL From Clause

• path expression limits search space

• local link, search limited to local machine

• global link, can go anywhere

• =>* would search all of WWW

• pre-analysis, filtering

• even three to four local links infeasible

Page 22: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL Where Clause

• like SQL

• CONTAINS, text search of retrieved document

• can push CONTAINS into navigation

WHERE y CONTAINS ‘lecture notes’ AND y.length < 4000;

Page 23: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL Query

• Find lecture notes on trees in a data structures course.

SELECT z.FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* zWHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;

Page 24: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

data structures -> lecture notes

Page 25: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

data structures -> lecture notesdata structures

Page 26: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

data structures -> lecture notesdata structures

Page 27: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

data structures -> lecture notesdata structures

lecture notes

Page 28: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

lecture notes ->* treesdata structures

lecture notes

Page 29: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

lecture notes ->* treesdata structures

lecture notes

Page 30: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

lecture notes ->* treesdata structures

lecture notes

trees

Page 31: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Resultdata structures

lecture notes

trees

Page 32: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL Example

Page 33: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WebSQL Architecture

• Java implementation

Page 34: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

WWW Query Language -Drawbacks

• dynamic architecture

• O(p**k)

- p is length of path expression

- k is branching factor

• a priori knowledge of topology

• back links are a problem

Page 35: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Jumping Spider - a Hybrid

• like a search engine

- static architecture

- keyword searches

• like a WWW query language

- uses modified WWW graph

- one kind of path expression

Page 36: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Kinds of Links

• content refinement queries are common

• heuristic

information in subdirectories is refined

• different kinds of links

back - subdirectory to parent

down - parent directory to subdirectory

side - unrelated directories

Page 37: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Re-using the WWW Graph

Page 38: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Directory Trees

Page 39: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Down Links

Page 40: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Back Links

Page 41: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Eliminate Back Links

Page 42: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Transitive Closure of Down Links

Page 43: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Plus a Side Link

Page 44: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

data structures -> lecture notesdata structures

Page 45: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

data structures -> lecture notesdata structures

Page 46: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

data structures -> lecture notesdata structures

lecture notes

Page 47: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

lecture notes -> treesdata structures

lecture notes

Page 48: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

lecture notes -> treesdata structures

lecture notes

trees

Page 49: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Analysis

• search engine index

- adds a pertinent index

• pertinent index - O(nlogn) to O(n**2) space

- all URLs that can reach this URL

- tree-like, so should be close to O(nlogn)

• more intersections

• implemented in Perl 5

Page 50: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Related Work

• WWW query languages

WebSQL (Arocena et al. - WWW6 ’97)

W3QS (Konopnicki and Shmueli - VLDB’95)

WebLog (Lakshmanan et al. RIDE ’96)

AKIRA (Lacroix et al. - ER ’97)

• Indexes that already use directories

Infoseek

WebGlimpse (Manber et al. - Usenix ’97)

• Semi-structured data models - many

Page 51: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University.

Future Work

• scale to size of WWW

• extended query language (negation)

• easier installation


Recommended