+ All Categories
Home > Documents > CS276A Text Information Retrieval, Mining, and Exploitation

CS276A Text Information Retrieval, Mining, and Exploitation

Date post: 06-Jan-2016
Category:
Upload: melina
View: 47 times
Download: 0 times
Share this document with a friend
Description:
CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 15 26 Nov 2002. …/~newbie/. www.ibm.com. /…/…/leaf.htm. Recap: Web Anatomy. E2. E1. WEB. Recap:Size of the Web. Capture – Recapture technique Assumes engines get independent random subsets of the Web. - PowerPoint PPT Presentation
65
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 15 26 Nov 2002
Transcript
Page 1: CS276A Text Information Retrieval, Mining, and Exploitation

CS276AText Information Retrieval, Mining, and

Exploitation

Lecture 1526 Nov 2002

Page 2: CS276A Text Information Retrieval, Mining, and Exploitation

Recap: Web Anatomy

www.ibm.comwww.ibm.com……//~newbie/~newbie/

/…/…/leaf.htm/…/…/leaf.htm

Page 3: CS276A Text Information Retrieval, Mining, and Exploitation

Recap:Size of the Web

Capture – Recapture technique Assumes engines get independent random

subsets of the Web

E2 contains x% of E1.Assume, E2 contains x% of the Web as well

Knowing size of E2 compute size of the WebSize of the Web = 100*E2/x

E1E2

WEB

Bharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)

Page 4: CS276A Text Information Retrieval, Mining, and Exploitation

Recent Measurements

Source: http://www.searchengineshowdown.com/stats/change.shtml

Page 5: CS276A Text Information Retrieval, Mining, and Exploitation

Today’s Topics

Web IR infrastructure Search deployment XML intro XML indexing and search

Page 6: CS276A Text Information Retrieval, Mining, and Exploitation

Web IR Infrastructure

Connectivity Server Fast access to links to support link analysis

Term Vector Database Fast access to document vectors to

augment link analysis

Page 7: CS276A Text Information Retrieval, Mining, and Exploitation

Connectivity Server[CS1: Bhar98b, CS2 & 3: Rand01]

Fast web graph access to support connectivity analysis

Stores mappings in memory from URL to outlinks, URL to inlinks

Applications HITS, Pagerank computations Crawl simulation Graph algorithms: web connectivity, diameter etc. Visualizations

Page 8: CS276A Text Information Retrieval, Mining, and Exploitation

Usage

Input

Graphalgorithm

+URLs

+Values

URLstoFPstoIDs

Execution

Graphalgorithm

runs inmemory

IDstoURLs

Output

URLs+

Values

Translation Tables on DiskURL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytesID(32b) -> FP(64b): 8 bytesID(32b) -> URLs: 0.5 bytes

Page 9: CS276A Text Information Retrieval, Mining, and Exploitation

ID assignment

Partition URLs into 3 sets, sorted lexicographically

High: Max degree > 254 Medium: 254 > Max degree > 24 Low: remaining (75%)

IDs assigned in sequence (densely)

E.g., HIGH IDs:

Max(indegree , outdegree) > 254

ID URL

9891 www.amazon.com/

9912 www.amazon.com/jobs/

9821878 www.geocities.com/

40930030 www.google.com/

85903590 www.yahoo.com/

Adjacency lists In memory tables for

Outlinks, Inlinks List index maps from an ID

to start of adjacency list

Page 10: CS276A Text Information Retrieval, Mining, and Exploitation

Adjacency List Compression - I

9813215398

147153

104105106

ListIndex

Sequenceof

AdjacencyLists

-63421-8496

104105106

ListIndex

DeltaEncoded

AdjacencyLists

• Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host)- Compress deltas with variable length encoding (e.g., Huffman)

• List Index pointers: 32b for high, Base+16b for med, Base+8b for low- Avg = 12b per pointer

Page 11: CS276A Text Information Retrieval, Mining, and Exploitation

List Index Pointers

URL Info

LC:TID

LC:TID

LC:TID

FRQ:RL

FRQ:RL

FRQ:RL

Base (4 bytes)

Offsets For 16

IDs

offset

ID to adjacency list lookup

ID

Adjacencylists

Page 12: CS276A Text Information Retrieval, Mining, and Exploitation

Adjacency List Compression - II

Inter List Compression Basis: Similar URLs may share links

Close in ID space => adjacency lists may overlap Approach

Define a representative adjacency list for a block of IDs Adjacency list of a reference ID Union of adjacency lists in the block

Represent adjacency list in terms of deletions and additions when it is cheaper to do so

Measurements Intra List + Starts: 8-11 bits per link (580M pages/16GB

RAM) Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)

Page 13: CS276A Text Information Retrieval, Mining, and Exploitation

Term Vector Database[Stat00]

Fast access to 50 word term vectors for web pages Term Selection:

Restricted to middle 1/3rd of lexicon by document frequency Top 50 words in document by TF.IDF.

Term Weighting: Deferred till run-time (can be based on term freq, doc freq, doc length)

Applications Content + Connectivity analysis (e.g., Topic Distillation) Topic specific crawls Document classification

Performance Storage: 33GB for 272M term vectors Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk

block)

Page 14: CS276A Text Information Retrieval, Mining, and Exploitation

Architecture

URL Info

LC:TID

LC:TID

LC:TID

FRQ:RL

FRQ:RL

FRQ:RL

128ByteTV

Record

Terms

Freq

Base (4 bytes)

Bit vectorFor

480 URLids

offset

URLid to Term Vector Lookup

URLid * 64 /480

Page 15: CS276A Text Information Retrieval, Mining, and Exploitation

Search Deployment

Web IR is just one (very specific) type of IR Commercially most important IR

application: Enterprise search (large corporations) Problem different from Web IR

Peer-2-Peer (P2P) search Another search deployment strategy

Page 16: CS276A Text Information Retrieval, Mining, and Exploitation

Enterprise Search Deployment

DatabaseCorporate

Network

Company

Web Site

E-Commerce Web PortalsEnterprises

Proprietary content Public content

World Wide

WebSources

Markets

SearchBoxes

Content

Location

Content

ManagementGroupware

Page 17: CS276A Text Information Retrieval, Mining, and Exploitation

1st Generation:

Classic Information Retrieval

2nd Generation:

Driven by WWW

3rd Generation:

Discovery(Text Mining)

User: Trained specialist Everyone Everyone and software agents

Scope: Small, closed collections Intranet/ExtranetStructured, semi-structured and unstructured information

Technology: Pattern/string matchingPattern/string matching and external factors for relevance ranking + categorization

Introduction of linguistic and semantic processing

1985 - 1993 1994 - 1999 2000+

Evolution of Enterprise Search

Page 18: CS276A Text Information Retrieval, Mining, and Exploitation

Enterprise IR is a lot more than search …

Security Cannot search what you

should not readContent organization & creation

Automatic classification Taxonomy generation Support for multiple

languages, multiple formats

Conduits into databases and other content management -- homes for “valuable” content

Information processing tools

Annotation Range searches Custom ranking

criteria Cross lingual tools,…

Individual preferences Personalization Notification, …

Page 19: CS276A Text Information Retrieval, Mining, and Exploitation

Peer-To-Peer (P2P) Search

No central index Each node in a network builds and

maintains own index Each node has “servent” software

On booting, servent pings ~4 other hosts Connects to those that respond Initiates, propagates and serves requests

Page 20: CS276A Text Information Retrieval, Mining, and Exploitation

Which hosts to connect to?

The ones you connected to last time Random hosts you know of Request suggestions from central (or

hierarchical) nameservers

All govern system’s shape and efficiency

Page 21: CS276A Text Information Retrieval, Mining, and Exploitation

Serving P2P search requests

Send your request to your neighbors They send it to their neighbors

decrement “time to live” for query query dies when ttl = 0

Send search matches back along requesting path

Page 22: CS276A Text Information Retrieval, Mining, and Exploitation

Some P2P Networks

Gnutella Kazaa Bearshare Aimster Grokster Morpheus

Page 23: CS276A Text Information Retrieval, Mining, and Exploitation

P2P: Information Retrieval Issues

Why is this more difficult than centralized IR?

Page 24: CS276A Text Information Retrieval, Mining, and Exploitation

P2P: Information Retrieval Issues

Selection of nodes to query Merging of results Spam

Page 25: CS276A Text Information Retrieval, Mining, and Exploitation

What is XML?

eXtensible Markup Language A framework for defining markup

languages No fixed collection of markup tags Each XML language targeted for

application All XML languages share features Enables building of generic tools

Page 26: CS276A Text Information Retrieval, Mining, and Exploitation

Basic Structure

An XML document is an ordered, labeled tree

character data leaf nodes contain the actual data (text strings) data nodes must be non-empty and non-

adjacent to other character data nodes element nodes, are each labeled with

a name (often called the element type), and a set of attributes, each consisting of a

name and a value, can have child nodes

Page 27: CS276A Text Information Retrieval, Mining, and Exploitation

XML Example

Page 28: CS276A Text Information Retrieval, Mining, and Exploitation

XML Example

<chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>

Page 29: CS276A Text Information Retrieval, Mining, and Exploitation

Elements

Elements are denoted by markup tags <foo attr1=“value” … > thetext </foo> Element start tag: foo Attribute: attr1 The character data: thetext Matching element end tag: </foo>

Page 30: CS276A Text Information Retrieval, Mining, and Exploitation

XML vs HTML

Relationship?

Page 31: CS276A Text Information Retrieval, Mining, and Exploitation

XML vs HTML

HTML is a markup language for a specific purpose (display in browsers)

XML is a framework for defining markup languages

HTML can be formalized as an XML language (XHTML)

XML defines logical structure only HTML: same intention, but has evolved into

a presentation language

Page 32: CS276A Text Information Retrieval, Mining, and Exploitation

XML: Design Goals

Separate syntax from semantics to provide a common framework for structuring information

Allow tailor-made markup for any imaginable application domain

Support internationalization (Unicode) and platform independence

Be the future of (semi)structured information (do some of the work now done by databases)

Page 33: CS276A Text Information Retrieval, Mining, and Exploitation

Why Use XML?

Represent semi-structured data (data that are structured, but don’t fit relational model)

XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free

Page 34: CS276A Text Information Retrieval, Mining, and Exploitation

Applications of XML XHTML CML – chemical markup language WML – wireless markup language ThML – theological markup language

<h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>

Page 35: CS276A Text Information Retrieval, Mining, and Exploitation

XML Schemas

Schema = syntax definition of XML language

Schema language = formal language for expressing XML schemas

Examples DTD XML Schema (W3C)

Relevance for XML IR Our job is much easier if we have a (one)

schema

Page 36: CS276A Text Information Retrieval, Mining, and Exploitation

XML Tutorial

http://www.brics.dk/~amoeller/XML/index.html

(Anders Møller and Michael Schwartzbach) Previous (and some following) slides are

based on their tutorial

Page 37: CS276A Text Information Retrieval, Mining, and Exploitation

XML Indexing and Search

Page 38: CS276A Text Information Retrieval, Mining, and Exploitation

Native XML Database

Uses XML document as logical unit Should support

Elements Attributes PCDATA (parsed character data) Document order

Contrast with DB modified for XML Generic IR system modified for XML

Page 39: CS276A Text Information Retrieval, Mining, and Exploitation

XML Indexing and Search

Most native XML databases have taken an DB approach Exact match Evaluate path expressions No IR type relevance ranking

Only a few that focus on relevance ranking

Page 40: CS276A Text Information Retrieval, Mining, and Exploitation

Timber: XML as DB extension

DB: search tuples Timber: search trees Main focus

Complex and variable structure of trees (vs. tuples)

Ordering XML query optimization vs relational

optimization

Page 41: CS276A Text Information Retrieval, Mining, and Exploitation

ToXin

Native XML database Exploits overall path structure

Supports any general path query Query evaluation in three stages

Preselection stage Selection stage Postselection stage

Page 42: CS276A Text Information Retrieval, Mining, and Exploitation

ToXin: Motivation

Strawman: Index all paths

occurring in database Does not allow

backward navigation Example query:

find all the titles of articles from 1990

Page 43: CS276A Text Information Retrieval, Mining, and Exploitation

Query Evaluation Stages

Pre-selection First navigation down the tree

Selection Value selection according to filter

Post-selection Navigation up and down again

Page 44: CS276A Text Information Retrieval, Mining, and Exploitation

ToXin

Page 45: CS276A Text Information Retrieval, Mining, and Exploitation

Factors Impacting Performance

Data source specific Document size Number of XML nodes and values Path complexity (degree of nesting) Average value size

Query specific Selectiveness of path constraint Size of query answer Number of elements selected by filter

Page 46: CS276A Text Information Retrieval, Mining, and Exploitation

Benchmark Parameters

Page 47: CS276A Text Information Retrieval, Mining, and Exploitation

Query Classification

Page 48: CS276A Text Information Retrieval, Mining, and Exploitation

Evaluation

Page 49: CS276A Text Information Retrieval, Mining, and Exploitation

ToXin: Summary

Efficient native XML database All paths are indexed (not just from root) Path index linear in corpus size Shortcomings

Order of nodes ignored Semantics of IDRefs ignored

What ismissing?

Page 50: CS276A Text Information Retrieval, Mining, and Exploitation

IR/Relevance Ranking for XML

Why is this difficult?

Page 51: CS276A Text Information Retrieval, Mining, and Exploitation

IR XML Challenge 1: Term Statistics

There is no document unit in XML How do we compute tf and idf? Global tf/idf over all text context is useless Indexing granularity

Page 52: CS276A Text Information Retrieval, Mining, and Exploitation

IR XML Challenge 2: Fragments

IR systems don’t store content (only index) Need to go to document for displaying

fragment Easier in DB framework

Page 53: CS276A Text Information Retrieval, Mining, and Exploitation

Relevance Ranking for XML

Will revisit next week

Page 54: CS276A Text Information Retrieval, Mining, and Exploitation

Querying XML

Semistructured queries XPath XQuery

Page 55: CS276A Text Information Retrieval, Mining, and Exploitation

Types of (Semi)Structured Queries

Location/position (“chapter no.3”) Simple attribute/value

/play/title contains “hamlet” Path queries

title contains “hamlet” /play//title contains “hamlet”

Complex graphs Employees with two managers

All of the above: mixed structure/content Subsumes: hyperlinks

Page 56: CS276A Text Information Retrieval, Mining, and Exploitation

XPath

Declarative language for Addressing (used in XLink/XPointer and in

XSLT) Pattern matching (used in XSLT and in

XQuery) Location path

a sequence of location steps separated by /

Example: child::section[position()<6] /

descendant::cite / attribute::href

Page 57: CS276A Text Information Retrieval, Mining, and Exploitation

Axes in XPath

ancestor, ancestor-or-self, attribute, child, descendent, descendent-or-self, following, following-sibling, namespace, parent, preceding, preceding-sibling, self

Page 58: CS276A Text Information Retrieval, Mining, and Exploitation

Location steps

A single location step has the form: axis :: node-test [ predicate ]

The axis selects a rough set of candidate nodes (e.g. the child nodes of the context node).

The node-test performs an initial filtration of the candidates based on their types (chardata node, processing instruction,

etc.), or names (e.g. element name).

The predicates (zero or more) cause a further, potentially more complex, filtration

child::section[position()<6]

Page 59: CS276A Text Information Retrieval, Mining, and Exploitation

XQuery

SQL for XML Usage scenarios

Human-readable documents Data-oriented documents Mixed documents (e.g., patient records)

Relies on XPath XML Schema datatypes

Turing complete XQuery is still a working draft. More than a hundred open issues as of 2002.11.10

Page 60: CS276A Text Information Retrieval, Mining, and Exploitation

XQuery

The principal forms of XQuery expressions are: path expressions element constructors FLWR ("flower") expressions list expressions conditional expressions quantified expressions datatype expressions

Evaluated with respect to a context

Page 61: CS276A Text Information Retrieval, Mining, and Exploitation

FLWR

FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p

FOR generates an ordered list of bindings of publisher names to $p

LET associates to each binding a further binding of the list of book elements with that publisher to $b

at this stage, we have an ordered list of tuples of bindings: ($p,$b)

WHERE filters that list to retain only the desired tuples

RETURN constructs for each tuple a resulting value

Page 62: CS276A Text Information Retrieval, Mining, and Exploitation

XQuery vs SQL

Order matters! document("zoo.xml")//chapter[2]//

figure[caption = "Tree Frogs"] XQuery is turing complete, SQL is not.

Page 63: CS276A Text Information Retrieval, Mining, and Exploitation

XQuery Example

Møller and Schwartzbach

Page 64: CS276A Text Information Retrieval, Mining, and Exploitation

XQuery Standard on Ranking (2.3.1)

Document order defines a total ordering among all the nodes seen by the language processor. Within a given document, the document node is the first node, followed by element nodes, text nodes, comment nodes, and processing instruction nodes in the order of their representation in the XML form of the document (after expansion of entities). Element nodes occur before their children. The namespace nodes of an element immediately follow the element node, in implementation-defined order. The attribute nodes of an element immediately follow its namespace nodes, and are also in implementation-defined order.

The relative order of nodes in distinct documents is implementation-defined but stable within a given query or transformation. In other words, given two distinct documents A and B, if a node in document A is before a node in document B, then every node in document A is before every node in document B. The relative order among free-floating nodes (those not in a document) is implementation-defined.

Page 65: CS276A Text Information Retrieval, Mining, and Exploitation

Next Week (12/3)

XML indexing and search II Metadata indexing and search Dublin Core, RDF, DAML+OIL


Recommended