Hammouda Webcast May21

8/2/2019 Hammouda Webcast May21

1/12

Text Mining:Fast Phrase-based Text Indexing and

Matching

Khaled Hammouda, Ph.D. Student

PAMI Research GroupUniversity of Waterloo

Waterloo, Ontario, Canada

LORNET Theme 4


2/12

The Problem

Information

Source

Web / LOR

Text DocumentsWeb DocumentsDiscussion Articles...

Automatic

Clustering/Grouping

ProgrammingLanguages

Database Systems

PatternRecognition

How do we judgesimilarity?

DataMining


3/12

Group Similar Documents Together Maximize intra-cluster similarity

Minimize inter-cluster similarity

Need to accuratelycalculate document similarity

Intra-Cluster Similarity

Inter-Cluster Similarity

Document Cluster

Document Cluster

Document Cluster

Clustering Documents


4/12

Document Similarity

How similar each document isto every other document?

Very time consuming!

O(n2

)


5/12

Document Similarity

Information Theoretic Measure (Dekang98):

How do we intersect every pair of documentswithout sacrificing efficiency?

What features should we intersect? Words

Phrases

BA

BABA

),sim(


6/12

Fast Phrase-based Document Indexingand Matching

Document Index Graph Structure A model based on a digraphrepresentation of the

phrases in the document set

Nodes correspond to unique terms

Edges maintain phrase representation

A phrase is a path in the graph

The model is an inverted list (terms documents)

Nodes carry term weight information for eachdocument in which they appear

Shared phrases can be matched efficiently

Phrase-based Features Phrases: more informative feature than individual

words

local context matching Represent sentences rather than words

Facilitate phrase-matching between documents

Achieves accurate document pair-wise similarity

Avoid high-dimensionality of vector space model

Allow incremental processing

Document 1

river raftingmild river raftingriver rafting trips

Document 2

wild river adventuresriver rafting vacation plan

Document 3

fishing tripsfishing vacation planbooking fishing trips

river fishing

mild

wild

river

rafting

adventures

booking

fishing

tripsvacation

plan

Document Index Graph


7/12

Document Index GraphDocument 1

river rafting

mild river rafting

river rafting trips

river

booking

fishing

tripsvacation

plan

mild

river

rafting

trips

wild

river

rafting

adventures

vacationplan

Document 2

wild river adventuresriver rafting vacation plan

Document 3

fishing trips

fishing vacation plan

booking fishing trips

river fishing

- river rafting - river- vacation plan

- river

- trips


8/12

Phrase-based Document Indexing

Document Index Graph (internal structure)

riverrafting

adventures

fishing

e2

e1

e0

doc TF ET

1 {0,0,3}2 {0,0,2}3 {0,0,1}

e0

s1(1),s

2(2),s

3(1)

e0

s2(1)

e2

s1(2)

e1

s4(1)

Edge Tables

Document Table

Document Index Graph (size scalability)

Document Index Graph (time performance)


9/12

Effect of using phrase-based similarity overindividual words

Effect of using phrase similarity (F-measure) Effect of using phrase similarity (Entropy)


10/12

Applications

Grouping search engine results on-the-fly(incremental processing)

Creating taxonomies of documents

(Yahoo! and Open Directory style)

Implementing Find Related or Find Similar features of information

retrieval systems

Automatic generation of descriptive phrases about a set ofdocuments (i.e. labeling clusters)

Detecting plagiarism


11/12

Collaboration

Provide Data Mining services (primarilytext mining) for other groups

Opportunity for collaboration with U ofSaskatchewan: I-Help Discussion System

Course Delivery Tools

Others are welcome


12/12

Questions

Instant Messaging

MSN Messenger: [email protected]

E-mail

[email protected]

Date post:	05-Apr-2018
Category:	Documents
Upload:	pradeep-kumar
View:	216 times
Download:	0 times