+ All Categories
Home > Documents > Hammouda Webcast May21

Hammouda Webcast May21

Date post: 05-Apr-2018
Category:
Upload: pradeep-kumar
View: 216 times
Download: 0 times
Share this document with a friend

of 12

Transcript
  • 8/2/2019 Hammouda Webcast May21

    1/12

    Text Mining:Fast Phrase-based Text Indexing and

    Matching

    Khaled Hammouda, Ph.D. Student

    PAMI Research GroupUniversity of Waterloo

    Waterloo, Ontario, Canada

    LORNET Theme 4

  • 8/2/2019 Hammouda Webcast May21

    2/12

    The Problem

    Information

    Source

    Web / LOR

    Text DocumentsWeb DocumentsDiscussion Articles...

    Automatic

    Clustering/Grouping

    ProgrammingLanguages

    Database Systems

    PatternRecognition

    How do we judgesimilarity?

    DataMining

  • 8/2/2019 Hammouda Webcast May21

    3/12

    Group Similar Documents Together Maximize intra-cluster similarity

    Minimize inter-cluster similarity

    Need to accuratelycalculate document similarity

    Intra-Cluster Similarity

    Inter-Cluster Similarity

    Document Cluster

    Document Cluster

    Document Cluster

    Clustering Documents

  • 8/2/2019 Hammouda Webcast May21

    4/12

    Document Similarity

    How similar each document isto every other document?

    Very time consuming!

    O(n2

    )

  • 8/2/2019 Hammouda Webcast May21

    5/12

    Document Similarity

    Information Theoretic Measure (Dekang98):

    How do we intersect every pair of documentswithout sacrificing efficiency?

    What features should we intersect? Words

    Phrases

    BA

    BABA

    ),sim(

  • 8/2/2019 Hammouda Webcast May21

    6/12

    Fast Phrase-based Document Indexingand Matching

    Document Index Graph Structure A model based on a digraphrepresentation of the

    phrases in the document set

    Nodes correspond to unique terms

    Edges maintain phrase representation

    A phrase is a path in the graph

    The model is an inverted list (terms documents)

    Nodes carry term weight information for eachdocument in which they appear

    Shared phrases can be matched efficiently

    Phrase-based Features Phrases: more informative feature than individual

    words

    local context matching Represent sentences rather than words

    Facilitate phrase-matching between documents

    Achieves accurate document pair-wise similarity

    Avoid high-dimensionality of vector space model

    Allow incremental processing

    Document 1

    river raftingmild river raftingriver rafting trips

    Document 2

    wild river adventuresriver rafting vacation plan

    Document 3

    fishing tripsfishing vacation planbooking fishing trips

    river fishing

    mild

    wild

    river

    rafting

    adventures

    booking

    fishing

    tripsvacation

    plan

    Document Index Graph

  • 8/2/2019 Hammouda Webcast May21

    7/12

    Document Index GraphDocument 1

    river rafting

    mild river rafting

    river rafting trips

    river

    booking

    fishing

    tripsvacation

    plan

    mild

    river

    rafting

    trips

    wild

    river

    rafting

    adventures

    vacationplan

    Document 2

    wild river adventuresriver rafting vacation plan

    Document 3

    fishing trips

    fishing vacation plan

    booking fishing trips

    river fishing

    - river rafting - river- vacation plan

    - river

    - trips

  • 8/2/2019 Hammouda Webcast May21

    8/12

    Phrase-based Document Indexing

    Document Index Graph (internal structure)

    riverrafting

    adventures

    fishing

    e2

    e1

    e0

    doc TF ET

    1 {0,0,3}2 {0,0,2}3 {0,0,1}

    e0

    s1(1),s

    2(2),s

    3(1)

    e0

    s2(1)

    e2

    s1(2)

    e1

    s4(1)

    Edge Tables

    Document Table

    Document Index Graph (size scalability)

    Document Index Graph (time performance)

  • 8/2/2019 Hammouda Webcast May21

    9/12

    Effect of using phrase-based similarity overindividual words

    Effect of using phrase similarity (F-measure) Effect of using phrase similarity (Entropy)

  • 8/2/2019 Hammouda Webcast May21

    10/12

    Applications

    Grouping search engine results on-the-fly(incremental processing)

    Creating taxonomies of documents

    (Yahoo! and Open Directory style)

    Implementing Find Related or Find Similar features of information

    retrieval systems

    Automatic generation of descriptive phrases about a set ofdocuments (i.e. labeling clusters)

    Detecting plagiarism

  • 8/2/2019 Hammouda Webcast May21

    11/12

    Collaboration

    Provide Data Mining services (primarilytext mining) for other groups

    Opportunity for collaboration with U ofSaskatchewan: I-Help Discussion System

    Course Delivery Tools

    Others are welcome

  • 8/2/2019 Hammouda Webcast May21

    12/12

    Questions

    Instant Messaging

    MSN Messenger: [email protected]

    E-mail

    [email protected]


Recommended