+ All Categories
Home > Documents > Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations...

Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations...

Date post: 19-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
53
Graph-Based Methods for M ltili lT t dWb Multilingual Text and Web Mining Mining Mark Last Department of Information Systems Engineering Ben-Gurion University of the Negev In cooperation with H tB k (U i it fB ) Horst Bunke (University of Bern) Abraham Kandel, Adam Schenker (University of South Florida) Alex Markov, Marina Litvak, Guy Danon (Ben-Gurion University) E-mail: [email protected] Home Page: http://www.bgu.ac.il/~mlast/ Text Mining Day 2009 at BGU, May 25, 2009
Transcript
Page 1: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Graph-Based Methods for M ltili l T t d W bMultilingual Text and Web

MiningMining Mark Last

Department of Information Systems Engineeringp y g gBen-Gurion University of the Negev

In cooperation with H t B k (U i it f B )Horst Bunke (University of Bern)

Abraham Kandel, Adam Schenker (University of South Florida)Alex Markov, Marina Litvak, Guy Danon (Ben-Gurion University)

E-mail: [email protected] Page: http://www.bgu.ac.il/~mlast/

Text Mining Day 2009 at BGU, May 25, 2009

Page 2: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Agenda

• Introduction and Motivation• Graph Based Representations of Text and• Graph-Based Representations of Text and

Web Documents• Graph-Based Categorization and Clustering

AlgorithmsAlgorithms• The Hybrid Approach to Web DocumentThe Hybrid Approach to Web Document

Categorization • Graph-Based Keyword Extraction• Summary

Prof. Mark Last (BGU) 2• Summary

Page 3: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

O C OINTRODUCTION AND MOTIVATIONMOTIVATION

Prof. Mark Last (BGU) 3

Page 4: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Web Mining Tasksg

Web Mining

Web Structure Mining

Web Usage Mining

Web Content MiningMining Mining Mining

PageRank

Information Search and Retrieval

Document Categorization

Document Clustering

Keyword Extraction and

g

Retrieval g g t act o a dSummarization

Prof. Mark Last (BGU) 4

Page 5: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

The Vector-Space Model(Salton et al 1975)(Salton et al., 1975)

A t t d t i id d “b f d (t / f t )”• A text document is considered a “bag of words (terms / features)” – Document dj = (w1j,… ,w|T|j) where T = (t1,…,t|T|) is set of terms

(features) that occurs at least once in at least one document (features) that occurs at least once in at least one document (vocabulary)

• Term: n-gram single word noun phrase keyphrase etcTerm: n gram, single word, noun phrase, keyphrase, etc.• Term weights: binary, frequency-based, etc.

Meaningless (“stop”) words are removed• Meaningless (“stop”) words are removed• Stemming operations may be applied

– Leaders => Leader– Expiring => expire

• The ordering and position of words, as well as document logical structure and layout, are completely ignored

May 29, 2009 5

Page 6: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Advantages of the Vector-Space ModelModel(based on Joachims, 2002)

A i l d i h f d i f• A simple and straightforward representation for English and other languages, where words have a g g gclear delimiter

• Most weighting schemes require a single scan of• Most weighting schemes require a single scan of each document

• A fixed-size vector representation makes unstructured text accessible to most classificationunstructured text accessible to most classification algorithms (from decision trees to SVMs)C i t tl d lt i th i f ti• Consistently good results in the information retrieval domain (mainly, on English corpora)

May 29, 2009 6

Page 7: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Limitations of the Vector-Space ModelSpace Model

T t d t• Text documents– Ignoring the word position in the document– Ignoring the ordering of words in the document

• Web Documents– Ignoring the information contained in HTML tags (e.g.,

document sections))• Multilingual documents

– Word separation may be tricky in some languages (e g– Word separation may be tricky in some languages (e.g., Latin, German, Chinese, etc.)

– No comprehensive evaluation on large non-EnglishNo comprehensive evaluation on large non English corpora

May 29, 2009 7

Page 8: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

The Word Separation in the Ancient LatinAncient Latin

The Arch of Titus, Rome (1st Century AD)

Dedication to Julius Caesar

(1st Century BC)

Words are separated

by triangles

May 29, 2009 8

Page 9: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Introduced in Schenker et al., 2005

GRAPH-BASED REPRESENTATIONS OF TEXT AND WEB DOCUMENTSAND WEB DOCUMENTS

Prof. Mark Last (BGU) 9

Page 10: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Relevant Definitions(Based on Bunke and Kandel 2000)(Based on Bunke and Kandel, 2000)

( )βα ,,, EVG =•A (labeled) graph G is a 4-tupleWh

VVE ×⊆

Where

V is a set of nodes (vertices), is a set of⊆α

β

( ),edges connecting the nodes, labeling the nodes and

is a functionis a function labelingβlabeling the nodes and

the edges.is a function labeling

Edge label

A Bx

Cy Node

label

label

•Node and edge IDs are omitted for brevity•Graph size: |G|=|V|+|E|

Prof. Mark Last (BGU) 10•Graph size: |G|=|V|+|E|

Page 11: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

The Graph-Based Model of Web Documents Basic IdeasDocuments – Basic Ideas

• At most one node for each unique term in a document• At most one node for each unique term in a document• If a word B follows a word A, there is a directed edge

from A to Bfrom A to B– Unless the words are separated by certain punctuation marks

(periods, question marks, and exclamation points)• Stop words are removed• Graph size may be limited by including only the most

f t tfrequent terms• Stemming

Alt t f f th t ( i l / l l– Alternate forms of the same term (singular/plural, past/present/future tense, etc.) are conflated to the most frequently occurring formq y g

• Several variations for node and edge labeling (see the next slides)

Prof. Mark Last (BGU) 11

Page 12: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

The Standard Representationp

Edges are labeled according to the document section where the• Edges are labeled according to the document section where the words are followed by each other– Title (TI) contains the text related to the document’s title and any provided ( ) y p

keywords (meta-data); – Link (L) is the “anchor text” that appears in clickable hyper-links on the

document;document; – Text (TX) comprises any of the visible text in the document (this includes

anchor text but not title and keyword text)

YAHOO NEWS MORE

TI L

YAHOO NEWS

SERVICE

MORE

REPORTS REUTERS

TX TX

SERVICE REPORTS REUTERS

TX

Prof. Mark Last (BGU) 12

Page 13: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

The Simple Representation

Th h i b d l h i ibl h• The graph is based only the visible text on the page (title and meta-data are ignored)p g ( g )

• Edges are not labeled

NEWS MORENEWS

SERVICE

MORE

REPORTS REUTERSSERVICE REPORTS REUTERS

Prof. Mark Last (BGU) 13

Page 14: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Other Representations

• The n distance Representation• The n-distance Representation – Look up to n terms ahead and connect the

succeeding terms with an edge that is labeled with thesucceeding terms with an edge that is labeled with the distance between them (n)

• The n-simple Representation• The n-simple Representation– Look up to n terms ahead and connect the

succeeding terms with an unlabeled edgesucceeding terms with an unlabeled edge• The Absolute Frequency Representation

Each node and edge is labeled with an absolute– Each node and edge is labeled with an absolute frequency measure

• The Relative Frequency RepresentationThe Relative Frequency Representation– Each node and edge is labeled with a relative

frequency measureProf. Mark Last (BGU) 14

frequency measure

Page 15: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Graph Based Document Representation Example –Source: www.cnn.com, 24/05/2005Example Source: www.cnn.com, 24/05/2005

Prof. Mark Last (BGU) 15

Page 16: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Graph Based Document Representation - ParsingRepresentation Parsing

title

link

text

Prof. Mark Last (BGU) 16

Page 17: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Graph Based Document Representation - PreprocessingRepresentation - Preprocessing

TITLE TITLE CNN.com International

Stop word removalText

A car bomb has exploded outside a popular Baghdad

Stop word removal

p p p grestaurant, killing three Iraqis and wounding more than 110 others, police officials said. Earlier an aide to the office of Iraqi Prime Minister Ibrahim al-Jaafari and his driver were Iraqi Prime Minister Ibrahim al Jaafari and his driver were killed in a drive-by shooting.

Li kStemming

killing

LinksIraq bomb: Four dead, 110 wounded.FULL STORY

g

FULL STORY.

Prof. Mark Last (BGU) 17

Page 18: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Graph Based Document Representation - PreprocessingRepresentation - Preprocessing

TITLE TITLE CNN.com International

TextA car bomb has exploded outside a popular Baghdad p p p grestaurant, killing three Iraqis and wounding more than 110 others, police officials said. Earlier an aide to the office ofIraqis Prime Minister Ibrahim al-Jaafari and his driver wereIraqis Prime Minister Ibrahim al Jaafari and his driver werekilling in a driver shooting.

Li kLinksIraqis bomb: Four dead, 110 wounding.FULL STORYFULL STORY.

Prof. Mark Last (BGU) 18

Page 19: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Standard Graph Based Document RepresentationRepresentation

TXTen most frequent

terms are used

KILLDRIVECAR

TX

FrequencyWord

3Iraq TXTX

TX

L

3Iraq

2Kill

2Bomb

Text

IRAQBOMB

TX

L2Bomb

2Wound

2D i Link TX

TX

2Drive

1Explod

Link

EXPLOD BAGHDAD WOUNDTX1Baghdad

1International Title

CNNINTERNATIONAL TI1CNN

1Car

Prof. Mark Last (BGU) 19

Page 20: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Simple Graph Based Document RepresentationRepresentation

Ten most frequent terms are used

KILLDRIVECARFrequencyWord

3Iraq

2Kill

IRAQBOMB2Bomb

2Wound

2Drive

1ExplodEXPLOD BAGHDAD WOUND

1Baghdad

1International 1International

1CNN

1CarProf. Mark Last (BGU) 20

1Car

Page 21: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Based on Schenker et al., 2005

GRAPH-BASED CATEGORIZATION ANDCATEGORIZATION AND CLUSTERING ALGORITHMSCLUSTERING ALGORITHMS

Prof. Mark Last (BGU) 21

Page 22: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

“Lazy” Document Categorization with Graph-Based ModelsGraph-Based Models

• The Basic k-Nearest Neighbors (k-NN) Algorithm– Input: a set of labeled training documents, a query document d,Input: a set of labeled training documents, a query document d,

and a parameter k defining the number of nearest neighbors to use

– Output: a label indicating the category of the query document d– Step 1. Find the k nearest training documents to d according to

a distance measurea distance measure– Step 2. Select the category of d to be the category held by the

majority of the k nearest training documentsmajority of the k nearest training documents • k-Nearest Neighbors with Graphs (Schenker et al., 2005)

– Represent the documents as graphsRepresent the documents as graphs – Use a graph-theoretical distance measure

Prof. Mark Last (BGU) 22

Page 23: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Distance between two Graphs

• Required properties– (1) boundary condition: d(G G )≥0– (1) boundary condition: d(G1,G2)≥0

– (2) identical graphs have zero distance: d(G1,G2)=0 →G1≅G2

(3) symmetry: d(G G )=d(G G )– (3) symmetry: d(G1,G2)=d(G2,G1)

– (4) triangle inequality: d(G1,G3)≤d(G1,G2)+d(G2,G3)

May 29, 2009 23

Page 24: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Maximum Common Subgraph (mcs)(mcs)

• The graph G is a maximum common subgraph (mcs) if G is a common subgraph of G and G(mcs) if G is a common subgraph of G1 and G2 and there exist no other common subgraph G’ of G d G h th t |G’| |G|G1 and G2 such that |G’| > |G|

x qA B

y

w z

A F

x r

qA

x

C Dy

B Ep

B

G2G1 G

May 29, 2009 24|G|= |V|+|E| = 2+1 = 3

Page 25: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Minimum Common Supergraph (MCS)(MCS)

• The graph G is a minimum common supergraph (MCS) if G is a common p g p ( )supergraph of G1 and G2 and there exist no other common supergraph G’ of G1 and G2 such that |G’| |G||G’| < |G|

wA D

x y

A

x

D

y

B CB C

GG G

z

|G|= |V|+|E| = 4+2 = 6

G2G1 G

May 29, 2009 25

|G| |V|+|E| 4+2 6

Page 26: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

MMCSN Distance between two GraphsGraphs

• MMCSN Measure (Schenker et al 2005):• MMCSN Measure (Schenker et al., 2005):

d (G G ) = 1−mcs(G1,G2)

• mcs(G G ) maximum common subgraph

dMMCSN (G1,G2) = 1MCS(G1,G2)

• mcs(G1, G2) - maximum common subgraph• MCS(G1, G2) - minimum common supergraph

AA

DA BA B

mcs (G1,G2)

B C

G1

G2

12 +

A D

1

MCS (G G )

667.054121),( 21 =

++

−=GGdMMCSNB C

Prof. Mark Last (BGU) 26

MCS (G1,G2)

Page 27: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Other Distance Measures

• Bunke and Shearer (1998): dMCS (G1,G2) =1−mcs(G1,G2)

max(G1 , G2 )

• Wallis et al. (2001):

( 1 , 2 )

dWGU (G1,G2) =1−mcs(G1,G2)

( )( )

• Bunke (1997):

WGU ( 1, 2)G1 + G2 − mcs(G1,G2)

d (G G ) |G |+|G | 2|mcs(G G )| • Bunke (1997):

F á d d V li (2001)

dUGU(G1,G2)=|G1|+|G2|–2|mcs(G1,G2)|

• Fernández and Valiente (2001):

d (G G ) | CS(G G )| | (G G )|dMMCS(G1,G2)=|MCS(G1,G2)|–|mcs(G1,G2)|

May 29, 2009 27

Page 28: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

k-Nearest Neighbors with GraphsSample Accuracy Results (Schenker et al 2004)Sample Accuracy Results (Schenker et al., 2004)

Benchmark Data Set: K-series (Boley et al., 1999)2 340 web documents from 20 categories

86%

2,340 web documents from 20 categoriesSource: English news pages hosted at Yahoo!

Best results

82%

Best results with graphs

78%

74%

Best results ith t

70%1 2 3 4 5 6 7 8 9 10

with vectors

1 2 3 4 5 6 7 8 9 10

Number of Nearest Neighbors (k)

Vector model (cosine) Vector model (Jaccard) Graphs (40 nodes/graph)Graphs (70 nodes/graph) Graphs (100 nodes/graph) Graphs (150 nodes/graph)

Prof. Mark Last (BGU) 28

Page 29: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

k-Nearest Neighbors with Graphs

Average Time to Classify One Documentg y

Method Average time to classify one document Vector (cosine) 7.8 seconds V t (J d) 7 79 dVector (Jaccard) 7.79 seconds

Graphs, 40 nodes/graph 8.71 seconds Graphs, 70 nodes/graph 16.31 secondsG ap s, 70 odes/g ap 6.3 seco ds

Graphs, 100 nodes/graph 24.62 seconds

May 29, 2009 29

Page 30: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

“Lazy” Document Categorization with Graph-Based ModelsGraph-Based Models

• Advantages• Advantages– Keeps HTML structure information– Retains original order of wordsRetains original order of words– Outperforms the vector-space model with several distance

measures• Limitation

– Can work only with “lazy” classifiers (such as k-NN), which have a very low classification speedhave a very low classification speed

• Conclusion– Graph models cannot be used directly for fast model-basedGraph models cannot be used directly for fast, model based

classification of web documents (e.g., using a decision tree)• Solution

– The hybrid approach: represent a document as a vector of sub-graphs (in a few minutes…)

Prof. Mark Last (BGU) 30

Page 31: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

The Graph-Based k-Means Clustering AlgorithmClustering Algorithm

Inputs: the set of n data items (represented by graphs) and a parameter k, defining the number of clustersp ( p y g p ) p , gto create

Outputs: the centroids of the clusters (represented by median graphs) and for each data item the cluster (aninteger in [1,k]) it belongs to

Step 1. Assign each data item randomly to a cluster (from 1 to k).Step 2. Using the initial assignment, determine the median of the set of graphs of each cluster.Step 3. Given the ne w medians, assign each data item to be in the cluster of its closest median, using ap , g , g

graph-theoretic distance measure.Step 4. Re-compute the medians as in Step 2. Repeat Steps 3 and 4 until the medians do not change.

Median of a set of graphs S (Bunke et al., 2001) is a graph g∈S such that g has the lowest average di t t ll l t i Sdistance to all elements in S:

i 1 d( G )S

∑⎛ ⎜

⎞ ⎟ g = argmin

∀s∈S Sd(s,Gi)

i=1∑

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Prof. Mark Last (BGU) 31

Page 32: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Graph-Based Document Clustering

DI =dmin

dmax

ClusteringComparative Evaluation – Dunn Index

DI =dmin

d

dmin - the minimum distance between any two objects

in different clusters

The best graph

dmax dmax - the maximum distance between any two items

in the same clustergraph-based

methods

The best

vector-vector-based

method

5/29/2009 Lecture No. 11 32

Page 33: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

O C OPresented in Markov et al., 2008

THE HYBRID APPROACH TO WEB DOCUMENT CATEGORIZATIONDOCUMENT CATEGORIZATION

Prof. Mark Last (BGU) 33

Page 34: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

The Hybrid Approach to Document CategorizationCategorization(Markov et al., 2006)

• Basic Idea– Represent a document as a vector of sub-graphsp g p– Categorize documents with a model-based classifier (e.g., a

decision tree), which is much faster than a “lazy” method• Naïve Approach

– Select sub-graphs that are most frequent in each category• Smart Approach

– Select sub-graphs that are more frequent in a specific category than in other categories

• Smart Approach with Fixed Threshold– Select sub-graphs that are frequent in a specific category and

more frequent than in other categories

May 29, 2009 34

Page 35: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Predictive Model Induction with Hybrid Representation (M k t l 2006)Hybrid Representation (Markov et al., 2006)

Web or textSet of documents with known Web or textdocuments

Set of documents with known categories – the training set

Documents graph SubgraphExtraction

Text representationGraphConstruction

Documents graph representation

Extraction of

Feature selection(optional)

Creation of prediction model

Document classification

Extraction of sub-graphsrelevant for classification ( p )p

rulesclassificationRepresentation of all documents as vectors with Boolean values for every sub-graph in the setsub graph in the setIdentification of best attributes (Boolean features) for classificationFinally – prediction model induction and extraction of classification rules

Prof. Mark Last (BGU) 35

y p

Page 36: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Frequent Subgraph ExtractionExample Example (based on the FSG algorithm by Kuramochi and Karypis, 2004)

A bA b

Subgraphs Document Graph Extensions

ArabArabArab

Arab

W t

Arab Bank PoliticWest

West

ArabPolitic ArabArab

West

Arab

Bank

PoliticPoliticPolitic

Politic

Prof. Mark Last (BGU) 36

Page 37: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Comparative Evaluation

B h k D t S t• Benchmark Data Sets– K-series (Source: Boley et al., 1999)

• 2 340 documents and 20 categories• 2,340 documents and 20 categories• Documents in that collection were originally news pages hosted at

Yahoo U i (S C t l 1998)– U-series (Source: Craven et al., 1998)

• 4167 documents taken from the computer science department of four different universities: Cornell, Texas, Washington, and Wisconsin , , g ,

• 7 major categories: course, faculty, students, project, staff, department, and other

• Known as “WebKB Dataset”• Known as WebKB Dataset

• Dictionary construction– N most frequent words in each document were taken for vector /N most frequent words in each document were taken for vector /

graph construction, that is, exactly the same words in each document were used for both the graph-based and the bag-of-words representations

May 29, 2009 37representations

Page 38: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Categorization Accuracy and Speedg y p

Accuracy Comparison for C4.5, K-series80%

Accuracy Comparison for NBC, K-series

75%

80%

n A

ccu

racy

65%

70%

75%

80%

Acc

ura

cy

65%

70%

Cla

ssi

fica

tio

50%

55%

60%

65%

Cla

ssif

icat

ion

20 30 40 50 60 70 80 90 100

Frequent Terms UsedBag-of-words Hybrid NaïveHybrid Smart Hybrid Smart with Fixed Threshold

20 30 40 50 60 70 80 90 100

Frequent Terms UsedBag-of-words Hybrid NaïveHybrid Smart Hybrid Smart with Fixed Threshold

Classification Speed:1.2 sec. per 1,000 Classification Speed:

0 3 1 000 Accuracy Comparison for C4.5, U-series

85%

racy

Accuracy Comparison for NBC, U-series

75%

80%

acy

. sec. pe ,000documents0.3 sec. per 1,000

documents

75%

80%

ssif

icati

on

Accu

60%

65%

70%

sif

icat

ion

Accu

raClassification Speed:125 sec. per 1,000

Classification Speed:1 7 sec per 1 000

70%20 30 40 50 60 70 80 90 100

Frequent Terms Used

Cla

s

Bag-of-words Hybrid NaïveHybrid Smart Hybrid Smart with Fixed Threshold

50%

55%

20 30 40 50 60 70 80 90 100

Frequent Terms Used

Cla

ss

Bag-of-words Hybrid NaïveHybrid Smart Hybrid Smart with Fixed Threshold

p ,documents

1.7 sec. per 1,000 documents

Prof. Mark Last (BGU) 38

Hybrid Smart Hybrid Smart with Fixed Threshold Hybrid Smart Hybrid Smart with Fixed Threshold

Page 39: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Percentage of Multi-node Subgraphsg g p

Relative Number of Multi Node Graphs for C4.5, K-series Relative Number of Multi Node Graphs for C4.5, U-series

50%

60%

70%

80%

90%

100%

No

de G

rap

hs

20%

30%

40%

No

de G

rap

hs

20%

30%

40%

50%

20 30 40 50 60 70 80 90 100

Frequent Terms Used

Mu

lti

N

Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold

0%

10%

20 30 40 50 60 70 80 90 100

Frequent Terms Used

Mu

lti

N

H b id N ï H b id S t H b id S t ith Fi d Th h ldHybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold

Relative Number of Multi Node Graphs for NBC, K-series

Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold

Relative Number of Multi Node Graphs for NBC, U-series

60%

70%

80%

90%

100%

od

e G

rap

hs

20%

30%

40%

od

e G

rap

hs

20%

30%

40%

50%

20 30 40 50 60 70 80 90 100

Frequent Terms Used

Mu

lti

No

0%

10%

20 30 40 50 60 70 80 90 100

Frequent Terms UsedM

ult

i N

Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed ThresholdHybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold Hybrid Naïve Hybrid Smart Hybrid Smart with Fixed Threshold

Prof. Mark Last (BGU) 39

Page 40: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

G S OLitvak and Last (2008)

GRAPH-BASED KEYWORD EXTRACTIONEXTRACTION

Prof. Mark Last (BGU) 40

Page 41: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Our methodology

• The keyword is a word presenting in the• The keyword - is a word presenting in the document summary.

• Document representation the ”simple”• Document representation - the simple directed graph:

Unique nodes non stop words– Unique nodes – non-stop words– Unlabeled edges - order-relationship

• A → B B appears after A in the same sentenceA → B B appears after A in the same sentence• Keyword extraction as a first stage of

extractive summarizationextractive summarization – The most salient words (”keywords”) are extracted

in order to generate a summary.g y

41

Page 42: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

The “simple” graph-based document representationdocument representation

Example:

GraphText

<titl > H i Gilb t H d T d<title> Hurricane Gilbert Heads Toward Dominican Coast </title><TEXT> Hurricane Gilbert swept

sustained

approaching

southeast Gilbert swept

headsptoward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for

storm

winds

mph

75 gustingHurricanepopulated south coast to prepare for

high winds, heavy rains and high seas.The storm was approaching from the

th t ith t i d i d f 7592

gusting

seas rains alerted

Dominicanheavy

southeast with sustained winds of 75 mph gusting to 92 mph. </TEXT>

heavilycivilpreparesouth

populated

defense

coast

42Sunday

coast

Republic

Page 43: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Keyword extractionThe supervised approachThe supervised approach

T i i l ifi ti l ith it• Training a classification algorithm on a repository of summarized documents.

• Each node in a document graph belongs to one ofEach node in a document graph belongs to one of two classes:

YES the word is included in the document extractive– YES - the word is included in the document extractive summaryNO otherwise– NO - otherwise.

43

Page 44: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

The Supervised approach (cont.)

Th f t d f d l ifi tiThe features used for nodes classification:• In Degree – number of incoming edges• Out Degree number of outgoing edges• Out Degree – number of outgoing edges• Degree – total number of edges• Frequency – term frequency of the word represented by node• Frequent words distribution – ∈ {0, 1}, equals to 1 iff Frequency ≥

threshold (0.05)• Location Score – an average of location scores between allLocation Score an average of location scores between all

sentences (S(N)) containing the word N represented by node, where sentence location score is an reciprocal of the sentence location in text (1/i)( )

• Tfidf Score – the tf-idf score of the word represented by node.We used formula:

• Headline Score – ∈ {0, 1}, equals to 1 iff document headline contains word represented by node

44

Page 45: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Feature extraction

Example:

N d “D i i ”southeast Gilbert swept

• Node “Dominican”:•In Degree = 2•Out Degree = 2

sustained

storm

approachingwinds

heads

Out Degree 2•Degree = 4•Frequency = 2/27 = 0.074•Frequent words distribution

stormmph

75 gustingHurricane

Dominicanheavy•Frequent words distribution= 1•Location Score = (1/1+1/2)/2 0 75

92 seas rains alerted

populated

(1/1+1/2)/2 = 0.75•Tfidf Score = (0.07/1.07)*log2(566/2) =

heavilycivilpreparesouth

defense

Sunday

coast

Republic

0.53•Headline Score = 1

Sunday Republic

45

Page 46: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

The unsupervised approach

U i d t t it t ti i th t t f• Unsupervised text unit extraction in the context of the text summarization task.

• No collection of summarized documents isNo collection of summarized documents is needed

• We apply the HITS algorithm to document graphs.

46

Page 47: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

HITSKleinberg, J.M. 1999.

• For each node, HITS produces two sets of scores -an ”authority” and a ”hub”:

• For the total rank (H) calculation we used the following four functions:following four functions:

47

Page 48: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Experimental results

DUC 2002 ll i• DUC, 2002 collection: – 566 English texts along with 2-3 summaries per g g p

document on average. – The size (|V|) of syntactic graphs extracted from theseThe size (|V|) of syntactic graphs extracted from these

texts is 196 on average, varying from 62 to 876.

48

Page 49: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Comparison of supervised and unsupervised approachesunsupervised approaches

• We consider unsupervised model based on extracting top N ranked words for different values of 10 ≤ N ≤ 120.

49• Set from top 2 features: Frequent words distribution and In Degree is

used for NBC

Page 50: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

SSUMMARY

Prof. Mark Last (BGU) 51

Page 51: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Selected Publications

• A Schenker M Last H Bunke A Kandel "Classification of Web Documents Using• A. Schenker, M. Last, H. Bunke, A. Kandel, Classification of Web Documents Using Graph Matching", International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition, Vol. 18, No. 3, 2004, pp. 475-496., , pp

• A. Schenker, H. Bunke, M. Last, A. Kandel, "Graph-Theoretic Techniques for Web Content Mining", World Scientific, 2005.

• A. Markov, M. Last, "A Simple, Structure-Sensitive Approach for Web Document , , p , ppClassification", Atlantic Web Intelligence Conference (AWIC2005), Lodz, Poland, June 2005.

• A. Markov, M. Last, and A. Kandel, “Fast Categorization of Web Documents Represented by Graphs”, in Advances in Web Mining and Web Usage Analysis, O. Nasraoui, et al. (Eds), Springer Lecture Notes in Computer Science (LNCS/LNAI), Vol.4811, 2007, pp. 56-71.A M k M L t d A K d l “Th H b id R t ti M d l f W b• A. Markov, M. Last, and A. Kandel, “The Hybrid Representation Model for Web Document Classification”, International Journal of Intelligent Systems, Vol. 23, No. 6, pp. 654-679, 2008.

• M Litvak and M Last "Graph Based Keyword Extraction for Single Document• M. Litvak and M. Last, Graph-Based Keyword Extraction for Single-Document Summarization", Proceedings of the 2nd Workshop on Multi-source, Multilingual Information Extraction and Summarization (MMIES2), Manchester, UK, August 23, 2008, pp. 17–24.

Prof. Mark Last (BGU) 52

pp

Page 52: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Future Research

• Enhancing graph representations of text and web documents– Utilizing POS taggingg gg g– Concept fusion based on available ontologies– Implementing graph representations for more languagesp g g p p g g

• Identification of the most relevant sections in long documents online forums etcdocuments, online forums, etc.

• Cross-lingual summarization of text documentsT i d i d ki i h b• Topic detection and tracking in the web content

• Opinion and sentiment mining

Prof. Mark Last (BGU) 53

Page 53: Graph-Based Methods May2009.ppt - BGUfrankel/TextMiningMay09/... · • Graph-Based Representations of Text andBased Representations of Text and Web Documents • Graph-Based Categorization

Hohentwiel, May 2008

Prof. Mark Last (BGU) 54


Recommended