Collective Relational Clustering Indrajit Bhattacharya Indrajit Bhattacharya Assistant Professor...

Collective Relational Clustering

Indrajit BhattacharyaIndrajit BhattacharyaAssistant ProfessorDepartment of CSA

Indian Institute of Science

Relational DataRelational Data

Recent abundance of relational (‘non-iid’) datao Internet o Social networks o Citations in scientific literatureo Biological networkso Telecommunication networkso Customer shopping patternso …

Various applicationso Web Miningo Online Advertising and Recommender Systemso Bioinformaticso Citation analysiso Epidemiologyo Text Analysiso …

Clustering for Relational DataClustering for Relational Data

Lot of research in Statistical Relational Learning over the last decadeo Series of focused workshops in premier conferenceso Confluence of different research areas

Recent focus of unsupervised learning from relational datao Regular papers in premiere conferenceso Recent Book: Relational Data Clustering: Models,

Algorithms, and Applications, Bo Long, Zhongfei Zhang, Philip S. Yu, CRC Press 2009

Traditional vs Relational ClusteringTraditional vs Relational Clustering

Traditional clustering focuses on ‘flat’ datao Cluster based on features of individual objects

Relational clustering additionally considers relationso Heterogeneous relations across objects of different typeso Homogeneous relations across objects of the same type

Naïve solution: Flatten data, then clustero Loss of relational and structural informationo No influence propagation across relational chainso Cannot discover interaction patterns across clusters

Collective relational clustering looks to cluster different data objects jointly

Early Instances of Relational ClusteringEarly Instances of Relational Clustering

Graph Partitioning Problem o Single type homogenous relational data

Co-clustering Problemo Bi-type heterogeneous relational data

General relational clustering considers multi-type data with heterogeneous and homogeneous relationships

Talk OutlineTalk Outline Introduction

Motivating Application: Entity Resolution over Heterogeneous Relational Data

The Relational Clustering Problem

Quick Survey of Relational Clustering Approaches

Probabilistic Model for Structured Relations

Probabilistic Model for Heterogeneous Relations

Future Directions







Future Directions

Application: Entity ResolutionApplication: Entity Resolution

Web data on Stephen Johnson


Ind. Researcher

Professor

Media Presenter

Movie Director

Photographer

Administrator


Data contains references to real world entitieso Structured entities (People, Products, Institutions,…) o Topics / Concepts (comp science, movies, politics, …)

Aim: Consolidate (cluster) according to entitieso Entity Resolution: Map structured references to entitieso Sense Disambiguation: Group words according to senseso Topic Discovery: Group words according to topics or

concepts

Relationships for Entity ResolutionRelationships for Entity Resolution

Movie Director

Photographer

Each document or structured record is a (co-occurrence) relation between references to persons, places, organizations, concepts, etc.

Relational Network Among EntitiesRelational Network Among Entities

Stephen Johnson

Stephen Johnson

Stephen Johnson

Stephen Johnson

Stephen Johnson

Alfred Aho Jeffrey Ullman

Bell Labs

Comp. Sc.

Prog. Lang.

Mark Cross

Chris Walshaw

Univ of Greenwich

HPC

Photography

Ansel AdamsCinema

Direction

Peter Gabriel

White House EPA

George W. Bush

Government

Media

Music

BBC

Stephen Johnson

Entertainment

Leeds University

Using the Network for ClusteringUsing the Network for Clustering

Given the network, find the assignment of data items or references to these entitieso Collective cluster assignment

Find a “nice” network of entities with regularities in the relational structureo Researchers collaborate with colleagues on similar

topicso People send emails to colleagues and friends

Collective Cluster Assignment: Collective Cluster Assignment: ExampleExample

Stephen JohnsonS JohnsonSC Jonshon

Alfred AhoA AhoA V Aho

Jeffrey UllmanJ. UllmanJ D Ullman

Bell LabsAT&T Bell code

generationgrammarexpression tree Stephen Johnson

Steve JohnsonS JohnsonS P Johnson

Mark CrossM Cross Chris

WalshawChris WalsawC Walshaw

U. GreenwichU. of GWich

ParallelizationStructured MeshCode generation

…To find a minimal match cost, dynamic programming, approach of [A Aho and S Johnson, 76], is used. …

Cluster 1

Cluster 2Cluster 3

Cluster 4Cluster 5

Cluster 11

Cluster 12Cluster 13

Cluster 14

Cluster 15

Regularity in a Cluster NetworkRegularity in a Cluster Network

S. Johnson

S. Johnson

Stephen C. Johnson

S. Johnson

M. G. Everett

M. Everett

Alfred V. Aho

A. Aho

S. Johnson

S. Johnson

Stephen C. Johnson

S. Johnson

M. G. Everett

M. Everett

Alfred V. Aho

A. Aho

M J1 A J2

M 1 1 0 0J1 1 1 1 0A 0 1 1 1J2 0 0 1 1

Cl. 1 has better separation of attributes

Cl. 2 has fewer cluster-cluster relations

M J1 A J2

M 1 1 0 0

J1 1 1 0 0

A 0 0 1 1

J2 0 0 1 1

Clustering 1 Clustering 2

Collective Relational ClusteringCollective Relational Clustering

Goal: Given relations among data items, assign to clusters such that relational neighborhoods of clusters have regularities (in addition to attribute similarities within clusters)

Challenges:o Collective / joint clustering decisions over relational

neighborhoods o Defining regularity in relational neighborhoodso Searching over relational networks







Future Directions

Relational Clustering: Different Relational Clustering: Different ApproachesApproaches

Greedy Agglomerative Algorithms o Bhattacharya et al ‘04, Dong et al ‘05

Information Theoretic Methodso Mutual Information (Dhillon et al ’03), o Information Bottleneck (Slonim & Tishby ’03), o Bregman Divergence (Merugu et al ‘04, Merugu et al

’06)

Matrix Factorization Techniqueso SVD, BVD, (Long et al ‘05, Long et al ’06)

Graph Cutso Min Cut, Ratio Cut, Normalized Cut, (Dhillon ’01)

Relational Clustering: Relational Clustering: Probabilistic ApproachesProbabilistic Approaches

Models for Co-clusteringo Taskar et al, ‘01; Hofmann et al, ‘98

Infinite Relational Model (Kemp et al, ’06)

Mixed Membership Relational Clustering model (Long et al, ‘06)

Topic Models Extensionso Correlated Topic Models (Blei et al, ‘06)o Grouped Cluster Model (Bhattacharya et al ‘06) o Gaussian Process Topic Models (Agovic & Banerjee, ‘10)

Markov Logic Network (Kok & Domingos, ‘08)

Model for Mixed Relational Data (Bhattacharya et al 08)







Future Directions

Modeling Groups of EntitiesModeling Groups of Entities

Bell Labs Group

Alfred V Aho

Jeffrey D Ullman

Ravi Sethi

Stephen C Johnson

Parallel Processing Research Group

Mark Cross

Chris Walshaw Kevin McManus

Stephen P Johnson

Martin Everett

P1: C. Walshaw, M. Cross, M. G. Everett, S. Johnson

P2: C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus

P3: C. Walshaw, M. Cross, M. G. Everett

P4: Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman

P5: A. Aho, S. Johnson, J. Ullman

P6: A. Aho, R. Sethi, J. Ullman

P

LDA-Group ModelLDA-Group Model

R

r

θ

z

a

T

Φ

AV

α

β

Entity label a and group label z for each reference r

Θ: ‘mixture’ of groups for each co-occurrence

Φz: multinomial for choosing entity a for each group z

Va: multinomial for choosing reference r from entity a

Dirichlet priors with α and β

P

LDA-Group ModelLDA-Group Model

R

r

θ

z

a

T

Φ

AV

α

β

Entity label a and group label z for each reference r

Θ: ‘mixture’ of groups for each co-occurrence

Φz: multinomial for choosing entity a for each group z

Va: multinomial for choosing reference r from entity a

Dirichlet priors with α and βReferenceS. Johnson

EntityStephen P Johnson

GroupBell Labs

generatedocument

generatenames

Inference Using Gibbs SamplingInference Using Gibbs Sampling

Approximate inference with Gibbs samplingo Find conditional distribution for any reference given

current groups and entities of all other referenceso Sample from conditional distribution

o Repeat over all references until convergence

When number of groups and entities are known

P(z t )n T

n

n A

ni i

d itDT

d i *DT

aitA T

* tA T

|z ,a,r

P(a a )n A

nS im(r ,v )i i

a tA T

* tA T i a

i

|z,a ,r

P(a a )n

N (r ,v )i new i* tA T i anew

|z,a ,r

Hidden name for a new entity equally prefers all observed references

Non Parametric Entity ResolutionNon Parametric Entity Resolution

Number of entities not a parameter o Allow number of entities to grow with data

For each reference choose any existing entity, or a new entity anew

Faster Inference: Split-Merge SamplingFaster Inference: Split-Merge Sampling

Naïve strategy reassigns data items individually

Alternative: allow clusters to merge or split

For cluster ai, find conditional probabilities for1. Merging with existing cluster aj

2. Splitting back to last merged clusters3. Remaining unchanged

Sample next state for ai from distribution

O(n g + e) time per iteration compared to O(n g + n e)

ER: Evaluation DatasetsER: Evaluation Datasets

CiteSeero 1,504 citations to machine learning papers (Lawrence et

al.)o 2,892 references to 1,165 author entities

arXivo 29,555 publications from High Energy Physics (KDD Cup’03)o 58,515 refs to 9,200 authors

Elsevier BioBaseo 156,156 Biology papers (IBM KDD Challenge ’05) o 831,991 author refso Keywords, topic classifications, language, country and

affiliation of corresponding author, etc

ER: Experimental EvaluationER: Experimental Evaluation

LDA-ER outperforms baselines in all datasetso A - Same entity to refs with attr similarity over a thresholdo A* - Transitive closure over decisions in A

Baselines require threshold as parametero Best achievable performance over all thresholds

LDA-ER does not require similarity threshold

CiteSeer ArXiv BioBase

A 0.980 0.976 0.568

A* 0.990 0.971 0.559

LDA- ER 0.993 0.981 0.645

ER: Trends in Semi-Synthetic DataER: Trends in Semi-Synthetic Data

Bigger improvement with obigger % of ambiguous refsomore refs per co-occurrenceomore neighbors per entity

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5

Percentage of ambiguous attributes

F1

A A* LDA-ER

0.75

0.8

0.85

0.9

2.25 2.5 2.75 3 3.25 3.5 3.75 4

avg #references / hyper-edge

F1

A A* LDA-ER

0.8

0.85

0.9

0 1 2 3 4 5 6 7 8

avg # neighbors / entity

F1

A A* LDA-ER







Future Directions

In a document collection, which names refer to the same entities?

Entity Resolution over a Document Entity Resolution over a Document CollectionCollection

Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases

When it comes to create a universe George Lucas is undisputed leader. Harrison Ford has done justice and special effects are superb.

Lucas script seemed funny enough. It was a fairly good movie with couple of laughs. There was not much story but Ford was good.

Harrison Ford the adventurer is it in yet another quest. To find his father who is in search of the Holy Grail. George Lucas has done a wonderful job.

Jointly Modeling the Textual ContentJointly Modeling the Textual Content

• Words are indicative of the concept entities • Concept entities are related to person entities





Document words belong to two categorieso References to structured entitieso References to (unstructured) concept entities

Collectively determine clusters for both types of entities

Relational patterns over two types of entities

Simplifications for learning Observed domain of entities w/ structured attributes Observed relationships between domain entities and

categories for constructing relational neighborhoods

Relational Clustering Relational Clustering Over Structured and Unstructured DataOver Structured and Unstructured Data

c

t

e

a

w

n

m

N

Generative Model for Documents Generative Model for Documents from Structured Entitiesfrom Structured Entities

Generate N reviews one by one

First choose a genre, say Action

Choose an Action movie, say Indiana Jones

Generate n mentions for movieo Choose movie attribute, say Actoro Get attribute value, say Harrison Fordo Generate mention for attribute value

Harrison Ford Ford

Generate m Action words o adventurer, quest, justice …

P(t) : Prior over genres P(e | t) : Movies for genre P(w | t) : Words for genre P(c) : Prior over movie attributes

Movie Reviewso 12,500 reviews: First 10 reviews for top 50 movies for 25 genres

Structured Movie Database from IMDBo 26,250 movies: Top 1250 movies from 25 genres + 25,000 otherso Movie table with 7 columns, but no movie name columno Genre + Top 2 actors, actresses, directors, writers

Entity Identification Baselineo Aggregate similarity over all mentions to score entity for doc o Does not use unstructured words in document

Document Classification Baselineo SVM-Light with default parameterso Uses all words in the document, including structured mentions

Entity Identification: EvaluationEntity Identification: Evaluation

Ent-Id: Experimental Results on Ent-Id: Experimental Results on IMDBIMDB

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100

% Training Data

Do

c C

at A

ccu

racy

DC-BaseJM

Baseline catches up with joint model only when 35% docs provided for training

Improvement in ent-id accuracy Significant drop in entropy over

entity choices

EI Accuracy EI EntropyJM 40.80% 0.67%EI-Base 38.50% 2.359

Ent-Id: Results on Semi-Synthetic DataEnt-Id: Results on Semi-Synthetic Data

0

0.2

0.4

0.6

0.8

1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0Genre overlap p0

Do

c C

at A

ccu

racy

JMDC-Base

0

0.2

0.4

0.6

0.8

1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Genre overlap p0

En

tity

Id a

ccu

racy

JMEI-Base

Ent-Id improves from 38% to 60% for medium overlap and to 70% when words clearly indicate genre

80% training data for baseline, none for JM

Joint model outperforms baseline for large overlap between genres

Future DirectionsFuture Directions

Handling uncertain relationso Coupling with information extraction

Modeling the cluster network Regularization for networks

Scalable inference mechanisms

Incorporating domain knowledge and user interactiono Semi-supervisiono Active learning

ReferencesReferences A Agovic and A Banerjee., Gaussian Process Topic Models, UAI 2010 S Kok and P Domingos, Extracting Semantic Networks from Text via Relational

Clustering, ECML 2008 I Bhattacharya, S Godbole, and S Joshi, Structured Entity Identification and

Document Categorization: Two Tasks with One Joint Model, SIGKDD 2008 I Bhattacharya and L Getoor, Collective Entity Resolution in Relational Data,

ACM-TKDD, March 2007 A Banerjee, S Basu, S Merugu, Multi-Way Clustering on Relation Graphs, SIAM

SDM 2007 B Long, M Zhang, P S Yu, A Probabilistic Framework for Relational Clustering,

SIGKDD 2007 D Zhou, J Huang, B Schoelkopf, Learning with hypergraphs: Clustering,

classification, and embedding, NIPS 2007 B Long, M Zhang, X Wu, P S Yu, Spectral Clustering for Multi-type Relational

Data, ICML 2006 I Bhattacharya and L Getoor, A Latent Dirichlet Model for Unsupervised Entity

Resolution, SIAM SDM 2006 X Dong, A Halevy, J Madhavan, Reference reconciliation in complex information

spaces, SIGMOD 2005 I Bhattacharya and L Getoor, Iterative Record Linkage for Cleaning and

Integration, SIGMOD–DMKD, 2004 B Taskar, E Segal, D Koller, Probabilistic Classification and Clustering in

Relational Data, IJCAI 2001

Backup SlidesBackup Slides

P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson

P2: “Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus

P3: “Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett

P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman

P5: “Deterministic Parsing of Ambiguous Grammars”, A. Aho, S. Johnson, J. Ullman

P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman

Entity Resolution From Structured Entity Resolution From Structured RelationsRelations

Stephen Johnson

Alfred Aho Jeffrey Ullman

Bell Labs Prog. Lang.

Stephen Johnson

Mark Cross

Chris Walshaw

Univ of GreenwichHPC

LDA-ER Generative Process: IllustrationLDA-ER Generative Process: Illustration

For each paper p:1. Choose θp 2. For each author

Sample z from θp Sample a from Φz Sample r from Va

P5

θP5 = [ p(G1)=0.1, p(G2)=0.9 ]

z=G2

a=Aho

ΦG2

Walshaw Johnson1 McManus Cross Everett Ullman Aho Sethi Johnson2

G2G1 ΦG1

0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.2 0.2

ΦG2

r=A.Aho

VA

G2

U

ΦG2

J.Ullman

VU

G2

J2

ΦG2

S.Johnson

VJ2

S C JohnsonStephen C Johnson S Johnson

VJ1=StephenP Johnson

0.04 0.04 0.90

Generating References from EntitiesGenerating References from Entities

Entities are not directly observed

1. Hidden attribute for each entity2. Similarity measure for pairs of attributes

A distribution over attributes for each entity

S C JohnsonStephen C Johnson S Johnson Alfred Aho M. Cross

Stephen C Johnson

0.2 0.6 0.2 0.0 0.0

ER: Performance for Specific NamesER: Performance for Specific Names

Significantly larger improvements for ‘ambiguous names’

NameBest F1 for

A/A*

F1 for

LDA- ER

cho_h 0.80 1.00

davis_a 0.67 0.89

kim_s 0.93 0.99

kim_y 0.93 0.99

lee_h 0.88 0.99

lee_ j 0.98 1.00

liu_ j 0.95 0.97

sarkar_s 0.67 1.00

sato_h 0.82 0.97

sato_t 0.85 1.00

shin_h 0.69 1.00

veselov_a 0.78 1.00

yamamoto_k 0.29 1.00

yang_z 0.77 0.97

zhang_r 0.83 1.00

zhu_z 0.57 1.00

Simplifying the problem: Entity Simplifying the problem: Entity IdentificationIdentification

Assume database on entities availableo IMDB movie databaseo DBLP, PubMed paper databaseo Customer databases in companies

Movie Name Actor Writer Genre Rating

1Indiana Jones and the Last Crusade

Harrison Ford George Lucas Adventure Excellent

2 American Graffiti Harrison Ford George Lucas Comedy Average

3Star Wars: Return of the Jedi

Harrison Ford George Lucas Sci-Fi Excellent

4 Fugitive Harrison Ford David Twohy Action Good

• Not enough information to disambiguate

• Noise in entity mentions

Entity Identification: Still DifficultEntity Identification: Still Difficult





American Graffiti : Harrison Ford, George Lucas

Indiana Jones and the Last Crusade : Harrison Ford, George Lucas

Star Wars: Return of the Jedi : Harrison Ford, George Lucas

?

??

Fugitive: Harrison Ford, David Twohy

Categorization and Entity Identification help each other

Classifier predicts additional attributes from document for use in entity identificationo Classifiers for Genre, Rating, Country of the movie …

Entity identification creates labeled data for training the classifiero Reviews tagged with movies labeled with Genre, Rating, etc

The IntuitionThe Intuition

Problem FormulationProblem Formulation

Movie Name Actor Writer Genre Rating

1Indiana Jones and the Last Crusade

Harrison Ford George Lucas Adventure Excellent

2 American Graffiti Harrison Ford George Lucas Comedy Average

3Star Wars: Return of the Jedi

Harrison Ford George Lucas Sci-Fi Excellent

4 Fugitive Harrison Ford David Twohy Action Good

columns C

entities E


• Structured mentions derived from column values

• Unstructured words determined by type value

type column T

Problem: Find the central entity for each document and categorize the documents according to type values

• Unobserved central entity for each document

)|(),|()()|()(

)|,,,,()|(

iiiiiiiii

iiiiii

twPceaPcPtePtP

wacetPdP

Traditional entity identification only considers structured mentions as evidence

Here, words suggest type values, and entities relevant for those types get priority

Formalizing the IntuitionFormalizing the Intuition

)|(),|()()|()(

)|,,,,()|(

iiiiiiiii

iiiiii

twPceaPcPtePtP

wacetPdP

Traditional entity identification only considers structured mentions as evidence

Traditional document categorization only considers words as evidence

Here, words suggest type values, and entities relevant for those types get priority

Mentions suggest entities, and type values relevant for those entities get priority

Formalizing the IntuitionFormalizing the Intuition

Infer hidden entity and type value from observed words and references for each document

Initialize posteriors using entity references only

Restrict assignment space for tractability

Unsupervised EM for InferenceUnsupervised EM for Inference

Objective FunctionObjective Function

Greedy agglomerative clustering step: merge cluster pair with max reduction in objective function value

Common cluster neighborhood

Similarity of attributes

weight for attributes

weight for relations

similarity ofattributes

1 iff relational edge exists between ci and

cj

iA A i j

jR i jw sim c c w c c ( , ) ( , )

Minimize:

( , ) ( , ) ( | ( ) | | ( ) |)c c w sim c c w N c N ci j A A i j R i j

Collective Relational Clustering AlgorithmCollective Relational Clustering Algorithm

1. Find similar references using ‘blocking’2. Bootstrap clusters using attributes and relations3. Compute similarities for cluster pairs and insert

into priority queue

4. Repeat until priority queue is empty5. Find ‘closest’ cluster pair6. Stop if similarity below threshold7. Merge to create new cluster8. Update similarity for ‘related’ clusters

O(n k log n) algorithm w/ efficient implementation

Date post:	14-Dec-2015
Category:	Documents
Upload:	rowan-smeaton
View:	224 times
Download:	3 times