+ All Categories
Home > Documents > Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb...

Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb...

Date post: 22-Dec-2015
Category:
View: 218 times
Download: 3 times
Share this document with a friend
Popular Tags:
59
Identity Resolution in Email Identity Resolution in Email Collections Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool
Transcript
Page 1: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

Identity Resolution in Email CollectionsIdentity Resolution in Email Collections

Tamer Elsayed and Douglas W. Oard

CLIP Colloquium, UMD, Feb 2009CLIP Colloquium, UMD, Feb 2009

Department of Computer Science, UMIACS, and iSchool

Page 2: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

2

Identity Resolution in Email Collections

Real Problem

National ArchivesNational Archives

Clinton Clinton White HouseWhite House Tobacco Tobacco

PolicyPolicy

search search requestrequest

hired 25 hired 25 personspersons

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~32 million

emails

200,000

80,000

for 6 months …

Page 3: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

3

Identity Resolution in Email Collections

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the

call will be too late for him.

Sheila

Identity Resolution in Email

WHO?WHO?WHO?WHO?

Page 4: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

4

Identity Resolution in Email Collections

Enron Collection

-----Original Message-----From: [email protected]@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]: [email protected]: Shhhh.... it's a SURPRISE !

Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: [email protected]: [email protected]: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: '[email protected]@ENRON'

Hope all is well.Count me in for the group present.See ya next week if not earlier

Please call me (713) 207-5233

Liza

Elizabeth Sager713-853-6349

Hi Shari

Thanks!

Shari

55 Sheila’s !!55 Sheila’s !!weisman

pardoglover

richjones

breedenhuckaby

tweedmcintyrechadwick

birminghamkahanekforakertasmanfisherpetitt

DomboRobbinschang

jarnotkirby

knudsenboehringer

lutzgloverwollamjortnerneylon

whangernagel

gravesmclaughlin

venvillerappazzo

millerswatekhollis

maynesnacey

ferrarinidey

macleodhowarddarlingwatsonperlickadvanihesterkennerlewis

waltonwhitmanberggrenosowski

kelly

Rank Rank CandidatesCandidates

Rank Rank CandidatesCandidates

Page 5: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

5

Identity Resolution in Email Collections

Generative Model

1. Choose “personperson” c to mention

p(c)

2. Choose appropriate “contextcontext” X to mention c

p(X | c)

3. Choose a “mentionmention” l

p(l | X, c)““sheila”sheila”

GEGEconferenceconference

callcall

Page 6: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

6

Identity Resolution in Email Collections

3-Step Solution(1) Identity(1) Identity ModelingModeling

Posterior DistributionPosterior Distribution

(3) Mention Resolution(3) Mention Resolution

(2) Context Reconstruction(2) Context Reconstruction

Page 7: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

7

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion and Future Work

Page 8: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

8

Identity Resolution in Email Collections

“Easy References” of Identity

-----Original Message-----From: [email protected]@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]: [email protected]: Shhhh.... it's a SURPRISE !

Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: [email protected]: [email protected]: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: '[email protected]@ENRON'

Hope all is well.Count me in for the group present.See ya next week if not earlier

Please call me (713) 207-5233

Liza

Elizabeth Sager713-853-6349

Hi Shari

Thanks!

Shari

Email Email StandardsStandards

Email-Client Email-Client BehaviorBehavior

User User RegularitiesRegularities

Page 9: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

9

Identity Resolution in Email Collections

Representational Model of Identity

77,240 “non-trivial” models

[email protected]

14 (Quoted Headers)

sheila glover

932 (Main Headers)

sheila

19 (Salutation)

216 (Signature)

sg19 (Signature)

sheila glover

1170 (User Name)

Representational ModelRepresentational Model

Page 10: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

10

Identity Resolution in Email Collections

Computational Model of Identity

c

m

t

identity

observed mention

name type

Tt

ctfreq

ctfreqctp

'

),'(

),()|(

)('

),,'(

),,(),|(

cassocl

ctmfreq

ctmfreqctmp

Tt

ctpctmpcmp )|(),|()|(

Cc

cassoc

cassoccp

'

)'(

)()(

Page 11: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

11

Identity Resolution in Email Collections

Identity Models

Candidates

Candidates

Likelihood: p Likelihood: p ( “sheila” | ( “sheila” | cc))

Page 12: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

12

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context ReconstructionContext Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion and Future Work

Page 13: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

13

Identity Resolution in Email Collections

Contextual Space

LocalLocalContextContext

Conversational Conversational ContextContext

Topical ContextTopical Context

Page 14: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

14

Identity Resolution in Email Collections

Topical Context

Date: Fri Dec 15 05:33:00 EST 2000From: [email protected]: vince j kaminski <[email protected]>Cc: sheila walton <[email protected]>Subject: Re: Grant Masson

Great news. Lets get this moving along. Sheila, can you work out GE letter?

Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call me on my cell phone.

[email protected]

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the call will be too late for

him.

Sheila call

Sheila

call

GE

GE

Page 15: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

15

Identity Resolution in Email Collections

Contextual Space

Social ContextSocial Context

LocalLocalContextContext

Conversational Conversational ContextContext

Topical ContextTopical Context

Page 16: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

16

Identity Resolution in Email Collections

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the call will be too late for

him.

Social Context

Date: Tue, 19 Dec 2000 07:07:00 -0800 (PST)From: [email protected]: [email protected] Subject: ESA Option Execution

KayCan you initial the ESA assignment and assumption agreement or should I ask Sheila Tweed to do it? I believe she is currently en route from Portland.Thanks, Rebecca

Sheila Tweed

[email protected]

[email protected]

Page 17: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

17

Identity Resolution in Email Collections

Formally

A context of an email is a probability probability distributiondistribution over emails

Probability estimated based on type of context

Contextual Space is a linear combination of 4 contexts

))(|( ikj exep

kx

Page 18: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

18

Identity Resolution in Email Collections

Context Expansion

topical

time

people

content

social

conversationallocal

Temporal similarity affects social and topical similarity

Page 19: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

19

Identity Resolution in Email Collections

Temporal Similarity

Decay over time Gaussian and Linear functions Time difference / Rank

Page 20: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

20

Identity Resolution in Email Collections

Social Similarity

Two sets of participants (email adresses) Binary, Overlap, Jacaard, Both

Page 21: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

21

Identity Resolution in Email Collections

Temporal Effect

Temporal Sim Pure Social Sim

Social SimNormalize

Social Context

Page 22: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

22

Identity Resolution in Email Collections

Topical Similarity

Standard IR Similarity: BM25

Email as a DOCUMENT? Subject Body (+Subject) Root of thread Concatenated path to root

Combined similarly with temporal similarity

email

reply / forward

Page 23: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

23

Identity Resolution in Email Collections

Contextual Space (emails)

Social ContextSocial Context

LocalLocalContextContext

Conversational Conversational ContextContext

Topical ContextTopical Context

Page 24: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

24

Identity Resolution in Email Collections

Contextual Space (mentions)

“Sheila”

social

conversational

social

topical

social

topical

topical

“Sheila Tweed”

“sheila”

[email protected]

“sg”

“Sheila Walton”

“Sheila”

Page 25: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

25

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention ResolutionMention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion and Future Work

Page 26: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

26

Identity Resolution in Email Collections

Mention Resolution

Candidates

Likelihood: p Likelihood: p ( “sheila” | ( “sheila” | cc))

Goal: estimate p(c|m, X(m)) and rank accordingly

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the call will be too late for

him.

Sheila

11

22 33

??

Page 27: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

27

Identity Resolution in Email Collections

[1] Context-Free Resolution

“Sheila”

social

conversational

social

topical

social

topical

topical

“Sheila Tweed”

“sheila”

[email protected]

“sg”

“Sheila Walton”

“Sheila”

“Sheila”

X

Context-FreeResolution

Page 28: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

28

Identity Resolution in Email Collections

[2] Contextual Resolution

“Sheila”

social

social social

topical

“Sheila Tweed”

“sheila”

[email protected]

“sg”

“Sheila Walton”

“Sheila”

Context-FreeResolution

Page 29: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

29

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing CollectionsEvaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion

Page 30: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

30

Identity Resolution in Email Collections

Test Collections

Collection Emails Identities Mention Candidates

Queries Min. Avg. Max.

Sager 1,628 627 51 1 4 11

Shapiro 974 855 49 1 8 21

Enron-subset 54,018 27,340 78 1 152 489

Enron-all 248,451 123,783 78 3 518 1785

Sager

Shapiro

Enron-subsetEnron-all

Page 31: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

31

Identity Resolution in Email Collections

Evaluation Measures

Commonly used in “known-item” retrieval

Success @1 (i.e., Precision @1) One-best

MRR (Mean Reciprocal Rank) Inverse of the harmonic mean of the ranks of true

answer ri

n

i irnMRR

1

11

Page 32: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

32

Identity Resolution in Email Collections

Comparison w/Literature

MRRMRR Success @ 1Success @ 1

ContextContext Lit.Lit. ContextContext Lit.Lit.

CollectionCollection ExpansionExpansion BestBest ExpansionExpansion BestBest

Sager 0.911 0.889 0.863 0.804

Shapiro 0.913 0.879 0.878 0.779

Enron-subset 0.91 - 0.846 (0.82)

Enron-all 0.89 - 0.821 -

ContextContextExpansionExpansion

Lit.Lit.BestBest

ContextContextExpansionExpansion

Lit.Lit.BestBest

Earlier expansion approach, reported in ACL 2008

Improvedexpansion

0.870.92

Page 33: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

33

Identity Resolution in Email Collections

Limitations

Resolving single mentions

All mention-queries are sampled from Enron to Enron emails

All mention-queries refer to Enron Employee Small for train/test split

Scalable Implementation for Resolving All Mentions

New Test Collection

Page 34: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

34

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce ImplementationScalable MapReduce Implementation New Test Collection Conclusion

Page 35: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

35

Identity Resolution in Email Collections

Scalable Implementation

Two Bottlenecks:1. Context expansion of ALL emails

For each email: ranked list of “Similar” emails

2. Resolution of ALL mentionsResolution of one mention depends on resolution of all other mentions in context

Page 36: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

36

Identity Resolution in Email Collections

Context Expansion of ALL Emails

Goal: For each email: ranked list of “Similar” emails Need for BOTH social and topical contexts Efficient implementation

Abstract Problem:Computing Pairwise Similarity

Page 37: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

37

Identity Resolution in Email Collections

Trivial Solution

load each vector o(N) times load each term o(dft2) times

scalable and efficient solutionfor large collections

Goal

Page 38: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

38

Identity Resolution in Email Collections

Better Solution

Load weights for each term once Each term contributes o(dft2) partial scores

Each term contributes only if appears in

Page 39: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

39

Identity Resolution in Email Collections

MapReduce Framework

mapmap

mapmap

mapmap

mapmap

reducereduce

reducereduce

reducereduce

input

input

input

input

output

output

output

ShufflingShuffling

group values group values by: [by: [keyskeys]]

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

handles low-level details transparentlytransparently

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

Page 40: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

40

Identity Resolution in Email Collections

reducereduce

Decomposition

Load weights for each term once Each term contributes o(dft2) partial scores

Each term contributes only if appears in

mapmap

Page 41: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

41

Identity Resolution in Email Collections

Expansion Using MapReduce

Using generic pairwise-similarity for both topical and social expansion

~~~~~~~~~~~~

~~~~~~~~~~~~

~~~~~~~~~~~--

~~~~~~~~~~~~

~~~~~~~~~~~~

doc rep.

time window

rankcut-off

df-cut

topical : body/root/pat

hsocial : participants

temporal sim model

contextsim model

contextgraph

Page 42: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

42

Identity Resolution in Email Collections

Context Mention-Graph

“Sheila”

social

conversational

social

topical

social

topical

topical

“Sheila Tweed”

“sheila”

[email protected]

“sg”

“Sheila Walton”

“Sheila”

Context-FreeResolution

mapmap

mapmap

mapmap

mapmap

reducereduce

Page 43: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

43

Identity Resolution in Email Collections

Resolution System Using MapReduce

EmailsThreads Identity Models

Social Expansion

Conv.Expansion

Topical Expansion

LocalExpansion

Local Graph Social GraphTopical GraphConv. Graph

Mention Recognition and

Prior Computation

Prior Resolution

Posterior Resolution

Social Resolution

Conv.Resolution

Topical Resolution

LocalResolution

Merging Context Resolutions

PriorPriorPrior

Dict.

Exp

ansi

on

Exp

ansi

on

Res

olu

tio

nR

eso

luti

on

Pac

kin

gP

acki

ng

PreprocessingPreprocessing

Page 44: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

44

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test CollectionNew Test Collection Conclusion

Page 45: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

45

Identity Resolution in Email Collections

New Test Collection

Random Sample from CMU-Enron collection “Annotation + Search” interface available Total annotators: 3 Annotation time: ~50 hours. Not only resolutions

Time, difficulty, confidence, evidence, and comments

Total mention-queries : 584 80% resolvable, 82% of them to Enron domain Overall inter-annotator agreement: ~81 %

Page 46: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

46

Identity Resolution in Email Collections

Mention-Query Selection

Page 47: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

47

Identity Resolution in Email Collections

Distribution of Names Based on Resolution

Enron-Resolvable390 (66%)

Non-Enron-Resolvable

80 (14%)

Unresolvable114 (20%)

Probably- Enron 24 (4%)

Probably-Non-Enron

62 (11%)

Unknown28 (5%)

Page 48: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

48

Identity Resolution in Email Collections

Distribution Based on Difficulty

843

39

108

33

239

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable Non-Enron-Resolvable

Hard Moderately Hard Easy

Page 49: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

49

Identity Resolution in Email Collections

Distribution Based on Confidence

411 2

31

4315

79

33663

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable Non-Enron-Resolvable

Unresolvable

Not Confident Somewhat Confident Very Confident

Page 50: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

50

Identity Resolution in Email Collections

Distribution of Time Spent

R2 = 0.98

R2 = 0.95

R2 = 0.88

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

22%

0 1 2 3 4 5 6 7 8 9 10 11 12

Time (minutes)

Enron-Resolvable

Non-Enron-Resolvable

Unresolvable

Enron

Non-Enron

Unresolvable

Page 51: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

51

Identity Resolution in Email Collections

Evaluation again …

0.90

0.44

0.690.65

0.84

0.47

0.760.71

0.92

0.59

0.800.76

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Enron Non-Enron Enron All

MR

R

topical social combination

Old Collection New Collection

Page 52: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

52

Identity Resolution in Email Collections

Pairwise Agreement

195

50 190

199

16/16 (100%)

2/4 (50%)

4/7 (57%)

50

27

23

12/16 (75%)

1/5 (20%)

2/2 (100%)

35/38 (92%)

2/2 (100%)

6/10 (60%)

24/27 (89%)

5/12 (42%)

11/11 (100%)

Enron-resolvable

Non-enron-resolvable

Unresolvable

a3

a2

a1

Page 53: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

53

Identity Resolution in Email Collections

Individual Annotator Agreement

6/9

3/9

28/32

6/10

37/40 2/224/27

5/12

11/11

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable Non-enron-Resolvable

Unresolvable

Ag

reem

ent

a1 a2 a3

Page 54: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

54

Identity Resolution in Email Collections

Overall Agreement

0.90

0.43

0.810.77

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable

Non-enron-Resolvable

ResolvableOverall

Unresolvable

Ag

ree

me

nt

Page 55: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

55

Identity Resolution in Email Collections

Agreement Based on Difficulty

57/59

5/7

30/38

5/16

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable Non-Enron-Resolvable

Ag

reem

ent

Easy Non-Easy

Page 56: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

56

Identity Resolution in Email Collections

Agreement Based on Confidence

74/82

8/18

17/22

13/15

2/5

6/8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable Non-Enron-Resolvable

Unresolvable

Ag

reem

ent

Very-Confident Not-Very-Confident

Page 57: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

57

Identity Resolution in Email Collections

Conclusion and Future Work

Identity Resolution by non-participants is feasible And automatic systems for that can be built ~90-75% accurate

Proposed generative probabilistic model Context Expansion using temporal similarity Scalable Implementation using “Pairwise Sim with MapReduce”

Developed largest test collection for the task 80% resolvable, 82% of them to Enron employees

Effectiveness scales well to large collections

Efficiency Results Evaluation using double-assessments Iterative approach for “joint resolution”

Page 58: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

58

Identity Resolution in Email Collections

Thank You!

Page 59: Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

59

Identity Resolution in Email Collections

Related Work

Diehl et al. (SIAM, 2006) Developed Enron-subset collection Temporal traffic models Candidates must have communicated with sender

Minkov et al. (SIGIR, 2006) Developed Sager and Shapiro collections Graphical framework Large collections?


Recommended