+ All Categories
Home > Data & Analytics > Understanding Email Traffic

Understanding Email Traffic

Date post: 14-Jul-2015
Category:
Upload: david-graus
View: 106 times
Download: 0 times
Share this document with a friend
Popular Tags:
45
Understanding email traffic David Graus, University of Amsterdam [email protected] @dvdgrs
Transcript

Understanding email trafficDavid Graus, University of Amsterdam [email protected] @dvdgrs

Dec. 12, 2014 - Frontiers of Forensic Science 2

Some background…

• PhD candidate at ILPS • Information Extraction & Retrieval

• Project in NWO’s Forensic Science program • Semantic Search in E-Discovery

Dec. 12, 2014 - Frontiers of Forensic Science 3

Some background…

• PhD candidate at ILPS • Information Extraction & Retrieval

• Project in NWO’s Forensic Science program • Semantic Search in E-Discovery

Dec. 12, 2014 - Frontiers of Forensic Science 4

Information Retrieval?

Dec. 12, 2014 - Frontiers of Forensic Science 5

Information Retrieval?

Ò Finding material of unstructured nature from large collections

Dec. 12, 2014 - Frontiers of Forensic Science 6

Information Extraction?

Ò Text mining Ò Discovering patterns in text data

Dec. 12, 2014 - Frontiers of Forensic Science 7

Semantic Search in E-Discovery?

Dec. 12, 2014 - Frontiers of Forensic Science 8

Semantic Search?

Dec. 12, 2014 - Frontiers of Forensic Science 9

E-Discovery?

• Retrieving and securing digital forensic evidence

Dec. 12, 2014 - Frontiers of Forensic Science 10

E-Discovery

⬜ Semantic Search in E-Discovery

Dec. 12, 2014 - Frontiers of Forensic Science 11

Semantic Search in E-Discovery

• Supporting search for digital forensic evidence • from emails, hard drives, mobile phones, etc… • not the open web • (Google won’t help us here)

Dec. 12, 2014 - Frontiers of Forensic Science 12

Search in E-Discovery¢ Finding out who knew what, from whom, and when¢ We don’t know what we’re looking for¢ What we’re looking for might be deliberately hidden¢ Communication might be very domain-specific,

contextualized or incomplete

Dec. 12, 2014 - Frontiers of Forensic Science 13

Approach¢ Generic search is not the answer

¢ Google: high precision search¢ E-Discovery: high recall & exploratory search

Dec. 12, 2014 - Frontiers of Forensic Science 14

Tasks¢ Support iterative search¢ Support (re)formulating questions and hypotheses¢ Retrieve all relevant traces

Dec. 12, 2014 - Frontiers of Forensic Science 15

Dec. 12, 2014 - Frontiers of Forensic Science 16

Dec. 12, 2014 - Frontiers of Forensic Science 17

Recipient recommendation

Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to

receive the email

Dec. 12, 2014 - Frontiers of Forensic Science 18

Why?

Ò Understanding communication in/structure of an enterprise

Ò Finding “unexpected” communication Ò Applications in:

Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection

Dec. 12, 2014 - Frontiers of Forensic Science 19

How?

Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork

Ò Related work Ò Social Network Analysis (SNA) Ò Email content

Ò Us Ò SNA + email content

Dec. 12, 2014 - Frontiers of Forensic Science 20

Part 1: Social Network Analysis?

[email protected] [email protected]

[email protected]

Dec. 12, 2014 - Frontiers of Forensic Science 21

image by Calvinius - Creative Commons Attribution-Share Alike 3.0

Dec. 12, 2014 - Frontiers of Forensic Science 22

SNA for predicting recipients?

1. Importance of a node in the network Prior probability More important people are more likely to be recipients of an(y) email

2. Connection strength between two nodes Conditional probability Given the sender, the recipients who are strongly associated are more likely to be the recipient

Dec. 12, 2014 - Frontiers of Forensic Science 23

Part 2: Email content

Ò Statistical Language Models (LMs)

Ò Assign a probability to [a sequence of] words; Ò By counting words

Ò Used in lots of places; Ò Web Search Ò Machine Translation Ò Speech Recognition

Dec. 12, 2014 - Frontiers of Forensic Science 24

Language Models

Ò Language models as communication “profiles”

Dec. 12, 2014 - Frontiers of Forensic Science 25

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)

Dec. 12, 2014 - Frontiers of Forensic Science 26

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)

Dec. 12, 2014 - Frontiers of Forensic Science 27

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1

talks with node2)

Dec. 12, 2014 - Frontiers of Forensic Science 28

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1

talks with node2)

Dec. 12, 2014 - Frontiers of Forensic Science 29

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1

talks with node2) 4. Corpus LM (how everyone

talks)

Dec. 12, 2014 - Frontiers of Forensic Science 30

Why language models?

Ò Comparisons between communication profiles: Ò Find nodes with most similar communication

Dec. 12, 2014 - Frontiers of Forensic Science 31

Model

Ò Given sender and email, predict recipients Ò Ranking function:

Dec. 12, 2014 - Frontiers of Forensic Science 32

Email likelihood Estimate using language modeling

Sender likelihoodusing SNA to estimate closeness of R and S

Recipient likelihoodusing SNA to estimate importance of R

Dec. 12, 2014 - Frontiers of Forensic Science 33

Email likelihood

Dec. 12, 2014 - Frontiers of Forensic Science 34

Email likelihood

P(word|R,S) P(word|R) P(word)

Dec. 12, 2014 - Frontiers of Forensic Science 35

Strength of connection between two nodes

1. Number of emails sent between nodes 2. Number of times two nodes are addressed together

Importance of node 1. Number of emails received 2. PageRank score

Recipient Likelihood P(R)

P(R)

P(S|R)

Sender Likelihood P(S|R)

Dec. 12, 2014 - Frontiers of Forensic Science 36

SNA

1. Importance of a node in the network

2. Strength of connection between nodes

Email Content

1. Interpersonal LM 2. Recipient LM 3. Corpus LM

Dec. 12, 2014 - Frontiers of Forensic Science 37

Approach: time-based

time

Training period: build models (SNA + LM)

Testing period: predict recipients

Dec. 12, 2014 - Frontiers of Forensic Science 38

Testing

Ò Remove recipients from email Ò Rank all nodes in the network, by computing:

1. P(E|R,S): Similarity between sender and candidate LMs

2. P(S|R): Strength of connection between sender and candidate

3. P(R): Importance of candidate

Testing period: predict recipients

Dec. 12, 2014 - Frontiers of Forensic Science 39

Dec. 12, 2014 - Frontiers of Forensic Science 40

Findings: What works?

Ò Importance of node: Number of received emails of nodePagerank

Ò Strength of connection: Number of emails between nodesNumber of times co-addressed

Ò LM Similarity: Interpersonal LM is most important (60%-20%-20%)

Dec. 12, 2014 - Frontiers of Forensic Science 41

Analysis: SNA vs email content

Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly

active users

Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users

Dec. 12, 2014 - Frontiers of Forensic Science 42

Finally

Ò Combining Social Network Analysis with Language Modeling is better than doing either.

Dec. 12, 2014 - Frontiers of Forensic Science 43

Future work

Ò Consider structure of network in more detail Ò Departments? Ò Friends/family?

Ò Include ‘time decay’

Ò Dynamically weight LM/SNA?

Dec. 12, 2014 - Frontiers of Forensic Science 44

Applications in E-Discovery/Digital Forensics

Ò Anomaly detection Ò Given a working prediction model; identify

“unexpected” communication Ò Language models for communication

Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues?

Ò Find communication that differs from the corpus-based communication

Dec. 12, 2014 - Frontiers of Forensic Science 45

Fin

Ò Questions?


Recommended