* Publically available em
ail dataset
PERSONALIZED COMMUNITY DETECTION USING COLLABORATIVE
SIMILARITY MEASURE
Presenter: Waqas Nawaz
5th International Conference on Data Mining and Intelligent Information Technology ApplicationsICMIA2013, June 18 - 20, 2013, Jeju Island, Republic of Korea
Waqas Nawaz, Yongkoo Han, Kifayat-Ullah Khan, Young-Koo Lee
The Case of Enron* Email Database
Data and Knowledge Engineering Lab,Department of Computer Engineering, Kyung Hee University,
Yongin-si, Gyeonggi-do 446-701, Republic of Korea
2
AGENDA
Motivation Problem Statement Related Work Proposed Idea
System Architecture Preprocessing Partitioning
Applications References
3
MOTIVATION
Email is an asynchronous and most prevalent means of communication among others (vocal, visual)
Users: Worldwide email service providers namely Hotmail, Yahoo and Gmail has 330, 302 and 193 million users respectively in 2009-11 [1]
Communication: Yahoo email network deliver around 38 thousand emails per second Globally[2]
It contains significant information: sender, receivers, contents, time etc.
to depict one’s communication behavior or pattern
4
PERSONALIZATION
Personal Interaction Network (Net) from Host’s (user ‘A’) Repository
A
B
@B
C
@
@
C
F
G
@G
G G
72
4
5Host A
B@
BC
@
@
E
D
@D
B C
42
3
5Host
User ‘A’ as Sender User ‘A’ as Receiver
5
PROBLEM STATEMENT FORMULATION (1/2)
Email provides a sophisticated bridge among individuals or groups to interact and share useful information
Is there any possibility to identify the groups having similar communication pattern?
Can we partition the individuals or sources separately with irregular activity?
Grouping the strongly connected individuals with analogous interaction patterns (behaviors) is admissible?
6
PROBLEM STATEMENT FORMULATION (2/2)
By Using Personalized Information (Restricted Information) Find the K group of users Application: Email Perspective
User Point of View Automatic email group prediction or management Email Strainer (e.g. Assist spam filtration )
Service Provider Point of View Efficiently anticipate the global network structure
using personalized information
7
RELATED WORK In literature Email database has been utilized to:
(CEAS-2005) Detect Worm Propagation: Statistically Analyze features to classify normal (legitimate) and abnormal (worms affected) emails [4]
(KDD-2005) Predict Organizational Structure Discover of important nodes (Email users) to predict hidden organizational structure [6] [7]*
(CEAS-2007) Characterization of Email: Measure the flow and frequency of user email toward the identification of communities of interest [12]
(ACM-2008*) Predict Organizational Structure Classify emails within the organization to determine its structure [10]
(ICADIWT-2008) Determine Popularity of Individuals in Community: Find popularity of individuals based on ranking strategy (Page Rank) and evaluated using organization hierarchy [11]
(CEC*-2009) Detect Email Communication Networks: Detect key users based on Evolutionary approach using SNA [8]
(KDD-2009) Email Prioritization: Personalize email prioritization based on social network analysis [5]
(IEEE*-2010) Identify Hidden Relationships: Predict sensitive relationships based on frequency of mutual private communication [9]
* BM
EI-2
01
0
8
PROBLEM STATEMENT INVESTIGATION
Key factors: Users Interaction
Email exchange (bi-directional) Behavioral Patterns
Email Metadata or Properties Sender, Receiver Time stamp (granularity? Day, Week, Month, Year) content's information
Subject, Body, Attachment(type, size , count) etc. Strength (Frequency: no. of sent and received)
9
PROPOSED ARCHITECTURE Input: Email Data Source Output: User Grouping Two Stages:
Preprocessing Partitioning
10
PROPOSED SOLUTION (1/5) Preprocessing – Emails Network to Graph
Mapping Email address Nodes or Vertices Email Meta Data Vertex Attributes No. of Emails Weights on edges
11
PROPOSED SOLUTION (2/5) Preprocessing – Behavioral Feature Extraction
Extracted Information (Email Metadata) Avg. No. of Emails (sent and received)
Avg. no. of recipients in To, Cc, and Bcc Ratio of sent and received emails Min, Max and Avg. attachments and size Dominating type of attachment
Power point, PDF, Word, Excel … Time stamp (Granularity Daily, Weekly, Monthly)
Morning, Noon, After-Noon, Evening, Night, Late-Night Avg. length or size of email subject and body Ratio of emails with attachments
Behavioral Feature Set
12
PROPOSED SOLUTION (3/5) Preprocessing – Feature Scaling
We have scale down or mapped related set of features from real space nominal space ID space for time efficiency (bypassing the extra computation for clustering)
Parallel coordinates are utilized
13
Morning <4am-
12pm>; 7
After Noon
<12pm-4pm>; 4
Evening <4pm-
8pm>; 4
Night <8pm-
12am>; 4
Late Night
(12am-4am); 4
PROPOSED SOLUTION (4/5) Preprocessing – Time Feature Scaling
First postulate the granularity level Day, Week, Month etc.
Communication Time Slices (CTS) Splitting the complete day (days level, real space) into
mutually exclusive time intervals and assign labels (nominal space) accordingly
14
PROPOSED SOLUTION (5/5) Partitioning – Intra-Graph Clustering*
1. Collaborative Similarity Measure (CSM)
2. K-Medoid Clustering
Collaborative Similarity = CSimሺ𝑣𝑎,𝑣𝑏ሻ= =
ە۔
𝛼∗𝑆𝐼𝑀ሺ𝑣𝑎,𝑣𝑏ሻ𝑠𝑡𝑟𝑢𝑐𝑡ۓ + (1− 𝛼) ∗𝑆𝐼𝑀ሺ𝑣𝑎,𝑣𝑏ሻ𝑐𝑜𝑛𝑡𝑒𝑥𝑡 , 𝑣𝑎 ↔ 𝑣𝑏
ෑ� 𝐶𝑆𝑖𝑚൫𝑣𝑝𝑖 ,𝑣𝑝𝑖+1൯𝑞
𝑖=1 , 𝑣𝑎 ⋯𝑣𝑏, 𝑣𝑝 𝑖𝑠 𝑜𝑛 𝑝𝑎𝑡ℎ 𝑣𝑎 𝑎𝑛𝑑 𝑣𝑏 (1− 𝛼) ∗𝑆𝐼𝑀ሺ𝑣𝑎,𝑣𝑏ሻ𝑐𝑜𝑛𝑡𝑒𝑥𝑡 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑆𝐼𝑀ሺ𝑣𝑎,𝑣𝑏ሻ𝑠𝑡𝑟𝑢𝑐𝑡𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ቐ
a2𝑁𝑎 ∩ 𝑁𝑏a2a2𝑁𝑎 ∪ 𝑁𝑏a2∗(𝑤𝑎𝑏), 𝑣𝑎 ↔ 𝑣𝑏 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑆𝐼𝑀ሺ𝑣𝑎,𝑣𝑏ሻ𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ς (𝑤𝑎𝑖)𝑀𝑖=1, 𝑣𝑎&& 𝑣𝑏←𝑎𝑖 ς (𝑤𝑎𝑗)𝑀𝑗=1, 𝑣𝑎| | 𝑣𝑏←𝑎𝑗 , 𝑣𝑎 ↔ 𝑣𝑏 | | 𝑣𝑎 ⋯𝑣𝑏
*Publish
ed in
DA
SFA
A W
orksh
op 2
01
2
15
APPLICATIONS
Automatically formed group of users with identical patterns can be applied to data (email) network: Part of security mechanism to assist in
detection of abnormal activity (e.g. spam filtering)
Known fraudulent accounts(individual or group) can facilitate to determine other fraudulent users
Identify the source of virus or infected group Assist Email Management System (Grouping
or Filtering) E.g. auto generated mailing list Automatic group level email prioritization
Community analysis or formation based on analogous communication patterns
16
REFERENCES
1. http://www.email-marketing-reports.com/metrics/email-statistics.htm
2. http://visualize.yahoo.com/3. CEAS 2005 – Second Conference on Email and Anti-Spam, July
21-22, 2005, Stanford University, California, USA, 2005.4. Steve Martin, Anil Sewani et. al., “Analyzing Behavioral
Features for Email Classification”, In CEAS [3].5. Shinjae Yoo et. al., “Mining Social Networks for Personalized
Email Prioritization”, KDD June 28 – July 1. 2009.6. J. Shetty, and J. Adibi, “Discovering Important Nodes through
Graph Entropy The Case of Enron Email Database”, KDD, Chicago, Illinois, 2005.
7. H. Yang et. al., “Discovering Important Nodes through Comprehensive Assessment Theory on Enron Email Database”, IEEE 3rd International Conference on BMEI, 2010.
8. G. Wilson and W. Banzhaf, “Discovery of Email Communication Networks from the Enron Corpus with Genetic Algorithm using SNA”, IEEE Congress on Evolutionary Computation, 2009.
17
REFERENCES (CONT…)
9. Hanhe Lin, “Predicting Sensitive Relationships form Email Corpus”, IEEE 4th Int. Conf. on Genetic and Evolutionary Computing, 2010.
10. K. Yelupula et. al., “Social Network Analysis for Email Classification”, ACM – SE , 28 – 29 March. 2008.
11. Anton Timofieiev et. al., “Social communities detection in Enron Corpus using h-Index”, IEEE ICADIWT, 4 – 6 Aug. 2008.
12. L. Johansen, M. Rowell, K. Butler, and P. McDaniel. “Email Communities of Interest”, In 4th Conference on Email and Anti-Spam (CEAS), Mountain View, CA, July 2007.
18
RELATED WORK (1/11)
In literature Email database has been utilized to: (CEAS-2005) Detect Worm Propagation: Statistically
Analyze features to classify normal (legitimate) and abnormal (worms affected) emails [4] Many researchers examine incoming email exclusively, which
does not provide detailed information about individual user’s behavior; Outgoing emails need to be utilized
Features extracted from outgoing emails to distinguish between normal and abnormal email activity
Examined the contributions of these features to their classification or the sensitivity of their model to those feature
Features includes presence of HTML, script tags, embedded images, hyperlinks, types of attachments, no of sent, etc.
Application: Novel worm detection based on SVM and Naïve Bayes classifiers.
19
RELATED WORK (2/11)
In literature Email database has been utilized to: (CEAS-2005) Detect Worm Propagation:
Statistically Analyze features to classify normal (legitimate) and abnormal (worms affected) emails [4]
20
RELATED WORK (3/11) (KDD-2005) Predict Organizational Structure
Discover of important nodes (Email users) to predict hidden organizational structure [6] [7] Event based entropy to find influential users Focused on text mining and natural language
processing
21
* BM
EI-2010
RELATED WORK (4/11) (KDD-2005) Predict Organizational Structure
Discover of important nodes (Email users) to predict hidden organizational structure [6] [7]*
22
RELATED WORK (5/11)
(CEAS-2007) Characterization of Email: Measure the flow and frequency of user email toward the identification of communities of interest [12]
A large number of communications indicates as association whereas a fewer number of communication does not
Frequency Thresholding
k-means Clusterings of Inbound weightsEmail communications between users
23
* 28-29 March
RELATED WORK (6/11)
(ACM-2008*) Predict Organizational Structure Classify emails within the organization to determine its structure [10] The analysis of email flows within the
organization by neglecting email contents Ranking individuals based on no. of emails sent
and received Algorithm steps
1. Get the statistical view of email conversations2. WEKA tool (k-Means) is utilized for clustering of
organization3. Validation of the results with original organization
structure
24
* 28-29 March
RELATED WORK (7/11)
(ACM-2008*) Predict Organizational Structure Classify emails within the organization to determine its structure [10]
25
* ICA
DIW
T, 4 – 6 Aug
RELATED WORK (8/11)
(IEEE*-2008) Determine Popularity of Individuals in Community: Find popularity of individuals based on ranking strategy (Page Rank) and evaluated using organization hierarchy [11] Detect abnormal activities Topologically identifying the communities
(triangle estimation measure the density of triangles in the network)
Importance of individual is estimated using h-index Organization hierarchy comparison for analysis
26
RELATED WORK (9/11) (CEC*-2009) Detect Email Communication
Networks: Detect key users based on Evolutionary approach using SNA [8] Detect important actors w.r.t email transactions Topological metrics used for network analysis
Degree (mean) Density Proximity Prestige
* IEEE C
ongre
ss on E
volu
tionary
Com
puta
tion
Hypothetical Networks
27
RELATED WORK (10/11) (KDD-2009) Email Prioritization: Personalize
email prioritization based on social network analysis [5] Automatically determine the importance level of
non-spam email messages Topological features
In/out/total degree centrality, Clustering coefficient, Clique count, Betweenness centrality
An example of bipartite email network: circles are contact persons and rectangles are email messages. Some email messages have human-assigned importance values but others do not. The network enables us to propagate the partially available importance values from messages to persons, and vice versa
28
RELATED WORK (11/11) (IEEE*-2010) Identify Hidden Relationships:
Predict sensitive relationships based on frequency of mutual private communication [9] Relationship means belonging to same
department, workgroup, or being friends(sensitive)
Email network modeled as directed weighted graph
Two Steps1. Counting mutual privacy communication among
users2. Evaluate the cluster factor information between
every pair of email users
* 4th
Int. C
onf. o
n G
enetic a
nd E
volu
tionary
C
om
putin
g