Date post: | 17-Jan-2015 |
Category: |
Education |
Upload: | securemind |
View: | 192 times |
Download: | 1 times |
Thesis Proposal:
A Quantitative Analysis of the Spread of Information in Social Networks
Joshua S. WhiteAdvisor: Jeanna N. Matthews, PhD
03/05/13
Outline
• Problem• To Date • Recently Completed Work• Current Work• Inspiration• Unanswered Questions• Current Tool Kits• Our Approach• Schedule of Completion
• Social Media Networks are the fastest growing, and make up the largest portions, of Internet content today
• These networks have only recently (2010-Present) been studied in any level of detail
• Most work has been in sampling small portions of the network and trying to predict outcomes (predicting politics)
Problem
1
1. Tom Pick. "102 Compelling Social Media and Online Marketing Stats and Facts for 2012 (and 2013)." Business 2 Community. January 2, 2013
1135
9
3236
3040
2495
278
156
132
4330252ACM Digital Library Search Results
(Sampled Dec, 2012 - Total = 20796)
Social Networks and Political Analysis
Using Social Networks as Datasets for Machine Learning
Actor Types in Any Network
Social Network Graphing
Malware and Social Networks
Social Network Meme's
Botnets and Social Networks
Individuals Influence on Social Networks
Social Network Analysis Tools
Actor Types in Twitter
{20,132
{666
Problem (continued)
To Date: Coalmine
• The basis for a Social Network Analysis Tool
Coalmine: an experience in building a system for social media analyticsJS White, JN Matthews, JL StacySPIE Defense, Security, and Sensing, 84080A-84080A-11
To Date: Coalmine
• Coalmine – Method scales well based on initial tests
– Manual and automated detection
– Configurable data collection capabilities
– Trial and error filter design tool
• At the Time (Major Future Work)– Rebuild of the tool:
• Fix scaling limitations• More extensible Map/Reduce method
– Solve map-piping issue• Inclusion of multi-job support• New storage and distribution method
– Solve replication and state issues
To Date: Coalmine
Coalmine: Data Set Overview
• Over the course of 2012 we collected 165 TB of Twitter Data (Uncompressed) – 147 “Full Days”, 100 “Partial Days”
• Estimated 65 Billion Tweets
– Twitter traffic at est. 175 million tweets per day in 2012
– Collection rates between 50% and 80% for “Full Days”.
– Data in JSON format using Twitters REST API.
1
1. Shea Bennett. "Just How Big Is twitter In 2012 [INFOGRAPHIC]," All Twitter - The Unofficial Twitter Resource, February 2013
Coalmine: Data Set Overview
• Basic observable patterns
– Twitter has a lot of outages
– Posting rates follow predictable patterns
To Date: Phishing Analysis
A method for the automated detection phishing websites through both site characteristics and image analysisJS White, JN Matthews, JL StacySPIE Defense, Security, and Sensing, 84080B-84080B-11
• Phash Process:
– Reduce image size to 32p x 32p
– Reduce the color to greyscale
– Calculate the DCT (creates frequency scalars)
– Reduce the DCT to 8p x 8p
– Second DCT reduction, set bits to 1 or 0 depending on placement above or below average DCT
– Take Hash
5
To Date: Phishing Analysis
Results:
• Two Methods:– Page characteristic analysis– Image similarity analysis
• Proof of concept system
• Need for a generically customization filter
To Date: Phishing Analysis
Recently Completed Work:
• BEK Infection Vector Analysis– Finished dev. of a filter for detection of suspect accounts
• Submitted to the IEEE CNS (Communications Network Security)
– “It's you on photo?: Automatic Detection of Twitter Accounts Infected With the Blackhole Exploit Kit”
Recently Completed Work:
Normal
Infectious
=
Current Work:
• KONY2012 Meme Analysis– Finished extraction of relevant data, identification of tag
variants, directed graphs of information flow• Preparing for submission to ASONAM (Advances in Social Network
Analysis and Mining)
Current Work:
• Actor Types Analysis– Literature review completed, started identifying statistical and
temporal characteristics of each type• Planned for submission to LEET'13 (Large Scale Emerging Exploits
and Threats)
+ =
Inspiration
• Our work was inspired in part by Malcolm Gladwell’s book, The Tipping Point – Life as an epidemic
• Thinking this way lead us to consider the spread of information and trends in terms of an outbreak where key people, Mavens, Connectors, and Salesmen, are primarily responsible.
1
1. Gladwell, M. (2000). The tipping point. Boston: Little, Brown and Company.
Some Unanswered Questions
• Automatic classification of actor types in social networks.– Do Gladwell's classifications apply?
• Connectors, mavens and salesmen
– Who are the opinion leaders?
• Privacy related implications of social network analysis
• Do social networks have the level of impact on public opinion/mass media that some believe?– Can we predict changes in the public or individuals opinions using
social network datasets as a base?
– Can we predict how meme's/news will spread?
– Are individuals covertly manipulating mass media through social networks?
• Is there an generally applicable way to identify major events like natural disasters as they happen?
Current Tool Kits
• Tool Kits and Methods:– Only one well developed tool kit:• NodeXL
– Small Datasets (Under 5000 Nodes)
– Built In statistics and data collection capabilities
– Built on MS Excel
– Allows exploration of group relationships
– Highest usage seems to be for political related research
1. Smith, M., Milic-Frayling, N., Shneiderman, B., Mendes Rodrigues, E., Leskovec, J., Dunne, C., (2010). NodeXL: a free and open network overview, discovery and exploration add-in for Excel 2007/2010, http://nodexl.codeplex.com/ from the Social Media Research Foundation, http://www.smrfoundation.org
1
Approach
• Borrow from traditional “Social Network Analysis” as it relates to the study of Sociology
• Most tools can't handle extremely large datasets– We employ the MapReduce methodology as our core for data
analysis
• Treat the analysis system like a filtering system and build “rules” for how the data should be processed
• Each rule is essentially constrained to a single Mapper
• Use case studies base on available data to develop individual statistics and rules
Schedule of Completion:
Questions: