Clarkson - Joshua White - Research Proposal Presentation

Thesis Proposal:

A Quantitative Analysis of the Spread of Information in Social Networks

Joshua S. WhiteAdvisor: Jeanna N. Matthews, PhD

03/05/13

Outline

• Problem• To Date • Recently Completed Work• Current Work• Inspiration• Unanswered Questions• Current Tool Kits• Our Approach• Schedule of Completion

• Social Media Networks are the fastest growing, and make up the largest portions, of Internet content today

• These networks have only recently (2010-Present) been studied in any level of detail

• Most work has been in sampling small portions of the network and trying to predict outcomes (predicting politics)

Problem

1

1. Tom Pick. "102 Compelling Social Media and Online Marketing Stats and Facts for 2012 (and 2013)." Business 2 Community. January 2, 2013

1135

9

3236

3040

2495

278

156

132

4330252ACM Digital Library Search Results

(Sampled Dec, 2012 - Total = 20796)

Social Networks and Political Analysis

Using Social Networks as Datasets for Machine Learning

Twitter

Actor Types in Any Network

Social Network Graphing

Malware and Social Networks

Social Network Meme's

Botnets and Social Networks

Individuals Influence on Social Networks

Social Network Analysis Tools

Actor Types in Twitter

{20,132

{666

Problem (continued)

To Date: Coalmine

• The basis for a Social Network Analysis Tool

Coalmine: an experience in building a system for social media analyticsJS White, JN Matthews, JL StacySPIE Defense, Security, and Sensing, 84080A-84080A-11

To Date: Coalmine

• Coalmine – Method scales well based on initial tests

– Manual and automated detection

– Configurable data collection capabilities

– Trial and error filter design tool

• At the Time (Major Future Work)– Rebuild of the tool:

• Fix scaling limitations• More extensible Map/Reduce method

– Solve map-piping issue• Inclusion of multi-job support• New storage and distribution method

– Solve replication and state issues

To Date: Coalmine

Coalmine: Data Set Overview

• Over the course of 2012 we collected 165 TB of Twitter Data (Uncompressed) – 147 “Full Days”, 100 “Partial Days”

• Estimated 65 Billion Tweets

– Twitter traffic at est. 175 million tweets per day in 2012

– Collection rates between 50% and 80% for “Full Days”.

– Data in JSON format using Twitters REST API.

1

1. Shea Bennett. "Just How Big Is twitter In 2012 [INFOGRAPHIC]," All Twitter - The Unofficial Twitter Resource, February 2013

Coalmine: Data Set Overview

• Basic observable patterns

– Twitter has a lot of outages

– Posting rates follow predictable patterns

To Date: Phishing Analysis

A method for the automated detection phishing websites through both site characteristics and image analysisJS White, JN Matthews, JL StacySPIE Defense, Security, and Sensing, 84080B-84080B-11

• Phash Process:

– Reduce image size to 32p x 32p

– Reduce the color to greyscale

– Calculate the DCT (creates frequency scalars)

– Reduce the DCT to 8p x 8p

– Second DCT reduction, set bits to 1 or 0 depending on placement above or below average DCT

– Take Hash

5


Results:

• Two Methods:– Page characteristic analysis– Image similarity analysis

• Proof of concept system

• Need for a generically customization filter


Recently Completed Work:

• BEK Infection Vector Analysis– Finished dev. of a filter for detection of suspect accounts

• Submitted to the IEEE CNS (Communications Network Security)

– “It's you on photo?: Automatic Detection of Twitter Accounts Infected With the Blackhole Exploit Kit”

Recently Completed Work:

Normal

Infectious

=

Current Work:

• KONY2012 Meme Analysis– Finished extraction of relevant data, identification of tag

variants, directed graphs of information flow• Preparing for submission to ASONAM (Advances in Social Network

Analysis and Mining)

Current Work:

• Actor Types Analysis– Literature review completed, started identifying statistical and

temporal characteristics of each type• Planned for submission to LEET'13 (Large Scale Emerging Exploits

and Threats)

+ =

Inspiration

• Our work was inspired in part by Malcolm Gladwell’s book, The Tipping Point – Life as an epidemic

• Thinking this way lead us to consider the spread of information and trends in terms of an outbreak where key people, Mavens, Connectors, and Salesmen, are primarily responsible.

1

1. Gladwell, M. (2000). The tipping point. Boston: Little, Brown and Company.

Some Unanswered Questions

• Automatic classification of actor types in social networks.– Do Gladwell's classifications apply?

• Connectors, mavens and salesmen

– Who are the opinion leaders?

• Privacy related implications of social network analysis

• Do social networks have the level of impact on public opinion/mass media that some believe?– Can we predict changes in the public or individuals opinions using

social network datasets as a base?

– Can we predict how meme's/news will spread?

– Are individuals covertly manipulating mass media through social networks?

• Is there an generally applicable way to identify major events like natural disasters as they happen?

Current Tool Kits

• Tool Kits and Methods:– Only one well developed tool kit:• NodeXL

– Small Datasets (Under 5000 Nodes)

– Built In statistics and data collection capabilities

– Built on MS Excel

– Allows exploration of group relationships

– Highest usage seems to be for political related research

1. Smith, M., Milic-Frayling, N., Shneiderman, B., Mendes Rodrigues, E., Leskovec, J., Dunne, C., (2010). NodeXL: a free and open network overview, discovery and exploration add-in for Excel 2007/2010, http://nodexl.codeplex.com/ from the Social Media Research Foundation, http://www.smrfoundation.org

1

Approach

• Borrow from traditional “Social Network Analysis” as it relates to the study of Sociology

• Most tools can't handle extremely large datasets– We employ the MapReduce methodology as our core for data

analysis

• Treat the analysis system like a filtering system and build “rules” for how the data should be processed

• Each rule is essentially constrained to a single Mapper

• Use case studies base on available data to develop individual statistics and rules

Schedule of Completion:

Questions:

Date post:	17-Jan-2015
Category:	Education
Upload:	securemind
View:	192 times
Download:	1 times

Clarkson - Joshua White - Research Proposal Presentation

Education