+ All Categories
Home > Data & Analytics > David lyon week3

David lyon week3

Date post: 14-Feb-2017
Category:
Upload: dave-lyon
View: 15 times
Download: 2 times
Share this document with a friend
27
Subreddit Subcultures Insight Data Engineering Fellowship, Silicon Valley David Lyon
Transcript
Page 1: David lyon week3

Subreddit Subcultures

Insight Data Engineering Fellowship, Silicon Valley

David Lyon

Page 2: David lyon week3

2007 - Impersonal Web

Page 3: David lyon week3

2017 - Personal Web

Page 4: David lyon week3

Reddit Comment Dataset

2 billion comments

1 million subreddits

Page 5: David lyon week3

Personalization of Reddit Over Time

Reddit Clustering App

https://youtu.be/XHczo0TM17E

Page 6: David lyon week3

Data PipelineIngestion / Processing User Interface

Page 7: David lyon week3

Challenge 1:Data Size

Every month on Reddit:

● Reddit is too big to cluster directly!

● The raw clustering matrix has 200 billion elements.

60k Subreddits

3 million unique authors

Page 8: David lyon week3

Solution 1: Filtering

Every month on Reddit:

● Filter for activity: 100 comments/month

● Active clustering matrix has 200 million elements

● Now 1000 times faster to cluster

6k active Subreddits

30k active authors

Page 9: David lyon week3

Challenge 2:

Every month on Reddit:

● Too many individual authors● Need to cluster by topic, not

author 30k active authors

6k active Subreddits

Page 10: David lyon week3

Solution 2: PCA

Every month on Reddit:

● PCA transforms author space to topic space by finding correlations

● PCA shrinks dimensionality by another 100 times 300

shared topics

6k active Subreddits

Page 11: David lyon week3

Challenge 3: Slow PCAEven on a cluster, PCA takes too long on 200 million elements: 100 minutes on 9 Spark workers.

PCA scales as O(MT)

M is the number of matrix elements

T is the number of topics after PCA

Over 80% of total time!

Page 12: David lyon week3

Solution 3: Random PCAUse Facebook Research Random PCA (2014) on a single node

Fbpca is O(M ln(T))

For 250 topics, FBPCA is 45 times faster! One FBPCA worker is 5x faster than 9 full PCA workers.

5x faster for an average sized month

Page 13: David lyon week3

David LyonPhD Physics from the University of Illinois

I love hiking, table tennis, and Astrophysics

Page 14: David lyon week3

Solution 3: Silhouette Analysis

● Silhouette Analysis reveals clustering at small k

● Also reveals a second clustering scale of around 400 clusters in this case

Page 15: David lyon week3

Next Steps - Random PCA for Spark.ml

Step 1: Learn Scala!

Step 2: Contribute to Open Source community

Step 3: Streaming Random PCA?

Page 16: David lyon week3

Next Steps - Popular Topics by Cluster

● Find the popular topics within each cluster using Term-Frequency Inverse-Document-Frequency (TF-IDF)

● Terms are 1-grams and 2-grams used in each cluster, and the document frequency is over all of reddit for that month.

Page 17: David lyon week3

Challenge 3: Finding K for K-Means

● Number of clusters is not the same as number of PCA topics

● Clustering can happen on more than one scaleFootball

Baseball

TV

Movies

Page 18: David lyon week3

K-Means Clustering

Nearby subreddits in feature space…

Become clustered!Football

Baseball

TV

Movies

Page 19: David lyon week3

Random PCA

● Complexity of PCA is O(mnk) for m rows, n input columns, k output columns

● FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko, 2009)

● Fast Randomized SVD (Facebook Research, 2014)● Complexity of Random PCA is O(mn ln(k))● For k=100, Random PCA is more than 20x faster!

Page 20: David lyon week3

Before PCA

Football 2 1

Baseball 3 1 15

TV 5 2 22

Movies 1 21 1 2

Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth999,999

Auth1,000,000

Page 21: David lyon week3

After PCA

Football 80 2 1

Baseball 90 3 2

TV 6 80 77

Movies 2 80 20

Sub Sporting Fictional Political

Page 22: David lyon week3

Anatomy of a Reddit Comment

BodyAuthorDate Subreddit

Group by MonthGroup by Subreddit

Count #comments by author per subreddit

Normalize authors so each author has mean=0 and variance = 1

Page 23: David lyon week3

Growth in Number of Subreddits

40 subreddits

1 million subreddits

Page 24: David lyon week3

Week 4 Challenges● Spark for iterative machine learning because Spark can

mapreduce in memory ● By reducing the dimension of data, ● No streaming - clustering requires lots of data & clusters

change slowly, but time window reduced from monthly to

daily

Page 25: David lyon week3

Clustering is Universal

● Galaxies cluster into superclusters of ~100k members

● The red dot is our galaxy

● Human knowledge is clustered - purple for physics, blue for chemistry, green for biology and medicine.

● The big blob to the upper left is Liberal Arts.

Page 26: David lyon week3

Subreddit Clustering

● Monthly graph from 10k subreddits X 2 million authors = 10 billion matrix entries

● Drastically reduce the size of data using Principal Component Analysis, normalized so that larger subreddits aren’t favored

● Cluster in reduced dimensional space using K-means ● Topics within Clusters based on relative frequency of 1-grams

and 2-grams

Page 27: David lyon week3

Social media brings us closer

● Continual contact with over 1 billion people● We can find people who share our exact interests

...and separates us

● Less tolerance for differences - unfriend or ban from community!

● Online communities become bubbles isolated from each other


Recommended