Date post: | 14-Feb-2017 |
Category: |
Data & Analytics |
Upload: | dave-lyon |
View: | 15 times |
Download: | 2 times |
Subreddit Subcultures
Insight Data Engineering Fellowship, Silicon Valley
David Lyon
2007 - Impersonal Web
2017 - Personal Web
Reddit Comment Dataset
2 billion comments
1 million subreddits
Personalization of Reddit Over Time
Reddit Clustering App
https://youtu.be/XHczo0TM17E
Data PipelineIngestion / Processing User Interface
Challenge 1:Data Size
Every month on Reddit:
● Reddit is too big to cluster directly!
● The raw clustering matrix has 200 billion elements.
60k Subreddits
3 million unique authors
Solution 1: Filtering
Every month on Reddit:
● Filter for activity: 100 comments/month
● Active clustering matrix has 200 million elements
● Now 1000 times faster to cluster
6k active Subreddits
30k active authors
Challenge 2:
Every month on Reddit:
● Too many individual authors● Need to cluster by topic, not
author 30k active authors
6k active Subreddits
Solution 2: PCA
Every month on Reddit:
● PCA transforms author space to topic space by finding correlations
● PCA shrinks dimensionality by another 100 times 300
shared topics
6k active Subreddits
Challenge 3: Slow PCAEven on a cluster, PCA takes too long on 200 million elements: 100 minutes on 9 Spark workers.
PCA scales as O(MT)
M is the number of matrix elements
T is the number of topics after PCA
Over 80% of total time!
Solution 3: Random PCAUse Facebook Research Random PCA (2014) on a single node
Fbpca is O(M ln(T))
For 250 topics, FBPCA is 45 times faster! One FBPCA worker is 5x faster than 9 full PCA workers.
5x faster for an average sized month
David LyonPhD Physics from the University of Illinois
I love hiking, table tennis, and Astrophysics
Solution 3: Silhouette Analysis
● Silhouette Analysis reveals clustering at small k
● Also reveals a second clustering scale of around 400 clusters in this case
Next Steps - Random PCA for Spark.ml
Step 1: Learn Scala!
Step 2: Contribute to Open Source community
Step 3: Streaming Random PCA?
Next Steps - Popular Topics by Cluster
● Find the popular topics within each cluster using Term-Frequency Inverse-Document-Frequency (TF-IDF)
● Terms are 1-grams and 2-grams used in each cluster, and the document frequency is over all of reddit for that month.
Challenge 3: Finding K for K-Means
● Number of clusters is not the same as number of PCA topics
● Clustering can happen on more than one scaleFootball
Baseball
TV
Movies
K-Means Clustering
Nearby subreddits in feature space…
Become clustered!Football
Baseball
TV
Movies
Random PCA
● Complexity of PCA is O(mnk) for m rows, n input columns, k output columns
● FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko, 2009)
● Fast Randomized SVD (Facebook Research, 2014)● Complexity of Random PCA is O(mn ln(k))● For k=100, Random PCA is more than 20x faster!
Before PCA
Football 2 1
Baseball 3 1 15
TV 5 2 22
Movies 1 21 1 2
Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth999,999
Auth1,000,000
After PCA
Football 80 2 1
Baseball 90 3 2
TV 6 80 77
Movies 2 80 20
Sub Sporting Fictional Political
Anatomy of a Reddit Comment
BodyAuthorDate Subreddit
Group by MonthGroup by Subreddit
Count #comments by author per subreddit
Normalize authors so each author has mean=0 and variance = 1
Growth in Number of Subreddits
40 subreddits
1 million subreddits
Week 4 Challenges● Spark for iterative machine learning because Spark can
mapreduce in memory ● By reducing the dimension of data, ● No streaming - clustering requires lots of data & clusters
change slowly, but time window reduced from monthly to
daily
Clustering is Universal
● Galaxies cluster into superclusters of ~100k members
● The red dot is our galaxy
● Human knowledge is clustered - purple for physics, blue for chemistry, green for biology and medicine.
● The big blob to the upper left is Liberal Arts.
Subreddit Clustering
● Monthly graph from 10k subreddits X 2 million authors = 10 billion matrix entries
● Drastically reduce the size of data using Principal Component Analysis, normalized so that larger subreddits aren’t favored
● Cluster in reduced dimensional space using K-means ● Topics within Clusters based on relative frequency of 1-grams
and 2-grams
Social media brings us closer
● Continual contact with over 1 billion people● We can find people who share our exact interests
...and separates us
● Less tolerance for differences - unfriend or ban from community!
● Online communities become bubbles isolated from each other