Date post: | 11-Apr-2017 |
Category: |
Software |
Upload: | hakka-labs |
View: | 463 times |
Download: | 0 times |
HISTORY OF VIRALITY
THE DATA
THE DATA: OLD VERSION
Article being viewedUser viewing articleTime of pageviewReferring domain
THE DATA: NEW VERSION
Article being viewed
Time of pageviewReferring domain
User viewing article
Referring User
DIFFERENT PERSPECTIVE:
Pageviews are a process on a graph!
WHAT THE GRAPH LOOKS LIKE:
WHAT THE PROCESS LOOKS LIKE:
WHAT THE DATA LOOKS LIKE:
WHAT CAN DO YOU WITH OLD PAGEVIEWS?
(Educated)
Guess!
CONNIE
OLD GRAPH RECONSTRUCTION: MODEL-BASED INFERENCEProbabilistic: You can infer connections that aren’t there! Error Prone: Graph statistics can be susceptible to small changes in the graph
Gets larger when differences in pageview times gets smaller
SIMPLIFIED VERSION:Observe:
Guess:
SIMPLIFIED VERSION:Guess:
Reality:
Check out a toy implementation here!
github.com/akellehe/pyconnie
NEW GRAPH RECONSTRUCTION: TRIVIAL
These are actually Unique Visitors …
LIFE IS A LITTLE MESSY…
This is more like what the Pageview graph looks like
PROBLEM: DATA MUNGING• Lots of potential for heuristics!• How do we get promotion attribution from
propagations?• Trees are important: how can we be sure
we get them?
PROBLEM: STREAMLINING ANALYSIS• How do we work from a common set of definitions?• How do we avoid repeating analysis?• How can we streamline data visualization? EDA?• How do we share optimized analyses? And avoid
inefficient (but correct) algorithms?
DEFINE DATA STRUCTURES!• All data munging happens “under the hood”• Data pre-processing is unit-tested• No room for heuristics: standardization!• Hard math definitions can be consistency-checked!
PROPAGATION SETFor one article
For the site (or other set of articles, S)
PROPAGATION SETPageviews to article b in time T
Pageviews to the site in time T
The simplest data structure. Just a representation of the raw pageview logs.
Represented as a generator of UserEdge objects
PROPAGATION GRAPH,
PROPAGATION GRAPH
PROPAGATION GRAPH
INFLUENCE GRAPHPropagation graph together with a map,
That measures the influence of the origin user in p on the pageviewing user
CONSIDER:
PROPAGATION FOREST
PROPAGATION FORESTThe propagation graph is great, but we’d also like a concept like unique visitors!
If there is attribution ordering in the graph, we can trace content back to its source!
PROPAGATION FOREST: FIRST PARENT ATTRIBUTION
n pageviews One UV
PROPAGATION FOREST gets the credit
RESULT: ALL GRAPHS ARE FORESTS
Promotions have 0 indegree,Users have 1 indegree
total edges in connected components:
Trees!
CAREFUL FOR EDGE CASES: MISSING DATA?All connected components should be rooted at a promotion source.
What happens if we lose the first edge (e.g. use the wrong T)?
PROPAGATION FOREST: CYCLE BREAKINGConsider … Cycle is not broken by
first-parent attribution
Traversal algorithms go on forever!
PROPAGATION FOREST: CYCLE BREAKINGConsider … As long as they’re not equal,
the can be ordered, say
Then, there is a node in the cycle with an out-edge younger than its in-edge:
The original pageview for that node must have been lost. Cut the in-edge (FPA!).
SUCCESS!Cycle-breaking + FPA = Trees!
Each tree is the UV graph downstream from a promotion source: promotion attribution!
Additional Benefits:Most information diffusion analyses model trees growing on graphs.
Many algorithms simplify when run on trees!
SUPERTREEWe may want to run an algorithm, or calculate a tree statistic from a whole forest, instead of just one tree. How can we do that?
Merge all the roots (promotion sources) together into one “super-node”
The whole forest becomes a SuperTree!
SUPERTREE: EXAMPLE
SUPERTREE: EXAMPLE
APPLICATION: LARGE SCALE DATA VIS
WHY IS IT SLOW?Layouts often consider repelling each node from every other: time complexity
Good for a few thousand nodes
OPENORD: SIMULATED ANNEALINGLinear main layout
Quadratic settling Phase
Implemented in Gephi
OPENORDGood for ~10k Users
Slow for ~100k Users
Messy! (if you skipthe quadratic step!)
TAKE ADVANTAGE OF TREE STRUCTURE!
Traverse the tree to decide where to place nodes!
H3 LAYOUTEach parent is in the center of a hemisphere.
Children are laid out on the surface of the hemisphere
They become centers of smaller hemispheres (if they’re parents)
Etc.
A NEW IMPLEMENTATIONpip install pyh3
WITH D3
MORE APPLICATIONS
ATTRIBUTION
Instead of
CASCADE PREDICTION
GRAPH AND TEMPORAL PROPERTIES ARE IMPORTANT!
TEST THE INFLUENTIALS HYPOTHESIS
IMPROVE CONTENT TARGETING
FINDING THE CAUSES OF VIRALITYConsider Fitting a Model:
User Features, content features, context features, User pair features
UNDER CONSTRUCTION:Online Regression!
Real-time feature weights tell which features correlate with propagation probabilities!
Drives hypothesis-building!
THE TEAM