BETTER ALGORITHMSFROM BIGGER DATAChris Bingham, CTO, Crimson Hexagon
April 26th, 2012
INTRODUCTIONCrimson Hexagon and me
ABOUT CRIMSON HEXAGON
•Founded 4 years ago; now 40+ employees in Boston
•Help companies make actionable business decisions
•Based on unique analysis of social media and internal
data
•Customers include F100, agencies, UN
•Tech stack:• Java, with R for algorithms• Massive Lucene infrastructure with custom shard management• Distributed computing framework for analysis• Hadoop increasingly used
BIG DATA, BETTER DATA, BETTER ALGORITHMS
•World’s largest searchable social media archive
•>200 billion posts in 2012
•Adding 1 billion every 2-3 days
•Twitter, Facebook, blogs, forums, comments, news,
etc.
BIG DATA, BETTER DATA, BETTER ALGORITHMS
•Who’s talking and listening?• Demographics• Interests• Relationships
•Trends and comparisons• Compared to yourself, over time• Compared to industry, competitors, etc.
•Human input• Define specific business question and possible answers• Provides focus and context
BIG DATA, BETTER DATA, BETTER ALGORITHMS
•Based on work by co-founder Gary King at Harvard
•Takes all those billions of posts, plus the human input
•Leverages the human judgment to massive scale
•Quantitative answers to specific business questions
•Accurate in any language
ALGORITHMS AND BIG DATAThe problem of leverage
MACHINE LEARNING
Let’s consider a typical data-analysis problem
using machine learning.
How does having more data help (or hurt) us?
DEFINE CATEGORIES
A
B
C
D
Some set of user-defined
categories (AKA topics, classes,
etc.)
PROVIDE TRAINING
A
B
C
D
Training examples to
map features to categories
LEARN A MODEL
A
B
C
D
Algorithm classifies items into categories
based on training data
CLASSIFY ITEMS
A
B
C
DIncoming unknown
items to be classified
w x y z
OBTAIN RESULTS
A
B
C
D
Result: Items are classified, hopefully
correctly!
w
x
y
z
DID IT WORK?
A
B
C
D
Compare algorithm to human(s) to
measure accuracy—here “z” was
incorrectly classified
w
x
y
z
A
B
C
D
w
x
y
z
ERROR RATE
We were wrong 25% of the time.
What happens when we add more data?
75% correct
25% wrong
SCALE TO BIG DATA
We just make the same mistakes
on a larger scale.
75% correct
25% wrong
75% correct
25% wrong
CAN MORE DATA HELP?
Can bigger data help us? In some ways.
• It can enable more types of analysis
• It can enable analysis of more categories
• It can provide more raw material for training and validation
What about accuracy?
A
B
C
D
E
F
HUMAN SCALE
A
B
C
D
More training usually improves accuracy—but we need not just more
data, but more humans.
Humans don’t scale.
w
x z
FEEDBACK
A
B
C
D
For some applications, users can implicitly provide feedback through their use.
e.g. ad placement; spam detection
But this isn’t possible in all cases—and you can’t be too wrong to begin
with
y
BOOTSTRAPPING
A
B
C
D
We can also feed the classified items back
into the training set (no human intervention).
Some incorrect classifications will
become part of the training! But that
doesn’t necessarily hurt.
w
x
y
z
BOOTSTRAPPING RESULT
A
B
C
D
The more data you have, the more you can
classify.
The more you classify, the more training data
you obtain.
The more training data, the more accurate the
results.
And we didn’t have to scale the human
involvement.
w
x
y
z
y sr
wtw
xu
xx
xv
INDIVIDUAL VS. AGGREGATE
w x y z
So far we’ve considered classification of individual items. This is the conventional machine-
learning approach. A
B
C
D
w
x
y
z
C
25% A
25% B
50% C
0% D
INDIVIDUAL VS. AGGREGATE
w x y z
What if we want to know the size of each category, rather than
which items are in which category?
e.g. epidemiology, polls, market research
A
B
D
w =
=
INDIVIDUAL VS. AGGREGATE
x
y
z
When considered individually, there’s a limited amount of information we have about each item.
As a result, there will be limited correlation with the training data, and therefore poor accuracy.
=
=
A? C?
B? D?
75% correct
25% wrong
W+X+Y+Z =
INDIVIDUAL VS. AGGREGATE
When considered in the aggregate, there’s much more data correlating with the training
data for each category.
As a result, we can make more accurate estimates of the category proportions.
%A
%C
%B
%D
85% correct
15% wrong
S+T+U+V+W+X+Y+Z =
INDIVIDUAL VS. AGGREGATE
Now, increasing the amount of data can actually increase the accuracy—
with the same amount of human training data.
%A
%C
%B
%D
95% correct
5% wrong
CONCLUSION
•Bigger data is important
•Better data is important
•Better algorithms are important
•The sweet spot is when one leverages the other
Bigger data can lead to better algorithms.
QUESTIONS?