Date post: | 07-Jul-2015 |
Category: |
Technology |
Upload: | visible-technologies |
View: | 1,211 times |
Download: | 0 times |
On Empirical Sentiment Accuracy Bounds Shawn Rutledge, Chief Scientist
Copyright © 2011 Visible. All rights reserved.
Visible’s Sentiment Approach
Algorithms
• State of the art
• Beyond overhyped NLP
Features
• Deep experience
• Social NLP & Context
Data
• Massive proprietary data
A sentiment model
based on years of
labeling social data for
enterprises.
107+ labels, 105+
topics, 102+
enterprises.
Visible was one of the
first Social Media
Monitoring solution in
the market.
Copyright © 2011 Visible. All rights reserved.
Visible’s Sentiment Approach
Algorithms
• State of the art
• Beyond overhyped NLP
Features
• Deep experience
• Social NLP & Context
Data
• Massive proprietary data
A sentiment model
based on years of
labeling social data for
enterprises.
107+ labels, 105+
topics, 102+
enterprises.
We have 10s of
millions of human
annotated social
media posts
Copyright © 2011 Visible. All rights reserved.
Visible’s Sentiment Approach
Algorithms
• State of the art
• Beyond overhyped NLP
Features
• Deep experience
• Social NLP & Context
Data
• Massive proprietary data
A sentiment model
based on years of
labeling social data for
enterprises.
107+ labels, 105+
topics, 102+
enterprises.
Basically all break-
through in the last two
decades have come
from better data
Copyright © 2011 Visible. All rights reserved.
• Claims: “We have 97%
Accuracy”
• Experience: “The best
vendor tested had 50%
accuracy at the post
level”
• Experience: Sentiment
Accuracy most
dissatisfying feature
according to Forrester
research, only 45%
satisfied with vendor
sentiment accuracy
Sentiment, The Accuracy Disconnect
There is a disconnect
between the hype and the
experience in the
marketplace
Copyright © 2011 Visible. All rights reserved.
1. Solve relevance first, sentiment second.
2. Accuracy is the wrong measure to optimize.
3. Sentiment is more subjective than you think it is.
Key Findings After spending several years of
research with the best available data,
here are some of the key findings.
Copyright © 2011 Visible. All rights reserved.
1. Solve relevance first, sentiment second.
2. Accuracy is the wrong measure to optimize.
3. Sentiment is more subjective than you think it is.
Key Findings
We won’t have time to cover the first two. The
third could be an alternate title for this talk.
Copyright © 2011 Visible. All rights reserved.
Double Blind, Multi-Reviewer Study:
Audit Findings, Large Financial Institution
No statistically significant
difference between human
labeled and AI labeled
sentiment
1. Same posts labeled by both human
labeling practice and automation.
2. At least two auditors grade each
label. Blind to label source.
A typical study.
Reviewers can’t tell the
difference between Visible’s
statistical models and human
annotators.
Copyright © 2011 Visible. All rights reserved.
Double Blind, Multi-Reviewer Study:
But…
Auditors agree with each other only 73% of the time [95%CI: 69%-77%].
Audit Findings, Large Financial Institution
No statistically significant
difference between human
labeled and AI labeled
sentiment
1. Same posts labeled by both human
labeling practice and automation.
2. At least two auditors grade each
label. Blind to label source. So is Sentiment “solved”?
No, Auditors think people and
automation are both poor. And they
don’t agree with each other.
Copyright © 2011 Visible. All rights reserved.
Key Audit Findings, Large Financial Institution
Both auditors agree with
label only 58% of the time
At least one auditor agrees with label 91%
of the time
Social Media Professionals Grading Human Annotations
58% - 91% is a huge range.
Another way of looking at the same study
Proxy for
“hard”
graders
Proxy for
“easy”
graders
Copyright © 2011 Visible. All rights reserved.
True Across a Wide Variety of Problems
Multi-Reviewer 3rd party audits across a variety of Brands consistently show
relatively low agreement rates.
About 81% Inter-Annotator Agreement [IQR: 78% - 83%]
This talk
promised
bounds and
here they are.
Copyright © 2011 Visible. All rights reserved.
True Across a Wide Variety of Problems
Multi-Reviewer 3rd party audits across a variety of Brands consistently show
relatively low agreement rates.
About 81% Inter-Annotator Agreement [IQR: 78% - 83%]
80% is also consistent
with academic research
Copyright © 2011 Visible. All rights reserved.
1. Yes, your team
2. Evaluating sentiment takes care
3. Accuracy claims in the 90s are either exaggerated or naïve (over-fit)
4. It will take effort to get your team in tight agreement on sentiment definitions
5. Real breakthroughs in sentiment accuracy will come from personalization
Take Aways
We all think we’re better than average drivers.
Similarly, although most of us have heard something like the
80% agreement statistic, we don’t think it applies to us. The
main thing I want you to take away from this talk is that it
does apply to you. People within your department, your
team, sitting in the cube next to you, disagree with you
about 20% of the time.
Copyright © 2011 Visible. All rights reserved.
1. Yes, your team
2. Evaluating sentiment takes care
3. Accuracy claims in the 90s are either exaggerated or naïve (over-fit)
4. It will take effort to get your team in tight agreement on sentiment definitions
5. Real breakthroughs in sentiment accuracy will come from personalization
Take Aways
The implications are also worth taking
to heart. When people claim accuracies
much higher than 80% they are either
lying or they don’t know what they are
doing (overfit to one dataset) .
Copyright © 2011 Visible. All rights reserved.
1. Yes, your team
2. Evaluating sentiment takes care
3. Accuracy claims in the 90s are either exaggerated or naïve (over-fit)
4. It will take effort to get your team in tight agreement on sentiment definitions
5. Real breakthroughs in sentiment accuracy will come from personalization
Take Aways
Similar to what has happened in Search, real breakthroughs will come
though personalization. Deeper linguistics (dealing with sarcasm, humor,
contextual knowledge) are interesting but can’t help break the 80% barrier.
If teams put the work into getting tight, consistent sentiment definitions (with
>80% agreement), only then do algorithms have a chance to do that well.
Thank You!
@shawnrut
@Visible
VisibleTechnologies.com