Empirical Sentiment Accuracy Bounds

On Empirical Sentiment Accuracy Bounds Shawn Rutledge, Chief Scientist

Copyright © 2011 Visible. All rights reserved.

Visible’s Sentiment Approach

Algorithms

• State of the art

• Beyond overhyped NLP

Features

• Deep experience

• Social NLP & Context

Data

• Massive proprietary data

A sentiment model

based on years of

labeling social data for

enterprises.

107+ labels, 105+

topics, 102+

enterprises.

Visible was one of the

first Social Media

Monitoring solution in

the market.



Algorithms



Features

• Deep experience


Data


A sentiment model

based on years of


enterprises.

107+ labels, 105+

topics, 102+

enterprises.

We have 10s of

millions of human

annotated social

media posts



Algorithms



Features

• Deep experience


Data


A sentiment model

based on years of


enterprises.

107+ labels, 105+

topics, 102+

enterprises.

Basically all break-

through in the last two

decades have come

from better data


• Claims: “We have 97%

Accuracy”

• Experience: “The best

vendor tested had 50%

accuracy at the post

level”

• Experience: Sentiment

Accuracy most

dissatisfying feature

according to Forrester

research, only 45%

satisfied with vendor

sentiment accuracy

Sentiment, The Accuracy Disconnect

There is a disconnect

between the hype and the

experience in the

marketplace


1. Solve relevance first, sentiment second.

2. Accuracy is the wrong measure to optimize.

3. Sentiment is more subjective than you think it is.

Key Findings After spending several years of

research with the best available data,

here are some of the key findings.


1. Solve relevance first, sentiment second.

2. Accuracy is the wrong measure to optimize.

3. Sentiment is more subjective than you think it is.

Key Findings

We won’t have time to cover the first two. The

third could be an alternate title for this talk.


Double Blind, Multi-Reviewer Study:

Audit Findings, Large Financial Institution

No statistically significant

difference between human

labeled and AI labeled

sentiment

1. Same posts labeled by both human

labeling practice and automation.

2. At least two auditors grade each

label. Blind to label source.

A typical study.

Reviewers can’t tell the

difference between Visible’s

statistical models and human

annotators.


Double Blind, Multi-Reviewer Study:

But…

Auditors agree with each other only 73% of the time [95%CI: 69%-77%].

Audit Findings, Large Financial Institution

No statistically significant

difference between human

labeled and AI labeled

sentiment

1. Same posts labeled by both human

labeling practice and automation.

2. At least two auditors grade each

label. Blind to label source. So is Sentiment “solved”?

No, Auditors think people and

automation are both poor. And they

don’t agree with each other.


Key Audit Findings, Large Financial Institution

Both auditors agree with

label only 58% of the time

At least one auditor agrees with label 91%

of the time

Social Media Professionals Grading Human Annotations

58% - 91% is a huge range.

Another way of looking at the same study

Proxy for

“hard”

graders

Proxy for

“easy”

graders


True Across a Wide Variety of Problems

Multi-Reviewer 3rd party audits across a variety of Brands consistently show

relatively low agreement rates.

About 81% Inter-Annotator Agreement [IQR: 78% - 83%]

This talk

promised

bounds and

here they are.


True Across a Wide Variety of Problems

Multi-Reviewer 3rd party audits across a variety of Brands consistently show

relatively low agreement rates.

About 81% Inter-Annotator Agreement [IQR: 78% - 83%]

80% is also consistent

with academic research


1. Yes, your team

2. Evaluating sentiment takes care

3. Accuracy claims in the 90s are either exaggerated or naïve (over-fit)

4. It will take effort to get your team in tight agreement on sentiment definitions

5. Real breakthroughs in sentiment accuracy will come from personalization

Take Aways

We all think we’re better than average drivers.

Similarly, although most of us have heard something like the

80% agreement statistic, we don’t think it applies to us. The

main thing I want you to take away from this talk is that it

does apply to you. People within your department, your

team, sitting in the cube next to you, disagree with you

about 20% of the time.


1. Yes, your team





Take Aways

The implications are also worth taking

to heart. When people claim accuracies

much higher than 80% they are either

lying or they don’t know what they are

doing (overfit to one dataset) .


1. Yes, your team





Take Aways

Similar to what has happened in Search, real breakthroughs will come

though personalization. Deeper linguistics (dealing with sarcasm, humor,

contextual knowledge) are interesting but can’t help break the 80% barrier.

If teams put the work into getting tight, consistent sentiment definitions (with

>80% agreement), only then do algorithms have a chance to do that well.

Thank You!

@shawnrut

@Visible

VisibleTechnologies.com

Date post:	07-Jul-2015
Category:	Technology
Upload:	visible-technologies
View:	1,211 times
Download:	0 times

Empirical Sentiment Accuracy Bounds

Technology