Sifting Social Data: Word Sense Disambiguation Using Machine Learning

Post on 02-Jul-2015

875 views 0 download

description

Slide prepared for "Social Media and the Defense Sector" http://www.smi-online.co.uk/defence/uk/social-media-within-the-military-and-defence-sector

transcript

Sifting Social Data Word Sense Disambiguation Using Machine Learning

Dr. Stuart Shulman

Founder & CEO, Texifter

“…a wealth of information creates a poverty of attention.”- Herbert Simon, 1971

Pronounced “tech-sifter” the metaphor is of a sifter

Text Classification

A 2500 year-old problem

Plato argued it would be frustrating and it still is…

Grimmer & Stewart “Text as Data” Political Analysis (2013)

Volume is a problem for scholarsCoders are expensive

Groups struggle to accurately label text at scaleValidation of both humans and machines is “essential”

Some models are easier to validate than othersAll models are wrong

Automated models enhance/amplify, but don’t replace humansThere is no one right way to do this

“Validate, validate, validate”“What should be avoided then, is the blind use of any method without a validation step.”

Our free, open-source, web-based text analytics toolkit

The original software kernel: tools for measurement

A mission to avoid tennis elbow

Items load to the screen and the coder hits the keystroke

Keystroke human coding: alone or in groups

Codes

Metadata Data

Human coding can be distributed to individuals, groups & crowds

Computer science & NSF influences: measure everything

How fast?How reliable?

How accurate?

Stuart Shulman – Texifter

Inter-rater reliability is one critical measurement

Stuart Shulman – Texifter

Plugged in to APIs & Government

Import data directly via APIs or from your desktop

Stuart Shulman – Texifter

Full historical Twitter access

Stuart Shulman – Texifter

PowerTrack operators for more precise queries

Stuart Shulman – Texifter

Store social data with survey responses and other data

Stuart Shulman – Texifter

Private, 3rd party & free (rate limited) social data sources

Stuart Shulman – Texifter

Unlimited “fire hose” premium data sources

Stuart Shulman – Texifter

The Five Pillars of Text Analytics

SearchFiltering

De-duplication and ClusteringHuman Coding

Machine-LearningStuart Shulman – Texifter

Pillar #1: Search

Stuart Shulman – Texifter

Pillar #1: Defined multi-term search

Stuart Shulman – Texifter

Pillar #2: Filters

Stuart Shulman – Texifter

Pillar #2: Filters

Stuart Shulman – Texifter

Pillar #3: Deduplication & clustering

Stuart Shulman – Texifter

Pillar #3: Deduplication & clustering

Stuart Shulman – Texifter

Pillar #4: Human coding (a.k.a. labeling or tagging)

Stuart Shulman – Texifter

Pillar #4: Human coding

Stuart Shulman – Texifter

Pillar #4: Human coding (adjudication)

Stuart Shulman – Texifter

Pillar#5: Machine-learning

Stuart Shulman – Texifter

Pillar#5: Machine-learning

Stuart Shulman – Texifter

Our ActiveLearning engine and coding tools combine…

what humans do best… with what computers do best

Humans and machines learning togetherKeep humans “in-the-loop” for more accurate results and better insights

Stuart Shulman – Texifter

Word sense disambiguation (relevance)

Stuart Shulman – Texifter

Word sense disambiguation (relevance)

Stuart Shulman – Texifter

Word sense disambiguation (relevance)

Stuart Shulman – Texifter

Stuart Shulman – Texifter

Human coding can be converted into machine classifiers

Accumulated human coding becomes training data via machine-learning

Users can drill into interactive reporting displays

Use metadata to examine sub-sets of responses and create reports.

Slicing big piles of text into smaller, more focused sets is key

Ultimately all text analytics are filtering techniques

Crowdsourcing accelerates the insight generation process through machine-learning

Distributed for synchronous & asynchronous collaboration

CoderRank (patent pending) for enhanced machine-learning is our key innovation

For more information visit the Texifter table ordiscovertext.com

@discovertextThank-you for listening!

Stuart Shulman – Texifter