Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

Leon DerczynskiAlan RitterSam Clark

Kalina Bontcheva

Streaming social media is powerful

● It's Big Data!– Velocity: 500M tweets / day– Volume: 20M users / month– Variety: earthquakes, stocks, this guy

● Sample of all human discourse - unprecedented● Not only where people are & when, but also

what they are doing● Interesting stuff - just ask the NSA!

Tweets are dirty

● You all know what Twitter is, so let's just look at some difficult tweets

● Orthography: Kk its 22:48 friday nyt :D really tired so imma go to sleep :) good nyt x god bles xxxxx

● Fragments: Bonfire tonite. All are welcome, joe included

● Capitalisation: Don't Have Time To Stop In??? Then, Check Out Our Quick Full Service Drive Thru Window :)

● Nonverbal acts: RT @Huddy85: @Mz_Twilightxxx *kisses your ass**sneezes after* Lol

Tough tweets: Do we even care?

● Most tweets are linguistically fairly well-formed● RT @DesignerDepot: Minimalist Web Design: When Less is More - http://ow.ly/2FwyX

● just went on an unfollowing spree... there's no point of following you if you haven't tweeted in 10+ days. #justsaying ..

● The tweets we find most difficult, are those that seem to say the least

● So im in tha chi whts popping tonight?

● i just gave my momma some money 4 a bill.... she smiled when i put it n her hand __AND__ said "i wanna go out to eat"... -______- HELLA SCAN

We do care

● However, there is utility in trivia:– Sadilek: Predict if you will get flu, using spatial co-location and friend network

– Sugumaran, U. Northern Iowa. Crow corpse reports precede West Nile Virus

– Emerging events: tendency to describe briefly

''There's a dead crow in my garden''

@mari: i think im sick ugh..

Problem representation

● Tweets into finite tokens (PTB + URLs, Smileys)● Put tokens in categories, depending on linguistic function

● Discriminative – cases one by one

– e.g. unigram tagger

● Sequence labelling– order matters!

– consider neighbouring labels

● Goal: label the whole sequence correctly

Word order still matters.. just

● Hard for tweets: exclamations and fragments● Whole sequences a bit rare● @FeeninforPretty making something to eat, aint ate all day

● Peace green tea time!! Happyzone!!!! :)))))

● Sentence structure cues (e.g. caps) often:– absent

– over-used

How do current tools do?

● Badly!– Out of the box:

– Trained on Twitter, IRC and WSJ data:

Where do they break?

● Continued work extending Stanford Tagger● Terrible at doing whole sentences

– Best was 10% accuracy

– SotA on newswire about 55-60%

● Problems on unknown words – this is a good target set to get better performance on– 1 in 5 words completely unseen

– 27% token accuracy on this group

What errors occur on unknowns?

● Gold standard errors (dank_UH je_UH → _FW)● Training lacks IV words (Internet, bake)● Pre-taggables (URLs, mentions, retweets)● NN vs. NNP (derek_NN, Bed_NNP)● Slang (LUVZ, HELLA, 2night)● Genre-specific (unfollowing)● Tokenisation errors (ass**sneezes)● Orthographic (suprising)

Do we have enough data?

● No, it's even worse than normal

– Ritter: 15K tokens, PTB, one annotator

– Foster: 14K tokens, PTB, low-noise

– CMU: 39K tokens, custom, narrow tagset

Tweet PoS-tagging issues

● From analysis, three big issues identified:

1. Many unseen words / orthographies

2. Uncertain sentence structure

3. Not enough annotated data

● Continued with Ritter dataset

Unseen words in tweets

● Two classes:● Standard token, non-standard orthography;

– freinds– KHAAAANNNNNNN!

● Non-standard token, standard orthography– omg + bieber = omb– Huntington


● Majority of non-standard orthographies can be corrected with a gazetteer: typical Pareto– vids → videos– cussin → cursing– hella → very

● No need to bother with e.g. Brown clustering● 361 entries give 2.3% token error reduction


● The rest can handled reasonably with word shape and contextual features

● Using edu.stanford.nlp.tagger.maxent.ExtractorFramesRare

● Features include:– word prefix and suffix shapes

– distribution of shape in corpus

– shapes of neighbouring words

● Corpus small, so adjust rare threshold● +5.35% absolute token acc., +18.5% sentence

Tweet “sentence” “structure”

● They are structured (sometimes)

● We still do better if we look at global features– Unigram tagger accuracy: 66%

● Sentence-level accuracy is important– Unigram tagger sentence accuracy: 2.3%


● Tweets contain some constrained-form tokens● Links, hashtags, user mentions, some smileys● We can fix the label for these tokens

● Knowing P(ci) constrains both P(ci-1|ci) and P(ci+1|ci)


● This allows us to prune the transition graph of labels in the sequence

● Because the graph is read in both directions, fixing any label point impacts whole tweet

● Setting label priors reduces token error 5.03%

Not enough data

● Big unlabelled data - 75 000 000 tweets / day (en)● Bootstrapping sometimes helps in this case

● Problem: initial accuracy is too low ● • ︵ _UH● Solution: consensus with > 1 tagger ◕ ◡ ◕ _UH

● Problem: only one tagger using PTB tags ⋋〴 _⋌ 〵 _UH● Solution: Vote-constrained Bootstrapping _ ⊙ ʘ _UH

Vote-constrained bootstrapping

● Not many taggers available for building semi-supervised data

● We chose Ritters plus the CMU tagger

● Where classes don't map 1:1● Create equivalence classes between tags

– CMU tag R (adverb) → PTB (WRB,RB,RBR,RBS)

– CMU tag !(interjection) → PTB (UH)● Coarser tag constrains set of fine-grained tags


● Ask both taggers to label the candidate input● Add tweet to semi-supervised data if both agree● Lebron_^ + Lebron_NNP → OK, Lebron_NNP● books_N + books_VBZ → Fail, reject tweet

● Evaluated quality on development set– Agreed on 17.8% of tweets

– Of those, 97.4 of tokens correctly PTB labelled

– 71.3% whole tweets correctly labelled


● Results:– Use Trendminer lang ID + data

– Collected 1.5M agreed-upon tokens

● Adding this bootstrapped data reduced error by:– Token-level: 13.7% Sentence-level: 4.5%

www.trendminer-project.eu

Final results

● Unknown accuracy rate: from 27.8% to 74.5%

Token SentenceBaseline: Ritter T-Pos 84.55 9.32GATE: eval set 88.69 20.34 - error reduction 26.80 12.15GATE: dev set 90.54 28.81 - error reduction 38.77 21.49

Where do we go next?

● Local tag sequence bounds?● Better handling of hashtags

– I'm stressed at 9am, shopping on my lunch break... can't deal w/ this today. #retailtherapy

– I'm so #bored today

● More data – bootstrapped ● More data – part-bootstrapped (e.g. CMU GS)● More data – human annotated

● Parsing

Downloadable & Friendly

● As command-line tool; as GATE PR; as Stanford Tagger model

● Included in GATE's TwitIE toolkit (4pm, Europa)● 1.5M token dataset available

● Updates since submission:– Better handling of contractions

– Less sensitive to tokenisation scheme

● Please play!

Thank you for your time!

There is hope:

Jersey Shore is overrated. studying and history homework then a fat night of sleep!

Do you have any questions?

Owoputi et al.

● NAACL'13 paper: 90.5% token perf w/ PTB accuracy● Advancement of the Gimpel tagger, used for our bootstrapping● Late discovery: Can be adapted to PTB tagset with good

results● We use disjoint techniques to Owoputi; combining them could

give an even better result!● Our model readily re-usable and integrated into existing NLP

tool sets

Capitalisation

● Noisy tweets have unusual capitalisation, right?– Buy Our Widgets Now– ugh I haet u all .. stupd ppl #fml

● Lowercase model with lowercased data allows us to ignore capitalisation noise

● Tried multiple approaches to classifying noisy vs. well-formed capitalisation

● Gain from ignoring case in noisy tweets offset by loss from mis-classified well-cased data

Date post:	26-Jan-2015
Category:	Technology
Upload:	leon-derczynski
View:	106 times
Download:	1 times

Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

Technology