Adaptive Parser-Centric Text Normalization

transcript

Adaptive Parser-Centric

Text Normalization

Congle Zhang* Tyler Baldwin** Howard Ho** Benny Kimelfeld** Yunyao Li**

* University of Washington **IBM Research - Almaden

Public Text

Web Text

Private Text

TextAnalytics

MarketingFinancial investmentDrug discoveryLaw enforcement…

Applications

Social media

InternalData

SubscriptionData

Text analytics is the key for discovering hidden value from text

REALITY

Image from http://samasource.org

CAN YOU READ THIS IN FIRST ATEMPT?

ay woundent of see ’ em

CAN YOU READ THIS IN FIRST ATEMPT?

00:0000:0100:02

I would not have seen them.

When a machine reads it

Results from Google translation

Chinese 唉看见他们woundent

Spanish ay woundent de verlas

Japanese ローマ法王進呈の AY woundent

Portuguese

ay woundent de vê-los

German ay woundent de voir 'em

Text Normalization• Informal writing standard written

I would not have seen them .

normalize

Challenge: Grammar

text normalization

would not of see them

I would not have seen them. Vs.

mapping out-of-vocabulary non-standard tokens to their in-vocabulary standard form

Challenge: Domain Adaptation

Tailor the same text normalization solution towards different writing style of different data sources

Challenge: Evaluation• Previous: word error rate & BLEU score

• However,– Words are not equally important – non-word information (punctuations,

capitalization) can be important– Word reordering is important

• How does the normalization actually impact the downstream applications?

Adaptive Parser-Centric Text Normalization

GrammaticalSentence

Domain Transferrable

Parsing performance

Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion

Model: Replacement Generator

• Replacement <i,j,s>: replace tokens xi … xj-1 with s

• Domain customization– Generic (cross-domain) replacements– Domain-specific replacements

Ay1 woudent2 of3 see4 ‘em5

<2,3,”would not”><1,2,”Ay”><1,2,”I”><1,2,ε>

<6,6,”.”>…

EditSameEditDeleteInsert

Model: Boolean Variables• Associate a unique Boolean

variable Xr with each replacement r

– Xr =true: replacement r is used to produce the output sentence

<2,3,”would not”> = true

… would not …

Model: Normalization Graph

• A graphical model Ay woudent of see ‘em

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<2,4,”would not have”> <2,3,”would”>

<4,5,”seen”>

<5,6,”them”>

*START*

<6,6,”.”>

<3,4,”of”>

Model: Legal Assignment• Sound

– Any two true replacements do not overlap

– <1,2,”Ay”> and <1,2,”I”> cannot be both true

• Completeness– Every input token is captured by at least

one true replacement18

Model: Legal = Path• A legal assignment: a path from start

to end

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<4,5,”seen”>

<5,6,”them”>

*START*

<6,6,”.”>

<3,4,”of”>

I would not have see him.

Output

Model: Assignment Probability

• Log-linear model; feature functions on edges

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<4,5,”seen”>

<5,6,”them”>

*START*

<6,6,”.”>

<3,4,”of”>

Inference• Select the assignment with the highest

probability

• Computationally hard on general graph models …

• But, in our model it boils down to finding the longest path in a weighted and directed acyclic graph

Inference

• weighted longest path

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<4,5,”seen”>

<5,6,”them”>

*START*

<6,6,”.”>

<3,4,”of”>

I would not have see him.

Learning

• Perceptron-style algorithm– Update weights by– Comparing (1) most probable output with

the current weights (2) gold sequence25

(1) Informal: Ay woudent of see ‘em(2) Gold: I would not have seen them.(3) Graph

Output (1) weights of features

Learning: Gold vs. Inferred

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<4,5,”seen”>

<5,6,”them”>

*START*

<6,6,”.”>

<3,4,”of”>

Gold sequence

Most probable sequence with current θ

Learning: Update Weights on the Differential Edges

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<4,5,”seen”>

<5,6,”them”>

*START*

<6,6,”.”>

<3,4,”of”>

the gold sequence becomes “longer”

Increase wi

Instantiation: Replacement Generators

Generator From To

leave intact good good

edit distance bac back

lowercase NEED need

capitalize it It

Google spell dispaear disappear

contraction wouldn’t would not

slang language ima I am going to

insert punctuation ε .

duplicated punctuation

delete filler lmao ε

Instantiation: Features• N-gram

– Frequency of the phrases induced by an edge

• Part-of-speech– Encourage certain behavior, such as

avoiding the deletion of noun phrases.• Positional

– Capitalize words after stop punctuations• Lineage

– Which generator spawned the replacement30

Evaluation Metrics: Compare Parses

Input sentence

Human Expert

Gold sentence

Normalized sentence

Normalizer

Parser

Compare

Gold Parse

Normalized Parse

Focus on subjects, verbs, and objects (SVO)

Evaluation Metrics: ExampleTest Gold SVO

I kinda wanna get ipad NEW

I kind of want to get a

new iPad.

verb(get) verb(want)verb(get)

precisionv = 1/1

recallv = 1/2

subj(get,I)subj(get,wanna

)obj(get,NEW)

subj(want, I)subj(get,I)obj(get,iPad)

precisionso = 1/3

recallso= 1/333

Evaluation: Baselines• w/oN: without normalization

• Google: Google spell checker

• w2wN: word-to-word normalization [Han and Baldwin 2011]

• Gw2wN: gold standard for word-to-word normalizations of previous work (whenever available).

Evaluation: Domains

• Twitter [Han and Baldwin 2011]

– Gold: Grammatical sentences

• SMS [Choudhury et al 2007]

– Gold: Grammatical sentences

• Call-Center Log: proprietary– Text-based responses about users’

experience with a call-center for a major company

– Gold: Grammatical sentences35

Evaluation: Twitter

• Twitter-specific replacement generators– Hashtags (#), ats (@), and retweets (RT)– Generators that allowed for either the initial

symbol or the entire token to be deleted

Evaluation: TwitterSystem

Verb Subject-Object

Pre Rec F1 Pre Rec F1

w/oN 83.7 68.1 75.1 31.7 38.6 34.8

Google 88.9 78.8 83.5 36.1 46.3 40.6

w2wN 87.5 81.5 84.4 44.5 58.9 50.7

Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0

generic 91.7 88.9 90.3 53.6 70.2 60.8

domain specific

95.3 88.7 91.9 72.5 76.3 74.4

Domain-specific generators yielded the best overall performance

Verb Subject-Object

w/oN 83.7 68.1 75.1 31.7 38.6 34.8

Google 88.9 78.8 83.5 36.1 46.3 40.6

w2wN 87.5 81.5 84.4 44.5 58.9 50.7

Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0

generic 91.7 88.9 90.3 53.6 70.2 60.8

domain specific

95.3 88.7 91.9 72.5 76.3 74.4

w/o domain-specific generators, our system outperformed the word-to-word normalization approaches

Verb Subject-Object

w/oN 83.7 68.1 75.1 31.7 38.6 34.8

Google 88.9 78.8 83.5 36.1 46.3 40.6

w2wN 87.5 81.5 84.4 44.5 58.9 50.7

Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0

generic 91.7 88.9 90.3 53.6 70.2 60.8

domain specific

95.3 88.7 91.9 72.5 76.3 74.4

Even perfect word-to-word normalization is not good enough!

Evaluation: SMS

SMS-specific replacement generator:- Mapping

dictionary of SMS abbreviations

Evaluation: SMS

SystemVerb Subject-Object

w/oN 76.4 48.1 59.0 19.5 21.5 20.4

Google 85.1 61.6 71.5 22.4 26.2 24.1

w2wN 78.5 61.5 68.9 29.9 36.0 32.6

Gw2wN 87.6 76.6 81.8 38.0 50.6 43.4

generic 86.5 77.4 81.7 35.5 47.7 40.7

domain specific

88.1 75.0 81.0 41.0 49.5 44.8

Evaluation: Call-Center

Call Center-specific generator:- Mapping dictionary

of call center abbreviations (e.g. “rep.”

“representative”)

Evaluation: Call-Center

SystemVerb Subject-Object

w/oN 98.5 97.1 97.8 69.2 66.1 67.6

Google 99.2 97.9 98.5 70.5 67.3 68.8

generic 98.9 97.4 98.1 71.3 67.9 69.6

domain specific

99.2 97.4 98.3 87.9 83.1 85.4

Discussion• Domain transfer w/ small amount of

effort is possible

• Performing normalization is indeed beneficial to dependency parsing– Simple word-to-word normalization is not

enough

Conclusion• Normalization framework with an eye

toward domain adaptation

• Parser-centric view of normalization

• Our system outperformed competitive baselines over three different domains

• Dataset to spur future research– https://www.cs.washington.edu/node/9091/

Adaptive Parser-Centric Text Normalization

Technology