The quest for robustness, scalability, and portability in (spoken) language applications Linguistics...

transcript

the quest for robustness, scalability, and portability in (spoken) language applications

Linguistics Methodology meets Language Reality:

Bob CarpenterSpeechWorks International

The Standard Cliché(s)

• Moore’s Cliché: – Exponential growth in computing power and memory will

continue to open up new possibilities

• The Internet Cliché:– With the advent and growth of the world-wide web, an

ever increasing amount of information must be managed

More Standard Clichés

• The Convergence Cliché:– Data, voice and video networking will be integrated over

a universal network, that:• includes land lines and wireless; • includes broadband and narrowband• likely implementation is IP (internet protocol)

• The Interface Cliché: – The three forces above (growth in computing power,

information online, and networking) will both enable and require new interfaces

– Speech will become as common as graphics

Some Comp Ling Clichés

• The Standard Linguist’s Cliché– But it must be recognized that the notion “probability of a

sentence” is an entirely useless one, under any known interpretation of this term.

– Noam Chomsky, 1969 [essay on Quine]

• The Standard Engineer’s Cliché– Anytime a linguist leaves the group the recognition rate

goes up. – Fred Jelinek, 1988 [address to DARPA]

The “Theoretical Abstraction”

• mature, monolingual, native language speaker– idealized to complete knowledge of language

• static, homogenous language community– all speakers learn identical grammars

• “competence” (vs. “performance”)– “performance” is a natural class– wetware “implementation” follows theory in divorcing

“knowledge of language” from processing

• assumes the existence and innateness of a “language faculty”

The Explicit Methodology

• “Emprical” Basis is binary grammaticality judgements– “intuitive” (to a “properly” trained linguist)– innateness and the “language faculty”– appropriate for phonetics through dialogue– in practice, very little agreement at boundaries and no

standard evaluations of theories vs. data

• Models of particular languages– by grammars that generate formal languages– low priority for transformationalists– high priority for monostratalists/computationalists

The Holy Grail of Linguistics

• A grammar meta-formalism in which – all and only natural language grammars (idealized as

above) can be expressed– assumed to correspond to the “language faculty”

• Grail is sought by every major camp of linguist– Explains why all major linguistic theories look alike from

any perspective outside of a linguistics department– The expedient abstractions have become an end in

themselves

But, Applications Require

• Robustness– acoustic and linguistic variation– disfluencies and noise

• Scalability– from embedded devices to palmtops to clients to servers– across tasks from simple to complex– system-initiative form-filling to mixed initiative dialogue

• Portability– simple adaptation to new tasks and new domains– preferably automated as much as possible

The $64,000 Question

• How do humans handle unrestricted language so effortlessly in real time?

• Unfortunately, the “classical” linguistic assumptions and methodology completely ignore this issue

• Psycholinguistics has uncovered some baselines:– lexicon (and syntax?): highly parallel– time course of processing: totally online– information integration: <= 200ms for all sources

• But is short on explanations

(AI) Success by Stupidity

• Jaime Carbonell’s Argument (ECAI, mid 1990s)• Apparent “intelligence” because they’re too limited

to do anything wrong: “right” answer hardcoded

• Typical in Computational NL Grammars– lexicon limited to demo– rules limited to common ones (eg: no heavy shift)

• Scaling up usually destroys this limited “success”– 1,000,000s of “grammatical” readings with large

grammars

My Favorite Experiments: I

• Mike Tanenhaus et al. (Univ. Rochester)• Head-Mounted Eye Tracking

Eyes track

Semantic resolution

~200 ms tracking time

Pick up the yellow plate

Clearly shows that understanding is online

My Favorite Experiments (II)

• Garden Paths and Context Sensitive– Crain & Steedman (U.Connecticut & U. Edinburgh)– if noun is not unique in context, postmodificiation is much

more likely than if noun picks out unique individual

• Garden Paths are Frequency and Agreement Sensitive– Tanenhaus et al.– The horse raced past the barn fell. (raced likely past)– The horses brought into the barn fell. (brought likely

participle, and less likely activity for horses)

Stats: Explanation or Stopgap

• A Common View– Statistics are some kind of approximation of underlying

factors requiring further explanation.

• Steve Abney’s Analogy (AT&T Labs)– Statistical Queueing Theory– Consider traffic flows through a toll gate on a highway. – Underlying factors are diverse, and explain the actions of

each driver, their cars, possible causes of flat tires, drunk drivers, etc.

– Statistics is more insightful [explanatory] in this case as it captures emergent generalizations

– It is a reductionist error to insist on low-level account

Competence vs. Performance

• What is computed vs. how it is computed• The what can be traditional grammatical structure• All structures not computed, regardless of the how• Define what probabilistically, independently of how

Algebraic vs. Statistical

• False Dichotomy– All statistical systems have an algebraic basis, even if

trivial

• The Good News:– Best statistical systems have best linguistic conditioning

(most “explanatory” in traditional sense)– Statistical estimatiors far less significant than the

appropriate linguistic conditioning– Rest of the talk provides examples of this

Bayesian Statistical Modeling

• Concerned with prior and posterior probabilities• Allows updates of reasoning• Bayes’ Law: P(A,B) = P(A|B) P(B) = P(B|A) P(A)• Eg: Source/Channel Model for Speech Recognition

– Ws: sequence of words– As: sequence of acoustic observations– Compute ArgMax_Ws P(Ws|As)

ArgMax_Ws P(Ws|As)

= ArgMax_Ws P(As|Ws) P(Ws) / P(As)

= ArgMax_Ws P(As|Ws) P(Ws)

P(As|Ws) : acoustic model P(Ws) : language model

Simple Bayesian Update Example

• Monty Hall’s Let’s Make a Deal• Three curtains with prize behind one, no other info• Contestant chooses one of three• Monty then opens curtain of one of others that does

not have the prize– if you choose curtain 2, then one of curtain 1 or 3 must

not contain prize

• Monty then lets you either keep your first guess, or change to the remaining curtain he didn’t open.

• Should you switch, stay, or doesn’t it matter?

Answer

• Yes! You should switch.• Why? Consider possiblities:

1 win lose lose

2 lose win lose

3 lose lose win

StayP(win) = 1/3

1 lose win win

2 win lose win

3 win win lose

SwitchP(win) = 2/3

prize behind

Defaults via Bayesian Inference

• Bayesian Inference provides an explanation for “rationality” of default reasoning

• Reason by choosing an action to maximize expected payoff given some knowledge– ArgMax_Action Payoff(Action) * P(Action|Knowledge)

• Given additional information update to Knowledge’– ArgMax_Action Payoff(Action) * P(Action|Knowledge’)– Chosen action may be different, as in Let’s Make a Deal

• Inferences are not logically sound, but are “rational”• Bayesian framework integrates partiality and

uncertainty of background knowledge

Example: Allophonic Variation

• English Pronunciation (M. Riley & A. Llolje, AT&T)• Derived from TIMIT with phoneme/phone labels

– orthographic: bottle– phonological: / b aa t ax l / (ARPAbet phonemes)– phonetic: 0.75 [ b aa dx el ] (TIMITbet phones)– 0.13 [ b aa t el ]– 0.10 [ b aa dx ax l ]– 0.02 [ b aa t ax l ]

• Allophonic variation is non-deterministic

Eg: Allophonic Variation (cont’d)

• Simple statistical model (simplified w/o insertion) • Estimate probability of phones given phonemes:

P(a1,…,aM|p1,…,pM)

= P(a1|p1,…,pM) * P(a2|p1,…,pM,a1) * … *

* P(aM|p1,…,pM,a1,…,aM-1)

• Approximate phoneme context to +/- k phones• Approximate phone history to 0 or 1 phones

– 0: … P(aJ|pJ-K,…,pJ,…,pJ+K) ...– 1: … P(aJ|pJ-K,…,pJ,…,pJ+K, aJ-1) …

• Uses word boundary marker and stress

Eg: Allophonic Variation (concl’d)

• Cluster phonological features using decision trees• Sparse data smoothed by decision trees over

standard features (+/- stop, voicing, aspiration, etc.)

• Conditional entropy: w/o context 1.5 bits, w 0.8• Most likely allophone correct 85.5%, in top 5, 99%• Average 17 pronunciations/word to get 95%• Robust: handles multiple pronunciations• Scalable: to whole of English pronunciation• Portable: easy to move to new dialects with training

– K. Knight (ISI): similar techniques for Japenese pronunciation of English words!

Example: Co-articulation

• HMMs have been applied to speech since mid-70s• Two major recent improvements, the first being

simply more training data and cycles• Second is: Context-dependent triphones• Instead of one HMM per phoneme/phone, use one

per context-dependent triphone– example: t-r+u ‘an r preceded by t and followed by u’– crucially clustered by phonological features to overcome

sparsity

Exploratory Data Analysis

(Trendier: data mining; Trendiest: information harvesting)

• Specious Argument: A statistical model won’t help explain linguistic processes.

• Counter 1: Abney’s anti-reductionist• But even if you don’t believe that:• Counter 2: In “other sciences” (pace linguistic

tradition), statistics is used to discover regularities • Allophone example: “had your” pronunciation

– / d / is 51%likely to realize as [ jh ], 37% as [ d ]– if / d / realizes as [ jh ], / y / deletes 84%– if / d / realizes as [ d ], / y / deletes 10%

Balancing Gricean Maxims

• Grice gives us conflicting maxims:– quantity (exactly as informative as required)– quality (try to make your contribution true)– manner (be perspicuous; eg. avoid ambiguity, be brief)

• Manner pulls in opposite directions– quality without ambiguity lengthens statements– quantity and and (part of) manner require brevity

• Balance by estimating a multidimensional “goodness” metric for generation

Gricean Balance (cont’d)

• Consider problem for aggregation in generation– Every student ran slowly or walked quickly.

Aggregates to:– Every student ran slowly or every student walked quickly.

• This reduces sentence length, shortens clause length, and increases ambiguity.

• These tradeoffs need to be balanced

Collins’ Head/Dependency Parser

• Michael Collins 1998 UPenn PhD thesis• Parses WSJ with ~90% constituent precision/recall• Generative model of tree probabilities• Clever Linguistic Decomposition and Training

– P(RootCat, HeadTag, HeadWord)– P(DaughterCat|MotherCat, HeadTag, HeadWord)– P(SubCat|MotherCat, DtrCat, HeadTag, HeadWord)– P(ModifierCat, ModiferTag, ModifierWord

| SubCat, MotherCat, DaughterCat, HeadTag,

HeadWord, Distance)

Eg: Collins’ Parser (cont’d)

• Distance encodes heaviness• Adjunct vs. Complement modifiers distinguished• Head Words and Tags model lexical variation and

word-word attachment preferences• Also conditions punctuation, coordination, UDCs• 12,000 word vocabulary plus unknown word

attachment model (by Collins) and tag model (by A. Ratnaparkhi, another 1998 UPenn thesis)

• Smoothed by backing off words to categories• Trivial statistical estimators; power is conditioning

Computational Complexity

• Wide coverage linguistic grammar generate millions of readings

• But Collins’ parser runs faster than real time on a notebook on unseen sentences of length up to 100

• How? Pruning. • Collins’ found tighter statistical estimates of tree

likelihoods with more features and more complex grammars ran faster because a tighter beam could be used – (E. Charniak & S. Caraballo at Brown have really pushed

the envelope here)

Complexity (cont’d)

• Collins’ parser is not complete in the usual sense• But neither are humans (eg. garden paths)• Can trade speed for accuracy in statistical parsers• Syntax is not processed autonomously

– Humans can’t parse without context, semantics, etc.– Even phone or phoneme detection is very challenging,

especially in a noisy environment– Top-down expectations and knowledge of likely bottom-

up combinations prune the vast search space on line– Question is how to combine it with other factors

N-best and Word Graphs

• Speech recognizers can return n-best histories– flights from Boston today– flights from Austin today– flights for Boston to pay– lights for Boston to pay

• Can also return a packed word graph of histories; sum of path log probs equal acoustics / word-string joint log prob

flights

lights

Boston

Austin

Probabilistic Graph Processing

• The architecture we’re exploring in the context of spoken dialogue systems involves:– Speech recognizers that produce probabilistic word

graph output– A tagger that transforms a word graph into a word/tag

graph with scores given by joint probabilities– A parser that transforms a word/tag graph into a graph-

based chart (as in CKY or chart parsing)

• Allows each module to rescore output of previous module’s decision

• Apply this architecture to speech act detection, dialogue act selection, and in generation

Prices rose sharply after hours15-best as a word/tag graph + minimization

prices:NNS

prices: NN

rose:VBD

rose:VBP

rose:NN

sharply:RB

after:IN

after:RB

after:IN

after:RB

hours:NNS

rose:VBD

rose:NNP

Challenge: Beat n-grams

• Backed off trigram models estimated from 300M words of WSJ provide best language models

• We know there is more to language than two words of history

• Challenge is to find out how to model it.

Conclusions

• Need ranking of hypotheses for applications• Beam can reduce processing time to linear

– need good statistics to do this

• More linguistic features are better for stat models– can induce the relevant ones and weights from data– linguistic rules emerge from these generalizations

• Using acoustic / word / tag / syntax graphs allows the propogation of uncertainty– ideal is totally online (model is compatible with this)– approximation allows simpler modules to do first pruning

Run, don’t walk, to read:• Steve Abney. 1996. Statistical methods and linguistics. In J.

L. Klavans and P. Resnik, eds., The Balancing Act. MIT Press.

• Mark Seidenberg and Maryellen MacDonald. 1999. A probabilistic constraints approach to language acquisition and processing. Cognitive Science.

• Dan Jurafsky and James H. Martin. 2000. Speech and Language Processing. Prentice-Hall.

• Chris Manning and Hinrich Schuetze. 1999. Statistical Natural Language Processing. MIT Press.

The quest for robustness, scalability, and portability in (spoken) language applications Linguistics...

Documents