+ All Categories
Home > Documents > Computational Linguistics

Computational Linguistics

Date post: 11-Jan-2016
Category:
Upload: glain
View: 61 times
Download: 1 times
Share this document with a friend
Description:
Computational Linguistics. What is it and what (if any) are its unifying themes?. Computational linguistics. I often agree with XKCD…. linguistics?. computational linguistics. physics. chemistry. biology. neuropsychology. psychology. literary criticism. more rigorous. less rigorous. - PowerPoint PPT Presentation
Popular Tags:
74
Computational Linguistics What is it and what (if any) are its unifying themes?
Transcript
Page 1: Computational Linguistics

Computational Linguistics

What is it and what (if any) are its

unifying themes?

Page 2: Computational Linguistics

2

Computational linguistics

Page 3: Computational Linguistics

3

I often agree with XKCD…

Page 4: Computational Linguistics

4

mo

re rig

oro

us

less rig

oro

us

mo

re flakey

literarycriticism

physics biologychemistry psychology

computational linguistics

neuropsychology

linguistics?

Page 5: Computational Linguistics

5

What defines the rigor of a field?

• Whether results are reproducible

• Whether theories are testable/falsifiable

• Whether there are a common set of methods for similar problems

• Whether approaches to problems can yield interesting new questions/answers

Page 6: Computational Linguistics

6

Linguistics

Page 7: Computational Linguistics

7

mo

re rig

oro

us

less rig

oro

us

literarycriticism

engineering sociologylinguistics

Page 8: Computational Linguistics

8

mo

re rig

oro

us

less rig

oro

us

other areas of sociolinguistics(e.g. D

eborah Tannen)

“theoretical” linguistics(e.g. m

inimalist syntax)

some areas of sociolinguistics

(e.g. Bill Labov)

The true situation with linguistics

psycholinguistics

experimental phonetics

historical linguistics

“theoretical” linguistics(e.g. lexical-functional gram

mar)

Page 9: Computational Linguistics

9

Okay enough alreadyWhat is computational linguistics

• Text normalization/segmentation• Morphological analysis• Automatic word pronunciation prediction• Transliteration• Word-class prediction: e.g. part of speech tagging• Parsing• Semantic role labeling• Machine translation• Dialog systems• Topic detection• Summarization• Text retrieval• Bioinformatics• Language modeling for automatic speech recognition• Computer-aided language learning (CALL)

Page 10: Computational Linguistics

10

Computational linguistics

• Often thought of as natural language engineering

• But there is also a serious scientific component to it.

Page 11: Computational Linguistics

11

Why CL may seem ad hoc

• Wide variety of areas (as in linguistics)

• If it’s natural language engineering, the goal is often just to build something that works

• Techniques tend to change in somewhat faddish ways…– For example: machine learning approaches

fall in and out of favor

Page 12: Computational Linguistics

12

Page 13: Computational Linguistics

13

Page 14: Computational Linguistics

14

Page 15: Computational Linguistics

15

Page 16: Computational Linguistics

16

Machine learning in CL

• In general it’s a plus since it has meant that evaluation has become more rigorous

• But it’s important that the field not turn into applied machine learning

• For this to be avoided, people need to continue to focus on what linguistic features are important

• Fortunately, this seems to be happening

Page 17: Computational Linguistics

17

Some interesting themes…

• Finite-state methods:– Many application areas– Raises interesting questions about how much of

language is “regular” (in the sense of “finite state”)

• Grammar induction:– Linguists have done a poor job at their stated goal of

explaining how humans learn grammar

• Computational models of language change:– Historical evidence for language change is only

partial. There are many changes in language for which we have no direct evidence.

Page 18: Computational Linguistics

18

Finite state methods

• Used from the 1950’s onwards

• Went out of fashion a bit during the 1980’s

• Then a revival in the 1990’s with the advent of weighted finite-state methods

Page 19: Computational Linguistics

19

Some applications

• Analysis of word structure – morphology

• Analysis of sentence structure– Part of speech tagging– Parsing

• Speech recognition

• Text normalization

• Computational biology

• …

Page 20: Computational Linguistics

20

Regular languages

• A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations:– Set union– Concatenation– Transitive closure (Kleene star)

Page 21: Computational Linguistics

21

Finite state automata: formal definition

Every regular language can be recognized by a finite-state automaton.Every finite-state automaton recognizes a regular language. (Kleene’s theorem)

Page 22: Computational Linguistics

22

Representation of FSA’s: State Diagram

Page 23: Computational Linguistics

23

Regular relations: formal definition

Page 24: Computational Linguistics

24

Finite-state transducers

Page 25: Computational Linguistics

25

An FST

Page 26: Computational Linguistics

26

Composition

• In addition to union, concatenation and Kleene closure, regular relations are closed under composition

• Composition is to be understood here the same way as composition in algebra:– R1oR1 means take the output of R1 and feed it

to the input of R2

Page 27: Computational Linguistics

27

Composition: an illustration

Page 28: Computational Linguistics

28

R1 as a transducer

Page 29: Computational Linguistics

29

R2 as a transducer

Page 30: Computational Linguistics

30

R1○R2

Page 31: Computational Linguistics

31

Some things you can do with FSTs

• Text analysis/normalization– Word segmentation– Abbreviation expansion– Digit-to-number-name mappingsi.e. mapping from writing to language

• Morphological analysis• Syntactic analysis

– E.g. part-of-speech tagging

• (With weights) pronunciation modeling and language modeling for speech recognition

Page 32: Computational Linguistics

32

That’s fine for engineering but…

• Does it really account for the facts?– Is morphology really regular?– Is the mapping between writing and speech

really regular?

Page 33: Computational Linguistics

33

What is morphology?

• scripsērunt is third person, plural, perfect, active of scrībō (`I write’)

• Morphology relates word forms– the “lemma” of scripsērunt is scrībō

• Morphology analyzes the structure of word forms– scripsērunt has the structure scrīb+s+ērunt

Page 34: Computational Linguistics

34

Morphology is a relation

• Imagine you have a Latin morphological analyzer comprising:– D: a relation that maps between surface form

and decomposed form– L: a relation that maps between decomposed

form and lemma

• Then:– scripsērunt ○ D = scrīb+s+ērunt– scripsērunt ○ D ○ L = scrībō

Page 35: Computational Linguistics

35

English regular plurals

• cat + s = cats /s/

• dog + s = dogs /z/

• spouse + s = spouses /Əz/

• This can be implemented by a rule that composes with the base word, inserting the relevant form of the affix at the end

Page 36: Computational Linguistics

36

Templatic affixes in Yowlumne

Transducer for each affix transforms base into required templaticform and appends the relevant string.

Page 37: Computational Linguistics

37

Subtractive morphology

Transducer deletes final VC of the base…

Page 38: Computational Linguistics

38

Bontoc infixation

• Insert a marker “>” after the first consonant (if any)

• Change “>” into the infix –um-

Page 39: Computational Linguistics

39

Side note … infixation in English

Kalamazoo

f*****g

Page 40: Computational Linguistics

40

Reduplication: Gothic

Problem: mapping w to ww is not a regular relation

Page 41: Computational Linguistics

41

Factoring Reduplication

• Prosodic constraints

• Copy verification transducer C

Page 42: Computational Linguistics

42

Non-Exact Copies

• Dakota (Inkelas & Zoll, 1999):

Page 43: Computational Linguistics

43

Non-Exact Copies

• Basic and modified stems in Sye (Inkelas & Zoll, 1999):

“they will fall all over”

Page 44: Computational Linguistics

44

Morphological Doubling Theory(Inkelas & Zoll, 1999)

• Most linguistic accounts of reduplication assume that the copying is done as part of morphology

• In MDT:– Reduplication involves doubling at the

morphosyntactic level – i.e. one is actually simply repeating words or morphemes

– Phonological doubling is thus expected, but not required

Page 45: Computational Linguistics

45

Gothic Reduplication under Morphological Doubling Theory

Page 46: Computational Linguistics

46

Summary

• If Inkelas & Zoll are right then all morphology can be computed using regular relations

• This in turn suggests that computational morphology has picked the right tool for the job

Page 47: Computational Linguistics

47

Another Example: Linguistic analysis of text

• Maps between the stuff you see on the page – e.g. text written in the standard orthography of a language – into linguistic units (words, morphemes, phonemes…)

• For example:– I ate a 25kg bass– [aI εIt Ə twεnti faIv kIlƏgræm bæs]

• This can be done using transducers– But is the mapping between writing and language

really regular (finite-state)?

Page 48: Computational Linguistics

48

Linguistic analysis of text

• Abbreviation expansion

• Disambiguation

• Number expansion

• Morphological analysis of words

• Word pronunciation

• …

Page 49: Computational Linguistics

49

A transducer for number namesConsider a machine that maps between digit strings and their reading as number names in English.

30,294,005,179,018,903.56 → thirty quadrillion, two hundred and ninety four trillion, five billion, one hundred seventy nine million, eighteen thousand, nine hundred three, point five six

Page 50: Computational Linguistics

50

Mapping between speech and writing

It seems obvious on the face of it that the mapping between speech and its written form is regular. After all, the words are ordered in the same way as speech. Even the tend to be ordered in the same lettersway as the sounds they represent.

Page 51: Computational Linguistics

51

Some examples where it isn’t…

twt`nx

jmn

nb

xpr

w

r`

‘honorific inversion’

Page 52: Computational Linguistics

52

Finite state methods

• In morphology they seem almost exactly correct as characterizations of the natural phenomenon

• In the mapping from writing to language, again, finite-state models seem almost exactly correct

Page 53: Computational Linguistics

53

Grammar induction

The common “nativist” view in linguistics…From Gilbert Harman's review of Chomsky's New Horizons in the Study of Language and Mind (published in Journal of Philosophy, 98(5), May 2001):

Further reflection along these lines and a great deal of empirical study of particular languages has led to the "principles and parameters" framework which has dominated linguistics in the last few decades. The idea is that languages are basically the same in structure, up to certain parameters, for example, whether the head of a phrase goes at the beginning of a phrase or at the end. Children do not have to learn the basic principles, they only need to set the parameters. Linguistics aims at stating the basic principles and parameters by considering how languages differ in certain more or less subtle respects. The result of this approach has been a truly amazing outpouring of discoveries about how languages are the same yet different.

Page 54: Computational Linguistics

54

Similarly…

Children come equipped with a set of principles of grammar construction (i.e. Universal Grammar (UG)). The principles of UG have open parameters. Specific grammars arise once values for these open parameters are specified. Parameter values are determined on the basis of [the primary linguistic data]. A language specific grammar, then, is simply a specification the values that the principles of UG leave open.

Cedric Boeckx and Norbert Hornstein. 2003. “The Varying Aims of Linguistic Theory.”

Page 55: Computational Linguistics

55

My “challenge” with Shalom Lappin

Page 56: Computational Linguistics

56

Page 57: Computational Linguistics

57

Automatic induction of grammars from unannotated text

• Klein, Dan and Manning, Christopher. 2004. “Corpus-based induction of syntactic structure: models of dependency and constituency”. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics

• Lots of subsequent work…

Page 58: Computational Linguistics

58

Different syntactic representations

Page 59: Computational Linguistics

59

Dependency Model with Valence (DMV)

• Each head generates a set of non-STOP arguments to one side, then a STOP argument; then similarly on the other side

• Trained using expectation maximization

Page 60: Computational Linguistics

60

Performance

Page 61: Computational Linguistics

61

Improvements

• Constituent structure can be induced in a similar way to inducing word classes (e.g. parts of speech) – by considering the environments in which the putative constituent finds itself.

• In Klein & Manning’s constituent-context model (CCM) probability of a bracketing is computed as follows:

Page 62: Computational Linguistics

62

Combined DMV+CCM

Subsequent work – e.g. Rens Bod’s 2006 Unsupervised Data Oriented Parsing – report F-scores close to 83.0

For comparison, the best supervised parsers get about 91.0

Page 63: Computational Linguistics

63

Some objections … and a synopsis

• Children do not learn grammars from unannotated text corpora: they get a lot of guidance from the environmental situation– Sure

• Performance of automatic induction algorithms is still far from human performance so they do not constitute evidence that we can do away with (nativist) linguistic theories of language acquisition– They do not show this. But the argument would have more

weight if nativist theories had already been demonstrated to contribute to a working model of grammar induction

• But Computational Linguistics is starting to make some serious contributions to this 50-year-old debate

Page 64: Computational Linguistics

64

The evolution of complex structure in language

Examples from: Stump, Gregory (2001) Inflectional Morphology:A Theory of Paradigm Structure. Cambridge University Press.

Page 65: Computational Linguistics

65

Evolutionary Modeling (A tiny sample)

• Hare, M. and Elman, J. L. (1995) Learning and morphological change. Cognition, 56(1):61--98.

• Kirby, S. (1999) Function, Selection, and Innateness: The Emergence of Language Universals. Oxford

• Nettle, D. "Using Social Impact Theory to simulate language change". Lingua, 108(2-3):95--117, 1999.

• de Boer, B. (2001) The Origins of Vowel Systems. Oxford

• Niyogi, P. (2006) The Computational Nature of Language Learning and Evolution. Cambridge, MA: MIT Press.

Page 66: Computational Linguistics

66

A multi-agent simulation• System is seeded with a grammar and small number of agents

– Each agent randomly selects a set of phonetic rules to apply to forms– Agents are assigned to one of a small number of social groups

• 2 parents “beget” child agents.– Children are exposed to a predetermined number of training forms combined

from both parents• Forms are presented proportional to their underlying “frequency”

– Children must learn to generalize to unseen slots for words– Learning algorithm similar to:

• David Yarowsky and Richard Wicentowski (2001) "Minimally supervised morphological analysis by multimodal alignment." Proceedings of ACL-2000, Hong Kong, pages 207-216.

• Features include last n-characters of input form, plus semantic class– Learners select the optimal surface form to derive other forms from (optimal =

requiring the simplest resulting ruleset – a Minimum Description Length criterion)• Forms are periodically pooled among all agents and the n best forms are

kept for each word and each slot• Population grows, but is kept in check by “natural disasters” and a quasi-

Malthusian model of resource limitations– Agents age and die according to reasonably realistic mortality statistics

Page 67: Computational Linguistics

67

Final states for a given initial state

Page 68: Computational Linguistics

68

Another example

• Kirby, Simon. 2001. “Spontaneous evolution of linguistic structure: an iterated learning model of the emergence of regularity and irregularity.” IEEE Transactions on Evolutionary Computation, 5(2):102--110.

• Assumes two meaning components each with 5 values, for 25 possible words

• Initial “speaker” randomly selects examples from the 25, producing random strings for each, and “teaches” them to the “hearer”

• Not all of the slots are filled, thus producing a “bottleneck”: the hearer must compute forms for the missing slots

Page 69: Computational Linguistics

69

The basic algorithm produces results that are too regular

Initial state

Final state

Page 70: Computational Linguistics

70

A more realistic result…• Addition of other

constraints, including – a random tendency for

“speakers” to omit symbols,

– a frequency distribution over the 25 possible meaning combinations

Page 71: Computational Linguistics

71

Summary

• Evolutionary modeling is evolving slowly– We are a long way from being able to model

the complexities of known language evolution

• Nonetheless, computational approaches promise to lend insights into how complex social systems such as language change over time, and complement discoveries in historical linguistics

Page 72: Computational Linguistics

72

Final thoughts

• Language is central to what it means to be human.

• Language is used to:– Communicate information– Communicate requests– Persuade, cajole…– (In written form) record history– Deceive

• Other animals do some or most of these things (cf. Anindya Sinha’s work on bonnet macaques)

• But humans are better at all of these

Page 73: Computational Linguistics

73

Final thoughts

• So the scientific study of language ought to be more central than it is

• We need to learn much more about how language works– How humans evolved language– How languages changed over time– How humans learn language

• Computational linguistics can contribute to all of these questions.

Page 74: Computational Linguistics

74


Recommended