+ All Categories
Home > Documents > LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal...

LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal...

Date post: 25-Dec-2015
Category:
Upload: philippa-marsh
View: 219 times
Download: 2 times
Share this document with a friend
Popular Tags:
43
LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong
Transcript

LING/C SC/PSYC 438/538

Lecture 1Sandiway Fong

SyllabusDetails:• 538: introductory level, no formal pre-requisites• 438: LING 388 or familiarity with one or more of the following: formal languages, syntax,

data structures, or compilers• Instructor: Sandiway Fong, Depts. of Linguistics and Computer Science• Office: Douglass 311 (ph. 626-6567)• Hours:

– by appt. or walk-in– after class (best if you have quick Qs)

• Email: [email protected]• Meet: Tuesday/Thursdays in AME S314, 2-3:15pm• No class on

– November 11th (Veterans Day)– November 27th (Thanksgiving)

Syllabus• Course objectives:

– introduction to computational linguistics– survey a range of topics– introduction to programming

• Expected learning outcomes:– acquire ability to write short programs– familiarity with basic concepts, techniques and applications– be equipped to take more advanced classes in computational

linguistics, e.g. 581 (Spring)

Syllabus• Grading

– 438

• homeworks 100%• note: all homeworks are required

– 538

• homeworks 75%• (homeworks, will be a superset of the exercises for 438)• chapter presentation 25%

• Homework submissions– email only– [email protected]– by midnight of due date– typically: one week – (homeworks will be presented in class)

Syllabus

• Homeworks– you may discuss questions with other students– however, you must write it up yourself (in your own words)– cite (web) references and your classmates (in the case of

discussion)– Student Code of Academic Integrity: plagiarism etc.

• http://deanofstudents.arizona.edu/codeofacademicintegrity

• Revisions to the syllabus– “the information contained in the course syllabus, other than

the grade and absence policies, may be subject to change with reasonable advance notice, as deemed appropriate by the instructor.”

Syllabus• Absences

– tell me ahead of time so we can make special arrangements

– I expect you to attend lectures (though attendance will not be taken)

• Required text– Speech and Language Processing,

Jurafsky & Martin, 2nd edition, Prentice Hall 2008

• Special equipment– none– all software required for the course is

freely available off the net

• Classroom etiquette– ask questions– use your own laptop or lab computer

• Topics (16 weeks)– Programming Language: Perl – Regular Expressions– Automata (Finite State)– Transducers (Finite State)– Programming Language: Prolog

(definite clause grammars)– Part of Speech Tagging– Stemming (Morphology)– Edit Distance (Spelling)– Grammars (Regular, Context-free)– Parsing (Syntax trees, algorithms)– N-grams (Probability, Smoothing)– and more …

Course website• Download lecture slides from my homepage

– http://dingo.sbs.arizona.edu/~sandiway/#courses– available from class time (and afterwards, look for corrections/updates)– in .pptx (animations) and .pdf formats

Course website

Miss a lecture?

• Available for review:– linked via course homepage to http://ua.lecturecast.arizona.edu/– access to low-res video, laptop screen, slides, index (searchable)

Textbook (J&M)2008 (2nd edition)

Nearly 1000 pages(maybe more than a full year’s worth…)25 chaptersDivided into 5 parts

I. WordsII. Speech – not this courseIII. SyntaxIV. Semantics and PragmaticsV. Applications

Book chapters• 1. Introduction• 1.1. Knowledge in Speech and Language Processing• 1.2. Ambiguity• 1.3. Models and Algorithms• 1.4. Language, Thought, and Understanding• 1.5. The State of the Art• 1.6. Some Brief History• 1.6.1. Foundational Insights: 1940s and 1950s• 1.6.2. The Two Camps: 1957–1970• 1.6.3. Four Paradigms: 1970–1983• 1.6.4. Empiricism and Finite-State Models Redux: 1983–1993• 1.6.5. The Field Comes Together: 1994–1999• 1.6.6. The Rise of Machine Learning: 2000–2008• 1.6.7. On Multiple Discoveries• 1.6.8. A Final Brief Note on Psychology• 1.7. Summary• Bibliographical and Historical Notes• I. Words• 2. Regular Expressions and Automata• 2.1. Regular Expressions• 2.1.1. Basic Regular Expression Patterns• 2.1.2. Disjunction, Grouping, and Precedence• 2.1.3. A Simple Example• 2.1.4. A More Complex Example• 2.1.5. Advanced Operators• 2.1.6. Regular Expression Substitution, Memory, and ELIZA• 2.2. Finite-State Automata• 2.2.1. Use of an FSA to Recognize Sheeptalk• 2.2.2. Formal Languages• 2.2.3. Another Example• 2.2.4. Non-Deterministic FSAs• 2.2.5. Use of an NFSA to Accept Strings• 2.2.6. Recognition as Search• 2.2.7. Relation of Deterministic and Non-Deterministic Automata• 2.3. Regular Languages and FSAs• 2.4. Summary• Bibliographical and Historical Notes• Exercises• 3. Words and Transducers• 3.1. Survey of (Mostly) English Morphology• 3.1.1. Inflectional Morphology• 3.1.2. Derivational Morphology• 3.1.3. Cliticization

• 3.1.4. Non-Concatenative Morphology• 3.1.5. Agreement• 3.2. Finite-State Morphological Parsing• 3.3. Construction of a Finite-State Lexicon• 3.4. Finite-State Transducers• 3.4.1. Sequential Transducers and Determinism• 3.5. FSTs for Morphological Parsing• 3.6. Transducers and Orthographic Rules• 3.7. The Combination of an FST Lexicon and Rules• 3.8. Lexicon-Free FSTs: The Porter Stemmer• 3.9. Word and Sentence Tokenization• 3.9.1. Segmentation in Chinese• 3.10. Detection and Correction of Spelling Errors• 3.11. Minimum Edit Distance• 3.12. Human Morphological Processing• 3.13. Summary• Bibliographical and Historical Notes• Exercises• 4. N-Grams• 4.1. Word Counting in Corpora• 4.2. Simple (Unsmoothed) N-Grams• 4.3. Training and Test Sets• 4.3.1. N-Gram Sensitivity to the Training Corpus• 4.3.2. Unknown Words: Open Versus Closed Vocabulary Tasks• 4.4. Evaluating N-Grams: Perplexity• 4.5. Smoothing• 4.5.1. Laplace Smoothing• 4.5.2. Good-Turing Discounting• 4.5.3. Some Advanced Issues in Good-Turing Estimation• 4.6. Interpolation• 4.7. Backoff• 4.7.1. Advanced: Details of Computing Katz Backoff α and P*• 4.8. Practical Issues: Toolkits and Data Formats• 4.9. Advanced Issues in Language Modeling• 4.9.1. Advanced Smoothing Methods: Kneser-Ney Smoothing• 4.9.2. Class-Based N-Grams• 4.9.3. Language Model Adaptation and Web Use• 4.9.4. Using Longer-Distance Information: A Brief Summary• 4.10. Advanced: Information Theory Background• 4.10.1. Cross-Entropy for Comparing Models• 4.11. Advanced: The Entropy of English and Entropy Rate Constancy• 4.12. Summary• Bibliographical and Historical Notes• Exercises

1. Introduction1.1. Knowledge in Speech and Language Processing1.2. Ambiguity1.3. Models and Algorithms1.4. Language, Thought, and Understanding1.5. The State of the Art1.6. Some Brief History1.6.1. Foundational Insights: 1940s and 1950s1.6.2. The Two Camps: 1957–19701.6.3. Four Paradigms: 1970–19831.6.4. Empiricism and Finite-State Models Redux: 1983–19931.6.5. The Field Comes Together: 1994–19991.6.6. The Rise of Machine Learning: 2000–20081.6.7. On Multiple Discoveries1.6.8. A Final Brief Note on Psychology1.7. SummaryBibliographical and Historical Notes

Syllabus

• Coverage– Intro to programming

• we’re going to use Perl • Python is another (perhaps more) popular language

– Topics: selected chapters from J&M • Chapters 1–6, skip Speech part (7–11), 12–25

Homework: Reading

• Chapter 1 from JM– introduction and history– available online– http://www.cs.colorado.

edu/~martin/SLP/Updates/1.pdf

• Whole book is available as an e-book– www.coursesmart.com

Homework: Install Perl• Install Perl on your laptop

– should be pre-installed on macs and Linux (Ubuntu), check your machine– on Windows PCs, if you don’t already have it, it’s freely available here– http://www.activestate.com/ (don’t pay, get the free version)

Homework: Install Perl

• Ubuntu:

• Mac:

perl –vwhich perl

Homework: Install Perl

Other methodsSee http://learn.perl.org/installing/

Learning Perl

• Learn Perl– Books…– Online resources

• http://learn.perl.org/• Next time, we begin with ...• http://perldoc.perl.org/perlintro.html

Language and Computers

• Enormous amounts of data stored– world-wide web (WWW)– corporate databases– your own hard drive

• Major categories of data– numeric – Language: words, text, sound– pictures, video

Language and Computers

• We know what we want from computer software• “killer applications”

– those that can make sense of language data• retrieve language data: (IR)• summarize knowledge contained in language data• sentiment analysis from online product reviews• answer questions (QA), make logical inferences• translate from one language into another• recognize speech: transcribe• etc...

Language and Computers

• In other words, we’d like computers to be smart about language• possess “intelligence”• pass the Turing Test …

Language and Computers

• In other words, we’d like computers to be smart about language – possess intelligence – well, perhaps not too smart…

From 2001… (HAL)

Language and Computers

• (Un)fortunately, we’re not there yet…– gap between what computers can do and – what we want them to be able to do

Often quoted (but not verified):

"The spirit is strong, but the flesh is weak" was translated into Russian and then back to English, the result was "The vodka is good, but the meat is rotten."

but with Google translate or babelfish, it’s not difficult to find (funny) examples…

Language and Computers

• and how can we tell if the translation is right anyway?

• http://fun.drno.de/pics/english/only-in-china/TranslateServerError.jpg

Language and Computers

Language and Computers

• Obama: "At a certain point, I've just concluded that for me personally it is important for me to go ahead and affirm that I think same-sex couples should be able to get married."

Is this sentence complicated? Why?

Language and Computers

Language and Computers

ExecutiveSummarization

Language and ComputersDo you trust Google Translate?• a real case: 4,000,000 yen or 40,000 yen?

Language and Computers

• Puzzle: translation of 4 万円以下

• Now fixed (almost)

with auto-detect on

4 万円10,000 yen

以下less than/below/not exceeding

no spaces: segmentation task

Language and Computers

• Non-compositionality Puzzle

Language and Computers

• What happened? 4 万円以下– can be segmented as follows:

4 万円10,000 yen

以下less than/below/not exceeding

4 万円以下Million yen

Language and Computers

• Still problems remain (as of August 27 2013):

another glitch but an order of magnitude in the other direction: 10,000 -> 1,000but better than “million”

Language and Computers• a visit to the Peabody Essex Museum

(Massachusetts)– Qing dynasty Huīzhōu ( 徽州 ) -style house

… so what do those 3 characters (Yin Yu Yang) – the name of the house actually mean? 蔭餘堂  (simplified 荫余堂 )

Language and Computers

• Meaning of 荫余堂 / 蔭餘堂 (simplified/traditional spelling)

the strange romanization is not the translation I’m looking for…

Language and Computers

• Meaning of 荫余堂 / 蔭餘堂

Meaning in language is (mostly) compositional

Language and Computers

• Meaning of 1st character: 荫 / 蔭 

Language and Computers

• Meaning of 2nd character: 余 /  餘

Language and Computers

• Meaning of 余 /  餘

Language and Computers

• Meaning of 3rd character: 堂

Language and Computers

• Meaning of 蔭  餘 堂• shade I Church• shady remainder Hall

Applications– technology is still in development

• even if we are willing to pay...– machine translation has been worked on since after World War II

(1950s)– still not perfected today– why?– what are the properties of human languages that make it hard?

Natural Language Properties

• which properties are going to be difficult for computers to deal with?

• grammar (Rules for putting words together into sentences)– How many rules are there?

• 100, 1000, 10000, more …– Portions learnt or innate– Do we have all the rules written down somewhere?

• lexicon (Dictionary)– How many words do we need to know?

• 1000, 10000, 100000 …

• meaning and inference (semantic interpretation, commonsense world knowledge)

Computers vs. Humans

• Knowledge of language– Computers are way faster than humans

• They kill us at arithmetic and chess

– But human beings are so good at language, we often take our ability for granted

• Processed without conscious thought• Do pretty complex things

and now Jeopardy as well …


Recommended