CS388:NaturalLanguageProcessingLecture1:Introduc9on
GregDurre<
Administrivia
‣ Coursewebsite:h<p://www.cs.utexas.edu/~gdurre</courses/sp2021/cs388.shtml
‣ Piazza:linkonthecoursewebsite
‣ Myofficehours:Officehours:Tuesday1pm-2pm,Wednesday3:30pm-4:30pm
‣ TA:XiYe.SeecoursewebsiteforOHs
‣ Lecture:TuesdaysandThursdays9:30am-10:45am
‣ Note:myOHstodayare12:30pm-1:30pm
‣ Gradescope:youshould’vego<enanemail
CourseRequirements
‣ 391LMachineLearning(orequivalent)
‣ 311or311HDiscreteMathforComputerScience(orequivalent)
‣ Addi9onalpriorexposuretoprobability,linearalgebra,op9miza9on,linguis9cs,andNLPusefulbutnotrequired
‣ Pythonexperience
‣ Mini1isoutnow(dueJanuary28),pleaselookatitsoon
‣ Ifthisseemslikeit’llbechallengingforyou,comeandtalktome(thisissmaller-scalethantheprojects,whicharesmaller-scalethanthefinalproject)
What’sthegoalofNLP?‣ Beabletosolveproblemsthatrequiredeepunderstandingoftext
‣ Example:dialoguesystemsSiri,what’syourfavoritekindof
movie?
Ilikesuperheromovies!
What’scomeoutrecently?
TheAvengers
Ques9onAnsweringWhenwasAbrahamLincolnborn?
February12,1809
Name Birthday
Lincoln,Abraham 2/12/1809Washington,George 2/22/1732
Adams,John 10/30/1735
Theparkhasatotaloffivevisitorcenters
five
HowmanyvisitorscentersarethereinRockyMountainNa9onalPark?
maptoBirthdayfield
MachineTransla9on
中共中央政治局7⽉30⽇召开会议,会议分析研究当前经济形势,部署下半年经济⼯作。
ThePoli9calBureauoftheCPCCentralCommi<eeheldamee9ngonJuly30toanalyzeandstudythecurrenteconomicsitua9onandplaneconomicworkinthesecondhalfoftheyear.
People’sDaily,August10,2020
ThePoli9calBureauoftheCPCCentralCommi<ee July30 holdamee9ng
Translate
Automa9cSummariza9on
…
…
OneofNewAmerica’swriterspostedastatementcri9calofGoogle.EricSchmidt,Google’sCEO,wasdispleased.
Thewriterandhisteamweredismissed.
providemissingcontext
paraphrasetoprovideclarity
compresstext
NLPAnalysisPipeline
Syntac9cparses
Coreferenceresolu9on
En9tydisambigua9on
Discourseanalysis
Summarize
Extractinforma9on
Answerques9ons
Iden9fysen9ment
‣ NLPisaboutbuildingthesepieces!Translate
TextAnalysis ApplicaDonsText AnnotaDons
‣ Allofthesecomponentsaremodeledwithsta9s9calapproachestrainedwithmachinelearning
Howdowerepresentlanguage?Labels
Sequences/tags
Trees
Text
themoviewasgood +Beyoncéhadoneofthebestvideosofall6me subjecDve
TomCruisestarsinthenewMissionImpossiblefilmPERSON WORK_OF_ART
Ieatcakewithicing
PPNP
SNP
VP
VBZ NNflightstoMiami
λx.flight(x)∧dest(x)=Miami
Howdoweusetheserepresenta9ons?
Labels
SequencesTrees
TextAnalysisText
‣ Mainques9on:Whatrepresenta9onsdoweneedforlanguage?Whatdowewanttoknowaboutit?
‣ Boilsdownto:whatambigui9esdoweneedtoresolve?
…
ApplicaDons
Treetransducers(formachinetransla9on)
Extractsyntac9cfeatures
Tree-structuredneuralnetworks
end-to-endmodels …
Whyislanguagehard?(andhowcanwehandlethat?)
LanguageisAmbiguous!
‣ HectorLevesque(2011):“Winogradschemachallenge”(namedaoerTerryWinograd,thecreatorofSHRDLU)
Thecitycouncilrefusedthedemonstratorsapermitbecausethey______violence
‣ >5datasetsinthelasttwoyearsexaminingthisproblemandcommonsensereasoning
‣ Referen9alambiguity
Thecitycouncilrefusedthedemonstratorsapermitbecausetheyadvocatedviolence
Thecitycouncilrefusedthedemonstratorsapermitbecausetheyfearedviolence
LanguageisAmbiguous!
examplecredit:DanKlein
‣ Syntac9candseman9cambigui9es:parsingneededtoresolvethese,butneedcontexttofigureoutwhichparseiscorrect
TeacherStrikesIdleKids
BanonNudeDancingonGovernor’sDesk
IraqiHeadSeeksArms
N N V NN V ADJ N
NP
NP
PP
PP PPN
NP
PN
body/posi9on
body/weapon
LanguageisReallyAmbiguous!
‣ Therearen’tjustoneortwopossibili9eswhichareresolvedpragma9cally
‣ Combinatoriallymanypossibili9es,manyyouwon’tevenregisterasambigui9es,butsystemss9llhavetoresolvethem
Itisreallyniceout
ilfaitvraimentbeau It’sreallyniceTheweatherisbeau9fulItisreallybeau9fuloutsideHemakestrulybeau9fulItfactactuallyhandsome
‣ Lotsofdata!
slidecredit:DanKlein
Whatdoweneedtounderstandlanguage? Whatdoweneedtounderstandlanguage?
‣ Worldknowledge:haveaccesstoinforma9onbeyondthetrainingdata
DOJgreenlightsDisney-Foxmerger
metaphor;“approves”
DepartmentofJus6ce
‣ Whatisagreenlight?Howdoweunderstandwhat“greenligh9ng”does?
‣ Needcommonsenseknowledge
‣ Grounding:learnwhatfundamentalconceptsactuallymeaninadata-drivenway
McMahanandStone(2015)Gollandetal.(2010)
Whatdoweneedtounderstandlanguage?
‣ Linguis9cstructure
‣ …butcomputersprobablywon’tunderstandlanguagethesamewayhumansdo
‣ However,linguis9cstellsuswhatphenomenaweneedtobeabletodealwithandgivesushintsabouthowlanguageworks
CenteringTheoryGroszetal.(1995)
Whatdoweneedtounderstandlanguage?
Whattechniquesdoweuse?(tocombinedata,knowledge,linguis9cs,etc.)
PretrainingUnsup:topicmodels,grammarinduc9on
Collinsvs.Charniakparsers
Abriefhistoryof(modern)NLP
1980 1990 2000 2010 2020
earlieststatMTworkatIBM
Largelyrule-based,expertsystems
Penntreebank
NP VP
S
Ratnaparkhitagger
NNP VBZ
Sup:SVMs,CRFs,NER,Sen9ment
Semi-sup,structuredpredic9on
Neural
Supervisedvs.Unsupervised
‣ Supervisedtechniquesworkwellonveryli<ledata(evenneuralnetworks)
annota9on(twohours!)
unsupervisedlearning
“LearningaPart-of-SpeechTaggerfromTwoHoursofAnnota9on”Garre<eandBaldridge(2013)
be<ersystem!
‣ Fullyunsupervisedtechniqueshavefallenoutoffavor
Petersetal.(2018),Devlinetal.(2019)
Pretraining
‣ Languagemodeling:predictthenextwordinatext P (wi|w1, . . . , wi�1)
P(w|Iwanttogoto)= 0.01Hawai’i
P(w|theac9ngwashorrible,Ithinkthemoviewas)=
‣ Modelunderstandssomesen9ment?
0.005LA0.0001class
‣ Trainaneuralnetworktodolanguagemodelingonmassiveunlabeledtext,fine-tuneittodo{tagging,sen9ment,ques9onanswering,…}
:usethismodelforotherpurposes
0.1bad0.001good
Interpretability
Wallace,Gardner,SinghInterpretabilityTutorialatEMNLP2020
‣ Whenwehavecomplexmodels,howdoweunderstandtheirdecisions?
Interpretability‣ Whenwehavecomplexmodels,howdoweunderstandtheirdecisions?
‣ “A<ribu9on”:understandwhatpartsoftheinputcontributetoapredic9on
‣ WhywasitclassAinsteadofclassB?
‣ Whatisthe“counterfactual”scenarioweareconsidering(thefoil)?
IdrankteabecauseIdon’tlikecoffeeIdrankteabecauseIwasthirsty(JacoviandGoldberg,2020))
‣ Datasetbiases:doesourdatahaveflawsthatpreventthemodelfromdoingtherightthing?
‣ Probing:whatrepresenta9onsgetlearnedindeepmodels?
Wherearewe?
‣ NLPconsistsof:analyzingandbuildingrepresenta9onsfortext,solvingproblemsinvolvingtext
‣ Theseproblemsarehardbecauselanguageisambiguous,requiresdrawingondata,knowledge,andlinguis9cstosolve
‣ Knowingwhichtechniquesuserequiresunderstandingdatasetsize,problemcomplexity,andalotoftricks!
‣ NLPencompassesallofthesethings
NLPvs.Computa9onalLinguis9cs
‣ NLP:buildsystemsthatdealwithlanguagedata
‣ CL:usecomputa9onaltoolstostudylanguage
Hamiltonetal.(2016)
NLPvs.Computa9onalLinguis9cs
‣ Computa9onaltoolsforotherpurposes:literarytheory,poli9calscience…
Bamman,O’Connor,Smith(2013)
Outline
MLandstructuredpredic9onforNLP{
Neuralnets {
Outline:Syntax+Seman9cs Outline:Applica9ons
Ethics
https://toxicdegeneration.allenai.org/
‣ E.g.,“toxicdegenera9on”:systemscangenerate{racist,sexist,…}content
‣Wewilltouchonethicalissuesthroughoutthecourse
CourseGoals
‣ CoverfundamentalmachinelearningtechniquesusedinNLP
‣ Makeyoua“producer”ratherthana“consumer”ofNLPtools
‣ CovermodernNLPproblemsencounteredintheliterature:whataretheac9veresearchtopicsin2021?
‣ Thefourassignmentsshouldteachyouwhatyouneedtoknowtounderstandnearlyanysystemintheliterature(e.g.:state-of-the-artNERsystem=project1+mini2+BERT,basicMTsystem=project2)
‣ Understandhowtolookatlanguagedataandapproachlinguis9cphenomena
Assignments
‣ Twominis(10%each),twoprojects(20%each)‣ Implementa9on-oriented,withanopen-endedcomponenttoeach
‣ Mini1(classifica9on)isoutNOW
‣ 1weekforminis,~2weeksperproject,5“slipdays”forautoma9cextensions
‣ Grading:‣ Minis:largelygradedbasedoncodeperformance
‣ Projects:gradedonamixofcodeperformance,writeup,extension
Theseprojectsrequireunderstandingoftheconcepts,abilitytowriteperformantcode,andabilitytothinkabouthowtodebugcomplexsystems.Theyarechallenging,sostartearly!
Assignments
‣ Finalproject(40%)
‣ Groupsof2preferred,1ispossible‣ (Brief!)proposaltobeapprovedbymebythemidpointofthesemester
‣ Wri<eninthestyleandtoneofanACLpaper
A climate conducive to learning and creating knowledge is the right of every person in our community. Bias, harassment and discrimination of any sort have no place here. If you notice an incident that causes concern, please contact the Campus Climate Response Team: diversity.utexas.edu/ccrt
The College of Natural Sciences is steadfastly committed to enriching and transformative educational and research experiences for every member of our community. Find more resources to support a diverse, equitable and welcoming community within Texas Science and share your experiences at cns.utexas.edu/diversity
Conduct Survey(onInstapoll)1. Name
2. Fillin:Iama[CS/____][PhD/masters/undergrad]inyear[12345+]3. Writeonereasonyouwanttotakethisclassoronethingyouwanttogetoutofit
4. Oneinteres9ngfactaboutyourself,orwhatyouliketodoinyourspare9me