Recap: BERT Announcements

Post on 22-Feb-2022

1 views 0 download

transcript

Recap:BERT Announcements

‣ A4backtoday,A5backsoon

‣ FPcheck-induetoday,willbereturnedsoon

‣ eCISevaluaEons:pleasefilltheseout

Mul$linguality

Dealingwithotherlanguages

‣ SomeofouralgorithmshavebeenspecifiedtoEnglish

‣ SomestructureslikeconsEtuencyparsingdon’tmakesenseforotherlanguages

‣ NeuralmethodsaretypicallytunedtoEnglish-scaleresources,maynotbethebestforotherlanguageswherelessdataisavailable

1)Whatotherphenomena/challengesdoweneedtosolve?

‣ QuesEon:

2)HowcanweleverageexisEngresourcestodobeUerinotherlanguageswithoutjustannotaEngmassivedata?

‣OtherlanguagespresentsomechallengesnotseeninEnglishatall!

ThisLecture

‣ Morphologicalrichness:effectsandchallenges

‣ Cross-lingualtaggingandparsing

‣ Morphologytasks:analysis,inflecEon,wordsegmentaEon

‣ Cross-lingualwordrepresentaEons

Morphology

Whatismorphology?‣ Studyofhowwordsform

‣ DerivaEonalmorphology:createanewlexemefromabase

estrange(v)=>estrangement(n)

become(v)=>unbecoming(adj)

Ibecome/shebecomes

‣ InflecEonalmorphology:wordisinflectedbasedonitscontext

‣ Maynotbetotallyregular:enflame=>inflammable

‣ Mostlyappliestoverbsandnouns

MorphologicalInflecEon‣ InEnglish: Iarrive youarrive he/she/itarrives

wearrive youarrive theyarrive[X]arrived

‣ InFrench:

MorphologicalInflecEon‣ InSpanish:

NounInflecEon

‣ NominaEve:I/he/she,accusaEve:me/him/her,geniEve:mine/his/hers

‣ Notjustverbseither;gender,number,casecomplicatethings

Igivethechildrenabook<=>IchgebeeinBuchItaughtthechildren<=>IchunterrichtedieKinder

‣ DaEve:mergedwithaccusaEveinEnglish,showsrecipientofsomething

denKindern

IrregularInflecEon‣ Commonwordsareodenirregular

‣ Iam/youare/sheis

‣ Lesscommonwordstypicallyfallintosomeregularparadigm—thesearesomewhatpredictable

‣ Jesuis/tues/elleest

‣ Soy/está/es

AggluEnaEngLangauges‣ Finnish/Hungarian(Finno-Ugric),alsoTurkish:whatapreposiEonwoulddoinEnglishisinsteadpartoftheverb

‣ Manypossibleforms—andinnewswiredata,onlyafewareobservedillaEve:“into” adessive:“on”

halata:“hug”

Morphologically-RichLanguages

‣ ManylanguagesspokenallovertheworldhavemuchrichermorphologythanEnglish

‣ CoNLL2006/2007:dependencyparsing+morphologicalanalysesfor~15mostlyIndo-Europeanlanguages

‣ Wordpiece/byte-pairencodingmodelsforMTarepreUygoodathandlingtheseifthere’senoughdata

‣ SPMRLsharedtasks(2013-2014):SyntacEcParsingofMorphologically-RichLanguages

Morphologically-RichLanguages

‣ GreatresourcesforchallengingyourassumpEonsaboutlanguageandforunderstandingmulElingualmodels!

MorphologicalAnalysis/Inflec$on

MorphologicalAnalysis‣ InEnglish,lexicalfeaturesonwordsandwordvectorsarepreUyeffecEve

‣ Whenwe’rebuildingsystems,weprobablywanttoknowbaseform+morphologicalfeaturesexplicitly

‣ Inotherlanguages,lotsmoreunseenwordsduetorichmorphology!Affectsparsing,translaEon,…

‣ Howtodothiskindofmorphologicalanalysis?

MorphologicalAnalysis:Hungarian

Ámakormányegyetlenadócsökkentésétsemjavasolja.

n=singular|case=nomina$ve|proper=no

deg=posi$ve|n=singular|case=nomina$ve

n=singular|case=nomina$ve|proper=no

n=singular|case=accusa$ve|proper=no|pperson=3rd|pnumber=singular

mood=indica$ve|t=present|p=3rd|n=singular|def=yes

Butthegovernmentdoesnotrecommendreducingtaxes.

MorphologicalAnalysis

‣ Givenawordincontext,needtopredictwhatitsmorphologicalfeaturesare

‣ LotsofworkonArabicinflecEon(highamountsofambiguity)

‣ Basicapproach:combinestwomodules:

‣ Lexicon:tellsyouwhatpossibiliEesarefortheword

‣ Analyzer:staEsEcalmodelthatdisambiguates

‣ ModelsarelargelyCRF-like:scoremorphologicalfeaturesincontext

MorphologicalInflecEon‣ Inversetaskofanalysis:givenbaseform+features,inflecttheword

DurreMandDeNero(2013)

‣ Hardforunknownwords—needmodelsthatgeneralize

w i n d e n

MorphologicalInflecEon

Chahuneauetal.(2013)

‣ MachinetranslaEonwherephrasetableisdefinedintermsoflemmas

‣ “Translate-and-inflect”:translateintouninflectedwordsandpredictinflecEonbasedonsourceside

ChineseWordSegmentaEon

‣ LSTMsovercharacterembeddings/characterbigramembeddingstopredictwordboundaries

‣ WordsegmentaEon:somelanguagesincludingChinesearetotallyuntokenized

Chenetal.(2015)

‣ HavingtherightsegmentaEoncanhelpmachinetranslaEon

Cross-LingualTaggingandParsing

Cross-LingualTagging‣ LabelingPOSdatasetsisexpensive

‣ CanwetransferannotaEonfromhigh-resourcelanguages(English,etc.)tolow-resourcelanguages?

English

Rawtext

POSdata

Spanish:

+Rawtext

en-esbitext

POSdata

Malagasy

bitext

Rawtext+ Malagasytagger

Spanishtagger

Cross-LingualTagging‣ Canweleveragewordalignmenthere?

NPRV??

‣ TagwithEnglishtagger,projectacrossbitext,trainFrenchtagger?WorkspreUywell

Ilikeitalot

Jel’aimebeaucoup

align Ilikeitalot

Jel’aimebeaucoup

NVPRDTADJ

tag Ilikeitalot

Jel’aimebeaucoup

Projectedtags

DasandPetrov(2011)

Cross-LingualParsing

McDonaldetal.(2011)

‣ NowthatwecanPOStagotherlanguages,canweparsethemtoo?

‣ Directtransfer:trainaparseroverPOSsequencesinonelanguage,thenapplyittoanotherlanguage

Iliketomatoes

PRONVERBNOUN

JelesaimePRONPRONVERB

Ilikethem

PRONVERBPRON

Parsertrainedtoaccepttaginput

VERBistheheadofPRONandNOUN

parsenewdata

train

Cross-LingualParsing

McDonaldetal.(2011)

‣ MulE-dir:transferaparsertrainedonseveralsourcetreebankstothetargetlanguage

‣ MulE-proj:morecomplexannotaEonprojecEonapproach

Cross-LingualWordRepresenta$ons

MulElingualEmbeddings

Ammaretal.(2016)

‣ mulECluster:usebilingualdicEonariestoformclustersofwordsthataretranslaEonsofoneanother,replacecorporawithclusterIDs,train“monolingual”embeddingsoverallthesecorpora

‣ Worksokaybutnotallthatwell

Ihaveanapple

J’aidesoranges IJeJ’

ID:47aihave

ID:24

4724891981

472418427

‣ Input:corporainmanylanguages.Output:embeddingswheresimilarwordsindifferentlanguageshavesimilarembeddings

MulElingualSentenceEmbeddings

Artetxeetal.(2019)

‣ FormBPEvocabularyoverallcorpora(50kmerges);willincludecharactersfromeveryscript

‣ TakeabunchofbitextsandtrainanMTmodelbetweenabunchoflanguagepairswithsharedparameters,useWassentenceembeddings

MulElingualSentenceEmbeddings

‣ TrainasystemforNLI(entailment/neutral/contradicEonofasentencepair)onEnglishandevaluateonotherlanguages

Artetxeetal.(2019)

MulElingualBERT

Devlinetal.(2019)

‣ Taketop104Wikipedias,trainBERTonallofthemsimultaneously

‣Whatdoesthislooklike?

BeethovenmayhaveproposedunsuccessfullytoThereseMalfay,thesupposeddedicateeof"FürElise";hisstatusasacommonermayagainhaveinterferedwiththoseplans.

当⼈们在⻢尔法蒂身后发现这部⼩曲的⼿稿时,便误认为上⾯写的是“FürElise”(即《给爱丽丝》)[51]。

Китай́(официально—Китай́скаяНаро́днаяРеспуб́лика,сокращённо—КНР;кит.трад.中華⼈⺠共和國,упр.中华⼈⺠

共和国,пиньинь:ZhōnghuáRénmín

MulElingualBERT:Results

Piresetal.(2019)

‣ CantransferBERTdirectlyacrosslanguageswithsomesuccess

‣…butthisevaluaEonisonlanguagesthatallshareanalphabet

MulElingualBERT:Results

Piresetal.(2019)

‣ Urdu(Arabicscript)=>Hindi(Devanagari).Transferswelldespitedifferentalphabets!

‣ Japanese=>English:differentscriptandverydifferentsyntax

ScalingUp:XLM-R

Conneauetal.(2019)

‣ Larger“CommonCrawl”dataset,beUerperformancethanmBERT

‣ Low-resourcelanguagesbenefitfromtrainingonotherlanguages

‣ High-resourcelanguagesseeasmallperformancehit,butnotmuch

Wherearewenow?‣ Universaldependencies:treebanks(+tags)for70+languages

‣ ManylanguagesaresEllsmall,soprojecEontechniquesmaysEllhelp

‣ Morecorporainotherlanguages,lessandlessrelianceonstructuredtoolslikeparsers,andpretrainingonunlabeleddatameansthatperformanceonotherlanguagesisbeUerthanever

‣ MulElingualmodelsseemtobeworkingbeUerandbeUer—butsEllmanychallengesforlow-resourceseyngs

Takeaways

‣ ManylanguageshaverichermorphologythanEnglishandposedisEnctchallenges

‣ Problems:howtoanalyzerichmorphology,howtogeneratewithit

‣ CanleverageresourcesforEnglishusingbitexts

‣ NextEme:wrapup+discussionofethics