Post on 12-Oct-2020
transcript
BD003:Introductionto NLP
Dr. DianaMaynardUniversityofSheffield,UK
IADSBigDataandAnalyticsSummerSchool,31July2017,UniversityofEssex
Part1:IntroductiontoGATE
WhyGATE?
• GATEisthemostwidelyusedopensourcetoolkitforNLPintheworld• We’reusingitbecauseit’sagreatwaytoshowcaseallthecoreNLP
componentsthatareusedfortextanalysistasks• YoucanplaywithallthetoolsinGATEandtryoutthingsforyourselftosee
howitworks• Andalsobecausewe’reexperts
• DevelopedattheUniversityofSheffieldsince2000(initscurrentform)• ThepersonwhohasledthedevelopmentoftheNLPtoolsinGATEsince2000
istheonepresentingtoyounowJ
• Andbytheway,justbecauseit’solddoesn’tmeanit’soutofdate.GATEisinconstantdevelopmentwithnewtechnologiesbeingconstantlyadded.
WhatisGATE?
• Open-sourcesoftwareframeworkandsetofreadysolutionsfortext/naturallanguageprocessing
• Re-usableabstractionsfordocuments,formatconversion,corpora,annotations,storage,algorithms,...
• Agraphicaluserinterfacetointeractivelydevelopsolutions(GATEGUI,GATEDeveloper)
• A(Java)libraryprovidingaprogrammingAPIforusingtheabstractions• Aninfrastructureofpluggablecomponents(GATEPlugins)• Ready-madesolutionstogetyoustarted• Companionsoftwareforsemanticsearch(Mimir)• Scalablefromlaptoptomassiveprocessingonthecloud(includingreal-
timestreamprocessing)
Aboutthistutorial
• ThistutorialwillgetyoustartedwiththeGATEgraphicaluserinterface(GUI),alsoknownas“GATEDeveloper”
• Itwillbeahands-onsession.PleasetrythingsoutinGATEasthetopicsarepresented.
• Thingssuggestedforyoutotryyourselfarein red.• StartGATEonyourcomputernow(ifyouhaven'talready)bydouble
clickingtheicon• Pleasedon'tjumpahead:ifyou'realreadyfinishedwithatask,perhapsyou
canhelpyourneighbour iftheygetstuck.• Pleasetrytokeepquestionsduringthesessionsrelatedtothecurrenttopic• Therewillbetimeattheendorinthebreaksformoregeneralquestions
GATEGUI
Resources Pane
Menu Bar
Shortcut Buttons
ResourceFeatures
Messages
DisplayPane
Resources
• MostthingsyouusewithinGATEare“resources”:• Languageresources (LRs)aredocuments,documentcollections,
ontologies...• Acollectionofdocumentsisknownasacorpus
• Processingresources (PRs)areprogramsthatoperateontextwithinthedocuments,andoftencreateormodifyannotations
• Datastores areforstoringdocumentsandcorporaforlateruse• Applications (“pipelines”)aresequencesofprocessingresourcesthat
runononeormoredocuments
DisplayingResources
• WhenyoufirstopenGATE,thedisplaypanewillshowmessagesfromthesysteminthe“Messages”tab
• Thedisplaypanedisplayswhateverelementsyouarecurrentlyworkingwith,e.g.anapplication,adocumentoraprocessingresource,eachinitsowntab
• Doubleclickingonaresourceintheresourcespanewilldisplayit• Tabsalongthetopofthedisplaypaneallowyoutochoosewhichof
theopenresourcestodisplay
CreateNewDocument
• FromtheResourcePane,rightclick“LanguageResources”→New→GATEDocument
• Ignoretheparametersettingsthatwillbedisplayed• ClickOK• “GATEDocument_<id>”willnowbeaddedto“LanguageResources”• Doubleclickthatdocumentname• Atabisopenedinthedisplaypane,showingtheemptydocument.
Youcanentersometext thereifyouwant.
EmptyDocument
DocumentTab
DocumentEditor
DocumentName
DocumentEditor Buttons
DocumentResource Views
DocumentEditor
• TheDocumentEditorisshownasanewTabintheDisplayPane,alongsidetheMessagePane
• TherearebuttonsonthetopoftheEditor,e.g.“AnnotationSets”–wewilllearnaboutthemlater.
• TherearetabsatthebottomoftheDocumentTab:theseshowdifferent“Views”ofthedocument.
• Thesmallpaneinthelowerleftshowsthe“documentfeatures”(optionalinformationassociatedwiththedocumentresourceaskey/valuepairs)
Simpleoperationsonresources
• Rightclickingonthenameofaresourceintheresourcepanegivesaccesstoamenuofactions
• Doubleclickingonthenameofaresourceopensaviewoftheresourceinthedisplaypane(tripleclickingthenamecanbeusedtorename)
• SelectingaresourceinstanceandpressingtheDelete(Mac:Fn+BS)keywillgenerallycloseit
• Youcanalsorightclickandthenselect“Close”
Parameters
• Resourcescanhaveparameterswhichneedtogetspecifiedwhentheresourceiscreated:Initialization(init)Parameters
• Processingresourcescanalsohaveparameterswhichcanbechangedforeachrun:RuntimeParameters
• Init parametersspecifyhowaresourceiscreated,e.g.thelocationofadocumenttoload
• Runtimeparametersconfigurewhataprocessingresourcedoes,e.g.ifsomeprocessingiscase-sensitiveornot.
Loadingadocument
• GATEcanreadandloaddocumentsinmanyformats:e.g.plaintext,HTML,XML,PDF,Word,CoNLL ,CSV,JSON
• GATEcanloaddocumentsfromfilesandfromURLs• Whenadocumentisloaded,itgetsconvertedtoGATEinternal
formatasdocumenttext+annotations.
Loadingadocument
• Toloadadocument:- rightclickonLanguageResources→“New→GATEDocument”OR- Filemenu→ NewLanguageResource→GATEDocument
• UsethesourceURL parametertospecifythedocumenttobeloaded:- typethefilenameorURL,or- clickthefilebrowsericontonavigatetothecorrectdocument.
• Loadafilefromyourhands-onmaterials:corpora→news-texts→ft-airlines-27-jul-2001.xml
• Loadawebpage– forthisthehttp://orhttps://partoftheURLisrequired,e.g.http://news.bbc.co.uk
• Note:ifyouusetheBBCpageabove,wesuggestpickingastoryandclickingonittogetabetterdocumentforprocessing,asthemainnewspagecontainsmainlyjustlinks
Documentviewer
Documentviewerbuttons
Document
Highlighted tab is the resource currently being viewed
Annotations
• AnnotationsarecentraltoGATE• Annotationsrepresentaspectsofthetextyouwanttoanalyze:
words,sentences,Dates,PersonNames• Annotationsarenamedbytheirtype,e.g.“Person”• Annotationconsistsof
• Annotationtype• startandendoffsets• setoffeatures,eachfeatureisanarbitraryname/valuepair,e.g.
orth=”upperInitial”
AnnotationSets
• Annotationsaregroupedintosets• Eachsetcancontainanynumberofannotationsofanytype• Youcancreateandorganizeyourannotationsetsasyouwish.• Predefinedsets
• Defaultset(emptyname):cannotbedeleted• “Originalmarkups”:annotationsfromthemarkupsinthefile• “Key”:byconvention,usedforgoldstandardannotations
• Clickthe“AnnotationSets”buttoninthedocumentviewer
AnnotationSets
Defaultannotationset
Original markupsannotation set
Annotation types
DocumentViewerButtons
Tabs
Viewingannotations
• ClickingontheAnnotationSetsbuttonopensanewpaneontherighthandsideinsidethedocumentview(AnnotationSetsview)
• Default(unnamed)setcontainssomeexamplesofannotations• Clickonthe▶ todisplaytheannotationtypesbelongingtothatset• YoushouldseetypessuchasLocation,Date,Personetc.• Clickthecheckboxforanannotationtypetoviewallthe
annotationsofthattypeinthedocument
Acloserlookattheannotations
• ClicktheAnnotationsListbuttonfromthemenuabovetheDisplaypane• Tableshowsannotationtype,annotationset,offsets,annotationid,and
features(forallselectedannotations)• Selectarowinthetabletohighlighttheannotationinthetext• TherearealsootherannotationviewspossiblesuchastheAnnotation
StackandCoreference Editor
Annotations
Date annotation
Annotations table
Editingexistingannotations
• SelectanannotationtypefromtheAnnotationSetsviewandhoveroverahighlightedannotationinthetext
• Apopupwindowdisplaysmoreinformationaboutit:thisistheannotationeditor
• Clickthedrawingpinsymbolatthetopoftheeditor.Thiswill“pin” thewindowopen(youcanstillmovethewindowaroundonyourscreenifyouwish)
• Tryeditingtheannotation:youcanchangetheannotationtype,featurenamesandvalues,thespanoftheannotation(clickingleftandrightarrowsatthetopofthebox)ordeletetheannotationoritsfeatures(redXs)
• ClosetheannotationeditorbyclickingtheXinthetoprightcorner,thenviewyoureditedannotationintheAnnotationList
Annotationeditor
annotation editorfeature name value
Annotation type
CreatingaCorpus
• Acorpusisacollectionofdocuments.• FormostGATEapplications,itiseasiertoworkwithacorpusrather
thananindividualdocument,evenifthatcorpusonlycontainsonedocument.
• RightclickLanguageResources→New→GATECorpus• OR• Filemenu→NewLanguageResource→GATECorpus• Aswiththedocuments,youcannameyourcorpusorusethedefault
GATEname.
Addingdocumentstoacorpus
1.Withtheinit parameter:clicktheeditbuttonandadddocumentsthatarealreadyloadedinGATEtothecorpus.ClickOKwhendone.or2.CreateanemptycorpusOpenthecorpusandusethe+buttontoadddocuments,ordragthemfromtheResourcespaneor populateitfromafiledirectory(nextslide)
• Doubleclickonthecorpusnametoviewthecorpus.• Doubleclickthedocumentlistedtheretoviewit.
PopulatingaCorpus(1)
• Usually,acorpuswillconsistofmorethanonedocument.Sometimestherecouldbehundredsofdocumentsinacorpus.
• Usingthepopulatefunctionmeansyoudon'thavetopreloadthedocumentsinGATEfirst,andallowsyoutoloadallthedocumentsintothecorpusinonego
• Todothis,let'sfirsttidyupabit• It'sbesttokeepGATEGUIclutter-freebyremovinganyunwanted
resourcesanddocuments,oritcangetabitconfusing• Closeallopendocumentsandcorpora
PopulatingaCorpus(2)
• Createanewemptycorpus,sodon'taddanydocumentstoityet• RightclickonthecorpusnameintheResourcespaneandselect
Populate• Usethefilebrowsericontoselectthenameofthedirectorywithyour
documents(corpora/news-texts)• Allthedocumentswillbeloadedinonego• Viewthecontentsofthecorpusasbefore
ProcessingResourcesandPlugins
ProcessingResourcesandPlugins
• Processingresources(PRs)arethetoolsthatprocessandannotatetext(textprocessingalgorithms).Oftenthismeanscreatingormodifyingannotationsonthetext.
• An“application”or“pipeline”consistsofanynumberofPRs,runsequentiallyoveracorpusofdocuments
• ApluginisacollectionofPRs,andotherresourcesbundledtogether.Forexample,everythingneededforIEinANNIEisintheANNIEplugin.
• AnapplicationcanusePRsfromoneormoredifferentplugins.• InordertousePRs,youneedtoloadtherelevantplugin(s)• PluginsareloadedviathePluginManager(greenjigsawpieceicon)
Plugins
• ClicktheicononthetopGATEmenutoopenthePluginManager[orgoviaFile →ManageCREOLEPlugins]
• DependingonyourversionofGATE,youmayseeapopupbox:
• TheuserpluginfolderisafolderonyourcomputerwherepluginsotherthanthoseprovidedbyGATEarestored
Plugins
List of available pluginsResources in the selected pluginLoad the
plugin for this session only
Load the plugin every time GATE starts
Apply all the settings
Close the plugins manager
Plugins
• Selectaplugintosee(ontheRHS)thenamesoftheresourcesitcontains
• Checktherelevant“LoadNow” boxtoloadapluginofyourchoice
• Click“ApplyAll” toloadtheselectedplugin• Click“Close”• RightclickonProcessingResourcestoseewhichnewPRsare
nowavailable
Applications
Here'soneImadeearlier:ANNIE
• ANNIEisareadymadecollectionofPRsthatperformsInformationExtractiononunstructuredtext.
• AdetailedexplanationofANNIEwillbegiveninthesecondpart.Fornow,we'rejustgoingtouseitasanexampleofanapplication.
• Later,we'llshowyouhowtomakeyourownapplicationfromscratch.
• ClicktheiconfromthetopGATEmenuORSelectFile→LoadANNIEsystem
• Select“withdefaults”• Loadanydocumentfromthehands-onmaterialandaddittoacorpus
Runninganapplication
ViewtheANNIEapplicationbydoubleclickingonitPRs selected in application (in order of their execution)
Corpus on which the application is executed
Runtime parameters of the selected PR
Execute the application
Viewingtheresults
• WhenamessageappearsinthebottomleftcornerofyourGATEwindowsayingsomethinglike“ANNIErunin1.3seconds”,theapplicationhasfinished.
• Doubleclickonthedocumenttoviewit• ViewtheannotationsbyselectingAnnotationSetsand
clickingonanyAnnotationtypesintheDefault(unnamed)set
• Ifyouwant,youcanviewtheannotationstabletoo.• Rememberthatnotalltheresultswillbeperfect!Laterinthe
course,you'lllearnmoreaboutthecausesoftheseerrors.
AddingnewPRs(1)
• Let'saddaVerbPhraseChunker PRtoANNIE.• First,wehavetoloadthepluginthatcontainsit,andthen
loadthePRintoGATE,beforewecanaddittotheapplication.
• UsethepluginsmanagertoloadtheToolsplugin.• RightclickonProcessingResourcesandselect“New”→“ANNIEVPChunker”
• Leaveallthedefaultparameterssetandclick“OK”
AddingnewPRs(2)
• NowweneedtoaddthenewPRtotheapplication.• DoubleclickonANNIE.• You'llseetheVPchunker isinthelistofloadedPRs.Thismeansit's
availableinGATE,butisn'tyetcontainedintheapplication.• Addittotheapplicationbyselectingitandusingtherightarrowto
transferit.• Nowusetheuparrowtomoveittotherightplaceintheapplication.It
shouldgoafter(below)thePOStaggerbutbefore(above)theNEtransducer.
• Runtheapplicationandviewtheresultsonthedocument.• Youshouldseeanewannotationtype“VG”.
Savingdocuments
• Usingdatastores• SavingdocumentsforuseoutsideGATE
Typesofdatastores
• Thereare2typesofdatastore:• Serialdatastores storedatadirectlyinadirectory• Lucenedatastores provideasearchablerepository
withLucene-basedindexing• Fornow,we'lllookatserialdatastores
Createanewserialdatastore
• Rightclick“Datastores” fromtheResourcespaneandselect“CreateDatastore”
• Select“SerialDatastore”• Createanewemptydirectorybyclickingthe“CreateNewFolder” iconandgiveyournewdirectoryaname
• Selectthisdirectoryandclick“Open”• Nowyourdatastore isreadytostoreyour
documents
Savedocumentstothedatastore
• Rightclickonyourcorpusandselect“SavetoDatastore”• Selectthedatastore thatyoujustcreated
• Nowclosethecorpusanddocument• Doubleclickonthenameofthedatastore intheResourcespane• Youshouldseethecorpusanddocument
• DoubleclickonthemtoloadthembackintoGATEandviewthem• Theyshouldcontaintheannotationsyoucreatedpreviously
• Youcanremovethingsfromthedatastore byrightclickingontheirnameinthedatastore andselecting“Delete”
• Youcanaddseveralcorporatothesamedatastore
Summary
• ThisfirstsessionhasgivenyouaguidedtouroftheGATEGUI• Lookedatlanguageresources,datastores,applicationsand
processingresources• Therearelotsofothertoolsandoptionsyoucanplaywith:see
theUserguideformoreinfo• Next,we'lllookatvariousNLPcomponents,andfurther
examineANNIE,GATE'sdefaultInformationExtractionsystem