BD003: Introduction toNLP · • A (Java) library providing a programming API for using the...

Post on 12-Oct-2020

1 views 0 download

transcript

BD003:Introductionto NLP

Dr. DianaMaynardUniversityofSheffield,UK

IADSBigDataandAnalyticsSummerSchool,31July2017,UniversityofEssex

Part1:IntroductiontoGATE

WhyGATE?

• GATEisthemostwidelyusedopensourcetoolkitforNLPintheworld• We’reusingitbecauseit’sagreatwaytoshowcaseallthecoreNLP

componentsthatareusedfortextanalysistasks• YoucanplaywithallthetoolsinGATEandtryoutthingsforyourselftosee

howitworks• Andalsobecausewe’reexperts

• DevelopedattheUniversityofSheffieldsince2000(initscurrentform)• ThepersonwhohasledthedevelopmentoftheNLPtoolsinGATEsince2000

istheonepresentingtoyounowJ

• Andbytheway,justbecauseit’solddoesn’tmeanit’soutofdate.GATEisinconstantdevelopmentwithnewtechnologiesbeingconstantlyadded.

WhatisGATE?

• Open-sourcesoftwareframeworkandsetofreadysolutionsfortext/naturallanguageprocessing

• Re-usableabstractionsfordocuments,formatconversion,corpora,annotations,storage,algorithms,...

• Agraphicaluserinterfacetointeractivelydevelopsolutions(GATEGUI,GATEDeveloper)

• A(Java)libraryprovidingaprogrammingAPIforusingtheabstractions• Aninfrastructureofpluggablecomponents(GATEPlugins)• Ready-madesolutionstogetyoustarted• Companionsoftwareforsemanticsearch(Mimir)• Scalablefromlaptoptomassiveprocessingonthecloud(includingreal-

timestreamprocessing)

Aboutthistutorial

• ThistutorialwillgetyoustartedwiththeGATEgraphicaluserinterface(GUI),alsoknownas“GATEDeveloper”

• Itwillbeahands-onsession.PleasetrythingsoutinGATEasthetopicsarepresented.

• Thingssuggestedforyoutotryyourselfarein red.• StartGATEonyourcomputernow(ifyouhaven'talready)bydouble

clickingtheicon• Pleasedon'tjumpahead:ifyou'realreadyfinishedwithatask,perhapsyou

canhelpyourneighbour iftheygetstuck.• Pleasetrytokeepquestionsduringthesessionsrelatedtothecurrenttopic• Therewillbetimeattheendorinthebreaksformoregeneralquestions

GATEGUI

Resources Pane

Menu Bar

Shortcut Buttons

ResourceFeatures

Messages

DisplayPane

Resources

• MostthingsyouusewithinGATEare“resources”:• Languageresources (LRs)aredocuments,documentcollections,

ontologies...• Acollectionofdocumentsisknownasacorpus

• Processingresources (PRs)areprogramsthatoperateontextwithinthedocuments,andoftencreateormodifyannotations

• Datastores areforstoringdocumentsandcorporaforlateruse• Applications (“pipelines”)aresequencesofprocessingresourcesthat

runononeormoredocuments

DisplayingResources

• WhenyoufirstopenGATE,thedisplaypanewillshowmessagesfromthesysteminthe“Messages”tab

• Thedisplaypanedisplayswhateverelementsyouarecurrentlyworkingwith,e.g.anapplication,adocumentoraprocessingresource,eachinitsowntab

• Doubleclickingonaresourceintheresourcespanewilldisplayit• Tabsalongthetopofthedisplaypaneallowyoutochoosewhichof

theopenresourcestodisplay

CreateNewDocument

• FromtheResourcePane,rightclick“LanguageResources”→New→GATEDocument

• Ignoretheparametersettingsthatwillbedisplayed• ClickOK• “GATEDocument_<id>”willnowbeaddedto“LanguageResources”• Doubleclickthatdocumentname• Atabisopenedinthedisplaypane,showingtheemptydocument.

Youcanentersometext thereifyouwant.

EmptyDocument

DocumentTab

DocumentEditor

DocumentName

DocumentEditor Buttons

DocumentResource Views

DocumentEditor

• TheDocumentEditorisshownasanewTabintheDisplayPane,alongsidetheMessagePane

• TherearebuttonsonthetopoftheEditor,e.g.“AnnotationSets”–wewilllearnaboutthemlater.

• TherearetabsatthebottomoftheDocumentTab:theseshowdifferent“Views”ofthedocument.

• Thesmallpaneinthelowerleftshowsthe“documentfeatures”(optionalinformationassociatedwiththedocumentresourceaskey/valuepairs)

Simpleoperationsonresources

• Rightclickingonthenameofaresourceintheresourcepanegivesaccesstoamenuofactions

• Doubleclickingonthenameofaresourceopensaviewoftheresourceinthedisplaypane(tripleclickingthenamecanbeusedtorename)

• SelectingaresourceinstanceandpressingtheDelete(Mac:Fn+BS)keywillgenerallycloseit

• Youcanalsorightclickandthenselect“Close”

Parameters

• Resourcescanhaveparameterswhichneedtogetspecifiedwhentheresourceiscreated:Initialization(init)Parameters

• Processingresourcescanalsohaveparameterswhichcanbechangedforeachrun:RuntimeParameters

• Init parametersspecifyhowaresourceiscreated,e.g.thelocationofadocumenttoload

• Runtimeparametersconfigurewhataprocessingresourcedoes,e.g.ifsomeprocessingiscase-sensitiveornot.

Loadingadocument

• GATEcanreadandloaddocumentsinmanyformats:e.g.plaintext,HTML,XML,PDF,Word,CoNLL ,CSV,JSON

• GATEcanloaddocumentsfromfilesandfromURLs• Whenadocumentisloaded,itgetsconvertedtoGATEinternal

formatasdocumenttext+annotations.

Loadingadocument

• Toloadadocument:- rightclickonLanguageResources→“New→GATEDocument”OR- Filemenu→ NewLanguageResource→GATEDocument

• UsethesourceURL parametertospecifythedocumenttobeloaded:- typethefilenameorURL,or- clickthefilebrowsericontonavigatetothecorrectdocument.

• Loadafilefromyourhands-onmaterials:corpora→news-texts→ft-airlines-27-jul-2001.xml

• Loadawebpage– forthisthehttp://orhttps://partoftheURLisrequired,e.g.http://news.bbc.co.uk

• Note:ifyouusetheBBCpageabove,wesuggestpickingastoryandclickingonittogetabetterdocumentforprocessing,asthemainnewspagecontainsmainlyjustlinks

Documentviewer

Documentviewerbuttons

Document

Highlighted tab is the resource currently being viewed

Annotations

• AnnotationsarecentraltoGATE• Annotationsrepresentaspectsofthetextyouwanttoanalyze:

words,sentences,Dates,PersonNames• Annotationsarenamedbytheirtype,e.g.“Person”• Annotationconsistsof

• Annotationtype• startandendoffsets• setoffeatures,eachfeatureisanarbitraryname/valuepair,e.g.

orth=”upperInitial”

AnnotationSets

• Annotationsaregroupedintosets• Eachsetcancontainanynumberofannotationsofanytype• Youcancreateandorganizeyourannotationsetsasyouwish.• Predefinedsets

• Defaultset(emptyname):cannotbedeleted• “Originalmarkups”:annotationsfromthemarkupsinthefile• “Key”:byconvention,usedforgoldstandardannotations

• Clickthe“AnnotationSets”buttoninthedocumentviewer

AnnotationSets

Defaultannotationset

Original markupsannotation set

Annotation types

DocumentViewerButtons

Tabs

Viewingannotations

• ClickingontheAnnotationSetsbuttonopensanewpaneontherighthandsideinsidethedocumentview(AnnotationSetsview)

• Default(unnamed)setcontainssomeexamplesofannotations• Clickonthe▶ todisplaytheannotationtypesbelongingtothatset• YoushouldseetypessuchasLocation,Date,Personetc.• Clickthecheckboxforanannotationtypetoviewallthe

annotationsofthattypeinthedocument

Acloserlookattheannotations

• ClicktheAnnotationsListbuttonfromthemenuabovetheDisplaypane• Tableshowsannotationtype,annotationset,offsets,annotationid,and

features(forallselectedannotations)• Selectarowinthetabletohighlighttheannotationinthetext• TherearealsootherannotationviewspossiblesuchastheAnnotation

StackandCoreference Editor

Annotations

Date annotation

Annotations table

Editingexistingannotations

• SelectanannotationtypefromtheAnnotationSetsviewandhoveroverahighlightedannotationinthetext

• Apopupwindowdisplaysmoreinformationaboutit:thisistheannotationeditor

• Clickthedrawingpinsymbolatthetopoftheeditor.Thiswill“pin” thewindowopen(youcanstillmovethewindowaroundonyourscreenifyouwish)

• Tryeditingtheannotation:youcanchangetheannotationtype,featurenamesandvalues,thespanoftheannotation(clickingleftandrightarrowsatthetopofthebox)ordeletetheannotationoritsfeatures(redXs)

• ClosetheannotationeditorbyclickingtheXinthetoprightcorner,thenviewyoureditedannotationintheAnnotationList

Annotationeditor

annotation editorfeature name value

Annotation type

CreatingaCorpus

• Acorpusisacollectionofdocuments.• FormostGATEapplications,itiseasiertoworkwithacorpusrather

thananindividualdocument,evenifthatcorpusonlycontainsonedocument.

• RightclickLanguageResources→New→GATECorpus• OR• Filemenu→NewLanguageResource→GATECorpus• Aswiththedocuments,youcannameyourcorpusorusethedefault

GATEname.

Addingdocumentstoacorpus

1.Withtheinit parameter:clicktheeditbuttonandadddocumentsthatarealreadyloadedinGATEtothecorpus.ClickOKwhendone.or2.CreateanemptycorpusOpenthecorpusandusethe+buttontoadddocuments,ordragthemfromtheResourcespaneor populateitfromafiledirectory(nextslide)

• Doubleclickonthecorpusnametoviewthecorpus.• Doubleclickthedocumentlistedtheretoviewit.

PopulatingaCorpus(1)

• Usually,acorpuswillconsistofmorethanonedocument.Sometimestherecouldbehundredsofdocumentsinacorpus.

• Usingthepopulatefunctionmeansyoudon'thavetopreloadthedocumentsinGATEfirst,andallowsyoutoloadallthedocumentsintothecorpusinonego

• Todothis,let'sfirsttidyupabit• It'sbesttokeepGATEGUIclutter-freebyremovinganyunwanted

resourcesanddocuments,oritcangetabitconfusing• Closeallopendocumentsandcorpora

PopulatingaCorpus(2)

• Createanewemptycorpus,sodon'taddanydocumentstoityet• RightclickonthecorpusnameintheResourcespaneandselect

Populate• Usethefilebrowsericontoselectthenameofthedirectorywithyour

documents(corpora/news-texts)• Allthedocumentswillbeloadedinonego• Viewthecontentsofthecorpusasbefore

ProcessingResourcesandPlugins

ProcessingResourcesandPlugins

• Processingresources(PRs)arethetoolsthatprocessandannotatetext(textprocessingalgorithms).Oftenthismeanscreatingormodifyingannotationsonthetext.

• An“application”or“pipeline”consistsofanynumberofPRs,runsequentiallyoveracorpusofdocuments

• ApluginisacollectionofPRs,andotherresourcesbundledtogether.Forexample,everythingneededforIEinANNIEisintheANNIEplugin.

• AnapplicationcanusePRsfromoneormoredifferentplugins.• InordertousePRs,youneedtoloadtherelevantplugin(s)• PluginsareloadedviathePluginManager(greenjigsawpieceicon)

Plugins

• ClicktheicononthetopGATEmenutoopenthePluginManager[orgoviaFile →ManageCREOLEPlugins]

• DependingonyourversionofGATE,youmayseeapopupbox:

• TheuserpluginfolderisafolderonyourcomputerwherepluginsotherthanthoseprovidedbyGATEarestored

Plugins

List of available pluginsResources in the selected pluginLoad the

plugin for this session only

Load the plugin every time GATE starts

Apply all the settings

Close the plugins manager

Plugins

• Selectaplugintosee(ontheRHS)thenamesoftheresourcesitcontains

• Checktherelevant“LoadNow” boxtoloadapluginofyourchoice

• Click“ApplyAll” toloadtheselectedplugin• Click“Close”• RightclickonProcessingResourcestoseewhichnewPRsare

nowavailable

Applications

Here'soneImadeearlier:ANNIE

• ANNIEisareadymadecollectionofPRsthatperformsInformationExtractiononunstructuredtext.

• AdetailedexplanationofANNIEwillbegiveninthesecondpart.Fornow,we'rejustgoingtouseitasanexampleofanapplication.

• Later,we'llshowyouhowtomakeyourownapplicationfromscratch.

• ClicktheiconfromthetopGATEmenuORSelectFile→LoadANNIEsystem

• Select“withdefaults”• Loadanydocumentfromthehands-onmaterialandaddittoacorpus

Runninganapplication

ViewtheANNIEapplicationbydoubleclickingonitPRs selected in application (in order of their execution)

Corpus on which the application is executed

Runtime parameters of the selected PR

Execute the application

Viewingtheresults

• WhenamessageappearsinthebottomleftcornerofyourGATEwindowsayingsomethinglike“ANNIErunin1.3seconds”,theapplicationhasfinished.

• Doubleclickonthedocumenttoviewit• ViewtheannotationsbyselectingAnnotationSetsand

clickingonanyAnnotationtypesintheDefault(unnamed)set

• Ifyouwant,youcanviewtheannotationstabletoo.• Rememberthatnotalltheresultswillbeperfect!Laterinthe

course,you'lllearnmoreaboutthecausesoftheseerrors.

AddingnewPRs(1)

• Let'saddaVerbPhraseChunker PRtoANNIE.• First,wehavetoloadthepluginthatcontainsit,andthen

loadthePRintoGATE,beforewecanaddittotheapplication.

• UsethepluginsmanagertoloadtheToolsplugin.• RightclickonProcessingResourcesandselect“New”→“ANNIEVPChunker”

• Leaveallthedefaultparameterssetandclick“OK”

AddingnewPRs(2)

• NowweneedtoaddthenewPRtotheapplication.• DoubleclickonANNIE.• You'llseetheVPchunker isinthelistofloadedPRs.Thismeansit's

availableinGATE,butisn'tyetcontainedintheapplication.• Addittotheapplicationbyselectingitandusingtherightarrowto

transferit.• Nowusetheuparrowtomoveittotherightplaceintheapplication.It

shouldgoafter(below)thePOStaggerbutbefore(above)theNEtransducer.

• Runtheapplicationandviewtheresultsonthedocument.• Youshouldseeanewannotationtype“VG”.

Savingdocuments

• Usingdatastores• SavingdocumentsforuseoutsideGATE

Typesofdatastores

• Thereare2typesofdatastore:• Serialdatastores storedatadirectlyinadirectory• Lucenedatastores provideasearchablerepository

withLucene-basedindexing• Fornow,we'lllookatserialdatastores

Createanewserialdatastore

• Rightclick“Datastores” fromtheResourcespaneandselect“CreateDatastore”

• Select“SerialDatastore”• Createanewemptydirectorybyclickingthe“CreateNewFolder” iconandgiveyournewdirectoryaname

• Selectthisdirectoryandclick“Open”• Nowyourdatastore isreadytostoreyour

documents

Savedocumentstothedatastore

• Rightclickonyourcorpusandselect“SavetoDatastore”• Selectthedatastore thatyoujustcreated

• Nowclosethecorpusanddocument• Doubleclickonthenameofthedatastore intheResourcespane• Youshouldseethecorpusanddocument

• DoubleclickonthemtoloadthembackintoGATEandviewthem• Theyshouldcontaintheannotationsyoucreatedpreviously

• Youcanremovethingsfromthedatastore byrightclickingontheirnameinthedatastore andselecting“Delete”

• Youcanaddseveralcorporatothesamedatastore

Summary

• ThisfirstsessionhasgivenyouaguidedtouroftheGATEGUI• Lookedatlanguageresources,datastores,applicationsand

processingresources• Therearelotsofothertoolsandoptionsyoucanplaywith:see

theUserguideformoreinfo• Next,we'lllookatvariousNLPcomponents,andfurther

examineANNIE,GATE'sdefaultInformationExtractionsystem