+ All Categories
Home > Documents > BD003: Introduction toNLP · • A (Java) library providing a programming API for using the...

BD003: Introduction toNLP · • A (Java) library providing a programming API for using the...

Date post: 12-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
44
BD003: Introduction to NLP Dr. Diana Maynard University of Sheffield, UK IADS Big Data and Analytics Summer School , 31 July 2017, University of Essex
Transcript
Page 1: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

BD003:Introductionto NLP

Dr. DianaMaynardUniversityofSheffield,UK

IADSBigDataandAnalyticsSummerSchool,31July2017,UniversityofEssex

Page 2: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Part1:IntroductiontoGATE

Page 3: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

WhyGATE?

• GATEisthemostwidelyusedopensourcetoolkitforNLPintheworld• We’reusingitbecauseit’sagreatwaytoshowcaseallthecoreNLP

componentsthatareusedfortextanalysistasks• YoucanplaywithallthetoolsinGATEandtryoutthingsforyourselftosee

howitworks• Andalsobecausewe’reexperts

• DevelopedattheUniversityofSheffieldsince2000(initscurrentform)• ThepersonwhohasledthedevelopmentoftheNLPtoolsinGATEsince2000

istheonepresentingtoyounowJ

• Andbytheway,justbecauseit’solddoesn’tmeanit’soutofdate.GATEisinconstantdevelopmentwithnewtechnologiesbeingconstantlyadded.

Page 4: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

WhatisGATE?

• Open-sourcesoftwareframeworkandsetofreadysolutionsfortext/naturallanguageprocessing

• Re-usableabstractionsfordocuments,formatconversion,corpora,annotations,storage,algorithms,...

• Agraphicaluserinterfacetointeractivelydevelopsolutions(GATEGUI,GATEDeveloper)

• A(Java)libraryprovidingaprogrammingAPIforusingtheabstractions• Aninfrastructureofpluggablecomponents(GATEPlugins)• Ready-madesolutionstogetyoustarted• Companionsoftwareforsemanticsearch(Mimir)• Scalablefromlaptoptomassiveprocessingonthecloud(includingreal-

timestreamprocessing)

Page 5: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Aboutthistutorial

• ThistutorialwillgetyoustartedwiththeGATEgraphicaluserinterface(GUI),alsoknownas“GATEDeveloper”

• Itwillbeahands-onsession.PleasetrythingsoutinGATEasthetopicsarepresented.

• Thingssuggestedforyoutotryyourselfarein red.• StartGATEonyourcomputernow(ifyouhaven'talready)bydouble

clickingtheicon• Pleasedon'tjumpahead:ifyou'realreadyfinishedwithatask,perhapsyou

canhelpyourneighbour iftheygetstuck.• Pleasetrytokeepquestionsduringthesessionsrelatedtothecurrenttopic• Therewillbetimeattheendorinthebreaksformoregeneralquestions

Page 6: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

GATEGUI

Resources Pane

Menu Bar

Shortcut Buttons

ResourceFeatures

Messages

DisplayPane

Page 7: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Resources

• MostthingsyouusewithinGATEare“resources”:• Languageresources (LRs)aredocuments,documentcollections,

ontologies...• Acollectionofdocumentsisknownasacorpus

• Processingresources (PRs)areprogramsthatoperateontextwithinthedocuments,andoftencreateormodifyannotations

• Datastores areforstoringdocumentsandcorporaforlateruse• Applications (“pipelines”)aresequencesofprocessingresourcesthat

runononeormoredocuments

Page 8: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

DisplayingResources

• WhenyoufirstopenGATE,thedisplaypanewillshowmessagesfromthesysteminthe“Messages”tab

• Thedisplaypanedisplayswhateverelementsyouarecurrentlyworkingwith,e.g.anapplication,adocumentoraprocessingresource,eachinitsowntab

• Doubleclickingonaresourceintheresourcespanewilldisplayit• Tabsalongthetopofthedisplaypaneallowyoutochoosewhichof

theopenresourcestodisplay

Page 9: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

CreateNewDocument

• FromtheResourcePane,rightclick“LanguageResources”→New→GATEDocument

• Ignoretheparametersettingsthatwillbedisplayed• ClickOK• “GATEDocument_<id>”willnowbeaddedto“LanguageResources”• Doubleclickthatdocumentname• Atabisopenedinthedisplaypane,showingtheemptydocument.

Youcanentersometext thereifyouwant.

Page 10: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

EmptyDocument

DocumentTab

DocumentEditor

DocumentName

DocumentEditor Buttons

DocumentResource Views

Page 11: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

DocumentEditor

• TheDocumentEditorisshownasanewTabintheDisplayPane,alongsidetheMessagePane

• TherearebuttonsonthetopoftheEditor,e.g.“AnnotationSets”–wewilllearnaboutthemlater.

• TherearetabsatthebottomoftheDocumentTab:theseshowdifferent“Views”ofthedocument.

• Thesmallpaneinthelowerleftshowsthe“documentfeatures”(optionalinformationassociatedwiththedocumentresourceaskey/valuepairs)

Page 12: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Simpleoperationsonresources

• Rightclickingonthenameofaresourceintheresourcepanegivesaccesstoamenuofactions

• Doubleclickingonthenameofaresourceopensaviewoftheresourceinthedisplaypane(tripleclickingthenamecanbeusedtorename)

• SelectingaresourceinstanceandpressingtheDelete(Mac:Fn+BS)keywillgenerallycloseit

• Youcanalsorightclickandthenselect“Close”

Page 13: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Parameters

• Resourcescanhaveparameterswhichneedtogetspecifiedwhentheresourceiscreated:Initialization(init)Parameters

• Processingresourcescanalsohaveparameterswhichcanbechangedforeachrun:RuntimeParameters

• Init parametersspecifyhowaresourceiscreated,e.g.thelocationofadocumenttoload

• Runtimeparametersconfigurewhataprocessingresourcedoes,e.g.ifsomeprocessingiscase-sensitiveornot.

Page 14: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Loadingadocument

• GATEcanreadandloaddocumentsinmanyformats:e.g.plaintext,HTML,XML,PDF,Word,CoNLL ,CSV,JSON

• GATEcanloaddocumentsfromfilesandfromURLs• Whenadocumentisloaded,itgetsconvertedtoGATEinternal

formatasdocumenttext+annotations.

Page 15: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Loadingadocument

• Toloadadocument:- rightclickonLanguageResources→“New→GATEDocument”OR- Filemenu→ NewLanguageResource→GATEDocument

• UsethesourceURL parametertospecifythedocumenttobeloaded:- typethefilenameorURL,or- clickthefilebrowsericontonavigatetothecorrectdocument.

• Loadafilefromyourhands-onmaterials:corpora→news-texts→ft-airlines-27-jul-2001.xml

• Loadawebpage– forthisthehttp://orhttps://partoftheURLisrequired,e.g.http://news.bbc.co.uk

• Note:ifyouusetheBBCpageabove,wesuggestpickingastoryandclickingonittogetabetterdocumentforprocessing,asthemainnewspagecontainsmainlyjustlinks

Page 16: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Documentviewer

Documentviewerbuttons

Document

Highlighted tab is the resource currently being viewed

Page 17: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Annotations

• AnnotationsarecentraltoGATE• Annotationsrepresentaspectsofthetextyouwanttoanalyze:

words,sentences,Dates,PersonNames• Annotationsarenamedbytheirtype,e.g.“Person”• Annotationconsistsof

• Annotationtype• startandendoffsets• setoffeatures,eachfeatureisanarbitraryname/valuepair,e.g.

orth=”upperInitial”

Page 18: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

AnnotationSets

• Annotationsaregroupedintosets• Eachsetcancontainanynumberofannotationsofanytype• Youcancreateandorganizeyourannotationsetsasyouwish.• Predefinedsets

• Defaultset(emptyname):cannotbedeleted• “Originalmarkups”:annotationsfromthemarkupsinthefile• “Key”:byconvention,usedforgoldstandardannotations

• Clickthe“AnnotationSets”buttoninthedocumentviewer

Page 19: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

AnnotationSets

Defaultannotationset

Original markupsannotation set

Annotation types

DocumentViewerButtons

Tabs

Page 20: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Viewingannotations

• ClickingontheAnnotationSetsbuttonopensanewpaneontherighthandsideinsidethedocumentview(AnnotationSetsview)

• Default(unnamed)setcontainssomeexamplesofannotations• Clickonthe▶ todisplaytheannotationtypesbelongingtothatset• YoushouldseetypessuchasLocation,Date,Personetc.• Clickthecheckboxforanannotationtypetoviewallthe

annotationsofthattypeinthedocument

Page 21: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Acloserlookattheannotations

• ClicktheAnnotationsListbuttonfromthemenuabovetheDisplaypane• Tableshowsannotationtype,annotationset,offsets,annotationid,and

features(forallselectedannotations)• Selectarowinthetabletohighlighttheannotationinthetext• TherearealsootherannotationviewspossiblesuchastheAnnotation

StackandCoreference Editor

Page 22: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Annotations

Date annotation

Annotations table

Page 23: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Editingexistingannotations

• SelectanannotationtypefromtheAnnotationSetsviewandhoveroverahighlightedannotationinthetext

• Apopupwindowdisplaysmoreinformationaboutit:thisistheannotationeditor

• Clickthedrawingpinsymbolatthetopoftheeditor.Thiswill“pin” thewindowopen(youcanstillmovethewindowaroundonyourscreenifyouwish)

• Tryeditingtheannotation:youcanchangetheannotationtype,featurenamesandvalues,thespanoftheannotation(clickingleftandrightarrowsatthetopofthebox)ordeletetheannotationoritsfeatures(redXs)

• ClosetheannotationeditorbyclickingtheXinthetoprightcorner,thenviewyoureditedannotationintheAnnotationList

Page 24: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Annotationeditor

annotation editorfeature name value

Annotation type

Page 25: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

CreatingaCorpus

• Acorpusisacollectionofdocuments.• FormostGATEapplications,itiseasiertoworkwithacorpusrather

thananindividualdocument,evenifthatcorpusonlycontainsonedocument.

• RightclickLanguageResources→New→GATECorpus• OR• Filemenu→NewLanguageResource→GATECorpus• Aswiththedocuments,youcannameyourcorpusorusethedefault

GATEname.

Page 26: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Addingdocumentstoacorpus

1.Withtheinit parameter:clicktheeditbuttonandadddocumentsthatarealreadyloadedinGATEtothecorpus.ClickOKwhendone.or2.CreateanemptycorpusOpenthecorpusandusethe+buttontoadddocuments,ordragthemfromtheResourcespaneor populateitfromafiledirectory(nextslide)

• Doubleclickonthecorpusnametoviewthecorpus.• Doubleclickthedocumentlistedtheretoviewit.

Page 27: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

PopulatingaCorpus(1)

• Usually,acorpuswillconsistofmorethanonedocument.Sometimestherecouldbehundredsofdocumentsinacorpus.

• Usingthepopulatefunctionmeansyoudon'thavetopreloadthedocumentsinGATEfirst,andallowsyoutoloadallthedocumentsintothecorpusinonego

• Todothis,let'sfirsttidyupabit• It'sbesttokeepGATEGUIclutter-freebyremovinganyunwanted

resourcesanddocuments,oritcangetabitconfusing• Closeallopendocumentsandcorpora

Page 28: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

PopulatingaCorpus(2)

• Createanewemptycorpus,sodon'taddanydocumentstoityet• RightclickonthecorpusnameintheResourcespaneandselect

Populate• Usethefilebrowsericontoselectthenameofthedirectorywithyour

documents(corpora/news-texts)• Allthedocumentswillbeloadedinonego• Viewthecontentsofthecorpusasbefore

Page 29: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

ProcessingResourcesandPlugins

Page 30: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

ProcessingResourcesandPlugins

• Processingresources(PRs)arethetoolsthatprocessandannotatetext(textprocessingalgorithms).Oftenthismeanscreatingormodifyingannotationsonthetext.

• An“application”or“pipeline”consistsofanynumberofPRs,runsequentiallyoveracorpusofdocuments

• ApluginisacollectionofPRs,andotherresourcesbundledtogether.Forexample,everythingneededforIEinANNIEisintheANNIEplugin.

• AnapplicationcanusePRsfromoneormoredifferentplugins.• InordertousePRs,youneedtoloadtherelevantplugin(s)• PluginsareloadedviathePluginManager(greenjigsawpieceicon)

Page 31: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Plugins

• ClicktheicononthetopGATEmenutoopenthePluginManager[orgoviaFile →ManageCREOLEPlugins]

• DependingonyourversionofGATE,youmayseeapopupbox:

• TheuserpluginfolderisafolderonyourcomputerwherepluginsotherthanthoseprovidedbyGATEarestored

Page 32: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Plugins

List of available pluginsResources in the selected pluginLoad the

plugin for this session only

Load the plugin every time GATE starts

Apply all the settings

Close the plugins manager

Page 33: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Plugins

• Selectaplugintosee(ontheRHS)thenamesoftheresourcesitcontains

• Checktherelevant“LoadNow” boxtoloadapluginofyourchoice

• Click“ApplyAll” toloadtheselectedplugin• Click“Close”• RightclickonProcessingResourcestoseewhichnewPRsare

nowavailable

Page 34: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Applications

Page 35: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Here'soneImadeearlier:ANNIE

• ANNIEisareadymadecollectionofPRsthatperformsInformationExtractiononunstructuredtext.

• AdetailedexplanationofANNIEwillbegiveninthesecondpart.Fornow,we'rejustgoingtouseitasanexampleofanapplication.

• Later,we'llshowyouhowtomakeyourownapplicationfromscratch.

• ClicktheiconfromthetopGATEmenuORSelectFile→LoadANNIEsystem

• Select“withdefaults”• Loadanydocumentfromthehands-onmaterialandaddittoacorpus

Page 36: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Runninganapplication

ViewtheANNIEapplicationbydoubleclickingonitPRs selected in application (in order of their execution)

Corpus on which the application is executed

Runtime parameters of the selected PR

Execute the application

Page 37: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Viewingtheresults

• WhenamessageappearsinthebottomleftcornerofyourGATEwindowsayingsomethinglike“ANNIErunin1.3seconds”,theapplicationhasfinished.

• Doubleclickonthedocumenttoviewit• ViewtheannotationsbyselectingAnnotationSetsand

clickingonanyAnnotationtypesintheDefault(unnamed)set

• Ifyouwant,youcanviewtheannotationstabletoo.• Rememberthatnotalltheresultswillbeperfect!Laterinthe

course,you'lllearnmoreaboutthecausesoftheseerrors.

Page 38: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

AddingnewPRs(1)

• Let'saddaVerbPhraseChunker PRtoANNIE.• First,wehavetoloadthepluginthatcontainsit,andthen

loadthePRintoGATE,beforewecanaddittotheapplication.

• UsethepluginsmanagertoloadtheToolsplugin.• RightclickonProcessingResourcesandselect“New”→“ANNIEVPChunker”

• Leaveallthedefaultparameterssetandclick“OK”

Page 39: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

AddingnewPRs(2)

• NowweneedtoaddthenewPRtotheapplication.• DoubleclickonANNIE.• You'llseetheVPchunker isinthelistofloadedPRs.Thismeansit's

availableinGATE,butisn'tyetcontainedintheapplication.• Addittotheapplicationbyselectingitandusingtherightarrowto

transferit.• Nowusetheuparrowtomoveittotherightplaceintheapplication.It

shouldgoafter(below)thePOStaggerbutbefore(above)theNEtransducer.

• Runtheapplicationandviewtheresultsonthedocument.• Youshouldseeanewannotationtype“VG”.

Page 40: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Savingdocuments

• Usingdatastores• SavingdocumentsforuseoutsideGATE

Page 41: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Typesofdatastores

• Thereare2typesofdatastore:• Serialdatastores storedatadirectlyinadirectory• Lucenedatastores provideasearchablerepository

withLucene-basedindexing• Fornow,we'lllookatserialdatastores

Page 42: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Createanewserialdatastore

• Rightclick“Datastores” fromtheResourcespaneandselect“CreateDatastore”

• Select“SerialDatastore”• Createanewemptydirectorybyclickingthe“CreateNewFolder” iconandgiveyournewdirectoryaname

• Selectthisdirectoryandclick“Open”• Nowyourdatastore isreadytostoreyour

documents

Page 43: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Savedocumentstothedatastore

• Rightclickonyourcorpusandselect“SavetoDatastore”• Selectthedatastore thatyoujustcreated

• Nowclosethecorpusanddocument• Doubleclickonthenameofthedatastore intheResourcespane• Youshouldseethecorpusanddocument

• DoubleclickonthemtoloadthembackintoGATEandviewthem• Theyshouldcontaintheannotationsyoucreatedpreviously

• Youcanremovethingsfromthedatastore byrightclickingontheirnameinthedatastore andselecting“Delete”

• Youcanaddseveralcorporatothesamedatastore

Page 44: BD003: Introduction toNLP · • A (Java) library providing a programming API for using the abstractions • An infrastructure of pluggable components (GATE Plugins) • Ready-made

Summary

• ThisfirstsessionhasgivenyouaguidedtouroftheGATEGUI• Lookedatlanguageresources,datastores,applicationsand

processingresources• Therearelotsofothertoolsandoptionsyoucanplaywith:see

theUserguideformoreinfo• Next,we'lllookatvariousNLPcomponents,andfurther

examineANNIE,GATE'sdefaultInformationExtractionsystem


Recommended