Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | gabriel-maurice-adams |
View: | 222 times |
Download: | 3 times |
©2012 Paula Matuszek
CSC 9010: Text Mining Applications
Fall, 2012Introduction to GATE
Dr. Paula [email protected]
Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
What is GATE?
Stands for General Architecture for Text Engineering.
Developed at the University of Sheffield Component-based architecture with data
separated from applications, many discrete capabilities included as plugins.
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Who Uses GATE?
Scientists performing experiments that involve processing human language
Developers developing applications with language processing components
Teachers and students of courses about language and language computation
Us :-)
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
How GATE can Help? Specify an architecture, or organizational
structure, for language processing software Provide a framework that implements the
architecture and can be used to embed language processing capabilities in applications
Provide a development environment built on top of the framework made up of convenient tools for developing components (plugins)
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Really? Yeah, really. It’s been under development for 15 years and is still
under very active development Open-source, with dozens of developers, some of whom
have been involved since the beginning Active community that provides good support
– Mailing list: lists.sourceforge.net/lists/listinfo/gate-users– twitter: twitter.com/#!/GateAcUk– LinkedIn: http://www.linkedin.com/groups/GATE-2230077
Many other text mining capabilities have been integrated with it.
An almost overwhelming amount of documentation
©2012 Paula Matuszek
GATE Architecture Overview
http://gate.ac.uk/overview.html
©2012 Paula Matuszek
GATE Product Family GATE Developer: IDE for language processing, with
information extraction and other plugins. GATE Embedded: object library which can be included
in applications GATE Teamware: collaborative annotation environment GATE Mimir: a “multiparadigm index” which supports
semantic indexing and search GATE Wiki: “controllable wiki” based on Grails and
Subversion GATE Cloud: GATE embedded running on
supercomputer hardware
©2012 Paula Matuszek
GATE Components We will deal primarily with GATE Developer: It has four components:
– Applications: groups of processes to be run on a document or corpus.
– LanguageResources (LRs): entities such as lexicons, documents, corpora, annotation schemas, ontologies.
– ProcessingResources (PRs): tools that operate on unstructured text, such as parsers and tokenizers. These are mostly plugins.
– DataStores: saved processed documents and resources.
©2012 Paula Matuszek
Overview of Gate Developer
GATE Developer Resources Pane
– applications: groups of processes to run on a document or corpus
– language resources: corpus, ontologies, schemas– processing resources: tools that operate on
unstructured text– datastores: saved documents and resources
Display Pane: whatever you’re currently working with.
©2012 Paula Matuszek
Setup Options
Configuration– Appearance: font, skin– Advanced:
– add space on markup (to make html and xml more readable)
– Save options and session on exit
– Insert append or prepend (for annotations)
– default browser (for user guide)
Input (?)– default language
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Language Resources Language Resources can be of four
kinds:– Documents are modeled as content plus
annotations plus features.– A Corpus is a Java Set whose members are
Documents.– Annotations are organized in graphs, which
are modeled as Java sets of Annotation.– Schemas are XML schemas describing
allowable annotations and features
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Documents Processing in GATE
Document:– Formats including XML, RTF, email, HTML,
SGML, and plain text.– Identified and converted into GATE
annotation format.– Processed by Processing Resources.– Results stored in a serial data store (based
on Java serialization) or indexed in a Lucene database.
– Can also be exported as XML.
©2012 Paula Matuszek
New Document Documents are converted to GATE format; can be
saved for future use or exported. Language Resources --> New --> Document Name: can leave blank and it will be created
automatically (no spaces) from filename+UniqueID Checkmarks: required.
– just leave defaults– sourceURL
– can be a file (click the folder icon for browse)– or actual URL (GATE will fetch it)– or set to stringContent to put content in directly.
Encoding will probably be utf-8. markupAware: process XML and HTML tags
©2012 Paula Matuszek
Document Display Double-click document
– Text (minus annotations if you chose markupAware)
– Annotation Sets– from XML, HTML, previous annotation work– different colors for different categories
– Annotations list– annotations chosen in Sets pane
©2012 Paula Matuszek
Creating a Corpus To import new documents we name the
corpus and create it without any documents.
Language Resources --> New --> Corpus Right-click and populate
– choose directory, extensions, encoding This will create the corpus and show the
corpus and the individual documents in the Resources Pane.
©2012 Paula Matuszek
GATE Corpus Corpus Display Pane:
– Add documents to a corpus with + button which appears when a corpus is displayed.
– Remove with -. (Note: this removes them from corpus, not from Developer)
Documents can be included in multiple corpora.
A corpus can be created from a single concatenated file, by specifying the documentRootElement. This makes sense for, for instance, XML documents.
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
CREOLE
A Collection of REusable Objects for Language Engineering
The set of resources integrated with GATE
All the resources are packaged as Java Archive (or ‘JAR’) files, plus some XML configuration data.
Managed in the Creole Plugin Manager
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Processing Resources: ANNIE
A family of Processing Resources for language analysis included with GATE
Stands for A Nearly-New Information Extraction system.
Using finite state techniques to implement various tasks: tokenization, semantic tagging, verb phrase chunking, and so on.
©2012 Paula Matuszek
ANNIE IE Modules
http://gate.ac.uk/sale/tao/splitch6.html#chap:annie
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Some ANNIE Components
Tokenizer Gazetteer: lists of entities Sentence Splitter Part of Speech Tagger
– produces a part-of-speech tag as an annotation on each word or symbol.
Semantic Tagger
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
ANNIE Component: Tokenizer
Token Types– word, number, symbol, punctuation, and
spaceToken. A tokenizer rule has a left hand side and
a right hand side.
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Tokenizer Rule
Operations used on the LHS:– | (or) – * (0 or more occurrences) – ? (0 or 1 occurrences) – + (1 or more occurrences)
The RHS uses ’;’ as a separator, and has the following format: {LHS} > {Annotation type};{attribute1}={value1};...;{attribute n}={value n}
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Example Tokenizer Rule
– "UPPERCASE_LETTER" "LOWERCASE_LETTER"*
– > – Token;orth=upperInitial;kind=word; – The sequence must begin with an uppercase
letter, followed by zero or more lowercase letters. This sequence will then be annotated as type “Token”. The attribute “orth” (orthography) has the value “upperInitial”; the attribute “kind” has the value “word”.
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
ANNIE Component: Gazetteer
The gazetteer lists used are plain text files, with one entry per line.
Each list represents a set of names, such as names of cities, organizations, days of the week, etc.
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Example Gazetteer List A small section of the list for units of currency: …… Ecu
European Currency Units FFr Fr German mark German marks New Taiwan dollar New Taiwan dollars NT dollar NT dollars
……
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
ANNIE Component: Semantic Tagger
Based on JAPE language, which contains rules that act on annotations assigned in earlier phases.
Produce outputs of annotated entities.
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
ANNIE Component: Sentence Splitter
Segments the text into sentences. This module is required for the tagger. The splitter uses a gazetteer list of
abbreviations to help distinguish sentence-marking full stops from other kinds.
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Example Using ANNIE
http://services.gate.ac.uk/annie/
More next week.
©2012 Paula Matuszek
Viewing and Editing Annotations
We have looked at annotations, both added by ANNIE and extracted from tags in the document.
It is sometimes useful to examine closely and edit these annotations– you are using a small corpus and want them correct
before you proceed with other tools– you have a sample set that will be used for training or
for quality assurance and they need to be accurate– you are still developing the resources being used to
tag documents.
©2012 Paula Matuszek
Unrestricted Annotation Editing
We can change to an arbitrary different annotation type.
The process is:– choose text to be annotated– hover over it or right click. The annotation
editor pops up.– if you’re changing it, delete existing annotation– add new annotation, by choosing or typing it in
©2012 Paula Matuszek
Restricted Annotation Editing
Typically we want better consistency and control for our editing.
Use a schema to specify allowable annotation types and features.
GATE includes many predefined schemas
Located at <root>/plugins/ANNIE/resources/schema
©2012 Paula Matuszek
Schema Annotation Editor CREOLE resource to let us use the
schema for annotation editing Enable in Manage CREOLE Plugins
window (under File menu) Select an annotation, hover or right-click Different editor window, specifying
allowable types and features Choose new type or feature.
©2012 Paula Matuszek
More on Schemas and Editing
You can also initiate editing by right-clicking on an annotation in the annotations list.
You can use multiple schemata in processing one document.
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Create an Application with Processing Resources
(PRs) Applications model a control strategy for the
execution of PRs. Simple pipelines: group a set of PRs
together in order and execute them in turn. Corpus pipelines: open each document in
the corpus in turn, set that document as a runtime parameter on each PR, run all the PRs on the corpus, then close the document
We will do this during lab.
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Saving GATE Language Resources and Applications
Data Stores: – save processed documents for additional
use– specialized folder on a hard drive– Lucene database
– improve processing times for large collections of documents
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Types of Data Store
Serial Data Store: – based on java’s serialization system. – store in a directory
Lucene Data Store (Lucene is an open-source indexing and search tool.)– searchable repository– Lucene-based indexing
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Saving in a datastore
Create a folder. Right-click to get Create Datastore
menu This only creates the store. Save
corpora or documents in the Language Resources pane.
Once saved, they can be
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Saving as XML
Individual documents can also be saved directly.– Special GATE XML format
– annotations are appended to the document, locations for tags are embedded in body
– Preserve original format– use for XML or html.– will save all original tags and everything
selected in the annotations – For a plain text file, embeds inline tags.
©2012 Paula MatuszekTaken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastucture/Presentation/GATE.ppt
Saving Applications
Save a set of processing resources and their parameters.– Right-click, save application state.– Append .xgapp for name
To export as a standalone, export as teamware– bundles all needed files– intended for teamware but can be used for
sharing directly.
©2012 Paula Matuszek
And LOTS more GATE is an extraordinarily rich system. Some of the other
CREOLE resources included in the standard distribution:– Annotation Merging, Quality assurance summarizer for comparing
annotations– Web crawler , Information Retrieval, Key Phrase Extraction– Machine learning – Domain-specific taggers (e.g., chemistry)– Resources for many languages
CREOLE plugins for integrating with many other systems. E.g.– UIMA– Wordnet– Penn BioTagger– OpenCalais– OpenNLP– LingPipe
More details at http://gate.ac.uk/gate/doc/plugins.html
©2012 Paula Matuszek
Some Links Home page is http://gate.ac.uk/ Some good short tutorial videos for getting started:
http://gate.ac.uk/demos/developer-videos/ . These are only a few minutes each, so they’re fast. Version 6, but they don’t seem to be very different.
User Guide: http://gate.ac.uk/sale/tao/index.html . This is apparently for version 7.1, which is a development build, but again it seems to be fine.
Lots of documentation (“acres” of it): http://gate.ac.uk/documentation.html
The wiki: http://gate.ac.uk/wiki/ Some very nice course materials, with a lot more detail
than we will cover, including a unit on sentiment analysis: http://gate.ac.uk/wiki/training-materials-2011.html
©2012 Paula Matuszek
What Next?
In lab we will create a simple application and use it.
Next week we will go into a lot more detail on using Annie for information extraction
Homework. (You knew that was coming...) I’m not going to get into programming in
GATE or the more advanced applications. This might be the best tool for some of your projects, though.