+ All Categories
Home > Documents > Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann...

Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann...

Date post: 28-Dec-2015
Category:
Upload: shana-bishop
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
31
Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer Laboratory, University of Cambridge
Transcript
Page 1: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Flexible Interfaces in the Applicationof Language Technology

to an eScience Corpus

C.J. Rupp, Ann Copestake,Simone Teufel & Benjamin Waldron

Computer Laboratory, University of Cambridge

Page 2: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Outline Two key interfaces:

SciXML: XML markup for the logical structure of research papers

SAF: Standoff Annotation Formalism for diverse linguistic information

Both coded in XML and designed for flexibility,

But what that means is distinct in the two cases.

Page 3: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

SciBorg Architecture

RSC papers

Nature papers SciXML

IUCr papers

Biology and CL(pdf)

POS tagging

OSCAR RASP

ERG/PET

WSD

anaphora tasks

standoff annotation

rhetoricalanalysis

RMRSmerge

Page 4: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Sciborg Corpus

A corpus of Chemistry research papers from 3 publishers: The Royal Society of Chemistry (RSC), The Nature Publishing Group (NPG), and The International Union of Crystallography.

Provided in Publishers’ XML markup, but with distinct markup schemes.

Page 5: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Conversion to SciXMLRSC

papers

Nature papers SciXML

IUCr papers

Biology and CL(pdf)

PLOS Biology papers

Page 6: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

SciXML Interface Requirements

Extensible So we can add additional publications

Neutral So as not to compromise any IP issues

Compatible with existing software Expressive enough

For adequate rendering in applications

Page 7: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Rendering Issues

We assume application will display the paper Probably in Hypertext

We must retain enough information to do this effectively Previous versions of SciXML have focused

on the logical structure of scientific papers.

Page 8: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

The Development of SciXML

Developed for a medical corpus (2000) Extracted from HTML web pages

Extended for a Computational Linguistics corpus First from LaTeX Then from PDF via OCR

Now defined as Relax NG Schema

Page 9: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Legacy Issues

The original SciXML schema had to interpret formatting. Lacking any organisation by function Dictating a flat paragraph structure Collecting all floats and notes in end lists But excluding text formatting

Page 10: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Adapted from Publishers’ Markup

List and Table formats Inline text formatting Functional paragraph types (e.g.

Theorem) Position markers for floats

Page 11: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Conversion by XSLT Most constructs can be handled quite simply

<xsl:template match="sec">

<DIV DEPTH="{@level}">

<xsl:apply-templates/>

</DIV>

</xsl:template>

Making the script virtually a stylesheet

Page 12: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Schema Development

Both the XSLT stylesheet and RNG Schema have been developed on a naïve basis. Coding conversion for constructs that occur

in the corpus

Eventually we have a big enough bag of tricks to make extension quite painless.

Page 13: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

SciXML Constructs Paper Identifiers

Unique identifiers, titles and authors Sections

Divisions embed recursively with headers Inline text markup

Font settings and LaTeX inclusion Paragraph structure

Paragraph elements and sub paragraph boundaries in lists, abstracts, captions, etc.

Page 14: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

SciXML Constructs Citations and Cross References

Citations are significant, but we also need textual cross references, compound references, footnote markers, float markers.

Equations and examples (Linguistic) examples and equation environments

Lists, tables and figures Lists, including definitions lists, tables, figures, and various

other sections for (external) data. Bibliography

The bibliography section is important for citation tracking

Page 15: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

RNG Schema (Fragment)<define name="PAPER.ELEMENT"> <element name="PAPER"> <ref name="METADATA.ELEMENT" /> <optional><ref name="PAGE.ELEMENT" /></optional> <ref name="TITLE.ELEMENT" /> <optional> <ref name="AUTHORLIST.ELEMENT" />

</optional> <optional> <ref name="ABSTRACT.ELEMENT" /> </optional> <element name="BODY"> <zeroOrMore>

<ref name="DIV.ELEMENT" /> </zeroOrMore> </element> <optional> <element name="ACKNOWLEDGMENTS">

<zeroOrMore> <choice> <ref name="REF.ELEMENT" /> <ref name="INLINE.ELEMENT" /> </choice></zeroOrMore>

</element> </optional>

<optional> <ref name="REFERENCELIST.ELEMENT"> </optional> <optional> <ref name="AUTHORNOTELIST.ELEMENT"> </optional> <optional> <ref name="FOOTNOTELIST.ELEMENT"> </optional> <optional> <ref name="FIGURELIST.ELEMENT"> </optional> <optional> <ref name="TABLELIST.ELEMENT"> </optional> </element></define>

<define name="REFERENCELIST.ELEMENT"> <element name="REFERENCELIS"> <zeroOrMore><ref name="REFERENCE.ELEMENT"

/></zeroOrMore> </element></define>

Page 16: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Language Technology in Sciborg The goal is Information Extraction from

Chemistry research papers. various analysis components interfacing

Different levels of analysis Different analysis methods Specialised and General analysers

But a common semantic representation: RMRS (Robust Minimal Recursion Semantics)

And a common interface structure: SAF

Page 17: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Multiple Analysis Components PET/ERG: “deep” analysis using detailed

(HPSG) grammars and lexicons RASP: Robust shallow parsing with a statically

trained grammar Each strand has a tokeniser, tagger and

parser OSCAR-3 analyses Chemistry terms and

notation

Page 18: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Getting the Text out of SciXML

Only some spans of marked up text contain linguistic text.

Using SciXML we can divide element into: Text (<P>), Markup (<IT>), Non-Text elements

(<SUP>). The analysers process, ignore and skip these,

respectively. We also use OSCAR-3 to detect data sections

without significant text portions.

Page 19: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

SciBorg Parsing Architecture

SciXML

Tokeniserfor Rasp

OSCARRASPparser

PET parser

SAFLattice

Sentencesplitter POS tagging

Tokeniserfor ERG

Page 20: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

SAF Interface Requirements Support results from different analysis

components. Allow the combination of complementary

results But they will assign conflicting structures Ambiguity is common Analyses will form a graph or lattice (c.f. chart

parsing and word lattices)

Page 21: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Motivating Standoff XML can only combine linguistic and

formatting markup if they share the same tree structure calculated for C11 H18 O3

<IT>calculated for</IT> C<SB>11</SB>H<SB>18</SB>O<SB>3</SB>

<v>calculated</v> <pp>for <ne>C11H1803</ne></pp>

Page 22: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Standoff Annotation

A common solution is to separate the flow of text from the annotations representing its analysis

The connection is formed by indexing at some consistent common level

SAF supports character offset indexing and XPoint indexing

Page 23: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Character Offset IndexingFormatted text: Come here!

raw text: "<p>Come <i>here</i>!</p>"

Unicode character points:

.<.p.>.C.o.m.e. .<.i.>.h.e.r.e .< ./ .i .> .! .< ./ .p .> .

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Tokens

<token from='3' to='7' value='Come'/>

<token from='11' to='14' value='here'/>

<token from='18' to='19' value='!'/>

Page 24: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

XPoint Indexing

Root (/)

. ’P’(/1).

. ’I’(/1/2).

. text(/1/2/1).

. h.e.r.e.

. text(/1/1). . text(/1/3).

. C.o.m.e. . !.

Page 25: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Index Conversion

We currently use both character offset and XPoint indexing.

The choice is influenced by the XML parser.

This implies maintaining a conversion table for a (SciXML) file. /1/3/0 <-> 18

Page 26: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Standards for Standoff Annotation

MAF: ISO standard for morphological annotation

SMAF: an emergent standard extending this to sentence, e.g. for parser input

SAF: includes all annotations for a paper in one file

Page 27: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Types of SAF Annotation Sentence segments

<annot type='sentence' id='s133' from='42065' source='v4987' target='v5154' to='43039' value='…calculated for C11H18O3….'/>

Tokens <annot type='token' id='t5151' from='42988' to='43030'

deps='s133' source='v5150' target='v5151' value='calculated'/> <annot type='token' id='t5152' from='43031' to='43034'

deps='s133' source='v5151' target='v5152' value='for'/> <annot type='token' id='t5153' from='43035' to='43043'

deps='s133' source='v5152' target='v5153' value='C11H18O3'/>

Page 28: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Types of SAF Annotation Part of Speech (POS) Tags

<annot type='pos' id='p5151' deps='t5151' source='v5150' target='v5151' value='VVN'/>

<annot type='pos' id='p5152' deps='t5152' source='v5151' target='v5152' value='IF'/>

<annot type='pos' id='p5153' deps='t5153' source='v5152' target='v5153' value='NP1'/>

OSCAR (NER) mark up <annot from="/1/5/6/27/51/2/83.1" to="/1/5/6/27/51/2/88/1.1"

type="oscar" id="o554"><slot name="type">compound</slot><slot name="surface">C11H18O3</slot><slot name="provenance">formulaRegex</slot></annot>

Page 29: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Types of SAF Annotation RMRS analyses:

<rmrs cfrom='42329' cto='43303'>

<label vid='420'/>

<ep cfrom='43258‘ cto='43288'><gpred>proper_q_rel</gpred><label vid='409'/><var sort='x' vid='410'/></ep>

<ep cfrom='43258' cto='43288'><gpred>named_rel</gpred><label vid='411'/><var sort='x' vid='410'/></ep>

<rarg><rargname>RSTR</rargname><label vid='409'/><var sort='h' vid='412'/></rarg>

<rarg><rargname>BODY</rargname><label vid='409'/><var sort='h' vid='413'/></rarg>

<rarg><rargname>CARG</rargname><label vid='411'/><constant>c11h18o3</constant></rarg>

<hcons hreln='qeq'><hi><var sort='h' vid='412'/></hi><lo><label vid='411'/></lo></hcons>

</rmrs>

Page 30: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

SAF Flexibility

The standoff supports a variety of annotation types

Which communicate between different levels of analysis

And between different analysis paths Hence it is also the main route for

communication in the architecture

Page 31: Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

SciXML Flexibility

A common representation for the logical structure and essential formatting of research papers

Conversion from various publishers’ markup schemes

And, also, from HTML, LaTeX and PDF Applied to several disciplines


Recommended