Natural Language Checking with Program Checking Tools

Natural Language Checkingwith Program Checking Tools

Fabrizio Perin, Lukas Renggli, Jorge Ressia

Syn

tax

Sty

le

ProgrammingLanguages

Parser

Compiler

ProgramChecker

Parser

CompilerSyn

tax

Sty

le


ProgramChecker

Parser

CompilerSyn

tax

Sty

le


NaturalLanguages

ProgramChecker

Parser

Compiler

Spell Checker

Grammar Checker

Syn

tax

Sty

le


NaturalLanguages

ProgramChecker

TextLint

Parser

Compiler

Spell Checker

Grammar Checker

Syn

tax

Sty

le


NaturalLanguages

libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.

The contributions of this paper are:

(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in

Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in

natural language; and(4) we demonstrate a graphical user interface that presents and explains the

problems detected by the tool.

Text Parsing Model Validation Failures

Rules Styles

GUI

Fig. 2. Data Flow through TextLint.

Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.

2 Modeling Text Documents

To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:

• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.

3








Rules Styles

GUI






3

.txt

.html

.tex








Rules Styles

GUI






3








Rules Styles

GUI






3

• The Sentence is a set of syntactic elements or phrases ending with a sentenceterminator.

• The Phrase models a set of syntactic elements of a particular length. Asentence provides access to all potential phrases of a specific size.

• The Syntactic Elements model the different tokens of a sentence, they are:· The Word models vocables or numbers in the text. A word is a sequenceof alphanumeric characters.

· The Punctuation models periods, commas, parentheses and other punctua-tion marks that are used in written text to separate paragraphs, sentencesand their elements.

· The Whitespace models blank areas between words and punctuations. Ourmodel considers spaces, tabs and carriage returns as whitespace.

· The Markup models LATEX or HTML commands depending on the filetypeof the input.

All document elements answer the message text which returns a plain stringrepresentation of the modeled text entity ignoring markup tokens. Furthermoreall elements know their source interval in the document. The relationship amongthe elements in the model are depicted in Figure 3.

Element

text()interval()

Document Paragraph Sentence Phrase1 * 1 * 1 *

SyntacticElement

text()interval()

Word Punctuation Whitespace Markup

1

*

1

*

Fig. 3. The TextLint model and the relationships between its classes.

3 From Strings to Objects

To build the high-level document model from the flat input string we usePetitParser [7]. PetitParser is a framework targeted at parsing formal languages(e.g., programming languages), but we employ it in this project to parse natural

4








Rules Styles

GUI






3








Element

text()interval()


SyntacticElement

text()interval()


1

*

1

*




4








Rules Styles

GUI






3








Element

text()interval()


SyntacticElement

text()interval()


1

*

1

*




4








Rules Styles

GUI






3








Element

text()interval()


SyntacticElement

text()interval()


1

*

1

*




4

Other Language Models








Rules Styles

GUI






3








Element

text()interval()


SyntacticElement

text()interval()


1

*

1

*




4








Rules Styles

GUI






3








Rules Styles

GUI






3

Avoid "a lot"Avoid "a"Avoid "allow to"Avoid "an"Avoid "as to whether"Avoid "can not"Avoid "case"Avoid "certainly"Avoid "could"Avoid "currently"Avoid "different than"Avoid "doubt but"Avoid "each and every one"Avoid "enormity"Avoid "factor"Avoid "funny"Avoid "help but"Avoid "help to"Avoid "however"Avoid "importantly"Avoid "in order to"Avoid "in regards to"Avoid "in terms of"Avoid "insightful"Avoid "interesting"

Avoid "irregardless"Avoid "one of the most"Avoid "regarded as"Avoid "required to"Avoid "somehow"Avoid "stuff"Avoid "the fact is"Avoid "the fact that"Avoid "the truth is"Avoid "thing"Avoid "thus"Avoid "true fact"Avoid "would"Avoid commaAvoid connectors repetitionAvoid continuous punctuationAvoid continuous word repetitionAvoid contractionAvoid joined sentencesAvoid long paragraphAvoid long sentenceAvoid passive voiceAvoid qualifierAvoid whitespaceAvoid word repetition








Rules Styles

GUI






3










Rules Styles

GUI






3

(self word: ‘somehow’)










Rules Styles

GUI






3

(self punctuation) , (self punctuation)










Rules Styles

GUI






3

(self wordIn: #('am' 'are' 'were' 'being' ... )) , (self separator star) , ((self wordSatisfying: [ :value | value endsWith: 'ed' ]) / (self wordIn: #('awoken' 'been' 'born' 'beat' ... )))








Rules Styles

GUI






3








Rules Styles

GUI






3

scientificPaperStyle := TLTextLintRule allRules-‐ TLWordRepetitionInParagraphRule








Rules Styles

GUI






3








Rules Styles

GUI






3








Rules Styles

GUI






3








Rules Styles

GUI






3








Rules Styles

GUI






3

Validation

tt1 t2 t3 t4

Issues

Words

Fig. 6. Evolution of a paper from beginning to publication.

7.1 History of a Paper

Figure 6 depicts the number of stylistic issues detected by TextLint andthe number of words in the text. The dashed vertical lines mark interestingmoments in the life-time of the document from the beginning to publication.

Up to point t1 we can see the early life of the paper. A significant amount oftext was added and the number of TextLint issues steadily increased over time.

This growth decreased between point t1 and t2. We can observe that eventhough some new text is being added the TextLint issues do not increaseas much as in the previous part. In this period the authors proof-read andrewrote portions of the paper to accommodate the ideas and to make the papercohesive to a single story.

Points t2 and t3 mark the moments when a native english speaker with ex-perience in paper writing for over 30 years proof-read the document. We canobserve in both cases that the number of errors was systematically reduced aftereach of the interventions. The issues detected did not disappear immediatelybecause the expert author often introduced annotations that were later fixedby the co-authors.

The peak at t3 marks the time before the paper submission. With the approach-ing deadline the authors added a lot of new issues. The time period between t3and t4 depicts the correction of most issues and the final preparations of thepaper for submission. Later the paper was accepted for publication.

Point t4 marks a slight increase in text size due to the introduction of pas-sages addressing the reviewers comments. Afterwards, there is an abrupt sizereduction due to the elimination of comments and unnecessary text for thecamera-ready version.

11

Avo

id‘curren

tly’

-74%

Avo

id‘certainly’

-25%

Avo

id‘w

ould’

-24%

Avo

id‘fac

tor’

-20%

Avo

idlongparag

raph

-20%

Avo

id‘thus’

-13%

Avo

id‘how

ever’

-10%

Avo

id‘case’

-7%

Avo

id‘can

not’

-5%

Avo

id‘cou

ld’

-5%

Avo

idpassive

voice

-4%

Avo

id‘insigh

tful’

-3%

Avo

id‘stu

ff’

-3%

Avo

idjoined

sentences

-1%

Avo

id‘asto

whether’

0%Avo

id‘differen

tth

an’

0%Avo

id‘dou

btbut’

0%Avo

id‘eachan

dev

eryon

e’0%

Avo

id‘enormity’

0%Avo

id‘helpbut’

0%Avo

id‘inrega

rdsto’

0%Avo

id‘irreg

ardless’

0%Avo

id‘reg

arded

as’

0%Avo

id‘thefact

is’

0%Avo

id‘thetruth

is’

0%Avo

id‘tru

efact’

0%Avo

idco

mma

0%Avo

idqualifier

2%Avo

id‘funny’

5%Avo

id‘oneof

themost’

5%Avo

id‘importantly’

9%Avo

idlongsentence

10%

Avo

id‘an’

10%

Avo

idco

ntinuou

spunctuation

15%

Avo

id‘interesting’

17%

Avo

id‘req

uired

to’

17%

Avo

id‘a’

23%

Avo

id‘inorder

to’

23%

Avo

idco

ntinuou

swordrepetition

24%

Avo

id‘interm

sof’

24%

Avo

id‘som

ehow

’25

%Avo

id‘helpto’

27%

Avo

id‘thefact

that’

32%

Avo

idwhitespac

e45

%Avo

id‘allow

to’

46%

Avo

id‘a

lot’

55%

Avo

id‘thing’

70%

Avo

idco

ntrac

tion

73%

Fig.7.

Effectivenessof

variou

sTextL

intrules.

amorein-depth

discussion

oftoolsthat

commenton

writing

stylecouldbeinclud

ed.�

Thereisawidevarietyof

(com

mercial)libraries

fornaturallan

guageprocessing.

Mostof

theselibraries

donot

providethenecessary

reusable

abstractionsto

analyzestylisticconcernsin

text.

Natural

Lan

guageprocessing(N

LP)isafieldof

computerscience

andlingu

is-

tics

concerned

withtheinteractionsbetweencomputers

andhu

man

(natural)

langu

ages.NLPisconcerned

withthenaturallangu

agegenerationan

dunder-

stan

ding.

Naturallangu

agegenerationistheprocess

that

conv

erts

inform

ation

from

acomputation

alrepresentationto

read

able

human

langu

age.

Natural

langu

ageunderstan

dingworks

byconv

ertingsamplesof

naturallangu

ageinto

moreform

alform

sunderstan

dableby

computersystem

s.Bates

[13]

summarizes

theNLPproblemsan

dstate-of-art

solution

sin

detail.

13

Future Work

‣ Natural Language Model

‣ Styles for Other Domains

‣ More Rules

textlint.lukas-renggli.ch@textlint

Date post:	25-Dec-2014
Category:	Technology
Upload:	lukas-renggli
View:	7,151 times
Download:	0 times

Natural Language Checking with Program Checking Tools

Technology