Date post: | 25-Dec-2014 |
Category: |
Technology |
Upload: | lukas-renggli |
View: | 7,151 times |
Download: | 0 times |
Natural Language Checkingwith Program Checking Tools
Fabrizio Perin, Lukas Renggli, Jorge Ressia
Syn
tax
Sty
le
ProgrammingLanguages
Parser
Compiler
ProgramChecker
Parser
CompilerSyn
tax
Sty
le
ProgrammingLanguages
ProgramChecker
Parser
CompilerSyn
tax
Sty
le
ProgrammingLanguages
NaturalLanguages
ProgramChecker
Parser
Compiler
Spell Checker
Grammar Checker
Syn
tax
Sty
le
ProgrammingLanguages
NaturalLanguages
ProgramChecker
TextLint
Parser
Compiler
Spell Checker
Grammar Checker
Syn
tax
Sty
le
ProgrammingLanguages
NaturalLanguages
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
.txt
.html
.tex
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
• The Sentence is a set of syntactic elements or phrases ending with a sentenceterminator.
• The Phrase models a set of syntactic elements of a particular length. Asentence provides access to all potential phrases of a specific size.
• The Syntactic Elements model the different tokens of a sentence, they are:· The Word models vocables or numbers in the text. A word is a sequenceof alphanumeric characters.
· The Punctuation models periods, commas, parentheses and other punctua-tion marks that are used in written text to separate paragraphs, sentencesand their elements.
· The Whitespace models blank areas between words and punctuations. Ourmodel considers spaces, tabs and carriage returns as whitespace.
· The Markup models LATEX or HTML commands depending on the filetypeof the input.
All document elements answer the message text which returns a plain stringrepresentation of the modeled text entity ignoring markup tokens. Furthermoreall elements know their source interval in the document. The relationship amongthe elements in the model are depicted in Figure 3.
Element
text()interval()
Document Paragraph Sentence Phrase1 * 1 * 1 *
SyntacticElement
text()interval()
Word Punctuation Whitespace Markup
1
*
1
*
Fig. 3. The TextLint model and the relationships between its classes.
3 From Strings to Objects
To build the high-level document model from the flat input string we usePetitParser [7]. PetitParser is a framework targeted at parsing formal languages(e.g., programming languages), but we employ it in this project to parse natural
4
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
• The Sentence is a set of syntactic elements or phrases ending with a sentenceterminator.
• The Phrase models a set of syntactic elements of a particular length. Asentence provides access to all potential phrases of a specific size.
• The Syntactic Elements model the different tokens of a sentence, they are:· The Word models vocables or numbers in the text. A word is a sequenceof alphanumeric characters.
· The Punctuation models periods, commas, parentheses and other punctua-tion marks that are used in written text to separate paragraphs, sentencesand their elements.
· The Whitespace models blank areas between words and punctuations. Ourmodel considers spaces, tabs and carriage returns as whitespace.
· The Markup models LATEX or HTML commands depending on the filetypeof the input.
All document elements answer the message text which returns a plain stringrepresentation of the modeled text entity ignoring markup tokens. Furthermoreall elements know their source interval in the document. The relationship amongthe elements in the model are depicted in Figure 3.
Element
text()interval()
Document Paragraph Sentence Phrase1 * 1 * 1 *
SyntacticElement
text()interval()
Word Punctuation Whitespace Markup
1
*
1
*
Fig. 3. The TextLint model and the relationships between its classes.
3 From Strings to Objects
To build the high-level document model from the flat input string we usePetitParser [7]. PetitParser is a framework targeted at parsing formal languages(e.g., programming languages), but we employ it in this project to parse natural
4
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
• The Sentence is a set of syntactic elements or phrases ending with a sentenceterminator.
• The Phrase models a set of syntactic elements of a particular length. Asentence provides access to all potential phrases of a specific size.
• The Syntactic Elements model the different tokens of a sentence, they are:· The Word models vocables or numbers in the text. A word is a sequenceof alphanumeric characters.
· The Punctuation models periods, commas, parentheses and other punctua-tion marks that are used in written text to separate paragraphs, sentencesand their elements.
· The Whitespace models blank areas between words and punctuations. Ourmodel considers spaces, tabs and carriage returns as whitespace.
· The Markup models LATEX or HTML commands depending on the filetypeof the input.
All document elements answer the message text which returns a plain stringrepresentation of the modeled text entity ignoring markup tokens. Furthermoreall elements know their source interval in the document. The relationship amongthe elements in the model are depicted in Figure 3.
Element
text()interval()
Document Paragraph Sentence Phrase1 * 1 * 1 *
SyntacticElement
text()interval()
Word Punctuation Whitespace Markup
1
*
1
*
Fig. 3. The TextLint model and the relationships between its classes.
3 From Strings to Objects
To build the high-level document model from the flat input string we usePetitParser [7]. PetitParser is a framework targeted at parsing formal languages(e.g., programming languages), but we employ it in this project to parse natural
4
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
• The Sentence is a set of syntactic elements or phrases ending with a sentenceterminator.
• The Phrase models a set of syntactic elements of a particular length. Asentence provides access to all potential phrases of a specific size.
• The Syntactic Elements model the different tokens of a sentence, they are:· The Word models vocables or numbers in the text. A word is a sequenceof alphanumeric characters.
· The Punctuation models periods, commas, parentheses and other punctua-tion marks that are used in written text to separate paragraphs, sentencesand their elements.
· The Whitespace models blank areas between words and punctuations. Ourmodel considers spaces, tabs and carriage returns as whitespace.
· The Markup models LATEX or HTML commands depending on the filetypeof the input.
All document elements answer the message text which returns a plain stringrepresentation of the modeled text entity ignoring markup tokens. Furthermoreall elements know their source interval in the document. The relationship amongthe elements in the model are depicted in Figure 3.
Element
text()interval()
Document Paragraph Sentence Phrase1 * 1 * 1 *
SyntacticElement
text()interval()
Word Punctuation Whitespace Markup
1
*
1
*
Fig. 3. The TextLint model and the relationships between its classes.
3 From Strings to Objects
To build the high-level document model from the flat input string we usePetitParser [7]. PetitParser is a framework targeted at parsing formal languages(e.g., programming languages), but we employ it in this project to parse natural
4
Other Language Models
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
• The Sentence is a set of syntactic elements or phrases ending with a sentenceterminator.
• The Phrase models a set of syntactic elements of a particular length. Asentence provides access to all potential phrases of a specific size.
• The Syntactic Elements model the different tokens of a sentence, they are:· The Word models vocables or numbers in the text. A word is a sequenceof alphanumeric characters.
· The Punctuation models periods, commas, parentheses and other punctua-tion marks that are used in written text to separate paragraphs, sentencesand their elements.
· The Whitespace models blank areas between words and punctuations. Ourmodel considers spaces, tabs and carriage returns as whitespace.
· The Markup models LATEX or HTML commands depending on the filetypeof the input.
All document elements answer the message text which returns a plain stringrepresentation of the modeled text entity ignoring markup tokens. Furthermoreall elements know their source interval in the document. The relationship amongthe elements in the model are depicted in Figure 3.
Element
text()interval()
Document Paragraph Sentence Phrase1 * 1 * 1 *
SyntacticElement
text()interval()
Word Punctuation Whitespace Markup
1
*
1
*
Fig. 3. The TextLint model and the relationships between its classes.
3 From Strings to Objects
To build the high-level document model from the flat input string we usePetitParser [7]. PetitParser is a framework targeted at parsing formal languages(e.g., programming languages), but we employ it in this project to parse natural
4
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
Avoid "a lot"Avoid "a"Avoid "allow to"Avoid "an"Avoid "as to whether"Avoid "can not"Avoid "case"Avoid "certainly"Avoid "could"Avoid "currently"Avoid "different than"Avoid "doubt but"Avoid "each and every one"Avoid "enormity"Avoid "factor"Avoid "funny"Avoid "help but"Avoid "help to"Avoid "however"Avoid "importantly"Avoid "in order to"Avoid "in regards to"Avoid "in terms of"Avoid "insightful"Avoid "interesting"
Avoid "irregardless"Avoid "one of the most"Avoid "regarded as"Avoid "required to"Avoid "somehow"Avoid "stuff"Avoid "the fact is"Avoid "the fact that"Avoid "the truth is"Avoid "thing"Avoid "thus"Avoid "true fact"Avoid "would"Avoid commaAvoid connectors repetitionAvoid continuous punctuationAvoid continuous word repetitionAvoid contractionAvoid joined sentencesAvoid long paragraphAvoid long sentenceAvoid passive voiceAvoid qualifierAvoid whitespaceAvoid word repetition
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
Avoid "a lot"Avoid "a"Avoid "allow to"Avoid "an"Avoid "as to whether"Avoid "can not"Avoid "case"Avoid "certainly"Avoid "could"Avoid "currently"Avoid "different than"Avoid "doubt but"Avoid "each and every one"Avoid "enormity"Avoid "factor"Avoid "funny"Avoid "help but"Avoid "help to"Avoid "however"Avoid "importantly"Avoid "in order to"Avoid "in regards to"Avoid "in terms of"Avoid "insightful"Avoid "interesting"
Avoid "irregardless"Avoid "one of the most"Avoid "regarded as"Avoid "required to"Avoid "somehow"Avoid "stuff"Avoid "the fact is"Avoid "the fact that"Avoid "the truth is"Avoid "thing"Avoid "thus"Avoid "true fact"Avoid "would"Avoid commaAvoid connectors repetitionAvoid continuous punctuationAvoid continuous word repetitionAvoid contractionAvoid joined sentencesAvoid long paragraphAvoid long sentenceAvoid passive voiceAvoid qualifierAvoid whitespaceAvoid word repetition
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
(self word: ‘somehow’)
Avoid "a lot"Avoid "a"Avoid "allow to"Avoid "an"Avoid "as to whether"Avoid "can not"Avoid "case"Avoid "certainly"Avoid "could"Avoid "currently"Avoid "different than"Avoid "doubt but"Avoid "each and every one"Avoid "enormity"Avoid "factor"Avoid "funny"Avoid "help but"Avoid "help to"Avoid "however"Avoid "importantly"Avoid "in order to"Avoid "in regards to"Avoid "in terms of"Avoid "insightful"Avoid "interesting"
Avoid "irregardless"Avoid "one of the most"Avoid "regarded as"Avoid "required to"Avoid "somehow"Avoid "stuff"Avoid "the fact is"Avoid "the fact that"Avoid "the truth is"Avoid "thing"Avoid "thus"Avoid "true fact"Avoid "would"Avoid commaAvoid connectors repetitionAvoid continuous punctuationAvoid continuous word repetitionAvoid contractionAvoid joined sentencesAvoid long paragraphAvoid long sentenceAvoid passive voiceAvoid qualifierAvoid whitespaceAvoid word repetition
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
(self punctuation) , (self punctuation)
Avoid "a lot"Avoid "a"Avoid "allow to"Avoid "an"Avoid "as to whether"Avoid "can not"Avoid "case"Avoid "certainly"Avoid "could"Avoid "currently"Avoid "different than"Avoid "doubt but"Avoid "each and every one"Avoid "enormity"Avoid "factor"Avoid "funny"Avoid "help but"Avoid "help to"Avoid "however"Avoid "importantly"Avoid "in order to"Avoid "in regards to"Avoid "in terms of"Avoid "insightful"Avoid "interesting"
Avoid "irregardless"Avoid "one of the most"Avoid "regarded as"Avoid "required to"Avoid "somehow"Avoid "stuff"Avoid "the fact is"Avoid "the fact that"Avoid "the truth is"Avoid "thing"Avoid "thus"Avoid "true fact"Avoid "would"Avoid commaAvoid connectors repetitionAvoid continuous punctuationAvoid continuous word repetitionAvoid contractionAvoid joined sentencesAvoid long paragraphAvoid long sentenceAvoid passive voiceAvoid qualifierAvoid whitespaceAvoid word repetition
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
(self wordIn: #('am' 'are' 'were' 'being' ... )) , (self separator star) , ((self wordSatisfying: [ :value | value endsWith: 'ed' ]) / (self wordIn: #('awoken' 'been' 'born' 'beat' ... )))
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
scientificPaperStyle := TLTextLintRule allRules-‐ TLWordRepetitionInParagraphRule
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
libraries: For parsing natural languages we use PetitParser [7], a flexibleparsing framework that makes it easy to define parsers and to dynamicallyreuse, compose, transform and extend grammars. Furthermore, we use Glamour[8], an engine for scripting browsers. Glamour reifies the notion of a browserand defines the flow of data between different user interface widgets.
The contributions of this paper are:
(1) we apply ideas from program checking to the domain of natural language;(2) we implement an object-oriented model used to represent natural text in
Smalltalk;(3) we demonstrate a pattern matcher for the detection of style issues in
natural language; and(4) we demonstrate a graphical user interface that presents and explains the
problems detected by the tool.
Text Parsing Model Validation Failures
Rules Styles
GUI
Fig. 2. Data Flow through TextLint.
Figure 2 gives an overview of the architecture of TextLint. Section 2 introducesthe natural text model of TextLint and Section 3 details how text documentsare parsed and the model is composed. Section 4 presents the rules whichmodel the stylistic checks. Section 5 describes how stylistic rules are defined inTextLint. The implementation of the user interface is demonstrated in Section 6.We summarize related work in Section 8 and conclude and present future workin Section 9.
2 Modeling Text Documents
To perform analyses of written text it is necessary to have a model represent-ing it. TextLint provides the abstractions for modeling written text from astructural point of view. The abstractions provided by our model are:
• The Document models a text document composed of paragraphs.• The Paragraph models a sequence of sentences up to a break point. Para-graphs are responsible for answering the sentences and words that composethem.
3
Validation
tt1 t2 t3 t4
Issues
Words
Fig. 6. Evolution of a paper from beginning to publication.
7.1 History of a Paper
Figure 6 depicts the number of stylistic issues detected by TextLint andthe number of words in the text. The dashed vertical lines mark interestingmoments in the life-time of the document from the beginning to publication.
Up to point t1 we can see the early life of the paper. A significant amount oftext was added and the number of TextLint issues steadily increased over time.
This growth decreased between point t1 and t2. We can observe that eventhough some new text is being added the TextLint issues do not increaseas much as in the previous part. In this period the authors proof-read andrewrote portions of the paper to accommodate the ideas and to make the papercohesive to a single story.
Points t2 and t3 mark the moments when a native english speaker with ex-perience in paper writing for over 30 years proof-read the document. We canobserve in both cases that the number of errors was systematically reduced aftereach of the interventions. The issues detected did not disappear immediatelybecause the expert author often introduced annotations that were later fixedby the co-authors.
The peak at t3 marks the time before the paper submission. With the approach-ing deadline the authors added a lot of new issues. The time period between t3and t4 depicts the correction of most issues and the final preparations of thepaper for submission. Later the paper was accepted for publication.
Point t4 marks a slight increase in text size due to the introduction of pas-sages addressing the reviewers comments. Afterwards, there is an abrupt sizereduction due to the elimination of comments and unnecessary text for thecamera-ready version.
11
Avo
id‘curren
tly’
-74%
Avo
id‘certainly’
-25%
Avo
id‘w
ould’
-24%
Avo
id‘fac
tor’
-20%
Avo
idlongparag
raph
-20%
Avo
id‘thus’
-13%
Avo
id‘how
ever’
-10%
Avo
id‘case’
-7%
Avo
id‘can
not’
-5%
Avo
id‘cou
ld’
-5%
Avo
idpassive
voice
-4%
Avo
id‘insigh
tful’
-3%
Avo
id‘stu
ff’
-3%
Avo
idjoined
sentences
-1%
Avo
id‘asto
whether’
0%Avo
id‘differen
tth
an’
0%Avo
id‘dou
btbut’
0%Avo
id‘eachan
dev
eryon
e’0%
Avo
id‘enormity’
0%Avo
id‘helpbut’
0%Avo
id‘inrega
rdsto’
0%Avo
id‘irreg
ardless’
0%Avo
id‘reg
arded
as’
0%Avo
id‘thefact
is’
0%Avo
id‘thetruth
is’
0%Avo
id‘tru
efact’
0%Avo
idco
mma
0%Avo
idqualifier
2%Avo
id‘funny’
5%Avo
id‘oneof
themost’
5%Avo
id‘importantly’
9%Avo
idlongsentence
10%
Avo
id‘an’
10%
Avo
idco
ntinuou
spunctuation
15%
Avo
id‘interesting’
17%
Avo
id‘req
uired
to’
17%
Avo
id‘a’
23%
Avo
id‘inorder
to’
23%
Avo
idco
ntinuou
swordrepetition
24%
Avo
id‘interm
sof’
24%
Avo
id‘som
ehow
’25
%Avo
id‘helpto’
27%
Avo
id‘thefact
that’
32%
Avo
idwhitespac
e45
%Avo
id‘allow
to’
46%
Avo
id‘a
lot’
55%
Avo
id‘thing’
70%
Avo
idco
ntrac
tion
73%
Fig.7.
Effectivenessof
variou
sTextL
intrules.
amorein-depth
discussion
oftoolsthat
commenton
writing
stylecouldbeinclud
ed.�
Thereisawidevarietyof
(com
mercial)libraries
fornaturallan
guageprocessing.
Mostof
theselibraries
donot
providethenecessary
reusable
abstractionsto
analyzestylisticconcernsin
text.
Natural
Lan
guageprocessing(N
LP)isafieldof
computerscience
andlingu
is-
tics
concerned
withtheinteractionsbetweencomputers
andhu
man
(natural)
langu
ages.NLPisconcerned
withthenaturallangu
agegenerationan
dunder-
stan
ding.
Naturallangu
agegenerationistheprocess
that
conv
erts
inform
ation
from
acomputation
alrepresentationto
read
able
human
langu
age.
Natural
langu
ageunderstan
dingworks
byconv
ertingsamplesof
naturallangu
ageinto
moreform
alform
sunderstan
dableby
computersystem
s.Bates
[13]
summarizes
theNLPproblemsan
dstate-of-art
solution
sin
detail.
13
Future Work
‣ Natural Language Model
‣ Styles for Other Domains
‣ More Rules
textlint.lukas-renggli.ch@textlint