+ All Categories
Home > Documents > CertAnon Product Specification - cs.odu.edu  · Web viewThe LASI User Interface is a Windows...

CertAnon Product Specification - cs.odu.edu  · Web viewThe LASI User Interface is a Windows...

Date post: 26-Sep-2018
Category:
Upload: vuminh
View: 215 times
Download: 0 times
Share this document with a friend
19
Running head: Lab 2 – LASI Prototype Product Specification 1 Lab 2 – LASI Prototype Product Specification Red Team Brittany Johnson CS411W Janet Brunelle April 8, 2013 Version 1
Transcript

Running head: Lab 2 – LASI Prototype Product Specification 1

Lab 2 – LASI Prototype Product Specification

Red Team

Brittany Johnson

CS411W

Janet Brunelle

April 8, 2013

Version 1

Lab 2 – LASI Prototype Product Specification 2

Table of Contents

1 Introduction..............................................................................................................................31.1 Purpose............................................................................................................................31.2 Scope................................................................................................................................41.3 Definitions, Acronyms, and Abbreviations.....................................................................51.4 References........................................................................................................................71.5 Overview..........................................................................................................................7

2 General Description.................................................................................................................72.1 Prototype Architecture Description.................................................................................82.2 Prototype Functional Description..................................................................................14

List of Figures

Figure 1. Prototype Major Functional Component Diagram...........................................................8Figure 2. GUI Site Map...................................................................................................................9Figure 3. Prototype Hardware and Software Component Diagram...............................................10Figure 4. Nouns.............................................................................................................................11Figure 5. Verbs..............................................................................................................................12Figure 6. Phrase.............................................................................................................................13

List of Tables

Table 1. Feature comparison between full product and prototype..................................................4

Lab 2 – LASI Prototype Product Specification 3

1 Introduction

Linguistic Analysis is the contextual study of written works and how the words combine

to form and overall meaning. Themes are the subject-object-verb relationships that help the

reader to comprehend and summarize what has been read. LASI will be a decision support tool

to assist users in determining common themes across multiple documents. It is even more

difficult to come to a conclusion when the number of documents increases because the theme

across all of the documents may not be the theme of each of the individual documents. The

complexity of a topic and the reader’s familiarity with it plays an important role in

comprehension. The reader’s comprehension, along with the ability to summarize the material is

important in being able to communicate the content of a document. Thus, it is often difficult for

people to identify a common theme over a large set of documents in a timely, consistent, and

objective manner.

1.1 PurposeLASI stands for Linguistic Analysis for Subject Identification. It is a stand-alone theme

finding application conceived by the Old Dominion University CS410 Red Group. It is designed

to be a decision support tool for large, multi-document linguistic analysis and allow for more

accurate and consistent results. LASI will be able to detect themes across many documents and

can provide both individual and cross document analysis to determine a single theme.

LASI’s ability to analyze multiple documents to find a common theme makes it a great

decision support tool for teachers, students, research analysts and those that would need to read

through large sets of documents on a frequent basis. Teachers, for example, would be able to use

LASI as an initial analysis on student papers to check whether or not it is consistent with the

Lab 2 – LASI Prototype Product Specification 4

topic of that paper. Both students and research analysts could use LASI to quickly assess the

usefulness of scientific and literary publications for the topic that they are researching.

1.2 ScopePrototype features will differ from the real world product in scale. Some features will be

eliminated to the project due to limited development time. A complete list of features is available

in Table 1.

Table 1. Feature comparison between full product and prototype

The types of documents that the LASI prototype accepts has been limited to just DOC and

DOCX. Scanned text recognition has been removed from the prototype since there is not enough

time to get the OCR software fully functioning. The prototype will limit the number of

documents that can be added to one project to three to five, and there is a size limitation of 10

pages on each of those documents to insure that the algorithm can function in a timely manner.

Rather than focusing on every part of speech, in the LASI prototype we will focus on noun-verb

binding. There were also a few of the more complex features that did not make it into the

prototype like user defined dictionaries, synonym identification, and content assumption.

Lab 2 – LASI Prototype Product Specification 5

1.3 Definitions, Acronyms, and Abbreviations A.I.D.: Assessment Improvement Design

A.I.D. Process: A process that provides quantitative and qualitative basis to identify problems and determine the feasibility of solutions.

Analysis: Detailed examination of the elements or structure of something, typically as a basis for interpretation.

Document: A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research.

Head word: A locally distinct word within a phrase which, by its syntactic associations, determines the category of the phrase itself.

LASI: Linguistic Analysis for Subject Identification

Linguistic Analysis: The scientific analysis of a language.

Parser: Takes in DOC and DOCX files and converts them to TXT files.

Part of Speech Tagger: Software utility that associates words with the parts-of-speech in a sentence.

Phrase: An instance of the Phrase class.

Phrase: (Linguistically) A group of words standing together as a conceptual unit.

Phrase Class: The root of the taxonomy of class types which correspond to syntactic roles at the phrase level and whose instances contain a collection of Words which together represent a linguistic phrase.

Semantic Analysis: Relating the syntactical structure of words to their language independent meanings.

Sharp NLP: Written in C#, natural language processing tool used to parse and tag parts-of-speech.

Strategic Document: Document produced by a client that defines their Goals, Visions and Missions.

Subject Identification: The process by which the subject matter and thematic content of documents is determined.

Lab 2 – LASI Prototype Product Specification 6

Syntactic Analysis: Identifies key words based on their location in the sentence, rather than their overall meaning throughout the document.

.TAGGED: The type of file that stores the output of the part-of-speech tagger containing the all of the text of the document with embedded syntactic annotations.

Theme: Subject-object-verb relationships that LASI is attempting to generate from the input set.

Tag: A label, or the act of attaching a label, that specifies the syntactic role of a selected element in a document.

Tagged Set: A group of words, whose part of speech and location in a sentence have been identified by the parser.

WordNet: Compiler and provider of the data files which forms the basis for the LASI thesaurus.

Word Class: The root of the taxonomy of class types which correspond to parts-of-speech at the word level and whose instances encapsulate each occurrence of a textually identified word.

Word Weight: A numeric value, associated with each syntactically and lexically unique word in a written work, indicating its significance.

[This space intentionally left blank.]

Lab 2 – LASI Prototype Product Specification 7

1.4 References Johnson, Brittany. (2013). Lab 1 – LASI Product Description.

SharpNLP. (n.d.). Retrieved from http://sharpnlp.codeplex.com/

Office binary to open xml. (n.d.). Retrieved from http://b2xtranslator.sourceforge.net/

1.5 Overview

This product specification provides the hardware and software configuration, external

interfaces, capabilities and features of the LASI prototype. The information provided in the

remaining sections of this document includes a detailed description of the hardware, software

and interface of the LASI prototype as well as the key features of the prototype.

2 General Description

The following sections describe the prototype in more detail. Section 2.1 identifies and

describes each architectural component of the prototype. Section 2.2 explains the prototype’s

functional requirement. Lastly, Section 2.3 describes the external interfaces of the prototype.

[This space intentionally left blank.]

Lab 2 – LASI Prototype Product Specification 8

2.1 Prototype Architecture Description

The architecture for the LASI prototype consists of 3 major components: a Graphical

User Interface, an algorithm and a file management system. Figure 1. shows a major functional

component diagram of the prototype.

The first major component is the graphic user interface. The LASI User Interface is a

Windows Presentation Foundation (WPF) project using XAML to define the structure of the

views and C# to provide the interactivity. The LASI prototype GUI contains: a Start-up Screen, a

Create Project View, a Project Preview, an In Progress View and a Results View.

[This space intentionally left blank.]

Figure 1. Prototype Major Functional Component Diagram

All Documents

Individual Documents

Start-up Screen

In Progress View

Project Preview

Create Project View

Results View

Results Aggregator

Object

Attributive

Subject

Word &Phrase BinderTagged

File Parser

Algorithm

Begin AnalysisPart-of-Speech

Tagger

Documents Returned

File Converter

File Management

Create Project

Graphic User Interface

Lab 2 – LASI Prototype Product Specification 9

Figure 2. GUI Site Map

As shown in Figure 2., results can be viewed in three different format types: Top Results, Word

Relationships, and Word Count and Weighting. The top results will be represented graphically

based on the user’s preferred chart type. The charting engine that is being used for this feature is

Lab 2 – LASI Prototype Product Specification 10

a functionality of the WPF Toolkit. The word relationships will also be displayed for each

document. Each word is colorized based on its part-of-speech. This will allow the user to see the

relationships between all of the words and phrases in a document. Results will also be displayed

based on the individual word count and weight. The weight that will be displayed is based on the

weighting algorithm. Results can either be printed or exported in PDF, JPG, and PNG.

The second major component is the file management system. It manages converting files

and invoking the tagger. The file management system contains the file converter and the parts-

of-speech tagger. The file converter that the LASI prototype is using is the B2XTranslator, third

party open source software that can convert DOC and DOCX into an XML file. The parts-of-

speech tagger software being used is SharpNLP, open source C# natural language processing

tool. The SharpNLP POS Tagger tags words and phrases with the respective parts-of-speech for

use by the LASI algorithm. SharpNLP utilizes the Penn Treebank parts-of-speech tags to define

the parts of speech.

The last major component is the algorithm. The LASI prototype algorithm is

written in C#. The Algorithm, as shown in Figure 1., contains a Tag Parser which converts the

text into word and phrase types representative to their parts-of-speech. A Word, in reference to

word types, is the root of the classification of class types which correspond to parts-of-speech at

the word level and whose instances encapsulate each occurrence of a textually identified word.

Figure 4. shows all of the Word types in the LASI prototype. Every word that is tagged by the

part-of-speech tagger has a corresponding Word type.

[This space intentionally left blank.]

Lab 2 – LASI Prototype Product Specification 11

Figure 3. Word

[This space intentionally left blank.]

Lab 2 – LASI Prototype Product Specification 12

A Phrase, as shown in Figure 5., is the root of the classification of class types which correspond

to syntactic roles at the phrase level and whose instances contain a collection of Words which

together represent a linguistic phrase. Just like with Word types, every type of phrase that can be

tagged with our part-of speech tagger is represented.

.

Figure 4. Phrase

The LASI prototype algorithm binds word and phrase types together based on their

syntactic relationship via a state machine derived logic flow. Words and phrases will be bound

together based on their Word or Phrase type mentioned above and how they relate to one another

within phrases, paragraphs, and the document. The weighting algorithm will assign each word a

weight based on its part-of-speech, frequency count and the number of times and ways it is

referenced. For the LASI prototype we will be focusing subject, object and attributive binding.

Lab 2 – LASI Prototype Product Specification 13

2.2 Prototype Functional Description

The major functional components are shown in Figure 1. A user will interact with the

LASI GUI and create a new project using documents that are of the correct file type and stripped

of all graphics. The user will need to fill out all required information needed to create a new

project. These actions will result in a new project being created and the document converter

being called. When documents are added to a project in the GUI, the document converter takes

DOC and DOCX and converts it to an XML file. Once the document is in XML, it is converted

to raw text that can be used by the parts-of-speech tagger.

The user is then navigated to the document preview where they can either remove or add

documents. Once analysis has begun, SharpNLP will embed a part-of-speech tag into the text

from each document. The tagged file is then passed on to tagged file reader which then assigns

each word and phrase a word or phrase type which corresponds to its part of speech given by the

tagger.

Once the word and phrase binding is finished, it will begin weighting the words based on

their frequency as the number of times it is referenced. The weighting metrics for each word will

be based on a raw frequency as well as a relative frequency. Each word will have a raw

frequency that is based on a simple word count, the number of times that the word was used in a

particular manner, and a frequency count for synonyms of that word. The relative frequency will

be based on subject, verb and object relationship between words as well as where a word is

located in a document. Each of the word weights will then be passed on to the GUI Results page

for the user to view.


Recommended