CIS 895 – MSE PROJECT
KDD- Service based Numerical Entity Searcher (KSNES)
Presentation 2 on March 31st , 2009
Naga Sowjanya [email protected]
1
OUTLINE
Project Data Flow Diagram Action Items Architectural Design Test Plan Formal Inspection Checklist Project Plan Prototype Demonstration Questions / Comments
2
PROJECT DATA FLOW DIAGRAM:
NUMERICAL ENTITY SEARCHER
3
MODULES IN THE PROJECT
Webpage (JSP): For requesting and receiving information from the service.
POS Tagger (Java): Stanford POS Tagger
Numerical Phrase Extractor (Java): Implemented using Shallow Parsing Technique
Number-Unit/Date Pattern Recognizer (C++): Implemented based on the Numerical Quantifier developed by Benjamin Sapp, UIUC.
4
ACTION ITEMS
Implemented Numerical Phrase Extractor
Detailed Description of Test Plan
Wrote Formal Specification using USE
UML Representation of the System
5
ARCHITECTURAL DESIGN
6
Service Oriented Architecture
PACKAGE VIEW
7
Overall Package View
Class Descriptions, Attributes and Operations are contained in Architecture Design Document
SEQUENCE DIAGRAM
8
CLASS DIAGRAM(NPE PACKAGE)
9
CLASS DIAGRAM(NDPR PACKAGE)
10
IMPLEMENTING NUMERICAL PHRASE EXTRACTOR
Input: Tagged Text I/PRP lost/VBD thirty-three/JJ dollars/NNS in/IN
1998/CD
Regular Expressions are used to determine the numerical patterns in the input. thirty-three/JJ dollars/NNS in/IN 1998/CD
Output: Numerical Phrases thirty-three dollars in 1998 11
TAGSET
12
SOME PATTERNS
"\\d+-\\d+(/JJ|/CD) [a-zA-Z]+/NN"parses
"(between|Between|from|From|In|in|since| Since|during|During)/IN ..../CD (([a-zA-Z]+/CC|[a-z]
+/TO) ..../CD)?”parses
'between 1987 and 1997', 'in 2007 and 2008’13
\\d+-\\d+(/JJ|/CD) [a-zA-Z]+/NN
3-2/JJ lead/NN
20-20/JJ match/NN
ASSIGNING BOUNDS Words that will be detected so as to set the bounds like
>, <, ~, = “ = ” is used if no words are mentioned
14
Bound Corresponding words
> more than, no less than, no fewer than, at most, over
< up to, not over, no more than, at least, less than, not over than
~ about, around, approximately, some, nearly, almost,
SOME PATTERNS [a-zA-Z0-9]+/CD( percent/NN)?( out/IN)?
of/IN( the/DT)? ( [a-zA-Z]+/CD)?( [a-zA-Z]+/JJ)? [a-zA-Z]+(/NN|/NNS|/NNP)
parsesone of the five peopletwo of the groupsone of the rare cases89 percent of peoplefive of the seven former employees3 out of 5 people
15
PHRASES THAT CAN BE PARSED
16
Numerical Phrases
27 year-old boy
A 3-2 lead
9 in 10 people
About 100 miles per hour
200 adults and children
$3 million
About two-thirds of the vote
The 17-mile drive
Less than 10% support
Six-bedroom apartment
5.987 ml
10:00 a.m. CST
From 400 to 500 miles
Temporal Phrases
Last year
Next week
Monday – Sunday
January–December
1956-60
Mid-1990s
Between 1999 and 2008
17th centaury
18 April 2008
Dec 21, 2009
October 10th 1984
John, 67
Since 1998
PHRASES THAT ARE NOT CURRENTLY PARSED
Numerical Phrases Temporal Phrases
six-pack of drinks 31st of March 1998
$100 more Since mid-November
252° (as POS can’t parse this) the January-April period
17
Future Work:
These phrases can also be parsed by adding more patterns to the current system but for now the most important and commonly occurring patterns are considered.
Current goal is to develop a basic idea of numerical phrase extraction.
FORMAL SPECIFICATION
Created and validated using USE 2.3.1. All Classes are specified
All important attributes and methods are specified
Constructor methods are not specified Contained at the end of the Architectural
Design Document
18
TEST PLAN
Outputs are checked at each module by the developer by matching them to the results manually calculated Check if the POS tagger has given the tagged
text. Check if the numerical phrases are extracted Check if the numerical phrase is explained to
Value, Unit and Unit-Type. UML diagrams and the required
specifications will be checked for consistency by two fellow MSE students
User interaction will be tested by the developer and the technical inspectors.
19
FORMAL INSPECTION CHECKLIST
The following items are to be checked: The symbols used in the class diagram conform to UML
standards The symbols used in the sequence diagrams conform to
UML standards The classes in the class diagrams have corresponding
descriptions provided in the Architecture Document The descriptions of the classes in the Architecture
Document are clear and concise The classes in the USE model are consistent with those in
the Architecture Document All the requirements in the Software Requirements
Specification have been covered in the Architecture Document
The multiplicities in the USE model have been depicted in the class diagram 20
PROJECT SCHEDULE Key Dates
Presentation 1: February 24th, 2009 Complete Numerical Sub-Chunker
Presentation 2: March 31st , 2009 Complete Numerical Phrase Extractor
Presentation 3: April 10th, 2009 Patch up the modules Develop a GUI Set them up on the server
To completely submit the documents by April 13th, 2009 to the committee
Final Portfolio submitted by April 15th , 2009
21
PROJECT SCHEDULE
22
PROTOTYPE DEMONSTRATION
POS Tagger working For now it works on the local machine
Numerical Pattern Extractor For now it works on the local machine
23
PHASE 3 DELIVERABLES
Action items Component Design Assessment Evaluation Project Evaluation User’s Manual Formal Technical Inspection Checklists Presentation 3 Executable Project Source Code
24
TO-DO LIST
Revise the Documents Revise Project Schedule Work on the Phase3 deliverables Final Demo
25
Questions??
Suggestions!!
THANK YOU 26