Semi-Automatic Content Extraction from Specifications

Post on 25-Feb-2016

45 views 5 download

description

Semi-Automatic Content Extraction from Specifications. Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Aaron Berkovich and Dan Sokol Cohesia Corporation. Extraction : Summarize in a prescribed vocabulary. Spec: Text. Spec: SDR. - PowerPoint PPT Presentation

transcript

1

Semi-Automatic Content Extraction from Specifications

Krishnaprasad ThirunarayanDepartment of Computer Science & Engineering

Wright State University Aaron Berkovich and Dan Sokol

Cohesia Corporation

2

Extraction : Summarize in a prescribed vocabulary

Spec: Text Spec: SDR

Domain Library

3

Sponsor: National Science Foundation SBIR: Phase I and Phase II

Industry: Cohesia Corporation Developer of (B2B) content and lower-level

infrastructure University: Wright State University

User-level tools: conceptualization and designOthers: Geometric Software Solutions, …

Tool/Product development and integration

Participants

4

Outline

Background and Goal (What?)Motivation (Why?)Details (How?)Conclusions

5

Background and Goal

6

Manual Content Extraction

Input: Paper-based specifications of a

manufacturing task describing composition, processing, and testing of materials

Additional constraints imposed by customers and vendors

Appropriate Ontology and Domain Library defining standard vocabulary

7

Output: An “equivalent” formalized description of

specs in Specification Definition Representation (SDR)

Observation: Specs originating from a common source

(ASTM, SAE, GE) share vocabulary and structure.

Linguistic patterns found in specs are exploited by an experienced extractor to interpret it.

8

Assistance for Extraction Document

PaperDocument

TextMark-Up Editor

(Wizard)

Document SDR

Document Proofer

original

9

Semi-automatic Content Extraction

Starting from an electronic version of a spec, develop a strategy for semantic markup, to assist in creating an “equivalent” SDR.

Semantic Markup: The task of overlaying an abstract syntax (“the essence”) on the “free-form” text.

• Spec: Human-sensible• Mark-up: Computer-sensible

Automate routine mechanical tasks.

10

AEROSPACE SPECIFICATION

TOLERANCES

Corrosion and Heat Resistant Steel, Iron Alloy, Titanium, and Titanium Alloy Bars and Wire

1. SCOPE: This specification covers established inch/pound manufacturing tolerances

applicable to corrosion and heat resistant steel, iron alloy, titanium, and titanium alloy bars and wire ordered to inch/pound dimensions. These tolerances apply to all conditions unless otherwise noted. The term excl. is used to apply only to the higher figure of the specified range.

2. DIAMETER AND THICKNESS: 2.1 Cold Finished Bars: 2.1.1 Rounds, Squares, Rexagons, and Octanons {See 2.1.3 and 2.1.4)

TABLE I Tolerance, Inch

Squares, Hexagons, Specified Diameter Rounds and Octagons or Thickness plus and minus minus only Inches (See 2.1.1.1) (See 2.1.1.2) Over 0.500 to 1.000, excl 0.002 0.004 1.000 0.0025 0.004 Over 1.000 to 1.500, excl 0.0025 0.006 1.500 to 2.000, incl 0.003 0.006 Over 2.000 to 3.000, incl 0.003 0.008 Over 3.000 to 4.000, incl 0.003 0.010 2.1.1.1 Size tolerances for round bars are plus and minus as shown in Table I, unless otherwise

specified. If required, however, they may be specified all plus and nothing minus, or all minus and nothing plus, or any combination of plus and minus, if the total spread in size tolerance for a specified size is not less than the total spread shown in the table.

2.1.1.2 For titanium and titanium alloys, the difference among the three measurements of the

distance between opposite faces of hexagons shall be not greater than one-half the size tolerance and the difference between the measurements of the distance between opposite faces of octagons shall be not greater than the size tolerance.

AS 2241J Issued 5-1-75 Revised 1-1-83

Value

Characteristic

Spec NameSpec Title

Revision

Revision Date

Qualifier

Values

Procedure

Semantic Mark-up

11

Ontology

(Gruber) An ontology is an explicit

specification of a conceptualization, which is an abstract, simplified view of the world that we wish to represent for some purpose.

12

Procedure

1 or many

1 or many

0, 1 or many

0, 1 or many

Characteristic

Document

Ref: 0, 1 or many

Ref: 0, 1 or many

Ref: 0, 1 or many

Value

Layer

RevisionReference

0, 1 or many

DomainLibrary

SDL Ontology

13

Spec: Text Spec: SDR

Extraction: Spec to SDR

14

Fundamental ObstaclesThe relation between the spec and its SDR rendition is “not linear”.

Same spec information duplicated in SDR in different contexts.

Contiguous block of information in SDR spread out in spec.

Equivalence of phrases hard to formalize.Tables and footnotes abbreviate information in irregular and complicated ways.

15

Linearizing through Abstraction: Introducing Specification Definition Language

Original Spec SDL

SDR

Manual (Ph-I) Compiled (Ph-I)

Original AMS-4976 spec is 8 pages. Its SDL equivalent is 15 pages.

Original AMS-5662J spec is 11 pages. Its SDR equivalent is 30 pages.

Manual (original)

Literal, Integrated,Semi-automatic (Ph-II)

16

17

18

Introducing Extraction Wizard

19

Motivation (Why?)

20

Business Background (Supply Chain)

Engine

Metal

Forger

Drawing

Spec

Drawing

Spec

21

Diverse and Large number of specs and spec users

QualityAssurance

Inspecting/Testing

Sales

Engineering

Certificateof

Test

Certificateof

Test

SalesOrder

LabRouting

ProductionRouting

Specs: AMS, DIN, JIS, PWA, GE, ASTMGM, etc.

22

Quality Issues Transcription Errors

From spec to hand-written sheet to computerCompleteness

Info in spec but missing in SDRSoundness

Info in SDR but not in specUniformity of Form Uniformity in Interpretation

Different understanding of the meaning while mapping to SDR (Ambiguity/Inconsistency)

23

Efficiency Issues Minimize time/effort required. Automate routine mechanizable tasks

Eliminate “cut-paste-modify” cycleMinimize duplication of information. Concise representation

Size of translation = O(Size of spec). Update consistency

Flexible rendition into various external forms.

24

Details (How?)

25

Essence of our Approach : Literal Translation

Conceptually, every piece of info in SDR owes its existence to phrases in spec.

Enable maintenance of correspondence between spec and its translation, and attempt to embed the translation into spec.

Requires compilation into SDL/SDR. Cf. XML/XSL Technology

26

Semi-automatic approach is feasible only if the partially generated translations (annotations) are intelligible to an extractor in the context of the original spec, and is systematically extensible.

Note that current manual extractions into SDL are not literal even though SDL enables it to an extent.

27

SDL Studio and its ExtensionSDL studio enables creation and editing of SDL documents. It has facilities to search domain library and compile SDL into an equivalent SDR. This can be further enriched using

Improved Domain Library Search Extraction and composition of SDL fragments Providing templates for commonly occurring

“procedures” Table processor etc …

28

Domain Library Search Engine

29

Domain Library

Currently, it contains technical phrases pertinent to materials and processing requirementsCohesia creates and maintains DLs for in-house use and for use by its clients such as GE, Alcoa, Allvac, etc.Typical size: 10,000 phrases

30

31

Improving Domain Library Search

Goal: Mapping “equivalent” phrases to same Domain Library TermUses: Techniques for prefix removal,

stemming, and dealing with other variations for root recognition

Stop words elimination Abbreviation expander and alias

normalization

32

Algorithm SketchList[Phrase] dl;Phrase ip; Int mt;List[Word] dlwm, inwm; % with back referencesList[Phrase] dlts;begin dl := readAndBuildDomainLibrary(); dlwm := buildWordMapAndBackLinks(dl); % delete stop words, link words to DLTs (in,mt) := readInputPhraseAndMatchThreshold(); inwm := buildWordMap(in); dlts :=

buildDLTsListContainingMatchedWords(dlwm,inwm); dlts := evaluateAndFilterDLTs(dlts,mt);end;

33

Matching wordsInt wordMatch(w1,w2)begin % normalized = vowels deleted, i.e., only consonants

present if caseUniformAndCleanedMatch(w1,w2)

return 100; if normalizedMatch(w1,w2)

return 90; if orderedNormalizedMatch(w1,w2)

return 70; % analyze for differences due to prefix and suffix

if normalizedDifferenceInPrefixSuffixTables(w1,w2) return 90;

end;

34

Design RationaleInput phrase may contain multiple DLTs.DLT words may not appear contiguous in input.Consonants are significant, and "correct" spellings may differ in vowels. Robustness with respect to spelling errors such as transposition of letters or missing vowels.Stemmers do not work for words appearing in DLTs satisfactorily. Instead, create tables customized to deal with prefixes and suffixes that arise in practice, and normalize dynamically.Err on the side of recall rather than precision.Number of words < Number of DLTs

35

Extraction Tool

36

Overall Approach

Preprocessing: Obtain spec in plain text form (from MSWord format).

This is a practical alternative to scanning and OCR-ing a paper-based spec.

Saving it in HTML format has the benefit of isolating tables. On the con side, it retains formatting tags.

Semi-Automatic Extraction: Recognize phrases in spec text that are associated with a requirement and generate SDL fragments to assist an extractor.

37

Two possible Avenues(From Document to SDL)Iteratively annotate the document text with XML tags reflecting the SDL structure and ontology. Generate various views of the document

and SDL from this single XML Master. Iteratively generate a sequence of progressively detailed SDL document from spec text.

38

First Avenue : Via XMLSemi-automatic extraction is accomplished in two phases: Initial automatic markup phase: Systematically

recognize domain library terms in spec text and add suitable XML annotations. Then generate a first-cut SDL fragment.

Subsequent manual conversion phase: Extractor organizes the information and completes the translation into an equivalent SDL.

Further steps: As the tool matures, automation can be attempted to produce more detailed extractions.

39

Advantages: Focus is on a single persistent XML

Master that tries to maintain a link between the spec and the extractions.

All the processing is orchestrated on this XML file.

Implements various views of the XML source using XSLFO and various transformations on the XML source using XSLT.

(cont’d)

40

Disadvantages: There is a need to manage a separate

SDL version to incorporate user inputs and corrections. This is because, even though it may be possible to represent SDL constructs using XML tags, it may not be possible to integrate user edits literally into the XML source.

(cont’d)

41

Insert Structure

Tags

Insert Ontology

Tags

Infer MissingChar.

GroupChar.

& Values

GroupC-Vs into

Procedures

Semantic-Markup Algorithm

42

DLT Tagger

Group Tagger

SDL Converter

Text file

XML file

XML file

XML file

SDL file

DomainLibrary

Structure Tagger

Functional Components

43

Tagging and Transformingflex structTagger.flexgcc lex.yy.c -lfla < GE.txt > GE.xmljava org.apache.xalan.xslt.Process -in GE.xml -xsl CSDLStylesheet.xsl -out GE.sdl …java org.apache.xalan.xslt.Process -in GE.xml -xsl CExpSDLStylesheet.xsl -out GE.exp.sdljava org.apache.xalan.xslt.Process -in GE.xml -xsl OriginalStylesheet.xsl -out GE.org.txt

44

45

46

Second Avenue: SDL all alongAs there is no obvious way of incorporating SDL edits into the XML source in general, try to generate legal SDL at different levels of detail all along. Advantage: Yields SDL documents that can be immediately used in Spec Studio and extended by an extractor.Disadvantage: This form does not retain correspondence with the original document explicitly.

47

Extraction Tool – Prototype Operation

Prototype Operation

48

49

Views: In the context of Spec

Plain text view Text view with

“requirement” phrases color coded and highlighted

View of domain library terms found in the spec

Views: In the context of SDL

Spec identity view + Large Note : Method D Extraction

Method C Extraction

Procedure view Characteristic-

value pair view

50

Extraction Method

Qualifiers Requirements Procedures

References

D Spec Class Only All information in notes

Not used

In notes

C Spec Class, Product, Alloy

All information in notes

Not used

In notes

B Many Qualifiers Characteristic-Value

pairs and notes

Used Retrieved

A Many Qualifiers CV pairs, pre-conditions,

permissibility, formulas, etc

Used Retrieved

51

Additional Standalone ToolsDomain Library Browser Given a word or a phrase, display all the

domain library information related to it.SDL Fragment Generator Given a sentence, generate an SDL

fragment that captures its essence.These tools can assist an extractor in composing SDL document incrementally.

52

Future Work

53

Longer-term VisionMarketplace continues to confirm the need for tools to capture the semantic interpretation of document contentCohesia plans to productize the results of the research into a viable commercial product

54

Example Engineering TasksHow to express and represent templates for well-known “procedures”? Alternative to cut-paste-modify cycle

Tensile Test Heat Treatment Melt Method Chemistry Packaging

55

How to express and represent heterogeneous tables and non-trivial footnotes in a spec in a convenient and uniform way?How to create, manipulate, and store specs in SDR and SDL among other forms and maintain interoperability?

56

Example Research QuestionsWhat are the forms of extraction rules? Phrase pattern matching Theory of equivalence/subsumption

Example: Aliases / Equivalent Phrases Creep = Plastic Strain Delivery Condition = Surface Finish Cause for Rejection = Rejection Criteria Imperfections detrimental to usage of product

= Free of injurious defects

57

Rules for interpreting “logic words”o   Connectives: and, or, …o   Quantifiers: all, every, each, …o   Modifiers: over, under, more, less, …o   Negation: not, no, unless, except, “free of” ...

Mismatch?• A, B, and C => {A,B,C}

union/OR-logic Distributive Laws?

• Lot and order number => lot number and order number

58

Another Example Scenerio

Melt Atmosphere = Inert GasSulphur < 2.0%Niobium < 0.5%

Melt Atmosphere = ArgonSulphur < 1.7%

Columbium < 0.2%

Buyers’ Purchase Order

Sellers’ Inventory

Match?

59

What are the strategies for searching and matching? Top-down: Template-driven

expectations Bottom-up: Identifying requirements

present Closure: Manual addition /

modification / disambiguation

60

Relevant Information Extraction Research and Technologies References

Message Understanding Conferences. Work on NLP an IE at UMass, NYU, SRI,

etc. Search and Filtering tools.

61

Conclusions

62

Spec Text asElectronic

Image

OpticalCharacter

Recognition

SpecText on Paper

PaperScanning

SDL (XML) SDR

SDLCompiler

SDLEditor

Spec Text inHTML/XML

ExtractionWizard

Read,Interpret,& Type

NSF SBIR Phase I

NSF SBIR Phase II

Before

63

Appendix

64

AMS 4928N (Ti Alloy)

65

Tensile Test

66