1
On Embedding Machine-Processable Semantics into Documents
Krishnaprasad ThirunarayanDepartment of Computer Science & Engineering
Wright State UniversityDayton, OH-45435, USA
2
Talk Outline
Background and Motivation (Why?)
Goals (What?)
Details (How?)
Conclusions
3
Background and Motivation
4
Heterogeneous Doc. Spec. Defn. Rep.
Content Extraction:
Formalize doc, using controlled vocabulary
5
Problems with this approach to content extraction
Archiving spec (for human comprehension) separately from its formalization is not conducive traceability.Manual extraction from spec (from scratch) for each use is labor intensive, time consuming, and prone to typographical errors.
6
Observation
Conceptually, every piece of information in an extraction owes its existence to a phrase in spec, and possibly, controlled vocabulary. So, explore techniques to maintain correspondence between a spec fragment and its formalization.
7
Goal
8
General Problem
Embed domain-specific mark-up (annotations) into human sensible document to make explicit semantics of
“content” text and complex data, and to augment an interpretation in a
modular fashion. Document text: Human comprehensible Semantic Mark-up: Machine processable
9
Details (How?)
10
Nature of Specs
Semi-structured Heterogeneous
Text Tables Images
Constrained technical vocabulary
Available as MS Word document
11
Pre-processing Spec
Abstract content from spec document by removing display oriented information Save text Save tabular data, preserving grid
layout Retain links to images …
Note: “Save As text” option in MS Word inadequate
12
Heterogeneous Document
13
XML generated by Majix
14
ASCII Output
15
Annotating Pre-processed Spec
Embedding Machine Processable Semantics Recognizing and tagging text using
controlled vocabulary By product of: Document Indexing and Semantic
Search Tagging tabular data to make explicit its
semantics : Same grid layout, but different interpretation and dependencies based on headings
Explore: XML-based programming language Water for defining data and its behavior (semantics)
16
Locating Controlled Vocabulary Terms
17
Example Table
Thickness (mm)
Tensile Strength
(ksi)
Yield Strength
(ksi)
0.50 and under
165 155
0.05 – 1.00 160 150
1.00 – 1.50 155 145
18
Example of Tagged Table
Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi)
table.<setHeading thickness strength.tensile strength.yield/>
0.50 and under 165 155
table.<addRow 0 0.50 165 155 />
0.50 - 1.00 160 150
table.<addRow 0.50 1.00 160 150 />
1.00 - 1.50 155 145
table.<addRow 1.00 1.50 155 145 /> ...
19
Example of Processing Code
<defclass table rows=required=vector heading=optional=vector>
<defmethod setHeading t=required ts=required ys=required>
<set heading=<vector t ts ys/>/>
</>
<defmethod addRow smin smax ts ys>
<set rows=
table.rows.<insert <vector smin smax ts ys/>/>/>
</>
<defmethod computeYieldStrength> … </>
<defmethod computeTensileStrength> … </>
…
</>
20
(cont’d)<defclass table rows=required=vector
heading=optional=vector>
…
<defmethod computeTensileStrength>
<set temp=fluid.Thickness/>
<set i=0/>
<do>
<until <and temp.<less table.rows.<get i/>.1/>
temp.<more_or_equal table.rows.<get i/>.0/> /> >
table.rows.<get i/>.2
</until>
<set i=i.<plus 1/>/>
</do>
</>
</>
21
(cont’d)
<defclass table rows=required=vector heading=optional=vector>
…
</>
fluid.<set Thickness=0.60>
<try
<set TensileStrength=table.<computeTensileStrength/>/>
TensileStrength
>
"TABLE: out of range error occurred"
</try>
22
Water
XML-based OO Scripting LanguageFacilitates creating Web Services Run methods remotely via web-
browser
Generalizes dynamic typing to constraint checking Conformance of actuals to formals
23
Pros and cons
Encoding Improvement Amount of tagging can be controlled by
suitably delimiting table data and annotating it with corresponding “string-processing” method
Master Copy Update Changes to spec requires manual
modification to archived annotated version.
Irregular Tables in Specs Different units, etc
24
Some Related Work
Microsoft Smart Tags Recognize “controlled” words in
Office 2003 documents and associate predefined list of actions with each occurrence
SHOE Table data in a declarative (logic)
language
25
Prolog rendition
strengthTableRow( 0, 0.50, 165, 155).strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145). ...strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength,
YieldStrength), L =< Thickness, U > Thickness.
thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _).thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength).
?- thicknessToYieldStrength(0.6,YS).
26
Conclusions
27
A Step towards Holy Grail
Ultimately enable authoring and/or extracting, human-comprehensible and machine-processable parts of a document “hand in hand”, and keep them “side by side”.