+ All Categories
Home > Documents > Sample Talks for Organizational Hints

Sample Talks for Organizational Hints

Date post: 05-Jan-2016
Category:
Upload: xerxes
View: 25 times
Download: 0 times
Share this document with a friend
Description:
Sample Talks for Organizational Hints. Krishnaprasad Thirunarayan Department of Computer Science and Engineering Wright State University Dayton, OH-45435. Overall R&D Agenda. - PowerPoint PPT Presentation
Popular Tags:
48
Sample Talks for Organizational Hints Krishnaprasad Thirunarayan Department of Computer Science and Engineering Wright State University Dayton, OH-45435
Transcript
Page 1: Sample Talks for Organizational Hints

Sample Talks for Organizational Hints

Krishnaprasad Thirunarayan Department of Computer Science and Engineering

Wright State UniversityDayton, OH-45435

Page 2: Sample Talks for Organizational Hints

Overall R&D Agenda

Develop semi-automatic techniques for information extraction/retrieval to enable man and machine to complement each other in assimilation of semi-structured, heterogeneous documents

=> Semantic Web Technologies.

Page 3: Sample Talks for Organizational Hints

A Modular Approach to Document Indexing and

Semantic Search

Page 4: Sample Talks for Organizational Hints

Goal (What?)

Background and Motivation (Why?)

Implementation Details (How?)

Evaluation and Applications (Why?)

Conclusions

Page 5: Sample Talks for Organizational Hints

Goal

Page 6: Sample Talks for Organizational Hints

Develop a modular approach to improving effectiveness of searching documents for information

Reuse and integrate mature software components

Page 7: Sample Talks for Organizational Hints

Background and Motivation

Page 8: Sample Talks for Organizational Hints

Improve recall using information implicit in the English language Improve precision and recall using domain-specific information implicit in the document collectionAssist manual content extraction by mapping document phrases to controlled vocabulary terms (domain library)

NSF-SBIR Phases I and II with Cohesia Corp.

Page 9: Sample Talks for Organizational Hints

Enable extensions Spell check input query Organize search results through grouping

Improve precision thro sense-disambiguation

Enable experimentation Investigate empirical relationship between

significant eigenvalues in the Singular Value Decomposition (SVD) and the number of document clusters using benchmarks.

Page 10: Sample Talks for Organizational Hints

Implementation Details (How?)

Page 11: Sample Talks for Organizational Hints

Tools Used

Apache’s Lucene APIs A high-performance, Java text search engine

library with smart indexing strategies.

WordNet and Java WordNet LibraryNIST and MathWork’s Java Matrix package (JAMA) for LSIDomain-specific controlled vocabulary for Materials and Process Specs

Page 12: Sample Talks for Organizational Hints

Jazzy, a Java Open Source Spell-Checker

MEDLINE dataset20-Newsgroups datasetReuters-215781 newswire stories datasets

Page 13: Sample Talks for Organizational Hints

Architecture of Content-based Indexing and Semantic Search Engine

Inverted Document Index

LSA Term Matrix

Document Indexer

Configurer

Searcher

Query Modifier

Highlighter

WordNet

Output

User query

Domain Library

Inverted DLIndex

DL Term Locator

Document collection

Page 14: Sample Talks for Organizational Hints

Evaluation and Application (Why?)

Page 15: Sample Talks for Organizational Hints

Enhanced search illustrating wildcard pattern and synonym expansion

Page 16: Sample Talks for Organizational Hints

Matching DL Items; DL Term and its location in the document

Page 17: Sample Talks for Organizational Hints

Example illustrating skippable group

Page 18: Sample Talks for Organizational Hints

LSI and Clustering

Exploring relationship between the number of significant eigenvalues and the number of document clusters

20-Mini-Newsgroup dataset 2000 postings, 20 groups

Reuters-215781 Newswire Stories dataset Used 2000 stories at a time, 70 topics

Page 19: Sample Talks for Organizational Hints

Conclusions

Page 20: Sample Talks for Organizational Hints

Useful assistance for manual content extraction from materials and process specs, given the controlled vocabularyIn future, this framework / infrastructure can be used for experiments with expressive and context-aware search.

Page 21: Sample Talks for Organizational Hints

Formalizing and Querying Heterogeneous Documents

with Tables

Page 22: Sample Talks for Organizational Hints

Goal (What?)

Background and Motivation (Why?)

Implementation Details (How?)

Evaluation and Applications (Why?)

Conclusions

Page 23: Sample Talks for Organizational Hints

Goal

Page 24: Sample Talks for Organizational Hints

Define, embed, and use metadata in semi-structured documents containing tables.

Content-oriented/domain-specific metadata of human sensible document Makes explicit semantics of complex data Enables augmentation of an interpretation

in a modular fashion.

Page 25: Sample Talks for Organizational Hints

Heterogeneous Document

Page 26: Sample Talks for Organizational Hints

Background and Motivation

Page 27: Sample Talks for Organizational Hints

Embedding metadata improves traceability, thereby facilitating

Content Extraction Verification Update

Page 28: Sample Talks for Organizational Hints

Implementation Details (How?)

Page 29: Sample Talks for Organizational Hints

XML Technology

Document-Centric View: XML is used to annotate documents for use by humans in the realm of document processing and content extraction.Data-Centric View: XML is used as text-based format for information exchange / serialization in the context of Web Services.

Page 30: Sample Talks for Organizational Hints

Basic idea behind our approach

Unify the two views by using XML-elements to materialize abstract syntax, and together with XML attributes and XML element definitions, formalize the content.

Key advantage: Minimizes maintenance of additional data structures to relate original document with its formalization.

Page 31: Sample Talks for Organizational Hints

Two Concrete Implementations

Use Web Services language Water which amalgamates XML Technology with programming language concepts

Use XML/XSLT infrastructure

Page 32: Sample Talks for Organizational Hints

Water-based approachEach annotation reflects the semantics of the text fragment it encloses. The annotated data can be interpreted

by viewing it as a function/procedure call in Water. The correspondence between formal parameter and actual argument is position-based.

The semantics of annotation is defined in Water as a method definition in a class, separately.

Page 33: Sample Talks for Organizational Hints

Example Table

Thickness (mm)

Tensile Strength

(ksi)

Yield Strength

(ksi)

0.50 and under

165 155

0.05 – 1.00 160 150

1.00 – 1.50 155 145

Page 34: Sample Talks for Organizational Hints

Example of Tagged Table

Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi)

table.<setHeading thickness strength.tensile

strength.yield/>

0.50 and under 165 155

table.<addRow 0 0.50 165 155 />

0.50 - 1.00 160 150

table.<addRow 0.50 1.00 160 150 />

1.00 - 1.50 155 145

table.<addRow 1.00 1.50 155 145 /> ...

Page 35: Sample Talks for Organizational Hints

Example of Processing Code

<defclass table rows=required=vector heading=optional=vector>

<defmethod setHeading t=required ts=required ys=required>

<set heading=<vector t ts ys/>/>

</>

<defmethod addRow smin smax ts ys>

<set rows=

table.rows.<insert <vector smin smax ts ys/>/>/>

</>

<defmethod computeYieldStrength> … </>

<defmethod computeTensileStrength> … </>

</>

Page 36: Sample Talks for Organizational Hints

XML/XSLT-based approach

Each annotation reflects the semantics of the text fragment it encloses.

To make the annotated data XML compliant, dummy attributes such as one, two, three, … etc are introduced. The correspondence between formal attribute and the actual value is name-based.

The semantics is defined by interpreting XML-elements and its XML-attributes via XSLT, separately.

Page 37: Sample Talks for Organizational Hints

Example of Tagged Table

<table type="Tensile">

<dependency name="Yield Offset" value="0.2%"/>

<tableSchema one="Thickness(min)" two="Thickness(max)"

three="Tensile Strength“ four="Yield Strength"/>

<tableUnits one="in" two="in" three="ksi" four="ksi" />

<tableData one="0" two="0.50" three="165" four="155" />

<tableData one="0.50" two="1.00" three="160" four="150" />

...

<\table>

Page 38: Sample Talks for Organizational Hints

XSLT Stylesheets can be used to:

Query: to perform table look-ups.Transform: to change units of measure such as from standard SI units to FPS units and vice versa.Format: to display the table in HTML form.Extract: to recover the original table.Verify: to check static semantic constraints on table data values.

Page 39: Sample Talks for Organizational Hints

Evaluation and Application (Why?)

Page 40: Sample Talks for Organizational Hints

Advantage

Only tabular data in each document is annotated. The annotation definition is factored out as background knowledge. Thus, the semantics of each table type is specified just once outside the document and is reused with different documents containing similar tables.

Page 41: Sample Talks for Organizational Hints

Disadvantage

Both avenues require mature tool support for wide spread adoption.

For example, develop MS FrontPage like interface where the Master document is the annotated form, and the user explicitly interacts with/edits only a view of the annotated document, for readability reasons, and has support for export as XML to generate well-formed XML document.

Page 42: Sample Talks for Organizational Hints

Prolog rendition

strengthTableRow( 0, 0.50, 165, 155).strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145). ...strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength,

YieldStrength), L =< Thickness, U > Thickness.

thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _).thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength).

?- thicknessToYieldStrength(0.6,YS).

Page 43: Sample Talks for Organizational Hints

Conclusion and Future Work

Page 44: Sample Talks for Organizational Hints

Develop a catalog of predefined tables, specifying them using Semantic Web formalisms (such as RDF, OWL, etc) and mapping the tabular data into a set of pre-defined tables, possibly qualified. Develop techniques for manual mapping of complex tables into simpler ones: To provide semantics to data. To improve traceability. To facilitate automatic manipulation.

Page 45: Sample Talks for Organizational Hints

Tailor and improve IE and IR techniques developed in the context of text processing to Semantic Web documents such as in XML, RDF, etc benefiting from additional support from ontologies such as in OWL, etc

Page 46: Sample Talks for Organizational Hints

Our Related Publications

Page 47: Sample Talks for Organizational Hints

K. Thirunarayan, A. Berkovich, and D. Sokol, An Information Extraction Approach to Reorganizing and Summarizing Specifications, In: Information and Software Technology Journal, Vol. 47, Issue 4, pp. 215-232, 2005.

K. Thirunarayan, On Embedding Machine-Processable Semantics into Documents, In: IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 7, pp. 1014-1018, July 2005.

Page 48: Sample Talks for Organizational Hints

Holy Grail

Ultimately develop principles,

techniques and tools, to author and extract human-readable and machine-comprehensible parts of a document hand in hand, and keep them side by side.


Recommended