+ All Categories
Home > Documents > Human Language Technologies for the Semantic Web Department of Computer Science, University of...

Human Language Technologies for the Semantic Web Department of Computer Science, University of...

Date post: 27-Mar-2015
Category:
Upload: kaitlyn-moser
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
30
Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks
Transcript
Page 1: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

Human Language Technologies for the Semantic Web

Department of Computer Science,University of Sheffield

Fabio Ciravegna and Yorick Wilks

Page 2: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Language Technologies

• Goal– Building systems able to process Natural

Language in its written or spoken form

• Methodology– Use of Language Analysis

• Technologies (examples):• Information Extraction from Text• Question Answering • Text Generation

Page 3: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

HLT for Kn. Management

• Use of HLT for Knowledge– Acquisition – Retrieval– Publication

• Main benefits– Cost Reduction– Time needed for KM– Improving knowledge accessibility

• Accessing/Diffusing/Understanding

Page 4: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

HLT in AKT for KM

acquisition retrieval publishing

Text mining

Information Extraction from Text

Text Generation

Page 5: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

HLT for Semantic Web

• Use of HLT for:– Document annotation– Information integration from different

sources

• Benefit– Reduce annotation needs– Retrieve and integrate dispersed

information

Page 6: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Information Extraction

• Textual documents are pervasive (e.g. Web) – Contained knowledge cannot be queried,

therefore cannot be• Used by automatic systems• Easily managed by humans

• IE can identify information in documents– e.g. to populate a database– e.g. to annotate documents

• Method: natural language analysisWordsInformationKnowledge

Page 7: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

IE tasks

Named Entities Template Elements

Template Relations

Scenario Template

WASHINGTON, D.C. (October 5, 1999) - nQuest Inc. today announced that Paul Jacobs, former Vice-President of E-Commerce at SRA International, has joined the company's executive management team as president.

nQuest Inc. Paul Jacobs.SRA International

Company: nQuest Inc. Date: today InPerson: Paul JacobsInRole: president

Company: SRA InternationalOutPerson: Paul JacobsOutRole: Vice-President of E-Commerce,

Page 8: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

IE Tools @ Sheffield

• GATE: – General Architecture for Language

Engineering– Used to integrate HLT modules

• Annie:– Rule-based Named Entity Recogniser– Download at www.gate.ac.uk

• Amilcare:– Adaptive IE system– Portable using examples– www.nlp.shef.ac.uk/amilcare

Page 9: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

IE Tools @ Sheffield (2)

• Melita: – Annotation tool – supported by adaptive IE (Amilcare)– Learns how to annotate– www.aktors.org/technologies/melita/

• Lasie– IE system for complex event extraction– Manual rule development– www.dcs.shef.ac.uk/research/groups/nlp/funded/

lasie.html

Page 10: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

•An architecture•A macro-level organisational picture for LE software systems.

• A framework•for programmers, GATE is an object-oriented class library that implements the architecture.

• A development environment•for language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction.

• Free software (LGPL). Mature robust software (in development since 1995). •Comes with…

• Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.

GATE is…

Page 11: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Some users…

At time of writing a representative fraction of GATE users includes: • Longman Pearson publishing, UK; • BT Exact Technologies, UK;• Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN Technologies, US;• Sirma AI Ltd., Bulgaria; • Resco AB, Sweden/Finland/Germany;• Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts• Master Foods NV: extraction of commodities events from news• the American National Corpus project, US; • Imperial College, London, the University of Manchester, Queen Mary

College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities;

• the Perseus Digital Library project, Tufts University, US.

Page 12: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

GATE and Content Extraction

ANNIE - Open-source IE system in GATE, providing modules needed for content extraction– Pre-processing– Named entity recognition– Coreference resolution

• ANNIE handles proper names, pronouns, and nominals

• Easy-to-use pattern-action rule language to enable customisation and postprocessing of the IE results

• Contact Hamish Cunningham ([email protected])

Page 13: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Amilcare Active annotation for the Semantic Web

• Tool for adaptive IE from Web-related texts– Specifically designed for document annotation– Trains with a limited amount of examples– Effective on different text types

• From free texts to rigid docs (XML,HTML, etc.)

– Tools for:• Normal user

– Able to annotate a corpus

• Amilcare Expert– Able to optimise experiments

• IE Expert– Able to edit rules

– Uses Annie for preprocessing up to Named Entity Recognition

[Ciravegna – IJCAI 2001]

Page 14: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Implementation details

• 100% Java• External Interfaces:

– API for use from other programs– GUI for manual training

• Requirements:– 10M on HD– Up to 300M RAM

• Contact Fabio Ciravegna ([email protected])

Page 15: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Users• Integrated with SW annotation tools:

– MnM (Open Univ.) – Ontomat (Karlsruhe Univ.) – Melita (Sheffield Univ.)

• Users:– Merck (D), – ISOCO (SP), – Quinary (I), – Ontoprise (D)– University College Dublin (IE), – 2 departments of CNRS (F)– University of Trier (D), – University of Texas (Austin, USA)

Page 16: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Document Annotation

• Many application areas require document annotation (enrichment)– Knowledge Management

• Protocol analysis in industry (Kingston 94)

• Italian police: 100 annotators/6 pages a day each– Semantic Web (Staab00, Motta02, Ciravegna02)

• Annotation is generally manual– Expensive– Inefficient – Difficult– Tedious & Tiring

• Error prone (15-30% inter-annotator disagreement)– Never ending

Page 17: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Melita• Document annotation tool

– Use adaptive IE engine to support annotation

• IE System:– Trains while users annotate– Provides preliminary annotation for new documents

• Advantages– Annotates trivial or previously seen cases – Focuses slow/expensive user activity on unseen cases– Validating extracted information

• Simpler & less error prone • Speeds up corpus annotation

– Learns how to improve capabilities

Page 18: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Annotation with IE

User Annotates

Trains on annotated corpus

Bare TextBare Text

AnnotationComparison

Retrains using errors, missing tags and mistakes

Annotates

Page 19: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Bare Text User

Corrects

Annotates

Uses corrections to retrain

Annotation with Suggestions

Page 20: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Cooperation:is IE a Useful Support?

CMU Seminars TASK Test:250 texts (Amilcare report the best IE results ever)

Location

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples

Precision Recall F-measure

Speaker

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples

Precision Recall F-measure

Stime

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples

Precision Recall F-measure

Etime

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples

Precision Recall F-measure

Page 21: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Integrating Information

• Information is available over the Web– Dispersed– In textual format

• IE as basis for retrieval and integration of information – Unsupervised learning using

• The redundancy of the web

• Available Repositories– Collections of documents/data– Known services (e.g. databases, digital libraries, search

engines)

to bootstrap learning and produce simple high precision IE applications

Page 22: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Mining Web Sites

• Extracting knowledge from CS Web sites

NamePositionEmail/TelephoneInvolvement in projectsPublicationsCo-workers

Person:

•Information distributed•Challenges

•Retrieving information•Integrating Information•Largely unsupervised by user

Page 23: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Mining Web sites

People and Projectnames

HomePageSearch

Project/People name lists and hyperlinksBasket:

• Annotates known names• Trains on annotations to discover

the HTML structure of the page• Recovers all names and hyperlinks

• Mines the site looking for Project and People names

• Uses •Generic patterns•Annie•Citeseer for likely bigrams

Page 24: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Mining Web sites

Projects/People Web pages

HomePageSearch

Extracts personal data•Addresses•Tel number•Email address•…

Project/People name lists and hyperlinksBasket:Name lists and hyperlinks Personal data People and ProjectsBasket:

Page 25: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003Name lists and hyperlinks Personal data People and ProjectsBasket:

HomePageSearch

People Publications

Mining Web sites

• Annotates known papers• Trains on annotations to

discover the HTML structure• Recovers co-authoring

information

Name lists and hyperlinks Personal data Co-authoring informationPeople and ProjectsBasket:

Page 26: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Paper discovery

Page 27: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Focus on people

Page 28: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

User Role

• Providing:– A URL– List of services (e.g. Google)

• Train wrappers using examples

– some examples of fillers (e.g. projects)

• In case, correcting intermediate results

Page 29: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Rationale

• Large collections (e.g. Web) contain redundant information– Redundancy can be used to bootstrap learning

• Mining the Web for information– Learned patters

• Integration of information – Multiple evidence

• Different strategies with different reliability• Scruffy works!

– User corrections of data in case

Page 30: Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.

F. Ciravegna- AKT Town Meeting April 2003

Conclusion

• In AKT we are using HLT (IE) for:– Helping in document annotation– Integrating information from different

sources

• Benefit:– Reduce annotation needs– Retrieve and integrate dispersed

information• Minimum user intervention


Recommended