Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 212 times |
Download: | 0 times |
Information Extraction and Information Extraction and Ontology Learning Guided by Ontology Learning Guided by
Web DirectoryWeb Directory
Authors:Authors: Martin Kavalec Martin Kavalec Vojtěch SvátekVojtěch Svátek
Presenter: Presenter: Mark VickersMark Vickers
OutlineOutline
IntroductionIntroduction– Mining Indicator TermsMining Indicator Terms– Integrating RainbowIntegrating Rainbow– Ontological Analysis of Web DirectoriesOntological Analysis of Web Directories– IE and Ontology LearningIE and Ontology Learning
Future WorkFuture Work
Related WorkRelated Work
AssessmentAssessment
IntroductionIntroductionGoal:Goal:
“…“…to extract information about (mostly generic) to extract information about (mostly generic) products, services and products, services and areas of competence of companiesareas of competence of companies, from the free text chunks , from the free text chunks
embedded in web presentationsembedded in web presentations.”.”
Taking advantage of:Taking advantage of:– Collections of extraction patternsCollections of extraction patterns– Ontologies of problem domainsOntologies of problem domains
Approach: Combine Information Extraction With Approach: Combine Information Extraction With OntologiesOntologies– Ontologies can improve quality of IE Ontologies can improve quality of IE – Extracted information can improve/extend ontologiesExtracted information can improve/extend ontologies– BootstrappingBootstrapping
IntroductionIntroduction
Uses Uses Open DirectoryOpen Directory (http://dmoz.org) (http://dmoz.org)– Obtain Obtain labeled labeled training datatraining data– Lightweight ontologiesLightweight ontologies
“The Open Directory Project is the largest, most comprehensive human-edited directory of the Web.”
Mining Indicator Terms Mining Indicator Terms
Informative termsInformative terms = generic names of products= generic names of products
Indicator termsIndicator terms = situated near informative terms = situated near informative terms– Example: ‘our assortment Example: ‘our assortment includes…includes…’’
‘‘in our shop you can in our shop you can buybuy…’…’
Assumption: Directory headings coincide with informativesAssumption: Directory headings coincide with informatives
Purpose: Generate extraction patterns based on Indicator Purpose: Generate extraction patterns based on Indicator termsterms
They use deeper linguistic techniquesThey use deeper linguistic techniques
Mining Indicator Terms Mining Indicator Terms Example:Example:……/Manufacturing/Materials/Metals/Steel/…/Manufacturing/Materials/Metals/Steel/…
Informative terms
Match headings with text pages to find Match headings with text pages to find sentences containing sentences containing informative termsinformative terms
Grab nearby words as Grab nearby words as indicator termsindicator terms
Generate extraction patterns fromGenerate extraction patterns from indicator termsindicator terms
Mining Indicator TermsMining Indicator Terms
Choosing Indicator TermsChoosing Indicator Terms– Syntactical analysis: Syntactical analysis: Link Grammar ParserLink Grammar Parser– Chose verbs occurring closest in parse tree to Chose verbs occurring closest in parse tree to
informative wordinformative word– Arrange verbs into a frequency tableArrange verbs into a frequency table– Order by ratio of frequency near informative Order by ratio of frequency near informative
term to frequency in generalterm to frequency in general– Chose 8 most promising verbsChose 8 most promising verbs
Mining Indicator TermsMining Indicator Terms
Preliminary TestingPreliminary Testing– Sampled 14,500 sentences containing heading Sampled 14,500 sentences containing heading
terms terms – Randomly chose 130 sentences with indicatorsRandomly chose 130 sentences with indicators– Manually labeled to estimate if informative term Manually labeled to estimate if informative term
was present or notwas present or notExample: Example:
“ “We are equipped to run any grade of corrugated from We are equipped to run any grade of corrugated from E-flute to Triplewall, E-flute to Triplewall, includingincluding all government all government grades.”grades.”
Mining Indicator TermsMining Indicator TermsPreliminary Test ResultsPreliminary Test Results
CoverageCoverage
Non-FilteredNon-Filtered 10 – 20 %10 – 20 %
Pre-FilteredPre-Filtered 70 – 80 %70 – 80 %
Integration into RainbowIntegration into Rainbow
RAINBOWRAINBOW ((RReusable eusable AArchitecture for rchitecture for ININtelligent telligent BBrokering rokering OOf f WWeb information access)eb information access)
– Web Analysis Tasks:Web Analysis Tasks:Sentence ExtractionSentence ExtractionExplicit MetadataExplicit MetadataHTML Structure*HTML Structure*Inline Image *Inline Image *Link Topology Structure*Link Topology Structure*Page SimilarityPage Similarity
– Internal Communication: based on SOAPInternal Communication: based on SOAP
– Will use ontologies for verifying semantic consistency of web Will use ontologies for verifying semantic consistency of web services provided within the distributed systemservices provided within the distributed system
Integration into RainbowIntegration into Rainbow
Rainbow will help solve “coverage” Rainbow will help solve “coverage” problem of directory links pointing to problem of directory links pointing to ‘barren’ pages‘barren’ pages– Using Analysis of:Using Analysis of:
Keywords and HTML Structure on start-up pagesKeywords and HTML Structure on start-up pages
URLs of embedded linksURLs of embedded links
– Metadata Extractor will be navigated towards Metadata Extractor will be navigated towards promising pages. promising pages.
– Looking for ‘about-us’ or ‘profile’ to find more Looking for ‘about-us’ or ‘profile’ to find more syntactically correct text, for example.syntactically correct text, for example.
Ontological Analysis of Web DirectoriesOntological Analysis of Web Directories
Terms and Phrases in single heading belong to Terms and Phrases in single heading belong to a small set of a small set of classesclassesParent-child relations belong to particular Parent-child relations belong to particular classes corresponding to ‘deep’ ontological classes corresponding to ‘deep’ ontological relationsrelations..
-Industries
- Construction_and_Maintenance
- Materials_and_supplies
- Masonry_and_Stone
- Natural_Stone
- International_Sources
- Mexico
Ontological Analysis of Web DirectoriesOntological Analysis of Web Directories
Meta-ontology of directory headings Meta-ontology of directory headings
Class
Named Relations
Class-subclass Relations
Reflexive Binary Relations
Ontological Analysis of Web DirectoriesOntological Analysis of Web Directories
Interpretation RulesInterpretation Rules
IE and Ontology LearningIE and Ontology Learning
Extracting with plain indicator terms with Extracting with plain indicator terms with simple heuristics workssimple heuristics works
But Even Better:But Even Better:– Learn indicators for each classLearn indicators for each class– Use ontology analysis to classify indicators Use ontology analysis to classify indicators
foundfound– Fill in database templates: true IEFill in database templates: true IE
IE and Ontology LearningIE and Ontology Learning
Classify HeadingsLearn class-specific indicators
Human Classifies Directory Headings
(WordNet)
Closed Loop Strategy:
Future WorkFuture Work
Complete the Complete the Information extraction & ontology Information extraction & ontology learning loop.learning loop.
With relation to With relation to Semantic WebSemantic Web, they want to , they want to adapt technique to the standards of usual adapt technique to the standards of usual explicit explicit metadatametadata
– Example: The information extracted can be forged to Example: The information extracted can be forged to RDF triples, with indicator collections accessible over RDF triples, with indicator collections accessible over the webthe web
Related WorkRelated WorkCombining IE and Ontologies (without use ofCombining IE and Ontologies (without use of web web directories)directories)
– Bootstrapping an Ontology-Based Information Extraction SystemsBootstrapping an Ontology-Based Information Extraction Systems
Advantages of using Link Grammar ParserAdvantages of using Link Grammar Parser– Learning to Generate Semantic Annotation for Domain Specific Learning to Generate Semantic Annotation for Domain Specific
SentencesSentences
Using Yahoo to classify Using Yahoo to classify whole documentswhole documents– Turning Yahoo into an Automatic Web-Page ClassifierTurning Yahoo into an Automatic Web-Page Classifier
Similar work aimed at more structured information Similar work aimed at more structured information using search enginesusing search engines
– Extracting Patterns and Relations form the World Wide WebExtracting Patterns and Relations form the World Wide Web
Bootstrapping and other statistical methods for IEBootstrapping and other statistical methods for IE– Text Classification by Bootstrapping with KeywordsText Classification by Bootstrapping with Keywords– Learning Dictionaries of Information Extraction by Multi-Level Learning Dictionaries of Information Extraction by Multi-Level
BootstrappingBootstrapping
AssessmentAssessment
I don’t think indicator term learning is done I don’t think indicator term learning is done (even though they say it is)(even though they say it is)
Counts on not yet decided Ontology Counts on not yet decided Ontology learning techniqueslearning techniques
Need to develop an official directoryNeed to develop an official directory