Measuring the Quality of Web Content using Factual Information

transcript

gefördert durch das Kompetenzzentrenprogramm

www.know-center.at

16. April 2012

WebQuality 2012 workshop at WWW 2012

Elisabeth Lex, Michael Voelske , Marcelo Errecalde , Edgardo Ferretti, Leticia Cagnina, Christopher Horn, Benno Stein and Michael Granitzer

www.know-center.at

Agenda

Motivation

Approach

Results

Summary and Outlook

www.know-center.at

Motivation

People‘s decisions often based on Web content

lacking quality control, no verification

Inaccurate, incorrect infomation No fact checking

Measures needed to capture credibility and quality aspects

In respect to facts!

www.know-center.at

Approach

Measure information quality based on factual information

3 Approaches:

Use simple statistics about the facts obtained from text

Exploit relational information contained in facts

Use semantic relationships like meronymy and hypernymy

First approach:

Use simple statistical features about facts in a document

Indicates how informative a document is

Derive facts from Web content using Open Information Extraction

www.know-center.at

Definition of Factual Density

Fact Count

Factual Density

www.know-center.at

Experiments

Wikipedia: 1000 Featured and Good articles versus 1000 Non-Featured (randomly selected)

Featured: a comprehensive coverage of the major facts in the context of the article’s subject

Baseline: Word Count [Blumenstock 2008]

Featured articles longer than non-featured

Bias: longer docs contain more facts

Evaluation: 2 Datasets

Unbalanced: articles differ in length

Balanced: articles similar in length

www.know-center.at

Distributions of docs in both datasets in respect to word count

www.know-center.at

Precision/Recall curves of Factual Density

www.know-center.at

ResultsFactual Density on balanced corpus

www.know-center.at

Experiments – Relational Features

Approach 2: exploiting relational information contained in facts

Extract relational features from articles

Use relations from ReVerb: binary relations (e1, relation, e2)

Use them to train a classifier to discriminate between featured/good and non-featured

www.know-center.at

Experiments – Relational Features

Approach 2: exploiting relational information contained in facts

Extract relational features from articles

Use relations from ReVerb: binary relations (e1, relation, e2)

Use them to train a classifier to discriminate between featured/good and non-featured

www.know-center.at

Summary

Simple fact related measure: Factual Density

Based on Factual Density, featured/good articles can be separated from non-featured if article length similar

If articles differ in length, word count! For future work, combination of both

Plan to incorporate edit history: more editors, higher factual density

Preliminary experiments with relational features

Promising results, more work in this direction

Goal here is to bring semantics in to the field of Information Quality

We expect this to unlock several IQ dimensions, e.g. generality vs specificity

www.know-center.at

Thank you for your attention!

Elisabeth Lex

elex@know-center.at

Measuring the Quality of Web Content using Factual Information

Technology