+ All Categories
Home > Documents > Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf ·...

Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf ·...

Date post: 02-Aug-2020
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
38
Collaborative Publishing: Wiki and Wikipedia By Qi Li Agenda Overview of Wiki and Wikipedia Knowledge Organization of Wikipedia Improving Wikipedia’s Accurary Wikipedia in Natural Language Processing
Transcript
Page 1: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Collaborative Publishing:

Wiki and Wikipedia

By Qi Li

Agenda

• Overview of Wiki and Wikipedia

• Knowledge Organization of Wikipedia

• Improving Wikipedia’s Accurary

• Wikipedia in Natural Language Processing

Page 2: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Overview of Wiki and

Wikipedia

Reference:

Keshava P Subramanya ([email protected])

Roopa Kannan ([email protected])

What is Wikipedia?

• Wikipedia is a freely licensed encyclopedia written by thousands of volunteers in many languages

• Free license allows others to freely

copy, redistribute, and modify our work

commercially or non-commercially

• Founded January 15, 2001

wikipedia.org

Page 3: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

What is wikis?

• A wiki is software that allows users to

create, edit, and link web pages easily.

• Wikis are often used to create

collaborative websites and to power

community websites.

• Ward Cunningham, developer of the first

wiki, WikiWikiWeb, originally described it

as "the simplest online database that could

possibly work".wikipedia.org

What is the Wikimedia Foundation?

• Non-profit foundation

• Aims to distribute a free encyclopedia to every single person on the planet in their own language

• Wikipedia and its sister projects

• Funded by public donations

• Applying for grants

wikimediafoundation.org

Page 4: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Wikimedia Foundation

Governed by Board of Directors (5 positions: 1 permanent (Jimmy Wales) 2 Bomis reps, 2 community reps)

Foundation coordinates official (volunteer) positions:

Fundraising, legal, technical development, press, etc

MediaWiki (software)

And the projects:

Local chapters: English (en); German (de); Italian (it); etc.: 215 languages in total

Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons

English-language

WikipediaAdmins

Long-term users, lots of contribs, heavy community participation

Logged-in users with some contributions

less community participation

Anonymous IP edits

Vandals, trolls, sockpuppets

Foundation

board

Developers, stewards, bureaucrats

Page 5: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Advantages of Free License

• Remains non-proprietary

• Decreases individual sense of

ownership

• Increases a sense of shared ownership

• Enhances the popularity of Wikipedia

• Attribution requirement extends brand

Free Software

• MediaWiki is GPL

• We use all free software on the website

• GNU/Linux

• Apache

• MySQL

• Php

Page 6: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

How big is Wikipedia?

• English Wikipedia is largest and has over 130 million words

• English Wikipedia larger than Britannica

and Microsoft Encarta combined

• In 15 months the publicly distributed

compressed database dumps may reach 1

terabyte total size

How big is Wikipedia Globally?

• English – 533,000 articles

• German – 220,000 article

• Japanese – 110,000 articles

• French – 100,000 articles

• Swedish – 71,000 articles

• Nearly 1.5 million across 200 languages

• 20+ with >10,000. 50+ with >1000

Page 7: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

How popular is Wikipedia?

Wikimedia Projects

• Wikipedia

• Wiktionary

• Wikibooks

• Wikisource

• Wikiquote

• Wikispecies

• Wikimedia Commons

• Wikinews

Page 8: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Wikimedia’s Hardware

• 40+ servers

• Squid caching servers in front to serve

cached objects quickly

• Apache/PHP webservers in the middle

• Database backend (MySql)

Page 9: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

MediaWiki

• MediaWiki is one of many wiki engines

• Collaborative software that allows users to

add or edit content

• Primarily developed for Wikipedia from

2002 onwards

• Scalable and multilingual

• Free license

MediaWiki features

• Quality control features (versioning)

• Editing features (simple markup)

• Community features (talk pages, profiles,

access levels)

Page 10: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Jakob Voss : Knowledge Organization with Wikipedia. 5th NKOS Workshop, Sep 21,2006

Knowledge Organization

with Wikipedia

Reference:

1. Jakob Voss Common Library Network (GBV) at 5th

NKOS Workshop, Alicante September 21, 2006

2. Phoebe Ayers: UC Davis, Physical Sciences &

Engineering Library, phoebe.ayers @ gmail.com

en.wikipedia.org/wiki/User:Phoebe Ayers

[[Outline]]

• Wikipedia: namespace

• Wikipedia's Category system

• Mapping

• Indexing with Wikipedia articles

Page 11: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

[[What is Wikipedia

namespaces]]• Main: The main namespace or article namespace

is the encyclopedia proper. It is the default

namespace and does not use a prefix.

• Portal (prefix Portal:) is for reader-oriented

portals that help readers find and browse through

articles related to a specific subject.

• User (prefix User:) is a namespace that provides

pages for Wikipedia users' personal

presentations and auxiliary pages for personal

use, for example containing bookmark to favorite

pages.

• Image (prefix Image:, also called image

description pages) is a namespace that provides

info about images and sound clips, one page for

each, with a link to the image or sound clip itself.

Wikipedia Namespace (cont.)

• Category contains categories of pages, with each displaying

a list of pages in that category and optional additional text.

• Help: the basic, technical features of Wikipedia.

• Talk namespaces: are used to discuss changes to the

corresponding page in the associated namespace. Pages in

the user talk namespace are used to leave messages for a

particular user.

– the talk namespace associated with the main article

namespace has the prefix Talk:,

– while the talk namespace associated with the user

namespace has the prefix User talk:

Page 12: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Wikipedia Namespace (cont.)

• MediaWiki (prefix MediaWiki:) is a namespace

containing interface texts such as link labels and

messages. They are used for adjusting the localisation

(i.e. local version) of interface messages without waiting

for a new LanguageXx.php file to get installed.

• Template (formerly part of the MediaWiki namespace) is

used to define a standard text which can then be

conveniently added within pages, either the text itself at

the time of adding, or a reference to the text at the time

of viewing the page. The latter way effectively changes

all such occurrences of the standard text automatically

by just editing the page where the text is defined. .

How do articles get written?

• Someone starts it

• Someone else checks it

• A (possibly third) party edits it…

http://en.wikipedia.org/wiki/Help:Contents/

Editing_Wikipedia

Page 13: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Article Criteria

• Notable (encyclopedic)

• Not vanity

• Not duplication

• Community consensus…

Page 14: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language
Page 15: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Edit wars… and other things

that go boom

Page 16: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Predictable vandalism… posted and reverted the same minute (10:31)

Page 17: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language
Page 18: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language
Page 19: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

How to edit Wikipedia

categories• Tagging by linking

[[Categorie:Information Science]]

...

• Open for all

• Blind tagging

• Multi-hierarchical relations

• High connectivity

Page 20: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Jakob Voss : Knowledge

Organization with

Wikipedia. 5th NKOS

Workshop, Sep 21,2006

[[Wikipedia categories]]

Jakob Voss : Knowledge

Organization with

Wikipedia. 5th NKOS

Workshop, Sep 21,2006

[[Category system as KOS]]

Collaboratively edited, general thesaurus(en, Jan 2006: 91,502 categories, 923,196 articles)

Distribution of descriptor levels

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

1 2 3 4 5 6 7 8 9 10 11 12

level

descriptors

DDC (ext.)

Wikipedia (en)

Voss (2006): Collaborative thesaurus tagging the Wikipedia way

http://arxiv.org/abs/cs.IR/0604036

Distribution of descriptors per record

0%

1%

10%

100%

1 2 3 4 5 6 7 8 9

descriptors (categories or tags)

records (pages or posts)

Wikipedia

del.icio.us

exponential (λ=0.6)

Page 21: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

What is Collaborative Publishing?

• Collaborative: works are created by

multiple people together rather than

individually

• Publishing: knowledge

• Some projects are overseen by an editor

or editorial team

• Many grow without any top-down oversight

Characteristics 1: access

control

• All users to edit any page but with control

access

• Control Access

Create Edit Link browsing

Administration × × × ×

Group

Individual

Public

Page 22: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Characteristics 2: Revision

control

Wikipedia’s Accuracy

Page 23: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Criticisms

• Could a collaborative project that anyone can edit be “a public good”?– Contribute articles

– Quality of articles is close to Encyclopaedia Britannica

• Vandalism

• Creeping bureaucracy & growing instances of infighting among editors

• The community’s anti-intellectual attitude

• “digital Maoism”

• “faith-based encyclopaedia”

Further criticisms

• Entries for pop cultural figures vs. those

for great literary figures, scientists, etc.

• Entry for Britney Spears longer than entry

for St. Augustine

• Seinfeld longer than Shakespeare; Barbie

longer than Bellow

• Response: Nothing to get exercised about

Page 24: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

80/10 Rule

• Counting only logged in users, and even excluding some prominent approved bot users

• 10 percent of all users make 80% of all edits

• 5 percent of all users make 66% of edits

• Half of all edits are made by just 2 1/2 percent of all users

Edits by Anons

• Controversial, intruiging

• Yes, you can edit this page

• Without logging in!

Page 25: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Edits by Anons - %

• Anonymous ip numbers can edit

Wikipedia, and do

• But these edits make up a total of around

18% of all edits, with some evidence of a

downward trend over time

• Anecdotally, many regular users report

sometimes editing anonymously by

accident or as a quiet form of Sock

Puppeting

Edits across namespaces

• Articles 85%

• Talk pages 8%

• User Page 3%

• User Talk Pages 4%

These percentages are stable in 2003

And 2004

Page 26: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Studying the Accuracy of Wikipedia

• Study by Nature

– “factual errors, omissions or misleading

statements”: Wikipedia vs Britannica: 162 vs

123; major : 4 vs 4

• Survey: whether they think sample articles

are accurate

– 76% -- accurate

Separate the wheat from the chaff

• Proposal 1: Based on explicit article

validation

– “trusted user” (defined using various criteria)

explicitly marks an article as “good”

– Peer-based explicit system: allow users to

choose which of their peers to trust, thus

providing different results for each user

– Shortage: explicit input from reviews

Page 27: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

• Proposal 2: automatically assess information quality by calculating metrics based on metadata recorded and stored by Wikipedia– Metrics: # of edits made for the article and # of unique editors for the article

– Distinguish of two classes of pages

– Link ratio analysis

– Quality of editors

– Trustworthiness or reputation of authors and articles

– Segments instead of articles

• Surprisingly successful

• Large/Complete/Coverage

• Again: Free

Page 28: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Wikipedia and copyright

• Need for copyright less than we imagined?

• Do our empirical assumptions about the

need for copyright need adjustment?

• Take open source software like Linux

(Surowieki, 2004).

References• Cohen, Noam. “Courts Turn to Wikipedia, but Selectively.” The New York Times

January 29 (2007): Section C, page 3.

• Economist. “Battle of Britannica.” Economist 378.8471 (April 1, 2006): 65-66.

• Fallis, Don. “The Epistemic Benefits and Costs and Collaboration.” Southern Journal of Philosophy 44.S (2006): 197-208.

• Fallis, Don. “On Verifying the Accuracy of Information: Philosophical Perspectives.”Library Trends 52.3 (2004): 463-487.

• Fricke, Martin and Don Fallis. “Indicators of Accuracy of Consumer Health Information on the Internet.” Journal of the American Medical Informatics Association9 (2002): 73-79.

• Giles, J. “Internet Encyclopedias Go Head to Head.” Nature 438.7069 (December 15, 2005): 900-901.

• Hettinger, Edwin. “Justifying Intellectual Property.” Philosophy and Public Affairs 18 (1989): 31-52.

• Paine, Lynn Sharp. “Trade Secrets and the Justification of Intellectual Property: A Comment on Hettinger.” Philosophy and Public Affairs 20 (1991): 247-263.

• Poe, Marshall. “The Hive.” Atlantic Monthly 298.2 (September 2006): 86-94.

• Resnik, David. “A Pluralistic Account of Intellectual Property.” Journal of Business Ethics 46 (2003): 319-335.

• Schiff, Stacy. “Know it All.” New Yorker 82.23 (July 31, 2006).

• Sunstein, Cass. “Mobbed up.” New Republic 230.24 (June 28, 2004): 40-45.

• Surowieki, James. The Wisdom of Crowds. New York: Anchor Books, 2004.

Page 29: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Wikipedia in NLP

Ontology

Thesauri

Categorization

Topic Detection

Information Retrieval (Query Expansion)

Word Sense Disambiguation

Question Answer

Translation (CLIR)

Wikitology !

• Using Wikipedia as an ontology offers the

best of both approaches

–Each article is a concept in the

ontology

–Terms linked via Wikipedia’s category

system and inter-article links

• It’s a consensus ontology created, kept

current and maintained by a diverse

community

• Overall content quality is high•••• intro •••• wikipedia •••• experiments •••• evaluation •••• next •••• conclusion ••••

Page 30: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Wikitology features

• Terms have unique IDs (URLs) and are “self describing” for people

• Several underlying graphs provide structure: categories, article links

• Article history contains useful meta-data (e.g., for trust)

• External sources provide more info (e.g., Google’s pagerank)

• Some of the data available in structured form, e.g., in RDF from DBpedia•••• intro •••• wikipedia •••• experiments •••• evaluation •••• next •••• conclusion ••••

[[Semantic Wikipedia]]

Typed links: [[is capital of::England]]

=> RDF triples

Völkel et al (2006): Semantic Wikipedia. WWW2006 conference

Page 31: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Thesauri

• Reference:

– Mining Domain-Specific Thesauri from

Wikipedia: A case study, Milne, D., Medelyan,

O., and Witten, H. 2006. Proceedings of the

2006 IEEE/WIC/ACM International

Conference on Web Intelligence

– Milne, D., Witten, I. H., & Nichols, D. M.

(2007). Extracting corpus specific knowledge

bases from Wikipedia. CIKM. Lisbon,

Portugal.

Thesauri

• Thesauri:

– an indexed compilation of words with similar,

related, broader, narrower and opposite

meanings.

• Wikipedia

– Each article - a concept

– Hyperlinks - relations

• Equivalence - USE, USE FOR

• Hierarchical - BT, NT

• Associative - RT

Page 32: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Topic Detection

• Reference:• Identifying document topics using the Wikipedia category network,

Peter Schonhofen, Proceedings of the 2006 IEEE/ACM International

Conference on Web Intelligence (WI 2006 Main Conference

Proceedings)

• Topic Detection:

– utilize an ontology to detect concepts in the document

– select the most dominant concepts to present the

document.

• Ontology from wikipedia

– Coverage of wikipedia is general purpose and very

wide,

– Structure is rich and consistent

Page 33: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Wikipedia structure

• Components: articles, images pages, discussion about article contents, authors, page component templates and so on.

• Articles: titles, categories, refer to other articles

• Categories: hierarchically into sub- and super-categories (not just tree)

• Author: links between articles, hierarchy of categories.

Wikipedia structure

Page 34: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Wikipedia for classification

• Reference: – Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text

Categorization with Encyclopedic Knowledge. Engeniy Gabrilovich and

Shaul Markovitch American Association for Artificial Intelligence 2006

– Benerjee, S., Ramanthan, K., & Gupta, A. (2007), Clustering short text

using Wikipedia, SIGIR

– Meyer, M., & Rensing, C. (2007). Categorizing Learning Objects based

on Wikiepdia as Substitue Corpus. Proceedings of the First International

Workshop on Learning Object Discovery and Exchange.

• Deals with automatic assignment of category labels to natural language documents

• Represent document as bags of words

• Features from words

• Limitation of BOW:• by individual word occurrences in the training set

– Wal-Mart supply chain goes real time

– Wal-Mart manages its stock with RFID technology

• Effective in medium difficulty categorization, but bad in small categories or short documents

• Using encyclopedia to endow the machine document with the broader of knowledge available to humans

Text Categorization

Page 35: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

• Auxiliary text classifier: –matching documents with the most relevant articles of wikipedia

–conventional bag of words + new features

• Examples for idea of auxiliary text classifier:– “Bernanke takes charge”–BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, …

• Using wikipedia–Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document

–Leverage the knowledge gained from these

• “jaguar car models”,

• the Wikipedia-based feature generator returns:

– JAGUAR (CAR),

–DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar),

–V12 (Jaguar’s engine),

–JAGUAR E-TYPE

–JAGUAR XJ.

• “jaguar Panthera onca”,

–JAGUAR,

–FELIDAE (feline species family), related felines such as LEOPARD,

–PUMA and BLACK PANTHER, as well as KINKAJOU

Page 36: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

• Some names denote multiple entities:

– “John Williams and the Boston Pops conducted a

summer Star Wars concert at Tanglewood.”

John Williams⇒ John Williams (composer)

– “John Williams lost a Taipei death match against

his brother, Axl Rotten.”

John Williams⇒ John Williams (wrestler)

– “John Williams won a Victoria Cross for his

actions at the battle of Rorke’s Drift.

John Williams⇒ John Williams (VC)

Word Sense Disambiguation

• Some entities have multiple names:

– John Williams (composer)⇐ John Williams

– John Williams (composer)⇐ John Towner

Williams

– John Williams (wrestler)⇐ John Williams

– John Williams (wrestler)⇐ Ian Rotten

– Venus (planet)⇐ Venus

– Venus (planet)⇐ Morning Star

– Venus (planet)⇐ Evening Star

Page 37: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

WSD

• Web searches

– Queries about Named Entities (NEs) constitute a significant portion of popular web queries.

– Ideally, search results are clustered such that:

• In each cluster, the queried name denotes the same entity.

• Each cluster is enriched by querying the web with alternative names of the corresponding entity.

• Web-based Information Extraction (IE)

– Aggregating extractions from multiple web pages can lead to improved accuracy in IE tasks (e.g. extracting relationships between NEs).

– Named entity disambiguation is essential for performing a meaningful aggregation.

Wikipedia Structures

• In general, there is a many-to-many

relationship between names and entities,

captured in Wikipedia through:

–Redirect articles.

–Disambiguation articles.

• Hyperlinks: An article may contain links to

other articles in Wikipedia.

• Categories: each article belongs to at

least one Wikipedia category.

Page 38: Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf · 2008-02-28 · Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons English-language

Redirect Articles

• Redirect article:

– exists for each alternative name used to refer to an

entity in Wikipedia.

– Example: The article titled John Towner Williams

consists in a pointer to the article John Williams

(composer).

• Disambiguation article:

– lists all Wikipedia entities (articles) that may be

denoted by an ambiguous name.

– Example: The article titled John Williams

(disambiguation) list 22 entities (articles).

Conclusion

• Overview of Wikipedia

• Knowledge organization in Wikipedia

• Accuracy of Wikipedia

• Application of Wikipedia in NLP


Recommended