Exploring Article Networks on Wikipedia with NodeXL

Post on 22-Jul-2015

42 views 0 download

Tags:

transcript

EXPLORING ARTICLE NETWORKS ON WIKIPEDIA WITH NODEXL

PRESENTATION DESCRIPTION

• With 4.8 million articles in the English version of Wikipedia, this crowd-sourced online

encyclopedia is regularly one of the top-ten visited sites online. For many, this is the go-to

source for a first read on a topic. The open-source and free Network Overview, Discovery

and Exploration for Excel (NodeXL), which is an add-on to Microsoft Excel, enables the

capture of “article networks” from Wikipedia. Such content network analysis-based data

visualizations enable the development of research leads; some understandings of public

conceptualizations of related concepts, peoples, events, and phenomena; the profiling of

Wikipedia editors (both humans and ‘bots), and other research insights. This presentation will

showcase this affordance of NodeXL and provide some ideas for practical applications of this

channel of research and knowing.

2

OVERVIEW

• Wikipedia ethos and practices

• Wikipedia

• The many Wikipedias; the English Wikipedia

• The Wikimedia Foundation

• MediaWiki and basic functionalities

• Basic article network analysis

• NodeXL and basic functionalities; automation

3

OVERVIEW (CONT.)

• http page networks on Wikipedia:

• article networks

• human author / editor networks

• robot networks

• Live demos

• Other (future) networks from Wikipedia

4

WIKIPEDIA ETHOS AND PRACTICES

• Objective, fact-based, and

research-focused

• Full research citations

• Isolating of opinions into Talk pages

• Open

• Open-access

• Open-source, public domain-released

• Crowd-sourced knowledge co-

creation; curated public data

• Crowd-funded 501(C)3; transparent

finances ($58.5 million goal for FY

2015)

• Editing via email-verified accounts

or Internet Protocol (IP) capture

5

WIKIPEDIA

THE MANY WIKIPEDIAS

• 288 Wikipedias (with 277 active)

• In order of articles: English (13.9%),

Swedish (5.6%), Dutch (5.2%), German

(5.25%), French (4.6%), Waray-Waray

(3.6%), Russian (3.5%), Cebuano

(3.4%), Italian (3.4%), Spanish (3.4%),

and Other (48.2%)

• (“List of Wikipedias” on Wikipedia)

THE ENGLISH WIKIPEDIA

• Founded in Jan. 15, 2001

• 4.8 million articles

• 25 million user accounts

• 1.347 administrators (“English

Wikipedia” on Wikipedia)

6

THE WIKIMEDIA FOUNDATION

• Objective: to encourage “the growth, development and distribution of free,

multilingual, educational content,” and to provide “the full content of these

wiki-based projects to the public free of charge”

• A range of projects: Wikipedia, Wikibooks, Wikiversity, Wikimedia

Commons, Wiktionary, Wikiquote, Wikivoyage, Wikidata, Wikinews,

Wikisource, Wikispecies, and MediaWiki (Wikimedia Foundation)

7

MEDIAWIKI AND BASIC FUNCTIONALITIES

• “wiki wiki”: “quick” or “fast” in Hawaiian

• Ward Cunningham as the developer of the first wiki software (WikiWikiWeb) in 1994 to

enable online collaborations with history versioning and rollback capabilities

• MediaWiki first created by the Wikimedia Foundation in 2002

• Magnus Manske and Lee Daniel Crocker were the initial developers of this tool using PHP

(MediaWiki)

8

A WIKIMEDIA ARTICLE INTERFACE

9

A VIEW OF THE REVISION HISTORY

10

BASIC ARTICLE NETWORK ANALYSIS

• Basics of network graphs: nodes-links, entities-relationships, vertices-edges;

undirected or directed (digraphs) graphs; networks and meta-networks;

subgraphs and clusters, motifs; network centrality

• Direct ties represented in ego neighborhoods (with a maximum geodesic

distance or graph diameter of 2); also 1.5 degree ties for transitivity (with a

maximum geodesic distance or graph diameter of 3) and 2 degree ties to

include networks of the respective “alters” (with much larger maximum

geodesic distances possible)

11

BASIC ARTICLE NETWORK ANALYSIS (CONT.)

• Entities may be individuals or groups, contents, and other elements

• Relatedness: Article networks created based on in-links and outlinks; node

“degree”

• Other types of relatedness are possible such as based on word co-occurrences, title

relatedness (same synset or “synonym set”), shared categories, and others

• Relations are conceptualized as enabling paths

12

NODEXL AND BASIC FUNCTIONALITIES; AUTOMATION

• A free and open-source add-on to Microsoft Excel available on the Microsoft

CodePlex platform

• Enables…

• Graph visualization (with datasets from UCINET, GraphML, and other types)

• Data extraction from a number of social media platform APIs; refreshed runs based on

the same parameters (macros)

• Large number of tools of graph analysis

• A number of layout algorithms and selections to represent the data visually

13

HTTP PAGE NETWORKS ON WIKIPEDIA (IN THIS CASE)

• http page links within Wikipedia, not connecting out to the Surface Web

• One-directional (outlink) directional graph of the target Wikipedia page

• May include article page networks, human page networks, robot page networks, and

others

• Networks seeded by one target title or name (as long as the string appears as a

page in Wikipedia)

• No need for an application programming interface (API) on the MediaWiki platform

14

MEDIAWIKI ARTICLE NETWORK ON WIKIPEDIA

(1 DEG., 237 VERTICES, 237 EDGES)

15

MEDIAWIKI ARTICLE NETWORK ON WIKIPEDIA

(1.5 DEG., 12,368 VERTICES AND 17,686 UNIQUE EDGES)

16

MEDIAWIKI ARTICLE NETWORK ON WIKIPEDIA

(2 DEG., 923,006 VERTICES)

17

In the first run, the software

kicked up an “out of memory”

exception error and crashed.

Another run was conducted on a

different machine with more

processing capability. The

screenshots are from that data

extraction. The data itself

involved some edge pairs (over

half a dozen) in which one of the

vertices was missing.

EXAMPLE: ARTICLE NETWORK

• Who are individuals related to a topic? Events? Years? Topics? Which of

these may be useful leads to learn more about the basic seed topic?

• Based on a real-world individual, what is he or she known for? Who are

people that this person is connected with?

• Based on a technology, when was it originated? Who originated it? What

were precursor inventions? What inventions were linked to the particular

technology?

18

EXAMPLE: ARTICLE NETWORK (CONT.)

• Based on collected lists, who is on a target list, and for what?

• Based on a particular topic, are there gaps in the information based on

“missing” article links?

• Based on a particular phenomena, event, phrase, or individual, in a foreign

context and foreign language, what may be learned?

19

WIKI ARTICLE NETWORK ON WIKIPEDIA

(1 DEG., 162 VERTICES)

20

WEB_LOG_ANALYSIS_ SOFTWARE ARTICLE NETWORK ON WIKIPEDIA (1 DEG., 13 VERTICES)

21

EXAMPLE: HUMAN (AUTHOR / EDITOR) USER NETWORK

• Based on the human user’s network on Wikipedia, what articles does he or she

tend to edit? In total, what does this network suggest about the person behind

the edits?

• (This requires the existence of a user page though.)

22

USER:LWEDEKIND NETWORK ON WIKIPEDIA (1 DEG., 9 VERTICES)

23

USER:THIS_LOUSY_T-SHIRT ARTICLE NETWORK ON WIKIPEDIA (1 DEG., 30 VERTICES)

24

EXAMPLE: ROBOT NETWORK

• Based on the approved robot user’s network, what are the interests of the

maker of the robot? What other accounts is the robot connected to?

25

USER:OGREBOT NETWORK ON WIKIPEDIA

(1 DEG., 5 VERTICES)

26

USER:EMAUSBOT NETWORK ON WIKIPEDIA

(1 DEG., 2 VERTICES)

27

ADDITIONAL APPROACHES

• Chaining from one target account to related others

• Cross-comparing information on the Wikipedia site with the extracted

networks

• Connecting the Wikipedia information with related sites on the Surface Web /

World Wide Web (WWW) and Internet

28

OTHER (FUTURE) NETWORKS FROM WIKIPEDIA

• The third-party tool to NodeXL has spaces to enable user-content (two-mode)

network extractions and the mapping of co-editing networks…but those

functions are not currently enabled (apparently)

29

DISCUSSIONS

• Questions?

• Ideas for research?

30