The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics...

The Ethics of Large-Scale Web Data Analysis (Webmetrics)

Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK

Rob Ackland, Australian Demographic and Social Research Institute, Australian National University

Virtual Knowledge Studio (VKS) Information Studies

Contents

What is webmetrics?Context: Online access to personal informationResearchers’ use of personal informationConfidentiality and anonymityResource issues

What ethical considerations apply to collecting and analysing web data on a large scale from unaware web “publishers” ?

1. What is webmetrics?

Large-scale analysis if web-based dataCollecting and quantitatively analysing online informationObjective is not to find information about individuals but identify trendsData gathered with VOSON, SocSciBot, Issue Crawler, LexiURL,…

Example

VOSON Hyperlink

network ofpolitical partiesfrom 6 countries(Ackland andGibson, 2006).Node size prop.to outdegree.76 nodes.

Normalised linking, smallest countries removed

Geopoliticalconnected

SwedenFinland

Norway

UK

Germany

Austria Switzerland

Poland

Italy

Belgium

Spain

France

NL

Example:Links betweenEU universities

AltaVista link searches

Link associations between social network sites

Example: Blog searching

2. Context: Online access to personal information

Blogs, social network sites, personal web sites contain information that is: Private and protected (invisible to

researchers) Intentionally public Publicly private1 (intended for friends

but allowed to be public) Unintentionally public (public but

believed by owner to be private)

1. Lang (2007)

Accessing “public” information

Commercial search enginesWeb crawlersInternet Archive (includes deleted info)

Who is using Dataveillance?

Dataveillance1: Downloading or otherwise gathering data on internet users in order to influence their behaviourGoogle – can use email, searching, blogging, social network activities to target advertising (& may report to US government)Amazon – can use past activities to target adverts or improve web site

1. Zimmer (2008)

3. Researchers’ use of personal information

Key issue: for large scale research, data from/about the unaware is used without their approval, and possibly for purposes that they might disagree withWhich ethical safeguards should be taken for this kind of research?

Issue 1: People vs. Documents

Traditionally, documents can be researched without approval, but people can’t Even harsh criticism is fair practice

(e.g., book review/analysis)

Since web pages are documents, researching them without permission is normally OK

Issue 2: Invasion of privacy? Natural vs. normative

A situation is naturally private1 if a reasonable person would expect privacyA situation is normatively private1 if a reasonable person would expect others to protect their privacyNon-secure web pages/data are typically naturally private Accessing is not normally invading privacy,

even if undesired by page owners and with negative consequences

1. Moor (2004)

4. Confidentiality and anonymity

When should anonymity be granted to research “subjects” (page owners)? When a possibly undesired label attached

(e.g., hate group, terrorist) When undesired groups might benefit? (e.g.,

league table of hate groups) When publicly private individuals singled

out (e.g., detailed analysis of “average” blogger)

Should data be anonymised – as for Census data used for research?

5. Resource issues

Accessing a web page uses the owner’s server time/bandwidthCrawling a web site can use a lot of the owner’s server time/bandwidth May incur charges or loss of service

quality

Robots.txt protocol

This file lists pages/folders in a web site may not be crawledIt does not restrict crawling speedIt should be obeyed in researchMost individual users are probably unaware of this and so don’t use its protection

Crawling speed

Web crawlers should not run too fast that they cause service issuesFull speed is probably OK on a UK university web site but not on a Burkina Faso library web siteUse judgement to decide how quickly to crawl – length of pauses in crawling

How many pages to crawl?

Crawling too many pages puts unnecessary strain on the server crawledUse judgement to decide the minimum number of pages/crawl depth that is enoughUse search engine queries as a substitute, if possible

Automatic search engine searches

Research can piggyback off the crawling of commercial search enginesNo resource implications for site ownersUses search engine “Applications Programming Interfaces”Search engines specify the maximum number of searches per dayResults limited to the imperfect web crawling/coverage of search engine crawlers

Summary

Researchers need to be aware of potential issues when doing large scale data analysis researchJudgement is called for in all issuesResearch does not normally need participant permissionBe sensitive to impact of findings and any need for anonymity

References

Lange, P. G. (2007). Publicly private and privately public: Social networking on YouTube. Journal of Computer-Mediated Communication, 13(1), Retrieved May 8, 2008 from: http://jcmc.indiana.edu/vol2013/issue2001/lange.htmlZimmer, M. (2008). The gaze of the perfect search engine: Google as an infrastructure of dataveillance. In A. Spink & M. Zimmer (Eds.), Web search: Multidisciplinary perspectives (pp. 77-99). Berlin: Springer.Moor, J. H. (2004). Towards a theory of privacy for the information age. In R. A. Spinello & H. T. Tavani (Eds.), Readings in CyberEthics (2nd ed., pp. 407-417). Sudbury, MA: Jones and Bartlett.

Date post:	15-Jan-2016
Category:	Documents
Upload:	amanda-boone
View:	219 times
Download:	0 times