Date post: | 15-Jan-2016 |
Category: |
Documents |
Upload: | amanda-boone |
View: | 219 times |
Download: | 0 times |
The Ethics of Large-Scale Web Data Analysis (Webmetrics)
Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK
Rob Ackland, Australian Demographic and Social Research Institute, Australian National University
Virtual Knowledge Studio (VKS) Information Studies
Contents
What is webmetrics?Context: Online access to personal informationResearchers’ use of personal informationConfidentiality and anonymityResource issues
What ethical considerations apply to collecting and analysing web data on a large scale from unaware web “publishers” ?
1. What is webmetrics?
Large-scale analysis if web-based dataCollecting and quantitatively analysing online informationObjective is not to find information about individuals but identify trendsData gathered with VOSON, SocSciBot, Issue Crawler, LexiURL,…
Example
VOSON Hyperlink
network ofpolitical partiesfrom 6 countries(Ackland andGibson, 2006).Node size prop.to outdegree.76 nodes.
Normalised linking, smallest countries removed
Geopoliticalconnected
SwedenFinland
Norway
UK
Germany
Austria Switzerland
Poland
Italy
Belgium
Spain
France
NL
Example:Links betweenEU universities
AltaVista link searches
Link associations between social network sites
Example: Blog searching
2. Context: Online access to personal information
Blogs, social network sites, personal web sites contain information that is: Private and protected (invisible to
researchers) Intentionally public Publicly private1 (intended for friends
but allowed to be public) Unintentionally public (public but
believed by owner to be private)
1. Lang (2007)
Accessing “public” information
Commercial search enginesWeb crawlersInternet Archive (includes deleted info)
Who is using Dataveillance?
Dataveillance1: Downloading or otherwise gathering data on internet users in order to influence their behaviourGoogle – can use email, searching, blogging, social network activities to target advertising (& may report to US government)Amazon – can use past activities to target adverts or improve web site
1. Zimmer (2008)
3. Researchers’ use of personal information
Key issue: for large scale research, data from/about the unaware is used without their approval, and possibly for purposes that they might disagree withWhich ethical safeguards should be taken for this kind of research?
Issue 1: People vs. Documents
Traditionally, documents can be researched without approval, but people can’t Even harsh criticism is fair practice
(e.g., book review/analysis)
Since web pages are documents, researching them without permission is normally OK
Issue 2: Invasion of privacy? Natural vs. normative
A situation is naturally private1 if a reasonable person would expect privacyA situation is normatively private1 if a reasonable person would expect others to protect their privacyNon-secure web pages/data are typically naturally private Accessing is not normally invading privacy,
even if undesired by page owners and with negative consequences
1. Moor (2004)
4. Confidentiality and anonymity
When should anonymity be granted to research “subjects” (page owners)? When a possibly undesired label attached
(e.g., hate group, terrorist) When undesired groups might benefit? (e.g.,
league table of hate groups) When publicly private individuals singled
out (e.g., detailed analysis of “average” blogger)
Should data be anonymised – as for Census data used for research?
5. Resource issues
Accessing a web page uses the owner’s server time/bandwidthCrawling a web site can use a lot of the owner’s server time/bandwidth May incur charges or loss of service
quality
Robots.txt protocol
This file lists pages/folders in a web site may not be crawledIt does not restrict crawling speedIt should be obeyed in researchMost individual users are probably unaware of this and so don’t use its protection
Crawling speed
Web crawlers should not run too fast that they cause service issuesFull speed is probably OK on a UK university web site but not on a Burkina Faso library web siteUse judgement to decide how quickly to crawl – length of pauses in crawling
How many pages to crawl?
Crawling too many pages puts unnecessary strain on the server crawledUse judgement to decide the minimum number of pages/crawl depth that is enoughUse search engine queries as a substitute, if possible
Automatic search engine searches
Research can piggyback off the crawling of commercial search enginesNo resource implications for site ownersUses search engine “Applications Programming Interfaces”Search engines specify the maximum number of searches per dayResults limited to the imperfect web crawling/coverage of search engine crawlers
Summary
Researchers need to be aware of potential issues when doing large scale data analysis researchJudgement is called for in all issuesResearch does not normally need participant permissionBe sensitive to impact of findings and any need for anonymity
References
Lange, P. G. (2007). Publicly private and privately public: Social networking on YouTube. Journal of Computer-Mediated Communication, 13(1), Retrieved May 8, 2008 from: http://jcmc.indiana.edu/vol2013/issue2001/lange.htmlZimmer, M. (2008). The gaze of the perfect search engine: Google as an infrastructure of dataveillance. In A. Spink & M. Zimmer (Eds.), Web search: Multidisciplinary perspectives (pp. 77-99). Berlin: Springer.Moor, J. H. (2004). Towards a theory of privacy for the information age. In R. A. Spinello & H. T. Tavani (Eds.), Readings in CyberEthics (2nd ed., pp. 407-417). Sudbury, MA: Jones and Bartlett.