+ All Categories
Home > Documents > Data in the wild

Data in the wild

Date post: 19-Dec-2016
Category:
Upload: bonnie
View: 212 times
Download: 0 times
Share this document with a friend
5
emerges in foraged data rather than more expansively creating data apropos a research question leads to difficulties deploying theory. Theories propose elements and relations that are tested with hypotheses or checked against ordered observations. Data must speak to the semantics of the elements and relations. A found corpus may or may not have perti- nent data. Trivial or obvious results may be reported in research centered on data-mining analysis because often such results are all the investigator could squeeze from the data. Findings that more con- textualized methodologies easily capture may require more exten- sive processing of a large data- set or may evade the researcher altogether. Even grounded theory, which begins with data, often requires further data collection once interesting theoretical prob- lems emerge. With found data, a loss of philosophical and epis- temological grounding occurs as research is conducted with data over which the investigator has and other data-mining techniques to search for emerging patterns in existing datasets—rather than generate new data for planned hypothesis testing or qualitative exploration—is transforming the way research is being conducted. When we say “data in the wild,” we point to the fact that the data- sets are not constructed and designed with research questions in mind, as in conventional surveys, censuses, interviews, logs, observational studies, and experimental stud- ies. Conventional datasets are generally planned according to conceptual or theoretical interests, with articulated research ques- tions. Researchers decide which attributes, variables, and data types are of interest prior to data collection. Conversely, if all the researcher has is, for example, a Twitter feed, it is not possible to ask questions such as “What is the political affiliation of this poster?” Using data-mining approaches, researchers can ask questions of the data, but the questions must be scoped to exactly what is avail- able. Adhering to material that In recent years, the proliferation of online services such as social networking, gaming, Internet fora, and chat rooms has provided academic and corporate research- ers opportunities to acquire and analyze large volumes of data on human activity and social interac- tion online. For example, massive corpora from Facebook, Twitter, and other user-generated data sources are being harvested “in the wild” on the Internet. Research based on such “found data” is increasingly common, as software tools become sophisticated enough to allow researchers to forage for data at fairly low cost. Researchers can now “not merely do more of the same, but in some cases con- duct qualitatively new forms of analysis” [1]. Novel computational methods are being developed to integrate multiple distinct and often heterogeneous datasets (e.g., mobile location data and Twitter feeds) in the hope that important new relationships will emerge that cannot be found using a single data source. The growing tendency to apply new machine learning Data in the Wild: Some Reflections Chee Siang Ang University of Kent | [email protected] Ania Bobrowicz University of Kent | [email protected] Diane J. Schiano djs.ux.consulting | [email protected] Bonnie Nardi University of California, Irvine | [email protected] interactions March + April 2013 39 FEATURE
Transcript
Page 1: Data in the wild

emerges in foraged data rather than more expansively creating data apropos a research question leads to difficulties deploying theory. Theories propose elements and relations that are tested with hypotheses or checked against ordered observations. Data must speak to the semantics of the elements and relations. A found corpus may or may not have perti-nent data.

Trivial or obvious results may be reported in research centered on data-mining analysis because often such results are all the investigator could squeeze from the data. Findings that more con-textualized methodologies easily capture may require more exten-sive processing of a large data-set or may evade the researcher altogether. Even grounded theory, which begins with data, often requires further data collection once interesting theoretical prob-lems emerge. With found data, a loss of philosophical and epis-temological grounding occurs as research is conducted with data over which the investigator has

and other data-mining techniques to search for emerging patterns in existing datasets—rather than generate new data for planned hypothesis testing or qualitative exploration—is transforming the way research is being conducted.

When we say “data in the wild,” we point to the fact that the data-sets are not constructed and designed with research questions in mind, as in conventional surveys, censuses, interviews, logs, observational studies, and experimental stud-ies. Conventional datasets are generally planned according to conceptual or theoretical interests, with articulated research ques-tions. Researchers decide which attributes, variables, and data types are of interest prior to data collection. Conversely, if all the researcher has is, for example, a Twitter feed, it is not possible to ask questions such as “What is the political affiliation of this poster?” Using data-mining approaches, researchers can ask questions of the data, but the questions must be scoped to exactly what is avail-able. Adhering to material that

In recent years, the proliferation of online services such as social networking, gaming, Internet fora, and chat rooms has provided academic and corporate research-ers opportunities to acquire and analyze large volumes of data on human activity and social interac-tion online. For example, massive corpora from Facebook, Twitter, and other user-generated data sources are being harvested “in the wild” on the Internet. Research based on such “found data” is increasingly common, as software tools become sophisticated enough to allow researchers to forage for data at fairly low cost. Researchers can now “not merely do more of the same, but in some cases con-duct qualitatively new forms of analysis” [1]. Novel computational methods are being developed to integrate multiple distinct and often heterogeneous datasets (e.g., mobile location data and Twitter feeds) in the hope that important new relationships will emerge that cannot be found using a single data source. The growing tendency to apply new machine learning

Data in the Wild: Some ReflectionsChee Siang Ang University of Kent | [email protected]

Ania Bobrowicz University of Kent | [email protected]

Diane J. Schianodjs.ux.consulting | [email protected]

Bonnie Nardi University of california, irvine | [email protected]

inte

rac

tio

ns

M

arc

h +

Ap

ril

20

13

39

Feature

Page 2: Data in the wild

little control. Simple things, such as knowing the age, gender, educa-tional background, and so forth, of those in a sample are often impos-sible. This is not to say that data mining does not produce valu-able results, but rather to temper uncritical enthusiasm with some observations on its limitations.

Data in the wild may be copi-ous, but not necessarily adequate to address important topics of inquiry. boyd and Crawford [2] note that even huge datasets lack certain kinds of data; for example, private messages on game serv-ers that are not logged in the chat function will not be in the corpus. An ethnographer or linguist knows that in online games, for example, private messages range from gos-sip to appeals for practical help to cybersex, and that they are central to the social experience of gaming. A sociologist conducting a survey of Twitter use cannot ask people about their level of educa-tion, experience with other com-putational artifacts, and so on. Researchers must decide whether what is available is adequate to address significant research top-ics. In the wild, we do not have the ability to “backfill” data not already present in a dataset. With conventional methodologies it is possible to progress to another sequence of data collection if we recognize that we missed critical data, or if new topics of inquiry open upon a first round of analy-sis, as they so often do (see [3]). Bold statements heralding a new era of social science in which we can sweep away those pesky, cumbersome methods requiring us to collect the data ourselves, must be seen in the context of the significant limitations of found data, including a lack of flexibility (see [2,3]). Easy acquisition of vast

volumes of naturalistic data has a price in terms of increased ana-lytical complexity and interpretive uncertainty [4]. Moreover, new ethical and legal challenges arise, some of which are discussed here.

An Ethical ChallengeOne challenge is characterized by a shift in relationships between researchers and subjects of study. Subjects no longer “participate” in research or give consent to be studied. Instead, they are treated as the source of data in a large online archive, even though they are not aware that the data they produce (if indeed they even imagine they are “producing data”) could be useful to some researchers. By contrast, if taking part in a survey or census, even at very large scale, the subject understands that someone will use the information they supply. Logistically, it would be impos-sible to ask for consent from every single person represented in data such as that collected from Facebook, Twitter, chatrooms, blogs, game logs, and so on, which, for a variety of obvious reasons, preclude gaining participant consent. Many researchers deal-ing with such data work on the assumption that since the data is available in the “public domain,” it can be used for research purposes without consent. But game worlds, virtual worlds, Second Life, and so on are not public in the way that walking down the street is public. For example, participants may purchase monthly subscriptions for a virtual world, which they thus conceive of as a privileged space for which they are paying good money.

The ethical challenges of har-vesting digital data have yet to be worked out. Our goal here is p

hoto

gra

ph

by T

om M

aglie

ry

inte

rac

tio

ns

M

arc

h +

Ap

ril

20

13

40

Feature

Page 3: Data in the wild
Page 4: Data in the wild

to draw attention to the issues in a nuanced way rather than to definitively answer them. We note that those supplying data in social media and email or by visit-ing websites, game worlds, and so on are providing unrecompensed labor to corporate and academic researchers. We may decide this form of labor is acceptable, but it should be recognized for what it is. We should consider certain ques-tions. What if we offered people the ability to opt out of research even if their identities were kept anonymous? Or a micropayment in exchange for participation? If there is a micropayment, how should the labor be valued? What role should human subjects review committees play? These commit-tees generally demand informed consent, a concern that grew out of abuses in Nazi Germany and the Tuskegee experiments (and other research; see [3]). Should we be thinking of ways to recast and reconfigure ethical arguments?

The issues have been profoundly altered by digital technologies, and it may be imperative to find new ways to talk about them. For instance, Google’s Street View allowed Internet users to see inside the front windows of some houses that had been pho-tographed [5]. What if someone is unaware that his or her house is on view, or does not know the channels through which to appeal to have his or her privacy restored? In Europe, the creation of Google Street View may not be legal in all jurisdictions. But its images may end up in found data, and researchers may inadvertently use them against the wishes of those who appear in them. By the lights of any human subjects review committee, this is an ethi-cal violation. Because of the vague provenance of found data, the con-trols that human subjects commit-tees have labored to institute over the past 60-plus years are dimin-ished. Some European countries have laws prohibiting filming a person in public for the purpose of public display without the person’s consent. While most countries do not have such laws, human sub-jects review committees would not approve the use of images that violated local standards and laws. But such laws may disappear from view in found data.

A Legal ChallengeA second challenge is legal. Using found data could potentially put researchers at risk of legal liabil-ity. While the majority of uses of online communication services are legitimate, it is widely known that such platforms can be used for illegal activities including violation of intellectual property rights, copying and downloading copyrighted material, invasion of

privacy through identity theft, and spamming, all the way to more sinister activities such as luring unsuspecting visitors (sometimes children) into illegal or danger-ous online and offline activities. All these things can and do hap-pen on the Internet every day, and illegal material could find its way into a researcher’s data-base. It is not feasible to manu-ally inspect a very large database for such material, or to trust an automated filtering system to capture illicit material (especially visual material). Therefore, the researcher who harvested the data may potentially be held liable for breaking the law if in posses-sion of data that violates a law.

Some of the authors of this article started a process of sam-pling screenshots of live webcam streams from a publicly available social networking site in order to analyze issues of self-disclosure and privacy. As a result, a large amount of visual data was cap-tured. However, storage became problematic, as there was no way of knowing what the dataset actu-ally contained. It may have includ-ed, for example, images that could be deemed either illegal or dis-turbing (e.g., child pornography or images possibly related to terror-ism). This type of data collection raises not only the ethical consid-erations, but also the legal conse-quences of unwittingly capturing and storing potentially controver-sial material. Serious concerns as to how to proceed with further collection of data emerged in the authors’ research when a prelimi-nary inspection of initial sets of screenshots revealed that an adult user account was used by minors. Due to the large amount of data, manual inspection was not fea-sible. Automated processes are not

Article 8 of the European Convention on Human Rightsarticle 8 provides a right to respect for one’s “private and family life, his home and his correspondence,” subject to certain restrictions that are “in accordance with law” and “necessary in a democratic society.” This article clearly provides a right to be free of un-lawful searches, but the court has given the protec-tion for “private and family life” for which this article provides a broad interpretation, taking for instance that the prohibition of private, consensual homosex-ual acts violates this article. This may be compared with the jurisprudence of the u.S. Supreme court, which has also adopted a somewhat broad interpre-tation of the right to privacy. Furthermore, article 8 sometimes comprises positive obligations: Whereas classical human rights are formulated as prohibiting a State from interfering with rights, and thus not to do something (e.g., not to separate a family under family life protection), the effective enjoyment of such rights may also include an obligation for the State to become active, and to do something (e.g., to enforce access for a divorced father to his child).

inte

rac

tio

ns

M

arc

h +

Ap

ril

20

13

42

Feature

Page 5: Data in the wild

reliable enough to detect poten-tially illegal material. University management was approached, as well as external and internal legal personnel, in order to understand the researchers’ legal position. However, discussions did not yield actionable guidelines beyond the advice not to pursue the research further. Another consideration was the effect possible nega-tive media publicity might have on researchers’ reputations and the university’s image. Even if researchers do not actually break the law, there may be a tempta-tion for the media or conservative political elements to sensation-alize the story (see [6]). In this scenario, universities would be inclined to err on the side of cau-tion, and would be hesitant to support researchers who under-take risky projects. In view of the above, it was decided that the col-lected data should be destroyed. A promising project with potential social impact was abandoned.

Conclusion Data in the wild provides researchers with unprecedented access to large naturalistic data-sets, resources that were not previously available. However, significant methodological, ethi-cal, and legal concerns arise. The authors’ own experience points to potential legal and ethical pit-falls in engaging with such data. Current ethical and methodologi-cal frameworks do not adequately address the gaps brought about by the scale and nature of this data. Because we are unsure of the ethical and legal rami-fications of working with large datasets, there may be a “chill-ing effect” on research as we act conservatively to avoid pitfalls. Laws may be untested in court

and difficult for the layperson to understand (see [7] and sidebar).

At present, we do not have a workable framework of guide-lines for conducting large-scale research with data in the wild that would comprehensively address issues such as protecting individu-als whose data is being captured online and informing researchers of risks. Internet laws are compli-cated, not only because they vary from country to country but also due to rapid changes in sharing information, which creates a regu-latory gap [8]. For example, in the U.K. the legal framework for this area is complex and constantly evolving. Depending on the nature of the data being used it potential-ly includes the Computer Misuse Act 1990, Data Protection Act 1998, the regulatory framework of CEOP (Child Exploitation and Protection Centre), and European legislation concerning human rights, data protection, and privacy.

We are calling for multidisci-plinary research involving law, computer science, social science, and the humanities to address the concerns we have discussed. Some topics for future discussion include how to work out realistic guidelines for conducting research with data in the wild, and how to undertake educating and involving human subjects review commit-tees, legislators, the public, stu-dents, and researchers themselves.

EndnotEs:

1. hannay, T. What can the Web do for science? Computer 43, 11 (2010), 84-87.

2. boyd, d. and crawford, K. Six provocations for Big Data. A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society. Oxford internet institute, 2011; http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431

3. Boellstorff, T., Nardi, B., pearce, c., and Taylor, T.L. Ethnography and Virtual Worlds: A Handbook of Method. princeton University press, princeton, NJ, 2012.

4. Big Data white paper: challenges and oppor-tunities with Big Data. a community white paper

developed by leading researchers across the United States. 2012; http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf

5. Mills, E. Google’s street-level maps raising privacy concerns; http://www.usatoday.com/tech/news/internetprivacy/2007-06-01-google-maps-privacy_N.htm

6. Bainbridge, W. eGods: Fantasy Versus Faith. Oxford University press, Oxford, U.K., 2013.

7. california law documents: child Exploitation and Online protection centre regulatory frame-work (http://ceop.police.uk/); children act 2004 (http://www.legislation.gov.uk/ukpga/2004/31/contents); computer Misuse act 1990 (http://www.legislation.gov.uk/ukpga/1990/18/contents); Data protection act 1998 (http://www.legislation.gov.uk/ukpga/1998/29/contents); The European convention on human rights (http://www.hri.org/docs/Echr50.html)

8. internet Society. Understanding your online identity protecting your privacy. 2012; http://www.internetsociety.org/understanding-your-online-identity-protecting-your-privacy

About thE Authors chee Siang ang is a lecturer in the School of Engineering and Digital arts, University of Kent. his main research interest lies in social computing, specifically vir-tual worlds, computer games, and

social networking. he is also very keen to investi-gate the applications of these technologies in vari-ous domains such as healthcare.

ania Bobrowicz is a senior lectur-er in digital arts at the University of Kent at canterbury, U.K. her research interests include art his-tory, computer-mediated commu-nication, and emerging societal issues brought about by digital

technologies. She is a fellow of the royal Society of arts and holds an M.Sc. in multimedia systems (London Guildhall University) and an M.a. in applied linguistics (University of Warsaw).

Diane Schiano is a user experi-ence researcher specializing in social, psychological, and design implications of emerging patterns of mediated cognition, communi-cation, and connection. She has a ph.D. in experimental psychology

(princeton) and an M.a. in counseling (iTp), and has worked at Stanford, NaSa/ames, interval research, aT&T Labs, Xerox parc, and as an inde-pendent consultant.

Bonnie Nardi is a professor at Uc irvine and the author of Ethnography and Virtual Worlds: A Handbook of Method (with T. Boellstorff, c. pearce, and T.L. Taylor, princeton Univ. press, 2012) and My Life as a Night Elf

Priest: An Anthropological Account of World of Warcraft (Univ. of Michigan press, 2010).

DOI: 10.1145/2427076.2427085 © 2013 acM 1072-5520/13/03 $15.00 in

tera

cti

on

s

Ma

rch

+ A

pri

l 2

01

3

43

Feature


Recommended