+ All Categories
Home > Documents > Missing Data in Sociological Research

Missing Data in Sociological Research

Date post: 25-Mar-2016
Category:
Upload: alex-wyatt
View: 219 times
Download: 0 times
Share this document with a friend
Description:
Published in The American Sociologist 43(4): 448-468
Popular Tags:
21
Missing Data in Sociological Research: An Overview of Recent Trends and an Illustration for Controversial Questions, Active Nonrespondents and Targeted Samples Jeremy R. Porter & Elaine Howard Ecklund Published online: 23 June 2012 # Springer Science+Business Media, LLC 2012 Abstract In an age of telemarketers, spam emails, and pop-up advertisements, sociol- ogists are finding it increasingly difficult to achieve high response rates for their surveys. Compounding these issues, the current political and social climate has decreased many survey respondentslikelihood of responding to controversial questions, which are often at the heart of much research in the discipline. Here we discuss such implications for survey research in sociology using: a content analysis of the prevalence of missing data and survey research methods in the most cited articles in top sociology journals, a case study highlighting the extraction of meaningful information through an example of potential mechanisms driving the non-random missing data patterns in the Religion Among Academic Scientists dataset, and qualitative responses from non-responders in this same case. Implications are likely to increase in importance given the ubiquitous nature of survey research, missing data, and privacy concerns in sociological research. Keywords Missing data . Non-response . Survey . Sociology . Religion . Family Introduction Since early references to the survey of population samples as a new research techniqueby Gerhard Lenski in 1961, the landscape of survey research has changed dramatically (Lenski 1961). In many ways the widespread development and imple- mentation of the sample survey method has been instrumental to the accumulation of Am Soc (2012) 43:448468 DOI 10.1007/s12108-012-9161-6 This research was supported by a grant from the John Templeton Foundation (grant #11299; Elaine Howard Ecklund, PI). The authors also wish to thanks and acknowledge that Kelsey Pedersen provided invaluable help with manuscript editing and formatting. J. R. Porter (*) City University of New York, Brooklyn College and Graduate Center, 218 Whitehead Hall, Brooklyn College, 2900 Bedford Ave, Brooklyn, NY 11220, USA e-mail: [email protected] E. H. Ecklund Department of Sociology, Rice University, Houston, TX, USA
Transcript
Page 1: Missing Data in Sociological Research

Missing Data in Sociological Research: An Overviewof Recent Trends and an Illustration for ControversialQuestions, Active Nonrespondents and Targeted Samples

Jeremy R. Porter & Elaine Howard Ecklund

Published online: 23 June 2012# Springer Science+Business Media, LLC 2012

Abstract In an age of telemarketers, spam emails, and pop-up advertisements, sociol-ogists are finding it increasingly difficult to achieve high response rates for their surveys.Compounding these issues, the current political and social climate has decreased manysurvey respondents’ likelihood of responding to controversial questions, which are oftenat the heart of much research in the discipline. Here we discuss such implications forsurvey research in sociology using: a content analysis of the prevalence of missing dataand survey research methods in the most cited articles in top sociology journals, a casestudy highlighting the extraction of meaningful information through an example ofpotential mechanisms driving the non-random missing data patterns in the ReligionAmong Academic Scientists dataset, and qualitative responses from non-responders inthis same case. Implications are likely to increase in importance given the ubiquitousnature of survey research, missing data, and privacy concerns in sociological research.

Keywords Missing data . Non-response . Survey . Sociology . Religion . Family

Introduction

Since early references to the survey of population samples as “a new researchtechnique” by Gerhard Lenski in 1961, the landscape of survey research has changeddramatically (Lenski 1961). In many ways the widespread development and imple-mentation of the sample survey method has been instrumental to the accumulation of

Am Soc (2012) 43:448–468DOI 10.1007/s12108-012-9161-6

This research was supported by a grant from the John Templeton Foundation (grant #11299; Elaine HowardEcklund, PI). The authors also wish to thanks and acknowledge that Kelsey Pedersen provided invaluablehelp with manuscript editing and formatting.

J. R. Porter (*)City University of New York, Brooklyn College and Graduate Center, 218 Whitehead Hall, BrooklynCollege, 2900 Bedford Ave, Brooklyn, NY 11220, USAe-mail: [email protected]

E. H. EcklundDepartment of Sociology, Rice University, Houston, TX, USA

Page 2: Missing Data in Sociological Research

knowledge across the social sciences and specifically sociology (Wright and Marsden2010). In more recent years, however, a number of obstacles have developed in theimplementation of the survey as a reliable method for the collection of social data.Many of these obstacles have been overcome by the introduction of more complexsampling and analytic methods, but many remain a concern to those involved in thecollection of data via the sample survey. Here we highlight a few of these issues andtheir direct relevance to the most influential research conducted in the field ofsociology in the recent past.

The inundation of Americans with requests for information and opinions hasreached an all time high. As an increasingly common response, many potential surveyparticipants are actively withholding information or refusing to respond at all.Compounding this issue is the fact that an inquiry for information about opinionson politically or socially sensitive topics is currently met with extremely high levelsof suspicion. It is extraordinarily difficult, then, for legitimate researchers to achievehigh response rates for their surveys. Random-sample surveys of the general popu-lation now routinely report response rates of only 30 % (PEW Research Center for thePeople and Press 2004). And even when a high response is achieved, subjects oftendo not answer all of the questions on the survey, leading to the intractable problem ofmissing data. In relation to the latter, the current political and social climate, coupledwith more advanced methods for respondent identification and decreased privacymake many uneasy about divulging personal information, and, in particular, theirfeelings about controversial issues.

Rose and Fraser (2008) point out the ubiquitous nature of missing data on surveysin this current age of concerns for privacy as well as political or social sensitivity, inparticular the impact of these on survey response. Their research brings to light theimportance of the issue at hand by stating that “missing data are nearly always aproblem in research, and missing values represent a serious threat to the validity ofinferences drawn from findings” (Pp. 71). Unfortunately, in order to accuratelyestimate relationships among indicators with high levels of missing data, very goodprior knowledge about the patterns of what Allison (2002) calls “missingness” isneeded. Furthermore, estimation methods, driven by variations in data replacementand imputation procedures, are at risk of producing statistics that are highly sensitiveto the handling of the missing data.

Within the last decade, attention given to the issue of missing data in the researchprocess has included much methodological and theoretical discussion about theimplications of low response rates and item non-response, and how to overcomethem (Abraham et al. 2006; Alosh 2009; DeSouza et al. 2009; Garcia et al. 2010;Gelman et al. 2005; Groves 2006; Groves et al. 2006; Haung et al. 2005;Khoshgoftaar et al. 2007; Martinussen et al. 2008; Montiel-Overall 2006; Olson2006; Paik 2004; Porter et al. 2009; Rose and Fraser 2008; Southern et al. 2008;Satten and Carroll 2000; Verbeke and Molenberghs 2000, 2010). As response ratescontinue to drop, however, and missing data leave greater lacunae, researchers havebeen forced to handle less-than-ideal data situations. Common fixes for large amountsof missing data include the well-known statistical procedures of data weighting,imputation, and other corrective schemes.

The results of such imputation methods may be detrimental to the ability ofresearchers to confidently present the results of their studies (Allison 2002; Dawid

Am Soc (2012) 43:448–468 449

Page 3: Missing Data in Sociological Research

1984; Gelman et al. 1996; Rubin 1984). For instance, Alosh (2009) highlights theimplications of variations in results based on differential imputation schemes. AndGarcia and colleagues (2010) point to potential selection biases associated withdifferent imputation schemes, while Haung and colleagues (2005) tackle the issueof relying on proxy information for Bayesian methods of data replacement. In each ofthe above examples, the researchers point out considerable variations across imputa-tion procedures and highlight an important point when dealing with missing data;namely, the types of methods we use to account for missing data have a direct impacton findings. This is especially important when policy implications are nested withinresearch and its findings.

Yet, we continue to lack a clear understanding of the kinds of questions to whichspecific populations of people are more or less likely to respond (Babbie 2007;Maxim 1999; Allison 2002; Griliches 1986). In light of potential result-quality issuesand decreasing rates of survey response, a systematic interpretation of respondents’refusal to answer an entire survey or select questions is vitally needed. The presentcase contributes to our understanding of this phenomenon by documenting therelationship between differential patterns of item non-response, controversial surveyitems, and insightful information that may be obtained from specific patterns of datamissingness. Furthermore, as part of her study, Ecklund was in direct communicationwith non-respondents, garnering their reasons for not responding to a survey ofscientists’ attitudes towards religion (discussed in greater detail below).

Here—in the first data collection of its kind—we use the Religion amongAcademic Scientists (RAAS) survey to examine the types of survey questions thetargeted group did not answer and some of the reasons they did not answer them. Thecontroversial questions in this study relate to religiosity among scientists, a group ofindividuals often caught in the middle of the larger societal debate about whetherreligion and science are in conflict (Evans and Evans 2008). Our findings aregenerated from analyses of missing data from what we call “active nonresponders,”both those who filled out part of the survey but abstained from part and those who didnot respond to the survey but wrote to tell us why.

In both instances, we have some advantages in our interpretation of the data (orlack of data, in this case). First, controversial issues were the primary interest of theRAAS, so non-response had already been anticipated and plans made for its inter-pretation. Second, potential RAAS respondents directly communicated with thisstudy’s PI, who is a fellow member of the targeted group. This familiarity allowsfor an interesting case in which direct discourse about the respondents’ reasons fornon-response can be linked to their refusal to participate in the survey or parts of it.How often does a solicitor get to knock on the closed door again to ask, “Why not?”

According to our findings, target group matters. Our respondents—scientists fromtop U.S. universities—did not have the same missing data patterns (in regards tospecific questions) as researchers have projected from surveys of the general popu-lation. Question topic also matters. Scientists were more likely to present missing dataon the more controversial questions related to religion, especially when asked tocompare their religiosity to that of the general population. We link their missing datapatterns to family formation, religious socialization, and present religiosity. Thespecific findings may be further explained by the historical context in which thesurvey was completed. More broadly, our findings show that traditional statistics do

450 Am Soc (2012) 43:448–468

Page 4: Missing Data in Sociological Research

not always help us understand the reasons behind missing data and low survey-response rates. Lastly, select populations may display unique missing data patternsthat need to be understood as survey researchers attempt to develop more rigorousmethodology.

Item Non-Response: Patterns of Implications

When attempting to understand potential patterns associated with missing datathrough the process of conducting analysis with collected survey data, the most basicform of the data can be dichotomized into two categories, unobserved (missing) andobserved (non-missing) (Gelman et al. 2005). In Bayesian notation, the completedataset is then made up of two components as expressed in the following equation:

ycom ¼ yobs;ymis� �

Here the complete dataset (ycom) is composed of the observed cases (yobs) andmissing cases (ymis) for any given variable. This very simplistic notation of themakeup of a complete dataset is somewhat logical and allows for the visualizationof these two components in relation to one another. As the observed data (yobs)increases as a proportion of the dataset, and given all else equal in terms of analytictechniques, the findings are assumed to be more reliable in relation to instances wherethe proportion of missing data (ymis) associated with any one variable is high.Furthermore, one should expect this zero-sum relationship to be dynamic as thehigher the proportion missing (and thus the lower the proportion observed) the lessreliable coefficient estimates associated with that set of responses are.

With regards to the magnitude of missing data, introduced above, the type ofmissing data is also very important to understand (Rubin 1976; Little and Rubin1987; Little 1995; Allison 2002; Alosh 2009; among others). Most important here isthe level of randomness associated with the missing data. If missingness is indepen-dent of the observed and unobserved data, then the data are denoted as being missingcompletely at random (MCAR). If missingness is dependent on the observed andunobserved data, however, it is considered missing not at random (MNAR).Similarly, if missingness is independent of the unobserved data, conditional of theobserved data, it is considered missing at random (MAR). Missingness is ignorable ifthe patterns are MCAR or MAR, meaning the item-non-responses are not identifiablyassociated with other characteristics of the survey sample. If item-non-response is notidentifiably independent of other sample characteristics (i.e. MNAR), it is non-ignorable (Verbeke and Molenberghs 2000, 2010; Gelman et al. 2005; Alosh2009). Data identified as MNAR means that there is some underlying pattern to thedata missingness that is associated with a trend in sociodemographics, attitudes, orother categorizing indicator of the sample.

As a direct result, survey data with extremely low item-response rates and non-random patterns of missing data are less reliable as indicators of the research area inquestion and may ultimately fail to provide any useful guide for the implementationof social policy. While recent reports from The Pew Research Center indicate thepotential for representativeness in spite of high levels of survey non-response (PEW

Am Soc (2012) 43:448–468 451

Page 5: Missing Data in Sociological Research

Research Center for the People and Press 2004), certain types of missing data patternsunquestionably introduce bias to sample statistics (Allison 2002). Thus, the recenttrends in survey research that include increasing rates of unreliable and biased socialscience data from survey non-response are particularly troubling in an era whenreliable data are vitally needed to continue addressing the most pressing socialproblems of our day.

In some of the more popular texts on social research methods, scholars mention anumber of ways to deal with the existence and magnitude of missing data (Babbie2007; Maxim 1999; Neuman 2003). These texts uniformly agree that in-depthanalyses of missing data could yield potential insights into and interpretation of theirmeaning. Furthermore, the past decade has seen further advancements in the handlingand understanding of missing data across a wide diversity of disciplines. This inter-disciplinary focus on the issue at hand brings to light the importance surroundingrecent trends in data-collection quality and research reliability, regardless of disci-pline. For instance, this review of recent literature on the subject has found contri-butions from clinical trials (Alosh 2009; DeSouza et al. 2009; Garcia et al. 2010;Haung et al. 2005; Southern et al. 2008), social work (Rose and Fraser 2008), librarysciences (Montiel-Overall 2006), statistics and data mining (Khoshgoftaar et al. 2007;Verbeke and Molenberghs 2010), public opinion (Abraham et al. 2006; Groves2006; Groves et al. 2006; Olson 2006), sociology (Allison 2002), demography(Porter et al. 2009), and public health (Paik 2004). Of course this list is notexhaustive, but it does give an indication of the very recent and widespreadinterest in the subject of missing data across many–seemingly–unrelated aca-demic disciplines. Regardless of disciplinary context, in all cases the primaryinterest of the research has been to better understand and deal with patterns ofnon-ignorable missing data. Yet, very little research exists among sociologistsconcerning potential issues with missing data and issues that may arise as a conse-quence. For the most complete analysis of missing data issues within the field ofsociology, see Allison 2002. A discussion of some of the more basic issues associatedwith missing data follows.

Nonignorable Data

More often than not, data are not missing at random and have underlying in theirpatterns more systematic reasons for non-response (Allison 2002; Griliches 1986).Allison (2002) points out that if the data are not missing at random, we say that themissing data mechanism is nonignorable. He goes on to say that “unfortunately, foreffective estimation with nonignorable missing data, very good prior knowledgeabout the nature of the missing data process usually is needed, because the datacontain no information about what models would be appropriate, and the resultstypically will be very sensitive to the choice of model” (p. 5). As mentioned above,we specifically find ourselves at an advantage as we have confidential and directinformation concerning both the survey respondents as well as the topic of interest(religion and science).

While the religion and science debate is much more controversial among certaingroups in our society, there is some information that is consistently likely to yieldhigh levels of missing data. Researchers generally find that unanswered questions,

452 Am Soc (2012) 43:448–468

Page 6: Missing Data in Sociological Research

resulting in nonignorable missing data-patterns, include questions about recent events(Babbie 2007), intrusive items (such as income, sexual behaviors, and criminal acts)(Maxim 1999), lost or restricted data (Neuman 2003), and other questions thatrespondents simply don’t know how to answer. Although researchers have someinformation about what kinds of questions respondents are less likely to answer, wedo not have more fine-tuned data on whether these unanswered-question patterns aresimilar across different types of groups. For example, would a survey of an elitepopulation generate the same unanswered-question patterns as a survey of the generalpopulation?

The Reality of Ignored Data

We focus our attention in this section on missing data and social science research,more specifically highly visible research in the discipline of sociology. Our focus isgrounded in recent literature highlighting the ubiquity (Rose and Fraser 2008) andhigh levels of frequency (Montiel-Overall 2006) in which missing data exhibits itselfin social science research. Given the high levels of researcher encounters with issuesinvolving item-non-response, considerable focus has been given to the post-hoccorrections for the estimation of social relationships involving data that are MNAR(non-ignorable). Rarely do researchers who are on the ground, however, have dis-cussion about or develop methods for collecting information on the reasons thatindividuals decide not to participate in surveys or not to answer certain questions. Yet,we know that both survey and item-non-response are at or approaching all-time lowsand the ability to understand these implications is extremely important.

As an example of the ubiquitous nature of survey research (and by relation,associated issues of missing data inherent in survey research) in sociology, we havesearched the most-cited articles in the American Sociological Review (ASR), theAmerican Journal of Sociology (AJS), and Social Forces—often regarded as the mostprestigious general sociology journals in the field according to empirical observationand/or reputational standing—we found that (using A.W. Hartzig’s “Publish orPerish”1 publication impact software) of the 105 most-cited articles during the past10 years, 79 % (n083) used primary or secondary data collected from survey/questionnaire techniques.2 The results of this analysis are presented in Fig. 1. Thesurveys that provided the data for examination in these articles included the well-known large-scale Panel Study of Income Dynamics (PSID); the General SocialSurvey (GSS); the National Longitudinal Survey of Youth (NLSY); the NationalHealth and Nutritional Examination Survey (NHANES); the Current PopulationSurvey (CPS); the National Longitudinal Study of Adolescent Health (AddHealth);the Health and Retirement Survey (HRS); the World Values Survey (WVS); theUnion of International Associations, National Organizations Survey (NOS); the

1 Harzig, A.W. 2011. Publish or Perish, version #3, available at www.harzing.com/pop.htm2 In order to ensure an equal coverage across the three journals, the top 35 articles since 2000 wereidentified and downloaded for further examination of data collection techniques and handling of any notedmissing data. We further standardized the selection process by total cites per year so not to give moreweight to articles published earlier in the decade based on total cites. We do understand that there is still alag in regards to publication and citation timing, but believe that we have collected a representative sampleof the most visible sociology publications over the past decade.

Am Soc (2012) 43:448–468 453

Page 7: Missing Data in Sociological Research

Survey of Crime, Community, and Health (CCH); and the Project on HumanDevelopment in Chicago Neighborhoods (PHDCN), among other less well-knownsurveys.

As the figure is organized, the results in the left-hand side of the chart apply to all105 of the most-cited articles across the three journals (35 each from ASR, AJS, andSocial Forces). Here one can see the strong reliance of these articles on survey/questionnaire methods for the collection of primary/secondary data sources that wereused in their analyses. In the right-hand side panel, the use of survey data by the most-cited articles over the past decade is broken down across each of the three journals.Here noticeable differences by journal type are uncovered and findings show thatarticles in Social Forces were the most likely to be based on survey data (92 %),followed by ASR (86 %), and AJS (60 %).

The variations across journals are not necessarily important to our analysis here,except to highlight the differences that exist in types of research projects published inspecific types of journals and the resulting implications of item non-response insurveys. We also understand, however, that there are documented issues associatedwith all types of research methodologies and our intention is not to highlight potentialproblems with published articles in these top sociology journals, but instead to shinelight on the importance of understanding issues associated with item-non-responsewith special attention to controversial questions. Again, the fact that missing dataexists is nothing new to even the most novice of researchers; however, the ability tomake informed decisions given nonignorable patterns of missing data continues toescape even some of the most advanced.

Of the 79 % of all journals in our sample that made use of survey data, item non-responses to controversial questions such as household income were, almost withoutexception, supplemented with imputed values in order to avoid a substantial loss ofthe sample. Of the surveys listed above, the PSID codebook reports issues withmissing variables that include questions about the family, income, marital/fertilityissues, and open-ended occupation/industry items. And the income question for the2006 wave of the GSS reported over 15 % missing data, even though it was measured

Fig. 1 Percentage reporting the usage of survey or questionnaire methods in the collection of data used inthe identified articles (ASR, AJS, Social Forces), N0105

454 Am Soc (2012) 43:448–468

Page 8: Missing Data in Sociological Research

via the less-intrusive categorical question, thought to overcome well-documentedproblems with respondents reporting their exact income.

As further documentation of the methods used in dealing with issues of missingdata, Table 1 presents a tabular breakdown. From the table, one can see that 21 % ofthe studies in this sample did not use a survey or questionnaire based data collectiontechnique. For those that did, nearly 20 % of all articles in the sample explicitlyacknowledged an impact that missing data had on the design of their analysis (6.7 %imputed values and 1 % excluded cases) or on the presentation of their results (7.6 %acknowledged issues associated with the estimation of their results from missing dataand 1.9 % tested for comparable relationships across other intact variables).Ultimately, nearly 61 % did not acknowledge any action taken in their analysis toaccount for missing data, but many did provide links to primary source materialsassociated with the secondary sources, which often indicated a correction, weightingor imputation for missing data. Others did not acknowledge missing data or simplymentioned that they screened for missing data. These last groups are interesting;perhaps they did not have issues with missing data or simply did not account for thepatterns of missingness that are appropriate with survey/questionnaire data. For thosewho did not mention missing data or simply relied on a linkage to materials from asecondary source it is possible that these nonignorable patterns present underlyingcorrelates that are likely to help guide future data collection procedures and evenresulting policy implications from the study at hand.

While the authors of each of the ASR, AJS, and Social Forces articles handled theirpatterns of missing data and non-response in a manner deemed appropriate byreviewers and editors alike (as evidenced by their ability to pass peer review in thediscipline’s top outlets), the fact that missing data was a notable issue in nearly aquarter (~22 %) of those articles that used a survey/questionnaire collection methodindicates the magnitude of potential bias introduced by this concern. Perhaps an evenlarger issue here is the reliance on secondary source materials (and corrections,imputations, etc.) and no mention of missing data issues, which make up the other73 % (64 out of 83) of studies using data collected via survey methods for theiranalysis. This highlights our main concern. That our discipline has such a high

Table 1 Identified techniques for handling missing data issues among the most influential sociologyarticles (N0105)

Technique for handling missing data a % n

Did not use survey data 21.0 22

Imputed missing values 6.7 7

Acknowledged issues with missing data as a study limitation 7.6 8

Excluded variables with high levels of missingness 1.0 1

Tested for generalizability by comparing results to other research 1.9 2

Provides links to materials for secondary data sources, mentions “screening”for missing data, or does not explicitly acknowledge any missing data

60.9 64

Total Number of Articles 100.0 105

a Categories are based on self-reported methods for handling missing data that appear within the actualpublished article

Am Soc (2012) 43:448–468 455

Page 9: Missing Data in Sociological Research

reliance on such methods, and that so little is understood about the unique circum-stances driving these patterns in each case, begs for the continued development ofsuch an understanding. In fact, it is well known that social scientists take a greatnumber of precautions in collecting data; and yet, missing data still exists and it isvery often systematic.

Most important, what are the underlying mechanisms of item non-response and isit possible to gain substantial insights by examining missing-data patterns on surveyquestions? We expect that the answer is “yes” and unique to each data collectioneffort given the social and political climate, sensitivity of the questions being asked/study topic, geographic location, and numerous other issues that must be taken intoaccount when designing data collection tools and analyzing/presenting their results.This is most true when a significant percentage of the missing data is directlyinvolved in dependent and primary independent measures, especially when the corefindings of a research project are based extensively on such indicators. The specialcase provided by the RAAS allows for a unique look into the reasons for completesurvey non-response. Often these reasons related directly to the controversial natureof the issue at hand and its timely relationship to historical circumstance. Suchanalyses could potentially yield important insights, allowing researchers to under-stand both their topic and their respondents to a greater degree than is possible withexisting methods.

Purpose and Expectations

The RAAS is a study of how elite scientists in the United States understand issuesrelated to spirituality and religiosity. It provides an opportunity to understand thecorrelates of missing data on particular questions among a specific population(elite scientists) given the sensitive and controversial issue of religion and how itis negotiated in academia. Furthermore, given their appointment at the mostprestigious universities in the United States, this sample is of scientists who havededicated much of their lives to their work. As one might expect, asking thispopulation questions about religion and their feelings on the subject is likely to beviewed by many as intrusive or controversial. Some of the more intrusive ques-tions were aimed at understanding scientists’ religious outlook in comparison toother Americans, or their beliefs about God or a god, the Bible, and religion.(Question wording for these specific questions is further discussed below). Thesequestions yielded nonignorable patterns of item non-response, though the surveyachieved a high response rate overall (75 %). Interestingly, of the 25 % that didnot participate in the survey, roughly 26 % directly corresponded with the study’sPI, providing insightful qualitative information regarding reasons for not complet-ing the survey. The RAAS is appropriate, then, as a case for examining non-ignorable missing data as well as the reasons individuals did not respond to asurvey on a controversial topic (Ecklund 2010). Further, the survey was of atargeted sample, meaning that we can potentially generalize from this case to othersurveys of scientists or other elite populations and even inform future studies ofsimilar populations on the development and dissemination of their data collectioninstruments.

456 Am Soc (2012) 43:448–468

Page 10: Missing Data in Sociological Research

From this point forward, we provide an example analysis in which we examine aseries of controversial survey questions in the RAAS in hopes of better understandingthe non-random pattern associated with patterns of missing data within the uniquecontext of that specific survey. Again, we believe that each data collection effort isundertaken within a unique context that must be understood in order to fullyunderstand the responses (and lack of responses) from any one instrument at anygiven point in time. This analysis is undertaken with that in mind and in hopes ofuncovering the unique relationships associated with the identified, and systematic,patterns of missing data. In order to do so, we set forth a series of assumptions andexpectations which we will examine and that will guide our research. First, weassume that controversial questions (in this case, questions about personal religiousviews) may be more likely to garner missing data and that these patterns will bestatistically nonignorable. Second, we expect a series of findings in relation tospecific questions and patterns of non-response in the RAAS. We expect that elitescientists will have different patterns of missing data on population-specific, moreintrusive measures (i.e. race, age, gender, and nativity) (Ecklund 2010) when com-pared to what other survey researchers have found among the general population.

We also expect level of prestige will be an important predictor of missing dataabout controversial questions on religion. Given previous findings about differencesin scientists’ religiosity according to rank and number of articles published (Ecklund),we expect that scientists at the assistant professor level and those who have publishedfewer articles (controlling for differences in article-publishing conventions amongdisciplines) will have more missing data about particularly controversial questionsthan will scientists at the associate and full professor levels. Our expectation is basedon the reasoning that those who have less prestige will feel less free to answercontroversial questions related to religion.

Because of recent media attention regarding the irreligiosity of natural scientists,we also expect natural scientists will be more likely to have missing data oncontroversial questions about religiosity than will social scientists and, based onprevious research, about religious socialization (Ecklund and Scheitle 2007;Ecklund 2010), we expect that scientists who are religious will be more likely toanswer questions about religion, because they face less fear of being perceived asdifferent from the general population. Finally, we expect, based on previous researchindicating that having children is a positive predictor of religiosity among scientists(Ecklund and Scheitle 2007), that those scientists who have children will be morelikely to be religious and more likely therefore to answer questions about religion onthe survey.

An Example: Controversial Questions in the Religion Among AcademicScientists Study

The RAAS study began during May 2005, when 2,198 faculty members in thedisciplines of physics, chemistry, biology, sociology, economics, political science,and psychology were randomly selected from the universities in the sample.Although faculty were randomly selected, oversampling occurred in the smaller fieldsand undersampling in the larger ones. For example, a little more than 62 % of all

Am Soc (2012) 43:448–468 457

Page 11: Missing Data in Sociological Research

sociologists in the sampling frame were selected, while only 29 % of physicists andbiologists were selected, reflecting the greater numerical presence of physicists andbiologists at these universities when compared to sociologists. In analyses wherediscipline is not controlled for, data weights were used to correct for the over- andundersampling. (The data-weighting scheme is available upon request).

Scientists included in the study were randomly selected from seven natural andsocial science disciplines at universities that appear on the University of Florida’sannual report of the “Top American Research Universities.”3 The University ofFlorida ranked elite institutions according to nine different measures, which includetotal research funding, federal research funding, endowment assets, annual giving,number of national academy members, faculty awards, doctorates granted, postdoc-toral appointees, and median SAT scores for undergraduates. Universities wereranked and selected according to the number of times they appeared in the top 25for each of these nine indicators.

Initially, the study’s PI wrote a personalized letter to each potential participant inthe study that contained a $15.00 cash pre-incentive, to keep regardless of decision toparticipate in the survey. Each letter included a unique identification code with whichto log onto a Web site and complete the survey. After five reminder emails, theresearch firm commissioned to field the survey—Schulman, Ronca, and Bucuvalas,Inc. (SRBI)—called respondents up to a total of 20 times (or until responded to),requesting participation over the phone or on the Web. Six and a half percent of therespondents completed the survey over the phone, and 93.5 % completed the Web-based survey.

Shedding Qualitative Light on Reasons of Non-Response

Many of the scientists who chose not to participate wrote to tell us why. Overall, 131personal emails or letters from those who did not wish to participate (out of 552 totalnonrespondents) were received. Reasons for not participating were systematicallycoded in an attempt to uncover patterns. In total, the scientists provided 13 discretereasons for not participating in the survey. Dominant reasons included lack of time,problems with the incentive, traveling or being away during the survey, and simplynot wishing to participate. We did demographic analyses of the nonrespondentscientists and found no substantial differences along basic demographic indicators(such as gender, age, discipline, and race) between those who responded and thosewho did not. The results presented in Table 2, which categorizes specific reasonscommunicated for not completing the survey -are still beneficial in allowing us tounderstand some of the reasons for survey non-participation. It is likely that those thatdid not communicate with the PI fall in these categories somewhere, but it is alsolikely that they in themselves provide a sub-population of non-respondents. In thiscase, “non-active nonresponders” likely did not respond because of their own rea-sons, which we can not ascertain.

Some respondents wrote to explain that they did not participate in the studyspecifically because it was on what they perceived to be an extremely controversial

3 After the RAAS study began, the “Top American Research Universities” project moved to Arizona StateUniversity. See http://mup.asu.edu/, accessed April 17, 2009.

458 Am Soc (2012) 43:448–468

Page 12: Missing Data in Sociological Research

topic. One biologist explained: “You are naïve if you think that you could preventHomeland Security or other major governmental agency from obtaining this confi-dential information. Sorry but here is your money back.” The biologist eventuallyended up participating in the study. Still, his response might help in explaining whycertain questions on the survey were much less likely to be answered than others.Even though complete confidentiality was assured and human subjects’ protectionextensively discussed, some scientists were fearful that their answers to controversialquestions might be used against them.

The pre-incentive also raised mixed reactions. For example, one psychologist said,“As soon as I opened that up, I thought, ‘Oh my God. I’ve got the bills now. I have todo it’ [laughs]. . . . It was just brilliant.”4 Other scientists called the study “harass-ment” or even “coercion.” For example, a well-known sociologist wrote the PI anemail saying, “It is obnoxious to send money (cash!) to create the obligation torespond.”5 It is important to note that the study received full human subjects’approval at the PI’s university. Reasons for not responding are likely to vary byindividual circumstance. In the case of the RAAS, about 26 % of those that did notrespond (25 % of the total sample) to the survey communicated directly with the PIand provided reasons for not responding.

As economists and political scientists have already discovered, however, the pre-incentive works. There was an overall response rate of 75 %, or 1,646 respondents,ranging from a 68 % rate for psychologists to a 78 % rate for biologists. This is anextremely high response rate for a survey of faculty. Even the highly successfulCarnegie Commission study of faculty resulted in only a 59.8 % rate.6 Understandingwho those are that did not complete these particular surveys is likely to help us toplace our findings in a context of a specific type of respondent, without over-generalizing to all scientists. However, we also understand that any meta-analysis

Table 2 Reasons identified by“Active-nonresponders” for notparticipating in the RAAS survey

Data Source: All data was col-lected from direct correspon-dence between surveyrespondents and the project PI

Reason for not completing survey % n

Too controversial 1.5 2

No time for survey 12.2 16

Have issue with the incentive 10.7 14

Policy of not participating in surveys 2.3 3

Survey not open-ended enough 2.3 3

Do not wish to participate 21.4 28

Confidentiality issues 0.8 1

Traveling, away, received survey after deadline 16.0 21

Computer or technical problem 6.1 8

Retired, sabbatical, or ill 3.8 5

Does not consider self appropriate for study 0.8 1

No reason 26.0 34

Total “Active Nonresponders” 100.0 131

4 Psyc 17, conducted January 3, 2006.5 This individual did not participate in the survey.6 See Ladd and Lipset, “The Politics of Academic Natural Scientists and Engineers.”

Am Soc (2012) 43:448–468 459

Page 13: Missing Data in Sociological Research

of missing data is likely to be constrained at some point by a simple lack ofidentification.

Identifying Quantitative Patterns to Mechanisms of Non-Response

Complementing our qualitative understanding of non-participants in the RAASsurvey, we also quantitatively examine missing-data patterns. As expected, the ques-tions with the highest levels of nonignorable missing data were in regards to, religion,the topic of the study. The survey asked some questions about religious identity,belief, and practice which were replicated from national surveys of the generalpopulation (such as the GSS), and other questions on spiritual practices, ethics, andthe intersection of religion and science in the respondent’s discipline, some of whichwere replicated from other national surveys and some of which were developeduniquely for this survey. 7 There were also a series of inquiries about academic rank,publications, and demographic information.

In order to better understand the mechanisms driving the nonignorable patterns ofmissing data, we selected the questions that had the highest non-response rates andanalyzed them with a series of logistic regression equations to better understand thelikelihood of a respondent refusing to answer a given question based on a selection ofidentified correlates. The analysis is informative in its ability to uncover the mech-anisms that may inform data replacement methods or simply the presentation offindings from previous and future research using the RAAS dataset. The generalequation applied to the logistic regression analysis is as follows:

lnbp

1� bp� �

¼ B0 þ B*1X1 þ B*

2X2 þ ::::::þ B*i Xi

where the logged probability that the question was identified as missing (equal to 1)versus the probability that the question was not identified as missing (equal to 0) isexamined as a probability equal to a constant (B0) plus a series of independentvariable specific slope coefficients (Bi * Xi).

Table 3 shows the questions on the survey that have the greatest proportion ofmissing data from the entire dataset. From the table, we find evidence supporting ourexpectation concerning controversial questions and item non-response, as thereligiosity-specific questions are missing at a higher rate than are other measuresmore widely recognized as intrusive: income (6 % missing), family formation (<1 %missing), and number of children (2.5 % missing). But the question that asks wherethese elite scientists would place their religious views on a seven-point conservative/liberal scale, when compared to other Americans, has a missing data rate of over34 %. And over 16 % of those who responded to the survey were not willing toanswer a question that described their feelings about the Bible. Questions about viewson truth in religion and belief about God also garnered a more than 10 % non-response. These preliminary analyses reveal that the missing data from this survey isnot missing at random. It is “nonignorable,” in Allison’s (2002) terms, meaning that

7 The 1998 GSS had 2,832 respondents, although only half of the sample was asked the expanded set ofreligion and spirituality questions. The 2004 GSS had 2,812 respondents. Where possible, we used datafrom the GSS 2006 for the comparisons of scientists with the general population. See Davis et al. (2007).

460 Am Soc (2012) 43:448–468

Page 14: Missing Data in Sociological Research

more complicated analysis is warranted. What is driving these patterns, and can webetter understand these non-responses given an analysis of correlates associated withthese missing patterns? That is our goal here.

To examine these specific patterns in regards to our expectations that variationsexist across demographic indicators, professional standing, religiosity and familialcharacteristics, we present results of a logistic regression analysis in Table 4. Thesefour sub-groups were included in the analysis as they are often found to be linked topatterns of missing data in the general population (citizenship, gender, age, race, andfamily formation are used as explanatory variables) (Allison 2002; Griliches 1986).As mentioned, we also included professional characteristics that may be important tothis population specifically, such as whether or not a faculty member is in the naturalsciences (social sciences as comparison), prestige as indicated by number of pub-lished articles, and rank. Next, we included religiosity variables that may make adifference in missing-data rates, including present levels of religiosity (attendance atservices and religious affiliation) and religious socialization (importance of religion inchildhood). Finally, we included measures of family formation such as marital statusand the number of children in the respondent’s family, given the results of previousanalyses linking it to religiosity using this dataset (Ecklund 2010).

Table 3 Selected controversialsurvey items and non-responserates

a These percentages are relativeto much lower levels of missingdata concerning more “widelyKnown intrusive measures”,including income (6 %), maritalstatus (<1 %), and number ofchildren at home (2.5 %)

Survey questions and response categories % Missing a

Which of the following comes closest toyour views about truth in religion?

10.1 %

(1) There is little truth to any religion

(2) There are basic truths in religion

(3) There is the most truth in only one religion

Which of the following statement comes closestto expressing what you believe about God?

10.1 %

(1) I do not believe in God

(2) There is no way to find our if there is a God

(3) I believe in a higher power, but it is not God

(4) I believe in God sometimes

(5) I have some doubts but I believe in God

(6) I have no doubt about God’s existence

Which of these statements comes closest todescribing your feelings about the bible?

16.1 %

(1) The Bible is an ancient book of fables, writtenby men

(2) The Bible is inspired by the word of God, butnot the actual word

(3) The Bible is the actual word of God and shouldbe taken literally

Compared to Most Americans, where would youplace your RELIGIOUS views on a seven pointscale?

34.3 %

(1) Extremely Liberal→→→→→→→(7)Extremely Conservative

Am Soc (2012) 43:448–468 461

Page 15: Missing Data in Sociological Research

Tab

le4

Oddsratio

spredictin

gmissing

data

onselected

controversialsurvey

items(standarderrors)a

Explanatory

variables

Beliefaboutrelig

ion

Beliefaboutgod

Beliefaboutbible

Religious

view

s

Reduced

Full

Reduced

Full

Reduced

Full

Reduced

Full

U.S.born

0.855(.06)*

0.882(.07)

†1.064(.08)

1.147(.10)

0.961(.05)

0.964(.06)

0.981(.04)

1.032(.04)

Fem

ale

0.934(.27)

0.761(.32)

1.393(.29)

1.945(.66)*

1.531(.19)*

1.756(.21)**

0.845(.15)

0.784(.18)

Age

0.991(.01)

0.991(.01)

1.029(.01)**

1.006(.01)

1.015(.19)

†1.017(.01)

1.013(.01)*

1.011(.01)

Income

1.022(.02)

1.053(.04)

0.948(.03)

0.897(.04)*

0.953(.02)*

0.945(.03)

†0.981(.01)

0.962(.02)

White

1.008(.31)

1.278(.36)

0.492(.34)*

0.484(.41)

†0.559(.23)**

0.596(.25)*

1.613(.15)**

1.999(.23)***

Natural

science

0.926(.19)

1.095(.29)

1.123(.29)

0.939(.33)

1.241(.15)

1.341(.21)

0.951(.12)

1.187(.16)

Num

berof

publishedarticles

1.035(.05)

1.082(.08)

1.062(.05)

1.074(.09)

1.016(.04)

1.037(.06)

1.035(.03)

1.062(.04)

Assistant

professor(ref

Full)

1.161(.27)

1.788(.53)

0.922(.28)

0.491(.60)

1.116(.21)

1.123(.39)

0.782(.17)

0.714(.29)

Associate

professor(ref

Full)

1.199(.26)

2.253(.45)

†0.695(.31)

0.327(.58)*

1.092(.22)

1.027(.33)

1.235(.16)

1.175(.24)

Attendance

atservices

0.496(.53)

1.035(.55)

0.378(.57)

†0.398(.68)

0.571(.34)

†0.517(.43)

0.163(.56)***

0.111(.71)**

Importof

Relig.in

child

hood

0.851(.09)

†1.042(.12)

1.117(.11)

1.229(.14)

1.028(.07)

1.013(.09)

0.803(.06)***

0.776(.07)***

Affiliated

with

Relig.Org.

0.761(.21)

0.553(.28)

1.211(.19)

0.993(.30)

1.055(.16)

0.949(.20)

0.162(.13)***

0.147(.16)***

Married

0.884(.23)

0.603(.31)

†1.087(.24)

1.257(.40)

1.084(.19)

1.144(.26)

0.936(.14)

0.865(.46)

Num

berof

kids

0.975(.07)

0.952(.11)

0.795(.07)**

0.851(.13)

0.926(.05)

0.977(.08)

0.897(.04)**

1.087(.18)

χ2

–26.559*

–28.342*

–27.271*

–309.452***

aReduced

modelsrepresenttheisolated

exam

inationof

thedemographic,p

rofessional,relig

iosity,o

rfamily

form

ationindicatorsof

allothervariables.In

contrastthefullmodel

representsthecontrolledeffect

ofeach

indicatorwith

allothervariablesin

themodel.**

*p<0.001,

**p<.0.01,

*p<0.05

,†p<0.10

462 Am Soc (2012) 43:448–468

Page 16: Missing Data in Sociological Research

We find very interesting patterns associated with the systematic missingness acrossthe four questions concerning religious views. Of specific interest, we find that someof the demographic indicators that are traditionally thought to be highly correlatedwith missing data patterns are not among the strongest predictors in this analysis ofelite scientists. Instead, we find them not to have a large influence on the likelihood ofhaving missing data—with a couple of notable exceptions. Again, all odds-ratiosshould be interpreted as the likelihood of the data point being missing (i.e. ratio aboveone, more likely to be missing). The only significant effect of nativity is associatedwith the initial question concerning the respondents’ general views. Here we find thatthose native to the United States were 12 to 15 % less likely to have avoidedanswering this question in both the reduced and full models. Similar, yet insignifi-cant, results were found concerning nativity and a respondent’s beliefs about theBible. Interestingly, concerning their beliefs about God and their comparative viewsto other Americans, U.S. natives were directionally less likely to respond.

We found that women (when compared to men) were significantly more likely tohave avoided questions on beliefs about God and the Bible, 95 and 76 % respectively.While again not significant, women were less likely to have missing data on thequestion addressing their beliefs about religion and their comparative stance to therest of Americans. Following the trend, older respondents were less likely to avoidthe question concerning their beliefs in religion but more likely to avoid questions ontheir beliefs about God, the Bible, and their comparative views. In this case, only inregards to their beliefs about God is the probability significant. The effect of age,however, is ultimately explained away when professional indicators, religiosity, andfamily formation indicators are introduced.

Scientists with higher levels of income are significantly less likely to answerquestions concerning their beliefs in God, the Bible, and their comparative views.They are more likely to answer questions concerning their beliefs about religion(although these results are not statistically significant). Finally, scientists who classi-fied themselves as racially white were less likely to have missing data on beliefsabout God and the Bible and more likely to have avoided answering the questionsconcerning their religious beliefs and their religious views compared to those of otherAmericans.

We find almost no evidence in support of our expectation that an increase inprestige leads to higher probability of answering questions concerning religiosity. Asthe number of articles a respondent has published increases, we find that the respon-dent is actually more likely to be missing data on all questions of interest (althoughthese results are not statistically significant). Finally, we find that professors with anassociate rank are over two times more likely than full professors to have avoidedquestions concerning their beliefs in religion but 67 % less likely to have avoidedquestions concerning their belief in God (results are significant). Finally, we also seethat natural scientists (when compared to social scientists) are not significantly moreor less likely to avoid answering any of the questions of interest here, in contrast withour expectation.

When we turn to the influence of religious factors, we find that all of the religiositymeasures are significant in decreasing the likelihood of missing data when scientistswere asked to compare their religious views to those of other Americans. Thisprovides strong support for our expectation that respondents who reported a higher

Am Soc (2012) 43:448–468 463

Page 17: Missing Data in Sociological Research

level of religiosity would in fact be more likely to answer the controversial questionsconcerning religiosity. In particular, the results show that as religiosity increases,scientists are significantly more likely to answer questions about their religiousviews, their views on God, their views on the Bible, and their comparisons of theirbeliefs to those of other Americans. Only the measure in comparison to otherAmericans, however, holds in the full model.

Perhaps not surprisingly, this is the most influential set of predictors underlying thesystematic patterns of nonignorable missing data. Not coincidentally, these religiousindicators directly relate to the patterns of missing data among the religiously-centered questions that are the endogenous variables in this analysis. This is extreme-ly important in our ability to understand variations in missing data patterns amongthese four questions. We find that not only are all of the top-four questions, in regardsto their quantity of missing data, all religion-centered questions, but also at least oneof them is strongly predicted by a set of alternative religiosity indicators that haveminimal levels of missing data. What does this mean for our understanding of thesepatterns and future analyses? This means that those who regularly attend services,those who had religion play an important role in their lives as children, and those whoare currently affiliated with a religion are less likely to avoid these questions. Thus,the majority of missing data on these questions relates directly to individuals who arenot religious or who are only weakly attached to religion. The question itself thenseems to be somewhat in-exhaustive as it only asks for response on a seven pointscale about religious views. It is certainly likely that much of the reason for non-response in this case is the inability of respondents to place themselves on the scalebecause of the fact that the highest levels of missing data were associated with thosethat exhibit very low levels of religiosity in regards to the questions that they didanswer.

The final section of determinants in the logistic regression model concerns thepotential relationship between the formation and development of a family throughmarriage and having children to the likelihood of viewing questions about religion ascontroversial, thus making them more likely to answer them. We find support for thisline of reasoning as all statistically significant responses indicate a lower likelihood ofmissing data in general for both married respondents and those with children. Beingmarried is related to a lower likelihood of avoiding questions concerning the respond-ent’s religious views, and having children is significantly related to a lower likelihoodof avoiding questions on God and on the comparison of one’s religious stance withthose of other Americans.

Discussion and Conclusion

Uncovering the mechanisms for underlying patterns of missing data is as important inassessing the reliability and validity of a data collection tool and its items as morepopularly employed traditional statistical tests (alpha reliability test, exploratory datareduction techniques, etc.). Very little attention has been given to this very importantissue, however. We currently have sophisticated methods for handling missing data.Some have inherent in them the regression-based probability analyses we haveundertaken here (i.e. multiple imputation and other stochastic replacement methods).

464 Am Soc (2012) 43:448–468

Page 18: Missing Data in Sociological Research

None of these methods, however, allows understanding of the mechanisms drivingthe missing data patterns. Filling this lacunae may provide insight to developmentand continued advancement of “professional standards” in our own discipline ofsociology as well as more broadly. For instance, the ability to detect inherent issueswith our questions concerning the comparison of religious views is likely the reasonthat it had the highest rate of missing data in the entire dataset. If another surveyutilized this question, it is likely that the addition of a category for “no religiousviews” would alleviate some of the missing data. It is still likely that this wouldremain one of the most avoided questions, as evidenced by the fact that the next threehighest levels of missing data were all in regards to religiosity questions, but futureanalyses can now be informed given the results of this introspective examination ofthe RAAS data.

Further, we found significant differences among questions that generated the mostmissing data. And these patterns of missing data did not fall along the predicted linesof what researchers have found in the general population. Questions about incomeand personal family issues, which scholars find often experience missing data in thegeneral population (Allison 2002), did not generate the same missing-data patternsamong natural and social scientists who teach and do research at elite universities.Instead, questions that are most likely to be controversial to this population were themost likely to generate missing data. In at least one case, a structural mechanism forthe level of missingness was uncovered, but overall these included questions whererespondents were asked to compare their religious views to those of other Americans,describe their feelings about the Bible, state their views on truth in religion, andquantify their beliefs about God. Analysis also revealed that these missing-datapatterns for sample-specific controversial questions were not ignorable, meaning thatsuch questions had a high percentage of missing data that could be directly linked,through statistically significant associations, to indicators of demographic makeup,professional rank, religiosity, and family formation.

Further, we found that gender, race, and the presence of children in one’s house-hold had an impact on the likelihood of having missing data on the controversialreligion questions mentioned above. Women were more likely than men to havemissing data when asked how their views on religion compared to those of otherAmericans, as did scientists who were racially white. In terms of religiosity measures,scientists who were the least religious were the most likely to have missing data forquestions on belief about God and on the comparison of their religious views to thoseof other Americans.

Some of these specific results about missing data among scientists may beexplained by factors of marginalization. Our elite universities nearly all employ fewerwomen (especially in the natural sciences but even in the social sciences) than men.Women often feel marginalized in this situation, which may make them less likely toanswer controversial questions about religion on a survey for fear of how the resultsmight be used to further marginalize them. If this marginalization is happening, wemight expect the same pattern of response from nonwhite groups, but there are toofew black and Latino individuals in the sample to make a meaningful comparison.Asian Americans are overpopulated as a minority group among the sciences whencompared to other minority groups, but many are first-generation immigrants orinternational scholars, those who may not be part of the same minority-

Am Soc (2012) 43:448–468 465

Page 19: Missing Data in Sociological Research

marginalization dynamics of American culture as other nonwhite racial minoritygroups.

This same kind of reasoning about marginalization might also be applied to thosewho are not religious. Researchers find that atheists and the nonreligious are some-what marginalized in the general population (Edgell et al. 2006). This means thatscientists who are not religious may be especially unlikely to answer questions aboutreligion for fear that the larger public might use such results against them. (Rememberthe quote from the biologist who was afraid his answers would be leaked toHomeland Security.)

Interestingly, the missing-data patterns across the specific questions deemed con-troversial to this group were often avoided at different rates, given both the questionof interest and the determinant at hand. For instance, and based solely on direction-ality of the obtained coefficients, the respondents were often likely to answer one ortwo questions related to general religion but avoid the others relating specifically toGod and the Bible. Furthermore, general religion questions often seemed to beinterpreted differently from those concerning God and the Bible. This may be becausethese questions connote adherence to the Christian religious traditions in particular. Inthis case, we find further evidence of potentially correctable issues in survey designwould likely yield a more complete and reliable set of data in future rounds of datacollection. Most obviously, we find that some of our survey questions do not allowfor respondents to provide an appropriate answer given that they are not exhaustive.In this case, the fact that missing data exists is actually not associated with respondentavoidance, but instead with an error in the item creation. We can make someassumptions from the patterns of missing data here, however. For instance, it is likelythat those who did not answer the question do not feel that they have “religiousviews” and therefore cannot compare themselves in this respect. Whatever the case,there were also a number of times when respondents answered these questions in amanner correlated with differences in nativity, gender, age, income, race, field, rank,religiosity, and marital status. Many of these associations were insignificant, but theyare interesting nonetheless.

The particular historical context of this survey may also be important. It took placein 2005, when there was much publicity surrounding controversial cases over intel-ligent design theory and charges that university professors were particularly biased intheir teaching (Schrecker 2006; Ecklund 2010). Natural and social scientists in high-profile positions at that time may have been particularly reticent to openly comparetheir views about religion to those of the general population. Individual emails to thestudy PI expressed such reservations, along with concern that the survey data be keptcompletely confidential and that their names not be identified.

Of general relevance to survey researchers, these results show the value ofsecondary analysis for understanding the underlying reasons for missing data in aparticular data set, which the PI has done in other publications generated from thisstudy (Ecklund et al. 2008). Where possible, it may be valuable for the researcher tohave extensive contact with the research subjects to try and discern the possiblereasons for their missing data. Further, these results show that missing data patterns—even fairly established ones having to do with demographics—may differ radicallyamong survey populations. More and more survey researchers are trying to avoid lowresponse rates by targeting specific (rather than representative) populations of

466 Am Soc (2012) 43:448–468

Page 20: Missing Data in Sociological Research

respondents. The kinds of conversations we have generated in this article aboutsurvey response rates, and particularly those about response rates concerning missingdata among specific populations and related controversial questions, may hold valuefor survey-methods classes as well as researchers in the field.

In sum, the ability to uncover issues associated with providing incentives forresponses, survey item construction, reliability, and validity (among other issuesinherent in survey data collection) is enough to warrant a more careful examinationof the actual structure of the data we use to uncover many of the associations andrelationships presented in our most coveted publication outlets. Coupled with the factthat nearly four out of every five of the articles obtained in this analysis made theircontributions via data collected through survey or questionnaire methods, a morecareful understanding of these mechanisms is warranted in order to continue toimprove our ability to make reliable and valid inferences from the data we collect.Furthermore, it is important to move beyond simply employing a series of genericreplacement methods in the process of imputation. Rather, researchers should moretightly link these data procedures or secondary correction methods to specific data-sets, since every dataset is collected within a specific social, historical and politicalcontext. The targeted sample, geographic coverage, and data collection method alsoplay into the ultimate “structure” of any given dataset. We must then strive tounderstand these conditions and take them into account when analyzing and present-ing our results. Often such results, especially those in publications with an impact ashigh as those examined in this study, inform policy and future research agendas foryears to come. Finally, the ability to incorporate the information extracted from activenonresponders and missing data patterns should be incorporated into the future designof these studies.

References

Abraham, K., Maitland, A., & Bianchi, S. (2006). Nonresponse in the American time use survey: who ismissing from the data and how much does it matter? Public Opinion Quarterly, 70(5), 676–703.

Allison, P. (2002). Missing data (Sage quantitative applications in the social sciences series). SagePublishing.

Alosh, M. (2009). The impact of missing data in a generalized integer-valued autoregression model forcount data. Journal of Biopharmaceutical Statistics, 19, 1039–1054.

Babbie, E. (2007). Practice of social research (11th ed.). Thomson: Wadsworth Publishing.Davis, J. A., Smith, T. W., Marsden, P. V. (2007). General social surveys, 1972–2006 [Cumulative File].

ICPSR Study Number 4697.Dawid, A. P. (1984). Statistical theory: the prequential approach (with discussion). Journal of the Royal

Statistical Society A, 147, 278–292.DeSouza, C. M., Legedza, A. T. R., & Sankoh, A. J. (2009). An overview of practical approaches for

handling missing data in clinical trials. Journal of Biopharmaceutical Statistics, 19, 1055–1073.Ecklund, E. H. (2010). Science Vs. religion: What scientists really think. New York: Oxford University Press.Ecklund, E. H., & Scheitle, C. (2007). Religion among academic scientists: distinctions, disciplines, and

demographics. Social Problems, 54(2), 289–307.Ecklund, E. H., Park, J. Z., & Veliz, P. T. (2008). Secularization and religious change among elite scientists:

a cross-cohort comparison. Social Forces, 86(4), 1805–1840.Edgell, P., Gerteis, J., & Hartmann, D. (2006). Atheists as ‘other’: moral boundaries and cultural mem-

bership in American Society. American Sociological Review, 71, 211–234.Evans, J. H., & Evans, M. S. (2008). Religion and science: beyond the epistemological conflict narrative.

Annual Review of Sociology, 34, 87–105.

Am Soc (2012) 43:448–468 467

Page 21: Missing Data in Sociological Research

Garcia, R. I., Ibrahim, J. G., & Zhu, H. (2010). Variable selection in the cox regression model withcovariates missing at random. Biometrics, 66, 97–104.

Gelman, A., Meng, X. L., & Stern, H. S. (1996). Posterior predictive assessment of model fitness viarealized discrepancies (with discussion). Statistica Sinica, 6, 733–807.

Gelman, A., Van Mechelen, I., Verbeke, G., Heitjan, D. F., & Meulders, M. (2005). Multiple imputation formodel checking: completed data plots with missing and latent data. Biometrics, 61, 74–85.

Griliches, Z. (1986). Comment on Behrman and Taubman. Journal of Labor Economics, 4(3), S146–S150.Groves, R. (2006). Nonresponse rates and nonresponse bias in household surveys. Public Opinion

Quarterly, 70(5), 646–675.Groves, R., Couper, M., Presser, S., Singer, E., Tourangeau, R., Piani, G., & Nelson, L. (2006). Experi-

ments in producing nonresponse bias. Public Opinion Quarterly, 70(5), 646–675.Haung, R., Liang, Y., & Carrierre, K. C. (2005). The role of proxy information in missing data analysis.

Statistical Methods in Medical Research, 14, 457–471.Khoshgoftaar, T. M., Van Hulse, J., Seiffert, C., & Zhao, L. (2007). The multiple imputation quantitative

noise corrector. Intelligent Data Analysis, 11, 245–263.Lenski, G. (1961). The religious factor. Garden City: Doubleday.Little, R. J. A. (1995). Modeling the drop-out mechanism in repeated measures studies. Journal of the

American Statistical Association, 90, 1112–1121.Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley.Martinussen, T., Nord-Larsen, T., & Johannsen, V. K. (2008). Estimating forest cover in the presence of

missing observations. Scandinavian Journal of Forest Research, 23, 266–271.Maxim, P. (1999). Quantitative research methods in the social sciences. Oxford University Press.Montiel-Overall, P. (2006). Implications of missing data in survey research. Canadian Journal of Infor-

mation and Library Science, 30(3/4), 241–269.Neuman, L. (2003). Social research methods: qualitative and quantitative approaches (5th ed.). Allyn and

Bacon Publishing.Olson, K. (2006). Survey participation, nonresponse bias, measurement error bias, and total bias. Public

Opinion Quarterly, 70(5), 737–758.Paik, M. C. (2004). Nonignorable missingness in matched case–control data analysis. Biometrics, 60, 306–

314.PEW Research Center for the People and Press. (2004). Polls face growing resistance, but still represen-

tative. Survey reports (April 20th).Porter, J. R., Cossman, R., & James, W. L. (2009). Research note: imputing large group averages for

missing data, using rural–urban continuum codes for density driven industry sectors. Journal ofPopulation Research, 26, 273–278.

Rose, R. A., & Fraser, M. W. (2008). A simplified framework for using multiple imputation in social workresearch. Social Work Research, 32(3), 171–178.

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.Rubin, D. B. (1984). Multiple imputation for nonresponse in surveys. New York: Wiley.Satten, G. A., & Carroll, R. J. (2000). Conditional and unconditional categorical regression models with

missing covariates. Biometrics, 56, 384–388.Schrecker, E. (2006). Worse than McCarthy. pp. B20, February 10, 2006, in The chronicle of higher

education.Southern, D. A., Norris, C. M., Quan, H., Shrive, F. M., Diane Galbraith, P., Humphries, K., Gao, M.,

Knudtson, M. L., & Ghali, W. A. (2008). An administrative data merging solution for dealing withmissing data in a clinical registry: adaptation from ICD-9 to ICD-10. BMC Medical ResearchMethodology, 8(1), 1–9.

Verbeke, G., & Molenberghs, G. (2000). Linear mixed models for longitudinal data. New York: Springer.Verbeke, G., & Molenberghs, G. (2010). Arbitrariness of models for augmented and coarse data, with

emphasis on incomplete data and random effects models. Statistical Modeling, 10(4), 391–419.Wright, J. D., & Marsden, P. V. (2010). In P. V. Marsden & J. D. Wright (Eds.), The Handbook of Survey

Research (2nd ed.). United Kingdom: Emerald Press.

468 Am Soc (2012) 43:448–468


Recommended