RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…

Post on 11-Nov-2014

1,112 views 0 download

Tags:

description

Jared Lyle, ICPSR Domain Repositories and Institutional Repositories Partnering to Curate: Opportunities and Examples Panel: Partnerships between institutional repositories, domain repositories, and publishers Research Data Access & Preservation Summit 2013 Baltimore, MD April 4, 2013 #rdap13

transcript

Domain Repositories and Institutional Repositories Partnering to Curate: Opportunities and Examples

Jared LyleRDAP13

About ICPSR• Founded in 1962 as a consortium of 21

universities to share the National Election Survey

• Today: 700+ members around the world• Data dissemination for more than 20 federal

and non-government sponsors• 600,000+ visitors per year

What we do• Acquire and archive social science data• Distribute data to researchers• Preserve data for future generations• Provide training in quantitative methods

Archive size• 8,000 data collections, over 60,000 data sets• Grows by 300+ collections a year• 9 Terabytes, soon to be 40+ Terabytes

http://www.icpsr.umich.edu

http://www.flickr.com/photos/dwiggs/3983200894/sizes/l/in/photostream/

1. Sharing Data (Archiving)

“It saves funding and avoids repeated data collecting efforts, allows the verification and replication of research findings, facilitates scientific openness, deters scientific misconduct, and supports communication and progress.”

Niu (2006). “Reward and Punishment Mechanism for Research Data Sharing.” http://www.iassistdata.org/downloads/iqvol304niu.pdf

“Virtually all geneticists believe that scientists should share their results freely with peers…”

Louis, Jones, and Campbell (2002). “Sharing in Science.” http://dx.doi.org/10.1511/2002.4.304

“…the era of data sharing has arrived.”

Samet (2009). “Data: To Share or Not to Share?” http://dx.doi.org/10.1097/EDE.0b013e3181930df3

http://www.data-pass.org/

Most PIs indicated that they wanted to be “Good Citizens” and help:

“This sounds like an exciting project.”

“I hope your project is successful because I think that it is important.”

“Good Citizens” = high willingness

…but no time, money, or resources to submit data to us.

14.2%

58.7%

25.7%

010203040506070

Data AreArchived

Has Copy ofData

Data Are Lost

Data Sharing (N=1,544)

Pienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Shared?” http://ori.hhs.gov/content/research-research-integrity-rri-conference-2009

See also: Pienta, Gutmann, Hoelter, Lyle, & Donakowski (2008). “The LEADS Database at ICPSR: Identifying Important ‘At Risk’ Social Science Data.” http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf

Data Sharing (N=935)

Federal Agency

Shared Formally, Archived(n=111)

Shared Informally, Not Archived(n=415)

Not Shared(n=409)

NSF (27.3%)

22.4% 43.7% 33.9%

NIH(72.7%)

7.4% 45.0% 47.6%

Total 11.5% 44.6% 43.9%

Pienta, Alter, & Lyle (2010). “The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data”. http://hdl.handle.net/2027.42/78307

2. Enhancing Data (Curating)

A well-prepared data collection “contains information intended to be complete and self-explanatory” for future users.

A corollary: Do no harm.

http://img.gawkerassets.com/img/17xbuy519gga2jpg/ku-xlarge.jpg

Data

Documentation

http://dx.doi.org/10.3886/ICPSR31521.v1

20

21

Disclosure Issues

• Direct Identifiers? – personal names– addresses (including ZIP codes)– telephone numbers– social security numbers– driver license numbers– patient numbers– certification numbers,

Disclosure Issues

• Indirect Identifiers? – detailed geography (i.e., state, county, or

census tract of residence)– exact date of birth– exact occupations held– exact dates of events– detailed income

Disclosure Issues

• External Linkages?– public patient/medical records– court records– police and correction records– Social Security records– Medicare records– driver’s licenses– military records

http://www.flickr.com/photos/k3v1nm/3366181223/

Opportunity

“It saves funding and avoids repeated data collecting efforts, allows the verification and replication of research findings, facilitates scientific openness, deters scientific misconduct, and supports communication and progress.”Niu (2006). “Reward and Punishment Mechanism for Research Data Sharing.” http://www.iassistdata.org/downloads/iqvol304niu.pdf

“Search/Compare Variables” examines 2.1 million variables in 4,000 data collections

Emerging sources and types of data

• Geo-spatial• Video• Administrative data• Online text• Transactions• Clicks• Sensors

Partnerships

Green, Ann G., and Myron P. Gutmann. (2007) "Building Partnerships Among Social Science  Researchers, Institution-based Repositories, and Domain Specific Data Archives."  OCLC Systems and Services: International Digital Library Perspectives. 23: 35-53.   http://hdl.handle.net/2027.42/41214

“We propose that domain specific archives partner with institution based repositories to provide expertise, tools, guidelines, and best practices to the research communities they serve.”

Support:

http://www.icpsr.umich.edu/icpsrweb/IR/

5 Pilot Data Collections

http://www.flickr.com/photos/smithsonian/2551170386/

Selection & Appraisal

Recovery

Finding interested partners

http://www.flickr.com/photos/usnationalarchives/4726917373/

Time & Willingness

http://www.flickr.com/photos/floridamemory/7026619371/

Inter-university Consortium for Political and Social Research. Survey of Data Curation Services for Repositories, 2012. ICPSR34302-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2012-09-21. doi:10.3886/ICPSR34302.v1

Survey of Repositories’ Data Needs

• Media recovery, format migration, data recovery

• Cost estimating and policy review• Metadata tools, documentation, and catalog

linkages• Support networks and training• Confidential data dissemination and

confidentiality review

Repository Suggested Solutions:

1. Community Wayfinder

http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf

2. Confidentiality Review & Treatment

• Suppressing unique cases• Grouping values (e.g., 13-29=1, 30-49=2)• Top-coding (e.g., >1,000=1,000)• Aggregating geographic areas• Swapping values• Sampling within a larger data collection• Adding “noise”• Replacing real data with synthetic data

http://www.icpsr.umich.edu/icpsrweb/content/DSDR/tools/qualanon.html

3. Access to Processing Tools

The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.

Hermes Outputs

• ASCII data files– Column- and tab-delimited

• Stat package setup files– SAS, SPSS, Stata (.do and .dct)

• “Ready-to-go” data files– SAS transport (CPORT engine)– SPSS system (.sav)– Stata system (.dta)– R (.rda)

Useful categories for discussion?• Media recovery, format migration, data recovery• Cost estimating and policy review• Metadata tools, documentation, and catalog

linkages• Support networks and training• Confidential data dissemination and

confidentiality review

Your ideas on partnerships?

Thank you!

lyle@umich.edu