Data quality control, Data formats and preservation, Versioning and authenticity, Data storage...

Post on 11-Jan-2016

215 views 0 download

Tags:

transcript

Data quality control,Data formats and preservation,

Versioning and authenticity,Data storage

Managing research data well workshop London, 30 June 2009

Manchester, 1 July 2009

2

Good data management

• good research• high quality data• needs to be planned • specific for purpose • data can be understood and used now and in future• data can then be shared and re-used

3

Can you understand / use these data?

SrvMthdDraft.doc

SrvMthdFinal.doc

SrvMthdLastOne.doc

SrvMthdRealVersion.doc

4

Quality control

Data quality control at various stages:

• data collection – e.g. instrument calibration; expert opinion; multiple measurements;

computer assisted interviews

• data entry, digitisation, transcription and coding - standardised and consistent procedures

– e.g. set up validation rules for data entry; use input masks; detailed variable labelling; missing value coding; use controlled vocabularies or choice lists; best structure to organise data and data files

• data checking and verifying - automated and/or manual– e.g. double entry; check for out-of-range values; apply random sample

validation; statistical analyses (descriptives, frequencies, means, range, clustering) to detect errors or find anomalous values; verify data completeness

5

Data formats

• choice of software format for digital data:– planned data analyses– software availability– hardware used– discipline specific standards and customs

• digital data software dependent

• digital data endangered by obsolescence of software/hardware

• best formats for long-term preservation - standard formats, interchangeable formats, open formats

– e.g. tab-delimited; comma-delimited (CSV); ASCII; OpenDocument format; SPSS portable; XML

6

Data format conversions

• convert data for preservation or back-up, e.g. export, save as• beware of conversion errors:

– loss of internal metadata> e.g. convert MS Access to tab-delimited tables

– loss of editing, formatting, formulae> e.g. convert MS Word to RTF

– truncation or loss of data > e.g. string variables lost in SPSS – STATA conversion

• check for errors and changes after conversion

Example 1: MS Excel to tab-delimitedExample 2: Word to XMLExample 3: Proprietary audio file (DVF) to WAV

7

MS Excel format

Tab–delimited text format

8

Version control• keep track of different copies or versions of data files

• which methods:› single site vs. across locations› single vs. multiple users› different versions to be stored vs. files to be synchronised

• single user of data files:› file naming – unique file names with date or version number (avoid spaces!)

e.g. FoodInterview_1_draft; FoodInterview_1_final; HealthTests_06-04-2008; BGHSurveyProcedures_00_04

› version control table or file history within or alongside data file› version control facility within software, e.g. MS Windows software

• multiple users of data files› same as above› control rights to file editing: read/write permissions, e.g. Windows Explorer› versioning/file sharing software: check files out/in, e.g. SVN, VSS, Google Docs, Amazon S3› manual merging of multiple entries/edits

• synchronise files, e.g. MS SyncToy software

9

Authenticity of data

• master files• assign responsibility for master files• record changes to master files

10

Data storage

• digital storage media unreliable• file formats and physical storage media ultimately become obsolete • optical (CD, DVD) and magnetic media (hard drive, tapes) vulnerable and

subject to physical degradation

Best practice:• use data formats with long-term readability• storage strategy with at least two different forms of storage• copy/migrate data files to new media between two and five years after first

created • check data integrity of stored data files at regular intervals (checksum)• know your back-up strategy: institutional/personal; network server/PC/laptop• maintain original copy, external local copy and external remote copy• test file recovery• Data Protection Act and data back-up – may require minimal data copies for

personal data; secure storage

11

Example: data storage and preservation at UKDA

preservation copy (UKDA) shadow copy (UKDA) dissemination copy to reduce

load on main system near-site online copy (on

campus) off-site online copy tape-based offline copy

(UKDA)

Multi-copy, multi-storage media and multi version resilience:

scheduled nightly

robotic 3-

monthly

12

Good data management practice

• plan data management early• assign roles and responsibilities• design data management according to needs and

purpose of research • data management throughout research

13

Resources

• ESDS (2008). Guide to good practice: micro data handling and security. http://www.esds.ac.uk/news/publications/microDataHandlingandSecurity.pdf

• Finch, L. & Webster, J. (2008). Caring for CDs and DVDs. NPO Preservation Guidance. Preservation in Practice Series. London, National Preservation Office. Available at http://www.bl.uk/npo/pdf/cd.pdf

• UK Data Archive (2009). Manage and Share Data. http://www.data-archive.ac.uk/sharing/

See: http://www.data-archive.ac.uk/sharing/furtherstorage.asp