+ All Categories
Home > Documents > A brief history of markup of social science data: from punched cards to the ‘life cycle’...

A brief history of markup of social science data: from punched cards to the ‘life cycle’...

Date post: 06-Jan-2016
Category:
Upload: avani
View: 31 times
Download: 0 times
Share this document with a friend
Description:
A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach. Presentation to: International symposium on XML for the long haul (Balisage 2010 pre-conference) By Laine G.M. Ruus, Librarian emeritus, University of Toronto 2010-08-02 - PowerPoint PPT Presentation
Popular Tags:
41
A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach Presentation to: International symposium on XML for the long haul (Balisage 2010 pre-conference) By Laine G.M. Ruus, Librarian emeritus, University of Toronto 2010-08-02 http://www.chass.utoronto.ca/~laine/misc/ balisage2010.ppt
Transcript
Page 1: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

A brief history of markup of social science data: from punched

cards to the ‘life cycle’ approach

Presentation to:

International symposium on XML for the long haul (Balisage 2010 pre-conference)

By Laine G.M. Ruus, Librarian emeritus, University of Toronto

2010-08-02

http://www.chass.utoronto.ca/~laine/misc/balisage2010.ppt

Page 2: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Overview:

• What are data (that is, quantitative social science data)?

• History of social science quantitative data and metadata

• Lessons learned

Page 3: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

What are data?

Page 4: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Data are…

• Representations of selected characteristics of a population of entities, eg individuals, companies, periods of time, etc

• Characteristics are grouped, and variations of a characteristic are assigned (normally) numeric values

• Assigning numeric values to variations of a characteristic allows their manipulation by mathematical/statistical procedures

Page 5: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

wisdom

knowledge

information (statistics)

data

Page 6: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Data and statistics are not equals

• Statistics are two kinds: – Descriptive statistics: summaries of

common characteristics of the raw data units(one-way tables, two-way tables … multi-

way tables)– Inferential statistics: measure strength

and direction of relationships among characteristics of raw data units

Page 7: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Of course, statistics (descriptive or inferential) can become data in their turn, and used in other statistical procedures.

Page 8: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach
Page 9: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Data and statistics are not equals (cont’d)

• Ie, data are: – the raw materials from which statistics are

generated– ideally, available at the level at which the data

were originally collected (=microdata)– need to be manipulated with statistical software in

order to be comprehensible

Page 10: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Data

Page 11: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

raw data

Page 12: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Metadata

Page 13: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

record layout

Page 14: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

variable description(aka data dictionary)

Page 15: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

genderprovince

Page 16: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

syntax file for SPSS

Page 17: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Metadata are…

• Instructions to explain the content and coding of a data set (whether numeric, alphabetic, or other), and aid in their correct interpretation

• Can be intended for human or computer consumption, but are ideally both

Page 18: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Raw data + a syntax file, processed through a statistical software package results in a system file – average shelf life less than 10 years

Page 19: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

The beginnings• Hollerith cards first used to process the 1890

US census of population• By 1930s, public opinion polling was being

used to eg predict electoral outcomes– 1936 Literary Digest poll predicted defeat

of Roosevelt in the US presidential election• Data gathering make-work projects in the 30s

in the US, such as economic censuses, surveys on unemployment, crop production, etc

Page 20: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

By the 1940s• Polling and survey taking matured• Beginnings of improved sampling methods,

such a Gallup’s quota samples• 1948 polls chose Dewey over Truman in the US

presidential election, leading to formation of a committee to determine why the error

• the Roper Center was created, the first data archive (1946)

• Data stored on punched cards, and analyzed using card sorters and similar equipment

• And metadata usually looked like this…

Page 21: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

The metadata for the May 1945 Canadian Gallup Poll…..

Page 22: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

The 1950s…• UNIVAC 1, the first alphanumeric computer• UNIVAC 2 correctly predicted the Eisenhower

sweep in the 1952 US presidential election• MIT began working on keyboard entry• Development of the COBOL compiler and

Fortran• Magnetic tapes, at 200 bpi, could store the

contents of 70,000 punched cards, ie about 5.6 megabytes of data

• Lucci & Rokkan promoted the idea of data management by libraries

Page 23: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

But the metadata for the August 1958 Canadian Gallup pollstill looked like this…………

Page 24: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

1960s…• Development of Basic, the Unix operating

system, and ASCII which allowed interchange of data among different computers

• Statistical software packages: DATA-TEXT, SPSS, P-STAT, BIOMED, NUCROS, SAS

• Magnetic tapes moved from 556 to 800 bpi• Most social scientists were still writing own

local software, or using card-sorters and calculators to produce cross-tabulations and compute chi-squares

Page 25: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

1970s a watershed decade…• Microprocessors, and 8” and later 5-1/4”

diskettes• Wang word processor, Ataris, Apple 1 and the

Commodore PET• dBASE, VISICALC and WORD STAR• ARPANET and expansion of time-sharing and

online systems• Online bibliographic services such as Dialog,

BRS, and Orbit

Page 26: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

1970s (cont’d)• David Nasatir wrote first manual on data

management under aegis of UNESCO (1972)

• Mid-decade saw the creation of IASSIST, and the first training at ICPSR for data librarians

• US census of population 1970 partly disseminated on computer tapes instead of print, forcing libraries to consider this new medium

Page 27: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

1970s (cont’d)• OSIRIS software developed at University of

Michigan, included statistical capabilities as well as outstanding data and metadata management

• NSF funded the National Conference on Cataloging and Information Services for Machine-Readable Data Files at Airlie House in Virginia

• US Department of Justice funded the project which resulted in Roistacher’s Style manual for machine-readable data files – bibliographic identity, methodology, and data dictionary

Page 28: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

An OSIRIS codebook generally followed the Roistacher recommendations. The record layout and data dictionary portion looked like this:

Page 29: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

1980s• Supercomputers and NSFNET changed face

of large scale computing, and PCs and MACs did the same for small scale computing

• BITNET, followed by the Internet, provided e-mail, listservs and remote login

• tape cartridges held the equivalent of 8 million cards or four times that of a 6250 tape. Five megabyte hard drives became available for microcomputers

• IBM brought microcomputing to the academic sector

• CD-ROMS, and the Quadra directory of databases

Page 30: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

1980s (cont’d)

• Sue Dodd’s Cataloging machine-readable data files : an interpretive manual, 1982

• Social forces one of the first journals to include guidelines on citing machine-readable data files

• Population index the first bibliographic journal to cite data files

• A draft revision of AACR2 chapter 9 (renamed: Computer Files) was published in 1987 – bibliographic control for data files

Page 31: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

1990s• Migration from IBM mainframes (EBCDIC) to

Unix (ASCII)• Demise of tapes for storage, in favour of

widespread use of CD-ROM• Statistics Canada makes the electronic products

from census the primary product• Gopher, developed in 1991, was replaced by the

WWW and html, and by 1996 there are about 100,000 web servers

• Beginning of the DDI (Data documentation initiative) project in 1995, published its first DTD in 1996

Page 32: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Three major developments lead up to DDI:

• OSIRIS’ metadata management capability• Roistacher ‘s outline of machine-readable

data file documentation (1980)• Dodd’s cataloguing manual (1982)

Page 33: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

OSIRIS metadata• OSIRIS dictionary provided structural

information: location, size, missing data, a variable name and a variable label (brief)

• OSIRIS codebook provided a tagged format:– Introduction (unstructured)– Full question text– Variable values and value labels– Variable-level comments

• North American institutions standardized on the OSIRIS type-1 and type-4 codebooks, Europe on the type-3 format codebook

Page 34: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Roistacher’s style manual

• Provided outline of the information that should be contained in the full metadata (aka codebook), including– Bibliographic identity– Project history– File processing summary– Data dictionary contents– Recommended appendices

Page 35: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Sue Dodd’s cataloguing manual

• Further refined the bibliographic identity component of the metadata

• Provided a cross-walk to AACR cataloguing rules

• Provided the foundation for the development of a MARC record

• Dodd also defined the components of a bibliographic citation

Page 36: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach
Page 37: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Many kinds of metadata for many purposes

• Data collection• Data interpretation• Data preservation• Data discovery• Coding standardization

Page 38: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Descriptive Structural Administrative

•MARC records•RAD records•Thesauri•Concordances

•Syntax files for eg SAS or SPSS•Programming syntax•Record layouts•Data dictionnaries•Missing data specifications•Definitions of derived variables

•Project conception, implementation and funding•Methodology reports, sampling frames, etc.•Questionnaires and data collection protocols•Interviewer instructions•Post-processing, weighting, etc•Access and dissemination restrictions•Question banks

Based on the NISO metadata classification:

Page 39: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

DDI provides a format …

• From which other subtypes of metadata (bibliographic records, syntax files, question banks, etc) can be generated

• Describes not just microdata but also an intelligent means of describing aggregate statistics as data

• Can incorporate all documentation from original project conception to edition management and post-processing

Page 40: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

DDI provides a format … (cont’d)

• 3rd generation data access tools (Nesstar, DDI, and Dataverse (VDC)) all support DDI 2.0 at present and provide a useful way to provide on-line remote distributed access to data discovery and data

• Leads to proliferation of new applications of metadata and realization of initiatives from earlier decades

Page 41: A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Lessons learned

• Three killers of data:– Software dependence– Lost metadata– Physical medium on which data are stored

• No solution as yet combines data, full metadata and statistical capability in a non-software dependant format


Recommended