Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | levi-castle |
View: | 223 times |
Download: | 1 times |
Data Archiving and Networked Services
DANS is an institute of KNAW en NWO
Census data, CEDAR and the future of Digital Archiving: changing ideas, challenges & opportunities1996-2014
Peter DoornData Archiving and Networked Services
CEDAR Mini Symposium, Amsterdam, 31st March 2014
Contents
• Two slides about DANS• Why digitize historical censuses?• History of the census digitization projects 1996-2006• Results: CD-ROMs, Websites, Publications• Digital preservation of the first “digitally born” census
of 1960• Projects and activities since 2006• Challenges for the years to come
What is DANS?
Institute of Dutch Academy and
Research Funding Organisation
(KNAW & NWO) since 2005
First predecessor dates back to
1964 (Steinmetz Foundation),
Historical Data Archive 1989
Mission: promote and provide
permanent access to digital research
information
EASY: Electronic Archiving System for self-deposit
NARCIS: Gateway to scholarly information In the Netherlands
Data Seal of Approval
Persistent IdentifierURN:NBN resolver
Our services
Why digitize historical censuses?
• Important source for statistics and research
• Limited number of census books• Preservation of 19th and 20th century
originals • Digital archiving• Target audience: researchers,
onderzoekers, students, local governments, amateur historians, education
Systematic digitization of Dutch Census Books
• 1995/96 possibility raised in talks between CBS and Steinmetz archive
• 1996: small pilot by CBS and Netherlands Historical Data Archive– Selection of material– How to digitize?– How to store? – How to pubish? – Project plan for continuation project
Digitization in three projects• 1997 - 1999:
– Microfilming and scanning 200 books, 42,500 pages– Data-entry 10,000 pages Census 1899
• 2002 – March 2004:– Checking and correction censuses 1795-1859 and 1930– Archiving digitally born census 1960 and 1971
• March 2003 – July 2006: Life Courses in Context– First project in humanities funded by NWO “large investments”– In collaboration with Historical Sample of the Netherlands (Kees
Mandemakers, IISG)– Data-entry censuses 1869-1956– Scanning handwritten tables 1947 and OCR tests– Documentation, harmonisation, “linking”, access, research
Digitizing Censuses: division of tasks
• Collaboration project CBS and NHDA/NIWI/DANS since early 1997
• Subsidized by NWO and KNAW • CBS:
– data entry tables Census 1899– Statline publication
• NIWI: – Scanning Census 1795-1971– OCR of Introduction to Census 1899– First Website Census 1899
Results 1999
• Set of 5 CD-ROMs– images of censuses 1795-1971 (200 books, c. 42,500 pages
• Set van 2 CD-ROMs– Database Census 1899– 27 books– 10.000 pages > 17,000,000 numbers/characters
• Introduction to Census 1899 (also as Website)• StatLine publication tables of 1899• Images 1899
• Conference & book with analyses of the Census 1899 (2001)
CD-ROM publications in
September 1999
Book publications
[related projects:Historical GIS, HASH, HDNG]
Website of Introduction to Census 1899
www.volkstelling.nl
Launched in September
1999
Census 1899 also published in CBS StatLine
The 1960 census: the first born digital census in the Netherlands
• First computer at CBS: X1 Electrologica• 1969: punch cards transferred to Steinmetz Archive• Kf. 100 needed for reconstructing files• Bitrot, data input errors and more…
121586004 3995813013110 3 52801322981010 1061121586010 3855413012060 3 52701322981010 106012158W000 3755113010010 2 52801322981010 0061121586001 3406713012050 00 0152701322981010 1860121586003 4225113013110 2 52801322113010 0061
1115100421115 6302120001000995581111405057126086200 B’(‘N3=‘)’5ZD,10B 17601115100421110 1306363301000075-81111718035817732405 SC2+NSC3); 17701115100421116 1305352202000900521111205041728284204 ‘,/’)’); 17801115100421119 4303430001000930521111203038829276500 B’(‘N3=‘)’5ZD,10B 1790
The size of the problem
Persons Missing persons Persons too many
Men 183,970 254,100
Women 182,755 7,661
Total 366,725 261,761
Lanceerknop voor de geheel vernieuwde website www.volkstellingen.nl
Launched in November
2004
Web statistics 2004-2009
• 194.000 visitors (3300 per month)
• 2 mln. page views
• 0,5 Tb data down-loaded
10/00 04/01 11/01 05/02 12/02 06/03 01/04 08/04 02/051000
10000
100000
Unieke bezoekers Bekeken pagina's Megabytes gedownload
Projects and activities since 2006
• Digitization of “transparancies” and collotypes• NLGIS – historical GIS• Checking and correction• Harmonisation• Archiving in EASY• Scanning historical data at CBS & CBS website• HISTEL project• CEDAR project
Digitization of “transparencies” and collotypes (early photo copies)
Totaaloverzicht lichtdrukken/transparanten Tekens per pagina Opmerking
Telling Banden Pagina's Tabel-inhoud Voorkolom (gedrukt) Blanco cellen Totaal
BDT 1930 38 6500 700 500 550 1750
BDT 1950 48 15000 700 500 400 1600
BDT 1963 224 70000 450 500 350 1300
BDT 1978 1314 micro-fiches 72270 350 500 300 1150 digitaal
beschikbaar
BRT 1930 217 27615 200 500 1200 1900
VT & BRT 1947 80 29200 450 600 450 1500
WT 1947 20 6935 400 400 150 950
WT 1956 131 47815 500 500 500 1500
VT 1960 ? 75000 500 500 500 1500 deels digitaal beschikbaar
VT 1971 geprint uit bestanden 87000 500 500 500 1500 digitaal
beschikbaar
Totaal 437335 475 500 490 1465
Digitaal beschikbaar 226770 450 500 433 1383
Totaal excl. digitaal beschikbaar 210565 488 500 513 1500
2006: Scanning and OCR of transparancies
Scan record atte
mpt, February
2005: Census 1947
C. 12.500 pages scanned in
one day
Manual data entry of 1947 Census
• Templates prepared for each table type• Data entry carried out by Xerox (India)• Supervision by Jan Jonker• Archived in and available from DANS EASY
Project idea June 2009:New portal historical population data
Checking and correction
• Most underestimated task of the project• Ongoing work since 1999…• Distinction between data-entry / conversion errors
and source errors• Data-entry errors are corrected• Error detection method based on differences between
calculated and given row and column totals• Source errors are indicated with notes… • Tom Vreugdenhil is the hero of error checking and
correction
Harmonisation
• Three key variables:– Occupations– Municipalities– Religious denomination
Harmonizing occupations
• Occupations available for 1849, 1889, 1899, 1909, 1920, 1930 and 1947
• Coded according to Historical International Standard Codes of Occupations (HISCO)
• Results:– Coded occupations and exact content and context of each table with
unique occupational titles (Excel & Access)– Total of all unique occupational titles in the censuses (Excel &
Access)– Excel Workbook Lookup tool to code occupations automatically – Excel Workbook hisco toolbar to search for codes, occupational titles
and descriptions of occupations in the HISCO database
Harmonizing municipalities• Based on the work by Onno Boonstra and Ad van der
Meer “Repertorium van Nederlandse gemeenten 1812-2006”
• New standard code (“Amstrdam code”) for all Dutch municipalities that have ever existed
• Database tool to code municipalities in the censuses
IDamsterdamse_code begindatum einddatum gemeente_prov gemeente provincie
1 100011-1-1812 1-10-1816 Almenum Almenum Friesland2 100021-1-1812 1-12-1999 Zuidlaren Zuidlaren Drenthe3 100021-12-1999 Tynaarlo Tynaarlo Drenthe4 100031-1-1812 30-1-1820 Zeddam Zeddam Gelderland5 100041-1-1812 Zijpe Zijpe Noord-Holland6 100051-10-1816 Opsterland Opsterland Friesland7 100051-1-1812 1-10-1816 Ureterp Ureterp Friesland
CBS Historical Collection website: 19th and 20th century publications
HISTEL project
Umbrella project to oversee the various census activities that are going on, supervised by René van Horik:• Transfer of data, website
– new agreement between CBS and DANS– publish www.volkstellingen.nl as extended data guide / paper in new DANS
data journal
• "Anonymous open access" to the census data in EASY• Archiving of existing data and newly scanned tables in EASY• Version management, updating corrected tables• Lisaison with CEDAR
Archiving everything in EASY
Why a CEDAR project?
• Great examples of LOD projects on new census data– Are they applicable to historical tables
• The historical censuses are stored in numerous containers in an archival silo– Can we open up the containers and silos to connect the data?– Can we make the data comparable over time?– Can we link it to outside sources?
• Is it viable to publish the whole DANS archive as LOD?– Provide insight to the possibilities for more data collections
Lots of challenges left…
• CEDAR: publishing the historical censuses as LOD– First priority for linking: linking the census data over time– Further harmonization is a prerequisite for this– LOD offers new insight in the extent of the harmonization problem
and a systematic solution (we expect ;-)
• Archiving LOD– PRELIDA (PREserving Linked Data) project offers insight in the
requirements and options– Storing the RDF is only part of the answer
• Lots of images of historical census tables left to turn into figures• Preserving the census services: www.volkstellingen.nl no longer
supported, NLGIS tool already gone• Wish for 2020: a user-friendly tool to link historical census data
over time and to external sources
Data Archiving and Networked Services
DANS is an institute of KNAW en NWO
Thank you for your attention
www.dans.knaw.nlwww.narcis.nl
[email protected]: @pkdoorn