+ All Categories
Home > Documents > I€¦  · Web viewNSOs must pay attention to issues related to power supply ... Word, and various...

I€¦  · Web viewNSOs must pay attention to issues related to power supply ... Word, and various...

Date post: 28-Jul-2018
Category:
Upload: doandang
View: 212 times
Download: 0 times
Share this document with a friend
92
Africa Census Processing Handbook – I. Data Capture Handbook, page 1 I. DATA CAPTURE Table of Contents Contents I. DATA CAPTURE.................................................1 INTRODUCTION.....................................................2 I.1 VALUE CHAIN ON DATA CONVERSION TECHNOLOGY...................3 I.1.1 OVERVIEW OF THE ISSUES......................................4 I.1.2 DECISIONS ABOUT DATA CONVERSION TECHNOLOGY......................4 I.1.3 FUNDING...................................................5 I.1.4 SYSTEM TESTING DURING PILOTING PHASE OF CENSUS..................5 I.1.5 LOCAL TECHNICAL SUPPORT.....................................6 I.1.6 PLANNING AND PREPARATION OF CENSUS DATA PROCESSING: GUIDELINES FROM THE UN PRINCIPLES AND RECOMMENDATION................................6 I.2 DATA PROCESSING PROJECT MANAGEMENT LIFE CYCLE...............7 I.2.1 THE BUSINESS MODEL.........................................7 I.2.1 METHODS AND ELEMENTS OF DATA CAPTURE..........................9 I.2.1.1 Data acquisition........................................................................................................ 9 I.2.1.2 Data entry..................................................................................................................9 I.2.1.3 Data coding.............................................................................................................10 I.2.1.4 Data cleaning..........................................................................................................10 I.2.1.5 Data quality............................................................................................................. 12 I.2.1.6 Data aggregation...................................................................................................12 I.2.1.7 Data validation........................................................................................................ 12 I.2.1.8 Data tabulation....................................................................................................... 12 I.2.1.9 Data analysis........................................................................................................... 13 I.2.1.10 Data warehousing................................................................................................ 13 I.2.1.11 Data mining..........................................................................................................13 I.3. FORMS DESIGN AND TESTING..................................13 I.3.1 THE CENSUS QUESTIONNAIRE...................................13 I.3.1.1 Respondent burden................................................................................................14 I.3.1.2 Question wording and format..............................................................................14 I.3.1.3 Layout and design..................................................................................................15 I.3.1.4 Long forms and short forms.................................................................................15 I.3.2 PROCESSING REQUIREMENTS FOR THE CENSUS QUESTIONNAIRE............15 I.3.2.1 Physical considerations.......................................................................................... 15 I.3.2.1 Checking requirements for Questionnaire layout...............................................16 Economic Commission for Africa Statistics Division
Transcript

Africa Census Processing Handbook – I. Data Capture Handbook, page 1

I. DATA CAPTURE

Table of Contents

Contents

I. DATA CAPTURE................................................................................................................................... 1INTRODUCTION............................................................................................................................................................ 2

I.1 VALUE CHAIN ON DATA CONVERSION TECHNOLOGY..........................................................3I.1.1 OVERVIEW OF THE ISSUES...............................................................................................................................4I.1.2 DECISIONS ABOUT DATA CONVERSION TECHNOLOGY..................................................................................4I.1.3 FUNDING.............................................................................................................................................................5I.1.4 SYSTEM TESTING DURING PILOTING PHASE OF CENSUS.............................................................................5I.1.5 LOCAL TECHNICAL SUPPORT............................................................................................................................6I.1.6 PLANNING AND PREPARATION OF CENSUS DATA PROCESSING: GUIDELINES FROM THE UN PRINCIPLES AND RECOMMENDATION.......................................................................................................................6

I.2 DATA PROCESSING PROJECT MANAGEMENT LIFE CYCLE...................................................7I.2.1 THE BUSINESS MODEL.....................................................................................................................................7I.2.1 METHODS AND ELEMENTS OF DATA CAPTURE...........................................................................................9

I.2.1.1 Data acquisition........................................................................................................................................ 9I.2.1.2 Data entry.................................................................................................................................................... 9I.2.1.3 Data coding.............................................................................................................................................. 10I.2.1.4 Data cleaning........................................................................................................................................... 10I.2.1.5 Data quality.............................................................................................................................................. 12I.2.1.6 Data aggregation................................................................................................................................... 12I.2.1.7 Data validation....................................................................................................................................... 12I.2.1.8 Data tabulation...................................................................................................................................... 12I.2.1.9 Data analysis............................................................................................................................................ 13I.2.1.10 Data warehousing............................................................................................................................... 13I.2.1.11 Data mining........................................................................................................................................... 13

I.3. FORMS DESIGN AND TESTING.................................................................................................. 13I.3.1 THE CENSUS QUESTIONNAIRE......................................................................................................................13

I.3.1.1 Respondent burden............................................................................................................................... 14I.3.1.2 Question wording and format..........................................................................................................14I.3.1.3 Layout and design.................................................................................................................................. 15I.3.1.4 Long forms and short forms.............................................................................................................. 15

I.3.2 PROCESSING REQUIREMENTS FOR THE CENSUS QUESTIONNAIRE.........................................................15I.3.2.1 Physical considerations....................................................................................................................... 15I.3.2.1 Checking requirements for Questionnaire layout....................................................................16

I.4 PROCESSING ENVIRONMENT..................................................................................................... 171.4.1 CRITERIA FOR SPACE CONSIDERATIONS.....................................................................................................171.4.2 INFRASTRUCTURE AND SYSTEM..................................................................................................................19

I.5 SOFTWARE....................................................................................................................................... 19I.5.1 EVALUATING SOFTWARE...............................................................................................................................19I.5.2 TESTING THE SOFTWARE..............................................................................................................................21I.5.3 ACQUIRING THE SOFTWARE..........................................................................................................................22

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 2

I.5.4 DEVELOPING SOFTWARE APPLICATIONS IN-HOUSE..................................................................................24I.5.5 REGIONAL APPROACHES FOR CENSUS DATA PROCESSING....................................................................24

I.6 HARDWARE..................................................................................................................................... 24I.6.1 EVALUATING HARDWARE NEEDS.................................................................................................................24I.6.2 ACQUIRING HARDWARE.................................................................................................................................25I.6.3 MANAGEMENT OF THE PRODUCTION..........................................................................................................27

I.7 PERSONAL DATA ASSISTANTS (PDAS) AND USE OF THE INTERNET............................28I.7.1 USE OF HAND-HELD DEVICES FOR MAPPING...........................................................................................28I.7.2 COMPUTER ASSISTED PERSON INTERVIEWING: AN INTRODUCTION....................................................29I.7.3 ESSENTIAL FEATURES OF CAPI...................................................................................................................30

I.7.3.1 Interviewing workflow........................................................................................................................ 30I.7.3.2 Other factors............................................................................................................................................ 30

I.7.4 CAPI COMPONENTS........................................................................................................................................31I.7.4.1. Software applications......................................................................................................................... 31I.7.4.2. Computer hardware............................................................................................................................ 32I.7.4.3. Human Resources.................................................................................................................................. 33

I.7.5 PDA CONCLUSIONS..........................................................................................................................................33I.7.6 USE OF THE INTERNET IN DATA COLLECTION...........................................................................................34

I.8. RESOURCE AVAILABILITY: EXTERNAL CONSULTANTS AND OUTSOURCING...........34I.8.1 MANAGING OUTSIDE CONSULTANTS............................................................................................................34I.8.2 DIFFERING OBJECTIVES..................................................................................................................................35I.8.3 SPECIFICATIONS..............................................................................................................................................35I.8.4 PREPARATION OF SPECIFICATIONS..............................................................................................................35I.8.5 MONITORING THE OUTSOURCED PROJECT OR PROCESS...........................................................................36

I.9 DATA PROCESSING IMPLEMENTATION..................................................................................36I.9.1 RECEIVING OF THE QUESTIONNAIRES.........................................................................................................36I.9.2 DOCUMENT MANAGEMENT............................................................................................................................38

I.10 DATA CAPTURE............................................................................................................................ 38I.10.1 KEY FROM PAPER.........................................................................................................................................38I.10.2 KEY FROM IMAGE (KFI).............................................................................................................................40

I.10.2.1. Advantages............................................................................................................................................ 40I.10.2.2 Disadvantages....................................................................................................................................... 41I.10.2.3 Conclusions............................................................................................................................................. 41

I.10.3 OPTICAL CHARACTER RECOGNITION (OCR) /INTELLIGENT CHARACTER RECOGNITION (ICR). 42I.10.3.1 Advantages............................................................................................................................................. 42I.10.3.2 Disadvantages....................................................................................................................................... 43I.10.3.3 Discussion............................................................................................................................................... 43

I.10.4 OPTICAL MARK RECOGNITION (OMR)..................................................................................................43I.10.4.1 Advantages............................................................................................................................................. 44I.10.4.2 Disadvantages....................................................................................................................................... 44I.10.4.3 Discussion............................................................................................................................................... 45

I.11 SCANNING VS KEYING................................................................................................................ 45I.11.1. ENTERING THE DATA..............................................................................................................................46

I.11.1.1. Scanning................................................................................................................................................ 46I.11.1.2. Heads down keying........................................................................................................................... 47I.11.1.3 Interactive keying......................................................................................................................... 48

I.11.2 VERIFICATION...............................................................................................................................................49

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 3

I.11.3 EDITING CONSIDERATIONS WITH SCANNED DATA.........................................................49I.11.4 CONCLUSION.......................................................................................................................................51

I.12. RELATIONSHIP OF QUESTIONNAIRE FORMAT TO KEYING..........................................51I.12.2 CODING...........................................................................................................................................................54I.12.3 EDITING......................................................................................................................................................... 55

I.13 CONCLUSIONS..................................................................................................................................................57

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 4

INTRODUCTION1

1 The United Nations Statistical Division has facilitated the publishing of handbooks on collection, questionnaire design, and dissemination processes of the statistical value chain (see Figure 1). The data conversion phase has no specific guidelines, so this deficiency provides opportunities for the development of data conversion guidelines. The intention of this part of this publication is to be a reference document to the various ways of doing data conversion. Data conversion is the movement of collected information from respondents into an electronic database for editing, tabulation, and analysis.

2 The earliest technology for data capture was “card punching” that required a machine literally punching holes in cards to be read by another machine and on to large magnetic tapes. Technology evolved in the 1970s and 1980s to facilitate data conversion from early days from punching cards to keying from paper directly into a mainframe computer. Keying continued with the arrival of personal computers. And, in the 1990s, countries moved more and more to scanning equipment to obtain image processing. Currently, in some countries, respondents can complete a census form through Personal Data Assistants (PDAs) and the internet.

3 National Statistics Offices (NSOs) in Africa reside on a spectrum of the available technology for data conversion. The NSOs regularly gather to share experiences in data conversion. NSOs usually present slides and documents to show their experiences. These meetings present opportunities to compile a reference book for the benefit of those countries moving across the data-conversion technology spectrum. This paper presents selected guidelines to assist NSOs.

4 NSOs make many decisions during the planning phases of a project. Timely decisions allow for establishing data items to be collected, choice of appropriate technology for supporting collecting, for conversion, sufficient system specifications and testing, and dissemination products development. Once NSOs make these decisions, then they develop detailed project plans, with the associated resources and budget.

5 Rationale. Over the years, the number of African countries population running censuses on a regular basis censuses and household surveys is increasing, because:

1. Governments require statistics for planning, monitoring and evaluation.

2. Policy makers require statistics to be used in developing specific programs.

6 Supply and demand for statistics is in line when NSOs meet these criteria. One of the requirements on the demand side is the time when the statistical information is delivered after of data collection. This demand requires statistical agencies to intensively plan their censuses and surveys. NSOs try to reduce the lead time between the completion of data collection and the release of the results and associated information products. The statistical value chain provides a framework for the processes, with sequencing, to be undertaken in planning and conducting surveys and census.

1 Statistics South Africa wrote the first draft of this section on data capture. Almost all of the information presented here comes from their long, distinguished contributions in this area.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 5

Projected African Census Dates for 2010 Round

Source: UNSD

I.1 VALUE CHAIN ON DATA CONVERSION TECHNOLOGY

7 Requirements for data collection activities appear in the UN Principles and Recommendations for Population and Housing Census. But, data capture remains among the biggest challenges in the census process. Sample surveys provide insights, but censuses have much larger volumes of data, and so present greater challenges. Pressure is increasing from users who demand ever-decreased turnaround times for the results.

Figure 1: The Statistical Value Chain

8 The data conversion process has to identify user needs, design and build systems to support operations, and convert data from collection medium from into a computer-based database. The data-conversion value chain appears in the chart. The various elements include:

1. Identify User Needs. The NSO must work with users within the statistical office, with other government agencies, with university and other researchers, with private sector entrepreneurs and employers, and others, in establishing what information they need for planning, monitoring and evaluation.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 6

2. Design Collection and Systems. Then, the staff must design the collection tools, whether hard copy and image. This task involves the specifications, development, testing (unit and system) and implementation of the computer and the manual systems needed to support the running of the census. Then, the NSO must pilot test the systems. This phase includes developing specifications for online editing and post-capture editing.

3. Build Collection Tool and Systems. After designing the appropriate tools and systems made up of those tools, the NSO then must either build the actual collections tools or outsource the building of the collection tools and systems. Both time and expertise are of essence here. Great systems that are not in place in time are useless to the agency. So, the NSO must obtain a balance.

4. Collect information. Collection activities will depend somewhat on choices made about data capture – the questionnaires will different in form and layout depending on whether the NSO scans or keys the collected data, and, if scanned, the type of scanning.

5. Process Information. Processing will also differ depending on the layout of the questionnaire. Coding of items like places, industry and occupation occurs after collection. Then, processing will also differ. But, even if all data are pre-coded, the processing will not be the same when keyed (with many personal computers) or scanned (with a scanner and fewer personal computers used for verification.) Editing and tabulation will follow.

6. Analyze information. After processing, the data will need to be analyzed. The form of capture should not create any issues for analysis. However, depending on the length of the data capture operations, data analysts may feel pressure.

7. Provide access to information. Various disseminations elements, like tables, reports, charts and maps are now standard for all censuses. But additional dissemination materials, like table retrieval systems, and public use microdata samples, are now also standard.

8. Archive information. It is very important that all versions of the data set – from unedited to fully edited – be archived, as well as all programs, and metadata.

9 NSOs implement the system through the census proper after the pilot. The problems encountered during the pilot differ from those encountered during the census proper because the census has much more data.

I.1.1 OVERVIEW OF THE ISSUES

10 National Statistical Offices conduct censuses to fulfill legal requirements for the country to obtain counts and characteristics of its population. Statistics acts contain this requirement. Other legislation depends on the availability of the census information. A country’s constitution may require that the seats in the national assembly be determined by the distribution of the residents. Other key beneficiaries of population censuses are government agencies that develop policy and programs, and monitor progress in improving living conditions of their people.

11 The requirements form the basis for the NSO’s work program. So, it follows that the NSO collects data both for users inside and outside of the NSO. Smaller sample surveys and administrative records also provide trends when collected in between large sample surveys and censuses, as part of the total work program. NSOs collect specific data in surveys and censuses according to user needs. Some of the needs are internal to the statistical organization, based on laws and other legal requirements of the country, usually from the legislative and executive branches. Outside agencies, including the United Nations (currently for the Millennium Development Goals), the World Bank, International Labor Organization, International Monetary Fund, and others, require information that comes from censuses.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 7

I.1.2 DECISIONS ABOUT DATA CONVERSION TECHNOLOGY

12 Data collection can be by hard copy, online, or a combination of the two. The choice of approach influences design, layout, and collection production. These decisions, in turn, will determine the technology required to covert the data for processing and analysis. Staff members make these choices during the planning phase.

13 If an NSO uses hard copy forms then they should have mechanisms to protect the forms from adverse weather conditions and high humidity. If the data conversion technology selected is for keying from paper (KFP), then sufficient space must be available on the form for writing in codes for open-ended questions. Scanning with optical mark recognition (image processing) requires durable paper to withstand the stresses of scanning more than once. The data collector has to be able to handle the questionnaire to avoid introduction of un-intended marks, thus degrading the quality of collected information. Recognition engines and software must be able to interpret any handwritten items. The forms must show areas where the enumerator can record a response in blocks to enforce legible writing. And finally, in many cases, the census data-conversion infrastructure (software and hardware) will be a method of introducing new technology into the national statistics office.

I.1.3 FUNDING

14 The census has to run using timelines as required by the Special Data Dissemination Standard, requiring project plans and financial requirements. NSOs compile strategic and implementation plans to provide how they will undertake the census. The stakeholders in this process must be involved and informed of the plans. The intention of the involvement of this interaction is to secure the commitment of the relevant stakeholder, especially those that provide funding. Among the stakeholders are national governments, represented by particular ministries, and the international donors such the United Nations Statistics Division and its related agencies. The benefit of the funding commitment is that the census program will run with confidence. The national statistics office’s publicity and acquisition of resources requires timelines. This timeliness gives users a level of confidence in the results.

I.1.4 SYSTEM TESTING DURING PILOTING PHASE OF CENSUS

15 NSOs need to develop computer-based systems that support the census operations according to what might-be-called the System Development Life Cycle or SDLC. The life cycle requires that Statistical Offices:

1. Gather user requirements: User requirements determine documenting what the system should achieve. This census phase involves producing census information from either a single or multiple processing centers. The documentation should include the collection tool. The activities that the system should support range from recruitment to product development. Parts of system can be implemented at different times as determined by the implementation plan of the program.

2. Develop conceptual system model by having a model of how the parts or components of the system relate to each other.

3. Develop physical system model by defining what functions and database tables the system requires. This model will include how they relate to each other and the editing.

4. Build the system by developing coding for each module/program.

5. Unit testing by determining whether each program performs according to the physical specifications. Programmers usually do the unit testing.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 8

6. System testing determines whether the system performs according to the entire physical specification and to user requirements. The pilot survey is another opportunity to perform system testing in a live environment.

7. Commission the system: Determines that the system or part thereof, is ready for running in a live environment.

8. Maintaining the system occurs by providing support, doing performance tuning, and making amendments for changed requirements. A change control procedure needs to be implemented to ensure that the live environment has changes that are fully tested.

16 Some of the important aspects that influence a successful outcome of census and survey results are both the unit and the system testing. It is for this reason, therefore, that strong emphasis is placed on allocating sufficient time and resources for system testing, in the planning phase. This emphasis is true irrespective of the approach used for system development.

17 NSOs can use several approaches for developing systems. These approaches range from using the Waterfall Approach to Joint Application Development (JAD) Approach. The approach is determined by the availability of the skill to implement the chosen approach.

18 At the end of it all, the quality of the computer system is determined by the extent to which the system can assist users to perform their tasks at the time the system is implemented. If user requirements have changed since the development of specifications, then compliance with specification is not the determining factor for the quality of the system.

I.1.5 LOCAL TECHNICAL SUPPORT

19 All African countries should conduct censuses at regular intervals, although not all of them do. The inter-censal period is used for identifying and doing feasibility studies on the appropriate technology for the next operation. The technologies available are in the area of software (operating systems, database management systems, software development tools and data extraction tools) and hardware (scanners, workstations, file servers and network servers). Infrastructure, such as available bandwidth and coverage of telecommunication, has an influence on the extent of connectivity between and amongst sites.

20 Whether the NSO has within-country support for such technology is available is one factor to consider with the proliferation of computing technology. The main benefit of local support is to reduce downtime for external support to a few hours in contrast to days, or even weeks. This factor is even more important when the NSO is competing for the service provider’s time with other customers. Transport and communication with the supplier, particularly if the supplier is in another country, are paramount.

21 The NSO must also build internal skills to provide first-line support to reduce the turnaround for minor incidences that occur during the data conversion phase.

22 Cost effectiveness and affordability are other factors to consider when deciding on the appropriateness of the technology. Many countries that previously processed census forms using key from paper are now using image processing that incorporates mark reading or character recognition. This process has even moved to respondents capturing their information online in some countries.

I.1.6 PLANNING AND PREPARATION OF CENSUS DATA PROCESSING: GUIDELINES FROM THE UN PRINCIPLES AND RECOMMENDATION

23 Plans for census data processing should be formulated as an integral part of the overall plan of the census. Those responsible for processing the census should be involved from the start of the overall planning process. Data processing will be required to obtain results for census tests, compilation of preliminary results, preparation of tabulations, evaluation of census results, analysis of census data,

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 9

arrangements for storage in and retrieval from a database, identification and correction of errors, and so on. In addition, data processing is playing an increasing role in the planning and control of field operations and other aspects of census administration. More specifically, the design, layout of census questionnaire as well as the specification of the printing of census depend on data capturing technology adopted for a given census.

24 Data processing also has an impact on almost all aspects of the rest of the census operations, ranging from the selection of topics and the design of the questionnaire to the analysis of the final results. Therefore, data-processing requirements need to be taken into account at an early stage in the planning.

25 The existing data processing staff will certainly need to be expanded somewhat and will probably need some upgrading of skills, particularly if new computer hardware or software will be used in the census. Any needed training should be completed early enough so that those taking the training can play an active role in census planning and operations.

26 NSOs need to make decisions about the location of the various data processing equipment within the country, including whether any of the processing work is to be decentralized. Acquisition of both equipment and supplies can require long lead times so estimates of both data capture and computer processing workloads must be made early to enable timely procurement.

27 Census operations need adequate space. NSOs must pay attention to issues related to power supply, temperature, humidity, and dust in the data processing centre. The census operations must also provide for maintaining servers, scanners, guillotines, generators, and any other important equipment.

28 The last issue is to have smooth internet and/or web communications between the units and centers involved in census operations. Data processing operations should have well-protected space for the storage of the completed census forms as well as both on-site and off-site backup of all electronic information on all data processing disks/servers.

29 The NSO will have to be make decisions about software for capturing, editing, and tabulating the census data while considering the processing equipment. The statistical organization will need to allow sufficient time to train staff in its use regardless of the software. Some degree of customizing will be needed to meet the specific requirements of the census whatever software is chosen. Customization will especially be needed for off-the-shelf, commercial software packages not specifically designed for census operations. Therefore, an information technology (IT) workforce has to be available for software implementation.

30 If the NSO decides to use outsourcing for some IT-related operations, these should be implemented to bring immediate economic and quality advantages to census operations. National statistical offices must ensure that outsourcing of census operations does not compromise data confidentiality. Contractors must not have free access to the basic census databases. Responsibility for hosting census databases must continue to rest with the national statistical offices. In short, NSOs should implement outsourcing to facilitate transfer of knowledge into the census organization such that they protect the essential features of privacy of individual respondents and the confidentiality of the data.

I.2 DATA PROCESSING PROJECT MANAGEMENT LIFE CYCLE

31 Good census data processing planning requires identifying what needs to be done, when and by whom. Any data processing project management life cycle aims at identifying the business aspects of the data processing from planning, analysis, design, implementation, operations and support.

I.2.1 THE BUSINESS MODEL

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 10

32 As discussed by Stats South Africa2, censuses should use a business model. The business model identifies a series of processes. Each business process should (1) identify a goal, (2) determines its specific inputs and specific outputs, (3) determine the required resources, (4) have a number of activities that are performed in some order, (5) affect one or more organizational units, and, (6) create valuable output for both internal and external customers.

33 So, each business process is a collection of activities designed to produce a specific output for particular users. This process implies a strong emphasis on how the work is done during data processing. A process becomes a specific ordering of work activities across time and place, with a beginning, a middle, and an end, and clearly defined inputs and outputs: a structure for action. Detailed planning activities must be prepared to fit into the operational plan and linked to the timelines. Determination of detailed resource requirements links the planning activities and their costs to each activity. The planning process aims at ensuring that each activity has proper resources and organization. The process must also make sure that the output of each activity is of sufficient quality for all subsequent activities. In addition, planning must identify all relationships between the different activities.

34 Since the census cycle is long, planning cannot be static but must be flexible enough to take changes into account. Many issues require careful consideration when planning census data processing. These issues include (1) Specifying the objectives of the census data processing, (2) defining the role of different stakeholders, (3) setting goals and targets, (4) developing the census data processing project plans, and (5) developing the census data processing budget. Each activity may be broken down into a number of smaller tasks, reducing the complexity of the Census data processing.

35 Hence, the total census process requires defined, detailed sub-processes, activities and tasks. An inter-connected workflow puts these processes into sequence order. The workflow will differentiate those sub-processes that are (1) Automated (i.e. system supported), (2) Manual (i.e. human intervention), (3) Static (i.e. box storage), and/or (4) Dynamic sub-processes (i.e. questionnaire movement).

36 An analysis will determine the information required as inputs for each sub-process linked to the resources required to execute the process. The analysis will also determine the activities needed to reach the expected output. Also, a set of described events will control passing from activity to activity and from one status/type of information to another within acceptable quality and performance levels.

37 Detailed processes must be more elaborate because they give detailed information about the methods used for each sub-process. This process includes the operational procedure manuals, the specific manuals for each sub-process, the specification of each process or specification of system module, database and the quality control plan for each sub-process. It is important at this stage to interact with the other Census areas of work, particularly, the questionnaire design team, the data collection team and the logistics team. Different specifications must take into account the required data items for census, the questionnaire layout and format, the procedures for completing questionnaires, printing quality, accounting and questionnaire movements. The various aspects of census planning may revise or adapt data processing methodology to the different census work areas.

38 The NSO then implements the processes after the different plans and resources are in place. These processes include the acquisition of the data processing site, the acquisition of specified resources (both IT and office related), the development and testing of the systems and work area preparation (i.e., storage, network wiring, partitioning of capturing area with tables and computers). Development and testing of the different system modules is also part of the implementation of the project. The processes need to be implemented within planned timeframe, using the planned budget, with acceptable performance, yielding good quality results. The main result is an accurate database of information from captured questionnaire.

39 Census procedures must structure the data processing operations as a capturing phase and a post-capturing phase. The capturing phase is the most demanding in terms of resources and timing, depending on the adopted mode of data capturing. The capturing phase deals with the documents management (i.e.

2 From the original version of this handbook

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 11

questionnaire and boxes) that can be grouped on a high level as receiving questionnaires, storage of the documents, and capturing of information (i.e. key from paper, scanning, recognition, coding, key correction). The post-capture process concerns the management of data dealing mainly with validation and editing, the statistical process of additional derivation or any required adjustment and the output process (i.e. tabulation and census products preparation). Performance management and efficient use of the resources are one of the most important aspects of management of Census data processing operations.

Figure DP2: The Data Processing High level Process flow

I.2.1 METHODS AND ELEMENTS OF DATA CAPTURE

40 Computer data capture is any process that uses a computer program to retrieve data and to summarize, analyze or otherwise convert the data into usable information. The process is automated in some way and run on a computer. A wider series of activities – editing, tabulating, analysing, sorting, summarizing, calculating, disseminating, and storing data – is computer processing. Census data processing systems typically start with raw data and end up with data sets that are internally consistent, and available for tabulations and other dissemination. Data capture involves data conversion from a physical questionnaire or a mechanical devise like a Personal Data Assistant (PDA) to a predetermined data format to be used for further data manipulation. Data processing includes that activity as well as the subsequent ones.

41 Elements of data processing. The first stage in census data processing involves converting data into a predetermined machine-readable format. Staff can apply various procedures to the data to get useful census information. The different elements of census data processing include:

1. Data acquisition2. Data entry3. Data coding 4. Data cleaning 5. Data quality 6. Data aggregation

7. Data validation 8. Data tabulation 9. Statistical analysis 10. Data warehousing 11. Data mining

I.2.1.1 Data acquisition

42 Digital data capturing typically involves acquiring signals and waveforms and then processing the signals to obtain desired information. The components of data acquisition systems include appropriate sensors to convert any measurements to electrical signals, then conditioning the electrical signal which can then be acquired by data acquisition hardware. Computers acquire data and then display, analyze, and store them. Data acquisition begins with the physical phenomenon or physical property of an object (under investigation) to be measured. In the case of census, the physical addresses are collected using a satellite. Another example would be to collect household assets having barcodes that can be captured by handheld scanners. When looking for anthropometric information or health information of individuals, data acquisition involves temperature measurement, intensity or change of a light source, the air pressure inside a chamber, the weights of any object or many other things related to census items. Depending on the

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 12

device, signals may be digital or logic signals or analog. Some devices, such as the handheld scanners, run permanently connected to a PC while some other devices such Geographic Positioning Systems (GPS) are stand-alone controllers, that can be operated from a PC or completely independent of the PC.

I.2.1.2 Data entry

43 As we will discuss below in the major topic of the data capture part of the Africa Census Handbook, the method of data capture could be key to the success, or at least the timeliness of the overall census. Two methods of data capture will be covered: (1) keying and (2) scanning.

1. Key from paper (KFP) is a process where a data entry clerk reads hand-written or printed records and keys them into a computer. Statistical offices may use keyers for censuses on a temporary basis for the duration of census entry. For example, the census data entry clerks may take census questionnaires having handwritten answers, and type what they see into a database using numerical codes. The number of data entry clerks working with physical hand-written questionnaires is determined by their keying speeds, number of documents to capture, and the timeframe of the project.

2. Character recognition. Some statistical agencies now use advanced technology for scanning documents to create images. These systems locate entries across the document by their positions and recognize the characters or digits based on type of character. These systems use (1) Optical Character Recognition (OCR), that is, machine-print-ready character, (2) Intelligent Character Recognition (ICR), reading the hand written characters, and (3) Optical Mark Recognition (OMR). All three methods use barcodes to assist in the capture. Although OCR technology is continuing to develop, the accuracy of character recognition varies widely based on (1) Quality of original document, (2) Scanner image, and, (3) Type of algorithm applied for recognition.

44 The final captured value will usually require a data entry clerk to review the results afterwards to check the accuracy of the data and to manually over-key in corrected information or missing data by simultaneously viewing the image on-screen.

I.2.1.3 Data coding

45 Census data set formats require that the information has a meaning and is fit for use. Good, acceptable formats are those where all values are numeric. A few variables, however, are collected using open text corresponding to data elements, part of census metadata. Some countries specify classifications like hierarchies of geographical place names. International standard classifications are normally used for occupations, industries, fields of education, and causes of death. The structure of these classifications must identify the data element name, have a clear data element definition, one or more representation terms, a code corresponding to enumerated values and a corresponding code list (the metadata), and, if possible, a list of synonyms to data elements. Statistical offices do data coding operations manually in most cases. However, new approaches for automated coding are being developed with different spellings and local languages items added to the metadata database.

I.2.1.4 Data cleaning

46 Data cleaning or data editing is the act of detecting and correcting (or removing) corrupt or inaccurate records a database. This aspect of the census process is covered in the Africa Census Editing Handbook, the second volume in this series. Data cleaning involves two procedures: (1) structure editing and (2) content editing. The current, described operation is the structure editing and identifies incomplete, incorrect, inaccurate, irrelevant, etc., parts of the data and then replacing, modifying or deleting these dirty data. Content editing appears below, after aggregation.

47 Data Auditing. The process of data cleaning should start with data auditing: using statistical methods to detect anomalies and contradictions. This procedure provides an indication of the characteristics of the

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 13

anomalies and their locations. The detection and removal of anomalies is performed by a sequence of operations on the data known as the workflow. The causes of the anomalies and errors in the data have to be closely detected. The workflow execution should take into account the correction action, closely managed to avoid computationally expensive time. The post-processing activities should yield good quality data from the inspected results to verify correctness. The remaining cases are subjected to automated imputation or final transformation of few cases to fit the expected quality.

48 Inconsistencies detected or removed may have been originally caused by different data dictionary definitions of similar entities in different stores, may have been caused by user entry errors, or may have been corrupted in transmission or storage. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).

49 Some of methods used for data cleaning include:

1. Parsing. Parsing detects syntax errors in data cleaning. A parser decides whether a string of data is acceptable within the allowed data specifications. This procedure is similar to the way a parser works with grammar in languages. NSOs use parsing for open responses such as place names, occupations and industries depending on the specific conditions (like different languages or slang).

2. Data Transformation. Data transformation allows the mapping of the data from a given format into a new format needed for a particular, appropriate application. In a census data set, sometimes additional variables may need to be derived as well. For instance, age may be derived from date of birth, or household size from the number of persons in household.

3. Duplicate Elimination. Duplicate detection requires an algorithm for determining whether data contain duplicate representations of the same entity. Usually, keys sort data to bring duplicate (or similar) entries closer together for faster identification. For census data, the key variables, including the Geocodes, the household number and person number are used for this purpose.

4. Statistical Methods. NSOs analyze data using means, standard deviations, ranges, or clustering algorithms to find values that are unexpected and thus erroneous. Although the correction of these data is difficult since true values are not known, they can be resolved by setting the values to an average or other statistical value. Statistical methods can also be used to handle missing values by replacing them with one or more plausible values that are usually obtained by extensive data augmentation algorithms. Many African Censuses have been using the logic imputation or dynamic imputation of value from similar neighboring characteristics records (hot deck and cold deck).

50 Various challenges and problems result, depending on the adopted approach in cleaning process. The processes sometimes depend on the existing tools. Proper records of process applied must be documented and added to the metadata in order to understand the different challenges experienced. Some aspects of concern in the data cleaning include;

1. Error Correction and loss of information. The most challenging problem within data cleaning remains the correction of values to remove duplicates and invalid entries. In many cases, the available information on such anomalies is limited and insufficient to determine the needed transformations or corrections, leaving the deletion of such entries as the only plausible solution. The deletion of data though, leads to loss of information that can be particularly costly with large amounts of deleted data.

2. Maintenance of Cleaned Data. Data cleaning is an expensive and time-consuming process. A good system will include an auditing trail of the corrections done on the data. This trail must avoid incorrect interpretations leading to additional cleaning after some values in data collection

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 14

change. The auditing process should occur for all values that change. Hence, the audit records a cleansing line for each item, and combination of items to assure complete and efficient data collection and provide management with statistics on the complete processing.

3. Data Cleaning in Virtually Integrated Environments. NSOs must keep the raw census data as close to the questionnaire as possible, depending on the statistical office policy. In this case, virtually integrated sources are created from the raw data set as cleaning of data is performed every time the data is accessed (i.e. using database triggers and online store procedures). The side effect of this procedure is a considerably decreased response time and efficiency. But the benefits are more complete edit trail, and easier correction if errors are found to have been introduced partway through the process.3

4. Data Cleaning Framework. Finally, it is extremely important to emphasize that data cleaning requires a set of rules developed by subject matter specialists and implemented by programmers. Experience shows that some statisticians clean the data by trial and error; sometimes this works, but some times it does not. Subject matter specialists should be using the U.N. Editing Handbook, previous census and survey edits for the country, and other sources to develop the various edits. This approach should derive a completely cleaned data set in a timely fashion. The data cleaning is an iterative process, involving significant exploration and interaction within the framework in the form of a collection of methods for error detection and elimination, and data auditing. That is, the editing operations work within a feedback system – invalid entries and inconsistent items are resolved systematically.

51 These procedures can occur with other data processing stages like integration and maintenance. A good data cleaning framework assists in resources allocation between cleaning requiring manual intervention and automated cleaning. Although cleaning is required, some systems sometimes are challenged in setting up reasonable time frames for the completion of cleaning.

I.2.1.5 Data quality

52 The results of the data cleaning activities described above should be a full structured data set. This data set should pass the following tests. High quality data need to pass a set of quality criteria at editing stage:

1. Accuracy: An aggregated value over the criteria of integrity, consistency and density2. Integrity: An aggregated value over the criteria of completeness and validity3. Completeness: Achieved by correcting data containing anomalies4. Validity: Approximated by the amount of data satisfying integrity constraints5. Consistency: Concerns contradictions and syntactical anomalies6. Uniformity: Directly related to irregularities7. Density: The quotient of missing values in the data and the number of total values ought to be

known8. Uniqueness: Related to the number of duplicates in the data

I.2.1.6 Data aggregation

53 NSOs usually capture the data in small lots. That is, the census data are not captured all at once, but rather in small groups of data, over time and space. Sometimes the data capture is centralized when the country is geographically small, has sufficient labor for a centralized census processing, and does not have a large population. In other case, the processing might be decentralized throughout the country. But, in either case, unless all of the forms are read at one time, an operation to aggregate the data will be needed in order to facilitate continued processing, tabulation, and dissemination.

3 The originally collected data should always be immediately archived, and archived in several places. As the data are edited, important versions of the data file should also be archived. And, again, the final edited data must be archived in several places so they will not inadvertently be lost.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 15

I.2.1.7 Data validation

54 Data validation is what is often called “content editing”. Most of the Africa Census Editing Handbook covers various aspects of the data validation, and so will not be covered here.

I.2.1.8 Data tabulation

55 After validation, NSOs make their tabulations. This operation is usually centralized in the National Statistical Office itself.

I.2.1.9 Data analysis

56 Practically all naturally occurring processes can be viewed as examples of data processing systems where "observable" information from the questionnaire is converted by human observers or scanners interpretation. The census databases are represented in typical numerical representations such as integers or by fixed-point or binary-coded decimal representations of numbers with a reference to floating point or field or variable of data analysis' measurements. Some algorithmic derivations, logical deductions and statistical calculations based on subject of analysis may deduce additional information.

57 Data capture must transform the census data into appropriate statistical packages of analysis. In addition to the common office software like Excel, the most common applications for analysis used for census computer editing and tabulations are SPSS, SAS, STATA, CSPro, REDATAM, and SuperCross. Many other tabulation and analysis software packages now exist, both for personal computers and directly on the web. NSO staff can access these from office servers, from local computers, from compact disks, or online.

I.2.1.10 Data warehousing4

58 After tabulation and analysis, and dissemination, the data must be archived. It is important, in archiving, to preserve all important versions of the collected data, including the original data, various stages in the editing process, and, of course, the final, edited data. Appropriate meta-data must also be preserved, including actual questionnaires, forms and manuals, flow records, and so forth. The archiving section of the Africa Census Tabulation Handbook covers this aspect of the census process.

I.2.1.11 Data mining

59 Although not covered in detail in these handbooks, it is important that the final, edited data set be as user-friendly as possible. As noted in the Africa Census Tabulation Handbook, public use micro-data files (PUMS) are now made as a matter of course as part of the census data processing. These data sets need to be easily accessible, and easily processed so that users within the public sector and in the private sector can easily obtain the compiled data they need for planning and policy formation. Also, many countries now put large amounts of their microdata or compiled data behind an electronic protective barrier, and allow users to make tables online without having access to the actual data themselves. In the best possible sense, this is data mining.

I.3. FORMS DESIGN AND TESTING

I.3.1 THE CENSUS QUESTIONNAIRE

60 The purpose of the census questionnaire is to capture data. A well-designed form captures data efficiently and effectively, with few respondent and enumerator errors. Form design requirements vary according to the enumeration and processing methods. Several issues needing consideration include:

4 It is important to note, again, that staff from Statistics South Africa wrote the initial draft of this volume. Their office uses “data warehousing” where later versions of the handbook use “archiving.” These terms can be used interchangeably.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 16

1. Respondent burden; 2. Format and question wording, which are affected by whether the census uses interviewers or self-

enumeration methods; 3. Layout and design of response areas, which are influenced by the need for good

interviewer/respondent perception and the data capture method; and,4. Whether the census uses a combination of short and long forms.

I.3.1.1 Respondent burden

61 Minimizing respondent burden will assist in obtaining accurate answers. The length of the form, the number and type of questions, and how easily the form is to complete can all change respondent burden. NSOs should keep these issues in mind when designing the census form. Lowering respondent burden is particularly important when using self-enumeration.

I.3.1.2 Question wording and format

62 The wording and format of questions will influence how well the form works. Issues that need to be taken into account when designing questions include:

1. Data needs of users. User meetings are very important in determining the content of the questionnaire. But, whether the data are to be keyed or scanned, the data needs of the users must fit in with the technology used for data capture. The number and display of questions will vary depending on the technology, but absolute numbers of items (based partly on space and partly on enumerator/respondent fatigue) must be taken into account. Sometimes, the questions can be formatted for the technological environment selected (even working around black boxes used in scanning sheets), but sometimes cannot.

2. Level of accuracy and amount of detail required. Laying out the items on paper under supervision from a potential supplier or by the suppliers themselves will allow for an early view of what the questionnaire will look like using that particular technology. Decisions about acceptability can then be made. The users will determines some of the amount of detail possible, but the NSO must also make determinations about what is most appropriate for the statistical office.

3. Availability of the data from the respondent. The respondent must be able to provide the data to the enumerator. If the questionnaire is not paid out in a comfortable manner, enumerators and respondents may have problems obtaining the best results. The layout is particularly important during enumeration, of course, since everything else depends on it. NSOs need to be aware, both logistically and culturally, that the questionnaire is “user-friendly.”

4. Appropriate, easily-understood language for respondents and interviewers. The questions must be worded in a way to obtain the most appropriate responses. Sometimes the layout of the questionnaire interferes with this task. Also, instructions to the enumerators may need to appear on the questionnaire to elucidate any potential problems. If these instructions are not clear and placed properly, the census results could be affected.

5. Type of questionnaires and layout per languages of census. The types of questionnaires – housing, population, mortality, agriculture – could affect the layout of the whole instrument. The items must be laid out for ease of use by enumerators and respondents. NSOs must make decisions like whether the population items will all be on a separate page for each person or as groups of items for all household members as part of the questionnaire layout.

6. Data item definitions, standard question wording, and any other relevant information. If definitions are present on the questionnaire, they must be placed so they don’t interfere with the scanning; however, they must also be easily available to the enumerator or respondent queries

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 17

might go unanswered. Standard question wording should be respected as should other relevant displayed information.

7. Data processing system used. The data processing system used after capture could influence the layout of the questionnaire. Integrated systems, like CSPro, will have fewer demands on questionnaire format than multiple systems.

8. Sequencing or order of questions. The sequencing of the questions is very important. NSO must be vigilant when an outside agency is laying out the questionnaire – usually, the supplier – to make certain that the order of the questions is not changed to fit the needs of the scanning. The sequencing will have been tested with a pilot survey and those results must be respected; NSOs do not want to experiment with the questionnaire itself.

9. Space required for each answer. Adequate space must be available for the answers. When written entries are expected, and these are to be coded, space for both the written entry and the coding box or circle must also be available.

I.3.1.3 Layout and design

63 The layout and design of the form depends on whether the questionnaire is to be keyed or scanned. Keyed forms allow for considerable leeway and placement of items and their response categories. Scanning is much more rigorous, and, in the case of Image recognition, sometimes very confining in terms of spacing and layout. It is important that the design be appropriate for the technology, but also be laid out in a way that will allow for appropriate enumerator data collection.

64 The layout and design of the form will have a direct impact on how accurately interviewers or householders will complete the form. Therefore, NSOs should especially consider graphic presentation, placement, and presentation of instructions, the use of space, layout and colors and the wording used. Poor use of any form design element, whether language, question sequencing or layout, will create obstacles for the respondent and/or interviewer. NSOs must minimize obstacles so that the form is filled in accurately since the purpose of the census form is to obtain high-quality information.

I.3.1.4 Long forms and short forms

65 When countries use both short and long forms for their census, each form must be considered separately, and tested in the hot-house environment as well as in the field, with pilot surveys and targeted studies. Also, the forms must be integrated, both for ease of use in the field by the enumerators as well as in processing.

I.3.2 PROCESSING REQUIREMENTS FOR THE CENSUS QUESTIONNAIRE

66 Various data-capture components of processing systems, ranging from key entry to electronic imaging through scanners, will require markedly differing form design. Form design requirements and form size for differing technologies may vary greatly. Forms design should incorporate provisions for contingencies in data processing. Even a form designed for automatic data capture (e.g., imaging, character recognition) should include space and codes alongside response areas. If the intelligent character recognition (ICR) system fails, for example, keyers can more easily enter responses, if necessary, when these codes appear on the form.

I.3.2.1 Physical considerations

67 Data-capture requirements should not unduly influence the respondent’s perception of the form. When designing forms for more advanced data-capture methods, such as imaging, the respondents must be able to provide answers in a suitable format that the data-capture equipment can recognize. Self-enumerated forms will require extensive testing that includes processing live data from tests. Questionnaires should include

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 18

as many pre-coded responses as possible. Open-ended questions should be limited to topics such as occupation and industry that censuses can only collect in this way.

68 The census questionnaire is the most important of all the forms, so should be subject to intensive quality assurance checks in both the proof and the production stages. Other items are normally only subject to intensive constraints in the proof stages and not in the production stage. This diminished checking reflects the lack of technical requirements for extreme precision of the physical placement of characters in these other items as well as the high cost of production checks.

1. Format of census forms: The enumeration method will influence the choice of format for the form. For example, census forms can be household forms or individual personal forms or both: (1) A household form can appear as a roster, providing for answers from all members of an average-sized household on a single page, or (2) Another approach is to design the household forms as a booklet, with all of the personal questions asked first for person 1, then repeated for other persons in the household.

2. Printing specifications: Since the quality of the printing will have a large impact on the census data’s final quality, printing quality assurance must be closely monitored throughout the entire process, including checking of proofs and production runs. For example, the colors of the questionnaire are important. The color of the printing must be clear and easy on the enumerator’s eyes. But, that color and the background colour of the form need to be verified in cases where scanning requires dropout of the colour.

3. Checking of proofs: Materials will progress through several proof stages before finally typesetting. The project leader responsible for forms design proofs should check, verify, and authorize as correct at each stage. Staff not directly involved in the form design process can also provide additional checks to detect discrepancies. The final typeset proof used for printing is the responsibility of the printer in most situations. The statistical agency should approve a sample of the typeset version in these cases before production printing starts.

4. Production runs: The NSO should select a sample of census forms as the printing process progresses for quality assurance checks. Issues such as resources available and the level of problems detected will affect the size of the sample selected and the sampling strategy. However, NSOs must allocate sufficient resources to ensure the printing quality. Otherwise, the census may incur significant costs in the processing phase to rectify mistakes resulting from printing errors.

69 Problems in the final census data may occur if staff cannot rectify these mistakes during processing. Sampling rates can be adjusted throughout the printing process, with higher rates expected at the beginning of the printing. In some cases, the printing technology includes the creation of new printing plates after completing some proportion of the work. Then, staff should employ a higher sampling rate after producing each new set of plates. The sample rate will be adjusted downwards when detected problems decrease.

I.3.2.1 Checking requirements for Questionnaire layout

70 NSOs should take a sample over the entire printing process from start to finish. Organizations cannot assume that this level of quality will necessarily continue just because the quality is good at the beginning of the print run. Quality assurance conducted at the printing plant will assist in the early detection of problems. However, agencies should not rely on the printers themselves to conduct the majority of quality assurance checks. NSOs need to carry out their own independent checks at all stages of all operations.

71 Examples of some of the checks on the forms include:

1. Horizontal and vertical trimming. If staff members do not trim the questionnaires properly, they will not go through the scanner as they should, leading to all sorts of errors. Several African countries have experienced this problem. Unfortunately, the only way to really test whether

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 19

trimming is working properly is to fill and run forms through a scanner. If the scanner is already in place in the NSO, testing should be done there. If the scanner is not in place, in the office, then arrangements should be made to run the questionnaire through a scanner at the vendor’s place of business.

2. Positioning or skew of response areas on the actual page. One of the problems with scanning in general is that the items must appear only on certain parts of the page. Monitoring boxes or other indicators keep the pages aligned for reading and transferring data. If these indicators are skewed, faulty reading will result.

3. Page numbering and correct order of pages. At least one staff member must be assigned to checking page numbering on the initial and subsequent drafts of the questionnaire. The vendor may get the pages out of order, and when this happens, the questionnaires become unusable. Quality control must be vigilant. The contract between the vendor and the NSO will state who is responsible for violations of paging and other aspects of the printing, but the NSO must take ultimate responsibility since it will be unable to collect the census at all with incorrectly paged forms.

4. Colour, background colour dropout, including any smudging. The nature of scanning requires certain colors for backdrop. Other colors either obscure the data or otherwise make the scanning more difficult or impossible. Orange seems to be a favorite color.

5. Strength of any binding. The binding is crucial. If the forms come apart in the field or in processing, without proper geographic identifiers, staff may not be able to put them back together. Even when they codes are reassigned, time and effort go into a process would be time better spent in other activities. So, often it is better to pay extra for sufficient binding in the first place than playing catch up later.

6. Paper size and weight. The paper size should be comfortable for the enumerators and for the processors. Even when data are scanned, verifiers sit at personal computers, so the paper must fit on the table or work station. Size and weight are particularly important in areas of high humidity or rainfall where additional climatic issues could damage the forms.

72 Particular attention should be given to any specialized printing requirements that are required for data-capture systems. A final check should be undertaken by processing a sample of forms through these systems to enable a comparison of actual versus expected results.

I.4 PROCESSING ENVIRONMENT

73 NSOs take censuses only rarely. But, censuses generate so much paper that most NSOs cannot provide regular office space for the questionnaires and the additional staff needed to process them. Hence, the census operations usually obtain outside space. Usually the NSO rents space for the period of the census. But, whether renting or using currently available space, a series of criteria should be considered.

1.4.1 CRITERIA FOR SPACE CONSIDERATIONS

In determining where that space should be, NSOs should take into account the following:

1. Enough space for storage of questionnaires. The primary consideration for off-site space is that the NSO have sufficient space for storing the questionnaires. The clerks must be able to easily find the questionnaires when they need them, and to return them to the appropriate areas after each operation. The absolute space must allow for easy obtaining and returning, and different operations must also have their own space for temporary storage as the forms as used.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 20

2. Accessibility space for delivery / discharging of questionnaire Since the questionnaires will come in from the field, and since they will eventually be moved to a permanent storage facility, at least until they are no longer needed, space for delivery and discharging of forms is necessary. If it is too difficult to get the questionnaires into and out of the census processing office they are likely to get mislaid or lost along the way. Adequate space for these operations is essential.

3. Manual operations area. Manual operations area usually refers to a space needed for check-in of the materials from the field. When the questionnaires are first off-loaded, they need be checked in manually, so space must be provided for this operation. If the questionnaires and/or transmission boxes are bar-coded, machine can scan the codes and the information recorded. Then, the questionnaires and boxes can be tracked. If barcodes are attached at check-in, then these can be used throughout the subsequent operations. But, even without barcodes, this operation is important so that tracking can be set up for the boxes and forms within the processing center.

4. Scanning area with room temperature control. Temperature control is important for the questionnaires and other materials as well as the staff working in the office. High humidity and temperature could damage the questionnaires. And, high humidity and temperature could also make the staff less effective in their work. Sweating on the forms will not improve their quality. Also, in countries that have very cold weather, the temperature in the processing center must be warm enough that staff is comfortable as they work.

5. Key from paper or Key correction area. When staff keys the data – from paper or image – they need personal computers from the very beginning of the operations. But, even with scanning, verification takes place on personal computers, and so these machines must be available in sufficient quantity not to hold up the other operations. And, sufficient space must be available for them in the processing. Security for both the machines and the questionnaires is a consideration as well.

6. Enough sanitary areas The building must have restrooms and access to water. Assessments about how many restrooms and access to them should be made as the NSO decides where to do the processing.

7. Cantina or break area in case the processing centre far from public facilities When the processing is done far from other public facilities – as is often the case for security reasons – then adequate food and water need to be available. Break areas are also important so that the scanning operators and keyers can relax between shifts and during breaks.

8. Server rooms and local network installed Because data capture will usually be from several or many machines, at least one server room is essential. The room has to be secure, and large enough to hold both the server and backup materials. If a local area network is used, space will be required for the wires, any additional machines, etc.

9. Quality assurance space A group will be providing quality assurance – whether hand or machine checking every 5th or 10th form – or other kinds of quality assurance. Space must be made available for their desks and machines, entry and exit to the area, and so forth.

10. Electrical installation per requirements. Because various operations will require so many machines plugged in during the processing – scanners, servers, and personal computers – the processing space must have adequate electrical power for all the operations to take place. Information about the electrical situation should be reported by the bidding companies trying to obtain the processing.

11. Backup generators. In countries with electrical problems – which is most countries these days – back up generators need to be included. Sometimes the back up generators will provide minimal

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 21

back up, allowing only for appropriate shut down operations. Other operations will provide for minimal processing to continue. And, others will allow for complete processing to continue. The cost-benefits of the various amount of back up generation will have to be weighed against the additional costs.

12. Post-capture process. After the processing is finished, operators must return the questionnaires to temporary storage. And, then space and logistics will be needed to get the questionnaires to permanent storage. Although electronic movement to a central office for continued processing, editing and tabulation, is usual for captured data, manual transfer might take place, and space will be required for that.

13. Parking area In countries with comparatively large numbers of staff driving to the processing center, adequate parking spaces must be available for the cars, trucks, and vans.

14. Access to public transport. In countries where large parts of the staff travel by public transport, the processing center must be placed so that various public transportations – busses, trainings, subways – serve the building. When stations are some distance from the processing center, provisions, like shuttle busses may be needed as well. Secure paths from stations to the processing center must be in place to provide adequate comfort for the workers.

15. Configuration for the night shift operations. When operations continue for two or more shifts, any additional provisions for night work need to be considered and implemented.

Figure : Example Layout of Functional Area for Key from Paper Operations

Figure: Example Layout of Functional Area for Scanning Operations

1.4.2 INFRASTRUCTURE AND SYSTEM

74 Understanding the purpose of fitting acquisition of hardware and software into the overall census plan is fundamental. It is important to fully understand the system requirements to make it easier to establish acquisitions decisions and the trade-off between functionality and cost. Decisions on factors such as how the NSO will capture the data, how they will edit, tabulate, disseminate, and store the data need to be made when developing the data-processing system. These decisions must be made early enough so that sufficient time is available for software and hardware evaluation and acquisition.

75 The budget available to the project is also a vital factor in making decisions about hardware and software. Costs of employing data-entry staff and the level of the computing infrastructure are also important considerations. A low budget project may not permit acquiring and deploying sophisticated state-of-the-art equipment. But, using less ambitious information technology may offer overall savings as well as greatly increase the use of the census output.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 22

76 Before agencies start the formal processes of evaluating and acquiring software and hardware, they should take the opportunity to research and investigate other organizations’ experiences with similar systems. During this period, they may acquire versions of software and/or hardware to be used for testing purposes. This process will allow agencies to become familiar with, and better understand, the potential and/or limitations of particular systems.

I.5 SOFTWARE

I.5.1 EVALUATING SOFTWARE

77 NSOs must evaluate possible applications against set criteria before acquiring and installing software. What the software is being used for and the complexity of the software will determine which criteria are critical. The most important criteria involve ensuring that the applications meet the required specifications. Among the criteria:

1. The software is easy to learn and use. The Software must be easy to learn, use and manipulate, or staff will not use it. The software should be incremental, so that once the first tasks are learned, the next ones can build and staff can “drop out” along the way as they learn what they need to do their specific tasks. For census purposes, subject matter specialists should not need to develop dictionaries and screens for keying or verification, but they should be able to make tables to assist in determining whether capture is working, and for later analysis.

2. The software serves as integrating tool that provides a common approach. The package should be integrated to allow for feedback among the elements, so that staff can make changes to the dictionary, screens, edits, and tabulations as needed within the overall census context.

3. An easy development environment exists for user interfaces. Although subject matter specialists may not need to develop checking programs from scratch, the software should allow them to make simple, small changes to the dictionary and other materials to assist in checking the data capture, whether in validation of the scanning or verification of the keying.

4. The software has an easy-to-use system development environment (workbench). Elements should include configuration management, testing and debugging facilities incorporating breakpoints and step-through capabilities. As noted above, the environment should be conducive to simple, quick changes, but also robust enough to accept large changes when needed.

5. The software has the ability to display required objects such as form images, if applicable. Error listings of various sorts must be possible, as well as what the actual data look like before, during, and after processing. The software should allow for transparent display.

6. The software has strategic value to the organization responsible for the census, or other elements of the national information technology infrastructure. The software should provide compatible outputs to other data compilations in the NSO’s statistical activities. Additions to the programming package inventory are only as useful as their overall usefulness. If skills on other packages decrease because of use of the software package, the package itself should be taking up the slack.

7. The software is compatible with current industry trends. The software should be compatible with current industry standards and be able to produce trends from previous censuses and surveys. The software should also be able to compile data for comparison with other sources, like vital records and migration statistics.

8. The NSO has current expertise in the product in the organization or externally. If the NSO already has used the supplier and the product, particularly if the supplier was also involved in the

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 23

previous census, statements about improvements can assist in determining the likelihood of a good fit for the current census. If the previous experience was problematic, however, detailed improvements should be presented and clear reasons why the NSO should go with the supplier again. If the NSO uses the product on a continuing basis, this analysis may not be needed.

9. Internal or external staffs experienced with the products are readily available. The supplier should provide names of satisfied customers – both public and private sector agencies – to help the NSO in assessing the quality of the software and the level of assistance.

10. The required level of training and support is available. The supplier should provide training materials upfront for analysis. The NSO should also contact countries who have previously used the supplier to obtain testimony on the type and level of training, and what support is available.

11. The supplier provides adequate support. It is often difficult to tell, at purchase, how much support the supplier will provide. However, testimony fro other countries should be provided – both from the supplier, and directly from the other countries, and the level of likely support assessed.

12. The supplier shows evidence of the current strength and longer-term viability . The supplier should be able to show what censuses it has already provided, and produce recommendations from the countries having used the software. If possible, the software should have been used in consecutive operations, either two censuses, or a census and survey, to show the stability of the package over time.

13. The software will be sourced locally or internationally. The software must be available to other local institutions as well as to international agencies if the data sets are to have maximum usability. If the software is not available to an agency trying to do comparative analysis across countries, the NSO may have to export the data into some other form. Some software, like CSPro, has the ability to export from ASCII (which is still the best standard – what you see is what you get) to other packages, like SPSS, SAS, and STATA.

14. The supplier is a well-recognized and used business with well known products. It is important that (1) the product is compatible with current industry trends, and (2) the supplier is financially secure.

I.5.2 TESTING THE SOFTWARE

78 Testing to evaluate the software should include at least the following steps:

1. Obtain test copies. The NSO should obtain test copies of the software. If the supplier is internationally know, like Microsoft, short-term use is usually provided online, and without cost. CSPro and IMPS and other public-use are free for download at any time. Others may cost considerable amounts just for testing, so NSOs should seriously consider whether they want to get involved in long term commitments with companies that have upfront charges from the very beginning.

2. Develop test prototypes, and test data packs to prove or disprove the software’s ability to satisfy key functionality requirements. Subject matter specialists and programmers should work together to develop a test of the software with both prototype and real-world situations, using the census questionnaire as the base for the tests.

3. Detail implications on and for the organization’s computing environment. The amount of impact on the current NSO processing system should be assessed to make sure that the software will not interfere with regular activities like trade statistics, other surveys, vital records, etc.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 24

4. Get access to reference sites and demonstrations relating to the supplier and its products and gauge user satisfaction. This information can be augmented with access to bulletin boards and discussion sites, if internet access is available. But, it is important to know what is available without any human intervention, to assist when contact with a contact person is not possible.

5. If it is a strategic product, ensure that a viable support mechanism exists and that the information quality and responsiveness are acceptable. Test human response systems to make sure they will work! Will anyone from the software provider be available on a regular basis to provide assistance when needed?

6. Conduct tests according to previously established criteria. The NSO should also check to make sure that data from the previous census can also be compiled for trends analysis. If the new software is incompatible with previous data, this could be a warning that the new data may also not be compatible with other systems.

7. Assess and document upgrade policy. The NSO should consider the “upgrade” policy to determine whether updates will be provided gratis or whether each will require additional payments. Some software companies will provide a certain number of upgrades for free, so the NSO must decide of that limited number will be sufficient for the operations over the time of use.

8. Determine full costing. It is important to determine the total cost, as much as possible, of the software, with all of its add-ons. The NSO does not want to get in a position where it must pay more to maintain or to obtain more features for the software partway through the census.

9. Produce a report on the evaluation process. A report on the analysis of the various software systems should be prepared, with emphasis on what is important for supervisors to consider, both in the current situation, and for the future as well.

I.5.3 ACQUIRING THE SOFTWARE

79 Software for census use in association with selected hardware can be acquired in a number of ways, such as:

1. Purchasing complete off-the-shelf packages that require no further development. Some countries prefer to use off-the-shelf software, particularly if they already are using that software and many staff members already know how to use it. Sometimes the software is already set up for processing because it is the same software and same programs already used for surveys.

2. Purchasing packages that can be further developed for census-specific activities. Some countries prefer software that they already use, but the programs needed for processing the census are not in place. Then, the software is used, and new processing procedures are developed.

3. Contracting out the provision of specific functionality for parts of systems. In this case, the development of the processing is contracted out for those parts of the system that the NSO cannot do, using already established software.

4. Contracting for externally developed software for complete systems. In this case, the whole operation is contracted out, with little or no institutional development within the NSO.

5. Obtaining free software such as IMPS or CSPro. Here, established processing software like IMPS or CSPro is used.

80 If a software package is to be used as part of the data entry (whether as direct entry or in validation), editing, and tabulation and dissemination, then it must be evaluated according to its performance against the following criteria:

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 25

1. Country size. Although the size of the country should not make a difference, some packages, like CSPro, run very slowly on large data sets. The older DOS package, IMPS, runs much faster, but DOS is no longer supported and so consideration, particularly for production runs, and particularly for running tables, must be considered. Small countries can use almost any package successfully.

2. Data entry. As noted elsewhere in this manual, data entry programs must be able to check for invalids and inconsistencies on entry. If a single package is to be used for all elements of the census, this facility is particularly important.

3. Editing. The package should be able to do editing, including imputation. As the “nearest neighbor” approach becomes increasingly used in census editing (see the UN Editing Handbook or the section on editing in this manual), it is even more important that NSOs make sure this facility is available if it is to be used. It is important that the package obtained include complete error listings. The listings should be of at least three forms: (1) summary statistics for each invalid and inconsistency check, (2) the ability to look at individual units before and after editing and the invalid and inconsistency found, and (3) frequency distributions of the imputed items,

4. Fast tabulation. The package should be able to produce what might be called quick tables or fast tables. These tables should be produced quickly and easily by both programmers and subject matter specialists to assist in checking the editing process. Staff should be able to make tables on the unedited and the edited data to be sure that the edits are working correctly as the editing is being done.

5. Tabulation. The package should be able to make crosstabulations. The tabling methods should be simple and straight-forward, and especially for large countries, fairly rapid.

6. Dissemination. The package should be able to organize the tables and metadata into a table retrieval system or other organizing tool. Also, the package should allow easy movement into Excel, Word, and various mapping packages.

81 (a) Package software: The use of package software, as opposed to developing task-specific software, has become an established practice in many areas of the information systems industry. The major reasons include:

1. The reduced risk, cost and time-frame associated with the implementation of proved solutions to recognized business needs;

2. The reduced overhead involved in maintaining the resulting system by procuring packages from vendors committed to their ongoing maintenance; and,

3. Although the rationale for using package software is clear, many agencies have been disappointed with the results of package implementations.

82 The most frequently encountered problems are:

1. A mismatch between package functionality and agency requirements;2. The level of customization required to ensure successful implementation;3. Inflexibility of the package to meet the changing needs of the agency;4. The level of maintenance required;5. An inadequate level of vendor support;6. Poor vendor choice;7. The amount of effort required to interface a package to existing systems.

83 The above problems are almost always attributable to an inadequate analysis of business needs, or a poor procedure for the evaluation and selection of a package, or both. NSOs usually acquire off-the-shelf

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 26

packages through direct negotiation with suppliers, after they conduct an evaluation study to determine that these products will fulfill the stated requirements.

84 Site license. The NSO must decide whether a site license is required or whether individual licenses would be more appropriate. NSO staff can often negotiate costs of software acquisition. Discounts may be available for higher-volume purchases. Staff should consider a license arrangement to allow many concurrent users as a cheaper alternative since fewer licenses need to be purchased than the total number of possible users. Other variants include differential pricing, that is, limited developers’ licenses and unlimited licenses for run-time access.

85 Contracting out specific functionality for parts of systems. NSOs using externally developed application-specific software must specify, develop, and control it. Therefore, these procedures should be subject to closely-monitored contracting conditions. These contracts are usually based on a formal request for tender or statement of requirements and may be linked to hardware acquisition. The NSO must have good contract management practices in place, so that the many benefits established in the planning process will not be lost in the execution.

86 Contracting out complete software systems. A simpler, but more expensive, method is to contract out specific functionality for specialized software. Broad requirements might be specified as “the requirement to deliver captured data from every form”, which leaves contractors to acquire and develop software themselves. While this method is simpler for the organization, the total product will most likely be more expensive. Also, means of communication with the contractor has to be very good to ensure adequate detailed specifications, and the organization has less control over the process.

I.5.4 DEVELOPING SOFTWARE APPLICATIONS IN-HOUSE

87 If no suitable software available off-the-shelf exists, the NSO might need to develop the required software in-house. The decision to take this action will depend on a number of factors, for example:

1. The budget available. Some software is free, like CSPro, because it is developed by a government agency and therefore must be distributed at no cost. Other software, like Supercalc, can be extremely expensive, and so NSOs must decide what software is best, given the total budget, and the line items for a particular piece of the census operations.

2. The technical skills available in the organization and the ability to retain those skills. Developing and keeping technical skills is a growing problem in the information technology industry. The current staff (and potential future staff) must be able to use the software, and the “learning curve” for obtaining the basics and add-ons of the software must be amenable with the staff skills.

3. The timetable for development. Some software allows for rapid development of the programs for capture, editing, and tabulation. Other software takes longer to learn, but is more powerful, with means and standard deviations, regression, multi-variate analysis, etc.,

4. The complexity of the required software. The more complex the software, the longer it will take to master it. But, having mastered it, the software will probably provide greater flexibility and be more comprehensive.

88 NSOs must exercise the same strict controls over development issues (e.g., standards, tools used, training of staff and adherence to timetables) whether software is developed in-house or contracted out.

I.5.5 REGIONAL APPROACHES FOR CENSUS DATA PROCESSING

89 In order to share the available expertise in a group of countries and facilitate more cost-effective training, a common approach for census data processing can be adopted by a set of countries that have shared experiences in census taking. The Anglophone Caribbean islands adopted this common approach for

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 27

their 1990 and 2000 rounds, and are implementing a similar system for the 2010 round. They have agreed to use harmonized questionnaires, manuals and forms, and software for data processing (CSPro). They also use this regional network to facilitate regional training and exchange expertise among countries.

I.6 HARDWARE

I.6.1 EVALUATING HARDWARE NEEDS

90 The requirements for evaluating hardware will depend on the nature of the hardware, its complexity, and any links with existing hardware and software. Strict evaluation criteria need to be drawn up before accepting the hardware for evaluation (and expected acquisition). Many of these criteria will be the same as the ones set out in software section above. NSOs must draw up specifications before the evaluation takes place. These specifications must clearly describe the requirements for the hardware. Also, suitable hardware must be acquired on the basis of a tender or direct purchase, if only one possible supplier is available.

91 The Statistical office needs to set up a team to carry out the evaluation. Who makes up this team will depend on the complexity of the hardware, the different evaluated hardware configurations, and other resources available. The members of the evaluation team must know how to make valid, consistent, and unbiased assessments of the equipment – from a technical skills perspective – and to be able to manage an objective evaluation process over time.

92 Technology changes rapidly, so updated or new hardware may become available after completing the evaluation. If the technology does change, another full evaluation will help decide whether to implement this new hardware, rather than on vendor-promised improved performances. NSOs should not assume that updated hardware would necessarily perform better or be more suited to the particular census application.

93 The evaluation should encompass a series of phases to ensure that the hardware is thoroughly assessed. NSOs must test the equipment’s operation in its production environment to be sure that it will perform properly.

94 Initial capital cost is only part of the cost of the hardware to the agency. It is one, but not the only, factor nor is it necessarily most important, factor in evaluating hardware. A relationship between savings and risk exists. Cheaper equipment has the potential to cost more in the long term if user requirements are not met or the equipment needs continuing maintenance and possible replacement before finishing the required job.

95 Product quality is paramount. Some vendors assemble hardware systems using several different off-the-shelf components. These elements require extensive testing, including aspects of integrated all components into the system. Vendors must also guarantee a continuing supply of like products over time. The establishment of a set of standards for delivering elements and a rigorous management process is essential regardless of whether one has a single supplier with a proprietary “box” or whether the building of the box is modular.

96 The period and conditions of the warranty are important points in evaluating vendor hardware. The warranty must cover the time needed to carry out the census. Since census periods so often run longer than expected, time should be included in the initial estimate for overruns.

I.6.2 ACQUIRING HARDWARE

97 NSOs usually acquire hardware on a basis similar to that for acquiring software. A tender process normally ensures that the hardware is the best solution – technically and financially – for the organization when the hardware is new technology for the organization. NSOs must compile the request for tender to

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 28

carefully consider the legal requirements of the organization and government policies, including ethical and probity considerations. NSOs should use any existing system of panels of suppliers for specific, relevant types of hardware to purchase or lease the required hardware. Ethical and probity issues are of paramount importance in any acquisition process. These issues can cause delays or other problems if not handled properly.

98 Staff must make detailed requirements specifications before releasing the tender document or contacting panel suppliers. The specifications will form the basis of the evaluation criteria (see section 5 above). The organization must determine its real requirements and acquire hardware appropriate for the job. Pressure to buy older technology to save money may be counter-productive if the NSO needs to upgrade other components. On the other hand, the NSO should make every effort not to pay too much for hardware by buying equipment that delivers more performance and functionality than required. Careful planning gains the most benefit from hardware purchases.

99 NSOs should follow some basic rules for acquisitions:

1. Use requests for proposals or requests for tender to control the process. The NSO should prepare detailed proposals or requests for tender that take into account the current office space, staff, finances, and other factors. Proposals should be specific to the needs of the specific NSO and the specific census, but some parts should be general and flexible enough to allow for changes in the technology and other circumstances between the time of the bidding and the end of the processing.

2. Try to keep proposals simple. The proposals should be written in simple English (or the language of the country). Simple proposals are easier to follow and are less likely to have provisions that the scanning company could later use “against” the NSO during the census or afterwards. One African country found, to its dismay, that they had bought scanners, but the scanning software was proprietary and they could only do follow-on surveys by using the same company and its software on the purchased hardware.

3. Purchase only what is required, but as much as possible, to encourage competitiveness in the evaluation process. The NSO should purchase only what is absolutely necessary and not try to be any “fancier” than needed to get the job done. Often, extra, “deluxe” systems have problems because they either have not been fully tested in the scanning bidder’s office, or too few countries have used them and so errors creep in.

4. Shortlist ruthlessly, focusing on the best technical solution and overall value for money. If the NSO does not have in-house capacity to assess the quality of the initial bids, outside help should be obtained. The criteria for short-listing should be explicit, with check lists, so that after the selection, paper trials exist for each bidder. Lawsuits and hurt feelings can be ameliorated with full documentation.

5. Negotiate the warranty period. Too often NSOs do not consider the need for immediate and long-term maintenance of the machines, once purchased. Warranty periods should be determined, and compared among the bidders.

6. Negotiate free training to be provided by the vendor. Every effort should be made to include free training as part of the package. Of course, the training will not be “free” if it is included as a separate line item in the bid, so NSOs must watch carefully to determine how the line items actually compare. But, if the bidder includes training, the NSO can determine whether in-house training or training at the scanning company’s office or training off-site is the best procedure, and include that in the determination of the successful bidder. In-house training has the advantage that many staff can be trained at once; training at the scanning company could provide more options, more different types of training, and could be an advantage if trained staff come back and offer subsequent training to the rest of the staff.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 29

7. Consider the level of local maintenance support available. It is important that machines be maintained, particularly during the capture process. If the NSO cannot maintain the machines themselves, then they need to look outside the office for maintenance. If the bidder is foreign, then the NSO must determine if any local public or private-sector operation can provide maintenance. If no local agency is available, then the scanning company itself or another foreign agency would need to provide maintenance. This will be a major expense and must be taken into account when determining who will provide the scanning.

8. Consider the advantages and disadvantages of purchasing locally compared to internationally. As noted, not only is inside vs. outside maintenance important, but the purchase itself must be considered. International purchasing will require additional costs, at least for the transportation if nothing else. But most likely, other expenses connected with the purchase will be added on. So, if the purchase cannot be made locally, the additional expenses must be considered when a final decision is made.

9. Avoid being under any obligation to a vendor. It is almost impossible not to become “indebted” to the vendor once the agreement is signed. It is important to talk with other NSOs about their own experiences with particular vendors before purchases are made. If other countries have had problems with vendors either supplying everything promised, or slow in delivery, or slow in maintenance, then these issues should be taken into account as decisions about possible vendors are made. Once the decision is made, and the vendor selected, the NSO is basically tied to that vendor, and with it, any delays or cost over-runs. It is almost impossible to change vendors part-way through the process. So, be forewarned!

10. Consider ethics and probity issues at all stages. Good vendors will be “upfront” about previous and current issues with their scanners and software. NSOs should check with statistical offices in other countries about the professional ethnics of the vendors being considered.

100 NSOs need to calculate accurate estimates of the equipment and manpower needed to capture and process the census information using both manual data entry and scanner technology. The calculation will depend on the target timeline, the estimate of the quantities to be processed or the limitation in resources (i.e., financial, personnel or infrastructure).

I.6.3 MANAGEMENT OF THE PRODUCTION

101 Production management during data processing projects needs to be carefully controlled since scales of work tend to be underestimated. Making a realistic estimate requires having the following information:

1. Number of data entry stations available for this work.2. Number of shifts of data entry operators.3. Number of productive hours on each shift.4. Number of data entry operators on each shift.5. Average number of key strokes per hour.6. Number of questionnaires.7. Average number of key strokes per questionnaire 8. Percentage of verification to be done.

102 NSOs then must make some assumptions, based on the local situation and other factors which affect overall production, including:

1. Some percentage of the equipment will not be operational at any point in time because of mechanical breakdown or operator absence.

2. A large percentage of key strokes will be automated by recognition engines doing OCR/ICR.3. Some percentage of the data will have to be rekeyed because of errors encountered in verification.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 30

4. Keying of manual corrections during editing will be the equivalent of five percent of the original workload.5

103 Suppose the following case exists:

1. 10 data entry stations available for this work.2. Two 8 hours shifts of data entry operators.3. 6 productive hours per shift.4. 10 operators on each shift.5. Average of 8,000 strokes per hour.6. 10,000 questionnaires.7. 2,000 strokes per questionnaire,8. 100 percent verification.

104 Making the assumptions listed above for other factors affecting production, the calculation in terms of days follows:

Number of work days = Total strokes / strokes per work day = (No. questionnaires x strokes per questionnaire x verification factor x factor for rekeying for data entry errors x factor for keying for editing problems) /(No. of stations x factor for stations operational efficiency x shifts per station x productive hours per shift x strokes per hour)

Or

= (10,000 x 2,000 x 2 x 1.05 x 1.05) / (10 x .9 x 2 x 6 x 8,000)= (44,100,000 strokes) / (864,000 strokes per work day)= ± 51 work days

105 The equations simply say that the total processing cannot be finished in fewer than 51 work days given the information available and the assumptions made. The office can determine a plan for entry after establishing the quantity of data entry. In applying an 80 percent recognition factor to the aforementioned calculation to cater for the assistance that OCR and ICR bring to the keying process, the following would now hold true:

= (10,000 x 2,000 x 2 x 1.05 x 1.05 x .2) / (10 x .9 x 2 x 6 x 8,000)= (8,820,000 strokes) / (864,000 strokes per work day)= ± 10 work days

106 What would have taken nearly two months can now be achieved within two weeks. However, issues like badly designed forms and noise from the image can increase the number of total strokes, thereby increasing the total amount of work to be done.

I.7 PERSONAL DATA ASSISTANTS (PDAS) AND USE OF THE INTERNET

107 NSOs started using computers in the 1950s to assist in capturing census data. From the 1950s through the 1970s, offices “punched” cards that were then read into mainframe computers for processing. In the 1980s, entry moved from punched cards to keying directly into the mainframe machine, and then later into Personal Computers – microcomputers. Keying was the main form of entry through the 1980s and 1990s, and, for many countries into the 2000s. Scanning technology really first became prominent for the 2000 round censuses, and continues for the 2010 round. However, it is clear that the 2020 round will move to

5 StatsSA puts these figures at 10 percent of equipment being nonoperational at any point, 80 percent of keystrokes automated, and 5 percent of data needing rekeying. All of the figures presented here come from StatsSA analysis. (Source is original draft of this paper)

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 31

personal digital devices of one sort or another. And, entry will involve some sort of assisted personal interviewing.

I.7.1 USE OF HAND-HELD DEVICES FOR MAPPING

108 Personal Data Assistants come in several forms. Most now have Geographic Positioning System (GPS) capabilities. Individuals use these to find addresses of their friends or shops. Obviously, then can also be used to locate units for enumeration. These devises serve several purposes. One, of course, is that NSOs can do prelisting to point and shoot each unit, and obtain its latitude and longitude (especially in Africa where houses normally do not have street addresses. The information can be used to produce enumeration area maps, census district maps, and other levels of the hierarchy. These maps can then be produced electronically for the enumerators and census supervisors to use during enumeration.

109 It is important to note that several private sector companies have now developed handheld devises separate from PDAs that only provide the enumerators with information about their geographic locations. These devices should be less expensive than a full PDA, but they are not susceptible to linking the geographic information with the confidential information provided by respondents, and so are safer. However, they suffer from the fact that the enumerator has to copy all the digits of the latitude and longitude on to the questionnaire, which could easily introduce errors into the data set.

110 When the listing information is compiled, enumeration areas can be formed, and the geographic information collected together. The enumeration map can be provided either on paper, or electronically, in a PDA, to guide the enumerator to the appropriate unit for enumeration. When a unit is fully enumerated, the PDA can send a message back to the district and central offices, and the census enumeration can then be “relatively” easily monitored. IF the system is set up beforehand and adequately tested.

111 But, additionally, the GPS can assist countries which collect both short-form (data from all households) and long-form (more data, but from fewer households, presumably randomly or purposely selected) households. The computer can pre-select those units that are to receive the long form, and in the case of the PDA enumeration, only the appropriate form would pop up when the enumerator starts enumerating at a unit.

I.7.2 COMPUTER ASSISTED PERSON INTERVIEWING: AN INTRODUCTION

112 The Problems with Paper-based data collection. Of course, paper-based data collection is both the traditional method of obtaining census data, and the method of all of the data collection methods described below. However, now that PDAs have become available, many of the drawbacks of paper become even more evident. These include that

1. Paper-based data collection is slow and time-consuming. After respondents provide answers, the enumerator has to either check a box (not particularly labor intensive) or laborious write out the information (very labor intensive.)

2. Re-keying information is inefficient and increases chance of error. PDAs enter the information directly into the processing medium. Any of the other methods require either rekeying or some method of electronic capture, adding at least one more step to the process.

3. Paper-based forms increase chance of error. The more steps in the process – almost any process – the more opportunity for errors. So, when paper is used, and is handed from one person to another for manual checking, coding, additional checking and verification, the likelihood of errors occurring increases.

4. Submitting multiple forms for a single process. With paper, when quality control is not complete, entry could occur more than one – or not at all. And, these problems can occur from process in the series to the next.

5. Cannot capture value-added data – GIS. The PDA automatically captures the Geographic Information from a satellite. This information aid enumerators in both finding and not duplicating units, and helps the rest of the processing since computers can be programmed to note if the same unit is being processed more than once – or not at all – based on its geographic position.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 32

6. Data integrity and Authenticity. Curb-stoning and other enumerator generated errors are likely to be reduced since computer programs can be implemented to check for systematic problems in data entry.

113 Computer Assisted Personal Interviewing (CAPI) integrates computer hardware, software, networking, database management and human resources. CAPI collects, processes, transfers and consolidates face-to-face survey and census data. Many National Statistics Offices (NSOs) are implementing the CAPI system. Other Computer Assisted interviewing methods include Telephone Interviewing (CATI), Web Interviewing (CAWI), and Self-Interviewing (CASI). Each of these methods achieves specific survey objectives. CATI and CAWI collect data from difficult-to-reach respondents. CASI appears in surveys that include sensitive questions. Multi-channel census plans often include combinations of these methods. We focus on CAPI in this section.

114 CAPI implements the move from traditional paper questionnaires to electronic questionnaires, and demonstrates several advantages: improved data quality, increased timeliness, more effective management, rationalized costs, capacity building, and satisfying increasing demand for statistical information. A typical implementation of CAPI has the following objectives:

1. Using software and computer hardware to allow enumerators to collect data in an operationally convivial environment;

2. Developing software applications adapted to mobile computer hardware while respecting content and census questionnaire design, integrating validity and logical controls, and facilitating coding of variables requiring classifications;

3. Offering rational management of field work (dealing with incomplete questionnaires, refusals, and absences, assessing productivity, and resolving technical issues);

4. Ensuring easy, quick and secure transfer of collected data;5. Allowing timely dissemination of data; and,6. Providing a cost-effective method.

I.7.3 ESSENTIAL FEATURES OF CAPI

115 Implementing several features can take fuller advantage of CAPI. These features include interviewing workflow, Geographic Information System (GIS) capabilities, data transfer and consolidation, operation management, data editing, tabulation, and customization of the mobile application.

I.7.3.1 Interviewing workflow

116 Enumerators should interview in a smooth, convivial, and operational mode. Software applications should support: automatic sequencing (transfers and skips) between questions, easy choice of multiple answers, automatic online verification of answers validity and consistency checks within record, with

possibility to correct/force invalid/erroneous answers, automatic calculation of variables (e.g., age), free text capture for open answers and notes, integration of classification dictionaries and availability of simple tools to access them online or

afterwards, and accessibility to online interviewer instructions.

117 Among the most time consuming tasks in data collection involves coding occupation and industry, places for migration, and field of study. These questions require the use of long standard classifications. CAPI software application allows interviewers real time access to integrated dictionaries. Interviewers can key in answers by: 1. writing the corresponding code directly, 2. browsing a tree of classification by code and full label,

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 33

3. searching by keyword or label, and, 4. browsing common labeling and adding frequently used entries. Of course, interviewers must write exact answers as respondents provide them for later verification of interviewer’s codification.

118 CAPI provides a Graphical User Interface (GUI), which is in a working language for NSOs since only the interviewers use it. Often, countries have many different languages. So, the interviewers may need to switch from one language to another depending on respondent’s language preference. The questions, though, will appear in various languages or as enumerator translations.

119 Implementing the possibility of comparing online previous respondent’s answers during post-census surveys allows assessing the quality of collected data during the census. Special care in the conceptual and development phases ensures quick responses of the software: download time of application, passage from one question to the other, and execution of routines.

I.7.3.2 Other factors

120 GIS capabilities. Visualization of maps on mobile device coupled with Global Positioning System (GPS) enabled hardware to allow surveyors to locate housing units easily. Seamless geo-coding of buildings permits data transfer to a GIS server in real time. In addition, these features allow controllers and supervisors to manage their staff efficiently, that is, monitoring as well as assisting. Several third party GIS solutions are available, and include ArcPad and ArcGIS Mobile, both being ESRI’s products. While ArcPad is suited for GIS trained personal, large-scale operations should use AcrGIS mobile with non-GIS personal. ArcGIS mobile communicates with an ArcGIS server through a mobile, data-access web service. The connection can be wireless, using General Packet Radio Package (GPRS), through Local Area Network (LAN), or with activeSync connection. The connection to the internal network (using a dock with a LAN port or through a USB cable connected to Desktop PC running ActiveSync) allows download of GIS data to a mobile device. Surveyors in the field can use ArcGIS mobile without an Internet connection to view maps. ArcGIS mobile application deployment supports Windows Mobile 6. Applications can integrate ArcGIS mobile run time seamlessly into the CAPI application. Staff can use the ArcGIS mobile Software Developer Kit (SDK) and runtime to build custom applications for specific features.

121 Data transmission and security. Collected individual data are personal and protected by statistical laws. NSOs must require personal interviewer login and password for access to mobile devices including by biometric access (fingerprints, for example) to computer hardware and software. NSOs should implement a secure, reliable and easy method to transmit data. Batch mode provides a method to get to servers when NSOs cannot guarantee full access and confidentiality at all times and at all places. Depending on communication infrastructure, available technologies include GPRS, Internet wireless connectivity (3G), encrypted modem, File Transfer Protocol (FTP), Internet, Secure Digital Memory (SDM) and/or synchronizer. Authentication, encrypting modems, and encrypted data increase security and the integrity of data.

122 Data editing and tabulation. Controllers, supervisors, and subject matters should integrate consolidation, verification and correction of responses at all steps beyond interviewing. Availability of imputation tools is an asset. The application should be able to produce basic tabulations as well as features to make dynamic tables.

123 Support structure and management. CAPI offers the possibility to computerize the whole processes of data production, from data collection through tabulation. CAPI also allows follow-up of production and productivity, as well as data quality assurance. Management systems can implement the following features to assist: (1) Electronic generation and transmission of workload (i.e., households to be visited by interviewers),

including sampling when applicable (e.g., post-census survey);(2) Tracking the status of surveying households (awaiting interview, being interviewed, completed

interview, uncompleted interview, transmitted data, controlled data, edited data...);

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 34

(3) Providing a support hot line number to assist with software, data transfer, hardware or questionnaire issues;

(4) Generating daily individual and aggregate statistics on productivity of interviewers, controllers and supervisors, and,

(5) Backup and recovery.

124 Customization. CAPI expects a designer component for customizing the mobile application. NSO staff should be able to implement and modify the work through a GUI. The elements include questionnaire items definition, sections, questions in multi-languages, modalities of answers, classification, notes, validation rules, logic consistency checks, messages, skip patterns, variable calculation, census personnel names and attributions. The NSO should need no higher level programming language for the application designer component. Operators should transfer configuration files to the mobile device separately from data files. Automatic updates allow application refreshment, whenever the client establishes connection with the server.

I.7.4 CAPI COMPONENTS

125 The three main components of CAPI are: (1) software application, (2) computer hardware, and (3) human resources. Each of these components requires choices for the various variations.

I.7.4.1. Software applications

126 Three strategic choices of software solutions are worth investigating:

1. Out-of-the-box commercial or public domain software. CSPro, SPSS Entryware, and Blaise are among the several available CAPI-compatible software packages. CSPro 4.0 is free software for data entry, editing and tabulation that supports Personal Data Assistant (PDA), also known as Pocket Personal Computing (PPC). SPSS Entryware supports PDAs, but requires licenses for the designer and run time components for each PDA. Blaise is modular, extendable data entry software well suited for complex surveys, but requires a license and does not currently support PDAs.

2. In-house specific software, developed for collecting and organizing data for a specific census or survey. While this solution may be easier to develop, it may not allow last minute changes (e.g., questionnaire and validity rules) and may not be reusable for other statistical operations.

3. Software for use in the census as well as in other household or establishment surveys. National Statistical Systems be cost-effective in integrating customizable solutions for collecting and processing census and survey data. NSO can develop a package in-house or through a service provider, but the latter requires delivering source codes to ensure maintenance. The main advantages for this alternative are (1) the GUI can quickly and easily change questionnaire and logical rules, (2) customization for other surveys, and (3) capacity building and dependability within the NSO.

I.7.4.2. Computer hardware

127 The choice of hardware depends on the required CAPI features, field conditions, length of interview, complexity of questionnaire, and budget costs. Hardware choices include PDAs, tablets, notebooks and combinations of these. NSOs should make final choices only after testing on a pilot census. The following lessons come from interviewers, controllers, supervisors, and managers based on the experience of several NSOs:

128 Advantages of PDA over notebook:(1) The small size of equipment is well adapted to field conditions. Enumerators can conduct the

interviews while standing, so they do not need desks. And, notebooks take up more space.(2) PDAs are lightweight, and so convenient for interviewers. The heavier weight of notebooks may

cause discomfort of interviewers, especial in rural areas.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 35

(3) PDAs provide the possibility of conducting interviews in difficult weather conditions, such as under a hot sun or in considerable wind, while notebooks are not adapted to harsh climatic conditions.

(4) PDAs cost less, although, some PDAs may be more expensive the some notebooks.

129 Advantages of notebook over PDA:(1) The bigger screen size of notebooks allows display of more information. The small size screen of

PDA does not allow showing all information at once, requiring some scrolling or memorization of questions.

(2) The availability of keyboard in notebooks allows for quicker data collection, especially for open-ended questions. Most PDAs have no hard keyboard, and a virtual keyboard may be less comfortable to use, and, in some instances, may hide possible answers in a long list.

(3) Software responds quicker in notebooks because of larger processing power. Hence, execution and navigation between questions are faster. Some PDAs with low processing power and resource-demanding applications may show slower execution of commands and navigation between questions, potentially freezing the application, especially in large households.

130 Hardware specifications of PDA. Very large numbers of PDAs will be required to conduct a census within any short period of time. Hence, acquisition of new PDAs may present a substantial portion of census budget. The NSO may be able to negotiate specifically tailored PDAs directly with the manufacturer that respond exactly to CAPI features. In order to implement full features of CAPI, the following are the minimum required specifications of PDAs: Processor: 624 Mhz CPU Operating System: Windows Mobile® 6.1 Professional Memory: ROM: 256 MB, RAM: 128MB Display: 5” TFT-LCD flat touch-sensitive screen Network HSPA/WCDMA, Quad-band GSM/GPRS/EDGE GPS Internal GPS antenna Connectivity Bluetooth® 2.0, Wi-Fi®: IEEE 802.11 b/g

I.7.4.3. Human Resources

131 Human resources. The human element is crucial for successful CAPI operation, with several aspects stressed: (1) Interviewer experience and expertise in conducting household enumeration; (2) Knowledge of variables, modalities and classifications; (3) Enumerators properly presenting themselves to the households; (4) Mastering of computer hardware; and,(5) Understanding the use of the data collection software.The component human resources include interviewers, functional and technical supervisors, and controllers.

132 Technical supervisors should be trained to install and maintain applications software, to customize software applications, and to program, using application script, validation and logic rules. Functional supervisors should be trained to use software applications (designer and run time components) in order to transfer configuration, enter survey personnel, perform sampling, conduct interviewing, ensure verification and control import and export of data, and to consolidate the data. NSOs should train controllers to use personal computers, and to use appropriate software applications to generate interviewer’s files, consolidate data and import and export data.

133 NSOs should train interviewers to use: (1) Personal computing: Operating system, notion of file and directory, hard drive, USB, SDM, stylus,

use of menu and change batteries; (2) Data collection applications: execute applications, identification of survey households, enter

household members, conduct surveys, and use of integrated classifications; and, (3) The new e-questionnaire.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 36

134 Required human resources profiles to collect data with CAPI are different from Paper and Pencil Interviewing (PAPI). The former, requires readiness to learn how to use mobile device and software applications. An initial training of 5 days on the use of personal computing, followed by 5 days training on the use of application is typical. NSOs should administer tests to all staff with additional training when necessary.

I.7.5 PDA CONCLUSIONS

135 Embracing technology can improve quality, save time, maintain costs, and build capacities. Numerous countries use CAPI with PDA. The NSO needs to plan integration of this technology through rigorous methods, training of staff and multiple site testing. Partnership with the private sector builds capacity and provides the administration with a solution package for effective census planning as well as for surveys. PDAs have the following advantages:

1. Quality of collected data is greater. Localisation of households and improved coverage, real time verification of data validity and coherence, use of pick lists and integrated classification, reduced non-response due to interviewer’s errors, standard approach of interviewers, easy testing of electronic questionnaire, and respondents are more likely to conduct in professional manner and provide sensitive information.

2. Timeliness of the production cycle improves since CAPI eliminates the data entry phase, reduces editing time, uses coding in real time, computes variables, and automatically transfers data.

3. Cost-effectiveness. This solution is potentially cost-effective. The high cost of acquiring the new technology (software, computer hardware and training) offsets the elimination of other steps previously used in traditional processing methods. Printing, transportation, and large storage facilities of paper questionnaires are no longer required. Likewise, NSOs do not need to recruit data entry clerks. CAPI reduces data editing since it performs all validation rules and consistency check rules in real time. Savings also may arise from shorter interviewing time.

4. Individual information is more secure. And, individual information is more secure since only the interviewer sees the identifying answers, while processing of paper questionnaires involves many different individuals.

I.7.6 USE OF THE INTERNET IN DATA COLLECTION

136 Along with the newly introduced PDAs for data collection, even newer is internet based data collection. Already, some “cell phones” – Ipods, Ipads, and other similar devices – can enter data directly into the computer, although for many of them, the keyboards are difficult to operate since they are so small. But they are inexpensive, and they are readily available since so many Africans already have them.

137 The introduction of netbooks could be even more important for census data collection because they have larger keyboards which are easier to use. Hence, questionnaires no longer have to fit a particular size page nor do we have to worry about layout or marks for making sure data capture is efficient and fast.

138 The main problem with these new technologies, and direct access to the internet is whether data can be encrypted as they are entered, and firewalls can be developed to vouchsafe the data when entered. It is very clear that this particular problem has yet to resolved, and until it is, most African countries are unlikely to use this technology to enter their data.

I.8. RESOURCE AVAILABILITY: EXTERNAL CONSULTANTS AND OUTSOURCING

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 37

139 Many countries have increased use of external consultants (including consultants provided through international development programs) and outsourcing in recent years. These outside experts become part of the total development of systems and facilities for census use. Traditionally, many countries use other government agencies for forms printing and mapping services. However, NSOs are also outsourcing both information technology and non-information technology projects to other government agencies and/or private vendors. Many NSOs are also using more external consultants.

I.8.1 MANAGING OUTSIDE CONSULTANTS

140 The principles involved in managing these external consultants or the outsourcing vendors are the same in all cases, regardless of whether they are another government agency or a private vendor. Whether or not a project is information-technology based, the desired result is the same – that is, to achieve a successful census – with all requirements being met for agreed cost to agreed timetable.

141 Use of external consultants and/or outsourcing depends on the requirements of the organization (including requirements for confidentiality and security), whether required skills exist in-house and whether projects can be outsourced cost-effectively. Outsourcing decisions should be made within the context of a larger organizational plan that identifies choices between (1) hiring and training staff, or (2) using external service providers to augment or replace resources for specific projects. No clear cut distinction between hiring consultants, the use of external service providers or outsourcing exists; quite often a system will contain elements of all of these working elements together with in-house resources.

142 The agency may have limited skills of the type needed for the implementation of a particular specialized system, or information technology may not be a core part of the business. If limited skills exist, the NSO may decide to have greater proportions of the work undertaken by resources outside the census agency. Instead of simply acquiring hardware and software for the processing system, NSOs would look for a total solution, with the successful tender taking responsibility for all information technology aspects of the processing system.

143 Bilateral agreements allow for the use of international consultants as technical advisers in many countries. In these cases, census managers should take advantage of the opportunity to assist in capacity building within the census agency. In some countries, agencies form tenders committees, consisting of the Ministry of Finance and a general control association, as well as the Statistical Office. The committee is usually responsible for calling tenders, requirements and conditions, evaluation of tenders and selecting the most suitable ones.

I.8.2 DIFFERING OBJECTIVES

144 Any external resource provider will inevitably have additional (or different) objectives from those of the statistical agency. For example:

1. A specialist mapping service provider may be more interested in producing maps to the highest standards of cartography than in offering a service that allows enumerators to locate dwellings effectively; or,

2. A private sector business will be obligated to provide a return to shareholders rather than satisfying the public policy needs that drive government agencies.

145 As a result of these differing objectives, in all cases where NSOs employ external resources, staff members need to carefully ensure that the selected external provider delivers a cost-effective solution that meets the census agency’s needs. The agency must carefully specify, plan, and monitor external service providers.

146 The NSO must also monitor the external agency’s use of any information obtained while consulting. Confidentiality of all maps and data must be maintained, so external agencies cannot not have access to

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 38

data or maps or other materials that could potentially breach confidentiality. Also, agreements need to be spelled out about what the external agency can have and cannot have access to during and after the census.

I.8.3 SPECIFICATIONS

147 Successful outsourcing initially requires the census agency to have a clear understanding of the requirements. NSOs must specify these requirements unambiguously to the service providers. If the agency cannot express its expectations and priorities clearly to service providers, the service providers may have difficulty achieving them. All parties also need to ensure that all sides fully understand any legal documents (e.g., conditions of tender). The laws, rules and procedures that apply in a country will determine the way NSOs pass these specifications on to the external providers. However, detailed written specifications should serve as a benchmark against which to measure performance later in the process.

I.8.4 PREPARATION OF SPECIFICATIONS

148 During the preparation of specifications for outsourcing or an external service provisions, staff should use about half of the time to establish:

1. The objectives of the project, 2. The outcome to be achieved, and 3. The procedures to follow in attaining that outcome.

The NSO must also specify the standards for the various outcomes (e.g., for a data entry operation, an allowable proportion of erroneous keystrokes could be specified).

149 The next largest amount of time should be spent on documenting precise price and payment terms where goods and/or services are to be purchased. The specifications must be designed to allow for requirements changing over the life of the project. The documents should include a clear method for both the census agency and the service provider to approve changes. Within this overall framework the specifications should (1) Clearly state the scope of the project; (2) Identify the deliverables and the associated schedule of dates for completion of each deliverable (i.e., milestones); (3) Identify key personnel by name and qualifications, and set out rules for their replacement, where necessary; (4) Clearly define invoicing and payment specifications, as well as time-frames and methods for payment of penalties; and, (5) Set out training programmes and documentation requirements.

I.8.5 MONITORING THE OUTSOURCED PROJECT OR PROCESS

150 NSOs must carefully monitor outsourced projects against the specifications. This monitoring must include early identification of problems, with milestones being important in this process. Particular care should be taken where the outsourced work is being developed and/or undertaken at a site some distance from the census office.

151 Regularly scheduled meetings and telephone and video conferences between census agency and service provider staff are essential. These meetings manage external relationships and ensure that expected contract results are achieved. Compliance with scheduled completion should be specified as a contract requirement, with listings of key attendees from all parties specified in the contract. The frequency of the meetings should be specified along with responsibility for recording and publishing decisions made or items agreed to.

152 A system of cascading meetings must be established with the project team staff meeting their counterparts frequently for routine monitoring. Occasional reports – those restricted to important issues requiring decisions – should be made less frequently, based on meetings of more senior staff. This key area might be neglected if not considered early enough in the progression of activities. Even if the requirements are clearly specified, problems could still arise during delivery, with potential outcomes achieved late or not at all.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 39

153 Clear and open communication is critical in this element of managing a census. NSOs should take care to ensure that all negotiations with external providers have a degree of common sense. NSOs need to appreciate all viewpoints and constraints, as well as provide rigorous contract preparation. Specifications may include penalties for failure to meet deadlines or quality standards, but these penalties are rarely effective in a census context. Cash from penalty payments does not improve the quality of the census held on a date specified months or years later. Attention to detail in the specification documents is a major step towards achieving this goal. NSOs need to develop and manage a good working relationship with the service providers.

I.9 DATA PROCESSING IMPLEMENTATION

I.9.1 RECEIVING OF THE QUESTIONNAIRES

154 The importance of a tracking system. Even before the first questionnaires come in from the field, the processing center should build a tracking tool to monitor the movement of the questionnaire from regions to Data Processing Center. This tracking system should be electronic, but a paper backup should be included as well, and printed periodically. This backup is needed particularly in countries not having continuous electricity. Several software packages, by Microsoft and other companies, and within CSPro, allow for excellent tracking systems to be developed easily and quickly. The Microsoft products must usually be bought, but CSPro is free and can be downloaded from the U.S. Census Bureau’s website. However, a simple Excel program can also be developed for the tracking, if staff members in a government office are more comfortable using that than other packages.

155 As noted, every operation must be thoroughly documented. The documentation assists in monitoring the flow of forms and information during the actual census or survey operation as well as being instructive when the next census or survey is being prepared. Check-in operations are often not thought through before the forms start coming in from the field, and many countries then have trouble keeping track of where the individual forms and groups of forms are located. This can cause real problems in finishing the census processing, particularly if some forms become temporarily or permanently lost. The procedures include:

1. Accounting for every questionnaire. Hence, the census processing office must account for the each form received by registering it as soon as it comes in from the field. Sometimes forms come in as a whole EA – the field office has held back the forms until the whole EA is finished and checked and recorded as complete. Sometimes, the forms come in on a flow basis, and so individual forms arrive separately in the processing office. In each case, as soon as possible after they arrive, the forms need to be marked, recorded, and stored in a logical fashion so that they are easily accessible when needed.

2. Accounting for every box. As noted, when the forms come in inside a box, each box must be accounted for as it is received. If the forms come in on a flow basis, then they must be put in the appropriate the box when received. When all forms are reported as being received, an operation to make sure all forms are present, and in increasing numerical order should be put in place.

3. Accounting for every geographic area. An operation should make sure of the association between received boxes with the different geographical areas the processing center covers. If the country has a single processing center, then the whole country must be considered at once. When forms are placed in the wrong geographic area because of mis-coding of the geographic information, it is often very difficult to find and re-place the offending forms. Hence, the geography on the form, on maps, and from other agencies and surveys should be used to make sure that the questionnaires are in from the field, that the geography is correct, and that the forms can be found physically at any point along in the processing.

4. Accounting for missing forms. When forms are missing, or evidence of mis-placed geography is found, the processing center should first assess whether the problem can be fixed within the

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 40

processing center. It is never a good idea to move forms around unnecessarily before electronic capture – whether scanning or keying. But, if the problems cannot be resolved in the processing center, it may be necessary to send forms back to the field offices for further clarification and work. Again, moving forms back to the field office should only be a last resort, and should only be used if the work is so bad that they questionnaires cannot be further processed without this movement.

5. Checking forms against the on-ground geography. The initial items recorded in the tracking system will be information about the questionnaire – the complete geography, possibly number of people in the unit. Whatever information is recorded, these data become the beginning of the reference data base, and it is then built from there.

6. Use of barcodes and geo-codes. If the questionnaires are barcoded, this information should be recorded in the reference database for each form. In some cases the barcode should be the point of reference as proxy for the all the geography. Of course, barcodes must be unique to use them in this way. If the barcode number is associated with the geographic hierarchy, it will help keep the questionnaires organized.

a. Sometimes the original questionnaires are not barcoded, but they can receive bar codes in the processing office when they are received. This procedure also will aid in the “bookkeeping” for the census. But, of course, it is essential that a one-to-one relationship exist between the questionnaire and the barcode for this procedure to work. Many scanning companies provide this facility – some produce the barcodes on the questionnaires themselves, others provide facility for printing and pasting the barcodes on the questionnaires after the fact.

b. As noted, in developing the reference data base, it is essential that a link exist between the geocode and the barcode (if one exists) and the developing database.

I.9.2 DOCUMENT MANAGEMENT

156 Then, as the form goes through the rest of the stations in the processing center – for first order checking, then coding, then keying or scanning, then verification, and finally closed up for storage, the barcode or geocode is used to update the information about the particular questionnaire.

1. Checking the forms in. When the form is first checked in, the date and time of check in is reported in the reference data base. Presumably, the questionnaire is then immediately stored in a bin, and so that additional would not be necessary. However, if an extra operation, going from field box to processing bin is needed, then this date and time of that operation must be recorded in the processing data base.

2. Check out/check in procedures. Then, for each operation, a checkout/checkin procedure from and to the storage area will track the form throughout the process. So, when the questionnaire is checked out for coding, the date and time are reported. And then when the forms are returned, the date and time and return are recorded. This procedure would occur for each stage of the process.

3. The Audit Trail. So, we will have an audit trail where each form is recorded at each process. And, this audit trail will allow us to determine where each questionnaire is in the flow of the census.

4. Individual and summary statistics. We will also be able to get summary statistics from the processing data base. We will know what percentage of the way through the check-in procedure is, what percentage of questionnaires are coded, what percentage are scanned, etc.

5. Finding non-moving forms and areas. We will also be able to track those forms that are not moving or are exceptions to the processing rules within the different business operations areas.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 41

Sometimes the questionnaires will simply have not arrived from the field, and so the processes will be slowed down. But, usually, some staff will slip or be absent or not be sufficiently aware of where a particular operation stands, and so the reference data base will aid in pin pointing various problems that need to be addressed as the materials flow through the operations.

157 And, as noted previously, extract reports of the progress and failures of cases not passing each stage will be reported on a daily or even hourly basis. And, at the end of the processing, summary statistics for all operations can be obtained.

I.10 DATA CAPTURE

158 This section of the handbook briefly discusses each of the current methods of data capture. We have to say “current” methods because the technology is changing so rapidly that by the time this document is published, it is likely that newer, better methods of data capture will already be available. Also, the current methods are also being improved very rapidly. However, at the time of this writing the major methods of data capture are (1) key from paper, (2) key from image, (3) Optical Character Recognition and Intelligent Character Recognition, and (4)Optimal Mark Reading. We describe these briefly below.

I.10.1 KEY FROM PAPER

159 The classical approach to data conversion of census and survey forms is key-from-paper. This process involves manual coding and then data entry. Verification follows each of these stages. A batch is typically a set of questionnaires from a single enumeration area. In most cases, a batch is the unit of work. This method of data conversion requires that:

(1) The census questionnaire fits on the workstations of the coders, keyboard operators, and verifiers. The layout should be appropriate to the size and shape of the questionnaire. The layout should have one individual’s details for a single page, or multiple individuals across two pages to enable easy keying and consistency verification.

(2) The form must be very sturdy since staff will handle them many times as they move through the various data conversion stages. Coders will use green or red pens to add the codes so the paper must withstand this treatment.

(3) The coders and verifiers must have the code lists for occupation, industry, geography, levels of educational attainment, etc., to assist in preparing for the keying.

160 NSO must track all questionnaires received from the field operations. The first step is recording initial from-the-field information in the storage management module of the data conversion process. Operational areas do not have to be physically adjacent. Forms should move from storage areas for coding, then back to the storeroom, and then to data capture. The NSO should design a storeroom management system to assist the storeroom manager and staff in producing a pick lists for coding, coding verification, data capture and verification.

161 Even before production transfers, that is, transfers of whole Enumeration Areas or District Office Area questionnaires, NSOs should use some type of sampling procedures to check coverage and completeness of the questionnaires from Enumeration Areas while still in the field. Regular staff who are trained in checking the questionnaires should visit the District Offices to make sure they are complete, and that the enumerators are following the skip patterns.

162 When the forms are transferred to the processing office, the first step is to receive the forms after the collection phase. This step determines the number of questionnaires having completed information. This tracking system will monitor the progress of the data conversion process. Typically, staff will count the forms for each enumerator area and then enter the number in the system. Clerks verify this number against the number given by the enumerator. When a large variance occurs – say, greater than 5 percent – then the clerk sends an alert to the relevant person for remedy.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 42

163 Hand coding. The coding process requires staff to read descriptions written by enumerators on the form as responses to open-ended questions such as occupation and industry. The clerk looks up an entry on a code list, on paper or electronically, and then writes the appropriate code on the form. The whole design process should determine the amount of coding and levels of geography when needed. If a particular item, like occupation or industry or detailed places, has many digits, great care must be taken, and, especially if the entries are in a large book, sub-samples might be developed for the most common occurrences – like in health and education.

164 Automated coding. The alternative to looking up from paper code list is to enter part of the description on a computer based system and then selecting the appropriate code. Coders or keyers then enter the selected code into the database. This online coding can come either from internal lists on the computer or network, or directly from the internet where constant connection is possible. This approach assumes that data entry operators can easily determine (or simply enter) all other data items directly. In other words, the items are ready to capture.

165 A coding verification process will determine the extent or level of quality of coding for open-ended data items. The coding verifier selects a sample of questionnaires within an enumeration area. The computer system generates the sample of questionnaires needing verification. From each of the sampled questionnaires, the verifier ascertains whether the descriptions on the form and that on the code list match. Various computer packages assist in keeping tallies of matching and non-matching code list.

166 On completion of the coding verification, the verifier enters the tally results into the system. The system will then determine whether the batch has met the coding quality criteria. The error rate is determined as a simple percentage:

Error Rate = (A / (A + B)) * 100 where

A = Number of non-matching descriptions

B = Number of matching descriptions

167 A computer program can compare the error rates with the tolerance, usually in the range between 5 to 10 percent of questionnaires. If the coding error rate is within tolerance then the system marks the batch as ready for data entry. Otherwise, some type of remedial action will be needed. Also, the coding error rate is an indication of the effectiveness of the coding training. For those batches that have error rates below the tolerance range, the respective coders should be re-trained to improve their coding efficiency.

168 Because computers are tireless and persistent, it is important that the coder and the verifier not correct or amend information, even to ensure consistency. It is in this environment that throughput is high as forms will go through the stages in the shortest possible time. Information technologists do bulk editing for correctness and consistency after completion of data entry.

169 An example of this type of problem is a coder who finds an individual who is 4 years old and has a bachelor’s degree as educational attainment level. If the coder is editing for consistency then the normal case is to assume the age is correct and change the level of educational attainment to “pre-school”. The selection of pre-school assumes that children under 6 years of age have not started school. However, the computer editing program can look at other variables, like relationships within the household to determine which item is more likely to be in error, and to then correct that. So, staff members should do no editing, and let the record go through the coding and data entry without any amendments. Any bulk editing can be done post-data entry.

Type Acronym Description

Key from Image KFIIdentical to Key from paper method, however incorporation of data entry from a scanned image

Optical Mark Recognition OMR Data is produced from response marks on the questionnaire during or post scanning

Optical Character Recognition OCR OCR technology recognizes machine-printed characters on a questionnaire

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 43

Intelligent Character Recognition ICR ICR technology recognizes handwritten characters on a

questionnaire

Intelligent recognition IR OCR technology recognizes handwritten and cursive characters on a questionnaire

170 Data entry also requires two stages: (1) data entry and (2) capture verification. NSOs should think carefully about the various options covering the extent of editing during data entry. In many cases, the more rigorous editing, the longer the duration of the data capturing activity, and the more likely keyers are to make errors. A later section in this handbook discusses the pros and cons of heads down keying.

171 But a good rule of thumb, no matter which method of data capture, is that forms should spend as little time as possible in the data capturing process. Little iteration between data entry and quality assurance should occur. A quality assurance process determines areas of improvement for data-capturing personnel. Clerks improve through re-training of data entry personnel. Bulk editing and correction of errors occurs after completion of data conversion.

I.10.2 KEY FROM IMAGE (KFI)

172 Scanning of census and survey questionnaires became viable for data processing in the late 1990s with cheaper, readily available digital imaging technology. The initiation of digital imaging (scanning) provided various methods of data entry to readily replace the key-from-paper method. The method closest to key from paper is key-from-image. The actual process of Key-From-Image (KFI) is quite similar to that of key-from-paper in that data capture is still manual. However, first the questionnaires are scanned, so all of the information, including hand-written entries and stray marks are captured. But then, instead of capturing from a manual form, capture is directly from an image. NSOs still need to design questionnaires for scanning interfaces and following the manual preparation process for scanning and imaging.

I.10.2.1. Advantages

1. Preparatory time. Development of a Key-From-Paper takes minimal planning and implementation because of the basic scanning and capturing process. Once staff members design a Key-From-Paper, NSOs can add new questionnaires without impact on the scanning workflow. Modifications would only be required to the output database to ensure data capture into the correct fields and structures.

2. Online verification. A major advantage of Key-From-Paper is that verification of questionnaires occurs at the time of data entry so staff can easily discover errors and discrepancies. (As noted in “checking” section below, each of the methods should have verification on entry, or soon after entry, including standard demographic testing.)

I.10.2.2 Disadvantages

1. Production time. No computer-aided recognition occurs in the Key-From-Paper process. Therefore, the keyer types the characters as displayed on the questionnaire. Later, independent checks are therefore impossible. That is, although second keyers can verify the originally keyed data, after capture and initial verification, it is not possible to go back to images to verify what was actually on the form. Staff would have to pull the forms from storage, if they still existed. In all the other capture methods, it is easier to return to the data on the page.

2. Keying errors. Keying errors will occur as the characters during manual capture. As capturers try to reach their targets during performance based paying, they have even more pressure to work quickly. In this environment, they are likely to produce more errors. Hence, the data set requires complete or substantial verification.

3. Entry clerk changes data due to tight validation. NSOs use tight validation when clerks can only input a predetermined set of values for an entry. Some census experts feel that keyers should be involved in making substantive changes to the data. Under this process, clerks will change any

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 44

inconsistent information to the easiest value a clerk can select following a set of rules. The program should have these rules embedded, or on sheets of paper, or clerks may be told to simply use intuition. However, in this process, invalid and out of range data are not consistently edited and corrected causing data problems downstream. In fact, as noted above, keying clerks should not be making these decisions. The Keyers should just key. A few exceptions might be obtaining the sex from the name, or the relationship to head by the person number on the questionnaire. Otherwise, it is far better to let the computer-editing program do the work. The computer does not tire, and follows the same rules each time a particular situation occurs. Keyers do not always do that.

4. Problems in independent changing items. Data entry clerks’ independently changing content when the system hampers their performance because of constant error messages negates the advantage of keying. Keyers should key, and should not make changes to the data during entry (except in a few cases, like sex from the name.)

I.10.2.3 Conclusions

173 Major advantages using Key-From-Image are higher quality (at least through the first years of this decade) and lower cost. However, OCR and ICR technologies have improved so much in recent years that most countries have moved away from key-from-image to a more hands off approach. To do key from image, NSOs must implement quality assurance functions to ensure data entry is within the constraints of the questionnaire. NSOs should give careful attention to the indexing of the scanned forms to ensure they are readily accessible. In some cases, unique barcode placed on the questionnaire index the forms. In other cases, the geographical key, usually built up of geographical information uniquely identifying the area from which the form originated, provides indexing for storage. Either of these methods is acceptable, as long as any user can readily access the form at any stage of the downstream processes.

174 Quality assurance measures in any manual data entry operation are vital. Usually 100 percent recapture and verification of data is undertaken, however sample based methods are also available. Images provide the method for easier management and control due to the ability to review original questionnaires. In Key-From-Paper operations, a massive logistical process occurs when actual physical questionnaires come from their storage. In many cases, questionnaires are lost, but this risk is minimal with Key-From-Paper processing.

I.10.3 OPTICAL CHARACTER RECOGNITION (OCR) /INTELLIGENT CHARACTER RECOGNITION (ICR)

175 Scanning has steadily become cheaper and more accessible with advancements in the development of recognition algorithms. Optical Character Recognition (OCR) and Intelligent Character Recognition (ICR) technology have become the foundation of image and forms processing around the world. OCR technology recognizes machine-printed characters on a form, while ICR technology recognizes handwritten characters. OCR technology’s ability to read machine printed characters is now almost perfect, as accuracy thresholds are usually between 99 and 100 percent. The key difference between OCR and ICR is that OCR is more accurate than ICR due to the large amount of variations that occur in handwriting. Nevertheless, ICR is a great advancement in character recognition as virtually no limit exists on the types of collected and converted data. NSOs must take care in recognizing editing and data confrontation to avoid problems. In this section, we will look at OCR and ICR in detail for use in data processing.

Examples of OCR and ICRExample of a form with OCR and ICR fields

Actual Process

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 45

Sources: Statistics South Africa

I.10.3.1 Advantages

176 OCR and ICR show both advantages and disadvantages when compared to key from paper or key from image. Among the advantages are:

1. Recognition engines used with imaging can capture highly specialized data sets. Since enumerators write their numbers differently depending on what school system taught them to print and how their writing changed over time, scanning equipment must be able to accept various ways of displaying the numbers. Similar, if specialized characters like hyphens, apostrophes, quotes, and other characters appear, the scanner needs to be able to interpret these as well. Most machines can accept these specialized sets and convert them into numerical codes.

2. OCR/ICR recognizes machine-printed or hand-printed characters. The machines must be able to read both machine-printed and hand-printed coding, and, in recent years, most scanning systems can do this with a large degree of accuracy. Hence, this readability is an advantage over keyed data.

3. Scanning and recognition allowed efficient management and planning for the rest of the processing workload. The scanning should be set up to be efficient, and hence, to obtain a captured data more quickly so that the data set can move on to the rest of the processing. Most current scanning systems do provide data sets reasonably quickly.

4. Quick retrieval for editing and reprocessing. Most scanning systems do provide for quick retrieval for editing and continued processing.

I.10.3.2 Disadvantages

177 The disadvantages of Optical Character Reading and Image Character Reading also fall into a number of categories. These include:

1. Technology is costly. Although the costs of scanning equipment have decreased considerably during the first decade of the 2000s, initial outlays, especially, continue to be very high. Maintenance costs are higher than with personal computers. Continued maintenance during and after the processing can be expensive.

2. May require significant manual intervention. When things go wrong – and inevitably with complicated machinery things will go wrong – someone will have to put the machine back in

Economic Commission for Africa Statistics Division

Image Snippet Type Actual Recognised Data

ICR KGOTSONG K?Ot5ON?

OCR 099011 099011

OCR and ICR

3022009 302

Africa Census Processing Handbook – I. Data Capture Handbook, page 46

working order. If the machines are good, and tested, and the climate is mild, then less maintenance will be required. But, NSOs must build in downtime for the systems.

3. Additional workload to enumerators-ICR has severe limitations when it comes to human handwriting. Enumerators will need to spend additional time and effort in making sure that the machines (or coders) can read their handwriting. Clerks and supervisors will need some time to make sure no stray marks appear on the forms and that the computer will easily be able to tell the difference between 1s and 7s, and between 2s and 3s. Otherwise, the edit could be much longer and more complicated.

4. Characters must be hand-printed/machine-printed with separate characters in boxes. When characters are not in the boxes, they could be misread or not read at all. Staff will need extra care to make sure all entries are legible.

5. Ineffective when dealing with cursive characters. When enumerators do not print the responses, the machine may not be able to read the recorded data. Cursive writing will be particularly difficult to read.

I.10.3.3 Discussion

OCR and ICR have many advantages over keying (because they are much faster) and OMR (because the number of categories can be much greater. The biggest disadvantages are the need for for additional recognition software and the added development costs. The issues include:

1. Corresponding issues with OMR, as discussed in the section below.2. Algorithm developments can be straight-forward or problematic depending on the complexity of

the form. Because data must be converted, it is important that the methods of complete capture (and then put in ASCII or other form) are developed early in the census so they do not hold up the processing

3. Processing time must be considered since it is related to the recognition engine. The more complicated the recognition, and the more checks, the longer it will take to obtain the results. But, added checks often increase accuracy, so the work must be maximized.

4. Development costs can be considerable, and the costs, but financial and in time, can mount quickly if care is not taken, and if the work is not started early enough in the process.

I.10.4 OPTICAL MARK RECOGNITION (OMR)

178 Optical Mark Recognition (OMR) technology’s long, virtuous history started when Gustav Tauschek obtained a patent on Optical Character Recognition (OCR) in Germany in 1929. OMR technology allows an input device – the imaging scanner – to read and convert hand-drawn marks (such as small circles or squares on specially designed paper) to a form that machines can process. OMR captures by using contrasting reflectivity at predetermined positions on a page. That is, it reads filled-in elements and ignores everything else.

179 The OMR software converts the information in the marks into the form of numbers or letters. It then transports them into the computer. Two methods of applying OMR technology in data processing are (1) form-based OMR, and (2) image-based OMR.

(1) Form-based OMR works with a specialized document that contains timing tracks along one edge of the form to tell the scanner where to look for marks that look like black boxes on the top or bottom of a form.

(2) Image-based OMR runs the scanned image through processing or interpreting engines to determine electronically the mark received from the form.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 47

180 In effect, form-based OMR does the ‘reading’ of data at scan time, while image-based OMR can apply the creation of data during any subsequent process. Due to these differences, form-based OMR does not allow added fields for interpretation after scanning; image-based OMR allows added fields when required. However, with form-based OMR, images can be saved during the scanning process and would require a KFI process for any further verification or e exceptions management

What OMR DoesReads Mark information Marks have to be precise

Source: UNSD Presentation

I.10.4.1 Advantages

181 Optical Mark Recognition (OMR) has both advantages and disadvantages. Among the advantages:

1. Recognition engine unneeded. Form based OMR is a data collection technology that does not require a recognition engine. Therefore it is fast, using minimum processing power to process forms and its costs are predictable and defined

2. Capture speed. OMR capture speeds range around 4000 forms per hour and one can process quite a lot within a short period of time.

I.10.4.2 Disadvantages

182 The disadvantages of OMR include:

1. OMR cannot recognize hand-printed or machine-printed characters. Staff must recode all hand-printed and machine-printed characters in order for the OMR machine to read them. This is an extra operation that impedes the speed of the processing.

2. OMR scanning does not capture the images of the forms, so electronic retrieval is not possible. Since OMR does not make images, but only captures filled entries, it does not capture actual, written responses, and so verification is impossible. Also, another operation – that is, another type of scanning – has to actually capture these responses. NSOs rarely do that other operation, and so valuable information is lost to the statistical organization and to the public.

3. Tick boxes may not be suitable for all types of questions. Some items, like places of birth and previous residence, industry, and occupation, are not compatible with OMR. In many countries, these items have very few general categories, or the processing drops them altogether. Many

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 48

countries need this information in order to get a complete inventory of industries and occupations of the population, but are unable to obtain them from this form of data capture.

I.10.4.3 Discussion

183 The great power of Optical Mark Sensing (OMR) is its ability to deliver census results more quickly than any other method. However, because of restrictions in the number of items and the number of response categories and the place of the items, the system has considerable drawbacks. Among important points to consider before implementing:

1. The entire process must be tested: (a) Information Capture; (b) Recognizing, and (c) Verifying Results. And, pretesting with OMR can be even more crucial than in the other systems because of the relatively frigid nature of the questionnaire.

2. Questionnaire design and preparation is a critical aspect. The design will limit the number and kinds of questions that can be asked/

3. Forms must be easily scanned and in a good condition at scan time otherwise transcription will be required. In some areas of the world, keeping the forms pristine will be very difficult. Countries in tropical and sub-tropical climates are particularly susceptible to the problems.

4. Enumerators must take particular care in filling out questionnaires. Since the OMR machines read only what is in the circles or other geometric shapes, extreme care is needed to make sure the information is recorded accurately, and within the shapes on the questionnaire.

5. Completeness and consistency checks must be in place. More checks than with the other systems are needed because of the more likely possibilities of forms coming in crooked, that is, not aligned properly, or marks not dark enough, or out of the box.

6. Care must be taken for the condition of the Questionnaire (dust, humidity, transportation, etc). As with the enumeration itself, all later activities must take into account the possible frailty of the questionnaires.

184 OMR scanning can be extremely powerful tool for processing large surveys and censuses. The forms, however, must be carefully controlled and managed. To achieve high accuracy, well-structured design and good quality printing of forms is critical. Printing can be extremely costly and limited geographically as fewer service-providers work with OMR scanning. Although OMR data are relatively accurate, it is important to do detailed testing and constant review of resulting data to ensure that the scanner is reading the correct fields. NSOs can do this work using various methods like an independent comparison of OCR read values versus KFI based values from the same images. Images can easily assist in correcting exceptions.

I.11 SCANNING VS KEYING

185 Many countries are using scanning equipment, either optical mark reading (OMR) or optical character recognition (OCR). Each of these has advantages over keying when the operation is smooth and efficient and when the costs are not great. However, some countries, especially small or isolated countries, may not be able to afford the initial start-up costs or the continuing maintenance costs during and after enumeration. On the positive side, many countries use the scanners obtained for the census for a number of purposes, including other surveys and such administrative records as entry and exit forms. However, unlike keying where the skills transfer easily to other applications, the basic skills involved in feeding documents into a scanner transfer only if the same or similar machines are used.

186 One of the advantages of keying is that the skills learned during keying transfer to other activities in the national census/statistical offices and other government agencies. After the census develops expert data entry operators, these data entry operators then key data for various follow-up surveys. These surveys could include post-enumeration surveys (PES) and other surveys such as fertility or household income and expenditure surveys. Staff can also key administrative records, such as vital records and those for trade, immigration and emigration, and customs.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 49

187 Therefore, when a country is deciding whether to use scanning equipment for its census, it should also decide whether the country will continue to use the machines. Multi-purpose machines continue to be useful long after the census. However, national census/statistical offices that key their data will find that the skills learned transfer and the machines continue to be useful either in the national statistical office, regional offices, or elsewhere in the Government. The continued use of the equipment should be factored in when making a decision about keying or scanning. Also, countries should consider out-sourcing the scanning which can be cheaper and more efficient than buying scanners and trying to do all of the work internally.

I.11.1. ENTERING THE DATA

I.11.1.1. Scanning

188 Traditionally, countries using optical or other scanning devices to capture their data would not normally correct their data as they are captured, although changes may depend on the skip patterns built into the system. Many of the newer scanners, however, can be programmed – like the PDAs described above or internet online entering – to change or correct individual items or, through inter-record or intra-record checking, pairs or more items. The fear is that disparate cases may not be covered and good data might be changed to bad or data might be changed without improving the quality.

189 Countries choosing to key their data, however, have several choices, depending on how quickly they need the data keyed and how much manual checking is needed. Each of the options depends on the requirements of the data capture teams, the skills of the data entry operators and the sophistication of the editing program. In large census operations, the biggest problem is getting the data keyed at all. The method producing the fastest results should be used. Of course you often don’t know the fastest method even when you are done. In Principles and Recommendations it is noted that countries achieve a quicker turn-around with sufficient machines, good training and expert data entry operators. Each of these requires adequate funding, however, which is not always available.

190 The quantity and type of data entry equipment required will depend on the method of data capture selected, the time available for this phase of the census, the size of the country, the degree of decentralization of the data capture operations and other factors. For keyboard data entry, the average input rates usually vary between 5,000 and 10,000 keystrokes per hour. Some operators stay well below that range, while others surpass it significantly. Among the factors that affect operator speed are (a) the supporting software and program; (b) the complexity of the operators’ tasks; (c) the ergonomic characteristics, reliability and speed of the equipment; (d) the question whether work is always available; (e) the training and aptitude of the recruited staff; and (f) the motivation of the workers.

I.11.1.2. Heads down keying

191 Heads down keying takes two forms. The first is keying all data items as they are encountered with no skip patterns. In this case, keying proceeds more quickly since data entry operators do not have to stop when invalid or inconsistent information is encountered. It may also be more accurate since keyers perform the task more mechanically. The second form of heads-down keying entails stopping to check the questionnaires for invalid or inconsistent results, so the process will go more slowly and will require much more expertise on the part of the keying staff. The high price in terms of speed must be seriously considered. Paradoxically, accuracy may also be improved with this method if data entry operators find that the data were recorded correctly but were miscoded. Mis-keying itself may sometimes be immediately challenged because the editing package provides for automatic checking.

192 Heads-down keying without skip patterns. When all entries are keyed, or skipped manually, a particular rhythm can be maintained, and certain skip patterns will not obviate valid but temporarily inconsistent information. For example, if a person is recorded as male, most data capture teams will require that the whole section on fertility be skipped. In this case, a data entry operator will key through,

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 50

(use the space bar or arrow to move through a male’s or young female’s record) because all fields will be blank. However, this procedure takes time, and the spacing may not be completely accurate. For example, the data entry operator might go too far or not far enough, and other items might be mis-keyed because they are improperly aligned. If all fields are keyed in this way, then this information can be keyed when no skip patterns are present.

193 For example, when the data entry operator encounters an adult female with fertility (a female for whom such items as children ever born, children surviving or children born in the last year have been collected and coded), all items are keyed. If the fertility information is keyed, the computer editing program can determine which item or set of items is valid and which must be changed. When the edit determines that the person is an adult female, but the fertility information is blank, then dynamic imputation or other appropriate means has to be used to obtain fertility information for the tabulations. If the actual information has been lost because of the skip patterns, the data capture team must decide whether the loss is worth the increased efficiency and speed. If skip patterns are present, the data entry operators can still move backwards through the screens to the appropriate position for corrections. Although the data entry operators will waste some time spacing through items they do not key, with this form of data entry, inconsistencies between sex, age and fertility can be attacked during the edit rather than at the time of keying.

194 Heads-down keying with skip patterns. A second method of heads-down keying involves keying with skip patterns in place. Again, if the data capture team requires skip patterns, usually to represent the way the enumerators collected the data, keying is easier and faster if the skip patterns are easy to follow and if the data entry operators learn the keying patterns quickly. If the skip patterns are very complicated, data entry operators may become confused and persistently key in the wrong places. The most efficient keying with skip patterns occurs when limited patterns that cover large parts of the record being keyed are used.

195 The data capture team will need to determine the appropriate skip patterns for their country’s census or survey. For example, it makes sense, to skip all of the employment items for children, that is, persons below the country’s defined age for potential employment. Often, these are half of the population items, so it is efficient to skip them for children, except for special situations such as for children whose age is borderline, or when the country may be interested in child labour. The data capture team decides on an item-by-item basis which items will be included for which age groups. Staff can group the items to manage the skip patterns easily. It is not always easy to have clear-cut decisions about skip patterns. For example, consider the following sequence:

1. What is this person’s citizenship?- Born in this country (skip to item 3)- Naturalized- Not a citizen

2. What is this person’s year of entry?3. NEXT ITEM

196 A skip pattern could skip from 1 to 3, that is, skip the item on year of entry, for persons born in the country. However, sometimes data entry operators violate the skip pattern, either because the enumerator or coder makes a mistake, or because of miskeying. The many factors involved include the skill level of the data entry operators, the cultural circumstances, the layout of the questionnaire and the layout of the screens. The data capture team often works together to determine whether a skip pattern in a case such is this case is reasonable.

I.11.1.3Interactive keying

197 NSOs may use interactive keying during census input but is more appropriate for surveys, particularly for small surveys where allocated items could affect the results of the survey. Interactive keying may

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 51

involve manual or automatic corrections, depending upon the information available to make changes or corrections.

198 Consider the case of a small survey. For small surveys, every response is important. If a country takes a 1 per cent sample survey, for example, each response represents 100 persons, housing units, or agricultural holdings. A few invalid or inconsistent cases could have a considerable impact on the results of the survey. In these cases, the demographers and other social scientists usually want to have considerable control over the processing.

199 Offices may establish control in several ways. The demographers and other specialists may key the data themselves, checking for extraneous, invalid or inconsistent responses as they go along, using the information as recorded on the data collection forms. They can often resolve conflicts, miscodes, or other inconsistencies immediately, while looking directly at the collected information. Sometimes they may opt to send incomplete or invalid questionnaires back to the field. This type of interactive keying gives the best results since the demographer also serves as data entry operator. However, it is by far the most expensive. Not many countries can afford this method.

200 The data capture teams may develop very detailed edit rules to determine what data entry operators must do when particular cases occur during keying. For each unresolved invalid code, they can decide what the data entry operator will key. The data capture team may resolve cases not covered by the detailed rules and may modify the rules (although at the risk of having inconsistencies between the first part and later parts of the keying).

201 Skip patterns which play an important role in heads-down keying, are important here, too. As with heads-down keying, data entry operators must be aware of and learn any skip patterns in use. As mentioned above, skip patterns can increase the speed of keying, but usually with some loss of quality. For interactive keying, a common rule of thumb is that the fewer the skips the better the quality.

202 Testing the keying instructions. After developing keying instructions, actual data entry operators must test the keying instructions before deciding on the actual operation, with or without heads-down keying. As the keying instructions are tested, operators can work bugs out of the system, with optimum keying obtained.

I.11.2 VERIFICATION

203 The national census/statistical office must also decide what level of verification is appropriate. For keyed data, many experts advise 100 percent verification. In this case, all items are rekeyed (or keyed over the existing information) to make certain that the data collected are the data that go into the machine for computer processing. Often, however, total verification is not practical, because the country does not have the time, financial or human resources to rekey all of the data. The percentage sample verified should be larger for beginning keyers, but less for more experienced keyers. In addition, if the tested error rate for the keying is very low, with the data entry operators making very few errors, then complete verification is probably not necessary.

204 All verification operations need to determine required information to optimize. Does the country want to track individual operators? Teams of keyers? How does it determine whether skills are being acquired, or maintained? The units of control could also be important, including daily, weekly, monthly reporting, and so forth, to determine the flow of work and the skills gained.

205 Finally, it is very important that verification be independent, that a different set of keyers do the verification from the data entry, or, at least different parts of the same team. The different sets of keyers allow for independence in the operations, and, therefore, better results.

206 For scanned data, some countries also perform verification to make sure that the scanning was comprehensive and complete. Scanning is so new that NSOs must thoroughly test the systems with pilot or

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 52

pretest data. Even so, changes in paper quality, actual printing of forms in various places, storage, etc, can cause problems that require verification.

207 If errors are systematic, edit programs can remove them, so keyers and verifiers should not be making judgments about correction. However, the keyers and verifiers are responsible for finding the errors. These errors could include inadequate testing of the scanning equipment causing systematic errors for certain items or combination items, confusing in reading certain digits (for example, interchanging 2s and 3s, or 8s and 9s), continuation check-off boxes, and so forth.

208 Misreading of continuation checkoffs has been a continuing problem in recent years. Edits must address this problem. If the original and continuation forms are not contiguous, the edit must use other procedures during the structure edits to resolve issues. As noted earlier, the NSO must have a completely sound, structured file before content ending begins. Techniques for verification are either dependent or independent.

209 Dependent verification. In dependent verification, data entry operators key over data previously keyed by other staff. When the key strokes differ, the software package informs the data entry operator. Then, the data entry operator either overrides the previous data, or makes a note is made of the discrepancy depending on the program. Since the keyers input the data from the original questionnaires, usually the data entry operators themselves can make informed decisions about whether the original keying was in error.

210 Independent verification. In independent verification, data entry operators rekey the data from scratch; they create a completely independent file of the keyed data, using the original questionnaires. An edit program then compares the two resulting files – the original keyed data set and the verification data set – to test for discrepancies. Presumably, some manual operation then rectifies invalid and inconsistent key strokes.

I.11.3 EDITING CONSIDERATIONS WITH SCANNED DATA

211 More and more countries now scan their data. In the early 2000s, many of these countries were surprised to find that scanning introduces different types of errors than keying. Part of the problem with editing scanned data involves lack of quality control during the scanning process. Because the technology was so new in the early part of the 2000s, many statistical offices did not have the background or facility to develop appropriate quality control for all items. Many of the countries that did develop appropriate quality control procedures did not end up developing them for all items. So, some of the items at the end of the question – particularly the fertility items – ended up being invalid or inconsistent.

212 Of course, many of the inconsistencies found in the keyed data also occur in scanned data. This handbook primarily addresses problems in keyed data since as of this writing most surveys and many censuses continue to key their data. However, it is useful to take some space to discuss the special problems evolving from the use of scanned data.

213 Because scanable questionnaires require markers to assist the machine in reading them, questionnaires often display items in ways that cause problems for enumerators and respondents during data collection. NSOs must address these items systematically. Programmers should use the regular edits described in the text for items closely related to other items, like religion to ethnicity. However, NSOs must take care when items needed for planning and policy have problems. For example:

1. Sex. The item for sex usually does not have problems because of it has only the two possibilities. But, as noted above, while keying usually restricts the keyer to only be able to key a 1 or a 2 (or a code for ‘unknown’), any value can appear in the columns for sex – other digits, or alpha-characters or other characters. So, some edit needs to be added to what was once done for keyed data to account for these miscellaneous values.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 53

2. Relationship. Relationship codes are a good illustration of the problem. A single digit for relationship codes, as shown in the text, should not have any problems. But if two digits are used, then sometimes a problem will occur in scanning when the first digit is either coded incorrectly or picked up incorrectly by the scanner. Normally, if codes 1 to 12 are used, the keyer will be restricted to keying these codes only, and the entry package will “complain” when an illegal code is entered. Scanning accepts almost anything (although newer scanning packages can “complain” as well.) Then, edits must change erroneous codes during edit, or they will cause all sorts of problems during tabulation stage.

3. Age. Age sometimes is an issue, particularly when 3 columns are used (to allow for people to be more than 100), so a digit by digit analysis may be needed – that is, looking separately at the ones, the tens, and the hundreds digits – to do a proper edit. Once the operations properly capture and set the age, the regular edit pertains.

4. Age and Date of Birth. However, when both age and date of birth are present, misleading information can cause problems when one item takes precedence over the others. Usually, subject matter specialists prefer use of date of birth with the census or survey reference date to produce an exact age (by subtraction) to compare with the reported age. When one or several digits are missing, programmers must take care to use all remaining digits properly. They will then obtain the best estimate of the computed age for comparison. When the scan does not pick up a single digit, for example, the edit can consider this situation to provide a best guess of what would have been there. This type of problem does not usually occur with keying.

5. Fertility. The items with the largest problems in the early 2000s resulting from scanning involve fertility – both the numbers of children born and surviving as well as children born in the last year or over time. For most countries, the main problem has been lack of quality control during scanning, resulting in strange items in the data capture. When a country has 17, 18, or 19 dead female children, for example, without editing the data are useless for planning.

6. Mortality. Mortality information also can present problems in the scanned data. For keyed data, if you have a series of items for deaths in the year before the census (sex and age of the deceased, and whether the person died a natural death, and whether it was a maternal death), the keying proceeds even over erasures and strikeouts. However, scanners do not normally read erasures; the scanner will leave blank information before continuing with the capture. The edit program must move the information into appropriate spaces for tabulation and subsequent analysis. Newer scanning operations could do these moves during and just after capture.

I.11.4 CONCLUSION

214 Unfortunately, each country’s problems depend on the particular programming and functioning of the individual scanners, so general guidelines are difficult. However, in all cases so far, the scanning problems have been systematic. That is, when staffs determine the algorithm to alleviate the problems, fully edited data sets result.

I.12. RELATIONSHIP OF QUESTIONNAIRE FORMAT TO KEYING

215 The two most common questionnaire formats for population items in a census or survey are person pages and household pages.

216 Person pages contain one page or two facing pages of population information, with separate pages for each person. This method is useful because all of the information for one person appears on one page, making it easy to collect. Also, this format makes it easy to check for internal consistency during enumeration. Person pages may appear combined in a bound booklet for ease of handling in the field as in figure A.II.1.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 54

Figure A.II.1. Sample questionnaire form with person pages

Person page for person X

Person page for person X+1

Item 1 Item 10 Item 1 Item 10Item 2 Item 11 Item 2 Item 11 . . . . . . . . . . . .

217 Coding and keying for items on person pages is basically a mechanical operation, in which the coder/data entry operator is not expected to evaluate the validity of the information supplied but rather assign its appropriate code or keystroke. Figure A.II.2. illustrates the flow of information for a given person recorded on a single page. It is easier to enter data on a single page for that person than to key by turning pages. Programmers implement validity checks later during the computer edits.

Figure A.II.2. Example of flow within a questionnaire with person pages

Person pageItem 1 Item 11

Item 2 Item 12Item 3 Item 13etc. etc.

• • •

218 Household pages have all information for a household on one page, if possible, or on a series of pages with all household members listed on each page. Listing the household members in this way is useful because the questionnaire saves space by not printing the items for each person. In addition, the enumerator can compare entries between household members as the data are collected.

Figure A.II.3. Sample questionnaire, household page with all persons on same page

Household page

PN Item 1

Item 2

Item 3

Item 4

Etc.

1          2          3          4          5          

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 55

.          

.          

.          

219 A third method is to have separate forms for each person, with the enumerator then assembling a loose booklet during or after enumeration. This method is efficient since the enumerator collects only the exact number of forms (pages) necessary for the household. The disadvantage is that the forms may separate during transfer or other handling, creating many potential editing and coverage problems if the census office is unable to reassemble them for the correct household.

220 The physical size of questionnaire pages is also a consideration, not only for enumeration, but also for keying. During coding and keying, the document must lie flat on the surface of the worktable, and coders or data entry operators must be able to locate and handle items on the form easily.

221 When all information is on a single page, staff can easily key the household pages as well, and it will obviously be faster since the data entry operator does not have to turn pages. Figure A.II.4. illustrates the flow of information on a household page.

Figure A.II.4. Example of flow for a questionnaire with household pages, with multiple persons per page

PN Item 1 Item 2 Item 3 etc.12345

Household page

222 Problems can occur in keying population or housing information that extends over more than one page. To solve the problems, the national statistical office is likely to take either of the two approaches outlined below:

1. Data may be entered one person at a time. The data entry operator may key the line of information for a person on the first page of the series of pages, and then turn to the second and subsequent pages. At the end of the first person’s pages, the data entry operator then turns back to the first of the household pages for that household, and keys the second person, third person and so forth. This type of keying works as long as the data entry operator can remain on the proper line throughout the keying. Although programmers can create computer-editing programs to disentangle information for erroneously keyed person items on another person’s line, the program itself is very difficult to prepare.

2. Data may be entered one page at a time. The data entry operator may key a whole page of information before moving on to the next page. Here, the data entry operator keys all information on the first page regardless of the number of persons. Then, the data entry operator turns the page and keys the next part of the information for all persons. Skip patterns may or may not be included here, depending on the type of keying (with or without computer editing). In any case, during the computer edit, programmers may have to assemble the records from the various sets of keyed data. The staff will have to deal with any miskeys of person numbers then.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 56

223 In the following example (figure A.II.5), the household’s demographic information poses no unusual keying problems since the census obtained a response for all items for all persons.

Figure A.II.5. Example of a household page with multiple persons, without keying problems

Household pagePN Relation Sex Age Etc.

1 Head of household M 40  2 Spouse F 35  3 Child F 18  4 Child M 12  5 Sibling M 35  6 Sibling of spouse F 30  7 Sibling child M 5  8 Sibling child F 3  

etc.        

224 However, a second page for the same household (figure A.II.6) could present some keying problem. For example, if the country chooses to collect language use only for persons 5 years of age and over, that information for the eighth person, a 3-year old, will be blank. The data entry operator should leave the cell blank for this child, or else the computer edit will have to attempt to correct it later.

225 Similarly, other items should be blank, such as persons under the minimum age for labour force participation, females under the minimum age for fertility and fertility for all males. In figure A.II.6, the data entry operator might incorrectly key person 6’s information on children ever born (in this case, 4) in person 5’s slot by mistake. The computer edit would then delete the male’s fertility items and impute fertility for the female, but it might not impute the correct value.

Figure A.II.6. Example of a household page with multiple persons, with potential keying problems

Household page 2Persons Language Labour force Children ever born etc.

1 Language 1 Yes    2 Language 1 No 3  3 Language 1 No 0  4 Language 1      5 Language 1 Yes    6 Language 1 No 4  7 Language 1      8        

etc.        

226 Many times a country must use a household form because of cost or space constraints. However, when the population is small, or the country can afford the additional expense, the form with person pages is likely to contain fewer matching errors through miskeying than occur with the household forms.

I.12.2 CODING

227 This section of the manual returns to coding in preparation for keying and scanning. A more complete discussion of coding appears in this handbook’s section on computer editing. The traditional method of

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 57

coding write-in entries in censuses is to have space for handwriting and then having a coder to cross-reference the code list. As noted previously, the layout of the questionnaire, particularly when scanned, will determine where enumerators can hand write, and how much they can write and still be legible. 228 Automated vs. Manual Coding.

229 As noted elsewhere, capturing using scanning may require segmentation of the text or guiding the handwriting in order to get good character recognition. Consequently, space for writing or classification may be insufficient in some cases. NSOs must decide how detailed the coding for selected items must be in order to obtain the information needed for planning and policy formation.

230 Whether the data are scanned or keyed, countries are now turning to automated coding for their capture of the codes for items like places and occupation and industry. Many automated systems are now on the market, but most of them follow the same, simple rules. As the keyer starts typing the first few letters of the place or occupation or industry, the name of the place or the type of occupation or industry should pop up on the screen. Then, the keyer confirms the entry, and the computer assigns the appropriate code. The machine captures the code and places it in the appropriate place on the person’s record.

231 And, the current trend is to build automated coding tools. In South Africa, uses of knowledge reference database as lookup database from the previous captured surveys or Census to allow automated coding. Problem is that the data set automatically carries any erroneous input from the past.

232 Coding considerations. When developing a coding scheme, census and survey staff must consider the returns of each investment of time, energy, and funds. The more complicated and comprehensive the number of codes, the longer it is going to take to do the coding. Clearly, when the displayed compilations require detail, like types of teachers, for example, the additional time needed to describe the occupation exactly is worth the extra effort. A more complete inventory of occupations results. Then, the government can use this additional information for both current and future planning.

233 Coding considerations are reasonably insignificant for small countries or small surveys since the amount of processing is much less than for a census. But, when many variables are included, and many of these variables need coding, or when the census covers a very large population, care must be taken to be certain that the coding is not going to take excessive amounts of time. Hence, since each census or survey is different, the balance must be determined in each case. Items need enough coding to provide the detail needed, but not so much coding that processing holds up the data, and hence delaying planning and policy formation.

234 Software packages. Some software packages can easily accept and work with alphanumeric data, but others do not. So, the amount of coding will depend on what the software can handle. Of course, in the current census round, most countries will use software that accepts alphanumerics. One type of problem that may occur with alphanumerics is the need to stop, if keying, to key all numbers to reposition the hands for alpha data and then having to reposition to go on with the keying. This procedure will inevitably lead to mistakes as the hands are constantly repositioning. (On the other hand, this repositioning may decrease the likelihood of medical problems like carpel tunnel syndrome.)

235 A second signification problem is that most packages have difficulties categorizing and performing calculations (sums, percentages, medians etc.) when non-numeric data are included. Sometimes, particularly with scanning, an entry program will automatically change alphanumeric data to all numerics. However, in other cases, programmers may have to write an on-entry program that changes the alpha characters to numerics before the editing will continue.

236 NSOs should avoid codes that are completely alphabetic characters or a combination of alphabetic characters and numbers (alphanumerics). However, when this is impossible – exactly 11 categories, for example, when a single digit would be better – then a program needs to convert these to numerics as soon as possible after entry.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 58

237 When forms are scanned, alpha-numerics are not a great problem, but many computer packages, as noted, require considerable manipulation in their use and so any alpha or other non-numeric characters should be converted soon after entry (but keeping the original data stored some place) so that a smooth, efficient edit can be obtained.

238 In many cases, editing programs require that alpha characters appear between quotation marks, or in some other manner, in order to process them. While the quotation marks make it clear that these are, indeed, alphanumeric data, the entries should still be converted for later ease of use.

I.12.3 EDITING 239 Scanned data do not suffer as much from additional columns of information. That is, the scanning package will transfer the number of columns captured automatically. The enumerator writes down a value of 0 to 9 for a digit. The program will move that number to the outputted data automatically – nothing is keyed as in the case of key from paper or key from image. Keying requires the extra step of having a keyer sitting at a machine and physically doing the keying.

240 But the scanner has a disadvantage. With codes 1 through 9, the scanner may pick up the appropriate number, or it may pick up an alpha character, a blank, or a stray mark and convert it to some readable character. And, this type of problem is not yet completely resolved in many current scanning software applications. Moreover, it is unlikely software will cure the problem any time soon. So, while scanning is many times faster, enumerators must collect and record the information very carefully to make sure that they do not introduce additional errors.

241 In most cases, edits readily handle these issues as described later. Traditionally, the content editing covered both invalid entries and inconsistent entries (for an individual or housing unit, and between individuals in a unit.) So, when we find a woman 70 with a 3 year old child – that is, inconsistent information across individuals in a unit, the edit will be about the same whether that inconsistency came from the respondent, the enumerator, the scanner, or the keyer.

242 When an item uses two columns, for example in relationship, scanning can introduce errors that would otherwise not be present when the variable uses a single column. When two columns are used for an item, say codes 1 to 10, then you introduce a whole new realm of errors, since any number between 11 and 99 could be introduced, as could various other characters (/,!,%, etc). Instead of legal values 1 to 9, you now have values coming in that could range anywhere from 0 to 99, as well as the aforementioned alpha characters, blanks, and stray marks

243 In most cases, the subject specialists provide the edit specifications for the item, but these values automatically increase the time and complexity of the edit, and could decrease the quality of the final data set.

244 When the editors receive a value of 13 for relationship, they must start making strategic decisions about what to do with this value. Was it meant to be 3, and the 1 is erroneous? Was it meant to be 10, and the 3 is wrong? Of course, real editing does not reduce us to using only this information in making a decision – we will be able to see whether the adjacent person could be a spouse or a child of this person, and use that information. We can use marital status. We can use work status (student or retired). However, if only one column is used in this case, fewer edits will be needed because fewer problems if this sort will show up.

245 Some countries have up to 12 items of information on fertility (children by sex in the household, children elsewhere, children dead, and total children). This information should be self-coded – that is, the enumerators should be coding it as they collect the data.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 59

246 But, an issue here is how many digits each of those items should be. When two columns are used, the boys in the house, for example, could be anywhere from 0 to 99; when only one column is used the numbers can only range from 0 to 9. Therefore, if the statistical office actually thinks that a woman would have more than 9 male children in the house – with much frequency – then the variable needs two columns.

247 However, since it is extremely unlikely that a female would have more than 9 male children in the household, having two digits introduces high probability of picking up stray marks or scanning misreads. So, the scanner could easily read 9 for a 0, for example, and read 91 children instead of 01. Having only one digit, while losing cases of more than 9 male children in the house – these would normally be code 9 – would not pick up these possibly very large numbers.

248 However, for total children in the house, total children elsewhere, total children dead, and total children, two columns might be more appropriate. Hence, while we would only allow 9 male children in the house or away or dead, we would allow the sum of the male children to be more than 9. But not more than 99. And, the computer editing specifications should determine what the maximum number would be – 12 or 15 or 19 – but, again, within a logical maximum.

249 These decisions, of course, depend on the fertility levels in the country, and so no single statement can take care of all cases. Countries with very high fertility may need more digits; in countries with low fertility, one digit may be sufficient.

250 The following set of standard codes covers the majority of relationships for most countries:

• Head of household (or householder)• Spouse• Child• Adopted or step-child• Sibling

• Parent• Grandchild• Other relative• Nonrelative

251 Some countries add a “0” code for head of household and can then add a 10 th category to the others. Hence, the relationship can have 10 codes. In this case, however, the household head should always come first. In countries where the head might appear further down the list, the edit might misinterpret the code ‘0’ as head when it is actually something missing. Cases with no ‘0’ or multiple ‘0’s would require editing, either at entry, or during the computer editing.

252 Many countries, particularly those experiencing the HIV/AIDS epidemic need much more detailed information than can be provided by these codes. Specific information on child-in-law, parent-in-law, grandparent, niece and nephew, and so forth become crucial in analyzing the HIV/AIDS situation in a countries where the epidemic is in place. In this situation, additional codes are required for the statistical office to carry out its mission, and so two-digit codes are required.

253 Once the subject matter specialists make a decision to use two columns, they may choose to assign significance to each of the columns. For example, the first column might describe the generation, with codes 0 to 9 for the nuclear family, codes 10 through 19 for one generation down – niece, nephew, child-in-law, and so forth, codes 20 to 29 for one generation up – aunt, uncle, parents-in-law, etc.

254 For the other items, the main topics include place names, occupation, industry, field of studies (for educational attainment), and cause of death. NSOs must decide when and where to do the coding. Coding done in the field offices will almost certainly be less rigorous than coding done in the central office. In addition, coding for scanning machines, particularly if the NSO rented them and has to return them, could well be different from situations where staff can expend more time and effort and present more detail.

255 Certain social and economic variables should also use this type of coding. For three-column ethnicity or nationality or ancestry, for example, the major tribal or ethnic grouping would be in the first of two columns and the minor tribal or ethnic grouping (like a sect) would be in the second digit.

Economic Commission for Africa Statistics Division

Africa Census Processing Handbook – I. Data Capture Handbook, page 60

256 For occupation and industry, many coding schemes already have a hierarchy built in. Most international coding schemes (ISCO, ISIC), by the United Nations agencies, the U.S. Census Bureau, and others, already have the levels imbedded in the codes, so the statistical office does not have to do any additional work. The International Labor Organization is building index of key words increase accuracy in coding. But when countries use custom coding, the first digit would be for the major occupation/industry, the second digit for the minor occupation/industry, and the third digit for specific occupation or industry.

257 A set of common codes for closely related variables can reduce coding errors and assist the data processors during the edit. Common codes also allow data capture logic – either in the scanning or in the keying – to use one entry to determine another. For example, in many countries, place codes (birthplace, parental birthplace, previous residence, and work place), language, ethnicity/race, and citizenship are very similar. A common coding scheme for “place” might be developed as three-digit codes with the first digit representing the continent, the second the region, and the third the specific country. 258 The structure of coding can facilitate the coding process as well as later processing during editing, tabulation and analysis. For large countries with many immigrants or ethnic groups, codes based on continent, region and country, with different codes or digits assigned to each, would be preferable to a simple listing. National census/statistical offices can also use country numerical codes developed by international organizations such as the United Nations Statistics Division (United Nations, 1999).

259 If a group of items on a questionnaire is not independent of each other, national census/survey staff probably should not ask all of them. The editing team must decide, on a case-by-case basis, when to use other items directly for assignment, and when to use other available variables.

260 When definitions differ between censuses (or between a census and a survey) for variables such as work or ethnicity, the national census/statistical office must decide how to take these changes into account, both for currently edited data and for datasets from the prior census, in order to show trends. If the original, unedited data are available, data processors can make changes to the appropriate edits and rerun all of them.

I.13 CONCLUSIONS

261 As noted elsewhere in this handbook, while census editing and tabulation (to be covered in the next sections) have not changed very much in the last couple of decades, we are in the middle of a revolution in the way we capture data. Here, we have covered the traditional methods of data capture, through keyed entry – both from paper and from image – and the more recent innovations of Optical and Intelligent data capture. We have touched on what will clearly be a major method of data capture in the 2020 census – PDA – but it is not yet clear how much of those instruments will be used in Africa. The next section of the Africa Census Handbook covers Editing.

Economic Commission for Africa Statistics Division


Recommended