Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | shonda-osborne |
View: | 225 times |
Download: | 1 times |
Copyright 2010, The World Bank Group. All Rights Reserved.
Business statistics surveys
3. Data processing
1
Business statistics and registers
Copyright 2010, The World Bank Group. All Rights Reserved.
Micro data file system• A micro data file must be set up to store the survey
data• A number of clerical operations are necessary before
the questionnaire address labels are prepared• Among the most important are the removal of obvious
duplications and updating of recently reported address changes
• Checks and last minute updates prior to dispatch prevent irritations with respondents, help reduce respondent burden and non-response
2
Copyright 2010, The World Bank Group. All Rights Reserved.
Unit identification
• The business community is dynamic• Discrepancies between the reporting unit
envisaged and actual reality may be expected• It is important to establish the cause of
discrepancies• Corrections and updates of units and their
attributes should take place in close co-operation with SBR staff
3
Copyright 2010, The World Bank Group. All Rights Reserved.
Data entry modes• Basically, five types of data entry occur:• Electronic data interchange (EDI)• Scanning• Optical character recognition (OCR)• ‘Heads-up’ data entry• ‘Heads-down’ data entry
• Special data entry software is needed• Which of the methods apply depends on labor
resources, equipment and technological know-how
4
Copyright 2010, The World Bank Group. All Rights Reserved.
Primary checks• When completed forms return to the NSO, the first
thing to do is check whether they are (almost) blank• Unusable forms can be considered as non response or
can be scheduled for follow-up• One should not wait with entering the data until the
entire collection process is completed• Follow-up actions towards respondents reporting
implausible data should be undertaken as soon as possible after return of the form
• Regardless of editing procedures the raw files as submitted by respondents must also be kept
5
Copyright 2010, The World Bank Group. All Rights Reserved.
Types of checks• Editing is examination of data for error detection• Only part of the errors made by respondents can be
traced• Data editing takes place during or after data entry• Routing checks test whether all questions which should
have been answered in fact have been answered• Data validation checks test whether answers are
permissible• Relational checks are a powerful editing tool• Exhaustive editing bears the risk of over-editing
6
Copyright 2010, The World Bank Group. All Rights Reserved.
Organization of editing• Not all editing strategies practiced are efficient• Five alternatives exist, some of which may be
combined:• Paper and pencil• Iteration of data entry and error lists• Computer-assisted data entry and editing• Automated editing• Selective editing
7
Copyright 2010, The World Bank Group. All Rights Reserved.
Three stages of editingIn the process of editing three stages can be discerned:
• Deterministic and stochastic methods are used to detect errors
8
Copyright 2010, The World Bank Group. All Rights Reserved.
Selective editing
• Selective editing comes down to detection of outliers
• It can take place during data entry or when most data have been collected
• Editing during data entry (input editing) has the advantage of timeliness
• Input editing is costly• To reduce cost one must be selective
9
Copyright 2010, The World Bank Group. All Rights Reserved.
Macro editing
• Macro-editing or aggregate editing is a way of selective editing focusing on output
• It systematizes what every statistical agency does before publication: verify whether publication figures look plausible
• To do this one may compare totals in publication cells with the same figures at time point t-1
• Selective editing is not without risks• Bias may occur if for instance only large positive deviations from
the expected value are corrected and large numbers of negative deviations (zeroes) are ignored
• Also false stability, due to firms who return exactly the same answers at every occasion, can damage the validity of publication figures
10
Copyright 2010, The World Bank Group. All Rights Reserved.
External consistency checks
• Thus far we focused on consistency checks between items from one and the same questionnaire
• However, also checks with data from other surveys may apply
• External consistency checks are an important means to reduce problems during the integration stage
• The applicability of external checks depends on the degree of coordination among surveys
11
Copyright 2010, The World Bank Group. All Rights Reserved.
Imputation
• Two types of missing data are usually distinguished: unit non response and item non response
• Imputation applies for item non response• Unit non response is dealt with by reweighting• There is a third manifestation of missing data, called intentional
missing data• Three types of item non response may be distinguished
– In the first type the missing values are completely at random– The second type does not depend on the value of the variable, but on
the values of some other variable(s)– The third type depends on the value of the variable on which it is
missing, e.g. high scores are more likely to be missing than small ones
12
Copyright 2010, The World Bank Group. All Rights Reserved.
Strategies to deal with non response
• Two general strategies apply for dealing with item non response
• The first strategy ignores the missing values• This method is called the complete case analysis• In the second strategy estimates for the missing
data are sought• By deleting all cases with one or more missing
values the sample size can become very small
13
Copyright 2010, The World Bank Group. All Rights Reserved.
Imputation methods
• There are several imputation methods, ranging from very simple and intuitive to complicated statistical procedures
• The most important methods are:
– Subjective treatment: impute on the basis of values which appear reasonable
– Mean/modus imputation: impute the mean of a variable or the modus– Post stratification: divide the sample into strata and then impute
stratum mean/modus/median– Cold deck imputation: find reasonable estimates for the missing values
in another data set– Hot deck imputation: find a donor case in the data set– Regression imputation: define predictor variables and estimate the
missing value
14
Copyright 2010, The World Bank Group. All Rights Reserved.
Intentional missing data
• Intentional missing data occur when it was decided to refrain from surveying certain variables
• This is done deliberately in order to better fit in with respondents accounting systems
• Example: for the compilation of data on “fixed capital formation” the purchase value of assets is required
• However, enterprises that lease the assets acquired, will not be able to supply the purchase value
• Therefore, the questionnaire mentions “lease amounts paid”, and the NSO imputes the purchase value by means of certain keys
• Intentional missing data will become an increasing phenomenon in business statistics to reduce the reporting burden
15
Copyright 2010, The World Bank Group. All Rights Reserved.
Weighting• Samples result in information for only part of the target population• It is common practice for statistical offices to attach weights to the
elements in a sample
• Objectives of weighting are:1. Expand the sample to the population.2. Cope with missing observations.3. Increase precision by utilization of auxiliary information.4. Achieve consistency with data from other sources.
• Weighting, i.e. the attribution of weights to sampled units, can in principle take place before data collection
• Reweighting always applies after data collection
16
Copyright 2010, The World Bank Group. All Rights Reserved.
Reweighting
• Weights can be used to expand the sample to the population
• The other objectives (coping with missing observations, increase precision, achieve consistency with data from other sources) are attained by adjusting of the inclusion weights
• The adjustment procedure is called reweighting• This is done on the basis of auxiliary information
17
Copyright 2010, The World Bank Group. All Rights Reserved.
Frame errors and estimation
• Frame errors complicate the estimation process• Four categories are relevant for business surveys:
• Undercoverage (missing units)• Overcoverage (inclusion of non population units)• Duplicate or multiple listings• Incorrect auxiliary information (size, activity, misconstruction of units, etc.)
• Undercoverage is perhaps the most serious problem
18
Copyright 2010, The World Bank Group. All Rights Reserved.
Seasonal adjustment
• Many economic time-series show cyclical fluctuations• This is most obvious for series published with a period
less than a year• The fluctuations involved are called seasonal
fluctuations• Major causes are calendar effects, institutional effects
and weather• Series must be corrected for these seasonal
fluctuations
19
Copyright 2010, The World Bank Group. All Rights Reserved.
Seasonal adjustment methods• Adjustment methods presuppose that a series can be
divided into three components:• the trend and cycle• the seasonal component• the irregular component
• Decomposition gives an estimate of the seasonal factors, the trend-cycle and the irregular component
• There are several methods in two broad classes: census methods and ‘model-based approaches’
20
Copyright 2010, The World Bank Group. All Rights Reserved.
Disclosure control of tabular data
• Results of surveys among enterprises are usually published in the form of tables
• Microdata sets with data from enterprises are hardly ever published
• In tabular data situations may occur in which it is possible to deduce information corresponding with an individual respondent from the aggregated total
• This must be prevented by statistical disclosure control (SDC)• There are three main methods:
1. Modification of the classification scheme,2. Suppressing of the sensitive cells, and3. Rounding of cell values
21
Copyright 2010, The World Bank Group. All Rights Reserved.
Sensitive cells in tables
• Dominance rule: if the sum of the contributions of n or fewer respondents accounts for more than k % of the total cell value than this cell value cannot be published
• The values n and k in this formulation are parameters whose values have to be chosen
• For example, one could choose n = 3 and k = 75• The main idea behind this dominance rule is that if a cell value is
dominated by the value of one respondent, then his contribution can be estimated fairly accurately
• If there are m respondents then m-1 of them can, by pooling their information, disclose information about the value of the data of the remaining respondent
• The value n should therefore be chosen larger than the maximum size of (imagined) coalitions of respondents
22
Copyright 2010, The World Bank Group. All Rights Reserved.
Linked tables
• The previous sections dealt with the disclosure control of one single table only
• When a set of linked tables, i.e. tables with common variables stemming from the same microdata, are published, additional problems may arise
• It is possible that a table in itself does not contain any sensitive cells, but that by combining the information with information from other tables individual information can be disclosed
• One could delete one or more of the tables from the set of linked tables
• Another option is to protect the original microdata file against disclosure
23
Copyright 2010, The World Bank Group. All Rights Reserved.
Metadata
• Users are entitled to be informed on the characteristics of the product they receive
• Information should include all important elements of data content, and data processing, including:1. A definition listing the components of the concept (inclusions and
exclusions) is often more informative than a more theoretical definition
2. Which unit type is used and how it is defined? 3. Which classification rules have been applied?4. How is the population delimited?5. Which collection method has been used (paper, telephone etc.)?6. How was non-response dealt with?7. How have the data been edited? Etc.
24
Copyright 2010, The World Bank Group. All Rights Reserved.
Dissemination
• Interests of users do not necessarily coincide with the scopes covered by individual surveys
• There is a wide variety of user groups and a wide variety of areas of interest
• Publications may overlap• There are many dissemination modes• Electronic dissemination, particularly through
NSO websites, has become dominant
25
Copyright 2010, The World Bank Group. All Rights Reserved.
Electronic dissemination
• The amount of statistical information available is immense
• Three types of clients can be distinguished:1. The occasional client who wants some basic figures;2. The client interested in a specific set of information
on a regular base;3. The client who needs large amounts of data with
changing needs.
26
Copyright 2010, The World Bank Group. All Rights Reserved.
Tabulations
• Statistical tables are the heart of a publication• The first condition for each table is that the message to
communicate can be easily understood• The data should be presented clearly and the table title should
describe essentially the contents of the table• The wording must be as informative as possible and easy to read
and understand• A table commonly consists of cells arranged in rows and columns• In case of a sample survey the cell contents are usually estimates of
totals or percentages of a predefined population• Rounding is often carried out to remove irrelevant digits• An important issue for sample surveys are precision indicators
27