+ All Categories
Home > Documents > Template for modules of the revised handbook Micro...  · Web view0 General information. 0.1...

Template for modules of the revised handbook Micro...  · Web view0 General information. 0.1...

Date post: 21-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
23
Method: Object Identifier Matching 0 General information 0.1 Module code 0.2 Version history Version Date Description of changes Author Institute 1 First version Leon Willenborg, Rob van de Laar CBS (Netherlands) 0.3 Template version and print date Template version used 1.0 p 3 d.d. 28-6-2011 Print date 6-9-2022 2:53 1
Transcript

Template for modules of the revised handbook

Method: Object Identifier Matching

0General information

0.1Module code

0.2Version history

Version

Date

Description of changes

Author

Institute

1

First version

Leon Willenborg,

Rob van de Laar

CBS (Netherlands)

0.3 Template version and print date

Template version used

1.0 p 3 d.d. 28-6-2011

Print date

27-4-2012 13:46

Contents

3General section – Method: Object identifier matching

31.Summary

32.General description of the method

43.Preparatory phase

44.Examples – not tool specific

55.Examples – tool specific

56.Glossary

117.Literature

13Specific section – Method: Object identifier matching

13A.1Purpose of the method

13A.2Recommended use of the method

13A.3Possible disadvantages of the method

13A.4Variants of the method

14A.5Input data

14A.6Logical preconditions

15A.7Tuning parameters

15A.8Recommended use of the individual variants of the method

15A.9Output data

15A.10Properties of the output data

15A.11Unit of input data suitable for the method

15A.12User interaction - not tool specific

16A.13Logging indicators

16A.14Quality indicators of the output data

16A.15Actual use of the method

16A.16Interconnections with other modules

General section – Method: Object identifier matching

1. Summary

The matching of records in two data sets, on the basis of a common key variable, which identifies the units represented in the data sets. The scores on this key variable (in both data sets) are assumed to be of good quality, though not perfect.

2. General description of the method

Matching based on a object identifier variable is the simplest way to match. Both matching data sets contain the same unique object identifier that is used as the matching key. The assumption is that the quality of the object identifier is sufficiently high; otherwise this matching method cannot be used effectively. Although we talk about a object identifier, it may actually exist of more than one variable, these are referred to as ‘key’ variables.

The basic principle is that a match is made if and only if a record from one dataset has exactly the same object identifier (key) value as another record from the second dataset. These types of matches are standard in databases, because database management packages contain functionality for this purpose. In database terminology, this involves an operation referred to as a ‘join’, or an ‘equi-join’. In the sense of Van de Laar (2008), joining is a procedure and not a method, because there are no approximations involved.

The exact matching, or ‘joining’ as it is defined above, describes an ideal situation. In practice this ideal situation may not exist because some object identifier values may be erroneous. In fact this is what the present method is directed at, as the ideal case is very simple.

2.1 Two steps

The assumption underlying this method is that the matching keys used in both data sets are ideally fault free, but in practice it is enough if it is of good quality, that is, errors are such that matches still can be made using this method.

First step: Records from both data sets are matched on the basis of exact equality of the object identifier scores. In this version of the method it is assumed that each record of Ds-input1 has at most one match in the second dataset.

Second step: If some records of Ds-input1 are not matched, this may be due to errors in the object identifier values. In a second step it is attempted to match any of the remaining records using the object identifier only. The errors in the object identifiers may be due to typing errors: a wrong character was typed, two neighbouring characters were wrongfully interchanged, a character was wrongfully not typed (or deleted), or an extra character was wrongfully typed, etc. With this in mind it could be possible to correct for a missed match. This is attempted in this second step. The idea is to look among the missed matches and find pairs that are close in terms of the Levenshtein (or Damerau-Levenshtein) distance. See also Example 4.5. If some records of Ds-input1 do not match in the second step, they can still be part of the output dataset, where the added variables are missing. If this is allowable is indicated by the variant of the method that is used (A.4).

3. Preparatory phase

[This is an optional, design related item, useful when considering the use of the described method in a practical situation. For instance about obtaining the required input data and auxiliary data of sufficient quality, and about the tuning, design time or run time, of the parameters or the required user interaction of the method. A formal list of these items can be found in the Specific description in the second part of the module.]

The quality of the object identifier scores in both data sets should be assessed, to see if the Primary key matching method is applicable. If this seems to be the case, the first step can be attempted. Depending on the number of unmatched records one has to decide what to do next. Go ahead with the method or not. And if so, choose a suitable metric, depending on the variables in the object identifier.

4. Examples – not tool specific

Most of the examples below refer to the CBS environment, but the issue at stake in each case can be generalized.

4.1 Example:

The matching of enterprises from two statistics, which are both based on the General Business Register (ABR). In both data sets, the unit – the enterprise – is identified by an eight-digit business identification number (a BEID). The BEID is the object identifier on which matching takes place. If the BEIDs in both data sets are the same, then a match is made; if the BEIDs are not the same, then the data sets are not matched. For example, no account is taken of the fact that, during the processing procedure for the individual statistics, errors could have crept into the BEIDs. This check is also often difficult because, in many cases, there are no more object characteristics present, such as names and addresses.

4.2 Example:

A variation on the first example is that data from the Tax Administration is matched with the BEIDs from the ABR. The one dataset from the ABR contains the BEID as the object identifier. The dataset from the Tax Administration contains the Tax Group (FE) as the object identifier. To match the two data sets, a ‘relationship or matching table’ is present, which indicates which FEs are associated with which BEIDs. In the same way as in the first example, the two data sets are matched, but with an extra step. There is a higher risk of incorrect matches here, because errors could have crept not only into the FEs or the BEIDs, but also into the registration of the relationship between the FEs and the BEIDs.

4.3 Example:

Matching based on a foreign key, for example, the SBI coding or size class coding in a record of a BEID. Indicators or averages from a dataset based on the SBI or size class, but then as a object identifier, can be matched with the record from the dataset with BEIDs. This occurs frequently in editing or imputation.

4.4 Example:

For privacy or data protection reasons external object identifiers can be replaced by secure internal object identifiers. The advantage is that it is impossible to link other external information to the records. This does prevent direct identification of units.

If E is the set of external keys and I the set of internally used keys, this replacement can be represented by a function

I

E

k

®

:

, which should be injective.

4.5 Example

Suppose that BEID as the object identifier, and you also have a complete list with BEIDs with at least some information about the businesses concerned. If you come across a BEID that does not seem correct, then you could look ‘close by’ this number in the list. The idea here is that a mistake was made when copying the number, for example, two digits were interchanged, or a 5 was replaced by a 6 (or vice versa) or a 7 by a 1 (or vice versa), etc. If, for example, you search for all BEIDs with a Levenshtein distance of 1 or 2 (see Section 3.2) from the given BEID, and also compare the associated business attributes with the data in the dataset or register concerned, you could potentially find the correct BEID with the associated business attributes.

5. Examples – tool specific

[This is an optional item. It provides simple and practical examples of the use of the method using syntax and options of some standard tool, including the name of the person who has provided the example, if this is not the author of the module.]

6. Glossary

[Only mention terms in this module-specific local glossary that are independent of a particular tool and with no SDMX equivalent. Copies of SDMX definitions from the Statistical Data and Metadata Exchange, or some other global glossary, can be included for the convenience of the reader. Local terms are marked by an asterisk (*)]

Term

Definition

Source of definition (link)

Synonyms (optional)

Atomic unit

See: Simple unit

Simple unit

BEID

The unique identifier (at Statistics Netherlands) of so-called business units in the General Business Register (ABR). The business unit forms, together with the Enterprise Group and to a lesser extent the legal entity, the statistical framework on which the economic statistics of Statistics Netherlands are compiled.

Bipartite digraph

A digraph

)

,

(

E

V

G

=

where V is the set of points and A is the collection of directed edges (arrows), in which the set of points V is composed of two disjoint subsets

1

V

,

2

V

. Each arrow a in

E

in such a graph has the characteristic that one of the points of an arrow lies in

1

V

and the other in

2

V

. In this document, we deal with a special subclass of bipartite digraphs, namely those for which all arrows run from

1

V

to

2

V

.

Bipartite graph

A graph

)

,

(

E

V

G

=

where

V

is the set of points and

E

is the collection of edges, for which the set of points

V

is composed of two disjoint subsets

1

V

,

2

V

. Each edge in such a graph has the characteristic that one of the points of e lies in

1

V

and the other in

2

V

.

Blocking variable

A variable that is used to partition matching data sets, that is, divide in a number of subfiles, with the intention of reducing the search space. If the blocking variable, for example, is a residential municipality for a matching problem where people are matched, then this means that only people living in the same municipality (at a certain time) will be matched.

BSN

The acronym for Burger Service Nummer, in Dutch the citizen’s identification number. .

Composite unit

A unit that is composed of units from a lower order. A household is an example of a composite unit; ‘persons’ are the simple units from which ‘households’ are composed..

Connected component (of a graph)

A maximal connected subgraph of a graph.

Connected graph

A graph in which all points are connected and therefore form a single component is called a connected graph.

Cut-off value

A value to limit the matching weights (upwards or downwards). As a result, it is possible, for example, to exclude pairs of records that have an overly high matching weight as candidate matches, because they are not sufficiently similar. Or, conversely, by increasing the cut-off value, it is possible to obtain more candidate matches, because records that are less similar are also considered as matching candidates.

Damerau-Levenshtein distance

DBMS

Database Management System. Examples of DBMSes are Oracle, MySQL, Sequel Server, MS Access, Postgress.

Deduplication

Taking the duplicate records out of a dataset, one by one, that occur multiple times, and that all relate to the same unit (in a certain period).

Degree

The degree of a point in a graph is the number of edges in the graph connected to this point.

Degree restriction

Limitation with respect to the degree of part of the points of a graph.

Deterministic matching

A matching technique that does not utilise a probability model. This is the, case for joining, which is matching based on object identifiers. But if there are errors in the values of these keys, the resulting matches are likely to be incorrect. Applied in the context of matching with object characteristics, this concept is confusing and also usually not applicable. Even if a ‘deterministic matching rule’ is used, it is highly possible that matching errors will be made because errors and irregularities are present in the data. This matching method is therefore used as counterpart to ‘probabilistic matching’. This document avoids the use of this concept because it is confusing and can easily lead to misunderstandings.

Direct identifier

A variable that can be used to identify entities. This includes object identifiers, but also variables such as the BSN, name, address, etc., that can be used to directly identify entities, but possibly not uniquely. Some direct identifiers (such as the BSN) are suitable for use as object identifiers. Others (such as name, address, etc.) are suitable for use as object characteristics. See also: indirect identifier.

Dissimilarity measure

A measure to express the differences between two objects or entities. Somewhat similar to a metric. Antonym: similarity measure.

Distance function

See: Metric

Doublure records

Different records that refer to the same object at the same time or the same period.

DSC

Data Service Center. A sevice at CBS that in principal stores and makes available all the data sets that are produced during al the steps (interim storage points, ISPs) in the productionproces of statistcs.

ETL

Extract Transform Load. A set of operations to make an external data set suitable for further processing, e.g. at a statistical office. These operations can be geared towards converting data formats, adapting new variables, converting the coding used in the data set to the coding used at the statistical office, etc.

False negative match

See: Missed match

False positive match

See: Mismatch

Feasible matching graph

A subgraph of an MC graph that satisfies the criteria that are established for the matching graph. These criteria relate at least to the maximum degree of the points or a part thereof (degree restrictions). The word ‘feasible’ is used in the sense of ‘feasible solution’.

Fellegi-Sunter method

Matching method described in Fellegi and Sunter (1969). See Appendix A for a short discussion of this method.

Foreign key

A key value that occurs in a record but is not suitable to identify the record itself. A foreign key is therefore located outside the key of a data set. The purpose of a foreign key is to make a match with a record in another data set which, for example, includes additional data based on that key. Example: A record from an enterprise, which is identified by a BEID, also has unique code – as a foreign key – included in the region where the enterprise is active. In another data set, the code of the region is the object identifier with additional data about the region, such as the number of residents, the average turnover in the region, the square km of the region, etc. In a record with personal details, uniquely identified by a BSN, consider a reference to the enterprise where someone works. In this context, for example, a code (for example, the BEID) can be used. Another data set, where the BEID is the key, contains data about the enterprise where the person works. A foreign key is often, but not necessarily, a reference to another unit type than to which the record itself relates. Consider, for example, data for an employee with a reference to his/her supervisor. Both are of the type ‘person’ and both can be designated by a staff ID number.

Hamming distance

Distance between two records on a matching key, measured by counting the number of variables with different scores.

Incidence matrix

0-1 matrix J that indicates for a graph

)

,

(

E

V

G

=

what the relationship is between edges in E and points in V. Suppose

n

V

=

|

|

,

m

E

=

|

|

and

J

is the

n

m

´

matrix where

1

)

,

(

=

j

i

J

if point j lies on edge i, and

0

)

,

(

=

j

i

J

otherwise.

Indirect identifier

A variable that can be used to identify at least some entities in a population, but which is not a direct identifier. Examples are: place of residence, profession, age, gender. Indirect identifiers are candidates for object characteristics.

Variables that are neither direct nor indirect identifiers express, for example, views, opinions, beliefs, etc. Such variables are not suitable for use as secondary matching keys. The scores for units on such variables are generally not public knowledge, and they can also fluctuate over time.

Integer programming

A special case of linear programming, in which the variables that occur in the optimisation model are integers and not real numbers.

Interim storage point (ISP)

A point in the statistical process at CBS where certain data sets are well documented, stored in the DSC and made avialble for general use

Joining

A form of matching used for databases and in which, for example, matching is based on matching keys being identical. (equi-join).

Key

See Object identifier, Object characteristic

Levenshtein distance

Damerau-Levenshtein distance

Linear programming

Abbreviated as LP. This is the area where solutions are sought for problems with linear target functions that must be optimised under linear constraints. In this context, the variables are real-valued. Important subclasses are formed by problems in which all, or some, variables take on values in a finite set (such as {0,1} ) or a denumerable set (such as the integer numbers). In this case we are dealing with an important subclass of LP, namely integer programming (IP).

Matching

The process of bringing together data (represented in records) relating to units and spread over two data sets, based on common or very similar characteristics in the form of primary or object characteristic values. This matching can be simple, especially if common object identifiers are present in these data sets. It can also be more difficult, especially if only object characteristics are present, for which the scores can also contain errors, or when these variables are not completely identical.

Matching candidate digraph

A bipartite digraph that represents the possible matches between records from two data sets. The asymmetry in the digraph, for example, can be a result of the different times to which the matching data sets relate. The arrows indicate, for example, a possible development from a unit in one dataset to a unit in another dataset. The edges may or may not be assigned matching weights. A matching candidate digraph symbolises part of the constraints that exist for a matching problem. Abbreviated as MC digraph.

Matching candidate graph

A bipartite graph that represents the possible matches between records from two data sets. The edges may or may not be assigned matching weights. A matching candidate graph symbolises part of the constraints that exist for a matching problem. Abbreviated as MC graph.

Matching graph

Graph that is the result of a match. It is a subgraph of the matching candidate graph, with the same set of vertices but.with less edges.

Matching key

One or multiple key variables that are used in two or more data sets to be matched, for example, to search for records from one dataset in records from another dataset. If the matching key is a object identifier variable, matching based on the similarity of the key will produce few problems as such. If, however, a matching key is used that consists of several object characteristic variables, then the matching will generally be more difficult due to errors (or other anomalies) in scores for these variables. However, errors can also occur in object identifiers.

Matching weight

For a graph

)

,

(

E

V

G

=

a function

)

,

0

[

:

¥

®

E

w

is a weight function, which associates a non-negative value G with each edge of the G. When matching, this weight expresses how well/poorly records match. It depends on the situation whether a higher/lower matching weight means that matching candidates fit together better/worse.

MC digraph

Matching candidate digraph (see the relevant description)

MC graph

Matching candidate graph (see the relevant description)

Metric

A metric d for a set X is defined as function

)

,

0

[

:

¥

®

´

X

X

d

, so a non-negative function, with the following properties:

1.

0

)

,

(

=

y

x

d

if and only if

y

x

=

,

2.

)

,

(

)

,

(

x

y

d

y

x

d

=

for all

y

x

,

in X (symmetry), and

3.

)

,

(

)

,

(

)

,

(

z

y

d

y

x

d

z

x

d

+

£

for all

z

y

x

,

,

in X (triangle inequality).

Sometimes, instead of property 3. a stronger attribute applies:

4.

)}

,

(

),

,

(

max{

)

,

(

z

y

d

y

x

d

z

x

d

£

A non-negative function d that satisfies 1, 2 and 4. Is called an ultra-metric.

Mismatch

A match that has been made erroneously (false positive match).

Missed match

A match that should have been made but was not (false negative match).

Object identifier

In database technology, the object identifier is the name for a variable or a combination of variables that satisfy the following requirements:

- the value of the variable (or the combination of variables) is unique in the table (or data set) and therefore unambiguously defines the record in which it occurs.

- the variable (or the combination of variables) is filled in everywhere and therefore cannot be empty.

The combination of variables is minimal: by eliminating one of the variables, the record is no longer unambiguously defined.

If related tables refer to the table in which the variable (or combination) of variables occur, this is used to establish a relationship between tables.

Examples are the BSN and the RIN number for people, and the BEID for businesses. In statistical confidentiality, such variables are also called direct identifiers. Unfortunately, statistical security also refers to variables such as name, address, place of residence, etc. as direct identifiers.Such variables are called object characteristics in this document, however.

Probabilistic matching

Matching of the same units on the basis of scores for the matching variables that do not necessarily have to be the same. The differences of scores can have various causes:

1. There are observational or processing errors in the scores

2. The units in the two data sets were observed at different times, or

3. Matching variables in the different data sets are not defined exactly the same and possibly have other domains.

Record linkage

Another name for ‘matching’; see the relevant description.

Referential integrity

In a relational database, this is the basic principle that is required for internal consistency of the different tables in that database. This means that a table always has a key if it is referenced by another table in a key field, possibly a foreign key field. Database systems guarantee consistency and ensure that a transaction that violates the consistency cannot be performed. Example: there is a table (1) with regional data, identified by the postal code. In another table (2), the postal code is used to indicate the region in which someone lives. Referential integrity ensures that the postal codes in table 2 can always be found in table 1. Furthermore, the postal codes in table 1 may not be eliminated if these occur in table 2, either as a primary, secondary or foreign key.

Remainder

Records that cannot be matched when matching is performed on two data sets. In some cases, no remainder is desired, and it must be ‘eliminated’ by making extra matches.

RIN

Record Identification Number. A object identifier used by Statistics Netherlands to replace keys also known outside Statistics Netherlands (such as the BSN). The reason to use a RIN is based on privacy considerations. It is then impossible to match the dataset in which they are used (and from which the original keys have been removed) for matching with external data sets.

Object characteristic

A combination of variables that can be used in the identification of units, but which are not used as a object identifier. Often, this concerns variables (or a combination thereof) such as name, address, place of residence, date of birth, profession, education, gender, etc. None of these variables can identify the record by themselves, but the combination can be used as a proxy for a object identifier, if this is missing.

In statistical security, such variables are also called identifiers or indirect identifiers.

Primary key

See Object identifier

Similarity measure

A measure that indicates the extent to which two units are similar. This type of measure (or its complement: the dissimilarity measure) is also used in the multivariate analysis, for example, for clustering. See also dissimilarity measure.

SLA

An agreement with clear appointments between supplier and user of a service or product,

Simple unit

A unit that (for the matching problem in question) is not composed of units of a lower order, also called simple (or atomic) unit. For Statistics Netherlands, a person is a simple unit. For a doctor, a person could be a composite unit, such as when the doctor considers a person as a system of organs. Whether a unit is considered as single or composite depends on the matching problem in question. Antonym: composite unit.

Soundex algorithm

Originally a phonetic algorithm to index names based on sound (in English). Later, a similar algorithm was developed for words in the Dutch language. Improvements of the Soundex algorithm for English include Metaphone and Double Metaphone.

Statistical matching

Matching records with information from units which do not necessarily have to be the same, but are similar. In terms of intention, this method deals with an entirely different problem than is discussed in this report. This is actually an imputation method. This method is not further discussed in this report for this reason.

Surjection

A function

Y

X

f

®

:

is a surjection if for each

Y

y

Î

there is an

X

x

Î

, such that

)

(

x

f

y

=

. This type of function is also called surjective.

Synthetic matching

See: Statistical matching

Type I error

See: Mismatch

Type II error

See: Missed match

UWV

The Dutch Employee Insurance Agency (Uitvoeringsinstituut WerknemersVerzekeringen).

Variable in a object identifier

One of the variables that together define a object identifier of a data set.

Weight

See: Matching weight

7. Literature

[All references should be written in English and should be publicly available.]

Van de Laar, Scholtus (2008) Standaardprocedures voor basisprocesstappen. Transformaties, Statistcs Netherlands, The Hague.

Willenborg, L. & Scholtus, S. (2012) String edits with applications to matching and automatic coding, Report, Statistics Netherlands, The Hague.

De Jong, W.A.M. (1991), Technieken voor het koppelen van bestanden. Statistical studies, M41, SDU/publishers/ Statistics Netherlands publications, The Hague.

D’Orazio, M., di Zio, M. And Scannu, M. (2006), Statistical matching. Wiley, New York.

Fellegi, I.P. and Sunter, A.B. (1969), A theory for record linkage. Journal of the American Statistical Association 64, 1183-1200.

Gartner (2007), Magic quadrant for data quality tools 2007. Gartner RAS, research note, June 2007.

Gill, L. (2001), Methods for Automatic Record Matching and Linking and their use in National Statistics. National Statistics Methodological series no. 25, Oxford University.

Herzog, T.N., Scheuren, F.J. and Winkler, W.E. (2007), Data quality and record linkage techniques. Springer.

ISAD (2008a), State of the art on statistical methodologies for the integration of surveys and administrative data. ESSnet Statistical Methodology project on the integration of survey and administrative data, a CENEX project.

ISAD (2008b), Recommendations on the use of methodologies for the integration of survey and administrative data. ESSnet Statistical Methodology project on the integration of survey and administrative data, a CENEX project.

Lenz, Rainer (2003), A graph theoretical approach to record linkage. Paper for the joint ECE/Eurostat worksession on statistical confidentiality 17-19 April 2003.

Mardia, K., Kent, J. and Bibby, J. (1982), Multivariate analysis. Academic Press.

Nemhauser, G.L and Wolsey, L.A. (1988), Integer and combinatorial optimization. Wiley Interscience.

Newcombe, H.B. (1988), Handbook of record linkage. Oxford University Press.

Newcombe, H.B., Kennedy, J.M., Axford, S.J. and James, A.P. (1959), Automatic linkage of vital records. Science 130, 954-959.

Papadimitriou, C.H. and Steiglitz, K. (1998), Combinatorial optimization. Dover.

Sankoff, D. and Kruskal, J.B. (eds.) (1983), Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley.

Statistics New Zealand (2006), Data integration manual. Statistics New Zealand, Wellington.

Van de Laar, R. (2008), Conceptuele typering van processtappen naar businessfunctie. Internal report, Statistics Netherlands, Voorburg.

Wikipedia, article about the EM algorithm, http://en.wikipedia.org/wiki/EM_algorithm.

Wikipedia, article about Record linkage, http://en.wikipedia.org/wiki/Record linkage.

Willenborg, L. and De Waal, T. (2000), Elements of statistical disclosure control. Lecture notes in statistics, Vol. 155, Springer.

Winkler, W.E. (1985), Exact matching lists of businesses: blocking, subfield identification and information theory. Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 438-443. Published in an extended version in Alvey W., Kills B. (eds.) Record Linkage Techniques - 1985, Proceedings of the Workshop on Exact Matching Methods}, pp. 227-241.

Winkler, W.E. (2006a) Overview of Record Linkage and Current Research Directions. U.S. Bureau of the Census, Statistical Research Division Report Series, n.2006/2.

Specific section – Method: Object identifier matching

[The General description of the method, item 2, gives a more easily readable and accessible account of the more formal items covered in the following Specific section. The Specific section can be used to check if the General description (item 2) is complete and well balanced concerning several relevant aspects.]

A.1 Purpose of the method

[This is the purpose for which the method is used in exactly one process step. If variants of the method can be applied for different intentions, then write different modules, each module describing a different method in the context of each process step. The content of this item should enable a reader with a particular application in mind to decide whether this method is of interest to him/her. This is not a description of the method itself, but of its purpose. Also it should not contain recommendations on the optimal tuning of the method for a particular situation, as covered in item A.8 ‘Recommended use’.]

The purpose is adding variables to a micro data set Ds-input1 from a second micro data set Ds-input2 for the same objects in both data sets. Records from two micro data sets are combined using an object identifier variable.

A.2 Recommended use of the method

[This item contains recommendations on the optimal use of the method for a particular situation. This is apart from recommendations on the use of each individual variant of the method, as described in item A.4, and their optimal use, as explained in A.8.]

1. In case object identifiers of good quality in both matching data sets are available.

A.3 Possible disadvantages of the method

[This item explicitly describes possibly undesirable side effects if the method is applied, and what type of error or practical disadvantage is to be expected then. These are unwanted properties (post-conditions) of the output data.]

1. If the quality of the object identifier values is not very high, type 1 or type 2 errors might occur.

A.4 Variants of the method

[Describe only variants of a method that are of practical relevance. Do not give variants of the method that are described in other modules, for they are considered separate methods. The variants in a first parameter are number 1.1, 1.2, etc, the variants of a second type in a second parameter, are 2.1, 2.2, etc. In item A.8 the recommended use of each variant is described for practical situations.]

1. The string metrics used in the second step, which depends on the type of the object identifier, has possible variants:

1.1 The Levenshtein distance

1.2 The Damerau-Levenshtein distance

1.3 Other distances for other types of object identifier

2. Records in the output dataset Ds-output1:

1.4 Each record of Ds-input1 is part of Ds-output1 (left outer join), or only matching records occur in the output dataset.

1.5 Each record of Ds-input1 can occur more than once in Ds-output1, or can occur at most once in Ds-output1.

3. Doublure records in Ds-input1 or in Ds-input2 can be allowed.

4. One or more blocking variables can be used to divide the datasets for matching.

5. The non matching records of Ds-input1 and/or Ds-input2 can be deliverd as separate output datasets Ds-output2 and Ds-output3.

6. Specification of the matching variables in Ds-input1 and Ds-input2.

7. Specification of the variables of Ds-input2 (and of Ds-input1) that will be part of Ds-output1.

As another method (not described yet), in step 2 other variables could be considered as well, outside the object identifier. This would be a hybrid between the object identifier matching method in this module, and an object characteristic matching method, as described in the modules

A.5 Input data

[A description of the input data used by the method. For instance: the type of input data (e.g. micro, macro, longitudinal data), and the type of information used as input by the method, as regards content. This includes also auxiliary data, if applicable. Label the datasets as Ds-input1 etc.]

1. Ds-input1, this is the primary input data set. It is a micro dataset to which additional variables must be added.

2. Ds-input2, this is an input data set that contains the variables to be added to Ds-input1.

A.6 Logical preconditions

[These are conditions on the input data purely by the method, i.e. not restrictions implied by a tool or a process step, such as that an input data set has to be present. Items A.6.1 and A.6.2 contain, respectively, the types of missing values and the types of erroneous values that may be present in the input data. The items A.6.1 to A.6.4 are not a complete list, but can be extended when appropriate]

A.6.1 Missing values

1. The object identifier values used in the matching may not contain gaps

A.6.2 Erroneous values

1. Errors in the object identifier are allowed, but it should still be possible to use them for matching with a distance metric.

A.6.3 Other quality related preconditions

1. In this version of the method it is assumed that each record of Ds-input1 has at most one exact match in the second dataset.

A.6.4 Other types of preconditions

A.7 Tuning parameters

[User-specified values that are used as auxiliary information in an implementation of the method. For instance: a user-specified cut-off value in an outlier detection method. The optimal values of these parameters in particular cases are for each variant specified in A.8]

A.8 Recommended use of the individual variants of the method

[Parameter values and tuning values (best practices) that are recommended in specific situations (e.g. input data sets, population units, unit properties) for the method. Refer explicitly to the described variants. When the method does not have parameters, no recommendations on the best parameter values can be made. When the method does not have any variants, no recommendations on the recommended use of a particular variant can be made.]

A.9 Output data

[A description of the output data generated by the method. For instance: the type of output data and the type of information in the output provided by the method, as regards content (so not a log file or quality indicators; these are specified in A.13 and A.14). Label the datasets as Ds-output1 etc.]

1. Ds-output1: a micro dataset containing all variables of Ds-input1, with variables added from Ds-input2.

2. Optional Ds-output2 containing all non-matching records from Ds-input1.

3. Optional Ds-output3 containing all non-matching records from Ds-input2.

A.10 Properties of the output data

[These are the desired useful properties (post-conditions) of the output data sets on methodological grounds, i.e. not implied by a tool. A.3 mentions possible undesired side effects.]

1. The output data set contains all variables from Ds-input1, but with additional variables from Ds-input2, presumably for the same objects.

A.11 Unit of input data suitable for the method

[Please choose one of the following: Incremental processing, Processing groups of units, or Processing full data sets. For instance when adding or updating one record of input data, some methods must process the complete input data, other methods append or update the existing output data by one record.]

1. Processing full data sets (internally blocking variables can divide a data set in smaller parts)

A.12 User interaction - not tool specific

[The necessary not tool specific user interaction before, during, and after use of the method.]

1. Before matching the tuning parameters must be set by analysing the results for different values.

2. No user interaction during matching.

3. After matching the number of mismatches must be evaluated, and quality indicators (Type1 and Type 2 errors).

A.13 Logging indicators

[Indicators that may be used for logging]

1. Number of non matching records from Ds-input1.

2. Number of non matching records from Ds-input2.

3. Time used

A.14 Quality indicators of the output data

[For instance: the precision of the output data set (in the form of a confidence interval) for a given precision of the input data set, or the number of significant digits in the output data, or a level of consistency achieved by the method.]

1. The number of mismatches or missed matches and the number of missed matches can be used as quality indicators. The quality of the matching method can be assessed based on the inspection of matches of test files. It is a labour intensive job to carry out. You must examine not only the matching candidates and the matches ultimately selected, but also any missed matches under various parameter settings. The quality indicators are influenced by the way that the weights are calculated, the use of cut-off values and the use of blocking variables to stratify large data sets.

A.15 Actual use of the method

[Instance where this method is used in practice, i.e. a particular statistical process in a particular country in year yyyy. If possible, provide references to more detailed documentation of the application.]

A.16 Interconnections with other modules

[The links to other modules yield additional information of various type relevant to the method described in this module. It also indicates which information is covered by other modules that should not be included in this module]

· Themes that refer explicitly to this module

1. Theme: Object Matching

2. Theme: Synthetic Matching

· Related methods described in other modules

1. Method: Unweighted matching of Object Characteristics

2. Method: Weighted matching of Object Characteristics

· Mathematical techniques used by the method described in this module

· GSBPM phases where the method described in this module is used

1. 5.1 Integrate data

· Tools that implement the method described in this module

[A description of the tools or their technical limitations or possibilities is not part of a method module. The links included here are meant for easy reference and as a summary of commonly used (standard) tools implementing the method]

· The Process step performed by the method

2. Adding variables to micro data set

PAGE

2

_1318146550.unknown
_1318161600.unknown
_1327145224.unknown
_1327697075.unknown
_1327697110.unknown
_1395221126.unknown
_1357368462.unknown
_1327697093.unknown
_1327218782.unknown
_1327668457.unknown
_1318161721.unknown
_1318161744.unknown
_1318161647.unknown
_1318146736.unknown
_1318146809.unknown
_1318146906.unknown
_1318146762.unknown
_1318146711.unknown
_1310369752.unknown
_1310369806.unknown
_1312197502.unknown
_1312197585.unknown
_1312197678.unknown
_1312197559.unknown
_1310369807.unknown
_1310369778.unknown
_1306929425.unknown
_1306929504.unknown
_1306929563.unknown
_1310369018.unknown
_1306929628.unknown
_1306929535.unknown
_1306929487.unknown
_1306928219.unknown
_1306928248.unknown
_1306928070.unknown

Recommended