basic information about record linkage
What is record linkage?
Record linkage refers to the process of recognizing pairs of records which represent identical observational units in two separate data files. It is the method of choice whenever one needs to join data files on the micro- or observational level.
Uses of Record Linkage
The canonical use of record linkage is to combine data from different sources for research hypothesis testing. Enhancing sample survey data with data from large administrative or epidemiological registers is a valuable example.
Other prominent uses involve the follow-up of cohorts, post-enumeration checks of census data, the update and deduplication of survey sampling frames, the retrospective creation of panel data, measuring population sizes by capture-recapture, and updating or extending register data.
Advantages of record linkage
The use of record linkage may bear several advantages: (i) Record linkage is very cheap as compared to studies which involve direct data collection. (ii) When direct data collection is impossible for technical or ethical reasons, record linkage can sometimes constitute an alternative. (iii) Record linkage leads to a reduced survey burden on respondents. (iv) Given the appropriate circumstances, record linkage may yield higher data quality than direct data collection. (v) Record linkage tends to yield larger sample sizes at equal costs.
When and why is record linkage difficult?
If the files include error-free and unique common identifiers as a social security number, record linkage is a simple file merge operation which can be done by any standard database management system or data analysis software.
However, mostly it is necessary to resort to a combination of ambiguous and error-prone identifiers as surnames, given names, and address information. The notorious data quality problems of such identifiers usually yield a considerable amount of unlinkable cases. In this situation the use of much more sophisticated techniques and specialised record linkage software is inevitable.
Record Linkage Methodology
A typical record linkage process involves five basic steps in a standard sequence:
In in the preprocessing step, the identifiers are prepared before performing the actual linking in order to get satisfactory linking results. Preprocessing techniques usually involve the split-up of compound identifiers (called "parsing"), the transformation of identifier values into a standard representation (called "standardization" or "normalization"), and standard data cleaning procedures as checks for plausibility and consistency.
Since the number of comparisons is often too high to be computed directly, in the blocking step the files are split up between disjunct subsets of records (called "blocks") and just the record pairs formed from corresponding blocks are to be compared. Usually, the blocks are formed by the exact agreement of one or more identifiers (called "blocking keys"). More advanced blocking techniques allow blocking keys that are only partially corresponding.
In the comparison step, some sort of similarity score is determined from the corresponding identifier values of the records. The similarity between identifier values is usually computed using string-similarity functions, most often with an edit-distance, the Jaro-Winkler similarity, or a n-gram similarity.
In the classification step, the similarity scores of the individual identifiers are aggregated into a total similarity score for each record pair. Then a decision on thresholds of similarity has to be made: record pairs above a appropriate threshold are considered as as matching, records below the threshold are considered as non-matching pairs. The decision on the matching status of the record pairs can be based on different classifiers, for example classification trees, support vector machines, or some statistical decision rule. Most record linkage programs use a probabilistic decision rule based on a model formulated by Fellegi & Sunter (1969).
In the fusion step, the record pairs considered as matching are merged into one "master" or compound record. If the records contain conflicting information on some common attributes, this may be a non-trivial task as well.
Availability of Record Linkage Software
Traditionally, most record linkage systems are commercial programs for business applications or special purpose programs for use in official statistics or cancer registries. However, recently several freely available programs emerged mainly from the academic realm. Most prominent among them are Febrl, Link Plus, MTB, Relais, and Link King.
Basic Introductory Bibliography
- Christen, P. 2012. Data Matching: Concepts and Techniques for Record
Linkage, Entity Resolution, and Duplicate Detection. Berlin: Springer.
- Elmagarmid, A., Ipeirotis, P. G., and Verykios, V. 2007. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering 19(1) 1–16.
- Fellegi, I. P. and Sunter, A. B. 1969. A theory for record linkage. Journal of the American Statistical Association 64(328) 1183–1210.
- Gill, L. E. 2001. Methods for Automatic Record Matching and Linkage and Their Use in National Statistics. Norwich: Office of National Statistics.
- Herzog, T. N., Scheuren, F. J., and Winkler, W. E. 2007. Data Quality and Record Linkage Techniques. New York: Springer.
- Judson, D. H. 2004. Computerized record linkage and statistical matching. K. Kempf-Leonard (ed.) Encyclopedia of Social Measurement. Amsterdam: Elsevier.
- Newcombe, H. B. 1988. Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business. Oxford: Oxford University Press.
- Smith, M. E. 1984. Record linkage: present status and methodology. Journal of Clinical Computing 13(2–3) 52–69.
- Winkler, W. E. 1995. Matching and record linkage. B. G. Cox et al. (ed.) Business Survey Methods. New York: Wiley, pp. 355–384.