preprocessing


Objectives
To determine the optimal degree of standardization for matching variables and blocking variables in preprocessing.

Description
Although generally regarded as essential for the success of record linkage, there is comparatively little research done concerning preprocessing. A major task in preprocessing is the standardization of identifier values. For example, the German umlauts ä, ö, ü are typically replaced by ae, oe, and ue. Another common operation is to remove titles from surname fields. Though there is an often overlooked drawback of standardizing identifiers: It may essentially be the removed or standardized part of an identifier value that differentiates between false positive and true positive matches. That is, there is a balance between to little and to much standardization in matching variables. Additionally, for sure this balance differs when dealing with blocking variables instead.