Stepwise Variable Selection for Loglinear Mixture in Record Linkage

Rong Zhu, Jin Zhang, Da Zhang, Guohua Yan

Abstract

A model building strategy is proposed to improve the probabilistic match in record linkage with focus on the loglinear mixture model of two components, each for the matched and unmatched pairs respectively. In reality, the comparison attributes (i.e., covariates) often interact each other, leading to more or less interactions in the loglinear models for both matched and unmatched pairs. However, the interactions patterns often are not the same for both components. Particularly, because the number of matched pairs is very small comparing with that of unmatched pairs in a real case, the model for matched pairs can not be fitted with the same higher order interactions as that for the unmatched pairs. The proposed strategy attempts to avoid both underfitting and overfitting due to subjective model specification for the data. Unlike the subjective specification, this strategy is data-driven. Starting from the situation of no interaction, we add interactions sequentially in two loglinear components using the forward selection approach. To this end, we define the alternatively climbing pathways through mixture families of two components with higher order interactions. The mixture models expanded along a pathway are nested successively, thus, conventional tests used for nested models can be applied. Regarding parameter estimation for the mixture, a simplified method (including the choice of initial values of parameters) for the EM algorithm is developed, which facilitates the mixture model fitting using existing packages and functions in sophisticated statistical software such as R. Simulation study has then been conducted for various situations to assess the model selection approach, and comparison of these selected models with the naive model assuming field independence has been made. We apply this strategy to the record linkage case study in SSC 2006 and have identified interactions among certain comparison attributes for both matched and unmatched pairs, these interaction patterns are not always the same for both matched and unmatched pairs.

Keywords

record linkage; loglinear mixture models; EM algorithm; model selection; alternatively climbing pathways

Full Text:

PDF