Stepwise Variable Selection for Loglinear Mixture in Record Linkage
Keywords:
record linkage, loglinear mixture models, EM algorithm, model selection, alternatively climbing pathwaysAbstract
A model building strategy is proposed to improve the probabilistic match in record linkage with focus on the loglinear mixture model of two components, each for the matched and unmatched pairs respectively. In reality, the comparison attributes (i.e., covariates) often interact each other, leading to more or less interactions in the loglinear models for both matched and unmatched pairs. However, the interactions patterns often are not the same for both components. Particularly, because the number of matched pairs is very small comparing with that of unmatched pairs in a real case, the model for matched pairs can not be fitted with the same higher order interactions as that for the unmatched pairs. The proposed strategy attempts to avoid both underfitting and overfitting due to subjective model specification for the data. Unlike the subjective specification, this strategy is data-driven. Starting from the situation of no interaction, we add interactions sequentially in two loglinear components using the forward selection approach. To this end, we define the alternatively climbing pathways through mixture families of two components with higher order interactions. The mixture models expanded along a pathway are nested successively, thus, conventional tests used for nested models can be applied. Regarding parameter estimation for the mixture, a simplified method (including the choice of initial values of parameters) for the EM algorithm is developed, which facilitates the mixture model fitting using existing packages and functions in sophisticated statistical software such as R. Simulation study has then been conducted for various situations to assess the model selection approach, and comparison of these selected models with the naive model assuming field independence has been made. We apply this strategy to the record linkage case study in SSC 2006 and have identified interactions among certain comparison attributes for both matched and unmatched pairs, these interaction patterns are not always the same for both matched and unmatched pairs.Downloads
Published
2010-04-09
Issue
Section
Mathematical Statistics
License
Upon acceptance of an article by the European Journal of Pure and Applied Mathematics, the author(s) retain the copyright to the article. However, by submitting your work, you agree that the article will be published under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). This license allows others to copy, distribute, and adapt your work, provided proper attribution is given to the original author(s) and source. However, the work cannot be used for commercial purposes.
By agreeing to this statement, you acknowledge that:
- You retain full copyright over your work.
- The European Journal of Pure and Applied Mathematics will publish your work under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
- This license allows others to use and share your work for non-commercial purposes, provided they give appropriate credit to the original author(s) and source.
How to Cite
Stepwise Variable Selection for Loglinear Mixture in Record Linkage. (2010). European Journal of Pure and Applied Mathematics, 3(2), 141-162. https://www.ejpam.com/index.php/ejpam/article/view/642