Misspecified Multivariate Regression Models Using the Genetic Algorithm and Information Complexity as the Fitness Function

Hamparsum Bozdogan, J. Andrew Howe


Model misspecification is a major challenge faced by all statistical modeling techniques. Real world multivariate data in high dimensions frequently exhibit higher kurtosis and heavier tails, asymmetry, or both. In this paper, we extend Akaike’s AIC-type model selection criteria in two ways. We use a more encompassing notion of information complexity (ICOMP) of Bozdogan for multivariate regression to allow certain types of model misspecification to be detected using the newly proposed criterion so as to protect the researchers against model misspecification. We do this by employing the “sandwich”or “robust”covariance matrix Fˆ−1RˆFˆ−1, which is computed with the sample kurtosis and skewness. Thus, even if the data modeled do not meet the standard Gaussian assumptions, an appropriate model can still be found. Theoretical results are then applied to multivariate regression models in subset selection of the best predictors in the presence of model misspecification by using the novel genetic algorithm (GA), with our extended ICOMP as the fitness function. We demonstrate the power of the confluence of these techniques on both simulated and real-world datasets. Our simulations are very challenging, combining multicolinearity, unnecessary variables, and redundant variables with asymmetrical or leptokurtic behavior. We also demonstrate our model selection prowess on the well-known body fat data. Our findings suggest that when data are overly peaked or skewed - both characteristics often seen in real data, ICOMP based on the sandwich covariance matrix should be used to drive model selection.


Misspecified multivariate regression models, Information complexity, Robust estimation, Genetic algorithm, Subset selection, Dimension reduction

Full Text: