Semi-parametric mixed effects models for longitudinal data with applications in business and economics

Longitudinal data is becoming increasingly common in business, social sciences, and biological sciences due to the advantages it offers over cross-section data in modeling and incorporating heterogeneity among subjects and in being able to make causal inferences from observational data. Parametric models and methods are widely used for analyzing longitudinal data for continuous, discrete, and count data occurring in these disciplines. Some popular models are Gaussian, Logit, and Poisson fixed and random effects models. These models are unreliable in situations in which the link function is nonlinear and the form of nonlinearity is not known with certainty. This paper employs a semiparametric extension of fixed and random effects models called generalized additive mixed models (GAMMs) to analyze several longitudinal data sets. These semi-parametric models are flexible and robust extensions of generalized linear models. Following Wood [19], the GAMMs are represented using penalized regression splines and estimated by penalized regression methods treating the penalized component of each smooth as a random effect term and the unpenalized component as a fixed effect term. The degree of smoothness for the unknown functions in the linear predictor part of the GAMM is estimated as the variance parameter of the term. Applications of GAMMs studied include analysis of anti-social behavior, decision to use a professional tax preparer, and analysis of patent data on manufacturing firms. For each application, several GAMMs are compared with their parametric counterparts.


Introduction
Linear regression model is the workhorse of empirical research across many disciplines. Generalized linear models (GLMs) extend the linear regression model by allowing for response variables, which are bounded or discrete. These models are used for modeling continuous, categorical, count, and ordinal data on the response variable. These models relax the assumption that the response is normally distributed by allowing it to follow any distribution from the exponential family, such as normal, Poisson, binomial, gamma etc. Inference for GLMs is based on likelihood theory. Gaussian, Logit, and Poisson regression models are among the most widely used GLMs. Common applications of Logit models include analysis of brand choice data in marketing (Baltas [2] and Guadagni and Little [7]) and transportation choice data in economics (Greene [6] and Manski and McFadden [12]). Applications of Poisson regression models include analysis of data on patents, number of trips to a doctor's office, and number of shipping accidents. McCullagh and Nelder [11] provide an authoritative account of GLMs and Cameron and Trivedi [4] and Greene [6] provide econometric applications. These models are appropriate for cross-section data and do not account for heterogeneity among subjects. In order to account for heterogeneity among subjects, longitudinal data is used for which the generalized linear mixed model (GLMM) extension of GLMs is needed. In GLMMs, some of the unknown coefficients in the model linear predictor are treated as random variables. These random effects are viewed as having a covariance structure that itself depends on some unknown fixed parameters. This allows the use of more complex models for the random component of data, which leads to improvements in modeling over-dispersed and correlated data. Generalized additive models (GAM) developed by Hastie and Tibshirani [8], [9] and extended by Wood [19] among others, are a powerful semi-parametric generalization of GLMs in which part of the linear predictor is a sum of unknown smooth functions of explanatory variables. GAMs are very flexible and are very useful in the nonparametric exploration of continuous, discrete, and count data. Sapra [14] presented several applications of these models to cross-section data in business and economics and demonstrated that GAMs generally provided a better fit to data than GLMs. This paper presents econometric applications of the generalized additive mixed models (GAMM) extensions of the generalized linear mixed models (GLMMs) for longitudinal data, which includes the conventional random effects models and demonstrates that the GAMMs can overcome a serious weakness of the GLMMs: failing to identify the nonlinearities in the link function. The paper is organized as follows. To begin with, we introduce the generalized additive mixed model (GAMM) in section 2 and present the penalized regression method for the estimation of GAMMs in section 3. In the following sections, several econometric applications of GAMM are presented. These applications include a Gaussian GAMM for analysis of anti-social behavior among children in section 4, a GAMM Logit model for analysis of data on choice of a paid tax-preparer in section 5, and a GAMM Poisson regression model for analysis of patent data for manufacturing firms in section 6.

Generalized additive mixed models
GAMMs extend generalized additive models (GAMs) by including random effects to allow for heterogeneity and correlation among subjects. Generalized additive models (GAMs) are nonparametric generalized linear models. GAMs extend traditional linear models in another way, namely by allowing for a link between the nonlinear predictor f(x 1 ... x p ) and the expected value of y. This amounts to allowing for an alternative distribution for the underlying random variation besides just the normal distribution. While Gaussian models can be used in many statistical applications, these models may not be adequate for modeling discrete responses such as counts, or bounded responses such as proportions. Generalized linear models (GLMs) consist of a random component, an additive component, and a link function relating these two components. The response y, the random component, is assumed to have a density in the exponential family where θ is called the natural parameter and ϕ is the scale parameter. The normal, binomial, and Poisson distributions are all in this family. The GLM models (1) can be extended to generalized linear mixed models (GLMMs) by incorporating random effects into the GLMs. Suppose that observations of the ith of n units consist of a response variable y i and p covariates x i = (1, x 1i ,, . . ., x pi ) T associated with fixed effects and a q x 1 vector of covariates z i associated with random effects. Let y it denote the response of the ith subject at time t and let x 1it , x 2it … x pit be the associated covariates. Given a q x 1 vector u of random effects, the observations y it on the ith unit at time t are assumed to be conditionally independent with means E (y it |u i ) = μ it and variances Var(y it |u i ) = ϕm it -1 v(μ it ), where v(.) is a specified variance function, m it is a prior weight (e.g. a binomial denominator) and ϕ is a scale parameter, and follow a generalized additive model. Under the GLMMs, the mean μ it = E (y it │x 1it , x 2it , …, x pit , u i ) is linked to the linear predictor x it T β +z it T u i , through the link function The generalized additive mixed models (GAMMs) extend the GLMMs by linking the mean μ = E(y│x 1 , x 2 , …, x p ) to the nonlinear nonparametric predictor through the link function where s 1 (·)... s p (·) are smooth nonparametric functions. The most commonly used link function is the canonical link, for which η = θ. The random effects u are assumed to be distributed iid as N (0, D(ψ)), where ψ is a cx1 vector of variance components.

Estimation of GAMMs
Estimation of GAMMs consists in representing the GAMM as a GLMM with a variance component controlling the amount of smoothing for each additive component. The Bayesian model of spline smoothing introduced by Wahba [16] and Silverman [15] has led to the possibility of estimating the degree of smoothness of terms in a generalized additive model as variances of the wiggly components of the smooth terms treated as random effects. Several algorithms for GAMM estimation have exploited this connection (see Wang [17] and Ruppert et al. [13]). In the normal errors identity link case, estimation can be performed using general linear mixed effects modeling software such as the lme package in R. In the generalized case, only approximate inference is possible using the Penalized Quasi-Likelihood approach of Breslow and Clayton [3]. An advantage of this approach is that it allows correlated errors to be treated via random effects or the correlation structures available. However, using correlation structures beyond the strictly additive form requires using a GEE approach to fitting. Some details of how GAMs are represented as mixed models and estimated using maximum likelihood or penalized quasi-likelihood methods can be found in Wood [18], [19]. In addition, these methods obtain a posterior covariance matrix for the parameters of all the fixed effects and the smooth terms. A similar approach due to Lin & Zhang [10] obtains the covariance matrix of the data in the additive case (or pseudo-data in the generalized case) implied by the weights, correlation and random effects structure based on the estimates of the parameters of these terms, which is used to obtain the posterior covariance matrix of the fixed and smooth effects. The bases used to represent smooth terms in GAMMs are the same as those used in GAMs. The normal GAMMs can be described by the conditional density of responses given the random effects f (y|u) = exp{y T (Xβ +Zu) -1 T b(Xβ +Zu) + 1 T c(y)}, (4) And the probability density function of the random effects f (u) = (2π) -q/2 |D (ψ)| -1/2 exp (-1/2 u T D (ψ) -1 u T ).
(5) Following Ruppert et al [13], we assume that (4) represents a generalized semi-parametric additive model with D1+D2 predictors of which the first D1 predictors form the columns of X and enter the model linearly and the last D2 predictors, which form the columns of Z enter the model non-parametrically as p-th degree splines. For each of the last D2 predictors, the powers of degree 1 through p are columns of X, while the truncated power functions form columns of Z. Under these conditions, the GAMM in (4) and (5) is represented as a GLMM and the methods for estimation of GLMMs become available for GAMMs. Lin and Zhang [10] propose constructing nonparametric smoothing spline estimators of the s functions and then estimating the smoothing parameter λ and variance components ψ using marginal quasi-likelihood as follows.
The parameters in the model are (β, ψ) and the likelihood function is where J (β, ψ) = ∫ R q exp {y T (Xβ +Zu) -1 T b (Xβ +Zu) -1/2 u T D (ψ) -1 u}. (7) Maximization of L (β, ψ) is intractable due to the presence of the q-dimensional integral J (β, ψ) in equation (7). Several methods have been proposed for circumventing this problem. The penalized quasi-likelihood method maximizes the penalized log-likelihood log f (y|u) -1/2 u T D (ψ) -1 u (8) to obtain estimates of (β,u) for given ψ. Fixing (β,u) at their current values, Breslow and Clayton [3] suggest updating ψ at each stage of the iteration using maximum likelihood or restricted maximum likelihood applied to the pseudo-data. Alternatively, the variance components ψ can be estimated via cross-validation.

The generalized additive Gaussian model
The generalized additive Gaussian model assumes that the link functions are (9) for the identity link and (10) for the log link, where ~(0, ( )). u N D i

. Variable definitions and data description
The data are taken from Allison [1]. The sample is drawn from the National Longitudinal Survey of Youth (NLSY; Center for Human Resource Research, 2002). We use Allison's smaller sample of 581 children, which was drawn by the author from a much larger sample. These children were interviewed in 1990, 1992, and 1994, but we use the data in 1990 and 1994 only. The dependent variable is ANTI and all of the remaining variables are explanatory variables. The data are summarized in Table 1. ANTI = Anti-social behavior (scale ranges from 0 to 6) SELF = Self-esteem (scale ranges from 6 to 24) POV = 1 if family is in poverty, 0 otherwise. BLACK = 1 if child is BLACK, 0 otherwise HISPANIC = 1 if child is HISPANIC, 0 otherwise CHILDAGE = child's age in 1990 MARRIED = 1 if mother was currently married in 1990, otherwise 0 GENDER = 1 if female, 0 otherwise MOMAGE = Mother's age at birth of child MOMWORK = 1 if mother was employed in 1990, 0 otherwise TIME_2 = 1 if the year is 1992, 0 othewise TIME_3 = 1 if the year is 1994, 0 othewise MSELF and MPOV are the person-specific means for the variables SELF and POV respectively. DSELF and DPOV are deviations around the person-specific means for the variables SELF and POV respectively.

Models
The following models were fit to the data. Model 1 is a fixed effects linear regression model with ANTI as the dependent variable and SELF, POV, and TIME_2 and TIME_3 as the independent variables, which is a GLM with identity link. Given the nonlinearity of the link function in SELF displayed in the partial residual plots of SELF in Fig.1, Model 2 is a fixed effects GAM with the identity link, which introduces a nonparametric smooth term s (SELF). Model 3 is a random effects GLM with the identity link with ANTI as the dependent variable and SELF, POV, TIME_2, TIME_3, BLACK, HISPANIC, CHILDAGE, MARRIED, GENDER, MOMAGE, and MOMWORK as the independent variables. Model 4 is a hybrid GLM with the identity link, ANTI as the dependent variable and DSELF, DPOV, MSELF. MPOV, TIME_2, TIME_3, BLACK, HISPANIC, CHILDAGE, MARRIED, GENDER, MOMAGE, and MOMWORK as the independent variables. Due to the nonlinearity of the link function in DSELF, CHILDAGE, and MOMAGE displayed in the partial residual plots of these variables in Fig. 2, Model 5 is a hybrid GAMM with ANTI as the dependent variable, which includes parametric terms for DPOV, MSELF, MPOV, TIME_2, TIME_3, BLACK, HISPANIC, MARRIED, GENDER, and MOMWORK, and nonparametric terms for DSELF, CHILDAGE, and MOMAGE. Estimation and inference results are presented in tables 2 through 6.

Nonparametric exploration of nonlinearity in the link function
The following partial residual plots help us identify the nature of nonlinearity in the link functions.

Comparing the models
A comparison of models using the AICs presented in Table 7 suggests that Models 1 and 2, which employ ANTI as the response variable and SELF, TIME_2, and TIME_3 as explanatory variables have the lowest AICs among the models considered and are therefore the best models. Both models are fixed effects models of which Model 2 is a generalized additive fixed effects model with a nonparametric smooth term for the variable SELF. At the other extreme, Models 3 and 5 have the highest AIC, BIC, and deviance. This may suggest that a pure random effects model does not perform as well as a fixed effects model and that including nonparametric nonlinear terms in DSELF, CHILDAGE, and MOMAGE may lead to over-fitting. Following Allison [1], Models 4 and 5 employ a hybrid approach, which combines some of the merits of fixed effects and random effects models. Under this approach, the time-varying variables are transformed into deviations from their individual-specific means, but the response variable is left unchanged. Unlike fixed effects models, the time-invariant variables are included in the regression model. Additionally, variables, which are the individualspecific means for each of the time-varying variables, are also included. Instead of OLS, a random effects model is estimated. The correlation matrices display the estimated sampling correlations among the fixed-effects coefficient estimates, which are not usually of direct interest. Very large correlations, however, are indicative of an ill-conditioned model (Frees [5]). In all of the models, correlation matrices of parameter estimates display weak correlations between parameter estimates after conditioning on random effects confirming that random effects specification is desirable in all of the cases considered. Results presented in tables 2 through 6 may be summarized as follows. The variables SELF and TIME_3 are highly significant across all of the fixed and random effects models as are the variables DSELF and MSELF across all of the hybrid models. The variable GENDER is highly significant in models 3, 4, and 5 in which it is included. Surprisingly, POV is statistically insignificant in all of the models except Models 3 and 5, a random effect GLM and a hybrid GAMM respectively. Most importantly, MOMAGE and MOMWORK are highly significant in Model 5 only, which is the only semi-parametric model that captures the nonlinearity of the link function in these variables.

The generalized additive mixed Logit model
The generalized additive Logit model assumes that the link function is

An empirical application of GAM Logit model to choice of using a professional tax preparer
The dataset is from Frees [5]. Following Frees [5], we model choice of using a professional tax preparer (PREP) using demographic and economic characteristics of taxpayers. The data are from Statistics of Income (SOI) panel of Individual Returns. The SOI panel represents a simple random sample of unaudited individual income tax returns filed for tax years [1979][1980][1981][1982][1983][1984][1985][1986][1987][1988][1989][1990]. The data were compiled from a stratified probability sample of unaudited individual income tax returns filed by US taxpayers. The estimates obtained from these data are intended to represent all returns filed for the income tax years under review. All returns presented are subjected to sampling except tentative and amended returns. Following Frees [5], we use a balanced panel from 1983-1984 and 1986-1987 taxpayers included in the SOI panel, a 4% sample of this comprises our sample of 258 taxpayers. These years are chosen because they contain the interesting information on paid-preparer usage. Specifically, these data include line-item tax return data and a binary variable noting the presence of a paid tax preparer for years 1982-1984 and 1986-1987. The variable definitions are presented in Table 8 and summary statistics for the data are displayed in Table 9.

Models
The following models were fit to the data.

Comparing the models
Estimation results are presented in tables 10 through 13. A comparison of models using the AIC presented in Table 14 suggests that Models 1 and 4, which employ PREP as the response variable and TAX, AGE, DEPEND, EMP and LNTPI as explanatory variables have the lowest AICs among the models considered and are therefore the best models. Model 1 is a generalized linear random effects Logit model while Model 4 is a generalized additive mixed effects Logit model with a nonparametric smooth term for the variable LNTPI. At the other extreme, Models 2, a random effects Logit model, which includes LNTPI, MR and EMP has the highest AIC, BIC, and deviance. This may suggest that a pure random effects model, which omits AGE and replaces TAX with MR as the tax liability variable, does not perform as well as the other random effects GLMMs and GAMMs. Correlation matrices of parameter estimates display weak correlations between parameter estimates after conditioning on random effects confirming that random effects specification is desirable in all of the cases considered.
Examination of tables 10-13 suggests that the variable EMP is statistically significant at 1% significance level in all of the models. However, the tax liability variables TAX and MR are statistically insignificant across all models. The variable LNTPI is statistically insignificant in all of the models except Model 3. The variable AGE is statistically significant at 1% significance level in all of the models. The positive signs of the coefficient estimates are all expected in all of the models. For instance, the signs of AGE, DEPEND, EMP and LNTPI are positive in all of the models indicating that the odds of choosing a professional tax preparer are higher for a taxpayer who is 65 years or older or has dependents, is self-employed or whose total income increases than for a taxpayer who does not have these traits. The surprising statistical insignificance of tax liability variables TAX and MR in the GAMM models could be attributed to possible over-fitting in these models.

The generalized additive mixed effects poisson models
Poisson and Negative Binomial regression models are the most widely used count data models. In these models, the outcome, y it is a count variable. Generalized additive mixed extensions of these models replace ∑ x jit β j , the linear component of the model with an additive component ∑ f j (x jit ) in the link function and introduce a random effects term. We wish to model p (y it |x 1it , x 2it … x pit ), the probability of an event given variables x 1it , x 2it … x pit . The Poisson regression model assumes that the link function is linear: ln ...
The generalized additive mixed Poisson model assumes instead that where s 1 , s 2 ,…,s p are smooth nonparametric functions, which are estimated by maximizing a penalized quasi-log likelihood approach described in Section 3 above.  Table 15.

Models
The following models were fit to the data using PAT as the response variable. Model 1 is a generalized linear mixed effects Poisson regression model, which employs LOGR, LOGR1, LOGR2, LOGR3, LOGR4, and LOGR5 as the explanatory variables. Model 2 extends model 1 by including time dummies for four of the five years using Year1 as the reference year. Next, we explore nonlinearities in the link function non-parametrically and present partial residual plots.
Smooth nonparametric terms are included in the link functions if nonlinearities are confirmed through these plots. Given the nonlinearity of the Poisson link function in LOGR1 as displayed in Fig. 4, Model 3 was chosen to be a semiparametric Poisson regression model in which the link function is of the additive form and includes a nonparametric smooth term for LOGR1 and parametric linear terms for all other variables.

Comparing the models
As for the Gaussian and Logit GLMMs, correlation matrices of parameter estimates for Poisson GLMMs also display weak correlations between parameter estimates after conditioning on random effects confirming that random effects specification is desirable in all of the cases considered. A comparison of models using the AICs is presented in Table 19. Based on the AIC, BIC, and Deviance criteria in Table 19, Model 3, a GAMM model with the nonparametric term s(LOGR1) appears to be the best since it has the lowest AIC index among all models. At the other extreme, Model 1, a GLMM Poisson regression model with no time dummies, has the highest AIC and is consequently the poorest model. The estimation and significance testing results are presented in tables 16 through 18. The variables LOGR, LOGR2, and LOGR3 are statistically significant in Model 1, which does not include time dummies. However, in models 2 and 3, which include time dummies, LOGR, LOGR2, and LOGR5 are highly statistically significant, but LOGR3 is not. All of the time dummies are also highly significant across all models. The remaining variables are statistically insignificant across all of the models.

Conclusion
The paper has studied applications of generalized additive mixed models (GAMMs), including GAMM Gaussian, Logit, and Poisson regression models in business and economics. Unlike GLMs and GLMMs, these models allow us to explore the relationship between the response and multiple predictor variables non-parametrically in the presence of nonlinearities in the link function using longitudinal data. The applications studied ranged from the analysis of antisocial behavior to the choice of a professional tax-preparer to the number of patents issued to a manufacturing firm. In all of the empirical applications, the semi-parametric GAMMs generally performed better than the parametric generalized linear mixed models (GLMMs). The GAMMs employed in the paper are widely applicable to the analysis of continuous, count, and binary response with longitudinal data occurring frequently in business and social sciences. GAMMs offer important advantages over GLMMs, including extension of nonparametric regression to more than one regressor circumventing the curse of dimensionality, non-parametric exploration of nonlinearities and interactions among explanatory variables as well as accounting for correlation and over-dispersion among responses. Nevertheless, the GAMMs are not without drawbacks. The computational algorithms are complex due to the presence of multiple integrals in the likelihood function and difficulties in interpretations. Poor prediction performance due to over-fitting in some situations is also a drawback of GAMMs. These models are useful mainly when parametric GLMs and GLMMs provide an inadequate fit for the data.