Figures
Abstract
The genetic basis of complex traits involves the function of many genes with small effects as well as complex gene-gene and gene-environment interactions. As one of the major players in complex diseases, the role of gene-environment interactions has been increasingly recognized. Motivated by epidemiology studies to evaluate the joint effect of environmental mixtures, we developed a functional varying-index coefficient model (FVICM) to assess the combined effect of environmental mixtures and their interactions with genes, under a longitudinal design with quantitative traits. Built upon the previous work, we extend the FVICM model to accommodate binary longitudinal traits through the development of a generalized functional varying-index coefficient model (gFVICM). This model examines how the genetic effects on a disease trait are nonlinearly influenced by a combination of environmental factors. We derive an estimation procedure for the varying-index coefficient functions using quadratic inference functions combined with penalized splines. A hypothesis testing procedure is proposed to evaluate the significance of the nonparametric index functions. Extensive Monte Carlo simulations are conducted to evaluate the performance of the method under finite samples. The utility of the method is further demonstrated through a case study with a pain sensitivity dataset. SNPs were found to have their effects on blood pressure nonlinearly influenced by a combination of environmental factors.
Citation: Zhang J, Wang H, Cui Y (2025) Generalized functional varying-index coefficient model for dynamic synergistic gene-environment interactions with binary longitudinal traits. PLoS ONE 20(1): e0318103. https://doi.org/10.1371/journal.pone.0318103
Editor: Mahdi Roozbeh, Semnan University, IRAN, ISLAMIC REPUBLIC OF
Received: June 12, 2024; Accepted: January 9, 2025; Published: January 27, 2025
Copyright: © 2025 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data for this study are within the paper, its Supporting information files, and publicly available from the Github repository (https://github.com/Honglang/gFVICM).
Funding: This work was supported in part by a grant (R21HG010073) from the National Institutes of Health (to Y. Cui), a grant (24TPA1288424) from the American Heart Association (to Y. Cui), and a grant (DMS-2212928) from the National Science Foundation (to H. Wang). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Longitudinal data analysis is very common in epidemiological studies when the response variables are measured over time on subjects. Numerous studies have demonstrated that longitudinal designs offer greater power in detecting genetic associations compared to cross-sectional designs [1–3]. On the other hand, there has been growing interest in understanding the role of interactions of genes with the environment (G × E) in human diseases, such as type 2 diabetes (e.g., Zimmet et al. [4]) and Parkinson’s disease (e.g., Ross and Smith [5]). In many studies, G × E interactions have traditionally been investigated using a single environment exposure model. Readers are referred to Miao et al. [6] for a comprehensive review of G × E studies. However, increasing evidence suggests that the risk of disease can be significantly influenced by simultaneous exposures to multiple environmental factors. Notably, the combined effect of these exposures often exceeds the sum of their individual effects [7, 8]. This has led to a growing interest in assessing the collective impact of environmental mixtures and exploring the mechanisms through which they interact with genes to influence disease risk. Some research has been done to assess nonlinear interactions between environmental mixtures and genes by applying some nonparametric or semiparametric models, such as the varying index coefficients model (VICM) proposed by [9], the generalized varying index coefficient models by Guo et al. [10], and the partial linear multi-varying index coefficients model (PLMVICM) by Liu et al. [11] and the generalized PLMVICM by Liu et al. [12]. However, these methods were developed for cross-sectional data. The extension of these modeling strategies to longitudinal data deserves further investigation.
Previously, we introduced a functional varying index coefficient model (FVICM) to assess nonlinear G × E interactions for a continuous longitudinal trait [13]. Such trait can be blood pressure or heart rate measured over time. However, in practice, it is possible that the response measured over time is discrete, for example, a binary measure of disease status. In human genetics, many disease traits are binary in nature, as being affected or unaffected (or cases vs controls). To investigate the nonlinear dynamic G × E interaction with environmental mixtures as a whole for a binary longitudinal trait, we propose the following generalized functional varying index coefficient model (gFVICM),
(1)
where Yij is the trait of interest observed for the ith subject at the jth time point (i = 1, ⋯, N; j = 1, ⋯, ni); Xij is a p-dimensional vector of environmental variables, which can be time-variant or time-invariant; Gi denotes a genetic variable which does not depend on time; g(⋅) is a known link function which can be the identity link for continuous traits and the logit link for binary traits; m0(⋅) and m1(⋅) are unknown nonparametric smooth functions which depend on the data; and β0 and β1 are p-dimensional vectors of index loading parameters. In this work, we consider a binary longitudinal response variable which takes the value of 1 or 0 at each time point. In model (1), the function
captures the interaction effect between environmental mixtures and the genetic variable (e.g., a single nucleotide polymorphism (SNP) on the risk of disease).
In this study, we formulate a statistical estimation and hypothesis testing procedure tailored for model (1), specifically for binary longitudinal traits. The Generalized Estimation Equation (GEE) method, proposed by Liang and Zeger [14], has been widely used in longitudinal data analysis. However, there are several disadvantages of GEE method due to some of its critical assumptions [15]. One disadvantage is that the consistency of GEE estimators are based on the consistency of estimators for the nuisance correlation parameter [16, 17]. Another shortcoming of the GEE method is that model selection and hypothesis testing are complicated, because the estimation procedure of the GEE method does not involve an objective function. The quadratic inference function (QIF) approach proposed by Qu et al. [18] is one of the improvements of the GEE method. The QIF method avoids the need to estimate nuisance correlation parameters and has been demonstrated by Qu et al. [18] to be generally more efficient than the GEE method. Furthermore, because the QIF is based on an objective function that is asymptotically chi-square distributed, it naturally accommodates the implementation of model selection criteria such as BIC. The asymptotic property can also allow us to conduct hypothesis tests. This motivates us to extend the QIF method to our model for estimation and hypothesis testing.
Penalized estimation in longitudinal modeling has been extensively studied, to name a few, the penalized GEE [19], penalized GEE with semiparametric generalized mixed-effects partial linear model [20], penalized QIF for varying-coefficient partially linear models [21], and variable selection for nonparametric varying-coefficient models [22]. For methods involving varying-coefficient models, the coefficient functions typically consider only one explanatory variable, which distinguishes them from the one proposed in model (1). In our proposed estimation procedure, we first use penalized splines [23] to approximate the nonparametric smooth functions ml(⋅), l = 0, 1. Then, we develop a profile estimation procedure for the index loading parameters and spline coefficients which are estimated iteratively under the QIF framework. In order to avoid overfitting and reduce the number of parameters in spline approximation, we add a penalty to the objective function based on a BIC criterion. We establish the asymptotic normality of the resulting estimators under certain regularity conditions. In addition, we are interested in testing the linearity of G × E interaction, i.e. the linearity of function m1(⋅). The QIF can be regarded as an inference function which has properties similar to the likelihood ratio test. Based on that, we construct a testing procedure to assess the linearity of the coefficient (interaction) function, where the test statistic asymptotically follows a χ2 distribution.
The performance of the proposed procedure under finite samples is evaluated by Monte Carlo simulations. The application of the proposed method is demonstrated through the analysis of a pain sensitivity data with a binary response variable indicating whether a subject has hypertension or not (Yes = 1, No = 0). Theories and proofs are rendered in S1 File. Our method offers a novel way for longitudinal G × E study with binary traits in which the focus is on the evaluation of the joint interaction between a genetic variant and a mixture of environmental exposures to affect a disease risk.
2 The model and estimation methods
2.1 The model
For a binary longitudinal disease trait, suppose the response yij, the p-dimensional covariate vector xij, and the SNP variable Gi are observed for the ith individual at the jth time point, where i = 1, ⋯, N;j = 1, ⋯, ni. In general, the number of potential environmental variables (p) that may interact with G to affect Y is not large. Any other covariates that do no interact with G can be modelled separately as a linear term outside of the m(⋅) function. Assume that observations from different subjects are independent, but those within the same subject are correlated. We also assume the model satisfies the first moment assumption, i.e.,
where g−1(⋅) is a given inverse link function. For binary responses, we use a logit link function and the model can be written as
(2)
For the identifiability purpose, the constraints ‖β0‖ = ‖β1‖ = 1 are imposed, where the first elements of β0 and β1 are set to be positive.
2.2 Quadratic inference function for gFVICM
Denote and
. First, the unknown coefficient functions m0(u0) and m1(u1) are approximated by truncated power spline basis as
(3)
where
is a q-degree truncated power spline basis with K knots κ1, …, κK,
, and γ0 and γ1 are (q + K + 1)-dimensional vectors of spline coefficients.
A marginal approach such as the GEE, assumes that the marginal mean μij is a function of the covariates through a link function, and the variance of yij is a function of the mean var(yij) = V(μi). The generalized estimation equation for longitudinal data is given as,
where
, μi = E(yi) is the mean function, and
is the first derivative of μi with respect to parameters θ = (βT, γT)T, with
. The covariance matrix Vi can be decomposed as
, where Ai is a diagonal matrix containing the marginal variances, and R(ρ) is a common working correlation matrix parameterized by a small number of nuisance parameters ρ. Utilizing the spline approximation described in Eq (3), the mean function can be expressed as
and the first derivative of μi is
where
.
In the QIF method, the inverse of the working correlation matrix can be approximated by a linear combination of several basis matrices [18]. This can be represented as:
where M1 is the identity matrix, and M2, …, Mh are predefined basis matrices. For instance:
- If the working correlation is exchangeable, R−1 ≈ a1M1 + a2M2, where M2 has zeros on the diagonal and ones on the off-diagonal.
- If the working correlation follows an AR(1) structure,
with
having ones on its two subdiagonals and zeros elsewhere.
The advantage of this approach is that it simplifies the estimation process by eliminating the need to directly estimate the nuisance parameters a1, …, ah.
Building on this concept, we can formulate the estimation function as follows:
(4)
Since the number of equations in (2ph + 2(q + K + 1)h) exceeds the number of unknown parameters (2p + 2(q + K + 1)), we cannot solve for the estimators by merely setting each element to zero. To address this, we estimate the parameters by minimizing the following quadratic inference function:
where
serves as a consistent estimator for var(gi). By minimizing the quadratic inference function, we can derive the parameter estimates as follows:
In order to avoid over-parameterization, we add a penalty term to QIF to penalize the number of knots [24]. The penalized QIF is written as
(5)
where D is a diagonal matrix with 1 for parameters corresponding to spline coefficients associated with knots and 0 otherwise. Specifically,
. Thus, the estimator is given by:
(6)
To determine the tuning parameter λ, we borrow the generalized cross-validation idea [24–26]. The generalized cross-validation statistic is defined as
where the effective degree of freedom is given by
,
represents the second derivative of QN. The optimal tuning parameter λ is the one that minimizes GCV(λ). In the process of implementing GCV, the desired value of λ can be found using a grid search by predefining a set of values for λ. We also established the asymptotic properties for the estimators of the index loading parameters and the penalized spline regression coefficients which are given in the online S1 File together with the proof.
3 Model selection and hypothesis test
3.1 Model selection
Model selection is crucial in spline approximation, as including too many parameters can lead to overfitting. According to the theoretical property of the generalized method of moments estimator [27], under the assumption that E(g1) = 0 and the number of estimating equations exceeds the number of parameters, we have in distribution. Here, r is the dimension of
, k is the dimension of θ, and
is the estimator obtained by minimizing the QIF given a specific order and number of knots. This asymptotic property of the QIF allows for a goodness-of-fit test, which is useful in determining the appropriate order and number of knots for our model. However, multiple models that are not nested may pass the goodness-of-fit tests. Given that
is asymptotically chi-square distributed, it is natural to extend the BIC to the QIF approach, by replacing twice the negative log-likelihood function by the QIF objective function [28]. Specifically, the BIC criterion for a model with r estimating equations and k parameters is expressed as follows:
The model with the lowest BIC is deemed be the optimal choice. If we select h basis matrices in (4), then r − k = hk − k = (h − 1)k.
In our simulation and real data application, we determine the number of knots K and the optimal order q by exploring various combinations and selecting the one that minimizes the BIC criterion. The knots are evenly spaced across the range of the single index u = βTX.
3.2 Nonparametric goodness-of-fit test based on QIF
The QIF can also be regarded as an inference function since it has properties similar to the likelihood ratio test. Suppose that the d-dimensional parameter vector γ is partitioned into (ψ, ζ), where ψ is the parameter of interest with dimension d1, and ζ is the nuisance parameter with dimension d2 = d − d1. If we are interested in testing
the test statistic
follows an asymptotically chi-square distribution with d1 degrees of freedom. Qu et al. [18] introduced a theorem that provided a way to conduct hypothesis testing in the QIF framework. The theorem states that given that all required regularity conditions are satisfied and ψ has dimension d1, under the null hypothesis,
is asymptotically chi-square distribution with d1 degrees of freedom, where
(7)
When there is no nuisance parameter, which is a special case of the condition in the theorem, has an asymptotical chi-square distribution with d degree of freedom under the null hypothesis.
3.3 Test for linearity of the interaction function in gFVICM
In our proposed gFVICM model as outlined in Eq (1), a key focus is to test the form of the unspecified coefficient function. Specifically, we are interested in determining whether a linear function is sufficient to describe the G × E interaction. If we fail to reject the hypothesis that the coefficient function is linear, we should fit a parametric linear interaction model to further evaluate the presence of a linear G × E interaction. Conversely, if the hypothesis is rejected, it suggests the presence of a nonlinear G × E interaction. It is important to note that we cannot directly test for the zero effect of the function m1(⋅) because, under the null hypothesis m1(⋅) = 0, the index loading parameters become unidentifiable unless we impose the condition β0 = β1 = β which is practically too restrictive. Let . Using the truncated power spline basis, the coefficient function can be approximated as:
Our objective is to test the linearity of m1(u1), which is equivalent to testing
Let be the estimator of the full parameter θ = (βT, γT)T under the null hypothesis with
and the estimator of θ under the alternative as
. Then following the theorem by Qu et al. [18], the test statistic
asymptotically follows a chi-square distribution with K + q − 1 degrees of freedom.
4 Simulation study
The performance of the proposed method in finite samples was assessed through Monte Carlo simulation studies. Specifically, we examined the following logistic regression model:
where
We simulated a three-dimensional set of environmental variables X = (X1, X2, X3). For each subject i, the variables X1ij, X2ij, X3ij were independently drawn from a uniform distribution U(0, 1). We set the minor allele frequency (MAF) to pA = 0.1, 0.3, 0.5, assuming Hardy-Weinberg equilibrium. The SNP genotypes AA, Aa, and aa were simulated based on a multinomial distribution with probabilities , 2pA(1 − pA) and (1 − pA)2, respectively. The genotype variable G was encoded as {0,1,2}, corresponding to the genotypes {aa, Aa, AA}, respectively. To generate correlated responses, we implemented the R package bindata developed by Leisch et al. [29] under an AR(1) correlation structure with correlation parameter ρ = 0.5. When implementing the function ‘rmvbin’ to generate the correlated binary data, one should specify the marginal probabilities and the correlation structure.
We set m0(u0) = cos(πu0) and m1(u1) = sin[π(u1 − A)/(B − A)] with and
. The true parameters were
and
. We generated 500 data sets, each with a sample size N = 200 or 500, and observed at time points ni = T = 10 or 20, respectively. For simplicity, we assumed all subjects were measured at equal amount of time points though this assumption is not required. The basis matrix M2 was set to have 1 on its two subdiagonals and 0 elsewhere. The number and order of knots for the splines were determined based on the BIC criterion.
4.1 Performance of estimation
Tables 1 and 2 present the parameter estimation results for different sample sizes and measurement times, respectively. These tables report the average bias (Bias), the average of the estimated standard error (SE) derived from the theoretical results, the standard deviation of the 500 estimates (SD), and the estimated coverage probability (CP) at the 95% confidence level. The tables show that as the sample size increases, the performance of the estimation improves, evidenced by reduced bias, SD, and SE. Additionally, increasing the number of repeated measurements for each subject also enhances estimation accuracy, as illustrated by the comparison between Tables 1 and 2. For instance, the CP for β01 improves from 86.8% to 90% when the number of measurement times increases from 10 to 20, given a sample size of 200. Moreover, the estimation of the loading parameter β1 becomes more accurate as MAF pA increases. Conversely, the estimation of β0 tends to deteriorate with high pA. This is likely because we have less data information for accurately estimating the marginal effects m0(⋅) when pA is large.
Figs 1–4 illustrate the estimated functions m0(u0) and m1(u1) under varying sample sizes and time points. In these plots, the solid lines represent the estimated functions, while the dashed lines denote the true functions. The 95% confidence bands are indicated by the dotted-dash lines. The plots show that the estimated curves almost perfectly align with the true curves, demonstrating the high accuracy of the estimation method. Additionally, the confidence bands are particularly tight for larger sample sizes and a greater number of measurement times, indicating robust estimation. It is noteworthy that the estimation of the interaction effects m1(u1) improves with an increase in pA. Conversely, the estimation of the marginal effects m0(u0) becomes less accurate as pA increases. This observation aligns with the parameter estimation results presented in Tables 1 and 2.
The solid lines represent the estimated functions, while the dashed lines denote the true functions. The 95% confidence bands are illustrated by the dotted-dash lines.
The solid lines represent the estimated functions, while the dashed lines denote the true functions. The 95% confidence bands are illustrated by the dotted-dash lines.
The solid lines represent the estimated functions, while the dashed lines denote the true functions. The 95% confidence bands are illustrated by the dotted-dash lines.
The solid lines represent the estimated functions, while the dashed lines denote the true functions. The 95% confidence bands are illustrated by the dotted-dash lines.
4.2 Performance of hypothesis tests
We assessed the performance of the test for the nonparametric function under the null hypothesis , where
, δ0 and δ1 are constants, representing a linear G × E interaction. To evaluate the test’s power, we considered a sequence of alternative models denoted by
, where τ varies. When τ = 0, the test evaluates the false positive rate.
Fig 5 illustrates the empirical size (for τ = 0) and power (for τ > 0) of the test at the 5% significance level, based on 500 Monte Carlo simulations. The analysis was conducted for sample sizes N = 200 and 500, under different measurement times: T = 10 (left panel) and T = 20 (right panel), with MAF fixed at 0.3. When the sample size is N = 200, the empirical Type I error rate is relatively high. However, this rate decreases significantly as the sample size increases to N = 500. Additionally, the power of the test improves markedly with an increase in sample size from 200 to 500. These findings suggest that our method effectively controls false positive rates and exhibits adequate power to detect deviations from the linear function, particularly under larger sample sizes. Moreover, a comparison between the results for T = 10 and T = 20 shows that the testing power increases with more frequent measurements, indicating better performance in detecting nonlinearity with higher measurement frequency.
To assess how the values of MAF affect the testing performance, we plotted the power under different MAFs pA = 0.1, 0.3, 0.5 when N = 500, T = 10, which is shown in Fig 6. The power of the test increases significantly when MAF rises from 0.1 to 0.3. However, the power values are quite similar when pA = 0.3 and 0.5.
5 Real data application
We applied the proposed gFVICM model to a dataset from a study investigating the association between the A118G SNPs of the OPRM1 gene and sensitivity to experimental pain. A sample of 163 healthy volunteers were recruited to this study. For each volunteer, Systolic Blood Pressure (SBP) and Diastolic Blood Pressure (DBP) were measured at 6 Dobutamine dosage levels: 0 (baseline), 5, 10, 20, 30 and 40mcg/min. Missing values were present in genotypes, covariates, and disease responses, with all missing rates falling below 10%. Rather than excluding the observations, given the small sample size of the study, we opted to impute the missing values before conducting the analysis. Clinically, a person is said to be hypertensive if the individual’s SBP is greater than 140mm Hg or DBP is greater than 90mm Hg [30]. Thus, the response variable Y is a binary variable indicating whether a person has hypertension or not, i.e. Y = 1 for hypertension and Y = 0 for non-hypertension. The model with the mean given in (2) is applied to the data.
One longitudinal covariate X1 = dosage level, two time-invariant covariates X2 = age and X3 = BMI were included as the environmental factors in the model. The genetic variables were five SNPs located at codon 16, 27, 49, 389, and 492 in the gene. Our goal was to assess how a combination of age, BMI and dosage level modifies the effect of the SNP on the risk of hypertension. Specifically, we tested the hypothesis H0: m1(u1) = δ0 + δ1u1 with the corresponding p-value denoted as . P-values for testing the significance of the three index loading coefficients β1 = (β11, β12, β13) were also reported and labeled as
,
, and
, respectively, following the asymptotic property of the estimators. Additionally, our proposed model was compared with a generalized additive varying-coefficient model (gAVCM) formulated as
, where
and
are unknown functions of X1. To evaluate the relative benefits of our integrative analysis, we calculated the objective function QN for both models. The p-values for testing
in the gAVCM are also provided in the tables and are denoted by pgAVCM.
In Table 3, the 5 SNPs have p-values () smaller than the significance level 0.05, which means the functions capturing the G × E interactions are nonlinear for all these 5 SNPs. The objective function QN shows that gFVICM provides a better fit to the data than gAVCM does. This demonstrates the advantage of the integrative analysis. Furthermore, the testing results for gAVCM indicate that the coefficients for interactions are not significant. The results imply that these SNP effects are potentially influenced by a mixture of environmental factors, rather than separately. Fig 7 exhibits the fitted nonlinear curves along with the 95% confidence bands, indicating G × E interactions for each SNP.
The 95% confidence bands are shown as dashed lines.
Table 4 displays the estimated odds for different genotypes at different dosage levels. Since dosage level (X1) does not show significance for SNPs condon27 and condon492, we did not show the estimated odds at different dosage levels for these two SNPs in the table. The changes in the values of odds demonstrate the interaction between SNP and environmental mixtures at different dosage levels. For example, we noted that the odds for genotype AA in SNP codon16 does not change too much as the dosage level increases, which means that the genetic effect of this genotype remains the same when subjects are exposed to different Dobutamine dosage levels. While for the other two genotypes, there is an increase in the value of odds until dosage level four, indicating increased blood pressure as dosage level increases from 0mcg/min to 20mcg/min. We can see the difference of odds at different dosage levels for an individual carrying different SNP genotypes. Using SNP codon389 as an example, individuals with the GG genotype consistently exhibit odds close to 1 across various dosage levels, suggesting the absence of a SNP × dosage interaction affecting the risk of hypotension. Conversely, individuals with the CC or CG genotype show an increased risk of hypotension as the dosage level rises, followed by a decrease after reaching dosage level 4. This indicates varying genetic responses to different dosage levels and, consequently, a SNP × dosage interaction. This will help scientists to get better understanding of the gene function and how different genotypes respond to the combined effect of the three variables to affect the risk of hypertension.
6 Discussion
In this paper, we introduced a generalized varying index coefficient modeling approach designed to evaluate the combined interaction effects of multiple environmental factors with a genetic factor. This model was inspired by empirical evidence and developed under a longitudinal design with a binary disease response. We developed a profile estimation procedure to estimate the index coefficients and nonparametric interaction functions iteratively. The estimation was conducted under the QIF framework. To estimate the nonparametric functions, we first approximated the function using truncated power spline basis, then estimated the spline coefficients under the QIF framework. Furthermore, we proposed a hypothesis test to assess the linearity of the nonparametric interaction function. Simulation study has been conducted to illustrate the estimation and testing procedures to evaluate the finite sample performance. The results indicate reasonable estimation performance of the method under different sample sizes and measurement times.
Our method was proposed to evaluate the joint interaction effect between genetic variants and multiple environmental variables as a whole. Compared to the generalized additive varying coefficient model (gAVCM), which models the G × E effect for each single environmental factor separately, our model presents two advantages: 1) it is biologically more attractive if there are synergistic effects between multiple exposures; and 2) it can potentially increase the testing power for detecting interactions since it can reduce multiple testing burden by treating multiple exposures as a single index variable. Although our method was motivated by a genetic association study, the developed model and inference procedures can be applied to other disciplines with the purpose to model the synergistic effect of multiple variables as a whole.
We applied our method to a real data set from a pain sensitivity study. Testing results indicate that all of the five SNPs are nonlinear moderated, by the synergistic effect of the three variables with dosage as a “time”-varying variable, to affect the risk of hypertension. These five SNPs were genotyped from a candidate gene which has been shown to be related to blood pressure changes [31]. Although the purpose of the data was not generated to evaluate the genetic effect on hypertension, we applied the method to this data set to demonstrate the utility of the method. The estimated odds of different genotypes for a particular SNP at different dosage levels does give insights into the effect of the SNPs nonlinearly modulated by different levels of Dobutamine dosage. Of particular interest is SNP condon49 in which individuals carrying genotype GG show a constant higher risk of developing hypertension regardless of the dosage change, indicating no SNP × dosage interaction. For the same SNP, individuals carrying genotype GA show a different pattern of developing hypertension as the dosage level increases. Such a dynamic change of genetic effect over different dosage levels cannot be revealed by a cross-sectional study, indicating the relative merit of a longitudinal design.
Other methods such as the random effects models [32] and the transition models [33] are also choices for longitudinal data analysis. The random effect models account for both fixed effects (which are common to all individuals) and random effects (which capture individual-specific variation) to handle the correlation between repeated measures within the same subject. The transition models focus on modeling the relationship between successive observations over time. This method is well-suited for data where the primary interest is in understanding how the outcome at one time point depends on previous outcomes, such as in Markov models or autoregressive models. We used QIF instead of random effects or transition models in this work due to its robustness to correlation structure misspecification and its ability to provide more efficient parameter estimates.
We also recognize that the present real data analysis is constrained by the examination of a limited number of SNPs. Our ability to access a large-scale longitudinal GWAS dataset is restricted, which affects the scope of our analysis. When working with a substantial number of SNPs, it is crucial to consider a rigorous approach for controlling the false discovery rate (FDR) in multiple testing correction. Nevertheless, it is important to emphasize that our method offers a novel and valuable strategy for conducting synergistic G × E studies within a longitudinal design. It contributes to the expanding toolkit available for G × E analysis, demonstrating its potential for uncovering meaningful insights in genetic research. In addition, missing values are often reported in longitudinal studies. An extension of the work is possible with missing values under the proposed QIF framework, which will be evaluated in our future studies.
Supporting information
S1 Data. Simulation and real data codes to replicate the results in the paper.
https://doi.org/10.1371/journal.pone.0318103.s001
(ZIP)
Acknowledgments
The authors express their gratitude to two anonymous reviewers for their valuable comments, which have significantly enhanced the quality and presentation of the manuscript.
References
- 1. Sitlani CM, Rice KM, Lumley T, et al. Generalized estimating equations for genome-wide association studies using longitudinal phenotype data. Statistics in Medicine. 2015;34:118–130. pmid:25297442
- 2. Furlotte NA, Eskin E, Eyheramendy S. Genome-wide association mapping with longitudinal data. Genetic Epidemiology. 2014;36:463–471.
- 3. Xu Z, Shen X, Pan W. Longitudinal analysis is more powerful than cross-sectional analysis in detecting genetic association with neuroimaging phenotypes. PLoS One. 2014;9(8):e102312. pmid:25098835
- 4. Zimmet P, Alberti K, Shaw J. Global and societal implications of the diabetes epidemic. Nature. 2001;414:782–787. pmid:11742409
- 5. Ross CA, Smith WW. Gene-environment interactions in Parkinson’s disease. Parkinsonism and Related Disorders. 2007;13:S309–S315. pmid:18267256
- 6. Miao J, Wu Y, Lu Q. Statistical methods for gene-environment interaction analysis. WIREs Computational Statistics. 2024;16:e1635. pmid:38699459
- 7. Carpenter DO, Arcaro K, Spink DC. Understanding the human health effects of chemical mixtures. Environmental Health Perspectives. 2002;110(suppl 1):25–42. pmid:11834461
- 8. Sexton K, Hattis D. Asymptotic properties of maximum likelihood estimators and likelihood ratio under non-standard conditions. Environmental Health Perspectives. 2007;115:825–832.
- 9. Ma S, Song P. Varying index coefficient models. Journal of the American Statistical Association. 2015;110:341–356.
- 10. Guo C, Yang H, Lv J. Generalized varying index coefficient models. Journal of Computational and Applied Mathematics. 2016;300:1–17.
- 11. Liu X, Cui Y, Li R. Partial linear varying multi-index coefficient model for integrative gene-environment interactions. Statistica Sinica. 2016;26:1037–1060. pmid:27667907
- 12. Liu X, Gao B, Cui Y. Generalized partial linear varying multi-index coefficient model for gene-environment interactions. Statistical Applications in Genetics and Molecular Biology. 2017;16(1):59–74. pmid:27988508
- 13.
Zhang J, Liu X, Wang H, Cui Y. Functional varying-index coefficients model for dynamic synergistic gene-environment interactions. 2025; https://doi.org/10.1007/s12561-024-09472-3.
- 14. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:12–22.
- 15. Song PXK, Jiang Z, Park E, Qu A. Quadratic inference functions in marginal models for longitudinal data. Statistics in Medicine. Statistics in Medicine. 2009;28:3683–3696. pmid:19757486
- 16. Crowder M. On consistency and inconsistency of estimating equations. Econometric Theory. 1986;3:305–330.
- 17. Crowder M. On the use of a working correlation matrix in using generalized linear models for repeated measures. Biometrika. 1995;82:407–410.
- 18. Qu A, Lindsay BG, Li B. Improving generalized estimation equations using quadratic inference functions. Biometrika. 2000;87:823–836.
- 19. Wang L, Zhou J, Qu A. Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics. 2012;68:353–360. pmid:21955051
- 20. Taavoni M, Arashi M. High-dimensional generalized semiparametric model for longitudinal data. Statistics. 2021;55:831–850.
- 21. Tian R, Xue L, Liu C. Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data. Journal of Multivariate Analysis. 2014;132:94–110.
- 22. Wang L, Li H, Huang JZ. Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. Journal of the American Statistical Association. 2008;103:1556–1569. pmid:20054431
- 23. Ruppert D, Carroll RJ. Spatially-adaptive penalties for spline fitting. Australian and New Zealand Journal of Statistics. 2000;42:205–223.
- 24. Qu A, Li R. Quadratic inference functions for varying coefficient models with longitudinal data. Biometrics. 2006;62:379–391. pmid:16918902
- 25. Ruppert D. Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics. 2002;11:735–757.
- 26. Bai Y, Fung WK, Zhu Z. Penalized quadratic inference functions for single-index models with longitudinal data. Journal of Multivariate Analysis. 2009;100:152–161.
- 27. Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50(4):1029–1054.
- 28. Wang L, Qu A. Consistent model selection and data-driven smooth tests for longitudinal data in the estimating equations approach. Journal of the Royal Statistical Society, Series B. 2009;71:177–190.
- 29. Leisch F, Weingessel A, Hornik K. On the generation of correlated artificial binary data. Working Paper Series, SFB “Adaptive Information Systems and Modelling in Economics and Management Science,” Vienna University of Economics. 1998.
- 30. James PA, Oparil S, Carter BL, et al. Evidence-based guideline for the management of high blood pressure in adults: report from the panel members appointed to the Eighth Joint National Committee (JNC 8). Journal of the American Medical Association. 2014;311(5):507–520. pmid:24352797
- 31. Johnson JA, Terra SG. Beta-adrenergic receptor polymorphisms: cardiovascular disease associations and pharmacogenetics. Pharmacological Research. 2002;19:1779–1787. pmid:12523655
- 32. Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. pmid:7168798
- 33. Verbeke G, Fieuws S, Molenberghs G, Davidian M. The analysis of multivariate longitudinal data: a review. Statistical Methods in Medical Research. 2012;23:42–59. pmid:22523185