Generalized functional varying-index coefficient model for dynamic synergistic gene-environment interactions with binary longitudinal traits

Jingyi Zhang; Honglang Wang; Yuehua Cui

doi:10.1371/journal.pone.0318103

Abstract

The genetic basis of complex traits involves the function of many genes with small effects as well as complex gene-gene and gene-environment interactions. As one of the major players in complex diseases, the role of gene-environment interactions has been increasingly recognized. Motivated by epidemiology studies to evaluate the joint effect of environmental mixtures, we developed a functional varying-index coefficient model (FVICM) to assess the combined effect of environmental mixtures and their interactions with genes, under a longitudinal design with quantitative traits. Built upon the previous work, we extend the FVICM model to accommodate binary longitudinal traits through the development of a generalized functional varying-index coefficient model (gFVICM). This model examines how the genetic effects on a disease trait are nonlinearly influenced by a combination of environmental factors. We derive an estimation procedure for the varying-index coefficient functions using quadratic inference functions combined with penalized splines. A hypothesis testing procedure is proposed to evaluate the significance of the nonparametric index functions. Extensive Monte Carlo simulations are conducted to evaluate the performance of the method under finite samples. The utility of the method is further demonstrated through a case study with a pain sensitivity dataset. SNPs were found to have their effects on blood pressure nonlinearly influenced by a combination of environmental factors.

Citation: Zhang J, Wang H, Cui Y (2025) Generalized functional varying-index coefficient model for dynamic synergistic gene-environment interactions with binary longitudinal traits. PLoS ONE 20(1): e0318103. https://doi.org/10.1371/journal.pone.0318103

Editor: Mahdi Roozbeh, Semnan University, IRAN, ISLAMIC REPUBLIC OF

Received: June 12, 2024; Accepted: January 9, 2025; Published: January 27, 2025

Copyright: © 2025 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data for this study are within the paper, its Supporting information files, and publicly available from the Github repository (https://github.com/Honglang/gFVICM).

Funding: This work was supported in part by a grant (R21HG010073) from the National Institutes of Health (to Y. Cui), a grant (24TPA1288424) from the American Heart Association (to Y. Cui), and a grant (DMS-2212928) from the National Science Foundation (to H. Wang). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Longitudinal data analysis is very common in epidemiological studies when the response variables are measured over time on subjects. Numerous studies have demonstrated that longitudinal designs offer greater power in detecting genetic associations compared to cross-sectional designs [1–3]. On the other hand, there has been growing interest in understanding the role of interactions of genes with the environment (G × E) in human diseases, such as type 2 diabetes (e.g., Zimmet et al. [4]) and Parkinson’s disease (e.g., Ross and Smith [5]). In many studies, G × E interactions have traditionally been investigated using a single environment exposure model. Readers are referred to Miao et al. [6] for a comprehensive review of G × E studies. However, increasing evidence suggests that the risk of disease can be significantly influenced by simultaneous exposures to multiple environmental factors. Notably, the combined effect of these exposures often exceeds the sum of their individual effects [7, 8]. This has led to a growing interest in assessing the collective impact of environmental mixtures and exploring the mechanisms through which they interact with genes to influence disease risk. Some research has been done to assess nonlinear interactions between environmental mixtures and genes by applying some nonparametric or semiparametric models, such as the varying index coefficients model (VICM) proposed by [9], the generalized varying index coefficient models by Guo et al. [10], and the partial linear multi-varying index coefficients model (PLMVICM) by Liu et al. [11] and the generalized PLMVICM by Liu et al. [12]. However, these methods were developed for cross-sectional data. The extension of these modeling strategies to longitudinal data deserves further investigation.

Previously, we introduced a functional varying index coefficient model (FVICM) to assess nonlinear G × E interactions for a continuous longitudinal trait [13]. Such trait can be blood pressure or heart rate measured over time. However, in practice, it is possible that the response measured over time is discrete, for example, a binary measure of disease status. In human genetics, many disease traits are binary in nature, as being affected or unaffected (or cases vs controls). To investigate the nonlinear dynamic G × E interaction with environmental mixtures as a whole for a binary longitudinal trait, we propose the following generalized functional varying index coefficient model (gFVICM), (1) where Y_ij is the trait of interest observed for the ith subject at the jth time point (i = 1, ⋯, N; j = 1, ⋯, n_i); X_ij is a p-dimensional vector of environmental variables, which can be time-variant or time-invariant; G_i denotes a genetic variable which does not depend on time; g(⋅) is a known link function which can be the identity link for continuous traits and the logit link for binary traits; m₀(⋅) and m₁(⋅) are unknown nonparametric smooth functions which depend on the data; and β₀ and β₁ are p-dimensional vectors of index loading parameters. In this work, we consider a binary longitudinal response variable which takes the value of 1 or 0 at each time point. In model (1), the function captures the interaction effect between environmental mixtures and the genetic variable (e.g., a single nucleotide polymorphism (SNP) on the risk of disease).

In this study, we formulate a statistical estimation and hypothesis testing procedure tailored for model (1), specifically for binary longitudinal traits. The Generalized Estimation Equation (GEE) method, proposed by Liang and Zeger [14], has been widely used in longitudinal data analysis. However, there are several disadvantages of GEE method due to some of its critical assumptions [15]. One disadvantage is that the consistency of GEE estimators are based on the consistency of estimators for the nuisance correlation parameter [16, 17]. Another shortcoming of the GEE method is that model selection and hypothesis testing are complicated, because the estimation procedure of the GEE method does not involve an objective function. The quadratic inference function (QIF) approach proposed by Qu et al. [18] is one of the improvements of the GEE method. The QIF method avoids the need to estimate nuisance correlation parameters and has been demonstrated by Qu et al. [18] to be generally more efficient than the GEE method. Furthermore, because the QIF is based on an objective function that is asymptotically chi-square distributed, it naturally accommodates the implementation of model selection criteria such as BIC. The asymptotic property can also allow us to conduct hypothesis tests. This motivates us to extend the QIF method to our model for estimation and hypothesis testing.

Penalized estimation in longitudinal modeling has been extensively studied, to name a few, the penalized GEE [19], penalized GEE with semiparametric generalized mixed-effects partial linear model [20], penalized QIF for varying-coefficient partially linear models [21], and variable selection for nonparametric varying-coefficient models [22]. For methods involving varying-coefficient models, the coefficient functions typically consider only one explanatory variable, which distinguishes them from the one proposed in model (1). In our proposed estimation procedure, we first use penalized splines [23] to approximate the nonparametric smooth functions m_l(⋅), l = 0, 1. Then, we develop a profile estimation procedure for the index loading parameters and spline coefficients which are estimated iteratively under the QIF framework. In order to avoid overfitting and reduce the number of parameters in spline approximation, we add a penalty to the objective function based on a BIC criterion. We establish the asymptotic normality of the resulting estimators under certain regularity conditions. In addition, we are interested in testing the linearity of G × E interaction, i.e. the linearity of function m₁(⋅). The QIF can be regarded as an inference function which has properties similar to the likelihood ratio test. Based on that, we construct a testing procedure to assess the linearity of the coefficient (interaction) function, where the test statistic asymptotically follows a χ² distribution.

The performance of the proposed procedure under finite samples is evaluated by Monte Carlo simulations. The application of the proposed method is demonstrated through the analysis of a pain sensitivity data with a binary response variable indicating whether a subject has hypertension or not (Yes = 1, No = 0). Theories and proofs are rendered in S1 File. Our method offers a novel way for longitudinal G × E study with binary traits in which the focus is on the evaluation of the joint interaction between a genetic variant and a mixture of environmental exposures to affect a disease risk.

2 The model and estimation methods

2.1 The model

For a binary longitudinal disease trait, suppose the response y_ij, the p-dimensional covariate vector x_ij, and the SNP variable G_i are observed for the ith individual at the jth time point, where i = 1, ⋯, N;j = 1, ⋯, n_i. In general, the number of potential environmental variables (p) that may interact with G to affect Y is not large. Any other covariates that do no interact with G can be modelled separately as a linear term outside of the m(⋅) function. Assume that observations from different subjects are independent, but those within the same subject are correlated. We also assume the model satisfies the first moment assumption, i.e., where g⁻¹(⋅) is a given inverse link function. For binary responses, we use a logit link function and the model can be written as (2)

For the identifiability purpose, the constraints ‖β₀‖ = ‖β₁‖ = 1 are imposed, where the first elements of β₀ and β₁ are set to be positive.

2.2 Quadratic inference function for gFVICM

Denote and . First, the unknown coefficient functions m₀(u₀) and m₁(u₁) are approximated by truncated power spline basis as (3) where is a q-degree truncated power spline basis with K knots κ₁, …, κ_K, , and γ₀ and γ₁ are (q + K + 1)-dimensional vectors of spline coefficients.

A marginal approach such as the GEE, assumes that the marginal mean μ_ij is a function of the covariates through a link function, and the variance of y_ij is a function of the mean var(y_ij) = V(μ_i). The generalized estimation equation for longitudinal data is given as, where , μ_i = E(y_i) is the mean function, and is the first derivative of μ_i with respect to parameters θ = (β^T, γ^T)^T, with . The covariance matrix V_i can be decomposed as , where A_i is a diagonal matrix containing the marginal variances, and R(ρ) is a common working correlation matrix parameterized by a small number of nuisance parameters ρ. Utilizing the spline approximation described in Eq (3), the mean function can be expressed as and the first derivative of μ_i is where .

In the QIF method, the inverse of the working correlation matrix can be approximated by a linear combination of several basis matrices [18]. This can be represented as: where M₁ is the identity matrix, and M₂, …, M_h are predefined basis matrices. For instance:

If the working correlation is exchangeable, R⁻¹ ≈ a₁M₁ + a₂M₂, where M₂ has zeros on the diagonal and ones on the off-diagonal.
If the working correlation follows an AR(1) structure, with having ones on its two subdiagonals and zeros elsewhere.

The advantage of this approach is that it simplifies the estimation process by eliminating the need to directly estimate the nuisance parameters a₁, …, a_h.

Building on this concept, we can formulate the estimation function as follows: (4)

Since the number of equations in (2ph + 2(q + K + 1)h) exceeds the number of unknown parameters (2p + 2(q + K + 1)), we cannot solve for the estimators by merely setting each element to zero. To address this, we estimate the parameters by minimizing the following quadratic inference function: where serves as a consistent estimator for var(g_i). By minimizing the quadratic inference function, we can derive the parameter estimates as follows:

In order to avoid over-parameterization, we add a penalty term to QIF to penalize the number of knots [24]. The penalized QIF is written as (5) where D is a diagonal matrix with 1 for parameters corresponding to spline coefficients associated with knots and 0 otherwise. Specifically, . Thus, the estimator is given by: (6)

To determine the tuning parameter λ, we borrow the generalized cross-validation idea [24–26]. The generalized cross-validation statistic is defined as where the effective degree of freedom is given by , represents the second derivative of Q_N. The optimal tuning parameter λ is the one that minimizes GCV(λ). In the process of implementing GCV, the desired value of λ can be found using a grid search by predefining a set of values for λ. We also established the asymptotic properties for the estimators of the index loading parameters and the penalized spline regression coefficients which are given in the online S1 File together with the proof.

3 Model selection and hypothesis test

3.1 Model selection

Model selection is crucial in spline approximation, as including too many parameters can lead to overfitting. According to the theoretical property of the generalized method of moments estimator [27], under the assumption that E(g₁) = 0 and the number of estimating equations exceeds the number of parameters, we have in distribution. Here, r is the dimension of , k is the dimension of θ, and is the estimator obtained by minimizing the QIF given a specific order and number of knots. This asymptotic property of the QIF allows for a goodness-of-fit test, which is useful in determining the appropriate order and number of knots for our model. However, multiple models that are not nested may pass the goodness-of-fit tests. Given that is asymptotically chi-square distributed, it is natural to extend the BIC to the QIF approach, by replacing twice the negative log-likelihood function by the QIF objective function [28]. Specifically, the BIC criterion for a model with r estimating equations and k parameters is expressed as follows:

The model with the lowest BIC is deemed be the optimal choice. If we select h basis matrices in (4), then r − k = hk − k = (h − 1)k.

In our simulation and real data application, we determine the number of knots K and the optimal order q by exploring various combinations and selecting the one that minimizes the BIC criterion. The knots are evenly spaced across the range of the single index u = β^TX.

3.2 Nonparametric goodness-of-fit test based on QIF

The QIF can also be regarded as an inference function since it has properties similar to the likelihood ratio test. Suppose that the d-dimensional parameter vector γ is partitioned into (ψ, ζ), where ψ is the parameter of interest with dimension d₁, and ζ is the nuisance parameter with dimension d₂ = d − d₁. If we are interested in testing the test statistic follows an asymptotically chi-square distribution with d₁ degrees of freedom. Qu et al. [18] introduced a theorem that provided a way to conduct hypothesis testing in the QIF framework. The theorem states that given that all required regularity conditions are satisfied and ψ has dimension d₁, under the null hypothesis, is asymptotically chi-square distribution with d₁ degrees of freedom, where (7)

When there is no nuisance parameter, which is a special case of the condition in the theorem, has an asymptotical chi-square distribution with d degree of freedom under the null hypothesis.

3.3 Test for linearity of the interaction function in gFVICM

In our proposed gFVICM model as outlined in Eq (1), a key focus is to test the form of the unspecified coefficient function. Specifically, we are interested in determining whether a linear function is sufficient to describe the G × E interaction. If we fail to reject the hypothesis that the coefficient function is linear, we should fit a parametric linear interaction model to further evaluate the presence of a linear G × E interaction. Conversely, if the hypothesis is rejected, it suggests the presence of a nonlinear G × E interaction. It is important to note that we cannot directly test for the zero effect of the function m₁(⋅) because, under the null hypothesis m₁(⋅) = 0, the index loading parameters become unidentifiable unless we impose the condition β₀ = β₁ = β which is practically too restrictive. Let . Using the truncated power spline basis, the coefficient function can be approximated as:

Our objective is to test the linearity of m₁(u₁), which is equivalent to testing

Let be the estimator of the full parameter θ = (β^T, γ^T)^T under the null hypothesis with and the estimator of θ under the alternative as . Then following the theorem by Qu et al. [18], the test statistic asymptotically follows a chi-square distribution with K + q − 1 degrees of freedom.

4 Simulation study

The performance of the proposed method in finite samples was assessed through Monte Carlo simulation studies. Specifically, we examined the following logistic regression model: where

We simulated a three-dimensional set of environmental variables X = (X₁, X₂, X₃). For each subject i, the variables X_1ij, X_2ij, X_3ij were independently drawn from a uniform distribution U(0, 1). We set the minor allele frequency (MAF) to p_A = 0.1, 0.3, 0.5, assuming Hardy-Weinberg equilibrium. The SNP genotypes AA, Aa, and aa were simulated based on a multinomial distribution with probabilities , 2p_A(1 − p_A) and (1 − p_A)², respectively. The genotype variable G was encoded as {0,1,2}, corresponding to the genotypes {aa, Aa, AA}, respectively. To generate correlated responses, we implemented the R package bindata developed by Leisch et al. [29] under an AR(1) correlation structure with correlation parameter ρ = 0.5. When implementing the function ‘rmvbin’ to generate the correlated binary data, one should specify the marginal probabilities and the correlation structure.

We set m₀(u₀) = cos(πu₀) and m₁(u₁) = sin[π(u₁ − A)/(B − A)] with and . The true parameters were and . We generated 500 data sets, each with a sample size N = 200 or 500, and observed at time points n_i = T = 10 or 20, respectively. For simplicity, we assumed all subjects were measured at equal amount of time points though this assumption is not required. The basis matrix M₂ was set to have 1 on its two subdiagonals and 0 elsewhere. The number and order of knots for the splines were determined based on the BIC criterion.

4.1 Performance of estimation

Tables 1 and 2 present the parameter estimation results for different sample sizes and measurement times, respectively. These tables report the average bias (Bias), the average of the estimated standard error (SE) derived from the theoretical results, the standard deviation of the 500 estimates (SD), and the estimated coverage probability (CP) at the 95% confidence level. The tables show that as the sample size increases, the performance of the estimation improves, evidenced by reduced bias, SD, and SE. Additionally, increasing the number of repeated measurements for each subject also enhances estimation accuracy, as illustrated by the comparison between Tables 1 and 2. For instance, the CP for β₀₁ improves from 86.8% to 90% when the number of measurement times increases from 10 to 20, given a sample size of 200. Moreover, the estimation of the loading parameter β₁ becomes more accurate as MAF p_A increases. Conversely, the estimation of β₀ tends to deteriorate with high p_A. This is likely because we have less data information for accurately estimating the marginal effects m₀(⋅) when p_A is large.

Download:

Table 1. Simulation results under different MAFs (p_A = 0.1, 0.3, 0.5) and sample sizes (N = 200, 500), T = 10 and correlation ρ = 0.5.

https://doi.org/10.1371/journal.pone.0318103.t001

Download:

Table 2. Simulation results under different MAFs (p_A = 0.1, 0.3, 0.5) and sample sizes (N = 200, 500), T = 20 and correlation ρ = 0.5.

https://doi.org/10.1371/journal.pone.0318103.t002

Figs 1–4 illustrate the estimated functions m₀(u₀) and m₁(u₁) under varying sample sizes and time points. In these plots, the solid lines represent the estimated functions, while the dashed lines denote the true functions. The 95% confidence bands are indicated by the dotted-dash lines. The plots show that the estimated curves almost perfectly align with the true curves, demonstrating the high accuracy of the estimation method. Additionally, the confidence bands are particularly tight for larger sample sizes and a greater number of measurement times, indicating robust estimation. It is noteworthy that the estimation of the interaction effects m₁(u₁) improves with an increase in p_A. Conversely, the estimation of the marginal effects m₀(u₀) becomes less accurate as p_A increases. This observation aligns with the parameter estimation results presented in Tables 1 and 2.

Download:

Fig 1. The estimation of function m₀(⋅) for sample sizes N = 200 and 500 with T = 10 time points.

The solid lines represent the estimated functions, while the dashed lines denote the true functions. The 95% confidence bands are illustrated by the dotted-dash lines.

https://doi.org/10.1371/journal.pone.0318103.g001

Download:

Fig 2. The estimation of function m₀(⋅) for sample sizes N = 200 and 500 with T = 20 time points.

The solid lines represent the estimated functions, while the dashed lines denote the true functions. The 95% confidence bands are illustrated by the dotted-dash lines.

https://doi.org/10.1371/journal.pone.0318103.g002

Download:

Fig 3. The estimation of function m₁(⋅) for sample sizes N = 200 and 500 with T = 10 time points.

The solid lines represent the estimated functions, while the dashed lines denote the true functions. The 95% confidence bands are illustrated by the dotted-dash lines.

https://doi.org/10.1371/journal.pone.0318103.g003

Download:

Fig 4. The estimation of function m₁(⋅) for sample sizes N = 200 and 500 with T = 20 time points.

The solid lines represent the estimated functions, while the dashed lines denote the true functions. The 95% confidence bands are illustrated by the dotted-dash lines.

https://doi.org/10.1371/journal.pone.0318103.g004

4.2 Performance of hypothesis tests

We assessed the performance of the test for the nonparametric function under the null hypothesis , where , δ₀ and δ₁ are constants, representing a linear G × E interaction. To evaluate the test’s power, we considered a sequence of alternative models denoted by , where τ varies. When τ = 0, the test evaluates the false positive rate.

Fig 5 illustrates the empirical size (for τ = 0) and power (for τ > 0) of the test at the 5% significance level, based on 500 Monte Carlo simulations. The analysis was conducted for sample sizes N = 200 and 500, under different measurement times: T = 10 (left panel) and T = 20 (right panel), with MAF fixed at 0.3. When the sample size is N = 200, the empirical Type I error rate is relatively high. However, this rate decreases significantly as the sample size increases to N = 500. Additionally, the power of the test improves markedly with an increase in sample size from 200 to 500. These findings suggest that our method effectively controls false positive rates and exhibits adequate power to detect deviations from the linear function, particularly under larger sample sizes. Moreover, a comparison between the results for T = 10 and T = 20 shows that the testing power increases with more frequent measurements, indicating better performance in detecting nonlinearity with higher measurement frequency.

Download:

Fig 5. The empirical size and power of testing the linearity of nonparametric function m₁(⋅) for sample sizes N = 200 and 500 with T = 10 (left) and 20(right) time points.

https://doi.org/10.1371/journal.pone.0318103.g005

To assess how the values of MAF affect the testing performance, we plotted the power under different MAFs p_A = 0.1, 0.3, 0.5 when N = 500, T = 10, which is shown in Fig 6. The power of the test increases significantly when MAF rises from 0.1 to 0.3. However, the power values are quite similar when p_A = 0.3 and 0.5.

Download:

Fig 6. The empirical size and power of testing the linearity of nonparametric function m₁ under different MAFs for N = 500 and T = 10.

https://doi.org/10.1371/journal.pone.0318103.g006

5 Real data application

We applied the proposed gFVICM model to a dataset from a study investigating the association between the A118G SNPs of the OPRM1 gene and sensitivity to experimental pain. A sample of 163 healthy volunteers were recruited to this study. For each volunteer, Systolic Blood Pressure (SBP) and Diastolic Blood Pressure (DBP) were measured at 6 Dobutamine dosage levels: 0 (baseline), 5, 10, 20, 30 and 40mcg/min. Missing values were present in genotypes, covariates, and disease responses, with all missing rates falling below 10%. Rather than excluding the observations, given the small sample size of the study, we opted to impute the missing values before conducting the analysis. Clinically, a person is said to be hypertensive if the individual’s SBP is greater than 140mm Hg or DBP is greater than 90mm Hg [30]. Thus, the response variable Y is a binary variable indicating whether a person has hypertension or not, i.e. Y = 1 for hypertension and Y = 0 for non-hypertension. The model with the mean given in (2) is applied to the data.

One longitudinal covariate X₁ = dosage level, two time-invariant covariates X₂ = age and X₃ = BMI were included as the environmental factors in the model. The genetic variables were five SNPs located at codon 16, 27, 49, 389, and 492 in the gene. Our goal was to assess how a combination of age, BMI and dosage level modifies the effect of the SNP on the risk of hypertension. Specifically, we tested the hypothesis H₀: m₁(u₁) = δ₀ + δ₁u₁ with the corresponding p-value denoted as . P-values for testing the significance of the three index loading coefficients β₁ = (β₁₁, β₁₂, β₁₃) were also reported and labeled as , , and , respectively, following the asymptotic property of the estimators. Additionally, our proposed model was compared with a generalized additive varying-coefficient model (gAVCM) formulated as , where and are unknown functions of X₁. To evaluate the relative benefits of our integrative analysis, we calculated the objective function Q_N for both models. The p-values for testing in the gAVCM are also provided in the tables and are denoted by p_gAVCM.

In Table 3, the 5 SNPs have p-values () smaller than the significance level 0.05, which means the functions capturing the G × E interactions are nonlinear for all these 5 SNPs. The objective function Q_N shows that gFVICM provides a better fit to the data than gAVCM does. This demonstrates the advantage of the integrative analysis. Furthermore, the testing results for gAVCM indicate that the coefficients for interactions are not significant. The results imply that these SNP effects are potentially influenced by a mixture of environmental factors, rather than separately. Fig 7 exhibits the fitted nonlinear curves along with the 95% confidence bands, indicating G × E interactions for each SNP.

Download:

Fig 7. Plot of the estimated nonparametric function m₁(u₁) for SNPs at codons 16, 27, 49, 389, and 492, represented by the solid curve.

The 95% confidence bands are shown as dashed lines.

https://doi.org/10.1371/journal.pone.0318103.g007

Download:

Table 3. List of SNPs showing MAF, alleles, p-values under different hypotheses, and Q_N.

https://doi.org/10.1371/journal.pone.0318103.t003

Table 4 displays the estimated odds for different genotypes at different dosage levels. Since dosage level (X₁) does not show significance for SNPs condon27 and condon492, we did not show the estimated odds at different dosage levels for these two SNPs in the table. The changes in the values of odds demonstrate the interaction between SNP and environmental mixtures at different dosage levels. For example, we noted that the odds for genotype AA in SNP codon16 does not change too much as the dosage level increases, which means that the genetic effect of this genotype remains the same when subjects are exposed to different Dobutamine dosage levels. While for the other two genotypes, there is an increase in the value of odds until dosage level four, indicating increased blood pressure as dosage level increases from 0mcg/min to 20mcg/min. We can see the difference of odds at different dosage levels for an individual carrying different SNP genotypes. Using SNP codon389 as an example, individuals with the GG genotype consistently exhibit odds close to 1 across various dosage levels, suggesting the absence of a SNP × dosage interaction affecting the risk of hypotension. Conversely, individuals with the CC or CG genotype show an increased risk of hypotension as the dosage level rises, followed by a decrease after reaching dosage level 4. This indicates varying genetic responses to different dosage levels and, consequently, a SNP × dosage interaction. This will help scientists to get better understanding of the gene function and how different genotypes respond to the combined effect of the three variables to affect the risk of hypertension.

Download:

Table 4. Estimated odds for different genotypes at each dosage level.

https://doi.org/10.1371/journal.pone.0318103.t004

6 Discussion

In this paper, we introduced a generalized varying index coefficient modeling approach designed to evaluate the combined interaction effects of multiple environmental factors with a genetic factor. This model was inspired by empirical evidence and developed under a longitudinal design with a binary disease response. We developed a profile estimation procedure to estimate the index coefficients and nonparametric interaction functions iteratively. The estimation was conducted under the QIF framework. To estimate the nonparametric functions, we first approximated the function using truncated power spline basis, then estimated the spline coefficients under the QIF framework. Furthermore, we proposed a hypothesis test to assess the linearity of the nonparametric interaction function. Simulation study has been conducted to illustrate the estimation and testing procedures to evaluate the finite sample performance. The results indicate reasonable estimation performance of the method under different sample sizes and measurement times.

Our method was proposed to evaluate the joint interaction effect between genetic variants and multiple environmental variables as a whole. Compared to the generalized additive varying coefficient model (gAVCM), which models the G × E effect for each single environmental factor separately, our model presents two advantages: 1) it is biologically more attractive if there are synergistic effects between multiple exposures; and 2) it can potentially increase the testing power for detecting interactions since it can reduce multiple testing burden by treating multiple exposures as a single index variable. Although our method was motivated by a genetic association study, the developed model and inference procedures can be applied to other disciplines with the purpose to model the synergistic effect of multiple variables as a whole.

We applied our method to a real data set from a pain sensitivity study. Testing results indicate that all of the five SNPs are nonlinear moderated, by the synergistic effect of the three variables with dosage as a “time”-varying variable, to affect the risk of hypertension. These five SNPs were genotyped from a candidate gene which has been shown to be related to blood pressure changes [31]. Although the purpose of the data was not generated to evaluate the genetic effect on hypertension, we applied the method to this data set to demonstrate the utility of the method. The estimated odds of different genotypes for a particular SNP at different dosage levels does give insights into the effect of the SNPs nonlinearly modulated by different levels of Dobutamine dosage. Of particular interest is SNP condon49 in which individuals carrying genotype GG show a constant higher risk of developing hypertension regardless of the dosage change, indicating no SNP × dosage interaction. For the same SNP, individuals carrying genotype GA show a different pattern of developing hypertension as the dosage level increases. Such a dynamic change of genetic effect over different dosage levels cannot be revealed by a cross-sectional study, indicating the relative merit of a longitudinal design.

Other methods such as the random effects models [32] and the transition models [33] are also choices for longitudinal data analysis. The random effect models account for both fixed effects (which are common to all individuals) and random effects (which capture individual-specific variation) to handle the correlation between repeated measures within the same subject. The transition models focus on modeling the relationship between successive observations over time. This method is well-suited for data where the primary interest is in understanding how the outcome at one time point depends on previous outcomes, such as in Markov models or autoregressive models. We used QIF instead of random effects or transition models in this work due to its robustness to correlation structure misspecification and its ability to provide more efficient parameter estimates.

We also recognize that the present real data analysis is constrained by the examination of a limited number of SNPs. Our ability to access a large-scale longitudinal GWAS dataset is restricted, which affects the scope of our analysis. When working with a substantial number of SNPs, it is crucial to consider a rigorous approach for controlling the false discovery rate (FDR) in multiple testing correction. Nevertheless, it is important to emphasize that our method offers a novel and valuable strategy for conducting synergistic G × E studies within a longitudinal design. It contributes to the expanding toolkit available for G × E analysis, demonstrating its potential for uncovering meaningful insights in genetic research. In addition, missing values are often reported in longitudinal studies. An extension of the work is possible with missing values under the proposed QIF framework, which will be evaluated in our future studies.

Supporting information

S1 Data. Simulation and real data codes to replicate the results in the paper.

https://doi.org/10.1371/journal.pone.0318103.s001

(ZIP)

S1 File. Theorems and proofs.

https://doi.org/10.1371/journal.pone.0318103.s002

(PDF)

Acknowledgments

The authors express their gratitude to two anonymous reviewers for their valuable comments, which have significantly enhanced the quality and presentation of the manuscript.

References

1. Sitlani CM, Rice KM, Lumley T, et al. Generalized estimating equations for genome-wide association studies using longitudinal phenotype data. Statistics in Medicine. 2015;34:118–130. pmid:25297442
- View Article
- PubMed/NCBI
- Google Scholar
2. Furlotte NA, Eskin E, Eyheramendy S. Genome-wide association mapping with longitudinal data. Genetic Epidemiology. 2014;36:463–471.
- View Article
- Google Scholar
3. Xu Z, Shen X, Pan W. Longitudinal analysis is more powerful than cross-sectional analysis in detecting genetic association with neuroimaging phenotypes. PLoS One. 2014;9(8):e102312. pmid:25098835
- View Article
- PubMed/NCBI
- Google Scholar
4. Zimmet P, Alberti K, Shaw J. Global and societal implications of the diabetes epidemic. Nature. 2001;414:782–787. pmid:11742409
- View Article
- PubMed/NCBI
- Google Scholar
5. Ross CA, Smith WW. Gene-environment interactions in Parkinson’s disease. Parkinsonism and Related Disorders. 2007;13:S309–S315. pmid:18267256
- View Article
- PubMed/NCBI
- Google Scholar
6. Miao J, Wu Y, Lu Q. Statistical methods for gene-environment interaction analysis. WIREs Computational Statistics. 2024;16:e1635. pmid:38699459
- View Article
- PubMed/NCBI
- Google Scholar
7. Carpenter DO, Arcaro K, Spink DC. Understanding the human health effects of chemical mixtures. Environmental Health Perspectives. 2002;110(suppl 1):25–42. pmid:11834461
- View Article
- PubMed/NCBI
- Google Scholar
8. Sexton K, Hattis D. Asymptotic properties of maximum likelihood estimators and likelihood ratio under non-standard conditions. Environmental Health Perspectives. 2007;115:825–832.
- View Article
- Google Scholar
9. Ma S, Song P. Varying index coefficient models. Journal of the American Statistical Association. 2015;110:341–356.
- View Article
- Google Scholar
10. Guo C, Yang H, Lv J. Generalized varying index coefficient models. Journal of Computational and Applied Mathematics. 2016;300:1–17.
- View Article
- Google Scholar
11. Liu X, Cui Y, Li R. Partial linear varying multi-index coefficient model for integrative gene-environment interactions. Statistica Sinica. 2016;26:1037–1060. pmid:27667907
- View Article
- PubMed/NCBI
- Google Scholar
12. Liu X, Gao B, Cui Y. Generalized partial linear varying multi-index coefficient model for gene-environment interactions. Statistical Applications in Genetics and Molecular Biology. 2017;16(1):59–74. pmid:27988508
- View Article
- PubMed/NCBI
- Google Scholar
13. Zhang J, Liu X, Wang H, Cui Y. Functional varying-index coefficients model for dynamic synergistic gene-environment interactions. 2025; https://doi.org/10.1007/s12561-024-09472-3.
14. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:12–22.
- View Article
- Google Scholar
15. Song PXK, Jiang Z, Park E, Qu A. Quadratic inference functions in marginal models for longitudinal data. Statistics in Medicine. Statistics in Medicine. 2009;28:3683–3696. pmid:19757486
- View Article
- PubMed/NCBI
- Google Scholar
16. Crowder M. On consistency and inconsistency of estimating equations. Econometric Theory. 1986;3:305–330.
- View Article
- Google Scholar
17. Crowder M. On the use of a working correlation matrix in using generalized linear models for repeated measures. Biometrika. 1995;82:407–410.
- View Article
- Google Scholar
18. Qu A, Lindsay BG, Li B. Improving generalized estimation equations using quadratic inference functions. Biometrika. 2000;87:823–836.
- View Article
- Google Scholar
19. Wang L, Zhou J, Qu A. Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics. 2012;68:353–360. pmid:21955051
- View Article
- PubMed/NCBI
- Google Scholar
20. Taavoni M, Arashi M. High-dimensional generalized semiparametric model for longitudinal data. Statistics. 2021;55:831–850.
- View Article
- Google Scholar
21. Tian R, Xue L, Liu C. Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data. Journal of Multivariate Analysis. 2014;132:94–110.
- View Article
- Google Scholar
22. Wang L, Li H, Huang JZ. Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. Journal of the American Statistical Association. 2008;103:1556–1569. pmid:20054431
- View Article
- PubMed/NCBI
- Google Scholar
23. Ruppert D, Carroll RJ. Spatially-adaptive penalties for spline fitting. Australian and New Zealand Journal of Statistics. 2000;42:205–223.
- View Article
- Google Scholar
24. Qu A, Li R. Quadratic inference functions for varying coefficient models with longitudinal data. Biometrics. 2006;62:379–391. pmid:16918902
- View Article
- PubMed/NCBI
- Google Scholar
25. Ruppert D. Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics. 2002;11:735–757.
- View Article
- Google Scholar
26. Bai Y, Fung WK, Zhu Z. Penalized quadratic inference functions for single-index models with longitudinal data. Journal of Multivariate Analysis. 2009;100:152–161.
- View Article
- Google Scholar
27. Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50(4):1029–1054.
- View Article
- Google Scholar
28. Wang L, Qu A. Consistent model selection and data-driven smooth tests for longitudinal data in the estimating equations approach. Journal of the Royal Statistical Society, Series B. 2009;71:177–190.
- View Article
- Google Scholar
29. Leisch F, Weingessel A, Hornik K. On the generation of correlated artificial binary data. Working Paper Series, SFB “Adaptive Information Systems and Modelling in Economics and Management Science,” Vienna University of Economics. 1998.
- View Article
- Google Scholar
30. James PA, Oparil S, Carter BL, et al. Evidence-based guideline for the management of high blood pressure in adults: report from the panel members appointed to the Eighth Joint National Committee (JNC 8). Journal of the American Medical Association. 2014;311(5):507–520. pmid:24352797
- View Article
- PubMed/NCBI
- Google Scholar
31. Johnson JA, Terra SG. Beta-adrenergic receptor polymorphisms: cardiovascular disease associations and pharmacogenetics. Pharmacological Research. 2002;19:1779–1787. pmid:12523655
- View Article
- PubMed/NCBI
- Google Scholar
32. Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. pmid:7168798
- View Article
- PubMed/NCBI
- Google Scholar
33. Verbeke G, Fieuws S, Molenberghs G, Davidian M. The analysis of multivariate longitudinal data: a review. Statistical Methods in Medical Research. 2012;23:42–59. pmid:22523185
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Sitlani CM, Rice KM, Lumley T, et al. Generalized estimating equations for genome-wide association studies using longitudinal phenotype data. Statistics in Medicine. 2015;34:118–130. pmid:25297442
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Furlotte NA, Eskin E, Eyheramendy S. Genome-wide association mapping with longitudinal data. Genetic Epidemiology. 2014;36:463–471.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Xu Z, Shen X, Pan W. Longitudinal analysis is more powerful than cross-sectional analysis in detecting genetic association with neuroimaging phenotypes. PLoS One. 2014;9(8):e102312. pmid:25098835
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Zimmet P, Alberti K, Shaw J. Global and societal implications of the diabetes epidemic. Nature. 2001;414:782–787. pmid:11742409
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Ross CA, Smith WW. Gene-environment interactions in Parkinson’s disease. Parkinsonism and Related Disorders. 2007;13:S309–S315. pmid:18267256
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Miao J, Wu Y, Lu Q. Statistical methods for gene-environment interaction analysis. WIREs Computational Statistics. 2024;16:e1635. pmid:38699459
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Carpenter DO, Arcaro K, Spink DC. Understanding the human health effects of chemical mixtures. Environmental Health Perspectives. 2002;110(suppl 1):25–42. pmid:11834461
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Sexton K, Hattis D. Asymptotic properties of maximum likelihood estimators and likelihood ratio under non-standard conditions. Environmental Health Perspectives. 2007;115:825–832.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref9] 9. Ma S, Song P. Varying index coefficient models. Journal of the American Statistical Association. 2015;110:341–356.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref10] 10. Guo C, Yang H, Lv J. Generalized varying index coefficient models. Journal of Computational and Applied Mathematics. 2016;300:1–17.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref11] 11. Liu X, Cui Y, Li R. Partial linear varying multi-index coefficient model for integrative gene-environment interactions. Statistica Sinica. 2016;26:1037–1060. pmid:27667907
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref12] 12. Liu X, Gao B, Cui Y. Generalized partial linear varying multi-index coefficient model for gene-environment interactions. Statistical Applications in Genetics and Molecular Biology. 2017;16(1):59–74. pmid:27988508
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref13] 13. Zhang J, Liu X, Wang H, Cui Y. Functional varying-index coefficients model for dynamic synergistic gene-environment interactions. 2025; https://doi.org/10.1007/s12561-024-09472-3.

[ref14] 14. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:12–22.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref15] 15. Song PXK, Jiang Z, Park E, Qu A. Quadratic inference functions in marginal models for longitudinal data. Statistics in Medicine. Statistics in Medicine. 2009;28:3683–3696. pmid:19757486
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref16] 16. Crowder M. On consistency and inconsistency of estimating equations. Econometric Theory. 1986;3:305–330.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref17] 17. Crowder M. On the use of a working correlation matrix in using generalized linear models for repeated measures. Biometrika. 1995;82:407–410.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref18] 18. Qu A, Lindsay BG, Li B. Improving generalized estimation equations using quadratic inference functions. Biometrika. 2000;87:823–836.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref19] 19. Wang L, Zhou J, Qu A. Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics. 2012;68:353–360. pmid:21955051
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref20] 20. Taavoni M, Arashi M. High-dimensional generalized semiparametric model for longitudinal data. Statistics. 2021;55:831–850.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref21] 21. Tian R, Xue L, Liu C. Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data. Journal of Multivariate Analysis. 2014;132:94–110.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref22] 22. Wang L, Li H, Huang JZ. Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. Journal of the American Statistical Association. 2008;103:1556–1569. pmid:20054431
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref23] 23. Ruppert D, Carroll RJ. Spatially-adaptive penalties for spline fitting. Australian and New Zealand Journal of Statistics. 2000;42:205–223.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref24] 24. Qu A, Li R. Quadratic inference functions for varying coefficient models with longitudinal data. Biometrics. 2006;62:379–391. pmid:16918902
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref25] 25. Ruppert D. Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics. 2002;11:735–757.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref26] 26. Bai Y, Fung WK, Zhu Z. Penalized quadratic inference functions for single-index models with longitudinal data. Journal of Multivariate Analysis. 2009;100:152–161.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref27] 27. Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50(4):1029–1054.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref28] 28. Wang L, Qu A. Consistent model selection and data-driven smooth tests for longitudinal data in the estimating equations approach. Journal of the Royal Statistical Society, Series B. 2009;71:177–190.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref29] 29. Leisch F, Weingessel A, Hornik K. On the generation of correlated artificial binary data. Working Paper Series, SFB “Adaptive Information Systems and Modelling in Economics and Management Science,” Vienna University of Economics. 1998.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref30] 30. James PA, Oparil S, Carter BL, et al. Evidence-based guideline for the management of high blood pressure in adults: report from the panel members appointed to the Eighth Joint National Committee (JNC 8). Journal of the American Medical Association. 2014;311(5):507–520. pmid:24352797
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref31] 31. Johnson JA, Terra SG. Beta-adrenergic receptor polymorphisms: cardiovascular disease associations and pharmacogenetics. Pharmacological Research. 2002;19:1779–1787. pmid:12523655
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref32] 32. Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. pmid:7168798
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref33] 33. Verbeke G, Fieuws S, Molenberghs G, Davidian M. The analysis of multivariate longitudinal data: a review. Statistical Methods in Medical Research. 2012;23:42–59. pmid:22523185
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

Figures

Abstract

1 Introduction

2 The model and estimation methods

2.1 The model

2.2 Quadratic inference function for gFVICM

3 Model selection and hypothesis test

3.1 Model selection

3.2 Nonparametric goodness-of-fit test based on QIF

3.3 Test for linearity of the interaction function in gFVICM

4 Simulation study

4.1 Performance of estimation

4.2 Performance of hypothesis tests

5 Real data application

6 Discussion

Supporting information

S1 Data. Simulation and real data codes to replicate the results in the paper.

S1 File. Theorems and proofs.

Acknowledgments

References