Optimal contrast analysis with heterogeneous variances and budget concerns

Show-Li Jan; Gwowen Shieh

doi:10.1371/journal.pone.0214391

Abstract

The omnibus test is commonly applied to evaluate the overall disparity between group means in ANOVA. Alternatively, linear contrasts are more informative in detecting specific pattern of mean differences that cannot be obtained via the omnibus test. This article concerns power and sample size calculations for contrast analysis with heterogeneous variances and budget concerns. Optimal allocation procedures for the Welch-Satterthwaite tests of standardized and unstandardized contrasts are presented to minimize the total sample size with the designated ratios, to meet a desirable power level for the least cost, and to attain the maximum power performance under a fixed cost. Currently available methods rely exclusively on simple allocation formula and direct rounding rule. The proposed allocation strategies combine the computing techniques of nonlinear optimization search and iterative screening process. Numerical assessments of a randomized control trial for the overcoming depression on the Internet are conducted to demonstrate and confirm that the approximate procedures do not guarantee optimal solution. The suggested approaches extend and outperform the existing findings in methodological soundness and overall performance. The corresponding computer algorithms are developed to implement the recommended power and sample size calculations for optimal contrast analysis.

Citation: Jan S-L, Shieh G (2019) Optimal contrast analysis with heterogeneous variances and budget concerns. PLoS ONE 14(3): e0214391. https://doi.org/10.1371/journal.pone.0214391

Editor: Seyedali Mirjalili, Griffith University, AUSTRALIA

Received: October 10, 2018; Accepted: March 12, 2019; Published: March 26, 2019

Copyright: © 2019 Jan, Shieh. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The summary statistics are available in the two articles: Clarke G, Eubanks D, Reid CK, et al. Overcoming Depression on the Internet (ODIN)(2): a randomized trial of a self-help depression skills program with reminders. Journal of Medical Internet Research 2005; 7: e16.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Within the context of analysis of variance (ANOVA), the omnibus F test is widely used for detecting the overall mean differences. Alternatively, many important research questions may be formulated as a linear combination of the population group means. Hence, a t test of individual contrast provides much more information than an omnibus hypothesis in assessing particular relation between the mean effects. Comprehensive exposition and further information can be found in Kutner et al. [1] and Maxwell and Delaney [2]. However, it has been noted in many actual applications that the homogeneous variances assumption of ANOVA is frequently violated. For example, Grissom [3], Rosopa, Schaffer, and Schroeder [4], and Ruscio and Roche [5] stressed that variances can be extremely different across treatment groups in clinical and psychological study. To account for the impact of variance heterogeneity, the Welch–Satterthwaite procedure of Satterthwaite [6] and Welch [7] is commonly recommended as an alternative to the usual t test for detecting the substantive significance of a linear contrast. Accordingly, the contrast analysis under heterogeneity of variance is a generalization of the well-known Behrens–Fisher problem of testing the difference between two population means with unequal variances.

The general guidelines of experimental design and statistical analysis suggest that even the renowned test procedures do not warrant correct detection of treatment differences that are strongly expected or theoretically supported (Ioannidis [8], Moher, Dulberg, & Wells [9]). To prevent mistakenly dismissing an important contrast effect and to provide profound implications for ANOVA research, the underlying issues of power and sample size calculations must also be considered. It is prudent to emphasize that the traditional power and sample size procedures do not consider the cost implications. Notably, Allison et al. [10] advocated designing statistically powerful studies while minimizing costs. On the other hand, Marcoulides [11] emphasized the notion of maximizing power in designing studies under budget constraints. Although power and sample size procedures are available for Welch’s test [12] of the difference between two means, Luh and Guo [13] noted that cost issues have not been incorporated in the Welch–Satterthwaite test for contrast analysis. Accordingly, Luh and Guo [13] described a formula for efficient sample size allocation in two scenarios. The first scenario is attaining a designated power with the minimum total cost, and the second scenario is maximizing the statistical power for a designated total cost. The suggested optimal sample sizes have a ratio that is proportional to the product of the ratio of contrast coefficients and the ratio of standard deviations divided by the square root of the ratio of unit sampling costs. The particular method is a direct extension of the optimal sample size formula in Dette and Munk [14] and Pentico [15] for detecting the difference between two means under the normality assumption.

A standard normal distribution can be viewed as a t distribution with an infinite number of degrees of freedom. Despite this large-sample argument, the importance of a Student’s t distribution is well recognized in statistical applications, especially when the sample size is small. Note that the underlying notion of incorporating the cost concerns is because the time, money, and other resources are limited in all practical studies. The power and sample procedures [16–20] stress the theoretical principles of the approximate degree of freedom tests using estimated degrees of freedom. Specifically, power and sample size calculations for the Welch’s [21] omnibus test have been presented in Jan and Shieh [16] and Shieh and Jan [17, 18]. On the other hand, Shieh and Jan [19] considered the problem of power and sample size for the Welch–Satterthwaite test of linear contrasts, but they did not consider budget issues. The related results in Jan and Shieh [20] are restricted for designing 2 × 2 factorial studies while minimizing financial costs. Therefore, these optimal sample size methods did not cover all the cost schemes for contrast analysis. To our knowledge, there has been no other optimal cost and allocation investigations for the Welch–Satterthwaite test of contrast analysis except for Luh and Guo [13]. As a generalization of the results in [16–20], the present study focuses on optimal contrast analysis by implementing the distributional properties of the Welch–Satterthwaite t statistic in cost and allocation evaluations. Accordingly, in addition to being able to contribute to the methodological development and understanding of the approximate degrees of freedom test procedure, it also facilitates the pedagogical and numerical comparisons of the suggested approaches and the methods of Luh and Guo [13] for optimal sample size determinations.

In addition to the unstandardized contrasts, the effect size reporting and interpretation practices suggest that the standardized effect sizes are useful when comparing results from multiple studies using measurement instruments whose raw units are not directly comparable, such as Fritz, Morris, and Richler [22], Lakens [23], and Takeshima et al. [24]. Notably, standardized contrasts of treatment effects and corresponding effect sizes in ANOVA have been investigated by, among others, Olejnik and Algina [25], Rosenthal, Rosnow, and Rubin [26], and Steiger [27]. The prescribed sample size studies of linear contrasts did not include the more involved situations of standardized contrasts. Hence, it is of theoretical importance to extend the power and sample size calculations for conducting hypothesis testing of standardized contrasts.

In view of the importance of methodological justification and computational support, this article aims to present a systematic and thorough discussion for the Welch-Satterthwaite tests of standardized and unstandardized contrasts. Optimal allocation approaches are presented to minimize the total sample size with the designated ratios, to meet a desirable power level for the least cost, and to attain the maximum power performance under a fixed cost. An Internet depression intervention example is employed to demonstrate the features of the suggested approaches. To facilitate the recommended procedures in planning research designs, computer algorithms are offered for optimal power and sample size calculations with the designated allocation and cost schemes.

Methods

Linear contrasts

Consider the one-way ANOVA model in which Y_ij denotes the jth value of the response variable from the ith treatment group and the observations are assumed to be independent and normally distributed: (1) where μ_i and are unknown parameters, i = 1, …, G (≥ 2) and j = 1, …, N_i. In addition to or instead of the omnibus test, questions regarding a particular pattern of group differences can be tested with a contrast of mean values.

A contrast is defined as a linear combination of mean parameters (2) where l_i are the linear coefficients with . With the model assumption defined in Eq 1, an unbiased estimator for the contrast ψ is of the form (3) where is the ith group sample mean and is an unbiased estimator of μ_i for i = 1, …, G. Moreover, the contrast estimator given in Eq 3 has the distribution (4) where . An unbiased estimator of ω² can be readily obtained by replacing the variance in ω² with its unbiased estimator : (5) where is the sample variance for i = 1,…, G.

Test for difference.

To appraise a linear contrast of the mean effects in terms of the hypothesis (6) the test statistic is of the form (7) where ψ₀ is a constant. The Welch–Satterthwaite procedure suggests that under the null hypothesis H₀: ψ = ψ₀, the quantity T has a convenient approximate distribution (8) where and t(v) is a t distribution with degrees of freedom v. For inferential purposes, the degrees of freedom v is replaced by its counterpart with direct substitution of for in v, where (9)

The test rejects H₀ at the significance level α if where is the upper 100(α/2) percentile of the t distribution .

Moreover, with the same theoretical arguments and analytic derivations, it can be shown that the statistic T has the general approximate distribution (10) where t(ν, Δ) is a noncentral t distribution with degrees of freedom v and noncentrality parameter (11)

Also, the power function of the Welch–Satterthwaite test can be approximated by (12)

Test for noninferiority and superiority.

In addition to the two-sided test of difference for a contrast, it is of clinical importance to test the hypotheses for noninferiority and superiority between mean effects (Laster & Johnson [28], Mulla et al. [29], Piaggio et al. [30], Scott [31]). The problem of testing noninferiority and superiority can be unified by the following hypotheses when larger values of ψ are better: (13) where ψ₀ is the non-inferiority or superiority threshold (Fleming et al. [32], Gayet-Ageron et al. [33], Gayet-Ageron et al. [34], Gladstone & Vach [35], Wien [36]). When ψ₀ < 0, the rejection of the null hypothesis implies noninferiority against the reference margin. Whereas the rejection of the null hypothesis indicates superiority over the reference bound for ψ₀ > 0. The upper one-sided test procedure rejects the null hypothesis at the significance level α if and the associated power function is expressed as (14)

Related points to consider on switching between superiority and non-inferiority can be found in the report of the Committee for Proprietary Medicinal Products [37], Ganju and Rom [38], Lewis [39], and Murray [40].

Standardized contrasts

The usual linear contrast has advantages in understanding the meaning of effect size because the scale is the same as the original units of analysis. Alternatively, the standardized contrasts provide a natural interpretation of net effect that is critical to transform the magnitude of a treatment combination with respect to the metric of a response variable. A standardized contrast effect ψ* is defined as (15) where and q_i = N_i/N_T for i = 1, …, G, and . To detect a standardized contrast effect, a slightly different statistic than T is considered: (16)

Also, T* has the general distribution (17) where .

Test for difference.

For assessing the standardized contrast effects in terms of the hypothesis (18) the test statistic T* has the distribution (19) where and is a constant. The null hypothesis is rejected at the significance level α if or where and are the lower and upper 100(α/2) percentiles of the noncentral t distribution , respectively. The corresponding power function is (20)

Test for noninferiority and superiority.

To perform the upper one-sided test for noninferiority and superiority in terms of (21) the test procedure rejects H₀ at the significance level α if . Accordingly, the power function is defined as (22)

The reference values need to be prudently selected to reflect the planned tests of noninferiority or superiority with appropriate magnitude and sign.

Sample size calculations

During the planning stage of a research study, a question of essential interest is how many subjects are needed in order to have the desired power for conducting a scientifically meaningful analysis. To extend the applicability of contrast analysis, optimal sample size procedures are presented with respect to distinct allocation and cost concerns.

Sample size ratios are fixed.

For advance planning of unstandardized and standardized contrast analysis, the prescribed power functions π(Δ) and π*(Δ*) can be employed to calculate the sample sizes {N_i, i = 1, …, G} needed to attain the specified power 1 − β for the chosen significance level α, null values ψ₀ and , contrast coefficients {l_i, i = 1, …, G}, and parameter values . However, it is prudent to consider the design structure with a priori chosen sample size ratios {r₁, …, r_G} where r_j = N_j/N_g ≥ 1, j = 1, …, G, with the gth group has the smallest sample size N_g. Note that the sample size calculations in Shieh and Jan [19] are only applicable to the tests of difference for conventional linear contrasts. Hence, they did not consider the tests for the standardized contrasts.

The cost and effort to treat a subject often vary with treatment groups and it is sensible for researchers to take into account budget and resource constraints in research design. The total cost of an ANOVA study can be represented by the overhead cost and sampling costs through the following simple cost function (23) where c_O is the fixed overhead cost associated with the study, and c_i reflects unit sampling cost of each subject in group i for i = 1, … G. Apparently, the cost assessment reduces to the evaluation of total number of subjects when c_O = 0 and c_i = 1 for i = 1, …, G. Under cost and power considerations, the following two scenarios arise naturally in choosing the optimal sample sizes.

Target power is fixed and total cost needs to be minimized.

Despite the simple linear form of the objective cost function, the optimization process involves the designated power function as a nonlinear constraint. Thus, a closed form solution rarely exists for most situations. With the specifications of the significance level α, the desired power level 1 − β, the null effect size, contrast coefficients, and the model parameters of group means and variance components, the suggested approach is composed of two key steps.

First, the preliminary set of sample sizes {N_Pi, i = 1, …, G} for attaining the desired power performance while minimizing the total cost can be obtained with the NLPQN subroutine of the SAS/IML [41] package. However, the sample sizes are treated as continuous variables in the optimization process. The resulting sample sizes are most likely non-integer values. In view of the discrete nature of sample sizes, a systematic evaluation is conducted to find the proper result in the second step. The screening process of Shieh and Jan [18] is extended for a wider range of sample size combinations. Specifically, power calculations and cost assessments are performed for a total of 4^G sample size sets {N_i, i = 1, …, G} with N_i = [N_P1]– 1, [N_P1], [N_P1] + 1, or [N_P1] + 2 for i = 1, … G, and [M] denotes the integer part of M. Then, the optimal allocation is found through an inspection of the sample size combinations that attain the desired power while giving the least cost. If more than one set yields the same amount of least cost, the one giving the largest power is reported.

In contrast to the proposed thorough search, Luh and Guo [13] showed that the potential optimal sample size ratio for contrast analysis of unstandardized means is proportional to the product of the ratio of contrast coefficients and the ratio of standard deviations divided by the square root of the ratio of unit sampling costs: (24)

Note that the allocation ratios are derived with the standard normal Z statistic for known variances, rather than the t statistic under the assumption of unknown variances.

Total cost is fixed and actual power needs to be maximized.

In addition to the prescribed design scheme, a problem of practical interest is to decide the best design in power performance when the total cost is fixed. Similar to the previous approach for optimal design, a two-step search procedure is performed. In this case, the specialized SAS/IML [41] NLPNRA subroutine is used to find the initial sample sizes {N_P1, …, N_PG} for the maximization of a nonlinear power function with the linear inequality cost constraint. The optimization algorithm assumes the sample sizes are continuous measurements and the computed outcomes {N_P1, …, N_PG} are extremely probable not integer values. To give the correct optimal solution, power and cost appraisals are performed in the second step for a total of 4^G sample size combinations {N₁, …, N_G} with N_i = [N_P1]– 1, [N_P1], [N_P1] + 1, or [N_P1] + 2 for i = 1, … G. Accordingly, the optimal allocation is obtained through a detailed comparison of the sample size configurations that yields the greatest power while maintaining the restricted budget.

In this case, Luh and Guo [13] suggested the optimal sample size combination still has the allocation ratios given in Eq 24. The sample size of the first group is determined by and the other sample sizes are then computed with N_i = N₁γ_i for i = 2, …, G. It is unlikely that the sample sizes computed from the allocation ratios are whole numbers. The computed sample sizes need to be rounded up or down to the nearest integer and the outcomes are reported as the optimal sample sizes.

Results

To explicate the usefulness of the recommended exact approaches and associated computer programs, the overcoming depression on the Internet (ODIN) study of Clarke et al. [42] is exemplified for power and sample size calculations. This research was a three-arm randomized control trial with a usual treatment control group and two ODIN intervention groups receiving reminders through postcards or brief telephone calls.

For demonstration, Luh and Guo [13] suggested a specific comparison of the two intervention programs to the usual treatment without access to ODIN with respect to the mental component summary scores at 16-week. The hypothesis testing is formulated as H₀: ψ ≤ –4.2 versus H₁: ψ > –4.2 with the linear coefficients {l₁, l₂, l₃} = {0.5, 0.5, –1}. For the three study conditions of mail reminder, telephone reminder, and control group, {N₁, N₂, N₃} = {75, 80, 100}, , and . It is shown that the contrast effect, estimated variance, and approximate degrees of freedom are , , and , respectively. Moreover, the observed test statistic T = 1.9927 and the p-value = 0.0238. Hence, the test concludes that the contrast effect is significantly larger than –4.2 at .

For the purposes of power analysis and sample size determination, the abovementioned findings are employed to provide planning values of the model parameters and design characteristics for upcoming Internet depression intervention study. With these parameter settings, contrast coefficients {0.5, 0.5, –1}, sample sizes {75, 80, 100}, and ψ₀ = –4.2, the accompanying program shows that attained powers for the two-sided test and one-side test given in Eqs 6 and 13 are 0.5093 and 0.6335, respectively. The resulting powers are far less than the fairly common level of 0.80. Numerical computations reveal that the balanced group sample sizes of 183 and 144 are required to achieve the target power of 0.8 for the two-sided and one-side tests, respectively.

According to the cost-effectiveness study of Hollinghurst et al. [43], Luh and Guo [13] assumed that the fixed overhead cost c_O = 0 and the average operation costs {c₁, c₂, c₃} = {20, 50, 100} as the unit sampling costs of the three treatment groups for future depression study. For the prescribed test for noninferiority, the approximate method of Luh and Guo [13] reported that the sample sizes {173, 93, 153} are required to attain the power performance of 0.80 with the least cost. Therefore, the total sample size and total cost are N_T = 419 and C_T = 23,410, respectively. It is also of fundamental interest to consider the optimal design problem in which the total number of subjects needs to be minimized. Luh and Guo [13] showed that the minimum sample sizes to assure the same power level of 0.80 are {98, 84, 193} with N_T = 375 and C_T = 25,460. Alternatively, the proposed approach suggests that the optimal sample sizes {173, 93, 152} and {97, 83, 193} are required to attain the designated power 0.80 with the least total cost and the smallest total sample size, respectively. The total sample size and total cost are N_T = 418 and C_T = 23,310 under the cost minimization consideration, whereas the corresponding results are N_T = 373 and C_T = 25,390 when minimum total sample size is desirable. The attained powers for the two sample size settings are 0.8000 and 0.8003, respectively, and they are nearly identical to the nominal level 0.80. For these two cases, Luh and Guo’s [13] method consistently gave greater total costs and larger total sample sizes than the suggested algorithm.

For the scenario of finding the optimal allocation to maximize power performance when the total cost is fixed as 22,000, the sample sizes computed by Luh and Guo [13] are {162.43, 87.73, 143.65}. They suggested finding the appropriate sample sizes by rounding up or down to the nearest integer. Accordingly, their chosen sample sizes are {162, 87, 144} with the total cost C_T = 21,990. When a computer is not available, the checking procedure entails laborious and tedious calculations especially for four or more groups. Instead, the optimal sample size allocation computed by the proposed approach is {160, 88, 144} which perfectly meets the planned budget. Moreover, exact computation shows that the resulting power 0.7795 of the optimal structure is larger than the power 0.7793 attained by the prescribed sample sizes {162, 87, 144}. Hence, the proposed algorithm is superior to the approximate procedure of Luh and Guo [13]. Generally, the computations of optimal solutions can be simplified by the approximate methods without much loss in accuracy, especially when the sample sizes are large. However, the proposed approaches will produce more accurate results across all sample sizes.

To demonstrate the hypothesis testing, power computation, and sample size determination for the standardized contrasts, the comparison of mental component summary scores between the depression interventions is analyzed next. It follows from the definitions of the unstandardized contrast ψ and the standardized contrast ψ* that a working value of is . For simplicity’s sake, the null standardized effect is set as . Then, the hypothesis testing in terms of the standardized measure is formulated by H₀: ψ* ≤ −0.25 versus H₁: ψ* > –0.25. With the given data, the computations show that the standardized test statistic T* = –1.8115, the critical value t_0.95(200.4582, –3.9922) = –2.3390, and the p-value = 0.0149. It is concluded that the standardized effect is significantly larger than –0.25 at α = 0.05.

With the parameter settings {μ₁, μ₂, μ₃} = {34.7, 32.3, 35.5} and , the statistical power associated with the previous sample size combination {75, 80, 100} is 0.6988. For a balanced structure, it can be shown that the minimum sample size set {94, 94, 94} is necessary to attain the designated power of 0.8. In this case, the total sample size is N_T = 282 and the total cost is C_T = 15,980 for the fixed overhead cost c_O = 0 and the average operation costs {c₁, c₂, c₃} = {20, 50, 100}. To attain the designated power 0.80 with the minimum total cost, the suggested procedure yields the optimal sample size scheme {305, 7, 15} with the total sample size N_T = 327 and total cost C_T = 7,950. On the other hand, the suggested allocation {73, 62, 143} incurs the least total sample size N_T = 278 with C_T = 18,860. In this case, the balanced design is not the optimal solution for either consideration of minimum total sample size or minimum total cost. When the maximum total cost is 22,000, the proposed optimal sample size structure is {815, 22, 46} with N_T = 883 and C_T = 22,000.

Note that all the numerical results of the optimal power and sample size procedures were computed with the supplemental SAS/IML algorithms. For ease of application, two different sets of computer programs are presented for the standardized and unstandardized contrast analysis.

Conclusions and discussion

The Welch–Satterthwaite statistic and the associated approximate t distribution have an important utility in accommodating the impact of heterogeneity of variance in statistical inference. The technical account of diverse hypothesis-testing frameworks enhances the theoretical implication and practical usefulness of the Welch–Satterthwaite test for contrast analysis in the detection of difference or inferiority/superiority. Moreover, the integrated document of different contrast effect sizes facilitates the reporting and interpretation of important finding in standardized measure scaled by the associated variabilities and design characteristics or in simple magnitude expressed in the same metric as the original units of analysis. One important implication of this research is that the essence of the Welch–Satterthwaite procedure is properly recognized in related power and sample size calculations without a normal simplification. Nonlinear optimization routines and systematic numerical evaluations are synthesized to give optimal sample size allocations for contrast analysis. According to the analytic examination and numerical assessment, the suggested procedures ultimately outperform the existing sample size methods based on the normal approximation and integer rounding. Essentially, the collection of computer programs covers both the two-sided and one-sided hypothesis testing for the two distinct formulations of standardized and unstandardized contrasts. The presented appraisals of statistical power, sample size, and financial budget should be useful for researchers to justify their allocation strategy and project support in planning research design.

The general formulation of a linear contrast of group means permits a wide range of research hypotheses to be tested in ANOVA. To enhance the usefulness of contrast analysis under heterogeneity of variance, this article addresses the problem of optimal sample size calculations for the Welch–Satterthwaite test with cost constraints. The present study has three essential features. First, the two-sided and one-sided test procedures are presented for both the standardized and unstandardized contrasts in ANOVA under the heterogeneous variances assumption. Second, optimal sample size approaches are proposed for the two essential problems that when the target power is fixed and total cost needs to be minimized and when the total cost is fixed and actual power needs to be maximized. Third, computer codes are presented to implement the power and sample size computations of the Welch–Satterthwaite procedures. In sum, this study contributes to the current literature for optimal research designs by alleviating the limitations of existing investigations and extending the usefulness of contrast analysis in ANOVA under variance heterogeneity.

Supporting information

S1 File. SAS/IML programs for performing the tests of linear contrast.

https://doi.org/10.1371/journal.pone.0214391.s001

(PDF)

S2 File. SAS/IML programs for performing the tests of standardized contrast.

https://doi.org/10.1371/journal.pone.0214391.s002

(PDF)

References

1. Kutner M. H., Nachtsheim C. J., Neter J., & Li W. (2005). Applied linear statistical models (5th ed.). New York, NY: McGraw Hill.
2. Maxwell S. E., & Delaney H. D. (2004). Designing experiments and analyzing data: A model comparison perspective (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
3. Grissom R. J. (2000). Heterogeneity of variance in clinical data. Journal of Consulting and Clinical Psychology, 68, 155–165. pmid:10710850
- View Article
- PubMed/NCBI
- Google Scholar
4. Rosopa P. J., Schaffer M. M., & Schroeder A. N. (2013). Managing heteroscedasticity in general linear models. Psychological Methods, 18, 335–351. pmid:24015776
- View Article
- PubMed/NCBI
- Google Scholar
5. Ruscio J., & Roche B. (2012). Variance heterogeneity in published psychological research: A review and a new index. Methodology, 8, 1–11.
- View Article
- Google Scholar
6. Satterthwaite F. E. (1946). An approximate distribution of estimate of variance components. Biometrics Bulletin, 2, 110–114. pmid:20287815
- View Article
- PubMed/NCBI
- Google Scholar
7. Welch B. L. (1947). The generalization of Students’ problem when several different population variances are involved. Biometrika, 34, 28–35. pmid:20287819
- View Article
- PubMed/NCBI
- Google Scholar
8. Ioannidis J. P. (2005). Why most published research findings are false. PLoS Medicine, 2, e124. pmid:16060722
- View Article
- PubMed/NCBI
- Google Scholar
9. Moher D., Dulberg C. S., & Wells G. A. (1994). Statistical power, sample size, and their reporting in randomized controlled trials. Journal of the American Medical Association, 272, 122–124. pmid:8015121
- View Article
- PubMed/NCBI
- Google Scholar
10. Allison D. B., Allison R. L., Faith M. S., Paultre F., & Pi-Sunyer X. (1997). Power and money: Designing statistically powerful studies while minimizing financial costs. Psychological Methods, 2, 20–33.
- View Article
- Google Scholar
11. Marcoulides G. A. (1993). Maximizing power in generalizability studies under budget constraints. Journal of Educational Statistics, 18, 197–206.
- View Article
- Google Scholar
12. Welch B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29, 350–362.
- View Article
- Google Scholar
13. Luh W. M., & Guo J. H. (2016). Sample size planning for the non-inferiority or equivalence of a linear contrast with cost considerations. Psychological Methods, 21, 13–34. pmid:26121080
- View Article
- PubMed/NCBI
- Google Scholar
14. Dette H., & Munk A. (1997). Optimum allocation of treatments for Welch’s test in equivalence assessment. Biometrics, 53, 1143–1150. pmid:9333344
- View Article
- PubMed/NCBI
- Google Scholar
15. Pentico DW. On the determination and use of optimal sample sizes for estimating the difference in means. The American Statistician. 1981; 35: 41–42.
- View Article
- Google Scholar
16. Jan S. L., & Shieh G. (2014). Sample size determinations for Welch’s test in one-way heteroscedastic ANOVA. British Journal of Mathematical and Statistical Psychology, 67, 72–93. pmid:23316952
- View Article
- PubMed/NCBI
- Google Scholar
17. Shieh G., & Jan S. L. (2013). Determining sample size with a given range of mean effects in one-way heteroscedastic analysis of variance. Journal of Experimental Education, 81, 281–294.
- View Article
- Google Scholar
18. Shieh G., & Jan S. L. (2015). Optimal sample size allocation for Welch’s test in one-way heteroscedastic ANOVA. Behavior Research Methods, 47, 374–383. pmid:24903689
- View Article
- PubMed/NCBI
- Google Scholar
19. Shieh G., & Jan S. L. (2015). Power and sample size calculations for testing linear combinations of group means under variance heterogeneity with applications to meta and moderation analyses. Psicologica, 36, 367–390.
- View Article
- Google Scholar
20. Jan S. L., & Shieh G. (2016). A systematic approach to designing statistically powerful heteroscedastic 2 × 2 factorial studies while minimizing financial costs. BMC Medical Research Methodology, 16, 114. pmid:27578357
- View Article
- PubMed/NCBI
- Google Scholar
21. Welch B. L. (1951). On the comparison of several mean values: An alternative approach. Biometrika, 38, 330–336.
- View Article
- Google Scholar
22. Fritz C. O., Morris P. E., & Richler J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141, 2–18.
- View Article
- Google Scholar
23. Lakens D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. pmid:24324449
- View Article
- PubMed/NCBI
- Google Scholar
24. Takeshima N., Sozu T., Tajika A., Ogawa Y., Hayasaka Y., & Furukawa T. A. (2014). Which is more generalizable, powerful and interpretable in meta-analyses, mean difference or standardized mean difference?. BMC Medical Research Methodology, 14, 30. pmid:24559167
- View Article
- PubMed/NCBI
- Google Scholar
25. Olejnik S., & Algina J. (2000). Measures of effect size for comparative studies: Applications, interpretations, and limitations. Contemporary Educational Psychology, 25, 241–286. pmid:10873373
- View Article
- PubMed/NCBI
- Google Scholar
26. Rosenthal R., Rosnow R. L., & Rubin D. B. (2000). Contrasts and effect sizes in behavioral research: A correlational approach. New York: Cambridge University Press.
27. Steiger J. H. (2004). Beyond the F test: Effect size confidence intervals and tests of close fit in the analysis of variance and contrast analysis. Psychological Methods, 9, 164–182. pmid:15137887
- View Article
- PubMed/NCBI
- Google Scholar
28. Laster L. L., & Johnson M. F. (2003). Non‐inferiority trials: the ‘at least as good as’ criterion. Statistics in Medicine, 22, 187–200. pmid:12520556
- View Article
- PubMed/NCBI
- Google Scholar
29. Mulla S. M., Scott I. A., Jackevicius C. A., You J. J., & Guyatt G. H. (2012). How to use a noninferiority trial: Users’ guides to the medical literature. Journal of the American Medical Association, 308, 2605–2611. pmid:23268519
- View Article
- PubMed/NCBI
- Google Scholar
30. Piaggio G., Elbourne D. R., Altman D. G., Pocock S. J., Evans S. J., & Consort Group. (2006). Reporting of noninferiority and equivalence randomized trials: An extension of the CONSORT statement. Journal of the American Medical Association, 295, 1152–1160. pmid:16522836
- View Article
- PubMed/NCBI
- Google Scholar
31. Scott I. A. (2009). Non-inferiority trials: Determining whether alternative treatments are good enough. Medical Journal of Australia, 190, 326–330. pmid:19296815
- View Article
- PubMed/NCBI
- Google Scholar
32. Fleming T. R., Odem-Davis K., Rothmann M. D., & Li Shen Y. (2011). Some essential considerations in the design and conduct of non-inferiority trials. Clinical Trials, 8, 432–439. pmid:21835862
- View Article
- PubMed/NCBI
- Google Scholar
33. Gayet-Ageron A., Agoritsas T., Rudaz S., Courvoisier D., & Perneger T. (2015). The choice of the noninferiority margin in clinical trials was driven by baseline risk, type of primary outcome, and benefits of new treatment. Journal of Clinical Epidemiology, 68, 1144–1151. pmid:25716902
- View Article
- PubMed/NCBI
- Google Scholar
34. Gayet-Ageron A., Jannot A. S., Agoritsas T., Rudaz S., Combescure C., & Perneger T. (2016). How do researchers determine the difference to be detected in superiority trials? Results of a survey from a panel of researchers. BMC Medical Research Methodology, 16, 89. pmid:27473336
- View Article
- PubMed/NCBI
- Google Scholar
35. Gladstone B. P., & Vach W. (2014). Choice of non-inferiority (NI) margins does not protect against degradation of treatment effects on an average–an observational study of registered and published NI trials. PLoS ONE, 9, e103616. pmid:25080093
- View Article
- PubMed/NCBI
- Google Scholar
36. Wiens B. L. (2002). Choosing an equivalence limit for noninferiority or equivalence studies. Controlled Clinical Trials, 23, 2–14. pmid:11852160
- View Article
- PubMed/NCBI
- Google Scholar
37. Committee for Proprietary Medicinal Products (2001). Points to consider on switching between superiority and non-inferiority. British Journal of Clinical Pharmacology, 52, 223–228. pmid:11560553
- View Article
- PubMed/NCBI
- Google Scholar
38. Ganju J., & Rom D. (2017). Non-inferiority versus superiority drug claims: the (not so) subtle distinction. Trials, 18, 278. pmid:28619049
- View Article
- PubMed/NCBI
- Google Scholar
39. Lewis J. A. (2001). Switching between superiority and non‐inferiority: an introductory note. British Journal of Clinical Pharmacology, 52, 221–221. pmid:11560552
- View Article
- PubMed/NCBI
- Google Scholar
40. Murray G. D. (2001). Switching between superiority and non‐inferiority. British Journal of Clinical Pharmacology, 52, 219–219. pmid:11560551
- View Article
- PubMed/NCBI
- Google Scholar
41. Institute SAS (2014). SAS/IML User’s Guide, Version 9.3. Cary, NC: SAS Institute Inc.
42. Clarke G., Eubanks D., Reid C. K., O’Connor E., DeBar L. L., Lynch F., … & Gullion C. (2005). Overcoming Depression on the Internet (ODIN)(2): a randomized trial of a self-help depression skills program with reminders. Journal of Medical Internet Research, 7, e16. pmid:15998607
- View Article
- PubMed/NCBI
- Google Scholar
43. Hollinghurst S., Peters T. J., Kaur S., Wiles N., Lewis G., & Kessler D. (2010). Cost-effectiveness of therapist-delivered online cognitive–behavioural therapy for depression: Randomised controlled trial. The British Journal of Psychiatry, 197, 297–304. pmid:20884953
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Kutner M. H., Nachtsheim C. J., Neter J., & Li W. (2005). Applied linear statistical models (5th ed.). New York, NY: McGraw Hill.

[ref2] 2. Maxwell S. E., & Delaney H. D. (2004). Designing experiments and analyzing data: A model comparison perspective (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

[ref3] 3. Grissom R. J. (2000). Heterogeneity of variance in clinical data. Journal of Consulting and Clinical Psychology, 68, 155–165. pmid:10710850
View Article
PubMed/NCBI
Google Scholar

[4] View Article

[5] PubMed/NCBI

[6] Google Scholar

[ref4] 4. Rosopa P. J., Schaffer M. M., & Schroeder A. N. (2013). Managing heteroscedasticity in general linear models. Psychological Methods, 18, 335–351. pmid:24015776
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref5] 5. Ruscio J., & Roche B. (2012). Variance heterogeneity in published psychological research: A review and a new index. Methodology, 8, 1–11.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Satterthwaite F. E. (1946). An approximate distribution of estimate of variance components. Biometrics Bulletin, 2, 110–114. pmid:20287815
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref7] 7. Welch B. L. (1947). The generalization of Students’ problem when several different population variances are involved. Biometrika, 34, 28–35. pmid:20287819
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref8] 8. Ioannidis J. P. (2005). Why most published research findings are false. PLoS Medicine, 2, e124. pmid:16060722
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref9] 9. Moher D., Dulberg C. S., & Wells G. A. (1994). Statistical power, sample size, and their reporting in randomized controlled trials. Journal of the American Medical Association, 272, 122–124. pmid:8015121
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref10] 10. Allison D. B., Allison R. L., Faith M. S., Paultre F., & Pi-Sunyer X. (1997). Power and money: Designing statistically powerful studies while minimizing financial costs. Psychological Methods, 2, 20–33.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref11] 11. Marcoulides G. A. (1993). Maximizing power in generalizability studies under budget constraints. Journal of Educational Statistics, 18, 197–206.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref12] 12. Welch B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29, 350–362.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref13] 13. Luh W. M., & Guo J. H. (2016). Sample size planning for the non-inferiority or equivalence of a linear contrast with cost considerations. Psychological Methods, 21, 13–34. pmid:26121080
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref14] 14. Dette H., & Munk A. (1997). Optimum allocation of treatments for Welch’s test in equivalence assessment. Biometrics, 53, 1143–1150. pmid:9333344
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref15] 15. Pentico DW. On the determination and use of optimal sample sizes for estimating the difference in means. The American Statistician. 1981; 35: 41–42.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref16] 16. Jan S. L., & Shieh G. (2014). Sample size determinations for Welch’s test in one-way heteroscedastic ANOVA. British Journal of Mathematical and Statistical Psychology, 67, 72–93. pmid:23316952
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref17] 17. Shieh G., & Jan S. L. (2013). Determining sample size with a given range of mean effects in one-way heteroscedastic analysis of variance. Journal of Experimental Education, 81, 281–294.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref18] 18. Shieh G., & Jan S. L. (2015). Optimal sample size allocation for Welch’s test in one-way heteroscedastic ANOVA. Behavior Research Methods, 47, 374–383. pmid:24903689
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref19] 19. Shieh G., & Jan S. L. (2015). Power and sample size calculations for testing linear combinations of group means under variance heterogeneity with applications to meta and moderation analyses. Psicologica, 36, 367–390.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref20] 20. Jan S. L., & Shieh G. (2016). A systematic approach to designing statistically powerful heteroscedastic 2 × 2 factorial studies while minimizing financial costs. BMC Medical Research Methodology, 16, 114. pmid:27578357
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref21] 21. Welch B. L. (1951). On the comparison of several mean values: An alternative approach. Biometrika, 38, 330–336.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref22] 22. Fritz C. O., Morris P. E., & Richler J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141, 2–18.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref23] 23. Lakens D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. pmid:24324449
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref24] 24. Takeshima N., Sozu T., Tajika A., Ogawa Y., Hayasaka Y., & Furukawa T. A. (2014). Which is more generalizable, powerful and interpretable in meta-analyses, mean difference or standardized mean difference?. BMC Medical Research Methodology, 14, 30. pmid:24559167
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref25] 25. Olejnik S., & Algina J. (2000). Measures of effect size for comparative studies: Applications, interpretations, and limitations. Contemporary Educational Psychology, 25, 241–286. pmid:10873373
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref26] 26. Rosenthal R., Rosnow R. L., & Rubin D. B. (2000). Contrasts and effect sizes in behavioral research: A correlational approach. New York: Cambridge University Press.

[ref27] 27. Steiger J. H. (2004). Beyond the F test: Effect size confidence intervals and tests of close fit in the analysis of variance and contrast analysis. Psychological Methods, 9, 164–182. pmid:15137887
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref28] 28. Laster L. L., & Johnson M. F. (2003). Non‐inferiority trials: the ‘at least as good as’ criterion. Statistics in Medicine, 22, 187–200. pmid:12520556
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref29] 29. Mulla S. M., Scott I. A., Jackevicius C. A., You J. J., & Guyatt G. H. (2012). How to use a noninferiority trial: Users’ guides to the medical literature. Journal of the American Medical Association, 308, 2605–2611. pmid:23268519
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref30] 30. Piaggio G., Elbourne D. R., Altman D. G., Pocock S. J., Evans S. J., & Consort Group. (2006). Reporting of noninferiority and equivalence randomized trials: An extension of the CONSORT statement. Journal of the American Medical Association, 295, 1152–1160. pmid:16522836
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref31] 31. Scott I. A. (2009). Non-inferiority trials: Determining whether alternative treatments are good enough. Medical Journal of Australia, 190, 326–330. pmid:19296815
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref32] 32. Fleming T. R., Odem-Davis K., Rothmann M. D., & Li Shen Y. (2011). Some essential considerations in the design and conduct of non-inferiority trials. Clinical Trials, 8, 432–439. pmid:21835862
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref33] 33. Gayet-Ageron A., Agoritsas T., Rudaz S., Courvoisier D., & Perneger T. (2015). The choice of the noninferiority margin in clinical trials was driven by baseline risk, type of primary outcome, and benefits of new treatment. Journal of Clinical Epidemiology, 68, 1144–1151. pmid:25716902
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref34] 34. Gayet-Ageron A., Jannot A. S., Agoritsas T., Rudaz S., Combescure C., & Perneger T. (2016). How do researchers determine the difference to be detected in superiority trials? Results of a survey from a panel of researchers. BMC Medical Research Methodology, 16, 89. pmid:27473336
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref35] 35. Gladstone B. P., & Vach W. (2014). Choice of non-inferiority (NI) margins does not protect against degradation of treatment effects on an average–an observational study of registered and published NI trials. PLoS ONE, 9, e103616. pmid:25080093
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref36] 36. Wiens B. L. (2002). Choosing an equivalence limit for noninferiority or equivalence studies. Controlled Clinical Trials, 23, 2–14. pmid:11852160
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref37] 37. Committee for Proprietary Medicinal Products (2001). Points to consider on switching between superiority and non-inferiority. British Journal of Clinical Pharmacology, 52, 223–228. pmid:11560553
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref38] 38. Ganju J., & Rom D. (2017). Non-inferiority versus superiority drug claims: the (not so) subtle distinction. Trials, 18, 278. pmid:28619049
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref39] 39. Lewis J. A. (2001). Switching between superiority and non‐inferiority: an introductory note. British Journal of Clinical Pharmacology, 52, 221–221. pmid:11560552
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref40] 40. Murray G. D. (2001). Switching between superiority and non‐inferiority. British Journal of Clinical Pharmacology, 52, 219–219. pmid:11560551
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref41] 41. Institute SAS (2014). SAS/IML User’s Guide, Version 9.3. Cary, NC: SAS Institute Inc.

[ref42] 42. Clarke G., Eubanks D., Reid C. K., O’Connor E., DeBar L. L., Lynch F., … & Gullion C. (2005). Overcoming Depression on the Internet (ODIN)(2): a randomized trial of a self-help depression skills program with reminders. Journal of Medical Internet Research, 7, e16. pmid:15998607
View Article
PubMed/NCBI
Google Scholar

[145] View Article

[146] PubMed/NCBI

[147] Google Scholar

[ref43] 43. Hollinghurst S., Peters T. J., Kaur S., Wiles N., Lewis G., & Kessler D. (2010). Cost-effectiveness of therapist-delivered online cognitive–behavioural therapy for depression: Randomised controlled trial. The British Journal of Psychiatry, 197, 297–304. pmid:20884953
View Article
PubMed/NCBI
Google Scholar

[149] View Article

[150] PubMed/NCBI

[151] Google Scholar