Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Anderson-Darling and Watson tests for the geometric distribution with estimated probability of success

  • Héctor Francisco Coronel-Brizio ,

    Contributed equally to this work with: Héctor Francisco Coronel-Brizio, Alejandro Raúl Hernández-Montoya, Manuel Enrique Rodríguez-Achach, Horacio Tapia-McClung, Juan Evangelista Trinidad-Segovia

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Writing – original draft

    hcoronel@uv.mx

    Affiliations Instituto de Investigaciones en Inteligencia Artificial, Universidad Veracruzana, Xalapa, Veracruz, México, Facultad de Física, Universidad Veracruzana, Xalapa, Veracruz, México

  • Alejandro Raúl Hernández-Montoya ,

    Contributed equally to this work with: Héctor Francisco Coronel-Brizio, Alejandro Raúl Hernández-Montoya, Manuel Enrique Rodríguez-Achach, Horacio Tapia-McClung, Juan Evangelista Trinidad-Segovia

    Roles Data curation, Funding acquisition, Investigation, Writing – review & editing

    Affiliations Instituto de Investigaciones en Inteligencia Artificial, Universidad Veracruzana, Xalapa, Veracruz, México, Facultad de Física, Universidad Veracruzana, Xalapa, Veracruz, México

  • Manuel Enrique Rodríguez-Achach ,

    Contributed equally to this work with: Héctor Francisco Coronel-Brizio, Alejandro Raúl Hernández-Montoya, Manuel Enrique Rodríguez-Achach, Horacio Tapia-McClung, Juan Evangelista Trinidad-Segovia

    Roles Investigation, Resources, Software

    Affiliation Unidad Experimental Marista (UNEXMAR), Universidad Marista de Mérida, Mérida, Yucatán, México

  • Horacio Tapia-McClung ,

    Contributed equally to this work with: Héctor Francisco Coronel-Brizio, Alejandro Raúl Hernández-Montoya, Manuel Enrique Rodríguez-Achach, Horacio Tapia-McClung, Juan Evangelista Trinidad-Segovia

    Roles Data curation, Resources, Validation, Writing – review & editing

    Affiliation Instituto de Investigaciones en Inteligencia Artificial, Universidad Veracruzana, Xalapa, Veracruz, México

  • Juan Evangelista Trinidad-Segovia

    Contributed equally to this work with: Héctor Francisco Coronel-Brizio, Alejandro Raúl Hernández-Montoya, Manuel Enrique Rodríguez-Achach, Horacio Tapia-McClung, Juan Evangelista Trinidad-Segovia

    Roles Methodology, Resources, Validation, Visualization, Writing – review & editing

    Affiliation Departamento de Economía y Empresa, Universidad de Almería (UAL), Almería, España

Abstract

This paper introduces two new goodness-of-fit tests for the geometric distribution based on discrete adaptations of the Watson W2 and Anderson-Darling A2 statistics, where the probability of success is unknown. Although these tests are widely applied to continuous distributions, their application in discrete models has been relatively unexplored. Our study addresses this need by developing a robust statistical framework specifically for discrete distributions, particularly the geometric distribution. We provide extensive tables of asymptotic critical values for these tests and demonstrate their practical relevance through a financial case study. Specifically, we apply these tests to analyze price runs derived from daily time series of NASDAQ, DJIA, Nikkei 225, and the Mexican IPC indices, covering the period from January 1, 2015, to December 31, 2022. This work broadens the range of available tools for assessing goodness-of-fit in discrete models, which are essential for applications in finance and beyond. The Python programs developed for this paper are available to the academic community.

1 Introduction

A test of fit is a statistical methodology to assess how well a given theoretical distribution matches a sample of data. See Stephens et al. for a general discussion [1].

In particular, we are interested in the case of test of fits for the geometric distribution using the discrete Watson W2 and Anderson-Darling A2 statistics. The geometric distribution is used to model the number of Bernoulli trials occurring until observing the first success and is the discrete case of the exponential distribution. It has applications in several areas, such as hydrology to analyze water deficit for management of water, in Robotics for the location recognition of unmanned vehicles and reliability studies, in Medicine, to assess the risk of infection with SARS-CoV-2 [24], in Ecology, to estimate the size of a population of individuals by means of a capture-recapture methodology [5, 6]. It has also been applied in the area of Statistical Process Control in order to ensure quality over time in a production or service, or to handle the multiple-reader interference problem [7, 8] among others, including finance to study the distribution of price “runs” (see [9] and references therein). Besides the well known textbook applications of the geometric distribution, the discrete case of the exponential distribution to phenomena like radioactive decay or quantum tunneling, we can mention the following few applications of this distribution in physical sciences: particle interactions in high energy Physics, photon detection in optics, physical chemistry, biochemistry and systems modeling, etc. [1012].

1.1 Motivation

In our opinion, there is still work to be done in developing rigorous testing methods for assessing the goodness-of-fit for discrete models, particularly for the geometric distribution. While tests such as Anderson-Darling and Watson have been primarily developed for continuous distributions, relatively few alternatives have been adapted for discrete cases. This study aims to contribute to this issue by adapting the Anderson-Darling and Watson tests to a discrete models, in our case the geometric distribution and provide a valuable tool for researchers working with this data.

Our motivation for this work stems from a previous investigation into the geometric distribution’s application to the duration of “price runs” in financial indices. In our earlier research [9], we analyzed the behavior of price runs in daily financial indices such as the NASDAQ, DJIA, and Nikkei 225. Through this study, we identified the need for a rigorous statistical test that could better assess the fit of geometric distributions in real-world financial data. The results of that research emphasized the limitations of current goodness-of-fit tests for discrete models, prompting the development of the current study.

1.2 Problem statement

Let us consider a geometric random variable X with unknown parameter θ, where 0 ≤ θ ≤ 1.0 is the probability of success on any given trial, and probability mass function: (1) which gives the probability that the first success occurs on the ith trial.

The “Test of fit” problem consists in assessing whether a statistical model or theoretical distribution provides a good fit to the observed data. In the present work, given a random sample of n values from Eq (1), statistical tests of fit using the discrete version of the well known Watson’s W2 and Anderson-Darling A2 statistics are developed. This type of tests have already been constructed for some important discrete distributions such as the Poisson distribution (Spinelli and Stephens [13]), the first-digit Benford distribution (Lesperance et al. [14]) and for the discrete uniform distribution (Choulakian, Lockhart and Stephens, Lockhart, Spinelli and Stephens [15, 16]).

In the case that concerns us, interesting tests of fit for the geometric distribution also have been given, see for example [17], it based on a characterization of this distribution in relation to the conditional expectation of the second-order statistic, given the value of the first-order statistics. Also, the asymptotic null distribution of the estimated test statistic is estimated by means of the bootstrap technique. Reference [18] presents a geometric fit test based on the detection of gradual deviations from the geometric distribution and observed data, rather than focusing on abrupt discrepancies. This statistical methodology is commonly referred to as “smooth test fit” and follows the work of Spinelli and Stephens [13].

[19] introduces and compares different goodness of fit tests for the Geometric distribution, including Anderson Darling and Cramér-von Mises tests. These tests are compared by simulation techniques. Finally, we must cite more theoretical and recent works that generalize the geometric distribution, which present several mathematical properties of this generalization, as well as methods employed to determine estimators for the new model based on maximum likelihood, moments, and proportion estimation [20].

By means of Monte Carlo simulations, reference [21] introduces and compares the relative performance of a few different statistical test for the geometric distribution mentioning that the best performed single statistic is the Anderson–Darling statistic.

The difference between the above mentioned kind of tests and the tests presented in this paper, is that for the geometric case and from asymptotic theory of test statistics, see section Asymptotic theory, we calculate the tests statistics asymptotic distributions for W2 and A2, and we give their respective Tables 1 and 2 for different values of the parameter θ. We also provide an explicit explanation of the procedure necessary to apply the geometric fit tests developed here, see section Test procedure. We aim to test statistically the null hypothesis that the data sample to analyze was drawn from the geometric distribution. Finally, we are making the Python code needed to perform the corresponding calculations and tests available.

thumbnail
Table 1. Asymptotic percentage points for statistic for selected values of the parameter θ with values for different significance levels α.

https://doi.org/10.1371/journal.pone.0315855.t001

thumbnail
Table 2. Asymptotic percentage points for statistic for selected values of the parameter θ with values for different significance levels α.

https://doi.org/10.1371/journal.pone.0315855.t002

By way of example, we apply this methodology to daily “runs” constructed from daily time series of four financial market indices. These data were chosen because of the importance of these data sets in economic studies and also due to their broad and universal availability.

1.3 Preliminary definitions

Following [15], define , and Zj = SjTj, where oi and ei = npi(θ) are the observed and expected number of observations in cells 1, 2, …, respectively, with pi(θ) given by Eq 1 and n being the total number of observations in our dataset.

By definition Zj is the cumulative sum of the differences between the observed frequencies oi, and the theoretically expected frequencies ei for all cells or categories from i = 1 up to i = j.

In this work we will consider the following definitions of the statistics: (2) (3) where denotes the theoretical distribution function.

In practice, we will work with a finite number of cells. If we denote by k the index corresponding to the last non-zero frequency cell, we define: (4) (5)

If the parameter θ is unknown, it will be replaced by its Maximum Likelihood Estimator to obtain the following expressions for the test statistics: (6) (7)

In our case, , where denotes the arithmetic mean of the sample values.

2 Asymptotic theory

In this section, a brief summary of the distribution theory of the test statistics is given. For a more detailed description, the reader is referred, for example to [15].

2.1 Known θ

Following [13], the test statistics given by Eqs (4) and (5) can be expressed as: (8) (9) where Z is the vector with entries Zj, for j = 1, …, k; D is a diagonal matrix whose j-th diagonal entry is pj(θ) and G is the diagonal matrix with elements Hi(1 − Hi). The statistics have the general form: (10) where V is a positive definite symmetric matrix. For W2(θ), we have V = D and V = DG−1 for A2(θ).

Let us denote by o and e the vectors of observed and expected values, respectively, of the cells; that is, oT = [o1, …, ok] and eT = [e1, …, ek].

Since o has a multinomial distribution with parameter PT = [p1(θ), …, pk(θ)], its mean vector and covariance matrix are n P and n(DPPT) and by the central limit theorem, converges to a multivariate normal distribution with mean vector zero and covariance matrix Σ0 = DPPT.

On the other hand, Z = R(o-e), where R denotes the lower-triangular matrix with unit elements (also called partial-sum matrix) and converges to a multivariate normal distribution with mean vector 0 and covariance matrix Σ = RΣ0RT whose elements are given by min(Hi, Hj) − HiHj.

By defining , X converges to a normal distribution with mean vector 0 and covariance matrix ΣX = V1/2ΣV1/2.

Also, Qn in Eq (10) can be expressed as: where νi, …, νk are independent standard normal random variables and λ1, …, λk the eigenvalues of ΣX.

2.2 Estimated θ

When the parameter θ is unknown, it must me replaced by its maximum likelihood estimator (or other efficient estimator) . In this case, we define the vector with elements , j = 1, …, k. Under the usual regularity conditions [22], for the case of a single parameter, converges to a normal random variable with mean vector 0 and covariance matrix , where the j-th element of b is:

Again, the asymptotic distribution is that of: but now, the eigenvalues are those corresponding to the matrix where .

2.3 Calculation of the asymptotic percentage points

In this work, the eigenvalues associated to each test statistic were obtained and the asymptotic percentage points were found using a Python implementation of the Lindsay-Pilla-Basak method [2325]. According to the references, this method is accurate to two or three decimal places. For those interested, a R implementation of this method also exists [25, 26]. The percentage points were calculated for increasing values of k, until convergence was achieved. The final results are given in Tables 1 and 2.

In order to investigate the speed of convergence of the empirical percentage points, 25000 samples for selected values of θ and n were simulated and the empirical percentage points for each statistic were computed. The results, shown in Tables 3 and 4, indicate that the asymptotic points can be used for moderately large values of n (an usual case in practice) with good accuracy. In these tables the notation of infinity (∞) refers to the theoretically calculated values based on the asymptotic distribution, which represents the distribution of the test statistics and when the sample size n approaches infinity. These asymptotic values provide a theoretical benchmark derived from the limiting behavior of the statistics as the number of observations grows large.

thumbnail
Table 3. Empirical percentage points of the statistic for selected values of the sample size n and the parameter θ based on 25000 simulations and with different significance levels α.

https://doi.org/10.1371/journal.pone.0315855.t003

thumbnail
Table 4. Empirical percentage points of the statistic for selected values of the sample size n and the parameter θ based on 25000 simulations and with different significance levels α.

https://doi.org/10.1371/journal.pone.0315855.t004

3 Test procedure

In order to perform a test of fit for the geometric distribution (1), given n observed values x1, …, xn, the procedure is as follows:

  1. Compute the sample estimate of the parameter θ as the inverse of the sample mean; i.e., .
  2. Consider the first, say k, non empty cells and compute using expression (1) for i = 1, …, k, replacing θ with the value .
  3. Compute the values of the statistics and using expressions (6) and (7) above.
  4. Refer to Tables 1 and/or 2 corresponding to the test statistic, entering the table at the estimated value .
  5. If the value of the test statistic exceeds the value for a given significance level α, the null hypothesis that the sample was drawn from the distribution (1) is rejected for that level.

4 Example

First, let us remember that if the price of a security or the value of a financial index goes in the same direction, either increasing or decreasing, for N consecutive days, we say that we have a bullish or bearish streak of N consecutive days or a “run” of length N days.

In reference [9], a data analysis of runs length is presented, in which we empirically compare the simple geometric statistical model, where θ = 1/2 is the probability that the market goes up or down in a day, with daily data of financial indices. The same reference includes an extensive and up to date bibliography on classic and current runs research.

Time series of runs obtained for daily closing values from NASDAQ, DJIA, Nikkei 225 and the Mexican IPC stock indexes from January 1, 2015 to December 31, 2022 were analyzed, for each of these daily recorded time series, and the lengths in days of the corresponding uninterrupted daily trends (upward and downward) were calculated. Runs data were constructed from free financial data downloaded from the Yahoo finance website https://finance.yahoo.com/. As we mentioned before, our objective is to statistically test the null hypothesis suggesting that the sample was drawn from a geometric distribution.

Runs data for the four financial markets analyzed in this paper are shown in Table 5, where summary statistics and the calculated values of the test statistics are also given. a) Length refers to the duration in days of each run, we observed uninterrupted trends lasting from one to thirteen days long. Number of observed runs for each market are shown, for example we observed for the NASDAQ 539 uninterrupted trends of one day length, 253 runs of two days length, etc. b) n indicates the total number of recorded runs. c) denotes average length, d) is , e) indicates the open 95% confidence intervals for the estimated θ, and finally, g) and h) show the estimated values of W2 and A2 for the estimated θ respectively where p–values are included and denoted by pv. Calculated values of the test statistics are discussed in next section Results.

thumbnail
Table 5. Distribution of lengths of trends and summary statistics for all analyzed markets.

a) Observed runs duration; b) total number of runs; c) observed average run length; d) θ estimated; e) 95% C.I. for θ; f) and g) estimated values of W2 and A2 statistics, respectively, with corresponding p–values.

https://doi.org/10.1371/journal.pone.0315855.t005

5 Results

According to the estimated value of θ, from Tables 1 and 2, it was found that in all cases, the p-values do not support the rejection of the null hypothesis. It is then concluded that the lengths of the trends can be appropriately described by the geometric distribution.

Also, from Table 5, it can be seen that the 95% confidence intervals [0.490, 0.534], [0.484, 0.528] and [0.487, 0.531], corresponding to the NASDAQ, DJIA and Nikkei 225 series, respectively, indicate that the null hypothesis θ = 0.5 would not be rejected with reasonable significance levels, whereas for the case of the Mexican IPC series, the 95% confidence [0.454, 0.498] interval does not contain θ = 0.5, which would produce an approximate p-value of 2.5% in testing the null hypothesis θ = 0.5.

6 Conclusion

Tests of fit for the geometric distribution based on the discrete version of the Watson W2 and Anderson-Darling A2 statistics are developed, particularly focusing on cases involving an unknown probability of success. Formulas to compute these two statistics when the parameter θ is unknown are provided in section Preliminary definitions by Eqs 6 and 7 respectively.

Additionally, we also present moderately extensive tables of asymptotic percentage points for each one of these two statistics and the procedure in the form of a list of instructions to fit data with a geometric distribution and assess the quality of this fit is provided in section Test procedure.

Beside, and as an illustration of the statistical methodology proposed for assessing a geometric test of fit, this research presents a comprehensive analysis of the geometric distribution’s fit for financial series of daily uninterrupted trends or “runs”, analyzed financial indices were Nasdaq, DJIA, Nikkei 25 and IPC.

An interesting finding of our study is that the trend lengths across the four major financial indices analyzed here predominantly adhere to a geometric distribution. This alignment is particularly notable with an estimated parameter value of θ close to 0.5, suggesting a near-equal probability of trend direction changes in these financial markets. However, our analysis also reveals variations in specific cases, such as the Mexican IPC series, where θ diverges from 0.5. This divergence is not just a statistical anomaly but potentially indicates a higher or lower likelihood of prolonged trends in this market.

Although signed trends were analyzed over a longer period than that used here, in reference [9], a similar divergence in the θ value and more prolonged trends are observed again in the case of the IPC market. This behavior could be a consequence of the fact that the IPC is an emerging market, which is more immature and volatile compared to the other markets analyzed. Further studies are necessary to fully understand these results.

Such insights could be instrumental for investors and market analysts in understanding market dynamics and making informed decisions.

While we think that our findings are robust for the data sets and time frame considered, they are not without limitations. The assumption of a constant θ value over time might oversimplify the complex and dynamic nature of financial markets. Additionally, the methodology’s reliance on historical data means it may not fully capture future market behaviors, especially in the face of unprecedented events or changes in market regulations.

Future research could expand upon this work in several ways. Firstly, exploring the variability of θ over different market conditions or time periods could provide a better understanding of market behavior. Secondly, applying these tests to a broader range of financial instruments, including emerging market indices and cryptocurrency markets, could validate the universality of the geometric distribution in financial trend analysis. Lastly, integrating machine learning techniques to predict changes in θ could offer groundbreaking tools for market prediction and investment strategy development.

In conclusion, our study contributes significantly to the statistical analysis of financial markets, offering a methodological framework that can be employed and expanded upon in various financial contexts. The insights gained highlight the intricate patterns underlying market trends and open avenues for future research to further understanding of these complexities.

Acknowledgments

The authors want to thank MSc Selene Jimenez for her LaTeX typesetting help on the manuscript.

References

  1. 1. D’Agostino RB and Stephens MA (ed) (1986): Goodness-Of-Fit Techniques. Taylor & Francis Group. e-ISBN 9780203753064.
  2. 2. Mathier L, Perreault L, and Bobe B (1992): The use of geometric and gamma-related distributions for frequency analysis of water deficit. Stochastic Hydrology and Hydraulics, 6, 239–254.
  3. 3. Cao W, Huang X and Shu F (2020): Location recognition of unmanned vehicles based on visual semantic information and geometric distribution. Proc IMechE Part D: J Automobile Engineering 235(2-3), 552–563.
  4. 4. Polymenis A (2021): An application of the geometric distribution for assessing the risk of infection with SARS-CoV-2 by location. Asian J of Medical Sci. 12(10), 8–12.
  5. 5. Anan O, Böhning D and Maruotti A (2019): On the Turing estimator in capture-recapture count data under the geometric distribution. Metrika 82 (2019) 149–172.
  6. 6. Niwitpong SA, Böhning D, van der Heijden PG and Holling H (2013): Capture-recapture estimation based upon the geometric distribution allowing for heterogeneity. Metrika 76 (2013) 495–519.
  7. 7. Zhang Jiujun, Li Erjie and Lin Zhonghua (2017): A Cramér-von Mises test-based distribution-free control chart for joint monitoring of location and scale. Computers & Industrial Engineering, 110:484–497.
  8. 8. Duan L, Wang ZJ and Duan F (2016): Geometric distribution-based readers scheduling optimization algorithm using artificial immune system. Sensors 16(11), 1924. pmid:27854342
  9. 9. Olivares-Sánchez HR, Rodríguez-Martínez CM, Coronel-Brizio HF, Scalas E, Seligman TH and Hernández-Montoya AR (2022): An empirical data analysis of “price runs” in daily financial indices: dynamically assessing market distributional geometric behavior. Plos ONE, 17(7):e0270492. pmid:35797336
  10. 10. Antoni T et al (2005): Geometric structures in hadronic cores of extensive air showers observed by KASCADE. Phys. Rev. D 71(7) 072002.
  11. 11. Nair RR, Wilk G and Włodarczyk Z (2023): Geometric Poisson distribution of photons produced in the ultrarelativistic hadronic collisions. Eur. Phys. J. A 59:203.
  12. 12. Michel D (2103): Simply conceiving the Arrhenius law and absolute kinetic constants using the geometric distribution. Physica A: Statistical Mechanics and its Applications, 392(19), 4258–4264.
  13. 13. Spinelli JJ and Stephens MA (1997): Cramér-von Mises tests of fit for the Poisson distribution. Can. Jour. Statist., 25, 2:257–268.
  14. 14. Lesperance M, Reed WJ, Stephens MA, Tsao C and Wilton B (2016): Assessing Conformance with Benford’s Law: Goodness-Of-Fit Tests and Simultaneous Confidence Intervals. Plos ONE 11(3): e0151235. pmid:27018999
  15. 15. Choulakian V, Lockhart RA and Stephens MA (1994): Cramér-von Mises statistics for discrete distributions. Can. Jour. Statist., 22:125–137.
  16. 16. Lockhart RA, Spinelli JJ and Stephens MA (2007): Cramér-von Mises statistics for discrete distributions with unknown parameters. Can. Jour. Statist., 35,1:125–133.
  17. 17. Jiménez-Gamero MD and Alba-Fernández MV (2021): A test for the geometric distribution based on linear regression of order statistics. Mathematics and Computers in Simulation, 186:103–123.
  18. 18. Best DJ and Rayner JCW (1989): Goodness of Fit for the Geometric Distribution. Biom. J. 31:307–311.
  19. 19. Ozonur D, Gokpinar E, Gokpinar F, Bayrak H and Gul HH (2013): Comparisons of the Goodness of Fit Tests for the Geometric Distribution. Gazi University Journal of Science 26(3), 369–375. ISSN 303-9709.
  20. 20. Alosey ARE and Gemeay AM (2025): A Novel Version of Geometric Distribution: Method and Application. Computational Journal of Mathematical and Statistical Sciences 4(1), 1–16.
  21. 21. Best DJ and Rayner JCW (2007): Tests of Fit for the Geometric Distribution. Communications in Statistics—Simulation and Computation, 32:4, 1065–1078.
  22. 22. Bishop YM, Fienberg SE and Holland PW (1975): Discrete Multivariate Analysis. MIT Press. Cambridge, Mass. ISSN 978-0-387-72805-6.
  23. 23. Lindsay BG, Pilla RS and Basak P (2000): Moment-based approximations of distributions using mixtures: Theory and applications. Annals of the Institute of Statistical Mathematics, 52(2):215–230.
  24. 24. Bodenham DA (2016) R package momentChi2: Moment-Matching Methods for Weighted Sums of Chi-Squared Random Variables. Available on: https://cran.r-project.org/package=momentchi2
  25. 25. Bodenham DA and Adams NM (2016): A comparison of efficient approximations for a weighted sum of chi-squared random variables. Statistics and Computing 26(4):917–928.
  26. 26. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2023. https://www.R-project.org/.