Figures
Abstract
Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.
Citation: Al Mobin M, Kamrujjaman M (2023) Downscaling epidemiological time series data for improving forecasting accuracy: An algorithmic approach. PLoS ONE 18(12): e0295803. https://doi.org/10.1371/journal.pone.0295803
Editor: Salim Heddam, University 20 Aout 1955 skikda, Algeria, ALGERIA
Received: September 14, 2023; Accepted: November 29, 2023; Published: December 14, 2023
Copyright: © 2023 Al Mobin, Kamrujjaman. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Any process that involves deriving high-resolution data from low-resolution variables is referred to as downscaling. This method relies on dynamical or statistical approaches and is extensively utilized in the field of meteorology, climatology, and remote sensing [1, 2]. Significant exploration of the downscaling methods has been done in the field of geology and climatology to enhance the out of existing models like the General Circulation Model (GCM) [3–8], Regional Climate Model (RCM) [9], Integrated Grid Modeling System (IGMS) [10], System Advisor Model (SAM) [10] and to make it usable for the forecast of geographically significant region and time. Several methods has been used to downscale these data such as BCC/RCG-Weather Generators (BCC/RCG-WG) [11–13], and Statistics Downscaling Model (SDSM) [11, 14–19], Bayesian Model Averaging (BMA) [20]. Even machine learning methods has been used like Genetic algorithm (GA) [9], K Nearest Neighbourhood Resampling (KNNR) [9], Support Vector Machine (SVM) [11, 21–23]. Except for the machine learning algorithms, which are methods that are finding their applications in new domains, the rest of the methods are tailored to suit the outputs of the models, as mentioned earlier.
This class of methods has recently been applied in the disaggregation of spatial epidemiological data [24, 25]. Nevertheless, significant work has yet to be done for the temporal downscaling of epidemiological data. The temporal downscaling techniques are often classical interpolation techniques that do not do justice to aggregated data. This phenomenon can be well illustrated with an example. Consider the case of monthly Dengue infection data of 2017 from Fig 1, which has been downscaled using linear interpolation by considering the aggregated value as the value of the end date of a month in Fig 2. In this case, if we consider the monthly aggregate of the downscaled data, it does not match the original aggregate. This downscaled data, which differs from the original data in such statistical measures, shall result in decisions and knowledge that cannot be far from the truth.
The monthly aggregate of the DENV infection in Bangladesh in the year 2017. The data has been aggregated to monthly scale to avoid the discontinuity observed in the daily scale.
The figure depicts the downscaled data using linear interpolation by considering the aggregated value as the value of the end date of a month using the data illustrated in Fig 1. In this case, if we consider the monthly aggregate of the downscaled data, it does not match the original aggregate. This downscaled data, which differs from the original data in such statistical measures, shall result in decisions and knowledge that cannot be far from the truth.
The paper aims to achieve the following:
- To propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled temporal time series of varying time lengths from aggregated data preserving most of the statistical characteristics and the aggregated sum of the original data.
- To present two downscaling case studies of epidemiological time series data (namely Dengue and COVID-19 data of Bangladesh) to showcase the workflow and efficacy of the algorithm.
- To present a comparison between the forecasting performance of aggregated data and algorithm generated synthetic data to showcase the improvement achieved (for synthetic data over aggregated data) in terms of scale independent error.
The paper is organized as follows. Materials and method section describes the data used for the paper and its sources and the methodology at length with the proposed SBD algorithm. The section titled “Comparison of the Synthesized Data with the Real Data” compares the synthesized data with the actual data of two different epidemiological cases (Dengue and COVID-19) in Bangladesh and shows how the SBD algorithm could generate statistically accurate approximate of the actual with very little input in both cases discuss the benchmark metric used for evaluating the output. Section titled “Improvements in Forecasting Accuracy” shows the improvement of the forecasting accuracy using synthesized data over aggregated data using a statistical forecasting toolbox in the dengue scenario of Bangladesh using the last 12 years of monthly aggregated data, Forecasting model selection procedures, and residuals. Finally, we concludes our paper with an overview of the paper and how our paper has contributed to the existing literature and scopes for improvements and fields of application of the SBD algorithm in the conclusion section.
Materials and methods
Data
The dengue data from Bangladesh used in this paper are from January 2010 to July 2022 and are collected from DGHS [26], and IEDCR [27]. The COVID-19 data of Bangladesh are from 8 March 2020 to December 2020 and are collected from the WHO data repository [28].
Methodology
The SBD algorithm can be segmented into three sequential parts, as exhibited in Fig 3. Initially, the algorithm considers a prior distribution to generate synthetic downscaled data. The SBD algorithm considers the aggregated data as the prior distribution of the downscaled data. For example: If we have the monthly epidemiological data of dengue for the year 2017, thus to attain the prior distribution for the downscaled data, we divide the data by 30. The fact is well illustrated in Figs 1 and 2. Fig 1 depicts the monthly distribution of the DENV (Dengue Virus) infection in Bangladesh for the year 2017, and Fig 2 represents the prior distribution obtained by the method described above.
The diagram depicts the flow diagram of the novel proposed algorithm. The algorithm works in three unique phases. The first phase (Initial Data Generator) generates a initial approximation based on the prior distribution, the second phase (Overthrow Correction) removes any abrupt fluctuation introduced during the de-aggregation in the first step, and finally the final step(Volume Correction) rectifies the any displacement of data point volume over the aggregation unit in the second step thus aggregation of this downscaled data agrees with the initial data.
Based on the prior distribution, initial statistical properties of the synthetic data are obtained except for the standard deviation (σ). As σ is scaling independent, hence scaling method used to obtain the prior distribution from the monthly aggregate keeps the σ identical to the monthly aggregate. To overcome this problem, we consider, (1) where σ0 is the standard deviation considered for the distribution to be fitted to generate the downscaled data by the algorithm and σpriordistribution is the standard deviation of the obtained prior distribution. Later on, in section titled “Comparison of the Synthesized Data with the Real Data”, we will see that the initial assumption of the standard deviation considered in (1) is a good approximation for the downscaled data.
Initial data generation.
The “Initial Data Generator” phase feeds on the aggregated data, length of the aggregate interval, and σ0 to give an initial downscaled data based on a “Distribution Generator”. Based on the prior distribution, a proper statistical probability distribution (PD) is to be considered to be fitted to generate the data. The “Distribution Generator” aims to fit the selected PD to the prior distribution based on the statistical properties obtained for the initial phase. The challenge in this scenario and every step of the algorithm is ensuring that the synthetic data produced in every step is non-negative integers, as we are dealing with epidemiological data. Thus specific measures have been deployed to tackle these challenges, which are:
- To ensure non negativity consider the transformation:
- To ensure that the data points are integer irrespective of the selection of PD, we round off the data to the nearest integer and subtract one from randomly selected data points in each aggregated unit such that the synthesized data has the same sum as the aggregated unit
Thus imposing these measures, the “Distribution Generator” generates synthetic distribution for each aggregated unit. Thus, looping over the entire aggregated timeline generates the initial distribution of the downscaled data concerning the aggregated data. This initial distribution is a suitable approximation to the actual data but can be improved with further refinement. The synthetic data will result in the exact aggregated data from which it is generated.
Overthrow correction.
This step is often necessary for time series data with an abrupt change in gradient or in case of initial approximation with abnormally large overthrow as the approximations are probabilistic. In case of data with the abrupt change in gradient, the initial approximation is often left with a staircase-like structure as exhibited in the Fig 4. The problem can be corrected using the overthrow correction measure, which is demonstrated in Fig 5.
Initial approximation without overthrow correction exhibits a staircase like property due to higher gradient change of the prior distribution.
Initial approximation with overthrow correction exhibits a much proper approximation of the real case scenario preserving its original trend.
The overthrow correction part takes a tolerance, δ, iteration limit, n, and a radius of an open interval, r. The step initially determines overthrow using tolerance between two neighboring points, i.e., if yi − yi−1 > δ or yi − yi+1 > δ then yi is an overthrow. After identifying an overthrow, we consider an open interval of radius r around the overthrow point and execute the distribution generator on that open interval. This redistributes the sample within the open interval diminishing the overthrow to some extent. This process is iterated n times over the entire time series to ensure satisfactory results. The strength of the overthrow correction step can be dictated by the two parameters δ and n. The strength of the overthrow correction is directly proportional to n and is inversely proportional to δ. Selecting the correct parameter value can ensure a good approximation of the real-life scenario.
Volume correction.
The overthrow correction disrupts the property of the synthesized time series to conserve its aggregated sum equal to the given aggregated distribution due to its local correction property. The scenario best illustrates the Table 1. This problem is addressed in this step. To maintain aggregated sum equal to the original data, we consider each aggregated unit and adjust the sum accordingly, adding/subtracting 1 from randomly chosen indices until the sum equates as required.
The Stochastic Bayesian Downscaling (SBD) algorithm.
The algorithm calls for a unique name. From now on, we shall address it as Stochastic Bayesian Downscaling (SBD) algorithm. The structural part of the algorithm has been discussed at length in the first three segments of the methodology sub-section. The proper pseudo code of the SBD algorithm is as follows:
Algorithm 1. Stochastic Bayesian Downscaling (SBD) Algorithm
Require: Aggregated value vector, v
Overthrow tolerance, δ
Iteration limit, n
Radius of the open interval, r
Standard deviation, σ
Ensure: downscaled time series,
for elem in v do
= Distribution Generator(elem,σ)
end for
for i from 1 to n do
find a vector of coordinates of overthrow points
for elem in overthrow points do
open interval centering elem of radius, r = Distribution Generator(sum of the elements of open interval,σ)
end for
end for
for elem in v do
if vi ≠ sum of euiquivalent aggregate in then
d=vi-sum of equivalent aggregate in
while d ≠ 0 do
if d > 0 then
d− = 1
else
d− = 1
end if
end while
end if
end for
Algorithm 2. Distribution Generator
Require: Total sum of the down scaled distribution, s
Standard deviation, σ
Ensure: Down scaled approximation over the length of the aggregate,
= Fit the decided distrubiton to the given down scaled time frame
if elems in then
end if
if elems in are not integer then
end if
if then
while d ≠ 0 do
if d > 0 then
d− = 1
else
d− = 1
end if
end while
end if
The SBD algorithm is heavily dependent on the random selection of numbers that are prone to generate non-reproducible results. Thus seeding the random number generator is highly recommended to ensure reproducible results.
The novelty of SBD algorithm is its consideration of the prior distribution as initialization and deploying the underlying distribution to generate synthesized downscaled data, which is non-negative and conserves the aggregated value of the given data.
Comparison of the synthesized data with the real data
To determine the accuracy of the SBD algorithm, we test the SBD algorithm against some real-world data. Here, we have taken 2020 COVID-19 data on infected individuals in Bangladesh and 2022 (January to July), Dengue data on infected individuals in Bangladesh. The data, as mentioned earlier, are daily data on the number of newly infected individuals nationwide. We aim to convert this data to monthly aggregate and feed the aggregated data to the algorithm to generate downscaled daily data; hence we can compare the accuracy of the synthetic daily data with the actual daily data. To determine the accuracy of the approximation, we will use two error measures and do component analysis on the real and synthetic data to see if the synthetic data can well approximate the underlying properties of the real data. In case of the component decomposition, we will use the additive model mentioned in (2), (2) as the procured data has some zero values for which the multiplicative model mentioned in (3) (3) is not suitable in this scenario.
Error measures for benchmark
To compare the result with the real world data we shall use two error terms that describes the overall error of the approximation. These are as follows:
Since many of the data points in the actual and synthesized cases is popluated with 0 hence Mass Absolute Percentage Error (MAPE), and Scaled Mass Absolute Percentage Error (SMAPE) are undefined in this scenario.
Dengue
Preprocessing and result.
In case of this simulation, we took Bangladesh’s 2022 daily Dengue infected data from January to July. To feed this data into the SBD algorithm, we convert the daily data to monthly aggregate as illustrated in Fig 6. For majority of the statistical work done in the paper we have used R.
Monthly aggregate of 2022 Dengue data from January to July.
We feed in this data considering,
- Initial standard deviation, .
- Over throw tolerance, δ = 0.6× (Range of the initial distribution).
- Iteration limit, n = 100.
- Radius of open interval, r = 3.
- Underlying distribution to be normal.
and generate the synthesized data. Fig 7 illustrates the synthesized data, which can be said to be a good approximation of the actual given the aggregated prior distribution (Fig 8).
SDB algorithm generated synthesized daily number of infected cases of Dengue in 2022 from January to July.
Daily number of infected cases of Dengue in 2022 from January to July.
Error metrics and statistical measures.
The calculated error measures are:
- MAE = 6.60664, which implies that the average error between the actual and synthesized data is 6.60664.
- RMSE = 12.64499 which implies that the standard deviation of the residuals/errors is 12.64499. The fact is well illustrated in Fig 14.
The error metric shows satisfactory results. The following Table 2 validates if the synthesized data honours the aggregated sum of the prior distribution.
The total number of cases in each scenario has been maintained equally. As discussed earlier, we can see that the initial distribution holds the monthly sum consistently, which gets disrupted in the overthrow correction phase and later corrected in the volume correction phase.
We shall now explore the basic statistical properties of the synthetic data with respect to the actual data.
It is to be noted that the mean of the synthesized data equates to that of the original data, although it was not plugged into the SBD algorithm in any manner as illustrated in Table 3. As previously discussed that σ0 is a good approximation to the original σ. All the rest of the measures are somewhat close, but the maximum varies by a lot. The maximum is hard to anticipate from the aggregated data; hence it is an avenue that demands further exploration.
Component decomposition and comparison.
We now want to do component decomposition of both the actual and synthetic data based on the model mentioned in (2). However, component decomposition is no benchmark for accuracy, but SBD algorithm aims to improve the outcome of forecasting techniques highly influenced by the components within a time series data. Thus comparing these components can answer the question of whether the components-based characteristics of the original time series are present within the synthesized data.
In the case of the trend component (Figs 9 and 10), both the actual and the synthesized data shows similar result and trend of the actual data have been well approximated by the trend of the synthesized data.
In the case of the seasonality component (Figs 11 and 12), both the actual and the synthesized data show major weekly and minor sub-weekly seasonality. The synthesized data’s seasonality approximates the actual data’s seasonality well.
Seasonality of the actual dengue data.
Seasonality of the synthetic dengue data.
In the case of the residual component (Figs 13 and 14), both the actual and the synthesized data show a similar result, although the residual of the synthetic data may look noisy at first glance but upon closer inspection, it is evident that the residual of the synthetic data shows less deviation from the standard value in comparison to the actual data. The synthesized data’s residual has well approximated the actual data’s residual.
Residual of the synthetic dengue data.
As mentioned earlier, the key takeaway from the discussion is that the SBD algorithm could generate an excellent approximation of the dengue data from the monthly aggregated data based on some statistical properties of the prior distribution. In the following section, we shall also test SBD algorithm’s efficacy in another epidemiological scenario.
COVID-19
Preprocessing and result.
In case of this simulation, we took Bangladesh’s 2020 daily COVID-19 infected data from March to December [29, 30]. To feed this data into the SBD algorithm, we convert the daily data to monthly aggregate as illustrated in Fig 15,
Monthly aggregate of 2020 COVID-19 infected data of Bangladesh from March to December.
We feed in this data considering,
- Initial standard deviation, .
- Over throw tolerance, δ = 0.2× (Range of the initial distribution).
- Iteration limit, n = 100.
- Radius of open interval, r = 3.
- Underlying distribution to be normal.
and generate the synthesized data. Fig 16 illustrates the synthesized data, which can be said to be a good approximation of the actual given the aggregated prior distribution (Fig 17).
Synthesized daily number of infected cases of COVID-19 in 2020 from March to December.
Daily number of infected cases of COVID-19 in 2020 from March to December.
Error metrics and statistical measures.
The calculated error measures are:
- MAE = 257.41806, which implies that the average error between the actual and synthesized data is 257.41806, which is reasonable considering the mean of the data is 1717.424749.
- RMSE = 346.6241, which implies that the standard deviation of the residuals/errors is 346.6241. The fact is well illustrated in Fig 23.
it is to be noted that the error term of this scenario must not be compared with the error term of the previous case as they are of varying scale. Compared to the scale of the data, the error metric shows satisfactory results. The following Table 4 validates if the synthesized data honours the aggregated sum of the prior distribution.
We shall now explore the basic statistical properties of the synthetic data with respect to the actual data.
It is to be noted that the mean of the synthesized data equates to that of the original data, although it was not plugged into the SBD algorithm in any manner as illustrated in Table 5. As previously discussed that σ0 is a good approximation to the original σ. All the rest of the measures are somewhat close, but the maximum varies by a lot. The maximum is hard to anticipate from the aggregated data; hence it is an avenue that demands further exploration.
Component decomposition and comparison.
We now want to do component decomposition of both the actual and synthetic data based on the model mentioned in (2). However, component decomposition in no way is a benchmark for accuracy, but as SBD algorithm aims to improve the outcome of forecasting techniques which are highly influenced by the components within a time series data. Thus, comparing these components can answer the question of whether the original time series’s components-based characteristics are present in the synthesized data.
In case of the trend component (Figs 18 and 19) both the actual and the synthesized data shows similar result and trend of the actual data have been well approximated by the trend of the synthesized data.
In case of the seasonality component (Figs 20 and 21), both the actual and the synthesized data shows major weekly seasonality. The seasonality of the synthesized data has well approximated the seasonality of the actual data.
Seasonality of the actual COVID-19 data.
Seasonality of the synthetic COVID-19 data.
In the case of the residual component (Figs 22 and 23), both the actual and the synthesized data shows a similar result, although the residual of the synthetic data may look a bit noisy at first glance but upon closer inspection, it is evident that the residual of the synthetic data shows less deviation from the standard value in comparison to the actual data. The residual of the synthesized data has well approximated the residual of the actual data.
Residual of the synthetic COVID-19 data.
The key takeaway from the discussion above is that the algorithm could generate an excellent approximation of the COVID-19 data from the monthly aggregated data based on some statistical properties of the prior distribution. We shall also test SBD algorithm’s efficacy in a forecasting scenario in the following section.
Improvements in forecasting accuracy
In this section, we shall forecast the Dengue infection case in Bangladesh using statistical forecasting tools. The use of statistical modelling is one of the helpful ways that may be utilized for the forecasting of dengue outbreaks [31, 32]. Previous research carried out in China [33], India [34], Thailand [35], West Indies [36], Colombia [37], and Australia [38] on infectious diseases made substantial use of the time series technique in the field of epidemiologic research on infectious diseases [38]. A number of earlier research looked at the Autoregressive Integrated Moving Average (ARIMA) model as a potential tool for use in forecasting [39–44].In addition, the ARIMA models have seen widespread use for dengue forecasting [42, 45–47]. When establishing statistical forecasting models, these are frequently paired with Seasonal Auto-regressive Integrated Moving Average (SARIMA) models, which have proven to be suitable for assessing time series data with ordinary or seasonal patterns [34, 36, 38, 48–50]. It is likely that developing a dengue incidence forecasting model based on knowledge from previous outbreaks and environment variables might be an extremely helpful tool for anticipating the severity and frequency of potential epidemics.
The idea of seasonality using the Fourier coefficient naming Fourier ARIMA model was introduced by [51, 52]. (4) where, δ0 is the constant term and ωk is the periodicity of the data.
We aim to forecast the monthly and synthesized daily data using the forecasting mentioned above techniques and compare the forecast accuracy based on error measures. We use SARIMA and Fourier-ARIMA models to forecast the monthly and synthesized data. The model in each case is chosen based on the lowest value of Akaike’s Information Criterion (AIC), Akaike’s Information Criterion correction (AICc), and Bayesian Information Criterion (BIC).
Model selection method
Box-Jenkins method is a generalized model selection pathway which works for time series irrespective of its stationarity or seasonality. The method is illustrated in Fig 24.
Error measures of model
The error measures for comparison is Mean Absolute Scaled Error(MASE) which is defined as We used this metric as it is scale-independent; hence is perfect for comparison [53, 54]. We also could have taken MAPE as a metric, but MAPE is undefined for such cases as the data is populated with zero values. We also use RMSE and MAE to gauge the error in the forecast [55, 56].
Forecast on the aggregated data
The actual data is monthly Dengue infection data of Bangladesh from 2010 to July 2022. Following Box-Jenkin’s method, we firstly check for the stationarity of the data based on the Augmented Dicky Fuller (ADF) test. ADF test returns the value of -4.7906 with p-value = 0.01, which implies that the data is stationary.
We run multiple SARIMA models and calculate their AIC, AICc and BIC and the best model is chosen based on the minimum value of the criterion. We present 5 of the top results in Table 6.
Here, the best model to use is SARIMA (1, 0, 0)(0, 1, 1)12. We fit the given model, which gives us the coefficients presented in Table 7:
To check the goodness of fit of the model, we use the Ljung box test, which returns the p-value = 0.9996 > 0.05, i.e. we accept the null hypothesis: “The model does not show lack ness of fit/ the residuals are not autocorrelated/ the residuals are random white noise.”
Given everything in place, we forecast the infection for the rest of 2023, i.e. from August to December. The forecast is illustrated in the given figure (Fig 25).
The figure illustrates the forecast generated by SARIMA (1, 0, 0)(0, 1, 1)12 from actual aggregated data.
To validate the goodness of the fit, we can analyze the model residual, illustrated in Fig 26. Here, the top graph is the residual with the timeline of the original data. The bottom left graph represents the Autocorrelation Function (ACF) with respect to the lag of the data. Almost all the values are within the significance e level, and the bottom right figure shows the distribution of the model’s residuals. It implies that the residuals are distributed generally with zero means.
The figure illustrates in the bottom left graph that the ACF values for different choices of lag are all contained within the significance level (The dotted blue)and in the bottom right graph that the residuals are normally distributed with it’s mean about 0.
To calculate the accuracy of the given forecast, we calculate the aforementioned error measures presented in Table 8.
The error measures are acceptable given the magnitude of the data, but there is room for improvement shall be demonstrated in the following subsection.
Forecast on the synthesized data
The synthesized data is daily Dengue infection data of Bangladesh from 2010 to July 2022. Following Box-Jenkin’s method, we firstly check for the stationarity of the data based on the Augmented Dicky Fuller (ADF) test. ADF test returns the value of -6.6531 with p-value = 0.01, which implies that the data is stationary.
We run multiple Fourier ARIMA models and calculate their AIC, AICc and BIC. The best model is chosen based on the minimum value of the criterion. We present 5 of the top results in Table 9. Here in each case of Fourier transformation, we used one pair of trigonometric terms where each pair is comprised of a sine and a cosine term as defined in (4) and the periodicity of the Fourier term is used to be 365.25. Prior to this we have used box cox transformation of λ = 0.49.
Here, the best model to use is ARIMA(7,0,7). We fit the given model, which gives us the coefficients in Table 10.
To check the goodness of fit of the model, we use the Ljung box test, which returns the p-value = 0.07749 > 0.05, i.e. we accept the null hypothesis: “The model does not show lack ness of fit/ the residuals are not autocorrelated/ the residuals are random white noise”.
Given everything in place, we forecast the infection for the rest of 2023, i.e. from August to December. The forecast is illustrated in the given figure (Fig 27).
The figure illustrates the forecast generated by ARIMA(7,0,7) from actual aggregated data.
To validate the goodness of the fit, we can analyze the model residual, illustrated in Fig 28. Here, the top graph is that of the residual with the timeline of the original data. The bottom left graph represents the Autocorrelation Function (ACF) with respect to the lag of the data. Almost all the values are within the significance e level, and the bottom right figure shows the distribution of the model’s residuals. It implies that the residuals are distributed normally with zero mean.
The figure illustrates in the bottom left graph that the ACF values for different choices of lag are mostly contained within the significance level (The dotted blue)and in the bottom right graph that the residuals are normally distributed with it’s mean about 0.
To calculate the accuracy of the given forecast, we calculate the aforementioned error measures presented in Table 11.
The error measures are acceptable, given the magnitude of the data. In comparison to the error measures of the actual data illustrated in Table 8, we can see improvement in the Table 11. Comparing the MASE term of the two tables shows about 72.76% decrement in error terms using the synthetic data over actual data.
Conclusion
In this paper, a novel temporal downscaling algorithm named Stochastic Bayesian Downscaling (SBD) algorithm has been proposed that can generate downscaled/deaggregated time series data of varying time length from the aggregated data. We have presented two case studies of Bangladesh using Dengue, 2022 data ranging from January to July and COVID-19, 2020 data to exhibit the workflow of the algorithm. In both case studies, the algorithm-generated synthetic data managed to replicate the mean of the actual data without ever being provided with it. In the case of the other statistical measures, the synthetic data could approximate it closely except for the maximum value. A way out of this issue is still an open question for research. Finally, we have tested how the classical statistical forecasting methods respond to the synthetic data with respect to actual aggregated data using monthly Dengue data of Bangladesh for the last 12 years. Our findings show that using synthetic data over actual aggregated data, we were able to reduce the scale-free error measure by 72.76%.
The SBD algorithm presented in this paper is designed to handle integer data by imposing certain restrictions but can be generalized to handle real numbers upon relaxing such restrictions. Hence, exploring diverse use cases in public health, epidemiology, economics, and finance can be a future direction of research. In this paper, we have only studied how statistical forecasting models respond to synthetic data compared to actual data. Repeating similar studies for the predictive class of machine learning models like Long Short Term Memory (LSTM), XGboost, etc is a further scope of research.
The downscaling algorithm has been predominantly used in geology to facilitate outputs of the prevalent models in the field. Very few applications have been made in epidemiology, and most of the application is spatial downscaling. This paper contributes to the current body of knowledge by proposing a parametric, probabilistic one-dimensional downscaling algorithm using aggregated data in the field of epidemiology that facilitates existing classical statistical forecasting tools box to generate better forecasts than the aggregated data. As we know, forecasting models like ARIMA and SARIMA are sensitive to data discontinuity and outliers, hence, the SBD algorithm can be implemented as pre model cleaning step to curate better results on a finer scale. As the SBD algorithm can increase data volume to a significant scale (e.g. downscaling monthly data to daily data can increase the number of data points to 30 times on average) while preserving key statistics and properties of the data, hence such downscaled data can open the avenue for exploration using state of the art neural network model which often requires large volume of data to generate fruitful outcome.
References
- 1. Ribalaygua J., Torres L., Pórtoles J., Monjo R., Gaitán E. & Pino M. Description and validation of a two-step analogue/regression downscaling method. Theoretical And Applied Climatology. 114, 253–269 (2013)
- 2. Peng J., Loew A., Merlin O. & Verhoest N. A review of spatial downscaling of satellite remotely sensed soil moisture. Reviews Of Geophysics. 55, 341–366 (2017)
- 3. Kim S., Kim J. & Bae D. Optimizing Parameters for the Downscaling of Daily Precipitation in Normal and Drought Periods in South Korea. Water. 14, 1108 (2022)
- 4. Bae D., Koike T., Awan J., Lee M. & Sohn K. Climate change impact assessment on water resources and susceptible zones identification in the Asian monsoon region. Water Resources Management. 29, 5377–5393 (2015)
- 5. Lee M., Im E. & Bae D. Impact of the spatial variability of daily precipitation on hydrological projections: A comparison of GCM-and RCM-driven cases in the Han River basin, Korea. Hydrological Processes. 33, 2240–2257 (2019)
- 6. Kim J., Im E. & Bae D. Intensified hydroclimatic regime in Korean basins under 1.5 and 2°C global warming. International Journal Of Climatology. 40, 1965–1978 (2020)
- 7. Gangopadhyay S., Clark M. & Rajagopalan B. Statistical downscaling using K-nearest neighbors. Water Resources Research. 41 (2005)
- 8. Fowler H., Blenkinsop S. & Tebaldi C. Linking climate change modelling to impacts studies: recent advances in downscaling techniques for hydrological modelling. International Journal Of Climatology: A Journal Of The Royal Meteorological Society. 27, 1547–1578 (2007)
- 9. Lee T. & Jeong C. Nonparametric statistical temporal downscaling of daily precipitation to hourly precipitation and implications for climate change scenarios. Journal Of Hydrology. 510 pp. 182–196 (2014)
- 10. Buster G., Rossol M., Maclaurin G., Xie Y. & Sengupta M. A physical downscaling algorithm for the generation of high-resolution spatiotemporal solar irradiance data. Solar Energy. 216 pp. 508–517 (2021)
- 11. Liu J., Yuan D., Zhang L., Zou X. & Song X. Comparison of three statistical downscaling methods and ensemble downscaling method based on Bayesian model averaging in upper Hanjiang River Basin, China. Advances In Meteorology. 2016 (2016)
- 12. Yaoming L., Qiang Z. & Deliang C. Stochastic modeling of daily precipitation in China. Journal Of Geographical Sciences. 14, 417–426 (2004)
- 13. Liao Y. Change of parameters of BCC/RCG-WG for daily non-precipitation variables in China: 1951–1978 and 1979–2007. Journal Of Geographical Sciences. 23, 579–594 (2013)
- 14. Dibike Y. & Coulibaly P. Hydrologic impact of climate change in the Saguenay watershed: comparison of downscaling methods and hydrologic models. Journal Of Hydrology. 307, 145–163 (2005)
- 15. Wetterhall F., Halldin S. & Xu C. Seasonality properties of four statistical-downscaling methods in central Sweden. Theoretical And Applied Climatology. 87, 123–137 (2007)
- 16. Khan M., Coulibaly P. & Dibike Y. Uncertainty analysis of statistical downscaling methods. Journal Of Hydrology. 319, 357–382 (2006)
- 17. Wilby R., Dawson C. & Barrow E. SDSM—a decision support tool for the assessment of regional climate change impacts. Environmental Modelling & Software. 17, 145–157 (2002)
- 18. Harpham C. & Wilby R. Multi-site downscaling of heavy daily precipitation occurrence and amounts. Journal Of Hydrology. 312, 235–255 (2005)
- 19. Wilby R. & Dettinger M. Streamflow changes in the Sierra Nevada, California, simulated using a statistically downscaled general circulation model scenario of climate change. Linking Climate Change To Land Surface Change. pp. 99–121 (2000)
- 20. Raftery A. & Zheng Y. Discussion: Performance of Bayesian model averaging. Journal Of The American Statistical Association. 98, 931–938 (2003)
- 21. Tripathi S., Srinivas V. & Nanjundiah R. Downscaling of precipitation for climate change scenarios: a support vector machine approach. Journal Of Hydrology. 330, 621–640 (2006)
- 22. Yu X. & Liong S. Forecasting of hydrologic time series with ridge regression in feature space. Journal Of Hydrology. 332, 290–302 (2007)
- 23. Ghosh S. & Mujumdar P. Statistical downscaling of GCM simulations to streamflow using relevance vector machine. Advances In Water Resources. 31, 132–146 (2008)
- 24. Matisziw T., Grubesic T. & Wei H. Downscaling spatial structure for the analysis of epidemiological data. Computers, Environment And Urban Systems. 32, 81–93 (2008)
- 25. Mahmud M., Kamrujjaman M., Adan M., Hossain M., Rahman M., Islam M., et al. Vaccine efficacy and sars-cov-2 control in california and us during the session 2020–2026: A modeling study. Infectious Disease Modelling. 7, 62–81 (2022) pmid:34869959
- 26.
DGHS DENV Press Relseases. (2022), https://dashboard.dghs.gov.bd/webportal/pages/heoc_dengue.php
- 27.
IEDCR Dengue Surveillence Report. https://iedcr.gov.bd/surveillances/
- 28.
WHO COVID-19 dashboard. (2022), https://covid19.who.int/data
- 29. Kamrujjaman M., Mahmud M. & Islam M. Coronavirus outbreak and the mathematical growth map of Covid-19. Annual Research & Review In Biology. pp. 72–78 (2020)
- 30. Islam M., Ira J., Kabir K. & Kamrujjaman M. Effect of lockdown and isolation to suppress the COVID-19 in Bangladesh: an epidemic compartments model. J Appl Math Comput. 4, 83–93 (2020)
- 31. Wong L., Shakir S., Atefi N. & AbuBakar S. Factors affecting dengue prevention practices: nationwide survey of the Malaysian public. PloS One. 10, e0122890 (2015) pmid:25836366
- 32. Husin N., Salim N. & Others Modeling of dengue outbreak prediction in Malaysia: a comparison of neural network and nonlinear regression model. 2008 International Symposium On Information Technology. 3 pp. 1–4 (2008)
- 33. Lu L., Lin H., Tian L., Yang W., Sun J. & Liu Q. Time series analysis of dengue fever and weather in Guangzhou, China. BMC Public Health. 9, 1–5 (2009) pmid:19860867
- 34. Bhatnagar S., Lal V., Gupta S., Gupta O. & Others Forecasting incidence of dengue in Rajasthan, using time series analyses. Indian Journal Of Public Health. 56, 281 (2012) pmid:23354138
- 35. Wongkoon S., Pollar M., Jaroensutasinee M. & Jaroensutasinee K. Predicting DHF incidence in Northern Thailand using time series analysis technique. International Journal Of Medical And Health Sciences. 1, 484–488 (2007)
- 36. Gharbi M., Quenel P., Gustave J., Cassadou S., Ruche G., Girdary L., et al. Time series analysis of dengue incidence in Guadeloupe, French West Indies: forecasting models using climate variables as predictors. BMC Infectious Diseases. 11, 1–13 (2011) pmid:21658238
- 37. Torres C., Barguil S., Melgarejo M. & Olarte A. Fuzzy model identification of dengue epidemic in Colombia based on multiresolution analysis. Artificial Intelligence In Medicine. 60, 41–51 (2014) pmid:24388398
- 38. Hu W., Clements A., Williams G. & Tong S. Dengue fever and El Nino/Southern Oscillation in Queensland, Australia: a time series predictive model. Occupational And Environmental Medicine. 67, 307–311 (2010) pmid:19819860
- 39. Hossian M. & Abdulla F. A Time Series analysis for the pineapple production in Bangladesh. Jahangirnagar University Journal Of Science. 38, 49–59 (2015)
- 40. Hossain M. & Abdulla F. Jute production in Bangladesh: a time series analysis. Journal Of Mathematics And Statistics. 11, 93–98 (2015)
- 41. Abdulla F. & Hossain M. Forecasting of Wheat Production in Kushtia District & Bangladesh by ARIMA Model: An Application of Box-Jenkin’s Method. Journal Of Statistics Applications & Probability. 4, 465 (2015)
- 42. Hossain M., Abdulla F. & Others Forecasting the tea production of Bangladesh: Application of ARIMA model. (2015)
- 43. Hossain M., Abdulla F. & Majumder A. Forecasting of banana production in Bangladesh. American Journal Of Agricultural And Biological Sciences. 11, 93–99 (2016)
- 44. Hossain M. & Abdulla F. Forecasting potato production in Bangladesh by ARIMA model. Journal Of Advanced Statistics. 1, 191–198 (2016)
- 45. Earnest A., Tan S., Wilder-Smith A. & Machin D. Comparing Statistical Models to Predict Dengue Fever Notifications. (2012)
- 46. Wu P., Guo H., Lung S., Lin C. & Su H. Weather as an effective predictor for occurrence of dengue fever in Taiwan. Acta Tropica. 103, 50–57 (2007) pmid:17612499
- 47. Eastin M., Delmelle E., Casas I., Wexler J. & Self C. Intra-and interseasonal autoregressive prediction of dengue outbreaks using local weather and regional climate for a tropical environment in Colombia. The American Journal Of Tropical Medicine And Hygiene. 91, 598 (2014) pmid:24957546
- 48.
Luz P., Mendes B., Codeço C., Struchiner C., Galvani A. & Others Time series analysis of dengue incidence in Rio de Janeiro, Brazil. (American Society of Tropical Medicine,2008)
- 49. Martinez E., Silva E. & Fabbro A. A SARIMA forecasting model to predict the number of cases of dengue in Campinas, State of São Paulo, Brazil. Revista Da Sociedade Brasileira De Medicina Tropical. 44 pp. 436–440 (2011) pmid:21860888
- 50.
Brownlee J.
Introduction to time series forecasting with python: how to prepare data and develop models to predict the future. (Machine Learning Mastery,2017)
- 51. Nachane D. & Clavel J. Forecasting interest rates: a comparative assessment of some second-generation nonlinear models. Journal Of Applied Statistics. 35, 493–514 (2008)
- 52. Iwok I. & Udoh G. A Comparative Study between the ARIMA-Fourier Model and the Wavelet model 1. AMERICAN JOURNAL OF SCIENTIFIC AND INDUSTRIAL RESEARCH. 7 pp. 137–144 (2016,12)
- 53. Hyndman R. & Koehler A. Another look at measures of forecast accuracy. International Journal Of Forecasting. 22, 679–688 (2006)
- 54. Pontius R., Thontteh O. & Chen H. Components of information for multiple resolution comparison between maps that share a real variable. Environmental And Ecological Statistics. 15, 111–142 (2008)
- 55. Willmott C. & Matsuura K. On the use of dimensioned measures of error to evaluate the performance of spatial interpolators. International Journal Of Geographical Information Science. 20, 89–102 (2006)
- 56.
Hyndman R. & Athanasopoulos G. Forecasting: principles and practice. (OTexts,2018)