Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A novel hybrid PSO-MIDAS model and its application to the U.S. GDP forecast

  • Feng Shen,

    Roles Conceptualization, Funding acquisition, Supervision, Writing – original draft, Writing – review & editing

    Affiliations School of Finance, Southwestern University of Finance and Economics, Chengdu, PR China, Engineering Research Center of Intelligent Finance, Southwestern University of Finance and Economics, Chengdu, PR China

  • Xiaodong Yan,

    Roles Data curation, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Finance, Southwestern University of Finance and Economics, Chengdu, PR China

  • Yuhuang Shang

    Roles Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review & editing

    syh@swufe.edu.cn

    Affiliation Institute of Chinese Financial Studies, Southwestern University of Finance and Economics, Chengdu, PR China

Abstract

In this study, the traditional lag structure selection method in the Mixed Data Sampling (MIDAS) regression model for forecasting GDP was replaced with a machine learning approach using the particle swarm optimization algorithm (PSO). The introduction of PSO aimed to automatically optimize the MIDAS model’s mixed-frequency lag structures, improving forecast accuracy and resolving the "forecast accuracy" and "forecast cost" weighting problem. The Diebold–Mariano test results based on U.S. macroeconomic data show that when the forecast horizon is large, the forecast accuracy of the PSO-MIDAS model is significantly better than other benchmark models. Empirical results show that, compared to the benchmark MIDAS model, the forecast accuracy of both univariate and multivariate PSO-MIDAS models improves by an average of 10% when the forecast horizon exceeds 2 quarters, and the optimization effect is greater compared to other benchmark models. The innovative use of the PSO algorithm addresses the limitations of traditional lag structure selection methods and enhances the predictive potential of the MIDAS model.

1. Introduction

Economic growth is one of the ultimate goals of monetary policy. As a crucial macroeconomic fundamental indicator, gross domestic product (GDP) plays a vital role in portraying the economic situation and characterizing the macroeconomic cycle [1]. The future GDP trend is closely related to production planning, consumer decision-making, and policymaking, so it has received extensive attention from producers, consumers, and governments. In the GDP growth forecast, the improvement of forecast accuracy often takes time and cost. Therefore, compared with the forecast timeliness, researchers pay more attention to how to have both "high accuracy" and "low cost".

In specific empirical studies, researchers found that the accuracy of the forecast depends on predictors’ choice. The timeliness of forecast is mainly affected by the frequency of data. According to the research of Clements et al. [2, 3] and Ghysels et al. [4], the choice of predictors and the richness of data frequency have a crucial impact on GDP forecast.

To improve the accuracy of forecasting, in the selection of predictors, in addition to traditional macroeconomic variables, financial variables are gradually taken into consideration. Stock et al. [5] and Ghysels et al. [4] are all concerned about financial variables’ ability to predict macroeconomic aggregates. Financial variables contain expected information about future economic activities, which is conducive to improving forecasts’ accuracy and showing actual availability, bringing about forecast accuracy improvements. Stock et al. [5] pointed out that the predictive power of financial variables for macroeconomic variables is statistically significant. Besides, the predictor can also select other indicators. Lahiri et al. [6] studied the role of the monthly diffusion indices compiled by the Institute for Supply Management (ISM) in predicting the current quarter’s GDP growth in the United States. They found new data on the ISM index available at the beginning of the month can improve the real-time forecasting effect.

As far as the richness of data frequency is concerned, high-frequency data has attracted more researchers’ attention. For the forecast of low-frequency macroeconomic indicators, high-frequency data information is even more critical. For example, GDP is quarterly data. If we want to introduce high-frequency data to predict low-frequency GDP, we will face a mixed-frequency modeling problem. It is worth noting that Ghysels et al. [7, 8] proposed the MIxed DAta Sampling (MIDAS) regression model, which embeds high-frequency time series lags of the explanatory variables into the regression model through a weighting scheme. The MIDAS model’s main advantage is that it can include as much mixed-frequency information as possible and improve forecast accuracy when there are mixed data samples. Moreover, the MIDAS model can use the latest published high-frequency data to improve the timeliness of forecast. The MIDAS model has been proved to be effective in many fields of financial and economic such as financial instrument’s price forecasting [9, 10], volatility forecasting [11, 12], and macroeconomic indicator’s forecasting [13, 14].

At present, the MIDAS model is widely used in macroeconomic forecasting. Clements et al. [2, 3] and Ghysels et al. [4] apply the MIDAS model to the empirical study of real-time forecast of macroeconomic variables. Kuzin et al. [15] used the MIDAS model to forecast quarterly GDP growth in the euro area in real-time (nowcasting) and forward forecasting and found that, compared with the mixed-frequency VAR model, the MIDAS model performs better when forecasting 4 to 5 months in advance. Furthermore, researchers have made many improvements to the MIDAS model in macroeconomic forecasting. Barsoum et al. [16] combined the unconstrained U-MIDAS model with Markov-switching and proposed a new MS-U-MIDAS model. Empirical research based on U.S. GDP growth data shows that the MS-U-MIDAS model’s real-time nowcasting and forward forecasting capabilities are similar to, or better than, traditional MIDAS models. Qiu [17] propose a new tree-based MIDAS model that introduces the regression tree (RT), the bootstrap aggregating decision trees (BAG) and the random forest (RF) algorithms into the MIDAS framework. The empirical results show that the tree-based MIDAS model generally improves forecast accuracy by a wide margin compared to existing MIDAS model in the U.S. Consumer Confidence Index’s forecasting. Recently, some researchers have begun to study the optimization of multivariate MIDAS models in price forecasting. Li et al. [18] integrate the extreme learning machine with the multivariate MIDAS model to predict natural gas prices. Wang and Kang [19] combine the multivariate MIDAS model with eXtreme Gradient Boosting (XGBoost) to forecast China’s steam coal prices.

Although the MIDAS model provides us with a research paradigm for forecasting economic indicators by mixed-frequency data, we cannot intelligently choose the data richness in the regression of the MIDAS model. The lag order of the predictor determines the amount of information. The greater the lag order, the more lag period data used by the predictor for regression, and the greater the amount of information contained in the model. Generally speaking, the forecast accuracy of the model increases as the richness of information increases. However, too long lagging information may bring about redundant effects, cause data noise, and ultimately negatively affect the forecast accuracy. Therefore, optimizing the lag structures and improving the MIDAS model’s forecast accuracy is a real challenge we face. In the existing research, two traditional methods are generally used to select the maximum lag order: expert experience method and information criterion method. In Clements et al. [3], when the monthly data of predictors are matched with quarterly GDP data, to facilitate comparison with the benchmark model, the high-frequency lags are set to a multiple of 3. Andreou et al. [20] used the AIC criterion to determine the maximum lag order of the MIDAS model in the context of a daily-quarterly data mixture. These two methods are mature in determining the maximum lag order of a single frequency model. However, there are still specific problems in determining the MIDAS model’s mixed-frequency lag structures. The incompatibility between the three traditional lag structure selection methods and the MIDAS model are described in Section 2. Given that the above three methods are not suitable for the MIDAS model, the simplest solution is the exhaustive method: traverse all the values of the mixed-frequency lag structure, establish multiple MIDAS models, and finally find the best lag structure. However, the exhaustive method’s biggest problem is that the improvement of forecast accuracy comes at the expense of time and cost, "high accuracy" and "low cost" cannot be achieved simultaneously.

In response to this problem, this paper uses machine learning to replace existing mixed-frequency lag structure selection method. We use Particle Swarm Optimization (PSO) to improve the traditional MIDAS model and obtain a PSO-MIDAS model embedded with machine learning algorithms. The main advantages of this model are: on the one hand, it significantly improves the forecast accuracy by optimizing the mixed-frequency lag structures; on the other hand, it cleverly uses the particle swarm optimization algorithm to learn and correct itself during the optimization process, which significantly saves the time and cost. Besides, the PSO-MIDAS model has universal applicability, and different optimal values can be obtained for different data, which improves the applicability of the model.

This paper’s main contribution lies in model construction, the machine learning algorithm is embedded in the traditional MIDAS model, and a new PSO-MIDAS model is proposed. We link the lag structure and validation set forecast metric of the MIDAS model to the particle position and fitness function of the PSO algorithm and optimize the lag structure of the MIDAS model using the PSO algorithm. According to economic and financial theory [21], we select the three variables of US industrial production (IP), non-farm payroll (NFP), and capacity utilization (CU) as predictors. This article compares the PSO-MIDAS model with the MIDAS model, U-MIDAS model and ADL model. The study found that the PSO-MIDAS model is improved by an average of 10% relative to the benchmark MIDAS model, and the optimization effect is greater compared to other benchmark models. The Diebold–Mariano test results show that when the forecast horizon is large, the forecast accuracy of the PSO-MIDAS model is significantly better than other benchmark models. This paper also extends the MIDAS model from univariate form to multivariate form. The multivariate model PSO-MV-MIDAS’s forecast accuracy is still better than the multivariate benchmark models. In general, the particle swarm optimization algorithm has made a significant contribution to improving the MIDAS model’s forecast accuracy.

The rest of this article proceeds as follows. In section 2, we introduce the theories and methods related to this research: the MIDAS model, leads and nowcast, the determination of lag order, PSO algorithm and the PSO-MIDAS model; Section 3 introduces the empirical research design of this article; Section 4 shows the empirical analysis results and discussions; Section 5 concludes the paper.

2. Methodology

2.1. The MIDAS model

Before introducing the MIDAS model, we need to introduce the Augmented Distributed Lag (ADL) model first. Both the ADL model and the MIDAS model can forecast low-frequency variables through mixed-frequency data. Suppose we hope to forecast some low-frequency quarterly variables through mixed-frequency data in the context of monthly-quarterly data mixtures. The predicted variable, such as the quarterly real GDP growth one quarter ahead, is denoted by . The predictor variable, such as the monthly industrial production in the first month of the t quarter, is denoted by . Note that the ADL model involves temporally aggregated series. We denote the quarterly aggregate of the predictor variable in the t quarter as . The aggregation scheme being used, for example, is averaging the monthly variable data, that is, . The regression model is as follows: (1) Which includes lags of and lags of , μ is the constant and is the random error term. This regression is parsimonious because only regression coefficients need to be estimated. Unlike the ADL model, the MIDAS model does not need to artificially set up a quarterly aggregation scheme for monthly variable and directly incorporate monthly high-frequency data into the model. Now we introduce the model with h-steps ahead forecast: (2) Which includes lags of and lags of . For the convenience of notation and explanation, we coincidentally set to 3 times . However, it is worth noting that the model can contain monthly high-frequency data of any positive integer lag order assigns different weight coefficients to each high-frequency lag. By introducing some weighting schemes to determine the mapping relationship between the low-dimensional hyperparameter θh and the high-dimensional weight coefficient , the weighting scheme dramatically reduces the number of parameters to be estimated. It avoids the problem of parameter proliferation caused by directly estimating coefficients for each high-frequency lag.

The alternative weighting schemes include U-MIDAS (unrestricted MIDAS polynomial), normalized Beta probability density function, normalized exponential Almon lag polynomial, and polynomial specification with step functions. Ghysels et al. [8] provided a detailed discussion about those weighting schemes. Following Ghysels et al. [8], we adopt the normalized Beta probability density function as the model weighting scheme, and the weighting scheme formula is as follows: (3) with: (4) (5)

Therefore, we do not need to estimate the weight coefficient of each high-frequency lag, only need to estimate and to determine . Then the estimated parameter set of the model with normalized Beta probability density function weighting scheme is: , a total of parameters, which can be estimated by nonlinear least squares (NLS).

In addition, we call the model with unrestricted MIDAS polynomial weighting scheme as model. The model is as follows: (6)

The model directly regresses with low-frequency data and high-frequency data . The number of parameters to be estimated is , which can be estimated by the ordinary least square (OLS) method. However, when and are large enough, the U-MIDAS model will suffer parameter proliferation issues and reducing forecast accuracy.

2.2. Leads and nowcast

In mixed-frequency forecast research, because the publication dates of high-frequency data and low-frequency data are not synchronized, the data we collect is usually ragged-edge data with missing observations at the end of the sample. For example, in a particular calendar month, we can observe the current quarter’s monthly data, but we cannot observe the current quarter’s quarterly data. So, we need to extend the model to include high-frequency monthly data in the current quarter.

We follow the concept based on MIDAS regression with leads proposed by Kuzin et al. [22] and Andreou et al. [18]. When our regression uses the information between quarter t and t+1, we call it regression with leads. For example, suppose we are one month into quarter t+1, hence the end of January, April, July, or October, our goal is to forecast the quarterly variables in the quarter t+h. At this time, we will have one-month lead data, denoted by . Denoted by the number of leads, the value range of is (0,1,2). Then the extended model is as follows: (7)

When degenerates to model. The calculation of the high-frequency lag has also changed accordingly, and not only includes lags before the quarter t also needs to include the information of the leads , then . In particular, we call the forecast made by the model when h = 1 nowcast, that is, forecasting the current quarter’s quarterly variables by using the current quarter’s monthly data. For more information about nowcast and MIDAS regression with leads, Andreou et al. [18] had already elaborated in detail.

2.3. The determination of lags

Before using the model for prediction, we need to determine the model’s low-frequency lags and high-frequency lags and directly determine the information richness of the high-frequency and low-frequency data included in the model and determine the number of estimated parameters. Most scholars use the expert experience and information criterion to determine the model’s lag structures in the existing research.

Some researchers determine the model’s lag structures based on his own research experience or expert advice. It is an empirical method based on a large number of empirical studies. However, determining the model lag structures based on expert experience is highly subjective and prone to model setting errors. Different time series research objects and different data sets will be different in model lag structure choices. Subjective expert experience is challenging to choose a reasonable lag structure on the new model and new data. Therefore, facing more complex MIDAS models, it is impossible to set the model’s lag structure through expert experience accurately.

The information criterion is the most commonly used method to determine the model’s lag structure. Generally, three information criteria, AIC (Akaike Information Criterion), BIC (Schwarz-Bayesian Information Criterion) or HQC (Hannan–Quinn information criterion), are used to select models. The definitions of AIC, BIC and HQC are: (8) (9) (10)

K is the number of estimated parameters. N is the sample size. is the maximum value of the likelihood function of the estimated model. We choose the model with the lowest information criterion value as the optimal model by calculating the AIC, BIC or HQC value of the model with different lag structures. It can be seen that the definitions of the three information criteria are relatively similar. Both are composed of the parameter number penalty part and the maximum likelihood function part. The maximum likelihood function part ensures that the best fit of the model within the sample. The parameter number penalty part ensures that the model is as simple as possible and improves its generalization capability out the sample. In the time series model, the lag order is proportional to the number of parameters K, so the information criterion can weigh the fit and generalization effects of the model in and out of the sample and choose the optimal lag order. However, because the information criterion only penalizes the model’s total number of parameters, it cannot reflect the model’s parameter quantity structure. That is, the information criterion only examines the sum of and and does not examine the structural changes between and . Therefore, using the information criterion method to determine the MIDAS model’s lag structures has certain theoretical flaws.

Through theoretical analysis, the method of expert experience and information criterion both have specific defects in determining the MIDAS model’s lag structures. However, determining and is an essential part of setting the model, which directly affects the forecast capability of the model, so this article proposes another way to determine the lag structures of the MIDAS model. Just as hyperparameters usually need to be adjusted to improve the prediction accuracy of the machine learning model, we regard and as the hyperparameters of model. By comparing the prediction accuracy of each possible model, we can find the optimal combination of and . However, this process will consume a lot of time and computing resources when and are large. For example, when , we need to estimate 100 models to find the optimal combination of and . However, when , the number of models required increases to 10000.

We usually compare the forecast metrics to judge the forecast capability of the model. The root mean squared forecast error (RMSFE) is the square root of the mean squared forecast error, the RMSFE is a measure of the magnitude of a typical forecasting “mistake”. The calculation formula of RMSFE is as follow: (11)

Yt+h denotes the true value of the explained variable at period t+h, and denotes the predicted value based on information available at period t. A model with a smaller RMSFE usually has a better forecast capability. We can find that the combination of and has a complicated relationship with the model’s RMSFE. Our goal is to minimize the model’s RMSFE in the validation set to find the optimal lag structures. As long as we can find a suitable method to solve this minimization problem, we can find the optimal lag structures. Therefore, we introduce a method to solve this optimization problem in the next section.

2.4. Particle swarm optimization

Particle Swarm Optimization (PSO) is a self-organizing heuristic optimization algorithm proposed and developed by Kennedy and Eberhart [23, 24]. The invention of this algorithm is inspired by the swarm hunting behavior of birds and fish. It is a kind of swarm intelligent random optimization algorithm. Compared with other heuristic optimization algorithms, particle swarm optimization has high computational efficiency, robust parameter control, and easy implementation and application [25]. Besides, compared to other non-random optimization algorithms, the random strategy of the PSO algorithm allows for improved global optimization capabilities. It does not rely on the optimization problem’s strict mathematical properties, so it is widely used in nonlinear complex optimization problems.

The particle swarm algorithm realizes the search function by continuously iterating and updating the position P and velocity V of each particle in the particle swarm. In a D-dimensional search space, the position of the i−th particle at time t is a D-dimensional vector, Pi(t) = (pi1,t,pi2,t,…,piD,t). Similarly, the velocity of the i−th particle at time t is also one D-dimensional vector, Vi(t) = (vi1,t,vi2,t,…,viD,t). In order to find the optimal solution, each particle moves to its historical optimal position pbest(i,t) and the group optimal position gbest(t) at their respective speeds. The calculation formula of pbest(i,t) and gbest(t) is as follows: (12) (13) i is the particle number, NP is the total number of particles in the particle swarm, t is the current iteration number, f(·) is the fitness function, and P is the particle’s position. The following equations update the position P and velocity V of each particle: (14) (15)

V is the particle update speed, ω is the inertial weight that weighs the algorithm’s global and local optimization capabilities, r1 and r2 are random variables subjected to a uniform distribution in the interval [0,1], and c1 and c2 are acceleration coefficients. Normally, we add an upper bound to V to prevent particles from leaving the search space. The particle velocity update Eq (12) can be divided into three parts. The first part is the inertia part, and the particles retain a part of the previous period’s velocity inertia to roam in the entire search space. The second part is the "self-recognition" part, particles can continue to approach the optimal position of their historical iteration. The third part is the "group cooperation" part so that particles can continue to approach the particle group’s optimal position. Input each particle’s position into the fitness function f(·) to obtain the corresponding fitness value. The particle swarm algorithm aims to minimize the fitness function f(·) and obtain the global minimum fitness value and the corresponding particle position.

2.5. The PSO-MIDAS model

Given the significance of the selection of the lag structures in improving the forecast accuracy of the MIDAS model, in this section, we apply the PSO algorithm to the optimization of the lag structures of the MIDAS model. Specifically, we set the lag structures of the MIDAS model as the particle’s position Pi(t), the dimension of the particle’s position is determined by the number of variables of the MIDAS model, and the lag structure of the MIDAS model with N variables is 1+N dimension, that is, the sum of the autoregressive term of the dependent variable and the number of the independent variables. Meanwhile, we set the fitness function f(·) as the RMSE of the MIDAS model on the validation set. Taking the univariate model as an example, first divide the samples into the training set, validation set, and test set, input the two-dimensional lag structure (), estimate the parameters of the model on the training set, and then obtain the forecast metrics of the model on the validation set. Then substituting the above process into the particle swarm optimization algorithm, we can find the MIDAS model with the best forecast metric on the validation set and expect that the model also has excellent prediction ability on the test set.

It is worth noting that we set the fitness function f(·) as the forecast metrics of the MIDAS model on the validation set rather than the training set because time series prediction is time-sensitive. The validation set, closer to the test set, is more timely than the training set so that the optimal lag structure can be extracted more effectively. Facing the complex nonlinear optimization problem of the MIDAS model’s lag structures studied in this paper, we have the option of using various heuristic algorithms besides the PSO algorithm, such as the well-known genetic algorithm and simulated annealing algorithm. However, when compared to the PSO algorithm, both of these alternatives consume more computing resources. Additionally, the genetic algorithm is better suited for solving discrete optimization problems rather than the continuous optimization problems presented in this paper. In general, the introduction of the PSO algorithm can save the researcher’s calculation and time costs while ensuring the global optimum solution. The pseudo-code of the PSO-MIDAS model can be summarised as follows:

1: for i in 1 to NP do {for each particle in the swarm}

2: Randomly initialize particles’ positions Pi(0) and velocities Vi(0)

3: Initialize particles’ personal best pbest (i,0) and group best gbest (0)

4: end for

5: repeat

6: for i in 1 to NP do

7: Update particle’s velocity using Eq ()

8: Update particle’s position using Eq ()

9: if Pi(t)<pbest (i,t) then {minimization of f(·)}

10: Update particle’s best-known position pbest (i,t+1) = Pi(t)

11: if pbest (i,t)<gbest (t) then {minimization of f(·)}

12: Update the group’s best-known position gbest (t+1) = pbest (i,t)

13: end if

14: end if

15: end for

16: until [number of iterations T is met]

17: return gbest (T) {the optimal lag structure of the MIDAS model}

3. Empirical experiment

3.1. Data collection

We collect U.S. quarterly real GDP data as low-frequency data, and the sample interval is from the first quarter of 1968 to the first quarter of 2020. Based on existing economic and financial theory and literature [21], we select the three variables of U.S. Industrial Production (IP), Non-Farm Payroll (NFP), and Capacity Utilization (CU) as the predictor. The first two predictors are the components of the Conference Board Coincident Index. The three predictors are all monthly data used as high-frequency data in this article, and the sample interval is from January 1968 to March 2020. The U.S. GDP data comes from the Bureau of Economic Analysis. The GDP data and capacity utilization data come from the Federal Reserve Board, and the Non-Farm Payroll data comes from the Bureau of Labor Statistics. The descriptive statistics of the variables are shown in Table 1:

Fig 1 shows that the fluctuations in the quarterly GDP growth rate have significant periodicity. The GDP growth rate is positive in most cases, which means that the volatility of GDP rises. During the depression phase of the economic cycle, the GDP growth rate fell to a negative value. The most recent trough was around the 2008 global financial crisis, and there was no sign of gradual recovery until the end of 2009.

Fig 2 shows that the fluctuation of the monthly growth rate of industrial production (IP) is also cyclical, and the fluctuation range is relatively broad. The bottom value of several fluctuations is around -5%. After the 2008 financial crisis, the negative growth rate was even close to -15%. Compared with industrial production data, Non-Farm Payroll (NFP) monthly growth rate fluctuations are more cyclical and consistent with the changing GDP appreciation rate, reflecting the close relationship between employment and output. This indicator’s fluctuation range is relatively small and is smaller than the fluctuation range of GDP appreciation rate. The monthly growth rate of capacity utilization (CU) fluctuates cyclically and fluctuates widely. The historical fluctuations in the sample interval are mainly concentrated around -15% to 10%.

In summary, the three forecast factors are procyclical with GDP growth rate and change in the same direction. The fluctuation range of NFP is relatively small. Moreover, CU fluctuates wildly.

3.2. Empirical design

The empirical design includes two parts. The first part explores the empirical results of the univariate PSO-MIDAS model. Comparing the empirical results of the benchmark models verifies whether the optimization effect of the PSO algorithm on the univariate MIDAS model is significant. The second part further explores the empirical results of the multivariate PSO-MV-MIDAS model. Similarly, by comparing with the benchmark models’ empirical results, verify whether the PSO algorithm’s optimization effect on the multivariate MV-MIDAS model is significant. Meanwhile, we explored whether there is any difference in the PSO algorithm’s optimization effect on the univariate and multivariate MIDAS models. Before presenting the empirical results, we will give specific explanations on the related setting details of the benchmark models, MIDAS model, and PSO algorithm.

The benchmark models include ADL, U-MIDAS and ADL-MIDAS models (for convenience, the ADL-MIDAS model will be referred to as the MIDAS model below). The lag structures of the benchmark model are determined by the AIC information criterion method. Both the PSO-MIDAS model and the PSO-MV-MIDAS model use the PSO algorithm to determine the MIDAS model’s optimal lag structure. The univariate model included the three monthly variables (IP, NFP, and CU) as the only high-frequency explanatory variable. The multivariate model included all the three high-frequency monthly variables as the high-frequency explanatory variables.

The forecasts of all models in this paper are out-of-sample rolling forecast, where the ratio of the training set to the validation set and test set in the sample is 7:1:2. We use the PSO algorithm to find the optimal lag structure based on the model’s RMSE on the validation set. The model’s RMSFE and MAPE on the test set is the average of the 42 rolling forecast metrics. The value range of the forecast horizon h is four quarters, that is, h = 1,2,3,4. Each quarter contains three monthly leads, namely , we total study 12 scenarios: use current quarter’s data for real-time nowcasting and use one to three quarters’ data in advance for forwarding forecasting, with a total period of 12 months. It is worth noting that, since the ADL model can not include the lead information of high-frequency monthly data, we only examined the ADL model under four forecast horizons without leads.

Fig 3 shows the empirical design process of the PSO-MIDAS flowchart, the algorithm programming is carried out via The MIDAS toolbox for R (midasr package) [26]. Regarding the PSO algorithm’s specific hyperparameter setting, we refer to the standard PSO algorithm’s settings of related hyperparameters in the Standard Particle Swarm Optimisation 2011 (SPSO 2011) proposed by Clerc et al. [27]. Readers can find more details about the PSO hyperparameter settings in the Standard Particle Swarm Optimisation [28]. The optimization interval of the PSO algorithm is set to 5 years (20 quarters, 60 months), that is, the low-frequency lags of GDP , the high-frequency lags of predictors . The PSO algorithm hyperparameter settings in this article are shown in Table 2:

4. Results

4.1. Univariate PSO-MIDAS model

Univariate models refer to models that use only one predictor variable as the explanatory variable. Therefore, we use IP, NFP, and CU indicators to construct three types of PSO-MIDAS models and benchmark models. In Table 3, the first three columns describe the matching relationship with different forecast horizon h, lead, and forecast type. Using the current quarter’s monthly data to forecast the current quarter’s variable is called real-time forecasting. Using the previous quarter’s monthly data to forecast is the forward forecast in the usual sense. The result of the PSO-MIDAS model is the RMSFE value of out-of-sample rolling forecasts, while the MIDAS model, U-MIDAS model and ADL model are the benchmark models. We compared the PSO-MIDAS model’s RMSFE value with the benchmark models’ RMSFE value and used the ratio as the model comparison result. If the ratio is less than 100%, the RMSFE of the PSO-MIDAS model is smaller than the RMSFE of the benchmark model, and the forecast capability of the PSO-MIDAS model is better than benchmark models. Conversely, if the ratio is greater than 100%, it indicates that the PSO-MIDAS model’s forecast capability has not improved. The smaller the ratio, the better the PSO-MIDAS optimization effect. In the last row of the table, we calculated the average ratio of the RMSFE for the PSO-MIDAS model compared to three benchmark models to evaluate the average optimization performance of the PSO-MIDAS model.

thumbnail
Table 3. Comparison of the RMSFE of univariate mixed-frequency models: Based on IP indicators.

https://doi.org/10.1371/journal.pone.0315604.t003

Table 3 shows that when IP as the explanatory variable, the PSO-MIDAS model’s forecast capability is better than other benchmark models. First of all, the PSO-MIDAS model is significantly better than the U-MIDAS model. Compared with the U-MIDAS model’s RMSFE level, the optimization effect of PSO-MIDAS is 37% on average. Secondly, the PSO-MIDAS model’s forecast capability is also better than the ADL models. The optimization effect is about 25%, indicating that the use of high-frequency data of the explanatory variable IP has contributed to the GDP forecast. Finally, the PSO-MIDAS model’s forecast results are better than the MIDAS model, the RMSFE ratio is less than 1, and the improvement effect is about 15%. It is worth noting that when h becomes larger, the RMSFE of the model also becomes larger, suggesting that using the same information to predict the more distant future will be more difficult than predicting the more recent future. However, when h is large, the PSO-MIDAS is better optimised for forecast capability relative to the benchmark models. This suggests that when h is large, the predictive performance of the MIDAS model becomes more sensitive to the choice of lag structure, while the PSO-MIDAS model constructed in this paper solves the problem of choosing the MIDAS model’s mixed-frequency lag structure. The particle swarm algorithm improves the traditional MIDAS model, which guarantees adequate and sufficient high-frequency information and filters the redundant data noise.

In Table 4, we replace the forecast metric in Table 3 with the MAPE (Mean Absolute Percentage Error) to evaluate the forecast accuracy of the model on the test set. The calculation formula of MAPE is as follow: (16)

The comparison results are similar to those in Table 3. The optimization effects of the PSO-MIDAS model relative to the MIDAS, U-MIDAS, and ADL models are 13%, 32%, and 23% respectively. We notice that samples with larger forecast errors have a greater impact on the MAPE compared to the RMSE. Therefore, the optimization effect of the PSO-MIDAS model, as measured by MAPE, is slightly inferior. Moreover, we assess the significance of differences in forecast accuracy using the Diebold and Mariano test and Harvey et al.’s small-sample bias-corrected variance [29, 30]. The Diebold-Mariano Test compares the accuracy of two forecasting methods by assessing whether the difference in forecast errors is statistically significant. The results are reported in Table 5. The Diebold–Mariano test results show that the forecast accuracy of the PSO-MIDAS model is significantly better than that of the U-MIDAS model, and when h = 3, 4, the forecast accuracy of the PSO-MIDAS model is significantly better than that of the MIDAS model and the ADL model.

thumbnail
Table 4. Comparison of the MAPE of univariate mixed-frequency models: Based on IP indicators.

https://doi.org/10.1371/journal.pone.0315604.t004

thumbnail
Table 5. Diebold–Mariano test results: Based on IP indicators.

https://doi.org/10.1371/journal.pone.0315604.t005

Table 6 shows that when NFP is the explanatory variable, the PSO-MIDAS model’s RMSFE is better than other benchmark models. Compared with the U-MIDAS model, the average optimization effect of PSO-MIDAS is about 9%. The PSO-MIDAS model’s forecast capability is also better than the ADL and MIDAS model, and the optimization effect is about 23% and 15%, respectively. Although the PSO-MIDAS model’s forecast capability using NFP as the explanatory variable is generally better than the benchmark model, its optimization effect is inferior to the PSO-MIDAS model using IP as the explanatory variable. Tables 7 and 8 report the MAPE and Diebold–Mariano test results of the PSO-MIDAS and the benchmark model, respectively. The Diebold–Mariano test results show that the optimization effect of the PSO-MIDAS is not significant compared to the U-MIDAS, and when h = 3, 4, the forecast accuracy of the PSO-MIDAS model is significantly better than that of the MIDAS model and the ADL model.

thumbnail
Table 6. Comparison of the RMSFE of univariate mixed-frequency models: Based on NFP indicators.

https://doi.org/10.1371/journal.pone.0315604.t006

thumbnail
Table 7. Comparison of the MAPE of univariate mixed-frequency models: Based on NFP indicators.

https://doi.org/10.1371/journal.pone.0315604.t007

thumbnail
Table 8. Diebold–Mariano test results: Based on NFP indicators.

https://doi.org/10.1371/journal.pone.0315604.t008

Furthermore, Table 9 shows that when CU as the explanatory variable, the RMSFE results of the PSO-MIDAS model is still better than other benchmark models. Compared with the U-MIDAS model, the average optimization effect is 25%; compared with the ADL model, the average optimization effect is 19%; compared with the MIDAS model, the average optimization effect is 12%. The MAPE and Diebold–Mariano test results in Tables 10 and 11 are similar to those of the PSO-MIDAS model with IP variable. In conclusion, the PSO-MIDAS model’s forecast accuracy is better than the benchmark univariate model, especially when the forecast horizon h is large. After we embedded the particle swarm algorithm in the MIDAS model, we can efficiently determine the MIDAS model’s lag structure and improve the prediction accuracy of the model.

thumbnail
Table 9. Comparison of the RMSFE of univariate mixed-frequency models: Based on CU indicators.

https://doi.org/10.1371/journal.pone.0315604.t009

thumbnail
Table 10. Comparison of the MAPE of univariate mixed-frequency models: Based on CU indicators.

https://doi.org/10.1371/journal.pone.0315604.t010

thumbnail
Table 11. Diebold–Mariano test results: Based on CU indicators.

https://doi.org/10.1371/journal.pone.0315604.t011

4.2. Multivariate PSO-MV-MIDAS model

Whether it is a univariate PSO-MIDAS model or a multivariate PSO-MV-MIDAS model, it essentially optimizes the selection of mixed-frequency lag structure in the MIDAS model through the PSO algorithm. However, the number of variables in the MIDAS model changes the optimization task of the PSO algorithm. The univariate PSO-MIDAS model solves a two-dimensional optimization problem. The multivariate PSO-MV-MIDAS model incorporates three variables: IP, NFP, and CU. Including the auto-regressive term of quarterly GDP growth rate, the PSO-MV-MIDAS model solves a four-dimensional optimization problem.

Tables 12 and 13 show that the PSO-MV-MIDAS model outperforms other multivariate benchmark models in terms of forecast accuracy, as measured by RMSFE and MAPE. Compared with the U-MV-MIDAS model, the average optimization effect is 71%; compared with the MV-ADL model, the average optimization effect is 34%; compared with the MV-MIDAS model, the average optimization effect is 12%. Similar with univariate PSO-MIDAS models, the PSO-MV-MIDAS model has a better optimization effect when the forecast horizon h is larger. The Diebold–Mariano test results of the PSO-MV-MIDAS and the multivariate benchmark model in Table 14 show that the forecast accuracy of the PSO- MV-MIDAS model is significantly better than that of the multivariate benchmark models when h = 2, 3, 4.

thumbnail
Table 12. Comparison of the RMSFE of multivariate mixed-frequency models: Based on IP, NFP, and CU indicators.

https://doi.org/10.1371/journal.pone.0315604.t012

thumbnail
Table 13. Comparison of the MAPE of multivariate mixed-frequency models: Based on IP, NFP, and CU indicators.

https://doi.org/10.1371/journal.pone.0315604.t013

thumbnail
Table 14. Diebold–Mariano test results: Based on IP, NFP, and CU indicators.

https://doi.org/10.1371/journal.pone.0315604.t014

Finally, for better analysis of the result, we summarize the average optimization effects of the univariate and multivariate PSO-MIDAS models relative to the benchmark model into a histogram plot, as shown in Fig 4. When calculating the optimization effect of the PSO-MIDAS model, we exclude model results that are not significant in the Diebold–Mariano test. Regardless of the univariate and multivariate PSO-MIDAS models, their optimization effect relative to the MIDAS model is maintained at an average level of 10%. Additionally, we present the GDP prediction results of the PSO-MIDAS model with a forecast horizon of 7 months (the Diebold–Mariano test results show that the univariate and multivariate PSO-MIDAS models outperform the benchmark MIDAS model when forecast horizon h = 7 months), as shown in Fig 5. It is worth noting that in multivariate models, the total lag order of the model grows exponentially, which means that the number of parameters to be estimated in the U-MV-MIDAS and MV-ADL model increases considerably, thus reducing the model’s generalisation ability. This explains why the PSO-MV-MIDAS model significantly outperforms both the U-MV-MIDAS and MV-ADL models.

thumbnail
Fig 4. The average optimization effects of the PSO-MIDAS against three benchmark models.

https://doi.org/10.1371/journal.pone.0315604.g004

thumbnail
Fig 5. Predicted GDP of the PSO-MIDAS model with a forecast horizon of 7 months.

https://doi.org/10.1371/journal.pone.0315604.g005

5. Conclusion

When utilizing the MIDAS model for GDP growth forecasting, the traditional model selection method cannot effectively determine the mixed-frequency lag structure, which impacts the accuracy of the model’s forecasts. In this paper, the PSO-MIDAS model was proposed to resolve the MIDAS model’s mixed-frequency lag structure selection problems. The PSO-MIDAS univariate model was shown to be better than the benchmark model MIDAS model, the U-MIDAS model and the ADL model, and the multivariate PSO-MV-MIDAS model was also found to perform better than the multivariate benchmark models. The Diebold–Mariano test results have shown that incorporating the PSO algorithm significantly enhances the forecasting ability of the MIDAS model, particularly for longer forecast horizons. Based on the research findings in this article, when applying the MIDAS model for macroeconomic forecasting, it is suggested to use the PSO-MIDAS model when the forecast horizon is greater than 2 quarters, and the PSO-MV-MIDAS model when the forecast horizon is greater than 1 quarter. In terms of selecting independent variables, the forecast performance of the PSO-MIDAS model with IP and CU as independent variables is superior to that of the PSO-MIDAS model with NFP as the independent variable. Additionally, the standard deviation of the NFP variable is significantly smaller than that of other variables, so it is recommended to use independent variables with larger standard deviations.

During the research, we also found some limitations of the PSO-MIDAS model, providing some directions for future research. First, the effectiveness of the PSO-MIDAS model is based on the assumption that the time series remains consistent in the validation set and the test set. If the potential lag structure of the target time series changes on the test set, we cannot use the validation set to calculate the potential optimal lag structure after the change. For example, our model had poor prediction results during the COVID-19 pandemic, which may be because the selected variables and their lag structures do not fully reflect the impact of the epidemic, which is also a problem faced by the benchmark MIDAS model. However, some studies have identified the inconsistency of time series through methods such as Regime Switch, which inspires the improvement of PSO-MIDAS. Second, we also tried to optimize the PSO-MIDAS lag structure through the validation set MAPE metric. Still, its model accuracy is not as good as the PSO-MIDAS model with the validation set RMSE as the fitness function. This may be because the MAPE metric is more sensitive to larger forecast errors, resulting in poor continuity of the fitness function, leading to poor PSO algorithm optimization. Therefore, the selection of forecast metrics for the fitness function is also a future research direction for PSO-MIDAS. Third, the PSO-MIDAS model can be implemented in various fields and data frequency mixtures in future studies. Including financial market variables could change the data frequency mixtures to monthly-daily or even quarterly-daily combinations. Exploring the optimization impact of the PSO-MIDAS model in different data frequency mixtures is an intriguing topic.

Supporting information

S1 File. Data and code used in this article.

https://doi.org/10.1371/journal.pone.0315604.s001

(ZIP)

References

  1. 1. Sun Z., Liu X., Guo H., A method for constructing the Composite Indicator of business cycles based on information granulation and Dynamic Time Warping, Knowledge-Based Syst. 101 (2016) 135–141. https://doi.org/10.1016/j.knosys.2016.03.013.
  2. 2. Clements M.P., Galvão A.B., Macroeconomic Forecasting With Mixed-Frequency Data, J. Bus. Econ. Stat. 26 (2008) 546–554. https://doi.org/10.1198/073500108000000015.
  3. 3. Clements M.P., Galvão A.B., Forecasting US output growth using leading indicators: an appraisal using MIDAS models, J. Appl. Econom. 24 (2009) 1187–1206. https://doi.org/10.1002/jae.1075.
  4. 4. Ghysels E., Wright J.H., Forecasting Professional Forecasters, J.Bus. Econ. Stat. 27 (2009) 504–516. https://doi.org/10.1198/jbes.2009.06044.
  5. 5. Stock J., Watson M., How Did Leading Indicator Forecasts Perform during the 2001 Recession?, Econ. Q. 89 (2003) 71–90.
  6. 6. Lahiri K., Monokroussos G., Nowcasting US GDP: The role of ISM business surveys, Int. J. Forecast. 29 (2013) 644–658. https://doi.org/10.1016/j.ijforecast.2012.02.010.
  7. 7. Ghysels E., Santa-Clara P., Valkanov R., The MIDAS Touch: Mixed Data Sampling Regression Mod, Finance. (2004).
  8. 8. Ghysels E., Sinko A., Valkanov R., MIDAS Regressions: Further Results and New Directions, Econom. Rev. 26 (2007) 53–90. https://doi.org/10.1080/07474930600972467.
  9. 9. Pan Y., Xiao Z., Wang X., Yang D., A multiple support vector machine approach to stock index forecasting with mixed frequency sampling, Knowledge-Based Syst. 122 (2017) 90–102. https://doi.org/10.1016/j.knosys.2017.01.033.
  10. 10. Ghysels E., Santa-Clara P., Valkanov R., There is a risk-return trade-off after all, J. Financ. Econ. 76 (2005) 509–548. https://doi.org/10.1016/j.jfineco.2004.03.008.
  11. 11. Girardin E., Joyeux R., Macro fundamentals as a source of stock market volatility in China: A GARCH-MIDAS approach, Econ. Model. 34 (2013) 59–68. https://doi.org/10.1016/j.econmod.2012.12.001.
  12. 12. Xu Q., Bo Z., Jiang C., Liu Y., Does Google search index really help predicting stock market volatility? Evidence from a modified mixed data sampling model on volatility, Knowledge-Based Syst. 166 (2019) 170–185. https://doi.org/10.1016/j.knosys.2018.12.025.
  13. 13. Monteforte L., Moretti G., Real-Time Forecasts of Inflation: The Role of Financial Variables, J. Forecast. 32 (2013) 51–61. https://doi.org/10.1002/for.1250.
  14. 14. Li X., Shang W., Wang S., Ma J., A MIDAS modelling framework for Chinese inflation index forecast incorporating Google search data, Electron. Commer. Res. Appl. 14 (2015) 112–125. https://doi.org/10.1016/j.elerap.2015.01.001.
  15. 15. Kuzin V., Marcellino M., Schumacher C., MIDAS vs. mixed-frequency VAR: Nowcasting GDP in the euro area, Int. J. Forecast. 27 (2011) 529–542. https://doi.org/10.1016/j.ijforecast.2010.02.006.
  16. 16. Barsoum F., Stankiewicz S., Forecasting GDP growth using mixed-frequency models with switching regimes, Int. J. Forecast. 31 (2015) 33–50. https://doi.org/10.1016/j.ijforecast.2014.04.002.
  17. 17. Qiu Y., Forecasting the Consumer Confidence Index with tree-based MIDAS regressions, Econ. Model. 91 (2020) 247–256. https://doi.org/10.1016/j.econmod.2020.06.003.
  18. 18. Li L., Han C., Yao S., Ning L., Variable Weights Combination MIDAS Model Based on ELM for Natural Gas Price Forecasting, IEEE ACCESS. 2022. 10, 52075–52093.
  19. 19. Wang C., Kang W., Forecasting China’s Steam Coal Prices Using Dynamic Factors and Mixed-Frequency Data, Pol. J. Environ. Stud. 2021;30(5):4241–4254. https://doi.org/10.15244/pjoes/131856
  20. 20. Andreou E., Ghysels E., Kourtellos A., Should Macroeconomic Forecasters Use Daily Financial Data and How?, J. Bus. Econ. Stat. 31 (2013) 240–251. https://doi.org/10.1080/07350015.2013.767199.
  21. 21. Clements M.P., Galvão A.B., Macroeconomic Forecasting with Mixed Frequency Data: Forecasting US Output Growth and Inflation, 2011. https://doi.org/10.2139/ssrn.878445.
  22. 22. Kuzin V., Marcellino M., Schumacher C., Pooling versus model selection for nowcasting GDP with many predictors: empirical evidence for six industrialized countries, J. Appl. Econom. 28 (2013) 392–411. https://doi.org/10.1002/jae.2279.
  23. 23. Eberhart R., Kennedy J., A new optimizer using particle swarm theory, in: MHS’95. Proc. Sixth Int. Symp. Micro Mach. Hum. Sci., 1995: pp. 39–43. https://doi.org/10.1109/MHS.1995.494215.
  24. 24. Kennedy J., Eberhart R., Particle swarm optimization, in: Proc. ICNN’95—Int. Conf. Neural Networks, 1995: pp. 1942–1948 vol.4. https://doi.org/10.1109/ICNN.1995.488968.
  25. 25. Bansal J.C., Singh P.K., Saraswat M., Verma A., Jadon S.S., Abraham A., Inertia Weight strategies in Particle Swarm Optimization, in: 2011 Third World Congr. Nat. Biol. Inspired Comput., 2011: pp. 633–640. https://doi.org/10.1109/NaBIC.2011.6089659.
  26. 26. Ghysels E., Kvedaras V., Zemlys V., Mixed Frequency Data Sampling Regression Models: The R Package midasr, J. Stat. Software; Vol 1, Issue 4. (2016). https://doi.org/10.18637/jss.v072.i04.
  27. 27. Clerc M., Zambrano-Bigiarini M., Rojas R., Standard Particle Swarm Optimisation 2011 at CEC-2013: A baseline for future PSO improvements, 2013. https://doi.org/10.1109/CEC.2013.6557848.
  28. 28. Clerc M., Standard Particle Swarm Optimisation, 2012. hal-00764996
  29. 29. Diebold F.X., Mariano R.S., Comparing predictive accuracy, J. Bus. Econ. Stat. 13 (1995), 253–263.
  30. 30. Harvey D., Leybourne S., Newbold P., Testing the equality of prediction mean squared errors, Int. J. Forecast. (1997). 13(2), 281–291.