Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets

Olga Krakovska; Gregory Christie; Andrew Sixsmith; Martin Ester; Sylvain Moreno

doi:10.1371/journal.pone.0213584

Abstract

Large survey databases for aging-related analysis are often examined to discover key factors that affect a dependent variable of interest. Typically, this analysis is performed with methods assuming linear dependencies between variables. Such assumptions however do not hold in many cases, wherein data are linked by way of non-linear dependencies. This in turn requires applications of analytic methods, which are more accurate in identifying potentially non-linear dependencies. Here, we objectively compared the feature selection performance of several frequently-used linear selection methods and three non-linear selection methods in the context of large survey data. These methods were assessed using both synthetic and real-world datasets, wherein relationships between the features and dependent variables were known in advance. In contrast to linear methods, we found that the non-linear methods offered better overall feature selection performance than linear methods in all usage conditions. Moreover, the performance of the non-linear methods was more stable, being unaffected by the inclusion or exclusion of variables from the datasets. These properties make non-linear feature selection methods a potentially preferable tool for both hypothesis-driven and exploratory analyses for aging-related datasets.

Citation: Krakovska O, Christie G, Sixsmith A, Ester M, Moreno S (2019) Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets. PLoS ONE 14(3): e0213584. https://doi.org/10.1371/journal.pone.0213584

Editor: Konstantinos C. Fragkos, University College London Hospitals NHS Foundation Trust, UNITED KINGDOM

Received: February 22, 2018; Accepted: February 25, 2019; Published: March 21, 2019

Copyright: © 2019 Krakovska et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data underlying the study are from a third party. A public use file of data is available from the Wisconsin Longitudinal Study, University of Wisconsin-Madison, 1180 Observatory Drive, Madison, Wisconsin 53706 and at http://www.ssc.wisc.edu/wlsresearch/data/ and Health and Retirement Study available from https://hrs.isr.umich.edu/data-products/access-to-public-data. The authors confirm they did not have any special access to this data.

Funding: This study was supported by grants from the Simon Fraser University Community Trust Endowment Fund and an AGE-WELL Catalyst grant to M.E. and S.M.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Within the field of statistical gerontology, there has been increasing use of large databases to explore relationships between key factorsand some outcome variable(s) of interest (dependent variable(s)). Indeed, several survey initiatives have been set up to track the biological, social and lifestyle factors that affect health and quality of life throughout the lifespan, i.e.Health and Retirement Study [1], Wisconsin Longitudinal Study[2] Canadian Longitudinal Study on Aging [3], National Population Health Survey [4]These databanks are a valuable resource that can be used to identify and quantify the factors affecting health in aging. In turn, the results of these analyses can empower key stakeholders, including end users and policy makers, to make informed decisions for themselves and optimized decisions at higher levels, i.e. at the level of healthcare systems.

However, the use of these datasets presents some significant challenges if they are to be used optimally to provide us with convincing results, strong evidence, and useful information. For example, it is important that researchers identify and only use variables within a database that are relevant to the outcome in question. Typically, quantitativeanalysiswithin gerontology has used linear methods (methods, that assume relationships)as a means of simplifying data and identifying relevant variables[5–8].However, the use of linear methods in databases where there are non-linear relationships can yield to misleading results[9]. A systematic review of 893 papers illustrated, that 92% of incorporated papers using linear methods were unclear about assumptions of the methods used [10].The purpose of this paper is to provide a systematic evaluation of different approaches to feature selection.We will do this firstly by reviewing and discussingin more detail some of the key problems and limitations in the analysis of large survey databases, including variable selection when dealing with non-linear relationships. Secondly, we quantitatively compare a range of different linear and non-linear methods (by non-linear methods we imply methods, that do not necessarily assume linear relationships) in order to evaluate their relative performance in terms of selecting relevant features from two example large survey databases.

Background

It is not uncommon for large survey databases to store dozens or hundreds of different measurements for each person (we refer to these measurements herein as features). Given their size and complexity, it is not usually practical for researchers to assess how all factors within a database interact to determine an outcome of interest (say, mortality rate). Instead, researchers will often select a handful of features and assess the predictive ability of these features using a variant of regression such as linear regression. Unfortunately, both of these operations—feature selection and prediction—are potentially problematic for the analysis of many large survey databases. Here, we outline two major issues inherent to this analytic technique and offer an alternative approach, which may be better suited for the analysis of data within these survey databases, when it is reasonable to assume non-linear relationships.

The first major issue pertains to the process of correctly identifying relevant features from irrelevant ones. In nearly all aging-related datasets analyses, experimenters must identify and select features that are relevant to the dependent variables of interest and reject all other, irrelevant features. Broadly construed, this is typically done using one of two, non-exclusive approaches. The first is to select features based on prior knowledge and one or more a priori hypotheses. We refer to this as model selection. For example, a researcher may be curious about the effects of alcohol consumption on mortality rates. The researcher could then select features that are relevant to the question of interest (e.g. number of alcoholic units consumed per week), along with other features that they believe may confound the results (e.g. education level), and ignore all other, presumably irrelevant features.

Although this practice is employed frequently (more than 50% of papers that analyzed “life activities” in HRS [1]dataset in 2012–2017 used linear methods), it is potentially problematic for several reasons. Obviously, the predictive accuracy of a solution is only as good as the features selected to model it, and model selection can fail when relevant features are not selected for inclusion, irrelevant features are selected for inclusion, or both. Although prior knowledge can help guide this manual process, there is no guarantee that this knowledge will lead to the selection of all relevant features and the rejection of all irrelevant ones. In fact, as the number of features in a database increases, the likelihood of erroneous model selection approaches certainty, a problem referred to as the model problem. Model selection is also impractical for exploratory data analyses, in which researchers have weak (or no) a priori hypotheses or knowledge to guide selection. Finally, model selection is likely insufficient to eliminate the problem of multiple collinearity, which occurs when one or more features are correlated with other features.

Rather than selecting features manually, researchers can also use statistical approaches that transform the original, higher-dimensional feature space into a lower-dimensional space. For example, exploratory factor analysis and principal components analysis[11, 12], explain patterns of inter-correlated data by way of a small number of underlying factors. Compared to manual feature selection, these factor analyses are agnostic to a priori hypotheses and are therefore more appropriate for exploratory data analyses. Moreover, because they are data-driven, they also minimize collinearity (and maximize parsimony) by explaining the greatest amount of variance in the original data with the fewest number of underlying factors. However, because these are transformational approaches, the computed factors represent a combination of the underlying, original features. In other words, a computed factor does not represent any one original feature in the dataset, but rather a complex combination of all features in the dataset. As a consequence, the interpretation of these results can be difficult and somewhat subjective.

Building atop this, a second major issue pertains to the process of simplifying datasets that contain non-linear relationships between variables. It is thought that many non-linear relationships exist linking variables within the health sciences[13–15]. These non-linear dependencies further exacerbate the challenge of correctly reducing the dimensionality of a dataset, as many linear methods can fail to adequately identify them. The performance of linear methods is also negatively impacted if datasets include extreme values and skewed distributions, both of which are, again, common in survey datasets.

Linear methods have been widely adopted for problems in data projection and dimensionality reduction. They still remain the first choice in the context of gerontology, but without being optimal. Here, we evaluate the performance of the most frequently used linear methods as well as non-linear methods on survey datasets.

In applications to large survey datasets, identification of the relevant features is usually done by automatic feature selection. Automatic feature selection derives a simplified model from the statistical properties of the underlying data, only this time by selecting the original features in the dataset. This process, while powerful, comes at a combinatorial cost. A brute-force solution—that is, one that finds the best solution by systematically assessing all possible combinations of underlying features—is computationally unfeasible for large survey databanks, which can easily contain hundreds or thousands of features. Instead, an approximate solution must be estimated, typically using one of three broad categories of selection methods: filter, wrapper, and embedded[16, 17]. These methods differ in how they select relevant features from irrelevant ones and thus merit a brief introduction.

Filter methods are a pre-processing step that scores each feature using a statistical measure (e.g. correlation coefficient), ranks all features on this measure, and rejects features that fall below a cut-off criterion. Filter methods are by far the least computationally demanding method, because they operate on each feature individually and ignore dependencies between them. However, this same approach means that filter methods do not solve the problem of multicollinearity, which can in turn lead to relatively poor performance relative to other techniques. In recognition of this, one of the common application of the filter methods is identifying relevant features for future modelling.

Wrapper methods are a category of approaches in which features are selected and assessed, in conjunction with other features, in their ability to account for the variance in the underlying data. An algorithm iteratively learns to select the combination of features that best explains the data. Common approaches to doing this include forward and backwards selection approaches, in which the algorithm starts initially with either none or all of the variables (respectively), and adds/removes variables until the model no longer improves. This approach is vastly employed in both linear and non-linear methods. Given its iterative nature, wrapper methods are relatively expensive computationally, and the typical forward/backward selection methods have both been shown to be potentially experimentally problematic in terms of identifying most relevant subset of features[18, 19]. There is also a risk that wrapper methods can overfit the data, meaning that the solution accounts for random noise and in actuality has relatively poor predictive performance when applied to new data on which it has not been trained.

Lastly, embedded methods are similar to wrapper methods in that an algorithm iteratively learns to select the features that best contribute to the accuracy of the overall solution. They include interactions between features in generating the model, which typically makes them superior to filter methods for prediction, and less likely to overfit the data than wrapper methods. Although these methods are beyond the scope of the present study, embedded approaches have shown promise in other recent studies that have focused on the analysis of large datasets with multiple variable interactions[20]

The goal of the present study was to quantitatively compare the performance of different linear (i.e., commonly studied in gerontology field) and nonlinear selection methods for the identification of relevant features within the two main survey databases (i.e., Wisconsin Longitudinal Study of Aging database, and Health Retirement Study) with applications to non-linear associations in data. To do this, we compared the performance of several linear methods (regression) widely used in gerontology versus non-linear (filter) feature selection methods using two main survey databases (WLS[2], and HRS[1]). Note here, that by "linear feature selection methods" we imply methods, that assume linear functional relationship between features and target variables, while "non-linear methods" do not have this interim assumption. In order to validate our results, we further tested those methods using synthetic datasets. Although we did not expect linear and non-linear methods to differ in their ability to identify linearly dependent features, we did hypothesize that non-linear methods would be superior at identifying non-linearly dependent features. As a result, non-linear based selection approaches may offer a more robust tool for feature identification, classification, prediction and machine learning applications for gerontology researchers.

Methods

The performance of a given statistical method depends on the underlying data to be analyzed. Therefore, an important preliminary step is to understand the properties of the data before commencing any analysis[21]. Here, we are interested in the extraction of relevant features from large social science datasets, which consist primarily of questionnaires filled by respondents, their proxies or reviewers[22]. To make a questionnaire simpler for respondents, questions are routinely presented in multiple choice formats, which maps continuous variables into discrete categories, with the number of categories typically ranging between three to seven. Respondents are occasionally asked to provide an exact number to a given question, and as a result the risk of erroneously splitting a response into categories is believed to be relatively high. For example, a respondent performing an activity five times per week may either report it as “daily” or “several times a week”.

Given this, it was important to understand how the various feature selection approaches (see next section) performed under these analysis conditions. Important parameters here include the level of noise obscuring the relationship between variables, the number of samples available for analysis, and the effects of discrete versus continuous variable representations. We therefore assessed performance in two ways. First, we constructed a series of synthetic datasets that mimic the noisy and non-linear nature of many survey datasets. Because the associations between variables were known in advance, we would be able to quantitatively gauge the performance of the different selection methods in identifying relevant features and discarding irrelevant ones (see ‘Synthetic Data’, below). Second, we further gauged the performance of the different selection methods using two representative datasets, the Wisconsin Longitudinal Study[23], and Health and Retirement Study[1]. Here, we relied on a priori knowledge to assess each method’s ability to identify previously-established dependencies between the variables within the dataset—namely, the effect of certain lifestyle activities on overall health (see ‘Representative Data’, below).

For both the synthetic and representative datasets, each feature was identified as either important or unimportant by each feature selection method. For linear methods, we assumed that a selected feature was important if the corresponding coefficient was not equal to zero at a .05 significance level. For the filter methods, we assumed that a selected feature was important if the feature and target variable werenot independent at a .05 significance level. Finally, the performance of each selection method was computed using F₁ scores, which represents the harmonic average of the precision and sensitivity of each selection method; as selection performance increases the F₁ score approaches 1 and as selection performance decreases the F₁ score approaches0. To estimate the statistical significance of the difference between F₁ scores of different methods, we followed the methodology described in [24]. This method tests the null hypothesis that the results of two techniques do not really differ; thus, the responses produced by one of the techniques could have just as likely come from the other. We therefore shuffled the responses produced by one of the methods (but not the other), re-computed the F₁ score, and determined the likelihood that this shuffling procedure would create an F₁ score at least as large as the F₁ score derived from the original, unshuffled comparison.

Feature selection methods

Eight common linear selection methods were used. This included Ordinary Least Square (OLS), a method of estimating parameters in linear regression[25], two stepwise (wrapper-based) regression approaches: Forward (FLS) and Backward (BLS) selection with three different criteria[26], and LASSO regression (LASSO)[27]. Forward selection involves starting with a model with zero variables and iteratively adding a new variable; if the variable results in a significant improvement in fit then it is included in the model. Backward selection is conceptually similar, but starts initially with all variables in the model and iteratively removes variables. We used three criteria frequently used both in backward and forward feature selection, namely Mallow’s C_p criteria (BLS C_p and FLS C_p respectively)[28], adjusted R² (BLS R² and FLS R²)[29], and Bayesian Information criterion, (BLS B and FLS B)[30]. Collectively, these selection methods address the problem of over fitting, and account for number of explanatory variables relative to the number of data points in the model. The selected features are features that are included in the best model which is in turn determined by the corresponding criteria.According to Mallow’s C_pcriteria the best model is the simplest model where the criteria's value is approximately equal to the number of features[28]. When using adjusted R² or, the model selected is one that corresponds to the maximum R² value[29]. On the contrary, the model with the lowest value of the Bayesian Information is preferred, when Bayesian Information criterionis employed[30].

We also included least absolute shrinkage and selection operator (LASSO) over linear regression, which performs shrinking and variable selection simultaneously. The tuning parameter that controls the shrinking was chosen by 10-fold cross validation performed by built-in cv.gmnet function from R packege "glmnet" [31].

The performance of these linear selection methods was contrasted against three non-linear methods. Three filter-based methods were tested, including distance correlation (DC), Hilbert-Schmidt Information Criterion (HS), and Hoeffding’s test (HT) of independence.Here we included all features that were not statistically independent from target variable at .05 significance level. Further information on each selection method is as follows.

Distance correlation (DC) [32, 33] is a universal approach to check if two variables are related, not necessarily linearly. It equals zero when the two variables are statistically independent, and equals to one if one variable is a linear function of another. To test for independence, we used permutation bootstrap with ≈500 replicates implemented in R package “energy”[34].

Hilbert-Schmidt Information Criterion (HS) is a non-parametric measure of dependence based on the Eigen-spectrum of covariance operators in the reproducing kernel Hilbert spaces[35]. The corresponding mapping of the two variables is a function that equals to zero when variables are independent, and is high, when variables are dependent. To test for independence, we used permutation bootstrap with ≈500 replicates[36], implemented in R package “dHSIC” [37], and Gaussian kernel It is possible to tune the bandwidth parameter of the kernel to better identify different types of the dependencies. For simplicity, we used bandwidth parameter equal to onethroughout this study.

Hoeffding’s test (HT) of independence is a non-parametric population-based test for statistical independence[38]. The test statistic depends on the rank order of the observations, with the P-values approximated by the linear interpolation table in Hollander and Wolfe [39]. We used Hoeffding’s test implementation in R package “Hmisc”[40].

Synthetic data

To test the ability of feature selection methods to identify relevant variables, we constructed synthetic datasets wherein a set of predictor variables were associated with a target variable, known a priori. The rest of the features within a given dataset were random. This can be described formally with the following. are random variables and are predictors with the known association with the continuous response. The goal is to identify what variables out of the set X = X^N∪X^R are identified as relevant for Y by different feature selection methods.

To do this, a target variable, y, was created by generating N random numbers from a uniform distribution, y~U[−15,15]. Sine, cosine and quadratic functions were used to model the relationships between the target and predictor variables. Specifically, we identified the function’s parameters, so that the corresponding predictor variable ranged from zero and X_max, where X_max, is a whole number either 4 or 7. We then solved the inverse problem of finding corresponding to each y_i, x_i = f⁻¹(y_i). Thus, the functional relationships used were that were rounded to the nearest whole number for discrete independent variables. We also approximated all to 0, and to x_max. We used the following combinations for {a,x_max} = [{0.1,7},{0.5,4}] for discrete variables, and {a,x_max} = {0.8,7} for continuous variables. p₁ and p₂ here are random numbers, p₁ϵ{0,1},p₂ϵ{0,1}.,.

Wealso used that we also rounded to the nearest whole number for discrete independent variables. Again, all negative values were set to zero, and exceeding x_max. to x_max. The parameters used for the discrete variables were {h,b,c,x_max} = [{5,−20,10,4},{−9,23,−5,4},{−9/7,73/7,−10,7}], and the parameters used for the continuous variables were {h,b,c,x_max} = [{5,−20,−10,4},{−10,60,−80,4}]. For datasets containing only discrete variables, the continuous variables were rounded to the nearest whole number.

Altogether, we constructed R = 9 variables with a known association with a dependent variable. If more than one solution existed, x_i was taken randomly, with equal probability, out of all the outcomes. Finally, we added uniform noise and rounded the resulting value to the closest integer in the corresponding range. We then added K = 80 random features. These random features were defined as follows. First, the range was defined such that each variable was between zero and x_max, where x_max is a random whole number between four and ten, with equal probabilities. Then, the feature was filled with a random whole number, with equal probability, between zero and x_max for discrete random features, and uniformly distributed for continuous random features.

We made two sets of the experiments. In the first set, we constructed only discrete variables, in which each and every xϵX was a whole number. In the other set, five variables were discrete, and four variables were continuous. Both sets also included 70 discrete random variables, , and 10 continuous random variables, .

Each X_iϵX is vector of length N. We investigated cases where N = 500, 750, 1000, 1250 and noise equal to .5 and 1.

Because we have a priori knowledge about whether each feature X_iϵX is related to Y, we can compare different feature selection methods. Thus, for each combination of N and noise we generated 200 synthetic datasets, applied a given method, and then investigated whether each feature was or was not identified as important or unimportant correctly. We then computed F₁scores, and checked whether the difference between F₁ scores of different methods is significant.

Note here, that we needed nonlinear relationship without clear trend, and selected relationships fulfill this purpose. At the same time, we are unable to mimic all potential relationships with synthetic dataset, so we used representative database to compare linear and non-linear methods on real data.

Representative data

The goal of our study was to test the selection performance of each method under typical usage conditions, in which researchers would attempt to identify relevant features within in a large dataset. To do this, we used data from two longitudinal studies on aging in USA: Wisconsin Longitudinal Study of Aging database (WLS) [2] and Health and Retirement Study (HRS) [1]. WLS is a long-term study of Wisconsin high school graduates of 1957, whose health has been tracked longitudinally, via multiple-choice surveys and interviews, for over 50 years. HRS is a longitudinal study on a healthy retirement, and aging, with the data collected through interviews and surveys.

WLS dataset.

We extracted health information along with several life style activities from the WLS[2], as reported in the module Computer-Assisted Personal Interviewing (CAPI) 2011 wave, Mail: Internet module and Mail: Social and Civic Participation, available at wls_pub_13_04.sas7bdat. The target variable, health change, was computed as the difference in HUIM3 health index, a rating scale targeted at measuring general health, between 2011 and 2004. In total, this representative dataset contained 3,028 respondents with 52 independent variables apiece. We aimed to assess the performance of each method at identifying factors that are already known to influence health in old age. To that end, we identified six variables as potentially relevant to this health indicator based on prior research individuals[41–43]: education level (equivalent years of regular education attained by 2011, denoted as “education”), alcohol use (number of alcohol symptoms, denoted as “alcohol”), tobacco use (including number of packs of cigarettes smoked per day, age of last cigarette smoked and number of years of regular smoking, denoted respectively as “tobacco”, “tabacco1” and “tobacco2”), and the respondent’s previous health score in 2004 (HUIM3 health index, denoted as “health”). We excluded 687 respondents who were missing data for any from these previously listed factors. Missing data were imputed with median for all other lifestyle activities; in all cases, this missing/imputed data amounted to less than 15% of the data per activity.

Data analysis was done on this dataset in three steps. First, we tested each method for feature selection on the complete set of data with all 51 independent variables, which represents analysis conditions wherein researchers do not have strong a priori knowledge to manually reduce a dataset. Second, we again tested each method for feature selection but on a smaller subset of data containing the six variables described above (“alcohol”, “education”,”tobacco”, “tobacco1”, “tobacco2”, “health”). This analysis was repeated on a smaller, third dataset that did not include the “health” variable. We then compared the performance of each selection method against its own performance on the smaller dataset in order to determine the influence of other variables on the feature selection performance of each method.

HRS dataset.

The HRS[1] dataset was targeting variables preserving cognitive health in aging. Here we included individuals of about the same age as in the WLS [2]dataset, between 70–74 years old. Respectively, we extracted cognitive health information along with life-activities data from the HRS, as reported in the modules Preload, Physical Health, Leave-behind questionnaires, and Cognition of 2014 wave, and health related variables from module Physical Health of 2008 wave. The target variable, cognitive health change was computed as the difference between the total number of words, correctly remembered by the respondents during the immediate and delayed recalls in 2014 and 2008.

In total, our second representative dataset contained information on 900 respondents characterized by 80 independent variables. Similarly to the WLS[2] case, we aimed to assess performance of each method for identifying factors that have been known to influence cognitive health in old age. Based on prior research, we identified four variables as potentially relevant to this health indicator [44–46]: education level (equivalent years of regular education attained by 2014, denoted as “education”), total alcohol use (total number of alcohol consumed per week, denoted as “alcohol”), smoking (total number of cigarettes consumed per day, denoted as “smoking”), and level of physical activity(“physical activity”). We excluded 682 respondents who had missing data for any from these previously listed factors. We than imputed missing data with medians for all other variables; in all cases, this missing/imputed data amounted to less than 20% of the data per variable.