443
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Wavelet Feature Screening

ORCID Icon, ORCID Icon & ORCID Icon
Received 23 Oct 2023, Accepted 09 Apr 2024, Published online: 20 May 2024

Abstract

An initial screening of which covariates are relevant is a common practice in high-dimensional regression models. The classic feature screening selects only a subset of covariates correlated with the response variable. However, many important features might have a relevant albeit highly nonlinear relation with the response. One screening approach that handles nonlinearity is to compute the correlation between the response and nonparametric functions of each covariate. Wavelets are powerful tools for nonparametric and functional data analysis but are still seldom used in the feature screening literature. We propose a wavelet feature screening method that can be easily implemented, and we prove that, under suitable conditions, it captures the true covariates with high probability. Simulation results also show that our approach outperforms other screening methods in highly nonlinear models. We apply feature screening to two datasets about ozone concentration and epilepsy. In both applications, the proposed method selects features that match findings in the literature of their respective research fields, illustrating the applicability of feature screening. Supplementary material for this article is available online.

1 Introduction

The number of explanatory variables is commonly much larger than the sample size in modern data science. In the current era of big data, many applications in biology have thousands of covariates, reaching even hundreds of thousands in genetic studies (Fan et al. Citation2020, sec. 1.1). A key objective in applications is to use all data to determine which features/covariates are associated with characteristics of interest, that is, the response variables. Some classic methods to deal with this type of problem where pn are sub-sampling covariates, ridge regression, and Lasso, for example (Hastie, Tibshirani, and Friedman Citation2009). However, if p is substantially larger than n, the performance of many variable selection methods may also be compromised (Fan and Lv Citation2008). In such scenarios, practitioners often do an initial screening to discard the covariates most likely irrelevant to the analysis. This approach works as an initial step in supervised learning. It reduces the problem’s dimension to only pnp covariates and facilitates the learning step of a regression or classification model.

Feature screening consists of computing a marginal utility measure for each feature and selecting the ones with the highest scores. The most popular utility measure is the Pearson correlation between the response variable and each covariate. Fan and Lv (Citation2008) studied this approach and derived its theoretical guarantees. Since Pearson correlation is tailored to linear relationships, further works proposed using nonparametric methods to capture nonlinear relations. Hall and Miller (Citation2009) proposed using a generalized correlation, where the utility measure is the Pearson correlation of the response variable and an estimated smooth function of the covariate. Likewise, Fan, Feng, and Song (Citation2011) proposed to fit regression splines of each feature to the response and use the norm of the fitted function as the utility score. Many other screening methods have been proposed since then; to name a few examples, we cite the following: model-free screening (Zhu et al. Citation2011; Xue and Liang Citation2017), screening based on distance-correlations (Li, Zhong, and Zhu Citation2012), screening for categorical response (Mai and Zou Citation2013), conditional marginal screening (Barut, Fan, and Verhasselt Citation2016), and screening for dependent data (Yousuf Citation2018). Fan et al. (Citation2020, chap. 8) provides a thorough literature review and further details about feature screening. Our article focuses on nonparametric utility measures (Hall and Miller Citation2009).

1.1 Generalized Correlation

Let Y be a response variable of interest and (X1, , Xp)Ta set of covariates to be used as proxies to Y. Assume we have a sample {(Xi.,Yi)}i=1n where Xi.=(Xi1,,Xip) and pn. We aim to reduce the data’s dimension by selecting a subset of covariates relevant to the study. Let X=(X1,,Xp) be the design matrix, where Xj=(X1j,,Xnj)T denotes the observations from the jth covariate.

The standard screening approach uses the Pearson correlation between Y and Xj as a utility measure. This approach is widely used by practitioners in all fields that require data analysis and counts with a solid theoretical foundation (Fan and Lv Citation2008). These two aspects explain the popularity of this screening approach. However, the Pearson correlation does not capture covariates with a highly nonlinear relation with the response variable. Therefore, Hall and Miller (Citation2009) proposes to tackle this problem by estimating the functional relations between Y and each Xj separately. They compute a generalized correlation as follows ρ(Xj,Y)=supfjHcov{fj(Xj),Y}var{fj(Xj)}var (Y),j=1,,p, where H is some functional space and fjH is the function that fits Xj to Y. When H consists of the space of linear functions, ρ(Xj,Y) becomes the Pearson correlation. Hall and Miller (Citation2009) uses cubic splines basis to represent fj . Given the sample {(Xi.,Yi)}i=1n, the generalized correlation can be estimated by ρ̂(Xj,Y)=supfjHi=1n{fj(Xij)f¯j}{YiY¯}i=1n{fj(Xij)f¯j}2i=1n{YiY¯}2,j=1,,p, where f¯j=i=1nfj(Xij)/n and Y¯=i=1nYi/n. The idea is to use ρ̂(Xj,Y) to measure the importance of Xj for fitting Y. The screening approach selects only pnp covariates with the highest ρ̂(Xj,Y) in absolute value.

The same screening results can be achieved using |||f̂j||||fj||| as utility measure, where ||·|| denotes some norm. For example, consider the case of Pearson correlation and let X and Y be two vectors with zero-mean and unit norm. Since ||XY||2=2X,Y, by maximizing the correlation X,Y we also minimize |||X||||Y|||2||XY||2, where the latter follows from the triangular inequality. Therefore, we will focus on the distance |||f̂j||||fj||| to devise a variable selection scheme.

1.2 Proposed Method

We use wavelet methods to estimate the function fj that links Y to Xj , j=1,,p. Wavelets form a basis for the space of square-integrable functions, similarly to cosines and sines in Fourier analysis. The main advantages of wavelets are their local adaptivity and regularization properties, making them a powerful tool to represent functional data (Morettin, Pinheiro, and Vidakovic Citation2017). Other properties are that wavelet coefficients can be quickly computed for equally spaced sampled observations (regular design), and such coefficients characterize quite general functional spaces, such as Hölder and Besov spaces, for example (Vidakovic Citation1999, p. 87). One challenge in our problem is that multiple covariates are generally irregularly spaced, a design for which often-used wavelet methods are unsuitable.

Algorithm 1

Warped wavelet screening

Input: response variable YRn, features X1,,XpRn, threshold ν>0, wavelet basis.

Output: Indices of screened features M̂{1,,p}.

1: for j=1,,p do

2: Sort (Yπ(1),Xj,π(1)),,(Yπ(n),Xj,π(n)) such that Xj,π(1)Xj,π(n), where {π(1),,π(n)} denotes a permutation of {1,,n}.

3: Apply the discrete wavelet transform to (Yπ(1),,Yπ(n)) using the chosen wavelet basis, obtaining wavelet coefficients {β̂k}.

4: Apply thresholding on the wavelet coefficients, obtaining {β˜k}.

5: Compute wj={kβ˜k2}1/2 using the thresholded coefficients.

6: end for

7: Set M̂={j;1jp,wj>ν}.

8: return M̂

Our main contribution is a feature screening method that uses warped wavelets, which were introduced by Kerkyacharian and Picard (Citation2004) to deal with random covariates. Our screening idea is described in Algorithm 1. It can be easily applied within any statistical software that computes wavelet transforms (e.g., R, Python, Matlab). The paradigm of wavelet feature screening has been suggested previously (Fan, Feng, and Song Citation2011, p. 552). The closest work to ours is the paper by Zhao, Chen, and Ogden (Citation2015), who proposes a screening step for a wavelet-based Lasso estimator of a linear regression model. However, this work assumes that predictors are sampled at equally spaced points, a condition hardly satisfied in practice. The current article is the first to propose a wavelet algorithm that handles irregular designs and to study its properties. The main component of our algorithm is the application of warped wavelets, whose details are presented in Section 2. Warped wavelet regression estimators adapt well to features with irregular sampling designs (Morettin and Porto Citation2022), which allows us to apply wavelets to run feature screening. Theoretical properties of the proposed screening approach are presented in Section 3, where we prove that Algorithm 1 can capture the true covariates with high probability.

Feature screening methods are commonly tested with simulations of scenarios where the link function fj is either smooth or has very few discontinuities. Differently, Section 4 presents simulations of challenging scenarios with highly nonlinear examples of fj . Our results show that model-free screening methods do not perform well when fj is very irregular, but Algorithm 1 can still detect true features in many such cases. The wavelet approach also outperforms splines for scenarios where fj contains local traits like peaks. This analysis illustrates the well-known advantage of wavelets for regression problems where the link function is irregular, containing spikes or discontinuities (Reiss et al. Citation2017, sec. 2.1).

Another contribution of this article is the application of feature screening to two scientific problems. The first uses ozone concentration data in the largest city in Brazil. Ozone is a pollutant related to various health problems; thus, monitoring which atmospheric features are related to ozone concentration is of practical interest (see Section 5). The second problem uses data of a study about temporal lobe epilepsy. These data were collected to understand how brain white matter measurements are related to brain degeneracy in epileptic patients (see Section 6). In both applications, we show that feature screening can detect covariates previously reported as important factors in their respective fields. The supplementary material contains an application with the cardiomyopathy data (Segal, Dahlquist, and Conklin Citation2003), a high-dimension example commonly used in the feature screening literature. Some final remarks in Section 7 close the article.

Notation: We use log for the natural logarithm. Vectors and matrices are represented by bold letters. The ith row and jth column of a matrix XRn×p are denoted as Xi.Rp and XjRn, respectively. The Euclidean norm of a vector xRn is denoted as ||x||=(i=1nxi2)1/2. For a random function f:RR, we define the L2 and L-norms in terms of the expected values fL22=E {|f(x)|2dx} and fL=E {maxxR|f(x)|}, respectively. These norms reduce to their usual definitions when f is not random.

2 Warped Wavelets

Kerkyacharian and Picard (Citation2004) introduces warped wavelets to allow randomly sampled covariates in wavelet regression estimators. We employ this method to perform feature screening, separately analyzing each pair (Xj,Y). The data for the jth covariate is {(Xij,Yi)}i=1n, j=1,,p. We fit the following model for each covariate: (1) Yi=fj(Xij)+ϵij,i=1,,n,(1) where ϵ1j,,ϵnjiidN(0,σ2), with independent Xij and ϵij . We assume that fjL2([a,b]), and one of our goals is to estimate the coefficients of its representation in a wavelet basis. For convenience, let us focus on the case where σ2=1, a = 0 and b = 1.

A common assumption in wavelet regression is that Xij=i/n, that is, the covariates are equally spaced in the support of fj . This assumption can be relaxed with warped wavelets, which allow Xij to be irregularly spaced on the interval [0,1]. Assume that Xij follows some continuous distribution with density gj and distribution function Gj . Without loss of generality, consider that X1j<X2j<<Xnj and denote their empirical distribution function by Ĝj. The idea of warped wavelets is to recover the usual regular design using the fact that Ĝj(Xij)=i/n. Plugging Xij=Ĝj1(Ĝj(Xij)) in (1), we have Yi=fj(Xij)+ϵij=fj(Ĝj1(Ĝj(Xij)))+ϵij=fj(Ĝj1(i/n))+ϵij, which resumes the classical equally spaced design when looking at the function composition fj(Ĝj1(·)). We shall focus on the estimators that take Gj as known to avoid some technicalities. Estimators that use the empirical distribution function Ĝj are also proposed by Kerkyacharian and Picard (Citation2004) and have similar properties to the estimator computed assuming Gj known. Kerkyacharian and Picard (Citation2004) show that both approaches attain (up to logarithmic factors) the minimax convergence rate from regular designs.

Another idea of warped wavelets is to take the wavelet decomposition of fj(Gj1(·)) instead of fj(·). Consider a wavelet basis {ψlk;l1,kZ} where ψ1,k is a scaling function (Härdle et al. Citation1998, Equationeq. 3.4). Expanding fj(Gj1(.)) in this basis we get fj(Gj1(y))=l1kZβlkψlk(y),or equivalentlyfj(x)=l1kZβlkψlk(Gj(x)), where the wavelet coefficients are given by (2) βlk=01ψlk(Gj(x))fj(x)gj(x)dx.(2)

Here we omit the index j from βlk , but later we shall write βjlk to let it clear it is a coefficient from fj . EquationEquation (2) reduces to the usual wavelet coefficient when gj(x) is the uniform density. In general, the properties of the warped wavelet transform depend on weights defined by gj(x) and how far they are from uniform weights. We shall assume that gj is bounded as follows:

  • C1There is a finite constant C > 0 such that C1<gj(x)<C for all x[0,1] and all j{1,,p}.

Under C1, the density gj in (4) defines a weighting scheme that belongs to a class known as Muckenhhoupt weights (Kerkyacharian and Picard Citation2004, sec. 4.1). This condition is sufficient to guarantee good convergence rates for warped wavelet estimators. Moreover, C1 has already been used for nonparametric feature screening (Fan, Feng, and Song Citation2011, condition B). We also assume that the following conditions hold:

  • C2The scaling function ψ1,k and wavelet functions ψlk , l0, have compact support.

  • C3The functions fj:[0,1]R, j=1,,p, are periodic and there is an absolute constant C > 0 such that ||fj||L<C for all j.

These are common assumptions for warped wavelets (Porto et al. Citation2016; Gómez, Porto, and Morettin Citation2021). C2 allows us to write the wavelet transform as finite sums, and it is satisfied by Daubechies wavelets, for example (Härdle et al. Citation1998, p. 127). The periodicity assumed in C3 permits us to apply the wavelet transform for functions with support in [0,1]. See (Vidakovic Citation1999, sec. 5.6) for other transforms of signals in an interval. This assumption also justifies periodization heuristics frequently used in practice to deal with boundary handling and sample sizes that are not a power of two (Ogden Citation1997). The assumption that fj is bounded is technical and is used later in Proposition 1.

We consider the warped wavelet estimator given by (3) f̂j(x)=l=1Jk=02l1β̂lkI{|β̂lk|κtn}ψlk(Gj(x)),tn=(lognn)1/2,(3)

where κ>0 is a tuning parameter and J satisfies 2JCtn1, for some absolute constant C > 0. The empirical coefficients are (4) β̂lk=1ni=1nψlk(Gj(Xij))Yi.(4)

Note that f̂j(x) in (3) is computed via hard thresholding. Other thresholding approaches can also be applied (Vidakovic Citation1999, chap. 6). See Fan and Li (Citation2001) for a detailed analysis of different thresholding options, such as SCAD and soft thresholding. The following result, due to (Kerkyacharian and Picard Citation2004, Prop. 3), will be useful to deduce theoretical guarantees for the warped wavelet feature screening.

Proposition 1.

(Kerkyacharian and Picard Citation2004, p. 1070) Assume that conditions C1–C3 hold and let α>1. Then, there exist constants Cα and Cα, and for any γ>0 there exists a constant κγ, with E(|β̂lkβlk|α)Cα1+fjLαnα/2,for 2ln,P(|β̂lkβlk|>κlognn)Cαnγα,for κκγ,2lnlogn.

3 Feature Screening via Warped Wavelet Regression

In this section, we use the properties of the warped wavelets estimator to establish theoretical guarantees for our Algorithm 1. Assume we can fit model (1) to all covariates but that fj is null for most of them, that is, only a few Xj are relevant to predict Y. Let βj={βjlk,0lJ,0k2l1} be the wavelet coefficients of fj as in (2). Notice that if fj is null, then βj=0. Therefore, we represent the set of relevant features in terms of their wavelet coefficients as M*={j;1jp,βj>0}.

Our goal is to find a set M̂{1,,p} that contains M* with high probability and has cardinality much lower than p. However, the functions fj are unknown, and we must use estimators f̂j to screen the relevant features. This approximation works well if the wavelet transform of f̂j is close enough to that of fj . Let β̂j be the wavelet coefficients of f̂j estimated as in (4). Then, the proposed set of screened variables is as follows: M̂={j;||β̂j||>νn,1jp}, where νn>0. The key point to recover jM* is that the energy of fj must be higher than the noise level, that is, ||β̂j|| has to be higher than the norm of wavelet coefficients of pure noise. In this case, we can screen the correct variables with high probability if we choose a suitable νn . The next proposition formalizes this idea.

Proposition 2.

Let κ, C and γ be defined as in Proposition 1, with γ>1/4. Assume that C1–C3 hold and that minjM*βj>2νn, where νn=2(J+1)/2κ((logn)/n)1/4. If M̂ is computed with this νn , then (5) P(M̂M*)1s2J+1Cn2γ,s=|M*|.(5)

Additionally, for fixed s and J, P(M̂M*)1 as n.

Proof.

See Appendix B. □

Another important aspect of any screening set is that its cardinality must be much lower than the dimension p. Proposition 3 shows that, on average, the number of elements on M̂ is much lower than p.

Proposition 3.

Assume that all conditions of Proposition 2 hold and that M̂ is defined with νn=2(J+1)/2κ((logn)/n)1/4. Then, the expected cardinality of M̂ satisfies E (|M̂|)s+2J+1C(ps)n2γ,where s=|M*|.

Proof.

See Appendix C. □

4 Simulation

This section presents simulation results of Algorithm 1. Similarly as in Example 4 of Hall and Miller (Citation2009), we simulate data from the model (6) Yi=j=156j5(f(Xij)+f(Xi,j+5))+ϵi,(6) where ϵi follows a standard normal distribution. The function f:[0,1]R is set as one of the test functions of Donoho and Johnstone (Citation1994) (see ) or the identity function f(x) = x. The pairs (Xij,Xi,j+5) are independently generated as (7) XijU(0,1),Xi,j+5=ρXij+Ui1ρ2ρ+1ρ2,j=1,,5,i=1,,n,(7) where ρ=0.85 and U1,,UniidUnif(0,1). This generating scheme guarantees that some covariates are correlated, with corr(Xij,Xi,j+5)=ρ. When true covariates Xj and Xj+5 are highly correlated, many variable selection methods tend to choose only one of them. This simulation study also checks the ability of screening methods to select both.

Fig. 1 Test functions of Donoho and Johnstone (Citation1994).

Fig. 1 Test functions of Donoho and Johnstone (Citation1994).

We set the dimension as p = 5000 and consider sample sizes n{128,256,512}. The relevant covariates are X1,,X10, that is, M*={1,,10}. We also generate {Xi,j}j=1115 correlated with {Xi,j}j=15 as outlined in (7). This way, we can analyze a case where nonsupport and support covariates are correlated. The remaining X16,,X5000 have all entries sampled independently from Unif(0,1). This section presents the numbers of true covariates detected by different methods among their 15 largest screening utility scores. In the supplementary material, we present the number of pairs of support covariates {(Xj,Xj+5)}j=15 that are selected together, and the number of selected nonsupport covariates {Xj}j=1115. All results are averages from 500 replicates.

Our Algorithm 1 is applied with the Haar wavelet basis and a single decomposition level, which are the default choices in our codes.Footnote1 This wavelet method is compared with the generalized correlation screening of Hall and Miller (Citation2009) implemented with cubic splines. We also apply the model-free methods SIRS (Zhu et al. Citation2011) and DC-SIS (Li, Zhong, and Zhu Citation2012), which are implemented in the R package VariableScreening (Li, Huang, and Dziak Citation2022).

Simulation results are presented in . We note that, albeit no method outperforms all others in all scenarios, the warped wavelets provided competitive results. When f is the Blocks or Heavisine functions, generalized correlation with splines gives the best results, and SIRS and DC-SIS have the best performances when f is the identity function. Apart from DC-SIS, all considered methods take on average only a few seconds to run. In general, wavelet methods have competitive results, outperforming the other approaches for the functions Bumps and Doppler, for example. This simulation highlights the good adaptation properties of wavelets for link functions fj with many localized traits like peaks and sharp oscillations.

Table 1 Average number of true features (#feat) selected among the 15 most relevant according to each screening method, and average runtime in seconds (time).

5 Application on Ozone Concentration

Ozone is a pollutant related to various respiratory problems, and its efficient monitoring greatly impacts public health (Zhang, Wei, and Fang Citation2019). However, various chemical and meteorological factors affect ozone concentration, making its monitoring a challenging task (Monks et al. Citation2015).

Our goal in this application is to identify the main factors that influence ozone concentration in São Paulo–Brazil (Santolaya et al. Citation2019) through wavelet feature screening. The relations between ozone level and related variables are known to be nonlinear for a long time (Lin, Trainer, and Liu Citation1988), which makes nonparametric screening methods appropriate tools for variable selection. We use the R package qualR (Gavidia-Calderón, Schuch, and de Fatima Andrade Citation2022) to collect data from the Pinheiros air quality station in São Paulo. The data consists of daily observations of 11 variables collected from January 2004 until December 2022. Due to many missing values, we analyze monthly averages and use only observations from April 2006 until September 2010. Moreover, we compute the cosine and sine of circular variables related to wind direction. The time series are displayed in . Notice that most time series present a seasonal pattern; thus, the observations are not independent.

Fig. 2 Time series of the meteorological variables recorded at Pinheiros air quality station in São Paulo. The abbreviations wd and gwd refer to wind direction and global wind direction, respectively.

Fig. 2 Time series of the meteorological variables recorded at Pinheiros air quality station in São Paulo. The abbreviations wd and gwd refer to wind direction and global wind direction, respectively.

The properties of feature screening methods discussed in this article assume independent observations. Thus, we computed differences in the variables shown in to reduce their dependence on past values. After plots of autocorrelations and partial autocorrelations, we employ the following:

  • Differences at lags 1 and 11 for the variables ozone, carbon monoxide, nitrogen monoxide, and temperature,

  • Differences at the lag 1 for the variables nitrogen dioxide, humidity, wind speed, cos (wd), and sin (wd), where wd stands for wind direction,

  • No transformation for the variables cos (gwd), and sin (gwd), where gwd stands for global wind direction.

The temporal dependence of all the variables was reduced considerably after these transformations, as shown in the supplementary material.

The next step uses the differenced time series to identify the most important features to model ozone concentration. We apply the generalized correlation with cubic splines, SIS, DC–SIS, and SIRS. The warped wavelet feature screening is applied using Daubechies’ extremal phase wavelets with filter number six.Footnote2 presents the five most relevant variables selected by each method. All methods identify the following features as most relevant: humidity, wind speed, and cos (wd). Hence, there is a consensus that humidity and wind direction and speed are relevant variables to model ozone concentration. This conclusion is in accordance with well-known results in the literature (Thompson et al. Citation2001, Sec 2.2.2). Wind direction and speed are associated with the transport of ozone in the atmosphere (Camalier, Cox, and Dolwick Citation2007), and weather factors such as humidity and temperature are frequently used to study peaks of ozone concentration (Oliveira, Drumond, and Rizzo Citation2022).

Table 2 The five most relevant variables according to different feature screening methods.

shows curves estimated using warped wavelets for some features. The vertical axis represents the differenced time series of ozone concentration minus its average. We subtract the average Y¯ of this variable before estimating the curves (Hastie, Tibshirani, and Friedman Citation2009, p. 298). The covariates are the differenced time series of features normalized to the interval (0, 1). We see in that Humidity, Wind Speed, and cos (wd) present a nonlinear relationship with the ozone concentration. Understanding the importance of these features and how they influence ozone concentration can aid scientists, for example, in developing simulation models (Sánchez-Ccoyllo et al. Citation2006; Franco et al. Citation2019).

Fig. 3 Plots of the response variable Y (differenced and centered ozone concentrations) and each covariate (differenced and normalized to the unit interval). The curves are warped wavelet regression estimates. The observations are considerably noisy, and only a few features display a noticeable effect on Y, like humidity and wind speed.

Fig. 3 Plots of the response variable Y (differenced and centered ozone concentrations) and each covariate (differenced and normalized to the unit interval). The curves are warped wavelet regression estimates. The observations are considerably noisy, and only a few features display a noticeable effect on Y, like humidity and wind speed.

6 Application on Epilepsy

Epilepsy is a brain disorder characterized by a predisposition and occurrence of multiple seizures (Fisher et al. Citation2014). A common form of this disorder is temporal lobe epilepsy (TLE), which can happen on one side of the brain. The progress of TLE is associated with abnormalities in the brain’s gray and white matter (Yasuda et al. Citation2010). Many studies focus on characteristics of white matter to study some aspects of TLE. One particular problem is to understand brain degeneration in epileptic patients. Comparisons with healthy controls have shown that epileptic patients seem to have older brains (Hwang et al. Citation2020). Further studies have been conducted to understand the relation between brain structural age and neurological measurements under the TLE disorder (Yasuda et al. Citation2023).

In this section, we apply feature screening on a dataset from a study about TLE. The data was collected by the Neuroimaging Laboratory at the University of Campinas-Brazil and consists of 88 features and ages obtained from 226 individuals (89 controls and 137 patients). The patients are divided by the side of their TLE: 76 have left TLE (LTLE) and 61 have right TLE (RTLE). The features are brain white matter measurements acquired via magnetic resonance imaging. We analyze pre-processed measurements in the interval [0,1]. Further details about the dataset can be checked in Yasuda et al. (Citation2023).

Our task is to identify which features can be used to model the patients’ age. The research hypothesis is that epileptic patients have feature values similar to older healthy controls. presents boxplots of the ages of each group of individuals. This figure shows that the distribution of ages is similar in the three groups. We applied feature screening methods separately for each group and checked which features were selected the most. Later, we analyze the estimated ages for two specific features.

Fig. 4 Boxplot of the observed ages for each group in the epilepsy study.

Fig. 4 Boxplot of the observed ages for each group in the epilepsy study.

We applied the generalized correlation with cubic splines, SIS, DC-SIS, and SIRS. Our Algorithm 1 is applied using Daubechies’ extremal phase wavelets with filter number six.Footnote3 presents the five features selected the most by each screening method. The methods have consistent results and identify similar sets of features. The features related to BODY and Genu are the most selected. These brain structures were also highlighted in previous studies (Yasuda et al. Citation2023, Tables 2 and 5).

Table 3 The five most relevant variables according to different feature screening methods for each group of participants in the epilepsy study.

shows that the two features selected the most for the three groups are BODY_FA and Genu_FA. presents scatterplots of these features versus the observed ages. The warped wavelet regression was applied to fit a smooth curve for each group of patients and control. Both plots in show that the curves of control individuals tend to be above the curves of RTLE and LTLE patients. This characteristic indicates that measurements of TLE patients are observed in controls with older ages, illustrating the degeneracy caused by the TLE disorder. Yasuda et al. (Citation2023) reached similar conclusions using linear regression. Hence, the feature screening and wavelet analysis corroborate their results.

Fig. 5 Scatterplot of observed ages versus the covariates BODY_FA (right) and Genu_FA (left). The symbols ·, ×, and + correspond to control, LTLE, and RTLE groups, respectively. The solid curves were estimated via warped wavelet regression for each group of patients.

Fig. 5 Scatterplot of observed ages versus the covariates BODY_FA (right) and Genu_FA (left). The symbols ·, ×, and + correspond to control, LTLE, and RTLE groups, respectively. The solid curves were estimated via warped wavelet regression for each group of patients.

7 Conclusion

Feature screening is a common practice in various scientific applications. Scientists are frequently looking for relevant relationships between variables of interest and candidate features, and the correlation is probably the most used statistic to screen such relevant relations. However, high dimensionality and nonlinearity are common characteristics in practice and pose challenges to classical approaches in data analysis. High-dimensional data increases the chance of observing spurious correlations, and many important features may have a nonlinear relation with the variable of interest. In this context, many screening metrics have been proposed in the literature to cope with large dimensions and allow for flexible relations between variables. We propose a wavelet approach for performing feature screening. Wavelet methods are well-known tools in the literature of sparse models and can estimate various types of curves. The proposed method is based on the warped wavelet regression, and we prove that it can achieve the sure screening property under suitable conditions. Simulation studies also show that the proposed method is competitive with other feature screening approaches in the literature. Additionally, the wavelet approach tends to outperform other methods when the link function between a feature and a response variable have sharp oscillations and peaks. Such an outcome agrees with the well-known ability of wavelets to adapt to signals with local nonlinearities. Our work also provides two applications in real data from relevant scientific problems: the relation of ozone concentration and atmospheric variables and the relation between brain white matter measurements and the age of patients with temporal lobe epilepsy. In both studies, the wavelet feature screening method provides results that are in line with other screening methods. Another application example in the supplementary material shows that wavelet feature screening also performs well in a case where pn. Overall, all the screening methods can select features that have relations reported in the literature of their respective fields. Therefore, we believe the warped wavelet screening method is a powerful tool for scientists dealing with many linear and nonlinear features.

Supplementary Materials

Full simulation results:The complete results of the simulation presented in Section 4.

Application in high-dimensional data:A feature screening application on the cardiomiopathy data analyzed by Hall and Miller (Citation2009) and Li, Zhong, and Zhu (Citation2012).

Summary of applications: R markdown files outlining the data analysis presented in Sections 5 and 6.

Codes:All R codes used for simulation studies and to analyze the datasets about ozone concentration and cardiomyopathy.

Supplemental material

supplement_waveletFeatureScreening.pdf

Download PDF (232.3 KB)

supplement (28).zip

Download Zip (12.2 MB)

Acknowledgments

We thank Paulo Artaxo and Luciana Rizzo for suggesting the analysis of ozone concentration data, and Clarissa Yasuda for suggesting the application about epilepsy. We also thank the Editor, Associate Editor, and two referees for all constructive comments about our article. We acknowledge Clarissa Yasuda and CEPID BRAINNFootnote4 (FAPESP 2013/07559-3) for generously providing the epilepsy data.

Disclosure Statement

The authors report that there are no competing interests to declare.

Additional information

Funding

This research was supported by the São Paulo Research Foundation (FAPESP) under Grants 2018/04654-9, 2021/04513-9 and 2023/02538-0; and the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) under Grant 310991/2020-0. RF acknowledges the support provided by the Morá Miriam Rozen Gerber Fellowship for Brazilian postdocs at the Weizmann Institute of Science.

Notes

1 The supplementary material shows that similar results are obtained with different wavelet basis, and that Algorithm 1 outperforms a wavelet screening applied with the method of Kovac and Silverman (Citation2000), which is implemented as the irregwd function in the R package wavethresh (Nason et al. Citation2022).

2 See the wavethresh manual for details (Nason et al. Citation2022).

3 See the wavethresh manual for details (Nason et al. Citation2022).

References

  • Barut, E., Fan, J., and Verhasselt, A. (2016), “Conditional Sure Independence Screening,” Journal of the American Statistical Association, 111, 1266–1277. DOI: 10.1080/01621459.2015.1092974.
  • Camalier, L., Cox, W., and Dolwick, P. (2007), “The Effects of Meteorology on Ozone in Urban Areas and their Use in Assessing Ozone Trends,” Atmospheric Environment, 41, 7127–7137. DOI: 10.1016/j.atmosenv.2007.04.061.
  • Donoho, D. L., and Johnstone, J. M. (1994), “Ideal Spatial Adaptation by Wavelet Shrinkage,” Biometrika, 81, 425–455. DOI: 10.1093/biomet/81.3.425.
  • Fan, J., Feng, Y., and Song, R. (2011), “Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models,” Journal of the American Statistical Association, 106, 544–557. DOI: 10.1198/jasa.2011.tm09779.
  • Fan, J., and Li, R. (2001), “Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360. DOI: 10.1198/016214501753382273.
  • Fan, J., Li, R., Zhang, C.-H., and Zou, H. (2020), Statistical Foundations of Data Science, Boca Raton: CRC Press.
  • Fan, J., and Lv, J. (2008), “Sure Independence Screening for Ultrahigh Dimensional Feature Space,” Journal of the Royal Statistical Society, Series B, 70, 849–911. DOI: 10.1111/j.1467-9868.2008.00674.x.
  • Fisher, R. S., Acevedo, C., Arzimanoglou, A., Bogacz, A., Helen Cross, J., Elger, C. E., et al. (2014), “ILAE Official Report: A Practical Clinical Definition of Epilepsy,” Epilepsia, 55, 475–482. DOI: 10.1111/epi.12550.
  • Franco, D. M. P., Andrade, M. d. F., Ynoue, R. Y., and Ching, J. (2019), “Effect of Local Climate Zone (LCZ) Classification on Ozone Chemical Transport Model Simulations in Sao Paulo, Brazil,” Urban Climate, 27, 293–313. DOI: 10.1016/j.uclim.2018.12.007.
  • Gavidia-Calderón, M., Schuch, D., and de Fatima Andrade, M. (2022), qualR: An R Package to Download São Paulo and Rio de Janeiro Air Pollution Data, R package version 0.9.7.
  • Gómez, L. M., Porto, R. F., and Morettin, P. A. (2021), “Nonparametric Regression with Warped Wavelets and Strong Mixing Processes,” Annals of the Institute of Statistical Mathematics, 73, 1203–1228. DOI: 10.1007/s10463-021-00789-0.
  • Hall, P., and Miller, H. (2009), “Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems,” Journal of Computational and Graphical Statistics, 18, 533–550. DOI: 10.1198/jcgs.2009.08041.
  • Härdle, W., Kerkyacharian, G., Picard, D., and Tsybakov, A. (1998), Wavelets, Approximation, and Statistical Applications, New York: Springer.
  • Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Vol. 2), New York: Springer.
  • Hwang, G., Hermann, B., Nair, V. A., Conant, L. L., Dabbs, K., Mathis, J., et al. (2020), “Brain Aging in Temporal Lobe Epilepsy: Chronological, Structural, and Functional,” NeuroImage: Clinical, 25, 102183. DOI: 10.1016/j.nicl.2020.102183.
  • Kerkyacharian, G., and Picard, D. (2004), “Regression in Random Design and Warped Wavelets,” Bernoulli, 10, 1053–1105. DOI: 10.3150/bj/1106314850.
  • Kovac, A., and Silverman, B. W. (2000), “Extending the Scope of Wavelet Regression Methods by Coefficient-Dependent Thresholding,” Journal of the American Statistical Association, 95, 172–183. DOI: 10.1080/01621459.2000.10473912.
  • Li, R., Huang, L., and Dziak, J. (2022), VariableScreening: High-Dimensional Screening for Semiparametric Longitudinal Regression, R package version 0.2.1.
  • Li, R., Zhong, W., and Zhu, L. (2012), “Feature Screening via Distance Correlation Learning,” Journal of the American Statistical Association, 107, 1129–1139. DOI: 10.1080/01621459.2012.695654.
  • Lin, X., Trainer, M., and Liu, S. (1988), “On the Nonlinearity of the Tropospheric Ozone Production,” Journal of Geophysical Research, 93, 15879–15888.
  • Mai, Q., and Zou, H. (2013), “The Kolmogorov Filter for Variable Screening in High-Dimensional Binary Classification,” Biometrika, 100, 229–234. DOI: 10.1093/biomet/ass062.
  • Monks, P. S., Archibald, A. T., Colette, A., Cooper, O., Coyle, M., Derwent, R., et al. (2015), “Tropospheric Ozone and its Precursors from the Urban to the Global Scale from Air Quality to Short-Lived Climate Forcer,” Atmospheric Chemistry and Physics, 15, 8889–8973. DOI: 10.5194/acp-15-8889-2015.
  • Morettin, P. A., Pinheiro, A., and Vidakovic, B. (2017), Wavelets in Functional Data Analysis, Cham: Springer.
  • Morettin, P. A., and Porto, R. F. (2022), “Estimation of Nonparametric Regression Models by Wavelets,” São Paulo Journal of Mathematical Sciences, 16, 539–565. DOI: 10.1007/s40863-021-00240-5.
  • Nason, G., Barber, S., Downie, T., Frylewicz, P., Kovac, A., Ogden, T., and Silverman, B. (2022), Package ‘wavethresh’, R package version 4.7.2.
  • Ogden, R. T. (1997), Essential Wavelets for Statistical Applications and Data Analysis, Boston: Birkhäuser.
  • Oliveira, M., Drumond, A., and Rizzo, L. (2022), “Air Pollution Persistent Exceedance Events in the Brazilian Metropolis of Sao Paulo and Associated Surface Weather Patterns,” International Journal of Environmental Science and Technology, 19, 9495–9506. DOI: 10.1007/s13762-021-03778-1.
  • Porto, R., Morettin, P., Percival, D., and Aubin, E. (2016), “Wavelet Shrinkage for Regression Models with Random Design and Correlated Errors,” Brazilian Journal of Probability and Statistics, 30, 614–652. DOI: 10.1214/15-BJPS296.
  • Reiss, P. T., Goldsmith, J., Shang, H. L., and Ogden, R. T. (2017), “Methods for Scalar–on–Function Regression,” International Statistical Review, 85, 228–249. DOI: 10.1111/insr.12163.
  • Sánchez-Ccoyllo, O. R., Ynoue, R. Y., Martins, L. D., and Andrade, M. d. F. (2006), “Impacts of Ozone Precursor Limitation and Meteorological Variables on Ozone Concentration in Sao Paulo, Brazil,” Atmospheric Environment, 40, 552–562. DOI: 10.1016/j.atmosenv.2006.04.069.
  • Santolaya, C., Oliveira, M. C. Q. D., Rizzo, L. V., and Miraglia, S. G. E. K. (2019), “Contribution of Chemical and Meteorological Factors to Tropospheric Ozone Formation in São Paulo, Brazil,” Brazilian Journal of Environmental Sciences, 54, 90–104.
  • Segal, M. R., Dahlquist, K. D., and Conklin, B. R. (2003), “Regression Approaches for Microarray Data Analysis,” Journal of Computational Biology, 10, 961–980. DOI: 10.1089/106652703322756177.
  • Thompson, M. L., Reynolds, J., Cox, L. H., Guttorp, P., and Sampson, P. D. (2001), “A Review of Statistical Methods for the Meteorological Adjustment of Tropospheric Ozone,” Atmospheric Environment, 35, 617–630.
  • Vidakovic, B. (1999), Statistical Modeling by Wavelets, New York: Wiley.
  • Xue, J., and Liang, F. (2017), “A Robust Model-Free Feature Screening Method for Ultrahigh-Dimensional Data,” Journal of Computational and Graphical Statistics, 26, 803–813. DOI: 10.1080/10618600.2017.1328364.
  • Yasuda, C. L., Valise, C., Saúde, A. V., Pereira, A. R., Pereira, F. R., Costa, A. L. F., et al. (2010), “Dynamic Changes in White and Gray Matter Volume are Associated with Outcome of Surgical Treatment in Temporal Lobe Epilepsy,” Neuroimage, 49, 71–79. DOI: 10.1016/j.neuroimage.2009.08.014.
  • Yasuda, C. L., Ramalho Pimentel-Silva, L., Beltramini, G. C., Liu, M., de Campos, B. M., Coan, A. C., Beaulieu, C., Cendes, F., and Gross, D. W. (2023), “Brain Volumes and White Matter Diffusion Across the Adult Lifespan in Temporal Lobe Epilepsy,” Annals of Clinical and Translational Neurology, 10, 1106–1118. DOI: 10.1002/acn3.51793.
  • Yousuf, K. (2018), “Variable Screening for High Dimensional Time Series,” Electronical Journal of Statistics, 12, 667–702.
  • Zhang, J., Wei, Y., and Fang, Z. (2019), “Ozone Pollution: A Major Health Hazard Worldwide,” Frontiers in Immunology, 10, 2518. DOI: 10.3389/fimmu.2019.02518.
  • Zhao, Y., Chen, H., and Ogden, R. T. (2015), “Wavelet-based Weighted LASSO and Screening Approaches in Functional Linear Regression,” Journal of Computational and Graphical Statistics, 24, 655–675. DOI: 10.1080/10618600.2014.925458.
  • Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. (2011), “Model-Free Feature Screening for Ultrahigh-Dimensional Data,” Journal of the American Statistical Association, 106, 1464–1475. DOI: 10.1198/jasa.2011.tm10563.

Appendix A.

Auxiliary Lemma

Lemma 1.

Assume that C1–C3 hold. Let κ, C, and γ be defined in Proposition 1, with γ>1/2. Therefore, it follows that (8) P(maxjM*|||β̂j||||βj|||>νn)s2J+1Cn2γ(8) and (9) jM*P(||β̂j||>νn)(ps)2J+1Cn2γ,(9) where s=|M*| and νn=2(J+1)/2κ((logn)/n)1/4.

Proof.

Let us prove (8) first. By the inequality |||β̂j||||βj|||||β̂jβj|| and the union bound, it follows that P(maxjM*|||β̂j||||βj|||>νn)jM*P(||β̂jβj||>νn).

We shall use the following event inclusion (10) {||β̂jβj||>νn}={l=0Jk=02l1(β̂jlkβjlk)2>νn2}l=0Jk=02l1{(β̂jlkβjlk)2>νn22J+1}.(10)

Notice that (10) holds for all j=1,,p. Then, by the union bound and since νn2=2J+1κ(logn)/n, P(maxjM*|||β̂j||||βj|||>νn)jM*l=0Jk=02l1P(|β̂jlkβjlk|2>νn2/2J+1)=jM*l=0Jk=02l1P(|β̂jlkβjlk|>κlognn).

Recall that M* has cardinality s and notice that there are 2J+1 pairs (l, k) in the sum above. Hence, (8) follows by applying Proposition 1 with α = 2.

Inequality (9) can be proved similarly. Indeed, since βj=0 for jM*, then jM*P(||β̂j||>νn)jM*l=0Jk=02l1P(|β̂jlk|>κlognn)(ps)2J+1Cn2γ, where the first inequality is due to the union bound and (10), and the second inequality is due to Proposition 1. □

Appendix B.

Proof of Proposition 2

Proof.

Consider the event An={maxjM*|||β̂j||||βj|||νn}. Let us show that if minjM*||βj||>2νn, then An{M̂M*}. Consider that An holds. It implies that ||β̂j||||βj||νn for all jM*. Since ||βj||νnνn by assumption, then ||β̂j||νn, that is, all jM* also belong to M̂. Hence, P(M̂M*)1P(Anc). By (8) P(Anc)=P(maxjM*|||β̂j||||βj|||>νn)s2J+1Cn2γ, where s is the cardinality of M*. This concludes the proof of (5).

Recall from the comment after (3) that 2JCn/logn. Thus, P(Anc)2sCn1/22γ/logn, where C is an absolute constant. Since γ>1/4 by assumption, then P(Anc)0 as n with s and J fixed. This concludes the proof of Proposition 2. □

Appendix C.

Proof of Proposition 3

Proof.

Recall that s=|M*|. We can write the cardinality |M̂| as a sum of indicator functions, which leads to the following inequality: |M̂|=j=1pI{||β̂j||>νn}s+jM*I{||β̂j||>νn}.

Notice that βj0 for jM*. Then, it follows from (9) that E {jM*I{||β̂j||>νn}}=jM*P{||β̂j||>νn}(ps)2J+1Cn2γ.