Full article: Wavelet Feature Screening

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

An initial screening of which covariates are relevant is a common practice in high-dimensional regression models. The classic feature screening selects only a subset of covariates correlated with the response variable. However, many important features might have a relevant albeit highly nonlinear relation with the response. One screening approach that handles nonlinearity is to compute the correlation between the response and nonparametric functions of each covariate. Wavelets are powerful tools for nonparametric and functional data analysis but are still seldom used in the feature screening literature. We propose a wavelet feature screening method that can be easily implemented, and we prove that, under suitable conditions, it captures the true covariates with high probability. Simulation results also show that our approach outperforms other screening methods in highly nonlinear models. We apply feature screening to two datasets about ozone concentration and epilepsy. In both applications, the proposed method selects features that match findings in the literature of their respective research fields, illustrating the applicability of feature screening. Supplementary material for this article is available online.

Keywords:

1 Introduction

The number of explanatory variables is commonly much larger than the sample size in modern data science. In the current era of big data, many applications in biology have thousands of covariates, reaching even hundreds of thousands in genetic studies (Fan et al. Citation2020, sec. 1.1). A key objective in applications is to use all data to determine which features/covariates are associated with characteristics of interest, that is, the response variables. Some classic methods to deal with this type of problem where $p ≫ n$ are sub-sampling covariates, ridge regression, and Lasso, for example (Hastie, Tibshirani, and Friedman Citation2009). However, if p is substantially larger than n, the performance of many variable selection methods may also be compromised (Fan and Lv Citation2008). In such scenarios, practitioners often do an initial screening to discard the covariates most likely irrelevant to the analysis. This approach works as an initial step in supervised learning. It reduces the problem’s dimension to only $p_{n} ≪ p$ covariates and facilitates the learning step of a regression or classification model.

Feature screening consists of computing a marginal utility measure for each feature and selecting the ones with the highest scores. The most popular utility measure is the Pearson correlation between the response variable and each covariate. Fan and Lv (Citation2008) studied this approach and derived its theoretical guarantees. Since Pearson correlation is tailored to linear relationships, further works proposed using nonparametric methods to capture nonlinear relations. Hall and Miller (Citation2009) proposed using a generalized correlation, where the utility measure is the Pearson correlation of the response variable and an estimated smooth function of the covariate. Likewise, Fan, Feng, and Song (Citation2011) proposed to fit regression splines of each feature to the response and use the norm of the fitted function as the utility score. Many other screening methods have been proposed since then; to name a few examples, we cite the following: model-free screening (Zhu et al. Citation2011; Xue and Liang Citation2017), screening based on distance-correlations (Li, Zhong, and Zhu Citation2012), screening for categorical response (Mai and Zou Citation2013), conditional marginal screening (Barut, Fan, and Verhasselt Citation2016), and screening for dependent data (Yousuf Citation2018). Fan et al. (Citation2020, chap. 8) provides a thorough literature review and further details about feature screening. Our article focuses on nonparametric utility measures (Hall and Miller Citation2009).

1.1 Generalized Correlation

Let Y be a response variable of interest and ${(X_{1}, \dots, X_{p})}^{T}$ a set of covariates to be used as proxies to Y. Assume we have a sample ${(X_{i .}, Y_{i})}_{i = 1}^{n}$ where $X_{i .} = (X_{i 1}, \dots, X_{ip})$ and $p ≫ n$ . We aim to reduce the data’s dimension by selecting a subset of covariates relevant to the study. Let $X = (X_{1}, \dots, X_{p})$ be the design matrix, where $X_{j} = {(X_{1 j}, \dots, X_{nj})}^{T}$ denotes the observations from the jth covariate.

The standard screening approach uses the Pearson correlation between Y and $X_{j}$ as a utility measure. This approach is widely used by practitioners in all fields that require data analysis and counts with a solid theoretical foundation (Fan and Lv Citation2008). These two aspects explain the popularity of this screening approach. However, the Pearson correlation does not capture covariates with a highly nonlinear relation with the response variable. Therefore, Hall and Miller (Citation2009) proposes to tackle this problem by estimating the functional relations between Y and each X_j separately. They compute a generalized correlation as follows $ρ (X_{j}, Y) = sup_{f_{j} \in H} \frac{cov {f_{j} (X_{j}), Y}}{\sqrt{var {f_{j} (X_{j})} var (Y)}}, j = 1, \dots, p,$ where $H$ is some functional space and $f_{j} \in H$ is the function that fits X_j to Y. When $H$ consists of the space of linear functions, $ρ (X_{j}, Y)$ becomes the Pearson correlation. Hall and Miller (Citation2009) uses cubic splines basis to represent f_j . Given the sample ${(X_{i .}, Y_{i})}_{i = 1}^{n}$ , the generalized correlation can be estimated by $\begin{matrix} \hat{ρ} (X_{j}, Y) = sup_{f_{j} \in H} \frac{\sum_{i = 1}^{n} {f_{j} (X_{ij}) - {\bar{f}}_{j}} {Y_{i} - \bar{Y}}}{\sqrt{\sum_{i = 1}^{n} {f_{j} (X_{ij}) - {\bar{f}}_{j}}^{2} \sum_{i = 1}^{n} {Y_{i} - \bar{Y}}^{2}}}, \\ j = 1, \dots, p, \end{matrix}$ where ${\bar{f}}_{j} = \sum_{i = 1}^{n} f_{j} (X_{ij}) / n$ and $\bar{Y} = \sum_{i = 1}^{n} Y_{i} / n$ . The idea is to use $\hat{ρ} (X_{j}, Y)$ to measure the importance of $X_{j}$ for fitting Y. The screening approach selects only $p_{n} ≪ p$ covariates with the highest $\hat{ρ} (X_{j}, Y)$ in absolute value.

The same screening results can be achieved using $| | | {\hat{f}}_{j} | | - | | f_{j} | | |$ as utility measure, where $| | \cdot | |$ denotes some norm. For example, consider the case of Pearson correlation and let X and Y be two vectors with zero-mean and unit norm. Since $| | X - Y | |^{2} = 2 - 〈 X, Y 〉$ , by maximizing the correlation $〈 X, Y 〉$ we also minimize $| | | X | | - | | Y | | |^{2} \leq | | X - Y | |^{2}$ , where the latter follows from the triangular inequality. Therefore, we will focus on the distance $| | | {\hat{f}}_{j} | | - | | f_{j} | | |$ to devise a variable selection scheme.

1.2 Proposed Method

We use wavelet methods to estimate the function f_j that links Y to X_j , $j = 1, \dots, p$ . Wavelets form a basis for the space of square-integrable functions, similarly to cosines and sines in Fourier analysis. The main advantages of wavelets are their local adaptivity and regularization properties, making them a powerful tool to represent functional data (Morettin, Pinheiro, and Vidakovic Citation2017). Other properties are that wavelet coefficients can be quickly computed for equally spaced sampled observations (regular design), and such coefficients characterize quite general functional spaces, such as Hölder and Besov spaces, for example (Vidakovic Citation1999, p. 87). One challenge in our problem is that multiple covariates are generally irregularly spaced, a design for which often-used wavelet methods are unsuitable.

Algorithm 1

Warped wavelet screening

Input: response variable $Y \in R^{n}$ , features $X_{1}, \dots, X_{p} \in R^{n}$ , threshold $ν > 0$ , wavelet basis.

Output: Indices of screened features $\hat{M} \subset {1, \dots, p}$ .

1: for $j = 1, \dots, p$ do

2: Sort $(Y_{π (1)}, X_{j, π (1)}), \dots, (Y_{π (n)}, X_{j, π (n)})$ such that $X_{j, π (1)} \leq \dots \leq X_{j, π (n)}$ , where ${π (1), \dots, π (n)}$ denotes a permutation of ${1, \dots, n}$ .

3: Apply the discrete wavelet transform to $(Y_{π (1)}, \dots, Y_{π (n)})$ using the chosen wavelet basis, obtaining wavelet coefficients ${{\hat{β}}_{k}}$ .

4: Apply thresholding on the wavelet coefficients, obtaining ${{\tilde{β}}_{k}}$ .

5: Compute $w_{j} = {\sum_{k} {\tilde{β}}_{k}^{2}}^{1 / 2}$ using the thresholded coefficients.

6: end for

7: Set $\hat{M} = {j; 1 \leq j \leq p, w_{j} > ν}$ .

8: return $\hat{M}$

Our main contribution is a feature screening method that uses warped wavelets, which were introduced by Kerkyacharian and Picard (Citation2004) to deal with random covariates. Our screening idea is described in Algorithm 1. It can be easily applied within any statistical software that computes wavelet transforms (e.g., R, Python, Matlab). The paradigm of wavelet feature screening has been suggested previously (Fan, Feng, and Song Citation2011, p. 552). The closest work to ours is the paper by Zhao, Chen, and Ogden (Citation2015), who proposes a screening step for a wavelet-based Lasso estimator of a linear regression model. However, this work assumes that predictors are sampled at equally spaced points, a condition hardly satisfied in practice. The current article is the first to propose a wavelet algorithm that handles irregular designs and to study its properties. The main component of our algorithm is the application of warped wavelets, whose details are presented in Section 2. Warped wavelet regression estimators adapt well to features with irregular sampling designs (Morettin and Porto Citation2022), which allows us to apply wavelets to run feature screening. Theoretical properties of the proposed screening approach are presented in Section 3, where we prove that Algorithm 1 can capture the true covariates with high probability.

Feature screening methods are commonly tested with simulations of scenarios where the link function f_j is either smooth or has very few discontinuities. Differently, Section 4 presents simulations of challenging scenarios with highly nonlinear examples of f_j . Our results show that model-free screening methods do not perform well when f_j is very irregular, but Algorithm 1 can still detect true features in many such cases. The wavelet approach also outperforms splines for scenarios where f_j contains local traits like peaks. This analysis illustrates the well-known advantage of wavelets for regression problems where the link function is irregular, containing spikes or discontinuities (Reiss et al. Citation2017, sec. 2.1).

Another contribution of this article is the application of feature screening to two scientific problems. The first uses ozone concentration data in the largest city in Brazil. Ozone is a pollutant related to various health problems; thus, monitoring which atmospheric features are related to ozone concentration is of practical interest (see Section 5). The second problem uses data of a study about temporal lobe epilepsy. These data were collected to understand how brain white matter measurements are related to brain degeneracy in epileptic patients (see Section 6). In both applications, we show that feature screening can detect covariates previously reported as important factors in their respective fields. The supplementary material contains an application with the cardiomyopathy data (Segal, Dahlquist, and Conklin Citation2003), a high-dimension example commonly used in the feature screening literature. Some final remarks in Section 7 close the article.

Notation: We use $log$ for the natural logarithm. Vectors and matrices are represented by bold letters. The ith row and jth column of a matrix $X \in R^{n \times p}$ are denoted as $X_{i .} \in R^{p}$ and $X_{j} \in R^{n}$ , respectively. The Euclidean norm of a vector $x \in R^{n}$ is denoted as $| | x | | = {(\sum_{i = 1}^{n} x_{i}^{2})}^{1 / 2}$ . For a random function $f : R \to R$ , we define the L² and $L^{\infty}$ -norms in terms of the expected values ${‖ f ‖}_{L^{2}}^{2} = E {\int | f (x) |^{2} dx}$ and ${‖ f ‖}_{L^{\infty}} = E {{max}_{x \in R} | f (x) |}$ , respectively. These norms reduce to their usual definitions when f is not random.

2 Warped Wavelets

Kerkyacharian and Picard (Citation2004) introduces warped wavelets to allow randomly sampled covariates in wavelet regression estimators. We employ this method to perform feature screening, separately analyzing each pair $(X_{j}, Y)$ . The data for the jth covariate is ${(X_{ij}, Y_{i})}_{i = 1}^{n}, j = 1, \dots, p$ . We fit the following model for each covariate: (1) $Y_{i} = f_{j} (X_{ij}) + ϵ_{ij}, i = 1, \dots, n,$ (1) where $ϵ_{1 j}, \dots, ϵ_{nj} \overset{iid}{\sim} N (0, σ^{2})$ , with independent X_ij and ϵ_ij . We assume that $f_{j} \in L^{2} ([a, b])$ , and one of our goals is to estimate the coefficients of its representation in a wavelet basis. For convenience, let us focus on the case where $σ^{2} = 1$ , a = 0 and b = 1.

A common assumption in wavelet regression is that $X_{ij} = i / n$ , that is, the covariates are equally spaced in the support of f_j . This assumption can be relaxed with warped wavelets, which allow X_ij to be irregularly spaced on the interval $[0, 1]$ . Assume that X_ij follows some continuous distribution with density g_j and distribution function G_j . Without loss of generality, consider that $X_{1 j} < X_{2 j} < \dots < X_{nj}$ and denote their empirical distribution function by ${\hat{G}}_{j}$ . The idea of warped wavelets is to recover the usual regular design using the fact that ${\hat{G}}_{j} (X_{ij}) = i / n$ . Plugging $X_{ij} = {\hat{G}}_{j}^{- 1} ({\hat{G}}_{j} (X_{ij}))$ in (1), we have $\begin{matrix} Y_{i} = f_{j} (X_{ij}) + ϵ_{ij} = f_{j} ({\hat{G}}_{j}^{- 1} ({\hat{G}}_{j} (X_{ij}))) + ϵ_{ij} \\ = f_{j} ({\hat{G}}_{j}^{- 1} (i / n)) + ϵ_{ij}, \end{matrix}$ which resumes the classical equally spaced design when looking at the function composition $f_{j} ({\hat{G}}_{j}^{- 1} (\cdot))$ . We shall focus on the estimators that take G_j as known to avoid some technicalities. Estimators that use the empirical distribution function ${\hat{G}}_{j}$ are also proposed by Kerkyacharian and Picard (Citation2004) and have similar properties to the estimator computed assuming G_j known. Kerkyacharian and Picard (Citation2004) show that both approaches attain (up to logarithmic factors) the minimax convergence rate from regular designs.

Another idea of warped wavelets is to take the wavelet decomposition of $f_{j} (G_{j}^{- 1} (\cdot))$ instead of $f_{j} (\cdot)$ . Consider a wavelet basis ${ψ_{lk}; l \geq - 1, k \in Z}$ where $ψ_{- 1, k}$ is a scaling function (Härdle et al. Citation1998, Equationeq. 3.4(3) $\begin{matrix} {\hat{f}}_{j} (x) = \sum_{l = - 1}^{J} \sum_{k = 0}^{2^{l} - 1} {\hat{β}}_{lk} I {| {\hat{β}}_{lk} | \geq κ t_{n}} ψ_{lk} (G_{j} (x)), \\ t_{n} = {(\frac{log n}{n})}^{1 / 2}, \end{matrix}$ (3) ). Expanding $f_{j} (G_{j}^{- 1} (.))$ in this basis we get $\begin{matrix} f_{j} (G_{j}^{- 1} (y)) = \sum_{l \geq - 1} \sum_{k \in Z} β_{lk} ψ_{lk} (y), or equivalently \\ f_{j} (x) = \sum_{l \geq - 1} \sum_{k \in Z} β_{lk} ψ_{lk} (G_{j} (x)), \end{matrix}$ where the wavelet coefficients are given by (2) $β_{lk} = \int_{0}^{1} ψ_{lk} (G_{j} (x)) f_{j} (x) g_{j} (x) dx .$ (2)

Here we omit the index j from β_lk , but later we shall write β_jlk to let it clear it is a coefficient from f_j . EquationEquation (2)(2) $β_{lk} = \int_{0}^{1} ψ_{lk} (G_{j} (x)) f_{j} (x) g_{j} (x) dx .$ (2) reduces to the usual wavelet coefficient when $g_{j} (x)$ is the uniform density. In general, the properties of the warped wavelet transform depend on weights defined by $g_{j} (x)$ and how far they are from uniform weights. We shall assume that g_j is bounded as follows:

C1There is a finite constant C > 0 such that $C^{- 1} < g_{j} (x) < C$ for all $x \in [0, 1]$ and all $j \in {1, \dots, p}$ .

Under C1, the density g_j in (4) defines a weighting scheme that belongs to a class known as Muckenhhoupt weights (Kerkyacharian and Picard Citation2004, sec. 4.1). This condition is sufficient to guarantee good convergence rates for warped wavelet estimators. Moreover, C1 has already been used for nonparametric feature screening (Fan, Feng, and Song Citation2011, condition B). We also assume that the following conditions hold:

C2The scaling function $ψ_{- 1, k}$ and wavelet functions ψ_lk , $l \geq 0$ , have compact support.
C3The functions $f_{j} : [0, 1] \to R, j = 1, \dots, p$ , are periodic and there is an absolute constant C > 0 such that $| | f_{j} | |_{L^{\infty}} < C$ for all j.

These are common assumptions for warped wavelets (Porto et al. Citation2016; Gómez, Porto, and Morettin Citation2021). C2 allows us to write the wavelet transform as finite sums, and it is satisfied by Daubechies wavelets, for example (Härdle et al. Citation1998, p. 127). The periodicity assumed in C3 permits us to apply the wavelet transform for functions with support in $[0, 1]$ . See (Vidakovic Citation1999, sec. 5.6) for other transforms of signals in an interval. This assumption also justifies periodization heuristics frequently used in practice to deal with boundary handling and sample sizes that are not a power of two (Ogden Citation1997). The assumption that f_j is bounded is technical and is used later in Proposition 1.

We consider the warped wavelet estimator given by (3) $\begin{matrix} {\hat{f}}_{j} (x) = \sum_{l = - 1}^{J} \sum_{k = 0}^{2^{l} - 1} {\hat{β}}_{lk} I {| {\hat{β}}_{lk} | \geq κ t_{n}} ψ_{lk} (G_{j} (x)), \\ t_{n} = {(\frac{log n}{n})}^{1 / 2}, \end{matrix}$ (3)

where $κ > 0$ is a tuning parameter and J satisfies $2^{J} \leq C t_{n}^{- 1}$ , for some absolute constant C > 0. The empirical coefficients are (4) ${\hat{β}}_{lk} = \frac{1}{n} \sum_{i = 1}^{n} ψ_{lk} (G_{j} (X_{ij})) Y_{i} .$ (4)

Note that ${\hat{f}}_{j} (x)$ in (3) is computed via hard thresholding. Other thresholding approaches can also be applied (Vidakovic Citation1999, chap. 6). See Fan and Li (Citation2001) for a detailed analysis of different thresholding options, such as SCAD and soft thresholding. The following result, due to (Kerkyacharian and Picard Citation2004, Prop. 3), will be useful to deduce theoretical guarantees for the warped wavelet feature screening.

Proposition 1.

(Kerkyacharian and Picard Citation2004, p. 1070) Assume that conditions C1–C3 hold and let $α > 1$ . Then, there exist constants $C_{α}$ and $C_{α}^{'}$ , and for any $γ > 0$ there exists a constant $κ_{γ}$ , with $\begin{matrix} E (| {\hat{β}}_{lk} - β_{lk} |^{α}) \leq C_{α} \frac{1 + {‖ f_{j} ‖}_{L^{\infty}}^{α}}{n^{α / 2}}, for 2^{l} \leq n, \\ P (| {\hat{β}}_{lk} - β_{lk} | > κ \sqrt{\frac{log n}{n}}) \leq C_{α}^{'} n^{- γ α}, \\ for κ \geq κ_{γ}, 2^{l} \leq \sqrt{\frac{n}{log n}} . \end{matrix}$

3 Feature Screening via Warped Wavelet Regression

In this section, we use the properties of the warped wavelets estimator to establish theoretical guarantees for our Algorithm 1. Assume we can fit model (1) to all covariates but that f_j is null for most of them, that is, only a few X_j are relevant to predict Y. Let $β_{j} = {β_{jlk}, 0 \leq l \leq J, 0 \leq k \leq 2^{l} - 1}$ be the wavelet coefficients of f_j as in (2). Notice that if f_j is null, then $β_{j} = 0$ . Therefore, we represent the set of relevant features in terms of their wavelet coefficients as $M_{*} = {j; 1 \leq j \leq p, ‖ β_{j} ‖ > 0}$ .

Our goal is to find a set $\hat{M} \subset {1, \dots, p}$ that contains $M_{*}$ with high probability and has cardinality much lower than p. However, the functions f_j are unknown, and we must use estimators ${\hat{f}}_{j}$ to screen the relevant features. This approximation works well if the wavelet transform of ${\hat{f}}_{j}$ is close enough to that of f_j . Let ${\hat{β}}_{j}$ be the wavelet coefficients of ${\hat{f}}_{j}$ estimated as in (4). Then, the proposed set of screened variables is as follows: $\hat{M} = {j; | | {\hat{β}}_{j} | | > ν_{n}, 1 \leq j \leq p},$ where $ν_{n} > 0$ . The key point to recover $j \in M_{*}$ is that the energy of f_j must be higher than the noise level, that is, $| | {\hat{β}}_{j} | |$ has to be higher than the norm of wavelet coefficients of pure noise. In this case, we can screen the correct variables with high probability if we choose a suitable ν_n . The next proposition formalizes this idea.

Proposition 2.

Let κ, $C'$ and γ be defined as in Proposition 1, with $γ > 1 / 4$ . Assume that C1–C3 hold and that ${min}_{j \in M_{*}} ‖ β_{j} ‖ > 2 ν_{n}$ , where $ν_{n} = 2^{(J + 1) / 2} \sqrt{κ} {((log n) / n)}^{1 / 4}$ . If $\hat{M}$ is computed with this ν_n , then (5) $P (\hat{M} \supset M_{*}) \geq 1 - \frac{s 2^{J + 1} C'}{n^{2 γ}}, s = | M_{*} | .$ (5)

Additionally, for fixed s and J, $P (\hat{M} \supset M_{*}) \to 1$ as $n \to \infty$ .

Proof.

See Appendix B. □

Another important aspect of any screening set is that its cardinality must be much lower than the dimension p. Proposition 3 shows that, on average, the number of elements on $\hat{M}$ is much lower than p.

Proposition 3.

Assume that all conditions of Proposition 2 hold and that $\hat{M}$ is defined with $ν_{n} = 2^{(J + 1) / 2} \sqrt{κ} {((log n) / n)}^{1 / 4}$ . Then, the expected cardinality of $\hat{M}$ satisfies $E (| \hat{M} |) \leq s + \frac{2^{J + 1} C' (p - s)}{n^{2 γ}}, where s = | M_{*} | .$

Proof.

See Appendix C. □

4 Simulation

This section presents simulation results of Algorithm 1. Similarly as in Example 4 of Hall and Miller (Citation2009), we simulate data from the model (6) $Y_{i} = \sum_{j = 1}^{5} \frac{6 - j}{5} (f (X_{ij}) + f (X_{i, j + 5})) + ϵ_{i},$ (6) where ϵ_i follows a standard normal distribution. The function $f : [0, 1] \to R$ is set as one of the test functions of Donoho and Johnstone (Citation1994) (see ) or the identity function f(x) = x. The pairs $(X_{ij}, X_{i, j + 5})$ are independently generated as (7) $\begin{matrix} X_{ij} \sim U (0, 1), X_{i, j + 5} = \frac{ρ X_{ij} + U_{i} \sqrt{1 - ρ^{2}}}{ρ + \sqrt{1 - ρ^{2}}}, \\ j = 1, \dots, 5, i = 1, \dots, n, \end{matrix}$ (7) where $ρ = 0.85$ and $U_{1}, \dots, U_{n} \overset{iid}{\sim} Unif (0, 1)$ . This generating scheme guarantees that some covariates are correlated, with $corr (X_{ij}, X_{i, j + 5}) = ρ$ . When true covariates X_j and $X_{j + 5}$ are highly correlated, many variable selection methods tend to choose only one of them. This simulation study also checks the ability of screening methods to select both.

Fig. 1 Test functions of Donoho and Johnstone (Citation1994).

We set the dimension as p = 5000 and consider sample sizes $n \in {128, 256, 512}$ . The relevant covariates are $X_{1}, \dots, X_{10}$ , that is, $M_{*} = {1, \dots, 10}$ . We also generate ${X_{i, j}}_{j = 11}^{15}$ correlated with ${X_{i, j}}_{j = 1}^{5}$ as outlined in (7). This way, we can analyze a case where nonsupport and support covariates are correlated. The remaining $X_{16}, \dots, X_{5000}$ have all entries sampled independently from $Unif (0, 1)$ . This section presents the numbers of true covariates detected by different methods among their 15 largest screening utility scores. In the supplementary material, we present the number of pairs of support covariates ${(X_{j}, X_{j + 5})}_{j = 1}^{5}$ that are selected together, and the number of selected nonsupport covariates ${X_{j}}_{j = 11}^{15}$ . All results are averages from 500 replicates.

Our Algorithm 1 is applied with the Haar wavelet basis and a single decomposition level, which are the default choices in our codes.Footnote¹ This wavelet method is compared with the generalized correlation screening of Hall and Miller (Citation2009) implemented with cubic splines. We also apply the model-free methods SIRS (Zhu et al. Citation2011) and DC-SIS (Li, Zhong, and Zhu Citation2012), which are implemented in the R package VariableScreening (Li, Huang, and Dziak Citation2022).

Simulation results are presented in . We note that, albeit no method outperforms all others in all scenarios, the warped wavelets provided competitive results. When f is the Blocks or Heavisine functions, generalized correlation with splines gives the best results, and SIRS and DC-SIS have the best performances when f is the identity function. Apart from DC-SIS, all considered methods take on average only a few seconds to run. In general, wavelet methods have competitive results, outperforming the other approaches for the functions Bumps and Doppler, for example. This simulation highlights the good adaptation properties of wavelets for link functions f_j with many localized traits like peaks and sharp oscillations.

Table 1 Average number of true features (#feat) selected among the 15 most relevant according to each screening method, and average runtime in seconds (time).

Download CSV Display Table

5 Application on Ozone Concentration

Ozone is a pollutant related to various respiratory problems, and its efficient monitoring greatly impacts public health (Zhang, Wei, and Fang Citation2019). However, various chemical and meteorological factors affect ozone concentration, making its monitoring a challenging task (Monks et al. Citation2015).

Our goal in this application is to identify the main factors that influence ozone concentration in São Paulo–Brazil (Santolaya et al. Citation2019) through wavelet feature screening. The relations between ozone level and related variables are known to be nonlinear for a long time (Lin, Trainer, and Liu Citation1988), which makes nonparametric screening methods appropriate tools for variable selection. We use the R package qualR (Gavidia-Calderón, Schuch, and de Fatima Andrade Citation2022) to collect data from the Pinheiros air quality station in São Paulo. The data consists of daily observations of 11 variables collected from January 2004 until December 2022. Due to many missing values, we analyze monthly averages and use only observations from April 2006 until September 2010. Moreover, we compute the cosine and sine of circular variables related to wind direction. The time series are displayed in . Notice that most time series present a seasonal pattern; thus, the observations are not independent.

Fig. 2 Time series of the meteorological variables recorded at Pinheiros air quality station in São Paulo. The abbreviations wd and gwd refer to wind direction and global wind direction, respectively.

The properties of feature screening methods discussed in this article assume independent observations. Thus, we computed differences in the variables shown in to reduce their dependence on past values. After plots of autocorrelations and partial autocorrelations, we employ the following:

Differences at lags 1 and 11 for the variables ozone, carbon monoxide, nitrogen monoxide, and temperature,
Differences at the lag 1 for the variables nitrogen dioxide, humidity, wind speed, $cos$ (wd), and $sin$ (wd), where wd stands for wind direction,
No transformation for the variables $cos$ (gwd), and $sin$ (gwd), where gwd stands for global wind direction.

The temporal dependence of all the variables was reduced considerably after these transformations, as shown in the supplementary material.

The next step uses the differenced time series to identify the most important features to model ozone concentration. We apply the generalized correlation with cubic splines, SIS, DC–SIS, and SIRS. The warped wavelet feature screening is applied using Daubechies’ extremal phase wavelets with filter number six.Footnote² presents the five most relevant variables selected by each method. All methods identify the following features as most relevant: humidity, wind speed, and $cos$ (wd). Hence, there is a consensus that humidity and wind direction and speed are relevant variables to model ozone concentration. This conclusion is in accordance with well-known results in the literature (Thompson et al. Citation2001, Sec 2.2.2). Wind direction and speed are associated with the transport of ozone in the atmosphere (Camalier, Cox, and Dolwick Citation2007), and weather factors such as humidity and temperature are frequently used to study peaks of ozone concentration (Oliveira, Drumond, and Rizzo Citation2022).

Table 2 The five most relevant variables according to different feature screening methods.

Display Table

shows curves estimated using warped wavelets for some features. The vertical axis represents the differenced time series of ozone concentration minus its average. We subtract the average $\bar{Y}$ of this variable before estimating the curves (Hastie, Tibshirani, and Friedman Citation2009, p. 298). The covariates are the differenced time series of features normalized to the interval (0, 1). We see in that Humidity, Wind Speed, and $cos$ (wd) present a nonlinear relationship with the ozone concentration. Understanding the importance of these features and how they influence ozone concentration can aid scientists, for example, in developing simulation models (Sánchez-Ccoyllo et al. Citation2006; Franco et al. Citation2019).

Fig. 3 Plots of the response variable Y (differenced and centered ozone concentrations) and each covariate (differenced and normalized to the unit interval). The curves are warped wavelet regression estimates. The observations are considerably noisy, and only a few features display a noticeable effect on Y, like humidity and wind speed.

6 Application on Epilepsy

Epilepsy is a brain disorder characterized by a predisposition and occurrence of multiple seizures (Fisher et al. Citation2014). A common form of this disorder is temporal lobe epilepsy (TLE), which can happen on one side of the brain. The progress of TLE is associated with abnormalities in the brain’s gray and white matter (Yasuda et al. Citation2010). Many studies focus on characteristics of white matter to study some aspects of TLE. One particular problem is to understand brain degeneration in epileptic patients. Comparisons with healthy controls have shown that epileptic patients seem to have older brains (Hwang et al. Citation2020). Further studies have been conducted to understand the relation between brain structural age and neurological measurements under the TLE disorder (Yasuda et al. Citation2023).

In this section, we apply feature screening on a dataset from a study about TLE. The data was collected by the Neuroimaging Laboratory at the University of Campinas-Brazil and consists of 88 features and ages obtained from 226 individuals (89 controls and 137 patients). The patients are divided by the side of their TLE: 76 have left TLE (LTLE) and 61 have right TLE (RTLE). The features are brain white matter measurements acquired via magnetic resonance imaging. We analyze pre-processed measurements in the interval $[0, 1]$ . Further details about the dataset can be checked in Yasuda et al. (Citation2023).

Our task is to identify which features can be used to model the patients’ age. The research hypothesis is that epileptic patients have feature values similar to older healthy controls. presents boxplots of the ages of each group of individuals. This figure shows that the distribution of ages is similar in the three groups. We applied feature screening methods separately for each group and checked which features were selected the most. Later, we analyze the estimated ages for two specific features.

Fig. 4 Boxplot of the observed ages for each group in the epilepsy study.

We applied the generalized correlation with cubic splines, SIS, DC-SIS, and SIRS. Our Algorithm 1 is applied using Daubechies’ extremal phase wavelets with filter number six.Footnote³ presents the five features selected the most by each screening method. The methods have consistent results and identify similar sets of features. The features related to BODY and Genu are the most selected. These brain structures were also highlighted in previous studies (Yasuda et al. Citation2023, Tables 2 and 5).

Table 3 The five most relevant variables according to different feature screening methods for each group of participants in the epilepsy study.

Download CSV Display Table

shows that the two features selected the most for the three groups are BODY_FA and Genu_FA. presents scatterplots of these features versus the observed ages. The warped wavelet regression was applied to fit a smooth curve for each group of patients and control. Both plots in show that the curves of control individuals tend to be above the curves of RTLE and LTLE patients. This characteristic indicates that measurements of TLE patients are observed in controls with older ages, illustrating the degeneracy caused by the TLE disorder. Yasuda et al. (Citation2023) reached similar conclusions using linear regression. Hence, the feature screening and wavelet analysis corroborate their results.

Fig. 5 Scatterplot of observed ages versus the covariates BODY_FA (right) and Genu_FA (left). The symbols $\cdot$ , ×, and + correspond to control, LTLE, and RTLE groups, respectively. The solid curves were estimated via warped wavelet regression for each group of patients.

Fig. 5 Scatterplot of observed ages versus the covariates BODY_FA (right) and Genu_FA (left). The symbols ·, ×, and + correspond to control, LTLE, and RTLE groups, respectively. The solid curves were estimated via warped wavelet regression for each group of patients.

7 Conclusion

Feature screening is a common practice in various scientific applications. Scientists are frequently looking for relevant relationships between variables of interest and candidate features, and the correlation is probably the most used statistic to screen such relevant relations. However, high dimensionality and nonlinearity are common characteristics in practice and pose challenges to classical approaches in data analysis. High-dimensional data increases the chance of observing spurious correlations, and many important features may have a nonlinear relation with the variable of interest. In this context, many screening metrics have been proposed in the literature to cope with large dimensions and allow for flexible relations between variables. We propose a wavelet approach for performing feature screening. Wavelet methods are well-known tools in the literature of sparse models and can estimate various types of curves. The proposed method is based on the warped wavelet regression, and we prove that it can achieve the sure screening property under suitable conditions. Simulation studies also show that the proposed method is competitive with other feature screening approaches in the literature. Additionally, the wavelet approach tends to outperform other methods when the link function between a feature and a response variable have sharp oscillations and peaks. Such an outcome agrees with the well-known ability of wavelets to adapt to signals with local nonlinearities. Our work also provides two applications in real data from relevant scientific problems: the relation of ozone concentration and atmospheric variables and the relation between brain white matter measurements and the age of patients with temporal lobe epilepsy. In both studies, the wavelet feature screening method provides results that are in line with other screening methods. Another application example in the supplementary material shows that wavelet feature screening also performs well in a case where $p ≫ n$ . Overall, all the screening methods can select features that have relations reported in the literature of their respective fields. Therefore, we believe the warped wavelet screening method is a powerful tool for scientists dealing with many linear and nonlinear features.

Supplementary Materials

Full simulation results:The complete results of the simulation presented in Section 4.

Application in high-dimensional data:A feature screening application on the cardiomiopathy data analyzed by Hall and Miller (Citation2009) and Li, Zhong, and Zhu (Citation2012).

Summary of applications: R markdown files outlining the data analysis presented in Sections 5 and 6.

Codes:All R codes used for simulation studies and to analyze the datasets about ozone concentration and cardiomyopathy.

Supplemental material

supplement_waveletFeatureScreening.pdf

Download PDF (232.3 KB)

supplement (28).zip

Download Zip (12.2 MB)

Acknowledgments

We thank Paulo Artaxo and Luciana Rizzo for suggesting the analysis of ozone concentration data, and Clarissa Yasuda for suggesting the application about epilepsy. We also thank the Editor, Associate Editor, and two referees for all constructive comments about our article. We acknowledge Clarissa Yasuda and CEPID BRAINNFootnote⁴ (FAPESP 2013/07559-3) for generously providing the epilepsy data.

Disclosure Statement

The authors report that there are no competing interests to declare.

Additional information

Funding

This research was supported by the São Paulo Research Foundation (FAPESP) under Grants 2018/04654-9, 2021/04513-9 and 2023/02538-0; and the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) under Grant 310991/2020-0. RF acknowledges the support provided by the Morá Miriam Rozen Gerber Fellowship for Brazilian postdocs at the Weizmann Institute of Science.

Notes

1 The supplementary material shows that similar results are obtained with different wavelet basis, and that Algorithm 1 outperforms a wavelet screening applied with the method of Kovac and Silverman (Citation2000), which is implemented as the irregwd function in the R package wavethresh (Nason et al. Citation2022).

2 See the wavethresh manual for details (Nason et al. Citation2022).

3 See the wavethresh manual for details (Nason et al. Citation2022).

4 https://www.brainn.org.br/

References

Barut, E., Fan, J., and Verhasselt, A. (2016), “Conditional Sure Independence Screening,” Journal of the American Statistical Association, 111, 1266–1277. DOI: 10.1080/01621459.2015.1092974.
PubMed Web of Science ®Google Scholar
Camalier, L., Cox, W., and Dolwick, P. (2007), “The Effects of Meteorology on Ozone in Urban Areas and their Use in Assessing Ozone Trends,” Atmospheric Environment, 41, 7127–7137. DOI: 10.1016/j.atmosenv.2007.04.061.
Web of Science ®Google Scholar
Donoho, D. L., and Johnstone, J. M. (1994), “Ideal Spatial Adaptation by Wavelet Shrinkage,” Biometrika, 81, 425–455. DOI: 10.1093/biomet/81.3.425.
Web of Science ®Google Scholar
Fan, J., Feng, Y., and Song, R. (2011), “Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models,” Journal of the American Statistical Association, 106, 544–557. DOI: 10.1198/jasa.2011.tm09779.
PubMed Web of Science ®Google Scholar
Fan, J., and Li, R. (2001), “Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360. DOI: 10.1198/016214501753382273.
Web of Science ®Google Scholar
Fan, J., Li, R., Zhang, C.-H., and Zou, H. (2020), Statistical Foundations of Data Science, Boca Raton: CRC Press.
Google Scholar
Fan, J., and Lv, J. (2008), “Sure Independence Screening for Ultrahigh Dimensional Feature Space,” Journal of the Royal Statistical Society, Series B, 70, 849–911. DOI: 10.1111/j.1467-9868.2008.00674.x.
PubMed Web of Science ®Google Scholar
Fisher, R. S., Acevedo, C., Arzimanoglou, A., Bogacz, A., Helen Cross, J., Elger, C. E., et al. (2014), “ILAE Official Report: A Practical Clinical Definition of Epilepsy,” Epilepsia, 55, 475–482. DOI: 10.1111/epi.12550.
PubMed Web of Science ®Google Scholar
Franco, D. M. P., Andrade, M. d. F., Ynoue, R. Y., and Ching, J. (2019), “Effect of Local Climate Zone (LCZ) Classification on Ozone Chemical Transport Model Simulations in Sao Paulo, Brazil,” Urban Climate, 27, 293–313. DOI: 10.1016/j.uclim.2018.12.007.
Web of Science ®Google Scholar
Gavidia-Calderón, M., Schuch, D., and de Fatima Andrade, M. (2022), qualR: An R Package to Download São Paulo and Rio de Janeiro Air Pollution Data, R package version 0.9.7.
Google Scholar
Gómez, L. M., Porto, R. F., and Morettin, P. A. (2021), “Nonparametric Regression with Warped Wavelets and Strong Mixing Processes,” Annals of the Institute of Statistical Mathematics, 73, 1203–1228. DOI: 10.1007/s10463-021-00789-0.
Web of Science ®Google Scholar
Hall, P., and Miller, H. (2009), “Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems,” Journal of Computational and Graphical Statistics, 18, 533–550. DOI: 10.1198/jcgs.2009.08041.
Web of Science ®Google Scholar
Härdle, W., Kerkyacharian, G., Picard, D., and Tsybakov, A. (1998), Wavelets, Approximation, and Statistical Applications, New York: Springer.
Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Vol. 2), New York: Springer.
Google Scholar
Hwang, G., Hermann, B., Nair, V. A., Conant, L. L., Dabbs, K., Mathis, J., et al. (2020), “Brain Aging in Temporal Lobe Epilepsy: Chronological, Structural, and Functional,” NeuroImage: Clinical, 25, 102183. DOI: 10.1016/j.nicl.2020.102183.
PubMed Web of Science ®Google Scholar
Kerkyacharian, G., and Picard, D. (2004), “Regression in Random Design and Warped Wavelets,” Bernoulli, 10, 1053–1105. DOI: 10.3150/bj/1106314850.
Web of Science ®Google Scholar
Kovac, A., and Silverman, B. W. (2000), “Extending the Scope of Wavelet Regression Methods by Coefficient-Dependent Thresholding,” Journal of the American Statistical Association, 95, 172–183. DOI: 10.1080/01621459.2000.10473912.
Web of Science ®Google Scholar
Li, R., Huang, L., and Dziak, J. (2022), VariableScreening: High-Dimensional Screening for Semiparametric Longitudinal Regression, R package version 0.2.1.
Google Scholar
Li, R., Zhong, W., and Zhu, L. (2012), “Feature Screening via Distance Correlation Learning,” Journal of the American Statistical Association, 107, 1129–1139. DOI: 10.1080/01621459.2012.695654.
PubMed Web of Science ®Google Scholar
Lin, X., Trainer, M., and Liu, S. (1988), “On the Nonlinearity of the Tropospheric Ozone Production,” Journal of Geophysical Research, 93, 15879–15888.
Web of Science ®Google Scholar
Mai, Q., and Zou, H. (2013), “The Kolmogorov Filter for Variable Screening in High-Dimensional Binary Classification,” Biometrika, 100, 229–234. DOI: 10.1093/biomet/ass062.
Web of Science ®Google Scholar
Monks, P. S., Archibald, A. T., Colette, A., Cooper, O., Coyle, M., Derwent, R., et al. (2015), “Tropospheric Ozone and its Precursors from the Urban to the Global Scale from Air Quality to Short-Lived Climate Forcer,” Atmospheric Chemistry and Physics, 15, 8889–8973. DOI: 10.5194/acp-15-8889-2015.
Web of Science ®Google Scholar
Morettin, P. A., Pinheiro, A., and Vidakovic, B. (2017), Wavelets in Functional Data Analysis, Cham: Springer.
Google Scholar
Morettin, P. A., and Porto, R. F. (2022), “Estimation of Nonparametric Regression Models by Wavelets,” São Paulo Journal of Mathematical Sciences, 16, 539–565. DOI: 10.1007/s40863-021-00240-5.
Web of Science ®Google Scholar
Nason, G., Barber, S., Downie, T., Frylewicz, P., Kovac, A., Ogden, T., and Silverman, B. (2022), Package ‘wavethresh’, R package version 4.7.2.
Google Scholar
Ogden, R. T. (1997), Essential Wavelets for Statistical Applications and Data Analysis, Boston: Birkhäuser.
Google Scholar
Oliveira, M., Drumond, A., and Rizzo, L. (2022), “Air Pollution Persistent Exceedance Events in the Brazilian Metropolis of Sao Paulo and Associated Surface Weather Patterns,” International Journal of Environmental Science and Technology, 19, 9495–9506. DOI: 10.1007/s13762-021-03778-1.
Web of Science ®Google Scholar
Porto, R., Morettin, P., Percival, D., and Aubin, E. (2016), “Wavelet Shrinkage for Regression Models with Random Design and Correlated Errors,” Brazilian Journal of Probability and Statistics, 30, 614–652. DOI: 10.1214/15-BJPS296.
Web of Science ®Google Scholar
Reiss, P. T., Goldsmith, J., Shang, H. L., and Ogden, R. T. (2017), “Methods for Scalar–on–Function Regression,” International Statistical Review, 85, 228–249. DOI: 10.1111/insr.12163.
PubMed Web of Science ®Google Scholar
Sánchez-Ccoyllo, O. R., Ynoue, R. Y., Martins, L. D., and Andrade, M. d. F. (2006), “Impacts of Ozone Precursor Limitation and Meteorological Variables on Ozone Concentration in Sao Paulo, Brazil,” Atmospheric Environment, 40, 552–562. DOI: 10.1016/j.atmosenv.2006.04.069.
Web of Science ®Google Scholar
Santolaya, C., Oliveira, M. C. Q. D., Rizzo, L. V., and Miraglia, S. G. E. K. (2019), “Contribution of Chemical and Meteorological Factors to Tropospheric Ozone Formation in São Paulo, Brazil,” Brazilian Journal of Environmental Sciences, 54, 90–104.
Google Scholar
Segal, M. R., Dahlquist, K. D., and Conklin, B. R. (2003), “Regression Approaches for Microarray Data Analysis,” Journal of Computational Biology, 10, 961–980. DOI: 10.1089/106652703322756177.
PubMed Web of Science ®Google Scholar
Thompson, M. L., Reynolds, J., Cox, L. H., Guttorp, P., and Sampson, P. D. (2001), “A Review of Statistical Methods for the Meteorological Adjustment of Tropospheric Ozone,” Atmospheric Environment, 35, 617–630.
Web of Science ®Google Scholar
Vidakovic, B. (1999), Statistical Modeling by Wavelets, New York: Wiley.
Google Scholar
Xue, J., and Liang, F. (2017), “A Robust Model-Free Feature Screening Method for Ultrahigh-Dimensional Data,” Journal of Computational and Graphical Statistics, 26, 803–813. DOI: 10.1080/10618600.2017.1328364.
PubMed Web of Science ®Google Scholar
Yasuda, C. L., Valise, C., Saúde, A. V., Pereira, A. R., Pereira, F. R., Costa, A. L. F., et al. (2010), “Dynamic Changes in White and Gray Matter Volume are Associated with Outcome of Surgical Treatment in Temporal Lobe Epilepsy,” Neuroimage, 49, 71–79. DOI: 10.1016/j.neuroimage.2009.08.014.
PubMed Web of Science ®Google Scholar
Yasuda, C. L., Ramalho Pimentel-Silva, L., Beltramini, G. C., Liu, M., de Campos, B. M., Coan, A. C., Beaulieu, C., Cendes, F., and Gross, D. W. (2023), “Brain Volumes and White Matter Diffusion Across the Adult Lifespan in Temporal Lobe Epilepsy,” Annals of Clinical and Translational Neurology, 10, 1106–1118. DOI: 10.1002/acn3.51793.
PubMed Web of Science ®Google Scholar
Yousuf, K. (2018), “Variable Screening for High Dimensional Time Series,” Electronical Journal of Statistics, 12, 667–702.
Web of Science ®Google Scholar
Zhang, J., Wei, Y., and Fang, Z. (2019), “Ozone Pollution: A Major Health Hazard Worldwide,” Frontiers in Immunology, 10, 2518. DOI: 10.3389/fimmu.2019.02518.
PubMed Web of Science ®Google Scholar
Zhao, Y., Chen, H., and Ogden, R. T. (2015), “Wavelet-based Weighted LASSO and Screening Approaches in Functional Linear Regression,” Journal of Computational and Graphical Statistics, 24, 655–675. DOI: 10.1080/10618600.2014.925458.
Web of Science ®Google Scholar
Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. (2011), “Model-Free Feature Screening for Ultrahigh-Dimensional Data,” Journal of the American Statistical Association, 106, 1464–1475. DOI: 10.1198/jasa.2011.tm10563.
PubMed Web of Science ®Google Scholar

Appendix A.

Auxiliary Lemma

Lemma 1. Assume that C1–C3 hold. Let κ,

C'

, and γ be defined in Proposition 1, with

γ > 1 / 2

. Therefore, it follows that

(8)

P (max_{j \in M_{*}} | | | {\hat{β}}_{j} | | - | | β_{j} | | | > ν_{n}) \leq \frac{s 2^{J + 1} C'}{n^{2 γ}}

(8) and

(9)

\sum_{j \notin M_{*}} P (| | {\hat{β}}_{j} | | > ν_{n}) \leq \frac{(p - s) 2^{J + 1} C'}{n^{2 γ}},

(9) where

s = | M_{*} |

and

ν_{n} = 2^{(J + 1) / 2} \sqrt{κ} {((log n) / n)}^{1 / 4}

Proof.

Let us prove (8) first. By the inequality $| | | {\hat{β}}_{j} | | - | | β_{j} | | | \leq | | {\hat{β}}_{j} - β_{j} | |$ and the union bound, it follows that $P (max_{j \in M_{*}} | | | {\hat{β}}_{j} | | - | | β_{j} | | | > ν_{n}) \leq \sum_{j \in M_{*}} P (| | {\hat{β}}_{j} - β_{j} | | > ν_{n}) .$

We shall use the following event inclusion (10) $\begin{matrix} {| | {\hat{β}}_{j} - β_{j} | | > ν_{n}} = {\sum_{l = 0}^{J} \sum_{k = 0}^{2^{l} - 1} {({\hat{β}}_{jlk} - β_{jlk})}^{2} > ν_{n}^{2}} \\ \subset \cup_{l = 0}^{J} \cup_{k = 0}^{2^{l} - 1} {{({\hat{β}}_{jlk} - β_{jlk})}^{2} > \frac{ν_{n}^{2}}{2^{J + 1}}} . \end{matrix}$ (10)

Notice that (10) holds for all $j = 1, \dots, p$ . Then, by the union bound and since $ν_{n}^{2} = 2^{J + 1} κ \sqrt{(log n) / n}$ , $\begin{matrix} P (max_{j \in M_{*}} | | | {\hat{β}}_{j} | | - | | β_{j} | | | > ν_{n}) \\ \leq \sum_{j \in M_{*}} \sum_{l = 0}^{J} \sum_{k = 0}^{2^{l} - 1} P (| {\hat{β}}_{jlk} - β_{jlk} |^{2} > ν_{n}^{2} / 2^{J + 1}) \\ = \sum_{j \in M_{*}} \sum_{l = 0}^{J} \sum_{k = 0}^{2^{l} - 1} P (| {\hat{β}}_{jlk} - β_{jlk} | > κ \sqrt{\frac{log n}{n}}) . \end{matrix}$

Recall that $M_{*}$ has cardinality s and notice that there are $2^{J + 1}$ pairs (l, k) in the sum above. Hence, (8) follows by applying Proposition 1 with α = 2.

Inequality (9) can be proved similarly. Indeed, since $β_{j} = 0$ for $j \notin M_{*}$ , then $\begin{matrix} \sum_{j \notin M_{*}} P (| | {\hat{β}}_{j} | | > ν_{n}) \leq \sum_{j \notin M_{*}} \sum_{l = 0}^{J} \sum_{k = 0}^{2^{l} - 1} P (| {\hat{β}}_{jlk} | > κ \sqrt{\frac{log n}{n}}) \\ \leq \frac{(p - s) 2^{J + 1} C'}{n^{2 γ}}, \end{matrix}$ where the first inequality is due to the union bound and (10), and the second inequality is due to Proposition 1. □

Appendix B.

Proof of Proposition 2

Proof.

Consider the event $A_{n} = {{max}_{j \in M_{*}} | | | {\hat{β}}_{j} | | - | | β_{j} | | | \leq ν_{n}}$ . Let us show that if ${min}_{j \in M_{*}} | | β_{j} | | > 2 ν_{n}$ , then $A_{n} \subset {\hat{M} \supset M_{*}}$ . Consider that A_n holds. It implies that $| | {\hat{β}}_{j} | | \geq | | β_{j} | | - ν_{n}$ for all $j \in M_{*}$ . Since $| | β_{j} | | - ν_{n} \geq ν_{n}$ by assumption, then $| | {\hat{β}}_{j} | | \geq ν_{n}$ , that is, all $j \in M_{*}$ also belong to $\hat{M}$ . Hence, $P (\hat{M} \supset M_{*}) \geq 1 - P (A_{n}^{c})$ . By (8) $P (A_{n}^{c}) = P (max_{j \in M_{*}} | | | {\hat{β}}_{j} | | - | | β_{j} | | | > ν_{n}) \leq s 2^{J + 1} C' n^{- 2 γ},$ where s is the cardinality of $M_{*}$ . This concludes the proof of (5).

Recall from the comment after (3) that $2^{J} \leq C \sqrt{n / log n}$ . Thus, $P (A_{n}^{c}) \leq 2 sC n^{1 / 2 - 2 γ} / \sqrt{log n}$ , where C is an absolute constant. Since $γ > 1 / 4$ by assumption, then $P (A_{n}^{c}) \to 0$ as $n \to \infty$ with s and J fixed. This concludes the proof of Proposition 2. □

Appendix C.

Proof of Proposition 3

Proof.

Recall that $s = | M_{*} |$ . We can write the cardinality $| \hat{M} |$ as a sum of indicator functions, which leads to the following inequality: $| \hat{M} | = \sum_{j = 1}^{p} I {| | {\hat{β}}_{j} | | > ν_{n}} \leq s + \sum_{j \notin M_{*}} I {| | {\hat{β}}_{j} | | > ν_{n}} .$

Notice that $β_{j} \equiv 0$ for $j \notin M_{*}$ . Then, it follows from (9) that $E {\sum_{j \notin M_{*}} I {| | {\hat{β}}_{j} | | > ν_{n}}} = \sum_{j \notin M_{*}} P {| | {\hat{β}}_{j} | | > ν_{n}} \leq \frac{(p - s) 2^{J + 1} C'}{n^{2 γ}} .$

□

Wavelet Feature Screening

Abstract