Full article: A Generalization Gap Estimation for Overparameterized Models via the Langevin Functional Variance

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

This article discusses the estimation of the generalization gap, the difference between generalization performance and training performance, for overparameterized models including neural networks. We first show that a functional variance, a key concept in defining a widely-applicable information criterion, characterizes the generalization gap even in overparameterized settings where a conventional theory cannot be applied. As the computational cost of the functional variance is expensive for the overparameterized models, we propose an efficient approximation of the function variance, the Langevin approximation of the functional variance (Langevin FV). This method leverages only the first-order gradient of the squared loss function, without referencing the second-order gradient; this ensures that the computation is efficient and the implementation is consistent with gradient-based optimization algorithms. We demonstrate the Langevin FV numerically by estimating the generalization gaps of overparameterized linear regression and nonlinear neural network models, containing more than a thousand of parameters therein. Supplementary materials for this article are available online.

KEYWORD:

1 Introduction

The great success of deep neural networks (LeCun, Bengio, and Hinton Citation2015; Goodfellow, Bengio, and Courville Citation2016) has been reported in many applied fields, such as natural language processing, image processing, and the natural sciences. This has now altered our understanding of generalization in data science. Generalization is our model’s ability to adapt properly to new data drawn from the same distribution as that of the training data. A classical discipline for generalization is the bias-variance dilemma, that is: the richness of models reduces the modeling bias but suffers from fitting spurious patterns, leading to poor generalization performance. However, the modern practice of deep neural networks yields counterexamples to this discipline, that is: deep neural networks are capable of exactly fitting the training data (known as interpolation) through the use of an overabundance of parameters of the models (known as over-parameterization) with accurately predicted test data (Chiyuan et al. Citation2017; Belkin, Hsu, and Mitra Citation2018). Several such surprising phenomena have been reported: the double descent phenomenon (Belkin, Hsu, and Mitra Citation2018; Mario et al. Citation2020; Nakada and Imaizumi Citation2021; Hastie et al. Citation2022), and the multiple descent phenomenon (Adlam and Pennington Citation2020; d’ Ascoli, Sagun, and Biroli Citation2020). These interesting but controversial phenomena give us a chance to rethink generalization in modern data science.

From a practical point of view, estimating the generalization performance helps us understand what is actually happening. A naive estimator of the generalization performance is the training performance. However, real-case studies using deep neural networks (Azulay and Weiss Citation2019; Zhang et al. Citation2021) suggest that there is a gap between the generalization performance and the training performance, a generalization gap. Double descent phenomena also imply the existence of non-negligible generalization gap. These empirical evidences advocate the importance of accurately estimating the generalization gap.

Statistical tools for estimating the generalization gap have been reported in the literature on information criteria or optimism estimation. The well-known Akaike information criterion (AIC; Akaike Citation1974) estimates the generalization log-loss of the plug-in predictive density using a maximum likelihood estimator. The Takeuchi information criterion (TIC; Takeuchi Citation1976) is a modification of AIC to deal with model misspecification. The regularization information criterion (RIC; Shibata Citation1989) is an extension of TIC to accommodate maximum penalized likelihood estimation. Moody (Citation1992) and Murata, Yoshizawa, and Amari (1994) generalized RIC to an arbitrary loss and advocated its use in nonlinear systems, such as neural networks. Mallows’ C_p (Mallows Citation1973) and Stein’s unbiased risk estimates (SURE; Stein Citation1981) offer an elegant estimation scheme of the generalization gap using the covariance between a predictor and its outcome in Gaussian models with the $l_{2}$ loss. Ramani, Blu, and Unser (Citation2008) proposed an efficient Monte Carlo sampling-based method (Monte Carlo SURE) to estimate the covariance. Recently, new methods associated with Bayesian learning have been developed. The deviance information criterion (DIC; Spiegelhalter et al. Citation2002) and widely-applicable information criterion (WAIC; Watanabe Citation2010) offer computationally efficient devices that estimate the generalization loss of Bayesian learning. In particular, Watanabe (Citation2010) showed that in a statistical model with a fixed dimension, the generalization gap of Bayesian learning is asymptotically equal to the functional variance (FV), that is, the posterior variance of the log-likelihood.

However, studies on the data-driven measurement of the generalization gap in the overparameterized regime has been relatively scarce. Gao and Jojic (Citation2016) extended Monte Carlo SURE to analyze the generalization gap in deep neural networks. Thomas et al. (Citation2020) modified TIC for the aforementioned purpose. These studies empirically investigated the gaps in common datasets, such as MNIST and CIFAR-10, and successfully developed useful tools to understand the generalization gap in the overparameterized regime. However, these approaches lacked theoretical guarantees in overparameterized models although they provided guarantees in statistical models with a fixed dimension. This results in unclarified theoretical applicability of these approaches. Furthermore, the computational costs of these approaches related to memory and speed are relatively high. This is because Gao and Jojic (Citation2016) requires training several times, and Thomas et al. (Citation2020) needs the second-order gradient (Hessian) of the loss function.

In this article, we focus on yet another tool to measure the generalization gap, the Functional Variance (FV) developed in Watanabe (Citation2010). We present the following contributions:

Theoretical applicability: We prove that FV is asymptotically unbiased to the generalization gap of Bayesian learning for overparameterized linear regression models.
Computational efficiency: We propose a computationally-efficient approximation of FV, Langevin FV (LFV), by leveraging only the first-order gradient of the loss function.

We show the theoretical applicability of FV as a measure of the generalization gap in overparameterized regime. Our theory employs the overparameterized linear regression model (see Belkin, Hsu, and Xu Citation2020; Bartlett et al. Citation2020; Hastie et al. Citation2022), but we should mention that this linear model can be regarded as a linear approximation of nonlinear overparameterized models including deep neural networks and the linearization has been widely employed to analyze behaviors of deep neural networks (see, Jacot, Gabriel, and Hongler Citation2018; Arora et al. Citation2019; Chizat, Oyallon, and Bach Citation2019; Yang and Littewin 2021). One of such theories is known as neural tangent kernel (Jacot, Gabriel, and Hongler Citation2018), that is, an approximation of neural networks in a small region of the parameters around an estimate or an initial value. The condition of the neural tangent kernel approximation has been investigated in Chizat, Oyallon, and Bach (Citation2019) and known to be applicable to any architecture of deep neural networks in Yang and Littewin (2021).

Although FV is theoretically favorable, it is defined by using the full posterior covariance and the full Bayesian inference in the overparameterized regime is often prohibited because of the computational burden. For efficiently computing the posterior, we employ the Langevin dynamics (see Risken Citation1996), which sequentially adds a normal random perturbation to each update of the gradient descent optimization and obtains the stationary distribution approximating the posterior distribution (Cheng et al. Citation2018). This approach has several merits from the computational perspective. For example, the implementation is easy and consistent with the gradient-based optimization that is the de-facto standard in deep neural network applications. Since the effect of linearization in theory remains unclear and controversial in practice (see Lee et al. Citation2020), we confirm the applicability of the developed method to regression using neural networks with real datasets in Section 4.3.

The rest of the article is organized as follows. Section 2 provides our theoretical results on FV, Section 3 proposes Langevin FV, which is an efficient implementation of FV using Langevin dynamics, Section 4 demonstrates the numerical experiments, and finally, Section 5 concludes this article. Supplementary material gives proofs of all theorems, and descriptions of source codes to reproduce experimental results.

2 Theoretical Results for the Functional Variance

In this section, we present theoretical results for the functional variance in overparameterized linear regression models. Throughout this article, $I_{n}$ denotes the n × n identity matrix for any $n \in N$ , and $‖ β ‖_{2} = {(β_{1}^{2} + β_{2}^{2} + \dots + β_{p}^{2})}^{1 / 2}$ and $‖ β ‖_{\infty} = \max_{i = 1, 2, \dots, p} | β_{i} |$ for any vector $β = {(β_{1}, β_{2}, \dots, β_{p})}^{⊤} \in R^{p}$ .

2.1 Problem Setting

We begin by introducing the problem setting for the theory. We consider a linear regression model $y = X β_{0} + ε,$ where $y = {(y_{1}, \dots, y_{n})}^{⊤} \in R^{n}$ is a vector of observed outcomes, $X = {(x_{1}, x_{2}, \dots, x_{n})}^{⊤} \in R^{n \times p}$ is an n × p random design matrix, $β_{0} \in R^{p}$ is an unknown coefficient vector, and $ε = {(ε_{1}, \dots, ε_{n})}^{⊤} \in R^{n}$ is a vector of iid error terms with mean zero and variance $0 < σ_{0}^{2} < \infty$ . Our interest is the overparameterized situation, where the number of regressors p is larger than the sample size n: $n \leq p .$

We take the quasi-Bayesian approach on the vector $β$ . Working under the quasi-likelihood $f (y_{i} ∣ x_{i}, β)$ with a Gaussian distribution with mean $x_{i}^{⊤} β$ and variance $σ_{0}^{2}$ , and a Gaussian prior $N_{p} (0, {σ_{0}^{2} / (α n)} I_{p})$ on $ε$ with mean $0$ and covariance matrix ${σ_{0}^{2} / (α n)} I_{p}$ , we obtain the quasi-posterior distribution of $β$ : (1) $\begin{matrix} Π_{α} (d β ∣ y, X) \\ = {(2 π)}^{- p / 2} {(det Q_{α})}^{- 1 / 2} e^{- {(β - {\hat{β}}_{α})}^{⊤} Q_{α}^{- 1} (β - {\hat{β}}_{α}) / 2} d β, \end{matrix}$ (1) where ${\hat{β}}_{α}$ is the maximum a posterior estimate (2) $\begin{matrix} {\hat{β}}_{α} : = {(n^{- 1} X^{⊤} X + α I_{p})}^{- 1} n^{- 1} X^{⊤} y \\ = \underset{β \in R^{d}}{\arg \min} {l_{α} (β) : = n^{- 1} ‖ y - Xβ ‖_{2}^{2} + α ‖ β ‖_{2}^{2}} \end{matrix}$ (2) and $Q_{α}$ is the matrix (3) $Q_{α} : = n^{- 1} σ_{0}^{2} {(n^{- 1} X^{⊤} X + α I_{p})}^{- 1} .$ (3)

We now analyze the Gibbs generalization gap (4) $\begin{matrix} Δ (α; X) : = \frac{1}{2 σ_{0}^{2}} {E_{y, y^{*}} [E_{β} [‖ y^{*} - Xβ ‖_{2}^{2}]] \\ - E_{y} [E_{β} [‖ y - Xβ ‖_{2}^{2}]]}, \end{matrix}$ (4) where $y^{*}$ is an independent copy of $y$ with given $X$ , $E_{β} [\cdot]$ is the expectation with respect to the quasi-posterior distribution (1), and $E_{y} [\cdot]$ (and $E_{y, y^{*}} [\cdot]$ ) is the expectation with respect to $y$ (and $y, y^{*}$ , respectively,) with given $X$ . The Gibbs generalization gap considers the generalization gap for one stochastic sample from the quasi-posterior distribution as estimates of the parameters. To estimate the Gibbs generalization gap from the current observations, we focus on the functional variance (5) $FV (α; X) : = \sum_{i = 1}^{n} V_{β} [\log f (y_{i} ∣ x_{i}, β)],$ (5) where $V_{β}$ is the variance with respect to the quasi-posterior distribution (1).

2.2 Asymptotic Unbiasedness of the Functional Variance

Here we present our theoretical findings on the functional variance in the overparameterized regime. The following theorem shows that the functional variance $FV (α; X)$ is an asymptotically unbiased estimator of the Gibbs generalization gap $Δ (α; X)$ for the overparameterized linear regression with Gaussian covariates. Its proof is given in Supplement A.

Theorem 1.

Let $n, p \in N$ , and let $Σ$ be a p × p nonzero and nonnegative definite matrix. Let $x_{1}, x_{2}, \dots, x_{n} \in R^{p}$ be iid random vectors from the Gaussian distribution with a mean zero and a covariance matrix $Σ$ . There exists an absolute constant C > 0, such that we have, for any $ε, α > 0$ , $P_{X} (| E_{y} [FV (α; X)] - Δ (α; X) | > ε)$ $\leq C (\frac{ξ^{2}}{α^{2}} \frac{1}{n ε} + \frac{ξ^{4}}{α^{4}} \frac{1}{n ε} + \frac{ξ^{4} b^{4}}{α^{2} σ_{0}^{4}} \frac{1}{n ε^{2}} + \frac{1}{n}),$ where $ξ : = tr {Σ}, b : = p^{1 / 2} ‖ β_{0} ‖_{\infty}$ , and $P_{X}$ denotes the probability with respect to $X$ . Furthermore, under the conditions (C1) $\sup_{p \in N} ξ < \infty$ and (C2) $\sup_{p \in N} b < \infty$ , we have $P_{X} (| E_{y} [FV (α; X)] - Δ (α; X) | \leq C' a_{n} / \sqrt{n}) \to 1$ for some $C' > 0$ not depending on n and p with an arbitrary slowly increasing sequence a_n.

Theorem 1 implies that FV successfully estimates the Gibbs generalization gap in the overparameterized regime, which supports the use of FV, even in the overparameterized regime. Interestingly, its rate of convergence $(a_{n} / \sqrt{n})$ is not affected by the dimension p but affected by the value of $\sup_{p \in N} ξ$ . As $\sup_{p \in N} ξ$ gets larger, the decay of the difference becomes slower. Furthermore, the result does not restrict the true distribution of the additive error to the Gaussian distribution; in contrast, some classical theories such as SURE require the error term to either be Gaussian or be of related distributions.

Let us mention the conditions in the theorem. Condition (C1) indicates the trace boundedness of $X^{⊤} X$ (divided by n) because $Σ = E [X^{⊤} X] / n$ . Both shallow and deep neural network models satisfy this condition in general settings (see, e.g., Karakida, Akaho, and Amari Citation2019). Condition (C2) together with Condition (C1) indicates that $n^{- 1} ‖ μ ‖_{2}^{2} \leq n^{- 1} ‖ X ‖_{F}^{2} ‖ β_{0} ‖_{2}^{2}$ is upper-bounded by some constant C > 0 with probability approaching 1. It implies that the average of the entries in the outcome expectation $μ$ is upper-bounded with high probability.

The main ingredient of the proof of Theorem 1 is the explicit identity of the difference between the generalization gap and FV. See Lemma 2. Its proof is presented in Supplement B. For two matrices $A = (a_{i j}), B = (b_{i j})$ of the same size, $A ° B = (a_{i j} b_{i j})$ denotes the Hadamard element-wise product, and let $A^{\otimes 2} : = A A^{⊤}$ .

Lemma 2.

We have (6) $Δ (α; X) = tr {H_{α}} and$ (6) (7) $\begin{matrix} E_{y} [FV (α; X)] - Δ (α; X) = - \frac{3}{2} tr {H_{α} ° H_{α}} + tr {H_{α} ° H_{α}^{2}} \\ + \frac{1}{σ_{0}^{2}} tr {H_{α} ° {((I_{n} - H_{α}) (X β_{0}))}^{\otimes 2}}, \end{matrix}$ (7)

where $H_{α}$ is the regularized hat matrix (8) $H_{α} : = n^{- 1} X {(n^{- 1} X^{⊤} X + α I_{p})}^{- 1} X^{⊤} .$ (8)

Lemma 2 highlights the role of the regularized hat matrix $H_{α}$ in evaluating the generalization gap and the residual. Same as the fixed-dimension theory, the hat matrix controls the magnitude of the generalization gap. With singular values $s_{1} \geq s_{2} \geq \dots \geq s_{n} \geq 0$ of the matrix $X$ , under the conditions in Theorem 1, we have $Δ (α; X) = tr {H_{α}} = \sum_{i = 1}^{n} \frac{s_{i}^{2} / n}{(s_{i}^{2} / n) + α} \to \exists Δ_{\infty} (α) \in [0, \infty)$ $(n \to \infty)$ as $tr {H_{α}} \leq α^{- 1} n^{- 1} (\sum_{i = 1}^{n} s_{i}^{2}) < \infty$ with the probability approaching 1 as $n \to \infty$ . Consider a simple case where $X$ has a fixed intrinsic dimension $p_{*} ≪ n$ with growing n, p, namely $s_{1} \geq s_{2} \geq \dots \geq s_{p_{*}} > 0 = s_{p_{*} + 1} = s_{p_{*} + 2} = \dots = s_{n}$ . Then, FV approaches the fixed intrinsic dimension $p_{*}$ (as $α \to 0$ ), which coincides with the degrees of freedom (Mallows Citation1973; Ye Citation1998) and AIC (Akaike Citation1974) for linear regression equipped with $p_{*}$ regressors.

Remark 1

(Sample-wise and joint log-likelihoods). One may think the use of the posterior covariance of the joint log-likelihood $J - FV (α; X) : = V_{β} [\log f (y ∣ X, β)]$ instead of FV (5) using sample-wise log-likelihood. In this case, we have the following expression of the difference between the J-FV and the generalization gap. Its proof is in Supplement C.

Proposition 1.

For any $α > 0$ , we have $E_{y} [J - FV (α; X)] - Δ (α; X)$ $= - \frac{3}{2} tr {H_{α}^{2}} + tr {H_{α}^{3}} + \frac{1}{σ_{0}^{2}} tr {H_{α} {((I - H_{α}) (X β_{0}))}^{\otimes 2}} .$

This exhibits an interesting correspondence between FV and J-FV (see Lemma 2 and Proposition 1); replacing the Hadamard products in $E_{y} [FV (α; X)]$ with simple matrix products yields $E_{y} [J - FV (α; X)]$ . Since the simple matrix products in $E_{y} [J - FV (α; X)]$ do not vanish as $n \to \infty$ , J-FV is not an asymptotically unbiased estimator of the generalization gap.

Remark 2

(Extension of the theorem). Theorem 1 is further extended in Theorem 3, which proves the convergence of FV under milder conditions. Its proof is shown in Supplement A.

Theorem 3.

Let $n, p \in N$ and assume the setting in Section 2.1. Let $X \in R^{n \times p}$ be a random matrix and let $S^{n - 1} = {u \in R^{d} ∣ ‖ u ‖_{2} = 1}$ be a $(n - 1)$ -sphere. Let $q_{1}, q_{2}, \dots, q_{n}$ be marginal probability densities of the left-singular vectors $u_{1}, u_{2}, \dots, u_{n} \in S^{n - 1}$ of $X$ . Write $b : = p^{1 / 2} ‖ β_{0} ‖_{\infty}, ψ : = \max_{i = 1, 2, \dots, n} \max_{u \in S^{n - 1}} {q_{i} (u)} \int_{S^{n - 1}} d v,$ $η : = \frac{1}{n} tr {X^{⊤} X} .$

Then, there exists an absolute constant C > 0 such that, for any $ε, α > 0$ , we have (9) $\begin{matrix} P_{X} (| E_{y} [FV (α; X)] - Δ (α; X) | > ε ∣ η) \\ \leq C (\frac{ψ η^{2}}{α^{2}} \frac{1}{n ε} + \frac{ψ η^{4}}{α^{4}} \frac{1}{n ε} + \frac{ψ η^{4} b^{4}}{α^{2} σ_{0}^{4}} \frac{1}{n ε^{2}}), \end{matrix}$ (9) where $P_{X} (\cdot ∣ η)$ denotes the conditional probability of $X$ given η.

Theorem 1 (Gaussian covariates) follows from Theorem 3 with ψ = 1; see Proposition 7.1 of Eaton (Citation1989), which proves that the left-singular vectors of the Gaussian design matrix follow a uniform distribution over the unit $(n - 1)$ -sphere $S^{n - 1}$ , that is, $q_{i} (u) = 1_{S^{n - 1}} (u) / \int_{S^{n - 1}} d v$ with $1_{S^{n - 1}} (\cdot)$ the indicator function of $S^{n - 1}$ .

Remark 3

(Definition of the generalization gap). Regarding the definition of the generalization gap $Δ (α; X)$ , we may consider another design matrix $X^{*}$ for generating $y^{*}$ . However, the covariate difference is less effective in the generalization gap (4) since we focus on the prediction of the conditional random variable $y ∣ X$ and not the covariate $X$ . For theoretical simplicity, we employed a single design matrix $X$ to generate both outcomes $y, y^{*}$ .

3 Langevin Functional Variance

Despite FV being theoretically attractive as proved in Theorem 3, there exist two computational difficulties:

(D1) generating samples from the quasi-posterior (1) requires the p × p matrix $Q_{α}$ , which is computationally intensive for overparameterized models $n ≪ p$ , and
(D2) computing the quasi-posterior (1) is inconsistent with gradient-based optimization approaches, such as the stochastic gradient descent (SGD), which are often used in optimizing overparameterized models.

To resolve these difficulties (D1) and (D2), we consider a Langevin approximation of the quasi-posterior in Section 3.1, and propose Langevin FV (LFV) for linear models in Section 3.2; we then extend it to nonlinear models in Section 3.3. We compare FV and the proposed LFV with the existing estimators based on TIC (Takeuchi Citation1976) in Section 3.4.

3.1 Langevin Approximation of the Quasi-Posterior

A key idea of our algorithm is approximating the quasi-posterior (1) via a Langevin process. Starting with an estimator $γ^{(1)} : = {\hat{β}}_{α}$ , we stochastically update $γ^{(t)}$ by (10) $γ^{(t + 1)} = γ^{(t)} - \frac{1}{4} δ κ_{n} \frac{\partial l_{α} (γ^{(t)})}{\partial γ} + δ^{1 / 2} e^{(t)} (t = 1, 2, \dots),$ (10) where $κ_{n} : = n / σ_{0}^{2}$ and $δ > 0$ are user-specified parameters, let ${e^{(1)}, e^{(2)}, \dots}$ denote a sequence of iid standard normal random vectors, and $l_{α}$ is defined in (2). Then, the distribution of the Langevin process (10) approximates the quasi-posterior as discussed in the following. First, the Langevin process (10) is a discretization of the Ornstein-Uhlenbeck process $d {\tilde{γ}}_{τ} = - \frac{1}{2} Q_{α}^{- 1} ({\tilde{γ}}_{τ} - {\hat{β}}_{α}) d τ + d e_{τ}$ equipped with a Wiener process ${e_{τ}}_{τ \geq 0}$ , that is, $e_{τ} - e_{τ'} \sim N_{p} (0, (τ - τ') I_{p})$ for any $τ > τ' \geq 0$ . The distribution of ${\tilde{γ}}_{τ}$ with the initialization ${\tilde{γ}}_{0} : = {\hat{β}}_{α}$ coincides with the quasi-posterior (11) $D ({\tilde{γ}}_{τ}) = N_{p} ({\hat{β}}_{α}, Q_{α}) (τ > 0) .$ (11)

See Risken (Citation1996) p. 156 for an example of the distribution (11). Next, with n and p fixed, for any sufficiently small $ϕ > 0$ , Theorem 2 of Cheng et al. (Citation2018) evaluates the 1-Wasserstein distance between the distributions of the Ornstein-Uhlenbeck and the Langevin processes as (12) $W_{1} (D ({\tilde{γ}}_{t δ}), D (γ^{(t)})) \leq ϕ$ (12) for some $δ = O (ϕ^{2} / p)$ and $t ≍ p / ϕ^{2}$ ; see Proposition 2 in Supplement D for details. Thus, the relations (11) and (12) imply that the distribution $D (γ^{(t)})$ approximates the quasi-posterior $N_{p} ({\hat{β}}_{α}, Q_{α})$ .

This Langevin approximation resolves difficulties (D1) and (D2) since

(S1) the Langevin process (10) is computed with only the first order gradient of the squared loss function $l_{α} (γ)$ defined in (2), meaning that the large matrix $Q_{α}$ is not explicitly computed, and
(S2) the Langevin process (10) is in the form of the gradient descent up to the normal noise $δ^{1 / 2} e^{(t)}$ ; this process can be implemented consistently with gradient-based algorithms, which are often used to optimize overparameterized models.

Remark 4

(The other approach for the posterior approximation). Besides the Langevin dynamics, we can use the other approaches for approximating the posterior distribution. One such approach is to use the stochastic noise in stochastic gradient descent (Sato and Nakagawa Citation2014; Mandt, Hoffman, and Blei Citation2017). Though powerful, this approach relies on the normal approximation in the stochastic gradient and the normal approximation is hardly expected in the overparameterized regime, which results in the computational instability.

3.2 Langevin Functional Variance

Using the Langevin approximation (10), we propose LFV: for a time step $T \in N$ , let ${γ^{(t)}}_{t = 1, 2, \dots, T}$ be the samples from the Langevin process. Let ${\hat{V}}_{γ} [\cdot]$ denotes an empirical variance with respect to ${γ^{(t)}}_{t = 1, 2, \dots, T}$ . Let $μ_{i}^{(t)} : = x_{i}^{⊤} γ^{(t)}$ . Then, we define LFV as (13) $\begin{matrix} LFV (α; X) : = \sum_{i = 1}^{n} {\hat{V}}_{γ} [\log f (y_{i} ∣ x_{i}, γ)] \\ = \sum_{i = 1}^{n} \frac{1}{T} \sum_{t = 1}^{T} {\frac{1}{2 σ_{0}^{2}} {(y_{i} - μ_{i}^{(t)})}^{2} \\ - {\frac{1}{T} \sum_{t' = 1}^{T} \frac{1}{2 σ_{0}^{2}} {(y_{i} - μ_{i}^{(t')})}^{2}}}^{2} . \end{matrix}$ (13)

As shown in (12), the distribution of the Langevin process approaches the quasi-posterior: LFV is expected to approximate FV for a sufficiently large $T \in N$ . Thus, LFV is expected to inherit the nature of FV, such as the asymptotic unbiasedness when estimating the generalization gap.

The distributional approximation of the high-dimensional vector $γ \in R^{p}$ faces the curse of dimensionality. However, FV is a statistic taking a value in $R$ ; empirically, FV and LFV approximate the generalization gap $Δ (α; X)$ with a reasonable number of samples $T \in N$ , even if the dimension p is relatively large. See numerical experiments in Section 4.1.

Lastly, let us mention the computational time of FV and LFV. For each iteration, the number of operations to obtain the gradient in LFV is $O (p^{2})$ , while that to compute the covariance matrix in FV is $O (p^{3})$ : therefore, the computational time is expected to be reduced, as also demonstrated for linear models in Remark 5.

3.3 Application to Nonlinear Models

This section provides a procedure in applying LFV (13) to a nonlinear overparameterized model $g_{θ} (z_{i})$ equipped with a parameter vector $θ$ and an input vector $z_{i}$ . The procedure is as follows: first, by replacing $l_{α} (β)$ in the Langevin process (10) with $ρ_{α} (θ) : = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - g_{θ} (z_{i}))}^{2} + α ‖ θ ‖_{2}^{2},$ the update (10) yields a sequence ${γ^{(t)}}_{t = 1, 2, \dots}$ . Next, by substituting $μ_{i}^{(t)} : = g_{γ^{(t)}} (z_{i})$ in (13), we obtain LFV for the nonlinear model $g_{θ}$ . We here emphasize that, while the exact form of the full quasi-posterior for nonlinear models is difficult to obtain and thus computing FV for nonlinear models is almost prohibited, LFV can be applied to even such nonlinear models.

The gradient of the overparameterized model $g_{θ}$ is compatible with the gradient of its linear approximation. This application works if the linear approximation of nonlinear neural network is reasonable around the initial estimate. Again, we remark that this assumption is important in the neural tangent kernel literature (Jacot, Gabriel, and Hongler Citation2018; Arora et al. Citation2019).

3.4 Comparison to the Existing Methods

Here, we compare FV and LFV to existing methods for the estimation of the generalization gap using TIC (Takeuchi Citation1976). TIC estimates the generalization gap of regular statistical models by using a quantity (14) $tr {\hat{F} {\hat{G}}^{- 1}}$ (14) equipped with p × p matrices $\hat{F} : = \sum_{i = 1}^{n} \frac{\partial \log f (y_{i} ∣ x_{i}, β)}{\partial β} \frac{\partial \log f (y_{i} ∣ x_{i}, β)}{\partial β^{⊤}} |_{β = \hat{β}} and$ $\hat{G} : = \sum_{i = 1}^{n} \frac{\partial^{2} \log f (y_{i} ∣ x_{i}, β)}{\partial β \partial β^{⊤}} |_{β = \hat{β}},$ where $\hat{β}$ is a maximum likelihood estimate. However, TIC cannot be applied to singular statistical models whose Hessian matrix $\hat{G}$ degenerates (i.e., the inverse of $\hat{G}$ does not exist), such as overparameterized models. To overcome this limitation of TIC, Thomas et al. (Citation2020) replace the inverse matrix ${\hat{G}}^{- 1}$ in (14) with the κ-generalized inverse matrix ${\hat{G}}_{κ}^{+}$ ; ${\hat{G}}_{κ}^{+}$ is defined as $V Σ_{κ}^{+} V^{⊤}$ , where $V$ is a matrix of the eigenvectors of $\hat{G}$ , and $Σ_{κ}^{+}$ is the diagonal matrix of which the jth diagonal component is $λ_{j} ({\hat{G}}_{κ}^{+}) : = {\begin{matrix} 1 / λ_{j} (\hat{G}) & (λ_{j} (\hat{G}) > κ) \\ 0 & (λ_{j} (\hat{G}) \leq κ) \end{matrix}, (j = 1, 2, \dots, p)$ with ${λ_{j} (\hat{G}) : j = 1, \dots, p}$ singular values of $\hat{G}$ . This TIC modification of Thomas et al. (Citation2020) is numerically examined in Section 4.1, for overparameterized linear regression; it rather estimates the number of nonzero eigenvalues in $\hat{G} = n^{- 1} X^{⊤} X$ , that is, the number of nonzero singular values of $X$ , but not the generalization gap $Δ (α; X)$ .

Instead of modifying TIC, we can employ RIC proposed by Shibata (Citation1976). RIC replaces the Hessian matrix $\hat{G}$ in TIC with the regularized inverse matrix ${(\hat{G} + α I)}^{- 1}$ . Moody (Citation1992) and Murata, Yoshizawa, and Amari (1994) further generalized RIC to arbitrary loss functions and demonstrated its application in shallow neural network models with a small number of hidden units. RIC has much in common with FV; specifically, RIC is also an asymptotically unbiased estimator of the Gibbs generalization gap $Δ (α; X)$ for overparameterized linear regression, and numerically, RIC behaves similarly to FV (and LFV) for the experiments discussed in Section 4.1. However, RIC still requires the inverse matrix of the p × p matrix $\hat{G} + α I$ , and its computational cost is intensive in the overparameterized setting ( $p \geq n$ ).

4 Numerical Experiments

This section presents the numerical evaluation of the Langevin FV (13) using both linear and nonlinear models.

4.1 LFV and Baselines in Linear Models

We first evaluate LFV (13) and compare it with some baselines in linear overparameterized models. The set-up of the numerical experiments is summarized as follows:

Synthetic data generation: For $p = 1.5 n$ , orthogonal matrices $U \in R^{n \times n}, V \in R^{p \times n}$ are iid from the uniform distribution over the set of orthogonal matrices satisfying $U^{⊤} U = V^{⊤} V = I_{n}$ , respectively. Entries of $β_{0} \in R^{p}$ are iid from $N (0, 1 / p)$ . With given singular values ${s_{i}}_{i = 1}^{n}$ , the design matrix $X : = US V^{⊤}$ is computed. $y$ is generated 50 times from $N_{n} (X β_{0}, I_{n})$ , that is, $σ_{0}^{2} = 1$ .
Singular values: We consider three different types of singular values: (i) $X$ has a fixed intrinsic dimension $d_{*} = 10$ , that is, $s_{1} = \dots = s_{10} = n^{1 / 2}$ and $s_{11} = s_{12} = \dots = s_{n} = 0$ , (ii) $s_{i} = n^{1 / 2} i^{- 1}$ , and (iii) $s_{i} = n^{1 / 2} i^{- 1 / 2}$ . Theorem 3 proves the convergence of FV in settings (i) and (ii) since $\lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} s_{i}^{2} < \infty$ in (i) and (ii), whereas it does not prove the convergence in setting (iii) since $\lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} s_{i}^{2}$ is not bounded in (iii).
Evaluation: We evaluate the following baselines and LFV with 50 experiments. Throughout these experiments, we employ $α = 0.1$ for ridge regularization (2).
1. TIC penalty (Takeuchi Citation1976) is extended to the overparameterization setting (Thomas et al. Citation2020) by replacing the inverse of the Hessian matrix $\hat{G}$ in TIC with the κ-generalized inverse matrix ${\hat{G}}_{κ}^{+}$ . See Section 3.4 for the definition of $TIC (κ) : = tr {\hat{F} {\hat{G}}_{κ}^{+}}$ .
2. FV is empirically computed with $T = 15 n$ samples of $β$ generated from the quasi-posterior (1). Its expectation shown in Lemma 2 is also computed.
3. LFV is computed with $δ = 1 / (10 n)$ and $T = 15 n$ Langevin samples $γ$ (see (10)).

The results are presented in the following . Overall, FV and LFV were able to more effectively estimate the generalization gap $Δ (α; X)$ ( $p = 1.5 n, n \to \infty$ ), as compared to TIC, and their biases were not drastically different. TIC was entirely inaccurate in settings (ii) and (iii); together with the result of (i), TIC was able to estimate the number of nonzero eigenvalues in the Hessian matrix $\hat{G} = n^{- 1} X^{⊤} X$ (i.e., the number of nonzero singular values of $X$ ), which is generally different from the generalization gap $Δ (α; X)$ .

Table 1 The generalization gap $Δ (α; X)$ and its estimates (TIC, FV, LFV) with the standard deviations in setting (i) where $s_{i} = n^{1 / 2}$ for $i = 1, 2, \dots, 10$ and 0 otherwise.

Display Table

Table 2 The generalization gap $Δ (α; X)$ and its estimates (TIC, FV, LFV) with the standard deviations in setting (ii), where $s_{i} = n^{1 / 2} i^{- 1}$ .

Display Table

Table 3 The generalization gap $Δ (α; X)$ and its estimates (TIC, FV, LFV) with the standard deviations in setting (iii), where $s_{i} = n^{1 / 2} i^{- 1 / 2}$ .

Display Table

Remark 5

(Computational time). Let us mention computational time of FV and LFV. We check the computation time by the linear synthetic dataset with $n = 100, p = 2000, α = 0.01, s_{i} = n^{1 / 2} i^{- 1 / 2}, T = 500$ five times. Among five experiments, the processing time for LFV is 5.33 ± 0.09 sec, and that for FV is 18.87 ± 0.18 sec, which implies that LFV is in fact faster than FV. For these experiments, we used AMD Ryzen 7 5700X processors. The computation is not parallelized for fair comparison.

4.2 LFV for Nonlinear Models

This section presents the evaluation of the Langevin FV for nonlinear neural networks via synthetic dataset experiments. We use the procedure in Section 3.3 in applying LFV to nonlinear models. The set-up is summarized as follows:

Synthetic data: Let $n \in {50, 500, 1000}, σ^{2} = 1$ . For $i = 1, 2, \dots, n$ , we generate $z_{i} \overset{iid}{\sim} N_{d} (0, I_{d})$ and $ε_{i} \overset{iid}{\sim} N (0, σ^{2})$ , and set $μ_{i} : = μ (z_{i})$ and $y_{i} : = μ_{i} + ε_{i}$ with a function $μ (z) : = 3 \tanh (〈 z, 1 〉 / 2)$ .
Neural network (NN): We employ a fully-connected one-hidden-layer NN: $g_{θ} (z) : = 〈 θ^{(2)}, \tanh (θ^{(1)} z + θ^{(0)}) 〉$ with $M \in {50, 100, 150}$ hidden units, where $θ = (θ^{(0)}, θ^{(1)}, θ^{(2)}) \in R^{M} \times R^{M \times d} \times R^{M}$ is a parameter vector (and thus the number of parameters is $p = M (d + 2)$ ) and the function $\tanh (\cdot)$ applies the hyperbolic tangent function $\tanh (z) : = (\exp (z) - \exp (- z)) / (\exp (z) + \exp (- z))$ entry-wise. Let ${\tilde{θ}}_{0}$ be a parameter satisfying $g_{{\tilde{θ}}_{0}} (z) = μ (z)$ . For each experiment, we employ the true parameter $θ_{0}$ that is iid drawn from the element-wise independent normal distribution with mean ${\tilde{θ}}_{0}$ and the element-wise variance 0.01. Further, we initialize the parameter $θ$ of the NN to be trained by the element-wise independent normal distribution whose mean is $θ_{0}$ with the larger element-wise variance 1, and update $θ$ by a full-batch gradient descent with a learning rate 0.1, ridge-regularization coefficient $α = 10^{- 3}$ and, $0.3 n$ iterations.
Langevin FV: For each setting $(d, M) \in {5, 10, 15} \times {50, 100, 150}$ , we take the average of LFV over 25 experiments. In each experiment, we randomly generate ${(z_{i}, y_{i})}_{i = 1}^{n} \subset R^{d} \times R$ , train the NN $g_{θ}$ , and compute T = 3000 iterations of the Langevin process with $δ = 10^{- 5}$ . We discard the first $0.1 T$ iterations, and employ the remaining $0.9 T$ iterations to compute LFV.
Generalization gap $\tilde{Δ}$ : For each setting $(d, M) \in {5, 10, 15} \times {50, 100, 150}$ , we take average of the following generalization gap $\tilde{Δ} : = \frac{1}{2 σ_{0}^{2}} {E_{y^{*}} (\frac{1}{n} \sum_{i = 1}^{n} {y_{i}^{*} - g_{\hat{θ}} (z_{i})}^{2})$ $- \frac{1}{n} \sum_{i = 1}^{n} {y_{i} - g_{\hat{θ}} (z_{i})}^{2}}$ over 200 times experiments. In each experiment, we randomly generate ${(z_{i}, y_{i})}_{i = 1}^{n} \subset R^{d} \times R$ , train the NN $g_{θ}$ , and evaluate $\tilde{Δ}$ by leveraging the ground-truth $μ_{i} = μ (z_{i})$ and $σ_{0}^{2} = 1$ .

LFV and generalization loss $\tilde{Δ}$ for a nonlinear NN with M hidden units are shown in . LFV values for overparameterized models (i.e., $p = M (d + 2) > n$ ) are in gray. It was observed that the value of LFV provides a rough estimate of the generalization gap even for nonlinear overparameterized NN models, though our overparameterized theories on FV and LFV are proved for linear models. = 50) and = 1000) suggested that the biases of LFV reduces as the sample size grows.

Table 4 The generalization gap and LFV for the NN model with n = 50 and T = 3000.

Display Table

Table 5 The generalization gap and LFV for the NN model with n = 500 and T = 3000.

Display Table

Table 6 The generalization gap and LFV for the NN model with n = 1000 and T = 3000.

Display Table

4.3 LFV for Nonlinear Models Using Real Datasets

In this section, we compare LFV to cross-validation statistic using real datasets with small sample sizes, where we note that the cross-validation becomes prohibited as the sample size becomes larger. We use the procedure in Section 3.3 in applying LFV to nonlinear models. The set-up is summarized as follows.

Real data: We collected the following seven regression datasets from KEEL dataset repository (Alcalá-Fdez et al. Citation2011): “machineCPU” ( $n = 208, d = 6$ ), “wankara” ( $n = 320, d = 9$ ), “baseball” ( $n = 336, d = 16$ ), “dee” ( $n = 364, d = 6$ ), “autoMPG6” ( $n = 391, d = 5$ ), “autoMPG8” ( $n = 391, d = 7$ ), and “stock” ( $n = 949, d = 9$ ). These datasets are selected so that our NN structure described below becomes overparameterized. For each dataset, we standardized (scaling and centering) the design matrix $X$ and the target variable $y$ .
Neural network (NN): We employ the same architecture as the nonlinear NN in Section 4.2, with the number of hidden layers M = 100. For the neural netweork training, we first initialize the parameter $θ$ randomly by the normal distribution $N_{p} (0, d^{- 1 / 2} I_{p})$ ; we employ 20 different random initializations for each dataset experiment. We train the NN by the fullbatch gradient descent (using the entire dataset) with the learning rate 0.01, ridge-regularization coefficient $α = 0.1$ . Gradient descent algorithm is terminated if $| {loss}_{t} - {loss}_{t - 1} | / | {loss}_{t - 1} | < 10^{- 6}$ , where ${loss}_{t}$ denotes the training loss at the iteration t.
Cross-validation (CV): We divide each dataset into the training set (90%) and test set (10%) uniformly randomly. Starting from the NN parameters already trained with the entire dataset as shown above, we train NN by gradient descent with the learning rate 0.001 using the divided training set. Gradient descent algorithm for the jth CV instance is terminated if $| {loss}_{j, t} - {loss}_{j, t - 1} | / | {loss}_{j, t - 1} | < 10^{- 6}$ , where ${loss}_{j, t}$ denotes the training loss at the iteration t, using the training set in the jth CV instance. After the training, we compute 10-fold cross-validation statistic: $CV : = \frac{1}{2 {\hat{σ}}_{0}^{2}} \frac{1}{10} \sum_{j = 1}^{10} { \frac{1}{n_{test (j)}} \sum_{i = 1}^{n_{test (j)}} {y_{test (j), i} - g_{{\hat{θ}}_{train (j)}} (x_{test (j), i})}^{2}, }$ where ${(x_{test (j), i}, y_{test (j), i})}_{i = 1}^{n_{test}}$ denotes the jth test set and ${(x_{train (j), i}, y_{train (j), i})}_{i = 1}^{n_{train}}$ denotes the jth training set. $g_{{\hat{θ}}_{train (j)}}$ is the NN trained with the jth training set. Since the training of the NN was unstable in some of the training sets, we ignored a few instances of 10 training sets in CV.
Langevin FV: After training the NN $g_{θ}$ , we compute $T = 10, 000$ iterations of the Langevin process with $δ = 5 \times 10^{- 6} / {\hat{σ}}_{0}^{2}$ , where ${\hat{σ}}_{0}^{2} : = {(n - 1)}^{- 1} \sum_{i = 1}^{n} {y_{i} - g_{\hat{θ}} (x_{i})}^{2}$ denotes the estimator of $σ_{0}^{2}$ for each dataset. We discard first 2500 iterations and compute LFV using the remaining 7500 iterations; we further compute $\underset{(trainingloss)}{\underset{⏟}{\frac{1}{2 {\hat{σ}}_{0}^{2}} \sum_{i = 1}^{n} {y_{i} - g_{{\hat{θ}}_{train}} (x_{i})}^{2}}} + LFV (α; X),$ where it can be regarded as a Langevin variant of WAIC (Watanabe Citation2010), and compare it to the CV statistic approximating the generalization loss.
Experimental Setting: we compute $CV$ statistic and training loss + LFV for each dataset. More precisely, we compute the average and standard deviation of these values over different 20 random initializations.

Results are shown in . Overall, training loss + LFV roughly approximated the CV statistic; they are expected to be compatible as both of these values approximate the generalization loss. Further, we computed the Spearman’s rank correlation between the averaged training loss + LFV and the averaged CV statistic (over 20 random initializations), for seven datasets. The correlation was 0.75, while that between the training loss (without LFV) and CV statistic was –0.75.

Table 7 CV statistic and training loss + LFV over 20 different initializations. n denotes the number of samples, and $p = M (d + 2)$ denotes the number of parameters used in NN.

Display Table

In addition to the computational burden, we note that CV statistic became unstable even in these experiments, as the multiple NNs trained with different CV instances fell in different local optima. So we ignored several CV instances for more stable computation; LFV was more stably computed as LFV was computed with only a single training of the NN.

Therefore, we expect that the proposed LFV can be used as a substitute of the cross-validation statistic for real datasets, even with larger sample sizes.

5 Conclusion

In this article, we considered a Gibbs generalization gap estimation method for overparameterized models. We proved that FV works as an asymptotically unbiased estimator, even in an overparameterization setting. We proposed a Langevin approximation of FV for efficient computation and applied it to overparameterized linear regression and nonlinear neural network models.

Supplementary Materials

Supplementary material contains the proofs of theorems, and descriptions of source codes to reproduce experimental results.

Supplemental material

Supplemental Material

Download Zip (10.3 KB)

Supplemental Material

Download PDF (264.3 KB)

Acknowledgments

We thank the editor, the AE, and two anonymous reviewers for constructive comments and suggestions. We also thank Eiki Shimizu for suggesting several references, and Tetsuya Takabatake and Yukito Iba for helpful discussions.

Additional information

Funding

A. Okuno was supported by JST CREST (JPMJCR21N3) and JSPS KAKENHI (21K17718, 22H05106). K. Yano was supported by JST CREST (JPMJCR1763), JSPS KAKENHI (19K20222, 21H05205, 21K12067), and MEXT (JPJ010217).

References

Adlam, B., and Pennington, J. (2020), “The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization,” in Proceedings of the 37th International Conference on Machine Learning, pp. 74–84.
Google Scholar
Akaike, H. (1974), “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, 19, 716–723. DOI: 10.1109/TAC.1974.1100705.
Web of Science ®Google Scholar
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., and Herrera, F. (2011), “KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework,” Journal of Multiple-Valued Logic & Soft Computing, 17, 255–287.
Web of Science ®Google Scholar
Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. (2019), “On Exact Computation with an Infinitely Wide Neural Net,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 8141–8150.
Google Scholar
Azulay, A., and Weiss, Y. (2019), “Why Do Deep Convolutional Networks Generalize So Poorly to Small Image Transformations?” Journal of Machine Learning Research, 20, 1–25.
Web of Science ®Google Scholar
Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020), “Benign Overfitting in Linear Regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070. DOI: 10.1073/pnas.1907378117.
PubMed Web of Science ®Google Scholar
Belkin, M., Hsu, D., and Mitra, P. (2018), “Overfitting or Perfect Fitting? Risk Bounds for Classification and Regression Rules that Interpolate,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 2306–2317.
Google Scholar
Belkin, M., Hsu, D., and Xu, J. (2020), “Two Models of Double Descent for Weak Features,” SIAM Journal on Mathematics of Data Science, 2, 1167–1180. DOI: 10.1137/20M1336072.
Google Scholar
Cheng, X., Chatterji, N. S., Abbasi-Yadkori, Y., Bartlett, P. L., and Jordan, M. I. (2018), “Sharp Convergence Rates for Langevin Dynamics in the Nonconvex Setting,” arXiv:1805.01648.
Google Scholar
Chiyuan, Z., Samy, B., Moritz, H., Benjamin, R., and Oriol, V. (2017), “Understanding Deep Learning Requires Rethinking Generalization,” in Proceedings of the 5th International Conference on Learning Representations.
Google Scholar
Chizat, L., Oyallon, E., and Bach, F. (2019), “On Lazy Training in Differentiable Programming,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 2937–2947.
Google Scholar
d’ Ascoli, S., Sagun, L., and Biroli, G. (2020), “Triple Descent and the Two Kinds of Overfitting: Where & Why Do They Appear?” in Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 3058–3069.
Google Scholar
Eaton, M. L. (1989), “Group Invariance Applications in Statistics,” in Regional Conference Series in Probability and Statistics (Vol. 1), pp. i-v + 1–133, Institute of Mathematical Statistics.
Google Scholar
Gao, T., and Jojic, V. (2016), “Degrees of Freedom in Deep Neural Networks,” in Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, pp. 232–241.
Google Scholar
Goodfellow, I., Bengio, Y., and Courville, A. (2016), Deep Learning, Cambridge, MA: MIT Press.
Google Scholar
Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022), “Surprises in High-Dimensional Ridgeless Least Squares Interpolation,” The Annals of Statistics, 50, 949–986. DOI: 10.1214/21-aos2133.
PubMed Web of Science ®Google Scholar
Jacot, A., Gabriel, F., and Hongler, C. (2018), “Neural Tangent Kernel: Convergence and Generalization in Neural Networks,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8580–8589.
Google Scholar
Karakida, R., Akaho, S., and Amari, S.-i. (2019), “Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach,” in Proceedings of the 32nd International Conference on Artificial Intelligence and Statistics, pp. 1032–1041.
Google Scholar
LeCun, Y., Bengio, Y., and Hinton, G. (2015), “Deep Learning,” Nature, 521, 436–444. DOI: 10.1038/nature14539.
PubMed Web of Science ®Google Scholar
Lee, J., Schoenholz, S., Pennington, J., Adlam, B., Xiao, L., Novak, R., and Sohl-Dickstein, J. (2020), “Finite Versus Infinite Neural Networks: An Empirical Study,” in Proceedings of the 34th Conference on Neural Information Processing Systems.
Google Scholar
Mallows, C. L. (1973), “Some Comments on Cp,” Technometrics, 15, 661–675. DOI: 10.2307/1267380.
Web of Science ®Google Scholar
Mandt, S., Hoffman, M. D., and Blei, D. M. (2017), “Stochastic Gradient Descent as Approximate Bayesian Inference,” Journal of Machine Learning Research, 18, 1–35.
Web of Science ®Google Scholar
Mario, G., Arthur, J., Stefano, S., Franck, G., Levent, S., Steṕhane, d., Giulio, B., Cleḿent, H., and Matthieu, W. (2020), “Scaling Description of Generalization with Number of Parameters in Deep Learning,” Journal of Statistical Mechanics: Theory and Experiment, 023401.
Web of Science ®Google Scholar
Moody, J. E. (1992), “The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems,” in Proceedings of the 4th International Conference on Neural Information Processing Systems, pp. 847–854.
Google Scholar
Murata, N., Yoshizawa, S., and Amari, S.-I. (1994), “Network Information Criterion–Determining the Number of Hidden Units for an Artificial Neural Network Model,” IEEE Transactions on Neural Networks, 5, 865–872. DOI: 10.1109/72.329683.
PubMed Web of Science ®Google Scholar
Nakada, R., and Imaizumi, M. (2021), “Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks,” arXiv:2103.00500.
Google Scholar
Ramani, S., Blu, T., and Unser, M. (2008), “Monte-Carlo SURE: A Black-Box Optimization of Regularization Parameters for General Denoising Algorithms,” IEEE Transactions on Image Processing, 17, 1540–1554. DOI: 10.1109/TIP.2008.2001404.
PubMed Web of Science ®Google Scholar
Risken, H. (1996), Fokker-Planck Equation for Several Variables; Methods of Solution, Berlin, Heidelberg: Springer.
Google Scholar
Sato, I., and Nakagawa, H. (2014), “Approximation Analysis of Stochastic Gradient Langevin Dynamics by Using Fokker-Planck Equation and Ito Process,” in Proceedings of the 31st International Conference on Machine Learning, pp. 982–990.
Google Scholar
Shibata, R. (1976), “Selection of the Order of an Autoregressive Model by Akaike’s Information Criterion,” Biometrika, 63, 117–126. DOI: 10.1093/biomet/63.1.117.
Web of Science ®Google Scholar
Shibata, R. (1989), Statistical Aspects of Model Selection, pp. 215–240, New York: Springer.
Google Scholar
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002), “Bayesian Measures of Model Complexity and Fit,” (with Discussion), Journal of the Royal Statistical Society, Series B, 64, 583–639. DOI: 10.1111/1467-9868.00353.
Google Scholar
Stein, C. M. (1981), “Estimation of the Mean of a Multivariate Normal Distribution,” The Annals of Statistics, 9, 1135–1151. DOI: 10.1214/aos/1176345632.
Web of Science ®Google Scholar
Takeuchi, K. (1976), “Distribution of Information Statistics and Validity Criteria of Models,” Mathematical Science, 153, 12–18.
Google Scholar
Thomas, V., Pedregosa, F., Merriënboer, B., Manzagol, P.-A., Bengio, Y., and Le Roux, N. (2020), “On the Interplay Between Noise and Curvature and its Effect on Optimization and Generalization,” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pp. 3503–3513.
Google Scholar
Watanabe, S. (2010), “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory,” Journal of Machine Learning Research, 11, 3571–3594.
Web of Science ®Google Scholar
Yang, G., and Littwin, E. (2021), “Tensor Programs IIb: Architectural Universality Of Neural Tangent Kernel Training Dynamics,” in Proceedings of the 38th International Conference on Machine Learning, pp. 11762–11772.
Google Scholar
Ye, J. (1998), “On Measuring and Correcting the Effects of Data Mining and Model Selection,” Journal of the American Statistical Association, 93, 120–131. DOI: 10.1080/01621459.1998.10474094.
Web of Science ®Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021), “Understanding Deep Learning (Still) Requires Rethinking Generalization,” Communication of the ACM, 64, 107–115. DOI: 10.1145/3446776.
Web of Science ®Google Scholar

A Generalization Gap Estimation for Overparameterized Models via the Langevin Functional Variance

Abstract

1 Introduction

2 Theoretical Results for the Functional Variance

2.1 Problem Setting

2.2 Asymptotic Unbiasedness of the Functional Variance

3 Langevin Functional Variance

3.1 Langevin Approximation of the Quasi-Posterior

3.2 Langevin Functional Variance

3.3 Application to Nonlinear Models

3.4 Comparison to the Existing Methods

4 Numerical Experiments

4.1 LFV and Baselines in Linear Models

Table 1 The generalization gap $Δ (α; X)$ and its estimates (TIC, FV, LFV) with the standard deviations in setting (i) where $s_{i} = n^{1 / 2}$ for $i = 1, 2, \dots, 10$ and 0 otherwise.

Table 2 The generalization gap $Δ (α; X)$ and its estimates (TIC, FV, LFV) with the standard deviations in setting (ii), where $s_{i} = n^{1 / 2} i^{- 1}$ .

Table 3 The generalization gap $Δ (α; X)$ and its estimates (TIC, FV, LFV) with the standard deviations in setting (iii), where $s_{i} = n^{1 / 2} i^{- 1 / 2}$ .

4.2 LFV for Nonlinear Models

Table 4 The generalization gap and LFV for the NN model with n = 50 and T = 3000.

Table 5 The generalization gap and LFV for the NN model with n = 500 and T = 3000.

Table 6 The generalization gap and LFV for the NN model with n = 1000 and T = 3000.

4.3 LFV for Nonlinear Models Using Real Datasets

Table 7 CV statistic and training loss + LFV over 20 different initializations. n denotes the number of samples, and $p = M (d + 2)$ denotes the number of parameters used in NN.

5 Conclusion

Supplementary Materials

Supplemental Material

Supplemental Material

Acknowledgments

References

Information for

Open access

Opportunities

Help and information

A Generalization Gap Estimation for Overparameterized Models via the Langevin Functional Variance

Abstract

1 Introduction

2 Theoretical Results for the Functional Variance

2.1 Problem Setting

2.2 Asymptotic Unbiasedness of the Functional Variance

3 Langevin Functional Variance

3.1 Langevin Approximation of the Quasi-Posterior

3.2 Langevin Functional Variance

3.3 Application to Nonlinear Models

3.4 Comparison to the Existing Methods

4 Numerical Experiments

4.1 LFV and Baselines in Linear Models

Table 1 The generalization gap Δ(α;X) and its estimates (TIC, FV, LFV) with the standard deviations in setting (i) where si=n1/2 for i=1,2,…,10 and 0 otherwise.

Table 2 The generalization gap Δ(α;X) and its estimates (TIC, FV, LFV) with the standard deviations in setting (ii), where si=n1/2i−1.

Table 3 The generalization gap Δ(α;X) and its estimates (TIC, FV, LFV) with the standard deviations in setting (iii), where si=n1/2i−1/2.

4.2 LFV for Nonlinear Models

Table 4 The generalization gap and LFV for the NN model with n = 50 and T = 3000.

Table 5 The generalization gap and LFV for the NN model with n = 500 and T = 3000.

Table 6 The generalization gap and LFV for the NN model with n = 1000 and T = 3000.

4.3 LFV for Nonlinear Models Using Real Datasets

Table 7 CV statistic and training loss + LFV over 20 different initializations. n denotes the number of samples, and p=M(d+2) denotes the number of parameters used in NN.

5 Conclusion

Supplementary Materials

Supplemental Material

Supplemental Material

Acknowledgments

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1 The generalization gap $Δ (α; X)$ and its estimates (TIC, FV, LFV) with the standard deviations in setting (i) where $s_{i} = n^{1 / 2}$ for $i = 1, 2, \dots, 10$ and 0 otherwise.

Table 2 The generalization gap $Δ (α; X)$ and its estimates (TIC, FV, LFV) with the standard deviations in setting (ii), where $s_{i} = n^{1 / 2} i^{- 1}$ .

Table 3 The generalization gap $Δ (α; X)$ and its estimates (TIC, FV, LFV) with the standard deviations in setting (iii), where $s_{i} = n^{1 / 2} i^{- 1 / 2}$ .

Table 7 CV statistic and training loss + LFV over 20 different initializations. n denotes the number of samples, and $p = M (d + 2)$ denotes the number of parameters used in NN.