1,127
Views
0
CrossRef citations to date
0
Altmetric
Machine Learning

A Generalization Gap Estimation for Overparameterized Models via the Langevin Functional Variance

ORCID Icon &
Pages 1287-1295 | Received 31 May 2022, Accepted 27 Mar 2023, Published online: 17 May 2023

Abstract

This article discusses the estimation of the generalization gap, the difference between generalization performance and training performance, for overparameterized models including neural networks. We first show that a functional variance, a key concept in defining a widely-applicable information criterion, characterizes the generalization gap even in overparameterized settings where a conventional theory cannot be applied. As the computational cost of the functional variance is expensive for the overparameterized models, we propose an efficient approximation of the function variance, the Langevin approximation of the functional variance (Langevin FV). This method leverages only the first-order gradient of the squared loss function, without referencing the second-order gradient; this ensures that the computation is efficient and the implementation is consistent with gradient-based optimization algorithms. We demonstrate the Langevin FV numerically by estimating the generalization gaps of overparameterized linear regression and nonlinear neural network models, containing more than a thousand of parameters therein. Supplementary materials for this article are available online.

1 Introduction

The great success of deep neural networks (LeCun, Bengio, and Hinton Citation2015; Goodfellow, Bengio, and Courville Citation2016) has been reported in many applied fields, such as natural language processing, image processing, and the natural sciences. This has now altered our understanding of generalization in data science. Generalization is our model’s ability to adapt properly to new data drawn from the same distribution as that of the training data. A classical discipline for generalization is the bias-variance dilemma, that is: the richness of models reduces the modeling bias but suffers from fitting spurious patterns, leading to poor generalization performance. However, the modern practice of deep neural networks yields counterexamples to this discipline, that is: deep neural networks are capable of exactly fitting the training data (known as interpolation) through the use of an overabundance of parameters of the models (known as over-parameterization) with accurately predicted test data (Chiyuan et al. Citation2017; Belkin, Hsu, and Mitra Citation2018). Several such surprising phenomena have been reported: the double descent phenomenon (Belkin, Hsu, and Mitra Citation2018; Mario et al. Citation2020; Nakada and Imaizumi Citation2021; Hastie et al. Citation2022), and the multiple descent phenomenon (Adlam and Pennington Citation2020; d’ Ascoli, Sagun, and Biroli Citation2020). These interesting but controversial phenomena give us a chance to rethink generalization in modern data science.

From a practical point of view, estimating the generalization performance helps us understand what is actually happening. A naive estimator of the generalization performance is the training performance. However, real-case studies using deep neural networks (Azulay and Weiss Citation2019; Zhang et al. Citation2021) suggest that there is a gap between the generalization performance and the training performance, a generalization gap. Double descent phenomena also imply the existence of non-negligible generalization gap. These empirical evidences advocate the importance of accurately estimating the generalization gap.

Statistical tools for estimating the generalization gap have been reported in the literature on information criteria or optimism estimation. The well-known Akaike information criterion (AIC; Akaike Citation1974) estimates the generalization log-loss of the plug-in predictive density using a maximum likelihood estimator. The Takeuchi information criterion (TIC; Takeuchi Citation1976) is a modification of AIC to deal with model misspecification. The regularization information criterion (RIC; Shibata Citation1989) is an extension of TIC to accommodate maximum penalized likelihood estimation. Moody (Citation1992) and Murata, Yoshizawa, and Amari (1994) generalized RIC to an arbitrary loss and advocated its use in nonlinear systems, such as neural networks. Mallows’ Cp (Mallows Citation1973) and Stein’s unbiased risk estimates (SURE; Stein Citation1981) offer an elegant estimation scheme of the generalization gap using the covariance between a predictor and its outcome in Gaussian models with the l2 loss. Ramani, Blu, and Unser (Citation2008) proposed an efficient Monte Carlo sampling-based method (Monte Carlo SURE) to estimate the covariance. Recently, new methods associated with Bayesian learning have been developed. The deviance information criterion (DIC; Spiegelhalter et al. Citation2002) and widely-applicable information criterion (WAIC; Watanabe Citation2010) offer computationally efficient devices that estimate the generalization loss of Bayesian learning. In particular, Watanabe (Citation2010) showed that in a statistical model with a fixed dimension, the generalization gap of Bayesian learning is asymptotically equal to the functional variance (FV), that is, the posterior variance of the log-likelihood.

However, studies on the data-driven measurement of the generalization gap in the overparameterized regime has been relatively scarce. Gao and Jojic (Citation2016) extended Monte Carlo SURE to analyze the generalization gap in deep neural networks. Thomas et al. (Citation2020) modified TIC for the aforementioned purpose. These studies empirically investigated the gaps in common datasets, such as MNIST and CIFAR-10, and successfully developed useful tools to understand the generalization gap in the overparameterized regime. However, these approaches lacked theoretical guarantees in overparameterized models although they provided guarantees in statistical models with a fixed dimension. This results in unclarified theoretical applicability of these approaches. Furthermore, the computational costs of these approaches related to memory and speed are relatively high. This is because Gao and Jojic (Citation2016) requires training several times, and Thomas et al. (Citation2020) needs the second-order gradient (Hessian) of the loss function.

In this article, we focus on yet another tool to measure the generalization gap, the Functional Variance (FV) developed in Watanabe (Citation2010). We present the following contributions:

  • Theoretical applicability: We prove that FV is asymptotically unbiased to the generalization gap of Bayesian learning for overparameterized linear regression models.

  • Computational efficiency: We propose a computationally-efficient approximation of FV, Langevin FV (LFV), by leveraging only the first-order gradient of the loss function.

We show the theoretical applicability of FV as a measure of the generalization gap in overparameterized regime. Our theory employs the overparameterized linear regression model (see Belkin, Hsu, and Xu Citation2020; Bartlett et al. Citation2020; Hastie et al. Citation2022), but we should mention that this linear model can be regarded as a linear approximation of nonlinear overparameterized models including deep neural networks and the linearization has been widely employed to analyze behaviors of deep neural networks (see, Jacot, Gabriel, and Hongler Citation2018; Arora et al. Citation2019; Chizat, Oyallon, and Bach Citation2019; Yang and Littewin 2021). One of such theories is known as neural tangent kernel (Jacot, Gabriel, and Hongler Citation2018), that is, an approximation of neural networks in a small region of the parameters around an estimate or an initial value. The condition of the neural tangent kernel approximation has been investigated in Chizat, Oyallon, and Bach (Citation2019) and known to be applicable to any architecture of deep neural networks in Yang and Littewin (2021).

Although FV is theoretically favorable, it is defined by using the full posterior covariance and the full Bayesian inference in the overparameterized regime is often prohibited because of the computational burden. For efficiently computing the posterior, we employ the Langevin dynamics (see Risken Citation1996), which sequentially adds a normal random perturbation to each update of the gradient descent optimization and obtains the stationary distribution approximating the posterior distribution (Cheng et al. Citation2018). This approach has several merits from the computational perspective. For example, the implementation is easy and consistent with the gradient-based optimization that is the de-facto standard in deep neural network applications. Since the effect of linearization in theory remains unclear and controversial in practice (see Lee et al. Citation2020), we confirm the applicability of the developed method to regression using neural networks with real datasets in Section 4.3.

The rest of the article is organized as follows. Section 2 provides our theoretical results on FV, Section 3 proposes Langevin FV, which is an efficient implementation of FV using Langevin dynamics, Section 4 demonstrates the numerical experiments, and finally, Section 5 concludes this article. Supplementary material gives proofs of all theorems, and descriptions of source codes to reproduce experimental results.

2 Theoretical Results for the Functional Variance

In this section, we present theoretical results for the functional variance in overparameterized linear regression models. Throughout this article, In denotes the n × n identity matrix for any nN, and β2=(β12+β22++βp2)1/2 and β=maxi=1,2,,p|βi| for any vector β=(β1,β2,,βp)Rp.

2.1 Problem Setting

We begin by introducing the problem setting for the theory. We consider a linear regression model y=Xβ0+ε,where y=(y1,,yn)Rn is a vector of observed outcomes, X=(x1,x2,,xn)Rn×p is an n × p random design matrix, β0Rp is an unknown coefficient vector, and ε=(ε1,,εn)Rn is a vector of iid error terms with mean zero and variance 0<σ02<. Our interest is the overparameterized situation, where the number of regressors p is larger than the sample size n: np.

We take the quasi-Bayesian approach on the vector β. Working under the quasi-likelihood f(yixi,β) with a Gaussian distribution with mean xiβ and variance σ02, and a Gaussian prior Np(0,{σ02/(αn)}Ip) on ε with mean 0 and covariance matrix {σ02/(αn)}Ip, we obtain the quasi-posterior distribution of β: (1) Πα(dβy,X)=(2π)p/2(detQα)1/2e(ββ̂α)Qα1(ββ̂α)/2dβ,(1) where β̂α is the maximum a posterior estimate (2) β̂α:=(n1XX+αIp)1n1Xy=argminβRd{lα(β) := n1y22+αβ22}(2) and Qα is the matrix (3) Qα:=n1σ02(n1XX+αIp)1.(3)

We now analyze the Gibbs generalization gap (4) Δ(α;X):=12σ02{Ey,y*[Eβ[y*22]]Ey[Eβ[y22]]},(4) where y* is an independent copy of y with given X, Eβ[·] is the expectation with respect to the quasi-posterior distribution (1), and Ey[·] (and Ey,y*[·]) is the expectation with respect to y (and y,y*, respectively,) with given X. The Gibbs generalization gap considers the generalization gap for one stochastic sample from the quasi-posterior distribution as estimates of the parameters. To estimate the Gibbs generalization gap from the current observations, we focus on the functional variance (5) FV(α;X):=i=1nVβ[logf(yixi,β)],(5) where Vβ is the variance with respect to the quasi-posterior distribution (1).

2.2 Asymptotic Unbiasedness of the Functional Variance

Here we present our theoretical findings on the functional variance in the overparameterized regime. The following theorem shows that the functional variance FV(α;X) is an asymptotically unbiased estimator of the Gibbs generalization gap Δ(α;X) for the overparameterized linear regression with Gaussian covariates. Its proof is given in Supplement A.

Theorem 1.

Let n,pN, and let Σ be a p × p nonzero and nonnegative definite matrix. Let x1,x2,,xnRp be iid random vectors from the Gaussian distribution with a mean zero and a covariance matrix Σ. There exists an absolute constant C > 0, such that we have, for any ε,α>0, PX(|Ey[FV(α;X)]Δ(α;X)|>ε) C(ξ2α21nε+ξ4α41nε+ξ4b4α2σ041nε2+1n),where ξ:=tr{Σ},b:=p1/2β0, and PX denotes the probability with respect to X. Furthermore, under the conditions (C1) suppNξ< and (C2) suppNb<, we have PX(|Ey[FV(α;X)]Δ(α;X)|Can/n)1 for some C>0 not depending on n and p with an arbitrary slowly increasing sequence an.

Theorem 1 implies that FV successfully estimates the Gibbs generalization gap in the overparameterized regime, which supports the use of FV, even in the overparameterized regime. Interestingly, its rate of convergence (an/n) is not affected by the dimension p but affected by the value of suppNξ. As suppNξ gets larger, the decay of the difference becomes slower. Furthermore, the result does not restrict the true distribution of the additive error to the Gaussian distribution; in contrast, some classical theories such as SURE require the error term to either be Gaussian or be of related distributions.

Let us mention the conditions in the theorem. Condition (C1) indicates the trace boundedness of XX (divided by n) because Σ=E[XX]/n. Both shallow and deep neural network models satisfy this condition in general settings (see, e.g., Karakida, Akaho, and Amari Citation2019). Condition (C2) together with Condition (C1) indicates that n1μ22n1XF2β022 is upper-bounded by some constant C > 0 with probability approaching 1. It implies that the average of the entries in the outcome expectation μ is upper-bounded with high probability.

The main ingredient of the proof of Theorem 1 is the explicit identity of the difference between the generalization gap and FV. See Lemma 2. Its proof is presented in Supplement B. For two matrices A=(aij),B=(bij) of the same size, A°B=(aijbij) denotes the Hadamard element-wise product, and let A2:=AA.

Lemma 2.

We have (6) Δ(α;X)=tr{Hα}and(6) (7) Ey[FV(α;X)]Δ(α;X)=32tr{Hα°Hα}+tr{Hα°Hα2}+1σ02tr{Hα°((InHα)(Xβ0))2},(7)

where Hα is the regularized hat matrix (8) Hα:=n1X(n1XX+αIp)1X.(8)

Lemma 2 highlights the role of the regularized hat matrix Hα in evaluating the generalization gap and the residual. Same as the fixed-dimension theory, the hat matrix controls the magnitude of the generalization gap. With singular values s1s2sn0 of the matrix X, under the conditions in Theorem 1, we have Δ(α;X)=tr{Hα}=i=1nsi2/n(si2/n)+αΔ(α)[0,) (n)as tr{Hα}α1n1(i=1nsi2)< with the probability approaching 1 as n. Consider a simple case where X has a fixed intrinsic dimension p*n with growing n, p, namely s1s2sp*>0=sp*+1=sp*+2==sn. Then, FV approaches the fixed intrinsic dimension p* (as α0), which coincides with the degrees of freedom (Mallows Citation1973; Ye Citation1998) and AIC (Akaike Citation1974) for linear regression equipped with p* regressors.

Remark 1

(Sample-wise and joint log-likelihoods). One may think the use of the posterior covariance of the joint log-likelihood JFV(α;X):=Vβ[logf(yX,β)]instead of FV (5) using sample-wise log-likelihood. In this case, we have the following expression of the difference between the J-FV and the generalization gap. Its proof is in Supplement C.

Proposition 1.

For any α>0, we have Ey[JFV(α;X)]Δ(α;X) =32tr{Hα2}+tr{Hα3}+1σ02tr{Hα((IHα)(Xβ0))2}.

This exhibits an interesting correspondence between FV and J-FV (see Lemma 2 and Proposition 1); replacing the Hadamard products in Ey[FV(α;X)] with simple matrix products yields Ey[JFV(α;X)]. Since the simple matrix products in Ey[JFV(α;X)] do not vanish as n, J-FV is not an asymptotically unbiased estimator of the generalization gap.

Remark 2

(Extension of the theorem). Theorem 1 is further extended in Theorem 3, which proves the convergence of FV under milder conditions. Its proof is shown in Supplement A.

Theorem 3.

Let n,pN and assume the setting in Section 2.1. Let XRn×p be a random matrix and let Sn1={uRdu2=1} be a (n1)-sphere. Let q1,q2,,qn be marginal probability densities of the left-singular vectors u1,u2,,unSn1 of X. Write b:=p1/2β0,ψ:=maxi=1,2,,nmaxuSn1{qi(u)}Sn1dv, η:=1ntr{XX}.

Then, there exists an absolute constant C > 0 such that, for any ε,α>0, we have (9) PX(|Ey[FV(α;X)]Δ(α;X)|>εη)C(ψη2α21nε+ψη4α41nε+ψη4b4α2σ041nε2),(9) where PX(·η) denotes the conditional probability of X given η.

Theorem 1 (Gaussian covariates) follows from Theorem 3 with ψ = 1; see Proposition 7.1 of Eaton (Citation1989), which proves that the left-singular vectors of the Gaussian design matrix follow a uniform distribution over the unit (n1)-sphere Sn1, that is, qi(u)=1Sn1(u)/Sn1dv with 1Sn1(·) the indicator function of Sn1.

Remark 3

(Definition of the generalization gap). Regarding the definition of the generalization gap Δ(α;X), we may consider another design matrix X* for generating y*. However, the covariate difference is less effective in the generalization gap (4) since we focus on the prediction of the conditional random variable yX and not the covariate X. For theoretical simplicity, we employed a single design matrix X to generate both outcomes y,y*.

3 Langevin Functional Variance

Despite FV being theoretically attractive as proved in Theorem 3, there exist two computational difficulties:

  • (D1) generating samples from the quasi-posterior (1) requires the p × p matrix Qα, which is computationally intensive for overparameterized models np, and

  • (D2) computing the quasi-posterior (1) is inconsistent with gradient-based optimization approaches, such as the stochastic gradient descent (SGD), which are often used in optimizing overparameterized models.

To resolve these difficulties (D1) and (D2), we consider a Langevin approximation of the quasi-posterior in Section 3.1, and propose Langevin FV (LFV) for linear models in Section 3.2; we then extend it to nonlinear models in Section 3.3. We compare FV and the proposed LFV with the existing estimators based on TIC (Takeuchi Citation1976) in Section 3.4.

3.1 Langevin Approximation of the Quasi-Posterior

A key idea of our algorithm is approximating the quasi-posterior (1) via a Langevin process. Starting with an estimator γ(1):=β̂α, we stochastically update γ(t) by (10) γ(t+1)=γ(t)14δκnlα(γ(t))γ+δ1/2e(t)(t=1,2,),(10) where κn:=n/σ02 and δ>0 are user-specified parameters, let {e(1),e(2),} denote a sequence of iid standard normal random vectors, and lα is defined in (2). Then, the distribution of the Langevin process (10) approximates the quasi-posterior as discussed in the following. First, the Langevin process (10) is a discretization of the Ornstein-Uhlenbeck process dγ˜τ=12Qα1(γ˜τβ̂α)dτ+deτ equipped with a Wiener process {eτ}τ0, that is, eτeτNp(0,(ττ)Ip) for any τ>τ0. The distribution of γ˜τ with the initialization γ˜0:=β̂α coincides with the quasi-posterior (11) D(γ˜τ)=Np(β̂α,Qα)(τ>0).(11)

See Risken (Citation1996) p. 156 for an example of the distribution (11). Next, with n and p fixed, for any sufficiently small ϕ>0, Theorem 2 of Cheng et al. (Citation2018) evaluates the 1-Wasserstein distance between the distributions of the Ornstein-Uhlenbeck and the Langevin processes as (12) W1(D(γ˜tδ),D(γ(t)))ϕ(12) for some δ=O(ϕ2/p) and tp/ϕ2; see Proposition 2 in Supplement D for details. Thus, the relations (11) and (12) imply that the distribution D(γ(t)) approximates the quasi-posterior Np(β̂α,Qα).

This Langevin approximation resolves difficulties (D1) and (D2) since

  • (S1) the Langevin process (10) is computed with only the first order gradient of the squared loss function lα(γ) defined in (2), meaning that the large matrix Qα is not explicitly computed, and

  • (S2) the Langevin process (10) is in the form of the gradient descent up to the normal noise δ1/2e(t); this process can be implemented consistently with gradient-based algorithms, which are often used to optimize overparameterized models.

Remark 4

(The other approach for the posterior approximation). Besides the Langevin dynamics, we can use the other approaches for approximating the posterior distribution. One such approach is to use the stochastic noise in stochastic gradient descent (Sato and Nakagawa Citation2014; Mandt, Hoffman, and Blei Citation2017). Though powerful, this approach relies on the normal approximation in the stochastic gradient and the normal approximation is hardly expected in the overparameterized regime, which results in the computational instability.

3.2 Langevin Functional Variance

Using the Langevin approximation (10), we propose LFV: for a time step TN, let {γ(t)}t=1,2,,T be the samples from the Langevin process. Let V̂γ[·] denotes an empirical variance with respect to {γ(t)}t=1,2,,T. Let μi(t):=xiγ(t). Then, we define LFV as (13) LFV(α;X):=i=1nV̂γ[logf(yixi,γ)]=i=1n1Tt=1T{12σ02(yiμi(t))21Tt=1T12σ02(yiμi(t))2}2.(13)

As shown in (12), the distribution of the Langevin process approaches the quasi-posterior: LFV is expected to approximate FV for a sufficiently large TN. Thus, LFV is expected to inherit the nature of FV, such as the asymptotic unbiasedness when estimating the generalization gap.

The distributional approximation of the high-dimensional vector γRp faces the curse of dimensionality. However, FV is a statistic taking a value in R; empirically, FV and LFV approximate the generalization gap Δ(α;X) with a reasonable number of samples TN, even if the dimension p is relatively large. See numerical experiments in Section 4.1.

Lastly, let us mention the computational time of FV and LFV. For each iteration, the number of operations to obtain the gradient in LFV is O(p2), while that to compute the covariance matrix in FV is O(p3): therefore, the computational time is expected to be reduced, as also demonstrated for linear models in Remark 5.

3.3 Application to Nonlinear Models

This section provides a procedure in applying LFV (13) to a nonlinear overparameterized model gθ(zi) equipped with a parameter vector θ and an input vector zi. The procedure is as follows: first, by replacing lα(β) in the Langevin process (10) with ρα(θ):=1ni=1n(yigθ(zi))2+αθ22,the update (10) yields a sequence {γ(t)}t=1,2,. Next, by substituting μi(t):=gγ(t)(zi) in (13), we obtain LFV for the nonlinear model gθ. We here emphasize that, while the exact form of the full quasi-posterior for nonlinear models is difficult to obtain and thus computing FV for nonlinear models is almost prohibited, LFV can be applied to even such nonlinear models.

The gradient of the overparameterized model gθ is compatible with the gradient of its linear approximation. This application works if the linear approximation of nonlinear neural network is reasonable around the initial estimate. Again, we remark that this assumption is important in the neural tangent kernel literature (Jacot, Gabriel, and Hongler Citation2018; Arora et al. Citation2019).

3.4 Comparison to the Existing Methods

Here, we compare FV and LFV to existing methods for the estimation of the generalization gap using TIC (Takeuchi Citation1976). TIC estimates the generalization gap of regular statistical models by using a quantity (14) tr{F̂Ĝ1}(14) equipped with p × p matrices F̂:=i=1nlogf(yixi,β)βlogf(yixi,β)β|β=β̂and Ĝ:=i=1n2logf(yixi,β)ββ|β=β̂, where β̂ is a maximum likelihood estimate. However, TIC cannot be applied to singular statistical models whose Hessian matrix Ĝ degenerates (i.e., the inverse of Ĝ does not exist), such as overparameterized models. To overcome this limitation of TIC, Thomas et al. (Citation2020) replace the inverse matrix Ĝ1 in (14) with the κ-generalized inverse matrix Ĝκ+; Ĝκ+ is defined as VΣκ+V, where V is a matrix of the eigenvectors of Ĝ, and Σκ+ is the diagonal matrix of which the jth diagonal component is λj(Ĝκ+):={1/λj(Ĝ)(λj(Ĝ)>κ)0(λj(Ĝ)κ),(j=1,2,,p) with {λj(Ĝ):j=1,,p} singular values of Ĝ. This TIC modification of Thomas et al. (Citation2020) is numerically examined in Section 4.1, for overparameterized linear regression; it rather estimates the number of nonzero eigenvalues in Ĝ=n1XX, that is, the number of nonzero singular values of X, but not the generalization gap Δ(α;X).

Instead of modifying TIC, we can employ RIC proposed by Shibata (Citation1976). RIC replaces the Hessian matrix Ĝ in TIC with the regularized inverse matrix (Ĝ+αI)1. Moody (Citation1992) and Murata, Yoshizawa, and Amari (1994) further generalized RIC to arbitrary loss functions and demonstrated its application in shallow neural network models with a small number of hidden units. RIC has much in common with FV; specifically, RIC is also an asymptotically unbiased estimator of the Gibbs generalization gap Δ(α;X) for overparameterized linear regression, and numerically, RIC behaves similarly to FV (and LFV) for the experiments discussed in Section 4.1. However, RIC still requires the inverse matrix of the p × p matrix Ĝ+αI, and its computational cost is intensive in the overparameterized setting (pn).

4 Numerical Experiments

This section presents the numerical evaluation of the Langevin FV (13) using both linear and nonlinear models.

4.1 LFV and Baselines in Linear Models

We first evaluate LFV (13) and compare it with some baselines in linear overparameterized models. The set-up of the numerical experiments is summarized as follows:

  • Synthetic data generation: For p=1.5n, orthogonal matrices URn×n,VRp×n are iid from the uniform distribution over the set of orthogonal matrices satisfying UU=VV=In, respectively. Entries of β0Rp are iid from N(0,1/p). With given singular values {si}i=1n, the design matrix X:=USV is computed. y is generated 50 times from Nn(Xβ0,In), that is, σ02=1.

  • Singular values: We consider three different types of singular values: (i) X has a fixed intrinsic dimension d*=10, that is, s1==s10=n1/2 and s11=s12==sn=0, (ii) si=n1/2i1, and (iii) si=n1/2i1/2. Theorem 3 proves the convergence of FV in settings (i) and (ii) since limnn1i=1nsi2< in (i) and (ii), whereas it does not prove the convergence in setting (iii) since limnn1i=1nsi2 is not bounded in (iii).

  • Evaluation: We evaluate the following baselines and LFV with 50 experiments. Throughout these experiments, we employ α=0.1 for ridge regularization (2).

    1. TIC penalty (Takeuchi Citation1976) is extended to the overparameterization setting (Thomas et al. Citation2020) by replacing the inverse of the Hessian matrix Ĝ in TIC with the κ-generalized inverse matrix Ĝκ+. See Section 3.4 for the definition of TIC(κ):=tr{F̂Ĝκ+}.

    2. FV is empirically computed with T=15n samples of β generated from the quasi-posterior (1). Its expectation shown in Lemma 2 is also computed.

    3. LFV is computed with δ=1/(10n) and T=15n Langevin samples γ (see (10)).

The results are presented in the following . Overall, FV and LFV were able to more effectively estimate the generalization gap Δ(α;X) (p=1.5n,n), as compared to TIC, and their biases were not drastically different. TIC was entirely inaccurate in settings (ii) and (iii); together with the result of (i), TIC was able to estimate the number of nonzero eigenvalues in the Hessian matrix Ĝ=n1XX (i.e., the number of nonzero singular values of X), which is generally different from the generalization gap Δ(α;X).

Table 1 The generalization gap Δ(α;X) and its estimates (TIC, FV, LFV) with the standard deviations in setting (i) where si=n1/2 for i=1,2,,10 and 0 otherwise.

Table 2 The generalization gap Δ(α;X) and its estimates (TIC, FV, LFV) with the standard deviations in setting (ii), where si=n1/2i1.

Table 3 The generalization gap Δ(α;X) and its estimates (TIC, FV, LFV) with the standard deviations in setting (iii), where si=n1/2i1/2.

Remark 5

(Computational time). Let us mention computational time of FV and LFV. We check the computation time by the linear synthetic dataset with n=100,p=2000,α=0.01,si=n1/2i1/2,T=500 five times. Among five experiments, the processing time for LFV is 5.33 ± 0.09 sec, and that for FV is 18.87 ± 0.18 sec, which implies that LFV is in fact faster than FV. For these experiments, we used AMD Ryzen 7 5700X processors. The computation is not parallelized for fair comparison.

4.2 LFV for Nonlinear Models

This section presents the evaluation of the Langevin FV for nonlinear neural networks via synthetic dataset experiments. We use the procedure in Section 3.3 in applying LFV to nonlinear models. The set-up is summarized as follows:

  • Synthetic data: Let n{50,500,1000},σ2=1. For i=1,2,,n, we generate ziiidNd(0,Id) and εiiidN(0,σ2), and set μi:=μ(zi) and yi:=μi+εi with a function μ(z):=3tanh(z,1/2).

  • Neural network (NN): We employ a fully-connected one-hidden-layer NN: gθ(z):=θ(2),tanh(θ(1)z+θ(0))with M{50,100,150} hidden units, where θ=(θ(0),θ(1),θ(2))RM×RM×d×RM is a parameter vector (and thus the number of parameters is p=M(d+2)) and the function tanh(·) applies the hyperbolic tangent function tanh(z):=(exp(z)exp(z))/(exp(z)+exp(z)) entry-wise. Let θ˜0 be a parameter satisfying gθ˜0(z)=μ(z). For each experiment, we employ the true parameter θ0 that is iid drawn from the element-wise independent normal distribution with mean θ˜0 and the element-wise variance 0.01. Further, we initialize the parameter θ of the NN to be trained by the element-wise independent normal distribution whose mean is θ0 with the larger element-wise variance 1, and update θ by a full-batch gradient descent with a learning rate 0.1, ridge-regularization coefficient α=103 and, 0.3n iterations.

  • Langevin FV: For each setting (d,M){5,10,15}×{50,100,150}, we take the average of LFV over 25 experiments. In each experiment, we randomly generate {(zi,yi)}i=1nRd×R, train the NN gθ, and compute T = 3000 iterations of the Langevin process with δ=105. We discard the first 0.1T iterations, and employ the remaining 0.9T iterations to compute LFV.

  • Generalization gap Δ˜: For each setting (d,M){5,10,15}×{50,100,150}, we take average of the following generalization gap Δ˜:=12σ02{Ey*(1ni=1n{yi*gθ̂(zi)}2) 1ni=1n{yigθ̂(zi)}2}over 200 times experiments. In each experiment, we randomly generate {(zi,yi)}i=1nRd×R, train the NN gθ, and evaluate Δ˜ by leveraging the ground-truth μi=μ(zi) and σ02=1.

LFV and generalization loss Δ˜ for a nonlinear NN with M hidden units are shown in . LFV values for overparameterized models (i.e., p=M(d+2)>n) are in gray. It was observed that the value of LFV provides a rough estimate of the generalization gap even for nonlinear overparameterized NN models, though our overparameterized theories on FV and LFV are proved for linear models. = 50) and = 1000) suggested that the biases of LFV reduces as the sample size grows.

Table 4 The generalization gap and LFV for the NN model with n = 50 and T = 3000.

Table 5 The generalization gap and LFV for the NN model with n = 500 and T = 3000.

Table 6 The generalization gap and LFV for the NN model with n = 1000 and T = 3000.

4.3 LFV for Nonlinear Models Using Real Datasets

In this section, we compare LFV to cross-validation statistic using real datasets with small sample sizes, where we note that the cross-validation becomes prohibited as the sample size becomes larger. We use the procedure in Section 3.3 in applying LFV to nonlinear models. The set-up is summarized as follows.

  • Real data: We collected the following seven regression datasets from KEEL dataset repository (Alcalá-Fdez et al. Citation2011): “machineCPU” (n=208,d=6), “wankara” (n=320,d=9), “baseball” (n=336,d=16), “dee” (n=364,d=6), “autoMPG6” (n=391,d=5), “autoMPG8” (n=391,d=7), and “stock” (n=949,d=9). These datasets are selected so that our NN structure described below becomes overparameterized. For each dataset, we standardized (scaling and centering) the design matrix X and the target variable y.

  • Neural network (NN): We employ the same architecture as the nonlinear NN in Section 4.2, with the number of hidden layers M = 100. For the neural netweork training, we first initialize the parameter θ randomly by the normal distribution Np(0,d1/2Ip); we employ 20 different random initializations for each dataset experiment. We train the NN by the fullbatch gradient descent (using the entire dataset) with the learning rate 0.01, ridge-regularization coefficient α=0.1. Gradient descent algorithm is terminated if |losstlosst1|/|losst1|<106, where losst denotes the training loss at the iteration t.

  • Cross-validation (CV): We divide each dataset into the training set (90%) and test set (10%) uniformly randomly. Starting from the NN parameters already trained with the entire dataset as shown above, we train NN by gradient descent with the learning rate 0.001 using the divided training set. Gradient descent algorithm for the jth CV instance is terminated if |lossj,tlossj,t1|/|lossj,t1|<106, where lossj,t denotes the training loss at the iteration t, using the training set in the jth CV instance. After the training, we compute 10-fold cross-validation statistic: CV:=12σ̂02110j=110{1ntest(j)i=1ntest(j){ytest(j),igθ̂train(j)(xtest(j),i)}2,}where {(xtest(j),i,ytest(j),i)}i=1ntest denotes the jth test set and {(xtrain(j),i,ytrain(j),i)}i=1ntrain denotes the jth training set. gθ̂train(j) is the NN trained with the jth training set. Since the training of the NN was unstable in some of the training sets, we ignored a few instances of 10 training sets in CV.

  • Langevin FV: After training the NN gθ, we compute T=10,000 iterations of the Langevin process with δ=5×106/σ̂02, where σ̂02:=(n1)1i=1n{yigθ̂(xi)}2 denotes the estimator of σ02 for each dataset. We discard first 2500 iterations and compute LFV using the remaining 7500 iterations; we further compute 12σ̂02i=1n{yigθ̂train(xi)}2(trainingloss)+LFV(α;X),where it can be regarded as a Langevin variant of WAIC (Watanabe Citation2010), and compare it to the CV statistic approximating the generalization loss.

  • Experimental Setting: we compute CV statistic and training loss + LFV for each dataset. More precisely, we compute the average and standard deviation of these values over different 20 random initializations.

Results are shown in . Overall, training loss + LFV roughly approximated the CV statistic; they are expected to be compatible as both of these values approximate the generalization loss. Further, we computed the Spearman’s rank correlation between the averaged training loss + LFV and the averaged CV statistic (over 20 random initializations), for seven datasets. The correlation was 0.75, while that between the training loss (without LFV) and CV statistic was –0.75.

Table 7 CV statistic and training loss + LFV over 20 different initializations. n denotes the number of samples, and p=M(d+2) denotes the number of parameters used in NN.

In addition to the computational burden, we note that CV statistic became unstable even in these experiments, as the multiple NNs trained with different CV instances fell in different local optima. So we ignored several CV instances for more stable computation; LFV was more stably computed as LFV was computed with only a single training of the NN.

Therefore, we expect that the proposed LFV can be used as a substitute of the cross-validation statistic for real datasets, even with larger sample sizes.

5 Conclusion

In this article, we considered a Gibbs generalization gap estimation method for overparameterized models. We proved that FV works as an asymptotically unbiased estimator, even in an overparameterization setting. We proposed a Langevin approximation of FV for efficient computation and applied it to overparameterized linear regression and nonlinear neural network models.

Supplementary Materials

Supplementary material contains the proofs of theorems, and descriptions of source codes to reproduce experimental results.

Supplemental material

Supplemental Material

Download Zip (10.3 KB)

Supplemental Material

Download PDF (264.3 KB)

Acknowledgments

We thank the editor, the AE, and two anonymous reviewers for constructive comments and suggestions. We also thank Eiki Shimizu for suggesting several references, and Tetsuya Takabatake and Yukito Iba for helpful discussions.

Additional information

Funding

A. Okuno was supported by JST CREST (JPMJCR21N3) and JSPS KAKENHI (21K17718, 22H05106). K. Yano was supported by JST CREST (JPMJCR1763), JSPS KAKENHI (19K20222, 21H05205, 21K12067), and MEXT (JPJ010217).

References

  • Adlam, B., and Pennington, J. (2020), “The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization,” in Proceedings of the 37th International Conference on Machine Learning, pp. 74–84.
  • Akaike, H. (1974), “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, 19, 716–723. DOI: 10.1109/TAC.1974.1100705.
  • Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., and Herrera, F. (2011), “KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework,” Journal of Multiple-Valued Logic & Soft Computing, 17, 255–287.
  • Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. (2019), “On Exact Computation with an Infinitely Wide Neural Net,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 8141–8150.
  • Azulay, A., and Weiss, Y. (2019), “Why Do Deep Convolutional Networks Generalize So Poorly to Small Image Transformations?” Journal of Machine Learning Research, 20, 1–25.
  • Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020), “Benign Overfitting in Linear Regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070. DOI: 10.1073/pnas.1907378117.
  • Belkin, M., Hsu, D., and Mitra, P. (2018), “Overfitting or Perfect Fitting? Risk Bounds for Classification and Regression Rules that Interpolate,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 2306–2317.
  • Belkin, M., Hsu, D., and Xu, J. (2020), “Two Models of Double Descent for Weak Features,” SIAM Journal on Mathematics of Data Science, 2, 1167–1180. DOI: 10.1137/20M1336072.
  • Cheng, X., Chatterji, N. S., Abbasi-Yadkori, Y., Bartlett, P. L., and Jordan, M. I. (2018), “Sharp Convergence Rates for Langevin Dynamics in the Nonconvex Setting,” arXiv:1805.01648.
  • Chiyuan, Z., Samy, B., Moritz, H., Benjamin, R., and Oriol, V. (2017), “Understanding Deep Learning Requires Rethinking Generalization,” in Proceedings of the 5th International Conference on Learning Representations.
  • Chizat, L., Oyallon, E., and Bach, F. (2019), “On Lazy Training in Differentiable Programming,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 2937–2947.
  • d’ Ascoli, S., Sagun, L., and Biroli, G. (2020), “Triple Descent and the Two Kinds of Overfitting: Where & Why Do They Appear?” in Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 3058–3069.
  • Eaton, M. L. (1989), “Group Invariance Applications in Statistics,” in Regional Conference Series in Probability and Statistics (Vol. 1), pp. i-v + 1–133, Institute of Mathematical Statistics.
  • Gao, T., and Jojic, V. (2016), “Degrees of Freedom in Deep Neural Networks,” in Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, pp. 232–241.
  • Goodfellow, I., Bengio, Y., and Courville, A. (2016), Deep Learning, Cambridge, MA: MIT Press.
  • Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022), “Surprises in High-Dimensional Ridgeless Least Squares Interpolation,” The Annals of Statistics, 50, 949–986. DOI: 10.1214/21-aos2133.
  • Jacot, A., Gabriel, F., and Hongler, C. (2018), “Neural Tangent Kernel: Convergence and Generalization in Neural Networks,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8580–8589.
  • Karakida, R., Akaho, S., and Amari, S.-i. (2019), “Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach,” in Proceedings of the 32nd International Conference on Artificial Intelligence and Statistics, pp. 1032–1041.
  • LeCun, Y., Bengio, Y., and Hinton, G. (2015), “Deep Learning,” Nature, 521, 436–444. DOI: 10.1038/nature14539.
  • Lee, J., Schoenholz, S., Pennington, J., Adlam, B., Xiao, L., Novak, R., and Sohl-Dickstein, J. (2020), “Finite Versus Infinite Neural Networks: An Empirical Study,” in Proceedings of the 34th Conference on Neural Information Processing Systems.
  • Mallows, C. L. (1973), “Some Comments on Cp,” Technometrics, 15, 661–675. DOI: 10.2307/1267380.
  • Mandt, S., Hoffman, M. D., and Blei, D. M. (2017), “Stochastic Gradient Descent as Approximate Bayesian Inference,” Journal of Machine Learning Research, 18, 1–35.
  • Mario, G., Arthur, J., Stefano, S., Franck, G., Levent, S., Steṕhane, d., Giulio, B., Cleḿent, H., and Matthieu, W. (2020), “Scaling Description of Generalization with Number of Parameters in Deep Learning,” Journal of Statistical Mechanics: Theory and Experiment, 023401.
  • Moody, J. E. (1992), “The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems,” in Proceedings of the 4th International Conference on Neural Information Processing Systems, pp. 847–854.
  • Murata, N., Yoshizawa, S., and Amari, S.-I. (1994), “Network Information Criterion–Determining the Number of Hidden Units for an Artificial Neural Network Model,” IEEE Transactions on Neural Networks, 5, 865–872. DOI: 10.1109/72.329683.
  • Nakada, R., and Imaizumi, M. (2021), “Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks,” arXiv:2103.00500.
  • Ramani, S., Blu, T., and Unser, M. (2008), “Monte-Carlo SURE: A Black-Box Optimization of Regularization Parameters for General Denoising Algorithms,” IEEE Transactions on Image Processing, 17, 1540–1554. DOI: 10.1109/TIP.2008.2001404.
  • Risken, H. (1996), Fokker-Planck Equation for Several Variables; Methods of Solution, Berlin, Heidelberg: Springer.
  • Sato, I., and Nakagawa, H. (2014), “Approximation Analysis of Stochastic Gradient Langevin Dynamics by Using Fokker-Planck Equation and Ito Process,” in Proceedings of the 31st International Conference on Machine Learning, pp. 982–990.
  • Shibata, R. (1976), “Selection of the Order of an Autoregressive Model by Akaike’s Information Criterion,” Biometrika, 63, 117–126. DOI: 10.1093/biomet/63.1.117.
  • Shibata, R. (1989), Statistical Aspects of Model Selection, pp. 215–240, New York: Springer.
  • Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002), “Bayesian Measures of Model Complexity and Fit,” (with Discussion), Journal of the Royal Statistical Society, Series B, 64, 583–639. DOI: 10.1111/1467-9868.00353.
  • Stein, C. M. (1981), “Estimation of the Mean of a Multivariate Normal Distribution,” The Annals of Statistics, 9, 1135–1151. DOI: 10.1214/aos/1176345632.
  • Takeuchi, K. (1976), “Distribution of Information Statistics and Validity Criteria of Models,” Mathematical Science, 153, 12–18.
  • Thomas, V., Pedregosa, F., Merriënboer, B., Manzagol, P.-A., Bengio, Y., and Le Roux, N. (2020), “On the Interplay Between Noise and Curvature and its Effect on Optimization and Generalization,” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pp. 3503–3513.
  • Watanabe, S. (2010), “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory,” Journal of Machine Learning Research, 11, 3571–3594.
  • Yang, G., and Littwin, E. (2021), “Tensor Programs IIb: Architectural Universality Of Neural Tangent Kernel Training Dynamics,” in Proceedings of the 38th International Conference on Machine Learning, pp. 11762–11772.
  • Ye, J. (1998), “On Measuring and Correcting the Effects of Data Mining and Model Selection,” Journal of the American Statistical Association, 93, 120–131. DOI: 10.1080/01621459.1998.10474094.
  • Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021), “Understanding Deep Learning (Still) Requires Rethinking Generalization,” Communication of the ACM, 64, 107–115. DOI: 10.1145/3446776.