Full article: Quadratic Neural Networks for Solving Inverse Problems

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In this paper we investigate the solution of inverse problems with neural network ansatz functions with generalized decision functions. The relevant observation for this work is that such functions can approximate typical test cases, such as the Shepp-Logan phantom, better, than standard neural networks. Moreover, we show that the convergence analysis of numerical methods for solving inverse problems with shallow generalized neural network functions leads to more intuitive convergence conditions, than for deep affine linear neural networks.

KEYWORDS:

MATHEMATICS SUBJECT CLASSIFICATION:

1 Introduction

We consider an inverse problem, which consists in solving an operator equation (1.1) $F (x) = y_{0},$ (1.1) where $F : X \to Y$ denotes an operator, mapping between function spaces $(X, ‖ \cdot ‖)$ and $(Y, ‖ \cdot ‖)$ . For the numerical solution of Equationequation (1.1)(1.1) $F (x) = y_{0},$ (1.1) an appropriate set of ansatz-functions $P$ has to be selected, on which an approximate solution of Equationequation (1.1)(1.1) $F (x) = y_{0},$ (1.1) is calculated, for instance a function $p \in P \subseteq X$ , which solves (1.2) $F (p) = y,$ (1.2) where $y$ is an approximation of $y_{0}$ . The history on the topic of approximating the solution of Equationequation (1.1)(1.1) $F (x) = y_{0},$ (1.1) by discretization and regularization is long: For linear inverse problems this has been investigated for instance in [Citation1]. Much later on neural network ansatz functions have been used for solving Equationequation (1.1)(1.1) $F (x) = y_{0},$ (1.1) ; see for instance [Citation2]. In this paper we investigate solving Equationequation (1.2)(1.2) $F (p) = y,$ (1.2) when $P$ is a set of neural network functions with generalized decision functions. In order to spot the difference to classical neural network functions we recall the definition of them first:

Definition 1.1

(Affine linear neural network functions). Let $m \geq n \in N$ . Line vectors in $R^{m}$ and $R^{n}$ are denoted by $w = {(w^{(1)}, w^{(2)}, \dots, w^{(m)})}^{T} and \vec{x} = {(x_{1}, x_{2}, \dots, x_{n})}^{T}, respectively .$

A shallow affine linear neural networks (ALNN) is a function (here m = n) (1.3) $\vec{x} \in R^{n} \to p (\vec{x}) : = Ψ [\vec{p}] (\vec{x}) : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + θ_{j}),$ (1.3) with $α_{j}, θ_{j} \in R, w_{j} \in R^{n}$ ; σ is an activation function, such as the sigmoid, ReLU $^{k}$ or Softmax function. Moreover, (1.4) $\vec{p} = [α_{1}, \dots, α_{N}; w_{1}, \dots, w_{N}; θ_{1}, \dots, θ_{N}] \in R^{n_{*}} with n_{*} = (n + 2) N$ (1.4) denotes the according parameterization of $p$ . In this context the set of ALNNs is given by $P : = {p of the form equation (1.3) : \vec{p} \in R^{n_{*}}} .$
More recently deep affine linear neural network functions (DNNs) have become popular: An $(L + 2)$ -layer network looks as follows: (1.5) $\begin{matrix} \vec{x} \in R^{n} \to & p (\vec{x}) : = \\ Ψ [\vec{p}] (\vec{x}) : = α^{T} σ_{L} (p_{L} (\dots (σ_{1} (p_{1} (\vec{x}))))) . \end{matrix}$ (1.5) where $p_{l} (\vec{x}) = {(w_{1, l}, w_{2, l}, \dots, w_{N_{l}, l})}^{T} \vec{x} + {(θ_{1, l}, θ_{2, l}, \dots, θ_{N_{l}, l})}^{T}$ with $α \in R^{N_{L}}, θ_{j, l} \in R$ and $w_{j, l} \in R^{N_{l - 1}}$ for all $l = 1, \dots, L$ and all $j = 1, \dots, N_{l}$ . Here N_l denotes the number of neurons in l-th internal layer and $N_{0} = n, σ_{k} (s), k = 1, \dots, L$ denote activation functions of each layer; they are used as maps $R^{N_{l}} \to R^{N_{l}}$ by acting component-wise on vectors. Moreover, we denote the parametrizing vector by (1.6) $\vec{p} = [α; w_{1, 1}, \dots, w_{N_{L}, L}; θ_{1, 1}, \dots, θ_{N_{L}, L}] \in R^{n_{*}} .$ (1.6) A shallow linear neural network can also be called a 3-layer network. The notation 3-layer network is consistent with the literature because input-and output are counted on top. Therefore, a general $(L + 2)$ -layer network has only L internal layers.

We demonstrate exemplary below (see Section 3.2) that the convergence analysis of an iterative Gauss-Newton method for solving a linear inverse problem restricted to a set of deep neural network ansatz functions is involved (see Example 3.4), while the analysis with shallow neural networks ansatz functions leads to intuitive conditions (see also [Citation3]). Moreover, it is well-known in regularization theory that the quality of approximation of the solution of Equationequation (1.2)(1.2) $F (p) = y,$ (1.2) is depending on the approximation qualities of the set $P$ : Approximation of functions with shallow affine linear neural networks is a classical topic of machine learning and in approximation theory, see for instance [Citation4–11]. The central result concerns the universal approximation property. Later on the universal approximation property has been established for different classes of neural networks: Examples are dropout neural networks (see [Citation12, Citation13]), convolutional neural networks (CNN) (see for example [Citation14, Citation15]), recurrent neural networks (RNN) (see [Citation16, Citation17]), networks with random nodes (see [Citation18]), with random weights and biases (see [Citation19, Citation20]) and with fixed neural network topology (see [Citation21]), to name but a few.

In this paper we follow up on this topic and analyze neural network functions with decision functions, which are higher order polynomials, as specified now:

Definition 1.2

(Shallow generalized neural network function). Let $n, m, N \in N$ . Moreover, let $f_{j} : R^{n} \to R^{m}, j = 1, \dots, N$ be vector-valued functions. Neural network functions associated to ${f_{j} : j = 1, \dots, N}$ are defined by (1.7) $\vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} f_{j} (\vec{x}) + θ_{j}) with α_{j}, θ_{j} \in R and w_{j} \in R^{m} .$ (1.7)

Again we denote the parameterization vector, containing all coefficients, with $\vec{p}$ and the set of all such functions with $P$ . We call (1.8) $D : = {\vec{x} \to w^{T} f_{j} (\vec{x}) + θ : w \in R^{m}, θ \in R, j = 1, \dots, N}$ (1.8) the set of decision functions associated to Equationequation (1.7)(1.7) $\vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} f_{j} (\vec{x}) + θ_{j}) with α_{j}, θ_{j} \in R and w_{j} \in R^{m} .$ (1.7) . The composition of σ with a decision function is called neuron.

We discuss approximation properties of generalized neural networks (see Theorem 2.5), in particular neural network functions with decision functions, which are at most quadratic polynomials in $\vec{x}$ . This idea is not completely new: In [Citation22] neural networks with parabolic decision functions have been applied in a number of applications. In [Citation23], the authors proposed neural networks with radial decision functions, and proved an universal approximation result for ReLU-activated deep quadratic networks.

Outline of this paper

The motivation for this paper is to solve Equationequation (1.2)(1.2) $F (p) = y,$ (1.2) on a set of neural network functions $P$ with iterative algorithms, such as for instance a Gauss-Newton method. Good approximation properties of functions of interest (such as the Shepp-Logan phantom) is provided with quadratic decision functions, which have been suggested already in [Citation9]. We concentrate on shallow neural networks because the resulting analysis of iterative methods for solving inverse problems is intuitive (see Section 3.2). We apply different approximation theorems (such as for instance the universal approximation theorem of [Citation6]) to cover all situations from Definition 2.1, below. Moreover, by constructing wavelet frames based on quadratic decision functions, we give an explicit convergence rate for shallow radial network approximations; see Section 4, with main result Theorem 4.6. In Appendix A the important conditions of the approximation to the identity (AtI) from [Citation24] used for proving the convergence rates in Section 4 are verified.

2 Examples of networks with generalized decision functions

The versatility of the network defined in Equationequation (1.7)(1.7) $\vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} f_{j} (\vec{x}) + θ_{j}) with α_{j}, θ_{j} \in R and w_{j} \in R^{m} .$ (1.7) is due to the flexibility of the functions $f_{j}, j = 1, \dots, N$ to chose.

We give a few examples.

Definition 2.1

(Neural networks with generalized decision functions). We split them into three categories:

General quadratic neural networks (GQNN): Let $m = n + 1, f_{j} (\vec{x}) = (\begin{matrix} f_{j}^{(1)} (\vec{x}) \\ ⋮ \\ f_{j}^{(n)} (\vec{x}) \\ f_{j}^{(n + 1)} (\vec{x}) \end{matrix}) = (\begin{matrix} x_{1} \\ ⋮ \\ x_{n} \\ {\vec{x}}^{T} A_{j} \vec{x} \end{matrix}) for all j = 1, \dots, N .$

That is, $f_{j}$ is a graph of a quadratic function. Then Equationequation (1.7)(1.7) $\vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} f_{j} (\vec{x}) + θ_{j}) with α_{j}, θ_{j} \in R and w_{j} \in R^{m} .$ (1.7) reads as follows: (2.1) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + {\vec{x}}^{T} A_{j} \vec{x} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, A_{j} \in R^{n \times n} . \end{matrix}$ (2.1) $\vec{p}$ denotes the parameterization of $p$ : (2.2) $\begin{matrix} \vec{p} & = [α_{1} \dots, α_{N}; w_{1}, \dots, w_{N}; A_{1}, \dots, A_{N}; θ_{1}, \dots, θ_{N}] \in R^{n_{*}} with \\ n_{*} & = (n^{2} + n + 2) N . \end{matrix}$ (2.2)

Note that in Equationequation (2.1)(2.1) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + {\vec{x}}^{T} A_{j} \vec{x} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, A_{j} \in R^{n \times n} . \end{matrix}$ (2.1) all parameters of the entries of the matrices A_j are parameters.

Constrained quadratic neural networks (CQNN): These are networks, where the entries of the matrices A_j, $j = 1, \dots, N$ are constrained:

Let $f (\vec{x}) = f_{j} (\vec{x}) = (\begin{matrix} f_{j}^{(1)} (\vec{x}) \\ ⋮ \\ f_{j}^{(n)} (\vec{x}) \\ f_{j}^{(n + 1)} (\vec{x}) \end{matrix}) = (\begin{matrix} x_{1} \\ ⋮ \\ x_{n} \\ 0 \end{matrix}) for all j = 1, \dots, N .$ That is A_j = 0 for all $j = 1, \dots, N$ . This set of CQNNs corresponds with the ALNNs defined in Equationequation (1.3)(1.3) $\vec{x} \in R^{n} \to p (\vec{x}) : = Ψ [\vec{p}] (\vec{x}) : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + θ_{j}),$ (1.3) .
Let $A_{j} \in R^{n \times n}, j = 1, \dots, N$ be chosen and fixed. We denote by MCNN the family of functions (2.3) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + ξ_{j} {\vec{x}}^{T} A_{j} \vec{x} + θ_{j}) \\ with α_{j}, θ_{j}, ξ_{j} \in R, w_{j} \in R^{n}, A_{j} \in R^{n \times n}, \end{matrix}$ (2.3) which is parameterized by the vector (2.4) $\begin{matrix} \vec{p} & = [α_{1} \dots, α_{N}; w_{1}, \dots, w_{N}; ξ_{1}, \dots, ξ_{N}; θ_{1}, \dots, θ_{N}] \in R^{n_{*}} with \\ n_{*} & = (n + 3) N . \end{matrix}$ (2.4)

In particular, choosing $A_{j} = I$ we get: (2.5) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + ξ_{j} {‖ \vec{x} ‖}^{2} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, ξ_{j} \in R . \end{matrix}$ (2.5)

Radial quadratic neural networks (RQNN)s: For $ξ_{j} \neq 0$ the argument in Equationequation (2.5)(2.5) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + ξ_{j} {‖ \vec{x} ‖}^{2} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, ξ_{j} \in R . \end{matrix}$ (2.5) rewrites to (2.6) $\begin{matrix} ν_{j} (\vec{x}) : = ξ_{j} {‖ \vec{x} ‖}^{2} + {\hat{w}}_{j}^{T} \vec{x} + θ_{j} = ξ_{j} {‖ \vec{x} - \vec{y} ‖}^{2} + κ_{j} \end{matrix}$ (2.6) with (2.7) ${\vec{y}}_{j} = - \frac{1}{2 ξ_{j}} {\hat{w}}_{j}^{T} and κ_{j} = θ_{j} - \frac{{‖ {\hat{w}}_{j} ‖}^{2}}{4 ξ_{j}} .$ (2.7)

We call the set of functions from Equationequation (2.5)(2.5) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + ξ_{j} {‖ \vec{x} ‖}^{2} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, ξ_{j} \in R . \end{matrix}$ (2.5) , which satisfy (2.8) $ξ_{j} \geq 0 and κ_{j} \leq 0 for all j = 1, \dots, N,$ (2.8)

RQNNs, radial neural network functions, because the level sets of ν_j are circles. These are radial basis functions (see for instance [Citation25] for a general overview).

Sign based quadratic (SBQNN) and cubic neural networks (CUNN): Let m = n.

(SBQNN): Let (2.9) $f^{(i)} (\vec{x}) = sgn (x_{i}) x_{i}^{2} for i = 1, \dots, n,$ (2.9) or alternatively $f^{(i)} (\vec{x}) = {‖ \vec{x} ‖}^{2} sgn (x_{i}) for i = 1, \dots, n .$ In the first case, and in the second case similarly, we obtain the family of functions (2.10) $\begin{matrix} Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (\sum_{i = 1}^{n} w_{j}^{(i)} sgn (x_{i}) x_{i}^{2} + θ_{j}) with \\ α_{j}, θ_{j} & \in R and w_{j} \in R^{n} . \end{matrix}$ (2.10) We call these functions signed squared neural networks. Note, here $\vec{p} \in R^{n_{*}}$ with $n_{*} = (n + 2) N$ .
(CUNN): Let (2.11) $f^{(i)} (\vec{x}) = x_{i}^{3} for i = 1, \dots, n .$ (2.11) We obtain the family of functions (2.12) $Ψ [\vec{p}] (\vec{x}) : = \sum_{j = 1}^{N} α_{j} σ (\sum_{i = 1}^{n} w_{j}^{(i)} x_{i}^{3} + θ_{j}) with α_{j}, θ_{j} \in R and w_{j} \in R^{n} .$ (2.12) We call these functions cubic neural networks. Again $\vec{p} \in R^{n_{*}}$ with $n_{*} = (n + 2) N$ .

Remark 1.

It is obvious that generalized quadratic neural network functions and matrix constrained neural networks are more versatile than affine linear works, and thus they satisfy the universal approximation property (see [Citation6]).
The constraint Equationequation (2.8)(2.8) $ξ_{j} \geq 0 and κ_{j} \leq 0 for all j = 1, \dots, N,$ (2.8) does not allow to use the universal approximation theorem in a straight forward manner. In fact we prove an approximation result indirectly via a convergence rates result (see Section 4).
Sign based quadratic and cubic neural networks satisfy the universal approximation property, which is proven below in Corollary 2.6 by reducing it to the classical result from [Citation6].

In order to prove the universal approximation property of SBQNNs and CUNNs we review the universal approximation theorem as formulated by [Citation6] first. It is based discriminatory properties of the functions σ.

Definition 2.2

(Discriminatory function). A function $σ : R \to R$ is called discriminatory (see [Citation6]) if every measure μ on ${[0, 1]}^{n}$ , which satisfies $\int_{{[0, 1]}^{n}} σ (w^{T} \vec{x} + θ) d μ (\vec{x}) = 0 for all w \in R^{n} and θ \in R$ implies that $μ \equiv 0$ .

Example 2.3.

Note that every non-polynomial function is discriminatory (this follows from the results in [Citation8]). Therefore the choices of activation function in Definition 1.1 are discriminatory for the Lebesgue-measure.

With these basic concepts we are able to recall Cybenko’s universal approximation result.

Theorem 2.4.

[Citation6] Let $σ : R \to R$ be a continuous discriminatory function. Then, for every function $g \in C ({[0, 1]}^{n})$ and every $ϵ > 0$ , there exists a function (2.13) $G_{ϵ} (\vec{x}) = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + θ_{j}) with N \in N, α_{j}, θ_{j} \in R, w_{j} \in R^{n},$ (2.13) satisfying $| G_{ϵ} (\vec{x}) - g (\vec{x}) | < ϵ for all \vec{x} \in {[0, 1]}^{n} .$

In the following we formulate and prove a modification of Cybenko’s universal approximation result [Citation6] for shallow generalized neural networks as introduced in Definition 1.2.

Theorem 2.5

(Generalized universal approximation theorem). Let $σ : R \to R$ be a continuous discriminatory function and assume that $f_{j} : {[0, 1]}^{n} \to R^{m}, j = 1, \dots, N$ are injective (this in particular means that $n \leq m$ ) and continuous, respectively. We denote $f = (f_{1}, \dots, f_{N})$ .

Then for every $g \in C ({[0, 1]}^{n})$ and every $ϵ > 0$ there exists some function $Ψ^{f} (\vec{x}) : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} f_{j} (\vec{x}) + θ_{j}) with α_{j}, θ_{j} \in R and w_{j} \in R^{m}$ satisfying $| Ψ^{f} (\vec{x}) - g (\vec{x}) | < ϵ for all \vec{x} \in {[0, 1]}^{n} .$

Proof.

We begin the proof by noting that since $\vec{x} \to f_{j} (\vec{x})$ is injective, the inverse function on the range of $f_{j}$ is well-defined, and we write $f_{j}^{- 1} : f_{j} ({[0, 1]}^{n}) \subseteq R^{m} \to {[0, 1]}^{n} \subseteq R^{n}$ . The proof that $f_{j}^{- 1}$ is continuous relies on the fact that the domain ${[0, 1]}^{n}$ of $f_{j}$ is compact, see for instance [Citation26, Chapter XI, Theorem 2.1]. Then applying the Tietze-Urysohn-Brouwer extension theorem (see [Citation27]) to the continuous function $g ° f_{j}^{- 1} : f_{j} ({[0, 1]}^{n}) \to R$ , we see that this function can be extended continuously to $R^{m}$ . This extension will be denoted by $g^{*} : R^{m} \to R$ .

We apply Theorem 2.4 to conclude that there exist $α_{j}, θ_{j} \in R$ and $w_{j} \in R^{m}, j = 1, \dots, N$ such that $Ψ^{*} (z) : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} z + θ_{j}) for all z \in R^{m},$ which satisfies (2.14) $| Ψ^{*} (z) - g^{*} (z) | < ϵ for all z \in {[0, 1]}^{m} .$ (2.14)

Then, because $f_{j}$ maps into $R^{m}$ we conclude, in particular, that $\begin{matrix} Ψ^{*} (f_{j} (\vec{x})) & = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} f_{j} (\vec{x}) + θ_{j}) \end{matrix}$ and $\begin{matrix} | Ψ^{*} (f_{j} (\vec{x})) - g (\vec{x}) | = | Ψ^{*} (f_{j} (\vec{x})) - g^{*} (f_{j} (\vec{x})) | < ϵ . \end{matrix}$

Therefore $Ψ^{f} (\cdot) : = Ψ^{*} (f_{j} (\cdot))$ satisfy the claimed assertions. □

It is obvious that the full variability in $w$ is the key to bring our proof and the universal approximation theorem in context. That is, if w_j, θ_j, $j = 1, \dots, N$ are allowed to vary over $R^{n}, R$ , respectively. RQNNs are constrained to $θ_{j} < {‖ {\vec{w}}_{j} ‖}^{2} / (4 ξ_{j})$ in Equationequation (2.7)(2.7) ${\vec{y}}_{j} = - \frac{1}{2 ξ_{j}} {\hat{w}}_{j}^{T} and κ_{j} = θ_{j} - \frac{{‖ {\hat{w}}_{j} ‖}^{2}}{4 ξ_{j}} .$ (2.7) and thus Theorem 2.5 does not apply. Interestingly Theorem 4.6 applies and allows to approximate functions in $L^{1} (R^{n})$ (even with rates).

Corollary 2.6

(Universal approximation properties of SBQNNs and CUNNs). Let the discriminatory function $σ : R \to R$ be Lipschitz continuous. All families of neural network functions from Definition 2.1 satisfy the universal approximation property on ${[0, 1]}^{n}$ .

Proof.

The proof follows from Theorem 2.5 and noting that all our functions $f_{j}, j = 1, \dots, N$ defined in Definition 2.1 are injective. □

3 Motivation

3.1 Motivation 1: the Shepp-Logan phantom

Almost all tomographic algorithms are tested with the Shepp-Logan phantom (see [Citation28]). It can be exactly represented with 10 characteristic functions on ellipses, and therefore it is much better approximated by a GQNN than with an ALNN or a DNN, respectively. This observation extends immediately to all function that contains “localized” features, because of the compactness of the characteristic sets. In this sense sparse approximations with ALNNs and DNNs would be more costly in practice.

3.2 Motivation 2: the Gauss-Newton iteration

Let us assume that $F : X \to Y$ is a linear operator. We consider the solution $p$ of Equationequation (1.2)(1.2) $F (p) = y,$ (1.2) to be an element of an ALNN or DNN, as defined in Definition 1.1. Therefore $p$ can be parameterized by a vector $\vec{p}$ as in Equationequation (1.4)(1.4) $\vec{p} = [α_{1}, \dots, α_{N}; w_{1}, \dots, w_{N}; θ_{1}, \dots, θ_{N}] \in R^{n_{*}} with n_{*} = (n + 2) N$ (1.4) (for shallow linear neural networks) or Equationequation (1.6)(1.6) $\vec{p} = [α; w_{1, 1}, \dots, w_{N_{L}, L}; θ_{1, 1}, \dots, θ_{N_{L}, L}] \in R^{n_{*}} .$ (1.6) (for deep neural networks). For GQNNs the adequate parameterization can be read out from Definition 2.1. In other words, we can write every searched for function $p$ via an operator $Ψ$ which maps a parameter $\vec{p} \in R^{n_{*}}$ to a function in $X$ , i.e. $\vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) .$

Therefore, we aim for solving the nonlinear operator equation (3.1) $N (\vec{p}) : = F ° Ψ [\vec{p}] = y .$ (3.1)

For ALNNs we have proven in [Citation3] that the Gauss-Newton method (see Equationequation (3.7)(3.7) $\begin{matrix} {\vec{p}}^{k + 1} = {\vec{p}}^{k} - N' {({\vec{p}}^{k})}^{†} (N ({\vec{p}}^{k}) - y) k \in N_{0}, \end{matrix}$ (3.7) ) is locally, quadratically convergent under conditions, which guarantee that during the iteration the gradients of $Ψ$ does not degenerate, i.e. that $P$ is a finite-dimensional manifold. Let us now calculate derivatives of the radial neural network operators $Ψ$ (see Equationequation (2.5)(2.5) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + ξ_{j} {‖ \vec{x} ‖}^{2} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, ξ_{j} \in R . \end{matrix}$ (2.5) ), which are the basis to prove that also the Gauss-Newton with RQNNs is convergent:

Example 3.1.

Let $σ : R \to R$ be continuous activation function, where all derivatives up to order 2 are uniformly bounded, i.e. in formulas (3.2) $σ \in C^{2} (R; R) \cap B^{2} (R; R),$ (3.2) such as the sigmoid function. Then, the derivatives of a radial neural network (RQNN) $Ψ$ as in Equationequation (2.5)(2.5) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + ξ_{j} {‖ \vec{x} ‖}^{2} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, ξ_{j} \in R . \end{matrix}$ (2.5) with respect to the coefficients of $\vec{p}$ in Equationequation (2.2)(2.2) $\begin{matrix} \vec{p} & = [α_{1} \dots, α_{N}; w_{1}, \dots, w_{N}; A_{1}, \dots, A_{N}; θ_{1}, \dots, θ_{N}] \in R^{n_{*}} with \\ n_{*} & = (n^{2} + n + 2) N . \end{matrix}$ (2.2) are given by the following formulas, where ν_s is defined in Equationequation (2.6)(2.6) $\begin{matrix} ν_{j} (\vec{x}) : = ξ_{j} {‖ \vec{x} ‖}^{2} + {\hat{w}}_{j}^{T} \vec{x} + θ_{j} = ξ_{j} {‖ \vec{x} - \vec{y} ‖}^{2} + κ_{j} \end{matrix}$ (2.6) :

Derivative with respect to α_s, $s = 1, \dots, N$ : (3.3) $\begin{matrix} \frac{\partial Ψ}{\partial α_{s}} [\vec{p}] (\vec{x}) & = σ (ν_{s}) for s = 1, \dots, N . \end{matrix}$ (3.3)
Derivative with respect to $w_{s}^{(t)}$ where $s = 1, \dots, N, t = 1, \dots, n$ : (3.4) $\begin{matrix} \frac{\partial Ψ}{\partial w_{s}^{(t)}} [\vec{p}] (\vec{x}) & = \sum_{j = 1}^{N} α_{j} σ' (ν_{j}) δ_{s = j} x_{t} = α_{s} σ' (ν_{s}) x_{t} . \end{matrix}$ (3.4)
Derivative with respect to θ_s where $s = 1, \dots, N$ : (3.5) $\begin{matrix} \frac{\partial Ψ}{\partial θ_{s}} [\vec{p}] (\vec{x}) & = \sum_{j = 1}^{N} α_{j} σ' (ν_{j}) δ_{s = j} = α_{s} σ' (ν_{s}) . \end{matrix}$ (3.5)
Derivative with respect $ξ_{s}$ : (3.6) $\frac{\partial Ψ}{\partial ξ_{s}} [\vec{p}] (\vec{x}) = α_{s} σ' (ν_{s}) {‖ \vec{x} ‖}^{2} .$ (3.6) Note, that all the derivatives above are functions in $X = L^{2} ({[0, 1]}^{n})$ .

For formulating a convergence result for the Gauss-Newton method (see Equationequation (3.7)(3.7) $\begin{matrix} {\vec{p}}^{k + 1} = {\vec{p}}^{k} - N' {({\vec{p}}^{k})}^{†} (N ({\vec{p}}^{k}) - y) k \in N_{0}, \end{matrix}$ (3.7) ) we need to specify the following assumptions, which were postulated first in [Citation3] to prove local convergence of a Gauss-Newton method on a set of shallow affine linear network functions. Here we verify convergence of the Gauss-Newton method on a set of RQNNs, as defined in Equationequation (2.1)(2.1) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + {\vec{x}}^{T} A_{j} \vec{x} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, A_{j} \in R^{n \times n} . \end{matrix}$ (2.1) . The proof is completely analogous as in [Citation3].

Assumption 3.2.

$F : X = L^{2} ({[0, 1]}^{n}) \to Y$ be a linear, bounded operator with trivial nullspace and dense range.
$Ψ : D (Ψ) \subseteq R^{(n + 2) N} \to X$ is a shallow RQNN generated by a strictly monotonic activation function σ which satisfies Equationequation (3.2)(3.2) $σ \in C^{2} (R; R) \cap B^{2} (R; R),$ (3.2) , like a sigmoid function.
All derivatives in Equationequations (3.3)–Equation(3.6)(3.6) $\frac{\partial Ψ}{\partial ξ_{s}} [\vec{p}] (\vec{x}) = α_{s} σ' (ν_{s}) {‖ \vec{x} ‖}^{2} .$ (3.6) are locally linearly independent functions.

Now, we recall a local convergence result:

Theorem 3.3

(Local convergence of Gauss-Newton method with RQNNs). Let $N$ be as in Equationequation (3.1)(3.1) $N (\vec{p}) : = F ° Ψ [\vec{p}] = y .$ (3.1) be the composition of a linear operator F and the RQNN network operator as defined in Equationequation (2.5)(2.5) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + ξ_{j} {‖ \vec{x} ‖}^{2} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, ξ_{j} \in R . \end{matrix}$ (2.5) , which satisfy Assumption 3.2. Moreover, let ${\vec{p}}^{0} \in D (Ψ)$ , which is open, be the starting point of the Gauss-Newton iteration (3.7) $\begin{matrix} {\vec{p}}^{k + 1} = {\vec{p}}^{k} - N' {({\vec{p}}^{k})}^{†} (N ({\vec{p}}^{k}) - y) k \in N_{0}, \end{matrix}$ (3.7) where $N' {({\vec{p}}^{k})}^{†}$ denotes the Moore-Penrose inverse of $N' ({\vec{p}}^{k})$ .Footnote¹

Moreover, let ${\vec{p}}^{†} \in D (Ψ)$ be a solution of Equationequation (3.1)(3.1) $N (\vec{p}) : = F ° Ψ [\vec{p}] = y .$ (3.1) , i.e. (3.8) $N ({\vec{p}}^{†}) = y .$ (3.8)

Then the Gauss-Newton iterations are locally, that is if ${\vec{p}}^{0}$ is sufficiently close to ${\vec{p}}^{†}$ , and quadratically converging.

Remark 2.

The essential condition in Theorem 3.3 is that all derivatives of $Ψ$ with respect to $\vec{p}$ are linearly independent, which is a nontrivial research question. In [Citation3] we studied convergence of the Gauss-Newton method on a set of shallow affine linear neural networks, where the convergence result required linear independence of the derivatives specified in Equationequations (3.3)–Equation(3.5)(3.5) $\begin{matrix} \frac{\partial Ψ}{\partial θ_{s}} [\vec{p}] (\vec{x}) & = \sum_{j = 1}^{N} α_{j} σ' (ν_{j}) δ_{s = j} = α_{s} σ' (ν_{s}) . \end{matrix}$ (3.5) . Here, for RQNNs, on top, linear independence of the 2nd order moment function Equationequation (3.6)(3.6) $\frac{\partial Ψ}{\partial ξ_{s}} [\vec{p}] (\vec{x}) = α_{s} σ' (ν_{s}) {‖ \vec{x} ‖}^{2} .$ (3.6) needs to hold. Even linear independence of neural network functions (with derivatives) is still a challenging research topic (see [Citation30]).

Under our assumptions the ill–posedness of F does not affect convergence of the Gauss-Newton method. The generalized inverse completely annihilates F in the iteration. The catch however is, that the data has to be attained in a finite dimensional space spanned by neurons (that is Equationequation (3.8)(3.8) $N ({\vec{p}}^{†}) = y .$ (3.8) holds). This assumption is in fact restrictive: Numerical tests show that convergence of Gauss-Newton methods becomes arbitrarily slow if it is violated. Here the ill-posedness of F enters. The proof of Theorem 3.3 is based on the affine covariant condition (see for instance [Citation31]).

In the following we show the complexity of the convergence condition of a Gauss-Newton method in the case when affine linear deep neural network functions are used for coding.

Example 3.4

(Convergence conditions of deep neural networks). We restrict our attention to the simple case n = 1. We consider a 4-layer DNN (consisting of two internal layers) with $σ_{1} = σ_{2} = σ$ , which reads as follows: (3.9) $x \to Ψ [\vec{p}] (x) : = \sum_{j_{2} = 1}^{N_{2}} α_{j_{2}, 2} σ (w_{j_{2}, 2} (\sum_{j_{1} = 1}^{N_{1}} α_{j_{1}, 1} σ (w_{j_{1}, 1} x + θ_{j_{1}, 1})) + θ_{j_{2}, 2}) .$ (3.9)

Now, we calculate by chain rule $\frac{\partial Ψ}{\partial w_{1, 1}} [\vec{p}] (x)$ . For this purpose we define $w_{1, 1} \to ρ (w_{1, 1}) : = \sum_{j_{1} = 1}^{N_{1}} α_{j_{1}, 1} σ (w_{j_{1}, 1} x + θ_{j_{1}, 1}) .$

With this definition we rewrite Equationequation (3.9)(3.9) $x \to Ψ [\vec{p}] (x) : = \sum_{j_{2} = 1}^{N_{2}} α_{j_{2}, 2} σ (w_{j_{2}, 2} (\sum_{j_{1} = 1}^{N_{1}} α_{j_{1}, 1} σ (w_{j_{1}, 1} x + θ_{j_{1}, 1})) + θ_{j_{2}, 2}) .$ (3.9) to $x \to Ψ [\vec{p}] (x) : = \sum_{j_{2} = 1}^{N_{2}} α_{j_{2}, 2} σ (w_{j_{2}, 2} ρ (w_{1, 1}) + θ_{j_{2}, 2}),$

Since $ρ' (w_{1, 1}) : = α_{j_{1}, 1} σ' (w_{1, 1} x + θ_{1, 1}) x,$

we therefore get (3.10) $\begin{matrix} \frac{\partial Ψ}{\partial w_{1, 1}} & = \sum_{j_{2} = 1}^{N_{2}} α_{j_{2}, 2} σ' (w_{j_{2}, 2} ρ (w_{1, 1}) + θ_{j_{2}, 2}) w_{j_{2}, 2} ρ' (w_{1, 1}) \\ = α_{j_{1}, 1} x \sum_{j_{2} = 1}^{N_{2}} α_{j_{2}, 2} w_{j_{2}, 2} σ' (w_{j_{2}, 2} ρ (w_{1, 1}) + θ_{j_{2}, 2}) σ' (w_{1, 1} x + θ_{1, 1}) . \end{matrix}$ (3.10)

Note that $Ψ [\vec{p}]$ is a function of x depending on the parameter $\vec{p}$ , which contains $w_{1, 1}$ as one component.

The complicating issue in a potential convergence analysis concerns the last identity, Equationequation (3.10)(3.10) $\begin{matrix} \frac{\partial Ψ}{\partial w_{1, 1}} & = \sum_{j_{2} = 1}^{N_{2}} α_{j_{2}, 2} σ' (w_{j_{2}, 2} ρ (w_{1, 1}) + θ_{j_{2}, 2}) w_{j_{2}, 2} ρ' (w_{1, 1}) \\ = α_{j_{1}, 1} x \sum_{j_{2} = 1}^{N_{2}} α_{j_{2}, 2} w_{j_{2}, 2} σ' (w_{j_{2}, 2} ρ (w_{1, 1}) + θ_{j_{2}, 2}) σ' (w_{1, 1} x + θ_{1, 1}) . \end{matrix}$ (3.10) , where evaluations of $σ'$ at different arguments appear simultaneously, which makes it complicated to analyze and interpret linear independence of derivatives of $Ψ$ with respect to the single elements of $\vec{p}$ . Note that for proving convergence of the Gauss-Newton method, according to Equationequation (3.10)(3.10) $\begin{matrix} \frac{\partial Ψ}{\partial w_{1, 1}} & = \sum_{j_{2} = 1}^{N_{2}} α_{j_{2}, 2} σ' (w_{j_{2}, 2} ρ (w_{1, 1}) + θ_{j_{2}, 2}) w_{j_{2}, 2} ρ' (w_{1, 1}) \\ = α_{j_{1}, 1} x \sum_{j_{2} = 1}^{N_{2}} α_{j_{2}, 2} w_{j_{2}, 2} σ' (w_{j_{2}, 2} ρ (w_{1, 1}) + θ_{j_{2}, 2}) σ' (w_{1, 1} x + θ_{1, 1}) . \end{matrix}$ (3.10) , we need to verify linear independence of then functions $\begin{matrix} x \to & σ' (w_{j_{2}, 2} ρ (w_{1, 1}) + θ_{j_{2}, 2}) σ' (w_{1, 1} x + θ_{1, 1}) \\ = σ' (w_{j_{2}, 2} (\sum_{j_{1} = 1}^{N_{1}} α_{j_{1}, 1} σ (w_{j_{1}, 1} x + θ_{j_{1}, 1} + θ_{j_{2}, 2}))) σ' (w_{1, 1} x + θ_{1, 1}) . \end{matrix}$

In other word, the potential manifold of quadratic neural networks is extremely complex.

The conclusion of this section is that the analysis of iterative regularization methods, like a Gauss-Newton method, gets significantly more complicated if deep neural networks are used as ansatz functions. In contrast using neural network with higher order nodes (such as radial) results in transparent moment conditions. So the research question discussed in the following is whether shallow higher order neural networks have similar approximation properties than deep neural network functions, which would reveal clear benefits of such for analyzing iterative methods for solving ill–posed problems.

4 Convergence rates for universal approximation of RQNNs

In the following we prove convergence rates of RQNNs (as defined in Equationequation (2.5)(2.5) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + ξ_{j} {‖ \vec{x} ‖}^{2} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, ξ_{j} \in R . \end{matrix}$ (2.5) ) in the $L^{1}$ -norm. To be precise we specify a subclass on RQNNs for which we prove convergence rates results. This is a much finer result than the standard universal approximation result, Theorem 2.5, since it operates on a subclass and provides convergence rates. However, it is the only approximation result so far, which we can provide. We recall that the constraints, $ξ_{j} \neq 0$ and $κ_{j} / ξ_{j} < 0$ (see Equationequation (2.7)(2.7) ${\vec{y}}_{j} = - \frac{1}{2 ξ_{j}} {\hat{w}}_{j}^{T} and κ_{j} = θ_{j} - \frac{{‖ {\hat{w}}_{j} ‖}^{2}}{4 ξ_{j}} .$ (2.7) ), induce that the level-sets of the neurons are compact (circles) and not unbounded, as they are for shallow affine linear neurons. Therefore we require a different analysis (and different spaces) than in the most advanced and optimal convergence rates results for affine linear neural networks as for instance in [Citation10, Citation11]. In fact the analysis is closer to an analysis of compact wavelets.

The convergence rate will be expressed in the following norm:

Definition 4.1.

$L^{1}$ denotes the space of square integrable functions on ${[0, 1]}^{n}$ , which satisfy [Citation32, (1.10)] $‖ f ‖_{L^{1}} : = inf {\sum_{g \in D} | c_{g} | : f = \sum_{g \in D} c_{g} g} < \infty .$

Here c_g are the coefficients of an wavelet expansion and D is a countable, general set of wavelet functions. A discrete wavelet basis (see, for example, [Citation33–36]) is a (not necessarily orthonormal) basis of the space of square integrable functions, where the basis elements are given by dilating and translating a mother wavelet function.

Remark 3.

Notice that the notation $L^{1}$ does not refer to the common L¹-function space of absolutely integrable functions and depends on the choice of the wavelet system. For more properties and details on this space see [Citation37, Remark 3.11].

Definition 4.2

(Circular scaling function and wavelets). Let r > 0 be a fixed constant and σ be a discriminatory function as defined in Definition 2.2 such that $| \int_{R^{n}} σ (r^{2} - ‖ \vec{x} ‖^{2}) d \vec{x} | < \infty$ . Then let (4.1) $\vec{x} \in R^{n} \to φ (\vec{x}) : = C_{n} σ (r^{2} - {‖ \vec{x} ‖}^{2}),$ (4.1) where C_n is a normalizing constant such that $\int_{R^{n}} φ (\vec{x}) d \vec{x} = 1$ .

Then we define for $k \in Z$ the radial scaling functions and wavelets (4.2) $\begin{matrix} (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to S_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{k} φ (2^{k / n} (\vec{x} - \vec{y})) and \\ (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to ψ_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{- k / 2} (S_{k}^{C} (\vec{x}, \vec{y}) - S_{k - 1}^{C} (\vec{x}, \vec{y})) . \end{matrix}$ (4.2)

Often $\vec{y} \in R^{n}$ is considered a parameter and we write synonymously (4.3) $\begin{matrix} \vec{x} \to S_{k, \vec{y}}^{C} (\vec{x}) = S_{k}^{C} (\vec{x}, \vec{y}) and \vec{x} \to ψ_{k, \vec{y}}^{C} (\vec{x}) = ψ_{k}^{C} (\vec{x}, \vec{y}) . \end{matrix}$ (4.3)

In particular, this notation means that $S_{k, \vec{y}}^{C}$ and $ψ_{k, \vec{y}}^{C}$ are considered solely functions of the variable $\vec{x}$ .

We consider the following subset of RQNNs (4.4) $S_{d}^{C} : = {S_{k, \vec{y}}^{C} : k \in Z and \vec{y} \in 2^{- k / n} Z^{n}} .$ (4.4)

This is different to standard universal approximation theorems, where an uncountable number of displacements are considered. Moreover, we define the discrete Wavelet space (4.5) $W_{d}^{C} : = {ψ_{k, \vec{y}}^{C} : = 2^{- k / 2} (S_{k, \vec{y}}^{C} - S_{k - 1, \vec{y}}^{C}) : k \in Z and \vec{y} \in 2^{- k / n} Z^{n}} .$ (4.5)

According to Theorem A.3, in order to prove the following approximation result, we only require to verify that $S_{d}^{C}$ satisfies the conditions for symmetric AtI and the double Lipschitz condition (see Definition A.1), which originate from [Citation24, Def. 3.4].

Corollary 4.3.

$W_{d}^{C}$ is a frame and for every function $f \in L^{1} (R^{n})$ there exists a linear combination of N elements of $W_{d}^{C}$ , denoted by f_N, satisfying (4.6) ${‖ f - f_{N} ‖}_{L^{2}} \leq {‖ f ‖}_{L^{1}} {(N + 1)}^{- 1 / 2} .$ (4.6)

For proving the approximation to identity, AtI, property of the functions $S_{k}^{C}, k \in Z$ , we use the following basic inequality.

Lemma 4.4.

Let $h : R^{n} \to R$ be a twice differentiable function, which can be expressed in the following way: $h (\vec{x}) = h_{s} ({‖ \vec{x} ‖}^{2}) for all \vec{x} \in R^{n} .$

Then the spectral norm of the Hessian of h can be estimated as follows:Footnote² (4.7) $‖ \nabla^{2} h (\vec{x}) ‖ \leq max {| 4 {‖ \vec{x} ‖}^{2} h_{s}^{'}' ({‖ \vec{x} ‖}^{2}) + 2 h_{s}^{'} ({‖ \vec{x} ‖}^{2}) |, | 2 h_{s}^{'} ({‖ \vec{x} ‖}^{2}) |} .$ (4.7)

Proof.

Since $\nabla^{2} h (\vec{x})$ is a symmetric matrix, its operator norm is equal to its spectral radius, namely the largest absolute value of an eigenvalue. By routine calculation we can see that $\nabla_{x_{i} x_{j}} h (\vec{x}) = 4 x_{i} x_{j} h_{s}^{'}' ({‖ \vec{x} ‖}^{2}) + 2 δ_{ij} h_{s}^{'} ({‖ \vec{x} ‖}^{2}) .$ $4 h_{s}^{'}' ({‖ \vec{x} ‖}^{2}) C \vec{z} = (- 2 h_{s}^{'} ({‖ \vec{x} ‖}^{2}) + λ) \vec{z} .$

Or in other words $\frac{- 2 h_{s}^{'} ({‖ \vec{x} ‖}^{2}) + λ}{4 h_{s}^{'}' ({‖ x ‖}^{2})}$ is an eigenvalue of C. Moreover, $C = \vec{x} {\vec{x}}^{T}$ is a rank one matrix and thus the spectral values are 0 with multiplicity $(n - 1)$ and ${‖ \vec{x} ‖}^{2}$ . This in turn shows that the eigenvalues of the Hessian are $+ 2 h_{s}^{'} ({‖ \vec{x} ‖}^{2})$ (with multiplicity n – 1) and $4 {‖ \vec{x} ‖}^{2} h_{s}^{'}' ({‖ \vec{x} ‖}^{2}) + 2 h_{s}^{'} ({‖ \vec{x} ‖}^{2})$ , which proves Equationequation (4.7)(4.7) $‖ \nabla^{2} h (\vec{x}) ‖ \leq max {| 4 {‖ \vec{x} ‖}^{2} h_{s}^{'}' ({‖ \vec{x} ‖}^{2}) + 2 h_{s}^{'} ({‖ \vec{x} ‖}^{2}) |, | 2 h_{s}^{'} ({‖ \vec{x} ‖}^{2}) |} .$ (4.7) . □

In the following lemma, we will prove that the kernels ${(S_{k}^{C})}_{k \in Z}$ are an AtI.

Lemma 4.5.

Let r > 0 be fixed. Suppose that the activation function $σ : R \to R$ is monotonically increasing and satisfies for the i-th derivative (i = 0, 1, 2) (4.8) $| σ^{i} (r^{2} - t^{2}) | \leq C_{σ} {(1 + {| t |}^{n})}^{- 1 - (2 i + 1) / n} for all t \in R .$ (4.8)

Then the kernels ${(S_{k}^{C})}_{k \in Z}$ as defined in Equationequation (4.2)(4.2) $\begin{matrix} (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to S_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{k} φ (2^{k / n} (\vec{x} - \vec{y})) and \\ (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to ψ_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{- k / 2} (S_{k}^{C} (\vec{x}, \vec{y}) - S_{k - 1}^{C} (\vec{x}, \vec{y})) . \end{matrix}$ (4.2) form an AtI as defined in Definition A.1 that also satisfy the double Lipschitz condition Equationequation (A.4)(A.4) $\begin{matrix} | S_{k} (\vec{x}, \vec{y}) - S_{k} (\vec{x}', \vec{y}) - S_{k} (\vec{x}, \vec{y}') + S_{k} (\vec{x}', \vec{y}') | \\ \leq & \tilde{C} {(\frac{C_{ρ} {‖ \vec{x} - \vec{x}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} {(\frac{C_{ρ} {‖ \vec{y} - \vec{y}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} \frac{2^{- k ϵ}}{{(2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n})}^{1 + ϵ}} \end{matrix}$ (A.4) .

Proof.

We verify the three conditions from Definition A.1 as well as Equationequation (A.4)(A.4) $\begin{matrix} | S_{k} (\vec{x}, \vec{y}) - S_{k} (\vec{x}', \vec{y}) - S_{k} (\vec{x}, \vec{y}') + S_{k} (\vec{x}', \vec{y}') | \\ \leq & \tilde{C} {(\frac{C_{ρ} {‖ \vec{x} - \vec{x}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} {(\frac{C_{ρ} {‖ \vec{y} - \vec{y}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} \frac{2^{- k ϵ}}{{(2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n})}^{1 + ϵ}} \end{matrix}$ (A.4) . First of all, we note that (4.9) $| σ^{i} (r^{2} - {‖ \vec{x} ‖}^{2}) | \leq C_{σ} {(1 + {‖ \vec{x} ‖}^{n})}^{- 1 - (2 i + 1) / n} for all \vec{x} \in R^{n} .$ (4.9)

Verification of (i) in Definition A.1: Equationequations (4.1)(4.1) $\vec{x} \in R^{n} \to φ (\vec{x}) : = C_{n} σ (r^{2} - {‖ \vec{x} ‖}^{2}),$ (4.1) and Equation4.8(4.1) $\vec{x} \in R^{n} \to φ (\vec{x}) : = C_{n} σ (r^{2} - {‖ \vec{x} ‖}^{2}),$ (4.1) imply that (4.10) $\begin{matrix} 0 \leq φ (\vec{x} - \vec{y}) & = C_{n} σ (r^{2} - {‖ \vec{x} - \vec{y} ‖}^{2}) \leq C_{σ} C_{n} {(1 + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 1 / n} \\ for all \vec{x}, \vec{y} \in R^{n} . \end{matrix}$ (4.10) Therefore $\begin{matrix} S_{k}^{C} (\vec{x}, \vec{y}) & = 2^{k} φ (2^{k / n} (\vec{x} - \vec{y})) \leq C_{σ} C_{n} 2^{k} {(1 + 2^{k} {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 1 / n} \\ = C_{σ} C_{n} 2^{- k / n} {(2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 1 / n} . \end{matrix}$ Thus (i) in Definition A.1 holds with $ϵ = 1 / n$ and $C_{ρ} = 1$ and $C = C_{n} C_{σ}$ .
Verification of (ii) in Definition A.1 with $C_{ρ} = 1$ and $C_{A} = 2^{- n}$ : Because σ is monotonically increasing it follows from Equationequation (4.1)(4.1) $\vec{x} \in R^{n} \to φ (\vec{x}) : = C_{n} σ (r^{2} - {‖ \vec{x} ‖}^{2}),$ (4.1) and the fact that $S_{0} (\vec{x}, \vec{y}) = φ (\vec{x} - \vec{y})$ (see Equationequation (4.2)(4.2) $\begin{matrix} (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to S_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{k} φ (2^{k / n} (\vec{x} - \vec{y})) and \\ (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to ψ_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{- k / 2} (S_{k}^{C} (\vec{x}, \vec{y}) - S_{k - 1}^{C} (\vec{x}, \vec{y})) . \end{matrix}$ (4.2) ) and Definition 4.2 that $F_{\vec{y}} (\vec{x}) : = ‖ \nabla_{\vec{x}} (S_{0} (\vec{x}, \vec{y})) ‖ = 2 C_{n} ‖ \vec{x} - \vec{y} ‖ σ' (r^{2} - {‖ \vec{x} - \vec{y} ‖}^{2}) for all \vec{y} \in R^{n} .$ Then Equationequation (4.9)(4.9) $| σ^{i} (r^{2} - {‖ \vec{x} ‖}^{2}) | \leq C_{σ} {(1 + {‖ \vec{x} ‖}^{n})}^{- 1 - (2 i + 1) / n} for all \vec{x} \in R^{n} .$ (4.9) implies that $\begin{matrix} F_{\vec{y}} (\vec{x}) & \leq 2 C_{n} C_{σ} {(1 + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 3 / n} ‖ \vec{x} - \vec{y} ‖ \\ \leq 2 C_{n} C_{σ} {(1 + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 3 / n} {(1 + {‖ \vec{x} - \vec{y} ‖}^{n})}^{\frac{1}{n}} \\ = 2 C_{n} C_{σ} {(1 + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 2 / n} . \end{matrix}$ From the definition of $S_{k}^{C} (\vec{x}, \vec{y})$ , it follows (4.11) $\begin{matrix} ‖ \nabla_{\vec{x}} (S_{k}^{C} (\vec{x}, \vec{y})) ‖ & = ‖ \nabla_{\vec{x}} (2^{k} φ (2^{k / n} (\vec{x} - \vec{y}))) ‖ = 2^{k} ‖ \nabla_{\vec{x}} S_{0} (2^{k / n} \vec{x}, 2^{k / n} \vec{y}) ‖ \\ = 2^{k + k / n} F_{2^{k / n} \vec{y}} (2^{k / n} \vec{x}) \\ \leq 2^{- k / n} C_{n} C_{σ} {(2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 2 / n} . \end{matrix}$ (4.11) From the mean value theorem it therefore follows from Equationequation (4.11)(4.11) $\begin{matrix} ‖ \nabla_{\vec{x}} (S_{k}^{C} (\vec{x}, \vec{y})) ‖ & = ‖ \nabla_{\vec{x}} (2^{k} φ (2^{k / n} (\vec{x} - \vec{y}))) ‖ = 2^{k} ‖ \nabla_{\vec{x}} S_{0} (2^{k / n} \vec{x}, 2^{k / n} \vec{y}) ‖ \\ = 2^{k + k / n} F_{2^{k / n} \vec{y}} (2^{k / n} \vec{x}) \\ \leq 2^{- k / n} C_{n} C_{σ} {(2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 2 / n} . \end{matrix}$ (4.11) and Equationequation (A.6)(A.6) ${‖ \vec{x} + t (\vec{x}' - \vec{x}) - \vec{y} ‖}^{n} \geq 2^{- n} {‖ \vec{x} - \vec{y} ‖}^{n} - 2^{- n} 2^{- k} .$ (A.6) that (4.12) $\begin{matrix} \frac{| S_{k}^{C} (\vec{x}, \vec{y}) - S_{k}^{C} (\vec{x}', \vec{y}) |}{‖ \vec{x} - \vec{x}' ‖} \\ \leq & {max_{{\vec{z} = t \vec{x}' + (1 - t) \vec{x} : t \in [0, 1]}}}_{} ‖ \nabla_{\vec{x}} (S_{k}^{C} (\vec{z}, \vec{y})) ‖ \\ \leq & 2^{- k / n} C_{n} C_{σ} {max_{{\vec{z} = \vec{x} + t (\vec{x}' - \vec{x}) : t \in [0, 1]}}}_{} {(2^{- k} + {‖ \vec{z} - \vec{y} ‖}^{n})}^{- 1 - 2 / n} \\ = & 2^{- k / n} C_{n} C_{σ} {(2^{- k} + {min_{{\vec{z} = \vec{x} + t (\vec{x}' - \vec{x}) : t \in [0, 1]}}}_{} {‖ \vec{z} - \vec{y} ‖}^{n})}^{- 1 - 2 / n} . \end{matrix}$ (4.12) Then application of Equationequation (A.6)(A.6) ${‖ \vec{x} + t (\vec{x}' - \vec{x}) - \vec{y} ‖}^{n} \geq 2^{- n} {‖ \vec{x} - \vec{y} ‖}^{n} - 2^{- n} 2^{- k} .$ (A.6) and noting that $\frac{1 - 2^{- n}}{2^{- n}} \geq 1$ gives $\begin{matrix} \frac{| S_{k}^{C} (\vec{x}, \vec{y}) - S_{k}^{C} (\vec{x}', \vec{y}) |}{‖ \vec{x} - \vec{x}' ‖} \\ \leq & 2^{- k / n} C_{n} C_{σ} {((1 - 2^{- n}) 2^{- k} + 2^{- n} {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 2 / n} \\ \leq & 2^{- k / n} {(2^{- n})}^{- 1 - 2 / n} C_{n} C_{σ} {(\frac{(1 - 2^{- n})}{2^{- n}} 2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 2 / n} \\ \leq & 2^{- k / n} 2^{n + 2} C_{n} C_{σ} {(2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 2 / n} . \end{matrix}$ Therefore (ii) is satisfied with $C_{ρ} = 1, ζ = 1 / n$ , $ϵ = 1 / n$ and $C = 2^{n + 2} C_{n} C_{σ}$ .
Verification of (iii) in Definition A.1: From the definition of $S_{k}^{C}$ (see Equationequation (4.2)(4.2) $\begin{matrix} (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to S_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{k} φ (2^{k / n} (\vec{x} - \vec{y})) and \\ (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to ψ_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{- k / 2} (S_{k}^{C} (\vec{x}, \vec{y}) - S_{k - 1}^{C} (\vec{x}, \vec{y})) . \end{matrix}$ (4.2) ) it follows that for every $k \in Z$ and $\vec{y} \in R^{n}$ $1 = \int_{R^{n}} S_{k}^{C} (\vec{x}, \vec{y}) d \vec{x} = \int_{R^{n}} 2^{k} φ (2^{k / n} (\vec{x} - \vec{y})) d \vec{x} .$
Verification of the double Lipschitz condition Equationequation (A.4)(A.4) $\begin{matrix} | S_{k} (\vec{x}, \vec{y}) - S_{k} (\vec{x}', \vec{y}) - S_{k} (\vec{x}, \vec{y}') + S_{k} (\vec{x}', \vec{y}') | \\ \leq & \tilde{C} {(\frac{C_{ρ} {‖ \vec{x} - \vec{x}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} {(\frac{C_{ρ} {‖ \vec{y} - \vec{y}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} \frac{2^{- k ϵ}}{{(2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n})}^{1 + ϵ}} \end{matrix}$ (A.4) in Definition A.1: By using the integral version of the mean value theorem, we have $S_{k}^{C} (\vec{x}, \vec{y}) - S_{k}^{C} (\vec{x}', \vec{y}) - S_{k}^{C} (\vec{x}, \vec{y}') + S_{k}^{C} (\vec{x}', \vec{y}')$ $= S_{k}^{C} (\vec{x}, \vec{y}) - S_{k}^{C} (\vec{x}', \vec{y}) - (S_{k}^{C} (\vec{x}, \vec{y}') - S_{k}^{C} (\vec{x}', \vec{y}'))$ $= \int_{0}^{1} 〈 \nabla_{\vec{y}} S_{k}^{C} (\vec{x}, \vec{y}' + t (\vec{y} - \vec{y}')), \vec{y} - \vec{y}' 〉 dt$ $- \int_{0}^{1} 〈 \nabla_{\vec{y}} S_{k}^{C} (\vec{x}', \vec{y}' + t (\vec{y} - \vec{y}')), \vec{y} - \vec{y}' 〉 dt$ $= \int_{0}^{1} \int_{0}^{1} 〈 \nabla_{\vec{x}, \vec{y}} S_{k}^{C} (\vec{x}' + s (\vec{x} - \vec{x}'), \vec{y}' + t (\vec{y} - \vec{y}')) (\vec{x} - \vec{x}'), \vec{y} - \vec{y}' 〉 dtds .$ Following this identity, we get (4.13) $\begin{matrix} \frac{| S_{k}^{C} (\vec{x}, \vec{y}) - S_{k}^{C} (\vec{x}', \vec{y}) + S_{k}^{C} (\vec{x}, \vec{y}') - S_{k}^{C} (\vec{x}', \vec{y}') |}{‖ \vec{x} - \vec{x}' ‖ ‖ \vec{y} - \vec{y}' ‖} \\ \leq & {max}_{I} ‖ \nabla_{\vec{y}} (\frac{S_{k}^{C} (\vec{x}, \vec{z}) - S_{k}^{C} (\vec{x}', \vec{z})}{‖ \vec{x} - \vec{x}' ‖}) ‖ \leq {max}_{Q} ‖ \nabla_{\vec{x} \vec{y}}^{2} S_{k}^{C} (\vec{z}', \vec{z}) ‖, \end{matrix}$ (4.13) where $\begin{matrix} I & : = {\vec{z} = t \vec{y} + (1 - t) \vec{y}' : t \in [0, 1]}, \\ Q & : = {(\vec{z} = t_{\vec{y}} \vec{y} + (1 - t_{\vec{y}}) \vec{y}', \vec{z}' = t_{\vec{x}} \vec{x} + (1 - t_{\vec{x}}) \vec{x}') : t_{\vec{y}} \in [0, 1], t_{\vec{x}} \in [0, 1]}, \end{matrix}$ and $‖ \nabla_{\vec{x} \vec{y}}^{2} S_{k}^{C} (\vec{z}', \vec{z}) ‖$ denotes again the spectral norm of $\nabla_{\vec{x} \vec{y}}^{2} S_{k}^{C} (\vec{z}', \vec{z})$ .

Now, we estimate the right hand side of Equationequation (4.13)(4.13) $\begin{matrix} \frac{| S_{k}^{C} (\vec{x}, \vec{y}) - S_{k}^{C} (\vec{x}', \vec{y}) + S_{k}^{C} (\vec{x}, \vec{y}') - S_{k}^{C} (\vec{x}', \vec{y}') |}{‖ \vec{x} - \vec{x}' ‖ ‖ \vec{y} - \vec{y}' ‖} \\ \leq & {max}_{I} ‖ \nabla_{\vec{y}} (\frac{S_{k}^{C} (\vec{x}, \vec{z}) - S_{k}^{C} (\vec{x}', \vec{z})}{‖ \vec{x} - \vec{x}' ‖}) ‖ \leq {max}_{Q} ‖ \nabla_{\vec{x} \vec{y}}^{2} S_{k}^{C} (\vec{z}', \vec{z}) ‖, \end{matrix}$ (4.13) : From the definition of $S_{k}^{C}$ , Equationequation (4.2)(4.2) $\begin{matrix} (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to S_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{k} φ (2^{k / n} (\vec{x} - \vec{y})) and \\ (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to ψ_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{- k / 2} (S_{k}^{C} (\vec{x}, \vec{y}) - S_{k - 1}^{C} (\vec{x}, \vec{y})) . \end{matrix}$ (4.2) , and the definition of $φ$ , Equationequation (4.1)(4.1) $\vec{x} \in R^{n} \to φ (\vec{x}) : = C_{n} σ (r^{2} - {‖ \vec{x} ‖}^{2}),$ (4.1) , it follows with the abbreviation $\vec{ω} = 2^{k / n} (\vec{z}' - \vec{z})$ : $\begin{matrix} ‖ \nabla_{\vec{x} \vec{y}}^{2} S_{k}^{C} (\vec{z}', \vec{z}) ‖ & = 2^{k} ‖ \nabla_{\vec{x} \vec{y}}^{2} (φ ° (2^{k / n} \cdot)) (\vec{z}' - \vec{z})) ‖ = 2^{k (1 + 2 / n)} ‖ \nabla^{2} φ (\vec{ω})) ‖ . \end{matrix}$

Applications of Lemma 4.4 with $\vec{x} \to h (\vec{x}) = φ (\vec{x})$ and $t \to h_{s} (t) = C_{n} σ (r^{2} - t)$ shows that (note that $h_{s}^{'} (t) = - C_{n} σ' (r^{2} - t)$ ) (4.14) $\begin{matrix} ‖ \nabla^{2} φ (\vec{ω}) ‖ \\ \leq & C_{n} max {| 4 {‖ \vec{ω} ‖}^{2} σ ″ (r^{2} - {‖ \vec{ω} ‖}^{2}) - 2 σ' (r^{2} - {‖ \vec{ω} ‖}^{2}) |, | 2 σ' (r^{2} - {‖ \vec{ω} ‖}^{2}) |} \\ \leq & 2^{2} C_{n} max {2 {‖ \vec{ω} ‖}^{2} | σ ″ (r^{2} - {‖ \vec{ω} ‖}^{2}) |, | σ' (r^{2} - {‖ \vec{ω} ‖}^{2}) |} . \end{matrix}$ (4.14)

Thus from Equationequation (4.9)(4.9) $| σ^{i} (r^{2} - {‖ \vec{x} ‖}^{2}) | \leq C_{σ} {(1 + {‖ \vec{x} ‖}^{n})}^{- 1 - (2 i + 1) / n} for all \vec{x} \in R^{n} .$ (4.9) it follows that $\begin{matrix} ‖ \nabla_{\vec{x} \vec{y}}^{2} S_{k}^{C} (\vec{z}', \vec{z}) ‖ \\ \leq 2^{2} 2^{k (1 + 2 / n)} C_{n} C_{σ} max {2 {‖ \vec{ω} ‖}^{2} {(1 + {‖ \vec{ω} ‖}^{n})}^{- 1 - 5 / n}, {(1 + {‖ \vec{ω} ‖}^{n})}^{- 1 - 3 / n}} \\ \leq 2^{3} 2^{k (1 + 2 / n)} C_{n} C_{σ} {(1 + {‖ \vec{ω} ‖}^{n})}^{- 1 - 3 / n} \\ \leq 2^{- k / n} 2^{3} C_{n} C_{σ} {(2^{- k} + {‖ \vec{z}' - \vec{z} ‖}^{n})}^{- 1 - 3 / n} . \end{matrix}$

In the next step we note that from Equationequation (A.7)(A.7) ${‖ \vec{x} + t_{\vec{x}} (\vec{x}' - \vec{x}) - \vec{y} - t_{\vec{y}} (\vec{y}' - \vec{y}) ‖}^{n} \geq 3^{- n} {‖ \vec{x} - \vec{y} ‖}^{n} - 3^{- n} 2^{1 - k}$ (A.7) it follows that (4.15) ${‖ \vec{z}' - \vec{z} ‖}^{n} \geq 3^{- n} {‖ \vec{x} - \vec{y} ‖}^{n} - 3^{- n} 2^{1 - k} .$ (4.15)

Thus we get because $\frac{1 - 3^{- n} 2}{3^{- n}} \geq 1$ $\begin{matrix} \frac{| S_{k}^{C} (\vec{x}, \vec{y}) - S_{k}^{C} (\vec{x}', \vec{y}) - S_{k}^{C} (\vec{x}, \vec{y}') + S_{k}^{C} (\vec{x}', \vec{y}') |}{‖ \vec{x} - \vec{x}' ‖ ‖ \vec{y} - \vec{y}' ‖} \\ \leq & 2^{- k / n} 2^{3} C_{n} C_{σ} {((1 - 3^{- n} 2) 2^{- k} + 3^{- n} {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 3 / n} \\ \leq & 2^{- k / n} 2^{3} {(3^{- n})}^{- 1 - 3 / n} C_{n} C_{σ} {(\frac{1 - 3^{- n} 2}{3^{- n}} 2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 3 / n} \\ \leq & 2^{- k / n} 2^{3} 3^{n + 3} C_{n} C_{σ} {(2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n})}^{- 1 - 3 / n} . \end{matrix}$

Therefore (iii) is satisfied with $C_{ρ} = 1, \tilde{C} = 2^{3} 3^{n + 3} C_{n} C_{σ}, ζ = 1 / n$ , and $ϵ = 1 / n$ .□

We have shown with Lemma 4.5 that the functions ${S_{k}^{C} : k \in Z}$ form an AtI, which satisfies the double Lipschitz condition (see Definition A.1). Moreover, for activations functions, like the sigmoid function, Equationequation (4.8)(4.8) $| σ^{i} (r^{2} - t^{2}) | \leq C_{σ} {(1 + {| t |}^{n})}^{- 1 - (2 i + 1) / n} for all t \in R .$ (4.8) it follows that (4.16) $lim_{t \to \pm \infty} {(1 + {| t |}^{n})}^{1 + (2 i + 1) / n} | σ^{i} (r^{2} - t^{2}) | = 0,$ (4.16) which shows Equationequation (3.2)(3.2) $σ \in C^{2} (R; R) \cap B^{2} (R; R),$ (3.2) . The limit result is an easy consequence of the fact that $t \to e^{- t^{2}}$ and its derivative are faster decaying to zero than the functions $t \to {(1 + {| t |}^{n})}^{- 1 - (2 i + 1) / n}$ . Therefore, it follows from Theorem A.3 that the set $W_{d}^{C}$ is a wavelet frame and satisfies the estimate Equationequation (A.14)(A.14) ${‖ f - f_{N} ‖}_{L^{2}} \leq {‖ f ‖}_{L^{1}} {(N + 1)}^{- 1 / 2} .$ (A.14) . We summarize this important result:

Theorem 4.6

( $L^{1}$ -convergence of radial wavelets). Let σ be an activation function that satisfies the conditions in Lemma 4.5. Then, for every function $f \in L^{1} (R^{n})$ (see Definition 4.1 for a definition of the space) and every $N \in N$ , there exists a function $f_{N} \in {span}_{N} (W_{d}^{C}) \subseteq L^{1} (R^{n}),$

and ${span}_{N}$ denotes linear combinations of at most N terms in the set, such that (4.17) ${‖ f - f_{N} ‖}_{L^{2}} \leq {‖ f ‖}_{L^{1}} {(N + 1)}^{- 1 / 2} .$ (4.17)

Using Theorem 4.6 we are able to formulate the main result of this paper:

Corollary 4.7

( $L^{1}$ -convergence of RQNNs). Let the assumptions of Theorem 4.6 be satisfied. Then, for every function $f \in L^{1} (R^{n})$ (see Definition 4.1) and every $N \in N$ , there exists a parametrization vector $\vec{p} = [α_{1}, \dots, α_{2 N}; w_{1}, \dots, w_{2 N}; ξ_{1}, \dots, ξ_{2 N}; θ_{1}, \dots, θ_{2 N}] .$ such that (4.18) ${‖ f - Ψ [\vec{p}] ‖}_{L^{2}} \leq {‖ f ‖}_{L^{1}} {(N + 1)}^{- 1 / 2},$ (4.18) where $Ψ [\vec{p}]$ is a RQNN from Equationequation (2.5)(2.5) $\begin{matrix} \vec{x} \to p (\vec{x}) = Ψ [\vec{p}] (\vec{x}) & : = \sum_{j = 1}^{N} α_{j} σ (w_{j}^{T} \vec{x} + ξ_{j} {‖ \vec{x} ‖}^{2} + θ_{j}) \\ with α_{j}, θ_{j} \in R, w_{j} \in R^{n}, ξ_{j} \in R . \end{matrix}$ (2.5) .

Proof.

From Theorem 4.6 it follows that there exists $f_{N} \in {span}_{N} (W_{d}^{C})$ , which satisfies Equationequation (4.17)(4.17) ${‖ f - f_{N} ‖}_{L^{2}} \leq {‖ f ‖}_{L^{1}} {(N + 1)}^{- 1 / 2} .$ (4.17) . By definition, $f_{N} (\vec{x}) = \sum_{k \in Z} \sum_{\vec{j} \in Z^{n}} χ_{f, N} (k, \vec{j}) β_{k, \vec{j}} ψ_{k, 2^{- k / n} \vec{j}}^{C} (\vec{x}),$ where $χ_{f, N}$ denotes the characteristic function of a set of indices of size N. Using the definition of $ψ_{k}^{C}$ we get from Equationequation (4.2)(4.2) $\begin{matrix} (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to S_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{k} φ (2^{k / n} (\vec{x} - \vec{y})) and \\ (\vec{y}, \vec{y}) \in R^{n} \times R^{n} \to ψ_{k}^{C} (\vec{x}, \vec{y}) & : = 2^{- k / 2} (S_{k}^{C} (\vec{x}, \vec{y}) - S_{k - 1}^{C} (\vec{x}, \vec{y})) . \end{matrix}$ (4.2) and Equationequation (4.1)(4.1) $\vec{x} \in R^{n} \to φ (\vec{x}) : = C_{n} σ (r^{2} - {‖ \vec{x} ‖}^{2}),$ (4.1) $\begin{matrix} f_{N} (\vec{x}) & = \sum_{k \in Z} \sum_{\vec{j} \in Z^{n}} χ_{f, N} (k, \vec{j}) β_{k, \vec{j}} 2^{- k / 2} (S_{k, 2^{- k / n} \vec{j}}^{C} (\vec{x}) - S_{k - 1, 2^{- k / n} \vec{j}}^{C} (\vec{x})) \\ = \sum_{k \in Z} \sum_{\vec{j} \in Z^{n}} χ_{f, N} (k, \vec{j}) β_{k, \vec{j}} 2^{- k / 2} \\ \times (2^{k} φ (2^{k / n} (\vec{x} - 2^{- k / n} \vec{j})) - 2^{k - 1} φ (2^{(k - 1) / n} (\vec{x} - 2^{- k / n} \vec{j}))) \\ = \sum_{k \in Z} \sum_{\vec{j} \in Z^{n}} χ_{f, N} (k, \vec{j}) C_{n} β_{k, \vec{j}} 2^{- k / 2} (2^{k} σ (r^{2} - {‖ 2^{k / n} (\vec{x} - 2^{- k / n} \vec{j}) ‖}^{2}) \\ - 2^{k - 1} σ (r^{2} - {‖ 2^{(k - 1) / n} (\vec{x} - 2^{- k / n} \vec{j}) ‖}^{2})) . \end{matrix}$

The arguments where σ is evaluated are in general different, and thus we get $\begin{matrix} f_{N} (\vec{x}) & = \sum_{k \in Z} \sum_{\vec{j} \in Z^{n}} \underset{= α_{j}}{\underset{⏟}{χ_{f, N} (k, \vec{j}) C_{n} β_{k, \vec{j}} 2^{k / 2}}} σ (\underset{= θ_{j}}{\underset{⏟}{r^{2}}} \underset{= ξ_{j}}{\underset{⏟}{- 4^{k / n}}} ‖ \vec{x} - \underset{= {\vec{y}}_{j}}{\underset{⏟}{2^{- k / n} \vec{j}}} ‖^{2}) \\ + \sum_{k \in Z} \sum_{\vec{j} \in Z^{n}} \underset{α_{j}}{\underset{⏟}{χ_{f, N} (k, \vec{j}) C_{n} β_{k, \vec{j}} 2^{k / 2 - 1}}} σ (\underset{θ_{j}}{\underset{⏟}{r^{2}}} \underset{= ξ_{j}}{\underset{⏟}{- 4^{(k - 1) / n}}} ‖ \vec{x} - \underset{{\vec{y}}_{j}}{\underset{⏟}{2^{- k / n} \vec{j}}} ‖^{2}) \end{matrix}$ from which the coefficients can be read out. Note that for the sake of simplicity we have not explicitly stated the exact index in the subscript of the coefficients $(\vec{α}, \vec{θ}, \vec{ξ}, y)$ , or in other words we use the formal relation $j = (k, \vec{j})$ to indicate the dependencies. This shows the claim if we identify $θ_{j} = r^{2}$ and $w^{T} = 2^{1 + 2 k / n} \vec{y} .$

□

Remark 4.

Our results in Section 4 are proved using “similar” methods as in [Citation37], but can not be used directly. Because the main ingredient of the proof is the construction of localizing functions as wavelets. This task requires two layers of affine linear decision functions to accomplish, since the linear combinations of functions of the form $\vec{x} \to φ (\vec{x}) : = C_{n} σ ({\vec{w}}^{T} \vec{x} + θ)$ are not localized; they cannot satisfy (iii) in Definition A.1, as they are not integrable on $R^{n}$ . In comparison, some quadratic decision functions—such as those used in Section 4—naturally gives localized functions, and therefore can work as wavelets on their own.

We mention that there exists a zoo of approximation rates results for neural network functions (see for instance [Citation10, Citation11], where best-approximation results in dependence of the space-dimension n have been proven). We have based our analysis on convergence rates results of [Citation37] and [Citation24]. This choice is motivated by the fact that the AtI property from [Citation24] is universal, and allows a natural extension of their results to quadratic neural network functions. The motivation here is not to give the optimal convergence rates result for quadratic neural networks, but to verify that convergence rates results are possible. We believe that AtI is a very flexible and elegant tool.

5 Conclusion

In this paper we studied generalized neural networks functions for solving inverse problems. We proved a universal approximation theorem and we have proven convergence rates of radial neural networks, which are the same order as the classical affine linear neural network functions, but with vastly reduced numbers of elements (in particular it spares one layer). The paper presents a proof of concept and thus we restrict attention only to radial neural networks although generalizations to higher order neural networks (such as cubic) is quite straightforward.

We also make the remark about the convergence of the actual training process of a neural network, i.e. optimization of the parameters of each decision function. This is usually done by gradient descent or Gauss-Newton-type methods. From the analysis in [Citation3] the convergence conditions of Newton’s method become very complicated for DNNs, but are transparent for quadratic neural networks, which is fact a matter of the complicated chain rule, when differentiating DNNs with respect to the parameters. For such an analysis there is a clear preference to go higher with the degree of polynomials than increasing the number of layers.

Acknowledgments

The authors would like to thank some referees for their valuable suggestions and their patience.

Additional information

Funding

This research was funded in whole, or in part, by the Austrian Science Fund (FWF) 10.55776/P34981 (OS & LF) – New Inverse Problems of Super-Resolved Microscopy (NIPSUM), SFB 10.55776/F68 (OS) “Tomography Across the Scales,” project F6807-N36 (Tomography with Uncertainties), and 10.55776/T1160 (CS) “Photoacoustic Tomography: Analysis and Numerics.” For open access purposes, the author has applied a CC BY public copyright license to any author-accepted manuscript version arising from this submission. The financial support by the Austrian Federal Ministry for Digital and Economic Affairs, the National Foundation for Research, Technology and Development and the Christian Doppler Research Association is gratefully acknowledged.

Notes

1 For a detailed exposition on generalized inverses see [Citation29].

2 In the following $\nabla$ and $\nabla^{2}$ (without subscripts) always denote derivatives with respect to an n-dimensional variable such as $\vec{x}$ . $'$ and $″$ denotes derivatives of a one-dimensional function.

References

Natterer, F. (1977). Regularisierung schlecht gestellter Probleme durch Projektionsverfahren. Numerische Mathematik 28(3):329–341. DOI: 10.1007/BF01389972.
Web of Science ®Google Scholar
Obmann, D., Schwab, J., Haltmeier, M. (2021). Deep synthesis network for regularizing inverse problems. Inverse Problems 37(1):015005. DOI: 10.1088/1361-6420/abc7cd.
Web of Science ®Google Scholar
Scherzer, O., Hofmann, B., Nashed, Z. (2023). Gauss–Newton method for solving linear inverse problems with neural network coders. Sampling Theory Signal Process. Data Anal. 21(2):25. DOI: 10.1007/s43670-023-00066-6.
Google Scholar
Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica 8:143–195. DOI: 10.1017/S0962492900002919.
Google Scholar
Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3):930–945. DOI: 10.1109/18.256500.
Web of Science ®Google Scholar
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4):303–314. DOI: 10.1007/BF02551274.
Google Scholar
Hornik, K., Stinchcombe, M., White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Netw. 2(5):359–366. DOI: 10.1016/0893-6080(89)90020-8.
Web of Science ®Google Scholar
Leshno, M., Lin, V. Y., Pinkus, A., Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6(6):861–867. DOI: 10.1016/s0893-6080(05)80131-5.
Web of Science ®Google Scholar
Mhaskar, H. N. (1993). Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1(1):61–80. DOI: 10.1007/BF02070821.
Google Scholar
Siegel, J. W., Xu, J. (2022). Sharp bounds on the approximation rates, metric entropy, and n-Widths of shallow neural networks. Found. Comput. Math. DOI: 10.1007/s10208-022-09595-3.
Web of Science ®Google Scholar
Siegel, J. W., Xu, J. (2023). Characterization of the variation spaces corresponding to shallow neural networks. Constr. Approx. 57(3):1109–1132. DOI: 10.1007/s00365-023-09626-4.
Web of Science ®Google Scholar
Wager, S., Wang, S., Liang, P. (2013). Dropout training as adaptive regularization. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1. Curran Associates Inc., pp. 351–359.
Google Scholar
Manita, O., Peletier, M., Portegies, J., Sanders, J., Senen-Cerda, A. (2022). Universal approximation in dropout neural networks. J. Mach. Learn. Res. 23(19):1–46.
Google Scholar
Zhou, D.-X. (2018). Deep distributed convolutional neural networks: universality. Anal. Appl. 16(6):895–919. DOI: 10.1142/S0219530518500124.
Web of Science ®Google Scholar
Zhou, D.-X. (2020). Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 48(2):787–794. DOI: 10.1016/j.acha.2019.06.004.
Web of Science ®Google Scholar
Schäfer, A. M., Zimmermann, H.-G. (2007). Recurrent neural networks are universal approximators. Int. J. Neural Syst. 17(4):253–263. DOI: 10.1142/s0129065707001111.
PubMed Web of Science ®Google Scholar
Hammer, B. (2000). Learning with Recurrent Neural Networks. Lecture Notes in Control and Information Sciences. London: Springer.
Google Scholar
White. (1989). An additional hidden unit test for neglected nonlinearity in multilayer feedforward networks. In: International Joint Conference on Neural Networks. DOI: 10.1109/ijcnn.1989.118281.
Google Scholar
Pao, Y.-H., Park, G.-H., Sobajic, D. J. (1994). Learning and generalization characteristics of the random vector functional-link net. Neurocomputing 6(2):163–180. DOI: 10.1016/0925-2312(94)90053-1.
Web of Science ®Google Scholar
Igelnik, B., Pao, Y.-H. (1995). Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE Trans. Neural Netw. 6(6):1320–1329. DOI: 10.1109/72.471375.
PubMedGoogle Scholar
Gelenbe, E., Mao, Z.-W., Li, Y.-D. (1999). Approximation by random networks with bounded number of layers. In: Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468). DOI: 10.1109/nnsp.1999.788135.
Google Scholar
Tsapanos, N., Tefas, A., Nikolaidis, N., Pitas, I. (2019). Neurons With paraboloid decision boundaries for improved neural network classification performance. IEEE Trans Neural Netw Learn Syst 30(1):284–294. DOI: 10.1109/tnnls.2018.2839655.
PubMed Web of Science ®Google Scholar
Fan, F., Xiong, J., Wang, G. (2020). Universal approximation with quadratic deep networks. Neural Netw. 124:383–392. DOI: 10.1016/j.neunet.2020.01.007.
PubMed Web of Science ®Google Scholar
Deng, D., Han, Y. (2009). Harmonic Analysis on Spaces of Homogeneous Type. Berlin, Heidelberg: Springer. DOI: 10.1007/978-3-540-88745-4.
Google Scholar
Buhmann, M. D. (2003). Radial Basis Functions. Cambridge: Cambridge University Press. DOI: 10.1017/cbo9780511543241.
Google Scholar
Dugundji, J. (1978). Topology, 2nd ed. Boston, MA: Allyn and Bacon, Inc.
Google Scholar
Kelley, J. L. (1955). General Topology. Toronto-New York-London: D. Van Nostrand Company.
Google Scholar
Shepp, L. A., Logan, B. F. (1974). The Fourier reconstruction of a head section. IEEE Trans. Nucl. Sci. 21(3):21–43. DOI: 10.1109/TNS.1974.6499235.
Google Scholar
Nashed, M., ed. (1976). Generalized Inverses and Applications. New York: Academic Press [Harcourt Brace Jovanovich Publishers], pp. xiv + 1054.
Google Scholar
Lamperski, A. (2022). Neural network independence properties with applications to adaptive control. In: 2022 IEEE 61st Conference on Decision and Control (CDC). DOI: 10.1109/CDC51059.2022.9992994.
Google Scholar
Deuflhard, P., Hohmann, A. (1991). Numerical Analysis. A First Course in Scientific Computation. Berlin: De Gruyter.
Google Scholar
Barron, A. R., Cohen, A., Dahmen, W., DeVore, R. A. (2008). Approximation and learning by greedy algorithms. Ann. Stat. 36(1):64–94. DOI: 10.1214/009053607000000631.
Web of Science ®Google Scholar
Daubechies, I. (1992). Ten Lectures on Wavelets. Philadelphia, PA: SIAM, xx + 357. DOI: 10.1137/1.9781611970104.
Google Scholar
Graps, A. (1995). An introduction to wavelets. IEEE Comput. Sci. Eng. 2(2):50–61. DOI: 10.1109/99.388960.
Google Scholar
Louis, A., Maass, P., Rieder, A. (1998). Wavelets. Theorie und Anwendungen, 2nd ed. Stuttgart: Teubner.
Google Scholar
Chui, C. (1992). An Introduction to Wavelets, Vol. 1. New York: Academic Press.
Google Scholar
Shaham, U., Cloninger, A., Coifman, R. R. (2018). Provable approximation properties for deep neural networks. Appl. Comput. Harmon. Anal. 44(3):537–557. DOI: 10.1016/j.acha.2016.04.003.
Web of Science ®Google Scholar

A

Approximation to the identity (AtI)

Definition A.

1 (Approximation to the identity [Citation24]). A sequence of symmetric kernel functions ${(S_{k} : R^{n} \times R^{n} \to R)}_{k \in Z}$ is said to be an approximation to the identity (AtI) if there exist a quintuple $(ϵ, ζ, C, C_{ρ}, C_{A})$ of positive numbers satisfying the additional constraints (A.1) $0 < ϵ \leq \frac{1}{n}, 0 < ζ \leq \frac{1}{n} and C_{A} < 1$ (A.1) the following three conditions are satisfied for all $k \in Z$ :

$| S_{k} (\vec{x}, \vec{y}) | \leq C \frac{2^{- k ϵ}}{{(2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n})}^{1 + ϵ}}$ for all $\vec{x}, \vec{y} \in R^{n}$ ;
$| S_{k} (\vec{x}, \vec{y}) - S_{k} (\vec{x}', \vec{y}) | \leq C {(\frac{C_{ρ} {‖ \vec{x} - \vec{x}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} \frac{2^{- k ϵ}}{{(2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n})}^{1 + ϵ}}$ for all triples $(\vec{x}, \vec{x}', \vec{y}) \in R^{n} \times R^{n} \times R^{n}$ which satisfy (A.2) $C_{ρ} {‖ \vec{x} - \vec{x}' ‖}^{n} \leq C_{A} (2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n});$ (A.2)
$\int_{R^{n}} S_{k} (\vec{x}, \vec{y}) d \vec{y} = 1$ for all $\vec{x} \in R^{n}$ .

Moreover, we say that the AtI satisfies the double Lipschitz condition if there exist a triple $(\tilde{C}, {\tilde{C}}_{A}, ζ)$ of positive constants satisfying (A.3) ${\tilde{C}}_{A} < \frac{1}{2},$ (A.3) such that for all $k \in Z$ (A.4) $\begin{matrix} | S_{k} (\vec{x}, \vec{y}) - S_{k} (\vec{x}', \vec{y}) - S_{k} (\vec{x}, \vec{y}') + S_{k} (\vec{x}', \vec{y}') | \\ \leq & \tilde{C} {(\frac{C_{ρ} {‖ \vec{x} - \vec{x}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} {(\frac{C_{ρ} {‖ \vec{y} - \vec{y}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} \frac{2^{- k ϵ}}{{(2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n})}^{1 + ϵ}} \end{matrix}$ (A.4) for all quadruples $(\vec{x}, \vec{x}', \vec{y}, \vec{y}') \in R^{n} \times R^{n} \times R^{n} \times R^{n}$ which satisfy (A.5) $C_{ρ} max {{‖ \vec{x} - \vec{x}' ‖}^{n}, {‖ \vec{y} - \vec{y}' ‖}^{n}} \leq {\tilde{C}}_{A} (2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}) .$ (A.5)

The conditions (ii) and Equationequation (A.5)(A.5) $C_{ρ} max {{‖ \vec{x} - \vec{x}' ‖}^{n}, {‖ \vec{y} - \vec{y}' ‖}^{n}} \leq {\tilde{C}}_{A} (2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}) .$ (A.5) are essential for our analysis. We characterize now geometric properties of these constrained sets:

Lemma A.2.

Let $C_{ρ} = 1, C_{A} = 2^{- n}$ and ${\tilde{C}}_{A} = 3^{- n}$ .

Then set of triples $(\vec{x}, \vec{x}', \vec{y})$ which satisfy Equationequation (A.2)(A.2) $C_{ρ} {‖ \vec{x} - \vec{x}' ‖}^{n} \leq C_{A} (2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n});$ (A.2) and for which ${‖ \vec{x} - \vec{y} ‖}^{n} \geq 2^{- k}$ and all $t \in [0, 1]$ satisfy (A.6) ${‖ \vec{x} + t (\vec{x}' - \vec{x}) - \vec{y} ‖}^{n} \geq 2^{- n} {‖ \vec{x} - \vec{y} ‖}^{n} - 2^{- n} 2^{- k} .$ (A.6)

The set of quadrupels $(\vec{x}, \vec{x}', \vec{y}, \vec{y}')$ which satisfy Equationequation (A.5)(A.5) $C_{ρ} max {{‖ \vec{x} - \vec{x}' ‖}^{n}, {‖ \vec{y} - \vec{y}' ‖}^{n}} \leq {\tilde{C}}_{A} (2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}) .$ (A.5) and for which ${‖ \vec{x} - \vec{y} ‖}^{n} \geq 2^{- k}$ satisfy (A.7) ${‖ \vec{x} + t_{\vec{x}} (\vec{x}' - \vec{x}) - \vec{y} - t_{\vec{y}} (\vec{y}' - \vec{y}) ‖}^{n} \geq 3^{- n} {‖ \vec{x} - \vec{y} ‖}^{n} - 3^{- n} 2^{1 - k}$ (A.7)

for all

t_{\vec{x}}, t_{\vec{y}} \in [0, 1]

Proof.

With the concrete choice of parameters C_A, $C_{ρ}$ Equationequation (A.2)(A.2) $C_{ρ} {‖ \vec{x} - \vec{x}' ‖}^{n} \leq C_{A} (2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n});$ (A.2) reads as follows (A.8) ${‖ \vec{x} - \vec{x}' ‖}^{n} \leq 2^{- n} (2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n}) .$ (A.8) Since we assume that ${‖ \vec{x} - \vec{y} ‖}^{n} \geq 2^{- k}$ it follows from Equationequation (A.8)(A.8) ${‖ \vec{x} - \vec{x}' ‖}^{n} \leq 2^{- n} (2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n}) .$ (A.8) that $‖ \vec{x} - \vec{x}' ‖ \leq 2^{- 1} ‖ \vec{x} - \vec{y} ‖ .$ In particular $‖ \vec{x} - \vec{y} ‖ - ‖ \vec{x} - \vec{x}' ‖ \geq 0$ .We apply Jensen’s inequality, which states that for $a, b \geq 0$ (A.9) $a^{n} + b^{n} \geq 2^{1 - n} {(a + b)}^{n} .$ (A.9) We use $a = ‖ \vec{x} + t (\vec{x}' - \vec{x}) - \vec{y} ‖$ and $b = ‖ t (\vec{x}' - \vec{x}) ‖$ , which then (along with the triangle inequality) gives ${‖ \vec{x} + t (\vec{x}' - \vec{x}) - \vec{y} ‖}^{n} + {‖ t (\vec{x}' - \vec{x}) ‖}^{n} \geq 2^{1 - n} {(‖ \vec{x} - \vec{y} ‖)}^{n} .$ In other words, it follows from Equationequation (A.8)(A.8) ${‖ \vec{x} - \vec{x}' ‖}^{n} \leq 2^{- n} (2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n}) .$ (A.8) that $\begin{matrix} {‖ \vec{x} + t (\vec{x}' - \vec{x}) - \vec{y} ‖}^{n} & \geq 2^{1 - n} {‖ \vec{x} - \vec{y} ‖}^{n} - t^{n} {‖ \vec{x}' - \vec{x} ‖}^{n} \\ \geq 2^{1 - n} {‖ \vec{x} - \vec{y} ‖}^{n} - {‖ \vec{x}' - \vec{x} ‖}^{n} \\ \geq 2^{1 - n} {‖ \vec{x} - \vec{y} ‖}^{n} - 2^{- n} (2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n}) \\ = 2^{- n} {‖ \vec{x} - \vec{y} ‖}^{n} - 2^{- n} 2^{- k} . \end{matrix}$
With the concrete choice of parameters ${\tilde{C}}_{A}, C_{ρ}$ Equationequation (A.5)(A.5) $C_{ρ} max {{‖ \vec{x} - \vec{x}' ‖}^{n}, {‖ \vec{y} - \vec{y}' ‖}^{n}} \leq {\tilde{C}}_{A} (2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}) .$ (A.5) reads as follows (A.10) $max {{‖ \vec{x} - \vec{x}' ‖}^{n}, {‖ \vec{y} - \vec{y}' ‖}^{n}} \leq 3^{- n} (2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n}) .$ (A.10) Since we assume that ${‖ \vec{x} - \vec{y} ‖}^{n} \geq 2^{- k}$ it follows from Equationequation (A.10)(A.10) $max {{‖ \vec{x} - \vec{x}' ‖}^{n}, {‖ \vec{y} - \vec{y}' ‖}^{n}} \leq 3^{- n} (2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n}) .$ (A.10) that $max {‖ \vec{x} - \vec{x}' ‖, ‖ \vec{y} - \vec{y}' ‖} \leq 3^{- 1} ‖ \vec{x} - \vec{y} ‖ .$ This in particular shows that $‖ \vec{x} - \vec{y} ‖ - ‖ \vec{x} - \vec{x}' ‖ - ‖ \vec{y} - \vec{y}' ‖ \geq 0.$ We apply Jensen’s inequality, which states that for $a, b, c \geq 0$ (A.11) $a^{n} + b^{n} + c^{n} \geq 3^{1 - n} {(a + b + c)}^{n} .$ (A.11) We use $a = ‖ \vec{x} + t_{\vec{x}} (\vec{x}' - \vec{x}) - \vec{y} - t_{\vec{y}} (\vec{y}' - \vec{y}) ‖, b = ‖ t_{\vec{x}} (\vec{x}' - \vec{x}) ‖$ and $c = ‖ t_{\vec{y}} (\vec{y}' - \vec{y}) ‖$ , which then (along with the triangle inequality) gives $\begin{matrix} {‖ \vec{x} + t_{\vec{x}} (\vec{x}' - \vec{x}) - \vec{y} - t_{\vec{y}} (\vec{y}' - \vec{y}) ‖}^{n} + {‖ t_{\vec{x}} (\vec{x}' - \vec{x}) ‖}^{n} \\ + {‖ t_{\vec{y}} (\vec{y}' - \vec{y}) ‖}^{n} \geq 3^{1 - n} {‖ \vec{x} - \vec{y} ‖}^{n} . \end{matrix}$ In other words, it follows from Equationequation (A.10)(A.10) $max {{‖ \vec{x} - \vec{x}' ‖}^{n}, {‖ \vec{y} - \vec{y}' ‖}^{n}} \leq 3^{- n} (2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n}) .$ (A.10) that $\begin{matrix} {‖ \vec{x} + t_{\vec{x}} (\vec{x}' - \vec{x}) - \vec{y} - t_{\vec{y}} (\vec{y}' - \vec{y}) ‖}^{n} \\ \geq 3^{1 - n} {‖ \vec{x} - \vec{y} ‖}^{n} - t_{\vec{x}}^{n} {‖ \vec{x}' - \vec{x} ‖}^{n} - t_{\vec{y}}^{n} {‖ \vec{y}' - \vec{y} ‖}^{n} \\ \geq 3^{1 - n} {‖ \vec{x} - \vec{y} ‖}^{n} - {‖ \vec{x}' - \vec{x} ‖}^{n} - {‖ \vec{y}' - \vec{y} ‖}^{n} \\ \geq 3^{1 - n} {‖ \vec{x} - \vec{y} ‖}^{n} - 3^{- n} 2 (2^{- k} + {‖ \vec{x} - \vec{y} ‖}^{n}) \\ = 3^{- n} {‖ \vec{x} - \vec{y} ‖}^{n} - 3^{- n} 2^{1 - k} . \end{matrix}$

□

The approximation to the identity in Definition A.1 can be used to construct wavelet frames that can approximate arbitrary functions in $L^{1} (R^{n})$ , as shown in the theorem below.

Remark 5.

When ${‖ \vec{x} - \vec{y} ‖}^{n} < 2^{- k}$ , Equationequation (A.6)(A.6) ${‖ \vec{x} + t (\vec{x}' - \vec{x}) - \vec{y} ‖}^{n} \geq 2^{- n} {‖ \vec{x} - \vec{y} ‖}^{n} - 2^{- n} 2^{- k} .$ (A.6) also holds since the right hand side of Equationequation (A.6)(A.6) ${‖ \vec{x} + t (\vec{x}' - \vec{x}) - \vec{y} ‖}^{n} \geq 2^{- n} {‖ \vec{x} - \vec{y} ‖}^{n} - 2^{- n} 2^{- k} .$ (A.6) is negative, so this inequality is trivial.

Theorem A.

3 ([Citation37]). Let ${(S_{k} : R^{n} \times R^{n} \to R)}_{k \in Z}$ be a symmetric AtI which satisfies the double Lipschitz condition (see Equationequation (A.4)(A.4) $\begin{matrix} | S_{k} (\vec{x}, \vec{y}) - S_{k} (\vec{x}', \vec{y}) - S_{k} (\vec{x}, \vec{y}') + S_{k} (\vec{x}', \vec{y}') | \\ \leq & \tilde{C} {(\frac{C_{ρ} {‖ \vec{x} - \vec{x}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} {(\frac{C_{ρ} {‖ \vec{y} - \vec{y}' ‖}^{n}}{2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n}})}^{ζ} \frac{2^{- k ϵ}}{{(2^{- k} + C_{ρ} {‖ \vec{x} - \vec{y} ‖}^{n})}^{1 + ϵ}} \end{matrix}$ (A.4) ). Let (A.12) $ψ_{k, \vec{y}} (\vec{x}) : = 2^{- k / 2} (S_{k} (\vec{x}, \vec{y}) - S_{k - 1} (\vec{x}, \vec{y})) for all \vec{x}, \vec{y} \in R^{n} and k \in Z .$ (A.12)

Then the set of functions (A.13) $W : = {\vec{x} \to ψ_{k, \vec{b}} (\vec{x}) : k \in Z, \vec{b} \in 2^{- k / n} Z^{n}},$ (A.13) is a frame and for every function $f \in L^{1} (R^{n})$ there exists a linear combination of N elements of $W$ , denoted by f_N, satisfying (A.14) ${‖ f - f_{N} ‖}_{L^{2}} \leq {‖ f ‖}_{L^{1}} {(N + 1)}^{- 1 / 2} .$ (A.14)

Quadratic Neural Networks for Solving Inverse Problems

Abstract

1 Introduction

Outline of this paper

2 Examples of networks with generalized decision functions

3 Motivation

3.1 Motivation 1: the Shepp-Logan phantom

3.2 Motivation 2: the Gauss-Newton iteration

4 Convergence rates for universal approximation of RQNNs

5 Conclusion

Acknowledgments

References

A

Approximation to the identity (AtI)

Information for

Open access

Opportunities

Help and information

Quadratic Neural Networks for Solving Inverse Problems

Abstract

1 Introduction

Outline of this paper

2 Examples of networks with generalized decision functions

3 Motivation

3.1 Motivation 1: the Shepp-Logan phantom

3.2 Motivation 2: the Gauss-Newton iteration

4 Convergence rates for universal approximation of RQNNs

5 Conclusion

Acknowledgments

Additional information

Funding

Notes

References

A

Approximation to the identity (AtI)

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date