Publication Cover
Optimization
A Journal of Mathematical Programming and Operations Research
Volume 73, 2024 - Issue 4
184
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Generalized left-localized Cayley parametrization for optimization with orthogonality constraints

&
Pages 1113-1159 | Received 14 Oct 2021, Accepted 18 Oct 2022, Published online: 15 Nov 2022

Abstract

We present a reformulation of optimization problems over the Stiefel manifold by using a Cayley-type transform, named the generalized left-localized Cayley transform, for the Stiefel manifold. The reformulated optimization problem is defined over a vector space, whereby we can apply directly powerful computational arts designed for optimization over a vector space. The proposed Cayley-type transform enjoys several key properties which are useful to (i) study relations between the original problem and the proposed problem; (ii) check the conditions to guarantee the global convergence of optimization algorithms. Numerical experiments demonstrate that the proposed algorithm outperforms the standard algorithms designed with a retraction on the Stiefel manifold.

1. Introduction

The Stiefel manifold St(p,N):={URN×pUTU=Ip} is defined for (p,N)N×N with pN, where Ip is the p×p identity matrix (see Appendix 1 for basic facts on St(p,N)).

We consider an orthogonal constraint optimization problem formulated as:

Problem 1.1

For a given continuous function f:RN×pR, (1) findUargminUSt(p,N)f(U),(1) where the existence of a minimizer in (Equation1) is automatically guaranteed by the compactness of St(p,N) and the continuity of f over the Np-dimensional Euclidean space RN×p.

This problem belongs to the so-called Riemannian optimization problems (see [Citation1] and Appendix 2), and has rich applications, in the case pN in particular, in data sciences including signal processing and machine learning as remarked recently in [Citation2,Citation3]. These applications include, e.g. nearest low-rank correlation matrix problem [Citation4–6], nonlinear eigenvalue problem [Citation7–9], sparse principal component analysis [Citation10–12], 1-bit compressed sensing [Citation13,Citation14], joint diagonalization problem for independent component analysis [Citation15–17] and enhancement of the generalization performance in deep neural network [Citation18,Citation19]. However, Problem 1.1 has inherent difficulties regarding the severe nonlinearity of St(p,N) as an instance of general nonlinear Riemannian manifolds.

Minimization of a continuous f:RN×NR over the orthogonal group O(N):=St(N,N) is a special instance of Problem 1.1 with p = N. This problem can be separated into two optimization problems over the special orthogonal group SO(N):={UO(N)det(U)=1} as (2) findU1argminUSO(N)f(U)(2) and, with an arbitrarily chosen QO(N)SO(N), findU2argminUSO(N)f(QU)because O(N) is the disjoint union of SO(N) and O(N)SO(N)={UO(N)det(U)=1}={QUO(N)USO(N)}. For the problem in (Equation2), the Cayley transform (3) φ:SO(N)EN,NQN,N:U(IU)(I+U)1(3) and its inversion mappingFootnote1 (4) φ1:QN,NSO(N)EN,N:V(IV)(I+V)1=2(I+V)1I(4) have been utilized in [Citation18,Citation20,Citation21] because φ translates a subset SO(N)EN,N(=O(N)EN,N [see (A3)]) of SO(N) into the vector space QN,N:={VRN×NVT=V} of all skew-symmetric matrices, where EN,N:={UO(N)det(I+U)=0} is called, in this paper, the singular-point set of φ. More precisely, this is because φ is a diffeomorphism between the dense subsetFootnote2 SO(N)EN,N of SO(N) and QN,N.

The Cayley transform pair φ and φ1 can be modified with an arbitrarily chosen SO(N) as (5) φS:O(N)EN,N(S)QN,N(S):=QN,N:Uφ(STU)=(ISTU)(I+STU)1(5) and (6) φS1:QN,N(S)O(N)EN,N(S):VSφ1(V)=S(IV)(I+V)1,(6) where EN,N(S):={UO(N)det(I+STU)=0} is the singular-point set of φS. These mappings are also diffeomorphisms between their domains and images. With the aid of φSFootnote3 with SSO(N), the following Problem 1.2 was considered in [Citation20] as a relaxation of the problem in (Equation2).

Problem 1.2

For a given continuous function f:RN×NR, choose SSO(N), and ϵ>0 arbitrarily. Then, (7) findVQN,N(S)such thatfφS1(V)<minf(SO(N))+ϵ.(7)

Remark 1.3

  1. (The existence of V in Problem 1.2). The existence of V satisfying (Equation7) is guaranteed because φS1(QN,N)=SO(N)EN,N(S) is a dense subset of SO(N) for any SSO(N) [Citation20] (see Fact A.3) and fφS1 is continuous.

  2. (Left-localized Cayley transform). We call φS in (Equation5) the left-localized Cayley transform centred at SO(N) because S is multiplied from the left of φ1(V) in (Equation6), and φS(S)=0. Although QN,N(S) in (Equation5) is the common set QN,N for all SO(N), we distinguish QN,N(S) for each SO(N) as the domain of parametrization φS1 for a particular subset O(N)EN,N(S)O(N).

We note that Problem 1.2 is a realistic relaxation of the problem in (Equation2) as long as our target is approximation of a solution to (Equation2) algorithmically because SO(N)EN,N(S)=φS1(QN,N(S)) is dense in SO(N). In reality with a digital computer, we can handle just a small subset of the rational numbers Q, which is dense in R, due to the limitation of the numerical precision. This situation implies that it is reasonable to consider an approximation of SO(N) within its dense subset SO(N)EN,N(S).

For Problem 1.2, we can enjoy various arts of optimization over a vector space, e.g. the gradient descent method and Newton's method, because QN,N(S) is a vector space. Thanks to the homeomorphism of φS, we can estimate a solution to the problem in (Equation2) by applying φS1 to a solution of Problem 1.2 with a sufficiently small ϵ>0. We call this strategy via Problem 1.2 a Cayley parametrization (CP) strategy for the problem in (Equation2). The CP strategy has a notable advantage over the standard optimization strategies [Citation1], called the retraction-based strategies, in view that many powerful computational arts designed for optimization over a single vector space can be directly plugged into the CP strategy. We will discuss the details in Remark 3.6.

In this paper, we address a natural question regarding a possible extension of the CP strategy to Problem 1.1 for general p<N: can we parameterize a dense subset of St(p,N) even with p<N in terms of a single vector space? To answer this question positively, we propose a Generalized Left-Localized Cayley Transform (G-L2CT): ΦS:St(p,N)EN,p(S)QN,p(S):U[AS(U)BST(U)BS(U)0],with SO(N), as an extension of the left-localized Cayley transform φS in (Equation5), where AS(U)Qp,p and BS(U)R(Np)×p are determined with a centre point SO(N) (see (Equation11) and (Equation12) in Definition 2.1). The set EN,p(S):={USt(p,N)det(Ip+SleTU)=0} is called the singular-point set of ΦS (see the notation in the end of this section), and QN,p(S) is a linear subspace of QN,N(S) (see (Equation9)). For any SO(N), we will show several key properties, e.g. (i) ΦS is a diffeomorphism between St(p,N)EN,p(S) and the vector space QN,p(S) with the inversion mapping ΦS1:QN,p(S)St(p,N)EN,p(S) (see Proposition 2.2); (ii) St(p,N)EN,p(S) is a dense subset of St(p,N) for p<N (see Theorem 2.3(b)). Therefore, the proposed ΦS and ΦS1 have inherent properties desired for applications in the CP strategy to Problem 1.1.

To extend the CP strategy to Problem 1.1 for p<N, we consider Problem 1.4 below, which can be seen as an extension of Problem 1.2. For the same reason as in Remark 1.3(a), the existence of V achieving (Equation8) is guaranteed by the denseness of St(p,N)EN,p(S) in St(p,N) (see Lemma 2.6).

Problem 1.4

For a given continuous function f:RN×pR with p<N, choose SO(N), and ϵ>0 arbitrarily. Then, (8) findVQN,p(S)such thatfΦS1(V)<minf(St(p,N))+ϵ.(8)

Under a smoothness assumption on general f, a realistic goal for Problem 1.1 is to find a stationary point USt(p,N) of f because Problem 1.1 is a non-convex optimization problem (see, e.g. [Citation1,Citation22,Citation23]) and any local minimizer must be a stationary point [Citation22,Citation23]. In Lemma 3.4, we present a characterization of a stationary point USt(p,N) of f over St(p,N), with SO(N) satisfying USt(p,N)EN,p(S), in terms of a stationary point VQN,p(S) of fΦS1 over the vector space QN,p(S), i.e. (fΦS1)(V)=0. To approximate a stationary point of f over St(p,N), we also consider the following problem:

Problem 1.5

For a continuously differentiable function f:RN×pR with p<N, choose SO(N) and ϵ>0 arbitrarily. Then, findVQN,p(S)such that(fΦS1)(V)F<ϵ.

For Problem 1.5, we can apply many powerful arts for searching a stationary point of a non-convex function over a vector space.

Numerical experiments in Section 4 demonstrate that the proposed CP strategy outperforms the standard algorithms designed with a retraction on St(p,N) (see Appendix 2) in the scenario of a certain eigenbasis extraction problem.

Notation. N and R denote the set of all positive integers and the set of all real numbers respectively. For general nN, we use In for the identity matrix in Rn×n, but for simplicity, we use I for the identity matrix in RN×N. For pN, IN×pRN×p denotes the matrix of the first p columns of I. For a matrix XRn×m, [X]ij (1in,1jm) denotes the (i,j) entry of X, and XT denotes the transpose of X. For a square matrix X:=[X11Rp×pX12Rp×(Np)X21R(Np)×pX22R(Np)×(Np)]RN×N,we use the notation [[X]]ij:=Xij for i,j{1,2}. For URN×p, the matrices UupRp×p and UloR(Np)×p respectively denote the upper and the lower block matrices of U=[UupTUloT]T. For SRN×N, the matrices SleRN×p and SriRN×(Np) respectively denote the left and right block matrices of S=[SleSri]. For a matrix XRn×n, Skew(X)=(XXT)/2 denotes the skew-symmetric component of X. For square matrices XiRni×ni(1ik), diag(X1,X2,,Xk)R(i=1kni)×(i=1kni) denotes the block diagonal matrix with diagonal blocks X1,X2,,Xk. For a given matrix, 2 and F denote the spectral norm and the Frobenius norm respectively. The functions σmax() and σmin() denote respectively the largest and the nonnegative smallest singular values of a given matrix. The function λmax() denotes the largest eigenvalue of a given symmetric matrix. For a vector space X of matrices, BX(X,ϵ):={XXXXF<ϵ} denotes an open ball centred at XX with radius ϵ>0. To distinguish from the symbol for the orthogonal group O(N), the symbol o() is used in place of the standard big O notation for computational complexity.

2. Generalized left-localized Cayley transform (G-L2CT)

2.1. Definition and properties of G-L2CT

In this subsection, we introduce the Generalized Left-Localized Cayley Transform (G-L2CT) for the parametrization of St(p,N) as a natural extension of φS in (Equation5). Indeed, the G-L2CT inherits key properties satisfied by φS (see Proposition 2.2 and Theorem 2.3).

Definition 2.1

Generalized left-localized Cayley transform

For p,NN satisfying pN, let SO(N), EN,p(S):={USt(p,N)det(Ip+SleTU)=0}, and (9) QN,p(S):=QN,p:={[ABTB0]|AT=ARp×p,BR(Np)×p}QN,N.(9) The generalized left-localized Cayley transform centred at S is defined by (10) ΦS:St(p,N)EN,p(S)QN,p(S):U[AS(U)BST(U)BS(U)0](10) with (11) AS(U):=2(Ip+SleTU)TSkew(UTSle)(Ip+SleTU)1Qp,p(11) (12) BS(U):=SriTU(Ip+SleTU)1R(Np)×p,(12) where we call S the centre point of ΦS, and EN,p(S) the singular-point set of ΦS.

Proposition 2.2

Inversion of G-L2CT

The mapping ΦS with SO(N) is a diffeomorphism between a subset St(p,N)EN,p(S)St(p,N) and QN,p(S).Footnote4 The inversion mapping is given, in terms of φS1 in (Equation6), by (13) ΦS1:QN,p(S)St(p,N)EN,p(S):VΞφS1(V)=S(IV)(I+V)1IN×p,(13) where Ξ:O(N)St(p,N):UUIN×p. Moreover, for VQN,p(S), we have the following expressions (14) ΦS1(V)=ΞφS1(V)=2S(I+V)1IN×pSIN×p(14) (15) =2S[M1[[V]]21M1]Sle=2(SleSri[[V]]21)M1Sle,(15) where M:=Ip+[[V]]11+[[V]]21T[[V]]21Rp×p is the Schur complement matrix of I+VRN×N (see Fact A.6).

Proof.

See Appendix 3.

Theorem 2.3

Denseness of St(p,N)EN,p(S)

Let SO(N) and p<N. Then, the following hold:

  1. St(p,N)EN,p(S)=Ξ(O(N)EN,N(S)), i.e. ΦS1(QN,p)=ΞφS1(QN,p)=ΞφS1(QN,N), where Ξ is defined as in Proposition 2.2.

  2. St(p,N)EN,p(S) is an open dense subset of St(p,N) (see Fact A.1(a) for the topology of St(p,N)).

  3. For S1,S2O(N), the subset Δ(S1,S2):=(St(p,N)EN,p(S1))(St(p,N)EN,p(S2)) is a nonempty open dense subset of St(p,N).

  4. Let g:QN,p(S)R:Vdet(Ip+SleTΦS1(V)). Then, g is a positive-valued function and limVQN,p(S),V2g(V)=0. Conversely, if a sequence (Vn)n=0QN,p(S) satisfies limng(Vn)=0, then limnVn2=.

Proof.

See Appendix 4.

Proposition 2.4

Properties of G-L2CT in view of the manifold theory

      

  1. (Chart). For SO(N), the ordered pair (St(p,N)EN,p(S),ΦS) is a chart of St(p,N), i.e. (i) St(p,N)EN,p(S) is an open subset of St(p,N); (ii) ΦS is a homeomorphism between St(p,N)EN,p(S) and the Np12p(p+1) dimensional Euclidean space QN,p(S).

  2. (Smooth atlas). The set (St(p,N)EN,p(S),ΦS)SO(N) is a smooth atlas of St(p,N), i.e. (i) SO(N)(St(p,N)EN,p(S))=St(p,N); (ii) for every pair S1,S2O(N), ΦS2ΦS11 is smooth over ΦS1(Δ(S1,S2)), where Δ(S1,S2) has been defined in Theorem 2.3(c).

Proof.

(a) (i) See Theorem 2.3(b). (ii) From Proposition 2.2, ΦS is a homeomorphism between St(p,N)EN,p(S) and QN,p(S). Clearly the dimension of the vector space QN,p(S) is Npp(p+1)/2.

(b) (i) Recall that St(p,N)=SO(N){SIN×p}=SO(N){Sle}=SO(N){ΦS1(0)}. (ii) See Proposition 2.2.

Remark 2.5

  1. (Relation between the Cayley transform-based retraction and ΦS1). By using φ1 in (Equation4), the Cayley transform-based retraction has been utilized for Problem 1.1, e.g. [Citation22,Citation24,Citation25] (see Appendix 2 for the retraction-based strategy). The Cayley transform-based retraction can be expressed by using the proposed ΦS1 (see (Equation31)). In Section 3.4, we will clarify a diffeomorphic property of this retraction through ΦS1.

  2. (Parametrization of St(p,N) with ΦS). By St(p,N)EN,p(S)(=ΦS1(QN,p(S)))St(p,N), for a given pair of USt(p,N) and SO(N), the inclusion USt(p,N)EN,p(S) is not guaranteed in general. However, Proposition 2.4(b) ensures the existence of SO(N) satisfying USt(p,N)EN,p(S). Indeed, we can construct such S by using a singular value decomposition of UupRp×p as shown later in Theorem 2.7. This fact tells us that the availability of general SO(N) can realize overall parametrization of St(p,N) with ΦS1. We note that a naive idea for using ΦI1, i.e. a special case of ΦS1 with S=I, in optimization over St(p,N) has been reported shortly in [Citation26], which can be seen as an extension of the Cayley parametrization in [Citation20] for optimization over O(N).

  3. (On the choice of Ξ:O(N)St(p,N) for ΦS1=ΞφS1 in Proposition 2.2). Since Ξ defined in Proposition 2.2 selects the first p column vectors from an orthogonal matrix, ΦS1(V)=ΞφS1(V) for VQN,p(S) can be regarded as the matrix of the first p column vectors selected from an orthogonal matrix φS1(V). Proposition 2.2 guarantees that the matrix ΦS1(V) of the first p column vectors of φS1(V) does not overlap in VQN,p(S). Although there are many other selection rules ΞU:O(N)St(p,N):UUU with U{USt(p,N)[U]ij{0,1},1iN,1jp} of p column vectors from φS1(V), ΞUφS1 can not necessarily parameterize St(p,N) without any overlap as shown below. For simplicity, assume 2p<N. Consider U satisfying Uup=0 (U:=[0Ip]T is such a typical instance). Then, we can verify that ΞUφS1(V) is not an injection on QN,p (see Appendix 5). Note that an idea for using ΞUφS1 only with S=I have been considered in [Citation26]. However, for parametrization of St(p,N), it seems to suggest the special selection U=IN×p, which corresponds to ΦI1.

By using Theorem 2.3, we deduce Lemma 2.6, which guarantees the existence of a solution to Problem 1.4 for any ϵ>0. Theorem 2.3 will also be used in Lemma 3.5 to ensure the existence of a solution to Problem 1.5.

Lemma 2.6

Let f:RN×pR be continuous with p<N and SO(N). Then, it holds (16) minUSt(p,N)f(U)=infUSt(p,N)EN,p(S)f(U)=infVQN,p(S)fΦS1(V).(16)

Proof.

The second equality in (Equation16) is verified from the homeomorphism of ΦS1. Let USt(p,N) be a global minimizer of f over St(p,N), i.e. f(U)=minf(St(p,N)). From the denseness of St(p,N)EN,p(S) in St(p,N) (see Theorem 2.3(b)), there exists a sequence (Un)n=0St(p,N)EN,p(S) satisfying limnUn=U. The continuity of f yields limnf(Un)=f(U), i.e. infUSt(p,N)EN,p(S)f(U)=minf(St(p,N)).

2.2. Computational complexities for ΦS and ΦS1 with SOp(N)

From the expressions in (Equation10)–(Equation12) and (Equation15), both ΦS and ΦS1 with general SO(N) require o(N2p+p3) flops (FLoating-point OPerationS [not ‘FLoating point Operations Per Second’]), which are dominated by the matrix multiplications SriTU in (Equation12) and Sri[[V]]21 in (Equation15) respectively. However, if we employ a special centre point (17) SOp(N):={diag(T,INp)TO(p)}O(N),(17) then the complexities for ΦS and ΦS1 can be reduced to o(Np2+p3) flops. Indeed, for TO(p) and USt(p,N), we have [diag(T,INp)]leTU=TTUup and [diag(T,INp)]riTU=Ulo. Hence Φdiag(T,INp)(U) requires Np2+o(p3) flops due to Adiag(T,INp)(U)=2(Ip+TTUup)TSkew(UupTT)(Ip+TTUup)1Rp×p,Bdiag(T,INp)(U)=Ulo(Ip+TTUup)1R(Np)×p.Moreover, for VQN,p(diag(T,INp)) and M:=Ip+[[V]]11+[[V]]21T[[V]]21Rp×p, it follows from [diag(T,INp)]ri[[V]]21=[0p[[V]]21T]T and (Equation15) that Φdiag(T,INp)1(V)=[2TM1T2[[V]]21M1]requires 2Np2+o(p3) flops.

For a given USt(p,N), Theorem 2.7 below presents a way to select TO(p) satisfying USt(p,N)EN,p(diag(T,INp)), where T is designed with a singular value decomposition of UupRp×p, requiring thus at most o(p3) flops.

Theorem 2.7

Parametrization of St(p,N) by ΦS with SOp(N)

Let U=[UupTUloT]TSt(p,N), and Uup=Q1ΣQ2T be a singular value decomposition of UupRp×p, where Q1,Q2O(p) and ΣRp×p is a diagonal matrix with non-negative entries. Define S:=diag(T,INp)Op(N) with T:=Q1Q2TO(p). Then, the following hold:

  1. det(Ip+SleTU)1 and USt(p,N)EN,p(S).

  2. (18) ΦS(U)=[0Q2(Ip+Σ)1Q2TUloTUloQ2(Ip+Σ)1Q2T0],(18) where BS(U)2=(12)UloQ2(Ip+Σ)1Q2T21.

Proof.

(a) By SleTU=TTUup=Q2ΣQ2T, it holds det(Ip+SleTU)=det(Q2(Ip+Σ)Q2T)=det(Ip+Σ)1, which implies USt(p,N)EN,p(S) by Definition 2.1.

(b) Substituting SleTU=Q2ΣQ2T and SriTU=Ulo into (Equation11) and (Equation12), we obtain (Equation18). From (Equation12), BS(U)2 is bounded above as (19) BS(U)2=SriTU(Ip+SleTU)12Sri2U2(Ip+Q2ΣQ2T)12=Q2(Ip+Σ)1Q2T2=(Ip+Σ)121.(19)

Remark 2.8

Comparisons to commonly used retractions of St(p,N)

The computational complexity 2Np2+o(p3) flops for ΦS1 with SOp(N) is competitive to that for commonly used retractions, which map a tangent vector to a point in St(p,N) (for the retraction-based strategy, see Appendix 2). Indeed, retractions based on QR decomposition, the polar decomposition [Citation1] and the Cayley transform [Citation22] require respectively 2Np2+o(p3) flops, 3Np2+o(p3) flops and 6Np2+o(p3) flops [Citation24, Table ].

Table 1. Performance of each algorithm applied to Problem 4.1.

2.3. Gradient of function after the Cayley parametrization

For the applications of ΦS (G-L2CT) with SO(N) to Problems 1.4 and 1.5, we present an expression of the gradient of fΦS1 denoted by (fΦS1) (Proposition 2.9) and its useful properties (Proposition 2.10, Remark 2.11 and also Proposition A.11).

Proposition 2.9

Gradient of function after the Cayley parametrization

For a differentiable function f:RN×pR and SO(N), the function fS:=fΦS1:QN,p(S)R is differentiable with (20) (VQN,p(S))fS(V)=2Skew(WSf(V))=WSf(V)WSf(V)TQN,p(S),(20) where (21) WSf(V):=[[[W¯Sf(V)]]11[[W¯Sf(V)]]12[[W¯Sf(V)]]210]RN×N(21) and (22) W¯Sf(V):=(I+V)1IN×pf(ΦS1(V))TS(I+V)1(22) (23) =[M1f(U)T(SleSri[[V]]21)M1[[V]]21M1∇f(U)T(SleSri[[V]]21)M1M1∇f(U)T((SleSri[[V]]21)M1[[V]]21T+Sri)[[V]]21M1∇f(U)T((SleSri[[V]]21)M1[[V]]21T+Sri)]RN×N(23) in terms of U:=ΦS1(V)St(p,N)EN,p(S) and M:=Ip+[[V]]11+[[V]]21T[[V]]21Rp×p. In particular, by Sle=ΦS1(0) in (Equation13), (24) fS(0)=[f(Sle)TSleSleT∇f(Sle)∇f(Sle)TSriSriT∇f(Sle)0].(24)

Proof.

See Appendix 6.

Proposition 2.10

Transformation formula for gradients of function

For S1,S2O(N), suppose that V1QN,p(S1) and V2QN,p(S2) satisfy ΦS11(V1)=ΦS21(V2). Then, for a differentiable function f:RN×pR, the following hold:

  1. X:=[φS11(V1)]riT[φS21(V2)]riO(Np) is guaranteed. Moreover, by using GS1,S2(V1,V2):=(I+V1)1[Ip00X](I+V2)×(fS2(V2)[00[[V2]]21INp]fS2(V2)[0[[V2]]21T0INp])×(I+V2)T[Ip00XT](I+V1)TQN,N, we have (25) fS1(V1)=[[[GS1,S2(V1,V2)]]11[[GS1,S2(V1,V2)]]12[[GS1,S2(V1,V2)]]210]QN,p(S1)=GS1,S2(V1,V2)[000INp]GS1,S2(V1,V2)[000INp].(25)

  2. fS1(V1)F2(1+V222)fS2(V2)F.

  3. fS1(V1)=0 if and only if fS2(V2)=0.

Proof.

See Appendix 7.

Remark 2.11

  1. (Computational complexity for (fΦS1) with SOp(N) in (Equation17)). Let S:=diag(T,INp)Op(N) with TO(p) and VQN,p(S). From (Equation21) and (Equation23), computation of (fΦS1)(V)(=2Skew(WSf(V))) requires at most 5Np2+o(p3) flops due to WSf(V)=[M1f(U)T[T[[V]]21]M1[[V]]21M1∇f(U)T[T[[V]]21]M1M1∇f(U)T([T[[V]]21]M1[[V]]21T+[0INp])0M1∇f(U)T[T[[V]]21]M1[[V]]21M1∇f(U)T[T[[V]]21]M1], where U:=ΦS1(V)St(p,N)EN,p(S) and M:=Ip+[[V]]11+[[V]]21T[[V]]21Rp×p.

  2. (Relation of gradients after Cayley parametrization). Proposition 2.10 illustrates the relations of two gradients after Cayley parameterization with different centre points. These relations will be used in Lemmas 3.4 and 3.5 to characterize the first-order optimality condition with the proposed Cayley parametrization.

  3. (Useful properties of the gradient after Cayley parametrization). In Appendix 8, we present useful properties (i) Lipschitz continuity; (ii) the boundedness; (iii) the variance bounded; of (fΦS1) for the minimization of fΦS1 over QN,p(S). These properties have been exploited in distributed optimization and stochastic optimization, e.g. [Citation27–32].

3. Optimization over the Stiefel manifold with the Cayley parametrization

3.1. Optimality condition via the Cayley parametrization

We present simple characterizations of (i) local minimizer, and (ii) stationary point, of a real valued function over St(p,N) in terms of ΦS.

Let X be a vector space of matrices. A point XYX is said to be a local minimizer of J:XR over YX if there exists ϵ>0 satisfying J(X)J(X) for all XBX(X,ϵ)Y. Under the smoothness assumption on J, XX is said to be a stationary point of J over the vector space X if J(X)=0. For a smooth function f:RN×pR, USt(p,N) is said to be a stationary point of f over St(p,N) [Citation22,Citation23] if U satisfies the following conditions: (26) {(IUUT)f(U)=0UT∇f(U)∇f(U)TU=0.(26) The above conditions J(X)=0 and (Equation26) are called the first-order optimality conditions because they are respectively necessary conditions for X to be a local minimizer of J over X (see, e.g. [Citation33, Theorem 2.2]), and for U to be a local minimizer of f over St(p,N) (see [Citation23, Definition 2.1, Remark 2.3] and [Citation22, Lemma 1]).

In Lemma 3.1 below, we characterize a local minimizer of f over St(p,N) as a local minimizer of fΦS1 with a certain SO(N) over the vector space QN,p(S).

Lemma 3.1

Equivalence of local minimizers in the two senses

Let f:RN×pR be continuous. Let USt(p,N) and SO(N) satisfy USt(p,N)EN,p(S). Then, U is a local minimizer of f over St(p,N) if and only if V:=ΦS(U)QN,p(S) is a local minimizer of fΦS1 over the vector space QN,p(S).

Proof.

Let U be a local minimizer of f over St(p,N) and ϵ>0 satisfy f(U)f(U) for all UBRN×p(U,ϵ)St(p,N)=:N1(U)St(p,N)EN,p(S). From the homeomorphism of ΦS in Proposition 2.2, N2(V):=ΦS(N1(U))QN,p(S) is a nonempty open subset of QN,p(S) containing V. Then, there exists ϵ^>0 satisfying BQN,p(S)(V,ϵ^)N2(V). Since fΦS1(BQN,p(S)(V,ϵ^))fΦS1(N2(V))=f(N1(U)), we obtain f(U)=fΦS1(V)fΦS1(V) for all VBQN,p(S)(V,ϵ^), implying thus V is a local minimizer of fΦS1 over QN,p(S). In a similar way, we can prove its converse.

Under a special assumption on f in Theorem 3.2 below, yet found especially in many data science scenarios (see Remark 3.3), we can characterize a global minimizer of Problem 1.1 via fΦS1 with any SO(N). In this case, a global minimizer VQN,p(S) of fΦS1 is guaranteed to exist in the unit ball {VQN,p(S)V21}.

Theorem 3.2

Let SO(N). Assume that f:RN×pR is continuous and right orthogonal invariant, i.e. f(U)=f(UQ) for USt(p,N) and QO(p). Then, there exists a global minimizer VQN,p(S) of fΦS1 achieving fΦS1(V)=minUSt(p,N)f(U), [[V]]2121 and V21.

Proof.

Let USt(p,N) be a global minimizer of f over St(p,N), and SleTU=Q1ΣQ2T be a singular value decomposition with Q1,Q2O(p) and nonnegative-valued diagonal matrix ΣRp×p. Then, we obtain U:=UQSt(p,N)EN,p(S) with Q:=Q2Q1TO(p) by |det(Ip+SleTU)|=|det(Ip+Q1ΣQ2TQ2Q1T)|=|det(Ip+Σ)|1. The right orthogonal invariance of f ensures f(U)=f(U)=fΦS1(V) with V:=ΦS(U).

Substituting SleTU=Q1ΣQ1T into (Equation11) and (Equation12), we obtain [[V]]11=AS(U)=0 and [[V]]21=BS(U)=SriTUQ(Ip+Q1ΣQ1T)1=SriTUQ2(Ip+Σ)1Q1T. In a similar manner to (Equation19), the last equality implies [[V]]2121. The last statement is verified by V22=λmax([0[[V]]21T[[V]]210][0[[V]]21T[[V]]210])=λmax([[[V]]21T[[V]]2100[[V]]21[[V]]21T])=λmax([[V]]21T[[V]]21)=[[V]]21221.

Remark 3.3

Right orthogonal invariance

Under the right orthogonal invariance of f, Problem 1.1 arises in, e.g. low-rank matrix completion [Citation34,Citation35], eigenvalue problems [Citation1,Citation22,Citation24,Citation36], and optimal H2 model reduction [Citation3,Citation37]. These applications can be formulated as optimization problems over the Grassmann manifold Gr(p,N), which is the set of all p-dimensional subspace of RN. Practically, Gr(p,N) is represented numerically by {[U]USt(p,N)}, where [U]:={UQSt(p,N)QO(p)} is an equivalence class, because the column space of USt(p,N) equals that of UQSt(p,N) for all QO(p). Since the value of the right orthogonal invariant f depends only on the equivalence class [U], Problem 1.1 of such f can be regarded as an optimization problem over Gr(p,N).

In Lemma 3.4 below, we characterize a stationary point of f over St(p,N) by a stationary point of fΦS1 with a certain SO(N) over the vector space QN,p(S). Moreover, Lemma 3.5 ensures the existence of solutions to Problem 1.5 with any ϵ>0. Therefore, we can approximate a stationary point of f over St(p,N) by solving Problem 1.5 with a sufficiently small ϵ>0.

Lemma 3.4

First-order optimality condition

Let f:RN×pR be differentiable. Let USt(p,N) and SO(N) satisfy USt(p,N)EN,p(S). Then, the first-order optimality condition in (Equation26) can be stated equivalently as (27) fS(ΦS(U))=0,(27) where fS:=fΦS1.

Proof.

Let USt(Np,N) satisfy UTU=0. Then, we have U=Φ[UU]1(0). For SO(N) satisfying USt(p,N)EN,p(S) and V:=ΦS(U)QN,p(S), i.e. U=ΦS1(V), Proposition 2.10(c) asserts that fS(V)=0 if and only if f[UU](0)=0. To prove the equivalence between (Equation26) and (Equation27), it is sufficient to show the equivalence between the condition in (Equation26) and f[UU](0)=0. By (Equation24), we have f[UU](0)=[f(U)TUUT∇f(U)∇f(U)TUUT∇f(U)0],which yields [[f[UU](0)]]11=0 if and only if the second condition in (Equation26) holds true.

In the following, we show the equivalence of UTf(U)=0 and (IUUT)f(U)=0. By noting [UU][UU]T=UUT+UUT=I, the equality UTf(U)=0 implies 0=UUTf(U)=(IUUT)∇f(U). Conversely, (IUUT)f(U)=0 implies 0=UT(IUUT)f(U)=UT∇f(U).

Lemma 3.5

Let f:RN×pR be continuously differentiable with p<N and SO(N). Then, infVQN,p(S)(fΦS1)(V)F=0.

Proof.

Let USt(p,N) be a global minimizer of f over St(p,N), and SO(N) satisfy USt(p,N)EN,p(S). Then, U is a stationary point of f over St(p,N), and we have (fΦS1)(V)F=0 with V:=ΦS(U)QN,p(S) from Lemma 3.4.

Theorem 2.3(c) ensures the denseness of Δ(S,S):=(St(p,N)EN,p(S))(St(p,N)EN,p(S)) in St(p,N). Then, we obtain a sequence (Un)n=0 of Δ(S,S) converging to U. Let (Vn)n=0 and (Vn)n=0 be sequences of Vn:=ΦS(Un)QN,p(S) and Vn:=ΦS(Un)QN,p(S). The continuity of ΦS yields limnVn=V, implying the boundedness of (Vn)n=0. From ΦS1(Vn)=Un=ΦS1(Vn) and Proposition 2.10(b), we have 0(fΦS1)(Vn)F2(1+Vn22)(fΦS1)(Vn)F. The right-hand side of the above inequality converges to zero from the boundedness of (Vn)n=0 and (fΦS1)(V)F=0. Therefore, we have limn(fΦS1)(Vn)F=0, from which we completed the proof.

3.2. Basic framework to incorporate optimization techniques designed over a vector space with the Cayley parametrization

We illustrate a general scheme of the Cayley parametrization strategy in Algorithm 1,Footnote5 where U0St(p,N) is an initial estimate for a solution to Problem 1.1 with p<N, SO(N) is a centre point for parametrization of a dense subset St(p,N)EN,p(S)St(p,N) in terms of the vector space QN,p(S), and a mapping A:QN,p(S)QN,p(S) is a certain update rule for decreasing the value of fΦS1. In principle, we can employ any optimization update scheme over a vector space as A, which is a notable advantage of the proposed strategy over the standard strategy (see Remark 3.6). As a simplest example, we will employ, in Section 4, a gradient descent-type update scheme AGDM:QN,p(S)QN,p(S):VVγ(fΦS1)(V) with a stepsize γ>0 determined by a certain line-search algorithm (see, e.g. [Citation33]).

To parameterize U0St(p,N) by ΦS1, SO(N) must be chosen to satisfy U0St(p,N)EN,p(S). An example of selection of such S for a given U0 is S:=diag(Q1Q2T,INp)Op(N) by using a singular value decomposition [U0]up=Q1ΣQ2TRp×p with Q1,Q2O(p) and a diagonal matrix ΣRp×p with non-negative entries (see Theorem 2.7).

Remark 3.6

Comparison to the retraction-based strategy

As reported in [Citation1,Citation3,Citation22,Citation24,Citation25,Citation41–52], Problem 1.1 has been tackled with a retraction R:TSt(p,N):={{U}×TUSt(p,N)USt(p,N)}St(p,N):(U,V)RU(V) (see, e.g. [Citation1]) by exploiting only a local diffeomorphismFootnote6 of each RU between a sufficiently small neighbourhood of 0TUSt(p,N) in the tangent space TUSt(p,N), at USt(p,N) to St(p,N), and its image in St(p,N) (see Appendix 2 for its basic idea). At the nth iteration, these retraction-based strategies decrease the time-varying function fRUn at 0TUnSt(p,N) over the time-varying vector space TUnSt(p,N), where UnSt(p,N) is the nth estimate for a solution. Many computational mechanisms for finding a descent direction DnTUnSt(p,N) in the tangent space TUnSt(p,N) have been motivated by standard ideas for optimization over a fixed vector space. To achieve fast convergence in optimization over a vector space, many researchers have been trying to utilize the past updating directions for estimating a current descent direction, e.g. the conjugate gradient method, quasi-Newton's method and Nesterov accelerated gradient method [Citation27,Citation28,Citation33,Citation53]. However, in the retraction-based strategy, since the past updating directions (Dk)k=0n1 no longer live in the current tangent space TUnSt(p,N), we can not utilize directly (Dk)k=0n1 for estimating a new descent direction DnTUnSt(p,N). To be exploited the past updating directions with a retraction, those directions must be translated into the current tangent space with certain mappings, e.g. a vector transport [Citation1] and the inversion mapping of retractions [Citation25].

On the other hand, Algorithm 1 decreases the fixed cost function fΦS1 with a fixed SO(N) over the fixed vector space QN,p(S) during the process of Algorithm 1 by exploiting the diffeomorphism of ΦS1 between QN,p(S) and an open dense subset St(p,N)EN,p(S) of St(p,N) (see Proposition 2.2 and Theorem 2.3(b)). Since every past updating direction lives in the same vector space QN,p(S), we can utilize the past updating directions without requiring any additional computation such as a vector transport and the inversion mapping of retractions. Therefore, we can transplant powerful computational arts, e.g. [Citation27–33], designed for optimization over a vector space, into the proposed strategy. For many such algorithms, Proposition A.11 must be useful for checking whether conditions, regarding the cost function, for a global convergence of optimization techniques hold true or not.

3.3. Singular-point issue in the Cayley parametrization strategy

Numerical performance of Algorithm 1 heavily depends on tuning SO(N) in general. If we choose S such that a minimizer USt(p,N) of Problem 1.1 is close to the singular-point set EN,p(S), then a risk of a slow convergence of Algorithm 1 arises due to an insensitivity of ΦS1 to the change around ΦS(U) in the vector space QN,p(S). In a case where p = N, this risk has been reported by [Citation20,Citation21]. We can see this insensitivity of ΦS1 via Proposition 3.7 below.

Proposition 3.7

The mobility of ΦS1

Let p,NN satisfy p<N, SO(N), VQN,p(S), and EQN,p(S) satisfy EF=1. Then, we have ΦS1(V+τE)ΦS1(V)Fτr(V),where (28) r(V):=21+[[V]]21221+σmin2([[V]]21).(28) We call r:QN,p(S)R the mobility of ΦS1, which is bounded as (29) r(V)2(1+[[V]]2122)1/2,(29) where the equality holds when σmin([[V]]21)=σmax([[V]]21)(=[[V]]212).

Proof.

See Appendix 9.

To interpret the result in Proposition 3.7, we consider two simple examples. Under the condition σmin([[V]]21)=σmax([[V]]21)(=[[V]]212), we observe from (Equation29) that the mobility r(V) becomes small when [[V]]212 increases. On the other hand, because r(V)=2 is achieved by [[V]]212=0 from (Equation28), [[V]]21 around zero does not lead small r(V).

These tendencies can be observed numerically in Figure , where the plot shows the norm [[V]]212 on the horizontal axis versus the values ΦS1(V+E)ΦS1(V)F and r(V), with randomly chosen V,EQN,p(S) satisfying EF=1, on the vertical axis for each N{500,1000,2000} and p = 10. From this figure, we observe that the mobility r(V) decreases and ΦS1 becomes insensitive as [[V]]212 increases.

Figure 1. The average values of the change ΦS1(V+E)ΦS1(V)F and the mobility r(V) for each [[V]]212 over 10 trials in the case N={500,1000,2000} and p = 10. In each trial, we generate V~,E~RN×N of which each entry is uniformly chosen from [0.5,0.5] except for the (Np)-by-(Np) right lower block matrix. Then, with E:=Skew(E~)/Skew(E~)FQN,p satisfying EF=1, we evaluate ΦS1(V+E)ΦS1(V)F and r(V) at VQN,p with [[V]]11=[[Skew(V~)]]11 and [[V]]21=c[[Skew(V~)]]21 by changing c[0,5/[[Skew(V~)]]212].

Figure 1. The average values of the change ‖ΦS−1(V+E)−ΦS−1(V)‖F and the mobility r(V) for each ‖[[V]]21‖2 over 10 trials in the case N={500,1000,2000} and p = 10. In each trial, we generate V~,E~∈RN×N of which each entry is uniformly chosen from [−0.5,0.5] except for the (N−p)-by-(N−p) right lower block matrix. Then, with E:=Skew(E~)/‖Skew(E~)‖F∈QN,p satisfying ‖E‖F=1, we evaluate ‖ΦS−1(V+E)−ΦS−1(V)‖F and r(V) at V∈QN,p with [[V]]11=[[Skew(V~)]]11 and [[V]]21=c[[Skew(V~)]]21 by changing c∈[0,5/‖[[Skew(V~)]]21‖2].

This insensitivity of ΦS1, at distant points from zero, causes a risk of slow convergence of Algorithm 1 even if the current estimate VnQN,p(S) is not sufficiently close to a solution VQN,p(S) of Problem 1.4 or Problem 1.5. Since Theorem 2.3(d) implies that ΦS(U)2 increases as USt(p,N)EN,p(S) approaches EN,p(S), the risk of the slow convergence, say a singular-point issue, can arise in a case where a global minimizer USt(p,N) stays around EN,p(S). In Section 4.2, we will see that the numerical performance of Algorithm 1 employing the gradient descent-type method tends to deteriorate as U approaches EN,p(S).

To remedy the singular-point issue in Algorithm 1, it is recommendable to use S such that ΦS(U) is close to zero in QN,p(S). Although we can not determine for a given S whether ΦS(U) is close to zero or not in advance of minimization for general f, Theorem 3.2 guarantees, under the right orthogonal invariance of f, the existence of a global minimizer U satisfying [[ΦS(U)]]2121 for every SO(N). In this case, by r(ΦS(U))2 in (Equation29) and the continuity of r, the mobility r of ΦS1 can be maintained in a neighbourhood of ΦS(U) to which a point sequence (Vn)n=0 generated by Algorithm 1 is desired to approach. Therefore, we do not need to be nervous about the influence by the singular-point set around ΦS(U).

For general f, to remedy the singular-point issue, we reported shortly in [Citation38,Citation39] that this issue can be avoided by a Cayley parametrization-type strategy, for Problem 3.8 below, by updating not only VnQN,p but also a preferable centre point SnO(N) strategically. Due to the space consuming discussion, we will present its fully detailed discussion in another occasion.

Problem 3.8

For a given continuous function f:RN×pR, choose ϵ>0 arbitrarily. Then, find(V,S)QN,p×O(N)such thatfΦS1(V)<minf(St(p,N))+ϵ.

3.4. Relation between the Cayley transform-based retraction and ΦS1

The proposed ΦS1 can be regarded as another form of the Cayley transform-based retraction for St(p,N). By using the inversion φ1 in (Equation4), the Cayley transform-based retraction RCay:TSt(p,N)St(p,N):(U,V)RUCay(V) was introduced explicitly in [Citation22,Citation24], where the tangent bundle TSt(p,N)={{U}×TUSt(p,N))USt(p,N)} is defined with the tangent space TUSt(p,N) to St(p,N) at USt(p,N) (see Fact A.1(d)). For USt(p,N), RUCay can be expressed with PU:=IUUT/2RN×N as (30) RUCay:TUSt(p,N)St(p,N):Vφ1(Skew(UVTPU))U.(30) By passing through the linear mapping Ψ[UU]:TUSt(p,N)QN,p([UU]):V12[UTV(UTV)TUTV0]with USt(Np,N) satisfying UTU=0, we have the following relation (31) (VTUSt(p,N))Φ[UU]1Ψ[UU](V)=RUCay(V).(31) This relation can be verified specially with S:=[UU]O(N) by (VQN,p(S))ΦS1(V)=S(IV)(I+V)1IN×p=(ISVS1)(I+SVS1)1SIN×p=(ISVST)(I+SVST)1U=φ1(SVST)Uand (VTUSt(p,N))SΨS(V)ST=12[UU][UTVVTUUTV0][UTUT]=12(UUTVUT+UUTVUTUVTUUT)=12(UUTVUT+(IUUT)VUTUVT(IUUT))(UUT+UUT=I)=12(UVTVUTUVTUUT)=12(UVTVUT12U(VTUUTV)UT)(VTUSt(p,N)UTV+VTU=0)=Skew(UVT12UVTUUT)=Skew(UVT(I12UUT))=Skew(UVTPU).Through the relation in (Equation31), we obtain a diffeomorphic property of RUCay in the following. The linear mapping ΨS is a bijection between TUSt(p,N) and QN,p(S) with its inversion mapping ΨS1:QN,p(S)TUSt(p,N):V2SVIN×p. From ΨS(TUSt(p,N))=QN,p(S), (Equation31) and Proposition 2.2, RUCay is a diffeomorphic between TUSt(p,N) and a subset St(p,N)EN,p(S) of St(p,N). Clearly, the inversion mapping of RUCay is given by RUCay1:St(p,N)EN,p(S)TUSt(p,N):UΨS1ΦS(U).

We present an explicit formula for RUCay1. From Definition 2.1, we have (32) (USt(p,N)EN,p(S))RUCay1(U)=2S[AS(U)BS(U)BS(U)0]IN×p=2[UU][AS(U)BS(U)]=2UAS(U)2UBS(U).(32) From (Equation11) and (Equation12), each term in (Equation32) is evaluated as 2UAS(U)=4U(Ip+UTU)1Skew(UTU)(Ip+UTU)1=2U(Ip+UTU)1((Ip+UTU)(Ip+UTU))(Ip+UTU)1=2U(Ip+UTU)12U(Ip+UTU)1,2UBS(U)=2UUTU(Ip+UTU)1=2(IUUT)U(Ip+UTU)1.By substituting these equalities into (Equation32), we have (33) RUCay1(U)=2U(Ip+UTU)12U(Ip+UTU)1+2(IUUT)U(Ip+UTU)1=2U(Ip+UTU)1+2U(Ip+UTU)12U(Ip+UTU)(Ip+UTU)1=2U(Ip+UTU)1+2U(Ip+UTU)12U.(33) Although the expression (Equation33) of RUCay1 has been given by [Citation25,Citation54], our discussion via (Equation31) presents much more comprehensive information about RUCay. In [Citation25,Citation54], it has been reported that a certain restriction of RUCay to a sufficiently small open neighbourhood of 0TUSt(p,N) is invertible with RUCay1. Meanwhile, we clarify that RUCay is invertible on TUSt(p,N) entirely by passing through ΦS1. The following proposition summarizes the above discussion.

Proposition 3.9

For USt(p,N), let USt(Np,N) satisfy UTU=0. Then, the Cayley transform-based retraction RUCay in (Equation30) [Citation22,Citation24] is diffeomorphic between TUSt(p,N) and St(p,N)EN,p(S), and its inversion mapping RUCay1 is given by (Equation33), where S:=[UU]. In addition, for p<N, the image St(p,N)EN,p(S) of RUCay is an open dense subset of St(p,N) (see Theorem 2.3(b)).

Remark 3.10

Minimization of fRUCay with a fixed U

By using the Cayley transform-based retraction RUCay, the Cayley parametrization strategy in Algorithm 1 can be modified to the minimization of fRUCay with a fixed USt(p,N) over TUSt(p,N). The explicit formula for the gradient of fRUCay is given in Appendix 10. Compared to the minimization of fRUCay over TUSt(p,N), advantages of the minimization of fΦS1 with SOp(N) over QN,p(S) are as follows.

  1. The complexity 2Np2+o(p3) flops of ΦS1 with SOp(N) is more efficient than 6Np2+o(p3) flops of RUCay (see Remark 2.8). In a case where we employ the gradient descent-type method for the minimization of fΦS1 and fRUCay, the difference of constant factor affects run time of algorithm in practice because ΦS1 and RUCay are used to estimate a stepsize many times within a line-search algorithm, e.g. the backtracking algorithm (Algorithm 2), in each iteration (see, e.g. [Citation33]).

  2. RUCay has been exploited with the aid of the Sherman-Morrison-Woodbury formula (see Fact A.7) to reduce the complexity for matrix inversion, which can induce the deterioration of the orthogonal feasibility due to the numerical instability of its formula [Citation22]. On the other hand, ΦS1 does not use the formula, and thus is numerically stabler than RUCay. This will be demonstrated numerically in Section 4. Indeed, for VQN,p(S), the condition number κ(M):=M2M12 of M:=Ip+[[V]]11+[[V]]21T[[V]]21 in (Equation15) is upper bounded byFootnote7 1+[[V]]112+[[V]]2122, implying thus M is hardly become ill-conditioned whenever V2 is not very large (this is usual case, e.g. in application of G-L2CT for optimization of right orthogonal invariant functions [see Theorem 3.2]).

4. Numerical experiments

We illustrate the performance of the proposed CP strategy in Algorithm 1 by numerical experiments. To demonstrate the effectiveness of the proposed formulation in Problem 1.4 in a simple situation, we implemented Algorithm 1 with a gradient descent-type update scheme AGDM:QN,p(S)QN,p(S):VVγfS(V) in MATLAB, where fS:=fΦS1. In AGDM for a given VQN,p(S), we use a stepsize γ>0, satisfying the so-called Armijo rule, generated by the backtracking algorithm (see, e.g. [Citation33]) with predetermined γinitial>0 and ρ,c(0,1) (see Algorithm 2). Armijo rule has been utilized to design a stepsize for decreasing the function value sufficiently in numerical optimization. All the experiments were performed on MacBook Pro (13-inch, 2017) with Intel Core i5-7360U and 16GB of RAM.

4.1. Comparison to the retraction-based strategy

We compared Algorithm 1+AGDM (abbreviated as GDM+CP) and three retraction-based strategies [Citation1] with the steepest descent solver implemented in Manopt [Citation55] in the scenario of eigenbasis extraction problem below. Since the Cayley transform-based retraction RCay in (Equation30) can be utilized for a parametrization of a subset of St(p,N) (see Section 3.4 and Proposition 3.9), to see differences in performance between ΦS1 and RUCay, we also compared the proposed GDM+CP and its modified version with replacement of ΦS1 by RUCay (abbreviated by GDM+CP-retraction) illustrated in Algorithm 3+A^GDM:VVγ(fRUCay)(V) for the minimization of fRUCay with a fixed USt(p,N) over TUSt(p,N).

Problem 4.1

Eigenbasis extraction problem (e.g. [Citation1,Citation22,Citation24])

For a given symmetric matrix ARN×N, (34) findUargminUSt(p,N)f(U)(:=Tr(UTAU)).(34) Any solution U of Problem 4.1 is an orthonormal eigenbasis associated with the p largest eigenvalues of A. In our experiment, we used A:=A~TA~RN×N with randomly chosen A~RN×N of which each entry is sampled by the standard normal distribution N(0,1). Note that f is right orthogonal invariant, and thus we can exploit Theorem 3.2 for GDM+CP.

For the retraction-based strategies, we employed three retractions: (i) Cayley transform-based (abbreviated by GDM+Cayley) [Citation22]; (ii) QR decomposition-based (abbreviated by GDM+QR) [Citation1]; (iii) polar decomposition-based (abbreviated by GDM+polar) [Citation1]. In the steepest descent solver in Manopt, we calculated a stepsize for the current estimate USt(p,N) with Algorithm 2 after replacement of the criterion fS(VγfS(V))>fS(V)fS(V)F2 by fRU(γgradf(U))>f(U)gradf(U)F2 (see, e.g. [Citation3, Algorithm 3.1]), where gradf(U)=PTUSt(p,N)(∇f(U))TUSt(p,N) is the Riemannian gradient of f at USt(p,N) (for the projection mapping PTUSt(p,N), see Fact A.1(d)).

For an initial point U0St(p,N), we used a centre point for GDM+CP as S:=diag(Q1Q2T,INp)Op(N) by using a singular value decomposition of [U0]up=Q1ΣQ2T with Q1,Q2O(p) and a nonnegative-valued diagonal matrix ΣRp×p (see Theorem 2.7). For GDM+CP-retraction, we used a fixed U:=U0 for the minimization of fRUCay. We note that the choice of U:=U0 is reasonable because the procedure of Algorithm 3(U0,U0,A^GDM), which tries to decrease fRU0Cay from the initial point RU0Cay1(U0)=0TU0St(p,N), is the same as the procedure of GDM+Cayley in the first iteration. The explicit formula for the gradient of fRUCay is given in Appendix 10.

For five algorithms, we used the default parameters ρ=0.5 and c=213, for Algorithm 2, in Manopt. We employed several initial stepsizes γinitial{0.1,0.01,0.001}. We generated an initial point U0St(p,N) by using ‘orth(rand(N,p))’ in MATLAB.

For each algorithm, we stopped the update at nth iteration when it achieved the following conditions (used in [Citation25]) with Dn:=fS(Vn), (fRU0Cay)(Vn), gradf(Un): (35) n5000orDnFD0F1010or|f(Un)f(Un1)||f(Un)|1020.(35) Table  illustrates average results for 10 trials of each algorithm employing the initial stepsize γinitial{0.1,0.01,0.001} with the shortest CPU time to reach the stopping criteria in the scenario of Problem 4.1 with (N,p){1000,2000}×{10,50}. In the table, ‘fval’ means the value f(U) at the output USt(p,N), ‘fval-optimal’ means f(U)f(U) with the global minimizer USt(p,N) obtained by the eigenvalue decomposition of A, ‘feasi’ means the feasibility IpUTUF, ‘nrmg’ means the norm fS(ΦS(U))F, (fRU0Cay)(RU0Cay1(U))F or gradf(U)F, ‘itr’ means the number of iterations, and ‘time’ means the CPU time (s). Figure  shows the convergence history of algorithms for each problem size respectively. The plots show CPU time on the horizontal axis versus the value f(U)f(U) on the vertical axis.

Figure 2. Convergence histories of each algorithm applied to Problem 4.1 regarding the value f(U)f(U) at CPU time for each problem size. Markers are put at every 250 iterations.

Figure 2. Convergence histories of each algorithm applied to Problem 4.1 regarding the value f(U)−f(U⋆) at CPU time for each problem size. Markers are put at every 250 iterations.

We observe that the proposed GDM+CP reaches the stopping criteria with the shortest CPU time among all five algorithms for every problem size. Possible reasons for the superiority of the proposed Cayley parametrization strategy to the retraction-based strategy are as follows.

  1. The Cayley parametrization strategy exploits the diffeomorphic property of ΦS1 between a vector space and an open dense subset of St(p,N) while the retraction-based strategy exploits only a local diffeomorphic property around 0 of retractions (see Remark 3.6).

  2. For Problem 4.1, a global minimizer VQN,p(S) of fΦS1 existsFootnote8 within the unit ball {VQN,p(S)V21} due to the right orthogonal invariance of f in (Equation34) and Theorem 3.2. In comparison, the existence of a global minimizer, say VTUnSt(p,N), of fRUn over TUnSt(p,N) is not guaranteed for a general retraction R. Even if such a V exists, it is not guaranteed that RUn(V)St(p,N) is a global minimizer of f over St(p,N) because RUn(TUnSt(p,N)) is not necessarily dense in St(p,N).

Additionally, GDM+CP can keep the feasibility at the same level as GDM+QR and GDM+polar, and better than GDM+Cayley. These observations imply that the proposed CP strategy outperforms the retraction-based strategy. Moreover, it is expected that the proposed CP strategy achieves fast convergence to a solution for Problem 1.1 when we plug more powerful computational arts designed for optimization over a vector space into the CP strategy (see also Remark 3.6).

As shown in Propositions 2.2 and 3.9, both ΦS1 and RU0Cay can parameterize respectively open dense subsets of St(p,N). However, we observe that (i) the proposed GDM+CP for the minimization of fΦS1 has faster convergence speed than GDM+CP-retraction for the minimization of fRU0Cay; (ii) the orthogonal feasibility in GDM+CP-retraction deteriorates compared than GDM+CP. We believe that these performance differences are made respectively by (i) the computational complexity for ΦS1 is more efficient than that of RU0Cay, and by (ii) calculations of RU0Cay and (fRU0Cay) require the Sherman-Morrison-Woodbury formula for matrix inversions in order to achieve comparable computational complexities, and its formula is known to have a numerical instability [Citation22] (see Remark 3.10).

Moreover, although GDM+CP-retraction reaches the stopping criteria without achieving the same level of the final cost value as the others,Footnote9 GDM+CP-retraction has the same or better performance than GDM+Cayley in view of convergence history in Figure at every time. This indicates an efficacy of the parametrization strategy of St(p,N) in the vector space reformulation for Problem 1.1 because GDM+CP-retraction and GDM+Cayley used the same Cayley transform-based retraction.

Finally, we remark that if γinitial is set as too large, numerical performance of the proposed GDM+CP can deteriorate because a generated sequence (Vn)n=0QN,p(S) can go away from 0QN,p(S) quickly, which induces the insensitivity of ΦS1 (see Section 3.3). This tendency can be observed from Figure , which illustrates average convergence histories for 10 trials of GDM+CP with each stepsize γinitial{0.1,0.01,0.001,0.0001} in the scenario of Problem 4.1. Figure shows that GDM+CP with γinitial=0.001 has the best performance among four algorithms. This observation indicates that we need not set γinitial as large for GDM+CP. Not surprisingly, we also see that too small γinitial causes slow convergence speed of GDM+CP with move only a little along fS(Vn) at each iteration.

Figure 3. Convergence histories of GDM+CP with each γinitial applied to Problem 4.1 (N = 1000, p = 10) regarding the value f(U)f(U) at CPU time for each problem size. Markers are put at every 250 iterations.

Figure 3. Convergence histories of GDM+CP with each γinitial applied to Problem 4.1 (N = 1000, p = 10) regarding the value f(U)−f(U⋆) at CPU time for each problem size. Markers are put at every 250 iterations.

4.2. Singular-point issue

In this subsection, we tested how much the singular-points influence the performance of the proposed CP strategy. As we mentioned in Section 3.3, a risk of the slow convergence of Algorithm 1 can arise in a case where a global minimizer USt(p,N) of Problem 1.1 is close to the singular-point set EN,p(S). To see such an influence, we compared CP strategies with several centre point S by a toy Problem 1.1 for the minimization of f(U):=12UUF2 with a given USt(p,N). Clearly, its solution is U.

In this experiment, we used centre points S(θ):=diag(R(θ),INp)O(N)(θ=π/1000,π/4,π/2,π), the global minimizer U:=[S(π)]le and an initial point U0:=[S(π/4)]le, where R(θ):=[cos(θ)sin(θ)sin(θ)cos(θ)]SO(2) is a rotation matrix. From [S(θ)]leTU=diag(R(θ),Ip2), we have det(Ip+[S(θ)]leTU)=2p1(1cos(θ)). Therefore, EN,p(S(θ))={USt(p,N)det(I+[S(θ)]leTU)=0} approaches U as θ0, and EN,p(S(π)) is farthest from U.

We used the stopping criteria (Equation35), and parameters ρ=0.5, c=213, and γinitial=0.1 for Algorithm 2 to determine a stepsize γ>0.

Table  illustrates average results for 10 trials of each algorithm with N = 1000 and p = 10 in this scenario. Figure  shows the convergence history of algorithms. The plot shows CPU time on the horizontal axis versus the value f(U)f(U) on the vertical axis.

Figure 4. Convergence histories of each algorithm applied to Problem 1.1 with f(U):=12UUF2, and N = 1000, p = 10 regarding the value f(U) at CPU time. Markers are put at every 250 iterations.

Figure 4. Convergence histories of each algorithm applied to Problem 1.1 with f(U):=12‖U−U⋆‖F2, and N = 1000, p = 10 regarding the value f(U) at CPU time. Markers are put at every 250 iterations.

Table 2. Performance of each algorithm applied Problem 1.1 with f(U):=12UUF2.

From Figure , we observe that GDM+CP with S(π) is the fastest among all algorithms. On the other hand, Un generated by GDM+CP with S(π/1000) does not approach a global minimizer U. This implies that the convergence speed of GDM+CP tends to become slower as θ0, or equivalently as U approaches the singular-point set.

From these observations, the performance of the proposed Algorithm 1 depends heavily on tuning S as mentioned in Section 3.3. Since we can not see whether a solution U is distant from EN,p(S) or not in advance before running algorithms, it is desired to circumvent the influence of this singular-point issue. In [Citation38,Citation39], we presented preliminary reports for a CP strategy with an adaptive changing centre point scheme to avoid the singular-point issue by considering Problem 3.8 instead of Problem 1.4.

5. Conclusion

We presented a generalization of the Cayley transform for the Stiefel manifold to parameterize a dense subset of the Stiefel manifold in terms of a single vector space. The proposed Cayley transform is diffeomorphic between a dense subset of the Stiefel manifold and a vector space. Thanks to the diffeomorphic property, we proposed a new reformulation of optimization problem over the Stiefel manifold to transplant optimization techniques designed over a vector space. Numerical experiments have shown that the proposed algorithm outperformed the standard algorithms designed with a retraction on the Stiefel manifold under a simple situation.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by JSPS [grants-in-aid19H04134] partially, by JSPS [grants-in-aid 21J21353] and by JST SICORP [grant number JPMJSC20C6].

Notes

1 φ1 is well-defined over QN,N because all eigenvalues of VQN,N are pure imaginary. For the second expression in (Equation4), see the beginning of Appendix 3.

2 The closure of SO(N)EN,N is equal to SO(N). For every USO(N), we can approximate it by some sequence (Un)n=1 of SO(N)EN,N with any accuracy, i.e. limnUn=U.

3 The domain of φS with SSO(N) is a subset O(N)EN,N(S)=SO(N)EN,N(S) of SO(N).

4 As in (Equation9), QN,p(S) is the common set QN,p for every SO(N). However, we distinguish QN,p(S) for each SO(N) as a parametrization of the particular subset St(p,N)EN,p(S) of St(p,N) (see also Remark 1.3(b)).

5 Algorithm 1 can serve as a central building block in our further advanced Cayley parametrization strategies, reported partially in [Citation38–40].

6 The local diffeomorphism of RU around 0TUSt(p,N) can be verified with the inverse function theorem and the condition (ii) in Definition B.1.

7 Let Ip+[[V]]21T[[V]]21=Q(Ip+Σ)QT be the eigenvalue decomposition with QO(N) and a nonnegative-valued diagonal matrix ΣRp×p. From (I2) in Appendix 9, we have M12(Ip+Σ)1F=(1+σmin2([[V]]21))11. Thus, we have κ(M)M21+[[V]]112+[[V]]2122.

8 From the relation minUSt(p,N)f(U)=infVQN,p(S)fΦS1(V) in Lemma 2.6, ΦS1(V)St(p,N) is also a global minimizer of f over St(p,N).

9 We note that this early stopping of GDM+CP-retraction can be caused by the instability [Citation22] of the Sherman-Morrison-Woodbury formula used in RU0Cay and (fRU0Cay).

10 The subspace W1:={UΩRN×pΩT=ΩRp×p}RN×p is an orthogonal complement to the subspace W2:={UKRN×pKR(Np)×p}RN×p with the inner product X,Y=Tr(XTY)(X,YRN×p). The tangent space TUSt(p,N) can be decomposed as W1W2 with the direct sum ⊕. In view of the orthogonal decomposition, the first term and the second term in the right-hand side of (EquationA1) can be regarded respectively as the orthogonal projection of X onto W1 and W2.

11 The exponential mapping ExpU:TUSt(p,N)St(p,N) at USt(p,N) is defined as a mapping that assigns a given direction DTUSt(p,N) to a point on the geodesic of St(p,N) with the initial velocity D. The exponential mapping is also a special instance of retractions of St(p,N). However, due to its high computational complexity, computationally simpler retractions have been used extensively for Problem 1.1 [Citation1].

References

  • Absil PA, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds. Princeton (NJ): Princeton University Press; 2008.
  • Manton JH. Geometry, manifolds, and nonconvex optimization: how geometry can help optimization. IEEE Signal Process Mag. 2020;37(5):109–119.
  • Sato H. Riemannian optimization and its applications. Switzerland: Springer International Publishing; 2021.
  • Pietersz R, Groenen PJF. Rank reduction of correlation matrices by majorization. Quant Finance. 2004;4(6):649–662.
  • Grubišić I, Pietersz R. Efficient rank reduction of correlation matrices. Linear Algebra Appl. 2007;422(2–3):629–653.
  • Zhu X. A feasible filter method for the nearest low-rank correlation matrix problem. Numer Algorithms. 2015;69(4):763–784.
  • Bai Z, Sleijpen G, van der Vorst H. Nonlinear eigenvalue problems. In: Bai Z, Demmel J, Dongarra J, Ruhe A, van der Vorst H, editors, Templates for the solution of algebraic eigenvalue problems. Philadelphia (PA): SIAM; 2000. p. 281–314.
  • Yang C, Meza JC, Wang LW. A constrained optimization algorithm for total energy minimization in electronic structure calculations. J Comput Phys. 2006;217(2):709–721.
  • Zhao Z, Bai Z, Jin X. A Riemannian Newton algorithm for nonlinear eigenvalue problems. Comput Optim Appl. 2015;36(2):752–774.
  • Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J Comput Graph Stat. 2006;15(2):265–286.
  • Journée M, Nesterov Y, Richtárik P, et al. Generalized power method for sparse principal component analysis. J Mach Learn Res. 2010;11(15):517–553.
  • Lu Z, Zhang Y. An augmented Lagrangian approach for sparse principal component analysis. Math Program. 2012;135(1–2):149–193.
  • Boufounos PT, Baraniuk RG. 1-bit compressive sensing. In: Annual Conference on Information Sciences and Systems. IEEE; 2008. p. 16–21.
  • Laska JN, Wen Z, Yin W, et al. Trust, but verify: fast and accurate signal recovery from 1-bit compressive measurements. IEEE Trans Signal Process. 2011;59(11):5289–5301.
  • Joho M, Mathis H. Joint diagonalization of correlation matrices by using gradient methods with application to blind signal separation. In: Sensor Array and Multichannel Signal Processing Workshop Proceedings. IEEE; 2002. p. 273–277.
  • Theis FJ, Cason TP, Absil PA. Soft dimension reduction for ICA by joint diagonalization on the Stiefel manifold. In: International Symposium on Independent Component Analysis and Blind Signal Separation. Springer; 2009. p. 354–361.
  • Sato H. Riemannian Newton-type methods for joint diagonalization on the Stiefel manifold with application to independent component analysis. Optimization. 2017;66(12):2211–2231.
  • Helfrich K, Willmott D, Ye Q. Orthogonal recurrent neural networks with scaled Cayley transform. In: International Conference on Machine Learning. PMLR; Vol. 80; 2018. p. 1969–1978.
  • Bansal N, Chen X, Wang Z. Can we gain more from orthogonality regularizations in training deep networks? In: Advances in neural information processing systems. Curran Associates Inc.; 2018. p. 4266–4276.
  • Yamada I, Ezaki T. An orthogonal matrix optimization by dual Cayley parametrization technique. In: 4th International Symposium on Independent Component Analysis and Blind, Signal Separation; 2003. p. 35–40.
  • Hori G, Tanaka T. Pivoting in Cayley transform-based optimization on orthogonal groups. In: Proceedings of the Second APSIPA Annual Summit and Conference; 2010. p. 181–184.
  • Wen Z, Yin W. A feasible method for optimization with orthogonality constraints. Math Program. 2013;142(1–2):397–434.
  • Gao B, Liu X, Chen X, et al. A new first-order algorithmic framework for optimization problems with orthogonality constraints. SIAM J Optim. 2018;28(1):302–332.
  • Zhu X. A Riemannian conjugate gradient method for optimization on the Stiefel manifold. Comput Optim Appl. 2017;67(1):73–110.
  • Zhu X, Sato H. Riemannian conjugate gradient methods with inverse retraction. Comput Optim Appl. 2020;77(3):779–810.
  • Fraikin C, Hüper K, Dooren PV. Optimization over the Stiefel manifold. In: Proceedings in Applied Mathematics and Mechanics. Wiley; Vol. 7; 2007.
  • Reddi SJ, Hefny A, Sra S, et al. Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning. PMLR; Vol. 48; 2016. p. 314–323.
  • Ghadimi S, Lan G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math Program. 2016;156:59–99.
  • Allen-Zhu Z. Natasha 2: faster non-convex optimization than SGD. In: Advances in neural information processing systems. Curran Associates Inc.; 2018. p. 2680–2691.
  • Ward R, Wu X, Bottou L. AdaGrad stepsizes: sharp convergence over nonconvex landscapes. In: International Conference on Machine Learning. PMLR; Vol. 97; 2019. p. 6677–6686.
  • Chen X, Liu S, Sun R, et al. On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941, 2018.
  • Tatarenko T, Touri B. Non-convex distributed optimization. IEEE Trans Automat Control. 2017;62(8):3744–3757.
  • Nocedal J, Wright S. Numerical optimization. 2nd ed. New York (NY): Springer; 2006.
  • Boumal N, Absil PA. Low-rank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra Appl. 2015;475:200–239.
  • Pitaval RA, Dai W, Tirkkonen O. Convergence of gradient descent for low-rank matrix approximation. IEEE Trans Inf Theory. 2015;61(8):4451–4457.
  • Sato H, Iwai T. Optimization algorithms on the Grassmann manifold with application to matrix eigenvalue problems. Jpn J Ind Appl Math. 2014;31(2):355–400.
  • Xu Y, Zeng T. Fast optimal H2 model reduction algorithms based on Grassmann manifold optimization. Int J Numer Anal Model. 2013;10(4):972–991.
  • Kume K, Yamada I. Adaptive localized Cayley parametrization technique for smooth optimization over the Stiefel manifold. In: European Signal Processing Conference. EURASIP; 2019. p. 500–504.
  • Kume K, Yamada I. A Nesterov-type acceleration with adaptive localized Cayley parametrization for optimization over the Stiefel manifold. In: European Signal Processing Conference. EURASIP; 2020. p. 2105–2109.
  • Kume K, Yamada I. A global Cayley parametrization of Stiefel manifold for direct utilization of optimization mechanisms over vector spaces. In: International Conference on Acoustics, Speech, and Signal Processing. IEEE; 2021. p. 5554–5558.
  • Edelman A, Arias TA, Smith ST. The geometry of algorithms with orthogonality constraints. SIAM J Matrix Anal Appl. 1998;20(2):303–353.
  • Nikpour M, Manton JH, Hori G. Algorithms on the Stiefel manifold for joint diagonalisation. In: International Conference on Acoustics, Speech, and Signal Processing. IEEE; Vol. 2. 2002. p. 1481–1484.
  • Nishimori Y, Akaho S. Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold. Neurocomputing. 2005;67:106–135.
  • Absil PA, Baker CG, Gallivan KA. Trust-region methods on Riemannian manifolds. Found Comut Math. 2007;7(3):303–330.
  • Abrudan TE, Eriksson J, Koivunen V. Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Trans Signal Process. 2008;56(3):1134–1147.
  • Absil PA, Malick J. Projection-like retractions on matrix manifolds. SIAM J Optim. 2012;22(1):135–158.
  • Ring W, Wirth B. Optimization methods on Riemannian manifolds and their application to shape space. SIAM J Optim. 2012;22(2):596–627.
  • Huang W, Gallivan KA, Absil PA. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM J Optim. 2015;25(3):1660–1685.
  • Jiang B, Dai YH. A framework of constraint preserving update schemes for optimization on Stiefel manifold. Math Program. 2015;153(2):535–575.
  • Manton JH. A framework for generalising the Newton method and other iterative methods from Euclidean space to manifolds. Numer Math. 2015;129:91–125.
  • Sato H, Iwai T. A new, globally convergent Riemannian conjugate gradient method. Optimization. 2015;64(4):1011–1031.
  • Kasai H, Mishra B. Inexact trust-region algorithms on Riemannian manifolds. In: Advances in neural information processing systems. Curran Associates Inc.; 2018. p. 4254–4265.
  • Nesterov Y. A method for solving the convex programming problem with convergence rate o(1/k2). Dokl Akad Nauk SSSR. 1983;269:543–547.
  • Siegel JW. Accelerated optimization with orthogonality constraints. J Comput Math. 2020;39(2):207–226.
  • Boumal N, Mishra B, Absil PA, et al. Manopt, a Matlab toolbox for optimization on manifolds. J Mach Learn Res. 2014;15:1455–1459.
  • Satake I. Linear algebra. New York (NY): Marcel Dekker Inc.; 1975.
  • Van den Bos A. Parameter estimation for scientists and engineers. New York (NY): Wiley; 2007.
  • Horn RA, Johnson CR. Matrix analysis. 2nd ed. Cambridge (MA): Cambridge University Press; 2012.

Appendix 1.

Basic facts on the Stiefel manifold, the Cayley transform and tools for matrix analysis

In this section, we summarize basic properties on St(p,N) and the Cayley transform together with elementary tools for matrix analysis.

Fact A.1

Stiefel manifold [Citation1,Citation41]

  1. The Stiefel manifold St(p,N) is an embedded submanifold of RN×p. The topology O(St(p,N)), the family of all open subsets, of St(p,N) is defined as any union of sets in {St(p,N)BRN×p(X,r)XRN×p,r>0}.

  2. The dimension of St(p,N) is Np12p(p+1), i.e. every point USt(p,N) has an open neighbourhood N(U)St(p,N) such that there exists a homeomorphism ϕ:N(U)RNpp(p+1)/2 between N(U) and some open subset of RNpp(p+1)/2.

  3. The Stiefel manifold St(p,N) is compact. Moreover, St(p,N) with p<N is connected while O(N):=St(N,N) is a disconnected union of connected subsets SO(N):={UO(N)det(U)=1} and O(N)SO(N).

  4. The tangent space to St(p,N) at USt(p,N) is expressed as TUSt(p,N)={VRN×pUTV+VTU=0}={UΩ+UKRN×pΩT=ΩRp×p,KR(Np)×p}in terms of an arbitrarily chosen USt(Np,p) satisfying UTU=0Rp×(Np) (see, e.g. [Citation1, Example 3.5.2]). The projection mapping PTUSt(p,N):RN×pTUSt(p,N) onto TUSt(p,N) is given byFootnote10 (see, e.g. [Citation1, Example 3.6.2]) (A1) (XRN×p)PTUSt(p,N)(X):=argminZTUSt(p,N)XZF=12U(UTXXTU)+(IUUT)X.(A1)

Fact A.2

Commutativity of the Cayley transform pair, e.g. [Citation56]

The Cayley transform φ in (Equation3) and its inversion φ1 in (Equation4) can be expressed as (A2) (UO(N)EN,N)φ(U)=(IU)(I+U)1=(I+U)1(IU)(VQN,N)φ1(V)=(IV)(I+V)1=(I+V)1(IV).(A2)

Fact A.3

Denseness of O(N)EN,N(S); see [Citation20] for S=I

For SO(N), define O(N,S):={UO(N)det(U)=det(S)}, i.e. O(N,S)={SO(N)(ifdet(S)=1)O(N)SO(N)(ifdet(S)=1).Then, for SO(N) and EN,N(S) defined just after (Equation6), O(N)EN,N(S) is a dense subset of O(N,S), i.e. the closure of O(N)EN,N(S) is O(N,S).

Proof.

It suffices to show for UO(N,S) that there exists a sequence (Un)n=0O(N)EN,N(S) such that limnUn=U.

Let UO(N,S). Then, STU can be expressed as STU=QTΛQ with some QO(N) and (A3) Λ=diag(Ik1,Ik2,R(θ1),R(θ2),,R(θm))O(N)(A3) (see [Citation56, IV. §5]), where k1,k2,mN{0} satisfy k1+k2+2m=N, and R(θ):=[cos(θ)sin(θ)sin(θ)cos(θ)]SO(2) (θ[0,2π){0,π}). The relation det(U)=det(S)=det(ST) ensures STUSO(N), thus the number k2 must be even. Define Un=SQTΛ(π+1/n)QO(N) (nN), where Λ(π+1/n)SO(N) is given by replacing each diagonal block matrix I2SO(2) in Ik2 [in (EquationA3)] with R(π+1/n). From det(I2+R(π+1/n))0 for nN, we have UnO(N)EN,N(S) and limnUn=U, which implies O(N)EN,N(S) is dense in O(N,S).

Lemma A.4

Matrix norms

  1. For ARl×m and BRm×n, it holds ABFA2BF and ABFAFB2.

  2. For VQN,N:={VRN×NVT=V}, we have σi(I+V)1(1iN), I+V22=1+V22 and (I+V)121, where σi() stands for the ith largest singular value of a given matrix.

  3. For V1,V2QN,N, (I+V1)1(I+V2)1FV1V2F.

  4. det(I+V)>0 for all VQN,N.

  5. For VQN,N, it holds 1+V22det(I+V)(1+V22)N/2.

Proof.

(a) Let biRm be the ith column vector of B. Then, it holds ABF2=i=1nAbi2=bi0Abibi2bi2i=1nA22bi2=A22BF2,where stands for the Euclidean norm for vectors. Thus, we have ABFA2BF. By taking the transpose of AB in the previous inequality, we have ABF=BTATFBT2ATF=AFB2.

(b) For 1iN, let λi(Y) be the ith largest eigenvalue of a symmetric matrix YRN×N. Then, we have the expression σi(I+V)=λi((I+V)T(I+V))=λi(I+VTV)=1+σi2(V)1(1iN), which asserts I+V22=σ12(I+V)=1+σ12(V)=1+V22 and (I+V)12=σN1(I+V)1.

(c) By (a) and (b), (I+V1)1(I+V2)1F=(I+V1)1((I+V2)(I+V1))(I+V2)1F(I+V1)12(I+V2)12V1V2FV1V2F.

(d) The nonsingularity of I+V (see (b)) yields det(I+V)0, and det(I+0)=1. Since det(I+) is continuous and QN,N is connected, det(I+V) is a positive-valued.

(e) Let I+V=Q1ΣQ2T be a singular value decomposition with Q1,Q2O(N) and a nonnegative diagonal matrix ΣRN×N. Then, we obtain |det(I+V)|=|det(Q1ΣQ2T)|=det(Σ)=i=1Nσi(I+V), implying thus det(I+V)=i=1Nσi(I+V) by (d). Moreover by (b), we have det(I+V)σ1(I+V)=I+V2=1+V22 and det(I+V)σ1N(I+V)=I+V2N=(1+V22)N/2.

Fact A.5

Derivative of matrix functions (see, e.g. [Citation57, Appendix D])

Let DR be an open interval. Then, the following hold:

  1. Let X:RRN×M and Y:RRM×L be differentiable on D. Then, ddtX(t)Y(t)=(ddtX(t))Y(t)+X(t)(ddtY(t)).

  2. Let X:RRN×N be differentiable and invertible on D. Then, ddtX1(t)=X1(t)(ddtX(t))X1(t).

Fact A.6

The Schur complement formula [Citation58, Sec. 0.8.5]

  Let [[X]]22R(Np)×(Np) be a nonsingular block matrix of XRN×N. Define a Schur complement matrix of X by M:=[[X]]11[[X]]12[[X]]221[[X]]21. Then, M is nonsingular if and only if X is nonsingular, and the inversion X1 can be expressed as X1=[M1M1[[X]]12[[X]]221[[X]]221[[X]]21M1[[X]]221+[[X]]221[[X]]21M1[[X]]12[[X]]221].Moreover, it holds det(X)=det([[X]]22)det(M).

Fact A.7

The Sherman-Morrison-Woodbury formular [Citation58, Sec. 0.7.4]

  For nonsingular matrices ARN×N, RRp×p, and rectangular matrices XRN×p, YRp×N, let B=A+XRYRN×N. If B and R1+YA1X are nonsingular, then B1=A1A1X(R1+YA1X)1YA1.

Appendix 2.

Retraction-based strategy for optimization over St(p,N)

We summarize a standard strategy for optimization over St(p,N).

Definition A.8

Retraction [Citation1]

The set of mappings RU:TUSt(p,N)St(p,N):DRU(D) defined at each USt(p,N) is called a retraction of St(p,N) if it satisfies (i) RU(0)=U; (ii) ddt|t=0RU(tD)=D for all USt(p,N) and DTUSt(p,N).

Retractions serve as certain approximations of the exponential mapping ExpUFootnote11. Many examples of retractions for St(p,N) are known, e.g. with QR decomposition, with polar decomposition and with the Euclidean projection [Citation1,Citation45,Citation46] as well as with the Cayley transform [Citation22,Citation45].

In the view that St(p,N) is a Riemannian manifold, Problem 1.1 has been tackled with retractions as an application of the standard strategies for optimization defined over Riemannian manifold. In such a strategy for St(p,N) based on a retraction [Citation1,Citation22,Citation24,Citation41–52], the computation for updating the estimate UnSt(p,N) to Un+1St(p,N) at nth iteration is decomposed into: (i) determine a search direction DnTUnSt(p,N); (ii) assign RUn(Dn)=RUn(0+Dn)St(p,N) to a new estimate Un+1. Along this strategy, optimization algorithms designed originally over a single vector space have been extended to those designed over tangent spaces, to St(p,N), by using additional tools, e.g. a vector transport [Citation1] and the inversion mapping of retractions [Citation25], if necessary. Such extensions have been made for many schemes, e.g. the gradient descent method [Citation41–43,Citation45], the conjugate gradient method [Citation24,Citation25,Citation47,Citation51], Newton's method [Citation41,Citation50], quasi-Newton's method [Citation47,Citation48], the Barzilai–Borwein method [Citation22,Citation49] and the trust-region method [Citation44,Citation52].

Appendix 3.

Proof of Proposition 2.2

The second equality in (Equation14) is verified by (IV)(I+V)1=(2I(I+V))(I+V)1=2(I+V)1I. Fact A.6 and [[I+V]]22=INp guarantee the non-singularity of M:=Ip+[[V]]11+[[V]]21T[[V]]21 and (A4) (I+V)1=[M1M1[[V]]21T[[V]]21M1INp[[V]]21M1[[V]]21T](A4) which implies (I+V)1IN×p=[Ip[[V]]21T]TM1 and the expressions of ΞφS1 in (Equation15).

In the following, we will show ΦS1=ΥS:=ΞφS1 on QN,p(S) by dividing 4 steps.

(I) Proof of ΥS(QN,p(S)):={ΥS(V)VQN,p(S)}St(p,N)EN,p(S). For every VQN,p(S), (EquationA2) ensures ΥS(V)TΥS(V)=IN×pT(I+V)T(IV)TSTS(IV)(I+V)1IN×p=IN×pT(IV)1(I+V)(I+V)1(IV)IN×p=Ip,thus ΥS(V)St(p,N). ΥS(V)EN,p(S) is confirmed by the expression in (Equation15), i.e. (A5) Ip+SleTΥS(V)=Ip+SleT(2(SleSri[[V]]21)M1Sle)=Ip+2M1Ip=2M1,(SleTSle=IpandSleTSri=0fromSTS=I)(A5) and det(Ip+SleTΥS(V))=2p/det(M)0.

(II) Proof of ΥSΦS(U)=U(USt(p,N)EN,p(S)). Let USt(p,N)EN,p(S) and V:=ΦS(U) in (Equation10). Then, by Ip+AS(U)+BST(U)BS(U)=Ip+2(Ip+SleTU)TSkew(UTSle)(Ip+SleTU)1+(Ip+SleTU)TUTSriSriTU(Ip+SleTU)1=(Ip+SleTU)T((Ip+SleTU)T(Ip+SleTU)+(UTSleSleTU)+UTSriSriTU)(Ip+SleTU)1=(Ip+SleTU)T(Ip+2UTSle+UTSleSleTU+UTSriSriTU)(Ip+SleTU)1=(Ip+UTSle)1(2Ip+2UTSle)(Ip+SleTU)1=2(Ip+SleTU)1,(SST=SleSleT+SriSriT=I)we deduce with (Equation15) ΥS(V)=2(SleSriBS(U))(Ip+AS(U)+BST(U)BS(U))1Sle=(Sle+SriSriTU(Ip+SleTU)1)(Ip+SleTU)Sle=(SleSleT+SriSriT)U=U.

(III) Proof of ΦSΥS(V)=V(VQN,p(S)). Let VQN,p(S) and U:=ΥS(V)=(15)2(SleSri[[V]]21)M1Sle with M:=Ip+[[V]]11+[[V]]21T[[V]]21. It suffices to show ASΥS(V)=[[V]]11 and BSΥS(V)=[[V]]21. Then, by the definition of ΦS in (Equation10), (Equation11) and (Equation12), and by SleTU=SleT(2(SleSri[[V]]21)M1Sle)=2M1Ip(SleTSle=Ip,SleTSri=0)SriTU=SriT(2(SleSri[[V]]21)M1Sle)=2[[V]]21M1(SriTSri=INp,SriTSle=0),each block matrix in (Equation10) can be evaluated as AS(U)=(Ip+2M1Ip)T((2M1Ip)T(2M1Ip))(Ip+2M1Ip)1=21MT(MTM1)M=21(MMT)=21((Ip+[[V]]11+[[V]]21T[[V]]21)(Ip+[[V]]11T+[[V]]21T[[V]]21))=[[V]]11([[V]]11T=[[V]]11),BS(U)=(2[[V]]21M1)(Ip+2M1Ip)1=2[[V]]21M1(2M1)1=[[V]]21,which implies ΦSΥS(V)=V.

(IV) Proof of diffeomorphism of ΦS and ΦS1. From (II) and (III), we have seen ΦS1=ΥS, and both ΦS and ΦS1 are homeomorphic between their domains and images, and consist of finite numbers of matrix additions, matrix multiplications and matrix inversions, which are all smooth. Therefore, ΦS and ΦS1 are diffeomorphic between their domains and images.

Appendix 4.

Proof of Theorem 2.3

(a) From the definition of φS1 in (Equation6), ΦS1 is the restriction of ΞφS1 to QN,p(S), which implies St(p,N)EN,p(S)=Prop.2.2ΦS1(QN,p(S))=ΞφS1(QN,p(S)). Thus, it suffices to show for every VQN,N(S) that there exists V^QN,p(S) satisfying ΦS1(V^)=ΞφS1(V), which is verified by the following lemma.

Lemma A.9

Let SO(N) and V=[ABTBC]QN,N(S) with AQp,p, BR(Np)×p, and CQNp,Np. Define (A6) V^:=[A^:=ABT(INp+C)TC(INp+C)1BB^TB^:=(INp+C)1B0Np]RN×N.(A6) Then, V^QN,p(S) and ΦS1(V^)=ΞφS1(V)

Proof.

From the skew-symmetries of A and C, we have A^T=ATBT(INp+C)TCT(INp+C)1B=A+BT(INp+C)TC(INp+C)1B=A^, thus V^QN,p(S).

By letting M:=Ip+A+BT(INp+C)1BRp×p, Fact A.6 yields (I+V)1=[M1(INp+C)1BM1M1BT(INp+C)1(INp+C)1(INp+C)1BM1BT(INp+C)1] from the non-singularities of I+V and INp+C (see Lemma A.4(b)). The expressions in (Equation6) and (Equation4) assert that ΞφS1(V)=S(IV)(I+V)1IN×p=2S(I+V)1IN×pSIN×p=2S[(Ip+A+BT(INp+C)1B)1(INp+C)1B(Ip+A+BT(INp+C)1B)1]SIN×p.On the other hand, from (Equation15), we obtain ΦS1(V^)=2S[(Ip+A^+B^TB^)1B^(Ip+A^+B^TB^)1]SIN×p.Clearly to get ΞφS1(V)=ΦS1(V^), it suffices to show A+BT(INp+C)1B=A^+B^TB^ because (INp+C)1B=B^ holds automatically by the definition of B^ in (EquationA6). The equation A+BT(INp+C)1B=A^+B^TB^ is verified by CT=C and by A^+B^TB^=ABT(INp+C)TC(INp+C)1B+BT(INp+C)T(INp+C)1B=A+BT(INpC)1(INpC)(INp+C)1B=A+BT(INp+C)1B.

(b) (Openness) By the continuity of g:RN×pR:Xdet(Ip+SleTX), the preimage g1({0}) is closed on RN×p. Since EN,p(S)=g1({0})St(p,N) is closed in St(p,N), St(p,N)EN,p(S) is open in St(p,N).

(Denseness) It suffices to show, for every UEN,p(S), there exists a sequence (Un)n=0St(p,N)EN,p(S) such that limnUn=U. Let U=Ξ(U~)EN,p(S)St(p,N) with U~:=[UU]O(N), where USt(Np,N) satisfies UTU=0. Then, Ξ(O(N)EN,N(S))=St(p,N)EN,p(S) (see (a)) ensures U~EN,N(S). By using the denseness of O(N)EN,N(S) in O(N,S) (see Fact A.3), we can construct a sequence (U~n)n=0O(N)EN,N(S) such that limnU~n=U~. Moreover by defining (Un)n=0:=(Ξ(U~n))n=0Ξ(O(N)EN,N(S))=(a)St(p,N)EN,p(S), the continuity of Ξ yields limnUn=limnΞ(U~n)=Ξ(U~)=U.

(c) St(p,N)EN,p(Si)(i=1,2) are open dense subsets of St(p,N) from Theorem 2.3(b). The openness of Δ(S1,S2) is clear. To show the denseness of Δ(S1,S2) in St(p,N), choose USt(p,N) and ϵ>0 arbitrarily. By the open denseness of St(p,N)EN,p(S1), there exist U1BSt(p,N)(U,ϵ)St(p,N)EN,p(S1) and ϵ1>0 satisfying BSt(p,N)(U1,ϵ1)BSt(p,N)(U,ϵ)St(p,N)EN,p(S1), where BSt(p,N)(U,ϵ):=BRN×p(U,ϵ)St(p,N). The denseness of St(p,N)EN,p(S2) in St(p,N) yields the existence of U2BSt(p,N)(U1,ϵ1)St(p,N)EN,p(S2), from which we obtain U2BSt(p,N)(U1,ϵ1)St(p,N)EN,p(S2)BSt(p,N)(U,ϵ)St(p,N)EN,p(S1)St(p,N)EN,p(S2)=BSt(p,N)(U,ϵ)Δ(S1,S2).

(d) From (EquationA5), we have Ip+SleTΦS1(V)=2M1 for VQN,p(S), where M:=Ip+[[V]]11+[[V]]21T[[V]]21Rp×p is the Schur complement matrix of I+VRN×N. Fact A.6 yields g(V)=det(2M1)=2p(det(M))1=2p(det(I+V))1 due to [[I+V]]22=INp. Lemma A.4(d) ensures g(V)>0. By Lemma A.4(e), we have det(I+V)1+V22 as V2, implying thus limVQN,p(S)V2g(V)=0.

Assume that (Vn)n=0QN,p(S) satisfies limng(Vn)=0. By 0<det(I+Vn)(1+Vn22)N/2 in Lemma A.4(e), we have g(Vn)=2p(det(I+Vn))12p/(1+Vn22)N/2. The assumption asserts Vn2 as n.

Appendix 5.

On the choice of Ξ:O(N)St(p,N) for ΦS1 in Proposition thm2.2

For 2p<N, let USt(p,N) and USt(Np,N) satisfy UTU=0, and S:=[UU]O(N). From U=SIN×p, we have (A7) (VQN,p)ΞUφS1(V)=S(IV)(I+V)1SIN×p=S(IV)S(I+STVS)1IN×p=SS(ISTVS)(I+STVS)1IN×p=ΞφSS1(STVS).(A7) From STVSQN,N, Theorem 2.3(a) ensures ΞUφS1(QN,p)ΞφSS1(QN,N)=St(p,N)EN,p(SS).

In the following, let us consider the case of Uup=0Rp×p to show that ΞUφS1 is not injective on QN,p. Since ΞUφS1 does not depend on U, we can assume, without loss of generality, S=[0Ip0Z0Z], U=[0Z] and U=[Ip00Z] with ZSt(p,Np) and ZSt(N2p,Np) satisfying ZTZ=0. We have (A8) (VQN,p)STVS=[0ZTIp00ZT][[[V]]11[[V]]21T[[V]]210][0Ip0Z0Z]=[ZT[[V]]210[[V]]11[[V]]21TZT[[V]]210][0Ip0Z0Z]=[0ZT[[V]]210[[V]]21TZ[[V]]11[[V]]21TZ0ZT[[V]]210].(A8)

Now, by using αR{0}, define V(α)QN,p as [[V(α)]]11=0 and [[V(α)]]21=α[0(Np)×pZ][0p×(N2p)Ip]T, where [[V(α)]]210 is guaranteed by ZSt(N2p,Np) and 0<N−2p. Then, ZT[[V(α)]]21=0 and (EquationA8) with V=V(α) yield (A9) STV(α)S=[00000[[V(α)]]21TZ0ZT[[V(α)]]210]=:[ABTBC]QN,N,(A9) where A=0Qp,p, B=0R(Np)×p and C=[0[[V(α)]]21TZZT[[V(α)]]210]QNp,Np.

Finally, by applying Lemma A.9 to (EquationA9) and (EquationA7), we deduce ΞUφS1(V(α))=(A7)ΞφSS1(STV(α)S)=LemmaA.9ΦSS1(0)=SSIN×p=SU for all αR{0}. This implies that infinitely many V(α)QN,p(αR{0}) achieve ΞUφS1(V(α))=SU, and clearly ΞUφS1 is not injective.

Appendix 6.

Proof of Proposition 2.9

The differentiability of fΦS1 is verified by the differentiabilities of f and ΦS1. Let V,DQN,p(S). From the chain rule, we obtain ddt(fΦS1)(V+tD)|t=0=Tr(f(U)TddtΦS1(V+tD)|t=0).Moreover, by ΦS1(V)=2S(I+V)1IN×pSIN×p and Fact A.5, we deduce ddtΦS1(V+tD)|t=0=2Sddt(I+V+tD)1|t=0IN×p=2S(I+V)1D(I+V)1IN×p.Therefore, we have ddt(fΦS1)(V+tD)|t=0=Tr(2f(U)TS(I+V)1D(I+V)1IN×p)=Tr(2(I+V)1IN×p∇f(U)TS(I+V)1D)=Tr(2W¯Sf(V)D),where W¯Sf(V) is defined in (Equation22). Furthermore, we have Tr(2W¯Sf(V)D)=Tr(2Skew(W¯Sf(V))D)=Tr(2Skew(WSf(V))D), where the first equality follows by Tr(XTD)=12(Tr(XTD)+Tr(XDT))=12(Tr(XD)Tr(XD))=0 for any symmetric matrix XRN×N and the second equality follows by (Equation21) and [[D]]22=0. Therefore, we obtain (A10) (DQN,p(S))ddt(fΦS1)(V+tD)|t=0=Tr(2Skew(WSf(V))D).(A10)

On the other hand, by letting (fΦS1)(V)QN,p(S) be the gradient of fΦS1 at V, it follows (A11) (DQN,p(S))ddt(fΦS1)(V+tD)|t=0=Tr((fΦS1)(V)TD).(A11) By noting 2Skew(WSf(V))QN,p(S), (EquationA10) and (EquationA11) imply (fΦS1)(V)=2Skew(WSf(V))T=2Skew(WSf(V)). By applying (EquationA4) to (Equation22), the expression (Equation23) is derived as W¯Sf(V)=(I+V)1IN×pf(U)TS(I+V)1=[M1[[V]]21M1]∇f(U)T[(SleSri[[V]]21)M1(SleSri[[V]]21)M1[[V]]21T+Sri].

By substituting V=0 into (Equation22), and by ΦS1(0)=SIN×p=Sle, we deduce   (SO(N))W¯Sf(0)=IN×p∇f(ΦS1(0))TS=[∇f(Sle)TSle∇f(Sle)TSri00]=(21)WSf(0)  and fS(0)=(20)2Skew(WSf(0))=[f(Sle)TSleSleT∇f(Sle)∇f(Sle)TSriSriT∇f(Sle)0].

Appendix 7.

Proof of Proposition 2.10

(I) Proof of Proposition 2.10(a). We need the following lemma to show Proposition 2.10(a). Figure  illustrates the relation between the following lemma and Proposition 2.10(a).

Figure A1. A flow chart represents the overview of the proof of Proposition 2.10(a). The goal is to derive a transformation formula from fS2(V2) to fS1(V1) under ΦS11(V1)=ΦφS11(V1)1(0)=ΦφS21(V2)1(0)=ΦS21(V2).

Figure A1. A flow chart represents the overview of the proof of Proposition 2.10(a). The goal is to derive a transformation formula from ∇fS2(V2) to ∇fS1(V1) under ΦS1−1(V1)=ΦφS1−1(V1)−1(0)=ΦφS2−1(V2)−1(0)=ΦS2−1(V2).

Lemma A.10

Let f:RN×pR be a differentiable function, and let SO(N), VQN,p(S) and S:=φS1(V)O(N) in (Equation6), implying thus ΦS1(V)=φS1(V)IN×p=SIN×p=ΦS1(0). Then, the following hold:

  1. For WSf(V) in (Equation21) and W¯Sf(V) in (Equation22), we have W¯Sf(V)=(I+V)1WSf(0)(I+V)T and (A13) WSf(V)=(I+V)1WSf(0)(I+V)T[000INp](I+V)1WSf(0)(I+V)T[000INp].(A13)

  2. The gradients of fS:=fΦS1 and fS:=fΦS1 satisfy (A14) fS(V)=(I+V)1fS(0)(I+V)T[000INp](I+V)1fS(0)(I+V)T[000INp](A14) and (A15) fS(0)=(I+V)(fS(V)[00[[V]]21INp]fS(V)[0[[V]]21T0INp])(I+V)T.(A15)

  3. If S^,SˇO(N) satisfy ΦS^1(0)=ΦSˇ1(0)St(p,N), i.e. S^le=Sˇle in (Equation13), then we have (A16) fS^(0)=[Ip00Y]fSˇ(0)[Ip00YT],(A16) where Y:=S^riTSˇriO(Np).

Proof.

(a) Combining ΦS1(V)=ΦS1(0) and WSf(0)=(A12)W¯Sf(0)=(22)IN×pf(ΦS1(0))TS=IN×p∇f(ΦS1(V))TS, we obtain (A17) (I+V)1WSf(0)(I+V)T=(I+V)1IN×pf(ΦS1(V))TS(IT+VT)1=(6)(I+V)1IN×p∇f(ΦS1(V))TS(IV)(I+V)1(IV)1=(A2)(I+V)1IN×p∇f(ΦS1(V))TS(I+V)1(IV)(IV)1=(I+V)1IN×p∇f(ΦS1(V))TS(I+V)1=(22)W¯Sf(V).(A17) The relation (EquationA13) is obtained by substituting (EquationA17) to an alternative expression of (Equation21): WSf(V)=W¯Sf(V)[000INp]W¯Sf(V)[000INp].

(b) (EquationA14) is confirmed by applying (EquationA13) to (Equation20) as fS(V)=(20)WSf(V)WSf(V)T=(A13)(I+V)1(WSf(0)WSf(0)T)(I+V)T[000INp](I+V)1(WSf(0)WSf(0)T)(I+V)T[000INp]=(20)(I+V)1fS(0)(I+V)T[000INp](I+V)1fS(0)(I+V)T[000INp].

To derive (EquationA15) from (EquationA14), let first fS(0)=[EQp,pFTFR(Np)×p0Np]QN,p(S) and apply (EquationA4) with M:=Ip+[[V]]11+[[V]]21T[[V]]21Rp×p as (A18) (I+V)1fS(0)(I+V)T=[M1M1[[V]]21T[[V]]21M1INp[[V]]21M1[[V]]21T][EFTF0]×[MTMT[[V]]21T[[V]]21MTINp[[V]]21MT[[V]]21T]=[M1(E+[[V]]21TF)M1FT[[V]]21M1(E+[[V]]21TF)+F[[V]]21M1FT]×[MTMT[[V]]21T[[V]]21MTINp[[V]]21MT[[V]]21T]=[M1GMT[[V]]21M1GMT+FMTM1GMT[[V]]21TM1FT[[V]]21M1GMT[[V]]21T+[[V]]21M1FTFMT[[V]]21T],(A18) where G:=E+[[V]]21TFFT[[V]]21Rp×p satisfies GT=G. By substituting (EquationA18) to (EquationA14), we obtain fS(V)=[M1GMTM1GMT[[V]]21TM1FT[[V]]21M1GMT+FMT0Np]and [00[[V]]21INp]fS(V)[0[[V]]21T0INp]=[000[[(I+V)1fS(0)(I+V)T]]22],from which we obtain fS(V)=(I+V)1fS(0)(I+V)T+[00[[V]]21INp]fS(V)[0[[V]]21T0INp].

(c) From S^le=ΦS^1(0)=ΦSˇ1(0)=Sˇle=:U, and S^TSˇ=[Ip00S^riTSˇri]=[Ip00Y]O(N),we see YO(Np) and Sˇri=S^riY by S^riY=S^riS^riTSˇri=(IS^leS^leT)Sˇri=(ISˇleSˇleT)Sˇri=SˇriSˇriTSˇri=Sˇri.

Thus, it follows from Sˇri=S^riY and U=S^le=Sˇle that fS^(0)=(24)[f(U)TS^leS^leT∇f(U)∇f(U)TS^riS^riT∇f(U)0]=[∇f(U)TSˇleSˇleT∇f(U)∇f(U)TSˇriYTYSˇriT∇f(U)0]=[Ip00Y][∇f(U)TSˇleSˇleT∇f(U)∇f(U)TSˇriSˇriT∇f(U)0][Ip00YT]=(24)[Ip00Y]fSˇ(0)[Ip00YT].

Return to the proof of Proposition 2.10(a). Let S1^:=φS11(V1)O(N) and Sˇ2:=φS21(V2)O(N). Since S1^le=ΦS1^1(0)=ΦS11(V1)=ΦS21(V2)=ΦSˇ21(0)=S2^le, Lemma A.10(c) implies X=S1^riTSˇ2riO(Np). Moreover from Lemma A.10, we have the relations fS1(V1)=(A14)(I+V1)1fS1^(0)(I+V1)T[000INp](I+V1)1fS1^(0)(I+V1)T[000INp],fS1^(0)=(A16)[Ip00X]fSˇ2(0)[Ip00XT],fSˇ2(0)=(A15)(I+V2)×(fS2(V2)[00[[V2]]21INp]fS2(V2)[0[[V2]]21T0INp])(I+V2)T.Finally by substituting the second and last relations into the first relation, we complete the proof.

(II) Proof of Proposition 2.10(b) and (c). From Proposition 2.10(a), Lemma A.4(a) and (b), we obtain fS1(V1)F=GS1,S2(V1,V2)[000INp]GS1,S2(V1,V2)[000INp]FGS1,S2(V1,V2)F(I+V1)122Ip00X22I+V222×fS2(V2)[00[[V2]]21INp]fS2(V2)[0[[V2]]21T0INp]FI+V222fS2(V2)[00[[V2]]21INp]fS2(V2)[0[[V2]]21T0INp]F(LemmaA.4(b))I+V222(1+00[[V2]]21INp22)fS2(V2)F2I+V222fS2(V2)F,where the last inequality is derived by 00[[V2]]21INp2=1 from the fact that each eigenvalue of a triangular matrix equals its diagonal entry. Finally by applying Lemma A.4(b) again, we obtain Proposition 2.10(b), which implies Proposition 2.10(c).

Appendix 8.

Useful properties of (fΦS1) for optimization

The properties of (fΦS1) in the following Proposition A.11 are useful in transplanting powerful computational arts designed for optimization over a vector space into the minimization of fΦS1 over QN,p(S). Indeed, the Lipschitz continuity of the gradient is one of the commonly used assumptions in optimization over a vector space (see, e.g. [Citation27–32]). The boundedness of the gradient is a key property for distributed optimization and stochastic optimization over a vector space (see, e.g. [Citation30–32]). The variance bounded of the gradient is also commonly assumed in stochastic optimization over a vector space (see, e.g. [Citation28–30]).

Proposition A.11

Bounds for gradient after Cayley parametrizaton

Let f:RN×pR be continuously differentiable. Then, for any SO(N), the following hold:

  1. (Lipschitz continuity). If (L>0,U1,U2St(p,N))∇f(U1)∇f(U2)FLU1U2Fand μmaxUSt(p,N)∇f(U)2, then the gradient of fS:=fΦS1 satisfies (A19) (V1,V2QN,p(S))fS(V1)fS(V2)F4(μ+L)V1V2F.(A19)

  2. (Boundedness). (A20) (VQN,p(S))fS(V)F2maxUSt(p,N)∇f(U)F.(A20)

  3. (Variance boundedness). Suppose fξ:RN×pR is indexed with realizations of a random variable ξ and continuously differentiable for each realization. If there exists σ0 and f satisfies (USt(p,N)){Eξ[fξ(U)]=f(U),Eξ[fξ(U)]=∇f(U),Eξ[fξ(U)∇f(U)F2]σ2,we have (A21) (VQN,p(S))Eξ[fSξ(V)fS(V)F2]4σ2.(A21)

Proof.

The existence of the maximum of f() over St(p,N) is guaranteed by the compactness of St(p,N) and the continuities of and f. We divide the proof of (a)–(c) as follows. Recall that W¯Sf(V) and WSf(V) for SO(N) were respectively defined as (Equation22) and (Equation21), and we have fS(V):=(fΦS1)(V)=2Skew(WSf(V)) (see Proposition 2.9). In the following, we use properties of Skew; (i) Skew(X)FXF for XRN×N; (ii) the linearity of Skew.

(I) Proof of Proposition A.11(a). First, we introduce a useful inequalities.

Lemma A.12

Lipschitz continuity of ΦS1

For every SO(N), ΦS1 is Lipschitz continuous over QN,p(S) with a constant 2, i.e. (V1,V2QN,p(S))ΦS1(V1)ΦS1(V2)F2V1V2F.

Proof.

From (Equation14) and Lemma A.4(a) and (c), we have ΦS1(V1)ΦS1(V2)F=2S((I+V1)1(I+V2)1)IN×pF2S2(I+V1)1(I+V2)1FIN×p22(I+V1)1(I+V2)1F2V1V2F.

Return to the proof of Proposition A.11(a). From (Equation20), (Equation21) in Proposition 2.9, we have (A22) fS(V1)fS(V2)F=2Skew(WSf(V1)WSf(V2))F2WSf(V1)WSf(V2)F2W¯Sf(V1)W¯Sf(V2)F.(A22) Moreover, from (Equation22), for all V1,V2QN,p(S) with U1:=ΦS1(V1),U2:=ΦS1(V2)St(p,N)EN,p(S), we deduce (A23) W¯Sf(V1)W¯Sf(V2)F=(I+V1)1IN×pf(U1)TS(I+V1)1(I+V2)1IN×p∇f(U2)TS(I+V2)1F(I+V1)1IN×p∇f(U1)TS(I+V1)1(I+V2)1IN×p∇f(U1)TS(I+V1)1F+(I+V2)1IN×p∇f(U1)TS(I+V1)1(I+V2)1IN×p∇f(U2)TS(I+V1)1F+(I+V2)1IN×p∇f(U2)TS(I+V1)1(I+V2)1IN×p∇f(U2)TS(I+V2)1F.(A23) The first term in the right-hand side of (EquationA23) can be bounded as (I+V1)1IN×pf(U1)TS(I+V1)1(I+V2)1IN×p∇f(U1)TS(I+V1)1F=((I+V1)1(I+V2)1)IN×p∇f(U1)TS(I+V1)1FIN×p2∇f(U1)2S2(I+V1)12(I+V1)1(I+V2)1F( Lemma A.4(a))∇f(U1)2V1V2FμV1V2F.( Lemma A.4(b) and (c))Similarly the last term in (EquationA23) can be bounded above by μV1V2F. The second term in (EquationA23) can be evaluated as (I+V2)1IN×pf(U1)TS(I+V1)1(I+V2)1IN×p∇f(U2)TS(I+V1)1F=(I+V2)1IN×p(∇f(U1)∇f(U2))TS(I+V1)1F(I+V2)12IN×p2S2(I+V1)12∇f(U1)∇f(U2)F( Lemma A.4(a))∇f(U1)∇f(U2)FLU1U2F( Lemma A.4(b) and Lipschitz continuity of ∇f)=LΦS1(V1)ΦS1(V2)F2LV1V2F( Lemma A.12).Therefore, the left-hand side of (EquationA23) is bounded as (V1,V2QN,p(S))W¯Sf(V1)W¯Sf(V2)F2(μ+L)V1V2F,which is combined with (EquationA22) to get (EquationA19).

(II) Proof of Proposition A.11(b). From (Equation20), (Equation21) in Proposition 2.9, we have fS(V)F=2Skew(WSf(V))F2WSf(V)F2W¯Sf(V)F.By using Lemma A.4(a) and (b), we get W¯Sf(V)F=(I+V)1IN×pf(ΦS1(V))TS(I+V)1F(I+V)122IN×p2S2∇f(ΦS1(V))F∇f(ΦS1(V))FmaxUSt(p,N)∇f(U)F,which implies (EquationA20).

(III) Proof of Proposition A.11 A.11. From (Equation20), (Equation21) in Proposition 2.9, we obtain, for each ξ, fSξ(V)fS(V)F2=4Skew(WSfξ(V)WSf(V))F24WSfξ(V)WSf(V)F24W¯Sfξ(V)W¯Sf(V)F2=4(I+V)1IN×p(fξ(ΦS1(V))f(ΦS1(V)))S(I+V)1F24(I+V)124IN×p22S22fξ(ΦS1(V))∇f(ΦS1(V))F2( Lemma A.4(a))4fξ(ΦS1(V))∇f(ΦS1(V))F2.( Lemma A.4(b))By taking the expectation of both sides, we get (EquationA21).

Appendix 9.

Proof of Proposition 3.7

Application of (Equation14) to U(τ):=ΦS1(V+τE)=2S(I+V+τE)1IN×pSIN×p yields U(τ)U(0)=2S((I+V+τE)1(I+V)1)IN×p=2τS(I+V+τE)1E(I+V)1IN×p (for the 2nd equality, see the proof of Lemma A.4(c)) and   U(τ)U(0)F=2τS(I+V+τE)1E(I+V)1IN×pF2τS2(I+V+τE)12EF(I+V)1IN×p2( Lemma A.4(a))2τ(I+V)1IN×p22τIpV21T2M12,  where M=Ip+[[V]]11+[[V]]21T[[V]]21Rp×p, the second last inequality is derived by Lemma A.4(b), and the last inequality is derived by (EquationA4).

To evaluate the norm M12, let Ip+[[V]]21T[[V]]21=Q(Ip+Σ)QT be the eigenvalue decomposition, where QO(p) is an orthogonal matrix and ΣRp×p is a diagonal matrix whose entries are non-negative. Then, we have M=Q(Ip+Σ)1/2(Ip+(Ip+Σ)1/2QT[[V]]11Q(Ip+Σ)1/2)(Ip+Σ)1/2QT.The norm M12=(Ip+[[V]]11+[[V]]21T[[V]]21)12 is bounded above as (A25) M12=Q(Ip+Σ)1/2(Ip+(Ip+Σ)1/2QT[[V]]11Q(Ip+Σ)1/2)1(Ip+Σ)1/2QT2(Ip+Σ)1/222(Ip+(Ip+Σ)1/2QT[[V]]11Q(Ip+Σ)1/2)12(Ip+Σ)12,(A25) where the last inequality is derived from the skew-symmetry of (Ip+Σ)1/2QT[[V]]11Q(Ip+Σ)1/2 and Lemma A.4(b). Moreover, by (Ip+Σ)12=(1+σmin2([[V]]21))1, we have M12(1+σmin2([[V]]21))1. Furthermore, from the definition of the spectral norm, we have Ip[[V]]21T2=λmax(Ip+[[V]]21T[[V]]21)=1+[[V]]2122. By substituting these relations into (EquationA24), we completed the proof of (Equation28). The equation (Equation29) is verified by σmin([[V]]21)σmax([[V]]21)=[[V]]212.

Appendix 10.

Gradient of fRUCay

Proposition A.13

Let USt(p,N) and USt(Np,N) satisfy UTU=0. For a differentiable function f:RN×pR, the Cayley transform-based retraction RCay in (Equation30), and USt(p,N), the function fRUCay:TUSt(p,N)R is differentiable with (VTUSt(p,N))(fRUCay)(V)=2PUSkew(ZU∇f(RUCay(V))TZ)Uwhere PU:=IUUT/2RN×N and Z:=(I+Skew(UVTPU))1RN×N. The matrix Z can be expressed as Z=IA(I2p+BTA)1BT with A=[UPUV/2]RN×2p and B=[PUV/2U]RN×2p.

Proof.

Let V,DTUSt(p,N). From the chain rule and Fact A.5, we obtain (A26) d(fRUCay)dt(V+tD)|t=0=Tr(f(RUCay(V))TdRUCaydt(V+tD)|t=0)=2Tr(∇f(RUCay(V))TZSkew(UDTPU)ZU)=Tr(UTZU∇f(RUCay(V))TZPUD)Tr(DTPUZU∇f(RUCay(V))TZU)=Tr(UTZU∇f(RUCay(V))TZPUD)Tr(UTZT∇f(RUCay(V))UTZTPUD)=Tr(2UTSkew(ZU∇f(RUCay(V))TZ)PUD)=Tr((2PUSkew(ZU∇f(RUCay(V))TZ)U)TD)(A26) due to RUCay(V)=2(I+Skew(UVTPU))1UU (see (Equation30) and (Equation4)) and dRUCaydt(V+tD)|t=0=2(I+Skew(UVTPU))1Skew(UDTPU)(I+Skew(UVTPU))1U=2ZSkew(UDTPU)ZU.For W:=2PUSkew(ZUf(RUCay(V))TZ)URN×p, we have UTW+WTU=0 because UTW=UTSkew(ZUf(RUCay(V))TZ)U is skew-symmetric. Fact A.1(d) yields WTUSt(p,N).

On the other hand, we obtain (A27) d(fRUCay)dt(V+tD)|t=0=Tr((fRUCay)(V)TD).(A27) From (EquationA26), (EquationA27) and WTUSt(p,N), it holds (fRUCay)(V)=W.

In the following, let us consider the expression of Z along the discussion in [Citation22, Lemma 4]. From I+Skew(UVTPU)=I+ABT, we have Z=(I+ABT)1. Then, applying the Sherman-Morrison-Woodbury formula (see Fact A.7) to Z, we obtain Z=IA(I2p+BTA)1BT.