A Spectral Gradient Projection Method for the Positive Semi-definite Procrustes Problem

This paper addresses the positive semi-definite procrustes problem (PSDP). The PSDP corresponds to a least squares problem over the set of symmetric and semi-definite positive matrices. These kinds of problems appear in many applications such as structure analysis, signal processing, among others. A non-monotone spectral projected gradient algorithm is proposed to obtain a numerical solution for the PSDP. The proposed algorithm employs the Zhang and Hager's non-monotone technique in combination with the Barzilai and Borwein's step size to accelerate convergence. Some theoretical results are presented. Finally, numerical experiments are performed to demonstrate the effectiveness and efficiency of the proposed method, and comparisons are made with other state-of-the-art algorithms.


Introduction
The positive semi-definite Procrustes problem (PSDP) is defined as follows: given two rectangular matrices A, B ∈ R n×m , we want to find a symmetric and positive semi-definite matrix X * ∈ R n×m that solves the following 1 Introduction optimization problem where ||Z|| F denotes the Frobenius norm of Z ∈ R r×k and S + (n) represents the set of the symmetric and positive semi-definite n-by-n matrices with real entries, i.e.
Problem (1) arises frequently in different applications such as: analysis of structures [5,17], signal processing [14], estimation of correlation matrices [2], among other. It is well known that the feasible set S + (n) of the problem (1), is a closed convex cone of dimensions n × (n + 1)/2, [9]. From this result and the fact that the objective function F is continuous, the existence of at least one global minimizer of problem (1) is guaranteed. Additionally, if A is full rank then there exists a unique solution for (1), for more details see [17]. In addition, since F is a convex function, this converts (1) in a convex minimization problem, which are relatively easy to solve. There are also some particular cases of the problem (1) that have an analytical formula for the solution, for example, when A = I m [9], when rank(A) = 1 [8], or when X is considered a diagonal matrix [8].
Although there are particular cases where problem (1) has a closed solution, in most cases, such as in real applications, a solution is only possible computationally, by an iterative method capable of dealing with nonstationary points, generating a sequence of feasible points that converge to a local minimizer. However, designing efficient algorithms that generate a feasible sequence of points is generally a difficult task, because it usually leads to use some projection operator, which requires computing spectral decompositions, which it is computationally expensive for large scale situations.
Due to the vast number of applications that problem (1) captures, many researchers are interested in studying this problem, both from a theoretical point of view, as well as from the design of new efficient approaches. In [9,10] the authors study a problem related to (1) theoretically, where they derive general analytical formulas for the solution considering some particular cases. Additionally, they provide sufficient and necessary conditions to guarantee the existence of the solution. On the other hand, the gradient projection method was implemented by Nicolas Gillis et al. in [8]. Specifically, Nicolas Gillis et.al. propose an algorithm called "FGM", which is an accelerated version of the classical gradient projection method, which uses the Nesterov [11] acceleration technique. In essence, the FGM is an implementation of the algorithm that appears in [11]. In [8], a method called "AN-FGM" is also proposed which is a semi-analytic approach that reduces the problem (1) to the case when A is diagonal and then uses the FGM to address a more easy problem, this proposal looks quite efficient to deal with problems where A is ill-conditioned. Another alternative to compute a numerical solution of problem (1) numerically, has been studied in [1], where the authors propose an algorithm called "Parallel tangents" that is based on the gradient projection method that incorporates an over-relaxation step. One drawback of this parallel tangents method is that it does not guarantee optimal convergence. On the other hand, two algorithms that were designed to solve convex optimization problems over S + (n), "SDPT3" [16] and "QSDP" [15] can be used to solve the problem (1).
In this work, we study the numerical behaviour of a spectral gradient projection method to address the positive semi-definite procrustes problem from a practical point of view. In particular, we adopt a gradient projection scheme with the non-monotone globalization technique proposed by Zhang and Hager in [18], in combination with the step size proposed by Barzilai and Borwein in [3]. Subsequently, we present some computational experiments, in order to illustrate the effectiveness of the proposed method, solving the PSDP problem under many conditioning situations of A. Our main contribution is the implementation of an efficient gradient projection procedure using MATLAB, and the numerical comparison of the performance of such method against other existing algorithms of the state of the art.
The rest of this work is organized as follows. In the next section some important notations and tools for the good understanding of the article are introduced. In Section 3, the update formula of our proposed method is presented. Subsection 3.1, addresses the problem of selecting the step size of the proposed method and describes a non-monotone globalization technique to regulate such step size, and we culminate this subsection presenting the proposal. A derivation of the proposed method from the algorithm presented in [7] is discussed in subsection 3.2. In Section 4, several numerical experi-ments are carried out in order to demonstrate the effectiveness and efficiency of our procedure. Finally, the conclusions are presented in Section 5.

Notations and Important Tools
In this section, we present some fundamental concepts and tools for the well understanding of this work. Let's denote by A, B : Another tool that we use is the projection operator over the feasible set S + (n) which is defined below.
Definition 1: Let X ∈ R n×n be a real square matrix. The projection operator π : R n×n → S + (n) over S + (n) is defined by π(X) = arg min P ∈R n×n ||P − X|| F , s.t. P ∈ S + (n).
Note that the projection of any arbitrary matrix X ∈ R n×n is defined from an optimization problem, however, problem (3) has a closed solution, this fact is established below.
Proposition 1: Let X ∈ R n×n be a real square matrix. Then π(X) is welldefined. Moreover, consider the symmetric part of X, that is, X sym = 1 2 (X ⊤ + X) and let X sym = V ΣV ⊤ be the spectral decomposition of X sym , then Proof. The proof of this proposition appear in [9].

A feasible Update Scheme
Since F is smooth, a natural idea is to compute the next iterates as Y (τ ) = X − τ ∇F (X), where X ∈ S + (n) is the previous iterate and τ > 0 represents the step size. The drawback of this approach is that the new point Y (τ ) may not satisfies the constraints of problem (1). In order to overcome this issue, we consider the well-known projected gradient method [4] which computes the new iterate Z(τ ) as a point on the curve Observe that equations (4) guarantees that the new iterate preserves the feasibility. On the other hand, there are different techniques to select the step size τ . The condition that is usually used for the gradient projection method is known as "Armijo's condition on the arc of projection" [4]. This condition imposes to choose the step size τ k , at the k-th iteration, as the largest positive number τ , such that verifies the following inequality where σ ∈ (0, 1) and Z k (τ ) = π(X k − τ ∇F (X k )). The Armijo condition (5) is used in combination with a heuristic so-called backtracking in order to find an appropriate step size that satisfies the condition (5), for more details about the backtracking strategy see [4,12].
Note that if we ensure that the directional derivative ∇F (X k )[Z k (τ ) − X k ] < 0 for all k, then we obtain a sequence {X k } of points such that the corresponding sequence {F (X k )} is monotonically decreasing. With the purpose of accelerating the convergence of the gradient projection scheme (4), we adopt the non-monotone globalization technique proposed by Zhang and Hager in [18] combined with the Barzilai and Borwein step sizes [3] which usually accelerate the convergence of gradient-based methods. This strategy is described in the next section.

The Barzilai-Borwein Step Sizes.
In this subsection, we are focused on an non-monotone strategy for the step size selection as well as to present the proposed algorithm in detail.
It is well-known that sometimes the Barzilai and Borwein step sizes [3] can improve the performance of the gradient-based algorithms without increasing too much the computational cost of the procedure. Typically, this technique considers the steepest descent method, and proposes to choose any of two step sizes, presente below, at the k-th iteration, where Since the values τ BB1 k and τ BB2 k (BB-steps) could be negative, we used the absolute value of themselves to avoid negative step sizes that involve growth in the objective function. For more details see [3,13]. Since the BB-steps does not necessarily decrease the objective function values at each iteration, it can invalidate convergence. However, this issue can be overcome by using a globalization technique, which guarantees global convergence by regulating the step sizes in (6), see [6,13]. Taking in mind this considerations, we adopt a non-monotone line search method based on a strategy in [18], in our proposed algorithm. Specifically, the iterates are recursively updated as where h is the smallest integer number that verify the following condition where C k+1 is formed by the convex combination of C k and F (X k+1 ) given by , where Q k+1 = γQ k + 1, with Q 0 = 1. The proposed non-monotone gradient projection method to deal with the numerical solution of the problem (1) is summarized in Algorithm 1.
Remark 1: Note that if we select γ = 0 in the previous algorithm, then Algorithm 1 is reduced to the classical gradient projection method. Observe also that the Algorithm 1 can be used to minimize any objective smooth function over the matrix set S + (n), however, the interest of this work is focused on the particular problem (1).
Note that the step 5, in Algorithm 1, is the step that require the major computational effort, because it needs to use the operator projection defined in (3), which requires compute a spectral decomposition, which it is computationally inefficient. In order to avoid the calculation of such spectral decomposition, in each step, we propose the following idea: first note that if the symmetric part of Y k = X k − τ k ∇F (X k ) is positive definite then this X k+1 = Z k (τ ), according to (4).

9:
k = k + 1. 10: end while 11: X * = X k . matrix coincides with its projection over S + (n). Thus, we propose to use the Cholesky's factorization to make the Algorithm 1 more efficient. Specifically, in the step 5, we try to compute the Cholesky factorization of if no error is generated, then X k+1 is updated by X k+1 = Y k +Y ⊤ k 2 , otherwise X k+1 = Z k (τ k ) is updated using the projection operator. In Section 4, we demostraste numerically the efficiency of this strategy on some numerical tests.

Another Point of View of Algorithm 1.
In this section we derive Algorithm 1 from an algorithm proposed by Francisco et al. in [7] recently. In addition, we establish a convergence result related to our Algorithm 1.
In [7] the authors propose a globally convergent non-monotonous algorithm to numerically solve the following optimization problem, where Ω is a closed subset of R n and f : R n → R is a continuously differentiable function onΩ such that Ω ⊂Ω. The proposed algorithm by Francisco et.al. builds a sequence of iterates as follows: given the current point x k ∈ Ω, ρ k > 0 a positive scalar and two symmetric matrices A k , B k , with A k definite positive, then the next trial point x k+1 is computed as the argument that minimize the quadratic model where ρ k works as a regularization parameter. This method is based on the ideas of the trust region methods [12] and the well-known method of Levenberg-Marquardt [12]. The authors in [7], combine these ideas with the non-monotone technique proposed by Zhang and Hager [18], and thus obtain a very general method to solve the non-linear optimization problem (8).
The rest of this subsection is dedicated to demonstrate that the Algorithm 1 can be seen as a particular case of the algorithm proposed in [7]. For this end, it is sufficient to demonstrate that the update formula of our proposal (4) is equivalent to solve a quadratic model on S + (n), due to the non-monotone strategy to choose the step size is the same for the two algorithms.
Proposition 2: Let X k ∈ S + (n) be the point generated by Algorithm 1 at the k-th iteration. If τ > 0 then Z k (τ ) = π(X k − τ ∇F (X k )) is the minimum of the following quadratic model, over the set S + (n).
Proof. Since Z k (τ ) = π(X k − τ ∇F (X k )) then Z k (τ ) is a solution of From the definition of J(X) and using trace properties we have Now, since τ 2 ∇F (X k ) ⊤ ∇F (X k ) is constant, then minimize J (·) is equivalent to minimize the functionĴ (·) given bŷ Rewriting this last result we arrive at or equivalentlŷ Then, since τ is constant for the optimization process over S + (n), we have that minimizeĴ (·) over S + (n), is equivalent to minimize the quadratic function Q k (X) defined in (10) over the set S + (n), which completes the proof.
Note that Proposition 2 shows that the Algorithm 1 is a particular case of the algorithm proposed by Francisco et al. [7], obtained taking ρ k = 1, B k to the null matrix and A k = 1 τ k I n at each iteration. This result implies that the Algorithm 1 is globally convergent, which it is establishes in Theorem 1.
Theorem 1: Let {X k } be a sequence generated by Algorithm 1. Assume that γ < 1, then every accumulation points of {X k } is a stationary point of the problem (1).
In the rest of this section, we denote by "Nitr" the average number of iterations, "Nfe" the average number of functions evaluations, "Time" the average execution time in seconds, "Fval" the average value of the evaluation of the objective function at pointX which denotes the optimum estimated by each algorithm, "Error" the average global error, that is, ||X * −X|| F , where X * denotes the global optimum each PSDP problem, and finally we denote by "XErr", the average error ||X − X k || F , and X k penultimate point generated by each algorithm. In addition, we denote by Grad to the classical gradient projection method proposed in [8], FGM denotes the accelerated gradient projection method proposed in [8], ParTan denotes parallel tangent method introduced in [1] and OptPSDP denotes our proposal.
For the numerical experiments, we consider problem (1) where the matrix A ∈ R n×m is build as A = P ΛQ ⊤ , where P ∈ R n×n and Q ∈ R m×m are orthogonal matrices randomly generated and Λ ∈ R n×m is a diagonal matrix defined as we explain below. The starting point X 0 was generated as X 0 = π(X 0 ), whereX 0 was randomly generated. In order to monitoring the behavior of the algorithms, the optimal solution is generated by X * = π(X) whereX ∈ R n×n was randomly generated. Then, the matrix B ∈ R n×m was taken as B = XA, in this way, X * is a global optimum of the problem (1) with optimal value zero, i.e. F (X * ) = 0. All random values were generated following a standard normal distribution using the randn function of Matlab.
In addition, we consider the following three distributions of the entries of Λ, Problema 1: The Λ diagonal entries are generated by a truncated normal distribution in the interval [10,12].

Problema 2:
The diagonal of Λ is given by λ ii = i + 2r i , where r i is a randomly generated from the uniform distribution in the interval [0,1].
Problema 3: Each element of the diagonal matrix Λ is generated as λ ii = 1 + 99(i−1) m+1 + 2r i , with r i is a randomly generated from the uniform distribution in the interval [0,1].
Observe that if the Λ is generated following the structure of Problema 1 then A is a well-conditioned matrix, while it is generated by the diagonal structures describe in Problema 2 and Problema 3 then A is a ill-conditioned matrix. In order to study the numerical behavior and performance of all methods, we consider several size of problems PSDP and different conditions number of A. In all tables, we present the averages of the comparing values obtained by each algorithm in a total of 50 independent instances.
In the first experiment, we study the efficiency of the proposed method on well-conditioned PSDP problems. Table 1 summarizes the numerical results of this comparison. From Table 1 we observe that the methods that the methods that converge faster are ParTan and OptPSDP. In addition, it's seen that if A is rectangular then the most efficient method in terms of CPU-time is ParTan. However, clearly we note that our proposal is more efficient for problems where A is square. According to the error XErr, all algorithms reach an order less than 1e-5 and additionally, we can see that the value Fval is close to zero for all algorithms.
In tables 2 and 3 we present the results obtained by the four procedures solving ill-conditioned PSDP. These tables clearly show that Grad algorithm is the method that obtain the worst results, because sometimes run the maximum number of iterations allowed and it is the slowest in terms CPU-time.
On the other hand, we observe that the FGM, ParTan and OptPSDP methods show similar performance both in the number of iterations, and in execution time when m = n. However, when A is a rectangular matrix, the more efficient method is ParTan. In spite of this, all the methods reach convergence, since all obtain small values of XErr.
For the fourth experiment group, the PSDP problems were constructed with randomly generated synthetic data as explained at the beginning of this section, however, the optimum X * matrix was built as follows, first a matrix M ∈ R n×n is randomly generated with entries following a standard normal distribution, afterwards V is obtained as the orthogonal matrix of the QR factorization of M, from this matrix, we set X * = V ⊤ ΣV , where Σ ∈ R n×n is a diagonal matrix whose diagonal elements were generated by Σ(1, 1) = Σ(2, 2) = 0 and Σ(i, i) = rand for all i ∈ {3, 4, . . . , n} using Matlab notation. Thus, the optimal solution of the PSDP generated is a symmetric and positive semi-definite matrix with only two eigenvalues equal to zero and n − 2 strictly positive eigenvalues.
The numerical results corresponding to the third experiment are shown in Table 4. This table shows that the ParTan algorithm obtained the best performance in terms of the number of iterations, in almost all experiments. In addition, we observe that the our OptPSDP is the most efficient procedure in terms of CPU-time in both well-conditioned and ill-conditioned problems. From all the experiments performed, we concluded that the our proposal is a competitive alternative to solve the problem 1 under different situations of conditioning and scale of A.

Conclusions
The problem (1) has a wide range of applications in the fields of structure analysis, physical problems, signal processing, estimation of correlation matrices, among others. To address this problem, we design and implement an efficient and globally convergent algorithm that preserves feasibility in each iteration. Our proposal is based on the gradient projection method and we incorporate a non-monotone strategy in combination with the Barzilai and Borwein step sizes in order to accelerate the convergence. The bottleneck of the proposed algorithm is the computation of the projection operator, which is computationally inefficient. In order to improve the efficiency of our algorithm, we present a strategy based on Cholesky factorization to reduce the number of projections. This technique can be a good alternative to deal with large-scale problems. Some theoretical results were presented. Finally, from the numerical experiments we note that the performance of the resulting algorithm is quite competitive with some of the state of the art methods.