Upscaling Vector Approximate Message Passing

In this paper we consider the problem of recovering a signal x of size N from noisy and compressed measurements y = Ax + w of size M, where the measurement matrix A is right-orthogonally invariant (ROI). Vector Approximate Message Passing (VAMP) demonstrates great reconstruction results for even highly ill-conditioned matrices A in relatively few iterations. However, performing each iteration is challenging due to either computational or memory point of view. On the other hand, a recently proposed Conjugate Gradient (CG) Expectation Propagation (CG-EP) framework is able to sacrifice some performance for efficiency, but requires access to exact singular spectrum of A. In this work we develop a CG-VAMP algorithm that does not require such information, is feasible to implement and converges to the neighborhood of the original VAMP.


INTRODUCTION
Consider the recovery of a random signal x ∈ R N from a general linear model where y ∈ R M is the measurement vector, w ∈ R M is a zero-mean i.i.d. Gaussian noise vector w ∼ N (0, v w I M ) and A ∈ R M ×N is a measurement matrix. We consider the compressed sensing scenario M << N with the ratio M N = δ = O(1) fixed. In this work we assume that x has finite 4 th order moment and variance v x = lim N →∞ 1 N x T x. Additionally, following [6], we assume that in the SVD of A = USV T the matrix V is Haar distributed and therefore A is a right-orthogonally invariant (ROI) matrix.
The Vector Approximate Message Passing (VAMP) algorithm [6], [2] is an iterative algorithm that estimates the signal x from (1) and can be derived from the Expectation Propagation (EP) framework. VAMP can achieve Bayes-optimal reconstruction results [6] when A is ROI and the two estimators involved in VAMP are chosen to be Bayes optimal. While the This work was supported by the ERC project C-SENSE (ERC-ADG-2015-694888). MD is also supported by a Royal Society Wolfson Research Merit Award. We would like to thank James Hopgood and Alessandro Perelli for useful discussions. first estimator, the denoiser g B , tries to estimate the signal x from a zero-mean i.i.d. Gaussian channel, the second estimator, the LMMSE estimator g A , is supposed to estimate an intrinsic i.i.d. Gaussian signal x 0 substituted in (1) from y. The challenge in implementing VAMP is the design of these estimators and computation of their parameters, divergences and Mean Squared Error (MSE) estimates, that ensures correct and efficient operation of the algorithm.
Designing good denoisers g B and computing the corresponding parameters is a relatively studied problem (refer to [3] and [7] and references therein). In contrast there has been much less focus on efficiently implementing the LMMSE estimator (with the exception of [11] and [8]). The LMMSE estimator operating in the VAMP algorithm requires computing a linear mapping that involves inverting either an N ×N [6] or M × M [10] matrix. Inversion of such a matrix has complexity O(N 3 ) and for large-scale problems is not feasible. This problem persists even when there exist fast implementations of the measurement matrix, A. In the original VAMP [6], this computation was addressed by precomputing the SVD of the measurement matrix A, reducing the required computation to order O(N 2 ). However, when both N and M are of order of hundred of thousands or more, as in CT imaging, storing the matrix of the singular vectors becomes not possible. In such applications even the measurement matrix A might not be stored, but only computed on-the-fly.
The computational/memory problem was partially addressed in [11], where the authors suggested to approximate the LMMSE solution with the Conjugate Gradient (CG) algorithm, which only requires computation of matrix-vector products with AA T , and proposed a rigorous method for estimating the key parameters of the algorithm. However, estimation of those parameters required the computation of the moments of the singular spectrum of AA T of order up to 2i + 2, where i is the number of CG iterations. Even if these moments were available (which generally they are not), for highly ill-conditioned A, the higher order moments of AA T are not feasible to manipulate with on devices with finite precision arithmetic. A final weakness of the CG approximation to the EP algorithm (CG-EP) is that it has worse convergence guarantees due to the LMMSE approximation which can result is inferior phase transition performance.
The contributions of this paper are twofold. First, we develop a practical method of estimating the key parameters for the CG-EP algorithm without referring to the singular moments of A. Second, the proposed method is used to construct the auto-tuned CG algorithm that computes its own divergence and the MSE of the produced estimate, and chooses such a number of iterations that ensures consistent progression of the overall CG-VAMP algorithm. The per-iteration cost of the proposed CG algorithm is only twice larger then of the CG algorithm in [11].
The paper is organized as follows. In the second section we discuss the key ingredients and limitations of the VAMP algorithm and consider the main constraints of its approximated version -CG-EP algorithm. In the third section it is introduced the main theoretical tool for the practical implementation of the CG-VAMP algorithm. In the next section the proposed tool is used to construct the auto-tuned CG algorithm that ensures certain per-iteration progression of the CG-VAMP algorithm. The last section is dedicated to numerical experiments of the proposed methods.

EXPECTATION PROPAGATION BASED ALGORITHMS
For the measurement system (1) with ROI A, iterative recovery of the signal x can be addressed by a family of algorithms derived from Expectation Propagation (EP). Vector Approximated Message Passing (VAMP) is one of the first EP-motivated inference algorithms designed for recovery of x from measurements (1). Although next we concentrate on VAMP algorithm, we note that VAMP [6], the exact EP for ROI measurements [10] and the CG-EP [11] are all instances of a general EP-based framework shown in Algorithm 1.
The EP-based framework from Algorithm 1 represents iterative passing of the messages x t A→B and x t B→A between two blocks: Block A and Block B. Under some mild conditions on g A and g B [2] and for ROI matrices A, it can be shown that in the LSL these messages are equivalent to the true signal x corrupted by some additive noise (3) Notably, due to the Haar matrix V and the choice of the scalars γ t A and γ t B , the noise vector h t can be shown to be equivalent to a zero-mean i.i.d. Gaussian vector that is independent of x [10]. Variances v t A→B and v t B→A of both noise vectors h t and q t respectively define the current state of the algorithm. At each iteration t, the functions g A and g B attempt to reduce the variances of the noise terms.
Calculation of the two messages requires the computation of two pairs of scalars: γ t A and γ t B , andṽ t A→B andṽ t B→A . Here, the scalars γ t A and γ t B correspond to (the inverse of the) divergence of the functions g A and g B and allows the progress of the algorithm to be described by a simple one dimensional State Evolution (SE). Additionally, the functions f A and f B produce estimatesṽ t A→B andṽ t B→A of the true variances v t A→B and v t B→A .

Algorithm 1: EP-based Algorithm
Initialization: In the VAMP version of Algorithm 1, the function g B is chosen to estimate x from the message x t A→B , which from (2) allows g B to be interpreted as a denoiser. Today, there are numerous Plug-and-Play methods for denoising signal from a Gaussian channel, including BM3D [1], Convolutional Denoisers/Neural Networks [2] etc. Additionally, computing the divergence and MSE estimates for these denoisers can be done using Monte-Carlo SURE techniques [5] [9].
A completely different picture is observed in the Block A. In VAMP [6] and in exact EP for ROI matrices A [10], the function g A is the linear mapping where the matrix W t is W t =ṽ w I M +ṽ t B→A AA T andṽ w is an estimate of the variance of the measurement noise w in (1). For this choice of g A , the scalar γ t A corresponds to Additionally, in the LSL, an unbiased estimateṽ t A→B of would have the following simple form [10] v t A→B = γ t A −ṽ t B→A Despite the potential optimality of the resulting algorithm, computing the mapping (4) and the corresponding divergence γ t A (6) is costly from either a computational or a memory point of view. In [11] the authors addressed this by approximating (4) using the CG algorithm shown on Algorithm 2. The new CG-EP algorithm involved the function g A corresponding to i[t] iterations of the CG algorithm that approximates the solution to The is the LSL density function of eigenvalues of AA T [11]. More precisely, it was shown that calculation of the two scalars requires the moments ξ j of order up to j = 2i + 2. In practice, the suggested method for computing γ

t,i[t] A
and v t A→B might be impossible for the reasons discussed before.

We consider estimation of γ t,i[t] A
and v t A→B in the LSL, where we expect to observe high accuracy of the Strong Law of Large Numbers (for details, see Theorem 1 in [10]). Although, in general CG iterations constitute nonlinear mappings, as shown in [9], in the LSL, the CG approximation of (7) with i[t] iterations g

i[t]
A is the following linear mapping g where U is the matrix of left-singular vectors of A and the matrix H t is a diagonal matrix with entries depending only on singular spectrum of A, v w and v t B→A . Note that (8) also does not depend on the vector being mapped (z t = y − Ax t B→A ), but only depends on the variances of vectors w and q t involved in y and x t B→A respectively. Additionally, by using the definition of vector y from (1) and the message x t B→A from (3), the vector z t can be rewritten as z t = y − Ax t B→A = w − Aq t = w − UΣV T q t (9) which is independent of the signal x. Furthermore, it was proved in [6] and [10] that the vector and is independent of w, U and Σ. By using (8), (9) and (10) we conclude that the output of the CG algorithm µ t U T w − UΣb t corresponds to a linear mapping of a Gaussian vector. We therefore propose the following method for estimating γ

t,i[t] A
and v t A→B . The first step is to generate two synthetic vectorṡ where the 0 highlights that we have effectively set the synthetic signalẋ = 0. We then form the corresponding synthetic vectorż tż t =ẏ − Aẋ t B→A and execute the CG algorithm with the original matrix W t on the synthesized vectorż ṫ µ where in (a) we used the Strong Law of Large Numbers, in (b) we used (10) and the Stein's Lemma and in (c) we used (8).
Since the divergenceγ for the synthesized data is only a function of A, v w and v t B→A , in the limit, the two terms will be the same lim Therefore we propose the following estimator of (γ Calculation of an estimate of the variance v t A→B is done in a similar way. By definition of v t A→B we have lim which depends only on the singular spectrum of A, v w and v t B→A . Therefore, substituting the synthesized data in (12) and calculating the resulting norm will produce the same result in the LSL. Thus we propose to use the following estimator of the variance v t

AUTO-TUNED CG
By approximating the exact LMMSE solution using the CG algorithm, the resulting CG-VAMP sacrifices convergence rates and guarantees. As proved for VAMP [6] and CG-EP in [11], the progression and the fixed points of the two algorithms can be defined through the 1D State Evolution Because the LMMSE estimator is the optimal g A [6] and the SE is 1D, then we have [4] Moreover, by choosing insufficient number of CG iterations, the variance at iteration t + 1 may not reduce below the variance at iteration t, i.e. v t+1 ≥ v t , which is equivalent to the algorithm converging to a new spurious fixed point.
To ensure progression of Block A, the proposed CG algorithm uses the following simple rulẽ v t A→B (i) ≤ cṽ t−1 A→B (14) where a scalar c represents a bound on the expected variance drop in VAMP (until it is in the neighborhood of the VAMP fixed point) after each iteration t. Once the inequality (14) is met or after a maximum number of iterations, the CG algorithm is terminated.
In the proposed CG algorithm, the variance v t A→B (i) is estimated using (13), which requires calculation ofγ t,i[t] A first. Using (11) and noting that the matrix-vector product Aq t needs to be computed only once, the calculation ofγ requires only O(N ) additional computations per iteration i. On the other hand, direct evaluation of (13) requires computing A T µ i t and would lead to a doubling the complexity. Instead, by expanding the norm and using the definition of γ Next, for CG with synthesized data we use the scalars α j t and β j t from the CG with the real data, and exclude computing the residual vector r i+1 from Algorithm 2 by using the second option of calculating the vector p i+1 on line 6. Now, since the matrix-vector product AA Tμi+1 t is a part of the CG algorithm, computing (15) costs only O(N ) computations.

SIMULATION RESULTS
In this section we empirically investigate the quality of the estimateṽ t A→B suggested in (15), and the performance of the auto-tuned CG-VAMP in recovering x from a highly ill-conditioned measurement system (1). We consider a Bernoulli-Gaussian signal x of size 2 14 in the same settings as in the original VAMP paper [6]. We consider the measurement matrix A with geometric singular values, which is one of the worst cases, and condition number 10000. The compression rate is chosen to be δ = 0.5. The auto-tuned CG algorithm iterates until the inequality (14) is met with the scalar c = 0.9. The total number of outer iterations t was chosen to be 50 and we average the results over 100 realizations. Figure 1(a) plots the change ofṽ t A→B over iterations t. The main line demonstrates the mean, while the shaded region shows the standard deviation of the variance over 100 realizations. From the plot we observe the consistent reduction of the variance after each iteration t as it was suggested by the inequality (14).
The Figure 1(b) illustrates the mean and the standard deviation of normalized squared difference of the estimated using (15) and true variance v t A→B . As we see, even for moderately large systems, the quality of the estimator (15) is satisfying and only gets better as the algorithm progresses.
In the Figure 2 it is shown the average and the standard deviation of the number of CG iterations needed to satisfy the inequality (14) at each iteration of the CG-VAMP. As we see, the required number of CG iterations varies as CG-VAMP progresses, which confirms the necessity in having a tool for adaptively choosing the number of CG iterations.

CONCLUSIONS
In this work, we developed a practical implementation of approximated VAMP that requires neither inversion of large matrices nor precomputing SVD of A, nor information about the singular spectrum of A. The CG-VAMP algorithm is guaranteed to consistently reduce the MSE of the intrinsic measurements given that the exact VAMP is capable of doing so. The proposed algorithm relies on the strong statistical properties of the Haar matrix V and with more practical measurement operators A like FFT matrices, CT operators etc, one might employ damping to increase the stability as was suggested in [8]. On the other hand, these types of matrices have fast implementation w.r.t. matrix-vector products and therefore the per-iteration complexity of the CG-VAMP will reduce significantly. Analysis of the performance of the proposed CG-VAMP algorithm with structured A is left as a further work.