Dictionary Learning for Sparse Approximations With the Majorization Method

In order to find sparse approximations of signals, an appropriate generative model for the signal class has to be known. If the model is unknown, it can be adapted using a set of training samples. This paper presents a novel method for dictionary learning and extends the learning problem by introducing different constraints on the dictionary. The convergence of the proposed method to a fixed point is guaranteed, unless the accumulation points form a continuum. This holds for different sparsity measures. The majorization method is an optimization method that substitutes the original objective function with a surrogate function that is updated in each optimization step. This method has been used successfully in sparse approximation and statistical estimation [ e.g., expectation-maximization (EM)] problems. This paper shows that the majorization method can be used for the dictionary learning problem too. The proposed method is compared with other methods on both synthetic and real data and different constraints on the dictionary are compared. Simulations show the advantages of the proposed method over other currently available dictionary learning methods not only in terms of average performance but also in terms of computation time.


I. INTRODUCTION
O RTHOGONAL function representations, introduced in the nineteenth century, are still a powerful tool in signal analysis.These representations have unique characteristics that make them suitable for many signal processing applications.In the last two decades, many researchers have tried to extend this idea to non-orthogonal and overcomplete representations [1], [2].The overcomplete representation problem with the associated underdetermined linear system does not have a unique solution.The method of frames finds the minimum mean square solution and leads to representations where most of the coefficients are non-zero.Minimum mean square representations are desirable for some applications (e.g.robust transform coding in the presence of noise or erasure [3]) while there are other applications where sparsity of the representation is more desirable, e.g. in Compressed Sensing [4].
Let y ∈ R d and x ∈ R N be the input signal and the coefficient vector respectively.The sparsest representation would be, where D is a d × N matrix, often called dictionary and ||.|| 0 is the sparsity measure that counts the number of nonzero coefficients.This formulation can be relaxed to sparse approximations by using ||y − Dx|| 2 ≤ ǫ with a small constant ǫ.Unfortunately finding the solutions to the above combinatorial problems is not easy in general [5].Many approximations/relaxations have been presented to find acceptable solutions, e.g.[6], [7].
These methods are more successful at finding a sparse x, when there is a suitable dictionary for the given signal.A simple method for dictionary generation is to add two or more orthogonal bases.Block-wise orthogonality can then be exploited to find the sparse approximation [8].This also makes it easier to analyze the performance of sparse approximation methods [9], [10].Another way to design a dictionary is to sample the parameters of an analytic function.For example a famous dictionary that has been used for overcomplete audio and image representations, is the Gabor dictionary [6].These designed dictionaries are efficient when we have some a priori information about the signal's generative model.Alternatively it is possible to adapt the dictionary to a given source using a set of training samples (Y = {y (i) : 1 ≤ i ≤ L} where L is the number of training samples).Dictionary learning is the process of finding a dictionary D in which a given set of training samples has sparse representations (or approximations) X = {x (i) : 1 ≤ i ≤ L}.Different methods have been proposed to learn dictionaries [11]- [15].These methods are generally based on alternating minimization.In one step, a sparse approximation/representation algorithm finds sparse representations of the training samples with a fixed dictionary.In the other step, the dictionary D is updated to decrease the average approximation error while X (or the sparsity of X [15]) remains fixed.Because the objective functions are nonconvex based on the pair of parameters (D,X ), these methods generally only find a local minimum and different initial value for D (or X ), lead to different solutions.Nevertheless, in practice, good results have been reported [16], [17].The proposed method in this paper uses a general formulation of alternating minimization.Therefore like other methods, we only expect to find local minima in general.

Contributions of the paper
This paper introduces a new algorithm for constrained dictionary learning which is very flexible and can use different constraints on the dictionary.The given method uses convex admissible sets whose boundaries are the same as the most frequently used admissible sets, however these convex sets allow the algorithm to generate a sequence throughout the sets (and not only on their boundaries).An advantage of the given method is that it optimizes a joint parameter objective function of the sparse coefficient matrix and the dictionary.In this framework, it is possible to choose a better path from the initial to the learnt dictionary by reducing the objective in different directions (coefficients or dictionary) in a cyclic way.This prevents oscillations of the sequence of updates around the optimal path and makes the algorithm more suitable for large scale problems, for which the calculation of sparse approximations of the training samples is often impossible.
Another advantage of the proposed algorithm is that we can impose a tighter constraint on the dictionary.For example, when a minimum size dictionary is required or when the optimum size of the dictionary is unknown, we can impose an additional penalty on the number of the atoms in the dictionary.
Numerical results show that the algorithm is faster than (or at least as fast as) most of the available dictionary learning methods.
Finally we show that the new algorithm is not only stable but also converges to a fixed point or its accumulation points form a continuum (in contrast to most of the dictionary learning methods, for which so far only stability has been shown).

Organization of the paper
An overview of previous dictionary learning methods is presented in Section II.Section III introduces the dictionary learning framework used in this paper.We introduce two new admissible sets for the dictionaries.Then, in Section III-A, we introduce the majorization method which is used in the matrix valued sparse approximation (III-B) and the dictionary update (III-C1, III-C2) steps.We introduce a new objective function to penalize the size of the dictionary in Section III-D.By minimization of the new objective function with the majorization minimization method, we find a minimum size dictionary.The different dictionary update methods are examined in the simulation section using training samples generated synthetically or sampled from an audio signal.After concluding the paper we present a convergence proof of the algorithm in Appendix B.

Notation
In this paper we use the following conventions.We use small and capital bold face characters for vector and matrix valued parameters respectively.In an iterative algorithm, the value of a parameter in the k th iteration is distinguished by using the iteration number in square brackets, e.g.D [k] .We use a similar notation for a countable series.When a parameter appears with a hat, it shows the current value of that parameter.In the majorization method we introduce an auxiliary parameter which is distinguished with a double dagger superscript, e.g.X ‡ .In dictionary learning, we have a set of training signals y (i) , where i is the signal index.Similarly, the associated coefficient vectors are x (i)

II. DICTIONARY LEARNING METHODS
In traditional dictionary learning, one often starts with some initial dictionary and finds sparse approximations of the set of training signals while keeping the dictionary fixed.This is followed by a second step in which the sparse coefficients are kept fixed and the dictionary is optimized.This algorithm runs for a specific number of alternating optimizations or until a specific approximation error is reached.Most of these algorithms have been derived for dictionary learning in a noisy sparse approximation setting.Recently some researchers have considered dictionary learning for exact sparse representations [18], [19].Like most other researchers, we consider dictionary learning for sparse approximation.

A. Sparse Approximation
Given a set of training samples y (i) , ∀i : 1 ≤ i ≤ L and a dictionary D, sparse approximations are often found by 1 , An alternative to minimizing (2) individually on each vector is to find a joint sparse approximation of the matrix Y = [y (1) y (2) ... y (L) ] by employing a sparsity measure in matrix form.The sparse matrix approximation problem can be formulated as, where J p,q (X) is defined as [20], For example, ||X|| F = J 1/2 2,2 (X) would be the Frobenius-norm.When p = q all elements in X are treated equally.
In this paper we use p = 1, so that J p,p is convex.Extending the algorithm to 0 < p < 1 is possible by using the majorization method proposed in [21].However the convergence of the algorithm in this setting has not yet been proven [21], [22].

B. Dictionary Update
The second step in dictionary learning is the optimization of the dictionary based on the current sparse approximation.The cost function in (3) can be thought of as an objective function with two parameters, Without additional constraints on the dictionary, minimizing the above objective function is an ill-posed problem.By constraining the norm of D we can solve the scale ambiguity2 of the problem.Dictionaries with fixed column-norms or fixed Frobenius-norm have been used in different papers (for example [13] and [23]).We will use more general convex admissible sets defined in (7) and ( 8) below.

C. Previously Suggested Dictionary Update Methods
In the Method of Optimal Directions (MOD) [13] the best dictionary D is found using the pseudo inverse of X, followed by re-normalization of each atom.The Maximum Likelihood based Dictionary Learning algorithm [11], is similar to MOD but uses gradient optimization.In general, if the update is done iteratively, the best possible dictionary is typically calculated without any constraint.This update is then followed by normalization of the atoms.This normalization step can increase the total approximation error.
Kreutz-Delgado et al. [23] presented a dictionary learning method based on Maximum a Posteriori estimation (from now called MAP-DL 3 ).By the use of an iterative method they estimate a dictionary that is consistent with a Bayesian model [23].However, as reported in [15], when a fixed column-norm constraint is used, the algorithm updates atom by atom, making the method too slow for many applications.
The K-SVD method presented in [15] is fundamentally different from these methods.Instead of keeping the sparse coefficients fixed in the dictionary update step, only the support of the coefficient vectors (the positions of the non-zero coefficients) is kept fixed.Updates for each atom are found as the best normalized elementary function that matches the error (calculated after representing the signals with all atoms except the currently selected atom).
The formulation of the problem in this paper has several similarities with MAP-DL.However, our approach to solve this problem is based on a joint objective function for both the sparse approximation and the dictionary, which is good because we can develop a uniform approach for the updates and we have the flexibility to be able to switch between updating parameters easily.Furthermore, we use a different class of constraints on the desired dictionaries.In this setting, we will show a basic convergence proof.Our simulations furthermore show faster convergence for the proposed approach.Moreover, we can optimize the joint parameter objective function more wisely (see section III-E) and thereby increase the observed speed of convergence even further.

III. DICTIONARY LEARNING WITH THE MAJORIZATION METHOD
We consider the dictionary learning problem as the following constrained optimization problem, where D is an admissible set of dictionaries.As noted in [23], two typical constraints are the unit Frobenius-norm and the unit column-norm constraints, both of which lead to nonconvex solution sets.Instead of using these constraints in the algorithm derived below, we use the convex relaxed version of these constrained sets.These are the convex sets of matrices with bounded Frobenius norm, where c F is a constant and the convex set of matrices with bounded column norm, where d i is the i th column of the dictionary D and c C is a constant.Note that when the sparsity measure in the sparse approximation step penalizes coefficients based on their magnitudes (e.g.l p : 0 < p ≤ 1), it is easy to show that the solution of ( 6) is on the boundary of these convex admissible sets.However, the convex admissible sets also allow the optimization algorithm to "pass through" these admissible sets while the traditional non-convex sets only allow the algorithm to move along the boundary of these sets.We use the block relaxation technique (see for example [24]) to solve (6), where p = 1, that is, in one step we fix D and minimize the objective based on X, while in the other step we minimize the objective based on D with X fixed.This alternating minimization continues until the algorithm converges to an accumulation point.For a fixed dictionary, ℓ 1 penalized sparse approximation is a convex optimization problem and using convex dictionary admissible sets also turns the dictionary update into a convex optimization problem.Whilst this allows us to find the optimum update in each step, (5) is not convex as a function of the pair (X, D), and alternating optimization is not guaranteed to find a global optimum.
Various methods have been presented to solve the ℓ 1 penalized sparse approximation [7], [25], [26].We choose an Iterative Thresholding (IT) approach, which is a majorization minimization algorithm (see next subsection), which can be extended to the sparse approximation problem in matrix form (see III-B).

A. Majorization Minimization Method
Optimization of the problem in (6) with respect to any one of the parameters is challenging.We here use a technique called the "majorization method" [24], [27].In the majorization method, the objective function is replaced by a surrogate objective function which majorizes it and can be easily minimized.Here we are particularly interested in surrogate functions in which the parameters are decoupled, so that the surrogate function can be minimized element-wise.
A function ψ majorizes φ when it satisfies the following conditions, where Υ is the parameter space.The surrogate function has an additional parameter ξ.At each iteration we first choose this parameter as the current value of ω and find the optimal update for ω.
We then update ξ with ω new .The algorithm continues until we find an accumulation point.In practice the algorithm is terminated when the distance between ω and ω new is less than some threshold.This iterative method can be viewed as a block-relaxed minimization of the joint objective ψ(ω, ξ) [24].In one step, we find the minimum of ψ based on ω.In the next step we minimize the objective based on ξ.
We use this interpretation of the majorization method to show the convergence of the proposed method in Appendix B.
There are different ways to derive a surrogate function.Jensen's inequality and Taylor series have often been used for this purpose [28] [29].The Taylor series of a differentiable function φ(ω) is, and we can define ψ(ω, ξ) (which satisfies ( 9)) as follows, Then, at each iteration, φ(ω new ) ≤ ψ(ω new , ω) ≤ ψ(ω, ω) = φ(ω), hence φ does not increase.Conditions for which these algorithms converge have been presented in [24] and [29].The convergence of this method for sparse approximation is shown in [26].A similar analysis can be derived for the iterative method in the dictionary update step.
In the next sections we show how we can use the majorization method to optimize the objective introduced in (6) based on X (Section III-B) or D (Sections III-C and III-D) using different constraints.Updating the coefficient or the dictionary matrices always reduces the joint objective function or keeps it X [n] = S λ (A) 5: end for 6: output: X t+1 = X [KX ]   at the same value.The fact that the objective function is lowerbounded is sufficient to show stability of the updating process in the sense of Lyapunov (Lyapunov second theorem) [30].We also provide a basic convergence proof for the proposed algorithm in Appendix B.

B. Matrix Valued Sparse Approximation
We begin by showing how the majorization method is used for the first step of the alternating minimization: matrix valued sparse approximation.The updating formula derived here is used in the generalized block relaxation method derived later in this section.For fixed D, we use the matrix form of the Taylor series inequality (13), see Appendix A, to derive the following majorizing function, where and c X > ||D T D|| is a constant, where ||.|| is defined as the spectral norm [31].This type of majorization has already been used for sparse approximation with vector valued coefficients [26], [32], [33].Φ(D, X) in ( 6) has two terms, ||Y − DX|| 2 F and λJ p,p (X).Therefore a function majorizing Φ(D, X) is, Let ).It can be shown that the optimum of the surrogate objective (16), where p = 1, is found by shrinking elements in A [26], [34], that is, The matrix A is the modified Landweber update [35], (which is a gradient descent update) of the matrix valued coefficients.This iterative update continues until X [n] converges to the optimum solution.The pseudocode for this coefficient update is presented in Algorithm 1.The operator S λ is the shrinkage operator defined in (17).

C. Dictionary Update
In the second step of the alternating minimization, we minimize the objective function with respect to D keeping X fixed.This constrained minimization problem can be solved using several methods.Among these, fixed-point iteration and iterative gradient projection methods have been suggested for the dictionary updates in [23], [11].In this paper we derive a majorization method for the dictionary update.
The quadratic part of the objective function in (6) has a bounded curvature when minimizing over D. So again using the Taylor series, the majorizing function is as follows, where 1] is the dictionary found in the previous step, and c D > ||X T X|| is a constant.When X changes in the sparse approximation step, this spectral norm needs to be recalculated.We know that the spectral norm of a Hermitian matrix is its largest eigenvalue and various efficient methods have been presented to calculate it [36].This majorizing function can be used with different constraints.In the following two subsections we derive the optimum of ( 18) under bounded Frobenius and column-norm constraints.
1) Constrained Frobenius-Norm Dictionaries: An advantage of using a constraint on the Frobenius-norm of the dictionary is that the learnt dictionary can have columns with different norms.Such dictionaries can then be used in the weighted-pursuit framework [37], where atoms with large norms have more chance to appear in the approximations.It has been shown that the average performance of sparse approximation increases when the weights are chosen correctly for the class of signals under study [37].
In the dictionary update step, with the help of a Lagrangian multiplier γ, we turn (6) into an unconstrained optimization problem, min where φ γ (D, X), for p = 1, is now defined as, Fixing X, the solution to this minimization problem is a global minimum if the solution satisfies the K.K.T conditions [38, γ must be zero.The majorizing function is generated by adding π D to the objective function, X has here been omitted from the list of parameters because it is assumed fixed in the dictionary update step.The optimum of this function is at a point with zero gradient, D [n] = P(B) 5: end for 6: output: D t+1 = D [KD ]   By solving the above equation we find the optimal dictionary, where B is defined as B has again the same role as the Landweber update.To satisfy the K.K.T. conditions, a non-negative γ has to be found such that γ(||D The pseudocode for this dictionary update is presented in Algorithm 2.Here P is the operator P F cF presented in (24).In the following, we show that the dictionary updates, subject to the constraints on the column-norms or the joint sparsity (see below) of the dictionaries, have similar algorithms, but with the different operators for P.
If we use an equality in the definition of (7), i.e. we demand a fixed Frobenius-norm, γ can become negative.In this case the decision criteria of (24) becomes an equality (||B|| F = c 1/2 F ).
2) Constrained Column-Norm Dictionaries: Another often used admissible set in dictionary learning is the set of fixed or unit column norm matrices.Instead a bound on the column norms of the dictionary can be used to get a convex admissible set.To make (6) an un-constrained optimization problem we need N Lagrangian multipliers (equal to the number of constraints), min where φ Γ (D, X), for p = 1, is now defined as, ) With this formulation, the K.K.T conditions are, This means that for each i when d T i d i is not equal to c C , γ i should be zero.( 25) can be rewritten as where Γ is a diagonal matrix with the γ i as the i th diagonal element.By adding π D , we get the majorizing function, The gradient is again set to zero and the optimum solution is found to be, where B has the same definition as introduced in (23).All γ i are non-negative and ( 1 cD Γ + I) is an (invertible) diagonal matrix.In equation (30), by changing γ i , we multiply the corresponding column of B by a scalar.We start by setting all γ i = 0.For any columns of D * 0 = B for which the norm is more than c , we find the smallest value of γ i which scales down that column to have the largest acceptable norm (c 1/2 C ).
where d j and b j are the j th columns of D and B respectively.Alternatively, we can use a fixed column-norm constraint Here the algorithm may find a Γ in which some of the γ i are negative.The dictionary update can then be found by a similar operator as (31) but with equality in the decision criteria (||b j || 2 = c 1/2 C ) or simply by When the norm of any columns of B is zero, we have some ambiguity in the update formula.In this case we can shrink the size of the dictionary by deleting this atom or keep the size fixed by introducing a random atom to the dictionary.In practice we have not encountered such an ambiguity.

D. Jointly Sparse Dictionaries
The majorization approach to dictionary learning is extremely flexible.To demonstrate this, we introduce an additional constraint that encourages dictionary size reduction.In some applications there is a benefit in using a smaller dictionary.One of these benefits could be in coding, where the coding cost increases when the size of the dictionary grows.To shrink the dictionary size during learning, we introduce the following additional constraint on the number of atoms in the dictionary.min where ||.|| 0 is an operator that counts the number of non-zero elements.Because φ θ,0,∞ is non-convex and non-continuous, we replace the objective function with a relaxed version as follows, min This objective is convex when X is fixed.For fixed X, to minimize over D, the joint sparsity penalty is again decoupled by adding π D , (defined above), to the objective function By separating the terms depending on D, the surrogate cost can be written as, where B is defined in (23).The dictionary constraint is again introduced into the objective function using Lagrangian multiplier(s).Let d j and b j be the j th columns of D and B respectively.The objective function, using the bounded column-norm (8), can be written as, where ψ α q (v, w) = (w − v) 2 + α||v|| q , τ j = (1 + γ j /c D ) 1/2 and γ j are the Lagrangian multipliers.To minimize (36), we can minimize the first term by minimizing ψ α q for each d j independently.With the help of two lemmas presented in [39], we can find the optimum of ψ α q based on d j for q = 1, 2 and ∞.The minimum of ψ α q (v, w) based on v [39, Lemma 4.1] is, where P q ′ α is the orthogonal projection onto the dual norm ball with radius w and the dual norm is defined as ||.|| q ′ with 1/q ′ + 1/q = 1.This minimization problem can be solve analytically for some q [39, Lemma 4.2].In this paper we derive the dictionary update formula for q = 2. Interested readers can derive the update formulas when q = 1 or q = ∞ in the same way.We have where τ = {τ j } 1≤j≤N .When all γ j are non-negative, for any inadmissible b * j with τ j = 1 (γ j = 0), one can decrease by increasing τ j to satisfy the K.K.T conditions.Let S J θ c D (B) := B * τ =1 for any B found by (23).The dictionary update is therefore done by When we are looking for a bounded Frobenius-norm dictionary, the dictionary update could be derived, using a similar approach, by Algorithm 3 : DL(X 0 , D 0 ) D t+1 = DU (X t+1 , D t ) 4: end for 5: output: D T

E. Generalized block relaxation method for dictionary learning
In the previous subsections we presented a block relaxation method to optimize X and D iteratively.In each step, we used an iterative method to find the optimum solution based on one variable while keeping the other variable fixed.The pseudocode for dictionary learning in this framework is presented in Algorithm 3.
Because the joint objective function does not have a fixed bounded curvature, we could not use the majorization method for both parameters jointly.On the other hand, this alternating optimization decreases the rate of convergence as it often oscillates around the optimal path.Instead of fully optimizing with respect to a single parameter in each step, the generalized block relaxation method updates each variable at a time and reduces the objective function, using for example a cyclic selection or any other periodic selection of the parameters.A simple way to choose which parameter to update is to calculate the update based on each parameter and then choose the parameter that decrease the objective function the most.A drawback of this type of parameter selection is that it doubles the computational cost.Another technique is to alternatively update each parameter.For dictionary learning, we found that using more coefficient updates than dictionary updates is in general more beneficial.So one can use p updates of X followed by q updates of D when p ≥ q.
A more complete explanation and a basic convergence proof for the generalized block relaxed dictionary learning algorithm are provided in Appendix B. It is easy to show that the block relaxation method is a special case of the generalized block relaxation method.Therefore convergence of the block relaxation method (alternating minimization) for the dictionary learning follows as a corollary of this result.

IV. SIMULATIONS
We evaluate the proposed method with synthetic and real data.Using synthetic data with random dictionaries helps us to examine the ability of the proposed methods to recover dictionaries exactly (to within an acceptable squared error).We generated the synthetic data and dictionaries as proposed in [23] and [15].To evaluate the performance on real data, we chose audio signals, which have been shown to have some sparse structure.We then used the learnt dictionary for audio coding and show some improvements in Rate-Distortion performance compared to coding with classical dictionaries.

A. Synthetic Data
A 20 × 40 matrix D was generated by normalizing a matrix with i.i.d.uniform random entries.The number of A comparison of the dictionary recovery success rates using different dictionary learning methods under a column-norm constraint.non-zero elements in each of the coefficient vectors was selected between 3 and 7.The locations of the non-zero coefficients were selected uniformly at random.We generated 1280 training samples where the absolute values of the nonzero coefficients were selected uniformly between 0.2 and 1.In the setting for exact dictionary recovery [15], [23] and under a mild condition, the constrained column-norm dictionary and the K-sparse signals are the global solutions of the dictionary learning problem based on exact sparse representations and the ℓ 1 based exact sparse representation problems, respectively (see for example [19]).The proposed algorithm as well as the other dictionary learning algorithms discussed, are proposed for sparse approximations, that is, they allow approximation error when calculating the sparse coefficients.To adapt the algorithm to this problem, we assumed that the sparse approximation finds the correct support in each step.Once the support has been identified, we find the best approximation by projecting onto the selected sub-space.This is called debiasing.
We here compare the majorization based dictionary learning Fig. 3.A comparison of the dictionary recovery success rates using MM and MAP dictionary learning methods under a Frobenius norm constraint: 1: Desired dictionary had fixed Frobenius-norm.2: Desired dictionary had fixed column-norms.algorithm to MOD, K-SVD and MAP-DL.The stopping criteria for IT was the distance between two consecutive iterations (δ = 3×10 −4 ) and λ was set to 0.4.The termination conditions for the iterative dictionary learning methods (majorization method for dictionary learning (MM-DL) and MAP-DL) was set to (||D [n] We started from a normalized random D and used 1000 iterations.The learning parameter (γ) in MAP-DL was selected as described in [23] and we down-scaled γ by a factor of 2 −j (j > 1) when the algorithm was diverging.To allow a fair comparison, we repeated the simulations 5 times.If the squared error between a learnt and true dictionary element was below 0.01, it was classified as correctly identified.The average percentages and standard deviations are shown in Figure 1.It can be seen that in all cases, MM-DL with fixed column-norm and K-SVD recovered nearly the same number of atoms and performed better than the other methods (although, for the signals with less than 6 non-zero coefficients, MM-DL recovered all desired atoms, performance of K-SVD was very close to it).The debiasing process creates some ambiguities in dictionary learning when using the boundednorm constraints as they reduce the effect of the coefficient magnitudes in the sparsity measure.Therefore, we observe atoms which do not have a boundary norm (here, unit norm), even after 1000 iterations.In this case, we get better results using a fixed column-norm admissible set which resolves this ambiguity.The MAP-DL algorithm did not perform well in this simulation.We guess the reason for this is slow convergence of the approach and the use of more iterations might improve the performance.
In Fig. 2 we compare the computation time of the algorithms for the above simulations.Simulations ran on the Intel Xeon 2.66 GHz dual-core processor machine and both cores were used by Matlab.In this graph the total execution time of the algorithms (sparse approximations plus dictionary updates for 1000 iterations) is shown.MOD was fastest followed by our MM-DL.
We have a larger admissible set when fixing the Frobeniusnorm of the dictionary, which makes the problem of exact recovery more complicated and we expect to observe worse performance in terms of exact atom recovery.To test this, we started with a normalized random dictionary, normalized either to have fixed Frobenius-norm or fixed column-norm.The simulations were repeated for 5 trials and the averages and standard deviations of the atom recovery are shown in Fig. 3.In these simulations MM-DL performed slightly better than MAP-DL.The other observation in this figure is that when the desired dictionaries have equal column-norms, performance of the algorithms increase but do not reach the performance observed when using the more restricted (and appropriate) admissible set.Computation times of the algorithms, on the machine described formerly, are shown in Fig. 4.
In the next experiment we assume that the desired dictionary size is unknown but bounded.We generated the data as in the previous experiments but the simulations were started with four times overcomplete dictionaries (two times larger than the desired dictionary size).The dictionary updates were based on the joint sparsity objective function (33) (with θ = 0.05, p = 1 and q = 2).The average percentage of exact atom recovery for 5 trials are shown in Fig. 5 and 6.We plotted the percentage of the exact recovery of the original atoms, regardless of the learnt dictionary size.In the lower plot, we show the size of dictionary after 1000 iterations.With this θ we identified the size correctly but for less sparse signals (higher k) we got less accurate results.The overall performance of the algorithm is determined by the correct choice of θ.By increasing θ we find smaller dictionaries and vice versa.

B. Dictionary Learning for Sparse Audio Coding
In this section we demonstrate the performance of the proposed dictionary learning method on audio signals and thus show that our method is applicable to large dictionary learning problems.An audio sample of more than 8 hours was recorded from BBC radio 3, which plays mostly classical music.
In the first experiment we used bounded column-norm and bounded Frobenius-norm dictionary admissible sets.The audio  sample was summed to mono and down-sampled by a factor of 4. From this 12kHz audio signal, we randomly took 4096 blocks of 256 samples each.The set of dictionaries with the column-norms bounded by c C is a subset of the set of bounded Frobenius-norm dictionaries, when c F = N c C .We chose dictionary admissible sets with column-norms and Frobeniusnorms bounded by c C = 1 and c F = N respectively.We initialized the dictionary with a 2 times overcomplete random dictionary and used 1000 iterations.The objective function against iteration, for two different values of λ, are shown in Fig. 7.This figure shows that the optimal bounded Frobeniusnorm dictionaries are better solutions for the objective functions.
As a second experiment, we looked at an audio coding example.We used the proposed method with the bounded Frobenius-norm constraint to learn a dictionary based on a training set of 8192 blocks, each 1024 samples long.In this experiment we want to learn the dictionary for a larger block length than the previous experiment.The convergence of the traditional block relaxation method for a problem with this  size is very slow.Therefore we run the simulations with the generalized block relaxation method and a joint sparsity constraint on the dictionary to encourage shrinkage of the dictionary.This shrinkage makes the algorithm faster in later iterations.Even though the recorded audio had 48k samples per second, the audio had a maximum frequency of 16kHz.Therefore we downsampled the original audio by a factor of 3/2 without any degradation in the audio fidelity.It has been shown that audio can be modeled reasonably well using tonal, transient and noisy residual components [40].We chose a 2 times overcomplete sinusoid dictionary (frequency oversampled DCT) as the initialization point and ran the simulations with different lambda values for 5000 iterations of alternative optimization of (41), which took approximately 8 hours for each λ, running on the machine mentioned in the previous subsection.
A subset of the learnt atoms (λ = .01,θ = .01),which is selected by uniformly sampling the atom indices, is shown in Fig. 8.These atoms are shown in the time and frequency domain in the left and middle windows respectively.The norms of the selected atoms are shown in the right window.The number of appearances of each atom, which are sorted based on their ℓ 2 norms, are shown in Fig. 9.To design an efficient encoder we only used atoms that were used frequently in the representations.Therefore we were able to further shrink the dictionary size.In this test we chose a threshold of 40 appearances (out of 8192) as the selection criteria.This dictionary was used to find the sparse approximations of 4096 different random blocks, each of 1024 samples, from the same data set.We then encoded the location (significant bit map) and magnitude of the non-zero coefficients separately.In this paper we used a uniform scalar quantizer with a double zero bin size to code the magnitude.We estimated the entropy of the coefficients to approximate the required coding cost.To encode the significant bit map, we assumed an i.i.d.distribution for the location of the non-zero atoms.The same coding strategy was used to code sparse approximations with a two times frequency overcomplete DCT (the initial dictionary used for learning ) followed by shrinking based on the number of appearances.For reference we calculated the rate-distortation of the DCT coefficient encoding of the same data, using the same method of significant bitmap and non-zero coefficients coding.The performance is compared in Fig. 10.In the sparse coding methods, the convex hulls of the rate-distortion performances calculated with different dictionaries, each optimized and shrunk for different bit-rate, are shown in this figure.Using the learnt dictionaries for sparse approximation is superior to using the DCT or overcomplete DCT for the range of bit-rates shown.
It would be nice to compare these real data experiments with K-SVD, which is shown to perform well in dictionary learning for medium size problems.However, we found K-SVD to be too slow on problems of this size.For example, one sparse Sparse coding with two times overcomplete DCT Fig. 10.Estimated Rate-Distortion for the audio coding example using the learnt dictionary, the shrunk 2 times overcomplete DCT dictionary and the DCT.
approximations of the signals, using a fast implementation of OMP [41], and one dictionary update approximately took 10 hours and this has to be repeated for a reasonable number of iterations, e.g.1000 iterations!

V. CONCLUSIONS
We have presented a new algorithm for dictionary learning and have shown its advantages with different experiments and for different data sets.The proposed method is very flexible in using different constraints on the dictionaries.Because the problem of dictionary learning is considered in a more general form (bounded norm for dictionaries), better results were possible.
While some of the other methods are based on atom-wise dictionary update (K-SVD, MAP-DL with unit column-norm a priori information), the proposed method updates the whole dictionary at once.Although the computational complexity of each iteration of the given algorithm is roughly cubic, we found that the algorithm is much faster for large scale problems than, for example, K-SVD (which has a higher order of complexity).
The given method solves the dictionary learning problem in a unified framework.This unified framework provides extra flexibility to update the coefficients and the dictionary in a more efficient way.Furthermore, we showed the convergence of the method to a set of fixed points in this framework.
Finally we have shown that the constrained Frobeniusnorm can increase the performance of dictionary learning by increasing the possible solution set.Audio coding with the learnt dictionary showed a superior rate-distortion performance over traditional orthogonal transform coding and overcomplete sparse coding with an oversampled DCT.

APPENDIX A MATRIX FORM OF THE MAJORIZING FUNCTION
We can use the Taylor series to majorize the quadratic term of the objective function which has a bounded curvature.The Taylor series in matrix form [42,Appendix D 1.7] is given by, where are the directional first and second derivatives of f at V in the U − V direction.The directional derivatives are defined by, For a bounded curvature objective function we have, where

APPENDIX B CONVERGENCE STUDY OF THE ALGORITHM
In the first step of analyzing an iterative algorithm, we need to show the boundedness of the solutions (or the stability of the algorithm).The stability of the algorithms, in which a positive objective is reduced in each iteration, is guaranteed using Lyapunov's second theorem.For example the stability of the MAP-DL is guaranteed when a suitable step size is chosen (to the authors knowledge, no analytical study has been done on how to choose this step size).The convergence of the alternating (gradient) projection based methods essentially depends on the admissible sets (and the gradient step size).In the dictionary learning problem with the admissible sets given by [13] [11], the convergence of the algorithm is not guaranteed.In K-SVD, one needs to find the sparse approximations based on the ℓ 0 sparsity measure for which no efficient algorithm exists so that the stability analysis is challenging.In practice we observed that in MOD and K-SVD, when the solution sequence enters a neighborhood of a local minimum, the objective increases in some iterations.Therefore, it does not converge monotonically to the solution.
The next step is to show the convergence of the algorithm to a fixed point or a set of fixed points.The authors in [23] referred to the convergence of the gradient flow method to show the convergence of the MAP-DL.Although this statement is completely correct, it requires the use of an arbitrary small step size which is practically impossible.
The stability of dictionary learning based on the majorization method has already been proven by the fact that we reduce the objective in each step.Here, we show the convergence to a set of fixed points.Our dictionary learning framework can be viewed as a generalized block-relaxed minimization scheme applied to an augmented objective function.Specifically, we combine two majorizing objectives, (15) and (18), where X ‡ and D ‡ are two auxiliary parameters corresponding to X and D respectively.c D and c X have been chosen to be larger than the spectral norms of X ‡T X ‡ and D ‡T D ‡ respectively.This augmented objective function does not majorize the joint objective, however when (D, D ‡ | D ‡ =D ) or (X, X ‡ | X ‡ =X ) are fixed, (41) majorizes the original joint objective based on the other pair of parameters.When the optimization method is viewed in the block relaxation framework, the optimum of X ‡ or D ‡ is easily found by X or D respectively.This corresponds to the parameter update in the standard majorization method [29].Therefore any sequence of updates is acceptable, given each update of D (or X) is followed by an update based on D ‡ (or X ‡ ) respectively.Such a block-relaxed sequential constrained minimization is not in general guaranteed to converge (see [24] for some counter examples).To study the convergence of our algorithm, we need to do a little more work.In the next subsection, we introduce some theoretical analysis of the generalized block relaxation method.We then analyze the proposed algorithm for dictionary learning, based on the given theoretical analysis.
We now need to introduce point to set maps, Definition B.1 (Point to set map).Let Υ be an arbitrary set and let Γ be the set of all subsets of Υ.A map ∆ : Υ → Γ is a point to set map (see for example [43]).
In the block relaxation technique a set of point to set maps ∆ i : Υ → Γ are defined as ∆ i ( ω) = {ω ∈ Υ : ∀j = i ω j = ω j } where ω = ( ω 1 , ω 2 , ..., ω p ) is the current value of the parameters.These point to set maps keep all the blocks of parameters fixed apart from the i th block.
By starting from ω [0] , the set of possible solutions Λ in the minimization problem is defined as, Λ = {ω ∈ Υ : η(ω) ≤ η(ω [0] )}.For any ω ∈ Λ in each block update we minimize the objective for the selected parameters.This gives us the following updating operator: In general this updating operator is a point to set map and we can choose an update parameter within the resulting set.In our case, the objective function always has a unique minimizer and the updating operators are point-to-point mappings.To use a set of updating operators, we also need to have an operator selector.
Definition B.2 (Operator selector).s(k) : N → P which This operator can choose the updating operator by sequentially selecting (circular) or free steering through the available operators.By using the updating operators defined in (42) and an update selector s(k), we can summarize the (generalized) block relaxed minimization by the following algorithm.Algorithm B.1.Let ω [0] be a given starting point, then {ω [k] } k∈N is the sequence of updates given by ω [k+1] ∈ U s(k) {ω [k] } and stop when ∀i ∈ P : ω = U i { ω} When the updating operator is injective, ω [k+1] = U s(k) {ω [k] }, to analyze the sequence generated by Algorithm B.1, we need to introduce some characteristics of the infinite series.

Definition B.3 (Asymptotically regularity). A sequence {α
|| .|| is a norm defined in the solution space.An operator is called asymptotically regular when the series generated by the sequential use of that operator is asymptotically regular.

Definition B.4 (Essentially periodic
).An infinite sequence {α [n] } n∈N drawn from a finite alphabet P = {A i : 1 ≤ i ≤ p} is essentially periodic, with a period m ∈ N, m ≥ p when ∀j ∈ N, ∀A i ∈ P, ∃n ∈ [jm + 1, (j + 1)m] and α The sequence of {ω [k] } of the Algorithm B.1 is asymptotically regular when ∆ i and η satisfy the following hypotheses [44], Hypotheses B.1.For all i ∈ P and η : Υ → R, We now study the accumulation points of Algorithm B.1, when the Hypotheses B.1 are satisfied.From basic mathematical analysis, we know that any bounded sequence has at least one accumulation point (Bolzano-Weierstrass Theorem [45,Theorem 4.1]).As Λ is closed, the accumulation points of {ω [n] } are in Λ.In a normed space, the following lemma guarantees that the sequence {ω [n] } n∈N generated by Algorithm B.1 will stay arbitrarily close to the accumulation points, when n > N for some N .Lemma B.1.Let {ω [n] } n∈N be a bounded asymptotically regular sequence and T be the set of its accumulation points then, ∀ǫ > 0, ∃N ∈ N, f or n > N, ∃t ∈ T, ||ω [n] − t|| < ǫ Proof: Let S be an ǫ-neighborhood of T and S c be its complement in the admissible set.As the admissible set is compact, S c is also compact.Because S is a neighborhood of T there is no accumulation point t in S c .If {ω [n] } has infinitely many points in S c , then it has a converging subsequence and at least one accumulation point in S c .This contradicts the fact that there is no accumulation point in S c .Therefore ∃N : ω [n] ∈ S, ∀n > N .On the other hand ǫ-neighborhood implies that for all n > N , ∃t ∈ T : ||ω [n] − t|| < ǫ.
In the next subsection we show asymptotic regularity of the generalized block relaxation method for dictionary learning.This is followed by showing the convergence of the proposed method to a set of fixed points.

B. Convergence study of the generalized block-relaxed dictionary learning
In dictionary learning, there are two parameters, coefficient matrix and dictionary.In generalized block-relaxed dictionary learning (41), we have four parameters.We mentioned that the augmented function (41) majorizes ( 6) only when one pair of parameter blocks ( (D, D ‡ | D ‡ =D ) or (X, X ‡ | X ‡ =X ) ) is fixed.Therefore ∆ X : X ∈ {D, X, D ‡ , X ‡ } are the point to set maps which fix all parameters but X (from now on we use this indexing for the point to set maps).
Proposition B.2.The generalized block-relaxed minimization of (41) is asymptotically regular when the updates of D and X are followed by updating of D ‡ and X ‡ respectively.
Proof: To show the asymptotic regularity we show that all the hypotheses in Hypotheses B.1 are satisfied.∆ X : X ∈ {D, X, D ‡ , X ‡ } are self contained, i.e.X ∈ ∆ X { X }, and continuous.Therefore they satisfy the first two hypotheses.The minimum of (41) based on each parameter is unique (the sparse approximation minimum is reached using soft shrinkage (17) over A and the dictionary update is reached by one of the operators introduced in ( 24), (31) or (38) over B ). ( 41) is strictly convex based on X ‡ or D ‡ when all other parameters are fixed.Therefore minimization based on D ‡ or X ‡ has a unique solution.Surrogate objective function (41) is a continuous function.When a mapping is continuous, its epigraph Λ is a closed set [38,Theorem7.1].As the admissible set is a closed set, the intersection of Λ and this set, which is the possible solution set, is closed.On the other hand there is no infinitely large point in Λ (maximum value of ||D|| F and J 1,1 (X) are bounded based on the dictionary constraints and φ(D [0] , X [0] )/λ respectively).In an Euclidean space boundedness and closedness are sufficient for a set to be compact.Therefore the hypothesis is satisfied and the sequence of (D, X, D ‡ , X ‡ ) [i] : i ∈ N is asymptotically regular [44].
Finally we present a Proposition which shows the convergence of the proposed algorithm.Proposition B.3.Generalized block-relaxed dictionary learning converges to a single fixed point (D * , X * ) or gets arbitrary close to a continuum of accumulation points, where each accumulation point satisfies: • ψ(D * , X * , D * , X * ) ≤ ψ(D * , X, D * , X * ) : ∀X • ψ(D * , X * , D * , X * ) ≤ ψ(D, X * , D * , X * ) : ∀D ∈ D Proof: Due to Proposition B.2, the sequence generated by generalized block-relaxed dictionary learning is asymptotically regular.Due to Theorem B.1 and Lemma B.1, the algorithm converges either to a fixed point or gets arbitrary close to a continuum of accumulation points.Because any accumulation point of the algorithm is a fixed point for all U i : ∀i ∈ P [44, Theorem 15], X * is the best coefficient matrix using dictionary D * and D * is the best admissible dictionary, using X * as the sparse representation.
Fig.1.A comparison of the dictionary recovery success rates using different dictionary learning methods under a column-norm constraint.

Fig. 2 .
Fig.2.A comparison of the computation costs of the dictionary learning methods under a column-norm constraint.
of non−zero elements in each coefficient vector) Average percents of exact recovery after 1000 iterations MM bounded norm #1 MM bounded norm #2 MAP−DL #1 MAP−DL #2

Fig. 4 .
Fig. 4. A comparison of the computation costs of the dictionary learning methods under a Frobenius norm constraint.

Fig. 5 .
Fig.5.Dictionary recovery success rates under a column-norm constraint and joint sparsity penalty.

Fig. 6 .
Fig.6.Dictionary recovery success rates under a Frobenius norm constraint and joint sparsity penalty.

Fig. 8 .
Fig. 8.A selection of learnt atoms in time (left) and frequency (middle) domain.Their norms are shown in the right panel.

Fig. 9 .
Fig. 9. Number of appearances of the learnt atoms in the representations of the training samples (of size 8192).

Theorem B. 1 .
[44, Theorem 15]  Let the update selector, s(k), be essentially periodic and ∆ i and η satisfy Hypotheses B.1.Every accumulation point ω * of {ω[n] }, generated by Algorithm B.1, satisfies ω * = U i {ω * } for any i ∈ PThe set of accumulation points T belongs to a level set of η.If η is continuous, T is closed and as Λ is bounded and T ⊆ Λ, T is bounded.Therefore T is compact.Proposition B.1.[29, Proposition 10.3.1]If a bounded sequence {ω[n] } n∈N is asymptotically regular, then its set of accumulation points is connected.If this set is finite, then it reduces to a single point.
. In this paper we use different norms for vectors and matrices.||.|| and ||.|| F are spectral and Frobenius norm in the Euclidean vector space respectively.||.|| p