VIEWS: 39 PAGES: 8 POSTED ON: 4/4/2011
A Simple Algorithm for Nuclear Norm Regularized Problems Martin Jaggi jaggi@inf.ethz.ch Marek Sulovsk´ y smarek@inf.ethz.ch Institute of Theoretical Computer Science, ETH Zurich, CH-8092 Zurich, Switzerland Abstract and the corresponding constrained variant Optimization problems with a nuclear norm min f (X) (2) regularization, such as e.g. low norm matrix t X∈Rn×m , ||X||∗ ≤ 2 factorizations, have seen many applications recently. We propose a new approximation where f (X) is any diﬀerentiable convex function (usu- algorithm building upon the recent sparse ally called the loss function), ||.||∗ is the nuclear norm approximate SDP solver of (Hazan, 2008). of a matrix, also known as the trace norm (sum of The experimental eﬃciency of our method the singular values, or 1 -norm of the spectrum). Here is demonstrated on large matrix completion µ > 0 and t > 0 respectively are given parameters, problems such as the Netﬂix dataset. The al- usually called the regularization parameter. gorithm comes with strong convergence guar- When choosing f (X) := ||A(X) − b||2 for some lin- 2 antees, and can be interpreted as a ﬁrst theo- ear map A : Rn×m → Rp , the above formula- retically justiﬁed variant of Simon-Funk-type tion (1) is the matrix generalization of the problem SVD heuristics. The method is free of tuning minx∈Rn ||Ax − b||2 + µ||x||1 , which is the important 2 parameters, and very easy to parallelize. 1 -regularized least squares problem, also known as the basis pursuit de-noising problem in compressed sensing literature. The analogue vector variant of 1. Introduction (2) is the Lasso problem (Tibshirani, 1996) which is This paper considers large scale convex optimization minx∈Rn ||Ax − b||2 ||x||1 ≤ t . 2 problems with a nuclear norm regularization, as for in- Recently (Toh & Yun, 2009; Liu et al., 2009) and (Ji stance low norm matrix factorizations. Such formula- & Ye, 2009) independently proposed algorithms that tions occur in many machine learning and compressed √ obtain an -accurate solution to (1) in O(1/ ) steps, sensing applications such as dimensionality reduction, by improving the algorithm of (Cai et al., 2008). More matrix classiﬁcation, multi-task learning and matrix recently also (Mazumder et al., 2009) and (Ma et al., completion (Srebro et al., 2004; Candes & Tao, 2009). 2009) proposed algorithms in this line of so called sin- Matrix completion by using matrix factorizations of gular value thresholding methods, but cannot guaran- either low rank or low norm has gained a lot of atten- tee a convergence speed. Each step of all those algo- tion in the area of recommender systems (Koren et al., rithms requires the computation of the singular value 2009) with the recently ended Netﬂix Prize competi- decomposition (SVD) of a matrix of the same size as tion. the solution matrix, which is expensive even with the Our new method builds upon the recent ﬁrst-order op- currently available fast methods such as PROPACK. timization scheme for semi-deﬁnite programs (SDP) of Both (Toh & Yun, 2009) and (Ji & Ye, 2009) show (Hazan, 2008) and has strong convergence guarantees. that the primal error of their algorithm is smaller than √ after O(1/ ) steps, using an analysis in the spirit We consider the following convex optimization prob- of (Nesterov, 1983). lems over matrices: We present a much simpler algorithm to solve prob- min f (X) + µ||X||∗ n×m (1) lems of the form (2), which does not need any SVD X∈R computations. We achieve this by transforming the Appearing in Proceedings of the 27 th International Confer- problem to a convex optimization problem on posi- ence on Machine Learning, Haifa, Israel, 2010. Copyright tive semi-deﬁnite matrices, and then using the approx- 2010 by the author(s)/owner(s). imate SDP solver of Hazan (2008). Hazan’s algorithm A Simple Algorithm for Nuclear Norm Regularized Problems can be interpreted as the generalization of the core- vector of a matrix M ∈ Sd×d . In practice for example set approach to problems on symmetric matrices. The Lanczos or the power method can be used. algorithm has a strong approximation guarantee, in the sense of obtaining -small primal-dual error (not Algorithm 1 Hazan’s Algorithm only small primal error). With the resulting approx- Input: Convex f with curvature constant Cf , tar- imate solution X, our algorithm also gives a matrix get accuracy ε. factorization X = U V T of rank O(1/ ) (with the T Initialize Z (1) := v0 v0 for arbitrary unit vector v0 . desired bounded nuclear norm). Compared to (Nes- for k = 1 to 4Cf do ε terov, 1983), a moderately increased number of steps Cf is needed in total, O(1/ ), which represents the price Compute vk := ApproxEV − f (Z (k) ), k2 . for the very severe simpliﬁcation in each individual Let αk := 1 k. step of our method on the one hand and the improved Set Z (k+1) := Z (k) + αk vk vk − Z (k) . T (low) rank on the other hand. end for We demonstrate that our new algorithm on standard datasets improves the state of the art nuclear norm Here ApproxEV(M, ε ) is a sub-routine that delivers methods, and scales to large problems such as matrix an approximate largest eigenvector to a matrix M with factorizations on the Netﬂix dataset. Furthermore, the the desired accuracy ε , meaning a unit length vector algorithm is easy to implement and parallelize, as it v such that v T M v ≥ λmax (M ) − ε . Note that as our only uses the power method (or Lanczos steps) to ap- convex function f takes a symmetric matrix Z as an proximate the largest eigenvalue of a matrix. argument, its gradient f (Z) is a symmetric matrix. Our method can also be interpreted as a modiﬁed, The actual running time for a given convex function theoretically justiﬁed variant of Simon Funk’s popu- f : Sd×d → R depends on its curvature constant Cf lar SVD heuristic (Webb, 2006), making it suitable for (also called the modulus of convexity) deﬁned as low norm matrix factorization. To our knowledge this is the ﬁrst guaranteed convergence result for this class 1 Cf := sup α2 (f (Z ) − f (Z) + Z − Z, f (Z) ) , Z,V ∈S,α∈R, of SVD-like gradient descent algorithms. Unlike most Z =Z+α(V −Z) other comparable algorithms, our general method is free of tuning parameters (apart from the regulariza- which turns out to be small for many applications1 . tion parameter). The algorithm can be seen as a matrix generaliza- tion of the sparse greedy approximation algorithm of Notation. For arbitrary real matrices, the stan- (Clarkson, 2008) for vectors in the unit simplex, called dard inner product is deﬁned as A, B := T r(AT B), the coreset method, which has seen many successful and the (squared) Frobenius matrix norm ||A||2 ro := F applications in a variety of areas ranging from cluster- A, A is the sum of all squared entries of the matrix. ing to support vector machine training, smallest en- By Sd×d we denote the set of symmetric d × d matri- closing ball/ellipsoid, boosting and others. Here spar- ces. A ∈ Rd×d is called positive semi-deﬁnite (PSD), sity just gets replaced by low rank. The same Algo- written as A 0, iﬀ v T Av ≥ 0 ∀v ∈ Rd . rithm 1 with a well-crafted function f can also be used to solve arbitrary SDPs in feasibility form. 2. Hazan’s Algorithm Our main ingredient is the following simple gradient- 3. Transformation to convex problems descent type algorithm of (Hazan, 2008), to obtain 3.1. Motivation: Formulating Matrix Factori- sparse solutions to any convex optimization problems zations as Convex Optimization Problems of the form min f (Z) , (3) Approximate matrix factorization refers to the setting Z∈S of approximating a given matrix Y ∈ Rn×m (typically where S := Z ∈ Sd×d Z 0, T r(Z) = 1 is the set given only partially) by a product X = U V T , under an of PSD matrices of unit trace. The set S is some- additional low rank or low norm constraint, such that times called Spectrahedron and is a generalization of some error function f (X) is minimized. Most of the the unit simplex to the space of symmetric matrices. currently known gradient-descent-type algorithms for The algorithm guarantees ε-small primal-dual error af- matrix factorization suﬀer from the following problem: ter at most O 1 iterations, where each iteration only ε 1 An overview of values of Cf for several classes of func- involves the calculation of a single approximate eigen- tions f can be found in (Clarkson, 2008) A Simple Algorithm for Nuclear Norm Regularized Problems Even if the loss-function f (X) is convex in X, the same Lemma 1. For any non-zero matrix X ∈ Rn×m and function expressed as a function f (U V T ) of both the t ∈ R: factor variables U and V usually becomes a non-convex t ||X||∗ ≤ problem (consider for example U, V ∈ R1×1 together 2 with the identity function f (x) = x). Therefore many iﬀ of the popular methods such as for example (Rennie & ∃ symmetric matrices A ∈ Sn×n , B ∈ Sm×m Srebro, 2005; Lin, 2007) can get stuck in local minima A X and so are neither theoretically nor practically well s.t. 0 and T r(A) + T r(B) = t . XT B justiﬁed, see also (DeCoste, 2006). These shortcomings can be overcome as follows: One Proof. This is a slight variation of the argument of can equivalently transform any low-norm matrix fac- (Fazel et al., 2001; Srebro et al., 2004). torization problem (which is usually not convex in its ⇒ From the characterization ||X||∗ = two factor variables) into an optimization problem over 1 minU V T =X 2 (||U ||2 ro + ||V ||2 ro ) we get that F F symmetric matrices: For any function f on Rn×m , the ∃ U, V , U V T = X s.t. ||U ||2 ro + ||V ||2 ro = F F optimization problem T T T r(U U ) + T r(V V ) ≤ t, or in other words we have UUT X min f (U V T ) (4) found a matrix of trace say s ≤ t. If U ∈Rn×r XT V V T V ∈Rm×r s < t, we add (t − s) to the top-left entry of A, i.e. we s.t. ||U ||2 ro + ||V ||2 ro = t F F add to A the PSD matrix e1 eT (which again gives a 1 PSD matrix). is equivalent to ⇐ As the matrix is symmetric and PSD, it can be ˆ (Cholesky) factorized to (U ; V )(U ; V )T s.t. U V T = X min f (Z) (5) Z∈S(n+m)×(n+m) and t = T r(U U T ) + T r(V V T ) = ||U ||2 ro + ||V ||2 ro , F F t rank(Z)≤r therefore ||X||∗ ≤ 2 . s.t. Z 0, T r(Z) = t. Corollary 2. Any nuclear norm regularized problem where “equivalent” means that for any feasible solu- of the form (2) is equivalent to tion of each problem, there is a feasible solution of the other problem with the same objective value. Here min ˆ f (Z) . (6) ˆ f is the same function as f , just acting on the cor- Z∈S(n+m)×(n+m) Z 0, T r(Z)=t responding oﬀ-diagonal rectangle of the larger, sym- ˆ metric matrices Z ∈ S(n+m)×(n+m) . Formally, f (Z) = Note that both transformations in this section are ˆ Z1 Z2 f T := f (Z2 ). The equivalence holds sim- equivalent formulations and not just relaxations. As Z2 Z3 already mentioned above, an explicit factorization of ply because every PSD matrix Z can be written as any feasible solution to (5) or (6) — if needed — can U UUT UV T some product Z = (U T V T ) = , always be directly obtained since Z 0. Alternatively, V V UT V V T algorithms for solving the transformed problem (5) or for U ∈ Rn×r and V ∈ Rm×r , when r ≥ rank(Z). (6) can directly maintain the approximate solution Z On the other hand, of course any arbitrary valued in a factorized representation, as achieved for example matrices U, V give rise to a PSD matrix Z of this by Hazan’s algorithm. form. Furthermore T r(Z) = T r(U U T ) + T r(V V T ) = ||U ||2 ro + ||V ||2 ro holds by deﬁnition. F F 3.3. Two Variants of Regularization The main advantage of this reformulation is that if the rank r := n + m is not restricted, the new problem (5) The two original problem formulations (1) and (2) are is now a convex problem over a nice well-studied con- very closely related, and used interchangeably in many vex domain (the cone of PSD matrices of ﬁxed trace), applications: If X ∗ is an optimal solution to trade-oﬀ whereas the original formulation (4) is usually not con- variant (1), then the same solution is also optimal for vex in both arguments U and V . (2) when using the value ||X ∗ ||∗ as the norm constraint. On the other hand (1) is just the Lagrangian version of (2), with µ being the Lagrange multiplyer belonging to 3.2. Nuclear Norm Regularized Problems the single constraint. This is the same change in for- In the same spirit, we obtain that any nuclear norm mulation as when going from regularized least squares regularized problem of the form (2) is equivalent to the formulation (the vector analogue of (1)), to the Lasso convex problem given by the following Corollary 2. problem corresponding to (2) and vice versa. A Simple Algorithm for Nuclear Norm Regularized Problems 4. Solving Nuclear Norm Regularized of low rank (k after k steps), and that by incremen- T Problems tally adding the rank-1 matrices vk vk , the algorithm automatically maintains a matrix factorization of the By the equivalent reformulation of the previous sec- approximate solution. tion, as in Corollary 2, we can now solve both general nuclear norm regularized problems and low norm ma- Also, Hazan’s algorithm is designed to automatically trix factorizations by using Hazan’s algorithm. stay within the feasible region S, where most of the existing approximate SDP-like methods do need a pro- Algorithm 2 Nuclear Norm Regularized Solver jection step to get back to the feasible region (as e.g. (Lin, 2007; Liu et al., 2009)), which makes both their ˆ 1. Consider the transformed problem for f given by theoretical analysis and implementation much more Corollary 2. complicated. ˆ 2. Adjust the function f by re-scaling all matrix entries by 1 . t 4.1. The Structure of the Eigenvalue Problem ˆ 3. Run Algorithm 1 for f (Z). For the actual computation of the approximate largest Cf ˆ ˆ eigenvector in ApproxEV − f (Z (k) ), k2 , either The following theorem shows that Algorithm 2 runs in time linear in the number Nf of non-zero entries Lanczos method or the power method (as in PageR- of the gradient f . This makes it very attractive in ank, see e.g. (Berkhin, 2005)) can be used. Both meth- particular for recommender systems applications and ods are known to scale well to very large problems and matrix completion, where f is a sparse matrix (same can be parallelized easily, as each iteration consists of sparsity pattern as the observed entries). just one matrix-vector multiplication. However, we have to be careful that we obtain the eigenvector for Theorem 3. Algorithm 2 obtains an approximate so- the largest eigenvalue which is not necessarily the prin- lution of primal-dual error ≤ ε for problems of the cipal one (largest in absolute value). In that case the 4Cf form (2) after at most ε many steps (or in other spectrum can be shifted by adding an appropriate con- words approximate eigenvector computations). stant to the diagonal of the matrix. (Hazan, 2008) In the k-th call of ApproxEV(), it is suﬃcient to per- made use of the fact that Lanczos method, which is form O(k) iterations of Lanczos method. Then the theoretically better understood, provably obtains the N f overall running time is O ε2 , or equivalently O 1 required approximation quality in a bounded number ε2 of steps if the matrix is PSD (Arora et al., 2005). many sparse matrix-vector multiplications. ˆ For arbitrary loss function f , the gradient − f (Z), Proof. We use Corollary 2 and then rescale all matrix which is the matrix whose largest eigenvector we have entries by 1 . Then the running time of follows from to compute in the algorithm, is always a symmet- t Theorem 2 of (Hazan, 2008). ˆ 0 G ric matrix of the block form f (Z) = for GT 0 Z1 Z2 The fact that each iteration of our algorithm is com- G = f (Z2 ), when Z = T . In other words Z2 Z3 putationally very cheap — consisting only of the com- ˆ f (Z) is the adjacency matrix of a weighted bipar- putation of an approximate eigenvector — strongly tite graph. One vertex class corresponds to the n rows contrasts the existing “singular value thresholding” of the original matrix Z2 (users in recommender sys- methods, which in each step need to compute an en- tems), the other class corresponds to the m columns tire SVD. Such a single incomplete SVD computation ˆ (items or movies). The spectrum of f is always (ﬁrst k singular vectors) amounts to the same com- putational cost as an entire run of our algorithm (for v symmetric: Whenever is an eigenvector for some k steps). Furthermore, those existing methods which w come with a theoretical guarantee, i.e. (Toh & Yun, v eigenvalue λ, then is an eigenvector for −λ. 2009; Liu et al., 2009; Ji & Ye, 2009; Ma et al., 2009), −w in their analysis assume that all SVDs used during the Hence, we have exactly the same setting as in the es- algorithm are exact, which is not feasible in practice. tablished Hubs and Authorities (HITS) model (Klein- By contrast, our analysis is rigorous even if the used berg, 1999). The ﬁrst part of any eigenvector is always eigenvectors are only ε -approximate. an eigenvector of the hub matrix GT G, and the second Another nice property of Hazan’s method is that the part is an eigenvector of the authority matrix GGT . returned solution is guaranteed to be simultaneously A Simple Algorithm for Nuclear Norm Regularized Problems Repeated squaring. In the special case that the |P | (the number of observed positions of the matrix) matrix X is very rectangular (n m or n m), operations, and we know that in total we need at most one of the two matrices GT G or GGT is very small. O ε1 many such matrix-vector multiplications. The 2 Then it is known that one can obtain an exponential same holds for the memory requirement: There is no speed-up in the power method by repeatedly squaring need to store the entire factorization of X (k) (meaning the smaller one of the matrices. In other words we can all the vectors vk ), but instead we only update and perform O(log 1 ) many matrix-matrix multiplications ε (k) store the prediction values Xij for ij ∈ P ∪ T in each instead of O( 1 ) matrix-vector multiplications. ε step. This, together with the known ratings yij deter- mines the sparse gradient matrix f (X (k) ) during the 4.2. Application to Matrix Completion and algorithm. Therefore, the total memory requirement Low Norm Matrix Factorizations is only |P ∪ T | (the size of the output) plus the size n + m of a single feature vector vk . For matrix factorization problems as for example from recommender systems (Koren et al., 2009), our algo- The constant Cf in the running time of Hazan’s rithm is particularly suitable as it retains the sparsity algorithm. of the observations, and constructs the solution in a factorized way. In the setting of a partially observed Lemma 4. For the squared error f (X) = 1 matrix such as in the Netﬂix case, the loss function 2 ij∈P (Xij − yij )2 , it holds that Cf ≤ 1. ˆ f (X) only depends on the observed positions, which Proof. It is known that the constant Cf is upper ˆ are very sparse, so f (X) — which is all we need for bounded by the largest eigenvalue of the Hessian our algorithm — is also sparse. 2ˆ ˆ f (Z) (here we consider f as a function on vectors). We again suppose that we want to approximate a par- On can directly compute that the diagonal entries of tially given matrix Y (let P be the set of known entries 2ˆ f (Z) are 1 at the entries corresponding to P , and of the matrix) by a product X = U V T such that some zero everywhere else, hence Cf ≤ 1. ˆ convex loss function f (X) is minimized. By T we de- note the unknown test entries of the matrix we want 4.3. Two Improved Variants of Algorithm 1 to predict. Our algorithm applies to any convex loss function on a low norm matrix factorization problem, The optimum on the line segment. Instead of 1 and we will only mention two cases in particular: ﬁxing the step width to αk := k in Algorithm 1, the αk ∈ [0, 1] of best improvement in the objective func- Our algorithm directly applies to Maximum Margin tion f can be found by line search. (Hazan, 2008) has Matrix Factorization (MMMF) (Srebro et al., 2004), already proposed binary search to ﬁnd better values whose original (soft margin) formulation is the trade- for αk . In many cases, however, we can even compute oﬀ formulation (1) with f (X) := ij∈P |Xij − yij | it analytically in a straightforward manner: Consider being the hinge or 1 -loss. Because this is not diﬀeren- (k+1) fα := f Z(α) T = f Z (k) + α vk vk − Z (k) and tiable, the authors recommend using the diﬀerentiable smoothed hinge loss instead. compute . ∂ (k+1) When using the standard squared loss function 0= fα = f Z(α) T , vk vk − Z (k) (7) 2 ∂α f (X) := ij∈P (Xij − yij ) , the problem is known as Regularized Matrix Factorization (Wu, 2007), and If this equation can be solved for α, then the optimal our algorithm directly applies. This loss function is such αk can directly be used as the step size, and the widely used in practice, has a nice gradient structure, convergence guarantee of Theorem 3 still holds. and is just the natural matrix generalization of the 2 - For the squared error f (X) = 1 ij∈P (Xij − yij )2 , 2 loss (notice the analogous Lasso and regularized least ¯ when we write v for the approximate eigenvector vk in squares formulation). The same function is known as step k, the optimality condition (7) is equivalent to the rooted mean squared error, which was the qual- ity measure used in the Netﬂix competition. We write ij∈P (Xij − yij )(Xij − vi vj ) ¯¯ RMSEtrain and RMSEtest for the rooted error on the αk = (8) ij∈P (Xij − vi vj )2 ¯¯ training ratings P and test ratings T respectively. Immediate Feedback in the Power Method. As Running time and memory. From Theorem 3 we a second small improvement, we propose a heuris- obtain that the running time of our Algorithm 2 is lin- tic to speed up the eigenvector computation in ear in the size of the input: Each matrix-vector multi- ApproxEV − f (Z (k) , ε : Instead of multiplying plication in Lanczos or the power method exactly costs the current candidate vector vk with the matrix A Simple Algorithm for Nuclear Norm Regularized Problems f (Z (k) ) in each power iteration, we multiply with the decay K, making the convergence very sensitive 1 1 2 T f (Z (k) ) + f Z (k) + k vk vk , i.e. the average to these parameters. This might be one of the rea- of the old and the new gradient. This means we im- sons that so far no results on the convergence speed mediately take into account the eﬀect of the new fea- could be obtained. Our method is free of these pa- ture vector vk . This heuristic (which unfortunately rameters, the k-th new feature vector is always a unit 1 does not fall into our current theoretical guarantee) vector scaled by √k . Also, we keep the Frobenius norm is inspired by stochastic gradient descent as in Simon 2 2 ||U ||F ro + ||V ||F ro of the obtained factorization exactly Funk’s method, which we describe in the following: ﬁxed during the algorithm, whereas in Funk’s method — which has a diﬀerent optimization objective — this 4.4. Relation to Simon Funk’s SVD Method norm strictly increases with every newly added feature. Interestingly, our proposed framework can also be seen Our described framework therefore gives a solid the- as a theoretically justiﬁed variant of Simon Funk’s oretical foundation for a modiﬁed variant of the ex- (Webb, 2006) and related approximate SVD meth- perimentally successful method (Webb, 2006) and its ods, which were used as a building block by most related variants such as (Kurucz et al., 2007; Paterek, of the teams participating in the Netﬂix competition a 2007; Tak´cs et al., 2009), with proved approximation (including the winner team). Those methods have quality and running time. been further investigated by (Paterek, 2007; Tak´csa et al., 2009) and also (Kurucz et al., 2007), which al- 5. Experimental Results ready proposed a heuristic using the HITS formula- tion. These approaches are algorithmically extremely We run our algorithm for the following standard similar to our method, although they are aimed at a datasets3 for matrix completion problems, using the slightly diﬀerent optimization problem, and do not di- squared error function. rectly guarantee bounded nuclear norm. Very recently, dataset #ratings n m (Salakhutdinov & Srebro, 2010) observed that Funk’s MovieLens 100k 105 943 1682 algorithm can be seen as stochastic gradient descent to MovieLens 1M 106 6040 3706 optimize (1) when the regularization term is replaced MovieLens 10M 107 69878 10677 by a weighted variant of the nuclear norm. Netﬂix 108 480189 17770 Simon Funk’s method considers the standard squared Any eigenvector method can be used as a black-box loss function f (X) = 1 ij∈S (Xij − yij )2 , and ﬁnds 2 in our algorithm. To keep the experiments simple, we the new rank-1 estimate (or feature) v by iterating used the power method4 , and performed 0.2 · k power ˆ v := v + λ(− f (Z)v − Kv), or equivalently iterations in step k. If not stated otherwise, the only optimization we used is the improvement by averaging ˆ 1 the old and new gradient as explained in Section 4.3. v := λ − f (Z) + −K I v , (9) λ All results were obtained by our (single-thread) imple- mentation in Java 6 on a 2.4 GHz Intel C2D laptop. a ﬁxed number of times. Here λ is a small ﬁxed con- stant called the learning rate. Additionally a decay Sensitivity. The generalization performance of our rate K > 0 is used for regularization, i.e. to penal- method is relatively stable under diﬀerent choices of ize the magnitude of the resulting feature v. Clearly the regularization parameter, see Figure 1: 0.95 this matrix multiplication formulation (9) is equiva- RMSE test k=1000 lent to a step of the power method applied within 0.93 our framework2 , and for small enough learning rates λ the resulting feature vector will converge to the largest 0.91 ˆ eigenvector of − f (Z). 0.89 However in Funk’s method, the magnitude of each new 0 15000 30000 45000 60000 Trace regularization t feature strongly depends on the starting vector v0 , the Figure 1. Sensitivity of the method on the choice of the number of iterations, the learning rate λ as well as regularization parameter t in (2), on MovieLens 1M. 2 Another diﬀerence of our method to Simon Funk’s lies 3 in the stochastic gradient descent type of the later, i.e. See www.grouplens.org and archive.ics.uci.edu/ml. “immediate feedback”: During each matrix multiplication, 4 We used the power method starting with the uniform it already takes the modiﬁed current feature v into account unit vector. 1 of the approximate eigenvalue corresponding 2 ˆ when calculating the loss f (Z), whereas our Algorithm 1 to the previously obtained feature vk−1 was added to the alters Z only after the eigenvector computation is ﬁnished. matrix diagonal to ensure good convergence. A Simple Algorithm for Nuclear Norm Regularized Problems 0.94 Movielens. Table 1 reports the running times of our MovieLens 10M rb algorithm on the three MovieLens datasets. Our al- gorithm gives an about 5.6 fold speed increase over (Toh & Yun, 2009), which is a very similar method 0.863 to (Ji & Ye, 2009). (Toh & Yun, 2009) already im- 1/k, test proves the “singular value thresholding” methods (Cai best on line segm., test gradient interp., test et al., 2008) and (Ma et al., 2009). For MMMF, (Ren- 0.785 nie & Srebro, 2005) report an optimization time of 1/k, train about 5 hours on the 1M dataset, but use the diﬀerent best on line segm., train RMSE smoothed hinge loss function so that the results can- gradient interp., train not be directly compared. (Ma et al., 2009), (Srebro 0.708 & Jaakkola, 2003) and (Ji & Ye, 2009) only obtained results on much smaller datasets. 0.63 0 100 200 300 400 Table 1. Running times tour (in seconds) of our algorithm k on the three MovieLens datasets compared to the reported Figure 2. Improvements for the two algorithm variants de- timings tTY of (Toh & Yun, 2009). The ratings {1, . . . , 5} scribed in Section 4.3, when running on MovieLens 10M. were used as-is and not normalized to any user and/or movie means. In accordance with (Toh & Yun, 2009), 50% Table 2. Running times tour (in hours) of our algorithm on of the ratings were used for training, the others were used the Netﬂix dataset compared to the reported timings tM of as the test set. Here NMAE is the mean absolute error, (Mazumder et al., 2009). 1 times 5−1 , over the total set of ratings. k is the num- ber of iterations of our algorithm, #mm is the total num- RMSEtest tM tour k #mm tr ber of sparse matrix-vector multiplications performed, and 0.986 3.3 0.144 20 50 99592 0.977 5.8 0.306 30 109 ” tr is the used trace parameter t in (2). They used Mat- 0.965 6.6 0.504 40 185 ” lab/PROPACK on an Intel Xeon 3.20 GHz processor. 0.9478 n.a. 13.6 200 4165 ” NMAE tTY tour k #mm tr 100k 0.205 7.39 0.156 15 33 9975 Note that the primary goal of this experimental sec- 1M 0.176 24.5 1.376 35 147 36060 tion is not to compete with the prediction quality of 10M 0.164 202 36.10 65 468 281942 the best engineered recommender systems (which are usually ensemble methods). We just demonstrate that In all the following experiments we have pre- our method solves nuclear norm regularized problems normalized all training ratings to the simple average of the form (2) on large sample datasets, obtaining µi +µj 2 of the user and movie mean values, for the sake strong performance improvements. of being consistent with comparable literature. For MovieLens 10M, we used partition rb provided with 6. Conclusion the dataset (10 test ratings per user). The regulariza- tion parameter t was set to 48333. We obtained a We have introduced a new method to solve arbitrary RMSEtest of 0.8617 after k = 400 steps, in a total convex problems with a nuclear norm regularization, running time of 52 minutes (16291 matrix multiplica- which is simple to implement and parallelize and scales tions). Our best RMSEtest value was 0.8573, compared very well. The method is parameter-free and comes to 0.8543 obtained by (Lawrence & Urtasun, 2009) us- with a convergence guarantee. This is, to our knowl- ing their non-linear improvement of MMMF. edge, the ﬁrst guaranteed convergence speed result for the class of Simon-Funk-type algorithms. Algorithm Variants. Comparing the proposed al- Further interesting questions include whether a simi- gorithm variants from Section 4.3, Figure 2 demon- lar algorithm could be used if a strict low-rank con- strates moderate improvements compared to our orig- straint as in (4), (5) is simultaneously applied. This inal Algorithm 2. corresponds to ﬁxing the sparsity of a solution in the coreset setting. Also, it remains to investigate if our Netﬂix. Table 2 shows an about 13 fold speed in- algorithm can be applied to other matrix factorization crease of our method over the “Hard Impute” singu- problems such as (potentially only partially observed) lar value thresholding algorithm of (Mazumder et al., kernel matrices as e.g. PSVM (Chang et al., 2007), 2009) on the Netﬂix dataset, where they used Mat- PCA or [p]LSA, because our method could exploit the lab/PROPACK on an Intel Xeon 3 GHz processor. even simpler form of f for symmetric matrices. A Simple Algorithm for Nuclear Norm Regularized Problems 7. Acknowledgements Lawrence, N and Urtasun, R. Non-linear matrix fac- torization with gaussian processes. ICML, 2009. We would like to thank Arkadi Nemirovski and Elad Hazan for helpful discussions. M. Jaggi is supported Lin, C-J. Projected gradient methods for nonnegative by a Google Research Award and the Swiss National matrix factorization. Neural Comput., 19(10):2756– Science Foundation (SNF Project 20PA21-121957). 2779, 2007. y M. Sulovsk´ is supported by the Swiss National Sci- ence Foundation (SNF Project 200020-125027). Liu, Y-J, Sun, D, and Toh, K-C. An implementable proximal point algorithmic framework for nuclear norm minimization. Optimization Online, 2009. References Ma, S, Goldfarb, D, and Chen, L. Fixed point and Arora, S, Hazan, E, and Kale, S. Fast algorithms bregman iterative methods for matrix rank mini- for approximate semideﬁnite programming using the mization. Mathematical Programming, 2009. multiplicative weights update method. FOCS, pp. 339–348, 2005. Mazumder, R, Hastie, T, and Tibshirani, R. Spectral regularization algorithms for learning large incom- Berkhin, P. A survey on pagerank computing. Internet plete matrices. Submitted to JMLR, 2009. Mathematics, 2(1):73, 2005. Nesterov, Y. A method of solving a convex program- Cai, J-F, Candes, E, and Shen, Z. A singular ming problem with convergence rate O(1/sqr(k)). value thresholding algorithm for matrix completion. Soviet Mathematics Doklady, 27:372–376, 1983. arXiv, math.OC, 2008. Paterek, A. Improving regularized singular value de- Candes, E and Tao, T. The power of convex relaxation: composition for collaborative ﬁltering. SIGKDD, Near-optimal matrix completion. IEEE Trans. In- 2007. form. Theory (to appear). arXiv, cs.IT, 2009. Rennie, J and Srebro, N. Fast maximum margin Chang, E, Zhu, K, Wang, H, Bai, H, Li, J, Qiu, Z, matrix factorization for collaborative prediction. and Cui, H. PSVM: Parallelizing support vector ICML, 2005. machines on distributed computers. NIPS, 20:257– 264. 2007. Salakhutdinov, R and Srebro, N. Collaborative ﬁl- tering in a non-uniform world: Learning with the Clarkson, K. Coresets, Sparse Greedy Approximation, weighted trace norm. arXiv, cs.LG, Feb 2010. and the Frank-Wolfe Algorithm. SODA, 2008. DeCoste, D. Collaborative prediction using ensembles Srebro, N, Rennie, J, and Jaakkola, T. Maximum- of maximum margin matrix factorizations. ICML, Margin Matrix Factorization. NIPS, 17:1329–1336, 2006. 2004. Fazel, M, Hindi, H, and Boyd, S. A rank minimization Srebro, N and Jaakkola, T. Weighted low-rank ap- heuristic with application to minimum order system proximations. ICML, 2003. approximation. Proc. American Control Conference, a a e Tak´cs, G, Pil´szy, I, N´meth, B, and Tikk, D. Scal- 6:4734–4739, 2001. able collaborative ﬁltering approaches for large rec- Hazan, E. Sparse approximate solutions to semideﬁ- ommender systems. JMLR, 10, 2009. nite programs. LATIN, pp. 306–316, 2008. Tibshirani, R. Regression Shrinkage and Selection via Ji, S and Ye, Y. An accelerated gradient method for the Lasso. Journal of the Royal Statistical Society. trace norm minimization. ICML, 2009. Series B, pp. 267–288, 1996. Kleinberg, J. Authoritative sources in a hyperlinked Toh, K and Yun, S. An accelerated proximal gradient environment. Journal of the ACM, 46(5), 1999. algorithm for nuclear norm regularized least squares problems. Optimization Online, 2009. Koren, Y, Bell, R, and Volinsky, C. Matrix factor- ization techniques for recommender systems. Com- Webb, B. Netﬂix update: Try this at home. Simon puter, 42(8):30–37, 2009. Funk’s personal blog, 2006. http://sifter.org/ ~simon/journal/20061211.html. Kurucz, M, Benczur, A, and Csalogany, K. Methods for large scale SVD with missing values. SIGKDD, Wu, M. Collaborative ﬁltering via ensembles of matrix 2007. factorizations. SIGKDD, 2007.