{"id":44,"date":"2016-01-06T02:25:45","date_gmt":"2016-01-06T02:25:45","guid":{"rendered":"http:\/\/www.aarondefazio.com\/tangentially\/?p=44"},"modified":"2016-01-27T04:29:44","modified_gmt":"2016-01-27T04:29:44","slug":"recent-work-on-minimising-finite-sums","status":"publish","type":"post","link":"https:\/\/www.aarondefazio.com\/tangentially\/?p=44","title":{"rendered":"Recent work on minimising finite sums"},"content":{"rendered":"<p>\nTo me, the most interesting recent development in convex optimisation is the<br \/>\ndiscovery of a family of methods that exploit a finite-sum objective structure<br \/>\nto theoretically and practically converge faster than was previously thought possible.\n<\/p>\n<p>\nThe two early methods in this class were <a href=\"http:\/\/arxiv.org\/abs\/1309.2388\">SAG<\/a><br \/>\n and <a href=\"http:\/\/www.jmlr.org\/papers\/volume14\/shalev-shwartz13a\/shalev-shwartz13a.pdf\">SDCA<\/a>.<\/p>\n<p>I spent time during my thesis working on these methods,<br \/>\nsee my publications on <a href=\"http:\/\/www.aarondefazio.com\/adefazio-nips2014.pdf\">SAGA<\/a><br \/>\n(an improved version of SAG)<br \/>\nand <a href=\"http:\/\/www.aarondefazio.com\/adefazio-icml2014.pdf\">Finito<\/a>.\n<\/p>\n<p>\nI would like to compile here a list of other papers broadly in this area with some brief comments on each. If you feel I&#8217;ve left anything out or you have a correction of some kind, please leave a comment and I&#8217;ll see to it.\n<\/p>\n<h3>SVRG<\/h3>\n<p>The <a href=\"http:\/\/www.stat.rutgers.edu\/home\/tzhang\/papers\/nips13-svrg.pdf\">SVRG method<\/a> really pushed<br \/>\nthe class of finite sum methods forward. It uses a clunky double-loop construction, but the theoretical analysis<br \/>\nis really simple compared to SAG and SDCA, and it&#8217;s really the core building block of the SAGA method mentioned above.<br \/>\nIt can also be readily applied to non-convex problems unlike the other methods mentioned here. It makes an interesting<br \/>\ntrade-off: it avoids storing gradients for each term, with the downside of 2-3x more gradient calculations total.<br \/>\nThis trade-off is bad for logistic regression and most convex problems, but for neural networks it may help. I haven&#8217;t looked to see if its seen any use on neural networks<br \/>\nin practice, maybe a topic for a future post.<\/p>\n<h3> Distributed variants <\/h3>\n<p>\nThere has been quite a lot of interest in distributed versions. I still believe that variants of distributed gradient<br \/>\nLBFGS are still the fastest method for distributed convex optimisation at large scale.<br \/>\nThe way that SAG and related methods<br \/>\nexploit the objective structure is not clearly extendable to the distributed setting. In constrast,<br \/>\ndistributed gradient LBFGS is a embarrassingly-parallel algorithm. LBFGS doesn&#8217;t work well for non-convex problems though, and thats where most of the interest in distribution optimisation is at the moment.\n<\/p>\n<p>\nFor regular SGD, the two best approaches I&#8217;ve seen for distribution are <a href=\"http:\/\/research.google.com\/archive\/large_deep_networks_nips2012.html\">Downpour<\/a> and soft penality methods like <a href=\"http:\/\/arxiv.org\/abs\/1412.6651\">Elastic Averaging SGD<\/a>. The soft penality approach might work well with SAGA; It seems like a good topic for future work.<\/p>\n<p>\nThe <a href=\"http:\/\/arxiv.org\/abs\/1506.04216\">DSA<\/a> method take the approach of combining the EXTRA method (a ADMM variant) with<br \/>\nSAGA. It seems like a good idea to me in principle. From their paper, it sounds like it might be the fastest<br \/>\nmethod for some network topologies. They also try the obvious way of distributing SAG, which they include in their plots.<br \/>\nThe &#8220;obvious&#8221; way doesn&#8217;t actually work (in the sense it doen&#8217;t converge linearly, losing all advantages it has over regular SGD),<br \/>\nso comparing against it is kind of silly. In general I don&#8217;t find their results convincing, although I&#8217;m not expert<br \/>\n on distributed optimisation.\n <\/p>\n<p>\nA few papers cover distributed SDCA, like <a href=\"http:\/\/papers.nips.cc\/paper\/5114-trading-computation-for-communication-distributed-stochastic-dual-coordinate-ascent\">this one<\/a><br \/>\nby Tianbao Yang. It compares against ADMM, which is horribly slow compared to distributed gradient LBFGS, so I&#8217;m not really convinced<br \/>\nby their results either.\n<\/p>\n<p>\nThe <a href=\"http:\/\/arxiv.org\/abs\/1507.07595\">distributed SVRG method<\/a> is a promising method I&#8217;ve seen.<br \/>\nThe SVRG method is inherently double-loop, with the inner loop looking like it could be potentially parallelised. I haven&#8217;t<br \/>\nlooked at the details in this paper, but it concerns me that compare against methods that I would not consider standard.<br \/>\nI would like to see comparisons against one of the many distributed SGD methods, or DG-LBFGS, or even just a wall-clock<br \/>\ncomparison against non-distributed SVRG or SAGA.\n<\/p>\n<p>\n<a href=\"http:\/\/arxiv.org\/abs\/1506.06840\">Another NIPS 2015 paper<\/a> from some well respected Carnegie Mellon researchers<br \/>\nalso addresses the asynchronous (i.e. multicore rather than distributed) SVRG\/SAGA problem. They cite my thesis,<br \/>\nwhich immediately gives them bonus points in my mind.<br \/>\nThey start with generalising SAGA\/SVRG into a meta algorithm which covers both cases. The construction is fairly obvious,<br \/>\nbut it&#8217;s nice to see it worked out with all the details. They then give an asynchronous SVRG method, which appears reasonable.<br \/>\nThe analysis uses techniques from the Hogwild! paper, which is one of the better asynchronous SGD methods.<br \/>\nThey talk about the applicability of their framework to both SVRG and SAGA, but they strangely don&#8217;t talk at all about<br \/>\nan asynchronous SAGA. I have a hunch the technique they use for asynchronous SVRG would not work well with SAGA,<br \/>\nbut I could be wrong. They have an excellent experiments section, with actually legible plots!! (I wish we lived in a world<br \/>\nwhere this didn&#8217;t deserve exclamation marks). In particular they show a solid speedup, with a mostly linear speedup with the<br \/>\nnumber of cores, although the slope is not the perfect 1:1 rate.<br \/>\nI have to point out that they actually compare against SGD based methods, which is great. Most finite-sum papers skip doing this<br \/>\nbecause it&#8217;s a pain to implement a fast SGD method. There are simply too many possible variants that reviewers will inevitably<br \/>\nsuggest you try.\n<\/p>\n<h3> Accelerated variants <\/h3>\n<p>\nThe paper <a href=\"http:\/\/arxiv.org\/abs\/1506.02186\">A Universal Catalyst for First-Order Optimization<\/a> (NIPS)<br \/>\nby Lin, Mairal and Harchaoui is the clearest approach.<br \/>\nIn general, virtually any optimisation method can<br \/>\nbe accelerated by using it to solve the proximal operator within the<br \/>\naccelerated proximal-point algorithm framework (see Salzo and Villa<br \/>\n2012, based on the proximal point method from Rockafellar).<br \/>\nLin et. al. take this approach, giving a double loop method, where any of the finite-sum methods mentioned can be used in the<br \/>\ninner loop. I have suspected this kind of approach could be applied to SAGA for a while now,<br \/>\nso it&#8217;s nice to see this paper working out the details.\n<\/p>\n<p>\nThe <a href=\"http:\/\/arxiv.org\/abs\/1305.2581\">paper on accelerated SDCA<\/a> also takes the double loop approach.<br \/>\nBoth approaches have convergence rates with large hidden logarithmic factors, which<br \/>\nare clearly not artefacts of the proof, but rather inherent in the double loop approach.\n<\/p>\n<p>\n<a href=\"http:\/\/arxiv.org\/abs\/1506.03016\">Another double loop paper<\/a><br \/>\ntakes a mirror descent style approach. I&#8217;m not sure if this is the same as the proximal point version, at<br \/>\nleast on the surface it appears to be a different method. I like the plots in this paper, they are too small and<br \/>\nnot well typeset, but they do show the differences between their method and SAGA\/SVRG fairly clearly.<br \/>\nIt seems the general trend is that the accelerated methods take 50-100 iterations before they overtake SAGA\/SVRG,<br \/>\nwhich matches my intuition. For most practical problems I would expect less than 100 iterations to be used; typically<br \/>\nthe test set error is completely minimised by then.\n<\/p>\n<p>\nIt&#8217;s not clear to me if a non-double loop approach is possible for a purely primal stochastic finte-sum methods.<br \/>\nThe obvious approaches don&#8217;t work, as the noise form the stochasticity is far too large compared to the amount of error that<br \/>\nvariants of Nesterov&#8217;s accelerated method can handle.\n<\/p>\n<p>\nThere is one non-double loop accelerated approach I know of, in <a href=\"http:\/\/arxiv.org\/abs\/1507.02000\">this draft technical report<\/a><br \/>\nfrom U. Florida. It takes a primal-dual approach. Guanghui Lan talked about this method in his OPT2015 workshop talk, and it sounds extremely promising. It&#8217;s still an early draft; It has no experiments section yet. I&#8217;ll be keeping an eye out for the published version in the future.\n<\/p>\n<h3> Online learning variants <\/h3>\n<p>The <a href=\"http:\/\/arxiv.org\/abs\/1506.03662\">&#8220;Neighborhood Watch&#8221; method<\/a> (NIPS 2015),<br \/>\ntake a hybrid approach between SAGA and classical SGD. Essentially they consider storing fewer past gradients by using the gradient at neighbouring points as a surrogate.<br \/>\nAlthough the method they propose is not surprising, it&#8217;s not at all clear a priori that the technique they propose would work in practice, so it&#8217;s nice to see some experiments and theory verifying it. The paper is interesting from a theoretical point of view, as they use a different Lyapunov function from that used by SAGA. It seems simpler.<\/p>\n<h3>Lower complexity bounds<\/h3>\n<p>\nAgarwal and Bottou have <a href=\"http:\/\/arxiv.org\/abs\/1410.0723\">an ICML 2015 paper<\/a> directly<br \/>\naddressing the class of finite sums. They recently updated their arXiv version to note that their<br \/>\nproof only covers deterministic methods, which is not ideal given all the known fast finite-sum methods are stochastic!<br \/>\nTheir difficulties are not surprising, the lower complexity analysis for finite sum methods is extremely complex; I was not able to make any<br \/>\nprogress on the problem after spending a few months on it during my PhD. Looking at the construction they use, I don&#8217;t see how<br \/>\nthey can handle the stochastic case without large changes.<br \/>\nEven with the cavet above, this paper is still interesting, particularly for the discussion in section 3.\n<\/p>\n<p>\nThey don&#8217;t really discuss the core question (at least in my mind): Does there<br \/>\nexist a deterministic finite-sum method with comparable<br \/>\nconvergence rates to SAG type methods?<br \/>\nI personally won&#8217;t be happy until we have lower and upper complexity bounds that give a tight answer<br \/>\nto this question. I would be really surprised if there existed a fast deterministic method.\n<\/p>\n<p>\nOn a somewhat less related but interesting note,<br \/>\nArjevani, Shalev-Shwartz &amp; Ohad Shamir have <a href=\"http:\/\/arxiv.org\/abs\/1503.06833\">put up on arXiv<\/a><br \/>\na paper proving some new lower complexity bounds for general strongly convex optimisation.<br \/>\nIt&#8217;s a long paper with a large number of results, but the one I find interesting is their core result.<br \/>\nThe key result is a lower-complexity bound; that is they show that no optimisation method in the class they consider,<br \/>\nno matter how clever, can converge any faster than the bound they state.<br \/>\nLower complexity bounds can be hard to grasp if you are used to upper bounds, like those proving the worst-case convergence<br \/>\nof a optimisation method. The inequalities are in the opposite direction.\n<\/p>\n<p>\nThe restricted class they consider is interesting and covers the core<br \/>\nmethods for strongly convex optimisation, namely gradient-descent and the simplest accelerated variant.<br \/>\nThey avoid the first\/second order oracle class used in the early lower-complexity bounds literature, which is why<br \/>\nthey are able to give an improved bound. The key result is the removal of dependence on problem dimension from the bound.<br \/>\nPrevious lower complexity bounds applied to any first-order method, with a proof relying on analysis of quadratic<br \/>\nproblems for tractability reasons. Since the well known conjugate gradient method<br \/>\ncan find an exact optimum of a  quadratic problem in O(p) time, for a p-dimensional problem, this dimension constant<br \/>\np appears in the old lower bounds.\n<\/p>\n<p>\nThe key idea of Arjevani et. al. is limiting the class of optimisation methods to steps that are a linear transformation<br \/>\nof the last rho steps (x_k and x_{k-1}) for rho typically 2 or 3.<br \/>\nThe transformation its self can&#8217;t depend on the x values, so line searches are ruled<br \/>\nout. The conjugate gradients method effectively does such a line-search (although it uses a closed form solution for it),<br \/>\nso it doesn&#8217;t fall within their restricted class.\n<\/p>\n<h3>Alternative sampling schemes<\/h3>\n<p>\nIt&#8217;s quite obvious that the uniform sampling scheme used by the standard finite sum<br \/>\nmethods can be improved if we know (or can at least approximate) the imbalance between the terms in the sum.<br \/>\nBy imbalance, I&#8217;m referring to the difference in the Lipschitz constants between each term. Terms with large constants<br \/>\nwill generally have a disproportionate contribution to the full gradient.<br \/>\nIn general we want methods where the theoretical convergence rate depends on the<br \/>\naverage Lipschitz constant of the terms, rather than the max, which is the case for SAG and other primal methods.\n<\/p>\n<p>The <a href=\"http:\/\/arxiv.org\/abs\/1504.04406\">paper<\/a> by Schmidt et. al. (AISTATS 2015) covers this problem<br \/>\nfor SAGA\/SAG style methods. I contributed a little to the theory for this paper. They show quite astounding practical results<br \/>\nfor conditional random field (CRF) problems, where weighted sampling seems crucial.<\/p>\n<p>\nFor SDCA, <a href=\"http:\/\/arxiv.org\/abs\/1401.2753\">this<\/a> preprint covers the problem to some degree.<br \/>\nI&#8217;ve only skimmed this paper. The practical results are not super-impressive. I believe this is because they<br \/>\nonly test on linear learners, which are too simple to really see the benefit.<\/p>\n<p>In general, weighted sampling for coordinate descent is a well studied problem<br \/>\nso applying those results to dual coordinate descent (i.e. SDCA) is a obvious research direction. I have a simple weighted sampling proof for a minor variant of SDCA in my thesis.\n<\/p>\n<p>\nThere are two NIPS 2015 papers on the general topic, both with Tong Zhang as a co-author! <a href=\"http:\/\/papers.nips.cc\/paper\/5913-local-smoothness-in-variance-reduced-optimization\">Local Smoothness in Variance Reduced Optimization<\/a> by Vainsencher, Liu and Zhang and <a href=\"http:\/\/papers.nips.cc\/paper\/5926-quartz-randomized-dual-coordinate-ascent-with-arbitrary-sampling\">Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling<\/a> by Qu, Richtarik and Zhang.<br \/>\nQuartz is a variant of SDCA with a interpolation-with-previous-value update for the primal estimate of x instead of a direct update. They say this leads to a better theoretical convergence rate. The main novelty is a proof that works for sampling schemes like mini-batch, importance sampling and uniform sampling. The practical performance is not as good as SDCA except when they are able to exploit sparsity together with mini-batching, as the update is more conservative. I like that they didn&#8217;t introduce another acroynm. It&#8217;s becoming impossible to communicate finite sum methods to people outside the field due to all the method names sounding similar.\n<\/p>\n<p>\nThe Local-smoothness paper discusses the particular problem of adapting the sampling scheme at run-time, so that the weighting doesn&#8217;t need to be specified up front. This is practically a necessity for good performance, and they do establish some theory covering it unlike previous work on adaptivity for SAG.<br \/>\nI like their discussion of why it can be effective. They point out that for smoothed-hinge loss, data-points whose activation is in the linear parts of the hinge can almost be ignored by the sampler once we get close to the solution.\n<\/p>\n<h3> The future <\/h3>\n<p>\nThere is still lots of interesting avenues of research open for finite sum methods. Do practical accelerated variants of any of the methods exist? Ideally without lots of parameters that require tuning. Are other variance reduction approaches possible (like adding a second set of control variates, or by using multiplicative instead of additive control?).\n <\/p>\n<p>\nI would like to see some solid c\/c++ implementations of the best methods so that they can be more widely used. There are some reasonable python implementations in the <a href=\"https:\/\/github.com\/mblondel\/lightning\">Lightning<\/a> package, but they need adaptive step sizes and importance sampling to really be production ready. Multithreading would also be great. It would also be nice to see SAGA implemented in TensorFlow, Vowpal Wabbit and LibLinear for example.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>To me, the most interesting recent development in convex optimisation is the discovery of a family of methods that exploit a finite-sum objective structure to theoretically and practically converge faster than was previously thought possible. The two early methods in this class were SAG and SDCA. I spent time during my thesis working on these &hellip; <a href=\"https:\/\/www.aarondefazio.com\/tangentially\/?p=44\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Recent work on minimising finite sums<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts\/44"}],"collection":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=44"}],"version-history":[{"count":5,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts\/44\/revisions"}],"predecessor-version":[{"id":53,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts\/44\/revisions\/53"}],"wp:attachment":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=44"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=44"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=44"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}