{"id":48,"date":"2016-03-20T02:06:20","date_gmt":"2016-03-20T02:06:20","guid":{"rendered":"http:\/\/www.aarondefazio.com\/tangentially\/?p=48"},"modified":"2016-03-20T02:06:20","modified_gmt":"2016-03-20T02:06:20","slug":"finally-an-alternative-to-sgd-stochastic-variance-reduction-methods","status":"publish","type":"post","link":"https:\/\/www.aarondefazio.com\/tangentially\/?p=48","title":{"rendered":"Finally an alternative to SGD: Stochastic variance reduction methods"},"content":{"rendered":"<p><!--l. 15--><\/p>\n<p class=\"noindent\" >This post describes the newly developed class of variance reduction techniques in<br \/>\nstochastic optimisation, which are fast and simple alternatives to the traditional<br \/>\nstochastic gradient descent (SGD) method. I give a light introduction here, focused<br \/>\non actually applying the methods in practice.\n<\/p>\n<p><!--l. 22--><\/p>\n<p class=\"noindent\" >\n<h3 class=\"likesectionHead\"><a \n id=\"x1-2000\"><\/a>But first: Ye Old SGD<\/h3>\n<p><!--l. 24--><\/p>\n<p class=\"noindent\" >Quite possibly the most important algorithm in machine learning, SGD is simply a<br \/>\nmethod for optimizing objective functions that consist of a sum or average of<br \/>\nterms:\n<\/p>\n<div class=\"math-display\"><!--l. 27--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"block\" ><mrow \n>\n                                    <mi \n>f<\/mi><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><mi \n>x<\/mi><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow> <mo \nclass=\"MathClass-rel\">=<\/mo> <mfrac><mrow \n><mn>1<\/mn><\/mrow> \n<mrow \n><mi \n>n<\/mi><\/mrow><\/mfrac><munderover accentunder=\"false\" accent=\"false\"><mrow  \n><mo mathsize=\"big\" \n>\u2211<\/mo>\n  <\/mrow><mrow \n><mi \n>i<\/mi><mo \nclass=\"MathClass-rel\">=<\/mo><mn>1<\/mn><\/mrow><mrow \n><mi \n>n<\/mi><\/mrow><\/munderover \n><msub><mrow \n><mi \n>f<\/mi><\/mrow><mrow \n>\n<mi \n>i<\/mi><\/mrow><\/msub \n><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><mi \n>x<\/mi><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow><mo \nclass=\"MathClass-punc\">.<\/mo>\n<\/mrow><\/math><\/div>\n<p><!--l. 29--><\/p>\n<p class=\"nopar\" >\n<p><!--l. 32--><\/p>\n<p class=\"indent\" >   The SGD method maintains an estimate<br \/>\n<!--l. 32--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><msub><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msub \n><\/math> of the solution<br \/>\nat step <!--l. 33--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><mi \n>k<\/mi><\/math>,<br \/>\nand at each step:<\/p>\n<p><!--l. 45--><\/p>\n<div class=\"minipage\" style=\"border: 1px solid black;\">\n<ol  class=\"enumerate1\" >\n<li \n  class=\"enumerate\" id=\"x1-2002x1\"><span \nclass=\"ecbx-1000\">Sample: <\/span>Pick an index <!--l. 45--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><mi \n>j<\/mi><\/math><br \/>\n      uniformly at random (a datapoint index).\n        <\/li>\n<li \n  class=\"enumerate\" id=\"x1-2004x2\"><span \nclass=\"ecbx-1000\">Step: <\/span>Update <!--l. 45--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><mi \n>x<\/mi><\/math>:\n<div class=\"math-display\"><!--l. 45--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"block\" ><mrow \n>\n                                     <msub><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><mo \nclass=\"MathClass-bin\">+<\/mo><mn>1<\/mn><\/mrow><\/msub \n> <mo \nclass=\"MathClass-rel\">=<\/mo> <msub><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msub \n> <mo \nclass=\"MathClass-bin\">\u2212<\/mo> <mi \n>\u03b3<\/mi><msubsup><mrow \n><mi \n>f<\/mi><\/mrow><mrow \n><mi \n>j<\/mi><\/mrow><mrow \n><mi \n>\u2032<\/mi><\/mrow><\/msubsup \n><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><msub><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n>\n<mi \n>k<\/mi><\/mrow><\/msub \n><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow><mo \nclass=\"MathClass-punc\">.<\/mo>\n<\/mrow><\/math><\/div>\n<p>      <!--l. 45--><\/p>\n<p class=\"nopar\" >\n<\/li>\n<\/ol><\/div>\n<p><\/span><br \/>\n<!--l. 47--><\/p>\n<p class=\"indent\" >   At it&#x2019;s most basic, this update is just an approximation of the<br \/>\nclassical gradient descent technique, where we take the non-random step<br \/>\n<!--l. 48--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><msub><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><mo \nclass=\"MathClass-bin\">+<\/mo><mn>1<\/mn><\/mrow><\/msub \n> <mo \nclass=\"MathClass-rel\">=<\/mo> <msub><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msub \n> <mo \nclass=\"MathClass-bin\">\u2212<\/mo> <mi \n>\u03b3<\/mi><msup><mrow \n><mi \n>f<\/mi><\/mrow><mrow \n><mi \n>\u2032<\/mi><\/mrow><\/msup \n><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><msub><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msub \n><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow><mo \nclass=\"MathClass-punc\">.<\/mo><\/math> The step size<br \/>\nparameter <!--l. 49--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><mi \n>\u03b3<\/mi><\/math><br \/>\nhere is just a constant, which depending on the problem may need to be adjusted<br \/>\ndown at each step. Variance reduction methods build upon SGD by modifying the<br \/>\ncore update in various ways.\n<\/p>\n<p><!--l. 55--><\/p>\n<p class=\"noindent\" >\n<h3 class=\"likesectionHead\"><a \n id=\"x1-3000\"><\/a>SAGA<\/h3>\n<p><!--l. 57--><\/p>\n<p class=\"noindent\" >The <a \nhref=\"http:\/\/arxiv.org\/abs\/1407.0202\">SAGA method<\/a> is a simple modi\ufb01cation of SGD, where we remember<br \/>\nthe <span \nclass=\"ecbx-1000\">last gradient evaluated for each datapoint<\/span>, and use this<br \/>\nto correct the SGD step to reduce its variance. These past gradients are stored in a table. We denote the previous gradient for datapoint j as <math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" display=\"inline\"><msubsup><mrow><mi>f<\/mi><\/mrow><mrow><mi>j<\/mi><\/mrow><mrow><mi>\u2032<\/mi><\/mrow><\/msubsup><mrow><mo class=\"MathClass-open\">(<\/mo><mrow><msubsup><mrow><mi>\u03d5<\/mi><\/mrow><mrow><mi>j<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msubsup><\/mrow><mo class=\"MathClass-close\">)<\/mo><\/mrow><\/math>.<br \/>\nThe core step is as<\/p>\n<p>follows:<\/p>\n<div class=\"minipage\"  style=\"border: 1px solid black;\">\n<ol  class=\"enumerate1\" >\n<li \n  class=\"enumerate\" id=\"x1-3002x1\"><span \nclass=\"ecbx-1000\">Sample: <\/span>Pick an index <!--l. 77--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><mi \n>j<\/mi><\/math><br \/>\n      uniformly at random.\n        <\/li>\n<li \n  class=\"enumerate\" id=\"x1-3006x3\"><span \nclass=\"ecbx-1000\">Step: <\/span>Update <!--l. 77--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><mi \n>x<\/mi><\/math><br \/>\n      using <!--l. 77--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><msubsup><mrow \n><mi \n>f<\/mi><\/mrow><mrow \n><mi \n>j<\/mi><\/mrow><mrow \n><mi \n>\u2032<\/mi><\/mrow><\/msubsup \n><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><msup><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msup \n><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow><\/math>,<br \/>\n        <!--l. 77--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><msubsup><mrow \n><mi \n>f<\/mi><\/mrow><mrow \n><mi \n>j<\/mi><\/mrow><mrow \n><mi \n>\u2032<\/mi><\/mrow><\/msubsup \n><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><msubsup><mrow \n><mi \n>\u03d5<\/mi><\/mrow><mrow \n><mi \n>j<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msubsup \n><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow><\/math><br \/>\n       and the table average: <!--tex4ht:inline--><\/p>\n<table class=\"equation\">\n<tr>\n<td>\n        <!--l. 77--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"block\" class=\"equation\">\n               <mstyle \n   id=\"x1-3007r1\"  class=\"label\" ><\/mstyle><!--endlabel--><msup><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><mo \nclass=\"MathClass-bin\">+<\/mo><mn>1<\/mn><\/mrow><\/msup \n> <mo \nclass=\"MathClass-rel\">=<\/mo> <msup><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msup \n> <mo \nclass=\"MathClass-bin\">\u2212<\/mo> <mi \n>\u03b3<\/mi> <mfenced separators=\"\" \nopen=\"[\"  close=\"]\" ><mrow><msubsup><mrow \n><mi \n>f<\/mi><\/mrow><mrow \n>\n<mi \n>j<\/mi><\/mrow><mrow \n><mi \n>\u2032<\/mi><\/mrow><\/msubsup \n><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><msup><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msup \n><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow> <mo \nclass=\"MathClass-bin\">\u2212<\/mo> <msubsup><mrow \n><mi \n>f<\/mi><\/mrow><mrow \n>\n<mi \n>j<\/mi><\/mrow><mrow \n><mi \n>\u2032<\/mi><\/mrow><\/msubsup \n><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><msubsup><mrow \n><mi \n>\u03d5<\/mi><\/mrow><mrow \n>\n<mi \n>j<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msubsup \n><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow> <mo \nclass=\"MathClass-bin\">+<\/mo> <mfrac><mrow \n><mn>1<\/mn><\/mrow> \n<mrow \n><mi \n>n<\/mi><\/mrow><\/mfrac><munderover accentunder=\"false\" accent=\"false\"><mrow  \n><mo mathsize=\"big\" \n>\u2211<\/mo>\n  <\/mrow><mrow \n><mi \n>i<\/mi><mo \nclass=\"MathClass-rel\">=<\/mo><mn>1<\/mn><\/mrow><mrow \n><mi \n>n<\/mi><\/mrow><\/munderover \n><msubsup><mrow \n><mi \n>f<\/mi><\/mrow><mrow \n>\n<mi \n>i<\/mi><\/mrow><mrow \n><mi \n>\u2032<\/mi><\/mrow><\/msubsup \n><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><msubsup><mrow \n><mi \n>\u03d5<\/mi><\/mrow><mrow \n>\n<mi \n>i<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msubsup \n><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow><\/mrow><\/mfenced><mo \nclass=\"MathClass-punc\">,<\/mo>\n<\/math><\/td>\n<td class=\"eq-no\">(1)<\/td>\n<\/tr>\n<\/table>\n<p>        <!--l. 77--><\/p>\n<p class=\"nopar\" >\n<\/li>\n<li\n class=\"enumerate\" id=\"x1-3004x2\"><span \nclass=\"ecbx-1000\">Update the table: <\/span>Denote <!--l. 77--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><msubsup><mrow \n><mi \n>\u03d5<\/mi><\/mrow><mrow \n><mi \n>j<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><mo \nclass=\"MathClass-bin\">+<\/mo><mn>1<\/mn><\/mrow><\/msubsup \n> <mo \nclass=\"MathClass-rel\">=<\/mo> <msup><mrow \n><mi \n>x<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msup \n><\/math>,<br \/>\n        and store <!--l. 77--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><msubsup><mrow \n><mi \n>f<\/mi><\/mrow><mrow \n><mi \n>j<\/mi><\/mrow><mrow \n><mi \n>\u2032<\/mi><\/mrow><\/msubsup \n><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><msubsup><mrow \n><mi \n>\u03d5<\/mi><\/mrow><mrow \n><mi \n>j<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><mo \nclass=\"MathClass-bin\">+<\/mo><mn>1<\/mn><\/mrow><\/msubsup \n><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow><\/math><br \/>\n       in the table. All other entries in the table remain unchanged. The quantity<br \/>\n        <!--l. 77--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><msubsup><mrow \n><mi \n>\u03d5<\/mi><\/mrow><mrow \n><mi \n>j<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><mo \nclass=\"MathClass-bin\">+<\/mo><mn>1<\/mn><\/mrow><\/msubsup \n><\/math><br \/>\n         is not explicitly stored.\n        <\/li>\n<\/ol><\/div>\n<p><\/span><br \/>\n<!--l. 79--><\/p>\n<p class=\"indent\" >Source code for SAGA and the accelerated variant point-saga is <a href=\"https:\/\/github.com\/adefazio\/point-saga\">available on github<\/a>. Some points about the algorithm: <\/p>\n<ul class=\"itemize1\">\n<li class=\"itemize\">Storing  the  past  gradients  is  typically  fast  and  low  cost.  For  logistic<br \/>\n     regression or least squares, it reduces to storing a single real value per<br \/>\n     datapoint instead of a full gradient. Similar reductions apply to neural<br \/>\n     networks. When the storage cost is high, the SVRG method described<br \/>\n     below can be used instead.\n     <\/li>\n<li class=\"itemize\">The table average <!--l. 86--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" > <mfrac><mrow \n><mn>1<\/mn><\/mrow>\n<mrow \n><mi \n>n<\/mi><\/mrow><\/mfrac><msubsup><mrow \n><mo \nclass=\"MathClass-op\"> \u2211<\/mo>\n  <!--nolimits--><\/mrow><mrow \n><mi \n>i<\/mi><mo \nclass=\"MathClass-rel\">=<\/mo><mn>1<\/mn><\/mrow><mrow \n><mi \n>n<\/mi><\/mrow><\/msubsup \n><msubsup><mrow \n><mi \n>f<\/mi><\/mrow><mrow \n><mi \n>i<\/mi><\/mrow><mrow \n><mi \n>\u2032<\/mi><\/mrow><\/msubsup \n><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><msubsup><mrow \n><mi \n>\u03d5<\/mi><\/mrow><mrow \n><mi \n>i<\/mi><\/mrow><mrow \n><mi \n>k<\/mi><\/mrow><\/msubsup \n><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow><\/math><br \/>\n     is of course cached rather then recalculated at each step.\n     <\/li>\n<li class=\"itemize\">Notice that there is only really two extra terms on top of the usual SGD<br \/>\n     step. The two terms also trivially cancel when you take the expectation<br \/>\n     with respect to <!--l. 90--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><mi \n>j<\/mi><\/math>.<\/p>\n<p>     So the expected step direction is a gradient step, just like with SGD.\n     <\/li>\n<li class=\"itemize\">The step above assumes you already have a table of gradients. If you are<br \/>\n     just starting the algorithm, do the \ufb01rst pass over the data in-order (instead<br \/>\n     of random), and use the table average over only the data seen so far in<br \/>\n     the step. Alternatively, just do regular SGD steps during the \ufb01rst pass.\n     <\/li>\n<li class=\"itemize\">The  requirement  for  randomly  sampling  datapoints  instead  of  doing<br \/>\n     in-order passes turns out to be absolutely crucial for it to work.\n     <\/li>\n<li class=\"itemize\">Implementing the step e\ufb03ciently when the gradients are sparse is a little<br \/>\n     subtle, but possible. See the SAGA paper for details and example code.\n     <\/li>\n<li class=\"itemize\">SAGA supports non-di\ufb00erentiable regularisers through the use of proximal<br \/>\n     operators. See the paper for details.<\/li>\n<\/ul>\n<p><!--l. 106--><\/p>\n<p class=\"noindent\" >\n<h3 class=\"likesectionHead\"><a \n id=\"x1-4000\"><\/a>Why variance reduction methods?<\/h3>\n<p><!--l. 108--><\/p>\n<p class=\"noindent\" >VR methods like SAGA have entirely di\ufb00erent convergence properties than<br \/>\nSGD. On strongly convex problems they are able to converge linearly<br \/>\n(<!--l. 110--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><mi \n>O<\/mi><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><mo \nclass=\"MathClass-op\">log<\/mo><!--nolimits--><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><mn>1<\/mn><mo \nclass=\"MathClass-bin\">\u2215<\/mo><mi \n>k<\/mi><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow><\/math>), instead of the<br \/>\nmuch slower <!--l. 110--><math \n xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"  \ndisplay=\"inline\" ><mi \n>O<\/mi><mrow ><mo \nclass=\"MathClass-open\">(<\/mo><mrow><mn>1<\/mn><mo \nclass=\"MathClass-bin\">\u2215<\/mo><mi \n>k<\/mi><\/mrow><mo \nclass=\"MathClass-close\">)<\/mo><\/mrow><\/math><br \/>\nconvergence possible with the SGD method (which requires careful iterate averaging). Linear convergence on a stochastic<br \/>\nmethod is quite remarkable, up until recently no such methods were known. It&#x2019;s also<br \/>\n<span \nclass=\"ecti-1000\">fast <\/span>linear convergence, compared to the theoretical rate for regular gradient descent,<br \/>\nit makes 1\/3 the progress each step, but each step is only a single datapoint<br \/>\nevaluation rather than a full batch gradient. For large datasets that can be<br \/>\nmillions of times faster. A schematic illustration of the practical speedup is<br \/>\nbelow, labeled as &#8220;incremental gradient&#8221;:\n<\/p>\n<p><!--l. 123--><br \/>\n<img decoding=\"async\" src=\"http:\/\/aarondefazio.com\/thesis-schematic.png\" alt=\"Convergence Schematic\" style=\"width: 100%;\"><\/p>\n<p class=\"noindent\" >\n<h3 class=\"likesectionHead\"><a \n id=\"x1-5000\"><\/a>Other variance reduction techniques<\/h3>\n<p><!--l. 125--><\/p>\n<p class=\"noindent\" >SAGA is perhaps the most straightforward of the variance reduction techniques. The<br \/>\nother VR methods have advantages in certain situtations, there is no clear best<br \/>\nmethod in all cases. We list the main methods here, each has several variants as<br \/>\nwell:<\/p>\n<dl class=\"description\">\n<dt class=\"description\">\n<a \nhref=\"http:\/\/arxiv.org\/abs\/1309.2388\" ><span \nclass=\"ecbx-1000\">SAG<\/span><\/a> <\/dt>\n<dd \nclass=\"description\">is the earliest of these VR methods developed. It doesn&#x2019;t actually take the<br \/>\n     gradient direction in expectation, rather it introduces a small bias with<br \/>\n     the potential for a lower variance in the gradient estimate. Unfortunately<br \/>\n     SAG is hard to extend from the point of view of the theory.\n     <\/dd>\n<dt class=\"description\">\n<a \nhref=\"http:\/\/www.stat.rutgers.edu\/home\/tzhang\/papers\/nips13-svrg.pdf\" ><span \nclass=\"ecbx-1000\">SVRG<\/span><\/a> <\/dt>\n<dd \nclass=\"description\">is the direct predecessor of SAGA that makes a di\ufb00erent trade-o\ufb00. It<br \/>\n     avoids storing gradients, but at the expense of computing two gradients per<br \/>\n     step instead of one. It also has a more complex double loop formulation,<br \/>\n     where the outer loop requires a recalculation of the full-data gradient. In<br \/>\n     most cases the extra computation makes it slower than SAGA, but it does<br \/>\n     have some applications.\n     <\/dd>\n<dt class=\"description\">\n<a \nhref=\"http:\/\/www.aarondefazio.com\/adefazio-icml2014.pdf\" ><span \nclass=\"ecbx-1000\">Finito\/Miso<\/span><\/a> <\/dt>\n<dd \nclass=\"description\">take  a  step  derived  indirectly  from  the  dual  formulation  of<br \/>\n     the  problem,  while  avoiding  any  dual  computation.  They  don&#x2019;t  work<br \/>\n     e\ufb03cently on sparse problems, but are perhaps the fastest methods on dense<br \/>\n     problems.\n     <\/dd>\n<dt class=\"description\">\n<a \nhref=\"http:\/\/www.jmlr.org\/papers\/volume14\/shalev...\/shalev-shwartz13a.pdf\" ><span \nclass=\"ecbx-1000\">SDCA<\/span><\/a> <\/dt>\n<dd \nclass=\"description\">is a purely dual method. It can be applied to some non-di\ufb00erentiable<br \/>\n     problems  unlike  the  other  methods  here,  but  it  is  complex  to  apply<br \/>\n     e\ufb03cently  to  other  problems  such  as  logistic  regression.  It  also  has<br \/>\n     the  nice  properly  of  avoiding  tunable  parameters  on  problems  with<br \/>\n     L2  regularisation.  For  practical  application  it  crucially  requires  a<br \/>\n     non-degenerate amount of strong convexity though.<\/dd>\n<\/dl>\n","protected":false},"excerpt":{"rendered":"<p>This post describes the newly developed class of variance reduction techniques in stochastic optimisation, which are fast and simple alternatives to the traditional stochastic gradient descent (SGD) method. I give a light introduction here, focused on actually applying the methods in practice. But first: Ye Old SGD Quite possibly the most important algorithm in machine &hellip; <a href=\"https:\/\/www.aarondefazio.com\/tangentially\/?p=48\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Finally an alternative to SGD: Stochastic variance reduction methods<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts\/48"}],"collection":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=48"}],"version-history":[{"count":8,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts\/48\/revisions"}],"predecessor-version":[{"id":68,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts\/48\/revisions\/68"}],"wp:attachment":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=48"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=48"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=48"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}