All posts by Aaron Defazio

Weighted random sampling with replacement with dynamic weights

Weighted random sampling from a set is a common problem in applications, and in general library support for it is good when you can fix the weights in advance. In applications it is more common to want to change the weight of each instance right after you sample it though. This seemingly simple operation doesn't seem to be supported in any of the random number libraries I've looked at. If you try googling for a solution, you find lots of papers and stack overflow posts on reservoir sampling, but nothing useful for solving the problem. After digging through Knuth and reading old paywalled papers, I've managed to piece together an approach that is extremely fast and easy to implement. This post details that method and provides a simple Python implementation. I have made a fast Cython version availiable on github also. First some notation. We want to sample an index 0 to N-1, according to an array of weights w[i]. These weights form the unnormalized probability distribution we want to sample from, i.e. each instance i should have probability w[i]/sum(w) of being chosen. Ideally we want an algorithm that gives constant time sampling and constant time weight mutation operations.

The simple slow approach: rejection sampling

Normally I avoid wasting time on approaches that don't work well in practice, however the simple rejection sampling approach to the problem turns out to be the vital building block of the algorithms that do work. The rejection sampling approach is only a few lines of Python: The idea is simple. We sample uniformly from the indices, then do a rejection sampling correction to account for the actual non-uniformity of the data. The best way to visualise what it's doing is to consider it as picking a point in 2 dimensions uniformly, then doing a reject or accept operation based on if the point is under the "graph" of the distribution or not. This is shown schematically as the yellow shaded regions below. The general idea of rejection sampling and this graph interpretation is quite subtle, and I'm not going to attempt to explain it here. If you just stare at it for a while you should be able to convince your self that it works. Unfortunately, rejection sampling like this is not practical when the weights are uneven. The rejection probability depends on the magnitude of the largest weight compared to the average, and that can be very large. Imagine if the first eight boxes above where %1 full and the last 100% full. It's going to reject roughly 80% of the time. It can of course be much worse when n is larger.

Making it practical

Rejection sampling can be very fast when all the weights are similar. For instance, suppose all the weights are in the interval [1,2]. Then the acceptance probability w[i]/w_max is always more than half, and the expected number of loops until acceptance is at most 2. It's not only expected constant time, but it's fast in practice as well. More generally, this is true whenever the interval is of the form [2^i, 2^(i+1)]. This leads to the idea used by most of the practical methods: group the data into "levels" where each level is an interval of that form. We can then sample a level, followed by sampling within the level with rejection sampling. Matias et al. (2003) show that the level sampling can be done in O(log*(n)) time, where log* is the iterated logarithm, a slowly growing function which is always no more than 5. Effectively it is an expected constant time sampling scheme. We don't recommend using the Matias scheme though, as its practical performance is hampered by its complexity. Instead, we suggest using a method that has a (weak) dependence on the size of the weights, which we detail below. Consider the intervals [2^i, 2^(i+1)]. The number of unique intervals ("levels") we need to consider is just the log2 of the ratio of the largest and smallest weights, which on practical problems won't be more than 20, corresponding to a 1 to 1 million difference. The extra overhead of the Matis method is not worth it to reduce the constant from 20 to 5. In practice the Matis method requires building a tree structure and updating it whenever the weights change, which is way slower than traversing a 20 element sequential array. Lets be a little more concrete. The algorithm will maintain a list of levels, in order of largest to smallest. Each level i consists of a list of the elements that fall within that levels range: [2^i, 2^(i+1)]. The list for each level need not be sorted or otherwise ordered. To sample an instance from the set, we sample a level, then we perform rejection sampling within that level. In Python, it looks like: The full class including weight updating is on github. Notice that we use a linear-time algorithm for sampling the levels (A cumulative distribution table lookup). Alternative methods could be used, such as a balanced binary tree or Walker's algorithm (see below). The only difficulty is that we want to change the weight of an element potentially after every sample, so any pre-computation needed for the level sampling needs to be fast. It doesn't seem worth it in practice to use something more complicated here since the number of levels is so small (as mentioned above usually less than 20).


A few small changes are possible to improve the usability and performance. The rejection sampling actually only needs a single random sample instead of 2. We can just take a U[0,1] sample, then multiply by level_size. The integer part is the idx_in_level and the remainder is the u_lvl part. When updating weights, we need to delete elements potentially from the middle of a levels index list. For example, imagine we need to move element 6 from the [2,4] bucket to the [1,2] bucket in the above diagram. We need to store the indices in a contiguous array for fast O(1) lookup, so initially this would look like a problem. However, since the order within the lists doesn't matter, we can actually take the last element of the list, move it to the location we want to delete from, then delete from the end of the list. In the sample code given, we keep a level_max array which just contains the upper bounds for each level. With a little extra code we could instead change this to be the largest element that has been in that bucket so far. This could lower the rejection rate a little at the cost of a few more operations maintaining the level_max array.

Other methods

Marsaglia et al. 2004 describe an early method that also treats the data in levels, but instead of rejection sampling it spreads each datapoint's probability mass across multiple levels. I believe updating the weights under their scheme would be quite complex, although in big O notation it should be roughly the same as the algorithm I described above.

Walker's alias method

This post is concerned with methods can be used when we wish to modify the weights dynamically as the algorithm runs. However, I would be remise to not mention the alias method, which given a fixed set of weights at initialization time constructs a table that allows extremely efficient constant time sampling without any of the complexity of the other approaches I mentioned. The key idea is simple. Take a table like that used in the rejection sampling method. Instead of normalizing each bucket by w_max, normalize by w(sum)/n. When doing this, some buckets will overflow, as the probability within them could be larger than w(sum)/n. So we take the excess, and spread it among those buckets that are not full. It's not too difficult to do this in such a way that each bucket is exactly full, and contains only two indices within it. Now when we go to sample, we pick a bucket uniformly as before, but within the bucket there is no longer a rejection option, just two indices to choose from. Unfortunately, there is no fast way to modify the table built by Walker's method when we change just a single weight. I've not been able to find any methods in the literature that manage to do so efficiently. Given it's simplicity, there are a few implementations of Walker's method out there, for example a Python version here.

A NIPS 2015 reading list

I went over the entire list of NIPS 2015 papers so that you don't have to. You're welcome. Below is the subset of papers that caught my eye.

Variance Reduction

StopWasting My Gradients: Practical SVRG Another great practical paper from Mark Schmidt and the rest of his group. The techniques they describe help speedup SVRG, although your still probably better off using SAGA.

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants Great typesetting and plots. They show a good speedup on SVRG on a multicore machine. Would be good to have async-SAGA plots though. Speedup varies from 0.3 to 1 per additional core.

Local Smoothness in Variance Reduced Optimization Essentially theory and experiements for adaptive importance sampling for SVRG and SDCA. Looks good. It's hard to say how robust it is and how easy to implement it will be.

Variance Reduced Stochastic Gradient Descent with Neighbors Good theory, some improvements on the basic SAGA theory. The N-SAGA method looks like it would be hard to implement fast though.

Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling A SDCA variant with a more conservative step and corresponding improved theory. They give a proof that applies for almost any sampling scheme, nice. Could be faster than SDCA for sparse problems if implemented carefully.

A Universal Catalyst for First-Order Optimization Uses the accelerated proximal-point algorithm as a "meta-algorithm" to speedup any of the finite-sum methods. The double loop construction they use introduces a logarithmic term in the convergence rate which is unfortunate. Probably not ideal in practice due to the additional tuning required, but it's hard to say.

Data science

Hidden Technical Debt in Machine Learning Systems This is based on a previous workshop paper. I'm not really impressed with the paper, it just doesn't go into enough detail to actually be useful. It's practically an extended abstract.

Precision-Recall-Gain Curves: PR Analysis Done Right I really like this paper. It correctly points out that PR curves are pretty bad, and suggests a better improvement. For the kind of problems I'm interested in ROC curves are still the better choice though.

Distributed Optimization

Deep learning with Elastic Averaging SGD The basic idea here is classical: distribute an optimisation problem by splitting it into parts (1 per machine), and enforce consistancy between the machines using a quadratic penality function. It's a "soft" penality, in the sense it doesn't enforce exact equality. It's interesting that they actually give each machine access to the whole dataset, whereas normally each machine is given a fixed subset of the data. I believe this means that on convex problems it will still converge to the exact solution, which is not true of a soft-penalty otherwise. They also show how important a momentum based updated is. Good paper.

Online/Stochastic Optimization

Probabilistic Line Searches for Stochastic Optimization This paper was present orally at the conference as well. They gave an execelent presentation. The basic idea is to use bayesian optimization techniques for line searches. The details are non-trivial, and they have good plots to go along with it. I suspect however that their method is too slow in practice for most applications.

Beyond Convexity: Stochastic Quasi-Convex Optimization They give an interesting and non-obvious definition of local-quasi-convexity that they use in their proof of convergence for the stochastic normalized gradient descent method. It's really a theory paper.

Online Gradient Boosting I'm no expert on boosting, but it looks interesting. I'll stick with offline boosting for solving practical problems for now.

Fast Rates for Exp-concave Empirical Risk Minimization Some good results here. The exp-concave subproblem of ERM is interesting; I'm a strong believer in narrowing down your problem class until you can prove something. They get an expected loss convergence rate like O(d/n) for d dimensions and n steps. This is better than the general ERM case, and better than past results. I'll probably read through the proof in detail at some point.

SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk The U-statistics terminology is new to me. They talk about direct minimisation of AUC instead of the usual ERM losses, which is an approach I find interesting. I'm surprised there hasn't been more research on this, from what they say in the paper. In the past I've never achieved better results out of direct AUC minimisation compared to logistic-loss though.


Policy Evaluation Using the Ω-Return Not really my area, so I can't comment in detail. Looks interesting though.

Bandit Smooth Convex Optimization: Improving the Bias-Variance Tradeoff Always nice to see improved bounds.

The Pareto Regret Frontier for Bandits


Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees The Frank-Wolfe method is popping up everywhere these days. Apparently it allows you to get convergence rates for Bayesian quadrature now. They appear to be applying it in function space, which is really cool. Nice visualisations as well.

A Universal Primal-Dual Convex Optimization Framework

On the Global Linear Convergence of Frank-Wolfe Optimization Variants Simon Lacoste-Julien is quickly becoming the Frank-Wolfe expert. I like the consideration of the structure of the constraint set through a notion similar to a condition number. Great diagrams as well.


Smooth and Strong: MAP Inference with Linear Convergence I don't fully understand the novelty in this paper, it looks like an application of standard duality tricks for smoothing. If anybody has insight here please let me know in the comments.

Reflection, Refraction, and Hamiltonian Monte Carlo A way of improving traditional HMC in some cases by better handling innate structure.

Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference

A Complete Recipe for Stochastic Gradient MCMC

Other Optimization

HONOR: Hybrid Optimization for NOn-convex Regularized problems Basically they switch between Quasi-newton and gradient steps depending on the local structure. Feels a little adhoc. They also only compare against GIST, a method they also invented. I would like to see a more general comparison.

Convergence rates of sub-sampled Newton methods The approach they take to sub-sampled approximations is interesting. They handle noise in the Hessian from the approximation by setting the smaller eigenvalues not to zero but to the value of the kth top eigenvalue, for some k. Noise should have a bigger relative effect on the lesser eigenvalues, so that looks reasonable. I'm not super-convinced by the plots though, I would like to see results on a few other problems. Also, fonts on plots are too small.


Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems Quite an interesting signal-recovery problem solved here. Related to compressed sensing where you assume some kind of structure on the design matrix (i.e. iid sampled from a gaussian or binomial).

Newton-Stein Method: A Second Order Method for GLMs via Stein's Lemma An alternative hessian approximation. They make use of sub-sampling in a similar way to the "Convergence rates of sub-sampled Newton methods" paper above, citing a pre-print of it. I like how they use estimates of the covariance matrix of the data to estimate the Hessian. Plot font is too small. They use the wrong citation style for NIPS, not sure how that got through review.

Learning with Symmetric Label Noise: The Importance of Being Unhinged

Recent work on minimising finite sums

To me, the most interesting recent development in convex optimisation is the discovery of a family of methods that exploit a finite-sum objective structure to theoretically and practically converge faster than was previously thought possible.

The two early methods in this class were SAG and SDCA. I spent time during my thesis working on these methods, see my publications on SAGA (an improved version of SAG) and Finito.

I would like to compile here a list of other papers broadly in this area with some brief comments on each. If you feel I've left anything out or you have a correction of some kind, please leave a comment and I'll see to it.


The SVRG method really pushed the class of finite sum methods forward. It uses a clunky double-loop construction, but the theoretical analysis is really simple compared to SAG and SDCA, and it's really the core building block of the SAGA method mentioned above. It can also be readily applied to non-convex problems unlike the other methods mentioned here. It makes an interesting trade-off: it avoids storing gradients for each term, with the downside of 2-3x more gradient calculations total. This trade-off is bad for logistic regression and most convex problems, but for neural networks it may help. I haven't looked to see if its seen any use on neural networks in practice, maybe a topic for a future post.

Distributed variants

There has been quite a lot of interest in distributed versions. I still believe that variants of distributed gradient LBFGS are still the fastest method for distributed convex optimisation at large scale. The way that SAG and related methods exploit the objective structure is not clearly extendable to the distributed setting. In constrast, distributed gradient LBFGS is a embarrassingly-parallel algorithm. LBFGS doesn't work well for non-convex problems though, and thats where most of the interest in distribution optimisation is at the moment.

For regular SGD, the two best approaches I've seen for distribution are Downpour and soft penality methods like Elastic Averaging SGD. The soft penality approach might work well with SAGA; It seems like a good topic for future work.

The DSA method take the approach of combining the EXTRA method (a ADMM variant) with SAGA. It seems like a good idea to me in principle. From their paper, it sounds like it might be the fastest method for some network topologies. They also try the obvious way of distributing SAG, which they include in their plots. The "obvious" way doesn't actually work (in the sense it doen't converge linearly, losing all advantages it has over regular SGD), so comparing against it is kind of silly. In general I don't find their results convincing, although I'm not expert on distributed optimisation.

A few papers cover distributed SDCA, like this one by Tianbao Yang. It compares against ADMM, which is horribly slow compared to distributed gradient LBFGS, so I'm not really convinced by their results either.

The distributed SVRG method is a promising method I've seen. The SVRG method is inherently double-loop, with the inner loop looking like it could be potentially parallelised. I haven't looked at the details in this paper, but it concerns me that compare against methods that I would not consider standard. I would like to see comparisons against one of the many distributed SGD methods, or DG-LBFGS, or even just a wall-clock comparison against non-distributed SVRG or SAGA.

Another NIPS 2015 paper from some well respected Carnegie Mellon researchers also addresses the asynchronous (i.e. multicore rather than distributed) SVRG/SAGA problem. They cite my thesis, which immediately gives them bonus points in my mind. They start with generalising SAGA/SVRG into a meta algorithm which covers both cases. The construction is fairly obvious, but it's nice to see it worked out with all the details. They then give an asynchronous SVRG method, which appears reasonable. The analysis uses techniques from the Hogwild! paper, which is one of the better asynchronous SGD methods. They talk about the applicability of their framework to both SVRG and SAGA, but they strangely don't talk at all about an asynchronous SAGA. I have a hunch the technique they use for asynchronous SVRG would not work well with SAGA, but I could be wrong. They have an excellent experiments section, with actually legible plots!! (I wish we lived in a world where this didn't deserve exclamation marks). In particular they show a solid speedup, with a mostly linear speedup with the number of cores, although the slope is not the perfect 1:1 rate. I have to point out that they actually compare against SGD based methods, which is great. Most finite-sum papers skip doing this because it's a pain to implement a fast SGD method. There are simply too many possible variants that reviewers will inevitably suggest you try.

Accelerated variants

The paper A Universal Catalyst for First-Order Optimization (NIPS) by Lin, Mairal and Harchaoui is the clearest approach. In general, virtually any optimisation method can be accelerated by using it to solve the proximal operator within the accelerated proximal-point algorithm framework (see Salzo and Villa 2012, based on the proximal point method from Rockafellar). Lin et. al. take this approach, giving a double loop method, where any of the finite-sum methods mentioned can be used in the inner loop. I have suspected this kind of approach could be applied to SAGA for a while now, so it's nice to see this paper working out the details.

The paper on accelerated SDCA also takes the double loop approach. Both approaches have convergence rates with large hidden logarithmic factors, which are clearly not artefacts of the proof, but rather inherent in the double loop approach.

Another double loop paper takes a mirror descent style approach. I'm not sure if this is the same as the proximal point version, at least on the surface it appears to be a different method. I like the plots in this paper, they are too small and not well typeset, but they do show the differences between their method and SAGA/SVRG fairly clearly. It seems the general trend is that the accelerated methods take 50-100 iterations before they overtake SAGA/SVRG, which matches my intuition. For most practical problems I would expect less than 100 iterations to be used; typically the test set error is completely minimised by then.

It's not clear to me if a non-double loop approach is possible for a purely primal stochastic finte-sum methods. The obvious approaches don't work, as the noise form the stochasticity is far too large compared to the amount of error that variants of Nesterov's accelerated method can handle.

There is one non-double loop accelerated approach I know of, in this draft technical report from U. Florida. It takes a primal-dual approach. Guanghui Lan talked about this method in his OPT2015 workshop talk, and it sounds extremely promising. It's still an early draft; It has no experiments section yet. I'll be keeping an eye out for the published version in the future.

Online learning variants

The "Neighborhood Watch" method (NIPS 2015), take a hybrid approach between SAGA and classical SGD. Essentially they consider storing fewer past gradients by using the gradient at neighbouring points as a surrogate. Although the method they propose is not surprising, it's not at all clear a priori that the technique they propose would work in practice, so it's nice to see some experiments and theory verifying it. The paper is interesting from a theoretical point of view, as they use a different Lyapunov function from that used by SAGA. It seems simpler.

Lower complexity bounds

Agarwal and Bottou have an ICML 2015 paper directly addressing the class of finite sums. They recently updated their arXiv version to note that their proof only covers deterministic methods, which is not ideal given all the known fast finite-sum methods are stochastic! Their difficulties are not surprising, the lower complexity analysis for finite sum methods is extremely complex; I was not able to make any progress on the problem after spending a few months on it during my PhD. Looking at the construction they use, I don't see how they can handle the stochastic case without large changes. Even with the cavet above, this paper is still interesting, particularly for the discussion in section 3.

They don't really discuss the core question (at least in my mind): Does there exist a deterministic finite-sum method with comparable convergence rates to SAG type methods? I personally won't be happy until we have lower and upper complexity bounds that give a tight answer to this question. I would be really surprised if there existed a fast deterministic method.

On a somewhat less related but interesting note, Arjevani, Shalev-Shwartz & Ohad Shamir have put up on arXiv a paper proving some new lower complexity bounds for general strongly convex optimisation. It's a long paper with a large number of results, but the one I find interesting is their core result. The key result is a lower-complexity bound; that is they show that no optimisation method in the class they consider, no matter how clever, can converge any faster than the bound they state. Lower complexity bounds can be hard to grasp if you are used to upper bounds, like those proving the worst-case convergence of a optimisation method. The inequalities are in the opposite direction.

The restricted class they consider is interesting and covers the core methods for strongly convex optimisation, namely gradient-descent and the simplest accelerated variant. They avoid the first/second order oracle class used in the early lower-complexity bounds literature, which is why they are able to give an improved bound. The key result is the removal of dependence on problem dimension from the bound. Previous lower complexity bounds applied to any first-order method, with a proof relying on analysis of quadratic problems for tractability reasons. Since the well known conjugate gradient method can find an exact optimum of a quadratic problem in O(p) time, for a p-dimensional problem, this dimension constant p appears in the old lower bounds.

The key idea of Arjevani et. al. is limiting the class of optimisation methods to steps that are a linear transformation of the last rho steps (x_k and x_{k-1}) for rho typically 2 or 3. The transformation its self can't depend on the x values, so line searches are ruled out. The conjugate gradients method effectively does such a line-search (although it uses a closed form solution for it), so it doesn't fall within their restricted class.

Alternative sampling schemes

It's quite obvious that the uniform sampling scheme used by the standard finite sum methods can be improved if we know (or can at least approximate) the imbalance between the terms in the sum. By imbalance, I'm referring to the difference in the Lipschitz constants between each term. Terms with large constants will generally have a disproportionate contribution to the full gradient. In general we want methods where the theoretical convergence rate depends on the average Lipschitz constant of the terms, rather than the max, which is the case for SAG and other primal methods.

The paper by Schmidt et. al. (AISTATS 2015) covers this problem for SAGA/SAG style methods. I contributed a little to the theory for this paper. They show quite astounding practical results for conditional random field (CRF) problems, where weighted sampling seems crucial.

For SDCA, this preprint covers the problem to some degree. I've only skimmed this paper. The practical results are not super-impressive. I believe this is because they only test on linear learners, which are too simple to really see the benefit. In general, weighted sampling for coordinate descent is a well studied problem so applying those results to dual coordinate descent (i.e. SDCA) is a obvious research direction. I have a simple weighted sampling proof for a minor variant of SDCA in my thesis.

There are two NIPS 2015 papers on the general topic, both with Tong Zhang as a co-author! Local Smoothness in Variance Reduced Optimization by Vainsencher, Liu and Zhang and Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling by Qu, Richtarik and Zhang. Quartz is a variant of SDCA with a interpolation-with-previous-value update for the primal estimate of x instead of a direct update. They say this leads to a better theoretical convergence rate. The main novelty is a proof that works for sampling schemes like mini-batch, importance sampling and uniform sampling. The practical performance is not as good as SDCA except when they are able to exploit sparsity together with mini-batching, as the update is more conservative. I like that they didn't introduce another acroynm. It's becoming impossible to communicate finite sum methods to people outside the field due to all the method names sounding similar.

The Local-smoothness paper discusses the particular problem of adapting the sampling scheme at run-time, so that the weighting doesn't need to be specified up front. This is practically a necessity for good performance, and they do establish some theory covering it unlike previous work on adaptivity for SAG. I like their discussion of why it can be effective. They point out that for smoothed-hinge loss, data-points whose activation is in the linear parts of the hinge can almost be ignored by the sampler once we get close to the solution.

The future

There is still lots of interesting avenues of research open for finite sum methods. Do practical accelerated variants of any of the methods exist? Ideally without lots of parameters that require tuning. Are other variance reduction approaches possible (like adding a second set of control variates, or by using multiplicative instead of additive control?).

I would like to see some solid c/c++ implementations of the best methods so that they can be more widely used. There are some reasonable python implementations in the Lightning package, but they need adaptive step sizes and importance sampling to really be production ready. Multithreading would also be great. It would also be nice to see SAGA implemented in TensorFlow, Vowpal Wabbit and LibLinear for example.

The Many Ways to Analyse Gradient Descent: Part 2

The previous post detailed a bunch of different ways of proving the convergence rate of gradient descent:

xk+1 = xk αf(x k),

for strongly convex problems. This post considers the non-strongly convex, but still convex case.

Rehash of Basic Lemmas

These hold for any x and y. L the Lipschitz smoothness constant. These are completely standard, see Nesterov’s book [2] for proofs. We use the notation x for an arbitrary minimizer of f.

f(y) f(x) + f(x),y x + L 2 x y2. (1)
f(y) f(x) + f(x),y x + 1 2L f(x) f(y)2. (2)
f(x) f(y),x y 1 L f(x) f(y)2. (3)

1 Proximal Style Convergence Proof

The following argument gives a proof of convergence that is well suited to modification for proving the convergence of proximal gradient methods. We start by proving a useful lemma:

Lemma 1. For any xk and y, when xk+1 = xk 1 Lf(x k):

2 L f(y) f(xk+1) y xk+1 2 x k y2.

Proof. We start with the Lipschitz upper bound around xk of xk+1:

f(xk+1) f(xk) + f(x k),xk+1 xk + L 2 xk+1 xk 2.

Now we bound f(xk) using the negated convexity lower bound of y around x (i.e. f(y) f(xk) + f(xk),y xk):

f(xk+1) f(y) + f(x k),xk+1 xk + xk y + L 2 xk+1 xk 2.

Negating, rearranging and multiplying through by 2 L gives:

2 L f(y) f(xk+1) 2 L f(x k),xk+1 y + xk+1 xk 2.

Now we replace f(xk) using xk+1 = xk 1 Lf(x k):

2 L f(y) f(xk+1) 2 y xk+1,xk xk+1 xk xk+1 2 = 2 y xk + xk xk+1,xk xk+1 xk xk+1 2 = 2 y xk,xk xk+1 + xk xk+1 .2

Now we complete the square using the quadratic y xk + xk xk+1 2 = y xk 2 + 2 y xk,xk xk+1 + xk xk+1 2. So we have:

2 L f(y) f(xk+1) y xk + xk xk+1 y xk 2 = y xk+1 y xk 2.

Using this lemma, the proof is quite simple. We apply it with y = x:

xk+1 x2 x k x2 2 L f(xk+1) f(x).

Now we sum this between 0 and k 1. The left hand side telescopes:

xk x2 x 0 x2 2 L r=0k1 f(x r+1) f(x).

Now we use the fact that gradient descent is a descent method, which implies that f(xk) f(xr+1) for all r k 1. So:

xk x2 x 0 x2 2k L f(xk) f(x).

Now we just drop the xk x2 term since it is positive and small. Leaving:

f(xk) f(x) L 2k x0 x2.


As far as I know, this proof is fairly modern [1]. Notice that unlike the strongly convex case, the quantity we are bounding (f(xk) f(x)) does not appear on both sides of the bound. Unfortunately, without strong convexity there is necessarily a looseness to the bounds, and this takes the form of bounding function value by distance to solution, with a large wiggle-factor. One thing that is perhaps a little confusing is the use of distance to solution x x, when it is not unique, as there are potentially multiple minimizers for non-strongly convex problems. The bound in fact holds for any chosen minimizer x. I found this to be a little confusing at first.

2 Older Style Proof

This proof is from Nesterov [2]. I’m not sure of the original source for it.

We start with the function value descent equation, using w := α 1 1 2αL :

f(xk+1) f(xk) w f(x k)2.

We introduce the simplified notation Δk = f(xk) f(x) so that we have

Δk+1 Δk w f(x k)2. (4)

Now using the convexity lower bound around xk evaluated at x, namely:

Δk f(x k),xk x,

and applying Cauchy-Schwarz (note the spelling! there is no “t” in Schwarz) to it:

Δk xk x f(x k) x0 x f(x k).

The last line is because gradient descent method descends in iterate distance each step. We now introduce the additional notation r0 = x0 x. Using this notation and rearranging gives:

f(x k) Δkr0.

We plug this into the function descent equation (Eq 4) above to get:

Δk+1 Δk w r02Δk2.

We now divide this through by Δk+1:

1 Δk Δk+1 w r02 Δk2 Δk+1

Then divide through by Δk also:

1 Δk 1 Δk+1 w r02 Δk Δk+1.

Now we use the fact that gradient descent is a descent method again, which implies that Δk Δk+1 1, so:

1 Δk 1 Δk+1 w r02.

1 Δk+1 1 Δk + w r02.

We then chain this inequality for each k:

1 Δk+1 1 Δk + w r02 1 Δk1 + 2 w r02 1 Δ0 + w r02(k + 1)

1 Δk+1 1 Δ0 + w r02(k + 1).

To get the final convergence rate we invert both sides:

f(xk) f(x) f(x0) f(x) x 0 x2 x0 x2 + w f(x0) f(x)k.

This is quite a complex expression. To simplify even further, we can get rid of the f(x0) f(x) terms on the right hand side using the Lipschitz upper bound about x:

f(x0) f(x) L 2 x x2.

Plugging in the step size α = 1 L gives w = 1 2L, yielding the following simpler convergence rate:

f(xk) f(x) 2L x0 x2 k + 4 .

Compared to the rate from the previous proof, f(xk) f(x) L 2k x0 x2, this is slightly better at k = 1, and worse thereafter.


I don’t like this proof. It’s feels like a random sequence of steps when you first look at it. The way the proof uses inverse quantities like 1 Δk is also confusing. The key equation is really the direct bound on Δ:

Δk+1 Δk w r02Δk2.

Often this is the kind of equation you encounter when proving the properties of dual methods for example. Equations of this kind can be encountered when applying proximal methods to non-differentiable functions also. It is also quite a clear statement about what is going on in terms of per-step convergence, a property that is less clear in the previous proof.

3 Gradient Concentration

When we don’t even have convexity, just Lipschitz smoothness, we can still prove something about convergence of the gradient norm. The Lipschitz upper bound holds without the requirement of convexity:

f(y) f(x) + f(x),y x + L 2 x y2.

Recall that from minimizing this bound with respect to y we can prove the equation:

f(xk) f(xk+1) 1 2L f(x k)2.

Examine this equation carefully. We have a bound on each gradient encountered during the optimization in terms of the difference in function values between steps. The sequence of function values is bounded below, so in fact we have a hard bound on the sum of the encountered gradient norms. Effectively, we chain (telescope) the above inequality over steps:

f(xk1) f(xk) + f(xk) f(xk+1) 1 2L f(x k)2 + 1 2L f(x k1)2.


f(x0) f(xk+1) 1 2L ik f(x i)2.

Now since f(xk+1) f(x):

ik f(x i)2 2L f(x 0) f(x).

Now to make this bound a little more concrete, we can put it in terms of the gradient gk with the smallest norm seen during the minimization ( gk gi for all i), so that ik f(xi)2 k gk 2, so:

gk 2 2L k f(x0) f(x).


Notice that the core technique used in this proof is the same as the last 2 proofs. We have a single step inequality bounding one of the quantities we care about. By summing that inequality over each step of minimization, one side of the inequality telescopes. We get an equality saying the the sum of the k versions of that quantity (one from each step) is less then some fixed constant independent of k, for any k. The convergence rate is thus of the form 1k, because the summation of the k quantities fits in a fixed bound.

Almost any proof for an optimization method that applies in the non-convex case uses a similar proof technique. There is just not that many assumptions to work with, so the options are limited.


[1]   Amir Beck and Marc Teboulle. Gradient-based algorithms with applications to signal recovery problems. Convex Optimization in Signal Processing and Communications, 2009.

[2]   Yu. Nesterov. Introductory Lectures On Convex Programming. Springer, 1998.

The Many Ways to Analyse Gradient Descent

Consider the classical gradient descent method:
xk+1 = xk αf(x k).

It’s a thing of beauty isn’t it? While it’s not used directly in practice any more, the proof techniques used in its analysis are the building blocks behind the theory of more advanced optimization methods. I know of 8 different ways of proving its convergence rate. Each of the proof techniques are interesting in their own right, but most books on convex optimization give just a single proof of convergence, then move onto greater things. But to do research in modern convex optimization you should know them all.

The purpose of this series of posts is to detail each of these proof techniques and what applications they have to more advanced methods. This post will cover the proofs under strong convexity assumptions, and the next post will cover the non-strongly convex case. Unlike most proofs in the literature, we will go into detail of every step, so that these proofs can be used as a reference (don’t cite this post directly though, cite the original source preferably, or the technical notes version). If you are aware of any methods I’ve not covered, please leave a comment with a reference so I can update this post.

For most of the proofs we end with a statement like Ak+1 (1 γ)Ak, where Ak is some quantity of interest, like distance to solution or function value sub-optimality. A full proof requires chaining these inequalities for each k, giving something of the form Ak (1 γ)kA0. We leave this step as a given.

Basic lemmas

These hold for any x and y. Here μ is the strong convexity constant and L the Lipschitz smoothness constant. These are completely standard, see Nesterov’s book [7] for proofs. We use the notation x for the unique minimizer of f (for strongly convex problems).

f(y) f(x) + f(x),y x + L 2 x y2. (1)
f(y) f(x) + f(x),y x + μ 2 x y2. (2)
f(y) f(x) + f(x),y x + 1 2L f(x) f(y)2. (3)
f(y) f(x) + f(x),y x + 1 2μ f(x) f(y)2. (4)
f(x) f(y),x y 1 L f(x) f(y)2. (5)
f(x) f(y),x y μ x y2. (6)

1 Function Value Descent

There is a very simple proof involving just the function values. We start by showing that the function value descent is controlled by the gradient norm:

Lemma 1. For any given α, the change in function value between steps can be bounded as follows:

f(xk) f(xk+1) α 1 1 2αL f(x k)2,

in particular, if α = 1 L we have f(xk) f(xk+1) 1 2L f(x k)2.

Proof. We start with (1), the Lipschitz upper bound about xk:

f(xk+1) f(xk) + f(x k),xk+1 xk + L 2 xk+1 xk 2.

Now we plug in the step equation xk+1 xk = αf(xk) :

f(xk+1) f(xk) α f(x k)2 + α2L 2 f(x k)2,

Negating and rearranging gives:

f(xk) f(xk+1) α 1 1 2αL f(x k)2.

Now since we are considering strongly convex problems, we actually have found a bound on the gradient norm in terms of function value. We apply (4): f(y) f(x) + f(x),y x + 1 2μ f(x) f(y)2 using x = x, y = xk:

f(xk) f(x) + 1 2μ f(x k)2,

f(x k) 2μ f(xk) f(x).

So combining these two results:

f(xk) f(xk+1) 1 2L f(x k)2 μ L f(xk) f(x).

We then negate, add & subtract f(x), then rearrange:

f(xk+1) f(xk) μ L f(xk) f(x),

f(xk+1) f(x) f(x k) + f(x) μ L f(xk) f(x),

f(xk+1) f(x) 1 μ L f(xk) f(x).

Note that this function value style proof requires the step size α = 1 L or smaller, instead of α = 2 μ+L, which we shall see gives the fastest convergence when using some of the other proof techniques below.


This proof (when α = 1 L is used) treats gradient descent as an upper bound minimization scheme. Such methods, sometimes known under the Majorization-Minimization nomenclature [3], are quite widespread in optimization. They can be applied to non-convex problems even, although the convergence rates in that case are necessarily weak. Likewise this proof gives the weakest convergence rate of the proof techniques presented in this post, but it is perhaps the simplest. Upper bound minimization techniques have recently seen interesting applications in 2nd order optimization, in the form of Nesterov’s cubicly regularized Newton’s method [9]. For stochastic optimization, the MISO method is also a upper bound minimization scheme [6]. For non-smooth problems, an interesting application of the MM approach is in minimizing convex problems with non-convex regularizers of the form λlog x + 1, in the form of reweighted L1 regularization [5].

2 Iterate Descent

There is also a simple proof involving just the distance of the iterates xk to the solution. Using the definition of the step xk+1 xk = αf(xk):

xk+1 x2 = x k αf(x k) x2 = xk x2 2α f(x k),xk x + α2 f(x k)2.

We now apply both the inner product bounds (5) f(x) f(y),x y 1 L f(x) f(y)2and (6) f(x) f(y),x y μ x y2 , in the following negated forms, using f(x) = 0:

f(x k),xk x1 L f(x k)2,

f(x k),xk xμ x k x2.

The inner product term has a weight 2α, and we apply each of these with weight α, giving:

xk+1 x2 1 αμ x k x2 + α α 1 L f(x k)2.

Now if we take α = 1 L, then the last term cancels and we have:

xk+1 x2 1 μ L xk x2.

This proof is not as tight as possible. Instead of splitting the inner product term and applying both bounds (5) and (6), we can apply the following stronger combined bound from Nesterov’s Book [7]:

f(x) f(y),x y μL μ + L x y2 + 1 μ + L f(x) f(y)2. (7)

Doing so yields:

xk+1 x2 1 2αμL μ + L xk x2 + α α 2 μ + L f(x k)2.

Now clearly to cancel out the gradient norm term we can take α = 2 μ+L, which yields the convergence rate:

xk+1 x2 1 4μL μ + L2 xk x2 1 4μ L xk x2.


This proof technique is the building block of the standard stochastic gradient descent (SGD) proof. The above proof is mostly based on Nesterov’s book, I’m not sure what the original citation is. It has a nice geometric interpretation, as the bound on the inner product term f(xk),xk x can easily be illustrated in 2 dimensions, say on a white-board. It’s effectively a statement on the angles that gradients in convex problems can take. To get the strongest bound using this technique, the complex bound in Equation 7 has to be used. That stronger bound is not really straight-forward, and perhaps too technical (in my opinion) to use in a textbook proof of the convergence rate.

3 Using the Second Fundamental Theorem of Calculus

Recall the second fundamental theorem of calculus:

f(y) = f(x) +xyf(z)dz.

This can be applied along intervals in higher dimensions. The case we care about is applying it to the first derivatives of f, giving an integral involving the Hessian:

f(y) = f(x) +01 fx + τ(y x),y xdτ.

We abuse the angle bracket notation here to apply to matrix-vector products as well as the usual dot-product. Using this result gives an interesting proof of convergence of gradient descent that doesn’t rely on the usual convexity lemmas. This proof bounds the distance to solution, just like the previous proof.

Lemma 2. For any positive t:

x y + t f(y) f(x) max 1 tL, 1 tμ x y.

Proof. We start by applying the second fundamental theorem of calculus in the above form:

x y + t f(y) f(x) = x y + t01 fx + τ(y x),y xdτ = 01 tfx + τ(y x) I,y xdτ 01 tfx + τ(y x) I,y xdτ 01 tfx + τ(y x) I x ydτ maxz tf(z) I x y.

Now we examine the eigenvalues of f(z). the minimum one is at least μ and the maximum at most L. An examination of the possible range of the eigenvalues of (tf(z) I) gives max 1 tL, 1 tμ. □

Using this lemma gives a simple proof along the lines of the iterate descent proof.

First, note that xk+1 x is in the right form for direct application of this lemma after substituting in the step equation:

xk+1 x = x k x + α f(x k) f(x) max 1 αL, 1 αμ xk x.

Note we introduced f(x) for “free”, as it’s of course equal to zero. The next step is optimize this bound in terms of α. Note that L is always larger than μ, so we take the 1 αL absolute value as negative, and the other positive, and match their magnitudes:

1 + αL = 1 αμ,

α(L + μ) = 2,

α = 2 L + μ.

Which gives the convergence rate:

xk+1 x L μ L + μ xk x.

Note that this rate is in terms of the distance to solution directly, rather than its square like in the previous proof. Converting to squared norm gives the same rate as before.


This proof technique has a linear-algebra feel to it, and is perhaps most comfortable to people with that background. The absolute values make it ugly in my opinion though. This proof technique is the building block used in the standard proof of the convergence of the heavy ball method for strongly convex problems [10]. It doesn’t appear to have many other applications, and so is probably the least seen of the techniques in this document. The main use of this kind of argument is in lower complexity bounds, where we often do some sort of eigenvalue analysis.

4 Lyapunov Style

The above results prove convergence of either the iterates or the function value separately. There is an interesting proof involving the sum of the two quantities. First we start with with the iterate convergence:

xk+1 x2 = x k x αf(x k)2 = xk x2 2α f(x k),xk x + α2 f(x k)2.

Now we use the function descent amount equation (Lemma 1) to bound the gradient norm term: 1 c f(x k)2 f(x k) f(xk+1) , where we have defined c = 1 α 1 1 2αL:

xk+1 x2 x k x2 + cα2 f(x k) f(xk+1) 2α f(x k),xk x.

Now we use the strong convexity lower bound (2) in a rearranged form:

f(x k),x x k f(x) f(x k) μ 2 xk x2,

to simplify:

xk+1 x2 1 αμ x k x2 + cα2 f(x k) f(xk+1) + 2α f(x) f(x k).

Now rearranging further:

xk+1 x2+cα2 f(x k+1) f(x) 1 αμ x k x2+cα2 2α f(x k) f(x).

Now this equation gives a descent rate for the weighted sum of xk x2 and f(xk) f(x). The best rate is given by matching the two convergence rates, that of the iterate distance terms:

1 αkμ,

and that of the function value terms, which changes from cα2 to cα2 2α:

cα2 2αk cα2 = 1 2 cα = 1 2 1 1 2αL = αL 1.

Matching these two rates:

1 αμ = αL 1,

2 = α(μ + L),

α = 2 μ + L.

Using this derived value for α gives a convergence rate of 1 2μ μ+L. I.e.

xk+1 x2+cα2 f(x k+1) f(x) 1 2μ μ + L xk x2 + cα2 f(x k) f(x).

and therefore after k steps:

xk x2 + cα2 f(x k) f(x) 1 2μ μ + Lk x 0 x2 + cα2 f(x 0) f(x).

The constants can be simplified to:

cα2 = α2 α 1 1 2αL = α 1 1 2αL = α 1 L μ+L = α μ μ+L = 2 μ.

Now we use: f(x0) f(x) L 2 x0 x2 on the right, and we just drop the function value term altogether on the left:

xk x2 1 2μ μ + Lkμ + L μ x0 x2.

If we instead use the more robust step size 1 L, which doesn’t require knowledge of μ, then a simple calculation shows that we instead get c = 2L, and so:

xk x2 1 μ Lk x 0 x2 + 2 L f(x0) f(x), 1 μ Lk2 x 0 x2.

The right hand side is obviously a much tighter bound then when 2(μ + L) is used, but the geometric rate is roughly twice as slow.


This proof technique has seen a lot of application lately. It is used for the SAGA [2]and SVRG [4] methods, and can be applied to accelerated method even, such as the accelerated coordinate descent theory [8]. The Lyapunov function analysis technique is of great general utility, and so it is worth studying carefully. It is covered perhaps best in Polyak’s book [10].

5 Gradient Norm Descent

In the strongly convex case, it is actually possible to show that the gradient norm decreases at least linearly as well as the function value and iterates. This requires a fixed step size of α = 1 L, as it is not true when line searches are used.

Lemma 3. For α = 1 L:

xk+2 xk+1 2 1 μ L xk+1 xk 2.

Note that xk+2 xk+1 2 = 1 L2 f(xk+1)2 and xk+1 xk 2 = 1 L2 f(xk)2.

Proof. We start by expanding in terms of the step equation xk+1 = xk αf(xk).

xk+2 xk+1 2 = x k+1 αf(x k+1) xk + αf(x k)2 = xk+1 xk 2 + α2 f(x k+1) f(x k)2 +2α f(x k) f(x k+1),xk+1 xk .

Now applying both inner product bounds (5) and (6):

wk+1 wk 2 1 αμ x k+1 xk 2 + α α 1 L f(x k+1) f(x k)2.

So for α = 1 L this simplifies to:

xk+2 xk+1 2 1 μ L xk+1 xk 2.

Chaining this result (Lemma 3) over k gives:

xk+1 xk 2 1 μ Lk x 1 x0 2 = 1 μ Lk 1 L2 f(x 0)2.

We now use f(xk) f(x) 1 2μ f(x k)2 = L2 2μ xk+1 xk 2 :

f(xk) f(x) 1 μ Lk 1 2μ f(x 0)2.


This technique is probably the weirdest of those listed here. It has seen application in proving the convergence rate of MISO under some different stochastic orderings [1]. While clearly a primal result, this proof has some components normally seen in the proof for a dual method. The gradient f(xk) is effectively the dual iterate. Another interesting property is that the portion of the proof concerning the gradient’s convergence uses the strong convexity between xk+1 and xk, whereas the other proofs considered all use the degree of strong convexity between xk and x.

This proof technique can’t work when line searches are used, as bounding the inner product:

α f(x k) f(x k+1),xk+1 xk ,

would fail if α changed between steps, as it would become αkf(xk) αk+1f(xk+1),xk+1 xk, which is a weird expression to work with.


[1]    Aaron Defazio. New Optimization Methods for Machine Learning. PhD thesis, Australian National University, 2014.

[2]    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014.

[3]    David R. Hunter and Kenneth Lange. Quantile regression via an mm algorithm. Journal of Computational and Graphical Statistics, 9, 2000.

[4]    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. NIPS, 2013.

[5]    Qiang Liu and Alexander Ihler. Learning scale free networks by reweighted l1 regularization. AISTATS, 2011.

[6]    Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. Technical report, INRIA Grenoble RhÃŽne-Alpes / LJK Laboratoire Jean Kuntzmann, 2014.

[7]    Yu. Nesterov. Introductory Lectures On Convex Programming. Springer, 1998.

[8]    Yu. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. Technical report, CORE, 2010.

[9]    Yu. Nesterov and B.T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.

[10]    Boris Polyak. Introduction to Optimization. Optimization Software, Inc., Publications Division., 1987.

The NIPS Consistency Experiment — My Experience

This year the NIPS (Neural Information Processing Systems) conference organisers decided to run an experiment on the consistency of paper reviews. They selected 10% of papers to be reviewed twice. Different area chairs and a different set of 3 reviewers were chosen for those papers.
Luckily for me, my paper with Francis Bach and Simon Lacoste-Julien was one of those 10%. My paper was initially submitted as Paper ID 867. They essentially created a duplicated paper id for me, #1860, which contained the second set of reviews.
This duplication of reviews was particularly interesting in my case. There was a very large discrepancy between the two sets of reviews. I won’t know if this is representative of the consistency of other reviews until NIPS releases the statistics from there experiment.
For reference, the two sets of reviews gave the following scores, before rebuttal:
Set 1 review 1: Quality 9, impact 2 (high) , confidence 4 (confident but not certain)
Set 1 review 2: Quality 6, impact 1 (incremental), confidence 3 (fairly confident)
Set 1 review 3: Quality 6, impact 1 (incremental), confidence 5 (Absolutely certain)
Set 2 review 1: Quality 5, impact 1 (incremental), confidence 5 (Absolutely certain)
Set 2 review 2: Quality 3, impact 1 (incremental), confidence 4 (Confident)
Set 2 review 3: Quality 6, impact 1 (incremental), confidence 5 (Absolutely certain)
Generally for NIPS a 9/6/6 in quality gives a high change of acceptance, where as a 5/3/6 is a certain non-acceptance. So one set of reviews was a clear accept and the second a clear reject! The meta reviews were as follows:
The paper introduces a new incremental gradient method that allows adaptation to the level of convexity in the input. The paper has a nice discussion of related methods, and it has a simpler proof that will be of interest to researchers. Recommendation: Accept.
Unfortunately, the scores are too low for acceptance to NIPS, and none of the reviewers were willing to argue for acceptance of the paper. The reviewers discussed the paper after the author rebuttal, and all reviewers ultimately felt that the paper could use some additional polish before publishing. Please do keep in mind the various criticisms of the reviewers when submitting to another venue.
The paper we submitted was fairly rough in its initial state, and the reviewers suggested lots of improvements. Particularly the Set 2/review 1, which was the most in depth review. I generally agree with the second meta-review in that the paper needed addition polish, which we have done for the camera ready.
In the end the paper was accepted. I suspect most papers with this kind of accept/reject split would be accepted, as it would just seem unfair if it were not.
The issue of consistency in paper reviews is clear to any body who has ever resubmitted a rejected paper to a different venue. It feels like luck of a draw to a degree. There is no easy solutions to this, so I’ll be interested to see if NIPS changes there process in future years, and what changes they make.