The Many Ways to Analyse Gradient Descent: Part 2

The previous post detailed a bunch of different ways of proving the convergence rate of gradient descent:

xk+1 = xk αf(x k),

for strongly convex problems. This post considers the non-strongly convex, but still convex case.

Rehash of Basic Lemmas

These hold for any x and y. L the Lipschitz smoothness constant. These are completely standard, see Nesterov’s book [2] for proofs. We use the notation x for an arbitrary minimizer of f.

f(y) f(x) + f(x),y x + L 2 x y2. (1)
f(y) f(x) + f(x),y x + 1 2L f(x) f(y)2. (2)
f(x) f(y),x y 1 L f(x) f(y)2. (3)

1 Proximal Style Convergence Proof

The following argument gives a proof of convergence that is well suited to modification for proving the convergence of proximal gradient methods. We start by proving a useful lemma:

Lemma 1. For any xk and y, when xk+1 = xk 1 Lf(x k):

2 L f(y) f(xk+1) y xk+1 2 x k y2.

Proof. We start with the Lipschitz upper bound around xk of xk+1:

f(xk+1) f(xk) + f(x k),xk+1 xk + L 2 xk+1 xk 2.

Now we bound f(xk) using the negated convexity lower bound of y around x (i.e. f(y) f(xk) + f(xk),y xk):

f(xk+1) f(y) + f(x k),xk+1 xk + xk y + L 2 xk+1 xk 2.

Negating, rearranging and multiplying through by 2 L gives:

2 L f(y) f(xk+1) 2 L f(x k),xk+1 y + xk+1 xk 2.

Now we replace f(xk) using xk+1 = xk 1 Lf(x k):

2 L f(y) f(xk+1) 2 y xk+1,xk xk+1 xk xk+1 2 = 2 y xk + xk xk+1,xk xk+1 xk xk+1 2 = 2 y xk,xk xk+1 + xk xk+1 .2

Now we complete the square using the quadratic y xk + xk xk+1 2 = y xk 2 + 2 y xk,xk xk+1 + xk xk+1 2. So we have:

2 L f(y) f(xk+1) y xk + xk xk+1 y xk 2 = y xk+1 y xk 2.

Using this lemma, the proof is quite simple. We apply it with y = x:

xk+1 x2 x k x2 2 L f(xk+1) f(x).

Now we sum this between 0 and k 1. The left hand side telescopes:

xk x2 x 0 x2 2 L r=0k1 f(x r+1) f(x).

Now we use the fact that gradient descent is a descent method, which implies that f(xk) f(xr+1) for all r k 1. So:

xk x2 x 0 x2 2k L f(xk) f(x).

Now we just drop the xk x2 term since it is positive and small. Leaving:

f(xk) f(x) L 2k x0 x2.


As far as I know, this proof is fairly modern [1]. Notice that unlike the strongly convex case, the quantity we are bounding (f(xk) f(x)) does not appear on both sides of the bound. Unfortunately, without strong convexity there is necessarily a looseness to the bounds, and this takes the form of bounding function value by distance to solution, with a large wiggle-factor. One thing that is perhaps a little confusing is the use of distance to solution x x, when it is not unique, as there are potentially multiple minimizers for non-strongly convex problems. The bound in fact holds for any chosen minimizer x. I found this to be a little confusing at first.

2 Older Style Proof

This proof is from Nesterov [2]. I’m not sure of the original source for it.

We start with the function value descent equation, using w := α 1 1 2αL :

f(xk+1) f(xk) w f(x k)2.

We introduce the simplified notation Δk = f(xk) f(x) so that we have

Δk+1 Δk w f(x k)2. (4)

Now using the convexity lower bound around xk evaluated at x, namely:

Δk f(x k),xk x,

and applying Cauchy-Schwarz (note the spelling! there is no “t” in Schwarz) to it:

Δk xk x f(x k) x0 x f(x k).

The last line is because gradient descent method descends in iterate distance each step. We now introduce the additional notation r0 = x0 x. Using this notation and rearranging gives:

f(x k) Δkr0.

We plug this into the function descent equation (Eq 4) above to get:

Δk+1 Δk w r02Δk2.

We now divide this through by Δk+1:

1 Δk Δk+1 w r02 Δk2 Δk+1

Then divide through by Δk also:

1 Δk 1 Δk+1 w r02 Δk Δk+1.

Now we use the fact that gradient descent is a descent method again, which implies that Δk Δk+1 1, so:

1 Δk 1 Δk+1 w r02.

1 Δk+1 1 Δk + w r02.

We then chain this inequality for each k:

1 Δk+1 1 Δk + w r02 1 Δk1 + 2 w r02 1 Δ0 + w r02(k + 1)

1 Δk+1 1 Δ0 + w r02(k + 1).

To get the final convergence rate we invert both sides:

f(xk) f(x) f(x0) f(x) x 0 x2 x0 x2 + w f(x0) f(x)k.

This is quite a complex expression. To simplify even further, we can get rid of the f(x0) f(x) terms on the right hand side using the Lipschitz upper bound about x:

f(x0) f(x) L 2 x x2.

Plugging in the step size α = 1 L gives w = 1 2L, yielding the following simpler convergence rate:

f(xk) f(x) 2L x0 x2 k + 4 .

Compared to the rate from the previous proof, f(xk) f(x) L 2k x0 x2, this is slightly better at k = 1, and worse thereafter.


I don’t like this proof. It’s feels like a random sequence of steps when you first look at it. The way the proof uses inverse quantities like 1 Δk is also confusing. The key equation is really the direct bound on Δ:

Δk+1 Δk w r02Δk2.

Often this is the kind of equation you encounter when proving the properties of dual methods for example. Equations of this kind can be encountered when applying proximal methods to non-differentiable functions also. It is also quite a clear statement about what is going on in terms of per-step convergence, a property that is less clear in the previous proof.

3 Gradient Concentration

When we don’t even have convexity, just Lipschitz smoothness, we can still prove something about convergence of the gradient norm. The Lipschitz upper bound holds without the requirement of convexity:

f(y) f(x) + f(x),y x + L 2 x y2.

Recall that from minimizing this bound with respect to y we can prove the equation:

f(xk) f(xk+1) 1 2L f(x k)2.

Examine this equation carefully. We have a bound on each gradient encountered during the optimization in terms of the difference in function values between steps. The sequence of function values is bounded below, so in fact we have a hard bound on the sum of the encountered gradient norms. Effectively, we chain (telescope) the above inequality over steps:

f(xk1) f(xk) + f(xk) f(xk+1) 1 2L f(x k)2 + 1 2L f(x k1)2.


f(x0) f(xk+1) 1 2L ik f(x i)2.

Now since f(xk+1) f(x):

ik f(x i)2 2L f(x 0) f(x).

Now to make this bound a little more concrete, we can put it in terms of the gradient gk with the smallest norm seen during the minimization ( gk gi for all i), so that ik f(xi)2 k gk 2, so:

gk 2 2L k f(x0) f(x).


Notice that the core technique used in this proof is the same as the last 2 proofs. We have a single step inequality bounding one of the quantities we care about. By summing that inequality over each step of minimization, one side of the inequality telescopes. We get an equality saying the the sum of the k versions of that quantity (one from each step) is less then some fixed constant independent of k, for any k. The convergence rate is thus of the form 1k, because the summation of the k quantities fits in a fixed bound.

Almost any proof for an optimization method that applies in the non-convex case uses a similar proof technique. There is just not that many assumptions to work with, so the options are limited.


[1]   Amir Beck and Marc Teboulle. Gradient-based algorithms with applications to signal recovery problems. Convex Optimization in Signal Processing and Communications, 2009.

[2]   Yu. Nesterov. Introductory Lectures On Convex Programming. Springer, 1998.

5 thoughts on “The Many Ways to Analyse Gradient Descent: Part 2”

  1. Hi Aaron, nice post. Do you know if the inequality of lemma 1 is sharp? I mean if there is an example of a convex and L-smooth function f and a suitably chosen y, such that we have equality in the place of inequality.

  2. Hi Aaron, nice post. Do you know any case where the inequality of lemma 1 is sharp? I mean an example of a convex and L-smooth function f and a suitably chosen y, such that we have equality in the place of the inequality.

    1. Essentially all of the inequalities are sharp for 1D quadratics, or higher dimensional quadratics along the eigen-direction corresponding to the largest eigenvalue for the L formulas and the smallest for the mu formulas.

  3. Hi Aaron,
    Could you provide me with a reference to the gradient concentration technique. Any paper where such a technique has been used for non-convex case. You may email me at the address.

Leave a Reply

Your email address will not be published. Required fields are marked *