# The Many Ways to Analyse Gradient Descent: Part 2

The previous post detailed a bunch of different ways of proving the convergence rate of gradient descent:

${x}_{k+1}={x}_{k}-\alpha {f}^{\prime }\left({x}_{k}\right),$

for strongly convex problems. This post considers the non-strongly convex, but still convex case.

### Rehash of Basic Lemmas

These hold for any $x$ and $y$. $L$ the Lipschitz smoothness constant. These are completely standard, see Nesterov’s book  for proofs. We use the notation ${x}^{\ast }$ for an arbitrary minimizer of $f$.

 $f\left(y\right)\le f\left(x\right)+⟨{f}^{\prime }\left(x\right),y-x⟩+\frac{L}{2}{∥x-y∥}^{2}.$ (1)
 $f\left(y\right)\ge f\left(x\right)+⟨{f}^{\prime }\left(x\right),y-x⟩+\frac{1}{2L}{∥{f}^{\prime }\left(x\right)-{f}^{\prime }\left(y\right)∥}^{2}.$ (2)
 $⟨{f}^{\prime }\left(x\right)-{f}^{\prime }\left(y\right),x-y⟩\ge \frac{1}{L}{∥{f}^{\prime }\left(x\right)-{f}^{\prime }\left(y\right)∥}^{2}.$ (3)

### 1 Proximal Style Convergence Proof

The following argument gives a proof of convergence that is well suited to modiﬁcation for proving the convergence of proximal gradient methods. We start by proving a useful lemma:

Lemma 1. For any ${x}_{k}$ and $y$, when ${x}_{k+1}={x}_{k}-\frac{1}{L}{f}^{\prime }\left({x}_{k}\right)$:

$\frac{2}{L}\left[f\left(y\right)-f\left({x}_{k+1}\right)\right]\ge {∥y-{x}_{k+1}∥}^{2}-{∥{x}_{k}-y∥}^{2}.$

Proof. We start with the Lipschitz upper bound around ${x}_{k}$ of ${x}_{k+1}$:

$f\left({x}_{k+1}\right)\le f\left({x}_{k}\right)+⟨{f}^{\prime }\left({x}_{k}\right),{x}_{k+1}-{x}_{k}⟩+\frac{L}{2}{∥{x}_{k+1}-{x}_{k}∥}^{2}.$

Now we bound $f\left({x}_{k}\right)$ using the negated convexity lower bound of $y$ around $x$ (i.e. $f\left(y\right)\ge f\left({x}_{k}\right)+⟨{f}^{\prime }\left({x}_{k}\right),y-{x}_{k}⟩$):

$f\left({x}_{k+1}\right)\le f\left(y\right)+⟨{f}^{\prime }\left({x}_{k}\right),{x}_{k+1}-{x}_{k}+{x}_{k}-y⟩+\frac{L}{2}{∥{x}_{k+1}-{x}_{k}∥}^{2}.$

Negating, rearranging and multiplying through by $\frac{2}{L}$ gives:

$\frac{2}{L}\left[f\left(y\right)-f\left({x}_{k+1}\right)\right]\ge \frac{2}{L}⟨{f}^{\prime }\left({x}_{k}\right),{x}_{k+1}-y⟩+{∥{x}_{k+1}-{x}_{k}∥}^{2}.$

Now we replace ${f}^{\prime }\left({x}_{k}\right)$ using ${x}_{k+1}={x}_{k}-\frac{1}{L}{f}^{\prime }\left({x}_{k}\right)$:

$\begin{array}{rcll}\frac{2}{L}\left[f\left(y\right)-f\left({x}_{k+1}\right)\right]& \ge & 2⟨y-{x}_{k+1}\phantom{\rule{0.3em}{0ex}},{x}_{k}-{x}_{k+1}⟩-{∥{x}_{k}-{x}_{k+1}∥}^{2}& \text{}\\ & =& 2⟨y-{x}_{k}+{x}_{k}-{x}_{k+1}\phantom{\rule{0.3em}{0ex}},{x}_{k}-{x}_{k+1}⟩-{∥{x}_{k}-{x}_{k+1}∥}^{2}& \text{}\\ & =& 2⟨y-{x}_{k},{x}_{k}-{x}_{k+1}⟩+∥{x}_{k}-{x}_{k+1}∥{.}^{2}& \text{}\end{array}$

Now we complete the square using the quadratic ${∥y-{x}_{k}+{x}_{k}-{x}_{k+1}∥}^{2}={∥y-{x}_{k}∥}^{2}+2⟨y-{x}_{k},{x}_{k}-{x}_{k+1}⟩+{∥{x}_{k}-{x}_{k+1}∥}^{2}$. So we have:

$\begin{array}{rcll}\frac{2}{L}\left[f\left(y\right)-f\left({x}_{k+1}\right)\right]& \ge & ∥y-{x}_{k}+{x}_{k}-{x}_{k+1}∥-{∥y-{x}_{k}∥}^{2}& \text{}\\ & =& ∥y-{x}_{k+1}∥-{∥y-{x}_{k}∥}^{2}.& \text{}\end{array}$

Using this lemma, the proof is quite simple. We apply it with $y={x}^{\ast }$:

${∥{x}_{k+1}-{x}^{\ast }∥}^{2}-{∥{x}_{k}-{x}^{\ast }∥}^{2}\le -\frac{2}{L}\left[f\left({x}_{k+1}\right)-f\left({x}^{\ast }\right)\right].$

Now we sum this between $0$ and $k-1$. The left hand side telescopes:

${∥{x}_{k}-{x}^{\ast }∥}^{2}-{∥{x}_{0}-{x}^{\ast }∥}^{2}\le -\frac{2}{L}\sum _{r=0}^{k-1}\left[f\left({x}_{r+1}\right)-f\left({x}^{\ast }\right)\right].$

Now we use the fact that gradient descent is a descent method, which implies that $f\left({x}_{k}\right)\le f\left({x}_{r+1}\right)$ for all $r\le k-1$. So:

${∥{x}_{k}-{x}^{\ast }∥}^{2}-{∥{x}_{0}-{x}^{\ast }∥}^{2}\le -\frac{2k}{L}\left[f\left({x}_{k}\right)-f\left({x}^{\ast }\right)\right].$

Now we just drop the ${∥{x}_{k}-{x}^{\ast }∥}^{2}$ term since it is positive and small. Leaving:

$f\left({x}_{k}\right)-f\left({x}^{\ast }\right)\le \frac{L}{2k}{∥{x}_{0}-{x}^{\ast }∥}^{2}.$

As far as I know, this proof is fairly modern . Notice that unlike the strongly convex case, the quantity we are bounding $\left(f\left({x}_{k}\right)-f\left({x}^{\ast }\right)\right)$ does not appear on both sides of the bound. Unfortunately, without strong convexity there is necessarily a looseness to the bounds, and this takes the form of bounding function value by distance to solution, with a large wiggle-factor. One thing that is perhaps a little confusing is the use of distance to solution $x-{x}^{\ast }$, when it is not unique, as there are potentially multiple minimizers for non-strongly convex problems. The bound in fact holds for any chosen minimizer ${x}^{\ast }$. I found this to be a little confusing at ﬁrst.

### 2 Older Style Proof

This proof is from Nesterov . I’m not sure of the original source for it.

We start with the function value descent equation, using $w:=\left[\alpha \left(1-\frac{1}{2}\alpha L\right)\right]:$

$f\left({x}_{k+1}\right)\le f\left({x}_{k}\right)-w{∥{f}^{\prime }\left({x}_{k}\right)∥}^{2}.$

We introduce the simpliﬁed notation ${\Delta }_{k}=f\left({x}_{k}\right)-f\left({x}^{\ast }\right)$ so that we have

 ${\Delta }_{k+1}\le {\Delta }_{k}-w{∥{f}^{\prime }\left({x}_{k}\right)∥}^{2}.$ (4)

Now using the convexity lower bound around ${x}_{k}$ evaluated at ${x}^{\ast }$, namely:

${\Delta }_{k}\le ⟨{f}^{\prime }\left({x}_{k}\right),{x}_{k}-{x}^{\ast }⟩,$

and applying Cauchy-Schwarz (note the spelling! there is no “t” in Schwarz) to it:

$\begin{array}{rcll}{\Delta }_{k}& \le & ∥{x}_{k}-{x}^{\ast }∥∥{f}^{\prime }\left({x}_{k}\right)∥& \text{}\\ & \le & ∥{x}_{0}-{x}^{\ast }∥∥{f}^{\prime }\left({x}_{k}\right)∥.& \text{}\end{array}$

The last line is because gradient descent method descends in iterate distance each step. We now introduce the additional notation ${r}_{0}=∥{x}_{0}-{x}^{\ast }∥$. Using this notation and rearranging gives:

$-∥{f}^{\prime }\left({x}_{k}\right)∥\le -{\Delta }_{k}∕{r}_{0}.$

We plug this into the function descent equation (Eq 4) above to get:

${\Delta }_{k+1}\le {\Delta }_{k}-\frac{w}{{r}_{0}^{2}}{\Delta }_{k}^{2}.$

We now divide this through by ${\Delta }_{k+1}$:

$1\le \frac{{\Delta }_{k}}{{\Delta }_{k+1}}-\frac{w}{{r}_{0}^{2}}\frac{{\Delta }_{k}^{2}}{{\Delta }_{k+1}}$

Then divide through by ${\Delta }_{k}$ also:

$\frac{1}{{\Delta }_{k}}\le \frac{1}{{\Delta }_{k+1}}-\frac{w}{{r}_{0}^{2}}\frac{{\Delta }_{k}}{{\Delta }_{k+1}}.$

Now we use the fact that gradient descent is a descent method again, which implies that $\frac{{\Delta }_{k}}{{\Delta }_{k+1}}\le 1,$ so:

$\frac{1}{{\Delta }_{k}}\le \frac{1}{{\Delta }_{k+1}}-\frac{w}{{r}_{0}^{2}}.$

$\therefore \frac{1}{{\Delta }_{k+1}}\ge \frac{1}{{\Delta }_{k}}+\frac{w}{{r}_{0}^{2}}.$

We then chain this inequality for each $k$:

$\frac{1}{{\Delta }_{k+1}}\ge \frac{1}{{\Delta }_{k}}+\frac{w}{{r}_{0}^{2}}\ge \frac{1}{{\Delta }_{k-1}}+2\frac{w}{{r}_{0}^{2}}\ge \cdots \ge \frac{1}{{\Delta }_{0}}+\frac{w}{{r}_{0}^{2}}\left(k+1\right)$

$\therefore \frac{1}{{\Delta }_{k+1}}\ge \frac{1}{{\Delta }_{0}}+\frac{w}{{r}_{0}^{2}}\left(k+1\right).$

To get the ﬁnal convergence rate we invert both sides:

$f\left({x}_{k}\right)-f\left({x}^{\ast }\right)\le \frac{\left[f\left({x}_{0}\right)-f\left({x}^{\ast }\right)\right]{∥{x}_{0}-{x}^{\ast }∥}^{2}}{{∥{x}_{0}-{x}^{\ast }∥}^{2}+w\left[f\left({x}_{0}\right)-f\left({x}^{\ast }\right)\right]k}.$

This is quite a complex expression. To simplify even further, we can get rid of the $f\left({x}_{0}\right)-f\left({x}^{\ast }\right)$ terms on the right hand side using the Lipschitz upper bound about ${x}^{\ast }$:

$f\left({x}_{0}\right)-f\left({x}^{\ast }\right)\le \frac{L}{2}{∥x-{x}^{\ast }∥}^{2}.$

Plugging in the step size $\alpha =\frac{1}{L}$ gives $w=\frac{1}{2L}$, yielding the following simpler convergence rate:

$f\left({x}_{k}\right)-f\left({x}^{\ast }\right)\le \frac{2L{∥{x}_{0}-{x}^{\ast }∥}^{2}}{k+4}.$

Compared to the rate from the previous proof, $f\left({x}_{k}\right)-f\left({x}^{\ast }\right)\le \frac{L}{2k}{∥{x}^{0}-{x}^{\ast }∥}^{2}$, this is slightly better at $k=1$, and worse thereafter.

I don’t like this proof. It’s feels like a random sequence of steps when you ﬁrst look at it. The way the proof uses inverse quantities like $\frac{1}{{\Delta }_{k}}$ is also confusing. The key equation is really the direct bound on $\Delta$:

${\Delta }_{k+1}\le {\Delta }_{k}-\frac{w}{{r}_{0}^{2}}{\Delta }_{k}^{2}.$

Often this is the kind of equation you encounter when proving the properties of dual methods for example. Equations of this kind can be encountered when applying proximal methods to non-diﬀerentiable functions also. It is also quite a clear statement about what is going on in terms of per-step convergence, a property that is less clear in the previous proof.

When we don’t even have convexity, just Lipschitz smoothness, we can still prove something about convergence of the gradient norm. The Lipschitz upper bound holds without the requirement of convexity:

$f\left(y\right)\le f\left(x\right)+⟨{f}^{\prime }\left(x\right),y-x⟩+\frac{L}{2}{∥x-y∥}^{2}.$

Recall that from minimizing this bound with respect to $y$ we can prove the equation:

$f\left({x}_{k}\right)-f\left({x}_{k+1}\right)\ge \frac{1}{2L}{∥{f}^{\prime }\left({x}_{k}\right)∥}^{2}.$

Examine this equation carefully. We have a bound on each gradient encountered during the optimization in terms of the diﬀerence in function values between steps. The sequence of function values is bounded below, so in fact we have a hard bound on the sum of the encountered gradient norms. Eﬀectively, we chain (telescope) the above inequality over steps:

$f\left({x}_{k-1}\right)-f\left({x}_{k}\right)+f\left({x}_{k}\right)-f\left({x}_{k+1}\right)\ge \frac{1}{2L}{∥{f}^{\prime }\left({x}_{k}\right)∥}^{2}+\frac{1}{2L}{∥{f}^{\prime }\left({x}_{k-1}\right)∥}^{2}.$

$...$

$f\left({x}_{0}\right)-f\left({x}_{k+1}\right)\ge \frac{1}{2L}\sum _{i}^{k}{∥{f}^{\prime }\left({x}_{i}\right)∥}^{2}.$

Now since $f\left({x}_{k+1}\right)\ge f\left({x}^{\ast }\right)$:

$\sum _{i}^{k}{∥{f}^{\prime }\left({x}_{i}\right)∥}^{2}\le 2L\left(f\left({x}_{0}\right)-f\left({x}^{\ast }\right)\right).$

Now to make this bound a little more concrete, we can put it in terms of the gradient ${g}_{k}$ with the smallest norm seen during the minimization ($∥{g}_{k}∥\le ∥{g}_{i}∥$ for all $i$), so that ${\sum }_{i}^{k}{∥{f}^{\prime }\left({x}_{i}\right)∥}^{2}\ge k{∥{g}_{k}∥}^{2}$, so:

${∥{g}_{k}∥}^{2}\le \frac{2L}{k}\left(f\left({x}_{0}\right)-f\left({x}^{\ast }\right)\right).$

Notice that the core technique used in this proof is the same as the last 2 proofs. We have a single step inequality bounding one of the quantities we care about. By summing that inequality over each step of minimization, one side of the inequality telescopes. We get an equality saying the the sum of the $k$ versions of that quantity (one from each step) is less then some ﬁxed constant independent of $k$, for any $k$. The convergence rate is thus of the form $1∕k$, because the summation of the $k$ quantities ﬁts in a ﬁxed bound.

Almost any proof for an optimization method that applies in the non-convex case uses a similar proof technique. There is just not that many assumptions to work with, so the options are limited.

### References

   Amir Beck and Marc Teboulle. Gradient-based algorithms with applications to signal recovery problems. Convex Optimization in Signal Processing and Communications, 2009.

   Yu. Nesterov. Introductory Lectures On Convex Programming. Springer, 1998.

## 5 thoughts on “The Many Ways to Analyse Gradient Descent: Part 2”

1. Foivos says:

Hi Aaron, nice post. Do you know if the inequality of lemma 1 is sharp? I mean if there is an example of a convex and L-smooth function f and a suitably chosen y, such that we have equality in the place of inequality.

2. Foivos says:

Hi Aaron, nice post. Do you know any case where the inequality of lemma 1 is sharp? I mean an example of a convex and L-smooth function f and a suitably chosen y, such that we have equality in the place of the inequality.

1. Aaron Defazio says:

Essentially all of the inequalities are sharp for 1D quadratics, or higher dimensional quadratics along the eigen-direction corresponding to the largest eigenvalue for the L formulas and the smallest for the mu formulas.

3. Ravi says:

Hi Aaron,
Could you provide me with a reference to the gradient concentration technique. Any paper where such a technique has been used for non-convex case. You may email me at the address.

1. Aaron Defazio says:

Nesterov’s book contains the proof. “Introductory Lectures On Convex Programming”.