The Many Ways to Analyse Gradient Descent: Part 2

The previous post detailed a bunch of different ways of proving the convergence rate of gradient descent:

xk+1 = xk αf(x k),

for strongly convex problems. This post considers the non-strongly convex, but still convex case.

Rehash of Basic Lemmas

These hold for any x and y. L the Lipschitz smoothness constant. These are completely standard, see Nesterov’s book [2] for proofs. We use the notation x for an arbitrary minimizer of f.

f(y) f(x) + f(x),y x + L 2 x y2. (1)
f(y) f(x) + f(x),y x + 1 2L f(x) f(y)2. (2)
f(x) f(y),x y 1 L f(x) f(y)2. (3)

1 Proximal Style Convergence Proof

The following argument gives a proof of convergence that is well suited to modification for proving the convergence of proximal gradient methods. We start by proving a useful lemma:

Lemma 1. For any xk and y, when xk+1 = xk 1 Lf(x k):

2 L f(y) f(xk+1) y xk+1 2 x k y2.

Proof. We start with the Lipschitz upper bound around xk of xk+1:

f(xk+1) f(xk) + f(x k),xk+1 xk + L 2 xk+1 xk 2.

Now we bound f(xk) using the negated convexity lower bound of y around x (i.e. f(y) f(xk) + f(xk),y xk):

f(xk+1) f(y) + f(x k),xk+1 xk + xk y + L 2 xk+1 xk 2.

Negating, rearranging and multiplying through by 2 L gives:

2 L f(y) f(xk+1) 2 L f(x k),xk+1 y + xk+1 xk 2.

Now we replace f(xk) using xk+1 = xk 1 Lf(x k):

2 L f(y) f(xk+1) 2 y xk+1,xk xk+1 xk xk+1 2 = 2 y xk + xk xk+1,xk xk+1 xk xk+1 2 = 2 y xk,xk xk+1 + xk xk+1 .2

Now we complete the square using the quadratic y xk + xk xk+1 2 = y xk 2 + 2 y xk,xk xk+1 + xk xk+1 2. So we have:

2 L f(y) f(xk+1) y xk + xk xk+1 y xk 2 = y xk+1 y xk 2.

Using this lemma, the proof is quite simple. We apply it with y = x:

xk+1 x2 x k x2 2 L f(xk+1) f(x).

Now we sum this between 0 and k 1. The left hand side telescopes:

xk x2 x 0 x2 2 L r=0k1 f(x r+1) f(x).

Now we use the fact that gradient descent is a descent method, which implies that f(xk) f(xr+1) for all r k 1. So:

xk x2 x 0 x2 2k L f(xk) f(x).

Now we just drop the xk x2 term since it is positive and small. Leaving:

f(xk) f(x) L 2k x0 x2.

Comments

As far as I know, this proof is fairly modern [1]. Notice that unlike the strongly convex case, the quantity we are bounding (f(xk) f(x)) does not appear on both sides of the bound. Unfortunately, without strong convexity there is necessarily a looseness to the bounds, and this takes the form of bounding function value by distance to solution, with a large wiggle-factor. One thing that is perhaps a little confusing is the use of distance to solution x x, when it is not unique, as there are potentially multiple minimizers for non-strongly convex problems. The bound in fact holds for any chosen minimizer x. I found this to be a little confusing at first.

2 Older Style Proof

This proof is from Nesterov [2]. I’m not sure of the original source for it.

We start with the function value descent equation, using w := α 1 1 2αL :

f(xk+1) f(xk) w f(x k)2.

We introduce the simplified notation Δk = f(xk) f(x) so that we have

Δk+1 Δk w f(x k)2. (4)

Now using the convexity lower bound around xk evaluated at x, namely:

Δk f(x k),xk x,

and applying Cauchy-Schwarz (note the spelling! there is no “t” in Schwarz) to it:

Δk xk x f(x k) x0 x f(x k).

The last line is because gradient descent method descends in iterate distance each step. We now introduce the additional notation r0 = x0 x. Using this notation and rearranging gives:

f(x k) Δkr0.

We plug this into the function descent equation (Eq 4) above to get:

Δk+1 Δk w r02Δk2.

We now divide this through by Δk+1:

1 Δk Δk+1 w r02 Δk2 Δk+1

Then divide through by Δk also:

1 Δk 1 Δk+1 w r02 Δk Δk+1.

Now we use the fact that gradient descent is a descent method again, which implies that Δk Δk+1 1, so:

1 Δk 1 Δk+1 w r02.

1 Δk+1 1 Δk + w r02.

We then chain this inequality for each k:

1 Δk+1 1 Δk + w r02 1 Δk1 + 2 w r02 1 Δ0 + w r02(k + 1)

1 Δk+1 1 Δ0 + w r02(k + 1).

To get the final convergence rate we invert both sides:

f(xk) f(x) f(x0) f(x) x 0 x2 x0 x2 + w f(x0) f(x)k.

This is quite a complex expression. To simplify even further, we can get rid of the f(x0) f(x) terms on the right hand side using the Lipschitz upper bound about x:

f(x0) f(x) L 2 x x2.

Plugging in the step size α = 1 L gives w = 1 2L, yielding the following simpler convergence rate:

f(xk) f(x) 2L x0 x2 k + 4 .

Compared to the rate from the previous proof, f(xk) f(x) L 2k x0 x2, this is slightly better at k = 1, and worse thereafter.

Comments

I don’t like this proof. It’s feels like a random sequence of steps when you first look at it. The way the proof uses inverse quantities like 1 Δk is also confusing. The key equation is really the direct bound on Δ:

Δk+1 Δk w r02Δk2.

Often this is the kind of equation you encounter when proving the properties of dual methods for example. Equations of this kind can be encountered when applying proximal methods to non-differentiable functions also. It is also quite a clear statement about what is going on in terms of per-step convergence, a property that is less clear in the previous proof.

3 Gradient Concentration

When we don’t even have convexity, just Lipschitz smoothness, we can still prove something about convergence of the gradient norm. The Lipschitz upper bound holds without the requirement of convexity:

f(y) f(x) + f(x),y x + L 2 x y2.

Recall that from minimizing this bound with respect to y we can prove the equation:

f(xk) f(xk+1) 1 2L f(x k)2.

Examine this equation carefully. We have a bound on each gradient encountered during the optimization in terms of the difference in function values between steps. The sequence of function values is bounded below, so in fact we have a hard bound on the sum of the encountered gradient norms. Effectively, we chain (telescope) the above inequality over steps:

f(xk1) f(xk) + f(xk) f(xk+1) 1 2L f(x k)2 + 1 2L f(x k1)2.

...

f(x0) f(xk+1) 1 2L ik f(x i)2.

Now since f(xk+1) f(x):

ik f(x i)2 2L f(x 0) f(x).

Now to make this bound a little more concrete, we can put it in terms of the gradient gk with the smallest norm seen during the minimization ( gk gi for all i), so that ik f(xi)2 k gk 2, so:

gk 2 2L k f(x0) f(x).

Comments

Notice that the core technique used in this proof is the same as the last 2 proofs. We have a single step inequality bounding one of the quantities we care about. By summing that inequality over each step of minimization, one side of the inequality telescopes. We get an equality saying the the sum of the k versions of that quantity (one from each step) is less then some fixed constant independent of k, for any k. The convergence rate is thus of the form 1k, because the summation of the k quantities fits in a fixed bound.

Almost any proof for an optimization method that applies in the non-convex case uses a similar proof technique. There is just not that many assumptions to work with, so the options are limited.

References

[1]   Amir Beck and Marc Teboulle. Gradient-based algorithms with applications to signal recovery problems. Convex Optimization in Signal Processing and Communications, 2009.

[2]   Yu. Nesterov. Introductory Lectures On Convex Programming. Springer, 1998.

The Many Ways to Analyse Gradient Descent

Consider the classical gradient descent method:
xk+1 = xk αf(x k).

It’s a thing of beauty isn’t it? While it’s not used directly in practice any more, the proof techniques used in its analysis are the building blocks behind the theory of more advanced optimization methods. I know of 8 different ways of proving its convergence rate. Each of the proof techniques are interesting in their own right, but most books on convex optimization give just a single proof of convergence, then move onto greater things. But to do research in modern convex optimization you should know them all.

The purpose of this series of posts is to detail each of these proof techniques and what applications they have to more advanced methods. This post will cover the proofs under strong convexity assumptions, and the next post will cover the non-strongly convex case. Unlike most proofs in the literature, we will go into detail of every step, so that these proofs can be used as a reference (don’t cite this post directly though, cite the original source preferably, or the technical notes version). If you are aware of any methods I’ve not covered, please leave a comment with a reference so I can update this post.

For most of the proofs we end with a statement like Ak+1 (1 γ)Ak, where Ak is some quantity of interest, like distance to solution or function value sub-optimality. A full proof requires chaining these inequalities for each k, giving something of the form Ak (1 γ)kA0. We leave this step as a given.

Basic lemmas

These hold for any x and y. Here μ is the strong convexity constant and L the Lipschitz smoothness constant. These are completely standard, see Nesterov’s book [7] for proofs. We use the notation x for the unique minimizer of f (for strongly convex problems).

f(y) f(x) + f(x),y x + L 2 x y2. (1)
f(y) f(x) + f(x),y x + μ 2 x y2. (2)
f(y) f(x) + f(x),y x + 1 2L f(x) f(y)2. (3)
f(y) f(x) + f(x),y x + 1 2μ f(x) f(y)2. (4)
f(x) f(y),x y 1 L f(x) f(y)2. (5)
f(x) f(y),x y μ x y2. (6)

1 Function Value Descent

There is a very simple proof involving just the function values. We start by showing that the function value descent is controlled by the gradient norm:

Lemma 1. For any given α, the change in function value between steps can be bounded as follows:

f(xk) f(xk+1) α 1 1 2αL f(x k)2,

in particular, if α = 1 L we have f(xk) f(xk+1) 1 2L f(x k)2.

Proof. We start with (1), the Lipschitz upper bound about xk:

f(xk+1) f(xk) + f(x k),xk+1 xk + L 2 xk+1 xk 2.

Now we plug in the step equation xk+1 xk = αf(xk) :

f(xk+1) f(xk) α f(x k)2 + α2L 2 f(x k)2,

Negating and rearranging gives:

f(xk) f(xk+1) α 1 1 2αL f(x k)2.

Now since we are considering strongly convex problems, we actually have found a bound on the gradient norm in terms of function value. We apply (4): f(y) f(x) + f(x),y x + 1 2μ f(x) f(y)2 using x = x, y = xk:

f(xk) f(x) + 1 2μ f(x k)2,

f(x k) 2μ f(xk) f(x).

So combining these two results:

f(xk) f(xk+1) 1 2L f(x k)2 μ L f(xk) f(x).

We then negate, add & subtract f(x), then rearrange:

f(xk+1) f(xk) μ L f(xk) f(x),

f(xk+1) f(x) f(x k) + f(x) μ L f(xk) f(x),

f(xk+1) f(x) 1 μ L f(xk) f(x).

Note that this function value style proof requires the step size α = 1 L or smaller, instead of α = 2 μ+L, which we shall see gives the fastest convergence when using some of the other proof techniques below.

Comments

This proof (when α = 1 L is used) treats gradient descent as an upper bound minimization scheme. Such methods, sometimes known under the Majorization-Minimization nomenclature [3], are quite widespread in optimization. They can be applied to non-convex problems even, although the convergence rates in that case are necessarily weak. Likewise this proof gives the weakest convergence rate of the proof techniques presented in this post, but it is perhaps the simplest. Upper bound minimization techniques have recently seen interesting applications in 2nd order optimization, in the form of Nesterov’s cubicly regularized Newton’s method [9]. For stochastic optimization, the MISO method is also a upper bound minimization scheme [6]. For non-smooth problems, an interesting application of the MM approach is in minimizing convex problems with non-convex regularizers of the form λlog x + 1, in the form of reweighted L1 regularization [5].

2 Iterate Descent

There is also a simple proof involving just the distance of the iterates xk to the solution. Using the definition of the step xk+1 xk = αf(xk):

xk+1 x2 = x k αf(x k) x2 = xk x2 2α f(x k),xk x + α2 f(x k)2.

We now apply both the inner product bounds (5) f(x) f(y),x y 1 L f(x) f(y)2and (6) f(x) f(y),x y μ x y2 , in the following negated forms, using f(x) = 0:

f(x k),xk x1 L f(x k)2,

f(x k),xk xμ x k x2.

The inner product term has a weight 2α, and we apply each of these with weight α, giving:

xk+1 x2 1 αμ x k x2 + α α 1 L f(x k)2.

Now if we take α = 1 L, then the last term cancels and we have:

xk+1 x2 1 μ L xk x2.

This proof is not as tight as possible. Instead of splitting the inner product term and applying both bounds (5) and (6), we can apply the following stronger combined bound from Nesterov’s Book [7]:

f(x) f(y),x y μL μ + L x y2 + 1 μ + L f(x) f(y)2. (7)

Doing so yields:

xk+1 x2 1 2αμL μ + L xk x2 + α α 2 μ + L f(x k)2.

Now clearly to cancel out the gradient norm term we can take α = 2 μ+L, which yields the convergence rate:

xk+1 x2 1 4μL μ + L2 xk x2 1 4μ L xk x2.

Comments

This proof technique is the building block of the standard stochastic gradient descent (SGD) proof. The above proof is mostly based on Nesterov’s book, I’m not sure what the original citation is. It has a nice geometric interpretation, as the bound on the inner product term f(xk),xk x can easily be illustrated in 2 dimensions, say on a white-board. It’s effectively a statement on the angles that gradients in convex problems can take. To get the strongest bound using this technique, the complex bound in Equation 7 has to be used. That stronger bound is not really straight-forward, and perhaps too technical (in my opinion) to use in a textbook proof of the convergence rate.

3 Using the Second Fundamental Theorem of Calculus

Recall the second fundamental theorem of calculus:

f(y) = f(x) +xyf(z)dz.

This can be applied along intervals in higher dimensions. The case we care about is applying it to the first derivatives of f, giving an integral involving the Hessian:

f(y) = f(x) +01 fx + τ(y x),y xdτ.

We abuse the angle bracket notation here to apply to matrix-vector products as well as the usual dot-product. Using this result gives an interesting proof of convergence of gradient descent that doesn’t rely on the usual convexity lemmas. This proof bounds the distance to solution, just like the previous proof.

Lemma 2. For any positive t:

x y + t f(y) f(x) max 1 tL, 1 tμ x y.

Proof. We start by applying the second fundamental theorem of calculus in the above form:

x y + t f(y) f(x) = x y + t01 fx + τ(y x),y xdτ = 01 tfx + τ(y x) I,y xdτ 01 tfx + τ(y x) I,y xdτ 01 tfx + τ(y x) I x ydτ maxz tf(z) I x y.

Now we examine the eigenvalues of f(z). the minimum one is at least μ and the maximum at most L. An examination of the possible range of the eigenvalues of (tf(z) I) gives max 1 tL, 1 tμ. □

Using this lemma gives a simple proof along the lines of the iterate descent proof.

First, note that xk+1 x is in the right form for direct application of this lemma after substituting in the step equation:

xk+1 x = x k x + α f(x k) f(x) max 1 αL, 1 αμ xk x.

Note we introduced f(x) for “free”, as it’s of course equal to zero. The next step is optimize this bound in terms of α. Note that L is always larger than μ, so we take the 1 αL absolute value as negative, and the other positive, and match their magnitudes:

1 + αL = 1 αμ,

α(L + μ) = 2,

α = 2 L + μ.

Which gives the convergence rate:

xk+1 x L μ L + μ xk x.

Note that this rate is in terms of the distance to solution directly, rather than its square like in the previous proof. Converting to squared norm gives the same rate as before.

Comments

This proof technique has a linear-algebra feel to it, and is perhaps most comfortable to people with that background. The absolute values make it ugly in my opinion though. This proof technique is the building block used in the standard proof of the convergence of the heavy ball method for strongly convex problems [10]. It doesn’t appear to have many other applications, and so is probably the least seen of the techniques in this document. The main use of this kind of argument is in lower complexity bounds, where we often do some sort of eigenvalue analysis.

4 Lyapunov Style

The above results prove convergence of either the iterates or the function value separately. There is an interesting proof involving the sum of the two quantities. First we start with with the iterate convergence:

xk+1 x2 = x k x αf(x k)2 = xk x2 2α f(x k),xk x + α2 f(x k)2.

Now we use the function descent amount equation (Lemma 1) to bound the gradient norm term: 1 c f(x k)2 f(x k) f(xk+1) , where we have defined c = 1 α 1 1 2αL:

xk+1 x2 x k x2 + cα2 f(x k) f(xk+1) 2α f(x k),xk x.

Now we use the strong convexity lower bound (2) in a rearranged form:

f(x k),x x k f(x) f(x k) μ 2 xk x2,

to simplify:

xk+1 x2 1 αμ x k x2 + cα2 f(x k) f(xk+1) + 2α f(x) f(x k).

Now rearranging further:

xk+1 x2+cα2 f(x k+1) f(x) 1 αμ x k x2+cα2 2α f(x k) f(x).

Now this equation gives a descent rate for the weighted sum of xk x2 and f(xk) f(x). The best rate is given by matching the two convergence rates, that of the iterate distance terms:

1 αkμ,

and that of the function value terms, which changes from cα2 to cα2 2α:

cα2 2αk cα2 = 1 2 cα = 1 2 1 1 2αL = αL 1.

Matching these two rates:

1 αμ = αL 1,

2 = α(μ + L),

α = 2 μ + L.

Using this derived value for α gives a convergence rate of 1 2μ μ+L. I.e.

xk+1 x2+cα2 f(x k+1) f(x) 1 2μ μ + L xk x2 + cα2 f(x k) f(x).

and therefore after k steps:

xk x2 + cα2 f(x k) f(x) 1 2μ μ + Lk x 0 x2 + cα2 f(x 0) f(x).

The constants can be simplified to:

cα2 = α2 α 1 1 2αL = α 1 1 2αL = α 1 L μ+L = α μ μ+L = 2 μ.

Now we use: f(x0) f(x) L 2 x0 x2 on the right, and we just drop the function value term altogether on the left:

xk x2 1 2μ μ + Lkμ + L μ x0 x2.

If we instead use the more robust step size 1 L, which doesn’t require knowledge of μ, then a simple calculation shows that we instead get c = 2L, and so:

xk x2 1 μ Lk x 0 x2 + 2 L f(x0) f(x), 1 μ Lk2 x 0 x2.

The right hand side is obviously a much tighter bound then when 2(μ + L) is used, but the geometric rate is roughly twice as slow.

Comments

This proof technique has seen a lot of application lately. It is used for the SAGA [2]and SVRG [4] methods, and can be applied to accelerated method even, such as the accelerated coordinate descent theory [8]. The Lyapunov function analysis technique is of great general utility, and so it is worth studying carefully. It is covered perhaps best in Polyak’s book [10].

5 Gradient Norm Descent

In the strongly convex case, it is actually possible to show that the gradient norm decreases at least linearly as well as the function value and iterates. This requires a fixed step size of α = 1 L, as it is not true when line searches are used.

Lemma 3. For α = 1 L:

xk+2 xk+1 2 1 μ L xk+1 xk 2.

Note that xk+2 xk+1 2 = 1 L2 f(xk+1)2 and xk+1 xk 2 = 1 L2 f(xk)2.

Proof. We start by expanding in terms of the step equation xk+1 = xk αf(xk).

xk+2 xk+1 2 = x k+1 αf(x k+1) xk + αf(x k)2 = xk+1 xk 2 + α2 f(x k+1) f(x k)2 +2α f(x k) f(x k+1),xk+1 xk .

Now applying both inner product bounds (5) and (6):

wk+1 wk 2 1 αμ x k+1 xk 2 + α α 1 L f(x k+1) f(x k)2.

So for α = 1 L this simplifies to:

xk+2 xk+1 2 1 μ L xk+1 xk 2.

Chaining this result (Lemma 3) over k gives:

xk+1 xk 2 1 μ Lk x 1 x0 2 = 1 μ Lk 1 L2 f(x 0)2.

We now use f(xk) f(x) 1 2μ f(x k)2 = L2 2μ xk+1 xk 2 :

f(xk) f(x) 1 μ Lk 1 2μ f(x 0)2.

Comments

This technique is probably the weirdest of those listed here. It has seen application in proving the convergence rate of MISO under some different stochastic orderings [1]. While clearly a primal result, this proof has some components normally seen in the proof for a dual method. The gradient f(xk) is effectively the dual iterate. Another interesting property is that the portion of the proof concerning the gradient’s convergence uses the strong convexity between xk+1 and xk, whereas the other proofs considered all use the degree of strong convexity between xk and x.

This proof technique can’t work when line searches are used, as bounding the inner product:

α f(x k) f(x k+1),xk+1 xk ,

would fail if α changed between steps, as it would become αkf(xk) αk+1f(xk+1),xk+1 xk, which is a weird expression to work with.

References

[1]    Aaron Defazio. New Optimization Methods for Machine Learning. PhD thesis, Australian National University, 2014.

[2]    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014.

[3]    David R. Hunter and Kenneth Lange. Quantile regression via an mm algorithm. Journal of Computational and Graphical Statistics, 9, 2000.

[4]    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. NIPS, 2013.

[5]    Qiang Liu and Alexander Ihler. Learning scale free networks by reweighted l1 regularization. AISTATS, 2011.

[6]    Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. Technical report, INRIA Grenoble RhÃŽne-Alpes / LJK Laboratoire Jean Kuntzmann, 2014.

[7]    Yu. Nesterov. Introductory Lectures On Convex Programming. Springer, 1998.

[8]    Yu. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. Technical report, CORE, 2010.

[9]    Yu. Nesterov and B.T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.

[10]    Boris Polyak. Introduction to Optimization. Optimization Software, Inc., Publications Division., 1987.

The NIPS Consistency Experiment — My Experience

This year the NIPS (Neural Information Processing Systems) conference organisers decided to run an experiment on the consistency of paper reviews. They selected 10% of papers to be reviewed twice. Different area chairs and a different set of 3 reviewers were chosen for those papers.
Luckily for me, my paper with Francis Bach and Simon Lacoste-Julien was one of those 10%. My paper was initially submitted as Paper ID 867. They essentially created a duplicated paper id for me, #1860, which contained the second set of reviews.
This duplication of reviews was particularly interesting in my case. There was a very large discrepancy between the two sets of reviews. I won’t know if this is representative of the consistency of other reviews until NIPS releases the statistics from there experiment.
For reference, the two sets of reviews gave the following scores, before rebuttal:
Set 1 review 1: Quality 9, impact 2 (high) , confidence 4 (confident but not certain)
Set 1 review 2: Quality 6, impact 1 (incremental), confidence 3 (fairly confident)
Set 1 review 3: Quality 6, impact 1 (incremental), confidence 5 (Absolutely certain)
--
Set 2 review 1: Quality 5, impact 1 (incremental), confidence 5 (Absolutely certain)
Set 2 review 2: Quality 3, impact 1 (incremental), confidence 4 (Confident)
Set 2 review 3: Quality 6, impact 1 (incremental), confidence 5 (Absolutely certain)
--
Generally for NIPS a 9/6/6 in quality gives a high change of acceptance, where as a 5/3/6 is a certain non-acceptance. So one set of reviews was a clear accept and the second a clear reject! The meta reviews were as follows:
The paper introduces a new incremental gradient method that allows adaptation to the level of convexity in the input. The paper has a nice discussion of related methods, and it has a simpler proof that will be of interest to researchers. Recommendation: Accept.
Unfortunately, the scores are too low for acceptance to NIPS, and none of the reviewers were willing to argue for acceptance of the paper. The reviewers discussed the paper after the author rebuttal, and all reviewers ultimately felt that the paper could use some additional polish before publishing. Please do keep in mind the various criticisms of the reviewers when submitting to another venue.
The paper we submitted was fairly rough in its initial state, and the reviewers suggested lots of improvements. Particularly the Set 2/review 1, which was the most in depth review. I generally agree with the second meta-review in that the paper needed addition polish, which we have done for the camera ready.
In the end the paper was accepted. I suspect most papers with this kind of accept/reject split would be accepted, as it would just seem unfair if it were not.
The issue of consistency in paper reviews is clear to any body who has ever resubmitted a rejected paper to a different venue. It feels like luck of a draw to a degree. There is no easy solutions to this, so I’ll be interested to see if NIPS changes there process in future years, and what changes they make.

Writing Your Thesis in LyX — A Setup Guide

I am just in the process of finishing writing my PhD, which I wrote entirely in LyX. The productivity advantages of using LyX over LaTeX are too large to ignore, which is why I went with LyX, and why you should too. In this post I will go over the process I went through to get LyX producing documents conforming with my university’s thesis formatting guidelines.
If your anything like me, you have a mixture of past papers you have written in LaTeX, as well as a bunch of notes and drafts in LyX. The university provides a thesis template in LaTeX which the recommend you use. Fortunately, it is actually not too difficult to convert such a template into a working LyX document. Likewise, the papers in LaTeX can also be imported into LyX.
My LyX thesis template is available for download here: template-thesis.zip. The source code is also on github at https://github.com/adefazio/lyx-thesis-template.

Creating a preamble

We will get LyX to format our document correctly using the thesis style by overriding most things using a preamble file containing TeX commands. The advantage of this approach is we can pretty much copy and paste the LaTeX from the thesis template provided by the university.
I used two preamble files. The primary file contains the usual TeX package statements, includes and the like. It points to the ANU thesis style file as well:
​
\usepackage{svn-multi}
\svnid{$Id$}
\usepackage[hyperindex=true,
			bookmarks=true,
            colorlinks=false,
            pdfborder=0,
            pagebackref=false,
            citecolor=blue,
            plainpages=false,
            pdfpagelabels,
            pagebackref=true,
            hyperfootnotes=false]{hyperref}
\usepackage[all]{hypcap}
\usepackage[palatino]{anuthesis}
\usepackage{afterpage}
\usepackage{graphicx}
\usepackage{thesis}
\usepackage[normalem]{ulem}
\usepackage[table]{xcolor}
\usepackage{makeidx}
\usepackage{cleveref}
\usepackage[centerlast]{caption}
\usepackage{float}
\urlstyle{sf}
\renewcommand{\sfdefault}{uop}
\usepackage[T1]{fontenc}
\usepackage[scaled]{beramono}
\usepackage{pifont}
\usepackage{rotating}
\usepackage{algorithmic}
​
\usepackage{multirow}
​
%%%% Old macros file includes
\usepackage{booktabs}
\usepackage{relsize}
\usepackage{xspace}
\usepackage{subfig}
\usepackage{listings}
%%%%%%%%
This is not exactly the same as the preamble in the ANU provided style files. Several packages are already automatically imported by LyX, so they don’t need to be included here.
The default style uses the traditional American indenting of the first line of each paragraph. I think that looks old-fashioned, so I change it to just put a little but more padding between paragraphs instead:
\setlength{\parindent}{0cm}
\setlength{\parskip}{4mm plus2mm minus3mm}
The second file I use is main-preamble.tex, which contains the title and author information:
\makeatletter
\AtBeginDocument{
  \hypersetup{
    pdftitle = {\@title},
    pdfauthor = {\@author}
  }
}
\makeatother
​
\title{Advances in stuff}
\author{John Smith}
\date{\today}
​
\renewcommand{\thepage}{\roman{page}}

Setting up chapters as child documents

You probably want to setup your thesis so that each chapter is in a separate document. In LaTeX you would have each chapter imported into your main TeX file using \input{}. In LyX, this is done using child documents. As an example, the main document for my thesis looks like the following:
figure main-example.png

Setting up the main document

Create a new document, and go to Document->Settings->LaTeX Preamble, put in the preamble files we created above:
\input{general-preamble}
\input{main-preamble} 
Each of the child documents is included via Insert->File->LyX Document ... The “include type” needs to be set to “include” for it to work correctly. The \mainmatter command signifies the switch over from page numbering using Roman numerals (for the introductory material) to Arabic numerals (for the thesis proper). It is inserted using the “TeX Code” insertion (Ctrl-L), which just directly inserts LaTeX commands into the document.
The Append is started with Document->Start Appendix Here, which should probably be in the insert menu of LyX instead. The bibliography is created with Insert->List/TOC->BibTex bibliography.., which lets you select a BibTex .bib file to use for your citations. LyX supports using multiple different bib files, which would appear useful if your combining multiple papers you have written into a thesis. A word of caution: if there are repeated entries between the bib files you will run into various hard to debug problems in LyX, especially if the entries only differ in capitalisation. I would suggest taking all the bib files and merging them using a command line tool into one file. I used the command:
bibtool -s -d *.bib > all.bib
LyX obeys most of the styling information specified by the preamble we created above. However, there are a few things that it overrides. Follow steps 3-7 from the next section to fix these up.

Setting up the child documents

In my case I put each child document in it’s own folder. This is not necessary, but it seemed like a good idea to keep the child documents in a folder together with any figures used by that document. I will walk through the creation of a single child document here. I created all the chapters initially by just copying the first one I created, but you could do it in a more organised way using LyX templates. The steps for creating the chapter are:
  1. Create a new empty LyX document and save it to a folder with the chapter within your thesis directory.
  2. Add the document to your main document using the include procedure described above.
  3. Open up the settings of the chapter document, and set the document class to Book, and in the master document field, point it at your main.lyx document
    figure DocClassChapters.png
  4. On the font section, change the default family base font size to what ever your university requires. This was different than the default for me.
  5. Tick the Two-sided document check-box under Page Layout. This sets it so that the margin is wider on the outside-edge of each page, so it looks right when bound as a book.
  6. On the Language section, change it to English (Australian) or English (UK) if your not in the ’ol US of A.
  7. Your university almost certain specifies a particular citation style. Specify that in the Bibliography section. My university (ANU) recommends a Author-year style:
    figure bibsetup.png
  8. In the LaTeX Preamble, add the command \input{../general-preamble}, which just points to the previously created preamble file. The main-preamble file is not used by the child documents.
  9. If your using sub-folders for each chapter, place a copy of the thesis.sty and (for ANU) the anuthesis.sty files in the subfolder as well.
The above setup will mean that you can view each chapter separately in LyX using the eye button figure EyeIcon.png in the toolbar. The chapter can be viewed within the whole-thesis PDF using the figure MasterIcon.png button. When viewing separately, the bibliography references will display as question marks (?), whereas they will be displayed correctly in the whole-thesis PDF. There are some suggested work-arounds for this issue on on the LyX Wiki, but I couldn’t get them to work.

Other productivity enhancements

There is a few additional setup steps you should go through in LyX if you haven’t already. These are not required, but they will generally increase your productivity.
  • Setup Forward and Reverse Search
  • Setup groups of chapters as LyX branches to make sending subsets of your thesis easier
  • Setup math macros
  • Add keyboard shortcuts for citations and cross-references.

Math Macros: The best feature of Lyx you’re not using

LyX is a brilliant tool for writing in the mathematical sciences. It provides a WYSIWYG style editor on top of LaTeX with just the right amount of abstraction for getting-stuff-done. With the proper setup, typing math into LyX is easier, faster and less error prone then using LaTeX directly, and it has virtually no down sides.
Part of the setup is the use of use of LyX math macros. Math macros allow you to define new commands that you can use within LyX math mode. When writing pure LaTeX, most people include a set of standard macros in the header of the LaTeX documents. For example, a common one is a shortcut for argmin:
\newcommand{\argmin}{\operatornamewithlimits{argmin}}
LyX math macros allow you to do the same thing within LyX.
To create the equivalent macro in LyX, create a new math macro
figure InsertMathMacro.png
This will insert the following into your document:
figure Macrobox.png
This won’t appear when you render out your document, so you can put it anywhere you like. To create the macro, just fill out the 3 parts. The first part is what you will type when using the macro, the second part is the LaTeX that will be output, and the third part controls what it looks like on screen within LyX. For most macros the TeX and LyX sections should be the same, for the argmin example we want the TeX code to produce a “mathop”, which LyX can’t natively display. So we put a \TeXtrm{argmin} in the LyX part. The filled in macro is thus:
figure argmin.png
Note that for this to work you have to have amsmath package imported. To set that up, go Document -> Settings..., then select the radio buttons under the math options section:
figure amsmath.png
To use the macro, just type \argmin within a LyX math box, then press space. It should insert the macro seemlessly. Within LyX the subscripts will look like:
but when you render it out to a pdf it will be typeset correctly as:

A few useful macros

I have the following macros setup in a lyx-math-macros.lyx file:
figure AFewMacros.png
These macros illustrate the parameters feature as well. To add parameters to a macro, use the macro bar that appears at the bottom of the screen when editing the macro.
I have a lyx template setup that I use for new documents. It includes a LyX sub-document include statement figure Include.png which gives access to the macros.

How it’s implemented

If you convert your LyX document to a pdfLaTeX document, you’ll see the following command is generated in the preamble
\global\long\def\argmin{\operatornamewithlimits{argmin}}
This uses lower level LaTeX commands then \newcommand, but it has a similar effect . At the call site you’ll see something like:
{argmin}_{x\in \mathbb{R}}
Other then the extra {} brackets, this is as simple as you could want.