October | 2023 | Tangentially / A Machine Learning Blog

Konstantin and I were honored with an ICML Outstanding paper award for our paper “Learning-Rate-Free Learning by D-Adaptation”. Besides the award, our implementation has already been downloaded nearly a million times over the last year (most before the award), and is seeing wide-spread use in training and fine-tuning deep learning models, particularly LoRAs.

Read on if you would like to know more about our work and the impact it is already having as a practical solution that removes the need to tune the baseline learning rate.

Adapting to D

Conference awards often elevate papers that take interesting new approaches to old problems, and this is certainly the case here. Our approach sounds like it wouldn’t work: we take an upper bound that would normally be used to show convergence, and we invert it, to instead give a lower bound $d_{k}$ on a key quantity $D = ∥ x_{0} - x_{*} ∥$ . Since $D$ plays the key role of the numerator of the step size in convex Lipscthiz optimization, we can use this bound $d_{k}$ to define a step-size:

γ_{k} = \frac{d_{k}}{\sqrt{\sum_{i}^{k} {∥ g_{i} ∥}^{2}}} .

Using this step-size in a plug-and-play style gives the D-Adaptation method. Simple!

The particular bound we use is a slight refinement of existing convergence bounds:

\sum_{k = 0}^{n} d_{k} (f (x_{k}) - f_{*}) \leq D ∥ \sum_{k = 0}^{n} g_{k} ∥ + \frac{1}{2} \sum_{k = 0}^{n} γ_{k} d_{k}^{2} {∥ g_{k} ∥}^{2} - \frac{1}{2} γ_{n + 1} {∥ \sum_{k = 0}^{n} g_{k} ∥}^{2},

and the rearrangement is simple. We first use that $f (x_{k}) - f_{*} \geq 0$ , to give

0 \leq D ∥ \sum_{k = 0}^{n} g_{k} ∥ + \frac{1}{2} \sum_{k = 0}^{n} γ_{k} d_{k}^{2} {∥ g_{k} ∥}^{2} - \frac{1}{2} γ_{n + 1} {∥ \sum_{k = 0}^{n} g_{k} ∥}^{2},

Then pull $D$ over to the left:

D \geq d_{k} = \frac{γ_{n + 1} {∥ \sum_{k = 0}^{n} g_{k} ∥}^{2} - \sum_{k = 0}^{n} γ_{k} d_{k}^{2} {∥ g_{k} ∥}^{2}}{2 ∥ \sum_{k = 0}^{n} g_{k} ∥} .

Why does this look crazy? Well for one thing, we have no reason to expect $d_{k}$ to be positive, in fact, it usually isn’t. The solution to the positivity issue turns out to be simple. Instead of using the bound directly, we define $d_{k}$ to be the max over the bounds over all previous steps. To “bootstrap” the process, we start with $d_{0} = 1 0^{- 8}$ or a similar small positive value. Given the seeming looseness of this approach, we wouldn’t expect this lower bound to provide a close estimate to $D$ , but in fact it does! In practice, the $d$ sequence grows exponentially, from arbitrarily small initial $d_{0}$ to values as good as hand tuned choices. We are even able to show that in theory, it asymptotically approaches the true D to within a factor of 0.366.

Convergence properties

“Non-asymptotic results tell you ’what’s really going on’, but asymptotics tell you what’s *really* going on” - Sam Power

Our key theory result is simple: we show that for a sufficiently large number of training steps $n$ , our approach yields convergence as fast as if you had of set the step size using knowledge of the true D from the beginning:

f ({\hat{x}}_{n}) - f (x_{*}) = (\frac{DG}{\sqrt{n + 1}}),

where ${\hat{x}}_{n}$ is the average iterate, and $G$ is a bound on the gradient norms. This is an interesting result for a number of reasons. Firstly, it’s the first such “asymptotic” result for this problem class to have no additional log factors - you pay no penalty for not knowing D! Other methods for the problem class we consider: convex, non-smooth Lipschitz functions, require knowledge of other unknowns, such as $f (x_{*})$ or require line searches, back-tracking or other complexities. A chief advantage of our method is it’s so damn simple.

The major down-side of this asymptotic result is that we don’t know how many optimization steps need to be taken, $n$ , before this asymptotic behavior kicks in. Fortunately, in practice, it kicks in after less than a few hundred steps, typically 1-2% of the total training time at most. This corresponds to when the $d_{k}$ estimate becomes within a constant factor of the true $D$ . We see this across more than 20 test problems, across deep-learning problems as well as simpler logistic regression problems.

The asymptotic bound we prove appears to be the true indictor of the behavior of our method in practice, although we give a non-asymptotic bound as well, which shows that the convergence is never worse than a log factor off the best-possible rate, even when $n$ is small. Approaches from the “parameter-free” literature such as coin-betting give slightly better non-asymptotic rates for this class (sqrt(log) factor difference), however we have developed an improved variant of D-Adaptation that matches the non-asymptotic rates of those parameter-free methods, known as Prodigy (https://arxiv.org/pdf/2306.06101.pdf). Prodigy seems to be an overall improvement over D-Adaptation in practice also, so we recommend its use as the primary variant of the method.

More on non-asymptotics

The non-asymptotic rate given by approaches such as coin betting for the average iterate is:

f ({\hat{x}}_{n}) - f (x_{*}) = (\frac{DG \sqrt{\log (1 + D ∕ d_{0})}}{\sqrt{n + 1}}) .

Essentially you pay a $\sqrt{\log (1 + D ∕ d_{0})}$ penalty for not having knowledge of $D$ before-hand. There is a much simpler approach that gives essentially the same rate for this complexity class: Grid Search! In fact, the simple approach of running $R = \log (D_{\max} ∕ D_{\min})$ runs with a log-spaced grid of D values (i.e. $1 0^{- 8}, 1 0^{- 7}, 1 0^{- 6}, \dots$ ) bracketing the true D. Since our total step-budget is $n + 1$ , each run must be of length (n+1)/R, and so taking the best $\hat{x}$ by function value gives a bound:

\begin{array}{l} f ({\hat{x}}_{n}) - f (x_{*}) & = (\frac{DG}{\sqrt{(n + 1) ∕ R}}) \\ = (\frac{DG \sqrt{R}}{\sqrt{n + 1}}) \\ = (\frac{DG \sqrt{\log (D_{\max} ∕ D_{\min})}}{\sqrt{n + 1}}) . \end{array}

Often in computer-science research we ignore log factors as unimportant, or “not-real”, but in this case, the log factor is very real! It corresponds to the cost of running multiple training runs and throwing away all but the best run.

Given this is the same sqrt(log) dependence as for coin betting, what differentiates parameter-free approaches such as coin-betting? They are developed for a much more general class of problems, Online Linear Optimization (OLO), and so when applied to convex non-smooth optimization as we do here, they are not able to make use of the additional problem structure. Note that the grid search method above obviously can’t achieve a sqrt-log-factor free rate asymptotically, since you have to actually perform each of the $R$ training runs. It’s unclear if coin-betting approaches can avoid the log factor asymptotically, this is an interesting direction for future research. Lower bounds are known for the online linear optimization setting that show that any non-asymptotic rate must include this sqrt-log factor.

Why previous parameter-free methods don’t work as well

A major component of our work was the effort it took to go from a theoretically nice algorithm to a practical method that works on deep learning problems. We had to identify why other learning-rate free methods fail, and we had to adapt our method to work on top of modern optimization methods such as Adam.

The key insight turned out to be SCHEDULING. By scheduling, I mean a sequence of constants that multiply the learning rate, that includes potentially warm-up at the beginning of training, and annealing of the learning rate towards the end of training.

Using theoretically motivated schedules such as $1 ∕ \sqrt{k + 1}$ learning rate decay turn out to work terribly in practice almost across the board on deep learning problems. Existing coin-betting approaches prior to our work all relied on very poorly performing schedules. This is why the COCOB method we tested as a baseline under-performed The schedule is an implicit part of the method, arising directly from the mathematical formulation, and so its difficult to apply an arbitrary schedule on top of it, in contrast, D-Adaptation natively supports the use of arbitrary schedules.

Using existing empirically motivated schedules, such as those already widely in use in deep learning practice, leads to huge improvements in performance. In collaboration with Ashok and Harsh, I have also been working on incorporating schedule support into practical parameter-free methods, giving a method we call Mechanic (https://arxiv.org/abs/2306.00144).

Practical performance of D-Adaptation

Our Adam variant of D-Adaptation is a drop-in replacement for the PyTorch Adam class, requiring no other changes to the code-base. We tested across a large selection of modern deep learning problems, including auto-regressive language models, image-classification, image-2-image tasks, object recognition, recommendation systems, masked-language models and more. In fact, we performed the largest comparison performed for any optimization paper for a new method, outside of 10+ author papers from Google.

D-Adaptation matched the final test performance of laboriously hand-tuned learning rates on 9 of 10 deep learning problems, and all 12 logistic regression problems tested. We have also received overwhelming feedback from the ML community that the method “just-works” and is a true viable replacement for hand-tuned learning rates.

ICML is not a theoretically focused venue; reviewers highly value papers whose contribution have clear practical impact together with strong theory. Before D-Adaptation, there were no methods that could be “dropped-in” to a code base without other changes to provide adaptivity to the learning rate that actually worked well in practice. The ease of use of our method and its clear practical utility were recognized by our reviewers, with final review scores of 9, 8 and 6. We were fortunate to have highly communicative reviewers, who suggested improvements, pointed out errors (even reviewing the appendix carefully!) and overall greatly improved our work. A triumph of peer review!

Limitations

D-Adapt Adam uses an additional memory buffer on top of regular Adam, and so it has overall higher memory use. It can also fail on problems where training is very unstable. We experienced issues with it’s convergence on ViT (vision transformer) training for example.

Since setting the learning rate globally with D-Adaptation requires aggregating dot products and norms from across all parameters, it will requiring syncing information across nodes when using sharded data-parallel training, which may have a performance hit. We found that applying it separately at the block or node level didn’t work well, the global syncing appears to be necessary.

Our theory also has limitations. It only applies in the deterministic setting, which is much more limited than the stochastic or online settings that other methods such as coin-betting or DoG have theory for. Given the method seems to work well for stochastic problems, I see developing theory for this case as an interesting research problem.

D-Adaptation MUST be used with a schedule. I have seen several comparisons against D-Adaptation that don’t mention the use of schedules, and likely omitted them. This is crucial for the performance of the method. We recommend the use of the linear decay schedule with potentially a warmup (5%?) if warmup is customary for the problem being tackled.

What’s Next

It’s rather annoying that we have to choose a schedule to use with D-Adaptation.... surely there is some way to determine data-adaptive schedules automatically? Stay tuned for a solution!

Tangentially / A Machine Learning Blog

Monthly Archives: October 2023

D-Adaptatation Wins an Outstanding Paper Award at ICML 2023!