{"id":127,"date":"2023-10-12T16:48:46","date_gmt":"2023-10-12T20:48:46","guid":{"rendered":"https:\/\/www.aarondefazio.com\/tangentially\/?p=127"},"modified":"2023-10-18T11:26:17","modified_gmt":"2023-10-18T15:26:17","slug":"d-adaptatation-wins-an-outstanding-paper-award-at-icml-2023","status":"publish","type":"post","link":"https:\/\/www.aarondefazio.com\/tangentially\/?p=127","title":{"rendered":"D-Adaptatation Wins an Outstanding Paper Award at ICML 2023!"},"content":{"rendered":"<p class='noindent'>Konstantin and I were honored with an ICML Outstanding paper award for our<br \/>\npaper \u201cLearning-Rate-Free Learning by D-Adaptation\u201d. Besides the award, our<br \/>\nimplementation has already been downloaded nearly a million times over the last<br \/>\nyear (most before the award), and is seeing wide-spread use in training and<br \/>\nfine-tuning deep learning models, particularly LoRAs.\n<\/p>\n<p><!-- l. 22 --><\/p>\n<p class='indent'>   Read on if you would like to know more about our work and the impact it is<br \/>\nalready having as a practical solution that removes the need to tune the baseline<br \/>\nlearning rate.\n<\/p>\n<p><!-- l. 26 --><\/p>\n<p class='noindent'>\n<h4 class='likesubsectionHead'><a id='x1-2000'><\/a>Adapting to D<\/h4>\n<p><!-- l. 28 --><\/p>\n<p class='noindent'>Conference awards often elevate papers that take interesting new approaches to<br \/>\nold problems, and this is certainly the case here. Our approach sounds like<br \/>\nit wouldn\u2019t work: we take an upper bound that would normally be used<br \/>\nto show convergence, and we invert it, to instead give a lower bound<br \/>\n<!-- l. 32 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><msub><mrow><mi>d<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><\/math> on a key<br \/>\nquantity <!-- l. 32 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>D<\/mi> <mo class='MathClass-rel'>=<\/mo> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><msub><mrow><mi>x<\/mi><\/mrow><mrow><mn>0<\/mn><\/mrow><\/msub> <mo class='MathClass-bin'>\u2212<\/mo> <msub><mrow><mi>x<\/mi><\/mrow><mrow><mo class='MathClass-bin'>\u2217<\/mo><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow><\/math>.<br \/>\nSince <!-- l. 33 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>D<\/mi><\/math> plays<br \/>\nthe key role of the numerator of the step size in convex Lipscthiz optimization, we can use<br \/>\nthis bound <!-- l. 34 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><msub><mrow><mi>d<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><\/math><br \/>\nto define a step-size: <\/p>\n<table class='equation-star'>\n<tr>\n<td>\n<!-- l. 36 --><math display='block' xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' class='equation'>\n                         <msub><mrow><mi>\u03b3<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub> <mo class='MathClass-rel'>=<\/mo>       <mfrac><mrow><msub><mrow><mi>d<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><\/mrow> \n<mrow><msqrt><mrow><munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n  <\/mrow><mrow><mi>i<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/munderover><msup><mrow> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><msub><mrow><mi>g<\/mi><\/mrow><mrow><mi>i<\/mi><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow> <\/mrow><mrow><mn>2<\/mn><\/mrow><\/msup><\/mrow><\/msqrt><\/mrow><\/mfrac><mo class='MathClass-punc'>.<\/mo>\n<\/math><\/td>\n<\/tr>\n<\/table>\n<p><!-- l. 40 --><\/p>\n<p class='indent'>   Using this step-size in a plug-and-play style gives the D-Adaptation method.<br \/>\nSimple!\n<\/p>\n<p><!-- l. 43 --><\/p>\n<p class='indent'>   The particular bound we use is a slight refinement of existing convergence<br \/>\nbounds: <\/p>\n<table class='equation-star'>\n<tr>\n<td>\n<p><!-- l. 45 --><math display='block' xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' class='equation'>\n<munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n   <\/mrow><mrow><mi>k<\/mi><mo class='MathClass-rel'>=<\/mo><mn>0<\/mn><\/mrow><mrow><mi>n<\/mi><\/mrow><\/munderover><msub><mrow><mi>d<\/mi><\/mrow><mrow>\n<mi>k<\/mi><\/mrow><\/msub> <mrow><mo form='prefix' fence='true'> (<\/mo><mrow><mi>f<\/mi><mo class='MathClass-open'>(<\/mo><msub><mrow><mi>x<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><mo class='MathClass-close'>)<\/mo> <mo class='MathClass-bin'>\u2212<\/mo> <msub><mrow><mi>f<\/mi><\/mrow><mrow><mo class='MathClass-bin'>\u2217<\/mo><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>)<\/mo><\/mrow><mo class='MathClass-rel'>\u2264<\/mo> <mi>D<\/mi> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n<\/mrow><mrow><mi>k<\/mi><mo class='MathClass-rel'>=<\/mo><mn>0<\/mn><\/mrow><mrow><mi>n<\/mi><\/mrow><\/munderover><msub><mrow><mi>g<\/mi><\/mrow><mrow>\n<mi>k<\/mi><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow><mo class='MathClass-bin'>+<\/mo><mfrac><mrow><mn>1<\/mn><\/mrow>\n<mrow><mn>2<\/mn><\/mrow><\/mfrac><munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n  <\/mrow><mrow><mi>k<\/mi><mo class='MathClass-rel'>=<\/mo><mn>0<\/mn><\/mrow><mrow><mi>n<\/mi><\/mrow><\/munderover><msub><mrow><mi>\u03b3<\/mi><\/mrow><mrow>\n<mi>k<\/mi><\/mrow><\/msub><msubsup><mrow><mi>d<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><mrow><mn>2<\/mn><\/mrow><\/msubsup><msup><mrow> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><msub><mrow><mi>g<\/mi><\/mrow><mrow>\n<mi>k<\/mi><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow> <\/mrow><mrow><mn>2<\/mn><\/mrow><\/msup><mo class='MathClass-bin'>\u2212<\/mo><mfrac><mrow><mn>1<\/mn><\/mrow>\n<mrow><mn>2<\/mn><\/mrow><\/mfrac><msub><mrow><mi>\u03b3<\/mi><\/mrow><mrow><mi>n<\/mi><mo class='MathClass-bin'>+<\/mo><mn>1<\/mn><\/mrow><\/msub><msup><mrow> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n<\/mrow><mrow><mi>k<\/mi><mo class='MathClass-rel'>=<\/mo><mn>0<\/mn><\/mrow><mrow><mi>n<\/mi><\/mrow><\/munderover><msub><mrow><mi>g<\/mi><\/mrow><mrow>\n<mi>k<\/mi><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow> <\/mrow><mrow><mn>2<\/mn><\/mrow><\/msup><mo class='MathClass-punc'>,<\/mo>\n<\/math><\/td>\n<\/tr>\n<\/table>\n<p><!-- l. 48 --><\/p>\n<p class='indent'>   and the rearrangement is simple. We first use that<br \/>\n<!-- l. 48 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>f<\/mi><mo class='MathClass-open'>(<\/mo><msub><mrow><mi>x<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><mo class='MathClass-close'>)<\/mo> <mo class='MathClass-bin'>\u2212<\/mo> <msub><mrow><mi>f<\/mi><\/mrow><mrow><mo class='MathClass-bin'>\u2217<\/mo><\/mrow><\/msub><mo class='MathClass-rel'>\u2265<\/mo> <mn>0<\/mn><\/math>, to<br \/>\ngive <\/p>\n<table class='equation-star'>\n<tr>\n<td>\n<!-- l. 50 --><math display='block' xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' class='equation'>\n      <mn>0<\/mn> <mo class='MathClass-rel'>\u2264<\/mo> <mi>D<\/mi> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n  <\/mrow><mrow><mi>k<\/mi><mo class='MathClass-rel'>=<\/mo><mn>0<\/mn><\/mrow><mrow><mi>n<\/mi><\/mrow><\/munderover><msub><mrow><mi>g<\/mi><\/mrow><mrow>\n<mi>k<\/mi><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow> <mo class='MathClass-bin'>+<\/mo> <mfrac><mrow><mn>1<\/mn><\/mrow> \n<mrow><mn>2<\/mn><\/mrow><\/mfrac><munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n  <\/mrow><mrow><mi>k<\/mi><mo class='MathClass-rel'>=<\/mo><mn>0<\/mn><\/mrow><mrow><mi>n<\/mi><\/mrow><\/munderover><msub><mrow><mi>\u03b3<\/mi><\/mrow><mrow>\n<mi>k<\/mi><\/mrow><\/msub><msubsup><mrow><mi>d<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><mrow><mn>2<\/mn><\/mrow><\/msubsup><msup><mrow> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><msub><mrow><mi>g<\/mi><\/mrow><mrow>\n<mi>k<\/mi><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow> <\/mrow><mrow><mn>2<\/mn><\/mrow><\/msup> <mo class='MathClass-bin'>\u2212<\/mo><mfrac><mrow><mn>1<\/mn><\/mrow> \n<mrow><mn>2<\/mn><\/mrow><\/mfrac><msub><mrow><mi>\u03b3<\/mi><\/mrow><mrow><mi>n<\/mi><mo class='MathClass-bin'>+<\/mo><mn>1<\/mn><\/mrow><\/msub><msup><mrow> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n  <\/mrow><mrow><mi>k<\/mi><mo class='MathClass-rel'>=<\/mo><mn>0<\/mn><\/mrow><mrow><mi>n<\/mi><\/mrow><\/munderover><msub><mrow><mi>g<\/mi><\/mrow><mrow>\n<mi>k<\/mi><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow> <\/mrow><mrow><mn>2<\/mn><\/mrow><\/msup><mo class='MathClass-punc'>,<\/mo>\n<\/math><\/td>\n<\/tr>\n<\/table>\n<p><!-- l. 53 --><\/p>\n<p class='indent'>   Then pull <!-- l. 53 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>D<\/mi><\/math><br \/>\nover to the left: <\/p>\n<table class='equation-star'>\n<tr>\n<td>\n<!-- l. 54 --><math display='block' xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' class='equation'>\n            <mi>D<\/mi> <mo class='MathClass-rel'>\u2265<\/mo> <msub><mrow><mi>d<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub> <mo class='MathClass-rel'>=<\/mo> <mfrac><mrow><msub><mrow><mi>\u03b3<\/mi><\/mrow><mrow><mi>n<\/mi><mo class='MathClass-bin'>+<\/mo><mn>1<\/mn><\/mrow><\/msub><msup><mrow> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n  <\/mrow><mrow><mi>k<\/mi><mo class='MathClass-rel'>=<\/mo><mn>0<\/mn><\/mrow><mrow><mi>n<\/mi><\/mrow><\/munderover><msub><mrow><mi>g<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow> <\/mrow><mrow><mn>2<\/mn><\/mrow><\/msup> <mo class='MathClass-bin'>\u2212<\/mo><munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n  <\/mrow><mrow><mi>k<\/mi><mo class='MathClass-rel'>=<\/mo><mn>0<\/mn><\/mrow><mrow><mi>n<\/mi><\/mrow><\/munderover><msub><mrow><mi>\u03b3<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><msubsup><mrow><mi>d<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><mrow><mn>2<\/mn><\/mrow><\/msubsup><msup><mrow> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><msub><mrow><mi>g<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow> <\/mrow><mrow><mn>2<\/mn><\/mrow><\/msup><\/mrow> \n                     <mrow><mn>2<\/mn> <mrow><mo form='prefix' fence='true'> \u2225<\/mo><mrow><munderover accent='false' accentunder='false'><mrow><mo>\u2211<\/mo>\n  <\/mrow><mrow><mi>k<\/mi><mo class='MathClass-rel'>=<\/mo><mn>0<\/mn><\/mrow><mrow><mi>n<\/mi><\/mrow><\/munderover><msub><mrow><mi>g<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><\/mrow><mo form='postfix' fence='true'>\u2225<\/mo><\/mrow> <\/mrow><\/mfrac>                 <mo class='MathClass-punc'>.<\/mo>\n<\/math><\/td>\n<\/tr>\n<\/table>\n<p><!-- l. 58 --><\/p>\n<p class='indent'>   Why does this look crazy? Well for one thing, we have no reason to expect<br \/>\n<!-- l. 59 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><msub><mrow><mi>d<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><\/math><br \/>\nto be positive, in fact, it usually isn\u2019t. The solution to the positivity issue<\/p>\n<p>turns out to be simple. Instead of using the bound directly, we define<br \/>\n<!-- l. 61 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><msub><mrow><mi>d<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><\/math> to be<br \/>\nthe max over the bounds over all previous steps. To \u201cbootstrap\u201d the process, we start<br \/>\nwith <!-- l. 63 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><msub><mrow><mi>d<\/mi><\/mrow><mrow><mn>0<\/mn><\/mrow><\/msub> <mo class='MathClass-rel'>=<\/mo> <mn>1<\/mn><msup><mrow><mn>0<\/mn><\/mrow><mrow><mo class='MathClass-bin'>\u2212<\/mo><mn>8<\/mn><\/mrow><\/msup><\/math><br \/>\nor a similar small positive value. Given the seeming looseness of this<br \/>\napproach, we wouldn\u2019t expect this lower bound to provide a close estimate to<br \/>\n<!-- l. 65 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>D<\/mi><\/math>, but in fact it does!<br \/>\nIn practice, the <!-- l. 66 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>d<\/mi><\/math><br \/>\nsequence grows exponentially, from arbitrarily small initial<br \/>\n<!-- l. 67 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><msub><mrow><mi>d<\/mi><\/mrow><mrow><mn>0<\/mn><\/mrow><\/msub><\/math> to<br \/>\nvalues as good as hand tuned choices. We are even able to show that in theory, it<br \/>\nasymptotically approaches the true D to within a factor of 0.366.\n<\/p>\n<p><!-- l. 71 --><\/p>\n<p class='noindent'>\n<h3 class='likesectionHead'><a id='x1-3000'><\/a>Convergence properties<\/h3>\n<p><!-- l. 72 --><\/p>\n<p class='noindent'>\n<blockquote class='quotation'><p>\n     <!-- l. 73 --><\/p>\n<p class='indent'>     \u201cNon-asymptotic results tell you \u2019what\u2019s really going on\u2019, but<br \/>\n     asymptotics tell you what\u2019s *really* going on\u201d &#8211; <a href='https:\/\/x.com\/sp_monte_carlo\/status\/1683529829214240771?s=20'>Sam Power<\/a><\/p>\n<\/blockquote>\n<p><!-- l. 76 --><\/p>\n<p class='noindent'>Our key theory result is simple: we show that for a sufficiently large number of training<br \/>\nsteps <!-- l. 77 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>n<\/mi><\/math>,<br \/>\nour approach yields convergence as fast as if you had of set the step size using<br \/>\nknowledge of the true D from the beginning: <\/p>\n<table class='equation-star'>\n<tr>\n<td>\n<!-- l. 80 --><math display='block' xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' class='equation'>\n                    <mi>f<\/mi><mo class='MathClass-open'>(<\/mo><msub><mrow><mover accent='true'><mrow><mi>x<\/mi><\/mrow><mo accent='true'>^<\/mo><\/mover><\/mrow><mrow><mi>n<\/mi><\/mrow><\/msub><mo class='MathClass-close'>)<\/mo> <mo class='MathClass-bin'>\u2212<\/mo> <mi>f<\/mi><mo class='MathClass-open'>(<\/mo><msub><mrow><mi>x<\/mi><\/mrow><mrow><mo class='MathClass-bin'>\u2217<\/mo><\/mrow><\/msub><mo class='MathClass-close'>)<\/mo> <mo class='MathClass-rel'>=<\/mo> <mi mathvariant='bold-script'><\/mi><mrow><mo form='prefix' fence='true'> (<\/mo><mrow> <mfrac><mrow><mi mathvariant='italic'>DG<\/mi><\/mrow>\n<mrow><msqrt><mrow><mi>n<\/mi> <mo class='MathClass-bin'>+<\/mo> <mn>1<\/mn><\/mrow><\/msqrt><\/mrow><\/mfrac> <\/mrow><mo form='postfix' fence='true'>)<\/mo><\/mrow><mo class='MathClass-punc'>,<\/mo>\n<\/math><\/td>\n<\/tr>\n<\/table>\n<p><!-- l. 83 --><\/p>\n<p class='indent'>   where <!-- l. 83 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><msub><mrow><mover accent='true'><mrow><mi>x<\/mi><\/mrow><mo accent='true'>^<\/mo><\/mover><\/mrow><mrow><mi>n<\/mi><\/mrow><\/msub><\/math> is the<br \/>\naverage iterate, and <!-- l. 83 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>G<\/mi><\/math><br \/>\nis a bound on the gradient norms. This is an interesting result for a<br \/>\nnumber of reasons. Firstly, it\u2019s the first such \u201casymptotic\u201d result for this<\/p>\n<p>problem class to have no additional log factors &#8211; you pay no penalty for<br \/>\nnot knowing D! Other methods for the problem class we consider: convex,<br \/>\nnon-smooth Lipschitz functions, require knowledge of other unknowns, such as<br \/>\n<!-- l. 89 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>f<\/mi><mo class='MathClass-open'>(<\/mo><msub><mrow><mi>x<\/mi><\/mrow><mrow><mo class='MathClass-bin'>\u2217<\/mo><\/mrow><\/msub><mo class='MathClass-close'>)<\/mo><\/math> or<br \/>\nrequire line searches, back-tracking or other complexities. A chief advantage of our<br \/>\nmethod is it\u2019s so damn simple.\n<\/p>\n<p><!-- l. 93 --><\/p>\n<p class='indent'>   The major down-side of this asymptotic result is that we<br \/>\ndon\u2019t know how many optimization steps need to be taken,<br \/>\n<!-- l. 94 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>n<\/mi><\/math>, before<br \/>\nthis asymptotic behavior kicks in. Fortunately, in practice, it kicks in after less than a<br \/>\nfew hundred steps, typically 1-2% of the total training time at most. This corresponds to<br \/>\nwhen the <!-- l. 97 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><msub><mrow><mi>d<\/mi><\/mrow><mrow><mi>k<\/mi><\/mrow><\/msub><\/math><br \/>\nestimate becomes within a constant factor of the true<br \/>\n<!-- l. 98 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>D<\/mi><\/math>. We<br \/>\nsee this across more than 20 test problems, across deep-learning problems as well as<br \/>\nsimpler logistic regression problems.\n<\/p>\n<p><!-- l. 102 --><\/p>\n<p class='indent'>   The asymptotic bound we prove appears to be the true indictor of the behavior of<br \/>\nour method in practice, although we give a non-asymptotic bound as well, which shows<br \/>\nthat the convergence is never worse than a log factor off the best-possible rate, even<br \/>\nwhen <!-- l. 105 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>n<\/mi><\/math><br \/>\nis small. Approaches from the \u201cparameter-free\u201d literature such as coin-betting give<br \/>\nslightly better non-asymptotic rates for this class (sqrt(log) factor difference),<br \/>\nhowever we have developed an improved variant of D-Adaptation that matches the<br \/>\nnon-asymptotic rates of those parameter-free methods, known as Prodigy<br \/>\n(<a href='https:\/\/arxiv.org\/pdf\/2306.06101.pdf'>https:\/\/arxiv.org\/pdf\/2306.06101.pdf<\/a>). Prodigy seems to be an overall improvement<br \/>\nover D-Adaptation in practice also, so we recommend its use as the primary variant<br \/>\nof the method.\n<\/p>\n<p><!-- l. 114 --><\/p>\n<p class='noindent'>\n<h4 class='likesubsectionHead'><a id='x1-4000'><\/a>More on non-asymptotics<\/h4>\n<p><!-- l. 116 --><\/p>\n<p class='noindent'>The non-asymptotic rate given by approaches such as coin betting for the average<br \/>\niterate is: <\/p>\n<table class='equation-star'>\n<tr>\n<td>\n<p><!-- l. 118 --><math display='block' xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' class='equation'>\n               <mi>f<\/mi><mo class='MathClass-open'>(<\/mo><msub><mrow><mover accent='true'><mrow><mi>x<\/mi><\/mrow><mo accent='true'>^<\/mo><\/mover><\/mrow><mrow><mi>n<\/mi><\/mrow><\/msub><mo class='MathClass-close'>)<\/mo> <mo class='MathClass-bin'>\u2212<\/mo> <mi>f<\/mi><mo class='MathClass-open'>(<\/mo><msub><mrow><mi>x<\/mi><\/mrow><mrow><mo class='MathClass-bin'>\u2217<\/mo><\/mrow><\/msub><mo class='MathClass-close'>)<\/mo> <mo class='MathClass-rel'>=<\/mo> <mi mathvariant='bold-script'><\/mi><mrow><mo form='prefix' fence='true'> (<\/mo><mrow><mfrac><mrow><mi mathvariant='italic'>DG<\/mi><msqrt><mrow><mi class='qopname'>log<\/mi><mo> \u2061<!-- FUNCTION APPLICATION --> <\/mo> <!-- nolimits --> <mrow><mo form='prefix' fence='true'> (<\/mo><mrow><mn>1<\/mn> <mo class='MathClass-bin'>+<\/mo> <mi>D<\/mi><mo class='MathClass-bin'>\u2215<\/mo><msub><mrow><mi>d<\/mi><\/mrow><mrow><mn>0<\/mn> <\/mrow> <\/msub><\/mrow><mo form='postfix' fence='true'>)<\/mo><\/mrow><\/mrow><\/msqrt><\/mrow>\n          <mrow><msqrt><mrow><mi>n<\/mi> <mo class='MathClass-bin'>+<\/mo> <mn>1<\/mn><\/mrow><\/msqrt><\/mrow><\/mfrac>         <\/mrow><mo form='postfix' fence='true'>)<\/mo><\/mrow><mo class='MathClass-punc'>.<\/mo>\n<\/math><\/td>\n<\/tr>\n<\/table>\n<p><!-- l. 121 --><\/p>\n<p class='indent'>   Essentially you pay a <!-- l. 121 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><msqrt><mrow><mi class='qopname'>log<\/mi><mo> \u2061<!-- FUNCTION APPLICATION --> <\/mo> <!-- nolimits --><mo class='MathClass-open'>(<\/mo><mn>1<\/mn> <mo class='MathClass-bin'>+<\/mo> <mi>D<\/mi><mo class='MathClass-bin'>\u2215<\/mo><msub><mrow><mi>d<\/mi><\/mrow><mrow><mn>0<\/mn> <\/mrow> <\/msub> <mo class='MathClass-close'>)<\/mo><\/mrow><\/msqrt><\/math> penalty<br \/>\nfor not having knowledge of <!-- l. 122 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>D<\/mi><\/math><br \/>\nbefore-hand. There is a much simpler approach that gives essentially the same rate<br \/>\nfor this complexity class: Grid Search! In fact, the simple approach of running<br \/>\n<!-- l. 124 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>R<\/mi> <mo class='MathClass-rel'>=<\/mo><mi class='qopname'> log<\/mi><mo> \u2061<!-- FUNCTION APPLICATION --> <\/mo><!-- nolimits --><mo class='MathClass-open'>(<\/mo><msub><mrow><mi>D<\/mi><\/mrow><mrow><mi class='qopname'>max<\/mi><mo> \u2061<!-- FUNCTION APPLICATION --> <\/mo><\/mrow><\/msub><mo class='MathClass-bin'>\u2215<\/mo><msub><mrow><mi>D<\/mi><\/mrow><mrow><mi class='qopname'>min<\/mi><mo> \u2061<!-- FUNCTION APPLICATION --> <\/mo><\/mrow><\/msub><mo class='MathClass-close'>)<\/mo><\/math> runs with a log-spaced<br \/>\ngrid of D values (i.e. <!-- l. 125 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mn>1<\/mn><msup><mrow><mn>0<\/mn><\/mrow><mrow><mo class='MathClass-bin'>\u2212<\/mo><mn>8<\/mn><\/mrow><\/msup><mo class='MathClass-punc'>,<\/mo><mn>1<\/mn><msup><mrow><mn>0<\/mn><\/mrow><mrow><mo class='MathClass-bin'>\u2212<\/mo><mn>7<\/mn><\/mrow><\/msup><mo class='MathClass-punc'>,<\/mo><mn>1<\/mn><msup><mrow><mn>0<\/mn><\/mrow><mrow><mo class='MathClass-bin'>\u2212<\/mo><mn>6<\/mn><\/mrow><\/msup><mo class='MathClass-punc'>,<\/mo><mi class='MathClass-op'>\u2026<\/mi><mo> \u2061<!-- FUNCTION APPLICATION --><\/mo><mspace width='0.17em' class='thinspace'><\/mspace><\/math>)<br \/>\nbracketing the true D. Since our total step-budget is<br \/>\n<!-- l. 126 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>n<\/mi> <mo class='MathClass-bin'>+<\/mo> <mn>1<\/mn><\/math>,<br \/>\neach run must be of length (n+1)\/R, and so taking the best<br \/>\n<!-- l. 127 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mover accent='true'><mrow><mi>x<\/mi><\/mrow><mo accent='true'>^<\/mo><\/mover><\/math> by<br \/>\nfunction value gives a bound:\n<\/p>\n<p><!-- tex4ht:inline --><!-- l. 133 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='block'><mtable columnalign='left' class='align-star'>\n             <mtr><mtd columnalign='right' class='align-odd'><mi>f<\/mi><mo class='MathClass-open'>(<\/mo><msub><mrow><mover accent='true'><mrow><mi>x<\/mi><\/mrow><mo accent='true'>^<\/mo><\/mover><\/mrow><mrow><mi>n<\/mi><\/mrow><\/msub><mo class='MathClass-close'>)<\/mo> <mo class='MathClass-bin'>\u2212<\/mo> <mi>f<\/mi><mo class='MathClass-open'>(<\/mo><msub><mrow><mi>x<\/mi><\/mrow><mrow><mo class='MathClass-bin'>\u2217<\/mo><\/mrow><\/msub><mo class='MathClass-close'>)<\/mo><\/mtd>             <mtd class='align-even'> <mo class='MathClass-rel'>=<\/mo> <mi mathvariant='bold-script'><\/mi><mrow><mo form='prefix' fence='true'> (<\/mo><mrow>   <mfrac><mrow><mi mathvariant='italic'>DG<\/mi><\/mrow>\n<mrow><msqrt><mrow><mo class='MathClass-open'>(<\/mo><mi>n<\/mi> <mo class='MathClass-bin'>+<\/mo> <mn>1<\/mn><mo class='MathClass-close'>)<\/mo><mo class='MathClass-bin'>\u2215<\/mo><mi>R<\/mi><\/mrow><\/msqrt><\/mrow><\/mfrac> <\/mrow><mo form='postfix' fence='true'>)<\/mo><\/mrow><mspace width='2em'><\/mspace><\/mtd>                      <mtd columnalign='right' class='align-label'><\/mtd>             <mtd class='align-label'>\n             <mspace width='2em'><\/mspace><\/mtd><\/mtr><mtr><mtd columnalign='right' class='align-odd'><\/mtd>                          <mtd class='align-even'> <mo class='MathClass-rel'>=<\/mo> <mi mathvariant='bold-script'><\/mi><mrow><mo form='prefix' fence='true'> (<\/mo><mrow><mfrac><mrow><mi mathvariant='italic'>DG<\/mi><msqrt><mrow><mi>R<\/mi><\/mrow><\/msqrt><\/mrow>\n<mrow><msqrt><mrow><mi>n<\/mi> <mo class='MathClass-bin'>+<\/mo> <mn>1<\/mn><\/mrow><\/msqrt><\/mrow><\/mfrac> <\/mrow><mo form='postfix' fence='true'>)<\/mo><\/mrow><mspace width='2em'><\/mspace><\/mtd>                          <mtd columnalign='right' class='align-label'><\/mtd>             <mtd class='align-label'>\n             <mspace width='2em'><\/mspace><\/mtd><\/mtr><mtr><mtd columnalign='right' class='align-odd'><\/mtd>                          <mtd class='align-even'> <mo class='MathClass-rel'>=<\/mo> <mi mathvariant='bold-script'><\/mi><mrow><mo form='prefix' fence='true'> (<\/mo><mrow><mfrac><mrow><mi mathvariant='italic'>DG<\/mi><msqrt><mrow><mi class='qopname'>log<\/mi><mo> \u2061<!-- FUNCTION APPLICATION --> <\/mo> <!-- nolimits --> <mrow><mo form='prefix' fence='true'> (<\/mo><mrow><msub><mrow><mi>D<\/mi><\/mrow><mrow><mi class='qopname'>max<\/mi><mo> \u2061<!-- FUNCTION APPLICATION --> <\/mo> <\/mrow> <\/msub> <mo class='MathClass-bin'>\u2215<\/mo><msub><mrow><mi>D<\/mi><\/mrow><mrow><mi class='qopname'>min<\/mi><mo> \u2061<!-- FUNCTION APPLICATION --> <\/mo> <\/mrow> <\/msub><\/mrow><mo form='postfix' fence='true'>)<\/mo><\/mrow><\/mrow><\/msqrt><\/mrow>\n            <mrow><msqrt><mrow><mi>n<\/mi> <mo class='MathClass-bin'>+<\/mo> <mn>1<\/mn><\/mrow><\/msqrt><\/mrow><\/mfrac>          <\/mrow><mo form='postfix' fence='true'>)<\/mo><\/mrow><mo class='MathClass-punc'>.<\/mo><mspace width='2em'><\/mspace><\/mtd>             <mtd columnalign='right' class='align-label'><\/mtd>             <mtd class='align-label'>\n   <mspace width='2em'><\/mspace><\/mtd><\/mtr><\/mtable><\/math><br \/>\n<!-- l. 135 --><\/p>\n<p class='noindent'>Often in computer-science research we ignore log factors as unimportant, or<br \/>\n\u201cnot-real\u201d, but in this case, the log factor is very real! It corresponds to the<br \/>\ncost of running multiple training runs and throwing away all but the best<br \/>\nrun.<\/p>\n<p><!-- l. 140 --><\/p>\n<p class='indent'>   Given this is the same sqrt(log) dependence as for coin betting, what<br \/>\ndifferentiates parameter-free approaches such as coin-betting? They are developed for<br \/>\na much more general class of problems, Online Linear Optimization (OLO),<br \/>\nand so when applied to convex non-smooth optimization as we do here,<br \/>\nthey are not able to make use of the additional problem structure. Note<br \/>\nthat the grid search method above obviously can\u2019t achieve a sqrt-log-factor<br \/>\nfree rate asymptotically, since you have to actually perform each of the<br \/>\n<!-- l. 147 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mi>R<\/mi><\/math><br \/>\ntraining runs. It\u2019s unclear if coin-betting approaches can avoid the log factor<br \/>\nasymptotically, this is an interesting direction for future research. Lower bounds are<br \/>\nknown for the online linear optimization setting that show that any non-asymptotic<br \/>\nrate must include this sqrt-log factor.\n<\/p>\n<p><!-- l. 153 --><\/p>\n<p class='noindent'>\n<h3 class='likesectionHead'><a id='x1-5000'><\/a>Why previous parameter-free methods don\u2019t work as well<\/h3>\n<p><!-- l. 155 --><\/p>\n<p class='noindent'>A major component of our work was the effort it took to go from a theoretically nice<br \/>\nalgorithm to a practical method that works on deep learning problems. We<br \/>\nhad to identify why other learning-rate free methods fail, and we had to<br \/>\nadapt our method to work on top of modern optimization methods such as<br \/>\nAdam.\n<\/p>\n<p><!-- l. 161 --><\/p>\n<p class='indent'>   The key insight turned out to be SCHEDULING. By scheduling, I mean a<br \/>\nsequence of constants that multiply the learning rate, that includes potentially<br \/>\nwarm-up at the beginning of training, and annealing of the learning rate towards the<br \/>\nend of training.\n<\/p>\n<p><!-- l. 166 --><\/p>\n<p class='indent'>   Using theoretically motivated schedules such as<br \/>\n<!-- l. 166 --><math xmlns='http:\/\/www.w3.org\/1998\/Math\/MathML' display='inline'><mn>1<\/mn><mo class='MathClass-bin'>\u2215<\/mo><msqrt><mrow><mi>k<\/mi> <mo class='MathClass-bin'>+<\/mo> <mn>1<\/mn><\/mrow><\/msqrt><\/math><br \/>\nlearning rate decay turn out to work terribly in practice almost across the board on<br \/>\ndeep learning problems. Existing coin-betting approaches prior to our work all relied<br \/>\non very poorly performing schedules. This is why the COCOB method we tested as a<br \/>\nbaseline under-performed The schedule is an implicit part of the method, arising<br \/>\ndirectly from the mathematical formulation, and so its difficult to apply an arbitrary<br \/>\nschedule on top of it, in contrast, D-Adaptation natively supports the use of arbitrary<br \/>\nschedules.\n<\/p>\n<p><!-- l. 176 --><\/p>\n<p class='indent'>   Using existing empirically motivated schedules, such as those already widely in<br \/>\nuse in deep learning practice, leads to huge improvements in performance. In<br \/>\ncollaboration with Ashok and Harsh, I have also been working on incorporating<br \/>\nschedule support into practical parameter-free methods, giving a method we call<br \/>\nMechanic (<a href='https:\/\/arxiv.org\/abs\/2306.00144'>https:\/\/arxiv.org\/abs\/2306.00144<\/a>).\n<\/p>\n<p><!-- l. 182 --><\/p>\n<p class='noindent'>\n<h3 class='likesectionHead'><a id='x1-6000'><\/a>Practical performance of D-Adaptation<\/h3>\n<p><!-- l. 184 --><\/p>\n<p class='noindent'>Our Adam variant of D-Adaptation is a drop-in replacement for the PyTorch Adam<br \/>\nclass, requiring no other changes to the code-base. We tested across a large selection<br \/>\nof modern deep learning problems, including auto-regressive language models,<br \/>\nimage-classification, image-2-image tasks, object recognition, recommendation<br \/>\nsystems, masked-language models and more. In fact, we performed the largest<br \/>\ncomparison performed for any optimization paper for a new method, outside of 10+<br \/>\nauthor papers from Google.\n<\/p>\n<p><!-- l. 193 --><\/p>\n<p class='indent'>   D-Adaptation matched the final test performance of laboriously hand-tuned<br \/>\nlearning rates on 9 of 10 deep learning problems, and all 12 logistic regression<br \/>\nproblems tested. We have also received overwhelming feedback from the ML<br \/>\ncommunity that the method \u201cjust-works\u201d and is a true viable replacement for<br \/>\nhand-tuned learning rates.\n<\/p>\n<p><!-- l. 199 --><\/p>\n<p class='indent'>   ICML is not a theoretically focused venue; reviewers highly value papers<br \/>\nwhose contribution have clear practical impact together with strong theory.<br \/>\nBefore D-Adaptation, there were no methods that could be \u201cdropped-in\u201d to<br \/>\na code base without other changes to provide adaptivity to the learning<br \/>\nrate that actually worked well in practice. The ease of use of our method<br \/>\nand its clear practical utility were recognized by our reviewers, with final<br \/>\nreview scores of 9, 8 and 6. We were fortunate to have highly communicative<br \/>\nreviewers, who suggested improvements, pointed out errors (even reviewing the<br \/>\nappendix carefully!) and overall greatly improved our work. A triumph of peer<br \/>\nreview!\n<\/p>\n<p><!-- l. 210 --><\/p>\n<p class='noindent'>\n<h3 class='likesectionHead'><a id='x1-7000'><\/a>Limitations<\/h3>\n<p><!-- l. 212 --><\/p>\n<p class='noindent'>D-Adapt Adam uses an additional memory buffer on top of regular Adam, and so it<br \/>\nhas overall higher memory use. It can also fail on problems where training is very<br \/>\nunstable. We experienced issues with it\u2019s convergence on ViT (vision transformer)<br \/>\ntraining for example.\n<\/p>\n<p><!-- l. 217 --><\/p>\n<p class='indent'>   Since setting the learning rate globally with D-Adaptation requires aggregating<br \/>\ndot products and norms from across all parameters, it will requiring syncing<br \/>\ninformation across nodes when using sharded data-parallel training, which may have<br \/>\na performance hit. We found that applying it separately at the block or node level<br \/>\ndidn\u2019t work well, the global syncing appears to be necessary.\n<\/p>\n<p><!-- l. 224 --><\/p>\n<p class='indent'>   Our theory also has limitations. It only applies in the deterministic setting, which<br \/>\nis much more limited than the stochastic or online settings that other methods such<br \/>\nas coin-betting or DoG have theory for. Given the method seems to work well for<br \/>\nstochastic problems, I see developing theory for this case as an interesting research<br \/>\nproblem.\n<\/p>\n<p><!-- l. 230 --><\/p>\n<p class='indent'>   D-Adaptation MUST be used with a schedule. I have seen several comparisons<br \/>\nagainst D-Adaptation that don\u2019t mention the use of schedules, and likely omitted<br \/>\nthem. This is crucial for the performance of the method. We recommend the use of<\/p>\n<p>the linear decay schedule with potentially a warmup (5%?) if warmup is customary<br \/>\nfor the problem being tackled.\n<\/p>\n<p><!-- l. 236 --><\/p>\n<p class='noindent'>\n<h3 class='likesectionHead'><a id='x1-8000'><\/a>What\u2019s Next<\/h3>\n<p><!-- l. 238 --><\/p>\n<p class='noindent'>It\u2019s rather annoying that we have to choose a schedule to use with D-Adaptation&#8230;.<br \/>\nsurely there is some way to determine data-adaptive schedules automatically? Stay<br \/>\ntuned for a solution! <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Konstantin and I were honored with an ICML Outstanding paper award for our paper \u201cLearning-Rate-Free Learning by D-Adaptation\u201d. Besides the award, our implementation has already been downloaded nearly a million times over the last year (most before the award), and is seeing wide-spread use in training and fine-tuning deep learning models, particularly LoRAs. Read on &hellip; <a href=\"https:\/\/www.aarondefazio.com\/tangentially\/?p=127\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">D-Adaptatation Wins an Outstanding Paper Award at ICML 2023!<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts\/127"}],"collection":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=127"}],"version-history":[{"count":6,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts\/127\/revisions"}],"predecessor-version":[{"id":133,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=\/wp\/v2\/posts\/127\/revisions\/133"}],"wp:attachment":[{"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=127"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=127"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aarondefazio.com\/tangentially\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=127"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}