Publications | Aaron Defazio

Why Gradients Rapidly Increase Near the End of Training

Aaron Defazio
Preprint (2025)

[arXiv]
During long-duration Large Language Model (LLM) training runs the gradient norm increases rapidly near the end of training. In this short note, we show that this increase is due to an unintended interaction between weight decay, normalization layers, and the learning rate schedule. We propose a simple correction that fixes this behavior while also resulting in lower loss values throughout training.

@misc{defazio2025gradientsrapidlyincreasenear, title={Why Gradients Rapidly Increase Near the End of Training}, author={Aaron Defazio}, year={2025}, eprint={2506.02285}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.02285}, }

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, Hao-Jun Michael Shi
Preprint (2025)

[arXiv]
The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly can eliminate the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.

@misc{eschenhagen2025purifyingshampooinvestigatingshampoos, title={Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner}, author={Runa Eschenhagen and Aaron Defazio and Tsung-Hsien Lee and Richard E. Turner and Hao-Jun Michael Shi}, year={2025}, eprint={2506.03595}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.03595}, }

PARQ: Piecewise-Affine Regularized Quantization

Lisa Jin, Jianhao Ma, Zechun Liu, Andrey Gromov, Aaron Defazio, Lin Xiao
ICML (2025)

[arXiv]
We develop a principled method for quantization-aware training (QAT) of large-scale machine learning models. Specifically, we show that convex, piecewise-affine regularization (PAR) can effectively induce the model parameters to cluster towards discrete values. We minimize PAR-regularized loss functions using an aggregate proximal stochastic gradient method (AProx) and prove that it has last-iterate convergence. Our approach provides an interpretation of the straight-through estimator (STE), a widely used heuristic for QAT, as the asymptotic form of PARQ. We conduct experiments to demonstrate that PARQ obtains competitive performance on convolution- and transformer-based vision tasks.

@inproceedings{ jin2025parq, title={{PARQ}: Piecewise-Affine Regularized Quantization}, author={Lisa Jin and Jianhao Ma and Zechun Liu and Andrey Gromov and Aaron Defazio and Lin Xiao}, booktitle={Forty-second International Conference on Machine Learning}, year={2025}, url={https://openreview.net/forum?id=8PCxOlwbIn} }

The Road Less Scheduled

Aaron Defazio, Xingyu (Alice) Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky
Neurips Oral (2024)

[arXiv] [code]
Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available

@misc{defazio2024roadscheduled, title={The Road Less Scheduled}, author={Aaron Defazio and Xingyu and Yang and Harsh Mehta and Konstantin Mishchenko and Ahmed Khaled and Ashok Cutkosky}, year={2024}, eprint={2405.15682}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2405.15682}, }

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, Robert M. Gower
ICML (2024)

[arXiv] [publication]
We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on L-smoothness.

@inproceedings{ mishkin2024directional, title={Directional Smoothness and Gradient Methods: Convergence and Adaptivity}, author={Aaron Mishkin and Ahmed Khaled and Yuanhao Wang and Aaron Defazio and Robert M. Gower}, booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}, year={2024}, url={https://openreview.net/forum?id=m9WZrEXWl5} }

Optimal Linear Decay Learning Rate Schedules and Further Refinements

Aaron Defazio, Ashok Cutkosky, Harsh Mehta, Konstantin Mishchenko
Preprint (2023)

[arXiv] [code]
Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules. Our key technical contribution is a refined analysis of learning rate schedules for a wide class of optimization algorithms (including SGD). In contrast to most prior works that study the convergence of the average iterate, we study the last iterate, which is what most people use in practice. When considering only worst-case analysis, our theory predicts that the best choice is the linear decay schedule: a popular choice in practice that sets the stepsize proportionally to 1 - t/T, where t is the current iteration and T is the total number of steps. To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules refined for any particular task. These refined schedules exhibit learning rate warm-up and rapid learning rate annealing near the end of training. Ours is the first systematic approach to automatically yield both of these properties. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule matches or outperforms all commonly used default schedules including cosine annealing, and that our schedule refinement method gives further improvements.

@misc{defazio2023when, title={Optimal Linear Decay Learning Rate Schedules and Further Refinements}, author={Aaron Defazio and Ashok Cutkosky and Harsh Mehta and Konstantin Mishchenko}, year={2023}, eprint={2310.07831}, archivePrefix={arXiv}, primaryClass={cs.LG} }

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Konstantin Mishchenko, Aaron Defazio
ICML 2024

[arXiv] [publication] [code]
We consider the problem of estimating the learning rate in adaptive methods, such as Adagrad and Adam. We describe two techniques, Prodigy and Resetting, to provably estimate the distance to the solution D, which is needed to set the learning rate optimally. Our techniques are modifications of the D-Adaptation method for learning-rate-free learning. Our methods improve upon the convergence rate of D-Adaptation by a factor of O(sqrt(log(D/d0))), where d0 is the initial estimate of D. We test our methods on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approaches consistently outperform D-Adaptation and reach test accuracy values close to that of hand-tuned Adam.

@misc{mishchenko2023prodigy, title={Prodigy: An Expeditiously Adaptive Parameter-Free Learner}, author={Konstantin Mishchenko and Aaron Defazio}, year={2023}, eprint={2306.06101}, archivePrefix={arXiv}, primaryClass={cs.LG} }

MoMo: Momentum Models for Adaptive Learning Rates

Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, Robert M. Gower
ICML (2024)

[arXiv]
Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new adaptive learning rates that can be used with any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (Stochastic gradient descent with momentum). MoMo uses momentum estimates of the batch losses and gradients sampled at each iteration to build a model of the loss function. Our model also makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. We then approximately minimize this model at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by developing MoMo-Adam - which is Adam with our new model-based adaptive learning rate. Additionally, for losses with unknown lower bounds, we develop on-the-fly estimates of a lower bound, that are incorporated in our model. Through extensive numerical experiments, we demonstrate that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of accuracy and robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR10, CIFAR100, Imagenet, recommender systems on the Criteo dataset, and a transformer model on the translation task IWSLT14.

@misc{schaipp2023momo, title={MoMo: Momentum Models for Adaptive Learning Rates}, author={Fabian Schaipp and Ruben Ohana and Michael Eickenberg and Aaron Defazio and Robert M. Gower}, year={2023}, eprint={2305.07583}, archivePrefix={arXiv}, primaryClass={cs.LG} }

Mechanic: A Learning Rate Tuner

Ashok Cutkosky, Aaron Defazio, Harsh Mehta
NeurIPS (2023)

[arXiv] [publication] [code]
We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call mechanic. Our method provides a practical realization of recent theoretical reductions for accomplishing a similar goal in online convex optimization. We rigorously evaluate mechanic on a range of large scale deep learning tasks with varying batch sizes, schedules, and base optimization algorithms. These experiments demonstrate that depending on the problem, mechanic either comes very close to, matches or even improves upon manual tuning of learning rates.

@ARTICLE{cutkosky2023mechanic, title={Mechanic: A Learning Rate Tuner}, author={Ashok Cutkosky and Aaron Defazio and Harsh Mehta}, year={2023}, journal={Advances in Neural Information Processing Systems 37 (NeurIPS 2023)}, }

Learning-Rate-Free Learning by D-Adaptation

Aaron Defazio, Konstantin Mishchenko
ICML (2023) Outstanding Paper Award

[arXiv] [publication] [code]
The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of D yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An open-source implementation is available.

@ARTICLE{defazio2023dadapt, author = {Aaron Defazio and Konstantin Mishchenko}, title = {Learning-Rate-Free Learning by D-Adaptation}, journal = {The 40th International Conference on Machine Learning (ICML 2023)}, year = {2023} }

Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

Aaron Defazio, Samy Jelassi
Journal of Machine Learning Research (2022)

[PDF] [publication] [code]
We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly.

@article{defazio2022adaptivity, author = {Aaron Defazio and Samy Jelassi}, title = {Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization}, journal = {Journal of Machine Learning Research}, year = {2022}, volume = {23}, pages = {1-34}, }

Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

Aaron Defazio, Baoyu Zhou, Lin Xiao

[PDF]
The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a learning rate scaling hyper-parameter to be carefully tuned. To overcome this restriction, we introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator, one that can both increase and decrease. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.

@article{defazio2022scaling, @misc{https://doi.org/10.48550/arxiv.2206.06900, doi = {10.48550/ARXIV.2206.06900}, url = {https://arxiv.org/abs/2206.06900}, author = {Defazio, Aaron and Zhou, Baoyu and Xiao, Lin}, keywords = {Machine Learning (cs.LG), Optimization and Control (math.OC), Machine Learning (stat.ML), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Mathematics, FOS: Mathematics}, title = {Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method}, publisher = {arXiv}, year = {2022} }

A scaling calculus for the design and initialization of ReLU networks

Aaron Defazio, Léon Bottou
Neural Computing and Applications (2022)

[PDF] [publication]
In this work, we describe a set of rules for the design and initialization of well-conditioned neural networks, guided by the goal of naturally balancing the diagonal blocks of the Hessian at the start of training. Our design principle balances multiple sensible measures of the conditioning of neural networks. We prove that for a ReLU-based deep multilayer perceptron, a simple initialization scheme using the geometric mean of the fan-in and fan-out satisfies our scaling rule. For more sophisticated architectures, we show how our scaling principle can be used to guide design choices to produce well-conditioned neural networks, reducing guess-work.

@article{defazio2022scaling, abstract = {We propose a system for calculating a ``scaling constant''for layers and weights of neural networks. We relate this scaling constant to two important quantities that relate to the optimizability of neural networks, and argue that a network that is ``preconditioned''via scaling, in the sense that all weights have the same scaling constant, will be easier to train. This scaling calculus results in a number of consequences, among them the fact that the geometric mean of the fan-in and fan-out, rather than the fan-in, fan-out, or arithmetic mean, should be used for the initialization of the variance of weights in a neural network. Our system allows for the off-line design \& engineering of ReLU (Rectified Linear Unit) neural networks, potentially replacing blind experimentation. We verify the effectiveness of our approach on a set of benchmark problems.}, author = {Defazio, Aaron and Bottou, L{\'e}on}, doi = {10.1007/s00521-022-07308-z}, isbn = {1433-3058}, journal = {Neural Computing and Applications}, title = {A scaling calculus for the design and initialization of ReLU networks}, url = {https://doi.org/10.1007/s00521-022-07308-z}, year = {2022}, }

Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization

Aaron Defazio

[arXiv]
Momentum methods are now used pervasively within the machine learning community for training non-convex models such as deep neural networks. Empirically, they out perform traditional stochastic gradient descent (SGD) approaches. In this work we develop a Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA) form. This analysis is much tighter than previous theory in the non-convex case, and due to this we are able to give precise insights into when SGD+M may out-perform SGD, and what hyper-parameter schedules will work and why.

@misc{defazio2020mom, title={Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization}, author={Aaron Defazio}, year={2020}, eprint={2010.00406}, archivePrefix={arXiv}, primaryClass={cs.LG} }

Stochastic Polyak Stepsize with a Moving Target

Robert M. Gower, Aaron Defazio, Michael Rabbat

[arXiv]
We propose a new stochastic gradient method called MOTAPS (Moving Targetted Polyak Stepsize) that uses recorded past loss values to compute adaptive stepsizes. MOTAPS can be seen as a variant of the Stochastic Polyak (SP) which is also a method that also uses loss values to adjust the stepsize. The downside to the SP method is that it only converges when the interpolation condition holds. MOTAPS is an extension of SP that does not rely on the interpolation condition. The MOTAPS method uses n auxiliary variables, one for each data point, that track the loss value for each data point. We provide a global convergence theory for SP, an intermediary method TAPS, and MOTAPS by showing that they all can be interpreted as a special variant of online SGD. We also perform several numerical experiments on convex learning problems, and deep learning models for image classification and language translation. In all of our tasks we show that MOTAPS is competitive with the relevant baseline method.

@misc{gower2021stochastic, title={Stochastic Polyak Stepsize with a Moving Target}, author={Robert M. Gower and Aaron Defazio and Michael Rabbat}, year={2021}, eprint={2106.11851}, archivePrefix={arXiv}, primaryClass={cs.LG} }

Compressed Sensing with a Jackknife, a Bootstrap, and Visualization

Aaron Defazio, Mark Tygert, Rachel Ward, Jure Zbontar
Journal of Data Science, Statistics, and Visualisation (JDSSV 2022)

[PDF] [Publication]
Compressed sensing proposes to reconstruct more degrees of freedom in a signal than the number of values actually measured (based on a potentially un-justified regularizer or prior distribution). Compressed sensing therefore risks introducing errors — inserting spurious artifacts or masking the abnormalities that medical imaging seeks to discover. Estimating errors using the standard statistical tools of a jackknife and a bootstrap yields error “bars” in the form of full images that are remarkably qualitatively representative of the actual errors (at least when evaluated and validated on data sets for which the ground truth and hence the actual error is available). These images show the structure of possible errors — without recourse to measuring the entire ground truth directly — and build confidence in regions of the images where the estimated errors are small. Further visualizations and summary statistics can aid in the interpretation of such error estimates. Visualizations include suitable colorizations of the reconstruction, as well as the obvious “correction” of the reconstruction by subtracting off the error estimates. The canonical summary statistic would be the root-mean-square of the error estimates. Unfortunately, colorizations appear likely to be too distracting for actual clinical practice in medical imaging, and the root-mean-square gets swamped by background noise in the error estimates. Fortunately, straightforward displays of the error estimates and of the “corrected” reconstruction are illuminating, and the root-mean-square improves greatly af- ter mild blurring of the error estimates; the blurring is barely perceptible to the human eye yet smooths away background noise that would otherwise overwhelm the root-mean-square.

@misc{defazio2022bootstrap, title={Compressed Sensing with a Jackknife, a Bootstrap, and Visualization}, author={Aaron Defazio and Mark Tygert and Rachel Ward and Jure Zbontar}, Journal = {Journal of Data Science, Statistics, and Visualisation}, Year = {2022}} }

The Power of Factorial Powers: New Parameter settings for (Stochastic) Optimization

Aaron Defazio, Robert M. Gower
Asian Conference on Machine Learning (ACML 2021)

[arXiv] [publication]
The convergence rates for convex and non-convex optimization methods depend on the choice of a host of constants, including step sizes, Lyapunov function constants and momentum constants. In this work we propose the use of factorial powers as a flexible tool for defining constants that appear in convergence proofs. We list a number of remarkable properties that these sequences enjoy, and show how they can be applied to convergence proofs to simplify or improve the convergence rates of the momentum method, accelerated gradient and the stochastic variance reduced method (SVRG).

@InProceedings{defazio2021factorial, title = {The Power of Factorial Powers: New Parameter settings for (Stochastic) Optimization}, author = {Defazio, Aaron and Gower, Robert M.}, booktitle = {Proceedings of The 13th Asian Conference on Machine Learning}, pages = {49--64}, year = {2021}, editor = {Balasubramanian, Vineeth N. and Tsang, Ivor}, volume = {157}, series = {Proceedings of Machine Learning Research}, month = {17--19 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v157/defazio21a/defazio21a.pdf}, url = {https://proceedings.mlr.press/v157/defazio21a.html}, }

On the (asymptotic) convergence of Stochastic Gradient Descent and Stochastic Heavy Ball

Othmane Sebbouh, Robert M. Gower, Aaron Defazio
Conference on Learning Theory 2021

[arXiv]
We study stochastic gradient descent (SGD) and the stochastic heavy ball method (SHB, otherwise known as the momentum method) for the general stochastic approximation problem. For SGD, in the convex and smooth setting, we provide the first almost sure asymptotic convergence rates for a weighted average of the iterates . More precisely, we show that the convergence rate of the function values is arbitrarily close to o(1/sqrt(k)), and is exactly o(1/k) in the so-called overparametrized case. We show that these results still hold when using stochastic line search and stochastic Polyak stepsizes, thereby giving the first proof of convergence of these methods in the non-overparametrized regime. Using a substantially different analysis, we show that these rates hold for SHB as well, but at the last iterate. This distinction is important because it is the last iterate of SGD and SHB which is used in practice. We also show that the last iterate of SHB converges to a minimizer almost surely. Additionally, we prove that the function values of the deterministic HB converge at a o(1/k) rate, which is faster than the previously known O(1/k). Finally, in the nonconvex setting, we prove similar rates on the lowest gradient norm along the trajectory of SGD.

@InProceedings{sebbouh2021convergence, title = {On the (asymptotic) convergence of Stochastic Gradient Descent and Stochastic Heavy Ball}, author = {Othmane Sebbouh and Robert M. Gower and Aaron Defazio}, booktitle = {Conference on Learning Theory, COLT 2021}, year = {2021}, publisher = {PMLR}, series = {Proceedings of Machine Learning Research}, }

MRI Banding Removal via Adversarial Training

Aaron Defazio, Tullie Murrell, Michael P. Recht
NeurIPS 2020

[arXiv]
MRI images reconstructed from sub-sampled Cartesian data using deep learning techniques often show a characteristic banding (sometimes described as streaking), which is particularly strong in low signal-to-noise regions of the reconstructed image. In this work, we propose the use of an adversarial loss that penalizes banding structures without requiring any human annotation. Our technique greatly reduces the appearance of banding, without requiring any additional computation or post-processing at reconstruction time. We report the results of a blind comparison against a strong baseline by a group of expert evaluators (board-certified radiologists), where our approach is ranked superior at banding removal with no statistically significant loss of detail.

@misc{defazio2020mri, title={MRI Banding Removal via Adversarial Training}, author={Aaron Defazio and Tullie Murrell and Michael P. Recht}, year={2020}, eprint={2001.08699}, archivePrefix={arXiv}, primaryClass={eess.IV} }

End-to-End Variational Networks for Accelerated MRI Reconstruction

Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C. Lawrence Zitnick, Nafissa Yakubova, Florian Knoll, Patricia Johnson
MICCAI 2020

[arXiv]
The slow acquisition speed of magnetic resonance imaging (MRI) has led to the development of two complementary methods: acquiring multiple views of the anatomy simultaneously (parallel imaging) and acquiring fewer samples than necessary for traditional signal processing methods (compressed sensing). While the combination of these methods has the potential to allow much faster scan times, reconstruction from such undersampled multi-coil data has remained an open problem. In this paper, we present a new approach to this problem that extends previously proposed variational methods by learning fully end-to-end. Our method obtains new state-of-the-art results on the fastMRI dataset for both brain and knee MRIs.

@misc{sriram2020endtoend, title={End-to-End Variational Networks for Accelerated MRI Reconstruction}, author={Anuroop Sriram and Jure Zbontar and Tullie Murrell and Aaron Defazio and C. Lawrence Zitnick and Nafissa Yakubova and Florian Knoll and Patricia Johnson}, year={2020}, eprint={2004.06688}, archivePrefix={arXiv}, primaryClass={eess.IV} }

Using Deep Learning to Accelerate Knee MRI at 3T: Results of an Interchangeability Study

Recht, Michael P. and Zbontar, Jure and Sodickson, Daniel K. and Knoll, Florian and Yakubova, Nafissa and Sriram, Anuroop and Murrell, Tullie and Defazio, Aaron and Rabbat, Michael and Rybak, Leon and Kline, Mitchell and Ciavarra, Gina and Alaia, Erin F. and Samim, Mohammad and Walter, William R. and Lin, Dana and Lui, Yvonne W. and Muckley, Matthew and Huang, Zhengnan and Johnson, Patricia and Stern, Ruben and Zitnick, C. Lawrence
American Journal of Roentgenology

[publication]

Objective

Deep Learning (DL) image reconstruction has the potential to disrupt the current state of MR imaging by significantly decreasing the time required for MR exams. Our goal was to use DL to accelerate MR imaging in order to allow a 5-minute comprehensive examination of the knee, without compromising image quality or diagnostic accuracy.

Methods

A DL model for image reconstruction using a variational network was optimized. The model was trained using dedicated multi-sequence training, in which a single reconstruction model was trained with data from multiple sequences with different contrast and orientations. Following training, data from 108 patients were retrospectively undersampled in a manner that would correspond with a net 3.49-fold acceleration of fully-sampled data acquisition and 1.88-fold acceleration compared to our standard two-fold accelerated parallel acquisition. An interchangeability study was performed, in which the ability of 6 readers to detect internal derangement of the knee was compared for the clinical and DL-accelerated images.

Results

The study demonstrated a high degree of interchangeability between standard and DL-accelerated images. In particular, results showed that interchanging the sequences would result in discordant clinical opinions no more than 4% of the time for any feature evaluated. Moreover, the accelerated sequence was judged by all six readers to have better quality than the clinical sequence.

Conclusions

An optimized DL model allowed for acceleration of knee images which performed interchangeably with standard images for the detection of internal derangement of the knee. Importantly, readers preferred the quality of accelerated images to that of standard clinical images.

@article{fastmri2020, Author = {Recht, Michael P. and Zbontar, Jure and Sodickson, Daniel K. and Knoll, Florian and Yakubova, Nafissa and Sriram, Anuroop and Murrell, Tullie and Defazio, Aaron and Rabbat, Michael and Rybak, Leon and Kline, Mitchell and Ciavarra, Gina and Alaia, Erin F. and Samim, Mohammad and Walter, William R. and Lin, Dana and Lui, Yvonne W. and Muckley, Matthew and Huang, Zhengnan and Johnson, Patricia and Stern, Ruben and Zitnick, C. Lawrence}, Journal = {American Journal of Roentgenology}, Month = {2020/07/09}, Title = {Using Deep Learning to Accelerate Knee MRI at 3T: Results of an Interchangeability Study}, Year = {2020}}

GrappaNet: Combining Parallel Imaging With Deep Learning for Multi-Coil MRI Reconstruction

Anuroop Sriram, Jure Zbontar, Tullie Murrell, C. Lawrence Zitnick, Aaron Defazio, Daniel K. Sodickson;

[arXiv] [publication]
Magnetic Resonance Image (MRI) acquisition is an inherently slow process which has spurred the development of two different acceleration methods: acquiring multiple correlated samples simultaneously (parallel imaging) and acquiring fewer samples than necessary for traditional signal processing methods (compressed sensing). Both methods provide complementary approaches to accelerating MRI acquisition. In this paper, we present a novel method to integrate traditional parallel imaging methods into deep neural networks that is able to generate high quality reconstructions even for high acceleration factors. The proposed method, called GrappaNet, performs progressive reconstruction by first mapping the reconstruction problem to a simpler one that can be solved by a traditional parallel imaging methods using a neural network, followed by an application of a parallel imaging method, and finally fine-tuning the output with another neural network. The entire network can be trained end-to-end. We present experimental results on the recently released fastMRI dataset and show that GrappaNet can generate higher quality reconstructions than competing methods for both 4x and 8x acceleration.

@InProceedings{Sriram_2020_CVPR, author = {Sriram, Anuroop and Zbontar, Jure and Murrell, Tullie and Zitnick, C. Lawrence and Defazio, Aaron and Sodickson, Daniel K.}, title = {GrappaNet: Combining Parallel Imaging With Deep Learning for Multi-Coil MRI Reconstruction}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2020} }

Advancing machine learning for MR image reconstruction with an open competition: Overview of the 2019 fastMRI challenge

Knoll, Florian and Zbontar, Jure and Sriram, Anuroop and Muckley, Matthew J. and Bruno, Mary and Defazio, Aaron and Parente, Marc and Geras, Krzysztof J. and Katsnelson, Joe and Chandarana, Hersh and Zhang, Zizhao and Drozdzalv, Michal and Romero, Adriana and Rabbat, Michael and Vincent, Pascal and Pinkerton, James and Wang, Duo and Yakubova, Nafissa and Owens, Erich and Zitnick, C. Lawrence and Recht, Michael P. and Sodickson, Daniel K. and Lui, Yvonne W.
Magnetic Resonance in Medicine

[publication] [arXiv] [Code]

Purpose

To advance research in the field of machine learning for MR image reconstruction with an open challenge.

Methods

We provided participants with a dataset of raw k‐space data from 1,594 consecutive clinical exams of the knee. The goal of the challenge was to reconstruct images from these data. In order to strike a balance between realistic data and a shallow learning curve for those not already familiar with MR image reconstruction, we ran multiple tracks for multi‐coil and single‐coil data. We performed a two‐stage evaluation based on quantitative image metrics followed by evaluation by a panel of radiologists. The challenge ran from June to December of 2019.

Results

We received a total of 33 challenge submissions. All participants chose to submit results from supervised machine learning approaches.

Conclusions

The challenge led to new developments in machine learning for image reconstruction, provided insight into the current state of the art in the field, and highlighted remaining hurdles for clinical adoption.

@article{doi:10.1002/mrm.28338, Author = {Knoll, Florian and Murrell, Tullie and Sriram, Anuroop and Yakubova, Nafissa and Zbontar, Jure and Rabbat, Michael and Defazio, Aaron and Muckley, Matthew J. and Sodickson, Daniel K. and Zitnick, C. Lawrence and Recht, Michael P.}, Journal = {Magnetic Resonance in Medicine}, Title = {Advancing machine learning for MR image reconstruction with an open competition: Overview of the 2019 fastMRI challenge},

fastMRI: A Publicly Available Raw k-Space and DICOM Dataset of Knee Images for Accelerated MR Image Reconstruction Using Machine Learning

Knoll, Florian and Zbontar, Jure and Sriram, Anuroop and Muckley, Matthew J. and Bruno, Mary and Defazio, Aaron and Parente, Marc and Geras, Krzysztof J. and Katsnelson, Joe and Chandarana, Hersh and Zhang, Zizhao and Drozdzalv, Michal and Romero, Adriana and Rabbat, Michael and Vincent, Pascal and Pinkerton, James and Wang, Duo and Yakubova, Nafissa and Owens, Erich and Zitnick, C. Lawrence and Recht, Michael P. and Sodickson, Daniel K. and Lui, Yvonne W.
Radiology: Artificial Intelligence

[publication] [Code]
A publicly available dataset containing k-space data as well as Digital Imaging and Communications in Medicine image data of knee images for accelerated MR image reconstruction using machine learning is presented.

@article{doi:10.1148/ryai.2020190007, Author = {Knoll, Florian and Zbontar, Jure and Sriram, Anuroop and Muckley, Matthew J. and Bruno, Mary and Defazio, Aaron and Parente, Marc and Geras, Krzysztof J. and Katsnelson, Joe and Chandarana, Hersh and Zhang, Zizhao and Drozdzalv, Michal and Romero, Adriana and Rabbat, Michael and Vincent, Pascal and Pinkerton, James and Wang, Duo and Yakubova, Nafissa and Owens, Erich and Zitnick, C. Lawrence and Recht, Michael P. and Sodickson, Daniel K. and Lui, Yvonne W.}, Journal = {Radiology: Artificial Intelligence}, Title = {fastMRI: A Publicly Available Raw k-Space and DICOM Dataset of Knee Images for Accelerated MR Image Reconstruction Using Machine Learning}, Year = {2020}}

On the Curved Geometry of Accelerated Optimization

Aaron Defazio.
NeurIPS 2019

[arXiv]
In this work we propose a differential geometric motivation for Nesterov's accelerated gradient method (AGM) for strongly-convex problems. By considering the optimization procedure as occurring on a Riemannian manifold with a natural structure, The AGM method can be seen as the proximal point method applied in this curved space. This viewpoint can also be extended to the continuous time case, where the accelerated gradient method arises from the natural block-implicit Euler discretization of an ODE on the manifold. We provide an analysis of the convergence rate of this ODE for quadratic objectives.

@ARTICLE{adefazio-curvedgeom2019, author = {Aaron Defazio}, title = {On the Curved Geometry of Accelerated Optimization}, journal = {Advances in Neural Information Processing Systems 33 (NIPS 2019)}, year = {2019} }

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

Aaron Defazio, Léon Bottou
NeurIPS 2019

[arXiv] [Code]
The application of stochastic variance reduction to optimization has shown remarkable recent theoretical and practical success. The applicability of these techniques to the hard non-convex optimization problems encountered during training of modern deep neural networks is an open problem. We show that naive application of the SVRG technique and related approaches fail, and explore why.

@ARTICLE{adefazio-varred2019, author = {Aaron Defazio and L{\'{e}}on Bottou}, title = {On the Ineffectiveness of Variance Reduced Optimization for Deep Learning}, journal = {Advances in Neural Information Processing Systems 33 (NIPS 2019)}, year = {2019} }

fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

Jure Zbontar and Florian Knoll and Anuroop Sriram and Matthew J. Muckley and Mary Bruno and Aaron Defazio and Marc Parente and Krzysztof J. Geras and Joe Katsnelson and Hersh Chandarana and Zizhao Zhang and Michal Drozdzal and Adriana Romero and Michael Rabbat and Pascal Vincent and James Pinkerton and Duo Wang and Nafissa Yakubova and Erich Owens and C. Lawrence Zitnick and Michael P. Recht and Daniel K. Sodickson and Yvonne W. Lui

[arXiv] [Code]
Accelerating Magnetic Resonance Imaging (MRI) by taking fewer measurements has the potential to reduce medical costs, minimize stress to patients and make MRI possible in applications where it is currently prohibitively slow or expensive. We introduce the fastMRI dataset, a large-scale collection of both raw MR measurements and clinical MR images, that can be used for training and evaluation of machine-learning approaches to MR image reconstruction. By introducing standardized evaluation criteria and a freely-accessible dataset, our goal is to help the community make rapid advances in the state of the art for MR image reconstruction. We also provide a self-contained introduction to MRI for machine learning researchers with no medical imaging background.

@inproceedings{fastMRI2018, title={{fastMRI}: An Open Dataset and Benchmarks for Accelerated {MRI}}, author={Jure Zbontar and Florian Knoll and Anuroop Sriram and Matthew J. Muckley and Mary Bruno and Aaron Defazio and Marc Parente and Krzysztof J. Geras and Joe Katsnelson and Hersh Chandarana and Zizhao Zhang and Michal Drozdzal and Adriana Romero and Michael Rabbat and Pascal Vincent and James Pinkerton and Duo Wang and Nafissa Yakubova and Erich Owens and C. Lawrence Zitnick and Michael P. Recht and Daniel K. Sodickson and Yvonne W. Lui}, journal = {ArXiv e-prints}, archivePrefix = "arXiv", eprint = {1811.08839}, year={2018} }

Controlling Covariate Shift using Equilibrium Normalization of Weights

Aaron Defazio, Léon Bottou

[arXiv]
We introduce a new normalization technique that exhibits the fast convergence properties of batch normalization using a transformation of layer weights instead of layer outputs. The proposed technique keeps the contribution of positive and negative weights to the layer output in equilibrium. We validate our method on a set of standard benchmarks including CIFAR-10/100, SVHN and ILSVRC 2012 ImageNet.

@ARTICLE{adefazio-equinorm2018, author = {Aaron Defazio and L{\'{e}}on Bottou}, title = {Controlling Covariate Shift using Equilibrium Normalization of Weights}, journal = {ArXiv e-prints}, archivePrefix = "arXiv", year = {2018} }

A Simple Practical Accelerated Method for Finite Sums

Aaron Defazio
NIPS 2016

[PDF] [Code]
We describe a novel optimization method for finite sums (such as empirical risk minimization problems) building on the recently introduced SAGA method. Our method achieves an accelerated convergence rate on strongly convex smooth prob- lems. Our method has only one parameter (a step size), and is radically simpler than other accelerated methods for finite sums. Additionally it can be applied when the terms are non-smooth, yielding a method applicable in many areas where operator splitting methods would traditionally be applied.

@ARTICLE{adefazio-nips2016, author = {Aaron Defazio}, title = {A Simple Practical Accelerated Method for Finite Sums}, journal = {Advances in Neural Information Processing Systems 29 (NIPS 2016)}, year = {2016} }

Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields

M. Schmidt, R. Babanezhad, M.O. Ahmed, A. Defazio, A. Clifton & A. Sarkar
AISTATS 2015

[PDF]
We apply stochastic average gradient (SAG) algorithms for training conditional random fields (CRFs). We describe a practical implementation that uses structure in the CRF gradient to reduce the memory requirement of this linearly-convergent stochastic gradient method, propose a non-uniform sampling scheme that substantially improves practical performance, and analyze the rate of convergence of the SAGA variant under non-uniform sampling. Our experimental results reveal that our method often significantly outperforms existing methods in terms of the training objective, and performs as well or better than optimally-tuned stochastic gradient methods in terms of test error.

@ARTICLE{mschmidt-aistats2015, author = {Mark Schmidt and Reza Babanezhad and Mohamed Osama Ahmed and Aaron Defazio and Ann Clifton and Anoop Sarkar }, title = {Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields}, journal = {18th International Conference on Artificial Intelligence and Statistics (AISTATS 2015)}, year = {2015} }

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

A. Defazio, F. Bach & S. Lacoste-Julien.
NIPS 2014

[PDF]
In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method.

@ARTICLE{adefazio-nips2014, author = {Aaron Defazio and Francis Bach and Simon Lacoste-Julien}, title = {SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives}, journal = {Advances in Neural Information Processing Systems 27 (NIPS 2014)}, year = {2014} }

Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems

A. Defazio, T. Caetano & J. Domke.
ICML 2014

[PDF] [Appendix]
Recent advances in optimization theory have shown that smooth strongly convex finite sums can be minimized faster than by treating them as a black box "batch" problem. In this work we introduce a new method in this class with a theoretical convergence rate four times faster than existing methods, for sums with sufficiently many terms. This method is also amendable to a sampling without replacement scheme that in practice gives further speed-ups. We give empirical results showing state of the art performance.

@ARTICLE{adefazio-icml2014, author = {Aaron Defazio and Tiberio Caetano and Justin Domke}, title = {Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems}, journal = {The 31st International Conference on Machine Learning (ICML 2014)}, year = {2014} }

A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

A. Defazio & T. Caetano.
NIPS 2012

[PDF] [Appendix] [Code]
A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lovasz extension to obtain a convex relaxation. For tractable classes such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.

@ARTICLE{adefazio-nips2012, author = {Aaron Defazio and Tiberio Caetano}, title = {A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation}, journal = {Advances in Neural Information Processing Systems 25 (NIPS 2012)}, year = {2012} }

A Graphical Model Formulation of Collaborative Filtering Neighbourhood Methods
with Fast Maximum Entropy Training

A. Defazio & T. Caetano
ICML 2012

[PDF]
Item neighbourhood methods for collaborative filtering learn a weighted graph over the set of items, where each item is connected to those it is most similar to. The prediction of a user's rating on an item is then given by that rating of neighbouring items, weighted by their similarity. This paper presents a new neighbourhood approach which we call item fields, whereby an undirected graphical model is formed over the item graph. The resulting prediction rule is a simple generalization of the classical approaches, which takes into account non-local information in the graph, allowing its best results to be obtained when using drastically fewer edges than other neighbourhood approaches. A fast approximate maximum entropy training method based on the Bethe approximation is presented which utilizes a novel decomposition into tractable sub-problems. When using precomputed sufficient statistics on the Movielens dataset, our method outperforms maximum likelihood approaches by two orders of magnitude.

@ARTICLE{adefazio-icml2012, author = {Aaron Defazio and Tiberio Caetano}, title = {A Graphical Model Formulation of Collaborative Filtering Neighbourhood Methods with Fast Maximum Entropy Training}, journal = {The 29th International Conference on Machine Learning (ICML 2012)}, year = {2012} }

Offset Sampling Improves Deep Learning based Accelerated MRI Reconstructions by Exploiting Symmetry

Aaron Defazio

[arXiv]
Deep learning approaches to accelerated MRI take a matrix of sampled Fourier-space lines as input and produce a spatial image as output. In this work we show that by careful choice of the offset used in the sampling procedure, the symmetries in k-space can be better exploited, producing higher quality reconstructions than given by standard equally-spaced samples or randomized samples motivated by compressed sensing.

@misc{defazio2019offset, title={Offset Sampling Improves Deep Learning based Accelerated MRI Reconstructions by Exploiting Symmetry}, author={Aaron Defazio}, year={2019}, eprint={1912.01101}, archivePrefix={arXiv}, primaryClass={eess.IV} }

A Comparison of Learning Algorithms on the Arcade Learning Environment

A. Defazio & T. Graepel

[PDF]
Reinforcement learning agents have traditionally been evaluated on small toy problems. With advances in computing power and the advent of the Arcade Learning Environment, it is now possible to evaluate algorithms on diverse and difficult problems within a consistent framework. We discuss some challenges posed by the arcade learning environment which do not manifest in simpler environments. We then provide a comparison of model-free, linear learning algorithms on this challenging problem set.

@TECHREPORT{adefazio-rl2014, author = {Aaron Defazio and Thore Graepel}, title = {A Comparison of Learning Algorithms on the Arcade Learning Environment}, institution = {Australian National University}, year = {2014} }

Linear programming in low dimensions is easy

[PDF]

A complete guide to the Bayes factor test

[PDF] [BLOG]

How to do A/B testing with early stopping correctly

[PDF] [BLOG]

Weighted random sampling with replacement with dynamic weights

[PDF] [BLOG]

New Optimisation Methods for Machine Learning

[PDF] Supervised by Tiberio Caetano
In this work we introduce several new optimisation methods for problems in machine learning. Our algorithms broadly fall into two categories: optimisation of finite sums and of graph structured objectives. The finite sum problem is simply the minimisation of objective functions that are naturally expressed as a summation over a large number of terms, where each term has a similar or identical weight. Such objectives most often appear in machine learning in the empirical risk minimisation framework in the non-online learning setting. The second category, that of graph structured objectives, consists of objectives that result from applying maximum likelihood to Markov random field models. Unlike the finite sum case, all the non-linearity is contained within a partition function term, which does not readily decompose into a summation.
For the finite sum problem, we introduce the Finito and SAGA algorithms, as well as variants of each. The Finito algorithm is best suited to strongly convex problems where the number of terms is of the same order as the condition number of the problem. We prove the fast convergence rate of Finito for strongly convex problems and demonstrate its state-of-the-art empirical performance on 5 datasets.
The SAGA algorithm we introduce is complementary to the Finito algorithm. It is more generally applicable, as it can be applied to problems without strong convexity, and to problems that have a non-differentiable regularisation term. In both cases we establish strong convergence rate proofs. It is also better suited to sparser problems than Finito. The SAGA method has a broader and simpler theory than any existing fast method for the problem class of finite sums, in particular it is the first such method that can provably be applied to non-strongly convex problems with non-differentiable regularisers without introduction of additional regularisation.
For graph-structured problems, we take three complementary approaches. We look at learning the parameters for a fixed structure, learning the structure independently, and learning both simultaneously. Specifically, for the combined approach, we introduce a new method for encouraging graph structures with the “scale-free” property. For the structure learning problem, we establish SHORTCUT, a O(n^2.5) expected time approximate structure learning method for Gaussian graphical models. For problems where the structure is known but the parameters unknown, we introduce an approximate maximum likelihood learning algorithm that is capable of learning a useful subclass of Gaussian graphical models.
Our thesis as a whole introduces a new suit of techniques for machine learning practitioners that increases the size and type of problems that can be efficiently solved. Our work is backed by extensive theory, including proofs of convergence for each method discussed.

@PHDTHESIS{adefazio-thesis2014, author = {Aaron Defazio}, title = {New Optimisation Methods for Machine Learning}, school = {Australian National University}, year = {2014}, note = {http://www.aarondefazio.com/pubs.html} }

Network Topology Tomography

[PDF] Supervised by Tiberio Caetano

@MASTERSTHESIS{adefazio-honours2010, author = {Aaron Defazio}, title = {Network Topology Tomography}, school = {Australian National University (ANU)}, year = {2010}, type = {Undergraduate Honors Thesis} }

Publications / Preprints

Objective

Methods

Results

Conclusions

Purpose

Methods

Results

Conclusions

Tech Reports

Articles

PhD Thesis

Honours Research