Publications & Tech Reports

### Publications / Preprints

When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement

[arXiv]
[code]
Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new

*problem-adaptive*learning rate schedules. Our key technical contribution is a refined analysis of learning rate schedules for a wide class of optimization algorithms (including SGD). In contrast to most prior works that study the convergence of the average iterate, we study the last iterate, which is what most people use in practice. When considering only worst-case analysis, our theory predicts that the best choice is the

*linear decay*schedule: a popular choice in practice that sets the stepsize proportionally to 1 - t/T, where t is the current iteration and T is the total number of steps. To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules

*refined*for any particular task. These refined schedules exhibit learning rate warm-up and rapid learning rate annealing near the end of training. Ours is the first systematic approach to

*automatically*yield both of these properties. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule matches or outperforms all commonly used default schedules including cosine annealing, and that our schedule refinement method gives further improvements.

@misc{defazio2023when,
title={When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement},
author={Aaron Defazio and Ashok Cutkosky and Harsh Mehta and Konstantin Mishchenko},
year={2023},
eprint={2310.07831},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

[arXiv]
[publication]
[code]
We consider the problem of estimating the learning rate in adaptive methods, such as Adagrad and Adam. We describe two techniques, Prodigy and Resetting, to provably estimate the distance to the solution D, which is needed to set the learning rate optimally. Our techniques are modifications of the D-Adaptation method for learning-rate-free learning. Our methods improve upon the convergence rate of D-Adaptation by a factor of O(sqrt(log(D/d0))), where d0 is the initial estimate of D. We test our methods on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approaches consistently outperform D-Adaptation and reach test accuracy values close to that of hand-tuned Adam.

@misc{mishchenko2023prodigy,
title={Prodigy: An Expeditiously Adaptive Parameter-Free Learner},
author={Konstantin Mishchenko and Aaron Defazio},
year={2023},
eprint={2306.06101},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

MoMo: Momentum Models for Adaptive Learning Rates

[arXiv]
Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new adaptive learning rates that can be used with any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (Stochastic gradient descent with momentum). MoMo uses momentum estimates of the batch losses and gradients sampled at each iteration to build a model of the loss function. Our model also makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. We then approximately minimize this model at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by developing MoMo-Adam - which is Adam with our new model-based adaptive learning rate. Additionally, for losses with unknown lower bounds, we develop on-the-fly estimates of a lower bound, that are incorporated in our model. Through extensive numerical experiments, we demonstrate that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of accuracy and robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR10, CIFAR100, Imagenet, recommender systems on the Criteo dataset, and a transformer model on the translation task IWSLT14.

@misc{schaipp2023momo,
title={MoMo: Momentum Models for Adaptive Learning Rates},
author={Fabian Schaipp and Ruben Ohana and Michael Eickenberg and Aaron Defazio and Robert M. Gower},
year={2023},
eprint={2305.07583},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Mechanic: A Learning Rate Tuner

[arXiv]
[publication]
[code]
We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call mechanic. Our method provides a practical realization of recent theoretical reductions for accomplishing a similar goal in online convex optimization. We rigorously evaluate mechanic on a range of large scale deep learning tasks with varying batch sizes, schedules, and base optimization algorithms. These experiments demonstrate that depending on the problem, mechanic either comes very close to, matches or even improves upon manual tuning of learning rates.

@ARTICLE{cutkosky2023mechanic,
title={Mechanic: A Learning Rate Tuner},
author={Ashok Cutkosky and Aaron Defazio and Harsh Mehta},
year={2023},
journal={Advances in Neural Information Processing Systems 37 (NeurIPS 2023)},
}

Learning-Rate-Free Learning by D-Adaptation

[arXiv]
[publication]
[code]
The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of D yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An open-source implementation is available.

@ARTICLE{defazio2023dadapt,
author = {Aaron Defazio and Konstantin Mishchenko},
title = {Learning-Rate-Free Learning by D-Adaptation},
journal = {The 40th International Conference on Machine Learning (ICML 2023)},
year = {2023}
}

Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

[PDF]
[publication]
[code]
We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly.

@article{defazio2022adaptivity,
author = {Aaron Defazio and Samy Jelassi},
title = {Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization},
journal = {Journal of Machine Learning Research},
year = {2022},
volume = {23},
pages = {1-34},
}

Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

[PDF]
The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a learning rate scaling hyper-parameter to be carefully tuned. To overcome this restriction, we introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator, one that can both increase and decrease. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.

@article{defazio2022scaling,
@misc{https://doi.org/10.48550/arxiv.2206.06900,
doi = {10.48550/ARXIV.2206.06900},
url = {https://arxiv.org/abs/2206.06900},
author = {Defazio, Aaron and Zhou, Baoyu and Xiao, Lin},
keywords = {Machine Learning (cs.LG), Optimization and Control (math.OC), Machine Learning (stat.ML), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Mathematics, FOS: Mathematics},
title = {Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method},
publisher = {arXiv},
year = {2022}
}

A scaling calculus for the design and initialization of ReLU networks

[PDF]
[publication]
In this work, we describe a set of rules for the design and initialization of well-conditioned neural networks, guided by the goal of naturally balancing the diagonal blocks of the Hessian at the start of training. Our design principle balances multiple sensible measures of the conditioning of neural networks. We prove that for a ReLU-based deep multilayer perceptron, a simple initialization scheme using the geometric mean of the fan-in and fan-out satisfies our scaling rule. For more sophisticated architectures, we show how our scaling principle can be used to guide design choices to produce well-conditioned neural networks, reducing guess-work.

@article{defazio2022scaling,
abstract = {We propose a system for calculating a ``scaling constant''for layers and weights of neural networks. We relate this scaling constant to two important quantities that relate to the optimizability of neural networks, and argue that a network that is ``preconditioned''via scaling, in the sense that all weights have the same scaling constant, will be easier to train. This scaling calculus results in a number of consequences, among them the fact that the geometric mean of the fan-in and fan-out, rather than the fan-in, fan-out, or arithmetic mean, should be used for the initialization of the variance of weights in a neural network. Our system allows for the off-line design \& engineering of ReLU (Rectified Linear Unit) neural networks, potentially replacing blind experimentation. We verify the effectiveness of our approach on a set of benchmark problems.},
author = {Defazio, Aaron and Bottou, L{\'e}on},
doi = {10.1007/s00521-022-07308-z},
isbn = {1433-3058},
journal = {Neural Computing and Applications},
title = {A scaling calculus for the design and initialization of ReLU networks},
url = {https://doi.org/10.1007/s00521-022-07308-z},
year = {2022},
}

Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization

[arXiv]
Momentum methods are now used pervasively within the machine learning community for training non-convex models such as deep neural networks. Empirically, they out perform traditional stochastic gradient descent (SGD) approaches. In this work we develop a Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA) form. This analysis is much tighter than previous theory in the non-convex case, and due to this we are able to give precise insights into when SGD+M may out-perform SGD, and what hyper-parameter schedules will work and why.

@misc{defazio2020mom,
title={Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization},
author={Aaron Defazio},
year={2020},
eprint={2010.00406},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Stochastic Polyak Stepsize with a Moving Target

[arXiv]
We propose a new stochastic gradient method called MOTAPS (Moving Targetted Polyak Stepsize) that uses recorded past loss values to compute adaptive stepsizes. MOTAPS can be seen as a variant of the Stochastic Polyak (SP) which is also a method that also uses loss values to adjust the stepsize. The downside to the SP method is that it only converges when the interpolation condition holds. MOTAPS is an extension of SP that does not rely on the interpolation condition. The MOTAPS method uses n auxiliary variables, one for each data point, that track the loss value for each data point. We provide a global convergence theory for SP, an intermediary method TAPS, and MOTAPS by showing that they all can be interpreted as a special variant of online SGD. We also perform several numerical experiments on convex learning problems, and deep learning models for image classification and language translation. In all of our tasks we show that MOTAPS is competitive with the relevant baseline method.

@misc{gower2021stochastic,
title={Stochastic Polyak Stepsize with a Moving Target},
author={Robert M. Gower and Aaron Defazio and Michael Rabbat},
year={2021},
eprint={2106.11851},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Compressed Sensing with a Jackknife, a Bootstrap, and Visualization

[PDF]
[Publication]
Compressed sensing proposes to reconstruct more degrees of freedom in a signal than the number of values actually measured (based on a potentially un-justified regularizer or prior distribution). Compressed sensing therefore risks introducing errors — inserting spurious artifacts or masking the abnormalities that medical imaging seeks to discover. Estimating errors using the standard statistical tools of a jackknife and a bootstrap yields error “bars” in the form of full images that are remarkably qualitatively representative of the actual errors (at least when evaluated and validated on data sets for which the ground truth and hence the actual error is available). These images show the structure of possible errors — without recourse to measuring the entire ground truth directly — and build confidence in regions of the images where the estimated errors are small. Further visualizations and summary statistics can aid in the interpretation of such error estimates. Visualizations include suitable colorizations of the reconstruction, as well as the obvious “correction” of the reconstruction by subtracting off the error estimates. The canonical summary statistic would be the root-mean-square of the error estimates. Unfortunately, colorizations appear likely to be too distracting for actual clinical practice in medical imaging, and the root-mean-square gets swamped by background noise in the error estimates. Fortunately, straightforward displays of the error estimates and of the “corrected” reconstruction are illuminating, and the root-mean-square improves greatly af- ter mild blurring of the error estimates; the blurring is barely perceptible to the human eye yet smooths away background noise that would otherwise overwhelm the root-mean-square.

@misc{defazio2022bootstrap,
title={Compressed Sensing with a Jackknife, a Bootstrap, and Visualization},
author={Aaron Defazio and Mark Tygert and Rachel Ward and Jure Zbontar},
Journal = {Journal of Data Science, Statistics, and Visualisation},
Year = {2022}}
}

The Power of Factorial Powers: New Parameter settings for (Stochastic) Optimization

[arXiv]
[publication]
The convergence rates for convex and non-convex optimization methods depend on the choice of a host of constants, including step sizes, Lyapunov function constants and momentum constants. In this work we propose the use of factorial powers as a flexible tool for defining constants that appear in convergence proofs. We list a number of remarkable properties that these sequences enjoy, and show how they can be applied to convergence proofs to simplify or improve the convergence rates of the momentum method, accelerated gradient and the stochastic variance reduced method (SVRG).

@InProceedings{defazio2021factorial,
title = {The Power of Factorial Powers: New Parameter settings for (Stochastic) Optimization},
author = {Defazio, Aaron and Gower, Robert M.},
booktitle = {Proceedings of The 13th Asian Conference on Machine Learning},
pages = {49--64},
year = {2021},
editor = {Balasubramanian, Vineeth N. and Tsang, Ivor},
volume = {157},
series = {Proceedings of Machine Learning Research},
month = {17--19 Nov},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v157/defazio21a/defazio21a.pdf},
url = {https://proceedings.mlr.press/v157/defazio21a.html},
}

On the (asymptotic) convergence of Stochastic Gradient Descent and Stochastic Heavy Ball

[arXiv]
We study stochastic gradient descent (SGD) and the stochastic heavy ball method (SHB, otherwise known as the momentum method) for the general stochastic approximation problem. For SGD, in the convex and smooth setting, we provide the first almost sure asymptotic convergence rates for a weighted average of the iterates . More precisely, we show that the convergence rate of the function values is arbitrarily close to o(1/sqrt(k)), and is exactly o(1/k) in the so-called overparametrized case. We show that these results still hold when using stochastic line search and stochastic Polyak stepsizes, thereby giving the first proof of convergence of these methods in the non-overparametrized regime. Using a substantially different analysis, we show that these rates hold for SHB as well, but at the last iterate. This distinction is important because it is the last iterate of SGD and SHB which is used in practice. We also show that the last iterate of SHB converges to a minimizer almost surely. Additionally, we prove that the function values of the deterministic HB converge at a o(1/k) rate, which is faster than the previously known O(1/k). Finally, in the nonconvex setting, we prove similar rates on the lowest gradient norm along the trajectory of SGD.

@InProceedings{sebbouh2021convergence,
title = {On the (asymptotic) convergence of Stochastic Gradient Descent and Stochastic Heavy Ball},
author = {Othmane Sebbouh and Robert M. Gower and Aaron Defazio},
booktitle = {Conference on Learning Theory, COLT 2021},
year = {2021},
publisher = {PMLR},
series = {Proceedings of Machine Learning Research},
}

MRI Banding Removal via Adversarial Training

[arXiv]
MRI images reconstructed from sub-sampled Cartesian data using deep learning techniques often show a characteristic banding (sometimes described as streaking), which is particularly strong in low signal-to-noise regions of the reconstructed image. In this work, we propose the use of an adversarial loss that penalizes banding structures without requiring any human annotation. Our technique greatly reduces the appearance of banding, without requiring any additional computation or post-processing at reconstruction time. We report the results of a blind comparison against a strong baseline by a group of expert evaluators (board-certified radiologists), where our approach is ranked superior at banding removal with no statistically significant loss of detail.

@misc{defazio2020mri,
title={MRI Banding Removal via Adversarial Training},
author={Aaron Defazio and Tullie Murrell and Michael P. Recht},
year={2020},
eprint={2001.08699},
archivePrefix={arXiv},
primaryClass={eess.IV}
}

End-to-End Variational Networks for Accelerated MRI Reconstruction

[arXiv]
The slow acquisition speed of magnetic resonance imaging (MRI) has led to the development of two complementary methods: acquiring multiple views of the anatomy simultaneously (parallel imaging) and acquiring fewer samples than necessary for traditional signal processing methods (compressed sensing). While the combination of these methods has the potential to allow much faster scan times, reconstruction from such undersampled multi-coil data has remained an open problem. In this paper, we present a new approach to this problem that extends previously proposed variational methods by learning fully end-to-end. Our method obtains new state-of-the-art results on the fastMRI dataset for both brain and knee MRIs.

@misc{sriram2020endtoend,
title={End-to-End Variational Networks for Accelerated MRI Reconstruction},
author={Anuroop Sriram and Jure Zbontar and Tullie Murrell and Aaron Defazio and C. Lawrence Zitnick and Nafissa Yakubova and Florian Knoll and Patricia Johnson},
year={2020},
eprint={2004.06688},
archivePrefix={arXiv},
primaryClass={eess.IV}
}

Using Deep Learning to Accelerate Knee MRI at 3T: Results of an Interchangeability Study

[publication]
##### Objective

Deep Learning (DL) image reconstruction has the potential to disrupt the current state of MR imaging by significantly decreasing the time required for MR exams. Our goal was to use DL to accelerate MR imaging in order to allow a 5-minute comprehensive examination of the knee, without compromising image quality or diagnostic accuracy.##### Methods

A DL model for image reconstruction using a variational network was optimized. The model was trained using dedicated multi-sequence training, in which a single reconstruction model was trained with data from multiple sequences with different contrast and orientations. Following training, data from 108 patients were retrospectively undersampled in a manner that would correspond with a net 3.49-fold acceleration of fully-sampled data acquisition and 1.88-fold acceleration compared to our standard two-fold accelerated parallel acquisition. An interchangeability study was performed, in which the ability of 6 readers to detect internal derangement of the knee was compared for the clinical and DL-accelerated images.##### Results

The study demonstrated a high degree of interchangeability between standard and DL-accelerated images. In particular, results showed that interchanging the sequences would result in discordant clinical opinions no more than 4% of the time for any feature evaluated. Moreover, the accelerated sequence was judged by all six readers to have better quality than the clinical sequence.##### Conclusions

An optimized DL model allowed for acceleration of knee images which performed interchangeably with standard images for the detection of internal derangement of the knee. Importantly, readers preferred the quality of accelerated images to that of standard clinical images.@article{fastmri2020,
Author = {Recht, Michael P. and Zbontar, Jure and Sodickson, Daniel K. and Knoll, Florian and Yakubova, Nafissa and Sriram, Anuroop and Murrell, Tullie and Defazio, Aaron and Rabbat, Michael and Rybak, Leon and Kline, Mitchell and Ciavarra, Gina and Alaia, Erin F. and Samim, Mohammad and Walter, William R. and Lin, Dana and Lui, Yvonne W. and Muckley, Matthew and Huang, Zhengnan and Johnson, Patricia and Stern, Ruben and Zitnick, C. Lawrence},
Journal = {American Journal of Roentgenology},
Month = {2020/07/09},
Title = {Using Deep Learning to Accelerate Knee MRI at 3T: Results of an Interchangeability Study},
Year = {2020}}

GrappaNet: Combining Parallel Imaging With Deep Learning for Multi-Coil MRI Reconstruction

[arXiv]
[publication]
Magnetic Resonance Image (MRI) acquisition is an inherently slow process which has spurred the development of two different acceleration methods: acquiring multiple correlated samples simultaneously (parallel imaging) and acquiring fewer samples than necessary for traditional signal processing methods (compressed sensing). Both methods provide complementary approaches to accelerating MRI acquisition. In this paper, we present a novel method to integrate traditional parallel imaging methods into deep neural networks that is able to generate high quality reconstructions even for high acceleration factors. The proposed method, called GrappaNet, performs progressive reconstruction by first mapping the reconstruction problem to a simpler one that can be solved by a traditional parallel imaging methods using a neural network, followed by an application of a parallel imaging method, and finally fine-tuning the output with another neural network. The entire network can be trained end-to-end. We present experimental results on the recently released fastMRI dataset and show that GrappaNet can generate higher quality reconstructions than competing methods for both 4x and 8x acceleration.

@InProceedings{Sriram_2020_CVPR,
author = {Sriram, Anuroop and Zbontar, Jure and Murrell, Tullie and Zitnick, C. Lawrence and Defazio, Aaron and Sodickson, Daniel K.},
title = {GrappaNet: Combining Parallel Imaging With Deep Learning for Multi-Coil MRI Reconstruction},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}

Advancing machine learning for MR image reconstruction with an open competition: Overview of the 2019 fastMRI challenge

[publication]
[arXiv]
[Code]
##### Purpose

To advance research in the field of machine learning for MR image reconstruction with an open challenge.##### Methods

We provided participants with a dataset of raw k‐space data from 1,594 consecutive clinical exams of the knee. The goal of the challenge was to reconstruct images from these data. In order to strike a balance between realistic data and a shallow learning curve for those not already familiar with MR image reconstruction, we ran multiple tracks for multi‐coil and single‐coil data. We performed a two‐stage evaluation based on quantitative image metrics followed by evaluation by a panel of radiologists. The challenge ran from June to December of 2019.##### Results

We received a total of 33 challenge submissions. All participants chose to submit results from supervised machine learning approaches.##### Conclusions

The challenge led to new developments in machine learning for image reconstruction, provided insight into the current state of the art in the field, and highlighted remaining hurdles for clinical adoption.@article{doi:10.1002/mrm.28338,
Author = {Knoll, Florian and Murrell, Tullie and Sriram, Anuroop and Yakubova, Nafissa and Zbontar, Jure and Rabbat, Michael and Defazio, Aaron and Muckley, Matthew J. and Sodickson, Daniel K. and Zitnick, C. Lawrence and Recht, Michael P.},
Journal = {Magnetic Resonance in Medicine},
Title = {Advancing machine learning for MR image reconstruction with an open competition: Overview of the 2019 fastMRI challenge},

fastMRI: A Publicly Available Raw k-Space and DICOM Dataset of Knee Images for Accelerated MR Image Reconstruction Using Machine Learning

[publication]
[Code]
A publicly available dataset containing k-space data as well as Digital Imaging and Communications in Medicine image data of knee images for accelerated MR image reconstruction using machine learning is presented.

@article{doi:10.1148/ryai.2020190007,
Author = {Knoll, Florian and Zbontar, Jure and Sriram, Anuroop and Muckley, Matthew J. and Bruno, Mary and Defazio, Aaron and Parente, Marc and Geras, Krzysztof J. and Katsnelson, Joe and Chandarana, Hersh and Zhang, Zizhao and Drozdzalv, Michal and Romero, Adriana and Rabbat, Michael and Vincent, Pascal and Pinkerton, James and Wang, Duo and Yakubova, Nafissa and Owens, Erich and Zitnick, C. Lawrence and Recht, Michael P. and Sodickson, Daniel K. and Lui, Yvonne W.},
Journal = {Radiology: Artificial Intelligence},
Title = {fastMRI: A Publicly Available Raw k-Space and DICOM Dataset of Knee Images for Accelerated MR Image Reconstruction Using Machine Learning},
Year = {2020}}

On the Curved Geometry of Accelerated Optimization

[arXiv]
In this work we propose a differential geometric motivation for Nesterov's accelerated gradient method (AGM) for strongly-convex problems. By considering the optimization procedure as occurring on a Riemannian manifold with a natural structure, The AGM method can be seen as the proximal point method applied in this curved space. This viewpoint can also be extended to the continuous time case, where the accelerated gradient method arises from the natural block-implicit Euler discretization of an ODE on the manifold. We provide an analysis of the convergence rate of this ODE for quadratic objectives.

@ARTICLE{adefazio-curvedgeom2019,
author = {Aaron Defazio},
title = {On the Curved Geometry of Accelerated Optimization},
journal = {Advances in Neural Information Processing Systems 33 (NIPS 2019)},
year = {2019}
}

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

[arXiv]
[Code]
The application of stochastic variance reduction to optimization has shown remarkable recent theoretical and practical success. The applicability of these techniques to the hard non-convex optimization problems encountered during training of modern deep neural networks is an open problem. We show that naive application of the SVRG technique and related approaches fail, and explore why.

@ARTICLE{adefazio-varred2019,
author = {Aaron Defazio and L{\'{e}}on Bottou},
title = {On the Ineffectiveness of Variance Reduced Optimization for Deep Learning},
journal = {Advances in Neural Information Processing Systems 33 (NIPS 2019)},
year = {2019}
}

fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

[arXiv]
[Code]
Accelerating Magnetic Resonance Imaging (MRI) by taking fewer measurements has the potential to reduce medical costs, minimize stress to patients and make MRI possible in applications where it is currently prohibitively slow or expensive. We introduce the fastMRI dataset, a large-scale collection of both raw MR measurements and clinical MR images, that can be used for training and evaluation of machine-learning approaches to MR image reconstruction. By introducing standardized evaluation criteria and a freely-accessible dataset, our goal is to help the community make rapid advances in the state of the art for MR image reconstruction. We also provide a self-contained introduction to MRI for machine learning researchers with no medical imaging background.

@inproceedings{fastMRI2018,
title={{fastMRI}: An Open Dataset and Benchmarks for Accelerated {MRI}},
author={Jure Zbontar and Florian Knoll and Anuroop Sriram and Matthew J. Muckley and Mary Bruno and Aaron Defazio and Marc Parente and Krzysztof J. Geras and Joe Katsnelson and Hersh Chandarana and Zizhao Zhang and Michal Drozdzal and Adriana Romero and Michael Rabbat and Pascal Vincent and James Pinkerton and Duo Wang and Nafissa Yakubova and Erich Owens and C. Lawrence Zitnick and Michael P. Recht and Daniel K. Sodickson and Yvonne W. Lui},
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1811.08839},
year={2018}
}

Controlling Covariate Shift using Equilibrium Normalization of Weights

[arXiv]
We introduce a new normalization technique that exhibits the fast convergence properties of batch normalization using a transformation of layer weights instead of layer outputs. The proposed technique keeps the contribution of positive and negative weights to the layer output in equilibrium. We validate our method on a set of standard benchmarks including CIFAR-10/100, SVHN and ILSVRC 2012 ImageNet.

@ARTICLE{adefazio-equinorm2018,
author = {Aaron Defazio and L{\'{e}}on Bottou},
title = {Controlling Covariate Shift using Equilibrium Normalization of Weights},
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
year = {2018}
}

A Simple Practical Accelerated Method for Finite Sums

[PDF]
[Code]
We describe a novel optimization method for finite sums (such as empirical risk minimization problems) building on the recently introduced SAGA method. Our method achieves an accelerated convergence rate on strongly convex smooth prob- lems. Our method has only one parameter (a step size), and is radically simpler than other accelerated methods for finite sums. Additionally it can be applied when the terms are non-smooth, yielding a method applicable in many areas where operator splitting methods would traditionally be applied.

@ARTICLE{adefazio-nips2016,
author = {Aaron Defazio},
title = {A Simple Practical Accelerated Method for Finite Sums},
journal = {Advances in Neural Information Processing Systems 29 (NIPS 2016)},
year = {2016}
}

Non-Uniform Stochastic Average Gradient Method for
Training Conditional Random Fields

[PDF]
We apply stochastic average gradient (SAG) algorithms for training conditional random fields (CRFs). We describe a practical implementation that uses structure in the CRF gradient to reduce the memory requirement of this linearly-convergent stochastic gradient method, propose a non-uniform sampling scheme that substantially improves practical performance, and analyze the rate of convergence of the SAGA variant under non-uniform sampling. Our experimental results reveal that our method often significantly outperforms existing methods in terms of the training objective, and performs as well or better than optimally-tuned stochastic gradient methods in terms of test error.

@ARTICLE{mschmidt-aistats2015,
author = {Mark Schmidt and Reza Babanezhad and Mohamed Osama Ahmed and
Aaron Defazio and Ann Clifton and Anoop Sarkar
},
title = {Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields},
journal = {18th International Conference on Artificial Intelligence and Statistics (AISTATS 2015)},
year = {2015}
}

SAGA: A Fast Incremental Gradient Method With Support for
Non-Strongly Convex Composite Objectives

[PDF]
In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method.

@ARTICLE{adefazio-nips2014,
author = {Aaron Defazio and Francis Bach and Simon Lacoste-Julien},
title = {SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly
Convex Composite Objectives},
journal = {Advances in Neural Information Processing Systems 27 (NIPS 2014)},
year = {2014}
}

Finito: A Faster, Permutable Incremental Gradient Method
for Big Data Problems

[PDF]
[Appendix]
Recent advances in optimization theory have shown that smooth strongly convex finite sums can be minimized faster than by treating them as a black box "batch" problem. In this work we introduce a new method in this class with a theoretical convergence rate four times faster than existing methods, for sums with sufficiently many terms. This method is also amendable to a sampling without replacement scheme that in practice gives further speed-ups. We give empirical results showing state of the art performance.

@ARTICLE{adefazio-icml2014,
author = {Aaron Defazio and Tiberio Caetano and Justin Domke},
title = {Finito: A Faster, Permutable Incremental Gradient Method for Big
Data Problems},
journal = {The 31st International Conference on Machine Learning (ICML 2014)},
year = {2014}
}

A Convex Formulation for Learning Scale-Free Networks
via Submodular Relaxation

[PDF]
[Appendix]
[Code]
A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lovasz extension to obtain a convex relaxation. For tractable classes such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.

@ARTICLE{adefazio-nips2012,
author = {Aaron Defazio and Tiberio Caetano},
title = {A Convex Formulation for Learning Scale-Free Networks via Submodular
Relaxation},
journal = {Advances in Neural Information Processing Systems 25 (NIPS 2012)},
year = {2012}
}

A Graphical Model Formulation of Collaborative Filtering
Neighbourhood Methods

with Fast Maximum Entropy Training

[PDF]
with Fast Maximum Entropy Training

Item neighbourhood methods for collaborative filtering learn a weighted graph over the set of items, where each item is connected to those it is most similar to. The prediction of a user's rating on an item is then given by that rating of neighbouring items, weighted by their similarity. This paper presents a new neighbourhood approach which we call item fields, whereby an undirected graphical model is formed over the item graph. The resulting prediction rule is a simple generalization of the classical approaches, which takes into account non-local information in the graph, allowing its best results to be obtained when using drastically fewer edges than other neighbourhood approaches. A fast approximate maximum entropy training method based on the Bethe approximation is presented which utilizes a novel decomposition into tractable sub-problems. When using precomputed sufficient statistics on the Movielens dataset, our method outperforms maximum likelihood approaches by two orders of magnitude.

@ARTICLE{adefazio-icml2012,
author = {Aaron Defazio and Tiberio Caetano},
title = {A Graphical Model Formulation of Collaborative Filtering Neighbourhood
Methods with Fast Maximum Entropy Training},
journal = {The 29th International Conference on Machine Learning (ICML 2012)},
year = {2012}
}

### Tech Reports

Offset Sampling Improves Deep Learning based Accelerated MRI Reconstructions by Exploiting Symmetry

[arXiv]
Deep learning approaches to accelerated MRI take a matrix of sampled Fourier-space lines as input and produce a spatial image as output. In this work we show that by careful choice of the offset used in the sampling procedure, the symmetries in k-space can be better exploited, producing higher quality reconstructions than given by standard equally-spaced samples or randomized samples motivated by compressed sensing.

@misc{defazio2019offset,
title={Offset Sampling Improves Deep Learning based Accelerated MRI Reconstructions by Exploiting Symmetry},
author={Aaron Defazio},
year={2019},
eprint={1912.01101},
archivePrefix={arXiv},
primaryClass={eess.IV}
}

A Comparison of Learning Algorithms on the Arcade Learning Environment

[PDF]
Reinforcement learning agents have traditionally been evaluated on small toy problems. With advances in computing power and the advent of the Arcade Learning Environment, it is now possible to evaluate algorithms on diverse and difficult problems within a consistent framework. We discuss some challenges posed by the arcade learning environment which do not manifest in simpler environments. We then provide a comparison of model-free, linear learning algorithms on this challenging problem set.

@TECHREPORT{adefazio-rl2014,
author = {Aaron Defazio and Thore Graepel},
title = {A Comparison of Learning Algorithms on the Arcade Learning Environment},
institution = {Australian National University},
year = {2014}
}

### Articles

Linear programming in low dimensions is easy

[PDF]
### PhD Thesis

New Optimisation Methods for Machine Learning

[PDF]
Supervised by Tiberio Caetano
In this work we introduce several new optimisation methods for problems in machine learning. Our algorithms broadly fall into two categories: optimisation of finite sums and of graph structured objectives. The finite sum problem is simply the minimisation of objective functions that are naturally expressed as a summation over a large number of terms, where each term has a similar or identical weight. Such objectives most often appear in machine learning in the empirical risk minimisation framework in the non-online learning setting. The second category, that of graph structured objectives, consists of objectives that result from applying maximum likelihood to Markov random field models. Unlike the finite sum case, all the non-linearity is contained within a partition function term, which does not readily decompose into a summation.

For the finite sum problem, we introduce the Finito and SAGA algorithms, as well as variants of each. The Finito algorithm is best suited to strongly convex problems where the number of terms is of the same order as the condition number of the problem. We prove the fast convergence rate of Finito for strongly convex problems and demonstrate its state-of-the-art empirical performance on 5 datasets.

The SAGA algorithm we introduce is complementary to the Finito algorithm. It is more generally applicable, as it can be applied to problems without strong convexity, and to problems that have a non-differentiable regularisation term. In both cases we establish strong convergence rate proofs. It is also better suited to sparser problems than Finito. The SAGA method has a broader and simpler theory than any existing fast method for the problem class of finite sums, in particular it is the first such method that can provably be applied to non-strongly convex problems with non-differentiable regularisers without introduction of additional regularisation.

For graph-structured problems, we take three complementary approaches. We look at learning the parameters for a fixed structure, learning the structure independently, and learning both simultaneously. Specifically, for the combined approach, we introduce a new method for encouraging graph structures with the “scale-free” property. For the structure learning problem, we establish SHORTCUT, a O(n^2.5) expected time approximate structure learning method for Gaussian graphical models. For problems where the structure is known but the parameters unknown, we introduce an approximate maximum likelihood learning algorithm that is capable of learning a useful subclass of Gaussian graphical models.

Our thesis as a whole introduces a new suit of techniques for machine learning practitioners that increases the size and type of problems that can be efficiently solved. Our work is backed by extensive theory, including proofs of convergence for each method discussed.

@PHDTHESIS{adefazio-thesis2014,
author = {Aaron Defazio},
title = {New Optimisation Methods for Machine Learning},
school = {Australian National University},
year = {2014},
note = {http://www.aarondefazio.com/pubs.html}
}

### Honours Research

Network Topology Tomography

[PDF]
Supervised by Tiberio Caetano
@MASTERSTHESIS{adefazio-honours2010,
author = {Aaron Defazio},
title = {Network Topology Tomography},
school = {Australian National University (ANU)},
year = {2010},
type = {Undergraduate Honors Thesis}
}