Everything is Easy in Low Dimensions (Part 1): Linear Programming

I recently came across this remarkable randomized algorithm due to Seidel (1991) for linear programming in 2 dimensions that takes only time O(n) for n constraints (in expectation). It’s definitely worth a close inspection, so I thought I would write a short article about it.

Note that my presentation of this algorithm assumes that there are no redundant or colinear constraints, that the solution is bounded, and that the solution for each subproblem encountered is unique. These assumptions are easy to remove without slowing the algorithm down, but they complicate the presentation a little.

FUNCTION lpsolve

INPUT: Set H of constraints & vector c.

OUTPUT: A point v in the polytope defined by H minimizing cTv.

BASE CASE: If |H| = 2 then return the unique intersecting point.

NON BASE CASE:

  1. Pick a constraint B uniformly at random from H.
  2. v = lpsolve(H B, c).
  3. If v is in the half-space defined by B RETURN v = v, otherwise:
  4. RETURN v, the solution of the 1D LP consisting of the projection of H & c onto B.

This algorithm is quite remarkable in its simplicity. It takes the form of an recursive algorithm, where we solve a base case, then add additional constraints 1 at a time. Every time we add a constraint, we consider if it changes the solution. It’s only when the old solution v is outside the added constraint that we have to do any work, namely solving a simple 1D LP, which is trivially O(m) time for m constraints.

If we had to solve the 1D LP subproblem at every step the algorithm would have quadratic running time ( i=1ni = O(n2)). We avoid this because we pick the constraint B to recurse on at each step randomly. Clearly we only have to solve that 1D LP if the randomly chosen constraint B forms part of the new solution to the problem (ignoring degenerate solutions), and since a solution is the intersection of 2 constraints, there is only a 2 in |H| chance that we randomly pick such a constraint.

It follows that the cost of each recursion in expectation is |H| times 2|H| which is just 2, and we recurse n times, giving us a O(n) running time in expectation.

In the original presentation by Seidel, the algorithm is given in a general form, usable in higher dimensions as well. The difficulty is that the running time quickly gets out of hand as the dimension d increases. The main obstacle is step 4 above, you essentially need to solve a d 1 dimensional LP, which is only slightly easier than your original LP in higher dimensions.

On a side note, there is an interesting discussion of this algorithm and related ones in the Wikipedia page for LP-type problems.

References

Seidel, Raimund (1991), "Small-dimensional linear programming and convex hulls made easy", Discrete and Computational Geometry, 6 (5): 423–434

A complete guide to the bayes factor test

The Bayes factor test is an interesting thing. Some Bayesians advocate it unequivalently, whereas others reject the notion of testing altogether, Bayesian or otherwise. This post takes a critical look at the Bayes factor, attempting to tease apart the ideas to get to the core of what it’s really doing. If you're used to frequentist tests and want to understand Bayes factors, this post is for you.

The Bayes factor test goes all the way back to Jeffreys’ early book on the Bayesian approach to statistics [Jeffreys1939]. Evidence for an alternative hypothesis H1 against that of the null hypothesis H0 is summarized by a quantity known as the Bayes factor. The Bayes factor is just the ratio of the data likelihoods, under both hypotheses and integrating out any nuisance parameters:

B10 : = p(D|H1) p(D|H0).

If this ratio is large, we can conclude that there is strong evidence for the alternative hypothesis. In contrast, if the inverse of this ratio is large, we have evidence supporting the null hypothesis. The definition of what is strong evidence is subjective, but usually a Bayes factor of 10 or more is considered sufficient.

The standard Bayes factor test against the point-null

Although Bayes factors are sometimes used for testing simple linear regression models against more complex ones, by far the most common test in practice is the analogue to the frequentist t-test, the Bayes factor t-test. Under the assumption of normality with unknown variance, it tests a null hypothesis of zero mean against non-zero mean.

This test is implemented in the BayesFactor R package with the ttestBF method. For the pairwise case it can be invoked with bf <- ttestBF(x = xdata, y=ydata). For a non-pairwise test just x can be passed in.

Statistical details of the test To apply the test we must formally define the two hypotheses in the form of statistical models. Jeffreys’ book is very pragmatic, and the modelling choices for this test reflect that. Since this is a t-test, we of course assume a normal distribution for the data for both hypothesis, so it just remains to just define the priors on it’s parameters in both cases.

We have essentially 3 parameters we have to place priors on: the mean μ of the alternative, the variance under the alternative σ12 and the variance under the null σ02. The μ under the null is assumed to be 0. We have to be extremely careful about the priors we place on these parameters. Reasonable seeming choices can lead to divergent integrals or non-sensical results very easily. We discuss this further below.

Variance Because the variance appears in both the numerator and denominator of the Bayes factor (the null and the alternative), we can use a “non-informative” improper prior of the form p(σ2) = 1σ2 for both. This prior causes the marginal likelihood integrals to go to infinity, but this cancels out between the numerator and denominator giving a finite value.

Mean The choice of a prior for the mean is a little more subjective. We can’t use an improper prior here, as the mean prior appears only in the numerator, and without the convenient cancelation with factors in the denominator, the Bayes factor won’t be well defined. Jeffreys argues for a Cauchy prior, which is just about the most vague prior you can use that still gives a convergent integral. The scale parameter of the Cauchy is just fixed to the standard deviation in his formulation. The BayesFactor package uses 0.707 times the standard deviation instead as the default, but this can be overridden.

What are the downsides compared to a frequentist t-test? The main downside is that the BF test is not as good by some frequentist measures than the tests that are designed to be as good as possible with respect to those measures. In particular, consider the standard measure of power at γ: the probability of picking up an effect of size γ under repeated replications of the experiment. Depending on the effect size and the number of samples n, a Bayesian t-test often requires 2-4 times as much data to match the power of a frequentist t-test.

Implementation details The Bayes factor requires computing marginal likelihoods, which is a quite distinct problem from the usual posterior expectations we compute when performing Bayesian estimation instead of hypothesis testing. The marginal likelihood is just an integral over the parameter space, which can be computed using numerical integration when the parameter space is small. For this t-test example, the BayesFactor package uses Gaussian quadrature for the alternative hypothesis. The null hypothesis doesn’t require any integration since it consists of a single point.

Jefferys lived in the statistical era of tables and hand computation, so he derived several formulas that can be used to approximate the Bayes factor. The following formula in terms of the t-statistic t = x̄(n 1)s2 works well for n in the hundreds or thousands:

B10 2 πnexp t22.

The sensitivity to priors

There are two types of priors that appear in the application of Bayes factors; The priors on the hypothesis p(H0), p(H1), and the priors on parameters that appears in marginal likelihoods, the p(0|H0), p(1|H1). Your prior belief ratio p(H1)p(H0) is updated to form your posterior beliefs using the simple equation:

p(H1|D) p(H0|D) = p(D|1)p(1|H1)d1 p(D|0)p(0|H0)d0 p(H1) p(H0).

The ratio p(H1)p(H0) is not formally considered part of the Bayes factor, but it plays an important role. It encodes your prior beliefs about the hypotheses. If you are using Bayes factors in a subjective fashion, to update your beliefs in a hypothesis after seeing data, then it is crucial to properly encode your prior belief in which hypothesis is more likely to be true. However, in practice everybody always takes this ratio to be 1, following the ideals of objective Bayesian reasoning, where one strives to introduce as little prior knowledge as possible into the problem. If your writing a scientific paper, it’s reasonable to use 1 as anybody reading the paper can apply there own ratio as a correction to your Bayes factor, reflecting there own prior beliefs.

The prior ratio of the hypotheses plays an interesting part in the calibration of the Bayes factor test. One way of testing Bayes factors is to simulate data from the priors, and compare the empirical Bayes factors over many simulations with the true ratios. This involving sampling a hypothesis, H0 or H1, then sampling then D. If the probability of sampling H0 and H1 is not equal at that first stage, then the observed Bayes factor will be off by the corresponding p(H1)p(H0). This is somewhat of a tautology, as we are basically just numerically verifying the Bayes factor equation as written. Nevertheless, it shows that if the priors on the hypotheses don’t reflect reality, then the Bayes factor won’t either. Blind application of a Bayes factor test won’t automatically give you interpretable “beliefs”.

Parameter priors

The dependence of Bayes factors on the parameter priors p(1|H1) and p(0|H0) is really at the core of understanding Bayes factors. It is a far more concrete problem than the hypothesis likelihoods, as there is no clear “objective” choices we can make. Additionally, intuitions from Bayesian estimation problems do not carry over to hypothesis testing.

Consider the simple BF t-test given above. We described the test using the Cauchy prior on the location parameter of the Gaussian. Often in practice people use large vague Gaussian priors instead when doing Bayesian estimation (as opposed to hypothesis testing), as they tend to be more numerically stable than the Cauchy. However, it has been noted by several authors [Berger and Pericchi2001] that simply replacing the Cauchy with a wide Gaussian in the BF t-test causes serious problems. Suppose we use a standard deviation of ν in this wide prior (ν large). It turns out that Bayes factor scales asymptotically like:

1 νnexp t22.

where t = x̄(n 1)s2 is the standard t-statistic of the data. The large ν that was chosen with the intent that it would be ’non-informative’ (i.e. minimally effect the result) turns out to directly divide the BF!

This is the opposite situation from estimation, where as ν increases is behaves more and more like the improper flat prior p() 1. We can’t use the flat prior in the BF t-test, but if we assume that the sample standard deviation s matches the true standard deviation, we can perform a BF z-test instead. Using a flat prior on , we get a BF of:

s2π n exp z22,

Because the flat prior appears in the numerator only there is no clear scaling choice for it, so this expression is not in any sense calibrated (Several approaches to calibration appear in the literature, see Robert [1993]).

It is interesting to see what happens when we apply the Cauchy prior, as used in the standard form of the test. In that case we see approximate scaling like [Robert et al.2009]:

2 πnexp t22.

A similar formula appears when the data standard deviation is assumed to be known also:

2 πn 1 + t2 n exp t22.

The 1 + t2 n part can also be written 1 + x̄ s2 in terms of the known standard deviation and the empirical mean, and clearly it is constant asymptotically; the expression on the whole scales similarly as in the unknown variance case.

Lindley’s paradox

Lindley’s paradox refers to a remarkable difference in the behavior of Bayes factor tests compared to frequentist tests as the sample size increases. Under a frequentist t-test, we compute the t statistic and compare it against a threshold that depends on n, and that decreases as n increases, (Recall the definition of the t statistic: t = x̄(n 1)s2). In contrast the effective threshold used by the BF t-test actually increases as n increases, although slowly, following roughly log n. This difference is illustrated in the following plot lifted from Rouder et al. [2009]:

PIC

The effect of this difference is that for large sample sizes, a frequentist test can reject the null with small p (say p = 0.05), while a BF t-test run on the same data can strongly support the null hypothesis! As a concrete example, if we generate n = 50,000 synthetic data points with mean 250000, variance 1, then the BF t-test yields 0.028 for the alternative hypothesis (i.e. 36:1 odds for the null being true), whereas a frequentist t-test rejects the null with p = 0.0478 (two-sided test).

It’s all well and good to argue the philosophical differences between Bayesian and frequentist approaches, but when they disagree so strongly, it becomes a practical issue. The key issue is the handling of small-effects. As n increases and s2 stays fixed, a fixed t value of say 2 represents a smaller and smaller measured effect, namely:

x̄ 2 n.

Detecting small effects requires a large sample size, and the Bayesian t-test greatly prefers the null hypothesis over that of a small effect. This is usually described positively as a Occam’s razor effect, but it does have real consequences when we are attempting to find true small effects. Essentially, the BF test will require much more data pick up a small effect then a corresponding frequentist test. Often, this is viewed from the opposite side; that frequentist tests too easily reject the null hypothesis for large sample sizes. This is more of an issue when testing more complex models such as linear regressions with multiple coefficients. Unless the data was actually generated from a linear equation with Gaussian noise, large samples will inevitably reject simple null models. This is in contrast on simple point null hypothesis testing, where we can appeal to asymptotic normality.

Possible changes from the default Bayes factor test

Here we attempt to answer the following question: are the properties of the default Bayes factor test indicative of BF tests in general? As we see it, there are two main ways to modify the test: We can use a non-point hypothesis as the null, or we can change the alternative hypothesis in some fashion. We discuss these possibilities and their combination separately.

Changing the alternative hypothesis

The default Bayes factor t-test is formulated from a objectivist point of view: it’s designed so that the data can “speak for its self”, the priors used are essentially as vague as possible while still yielding a usable test. What about if we choose the prior for the alternative that gives the strongest evidence for the alternative instead? This obviously violates the likelihood principle as we choose the prior after seeing the data, but it is nevertheless instructive in that it indicates how large the difference between point NHST tests and BF tests are.

A reasonable class of priors to consider is those that are non-increasing in terms of || , as they have a single mode at = 0. We discuss other priors below. Table 1 shows a comparison of p-values, likelihood ratios from the Neyman-Pearson likelihood ratio test, and a bound on BFs from any possible prior of this form. It is immediately obvious that according to this bound, any reasonable choice of prior yields much less evidence for the alternative then that suggested by the p-value. For example, the difference at p=0.001 is 18 fold, and the difference is larger for smaller p.







Method










t-statistic 1.645 1.960 2.576 3.291





p-value 0.1 0.05 0.01 0.001





LR 0.258:10.146:10.036:10.0044:1





BF bound0.644:10.409:10.123:1 0.018:1





Table 1: Comparison of methods for point-null normal data tests, reproduced from Berger and Wolpert [1988].

Although we disregarded multi-modal alternative priors above, they actually form an interesting class of tests. In terms of conservativeness, their BF is easily seen to be bounded by the likelihood ratio when using a prior concentrated on the maximum likelihood solution ̂:

p(D| = 0) p(D| = ̂).

This ratio is simplify the Neyman-Pearson likelihood ratio for composite hypothesis, and it’s values are also shown in Table 1. This values are still more conservative than p-values by a factor of 2-5. Note that in a Neyman-Pearson test such as the frequentist t-test, this ratio is not read-off directly as an odds ratio as we are doing here; that’s why this NP odds ratio doesn’t agree directly with the p-value.

The interesting aspect of multi-modal priors is that they allow you to formulate a test that is more powerful for finding evidence of the null hypothesis. It can be shown that under the default BF t-test, evidence for the null accumulates very slowly when the null is true, where as evidence for the alternative accumulates exponentially fast when the alternative is true. Multi-modal priors can be formulated that fix this imbalance [Johnson and Rossell2010]. One simple case is a normal or Cauchy prior with a notch around the null = 0 set to have low probability.

If we go a step further and consider the limits of sequences of improper priors, the Bayes factor can take essentially any value. Some particular “reasonable” sequences actually yield values very similar to t-test’s p-values [Robert1993] as their posterior probabilities, at least between 0.1 and 0.01. Their construction also places 0 probability mass in the neighborhood of the null point in the limit, so it is similar to the multi-modal priors discussed in the previous paragraph. Although this improper prior does actually avoid Lindley’s paradox, it is more of a theoretical construction; it is not recommend in practice, even by it’s discoverer [Robert1993].

Consequences for Lindley’s paradox

As discussed above, when restricting ourselves to well behaved proper priors, “Lindley’s paradox” as such can not be rectified by a careful choice of a proper prior on the alternative hypothesis. When one takes a step back and considers the consequences of this, Lindley’s paradox becomes much clearer, as a statement about objective priors rather than BF tests in general.

The central idea is this: It requires a lot of evidence to detect small effects when you use vague priors. The idea of running a long experiment when you use a vague prior is nonsensical from an experimental design point of view: large amounts of data are useful for picking up small effects, and if you expect such a small effect you should encode this sensibly in your prior. The default Cauchy prior used in the BF t-test essentially says that you expect a 50% chance that the absolute effect size (Cohen’s d) is larger than 1, which is a completely implausible in many settings. For instance, if you are trying to optimize conversion rates on a website with say a 10% base conversion rate, an effect size of “1” is a 4x increase in sales! Some software defaults to 0.707 instead of 1 for the Cauchy scale parameter, but the effect is largely the same

Experimental design considerations are important in the choice of a prior, and the Bayesian justification of the Bayes factor depends on our priors being reasonable. The choice of using a particular BF test is not a full experimental design, just as for a frequentist it is not sufficient to just decide on particular test without regard to the eventual sample size. It is necessary to examine other properties of the test, so determine if it will behave reasonably in the experimental setting in which we will apply it. So in effect, Lindley’s paradox is just the statement that a poorly designed Bayesian experiment won’t agree with a less poorly designed frequentist experiment.

Put another way, if the Bayes factor is comparing two hypothesis, both of which are unlikely under the data, you won’t get reasonable results. You can’t identify this from examining the Bayes factor, you have to look at the experiment holistically. The extreme case of this is when neither hypothesis includes the true value of . In such cases, the BF will generally not converge to any particular value, and may show strong evidence in either direction.

An example

It is illustrative to discuss an example in the literature of this problem: a high profile case where the Bayesian and frequentist results are very different under a default prior, and more reasonable under a subjective prior. One such case is the study by Bem [2011] on precognition. The subject matter is controversial; Bem gives results for 9 experiments, in which 8 show evidence for the existence of precognition at the p = 0.05 level! This obviously extraordinary claim lead to extensive examination of the results from the Bayesian view-point.

In Wagenmakers et al. [2011a], the raw data of Bem [2011] is analyzed using the default Bayesian t-test. They report Bayes factors towards the alternative (in decreasing order) of 5.88, 1.81, 1.63, 1.05, 0.88, 0.58, 0.47, 0.32, 0.29, 0.13 . So only one experiment showed reasonable evidence of precognition (5.88:1 for), and if you combine the results you get 19:1 odds against an effect.

In contrast, Bem et al. [2011] performed a Bayes factor analysis using informative priors on effect sizes, uses values considered reasonable under non-precognitive circumstances. The 3 largest Bayes factors for the existence of precognition they got were 10.1, 5.3 and 4.9, indicating substantial evidence!

The interesting thing here is that the vague prior used by Wagenmakers et al. [2011a] placed a large prior probability on implausibly large effect sizes. Even strong proponents of ESP don’t believe in such large effects, as they would have been easily detected in other past similar experiments. Somewhat counter-intuitively, this belief that ESP effects are strong if they exist at all yields a weaker posterior belief in the existence of ESP!

It should be noted that Wagenmakers et al. [2011b] responded to this criticism, but somewhat unsatisfactory in my opinion. Of course, even with the use of correct subjective Bayesian priors on the alternative hypothesis, these experiments are not sufficient to convince a skeptic (such as my self) of the existence of ESP! If one factors in the prior p(H1)p(H0) and updates their beliefs correctly, you may go from a 1:1-million belief in ESP existing to a 1:76 belief (if you use the naive combined BF of 13,669 from Bem et al. [2011]).

Most reasonable people would consider the chance that a systematic unintentional error was made during the experiments to be greater than 1 in a thousand, and so not update their beliefs quite as severely. Regardless, several pre-registered replications of Bem’s experiments have failed to show any evidence of precognition [Ritchie et al.2012].

Recommendations when using point-nulls

The general consensus in the literature is that single Bayes factors should not be reported. Instead the BFs corresponding to several different possible priors for the alternative hypothesis should be reported. Only if the conclusions are stable for a wide range of priors can they be reported as definitive. If the results are not stable, then it must be argued why a particular prior, chosen subjectively, is the best choice.

Bounds giving the largest or smallest possible Bayes factor over reasonable classes of priors, such as the bounds given in Berger and Jefferys [1991], can also be reported. The consideration of multiple possible priors is generally called “robust” Bayesian analysis.

Non-point nulls

As we showed above, under essentially any point-null Bayes factor t-test, the results are much more conservative than frequentist t-tests on the same data. But is this indicative of BF tests in general? It turns out Bayes factor tests that don’t use point-nulls behave much more like their frequentists counterparts [Casella and Berger1987].

But first, lets just point out that point-null hypotheses are fundamentally weird things. When we start assigning positive probability mass to single values, we move from the comfortable realm of simple distribution functions we can plot on a whiteboard, into the Terra pericolosa that is measure theory. As anybody who has done a measure theory course knows, common sense reasoning can easily go astray even on simple sounding problems. The “surprising” Occam’s razor effect seen in point-null BF tests is largely attributable to this effect. It can be argued that any prior that concentrates mass on a single point is not “impartial”, effectively favoring that point disproportionately [Casella and Berger1987].

One-sided tests

The simplest alternative to a point null test is a postive versus negative effect test, where the null is taken to be the entire negative (or positive) portion of the real line. This test is very similar in effect to a frequentist one sided test. Unlike the point-null case, both the alternative and null hypotheses spaces are the same dimension; in practice this means we can use the same unnormalized priors on both without running into issues. This test can be preformed using the BayesFactor R package with the ttestBF method, althought somewhat indirectly, by taking the ratio of two seperate point null tests:

bfInterval < ttestBF(x = ldata, y=rdata, nullInterval=c(Inf,0)) 
bfNonPoint < bfInterval[2] / bfInterval[1] 
print(bfNonPoint)

Applying one-sided tests in the frequentist setting is routine (in fact often advocated [Cho and Abe2013]), it is surprising then that their theoretical properties are very different to two-sided tests. If we look at both Bayesian and frequentist approaches from the outside, using decision theory (with quadratic loss), we see a fundamental difference. For the one sided case, the Fisher p-value approach and obeys a weak kind of optimality known as admissibility [Hwang et al.1992]: no approach strictly dominates it. In contrast, in the two-sided case neither the frequentist or the standard Bayesian approach is admissible, and neither dominates the other!

What does this tell us about sequential testing settings? The use of Bayes factor tests is sometimes advocated in the sequential testing setting, as their interpretation is not affected by early stopping. “Sequential testing” is when a test of some form of test is applied multiple times during the experiment, with the experiment potentially stopping early if the test indicates strong evidence of an effect. Using frequentist tests sequentially is problematic; the false-positive rate is amplified and special considerations must be made to account for this. See my previous post for details on how to perform the frequentist tests correctly.

The interpretation of the Bayes factor in contrast is unaffected by early stopping. Naive application of a point-null BF test does seem to perform reasonable in a sequential setting, as it’s naturally conservative nature results in few false positives being detected. However, as we pointed out above, one-sided non-point null tests are equally valid, and they are not particularly conservative. Just applying one sequentially gives you high false-positive rates comparable to the frequentist test. The false-positive rate is a frequentist property, and as such it is quite unrelated to the Bayesian interpretation of the posterior distribution.

Indeed, in most sequential A/B testing situations, such as website design tests, the one-sided non-point test seems the most reasonable, as the two hypothesis correspond more directly to the two courses of action that will be taken after the test concludes: Switch all customers to design A or design B.

So, should I use a point null hypothesis?

Here we start to move outside the realm of mathematics and start to meander into the realm of philosophy. You have to consider the purpose of your hypothesis test. Point-nulls enforce a strong Occam’s razor effect, and depending on the purpose of your test, this may or may not be appropriate.

The examples and motivation used for null hypothesis testing in the original formulation of the problem by Jeffreys came from physics. In physics problems testing for absolute truth is a much clearer proposition. Either an exotic particle exists or it doesn’t, there is no in-between. This is far cry from the psychology, marketing and economics problems that are more commonly dealt with today, where we always expect some sort of effect, although potentially small and possibly in the opposite direction from what we guessed apriori.

You need a real belief in the possibility of zero effect in order to justify a point null test. This is a real problem in practice. In contrast, by using non-point null tests you will be in closer agreement with frequentist results, and you will need less data to make conclusions, so there is a lot of advantages to avoiding point null tests.

It is interesting to contrast modern statistical practice in high energy physics (HEP) with that of other areas where statistics is applied. It is common practice to seek “five sigma” of evidence before making a claim of a discovery in HEP fields. In contrast to the usual p 0.05 requirements, the difference is vast. Five sigma corresponds to about a 1-in-1million p value. However, if you view the five sigma approach as a heuristic approximation to the point-null BF test, the results are much more comparable. The equivalent t statistic needed for rejection of the point null under the Bayesian test scales with the amount of data. HEP experiments often average over millions of data points, and it is in that multiple-million region where the rejection threshold, when rephrased in terms of the t statistic, is roughly five.

Concluding guidelines for applying Bayes factor tests

  • Don’t use point-null Bayesian tests unless the point null hypothesis is physically plausible.
  • Always perform a proper experimental design, including the consideration of frequentist implications of your Bayesian test.
  • Try to use an informative prior if you are using a point null test; non-informative priors are both controversial and extremely conservative. Ideally show the results under multiple reasonable priors.
  • Don’t blindly apply the Bayes factor test in sequential testing situations. You need to use decision theory.

References

   Daryl J. Bem. Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 2011.

   Daryl J. Bem, Jessica Utts, and Wesley O. Johnson. Must psychologists change the way they analyze their data? a response to wagenmakers, wetzels, borsboom, & van der maas. 2011.

   James O. Berger and William H. Jefferys. The application of robust bayesian analysis to hypothesis testing and occam’s razor. Technical report, Purdue University, 1991.

   James O. Berger and Luis R. Pericchi. Objective bayesian methods for model selection: Introduction and comparison. IMS Lecture Notes - Monograph Series, 2001.

   J.O. Berger and R.L. Wolpert. The Likelihood Principle. Institute of Mathematical Statistics. Lecture notes : monographs series. Institute of Mathematical Statistics, 1988. ISBN 9780940600133. URL https://books.google.com.au/books?id=7fz8JGLmWbgC.

   George Casella and Roger L. Berger. Reconciling bayesian and frequentist evidence in the one-sided testing problem. Journal of the American Statistical Association, 82(397):106–111, 1987. ISSN 0162-1459. doi: 10.1080/01621459.1987.10478396.

   Hyun-Chul Cho and Shuzo Abe. Is two-tailed testing for directional research hypotheses tests legitimate? Journal of Business Research, 66(9):1261 – 1266, 2013. ISSN 0148-2963. doi: http://dx.doi.org/10.1016/j.jbusres.2012.02. 023. URL http://www.sciencedirect.com/science/article/pii/S0148296312000550. Advancing Research Methods in Marketing.

   Jiunn Tzon Hwang, George Casella, Christian Robert, Martin T. Wells, and Roger H. Farrell. Estimation of accuracy in testing. Ann. Statist., 20(1):490–509, 03 1992. doi: 10.1214/aos/1176348534. URL http://dx.doi.org/10.1214/aos/1176348534.

   Harold Jeffreys. Theory of Probability. 1939.

   V. E. Johnson and D. Rossell. On the use of non-local prior desities in bayesian hypothesis tests. Journal of the Royal Statistical Society, 2010.

   Stuart J. Ritchie, Richard Wiseman, and Christopher C. French. Failing the future: Three unsuccessful attempts to replicate bem’s ‘retroactive facilitation of recall’ effect. PLoS ONE, 2012.

   Christian P. Robert. A note on jeffreys-lindley paradox. Statistica Sinica, 1993.

   Christian P. Robert, Nicolas Chopin, and Judith Rousseau. Harold jeffreys’s theory of probability revisited. Statistical Science, 2009.

   Jeffrey N. Rouder, Paul l. Speckman, Dongchu Sun, Richard D. Morey, and Geoffrey Iverson. Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 2009.

   Eric Jan Wagenmakers, Ruud Wetzels, Denny Borsboom, and Han van der Maas. Why psychologists must change the way they analyze their data: The case of psi. Journal of Personality and Social Psychology, 2011a.

   Eric Jan Wagenmakers, Ruud Wetzels, Denny Borsboom, and Han van der Maas. Why psychologists must change the way they analyze their data: The case of psi: Clarifications for bem, utts, and johnson (2011). Journal of Personality and Social Psychology, 2011b.

How to do A/B testing with early stopping correctly

It’s amazing the amount of confusion on how to run a simple A/B test seen on the internet. The crux of the matter is this: If you're going to look at the results of your experiment as it runs, you can not just repeatedly apply a 5% significance level t-test. Doing so WILL lead to false-positive rates way above 5%, usually on the order of 17-30%.

Most discussions of A/B testing do recognize this problem, however the solutions they suggest are simply wrong. I discuss a few of these misguided approaches below. It turns out that easy to use, practical solutions have been worked out by clinical statistician decades ago, in papers with many thousands of citations. They call the setting “Group Sequential designs” [Bartroff et al.2013]. At the bottom of this post I give tables and references so you can use group sequential designs in your own experiments.

What's wrong with just testing as you go?

The key problem is this: Whenever you look at the data (whether you run a formal statistical test or just eyeball it) you are making a decision about the effectiveness of the intervention. If the results look positive and you may decide to stop the experiment, particularly if it looks like the intervention is giving very bad results.

Say you make a p=0.05 statistical test at each step. On the very first step you have a 5% chance of a false positive (i.e. the intervention had no effect but it looks like it did). On each subsequent test, you have a non-zero additional probability of picking up a false positive. It’s obviously not an additional 5% each time as the test statistics are highly correlated, but it turns out that the false positives accumulate very rapidly. Looking at the results each day for a 10 day experiment with say 1000 data points per day will give you about an accumulated 17% false positive rate (According to a simple numerical simulation) if you stop the experiment on the first p<0.05 result you see.


PIC

Figure 1: 100 simulations of 0-effect (the null hypothesis) with no early stopping. Notice that at any point in time, roughly 5% of the paths are outside the boundaries (A 5% false positive rate is expected).


PIC

Figure 2: Paths terminated when they cross the boundary on the next step. 12 of 100 simulations are stopped, giving a very high 12% false positive rate!

The error-spending technique

They key idea of “Group Sequential” testing is that under the usual assumption that our test statistic is normally distributed, it is possible to compute the false-positive probabilities at each stage exactly using dynamic programming. Since we can compute these probabilities, we can also adjust the test’s false-positive chance at every step so that the total false-positive rate is below the threshold α we want. This is know as the error spending approach. Say we want to run the test daily for a 30 day experiment. We provide a 30 element sequence αi of per step error chances, which sums to less than say α = 0.05. Using dynamic programming, we can then determine a sequence of thresholds, one per day, that we can use with the standard z-score test. The big advantage of this approach is that is a natural extension of the usual frequentist testing paradigm.

The error spending approach has a lot of flexibility, as the error chances can be distributed arbitrarily between the days. Several standard allocations are commonly used in the research literature.

Of course, libraries are available for computing these boundaries in several languages. Tables for the most common settings are also available. I’ve duplicated a few of these tables below for ease of reference.

Recommended approach

  1. Choose a α-spending function, from those discussed below. The Pocock approach is the simplest.
  2. Fix the maximum length of your experiment, and the number of times you wish to run the statistical test. I.e. daily for a 30 day experiment.
  3. Lookup on the tables below, or calculate using the ldbounds or GroupSeq R packages, the z-score thresholds for each of the tests you will run. For example, for the Pocock approach, α = 0.01, 10 tests max, use z = 3.117 for every test. For the O’brien-Fleming approach it will be the sequence {8,5.82,4.95,4.29,3.8,3.46,3.19,2.98,2.80,2.66}.
  4. Start the experiment.
  5. For each test during the course of experiment, compute the z-score to the threshold, and optionally stop the test if the score exceeds the threshold given in the lookup table.

Real valued example

The simplest case of A/B testing is when each observation for each user is a numerical value, such as a count of activity that day, the sales total, or some other usage statistic. In this case the z-score when we have na data points from group a and nb from group b with means and standard deviations μ and σ is:

z = μa - μb σa 2 na + σb2 nb ,

For the kth trial, just compare the z value here to the kth threshold from the appropriate table below, if it exceeds it you may stop the experiment.

Conversions example

If in our A/B test we are testing if the click through rate or purchase chance improves, we use a slightly different formula. Let p be the click through proportions and n the total views (i.e. napa =conversions in group A). Then

z = pa - pb 1 napa(1 - pa) + 1 nbpb(1 - pb).

As above, for the kth trial, just compare the zvalue here to the kth row of the appropriate table below, if it exceeds it you may stop the experiment.

Assumptions

  • We are assuming the distribution of test statistic is normally distributed, with known variance. If you did STATS101, you are probably ok with this assumption, but those of you who did STATS102 are no doubt objecting. Shouldn’t you be using a t-test or chi-square test? What about the exact binomial test for conversions? The fact is that unless you are getting only a tiny amount of data between tests, you just don’t need to worry about this. If you are getting tiny amounts of data, you can’t hope to pick up any difference between the A/B groups over short time scales. Just run your test less regularly!
    The effect sizes we hope to see in A/B testing are much smaller than those in other areas of statistics such as clinical trials, so we usually can’t expect to get statistical significance in the small data regime where the normality approximation brakes down.
  • Fixed maximum experiment length. This is an annoying problem. Often we would like to just keep running the experiment until we get some sort of effect. The difficulty is that we have to spread the chance of false-positives across the duration of the experiment. We could potentially use a error-spending function that drops off rapidly, so that the total error probability is summable. However, I would not recommend this. Instead, do a power calculation to determine the length of the experiment needed to pick up the minimum effect size that you would care about in practice. Then you can use that length for the experiment.
    In practice you will always see some sort of effect/uplift if you run a long enough experiment. There is no real world situation where the A/B groups behave exactly the same.
  • Equally distributed tests. We are assuming above that we have the exact same amount of data gathered between tests. Fortunately this assumption is not too crucial to the correctness of the test. It is actually quite easy to correct for differing intervals using the R software discussed. Say you are in the common situation where you see much less data over the weekend than for a weekday. The ldbounds R package lets you calculate the sequence of z-score boundaries for variable intervals. you just feed in the time between tests as a normalized sequence that sums to 1. For example, for a 7 day experiment starting on Monday, where the expected volume of data is half on Saturday and Sunday, you would just use something like:
    b <- bounds(cumsum(c(rep(1/6, 5), 1/12, 1/12)), iuse=c(1, 1), alpha=c(0.025, 0.025)) 
    print(b$upper.bounds) 
    [1] 5.366558 3.710552 2.969718 2.538649 2.252156 2.178169 2.088238

    The iuse argument controls the α spending function used (here we used the O’Brien-Fleming approach) and alpha gives the false-positive rate on either side (we are doing a 0.05 fpr two-sided test here).

  • Two-sided tests. The approach described above is that of a two-sided test. I allow for the effect of the intervention to be both potentially positive and negative. This is obviously more conservative then assuming it can only be positive, but I think in practice you should always use two-sided tests for A/B experiments. It’s hard to justify the assumption that the intervention can’t possibly have a negative effect!
  • Why should I use this approach when Wald’s SPRT test is optimal? Wald’s theory of sequential testing is based around testing a single point hypothesis against another point hypothesis. While this approach is nearly as good as is possible for the simple point-point case, there is no clear generalization to the case of testing a point hypothesis of zero effect against the “composite” hypothesis of non-zero effect. In Wald’s 1945 book, several approaches are suggested, but they have known deficiencies [Sec 5.2 Tartakovsky et al.2013]. Lordon’s 2-SPRT [Lordon1976] is the modern SPRT variant that is most commonly used. It requires setting an indifference region around 0 instead of a fixing the measurement points in advance, so the experimental design decisions are a little different. I intend to write more about the 2-SPRT in the future.

Bayesian approaches

Some people attempt to perform Bayesian tests by putting a prior on their data and computing “credible intervals” (or equivalently posterior probabilities) instead of confidence intervals. The classical “likelihood principle” states that Bayesian statistical quantities do not change under sequential testing, however under the Bayesian methodology estimation is fundamentally different from hypothesis testing, and trying to convert interval estimates to tests doesn’t work. The intuition is fairly simple: If Bayesian tests are immune to the sequential testing problem but Frequentist tests are not, then techniques like credible intervals which agree with frequentist confidence intervals on simple problems can’t then be used for Bayesian tests.

In general the best approach is to combine the Bayesian estimates with decision theory. This involves assigning a cost to each datapoint gathered, as well as determining the cost due to the test coming to the wrong solution.

I’m a strong proponent of Bayesian statistics, but I find it next to impossible in practice to determine these two costs. The frequentist approach of controlling false-positive rates is much easy to apply in a business context. The main exception is for per-user personalization or advert placement where the business decisions have to be made automatically. In that case the techniques studied for multi-arm Bandits or contextual bandits are natural automatic applications of the decision theory approach. They are far outside the scope of this post however.

The other Bayesian approach

The most commonly suggested Bayesian approach is the Bayes Factor (BF) test, a potential drop in replacement for frequentist tests. Indeed, A. Wald (the progenitor of sequential testing) way back in his 1945 book suggests a similar approach to a BF test [Sec. 4.1.3, Wald1945].

Unfortunately the bayes factor test often gives counter-intuitive and sometimes radically different results from frequentist tests. This is known as Lindley’s paradox [Lindley1957]. This doesn’t invalidate BF results, but you do need to fully understand the implications to use them in practice. I intend to write a another post going into detail about bayes factor tests, and their pitfalls and advantages over frequentist approaches.

Interactive tool

The GroupSeq R package provides a very easy to use GUI for calculating the stopping boundaries for most of the above settings. It can also do power calculations and stopping intervals. I’ve put a few screenshots of it in action below.

PIC

PIC

Power calculations

Powe calculations are typically used with frequentist experiments to determine the length of the experiment required to pick up effects of a certain magnitude. Power calculations can still be done for group sequential tests, using the R tools mentioned above. Like with the z-score boundary calculations, multiple integrals need to be evaluated numerically, so it is somewhat of a black-box.

Power calculations can also be used to select from the various error spending approaches above. Just run the power calculation for each approach, then choose the one that gives the smallest required experiment length.

Pocock 1977 tables (for two sided tests)

Pocock’s approach uses the exact same z-score threshold at every test. Values for larger N (the total number of tests) or irregularly spaced test intervals may be calculated using the GroupSeq R package. The test α values here illustrated how much more conservative you need to be. For example, when running 10 tests during the experiment, you effectively have to run a z-test at the α = 0.01 level to get an effective 0.05 false positive rate.






For α = 0.05 For α = 0.01





N test α z-score test α z-score










1 0.05 1.960 0.01 2.576





2 0.0294 2.178 0.0056 2.772





3 0.0221 2.289 0.0041 2.873





4 0.0182 2.361 0.0033 2.939





5 0.0158 2.413 0.0028 2.986





6 0.0142 2.453 0.0025 3.023





8 0.0120 2.512 0.0021 3.078





10 0.0106 2.555 0.0018 3.117





12 0.0097 2.585 0.0016 3.147





15 0.0086 2.626 0.0015 3.182





20 0.0075 2.672 0.0013 3.224





O’brien-Fleming style tables (for two sided tests)

This approach is essentially the most natural generalization of Wald’s SPRT test to the group-sequential case with composite hypothesis. This test is much less likely to stop at the early stages of the test than the other approaches, but more likely to stop in the later tests. Instead of reporting the tables from their paper I give the stopping boundaries for a version of their approach [O’brien and Fleming1979] adapted to the error-spending algorithm implemented in the ldbounds library.

In the computation below we cap the z-score stopping boundaries at 8, as this is the default for the ldbounds library.




N α

z-scores (two-sided tests)




1.00 0.05

1.96

1.00 0.01

2.576

2.00 0.05

2.963,1.969

2.00 0.01

3.801,2.578

3.00 0.05

3.710,2.511,1.993

3.00 0.01

4.723,3.246,2.589

4.00 0.05

4.333,2.963,2.359,2.014

4.00 0.01

5.493,3.802,3.044,2.603

5.00 0.05

4.877,3.357,2.680,2.290,2.031

5.00 0.01

6.168,4.286,3.444,2.947,2.615

6.00 0.05

5.367,3.711,2.970,2.539,2.252,2.045

6.00 0.01

6.776,4.740,3.803,3.257,2.892,2.626

8.00 0.05

6.232,4.331,3.481,2.980,2.644,2.401,2.215,2.066

8.00 0.01

8.000,8.000,4.442,3.807,3.381,3.071,2.833,2.643

10.00 0.05

6.991,4.899,3.930,3.367,2.989,2.715,2.504,2.336,2.197,2.081

10.00 0.01

8.000,8.000,5.071,4.289,3.812,3.463,3.195,2.981,2.804,2.656

15.00 0.05

8.000,8.000,4.900,4.189,3.722,3.382,3.120,2.910,2.738,2.593,

2.469,2.361,2.266,2.182,2.107

15.00 0.01

8.000,8.000,8.000,8.000,4.731,4.295,3.964,3.699,3.481,3.297,

3.139,3.002,2.881,2.774,2.678

20.00 0.05

8.000,8.000,8.000,4.916,4.337,3.942,3.638,3.394,3.193,3.024,

2.880,2.754,2.643,2.545,2.457,2.378,2.305,2.239,2.179,2.123

20.00 0.01

8.000,8.000,8.000,8.000,8.000,5.076,4.615,4.302,4.050,3.837,

3.653,3.494,3.353,3.229,3.117,3.016,2.925,2.841,2.764,2.693

25.00 0.05

8.000,8.000,8.000,8.000,4.919,4.436,4.093,3.820,3.594,3.404,

3.241,3.100,2.975,2.865,2.766,2.676,2.595,2.521,2.452,2.389,

2.331,2.277,2.226,2.179,2.134

25.00 0.01

8.000,8.000,8.000,8.000,8.000,8.000,5.517,4.878,4.565,4.315,

4.107,3.928,3.769,3.629,3.504,3.390,3.287,3.193,3.107,3.027,

2.953,2.884,2.820,2.760,2.703

30.00 0.05

8.000,8.000,8.000,8.000,8.000,4.933,4.505,4.204,3.956,3.748,

3.568,3.413,3.276,3.154,3.045,2.946,2.857,2.775,2.700,2.631,

2.566,2.506,2.451,2.398,2.349,2.303,2.260,2.219,2.180,2.143

30.00 0.01

8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,4.986,4.731,

4.510,4.318,4.144,3.991,3.853,3.728,3.615,3.512,3.416,3.329,

3.247,3.172,3.101,3.035,2.973,2.914,2.859,2.807,2.758,2.711

50.00 0.05

8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,5.185,4.919,

4.669,4.454,4.279,4.118,3.976,3.848,3.731,3.625,3.526,3.436,

3.352,3.274,3.201,3.132,3.068,3.008,2.951,2.898,2.847,2.798,

2.752,2.709,2.667,2.627,2.589,2.552,2.517,2.484,2.452,2.421,

2.391,2.362,2.334,2.307,2.281,2.256,2.232,2.208,2.186,2.164

50.00 0.01

8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,

8.000,8.000,8.000,8.000,5.173,4.938,4.739,4.589,4.452,4.339,

4.230,4.133,4.040,3.954,3.874,3.797,3.726,3.658,3.594,3.533,

3.475,3.419,3.367,3.316,3.268,3.222,3.178,3.135,3.095,3.055,

3.018,2.981,2.946,2.912,2.880,2.848,2.817,2.788,2.759,2.731

60.00 0.05

8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,

5.292,4.977,4.718,4.525,4.372,4.229,4.101,3.984,3.876,3.776,

3.684,3.598,3.518,3.443,3.373,3.307,3.244,3.185,3.129,3.076,

3.025,2.977,2.932,2.888,2.846,2.806,2.767,2.730,2.695,2.661,

2.628,2.596,2.566,2.536,2.508,2.480,2.453,2.427,2.402,2.378,

2.355,2.332,2.309,2.288,2.267,2.246,2.227,2.207,2.188,2.170

60.00 0.01

8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,8.000,

8.000,8.000,8.000,8.000,8.000,8.000,8.000,5.036,4.902,4.778,

4.662,4.539,4.439,4.345,4.258,4.170,4.092,4.019,3.947,3.880,

3.817,3.756,3.698,3.643,3.590,3.539,3.491,3.444,3.399,3.356,

3.315,3.275,3.236,3.199,3.163,3.128,3.095,3.062,3.030,3.000,

2.970,2.941,2.913,2.886,2.859,2.834,2.808,2.784,2.760,2.737




References

   Jay Bartroff, Tze Leung Lai, and Mei-Chiung Shih. Sequential Experimentation in Clinical Trials. Springer, 2013.

   D. V. Lindley. A statistical paradox. Biometrika, 44:187–192, 1957.

   Gary Lordon. 2-sprt’s and the modified kiefer-weiss problem of minimizing an expected sample size. The Annals of Statistics, 4:281–291, 1976.

   Peter C. O’brien and Thomas R. Fleming. A multiple testing procedure for clinical trials. Biometrics, 35, 1979.

   Stuart J. Pocock. Group sequential methods in the design and analysis of clinical trials. Biometrika, 64, 1977.

   Alexander Tartakovsky, Igor Nikiforov, and Michéle Basseville. Sequential Analysis: Hypothesis Testing and Changepoint Detection. CRC Press, 2013.

   A. Wald. Sequential tests of statistical hypotheses. Ann. Math. Statist., 16(2):117–186, 06 1945. doi: 10.1214/aoms/1177731118. URL http://dx.doi.org/10.1214/aoms/1177731118.

Machine Learning in Haskell!

A colleague of mine has recently released a deep learning library for Haskell called Grenade. Probably the most interesting aspect is the use of dependent typing and other sophisticated functional programming tricks which are not normally seen in machine learning libraries. At the moment the library is in the proof of concept stage. You can basically use any combination of convolutional and fully connected layers, but training is single threaded using BLAS.

A curated list of interesting ICML 2016 papers

I've went through the hundreds of ICML 2016 papers and curated a subset that look interesting to me. In no particular order:

Faster Convex Optimization: Simulated Annealing with an Efficient Universal Barrier

Jacob Abernethy, Elad Hazan

Variance Reduction for Faster Non-Convex Optimization

Zeyuan Allen-Zhu, Elad Hazan

Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives

Zeyuan Allen-Zhu, Yang Yuan

Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling

Zeyuan Allen-Zhu, Zheng Qu, Peter Richtarik, Yang Yuan

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, JingDong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Awni Hannun, Billy Jun, Tony Han, Patrick LeGresley, Xiangang Li, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Sheng Qian, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Chong Wang, Yi Wang, Zhiqian Wang, Bo Xiao, Yan Xie, Dani Yogatama, Jun Zhan, Zhenyao Zhu

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

Yossi Arjevani, Ohad Shamir

Black-box Optimization with a Politician

Sebastien Bubeck, Yin Tat Lee

Importance Sampling Tree for Large-scale Empirical Expectation

Olivier Canevet, Cijo Jose, Francois Fleuret

CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy

Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, John Wernsing

Solving Ridge Regression using Sketched Preconditioned SVRG

Alon Gonen, Francesco Orabona, Shai Shalev-Shwartz

Variance-Reduced and Projection-Free Stochastic Optimization

Elad Hazan, Haipeng Luo

On Graduated Optimization for Stochastic Non-Convex Problems

Elad Hazan, Kfir Yehuda Levy, Shai Shalev-Shwartz

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

Nan Jiang, Lihong Li

Stochastic Variance Reduced Optimization for Nonconvex Sparse Learning

Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, Jarvis Haupt

A Variational Analysis of Stochastic Gradient Algorithms

Stephan Mandt, Matthew Hoffman, David Blei

Stochastic Variance Reduction for Nonconvex Optimization

Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola

A Superlinearly-Convergent Proximal Newton-type Method for the Optimization of Finite Sums

Anton Rodomanov, Dmitry Kropotov

SDCA without Duality, Regularization, and Individual Convexity

Shai Shalev-Shwartz

Training Neural Networks Without Gradients: A Scalable ADMM Approach

Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, Tom Goldstein

Finally an alternative to SGD: Stochastic variance reduction methods

This post describes the newly developed class of variance reduction techniques in stochastic optimisation, which are fast and simple alternatives to the traditional stochastic gradient descent (SGD) method. I give a light introduction here, focused on actually applying the methods in practice.

But first: Ye Old SGD

Quite possibly the most important algorithm in machine learning, SGD is simply a method for optimizing objective functions that consist of a sum or average of terms:

f(x) = 1 n i=1nf i(x).

The SGD method maintains an estimate xk of the solution at step k, and at each step:

  1. Sample: Pick an index j uniformly at random (a datapoint index).
  2. Step: Update x:
    xk+1 = xk γfj(x k).

At it’s most basic, this update is just an approximation of the classical gradient descent technique, where we take the non-random step xk+1 = xk γf(xk). The step size parameter γ here is just a constant, which depending on the problem may need to be adjusted down at each step. Variance reduction methods build upon SGD by modifying the core update in various ways.

SAGA

The SAGA method is a simple modification of SGD, where we remember the last gradient evaluated for each datapoint, and use this to correct the SGD step to reduce its variance. These past gradients are stored in a table. We denote the previous gradient for datapoint j as fj(ϕjk). The core step is as follows:

  1. Sample: Pick an index j uniformly at random.
  2. Step: Update x using fj(xk), fj(ϕjk) and the table average:
    xk+1 = xk γ f j(xk) f j(ϕ jk) + 1 n i=1nf i(ϕ ik), (1)

  3. Update the table: Denote ϕjk+1 = xk, and store fj(ϕjk+1) in the table. All other entries in the table remain unchanged. The quantity ϕjk+1 is not explicitly stored.

Source code for SAGA and the accelerated variant point-saga is available on github. Some points about the algorithm:

  • Storing the past gradients is typically fast and low cost. For logistic regression or least squares, it reduces to storing a single real value per datapoint instead of a full gradient. Similar reductions apply to neural networks. When the storage cost is high, the SVRG method described below can be used instead.
  • The table average 1 n i=1nfi(ϕik) is of course cached rather then recalculated at each step.
  • Notice that there is only really two extra terms on top of the usual SGD step. The two terms also trivially cancel when you take the expectation with respect to j. So the expected step direction is a gradient step, just like with SGD.
  • The step above assumes you already have a table of gradients. If you are just starting the algorithm, do the first pass over the data in-order (instead of random), and use the table average over only the data seen so far in the step. Alternatively, just do regular SGD steps during the first pass.
  • The requirement for randomly sampling datapoints instead of doing in-order passes turns out to be absolutely crucial for it to work.
  • Implementing the step efficiently when the gradients are sparse is a little subtle, but possible. See the SAGA paper for details and example code.
  • SAGA supports non-differentiable regularisers through the use of proximal operators. See the paper for details.

Why variance reduction methods?

VR methods like SAGA have entirely different convergence properties than SGD. On strongly convex problems they are able to converge linearly (O(log(1k)), instead of the much slower O(1k) convergence possible with the SGD method (which requires careful iterate averaging). Linear convergence on a stochastic method is quite remarkable, up until recently no such methods were known. It’s also fast linear convergence, compared to the theoretical rate for regular gradient descent, it makes 1/3 the progress each step, but each step is only a single datapoint evaluation rather than a full batch gradient. For large datasets that can be millions of times faster. A schematic illustration of the practical speedup is below, labeled as "incremental gradient":

Convergence Schematic

Other variance reduction techniques

SAGA is perhaps the most straightforward of the variance reduction techniques. The other VR methods have advantages in certain situtations, there is no clear best method in all cases. We list the main methods here, each has several variants as well:

SAG
is the earliest of these VR methods developed. It doesn’t actually take the gradient direction in expectation, rather it introduces a small bias with the potential for a lower variance in the gradient estimate. Unfortunately SAG is hard to extend from the point of view of the theory.
SVRG
is the direct predecessor of SAGA that makes a different trade-off. It avoids storing gradients, but at the expense of computing two gradients per step instead of one. It also has a more complex double loop formulation, where the outer loop requires a recalculation of the full-data gradient. In most cases the extra computation makes it slower than SAGA, but it does have some applications.
Finito/Miso
take a step derived indirectly from the dual formulation of the problem, while avoiding any dual computation. They don’t work efficently on sparse problems, but are perhaps the fastest methods on dense problems.
SDCA
is a purely dual method. It can be applied to some non-differentiable problems unlike the other methods here, but it is complex to apply efficently to other problems such as logistic regression. It also has the nice properly of avoiding tunable parameters on problems with L2 regularisation. For practical application it crucially requires a non-degenerate amount of strong convexity though.

Weighted random sampling with replacement with dynamic weights

Weighted random sampling from a set is a common problem in applications, and in general library support for it is good when you can fix the weights in advance. In applications it is more common to want to change the weight of each instance right after you sample it though. This seemingly simple operation doesn't seem to be supported in any of the random number libraries I've looked at.

If you try googling for a solution, you find lots of papers and stack overflow posts on reservoir sampling, but nothing useful for solving the problem. After digging through Knuth and reading old paywalled papers, I've managed to piece together an approach that is extremely fast and easy to implement. This post details that method and provides a simple Python implementation. I have made a fast Cython version availiable on github also.

First some notation. We want to sample an index 0 to N-1, according to an array of weights w[i]. These weights form the unnormalized probability distribution we want to sample from, i.e. each instance i should have probability w[i]/sum(w) of being chosen. Ideally we want an algorithm that gives constant time sampling and constant time weight mutation operations.

The simple slow approach: rejection sampling

Normally I avoid wasting time on approaches that don't work well in practice, however the simple rejection sampling approach to the problem turns out to be the vital building block of the algorithms that do work.

The rejection sampling approach is only a few lines of Python:

The idea is simple. We sample uniformly from the indices, then do a rejection sampling correction to account for the actual non-uniformity of the data. The best way to visualise what it's doing is to consider it as picking a point in 2 dimensions uniformly, then doing a reject or accept operation based on if the point is under the "graph" of the distribution or not. This is shown schematically as the yellow shaded regions below.

The general idea of rejection sampling and this graph interpretation is quite subtle, and I'm not going to attempt to explain it here. If you just stare at it for a while you should be able to convince your self that it works.

Unfortunately, rejection sampling like this is not practical when the weights are uneven. The rejection probability depends on the magnitude of the largest weight compared to the average, and that can be very large. Imagine if the first eight boxes above where %1 full and the last 100% full. It's going to reject roughly 80% of the time. It can of course be much worse when n is larger.

Making it practical

Rejection sampling can be very fast when all the weights are similar. For instance, suppose all the weights are in the interval [1,2]. Then the acceptance probability w[i]/w_max is always more than half, and the expected number of loops until acceptance is at most 2. It's not only expected constant time, but it's fast in practice as well. More generally, this is true whenever the interval is of the form [2^i, 2^(i+1)].

This leads to the idea used by most of the practical methods: group the data into "levels" where each level is an interval of that form. We can then sample a level, followed by sampling within the level with rejection sampling. Matis et al. (2003) show that the level sampling can be done in O(log*(n)) time, where log* is the iterated logarithm, a slowly growing function which is always no more than 5. Effectively it is an expected constant time sampling scheme.

We don't recommend using the Matis scheme though, as its practical performance is hampered by its complexity. Instead, we suggest using a method that has a (weak) dependence on the size of the weights, which we detail below.

Consider the intervals [2^i, 2^(i+1)]. The number of unique intervals ("levels") we need to consider is just the log2 of the ratio of the largest and smallest weights, which on practical problems won't be more than 20, corresponding to a 1 to 1 million difference. The extra overhead of the Matis method is not worth it to reduce the constant from 20 to 5. In practice the Matis method requires building a tree structure and updating it whenever the weights change, which is way slower than traversing a 20 element sequential array.

Lets be a little more concrete. The algorithm will maintain a list of levels, in order of largest to smallest. Each level i consists of a list of the elements that fall within that levels range: [2^i, 2^(i+1)]. The list for each level need not be sorted or otherwise ordered. To sample an instance from the set, we sample a level, then we perform rejection sampling within that level. In Python, it looks like:

The full class including weight updating is on github. Notice that we use a linear-time algorithm for sampling the levels (A cumulative distribution table lookup). Alternative methods could be used, such as a balanced binary tree or Walker's algorithm (see below). The only difficulty is that we want to change the weight of an element potentially after every sample, so any pre-computation needed for the level sampling needs to be fast. It doesn't seem worth it in practice to use something more complicated here since the number of levels is so small (as mentioned above usually less than 20).

Enhancements

A few small changes are possible to improve the usability and performance. The rejection sampling actually only needs a single random sample instead of 2. We can just take a U[0,1] sample, then multiply by level_size. The integer part is the idx_in_level and the remainder is the u_lvl part.

When updating weights, we need to delete elements potentially from the middle of a levels index list. For example, imagine we need to move element 6 from the [2,4] bucket to the [1,2] bucket in the above diagram. We need to store the indices in a contiguous array for fast O(1) lookup, so initially this would look like a problem. However, since the order within the lists doesn't matter, we can actually take the last element of the list, move it to the location we want to delete from, then delete from the end of the list.

In the sample code given, we keep a level_max array which just contains the upper bounds for each level. With a little extra code we could instead change this to be the largest element that has been in that bucket so far. This could lower the rejection rate a little at the cost of a few more operations maintaining the level_max array.

Other methods

Marsaglia et al. 2004 describe an early method that also treats the data in levels, but instead of rejection sampling it spreads each datapoint's probability mass across multiple levels. I believe upating the weights under their scheme would be quite complex, although in big O notation it should be roughly the same as the algorithm I described above.

Walker's alias method

This post is concerned with methods can be used when we wish to modify the weights dynamically as the algorithm runs. However, I would be remise to not mention the alias method, which given a fixed set of weights at initialization time constructs a table that allows extremely efficient constant time sampling without any of the complexity of the other approaches I mentioned.

The key idea is simple. Take a table like that used in the rejection sampling method. Instead of normalizing each bucket by w_max, normalize by w(sum)/n. When doing this, some buckets will overflow, as the probability within them could be larger than w(sum)/n. So we take the excess, and spread it amoung those buckets that are not full. It's not too difficult to do this in such a way that each bucket is exactly full, and contains only two indices within it. Now when we go to sample, we pick a bucket uniformly as before, but within the bucket there is no longer a rejection option, just two indices to choose from.

Unfortunately, there is no fast way to modify the table built by Walker's method when we change just a single weight. I've not been able to find any methods in the literature that manage to do so efficently. Given it's simplicity, there are a few implementations of Walker's method out there, for example a Python version here.

A NIPS 2015 reading list

I went over the entire list of NIPS 2015 papers so that you don't have to. You're welcome. Below is the subset of papers that caught my eye.

Variance Reduction

StopWasting My Gradients: Practical SVRG Another great practical paper from Mark Schmidt and the rest of his group. The techniques they describe help speedup SVRG, although your still probably better off using SAGA.

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants Great typesetting and plots. They show a good speedup on SVRG on a multicore machine. Would be good to have async-SAGA plots though. Speedup varies from 0.3 to 1 per additional core.

Local Smoothness in Variance Reduced Optimization Essentially theory and experiements for adaptive importance sampling for SVRG and SDCA. Looks good. It's hard to say how robust it is and how easy to implement it will be.

Variance Reduced Stochastic Gradient Descent with Neighbors Good theory, some improvements on the basic SAGA theory. The N-SAGA method looks like it would be hard to implement fast though.

Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling A SDCA variant with a more conservative step and corresponding improved theory. They give a proof that applies for almost any sampling scheme, nice. Could be faster than SDCA for sparse problems if implemented carefully.

A Universal Catalyst for First-Order Optimization Uses the accelerated proximal-point algorithm as a "meta-algorithm" to speedup any of the finite-sum methods. The double loop construction they use introduces a logarithmic term in the convergence rate which is unfortunate. Probably not ideal in practice due to the additional tuning required, but it's hard to say.

Data science

Hidden Technical Debt in Machine Learning Systems This is based on a previous workshop paper. I'm not really impressed with the paper, it just doesn't go into enough detail to actually be useful. It's practically an extended abstract.

Precision-Recall-Gain Curves: PR Analysis Done Right I really like this paper. It correctly points out that PR curves are pretty bad, and suggests a better improvement. For the kind of problems I'm interested in ROC curves are still the better choice though.

Distributed Optimization

Deep learning with Elastic Averaging SGD The basic idea here is classical: distribute an optimisation problem by splitting it into parts (1 per machine), and enforce consistancy between the machines using a quadratic penality function. It's a "soft" penality, in the sense it doesn't enforce exact equality. It's interesting that they actually give each machine access to the whole dataset, whereas normally each machine is given a fixed subset of the data. I believe this means that on convex problems it will still converge to the exact solution, which is not true of a soft-penalty otherwise. They also show how important a momentum based updated is. Good paper.

Online/Stochastic Optimization

Probabilistic Line Searches for Stochastic Optimization This paper was present orally at the conference as well. They gave an execelent presentation. The basic idea is to use bayesian optimization techniques for line searches. The details are non-trivial, and they have good plots to go along with it. I suspect however that their method is too slow in practice for most applications.

Beyond Convexity: Stochastic Quasi-Convex Optimization They give an interesting and non-obvious definition of local-quasi-convexity that they use in their proof of convergence for the stochastic normalized gradient descent method. It's really a theory paper.

Online Gradient Boosting I'm no expert on boosting, but it looks interesting. I'll stick with offline boosting for solving practical problems for now.

Fast Rates for Exp-concave Empirical Risk Minimization Some good results here. The exp-concave subproblem of ERM is interesting; I'm a strong believer in narrowing down your problem class until you can prove something. They get an expected loss convergence rate like O(d/n) for d dimensions and n steps. This is better than the general ERM case, and better than past results. I'll probably read through the proof in detail at some point.

SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk The U-statistics terminology is new to me. They talk about direct minimisation of AUC instead of the usual ERM losses, which is an approach I find interesting. I'm surprised there hasn't been more research on this, from what they say in the paper. In the past I've never achieved better results out of direct AUC minimisation compared to logistic-loss though.

Bandits

Policy Evaluation Using the Ω-Return Not really my area, so I can't comment in detail. Looks interesting though.

Bandit Smooth Convex Optimization: Improving the Bias-Variance Tradeoff Always nice to see improved bounds.

The Pareto Regret Frontier for Bandits

Frank-Wolfe

Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees The Frank-Wolfe method is popping up everywhere these days. Apparently it allows you to get convergence rates for Bayesian quadrature now. They appear to be applying it in function space, which is really cool. Nice visualisations as well.

A Universal Primal-Dual Convex Optimization Framework

On the Global Linear Convergence of Frank-Wolfe Optimization Variants Simon Lacoste-Julien is quickly becoming the Frank-Wolfe expert. I like the consideration of the structure of the constraint set through a notion similar to a condition number. Great diagrams as well.

Inference/Sampling

Smooth and Strong: MAP Inference with Linear Convergence I don't fully understand the novelty in this paper, it looks like an application of standard duality tricks for smoothing. If anybody has insight here please let me know in the comments.

Reflection, Refraction, and Hamiltonian Monte Carlo A way of improving traditional HMC in some cases by better handling innate structure.

Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference

A Complete Recipe for Stochastic Gradient MCMC

Other Optimization

HONOR: Hybrid Optimization for NOn-convex Regularized problems Basically they switch between Quasi-newton and gradient steps depending on the local structure. Feels a little adhoc. They also only compare against GIST, a method they also invented. I would like to see a more general comparison.

Convergence rates of sub-sampled Newton methods The approach they take to sub-sampled approximations is interesting. They handle noise in the Hessian from the approximation by setting the smaller eigenvalues not to zero but to the value of the kth top eigenvalue, for some k. Noise should have a bigger relative effect on the lesser eigenvalues, so that looks reasonable. I'm not super-convinced by the plots though, I would like to see results on a few other problems. Also, fonts on plots are too small.

Theory

Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems Quite an interesting signal-recovery problem solved here. Related to compressed sensing where you assume some kind of structure on the design matrix (i.e. iid sampled from a gaussian or binomial).

Newton-Stein Method: A Second Order Method for GLMs via Stein's Lemma An alternative hessian approximation. They make use of sub-sampling in a similar way to the "Convergence rates of sub-sampled Newton methods" paper above, citing a pre-print of it. I like how they use estimates of the covariance matrix of the data to estimate the Hessian. Plot font is too small. They use the wrong citation style for NIPS, not sure how that got through review.

Learning with Symmetric Label Noise: The Importance of Being Unhinged

Recent work on minimising finite sums

To me, the most interesting recent development in convex optimisation is the discovery of a family of methods that exploit a finite-sum objective structure to theoretically and practically converge faster than was previously thought possible.

The two early methods in this class were SAG and SDCA. I spent time during my thesis working on these methods, see my publications on SAGA (an improved version of SAG) and Finito.

I would like to compile here a list of other papers broadly in this area with some brief comments on each. If you feel I've left anything out or you have a correction of some kind, please leave a comment and I'll see to it.

SVRG

The SVRG method really pushed the class of finite sum methods forward. It uses a clunky double-loop construction, but the theoretical analysis is really simple compared to SAG and SDCA, and it's really the core building block of the SAGA method mentioned above. It can also be readily applied to non-convex problems unlike the other methods mentioned here. It makes an interesting trade-off: it avoids storing gradients for each term, with the downside of 2-3x more gradient calculations total. This trade-off is bad for logistic regression and most convex problems, but for neural networks it may help. I haven't looked to see if its seen any use on neural networks in practice, maybe a topic for a future post.

Distributed variants

There has been quite a lot of interest in distributed versions. I still believe that variants of distributed gradient LBFGS are still the fastest method for distributed convex optimisation at large scale. The way that SAG and related methods exploit the objective structure is not clearly extendable to the distributed setting. In constrast, distributed gradient LBFGS is a embarrassingly-parallel algorithm. LBFGS doesn't work well for non-convex problems though, and thats where most of the interest in distribution optimisation is at the moment.

For regular SGD, the two best approaches I've seen for distribution are Downpour and soft penality methods like Elastic Averaging SGD. The soft penality approach might work well with SAGA; It seems like a good topic for future work.

The DSA method take the approach of combining the EXTRA method (a ADMM variant) with SAGA. It seems like a good idea to me in principle. From their paper, it sounds like it might be the fastest method for some network topologies. They also try the obvious way of distributing SAG, which they include in their plots. The "obvious" way doesn't actually work (in the sense it doen't converge linearly, losing all advantages it has over regular SGD), so comparing against it is kind of silly. In general I don't find their results convincing, although I'm not expert on distributed optimisation.

A few papers cover distributed SDCA, like this one by Tianbao Yang. It compares against ADMM, which is horribly slow compared to distributed gradient LBFGS, so I'm not really convinced by their results either.

The distributed SVRG method is a promising method I've seen. The SVRG method is inherently double-loop, with the inner loop looking like it could be potentially parallelised. I haven't looked at the details in this paper, but it concerns me that compare against methods that I would not consider standard. I would like to see comparisons against one of the many distributed SGD methods, or DG-LBFGS, or even just a wall-clock comparison against non-distributed SVRG or SAGA.

Another NIPS 2015 paper from some well respected Carnegie Mellon researchers also addresses the asynchronous (i.e. multicore rather than distributed) SVRG/SAGA problem. They cite my thesis, which immediately gives them bonus points in my mind. They start with generalising SAGA/SVRG into a meta algorithm which covers both cases. The construction is fairly obvious, but it's nice to see it worked out with all the details. They then give an asynchronous SVRG method, which appears reasonable. The analysis uses techniques from the Hogwild! paper, which is one of the better asynchronous SGD methods. They talk about the applicability of their framework to both SVRG and SAGA, but they strangely don't talk at all about an asynchronous SAGA. I have a hunch the technique they use for asynchronous SVRG would not work well with SAGA, but I could be wrong. They have an excellent experiments section, with actually legible plots!! (I wish we lived in a world where this didn't deserve exclamation marks). In particular they show a solid speedup, with a mostly linear speedup with the number of cores, although the slope is not the perfect 1:1 rate. I have to point out that they actually compare against SGD based methods, which is great. Most finite-sum papers skip doing this because it's a pain to implement a fast SGD method. There are simply too many possible variants that reviewers will inevitably suggest you try.

Accelerated variants

The paper A Universal Catalyst for First-Order Optimization (NIPS) by Lin, Mairal and Harchaoui is the clearest approach. In general, virtually any optimisation method can be accelerated by using it to solve the proximal operator within the accelerated proximal-point algorithm framework (see Salzo and Villa 2012, based on the proximal point method from Rockafellar). Lin et. al. take this approach, giving a double loop method, where any of the finite-sum methods mentioned can be used in the inner loop. I have suspected this kind of approach could be applied to SAGA for a while now, so it's nice to see this paper working out the details.

The paper on accelerated SDCA also takes the double loop approach. Both approaches have convergence rates with large hidden logarithmic factors, which are clearly not artefacts of the proof, but rather inherent in the double loop approach.

Another double loop paper takes a mirror descent style approach. I'm not sure if this is the same as the proximal point version, at least on the surface it appears to be a different method. I like the plots in this paper, they are too small and not well typeset, but they do show the differences between their method and SAGA/SVRG fairly clearly. It seems the general trend is that the accelerated methods take 50-100 iterations before they overtake SAGA/SVRG, which matches my intuition. For most practical problems I would expect less than 100 iterations to be used; typically the test set error is completely minimised by then.

It's not clear to me if a non-double loop approach is possible for a purely primal stochastic finte-sum methods. The obvious approaches don't work, as the noise form the stochasticity is far too large compared to the amount of error that variants of Nesterov's accelerated method can handle.

There is one non-double loop accelerated approach I know of, in this draft technical report from U. Florida. It takes a primal-dual approach. Guanghui Lan talked about this method in his OPT2015 workshop talk, and it sounds extremely promising. It's still an early draft; It has no experiments section yet. I'll be keeping an eye out for the published version in the future.

Online learning variants

The "Neighborhood Watch" method (NIPS 2015), take a hybrid approach between SAGA and classical SGD. Essentially they consider storing fewer past gradients by using the gradient at neighbouring points as a surrogate. Although the method they propose is not surprising, it's not at all clear a priori that the technique they propose would work in practice, so it's nice to see some experiments and theory verifying it. The paper is interesting from a theoretical point of view, as they use a different Lyapunov function from that used by SAGA. It seems simpler.

Lower complexity bounds

Agarwal and Bottou have an ICML 2015 paper directly addressing the class of finite sums. They recently updated their arXiv version to note that their proof only covers deterministic methods, which is not ideal given all the known fast finite-sum methods are stochastic! Their difficulties are not surprising, the lower complexity analysis for finite sum methods is extremely complex; I was not able to make any progress on the problem after spending a few months on it during my PhD. Looking at the construction they use, I don't see how they can handle the stochastic case without large changes. Even with the cavet above, this paper is still interesting, particularly for the discussion in section 3.

They don't really discuss the core question (at least in my mind): Does there exist a deterministic finite-sum method with comparable convergence rates to SAG type methods? I personally won't be happy until we have lower and upper complexity bounds that give a tight answer to this question. I would be really surprised if there existed a fast deterministic method.

On a somewhat less related but interesting note, Arjevani, Shalev-Shwartz & Ohad Shamir have put up on arXiv a paper proving some new lower complexity bounds for general strongly convex optimisation. It's a long paper with a large number of results, but the one I find interesting is their core result. The key result is a lower-complexity bound; that is they show that no optimisation method in the class they consider, no matter how clever, can converge any faster than the bound they state. Lower complexity bounds can be hard to grasp if you are used to upper bounds, like those proving the worst-case convergence of a optimisation method. The inequalities are in the opposite direction.

The restricted class they consider is interesting and covers the core methods for strongly convex optimisation, namely gradient-descent and the simplest accelerated variant. They avoid the first/second order oracle class used in the early lower-complexity bounds literature, which is why they are able to give an improved bound. The key result is the removal of dependence on problem dimension from the bound. Previous lower complexity bounds applied to any first-order method, with a proof relying on analysis of quadratic problems for tractability reasons. Since the well known conjugate gradient method can find an exact optimum of a quadratic problem in O(p) time, for a p-dimensional problem, this dimension constant p appears in the old lower bounds.

The key idea of Arjevani et. al. is limiting the class of optimisation methods to steps that are a linear transformation of the last rho steps (x_k and x_{k-1}) for rho typically 2 or 3. The transformation its self can't depend on the x values, so line searches are ruled out. The conjugate gradients method effectively does such a line-search (although it uses a closed form solution for it), so it doesn't fall within their restricted class.

Alternative sampling schemes

It's quite obvious that the uniform sampling scheme used by the standard finite sum methods can be improved if we know (or can at least approximate) the imbalance between the terms in the sum. By imbalance, I'm referring to the difference in the Lipschitz constants between each term. Terms with large constants will generally have a disproportionate contribution to the full gradient. In general we want methods where the theoretical convergence rate depends on the average Lipschitz constant of the terms, rather than the max, which is the case for SAG and other primal methods.

The paper by Schmidt et. al. (AISTATS 2015) covers this problem for SAGA/SAG style methods. I contributed a little to the theory for this paper. They show quite astounding practical results for conditional random field (CRF) problems, where weighted sampling seems crucial.

For SDCA, this preprint covers the problem to some degree. I've only skimmed this paper. The practical results are not super-impressive. I believe this is because they only test on linear learners, which are too simple to really see the benefit. In general, weighted sampling for coordinate descent is a well studied problem so applying those results to dual coordinate descent (i.e. SDCA) is a obvious research direction. I have a simple weighted sampling proof for a minor variant of SDCA in my thesis.

There are two NIPS 2015 papers on the general topic, both with Tong Zhang as a co-author! Local Smoothness in Variance Reduced Optimization by Vainsencher, Liu and Zhang and Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling by Qu, Richtarik and Zhang. Quartz is a variant of SDCA with a interpolation-with-previous-value update for the primal estimate of x instead of a direct update. They say this leads to a better theoretical convergence rate. The main novelty is a proof that works for sampling schemes like mini-batch, importance sampling and uniform sampling. The practical performance is not as good as SDCA except when they are able to exploit sparsity together with mini-batching, as the update is more conservative. I like that they didn't introduce another acroynm. It's becoming impossible to communicate finite sum methods to people outside the field due to all the method names sounding similar.

The Local-smoothness paper discusses the particular problem of adapting the sampling scheme at run-time, so that the weighting doesn't need to be specified up front. This is practically a necessity for good performance, and they do establish some theory covering it unlike previous work on adaptivity for SAG. I like their discussion of why it can be effective. They point out that for smoothed-hinge loss, data-points whose activation is in the linear parts of the hinge can almost be ignored by the sampler once we get close to the solution.

The future

There is still lots of interesting avenues of research open for finite sum methods. Do practical accelerated variants of any of the methods exist? Ideally without lots of parameters that require tuning. Are other variance reduction approaches possible (like adding a second set of control variates, or by using multiplicative instead of additive control?).

I would like to see some solid c/c++ implementations of the best methods so that they can be more widely used. There are some reasonable python implementations in the Lightning package, but they need adaptive step sizes and importance sampling to really be production ready. Multithreading would also be great. It would also be nice to see SAGA implemented in TensorFlow, Vowpal Wabbit and LibLinear for example.