Not All Winners Are Winners
A really phenomenal white paper titled “Most Winning A/B Test Results Are Illusory,” published by Qubit, was making the rounds a few weeks ago. If A/B testing is part of your daily life, then you really must take a few minutes to read through it. The gist of the paper is that many people call their A/B tests way too soon, which can cause the inferior variant to appear superior because of simple bad luck.
To understand how this can happen, imagine you have two fair coins (50% chance of landing on heads). You want to see whether your left or right hand is superior at getting heads, so you will know which hand to use when making a heads/tails bet in the future. You flip the coin in each hand 16 times, and you get these results:
Since we know the coin is fair, we know that getting 11 heads and 5 tails is just as likely as getting 11 tails and 5 heads. However, if we plug this result into a t-test to calculate our confidence, we find that we’re 96.6% certain that our right hand is superior at flipping the coin! Now, we know this is absurd since, in our example, knowing that the coin is fair, we could arbitrarily say that heads were tails, and vice versa.
If our coin flipping example were an A/B test, we would have gone ahead with the “right hand” variant. This wouldn’t have been a major loss, but it wouldn’t have been a win, either. The scary part is that this same thing can happen when the variant is actually worse! This means you can move forward with a “winning” variant, and watch your conversion rate drop!
Doing It Right
So what’s the problem? Why is this happening?
A/B tests are designed to imitate scientific experiments, but most marketers running A/B Tests do not live in a world that is anything like a lab at a university. The stumbling point is that people running A/B tests are supposed to wait and not peak at the results until the test is done, but many marketers won’t do that.
Let’s outline the classic setup for running an experiment:
- Decide the minimum improvement you care about. (Do you care if a variant results in an improvement of less than 10%?)
- Determine how many samples you need in order to know within a tolerable percentage of certainty that the variant is better than the original by at least the amount you decided in step 1.
- Start your test but DO NOT look at the results until you have the number of examples you determined you need in step 2.
- Set a certainty of improvement that you want to use to determine if the variant is better (usually 95%).
- After you have seen the observations decided in step 2, then put your results into a t-test (or other favorite significance test) and see if your confidence is greater than the threshold set in step 4.
- If the results of step 5 indicate that your variant is better, go with it. Otherwise, keep the original.
I recommend you play with the sample size calculator in step 2. If these steps seem straightforward to you, and the sample sizes you come up with seem easily achievable, then you can stop reading here and go get better results from your A/B tests. This approach works and will give you good results.
If, however, you read the above and thought “I also should eat more veggies and work out more…” then read on!
Marketers are NOT Scientists
I believe the reason marketers tend not to follow through with proper methodology when it comes to A/B testing has less to do with ignorance of the procedure and more to do with real world rewards and constraints. For scientists working in a lab, the most important thing is that the results must be correct. Running a test takes a relatively small chunk of time, while getting an incorrect answer that eventually finds its way into publication can have consequences that range from being embarrassing to costing lives.
Marketers have almost the opposite pressures. Management wants results as soon as possible, but you may have a long list of features and designs waiting to be tested, and you don’t want to waste time testing minor improvements if someone has something that could be a major improvement. Most important: marketers are concerned with growth! Being correct is useful only insofar as it leads to growth.
So, now, we have the question: “Is there a way to run A/B tests that acknowledges the world marketers have to exist in?”
Whenever I’m studying interesting questions involving probabilities that don’t have an obvious analytical solution, I turn to Monte Carlo simulations! A Monte Carlo simulation is simply a way for us to answer questions by running simulations enough times to get an answer. All we have to do is model our problem. Then, we can model different strategies and see how they perform.
For our A/B testing model, we’re going to make some assumptions. In this case, we’re going to have a page that starts with a 5% conversion rate. We then assume that variants can have conversion rates that are normally distributed around 5%. In practical terms, this means that any given variant is equally likely to be better or worse than the original, and that small improvements are much more common than really large ones.
Finally, we address perhaps the most important constraint: each strategy gets only a total of 1 million observations. As you collect more data, you get more certain; but if you need 100,000 results to be certain, then how many tests have you wasted? No one has unlimited visitors to sample from. In our model, the more careful testing strategy might be penalized because it wastes too much time on poor performers and never gets to a really good variant.
The Scientist and The Impatient Marketer
Let’s start by modeling the strategy of “The Scientist.” This strategy follows all of the steps for proper testing outlined above. We can see the results of a single simulation run below:
What we see is quite clear. The Scientist has continuous improvement and will stay at a good conversion rate until another improvement is found; rarely, if ever, choosing the inferior variant by mistake. After 1,000,000 people, The Scientist has run around 20 tests and has bumped the conversion rate from 5% to 6.7% at the end.
Now, let’s look at a strategy we’ll call “The Impatient Marketer.” The Impatient Marketer is an extreme case of sloppy A/B testing, but it is an important step toward understanding how we can model a strategy for marketers that is both sane and provides good results. The Impatient Marketer checks constantly (as opposed to waiting), stops the test as soon as it reaches 95% confidence, and gets bored after 500 observations, at which point the test is stopped in favor of the original.
Here we see something very different from The Scientist. The Impatient Marketer has results all over the board. Many tests are inferior to their predecessor and many are worse than the first page!
But there are some pluses here as well. In this case, The Impatient Marketer reached a peak of 7.8% conversion and still ended close to The Scientist at 6.3%! It’s also worth noting that if this simulation is run over and over again, we find that The Impatient Marketer consistently does better than the baseline.
Now, let’s make The Impatient Marketer a little less impatient and a little more realistic. Our new strategy is “The Realist.” The Realist wants results fast, but doesn’t want to make a lot of mistakes, and also doesn’t want to follow a 6-step process for each test. The Realist waits until 99% confidence to make the call, but will wait for only 2,000 observations. This strategy is very simple, but much less reckless than that of The Impatient Marketer.
In this sample run, The Realist is doing much better than The Impatient Marketer. The Realist occasionally does make a wrong choice, but only very briefly drops below the original. The Realist ends at 6.3% but has spent a lot of time with a variant that achieved 7.4%. Because The Realist is always trying out new ideas, this strategy is able to sometimes find better variants that The Scientist never gets to!
In the above images, all we have is a single sample path. How do we judge how well each strategy performs? Maybe The Scientist does even better, or maybe The Impatient Marketer’s gains make up for the losses?
The way we’ll approach this is by measuring the area under the curve. If you imagine just sticking to the original, there would be a straight line at 0.05 across the entire plot, giving an area of 0.05 x 1,000,000 = 50,000. If we measure the area under each point, then we can compare. And, to get a fair assessment, we’ll simulate this process thousands of times and take the average. After we do that, here are our results:
There are a couple of really fascinating results here. Perhaps most remarkable, The Impatient Marketer does surprisingly well! Now, of course, if you actually look, The Impatient Marketer does an unrealistic number of A/B Tests. However, if you have a low traffic site that simply will never see a well-designed test converge, there’s definitely a useful insight here: A/B testing is useful even if you don’t have much data, but you have to continuously run tests to avoid getting stuck too long at a poor conversion rate.
But most interesting to everyday marketers running A/B tests is that The Realist and The Scientist do about the same in the long run! Now, it is important to note that we know these conclusions hold true only for the assumptions of our model. Still, there is an important takeaway that, if you’re thoughtful, you can make tradeoffs in your testing methodology and still get great results.
The biggest assumption in our model is that these tests are running back to back without breaks. Veering away from classical tests works only if acting on inferior information is made up for by always having another test ready to go.
If you want to end your test early because a design for the next round is ready, go for it! If other office pressure is making you want to end a test early, feel free to stop, but make sure you have another test ready to go. Additionally, if you have good cause for stopping early, lean toward being more conservative with your results. You assume a lot of risk if you go with a variant that isn’t a clear winner.
Conversely, if you have no pressure to stop early, stick with the traditional testing setup outlined above! Run the sample size calculator and see if the number of samples needed is in a range you can gather in a reasonable time frame. If so, there’s no reason to break what works; and, in fact, you may find your time best spent exploring other, mathematically sound, approaches to running tests.
In all of our models, being vigilant and continuously running tests is a sure way to minimize any limitations in the testing methodology.