# Most of Your A/B Test Results Are Illusory and That’s Okay

## Not All Winners Are Winners

A really phenomenal white paper titled “Most Winning A/B Test Results Are Illusory,” published by Qubit, was making the rounds a few weeks ago. If A/B testing is part of your daily life, then you really must take a few minutes to read through it. The gist of the paper is that many people call their A/B tests way too soon, which can cause the inferior variant to appear superior because of simple bad luck.

To understand how this can happen, imagine you have two fair coins (50% chance of landing on heads). You want to see whether your left or right hand is superior at getting heads, so you will know which hand to use when making a heads/tails bet in the future. You flip the coin in each hand 16 times, and you get these results:

Since we know the coin is fair, we know that getting 11 heads and 5 tails is just as likely as getting 11 tails and 5 heads. However, if we plug this result into a t-test to calculate our confidence, we find that we’re 96.6% certain that our right hand is superior at flipping the coin! Now, we know this is absurd since, in our example, knowing that the coin is fair, we could arbitrarily say that heads were tails, and vice versa.

If our coin flipping example were an A/B test, we would have gone ahead with the “right hand” variant. This wouldn’t have been a major loss, but it wouldn’t have been a win, either. The scary part is that this same thing can happen when the variant is actually worse! This means you can move forward with a “winning” variant, and watch your conversion rate drop!

## Doing It Right

So what’s the problem? Why is this happening?

A/B tests are designed to imitate scientific experiments, but most marketers running A/B Tests do not live in a world that is anything like a lab at a university. The stumbling point is that people running A/B tests are supposed to wait and not peak at the results until the test is done, but many marketers won’t do that.

Let’s outline the classic setup for running an experiment:

- Decide the minimum improvement you care about. (Do you care if a variant results in an improvement of less than 10%?)
- Determine how many samples you need in order to know within a tolerable percentage of certainty that the variant is better than the original by at least the amount you decided in step 1.
- Start your test but DO NOT look at the results until you have the number of examples you determined you need in step 2.
- Set a certainty of improvement that you want to use to determine if the variant is better (usually 95%).
- After you have seen the observations decided in step 2, then put your results into a t-test (or other favorite significance test) and see if your confidence is greater than the threshold set in step 4.
- If the results of step 5 indicate that your variant is better, go with it. Otherwise, keep the original.

I recommend you play with the sample size calculator in step 2. If these steps seem straightforward to you, and the sample sizes you come up with seem easily achievable, then you can stop reading here and go get better results from your A/B tests. This approach works and will give you good results.

If, however, you read the above and thought “I also should eat more veggies and work out more…” then read on!

## Marketers are NOT Scientists

I believe the reason marketers tend not to follow through with proper methodology when it comes to A/B testing has less to do with ignorance of the procedure and more to do with real world rewards and constraints. For scientists working in a lab, the most important thing is that the results must be correct. Running a test takes a relatively small chunk of time, while getting an incorrect answer that eventually finds its way into publication can have consequences that range from being embarrassing to costing lives.

Marketers have almost the opposite pressures. Management wants results as soon as possible, but you may have a long list of features and designs waiting to be tested, and you don’t want to waste time testing minor improvements if someone has something that could be a major improvement. Most important: marketers are concerned with growth! Being correct is useful only insofar as it leads to growth.

So, now, we have the question: “Is there a way to run A/B tests that acknowledges the world marketers have to exist in?”

## Simulating Strategies

Whenever I’m studying interesting questions involving probabilities that don’t have an obvious analytical solution, I turn to Monte Carlo simulations! A Monte Carlo simulation is simply a way for us to answer questions by running simulations enough times to get an answer. All we have to do is model our problem. Then, we can model different strategies and see how they perform.

For our A/B testing model, we’re going to make some assumptions. In this case, we’re going to have a page that starts with a 5% conversion rate. We then assume that variants can have conversion rates that are normally distributed around 5%. In practical terms, this means that any given variant is equally likely to be better or worse than the original, and that small improvements are much more common than really large ones.

Finally, we address perhaps the most important constraint: each strategy gets only a total of 1 million observations. As you collect more data, you get more certain; but if you need 100,000 results to be certain, then how many tests have you wasted? No one has unlimited visitors to sample from. In our model, the more careful testing strategy might be penalized because it wastes too much time on poor performers and never gets to a really good variant.

## The Scientist and The Impatient Marketer

Let’s start by modeling the strategy of “The Scientist.” This strategy follows all of the steps for proper testing outlined above. We can see the results of a single simulation run below:

What we see is quite clear. The Scientist has continuous improvement and will stay at a good conversion rate until another improvement is found; rarely, if ever, choosing the inferior variant by mistake. After 1,000,000 people, The Scientist has run around 20 tests and has bumped the conversion rate from 5% to 6.7% at the end.

Now, let’s look at a strategy we’ll call “The Impatient Marketer.” The Impatient Marketer is an extreme case of sloppy A/B testing, but it is an important step toward understanding how we can model a strategy for marketers that is both sane and provides good results. The Impatient Marketer checks constantly (as opposed to waiting), stops the test as soon as it reaches 95% confidence, and gets bored after 500 observations, at which point the test is stopped in favor of the original.

Here we see something very different from The Scientist. The Impatient Marketer has results all over the board. Many tests are inferior to their predecessor and many are worse than the first page!

But there are some pluses here as well. In this case, The Impatient Marketer reached a peak of 7.8% conversion and still ended close to The Scientist at 6.3%! It’s also worth noting that if this simulation is run over and over again, we find that The Impatient Marketer consistently does better than the baseline.

## The Realist

Now, let’s make The Impatient Marketer a little less impatient and a little more realistic. Our new strategy is “The Realist.” The Realist wants results fast, but doesn’t want to make a lot of mistakes, and also doesn’t want to follow a 6-step process for each test. The Realist waits until 99% confidence to make the call, but will wait for only 2,000 observations. This strategy is very simple, but much less reckless than that of The Impatient Marketer.

In this sample run, The Realist is doing much better than The Impatient Marketer. The Realist occasionally does make a wrong choice, but only very briefly drops below the original. The Realist ends at 6.3% but has spent a lot of time with a variant that achieved 7.4%. Because The Realist is always trying out new ideas, this strategy is able to sometimes find better variants that The Scientist never gets to!

## Measuring Strategies

In the above images, all we have is a single sample path. How do we judge how well each strategy performs? Maybe The Scientist does even better, or maybe The Impatient Marketer’s gains make up for the losses?

The way we’ll approach this is by measuring the area under the curve. If you imagine just sticking to the original, there would be a straight line at 0.05 across the entire plot, giving an area of 0.05 x 1,000,000 = 50,000. If we measure the area under each point, then we can compare. And, to get a fair assessment, we’ll simulate this process thousands of times and take the average. After we do that, here are our results:

There are a couple of really fascinating results here. Perhaps most remarkable, The Impatient Marketer does surprisingly well! Now, of course, if you actually look, The Impatient Marketer does an unrealistic number of A/B Tests. However, if you have a low traffic site that simply will never see a well-designed test converge, there’s definitely a useful insight here: A/B testing is useful even if you don’t have much data, but you have to continuously run tests to avoid getting stuck too long at a poor conversion rate.

But most interesting to everyday marketers running A/B tests is that The Realist and The Scientist do about the same in the long run! Now, it is important to note that we know these conclusions hold true only for the assumptions of our model. Still, there is an important takeaway that, if you’re thoughtful, you can make tradeoffs in your testing methodology and still get great results.

## Takeaways

The biggest assumption in our model is that these tests are running back to back without breaks. Veering away from classical tests works only if acting on inferior information is made up for by always having another test ready to go.

If you want to end your test early because a design for the next round is ready, go for it! If other office pressure is making you want to end a test early, feel free to stop, but make sure you have another test ready to go. Additionally, if you have good cause for stopping early, lean toward being more conservative with your results. You assume a lot of risk if you go with a variant that isn’t a clear winner.

Conversely, if you have no pressure to stop early, stick with the traditional testing setup outlined above! Run the sample size calculator and see if the number of samples needed is in a range you can gather in a reasonable time frame. If so, there’s no reason to break what works; and, in fact, you may find your time best spent exploring other, mathematically sound, approaches to running tests.

In all of our models, being vigilant and continuously running tests is a sure way to minimize any limitations in the testing methodology.

**About the Author:** Will Kurt is a Data Scientist/Growth Engineer at KISSmetrics. You can reach out to him on twitter @willkurt and see what he’s hacking on at github.com/willkurt.

Awesome post guys, thanks. I have often wondered about the most effective way to handle the statistics of split testing. We as an agency often work with businesses with relatively little traffic, so taking the ideal scientific approach isn’t always viable.

Of course, one great way to speed up the process is by testing big ideas and setting a threshold of at least a 50% improvement. The wins don’t come as often, but we find the pay off is worth it when they do.

This also means that you can drop unsuccessful variations faster because you don’t need to be sure whether they beat the original – just that they haven’t beat the original by at least 50%.

It’s good to know that the realist still gets good results in the long-run anyway. Post Tweeted.

Mark, thanks for the great feedback and for sharing your experience. We look forward to hearing more from you :)

Hello Neil Patel,

Nice post as described properly. Do you think that A/B test is more important for small business marketing..??

I think it is just as important for large businesses. It’s all about seeing what works best.

I really have problems when it comes down to split testing. Normally I would rank a website in the SERPS but I really suck at making money from it. I am using VWO btw, do you think that is a good idea?

In my simulations one of the biggest things I found was that being more conservative with what you call a winning variant tends to help, especially if you’re not willing/able to wait for a test to run a long time. So if the tool you’re using provides any sort of probability of improvement wait until it is 99% rather than 95%. Variants that are truly superior will show high significance early on, even if you’re not following a strict testing methodology.

Thanks for the question!

I read first A/B Test on Moz about Social Signals and after that… more & more..

A/B testing really is a surprisingly deep topic. I plan to have a lot more posts in the future exploring some of the more interesting questions A/B testing brings up! Thanks for your comment!

Abdul, awesome! It’s a growing field :)

We at http://www.schoolgennie.com are doing similar tests. This article provides in depth analysis of A/B testing. Next month we are going to run tests with some softwares. I will appreciate if you can write something on choosing good tools.

It would definitely be interesting to apply this simulation method to the variety of existing tools out there! There are many subtleties in the methods use to calculate results, and I myself am very curious what the impact these different methods have on decision making. I’ll definitely look into this for a possible future post!

Thanks for your feedback!

Pardeep, will definitely keep you in mind for our next post :)

Great post and very interesting analysis. As a trained astrophysicist working in online marketing I find the application of scientific rigour an interesting challenge. A third way is definitely worth some consideration. Thanks!

Thanks Reece!

A/B Testing is one of those great problems whose initial phrasing “Is B greater than A?” seems trivial to solve and then opens up a really fascinating range of trickier problems. I plan to continue exploring these problems in future posts and look forward to more feedback!

Reece, thanks for the feedback. We can definitely use some tips from you :)

Brilliant post, thanks so much for putting it together.

The problem I’ve seen with many marketers is they just look at the averages. You can’t measure the “true” conversion rate, you can only measure the mean +/- the margin of error. This means you can’t say, “We are 95% confident that the conversion rate is X%”, you can only say, “We are 95% confident that the conversion rate is X% +/- Y%”. Many marketers drop that last part (margin of error), and simply stick to the averages – this is bad!

To highlight how big of a problem this can be, let’s relate it to the coin flip example you gave. Let’s say that “heads” equates to a “successful conversion”, while “tails” equates to “no conversion”. Based on this, the right hand’s conversion rate is 31% (5/16), while the left hand’s conversion rate is 69% (11/16) – as you’ve noted, the t-test comes back significant, even though we know there really isn’t any difference. The test has misled you. If you look past the averages, and take into account the margin of error, you see a different story, one that likely will make you question your confidence in the test results, and have you wondering, “What’s going on here?”.

The winner (left hand), who’s conversion rate of 69% was 120% higher (relative) when compared to the right hand’s conversion rate of 31%, has a 95% confidence interval of 44% (lower) to 85.8% (upper). Yes, the average conversion rate is 69%, but the 95% confidence is +/- 42% around the average. Wow. You are suddenly a lot less confident in your measurement. What if weatherman said they are 95% confident that the max temp for today will be between 44 degrees and 86 degrees? I wouldn’t even know what to do – do you wear shorts and sandals, or put on a cold-weather coat?

It is important that marketers provide a complete picture of what’s going on, and make decisions accordingly.

Thanks again for a great post!

Brian (twitter.com/cometrico)

Whoops, slight correction (it won’t let me edit the comment) – it should read there is a 42% difference between upper and lower confidence interval.

Brian

Thanks for your feedback!

I’ve noticed this issue myself, many people tend to misunderstand what a mean truly represents. Your comment makes this point quite well. The problem I’ve found is that when presented with the 95% confidence interval people see this as adding uncertainty to the decision making process and things become more confusing for them. I think one solution to this issue would be to simulate decision making processes that take this into account. Then come up with an easy set of rules that incorporate this knowledge into the process yielding better results.

I hope to address this in a future post, and also plan on doing a series of posts laying out the math in clear terms so that people can become more aware of how important this issue is. Thanks again!

Brian, thanks for this in-depth feedback. I like the way you broke everything down in an easy to digest manner. Looking forward to hearing more of the same from you :)

The first use of the word ‘peak’ should read ‘peek’ though to be fair the unintended typo kinda makes sense, in that its tempting to jump on the first apparently significant result (hence ‘a peak’) rather than see out the experiment.

Thanks for the catch! I also spent a lot of time looking for the peak of probability density functions so perhaps my brain has that spelling hard wired when discussing A/B tests!

Stuart, thanks for catching that :)

I wouldn’t really call the whitepaper “phenomenal” since this is statistical ABC. However, given that online marketers are generally oblivious to the scientific method and to error statistics as the basis of the empirical part of it, it certainly deserves appraisal. In fact, we are also trying to educate online marketers about the very same things with the toolkit we’ve just launched so I myself hold this cause most dearly.

I’d like to note one critical mistake in the paper: “Rather than a scattergun approach, it’s best to perform a small number of focused and well-grounded tests, all of which have adequate statistical power.”. While I can agree with that as preferable to running a hundred ungrounded tests that tell you nothing, it sounds to me as if he says – doing many tests causes troubles with their power, so you better stick to doing just a few.

The problem with that is that #1 even if you do just a few tests with a couple of variations each at a time, you can still run into the exact same problem if you are testing similar changes each time. #2 if you have grounds for doing multiple tests with many variations at once, you should definitely do it. Just make sure that you adjust your significance levels so that you account for the multiplicity problem.

For #2 I’ve found that the False Discovery Method by Hochberd, Benjamini and Yekutieli works best for real-world questions and I’ve built a multiple-comparison statistical significance calculator around that. You can check it out here: https://www.analytics-toolkit.com/statistical-calculators/

Back to you post: you say “Marketers are NOT Scientists” and I regretfully agree. But then you go on and it seems to me that you suggest that the problems marketers face are somehow different than those faced by scientists and I completely disagree if that is indeed the case. The problems are exactly the same and require exactly the same approach. The scientific method is the only framework that allows us to tell true from false with regards to ANYTHING in the real world.

Thus, when you ask “Is there a way to run A/B tests that acknowledges the world marketers have to exist in?” my answer would be – yes, do your science right and simply pick a significance level that reflects the cost/benefit calculations that you work with. 95% is not some magical number. Many scientific tests are in fact looking aiming for a 99.99% confidence level. In the same time for a bunch of marketers being 90% confident that they are not committing a type I error may be perfectly fine.

I argue that adjusting the significance level is all that needs to be done. No need for complex strategies and Monte Carlo simulations.

Having said that – your post only shows a few sample runs and compares them. However, what is the statistical power of such a comparison. It seems very very flimsy to me or maybe I’m missing your point?

Would love to hear back from you, Will.

Hey Geo!

The problem with just focusing on significance levels alone is that it accounts for neither the magnitude of effect nor the costs associated with gathering samples, and certainly not the challenges of reasoning about statistical significance for people with little stats background. The reason for the MC simulations is because I wanted to compare strategies even if they’re not phrased in terms of traditional experiment design. Certainly turning the parameters of The Scientist strategy could give you nearly identical decision making as The Realist or Impatient Marketer, but what I wanted to test is precisely a “rule of thumb” approach. As other comments have mentioned many people don’t really understand the confidence interval in relation to the observed mean. Rather than increase the cognitive overhead of experiment design for people in this group I would much rather arrive at a set of useful heuristics, even if these heuristics don’t lead to a truly optimal approach. Also for the results table each strategy was run 3,000 times over 1,000,000 samples. Even though the exact means contain some variation at only 3,000 runs, the relative ordering of the strategies remained consistent (and the ordering is all I am in interested in for comparing strategies).

Hey Will,

“The problem with just focusing on significance levels alone is that it accounts for neither the magnitude of effect nor the costs associated with gathering samples, and certainly not the challenges of reasoning about statistical significance for people with little stats background.”

Well, these are not statistical issues, but test design issues. I’m very tempted to quote from an excellent article called “Error Statistics” (2010):

“Criticism (#3) is often phrased as “statistical significance is not substantive significance”. What counts as substantive significance is a matter of the context. What the severity construal of tests will do is tell us which discrepancies are and are not indicated, thereby avoiding the confusion between statistical and substantive significance.”

It’s mathematically much easier (and thus in practice – cheaper) to achieve a significant result when the substantive significance is higher. Thus, you quickly arrive at the result and the cost of gathering samples is lower and you can quickly start receiving the benefits. “higher” and “lower” here are in comparison to a situation where you have little substantive significance: you would need a larger sample size and the cost of that would be higher. However the cost of gathering that vs the benefits of switching before achieving significance is really the same as in the first case (given same significance level), due to the much lower magnitude of the expected result.

Thus I think choosing a properly strict significance level ensures you don’t overspend on sample gathering and in reality balances between your cost and your missed potential benefits.

As far as “challenges of reasoning about statistical significance”, my prediction is that people with “little stats background” will start seeing their positions in peril as the companies they work for start loosing money because of their non-empirical approach (compared to the new wave of internet marketers, which, I believe, will bring to the scene a much more scientific approach). It’s all based upon the simple fact that I’ll restate: “The scientific method is the only framework that allows us to tell true from false with regards to ANYTHING in the real world.”.

Now, with that in mind, if you look at your “Realist” strategy – what you’ve done by setting a threshold at 2000 observations and having the distribution the same as with the other two strategies is in fact almost the same as setting a confidence level. I’m a bit too lazy to do the proper math right now, but with such a sample size I’m estimating that it has enough power to detect 25-30% uplift from the 5% conversion rate base with ~95% certainty. And this power only grows as the marketer achieves a better “base” conversion rate. Yes, the other condition – the 99% confidence level, with or without power, is something that skews this but still, in effect, by imposing this threshold you are trying to do and actually doing what should be done by setting up a proper significance level.

Let me know if my reasoning is incorrect, I’d love to be schooled on such matters as I deem them of highest importance.

I think I might have been too vague in my previous comment. To put it simply: having a fixed threshold after which to stop is a rudimentary approach to statistical power at a given significance level. Since statistical power depends on the chosen significance level, the best way to achieve what the rudimentary approach of the “Realist”is trying to achieve would be to adjust the significance level so that it isn’t too stringent. “Too stringent” here can only be defined by the person doing the test, based on the expected cost/benefit tradeoff.

Geo, thanks for the clarification. We look forward to hearing more from you :)

This is why I’ve started using Bayesian Statistics. If you handed me two coins that looked identical and felt like other coins I’ve seen in the past, I would have a high degree of confidence that they are identical and my prior would represent that. 16 iterations with one having 11 heads and the other 11 tails would not cause me to think one is different.

For different A/B tests, I have different standards for what is a winner. In one instance I wanted to change the structure of a landing page so that I could run more tests without the need for dev resources. My criteria for this test was to be reasonably certain that the B version was not significantly worse. Bayesian Statistics is much better at answering these questions.

From a philosophical standpoint I am in full agreement with your main point. I’m a former scientist, yet I know that my goal is not to publish results, it is to maximize the profits. My time and opportunity costs have to be factored into my decisions.

Ryan, thanks for these great insights. If you have any additional points we are all ears :)

Ryan, awesome. I think you discussed the same topic on the other blog. Would definitely love to hear more about it from you :)

Reilly from Qubit here–and you’re absolutely right. Qubit’s model employs Bayesian Priors, as opposed to many of the other models out there. Good call on that one!

“Reasonably certain” is rather subjective, isn’t it? It’s all about tolerance, and what the cost of making an “incorrect” decision are. If your cost of being wrong is low, then sure you can set a lower level on your confidence interval / statistical significance level, requiring less data / time. I like Bayesian Stats due to the fact that where the frequentist approach (through significance testing) can only provide you with whether a treatment was “better”, “worse” or “different”, a Bayesian approach (through use of Monte Carlo simulations) can give you the probability that variation A beats B based on the data you have – the problem with this, though, is your results can vary significantly based upon the amount of data you collected.

The coin example sounds great in theory, because a fair coin has two sides with equal probability of landing on each side – this doesn’t translate as well in a practical sense of website optimization. The problem with priors in the real world (such as a business setting) is they are prone to bias by whoever is setting the prior.

You hit the nail on the head here with, “… goal is not to publish results, it is to maximize the profits” – so true. I feel if more people grasped this we’d see a push for the major tools to adopt and offer a multi-armed bandit approach to minimize regret.

Someone essentially lend a hand to make significantly posts I would state. This is the first time I frequented your web page and so far? I amazed with the analysis you made to create this particular post extraordinary. Fantastic task!

Materi, glad you liked the post. Appreciate the feedback :)