Kissmetrics Blog

A blog about analytics, marketing and testing

Built to optimize growth. Track, analyze and engage to get more customers.

Why Website Test Results Don’t Always Add Up & What To Do About It

If you do enough A/B testing, I promise that you will eventually have some variation of this problem:

You run a test. You see a 10% increase in conversion. You run a different, unrelated test. You see a 20% increase in conversion. You roll both winning branches out to 100% of your customers. You donʼt see a 30% increase in conversion.

Why? In every world Iʼve ever inhabited, 10 plus 20 equals 30, right? Youʼve proven that both changes youʼve made are improvements. Why arenʼt you seeing the expected overall increase in conversions when you roll them both out?

There are lots of different reasons this can happen. Here are a few:

The Changes Affected the Same Group of People

You may be causing some problems for some of your users. I know, itʼs a chilling thought. But the truth is, something youʼre doing is probably keeping some of your users from doing something you want them to do.

The interesting thing is that there may be a lot of different ways to fix this problem youʼre causing. For example, if itʼs impossible for some people to purchase products on your site because you only accept Visa, you might experiment with also offering American Express or MasterCard.

So, letʼs say that you run two separate tests, one that adds AmEx, and one that adds MasterCard. You run them as separate tests because you want to gauge which is most attractive to your users. Or because you just love running tests.

Now letʼs say that each new payment method gives you a positive increase in revenue of 20%. Youʼd think that, if you combined them, youʼd get a revenue increase of 40%, right? Nope.

Imagine somebody who doesnʼt have a Visa, but does have both a MasterCard and an AmEx. Their problem (that they canʼt purchase because they donʼt have Visa) goes away regardless of which new payment method you end up implementing. Implementing both doesnʼt mean theyʼll spend twice as much.

Want to Avoid This?

If youʼre running several different tests that all act on the same metric, youʼre almost certainly going to hit this problem. Itʼs enough to be aware that itʼs happening and not to plan on test results adding up exactly.

Other Changes Are Hurting You

It seems like it would go without saying, but A/B tests only test A against B. They donʼt take into account other changes you may be making at the same time.

For example, letʼs say you run several tests to improve your registration flow. In each test, youʼre seeing statistically significant improvements in registrations. However, when you eventually release all the changes out to 100% of your new users, youʼre just not seeing as much improvement in the actual number of people who are registering as you expected.

The first question you should ask yourself is, “What else did I change?” So many things can affect registrations. For example, maybe youʼve changed the type of user youʼre acquiring, and the new users are less likely to register. Maybe something you did is slowing down your page load, and thatʼs causing fewer people to make it through registration. Maybe youʼre running another test that is having a seriously negative impact.

Want to Avoid This?

Make sure you know everything that might have an impact on the key metrics youʼre measuring, and test all of it.

If youʼre only testing your major features, youʼll never know if some ʻtrivial changeʼ you pushed out without thinking about it is actually counteracting the benefits you expect to receive from an experiment.

The Changes Interact

A/B tests are perfect for what I call ʻone variable changesʼ. For example, theyʼre great if youʼre testing a landing page, and you want to see if you get more conversions with a blue button or a red button. Things like this can have a surprising impact.

The problem comes when you try to merge the outcomes of two completely separate tests. Imagine a stupidly simple scenario in which you had two tests running:

  • The first tests a blue button vs. a red button on a white background.
  • The second tests a white background vs. a blue background with a red button.

Now, imagine that, in the first test, the blue button won, while in the second test, the blue background won. If you tried to merge the results of both tests, youʼd end up with a blue button on a blue background, which Iʼm going to guess isnʼt going to convert terribly well, since the button will be invisible.

Obviously, those are terribly designed tests, and you wouldnʼt just merge them, but the point is that all sorts of different experiments can combine in surprising ways. Certain messaging might be effective when presented with a particular image, while it might be awful with another image. Regardless of how each test performs independently, they might combine very poorly.

Want to Avoid This?

Be aware of potentially conflicting changes, and make sure that youʼre always testing the final version of designs against the original, even if youʼve tested each element on its own.

Your Test Wasnʼt Statistically Significant

If youʼve done any A/B testing, youʼve probably gotten excited or sad about early results only to see them change wildly over time. The problem here is frequently statistical significance.

When youʼre working with very small numbers, the behavior of even one user can drastically throw off your metrics.

Consider if you only have two users. If one person buys and the other doesnʼt, youʼve got a conversion rate of 50%. Not bad. But if both people buy, youʼve got a conversion rate of 100%. Thatʼs remarkable! And totally unsustainable!

Want to Avoid This?

You need to use a big enough sample size to make sure that your results are significant. And, this is important, you need to determine the sample size ahead of time in order to avoid something called repeated significance testing errors. I wonʼt try to describe it here, but feel free to read more about it if you donʼt believe me. Just be prepared to learn a little math.

Another big problem is that it can be really tough to find the right sample size because of the natural variance of whatever youʼre trying to measure. One way to get an idea of your variance is to run an A/A test. This means that you split your users into two different groups and show them both the same thing.

Youʼd imagine that youʼd get exactly the same results from each group, but youʼll notice that there will be natural differences between the two groups. If this difference is very large, that means that youʼll most likely need a larger sample size to get statistically significant results.

As a note to all the math majors out there, yes, I know that this is not a very good explanation of statistical variance, and you probably have something written in Greek that will explain it all much better. But the truth is, just this is little bit of information can help you plan a much more predictive test.

Your Metrics Are Wrong

Here is the dirty little secret of gathering metrics: doing it is really kind of hard to get right. If anything about your metrics is confusing, itʼs very possible that itʼs because the numbers youʼre collecting are just wrong.

I canʼt tell you the number of times a new feature that should have performed well has failed, and we have traced the failure to a small bug that only affected the way the events were being recorded.

Want to Avoid This?

Use your common sense. If youʼve done your homework, and you really believe something should be doing better or worse than it is, do a little digging.

Look for things like race conditions in the code that could be breaking the information gathering system. Also, make sure that youʼre recording everywhere an event occurs. If someone can purchase from multiple places in your product, make sure youʼre recording a ʻpurchaseʼ event everywhere it can happen.

Still Not Adding Up?

There will be times when you do all of these things right, and you still get surprising results from your experiments. That means itʼs time to observe some users and find out whatʼs really going on.

Qualitative research is wonderful for understanding why your users are doing what theyʼre doing, especially when theyʼre doing something surprising! But thatʼs a different blog post.

About the Author: Laura Klein is a user experience and research expert in Silicon Valley, where she teaches startups how to make their products more appealing and easier to use. She blogs about UX, metrics, customer development, and startups at Users Know. You can also follow Laura on Twitter.

  1. don’t you think that sometimes it has no sense to invest money in usability as it doesn’t bring you required return, especially if you consider loosing of time.

    • Working for a dedicated hosting service, you sell to geeks. Geeks are not the focus of usability, so why would you worry about usability. If you were selling to consumers, then usability matters. Simpler is only better when the population you are selling to values simpler. Geeks don’t value simpler. Usability is a retail concept. Usability is not a universal concept.

      That said, it should be noted that programmers don’t keep their carrier from their carried separate. Requirements elicitors don’t capture meaning. Without that meaning, the model is going to cause the view to fail. Fresh paint on the view won’t fix the model. We deliver average functionality. That means that usability is compromised for everyone.

      What are you losing time against? Most programming is churn. It doesn’t bring a return. The best websites build and forget. Then, they let that website generate $$$ for years and years without lifting a finger.

  2. “Your Test Wasnʼt Statistically Significant”

    As a rule of thumb, what’s the minimum number you see as likely to yield meaningful stats?

    • For A/B testing, you ideally want 1,000 visitors over the course of the test. 500 to page A, 500 to page B. It’s obviously nice to have more, but 1,000 is a good number to shoot for.

      • How many sample points or visitors you need I think also depend on the amount of change you are trying to detect. For example, if making a change to the page should have a 50% change impact, that is a lot easier to detect than something that would make a 5% change impact, with the same amount of randomness of the sample.

  3. Chris Neumann Sep 05, 2011 at 10:31 pm

    The other theory I have on this is that you’re generally pushing the people on the margin through the funnel. So, if you have 100 visitors and you ordered them by how likely they are to adopt your solution, then the 40th place person on that list is less likely than the 30th person on that list to adopt. For that reason, you have to do a lot of tests to get the 40th place person to sign up.


Please use your real name and a corresponding social media profile when commenting. Otherwise, your comment may be deleted.

← Previous ArticleNext Article →