When NOT to A/B Test Your Mobile App

A/B testing might be the single most effective way to turn a good app into an amazing app. However, it’s also a subtle way to lead yourself into THINKING you’re improving your app when in fact your test results are full of false positives or you’re spending precious time testing when you could be doing something else.

Don’t get me wrong. A/B testing can be outstandingly effective at increasing user conversions and your bottom line (which is why all the big guys such as Facebook, LinkedIn, and Etsy are A/B testing constantly), but there’s a time to test and there’s a time to just implement changes. You should skip A/B testing…

1. When being first is more important than being optimized

You don’t have to be a genius data scientist to A/B test well, but it’s not a trivial task either. First of all, A/B testing can take time. You need to plan the test, program the different variants, push out a new version of your app through the app/play store, and wait while users engage with your app long enough to give you clear results.

A/B testing platforms with a visual editor can help reduce the time needed for programming and can eliminate app approval red tape, but to some extent planning and waiting are unavoidable. Of course, if you already have a stable user base and there’s no immediate urgency to make certain changes, the time it takes to A/B test is completely worth it.

Nevertheless, there are situations when time is your most important resource. For instance, a common scenario is when getting to the market first gives you a significant competitive advantage. This could be the launch of a new feature or large design changes.

KAYAK ran into this exact situation when Apple announced the iOS 7. Apple released all of their new developer resources on one day, but it was up to developers when their apps would adopt the new design.

KAYAK is a company that tests A LOT. A data driven and experimental mentality is core to their corporate culture, but this was a time when they chose to just implement a large suite of design changes rather than test each detail. And it paid off.

kayak app

According to their Director of Engineering for Mobile at the time, Vinayak Ranade, “If we had spent time incrementally testing every single change we’d made and redesigned, we would never have made it. And a lot of companies did do that and they were three months late to the game.”

2. When you are fairly certain the hypothesis is wrong or you have no hypothesis at all

It’s easy to focus so hard on developing an experimental culture that you start testing everything. Literally everything. Even when you have no clue what exactly you’re testing and even when you already know the change probably would not be helpful.

While it’s great to test anything that could have a positive impact on your bottom line, it’s important to remember that the more you test, the more likely it is you will have false positive results. The typical threshold for statistically significant results is 95%. That means that you usually run a test until you’re 95% certain your results are accurate.

But 95% statistical significance is the scientific norm. This is the same rigorous standard that’s applied to FDA clinical trials. And yet, it still means that if you run 100 A/B tests with statistically significant improvements on your app, 5 of those tests would not be expected to improve your app at all.

There is no way to completely avoid this, but there are many ways to mitigate the number of false positives you get. The best thing to do is to test with care. Make sure you know what you’re testing and have a solid hypothesis as to how your test can improve the bottom line.

If you’re testing a button color, why do you think green will be better than blue? Are you randomly testing colors or do you think a certain contrast between the button and background colors will make the button more noticeable to customers?

Creating a good hypothesis and planning the test(s) to prove the hypothesis will give your tests direction and yield actionable insights that are less likely to be due just to chance.

Likewise, A/B testing should be skipped in situations where you know that an idea almost certainly will improve your app and the risks associated with blindly implementing the idea are low.

For example, Robot Invader, the makers of Wind-up Knight and Rise of the Blogs, consistently asks beta users for feedback. After playing the beta version of their newest game, Wind-up Knight 2, several players thought there wasn’t enough congratulatory “glitter” after completing achievements.

The recommendation from users was that more pomp and circumstance be added so that players would feel rewarded after accomplishing certain tasks and be more aware of the new features they just unlocked. The downsides of implementing something like this are close to zero, and the likely impact is positive.

There is no reason to spend time and resources to test something that probably is good and has low risk. Jumping to implementation is perfectly advisable.

under siege app

Screenshot of me completing a level PERFECTLY on Wind-up Knight 2. Yea, I’m bragging.

3. When you don’t have enough users

As with any scientific experiment, you need to have enough data points to gather statistically significant results. This means you need to have a minimum number of users participating in each test. Depending on how you structure the test (how many variants) and what your expected results are (a small improvement off of an already high conversion rate or a large improvement off of a low conversion rate), you might need thousands of users to get statistically significant results. Since not everyone has Google’s scale, the key is prioritization.

If you don’t have many users, you might want to first focus your time on activities that will bring in more users. This could be marketing or even pivoting your app to build up the features that customers are actually using. Once you have enough users to start optimizing, you might have only enough users to run one test at a time.

In this case, it’s really important to first test the ideas most likely to have a big impact but too risky to jump straight to implementation. Examples of risky yet likely impactful ideas are changes to in-app purchases, login screens, page flow, and algorithms related to app logic (i.e., how recommendations are surfaced, how search queries are answered, etc.).

Here’s a simple chart to estimate how many users you need to get statistically significant (95%) results when doing an A/B test with two different variants (an A and a B). The number of users you need depends on your conversion rate (existing conversion rate of variant A) and how much better you expect the new variant to be (predicted increase in conversion rate of your new variant B).

conversion rate vs. predicted rate in cr

Example: If your current conversion rate is 5% and you predict that it’ll increase by 15% with your new variant, you’re expecting your new conversion rate to be 5.75%. For this test, you’ll probably need around 9,200 users to get statistically significant results. That is, 4,600 users for variant A (your current version) and 4,600 users for your variant B (your new version).

When you’re low on users, you also must watch the funnel: the higher up you test, the faster you’ll get results. If 1,000 daily users land on your app’s login screen but only 100 make it to checkout, with all else being equal, a test of the login screen can produce results up to 10 times faster than a test of the checkout screen, simply due to the volume of users.

4. When what you’re testing contradicts your brand or design principles

While we want to test as much as possible, there are some things that are hard or unwise to test. A/B testing a new logo after your company has been established for years can cause brand confusion with your customers. You might get more conversions in the short term as the change catches people’s eyes, but potentially it could be damaging in the long term to test radical changes to your brand.

This especially applies to design elements. An unusually large button or off-colored button might get more clicks because it stands out so much, but it could be impacting how your users see your brand. An otherwise elegant app becoming less elegant might not largely impact user engagement at first, but you could lose customers over time.

Similarly, some changes to price are really difficult to test (not to mention frowned upon by Apple). We test with the assumption that test results are reproducible and externally valid. In other words, testing one random group of people will produce the same results as testing another random group of people.

Logos and sometimes prices are not like that because customers talk to each other. If it’s highly publicized in the media that it costs $9.99 to unlock the full features of your app, it’s probably not a good idea to show a different price to some users. They might have read the article that promised $9.99 and be much more likely to upgrade if their version is cheaper or much less likely to upgrade if they see a higher price.

Either way, your results could be entirely biased and inaccurate, not to mention that huge PR mess you just got yourself into: once pricing goes public, all tests are off.

Summary

All in all, A/B testing your native mobile app is challenging but well worth the effort because it will help your good app become great. However, experienced testers test with caution:

  • Don’t sacrifice time for optimization when time is more important.
  • Test frequently and continuously but avoid over-testing and aimless testing. Have concrete hypotheses in mind and plan your tests to prove or disprove them.
  • Make sure you have a sufficient number of users to gain statistical significance on each test. If you don’t have many users, prioritize tests so that you don’t spread your users too thinly on each test.
  • Do not pit intelligent design against evolution through testing. New ideas being tested should mesh with your overall brand, look, and feel.

About the author: Lynn Wang is the head of marketing at Apptimize, an A/B testing platform for iOS and Android apps designed for mobile product managers and developers alike. Apptimize features a visual editor that enables real-time A/B testing without needing app store approvals. It has a programmatic interface that allows developers to test anything they can code. Sign up for Apptimize for free today or read more about mobile A/B testing on their blog.

  1. Hello, thank you guys, really helpful article. I’m almost a complete newbie in mobile apps market so it’s great have good resource for further plans.

  2. “Example: If your current conversion rate is 5% and you predict that it’ll increase by 15% with your new variant, you’re expecting your new conversion rate to be 5.75%. For this test, you’ll probably need around 9,200 users to get statistically significant results. That is, 4,600 users for variant A (your current version) and 4,600 users for your variant B (your new version).”

    How are you calculating that? Doing a rough calculation in R using the same 5% base & 15% improvement (relative), I’m seeing that to have 95% statistical significance and 80% power requires about ~11,200 users per variation, for a total of ~22,400.

    • Great question, Brian!

      A variant vs. baseline conversion test is a simple 2-sample binomial problem (http://www.cliffsnotes.com/math/statistics/univariate-inferential-tests/test-for-comparing-two-proportions) and because you have a pre-defined hypothesis (variant is better than the baseline) you want to aim for a one-tailed test.

      For 5% baseline conversion and 5.75% variant conversion (15% lift), you can plug in you sample size of ~9200 (to be precise, 4569 per variant) and plug in the numbers:
      - Pooled Sample Proportion: 5.4%
      - Z-Stat = -1.58
      - Significance = 94.41% (off because of the discrete conversion numbers)

      Going from a hypothesis (baseline conversion, desired lift) to the required sample size requires some algebra work, but the fundamental approach is the same.

      I can probably answer your question more precisely if you tell me what method you’re using to reach your numbers (possibly a different distribution?). Feel free to reach out to me at lynn@apptimize.com if you have more questions!

      • Hi Lynn,

        With that sample size (4569 per variant), the test is under powered. I’ve used R to calculate this as follows: power.prop.test(n = 4569, p1 = 0.05, p2 = 0.0575, alternative = “one.sided”, sig.level = 0.05)

        Based on this, the statistical power of the test is ~47.79%, or rather, the probability that you haven’t committed a Type II error (false negative) is only ~47.79%

  3. Hi Brian,

    I see what you’re saying! We actually purposefully create user estimates based on the statistical significance, but not statistical power. Though this is not the same method used in academia (where an 80% power is standard for publication), we find that statistical power is often secondary when dealing with user-constrained businesses that have a lot of ideas to test.

    The 80% power approach is the most rigorous, because it protects against false negatives, but it’s not often used when you have limited users and lots of ideas to test, for three main reasons:

    - Analyst Observation: as you you approach the 95% significance level, even if there is a potential “false positive” situation (good answer, but with <95% significance) at that stage a testing analyst is typically able to recognize whether it's worth continuing the test, which is always possible. On the other hand, if results are positive and significant (which can happen even below 80% power), it's time to take action and move on to the next test

    - Low Cost of False Negatives: high-innovation apps typically have a low cost for false negatives – for the vast majority of tests you can always move on to the next idea. Of course, when the stakes are higher (say you're testing a radically new flow that took you weeks to build) it's worth designing for power

    - Opportunity Cost of testing: most companies, even with millions of downloads, have more testable ideas than users on which to test them. This often means that you're better off testing three ideas at 50% power than one at 80%

    There are advantages to both approaches – it really depends on how constrained you are by your ideas-to-users ratio. Additionally, the estimates I gave in the post are estimates for how many people are necessary to run a test. We often recommend to customers to continue tests after statistical significance has been reached because there are other factors that contribute to erroneous results (primarily, not running the test for enough time can get a biased sample of users).

  4. Thanks, I know mobile is more and more important, so good to know this. Will work on it!

  5. Glad you enjoy it. Thanks for the great feedback :)

14 comments

Please use your real name and a corresponding social media profile when commenting. Otherwise, your comment may be deleted.

← Previous ArticleNext Article →