Supercharge Your Testing With The New KISSmetrics A/B Test Report

We have a very cool feature coming out today in KISSmetrics: our new A/B Test Report! In the past if you wanted to run an A/B test using your KISSmetrics data you would need to first run a funnel report and then manually enter your data into whatever external tools you use for your analysis. Now you can do all of your A/B test reporting from within KISSmetrics!

Turning your KISSmetrics data into insights

Let me give you a quick walkthrough of what you’ll find.

1-report-ab-test

To start out we have to choose a target event which we’ll consider our “conversion event”, and the KISSmetrics property that will serve as our “experiment”.

2-report-configuration

For this test we’re going to use the ‘Signed up’ event as our conversion event. The really cool thing to note here is that any event being tracked in KISSmetrics can be used as conversion event!

3-report-configuration

Next we have to select which experiment we’re going to look at. In this case we’re going to use the property associated with an Optimizely test we’re running. This is a great example of the fact that you don’t have to give up on using your favorite A/B testing tools in order to get all your data and reports in one place.

4-select-a-baseline

A/B Testing is all about making a comparison; so all of your results are given in relation to a baseline. Normally this would be your control or original variant, but our reporting tool lets you choose any of the values of your property to serve as a baseline. It’s also important to point out that while this example only has 2 values for the property this report will work with any number of values that a property might have. Testing more than one variant at a time is no problem.

5-variation-1-the-winner

When you run the report it gives you a nice summary of the results. Notice at the top we provide you with a simple explanation of your results. In this case there is enough data and strong enough significance to make the call.

We also give you all the key summary data you’ll need to understand your results:

  • How long the test ran
  • The number of people in the test
  • Total conversions
  • The improvement you’re likely to see
  • How sure you can be of seeing an improvement

Exploring in Your Data

6-improvement-over-time

For those looking for information beyond a simple summary we have more for you!

First the report provides a visual history of the estimated improvement as more data is collected. This is important because just looking at significance alone can be deceptive. We mark when the results have achieved what is commonly considered to be statistical significance with a trophy icon and shading in the timeline after this has occurred. However in the early stages of a test it is not unlikely that the inferior variant will temporarily look like the winning one.

Looking at this history you can clearly see how much your improvement is jumping around and likewise use that to get a better intuition of how trustworthy your results are. The more stable your estimate of improvement the more likely it is to be accurate.

7-variation-winner

Finally, for anyone that is looking to do more analysis on their own, our report provides all the data you’ll need. This gives you much of the information found in our typical funnel report plus you get the likely improvement of each variant as well as the certainty that there is an improvement.

KISSmetrics is about People!

The focus of KISSmetrics is people, and so we have also built in the ability to explore the individual people that are going through your test.

8-ab-test-metrics

Click on any of the points in the improvement timeline and you’ll be presented with links to run a People Report so you can see who in each variant converted, or simply everyone who passed through that variant.

Changing Your Conversion Event

Now let’s go back to the feature that you can leverage all of your KISSmetrics data in this report. We have our result for conversion to signups, and things look great, but thinking about it maybe we should also see how the result look if we swap out ‘Signed up’ for a more important conversion event ‘Received data’.

9-report-configuration

All we have to do is go back to the top of the report, change our conversion event and rerun our report!

10-there-is-no-clear-winner

Now we can see that if ‘Received data’ is the conversion event we really care about, we don’t have enough data to call our test. In this case the results are looking pretty close even after over 40,000 observations. The report is letting us know that maybe in this case it is best just to stop the test and stick to our original. Of course we’re free to continue running our test, and eventually we should reach a conversion, however it is unlikely we’ll see the real gains were looking for.

Comparing Multiple Variants

To highlight a few more features let’s go back and look at results from a test that we ran a few months back, long before work was even started on this report! Because of the way we can make use of any of our KISSmetrics data – the event/property combination we’re investigating can be a test we ran long ago, or even a combination that was never even thought of as being an A/B test in the first place!

11-experiment-variations

In this report we’re comparing 3 variations against the original page. Here we can see how the report handles this. We get 3 lines, one for the comparison of each variant against the baseline. The report tells you which of the 3 variants is the superior.

This is great, but it looks like the variants are all doing well against the original, let’s see how they do against each other.

12-experiment-variations

This time you can see we’ve switched our baseline from the original to variant 2, the clearly superior variant. We’ve also deselected the original so that it no longer appears in our visualization. In our data table all of our improvements and certainties are expressed in terms of our new baseline. Right away we can see that variant 2 and variant 3 are actually very close, certainly too close to call. This is extremely useful to know, maybe your design team prefers variant 3, now you know you’re free to make that choice with likely little or no loss in conversion.

The new A/B Test report will open up many new ways to explore and gain insights from your existing KISSmetrics data. I hope you find it as exciting and useful as I have!

Google Analytics tells you what’s happening. KISSmetrics tells you who’s doing it.

If you’re interested in trying KISSmetrics to gain actionable insights about the people who pay you, then sign up today:

try-kissmetrics

About the Author: Will Kurt is a Data Scientist/Growth Engineer at KISSmetrics. You can reach out to him on twitter @willkurt and see what he’s hacking on at github.com/willkurt.

  1. Amazing! This is something I’ve wanted to see built into Google Analytics for such a long time I can’t even begin to tell you. You might just win me over as a client with this one :-)

    One thing I’m curious about – what algorithm do you use to correct for multiple comparisons errors – when you calculate certainty for several variations?

    Also – how do you cope with the “peeking” problem, which introduces basically the same type of error as the multiple comparison issue (type I)?

    • Hey Geo!

      First off thanks for your feedback, these are all great questions!

      For multiple variants the key is that we focus on what the user has selected as a baseline, we then essentially treat it as a series of A/B tests against the baseline. The way we decide an overall winner is the one that has achieved ‘significance’ (I know this can be problematic and I’ll touch on this in a bit) and has the highest likely improvement over the baseline. To get more fine tuned results among the variants you can simply change the baseline to one of the variants and compare from there.

      Now as for ‘peeking’ we don’t do anything to stop it, but I actually have put a lot of thought into this. If you have the chance I recommend reading through this post I put up earlier this week: http://blog.kissmetrics.com/your-ab-tests-are-illusory/ it covers my thoughts on the ‘peeking problem’. But the short answer is: to avoid error requires more observations, but the reason most people call tests early is because getting more observations is costly for them. While stopping as soon as a test reaches 95% certainty will lead to an increase in type 1 error, I feel the issue isn’t really between users that stop early or wait, but people who don’t even wait until 95% to call a test.

      That said I view this as an evolving report. I believe (and user feedback has shown) that simply providing more data to the user is confusing, which in turn can lead to poorer decision making. What I would love to see happen is to study the efficacy of different decision making strategies and build them into this report. In the meantime making it so the basic tools for decision making are available is a big first step.

      Thanks again for your questions and I definitely hope we can win you over as a client!

      • Thank you for your reply, Will.

        I understand what you are doing there with the baseline change and the changing tails of the significance test, clever :-)

        However, I still haven’t understood how you cope with the multiple comparisons problem. The problem is that if you choose a 95% significance level and do more than one pairwise comparison, then you’ve actually increased the chance that one or more of those comparisons will yield a false possitive at that level (type I error).

        If you do 13 pairwise tests against a baseline and you aim for a 95% significance level then you have statistically about 50% chance that at least one of those tests will report a significant result at the 95% level whilst it is not really significant (is a false possitive). So, the reported 95% significance is not really 95%, but much lower. Thus the need to adjust the significance values for those pairwise comparisons based on the number of the tests conducted and their results.

        Looking at the screenshots posted and doing some math I can’t see any p-value adjustment being applied, but I might be simply unfamiliar with the method you are applying, so that’s why I asked.

        Now, about the “peeking” problem: if I am to rephrase your point: people still don’t get statistical significance and I’m expecting them to worry about multiple comparisons :-) Point taken! I hope features like these can help change this situation and I salute your efforts since statistical significance is what makes or breaks data-driven decision making.

        Let me know if what I wrote above makes sense.

      • Hey Geo!

        I definitely get your point now! In the current iteration we do not currently adjust p-values for multiple variants, but I did see you mentioned the False Discovery Method in another post which I will definitely look into. Thanks for your feedback!

  2. Dear Concern,
    Do you have any free tools that I can use to verify how much profit I am going to make when purchasing from Kissmetrics?

    Thanks,
    Evan Webb

  3. Is there any chance that using KissMetrics rather than Google Analytics will have an impact on our SEO? Google Analytics allows for duplicate content while testing and handles it to avoid penalties. How would KissMetrics protect us against the penalties of duplicate content?

  4. Thank you for sharing ..!!
    Till today I didnt understood the A/B test, After red this article Completely understood What is A/B testing and How to Do it in Proper way.

  5. Wow.. Amazing KISSmetrics free trial. I will ask my officials to purchase your complete package.

  6. For some reason it won’t let me reply to a threaded comment, but in regards to the back and forth with Geo and Will:

    “So, the reported 95% significance is not really 95%, but much lower.”

    To clarify, the reported significance (95%) would be correct for the individual comparison, though you are correct that the probability of committing at least one Type I error within the experiment as a whole will be above 5%.

    Your ~50% figure is based on the Family Wise Error Rate, which is really quite stringent as you are trying to control the probability of making at least one false discovery among all comparisons. In a business setting such as with A/B testing, the tolerance for “risk” and making incorrect decisions can be a bit higher than say, in the medical field, where the cost of a false positive could mean doing harm to someone by giving them the wrong treatment. As a result, it’s probability more practical here to manage using the False Discovery Rate.

  7. Jenifar Lopez May 28, 2014 at 9:39 pm

    I would like check first with free trail how much I would be benefited after buy this one. Is there any chance to evaluate the performance of KISSmetrics? Thanks!!

    • Jenifar, free trials are great at determining how effective a product is. If you check out our case studies and testimonials you’ll see that people have benefited greatly from KISS.

  8. You have done a good job. Keep it up.

17 comments

Please use your real name and a corresponding social media profile when commenting. Otherwise, your comment may be deleted.

← Previous ArticleNext Article →