A/B testing is a powerful way to improve online experiences, but often it can also be frustrating. Three common problem areas are the low success rate for A/B tests, difficulty coming up with good ideas to test, and problems separating causation from correlation.

Adding human insight can help with all three problems, but it’s not always obvious how to bring the two methods together. In this article we’ll show you what to do: when to use human insight in your A/B testing program, the specific steps you should take, and the sort of results you should expect.

The power of A/B testing

Let’s start with a definition: A/B tests are online controlled experiments in which you expose people to two or more randomly-chosen variants of an online experience. Usually one variant is the current version of the experience, and the other variant is a modified version. You track which version performs better at getting people to do what you want them to do. Most commonly, A/B tests are used to optimize user interface elements in websites (for example, the color of the “buy now” button), but they can be used to test almost anything you can present online – images, videos, text, product features, etc.

A good A/B testing program can help an online business increase its profits by millions of dollars. A/B tests are broadly used; a group of thirteen online companies including Google, Facebook, and Microsoft reported that they ran more than 100,000 A/B tests in 2018 alone.

Limitations of A/B testing

Despite their popularity, A/B tests can be a source of frustration for many companies. The testing process is expensive and consumes a lot of time, and it’s hard to predict when you’ll see benefits from it. This can produce management pushback against the whole A/B testing process. Three problems that stand out are:

  • The success rate for A/B tests is quite low
  • It’s often very difficult to come up with high-quality ideas to test
  • It can be very difficult to tell the difference between correlation and causation, making it hard to decide what to do next

Here are some details on the problems:

Most A/B tests fail

One frustrating fact about A/B tests is that most of them fail to produce a statistically significant improvement. In other words, either the current version wins or there’s no significant difference between the variants. There is no consensus on the overall success rate industry-wide, but online reports estimate it at between 10% and 25%, with most guesses clustering closer to 10%. In other words, 75% to 90% of A/B tests don’t provide benefits to the companies running them. This can produce a lot of impatience and frustration in the company doing the testing, especially if you’re looking at a feature which has relatively low traffic, in which case a single test can take numerous weeks to produce a statistically significant result.

It’s hard to form good hypotheses to test

The second frustration with A/B tests is that it can be very hard to come up with good variants to test. In part, that’s because the most obvious fixes have already been turned into best practices (for example, if you want to create a checkout flow, it’s not a bad idea to copy what Amazon does). But A/B testing teams can also become focused on making small changes to a website more or less at random, which may not lead to useful insights. As UX researcher Shanshan Ma wrote:

“You need to develop ideas for alternative design directions for various pages. Is the layout going to be different? Should you try a different picture? Use a different font size? Although ideas may be flying about, there is not necessarily any guidance for how to come up with a winning alternative. What if layout A works better with picture B, but layout B works better with picture A?….You can confront a real quandary when all of the alternatives you’ve tested work equally well or badly. It may be that none of the alternatives you’ve tested allow you to discover the factor that would have the most impact on the effectiveness of a page design, so the optimal solution remains undiscovered even after the A/B testing.”

The underlying problem is that most of us are surprisingly poor at thinking up things that customers will respond to. Even experts in online optimization report that their guesses about how customers will react to something are wrong up to 90% of the time.

Researchers at Microsoft reported:

“One interesting statistic about innovation is how poor we are at assessing the values of our ideas. Features are built because teams believe they are useful, yet we often joke that our job, as the team that builds the experimentation platform, is to tell our clients that their new baby is ugly, as the reality is that most ideas fail to move the metrics they were designed to improve.”

It’s hard to separate causation from correlation

An A/B test can tell you which variant performs better, but it can’t tell you why. Your reaction to that might be, “who cares, as long as it works?” But if you don’t understand why a test worked, you may make follow-on mistakes. One notorious example from history was cited by Harvard Business Review in its coverage of A/B testing:

“Between 1500 and 1800, about two million sailors died of scurvy. Today we know that scurvy is caused by a lack of vitamin C in the diet, which sailors experienced because they didn’t have adequate supplies of fruit on long voyages. In 1747, Dr. James Lind, a surgeon in the Royal Navy, decided to do an experiment to test six possible cures….The experiment showed that citrus fruits could prevent scurvy, though no one knew why. Lind mistakenly believed that the acidity of the fruit was the cure and tried to create a less-perishable remedy by heating the citrus juice into a concentrate, which destroyed the vitamin C. It wasn’t until 50 years later, when unheated lemon juice was added to sailors’ daily rations, that the Royal Navy finally eliminated scurvy among its crews.”

Most modern A/B tests don’t kill thousands of sailors, but if misinterpreted they can hurt companies. For example, the optimization agency Conversion Sciences reported an A/B test in which a video explainer on a commerce site produced more conversions than an animation. Their initial conclusion was that video converts better than animation, and they started to plan for how they would iterate on the video. But when they dug further, they found that very few people were actually watching the video. It won not because video was more effective, but because something about the animation had been driving away customers – maybe it was slow to load, or its initial appearance was unattractive. Without understanding the cause of the win, it was impossible for them to understand what they should do next.

This sort of problem in which causation was misunderstood, or the test results were driven by an unidentified bug, is widely cited in literature on A/B testing. There are good examples here.

How human insight can improve A/B test results

There are many case studies showing benefits from mixing human insight and A/B tests. For example, commerce vendor Volusion reported a 10% increase in conversion from combining the two methods. But there’s not much information on exactly how to create that mix. The best how-to article we could find online was written in 2011 by Shanshan Ma, who is currently a UX research manager at ServiceNow. Writing for the excellent online magazine UX Matters, Shanshan explained that you should run a human insight study at the start of the A/B testing process in order to form better ideas on what to test. We agree strongly. We’ve also found that there are two other areas in which human insight and A/B testing combine well. Here’s how we recommend you mix the two methods:

We recommend using human insight at three points in the A/B testing process: First, very early, to generate hypotheses to test. Second, just before you launch the test, to prioritize and validate the variants. And third, after testing is complete, to verify causation.

1. Generate hypotheses to test

At the very start of your testing process, use human insight tests to help you create good hypotheses. The goal at this point is to inform your intuition by watching customers interact with the current version of the experience you want to optimize. Have them go through the experience while speaking their thoughts aloud. 

For example, if you’re optimizing the checkout flow on a website, have them make up a fake purchase and go through checkout and stop just before they confirm the purchase. (Or, better yet, provide them with a dummy set of login credentials and credit card number so they can complete the full checkout without actually ordering anything.)

If you’re evaluating written text, images, videos, etc, save it in a public-facing document. Have your participants go to the document during the test, read or watch it, and comment on it as they do so. Be sure to set the document to public access before you start the test, or they won’t be able to access it. (Some human insight solutions will securely host the material you want people to review. That’s even better because it’s more convenient for you and harder for others in the public to access.)

Good questions to ask during the test:

  • As you (read/view) the material, tell us what you’re thinking and why. What stands out to you?
  • How did you feel when interacting with this process (or document)?
  • What did you like most?
  • What did you like least?
  • How did you feel about the (colors, images, text, font, etc)?
  • Did anything confuse you? If so, why?
  • Was there additional information you wanted that you didn’t get?
  • If this had been real life, what would you do next? Why?
  • If you had a magic wand and could change anything, what would you do to improve this (page, image, video, etc)?

2. Validate and prioritize variants

Once you’ve formed a hypothesis on what could be improved, you’ll generally be able to think of several variants that might solve the problem. You could test them all simultaneously in a multivariant test, but that slows down the testing process because more variants means a longer wait for statistically significant results. 

As an alternative, you can put several variants into a human insight test, and get an evaluation of them in less than a day. You’re spending a few hours now to save weeks of testing later. Have the participants look at each variant, think aloud as they do it, and then have them answer some questions. The results of this test will help you in several ways:

  • You can identify unexpected problems that might have caused a variant to fail (for example, perhaps a color you used doesn’t go over well, or a word you used aggravates some people). These drawbacks are easy to fix before you go to the A/B test.
  • You may find that one element of a particular variant goes over well, while another element of a different variant is liked. You can combine the best of both variants into a new one that may go over best of all.
  • You will usually get a feel for which variant is most attractive to people. Run your A/B test on this version first and you’ll maximize your chances of getting a win on the first try.

Keep in mind: Your goal here is not to substitute the human insight test for the A/B test; it’s to use human insight to help you make a better-informed hypothesis about which variant to test. Think of this stage as a brainstorming process between you and the participants.

Good questions to ask in the test:

  • In your own words, describe what this (page/image/video/etc) is communicating
  • What about it did you like the most, and why?
  • What about it did you dislike the most, and why?
  • What, if anything, was unclear or confusing?
  • Please rate your level or agreement with the following statements (Rating scale [Strongly disagree – Strongly agree])
    • “The (page/image/video/etc) is easy to understand.” Please explain your rating.
    • “The (page/image/video/etc) compels me to take action.” Please explain your rating.
  • What additional feedback, if any, do you have about what is being communicated on this page?

After they have seen both the control and the variant(s), ask:

  • In your own words, describe the difference between the options you just saw. Don’t use language from the page itself.
  • Explain what about Option A is compelling to you. (repeat for option B, C, etc)
  • Which option did you prefer? Please explain your answer.

3. Verify causation

After you have run the A/B test, sometimes you’ll find that the winner is a huge surprise, or it may be very difficult to understand why people reacted the way they did. These results are sometimes genuine breakthroughs, but sometimes they’ll turn out to be caused by unrelated problems you’re not aware of. Before you act on these surprise findings, it’s good practice to dig into the motivations and reactions of people interacting with the winning variant.

Run a human insight test, take participants to the experience that won, and have them talk you through exactly what they are thinking as they experience it. Does it meet their expectations? Does it surprise them in any way? Are they confused about anything? Also ask what they would do next, and why. Emphasize that they should explain their motivations and assumptions out loud. 

Also, observe what they do. For example, if your winning variant has a video on the screen, do they watch the video? If so, do they have good feedback about it? It’s just as important to notice what they don’t do or say as it is to note what they actually do.

You won’t always be able to figure out why a particular variant won, but the more you understand, the better hypotheses you’ll be able to create for your next round of experiments.

Good questions to ask in the test:

  • Describe what this (page/image/document/etc) is about. Use your own words.
  • How would you rate your overall emotional reaction to it? Please explain your answer. (rating scale, from negative to positive)
  • What did you like most about it, and why?
  • What did you like least about it, and why?
  • Was there anything that confused you? If so, why did it confuse you?
  • Do you have unanswered questions about anything you see?
  • Did anything surprise you? If so, what was it and why?
  • What do you think will happen if you click (or proceed, or whatever the next step is)?
  • If you were interacting with this in real life, what barriers, if any, would keep you from clicking (or proceeding, or whatever the next step is)?

Conclusion

A/B testing is never going to be a perfect process, but we can make it better. Think of an A/B testing program as a conversation with the market in which you propose things by launching variants, and the customers reply to you through their behavior. If you look at it that way, it’s a very stilted and serialized conversation: an endless series of “how about this?” questions that gradually meander toward the truth. 

Adding human insight makes the conversation richer and faster. Rather than asking one semi-random question at a time, you can brainstorm with the customers up front, use quick feedback to create and prioritize your tests, and dig into the motivations and perceptions that drive customer behavior. This does require you to slightly modify your A/B testing process, but the reward is more effective testing and a much shorter learning curve.

What do you think? We welcome your feedback. Feel free to use the comments below to post questions and your own thoughts on how to optimize A/B tests through human insight.

To learn more…

Photo by Jason Dent on Unsplash

The opinions expressed in this publication are those of the authors. They do not necessarily reflect the opinions or views of UserTesting or its affiliates.