Beyond 95%: How to Select the Right Confidence Level for A/B Testing
- Alex Shleizer
- Jun 8, 2023
- 5 min read
Suppose you've conducted an A/B test between two different landing pages and now have the results in hand. The next step involves determining whether these results bear statistical significance. Numerous online tools offer the ability to test for significance based on the received data, often presenting three primary confidence levels: 90%, 95%, and 99%. But why are these the preferred choices? And more importantly, which one should you opt for?

In this article, I delve into these pertinent questions and argue that, in many instances, it may be more practical to use a confidence rate lower than 90% when interpreting A/B test results. Moreover, I will provide some heuristics to help you select an appropriate confidence level for your situation.
Understanding Confidence Levels
(Feel free to skip this part if you’re already familiar with confidence levels)
To better understand confidence levels, let's use an analogy. Imagine you're in a colossal casino with countless slot machines. Rumor has it that some machines are luckier than others. Your mission is to determine which machines are indeed worth playing. To do this, you decide to test two machines, Slot A and Slot B. After a few rounds on each machine, it appears that Slot B is delivering more payouts than Slot A. You're ready to decide which slot to continue playing, but how certain are you?
This is where the concept of "confidence levels" comes into play.
A high confidence level, like 95%, essentially implies, "I won't play Slot B unless I'm 95% sure it offers more payouts." A higher confidence level means you would need to play both slot machines more times to be sure that Slot B isn't merely experiencing a lucky streak. However, each round incurs time and monetary costs. Thus, there's a clear trade-off: more rounds yield higher confidence in your decision but at the expense of additional resources.
On the other hand, lowering the confidence level to, say, 80% means you'll need fewer rounds (less cost) before choosing Slot B. However, this increases the likelihood of choosing a machine that seemed luckier due to mere chance (false positive) but actually isn't. And if you choose Slot A, while Slot B is truly the better option, you've encountered a false negative - a missed opportunity.
In essence, confidence levels in A/B testing help us determine how certain we need to be before making a decision. The goal is to strike a balance between the costs of more testing (playing more rounds) and the risk of making an incorrect choice due to insufficient data (choosing the wrong slot machine).
Debunking the 95% Confidence Level Standard
A quick Google search will reveal numerous recommendations advocating a 95% confidence level. Practically, this means that if you have a control and a test, you will concede the test as victorious only if there's a 5% or less chance that the test triumphed due to random luck.
However, this implies attributing significant weight to the control, a bias that isn't always justified! To illustrate this, consider a numerical example:
Assume you have a landing page with a 1% conversion rate in an expensive sector where each click costs $30. After running the test for some time, you obtain the following results:
Variation | Clicks | Conversions | CVR | CPC | Cost |
Control | 1000 | 10 | %1 | $30 | $30,000 |
Test | 1000 | 12 | %1.2 | $30 | $30,000 |
Now if you plug these numbers into the A/B test calculator linked earlier, you will get the answer that this result is not statistically significant using a confidence level of 95%.
However, let's operate under the assumption that the test variation does indeed deliver a solid 20% improvement over the control. The stark reality is that confirming this would require an investment of more than $60,000! How much more? Even if I increased the spending tenfold, bringing 10,000 clicks to each variation, the results would still be statistically insignificant, even after expending a total of $600,000! From simulations, it seems around 25,000 clicks per variation (Over 1 Million $ in total) are required to recognize a variation that is 20% better when applying a 95% confidence level (Two-tailed test) in a low conversion rate situation - a scenario quite common in many sectors.
Often, what transpires is that people neither wait nor expend so much. They simply declare the test variation a loser and proceed, inadvertently missing out on the opportunity to attain a 20% increase.
Opting for a 95% confidence level essentially implies a substantial status quo bias. You are demanding the test to decisively outperform the control before pronouncing it as the winner, yet you ask much less of the control version.
The 95% number has its origins in the scientific realm - but business is not science
The 95% confidence level is frequently used as a standard in scientific research. This high level of confidence is justifiable when conducting scientific research, where findings declared as "scientific truths" are taught in schools, broadcasted in the media, and may have policy implications. It's entirely sensible to be cautious and conservative when accepting new findings, preferring to err on the side of rejecting real findings (false negatives) over accepting false ones (false positives).
However, this stringent standard isn't always applicable in business-related scenarios. For instance, if you select a landing page that performs "just as well" as the control, no harm is done. And if it performs somewhat worse, the incurred damage is relatively symmetrical to overlooking a superior landing page. Hence, you typically wouldn't want to lean so heavily towards the control, and you wouldn't want to consistently apply the high confidence levels used in scientific research.
Selecting Confidence Levels for an A/B Test
My proposed heuristic for selecting the confidence level for statistical tests involves asking yourself: "How confident am I that the control is good?" Choose a lower value if your confidence is low and a higher value if your confidence is high.
Consider this scenario: you've launched a new website and a week later created an A/B test for the landing page. This situation warrants little to no confidence in the control, as the test might have benefitted from lessons learned during the initial design phase. Here, I'd forgo running a statistical confidence analysis and simply choose the version with the higher conversion rate.
Contrast this with another scenario where your website version has been operating successfully for 10 years, consistently outperforming numerous A/B tests. In this situation, a status quo bias is sensible, and you'd want any new version to convincingly outperform before replacing the proven design. Here, I might choose a 90%-95% confidence rate.
For a version that has passed a few A/B tests, a confidence rate of 60% might be appropriate. Here's a rough guide to selecting confidence levels based on your subjective confidence. These aren't hard and fast rules but provide a more nuanced approach than blindly adhering to a 95% standard.
Your confidence in the control compared to the test | The statistical Confidence level for the test |
Nonexistent to low | No statistical analysis. Just pick the winner based on higher CVR |
Low to medium | 60% |
Medium to high | 80% |
High to very high | 90%-95% |
Practical note, most online calculators don’t support lower confidence rates than 90%, but this calculator gives you the ability to get a Bayesian and not z-test estimation. You can use the Bayesian estimation for the test variation and check if it’s higher or not compared to your selected confidence level.
Another something to consider is that if the deployment of new variations is risky or expensive, it makes sense to be more conservative and to choose a higher confidence level to avoid deploying variations that are not actually better.
In conclusion, the 95% confidence level is not a magic number, and mindlessly following it may lead to missed opportunities in a business setting. Confidence levels should be selected based on a careful consideration of your specific context and the level of confidence you have in your control. By better understanding and thoughtfully choosing your confidence level, you can make more effective and resource-efficient decisions in your A/B testing.
コメント