Delphina

‍

There's a wave of anti-A/B testing sentiment in the air. More leaders are talking about the need for taste-based decision-making; many are actively deriding testing as a crutch that stifles creativity and sidesteps leadership.

‍

They point to tweetable statements like:

‍

Linear’s Karri Saarinen’s: “We don’t do A/B tests. We validate ideas and assumptions that are driven by taste and opinions, rather than the other way around.”
Intercom’s Des Traynor: “Every product company that focusses on A/B tests … to boost conversion are duping themselves”
Tinder's Brian Norgard: “A/B testing culture guiding product is a clear signal the company is out of new ideas.”
Shopify’s Tobi Lütke: “There needs to be more acceptance in business of unquantifiable things.”
Airbnb’s Brian Chesky: “A/B testing is abdicating your responsibility to the users.” (This one was taken pretty wildly out of context — but I’ve seen it quoted as such, so needs to be acknowledged.)

‍

As a data leader that’s led large-scale testing initiatives, here’s my (Duncan’s) hot take: of course taste is important for decision making; you were kidding yourself if you ever thought otherwise.

‍

But that doesn’t mean you should throw the baby out with the bath water. A/B testing is an invaluable tool that enables your company to really learn and scale, by systematically separating the wheat from the chaff. You’re also kidding yourself if you think otherwise.

‍

Let’s start by unpacking what’s gone wrong with experimentation.

‍

Where A/B testing goes wrong

‍

Testing gets a bad rap for two reasons:

tests are being used as a substitute for strategy,
and data teams aren’t doing testing right.

‍

How testing gets misappropriated for strategy

‍

In the early days at Uber, the company was laser-focused on driving trip volume. This was beautiful in how easy it was to measure and to communicate. But here’s the problem: "more trips" isn't a strategy; it's a metric.

‍

Focusing on trips led us to double down on creating trips at any cost, for example with short and cheap trips in products like Uber Pool. That’s different from building repeated, high-quality experiences, and led to a tradeoff that nobody intended to make: short-term growth vs. long-term customer value.

‍

The problem wasn’t our testing — it was the lack of coherent strategy. When this happens in your organization, data leaders need to have a hard conversation. If you’re being asked to ‘test what features users want’, that’s a sign to push back. What kind of experience are teams trying to create? What’s the long-term product vision?

‍

It may feel uncomfortably close to saying, “You need to do your job first, so I can do mine.” But the strategy needs to come first, and then A/B testing can help you test your path toward it.

‍

There’s a helpful analogy here: A/B testing isn’t going to tell you what hill to climb. Rather, once you pick a hill, A/B testing will help you find your way to the top.

How teams screw up A/B testing

‍

I’ve also seen the call coming from inside the house — tests are developed, run, or reported on poorly. The problem is that if testing isn’t executed well, then business leaders will justifiably ask, what’s the point?

‍

There are a few common ways teams fall down on testing:

‍

Not accounting for seasonality

Teams sometimes run tests during atypical time periods and fail to contextualize the results accordingly. For example, a promotion test that happens to fall over a holiday might show spectacular conversion rates that would be impossible to maintain year-round. If you then extrapolate those results as if they're representative of normal conditions, business stakeholders immediately start questioning your judgement.

‍

Missing long-term effects

Short-term metrics frequently hide longer-term consequences that aren't captured in the initial testing window. A feature might drive an immediate conversion uplift of 15% — while simultaneously increasing negative reviews or return rates that only become apparent weeks later. Without implementing proper longer term measurement, these tests can lead to features that optimize immediate results while quietly eroding brand equity and customer lifetime value.

‍

The sum of the parts is too big

When multiple teams run parallel experiments, each claiming significant improvements, the reported combined impact often exceeds what seems possible. I’ve seen it firsthand: six teams each claim 5% improvements, while the entire business only grew 20% during that period. There are lots of reasons this can happen (we won’t unpack those here) — and it’s critical that data leaders be careful about rolling these effects all together. Because again, claiming nonsensical victories creates serious credibility issues.

‍

Six common-sense practices for strong A/B testing

‍

After seeing both the successes and failures of testing programs firsthand, I’ve found these six best practices make the difference between testing that drives good decisions and testing that drives skepticism.

‍

Define strong success metrics

When your experiments don’t have predefined success criteria, you’ll likely end up cherry-picking whatever looks good in the data. Establish 1-2 primary metrics that directly tie to your hypothesis, along with several secondary metrics to catch potential negative impacts in other areas.

‍

So if your primary metric is conversion rate, track secondary metrics like time-on-page, user satisfaction scores, and 30-day retention to ensure you're not creating downstream problems.

‍

Write down a clear hypothesis

Vague experimentation without clear direction wastes resources and creates confusion about what insights to extract. Instead of approaching tests with a generic "Let's see what happens if we change X" mindset, frame each experiment with a specific hypothesis: "We believe changing X will improve Y because Z."

‍

For example, "We believe showing fewer search results per page will increase conversion because it reduces cognitive load for customers." This structure forces teams to articulate their reasoning and creates natural guardrails for interpretation when results come in.

‍

Run tests long enough

The pressure to move quickly often leads teams to cut testing windows short, missing critical medium and long-term effects. Make sure your tests have sufficient time to capture the full impact of your changes, particularly for features that might influence customer behavior patterns over time.

‍

Consider that a pricing change might show an immediate uptick in conversions — but lead to decreased customer lifetime value that only becomes apparent after a few months.

‍

Share both positive and negative results

The nature of experimentation means some hypotheses won’t pan out. But don’t sweep failed tests under the rug — create a culture where every test is valued as a learning opportunity.

‍

Such as: "Our test to simplify the checkout flow actually decreased conversion by 2%, teaching us that users value security indicators more than we expected." This transparency builds credibility with stakeholders and creates institutional knowledge that prevents teams from repeatedly testing bad ideas.

‍

As Ramesh Johari explained on our podcast, High Signal, this is critical to becoming what he calls “a self learning organization”.

‍

Haircut appropriately

No test exists in a perfect vacuum, and pretending otherwise undermines trust. Acknowledge when a test might be impacted by external factors, and apply appropriate "haircuts" to results when reporting up the chain.

‍

For instance, "We saw a 10% improvement, but since it was during our peak season, we're conservatively estimating a 5% annual impact." This honest approach builds confidence in your reporting and establishes your team as trustworthy partners rather than metric chasers who don’t understand business context.

‍

Run your program tightly

When different teams use different methodologies and reporting approaches, it’s impossible to compare results across experiments — and easy for the entire testing program to lose credibility. Instead, ensure all teams use the same standardized approach to measuring and reporting impact.

‍

For example, require every test report to include both the relative improvement ("conversion improved by 12%") and the absolute change ("from 5.2% to 5.8%") to prevent discrepancies between how results are interpreted by business stakeholders.

‍

Running a tight, high-quality testing program is such an easy way to drive significant impact and credibility in your organization.

‍

Striking the balance: taste for strategy, testing for tactics

‍

Back at Uber, when the company finally figured out that we needed to be more deliberate about our strategy, we had a realization: a ‘trip isn’t just a trip’. Not all trips are created equal, and we shouldn’t indiscriminately optimize for trips alone. We needed to define a comprehensive strategy, and then a basket of metrics that reflected those goals.

‍

Despite leaders sometimes complaining that A/B testing encourages focusing on the numbers rather than the big picture, that’s actually the whole point. It just needs to happen within the right context:

‍

Strategy and big directional decisions based on taste, vision, and leadership judgment
Execution and optimization based on rigorous testing and data

‍

Once you’re using the right metrics, A/B testing can do what it’s supposed to: enable more reliable (and much easier) tactical decision-making. It allows teams to quickly, independently, and consistently learn what works and make decisions accordingly — allowing a large organization to get far more done.

‍

So the next time you hear leaders criticizing A/B testing, listen carefully to what they’re actually saying. It doesn’t need to be an OR here, it should be an AND. Testing is a compass, not a roadmap. Pick the right hill, and use testing to chart the fastest course.

The vibes about A/B testing are wrong

Why the backlash is a big mistake

Where A/B testing goes wrong

How testing gets misappropriated for strategy

How teams screw up A/B testing

Six common-sense practices for strong A/B testing

Striking the balance: taste for strategy, testing for tactics

Related articles

The must-listen perspectives on data and AI

The must-listen perspectives on data and AI

The rise of data slop

The rise of data slop

The paradox of optimism in data science

The paradox of optimism in data science