A MIXPANEL GUIDE
The New Testing Paradigm
Prefer a PDF? Enter your email and we’ll send it over.
SECTION 01
Why experimentation matters more than ever
Product teams can build faster than ever before. AI has lowered the cost of generating code, compressing the time from idea to first implementation. But the full development cycle, review, QA, debugging, and confident deployment, is a more complicated story. While building is easier now, validating whether you’re making the right decisions hasn’t caught up.
The teams that thrive today aren’t the ones shipping the most. They’re the ones who are shipping the right things at a faster pace. User attention is limited: ship a new onboarding flow, a new UI, and a few new features in a couple weeks, and see if the people who love your app are willing to keep up. And more importantly, changing too much too fast, without validation, risks wasted time, bad customer experiences, and general chaos.
The solution isn’t to slow down: Rather, it’s to change how you ship.
At Mixpanel, we believe testing is how you bridge the gap between accelerated development cycles and staying grounded in customer needs. But to get there, testing itself needs to change, too.
The velocity gap
There are two kinds of velocity that product teams need to track, but most only focus on one. Deployment velocity is how quickly you can ship changes. AI has accelerated the code generation part of this significantly: Engineers produce first drafts faster, and prototypes get built in hours instead of days.
But recent research suggests the picture gets murkier downstream. AI-generated code tends to introduce more defects, require more review cycles, and accumulate technical debt faster than code written with more deliberate oversight. The result is that teams can get from idea to a working MVP faster, but they often have less confidence in what they’re shipping.
Intelligence velocity is how quickly you can validate whether any new code has the expected effect, and this part of the equation is lagging behind. Most teams still rely on the same instrumentation, the same analytics review process, and the same post-launch retrospective they’ve always used.
When deployment velocity outpaces intelligence velocity, unvalidated debt accumulates. Features ship without clear success criteria. Metrics move, but no one is sure why. Teams optimize for output over outcome.
In other words, AI speeds up output while validation struggles to keep pace. Experimentation, reimagined for AI-assisted development, is how you can start to bring output and validation back into alignment.
[Diagram: The Old Model vs. The New Model]
Design team will replace with exported image
The shift: From testing to intelligence
For a long time, running an experiment meant something linear: have an idea, set up a test, wait for results, ship the winner. The process was largely manual and the tools required were often separate from each other. The cycle was episodic: after each experiment, you’d reset and start over.
That model made sense when the pace of development was slower. It doesn’t anymore.
Experimentation in 2026 and in the future looks different. It isn’t a siloed, sometimes step. It’s an automated part of the process. It starts from real user behavior, not assumptions. It relies on shared metrics that persist across tests, not definitions that get recreated every time. And it produces intelligence that feeds forward into the next decision, rather than sitting in a retrospective doc no one reads.
This is the shift: from isolated tests to a connected intelligence system. The rest of this guide is about what that shift looks like in practice, and how to build it on your team.
The humbling part is that most ideas don’t survive contact with real users.
When Microsoft’s experimentation team analyzed years of well-designed, well-executed A/B tests, they found that only about one-third actually improved the metrics they were built to move. The rest were flat or negative. Similar patterns show up across major experimentation programs. Even experienced teams are wrong about what will work most of the time, which is exactly why the connected intelligence loop matters. You can’t reason your way to the winners. You have to test.
What’s at stake
The risk of skipping this isn’t a few wasted experiments. It’s compounding uncertainty at scale. Every feature you ship without a hypothesis is a bet you’re making blind. In an AI-assisted workflow, where the cost of developing a new feature has dropped dramatically, the number of those bets compounds as more ideas become code. Without a systematic way to validate which changes are working, the signal-to-noise ratio deteriorates.
The teams building an intelligence advantage now, teams that systematically test, analyze, and feed that knowledge forward, are building something competitors can’t easily replicate. Everyone has access to the same AI tools. What you can’t copy is someone else’s validated understanding of their users, accumulated with each experiment.
AI can help you generate code faster. It can’t validate that what you’re building is actually what customers want.
SECTION 02
Why experimentation must be paired with analytics
Experiments answer a specific, powerful question: What happens when you intervene? Run the test, observe the outcome, and you have evidence that a specific change caused a specific result.
But experiments can’t tell you everything. They can confirm a hypothesis, but they can’t generate one. They can tell you that variant B won, but they can’t tell you why.
That’s where analytics shine. The teams that run these with separate tools, one for experiments, one for data, are missing the compounding benefit of running them together.
From observation to hypothesis
Before you can design a meaningful test, you need to understand what users are actually doing: where they drop off, what sequences of actions precede conversion, which cohorts behave differently from the average.
That picture comes from analytics. It’s the difference between guessing what to test and knowing what to test.
A team that looks at behavioral data before forming a hypothesis will write a sharper, more precisely-scoped experiment. They’ll target the users most likely to be affected. They’ll define success in terms of behaviors that matter, not proxy metrics that are easy to measure.
Behavioral precision, knowing which users to include, based on what they’ve actually done, is one of the highest-leverage improvements most teams can make to their experimentation programs. It improves signal quality, reduces noise, and forces the team to articulate a clear, user-grounded reason for running the test in the first place.
Measurement is often overlooked and is often the most impactful. In our product teams at Airbnb, there’s usually half a dozen engineers and a data scientist, and maybe a designer. The role of the data scientist is to help the product manager value and prioritize opportunity sizes and hypotheses right at the start of the product lifecycle.
Alok Gupta, Director of Data Science at Airbnb
Metrics as shared infrastructure
One of the most common sources of friction in experimentation is metrics that don’t travel. A team runs a test, defines what success looks like for that particular experiment, and then two months later another team runs a related test with a slightly different definition. The results can’t be compared. Intelligence doesn’t accumulate.
Strong experimentation programs treat metrics as durable, reusable infrastructure, not one-off definitions. When your primary metrics are standardized across experiments, results become comparable. Teams can build on each other’s findings instead of starting from scratch every time.
This is about trust, in addition to tidiness. Experiments are only as credible as the metrics they’re measured against. If two teams define the same metric differently, neither team can be confident in their results. Over time, that inconsistency erodes the entire program’s value.
Behavioral targeting: Running more precise experiments
Most standalone experimentation platforms limit targeting to runtime attributes: things you know about a user at the moment they hit your product. Attributes like their OS, country, and whether they’re on mobile or desktop. That’s a shallow slice of what actually matters.
When experimentation is connected to your analytics, the targeting model changes entirely. Instead of asking, Who is this user? you can ask, What has this user done? You can define experiment audiences based on behavioral sequences, for example, people who completed onboarding but never invited a teammate, accounts that engaged with a feature three times in their first week and then stopped, or users who have seen a particular screen but never converted.
This matters for two reasons. First, it makes experiments more precise. When you run a test on a broad population that includes users for whom the change is irrelevant, you’re diluting the signal. Behavioral targeting concentrates the experiment on the users where the hypothesis actually applies.
Second, it closes the loop between opportunity identification and experiment design. In a connected system, the same behavioral data you used to spot the opportunity becomes the definition of the experiment audience. You’re targeting the exact users your hypothesis was built around.
The connected loop
The clearest mental model for experimentation in the AI era is a loop, not linear.
It starts with observation: What are users doing, and where are the opportunities?
It moves to hypothesis: Based on what we observed, what change might improve things, and for whom?
Then to experiment design and execution: How can we design a statistically significant test to prove or disprove our hypothesis?
Then to analysis: Not just, “Did it work?” but “Why, and what does that tell us?”
And finally, back to observation, informed by everything you just learned.
[Diagram: The Connected Loop for Experimentation]
Design team will replace with exported image
In the fragmented model, this loop is broken at every step. Data lives in separate tools. Metrics get redefined. Results don’t carry forward. Each experiment starts from scratch.
In the connected model, the loop closes. Behavioral data informs hypotheses. Experiments run using shared metrics. Analysis happens in context. And intelligence accumulates, experiment after experiment, into something that compounds.
That compounding is the advantage, and it only happens when analytics and experimentation are built to work together.
Experiments tell you what happened. Analytics tell you why. You need both.
SECTION 03
What an experimentation workflow looks like in the AI era
Effective experimentation isn’t about running more tests; it’s about running better ones with a repeatable process that produces reliable outcomes at every stage.
The six stages below describe a modern experimentation workflow. They’re tool-agnostic by design. What you’ll notice is that each stage feeds naturally into the next. This is a system, not a checklist.
[Diagram: Modern Experimentation Workflow, 6 Stages]
Design team will replace with exported image
Stage 01: Identify opportunities from user behavior
Good experiments start from evidence, not intuition. Before writing a hypothesis, spend time looking at your behavioral data.
Where are users dropping off?
Which paths precede conversion?
Are there segments behaving differently from your average user?
THE QUESTION YOU’RE TRYING TO ANSWER
Where is the gap between what users are doing and what you want them to do?
This stage is often skipped. Teams jump straight to ideas, but ideas without behavioral grounding tend to produce broad, noisy experiments, tested on the wrong users, measuring the wrong things. Grounding your opportunity in real behavior gives you a sharper target from the start.
AI AT THIS STAGE
Today, surfacing a testable opportunity requires manual investigation: pulling funnel reports, reviewing session replays, hunting for the anomaly that signals something worth testing. AI changes this by making the process proactive rather than reactive.
Experimentation platforms are increasingly building always-on monitoring that flags behavioral anomalies and surfaces experiment candidates automatically, drawing on signals across analytics, session replay, and product usage data. The next frontier is a continuously updated queue of evidence-backed opportunities, ranked by potential impact—so teams spend less time looking for what to test and more time designing good tests.
Stage 02: Form a clear, testable hypothesis
A hypothesis isn’t an idea. It’s a structured prediction. If we make this specific change for these specific users, we expect to see this specific outcome as measured by this metric.
Every part of that structure matters. Specific change keeps the experiment focused. Specific users ensures you’re testing on the cohort most likely to be affected. Specific outcome, measured by a specific metric makes success objectively evaluable.
WEAK VS. STRONG HYPOTHESIS
Weak: We think the new onboarding will perform better.
Strong: Adding a direct deposit prompt to the second onboarding step will increase the rate of users who activate direct deposit within 30 days, without decreasing 7-day retention.
The difference is testability. If you can’t fail the hypothesis, it’s not a hypothesis.
AI AT THIS STAGE
Most hypothesis errors are structural: Teams omit a required element, choose a metric that doesn’t map cleanly to the behavior they’re changing, or frame the intervention too broadly to yield a clean result. AI can catch these issues before they cost you a run cycle.
Emerging tooling is embedding setup guidance directly into the experiment workflow—flagging weak hypothesis structure, suggesting better-matched success metrics, and searching prior experiment history so teams aren’t unknowingly re-asking questions that have already been answered.
The goal is making good hypothesis hygiene the default, not something that requires a statistician to audit.
Stage 03: Design experiments with meaningful success metrics
Experiment design is where most teams leave value on the table. Before launching, confirm that:
The primary metric is well-defined and tracks directly to the hypothesis. One primary metric per experiment, nothing more.
Guardrail metrics are identified. These are the things you can’t afford to break. If your primary metric improves but a key downstream behavior drops sharply, that’s not a win.
Sample size and run duration are calculated upfront based on your baseline, MDE, and desired power level. Running experiments “until they look good” is how false positives happen.
Audience is scoped appropriately. The experiment should target the users most likely to be affected. Broad rollouts dilute signal. Targeted experiments produce cleaner results.
AI AT THIS STAGE
The hardest design decisions (how long to run, how much traffic to allocate, whether the potential lift justifies the opportunity cost) are currently left to manual calculation or intuition. Underpowered experiments, wasted traffic, and tests that can’t detect real effects are usually the result.
AI is beginning to automate these decisions: recommending minimum detectable effects based on historical baselines, estimating the traffic cost of running an experiment before it launches, and enabling roadmap-level cost-benefit analysis across competing test ideas. Good experiment design shouldn’t require a statistics degree to get right.
Stage 04: Run experiments responsibly
Responsible execution comes down to three things.
Validate your tracking before launch. Confirm events are firing correctly, variants are assigned as expected, and your metrics are capturing what you intend. A tracking error caught before launch is a five-minute fix. The same error caught a week into the experiment requires starting over.
Don’t peek, unless your platform is built for it. If you’re using a frequentist framework, wait until you’ve hit your pre-planned sample size before evaluating results. If you’re using Bayesian or sequential methods, you can monitor continuously, but only if your platform is designed for that.
Let experiments run to completion. Early results are often misleading, especially with day-of-week effects. Let the data accumulate before you draw conclusions.
AI AT THIS STAGE
The operational overhead of running experiments (provisioning flags, monitoring for tracking errors, routing traffic, managing experiment lifecycle) is friction that has nothing to do with the quality of the underlying thinking. In high-velocity development environments, that friction is increasingly the bottleneck.
The next generation of experimentation tooling is designed to eliminate it: no-code interfaces for non-technical teams, dynamic traffic allocation that adjusts as data accumulates, automated flag lifecycle management that retires stale experiments without manual cleanup, and natural language interfaces so experiment management can happen from anywhere.
The goal is making fast, rigorous experimentation accessible to every team, not just the ones with dedicated engineering support.
Stage 05: Analyze results in context
When the experiment concludes, the first question is: Did variant B win? That’s the right starting point. But stopping there leaves most of the learning on the table. The more useful questions:
Did it win or lose, and why? Can you identify which behaviors or segments drove the result?
Was the effect consistent across groups, or did it help some users while hurting others?
What happened downstream? Experiments often have effects beyond the thing you’re directly testing.
What does this tell you about your users: what they value, where they struggle, what they need?
This kind of investigation requires the ability to move fluidly from experiment results to behavioral exploration: segmenting by cohort, comparing downstream events, and understanding the sequence of actions that preceded outcomes. Teams that can do this fast learn faster than teams that can’t.
AI AT THIS STAGE
Statistical significance tells you an effect is real. It doesn’t tell you what drove it, who it affected most, or what to test next. Most post-experiment analysis stops at the headline result and leaves the deeper signal undiscovered.
AI is beginning to surface that signal automatically: identifying which user cohorts drove a result and hypothesizing why, recommending follow-up experiments based on what the data implies, and forecasting downstream business impact beyond the primary metric.
The shift is from analysis as a retrospective exercise to analysis as the starting point for the next decision.
Stage 06: Turn results into the next decision
This final stage is the one most teams skip, and it’s the one that compounds the most value over time. Every experiment result should be documented with its hypothesis, methodology, outcome, and what it implies for future decisions. That documentation is how you prevent the same question from being asked twice, build on prior intelligence instead of starting from scratch, and gradually develop a richer understanding of your users.
Results should feed directly into roadmap prioritization. A confirmed lift is evidence to invest further. A null result is evidence to shift direction. A failure that reveals a user insight is often more valuable than a win that confirms what you already knew.
AI AT THIS STAGE
The long-term value of experimentation is compound. Each test should make the next one smarter: surfacing relevant prior experiments before a new hypothesis is formed, flagging related questions that remain unanswered, and connecting current results back to the opportunities identified at stage one.
AI is making this kind of institutional memory tractable. Natural language search across experiment history, AI-generated ideation grounded in prior results, and intelligent prioritization that deprioritizes already-answered questions all point toward the same goal: a system that gets smarter with every test, turning experimentation into a durable organizational capability rather than a series of isolated events.
Treating each experiment as a standalone event breaks the loop. Treating each one as a data point in an ongoing investigation is what builds the intelligence advantage.
SECTION 04
Creating a culture of experimentation: why tools alone aren’t enough
The hardest part of building an experimentation program isn’t the tooling. It’s the people. Getting your stack connected is a few weeks of work. Changing how a team thinks about evidence, risk, and the meaning of failure takes longer. Without that shift, even the best infrastructure produces underwhelming results, because the culture around it hasn’t changed.
Organizational culture: Intelligence has to be the goal
In most organizations, experiments are run to confirm ideas. That’s a problem. If your team treats experiments as a performance review, run to prove that the work they did was valuable, you’ve already forfeited the value. Teams that need their experiments to succeed will interpret results charitably, stop tests early when things look promising, and avoid testing ideas that feel uncertain. Over time, this produces a portfolio of experiments that mostly confirm what teams already believed and offer little genuine discovery.
The secret sauce that turns the raw ingredients of experimentation into supercharged product innovation is culture.
Netflix
How high-performing cultures approach testing differently
The most successful and effective organizations share a few common behaviors in their experimentation analysis, mainly:
A negative result is useful.
An inconclusive result often means the hypothesis was too broad.
Killing a feature early because the data supports that decision is celebrated, not punished.
The accessibility problem
There’s a version of experimentation culture that sounds sophisticated but actually centralizes control: a small team of analysts who run all the experiments while product managers and engineers wait in queue. This model doesn’t scale and creates the wrong incentives. The teams that build durable experimentation cultures make it accessible. That means:
Product managers can form and launch experiments without filing a ticket.
Engineers can review results without writing SQL.
Designers can explore segment-level outcomes without needing an analyst to pull the data.
This isn’t about cutting corners on rigor. It’s about removing friction from the places where friction prohibits momentum. If running an experiment takes two weeks of coordination, most ideas will never get tested. If it takes an afternoon, you’ll run more experiments because more teams will be comfortable doing so, you’ll learn more per sprint, and make better decisions as a result.
By running 25,000 tests a year, Booking.com transformed itself from a small startup into the world’s largest accommodation platform. Today scaling up an organization’s experimentation capabilities is critical, but many firms struggle to do it, not because of technology but because of culture.
Stefan Thomke, Harvard Business Review
Operational culture: Experimentation as infrastructure
Beyond mindset, there’s the question of process. Where does experimentation live in your product development cycle? If the answer is we run experiments after we ship, you’re starting too late. The most effective teams embed experimentation directly into their delivery pipeline. Changes are deployed to a small percentage of users first. Data validates the hypothesis. Only then does the code get promoted to a full rollout. Experimentation becomes the mechanism that gates forward progress, not an optional step that happens if there’s time.
This approach changes the risk profile of shipping entirely. Instead of committing a full rollout and discovering problems at 100%, you’re discovering problems at 5% and deciding whether to proceed. At scale, this saves enormous amounts of time and user trust.
Your intelligence loop is your competitive advantage
Since every team has access to the same AI tools, the differentiator isn’t speed. Speed is table stakes. What you can’t copy from a competitor is their validated understanding of their users, accumulated through hundreds of experiments, each one sharpening the team’s mental model of what works, what doesn’t, and why.
[Diagram: Experimentation Maturity Spectrum]
Design team will replace with exported image
The question to ask isn’t Are we doing experiments? It’s Are we building the kind of program that compounds? That means each experiment feeds the next, intelligence is documented and shared, hypotheses get sharper over time, and the loop closes faster with every iteration.
AI is the engine. Experimentation is the steering wheel. An engine without a steering wheel is a faster way to crash.
SECTION 05
Experimentation in practice
The concepts in the previous sections aren’t theoretical. Teams across industries are putting them to work, with measurable results.
Step: Tightened its testing loop. Primary accounts grew 14%.
Step is a fintech platform designed to help teens and young adults manage money and build financial independence. The team wanted to strengthen their experimentation process to ensure product changes were actually driving growth, and they weren’t shipping features and hoping.
They used Mixpanel to centralize testing, speed up analysis, and tighten decision-making. A key experiment redesigned the onboarding experience to guide more users toward setting up direct deposit. The hypothesis was clear, the metric was specific, and the result was unambiguous: a 14% increase in customers who made Step their primary bank account.
RESULT
14% increase in customers who made Step their primary bank account
Beyond the lift, Step built a more disciplined approach to experimentation, where every change had a hypothesis, every hypothesis had a metric, and every metric was shared across the team. Each experiment fed the next. The intelligence loop closed.
Read case study
Kolon Mall: One platform, one playbook.
Kolon Mall, a South Korean ecommerce platform, faced a challenge familiar to many product teams: They were running experiments, but the process was fragmented. Different teams used different tools. Metrics weren’t standardized. Results were hard to compare and harder to act on.
The shift they made was about establishing experimentation as a shared organizational capability. They standardized on a single platform, defined reusable metrics, and created a process where learnings from each experiment fed directly into the next planning cycle. The new culture resulted in more teams participating and more hypotheses being tested. And because the infrastructure was consistent, intelligence accumulated in a way it couldn’t before.
RESULT
A scalable culture of experimentation
The new experimentation culture resulted in more teams participating and more hypotheses being tested. And because the infrastructure was consistent, intelligence accumulated in a way it couldn’t before.
Read case study
How Mixpanel approaches experimentation
The ideas in this guide, behavioral precision, shared metrics, fast diagnostics, connected analytics, describe how Mixpanel’s Experimentation product is built. Experiment audiences are defined based on real user behavior. Success metrics are drawn from Mixpanel’s shared metric system. And when an experiment concludes, analysis happens in context: you can move from did it work? to why? without leaving the platform.
Mixpanel is also making this all AI-native. Experiments can be created, monitored, and managed through natural language, via Mixpanel Agent or through MCP integrations that connect your experimentation program directly to the tools your team already works in. The goal: make the new testing paradigm described in this guide as fast and frictionless as possible for product managers, analysts, and engineers alike.
At the analysis layer, AI surfaces what drove an outcome automatically: which cohorts responded, what the data implies for the next test, and how results project downstream to retention and revenue. As your program grows, the system compounds, learning from your experiment history to generate better-targeted hypotheses and surface opportunities you haven’t thought to look for yet.
The goal: Make the new testing paradigm described in this guide as fast and frictionless as possible for product managers, analysts, and engineers alike.
FEATURED EXAMPLE
How Mixpanel approaches experimentation
The ideas in this guide, behavioral precision, shared metrics, fast diagnostics, connected analytics, describe how Mixpanel’s Experimentation product is built.
Mixpanel Experimentation integrates directly with our behavioral analytics platform. Experiment audiences are defined based on real user behavior, specific actions, sequences, states, using the same cohort definitions you already use in analytics. Success metrics are drawn from Mixpanel’s shared metric system, so they’re consistent and reusable across teams. And when an experiment concludes, analysis happens in context: You can move from “did it work?” to “why?” without leaving the platform or exporting data.
Mixpanel is also making this all AI-native. Experiments can be created, monitored, and managed through natural language, via Mixpanel Agent or through MCP integrations that connect your experimentation program directly to the tools your team already works in.
Learn more about Mixpanel Experimentation
A new opportunity,
and a new challenge
The opportunity is real. Teams that get this right will ship faster and with more confidence, accumulate a genuine understanding of their users that competitors can’t replicate, and make better decisions with every cycle.
But getting there isn’t a tooling decision. It requires changing how your team thinks about evidence: what counts as a reason to ship and what counts as a reason to stop. It requires building experimentation into the delivery pipeline rather than bolting it on afterward. And it requires connecting the data, the metrics, and the analysis into a system that actually compounds, rather than a collection of tools that each solve one piece of the problem in isolation.
AI is making parts of this easier. It isn’t making the thinking easier. The teams that pull ahead will be the ones that do that work deliberately. The ones that don’t will get better at producing the wrong things.
Get the PDF version
Want to save, share, or read offline? Enter your email and we’ll send the complete PDF.