Ab Test Analysis

Analyze A/B test results with statistical significance, sample size validation, confidence intervals, and ship/extend/stop recommendations. Use

Source: phuryn/pm-skills

What Is This?

Overview

A/B test analysis is the process of evaluating experiment results using statistical methods to determine whether observed differences between variants are meaningful or simply due to random chance. This skill applies statistical significance testing, confidence interval calculation, and sample size validation to raw experiment data, then translates those findings into actionable product decisions.

The analysis workflow covers the full lifecycle of experiment evaluation. It begins with validating that the test collected sufficient data to detect a meaningful effect, then calculates whether the observed difference crosses the threshold for statistical significance. From there, it produces confidence intervals that describe the plausible range of the true effect size, and finally delivers a clear recommendation to ship the winning variant, extend the test to gather more data, or stop the experiment entirely.

This skill is particularly valuable because it removes guesswork from product decisions. Without rigorous statistical analysis, teams often ship variants based on noisy data, misread small fluctuations as meaningful improvements, or prematurely end tests before reaching reliable conclusions.

Who Should Use This

Product managers who need to interpret experiment results and make shipping decisions without relying solely on data science teams
Growth engineers running split tests on conversion funnels, onboarding flows, or feature rollouts
Data analysts who want a structured framework for communicating test results to non-technical stakeholders
UX researchers evaluating behavioral differences between design variants

Why Use It?

Problems It Solves

Prevents shipping variants based on statistically insignificant results, which wastes engineering resources and can harm key metrics
Catches underpowered tests before conclusions are drawn, avoiding false negatives where real improvements go undetected
Standardizes how teams interpret p-values, confidence intervals, and effect sizes across different experiments
Reduces the time analysts spend writing custom statistical code for each new experiment
Provides non-technical stakeholders with plain-language recommendations tied to quantitative evidence

Core Highlights

Calculates statistical significance using z-tests or t-tests depending on sample characteristics
Validates sample size against minimum detectable effect thresholds before drawing conclusions
Generates confidence intervals for conversion rate differences and relative lift
Produces explicit ship, extend, or stop recommendations with supporting rationale
Handles both one-tailed and two-tailed test configurations
Supports common metrics including conversion rates, click-through rates, and revenue per user
Flags potential issues such as novelty effects, sample ratio mismatch, and insufficient runtime

How to Use It?

Basic Usage

Provide the skill with your experiment data in a structured format. A minimal input includes the number of visitors and conversions for each variant.

Control:   visitors=10000, conversions=320
Variant A: visitors=9980,  conversions=374

Significance threshold: 95%
Minimum detectable effect: 10%

The skill will calculate the conversion rates, the absolute and relative difference, the p-value, and the confidence interval, then return a recommendation.

Specific Scenarios

Scenario 1: Underpowered test with early results A team ran a test for three days and observed a 15% lift. The skill detects that only 1,200 users were exposed per variant, which is below the required sample size for the stated MDE. The recommendation is to extend the test rather than ship based on preliminary data.

Scenario 2: Statistically significant negative result The variant shows a statistically significant drop in conversion rate at 97% confidence. The skill recommends stopping the experiment and reverting to the control, with a confidence interval showing the true effect is between negative 8% and negative 3%.

Real-World Examples

A checkout flow redesign test ran for two weeks across 50,000 users per variant. The analysis confirmed significance at 99% confidence with a 6.2% lift in completed purchases, producing a clear ship recommendation with an estimated annual revenue impact.

A notification timing experiment showed a 4% improvement that did not reach the 95% significance threshold after four weeks. The skill recommended stopping the test, noting that the observed effect was likely noise given the confidence interval crossing zero.

When to Use It?

Use Cases

Evaluating feature flag experiments before a full rollout
Analyzing pricing page variant tests for subscription products
Reviewing onboarding flow experiments tied to activation metrics
Interpreting email campaign split tests on open and click rates
Assessing search ranking algorithm changes using interleaving or bucket tests
Validating that a performance optimization produced a measurable improvement in user behavior
Reviewing multivariate test results where multiple elements changed simultaneously

Important Notes

Requirements

Raw experiment data must include visitor counts and conversion counts for each variant
The desired significance threshold and minimum detectable effect must be specified before analysis
Tests should have run long enough to capture at least one full business cycle to avoid day-of-week bias

More Skills You Might Like

Explore similar skills to enhance your workflow