There is a lot of finer detail into setting up, tracking, and interpreting the results of experiments. Understanding how to set up good experiments is a bit more than just setting up an A and a B variant and looking at what can impact these metrics. A little sense of scientific rigor is necessary to come to the right conclusions. Having an understanding of who should be in your experiment, how to measure its success and how to avoid the different pitfalls of experimenting using A/B tests is necessary to set up a good experiment.
A large part in setting up an experiment involves understanding who is going to be the population you are going to treat, how are you going to split the different control/variants groups and how far down the funnel are you going to measure user’s action with respect to the experiment setup.
Understanding who should be exposed to your experiment should be something that you have obtained through prior data, or through the generation of hypothesis for your experiment. You should have from your available data already pre-defined eligibility rules or specific segments that you want to run your experiment against.
In the picture above, users are split into equal group sizes based on some identifier and some simple heuristic, this is the core underlying logic is at the core of setting up the experiment. The identifier is usually generated through random number generation and can depending on the setup be re-generated to match the identifier of previously running experiments.
For a given group in an experiment the population can then be split across. Allocated users would for instance contain all users having been assigned to the group, regardless of wether they would have seen the specific experiment. An exposed user would be one that could have seen the experiment, it is normally being kept track of at the intersection of where the a and b variant diverges. A treated user is one that performed the desired action.
Metrics are at the core of setting up the experiment. It is import to plan which key metrics to track for optimization purposes and make sure there is also logging implemented to track secondary user actions.
Primary metrics should be aligned with the intent of the experiment. The purpose of tracking primary metrics is to have a set of KPIs for which we intend to determine the success of the experiment and track their significance. Their number should be limited in order to avoid both chance of false positive/negative as well as limit the number of trade-off decision that will need to be made.
Example of primary metrics for an e-commerce website: add to cart, quantity added to cart, quantity purchased, average order value
Secondary metrics on the other hand help provide insights related to user behavior that can help shape the next iterations of the experiment or help shape a development roadmap.
Example of secondary metrics for an e-commerce website: Clicks to a specific product list page, cart views
Significance helps you understand how likely you are to make a wrong judgment call
Significance is established from
The higher the sample size and (relative) size of the effect the more easily it is to establish statistical significance.
It is important to plan prior to launching your experiment what sample size is required in order to avoid confirmation bias. Calculators exist online to calculate both the power and the necessary sample size for experiments.
Obtaining significance over an allocated population rather than an exposed population tends to be more difficult to establish due to a lower relative effect. This is due to the dilution of the impact of the effect across a larger population.
Some effects can only be analyzed over the long run, for example, uplift in quantity of an item sold needs to be balanced between ordering frequency and quantity/order
Other effects change over time due to the Novelty effect or Learning curves. In these cases, it is important to define a holdout, ie: a sample of the population that will not be exposed to your treatment over an extended period of time.
Tracking the performance of metrics at a certain degree of significance allows to make mostly right judgment calls, yet sometimes it is not really feasible to obtain significant metrics uplift. Tracking these metrics uplift with confidence intervals allows to make better-informed decisions.
Some specific effects and scenarios are not well suited to be tracked through a normal A/B test and need special attention as to how to cater for them, these include:
Network effects: For instance in the case of a referral program, The referrer and Referee could be split across test and control groups causing spillover on the control or variant group
Novelty effects: Prompts and CTA tend to exhibit novelty effects, if not measuring their performance over the long term using a holdout a wrong attribution and/or customer fatigue can happen.
What-if scenarios: If you are looking to understand the impact of not having launched a product (i.e. what if we hadn’t launched this product), for instance, a subscription offering on a website. A/B test wouldn’t be the right fit.
When running A/B test experiments a number of challenges can arise and need proper thoughts on how to deal with it some example of that: