A/B Testing At Scale

I recently gave a mini-lecture in my Knowledge Discovery and Data Mining on this topic and am sharing those information here in the form of a post. Hope you like it!

A/B Testing

Also called Bucket Testing, Split Testing, or Controlled Experiments.
An experiment where multiple variants of a product (mostly webpage) are displayed to users at random and the performance response is statistically analysed to determine which is better suited for a given goal.
Few of the companies that use A/B Testing: Amazon, eBay, Etsy, Meta, Google, Groupon, LinkedIn, Microsoft, Netflix, and Yahoo

A Simple A/B Test Diagram

Consists of two variants: Control (A) and Treatment (B).
Keywords in A/B Testing:
- Overall Evaluation Criterion (OEC)
- Parameter
- Variant
- Randomization unit

Overall Evaluation Criterion (OEC)

This is the criterion we evaluate through our experiment. Often, this is the Response or Dependent variable (Y column).
Could be active days per user, number of registrations to service, etc.
Experiments may have multiple objectives and analysis might use a scorecard approach.

Parameter

Controllable experimental variables that might influence the OEC or other metrics of interest.
Also called Factors or Variables. Synonymous to the attribute columns.
Our simple A/B test contains a single parameter. Such a test is called univariate.
There also exist MultiVariate Tests (MVTs), where multiple parameters (such as, both font color and font size) are evaluated simultaneously to find the OEC.

Variant

The user experience that is being tested upon.
In our simple test, A and B are the two variants, usually called the Control variant and the Treatment variant.

Randomization Unit

A pseudo-randomization process (e.g. hashing) applied to units (eg. users) to match them to variants to ensure statistical unbiasedness.
If user is the randomization unit (as in most cases), a user should consistently see the same experience.
The assignment of a user to a variant should not tell you anything about the assignment of a different user to its variant.

Why use A/B Testing?

Correlation: Relationship between variables.
Causality: Changes in one variable brings about changes in the other.
Sensitivity: Able to detect small changes that are harder to detect with other techniques, such as changes over time.
Detect unexpected changes: Many experiments uncover surprising impacts on other metrics, such as, increased crashes/errors, or cannibalizing clicks from other features.

A/B Testing at Scale

To keep up with innovation, companies want to be able to experiment with as many ideas as possible simultaneously.
- When LinkedIn started, the platform supported about 50 experiments per day. Today, that number has increased to more than 400. The number of metrics supported has also grown from 60 to more than 1000.
- At Microsoft’s Bing, the use of controlled experiments has grown exponentially over time, with over 200 concurrent experiments now running on any given day.
The areas experimented on are extremely diverse, from visual changes on home page, to improvements on job recommendation algorithm (LinkedIn) or search result algorithm (Bing), to personalizing subject line of user emails.

Exponential Growth in Experimentation over Time: Bing

Problems with Scaling A/B Tests

Large scale experiments can have multiple dimensions, including the number of users and the number of experiments.
If we use a multivariate test with N parameters, there would be N simultaneous experiments, where each experiment would modify a different parameter.
However, a multivariate test is not feasible in complex environments, since not all parameters are independent and not all values might work with the values for another parameter (e.g., blue text color on a blue background).
In reality, most of the companies use an overlapping experiment infrastructure.

Overlapping Experiment Infrastructure

The main idea here is to partition parameters into N subsets.
Each subset is associated with a layer of experiments.
Each test request would trigger at most N experiments simultaneously (one experiment per layer).
Each experiment can only modify parameters associated with its layer (i.e., in that subset), and the same parameter cannot be associated with multiple layers.

Challenges of A/B Testing at Scale

Increasing cache fragmentation, lowering cache hit rates and increasing latency.
Potential to degrade user experience, thus causing user abandonment.
False positives due to experimental design issues, data issues, biased analyses, or simply chance.
Risk of interactions between different treatments with increasing number of parallel experiments.

Combatting the Challenges

Latency might be handled by setting benchmarks and feature selection to reduce the overhead (but not always feasible).
A well-structured A/B testing framework can analyse the OEC fast and abandon the experiment before any noticeable degradation.
An Empirical Bayesian False Discovery Rate control algorithm is used to identify cases that are most likely be true positives.
To prevent interactions between treatments, the experiment system uses a set of constraints to ensure that conflicting experiment do not run together.

Conclusion

MORE… BETTER… FASTER…

A/B testing at scale helps in running more experiments, running them better, and getting results faster.

More: The platform needs to scale to handle not only today’s data volume but also tomorrow’s.
Better: Fewer misconfigured (logging issues, weird error cases), forgotten (start experiments and then forget to analyze them), or unclear experiments (what exactly are you measuring here/what filters are you using).
Faster: Pushing out new experiments, implementing new features, running experiments, analysing the results – all concurrently.

References

Alex Deng, Pavel Dmitriev, Somit Gupta, Ron Kohavi, Paul Raff, and Lukas Vermeer. 2017. A/B Testing at Scale: Accelerating Software Innovation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17). Association for Computing Machinery, New York, NY, USA, 1395– 1397. DOI:https://doi.org/10.1145/3077136.3082060
Fabijan, A., Dmitriev, P., Olsson, H. H., & Bosch, J. (2018, August). Online controlled experimentation at scale: an empirical survey on the current state of A/B testing. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 68-72). IEEE.
Xu, Y., Chen, N., Fernandez, A., Sinno, O., & Bhasin, A. (2015, August). From infrastructure to culture: A/B testing challenges in large scale social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2227-2236).
Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., & Pohlmann, N. (2013, August). Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1168-1176).
Tang, D., Agarwal, A., O’Brien, D., & Meyer, M. (2010, July). Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 17-26).
A / B Testing: The Most Powerful Way to Turn Clicks Into Customers. (2013). Wiley.
Ron Kohavi. 2015. Online Controlled Experiments: Lessons from Running A/B/n Tests for 12 Years. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’15). Association for Computing Machinery, New York, NY, USA, 1. DOI:https://doi.org/10.1145/2783258.2785464
Ivaniuk, A. (2020). A/B testing at LinkedIn: Assigning variants at scale. LinkedIn. https:// engineering.linkedin.com/blog/2020/a-b-testing-variant-assignment
Blog, N. T. (2022, January 10). What is an A/B Test? – Netflix TechBlog. Medium. https:// netflixtechblog.com/what-is-an-a-b-test-b08cc1b57962

Be First to Comment

Leave a Reply Cancel reply