Analysis Bayesian Approach¶

This tutorial shows how to perform post-test analysis of an A/B test experiment with two variants, so called control and treatment groups, using bayesian statistics. It handles both the case of means comparison and conversions comparison.

Let’s import first the tools needed.

[1]:

import numpy as np
import pandas as pd
from abexp.core.analysis_bayesian import BayesianAnalyzer
from abexp.core.analysis_bayesian import BayesianGLMAnalyzer
import warnings
warnings.filterwarnings('ignore')

Compare means¶

Here we want to compare the average revenue per user of the control group versus the treatment group.

[2]:

# Revenue for users
np.random.seed(42)
revenue_contr = np.random.randint(low=400, high=500, size=10000)
revenue_treat = np.random.randint(low=500, high=700, size=10000)

[3]:

# Define the analyzer
analyzer = BayesianAnalyzer()

[4]:

prob, lift, diff_means, ci = analyzer.compare_mean(obs_contr=revenue_contr, obs_treat=revenue_treat)

logp = -1.18e+05, ||grad|| = 3.0081e+10: 100%|██████████| 22/22 [00:00<00:00, 773.97it/s]
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>Metropolis: [nu_minus_one]
>Metropolis: [std_treat]
>Metropolis: [std_contr]
>Metropolis: [mean_treat]
>Metropolis: [mean_contr]
Sampling 4 chains, 0 divergences: 100%|██████████| 202000/202000 [02:51<00:00, 1181.01draws/s]
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

[5]:

print('Probability that mean revenue(treatment) is greater than mean revenue(control) = {:.2%}'.format(prob))

Probability that mean revenue(treatment) is greater than mean revenue(control) = 94.79%

[6]:

print('Lift between treatment and control = {:.2%}'.format(lift))

Lift between treatment and control = 33.20%

The result of bayesian A/B testing is the probability that the treatment group perform better than the control group i.e. highest mean revenue per user value in the current example. This is a very intuitive way of doing A/B testing because it does not introduce any statistical measures (e.g. p-value) which are more difficult to be interpreted by non statisticians.

We can set an arbitrary threshold to define how to consider the outcome of the bayesian test, e.g. if prob \(>\) 90% we can conclude to a significative effect of the treatment on the mean revenue per user when compare to the control group.

Compare proportions¶

[7]:

# Number of users that made a purchase
purchase_contr = 470
purchase_treat = 500

# Total number of users
total_usr_treat = 5000
total_usr_contr = 5000

[8]:

prob, lift = analyzer.compare_conv(conv_contr=purchase_contr,
                                   conv_treat=purchase_treat,
                                   nobs_contr=total_usr_treat,
                                   nobs_treat=total_usr_contr)

[9]:

print('Probability that purchase(treatment) is greater than purchase proportion(control) = {:.2%}'.format(prob))

Probability that purchase(treatment) is greater than purchase proportion(control) = 84.45%

[10]:

print('Lift between treatment and control = {:.2%}'.format(lift))

Lift between treatment and control = 6.37%

Bayesian GLM¶

Here we want to compare the average revenue per user of the control group versus the treatment group. We are also interested to differentiate the results based on some categorical features of the input samples (i.e. seniority_level, country).

[11]:

# Define the analyzer
analyzer = BayesianGLMAnalyzer()

Multivariate Regression

[12]:

df = pd.DataFrame([[1, 4, 35],
                   [0, 4, 5],
                   [1, 3, 28],
                   [0, 1, 5],
                   [0, 2, 1],
                   [1, 0, 1.5]], columns=['group', 'seniority_level', 'revenue'])

[13]:

stats = analyzer.multivariate_regression(df, 'revenue')
stats

Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [lam, seniority_level, group, Intercept]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [00:03<00:00, 2035.12draws/s]
The number of effective samples is smaller than 25% for some parameters.

[13]:

	mean	std	min	25%	50%	75%	max	Prob<0	Prob>0
Intercept	1.048460	2.940644	-13.254892	-0.372376	0.967242	2.372862	26.860366	0.30325	0.69675
group	0.576785	0.551946	-1.425842	0.195678	0.572784	0.957911	2.738990	0.14725	0.85275
seniority_level	1.646575	1.287070	-2.438778	0.817672	1.352801	2.257462	8.219804	0.05050	0.94950
lam	0.774718	1.390844	0.001202	0.101534	0.296813	0.821106	16.358989	0.00000	1.00000

In the last column Prob>0, the table above shows that there is there is 85.27% of probability that revenue of group 1 is greater than group 2. Moreover it also shows that there is94.95% of probability that seniority level is positively associated to revenue.

For the sake of providing a general summary of statistics the table also shows: the intercept and lambda (lam) of the regression model.

Hierarchical regression

If your are not familiar with hierarchical regression have a look at the blog https://twiecki.io/blog/2014/03/17/bayesian-glms-3/.

[14]:

df = pd.DataFrame([[0, 5,   'USA'],
                   [0, 5,   'USA'],
                   [0, 100, 'Italy'],
                   [1, 100, 'USA'],
                   [1, 100, 'USA'],
                   [1, 100, 'France']], columns=['group', 'revenue', 'country'])

[15]:

stats = analyzer.hierarchical_regression(df, group_col='group', cat_col='country', kpi_col='revenue')
stats

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [eps, beta, alpha, sigma_beta, sigma_alpha, mu_beta, mu_alpha]
Sampling 4 chains, 816 divergences: 100%|██████████| 6000/6000 [02:10<00:00, 45.87draws/s]
There were 52 divergences after tuning. Increase `target_accept` or reparameterize.
There were 364 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.2979906043312202, but should be close to 0.8. Try to increase the number of tuning steps.
There were 75 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6628490775514363, but should be close to 0.8. Try to increase the number of tuning steps.
There were 325 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7113696800957767, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

[15]:

	mean	std	min	25%	50%	75%	max	Prob<0	Prob>0
mu_alpha	-0.028085	0.989639	-3.581447	-0.695825	-0.132219	0.688185	3.598191	0.54100	0.45900
mu_beta	0.176766	0.993789	-3.468508	-0.487023	0.309218	0.832437	3.588725	0.39750	0.60250
alpha__USA	14.074894	37.636252	-171.899366	-0.990796	0.317332	11.625923	240.521179	0.45875	0.54125
alpha__Italy	32.564691	46.492324	-57.351711	-0.532305	0.945736	99.803488	163.613053	0.39150	0.60850
alpha__France	2.547504	6.700164	-40.234538	-0.467854	1.040751	4.971800	91.083058	0.35550	0.64450
beta__USA	22.419341	43.726614	-140.604607	-0.145441	1.603786	33.143822	272.022584	0.26150	0.73850
beta__Italy	-1.967748	58.002111	-484.885230	-3.517865	0.349032	3.400547	481.391653	0.44850	0.55150
beta__France	34.939470	45.972820	-86.950038	-0.048646	1.928143	94.856067	208.532713	0.25650	0.74350
sigma_alpha	26.197334	42.125100	0.190135	0.528937	1.937846	51.083900	458.640177	0.00000	1.00000
sigma_beta	36.309637	54.466205	0.075608	0.989605	5.203234	59.455603	434.367847	0.00000	1.00000
eps	60.218967	46.760094	0.103970	0.664053	67.356771	99.604387	282.430219	0.00000	1.00000

In the table above we will focus on the beta parameters which represents the coefficients of the hierarchical regression. In the last column Prob>0, the table shows per each country the probability that revenue of group 1 is greater than group 2. In this way we can have an idea of the country in which the treatment was more effective.

For the sake of providing a general summary of statistics the table also shows: the alpha parameters which are the intercepts of the hierarchical regression; mu, sigma and eps which are the hyperpriors of the regression.