Analysis Bayesian Approach

This tutorial shows how to perform post-test analysis of an A/B test experiment with two variants, so called control and treatment groups, using bayesian statistics. It handles both the case of means comparison and conversions comparison.

Let’s import first the tools needed.

[1]:
import numpy as np
import pandas as pd
from abexp.core.analysis_bayesian import BayesianAnalyzer
from abexp.core.analysis_bayesian import BayesianGLMAnalyzer
import warnings
warnings.filterwarnings('ignore')

Compare means

Here we want to compare the average revenue per user of the control group versus the treatment group.

[2]:
# Revenue for users
np.random.seed(42)
revenue_contr = np.random.randint(low=400, high=500, size=10000)
revenue_treat = np.random.randint(low=500, high=700, size=10000)
[3]:
# Define the analyzer
analyzer = BayesianAnalyzer()
[4]:
prob, lift, diff_means, ci = analyzer.compare_mean(obs_contr=revenue_contr, obs_treat=revenue_treat)
logp = -1.18e+05, ||grad|| = 3.0081e+10: 100%|██████████| 22/22 [00:00<00:00, 773.97it/s]
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>Metropolis: [nu_minus_one]
>Metropolis: [std_treat]
>Metropolis: [std_contr]
>Metropolis: [mean_treat]
>Metropolis: [mean_contr]
Sampling 4 chains, 0 divergences: 100%|██████████| 202000/202000 [02:51<00:00, 1181.01draws/s]
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
[5]:
print('Probability that mean revenue(treatment) is greater than mean revenue(control) = {:.2%}'.format(prob))
Probability that mean revenue(treatment) is greater than mean revenue(control) = 94.79%
[6]:
print('Lift between treatment and control = {:.2%}'.format(lift))
Lift between treatment and control = 33.20%

The result of bayesian A/B testing is the probability that the treatment group perform better than the control group i.e. highest mean revenue per user value in the current example. This is a very intuitive way of doing A/B testing because it does not introduce any statistical measures (e.g. p-value) which are more difficult to be interpreted by non statisticians.

We can set an arbitrary threshold to define how to consider the outcome of the bayesian test, e.g. if prob \(>\) 90% we can conclude to a significative effect of the treatment on the mean revenue per user when compare to the control group.

Compare proportions

[7]:
# Number of users that made a purchase
purchase_contr = 470
purchase_treat = 500

# Total number of users
total_usr_treat = 5000
total_usr_contr = 5000
[8]:
prob, lift = analyzer.compare_conv(conv_contr=purchase_contr,
                                   conv_treat=purchase_treat,
                                   nobs_contr=total_usr_treat,
                                   nobs_treat=total_usr_contr)
[9]:
print('Probability that purchase(treatment) is greater than purchase proportion(control) = {:.2%}'.format(prob))
Probability that purchase(treatment) is greater than purchase proportion(control) = 84.45%
[10]:
print('Lift between treatment and control = {:.2%}'.format(lift))
Lift between treatment and control = 6.37%

Bayesian GLM

Here we want to compare the average revenue per user of the control group versus the treatment group. We are also interested to differentiate the results based on some categorical features of the input samples (i.e.  seniority_level, country).

[11]:
# Define the analyzer
analyzer = BayesianGLMAnalyzer()

Multivariate Regression

[12]:
df = pd.DataFrame([[1, 4, 35],
                   [0, 4, 5],
                   [1, 3, 28],
                   [0, 1, 5],
                   [0, 2, 1],
                   [1, 0, 1.5]], columns=['group', 'seniority_level', 'revenue'])
[13]:
stats = analyzer.multivariate_regression(df, 'revenue')
stats
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [lam, seniority_level, group, Intercept]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [00:03<00:00, 2035.12draws/s]
The number of effective samples is smaller than 25% for some parameters.
[13]:
mean std min 25% 50% 75% max Prob<0 Prob>0
Intercept 1.048460 2.940644 -13.254892 -0.372376 0.967242 2.372862 26.860366 0.30325 0.69675
group 0.576785 0.551946 -1.425842 0.195678 0.572784 0.957911 2.738990 0.14725 0.85275
seniority_level 1.646575 1.287070 -2.438778 0.817672 1.352801 2.257462 8.219804 0.05050 0.94950
lam 0.774718 1.390844 0.001202 0.101534 0.296813 0.821106 16.358989 0.00000 1.00000

In the last column Prob>0, the table above shows that there is there is 85.27% of probability that revenue of group 1 is greater than group 2. Moreover it also shows that there is94.95% of probability that seniority level is positively associated to revenue.

For the sake of providing a general summary of statistics the table also shows: the intercept and lambda (lam) of the regression model.

Hierarchical regression

If your are not familiar with hierarchical regression have a look at the blog https://twiecki.io/blog/2014/03/17/bayesian-glms-3/.

[14]:
df = pd.DataFrame([[0, 5,   'USA'],
                   [0, 5,   'USA'],
                   [0, 100, 'Italy'],
                   [1, 100, 'USA'],
                   [1, 100, 'USA'],
                   [1, 100, 'France']], columns=['group', 'revenue', 'country'])

[15]:
stats = analyzer.hierarchical_regression(df, group_col='group', cat_col='country', kpi_col='revenue')
stats
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [eps, beta, alpha, sigma_beta, sigma_alpha, mu_beta, mu_alpha]
Sampling 4 chains, 816 divergences: 100%|██████████| 6000/6000 [02:10<00:00, 45.87draws/s]
There were 52 divergences after tuning. Increase `target_accept` or reparameterize.
There were 364 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.2979906043312202, but should be close to 0.8. Try to increase the number of tuning steps.
There were 75 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6628490775514363, but should be close to 0.8. Try to increase the number of tuning steps.
There were 325 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7113696800957767, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
[15]:
mean std min 25% 50% 75% max Prob<0 Prob>0
mu_alpha -0.028085 0.989639 -3.581447 -0.695825 -0.132219 0.688185 3.598191 0.54100 0.45900
mu_beta 0.176766 0.993789 -3.468508 -0.487023 0.309218 0.832437 3.588725 0.39750 0.60250
alpha__USA 14.074894 37.636252 -171.899366 -0.990796 0.317332 11.625923 240.521179 0.45875 0.54125
alpha__Italy 32.564691 46.492324 -57.351711 -0.532305 0.945736 99.803488 163.613053 0.39150 0.60850
alpha__France 2.547504 6.700164 -40.234538 -0.467854 1.040751 4.971800 91.083058 0.35550 0.64450
beta__USA 22.419341 43.726614 -140.604607 -0.145441 1.603786 33.143822 272.022584 0.26150 0.73850
beta__Italy -1.967748 58.002111 -484.885230 -3.517865 0.349032 3.400547 481.391653 0.44850 0.55150
beta__France 34.939470 45.972820 -86.950038 -0.048646 1.928143 94.856067 208.532713 0.25650 0.74350
sigma_alpha 26.197334 42.125100 0.190135 0.528937 1.937846 51.083900 458.640177 0.00000 1.00000
sigma_beta 36.309637 54.466205 0.075608 0.989605 5.203234 59.455603 434.367847 0.00000 1.00000
eps 60.218967 46.760094 0.103970 0.664053 67.356771 99.604387 282.430219 0.00000 1.00000

In the table above we will focus on the beta parameters which represents the coefficients of the hierarchical regression. In the last column Prob>0, the table shows per each country the probability that revenue of group 1 is greater than group 2. In this way we can have an idea of the country in which the treatment was more effective.

For the sake of providing a general summary of statistics the table also shows: the alpha parameters which are the intercepts of the hierarchical regression; mu, sigma and eps which are the hyperpriors of the regression.