Contextual Multi-Armed Bandit

For the contextual multi-armed bandit (sMAB) when user information is available (context), we implemented a generalisation of Thompson sampling algorithm (Agrawal and Goyal, 2014) based on PyMC3.

title

The following notebook contains an example of usage of the class Cmab, which implements the algorithm above.

[1]:
import numpy as np

from pybandits.cmab import CmabBernoulli
from pybandits.model import BayesianLogisticRegression, BnnLayerParams, BnnParams, StudentTArray
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pydantic/_migration.py:283: UserWarning: `pydantic.generics:GenericModel` has been moved to `pydantic.BaseModel`.
  warnings.warn(f'`{import_path}` has been moved to `{new_location}`.')
[2]:
n_samples = 1000
n_features = 5

First, we need to define the input context matrix \(X\) of size (\(n\_samples, n\_features\)) and the mapping of possible actions \(a_i \in A\) to their associated model.

[3]:
# context
X = 2 * np.random.random_sample((n_samples, n_features)) - 1  # random float in the interval (-1, 1)
print("X: context matrix of shape (n_samples, n_features)")
print(X[:10])
X: context matrix of shape (n_samples, n_features)
[[ 0.5285175  -0.05228887  0.19835608  0.32689102  0.19719533]
 [-0.3132468  -0.31361116 -0.29442892  0.41919992  0.54720322]
 [-0.77693054  0.95649133  0.49386009 -0.35104945 -0.7350991 ]
 [-0.99822708 -0.21798427  0.54473838  0.04482453  0.76356762]
 [-0.38404736  0.41135912  0.03980305 -0.78629694 -0.47528718]
 [ 0.84458725 -0.1103139  -0.32329648 -0.85923443 -0.02208486]
 [-0.24621794 -0.84101825  0.33660977 -0.16579716  0.83782307]
 [ 0.92155154 -0.8990105  -0.35697737  0.65098417 -0.526555  ]
 [ 0.07155641  0.08588879  0.3220394   0.22821753  0.56524977]
 [-0.90299882  0.87860681 -0.2439853   0.27823644  0.62557461]]
[4]:
# define action model
bias = StudentTArray.cold_start(mu=1, sigma=2, shape=1)
weight = StudentTArray.cold_start(shape=(n_features, 1))
layer_params = BnnLayerParams(weight=weight, bias=bias)
model_params = BnnParams(bnn_layer_params=[layer_params])

actions = {
    "a1": BayesianLogisticRegression(model_params=model_params),
    "a2": BayesianLogisticRegression(model_params=model_params),
}

We can now init the bandit given the mapping of actions \(a_i\) to their model.

[5]:
# init contextual Multi-Armed Bandit model
cmab = CmabBernoulli(actions=actions)

The predict function below returns the action selected by the bandit at time \(t\): \(a_t = argmax_k P(r=1|\beta_k, x_t)\). The bandit selects one action per each sample of the contect matrix \(X\).

[6]:
# predict action
pred_actions, _, _ = cmab.predict(X)
print("Recommended action: {}".format(pred_actions[:10]))
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pymc/data.py:384: FutureWarning: Data is now always mutable. Specifying the `mutable` kwarg will raise an error in a future release
  warnings.warn(
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pytensor/link/c/cmodule.py:2968: UserWarning: PyTensor could not link to a BLAS installation. Operations that might benefit from BLAS will be severely degraded.
This usually happens when PyTensor is installed via pip. We recommend it be installed via conda/mamba/pixi instead.
Alternatively, you can use an experimental backend such as Numba or JAX that perform their own BLAS optimizations, by setting `pytensor.config.mode == 'NUMBA'` or passing `mode='NUMBA'` when compiling a PyTensor function.
For more options and details see https://pytensor.readthedocs.io/en/latest/troubleshooting.html#how-do-i-configure-test-my-blas-library
  warnings.warn(
Sampling: [bias_0, out, weight_0]
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pymc/data.py:384: FutureWarning: Data is now always mutable. Specifying the `mutable` kwarg will raise an error in a future release
  warnings.warn(
Sampling: [bias_0, out, weight_0]
Recommended action: ['a2', 'a2', 'a1', 'a1', 'a1', 'a1', 'a1', 'a1', 'a1', 'a2']

Now, we observe the rewards and the context from the environment. In this example rewards and the context are randomly simulated.

[7]:
# simulate reward from environment
simulated_rewards = np.random.randint(2, size=n_samples).tolist()
print("Simulated rewards: {}".format(simulated_rewards[:10]))
Simulated rewards: [0, 0, 0, 0, 1, 1, 0, 0, 1, 0]

Finally, we update the model providing per each action sample: (i) its context \(x_t\) (ii) the action \(a_t\) selected by the bandit, (iii) the corresponding reward \(r_t\).

[8]:
# update model
cmab.update(context=X, actions=pred_actions, rewards=simulated_rewards)
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pymc/data.py:384: FutureWarning: Data is now always mutable. Specifying the `mutable` kwarg will raise an error in a future release
  warnings.warn(
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [weight_0, bias_0]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 1 seconds.
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/arviz/stats/diagnostics.py:596: RuntimeWarning: invalid value encountered in scalar divide
  (between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)
There were 2000 divergences after tuning. Increase `target_accept` or reparameterize.
We recommend running at least 4 chains for robust computation of convergence diagnostics
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pymc/data.py:384: FutureWarning: Data is now always mutable. Specifying the `mutable` kwarg will raise an error in a future release
  warnings.warn(
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [weight_0, bias_0]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 1 seconds.
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/arviz/stats/diagnostics.py:596: RuntimeWarning: invalid value encountered in scalar divide
  (between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)
There were 2000 divergences after tuning. Increase `target_accept` or reparameterize.
We recommend running at least 4 chains for robust computation of convergence diagnostics
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details