Contextual Multi-Armed Bandit

For the contextual multi-armed bandit (cMAB) when user information is available (context), we implemented a generalisation of Thompson sampling algorithm (Agrawal and Goyal, 2014) based on NumPyro.

title

The following notebook contains an example of usage of the class Cmab, which implements the algorithm above.

[1]:
import numpy as np

from pybandits.cmab import CmabBernoulli
from pybandits.model import BayesianNeuralNetwork, BnnLayerParams, BnnParams, FeaturesConfig, StudentTArray
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2]:
n_samples = 1000
n_features = 5

First, we need to define the input context matrix \(X\) of size (\(n\_samples, n\_features\)) and the mapping of possible actions \(a_i \in A\) to their associated model.

[3]:
# context
X = 2 * np.random.random_sample((n_samples, n_features)) - 1  # random float in the interval (-1, 1)
print("X: context matrix of shape (n_samples, n_features)")
print(X[:10])
X: context matrix of shape (n_samples, n_features)
[[ 0.57721501 -0.10491495  0.32741551 -0.64445469  0.59615259]
 [ 0.80096156  0.30799961 -0.11300114 -0.38609507 -0.01894877]
 [ 0.68199918 -0.7524287   0.4108654   0.82683007  0.01918578]
 [ 0.02820283 -0.18025311  0.12685239 -0.40028207 -0.54182886]
 [-0.05421174 -0.12169021 -0.38969847  0.00909368 -0.10272883]
 [-0.99113276  0.04352721  0.22339235  0.94559409 -0.02681423]
 [ 0.20475381 -0.9097193  -0.7052359   0.88276448  0.48553881]
 [-0.37192289  0.33167246 -0.23457425 -0.90863326  0.62528002]
 [ 0.2682564  -0.41391854  0.4441127  -0.79089207 -0.99060307]
 [-0.04413392  0.86481794 -0.87721351  0.35099491 -0.54559669]]
[4]:
# define action model
bias = StudentTArray.cold_start(mu=1, sigma=2, shape=1)
weight = StudentTArray.cold_start(shape=(n_features, 1))
layer_params = BnnLayerParams(weight=weight, bias=bias)
model_params = BnnParams(bnn_layer_params=[layer_params])
feature_config = FeaturesConfig(n_features=n_features)

update_method = "VI"
update_kwargs = {"num_steps": 100, "batch_size": 128, "optimizer_type": "adam"}

actions = {
    "a1": BayesianNeuralNetwork(
        model_params=model_params,
        feature_config=feature_config,
        update_method=update_method,
        update_kwargs=update_kwargs,
    ),
    "a2": BayesianNeuralNetwork(
        model_params=model_params,
        feature_config=feature_config,
        update_method=update_method,
        update_kwargs=update_kwargs,
    ),
}

We can now init the bandit given the mapping of actions \(a_i\) to their model.

[5]:
# init contextual Multi-Armed Bandit model
cmab = CmabBernoulli(actions=actions)

The predict function below returns the action selected by the bandit at time \(t\): \(a_t = argmax_k P(r=1|\beta_k, x_t)\). The bandit selects one action per each sample of the contect matrix \(X\).

[6]:
# predict action
pred_actions, _, _ = cmab.predict(X)
print("Recommended action: {}".format(pred_actions[:10]))
Recommended action: ['a2', 'a2', 'a2', 'a1', 'a2', 'a1', 'a2', 'a1', 'a2', 'a2']

Now, we observe the rewards and the context from the environment. In this example rewards and the context are randomly simulated.

[7]:
# simulate reward from environment
simulated_rewards = np.random.randint(2, size=n_samples).tolist()
print("Simulated rewards: {}".format(simulated_rewards[:10]))
Simulated rewards: [1, 0, 1, 0, 1, 1, 1, 0, 1, 0]

Finally, we update the model providing per each action sample: (i) its context \(x_t\) (ii) the action \(a_t\) selected by the bandit, (iii) the corresponding reward \(r_t\).

[8]:
# update model
cmab.update(context=X, actions=pred_actions, rewards=simulated_rewards)