# Contextual Multi-Armed Bandit

For the contextual multi-armed bandit (sMAB) when user information is available (context), we implemented a generalisation of Thompson sampling algorithm ([Agrawal and Goyal, 2014](https://arxiv.org/pdf/1209.3352.pdf)) based on PyMC3.

![title](img/cmab.png)

The following notebook contains an example of usage of the class Cmab, which implements the algorithm above.

In [None]:
import numpy as np

from pybandits.cmab import CmabBernoulli
from pybandits.model import BayesianLogisticRegression, StudentT

In [None]:
n_samples = 1000
n_features = 5

First, we need to define the input context matrix $X$ of size ($n\_samples, n\_features$) and the mapping of possible actions $a_i \in A$ to their associated model.

In [None]:
# context
X = 2 * np.random.random_sample((n_samples, n_features)) - 1 # random float in the interval (-1, 1)
print("X: context matrix of shape (n_samples, n_features)")
print(X[:10])

In [None]:
# define action model
actions = {
 "a1": BayesianLogisticRegression(alpha=StudentT(mu=1, sigma=2), betas=n_features * [StudentT()]),
 "a2": BayesianLogisticRegression(alpha=StudentT(mu=1, sigma=2), betas=n_features * [StudentT()]),
}

We can now init the bandit given the mapping of actions $a_i$ to their model.

In [None]:
# init contextual Multi-Armed Bandit model
cmab = CmabBernoulli(actions=actions)

The predict function below returns the action selected by the bandit at time $t$: $a_t = argmax_k P(r=1|\beta_k, x_t)$. The bandit selects one action per each sample of the contect matrix $X$.

In [None]:
# predict action
pred_actions, _, _ = cmab.predict(X)
print("Recommended action: {}".format(pred_actions[:10]))

Now, we observe the rewards and the context from the environment. In this example rewards and the context are randomly simulated.

In [None]:
# simulate reward from environment
simulated_rewards = np.random.randint(2, size=n_samples)
# simulate context from environment
simulated_context = 2 * np.random.random_sample((n_samples, n_features)) - 1 # random float in the interval (-1, 1)
print("Simulated rewards: {}".format(simulated_rewards[:10]))

Finally, we update the model providing per each action sample: (i) its context $x_t$ (ii) the action $a_t$ selected by the bandit, (iii) the corresponding reward $r_t$.

In [None]:
# update model
cmab.update(context=X, actions=pred_actions, rewards=simulated_rewards)