Stochastic Multi-Armed Bandit

For the stochastic multi-armed bandit (sMAB), we implemented a Bernoulli multi-armed bandit based on Thompson sampling algorithm (Agrawal and Goyal, 2012).

title

The following notebook contains an example of usage of the class Smab, which implements the algorithm above.

[1]:
import numpy as np
import random
from pybandits.core.smab import Smab

First, we need to define the list of possible actions \(a_i \in A\) and the priors parameters for each Beta distibution \(\alpha, \beta\). By setting them all to 1, all actions have the same probability to be selected by the bandit at the beginning before the first update.

[2]:
# define actions
action_ids = ['Action A', 'Action B', 'Action C']
[3]:
# define beta priors parameters
success_priors = {'Action A': 1,
                  'Action B': 1,
                  'Action C': 1}

failure_priors = {'Action A': 1,
                  'Action B': 1,
                  'Action C': 1}

We can now init the bandit given the list of actions \(a_i\) and the success/failure beta priors parameters \(\alpha, \beta\).

[4]:
# init stochastic Multi-Armed Bandit model
smab = Smab(action_ids=action_ids,
            success_priors=success_priors,
            failure_priors=failure_priors)

The predict function below returns the action selected by the bandit at time \(t\): \(a_t = argmax_k \theta_k^t\), where \(\theta_k^t\) is the sample from the Beta distribution \(k\) at time \(t\). The bandit selects one action at time when n_samples=1, or it selects batches of samples when n_samples>1.

[5]:
# predict actions
pred_actions, _ = smab.predict(n_samples=1000)
print('Recommended action: {}'.format(pred_actions[:10]))
Recommended action: ['Action C', 'Action C', 'Action C', 'Action B', 'Action B', 'Action C', 'Action B', 'Action C', 'Action A', 'Action B']

Now, we observe the rewards from the environment. In this example rewards are randomly simulated.

[6]:
# simulate rewards from environment
n_successes, n_failures = {}, {}
for a in action_ids:
    n_successes[a] = random.randint(0, pred_actions.count(a))
    n_failures[a] = pred_actions.count(a) - n_successes[a]
    print('{}: n_successes={}, n_failures={}'.format(a, n_successes[a], n_failures[a]))
Action A: n_successes=285, n_failures=31
Action B: n_successes=123, n_failures=210
Action C: n_successes=261, n_failures=90

Finally we update the model providing per each action the number of successes \(S_i\) and the number of failures \(F_i\).

[7]:
# update model
for a in action_ids:
    smab.update(action_id=a, n_successes=n_successes[a], n_failures=n_failures[a])