Simulation sMAB

This notebook shows a simulation framework for the stochastic multi-armed bandit (sMAB). It allows to study the behaviour of the bandit algoritm, to evaluate results and to run experiments on simulated data under different reward and action settings.

[1]:

import numpy as np
from pybandits.core.smab import Smab
from pybandits.utils.simulation_smab import SimulationSmab

First, we need to initialize the sMAB as shown in the previous notebook.

[2]:

# define actions
action_ids = ['Action A', 'Action B', 'Action C']

[3]:

# init stochastic Multi-Armed Bandit model
smab = Smab(action_ids=action_ids)

To init SimulationSmab we need to define (i) the numbe of updates in the simulation, (ii) the number of samples per batch to consider at each iteration of the simulation, (ii) the probabilities of positive rewards per each action, i.e. the ground truth (‘Action A’: 0.6 means that if the bandit selects ‘Action A’, then the environment will return a positive reward with 60% probability).

Data are processed in batches of size n>=1. Per each batch of simulated samples, the sMAB selects one action and collects the corresponding simulated reward for each sample. Then, prior parameters are updated based on returned rewards from recommended actions.

[4]:

# init simulation
sim = SimulationSmab(smab=smab,
                     n_updates=20,
                     batch_size=2000,
                     probs_reward={'Action A': 0.6, 'Action B': 0.5, 'Action C': 0.8},
                     verbose=True)

Now, we can start simulation process by executing run() which performs the following steps:

For i=0 to n_updates:
    Consider batch[i] of observations
    sMAB selects the best action as the action with the highest reward probability for each sample in batch[i].
    Rewards are returned for each recommended action
    Prior parameters are updated based on recommended actions and returned rewards

Finally we can visualize the results of the simulation. In this case ‘Action C’ was recommended the most since it has a ground truth of 80% probability to return a positive reward.

[5]:

# run simulation
sim.run()

Simulation results (first 10 observations):
      action  reward
0  Action B     0.0
1  Action C     1.0
2  Action C     0.0
3  Action A     1.0
4  Action B     1.0
5  Action C     1.0
6  Action A     1.0
7  Action A     1.0
8  Action B     0.0
9  Action B     0.0

Count of actions selected by the bandit:
 {'Action C': 38670, 'Action B': 683, 'Action A': 647}

Observed proportion of positive rewards for each action:
 {'Action A': 0.6120556414219475, 'Action B': 0.4978038067349927, 'Action C': 0.7995603827256271}