Simulation cMAB

This notebook shows a simulation framework for the contextual multi-armed bandit (cMAB). It allows to study the behaviour of the bandit algoritm, to evaluate results and to run experiments on simulated data under different context, reward and action settings.

[1]:

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from pybandits.utils.simulation_cmab import SimulationCmab
from pybandits.core.cmab import Cmab

First we need to define the simulation parameters: (i) the number of samples per batch to consider at each iteration of the simulation, (ii) the number of samples groups (we assume to have groups of samples whose features come from the same distribution), (iii) the numbe of updates in the simulation, (iv) the list of possible actions, (v) the number of features in the context matrix.

Data are processed in batches of size n>=1. Per each batch of simulated samples, the cMAB selects one action and collects the corresponding simulated reward for each sample. Then, prior parameters are updated based on returned rewards from recommended actions.

[2]:

# simulation parameters
batch_size = 100
n_groups = 3
n_updates = 5
n_jobs = 1
actions_ids = ['action A', 'action B', 'action C']
n_features = 5
verbose = True

Here, we init the context matrix \(X\) and the groups of samples. Samples that belong to the same group have features that come from the same distribution.

[3]:

# init context matrix and groups
X, group = make_classification(n_samples=batch_size*n_updates,
                               n_features=n_features,
                               n_informative=n_features,
                               n_redundant=0,
                               n_classes=n_groups)

We need to define the probabilities of positive rewards per each action/group, i.e. the ground truth (‘Action A’: 0.8 for group ‘0’ means that if the bandits selects ‘Action A’ for samples that belong to group ‘0’, then the environment will return a positive reward with 80% probability).

[4]:

# init probability of rewards from the environment
prob_rewards = pd.DataFrame([[0.05, 0.80, 0.05],
                             [0.80, 0.05, 0.05],
                             [0.80, 0.05, 0.80]], columns=actions_ids, index=range(n_groups))
print('Probability of positive reward for each group/action:')
prob_rewards

Probability of positive reward for each group/action:

[4]:

	action A	action B	action C
0	0.05	0.80	0.05
1	0.80	0.05	0.05
2	0.80	0.05	0.80

We initialize the Cmab as shown in the previous notebook and the SimulationCmab with the parameters set above.

[5]:

# init contextual Multi-Armed Bandit model
cmab = Cmab(n_features=n_features, actions_ids=actions_ids, n_jobs=n_jobs)

[6]:

# init simulation
sim = SimulationCmab(cmab=cmab, X=X,
                     group=group,
                     batch_size=batch_size,
                     n_updates=n_updates,
                     prob_rewards=prob_rewards,
                     verbose=verbose)

Setup simulation  completed.
Simulated input probability rewards:
        action A  action B  action C
group
0      0.041176  0.835294  0.052941
1      0.819277  0.036145  0.054217
2      0.786585  0.042683  0.817073

Now, we can start simulation process by executing run() which performs the following steps:

For i=0 to n_updates:
    Extract batch[i] of samples from X
    Model recommends the best actions as the action with the highest reward probability to each simulated sample in batch[i] and collect corresponding simulated rewards
    Model priors are updated using information from recommended actions and returned rewards

Finally, we can visualize the results of the simulation. As defined in the ground truth: ‘Action B’ was the action recommended the most for samples that belong to group ‘0’, ‘Action A’ to group ‘1’ and both ‘Action A’ and ‘Action C’ to group ‘2’.

[7]:

sim.run()

Iteration #1
Start predict batch 1 ...
Start update batch 1 ...

Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 11 seconds.
The number of effective samples is smaller than 25% for some parameters.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 10 seconds.
The number of effective samples is smaller than 25% for some parameters.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 4 seconds.

Iteration #2
Start predict batch 2 ...
Start update batch 2 ...

Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 9 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 5 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.

Iteration #3
Start predict batch 3 ...
Start update batch 3 ...

Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 9 seconds.
The number of effective samples is smaller than 25% for some parameters.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 4 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.

Iteration #4
Start predict batch 4 ...
Start update batch 4 ...

Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 4 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.

Iteration #5
Start predict batch 5 ...
Start update batch 5 ...

Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 4 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [beta4, beta3, beta2, beta1, beta0, alpha]
Sampling 2 chains for 500 tune and 1_000 draw iterations (1_000 + 2_000 draws total) took 3 seconds.

Simulation results (first 10 observations):
      action  reward  group  selected_prob_reward  max_prob_reward  regret  \
0  action C     0.0      1                  0.05              0.8    0.75
1  action C     1.0      2                  0.80              0.8    0.00
2  action B     1.0      0                  0.80              0.8    0.00
3  action C     0.0      1                  0.05              0.8    0.75
4  action C     0.0      1                  0.05              0.8    0.75
5  action B     1.0      0                  0.80              0.8    0.00
6  action A     0.0      0                  0.05              0.8    0.75
7  action C     0.0      2                  0.80              0.8    0.00
8  action C     0.0      1                  0.05              0.8    0.75
9  action C     1.0      2                  0.80              0.8    0.00

   cum_regret
0        0.75
1        0.75
2        0.75
3        1.50
4        2.25
5        2.25
6        3.00
7        3.00
8        3.75
9        3.75

Count of actions selected by the bandit:
 {'group 0': {'action B': 85, 'action A': 53, 'action C': 32}, 'group 1': {'action A': 109, 'action C': 31, 'action B': 26}, 'group 2': {'action A': 70, 'action C': 59, 'action B': 35}}

Observed proportion of positive rewards for each action:
 {'group 0': {'action B': 0.788235294117647, 'action A': 0.03773584905660377, 'action C': 0.03125}, 'group 1': {'action A': 0.7981651376146789, 'action B': 0.07692307692307693, 'action C': 0.03225806451612903}, 'group 2': {'action A': 0.7142857142857143, 'action C': 0.8305084745762712, 'action B': 0.02857142857142857}}