pybandits.core
pybandits.core.smab
- class pybandits.core.smab.Smab(action_ids, success_priors=None, failure_priors=None, random_seed=None)
Bases:
object
Stochastic Multi-Armed Bandit for Bernoulli bandits with Thompson Sampling.
- Parameters
action_ids (List[str]) – List of possible actions
success_priors (Dict[str, int]) – Dictionary containing the prior number of positive feedback (successes) for each action. keys are action IDs and values are successes counts. If None, each action’s prior is set to 1 by default. Success counts must be integers and > 0.
failure_priors (Dict[str, int]) – Dictionary containing the prior number of negative feedback (failures) for each action. keys are action IDs and values are failures counts. If None, each action’s prior is set to 1 by default. Failure counts must be integers and > 0.
random_seed (int) – Seed for random state. If specified, the model outputs deterministic results.
- batch_update(batch)
This method updates the SMAB for several action IDs at once, iterating over the batch.
- Parameters
batch (List[dict]) – List of dicts in the form [{‘action_id’: <str>, ‘n_successes’: <int>, ‘n_failures’:<int>}]
- predict(n_samples=1, forbidden_actions=None)
Predict the best actions by randomly drawing samples from a beta distribution for each possible action. The action with the highest value is recommended to the user as the ‘best action’ considering current information. The Beta distributions’ alpha and beta parameters for each action are its associated counts of success and failure, respectively.
- Parameters
n_samples (int) – Number of samples to predict (default 1).
forbidden_actions (List[str]) – List of forbidden actions. If specified, the model will discard the forbidden_actions and it will only consider the remaining allowed_actions. By default, the model considers all actions as allowed_actions. Note that: actions = allowed_actions U forbidden_actions.
- Returns
best_actions (list) – The best actions according to the model, i.e. the actions whose distribution gave the greater sample.
probs (list) – The probabilities to get a positive reward for each action.
- update(action_id, n_successes, n_failures)
This method updates the SMAB with feedbacks for a given action. The action’s associated success (resp. failure) counter is incremented by the number of successes (resp. failures) received.
- Parameters
action_id (str) – The ID of the action to update.
n_successes (int) – The number of successes received for action_id.
n_failures (int) – The number of failures received for action_id.
pybandits.core.cmab
- class pybandits.core.cmab.Cmab(n_features, actions_ids, params_sample=None, n_jobs=1, mu_alpha=None, mu_betas=None, sigma_alpha=None, sigma_betas=None, nu_alpha=None, nu_betas=None, random_seed=None)
Bases:
object
Contextual Multi-Armed Bandit with binary rewards. It is based on Thompson Sampling with bayesian logistic regression. It assumes a prior distribution over the parameters and it computes the posterior reward distributions applying Bayes’theorem via Markov Chain Monte Carlo simulation (MCMC).
- Parameters
n_features (int) – The number of contextual features.
actions_ids (list of strings with length = n_actions) – List of actions names.
params_sample (dict) – Sampling parameters for pm.sample function from pymc3.
n_jobs (int) – The number of jobs to run in parallel. If n_jobs > 1, both the update() and predict() functions will be run with parallelization via the multiprocessing package.
mu_alpha (dict) – Mu (location) parameters for alpha prior Student’s t distribution. By default all mu=0. The keys of the dict must be the actions_ids, and the values are floats. e.g. mu={‘action1’: 0., ‘action2’: 0.} with n_actions=2.
mu_betas (dict) – Mu (location) parameters for betas prior Student’s t distributions. By default all mu=0. The keys of the dict must be the actions_ids, and the values are lists of floats with length=n_features. e.g. mu={‘action1’: [0., 0., 0.], ‘action2’: [0., 0., 0.]} with n_actions=2 and n_features=3.
sigma_alpha (dict) – Sigma (scale) parameters for alpha prior Student’s t distribution. By default all sigma=10. The keys of the dict must be the actions_ids, and the values are floats. e.g. sigma={‘action1’: 10., ‘action2’: 10.} with n_actions=2.
sigma_betas (dict) – Sigma (scale) parameters for betas prior Student’s t distributions. By default all sigma=10. The keys of the dict must be the actions_ids, and the values are lists of floats with length=n_features. e.g. sigma={‘action1’: [10., 10., 10.], ‘action2’: [10., 10., 10.]} with n_actions=2 and n_features=3.
nu_alpha (dict) – Nu (normality) parameters for alpha prior Student’s t distribution. By default all nu=5. The keys of the dict must be the actions_ids, and the values are floats. e.g. nu={‘action1’: 5., ‘action2’: 5.} with n_actions=2.
nu_betas (dict) – Nu (normality) parameters for betas prior Student’s t distributions. By default all nu=5. The keys of the dict must be the actions_ids, and the values are lists of floats with length=n_features. e.g. nu={‘action1’: [5., 5., 5.], ‘action2’: [5., 5., 5.]} with n_actions=2 and n_features=3.
random_seed (int) – Seed for random state. If specified, the model outputs deterministic results.
- fast_predict(X)
Compute the posterior reward probability for all actions sampling coefficients from student-t distribution as a real time faster alternative to fast_sample_posterior_predictive. It returns the action with the highest probability.
- Parameters
X (array_like of shape (n_samples, n_features)) – Matrix with contextual features.
- Returns
best_actions (list of len (n_samples)) – Best action per each sample, i.e. action with the highest probability to get a positive reward.
probs (array_like of shape (n_actions, n_samples)) – Reward probability for each action-sample pair
- predict(X)
Generate posterior predictive probability from a model given the trace and the context X. It returns the action with the highest probability. If n_jobs > 1, the prediction for each model action will be run in parallel.
- Parameters
X (array_like of shape (n_samples, n_features)) – Matrix with contextual features.
- Returns
best_actions (list of len (n_samples)) – Best action per each sample, i.e. action with the highest probability to get a positive reward.
probs (array_like of shape (n_actions, n_samples)) – Reward probability for each action-sample pair
- update(X, actions, rewards)
Update internal state of the models. Compute posterior distributions using new data set (actions, rewards and context). If n_jobs > 1, the models of each actions will be updated in parallel.
- Parameters
X (array_like of shape (n_samples, n_features)) – Matrix with contextual features.
actions (array_like of shape (n_samples,)) – Array of recommended actions per each sample.
rewards (array_like of shape (n_samples,)) – Array of boolean rewards (0 or 1) per each sample.
- pybandits.core.cmab.check_context_matrix(context, n_features)
Check context matrix
- Parameters
X (array_like of shape (n_samples, n_features)) – Matrix with contextual features.
pybandits.utils
pybandits.utils.simulation_smab
- class pybandits.utils.simulation_smab.SimulationSmab(smab, n_updates=10, batch_size=100, probs_reward=None, save=False, path='', random_seed=None, verbose=False)
Bases:
object
Simulate environment for stochastic multi-armed bandits.
This class performs simulation of stochastic Multi-Armed Bandits (sMAB). Data are processed in batches of size n>=1. Per each batch of simulated samples, the sMAB selects one action and collects the corresponding simulated reward for each sample. Then, prior parameters are updated based on returned rewards from recommended actions.
- Parameters
smab (pybandits.core.smab.Smab) – Stochastic multi-armed bandit model.
n_updates (int, default=10) – The number of updates (i.e. batches of samples) in the simulation.
batch_size (int, default=100) – The number of samples per batch.
probs_reward (dict, default=None) – The reward probability for the different actions. If None probabilities are set to 0.5. The keys of the dict must match the smab actions_ids, and the values are float in the interval [0, 1]. e.g. probs_reward={‘action A’: 0.6, ‘action B’: 0.8, ‘action C’: 1.}
save (bool, default=False) – Boolean flag to save the results.
path (string, default='') – Path where results are saved if save=True
random_seed (int, default=None) – Seed for random state. If specified, the model outputs deterministic results.
verbose (bool, default=False) – Enable verbose output. If True, detailed logging information about the simulation are provided.
- get_count_selected_actions()
Get the count of actions selected by the bandit at the end of the process.
- Returns
Dictionary with keys=action_ids and values=count of recommended actions.
- Return type
dict
- get_cumulative_proportions()
Get (i) the cumulative action proportions and (ii) the cumulative reward proportions per action.
- Returns
Dictionary with keys=(actions, reward) and values=(cumulative action proportions, cumulative reward proportions per action)
- Return type
dict
- get_proportion_positive_reward()
Get the observed proportion of positive rewards for each action at the end of the simulation process.
- Returns
Dictionary with keys=action_ids and values=proportion of positive rewards for each action.
- Return type
dict
- run()
- Start simulation process. It consists in the following steps:
- for i=0 to n_updates
Consider batch[i] of observation sMAB selects the best action as the action with the highest reward probability to each sample in
batch[i].
Rewards are returned for each recommended action Prior parameters are updated based on recommended actions and returned rewards
pybandits.utils.simulation_cmab
- class pybandits.utils.simulation_cmab.SimulationCmab(cmab, X, batch_size=100, n_updates=10, group=None, prob_rewards=None, save=False, path='', random_seed=None, verbose=False)
Bases:
object
Simulate environment for contextual multi-armed bandit models.
This class simulates information required by the contextual bandit. Generated data are processed by the bandit with batches of size n>=1. For each batch of samples, actions are recommended by the bandit and corresponding simulated rewards collected. Bandit policy parameters are then updated based on returned rewards from recommended actions.
- Parameters
cmab (pybandits.core.cmab.Cmab) – Contextual multi-armed bandit model
X (array_like of shape (n_samples, n_feature)) – Context matrix of samples features.
batch_size (int, default=100) – The number of samples per batch.
n_updates (int, default=10) – The number of updates in the simulation.
group (list int with length=n_samples) – Group to which each sample belongs. Samples which belongs to the same group have features that come from the same distribution and they have the same probability to receive a positive/negative feedback from each action.
prob_rewards (pd.DataFrame of shape (n_groups, n_actions)) – Matrix of positive reward probability for each group-action combination. If None all probs are set to 0.5.
save (bool, default=False) – Boolean flag to save the results.
path (string, default='') – Path where results are saved if save=True
verbose (bool, default=False) – Enable verbose output. If True produce detailed logging information about the simulation.
random_seed (int, default=None) – Seed for random state. If specified, the model outputs deterministic results.
- get_count_selected_actions()
Get the proportion of recommended actions per group at the end of the process.
- Returns
df – Matrix of the proportion of recommended actions per group.
- Return type
pandas DataFrame
- get_cumulative_proportions(path='')
- Plot results of the simulation. It will create two plots per each group which display:
The cumulated proportion of action
The cumulated proportion of rewards
- Parameters
path (str, default='') – Path in which plot figures are saved.
- get_proportion_positive_reward()
Get the proportion of positive rewards per group/action at the end of the process.
- Returns
df – Matrix of the proportion of positive rewards per group/action.
- Return type
pandas DataFrame
- run()
Start simulation process. It consists in the following steps:
- for i=0 to n_updates
Extract batch[i] of samples from X
Model recommends the best actions as the action with the highest reward probability to each simulated sample in batch[i] and collect corresponding simulated rewards
Model priors are updated using information from recommended actions and returned rewards