Introduction

This notebook demonstrates the use of offline policy evaluation for MABs.

Objectives

Evaluation:

Evaluate the performance of a MAB using multiple offline policy estimators.

[1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

from pybandits.cmab import CmabBernoulliCC
from pybandits.offline_policy_evaluator import OfflinePolicyEvaluator

%load_ext autoreload
%autoreload 2
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pydantic/_migration.py:283: UserWarning: `pydantic.generics:GenericModel` has been moved to `pydantic.BaseModel`.
  warnings.warn(f'`{import_path}` has been moved to `{new_location}`.')
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Generate data

We first generate a binarly labeled data set, with a two dimensional feature space, and is not lineraly seprabale. We then split the data set to a training data setm and a test data set.

[2]:
n_samples = 1000
n_actions = 2
n_batches = 3
n_rewards = 1
n_groups = 2
n_features = 3
[3]:
unique_actions = [f"a{i}" for i in range(n_actions)]
action_ids = np.random.choice(unique_actions, n_samples * n_batches)
batches = [i for i in range(n_batches) for _ in range(n_samples)]
rewards = [np.random.randint(2, size=(n_samples * n_batches)) for _ in range(n_rewards)]
action_true_rewards = {(a, r): np.random.rand() for a in unique_actions for r in range(n_rewards)}
true_rewards = [
    np.array([action_true_rewards[(a, r)] for a in action_ids]).reshape(n_samples * n_batches) for r in range(n_rewards)
]
groups = np.random.randint(n_groups, size=n_samples * n_batches)
action_costs = {action: np.random.rand() for action in unique_actions}
costs = np.array([action_costs[a] for a in action_ids])
context = np.random.rand(n_samples * n_batches, n_features)
action_propensity_score = {action: np.random.rand() for action in unique_actions}
propensity_score = np.array([action_propensity_score[a] for a in action_ids])
df = pd.DataFrame(
    {
        "batch": batches,
        "action_id": action_ids,
        "cost": costs,
        "group": groups,
        **{f"reward_{r}": rewards[r] for r in range(n_rewards)},
        **{f"true_reward_{r}": true_rewards[r] for r in range(n_rewards)},
        **{f"context_{i}": context[:, i] for i in range(n_features)},
        "propensity_score": propensity_score,
    }
)
contextual_features = [col for col in df.columns if col.startswith("context")]

Generate Model

Using the cold_start method of CmabBernoulliCC, we can create a model to be used for offline policy evaluation.

[4]:
action_ids_cost = {action_id: df["cost"][df["action_id"] == action_id].iloc[0] for action_id in unique_actions}

mab = CmabBernoulliCC.cold_start(action_ids_cost=action_ids_cost, n_features=len(contextual_features))

OPE

Given the model and the OPE data from the logging policy, we can either evaluate the model using the logging policy, or update it with the logging policy data prior to the evaluation.

[ ]:

[5]:
evaluator = OfflinePolicyEvaluator(
    logged_data=df,
    split_prop=0.5,
    n_trials=10,
    fast_fit=True,
    scaler=MinMaxScaler(),
    ope_estimators=None,
    verbose=True,
    propensity_score_model_type="batch_empirical",
    expected_reward_model_type="gbm",
    importance_weights_model_type="logreg",
    batch_feature="batch",
    action_feature="action_id",
    reward_feature="reward_0",
    true_reward_feature="true_reward_0",
    contextual_features=contextual_features,
    group_feature="group",
    cost_feature="cost",
    propensity_score_feature="propensity_score",
)
100%|██████████| 2/2 [00:00<00:00, 290.68it/s]
2025-08-14 08:29:30.635 | INFO     | pybandits.offline_policy_evaluator:_estimate_propensity_score:752 - Data batch-empirical estimation of propensity score.
2025-08-14 08:29:30.643 | INFO     | pybandits.offline_policy_evaluator:_estimate_expected_reward:802 - Data prediction of expected reward based on gbm model.
[6]:
evaluator.evaluate(mab=mab, visualize=True, n_mc_experiments=1000)
2025-08-14 08:29:30.953 | INFO     | pybandits.offline_policy_evaluator:estimate_policy:898 - Data prediction of expected policy based on Monte Carlo experiments.
1000it [00:17, 56.47it/s]
2025-08-14 08:29:48.870 | INFO     | pybandits.offline_policy_evaluator:_estimate_importance_weight:841 - Data prediction of importance weights based on logreg model.
2025-08-14 08:29:48.955 | INFO     | pybandits.offline_policy_evaluator:evaluate:971 - Offline Policy Evaluation for reward_0.
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/scipy/stats/_resampling.py:147: RuntimeWarning: invalid value encountered in scalar divide
  a_hat = 1/6 * sum(nums) / sum(dens)**(3/2)
/home/runner/work/pybandits/pybandits/pybandits/offline_policy_estimator.py:138: DegenerateDataWarning: The BCa confidence interval cannot be calculated. This problem is known to occur when the distribution is degenerate or the statistic is np.min.
  bootstrap_result = bootstrap(
Loading BokehJS ...
[6]:
value lower upper std estimator objective
0 0.513582 0.481212 0.546771 0.016754 b-ipw reward_0
1 0.502123 0.496071 0.508091 0.003047 dm reward_0
2 0.509824 0.477889 0.542222 0.016245 dr reward_0
3 0.502123 0.496112 0.508018 0.003031 dros-opt reward_0
4 0.509824 0.478415 0.541828 0.016184 dros-pess reward_0
5 0.511227 0.478210 0.543671 0.016749 ipw reward_0
6 0.000000 NaN NaN 0.000000 rep reward_0
7 0.509824 0.478482 0.541688 0.016305 sndr reward_0
8 0.511211 0.478581 0.544878 0.016854 snips reward_0
9 0.509824 0.478829 0.543385 0.016579 sg-dr reward_0
10 0.511227 0.477929 0.544147 0.016723 sg-ipw reward_0
11 0.502123 0.496215 0.508104 0.003033 switch-dr reward_0
[7]:
evaluator.update_and_evaluate(mab=mab, visualize=True, n_mc_experiments=1000)
2025-08-14 08:29:50.156 | INFO     | pybandits.offline_policy_evaluator:_update_mab:1050 - Offline policy update for <class 'pybandits.cmab.CmabBernoulliCC'>.
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pytensor/link/c/cmodule.py:2968: UserWarning: PyTensor could not link to a BLAS installation. Operations that might benefit from BLAS will be severely degraded.
This usually happens when PyTensor is installed via pip. We recommend it be installed via conda/mamba/pixi instead.
Alternatively, you can use an experimental backend such as Numba or JAX that perform their own BLAS optimizations, by setting `pytensor.config.mode == 'NUMBA'` or passing `mode='NUMBA'` when compiling a PyTensor function.
For more options and details see https://pytensor.readthedocs.io/en/latest/troubleshooting.html#how-do-i-configure-test-my-blas-library
  warnings.warn(
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/rich/live.py:256:
UserWarning: install "ipywidgets" for Jupyter support
  warnings.warn('install "ipywidgets" for Jupyter support')
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/rich/live.py:256:
UserWarning: install "ipywidgets" for Jupyter support
  warnings.warn('install "ipywidgets" for Jupyter support')
2025-08-14 08:29:57.570 | INFO     | pybandits.offline_policy_evaluator:estimate_policy:898 - Data prediction of expected policy based on Monte Carlo experiments.
1000it [00:20, 48.55it/s]
2025-08-14 08:30:18.388 | INFO     | pybandits.offline_policy_evaluator:_estimate_importance_weight:841 - Data prediction of importance weights based on logreg model.
2025-08-14 08:30:18.483 | INFO     | pybandits.offline_policy_evaluator:evaluate:971 - Offline Policy Evaluation for reward_0.
Loading BokehJS ...
[7]:
value lower upper std estimator objective
0 0.535735 0.483194 0.590617 0.027608 b-ipw reward_0
1 0.502773 0.496870 0.508893 0.003065 dm reward_0
2 0.506046 0.462949 0.547684 0.021663 dr reward_0
3 0.502773 0.496869 0.508908 0.003065 dros-opt reward_0
4 0.506046 0.464324 0.549528 0.021540 dros-pess reward_0
5 0.515309 0.463361 0.567612 0.026450 ipw reward_0
6 0.502907 0.436047 0.572674 0.034881 rep reward_0
7 0.506010 0.463446 0.547685 0.021405 sndr reward_0
8 0.509588 0.459913 0.562377 0.026216 snips reward_0
9 0.506046 0.463748 0.550405 0.022042 sg-dr reward_0
10 0.515309 0.464383 0.568361 0.026658 sg-ipw reward_0
11 0.502773 0.496745 0.508664 0.003049 switch-dr reward_0