Introduction

This notebook demonstrates the use of offline policy evaluation for MABs.

Objectives

Evaluation:

Evaluate the performance of a MAB using multiple offline policy estimators.

[1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

from pybandits.cmab import CmabBernoulliCC
from pybandits.offline_policy_evaluator import OfflinePolicyEvaluator

%load_ext autoreload
%autoreload 2
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pydantic/_migration.py:283: UserWarning: `pydantic.generics:GenericModel` has been moved to `pydantic.BaseModel`.
  warnings.warn(f'`{import_path}` has been moved to `{new_location}`.')
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Generate data

We first generate a binarly labeled data set, with a two dimensional feature space, and is not lineraly seprabale. We then split the data set to a training data setm and a test data set.

[2]:
n_samples = 1000
n_actions = 2
n_batches = 3
n_rewards = 1
n_groups = 2
n_features = 3
[3]:
unique_actions = [f"a{i}" for i in range(n_actions)]
action_ids = np.random.choice(unique_actions, n_samples * n_batches)
batches = [i for i in range(n_batches) for _ in range(n_samples)]
rewards = [np.random.randint(2, size=(n_samples * n_batches)) for _ in range(n_rewards)]
action_true_rewards = {(a, r): np.random.rand() for a in unique_actions for r in range(n_rewards)}
true_rewards = [
    np.array([action_true_rewards[(a, r)] for a in action_ids]).reshape(n_samples * n_batches) for r in range(n_rewards)
]
groups = np.random.randint(n_groups, size=n_samples * n_batches)
action_costs = {action: np.random.rand() for action in unique_actions}
costs = np.array([action_costs[a] for a in action_ids])
context = np.random.rand(n_samples * n_batches, n_features)
action_propensity_score = {action: np.random.rand() for action in unique_actions}
propensity_score = np.array([action_propensity_score[a] for a in action_ids])
df = pd.DataFrame(
    {
        "batch": batches,
        "action_id": action_ids,
        "cost": costs,
        "group": groups,
        **{f"reward_{r}": rewards[r] for r in range(n_rewards)},
        **{f"true_reward_{r}": true_rewards[r] for r in range(n_rewards)},
        **{f"context_{i}": context[:, i] for i in range(n_features)},
        "propensity_score": propensity_score,
    }
)
contextual_features = [col for col in df.columns if col.startswith("context")]

Generate Model

Using the cold_start method of CmabBernoulliCC, we can create a model to be used for offline policy evaluation.

[4]:
action_ids_cost = {action_id: df["cost"][df["action_id"] == action_id].iloc[0] for action_id in unique_actions}

mab = CmabBernoulliCC.cold_start(action_ids_cost=action_ids_cost, n_features=len(contextual_features))

OPE

Given the model and the OPE data from the logging policy, we can either evaluate the model using the logging policy, or update it with the logging policy data prior to the evaluation.

[ ]:

[5]:
evaluator = OfflinePolicyEvaluator(
    logged_data=df,
    split_prop=0.5,
    n_trials=10,
    fast_fit=True,
    scaler=MinMaxScaler(),
    ope_estimators=None,
    verbose=True,
    propensity_score_model_type="batch_empirical",
    expected_reward_model_type="gbm",
    importance_weights_model_type="logreg",
    batch_feature="batch",
    action_feature="action_id",
    reward_feature="reward_0",
    true_reward_feature="true_reward_0",
    contextual_features=contextual_features,
    group_feature="group",
    cost_feature="cost",
    propensity_score_feature="propensity_score",
)
100%|██████████| 2/2 [00:00<00:00, 308.56it/s]
2025-07-16 10:43:11.795 | INFO     | pybandits.offline_policy_evaluator:_estimate_propensity_score:730 - Data batch-empirical estimation of propensity score.
2025-07-16 10:43:11.803 | INFO     | pybandits.offline_policy_evaluator:_estimate_expected_reward:780 - Data prediction of expected reward based on gbm model.
[6]:
evaluator.evaluate(mab=mab, visualize=True, n_mc_experiments=1000)
2025-07-16 10:43:12.147 | INFO     | pybandits.offline_policy_evaluator:estimate_policy:876 - Data prediction of expected policy based on Monte Carlo experiments.
1000it [00:16, 61.82it/s]
2025-07-16 10:43:28.531 | INFO     | pybandits.offline_policy_evaluator:_estimate_importance_weight:819 - Data prediction of importance weights based on logreg model.
2025-07-16 10:43:28.610 | INFO     | pybandits.offline_policy_evaluator:evaluate:949 - Offline Policy Evaluation for reward_0.
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/scipy/stats/_resampling.py:147: RuntimeWarning: invalid value encountered in scalar divide
  a_hat = 1/6 * sum(nums) / sum(dens)**(3/2)
/home/runner/work/pybandits/pybandits/pybandits/offline_policy_estimator.py:116: DegenerateDataWarning: The BCa confidence interval cannot be calculated. This problem is known to occur when the distribution is degenerate or the statistic is np.min.
  bootstrap_result = bootstrap(
Loading BokehJS ...
[6]:
value lower upper std estimator objective
0 0.503952 0.471031 0.537714 0.016862 b-ipw reward_0
1 0.504058 0.498670 0.509589 0.002801 dm reward_0
2 0.507871 0.475852 0.539793 0.016324 dr reward_0
3 0.504058 0.498674 0.509451 0.002757 dros-opt reward_0
4 0.507871 0.474337 0.539798 0.016360 dros-pess reward_0
5 0.507603 0.474846 0.541772 0.017069 ipw reward_0
6 0.000000 NaN NaN 0.000000 rep reward_0
7 0.507869 0.476486 0.540600 0.016358 sndr reward_0
8 0.507282 0.473912 0.540733 0.017050 snips reward_0
9 0.507871 0.475961 0.540780 0.016357 sg-dr reward_0
10 0.507603 0.474857 0.540881 0.016949 sg-ipw reward_0
11 0.504058 0.498776 0.509534 0.002775 switch-dr reward_0
[7]:
evaluator.update_and_evaluate(mab=mab, visualize=True, n_mc_experiments=1000)
2025-07-16 10:43:29.805 | INFO     | pybandits.offline_policy_evaluator:_update_mab:1028 - Offline policy update for <class 'pybandits.cmab.CmabBernoulliCC'>.
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pytensor/link/c/cmodule.py:2968: UserWarning: PyTensor could not link to a BLAS installation. Operations that might benefit from BLAS will be severely degraded.
This usually happens when PyTensor is installed via pip. We recommend it be installed via conda/mamba/pixi instead.
Alternatively, you can use an experimental backend such as Numba or JAX that perform their own BLAS optimizations, by setting `pytensor.config.mode == 'NUMBA'` or passing `mode='NUMBA'` when compiling a PyTensor function.
For more options and details see https://pytensor.readthedocs.io/en/latest/troubleshooting.html#how-do-i-configure-test-my-blas-library
  warnings.warn(
2025-07-16 10:43:46.819 | INFO     | pybandits.offline_policy_evaluator:estimate_policy:876 - Data prediction of expected policy based on Monte Carlo experiments.
1000it [00:18, 52.84it/s]
2025-07-16 10:44:05.967 | INFO     | pybandits.offline_policy_evaluator:_estimate_importance_weight:819 - Data prediction of importance weights based on logreg model.
2025-07-16 10:44:06.056 | INFO     | pybandits.offline_policy_evaluator:evaluate:949 - Offline Policy Evaluation for reward_0.
Loading BokehJS ...
[7]:
value lower upper std estimator objective
0 0.505496 0.454257 0.560382 0.026886 b-ipw reward_0
1 0.504393 0.498935 0.509760 0.002780 dm reward_0
2 0.513988 0.467177 0.558131 0.022825 dr reward_0
3 0.504393 0.498910 0.509909 0.002810 dros-opt reward_0
4 0.513988 0.469365 0.557159 0.022540 dros-pess reward_0
5 0.511271 0.458302 0.565420 0.027439 ipw reward_0
6 0.493827 0.358025 0.666268 0.076405 rep reward_0
7 0.514003 0.471214 0.558653 0.022219 sndr reward_0
8 0.512051 0.460596 0.567523 0.027174 snips reward_0
9 0.513988 0.469818 0.558791 0.022583 sg-dr reward_0
10 0.511271 0.459634 0.567058 0.027439 sg-ipw reward_0
11 0.504393 0.498951 0.510055 0.002840 switch-dr reward_0