Offline Policy Evaluation

Introduction

This notebook demonstrates the use of offline policy evaluation for MABs.

Objectives

Evaluation:

Evaluate the performance of a MAB using multiple offline policy estimators.

[1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

from pybandits.cmab import CmabBernoulliCC
from pybandits.offline_policy_evaluator import OfflinePolicyEvaluator

%load_ext autoreload
%autoreload 2
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Generate data

We first generate a binarly labeled data set, with a two dimensional feature space, and is not lineraly seprabale. We then split the data set to a training data setm and a test data set.

[2]:
n_samples = 1000
n_actions = 2
n_batches = 3
n_rewards = 1
n_groups = 2
n_features = 3
[3]:
unique_actions = [f"a{i}" for i in range(n_actions)]
action_ids = np.random.choice(unique_actions, n_samples * n_batches)
batches = [i for i in range(n_batches) for _ in range(n_samples)]
rewards = [np.random.randint(2, size=(n_samples * n_batches)) for _ in range(n_rewards)]
action_true_rewards = {(a, r): np.random.rand() for a in unique_actions for r in range(n_rewards)}
true_rewards = [
    np.array([action_true_rewards[(a, r)] for a in action_ids]).reshape(n_samples * n_batches) for r in range(n_rewards)
]
groups = np.random.randint(n_groups, size=n_samples * n_batches)
action_costs = {action: np.random.rand() for action in unique_actions}
costs = np.array([action_costs[a] for a in action_ids])
context = np.random.rand(n_samples * n_batches, n_features)
action_propensity_score = {action: np.random.rand() for action in unique_actions}
propensity_score = np.array([action_propensity_score[a] for a in action_ids])
df = pd.DataFrame(
    {
        "batch": batches,
        "action_id": action_ids,
        "cost": costs,
        "group": groups,
        **{f"reward_{r}": rewards[r] for r in range(n_rewards)},
        **{f"true_reward_{r}": true_rewards[r] for r in range(n_rewards)},
        **{f"context_{i}": context[:, i] for i in range(n_features)},
        "propensity_score": propensity_score,
    }
)
contextual_features = [col for col in df.columns if col.startswith("context")]

Generate Model

Using the cold_start method of CmabBernoulliCC, we can create a model to be used for offline policy evaluation.

[4]:
action_ids_cost = {action_id: df["cost"][df["action_id"] == action_id].iloc[0] for action_id in unique_actions}

mab = CmabBernoulliCC.cold_start(action_ids_cost=action_ids_cost, n_features=len(contextual_features))

OPE

Given the model and the OPE data from the logging policy, we can either evaluate the model using the logging policy, or update it with the logging policy data prior to the evaluation.

[ ]:

[5]:
evaluator = OfflinePolicyEvaluator(
    logged_data=df,
    split_prop=0.5,
    n_trials=10,
    fast_fit=True,
    scaler=MinMaxScaler(),
    ope_estimators=None,
    verbose=True,
    propensity_score_model_type="batch_empirical",
    expected_reward_model_type="gbm",
    importance_weights_model_type="logreg",
    batch_feature="batch",
    action_feature="action_id",
    reward_feature="reward_0",
    true_reward_feature="true_reward_0",
    contextual_features=contextual_features,
    group_feature="group",
    cost_feature="cost",
    propensity_score_feature="propensity_score",
)
100%|██████████| 2/2 [00:00<00:00, 271.41it/s]
2025-12-26 12:35:34.561 | INFO     | pybandits.offline_policy_evaluator:_estimate_propensity_score:752 - Data batch-empirical estimation of propensity score.
2025-12-26 12:35:34.569 | INFO     | pybandits.offline_policy_evaluator:_estimate_expected_reward:802 - Data prediction of expected reward based on gbm model.
[6]:
evaluator.evaluate(mab=mab, visualize=True, n_mc_experiments=1000)
2025-12-26 12:35:34.888 | INFO     | pybandits.offline_policy_evaluator:estimate_policy:898 - Data prediction of expected policy based on Monte Carlo experiments.
1000it [00:16, 61.94it/s]
2025-12-26 12:35:51.251 | INFO     | pybandits.offline_policy_evaluator:_estimate_importance_weight:841 - Data prediction of importance weights based on logreg model.
2025-12-26 12:35:51.327 | INFO     | pybandits.offline_policy_evaluator:evaluate:971 - Offline Policy Evaluation for reward_0.
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/scipy/stats/_resampling.py:147: RuntimeWarning: invalid value encountered in scalar divide
  a_hat = 1/6 * sum(nums) / sum(dens)**(3/2)
/home/runner/work/pybandits/pybandits/pybandits/offline_policy_estimator.py:138: DegenerateDataWarning: The BCa confidence interval cannot be calculated. This problem is known to occur when the distribution is degenerate or the statistic is np.min.
  bootstrap_result = bootstrap(
Loading BokehJS ...
[6]:
value lower upper std estimator objective
0 0.489862 0.456674 0.522372 0.016701 b-ipw reward_0
1 0.515219 0.509716 0.520946 0.002870 dm reward_0
2 0.491414 0.459075 0.522584 0.016344 dr reward_0
3 0.515219 0.509512 0.520835 0.002877 dros-opt reward_0
4 0.491414 0.459531 0.523425 0.016349 dros-pess reward_0
5 0.488835 0.457371 0.522736 0.016535 ipw reward_0
6 0.000000 NaN NaN 0.000000 rep reward_0
7 0.491375 0.458642 0.522875 0.016390 sndr reward_0
8 0.489635 0.457615 0.522625 0.016591 snips reward_0
9 0.491414 0.459919 0.524346 0.016368 sg-dr reward_0
10 0.488835 0.457317 0.521283 0.016369 sg-ipw reward_0
11 0.515219 0.509526 0.520710 0.002872 switch-dr reward_0
[7]:
evaluator.update_and_evaluate(mab=mab, visualize=True, n_mc_experiments=1000)
2025-12-26 12:35:52.526 | INFO     | pybandits.offline_policy_evaluator:_update_mab:1050 - Offline policy update for <class 'pybandits.cmab.CmabBernoulliCC'>.
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pytensor/link/c/cmodule.py:2968: UserWarning: PyTensor could not link to a BLAS installation. Operations that might benefit from BLAS will be severely degraded.
This usually happens when PyTensor is installed via pip. We recommend it be installed via conda/mamba/pixi instead.
Alternatively, you can use an experimental backend such as Numba or JAX that perform their own BLAS optimizations, by setting `pytensor.config.mode == 'NUMBA'` or passing `mode='NUMBA'` when compiling a PyTensor function.
For more options and details see https://pytensor.readthedocs.io/en/latest/troubleshooting.html#how-do-i-configure-test-my-blas-library
  warnings.warn(
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/rich/live.py:256:
UserWarning: install "ipywidgets" for Jupyter support
  warnings.warn('install "ipywidgets" for Jupyter support')
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/rich/live.py:256:
UserWarning: install "ipywidgets" for Jupyter support
  warnings.warn('install "ipywidgets" for Jupyter support')
2025-12-26 12:35:59.517 | INFO     | pybandits.offline_policy_evaluator:estimate_policy:898 - Data prediction of expected policy based on Monte Carlo experiments.
1000it [00:15, 62.77it/s]
2025-12-26 12:36:15.660 | INFO     | pybandits.offline_policy_evaluator:_estimate_importance_weight:841 - Data prediction of importance weights based on logreg model.
2025-12-26 12:36:15.736 | INFO     | pybandits.offline_policy_evaluator:evaluate:971 - Offline Policy Evaluation for reward_0.
Loading BokehJS ...
[7]:
value lower upper std estimator objective
0 0.507254 0.465710 0.549189 0.021338 b-ipw reward_0
1 0.500674 0.495074 0.506353 0.002895 dm reward_0
2 0.498864 0.458973 0.538123 0.020057 dr reward_0
3 0.500674 0.495049 0.506347 0.002868 dros-opt reward_0
4 0.498864 0.459306 0.537497 0.019707 dros-pess reward_0
5 0.504132 0.458905 0.550335 0.023217 ipw reward_0
6 0.571429 0.142857 1.428571 0.282560 rep reward_0
7 0.498874 0.459908 0.538285 0.019977 sndr reward_0
8 0.501551 0.457844 0.547851 0.023137 snips reward_0
9 0.498864 0.459652 0.538631 0.019935 sg-dr reward_0
10 0.504132 0.460232 0.552106 0.023486 sg-ipw reward_0
11 0.500674 0.495120 0.506324 0.002858 switch-dr reward_0