Introduction
This notebook demonstrates the use of offline policy evaluation for MABs.
Objectives
Evaluation:
Evaluate the performance of a MAB using multiple offline policy estimators.
[1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from pybandits.cmab import CmabBernoulliCC
from pybandits.offline_policy_evaluator import OfflinePolicyEvaluator
%load_ext autoreload
%autoreload 2
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pydantic/_migration.py:283: UserWarning: `pydantic.generics:GenericModel` has been moved to `pydantic.BaseModel`.
warnings.warn(f'`{import_path}` has been moved to `{new_location}`.')
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Generate data
We first generate a binarly labeled data set, with a two dimensional feature space, and is not lineraly seprabale. We then split the data set to a training data setm and a test data set.
[2]:
n_samples = 1000
n_actions = 2
n_batches = 3
n_rewards = 1
n_groups = 2
n_features = 3
[3]:
unique_actions = [f"a{i}" for i in range(n_actions)]
action_ids = np.random.choice(unique_actions, n_samples * n_batches)
batches = [i for i in range(n_batches) for _ in range(n_samples)]
rewards = [np.random.randint(2, size=(n_samples * n_batches)) for _ in range(n_rewards)]
action_true_rewards = {(a, r): np.random.rand() for a in unique_actions for r in range(n_rewards)}
true_rewards = [
np.array([action_true_rewards[(a, r)] for a in action_ids]).reshape(n_samples * n_batches) for r in range(n_rewards)
]
groups = np.random.randint(n_groups, size=n_samples * n_batches)
action_costs = {action: np.random.rand() for action in unique_actions}
costs = np.array([action_costs[a] for a in action_ids])
context = np.random.rand(n_samples * n_batches, n_features)
action_propensity_score = {action: np.random.rand() for action in unique_actions}
propensity_score = np.array([action_propensity_score[a] for a in action_ids])
df = pd.DataFrame(
{
"batch": batches,
"action_id": action_ids,
"cost": costs,
"group": groups,
**{f"reward_{r}": rewards[r] for r in range(n_rewards)},
**{f"true_reward_{r}": true_rewards[r] for r in range(n_rewards)},
**{f"context_{i}": context[:, i] for i in range(n_features)},
"propensity_score": propensity_score,
}
)
contextual_features = [col for col in df.columns if col.startswith("context")]
Generate Model
Using the cold_start method of CmabBernoulliCC, we can create a model to be used for offline policy evaluation.
[4]:
action_ids_cost = {action_id: df["cost"][df["action_id"] == action_id].iloc[0] for action_id in unique_actions}
mab = CmabBernoulliCC.cold_start(action_ids_cost=action_ids_cost, n_features=len(contextual_features))
OPE
Given the model and the OPE data from the logging policy, we can either evaluate the model using the logging policy, or update it with the logging policy data prior to the evaluation.
[ ]:
[5]:
evaluator = OfflinePolicyEvaluator(
logged_data=df,
split_prop=0.5,
n_trials=10,
fast_fit=True,
scaler=MinMaxScaler(),
ope_estimators=None,
verbose=True,
propensity_score_model_type="batch_empirical",
expected_reward_model_type="gbm",
importance_weights_model_type="logreg",
batch_feature="batch",
action_feature="action_id",
reward_feature="reward_0",
true_reward_feature="true_reward_0",
contextual_features=contextual_features,
group_feature="group",
cost_feature="cost",
propensity_score_feature="propensity_score",
)
100%|██████████| 2/2 [00:00<00:00, 308.56it/s]
2025-07-16 10:43:11.795 | INFO | pybandits.offline_policy_evaluator:_estimate_propensity_score:730 - Data batch-empirical estimation of propensity score.
2025-07-16 10:43:11.803 | INFO | pybandits.offline_policy_evaluator:_estimate_expected_reward:780 - Data prediction of expected reward based on gbm model.
[6]:
evaluator.evaluate(mab=mab, visualize=True, n_mc_experiments=1000)
2025-07-16 10:43:12.147 | INFO | pybandits.offline_policy_evaluator:estimate_policy:876 - Data prediction of expected policy based on Monte Carlo experiments.
1000it [00:16, 61.82it/s]
2025-07-16 10:43:28.531 | INFO | pybandits.offline_policy_evaluator:_estimate_importance_weight:819 - Data prediction of importance weights based on logreg model.
2025-07-16 10:43:28.610 | INFO | pybandits.offline_policy_evaluator:evaluate:949 - Offline Policy Evaluation for reward_0.
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/scipy/stats/_resampling.py:147: RuntimeWarning: invalid value encountered in scalar divide
a_hat = 1/6 * sum(nums) / sum(dens)**(3/2)
/home/runner/work/pybandits/pybandits/pybandits/offline_policy_estimator.py:116: DegenerateDataWarning: The BCa confidence interval cannot be calculated. This problem is known to occur when the distribution is degenerate or the statistic is np.min.
bootstrap_result = bootstrap(
[6]:
value | lower | upper | std | estimator | objective | |
---|---|---|---|---|---|---|
0 | 0.503952 | 0.471031 | 0.537714 | 0.016862 | b-ipw | reward_0 |
1 | 0.504058 | 0.498670 | 0.509589 | 0.002801 | dm | reward_0 |
2 | 0.507871 | 0.475852 | 0.539793 | 0.016324 | dr | reward_0 |
3 | 0.504058 | 0.498674 | 0.509451 | 0.002757 | dros-opt | reward_0 |
4 | 0.507871 | 0.474337 | 0.539798 | 0.016360 | dros-pess | reward_0 |
5 | 0.507603 | 0.474846 | 0.541772 | 0.017069 | ipw | reward_0 |
6 | 0.000000 | NaN | NaN | 0.000000 | rep | reward_0 |
7 | 0.507869 | 0.476486 | 0.540600 | 0.016358 | sndr | reward_0 |
8 | 0.507282 | 0.473912 | 0.540733 | 0.017050 | snips | reward_0 |
9 | 0.507871 | 0.475961 | 0.540780 | 0.016357 | sg-dr | reward_0 |
10 | 0.507603 | 0.474857 | 0.540881 | 0.016949 | sg-ipw | reward_0 |
11 | 0.504058 | 0.498776 | 0.509534 | 0.002775 | switch-dr | reward_0 |
[7]:
evaluator.update_and_evaluate(mab=mab, visualize=True, n_mc_experiments=1000)
2025-07-16 10:43:29.805 | INFO | pybandits.offline_policy_evaluator:_update_mab:1028 - Offline policy update for <class 'pybandits.cmab.CmabBernoulliCC'>.
/home/runner/.cache/pypoetry/virtualenvs/pybandits-vYJB-miV-py3.10/lib/python3.10/site-packages/pytensor/link/c/cmodule.py:2968: UserWarning: PyTensor could not link to a BLAS installation. Operations that might benefit from BLAS will be severely degraded.
This usually happens when PyTensor is installed via pip. We recommend it be installed via conda/mamba/pixi instead.
Alternatively, you can use an experimental backend such as Numba or JAX that perform their own BLAS optimizations, by setting `pytensor.config.mode == 'NUMBA'` or passing `mode='NUMBA'` when compiling a PyTensor function.
For more options and details see https://pytensor.readthedocs.io/en/latest/troubleshooting.html#how-do-i-configure-test-my-blas-library
warnings.warn(
2025-07-16 10:43:46.819 | INFO | pybandits.offline_policy_evaluator:estimate_policy:876 - Data prediction of expected policy based on Monte Carlo experiments.
1000it [00:18, 52.84it/s]
2025-07-16 10:44:05.967 | INFO | pybandits.offline_policy_evaluator:_estimate_importance_weight:819 - Data prediction of importance weights based on logreg model.
2025-07-16 10:44:06.056 | INFO | pybandits.offline_policy_evaluator:evaluate:949 - Offline Policy Evaluation for reward_0.
[7]:
value | lower | upper | std | estimator | objective | |
---|---|---|---|---|---|---|
0 | 0.505496 | 0.454257 | 0.560382 | 0.026886 | b-ipw | reward_0 |
1 | 0.504393 | 0.498935 | 0.509760 | 0.002780 | dm | reward_0 |
2 | 0.513988 | 0.467177 | 0.558131 | 0.022825 | dr | reward_0 |
3 | 0.504393 | 0.498910 | 0.509909 | 0.002810 | dros-opt | reward_0 |
4 | 0.513988 | 0.469365 | 0.557159 | 0.022540 | dros-pess | reward_0 |
5 | 0.511271 | 0.458302 | 0.565420 | 0.027439 | ipw | reward_0 |
6 | 0.493827 | 0.358025 | 0.666268 | 0.076405 | rep | reward_0 |
7 | 0.514003 | 0.471214 | 0.558653 | 0.022219 | sndr | reward_0 |
8 | 0.512051 | 0.460596 | 0.567523 | 0.027174 | snips | reward_0 |
9 | 0.513988 | 0.469818 | 0.558791 | 0.022583 | sg-dr | reward_0 |
10 | 0.511271 | 0.459634 | 0.567058 | 0.027439 | sg-ipw | reward_0 |
11 | 0.504393 | 0.498951 | 0.510055 | 0.002840 | switch-dr | reward_0 |