### Introduction

This notebook demonstrates the use of offline policy evaluation for MABs.

### Objectives

#### Evaluation:

Evaluate the performance of a MAB using multiple offline policy estimators.

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

from pybandits.cmab import CmabBernoulliCC
from pybandits.offline_policy_evaluator import OfflinePolicyEvaluator

%load_ext autoreload
%autoreload 2

## Generate data

We first generate a binarly labeled data set, with a two dimensional feature space, and is not lineraly seprabale.
We then split the data set to a training data setm and a test data set.

In [None]:
n_samples = 1000
n_actions = 2
n_batches = 3
n_rewards = 1
n_groups = 2
n_features = 3

In [None]:
unique_actions = [f"a{i}" for i in range(n_actions)]
action_ids = np.random.choice(unique_actions, n_samples * n_batches)
batches = [i for i in range(n_batches) for _ in range(n_samples)]
rewards = [np.random.randint(2, size=(n_samples * n_batches)) for _ in range(n_rewards)]
action_true_rewards = {(a, r): np.random.rand() for a in unique_actions for r in range(n_rewards)}
true_rewards = [
 np.array([action_true_rewards[(a, r)] for a in action_ids]).reshape(n_samples * n_batches) for r in range(n_rewards)
]
groups = np.random.randint(n_groups, size=n_samples * n_batches)
action_costs = {action: np.random.rand() for action in unique_actions}
costs = np.array([action_costs[a] for a in action_ids])
context = np.random.rand(n_samples * n_batches, n_features)
action_propensity_score = {action: np.random.rand() for action in unique_actions}
propensity_score = np.array([action_propensity_score[a] for a in action_ids])
df = pd.DataFrame(
 {
 "batch": batches,
 "action_id": action_ids,
 "cost": costs,
 "group": groups,
 **{f"reward_{r}": rewards[r] for r in range(n_rewards)},
 **{f"true_reward_{r}": true_rewards[r] for r in range(n_rewards)},
 **{f"context_{i}": context[:, i] for i in range(n_features)},
 "propensity_score": propensity_score,
 }
)
contextual_features = [col for col in df.columns if col.startswith("context")]

## Generate Model

Using the cold_start method of CmabBernoulliCC, we can create a model to be used for offline policy evaluation.

In [None]:
action_ids_cost = {action_id: df["cost"][df["action_id"] == action_id].iloc[0] for action_id in unique_actions}

mab = CmabBernoulliCC.cold_start(action_ids_cost=action_ids_cost, n_features=len(contextual_features))

## OPE

Given the model and the OPE data from the logging policy, we can either evaluate the model using the logging policy, or update it with the logging policy data prior to the evaluation.

In [None]:
evaluator = OfflinePolicyEvaluator(
 logged_data=df,
 split_prop=0.5,
 n_trials=10,
 fast_fit=True,
 scaler=MinMaxScaler(),
 ope_estimators=None,
 verbose=True,
 propensity_score_model_type="batch_empirical",
 expected_reward_model_type="gbm",
 importance_weights_model_type="logreg",
 batch_feature="batch",
 action_feature="action_id",
 reward_feature="reward_0",
 true_reward_feature="true_reward_0",
 contextual_features=contextual_features,
 group_feature="group",
 cost_feature="cost",
 propensity_score_feature="propensity_score",
)

In [None]:
evaluator.evaluate(mab=mab, visualize=True, n_mc_experiments=1000)

In [None]:
evaluator.update_and_evaluate(mab=mab, visualize=True, n_mc_experiments=1000)