{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduction\n",
    "\n",
    "This notebook demonstrates the use of offline policy evaluation for MABs.\n",
    "\n",
    "### Objectives\n",
    "\n",
    "#### Evaluation:\n",
    "\n",
    "Evaluate the performance of a MAB using multiple offline policy estimators."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "from pybandits.cmab import CmabBernoulliCC\n",
    "from pybandits.offline_policy_evaluator import OfflinePolicyEvaluator\n",
    "\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first generate a binarly labeled data set, with a two dimensional feature space, and is not lineraly seprabale.\n",
    "We then split the data set to a training data setm and a test data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 1000\n",
    "n_actions = 2\n",
    "n_batches = 3\n",
    "n_rewards = 1\n",
    "n_groups = 2\n",
    "n_features = 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unique_actions = [f\"a{i}\" for i in range(n_actions)]\n",
    "action_ids = np.random.choice(unique_actions, n_samples * n_batches)\n",
    "batches = [i for i in range(n_batches) for _ in range(n_samples)]\n",
    "rewards = [np.random.randint(2, size=(n_samples * n_batches)) for _ in range(n_rewards)]\n",
    "action_true_rewards = {(a, r): np.random.rand() for a in unique_actions for r in range(n_rewards)}\n",
    "true_rewards = [\n",
    "    np.array([action_true_rewards[(a, r)] for a in action_ids]).reshape(n_samples * n_batches) for r in range(n_rewards)\n",
    "]\n",
    "groups = np.random.randint(n_groups, size=n_samples * n_batches)\n",
    "action_costs = {action: np.random.rand() for action in unique_actions}\n",
    "costs = np.array([action_costs[a] for a in action_ids])\n",
    "context = np.random.rand(n_samples * n_batches, n_features)\n",
    "action_propensity_score = {action: np.random.rand() for action in unique_actions}\n",
    "propensity_score = np.array([action_propensity_score[a] for a in action_ids])\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"batch\": batches,\n",
    "        \"action_id\": action_ids,\n",
    "        \"cost\": costs,\n",
    "        \"group\": groups,\n",
    "        **{f\"reward_{r}\": rewards[r] for r in range(n_rewards)},\n",
    "        **{f\"true_reward_{r}\": true_rewards[r] for r in range(n_rewards)},\n",
    "        **{f\"context_{i}\": context[:, i] for i in range(n_features)},\n",
    "        \"propensity_score\": propensity_score,\n",
    "    }\n",
    ")\n",
    "contextual_features = [col for col in df.columns if col.startswith(\"context\")]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the cold_start method of CmabBernoulliCC, we can create a model to be used for offline policy evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_ids_cost = {action_id: df[\"cost\"][df[\"action_id\"] == action_id].iloc[0] for action_id in unique_actions}\n",
    "\n",
    "mab = CmabBernoulliCC.cold_start(action_ids_cost=action_ids_cost, n_features=len(contextual_features))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## OPE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given the model and the OPE data from the logging policy, we can either evaluate the model using the logging policy, or update it with the logging policy data prior to the evaluation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluator = OfflinePolicyEvaluator(\n",
    "    logged_data=df,\n",
    "    split_prop=0.5,\n",
    "    n_trials=10,\n",
    "    fast_fit=True,\n",
    "    scaler=MinMaxScaler(),\n",
    "    ope_estimators=None,\n",
    "    verbose=True,\n",
    "    propensity_score_model_type=\"batch_empirical\",\n",
    "    expected_reward_model_type=\"gbm\",\n",
    "    importance_weights_model_type=\"logreg\",\n",
    "    batch_feature=\"batch\",\n",
    "    action_feature=\"action_id\",\n",
    "    reward_feature=\"reward_0\",\n",
    "    true_reward_feature=\"true_reward_0\",\n",
    "    contextual_features=contextual_features,\n",
    "    group_feature=\"group\",\n",
    "    cost_feature=\"cost\",\n",
    "    propensity_score_feature=\"propensity_score\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluator.evaluate(mab=mab, visualize=True, n_mc_experiments=1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluator.update_and_evaluate(mab=mab, visualize=True, n_mc_experiments=1000)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}