{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# sMAB Simulation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook shows a simulation framework for the stochastic multi-armed bandit (sMAB). It allows to study the behaviour of the bandit algoritm, to evaluate results and to run experiments on simulated data under different reward and action settings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pybandits.model import Beta\n",
    "from pybandits.smab import SmabBernoulli\n",
    "from pybandits.smab_simulator import SmabSimulator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we need to define the simulation parameters. The parameters contain:\n",
    "- Number of update rounds\n",
    "- Number of samples per batch of update round\n",
    "- Seed for reproducibility\n",
    "- Verbosity enabler\n",
    "- Visualization enabler\n",
    "\n",
    "Data are processed in batches of size n>=1. Per each batch of simulated samples, the sMAB selects one action and collects the corresponding simulated reward for each sample. Then, prior parameters are updated based on returned rewards from recommended actions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# general simulator parameters\n",
    "n_updates = 10\n",
    "batch_size = 100\n",
    "random_seed = None\n",
    "verbose = True\n",
    "visualize = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we initialize the action model and the sMAB. We define three actions, each with a Beta model. The Beta model is a conjugate prior for the Bernoulli likelihood function. The Beta distribution is defined by two parameters: alpha and beta. The action model is defined as a dictionary with the action name as key and the Beta model as value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define action model\n",
    "actions = {\n",
    "    \"a1\": Beta(),\n",
    "    \"a2\": Beta(),\n",
    "    \"a3\": Beta(),\n",
    "}\n",
    "# init stochastic Multi-Armed Bandit model\n",
    "smab = SmabBernoulli(actions=actions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we need to define the probabilities of positive rewards per each action, i.e. the ground truth ('Action A': 0.8 that if the bandits selects 'Action A' for samples that belong to group '0', then the environment will return a positive reward with 80% probability).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# init probability of rewards from the environment\n",
    "probs_reward = dict(zip(actions.keys(), [0.05, 0.80, 0.05]))\n",
    "print(\"Probability of positive reward for each action:\")\n",
    "probs_reward"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Now, we initialize the SmabSimulator with the parameters set above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# init simulation\n",
    "smab_simulator = SmabSimulator(\n",
    "    mab=smab,\n",
    "    batch_size=batch_size,\n",
    "    n_updates=n_updates,\n",
    "    probs_reward=probs_reward,\n",
    "    verbose=verbose,\n",
    "    visualize=visualize,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can start simulation process by executing run() which performs the following steps:\n",
    "```\n",
    "For i=0 to n_updates:\n",
    "    Extract batch[i] of samples from X\n",
    "    Model recommends the best actions as the action with the highest reward probability to each simulated sample in batch[i] and collect corresponding simulated rewards\n",
    "    Model priors are updated using information from recommended actions and returned rewards\n",
    "```\n",
    "Finally, we can visualize the results of the simulation. As defined in the ground truth: 'a2' was the action recommended the most."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smab_simulator.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Furthermore, we can examine the number of times each action was selected and the proportion of positive rewards for each action."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smab_simulator.selected_actions_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smab_simulator.positive_reward_proportion"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}