{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Stochastic Multi-Armed Bandit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the stochastic multi-armed bandit (sMAB), we implemented a Bernoulli multi-armed bandit based on Thompson sampling algorithm ([Agrawal and Goyal, 2012](http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![title](img/smab.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following notebook contains an example of usage of the class Smab, which implements the algorithm above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to define the list of possible actions $a_i \\in A$ and the priors parameters for each Beta distibution $\\alpha, \\beta$. By setting them all to 1, all actions have the same probability to be selected by the bandit at the beginning before the first update." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# define actions\n", "action_ids = [\"Action A\", \"Action B\", \"Action C\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "from pybandits.model import Beta\n", "from pybandits.smab import SmabBernoulli" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_samples = 1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to define the mapping of possible actions $a_i \\in A$ to their associated model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define action model\n", "actions = {\n", " \"a1\": Beta(),\n", " \"a2\": Beta(),\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now init the bandit given the mapping of actions $a_i$ to their model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# init stochastic Multi-Armed Bandit model\n", "smab = SmabBernoulli(actions=actions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The predict function below returns the action selected by the bandit at time $t$: $a_t = argmax_k \\theta_k^t$, where $\\theta_k^t$ is the sample from the Beta distribution $k$ at time $t$. The bandit selects one action at time when n_samples=1, or it selects batches of samples when n_samples>1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# predict actions\n", "pred_actions, _ = smab.predict(n_samples=n_samples)\n", "print(\"Recommended action: {}\".format(pred_actions[:10]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we observe the rewards from the environment. In this example rewards are randomly simulated. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# simulate reward from environment\n", "simulated_rewards = np.random.randint(2, size=n_samples)\n", "print(\"Simulated rewards: {}\".format(simulated_rewards[:10]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we update the model providing per each action sample: (i) the action $a_t$ selected by the bandit, (ii) the corresponding reward $r_t$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "smab.update(actions=pred_actions, rewards=simulated_rewards)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "hide_input": false, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 4 }