{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Contextual Multi-Armed Bandit with Zooming Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook demonstrates the usage of the Zooming model for quantitative action spaces with the contextual multi-armed bandit (CMAB) implementation in `pybandits`.\n", "\n", "The Zooming model adaptively partitions a continuous action space and fits a model (e.g., Bayesian Neural Network) to each segment. This allows efficient exploration and exploitation in continuous or high-cardinality action spaces through an adaptive discretization approach. Unlike the SMAB zooming model, the CMAB version uses contextual information to predict rewards.\n", "\n", "References:\n", "- [Multi-Armed Bandits in Metric Spaces (Kleinberg, Slivkins, and Upfal, 2008)](https://arxiv.org/pdf/0809.4882)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "from pybandits.cmab import CmabBernoulli\n", "from pybandits.quantitative_model import CmabZoomingModel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "First, we'll define actions with quantitative parameters. In this example, we'll use two actions, each with a one-dimensional quantitative parameter (e.g., price point or dosage level) ranging from 0 to 1. Unlike the SMAB model, here we also need to define contextual features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For reproducibility\n", "np.random.seed(42)\n", "\n", "# Define number of features for the context\n", "n_features = 3\n", "# Define number of segments for each action\n", "n_max_segments = 16 # Maximum number of segments for each action\n", "# Define cold start parameters for the base model\n", "base_model_cold_start_kwargs = {\n", " \"n_features\": n_features, # Number of context features\n", " \"update_method\": \"VI\", # Variational Inference for Bayesian updates\n", "}\n", "\n", "\n", "# Define actions with zooming models\n", "actions = {\n", " \"action_1\": CmabZoomingModel.cold_start(\n", " dimension=1, n_max_segments=n_max_segments, base_model_cold_start_kwargs=base_model_cold_start_kwargs\n", " ),\n", " \"action_2\": CmabZoomingModel.cold_start(\n", " dimension=1, n_max_segments=n_max_segments, base_model_cold_start_kwargs=base_model_cold_start_kwargs\n", " ),\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can initialize the CmabBernoulli bandit with our zooming models:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize the bandit\n", "cmab = CmabBernoulli(actions=actions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulate Environment\n", "\n", "Let's create a reward function that depends on both the action, its quantitative parameter, and the context. For illustration purposes, we'll define that:\n", "\n", "- `action_1` performs better when the first context feature is high and when the quantitative parameter is around 0.25\n", "- `action_2` performs better when the second context feature is high and when the quantitative parameter is around 0.75\n", "\n", "The reward probability follows a bell curve for the quantitative parameter and is also influenced by the context features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def reward_function(action, quantity, context):\n", " if action == \"action_1\":\n", " # Bell curve centered at 0.25 for the quantity\n", " # Influenced by first context feature\n", " quantity_component = np.exp(-((quantity - 0.25) ** 2) / 0.02)\n", " context_component = 0.5 + 0.5 * (context[0] / 2) # First feature has influence\n", " prob = quantity_component * context_component\n", " else: # action_2\n", " # Bell curve centered at 0.75 for the quantity\n", " # Influenced by second context feature\n", " quantity_component = np.exp(-((quantity - 0.75) ** 2) / 0.02)\n", " context_component = 0.5 + 0.5 * (context[1] / 2) # Second feature has influence\n", " prob = quantity_component * context_component\n", "\n", " # Ensure probability is between 0 and 1\n", " prob = max(0, min(1, prob))\n", "\n", " return np.random.binomial(1, prob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize our reward functions to understand what the bandit needs to learn. We'll show the reward surfaces for different values of context:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.linspace(0, 1, 100)\n", "\n", "# Plot for three different contexts\n", "contexts = [\n", " np.array([1.0, 0.0, 0.0]), # High first feature\n", " np.array([0.0, 1.0, 0.0]), # High second feature\n", " np.array([0.5, 0.5, 0.0]), # Mixed features\n", "]\n", "\n", "plt.figure(figsize=(16, 5))\n", "for i, context in enumerate(contexts, 1):\n", " plt.subplot(1, 3, i)\n", "\n", " y1 = [np.exp(-((xi - 0.25) ** 2) / 0.02) * (0.5 + 0.5 * (context[0] / 2)) for xi in x]\n", " y2 = [np.exp(-((xi - 0.75) ** 2) / 0.02) * (0.5 + 0.5 * (context[1] / 2)) for xi in x]\n", "\n", " plt.plot(x, y1, \"b-\", label=\"action_1\")\n", " plt.plot(x, y2, \"r-\", label=\"action_2\")\n", " plt.xlabel(\"Quantitative Parameter\")\n", " plt.ylabel(\"Reward Probability\")\n", "\n", " if i == 1:\n", " title = \"Context: High Feature 1\"\n", " elif i == 2:\n", " title = \"Context: High Feature 2\"\n", " else:\n", " title = \"Context: Mixed Features\"\n", "\n", " plt.title(title)\n", " plt.legend()\n", " plt.grid(True)\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate Synthetic Context Data\n", "\n", "Let's create synthetic context data for our experiment:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate random context data\n", "n_batches = 10\n", "batch_size = 100\n", "n_rounds = n_batches * batch_size\n", "raw_context_data = np.random.normal(0, 1, (n_rounds, n_features))\n", "\n", "# Standardize the context data\n", "scaler = StandardScaler()\n", "context_data = scaler.fit_transform(raw_context_data)\n", "\n", "# Preview the context data\n", "pd.DataFrame(context_data[:5], columns=[f\"Feature {i + 1}\" for i in range(n_features)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bandit Training Loop\n", "\n", "Now, let's train our bandit by simulating interactions for several rounds:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for t in range(n_batches):\n", " # Get context for this round\n", " current_context = context_data[t * batch_size : (t + 1) * batch_size]\n", "\n", " # Predict best action\n", " pred_actions, probs, weighted_sums = cmab.predict(context=current_context)\n", " chosen_actions = [a[0] for a in pred_actions]\n", " chosen_quantities = [a[1][0] for a in pred_actions]\n", "\n", " # Observe reward\n", " rewards = [\n", " reward_function(chosen_action, chosen_quantity, current_context[0])\n", " for chosen_action, chosen_quantity in zip(chosen_actions, chosen_quantities)\n", " ]\n", "\n", " # Update bandit\n", " cmab.update(actions=chosen_actions, rewards=rewards, context=current_context, quantities=chosen_quantities)\n", "\n", " # Print progress\n", " print(f\"Completed {t} batches\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examining Segment Adaptation\n", "\n", "Let's look at the adaptive segmentation for one of the actions to see how the model has split the quantitative space:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract the segmentation for one of the actions\n", "action1_segments = list(cmab.actions[\"action_1\"].segmented_actions.keys())\n", "action2_segments = list(cmab.actions[\"action_2\"].segmented_actions.keys())\n", "\n", "# Print the number of segments\n", "print(f\"Number of segments for action_1: {len(action1_segments)}\")\n", "print(f\"Number of segments for action_2: {len(action2_segments)}\")\n", "\n", "# Create a figure to visualize the segments\n", "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))\n", "\n", "# Plot action_1 segments\n", "ax1.set_title(\"action_1 Segments\")\n", "for i, segment in enumerate(action1_segments):\n", " ax1.plot([segment.mins[0], segment.maxs[0]], [i, i], linewidth=5)\n", "ax1.set_xlim(0, 1)\n", "ax1.set_xlabel(\"Quantitative Parameter\")\n", "ax1.axvline(0.25, color=\"red\", linestyle=\"--\", label=\"Optimal Value\")\n", "ax1.legend()\n", "\n", "# Plot action_2 segments\n", "ax2.set_title(\"action_2 Segments\")\n", "for i, segment in enumerate(action2_segments):\n", " ax2.plot([segment.mins[0], segment.maxs[0]], [i, i], linewidth=5)\n", "ax2.set_xlim(0, 1)\n", "ax2.set_xlabel(\"Quantitative Parameter\")\n", "ax2.axvline(0.75, color=\"red\", linestyle=\"--\", label=\"Optimal Value\")\n", "ax2.legend()\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Testing with Specific Contexts\n", "\n", "Finally, let's test our trained bandit with specific contexts to see if it has learned the optimal policy:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define test contexts\n", "test_contexts = [\n", " [2.0, -1.0, 0.0], # High feature 1, low feature 2\n", " [-1.0, 2.0, 0.0], # Low feature 1, high feature 2\n", " [1.0, 1.0, 0.0], # High feature 1 and 2\n", " [-1.0, -1.0, 0.0], # Low feature 1 and 2\n", "]\n", "test_contexts = scaler.transform(test_contexts)\n", "\n", "# Test predictions\n", "results = []\n", "for i, context in enumerate(test_contexts):\n", " context_reshaped = context.reshape(1, -1)\n", " pred_actions, probs, weighted_sums = cmab.predict(context=context_reshaped)\n", " chosen_action_quantity = pred_actions[0]\n", " chosen_action_probs = {action: probs[0][chosen_action_quantity] for action in actions}\n", " chosen_action = chosen_action_quantity[0]\n", " chosen_quantities = chosen_action_quantity[1][0]\n", " chosen_action_probs = probs[0][chosen_action_quantity]\n", "\n", " # Sample optimal quantity for the chosen action\n", " # In a real application, you would have a method to test different quantities\n", " # Here we'll use our knowledge of the true optimal values\n", " if chosen_action == \"action_1\":\n", " optimal_quantity = 0.25\n", " else:\n", " optimal_quantity = 0.75\n", "\n", " # Expected reward probability\n", " expected_reward = reward_function(chosen_action, optimal_quantity, context)\n", "\n", " results.append(\n", " {\n", " \"Context\": context,\n", " \"Chosen Action\": chosen_action,\n", " \"Action Probabilities\": chosen_action_probs,\n", " \"Optimal Quantity\": optimal_quantity,\n", " \"Expected Reward\": expected_reward,\n", " }\n", " )\n", "\n", "# Display results\n", "for i, result in enumerate(results):\n", " context_type = \"\"\n", " if i == 0:\n", " context_type = \"High feature 1, low feature 2\"\n", " elif i == 1:\n", " context_type = \"Low feature 1, high feature 2\"\n", " elif i == 2:\n", " context_type = \"High feature 1 and 2\"\n", " elif i == 3:\n", " context_type = \"Low feature 1 and 2\"\n", "\n", " print(f\"\\nTest {i + 1}: {context_type}\")\n", " print(f\"Context: {result['Context']}\")\n", " print(f\"Chosen Action: {result['Chosen Action']}\")\n", " print(f\"Action Probabilities: {result['Action Probabilities']}\")\n", " print(f\"Optimal Quantity: {result['Optimal Quantity']:.2f}\")\n", " print(f\"Expected Reward: {result['Expected Reward']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "The CMAB Zooming model extends the concept of adaptive discretization to contextual bandits. This approach allows efficient exploration and exploitation of continuous action parameters while taking context into account. It adaptively refines the segmentation of the parameter space, concentrating more segments in high-reward regions for finer discretization.\n", "\n", "This approach is particularly useful when:\n", "1. Actions have continuous parameters that affect rewards\n", "2. The reward function depends on both context and action parameters\n", "3. The optimal parameter values may vary across different contexts\n", "4. The action space needs to be adaptively discretized for efficient exploration\n", "\n", "Real-world applications include:\n", "- Personalized pricing: Find optimal prices (continuous parameter) based on customer features (context)\n", "- Content recommendation: Optimize content parameters (e.g., length, complexity) based on user demographics\n", "- Medical dosing: Determine optimal medication dosages based on patient characteristics\n", "- Ad campaign optimization: Find best bid values based on ad placement and target audience" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python" } }, "nbformat": 4, "nbformat_minor": 4 }