# Contextual Multi-Armed Bandit with Zooming Model

This notebook demonstrates the usage of the Zooming model for quantitative action spaces with the contextual multi-armed bandit (CMAB) implementation in `pybandits`.

The Zooming model adaptively partitions a continuous action space and fits a model (e.g., Bayesian Neural Network) to each segment. This allows efficient exploration and exploitation in continuous or high-cardinality action spaces through an adaptive discretization approach. Unlike the SMAB zooming model, the CMAB version uses contextual information to predict rewards.

References:
- [Multi-Armed Bandits in Metric Spaces (Kleinberg, Slivkins, and Upfal, 2008)](https://arxiv.org/pdf/0809.4882)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

from pybandits.cmab import CmabBernoulli
from pybandits.quantitative_model import CmabZoomingModel

## Setup

First, we'll define actions with quantitative parameters. In this example, we'll use two actions, each with a one-dimensional quantitative parameter (e.g., price point or dosage level) ranging from 0 to 1. Unlike the SMAB model, here we also need to define contextual features.

In [None]:
# For reproducibility
np.random.seed(42)

# Define number of features for the context
n_features = 3
# Define number of segments for each action
n_max_segments = 16 # Maximum number of segments for each action
# Define cold start parameters for the base model
base_model_cold_start_kwargs = {
 "n_features": n_features, # Number of context features
 "update_method": "VI", # Variational Inference for Bayesian updates
}


# Define actions with zooming models
actions = {
 "action_1": CmabZoomingModel.cold_start(
 dimension=1, n_max_segments=n_max_segments, base_model_cold_start_kwargs=base_model_cold_start_kwargs
 ),
 "action_2": CmabZoomingModel.cold_start(
 dimension=1, n_max_segments=n_max_segments, base_model_cold_start_kwargs=base_model_cold_start_kwargs
 ),
}

Now we can initialize the CmabBernoulli bandit with our zooming models:

In [None]:
# Initialize the bandit
cmab = CmabBernoulli(actions=actions)

## Simulate Environment

Let's create a reward function that depends on both the action, its quantitative parameter, and the context. For illustration purposes, we'll define that:

- `action_1` performs better when the first context feature is high and when the quantitative parameter is around 0.25
- `action_2` performs better when the second context feature is high and when the quantitative parameter is around 0.75

The reward probability follows a bell curve for the quantitative parameter and is also influenced by the context features.

In [None]:
def reward_function(action, quantity, context):
 if action == "action_1":
 # Bell curve centered at 0.25 for the quantity
 # Influenced by first context feature
 quantity_component = np.exp(-((quantity - 0.25) ** 2) / 0.02)
 context_component = 0.5 + 0.5 * (context[0] / 2) # First feature has influence
 prob = quantity_component * context_component
 else: # action_2
 # Bell curve centered at 0.75 for the quantity
 # Influenced by second context feature
 quantity_component = np.exp(-((quantity - 0.75) ** 2) / 0.02)
 context_component = 0.5 + 0.5 * (context[1] / 2) # Second feature has influence
 prob = quantity_component * context_component

 # Ensure probability is between 0 and 1
 prob = max(0, min(1, prob))

 return np.random.binomial(1, prob)

Let's visualize our reward functions to understand what the bandit needs to learn. We'll show the reward surfaces for different values of context:

In [None]:
x = np.linspace(0, 1, 100)

# Plot for three different contexts
contexts = [
 np.array([1.0, 0.0, 0.0]), # High first feature
 np.array([0.0, 1.0, 0.0]), # High second feature
 np.array([0.5, 0.5, 0.0]), # Mixed features
]

plt.figure(figsize=(16, 5))
for i, context in enumerate(contexts, 1):
 plt.subplot(1, 3, i)

 y1 = [np.exp(-((xi - 0.25) ** 2) / 0.02) * (0.5 + 0.5 * (context[0] / 2)) for xi in x]
 y2 = [np.exp(-((xi - 0.75) ** 2) / 0.02) * (0.5 + 0.5 * (context[1] / 2)) for xi in x]

 plt.plot(x, y1, "b-", label="action_1")
 plt.plot(x, y2, "r-", label="action_2")
 plt.xlabel("Quantitative Parameter")
 plt.ylabel("Reward Probability")

 if i == 1:
 title = "Context: High Feature 1"
 elif i == 2:
 title = "Context: High Feature 2"
 else:
 title = "Context: Mixed Features"

 plt.title(title)
 plt.legend()
 plt.grid(True)

plt.tight_layout()
plt.show()

## Generate Synthetic Context Data

Let's create synthetic context data for our experiment:

In [None]:
# Generate random context data
n_batches = 10
batch_size = 100
n_rounds = n_batches * batch_size
raw_context_data = np.random.normal(0, 1, (n_rounds, n_features))

# Standardize the context data
scaler = StandardScaler()
context_data = scaler.fit_transform(raw_context_data)

# Preview the context data
pd.DataFrame(context_data[:5], columns=[f"Feature {i + 1}" for i in range(n_features)])

## Bandit Training Loop

Now, let's train our bandit by simulating interactions for several rounds:

In [None]:
for t in range(n_batches):
 # Get context for this round
 current_context = context_data[t * batch_size : (t + 1) * batch_size]

 # Predict best action
 pred_actions, probs, weighted_sums = cmab.predict(context=current_context)
 chosen_actions = [a[0] for a in pred_actions]
 chosen_quantities = [a[1][0] for a in pred_actions]

 # Observe reward
 rewards = [
 reward_function(chosen_action, chosen_quantity, current_context[0])
 for chosen_action, chosen_quantity in zip(chosen_actions, chosen_quantities)
 ]

 # Update bandit
 cmab.update(actions=chosen_actions, rewards=rewards, context=current_context, quantities=chosen_quantities)

 # Print progress
 print(f"Completed {t} batches")

## Examining Segment Adaptation

Let's look at the adaptive segmentation for one of the actions to see how the model has split the quantitative space:

In [None]:
# Extract the segmentation for one of the actions
action1_segments = list(cmab.actions["action_1"].segmented_actions.keys())
action2_segments = list(cmab.actions["action_2"].segmented_actions.keys())

# Print the number of segments
print(f"Number of segments for action_1: {len(action1_segments)}")
print(f"Number of segments for action_2: {len(action2_segments)}")

# Create a figure to visualize the segments
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))

# Plot action_1 segments
ax1.set_title("action_1 Segments")
for i, segment in enumerate(action1_segments):
 ax1.plot([segment.mins[0], segment.maxs[0]], [i, i], linewidth=5)
ax1.set_xlim(0, 1)
ax1.set_xlabel("Quantitative Parameter")
ax1.axvline(0.25, color="red", linestyle="--", label="Optimal Value")
ax1.legend()

# Plot action_2 segments
ax2.set_title("action_2 Segments")
for i, segment in enumerate(action2_segments):
 ax2.plot([segment.mins[0], segment.maxs[0]], [i, i], linewidth=5)
ax2.set_xlim(0, 1)
ax2.set_xlabel("Quantitative Parameter")
ax2.axvline(0.75, color="red", linestyle="--", label="Optimal Value")
ax2.legend()

plt.tight_layout()
plt.show()

## Testing with Specific Contexts

Finally, let's test our trained bandit with specific contexts to see if it has learned the optimal policy:

In [None]:
# Define test contexts
test_contexts = [
 [2.0, -1.0, 0.0], # High feature 1, low feature 2
 [-1.0, 2.0, 0.0], # Low feature 1, high feature 2
 [1.0, 1.0, 0.0], # High feature 1 and 2
 [-1.0, -1.0, 0.0], # Low feature 1 and 2
]
test_contexts = scaler.transform(test_contexts)

# Test predictions
results = []
for i, context in enumerate(test_contexts):
 context_reshaped = context.reshape(1, -1)
 pred_actions, probs, weighted_sums = cmab.predict(context=context_reshaped)
 chosen_action_quantity = pred_actions[0]
 chosen_action_probs = {action: probs[0][chosen_action_quantity] for action in actions}
 chosen_action = chosen_action_quantity[0]
 chosen_quantities = chosen_action_quantity[1][0]
 chosen_action_probs = probs[0][chosen_action_quantity]

 # Sample optimal quantity for the chosen action
 # In a real application, you would have a method to test different quantities
 # Here we'll use our knowledge of the true optimal values
 if chosen_action == "action_1":
 optimal_quantity = 0.25
 else:
 optimal_quantity = 0.75

 # Expected reward probability
 expected_reward = reward_function(chosen_action, optimal_quantity, context)

 results.append(
 {
 "Context": context,
 "Chosen Action": chosen_action,
 "Action Probabilities": chosen_action_probs,
 "Optimal Quantity": optimal_quantity,
 "Expected Reward": expected_reward,
 }
 )

# Display results
for i, result in enumerate(results):
 context_type = ""
 if i == 0:
 context_type = "High feature 1, low feature 2"
 elif i == 1:
 context_type = "Low feature 1, high feature 2"
 elif i == 2:
 context_type = "High feature 1 and 2"
 elif i == 3:
 context_type = "Low feature 1 and 2"

 print(f"\nTest {i + 1}: {context_type}")
 print(f"Context: {result['Context']}")
 print(f"Chosen Action: {result['Chosen Action']}")
 print(f"Action Probabilities: {result['Action Probabilities']}")
 print(f"Optimal Quantity: {result['Optimal Quantity']:.2f}")
 print(f"Expected Reward: {result['Expected Reward']}")

## Conclusion

The CMAB Zooming model extends the concept of adaptive discretization to contextual bandits. This approach allows efficient exploration and exploitation of continuous action parameters while taking context into account. It adaptively refines the segmentation of the parameter space, concentrating more segments in high-reward regions for finer discretization.

This approach is particularly useful when:
1. Actions have continuous parameters that affect rewards
2. The reward function depends on both context and action parameters
3. The optimal parameter values may vary across different contexts
4. The action space needs to be adaptively discretized for efficient exploration

Real-world applications include:
- Personalized pricing: Find optimal prices (continuous parameter) based on customer features (context)
- Content recommendation: Optimize content parameters (e.g., length, complexity) based on user demographics
- Medical dosing: Determine optimal medication dosages based on patient characteristics
- Ad campaign optimization: Find best bid values based on ad placement and target audience