pybandits

pybandits.smab

class pybandits.smab.BaseSmabBernoulli(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[BaseBeta], strategy: Strategy)

Bases: BaseMab, ABC

Base model for a Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling.

Parameters:
  • actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.

  • strategy (Strategy) – The strategy used to select actions.

actions_manager: SmabActionsManager[BaseBeta]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

predict(n_samples: Annotated[int, Gt(gt=0)] = 1, forbidden_actions: Set[ActionId] | None = None) SmabPredictions

Predict actions.

Parameters:
  • n_samples (int > 0, default=1) – Number of samples to predict.

  • forbidden_actions (Optional[Set[ActionId]], default=None) – Set of forbidden actions. If specified, the model will discard the forbidden_actions and it will only consider the remaining allowed_actions. By default, the model considers all actions as allowed_actions. Note that: actions = allowed_actions U forbidden_actions.

Returns:

  • actions (List[ActionId] of shape (n_samples,)) – The actions selected by the multi-armed bandit model.

  • probs (List[Dict[ActionId, Probability]] of shape (n_samples,)) – The probabilities of getting a positive reward for each action.

update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)

Update the stochastic Bernoulli bandit given the list of selected actions and their corresponding binary rewards.

Parameters:
  • actions (List[ActionId] of shape (n_samples,), e.g. ['a1', 'a2', 'a3', 'a4', 'a5']) – The selected action for each sample.

  • rewards (List[Union[BinaryReward, List[BinaryReward]]] of shape (n_samples, n_objectives)) –

    The binary reward for each sample.
    If strategy is not MultiObjectiveBandit, rewards should be a list, e.g.

    rewards = [1, 0, 1, 1, 1, …]

    If strategy is MultiObjectiveBandit, rewards should be a list of list, e.g. (with n_objectives=2):

    rewards = [[1, 1], [1, 0], [1, 1], [1, 0], [1, 1], …]

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

class pybandits.smab.SmabBernoulli(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[Beta], strategy: ClassicBandit)

Bases: BaseSmabBernoulli

Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling.

References

Analysis of Thompson Sampling for the Multi-armed Bandit Problem (Agrawal and Goyal, 2012) http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf

Parameters:
  • actions_manager (SmabActionsManagerSO) – The manager for actions and their associated models.

  • strategy (ClassicBandit) – The strategy used to select actions.

actions_manager: SmabActionsManager[Beta]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

strategy: ClassicBandit
class pybandits.smab.SmabBernoulliBAI(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[Beta], strategy: BestActionIdentificationBandit)

Bases: BaseSmabBernoulli

Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling, and Best Action Identification strategy.

References

Analysis of Thompson Sampling for the Multi-armed Bandit Problem (Agrawal and Goyal, 2012) http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf

Parameters:
  • actions_manager (SmabActionsManagerSO) – The manager for actions and their associated models.

  • strategy (BestActionIdentificationBandit) – The strategy used to select actions.

actions_manager: SmabActionsManager[Beta]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

strategy: BestActionIdentificationBandit
class pybandits.smab.SmabBernoulliCC(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[BetaCC], strategy: CostControlBandit)

Bases: BaseSmabBernoulli

Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling, and Cost Control strategy.

The sMAB is extended to include a control of the action cost. Each action is associated with a predefined “cost”. At prediction time, the model considers the actions whose expected rewards is above a pre-defined lower bound. Among these actions, the one with the lowest associated cost is recommended. The expected reward interval for feasible actions is defined as [(1-subsidy_factor) * max_p, max_p], where max_p is the highest expected reward sampled value.

References

Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints (Daulton et al., 2019) https://arxiv.org/abs/1911.00638

Multi-Armed Bandits with Cost Subsidy (Sinha et al., 2021) https://arxiv.org/abs/2011.01488

Parameters:
  • actions_manager (SmabActionsManagerCC) – The manager for actions and their associated models.

  • strategy (CostControlBandit) – The strategy used to select actions.

actions_manager: SmabActionsManager[BetaCC]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

strategy: CostControlBandit
class pybandits.smab.SmabBernoulliMO(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[BetaMO], strategy: MultiObjectiveBandit)

Bases: BaseSmabBernoulli

Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling, and Multi-Objectives strategy.

The reward pertaining to an action is a multidimensional vector instead of a scalar value. In this setting, different actions are compared according to Pareto order between their expected reward vectors, and those actions whose expected rewards are not inferior to that of any other actions are called Pareto optimal actions, all of which constitute the Pareto front.

References

Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem (Yahyaa and Manderick, 2015) https://www.researchgate.net/publication/272823659_Thompson_Sampling_for_Multi-Objective_Multi-Armed_Bandits_Problem

Parameters:
  • actions_manager (SmabActionsManagerMO) – The manager for actions and their associated models.

  • strategy (MultiObjectiveBandit) – The strategy used to select actions.

actions_manager: SmabActionsManager[BetaMO]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

strategy: MultiObjectiveBandit
class pybandits.smab.SmabBernoulliMOCC(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[BetaMOCC], strategy: MultiObjectiveCostControlBandit)

Bases: BaseSmabBernoulli

Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling implementation for Multi-Objective (MO) with Cost Control (CC) strategy.

This Bandit allows the reward to be a multidimensional vector and include a control of the action cost. It merges the Multi-Objective and Cost Control strategies.

Parameters:
  • actions_manager (SmabActionsManagerMOCC) – The manager for actions and their associated models.

  • strategy (MultiObjectiveCostControlBandit) – The strategy used to select actions.

actions_manager: SmabActionsManager[BetaMOCC]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

strategy: MultiObjectiveCostControlBandit

pybandits.cmab

class pybandits.cmab.BaseCmabBernoulli(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: CmabActionsManager[BaseBayesianLogisticRegression], strategy: Strategy, predict_with_proba: bool, predict_actions_randomly: bool)

Bases: BaseMab, ABC

Base model for a Contextual Multi-Armed Bandit for Bernoulli bandits with Thompson Sampling.

Parameters:
  • actions (Dict[ActionId, BayesianLogisticRegression]) – The list of possible actions, and their associated Model.

  • strategy (Strategy) – The strategy used to select actions.

  • predict_with_proba (bool) – If True predict with sampled probabilities, else predict with weighted sums.

  • predict_actions_randomly (bool) – If True predict actions randomly (where each action has equal probability to be selected), else predict with the bandit strategy.

actions_manager: CmabActionsManager[BaseBayesianLogisticRegression]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

predict(context: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], forbidden_actions: Set[ActionId] | None = None) CmabPredictions

Predict actions.

Parameters:
  • context (ArrayLike of shape (n_samples, n_features)) – Matrix of contextual features.

  • forbidden_actions (Optional[Set[ActionId]], default=None) – Set of forbidden actions. If specified, the model will discard the forbidden_actions and it will only consider the remaining allowed_actions. By default, the model considers all actions as allowed_actions. Note that: actions = allowed_actions U forbidden_actions.

Returns:

  • actions (List[ActionId] of shape (n_samples,)) – The actions selected by the multi-armed bandit model.

  • probs (List[Dict[ActionId, Probability]] of shape (n_samples,)) – The probabilities of getting a positive reward for each action.

  • ws (List[Dict[ActionId, float]]) – The weighted sum of logistic regression logits.

predict_actions_randomly: bool
predict_with_proba: bool
update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], context: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None, context_memory: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None)

Update the contextual Bernoulli bandit given the list of selected actions and their corresponding binary rewards.

Parameters:
  • context (ArrayLike of shape (n_samples, n_features)) – Matrix of contextual features.

  • actions (List[ActionId] of shape (n_samples,), e.g. ['a1', 'a2', 'a3', 'a4', 'a5']) – The selected action for each sample.

  • rewards (List[Union[BinaryReward, List[BinaryReward]]] of shape (n_samples, n_objectives)) –

    The binary reward for each sample.
    If strategy is not MultiObjectiveBandit, rewards should be a list, e.g.

    rewards = [1, 0, 1, 1, 1, …]

    If strategy is MultiObjectiveBandit, rewards should be a list of list, e.g. (with n_objectives=2):

    rewards = [[1, 1], [1, 0], [1, 1], [1, 0], [1, 1], …]

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

  • context_memory (Optional[ArrayLike] of shape (n_samples, n_features)) – Matrix of contextual features.

class pybandits.cmab.CmabBernoulli(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: CmabActionsManager[BayesianLogisticRegression], strategy: ClassicBandit, predict_with_proba: bool = False, predict_actions_randomly: bool = False)

Bases: BaseCmabBernoulli

Contextual Bernoulli Multi-Armed Bandit with Thompson Sampling.

References

Thompson Sampling for Contextual Bandits with Linear Payoffs (Agrawal and Goyal, 2014) https://arxiv.org/pdf/1209.3352.pdf

Parameters:
  • actions_manager (CmabActionsManagerSO) – The manager for actions and their associated models.

  • strategy (ClassicBandit) – The strategy used to select actions.

  • predict_with_proba (bool) – If True predict with sampled probabilities, else predict with weighted sums

  • predict_actions_randomly (bool) – If True predict actions randomly (where each action has equal probability to be selected), else predict with the bandit strategy.

actions_manager: CmabActionsManager[BayesianLogisticRegression]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

predict_actions_randomly: bool
predict_with_proba: bool
strategy: ClassicBandit
class pybandits.cmab.CmabBernoulliBAI(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: CmabActionsManager[BayesianLogisticRegression], strategy: BestActionIdentificationBandit, predict_with_proba: bool = False, predict_actions_randomly: bool = False)

Bases: BaseCmabBernoulli

Contextual Bernoulli Multi-Armed Bandit with Thompson Sampling, and Best Action Identification strategy.

References

Analysis of Thompson Sampling for the Multi-armed Bandit Problem (Agrawal and Goyal, 2012) http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf

Parameters:
  • actions_manager (CmabActionsManagerSO) – The manager for actions and their associated models.

  • strategy (BestActionIdentificationBandit) – The strategy used to select actions.

  • predict_with_proba (bool) – If True predict with sampled probabilities, else predict with weighted sums

  • predict_actions_randomly (bool) – If True predict actions randomly (where each action has equal probability to be selected), else predict with the bandit strategy.

actions_manager: CmabActionsManager[BayesianLogisticRegression]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

predict_actions_randomly: bool
predict_with_proba: bool
strategy: BestActionIdentificationBandit
class pybandits.cmab.CmabBernoulliCC(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: CmabActionsManager[BayesianLogisticRegressionCC], strategy: CostControlBandit, predict_with_proba: bool = True, predict_actions_randomly: bool = False)

Bases: BaseCmabBernoulli

Contextual Bernoulli Multi-Armed Bandit with Thompson Sampling, and Cost Control strategy.

The Cmab is extended to include a control of the action cost. Each action is associated with a predefined “cost”. At prediction time, the model considers the actions whose expected rewards is above a pre-defined lower bound. Among these actions, the one with the lowest associated cost is recommended. The expected reward interval for feasible actions is defined as [(1-subsidy_factor) * max_p, max_p], where max_p is the highest expected reward sampled value.

References

Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints (Daulton et al., 2019) https://arxiv.org/abs/1911.00638

Multi-Armed Bandits with Cost Subsidy (Sinha et al., 2021) https://arxiv.org/abs/2011.01488

Parameters:
  • actions_manager (CmabActionsManagerCC) – The manager for actions and their associated models.

  • strategy (CostControlBandit) – The strategy used to select actions.

  • predict_with_proba (bool) – If True predict with sampled probabilities, else predict with weighted sums

  • predict_actions_randomly (bool) – If True predict actions randomly (where each action has equal probability to be selected), else predict with the bandit strategy.

actions_manager: CmabActionsManager[BayesianLogisticRegressionCC]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

predict_actions_randomly: bool
predict_with_proba: bool
strategy: CostControlBandit
pybandits.cmab.choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

Added in version 1.7.0.

Note

New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.

Parameters:
  • a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

  • replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of a can be selected multiple times.

  • p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Returns:

samples – The generated random samples

Return type:

single item or ndarray

Raises:

ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

See also

randint, shuffle, permutation

random.Generator.choice

which should be used in new code

Notes

Setting user-specified probabilities through p uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element of p is 1 / len(a).

Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword.

Examples

Generate a uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3)
array([0, 3, 4]) # random
>>> #This is equivalent to np.random.randint(0,5,3)

Generate a non-uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0]) # random

Generate a uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False)
array([3,1,0]) # random
>>> #This is equivalent to np.random.permutation(np.arange(5))[:3]

Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0]) # random

Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:

>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
      dtype='<U11')

pybandits.model

class pybandits.model.BaseBayesianLogisticRegression(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1, alpha: StudentT, betas: Annotated[List[StudentT], MinLen(min_length=1)], update_method: Literal['MCMC', 'VI'] = 'MCMC', update_kwargs: dict | None = None)

Bases: Model, ABC

Base Bayesian Logistic Regression model.

It is modeled as:

y = sigmoid(alpha + beta1 * x1 + beta2 * x2 + … + betaN * xN)

where the alpha and betas coefficients are Student’s t-distributions.

Parameters:
  • alpha (StudentT) – Student’s t-distribution of the alpha coefficient.

  • betas (StudentT) – Student’s t-distributions of the betas coefficients.

  • update_method (UpdateMethods, defaults to "MCMC") – The strategy for computing posterior quantities of the Bayesian models in the update function. Such as Markov chain Monte Carlo (“MCMC”) or Variational Inference (“VI”). Check UpdateMethods in pybandits.model for the full list.

  • update_kwargs (Optional[dict], uses default values if not specified) – Additional arguments to pass to the update method.

alpha: StudentT
arrange_update_kwargs()
betas: List[StudentT]
check_context_matrix(context: ndarray)

Check and cast context matrix.

Parameters:

context (np.ndarray of shape (n_samples, n_features)) – Matrix of contextual features.

Returns:

context – Matrix of contextual features.

Return type:

pandas DataFrame of shape (n_samples, n_features)

classmethod cold_start(n_features: Annotated[int, Gt(gt=0)], update_method: Literal['MCMC', 'VI'] = 'MCMC', update_kwargs: dict | None = None, **kwargs) BayesianLogisticRegression

Utility function to create a Bayesian Logistic Regression model or child model with cost control, with default parameters.

It is modeled as:

y = sigmoid(alpha + beta1 * x1 + beta2 * x2 + … + betaN * xN)

where the alpha and betas coefficients are Student’s t-distributions.

Parameters:
  • n_features (PositiveInt) – The number of betas of the Bayesian Logistic Regression model. This is also the number of features expected after in the context matrix.

  • update_method (UpdateMethods, defaults to "MCMC") – The strategy for computing posterior quantities of the Bayesian models in the update function. Such as Markov chain Monte Carlo (“MCMC”) or Variational Inference (“VI”). Check UpdateMethods in pybandits.model for the full list.

  • update_kwargs (Optional[dict], uses default values if not specified) – Additional arguments to pass to the update method.

  • kwargs (Dict[str, Any]) – Additional arguments for the Bayesian Logistic Regression child model.

Returns:

blr – The Bayesian Logistic Regression model.

Return type:

BayesianLogisticRegrssion

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • context – The context.

sample_proba(context: ndarray) Tuple[Probability, float]

Compute the probability of getting a positive reward from the sampled regression coefficients and the context.

Parameters:

context (np.ndarray) – Context matrix of shape (n_samples, n_features).

Returns:

  • prob (ndarray of shape (n_samples)) – Probability of getting a positive reward.

  • weighted_sum (ndarray of shape (n_samples)) – Weighted sums between contextual feature values and sampled coefficients.

update_kwargs: dict | None
update_method: Literal['MCMC', 'VI']
class pybandits.model.BaseBeta(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1)

Bases: Model

Beta Distribution model for Bernoulli multi-armed bandits.

classmethod both_or_neither_counters_are_defined(values)
property mean: float

The success rate i.e. n_successes / (n_successes + n_failures).

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

sample_proba() Probability

Sample the probability of getting a positive reward.

Returns:

prob – Probability of getting a positive reward.

Return type:

Probability

property std: float

The corrected standard deviation (Bessel’s correction) of the binary distribution of successes and failures.

class pybandits.model.BaseModel

Bases: PyBanditsBaseModel, ABC

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

abstract reset()

Reset the model.

abstract sample_proba() Probability

Sample the probability of getting a positive reward.

abstract update(rewards: List[BinaryReward] | List[List[BinaryReward]], **kwargs)

Update the model.

Parameters:

rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – A list of binary rewards.

class pybandits.model.BayesianLogisticRegression(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1, alpha: StudentT, betas: Annotated[List[StudentT], MinLen(min_length=1)], update_method: Literal['MCMC', 'VI'] = 'MCMC', update_kwargs: dict | None = None)

Bases: BaseBayesianLogisticRegression

Bayesian Logistic Regression model.

It is modeled as:

y = sigmoid(alpha + beta1 * x1 + beta2 * x2 + … + betaN * xN)

where the alpha and betas coefficients are Student’s t-distributions.

Parameters:
  • alpha (StudentT) – Student’s t-distribution of the alpha coefficient.

  • betas (StudentT) – Student’s t-distributions of the betas coefficients.

  • update_method (UpdateMethods, defaults to "MCMC") – The strategy for computing posterior quantities of the Bayesian models in the update function. Such as Markov chain Monte Carlo (“MCMC”) or Variational Inference (“VI”). Check UpdateMethods in pybandits.model for the full list.

  • update_kwargs (Optional[dict], uses default values if not specified) – Additional arguments to pass to the update method.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • context – The context.

class pybandits.model.BayesianLogisticRegressionCC(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1, alpha: StudentT, betas: Annotated[List[StudentT], MinLen(min_length=1)], update_method: Literal['MCMC', 'VI'] = 'MCMC', update_kwargs: dict | None = None, cost: Annotated[float, Ge(ge=0)])

Bases: BaseBayesianLogisticRegression

Bayesian Logistic Regression model with cost control.

It is modeled as:

y = sigmoid(alpha + beta1 * x1 + beta2 * x2 + … + betaN * xN)

where the alpha and betas coefficients are Student’s t-distributions.

Parameters:
  • alpha (StudentT) – Student’s t-distribution of the alpha coefficient.

  • betas (StudentT) – Student’s t-distributions of the betas coefficients.

  • update_method (UpdateMethods, defaults to "MCMC") – The strategy for computing posterior quantities of the Bayesian models in the update function. Such as Markov chain Monte Carlo (“MCMC”) or Variational Inference (“VI”). Check UpdateMethods in pybandits.model for the full list.

  • update_kwargs (Optional[dict], uses default values if not specified) – Additional arguments to pass to the update method.

  • cost (NonNegativeFloat) – Cost associated to the Bayesian Logistic Regression model.

cost: Annotated[float, Ge(ge=0)]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • context – The context.

class pybandits.model.Beta(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1)

Bases: BaseBeta

Beta Distribution model for Bernoulli multi-armed bandits.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pybandits.model.BetaCC(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1, cost: Annotated[float, Ge(ge=0)])

Bases: BaseBeta

Beta Distribution model for Bernoulli multi-armed bandits with cost control.

Parameters:

cost (NonNegativeFloat) – Cost associated to the Beta distribution.

cost: Annotated[float, Ge(ge=0)]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pybandits.model.BetaMO(*, models: List[Beta])

Bases: ModelMO

Beta Distribution model for Bernoulli multi-armed bandits with multi-objectives.

Parameters:

models (List[Beta] of shape (n_objectives,)) – List of Beta distributions.

classmethod cold_start(n_objectives: Annotated[int, Gt(gt=0)], **kwargs) BetaMO

Utility function to create a multi-objective Beta model with cold start.

Parameters:
  • n_objectives (PositiveInt) – The number of objectives.

  • n_objectives – The number of objectives.

Returns:

beta_mo – The multi-objective Beta model.

Return type:

BetaMO

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

models: List[Beta]
sample_proba() List[Probability]

Sample the probability of getting a positive reward.

Returns:

prob – Probabilities of getting a positive reward for each objective.

Return type:

List[Probability]

class pybandits.model.BetaMOCC(*, models: List[Beta], cost: Annotated[float, Ge(ge=0)])

Bases: BetaMO

Beta Distribution model for Bernoulli multi-armed bandits with multi-objectives and cost control.

Parameters:
  • models (List[BetaCC] of shape (n_objectives,)) – List of Beta distributions.

  • cost (NonNegativeFloat) – Cost associated to the Beta distribution.

cost: Annotated[float, Ge(ge=0)]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pybandits.model.Model(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1)

Bases: BaseModel, ABC

Class to model the prior distributions.

Parameters:
  • n_successes (PositiveInt = 1) – Counter of the number of successes.

  • n_failures (PositiveInt = 1) – Counter of the number of failures.

property count: int

The total amount of successes and failures collected.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_failures: Annotated[int, Gt(gt=0)]
n_successes: Annotated[int, Gt(gt=0)]
reset()

Reset the model.

update(rewards: List[BinaryReward], **kwargs)

Update n_successes and n_failures.

Parameters:

rewards (List[BinaryReward]) – A list of binary rewards.

class pybandits.model.ModelMO(*, models: Annotated[List[Model], MinLen(min_length=1)])

Bases: BaseModel, ABC

Multi-objective extension of Model. :param models: List of models. :type models: List[Model]

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

models: List[Model]
reset()

Reset the model.

sample_proba(**kwargs) List[Probability]

Sample the probability of getting a positive reward. :returns: prob – Probabilities of getting a positive reward for each objective. :rtype: List[Probability]

update(rewards: List[List[BinaryReward]], **kwargs)

Update the Beta model using the provided rewards.

Parameters:
  • rewards (List[List[BinaryReward]]) – A list of rewards, where each reward is in turn a list containing the reward of the Beta model associated to each objective. For example, [[1, 1], [1, 0], [1, 1], [1, 0], [1, 1]].

  • kwargs (Dict[str, Any]) – Additional arguments for the Bayesian Logistic Regression MO child model.

class pybandits.model.StudentT(*, mu: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)] = 0.0, sigma: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)] = 10.0, nu: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)] = 5.0)

Bases: PyBanditsBaseModel

Student’s t-distribution.

Parameters:
  • mu (float) – Mean of the Student’s t-distribution.

  • sigma (float) – Standard deviation of the Student’s t-distribution.

  • nu (float) – Degrees of freedom.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

mu: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)]
nu: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)]
sigma: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)]

pybandits.strategy

class pybandits.strategy.BestActionIdentificationBandit(*, exploit_p: Float_0_1 | None = 0.5)

Bases: Strategy

Best-Action Identification (BAI) strategy for multi-armed bandits.

References

Simple Bayesian Algorithms for Best-Arm Identification (Russo, 2018) https://arxiv.org/pdf/1602.08448.pdf

Parameters:

exploit_p (Optional[Float01], 0.5 if not specified) – Tuning parameter taking value in [0, 1] which specifies the probability of selecting the best or an alternative action. If exploit_p is 1, the bandit always selects the action with the highest probability of getting a positive reward. That is, it behaves as a Greedy strategy. If exploit_p is 0, the bandit always select the action with 2nd highest probability of getting a positive reward.

compare_best_actions(actions: Dict[ActionId, Beta]) float

Compare the 2 best actions, hence the 2 actions with the highest expected means of getting a positive reward.

Parameters:

actions (Dict[ActionId, Beta])

Returns:

pvalue – p-value result of the statistical test.

Return type:

float

exploit_p: Float_0_1 | None
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod numerize_exploit_p(v)
select_action(p: Dict[ActionId, float], actions: Dict[ActionId, Model] | None = None) ActionId

Select with probability self.exploit_p the best action (i.e. the action with the highest probability of getting a positive reward), and with probability 1-self.exploit_p it returns the second best action (i.e. the action with the second highest probability of getting a positive reward).

Parameters:
  • p (Dict[ActionId, Probability]) – The dictionary of actions and their sampled probability of getting a positive reward.

  • actions (Optional[Dict[ActionId, Model]]) – The dictionary of actions and their associated model.

Returns:

selected_action – The selected action.

Return type:

ActionId

with_exploit_p(exploit_p: Float_0_1 | None) Self

Instantiate a mutated cost control bandit strategy with an altered subsidy factor.

Parameters:

exploit_p (Optional[Float01], 0.5 if not specified) – Tuning parameter taking value in [0, 1] which specifies the probability of selecting the best or an alternative action. If exploit_p is 1, the bandit always selects the action with the highest probability of getting a positive reward. That is, it behaves as a Greedy strategy. If exploit_p is 0, the bandit always select the action with 2nd highest probability of getting a positive reward.

Returns:

mutated_best_action_identification – The mutated best action identification strategy.

Return type:

BestActionIdentificationBandit

class pybandits.strategy.ClassicBandit

Bases: Strategy

Classic multi-armed bandits strategy.

References

Analysis of Thompson Sampling for the Multi-armed Bandit Problem (Agrawal and Goyal, 2012) http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf

Thompson Sampling for Contextual Bandits with Linear Payoffs (Agrawal and Goyal, 2014) https://arxiv.org/pdf/1209.3352.pdf

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

select_action(p: Dict[ActionId, float], actions: Dict[ActionId, Model] | None = None) ActionId

Select the action with the highest probability of getting a positive reward.

Parameters:
  • p (Dict[ActionId, Probability]) – The dictionary of actions and their sampled probability of getting a positive reward.

  • actions (Optional[Dict[ActionId, Model]]) – The dictionary of actions and their associated model.

Returns:

selected_action – The selected action.

Return type:

ActionId

class pybandits.strategy.CostControlBandit(*, subsidy_factor: Float_0_1 | None = 0.5)

Bases: CostControlStrategy

Cost Control (CC) strategy for multi-armed bandits.

Bandits are extended to include a control of the action cost. Each action is associated with a predefined “cost”. At prediction time, the model considers the actions whose expected rewards are above a pre-defined lower bound. Among these actions, the one with the lowest associated cost is recommended. The expected reward interval for feasible actions is defined as [(1-subsidy_factor)*max_p, max_p], where max_p is the highest expected reward sampled value.

References

Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints (Daulton et al., 2019) https://arxiv.org/abs/1911.00638

Multi-Armed Bandits with Cost Subsidy (Sinha et al., 2021) https://arxiv.org/abs/2011.01488

Parameters:

subsidy_factor (Optional[Float01], 0.5 if not specified) – Number in [0, 1] to define smallest tolerated probability reward, hence the set of feasible actions. If subsidy_factor is 1, the bandits always selects the action with the minimum cost. If subsidy_factor is 0, the bandits always selects the action with highest probability of getting a positive reward (it behaves as a classic Bernoulli bandit).

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod numerize_subsidy_factor(v)
select_action(p: Dict[ActionId, Probability], actions: Dict[ActionId, Model]) ActionId

Select the action with the minimum cost among the set of feasible actions (the actions whose expected rewards are above a certain lower bound defined as [(1-subsidy_factor)*max_p, max_p], where max_p is the highest expected reward sampled value.

Parameters:
  • p (Dict[ActionId, Probability]) – The dictionary or actions and their sampled probability of getting a positive reward.

  • actions (Dict[ActionId, BetaCC]) – The dictionary or actions and their cost.

Returns:

selected_action – The selected action.

Return type:

ActionId

subsidy_factor: Float_0_1 | None
with_subsidy_factor(subsidy_factor: Float_0_1 | None) Self

Instantiate a mutated cost control bandit strategy with an altered subsidy factor.

Parameters:

subsidy_factor (Optional[Float01], 0.5 if not specified) – Number in [0, 1] to define smallest tolerated probability reward, hence the set of feasible actions. If subsidy_factor is 1, the bandits always selects the action with the minimum cost. If subsidy_factor is 0, the bandits always selects the action with highest probability of getting a positive reward (it behaves as a classic Bernoulli bandit).

Returns:

mutated_cost_control_bandit – The mutated cost control bandit strategy.

Return type:

CostControlBandit

class pybandits.strategy.CostControlStrategy

Bases: Strategy, ABC

Cost Control (CC) strategy for multi-armed bandits.

Bandits are extended to include a control of the action cost. Each action is associated with a predefined “cost”.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pybandits.strategy.MultiObjectiveBandit

Bases: MultiObjectiveStrategy

Multi-Objective (MO) strategy for multi-armed bandits.

The reward pertaining to an action is a multidimensional vector instead of a scalar value. In this setting, different actions are compared according to Pareto order between their expected reward vectors, and those actions whose expected rewards are not inferior to that of any other actions are called Pareto optimal actions, all of which constitute the Pareto front.

References

Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem (Yahyaa and Manderick, 2015) https://www.researchgate.net/publication/272823659_Thompson_Sampling_for_Multi-Objective_Multi-Armed_Bandits_Problem

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

select_action(p: Dict[ActionId, List[Probability]], **kwargs) ActionId

Select an action at random from the Pareto optimal set of action. The Pareto optimal action set (Pareto front) A* is the set of actions not dominated by any other actions not in A*. Dominance relation is established based on the objective reward probabilities vectors.

Parameters:

p (Dict[ActionId, List[Probability]]) – The dictionary of actions and their sampled probability of getting a positive reward for each objective.

Returns:

selected_action – The selected action.

Return type:

ActionId

class pybandits.strategy.MultiObjectiveCostControlBandit

Bases: MultiObjectiveStrategy, CostControlStrategy

Multi-Objective (MO) with Cost Control (CC) strategy for multi-armed bandits.

This strategy allows the reward to be a multidimensional vector and include a control of the action cost. It merges the Multi-Objective and Cost Control strategies.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

select_action(p: Dict[ActionId, List[Probability]], actions: Dict[ActionId, BetaMOCC]) ActionId

Select the action with the minimum cost among the Pareto optimal set of action. The Pareto optimal action set (Pareto front) A* is the set of actions not dominated by any other actions not in A*. Dominance relation is established based on the objective reward probabilities vectors.

Parameters:

p (Dict[ActionId, List[Probability]]) – The dictionary of actions and their sampled probability of getting a positive reward for each objective.

Returns:

selected_action – The selected action.

Return type:

ActionId

class pybandits.strategy.MultiObjectiveStrategy

Bases: Strategy, ABC

Multi Objective Strategy to select actions in multi-armed bandits.

classmethod get_pareto_front(p: Dict[ActionId, List[Probability]]) List[ActionId]

Create Pareto optimal set of actions (Pareto front) A* identified as actions that are not dominated by any action out of the set A*.

Parameters:

p: Dict[ActionId, Probability]

The dictionary or actions and their sampled probability of getting a positive reward for each objective.

returns:

pareto_front – The list of Pareto optimal actions

rtype:

set

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pybandits.strategy.Strategy

Bases: PyBanditsBaseModel, ABC

Strategy to select actions in multi-armed bandits.

classmethod get_expected_value_from_state(state: Dict[str, Any], field_name: str) float
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod numerize_field(v, field_name: str)
abstract select_action(p: Dict[ActionId, Probability], actions: Dict[ActionId, BaseModel] | None) ActionId

Select the action.

pybandits.strategy.random() x in the interval [0, 1).

pybandits.actions_manager

class pybandits.actions_manager.ActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)

Bases: PyBanditsBaseModel, ABC

Base class for managing actions and their associated models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update. The change point detection is based on the adaptive windowing scheme.

References

Scaling Multi-Armed Bandit Algorithms (Fouché et al., 2019) https://edouardfouche.com/publications/S-MAB_FOUCHE_KDD19.pdf

Parameters:
  • actions (Dict[ActionId, Model]) – The list of possible actions, and their associated Model.

  • delta (Optional[PositiveProbability]) – The confidence level for the adaptive window. None for skipping the change point detection.

actions: Dict[ActionId, BaseModel]
classmethod at_least_one_action_is_defined(v)
delta: PositiveProbability | None
property maximum_memory_length: Annotated[int, Ge(ge=0)]

Get maximum possible memory length based on current action statistics.

Returns:

Maximum memory length allowed.

Return type:

NonNegativeInt

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None, **kwargs)

Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.

Parameters:
  • actions (List[ActionId]) – The selected action for each sample.

  • rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

class pybandits.actions_manager.CmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)

Bases: ActionsManager, BaseModel, Generic[CmabModelType]

Manages actions and their associated models for cMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.

Parameters:
  • actions (Dict[ActionId, BayesianLogisticRegression]) – The list of possible actions, and their associated Model.

  • delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.

actions: Dict[ActionId, CmabModelType]
classmethod check_bayesian_logistic_regression_models(v)
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], context: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None, context_memory: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None)

Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.

Parameters:
  • actions (List[ActionId]) – The selected action for each sample.

  • rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.

  • context (ArrayLike of shape (n_samples, n_features)) – Matrix of contextual features.

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

  • context_memory (Optional[ArrayLike] of shape (n_samples, n_features)) – Matrix of contextual features.

pybandits.actions_manager.CmabActionsManagerCC

alias of CmabActionsManager[BayesianLogisticRegressionCC]

pybandits.actions_manager.CmabActionsManagerSO

alias of CmabActionsManager[BayesianLogisticRegression]

class pybandits.actions_manager.CmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)

Bases: ActionsManager, BaseModel, Generic[CmabModelType]

Manages actions and their associated models for cMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.

Parameters:
  • actions (Dict[ActionId, BayesianLogisticRegression]) – The list of possible actions, and their associated Model.

  • delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.

actions: Dict[ActionId, CmabModelType]
classmethod check_bayesian_logistic_regression_models(v)
delta: PositiveProbability | None
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], context: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None, context_memory: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None)

Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.

Parameters:
  • actions (List[ActionId]) – The selected action for each sample.

  • rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.

  • context (ArrayLike of shape (n_samples, n_features)) – Matrix of contextual features.

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

  • context_memory (Optional[ArrayLike] of shape (n_samples, n_features)) – Matrix of contextual features.

class pybandits.actions_manager.CmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)

Bases: ActionsManager, BaseModel, Generic[CmabModelType]

Manages actions and their associated models for cMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.

Parameters:
  • actions (Dict[ActionId, BayesianLogisticRegression]) – The list of possible actions, and their associated Model.

  • delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.

actions: Dict[ActionId, CmabModelType]
classmethod check_bayesian_logistic_regression_models(v)
delta: PositiveProbability | None
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], context: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None, context_memory: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None)

Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.

Parameters:
  • actions (List[ActionId]) – The selected action for each sample.

  • rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.

  • context (ArrayLike of shape (n_samples, n_features)) – Matrix of contextual features.

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

  • context_memory (Optional[ArrayLike] of shape (n_samples, n_features)) – Matrix of contextual features.

class pybandits.actions_manager.SmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)

Bases: ActionsManager, BaseModel, Generic[SmabModelType]

Manages actions and their associated models for sMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.

Parameters:
  • actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.

  • delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.

actions: Dict[ActionId, SmabModelType]
classmethod all_actions_have_same_number_of_objectives(actions: Dict[ActionId, SmabModelType])
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)

Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.

Parameters:
  • actions (List[ActionId]) – The selected action for each sample.

  • rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

pybandits.actions_manager.SmabActionsManagerCC

alias of SmabActionsManager[BetaCC]

pybandits.actions_manager.SmabActionsManagerMO

alias of SmabActionsManager[BetaMO]

pybandits.actions_manager.SmabActionsManagerMOCC

alias of SmabActionsManager[BetaMOCC]

pybandits.actions_manager.SmabActionsManagerSO

alias of SmabActionsManager[Beta]

class pybandits.actions_manager.SmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)

Bases: ActionsManager, BaseModel, Generic[SmabModelType]

Manages actions and their associated models for sMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.

Parameters:
  • actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.

  • delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.

actions: Dict[ActionId, SmabModelType]
classmethod all_actions_have_same_number_of_objectives(actions: Dict[ActionId, SmabModelType])
delta: PositiveProbability | None
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)

Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.

Parameters:
  • actions (List[ActionId]) – The selected action for each sample.

  • rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

class pybandits.actions_manager.SmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)

Bases: ActionsManager, BaseModel, Generic[SmabModelType]

Manages actions and their associated models for sMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.

Parameters:
  • actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.

  • delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.

actions: Dict[ActionId, SmabModelType]
classmethod all_actions_have_same_number_of_objectives(actions: Dict[ActionId, SmabModelType])
delta: PositiveProbability | None
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)

Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.

Parameters:
  • actions (List[ActionId]) – The selected action for each sample.

  • rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

class pybandits.actions_manager.SmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)

Bases: ActionsManager, BaseModel, Generic[SmabModelType]

Manages actions and their associated models for sMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.

Parameters:
  • actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.

  • delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.

actions: Dict[ActionId, SmabModelType]
classmethod all_actions_have_same_number_of_objectives(actions: Dict[ActionId, SmabModelType])
delta: PositiveProbability | None
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)

Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.

Parameters:
  • actions (List[ActionId]) – The selected action for each sample.

  • rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

class pybandits.actions_manager.SmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)

Bases: ActionsManager, BaseModel, Generic[SmabModelType]

Manages actions and their associated models for sMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.

Parameters:
  • actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.

  • delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.

actions: Dict[ActionId, SmabModelType]
classmethod all_actions_have_same_number_of_objectives(actions: Dict[ActionId, SmabModelType])
delta: PositiveProbability | None
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)

Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.

Parameters:
  • actions (List[ActionId]) – The selected action for each sample.

  • rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.

  • actions_memory (Optional[List[ActionId]]) – List of previously selected actions.

  • rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.

pybandits.smab_simulator

class pybandits.smab_simulator.SmabSimulator(*, smab: BaseSmabBernoulli, n_updates: Annotated[int, Gt(gt=0)] = 10, batch_size: Annotated[int, Gt(gt=0)] = 100, probs_reward: DataFrame | None = None, save: bool = False, path: str = '', file_prefix: str = '', random_seed: Annotated[int, Ge(ge=0)] | None = None, verbose: bool = False, visualize: bool = False)

Bases: Simulator

Simulate environment for stochastic multi-armed bandits.

This class performs simulation of stochastic Multi-Armed Bandits (sMAB). Data are processed in batches of size n>=1. Per each batch of simulated samples, the mab selects one action and collects the corresponding simulated reward for each sample. Then, prior parameters are updated based on returned rewards from recommended actions.

Parameters:

mab (BaseSmabBernoulli) – sMAB model.

mab: BaseSmabBernoulli
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(_Simulator__context: Any) None

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

classmethod replace_null_and_validate_probs_reward(values)

pybandits.cmab_simulator

class pybandits.cmab_simulator.CmabSimulator(*, cmab: BaseCmabBernoulli, n_updates: Annotated[int, Gt(gt=0)] = 10, batch_size: Annotated[int, Gt(gt=0)] = 100, probs_reward: DataFrame | None = None, save: bool = False, path: str = '', file_prefix: str = '', random_seed: Annotated[int, Ge(ge=0)] | None = None, verbose: bool = False, visualize: bool = False, context: ndarray, group: List | None = None)

Bases: Simulator

Simulate environment for contextual multi-armed bandit models.

This class simulates information required by the contextual bandit. Generated data are processed by the bandit with batches of size n>=1. For each batch of samples, actions are recommended by the bandit and corresponding simulated rewards collected. Bandit policy parameters are then updated based on returned rewards from recommended actions.

Parameters:
  • mab (BaseCmabBernoulli) – Contextual multi-armed bandit model

  • context (np.ndarray of shape (n_samples, n_feature)) – Context matrix of samples features.

  • group (Optional[List] with length=n_samples) – Group to which each sample belongs. Samples which belongs to the same group have features that come from the same distribution and they have the same probability to receive a positive/negative feedback from each action. If not supplied, all samples are assigned to the group.

context: ndarray
group: List | None
mab: BaseCmabBernoulli
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(_Simulator__context: Any) None

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

classmethod replace_nulls_and_validate_sizes_and_dtypes(values)