pybandits
pybandits.smab
- class pybandits.smab.BaseSmabBernoulli(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[BaseBeta], strategy: Strategy)
Bases:
BaseMab
,ABC
Base model for a Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling.
- Parameters:
- actions_manager: SmabActionsManager[BaseBeta]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- predict(n_samples: Annotated[int, Gt(gt=0)] = 1, forbidden_actions: Set[ActionId] | None = None) SmabPredictions
Predict actions.
- Parameters:
n_samples (int > 0, default=1) – Number of samples to predict.
forbidden_actions (Optional[Set[ActionId]], default=None) – Set of forbidden actions. If specified, the model will discard the forbidden_actions and it will only consider the remaining allowed_actions. By default, the model considers all actions as allowed_actions. Note that: actions = allowed_actions U forbidden_actions.
- Returns:
actions (List[ActionId] of shape (n_samples,)) – The actions selected by the multi-armed bandit model.
probs (List[Dict[ActionId, Probability]] of shape (n_samples,)) – The probabilities of getting a positive reward for each action.
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)
Update the stochastic Bernoulli bandit given the list of selected actions and their corresponding binary rewards.
- Parameters:
actions (List[ActionId] of shape (n_samples,), e.g. ['a1', 'a2', 'a3', 'a4', 'a5']) – The selected action for each sample.
rewards (List[Union[BinaryReward, List[BinaryReward]]] of shape (n_samples, n_objectives)) –
- The binary reward for each sample.
- If strategy is not MultiObjectiveBandit, rewards should be a list, e.g.
rewards = [1, 0, 1, 1, 1, …]
- If strategy is MultiObjectiveBandit, rewards should be a list of list, e.g. (with n_objectives=2):
rewards = [[1, 1], [1, 0], [1, 1], [1, 0], [1, 1], …]
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
- class pybandits.smab.SmabBernoulli(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[Beta], strategy: ClassicBandit)
Bases:
BaseSmabBernoulli
Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling.
References
Analysis of Thompson Sampling for the Multi-armed Bandit Problem (Agrawal and Goyal, 2012) http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf
- Parameters:
actions_manager (SmabActionsManagerSO) – The manager for actions and their associated models.
strategy (ClassicBandit) – The strategy used to select actions.
- actions_manager: SmabActionsManager[Beta]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- strategy: ClassicBandit
- class pybandits.smab.SmabBernoulliBAI(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[Beta], strategy: BestActionIdentificationBandit)
Bases:
BaseSmabBernoulli
Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling, and Best Action Identification strategy.
References
Analysis of Thompson Sampling for the Multi-armed Bandit Problem (Agrawal and Goyal, 2012) http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf
- Parameters:
actions_manager (SmabActionsManagerSO) – The manager for actions and their associated models.
strategy (BestActionIdentificationBandit) – The strategy used to select actions.
- actions_manager: SmabActionsManager[Beta]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- strategy: BestActionIdentificationBandit
- class pybandits.smab.SmabBernoulliCC(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[BetaCC], strategy: CostControlBandit)
Bases:
BaseSmabBernoulli
Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling, and Cost Control strategy.
The sMAB is extended to include a control of the action cost. Each action is associated with a predefined “cost”. At prediction time, the model considers the actions whose expected rewards is above a pre-defined lower bound. Among these actions, the one with the lowest associated cost is recommended. The expected reward interval for feasible actions is defined as [(1-subsidy_factor) * max_p, max_p], where max_p is the highest expected reward sampled value.
References
Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints (Daulton et al., 2019) https://arxiv.org/abs/1911.00638
Multi-Armed Bandits with Cost Subsidy (Sinha et al., 2021) https://arxiv.org/abs/2011.01488
- Parameters:
actions_manager (SmabActionsManagerCC) – The manager for actions and their associated models.
strategy (CostControlBandit) – The strategy used to select actions.
- actions_manager: SmabActionsManager[BetaCC]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- strategy: CostControlBandit
- class pybandits.smab.SmabBernoulliMO(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[BetaMO], strategy: MultiObjectiveBandit)
Bases:
BaseSmabBernoulli
Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling, and Multi-Objectives strategy.
The reward pertaining to an action is a multidimensional vector instead of a scalar value. In this setting, different actions are compared according to Pareto order between their expected reward vectors, and those actions whose expected rewards are not inferior to that of any other actions are called Pareto optimal actions, all of which constitute the Pareto front.
References
Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem (Yahyaa and Manderick, 2015) https://www.researchgate.net/publication/272823659_Thompson_Sampling_for_Multi-Objective_Multi-Armed_Bandits_Problem
- Parameters:
actions_manager (SmabActionsManagerMO) – The manager for actions and their associated models.
strategy (MultiObjectiveBandit) – The strategy used to select actions.
- actions_manager: SmabActionsManager[BetaMO]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- strategy: MultiObjectiveBandit
- class pybandits.smab.SmabBernoulliMOCC(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: SmabActionsManager[BetaMOCC], strategy: MultiObjectiveCostControlBandit)
Bases:
BaseSmabBernoulli
Stochastic Bernoulli Multi-Armed Bandit with Thompson Sampling implementation for Multi-Objective (MO) with Cost Control (CC) strategy.
This Bandit allows the reward to be a multidimensional vector and include a control of the action cost. It merges the Multi-Objective and Cost Control strategies.
- Parameters:
actions_manager (SmabActionsManagerMOCC) – The manager for actions and their associated models.
strategy (MultiObjectiveCostControlBandit) – The strategy used to select actions.
- actions_manager: SmabActionsManager[BetaMOCC]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- strategy: MultiObjectiveCostControlBandit
pybandits.cmab
- class pybandits.cmab.BaseCmabBernoulli(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: CmabActionsManager[BaseBayesianLogisticRegression], strategy: Strategy, predict_with_proba: bool, predict_actions_randomly: bool)
Bases:
BaseMab
,ABC
Base model for a Contextual Multi-Armed Bandit for Bernoulli bandits with Thompson Sampling.
- Parameters:
actions (Dict[ActionId, BayesianLogisticRegression]) – The list of possible actions, and their associated Model.
strategy (Strategy) – The strategy used to select actions.
predict_with_proba (bool) – If True predict with sampled probabilities, else predict with weighted sums.
predict_actions_randomly (bool) – If True predict actions randomly (where each action has equal probability to be selected), else predict with the bandit strategy.
- actions_manager: CmabActionsManager[BaseBayesianLogisticRegression]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- predict(context: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], forbidden_actions: Set[ActionId] | None = None) CmabPredictions
Predict actions.
- Parameters:
context (ArrayLike of shape (n_samples, n_features)) – Matrix of contextual features.
forbidden_actions (Optional[Set[ActionId]], default=None) – Set of forbidden actions. If specified, the model will discard the forbidden_actions and it will only consider the remaining allowed_actions. By default, the model considers all actions as allowed_actions. Note that: actions = allowed_actions U forbidden_actions.
- Returns:
actions (List[ActionId] of shape (n_samples,)) – The actions selected by the multi-armed bandit model.
probs (List[Dict[ActionId, Probability]] of shape (n_samples,)) – The probabilities of getting a positive reward for each action.
ws (List[Dict[ActionId, float]]) – The weighted sum of logistic regression logits.
- predict_actions_randomly: bool
- predict_with_proba: bool
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], context: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None, context_memory: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None)
Update the contextual Bernoulli bandit given the list of selected actions and their corresponding binary rewards.
- Parameters:
context (ArrayLike of shape (n_samples, n_features)) – Matrix of contextual features.
actions (List[ActionId] of shape (n_samples,), e.g. ['a1', 'a2', 'a3', 'a4', 'a5']) – The selected action for each sample.
rewards (List[Union[BinaryReward, List[BinaryReward]]] of shape (n_samples, n_objectives)) –
- The binary reward for each sample.
- If strategy is not MultiObjectiveBandit, rewards should be a list, e.g.
rewards = [1, 0, 1, 1, 1, …]
- If strategy is MultiObjectiveBandit, rewards should be a list of list, e.g. (with n_objectives=2):
rewards = [[1, 1], [1, 0], [1, 1], [1, 0], [1, 1], …]
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
context_memory (Optional[ArrayLike] of shape (n_samples, n_features)) – Matrix of contextual features.
- class pybandits.cmab.CmabBernoulli(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: CmabActionsManager[BayesianLogisticRegression], strategy: ClassicBandit, predict_with_proba: bool = False, predict_actions_randomly: bool = False)
Bases:
BaseCmabBernoulli
Contextual Bernoulli Multi-Armed Bandit with Thompson Sampling.
References
Thompson Sampling for Contextual Bandits with Linear Payoffs (Agrawal and Goyal, 2014) https://arxiv.org/pdf/1209.3352.pdf
- Parameters:
actions_manager (CmabActionsManagerSO) – The manager for actions and their associated models.
strategy (ClassicBandit) – The strategy used to select actions.
predict_with_proba (bool) – If True predict with sampled probabilities, else predict with weighted sums
predict_actions_randomly (bool) – If True predict actions randomly (where each action has equal probability to be selected), else predict with the bandit strategy.
- actions_manager: CmabActionsManager[BayesianLogisticRegression]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- predict_actions_randomly: bool
- predict_with_proba: bool
- strategy: ClassicBandit
- class pybandits.cmab.CmabBernoulliBAI(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: CmabActionsManager[BayesianLogisticRegression], strategy: BestActionIdentificationBandit, predict_with_proba: bool = False, predict_actions_randomly: bool = False)
Bases:
BaseCmabBernoulli
Contextual Bernoulli Multi-Armed Bandit with Thompson Sampling, and Best Action Identification strategy.
References
Analysis of Thompson Sampling for the Multi-armed Bandit Problem (Agrawal and Goyal, 2012) http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf
- Parameters:
actions_manager (CmabActionsManagerSO) – The manager for actions and their associated models.
strategy (BestActionIdentificationBandit) – The strategy used to select actions.
predict_with_proba (bool) – If True predict with sampled probabilities, else predict with weighted sums
predict_actions_randomly (bool) – If True predict actions randomly (where each action has equal probability to be selected), else predict with the bandit strategy.
- actions_manager: CmabActionsManager[BayesianLogisticRegression]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- predict_actions_randomly: bool
- predict_with_proba: bool
- strategy: BestActionIdentificationBandit
- class pybandits.cmab.CmabBernoulliCC(epsilon: Float_0_1 | None = None, default_action: ActionId | None = None, *, actions_manager: CmabActionsManager[BayesianLogisticRegressionCC], strategy: CostControlBandit, predict_with_proba: bool = True, predict_actions_randomly: bool = False)
Bases:
BaseCmabBernoulli
Contextual Bernoulli Multi-Armed Bandit with Thompson Sampling, and Cost Control strategy.
The Cmab is extended to include a control of the action cost. Each action is associated with a predefined “cost”. At prediction time, the model considers the actions whose expected rewards is above a pre-defined lower bound. Among these actions, the one with the lowest associated cost is recommended. The expected reward interval for feasible actions is defined as [(1-subsidy_factor) * max_p, max_p], where max_p is the highest expected reward sampled value.
References
Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints (Daulton et al., 2019) https://arxiv.org/abs/1911.00638
Multi-Armed Bandits with Cost Subsidy (Sinha et al., 2021) https://arxiv.org/abs/2011.01488
- Parameters:
actions_manager (CmabActionsManagerCC) – The manager for actions and their associated models.
strategy (CostControlBandit) – The strategy used to select actions.
predict_with_proba (bool) – If True predict with sampled probabilities, else predict with weighted sums
predict_actions_randomly (bool) – If True predict actions randomly (where each action has equal probability to be selected), else predict with the bandit strategy.
- actions_manager: CmabActionsManager[BayesianLogisticRegressionCC]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- predict_actions_randomly: bool
- predict_with_proba: bool
- strategy: CostControlBandit
- pybandits.cmab.choice(a, size=None, replace=True, p=None)
Generates a random sample from a given 1-D array
Added in version 1.7.0.
Note
New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.
- Parameters:
a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were
np.arange(a)
size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. Default is None, in which case a single value is returned.replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of
a
can be selected multiple times.p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in
a
.
- Returns:
samples – The generated random samples
- Return type:
single item or ndarray
- Raises:
ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size
See also
randint
,shuffle
,permutation
random.Generator.choice
which should be used in new code
Notes
Setting user-specified probabilities through
p
uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element ofp
is 1 / len(a).Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its
axis
keyword.Examples
Generate a uniform random sample from np.arange(5) of size 3:
>>> np.random.choice(5, 3) array([0, 3, 4]) # random >>> #This is equivalent to np.random.randint(0,5,3)
Generate a non-uniform random sample from np.arange(5) of size 3:
>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0]) array([3, 3, 0]) # random
Generate a uniform random sample from np.arange(5) of size 3 without replacement:
>>> np.random.choice(5, 3, replace=False) array([3,1,0]) # random >>> #This is equivalent to np.random.permutation(np.arange(5))[:3]
Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:
>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0]) array([2, 3, 0]) # random
Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:
>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher'] >>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3]) array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random dtype='<U11')
pybandits.model
- class pybandits.model.BaseBayesianLogisticRegression(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1, alpha: StudentT, betas: Annotated[List[StudentT], MinLen(min_length=1)], update_method: Literal['MCMC', 'VI'] = 'MCMC', update_kwargs: dict | None = None)
Bases:
Model
,ABC
Base Bayesian Logistic Regression model.
It is modeled as:
y = sigmoid(alpha + beta1 * x1 + beta2 * x2 + … + betaN * xN)
where the alpha and betas coefficients are Student’s t-distributions.
- Parameters:
alpha (StudentT) – Student’s t-distribution of the alpha coefficient.
betas (StudentT) – Student’s t-distributions of the betas coefficients.
update_method (UpdateMethods, defaults to "MCMC") – The strategy for computing posterior quantities of the Bayesian models in the update function. Such as Markov chain Monte Carlo (“MCMC”) or Variational Inference (“VI”). Check UpdateMethods in pybandits.model for the full list.
update_kwargs (Optional[dict], uses default values if not specified) – Additional arguments to pass to the update method.
- arrange_update_kwargs()
- check_context_matrix(context: ndarray)
Check and cast context matrix.
- Parameters:
context (np.ndarray of shape (n_samples, n_features)) – Matrix of contextual features.
- Returns:
context – Matrix of contextual features.
- Return type:
pandas DataFrame of shape (n_samples, n_features)
- classmethod cold_start(n_features: Annotated[int, Gt(gt=0)], update_method: Literal['MCMC', 'VI'] = 'MCMC', update_kwargs: dict | None = None, **kwargs) BayesianLogisticRegression
Utility function to create a Bayesian Logistic Regression model or child model with cost control, with default parameters.
It is modeled as:
y = sigmoid(alpha + beta1 * x1 + beta2 * x2 + … + betaN * xN)
where the alpha and betas coefficients are Student’s t-distributions.
- Parameters:
n_features (PositiveInt) – The number of betas of the Bayesian Logistic Regression model. This is also the number of features expected after in the context matrix.
update_method (UpdateMethods, defaults to "MCMC") – The strategy for computing posterior quantities of the Bayesian models in the update function. Such as Markov chain Monte Carlo (“MCMC”) or Variational Inference (“VI”). Check UpdateMethods in pybandits.model for the full list.
update_kwargs (Optional[dict], uses default values if not specified) – Additional arguments to pass to the update method.
kwargs (Dict[str, Any]) – Additional arguments for the Bayesian Logistic Regression child model.
- Returns:
blr – The Bayesian Logistic Regression model.
- Return type:
BayesianLogisticRegrssion
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- sample_proba(context: ndarray) Tuple[Probability, float]
Compute the probability of getting a positive reward from the sampled regression coefficients and the context.
- Parameters:
context (np.ndarray) – Context matrix of shape (n_samples, n_features).
- Returns:
prob (ndarray of shape (n_samples)) – Probability of getting a positive reward.
weighted_sum (ndarray of shape (n_samples)) – Weighted sums between contextual feature values and sampled coefficients.
- update_kwargs: dict | None
- update_method: Literal['MCMC', 'VI']
- class pybandits.model.BaseBeta(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1)
Bases:
Model
Beta Distribution model for Bernoulli multi-armed bandits.
- classmethod both_or_neither_counters_are_defined(values)
- property mean: float
The success rate i.e. n_successes / (n_successes + n_failures).
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- sample_proba() Probability
Sample the probability of getting a positive reward.
- Returns:
prob – Probability of getting a positive reward.
- Return type:
Probability
- property std: float
The corrected standard deviation (Bessel’s correction) of the binary distribution of successes and failures.
- class pybandits.model.BaseModel
Bases:
PyBanditsBaseModel
,ABC
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- abstract reset()
Reset the model.
- abstract sample_proba() Probability
Sample the probability of getting a positive reward.
- abstract update(rewards: List[BinaryReward] | List[List[BinaryReward]], **kwargs)
Update the model.
- Parameters:
rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – A list of binary rewards.
- class pybandits.model.BayesianLogisticRegression(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1, alpha: StudentT, betas: Annotated[List[StudentT], MinLen(min_length=1)], update_method: Literal['MCMC', 'VI'] = 'MCMC', update_kwargs: dict | None = None)
Bases:
BaseBayesianLogisticRegression
Bayesian Logistic Regression model.
It is modeled as:
y = sigmoid(alpha + beta1 * x1 + beta2 * x2 + … + betaN * xN)
where the alpha and betas coefficients are Student’s t-distributions.
- Parameters:
alpha (StudentT) – Student’s t-distribution of the alpha coefficient.
betas (StudentT) – Student’s t-distributions of the betas coefficients.
update_method (UpdateMethods, defaults to "MCMC") – The strategy for computing posterior quantities of the Bayesian models in the update function. Such as Markov chain Monte Carlo (“MCMC”) or Variational Inference (“VI”). Check UpdateMethods in pybandits.model for the full list.
update_kwargs (Optional[dict], uses default values if not specified) – Additional arguments to pass to the update method.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- class pybandits.model.BayesianLogisticRegressionCC(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1, alpha: StudentT, betas: Annotated[List[StudentT], MinLen(min_length=1)], update_method: Literal['MCMC', 'VI'] = 'MCMC', update_kwargs: dict | None = None, cost: Annotated[float, Ge(ge=0)])
Bases:
BaseBayesianLogisticRegression
Bayesian Logistic Regression model with cost control.
It is modeled as:
y = sigmoid(alpha + beta1 * x1 + beta2 * x2 + … + betaN * xN)
where the alpha and betas coefficients are Student’s t-distributions.
- Parameters:
alpha (StudentT) – Student’s t-distribution of the alpha coefficient.
betas (StudentT) – Student’s t-distributions of the betas coefficients.
update_method (UpdateMethods, defaults to "MCMC") – The strategy for computing posterior quantities of the Bayesian models in the update function. Such as Markov chain Monte Carlo (“MCMC”) or Variational Inference (“VI”). Check UpdateMethods in pybandits.model for the full list.
update_kwargs (Optional[dict], uses default values if not specified) – Additional arguments to pass to the update method.
cost (NonNegativeFloat) – Cost associated to the Bayesian Logistic Regression model.
- cost: Annotated[float, Ge(ge=0)]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- class pybandits.model.Beta(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1)
Bases:
BaseBeta
Beta Distribution model for Bernoulli multi-armed bandits.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class pybandits.model.BetaCC(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1, cost: Annotated[float, Ge(ge=0)])
Bases:
BaseBeta
Beta Distribution model for Bernoulli multi-armed bandits with cost control.
- Parameters:
cost (NonNegativeFloat) – Cost associated to the Beta distribution.
- cost: Annotated[float, Ge(ge=0)]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class pybandits.model.BetaMO(*, models: List[Beta])
Bases:
ModelMO
Beta Distribution model for Bernoulli multi-armed bandits with multi-objectives.
- Parameters:
models (List[Beta] of shape (n_objectives,)) – List of Beta distributions.
- classmethod cold_start(n_objectives: Annotated[int, Gt(gt=0)], **kwargs) BetaMO
Utility function to create a multi-objective Beta model with cold start.
- Parameters:
n_objectives (PositiveInt) – The number of objectives.
n_objectives – The number of objectives.
- Returns:
beta_mo – The multi-objective Beta model.
- Return type:
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- sample_proba() List[Probability]
Sample the probability of getting a positive reward.
- Returns:
prob – Probabilities of getting a positive reward for each objective.
- Return type:
List[Probability]
- class pybandits.model.BetaMOCC(*, models: List[Beta], cost: Annotated[float, Ge(ge=0)])
Bases:
BetaMO
Beta Distribution model for Bernoulli multi-armed bandits with multi-objectives and cost control.
- Parameters:
models (List[BetaCC] of shape (n_objectives,)) – List of Beta distributions.
cost (NonNegativeFloat) – Cost associated to the Beta distribution.
- cost: Annotated[float, Ge(ge=0)]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class pybandits.model.Model(*, n_successes: Annotated[int, Gt(gt=0)] = 1, n_failures: Annotated[int, Gt(gt=0)] = 1)
Bases:
BaseModel
,ABC
Class to model the prior distributions.
- Parameters:
n_successes (PositiveInt = 1) – Counter of the number of successes.
n_failures (PositiveInt = 1) – Counter of the number of failures.
- property count: int
The total amount of successes and failures collected.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- n_failures: Annotated[int, Gt(gt=0)]
- n_successes: Annotated[int, Gt(gt=0)]
- reset()
Reset the model.
- update(rewards: List[BinaryReward], **kwargs)
Update n_successes and n_failures.
- Parameters:
rewards (List[BinaryReward]) – A list of binary rewards.
- class pybandits.model.ModelMO(*, models: Annotated[List[Model], MinLen(min_length=1)])
Bases:
BaseModel
,ABC
Multi-objective extension of Model. :param models: List of models. :type models: List[Model]
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- reset()
Reset the model.
- sample_proba(**kwargs) List[Probability]
Sample the probability of getting a positive reward. :returns: prob – Probabilities of getting a positive reward for each objective. :rtype: List[Probability]
- update(rewards: List[List[BinaryReward]], **kwargs)
Update the Beta model using the provided rewards.
- Parameters:
rewards (List[List[BinaryReward]]) – A list of rewards, where each reward is in turn a list containing the reward of the Beta model associated to each objective. For example, [[1, 1], [1, 0], [1, 1], [1, 0], [1, 1]].
kwargs (Dict[str, Any]) – Additional arguments for the Bayesian Logistic Regression MO child model.
- class pybandits.model.StudentT(*, mu: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)] = 0.0, sigma: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)] = 10.0, nu: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)] = 5.0)
Bases:
PyBanditsBaseModel
Student’s t-distribution.
- Parameters:
mu (float) – Mean of the Student’s t-distribution.
sigma (float) – Standard deviation of the Student’s t-distribution.
nu (float) – Degrees of freedom.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- mu: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)]
- nu: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)]
- sigma: Annotated[float, None, Interval(gt=None, ge=None, lt=None, le=None), None, AllowInfNan(allow_inf_nan=False)]
pybandits.strategy
- class pybandits.strategy.BestActionIdentificationBandit(*, exploit_p: Float_0_1 | None = 0.5)
Bases:
Strategy
Best-Action Identification (BAI) strategy for multi-armed bandits.
References
Simple Bayesian Algorithms for Best-Arm Identification (Russo, 2018) https://arxiv.org/pdf/1602.08448.pdf
- Parameters:
exploit_p (Optional[Float01], 0.5 if not specified) – Tuning parameter taking value in [0, 1] which specifies the probability of selecting the best or an alternative action. If exploit_p is 1, the bandit always selects the action with the highest probability of getting a positive reward. That is, it behaves as a Greedy strategy. If exploit_p is 0, the bandit always select the action with 2nd highest probability of getting a positive reward.
- compare_best_actions(actions: Dict[ActionId, Beta]) float
Compare the 2 best actions, hence the 2 actions with the highest expected means of getting a positive reward.
- Parameters:
actions (Dict[ActionId, Beta])
- Returns:
pvalue – p-value result of the statistical test.
- Return type:
float
- exploit_p: Float_0_1 | None
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod numerize_exploit_p(v)
- select_action(p: Dict[ActionId, float], actions: Dict[ActionId, Model] | None = None) ActionId
Select with probability self.exploit_p the best action (i.e. the action with the highest probability of getting a positive reward), and with probability 1-self.exploit_p it returns the second best action (i.e. the action with the second highest probability of getting a positive reward).
- Parameters:
p (Dict[ActionId, Probability]) – The dictionary of actions and their sampled probability of getting a positive reward.
actions (Optional[Dict[ActionId, Model]]) – The dictionary of actions and their associated model.
- Returns:
selected_action – The selected action.
- Return type:
ActionId
- with_exploit_p(exploit_p: Float_0_1 | None) Self
Instantiate a mutated cost control bandit strategy with an altered subsidy factor.
- Parameters:
exploit_p (Optional[Float01], 0.5 if not specified) – Tuning parameter taking value in [0, 1] which specifies the probability of selecting the best or an alternative action. If exploit_p is 1, the bandit always selects the action with the highest probability of getting a positive reward. That is, it behaves as a Greedy strategy. If exploit_p is 0, the bandit always select the action with 2nd highest probability of getting a positive reward.
- Returns:
mutated_best_action_identification – The mutated best action identification strategy.
- Return type:
- class pybandits.strategy.ClassicBandit
Bases:
Strategy
Classic multi-armed bandits strategy.
References
Analysis of Thompson Sampling for the Multi-armed Bandit Problem (Agrawal and Goyal, 2012) http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf
Thompson Sampling for Contextual Bandits with Linear Payoffs (Agrawal and Goyal, 2014) https://arxiv.org/pdf/1209.3352.pdf
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- select_action(p: Dict[ActionId, float], actions: Dict[ActionId, Model] | None = None) ActionId
Select the action with the highest probability of getting a positive reward.
- Parameters:
p (Dict[ActionId, Probability]) – The dictionary of actions and their sampled probability of getting a positive reward.
actions (Optional[Dict[ActionId, Model]]) – The dictionary of actions and their associated model.
- Returns:
selected_action – The selected action.
- Return type:
ActionId
- class pybandits.strategy.CostControlBandit(*, subsidy_factor: Float_0_1 | None = 0.5)
Bases:
CostControlStrategy
Cost Control (CC) strategy for multi-armed bandits.
Bandits are extended to include a control of the action cost. Each action is associated with a predefined “cost”. At prediction time, the model considers the actions whose expected rewards are above a pre-defined lower bound. Among these actions, the one with the lowest associated cost is recommended. The expected reward interval for feasible actions is defined as [(1-subsidy_factor)*max_p, max_p], where max_p is the highest expected reward sampled value.
References
Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints (Daulton et al., 2019) https://arxiv.org/abs/1911.00638
Multi-Armed Bandits with Cost Subsidy (Sinha et al., 2021) https://arxiv.org/abs/2011.01488
- Parameters:
subsidy_factor (Optional[Float01], 0.5 if not specified) – Number in [0, 1] to define smallest tolerated probability reward, hence the set of feasible actions. If subsidy_factor is 1, the bandits always selects the action with the minimum cost. If subsidy_factor is 0, the bandits always selects the action with highest probability of getting a positive reward (it behaves as a classic Bernoulli bandit).
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod numerize_subsidy_factor(v)
- select_action(p: Dict[ActionId, Probability], actions: Dict[ActionId, Model]) ActionId
Select the action with the minimum cost among the set of feasible actions (the actions whose expected rewards are above a certain lower bound defined as [(1-subsidy_factor)*max_p, max_p], where max_p is the highest expected reward sampled value.
- Parameters:
p (Dict[ActionId, Probability]) – The dictionary or actions and their sampled probability of getting a positive reward.
actions (Dict[ActionId, BetaCC]) – The dictionary or actions and their cost.
- Returns:
selected_action – The selected action.
- Return type:
ActionId
- subsidy_factor: Float_0_1 | None
- with_subsidy_factor(subsidy_factor: Float_0_1 | None) Self
Instantiate a mutated cost control bandit strategy with an altered subsidy factor.
- Parameters:
subsidy_factor (Optional[Float01], 0.5 if not specified) – Number in [0, 1] to define smallest tolerated probability reward, hence the set of feasible actions. If subsidy_factor is 1, the bandits always selects the action with the minimum cost. If subsidy_factor is 0, the bandits always selects the action with highest probability of getting a positive reward (it behaves as a classic Bernoulli bandit).
- Returns:
mutated_cost_control_bandit – The mutated cost control bandit strategy.
- Return type:
- class pybandits.strategy.CostControlStrategy
Bases:
Strategy
,ABC
Cost Control (CC) strategy for multi-armed bandits.
Bandits are extended to include a control of the action cost. Each action is associated with a predefined “cost”.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class pybandits.strategy.MultiObjectiveBandit
Bases:
MultiObjectiveStrategy
Multi-Objective (MO) strategy for multi-armed bandits.
The reward pertaining to an action is a multidimensional vector instead of a scalar value. In this setting, different actions are compared according to Pareto order between their expected reward vectors, and those actions whose expected rewards are not inferior to that of any other actions are called Pareto optimal actions, all of which constitute the Pareto front.
References
Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem (Yahyaa and Manderick, 2015) https://www.researchgate.net/publication/272823659_Thompson_Sampling_for_Multi-Objective_Multi-Armed_Bandits_Problem
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- select_action(p: Dict[ActionId, List[Probability]], **kwargs) ActionId
Select an action at random from the Pareto optimal set of action. The Pareto optimal action set (Pareto front) A* is the set of actions not dominated by any other actions not in A*. Dominance relation is established based on the objective reward probabilities vectors.
- Parameters:
p (Dict[ActionId, List[Probability]]) – The dictionary of actions and their sampled probability of getting a positive reward for each objective.
- Returns:
selected_action – The selected action.
- Return type:
ActionId
- class pybandits.strategy.MultiObjectiveCostControlBandit
Bases:
MultiObjectiveStrategy
,CostControlStrategy
Multi-Objective (MO) with Cost Control (CC) strategy for multi-armed bandits.
This strategy allows the reward to be a multidimensional vector and include a control of the action cost. It merges the Multi-Objective and Cost Control strategies.
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- select_action(p: Dict[ActionId, List[Probability]], actions: Dict[ActionId, BetaMOCC]) ActionId
Select the action with the minimum cost among the Pareto optimal set of action. The Pareto optimal action set (Pareto front) A* is the set of actions not dominated by any other actions not in A*. Dominance relation is established based on the objective reward probabilities vectors.
- Parameters:
p (Dict[ActionId, List[Probability]]) – The dictionary of actions and their sampled probability of getting a positive reward for each objective.
- Returns:
selected_action – The selected action.
- Return type:
ActionId
- class pybandits.strategy.MultiObjectiveStrategy
Bases:
Strategy
,ABC
Multi Objective Strategy to select actions in multi-armed bandits.
- classmethod get_pareto_front(p: Dict[ActionId, List[Probability]]) List[ActionId]
Create Pareto optimal set of actions (Pareto front) A* identified as actions that are not dominated by any action out of the set A*.
Parameters:
- p: Dict[ActionId, Probability]
The dictionary or actions and their sampled probability of getting a positive reward for each objective.
- returns:
pareto_front – The list of Pareto optimal actions
- rtype:
set
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class pybandits.strategy.Strategy
Bases:
PyBanditsBaseModel
,ABC
Strategy to select actions in multi-armed bandits.
- classmethod get_expected_value_from_state(state: Dict[str, Any], field_name: str) float
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod numerize_field(v, field_name: str)
- pybandits.strategy.random() x in the interval [0, 1).
pybandits.actions_manager
- class pybandits.actions_manager.ActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)
Bases:
PyBanditsBaseModel
,ABC
Base class for managing actions and their associated models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update. The change point detection is based on the adaptive windowing scheme.
References
Scaling Multi-Armed Bandit Algorithms (Fouché et al., 2019) https://edouardfouche.com/publications/S-MAB_FOUCHE_KDD19.pdf
- Parameters:
actions (Dict[ActionId, Model]) – The list of possible actions, and their associated Model.
delta (Optional[PositiveProbability]) – The confidence level for the adaptive window. None for skipping the change point detection.
- classmethod at_least_one_action_is_defined(v)
- delta: PositiveProbability | None
- property maximum_memory_length: Annotated[int, Ge(ge=0)]
Get maximum possible memory length based on current action statistics.
- Returns:
Maximum memory length allowed.
- Return type:
NonNegativeInt
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None, **kwargs)
Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.
- Parameters:
actions (List[ActionId]) – The selected action for each sample.
rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
- class pybandits.actions_manager.CmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)
Bases:
ActionsManager
,BaseModel
,Generic
[CmabModelType
]Manages actions and their associated models for cMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.
- Parameters:
actions (Dict[ActionId, BayesianLogisticRegression]) – The list of possible actions, and their associated Model.
delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.
- actions: Dict[ActionId, CmabModelType]
- classmethod check_bayesian_logistic_regression_models(v)
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], context: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None, context_memory: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None)
Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.
- Parameters:
actions (List[ActionId]) – The selected action for each sample.
rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.
context (ArrayLike of shape (n_samples, n_features)) – Matrix of contextual features.
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
context_memory (Optional[ArrayLike] of shape (n_samples, n_features)) – Matrix of contextual features.
- pybandits.actions_manager.CmabActionsManagerCC
alias of
CmabActionsManager[BayesianLogisticRegressionCC]
- pybandits.actions_manager.CmabActionsManagerSO
alias of
CmabActionsManager[BayesianLogisticRegression]
- class pybandits.actions_manager.CmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)
Bases:
ActionsManager
,BaseModel
,Generic
[CmabModelType
]Manages actions and their associated models for cMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.
- Parameters:
actions (Dict[ActionId, BayesianLogisticRegression]) – The list of possible actions, and their associated Model.
delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.
- actions: Dict[ActionId, CmabModelType]
- classmethod check_bayesian_logistic_regression_models(v)
- delta: PositiveProbability | None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], context: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None, context_memory: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None)
Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.
- Parameters:
actions (List[ActionId]) – The selected action for each sample.
rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.
context (ArrayLike of shape (n_samples, n_features)) – Matrix of contextual features.
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
context_memory (Optional[ArrayLike] of shape (n_samples, n_features)) – Matrix of contextual features.
- class pybandits.actions_manager.CmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)
Bases:
ActionsManager
,BaseModel
,Generic
[CmabModelType
]Manages actions and their associated models for cMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.
- Parameters:
actions (Dict[ActionId, BayesianLogisticRegression]) – The list of possible actions, and their associated Model.
delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.
- actions: Dict[ActionId, CmabModelType]
- classmethod check_bayesian_logistic_regression_models(v)
- delta: PositiveProbability | None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], context: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None, context_memory: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None)
Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.
- Parameters:
actions (List[ActionId]) – The selected action for each sample.
rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.
context (ArrayLike of shape (n_samples, n_features)) – Matrix of contextual features.
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
context_memory (Optional[ArrayLike] of shape (n_samples, n_features)) – Matrix of contextual features.
- class pybandits.actions_manager.SmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)
Bases:
ActionsManager
,BaseModel
,Generic
[SmabModelType
]Manages actions and their associated models for sMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.
- Parameters:
actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.
delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.
- actions: Dict[ActionId, SmabModelType]
- classmethod all_actions_have_same_number_of_objectives(actions: Dict[ActionId, SmabModelType])
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)
Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.
- Parameters:
actions (List[ActionId]) – The selected action for each sample.
rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
- pybandits.actions_manager.SmabActionsManagerCC
alias of
SmabActionsManager[BetaCC]
- pybandits.actions_manager.SmabActionsManagerMO
alias of
SmabActionsManager[BetaMO]
- pybandits.actions_manager.SmabActionsManagerMOCC
alias of
SmabActionsManager[BetaMOCC]
- pybandits.actions_manager.SmabActionsManagerSO
alias of
SmabActionsManager[Beta]
- class pybandits.actions_manager.SmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)
Bases:
ActionsManager
,BaseModel
,Generic
[SmabModelType
]Manages actions and their associated models for sMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.
- Parameters:
actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.
delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.
- actions: Dict[ActionId, SmabModelType]
- classmethod all_actions_have_same_number_of_objectives(actions: Dict[ActionId, SmabModelType])
- delta: PositiveProbability | None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)
Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.
- Parameters:
actions (List[ActionId]) – The selected action for each sample.
rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
- class pybandits.actions_manager.SmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)
Bases:
ActionsManager
,BaseModel
,Generic
[SmabModelType
]Manages actions and their associated models for sMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.
- Parameters:
actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.
delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.
- actions: Dict[ActionId, SmabModelType]
- classmethod all_actions_have_same_number_of_objectives(actions: Dict[ActionId, SmabModelType])
- delta: PositiveProbability | None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)
Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.
- Parameters:
actions (List[ActionId]) – The selected action for each sample.
rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
- class pybandits.actions_manager.SmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)
Bases:
ActionsManager
,BaseModel
,Generic
[SmabModelType
]Manages actions and their associated models for sMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.
- Parameters:
actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.
delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.
- actions: Dict[ActionId, SmabModelType]
- classmethod all_actions_have_same_number_of_objectives(actions: Dict[ActionId, SmabModelType])
- delta: PositiveProbability | None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)
Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.
- Parameters:
actions (List[ActionId]) – The selected action for each sample.
rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
- class pybandits.actions_manager.SmabActionsManager(delta: PositiveProbability | None = None, actions: Dict[ActionId, Model] | None = None, action_ids: Set[ActionId] | None = None, kwargs: Dict[str, Any] | None = None)
Bases:
ActionsManager
,BaseModel
,Generic
[SmabModelType
]Manages actions and their associated models for sMAB models. The class allows to account for non-stationarity by providing an adaptive window scheme for action update.
- Parameters:
actions (Dict[ActionId, BaseBeta]) – The list of possible actions, and their associated Model.
delta (Optional[PositiveProbability], 0.1 if not specified.) – The confidence level for the adaptive window.
- actions: Dict[ActionId, SmabModelType]
- classmethod all_actions_have_same_number_of_objectives(actions: Dict[ActionId, SmabModelType])
- delta: PositiveProbability | None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'json_encoders': {<class 'collections.deque'>: <class 'list'>}}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- update(actions: List[ActionId], rewards: List[BinaryReward] | List[List[BinaryReward]], actions_memory: List[ActionId] | None = None, rewards_memory: List[BinaryReward] | List[List[BinaryReward]] | None = None)
Update the models associated with the given actions using the provided rewards. For adaptive window size, the update by resetting the action models and retraining them on the new data.
- Parameters:
actions (List[ActionId]) – The selected action for each sample.
rewards (Union[List[BinaryReward], List[List[BinaryReward]]]) – The reward for each sample.
actions_memory (Optional[List[ActionId]]) – List of previously selected actions.
rewards_memory (Optional[Union[List[BinaryReward], List[List[BinaryReward]]]]) – List of previously collected rewards.
pybandits.smab_simulator
- class pybandits.smab_simulator.SmabSimulator(*, smab: BaseSmabBernoulli, n_updates: Annotated[int, Gt(gt=0)] = 10, batch_size: Annotated[int, Gt(gt=0)] = 100, probs_reward: DataFrame | None = None, save: bool = False, path: str = '', file_prefix: str = '', random_seed: Annotated[int, Ge(ge=0)] | None = None, verbose: bool = False, visualize: bool = False)
Bases:
Simulator
Simulate environment for stochastic multi-armed bandits.
This class performs simulation of stochastic Multi-Armed Bandits (sMAB). Data are processed in batches of size n>=1. Per each batch of simulated samples, the mab selects one action and collects the corresponding simulated reward for each sample. Then, prior parameters are updated based on returned rewards from recommended actions.
- Parameters:
mab (BaseSmabBernoulli) – sMAB model.
- mab: BaseSmabBernoulli
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(_Simulator__context: Any) None
Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.
- classmethod replace_null_and_validate_probs_reward(values)
pybandits.cmab_simulator
- class pybandits.cmab_simulator.CmabSimulator(*, cmab: BaseCmabBernoulli, n_updates: Annotated[int, Gt(gt=0)] = 10, batch_size: Annotated[int, Gt(gt=0)] = 100, probs_reward: DataFrame | None = None, save: bool = False, path: str = '', file_prefix: str = '', random_seed: Annotated[int, Ge(ge=0)] | None = None, verbose: bool = False, visualize: bool = False, context: ndarray, group: List | None = None)
Bases:
Simulator
Simulate environment for contextual multi-armed bandit models.
This class simulates information required by the contextual bandit. Generated data are processed by the bandit with batches of size n>=1. For each batch of samples, actions are recommended by the bandit and corresponding simulated rewards collected. Bandit policy parameters are then updated based on returned rewards from recommended actions.
- Parameters:
mab (BaseCmabBernoulli) – Contextual multi-armed bandit model
context (np.ndarray of shape (n_samples, n_feature)) – Context matrix of samples features.
group (Optional[List] with length=n_samples) – Group to which each sample belongs. Samples which belongs to the same group have features that come from the same distribution and they have the same probability to receive a positive/negative feedback from each action. If not supplied, all samples are assigned to the group.
- context: ndarray
- group: List | None
- mab: BaseCmabBernoulli
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(_Simulator__context: Any) None
Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.
- classmethod replace_nulls_and_validate_sizes_and_dtypes(values)