Internal Rewards

Internal Rewards

class ftw.tf.internal_reward.InternalRewards(num_events: int, init_value_or_range: Union[float, Tuple[float, float]], perturb_probability: float = 1.0, perturb_max_pct_change: float = 0.2, dtype=tf.float32, name: str = 'internal_rewards')

Bases: ftw.tf.hyperparameters.base.Hyperparameter

Internal rewards class, as introduced by the FTW paper (Jaderberg et al., 2019).

In addition to the implementation of Internal rewards as used in the FTW paper, this class also supports the ‘base case’ of a scalar internal reward variable.

Can be initialized either with a concrete scalar value or a tuple (min, max) indicating a range from which to draw a random sample. More specifically, the sample is drawn from a log-uniform distribution defined over this range.

This class is used in combination with environments that offer a vector of different environment events, which should be supplied by the environment instead of a normal scalar reward, e.g. as the reward field of a dm-acme observation_action_reward.OAR NamedTuple. This class offers a reward() method that computes reward as a

  • dot product between environment events and internal reward weights, if internal rewards is a vector and events is a (batch of) vector(s).
  • product between environment events and internal reward weights, if internal rewards is a scalar and events is a (batch of) vector(s).

In the unlikely case that events is a scalar, but internal rewards a vector, reward() will raise a ValueError.

Inherits from ftw.tf.hyperparameters.Hyperparameter, i.e. it offers get(), set() and perturb() methods. See docstring for ftw.tf.hyperparameters.Hyperparameter for more details.

__init__(num_events: int, init_value_or_range: Union[float, Tuple[float, float]], perturb_probability: float = 1.0, perturb_max_pct_change: float = 0.2, dtype=tf.float32, name: str = 'internal_rewards')

Initializes InternalRewards.

Args:
num_events: Number of different events supplied by the environment.
Must be an int value > 0.
init_value_or_range: A scalar value of type int for initialization with an exact value,
or a tuple (min, max), where min, max are of type int for initialization by drawing a sample from a log-uniform distribution defined over (min, max).
perturb_probability: The probability of actually perturbing the
internal rewards value(s) when calling perturb(). Must be a float value with 0 <= value <= 1.
perturb_max_pct_change: Maximum allowed change of the hyperparameter value in percent,
when calling perturb() and if perturb() actually perturbs the value (see perturb_probability). Resulting change lies in the range of (-perturb_max_pct_change, perturb_max_pct_change). Must be a float value > 0.
dtype: tf.dtype used by the tf.Variable that holds the internal rewards value(s).
Defaults to tf.float32.

name: Name for this InternalRewards instance. Defaults to ‘internal_rewards’.

get() → numpy.ndarray

Returns (numpy) value of the tf.Variable storing the hyperparameter value.

perturb(verbose=False)

May perturb the internal rewards, depending on perturb_probability.

reward(events: Union[numpy.ndarray, tensorflow.python.framework.ops.Tensor]) → tensorflow.python.framework.ops.Tensor

Computes reward as a dot product between environment events and internal reward weights.

If internal reward is a scalar, then the reward is just the product between events and internal reward.

Args:
events: Environment events. Expected to be of type numpy.ndarray or tf.Tensor.
Returns:
Scalar reward of type tf.Tensor.
Raises:
ValueError: If shapes of events and internal rewards are incompatible.
set(value: Union[numpy.ndarray, tensorflow.python.framework.ops.Tensor])

Assign value to the tf.Variable storing the hyperparameter value.

Args:
value: numpy.ndarray or tf.Tensor, containing the new value for the hyperparameter variable.
variable

Returns the tf.Variable storing the hyperparameter value.