For The Win (FTW) TensorFlow Agent

FTW Agent

class ftw.agents.tf.ftw.agent.FTW(environment_spec: acme.specs.EnvironmentSpec, sequence_length: int, num_environment_events: int = 1, embed: sonnet.src.base.Module = None, max_queue_size: int = 32, batch_size: int = 16, hidden_size: int = 256, use_pixel_cotrol: bool = True, use_reward_prediction: bool = True, reward_prediction_sequence_length: int = 3, reward_prediction_sequence_period: int = 1, num_dimensions: int = 256, dnc_clip_value=None, use_dnc_linear_projection: bool = True, init_scale: float = 0.1, min_scale: float = 1e-06, tanh_mean: bool = False, fixed_scale: bool = False, use_tfd_independent: bool = False, variational_unit_w_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, variational_unit_b_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, strict_period_order: bool = True, dnc_memory_size: int = 450, dnc_word_size: int = 32, dnc_num_reads: int = 4, core_type: str = 'rpth', slow_core_period_min_max: Tuple[int, int] = (5, 20), slow_core_period_init_value: Optional[int] = None, learning_rate: Union[float, Tuple[float, float]] = (1e-05, 0.005), entropy_cost: Union[float, Tuple[float, float]] = (0.0005, 0.01), reward_prediction_cost: Union[float, Tuple[float, float]] = (0.1, 1.0), pixel_control_cost: Union[float, Tuple[float, float]] = (0.01, 0.1), kld_prior_fixed_cost: Union[float, Tuple[float, float]] = (0.0001, 0.1), kld_prior_posterior_cost: Union[float, Tuple[float, float]] = (0.001, 1.0), scale_grads_fast_to_slow: Union[float, Tuple[float, float]] = (0.1, 1.0), internal_rewards: Union[float, Tuple[float, float]] = (0.1, 1.0), baseline_cost: float = 0.5, discount: float = 0.99, max_abs_reward: float = None, max_gradient_norm: float = None, rms_prop_epsilon: float = 1e-05, learning_rate_decay_steps: int = 0, uint_pixels_to_float: bool = True, agent_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)

Bases: acme.core.Actor, ftw.agents.tf.ftw.agent.FTWWithoutActor

__init__(environment_spec: acme.specs.EnvironmentSpec, sequence_length: int, num_environment_events: int = 1, embed: sonnet.src.base.Module = None, max_queue_size: int = 32, batch_size: int = 16, hidden_size: int = 256, use_pixel_cotrol: bool = True, use_reward_prediction: bool = True, reward_prediction_sequence_length: int = 3, reward_prediction_sequence_period: int = 1, num_dimensions: int = 256, dnc_clip_value=None, use_dnc_linear_projection: bool = True, init_scale: float = 0.1, min_scale: float = 1e-06, tanh_mean: bool = False, fixed_scale: bool = False, use_tfd_independent: bool = False, variational_unit_w_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, variational_unit_b_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, strict_period_order: bool = True, dnc_memory_size: int = 450, dnc_word_size: int = 32, dnc_num_reads: int = 4, core_type: str = 'rpth', slow_core_period_min_max: Tuple[int, int] = (5, 20), slow_core_period_init_value: Optional[int] = None, learning_rate: Union[float, Tuple[float, float]] = (1e-05, 0.005), entropy_cost: Union[float, Tuple[float, float]] = (0.0005, 0.01), reward_prediction_cost: Union[float, Tuple[float, float]] = (0.1, 1.0), pixel_control_cost: Union[float, Tuple[float, float]] = (0.01, 0.1), kld_prior_fixed_cost: Union[float, Tuple[float, float]] = (0.0001, 0.1), kld_prior_posterior_cost: Union[float, Tuple[float, float]] = (0.001, 1.0), scale_grads_fast_to_slow: Union[float, Tuple[float, float]] = (0.1, 1.0), internal_rewards: Union[float, Tuple[float, float]] = (0.1, 1.0), baseline_cost: float = 0.5, discount: float = 0.99, max_abs_reward: float = None, max_gradient_norm: float = None, rms_prop_epsilon: float = 1e-05, learning_rate_decay_steps: int = 0, uint_pixels_to_float: bool = True, agent_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)

Initialize self. See help(type(self)) for accurate signature.

observe(action: Any, next_timestep: dm_env._environment.TimeStep)

Make an observation of timestep data from the environment.

Args:
action: action taken in the environment. next_timestep: timestep produced by the environment given the action.
observe_first(timestep: dm_env._environment.TimeStep)

Make a first observation from the environment.

Note that this need not be an initial state, it is merely beginning the recording of a trajectory.

Args:
timestep: first timestep.
run()
select_action(observation: numpy.ndarray) → int

Samples from the policy and returns an action.

update()

Perform an update of the actor parameters from past observations.

FTW Learner

class ftw.agents.tf.ftw.learning.FtwLearner(policy_network: sonnet.src.recurrent.RNNCore, dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2, learning_rate: Union[float, tensorflow.python.ops.variables.Variable], slow_core_period: Union[float, tensorflow.python.ops.variables.Variable], internal_rewards: ftw.tf.internal_reward.ftw_internal_reward.InternalRewards, pixel_control_network: Optional[sonnet.src.recurrent.RNNCore] = None, reward_prediction_network: Optional[sonnet.src.base.Module] = None, pixel_control_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, nonzero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, zero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, entropy_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0, kld_prior_fixed_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0001, kld_prior_posterior_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.001, pixel_control_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.01, reward_prediction_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.1, baseline_cost: float = 0.5, discount: Union[float, tensorflow.python.ops.variables.Variable] = 0.99, max_abs_reward: Optional[float] = None, max_gradient_norm: Optional[float] = None, rms_prop_epsilon: float = 0.01, learning_rate_decay_steps: int = 0, can_sample=None, can_sample_auxiliary=None, uint_pixels_to_float: bool = True, learner_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)

Bases: acme.core.Learner, acme.tf.savers.TFSaveable

Learner for an importance-weighted advantage actor-critic with auxiliary tasks and recurrent processing with temporal hierarchy.

__init__(policy_network: sonnet.src.recurrent.RNNCore, dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2, learning_rate: Union[float, tensorflow.python.ops.variables.Variable], slow_core_period: Union[float, tensorflow.python.ops.variables.Variable], internal_rewards: ftw.tf.internal_reward.ftw_internal_reward.InternalRewards, pixel_control_network: Optional[sonnet.src.recurrent.RNNCore] = None, reward_prediction_network: Optional[sonnet.src.base.Module] = None, pixel_control_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, nonzero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, zero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, entropy_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0, kld_prior_fixed_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0001, kld_prior_posterior_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.001, pixel_control_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.01, reward_prediction_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.1, baseline_cost: float = 0.5, discount: Union[float, tensorflow.python.ops.variables.Variable] = 0.99, max_abs_reward: Optional[float] = None, max_gradient_norm: Optional[float] = None, rms_prop_epsilon: float = 0.01, learning_rate_decay_steps: int = 0, can_sample=None, can_sample_auxiliary=None, uint_pixels_to_float: bool = True, learner_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)

Initialize self. See help(type(self)) for accurate signature.

get_step()
get_variables(names: List[str]) → List[List[numpy.ndarray]]

Return the named variables as a collection of (nested) numpy arrays.

Args:
names: args where each name is a string identifying a predefined subset of
the variables.
Returns:
A list of (nested) numpy arrays variables such that variables[i] corresponds to the collection named by names[i].
get_weights() → Mapping[str, List[tensorflow.python.ops.variables.Variable]]
run()

Run the update loop; typically an infinite loop which calls step.

set_step(step)
set_weights(weights: Mapping[str, List[tensorflow.python.ops.variables.Variable]])
state

Returns the stateful objects for checkpointing.

step()

Does a step of SGD and logs the results.

FTW Actor

class ftw.agents.tf.ftw.acting.FtwActor(network: sonnet.src.recurrent.RNNCore, adder: acme.adders.base.Adder = None, reward_prediction_adder: acme.adders.base.Adder = None, variable_client: acme.tf.variable_utils.VariableClient = None, uint_pixels_to_float: bool = True)

Bases: acme.core.Actor

A recurrent actor.

__init__(network: sonnet.src.recurrent.RNNCore, adder: acme.adders.base.Adder = None, reward_prediction_adder: acme.adders.base.Adder = None, variable_client: acme.tf.variable_utils.VariableClient = None, uint_pixels_to_float: bool = True)

Initialize self. See help(type(self)) for accurate signature.

observe(action: Any, next_timestep: dm_env._environment.TimeStep)

Make an observation of timestep data from the environment.

Args:
action: action taken in the environment. next_timestep: timestep produced by the environment given the action.
observe_first(timestep: dm_env._environment.TimeStep)

Make a first observation from the environment.

Note that this need not be an initial state, it is merely beginning the recording of a trajectory.

Args:
timestep: first timestep.
select_action(observation: Any) → Any

Samples from the policy and returns an action.

update()

Perform an update of the actor parameters from past observations.

Utilities for Replay Buffers, Datasets, Hyperparameters & Internal Rewards

ftw.agents.tf.ftw.utils.create_adders(server_address: str, sequence_length: int, use_pixel_control: bool = False, use_reward_prediction: bool = False, reward_prediction_sequence_length: int = 3, reward_prediction_sequence_period: int = 1, pad_end_of_episode: bool = False, delta_encoded: bool = True)

Creates the reverb adders required by the FTW actor.

Args:

server_address: Address of the reverb server responsible for storing training data. sequence_length: Length of unroll sequences used in training (for calculation of main losses and

Pixel control auxiliary loss).

use_pixel_control: Whether to create an adder for the Pixel control auxiliary task. use_reward_prediction: Whether to create an adder for the Reward prediction auxiliary task. reward_prediction_sequence_length: Length of reward prediction sequences.

Defaults to 3, as in the FTW and UNREAL agents
reward_prediction_sequence_period: Period with which to add Reward prediction sequences to the respective
replay buffer. Defaults to 1, i.e. at every step, the last reward_prediction_sequence_length steps are added to the replay buffer.
pad_end_of_episode: Whether to pad sequences with zero-like steps at the the end of an episode, if necessary.
Defaults to False.
delta_encoded: Whether to use compression for the adder. May lower RAM requirements.
See documentation of dm-acme’s adders for more details. Defaults to True.
Returns:
A tuple (adder, rp_adder), where adder is the main adder and rp_adder is either None (if use_reward_prediction=False) or the adder required for the Reward prediction auxiliary task.
ftw.agents.tf.ftw.utils.create_datasets(learner_client: <sphinx.ext.autodoc.importer._MockObject object at 0x7f0c46925240>, environment_spec: acme.specs.EnvironmentSpec, batch_size: int, sequence_length: int, extra_spec: Optional[Dict[KT, VT]] = None, use_pixel_control: bool = False, use_reward_prediction: bool = False, reward_prediction_sequence_length: int = 3)

Creates the dataset(s) required by the FTW agent.

Args:

learner_client: A reverb.TFClient connected to the reverb server holding the required reverb tables. environment_spec: An acme.specs.EnvironmentSpec namedtuple containing the specs of the environment. batch_size: Batch size used in training. sequence_length: Length of unroll sequences used in training (for calculation of main losses and

Pixel control auxiliary loss).

extra_spec: A dictionary containing extra specs required for training, such as logits or core state. use_pixel_control: Whether to create a dataset for the Pixel control auxiliary task. use_reward_prediction: Whether to create a dataset for the Reward prediction auxiliary task. reward_prediction_sequence_length: Length of reward prediction sequences.

Defaults to 3, as in the FTW and UNREAL agents
Returns:

A 4-element tuple of tf.Dataset objects for each respective task (where queue is used in the calculation of the main losses):

(queue_dataset, pixel_control_dataset, nonzero_reward_prediction_dataset, zero_reward_prediction_dataset)
ftw.agents.tf.ftw.utils.create_reverb_tables(batch_size: int, max_queue_size: int, use_pixel_control: bool = False, use_reward_prediction: bool = False, max_pixel_control_buffer_size: int = 100, max_reward_pred_buffer_size: int = 800)

Creates the reverb table(s) required by the FTW agent.

Args:

batch_size: Batch size used in training. max_queue_size: Maximum capacity of queue. use_pixel_control: Whether to create a table for the Pixel control auxiliary task. use_reward_prediction: Whether to create a table for the Reward prediction auxiliary task. max_pixel_control_buffer_size: Maximum capacity of Pixel control replay buffer. max_reward_pred_buffer_size: Maximum capacity of each Reward prediction replay buffer

(one buffer for zero rewards, one for non-zero rewards).
Returns:
A triple (tables, can_sample_queue, can_sample_auxiliary), where tables is a list containing all created tables, can_sample_queue is a function that returns a bool indicating whether a batch of training data can be sampled from the queue (used in the calculation of the main losses), and can_sample_auxiliary is a function that returns a bool indicating whether a batch of training data can be sampled from the auxiliary replay buffer(s).
ftw.agents.tf.ftw.utils.initialize_hypers(slow_core_period_min_max: Tuple[int, int] = (5, 20), slow_core_period_init_value: Optional[int] = None, learning_rate: Union[float, Tuple[float, float]] = (1e-05, 0.005), entropy_cost: Union[float, Tuple[float, float]] = (0.0005, 0.01), reward_prediction_cost: Union[float, Tuple[float, float]] = (0.1, 1.0), pixel_control_cost: Union[float, Tuple[float, float]] = (0.01, 0.1), kld_prior_fixed_cost: Union[float, Tuple[float, float]] = (0.0001, 0.1), kld_prior_posterior_cost: Union[float, Tuple[float, float]] = (0.001, 1.0), scale_grads_fast_to_slow: Union[float, Tuple[float, float]] = (0.1, 1.0)) → Mapping[str, ftw.tf.hyperparameters.base.Hyperparameter]

Create and initialize all hyperparameters required by the FTW agent.

All arguments can either be supplied as a 2-tuple (min, max), indicating a range to be used in the random initialization of the corresponding hyperparameter, or as a scalar value, in which case the corresponding hyperparameter will be initialized with this exact value. Please note, however, that if a scalar value is used to initialize slow_core_period to an exact value, calling perturb() on the resulting hyperparameter will have no effect.

Args:
slow_core_period_min_max: (Inclusive) lower and upper bound for random initialization
of the period used for the slow core of the RPTH module. See docstring for RPTH module (in ftw.tf.networks.recurrence) for more details.
slow_core_period_init_value: Optional. If not None, the period used for the slow core
of the RPTH module will be initialized with this exact value, instead of being initialized randomly. See docstring for RPTH module (in ftw.tf.networks.recurrence) for more details.

learning_rate: Learning rate used in training. entropy_cost: Multiplier for the entropy loss. reward_prediction_cost: Multiplier for the Reward prediction loss. pixel_control_cost: Multiplier for the Pixel control loss. kld_prior_fixed_cost: Multiplier for the Kullback-Leibler divergence loss between

a fixed Multivariate Normal Diagonal (MVNDiag) distribution and the prior (MVNDiag) distribution as produced by the RPTH module’s slow core.
kld_prior_posterior_cost: Multiplier for the Kullback-Leibler divergence loss between
the prior (MVNDiag) distribution as produced by the RPTH module’s slow core and the posterior (MVNDiag) distribution as produced by the RPTH module’s fast core.

scale_grads_fast_to_slow: Scaling factor for the gradients flowing from fast to slow core of the RPTH module.

Returns:
A dictionary containing all created hyperparameters. Keys of this dictionary correspond to the argument names of this function, except for the key ‘slow_core_period’, which results from the argument slow_core_period_min_max (and possibly slow_core_init_value).
ftw.agents.tf.ftw.utils.initialize_internal_rewards(num_events: int = 1, init_value_or_range: Union[float, Tuple[float, float]] = (0.1, 1.0)) → ftw.tf.internal_reward.ftw_internal_reward.InternalRewards

Creates and initializes the internal rewards required by the FTW agent.

Args:

num_events: Number of events returned by the environment. init_value_or_range: A scalar value of type int for initialization with an exact value,

or a tuple (min, max), where min, max are of type int for initialization by drawing a sample from a log-uniform distribution defined over (min, max).
Returns:
An InternalRewards object.