For The Win (FTW) TensorFlow Agent¶

`FTW Agent`¶

class ftw.agents.tf.ftw.agent.FTW(environment_spec: acme.specs.EnvironmentSpec, sequence_length: int, num_environment_events: int = 1, embed: sonnet.src.base.Module = None, max_queue_size: int = 32, batch_size: int = 16, hidden_size: int = 256, use_pixel_cotrol: bool = True, use_reward_prediction: bool = True, reward_prediction_sequence_length: int = 3, reward_prediction_sequence_period: int = 1, num_dimensions: int = 256, dnc_clip_value=None, use_dnc_linear_projection: bool = True, init_scale: float = 0.1, min_scale: float = 1e-06, tanh_mean: bool = False, fixed_scale: bool = False, use_tfd_independent: bool = False, variational_unit_w_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, variational_unit_b_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, strict_period_order: bool = True, dnc_memory_size: int = 450, dnc_word_size: int = 32, dnc_num_reads: int = 4, core_type: str = 'rpth', slow_core_period_min_max: Tuple[int, int] = (5, 20), slow_core_period_init_value: Optional[int] = None, learning_rate: Union[float, Tuple[float, float]] = (1e-05, 0.005), entropy_cost: Union[float, Tuple[float, float]] = (0.0005, 0.01), reward_prediction_cost: Union[float, Tuple[float, float]] = (0.1, 1.0), pixel_control_cost: Union[float, Tuple[float, float]] = (0.01, 0.1), kld_prior_fixed_cost: Union[float, Tuple[float, float]] = (0.0001, 0.1), kld_prior_posterior_cost: Union[float, Tuple[float, float]] = (0.001, 1.0), scale_grads_fast_to_slow: Union[float, Tuple[float, float]] = (0.1, 1.0), internal_rewards: Union[float, Tuple[float, float]] = (0.1, 1.0), baseline_cost: float = 0.5, discount: float = 0.99, max_abs_reward: float = None, max_gradient_norm: float = None, rms_prop_epsilon: float = 1e-05, learning_rate_decay_steps: int = 0, uint_pixels_to_float: bool = True, agent_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)¶

Bases: acme.core.Actor, ftw.agents.tf.ftw.agent.FTWWithoutActor

__init__(environment_spec: acme.specs.EnvironmentSpec, sequence_length: int, num_environment_events: int = 1, embed: sonnet.src.base.Module = None, max_queue_size: int = 32, batch_size: int = 16, hidden_size: int = 256, use_pixel_cotrol: bool = True, use_reward_prediction: bool = True, reward_prediction_sequence_length: int = 3, reward_prediction_sequence_period: int = 1, num_dimensions: int = 256, dnc_clip_value=None, use_dnc_linear_projection: bool = True, init_scale: float = 0.1, min_scale: float = 1e-06, tanh_mean: bool = False, fixed_scale: bool = False, use_tfd_independent: bool = False, variational_unit_w_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, variational_unit_b_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, strict_period_order: bool = True, dnc_memory_size: int = 450, dnc_word_size: int = 32, dnc_num_reads: int = 4, core_type: str = 'rpth', slow_core_period_min_max: Tuple[int, int] = (5, 20), slow_core_period_init_value: Optional[int] = None, learning_rate: Union[float, Tuple[float, float]] = (1e-05, 0.005), entropy_cost: Union[float, Tuple[float, float]] = (0.0005, 0.01), reward_prediction_cost: Union[float, Tuple[float, float]] = (0.1, 1.0), pixel_control_cost: Union[float, Tuple[float, float]] = (0.01, 0.1), kld_prior_fixed_cost: Union[float, Tuple[float, float]] = (0.0001, 0.1), kld_prior_posterior_cost: Union[float, Tuple[float, float]] = (0.001, 1.0), scale_grads_fast_to_slow: Union[float, Tuple[float, float]] = (0.1, 1.0), internal_rewards: Union[float, Tuple[float, float]] = (0.1, 1.0), baseline_cost: float = 0.5, discount: float = 0.99, max_abs_reward: float = None, max_gradient_norm: float = None, rms_prop_epsilon: float = 1e-05, learning_rate_decay_steps: int = 0, uint_pixels_to_float: bool = True, agent_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)¶: Initialize self. See help(type(self)) for accurate signature.

observe(action: Any, next_timestep: dm_env._environment.TimeStep)¶

Make an observation of timestep data from the environment.

Args:: action: action taken in the environment. next_timestep: timestep produced by the environment given the action.

observe_first(timestep: dm_env._environment.TimeStep)¶

Make a first observation from the environment.

Note that this need not be an initial state, it is merely beginning the recording of a trajectory.

Args:: timestep: first timestep.

run()¶

select_action(observation: numpy.ndarray) → int¶: Samples from the policy and returns an action.

update()¶: Perform an update of the actor parameters from past observations.

`FTW Learner`¶

class ftw.agents.tf.ftw.learning.FtwLearner(policy_network: sonnet.src.recurrent.RNNCore, dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2, learning_rate: Union[float, tensorflow.python.ops.variables.Variable], slow_core_period: Union[float, tensorflow.python.ops.variables.Variable], internal_rewards: ftw.tf.internal_reward.ftw_internal_reward.InternalRewards, pixel_control_network: Optional[sonnet.src.recurrent.RNNCore] = None, reward_prediction_network: Optional[sonnet.src.base.Module] = None, pixel_control_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, nonzero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, zero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, entropy_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0, kld_prior_fixed_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0001, kld_prior_posterior_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.001, pixel_control_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.01, reward_prediction_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.1, baseline_cost: float = 0.5, discount: Union[float, tensorflow.python.ops.variables.Variable] = 0.99, max_abs_reward: Optional[float] = None, max_gradient_norm: Optional[float] = None, rms_prop_epsilon: float = 0.01, learning_rate_decay_steps: int = 0, can_sample=None, can_sample_auxiliary=None, uint_pixels_to_float: bool = True, learner_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)¶

Bases: acme.core.Learner, acme.tf.savers.TFSaveable

Learner for an importance-weighted advantage actor-critic with auxiliary tasks and recurrent processing with temporal hierarchy.

__init__(policy_network: sonnet.src.recurrent.RNNCore, dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2, learning_rate: Union[float, tensorflow.python.ops.variables.Variable], slow_core_period: Union[float, tensorflow.python.ops.variables.Variable], internal_rewards: ftw.tf.internal_reward.ftw_internal_reward.InternalRewards, pixel_control_network: Optional[sonnet.src.recurrent.RNNCore] = None, reward_prediction_network: Optional[sonnet.src.base.Module] = None, pixel_control_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, nonzero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, zero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, entropy_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0, kld_prior_fixed_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0001, kld_prior_posterior_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.001, pixel_control_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.01, reward_prediction_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.1, baseline_cost: float = 0.5, discount: Union[float, tensorflow.python.ops.variables.Variable] = 0.99, max_abs_reward: Optional[float] = None, max_gradient_norm: Optional[float] = None, rms_prop_epsilon: float = 0.01, learning_rate_decay_steps: int = 0, can_sample=None, can_sample_auxiliary=None, uint_pixels_to_float: bool = True, learner_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)¶: Initialize self. See help(type(self)) for accurate signature.

get_step()¶

get_variables(names: List[str]) → List[List[numpy.ndarray]]¶

Return the named variables as a collection of (nested) numpy arrays.

Args:

names: args where each name is a string identifying a predefined subset of: the variables.

Returns:

A list of (nested) numpy arrays variables such that variables[i] corresponds to the collection named by names[i].

get_weights() → Mapping[str, List[tensorflow.python.ops.variables.Variable]]¶

run()¶: Run the update loop; typically an infinite loop which calls step.

set_step(step)¶

set_weights(weights: Mapping[str, List[tensorflow.python.ops.variables.Variable]])¶

state¶: Returns the stateful objects for checkpointing.

step()¶: Does a step of SGD and logs the results.

`FTW Actor`¶

class ftw.agents.tf.ftw.acting.FtwActor(network: sonnet.src.recurrent.RNNCore, adder: acme.adders.base.Adder = None, reward_prediction_adder: acme.adders.base.Adder = None, variable_client: acme.tf.variable_utils.VariableClient = None, uint_pixels_to_float: bool = True)¶

Bases: acme.core.Actor

A recurrent actor.

__init__(network: sonnet.src.recurrent.RNNCore, adder: acme.adders.base.Adder = None, reward_prediction_adder: acme.adders.base.Adder = None, variable_client: acme.tf.variable_utils.VariableClient = None, uint_pixels_to_float: bool = True)¶: Initialize self. See help(type(self)) for accurate signature.

observe(action: Any, next_timestep: dm_env._environment.TimeStep)¶

Make an observation of timestep data from the environment.

Args:: action: action taken in the environment. next_timestep: timestep produced by the environment given the action.

observe_first(timestep: dm_env._environment.TimeStep)¶

Make a first observation from the environment.

Note that this need not be an initial state, it is merely beginning the recording of a trajectory.

Args:: timestep: first timestep.

select_action(observation: Any) → Any¶: Samples from the policy and returns an action.

update()¶: Perform an update of the actor parameters from past observations.

Utilities for Replay Buffers, Datasets, Hyperparameters & Internal Rewards¶

ftw.agents.tf.ftw.utils.create_adders(server_address: str, sequence_length: int, use_pixel_control: bool = False, use_reward_prediction: bool = False, reward_prediction_sequence_length: int = 3, reward_prediction_sequence_period: int = 1, pad_end_of_episode: bool = False, delta_encoded: bool = True)¶

Creates the reverb adders required by the FTW actor.

Args:

server_address: Address of the reverb server responsible for storing training data. sequence_length: Length of unroll sequences used in training (for calculation of main losses and

Pixel control auxiliary loss).

use_pixel_control: Whether to create an adder for the Pixel control auxiliary task. use_reward_prediction: Whether to create an adder for the Reward prediction auxiliary task. reward_prediction_sequence_length: Length of reward prediction sequences.

Defaults to 3, as in the FTW and UNREAL agents

reward_prediction_sequence_period: Period with which to add Reward prediction sequences to the respective: replay buffer. Defaults to 1, i.e. at every step, the last reward_prediction_sequence_length steps are added to the replay buffer.
pad_end_of_episode: Whether to pad sequences with zero-like steps at the the end of an episode, if necessary.: Defaults to False.
delta_encoded: Whether to use compression for the adder. May lower RAM requirements.: See documentation of dm-acme’s adders for more details. Defaults to True.

Returns:

A tuple (adder, rp_adder), where adder is the main adder and rp_adder is either None (if use_reward_prediction=False) or the adder required for the Reward prediction auxiliary task.

ftw.agents.tf.ftw.utils.create_datasets(learner_client: <sphinx.ext.autodoc.importer._MockObject object at 0x7f0c46925240>, environment_spec: acme.specs.EnvironmentSpec, batch_size: int, sequence_length: int, extra_spec: Optional[Dict[KT, VT]] = None, use_pixel_control: bool = False, use_reward_prediction: bool = False, reward_prediction_sequence_length: int = 3)¶

Creates the dataset(s) required by the FTW agent.

Args:

learner_client: A reverb.TFClient connected to the reverb server holding the required reverb tables. environment_spec: An acme.specs.EnvironmentSpec namedtuple containing the specs of the environment. batch_size: Batch size used in training. sequence_length: Length of unroll sequences used in training (for calculation of main losses and

Pixel control auxiliary loss).

extra_spec: A dictionary containing extra specs required for training, such as logits or core state. use_pixel_control: Whether to create a dataset for the Pixel control auxiliary task. use_reward_prediction: Whether to create a dataset for the Reward prediction auxiliary task. reward_prediction_sequence_length: Length of reward prediction sequences.

Defaults to 3, as in the FTW and UNREAL agents

Returns:

A 4-element tuple of tf.Dataset objects for each respective task (where queue is used in the calculation of the main losses):

(queue_dataset, pixel_control_dataset, nonzero_reward_prediction_dataset, zero_reward_prediction_dataset)

ftw.agents.tf.ftw.utils.create_reverb_tables(batch_size: int, max_queue_size: int, use_pixel_control: bool = False, use_reward_prediction: bool = False, max_pixel_control_buffer_size: int = 100, max_reward_pred_buffer_size: int = 800)¶

Creates the reverb table(s) required by the FTW agent.

Args:: batch_size: Batch size used in training. max_queue_size: Maximum capacity of queue. use_pixel_control: Whether to create a table for the Pixel control auxiliary task. use_reward_prediction: Whether to create a table for the Reward prediction auxiliary task. max_pixel_control_buffer_size: Maximum capacity of Pixel control replay buffer. max_reward_pred_buffer_size: Maximum capacity of each Reward prediction replay buffer

(one buffer for zero rewards, one for non-zero rewards).
Returns:: A triple (tables, can_sample_queue, can_sample_auxiliary), where tables is a list containing all created tables, can_sample_queue is a function that returns a bool indicating whether a batch of training data can be sampled from the queue (used in the calculation of the main losses), and can_sample_auxiliary is a function that returns a bool indicating whether a batch of training data can be sampled from the auxiliary replay buffer(s).

ftw.agents.tf.ftw.utils.initialize_hypers(slow_core_period_min_max: Tuple[int, int] = (5, 20), slow_core_period_init_value: Optional[int] = None, learning_rate: Union[float, Tuple[float, float]] = (1e-05, 0.005), entropy_cost: Union[float, Tuple[float, float]] = (0.0005, 0.01), reward_prediction_cost: Union[float, Tuple[float, float]] = (0.1, 1.0), pixel_control_cost: Union[float, Tuple[float, float]] = (0.01, 0.1), kld_prior_fixed_cost: Union[float, Tuple[float, float]] = (0.0001, 0.1), kld_prior_posterior_cost: Union[float, Tuple[float, float]] = (0.001, 1.0), scale_grads_fast_to_slow: Union[float, Tuple[float, float]] = (0.1, 1.0)) → Mapping[str, ftw.tf.hyperparameters.base.Hyperparameter]¶

Create and initialize all hyperparameters required by the FTW agent.

All arguments can either be supplied as a 2-tuple (min, max), indicating a range to be used in the random initialization of the corresponding hyperparameter, or as a scalar value, in which case the corresponding hyperparameter will be initialized with this exact value. Please note, however, that if a scalar value is used to initialize slow_core_period to an exact value, calling perturb() on the resulting hyperparameter will have no effect.

Args:

slow_core_period_min_max: (Inclusive) lower and upper bound for random initialization: of the period used for the slow core of the RPTH module. See docstring for RPTH module (in ftw.tf.networks.recurrence) for more details.
slow_core_period_init_value: Optional. If not None, the period used for the slow core: of the RPTH module will be initialized with this exact value, instead of being initialized randomly. See docstring for RPTH module (in ftw.tf.networks.recurrence) for more details.

learning_rate: Learning rate used in training. entropy_cost: Multiplier for the entropy loss. reward_prediction_cost: Multiplier for the Reward prediction loss. pixel_control_cost: Multiplier for the Pixel control loss. kld_prior_fixed_cost: Multiplier for the Kullback-Leibler divergence loss between

a fixed Multivariate Normal Diagonal (MVNDiag) distribution and the prior (MVNDiag) distribution as produced by the RPTH module’s slow core.

kld_prior_posterior_cost: Multiplier for the Kullback-Leibler divergence loss between: the prior (MVNDiag) distribution as produced by the RPTH module’s slow core and the posterior (MVNDiag) distribution as produced by the RPTH module’s fast core.

scale_grads_fast_to_slow: Scaling factor for the gradients flowing from fast to slow core of the RPTH module.

Returns:

A dictionary containing all created hyperparameters. Keys of this dictionary correspond to the argument names of this function, except for the key ‘slow_core_period’, which results from the argument slow_core_period_min_max (and possibly slow_core_init_value).

ftw.agents.tf.ftw.utils.initialize_internal_rewards(num_events: int = 1, init_value_or_range: Union[float, Tuple[float, float]] = (0.1, 1.0)) → ftw.tf.internal_reward.ftw_internal_reward.InternalRewards¶

Creates and initializes the internal rewards required by the FTW agent.

Args:: num_events: Number of events returned by the environment. init_value_or_range: A scalar value of type int for initialization with an exact value,

or a tuple (min, max), where min, max are of type int for initialization by drawing a sample from a log-uniform distribution defined over (min, max).
Returns:: An InternalRewards object.

For The Win (FTW) TensorFlow Agent¶

FTW Agent¶

FTW Learner¶

FTW Actor¶

Utilities for Replay Buffers, Datasets, Hyperparameters & Internal Rewards¶

`FTW Agent`¶

`FTW Learner`¶

`FTW Actor`¶