For The Win (FTW) TensorFlow Agent¶
FTW Agent¶
-
class
ftw.agents.tf.ftw.agent.FTW(environment_spec: acme.specs.EnvironmentSpec, sequence_length: int, num_environment_events: int = 1, embed: sonnet.src.base.Module = None, max_queue_size: int = 32, batch_size: int = 16, hidden_size: int = 256, use_pixel_cotrol: bool = True, use_reward_prediction: bool = True, reward_prediction_sequence_length: int = 3, reward_prediction_sequence_period: int = 1, num_dimensions: int = 256, dnc_clip_value=None, use_dnc_linear_projection: bool = True, init_scale: float = 0.1, min_scale: float = 1e-06, tanh_mean: bool = False, fixed_scale: bool = False, use_tfd_independent: bool = False, variational_unit_w_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, variational_unit_b_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, strict_period_order: bool = True, dnc_memory_size: int = 450, dnc_word_size: int = 32, dnc_num_reads: int = 4, core_type: str = 'rpth', slow_core_period_min_max: Tuple[int, int] = (5, 20), slow_core_period_init_value: Optional[int] = None, learning_rate: Union[float, Tuple[float, float]] = (1e-05, 0.005), entropy_cost: Union[float, Tuple[float, float]] = (0.0005, 0.01), reward_prediction_cost: Union[float, Tuple[float, float]] = (0.1, 1.0), pixel_control_cost: Union[float, Tuple[float, float]] = (0.01, 0.1), kld_prior_fixed_cost: Union[float, Tuple[float, float]] = (0.0001, 0.1), kld_prior_posterior_cost: Union[float, Tuple[float, float]] = (0.001, 1.0), scale_grads_fast_to_slow: Union[float, Tuple[float, float]] = (0.1, 1.0), internal_rewards: Union[float, Tuple[float, float]] = (0.1, 1.0), baseline_cost: float = 0.5, discount: float = 0.99, max_abs_reward: float = None, max_gradient_norm: float = None, rms_prop_epsilon: float = 1e-05, learning_rate_decay_steps: int = 0, uint_pixels_to_float: bool = True, agent_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)¶ Bases:
acme.core.Actor,ftw.agents.tf.ftw.agent.FTWWithoutActor-
__init__(environment_spec: acme.specs.EnvironmentSpec, sequence_length: int, num_environment_events: int = 1, embed: sonnet.src.base.Module = None, max_queue_size: int = 32, batch_size: int = 16, hidden_size: int = 256, use_pixel_cotrol: bool = True, use_reward_prediction: bool = True, reward_prediction_sequence_length: int = 3, reward_prediction_sequence_period: int = 1, num_dimensions: int = 256, dnc_clip_value=None, use_dnc_linear_projection: bool = True, init_scale: float = 0.1, min_scale: float = 1e-06, tanh_mean: bool = False, fixed_scale: bool = False, use_tfd_independent: bool = False, variational_unit_w_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, variational_unit_b_init: Union[sonnet.src.initializers.Initializer, tensorflow.python.keras.initializers.initializers_v2.Initializer, None] = None, strict_period_order: bool = True, dnc_memory_size: int = 450, dnc_word_size: int = 32, dnc_num_reads: int = 4, core_type: str = 'rpth', slow_core_period_min_max: Tuple[int, int] = (5, 20), slow_core_period_init_value: Optional[int] = None, learning_rate: Union[float, Tuple[float, float]] = (1e-05, 0.005), entropy_cost: Union[float, Tuple[float, float]] = (0.0005, 0.01), reward_prediction_cost: Union[float, Tuple[float, float]] = (0.1, 1.0), pixel_control_cost: Union[float, Tuple[float, float]] = (0.01, 0.1), kld_prior_fixed_cost: Union[float, Tuple[float, float]] = (0.0001, 0.1), kld_prior_posterior_cost: Union[float, Tuple[float, float]] = (0.001, 1.0), scale_grads_fast_to_slow: Union[float, Tuple[float, float]] = (0.1, 1.0), internal_rewards: Union[float, Tuple[float, float]] = (0.1, 1.0), baseline_cost: float = 0.5, discount: float = 0.99, max_abs_reward: float = None, max_gradient_norm: float = None, rms_prop_epsilon: float = 1e-05, learning_rate_decay_steps: int = 0, uint_pixels_to_float: bool = True, agent_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)¶ Initialize self. See help(type(self)) for accurate signature.
-
observe(action: Any, next_timestep: dm_env._environment.TimeStep)¶ Make an observation of timestep data from the environment.
- Args:
- action: action taken in the environment. next_timestep: timestep produced by the environment given the action.
-
observe_first(timestep: dm_env._environment.TimeStep)¶ Make a first observation from the environment.
Note that this need not be an initial state, it is merely beginning the recording of a trajectory.
- Args:
- timestep: first timestep.
-
run()¶
-
select_action(observation: numpy.ndarray) → int¶ Samples from the policy and returns an action.
-
update()¶ Perform an update of the actor parameters from past observations.
-
FTW Learner¶
-
class
ftw.agents.tf.ftw.learning.FtwLearner(policy_network: sonnet.src.recurrent.RNNCore, dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2, learning_rate: Union[float, tensorflow.python.ops.variables.Variable], slow_core_period: Union[float, tensorflow.python.ops.variables.Variable], internal_rewards: ftw.tf.internal_reward.ftw_internal_reward.InternalRewards, pixel_control_network: Optional[sonnet.src.recurrent.RNNCore] = None, reward_prediction_network: Optional[sonnet.src.base.Module] = None, pixel_control_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, nonzero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, zero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, entropy_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0, kld_prior_fixed_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0001, kld_prior_posterior_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.001, pixel_control_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.01, reward_prediction_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.1, baseline_cost: float = 0.5, discount: Union[float, tensorflow.python.ops.variables.Variable] = 0.99, max_abs_reward: Optional[float] = None, max_gradient_norm: Optional[float] = None, rms_prop_epsilon: float = 0.01, learning_rate_decay_steps: int = 0, can_sample=None, can_sample_auxiliary=None, uint_pixels_to_float: bool = True, learner_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)¶ Bases:
acme.core.Learner,acme.tf.savers.TFSaveableLearner for an importance-weighted advantage actor-critic with auxiliary tasks and recurrent processing with temporal hierarchy.
-
__init__(policy_network: sonnet.src.recurrent.RNNCore, dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2, learning_rate: Union[float, tensorflow.python.ops.variables.Variable], slow_core_period: Union[float, tensorflow.python.ops.variables.Variable], internal_rewards: ftw.tf.internal_reward.ftw_internal_reward.InternalRewards, pixel_control_network: Optional[sonnet.src.recurrent.RNNCore] = None, reward_prediction_network: Optional[sonnet.src.base.Module] = None, pixel_control_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, nonzero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, zero_reward_prediction_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, entropy_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0, kld_prior_fixed_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.0001, kld_prior_posterior_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.001, pixel_control_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.01, reward_prediction_cost: Union[float, tensorflow.python.ops.variables.Variable] = 0.1, baseline_cost: float = 0.5, discount: Union[float, tensorflow.python.ops.variables.Variable] = 0.99, max_abs_reward: Optional[float] = None, max_gradient_norm: Optional[float] = None, rms_prop_epsilon: float = 0.01, learning_rate_decay_steps: int = 0, can_sample=None, can_sample_auxiliary=None, uint_pixels_to_float: bool = True, learner_id: int = 0, counter: acme.utils.counting.Counter = None, logger: acme.utils.loggers.base.Logger = None)¶ Initialize self. See help(type(self)) for accurate signature.
-
get_step()¶
-
get_variables(names: List[str]) → List[List[numpy.ndarray]]¶ Return the named variables as a collection of (nested) numpy arrays.
- Args:
- names: args where each name is a string identifying a predefined subset of
- the variables.
- Returns:
- A list of (nested) numpy arrays variables such that variables[i] corresponds to the collection named by names[i].
-
get_weights() → Mapping[str, List[tensorflow.python.ops.variables.Variable]]¶
-
run()¶ Run the update loop; typically an infinite loop which calls step.
-
set_step(step)¶
-
set_weights(weights: Mapping[str, List[tensorflow.python.ops.variables.Variable]])¶
-
state¶ Returns the stateful objects for checkpointing.
-
step()¶ Does a step of SGD and logs the results.
-
FTW Actor¶
-
class
ftw.agents.tf.ftw.acting.FtwActor(network: sonnet.src.recurrent.RNNCore, adder: acme.adders.base.Adder = None, reward_prediction_adder: acme.adders.base.Adder = None, variable_client: acme.tf.variable_utils.VariableClient = None, uint_pixels_to_float: bool = True)¶ Bases:
acme.core.ActorA recurrent actor.
-
__init__(network: sonnet.src.recurrent.RNNCore, adder: acme.adders.base.Adder = None, reward_prediction_adder: acme.adders.base.Adder = None, variable_client: acme.tf.variable_utils.VariableClient = None, uint_pixels_to_float: bool = True)¶ Initialize self. See help(type(self)) for accurate signature.
-
observe(action: Any, next_timestep: dm_env._environment.TimeStep)¶ Make an observation of timestep data from the environment.
- Args:
- action: action taken in the environment. next_timestep: timestep produced by the environment given the action.
-
observe_first(timestep: dm_env._environment.TimeStep)¶ Make a first observation from the environment.
Note that this need not be an initial state, it is merely beginning the recording of a trajectory.
- Args:
- timestep: first timestep.
-
select_action(observation: Any) → Any¶ Samples from the policy and returns an action.
-
update()¶ Perform an update of the actor parameters from past observations.
-
Utilities for Replay Buffers, Datasets, Hyperparameters & Internal Rewards¶
-
ftw.agents.tf.ftw.utils.create_adders(server_address: str, sequence_length: int, use_pixel_control: bool = False, use_reward_prediction: bool = False, reward_prediction_sequence_length: int = 3, reward_prediction_sequence_period: int = 1, pad_end_of_episode: bool = False, delta_encoded: bool = True)¶ Creates the reverb adders required by the FTW actor.
- Args:
server_address: Address of the reverb server responsible for storing training data. sequence_length: Length of unroll sequences used in training (for calculation of main losses and
Pixel control auxiliary loss).use_pixel_control: Whether to create an adder for the Pixel control auxiliary task. use_reward_prediction: Whether to create an adder for the Reward prediction auxiliary task. reward_prediction_sequence_length: Length of reward prediction sequences.
Defaults to 3, as in the FTW and UNREAL agents- reward_prediction_sequence_period: Period with which to add Reward prediction sequences to the respective
- replay buffer. Defaults to 1, i.e. at every step, the last reward_prediction_sequence_length steps are added to the replay buffer.
- pad_end_of_episode: Whether to pad sequences with zero-like steps at the the end of an episode, if necessary.
- Defaults to False.
- delta_encoded: Whether to use compression for the adder. May lower RAM requirements.
- See documentation of dm-acme’s adders for more details. Defaults to True.
- Returns:
- A tuple (adder, rp_adder), where adder is the main adder and rp_adder is either None (if use_reward_prediction=False) or the adder required for the Reward prediction auxiliary task.
-
ftw.agents.tf.ftw.utils.create_datasets(learner_client: <sphinx.ext.autodoc.importer._MockObject object at 0x7f0c46925240>, environment_spec: acme.specs.EnvironmentSpec, batch_size: int, sequence_length: int, extra_spec: Optional[Dict[KT, VT]] = None, use_pixel_control: bool = False, use_reward_prediction: bool = False, reward_prediction_sequence_length: int = 3)¶ Creates the dataset(s) required by the FTW agent.
- Args:
learner_client: A reverb.TFClient connected to the reverb server holding the required reverb tables. environment_spec: An acme.specs.EnvironmentSpec namedtuple containing the specs of the environment. batch_size: Batch size used in training. sequence_length: Length of unroll sequences used in training (for calculation of main losses and
Pixel control auxiliary loss).extra_spec: A dictionary containing extra specs required for training, such as logits or core state. use_pixel_control: Whether to create a dataset for the Pixel control auxiliary task. use_reward_prediction: Whether to create a dataset for the Reward prediction auxiliary task. reward_prediction_sequence_length: Length of reward prediction sequences.
Defaults to 3, as in the FTW and UNREAL agents- Returns:
A 4-element tuple of tf.Dataset objects for each respective task (where queue is used in the calculation of the main losses):
(queue_dataset, pixel_control_dataset, nonzero_reward_prediction_dataset, zero_reward_prediction_dataset)
-
ftw.agents.tf.ftw.utils.create_reverb_tables(batch_size: int, max_queue_size: int, use_pixel_control: bool = False, use_reward_prediction: bool = False, max_pixel_control_buffer_size: int = 100, max_reward_pred_buffer_size: int = 800)¶ Creates the reverb table(s) required by the FTW agent.
- Args:
batch_size: Batch size used in training. max_queue_size: Maximum capacity of queue. use_pixel_control: Whether to create a table for the Pixel control auxiliary task. use_reward_prediction: Whether to create a table for the Reward prediction auxiliary task. max_pixel_control_buffer_size: Maximum capacity of Pixel control replay buffer. max_reward_pred_buffer_size: Maximum capacity of each Reward prediction replay buffer
(one buffer for zero rewards, one for non-zero rewards).- Returns:
- A triple (tables, can_sample_queue, can_sample_auxiliary), where tables is a list containing all created tables, can_sample_queue is a function that returns a bool indicating whether a batch of training data can be sampled from the queue (used in the calculation of the main losses), and can_sample_auxiliary is a function that returns a bool indicating whether a batch of training data can be sampled from the auxiliary replay buffer(s).
-
ftw.agents.tf.ftw.utils.initialize_hypers(slow_core_period_min_max: Tuple[int, int] = (5, 20), slow_core_period_init_value: Optional[int] = None, learning_rate: Union[float, Tuple[float, float]] = (1e-05, 0.005), entropy_cost: Union[float, Tuple[float, float]] = (0.0005, 0.01), reward_prediction_cost: Union[float, Tuple[float, float]] = (0.1, 1.0), pixel_control_cost: Union[float, Tuple[float, float]] = (0.01, 0.1), kld_prior_fixed_cost: Union[float, Tuple[float, float]] = (0.0001, 0.1), kld_prior_posterior_cost: Union[float, Tuple[float, float]] = (0.001, 1.0), scale_grads_fast_to_slow: Union[float, Tuple[float, float]] = (0.1, 1.0)) → Mapping[str, ftw.tf.hyperparameters.base.Hyperparameter]¶ Create and initialize all hyperparameters required by the FTW agent.
All arguments can either be supplied as a 2-tuple (min, max), indicating a range to be used in the random initialization of the corresponding hyperparameter, or as a scalar value, in which case the corresponding hyperparameter will be initialized with this exact value. Please note, however, that if a scalar value is used to initialize slow_core_period to an exact value, calling perturb() on the resulting hyperparameter will have no effect.
- Args:
- slow_core_period_min_max: (Inclusive) lower and upper bound for random initialization
- of the period used for the slow core of the RPTH module. See docstring for RPTH module (in ftw.tf.networks.recurrence) for more details.
- slow_core_period_init_value: Optional. If not None, the period used for the slow core
- of the RPTH module will be initialized with this exact value, instead of being initialized randomly. See docstring for RPTH module (in ftw.tf.networks.recurrence) for more details.
learning_rate: Learning rate used in training. entropy_cost: Multiplier for the entropy loss. reward_prediction_cost: Multiplier for the Reward prediction loss. pixel_control_cost: Multiplier for the Pixel control loss. kld_prior_fixed_cost: Multiplier for the Kullback-Leibler divergence loss between
a fixed Multivariate Normal Diagonal (MVNDiag) distribution and the prior (MVNDiag) distribution as produced by the RPTH module’s slow core.- kld_prior_posterior_cost: Multiplier for the Kullback-Leibler divergence loss between
- the prior (MVNDiag) distribution as produced by the RPTH module’s slow core and the posterior (MVNDiag) distribution as produced by the RPTH module’s fast core.
scale_grads_fast_to_slow: Scaling factor for the gradients flowing from fast to slow core of the RPTH module.
- Returns:
- A dictionary containing all created hyperparameters. Keys of this dictionary correspond to the argument names of this function, except for the key ‘slow_core_period’, which results from the argument slow_core_period_min_max (and possibly slow_core_init_value).
-
ftw.agents.tf.ftw.utils.initialize_internal_rewards(num_events: int = 1, init_value_or_range: Union[float, Tuple[float, float]] = (0.1, 1.0)) → ftw.tf.internal_reward.ftw_internal_reward.InternalRewards¶ Creates and initializes the internal rewards required by the FTW agent.
- Args:
num_events: Number of events returned by the environment. init_value_or_range: A scalar value of type int for initialization with an exact value,
or a tuple (min, max), where min, max are of type int for initialization by drawing a sample from a log-uniform distribution defined over (min, max).- Returns:
- An InternalRewards object.