# opennmt.layers.transformer module¶

Define layers related to the Google’s Transformer model.

opennmt.layers.transformer.tile_sequence_length(sequence_length, num_heads)[source]

Tiles lengths num_heads times.

Parameters: sequence_length – The sequence length. num_heads – The number of heads. A tf.Tensor where each length is replicated num_heads times.
opennmt.layers.transformer.build_sequence_mask(sequence_length, num_heads=None, maximum_length=None, dtype=tf.float32)[source]

Parameters: sequence_length – The sequence length. num_heads – The number of heads. maximum_length – Optional size of the returned time dimension. Otherwise it is the maximum of sequence_length. dtype – The type of the mask tensor. A broadcastable tf.Tensor of type dtype and shape [batch_size, 1, 1, max_length].
opennmt.layers.transformer.build_future_mask(sequence_length, num_heads=None, maximum_length=None, dtype=tf.float32)[source]

Builds the dot product mask for future positions.

Parameters: sequence_length – The sequence length. num_heads – The number of heads. maximum_length – Optional size of the returned time dimension. Otherwise it is the maximum of sequence_length. dtype – The type of the mask tensor. A broadcastable tf.Tensor of type dtype and shape [batch_size, 1, max_length, max_length].
opennmt.layers.transformer.cumulative_average_mask(sequence_length, maximum_length=None, dtype=tf.float32)[source]

Builds the mask to compute the cumulative average as described in https://arxiv.org/abs/1805.00631.

Parameters: sequence_length – The sequence length. maximum_length – Optional size of the returned time dimension. Otherwise it is the maximum of sequence_length. dtype – The type of the mask tensor. A tf.Tensor of type dtype and shape [batch_size, max_length, max_length].
opennmt.layers.transformer.cumulative_average(inputs, mask_or_step, cache=None)[source]

Computes the cumulative average as described in https://arxiv.org/abs/1805.00631.

Parameters: inputs – The sequence to average. A tensor of shape $$[B, T, D]$$. mask_or_step – If cache is set, this is assumed to be the current step of the dynamic decoding. Otherwise, it is the mask matrix used to compute the cumulative average. cache – A dictionnary containing the cumulative average of the previous step. The cumulative average, a tensor of the same shape and type as inputs.
opennmt.layers.transformer.fused_projection(inputs, num_units, num_outputs=1)[source]

Projects the same input into multiple output spaces.

Parameters: inputs – The inputs to project. num_units – The number of output units of each space. num_outputs – The number of output spaces. num_outputs tf.Tensor of depth num_units.
opennmt.layers.transformer.split_heads(inputs, num_heads)[source]

Splits a tensor in depth.

Parameters: inputs – A tf.Tensor of shape $$[B, T, D]$$. num_heads – The number of heads $$H$$. A tf.Tensor of shape $$[B, H, T, D / H]$$.
opennmt.layers.transformer.combine_heads(inputs)[source]

Parameters: inputs – A tf.Tensor of shape $$[B, H, T, D]$$. A tf.Tensor of shape $$[B, T, D * H]$$.
opennmt.layers.transformer.dot_product_attention(queries, keys, values, mode, mask=None, dropout=0.0)[source]

Computes the dot product attention.

Parameters: queries – The sequence of queries. A tensor of shape $$[B, T_1, ...]$$. keys – The sequence use to calculate attention scores. A tensor of shape $$[B, T_2, ...]$$. values – The sequence to attend. A tensor of shape $$[B, T_2, ...]$$. mode – A tf.estimator.ModeKeys mode. mask – A tf.Tensor applied to the dot product. dropout – The probability to drop units from the inputs. A tuple (context vector, attention vector).
opennmt.layers.transformer.multi_head_attention(num_heads, queries, memory, mode, num_units=None, mask=None, cache=None, dropout=0.0, return_attention=False)[source]

Computes the multi-head attention as described in https://arxiv.org/abs/1706.03762.

Parameters: num_heads – The number of attention heads. queries – The sequence of queries. A tensor of shape $$[B, T_1, ...]$$. memory – The sequence to attend. A tensor of shape $$[B, T_2, ...]$$. If None, computes self-attention. mode – A tf.estimator.ModeKeys mode. num_units – The number of hidden units. If not set, it is set to the input dimension. mask – A tf.Tensor applied to the dot product. cache – A dictionary containing pre-projected keys and values. dropout – The probability to drop units from the inputs. return_attention – Return the attention head probabilities in addition to the context. The concatenated attention context of each head and the attention probabilities (if return_attention is set).
opennmt.layers.transformer.feed_forward(x, inner_dim, mode, dropout=0.0)[source]

Implements the Transformer’s “Feed Forward” layer.

$ffn(x) = max(0, x*W_1 + b_1)*W_2 + b_2$
Parameters: x – The input. inner_dim – The number of units of the inner linear transformation. mode – A tf.estimator.ModeKeys mode. dropout – The probability to drop units from the inner transformation. The transformed input.
opennmt.layers.transformer.norm(inputs)[source]

Layer normalizes inputs.

opennmt.layers.transformer.drop_and_add(inputs, outputs, mode, dropout=0.1)[source]

Drops units in the outputs and adds the previous values.

Parameters: inputs – The input of the previous layer. outputs – The output of the previous layer. mode – A tf.estimator.ModeKeys mode. dropout – The probability to drop units in outputs. The residual and normalized output.
class opennmt.layers.transformer.FeedForwardNetwork(inner_dim, output_dim, dropout=0.1, activation=, **kwargs)[source]

Bases: tensorflow.python.keras.engine.base_layer.Layer

Implements the Transformer’s “Feed Forward” layer.

$ffn(x) = max(0, x*W_1 + b_1)*W_2 + b_2$

Note

Object-oriented implementation for TensorFlow 2.0.

__init__(inner_dim, output_dim, dropout=0.1, activation=, **kwargs)[source]

Initializes this layer.

Parameters: inner_dim – The number of units of the inner linear transformation. output_dim – The number of units of the ouput linear transformation. dropout – The probability to drop units from the activation output. activation – The activation function to apply between the two linear transformations. kwargs – Additional layer arguments.
call(inputs, training=None)[source]

Runs the layer.

class opennmt.layers.transformer.MultiHeadAttention(num_heads, num_units, dropout=0.1, return_attention=False, **kwargs)[source]

Bases: tensorflow.python.keras.engine.base_layer.Layer

Computes the multi-head attention as described in https://arxiv.org/abs/1706.03762.

Note

Object-oriented implementation for TensorFlow 2.0.

__init__(num_heads, num_units, dropout=0.1, return_attention=False, **kwargs)[source]

Initializes this layers.

Parameters: num_heads – The number of attention heads. num_units – The number of hidden units. dropout – The probability to drop units from the inputs. return_attention – If True, also return the attention weights of the first head. kwargs – Additional layer arguments.
call(inputs, memory=None, mask=None, cache=None, training=None)[source]

Runs the layer.

Parameters: inputs – The sequence of queries. A tensor of shape $$[B, T_1, ...]$$. memory – The sequence to attend. A tensor of shape $$[B, T_2, ...]$$. If None, computes self-attention. mask – A tf.Tensor applied to the dot product. cache – A dictionary containing pre-projected keys and values. training – Run in training mode. A tuple with the attention context, the updated cache and the attention probabilities of the first head (if return_attention is True).
class opennmt.layers.transformer.TransformerLayerWrapper(layer, output_dropout, **kwargs)[source]

Layer wrapper that applies a standard Transformer preprocessing and postprocessing:

y = layer_norm(x)
y = dropout(layer(y)) + x

__init__(layer, output_dropout, **kwargs)[source]

Initializes the wrapper.

Parameters: layer – The Transformer layer to wrap. output_dropout – The dropout to apply on the layer output. **kwargs – Additional layer arguments.