opennmt.layers.transformer module

Define layers related to the Google’s Transformer model.

opennmt.layers.transformer.tile_sequence_length(sequence_length, num_heads)[source]

Tiles lengths num_heads times.

Parameters:
  • sequence_length – The sequence length.
  • num_heads – The number of heads.
Returns:

A tf.Tensor where each length is replicated num_heads times.

opennmt.layers.transformer.build_sequence_mask(sequence_length, num_heads=None, maximum_length=None, dtype=tf.float32)[source]

Builds the dot product mask.

Parameters:
  • sequence_length – The sequence length.
  • num_heads – The number of heads.
  • maximum_length – Optional size of the returned time dimension. Otherwise it is the maximum of sequence_length.
  • dtype – The type of the mask tensor.
Returns:

A broadcastable tf.Tensor of type dtype and shape [batch_size, 1, 1, max_length].

opennmt.layers.transformer.build_future_mask(sequence_length, num_heads=None, maximum_length=None, dtype=tf.float32)[source]

Builds the dot product mask for future positions.

Parameters:
  • sequence_length – The sequence length.
  • num_heads – The number of heads.
  • maximum_length – Optional size of the returned time dimension. Otherwise it is the maximum of sequence_length.
  • dtype – The type of the mask tensor.
Returns:

A broadcastable tf.Tensor of type dtype and shape [batch_size, 1, max_length, max_length].

opennmt.layers.transformer.cumulative_average_mask(sequence_length, maximum_length=None, dtype=tf.float32)[source]

Builds the mask to compute the cumulative average as described in https://arxiv.org/abs/1805.00631.

Parameters:
  • sequence_length – The sequence length.
  • maximum_length – Optional size of the returned time dimension. Otherwise it is the maximum of sequence_length.
  • dtype – The type of the mask tensor.
Returns:

A tf.Tensor of type dtype and shape [batch_size, max_length, max_length].

opennmt.layers.transformer.cumulative_average(inputs, mask_or_step, cache=None)[source]

Computes the cumulative average as described in https://arxiv.org/abs/1805.00631.

Parameters:
  • inputs – The sequence to average. A tensor of shape \([B, T, D]\).
  • mask_or_step – If cache is set, this is assumed to be the current step of the dynamic decoding. Otherwise, it is the mask matrix used to compute the cumulative average.
  • cache – A dictionnary containing the cumulative average of the previous step.
Returns:

The cumulative average, a tensor of the same shape and type as inputs.

opennmt.layers.transformer.fused_projection(inputs, num_units, num_outputs=1)[source]

Projects the same input into multiple output spaces.

Parameters:
  • inputs – The inputs to project.
  • num_units – The number of output units of each space.
  • num_outputs – The number of output spaces.
Returns:

num_outputs tf.Tensor of depth num_units.

opennmt.layers.transformer.split_heads(inputs, num_heads)[source]

Splits a tensor in depth.

Parameters:
  • inputs – A tf.Tensor of shape \([B, T, D]\).
  • num_heads – The number of heads \(H\).
Returns:

A tf.Tensor of shape \([B, H, T, D / H]\).

opennmt.layers.transformer.combine_heads(inputs)[source]

Concatenates heads.

Parameters:inputs – A tf.Tensor of shape \([B, H, T, D]\).
Returns:A tf.Tensor of shape \([B, T, D * H]\).
opennmt.layers.transformer.dot_product_attention(queries, keys, values, mode, mask=None, dropout=0.0)[source]

Computes the dot product attention.

Parameters:
  • queries – The sequence of queries. A tensor of shape \([B, T_1, ...]\).
  • keys – The sequence use to calculate attention scores. A tensor of shape \([B, T_2, ...]\).
  • values – The sequence to attend. A tensor of shape \([B, T_2, ...]\).
  • mode – A tf.estimator.ModeKeys mode.
  • mask – A tf.Tensor applied to the dot product.
  • dropout – The probability to drop units from the inputs.
Returns:

A tuple (context vector, attention vector).

opennmt.layers.transformer.multi_head_attention(num_heads, queries, memory, mode, num_units=None, mask=None, cache=None, dropout=0.0, return_attention=False)[source]

Computes the multi-head attention as described in https://arxiv.org/abs/1706.03762.

Parameters:
  • num_heads – The number of attention heads.
  • queries – The sequence of queries. A tensor of shape \([B, T_1, ...]\).
  • memory – The sequence to attend. A tensor of shape \([B, T_2, ...]\). If None, computes self-attention.
  • mode – A tf.estimator.ModeKeys mode.
  • num_units – The number of hidden units. If not set, it is set to the input dimension.
  • mask – A tf.Tensor applied to the dot product.
  • cache – A dictionary containing pre-projected keys and values.
  • dropout – The probability to drop units from the inputs.
  • return_attention – Return the attention head probabilities in addition to the context.
Returns:

The concatenated attention context of each head and the attention probabilities (if return_attention is set).

opennmt.layers.transformer.feed_forward(x, inner_dim, mode, dropout=0.0)[source]

Implements the Transformer’s “Feed Forward” layer.

\[ffn(x) = max(0, x*W_1 + b_1)*W_2 + b_2\]
Parameters:
  • x – The input.
  • inner_dim – The number of units of the inner linear transformation.
  • mode – A tf.estimator.ModeKeys mode.
  • dropout – The probability to drop units from the inner transformation.
Returns:

The transformed input.

opennmt.layers.transformer.norm(inputs)[source]

Layer normalizes inputs.

opennmt.layers.transformer.drop_and_add(inputs, outputs, mode, dropout=0.1)[source]

Drops units in the outputs and adds the previous values.

Parameters:
  • inputs – The input of the previous layer.
  • outputs – The output of the previous layer.
  • mode – A tf.estimator.ModeKeys mode.
  • dropout – The probability to drop units in outputs.
Returns:

The residual and normalized output.

class opennmt.layers.transformer.FeedForwardNetwork(inner_dim, output_dim, dropout=0.1, activation=, **kwargs)[source]

Bases: tensorflow.python.keras.engine.base_layer.Layer

Implements the Transformer’s “Feed Forward” layer.

\[ffn(x) = max(0, x*W_1 + b_1)*W_2 + b_2\]

Note

Object-oriented implementation for TensorFlow 2.0.

__init__(inner_dim, output_dim, dropout=0.1, activation=, **kwargs)[source]

Initializes this layer.

Parameters:
  • inner_dim – The number of units of the inner linear transformation.
  • output_dim – The number of units of the ouput linear transformation.
  • dropout – The probability to drop units from the activation output.
  • activation – The activation function to apply between the two linear transformations.
  • kwargs – Additional layer arguments.
call(inputs, training=None)[source]

Runs the layer.

class opennmt.layers.transformer.MultiHeadAttention(num_heads, num_units, dropout=0.1, return_attention=False, **kwargs)[source]

Bases: tensorflow.python.keras.engine.base_layer.Layer

Computes the multi-head attention as described in https://arxiv.org/abs/1706.03762.

Note

Object-oriented implementation for TensorFlow 2.0.

__init__(num_heads, num_units, dropout=0.1, return_attention=False, **kwargs)[source]

Initializes this layers.

Parameters:
  • num_heads – The number of attention heads.
  • num_units – The number of hidden units.
  • dropout – The probability to drop units from the inputs.
  • return_attention – If True, also return the attention weights of the first head.
  • kwargs – Additional layer arguments.
call(inputs, memory=None, mask=None, cache=None, training=None)[source]

Runs the layer.

Parameters:
  • inputs – The sequence of queries. A tensor of shape \([B, T_1, ...]\).
  • memory – The sequence to attend. A tensor of shape \([B, T_2, ...]\). If None, computes self-attention.
  • mask – A tf.Tensor applied to the dot product.
  • cache – A dictionary containing pre-projected keys and values.
  • training – Run in training mode.
Returns:

A tuple with the attention context, the updated cache and the attention probabilities of the first head (if return_attention is True).

class opennmt.layers.transformer.TransformerLayerWrapper(layer, output_dropout, **kwargs)[source]

Bases: opennmt.layers.common.LayerWrapper

Layer wrapper that applies a standard Transformer preprocessing and postprocessing:

y = layer_norm(x)
y = dropout(layer(y)) + x
__init__(layer, output_dropout, **kwargs)[source]

Initializes the wrapper.

Parameters:
  • layer – The Transformer layer to wrap.
  • output_dropout – The dropout to apply on the layer output.
  • **kwargs – Additional layer arguments.