Modules

Core Modules

class onmt.modules.Embeddings(word_vec_size, word_vocab_size, word_padding_idx, position_encoding=False, feat_merge='concat', feat_vec_exponent=0.7, feat_vec_size=-1, feat_padding_idx=[], feat_vocab_sizes=[], dropout=0, sparse=False, fix_word_vecs=False)[source]

Bases: torch.nn.modules.module.Module

Words embeddings for encoder/decoder.

Additionally includes ability to add sparse input features based on “Linguistic Input Features Improve Neural Machine Translation” [SH16].

graph LR A[Input] C[Feature 1 Lookup] A-->B[Word Lookup] A-->C A-->D[Feature N Lookup] B-->E[MLP/Concat] C-->E D-->E E-->F[Output]
Parameters
  • word_vec_size (int) – size of the dictionary of embeddings.

  • word_padding_idx (int) – padding index for words in the embeddings.

  • feat_padding_idx (List[int]) – padding index for a list of features in the embeddings.

  • word_vocab_size (int) – size of dictionary of embeddings for words.

  • feat_vocab_sizes (List[int], optional) – list of size of dictionary of embeddings for each feature.

  • position_encoding (bool) – see PositionalEncoding

  • feat_merge (string) – merge action for the features embeddings: concat, sum or mlp.

  • feat_vec_exponent (float) – when using -feat_merge concat, feature embedding size is N^feat_dim_exponent, where N is the number of values the feature takes.

  • feat_vec_size (int) – embedding dimension for features when using -feat_merge mlp

  • dropout (float) – dropout probability.

property emb_luts

Embedding look-up table.

forward(source, step=None)[source]

Computes the embeddings for words and features.

Parameters

source (LongTensor) – index tensor (len, batch, nfeat)

Returns

Word embeddings (len, batch, embedding_size)

Return type

FloatTensor

load_pretrained_vectors(emb_file)[source]

Load in pretrained embeddings.

Parameters

emb_file (str) – path to torch serialized embeddings

property word_lut

Word look-up table.

Encoders

class onmt.encoders.EncoderBase[source]

Bases: torch.nn.modules.module.Module

Base encoder class. Specifies the interface used by different encoder types and required by onmt.Models.NMTModel.

graph BT A[Input] subgraph RNN C[Pos 1] D[Pos 2] E[Pos N] end F[Memory_Bank] G[Final] A-->C A-->D A-->E C-->F D-->F E-->F E-->G
forward(src, lengths=None)[source]
Parameters
  • src (LongTensor) – padded sequences of sparse indices (src_len, batch, nfeat)

  • lengths (LongTensor) – length of each sequence (batch,)

Returns

  • final encoder state, used to initialize decoder

  • memory bank for attention, (src_len, batch, hidden)

Return type

(FloatTensor, FloatTensor)

class onmt.encoders.MeanEncoder(num_layers, embeddings)[source]

Bases: onmt.encoders.encoder.EncoderBase

A trivial non-recurrent encoder. Simply applies mean pooling.

Parameters
  • num_layers (int) – number of replicated layers

  • embeddings (onmt.modules.Embeddings) – embedding module to use

forward(src, lengths=None)[source]

See EncoderBase.forward()

classmethod from_opt(opt, embeddings)[source]

Alternate constructor.

class onmt.encoders.RNNEncoder(rnn_type, bidirectional, num_layers, hidden_size, dropout=0.0, embeddings=None, use_bridge=False)[source]

Bases: onmt.encoders.encoder.EncoderBase

A generic recurrent neural network encoder.

Parameters
  • rnn_type (str) – style of recurrent unit to use, one of [RNN, LSTM, GRU, SRU]

  • bidirectional (bool) – use a bidirectional RNN

  • num_layers (int) – number of stacked layers

  • hidden_size (int) – hidden size of each layer

  • dropout (float) – dropout value for torch.nn.Dropout

  • embeddings (onmt.modules.Embeddings) – embedding module to use

forward(src, lengths=None)[source]

See EncoderBase.forward()

classmethod from_opt(opt, embeddings)[source]

Alternate constructor.

Decoders

class onmt.decoders.DecoderBase(attentional=True)[source]

Bases: torch.nn.modules.module.Module

Abstract class for decoders.

Parameters

attentional (bool) – The decoder returns non-empty attention.

classmethod from_opt(opt, embeddings)[source]

Alternate constructor.

Subclasses should override this method.

class onmt.decoders.decoder.RNNDecoderBase(rnn_type, bidirectional_encoder, num_layers, hidden_size, attn_type='general', attn_func='softmax', coverage_attn=False, context_gate=None, copy_attn=False, dropout=0.0, embeddings=None, reuse_copy_attn=False, copy_attn_type='general')[source]

Bases: onmt.decoders.decoder.DecoderBase

Base recurrent attention-based decoder class.

Specifies the interface used by different decoder types and required by NMTModel.

graph BT A[Input] subgraph RNN C[Pos 1] D[Pos 2] E[Pos N] end G[Decoder State] H[Decoder State] I[Outputs] F[memory_bank] A--emb-->C A--emb-->D A--emb-->E H-->C C-- attn --- F D-- attn --- F E-- attn --- F C-->I D-->I E-->I E-->G F---I
Parameters
  • rnn_type (str) – style of recurrent unit to use, one of [RNN, LSTM, GRU, SRU]

  • bidirectional_encoder (bool) – use with a bidirectional encoder

  • num_layers (int) – number of stacked layers

  • hidden_size (int) – hidden size of each layer

  • attn_type (str) – see GlobalAttention

  • attn_func (str) – see GlobalAttention

  • coverage_attn (str) – see GlobalAttention

  • context_gate (str) – see ContextGate

  • copy_attn (bool) – setup a separate copy attention mechanism

  • dropout (float) – dropout value for torch.nn.Dropout

  • embeddings (onmt.modules.Embeddings) – embedding module to use

  • reuse_copy_attn (bool) – reuse the attention for copying

  • copy_attn_type (str) – The copy attention style. See GlobalAttention.

forward(tgt, memory_bank, memory_lengths=None, step=None)[source]
Parameters
  • tgt (LongTensor) – sequences of padded tokens (tgt_len, batch, nfeats).

  • memory_bank (FloatTensor) – vectors from the encoder (src_len, batch, hidden).

  • memory_lengths (LongTensor) – the padded source lengths (batch,).

Returns

  • dec_outs: output from the decoder (after attn) (tgt_len, batch, hidden).

  • attns: distribution over src at each tgt (tgt_len, batch, src_len).

Return type

(FloatTensor, dict[str, FloatTensor])

classmethod from_opt(opt, embeddings)[source]

Alternate constructor.

init_state(src, memory_bank, encoder_final)[source]

Initialize decoder state with last state of the encoder.

class onmt.decoders.StdRNNDecoder(rnn_type, bidirectional_encoder, num_layers, hidden_size, attn_type='general', attn_func='softmax', coverage_attn=False, context_gate=None, copy_attn=False, dropout=0.0, embeddings=None, reuse_copy_attn=False, copy_attn_type='general')[source]

Bases: onmt.decoders.decoder.RNNDecoderBase

Standard fully batched RNN decoder with attention.

Faster implementation, uses CuDNN for implementation. See RNNDecoderBase for options.

Based around the approach from “Neural Machine Translation By Jointly Learning To Align and Translate” [BCB14]

Implemented without input_feeding and currently with no coverage_attn or copy_attn support.

class onmt.decoders.InputFeedRNNDecoder(rnn_type, bidirectional_encoder, num_layers, hidden_size, attn_type='general', attn_func='softmax', coverage_attn=False, context_gate=None, copy_attn=False, dropout=0.0, embeddings=None, reuse_copy_attn=False, copy_attn_type='general')[source]

Bases: onmt.decoders.decoder.RNNDecoderBase

Input feeding based decoder.

See RNNDecoderBase for options.

Based around the input feeding approach from “Effective Approaches to Attention-based Neural Machine Translation” [LPM15]

graph BT A[Input n-1] AB[Input n] subgraph RNN E[Pos n-1] F[Pos n] E --> F end G[Encoder] H[memory_bank n-1] A --> E AB --> F E --> H G --> H

Attention

class onmt.modules.AverageAttention(model_dim, dropout=0.1, aan_useffn=False)[source]

Bases: torch.nn.modules.module.Module

Average Attention module from “Accelerating Neural Transformer via an Average Attention Network” [ZXS18].

Parameters
  • model_dim (int) – the dimension of keys/values/queries, must be divisible by head_count

  • dropout (float) – dropout parameter

cumulative_average(inputs, mask_or_step, layer_cache=None, step=None)[source]

Computes the cumulative average as described in [ZXS18] – Equations (1) (5) (6)

Parameters
  • inputs (FloatTensor) – sequence to average (batch_size, input_len, dimension)

  • mask_or_step – if cache is set, this is assumed to be the current step of the dynamic decoding. Otherwise, it is the mask matrix used to compute the cumulative average.

  • layer_cache – a dictionary containing the cumulative average of the previous step.

Returns

a tensor of the same shape and type as inputs.

cumulative_average_mask(batch_size, inputs_len, device)[source]

Builds the mask to compute the cumulative average as described in [ZXS18] – Figure 3

Parameters
  • batch_size (int) – batch size

  • inputs_len (int) – length of the inputs

Returns

  • A Tensor of shape (batch_size, input_len, input_len)

Return type

(FloatTensor)

forward(inputs, mask=None, layer_cache=None, step=None)[source]
Parameters

inputs (FloatTensor) – (batch_size, input_len, model_dim)

Returns

  • gating_outputs (batch_size, input_len, model_dim)

  • average_outputs average attention

    (batch_size, input_len, model_dim)

Return type

(FloatTensor, FloatTensor)

class onmt.modules.GlobalAttention(dim, coverage=False, attn_type='dot', attn_func='softmax')[source]

Bases: torch.nn.modules.module.Module

Global attention takes a matrix and a query vector. It then computes a parameterized convex combination of the matrix based on the input query.

Constructs a unit mapping a query q of size dim and a source matrix H of size n x dim, to an output of size dim.

graph BT A[Query] subgraph RNN C[H 1] D[H 2] E[H N] end F[Attn] G[Output] A --> F C --> F D --> F E --> F C -.-> G D -.-> G E -.-> G F --> G

All models compute the output as \(c = \sum_{j=1}^{\text{SeqLength}} a_j H_j\) where \(a_j\) is the softmax of a score function. Then then apply a projection layer to [q, c].

However they differ on how they compute the attention score.

  • Luong Attention (dot, general):
    • dot: \(\text{score}(H_j,q) = H_j^T q\)

    • general: \(\text{score}(H_j, q) = H_j^T W_a q\)

  • Bahdanau Attention (mlp):
    • \(\text{score}(H_j, q) = v_a^T \text{tanh}(W_a q + U_a h_j)\)

Parameters
  • dim (int) – dimensionality of query and key

  • coverage (bool) – use coverage term

  • attn_type (str) – type of attention to use, options [dot,general,mlp]

  • attn_func (str) – attention function to use, options [softmax,sparsemax]

forward(source, memory_bank, memory_lengths=None, coverage=None)[source]
Parameters
  • source (FloatTensor) – query vectors (batch, tgt_len, dim)

  • memory_bank (FloatTensor) – source vectors (batch, src_len, dim)

  • memory_lengths (LongTensor) – the source context lengths (batch,)

  • coverage (FloatTensor) – None (not supported yet)

Returns

  • Computed vector (tgt_len, batch, dim)

  • Attention distribtutions for each query (tgt_len, batch, src_len)

Return type

(FloatTensor, FloatTensor)

score(h_t, h_s)[source]
Parameters
  • h_t (FloatTensor) – sequence of queries (batch, tgt_len, dim)

  • h_s (FloatTensor) – sequence of sources (batch, src_len, dim

Returns

raw attention scores (unnormalized) for each src index

(batch, tgt_len, src_len)

Return type

FloatTensor

Architecture: Transformer

class onmt.modules.PositionalEncoding(dropout, dim, max_len=5000)[source]

Bases: torch.nn.modules.module.Module

Sinusoidal positional encoding for non-recurrent neural networks.

Implementation based on “Attention Is All You Need” [VSP+17]

Parameters
  • dropout (float) – dropout parameter

  • dim (int) – embedding size

forward(emb, step=None)[source]

Embed inputs.

Parameters
  • emb (FloatTensor) – Sequence of word vectors (seq_len, batch_size, self.dim)

  • step (int or NoneType) – If stepwise (seq_len = 1), use the encoding for this position.

class onmt.modules.position_ffn.PositionwiseFeedForward(d_model, d_ff, dropout=0.1)[source]

Bases: torch.nn.modules.module.Module

A two-layer Feed-Forward-Network with residual layer norm.

Parameters
  • d_model (int) – the size of input for the first-layer of the FFN.

  • d_ff (int) – the hidden layer size of the second-layer of the FNN.

  • dropout (float) – dropout probability in \([0, 1)\).

forward(x)[source]

Layer definition.

Parameters

x(batch_size, input_len, model_dim)

Returns

Output (batch_size, input_len, model_dim).

Return type

(FloatTensor)

class onmt.encoders.TransformerEncoder(num_layers, d_model, heads, d_ff, dropout, attention_dropout, embeddings, max_relative_positions)[source]

Bases: onmt.encoders.encoder.EncoderBase

The Transformer encoder from “Attention is All You Need” [VSP+17]

graph BT A[input] B[multi-head self-attn] C[feed forward] O[output] A --> B B --> C C --> O
Parameters
  • num_layers (int) – number of encoder layers

  • d_model (int) – size of the model

  • heads (int) – number of heads

  • d_ff (int) – size of the inner FF layer

  • dropout (float) – dropout parameters

  • embeddings (onmt.modules.Embeddings) – embeddings to use, should have positional encodings

Returns

  • embeddings (src_len, batch_size, model_dim)

  • memory_bank (src_len, batch_size, model_dim)

Return type

(torch.FloatTensor, torch.FloatTensor)

forward(src, lengths=None)[source]

See EncoderBase.forward()

classmethod from_opt(opt, embeddings)[source]

Alternate constructor.

class onmt.decoders.TransformerDecoder(num_layers, d_model, heads, d_ff, copy_attn, self_attn_type, dropout, attention_dropout, embeddings, max_relative_positions, aan_useffn)[source]

Bases: onmt.decoders.decoder.DecoderBase

The Transformer decoder from “Attention is All You Need”. [VSP+17]

graph BT A[input] B[multi-head self-attn] BB[multi-head src-attn] C[feed forward] O[output] A --> B B --> BB BB --> C C --> O
Parameters
  • num_layers (int) – number of encoder layers.

  • d_model (int) – size of the model

  • heads (int) – number of heads

  • d_ff (int) – size of the inner FF layer

  • copy_attn (bool) – if using a separate copy attention

  • self_attn_type (str) – type of self-attention scaled-dot, average

  • dropout (float) – dropout parameters

  • embeddings (onmt.modules.Embeddings) – embeddings to use, should have positional encodings

forward(tgt, memory_bank, step=None, **kwargs)[source]

Decode, possibly stepwise.

classmethod from_opt(opt, embeddings)[source]

Alternate constructor.

init_state(src, memory_bank, enc_hidden)[source]

Initialize decoder state.

class onmt.modules.MultiHeadedAttention(head_count, model_dim, dropout=0.1, max_relative_positions=0)[source]

Bases: torch.nn.modules.module.Module

Multi-Head Attention module from “Attention is All You Need” [VSP+17].

Similar to standard dot attention but uses multiple attention distributions simulataneously to select relevant items.

graph BT A[key] B[value] C[query] O[output] subgraph Attn D[Attn 1] E[Attn 2] F[Attn N] end A --> D C --> D A --> E C --> E A --> F C --> F D --> O E --> O F --> O B --> O

Also includes several additional tricks.

Parameters
  • head_count (int) – number of parallel heads

  • model_dim (int) – the dimension of keys/values/queries, must be divisible by head_count

  • dropout (float) – dropout parameter

forward(key, value, query, mask=None, layer_cache=None, attn_type=None)[source]

Compute the context vector and the attention vectors.

Parameters
  • key (FloatTensor) – set of key_len key vectors (batch, key_len, dim)

  • value (FloatTensor) – set of key_len value vectors (batch, key_len, dim)

  • query (FloatTensor) – set of query_len query vectors (batch, query_len, dim)

  • mask – binary mask 1/0 indicating which keys have zero / non-zero attention (batch, query_len, key_len)

Returns

  • output context vectors (batch, query_len, dim)

  • one of the attention vectors (batch, query_len, key_len)

Return type

(FloatTensor, FloatTensor)

update_dropout(dropout)[source]

Architecture: Conv2Conv

(These methods are from a user contribution and have not been thoroughly tested.)

class onmt.encoders.CNNEncoder(num_layers, hidden_size, cnn_kernel_width, dropout, embeddings)[source]

Bases: onmt.encoders.encoder.EncoderBase

Encoder based on “Convolutional Sequence to Sequence Learning” [GAG+17].

forward(input, lengths=None, hidden=None)[source]

See onmt.modules.EncoderBase.forward()

classmethod from_opt(opt, embeddings)[source]

Alternate constructor.

class onmt.decoders.CNNDecoder(num_layers, hidden_size, attn_type, copy_attn, cnn_kernel_width, dropout, embeddings, copy_attn_type)[source]

Bases: onmt.decoders.decoder.DecoderBase

Decoder based on “Convolutional Sequence to Sequence Learning” [GAG+17].

Consists of residual convolutional layers, with ConvMultiStepAttention.

forward(tgt, memory_bank, step=None, **kwargs)[source]

See onmt.modules.RNNDecoderBase.forward()

classmethod from_opt(opt, embeddings)[source]

Alternate constructor.

init_state(_, memory_bank, enc_hidden)[source]

Init decoder state.

class onmt.modules.ConvMultiStepAttention(input_size)[source]

Bases: torch.nn.modules.module.Module

Conv attention takes a key matrix, a value matrix and a query vector. Attention weight is calculated by key matrix with the query vector and sum on the value matrix. And the same operation is applied in each decode conv layer.

apply_mask(mask)[source]

Apply mask

forward(base_target_emb, input_from_dec, encoder_out_top, encoder_out_combine)[source]
Parameters
  • base_target_emb – target emb tensor

  • input_from_dec – output of decode conv

  • encoder_out_top – the key matrix for calculation of attetion weight, which is the top output of encode conv

  • encoder_out_combine – the value matrix for the attention-weighted sum, which is the combination of base emb and top output of encode

class onmt.modules.WeightNormConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, init_scale=1.0, polyak_decay=0.9995)[source]

Bases: torch.nn.modules.conv.Conv2d

forward(x, init=False)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Architecture: SRU

class onmt.models.sru.SRU(input_size, hidden_size, num_layers=2, dropout=0, rnn_dropout=0, bidirectional=False, use_tanh=1, use_relu=0)[source]

Bases: torch.nn.modules.module.Module

Implementation of “Training RNNs as Fast as CNNs” [LZA17]

TODO: turn to pytorch’s implementation when it is available.

This implementation is adpoted from the author of the paper: https://github.com/taolei87/sru/blob/master/cuda_functional.py.

Parameters
  • input_size (int) – input to model

  • hidden_size (int) – hidden dimension

  • num_layers (int) – number of layers

  • dropout (float) – dropout to use (stacked)

  • rnn_dropout (float) – dropout to use (recurrent)

  • bidirectional (bool) – bidirectional

  • use_tanh (bool) – activation

  • use_relu (bool) – activation

forward(input, c0=None, return_hidden=True)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Alternative Encoders

onmt.modules.AudioEncoder

class onmt.encoders.AudioEncoder(rnn_type, enc_layers, dec_layers, brnn, enc_rnn_size, dec_rnn_size, enc_pooling, dropout, sample_rate, window_size)[source]

Bases: onmt.encoders.encoder.EncoderBase

A simple encoder CNN -> RNN for audio input.

Parameters
  • rnn_type (str) – Type of RNN (e.g. GRU, LSTM, etc).

  • enc_layers (int) – Number of encoder layers.

  • dec_layers (int) – Number of decoder layers.

  • brnn (bool) – Bidirectional encoder.

  • enc_rnn_size (int) – Size of hidden states of the rnn.

  • dec_rnn_size (int) – Size of the decoder hidden states.

  • enc_pooling (str) – A comma separated list either of length 1 or of length enc_layers specifying the pooling amount.

  • dropout (float) – dropout probablity.

  • sample_rate (float) – input spec

  • window_size (int) – input spec

forward(src, lengths=None)[source]

See onmt.encoders.encoder.EncoderBase.forward()

classmethod from_opt(opt, embeddings=None)[source]

Alternate constructor.

onmt.modules.ImageEncoder

class onmt.encoders.ImageEncoder(num_layers, bidirectional, rnn_size, dropout, image_chanel_size=3)[source]

Bases: onmt.encoders.encoder.EncoderBase

A simple encoder CNN -> RNN for image src.

Parameters
  • num_layers (int) – number of encoder layers.

  • bidirectional (bool) – bidirectional encoder.

  • rnn_size (int) – size of hidden states of the rnn.

  • dropout (float) – dropout probablity.

forward(src, lengths=None)[source]

See onmt.encoders.encoder.EncoderBase.forward()

classmethod from_opt(opt, embeddings=None)[source]

Alternate constructor.

load_pretrained_vectors(opt)[source]

Pass in needed options only when modify function definition.

Copy Attention

class onmt.modules.CopyGenerator(input_size, output_size, pad_idx)[source]

Bases: torch.nn.modules.module.Module

An implementation of pointer-generator networks [SLM17].

These networks consider copying words directly from the source sequence.

The copy generator is an extended version of the standard generator that computes three values.

  • \(p_{softmax}\) the standard softmax over tgt_dict

  • \(p(z)\) the probability of copying a word from the source

  • \(p_{copy}\) the probility of copying a particular word. taken from the attention distribution directly.

The model returns a distribution over the extend dictionary, computed as

\(p(w) = p(z=1) p_{copy}(w) + p(z=0) p_{softmax}(w)\)

graph BT A[input] S[src_map] B[softmax] BB[switch] C[attn] D[copy] O[output] A --> B A --> BB S --> D C --> D D --> O B --> O BB --> O
Parameters
  • input_size (int) – size of input representation

  • output_size (int) – size of output vocabulary

  • pad_idx (int) –

forward(hidden, attn, src_map)[source]

Compute a distribution over the target dictionary extended by the dynamic dictionary implied by copying source words.

Parameters
  • hidden (FloatTensor) – hidden outputs (batch x tlen, input_size)

  • attn (FloatTensor) – attn for each (batch x tlen, input_size)

  • src_map (FloatTensor) – A sparse indicator matrix mapping each source word to its index in the “extended” vocab containing. (src_len, batch, extra_words)

Structured Attention

class onmt.modules.structured_attention.MatrixTree(eps=1e-05)[source]

Bases: torch.nn.modules.module.Module

Implementation of the matrix-tree theorem for computing marginals of non-projective dependency parsing. This attention layer is used in the paper “Learning Structured Text Representations” [LL17].

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.