opennmt.inputters.text_inputter module

Define word-based embedders.

opennmt.inputters.text_inputter.visualize_embeddings(log_dir, embedding_var, vocabulary_file, num_oov_buckets=1)[source]

Registers an embedding variable for visualization in TensorBoard.

This function registers embedding_var in the projector_config.pbtxt file and generates metadata from vocabulary_file to attach a label to each word ID.

Parameters:
  • log_dir – The active log directory.
  • embedding_var – The embedding variable to visualize.
  • vocabulary_file – The associated vocabulary file.
  • num_oov_buckets – The number of additional unknown tokens.
opennmt.inputters.text_inputter.load_pretrained_embeddings(embedding_file, vocabulary_file, num_oov_buckets=1, with_header=True, case_insensitive_embeddings=True)[source]

Returns pretrained embeddings relative to the vocabulary.

The embedding_file must have the following format:

N M
word1 val1 val2 ... valM
word2 val1 val2 ... valM
...
wordN val1 val2 ... valM

or if with_header is False:

word1 val1 val2 ... valM
word2 val1 val2 ... valM
...
wordN val1 val2 ... valM

This function will iterate on each embedding in embedding_file and assign the pretrained vector to the associated word in vocabulary_file if found. Otherwise, the embedding is ignored.

If case_insensitive_embeddings is True, word embeddings are assumed to be trained on lowercase data. In that case, word alignments are case insensitive meaning the pretrained word embedding for “the” will be assigned to “the”, “The”, “THE”, or any other case variants included in vocabulary_file.

Parameters:
  • embedding_file – Path the embedding file. Entries will be matched against vocabulary_file.
  • vocabulary_file – The vocabulary file containing one word per line.
  • num_oov_buckets – The number of additional unknown tokens.
  • with_headerTrue if the embedding file starts with a header line like in GloVe embedding files.
  • case_insensitive_embeddingsTrue if embeddings are trained on lowercase data.
Returns:

A Numpy array of shape [vocabulary_size + num_oov_buckets, embedding_size].

opennmt.inputters.text_inputter.tokens_to_chars(tokens, padding_value='')[source]

Splits tokens into unicode characters.

Parameters:
  • tokens – A string tf.Tensor of shape \([T]\).
  • padding_value – The value to use for padding.
Returns:

The characters as a string tf.Tensor of shape \([T, W]\) and the length of each token as an int64 tf.Tensor of shape \([T]\).

class opennmt.inputters.text_inputter.TextInputter(tokenizer=None, dtype=tf.float32, vocabulary_file_key=None, num_oov_buckets=1)[source]

Bases: opennmt.inputters.inputter.Inputter

An abstract inputter that processes text.

initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter.

Parameters:
  • metadata – A dictionary containing additional metadata set by the user.
  • asset_dir – The directory where assets can be written. If None, no assets are returned.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

export_assets(asset_dir, asset_prefix='')[source]

Exports assets used by this tokenizer.

Parameters:
  • asset_dir – The directory where assets can be written.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

vocabulary_lookup()[source]

Returns a lookup table mapping string to index.

vocabulary_lookup_reverse()[source]

Returns a lookup table mapping index to string.

make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters:
  • data_file – The data file.
  • training – Run in training mode.
Returns:

A tf.data.Dataset.

get_dataset_size(data_file)[source]

Returns the size of the dataset.

Parameters:data_file – The data file.
Returns:The total size.
make_features(element=None, features=None, training=None)[source]

Tokenizes raw text.

class opennmt.inputters.text_inputter.WordEmbedder(vocabulary_file_key, embedding_size=None, embedding_file_key=None, embedding_file_with_header=True, case_insensitive_embeddings=True, trainable=True, dropout=0.0, tokenizer=None, dtype=tf.float32)[source]

Bases: opennmt.inputters.text_inputter.TextInputter

Simple word embedder.

__init__(vocabulary_file_key, embedding_size=None, embedding_file_key=None, embedding_file_with_header=True, case_insensitive_embeddings=True, trainable=True, dropout=0.0, tokenizer=None, dtype=tf.float32)[source]

Initializes the parameters of the word embedder.

Parameters:
  • vocabulary_file_key – The data configuration key of the vocabulary file containing one word per line.
  • embedding_size – The size of the resulting embedding. If None, an embedding file must be provided.
  • embedding_file_key – The data configuration key of the embedding file.
  • embedding_file_with_headerTrue if the embedding file starts with a header line like in GloVe embedding files.
  • case_insensitive_embeddingsTrue if embeddings are trained on lowercase data.
  • trainable – If False, do not optimize embeddings.
  • dropout – The probability to drop units in the embedding.
  • tokenizer – An optional opennmt.tokenizers.tokenizer.Tokenizer to tokenize the input text. Defaults to a space tokenization.
  • dtype – The embedding type.
Raises:

ValueError – if neither embedding_size nor embedding_file_key are set.

See also

The opennmt.inputters.text_inputter.load_pretrained_embeddings() function for details about the pretrained embedding format and behavior.

initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter.

Parameters:
  • metadata – A dictionary containing additional metadata set by the user.
  • asset_dir – The directory where assets can be written. If None, no assets are returned.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the inputter.

get_receiver_tensors()[source]

Returns the input placeholders for serving.

make_features(element=None, features=None, training=None)[source]

Converts words tokens to ids.

build(input_shape=None)[source]

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters:input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).
make_inputs(features, training=None)[source]

Creates the model input from the features.

Parameters:
  • features – A dictionary of tf.Tensor.
  • training – Run in training mode.
Returns:

The model input.

visualize(log_dir)[source]

Visualizes the transformation, usually embeddings.

Parameters:log_dir – The active log directory.
class opennmt.inputters.text_inputter.CharEmbedder(vocabulary_file_key, embedding_size, dropout=0.0, tokenizer=None, dtype=tf.float32)[source]

Bases: opennmt.inputters.text_inputter.TextInputter

Base class for character-aware inputters.

__init__(vocabulary_file_key, embedding_size, dropout=0.0, tokenizer=None, dtype=tf.float32)[source]

Initializes the parameters of the character embedder.

Parameters:
  • vocabulary_file_key – The meta configuration key of the vocabulary file containing one character per line.
  • embedding_size – The size of the character embedding.
  • dropout – The probability to drop units in the embedding.
  • tokenizer – An optional opennmt.tokenizers.tokenizer.Tokenizer to tokenize the input text. Defaults to a space tokenization.
  • dtype – The embedding type.
get_receiver_tensors()[source]

Returns the input placeholders for serving.

make_features(element=None, features=None, training=None)[source]

Converts words to characters.

build(input_shape=None)[source]

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters:input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).
make_inputs(features, training=None)[source]

Creates the model input from the features.

Parameters:
  • features – A dictionary of tf.Tensor.
  • training – Run in training mode.
Returns:

The model input.

visualize(log_dir)[source]

Visualizes the transformation, usually embeddings.

Parameters:log_dir – The active log directory.
class opennmt.inputters.text_inputter.CharConvEmbedder(vocabulary_file_key, embedding_size, num_outputs, kernel_size=5, stride=3, dropout=0.0, tokenizer=None, dtype=tf.float32)[source]

Bases: opennmt.inputters.text_inputter.CharEmbedder

An inputter that applies a convolution on characters embeddings.

__init__(vocabulary_file_key, embedding_size, num_outputs, kernel_size=5, stride=3, dropout=0.0, tokenizer=None, dtype=tf.float32)[source]

Initializes the parameters of the character convolution embedder.

Parameters:
  • vocabulary_file_key – The meta configuration key of the vocabulary file containing one character per line.
  • embedding_size – The size of the character embedding.
  • num_outputs – The dimension of the convolution output space.
  • kernel_size – Length of the convolution window.
  • stride – Length of the convolution stride.
  • dropout – The probability to drop units in the embedding.
  • tokenizer – An optional opennmt.tokenizers.tokenizer.Tokenizer to tokenize the input text. Defaults to a space tokenization.
  • dtype – The embedding type.
make_inputs(features, training=None)[source]

Creates the model input from the features.

Parameters:
  • features – A dictionary of tf.Tensor.
  • training – Run in training mode.
Returns:

The model input.

class opennmt.inputters.text_inputter.CharRNNEmbedder(vocabulary_file_key, embedding_size, num_units, dropout=0.2, encoding='average', cell_class=None, tokenizer=None, dtype=tf.float32)[source]

Bases: opennmt.inputters.text_inputter.CharEmbedder

An inputter that runs a single RNN layer over character embeddings.

__init__(vocabulary_file_key, embedding_size, num_units, dropout=0.2, encoding='average', cell_class=None, tokenizer=None, dtype=tf.float32)[source]

Initializes the parameters of the character RNN embedder.

Parameters:
  • vocabulary_file_key – The meta configuration key of the vocabulary file containing one character per line.
  • embedding_size – The size of the character embedding.
  • num_units – The number of units in the RNN layer.
  • dropout – The probability to drop units in the embedding and the RNN outputs.
  • encoding – “average” or “last” (case insensitive), the encoding vector to extract from the RNN outputs.
  • cell_class – The inner cell class or a callable taking num_units as argument and returning a cell. Defaults to a LSTM cell.
  • tokenizer – An optional opennmt.tokenizers.tokenizer.Tokenizer to tokenize the input text. Defaults to a space tokenization.
  • dtype – The embedding type.
Raises:

ValueError – if encoding is invalid.

make_inputs(features, training=None)[source]

Creates the model input from the features.

Parameters:
  • features – A dictionary of tf.Tensor.
  • training – Run in training mode.
Returns:

The model input.