opennmt.utils.vocab module

Vocabulary utilities for Python scripts.

class opennmt.utils.vocab.Vocab(special_tokens=None, from_file=None, from_format='default')[source]

Bases: object

Vocabulary class.

__init__(special_tokens=None, from_file=None, from_format='default')[source]

Initializes a vocabulary.

Parameters:
  • special_tokens – A list of special tokens (e.g. start of sentence).
  • from_file – Optionally initialize from an existing saved vocabulary.
  • from_format – Define the format of the from_file saved vocabulary. Can be: default, sentencepiece. “default” is simply one token per line.
Raises:

ValueError – if file_format is invalid.

size

Returns the number of entries of the vocabulary.

words

Returns the list of words.

__len__()[source]

Returns the number of entries of the vocabulary.

__contains__(token)[source]

Returns True if the vocabulary contains token.

add_from_text(filename, tokenizer=None)[source]

Fills the vocabulary from a text file.

Parameters:
  • filename – The file to load from.
  • tokenizer – A callable to tokenize a line of text.
serialize(path)[source]

Writes the vocabulary on disk.

Parameters:path – The path where the vocabulary will be saved.
load(path, file_format='default')[source]

Loads a serialized vocabulary.

Parameters:
  • path – The path to the vocabulary to load.
  • file_format – Define the format of the vocabulary file. Can be: default, sentencepiece. “default” is simply one token per line.
Raises:

ValueError – if file_format is invalid.

add(token)[source]

Adds a token or increases its frequency.

Parameters:token – The string to add.
lookup(identifier, default=None)[source]

Lookups in the vocabulary.

Parameters:
  • identifier – A string or an index to lookup.
  • default – The value to return if identifier is not found.
Returns:

The value associated with identifier or default.

prune(max_size=0, min_frequency=1)[source]

Creates a pruned version of the vocabulary.

Parameters:
  • max_size – The maximum vocabulary size.
  • min_frequency – The minimum frequency of each entry.
Returns:

A new vocabulary.

pad_to_multiple(multiple, num_oov_buckets=1)[source]

Pads the vocabulary size to a multiple value.

More specically, this method ensures that:

(vocab_size + num_oov_buckets) % multiple == 0
Parameters:
  • multiple – The multiple value.
  • num_oov_buckets – The number of OOV buckets added during the training. Usually just 1 for the token.