opennmt.tokenizers.tokenizer module

Define base tokenizers.

class opennmt.tokenizers.tokenizer.Tokenizer(configuration_file_or_key=None, params=None)[source]

Bases: object

Base class for tokenizers.

__init__(configuration_file_or_key=None, params=None)[source]

Initializes the tokenizer.

Parameters:configuration_file_or_key – The YAML configuration file or a the key to the YAML configuration file.
initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the tokenizer (e.g. load BPE models).

Parameters:
  • metadata – A dictionary containing additional metadata set by the user.
  • asset_dir – The directory where assets can be written. If None, no assets are returned.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the tokenizer.

export_assets(asset_dir, asset_prefix='')[source]

Exports assets for this tokenizer.

Parameters:
  • asset_dir – The directory where assets can be written.
  • asset_prefix – The prefix to attach to assets filename.
Returns:

A dictionary containing additional assets used by the tokenizer.

tokenize_stream(input_stream=<_io.TextIOWrapper name='' mode='r' encoding='UTF-8'>, output_stream=<_io.TextIOWrapper name='' mode='w' encoding='UTF-8'>, delimiter=' ')[source]

Tokenizes a stream of sentences.

Parameters:
  • input_stream – The input stream.
  • output_stream – The output stream.
  • delimiter – The token delimiter to use for text serialization.
detokenize_stream(input_stream=<_io.TextIOWrapper name='' mode='r' encoding='UTF-8'>, output_stream=<_io.TextIOWrapper name='' mode='w' encoding='UTF-8'>, delimiter=' ')[source]

Detokenizes a stream of sentences.

Parameters:
  • input_stream – The input stream.
  • output_stream – The output stream.
  • delimiter – The token delimiter used for text serialization.
tokenize(text)[source]

Tokenizes text.

Parameters:text – The text to tokenize as a tf.Tensor or Python string.
Returns:A 1-D string tf.Tensor if text is a tf.Tensor or a list of Python unicode strings otherwise.
Raises:ValueError – if the rank of text is greater than 0.
detokenize(tokens, sequence_length=None)[source]

Detokenizes tokens.

The Tensor version supports batches of tokens.

Parameters:
  • tokens – The tokens as a 1-D or 2-D tf.Tensor or list of Python strings.
  • sequence_length – The length of each sequence. Required if tokens is a tf.Tensor.
Returns:

A 0-D or 1-D string tf.Tensor if tokens is a tf.Tensor or a Python unicode strings otherwise.

Raises:
  • ValueError – if the rank of tokens is greater than 2.
  • ValueError – if tokens is a 2-D tf.Tensor and sequence_length is not set.
class opennmt.tokenizers.tokenizer.SpaceTokenizer(configuration_file_or_key=None, params=None)[source]

Bases: opennmt.tokenizers.tokenizer.Tokenizer

A tokenizer that splits on spaces.

class opennmt.tokenizers.tokenizer.CharacterTokenizer(configuration_file_or_key=None, params=None)[source]

Bases: opennmt.tokenizers.tokenizer.Tokenizer

A tokenizer that splits unicode characters.