opennmt.utils.data module

Functions for reading data.

opennmt.utils.data.get_padded_shapes(dataset)[source]

Returns the padded shapes for tf.data.Dataset.padded_batch.

Parameters:dataset – The dataset that will be batched with padding.
Returns:The same structure as dataset.output_shapes containing the padded shapes.
opennmt.utils.data.filter_irregular_batches(multiple)[source]

Transformation that filters out batches based on their size.

Parameters:multiple – The divisor of the batch size.
Returns:A tf.data.Dataset transformation.
opennmt.utils.data.prefetch_element(buffer_size=None)[source]

Transformation that prefetches elements from the dataset.

This is a small wrapper around tf.data.Dataset.prefetch to customize the case buffer_size is None for some TensorFlow versions.

Parameters:buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
Returns:A tf.data.Dataset transformation.
opennmt.utils.data.filter_examples_by_length(maximum_features_length=None, maximum_labels_length=None, features_length_fn=None, labels_length_fn=None)[source]

Transformation that constrains examples length.

Parameters:
  • maximum_features_length – The maximum length or list of maximum lengths of the features sequence(s). None to not constrain the length.
  • maximum_labels_length – The maximum length of the labels sequence. None to not constrain the length.
  • features_length_fn – A callable mapping features to a sequence length.
  • labels_length_fn – A callable mapping labels to a sequence length.
Returns:

A tf.data.Dataset transformation.

opennmt.utils.data.random_shard(shard_size, dataset_size)[source]

Transformation that shards the dataset in a random order.

Parameters:
  • shard_size – The number of examples in each shard.
  • dataset_size – The total number of examples in the dataset.
Returns:

A tf.data.Dataset transformation.

opennmt.utils.data.batch_dataset(batch_size, padded_shapes=None)[source]

Transformation that batches a dataset.

Parameters:
  • batch_size – The batch size.
  • padded_shapes – The padded shapes for this dataset. If None, the shapes are automatically inferred from the dataset output shapes.
Returns:

A tf.data.Dataset transformation.

opennmt.utils.data.batch_parallel_dataset(batch_size, batch_type='examples', batch_multiplier=1, bucket_width=None, padded_shapes=None, features_length_fn=None, labels_length_fn=None, batch_size_multiple=1)[source]

Transformation that batches a parallel dataset.

This implements an example-based and a token-based batching strategy with optional bucketing of sequences.

Bucketing makes the batches contain sequences of similar lengths to optimize the training efficiency. For example, if bucket_width is 5, sequences will be organized by lengths:

1 - 5 | 6 - 10 | 11 - 15 | …

where the assigned length is the maximum of the source and target lengths. Then each batch will only consider sequences from the same bucket.

Parameters:
  • batch_size – The batch size.
  • batch_type – The training batching strategy to use: can be “examples” or “tokens”.
  • batch_multiplier – The batch size multiplier to prepare splitting accross replicated graph parts.
  • bucket_width – The sequence length bucket width.
  • padded_shapes – The padded shapes for this dataset. If None, the shapes are automatically inferred from the dataset output shapes.
  • features_length_fn – A callable mapping features to a sequence length.
  • labels_length_fn – A callable mapping labels to a sequence length.
  • batch_size_multiple – When batch_type is “tokens”, ensure that the result batch size is a multiple of this value.
Returns:

A tf.data.Dataset transformation.

Raises:

ValueError – if batch_type is not one of “examples” or “tokens”.

opennmt.utils.data.training_pipeline(dataset, batch_size, batch_type='examples', batch_multiplier=1, bucket_width=None, single_pass=False, process_fn=None, num_threads=None, shuffle_buffer_size=None, prefetch_buffer_size=None, dataset_size=None, maximum_features_length=None, maximum_labels_length=None, features_length_fn=None, labels_length_fn=None, batch_size_multiple=1, num_shards=1, shard_index=0)[source]

Defines a complete training data pipeline.

Parameters:
  • dataset – The base dataset.
  • batch_size – The batch size to use.
  • batch_type – The training batching stragety to use: can be “examples” or “tokens”.
  • batch_multiplier – The batch size multiplier to prepare splitting accross replicated graph parts.
  • bucket_width – The width of the length buckets to select batch candidates from. None to not constrain batch formation.
  • single_pass – If True, makes a single pass over the training data.
  • process_fn – The processing function to apply on each element.
  • num_threads – The number of elements processed in parallel.
  • shuffle_buffer_size – The number of elements from which to sample.
  • prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
  • dataset_size – The total size of the dataset, if known. It is recommended to set it when shuffle_buffer_size is smaller than the dataset size (or the shard size when sharding is configured).
  • maximum_features_length – The maximum length or list of maximum lengths of the features sequence(s). None to not constrain the length.
  • maximum_labels_length – The maximum length of the labels sequence. None to not constrain the length.
  • features_length_fn – A callable mapping features to a sequence length.
  • labels_length_fn – A callable mapping labels to a sequence length.
  • batch_size_multiple – When batch_type is “tokens”, ensure that the result batch size is a multiple of this value.
  • num_shards – The number of data shards (usually the number of workers in a distributed setting).
  • shard_index – The shard index this data pipeline should read from.
Returns:

A tf.data.Dataset.

opennmt.utils.data.inference_pipeline(dataset, batch_size, process_fn=None, num_threads=None, prefetch_buffer_size=None, bucket_width=None, length_fn=None)[source]

Defines a complete inference data pipeline.

Parameters:
  • dataset – The base dataset.
  • batch_size – The batch size to use.
  • process_fn – The processing function to apply on each element.
  • num_threads – The number of elements processed in parallel.
  • prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions.
  • bucket_width – The width of the length buckets to select batch candidates from. If set, this means the inference pipeline will be reordered based on the examples length, the application is then responsible to restore the predictions in order. An “index” key will be inserted in the examples dict.
  • length_fn – A callable mapping features to a sequence length.
Returns:

A tf.data.Dataset.