# opennmt.inputters.inputter module¶

Define generic inputters.

class opennmt.inputters.inputter.Inputter(dtype=tf.float32)[source]

Bases: tensorflow.python.keras.engine.base_layer.Layer

Base class for inputters.

num_outputs

How many parallel outputs does this inputter produce.

initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter.

Parameters: metadata – A dictionary containing additional metadata set by the user. asset_dir – The directory where assets can be written. If None, no assets are returned. asset_prefix – The prefix to attach to assets filename. A dictionary containing additional assets used by the inputter.
export_assets(asset_dir, asset_prefix='')[source]

Exports assets used by this tokenizer.

Parameters: asset_dir – The directory where assets can be written. asset_prefix – The prefix to attach to assets filename. A dictionary containing additional assets used by the inputter.
make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters: data_file – The data file. training – Run in training mode. A tf.data.Dataset.
make_inference_dataset(features_file, batch_size, bucket_width=None, num_threads=1, prefetch_buffer_size=None)[source]

Builds a dataset to be used for inference.

For evaluation and training datasets, see opennmt.inputters.inputter.ExampleInputter.

Parameters: features_file – The test file. batch_size – The batch size to use. bucket_width – The width of the length buckets to select batch candidates from (for efficiency). Set None to not constrain batch formation. num_threads – The number of elements processed in parallel. prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions. A tf.data.Dataset.
get_dataset_size(data_file)[source]

Returns the size of the dataset.

Parameters: data_file – The data file. The total size.
get_serving_input_receiver()[source]

Returns a serving input receiver for this inputter.

Returns: A tf.estimator.export.ServingInputReceiver.
get_receiver_tensors()[source]

Returns the input placeholders for serving.

get_length(features)[source]

Returns the length of the input features, if defined.

make_features(element=None, features=None, training=None)[source]

Creates features from data.

Parameters: element – An element from the dataset. features – An optional dictionary of features to augment. training – Run in training mode. A dictionary of tf.Tensor.
call(features, training=None)[source]

Forwards call to make_inputs().

make_inputs(features, training=None)[source]

Creates the model input from the features.

Parameters: features – A dictionary of tf.Tensor. training – Run in training mode. The model input.
visualize(log_dir)[source]

Visualizes the transformation, usually embeddings.

Parameters: log_dir – The active log directory.
set_data_field(data, key, value, volatile=False)[source]

Sets a data field.

Parameters: data – The data dictionary. key – The value key. value – The value to assign. volatile – If True, the key/value pair will be removed once the processing done. The updated data dictionary.
remove_data_field(data, key)[source]

Removes a data field.

Parameters: data – The data dictionary. key – The value key. The updated data dictionary.
add_process_hooks(hooks)[source]

Processing hooks are additional and model specific data processing functions applied after calling this inputter opennmt.inputters.inputter.Inputter.process() function.

Parameters: hooks – A list of callables with the signature (inputter, data) -> data.
process(data, training=None)[source]

Prepares raw data.

Parameters: data – The raw data. training – Run in training mode. A dictionary of tf.Tensor.
transform_data(data, mode='train', log_dir=None)[source]

Transforms the processed data to an input.

This is usually a simple forward of a data field to opennmt.inputters.inputter.Inputter.transform().

Parameters: data – A dictionary of data fields. mode – A tf.estimator.ModeKeys mode. log_dir – The log directory. If set, visualization will be setup. The transformed input.
class opennmt.inputters.inputter.MultiInputter(inputters, reducer=None)[source]

An inputter that gathers multiple inputters, possibly nested.

num_outputs

How many parallel outputs does this inputter produce.

get_leaf_inputters()[source]

Returns a list of all leaf Inputter instances.

initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter.

Parameters: metadata – A dictionary containing additional metadata set by the user. asset_dir – The directory where assets can be written. If None, no assets are returned. asset_prefix – The prefix to attach to assets filename. A dictionary containing additional assets used by the inputter.
export_assets(asset_dir, asset_prefix='')[source]

Exports assets used by this tokenizer.

Parameters: asset_dir – The directory where assets can be written. asset_prefix – The prefix to attach to assets filename. A dictionary containing additional assets used by the inputter.
make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters: data_file – The data file. training – Run in training mode. A tf.data.Dataset.
get_dataset_size(data_file)[source]

Returns the size of the dataset.

Parameters: data_file – The data file. The total size.
visualize(log_dir)[source]

Visualizes the transformation, usually embeddings.

Parameters: log_dir – The active log directory.
class opennmt.inputters.inputter.ParallelInputter(inputters, reducer=None, share_parameters=False, combine_features=True)[source]

An multi inputter that process parallel data.

__init__(inputters, reducer=None, share_parameters=False, combine_features=True)[source]

Initializes a parallel inputter.

Parameters: inputters – A list of opennmt.inputters.inputter.Inputter. reducer – A opennmt.layers.reducer.Reducer to merge all inputs. If set, parallel inputs are assumed to have the same length. share_parameters – Share the inputters parameters. combine_features – Combine each inputter features in a single dict or return them separately. This is typically True for multi source inputs but False for features/labels parallel data. ValueError – if share_parameters is set but the child inputters are not of the same type.
make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters: data_file – The data file. training – Run in training mode. A tf.data.Dataset.
get_dataset_size(data_file)[source]

Returns the size of the dataset.

Parameters: data_file – The data file. The total size.
get_receiver_tensors()[source]

Returns the input placeholders for serving.

get_length(features)[source]

Returns the length of the input features, if defined.

make_features(element=None, features=None, training=None)[source]

Creates features from data.

Parameters: element – An element from the dataset. features – An optional dictionary of features to augment. training – Run in training mode. A dictionary of tf.Tensor.
build(input_shape=None)[source]

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters: input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).
make_inputs(features, training=None)[source]

Creates the model input from the features.

Parameters: features – A dictionary of tf.Tensor. training – Run in training mode. The model input.
class opennmt.inputters.inputter.MixedInputter(inputters, reducer=, dropout=0.0)[source]

An multi inputter that applies several transformation on the same data.

__init__(inputters, reducer=, dropout=0.0)[source]

Initializes a mixed inputter.

Parameters: inputters – A list of opennmt.inputters.inputter.Inputter. reducer – A opennmt.layers.reducer.Reducer to merge all inputs. dropout – The probability to drop units in the merged inputs.
make_dataset(data_file, training=None)[source]

Creates the base dataset required by this inputter.

Parameters: data_file – The data file. training – Run in training mode. A tf.data.Dataset.
get_dataset_size(data_file)[source]

Returns the size of the dataset.

Parameters: data_file – The data file. The total size.
get_receiver_tensors()[source]

Returns the input placeholders for serving.

get_length(features)[source]

Returns the length of the input features, if defined.

make_features(element=None, features=None, training=None)[source]

Creates features from data.

Parameters: element – An element from the dataset. features – An optional dictionary of features to augment. training – Run in training mode. A dictionary of tf.Tensor.
make_inputs(features, training=None)[source]

Creates the model input from the features.

Parameters: features – A dictionary of tf.Tensor. training – Run in training mode. The model input.
class opennmt.inputters.inputter.ExampleInputter(features_inputter, labels_inputter, share_parameters=False)[source]

An inputter that returns training examples (parallel features and labels).

__init__(features_inputter, labels_inputter, share_parameters=False)[source]

Initializes this inputter.

Parameters: features_inputter – An inputter producing the features (source). labels_inputter – An inputter producing the labels (target). share_parameters – Share the inputters parameters.
initialize(metadata, asset_dir=None, asset_prefix='')[source]

Initializes the inputter.

Parameters: metadata – A dictionary containing additional metadata set by the user. asset_dir – The directory where assets can be written. If None, no assets are returned. asset_prefix – The prefix to attach to assets filename. A dictionary containing additional assets used by the inputter.
export_assets(asset_dir, asset_prefix='')[source]

Exports assets used by this tokenizer.

Parameters: asset_dir – The directory where assets can be written. asset_prefix – The prefix to attach to assets filename. A dictionary containing additional assets used by the inputter.
make_inference_dataset(features_file, batch_size, bucket_width=None, num_threads=1, prefetch_buffer_size=None)[source]

Builds a dataset to be used for inference.

For evaluation and training datasets, see opennmt.inputters.inputter.ExampleInputter.

Parameters: features_file – The test file. batch_size – The batch size to use. bucket_width – The width of the length buckets to select batch candidates from (for efficiency). Set None to not constrain batch formation. num_threads – The number of elements processed in parallel. prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions. A tf.data.Dataset.
make_evaluation_dataset(features_file, labels_file, batch_size, num_threads=1, prefetch_buffer_size=None)[source]

Builds a dataset to be used for evaluation.

Parameters: features_file – The evaluation source file. labels_file – The evaluation target file. batch_size – The batch size to use. num_threads – The number of elements processed in parallel. prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions. A tf.data.Dataset.
make_training_dataset(features_file, labels_file, batch_size, batch_type='examples', batch_multiplier=1, batch_size_multiple=1, shuffle_buffer_size=None, bucket_width=None, maximum_features_length=None, maximum_labels_length=None, single_pass=False, num_shards=1, shard_index=0, num_threads=4, prefetch_buffer_size=None)[source]

Builds a dataset to be used for training. It supports the full training pipeline, including:

• sharding
• shuffling
• filtering
• bucketing
• prefetching
Parameters: features_file – The evaluation source file. labels_file – The evaluation target file. batch_size – The batch size to use. batch_type – The training batching stragety to use: can be “examples” or “tokens”. batch_multiplier – The batch size multiplier to prepare splitting accross replicated graph parts. batch_size_multiple – When batch_type is “tokens”, ensure that the result batch size is a multiple of this value. shuffle_buffer_size – The number of elements from which to sample. bucket_width – The width of the length buckets to select batch candidates from (for efficiency). Set None to not constrain batch formation. maximum_features_length – The maximum length or list of maximum lengths of the features sequence(s). None to not constrain the length. maximum_labels_length – The maximum length of the labels sequence. None to not constrain the length. single_pass – If True, makes a single pass over the training data. num_shards – The number of data shards (usually the number of workers in a distributed setting). shard_index – The shard index this data pipeline should read from. num_threads – The number of elements processed in parallel. prefetch_buffer_size – The number of batches to prefetch asynchronously. If None, use an automatically tuned value on TensorFlow 1.8+ and 1 on older versions. A tf.data.Dataset.