Preprocess

preprocess.py

usage: preprocess.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG]
                     [--data_type DATA_TYPE] --train_src TRAIN_SRC
                     [TRAIN_SRC ...] --train_tgt TRAIN_TGT [TRAIN_TGT ...]
                     [--train_ids TRAIN_IDS [TRAIN_IDS ...]]
                     [--valid_src VALID_SRC] [--valid_tgt VALID_TGT]
                     [--src_dir SRC_DIR] --save_data SAVE_DATA
                     [--max_shard_size MAX_SHARD_SIZE]
                     [--shard_size SHARD_SIZE] [--overwrite]
                     [--src_vocab SRC_VOCAB] [--tgt_vocab TGT_VOCAB]
                     [--features_vocabs_prefix FEATURES_VOCABS_PREFIX]
                     [--src_vocab_size SRC_VOCAB_SIZE]
                     [--tgt_vocab_size TGT_VOCAB_SIZE]
                     [--vocab_size_multiple VOCAB_SIZE_MULTIPLE]
                     [--src_words_min_frequency SRC_WORDS_MIN_FREQUENCY]
                     [--tgt_words_min_frequency TGT_WORDS_MIN_FREQUENCY]
                     [--dynamic_dict] [--share_vocab]
                     [--src_seq_length SRC_SEQ_LENGTH]
                     [--src_seq_length_trunc SRC_SEQ_LENGTH_TRUNC]
                     [--tgt_seq_length TGT_SEQ_LENGTH]
                     [--tgt_seq_length_trunc TGT_SEQ_LENGTH_TRUNC] [--lower]
                     [--filter_valid] [--shuffle SHUFFLE] [--seed SEED]
                     [--report_every REPORT_EVERY] [--log_file LOG_FILE]
                     [--log_file_level {WARNING,NOTSET,CRITICAL,INFO,DEBUG,ERROR,30,0,50,20,10,40}]
                     [--sample_rate SAMPLE_RATE] [--window_size WINDOW_SIZE]
                     [--window_stride WINDOW_STRIDE] [--window WINDOW]
                     [--image_channel_size {3,1}]

Named Arguments

-config, --config

config file path

-save_config, --save_config

config file save path

Data

--data_type, -data_type

Type of the source input. Options are [text|img|audio].

Default: “text”

--train_src, -train_src

Path(s) to the training source data

--train_tgt, -train_tgt

Path(s) to the training target data

--train_ids, -train_ids

ids to name training shards, used for corpus weighting

Default: [None]

--valid_src, -valid_src

Chemin vers les données de validation source

--valid_tgt, -valid_tgt

Chemin vers les données de validation cible

--src_dir, -src_dir

Source directory for image or audio files.

Default: “”

--save_data, -save_data

Fichier de sortie pour les données préparées

--max_shard_size, -max_shard_size

Deprecated use shard_size instead

Default: 0

--shard_size, -shard_size

Divide src_corpus and tgt_corpus into smaller multiple src_copus and tgt corpus files, then build shards, each shard will have opt.shard_size samples except last shard. shard_size=0 means no segmentation shard_size>0 means segment dataset into multiple shards, each shard has shard_size samples

Default: 1000000

--overwrite, -overwrite

Overwrite existing shards if any.

Default: False

Vocab

--src_vocab, -src_vocab

Path to an existing source vocabulary. Format: one word per line.

Default: “”

--tgt_vocab, -tgt_vocab

Path to an existing target vocabulary. Format: one word per line.

Default: “”

--features_vocabs_prefix, -features_vocabs_prefix

Préfixe du chemin vers les vocabulaires des features

Default: “”

--src_vocab_size, -src_vocab_size

Size of the source vocabulary

Default: 50000

--tgt_vocab_size, -tgt_vocab_size

Size of the target vocabulary

Default: 50000

--vocab_size_multiple, -vocab_size_multiple

Make the vocabulary size a multiple of this value

Default: 1

--src_words_min_frequency, -src_words_min_frequency

Default: 0

--tgt_words_min_frequency, -tgt_words_min_frequency

Default: 0

--dynamic_dict, -dynamic_dict

Create dynamic dictionaries

Default: False

--share_vocab, -share_vocab

Share source and target vocabulary

Default: False

Pruning

--src_seq_length, -src_seq_length

Maximum source sequence length

Default: 50

--src_seq_length_trunc, -src_seq_length_trunc

Truncate source sequence length.

--tgt_seq_length, -tgt_seq_length

Maximum target sequence length to keep.

Default: 50

--tgt_seq_length_trunc, -tgt_seq_length_trunc

Truncate target sequence length.

--lower, -lower

lowercase data

Default: False

--filter_valid, -filter_valid

Filter validation data by src and/or tgt length

Default: False

Random

--shuffle, -shuffle

Shuffle data

Default: 0

--seed, -seed

Random seed

Default: 3435

Logging

--report_every, -report_every

Report status every this many sentences

Default: 100000

--log_file, -log_file

Output logs to a file under this path.

Default: “”

--log_file_level, -log_file_level

Possible choices: WARNING, NOTSET, CRITICAL, INFO, DEBUG, ERROR, 30, 0, 50, 20, 10, 40

Default: “0”

Speech

--sample_rate, -sample_rate

Sample rate.

Default: 16000

--window_size, -window_size

Window size for spectrogram in seconds.

Default: 0.02

--window_stride, -window_stride

Window stride for spectrogram in seconds.

Default: 0.01

--window, -window

Window type for spectrogram generation.

Default: “hamming”

--image_channel_size, -image_channel_size

Possible choices: 3, 1

Using grayscale image can training model faster and smaller

Default: 3