Preprocess

preprocess.py

usage: preprocess.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG]
                     [--data_type DATA_TYPE] --train_src TRAIN_SRC --train_tgt
                     TRAIN_TGT [--valid_src VALID_SRC] [--valid_tgt VALID_TGT]
                     [--src_dir SRC_DIR] --save_data SAVE_DATA
                     [--max_shard_size MAX_SHARD_SIZE]
                     [--shard_size SHARD_SIZE] [--src_vocab SRC_VOCAB]
                     [--tgt_vocab TGT_VOCAB]
                     [--features_vocabs_prefix FEATURES_VOCABS_PREFIX]
                     [--src_vocab_size SRC_VOCAB_SIZE]
                     [--tgt_vocab_size TGT_VOCAB_SIZE]
                     [--vocab_size_multiple VOCAB_SIZE_MULTIPLE]
                     [--src_words_min_frequency SRC_WORDS_MIN_FREQUENCY]
                     [--tgt_words_min_frequency TGT_WORDS_MIN_FREQUENCY]
                     [--dynamic_dict] [--share_vocab]
                     [--src_seq_length SRC_SEQ_LENGTH]
                     [--src_seq_length_trunc SRC_SEQ_LENGTH_TRUNC]
                     [--tgt_seq_length TGT_SEQ_LENGTH]
                     [--tgt_seq_length_trunc TGT_SEQ_LENGTH_TRUNC] [--lower]
                     [--filter_valid] [--shuffle SHUFFLE] [--seed SEED]
                     [--report_every REPORT_EVERY] [--log_file LOG_FILE]
                     [--log_file_level {WARNING,DEBUG,INFO,CRITICAL,NOTSET,ERROR,30,10,20,50,0,40}]
                     [--sample_rate SAMPLE_RATE] [--window_size WINDOW_SIZE]
                     [--window_stride WINDOW_STRIDE] [--window WINDOW]
                     [--image_channel_size {3,1}]

Named Arguments

-config, --config
 config file path
-save_config, --save_config
 config file save path

Data

--data_type, -data_type
 

Type of the source input. Options are [text|img|audio].

Default: “text”

--train_src, -train_src
 Chemin vers les données d'entrainement source
--train_tgt, -train_tgt
 Chemin vers les données d'entrainement cible
--valid_src, -valid_src
 Chemin vers les données de validation source
--valid_tgt, -valid_tgt
 Chemin vers les données de validation cible
--src_dir, -src_dir
 

Source directory for image or audio files.

Default: “”

--save_data, -save_data
 Fichier de sortie pour les données préparées
--max_shard_size, -max_shard_size
 

Deprecated use shard_size instead

Default: 0

--shard_size, -shard_size
 

Divide src_corpus and tgt_corpus into smaller multiple src_copus and tgt corpus files, then build shards, each shard will have opt.shard_size samples except last shard. shard_size=0 means no segmentation shard_size>0 means segment dataset into multiple shards, each shard has shard_size samples

Default: 1000000

Vocab

--src_vocab, -src_vocab
 

Path to an existing source vocabulary. Format: one word per line.

Default: “”

--tgt_vocab, -tgt_vocab
 

Path to an existing target vocabulary. Format: one word per line.

Default: “”

--features_vocabs_prefix, -features_vocabs_prefix
 

Préfixe du chemin vers les vocabulaires des features

Default: “”

--src_vocab_size, -src_vocab_size
 

Size of the source vocabulary

Default: 50000

--tgt_vocab_size, -tgt_vocab_size
 

Size of the target vocabulary

Default: 50000

--vocab_size_multiple, -vocab_size_multiple
 

Make the vocabulary size a multiple of this value

Default: 1

--src_words_min_frequency, -src_words_min_frequency
 Default: 0
--tgt_words_min_frequency, -tgt_words_min_frequency
 Default: 0
--dynamic_dict, -dynamic_dict
 

Create dynamic dictionaries

Default: False

--share_vocab, -share_vocab
 

Share source and target vocabulary

Default: False

Pruning

--src_seq_length, -src_seq_length
 

Maximum source sequence length

Default: 50

--src_seq_length_trunc, -src_seq_length_trunc
 Truncate source sequence length.
--tgt_seq_length, -tgt_seq_length
 

Maximum target sequence length to keep.

Default: 50

--tgt_seq_length_trunc, -tgt_seq_length_trunc
 Truncate target sequence length.
--lower, -lower
 

lowercase data

Default: False

--filter_valid, -filter_valid
 

Filter validation data by src and/or tgt length

Default: False

Random

--shuffle, -shuffle
 

Shuffle data

Default: 0

--seed, -seed

Random seed

Default: 3435

Logging

--report_every, -report_every
 

Report status every this many sentences

Default: 100000

--log_file, -log_file
 

Output logs to a file under this path.

Default: “”

--log_file_level, -log_file_level
 

Possible choices: WARNING, DEBUG, INFO, CRITICAL, NOTSET, ERROR, 30, 10, 20, 50, 0, 40

Default: “0”

Speech

--sample_rate, -sample_rate
 

Sample rate.

Default: 16000

--window_size, -window_size
 

Window size for spectrogram in seconds.

Default: 0.02

--window_stride, -window_stride
 

Window stride for spectrogram in seconds.

Default: 0.01

--window, -window
 

Window type for spectrogram generation.

Default: “hamming”

--image_channel_size, -image_channel_size
 

Possible choices: 3, 1

Using grayscale image can training model faster and smaller

Default: 3