# Embeddings

Word embeddings are learned using a lookup table. Each word is assigned to a random vector within this table that is simply updated with the gradients coming from the network.

## Pretrained¶

When training with small amounts of data, performance can be improved by starting with pretrained embeddings. The arguments -pre_word_vecs_dec and -pre_word_vecs_enc can be used to specify these files.

The pretrained embeddings must be manually constructed Torch serialized tensors that correspond to the source and target dictionary files. For example:

local vocab_size = 50004
local embedding_size = 500

local embeddings = torch.Tensor(vocab_size, embedding_size):uniform()

torch.save('enc_embeddings.t7', embeddings)


where embeddings[i] is the embedding of the $i$-th word in the vocabulary.

To automate this process, OpenNMT provides a script tools/embeddings.lua that can download pretrained embeddings from Polyglot or convert trained embeddings from word2vec, GloVe or FastText with regard to the word vocabularies generated by preprocess.lua. Supported format are:

• word2vec-bin (default): binary format generated by word2vec.
• word2vec-txt: textual word2vec format - starts with header line containing number of words and embedding size, and is then followed by one line per embedding: the first token is the word, and following fields are the embeddings values.
• glove: text format - same format as word2vec-txt but without header line.

Note

The script requires the lua-zlib package.

For example, to generate pretrained English words embeddings:

th tools/embeddings.lua -lang en -dict_file data/demo.src.dict -save_data data/demo-src-emb


Note

Languages codes are Polygot's Wikipedia Language Codes.

Or to map pretrained word2vec vectors to the built vocabulary:

th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict\
-save_data data/demo-src-emb


Tip

If vocabs as-is are not found in the embeddings file, you can use -approximate option to also look for uppercase variants and variants without possible joiner marks. You can dump the non found vocabs by setting -save_unknown_dict parameter.

## Fixed¶

By default these embeddings will be updated during training, but they can be held fixed using -fix_word_vecs_enc and -fix_word_vecs_dec options. These options can be enabled or disabled during a retraining.

Tip

When using pretrained word embeddings, if you declare a larger -word_vec_size then the difference is uniformally initalized and you can use -fix_word_vecs_enc pretrained (or -fix_word_vecs_dec pretrained) to fix the pretrained part and optimize the remaining part.

## Extraction¶

The tools/extract_embeddings.lua script can be used to extract the model word embeddings into text files. They can then be easily transformed into another format for visualization or processing.