Keras implementation of word2vec, including full data processing pipeline, where impelementation closely follows TF tutorial
Implementation primarily for building intuition for both keras and word2vec. Includes both data processing and model estimation pipelines.
To run the data processing, must run the submit.py script. This script reads in a text file (using the path_to_text_file parameter), then does a grid search over parameter grids at top of script.
If a user wants to test a single combination of parameter values, work with the commands inside the for loop, which runs preprocessing + model estimation by calling run.py for a fixed set of parameter values.
submit.py allows users to set a series of parameters in word2vec runs. There are three sets of parameters Data Processing, word2Vec, and Model Tuning parameters:
Data Processingparameters deal with the amount of training used in each batch, as well as, performance in pre-fetching data for each batch as the model runs.word2vecparameters deal with word2vec specific parameters such as the number of negative samples.Model Tuningparameters deal with more general model performance issues such as number of epochs for each model run.
BATCH_SIZE : number of training samples
BUFFER_SIZE : size of buffer to be filled while prior batch is running
AUTOTUNE : tuning parameter for number of data elements to fetch into buffer
NUM_NS : number of negative samples using in Noise Contrastive Estimation procedure
T_PARAM : threshold value used to down-weight high frequency words (see equation 5)
VOCAB_SIZE : number of words in vocabulary
WINDOW_SIZE : number of words before and after target word to include in context
EMBEDDING_DIM : number of hidden layers in intermediate layer (or number of vectors in word embedding)
SEQUENCE_LENGTH : length of each sentence
EPOCHS : number of parameter updates
SEED : random seed
submit.py runs run.py in src folder, assuming the same folder structure as is in this repository.