The chapter will help us understand developing word embedding in Gensim.
Word embedding, approach to represent words & document, is a dense vector representation for text where words having the same meaning have a similar representation. Following are some characteristics of word embedding −
It is a class of technique which represents the individual words as real-valued vectors in a pre-defined vector space.
This technique is often lumped into the field of DL (deep learning) because every word is mapped to one vector and the vector values are learned in the same way a NN (Neural Networks) does.
The key approach of word embedding technique is a dense distributed representation for every word.
As discussed above, word embedding methods/algorithms learn a real-valued vector representation from a corpus of text. This learning process can either use with the NN model on task like document classification or is an unsupervised process such as document statistics. Here we are going to discuss two methods/algorithm that can be used to learn a word embedding from text −
Word2Vec, developed by Tomas Mikolov, et. al. at Google in 2013, is a statistical method for efficiently learning a word embedding from text corpus. It’s actually developed as a response to make NN based training of word embedding more efficient. It has become the de facto standard for word embedding.
Word embedding by Word2Vec involves analysis of the learned vectors as well as exploration of vector math on representation of words. Following are the two different learning methods which can be used as the part of Word2Vec method −
GloVe(Global vectors for Word Representation), is an extension to the Word2Vec method. It was developed by Pennington et al. at Stanford. GloVe algorithm is a mix of both −
If we talk about its working then instead of using a window to define local context, GloVe constructs an explicit word co-occurrence matrix using statistics across the whole text corpus.
Here, we will develop Word2Vec embedding by using Gensim. In order to work with a Word2Vec model, Gensim provides us Word2Vec class which can be imported from models.word2vec. For its implementation, word2vec requires a lot of text e.g. the entire Amazon review corpus. But here, we will apply this principle on small-in memory text.
First we need to import the Word2Vec class from gensim.models as follows −
from gensim.models import Word2Vec
Next, we need to define the training data. Rather than taking big text file, we are using some sentences to implement this principal.
sentences = [ ['this', 'is', 'gensim', 'tutorial', 'for', 'free'], ['this', 'is', 'the', 'tutorials' 'point', 'website'], ['you', 'can', 'read', 'technical','tutorials', 'for','free'], ['we', 'are', 'implementing','word2vec'], ['learn', 'full', 'gensim', 'tutorial'] ]
Once the training data is provided, we need to train the model. it can be done as follows −
model = Word2Vec(sentences, min_count=1)
We can summarise the model as follows −;
print(model)
We can summarise the vocabulary as follows −
words = list(model.wv.vocab) print(words)
Next, let’s access the vector for one word. We are doing it for the word ‘tutorial’.
print(model['tutorial'])
Next, we need to save the model −
model.save('model.bin')
Next, we need to load the model −
new_model = Word2Vec.load('model.bin')
Finally, print the saved model as follows −
print(new_model)
from gensim.models import Word2Vec sentences = [ ['this', 'is', 'gensim', 'tutorial', 'for', 'free'], ['this', 'is', 'the', 'tutorials' 'point', 'website'], ['you', 'can', 'read', 'technical','tutorials', 'for','free'], ['we', 'are', 'implementing','word2vec'], ['learn', 'full', 'gensim', 'tutorial'] ] model = Word2Vec(sentences, min_count=1) print(model) words = list(model.wv.vocab) print(words) print(model['tutorial']) model.save('model.bin') new_model = Word2Vec.load('model.bin') print(new_model)
Word2Vec(vocab=20, size=100, alpha=0.025) [ 'this', 'is', 'gensim', 'tutorial', 'for', 'free', 'the', 'howcodex', 'website', 'you', 'can', 'read', 'technical', 'tutorials', 'we', 'are', 'implementing', 'word2vec', 'learn', 'full' ] [ -2.5256255e-03 -4.5352755e-03 3.9024993e-03 -4.9509313e-03 -1.4255195e-03 -4.0217536e-03 4.9407515e-03 -3.5925603e-03 -1.1933431e-03 -4.6682903e-03 1.5440651e-03 -1.4101702e-03 3.5070938e-03 1.0914479e-03 2.3334436e-03 2.4452661e-03 -2.5336299e-04 -3.9676363e-03 -8.5054158e-04 1.6443320e-03 -4.9968651e-03 1.0974540e-03 -1.1123562e-03 1.5393364e-03 9.8941079e-04 -1.2656028e-03 -4.4471184e-03 1.8309267e-03 4.9302122e-03 -1.0032534e-03 4.6892050e-03 2.9563988e-03 1.8730218e-03 1.5343715e-03 -1.2685956e-03 8.3664013e-04 4.1721235e-03 1.9445885e-03 2.4097660e-03 3.7517555e-03 4.9687522e-03 -1.3598346e-03 7.1032363e-04 -3.6595813e-03 6.0000515e-04 3.0872561e-03 -3.2115565e-03 3.2270295e-03 -2.6354722e-03 -3.4988276e-04 1.8574356e-04 -3.5757164e-03 7.5391348e-04 -3.5205986e-03 -1.9795434e-03 -2.8321696e-03 4.7155009e-03 -4.3349937e-04 -1.5320212e-03 2.7013756e-03 -3.7055744e-03 -4.1658725e-03 4.8034848e-03 4.8594419e-03 3.7129463e-03 4.2385766e-03 2.4612297e-03 5.4920948e-04 -3.8912550e-03 -4.8226118e-03 -2.2763973e-04 4.5571579e-03 -3.4609400e-03 2.7903817e-03 -3.2709218e-03 -1.1036445e-03 2.1492650e-03 -3.0384419e-04 1.7709908e-03 1.8429896e-03 -3.4038599e-03 -2.4872608e-03 2.7693063e-03 -1.6352943e-03 1.9182395e-03 3.7772327e-03 2.2769428e-03 -4.4629495e-03 3.3151123e-03 4.6509290e-03 -4.8521687e-03 6.7615538e-04 3.1034781e-03 2.6369948e-05 4.1454583e-03 -3.6932561e-03 -1.8769916e-03 -2.1958587e-04 6.3395966e-04 -2.4969708e-03 ] Word2Vec(vocab=20, size=100, alpha=0.025)
We can also explore the word embedding with visualisation. It can be done by using a classical projection method (like PCA) to reduce the high-dimensional word vectors to 2-D plots. Once reduced, we can then plot them on graph.
First, we need to retrieve all the vectors from a trained model as follows −
Z = model[model.wv.vocab]
Next, we need to create a 2-D PCA model of word vectors by using PCA class as follows −
pca = PCA(n_components=2) result = pca.fit_transform(Z)
Now, we can plot the resulting projection by using the matplotlib as follows −
Pyplot.scatter(result[:,0],result[:,1])
We can also annotate the points on the graph with the words itself. Plot the resulting projection by using the matplotlib as follows −
words = list(model.wv.vocab) for i, word in enumerate(words): pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
from gensim.models import Word2Vec from sklearn.decomposition import PCA from matplotlib import pyplot sentences = [ ['this', 'is', 'gensim', 'tutorial', 'for', 'free'], ['this', 'is', 'the', 'tutorials' 'point', 'website'], ['you', 'can', 'read', 'technical','tutorials', 'for','free'], ['we', 'are', 'implementing','word2vec'], ['learn', 'full', 'gensim', 'tutorial'] ] model = Word2Vec(sentences, min_count=1) X = model[model.wv.vocab] pca = PCA(n_components=2) result = pca.fit_transform(X) pyplot.scatter(result[:, 0], result[:, 1]) words = list(model.wv.vocab) for i, word in enumerate(words): pyplot.annotate(word, xy=(result[i, 0], result[i, 1])) pyplot.show()