Here, we shall learn about the core concepts of Gensim, with main focus on the vector and the model.
What if we want to infer the latent structure in our corpus? For this, we need to represent the documents in a such a way that we can manipulate the same mathematically. One popular kind of representation is to represent every document of corpus as a vector of features. That’s why we can say that vector is a mathematical convenient representation of a document.
To give you an example, let’s represent a single feature, of our above used corpus, as a Q-A pair −
Q − How many times does the word Hello appear in the document?
A − Zero(0).
Q − How many paragraphs are there in the document?
A − Two(2)
The question is generally represented by its integer id, hence the representation of this document is a series of pairs like (1, 0.0), (2, 2.0). Such vector representation is known as a dense vector. Why dense, because it comprises an explicit answer to all the questions written above.
The representation can be a simple like (0, 2), if we know all the questions in advance. Such sequence of the answers (of course if the questions are known in advance) is the vector for our document.
Another popular kind of representation is the bag-of-word (BoW) model. In this approach, each document is basically represented by a vector containing the frequency count of every word in the dictionary.
To give you an example, suppose we have a dictionary that contains the words [‘Hello’, ‘How’, ‘are’, ‘you’]. A document consisting of the string “How are you how” would then be represented by the vector [0, 2, 1, 1]. Here, the entries of the vector are in order of the occurrences of “Hello”, “How”, “are”, and “you”.
From the above explanation of vector, the distinction between a document and a vector is almost understood. But, to make it clearer, document is text and vector is a mathematically convenient representation of that text. Unfortunately, sometimes many people use these terms interchangeably.
For example, suppose we have some arbitrary document A then instead of saying, “the vector that corresponds to document A”, they used to say, “the vector A” or “the document A”. This leads to great ambiguity. One more important thing to be noted here is that, two different documents may have the same vector representation.
Before taking an implementation example of converting corpus into the list of vectors, we need to associate each word in the corpus with a unique integer ID. For this, we will be extending the example taken in above chapter.
from gensim import corpora dictionary = corpora.Dictionary(processed_corpus) print(dictionary)
Dictionary(25 unique tokens: ['computer', 'opinion', 'response', 'survey', 'system']...)
It shows that in our corpus there are 25 different tokens in this gensim.corpora.Dictionary.
We can use the dictionary to turn tokenised documents into these 5-diemsional vectors as follows −
pprint.pprint(dictionary.token2id)
{ 'binary': 11, 'computer': 0, 'error': 7, 'generation': 12, 'graph': 16, 'intersection': 17, 'iv': 19, 'measurement': 8, 'minors': 20, 'opinion': 1, 'ordering': 21, 'paths': 18, 'perceived': 9, 'quasi': 22, 'random': 13, 'relation': 10, 'response': 2, 'survey': 3, 'system': 4, 'time': 5, 'trees': 14, 'unordered': 15, 'user': 6, 'well': 23, 'widths': 24 }
And similarly, we can create the bag-of-word representation for a document as follows −
BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus] pprint.pprint(BoW_corpus)
[ [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(2, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)], [(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], [(14, 1), (16, 1), (17, 1), (18, 1)], [(14, 1), (16, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1)] ]
Once we have vectorised the corpus, next what? Now, we can transform it using models. Model may be referred to an algorithm used for transforming one document representation to other.
As we have discussed, documents, in Gensim, are represented as vectors hence, we can, though model as a transformation between two vector spaces. There is always a training phase where models learn the details of such transformations. The model reads the training corpus during training phase.
Let’s initialise tf-idf model. This model transforms vectors from the BoW (Bag of Words) representation to another vector space where the frequency counts are weighted according to the relative rarity of every word in corpus.
In the following example, we are going to initialise the tf-idf model. We will train it on our corpus and then transform the string “trees graph”.
from gensim import models tfidf = models.TfidfModel(BoW_corpus) words = "trees graph".lower().split() print(tfidf[dictionary.doc2bow(words)])
[(3, 0.4869354917707381), (4, 0.8734379353188121)]
Now, once we created the model, we can transform the whole corpus via tfidf and index it, and query the similarity of our query document (we are giving the query document ‘trees system’) against each document in the corpus −
from gensim import similarities index = similarities.SparseMatrixSimilarity(tfidf[BoW_corpus],num_features=5) query_document = 'trees system'.split() query_bow = dictionary.doc2bow(query_document) simils = index[tfidf[query_bow]] print(list(enumerate(simils)))
[(0, 0.0), (1, 0.0), (2, 1.0), (3, 0.4869355), (4, 0.4869355)]
From the above output, document 4 and document 5 has a similarity score of around 49%.
Moreover, we can also sort this output for more readability as follows −
for doc_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True): print(doc_number, score)
2 1.0 3 0.4869355 4 0.4869355 0 0.0 1 0.0