Gensim - Vector & Model


Advertisements

Here, we shall learn about the core concepts of Gensim, with main focus on the vector and the model.

What is Vector?

What if we want to infer the latent structure in our corpus? For this, we need to represent the documents in a such a way that we can manipulate the same mathematically. One popular kind of representation is to represent every document of corpus as a vector of features. That’s why we can say that vector is a mathematical convenient representation of a document.

To give you an example, let’s represent a single feature, of our above used corpus, as a Q-A pair −

Q − How many times does the word Hello appear in the document?

A − Zero(0).

Q − How many paragraphs are there in the document?

A − Two(2)

The question is generally represented by its integer id, hence the representation of this document is a series of pairs like (1, 0.0), (2, 2.0). Such vector representation is known as a dense vector. Why dense, because it comprises an explicit answer to all the questions written above.

The representation can be a simple like (0, 2), if we know all the questions in advance. Such sequence of the answers (of course if the questions are known in advance) is the vector for our document.

Another popular kind of representation is the bag-of-word (BoW) model. In this approach, each document is basically represented by a vector containing the frequency count of every word in the dictionary.

To give you an example, suppose we have a dictionary that contains the words [‘Hello’, ‘How’, ‘are’, ‘you’]. A document consisting of the string “How are you how” would then be represented by the vector [0, 2, 1, 1]. Here, the entries of the vector are in order of the occurrences of “Hello”, “How”, “are”, and “you”.

Vector Versus Document

From the above explanation of vector, the distinction between a document and a vector is almost understood. But, to make it clearer, document is text and vector is a mathematically convenient representation of that text. Unfortunately, sometimes many people use these terms interchangeably.

For example, suppose we have some arbitrary document A then instead of saying, “the vector that corresponds to document A”, they used to say, “the vector A” or “the document A”. This leads to great ambiguity. One more important thing to be noted here is that, two different documents may have the same vector representation.

Converting corpus into list of vectors

Before taking an implementation example of converting corpus into the list of vectors, we need to associate each word in the corpus with a unique integer ID. For this, we will be extending the example taken in above chapter.

Example

from gensim import corpora
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

Output

Dictionary(25 unique tokens: ['computer', 'opinion', 'response', 'survey', 'system']...)

It shows that in our corpus there are 25 different tokens in this gensim.corpora.Dictionary.

Implementation Example

We can use the dictionary to turn tokenised documents into these 5-diemsional vectors as follows −

pprint.pprint(dictionary.token2id)

Output

{
   'binary': 11,
   'computer': 0,
   'error': 7,
   'generation': 12,
   'graph': 16,
   'intersection': 17,
   'iv': 19,
   'measurement': 8,
   'minors': 20,
   'opinion': 1,
   'ordering': 21,
   'paths': 18,
   'perceived': 9,
   'quasi': 22,
   'random': 13,
   'relation': 10,
   'response': 2,
   'survey': 3,
   'system': 4,
   'time': 5,
   'trees': 14,
   'unordered': 15,
   'user': 6,
   'well': 23,
   'widths': 24
}

And similarly, we can create the bag-of-word representation for a document as follows −

BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(BoW_corpus)

Output

[
   [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
   [(2, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
   [(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)],
   [(14, 1), (16, 1), (17, 1), (18, 1)],
   [(14, 1), (16, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1)]
]

What is Model?

Once we have vectorised the corpus, next what? Now, we can transform it using models. Model may be referred to an algorithm used for transforming one document representation to other.

As we have discussed, documents, in Gensim, are represented as vectors hence, we can, though model as a transformation between two vector spaces. There is always a training phase where models learn the details of such transformations. The model reads the training corpus during training phase.

Initializing a Model

Let’s initialise tf-idf model. This model transforms vectors from the BoW (Bag of Words) representation to another vector space where the frequency counts are weighted according to the relative rarity of every word in corpus.

Implementation Example

In the following example, we are going to initialise the tf-idf model. We will train it on our corpus and then transform the string “trees graph”.

Example

from gensim import models
tfidf = models.TfidfModel(BoW_corpus)
words = "trees graph".lower().split()
print(tfidf[dictionary.doc2bow(words)])

Output

[(3, 0.4869354917707381), (4, 0.8734379353188121)]

Now, once we created the model, we can transform the whole corpus via tfidf and index it, and query the similarity of our query document (we are giving the query document ‘trees system’) against each document in the corpus −

Example

from gensim import similarities
index = similarities.SparseMatrixSimilarity(tfidf[BoW_corpus],num_features=5)
query_document = 'trees system'.split()
query_bow = dictionary.doc2bow(query_document)
simils = index[tfidf[query_bow]]
print(list(enumerate(simils)))

Output

[(0, 0.0), (1, 0.0), (2, 1.0), (3, 0.4869355), (4, 0.4869355)]

From the above output, document 4 and document 5 has a similarity score of around 49%.

Moreover, we can also sort this output for more readability as follows −

Example

for doc_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
   print(doc_number, score)

Output

2 1.0
3 0.4869355
4 0.4869355
0 0.0
1 0.0
Advertisements