Gensim - Transformations

Transforming Documents

Transforming documents means to represent the document in such a way that the document can be manipulated mathematically. Apart from deducing the latent structure of the corpus, transforming documents will also serve the following goals −

It discovers the relationship between words.
It brings out the hidden structure in the corpus.
It describes the documents in a new and more semantic way.
It makes the representation of the documents more compact.
It improves efficiency because new representation consumes less resources.
It improves efficacy because in new representation marginal data trends are ignored.
The noise is also reduced in new document representation.

Let’s see the implementation steps for transforming the documents from one vector space representation to another.

Implementation Steps

In order to transform documents, we must follow the following steps −

Step 1: Creating the Corpus

The very first and basic step is to create the corpus from the documents. We have already created the corpus in previous examples. Let’s create another one with some enhancements (removing common words and the words that appear only once) −

import gensim
import pprint
from collections import defaultdict
from gensim import corpora

Now provide the documents for creating the corpus −

t_corpus = ["CNTK formerly known as Computational Network Toolkit", "is a free easy-to-use open-source commercial-grade toolkit", "that enable us to train deep learning algorithms to learn like the human brain.", "You can find its free tutorial on howcodex.com", "Howcodex.com also provide best technical tutorials on technologies like AI deep learning machine learning for free"]

Next, we need to do tokenise and along with it we will remove the common words also −

stoplist = set('for a of the and to in'.split(' '))
processed_corpus = [
   [
      word for word in document.lower().split() if word not in stoplist
   ]
	for document in t_corpus
]

Following script will remove those words that appear only −

frequency = defaultdict(int)
for text in processed_corpus:
   for token in text:
      frequency[token] += 1
   processed_corpus = [
      [token for token in text if frequency[token] > 1] 
      for text in processed_corpus
   ]
pprint.pprint(processed_corpus)

Output

[
   ['toolkit'],
   ['free', 'toolkit'],
   ['deep', 'learning', 'like'],
   ['free', 'on', 'howcodex.com'],
   ['howcodex.com', 'on', 'like', 'deep', 'learning', 'learning', 'free']
]

Now pass it to the corpora.dictionary() object to get the unique objects in our corpus −

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

Output

Dictionary(7 unique tokens: ['toolkit', 'free', 'deep', 'learning', 'like']...)

Next, the following line of codes will create the Bag of Word model for our corpus −

BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(BoW_corpus)

Output

[
   [(0, 1)],
   [(0, 1), (1, 1)],
   [(2, 1), (3, 1), (4, 1)],
   [(1, 1), (5, 1), (6, 1)],
   [(1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1)]
]

Step 2: Creating a Transformation

The transformations are some standard Python objects. We can initialize these transformations i.e. Python objects by using a trained corpus. Here we are going to use tf-idf model to create a transformation of our trained corpus i.e. BoW_corpus.

First, we need to import the models package from gensim.

from gensim import models

Now, we need to initialise the model as follows −

tfidf = models.TfidfModel(BoW_corpus)

Step 3: Transforming Vectors

Now, in this last step, the vectors will be converted from old representation to new representation. As we have initialised the tfidf model in above step, the tfidf will now be treated as a read only object. Here, by using this tfidf object we will convert our vector from bag of word representation (old representation) to Tfidf real-valued weights (new representation).

doc_BoW = [(1,1),(3,1)]
print(tfidf[doc_BoW]

Output

[(1, 0.4869354917707381), (3, 0.8734379353188121)]

We applied the transformation on two values of corpus, but we can also apply it to the whole corpus as follows −

corpus_tfidf = tfidf[BoW_corpus]
for doc in corpus_tfidf:
   print(doc)

Output

[(0, 1.0)]
[(0, 0.8734379353188121), (1, 0.4869354917707381)]
[(2, 0.5773502691896257), (3, 0.5773502691896257), (4, 0.5773502691896257)]
[(1, 0.3667400603126873), (5, 0.657838022678017), (6, 0.657838022678017)]
[
   (1, 0.19338287240886842), (2, 0.34687949360312714), (3, 0.6937589872062543), 
   (4, 0.34687949360312714), (5, 0.34687949360312714), (6, 0.34687949360312714)
]

Complete Implementation Example

import gensim
import pprint
from collections import defaultdict
from gensim import corpora
t_corpus = [
   "CNTK formerly known as Computational Network Toolkit", 
   "is a free easy-to-use open-source commercial-grade toolkit", 
   "that enable us to train deep learning algorithms to learn like the human brain.", 
   "You can find its free tutorial on howcodex.com", 
   "Howcodex.com also provide best technical tutorials on 
   technologies like AI deep learning machine learning for free"
]
stoplist = set('for a of the and to in'.split(' '))
processed_corpus = [
   [word for word in document.lower().split() if word not in stoplist]
   for document in t_corpus
]
frequency = defaultdict(int)
for text in processed_corpus:
   for token in text:
      frequency[token] += 1
   processed_corpus = [
      [token for token in text if frequency[token] > 1] 
      for text in processed_corpus
   ]
pprint.pprint(processed_corpus)
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)
BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(BoW_corpus)
   from gensim import models
   tfidf = models.TfidfModel(BoW_corpus)
   doc_BoW = [(1,1),(3,1)]
   print(tfidf[doc_BoW])
   corpus_tfidf = tfidf[BoW_corpus]
   for doc in corpus_tfidf:
print(doc)

Various Transformations in Gensim

Using Gensim, we can implement various popular transformations, i.e. Vector Space Model algorithms. Some of them are as follows −

Tf-Idf(Term Frequency-Inverse Document Frequency)

During initialisation, this tf-idf model algorithm expects a training corpus having integer values (such as Bag-of-Words model). Then after that, at the time of transformation, it takes a vector representation and returns another vector representation.

The output vector will have the same dimensionality but the value of the rare features (at the time of training) will be increased. It basically converts integer-valued vectors into real-valued vectors. Following is the syntax of Tf-idf transformation −

Model=models.TfidfModel(corpus, normalize=True)

LSI(Latent Semantic Indexing)

LSI model algorithm can transform document from either integer valued vector model (such as Bag-of-Words model) or Tf-Idf weighted space into latent space. The output vector will be of lower dimensionality. Following is the syntax of LSI transformation −

Model=models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

LDA(Latent Dirichlet Allocation)

LDA model algorithm is another algorithm that transforms document from Bag-of-Words model space into a topic space. The output vector will be of lower dimensionality. Following is the syntax of LSI transformation −

Model=models.LdaModel(corpus, id2word=dictionary, num_topics=100)

Random Projections (RP)

RP, a very efficient approach, aims to reduce the dimensionality of vector space. This approach is basically approximate the Tf-Idf distances between the documents. It does this by throwing in a little randomness.

Model=models.RpModel(tfidf_corpus, num_topics=500)

Hierarchical Dirichlet Process (HDP)

HDP is a non-parametric Bayesian method which is a new addition to Gensim. We should have to take care while using it.

Model=models.HdpModel(corpus, id2word=dictionary

Previous Page Print Page