Gensim - Creating a Dictionary


Advertisements

In last chapter where we discussed about vector and model, you got an idea about the dictionary. Here, we are going to discuss Dictionary object in a bit more detail.

What is Dictionary?

Before getting deep dive into the concept of dictionary, let’s understand some simple NLP concepts −

  • Token − A token means a ‘word’.

  • Document − A document refers to a sentence or paragraph.

  • Corpus − It refers to a collection of documents as a bag of words (BoW).

For all the documents, a corpus always contains each word’s token’s id along with its frequency count in the document.

Let’s move to the concept of dictionary in Gensim. For working on text documents, Gensim also requires the words, i.e. tokens to be converted to their unique ids. For achieving this, it gives us the facility of Dictionary object, which maps each word to their unique integer id. It does this by converting input text to the list of words and then pass it to the corpora.Dictionary() object.

Need of Dictionary

Now the question arises that what is actually the need of dictionary object and where it can be used? In Gensim, the dictionary object is used to create a bag of words (BoW) corpus which further used as the input to topic modelling and other models as well.

Forms of Text Inputs

There are three different forms of input text, we can provide to Gensim −

  • As the sentences stored in Python’s native list object (known as str in Python 3)

  • As one single text file (can be small or large one)

  • Multiple text files

Creating a Dictionary Using Gensim

As discussed, in Gensim, the dictionary contains the mapping of all words, a.k.a tokens to their unique integer id. We can create a dictionary from list of sentences, from one or more than one text files (text file containing multiple lines of text). So, first let’s start by creating dictionary using list of sentences.

From a List of Sentences

In the following example we will be creating dictionary from a list of sentences. When we have list of sentences or you can say multiple sentences, we must convert every sentence to a list of words and comprehensions is one of the very common ways to do this.

Implementation Example

First, import the required and necessary packages as follows −

import gensim
from gensim import corpora
from pprint import pprint

Next, make the comprehension list from list of sentences/document to use it creating the dictionary −

doc = [
   "CNTK formerly known as Computational Network Toolkit",
   "is a free easy-to-use open-source commercial-grade toolkit",
   "that enable us to train deep learning algorithms to learn like the human brain."
]

Next, we need to split the sentences into words. It is called tokenisation.

text_tokens = [[text for text in doc.split()] for doc in doc]

Now, with the help of following script, we can create the dictionary −

dict_LoS = corpora.Dictionary(text_tokens)

Now let’s get some more information like number of tokens in the dictionary −

print(dict_LoS)

Output

Dictionary(27 unique tokens: ['CNTK', 'Computational', 'Network', 'Toolkit', 'as']...)

We can also see the word to unique integer mapping as follows −

print(dict_LoS.token2id)

Output

{
   'CNTK': 0, 'Computational': 1, 'Network': 2, 'Toolkit': 3, 'as': 4, 
   'formerly': 5, 'known': 6, 'a': 7, 'commercial-grade': 8, 'easy-to-use': 9,
   'free': 10, 'is': 11, 'open-source': 12, 'toolkit': 13, 'algorithms': 14,
   'brain.': 15, 'deep': 16, 'enable': 17, 'human': 18, 'learn': 19, 'learning': 20,
   'like': 21, 'that': 22, 'the': 23, 'to': 24, 'train': 25, 'us': 26
}

Complete Implementation Example

import gensim
from gensim import corpora
from pprint import pprint
doc = [
   "CNTK formerly known as Computational Network Toolkit",
   "is a free easy-to-use open-source commercial-grade toolkit",
   "that enable us to train deep learning algorithms to learn like the human brain."
]
text_tokens = [[text for text in doc.split()] for doc in doc]
dict_LoS = corpora.Dictionary(text_tokens)
print(dict_LoS.token2id)

From Single Text File

In the following example we will be creating dictionary from a single text file. In the similar fashion, we can also create dictionary from more than one text files (i.e. directory of files).

For this, we have saved the document, used in previous example, in the text file named doc.txt. Gensim will read the file line by line and process one line at a time by using simple_preprocess. In this way, it doesn’t need to load the complete file in memory all at once.

Implementation Example

First, import the required and necessary packages as follows −

import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

Next line of codes will make gensim dictionary by using the single text file named doc.txt −

dict_STF = corpora.Dictionary(
   simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’)
)

Now let’s get some more information like number of tokens in the dictionary −

print(dict_STF)

Output

Dictionary(27 unique tokens: ['CNTK', 'Computational', 'Network', 'Toolkit', 'as']...)

We can also see the word to unique integer mapping as follows −

print(dict_STF.token2id)

Output

{
   'CNTK': 0, 'Computational': 1, 'Network': 2, 'Toolkit': 3, 'as': 4, 
   'formerly': 5, 'known': 6, 'a': 7, 'commercial-grade': 8, 'easy-to-use': 9, 
   'free': 10, 'is': 11, 'open-source': 12, 'toolkit': 13, 'algorithms': 14, 
   'brain.': 15, 'deep': 16, 'enable': 17, 'human': 18, 'learn': 19, 
   'learning': 20, 'like': 21, 'that': 22, 'the': 23, 'to': 24, 'train': 25, 'us': 26
}

Complete Implementation Example

import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
dict_STF = corpora.Dictionary(
   simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’)
)
dict_STF = corpora.Dictionary(text_tokens)
print(dict_STF.token2id)

From Multiple Text Files

Now let’s create dictionary from multiple files, i.e. more than one text file saved in the same directory. For this example, we have created three different text files namely first.txt, second.txt and third.txtcontaining the three lines from text file (doc.txt), we used for previous example. All these three text files are saved under a directory named ABC.

Implementation Example

In order to implement this, we need to define a class with a method that can iterate through all the three text files (First, Second, and Third.txt) in the directory (ABC) and yield the processed list of words tokens.

Let’s define the class named Read_files having a method named __iteration__() as follows −

class Read_files(object):
   def __init__(self, directoryname):
      elf.directoryname = directoryname
   def __iter__(self):
      for fname in os.listdir(self.directoryname):
         for line in open(os.path.join(self.directoryname, fname), encoding='latin'):
   yield simple_preprocess(line)

Next, we need to provide the path of the directory as follows −

path = "ABC"

#provide the path as per your computer system where you saved the directory.

Next steps are similar as we did in previous examples. Next line of codes will make Gensim directory by using the directory having three text files −

dict_MUL = corpora.Dictionary(Read_files(path))

Output

Dictionary(27 unique tokens: ['CNTK', 'Computational', 'Network', 'Toolkit', 'as']...)

Now we can also see the word to unique integer mapping as follows −

print(dict_MUL.token2id)

Output

{
   'CNTK': 0, 'Computational': 1, 'Network': 2, 'Toolkit': 3, 'as': 4, 
   'formerly': 5, 'known': 6, 'a': 7, 'commercial-grade': 8, 'easy-to-use': 9, 
   'free': 10, 'is': 11, 'open-source': 12, 'toolkit': 13, 'algorithms': 14, 
   'brain.': 15, 'deep': 16, 'enable': 17, 'human': 18, 'learn': 19, 
   'learning': 20, 'like': 21, 'that': 22, 'the': 23, 'to': 24, 'train': 25, 'us': 26
}

Saving and Loading a Gensim Dictionary

Gensim support their own native save() method to save dictionary to the disk and load() method to load back dictionary from the disk.

For example, we can save the dictionary with the help of following script −

Gensim.corpora.dictionary.save(filename)

#provide the path where you want to save the dictionary.

Similarly, we can load the saved dictionary by using the load() method. Following script can do this −

Gensim.corpora.dictionary.load(filename)

#provide the path where you have saved the dictionary.

Advertisements