We have understood how to create dictionary from a list of documents and from text files (from one as well as from more than one). Now, in this section, we will create a bag-of-words (BoW) corpus. In order to work with Gensim, it is one of the most important objects we need to familiarise with. Basically, it is the corpus that contains the word id and its frequency in each document.
As discussed, in Gensim, the corpus contains the word id and its frequency in every document. We can create a BoW corpus from a simple list of documents and from text files. What we need to do is, to pass the tokenised list of words to the object named Dictionary.doc2bow(). So first, let’s start by creating BoW corpus using a simple list of documents.
In the following example, we will create BoW corpus from a simple list containing three sentences.
First, we need to import all the necessary packages as follows −
import gensim import pprint from gensim import corpora from gensim.utils import simple_preprocess
Now provide the list containing sentences. We have three sentences in our list −
doc_list = [ "Hello, how are you?", "How do you do?", "Hey what are you doing? yes you What are you doing?" ]
Next, do tokenisation of the sentences as follows −
doc_tokenized = [simple_preprocess(doc) for doc in doc_list]
Create an object of corpora.Dictionary() as follows −
dictionary = corpora.Dictionary()
Now pass these tokenised sentences to dictionary.doc2bow() objectas follows −
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
At last we can print Bag of word corpus −
print(BoW_corpus)
[ [(0, 1), (1, 1), (2, 1), (3, 1)], [(2, 1), (3, 1), (4, 2)], [(0, 2), (3, 3), (5, 2), (6, 1), (7, 2), (8, 1)] ]
The above output shows that the word with id=0 appears once in the first document (because we have got (0,1) in the output) and so on.
The above output is somehow not possible for humans to read. We can also convert these ids to words but for this we need our dictionary to do the conversion as follows −
id_words = [[(dictionary[id], count) for id, count in line] for line in BoW_corpus] print(id_words)
[ [('are', 1), ('hello', 1), ('how', 1), ('you', 1)], [('how', 1), ('you', 1), ('do', 2)], [('are', 2), ('you', 3), ('doing', 2), ('hey', 1), ('what', 2), ('yes', 1)] ]
Now the above output is somehow human readable.
import gensim import pprint from gensim import corpora from gensim.utils import simple_preprocess doc_list = [ "Hello, how are you?", "How do you do?", "Hey what are you doing? yes you What are you doing?" ] doc_tokenized = [simple_preprocess(doc) for doc in doc_list] dictionary = corpora.Dictionary() BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized] print(BoW_corpus) id_words = [[(dictionary[id], count) for id, count in line] for line in BoW_corpus] print(id_words)
In the following example, we will be creating BoW corpus from a text file. For this, we have saved the document, used in previous example, in the text file named doc.txt..
Gensim will read the file line by line and process one line at a time by using simple_preprocess. In this way, it doesn’t need to load the complete file in memory all at once.
First, import the required and necessary packages as follows −
import gensim from gensim import corpora from pprint import pprint from gensim.utils import simple_preprocess from smart_open import smart_open import os
Next, the following line of codes will make read the documents from doc.txt and tokenised it −
doc_tokenized = [ simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’) ] dictionary = corpora.Dictionary()
Now we need to pass these tokenized words into dictionary.doc2bow() object(as did in the previous example)
BoW_corpus = [ dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized ] print(BoW_corpus)
[ [(9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], [ (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1) ], [ (23, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1) ], [(3, 1), (18, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)], [ (18, 1), (27, 1), (31, 2), (32, 1), (38, 1), (41, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1) ] ]
The doc.txt file have the following content −
CNTK formerly known as Computational Network Toolkit is a free easy-to-use open-source commercial-grade toolkit that enable us to train deep learning algorithms to learn like the human brain.
You can find its free tutorial on howcodex.com also provide best technical tutorials on technologies like AI deep learning machine learning for free.
import gensim from gensim import corpora from pprint import pprint from gensim.utils import simple_preprocess from smart_open import smart_open import os doc_tokenized = [ simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’) ] dictionary = corpora.Dictionary() BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized] print(BoW_corpus)
We can save the corpus with the help of following script −
corpora.MmCorpus.serialize(‘/Users/Desktop/BoW_corpus.mm’, bow_corpus)
#provide the path and the name of the corpus. The name of corpus is BoW_corpus and we saved it in Matrix Market format.
Similarly, we can load the saved corpus by using the following script −
corpus_load = corpora.MmCorpus(‘/Users/Desktop/BoW_corpus.mm’) for line in corpus_load: print(line)