Doc2Vec model, as opposite to Word2Vec model, is used to create a vectorised representation of a group of words taken collectively as a single unit. It doesn’t only give the simple average of the words in the sentence.
Here to create document vectors using Doc2Vec, we will be using text8 dataset which can be downloaded from gensim.downloader.
We can download the text8 dataset by using the following commands −
import gensim import gensim.downloader as api dataset = api.load("text8") data = [d for d in dataset]
It will take some time to download the text8 dataset.
In order to train the model, we need the tagged document which can be created by using models.doc2vec.TaggedDcument() as follows −
def tagged_document(list_of_list_of_words): for i, list_of_words in enumerate(list_of_list_of_words): yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i]) data_for_training = list(tagged_document(data))
We can print the trained dataset as follows −
print(data_for_training [:1])
[TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been' , 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers', 'to', 'related', 'social', 'movements', 'that', 'advocate', 'the', 'elimination', 'of', 'authoritarian', 'institutions', 'particularly', 'the', 'state', 'the', 'word', 'anarchy', 'as', 'most', 'anarchists', 'use', 'it', 'does', 'not', 'imply', 'chaos', 'nihilism', 'or', 'anomie', 'but', 'rather', 'a', 'harmonious', 'anti', 'authoritarian', 'society', 'in', 'place', 'of', 'what', 'are', 'regarded', 'as', 'authoritarian', 'political', 'structures', 'and', 'coercive', 'economic', 'institutions', 'anarchists', 'advocate', 'social', 'relations', 'based', 'upon', 'voluntary', 'association', 'of', 'autonomous', 'individuals', 'mutual', 'aid', 'and', 'self', 'governance', 'while', 'anarchism', 'is', 'most', 'easily', 'defined', 'by', 'what', 'it', 'is', 'against', 'anarchists', 'also', 'offer', 'positive', 'visions', 'of', 'what', 'they', 'believe', 'to', 'be', 'a', 'truly', 'free', 'society', 'however', 'ideas', 'about', 'how', 'an', 'anarchist', 'society', 'might', 'work', 'vary', 'considerably', 'especially', 'with', 'respect', 'to', 'economics', 'there', 'is', 'also', 'disagreement', 'about', 'how', 'a', 'free', 'society', 'might', 'be', 'brought', 'about', 'origins', 'and', 'predecessors', 'kropotkin', 'and', 'others', 'argue', 'that', 'before', 'recorded', 'history', 'human', 'society', 'was', 'organized', 'on', 'anarchist', 'principles', 'most', 'anthropologists', 'follow', 'kropotkin', 'and', 'engels', 'in', 'believing', 'that', 'hunter', 'gatherer', 'bands', 'were', 'egalitarian', 'and', 'lacked', 'division', 'of', 'labour', 'accumulated', 'wealth', 'or', 'decreed', 'law', 'and', 'had', 'equal', 'access', 'to', 'resources', 'william', 'godwin', 'anarchists', 'including', 'the', 'the', 'anarchy', 'organisation', 'and', 'rothbard', 'find', 'anarchist', 'attitudes', 'in', 'taoism', 'from', 'ancient', 'china', 'kropotkin', 'found', 'similar', 'ideas', 'in', 'stoic', 'zeno', 'of', 'citium', 'according', 'to', 'kropotkin', 'zeno', 'repudiated', 'the', 'omnipotence', 'of', 'the', 'state', 'its', 'intervention', 'and', 'regimentation', 'and', 'proclaimed', 'the', 'sovereignty', 'of', 'the', 'moral', 'law', 'of', 'the', 'individual', 'the', 'anabaptists', 'of', 'one', 'six', 'th', 'century', 'europe', 'are', 'sometimes', 'considered', 'to', 'be', 'religious', 'forerunners', 'of', 'modern', 'anarchism', 'bertrand', 'russell', 'in', 'his', 'history', 'of', 'western', 'philosophy', 'writes', 'that', 'the', 'anabaptists', 'repudiated', 'all', 'law', 'since', 'they', 'held', 'that', 'the', 'good', 'man', 'will', 'be', 'guided', 'at', 'every', 'moment', 'by', 'the', 'holy', 'spirit', 'from', 'this', 'premise', 'they', 'arrive', 'at', 'communism', 'the', 'diggers', 'or', 'true', 'levellers', 'were', 'an', 'early', 'communistic', 'movement', (truncated…)
Once trained we now need to initialise the model. it can be done as follows −
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)
Now, build the vocabulary as follows −
model.build_vocab(data_for_training)
Now, let’s train the Doc2Vec model as follows −
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)
Finally, we can analyse the output by using model.infer_vector() as follows −
print(model.infer_vector(['violent', 'means', 'to', 'destroy', 'the','organization']))
import gensim import gensim.downloader as api dataset = api.load("text8") data = [d for d in dataset] def tagged_document(list_of_list_of_words): for i, list_of_words in enumerate(list_of_list_of_words): yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i]) data_for_training = list(tagged_document(data)) print(data_for_training[:1]) model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30) model.build_vocab(data_training) model.train(data_training, total_examples=model.corpus_count, epochs=model.epochs) print(model.infer_vector(['violent', 'means', 'to', 'destroy', 'the','organization']))
[ -0.2556166 0.4829361 0.17081228 0.10879577 0.12525807 0.10077011 -0.21383236 0.19294572 0.11864349 -0.03227958 -0.02207291 -0.7108424 0.07165232 0.24221905 -0.2924459 -0.03543589 0.21840079 -0.1274817 0.05455418 -0.28968817 -0.29146606 0.32885507 0.14689675 -0.06913587 -0.35173815 0.09340707 -0.3803535 -0.04030455 -0.10004586 0.22192696 0.2384828 -0.29779273 0.19236489 -0.25727913 0.09140676 0.01265439 0.08077634 -0.06902497 -0.07175519 -0.22583418 -0.21653089 0.00347822 -0.34096122 -0.06176808 0.22885063 -0.37295452 -0.08222228 -0.03148199 -0.06487323 0.11387568 ]