As the name implies, unigram tagger is a tagger that only uses a single word as its context for determining the POS(Part-of-Speech) tag. In simple words, Unigram Tagger is a context-based tagger whose context is a single word, i.e., Unigram.
NLTK provides a module named UnigramTagger for this purpose. But before getting deep dive into its working, let us understand the hierarchy with the help of following diagram −
From the above diagram, it is understood that UnigramTagger is inherited from NgramTagger which is a subclass of ContextTagger, which inherits from SequentialBackoffTagger.
The working of UnigramTagger is explained with the help of following steps −
As we have seen, UnigramTagger inherits from ContextTagger, it implements a context() method. This context() method takes the same three arguments as choose_tag() method.
The result of context() method will be the word token which is further used to create the model. Once the model is created, the word token is also used to look up the best tag.
In this way, UnigramTagger will build a context model from the list of tagged sentences.
NLTK’s UnigramTagger can be trained by providing a list of tagged sentences at the time of initialization. In the example below, we are going to use the tagged sentences of the treebank corpus. We will be using first 2500 sentences from that corpus.
First import the UniframTagger module from nltk −
from nltk.tag import UnigramTagger
Next, import the corpus you want to use. Here we are using treebank corpus −
from nltk.corpus import treebank
Now, take the sentences for training purpose. We are taking first 2500 sentences for training purpose and will tag them −
train_sentences = treebank.tagged_sents()[:2500]
Next, apply UnigramTagger on the sentences used for training purpose −
Uni_tagger = UnigramTagger(train_sentences)
Take some sentences, either equal to or less taken for training purpose i.e. 2500, for testing purpose. Here we are taking first 1500 for testing purpose −
test_sentences = treebank.tagged_sents()[1500:] Uni_tagger.evaluate(test_sents)
0.8942306156033808
Here, we got around 89 percent accuracy for a tagger that uses single word lookup to determine the POS tag.
from nltk.tag import UnigramTagger from nltk.corpus import treebank train_sentences = treebank.tagged_sents()[:2500] Uni_tagger = UnigramTagger(train_sentences) test_sentences = treebank.tagged_sents()[1500:] Uni_tagger.evaluate(test_sentences)
0.8942306156033808
From the above diagram showing hierarchy for UnigramTagger, we know all the taggers that inherit from ContextTagger, instead of training their own, can take a pre-built model. This pre-built model is simply a Python dictionary mapping of a context key to a tag. And for UnigramTagger, context keys are individual words while for other NgramTagger subclasses, it will be tuples.
We can override this context model by passing another simple model to the UnigramTagger class instead of passing training set. Let us understand it with the help of an easy example below −
from nltk.tag import UnigramTagger from nltk.corpus import treebank Override_tagger = UnigramTagger(model = {‘Vinken’ : ‘NN’}) Override_tagger.tag(treebank.sents()[0])
[ ('Pierre', None), ('Vinken', 'NN'), (',', None), ('61', None), ('years', None), ('old', None), (',', None), ('will', None), ('join', None), ('the', None), ('board', None), ('as', None), ('a', None), ('nonexecutive', None), ('director', None), ('Nov.', None), ('29', None), ('.', None) ]
As our model contains ‘Vinken’ as the only context key, you can observe from the output above that only this word got tag and every other word has None as a tag.
For deciding which tag is most likely for a given context, the ContextTagger class uses frequency of occurrence. It will do it by default even if the context word and tag occur only once, but we can set a minimum frequency threshold by passing a cutoff value to the UnigramTagger class. In the example below, we are passing the cutoff value in previous recipe in which we trained a UnigramTagger −
from nltk.tag import UnigramTagger from nltk.corpus import treebank train_sentences = treebank.tagged_sents()[:2500] Uni_tagger = UnigramTagger(train_sentences, cutoff = 4) test_sentences = treebank.tagged_sents()[1500:] Uni_tagger.evaluate(test_sentences)
0.7357651629613641