Natural Language Toolkit - Text Classification


Advertisements

What is text classification?

Text classification, as the name implies, is the way to categorize pieces of text or documents. But here the question arises that why we need to use text classifiers? Once examining the word usage in a document or piece of text, classifiers will be able to decide what class label should be assigned to it.

Binary Classifier

As name implies, binary classifier will decide between two labels. For example, positive or negative. In this the piece of text or document can either be one label or another, but not both.

Multi-label Classifier

Opposite to binary classifier, multi-label classifier can assign one or more labels to a piece of text or document.

Labeled Vs Unlabeled Feature set

A key-value mapping of feature names to feature values is called a feature set. Labeled feature sets or training data is very important for classification training so that it can later classify unlabeled feature set.

Labeled Feature Set Unlabeled Feature Set
It is a tuple that look like (feat, label). It is a feat itself.
It is an instance with a known class label. Without associated label, we can call it an instance.
Used for training a classification algorithm. Once trained, classification algorithm can classify an unlabeled feature set.

Text Feature Extraction

Text feature extraction, as the name implies, is the process of transforming a list of words into a feature set that is usable by a classifier. We must have to transform our text into ‘dict’ style feature sets because Natural Language Tool Kit (NLTK) expect ‘dict’ style feature sets.

Bag of Words (BoW) model

BoW, one of the simplest models in NLP, is used to extract the features from piece of text or document so that it can be used in modeling such that in ML algorithms. It basically constructs a word presence feature set from all the words of an instance. The concept behind this method is that it doesn’t care about how many times a word occurs or about the order of the words, it only cares weather the word is present in a list of words or not.

Example

For this example, we are going to define a function named bow() −

def bow(words):
   return dict([(word, True) for word in words])

Now, let us call bow() function on words. We saved this functions in a file named bagwords.py.

from bagwords import bow
bow(['we', 'are', 'using', 'howcodex'])

Output

{'we': True, 'are': True, 'using': True, 'howcodex': True}

Training classifiers

In previous sections, we learned how to extract features from the text. So now we can train a classifier. The first and easiest classifier is NaiveBayesClassifier class.

Naïve Bayes Classifier

To predict the probability that a given feature set belongs to a particular label, it uses Bayes theorem. The formula of Bayes theorem is as follows.

$$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$

Here,

P(A|B) − It is also called the posterior probability i.e. the probability of first event i.e. A to occur given that second event i.e. B occurred.

P(B|A) − It is the probability of second event i.e. B to occur after first event i.e. A occurred.

P(A), P(B) − It is also called prior probability i.e. the probability of first event i.e. A or second event i.e. B to occur.

To train Naïve Bayes classifier, we will be using the movie_reviews corpus from NLTK. This corpus has two categories of text, namely: pos and neg. These categories make a classifier trained on them a binary classifier. Every file in the corpus is composed of two,one is positive movie review and other is negative movie review. In our example, we are going to use each file as a single instance for both training and testing the classifier.

Example

For training classifier, we need a list of labeled feature sets, which will be in the form [(featureset, label)]. Here the featureset variable is a dict and label is the known class label for the featureset. We are going to create a function named label_corpus() which will take a corpus named movie_reviewsand also a function named feature_detector, which defaults to bag of words. It will construct and returns a mapping of the form, {label: [featureset]}. After that we will use this mapping to create a list of labeled training instances and testing instances.

import collections

def label_corpus(corp, feature_detector=bow):
   label_feats = collections.defaultdict(list)
   for label in corp.categories():
      for fileid in corp.fileids(categories=[label]):
         feats = feature_detector(corp.words(fileids=[fileid]))
         label_feats[label].append(feats)
   return label_feats

With the help of above function we will get a mapping {label:fetaureset}. Now we are going to define one more function named split that will take a mapping returned from label_corpus() function and splits each list of feature sets into labeled training as well as testing instances.

def split(lfeats, split=0.75):
   train_feats = []
   test_feats = []
   for label, feats in lfeats.items():
      cutoff = int(len(feats) * split)
      train_feats.extend([(feat, label) for feat in feats[:cutoff]])
      test_feats.extend([(feat, label) for feat in feats[cutoff:]])
   return train_feats, test_feats

Now, let us use these functions on our corpus, i.e. movie_reviews −

from nltk.corpus import movie_reviews
from featx import label_feats_from_corpus, split_label_feats
movie_reviews.categories()

Output

['neg', 'pos']

Example

lfeats = label_feats_from_corpus(movie_reviews)
lfeats.keys()

Output

dict_keys(['neg', 'pos'])

Example

train_feats, test_feats = split_label_feats(lfeats, split = 0.75)
len(train_feats)

Output

1500

Example

len(test_feats)

Output

500

We have seen that in movie_reviews corpus, there are 1000 pos files and 1000 neg files. We also end up with 1500 labeled training instances and 500 labeled testing instances.

Now let us train NaïveBayesClassifier using its train() class method −

from nltk.classify import NaiveBayesClassifier
NBC = NaiveBayesClassifier.train(train_feats)
NBC.labels()

Output

['neg', 'pos']

Decision Tree Classifier

Another important classifier is decision tree classifier. Here to train it the DecisionTreeClassifier class will create a tree structure. In this tree structure each node corresponds to a feature name and the branches correspond to the feature values. And down the branches we will get to the leaves of the tree i.e. the classification labels.

To train decision tree classifier, we will use the same training and testing features i.e. train_feats and test_feats, variables we have created from movie_reviews corpus.

Example

To train this classifier, we will call DecisionTreeClassifier.train() class method as follows −

from nltk.classify import DecisionTreeClassifier
decisiont_classifier = DecisionTreeClassifier.train(
   train_feats, binary = True, entropy_cutoff = 0.8, 
   depth_cutoff = 5, support_cutoff = 30
)
accuracy(decisiont_classifier, test_feats)

Output

0.725

Maximum Entropy Classifier

Another important classifier is MaxentClassifier which is also known as a conditional exponential classifier or logistic regression classifier. Here to train it, the MaxentClassifier class will convert labeled feature sets to vector using encoding.

To train decision tree classifier, we will use the same training and testing features i.e. train_featsand test_feats, variables we have created from movie_reviews corpus.

Example

To train this classifier, we will call MaxentClassifier.train() class method as follows −

from nltk.classify import MaxentClassifier
maxent_classifier = MaxentClassifier
.train(train_feats,algorithm = 'gis', trace = 0, max_iter = 10, min_lldelta = 0.5)
accuracy(maxent_classifier, test_feats)

Output

0.786

Scikit-learn Classifier

One of the best machine learning (ML) libraries is Scikit-learn. It actually contains all sorts of ML algorithms for various purposes, but they all have the same fit design pattern as follows −

  • Fitting the model to the data
  • And use that model to make predictions

Rather than accessing scikit-learn models directly, here we are going to use NLTK’s SklearnClassifier class. This class is a wrapper class around a scikit-learn model to make it conform to NLTK’s Classifier interface.

We will follow following steps to train a SklearnClassifier class −

Step 1 − First we will create training features as we did in previous recipes.

Step 2 − Now, choose and import a Scikit-learn algorithm.

Step 3 − Next, we need to construct a SklearnClassifier class with the chosen algorithm.

Step 4 − Last, we will train SklearnClassifier class with our training features.

Let us implement these steps in the below Python recipe −

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
sklearn_classifier = SklearnClassifier(MultinomialNB())
sklearn_classifier.train(train_feats)
<SklearnClassifier(MultinomialNB(alpha = 1.0,class_prior = None,fit_prior = True))>
accuracy(sk_classifier, test_feats)

Output

0.885

Measuring precision and recall

While training various classifiers we have measured their accuracy also. But apart from accuracy there are number of other metrics which are used to evaluate the classifiers. Two of these metrics are precision and recall.

Example

In this example, we are going to calculate precision and recall of the NaiveBayesClassifier class we trained earlier. To achieve this we will create a function named metrics_PR() which will take two arguments, one is the trained classifier and other is the labeled test features. Both the arguments are same as we passed while calculating the accuracy of the classifiers −

import collections
from nltk import metrics
def metrics_PR(classifier, testfeats):
   refsets = collections.defaultdict(set)
   testsets = collections.defaultdict(set)
   for i, (feats, label) in enumerate(testfeats):
      refsets[label].add(i)
      observed = classifier.classify(feats)
         testsets[observed].add(i)
   precisions = {}
   recalls = {}
   for label in classifier.labels():
   precisions[label] = metrics.precision(refsets[label],testsets[label])
   recalls[label] = metrics.recall(refsets[label], testsets[label])
   return precisions, recalls

Let us call this function to find the precision and recall −

from metrics_classification import metrics_PR
nb_precisions, nb_recalls = metrics_PR(nb_classifier,test_feats)
nb_precisions['pos']

Output

0.6713532466435213

Example

nb_precisions['neg']

Output

0.9676271186440678

Example

nb_recalls['pos']

Output

0.96

Example

nb_recalls['neg']

Output

0.478

Combination of classifier and voting

Combining classifiers is one of the best ways to improve classification performance. And voting is one of the best ways to combine multiple classifiers. For voting we need to have odd number of classifiers. In the following Python recipe we are going to combine three classifiers namely NaiveBayesClassifier class, DecisionTreeClassifier class and MaxentClassifier class.

To achieve this we are going to define a function named voting_classifiers() as follows.

import itertools
from nltk.classify import ClassifierI
from nltk.probability import FreqDist
class Voting_classifiers(ClassifierI):
   def __init__(self, *classifiers):
      self._classifiers = classifiers
      self._labels = sorted(set(itertools.chain(*[c.labels() for c in classifiers])))
   def labels(self):
      return self._labels
   def classify(self, feats):
      counts = FreqDist()
      for classifier in self._classifiers:
         counts[classifier.classify(feats)] += 1
      return counts.max()

Let us call this function to combine three classifiers and find the accuracy −

from vote_classification import Voting_classifiers
combined_classifier = Voting_classifiers(NBC, decisiont_classifier, maxent_classifier)
combined_classifier.labels()

Output

['neg', 'pos']

Example

accuracy(combined_classifier, test_feats)

Output

0.948

From the above output, we can see that the combined classifiers got highest accuracy than the individual classifiers.

Advertisements