OpenNLP - Finding Parts of Speech


Advertisements

Using OpenNLP, you can also detect the Parts of Speech of a given sentence and print them. Instead of full name of the parts of speech, OpenNLP uses short forms of each parts of speech. The following table indicates the various parts of speeches detected by OpenNLP and their meanings.

Parts of Speech Meaning of parts of speech
NN Noun, singular or mass
DT Determiner
VB Verb, base form
VBD Verb, past tense
VBZ Verb, third person singular present
IN Preposition or subordinating conjunction
NNP Proper noun, singular
TO to
JJ Adjective

Tagging the Parts of Speech

To tag the parts of speech of a sentence, OpenNLP uses a model, a file named en-posmaxent.bin. This is a predefined model which is trained to tag the parts of speech of the given raw text.

The POSTaggerME class of the opennlp.tools.postag package is used to load this model, and tag the parts of speech of the given raw text using OpenNLP library. To do so, you need to −

  • Load the en-pos-maxent.bin model using the POSModel class.

  • Instantiate the POSTaggerME class.

  • Tokenize the sentence.

  • Generate the tags using tag() method.

  • Print the tokens and tags using POSSample class.

Following are the steps to be followed to write a program which tags the parts of the speech in the given raw text using the POSTaggerME class.

Step 1: Load the model

The model for POS tagging is represented by the class named POSModel, which belongs to the package opennlp.tools.postag.

To load a tokenizer model −

  • Create an InputStream object of the model (Instantiate the FileInputStream and pass the path of the model in String format to its constructor).

  • Instantiate the POSModel class and pass the InputStream (object) of the model as a parameter to its constructor, as shown in the following code block −

//Loading Parts of speech-maxent model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-pos-maxent.bin"); 
POSModel model = new POSModel(inputStream); 

Step 2: Instantiating the POSTaggerME class

The POSTaggerME class of the package opennlp.tools.postag is used to predict the parts of speech of the given raw text. It uses Maximum Entropy to make its decisions.

Instantiate this class and pass the model object created in the previous step, as shown below −

//Instantiating POSTaggerME class 
POSTaggerME tagger = new POSTaggerME(model);

Step 3: Tokenizing the sentence

The tokenize() method of the whitespaceTokenizer class is used to tokenize the raw text passed to it. This method accepts a String variable as a parameter, and returns an array of Strings (tokens).

Instantiate the whitespaceTokenizer class and the invoke this method by passing the String format of the sentence to this method.

//Tokenizing the sentence using WhitespaceTokenizer class  
WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; 
String[] tokens = whitespaceTokenizer.tokenize(sentence); 

Step 4: Generating the tags

The tag() method of the whitespaceTokenizer class assigns POS tags to the sentence of tokens. This method accepts an array of tokens (String) as a parameter and returns tag (array).

Invoke the tag() method by passing the tokens generated in the previous step to it.

//Generating tags 
String[] tags = tagger.tag(tokens); 

Step 5: Printing the tokens and the tags

The POSSample class represents the POS-tagged sentence. To instantiate this class, we would require an array of tokens (of the text) and an array of tags.

The toString() method of this class returns the tagged sentence. Instantiate this class by passing the token and the tag arrays created in the previous steps and invoke its toString() method, as shown in the following code block.

//Instantiating the POSSample class 
POSSample sample = new POSSample(tokens, tags); 
System.out.println(sample.toString());

Example

Following is the program which tags the parts of speech in a given raw text. Save this program in a file with the name PosTaggerExample.java.

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.postag.POSModel; 
import opennlp.tools.postag.POSSample; 
import opennlp.tools.postag.POSTaggerME; 
import opennlp.tools.tokenize.WhitespaceTokenizer;  

public class PosTaggerExample { 
  
   public static void main(String args[]) throws Exception{ 
    
      //Loading Parts of speech-maxent model       
      InputStream inputStream = new 
         FileInputStream("C:/OpenNLP_models/en-pos-maxent.bin"); 
      POSModel model = new POSModel(inputStream); 
       
      //Instantiating POSTaggerME class 
      POSTaggerME tagger = new POSTaggerME(model); 
       
      String sentence = "Hi welcome to Howcodex"; 
       
      //Tokenizing the sentence using WhitespaceTokenizer class  
      WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; 
      String[] tokens = whitespaceTokenizer.tokenize(sentence); 
       
      //Generating tags 
      String[] tags = tagger.tag(tokens);
      
      //Instantiating the POSSample class 
      POSSample sample = new POSSample(tokens, tags); 
      System.out.println(sample.toString()); 
   
   } 
}       

Compile and execute the saved Java file from the Command prompt using the following commands −

javac PosTaggerExample.java 
java PosTaggerExample 

On executing, the above program reads the given text and detects the parts of speech of these sentences and displays them, as shown below.

Hi_NNP welcome_JJ to_TO Howcodex_VB 

POS Tagger Performance

Following is the program which tags the parts of speech of a given raw text. It also monitors the performance and displays the performance of the tagger. Save this program in a file with the name PosTagger_Performance.java.

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.cmdline.PerformanceMonitor; 
import opennlp.tools.postag.POSModel; 
import opennlp.tools.postag.POSSample; 
import opennlp.tools.postag.POSTaggerME; 
import opennlp.tools.tokenize.WhitespaceTokenizer;  

public class PosTagger_Performance { 
   public static void main(String args[]) throws Exception{ 
      //Loading Parts of speech-maxent model       
      InputStream inputStream = new 
         FileInputStream("C:/OpenNLP_models/en-pos-maxent.bin"); 
      POSModel model = new POSModel(inputStream); 
       
      //Creating an object of WhitespaceTokenizer class  
      WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; 
      
      //Tokenizing the sentence 
      String sentence = "Hi welcome to Howcodex"; 
      String[] tokens = whitespaceTokenizer.tokenize(sentence); 
       
      //Instantiating POSTaggerME class 
      POSTaggerME tagger = new POSTaggerME(model); 
       
      //Generating tags 
      String[] tags = tagger.tag(tokens); 
       
      //Instantiating POSSample class       
      POSSample sample = new POSSample(tokens, tags); 
      System.out.println(sample.toString()); 
       
      //Monitoring the performance of POS tagger 
      PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); 
      perfMon.start(); 
      perfMon.incrementCounter(); 
      perfMon.stopAndPrintFinalResult();      
   } 
}

Compile and execute the saved Java file from the Command prompt using the following commands −

javac PosTaggerExample.java 
java PosTaggerExample 

On executing, the above program reads the given text and tags the parts of speech of these sentences and displays them. In addition, it also monitors the performance of the POS tagger and displays it.

Hi_NNP welcome_JJ to_TO Howcodex_VB  
Average: 0.0 sent/s  
Total: 1 sent 
Runtime: 0.0s 

POS Tagger Probability

The probs() method of the POSTaggerME class is used to find the probabilities for each tag of the recently tagged sentence.

//Getting the probabilities of the recent calls to tokenizePos() method 
double[] probs = detector.getSentenceProbabilities(); 

Following is the program which displays the probabilities for each tag of the last tagged sentence. Save this program in a file with the name PosTaggerProbs.java.

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.postag.POSModel; 
import opennlp.tools.postag.POSSample; 
import opennlp.tools.postag.POSTaggerME; 
import opennlp.tools.tokenize.WhitespaceTokenizer;  

public class PosTaggerProbs { 
   
   public static void main(String args[]) throws Exception{ 
      
      //Loading Parts of speech-maxent model       
      InputStream inputStream = new FileInputStream("C:/OpenNLP_mdl/en-pos-maxent.bin"); 
      POSModel model = new POSModel(inputStream); 
       
      //Creating an object of WhitespaceTokenizer class  
      WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; 
       
      //Tokenizing the sentence 
      String sentence = "Hi welcome to Howcodex"; 
      String[] tokens = whitespaceTokenizer.tokenize(sentence); 
       
      //Instantiating POSTaggerME class 
      POSTaggerME tagger = new POSTaggerME(model); 
             
      //Generating tags 
      String[] tags = tagger.tag(tokens);       
      
      //Instantiating the POSSample class 
      POSSample sample = new POSSample(tokens, tags);  
      System.out.println(sample.toString());
      
      //Probabilities for each tag of the last tagged sentence. 
      double [] probs = tagger.probs();       
      System.out.println("  ");       
      
      //Printing the probabilities  
      for(int i = 0; i<probs.length; i++) 
         System.out.println(probs[i]); 
   } 
}      

Compile and execute the saved Java file from the Command prompt using the following commands −

javac TokenizerMEProbs.java 
java TokenizerMEProbs

On executing, the above program reads the given raw text, tags the parts of speech of each token in it, and displays them. In addition, it also displays the probabilities for each parts of speech in the given sentence, as shown below.

Hi_NNP welcome_JJ to_TO Howcodex_VB    
0.6416834779738033 
0.42983612874819177 
0.8584513635863117 
0.4394784478206072 
Advertisements