OpenNLP - Chunking Sentences


Advertisements

Chunking a sentences refers to breaking/dividing a sentence into parts of words such as word groups and verb groups.

Chunking a Sentence using OpenNLP

To detect the sentences, OpenNLP uses a model, a file named en-chunker.bin. This is a predefined model which is trained to chunk the sentences in the given raw text.

The opennlp.tools.chunker package contains the classes and interfaces that are used to find non-recursive syntactic annotation such as noun phrase chunks.

You can chunk a sentence using the method chunk() of the ChunkerME class. This method accepts tokens of a sentence and POS tags as parameters. Therefore, before starting the process of chunking, first of all you need to Tokenize the sentence and generate the parts POS tags of it.

To chunk a sentence using OpenNLP library, you need to −

  • Tokenize the sentence.

  • Generate POS tags for it.

  • Load the en-chunker.bin model using the ChunkerModel class

  • Instantiate the ChunkerME class.

  • Chunk the sentences using the chunk() method of this class.

Following are the steps to be followed to write a program to chunk sentences from the given raw text.

Step 1: Tokenizing the sentence

Tokenize the sentences using the tokenize() method of the whitespaceTokenizer class, as shown in the following code block.

//Tokenizing the sentence 
String sentence = "Hi welcome to Howcodex";       
WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; 
String[] tokens = whitespaceTokenizer.tokenize(sentence);

Step 2: Generating the POS tags

Generate the POS tags of the sentence using the tag() method of the POSTaggerME class, as shown in the following code block.

//Generating the POS tags 
File file = new File("C:/OpenNLP_models/en-pos-maxent.bin");     
POSModel model = new POSModelLoader().load(file);     
//Constructing the tagger 
POSTaggerME tagger = new POSTaggerME(model);        
//Generating tags from the tokens 
String[] tags = tagger.tag(tokens); 

Step 3: Loading the model

The model for chunking a sentence is represented by the class named ChunkerModel, which belongs to the package opennlp.tools.chunker.

To load a sentence detection model −

  • Create an InputStream object of the model (Instantiate the FileInputStream and pass the path of the model in String format to its constructor).

  • Instantiate the ChunkerModel class and pass the InputStream (object) of the model as a parameter to its constructor, as shown in the following code block −

//Loading the chunker model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-chunker.bin"); 
ChunkerModel chunkerModel = new ChunkerModel(inputStream);  

Step 4: Instantiating the chunkerME class

The chunkerME class of the package opennlp.tools.chunker contains methods to chunk the sentences. This is a maximum-entropy-based chunker.

Instantiate this class and pass the model object created in the previous step.

//Instantiate the ChunkerME class 
ChunkerME chunkerME = new ChunkerME(chunkerModel); 

Step 5: Chunking the sentence

The chunk() method of the ChunkerME class is used to chunk the sentences in the raw text passed to it. This method accepts two String arrays representing tokens and tags, as parameters.

Invoke this method by passing the token array and tag array created in the previous steps as parameters.

//Generating the chunks 
String result[] = chunkerME.chunk(tokens, tags); 

Example

Following is the program to chunk the sentences in the given raw text. Save this program in a file with the name ChunkerExample.java.

import java.io.File; 
import java.io.FileInputStream; 
import java.io.IOException; 
import java.io.InputStream;  

import opennlp.tools.chunker.ChunkerME; 
import opennlp.tools.chunker.ChunkerModel; 
import opennlp.tools.cmdline.postag.POSModelLoader; 
import opennlp.tools.postag.POSModel; 
import opennlp.tools.postag.POSTaggerME; 
import opennlp.tools.tokenize.WhitespaceTokenizer;  

public class ChunkerExample{ 
   
   public static void main(String args[]) throws IOException { 
      //Tokenizing the sentence 
      String sentence = "Hi welcome to Howcodex";       
      WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; 
      String[] tokens = whitespaceTokenizer.tokenize(sentence); 
     
      //Generating the POS tags 
      //Load the parts of speech model 
      File file = new File("C:/OpenNLP_models/en-pos-maxent.bin"); 
      POSModel model = new POSModelLoader().load(file);     
      
      //Constructing the tagger 
      POSTaggerME tagger = new POSTaggerME(model);        
      
      //Generating tags from the tokens 
      String[] tags = tagger.tag(tokens);    
    
      //Loading the chunker model 
      InputStream inputStream = new 
         FileInputStream("C:/OpenNLP_models/en-chunker.bin"); 
      ChunkerModel chunkerModel = new ChunkerModel(inputStream);  
      
      //Instantiate the ChunkerME class 
      ChunkerME chunkerME = new ChunkerME(chunkerModel);
       
      //Generating the chunks 
      String result[] = chunkerME.chunk(tokens, tags); 
  
      for (String s : result) 
         System.out.println(s);         
   }    
}      

Compile and execute the saved Java file from the Command prompt using the following command −

javac ChunkerExample.java 
java ChunkerExample 

On executing, the above program reads the given String and chunks the sentences in it, and displays them as shown below.

Loading POS Tagger model ... done (1.040s) 
B-NP 
I-NP 
B-VP 
I-VP

Detecting the Positions of the Tokens

We can also detect the positions or spans of the chunks using the chunkAsSpans() method of the ChunkerME class. This method returns an array of objects of the type Span. The class named Span of the opennlp.tools.util package is used to store the start and end integer of sets.

You can store the spans returned by the chunkAsSpans() method in the Span array and print them, as shown in the following code block.

//Generating the tagged chunk spans 
Span[] span = chunkerME.chunkAsSpans(tokens, tags); 
       
for (Span s : span) 
   System.out.println(s.toString()); 

Example

Following is the program which detects the sentences in the given raw text. Save this program in a file with the name ChunkerSpansEample.java.

import java.io.File; 
import java.io.FileInputStream; 
import java.io.IOException; 
import java.io.InputStream;  

import opennlp.tools.chunker.ChunkerME; 
import opennlp.tools.chunker.ChunkerModel; 
import opennlp.tools.cmdline.postag.POSModelLoader; 
import opennlp.tools.postag.POSModel; 
import opennlp.tools.postag.POSTaggerME; 
import opennlp.tools.tokenize.WhitespaceTokenizer; 
import opennlp.tools.util.Span;  

public class ChunkerSpansEample{ 
   
   public static void main(String args[]) throws IOException { 
      //Load the parts of speech model 
      File file = new File("C:/OpenNLP_models/en-pos-maxent.bin");     
      POSModel model = new POSModelLoader().load(file); 
       
      //Constructing the tagger 
      POSTaggerME tagger = new POSTaggerME(model); 
  
      //Tokenizing the sentence 
      String sentence = "Hi welcome to Howcodex";       
      WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; 
      String[] tokens = whitespaceTokenizer.tokenize(sentence); 
       
      //Generating tags from the tokens 
      String[] tags = tagger.tag(tokens);       
   
      //Loading the chunker model 
      InputStream inputStream = new 
         FileInputStream("C:/OpenNLP_models/en-chunker.bin"); 
      ChunkerModel chunkerModel = new ChunkerModel(inputStream);
      ChunkerME chunkerME = new ChunkerME(chunkerModel);       
           
      //Generating the tagged chunk spans 
      Span[] span = chunkerME.chunkAsSpans(tokens, tags); 
       
      for (Span s : span) 
         System.out.println(s.toString());  
   }    
}

Compile and execute the saved Java file from the Command prompt using the following commands −

javac ChunkerSpansEample.java 
java ChunkerSpansEample

On executing, the above program reads the given String and spans of the chunks in it, and displays the following output −

Loading POS Tagger model ... done (1.059s) 
[0..2) NP 
[2..4) VP 

Chunker Probability Detection

The probs() method of the ChunkerME class returns the probabilities of the last decoded sequence.

//Getting the probabilities of the last decoded sequence       
double[] probs = chunkerME.probs(); 

Following is the program to print the probabilities of the last decoded sequence by the chunker. Save this program in a file with the name ChunkerProbsExample.java.

import java.io.File; 
import java.io.FileInputStream; 
import java.io.IOException; 
import java.io.InputStream; 
import opennlp.tools.chunker.ChunkerME; 
import opennlp.tools.chunker.ChunkerModel; 
import opennlp.tools.cmdline.postag.POSModelLoader; 
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME; 
import opennlp.tools.tokenize.WhitespaceTokenizer;  

public class ChunkerProbsExample{ 
   
   public static void main(String args[]) throws IOException { 
      //Load the parts of speech model 
      File file = new File("C:/OpenNLP_models/en-pos-maxent.bin");     
      POSModel model = new POSModelLoader().load(file); 
       
      //Constructing the tagger 
      POSTaggerME tagger = new POSTaggerME(model); 
  
      //Tokenizing the sentence 
      String sentence = "Hi welcome to Howcodex";       
      WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; 
      String[] tokens = whitespaceTokenizer.tokenize(sentence); 
       
      //Generating tags from the tokens 
      String[] tags = tagger.tag(tokens);       
   
      //Loading the chunker model 
      InputStream inputStream = new 
         FileInputStream("C:/OpenNLP_models/en-chunker.bin"); 
      ChunkerModel cModel = new ChunkerModel(inputStream); 
      ChunkerME chunkerME = new ChunkerME(cModel); 
       
      //Generating the chunk tags 
      chunkerME.chunk(tokens, tags); 
       
      //Getting the probabilities of the last decoded sequence       
      double[] probs = chunkerME.probs(); 
      for(int i = 0; i<probs.length; i++) 
         System.out.println(probs[i]);       
   }    
}   

Compile and execute the saved Java file from the Command prompt using the following commands −

javac ChunkerProbsExample.java 
java ChunkerProbsExample 

On executing, the above program reads the given String, chunks it, and prints the probabilities of the last decoded sequence.

0.9592746040797778 
0.6883933131241501 
0.8830563473996004 
0.8951150529746051 
Advertisements