OpenNLP - Sentence Detection


Advertisements

While processing a natural language, deciding the beginning and end of the sentences is one of the problems to be addressed. This process is known as Sentence Boundary Disambiguation (SBD) or simply sentence breaking.

The techniques we use to detect the sentences in the given text, depends on the language of the text.

Sentence Detection Using Java

We can detect the sentences in the given text in Java using, Regular Expressions, and a set of simple rules.

For example, let us assume a period, a question mark, or an exclamation mark ends a sentence in the given text, then we can split the sentence using the split() method of the String class. Here, we have to pass a regular expression in String format.

Following is the program which determines the sentences in a given text using Java regular expressions (split method). Save this program in a file with the name SentenceDetection_RE.java.

public class SentenceDetection_RE {  
   public static void main(String args[]){ 
     
      String sentence = " Hi. How are you? Welcome to Howcodex. " 
         + "We provide free tutorials on various technologies"; 
     
      String simple = "[.?!]";      
      String[] splitString = (sentence.split(simple));     
      for (String string : splitString)   
         System.out.println(string);      
   } 
}

Compile and execute the saved java file from the command prompt using the following commands.

javac SentenceDetection_RE.java 
java SentenceDetection_RE

On executing, the above program creates a PDF document displaying the following message.

Hi 
How are you 
Welcome to Howcodex 
We provide free tutorials on various technologies

Sentence Detection Using OpenNLP

To detect sentences, OpenNLP uses a predefined model, a file named en-sent.bin. This predefined model is trained to detect sentences in a given raw text.

The opennlp.tools.sentdetect package contains the classes and interfaces that are used to perform the sentence detection task.

To detect a sentence using OpenNLP library, you need to −

  • Load the en-sent.bin model using the SentenceModel class

  • Instantiate the SentenceDetectorME class.

  • Detect the sentences using the sentDetect() method of this class.

Following are the steps to be followed to write a program which detects the sentences from the given raw text.

Step 1: Loading the model

The model for sentence detection is represented by the class named SentenceModel, which belongs to the package opennlp.tools.sentdetect.

To load a sentence detection model −

  • Create an InputStream object of the model (Instantiate the FileInputStream and pass the path of the model in String format to its constructor).

  • Instantiate the SentenceModel class and pass the InputStream (object) of the model as a parameter to its constructor as shown in the following code block −

//Loading sentence detector model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/ensent.bin"); 
SentenceModel model = new SentenceModel(inputStream);

Step 2: Instantiating the SentenceDetectorME class

The SentenceDetectorME class of the package opennlp.tools.sentdetect contains methods to split the raw text into sentences. This class uses the Maximum Entropy model to evaluate end-of-sentence characters in a string to determine if they signify the end of a sentence.

Instantiate this class and pass the model object created in the previous step, as shown below.

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model);

Step 3: Detecting the sentence

The sentDetect() method of the SentenceDetectorME class is used to detect the sentences in the raw text passed to it. This method accepts a String variable as a parameter.

Invoke this method by passing the String format of the sentence to this method.

//Detecting the sentence 
String sentences[] = detector.sentDetect(sentence);

Example

Following is the program which detects the sentences in a given raw text. Save this program in a file with named SentenceDetectionME.java.

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel;  

public class SentenceDetectionME { 
  
   public static void main(String args[]) throws Exception { 
   
      String sentence = "Hi. How are you? Welcome to Howcodex. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
    
      //Detecting the sentence
      String sentences[] = detector.sentDetect(sentence); 
    
      //Printing the sentences 
      for(String sent : sentences)        
         System.out.println(sent);  
   } 
}

Compile and execute the saved Java file from the Command prompt using the following commands −

javac SentenceDetectorME.java 
java SentenceDetectorME

On executing, the above program reads the given String and detects the sentences in it and displays the following output.

Hi. How are you? 
Welcome to Howcodex. 
We provide free tutorials on various technologies

Detecting the Positions of the Sentences

We can also detect the positions of the sentences using the sentPosDetect() method of the SentenceDetectorME class.

Following are the steps to be followed to write a program which detects the positions of the sentences from the given raw text.

Step 1: Loading the model

The model for sentence detection is represented by the class named SentenceModel, which belongs to the package opennlp.tools.sentdetect.

To load a sentence detection model −

  • Create an InputStream object of the model (Instantiate the FileInputStream and pass the path of the model in String format to its constructor).

  • Instantiate the SentenceModel class and pass the InputStream (object) of the model as a parameter to its constructor, as shown in the following code block.

//Loading sentence detector model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
SentenceModel model = new SentenceModel(inputStream);

Step 2: Instantiating the SentenceDetectorME class

The SentenceDetectorME class of the package opennlp.tools.sentdetect contains methods to split the raw text into sentences. This class uses the Maximum Entropy model to evaluate end-of-sentence characters in a string to determine if they signify the end of a sentence.

Instantiate this class and pass the model object created in the previous step.

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model); 

Step 3: Detecting the position of the sentence

The sentPosDetect() method of the SentenceDetectorME class is used to detect the positions of the sentences in the raw text passed to it. This method accepts a String variable as a parameter.

Invoke this method by passing the String format of the sentence as a parameter to this method.

//Detecting the position of the sentences in the paragraph  
Span[] spans = detector.sentPosDetect(sentence); 

Step 4: Printing the spans of the sentences

The sentPosDetect() method of the SentenceDetectorME class returns an array of objects of the type Span. The class named Span of the opennlp.tools.util package is used to store the start and end integer of sets.

You can store the spans returned by the sentPosDetect() method in the Span array and print them, as shown in the following code block.

//Printing the sentences and their spans of a sentence 
for (Span span : spans)         
System.out.println(paragraph.substring(span); 

Example

Following is the program which detects the sentences in the given raw text. Save this program in a file with named SentenceDetectionME.java.

import java.io.FileInputStream; 
import java.io.InputStream; 
  
import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.util.Span;

public class SentencePosDetection { 
  
   public static void main(String args[]) throws Exception { 
   
      String paragraph = "Hi. How are you? Welcome to Howcodex. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
       
      //Detecting the position of the sentences in the raw text 
      Span spans[] = detector.sentPosDetect(paragraph); 
       
      //Printing the spans of the sentences in the paragraph 
      for (Span span : spans)         
         System.out.println(span);  
   } 
}

Compile and execute the saved Java file from the Command prompt using the following commands −

javac SentencePosDetection.java 
java SentencePosDetection

On executing, the above program reads the given String and detects the sentences in it and displays the following output.

[0..16) 
[17..43) 
[44..93)

Sentences along with their Positions

The substring() method of the String class accepts the begin and the end offsets and returns the respective string. We can use this method to print the sentences and their spans (positions) together, as shown in the following code block.

for (Span span : spans)         
   System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span); 

Following is the program to detect the sentences from the given raw text and display them along with their positions. Save this program in a file with name SentencesAndPosDetection.java.

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.util.Span; 
   
public class SentencesAndPosDetection { 
  
   public static void main(String args[]) throws Exception { 
     
      String sen = "Hi. How are you? Welcome to Howcodex." 
         + " We provide free tutorials on various technologies"; 
      //Loading a sentence model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
       
      //Detecting the position of the sentences in the paragraph  
      Span[] spans = detector.sentPosDetect(sen);  
      
      //Printing the sentences and their spans of a paragraph 
      for (Span span : spans)         
         System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);  
   } 
}  

Compile and execute the saved Java file from the Command prompt using the following commands −

javac SentencesAndPosDetection.java 
java SentencesAndPosDetection

On executing, the above program reads the given String and detects the sentences along with their positions and displays the following output.

Hi. How are you? [0..16) 
Welcome to Howcodex. [17..43)  
We provide free tutorials on various technologies [44..93)

Sentence Probability Detection

The getSentenceProbabilities() method of the SentenceDetectorME class returns the probabilities associated with the most recent calls to the sentDetect() method.

//Getting the probabilities of the last decoded sequence       
double[] probs = detector.getSentenceProbabilities(); 

Following is the program to print the probabilities associated with the calls to the sentDetect() method. Save this program in a file with the name SentenceDetectionMEProbs.java.

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel;  

public class SentenceDetectionMEProbs { 
  
   public static void main(String args[]) throws Exception { 
   
      String sentence = "Hi. How are you? Welcome to Howcodex. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin");
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class
      SentenceDetectorME detector = new SentenceDetectorME(model);  
      
      //Detecting the sentence 
      String sentences[] = detector.sentDetect(sentence); 
    
      //Printing the sentences 
      for(String sent : sentences)        
         System.out.println(sent);   
         
      //Getting the probabilities of the last decoded sequence       
      double[] probs = detector.getSentenceProbabilities(); 
       
      System.out.println("  "); 
       
      for(int i = 0; i<probs.length; i++) 
         System.out.println(probs[i]); 
   } 
}       

Compile and execute the saved Java file from the Command prompt using the following commands −

javac SentenceDetectionMEProbs.java 
java SentenceDetectionMEProbs

On executing, the above program reads the given String and detects the sentences and prints them. In addition, it also returns the probabilities associated with the most recent calls to the sentDetect() method, as shown below.

Hi. How are you? 
Welcome to Howcodex. 
We provide free tutorials on various technologies 
   
0.9240246995179983 
0.9957680129995953 
1.0
Advertisements