OpenNLP - Tokenization


Advertisements

The process of chopping the given sentence into smaller parts (tokens) is known as tokenization. In general, the given raw text is tokenized based on a set of delimiters (mostly whitespaces).

Tokenization is used in tasks such as spell-checking, processing searches, identifying parts of speech, sentence detection, document classification of documents, etc.

Tokenizing using OpenNLP

The opennlp.tools.tokenize package contains the classes and interfaces that are used to perform tokenization.

To tokenize the given sentences into simpler fragments, the OpenNLP library provides three different classes −

  • SimpleTokenizer − This class tokenizes the given raw text using character classes.

  • WhitespaceTokenizer − This class uses whitespaces to tokenize the given text.

  • TokenizerME − This class converts raw text into separate tokens. It uses Maximum Entropy to make its decisions.

SimpleTokenizer

To tokenize a sentence using the SimpleTokenizer class, you need to −

  • Create an object of the respective class.

  • Tokenize the sentence using the tokenize() method.

  • Print the tokens.

Following are the steps to be followed to write a program which tokenizes the given raw text.

Step 1 − Instantiating the respective class

In both the classes, there are no constructors available to instantiate them. Therefore, we need to create objects of these classes using the static variable INSTANCE.

SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;   

Step 2 − Tokenize the sentences

Both these classes contain a method called tokenize(). This method accepts a raw text in String format. On invoking, it tokenizes the given String and returns an array of Strings (tokens).

Tokenize the sentence using the tokenizer() method as shown below.

//Tokenizing the given sentence 
 String tokens[] = tokenizer.tokenize(sentence); 

Step 3 − Print the tokens

After tokenizing the sentence, you can print the tokens using for loop, as shown below.

//Printing the tokens 
for(String token : tokens)       
   System.out.println(token);

Example

Following is the program which tokenizes the given sentence using the SimpleTokenizer class. Save this program in a file with the name SimpleTokenizerExample.java.

import opennlp.tools.tokenize.SimpleTokenizer;  
public class SimpleTokenizerExample { 
   public static void main(String args[]){ 
     
      String sentence = "Hi. How are you? Welcome to Howcodex. " 
         + "We provide free tutorials on various technologies"; 
    
      //Instantiating SimpleTokenizer class 
      SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;  
       
      //Tokenizing the given sentence 
      String tokens[] = simpleTokenizer.tokenize(sentence);  
       
      //Printing the tokens 
      for(String token : tokens) {         
         System.out.println(token);  
      }       
   }  
}

Compile and execute the saved Java file from the Command prompt using the following commands −

javac SimpleTokenizerExample.java 
java SimpleTokenizerExample

On executing, the above program reads the given String (raw text), tokenizes it, and displays the following output −

Hi 
. 
How 
are 
you 
? 
Welcome 
to 
Howcodex 
. 
We 
provide 
free 
tutorials 
on 
various 
technologies 

WhitespaceTokenizer

To tokenize a sentence using the WhitespaceTokenizer class, you need to −

  • Create an object of the respective class.

  • Tokenize the sentence using the tokenize() method.

  • Print the tokens.

Following are the steps to be followed to write a program which tokenizes the given raw text.

Step 1 − Instantiating the respective class

In both the classes, there are no constructors available to instantiate them. Therefore, we need to create objects of these classes using the static variable INSTANCE.

WhitespaceTokenizer tokenizer = WhitespaceTokenizer.INSTANCE; 

Step 2 − Tokenize the sentences

Both these classes contain a method called tokenize(). This method accepts a raw text in String format. On invoking, it tokenizes the given String and returns an array of Strings (tokens).

Tokenize the sentence using the tokenizer() method as shown below.

//Tokenizing the given sentence 
 String tokens[] = tokenizer.tokenize(sentence); 

Step 3 − Print the tokens

After tokenizing the sentence, you can print the tokens using for loop, as shown below.

//Printing the tokens 
for(String token : tokens)       
   System.out.println(token);

Example

Following is the program which tokenizes the given sentence using the WhitespaceTokenizer class. Save this program in a file with the name WhitespaceTokenizerExample.java.

import opennlp.tools.tokenize.WhitespaceTokenizer;  

public class WhitespaceTokenizerExample {  
   
   public static void main(String args[]){ 
     
      String sentence = "Hi. How are you? Welcome to Howcodex. " 
         + "We provide free tutorials on various technologies"; 
    
      //Instantiating whitespaceTokenizer class 
       WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;  
       
      //Tokenizing the given paragraph 
      String tokens[] = whitespaceTokenizer.tokenize(sentence);  
       
      //Printing the tokens 
      for(String token : tokens)     
         System.out.println(token);        
   } 
}

Compile and execute the saved Java file from the Command prompt using the following commands −

javac WhitespaceTokenizerExample.java 
java WhitespaceTokenizerExample 

On executing, the above program reads the given String (raw text), tokenizes it, and displays the following output.

Hi. 
How 
are 
you? 
Welcome 
to 
Howcodex. 
We 
provide 
free 
tutorials 
on 
various 
technologies

TokenizerME class

OpenNLP also uses a predefined model, a file named de-token.bin, to tokenize the sentences. It is trained to tokenize the sentences in a given raw text.

The TokenizerME class of the opennlp.tools.tokenizer package is used to load this model, and tokenize the given raw text using OpenNLP library. To do so, you need to −

  • Load the en-token.bin model using the TokenizerModel class.

  • Instantiate the TokenizerME class.

  • Tokenize the sentences using the tokenize() method of this class.

Following are the steps to be followed to write a program which tokenizes the sentences from the given raw text using the TokenizerME class.

Step 1 − Loading the model

The model for tokenization is represented by the class named TokenizerModel, which belongs to the package opennlp.tools.tokenize.

To load a tokenizer model −

  • Create an InputStream object of the model (Instantiate the FileInputStream and pass the path of the model in String format to its constructor).

  • Instantiate the TokenizerModel class and pass the InputStream (object) of the model as a parameter to its constructor, as shown in the following code block.

//Loading the Tokenizer model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-token.bin"); 
TokenizerModel tokenModel = new TokenizerModel(inputStream);

Step 2 − Instantiating the TokenizerME class

The TokenizerME class of the package opennlp.tools.tokenize contains methods to chop the raw text into smaller parts (tokens). It uses Maximum Entropy to make its decisions.

Instantiate this class and pass the model object created in the previous step as shown below.

//Instantiating the TokenizerME class 
TokenizerME tokenizer = new TokenizerME(tokenModel);

Step 3 − Tokenizing the sentence

The tokenize() method of the TokenizerME class is used to tokenize the raw text passed to it. This method accepts a String variable as a parameter, and returns an array of Strings (tokens).

Invoke this method by passing the String format of the sentence to this method, as follows.

//Tokenizing the given raw text 
String tokens[] = tokenizer.tokenize(paragraph);

Example

Following is the program which tokenizes the given raw text. Save this program in a file with the name TokenizerMEExample.java.

import java.io.FileInputStream; 
import java.io.InputStream; 
import opennlp.tools.tokenize.TokenizerME; 
import opennlp.tools.tokenize.TokenizerModel;  

public class TokenizerMEExample { 
  
   public static void main(String args[]) throws Exception{     
     
      String sentence = "Hi. How are you? Welcome to Howcodex. " 
            + "We provide free tutorials on various technologies"; 
       
      //Loading the Tokenizer model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-token.bin"); 
      TokenizerModel tokenModel = new TokenizerModel(inputStream); 
       
      //Instantiating the TokenizerME class 
      TokenizerME tokenizer = new TokenizerME(tokenModel); 
       
      //Tokenizing the given raw text 
      String tokens[] = tokenizer.tokenize(sentence);       
          
      //Printing the tokens  
      for (String a : tokens) 
         System.out.println(a); 
   } 
} 

Compile and execute the saved Java file from the Command prompt using the following commands −

javac TokenizerMEExample.java 
java TokenizerMEExample

On executing, the above program reads the given String and detects the sentences in it and displays the following output −

Hi 
. 
How 
are 
you 
? 
Welcome 
to 
Howcodex 
. 
We 
provide 
free 
tutorials 
on 
various 
technologie

Retrieving the Positions of the Tokens

We can also get the positions or spans of the tokens using the tokenizePos() method. This is the method of the Tokenizer interface of the package opennlp.tools.tokenize. Since all the (three) Tokenizer classes implement this interface, you can find this method in all of them.

This method accepts the sentence or raw text in the form of a string and returns an array of objects of the type Span.

You can get the positions of the tokens using the tokenizePos() method, as follows −

//Retrieving the tokens 
tokenizer.tokenizePos(sentence); 

Printing the positions (spans)

The class named Span of the opennlp.tools.util package is used to store the start and end integer of sets.

You can store the spans returned by the tokenizePos() method in the Span array and print them, as shown in the following code block.

//Retrieving the tokens 
Span[] tokens = tokenizer.tokenizePos(sentence);
//Printing the spans of tokens 
for( Span token : tokens)        
   System.out.println(token);

Printing tokens and their positions together

The substring() method of the String class accepts the begin and the end offsets and returns the respective string. We can use this method to print the tokens and their spans (positions) together, as shown in the following code block.

//Printing the spans of tokens 
for(Span token : tokens)  
   System.out.println(token +" "+sent.substring(token.getStart(), token.getEnd()));

Example(SimpleTokenizer)

Following is the program which retrieves the token spans of the raw text using the SimpleTokenizer class. It also prints the tokens along with their positions. Save this program in a file with named SimpleTokenizerSpans.java.

import opennlp.tools.tokenize.SimpleTokenizer; 
import opennlp.tools.util.Span;  

public class SimpleTokenizerSpans {  
   public static void main(String args[]){ 
     
      String sent = "Hi. How are you? Welcome to Howcodex. " 
         + "We provide free tutorials on various technologies"; 
    
      //Instantiating SimpleTokenizer class 
      SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;  
       
      //Retrieving the boundaries of the tokens 
      Span[] tokens = simpleTokenizer.tokenizePos(sent);  
       
      //Printing the spans of tokens 
      for( Span token : tokens)
         System.out.println(token +" "+sent.substring(token.getStart(), token.getEnd()));          
   } 
}      

Compile and execute the saved Java file from the Command prompt using the following commands −

javac SimpleTokenizerSpans.java 
java SimpleTokenizerSpans 

On executing, the above program reads the given String (raw text), tokenizes it, and displays the following output −

[0..2) Hi 
[2..3) . 
[4..7) How 
[8..11) are 
[12..15) you 
[15..16) ? 
[17..24) Welcome 
[25..27) to 
[28..42) Howcodex 
[42..43) . 
[44..46) We 
[47..54) provide 
[55..59) free 
[60..69) tutorials 
[70..72) on 
[73..80) various 
[81..93) technologies 

Example (WhitespaceTokenizer)

Following is the program which retrieves the token spans of the raw text using the WhitespaceTokenizer class. It also prints the tokens along with their positions. Save this program in a file with the name WhitespaceTokenizerSpans.java.

import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.Span; 
public class WhitespaceTokenizerSpans {  
   public static void main(String args[]){ 
     
      String sent = "Hi. How are you? Welcome to Howcodex. " 
         + "We provide free tutorials on various technologies"; 
    
      //Instantiating SimpleTokenizer class 
      WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;  
       
      //Retrieving the tokens 
      Span[] tokens = whitespaceTokenizer.tokenizePos(sent);  
       
      //Printing the spans of tokens 
      for( Span token : tokens) 
         System.out.println(token +" 
            "+sent.substring(token.getStart(), token.getEnd()));        
   } 
} 

Compile and execute the saved java file from the command prompt using the following commands

javac WhitespaceTokenizerSpans.java 
java WhitespaceTokenizerSpans

On executing, the above program reads the given String (raw text), tokenizes it, and displays the following output.

[0..3) Hi. 
[4..7) How 
[8..11) are 
[12..16) you? 
[17..24) Welcome 
[25..27) to 
[28..43) Howcodex. 
[44..46) We 
[47..54) provide 
[55..59) free
[60..69) tutorials 
[70..72) on 
[73..80) various 
[81..93) technologies

Example (TokenizerME)

Following is the program which retrieves the token spans of the raw text using the TokenizerME class. It also prints the tokens along with their positions. Save this program in a file with the name TokenizerMESpans.java.

import java.io.FileInputStream; 
import java.io.InputStream; 
import opennlp.tools.tokenize.TokenizerME; 
import opennlp.tools.tokenize.TokenizerModel; 
import opennlp.tools.util.Span;  

public class TokenizerMESpans { 
   public static void main(String args[]) throws Exception{     
      String sent = "Hello John how are you welcome to Howcodex"; 
       
      //Loading the Tokenizer model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-token.bin"); 
      TokenizerModel tokenModel = new TokenizerModel(inputStream); 
       
      //Instantiating the TokenizerME class 
      TokenizerME tokenizer = new TokenizerME(tokenModel); 
       
      //Retrieving the positions of the tokens 
      Span tokens[] = tokenizer.tokenizePos(sent); 
       
      //Printing the spans of tokens 
      for(Span token : tokens) 
         System.out.println(token +" "+sent.substring(token.getStart(), token.getEnd()));      
   } 
} 

Compile and execute the saved Java file from the Command prompt using the following commands −

javac TokenizerMESpans.java 
java TokenizerMESpans

On executing, the above program reads the given String (raw text), tokenizes it, and displays the following output −

[0..5) Hello 
[6..10) John 
[11..14) how 
[15..18) are 
[19..22) you 
[23..30) welcome 
[31..33) to 
[34..48) Howcodex 

Tokenizer Probability

The getTokenProbabilities() method of the TokenizerME class is used to get the probabilities associated with the most recent calls to the tokenizePos() method.

//Getting the probabilities of the recent calls to tokenizePos() method 
double[] probs = detector.getSentenceProbabilities(); 

Following is the program to print the probabilities associated with the calls to tokenizePos() method. Save this program in a file with the name TokenizerMEProbs.java.

import java.io.FileInputStream; 
import java.io.InputStream; 
import opennlp.tools.tokenize.TokenizerME; 
import opennlp.tools.tokenize.TokenizerModel; 
import opennlp.tools.util.Span;  

public class TokenizerMEProbs { 
   
   public static void main(String args[]) throws Exception{     
      String sent = "Hello John how are you welcome to Howcodex"; 
      
      //Loading the Tokenizer model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-token.bin"); 
      TokenizerModel tokenModel = new TokenizerModel(inputStream); 
      
      //Instantiating the TokenizerME class 
      TokenizerME tokenizer = new TokenizerME(tokenModel);
      
      //Retrieving the positions of the tokens 
      Span tokens[] = tokenizer.tokenizePos(sent); 
       
      //Getting the probabilities of the recent calls to tokenizePos() method 
      double[] probs = tokenizer.getTokenProbabilities(); 
       
      //Printing the spans of tokens 
      for(Span token : tokens) 
         System.out.println(token +" "+sent.substring(token.getStart(), token.getEnd()));      
         System.out.println("  "); 
         for(int i = 0; i<probs.length; i++) 
            System.out.println(probs[i]);          
   } 
}      

Compile and execute the saved Java file from the Command prompt using the following commands −

javac TokenizerMEProbs.java 
java TokenizerMEProbs 

On executing, the above program reads the given String and tokenizes the sentences and prints them. In addition, it also returns the probabilities associated with the most recent calls to the tokenizerPos() method.

[0..5) Hello 
[6..10) John 
[11..14) how 
[15..18) are 
[19..22) you 
[23..30) welcome 
[31..33) to 
[34..48) Howcodex 
   
1.0 
1.0 
1.0 
1.0 
1.0 
1.0 
1.0 
1.0
Advertisements