OpenNLP - Named Entity Recognition


Advertisements

The process of finding names, people, places, and other entities, from a given text is known as Named Entity Recognition (NER). In this chapter, we will discuss how to carry out NER through Java program using OpenNLP library.

Named Entity Recognition using open NLP

To perform various NER tasks, OpenNLP uses different predefined models namely, en-nerdate.bn, en-ner-location.bin, en-ner-organization.bin, en-ner-person.bin, and en-ner-time.bin. All these files are predefined models which are trained to detect the respective entities in a given raw text.

The opennlp.tools.namefind package contains the classes and interfaces that are used to perform the NER task. To perform NER task using OpenNLP library, you need to −

  • Load the respective model using the TokenNameFinderModel class.

  • Instantiate the NameFinder class.

  • Find the names and print them.

Following are the steps to be followed to write a program which detects the name entities from a given raw text.

Step 1: Loading the model

The model for sentence detection is represented by the class named TokenNameFinderModel, which belongs to the package opennlp.tools.namefind.

To load an NER model −

  • Create an InputStream object of the model (Instantiate the FileInputStream and pass the path of the appropriate NER model in String format to its constructor).

  • Instantiate the TokenNameFinderModel class and pass the InputStream (object) of the model as a parameter to its constructor, as shown in the following code block.

//Loading the NER-person model 
InputStream inputStreamNameFinder = new FileInputStream(".../en-nerperson.bin");       
TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);

Step 2: Instantiating the NameFinderME class

The NameFinderME class of the package opennlp.tools.namefind contains methods to perform the NER tasks. This class uses the Maximum Entropy model to find the named entities in the given raw text.

Instantiate this class and pass the model object created in the previous step as shown below −

//Instantiating the NameFinderME class 
NameFinderME nameFinder = new NameFinderME(model);

Step 3: Finding the names in the sentence

The find() method of the NameFinderME class is used to detect the names in the raw text passed to it. This method accepts a String variable as a parameter.

Invoke this method by passing the String format of the sentence to this method.

//Finding the names in the sentence 
Span nameSpans[] = nameFinder.find(sentence);

Step 4: Printing the spans of the names in the sentence

The find() method of the NameFinderME class returns an array of objects of the type Span. The class named Span of the opennlp.tools.util package is used to store the start and end integer of sets.

You can store the spans returned by the find() method in the Span array and print them, as shown in the following code block.

//Printing the sentences and their spans of a sentence 
for (Span span : spans)         
System.out.println(paragraph.substring(span);

NER Example

Following is the program which reads the given sentence and recognizes the spans of the names of the persons in it. Save this program in a file with the name NameFinderME_Example.java.

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.namefind.NameFinderME; 
import opennlp.tools.namefind.TokenNameFinderModel; 
import opennlp.tools.util.Span;  

public class NameFinderME_Example { 
   public static void main(String args[]) throws Exception{ 
      /Loading the NER - Person model       InputStream inputStream = new 
         FileInputStream("C:/OpenNLP_models/en-ner-person.bin"); 
      TokenNameFinderModel model = new TokenNameFinderModel(inputStream);
      
      //Instantiating the NameFinder class 
      NameFinderME nameFinder = new NameFinderME(model); 
    
      //Getting the sentence in the form of String array  
      String [] sentence = new String[]{ 
         "Mike", 
         "and", 
         "Smith", 
         "are", 
         "good", 
         "friends" 
      }; 
       
      //Finding the names in the sentence 
      Span nameSpans[] = nameFinder.find(sentence); 
       
      //Printing the spans of the names in the sentence 
      for(Span s: nameSpans) 
         System.out.println(s.toString());    
   }    
}      

Compile and execute the saved Java file from the Command prompt using the following commands −

javac NameFinderME_Example.java 
java NameFinderME_Example

On executing, the above program reads the given String (raw text), detects the names of the persons in it, and displays their positions (spans), as shown below.

[0..1) person 
[2..3) person 

Names along with their Positions

The substring() method of the String class accepts the begin and the end offsets and returns the respective string. We can use this method to print the names and their spans (positions) together, as shown in the following code block.

for(Span s: nameSpans)        
   System.out.println(s.toString()+"  "+tokens[s.getStart()]);

Following is the program to detect the names from the given raw text and display them along with their positions. Save this program in a file with the name NameFinderSentences.java.

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.namefind.NameFinderME; 
import opennlp.tools.namefind.TokenNameFinderModel; 
import opennlp.tools.tokenize.TokenizerME; 
import opennlp.tools.tokenize.TokenizerModel; 
import opennlp.tools.util.Span;  

public class NameFinderSentences {  
   public static void main(String args[]) throws Exception{        
      
      //Loading the tokenizer model 
      InputStream inputStreamTokenizer = new 
         FileInputStream("C:/OpenNLP_models/entoken.bin");
      TokenizerModel tokenModel = new TokenizerModel(inputStreamTokenizer); 
       
      //Instantiating the TokenizerME class 
      TokenizerME tokenizer = new TokenizerME(tokenModel); 
       
      //Tokenizing the sentence in to a string array 
      String sentence = "Mike is senior programming 
      manager and Rama is a clerk both are working at 
      Howcodex"; 
      String tokens[] = tokenizer.tokenize(sentence); 
       
      //Loading the NER-person model 
      InputStream inputStreamNameFinder = new 
         FileInputStream("C:/OpenNLP_models/enner-person.bin");       
      TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);
      
      //Instantiating the NameFinderME class 
      NameFinderME nameFinder = new NameFinderME(model);       
      
      //Finding the names in the sentence 
      Span nameSpans[] = nameFinder.find(tokens);        
      
      //Printing the names and their spans in a sentence 
      for(Span s: nameSpans)        
         System.out.println(s.toString()+"  "+tokens[s.getStart()]);      
   }    
} 

Compile and execute the saved Java file from the Command prompt using the following commands −

javac NameFinderSentences.java 
java NameFinderSentences 

On executing, the above program reads the given String (raw text), detects the names of the persons in it, and displays their positions (spans) as shown below.

[0..1) person  Mike

Finding the Names of the Location

By loading various models, you can detect various named entities. Following is a Java program which loads the en-ner-location.bin model and detects the location names in the given sentence. Save this program in a file with the name LocationFinder.java.

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.namefind.NameFinderME; 
import opennlp.tools.namefind.TokenNameFinderModel; 
import opennlp.tools.tokenize.TokenizerME; 
import opennlp.tools.tokenize.TokenizerModel; 
import opennlp.tools.util.Span;  

public class LocationFinder { 
   public static void main(String args[]) throws Exception{
 
      InputStream inputStreamTokenizer = new 
         FileInputStream("C:/OpenNLP_models/entoken.bin"); 
      TokenizerModel tokenModel = new TokenizerModel(inputStreamTokenizer); 
       
      //String paragraph = "Mike and Smith are classmates"; 
      String paragraph = "Howcodex is located in Hyderabad"; 
        
      //Instantiating the TokenizerME class 
      TokenizerME tokenizer = new TokenizerME(tokenModel); 
      String tokens[] = tokenizer.tokenize(paragraph); 
       
      //Loading the NER-location moodel 
      InputStream inputStreamNameFinder = new 
         FileInputStream("C:/OpenNLP_models/en- ner-location.bin");       
      TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder); 
        
      //Instantiating the NameFinderME class 
      NameFinderME nameFinder = new NameFinderME(model);      
        
      //Finding the names of a location 
      Span nameSpans[] = nameFinder.find(tokens);        
      //Printing the spans of the locations in the sentence 
      for(Span s: nameSpans)        
         System.out.println(s.toString()+"  "+tokens[s.getStart()]); 
   }    
}   

Compile and execute the saved Java file from the Command prompt using the following commands −

javac LocationFinder.java 
java LocationFinder

On executing, the above program reads the given String (raw text), detects the names of the persons in it, and displays their positions (spans), as shown below.

[4..5) location  Hyderabad

NameFinder Probability

The probs()method of the NameFinderME class is used to get the probabilities of the last decoded sequence.

double[] probs = nameFinder.probs(); 

Following is the program to print the probabilities. Save this program in a file with the name TokenizerMEProbs.java.

import java.io.FileInputStream; 
import java.io.InputStream; 
import opennlp.tools.tokenize.TokenizerME; 
import opennlp.tools.tokenize.TokenizerModel; 
import opennlp.tools.util.Span; 
public class TokenizerMEProbs { 
   public static void main(String args[]) throws Exception{     
      String sent = "Hello John how are you welcome to Howcodex"; 
       
      //Loading the Tokenizer model 
      InputStream inputStream = new 
         FileInputStream("C:/OpenNLP_models/en-token.bin"); 
      TokenizerModel tokenModel = new TokenizerModel(inputStream); 
       
      //Instantiating the TokenizerME class 
      TokenizerME tokenizer = new TokenizerME(tokenModel); 
       
      //Retrieving the positions of the tokens 
      Span tokens[] = tokenizer.tokenizePos(sent); 
       
      //Getting the probabilities of the recent calls to tokenizePos() method 
      double[] probs = tokenizer.getTokenProbabilities(); 
       
      //Printing the spans of tokens 
      for( Span token : tokens) 
         System.out.println(token +" 
            "+sent.substring(token.getStart(), token.getEnd()));      
         System.out.println("  "); 
      for(int i = 0; i<probs.length; i++) 
         System.out.println(probs[i]);          
   } 
}

Compile and execute the saved Java file from the Command prompt using the following commands −

javac TokenizerMEProbs.java 
java TokenizerMEProbs

On executing, the above program reads the given String, tokenizes the sentences, and prints them. In addition, it also returns the probabilities of the last decoded sequence, as shown below.

[0..5) Hello 
[6..10) John 
[11..14) how 
[15..18) are 
[19..22) you 
[23..30) welcome 
[31..33) to 
[34..48) Howcodex 
   
1.0 
1.0 
1.0 
1.0 
1.0 
1.0 
1.0 
1.0
Advertisements