Lucene - Quick Guide


Advertisements

Lucene - Overview

Lucene is a simple yet powerful Java-based Search library. It can be used in any application to add search capability to it. Lucene is an open-source project. It is scalable. This high-performance library is used to index and search virtually any kind of text. Lucene library provides the core operations which are required by any search application. Indexing and Searching.

How Search Application works?

A Search application performs all or a few of the following operations −

Step Title Description
1

Acquire Raw Content

The first step of any search application is to collect the target contents on which search application is to be conducted.

2

Build the document

The next step is to build the document(s) from the raw content, which the search application can understand and interpret easily.

3

Analyze the document

Before the indexing process starts, the document is to be analyzed as to which part of the text is a candidate to be indexed. This process is where the document is analyzed.

4

Indexing the document

Once documents are built and analyzed, the next step is to index them so that this document can be retrieved based on certain keys instead of the entire content of the document. Indexing process is similar to indexes at the end of a book where common words are shown with their page numbers so that these words can be tracked quickly instead of searching the complete book.

5

User Interface for Search

Once a database of indexes is ready then the application can make any search. To facilitate a user to make a search, the application must provide a user a mean or a user interface where a user can enter text and start the search process.

6

Build Query

Once a user makes a request to search a text, the application should prepare a Query object using that text which can be used to inquire index database to get the relevant details.

7

Search Query

Using a query object, the index database is then checked to get the relevant details and the content documents.

8

Render Results

Once the result is received, the application should decide on how to show the results to the user using User Interface. How much information is to be shown at first look and so on.

Apart from these basic operations, a search application can also provide administration user interface and help administrators of the application to control the level of search based on the user profiles. Analytics of search results is another important and advanced aspect of any search application.

Lucene's Role in Search Application

Lucene plays role in steps 2 to step 7 mentioned above and provides classes to do the required operations. In a nutshell, Lucene is the heart of any search application and provides vital operations pertaining to indexing and searching. Acquiring contents and displaying the results is left for the application part to handle.

In the next chapter, we will perform a simple Search application using Lucene Search library.

Lucene - Environment Setup

This tutorial will guide you on how to prepare a development environment to start your work with the Spring Framework. This tutorial will also teach you how to setup JDK, Tomcat and Eclipse on your machine before you set up the Spring Framework −

Step 1 - Java Development Kit (JDK) Setup

You can download the latest version of SDK from Oracle's Java site: Java SE Downloads. You will find instructions for installing JDK in downloaded files; follow the given instructions to install and configure the setup. Finally set the PATH and JAVA_HOME environment variables to refer to the directory that contains Java and javac, typically java_install_dir/bin and java_install_dir respectively.

If you are running Windows and installed the JDK in C:\jdk1.6.0_15, you would have to put the following line in your C:\autoexec.bat file.

set PATH = C:\jdk1.6.0_15\bin;%PATH%
set JAVA_HOME = C:\jdk1.6.0_15

Alternatively, on Windows NT/2000/XP, you could also right-click on My Computer, select Properties, then Advanced, then Environment Variables. Then, you would update the PATH value and press the OK button.

On Unix (Solaris, Linux, etc.), if the SDK is installed in /usr/local/jdk1.6.0_15 and you use the C shell, you would put the following into your .cshrc file.

setenv PATH /usr/local/jdk1.6.0_15/bin:$PATH
setenv JAVA_HOME /usr/local/jdk1.6.0_15

Alternatively, if you use an Integrated Development Environment (IDE) like Borland JBuilder, Eclipse, IntelliJ IDEA, or Sun ONE Studio, compile and run a simple program to confirm that the IDE knows where you installed Java, otherwise do proper setup as given in the document of the IDE.

Step 2 - Eclipse IDE Setup

All the examples in this tutorial have been written using Eclipse IDE. So I would suggest you should have the latest version of Eclipse installed on your machine.

To install Eclipse IDE, download the latest Eclipse binaries from https://www.eclipse.org/downloads/. Once you downloaded the installation, unpack the binary distribution into a convenient location. For example, in C:\eclipse on windows, or /usr/local/eclipse on Linux/Unix and finally set PATH variable appropriately.

Eclipse can be started by executing the following commands on windows machine, or you can simply double click on eclipse.exe

 %C:\eclipse\eclipse.exe

Eclipse can be started by executing the following commands on Unix (Solaris, Linux, etc.) machine −

$/usr/local/eclipse/eclipse

After a successful startup, it should display the following result −

Eclipse Home page

Step 3 - Setup Lucene Framework Libraries

If the startup is successful, then you can proceed to set up your Lucene framework. Following are the simple steps to download and install the framework on your machine.

https://archive.apache.org/dist/lucene/java/3.6.2/

  • Make a choice whether you want to install Lucene on Windows, or Unix and then proceed to the next step to download the .zip file for windows and .tz file for Unix.

  • Download the suitable version of Lucene framework binaries from https://archive.apache.org/dist/lucene/java/.

  • At the time of writing this tutorial, I downloaded lucene-3.6.2.zip on my Windows machine and when you unzip the downloaded file it will give you the directory structure inside C:\lucene-3.6.2 as follows.

Lucene Directories

You will find all the Lucene libraries in the directory C:\lucene-3.6.2. Make sure you set your CLASSPATH variable on this directory properly otherwise, you will face problem while running your application. If you are using Eclipse, then it is not required to set CLASSPATH because all the setting will be done through Eclipse.

Once you are done with this last step, you are ready to proceed for your first Lucene Example which you will see in the next chapter.

Lucene - First Application

In this chapter, we will learn the actual programming with Lucene Framework. Before you start writing your first example using Lucene framework, you have to make sure that you have set up your Lucene environment properly as explained in Lucene - Environment Setup tutorial. It is recommended you have the working knowledge of Eclipse IDE.

Let us now proceed by writing a simple Search Application which will print the number of search results found. We'll also see the list of indexes created during this process.

Step 1 - Create Java Project

The first step is to create a simple Java Project using Eclipse IDE. Follow the option File > New -> Project and finally select Java Project wizard from the wizard list. Now name your project as LuceneFirstApplication using the wizard window as follows −

Create Project Wizard

Once your project is created successfully, you will have following content in your Project Explorer

Lucene First Application Directories

Step 2 - Add Required Libraries

Let us now add Lucene core Framework library in our project. To do this, right click on your project name LuceneFirstApplication and then follow the following option available in context menu: Build Path -> Configure Build Path to display the Java Build Path window as follows −

Java Build Path

Now use Add External JARs button available under Libraries tab to add the following core JAR from the Lucene installation directory −

  • lucene-core-3.6.2

Step 3 - Create Source Files

Let us now create actual source files under the LuceneFirstApplication project. First we need to create a package called com.howcodex.lucene. To do this, right-click on src in package explorer section and follow the option : New -> Package.

Next we will create LuceneTester.java and other java classes under the com.howcodex.lucene package.

LuceneConstants.java

This class is used to provide various constants to be used across the sample application.

package com.howcodex.lucene;

public class LuceneConstants {
   public static final String CONTENTS = "contents";
   public static final String FILE_NAME = "filename";
   public static final String FILE_PATH = "filepath";
   public static final int MAX_SEARCH = 10;
}

TextFileFilter.java

This class is used as a .txt file filter.

package com.howcodex.lucene;

import java.io.File;
import java.io.FileFilter;

public class TextFileFilter implements FileFilter {

   @Override
   public boolean accept(File pathname) {
      return pathname.getName().toLowerCase().endsWith(".txt");
   }
}

Indexer.java

This class is used to index the raw data so that we can make it searchable using the Lucene library.

package com.howcodex.lucene;

import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Indexer {

   private IndexWriter writer;

   public Indexer(String indexDirectoryPath) throws IOException {
      //this directory will contain the indexes
      Directory indexDirectory = 
         FSDirectory.open(new File(indexDirectoryPath));

      //create the indexer
      writer = new IndexWriter(indexDirectory, 
         new StandardAnalyzer(Version.LUCENE_36),true, 
         IndexWriter.MaxFieldLength.UNLIMITED);
   }

   public void close() throws CorruptIndexException, IOException {
      writer.close();
   }

   private Document getDocument(File file) throws IOException {
      Document document = new Document();

      //index file contents
      Field contentField = new Field(LuceneConstants.CONTENTS, new FileReader(file));
      //index file name
      Field fileNameField = new Field(LuceneConstants.FILE_NAME,
         file.getName(),Field.Store.YES,Field.Index.NOT_ANALYZED);
      //index file path
      Field filePathField = new Field(LuceneConstants.FILE_PATH,
         file.getCanonicalPath(),Field.Store.YES,Field.Index.NOT_ANALYZED);

      document.add(contentField);
      document.add(fileNameField);
      document.add(filePathField);

      return document;
   }   

   private void indexFile(File file) throws IOException {
      System.out.println("Indexing "+file.getCanonicalPath());
      Document document = getDocument(file);
      writer.addDocument(document);
   }

   public int createIndex(String dataDirPath, FileFilter filter) 
      throws IOException {
      //get all files in the data directory
      File[] files = new File(dataDirPath).listFiles();

      for (File file : files) {
         if(!file.isDirectory()
            && !file.isHidden()
            && file.exists()
            && file.canRead()
            && filter.accept(file)
         ){
            indexFile(file);
         }
      }
      return writer.numDocs();
   }
}

Searcher.java

This class is used to search the indexes created by the Indexer to search the requested content.

package com.howcodex.lucene;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Searcher {
	
   IndexSearcher indexSearcher;
   QueryParser queryParser;
   Query query;
   
   public Searcher(String indexDirectoryPath) 
      throws IOException {
      Directory indexDirectory = 
         FSDirectory.open(new File(indexDirectoryPath));
      indexSearcher = new IndexSearcher(indexDirectory);
      queryParser = new QueryParser(Version.LUCENE_36,
         LuceneConstants.CONTENTS,
         new StandardAnalyzer(Version.LUCENE_36));
   }
   
   public TopDocs search( String searchQuery) 
      throws IOException, ParseException {
      query = queryParser.parse(searchQuery);
      return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
   }

   public Document getDocument(ScoreDoc scoreDoc) 
      throws CorruptIndexException, IOException {
      return indexSearcher.doc(scoreDoc.doc);	
   }

   public void close() throws IOException {
      indexSearcher.close();
   }
}

LuceneTester.java

This class is used to test the indexing and search capability of lucene library.

package com.howcodex.lucene;

import java.io.IOException;

import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

public class LuceneTester {
	
   String indexDir = "E:\\Lucene\\Index";
   String dataDir = "E:\\Lucene\\Data";
   Indexer indexer;
   Searcher searcher;

   public static void main(String[] args) {
      LuceneTester tester;
      try {
         tester = new LuceneTester();
         tester.createIndex();
         tester.search("Mohan");
      } catch (IOException e) {
         e.printStackTrace();
      } catch (ParseException e) {
         e.printStackTrace();
      }
   }

   private void createIndex() throws IOException {
      indexer = new Indexer(indexDir);
      int numIndexed;
      long startTime = System.currentTimeMillis();	
      numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
      long endTime = System.currentTimeMillis();
      indexer.close();
      System.out.println(numIndexed+" File indexed, time taken: "
         +(endTime-startTime)+" ms");		
   }

   private void search(String searchQuery) throws IOException, ParseException {
      searcher = new Searcher(indexDir);
      long startTime = System.currentTimeMillis();
      TopDocs hits = searcher.search(searchQuery);
      long endTime = System.currentTimeMillis();
   
      System.out.println(hits.totalHits +
         " documents found. Time :" + (endTime - startTime));
      for(ScoreDoc scoreDoc : hits.scoreDocs) {
         Document doc = searcher.getDocument(scoreDoc);
            System.out.println("File: "
            + doc.get(LuceneConstants.FILE_PATH));
      }
      searcher.close();
   }
}

Step 4 - Data & Index directory creation

We have used 10 text files from record1.txt to record10.txt containing names and other details of the students and put them in the directory E:\Lucene\Data. Test Data. An index directory path should be created as E:\Lucene\Index. After running this program, you can see the list of index files created in that folder.

Step 5 - Running the program

Once you are done with the creation of the source, the raw data, the data directory and the index directory, you are ready for compiling and running of your program. To do this, keep the LuceneTester.Java file tab active and use either the Run option available in the Eclipse IDE or use Ctrl + F11 to compile and run your LuceneTester application. If the application runs successfully, it will print the following message in Eclipse IDE's console −

Indexing E:\Lucene\Data\record1.txt
Indexing E:\Lucene\Data\record10.txt
Indexing E:\Lucene\Data\record2.txt
Indexing E:\Lucene\Data\record3.txt
Indexing E:\Lucene\Data\record4.txt
Indexing E:\Lucene\Data\record5.txt
Indexing E:\Lucene\Data\record6.txt
Indexing E:\Lucene\Data\record7.txt
Indexing E:\Lucene\Data\record8.txt
Indexing E:\Lucene\Data\record9.txt
10 File indexed, time taken: 109 ms
1 documents found. Time :0
File: E:\Lucene\Data\record4.txt

Once you've run the program successfully, you will have the following content in your index directory

Lucene Index Directory

Lucene - Indexing Classes

Indexing process is one of the core functionalities provided by Lucene. The following diagram illustrates the indexing process and the use of classes. IndexWriter is the most important and the core component of the indexing process.

Indexing Process

We add Document(s) containing Field(s) to IndexWriter which analyzes the Document(s) using the Analyzer and then creates/open/edit indexes as required and store/update them in a Directory. IndexWriter is used to update or create indexes. It is not used to read indexes.

Indexing Classes

Following is a list of commonly-used classes during the indexing process.

S.No. Class & Description
1 IndexWriter

This class acts as a core component which creates/updates indexes during the indexing process.

2 Directory

This class represents the storage location of the indexes.

3 Analyzer

This class is responsible to analyze a document and get the tokens/words from the text which is to be indexed. Without analysis done, IndexWriter cannot create index.

4 Document

This class represents a virtual document with Fields where the Field is an object which can contain the physical document's contents, its meta data and so on. The Analyzer can understand a Document only.

5 Field

This is the lowest unit or the starting point of the indexing process. It represents the key value pair relationship where a key is used to identify the value to be indexed. Let us assume a field used to represent contents of a document will have key as "contents" and the value may contain the part or all of the text or numeric content of the document. Lucene can index only text or numeric content only.

Lucene - Searching Classes

The process of Searching is again one of the core functionalities provided by Lucene. Its flow is similar to that of the indexing process. Basic search of Lucene can be made using the following classes which can also be termed as foundation classes for all search related operations.

Searching Classes

Following is a list of commonly-used classes during searching process.

S.No. Class & Description
1 IndexSearcher

This class act as a core component which reads/searches indexes created after the indexing process. It takes directory instance pointing to the location containing the indexes.

2 Term

This class is the lowest unit of searching. It is similar to Field in indexing process.

3 Query

Query is an abstract class and contains various utility methods and is the parent of all types of queries that Lucene uses during search process.

4 TermQuery

TermQuery is the most commonly-used query object and is the foundation of many complex queries that Lucene can make use of.

5 TopDocs

TopDocs points to the top N search results which matches the search criteria. It is a simple container of pointers to point to documents which are the output of a search result.

Lucene - Indexing Process

Indexing process is one of the core functionality provided by Lucene. Following diagram illustrates the indexing process and use of classes. IndexWriter is the most important and core component of the indexing process.

Indexing Process

We add Document(s) containing Field(s) to IndexWriter which analyzes the Document(s) using the Analyzer and then creates/open/edit indexes as required and store/update them in a Directory. IndexWriter is used to update or create indexes. It is not used to read indexes.

Now we'll show you a step by step process to get a kick start in understanding of indexing process using a basic example.

Create a document

  • Create a method to get a lucene document from a text file.

  • Create various types of fields which are key value pairs containing keys as names and values as contents to be indexed.

  • Set field to be analyzed or not. In our case, only contents is to be analyzed as it can contain data such as a, am, are, an etc. which are not required in search operations.

  • Add the newly created fields to the document object and return it to the caller method.

private Document getDocument(File file) throws IOException {
   Document document = new Document();
   
   //index file contents
   Field contentField = new Field(LuceneConstants.CONTENTS, 
      new FileReader(file));
   
   //index file name
   Field fileNameField = new Field(LuceneConstants.FILE_NAME,
      file.getName(),
      Field.Store.YES,Field.Index.NOT_ANALYZED);
   
   //index file path
   Field filePathField = new Field(LuceneConstants.FILE_PATH,
      file.getCanonicalPath(),
      Field.Store.YES,Field.Index.NOT_ANALYZED);

   document.add(contentField);
   document.add(fileNameField);
   document.add(filePathField);

   return document;
}   

Create a IndexWriter

IndexWriter class acts as a core component which creates/updates indexes during indexing process. Follow these steps to create a IndexWriter −

Step 1 − Create object of IndexWriter.

Step 2 − Create a Lucene directory which should point to location where indexes are to be stored.

Step 3 − Initialize the IndexWriter object created with the index directory, a standard analyzer having version information and other required/optional parameters.

private IndexWriter writer;

public Indexer(String indexDirectoryPath) throws IOException {
   //this directory will contain the indexes
   Directory indexDirectory = 
      FSDirectory.open(new File(indexDirectoryPath));
   
   //create the indexer
   writer = new IndexWriter(indexDirectory, 
      new StandardAnalyzer(Version.LUCENE_36),true,
      IndexWriter.MaxFieldLength.UNLIMITED);
}

Start Indexing Process

The following program shows how to start an indexing process −

private void indexFile(File file) throws IOException {
   System.out.println("Indexing "+file.getCanonicalPath());
   Document document = getDocument(file);
   writer.addDocument(document);
}

Example Application

To test the indexing process, we need to create a Lucene application test.

Step Description
1

Create a project with a name LuceneFirstApplication under a package com.howcodex.lucene as explained in the Lucene - First Application chapter. You can also use the project created in Lucene - First Application chapter as such for this chapter to understand the indexing process.

2

Create LuceneConstants.java,TextFileFilter.java and Indexer.java as explained in the Lucene - First Application chapter. Keep the rest of the files unchanged.

3

Create LuceneTester.java as mentioned below.

4

Clean and build the application to make sure the business logic is working as per the requirements.

LuceneConstants.java

This class is used to provide various constants to be used across the sample application.

package com.howcodex.lucene;

public class LuceneConstants {
   public static final String CONTENTS = "contents";
   public static final String FILE_NAME = "filename";
   public static final String FILE_PATH = "filepath";
   public static final int MAX_SEARCH = 10;
}

TextFileFilter.java

This class is used as a .txt file filter.

package com.howcodex.lucene;

import java.io.File;
import java.io.FileFilter;

public class TextFileFilter implements FileFilter {

   @Override
   public boolean accept(File pathname) {
      return pathname.getName().toLowerCase().endsWith(".txt");
   }
}

Indexer.java

This class is used to index the raw data so that we can make it searchable using the Lucene library.

package com.howcodex.lucene;

import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Indexer {

   private IndexWriter writer;

   public Indexer(String indexDirectoryPath) throws IOException {
      //this directory will contain the indexes
      Directory indexDirectory = 
         FSDirectory.open(new File(indexDirectoryPath));

      //create the indexer
      writer = new IndexWriter(indexDirectory, 
         new StandardAnalyzer(Version.LUCENE_36),true,
         IndexWriter.MaxFieldLength.UNLIMITED);
   }

   public void close() throws CorruptIndexException, IOException {
      writer.close();
   }

   private Document getDocument(File file) throws IOException {
      Document document = new Document();

      //index file contents
      Field contentField = new Field(LuceneConstants.CONTENTS, 
         new FileReader(file));
      
      //index file name
      Field fileNameField = new Field(LuceneConstants.FILE_NAME,
         file.getName(),
         Field.Store.YES,Field.Index.NOT_ANALYZED);
      
      //index file path
      Field filePathField = new Field(LuceneConstants.FILE_PATH,
         file.getCanonicalPath(),
         Field.Store.YES,Field.Index.NOT_ANALYZED);

      document.add(contentField);
      document.add(fileNameField);
      document.add(filePathField);

      return document;
   }   

   private void indexFile(File file) throws IOException {
      System.out.println("Indexing "+file.getCanonicalPath());
      Document document = getDocument(file);
      writer.addDocument(document);
   }

   public int createIndex(String dataDirPath, FileFilter filter) 
      throws IOException {
      //get all files in the data directory
      File[] files = new File(dataDirPath).listFiles();

      for (File file : files) {
         if(!file.isDirectory()
            && !file.isHidden()
            && file.exists()
            && file.canRead()
            && filter.accept(file)
         ){
            indexFile(file);
         }
      }
      return writer.numDocs();
   }
}

LuceneTester.java

This class is used to test the indexing capability of the Lucene library.

package com.howcodex.lucene;

import java.io.IOException;

public class LuceneTester {
	
   String indexDir = "E:\\Lucene\\Index";
   String dataDir = "E:\\Lucene\\Data";
   Indexer indexer;
   
   public static void main(String[] args) {
      LuceneTester tester;
      try {
         tester = new LuceneTester();
         tester.createIndex();
      } catch (IOException e) {
         e.printStackTrace();
      } 
   }

   private void createIndex() throws IOException {
      indexer = new Indexer(indexDir);
      int numIndexed;
      long startTime = System.currentTimeMillis();	
      numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
      long endTime = System.currentTimeMillis();
      indexer.close();
      System.out.println(numIndexed+" File indexed, time taken: "
         +(endTime-startTime)+" ms");		
   }
}

Data & Index Directory Creation

We have used 10 text files from record1.txt to record10.txt containing names and other details of the students and put them in the directory E:\Lucene\Data. Test Data. An index directory path should be created as E:\Lucene\Index. After running this program, you can see the list of index files created in that folder.

Running the Program

Once you are done with the creation of the source, the raw data, the data directory and the index directory, you can proceed by compiling and running your program. To do this, keep the LuceneTester.Java file tab active and use either the Run option available in the Eclipse IDE or use Ctrl + F11 to compile and run your LuceneTester application. If your application runs successfully, it will print the following message in Eclipse IDE's console −

Indexing E:\Lucene\Data\record1.txt
Indexing E:\Lucene\Data\record10.txt
Indexing E:\Lucene\Data\record2.txt
Indexing E:\Lucene\Data\record3.txt
Indexing E:\Lucene\Data\record4.txt
Indexing E:\Lucene\Data\record5.txt
Indexing E:\Lucene\Data\record6.txt
Indexing E:\Lucene\Data\record7.txt
Indexing E:\Lucene\Data\record8.txt
Indexing E:\Lucene\Data\record9.txt
10 File indexed, time taken: 109 ms

Once you've run the program successfully, you will have the following content in your index directory −

Lucene Index Directory

Lucene - Indexing Operations

In this chapter, we'll discuss the four major operations of indexing. These operations are useful at various times and are used throughout of a software search application.

Indexing Operations

Following is a list of commonly-used operations during indexing process.

S.No. Operation & Description
1 Add Document

This operation is used in the initial stage of the indexing process to create the indexes on the newly available content.

2 Update Document

This operation is used to update indexes to reflect the changes in the updated contents. It is similar to recreating the index.

3 Delete Document

This operation is used to update indexes to exclude the documents which are not required to be indexed/searched.

4 Field Options

Field options specify a way or control the ways in which the contents of a field are to be made searchable.

Lucene - Search Operation

The process of searching is one of the core functionalities provided by Lucene. Following diagram illustrates the process and its use. IndexSearcher is one of the core components of the searching process.

Searching Process

We first create Directory(s) containing indexes and then pass it to IndexSearcher which opens the Directory using IndexReader. Then we create a Query with a Term and make a search using IndexSearcher by passing the Query to the searcher. IndexSearcher returns a TopDocs object which contains the search details along with document ID(s) of the Document which is the result of the search operation.

We will now show you a step-wise approach and help you understand the indexing process using a basic example.

Create a QueryParser

QueryParser class parses the user entered input into Lucene understandable format query. Follow these steps to create a QueryParser −

Step 1 − Create object of QueryParser.

Step 2 − Initialize the QueryParser object created with a standard analyzer having version information and index name on which this query is to be run.

QueryParser queryParser;

public Searcher(String indexDirectoryPath) throws IOException {

   queryParser = new QueryParser(Version.LUCENE_36,
      LuceneConstants.CONTENTS,
      new StandardAnalyzer(Version.LUCENE_36));
}

Create a IndexSearcher

IndexSearcher class acts as a core component which searcher indexes created during indexing process. Follow these steps to create a IndexSearcher −

Step 1 − Create object of IndexSearcher.

Step 2 − Create a Lucene directory which should point to location where indexes are to be stored.

Step 3 − Initialize the IndexSearcher object created with the index directory.

IndexSearcher indexSearcher;

public Searcher(String indexDirectoryPath) throws IOException {
   Directory indexDirectory = 
      FSDirectory.open(new File(indexDirectoryPath));
   indexSearcher = new IndexSearcher(indexDirectory);
}

Make search

Follow these steps to make search −

Step 1 − Create a Query object by parsing the search expression through QueryParser.

Step 2 − Make search by calling the IndexSearcher.search() method.

Query query;

public TopDocs search( String searchQuery) throws IOException, ParseException {
   query = queryParser.parse(searchQuery);
   return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
}

Get the Document

The following program shows how to get the document.

public Document getDocument(ScoreDoc scoreDoc) 
   throws CorruptIndexException, IOException {
   return indexSearcher.doc(scoreDoc.doc);	
}

Close IndexSearcher

The following program shows how to close the IndexSearcher.

public void close() throws IOException {
   indexSearcher.close();
}

Example Application

Let us create a test Lucene application to test searching process.

Step Description
1

Create a project with a name LuceneFirstApplication under a package com.howcodex.lucene as explained in the Lucene - First Application chapter. You can also use the project created in Lucene - First Application chapter as such for this chapter to understand the searching process.

2

Create LuceneConstants.java,TextFileFilter.java and Searcher.java as explained in the Lucene - First Application chapter. Keep the rest of the files unchanged.

3

Create LuceneTester.java as mentioned below.

4

Clean and Build the application to make sure business logic is working as per the requirements.

LuceneConstants.java

This class is used to provide various constants to be used across the sample application.

package com.howcodex.lucene;

public class LuceneConstants {
   public static final String CONTENTS = "contents";
   public static final String FILE_NAME = "filename";
   public static final String FILE_PATH = "filepath";
   public static final int MAX_SEARCH = 10;
}

TextFileFilter.java

This class is used as a .txt file filter.

package com.howcodex.lucene;

import java.io.File;
import java.io.FileFilter;

public class TextFileFilter implements FileFilter {

   @Override
   public boolean accept(File pathname) {
      return pathname.getName().toLowerCase().endsWith(".txt");
   }
}

Searcher.java

This class is used to read the indexes made on raw data and searches data using the Lucene library.

package com.howcodex.lucene;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Searcher {
	
   IndexSearcher indexSearcher;
   QueryParser queryParser;
   Query query;

   public Searcher(String indexDirectoryPath) throws IOException {
      Directory indexDirectory = 
         FSDirectory.open(new File(indexDirectoryPath));
      indexSearcher = new IndexSearcher(indexDirectory);
      queryParser = new QueryParser(Version.LUCENE_36,
         LuceneConstants.CONTENTS,
         new StandardAnalyzer(Version.LUCENE_36));
   }

   public TopDocs search( String searchQuery) 
      throws IOException, ParseException {
      query = queryParser.parse(searchQuery);
      return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
   }

   public Document getDocument(ScoreDoc scoreDoc) 
      throws CorruptIndexException, IOException {
      return indexSearcher.doc(scoreDoc.doc);	
   }

   public void close() throws IOException {
      indexSearcher.close();
   }
}

LuceneTester.java

This class is used to test the searching capability of the Lucene library.

package com.howcodex.lucene;

import java.io.IOException;

import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

public class LuceneTester {
	
   String indexDir = "E:\\Lucene\\Index";
   String dataDir = "E:\\Lucene\\Data";
   Searcher searcher;

   public static void main(String[] args) {
      LuceneTester tester;
      try {
         tester = new LuceneTester();
         tester.search("Mohan");
      } catch (IOException e) {
         e.printStackTrace();
      } catch (ParseException e) {
         e.printStackTrace();
      }
   }

   private void search(String searchQuery) throws IOException, ParseException {
      searcher = new Searcher(indexDir);
      long startTime = System.currentTimeMillis();
      TopDocs hits = searcher.search(searchQuery);
      long endTime = System.currentTimeMillis();

      System.out.println(hits.totalHits +
         " documents found. Time :" + (endTime - startTime) +" ms");
      for(ScoreDoc scoreDoc : hits.scoreDocs) {
         Document doc = searcher.getDocument(scoreDoc);
         System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
      }
      searcher.close();
   }	
}

Data & Index Directory Creation

We have used 10 text files named record1.txt to record10.txt containing names and other details of the students and put them in the directory E:\Lucene\Data. Test Data. An index directory path should be created as E:\Lucene\Index. After running the indexing program in the chapter Lucene - Indexing Process, you can see the list of index files created in that folder.

Running the Program

Once you are done with the creation of the source, the raw data, the data directory, the index directory and the indexes, you can proceed by compiling and running your program. To do this, keep LuceneTester.Java file tab active and use either the Run option available in the Eclipse IDE or use Ctrl + F11 to compile and run your LuceneTesterapplication. If your application runs successfully, it will print the following message in Eclipse IDE's console −

1 documents found. Time :29 ms
File: E:\Lucene\Data\record4.txt

Lucene - Query Programming

We have seen in previous chapter Lucene - Search Operation, Lucene uses IndexSearcher to make searches and it uses the Query object created by QueryParser as the input. In this chapter, we are going to discuss various types of Query objects and the different ways to create them programmatically. Creating different types of Query object gives control on the kind of search to be made.

Consider a case of Advanced Search, provided by many applications where users are given multiple options to confine the search results. By Query programming, we can achieve the same very easily.

Following is the list of Query types that we'll discuss in due course.

S.No. Class & Description
1 TermQuery

This class acts as a core component which creates/updates indexes during the indexing process.

2 TermRangeQuery

TermRangeQuery is used when a range of textual terms are to be searched.

3 PrefixQuery

PrefixQuery is used to match documents whose index starts with a specified string.

4 BooleanQuery

BooleanQuery is used to search documents which are result of multiple queries using AND, OR or NOT operators.

5 PhraseQuery

Phrase query is used to search documents which contain a particular sequence of terms.

6 WildCardQuery

WildcardQuery is used to search documents using wildcards like '*' for any character sequence,? matching a single character.

7 FuzzyQuery

FuzzyQuery is used to search documents using fuzzy implementation that is an approximate search based on the edit distance algorithm.

8 MatchAllDocsQuery

MatchAllDocsQuery as the name suggests matches all the documents.

Lucene - Analysis

In one of our previous chapters, we have seen that Lucene uses IndexWriter to analyze the Document(s) using the Analyzer and then creates/open/edit indexes as required. In this chapter, we are going to discuss the various types of Analyzer objects and other relevant objects which are used during the analysis process. Understanding the Analysis process and how analyzers work will give you great insight over how Lucene indexes the documents.

Following is the list of objects that we'll discuss in due course.

S.No. Class & Description
1 Token

Token represents text or word in a document with relevant details like its metadata (position, start offset, end offset, token type and its position increment).

2 TokenStream

TokenStream is an output of the analysis process and it comprises of a series of tokens. It is an abstract class.

3 Analyzer

This is an abstract base class for each and every type of Analyzer.

4 WhitespaceAnalyzer

This analyzer splits the text in a document based on whitespace.

5 SimpleAnalyzer

This analyzer splits the text in a document based on non-letter characters and puts the text in lowercase.

6 StopAnalyzer

This analyzer works just as the SimpleAnalyzer and removes the common words like 'a', 'an', 'the', etc.

7 StandardAnalyzer

This is the most sophisticated analyzer and is capable of handling names, email addresses, etc. It lowercases each token and removes common words and punctuations, if any.

Lucene - Sorting

In this chapter, we will look into the sorting orders in which Lucene gives the search results by default or can be manipulated as required.

Sorting by Relevance

This is the default sorting mode used by Lucene. Lucene provides results by the most relevant hit at the top.

private void sortUsingRelevance(String searchQuery)
   throws IOException, ParseException {
   searcher = new Searcher(indexDir);
   long startTime = System.currentTimeMillis();
   
   //create a term to search file name
   Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
   //create the term query object
   Query query = new FuzzyQuery(term);
   searcher.setDefaultFieldSortScoring(true, false);
   //do the search
   TopDocs hits = searcher.search(query,Sort.RELEVANCE);
   long endTime = System.currentTimeMillis();

   System.out.println(hits.totalHits +
      " documents found. Time :" + (endTime - startTime) + "ms");
   for(ScoreDoc scoreDoc : hits.scoreDocs) {
      Document doc = searcher.getDocument(scoreDoc);
      System.out.print("Score: "+ scoreDoc.score + " ");
      System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
   }
   searcher.close();
}

Sorting by IndexOrder

This sorting mode is used by Lucene. Here, the first document indexed is shown first in the search results.

private void sortUsingIndex(String searchQuery)
   throws IOException, ParseException {
   searcher = new Searcher(indexDir);
   long startTime = System.currentTimeMillis();
   
   //create a term to search file name
   Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
   //create the term query object
   Query query = new FuzzyQuery(term);
   searcher.setDefaultFieldSortScoring(true, false);
   //do the search
   TopDocs hits = searcher.search(query,Sort.INDEXORDER);
   long endTime = System.currentTimeMillis();

   System.out.println(hits.totalHits +
      " documents found. Time :" + (endTime - startTime) + "ms");
   for(ScoreDoc scoreDoc : hits.scoreDocs) {
      Document doc = searcher.getDocument(scoreDoc);
      System.out.print("Score: "+ scoreDoc.score + " ");
      System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
   }
   searcher.close();
}

Example Application

Let us create a test Lucene application to test the sorting process.

Step Description
1

Create a project with a name LuceneFirstApplication under a package com.howcodex.lucene as explained in the Lucene - First Application chapter. You can also use the project created in Lucene - First Application chapter as such for this chapter to understand the searching process.

2

Create LuceneConstants.java and Searcher.java as explained in the Lucene - First Application chapter. Keep the rest of the files unchanged.

3

Create LuceneTester.java as mentioned below.

4

Clean and Build the application to make sure the business logic is working as per the requirements.

LuceneConstants.java

This class is used to provide various constants to be used across the sample application.

package com.howcodex.lucene;

public class LuceneConstants {
   public static final String CONTENTS = "contents";
   public static final String FILE_NAME = "filename";
   public static final String FILE_PATH = "filepath";
   public static final int MAX_SEARCH = 10;
}

Searcher.java

This class is used to read the indexes made on raw data and searches data using the Lucene library.

package com.howcodex.lucene;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Searcher {
	
IndexSearcher indexSearcher;
   QueryParser queryParser;
   Query query;

   public Searcher(String indexDirectoryPath) throws IOException {
      Directory indexDirectory 
         = FSDirectory.open(new File(indexDirectoryPath));
      indexSearcher = new IndexSearcher(indexDirectory);
      queryParser = new QueryParser(Version.LUCENE_36,
         LuceneConstants.CONTENTS,
         new StandardAnalyzer(Version.LUCENE_36));
   }

   public TopDocs search( String searchQuery) 
      throws IOException, ParseException {
      query = queryParser.parse(searchQuery);
      return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
   }

   public TopDocs search(Query query) 
      throws IOException, ParseException {
      return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
   }

   public TopDocs search(Query query,Sort sort) 
      throws IOException, ParseException {
      return indexSearcher.search(query, 
         LuceneConstants.MAX_SEARCH,sort);
   }

   public void setDefaultFieldSortScoring(boolean doTrackScores, 
      boolean doMaxScores) {
      indexSearcher.setDefaultFieldSortScoring(
         doTrackScores,doMaxScores);
   }

   public Document getDocument(ScoreDoc scoreDoc) 
      throws CorruptIndexException, IOException {
      return indexSearcher.doc(scoreDoc.doc);	
   }

   public void close() throws IOException {
      indexSearcher.close();
   }
}

LuceneTester.java

This class is used to test the searching capability of the Lucene library.

package com.howcodex.lucene;

import java.io.IOException;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;

public class LuceneTester {
	
   String indexDir = "E:\\Lucene\\Index";
   String dataDir = "E:\\Lucene\\Data";
   Indexer indexer;
   Searcher searcher;

   public static void main(String[] args) {
      LuceneTester tester;
      try {
          tester = new LuceneTester();
          tester.sortUsingRelevance("cord3.txt");
          tester.sortUsingIndex("cord3.txt");
      } catch (IOException e) {
          e.printStackTrace();
      } catch (ParseException e) {
          e.printStackTrace();
      }		
   }

   private void sortUsingRelevance(String searchQuery)
      throws IOException, ParseException {
      searcher = new Searcher(indexDir);
      long startTime = System.currentTimeMillis();
      
      //create a term to search file name
      Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
      //create the term query object
      Query query = new FuzzyQuery(term);
      searcher.setDefaultFieldSortScoring(true, false);
      //do the search
      TopDocs hits = searcher.search(query,Sort.RELEVANCE);
      long endTime = System.currentTimeMillis();

      System.out.println(hits.totalHits +
         " documents found. Time :" + (endTime - startTime) + "ms");
      for(ScoreDoc scoreDoc : hits.scoreDocs) {
         Document doc = searcher.getDocument(scoreDoc);
         System.out.print("Score: "+ scoreDoc.score + " ");
         System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
      }
      searcher.close();
   }

   private void sortUsingIndex(String searchQuery)
      throws IOException, ParseException {
      searcher = new Searcher(indexDir);
      long startTime = System.currentTimeMillis();
      //create a term to search file name
      Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
      //create the term query object
      Query query = new FuzzyQuery(term);
      searcher.setDefaultFieldSortScoring(true, false);
      //do the search
      TopDocs hits = searcher.search(query,Sort.INDEXORDER);
      long endTime = System.currentTimeMillis();

      System.out.println(hits.totalHits +
      " documents found. Time :" + (endTime - startTime) + "ms");
      for(ScoreDoc scoreDoc : hits.scoreDocs) {
         Document doc = searcher.getDocument(scoreDoc);
         System.out.print("Score: "+ scoreDoc.score + " ");
         System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
      }
      searcher.close();
   }
}

Data & Index Directory Creation

We have used 10 text files from record1.txt to record10.txt containing names and other details of the students and put them in the directory E:\Lucene\Data. Test Data. An index directory path should be created as E:\Lucene\Index. After running the indexing program in the chapter Lucene - Indexing Process, you can see the list of index files created in that folder.

Running the Program

Once you are done with the creation of the source, the raw data, the data directory, the index directory and the indexes, you can compile and run your program. To do this, Keep the LuceneTester.Java file tab active and use either the Run option available in the Eclipse IDE or use Ctrl + F11 to compile and run your LuceneTester application. If your application runs successfully, it will print the following message in Eclipse IDE's console −

10 documents found. Time :31ms
Score: 1.3179655 File: E:\Lucene\Data\record3.txt
Score: 0.790779 File: E:\Lucene\Data\record1.txt
Score: 0.790779 File: E:\Lucene\Data\record2.txt
Score: 0.790779 File: E:\Lucene\Data\record4.txt
Score: 0.790779 File: E:\Lucene\Data\record5.txt
Score: 0.790779 File: E:\Lucene\Data\record6.txt
Score: 0.790779 File: E:\Lucene\Data\record7.txt
Score: 0.790779 File: E:\Lucene\Data\record8.txt
Score: 0.790779 File: E:\Lucene\Data\record9.txt
Score: 0.2635932 File: E:\Lucene\Data\record10.txt
10 documents found. Time :0ms
Score: 0.790779 File: E:\Lucene\Data\record1.txt
Score: 0.2635932 File: E:\Lucene\Data\record10.txt
Score: 0.790779 File: E:\Lucene\Data\record2.txt
Score: 1.3179655 File: E:\Lucene\Data\record3.txt
Score: 0.790779 File: E:\Lucene\Data\record4.txt
Score: 0.790779 File: E:\Lucene\Data\record5.txt
Score: 0.790779 File: E:\Lucene\Data\record6.txt
Score: 0.790779 File: E:\Lucene\Data\record7.txt
Score: 0.790779 File: E:\Lucene\Data\record8.txt
Score: 0.790779 File: E:\Lucene\Data\record9.txt
Advertisements