In general, indexing is an arrangement of documents or (other entities) systematically. Indexing enables users to locate information in a document.
Indexing collects, parses, and stores documents.
Indexing is done to increase the speed and performance of a search query while finding a required document.
In Apache Solr, we can index (add, delete, modify) various document formats such as xml, csv, pdf, etc. We can add data to Solr index in several ways.
In this chapter, we are going to discuss indexing −
In this chapter, we will discuss how to add data to the index of Apache Solr using various interfaces (command line, web interface, and Java client API)
Solr has a post command in its bin/ directory. Using this command, you can index various formats of files such as JSON, XML, CSV in Apache Solr.
Browse through the bin directory of Apache Solr and execute the –h option of the post command, as shown in the following code block.
[Hadoop@localhost bin]$ cd $SOLR_HOME [Hadoop@localhost bin]$ ./post -h
On executing the above command, you will get a list of options of the post command, as shown below.
Usage: post -c <collection> [OPTIONS] <files|directories|urls|-d [".."]> or post –help collection name defaults to DEFAULT_SOLR_COLLECTION if not specified OPTIONS ======= Solr options: -url <base Solr update URL> (overrides collection, host, and port) -host <host> (default: localhost) -p or -port <port> (default: 8983) -commit yes|no (default: yes) Web crawl options: -recursive <depth> (default: 1) -delay <seconds> (default: 10) Directory crawl options: -delay <seconds> (default: 0) stdin/args options: -type <content/type> (default: application/xml) Other options: -filetypes <type>[,<type>,...] (default: xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots, rtf,htm,html,txt,log) -params "<key> = <value>[&<key> = <value>...]" (values must be URL-encoded; these pass through to Solr update request) -out yes|no (default: no; yes outputs Solr response to console) -format Solr (sends application/json content as Solr commands to /update instead of /update/json/docs) Examples: * JSON file:./post -c wizbang events.json * XML files: ./post -c records article*.xml * CSV file: ./post -c signals LATEST-signals.csv * Directory of files: ./post -c myfiles ~/Documents * Web crawl: ./post -c gettingstarted http://lucene.apache.org/Solr -recursive 1 -delay 1 * Standard input (stdin): echo '{commit: {}}' | ./post -c my_collection - type application/json -out yes –d * Data as string: ./post -c signals -type text/csv -out yes -d $'id,value\n1,0.47'
Suppose we have a file named sample.csv with the following content (in the bin directory).
Student ID | First Name | Lasst Name | Phone | City |
---|---|---|---|---|
001 | Rajiv | Reddy | 9848022337 | Hyderabad |
002 | Siddharth | Bhattacharya | 9848022338 | Kolkata |
003 | Rajesh | Khanna | 9848022339 | Delhi |
004 | Preethi | Agarwal | 9848022330 | Pune |
005 | Trupthi | Mohanty | 9848022336 | Bhubaneshwar |
006 | Archana | Mishra | 9848022335 | Chennai |
The above dataset contains personal details like Student id, first name, last name, phone, and city. The CSV file of the dataset is shown below. Here, you must note that you need to mention the schema, documenting its first line.
id, first_name, last_name, phone_no, location 001, Pruthvi, Reddy, 9848022337, Hyderabad 002, kasyap, Sastry, 9848022338, Vishakapatnam 003, Rajesh, Khanna, 9848022339, Delhi 004, Preethi, Agarwal, 9848022330, Pune 005, Trupthi, Mohanty, 9848022336, Bhubaneshwar 006, Archana, Mishra, 9848022335, Chennai
You can index this data under the core named sample_Solr using the post command as follows −
[Hadoop@localhost bin]$ ./post -c Solr_sample sample.csv
On executing the above command, the given document is indexed under the specified core, generating the following output.
/home/Hadoop/java/bin/java -classpath /home/Hadoop/Solr/dist/Solr-core 6.2.0.jar -Dauto = yes -Dc = Solr_sample -Ddata = files org.apache.Solr.util.SimplePostTool sample.csv SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/Solr/Solr_sample/update... Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf, htm,html,txt,log POSTing file sample.csv (text/csv) to [base] 1 files indexed. COMMITting Solr index changes to http://localhost:8983/Solr/Solr_sample/update... Time spent: 0:00:00.228
Visit the homepage of Solr Web UI using the following URL −
http://localhost:8983/
Select the core Solr_sample. By default, the request handler is /select and the query is “:”. Without doing any modifications, click the ExecuteQuery button at the bottom of the page.
On executing the query, you can observe the contents of the indexed CSV document in JSON format (default), as shown in the following screenshot.
Note − In the same way, you can index other file formats such as JSON, XML, CSV, etc.
You can also index documents using the web interface provided by Solr. Let us see how to index the following JSON document.
[ { "id" : "001", "name" : "Ram", "age" : 53, "Designation" : "Manager", "Location" : "Hyderabad", }, { "id" : "002", "name" : "Robert", "age" : 43, "Designation" : "SR.Programmer", "Location" : "Chennai", }, { "id" : "003", "name" : "Rahim", "age" : 25, "Designation" : "JR.Programmer", "Location" : "Delhi", } ]
Open Solr web interface using the following URL −
http://localhost:8983/
Step 2
Select the core Solr_sample. By default, the values of the fields Request Handler, Common Within, Overwrite, and Boost are /update, 1000, true, and 1.0 respectively, as shown in the following screenshot.
Now, choose the document format you want from JSON, CSV, XML, etc. Type the document to be indexed in the text area and click the Submit Document button, as shown in the following screenshot.
Following is the Java program to add documents to Apache Solr index. Save this code in a file with the name AddingDocument.java.
import java.io.IOException; import org.apache.Solr.client.Solrj.SolrClient; import org.apache.Solr.client.Solrj.SolrServerException; import org.apache.Solr.client.Solrj.impl.HttpSolrClient; import org.apache.Solr.common.SolrInputDocument; public class AddingDocument { public static void main(String args[]) throws Exception { //Preparing the Solr client String urlString = "http://localhost:8983/Solr/my_core"; SolrClient Solr = new HttpSolrClient.Builder(urlString).build(); //Preparing the Solr document SolrInputDocument doc = new SolrInputDocument(); //Adding fields to the document doc.addField("id", "003"); doc.addField("name", "Rajaman"); doc.addField("age","34"); doc.addField("addr","vishakapatnam"); //Adding the document to Solr Solr.add(doc); //Saving the changes Solr.commit(); System.out.println("Documents added"); } }
Compile the above code by executing the following commands in the terminal −
[Hadoop@localhost bin]$ javac AddingDocument [Hadoop@localhost bin]$ java AddingDocument
On executing the above command, you will get the following output.
Documents added