Apache POI Word - Text Extraction


Advertisements

This chapter explains how to extract simple text data from a Word document using Java. In case you want to extract metadata from a Word document, make use of Apache Tika.

For .docx files, we use the class org.apache.poi.xwpf.extractor.XPFFWordExtractor that extracts and returns simple data from a Word file. In the same way, we have different methodologies to extract headings, footnotes, table data, etc. from a Word file.

The following code shows how to extract simple text from a Word file −

import java.io.FileInputStream;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class WordExtractor {

   public static void main(String[] args)throws Exception {

      XWPFDocument docx = new XWPFDocument(new FileInputStream("create_paragraph.docx"));
      
      //using XWPFWordExtractor Class
      XWPFWordExtractor we = new XWPFWordExtractor(docx);
      System.out.println(we.getText());
   }
}

Save the above code as WordExtractor.java. Compile and execute it from the command prompt as follows −

$javac WordExtractor.java
$java WordExtractor

It will generate the following output:

At howcodex.com, we strive hard to provide quality tutorials for self-learning
purpose in the domains of Academics, Information Technology, Management and Computer
Programming Languages.
Advertisements