Read PDF File in Java Read PDF File in Java

Apache PDFBox, iText 5, iText 7

Page content

In this tutorial, we’ll learn how to read a PDF file in Java using different libraries.

Overview

Portable Document Format (PDF) is a popular and widely used file format for documents. PDF format is the first choice for electronic distribution (e.g. email attachments) and print media.

Unlike text files, reading the data from PDF files is very complex and Java doesn’t provide native support for reading PDF files. The good news is that there are many open-source Java libraries available, that we can use. In this article, we’ll look at some of the popular libraries for reading PDF file in Java:-

Apache PDFBox

Apache PDFBox library allows you to create new PDF documents, extract content from PDF, fill a PDF form, save PDF as an Image, split & merge, digital sign PDF files, print PDF files, and many more.

Apache PDFBox is quite easy to use if you’re doing basic text extraction, let’s look at the example:-

  1. Let’s first import the pdfbox dependency to the pom.xml
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.27</version>
    </dependency>
    
    or build.gradle
    implementation group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.27'
    
  2. Let’s look at the simple example of using Apache PDFBox to read text from a PDF file:-
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.text.PDFTextStripper;
    import java.io.File;
    import java.io.IOException;
    
    public class PdfDocumentReader {
    
      public void readPdfFile(File file) throws IOException {
        try(PDDocument document = PDDocument.load(file)) {
            if (!document.isEncrypted()) {
                PDFTextStripper textStripper = new PDFTextStripper();
                String text = textStripper.getText(document);
                System.out.println("Text:" + text);
            }
        }
      }
    }
    

In this example, we initialized the instance of PDDocument in the try block to auto-close the resources after reading the PDF file.

Please refer to apache pdfbox examples for more examples

iText 5

iText has open-source libraries to read and write complex PDF files. iText library provides low-level support to read a PDF file but is a bit complex to understand and work with.

Please note that iText 5 is EOL and transitioned to maintenance mode, meaning it only receives security-related releases and fixes to allow users who have developed their solutions using iText 5 to safely continue using it. No new features will be added. For new implementations, iText 7 is recommended.

Let’s explore how to read PDF using iText 5:-

  1. Let’s first import the itextpdf dependency to the pom.xml
    <dependency>
        <groupId>com.itextpdf</groupId>
        <artifactId>itextpdf</artifactId>
        <version>5.5.13.3</version>
    </dependency>
    
    or build.gradle
    implementation group: 'com.itextpdf', name: 'itextpdf', version: '5.5.13.3'
    
  2. Let’s look at the simple example of using iText to read text from a PDF file:-
    import com.itextpdf.text.pdf.PdfReader;
    import com.itextpdf.text.pdf.parser.PdfTextExtractor;
    import java.io.IOException;
    
    public class PdfDocumentReader {
    
      public void readPdfFile(String fileName) throws IOException {
        PdfReader pdfReader = new PdfReader(fileName);
        int pages = pdfReader.getNumberOfPages();
        StringBuilder text = new StringBuilder();
        for (int i = 1; i <= pages; i++) {
            text.append(PdfTextExtractor.getTextFromPage(pdfReader, i));
        }
        System.out.println("Text:" + text);
      }
    }
    

In this example, we initialized the instance of PdfReader to load a PDF file and then looped through the pages to extract the content from each page.

iText 7

iText 7 is recommended iText library to read a PDF file. Let’s explore that:-

  1. Let’s first import the itext7-core dependency to the pom.xml
    <dependency>
      <groupId>com.itextpdf</groupId>
      <artifactId>itext7-core</artifactId>
      <version>7.2.5</version>
      <type>pom</type>
    </dependency>
    
    or build.gradle
    implementation group: 'com.itextpdf', name: 'itext7-core', version: '7.2.5', ext: 'pom'
    
  2. and write a code using iText to read text from a PDF file:-
    import com.itextpdf.kernel.pdf.PdfDocument;
    import com.itextpdf.kernel.pdf.PdfPage;
    import com.itextpdf.kernel.pdf.PdfReader;
    import com.itextpdf.kernel.pdf.canvas.parser.PdfTextExtractor;
    import java.io.File;
    import java.io.IOException;
    
    public class PdfDocumentReader {
    
      public void readPdfFileFromPdfBox7(File file) throws IOException {
        StringBuilder text = new StringBuilder();
        try(PdfDocument document = new PdfDocument(new PdfReader(file))){
            for (int i = 1; i <= document.getNumberOfPages(); ++i) {
                PdfPage page = document.getPage(i);
                text.append(PdfTextExtractor.getTextFromPage(page));
            }
        }
        System.out.println("Text:" + text);
      }
    }
    

In this example, we initialized the instance of PdfDocument in the try block to auto-close the resources after reading the PDF file.